## Data Gathering
Three different pieces of data required gathering using different methods for this project. These included:

1. Directly download the WeRateDogs Twitter archive data (twitter_archive_enhanced.csv)
 
      * This archive was provided by Udacity, and was readily available for download.
      
      
 
2. Using the Requests library to download the tweet image prediction (image_predictions.tsv)

    * Assigning a provided url to use in get() method
    * Using the get() method to send a GET request to the url page
    * Checking the status code of the request to confirm successful download
    * Writing the response to an output file that matches a tab delimited format - image_predictions.tsv
    * Loading the file into a pandas dataframe
    
    
    
3. Uploading and reading additional Tweet data from tweet_json.txt file

    * The contents of the file were stored in a json format. The data was read line by line to obtain additional information about the tweets in the twitter-archive. The information obtained included:
            1. Retweet counts
            2. Favorite counts
            3. Time of creation of each tweet
        
       
       
The three pieces of data gathered now included:

1. `Twitter-archive-enhanced.csv`
2. `Image_predictions.tsv`
3. `Tweet_json.txt`

They were all loaded into separate pandas dataframes:

1. `twitter_archive`
2. `image_predictions`
3. `tweet_json_data`


## Assessing Data

This involved both visual and programmatic assessment.
The following key points acted as guidelines during assessment of the data:

* Only original ratings (no retweets) that have images are required. Not all are dog ratings and some are retweets.
* The requirements of this project are only to assess and clean at least 8 quality issues and at least 2 tidiness issues in this dataset.
* The fact that the rating numerators are greater than the denominators does not need to be cleaned.

The following steps were followed during assessment:

   `twitter_archive`
    
1. Describing each column in the table to gain awareness of all the variables in each table
2.
3. Checking the distribution of missing values using the `missingno`library to detect missing data
4. Checking for duplicated records in each table using duplicated() function
5. Checking the value counts in the rating denominator and rating numerator columns
6. Checking to see which entries had values of 0 in both rating numerator and rating denominator columns
7. Finding entries that were labelled as 'None' in the name column of the `twitter_archive`.
8. Going through the text entries of records whose name column matched 'None', to assess whether names were present in the text and had been left out
9. Checking the value counts of the dog stages columns, followed by checking their sum totals
10. Checking the entries in the source column

`image_predictions`
    
1. Displaying the table
2. Describing each column in the table to gain awareness of all the variables in each table
3. Checking the distribution of missing values using the `missingno`library to detect missing data
4. Checking for duplicated records in each table using duplicated() function
3. Using the describe() function to check the distribution of the confidence columns
4. Checking the prediction with a confidence of 1
5. Checking the value counts of p1_dog, p2_dog, p3_dog columns

`tweet_json_data`
    
1. Displaying the table
2. Describing each column in the table to gain awareness of all the variables in each table
3. Using `.info()` to assess the columns and their datatypes
4. Checking for duplicates



### Assessment Findings

### Quality issues

#### *Twitter Archive Table*
1. Remove the retweets and replies included in the Twitter archive table
* Some rating denominators have the value of 0 or greater than 10
* Timestamp is an object datatype instead of datetime
* Text column contains both text from the tweet and a short version of the tweet status url
* Source columns contains the text and and a url
* Missing values in the 'expanded_urls' column
* There are missing names labelled a 'None' in the name column
* Missing counts in the dog stage category columns (doggo, floofer, pupper, puppo)
* Twitter_archive table contains ratings that lack images present in the image predictions table
* Image_pred_clean table is missing a prediction column to determine whether the image is of a dog or not, so as to determine true positives and false negatives of the neural network


#### *Image Predictions Table*
11. Create a dog breed column containing the prediction with the highest confidence
* Create a dog breed column containing the prediction with the highest confidence

#### *Tweet json data Table*
13. created_at column is not in the datetime datatype


### Tidiness issues
#### *Twitter_archive Table*
1. There are four columns that represent the same type of data, the dog category is represented in the doggo, floofer, pupper and puppo columns

####  *Image_prediction table
2. Drop redundant columns from the image predictions table

#### *Tweet_json_data* Table*
3. This data is separate from the other tweet data in the twitter_archive table
4. Merge Twitter archive table with Tweet json data table

## Cleaning data

Made a copy of each table for the cleaning stage

`twitter_archive`
 
1. Removed the retweets and replies included in the Twitter archive table
      * Identifying and removing rows that contain observations for 'in_reply_to_status' and 'retweet_status_id'
      * Droping columns that refer to replies and retweets in the table
      
    
2. Corrected some rating denominators that had a value of 0 or greater than 10
      * Created a function to replace values that are greater or less than 10 with 10
      
    
3. Changed the Timestamp object datatype to datetime
      * Applied pandas datetime function to Timestamp column
      
    
4. Removed the short version of the tweet status url from the text column that contains both text and the url
      * Wrote a function to split the text from the url using .split() and return only the text portion
      * Applied the function to the text column using .apply()
      
      
5. Obtained the tweet source from the source url in the source column
      * Wrote a function to extract the source of the tweet embedded in the url and applied it to the source column using .apply()
      
    
6. Handled the missing data in the expanded_url column
      * The missing values were replace with "None" using the pandas fillna() function 
      
      
7. Handled the missing names labelled as 'None' in the name column
      * No action was performed
      
      
8. Handled missing counts in the dog stage category columns (doggo, floofer, pupper, puppo)
      * No action was peformed
      
      
9. Removed twitter_archive table ratings that lacked images present in the image_predictions table
      * Removed the ratings in the twitter_archive_clean table that did not have a corresponding tweet ID in the image_predictions_clean_table
    
`image_predictions`

10. Added a prediction column to determine whether the image is of a dog or not, so as to determine true positives and false negatives of the neural network.
    * Wrote a function that checked p1_dog, p2_dog and p3_dog boolean values to classify an image as 'dog' or 'not_dog'
    * Created a dog prediction column to show whether the neural network labelled the picture as a 'dog' or 'not_dog'
    
    
11. Created a dog breed column containing the prediction with the highest confidence

    *  Determined breed by checking the p1, p2, and p3 breed predictions with the highest confidence among the three available predictions by the neural network
    
`tweet_json_data`
12. Changed the created_at column datatype to datetime datatype


13. Merged the four columns that represent the same type of data, the dog category is represented in the doggo, floofer, pupper and puppo columns.

    * Assigned the values in the four different columns to one new column 'dog_stage'
    * Dropped the redundant columns from the twitter archive table
    
    
14. Dropped redundant columns from the image predictions table : 'img_num','p1','p1_dog','p2','p2_dog','p3','p3_dog'


15. Merged Twitter archive table with Tweet json data table on tweet ID's