# Data Wrangling Report

This document briefly explains the wrangling process that was conducted on the tweet data from the WeRateDogs Twitter page.

### Gathering Data
To start the wrangling process, data was gathered from three different sources. The first was a CSV file containing a WeRateDogs Twitter archive already on hand. Second, was a TSV file gathered from Udacity's server that contains tweet image prediction data. Lastly, the retweet and favorite count of each tweet were gathered directly from the Twitter API.

Each set of data that was gathered were loaded onto their own dataframes for assessment.

### Assessing Data
Each dataframe was assessed visually and programmatically. Both data quality and data tidiness issues were clearly documented for each dataframe with the following summary produced:

#### Data quality issues (9):

- On *twitter_archive_df*, there are some retweets and replies.
- On *twitter_archive_df*, the expanded_urls', 'source' and 'name' column names are unclear.
- On *twitter_archive_df*, the 'timstamp' data is not using the correct data type.
- On *twitter_archive_df*, the 'source' data contains only 4 unique values. This can be a category datatype rather than object.
- On *twitter_archive_df*, the 'source' information contains unnecesary HTML.
- On *twitter_archive_df*, the 'rating_numerator'/'rating_denominator' value has some odd results like infinity.
- On *twitter_archive_df*, there are some dog names that likely are to be mistakes. Ie. 'a', 'an' and 'the'.
- On *image_predictions_df*, the column names are unclear.
- On *image_predictions_df*, the 'image_num' column has an inappropriate data type.

#### Data tidiness issues (3):

- image_predictions_df is part of the same observational unit as twitter_archive_df.
- tweet_stats_df is part of the same observational unit as twitter_archive_df.
- On twitter_archive_df, the 4 different columns doggo, floofer, pupper and puppo, are all relative to the same variable that identifies the stage of dog.

### Cleaning Data
Prior to cleaning the data, a copy of the data was made. Then, an ordered list of cleaning steps were defined as per below with the type of data quality issue in brackets. Each item was coded and tested.

#### Data Completion Quality Steps:
1. On twitter_archive_df, remove the retweets and replies and drop retweet and reply related columns (unnecessary data).

#### Data Tidiness Steps:
2. On twitter_archive_df, melt the 4 different columns doggo, floofer, pupper and puppo together into a single column named "dog stage".
3. tweet_stats_df is part of the same observational unit as twitter_archive_df and therefore should be merged. Disregard any teeets without a favoruite and retweet count as they may no longer be valid.
4. image_predictions_df is part of the same observational unit as twitter_archive_df and therefore should be merged. Select the most probable dog breed from image_predictions_df for the merging as all the predictions probabilities is not required for the analysis.

#### Remaining Data Quality Steps:
5. On twitter_archive_df, check and correct 'rating_numerator' and 'rating_denominator' for accuracy (accuracy issue).
6. On twitter_archive_df, remove names that likely are to be mistakes, including 'a', 'an' and 'the' (accuracy issue).
7. On twitter_archive_df, appropriatly rename the 'expanded_urls', 'source' and 'name' columns to be clear (consistency issue).
8. On twitter_archive_df, appropriatly rename the newly added columns from image_predictions_df (consistency issue).
9. On twitter_archive_df, change the 'timstamp' data to use the datetime datatype (validity issue).
10. On twitter_archive_df, remove the unnecesary HTML from the 'source' column (consistency issue).
11. On twitter_archive_df, change the 'source' data to use the category data type (validity issue).
12. On twitter_archive_df, change the 'selected_image_numer' data to use the category data type (validity issue).

### Storing Data
The final cleaned dataframe was stored as a CSV file to be loaded and used later.