# Wrangle Report - WeRateDogs

In this project we gathered data from the WeRateDogs Twitter account. In particular, we dealt with 3 datasets:  

* Twitter archive: contains roughly 2.5k tweets collected from the mentioned account, with 17 columns including some extracted attributes such as dog name or stage.  

* Image predictions: stores about 2,000 entries of images predictions and it contains 12 columns.

* Tweet details: gathered this data quering the Twitter API. Includes data about retweet count and favorite count for each tweet id. It consists of 2,331 rows.

We explored these 3 datasets looking for potential data quality issues that would need to be assessed. We used pandas and numpy functions like *describe*, *info*, *head*, *sample*, *count*, *unique*, *sum*, etc.  
Here are the most critical findings:

### Data quality

For the `twitter_archive` table, we encountered that:  
- some rows were retweets and replies, and we did not want to use them for our analysis to prevent skewness.  
- there were missing values *in_reply_to_status_id*, *in_reply_to_user_id*, *retweeted_status_id*, *retweeted_status_user_id*, *retweeted_status_timestamp* and *expanded_urls*.  
- some *rating_numerator* and some *rating_denominator* had extreme values, which could affect our future analysis.
- some wonky names for dogs, like *a* or *th*.  

Regarding the `image_predictions` table, we identified that:  
- some of the columns *p1*, *p1_conf*, etc., had nondescriptive names.  
- there were inconsistencies on the breeds names, some were capitalized, some were not. Although some words must be capitalize in English, we decided that for analysis purposes this could be an issue.  

About the `tweets_list` table, we spotted that:  
- the variables *tweet_id*, *retweet_count* and *favorite_count* were object types not integers.  
- there were at least 25 missing rows comparing to the `twitter_archive` table.

### Tidiness

- In the `twitter_archive` table, the variable *stage* was store in 4 different columns, instead of one.
- The `tweets_list` table should be part of the `twitter_archive` table. 

After this exercise, we worked on solutions to clean the above issues. We created copies of the dataframes in order to not lose any data. This is a summary of the implemented solutions:

1. From `twitter_archive` table, remove retweets when columns *retweeted_status_id*, *retweeted_status_user_id* and *retweeted_status_timestamp* are non empty. Remove replies when columns *in_reply_to_status_id* and *in_reply_to_user_id* are non empty.  
    Then, drop columns *retweeted_status_id*, *retweeted_status_user_id*, *retweeted_status_timestamp*, *in_reply_to_status_id* and *in_reply_to_user_id*. 
2. From `twitter_archive` table, fill in values for expanded urls. The format should be: "https://twitter.com/dog_rates/status/{tweet_id}/photo/1".
3. From `twitter_archive` table, compute numerator/denominator and add this to a new *rating* column. Because *WeRateDogs* rating method tends to give high ratings, we'll keep a ratio up to 2 to 1. The ones that have ratio higher than 2 they'll be designated with a rating of 2.  
    Drop the columns *rating_numerator* and *rating_denominator*.  
4. From `twitter_archive` table, transform all the names that start with a lowercase to *None*.  
5. From `image_predictions` table, change `image_predictions` column headers to more insightful ones.  
6. From `image_predictions` table, convert to lowercase breed names in columns *prediction_1*, *prediction_2* and *prediction_3* (although we know that some words must be capitalize in English, this is done for analysis purposes).  
7. From `tweets_list` table, convert *tweet_id*, *retweet_count* and *favorite_count* to __integer__ in `tweets_list`.  
8. From `tweets_list`, combine *doggo*, *floofer*, *pupper* and *puppo* columns into one column *stage*.  
9. Left join `tweets_list` with `twitter_archive` on *twitter_id*.  
10. From `tweets_list` table, drop the 7 rows that are missing values for *retweet_count* and *favorite_count*.  
11. From `twitter_archive` table, convert *retweet_count* and *favorite_count* to __integer__ in `twitter_archive`. We have done this before but when we joined both tables there were missing values and the variable type changed to __float__.

Finally, we ended up with 2 clean files:  
* `twitter_archive_master.csv`
* `image_predictions_master.csv`

The reason to keep them separate was that they contain information about 2 different units or entities. The first file has attributes associated to specific tweets, while the second file is focused on images and their attributes.  

We will use these 2 files to come up with insights and visualizations in the next step.