# Wrangle Report

by Markus Stachl, 18.07.2021

## Gather

There were 3 datasets to be gathered from various sources and used for the analysis. Those were:
- `tweets`: The dataset has been provided by Udacity as a csv-file. The dataset consists of tweets from the famous WeRateDog Twitter account containg tweet metadata like ID, tweeted text as well metadata like rating and dog name retreived from basic text processing.
- `predictions`: The tweet image predictions, i.e., what breed of dog (or other object, animal, etc.) is present in each tweet according to a neural network. This file (image_predictions.tsv) is hosted on Udacity's servers and was downloaded programmatically using the Requests library.
- `tweets_extended`: A dataset containing detailled tweet metadata including number of favorites, number of retweets and extended user information. The dataset was retreived using the Twitter Tweepy API and read as JSON data.

## Assess

After reading each dataset into a Pandas dataframe, I started assessing them individually regarding quality and tidiness. This meant looking at the content of the tables and checking for validity, correctness and completeness. It also meant assessing table structures to ensure that every observation is modelled as a row, every feature is a column and every cell is a single value. Assessing the individual datasets was conducted via usage of pandas built-in functions like `.info()` and `.describe()`. After the assessment, I located the follwing quality and tidiness issues:

### Tweets
#### Quality
- contains retweets
- contains replies
- contains non-dog related ratings, tweets without images (https://twitter.com/dog_rates/status/828361771580813312)
- ratings (numerator and denominator) are not all base 10
- tweets: wrong numerator/denominator for id=66628740622469529 and id=835246439529840640 and id=740373189193256964 and id=881633300179243008
- tweets: outlier ratings for ids 680494726643068929, 778027034220126208, 786709082849828864  due to delimiter '.'
- tweets: outlier ratings for ids 670842764863651840, 749981277374128128	
- wrong names "a","an","light","old" et al.
- contains invalid ids (see error_ids)
- remove redundant/unnecesary columns
#### Tidiness
- wrong datatype for timestamp
- index not set to tweet_id

### Predictions
#### Tidiness
- predictions in separate table
- multiple columns for dog stages

### Tweets_extended
#### Tidiness
- two separate tables for tweets

## Clean

Before performing any cleaning steps a copy of the original data was created in case a rollback is needed. Every issue was addressed individually by following a "Define-Code-Test"-methodology. Firstly, retweets and replies were removed from the dataset. Secondly, I assured correct structure by merging the three dataframes into one dataset by joining on the `tweet_id` and setting the correct datatype for the timestamp. Multiple representations of the dog stage were melted into a single column `stages`. WeRateDogs regularly uses a 1-10 scale, but adjusting the numerator and denominator for groups of dogs, like 121/110. The ratings were normalized to the 1-10 scale (11/10 in the given example). Furthermore, fake ratings were present in the dataset, like a dog with rating 1776/10 representing the US independence day, or Snoop Dogg with a rating of 420/10. These outliers were excluded from the dataset to ease future analysis. Lastly, invalid dog names were removed from the dataset.

Finally, the fully cleaned dataset was exported to a new file for future analysis.