## Data Wrangling Report

<span style="color: gray; font-size:1em;">Mateusz Zajac</span>
<br><span style="color: gray; font-size:1em;">Feb-2019</span>

- [Part I - Gathering Data](#gather)
- [Part II - Assessing Data](#assess)
- [Part III - Cleaning Data](#clean)

<a id='gather'></a>
### Part I - Gathering Data
For this project I used data from three sources:
 1. The WeRateDogs Twitter archive, which download manually from: [twitter_archive_enhanced.csv](https://d17h27t6h515a5.cloudfront.net/topher/2017/August/59a4e958_twitter-archive-enhanced/twitter-archive-enhanced.csv)
 2. The tweet image predictions, which I downloaded programmatically from:  [image_predictions.tsv](https://d17h27t6h515a5.cloudfront.net/topher/2017/August/599fd2ad_image-predictions/image-predictions.tsv) using the Requests library
 3. The tweet_json.txt, which is a downloaded JSON data (using Python's Tweepy library) from Twitter API for each tweet from the Twitter archive file 

<a id='assess'></a>
### Part II - Assessing  Data

At this stage I assessed data visually and programmatically, using pandas .sample(10), .info(), .describe(), .value_counts() or .duplicated().sum(). I have identified and documented several issues with:


<span style="color:red; font-size:1em;">**Data Quality:**</span>

**Twitter Archive**
 1. dataset contains retweets
 2. in_reply_to_status_id, in_reply_to_user_id, retweeted_status_id, retweeted_status_user_id should be intergs instead of float (check if fields are essential for the analysis)
 3. timestamp and retweeted_status_timestamp should be datetime instead of object (string)
 4. incorrect dog names (for instance: 'such', 'a', 'an')
 5. the rating numerator and denominator have ratings above the standard scale (for instance numerator equal to 1776 and denominator equal to 170)
 6. urls appear in the source column
 7. source and dog stage datatype can be set to 'category'

**Twitter Images**
 1. Missing records: 2075 rows instead of 2356
 2. 66 tweet_ids have the same duplicated jpg_urls
 3. Natural network didn't recognize a dog in any of the attempts

**Twitter API**
 1. There are a few data type issues but for this project we need only 3 columns: Tweet ID, retweet count and favorite count
 2. Tweet ID 666020888022790149 is duplicated 


<span style="color:red; font-size:1em;">**Tidiness:**</span>

**General**
- All tables should be part of one dataset

**Twitter Archive**
- Different stage of dogs in columns instead of rows

**Twitter API**
- For this project we need only 3 columns: Tweet ID, retweet count and favorite count. Drop the rest of columns.



Assessing data quality and tidiness issue was an iterative process. While dealing with one issue, I usually found another one that was not originally listed and tried to clean it to some extent. As mentioned in the key points of this project, assessing and cleaning the entire dataset completely would require a lot of time, and is not necessary to practice and demonstrate your skills in data wrangling. I tried to be reasonable.

<a id='clean'></a>
### Part III - Cleaning Data

I created copies of original dataframes and I tried to clean all of them separately, one after another. At the end, I inner joined the three datasets on the common key ('tweet_id') and polished a bit the joined dataframe. The dataframe was saved as a separate file "twitter_archive_master.csv".

That was a challanging task that took me some time. There were some parts I was struggling with and needed to do an online research (which I always find as a part of learning curve). I am sure there were moments where I could have done better job with cleaning (or make the cleaning more automated, reusable with new data) or the code itself could have been cleaner, written smarter.