## Wrangle report for tweet archive of Twitter user @dog_rates data
Author: Jem Chang  
Date: 11Mar2019  
Purpose: document data wrangling process including gathering, assessing, and cleaning data  

## Table of Contents
<ul>
<li><a href="#env">Environment and tools</a></li>
<li><a href="#gather">Data Gathering</a></li>
<li><a href="#assess">Data Assessing and Cleaning</a></li>
<li><a href="#summary">Summary</a></li>
</ul>

<a id='env'></a>
#### Environment and Tools
The data wrangling process is peformed in the Jupyter Notebook with Python 3.7. The libraries used in this project are pandas, requests, tweepy, json, re, default_timer, matplotlib.pyplot, seaborn, and scipy.stats. %matplotlib inline is added for direct outputs in the notebook. pd.options.display.max_colwidth = 600 is set for avoiding text collapes.  

<a id='gather'></a>
#### Data Gathering  
The datasets for this projects are from the tweet archive of Twitter user @dog_rates (WeRateDogs).    

1. Enhanced Twitter Archive: contains tweet data for all 5000+. Only 2356 records have ratings.   
File name: twitter-archive-enhanced  
Format: csv  
Source: directly download from Udacity website.  

2. Image Predictions File: the output from neural network   
File name: image-predictions  
Format: tsv  
Source: get the data from url = 'https://d17h27t6h515a5.cloudfront.net/topher/2017/August/599fd2ad_image-predictions/image-predictions.tsv'    

| Variable Name  | Definition                                                                                  |
|----------------|---------------------------------------------------------------------------------------------|
| tweet_id       | the last part of the tweet URL after "status/"                                              |
| p1             | the algorithm's #1 prediction for the image in the tweet                                    |
| p1_conf        | how confident the algorithm is in its #1 prediction                                         |
| p1_dog         | whether or not the #1 prediction is a breed of dog                                          |
| p2             | the algorithm's #2 prediction for the image in the tweet                                    |
| p2_conf        | how confident the algorithm is in its #2 prediction                                         |
| p2_dog         | whether or not the #2 prediction is a breed of dog                                          |
| p3             | the algorithm's #3 prediction for the image in the tweet                                    |
| p3_conf        | how confident the algorithm is in its #3 prediction                                         |
| p3_dog         | whether or not the #3 prediction is a breed of dog                                          |

3. Additional Data via the Twitter API  
File name: tweet_json  
Format: txt  
Source: connect Twitter API to download json format text file and use pandas to read into the notebook.  

<a id='assess'></a>
#### Data Assessing and Cleaning
##### Quality
`Enhanced Twitter Archive` table:  
1. The column, text, includes url: the rating, name, and stage might be not all correct, so removing urls is easier for the future cleaning.  
2. The ratings are not all correct: some of ratings are only extracted the decimals instead of the full numbers. I clean numerators and denomiators separatly and set up the cutoffs to filter bigger numbers. And I visually review them and find the logic that causes the incorrect extraction.  
3. The denominator of the ratings should not be 0: originally, there are some 0 denominators in the dataset but after correcting rating, the 0's all gone.  
4. Timestamp is datetime instead of string and the timestamp later than August 1st, 2017 should be removed because there is no image prediction output after the date. 
5. This table should only keep original ratings that have images for the future analysis.
6. Dog stages (doggo, floofer, pupper, puppo) might be not all correct. After checking the stages, they are all extracted correctly from the text.
7. The names are not all correct: I only fix one issue by using IG account as a name if name is missing or is extracted incorrectly. I don't think it will influence the future analysis so I leave others as is. 

`Additional Data via the Twitter API` table:  
8. id should be renamed as tweet_id to be consistent with other ids from different tables for the future joining/merging with other datasets.

##### Tidiness  
Based on tidy data rules: https://cran.r-project.org/web/packages/tidyr/vignettes/tidy-data.html  

`Enhanced Twitter Archive` table: column headers are values, not variable names 
1. All the dog stages (doggo, floofer, pupper, puppo) should be transposed into one column, dog_stage. If there are more than one stages, the stages will be concatenated and separated by one space.  

`Image Predictions File` table: column headers are values, not variable names  
2. All p1, p2, p3, p1_cof, p2_cof, p3_cof, p1_dog, p2_dog, and p3_dog should be transposed into p, p_conf, p_dog. 
Note: after tidying image prediction table, I do not merge back to tweeter archive file because it will cause that the observations of tweeter archive files increase three times. I only merge them when I perform the analysis instead of the storing data stage.

<a id='summary'></a>
#### Summary  
The most challenging part of the wrangling process is extracting useful information from each tweet. Regular expression is heavily used in this process. After data wrangling, the datasets are ready for analysis and modeling. 