# WeRateDogs wrangling report

Helen

## Introduction

Real-world data rarely comes clean. The goal of this project is to wrangle WeRateDogs Twitter data to create interesting and trustworthy analyses and visualizations.

The required effort for my data wrangling in this project consists of:
- Gathering data  
- Assessing data
- Cleaning data
 

## Gathering Data

Gather data from 3 different sources.
 

- The tweet image predictions  :  This file (image_predictions.tsv) is hosted on Udacity's servers and should be downloaded programmatically using the Requests library and the following URL: https://d17h27t6h515a5.cloudfront.net/topher/2017/August/599fd2ad_image-predictions/image-predictions.tsv

- Twitter API  : Query the Twitter API for each tweet's JSON data using Python's Tweepy library and store each tweet's entire set of JSON data in a file called tweet_json.txt file

**Source 1 : Enhanced WeRateDogs Twitter Archive**

- Download file twitter_archive_enhanced.csv manually from link provided, given to Udacity from @WeRateDogs.

- The WeRateDogs Twitter archive contains basic tweet data for all 5000+ of their tweets, but not everything. One column the archive does contain though: each tweet's text, which I used to extract rating, dog name, and dog "stage" (i.e. doggo, floofer, pupper, and puppo) to make this Twitter archive "enhanced." 

**Source 2 : Image Predictions File**
- This file (image_predictions.tsv) is hosted on Udacity's servers and should be downloaded programmatically using the Requests library and the following URL: https://d17h27t6h515a5.cloudfront.net/topher/2017/August/599fd2ad_image-predictions/image-predictions.tsv
  
- One more cool thing: I ran every image in the WeRateDogs Twitter archive through a neural network that can classify breeds of dogs*. The results: a table full of image predictions (the top three only) alongside each tweet ID, image URL, and the image number that corresponded to the most confident prediction (numbered 1 to 4 since tweets can have up to four images).

**Source 3 : Additional Data via the Twitter API**

- The recommended step to acquire this data should be query the Twitter API for each tweet's JSON data using Python's Tweepy library and store each tweet's entire set of JSON data in a file called tweet_json.txt file

- However, I can't set up a Twitter developer account using the steps recommended, therefore for this part I will gather the this piece of data without twitter account. 

- Back to the basic-ness of Twitter archives: retweet count and favorite count are two of the notable column omissions. Fortunately, this additional data can be gathered by anyone from Twitter's API. Well, "anyone" who has access to data for the 3000 most recent tweets, at least. But you, because you have the WeRateDogs Twitter archive and specifically the tweet IDs within it, can gather this data for all 5000+. And guess what? You're going to query Twitter's API to gather this valuable data.

## Data Assess

After gathering each of the above pieces of data, assess them visually and programmatically for quality and tidiness issues.

Detect and document at least `eight (8) quality issues` and `two (2) tidiness issues`

### Assessment Observations
**Quality - Completeness, validity, accuracy, consistency (content issues)**


`df_twitter_archive`

1) wrong datatype for some columns eg : timestamp should be datetime instead of objec (in_reply_to_status_id, in_reply_to_user_id, retweeted_status_id, retweeted_status_user_id should be integers/strings instead of float.)

2) There are 181 retweet entries. We only want  Use only original ratings tweets, not retweets.

3) Rating denominator and numerators value are inconsistent or incorrect. 

  (However, rating numerators that are greater than the denominators does not need to be cleaned. This unique rating system is a big part of the popularity of WeRateDogs.)

4) Wrong numerators and denominators are captured from text 
  - some are actually refer to date or other meaning 15/8 and 24/7.
  - wrongly captured due to decimal and spaces "11. 26/10" was captured as "26/10"

5) Tweets that has no image.
  
`df_image_predictions`

6) Inconsistent capitalization on predicted dog names

7) There are 324 rows non dog image (where p1, p2, and p3 are false)

8) Duplicated jpg urls with different tweet ids

9) 'None' string should be replaced with 'NaN'

10) All 3 files contain different number of rows.

**Tidiness**

1)  2 columns storing rating information.

2)  4 columns (doggo, floofer, pupper, puppo) to indicate dog stages.

3)  All 3 files contain common tweet_id column, which can be used to join all three files as one dataframe.


## Cleaning data


Here we will fix the quality and tidiness issues that we identified in step 2.

Before cleaning start, the dataframe are copied to another new dataframes.

Cleaning process involves three steps:
- Define : convert assessments into defined cleaning tasks. These definitions also serve as an instruction list so others (in the future) can look at the work and reproduce it.
- Code : convert those definitions to code and run that code.
- Test : test dataset, visually or with code, to make sure cleaning operations worked.



In this project, steps of assessing and cleaning data is repeated multiple times. Each time of repeating the steps make the data more meaningful to be analyized in later stage.

## Conclusion

The wrangled data is stored in twitter_archive_master.csv.
Then it is ready to be analyzed to draw useful insight from it, then to build visualization and reports.