# Wrangling WeRateDogs Twitter Data

## Introduction

The goal of this project is to wrawngle the WeRateDogs Twitter data provided by Udacity in order to uncover insights and to generate visualizations about the data archive. The __[WeRateDogs](https://twitter.com/dog_rates)__ Twitter archive contained over 5,000 tweets from their Twitter account. This dataset was then combined with quantitative data regarding retweet and favorites activity for each tweet that I collected via the Twitter API, as  well as a file of image predictions generated via a neural network algorithm by Udacity.

For the purposes of gathering, assessing, cleaning, and analyzing the data, I imported the following python libraries: requests, NumPY, pandas, os, tweepy, json, matplotlib, seaborn, and warnings. 

## 1. Data Gathering

First, I downloaded a CSV file of tweets from WeRateDogs that was provided inside the Udacity classorom. The file was then read into my Jupyter notebook as the dataframe original_df.

Then, I used the requests library in python to automatically retrieve the image-predictions.tsv file. This file was then read it into the Jupyter notebook as the dataframe image_predictions_df. 

In order to collect retweet and favorites data that are needed in order to develop insights about this dataset, I created a developer's account on Twitter to retrieve retweet and favorite data via the tweepy python library. I then configured the Jupyter notebook to make those requests. The data was then stored as a JSON formatted text file and then was read into the Jupyter notebook as the dataframe tweets_info_df. 

## 2. Data Assessment
Using both programmatic and visual methods for exploring the three datasets, I identified the following quality and tidiness issues across all three datasets. Individual observations for each dataset can be found in the subsections that follow: 

*Tidiness*:
1. **doggo, floofer, pupper,** and **puppo** are all descriptions of dog stages--as per *The Dogtionary*, and as such should be combined into one singular column. 
2. All three of these datasets appear to be aspects of the same observational unit and as such should be combined into one table. 
3. The **source** column is not a variable for analysis, is unecessary, and as such should be removed.

*Quality*:
1. Tweets for user IDs that are no longer valid on Twitter--and as such, we cannot get favorite and retweet data--are present and should be removed.
2. Retweets and replies are present and should be removed.
3. **retweeted_status_id, retweeted_status_user_id,** and **retweeted_status_timestamp** columns are unecessary--as they are useful metadata for analyzing retweets. 
4. All **_id** fields are not in string or object format.
5. The **timestamp** column is not in datetime format
6. There exist 0 denominator observations for ratings.
7. There exist invalid numerator observations for ratings that contain decimals.
8. There non-dog names present in the **name** column.
9. There are tweets present that do not have images for use with the prediction algorithm.


### 2.1 WeRateDogs Twitter Archive
original_df Column Descriptions:
- **tweet_id**: The unique identifier for each tweet.
- **in_reply_to_status_id**: The id provided for the tweet for which a given tweet is a reply. 
- **in_reply_to_user_id**: The user_id provided for the tweet for which a given tweet is a reply. 
- **timestamp**: Date and time the tweet was created.
- **source**: The URL source for a given tweet.
- **text**: The text of a tweet.
- **retweeted_status_id**: The status id original tweet for which this is a retweet. 
- **retweeted_status_user_id**: The user id original tweet for which this is a retweet. 
- **retweeted_status_timestamp**: Date and time the retweet was created.
- **expanded_urls**: Expanded version of url1; URL entered by user and displayed in Twitter. Note that the user-entered URL may itself be a shortened URL, e.g. from bit.ly.
- **rating_numerator**: The ranking given by the user.
- **rating_denominator**: The reference ranking given by the user. 
- **name**: The name of the dog.
- **doggo**, **floofer**,  **pupper**, **puppo** -- The stage of the dog.

Observations:
- Unnecessary **source** column.
- The **rating_numerator** and **rating_denominator** values appear to be arbitrary and do not share a combined scale.
- There is one observation that has a **rating_denominator** value of 0.
- There are invalid **rating_numerator** values that contain decimals.
- **tweet_id, in_reply_to_status_id, in_reply_to_user_id, retweeted_status_id,** and **retweeted_status_user_id** are not strings.
- **timestamp** and **retweeted_status_timestamp** are not in datetime format.
- **doggo, floofer, pupper,** and **puppo** all appear to be different aspects of the same variable.
- **in_reply_to_status_id, in_reply_to_user_id, retweeted_status_id, retweeted_status_user_id,** and **retweeted_status_timestamp** all appear to be NaN values for most of the observations in the dataset.
- There are 181 retweets.
- There are non-names in the **name** column.

### 2.2 Udacity Image Dog Breed Predictions Dataset
image_predictions_df Column Descriptions:
- **tweet_id**: The unique ID of a given tweet.
- **jpg_url**: The URL for the tweet's image. 
- **img_num**: The number of images associated with a given tweet.
- **p1**: The top prediction produced by the algortihm for the breed of the dog in the image.
- **p1_conf**: The p-value or confidence associated with the algorithm's top prediction for the image in the tweet. 
- **p1_dog**: A boolean value reporting whether or not the top prediction is a breed of dog. 
- **p2**: The second most-likely prediction produced by the algortihm for the breed of the dog in the image.
- **p2_conf**: The p-value or confidence associated with the algorithm's second most-likely prediction for the image in the tweet. 
- **p2_dog**:  A boolean value reporting whether or not the second most-likely prediction is a breed of dog. 
- **p3**: The third most-likely prediction produced by the algortihm for the breed of the dog in the image.
- **p3_conf**: The p-value or confidence associated with the algorithm's third most-likely prediction for the image in the twee
- **p3_dog**: A boolean value reporting whether or not the third most-likely prediction is a breed of dog. 

image_predictions_df Observations:
- **tweet_id** is not a string (quality).
- Not all of the tweets have at least one image (quality).


### 2.3 Twitter Retweet and Favorite Counts
tweets_info_df Column Descriptions:
- **tweet_id**: The unique identifier for each tweet.
- **retweets**: The count of retweets for a given tweet (by user).
- **favorites**: The count of favorites for a given tweet (by user).
- **followers**: The number of followers of a given tweet.

tweets_info_df Observations:
- **tweet_id** is a string!
- 165 tweets have a **favorite** count of zero (quality).
- All of the tweets collected have at least 1 retweet.


## 3. Data Cleaning

The following tasks were coded and performed in order to address the Tidiness and Quality issues outlined in *2. Data Assessment*.

*Tidiness*:
1. There will be one column for **stage** that is combined from **doggo, floofer, pupper,** and **puppo**.
2. Create one universal dataset from the three individual tables so that it is more useful for analysis.
3. The **source** column is unecessary and should be removed..

*Quality*:
1. Remove all tweets for whom the user ID is no longer valid on Twitter.
2. Remove all retweet observations.
3. Remove the **retweeted_status_id, retweeted_status_user_id,** and **retweeted_status_timestamp** columns once retweets have been removed.
4. Convert all **_id** fields to strings.
5. Convert the **timestamp** column into datetime format
6. Remove 0 denominator observations for ratings.
7. Remove/correct all non-dog names.
8. Remove all tweets that do not have at least one image.


The __[wrangle_act.ipynb Jupyter Notebook](wrangle_act.ipynb#3-data-cleaning)__ accomplishes those tasks in the following sections:

1. __[(3.1 Copy the Dataframes)](wrangle_act.ipynb#3-1-copy-dataframes)__, copies all of the gathered data into separate dataframes for cleaning.
2. __[(3.2 Convert all id fields to strings and remove source column)](wrangle_act.ipynb#3-2-convert-id-source)__, converts the **id** fields to strings and removes the source column.  
3. __[(3.3 Join All Three Datasets)](wrangle_act.ipynb#3-3-join-datasets)__, all three datasets were joined together with the **tweet_id** field as the key value for an inner-join.
4. __[(3.4 Remove Retweets and Replies)](wrangle_act.ipynb#3-4-remove-retweets-replies)__,all replies were removed from the dataset by querying whether or not the **in_reply_to_status_id** was a null value, and all retweets were removed from the dataset by querying whether or not the **retweeted_status_id** was a null value.
5. __[(3.5 Remove Retweet and Reply Data Columns)](wrangle_act.ipynb#3-5-remove-retweet-reply-col)__, the following empty columns were removed--as they relate explicitly to retweets and replies: **in_reply_to_status_id, in_reply_to_user_id, retweeted_status_id, retweeted_status_user_id,** and **retweeted_status_timestamp**.
6. __[(3.6 Convert Timestamp)](wrangle_act.ipynb#3-6-timestamp)__, the **timestamp** column was converted to datetime format using pandas. 
7. __[(3.7 Remove Ratings with a 0 Denominator)](wrangle_act.ipynb#3-7-ratings-denominator)__, all observations containing a **rating_denominator** of 0 were queried and removed from the dataset. 
8. __[(3.8 Correct erroneous ratings (numerators and denominators))](wrangle_act.ipynb#3-8-ratings-corr)__, after the visual assessment phase, it was noted that some of the rating_numerator values are entirely incorrect. The rating_numerator and rating_denominator fields were changed to floats and then the invalid values were identified and corrected.
9. __[(3.9 Identify (if possible) Dog Names Where Non-Names Currently Exist)](wrangle_act.ipynb#3-9-names)__, after identifying all of the non-dog name words in the **name** column each word was queried and entries were verified to make certain no name was referenced. If a name was referenced, then it was corrected in the dataset. For all tweets not containing a name, the non-name words were replaced with "None." 
10. __[(3.10 Combine Dogtionary Stages)](wrangle_act.ipynb#3-10-stages)__, all four of *The Dogtionary* stage variables were combined into a single **stage** column. 
11. __[(3.111 Verify All Tweets Contain Images)](wrangle_act.ipynb#3-11-image-check)__, the dataset was queried to make certain there were no null entries for the **jpg_url** column. 


## 4. Data Storage

After all of the cleaning tasks were completed, and before Analysis and Visualization was conducted (__[notebook,](wrangle_act.ipynb#analysis-viz)__ __[report)](act_report.html)__, the data was saved as twitter_archive_master.csv __[(data file)](twitter_archive_master.csv)__.