# Wrangle Data

## Project Details:

This project is from Udacity's Data Wrangling course, which is part of the Data Analyst Nanodegree program. Using Python and its libraries, we will gather data from a variety of sources and different formats, assess its quality and tidiness, then clean it. That, is we will perform data wrangling. These efforts will be documented in this Jupyter Notebook in order showcase them through analyses and visualizations using Python (and its libraries) and/or SQL.

The tasks in this project include:

- Gathering data
- Assessing this data
- Cleaning it
- Storing, analyzing, and visualizing the wrangled data
- Reporting on:
    1. Data wrangling efforts
    2. Data analyses and visualizations

*Please note some of the descriptive text in this notebook has been taken directly from the Udacity course materials.*

## Table of Contents

- [Introduction](#intro)
- [Part I: Gathering Data](#gather)
- [Part II: Assessing Data](#assess)
- [Part III: Cleaning Data](#clean)
- [Part IV: Storing, Analyzing and Visualizing Data](#analyze)
- [Conclusion](#concl)

<a id='intro'></a>
### Introduction

This project involves downloading a twitter archive and supplementing it with additional data. Accoring to the Udacity course materials:

> The dataset that we will be wrangling (and analyzing and visualizing) is the tweet archive of Twitter user @dog_rates, also known as WeRateDogs. WeRateDogs is a Twitter account that rates people's dogs with a humorous comment about the dog. These ratings almost always have a denominator of 10. The numerators, though? Almost always greater than 10. 11/10, 12/10, 13/10, etc. Why? Because "they're good dogs Brent." WeRateDogs has over 4 million followers and has received international media coverage.

> WeRateDogs downloaded their Twitter archive and sent it to Udacity via email exclusively for use in this project. This archive contains basic tweet data (tweet ID, timestamp, text, etc.) for all 5000+ of their tweets as they stood on August 1, 2017.

<a id='gather'></a>
### Part I: Gathering Data

The twitter archive contains basic data for over 5000 WeRateDogs tweets. In order to make our analysis more interesting, we will need to gather data from additional sources. Specifically, we will need:

1. The WeRateDogs Twitter archive (twitter_archive_enhanced.csv). This file has already been placed in the same directory as this notebook.

2. Tweet image predictions that have been created with a neural network image classifier and are accessible as part of the Udacity course materials.

3. Additional information gathered by using the Twitter API, such as like and retweet counts.

Let's start by importing the necessary Python libraries for our efforts.

In [3]:
#Import the necessary libraries needed for the project.
import pandas as pd
import numpy as np

import requests

Let's load our twitter archive into a Pandas dataframe.

In [5]:
# Create dataframe of WeRateDogs twitter archive and view the first few records.
twitter_archive = pd.read_csv('twitter-archive-enhanced.csv')
twitter_archive.head()

Unnamed: 0,tweet_id,in_reply_to_status_id,in_reply_to_user_id,timestamp,source,text,retweeted_status_id,retweeted_status_user_id,retweeted_status_timestamp,expanded_urls,rating_numerator,rating_denominator,name,doggo,floofer,pupper,puppo
0,892420643555336193,,,2017-08-01 16:23:56 +0000,"<a href=""http://twitter.com/download/iphone"" r...",This is Phineas. He's a mystical boy. Only eve...,,,,https://twitter.com/dog_rates/status/892420643...,13,10,Phineas,,,,
1,892177421306343426,,,2017-08-01 00:17:27 +0000,"<a href=""http://twitter.com/download/iphone"" r...",This is Tilly. She's just checking pup on you....,,,,https://twitter.com/dog_rates/status/892177421...,13,10,Tilly,,,,
2,891815181378084864,,,2017-07-31 00:18:03 +0000,"<a href=""http://twitter.com/download/iphone"" r...",This is Archie. He is a rare Norwegian Pouncin...,,,,https://twitter.com/dog_rates/status/891815181...,12,10,Archie,,,,
3,891689557279858688,,,2017-07-30 15:58:51 +0000,"<a href=""http://twitter.com/download/iphone"" r...",This is Darla. She commenced a snooze mid meal...,,,,https://twitter.com/dog_rates/status/891689557...,13,10,Darla,,,,
4,891327558926688256,,,2017-07-29 16:00:24 +0000,"<a href=""http://twitter.com/download/iphone"" r...",This is Franklin. He would like you to stop ca...,,,,https://twitter.com/dog_rates/status/891327558...,12,10,Franklin,,,,


Next we will gather the tweet image predictions, i.e., the dog's breed, etc. 

> Every image in the WeRateDogs Twitter archive has been run through a neural network that can classify breeds of dogs. The results include a table full of image predictions (the top three only) alongside each tweet ID, image URL, and the image number that corresponded to the most confident prediction (numbered 1 to 4 since tweets can have up to four images). This file (image_predictions.tsv) is hosted on Udacity's servers and should be downloaded programmatically using the Requests library and the following URL: https://d17h27t6h515a5.cloudfront.net/topher/2017/August/599fd2ad_image-predictions/image-predictions.tsv

In [4]:
# Download the tweet image predictions file
url = 'https://d17h27t6h515a5.cloudfront.net/topher/2017/August/599fd2ad_image-predictions/image-predictions.tsv'

# Request the url and save the response
response = requests.get(url)

# If there is not an error, write the file to our directory
if response.status_code == 200:
    with open('image_predictions.tsv', 'wb') as file:
        file.write(response.content)

In [6]:
# Load the predictions into a Pandas dataframe.
image_predictions =  pd.read_csv('image_predictions.tsv', sep='\t')
image_predictions.head()

Unnamed: 0,tweet_id,jpg_url,img_num,p1,p1_conf,p1_dog,p2,p2_conf,p2_dog,p3,p3_conf,p3_dog
0,666020888022790149,https://pbs.twimg.com/media/CT4udn0WwAA0aMy.jpg,1,Welsh_springer_spaniel,0.465074,True,collie,0.156665,True,Shetland_sheepdog,0.061428,True
1,666029285002620928,https://pbs.twimg.com/media/CT42GRgUYAA5iDo.jpg,1,redbone,0.506826,True,miniature_pinscher,0.074192,True,Rhodesian_ridgeback,0.07201,True
2,666033412701032449,https://pbs.twimg.com/media/CT4521TWwAEvMyu.jpg,1,German_shepherd,0.596461,True,malinois,0.138584,True,bloodhound,0.116197,True
3,666044226329800704,https://pbs.twimg.com/media/CT5Dr8HUEAA-lEu.jpg,1,Rhodesian_ridgeback,0.408143,True,redbone,0.360687,True,miniature_pinscher,0.222752,True
4,666049248165822465,https://pbs.twimg.com/media/CT5IQmsXIAAKY4A.jpg,1,miniature_pinscher,0.560311,True,Rottweiler,0.243682,True,Doberman,0.154629,True


Because the archive doesn't contain everything, we will need to gather additional data from Twitter's API: Each tweet's retweet count and favorite ("like") count at minimum, and any additional data we find interesting. 

> Using the tweet IDs in the WeRateDogs Twitter archive, query the Twitter API for each tweet's JSON data using Python's Tweepy library and store each tweet's entire set of JSON data in a file called tweet_json.txt. Each tweet's JSON data should be written to its own line. 

In [None]:
# Import the required libraries
import tweepy
from tweepy import OAuthHandler
import json
from timeit import default_timer as timer

# Keys, secrets, etc. are hidden to comply with Twitter's API terms and conditions
consumer_key = 'HIDDEN'
consumer_secret = 'HIDDEN'
access_token = 'HIDDEN'
access_secret = 'HIDDEN'

auth = OAuthHandler(consumer_key, consumer_secret)
auth.set_access_token(access_token, access_secret)

api = tweepy.API(auth, wait_on_rate_limit=True)

# NOTE TO REVIEWER: this student had mobile verification issues so the following
# Twitter API code was sent to this student from a Udacity instructor

# Tweet IDs for which to gather additional data via Twitter's API
tweet_ids = twitter_archive.tweet_id.values
len(tweet_ids)

# Query Twitter's API for JSON data for each tweet ID in the Twitter archive
count = 0
fails_dict = {}
start = timer()
# Save each tweet's returned JSON as a new line in a .txt file
with open('tweet_json.txt', 'w') as outfile:
    # This loop will likely take 20-30 minutes to run because of Twitter's rate limit
    for tweet_id in tweet_ids:
        count += 1
        print(str(count) + ": " + str(tweet_id))
        try:
            tweet = api.get_status(tweet_id, tweet_mode='extended')
            print("Success")
            json.dump(tweet._json, outfile)
            outfile.write('\n')
        except tweepy.TweepError as e:
            print("Fail")
            fails_dict[tweet_id] = e
            pass
end = timer()
print(end - start)
print(fails_dict)

> Then read this .txt file line by line into a pandas DataFrame with (at minimum) tweet ID, retweet count, and favorite count.

In [17]:
# Import library necessary to process JSON
import json

# Create a tempory list to hold the extracted json tweet data
data_list = []

with open('tweet_json.txt', 'r') as file:

    # Read first line of tweet data
    tweet = file.readline()

    # Loop through successive tweets until no more left
    while tweet:
        
        # Load json tweet data
        tweet_data = json.loads(tweet)

        # Extract the fields we will use in our analysis
        tweet_id = tweet_data['id']
        retweet_count = tweet_data['retweet_count']
        favorite_count = tweet_data['favorite_count']
        
        # Place the extracted fields into a dictionary
        extracted_data = {'tweet_id': tweet_id, 
                          'retweet_count': retweet_count, 
                          'favorite_count': favorite_count
                         }
        # Append the dictionary to our list
        data_list.append(extracted_data)

        # Read in next tweet
        tweet = file.readline()

        
# Convert the temporary list of dictionaries into a dataframe
twitter_api_data = pd.DataFrame(data_list, columns = ['tweet_id', 'retweet_count', 'favorite_count'])

twitter_api_data.head()

Unnamed: 0,tweet_id,retweet_count,favorite_count
0,892420643555336193,8853,39467
1,892177421306343426,6514,33819
2,891815181378084864,4328,25461
3,891689557279858688,8964,42908
4,891327558926688256,9774,41048


<a id='assess'></a>
### Part II: Assessing Data

Now that we have gathered data, we must assess it visually and programmatically in order to identify quality and tidiness issues. Our work must meet the following criteria:

- Assessing and cleaning the entire dataset completely would require a lot of time, and is not necessary to practice and demonstrate your skills in data wrangling. Therefore, the requirements of this project are only to assess and clean at least 8 quality issues and at least 2 tidiness issues in this dataset. 
- We only want original ratings (no retweets) that have images. Though there are 5000+ tweets in the dataset, not all are dog ratings and some are retweets.
- Cleaning includes merging individual pieces of data according to the rules of tidy data.
- The fact that the rating numerators are greater than the denominators does not need to be cleaned. This unique rating system is a big part of the popularity of WeRateDogs.
- We do not need to gather the tweets beyond August 1st, 2017. Note that you won't be able to gather the image predictions for these tweets since you don't have access to the algorithm used.

The original WeRateDogs twitter archive was filtered and enhanced before we imported it. The 5000+ original tweets have been have filtered for those with ratings only (2356). The rating, dog name, and dog "stage" (doggo, floofer, pupper, and puppo) have also been extracted from each tweet's text. This process was likely not perfect, so we'll need to assess and clean these columns in order to use them for analysis.

In [None]:
#### Quality
##### `patients` table
- Zip code is a float not a string

In [None]:
#### Tidiness
- Contact column in `patients` table should be split into phone number and email

<a id='clean'></a>
### Part III: Cleaning Data

> Clean each of the issues you documented while assessing. Perform this cleaning in wrangle_act.ipynb as well. The result should be a high quality and tidy master pandas DataFrame (or DataFrames, if appropriate). Again, the issues that satisfy the Project Motivation must be cleaned.

### Missing Data

#### `treatments`: Missing records (280 instead of 350)

##### Define
Import the cut treatments into a DataFrame and concatenate it with the original treatments DataFrame.

##### Code

##### Test

### Tidiness

#### Issue text here

##### Define

##### Code

##### Test

### Quality

#### Issue text here

##### Define

##### Code

##### Test

<a id='analyze'></a>
### Part IV: Storing, Analyzing and Visualizing Data

> Store the clean DataFrame(s) in a CSV file with the main one named twitter_archive_master.csv. If additional files exist because multiple tables are required for tidiness, name these files appropriately. Additionally, you may store the cleaned data in a SQLite database (which is to be submitted as well if you do).

<a id='concl'></a>
### Conclusion

#### Reporting for this Project

> Create a 300-600 word written report called wrangle_report.pdf or wrangle_report.html that briefly describes your wrangling efforts. This is to be framed as an internal document.

> Create a 250-word-minimum written report called act_report.pdf or act_report.html that communicates the insights and displays the visualization(s) produced from your wrangled data. This is to be framed as an external document, like a blog post or magazine article, for example.

> Both of these documents can be created in separate Jupyter Notebooks using the Markdown functionality of Jupyter Notebooks, then downloading those notebooks as PDF files or HTML files (see image below). You might prefer to use a word processor like Google Docs or Microsoft Word, however.