# Wrangle and Analyze Data

## Introduction
The purpose of this project is to wrangle and analyze data from WeRateDogs, a humorous Twitter account that rates dog photos. I will gather data from various sources, in several formats, assess the quality of the data, clean it, and then provide analysis. 

From the course **Project Details**, the tasks to accomplish are:
- Data wrangling, which consists of:
 - Gathering data
 - Assessing data
 - Cleaning data
- Storing, analyzing, and visualizing your wrangled data
- Reporting on 1) your data wrangling efforts and 2) your data analyses and visualizations

#### Imports
Import libraries as dictacted by project description.

In [1]:
%matplotlib inline

import pandas as pd
import matplotlib.pyplot as plt
import tweepy
import requests as rq
import time
import json

pd.set_option('display.max_colwidth', 30)

## Gathering Data
#### Open DogRates Twitter Archive with *Pandas*
This file was provided as-is. Open *twitter-archive-enhanced.csv* with Pandas.

In [2]:
# Specific Dtypes for columns and date columns to parse
archive_dtypes = {'tweet_id':str, 'in_reply_to_status_id': str, 'in_reply_to_user_id': str,
                  'retweeted_status_id': str, 'retweeted_status_user_id': str, 'rating_numerator': int,
                  'rating_denomenator': int}
archive_parse_dates = ['timestamp', 'retweeted_status_timestamp']

archive_df = pd.read_csv('twitter-archive-enhanced.csv', dtype=archive_dtypes, parse_dates=archive_parse_dates)
archive_df.set_index('tweet_id', inplace=True)

#### Get Tweets with *Tweepy*
Download the detailed tweet data from Twitter using Tweepy. Save the data to *tweet_json.txt* once downloaded.

If the file exists, open it locally to avoid redownloading. (The download takes approximately 30 minutes to complete due to API rate-limiting because extended tweets are not supported by the Tweepy bulk-download function.)

In [3]:
retry_count = 0

while True:
    try:
        # open tweet detailed tweet archive file saved locally
        with open('tweet_json.txt', 'r') as fr:
            tweets_df = pd.io.json.json_normalize(
                        json.load(fr)
                     )
        break
    
    except FileNotFoundError as e:
        # list of tweets to download from the archive file
        tweet_ids = archive_df.tweet_id.unique().tolist()
        tweets = []

        for t in tweet_ids:
            try:
                # Parsing Tweepy objects: https://stackoverflow.com/questions/27900451
                
                twitter_keys = {"ConsumerAPIKey":None,
                                "ConsumerSecret":None,
                                "AccessToken":None,
                                "AccessTokenSecret":None}
                # comment this out and insert values to twitter_keys above if using other Twitter API keys
                twitter_keys = pd.read_json('twitterkeys.json', typ='series')
                
                twitter_auth = tweepy.OAuthHandler(twitter_keys['ConsumerAPIKey'], twitter_keys['ConsumerSecret'])
                twitter_auth.set_access_token(twitter_keys['AccessToken'], twitter_keys['AccessTokenSecret'])
                twitter = tweepy.API(twitter_auth, parser=tweepy.parsers.JSONParser(),
                                     wait_on_rate_limit=True,
                                     wait_on_rate_limit_notify=True)

                # get tweets one-by-one because tweepy.statuses_lookup doesn't support tweet_mode=extended
                tweets.append(
                    twitter.get_status(t, tweet_mode='extended', include_entities=True, trim_user=True)
                )

            except tweepy.TweepError as te:
                #swallow 'status not found' errors, raise all others
                if(te.api_code != 144):
                    raise

            #delay to reduce request rate; 15m limiting causes proxy timeout on intranet    
            time.sleep(.2)

        # write file to disk to avoid re-downloading
        with open('tweet_json.txt', 'w') as fw:
            json.dump(tweets, fw)
        
        # if met maximum retries, raise an error
        retry_count = retry_count + 1
        if retry_count == 4:
            raise

#### Download Image Predictions with *Requests*

Download the image predictions file using the Requests library. Save the file locally as *image-predictions.tsv*. If the file was already downloaded, open it locally to avoid hitting the server for the same file many times.

In [4]:
retry_count = 0

while True:
    try: 
        # Open image prediction file saved locally
        images_df = pd.read_csv('image-predictions.tsv', sep='\t', dtype={'tweet_id':str}).set_index('tweet_id')
        break
    
    except FileNotFoundError as e:
        # Use Requests library to download
        image_predictions_url = 'https://d17h27t6h515a5.cloudfront.net/topher/2017/August/599fd2ad_image-predictions/image-predictions.tsv'

        # Save the file
        with open('image-predictions.tsv', 'w') as fr:
            r = rq.get(image_predictions_url).text
            fr.write(r)
        
        # if met maximum retries, raise an error
        retry_count = retry_count + 1
        if retry_count == 4:
            raise

## Assessing Data

#### Columns of Archive data

In [6]:
archive_df.head()

In [7]:
list(archive_df)

Several columns contain data not necessary for this analysis. I intend to remove them for clarity below. Slated for removal:
- in_reply_to_status_id
- in_reply_to_user_id
- source
- retweeted_status_user_id
- retweeted_status_timestamp

#### Rating Denominators

In [7]:
archive_df.rating_denominator.value_counts(sort=False)

Rating are almost exclusively out of 10. Ratings out of a multiple of 10 tend to contain multiple dogs. Let's look in to the outlier rating denominators.

In [8]:
c_denom = [0, 2, 7, 11, 15, 16]
archive_df[archive_df.rating_numerator.isin(c_denom)][['text', 'rating_numerator', 'rating_denominator']]

All of the above ratings appear to be incorrect and will be corrected in the cleaning section below.

#### Rating Numerators

In [9]:
archive_df.rating_numerator.value_counts(sort=False)

There are many more possible numerators, but a few stand out as outliers or non-conforming.

In [10]:
c_numer = [0, 420, 666, 1776]
archive_df[archive_df.rating_numerator.isin(c_numer)][['text', 'rating_numerator', 'rating_denominator']]

Since joke ratings seem valid for a humor account, the tweets that feature dogs will be kept for now. The Snoop Dogg tweet and tweets without dogs will be deleted.

#### Names

In [11]:
archive_df.name.value_counts()

A large set of tweets have dogs named 'None'. Let's see if that's correct.

In [12]:
archive_df[archive_df.name == 'None'].sample(20, random_state=1210)

A random sample indicates that these tweets do, in fact, lack names. The first value counts also indicate that when the name is entirely lower case, it tends to be a standard word that was interpreted as a name.

In [13]:
archive_df[archive_df.name.str.islower()]

Names that are lower case appear to typically belong to *descriptions* of the dog. Those, and names that are None, will be converted to blanks in the next section.

#### Columns of Tweets Download

In [14]:
tweets_df.head()

In [15]:
list(tweets_df)

There are 158 columns in the normalized download of tweets. I'll remove all except the useful columns from this view. For inclusion:
- id_str
- retweet_count
- favorite_count

#### Columns of Image Prediction

In [16]:
images_df.head()

In [17]:
images_df.reset_index().tweet_id.nunique(), len(images_df)

All predictions are on single rows, without duplicate tweet_id.

In [18]:
plt.subplot(1, 3, 1)
plt.boxplot(images_df[images_df.p1_dog == True].p1_conf)
plt.ylim(ymax=1)

plt.subplot(1, 3, 2)
plt.boxplot(images_df[images_df.p2_dog == True].p2_conf)
plt.ylim(ymax=1)

plt.subplot(1, 3, 3)
plt.boxplot(images_df[images_df.p3_dog == True].p3_conf)
plt.ylim(ymax=1)

plt.show()

The highest confindence predictions come from Prediction 1.

## Cleaning and Tidying Data
- Low quality (dirty) data has content issues. Fix 8 of these.
- Untidy (messy) data has structural issues. Fix 2 of these.

#### Remove Extra Archive Columns
Remove columns with data unnecessary for analysis.

In [19]:
archive_df.drop(columns=['in_reply_to_status_id', 'in_reply_to_user_id',
                         'source', 'retweeted_status_user_id',
                         'retweeted_status_timestamp'], inplace=True, errors='ignore')

#### Remove Extra Tweet Download Columns
Remove all columns except those needed for analysis.

In [20]:
tweets_df = tweets_df[['id_str', 'retweet_count', 'favorite_count']].set_index('id_str')

#### There Aren't Even Dogs in These (Remove Tweets without Dogs)
Some tweets had no dogs in them or they weren't specifically about the dog; remove these.

In [21]:
r_col = ['835152434251116546', '746906459439529985', '670842764863651840', '855862651834028034']
archive_df.drop(index=r_col, inplace=True)

#### Remove Retweets and Tweets with No Photos
We love them all, but we want to see original tweets with photos.

In [22]:
archive_df = archive_df[archive_df.retweeted_status_id.isna() &
                        ~archive_df.expanded_urls.isna()]
len(archive_df)

After removal of retweets and tweets without photos, 2114 tweets remain.

#### DoggoLingo (Tidy Structure)
Combine *doggo, floofer, pupper,* and *puppo* columns in to a single *doggolingo* variable.

In [23]:
cols = ['doggo', 'floofer', 'pupper', 'puppo']
archive_df[cols] = archive_df[cols].replace('None', '')
archive_df['doggolingo'] = archive_df[cols].apply(lambda x: ';'.join(filter(None, x)), axis=1)

archive_df.drop(cols, axis='columns', inplace=True, errors='ignore')

#### A Dog Named None (Correct Names)

For names that are 'None' or all lower case, convert them to blanks.

In [24]:
archive_df.loc[(archive_df.name=='None') |
               (archive_df.name.str.islower()), 'name'] = ''

#### They're Good Dogs (Correct Ratings)
Update ratings that were found with incorrect denominators.

In [25]:
update_ratings = pd.DataFrame([
    {'tweet_id': '835246439529840640', 'rating_numerator': 13, 'rating_denominator': 10},
    {'tweet_id': '832088576586297345', 'rating_numerator': pd.np.nan, 'rating_denominator': pd.np.nan},
    {'tweet_id': '810984652412424192', 'rating_numerator': pd.np.nan, 'rating_denominator': pd.np.nan},
    {'tweet_id': '775096608509886464', 'rating_numerator': 14, 'rating_denominator': 10},
    {'tweet_id': '740373189193256964', 'rating_numerator': 14, 'rating_denominator': 10},
    {'tweet_id': '682962037429899265', 'rating_numerator': 10, 'rating_denominator': 10},
    {'tweet_id': '682808988178739200', 'rating_numerator': pd.np.nan, 'rating_denominator': pd.np.nan},
    {'tweet_id': '666287406224695296', 'rating_numerator': 9, 'rating_denominator': 10}
]).set_index('tweet_id')

archive_df.update(update_ratings, overwrite=True)

#### Normalize Ratings
For ratings that contain multiple dogs (denominators are a multiple of 10), convert the rating to a 10 point scale and save separately.

In [31]:
archive_df[archive_df.rating_denominator > 10].head()

#### Award for the Most Confident Dog (Tidy Structure and Simplify Predictions)
Select the prediction that is a dog and is the most confident about dog type as the ultimate prediction.

In [26]:
# create three temporary DataFrames of info, rename columns to append all stacked
images_p1_df = images_df.reset_index()[['tweet_id', 'jpg_url', 'img_num', 'p1', 'p1_conf', 'p1_dog']]
images_p2_df = images_df.reset_index()[['tweet_id', 'jpg_url', 'img_num', 'p2', 'p2_conf', 'p2_dog']]
images_p3_df = images_df.reset_index()[['tweet_id', 'jpg_url', 'img_num', 'p3', 'p3_conf', 'p3_dog']]

images_p1_df.rename(columns={'p1': 'breed', 'p1_conf': 'conf', 'p1_dog': 'is_dog'}, inplace=True)
images_p2_df.rename(columns={'p2': 'breed', 'p2_conf': 'conf', 'p2_dog': 'is_dog'}, inplace=True)
images_p3_df.rename(columns={'p3': 'breed', 'p3_conf': 'conf', 'p3_dog': 'is_dog'}, inplace=True)

a = pd.DataFrame()
a = (a.append(images_p1_df)
      .append(images_p2_df)
      .append(images_p3_df)
    )

# Select only 'dog' images, then keep only the highest confidence dog image
# reset index and massage DataFrame for later use
images_df = (a[a.is_dog==True]
              .groupby(['tweet_id', 'jpg_url', 'img_num'])
              .apply(lambda x: x[x.conf==x.conf.max()])
              .drop(columns=['tweet_id', 'jpg_url', 'img_num'])
              .reset_index()
              .set_index('tweet_id')
              .drop(columns=['level_3'])
            )

#### Clean Dog Breeds
Clean the names of breeds to remove underscores and convert to title case.

In [27]:
images_df.breed = images_df.breed.str.replace('_', ' ').str.title()

#### Join Tables
Join the information from the tweet details downloaded from Twitter and image predictions to the WeRateDogs Twitter archive.

In [28]:
master_df = archive_df.join(tweets_df)
master_df = master_df.join(images_df)

## Store Twitter Archive Master
Store the final cleaned archive according to project requirements.

*Store the clean DataFrame(s) in a CSV file with the main one named twitter_archive_master.csv. If additional files exist because multiple tables are required for tidiness, name these files appropriately.*

In [29]:
master_df.to_csv('twitter_archive_master.csv')

## Visualization and Insights