# Wrangle Act
### By Julio Uribe

The purpose of this project is to exapnd on wrangling abilities. In this file we will gather data about "weRateDogs" twitter posts from a couple of different sources: directly from Twitter API, using the provided twitter enhanced file for tweet id's, and pulling from udacity's server to look at neural net results in a tsv file.

# Setup: Import  Modules

In [1]:
import tweepy
from tweepy import OAuthHandler
import json
import numpy as np
from timeit import default_timer as timer
import pandas as pd
import requests

# Gather Data

## First Source File: Twitter Enhanced file and set up API keys

In [2]:
#load file info into dataframe for tweet id's to use later
twitter_archive = pd.read_csv('twitter-archive-enhanced.csv')
#load tweet ids for api extraction
tweet_ids = twitter_archive.tweet_id.values

#Set up API credentials from file outside directory
creds = []
with open('/Users/Jules/Desktop/DAND/twitter_credentials.txt', 'r') as f:
    creds = f.read().split("'")
consumer_key = creds[1]
consumer_secret = creds[3]
access_token = creds[5]
access_secret = creds[7]
#create auth object with keys
auth = tweepy.OAuthHandler(consumer_key, consumer_secret)
auth.set_access_token(access_token, access_secret)
#create tweepy api object for requests
api = tweepy.API(auth, wait_on_rate_limit = True)

print (len(tweet_ids))

2356


## Second Source File: Query Twitter's API for JSON data

# Query Twitter's API for JSON data for each tweet ID in the Twitter archive
count = 0
fails_dict = {}
start = timer()
# Save each tweet's returned JSON as a new line in a .txt file
with open('tweet_json.txt', 'w') as outfile:
    # Loop pauses/resumes at about 900 tweets due to api's rate limit
    for tweet_id in tweet_ids:
        count += 1
        print(str(count) + ": " + str(tweet_id))
        try:
            tweet = api.get_status(tweet_id, tweet_mode='extended')
            print("Success")
            json.dump(tweet._json, outfile)
            outfile.write('\n')
        except tweepy.TweepError as e:
            print("Fail")
            fails_dict[tweet_id] = e
            pass
end = timer()
print(end - start)
print(fails_dict)

### Load JSON data we got from Twitter API into a cleaner dataframe

In [4]:
#load twitter json file into a pandas dataframe
tweets_json_full = pd.read_json("tweet_json.txt", lines=True)
#tweets_json_full.info()

In [5]:
#create a smaller version of tweets_json_full with only the columns we're interested in
tweets_json = pd.DataFrame(tweets_json_full[['id', 'created_at', 'favorite_count', 'retweet_count', 'full_text', 'extended_entities']])
#tweets_json.head()

In [6]:
#seeing if we can extract anything interesting from the extended_entities values
# for i in range(5):
#     print(tweets_json_full['extended_entities'][i]['media'][0]['url'])

## Third Source file: Use Requests Module to Load Neural Net Results

In [7]:
r = requests.get('https://d17h27t6h515a5.cloudfront.net/topher/2017/August/599fd2ad_image-predictions/image-predictions.tsv')
with open("image_predictions.tsv", 'wb') as file:
    for chunk in r.iter_content(chunk_size=128):
        file.write(chunk)
#read file back in and create a df for image predictions data     
image_predictions = pd.read_csv("image_predictions.tsv", sep='\t')

# Assessing Data

For this project, we have three dataframes we're currently working with: 
* twitter_archive - imported tweet info from twitter_archive_enhanced.txt provided by udacity. Has tweet IDs, tweet text, ratings, and other information.
* tweets_json - data from twitter API containing retweets, favorited count, tweet text, and more.
* image_predictions - results from neural net. Contains three predictions, image url, number of images, etc.

The three dataframes provide info about the tweets posted from the WeRateDogs twitter profile. We'll do some assessing of the data before we merge these dataframes together. Then we'll clean the data to get the most complete data set we can.

## Quality Issues
* twitter_archive has 2356 entries, tweets_json has 2340 entries, and image_predictions has 2075 entries
* 'None' values are strings but should be NaN values
* In 'twitter_archive', incorrect and missing names under 'name' column: 'None', 'a', 'the', 'an', etc.
* In 'twitter_archive', a few rating numerators are under 10 but according to the twitter profile, all ratings should be above 10
* In 'twitter_archive', there's are rows where the 'rating_denominator' is lower or higher than 10. We need to standardize all rows to be out of 10.
* In 'twitter_archive', timestamp column is in string format instead of datetime.
* In 'image_predicitons', results from the neural net in p1 give us results in inconsistent lower/upper case. Need to make consistent
* In 'tweets_json', 'id' column should be renamed to 'tweet_id' to be consistent with other two dataframes
* In 'image_predicitons', we get invalid results from our neural net in p1 such as 'desktop_computer', 'electric_fan', 'wild_boar'.
* In 'image_predictions', some prediction of dog breeds aren't actual dog breeds
* From 'image_predictions', I see not all twitter posts actually have dogs in the post. We should toss these out
* From 'image_predictions', we have tweets that do not have a dog present. We should toss out rows that do not have a dog present. Consider using three prediction values to toss out tweets?


## Tidiness Issues
* In 'twitter_archive', the last four columns (doggo, floofer, pupper, puppo) are not always observed and best serve as a category. We should combine these 4 columns into one
* In 'twitter_archive', there are multiple values in 'source' column.
* In 'tweets_json', we have multiple values in the 'extended entities' column. Clean up column values to iphone, vine, web client, etc.
* twitter_archive has 2356 entries, tweets_json has 2340 entries, and image_predictions has 2075 entries. We'll merge later on tweet IDs.

In [8]:
#twitter_archive

In [9]:
# twitter_archive has 2356 entries, tweets_json has 2340 entries, and image_predictions has 2075 entries. We'll merge later
# on tweet IDs.
#twitter_archive.describe()
#tweets_json.info()
#twitter_archive.isnull().sum()
#type(twitter_archive.timestamp[0])
#twitter_archive.source.value_counts()

In [10]:
#twitter_archive.name.value_counts()

In [11]:
#twitter_archive.rating_denominator.value_counts()

In [12]:
#twitter_archive.rating_numerator.value_counts()

In [13]:
#tweets_json

In [14]:
# tweets_json.isnull().sum()
# tweets_json.info()

In [15]:
# image_predictions.isnull().sum()
# image_predictions

In [16]:
# image_predictions.describe()

In [17]:
#image_predictions.p1.value_counts()
# image_predictions.p1.value_counts()
#image_predictions.p3.value_counts()

In [18]:
# if it hits false multiple times, toss out row
#image_predictions[image_predictions.p3 == 'space_shuttle']

In [19]:
#image_predictions[image_predictions.p1 == 'coffee_mug'].jpg_url

In [20]:
# there's a good chance that a large part of our data set doesn't actually contain dogs in the image, throwing off ratings
#image_predictions['p1_dog'].mean(), image_predictions['p2_dog'].mean(), image_predictions['p3_dog'].mean()

In [21]:
# explore prediction results for tweets with more than one image. How does the neural net handle multiple images?
# multi_pic = image_predictions[image_predictions["img_num"] > 1]
# multi_pic

In [22]:
# lets compare the average p1_dog, p2_dog, p3_dog rates from multiple images to the whole dataframe
# multi_pic['p1_dog'].mean(), multi_pic['p2_dog'].mean(), multi_pic['p3_dog'].mean()
# Multiple images is more likely to have a dog in it than the general dog prediciton rate from entire dataframe

In [23]:
#checking for duplicated values
# twitter_archive[twitter_archive.tweet_id.duplicated()]
# tweets_json[tweets_json.id.duplicated()]
# image_predictions[image_predictions.tweet_id.duplicated()]

# Cleaning Data

### Create copies for data

In [24]:
# Create copies of all three dataframes
twitter_archive_clean = twitter_archive.copy()
tweets_json_clean = tweets_json.copy()
image_predictions_clean = image_predictions.copy()

## Goal is to merge all three data sets. First we need to clean up some columns

### First Merge and Cleaning

In [25]:
# Define
# Rename a few columns in the dataframes for consistency
# Clean
tweets_json_clean.rename(columns={'id':'tweet_id', 'created_at': 'timestamp'}, inplace=True)
twitter_archive_clean.rename(columns={'text':'full_text', 'created_at': 'timestamp'}, inplace=True)
# Test: Make sure column names are consistent when shared/overlapping
#twitter_archive_clean.columns, tweets_json_clean.columns

In [26]:
# Define
# I'm going to delete several columns that are less interesting in the dataframes
#Clean
twitter_archive_clean.drop(['in_reply_to_status_id', 'in_reply_to_user_id', 'source', 'retweeted_status_id', 'retweeted_status_user_id', 'retweeted_status_timestamp', 'expanded_urls'], axis=1, inplace=True)
tweets_json_clean.drop(['extended_entities'], axis=1, inplace=True)
# Test: Make sure all the appropriate columns have been deleted
#twitter_archive_clean.columns, tweets_json_clean.columns

In [27]:
# Define
# Merge the twitter_enhanced_clean and tweets_json_clean together using 'tweet_id'
tweets_super_clean = tweets_json_clean.merge(twitter_archive_clean, how='inner', on='tweet_id')
# Test: let's see what columns we have now and if the merge is doing what we want it to do
#tweets_super_clean.head()
#tweets_super_clean.info()

In [28]:
# Before we move onto our second merge, we can clean up this data set a bit more
# Define
# timestamp_x and timestamp_y show the same data but timestamp_x is already in the datetime format we want so we'll keep that one
# two columns for full text as well. We'll keep the first one
# Drop the columns
tweets_super_clean.drop(['timestamp_y', 'full_text_y'], axis=1, inplace=True)
# Rename the columns
tweets_super_clean.rename(columns={'timestamp_x':'timestamp', 'full_text_x': 'full_text'}, inplace=True)
# Test: verify our column surgery was clean and successful
tweets_super_clean.info()

<class 'pandas.core.frame.DataFrame'>
Int64Index: 2339 entries, 0 to 2338
Data columns (total 12 columns):
tweet_id              2339 non-null int64
timestamp             2339 non-null datetime64[ns]
favorite_count        2339 non-null int64
retweet_count         2339 non-null int64
full_text             2339 non-null object
rating_numerator      2339 non-null int64
rating_denominator    2339 non-null int64
name                  2339 non-null object
doggo                 2339 non-null object
floofer               2339 non-null object
pupper                2339 non-null object
puppo                 2339 non-null object
dtypes: datetime64[ns](1), int64(5), object(6)
memory usage: 237.6+ KB


### Second Merge

In [29]:
# We'll now merge tweets_super_clean with image_predictions_clean using tweet_ids
tweets_super_clean = tweets_super_clean.merge(image_predictions_clean, on='tweet_id', how='inner')
tweets_super_clean.columns

Index(['tweet_id', 'timestamp', 'favorite_count', 'retweet_count', 'full_text',
       'rating_numerator', 'rating_denominator', 'name', 'doggo', 'floofer',
       'pupper', 'puppo', 'jpg_url', 'img_num', 'p1', 'p1_conf', 'p1_dog',
       'p2', 'p2_conf', 'p2_dog', 'p3', 'p3_conf', 'p3_dog'],
      dtype='object')

In [30]:
# None values should be replaced with NaN's to reflect missing data
tweets_super_clean.replace('None', np.nan, inplace=True)
tweets_super_clean.info()

<class 'pandas.core.frame.DataFrame'>
Int64Index: 2066 entries, 0 to 2065
Data columns (total 23 columns):
tweet_id              2066 non-null int64
timestamp             2066 non-null datetime64[ns]
favorite_count        2066 non-null int64
retweet_count         2066 non-null int64
full_text             2066 non-null object
rating_numerator      2066 non-null int64
rating_denominator    2066 non-null int64
name                  1491 non-null object
doggo                 80 non-null object
floofer               8 non-null object
pupper                222 non-null object
puppo                 24 non-null object
jpg_url               2066 non-null object
img_num               2066 non-null int64
p1                    2066 non-null object
p1_conf               2066 non-null float64
p1_dog                2066 non-null bool
p2                    2066 non-null object
p2_conf               2066 non-null float64
p2_dog                2066 non-null bool
p3                    2066 non-null objec

In [31]:
# Define
# There's are rows where the 'rating_denominator' is lower or higher than 10. We need to standardize all rows to be out of 10
# Clean
tweets_super_clean.rating_denominator = 10
# Test
tweets_super_clean.rating_denominator.value_counts()

10    2066
Name: rating_denominator, dtype: int64

In [32]:
# Define
#In 'twitter_archive', correct all 'a', 'the', 'an', etc. dog names by replacing them with NaN values
# Bad names tend to start with lowercase so we'll put all the lower case names into a list of bad_names
bad_names = ['None']
# put all the names into a series
names_left = tweets_super_clean.name.value_counts()
for i in names_left.index:
    if i.islower():
        bad_names.append(i)
# iterate through our df and replace bad names with NaN values
for name in bad_names:
    tweets_super_clean.name.replace(name, np.nan, inplace=True)
tweets_super_clean.name.value_counts()

Tucker      10
Penny       10
Lucy        10
Charlie     10
Oliver      10
Cooper      10
Winston      8
Sadie        8
Bo           8
Lola         8
Daisy        7
Toby         7
Bella        6
Jax          6
Rusty        6
Scout        6
Dave         6
Koda         6
Milo         6
Bailey       6
Stanley      6
Leo          5
Chester      5
Buddy        5
Alfie        5
Louis        5
Larry        5
Oscar        5
Sophie       4
Gus          4
            ..
Jackie       1
Jay          1
Chase        1
Levi         1
Dido         1
Ralphson     1
Simba        1
Mairi        1
Jim          1
Rodney       1
Jo           1
Juckson      1
Franq        1
Crimson      1
Bayley       1
Carll        1
Jennifur     1
Bobb         1
Rey          1
Kayla        1
Clyde        1
Covach       1
Malikai      1
Goose        1
Timber       1
Alf          1
Tedrick      1
Cupid        1
Ashleigh     1
Perry        1
Name: name, Length: 912, dtype: int64

In [33]:
# Test
# Look at the full text from tweets where we found no dog name and manually look over to confirm there aren't dog names
#need_names = tweets_super_clean[tweets_super_clean.name == 'None'].full_text
#for i in need_names:
#    print (i)
# Our visual assessment here doesn't catch dog names that we skiped. We have many tweets w/o a specific dog name mentioned.

In [34]:
tweets_super_clean.columns
#tweets_super_clean['stage'] = tweets_super_clean.doggo + tweets_super_clean.floofer + tweets_super_clean.pupper + tweets_super_clean.puppo
#tweets_super_clean.stage.value_counts()
#tweets_super_clean.drop(['stage'], axis=1, inplace=True)
tweets_super_clean.iloc[:, 8:12]

Unnamed: 0,doggo,floofer,pupper,puppo
0,,,,
1,,,,
2,,,,
3,,,,
4,,,,
5,,,,
6,,,,
7,,,,
8,,,,
9,doggo,,,


In [35]:
# replace doggo, pupper, floofer, and puppo columns strings with 1's to stack back
super_copy = tweets_super_clean.copy()
#mapping = {'doggo': 1, 'pupper': 1, 'floofer': 1, 'puppo': 1}
#super_copy.iloc[['doggo','pupper', 'floofer', 'puppop']].replace({'doggo': mapping, 'pupper': mapping, 'floofer': mapping, 'puppo': mapping}, inplace=True)

# super_copy.doggo.replace('doggo', 1, inplace=True)
# super_copy.doggo.replace('None', np.nan, inplace=True)
# super_copy.pupper.replace('pupper', 1, inplace=True)
# super_copy.floofer.replace('floofer', 1, inplace=True)
# super_copy.puppo.replace('puppo', 1, inplace=True)
#super_copy['stage'] = list(zip(super_copy.doggo, super_copy.floofer, super_copy.pupper, super_copy.puppo))
#super_copy['stage'] = super_copy.doggo.astype(str)+super_copy.floofer.astype(str)+super_copy.pupper.astype(str)+super_copy.puppo.astype(str)
super_copy['stage'] = super_copy.doggo.combine_first(super_copy.floofer).combine_first(super_copy.pupper).combine_first(super_copy.puppo)
#(++super_copy.pupper+super_copy.puppo).astype(str)

# for index, row in super_copy.iterrows():
#     if row.doggo+row.pupper+row.floofer+row.puppo > 1:
#         row.mixed = 1

super_copy.stage.value_counts()
#df['stages'] = (super_copy.iloc[:, 1:] == 1).idxmax(1)
# type(tweets_super_clean.doggo[5])
# tweets_super_clean.doggo[5]

pupper     211
doggo       80
puppo       23
floofer      7
Name: stage, dtype: int64

In [36]:
for i in super_copy.stage:
    for j in i:
        print (type(j))


TypeError: 'float' object is not iterable

## Quality Issues
* In 'twitter_archive', incorrect and missing names under 'name' column: 'None', 'a', 'the', 'an', etc.
* 'None' values are strings but should be NaN values
* In 'twitter_archive', a few rating numerators are under 10 but according to the twitter profile, all ratings should be above 10
* In 'twitter_archive', there's are rows where the 'rating_denominator' is lower or higher than 10. We need to standardize all rows to be out of 10.
* ~~In 'twitter_archive', timestamp column is in string format instead of datetime.~~
* In 'image_predicitons', results from the neural net in p1 give us results in inconsistent lower/upper case. Need to make consistent
* ~~In 'tweets_json', 'id' column should be renamed to 'tweet_id' to be consistent with other two dataframes~~
* In 'image_predicitons', we get invalid results from our neural net in p1 such as 'desktop_computer', 'electric_fan', 'wild_boar'.
* In 'image_predictions', some prediction of dog breeds aren't actual dog breeds
* From 'image_predictions', not all twitter posts actually have dogs in the post
* From 'image_predictions', we have tweets that do not have a dog present. We should toss out rows that do not have a dog present. Consider using three prediction values to toss out tweets?


## Tidiness Issues
* ~~all three dataframes should be one table~~
* ~~rename several columns before mering for consistency: full_text, tweet_id, timestamp to be the standard
* ~~get rid of less interesting columns so that our dataframe isn't massive when fully merged
* ~~twitter_archive has 2356 entries, tweets_json has 2340 entries, and image_predictions has 2075 entries~~
* In 'twitter_archive', the last four columns (doggo, floofer, pupper, puppo) are not always observed and best serve as a category. We should combine these 4 columns into one