# Table of Contents
<ul>
<li><a href="#intro">Introduction</a></li>
<li><a href="#questions">Questions to Answer</a></li>
<li><a href="#wrangling">Data Wrangling</a></li>
<li><a href="#eda">Data Visualization and Analysis</a></li>
<li><a href="#conclusions">Conclusions and Insights</a></li>
</ul>


<a id='intro'></a>
## Introduction

This is a data analytics project that utilizes a myriad of data analysis tools such as Numpy, PANDAS, Requests, Tweepy, Json, Beautiful Soup and Matplotlib among others, in order to analyze the datasets pertaining to the highly popular Twitter account [@WeRateDogs](https://twitter.com/dog_rates?ref_src=twsrc%5Egoogle%7Ctwcamp%5Eserp%7Ctwgr%5Eauthor).

Launched in 2015 by then Campbell University student Matt Nelson, We Rate Dogs posts submitted photos of adorable dogs that, accompanied by a rating as well as a snarky and oft humerous comments. As of writing, We Rate Dogs have over 10 million followers in both Twitter and Instagram combined.

The datasets we will be analyzing come from We Rate Dogs' Twitter archive, tweet image predictions (i.e., what breed of dog is present in each tweet) according to a neural network and each tweet's retweet count and favorite ("like") count at minimum, respectively.

## Data Wrangling

### Gather

In [1]:
#Import libraries
import pandas as pd
import numpy as np
import requests
import tweepy
from tweepy import OAuthHandler
import json
from timeit import default_timer as timer
import matplotlib.pyplot as plt

from pandas import Series, DataFrame

In [2]:
#First Dataset - We Rate Dogs Twitter Archive
#Feed We Rate Dogs Twitter Archive csv file into dataframe
archive_df = pd.read_csv('twitter-archive-enhanced.csv') 

In [3]:
#Second Dataset - Tweet Predictions
#Download tweet_predictions file programmatically using requests

url = "https://d17h27t6h515a5.cloudfront.net/topher/2017/August/599fd2ad_image-predictions/image-predictions.tsv"
r = requests.get(url)

with open(url.split('/')[-1], mode = 'wb') as file:
    file.write(r.content)
    
predictions_df = pd.read_csv('image-predictions.tsv', sep = '\t')

In [4]:
#Create array for each json entry, ensuring it is stored by line
with open('tweet_json.txt') as file:
    status = []
    for line in file:
        status.append(json.loads(line))

#Read tweet_json.txt file into a Pandas DataFrame with tweet ID, retweet count and favorite count.
analytics_df = pd.DataFrame.from_dict(status)
analytics_df

Unnamed: 0,created_at,id,id_str,full_text,truncated,display_text_range,entities,extended_entities,source,in_reply_to_status_id,...,favorite_count,favorited,retweeted,possibly_sensitive,possibly_sensitive_appealable,lang,retweeted_status,quoted_status_id,quoted_status_id_str,quoted_status
0,Tue Aug 01 16:23:56 +0000 2017,892420643555336193,892420643555336193,This is Phineas. He's a mystical boy. Only eve...,False,"[0, 85]","{'hashtags': [], 'symbols': [], 'user_mentions...","{'media': [{'id': 892420639486877696, 'id_str'...","<a href=""http://twitter.com/download/iphone"" r...",,...,39467,False,False,False,False,en,,,,
1,Tue Aug 01 00:17:27 +0000 2017,892177421306343426,892177421306343426,This is Tilly. She's just checking pup on you....,False,"[0, 138]","{'hashtags': [], 'symbols': [], 'user_mentions...","{'media': [{'id': 892177413194625024, 'id_str'...","<a href=""http://twitter.com/download/iphone"" r...",,...,33819,False,False,False,False,en,,,,
2,Mon Jul 31 00:18:03 +0000 2017,891815181378084864,891815181378084864,This is Archie. He is a rare Norwegian Pouncin...,False,"[0, 121]","{'hashtags': [], 'symbols': [], 'user_mentions...","{'media': [{'id': 891815175371796480, 'id_str'...","<a href=""http://twitter.com/download/iphone"" r...",,...,25461,False,False,False,False,en,,,,
3,Sun Jul 30 15:58:51 +0000 2017,891689557279858688,891689557279858688,This is Darla. She commenced a snooze mid meal...,False,"[0, 79]","{'hashtags': [], 'symbols': [], 'user_mentions...","{'media': [{'id': 891689552724799489, 'id_str'...","<a href=""http://twitter.com/download/iphone"" r...",,...,42908,False,False,False,False,en,,,,
4,Sat Jul 29 16:00:24 +0000 2017,891327558926688256,891327558926688256,This is Franklin. He would like you to stop ca...,False,"[0, 138]","{'hashtags': [{'text': 'BarkWeek', 'indices': ...","{'media': [{'id': 891327551943041024, 'id_str'...","<a href=""http://twitter.com/download/iphone"" r...",,...,41048,False,False,False,False,en,,,,
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
2349,Mon Nov 16 00:24:50 +0000 2015,666049248165822465,666049248165822465,Here we have a 1949 1st generation vulpix. Enj...,False,"[0, 120]","{'hashtags': [], 'symbols': [], 'user_mentions...","{'media': [{'id': 666049244999131136, 'id_str'...","<a href=""http://twitter.com/download/iphone"" r...",,...,111,False,False,False,False,en,,,,
2350,Mon Nov 16 00:04:52 +0000 2015,666044226329800704,666044226329800704,This is a purebred Piers Morgan. Loves to Netf...,False,"[0, 137]","{'hashtags': [], 'symbols': [], 'user_mentions...","{'media': [{'id': 666044217047650304, 'id_str'...","<a href=""http://twitter.com/download/iphone"" r...",,...,311,False,False,False,False,en,,,,
2351,Sun Nov 15 23:21:54 +0000 2015,666033412701032449,666033412701032449,Here is a very happy pup. Big fan of well-main...,False,"[0, 130]","{'hashtags': [], 'symbols': [], 'user_mentions...","{'media': [{'id': 666033409081393153, 'id_str'...","<a href=""http://twitter.com/download/iphone"" r...",,...,128,False,False,False,False,en,,,,
2352,Sun Nov 15 23:05:30 +0000 2015,666029285002620928,666029285002620928,This is a western brown Mitsubishi terrier. Up...,False,"[0, 139]","{'hashtags': [], 'symbols': [], 'user_mentions...","{'media': [{'id': 666029276303482880, 'id_str'...","<a href=""http://twitter.com/download/iphone"" r...",,...,132,False,False,False,False,en,,,,


### Assess

In [5]:
#Visual Assessment
archive_df

Unnamed: 0,tweet_id,in_reply_to_status_id,in_reply_to_user_id,timestamp,source,text,retweeted_status_id,retweeted_status_user_id,retweeted_status_timestamp,expanded_urls,rating_numerator,rating_denominator,name,doggo,floofer,pupper,puppo
0,892420643555336193,,,2017-08-01 16:23:56 +0000,"<a href=""http://twitter.com/download/iphone"" r...",This is Phineas. He's a mystical boy. Only eve...,,,,https://twitter.com/dog_rates/status/892420643...,13,10,Phineas,,,,
1,892177421306343426,,,2017-08-01 00:17:27 +0000,"<a href=""http://twitter.com/download/iphone"" r...",This is Tilly. She's just checking pup on you....,,,,https://twitter.com/dog_rates/status/892177421...,13,10,Tilly,,,,
2,891815181378084864,,,2017-07-31 00:18:03 +0000,"<a href=""http://twitter.com/download/iphone"" r...",This is Archie. He is a rare Norwegian Pouncin...,,,,https://twitter.com/dog_rates/status/891815181...,12,10,Archie,,,,
3,891689557279858688,,,2017-07-30 15:58:51 +0000,"<a href=""http://twitter.com/download/iphone"" r...",This is Darla. She commenced a snooze mid meal...,,,,https://twitter.com/dog_rates/status/891689557...,13,10,Darla,,,,
4,891327558926688256,,,2017-07-29 16:00:24 +0000,"<a href=""http://twitter.com/download/iphone"" r...",This is Franklin. He would like you to stop ca...,,,,https://twitter.com/dog_rates/status/891327558...,12,10,Franklin,,,,
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
2351,666049248165822465,,,2015-11-16 00:24:50 +0000,"<a href=""http://twitter.com/download/iphone"" r...",Here we have a 1949 1st generation vulpix. Enj...,,,,https://twitter.com/dog_rates/status/666049248...,5,10,,,,,
2352,666044226329800704,,,2015-11-16 00:04:52 +0000,"<a href=""http://twitter.com/download/iphone"" r...",This is a purebred Piers Morgan. Loves to Netf...,,,,https://twitter.com/dog_rates/status/666044226...,6,10,a,,,,
2353,666033412701032449,,,2015-11-15 23:21:54 +0000,"<a href=""http://twitter.com/download/iphone"" r...",Here is a very happy pup. Big fan of well-main...,,,,https://twitter.com/dog_rates/status/666033412...,9,10,a,,,,
2354,666029285002620928,,,2015-11-15 23:05:30 +0000,"<a href=""http://twitter.com/download/iphone"" r...",This is a western brown Mitsubishi terrier. Up...,,,,https://twitter.com/dog_rates/status/666029285...,7,10,a,,,,


In [6]:
#Programmatical Assessment
#Check for data type incongruencies and missing data in archive_df
archive_df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 2356 entries, 0 to 2355
Data columns (total 17 columns):
tweet_id                      2356 non-null int64
in_reply_to_status_id         78 non-null float64
in_reply_to_user_id           78 non-null float64
timestamp                     2356 non-null object
source                        2356 non-null object
text                          2356 non-null object
retweeted_status_id           181 non-null float64
retweeted_status_user_id      181 non-null float64
retweeted_status_timestamp    181 non-null object
expanded_urls                 2297 non-null object
rating_numerator              2356 non-null int64
rating_denominator            2356 non-null int64
name                          2356 non-null object
doggo                         2356 non-null object
floofer                       2356 non-null object
pupper                        2356 non-null object
puppo                         2356 non-null object
dtypes: float64(4), int64(3), ob

In [7]:
archive_df.head()

Unnamed: 0,tweet_id,in_reply_to_status_id,in_reply_to_user_id,timestamp,source,text,retweeted_status_id,retweeted_status_user_id,retweeted_status_timestamp,expanded_urls,rating_numerator,rating_denominator,name,doggo,floofer,pupper,puppo
0,892420643555336193,,,2017-08-01 16:23:56 +0000,"<a href=""http://twitter.com/download/iphone"" r...",This is Phineas. He's a mystical boy. Only eve...,,,,https://twitter.com/dog_rates/status/892420643...,13,10,Phineas,,,,
1,892177421306343426,,,2017-08-01 00:17:27 +0000,"<a href=""http://twitter.com/download/iphone"" r...",This is Tilly. She's just checking pup on you....,,,,https://twitter.com/dog_rates/status/892177421...,13,10,Tilly,,,,
2,891815181378084864,,,2017-07-31 00:18:03 +0000,"<a href=""http://twitter.com/download/iphone"" r...",This is Archie. He is a rare Norwegian Pouncin...,,,,https://twitter.com/dog_rates/status/891815181...,12,10,Archie,,,,
3,891689557279858688,,,2017-07-30 15:58:51 +0000,"<a href=""http://twitter.com/download/iphone"" r...",This is Darla. She commenced a snooze mid meal...,,,,https://twitter.com/dog_rates/status/891689557...,13,10,Darla,,,,
4,891327558926688256,,,2017-07-29 16:00:24 +0000,"<a href=""http://twitter.com/download/iphone"" r...",This is Franklin. He would like you to stop ca...,,,,https://twitter.com/dog_rates/status/891327558...,12,10,Franklin,,,,


In [8]:
archive_df.tail()

Unnamed: 0,tweet_id,in_reply_to_status_id,in_reply_to_user_id,timestamp,source,text,retweeted_status_id,retweeted_status_user_id,retweeted_status_timestamp,expanded_urls,rating_numerator,rating_denominator,name,doggo,floofer,pupper,puppo
2351,666049248165822465,,,2015-11-16 00:24:50 +0000,"<a href=""http://twitter.com/download/iphone"" r...",Here we have a 1949 1st generation vulpix. Enj...,,,,https://twitter.com/dog_rates/status/666049248...,5,10,,,,,
2352,666044226329800704,,,2015-11-16 00:04:52 +0000,"<a href=""http://twitter.com/download/iphone"" r...",This is a purebred Piers Morgan. Loves to Netf...,,,,https://twitter.com/dog_rates/status/666044226...,6,10,a,,,,
2353,666033412701032449,,,2015-11-15 23:21:54 +0000,"<a href=""http://twitter.com/download/iphone"" r...",Here is a very happy pup. Big fan of well-main...,,,,https://twitter.com/dog_rates/status/666033412...,9,10,a,,,,
2354,666029285002620928,,,2015-11-15 23:05:30 +0000,"<a href=""http://twitter.com/download/iphone"" r...",This is a western brown Mitsubishi terrier. Up...,,,,https://twitter.com/dog_rates/status/666029285...,7,10,a,,,,
2355,666020888022790149,,,2015-11-15 22:32:08 +0000,"<a href=""http://twitter.com/download/iphone"" r...",Here we have a Japanese Irish Setter. Lost eye...,,,,https://twitter.com/dog_rates/status/666020888...,8,10,,,,,


In [9]:
archive_df.sample(10)

Unnamed: 0,tweet_id,in_reply_to_status_id,in_reply_to_user_id,timestamp,source,text,retweeted_status_id,retweeted_status_user_id,retweeted_status_timestamp,expanded_urls,rating_numerator,rating_denominator,name,doggo,floofer,pupper,puppo
2164,669371483794317312,,,2015-11-25 04:26:12 +0000,"<a href=""http://twitter.com/download/iphone"" r...",This is Oliviér. He's a Baptist Hindquarter. A...,,,,https://twitter.com/dog_rates/status/669371483...,10,10,Oliviér,,,,
2028,671866342182637568,,,2015-12-02 01:39:53 +0000,"<a href=""http://twitter.com/download/iphone"" r...",Meet Dylan. He can use a fork but clearly can'...,,,,https://twitter.com/dog_rates/status/671866342...,10,10,Dylan,,,,
1598,686035780142297088,6.86034e+17,4196984000.0,2016-01-10 04:04:10 +0000,"<a href=""http://twitter.com/download/iphone"" r...",Yes I do realize a rating of 4/20 would've bee...,,,,,4,20,,,,,
108,871515927908634625,,,2017-06-04 23:56:03 +0000,"<a href=""http://twitter.com/download/iphone"" r...",This is Napolean. He's a Raggedy East Nicaragu...,,,,https://twitter.com/dog_rates/status/871515927...,12,10,Napolean,doggo,,,
501,813096984823349248,,,2016-12-25 19:00:02 +0000,"<a href=""http://twitter.com/download/iphone"" r...",This is Rocky. He got triple-doggo-dared. Stuc...,,,,https://twitter.com/dog_rates/status/813096984...,11,10,Rocky,doggo,,,
1815,676613908052996102,,,2015-12-15 04:05:01 +0000,"<a href=""http://twitter.com/download/iphone"" r...",This is the saddest/sweetest/best picture I've...,,,,https://twitter.com/dog_rates/status/676613908...,12,10,the,,,,
2349,666051853826850816,,,2015-11-16 00:35:11 +0000,"<a href=""http://twitter.com/download/iphone"" r...",This is an odd dog. Hard on the outside but lo...,,,,https://twitter.com/dog_rates/status/666051853...,2,10,an,,,,
14,889531135344209921,,,2017-07-24 17:02:04 +0000,"<a href=""http://twitter.com/download/iphone"" r...",This is Stuart. He's sporting his favorite fan...,,,,https://twitter.com/dog_rates/status/889531135...,13,10,Stuart,,,,puppo
990,748705597323898880,,,2016-07-01 02:31:39 +0000,"<a href=""http://twitter.com"" rel=""nofollow"">Tw...",#BarkWeek is getting rather heckin terrifying ...,,,,https://twitter.com/dog_rates/status/748705597...,13,10,,,,,
762,778039087836069888,,,2016-09-20 01:12:28 +0000,"<a href=""http://twitter.com/download/iphone"" r...",Evolution of a pupper yawn featuring Max. 12/1...,,,,https://twitter.com/dog_rates/status/778039087...,12,10,,,,pupper,


In [10]:
archive_df.isnull().sum()

tweet_id                         0
in_reply_to_status_id         2278
in_reply_to_user_id           2278
timestamp                        0
source                           0
text                             0
retweeted_status_id           2175
retweeted_status_user_id      2175
retweeted_status_timestamp    2175
expanded_urls                   59
rating_numerator                 0
rating_denominator               0
name                             0
doggo                            0
floofer                          0
pupper                           0
puppo                            0
dtype: int64

In [11]:
predictions_df

Unnamed: 0,tweet_id,jpg_url,img_num,p1,p1_conf,p1_dog,p2,p2_conf,p2_dog,p3,p3_conf,p3_dog
0,666020888022790149,https://pbs.twimg.com/media/CT4udn0WwAA0aMy.jpg,1,Welsh_springer_spaniel,0.465074,True,collie,0.156665,True,Shetland_sheepdog,0.061428,True
1,666029285002620928,https://pbs.twimg.com/media/CT42GRgUYAA5iDo.jpg,1,redbone,0.506826,True,miniature_pinscher,0.074192,True,Rhodesian_ridgeback,0.072010,True
2,666033412701032449,https://pbs.twimg.com/media/CT4521TWwAEvMyu.jpg,1,German_shepherd,0.596461,True,malinois,0.138584,True,bloodhound,0.116197,True
3,666044226329800704,https://pbs.twimg.com/media/CT5Dr8HUEAA-lEu.jpg,1,Rhodesian_ridgeback,0.408143,True,redbone,0.360687,True,miniature_pinscher,0.222752,True
4,666049248165822465,https://pbs.twimg.com/media/CT5IQmsXIAAKY4A.jpg,1,miniature_pinscher,0.560311,True,Rottweiler,0.243682,True,Doberman,0.154629,True
...,...,...,...,...,...,...,...,...,...,...,...,...
2070,891327558926688256,https://pbs.twimg.com/media/DF6hr6BUMAAzZgT.jpg,2,basset,0.555712,True,English_springer,0.225770,True,German_short-haired_pointer,0.175219,True
2071,891689557279858688,https://pbs.twimg.com/media/DF_q7IAWsAEuuN8.jpg,1,paper_towel,0.170278,False,Labrador_retriever,0.168086,True,spatula,0.040836,False
2072,891815181378084864,https://pbs.twimg.com/media/DGBdLU1WsAANxJ9.jpg,1,Chihuahua,0.716012,True,malamute,0.078253,True,kelpie,0.031379,True
2073,892177421306343426,https://pbs.twimg.com/media/DGGmoV4XsAAUL6n.jpg,1,Chihuahua,0.323581,True,Pekinese,0.090647,True,papillon,0.068957,True


In [12]:
predictions_df.head()

Unnamed: 0,tweet_id,jpg_url,img_num,p1,p1_conf,p1_dog,p2,p2_conf,p2_dog,p3,p3_conf,p3_dog
0,666020888022790149,https://pbs.twimg.com/media/CT4udn0WwAA0aMy.jpg,1,Welsh_springer_spaniel,0.465074,True,collie,0.156665,True,Shetland_sheepdog,0.061428,True
1,666029285002620928,https://pbs.twimg.com/media/CT42GRgUYAA5iDo.jpg,1,redbone,0.506826,True,miniature_pinscher,0.074192,True,Rhodesian_ridgeback,0.07201,True
2,666033412701032449,https://pbs.twimg.com/media/CT4521TWwAEvMyu.jpg,1,German_shepherd,0.596461,True,malinois,0.138584,True,bloodhound,0.116197,True
3,666044226329800704,https://pbs.twimg.com/media/CT5Dr8HUEAA-lEu.jpg,1,Rhodesian_ridgeback,0.408143,True,redbone,0.360687,True,miniature_pinscher,0.222752,True
4,666049248165822465,https://pbs.twimg.com/media/CT5IQmsXIAAKY4A.jpg,1,miniature_pinscher,0.560311,True,Rottweiler,0.243682,True,Doberman,0.154629,True


In [13]:
predictions_df.tail()

Unnamed: 0,tweet_id,jpg_url,img_num,p1,p1_conf,p1_dog,p2,p2_conf,p2_dog,p3,p3_conf,p3_dog
2070,891327558926688256,https://pbs.twimg.com/media/DF6hr6BUMAAzZgT.jpg,2,basset,0.555712,True,English_springer,0.22577,True,German_short-haired_pointer,0.175219,True
2071,891689557279858688,https://pbs.twimg.com/media/DF_q7IAWsAEuuN8.jpg,1,paper_towel,0.170278,False,Labrador_retriever,0.168086,True,spatula,0.040836,False
2072,891815181378084864,https://pbs.twimg.com/media/DGBdLU1WsAANxJ9.jpg,1,Chihuahua,0.716012,True,malamute,0.078253,True,kelpie,0.031379,True
2073,892177421306343426,https://pbs.twimg.com/media/DGGmoV4XsAAUL6n.jpg,1,Chihuahua,0.323581,True,Pekinese,0.090647,True,papillon,0.068957,True
2074,892420643555336193,https://pbs.twimg.com/media/DGKD1-bXoAAIAUK.jpg,1,orange,0.097049,False,bagel,0.085851,False,banana,0.07611,False


In [14]:
predictions_df.sample(10)

Unnamed: 0,tweet_id,jpg_url,img_num,p1,p1_conf,p1_dog,p2,p2_conf,p2_dog,p3,p3_conf,p3_dog
243,670452855871037440,https://pbs.twimg.com/media/CU3tUC4WEAAoZby.jpg,1,Arctic_fox,0.188174,False,indri,0.123584,False,malamute,0.080379,True
1641,807106840509214720,https://pbs.twimg.com/ext_tw_video_thumb/80710...,1,Chihuahua,0.50537,True,Pomeranian,0.120358,True,toy_terrier,0.077008,True
956,705428427625635840,https://pbs.twimg.com/media/CcovaMUXIAApFDl.jpg,1,Chihuahua,0.774792,True,quilt,0.073079,False,Pembroke,0.022365,True
1645,808134635716833280,https://pbs.twimg.com/media/Cx5R8wPVEAALa9r.jpg,1,cocker_spaniel,0.74022,True,Dandie_Dinmont,0.061604,True,English_setter,0.041331,True
337,672231046314901505,https://pbs.twimg.com/media/CVQ-kfWWoAAXV15.jpg,1,killer_whale,0.823919,False,grey_whale,0.036601,False,hammerhead,0.029522,False
538,676949632774234114,https://pbs.twimg.com/media/CWUCGMtWEAAjXnS.jpg,1,Welsh_springer_spaniel,0.206479,True,Saint_Bernard,0.139339,True,boxer,0.114606,True
336,672222792075620352,https://pbs.twimg.com/media/CVQ3EDdWIAINyhM.jpg,1,beagle,0.958178,True,basset,0.009117,True,Italian_greyhound,0.007731,True
1980,871032628920680449,https://pbs.twimg.com/media/DBaHi3YXgAE6knM.jpg,1,kelpie,0.398053,True,macaque,0.068955,False,dingo,0.050602,False
1140,729823566028484608,https://pbs.twimg.com/media/CiDap8fWEAAC4iW.jpg,1,kelpie,0.218408,True,Arabian_camel,0.114368,False,coyote,0.096409,False
1157,733482008106668032,https://pbs.twimg.com/media/Ci3Z_idUkAA8RUh.jpg,1,French_bulldog,0.619382,True,computer_keyboard,0.142274,False,mouse,0.058505,False


In [70]:
predictions_df.isnull().sum()

tweet_id    0
jpg_url     0
img_num     0
p1          0
p1_conf     0
p1_dog      0
p2          0
p2_conf     0
p2_dog      0
p3          0
p3_conf     0
p3_dog      0
dtype: int64

In [16]:
predictions_df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 2075 entries, 0 to 2074
Data columns (total 12 columns):
tweet_id    2075 non-null int64
jpg_url     2075 non-null object
img_num     2075 non-null int64
p1          2075 non-null object
p1_conf     2075 non-null float64
p1_dog      2075 non-null bool
p2          2075 non-null object
p2_conf     2075 non-null float64
p2_dog      2075 non-null bool
p3          2075 non-null object
p3_conf     2075 non-null float64
p3_dog      2075 non-null bool
dtypes: bool(3), float64(3), int64(2), object(4)
memory usage: 152.1+ KB


In [17]:
analytics_df.head()

Unnamed: 0,created_at,id,id_str,full_text,truncated,display_text_range,entities,extended_entities,source,in_reply_to_status_id,...,favorite_count,favorited,retweeted,possibly_sensitive,possibly_sensitive_appealable,lang,retweeted_status,quoted_status_id,quoted_status_id_str,quoted_status
0,Tue Aug 01 16:23:56 +0000 2017,892420643555336193,892420643555336193,This is Phineas. He's a mystical boy. Only eve...,False,"[0, 85]","{'hashtags': [], 'symbols': [], 'user_mentions...","{'media': [{'id': 892420639486877696, 'id_str'...","<a href=""http://twitter.com/download/iphone"" r...",,...,39467,False,False,False,False,en,,,,
1,Tue Aug 01 00:17:27 +0000 2017,892177421306343426,892177421306343426,This is Tilly. She's just checking pup on you....,False,"[0, 138]","{'hashtags': [], 'symbols': [], 'user_mentions...","{'media': [{'id': 892177413194625024, 'id_str'...","<a href=""http://twitter.com/download/iphone"" r...",,...,33819,False,False,False,False,en,,,,
2,Mon Jul 31 00:18:03 +0000 2017,891815181378084864,891815181378084864,This is Archie. He is a rare Norwegian Pouncin...,False,"[0, 121]","{'hashtags': [], 'symbols': [], 'user_mentions...","{'media': [{'id': 891815175371796480, 'id_str'...","<a href=""http://twitter.com/download/iphone"" r...",,...,25461,False,False,False,False,en,,,,
3,Sun Jul 30 15:58:51 +0000 2017,891689557279858688,891689557279858688,This is Darla. She commenced a snooze mid meal...,False,"[0, 79]","{'hashtags': [], 'symbols': [], 'user_mentions...","{'media': [{'id': 891689552724799489, 'id_str'...","<a href=""http://twitter.com/download/iphone"" r...",,...,42908,False,False,False,False,en,,,,
4,Sat Jul 29 16:00:24 +0000 2017,891327558926688256,891327558926688256,This is Franklin. He would like you to stop ca...,False,"[0, 138]","{'hashtags': [{'text': 'BarkWeek', 'indices': ...","{'media': [{'id': 891327551943041024, 'id_str'...","<a href=""http://twitter.com/download/iphone"" r...",,...,41048,False,False,False,False,en,,,,


In [18]:
analytics_df.tail()

Unnamed: 0,created_at,id,id_str,full_text,truncated,display_text_range,entities,extended_entities,source,in_reply_to_status_id,...,favorite_count,favorited,retweeted,possibly_sensitive,possibly_sensitive_appealable,lang,retweeted_status,quoted_status_id,quoted_status_id_str,quoted_status
2349,Mon Nov 16 00:24:50 +0000 2015,666049248165822465,666049248165822465,Here we have a 1949 1st generation vulpix. Enj...,False,"[0, 120]","{'hashtags': [], 'symbols': [], 'user_mentions...","{'media': [{'id': 666049244999131136, 'id_str'...","<a href=""http://twitter.com/download/iphone"" r...",,...,111,False,False,False,False,en,,,,
2350,Mon Nov 16 00:04:52 +0000 2015,666044226329800704,666044226329800704,This is a purebred Piers Morgan. Loves to Netf...,False,"[0, 137]","{'hashtags': [], 'symbols': [], 'user_mentions...","{'media': [{'id': 666044217047650304, 'id_str'...","<a href=""http://twitter.com/download/iphone"" r...",,...,311,False,False,False,False,en,,,,
2351,Sun Nov 15 23:21:54 +0000 2015,666033412701032449,666033412701032449,Here is a very happy pup. Big fan of well-main...,False,"[0, 130]","{'hashtags': [], 'symbols': [], 'user_mentions...","{'media': [{'id': 666033409081393153, 'id_str'...","<a href=""http://twitter.com/download/iphone"" r...",,...,128,False,False,False,False,en,,,,
2352,Sun Nov 15 23:05:30 +0000 2015,666029285002620928,666029285002620928,This is a western brown Mitsubishi terrier. Up...,False,"[0, 139]","{'hashtags': [], 'symbols': [], 'user_mentions...","{'media': [{'id': 666029276303482880, 'id_str'...","<a href=""http://twitter.com/download/iphone"" r...",,...,132,False,False,False,False,en,,,,
2353,Sun Nov 15 22:32:08 +0000 2015,666020888022790149,666020888022790149,Here we have a Japanese Irish Setter. Lost eye...,False,"[0, 131]","{'hashtags': [], 'symbols': [], 'user_mentions...","{'media': [{'id': 666020881337073664, 'id_str'...","<a href=""http://twitter.com/download/iphone"" r...",,...,2535,False,False,False,False,en,,,,


In [19]:
analytics_df.sample(10)

Unnamed: 0,created_at,id,id_str,full_text,truncated,display_text_range,entities,extended_entities,source,in_reply_to_status_id,...,favorite_count,favorited,retweeted,possibly_sensitive,possibly_sensitive_appealable,lang,retweeted_status,quoted_status_id,quoted_status_id_str,quoted_status
1850,Sat Dec 12 16:02:36 +0000 2015,675707330206547968,675707330206547968,We've got ourselves a battle here. Watch out R...,False,"[0, 82]","{'hashtags': [], 'symbols': [], 'user_mentions...","{'media': [{'id': 675707321759039488, 'id_str'...","<a href=""http://twitter.com/download/iphone"" r...",6.754971e+17,...,2154,False,False,False,False,en,,,,
456,Sun Jan 08 17:20:31 +0000 2017,818145370475810820,818145370475810820,This is Autumn. Her favorite toy is a cheesebu...,False,"[0, 82]","{'hashtags': [], 'symbols': [], 'user_mentions...","{'media': [{'id': 818145346668916739, 'id_str'...","<a href=""http://twitter.com/download/iphone"" r...",,...,13671,False,False,False,False,en,,,,
260,Fri Mar 17 21:13:10 +0000 2017,842846295480000512,842846295480000512,This is Charlie. He's wishing you a very fun a...,False,"[0, 90]","{'hashtags': [], 'symbols': [], 'user_mentions...","{'media': [{'id': 842846286093209601, 'id_str'...","<a href=""http://twitter.com/download/iphone"" r...",,...,16440,False,False,False,False,en,,,,
1840,Sun Dec 13 02:51:51 +0000 2015,675870721063669760,675870721063669760,&amp; this is Yoshi. Another world record cont...,False,"[0, 144]","{'hashtags': [], 'symbols': [], 'user_mentions...","{'media': [{'id': 675870715636224001, 'id_str'...","<a href=""http://twitter.com/download/iphone"" r...",6.757073e+17,...,1783,False,False,False,False,en,,,,
33,Thu Jul 13 15:58:47 +0000 2017,885528943205470208,885528943205470208,This is Maisey. She fell asleep mid-excavation...,False,"[0, 109]","{'hashtags': [], 'symbols': [], 'user_mentions...","{'media': [{'id': 885528931826368512, 'id_str'...","<a href=""http://twitter.com/download/iphone"" r...",,...,36689,False,False,False,False,en,,,,
704,Tue Oct 11 00:34:48 +0000 2016,785639753186217984,785639753186217984,This is Pinot. He's a sophisticated doggo. You...,False,"[0, 140]","{'hashtags': [], 'symbols': [], 'user_mentions...","{'media': [{'id': 785639740259303424, 'id_str'...","<a href=""http://twitter.com/download/iphone"" r...",,...,8735,False,False,False,False,en,,,,
983,Sat Jul 02 03:00:36 +0000 2016,749075273010798592,749075273010798592,This is Boomer. He's self-baptizing. Other dog...,False,"[0, 130]","{'hashtags': [], 'symbols': [], 'user_mentions...",,"<a href=""http://vine.co"" rel=""nofollow"">Vine -...",,...,6353,False,False,False,False,en,,,,
2127,Fri Nov 27 17:17:44 +0000 2015,670290420111441920,670290420111441920,This is Sandra. She's going skydiving. Nice ad...,False,"[0, 110]","{'hashtags': [], 'symbols': [], 'user_mentions...","{'media': [{'id': 670290335751237632, 'id_str'...","<a href=""http://twitter.com/download/iphone"" r...",,...,750,False,False,False,False,en,,,,
2060,Mon Nov 30 03:06:07 +0000 2015,671163268581498880,671163268581498880,Pack of horned dogs here. Very team-oriented b...,False,"[0, 137]","{'hashtags': [], 'symbols': [], 'user_mentions...","{'media': [{'id': 671163258263482369, 'id_str'...","<a href=""http://twitter.com/download/iphone"" r...",,...,1763,False,False,False,False,en,,,,
1146,Sun May 01 21:32:40 +0000 2016,726887082820554753,726887082820554753,This is Blitz. He's a new dad struggling to co...,False,"[0, 113]","{'hashtags': [], 'symbols': [], 'user_mentions...","{'media': [{'id': 726887073358159872, 'id_str'...","<a href=""http://twitter.com/download/iphone"" r...",,...,4195,False,False,False,False,en,,,,


In [69]:
analytics_df.isnull().sum()

created_at                          0
id                                  0
id_str                              0
full_text                           0
truncated                           0
display_text_range                  0
entities                            0
extended_entities                 281
source                              0
in_reply_to_status_id            2276
in_reply_to_status_id_str        2276
in_reply_to_user_id              2276
in_reply_to_user_id_str          2276
in_reply_to_screen_name          2276
user                                0
geo                              2354
coordinates                      2354
place                            2353
contributors                     2354
is_quote_status                     0
retweet_count                       0
favorite_count                      0
favorited                           0
retweeted                           0
possibly_sensitive                143
possibly_sensitive_appealable     143
lang        

In [21]:
#check for duplicate tweet id, name & expanded_urls in archive_df
archive_df[archive_df.tweet_id.duplicated()]
archive_df[archive_df.name.duplicated()]
archive_df[archive_df.expanded_urls.duplicated()]

Unnamed: 0,tweet_id,in_reply_to_status_id,in_reply_to_user_id,timestamp,source,text,retweeted_status_id,retweeted_status_user_id,retweeted_status_timestamp,expanded_urls,rating_numerator,rating_denominator,name,doggo,floofer,pupper,puppo
55,881633300179243008,8.816070e+17,4.738443e+07,2017-07-02 21:58:53 +0000,"<a href=""http://twitter.com/download/iphone"" r...",@roushfenway These are good dogs but 17/10 is ...,,,,,17,10,,,,,
64,879674319642796034,8.795538e+17,3.105441e+09,2017-06-27 12:14:36 +0000,"<a href=""http://twitter.com/download/iphone"" r...",@RealKentMurphy 14/10 confirmed,,,,,14,10,,,,,
75,878281511006478336,,,2017-06-23 16:00:04 +0000,"<a href=""http://twitter.com/download/iphone"" r...",Meet Shadow. In an attempt to reach maximum zo...,,,,"https://www.gofundme.com/3yd6y1c,https://twitt...",13,10,Shadow,,,,
76,878057613040115712,,,2017-06-23 01:10:23 +0000,"<a href=""http://twitter.com/download/iphone"" r...",This is Emmy. She was adopted today. Massive r...,,,,https://twitter.com/dog_rates/status/878057613...,14,10,Emmy,,,,
98,873213775632977920,,,2017-06-09 16:22:42 +0000,"<a href=""http://twitter.com/download/iphone"" r...",This is Sierra. She's one precious pupper. Abs...,,,,https://www.gofundme.com/help-my-baby-sierra-g...,12,10,Sierra,,,pupper,
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
2286,667182792070062081,,,2015-11-19 03:29:07 +0000,"<a href=""http://twitter.com/download/iphone"" r...",This is Timison. He just told an awful joke bu...,,,,https://twitter.com/dog_rates/status/667182792...,10,10,Timison,,,,
2293,667152164079423490,,,2015-11-19 01:27:25 +0000,"<a href=""http://twitter.com/download/iphone"" r...",This is Pipsy. He is a fluffball. Enjoys trave...,,,,https://twitter.com/dog_rates/status/667152164...,12,10,Pipsy,,,,
2294,667138269671505920,,,2015-11-19 00:32:12 +0000,"<a href=""http://twitter.com/download/iphone"" r...",Extremely intelligent dog here. Has learned to...,,,,https://twitter.com/dog_rates/status/667138269...,10,10,,,,,
2298,667070482143944705,6.670655e+17,4.196984e+09,2015-11-18 20:02:51 +0000,"<a href=""http://twitter.com/download/iphone"" r...",After much debate this dog is being upgraded t...,,,,,10,10,,,,,


In [22]:
analytics_df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 2354 entries, 0 to 2353
Data columns (total 31 columns):
created_at                       2354 non-null object
id                               2354 non-null int64
id_str                           2354 non-null object
full_text                        2354 non-null object
truncated                        2354 non-null bool
display_text_range               2354 non-null object
entities                         2354 non-null object
extended_entities                2073 non-null object
source                           2354 non-null object
in_reply_to_status_id            78 non-null float64
in_reply_to_status_id_str        78 non-null object
in_reply_to_user_id              78 non-null float64
in_reply_to_user_id_str          78 non-null object
in_reply_to_screen_name          78 non-null object
user                             2354 non-null object
geo                              0 non-null object
coordinates                      0 non-null

In [23]:
#Check for duplicate ids in analytics_df
analytics_df[analytics_df.id.duplicated()]

Unnamed: 0,created_at,id,id_str,full_text,truncated,display_text_range,entities,extended_entities,source,in_reply_to_status_id,...,favorite_count,favorited,retweeted,possibly_sensitive,possibly_sensitive_appealable,lang,retweeted_status,quoted_status_id,quoted_status_id_str,quoted_status


### Assessment Observations
##### Tidiness
<ul>
    <li> Columns floofer, doggo, pupper and puppo need to be in one column.</li>
    <li> Merge all tables after quality issues have been addressed. </li>
</ul>

##### Quality

###### Archive_df
<ul>
        <li> Some names are erroneously parsed to and have empty values or lowercase strings. </li>
    <li> There are retweets included in the data, which should be removed. Drop retweeted_status_user_id, retweeted_status_id, retweeted_status_timestamp. </li>
    <li> Datatype of tweet_id is an integer instead of an object. </li>
    <li> Datatype for timestamp is a string instead of a timestamp.</li>
    <li> Erroneously parsed rating numerator and denominator wherein numerators are less than 10 and denominators are more or less than 10. Denominators are, equivocally, always 10, and numerators are always 10 or greater. </li>
    <li> Rating needs to be standardized and within one column as they are a single variable. </li>
    <li> Drop impertinent columns. </li>
</ul>


###### Predictions_df
<ul>
    <li> Datatypes for p1_conf, p2_conf, p3_conf and img_num are in string instead of an integer. </li>
    <li> Inconsistent text cases for prediction1, prediction2, prediction3.</li>
    <li> Update column names for information accuracy.</li>
    <li> Datatype for tweet_id is an integer instead of an object.</li>
</ul>

###### Analytics_df
<ul>
    <li> Datatype of created_at is in string instead of timestamp. </li>
    <li> Rename id_string to tweet_id. </li>
    <li> Duplicate variable for tweet_id. Drop id, which is just the tweet id in integer format.</li>
    <li> Drop impertinent columns. </li>
</ul>

### Clean

In [24]:
#Create copies prior to cleaning
archive_clean = archive_df.copy()
predictions_clean = predictions_df.copy()
analytics_clean = analytics_df.copy()

#### Tidiness

##### Define

Combine data from columns floofer, doggo, pupper & puppo into a single column called 'category' using a combination of replace(), fillna() and vectorized operations.

##### Code

In [25]:
#replace string value None with empty strings
archive_clean['doggo'] = archive_clean['doggo'].replace('None', '')
archive_clean['floofer'] = archive_clean['floofer'].replace('None', '')
archive_clean['pupper'] = archive_clean['pupper'].replace('None', '')
archive_clean['puppo'] = archive_clean['puppo'].replace('None', '')

#replace empty strings with NaNs and then use vectorized operations to add onto a new column called category.
archive_clean['category'] = (archive_clean['doggo'].fillna('') + archive_clean['floofer'].fillna('') 
                  + archive_clean['pupper'].fillna('') + archive_clean['puppo'].fillna('')).replace('', np.nan)

#verify
archive_clean.category.value_counts()

pupper          245
doggo            83
puppo            29
doggopupper      12
floofer           9
doggofloofer      1
doggopuppo        1
Name: category, dtype: int64

In [26]:
#since there are rows that have multiple categories, change those into multiple
remove = ['doggopupper', 'doggopuppo', 'doggofloofer']

#use for loop to change rows with multiple categories to 'multiple'
for dogs in archive_clean.category:
    if dogs in remove:
        archive_clean['category'] = archive_clean['category'].replace(remove, 'multiple')

In [27]:
archive_clean.drop(columns=['doggo', 'pupper','floofer','puppo'], inplace=True)

##### Testing

In [28]:
archive_clean.head(1)

Unnamed: 0,tweet_id,in_reply_to_status_id,in_reply_to_user_id,timestamp,source,text,retweeted_status_id,retweeted_status_user_id,retweeted_status_timestamp,expanded_urls,rating_numerator,rating_denominator,name,category
0,892420643555336193,,,2017-08-01 16:23:56 +0000,"<a href=""http://twitter.com/download/iphone"" r...",This is Phineas. He's a mystical boy. Only eve...,,,,https://twitter.com/dog_rates/status/892420643...,13,10,Phineas,


In [29]:
archive_clean.category.value_counts()

pupper      245
doggo        83
puppo        29
multiple     14
floofer       9
Name: category, dtype: int64

##### Define
Update numerator and denominator values to ensure that numerators are 10 or greater and denominators are 10.

##### Code

In [30]:
#for loop to convert numerators that are less than 10 to 10.
for i in archive_clean.index:
    archive_clean.at[i, 'rating_numerator'] = max(10, archive_clean.at[i, 'rating_numerator'])
    
for i in archive_clean.index:
    archive_clean.at[i, 'rating_denominator'] = max(10, archive_clean.at[i, 'rating_denominator'])

##### Testing

In [31]:
archive_clean.rating_numerator.nsmallest()

45     10
113    10
165    10
212    10
229    10
Name: rating_numerator, dtype: int64

In [32]:
archive_clean.rating_numerator.nlargest()

979     1776
313      960
189      666
188      420
2074     420
Name: rating_numerator, dtype: int64

> Numerators that are higher than 2 digits will be treated as outliers in this data analysis project.

In [33]:
archive_clean.rating_denominator.nsmallest()

0    10
1    10
2    10
3    10
4    10
Name: rating_denominator, dtype: int64

In [34]:
archive_clean.rating_denominator.nlargest()

1120    170
902     150
1634    130
1779    120
1635    110
Name: rating_denominator, dtype: int64

##### Define
Create a single column with a standardized rating using vectorized mathematical operations on rating_numerator and rating denominator, respectively.

##### Code

In [35]:
archive_clean['rating'] = (archive_clean['rating_numerator'] / archive_clean['rating_denominator'])

#### Quality 

##### Define
Drop retweets using .notnull() method as our goal is to only analyze original tweets and not retweets. Drop retweeted_status_user_id, retweeted_status_id, retweeted_status_timestamp as they are deemed impertinent data.

##### Code

In [36]:
#Use .isnull() to null retweeted tweets
archive_clean = archive_clean[archive_clean.retweeted_status_id.isnull()]

##### Testing

In [37]:
archive_clean[archive_clean.retweeted_status_id.notnull()]

Unnamed: 0,tweet_id,in_reply_to_status_id,in_reply_to_user_id,timestamp,source,text,retweeted_status_id,retweeted_status_user_id,retweeted_status_timestamp,expanded_urls,rating_numerator,rating_denominator,name,category,rating


##### Define
Update datatype of tweet_id from integer to string.

##### Code

In [38]:
archive_clean['tweet_id'] = archive_clean['tweet_id'].apply(str)

##### Testing

In [39]:
archive_clean.info()

<class 'pandas.core.frame.DataFrame'>
Int64Index: 2175 entries, 0 to 2355
Data columns (total 15 columns):
tweet_id                      2175 non-null object
in_reply_to_status_id         78 non-null float64
in_reply_to_user_id           78 non-null float64
timestamp                     2175 non-null object
source                        2175 non-null object
text                          2175 non-null object
retweeted_status_id           0 non-null float64
retweeted_status_user_id      0 non-null float64
retweeted_status_timestamp    0 non-null object
expanded_urls                 2117 non-null object
rating_numerator              2175 non-null int64
rating_denominator            2175 non-null int64
name                          2175 non-null object
category                      344 non-null object
rating                        2175 non-null float64
dtypes: float64(5), int64(2), object(8)
memory usage: 271.9+ KB


##### Define
Cleaning erroneously parsed names by replacing names with valid ones using a combination of str.islower(), tolist(), for loops, findall() and Python's regular expressions.

##### Code

In [40]:
replacing_named = archive_clean.loc[(archive_clean['name'].str.islower()) & (archive_clean['text'].str.contains('named'))]
replacing_name_is = archive_clean.loc[(archive_clean['name'].str.islower()) & (archive_clean['text'].str.contains('name is'))]
replacing_not_named = archive_clean.loc[(archive_clean['name'].str.islower())]

#save as list
replacing_named = replacing_named['text'].tolist()
replacing_name_is = replacing_name_is['text'].tolist()
replacing_not_named = replacing_not_named['text'].tolist()

In [41]:
#using for loop to iterate through saved list to find names in lowercase and the word 'named' appears in text,
#and set the word that appears after 'named' to be the value of name.
import re
for entry in replacing_named:
    bool_mask = archive_clean.text == entry
    name_column = 'name'
    archive_clean.loc[bool_mask, name_column] = re.findall(r"named\s(\w+)", entry)

In [42]:
#using for loop to iterate through saved list to find names in lowercase and the words 'mame is' appear in text
#and set the word that appears after 'named' to be the value of name.
for entry in replacing_name_is:
    bool_mask = archive_clean.text == entry
    name_column = 'name'
    archive_clean.loc[bool_mask, name_column] = re.findall(r"name is\s(\w+)", entry)

In [43]:
for entry in replacing_not_named:
    mask = archive_clean.text == entry
    name_column = 'name'
    archive_clean.loc[mask, name_column] = "None"

##### Testing

In [44]:
#check on names
archive_clean.name.sort_values()

1035     Abby
1021     Abby
938       Ace
1933     Acro
1327    Adele
        ...  
2141     Zoey
115      Zoey
8        Zoey
151     Zooey
1875     Zuzu
Name: name, Length: 2175, dtype: object

##### Define
Change archive_clean timestamp from string to timestamp using .to_datetime() method.

##### Code

In [45]:
archive_clean['timestamp'] = pd.to_datetime(archive_clean['timestamp'])

##### Testing

In [46]:
archive_clean.info()

<class 'pandas.core.frame.DataFrame'>
Int64Index: 2175 entries, 0 to 2355
Data columns (total 15 columns):
tweet_id                      2175 non-null object
in_reply_to_status_id         78 non-null float64
in_reply_to_user_id           78 non-null float64
timestamp                     2175 non-null datetime64[ns, UTC]
source                        2175 non-null object
text                          2175 non-null object
retweeted_status_id           0 non-null float64
retweeted_status_user_id      0 non-null float64
retweeted_status_timestamp    0 non-null object
expanded_urls                 2117 non-null object
rating_numerator              2175 non-null int64
rating_denominator            2175 non-null int64
name                          2175 non-null object
category                      344 non-null object
rating                        2175 non-null float64
dtypes: datetime64[ns, UTC](1), float64(5), int64(2), object(7)
memory usage: 271.9+ KB


##### Define
Update datatype of tweet_id from integer to string in predictions_clean.

##### Code

In [47]:
predictions_clean['tweet_id'] = predictions_clean['tweet_id'].apply(str)

##### Testing

In [48]:
predictions_clean.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 2075 entries, 0 to 2074
Data columns (total 12 columns):
tweet_id    2075 non-null object
jpg_url     2075 non-null object
img_num     2075 non-null int64
p1          2075 non-null object
p1_conf     2075 non-null float64
p1_dog      2075 non-null bool
p2          2075 non-null object
p2_conf     2075 non-null float64
p2_dog      2075 non-null bool
p3          2075 non-null object
p3_conf     2075 non-null float64
p3_dog      2075 non-null bool
dtypes: bool(3), float64(3), int64(1), object(5)
memory usage: 152.1+ KB


##### Define

Rename column names in predictions_clean for accuracy.

##### Code

In [49]:
#rename columns for accuracy of column name
predictions_clean.columns = ['tweet_id', 'jpg_url','img_num', 'prediction1', 'p1_conf', 'p1_is_dog','prediction2', 'p2_conf', 'p2_is_dog','prediction3', 'p3_conf', 'p3_is_dog']

##### Testing

In [50]:
predictions_clean.head()

Unnamed: 0,tweet_id,jpg_url,img_num,prediction1,p1_conf,p1_is_dog,prediction2,p2_conf,p2_is_dog,prediction3,p3_conf,p3_is_dog
0,666020888022790149,https://pbs.twimg.com/media/CT4udn0WwAA0aMy.jpg,1,Welsh_springer_spaniel,0.465074,True,collie,0.156665,True,Shetland_sheepdog,0.061428,True
1,666029285002620928,https://pbs.twimg.com/media/CT42GRgUYAA5iDo.jpg,1,redbone,0.506826,True,miniature_pinscher,0.074192,True,Rhodesian_ridgeback,0.07201,True
2,666033412701032449,https://pbs.twimg.com/media/CT4521TWwAEvMyu.jpg,1,German_shepherd,0.596461,True,malinois,0.138584,True,bloodhound,0.116197,True
3,666044226329800704,https://pbs.twimg.com/media/CT5Dr8HUEAA-lEu.jpg,1,Rhodesian_ridgeback,0.408143,True,redbone,0.360687,True,miniature_pinscher,0.222752,True
4,666049248165822465,https://pbs.twimg.com/media/CT5IQmsXIAAKY4A.jpg,1,miniature_pinscher,0.560311,True,Rottweiler,0.243682,True,Doberman,0.154629,True


##### Define
Change p1_conf, p2_conf, p3_conf and img_num from a string to an int64 by creating a function that uses a combination .to_numeric(), .fillna() and .astype().

##### Code

In [51]:
#Create function to turn string to integer
def string_to_integer(array):
    return pd.to_numeric(array)
    
predictions_clean['p1_conf'] = string_to_integer(predictions_clean['p1_conf'])
predictions_clean['p2_conf'] = string_to_integer(predictions_clean['p2_conf'])
predictions_clean['p3_conf'] = string_to_integer(predictions_clean['p3_conf'])
predictions_clean['img_num'] = string_to_integer(predictions_clean['img_num'])

##### Testing

In [52]:
predictions_clean.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 2075 entries, 0 to 2074
Data columns (total 12 columns):
tweet_id       2075 non-null object
jpg_url        2075 non-null object
img_num        2075 non-null int64
prediction1    2075 non-null object
p1_conf        2075 non-null float64
p1_is_dog      2075 non-null bool
prediction2    2075 non-null object
p2_conf        2075 non-null float64
p2_is_dog      2075 non-null bool
prediction3    2075 non-null object
p3_conf        2075 non-null float64
p3_is_dog      2075 non-null bool
dtypes: bool(3), float64(3), int64(1), object(5)
memory usage: 152.1+ KB


In [53]:
predictions_clean.head()

Unnamed: 0,tweet_id,jpg_url,img_num,prediction1,p1_conf,p1_is_dog,prediction2,p2_conf,p2_is_dog,prediction3,p3_conf,p3_is_dog
0,666020888022790149,https://pbs.twimg.com/media/CT4udn0WwAA0aMy.jpg,1,Welsh_springer_spaniel,0.465074,True,collie,0.156665,True,Shetland_sheepdog,0.061428,True
1,666029285002620928,https://pbs.twimg.com/media/CT42GRgUYAA5iDo.jpg,1,redbone,0.506826,True,miniature_pinscher,0.074192,True,Rhodesian_ridgeback,0.07201,True
2,666033412701032449,https://pbs.twimg.com/media/CT4521TWwAEvMyu.jpg,1,German_shepherd,0.596461,True,malinois,0.138584,True,bloodhound,0.116197,True
3,666044226329800704,https://pbs.twimg.com/media/CT5Dr8HUEAA-lEu.jpg,1,Rhodesian_ridgeback,0.408143,True,redbone,0.360687,True,miniature_pinscher,0.222752,True
4,666049248165822465,https://pbs.twimg.com/media/CT5IQmsXIAAKY4A.jpg,1,miniature_pinscher,0.560311,True,Rottweiler,0.243682,True,Doberman,0.154629,True


##### Define
Change text case of prediction1, prediction2 and prediction3 from a variety of text cases to title case by creating a function for str.title().

##### Code

In [54]:
def to_title(array):
    return array.str.title()

predictions_clean.prediction1 = to_title(predictions_clean.prediction1)
predictions_clean.prediction2 = to_title(predictions_clean.prediction2)
predictions_clean.prediction3 = to_title(predictions_clean.prediction3)

##### Testing

In [55]:
predictions_clean.head()

Unnamed: 0,tweet_id,jpg_url,img_num,prediction1,p1_conf,p1_is_dog,prediction2,p2_conf,p2_is_dog,prediction3,p3_conf,p3_is_dog
0,666020888022790149,https://pbs.twimg.com/media/CT4udn0WwAA0aMy.jpg,1,Welsh_Springer_Spaniel,0.465074,True,Collie,0.156665,True,Shetland_Sheepdog,0.061428,True
1,666029285002620928,https://pbs.twimg.com/media/CT42GRgUYAA5iDo.jpg,1,Redbone,0.506826,True,Miniature_Pinscher,0.074192,True,Rhodesian_Ridgeback,0.07201,True
2,666033412701032449,https://pbs.twimg.com/media/CT4521TWwAEvMyu.jpg,1,German_Shepherd,0.596461,True,Malinois,0.138584,True,Bloodhound,0.116197,True
3,666044226329800704,https://pbs.twimg.com/media/CT5Dr8HUEAA-lEu.jpg,1,Rhodesian_Ridgeback,0.408143,True,Redbone,0.360687,True,Miniature_Pinscher,0.222752,True
4,666049248165822465,https://pbs.twimg.com/media/CT5IQmsXIAAKY4A.jpg,1,Miniature_Pinscher,0.560311,True,Rottweiler,0.243682,True,Doberman,0.154629,True


##### Define
Change created_at from string to timestamp using .to_datetime() method.

##### Code

In [56]:
analytics_clean.created_at = pd.to_datetime(analytics_clean.created_at)

##### Testing

In [57]:
analytics_clean.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 2354 entries, 0 to 2353
Data columns (total 31 columns):
created_at                       2354 non-null datetime64[ns, UTC]
id                               2354 non-null int64
id_str                           2354 non-null object
full_text                        2354 non-null object
truncated                        2354 non-null bool
display_text_range               2354 non-null object
entities                         2354 non-null object
extended_entities                2073 non-null object
source                           2354 non-null object
in_reply_to_status_id            78 non-null float64
in_reply_to_status_id_str        78 non-null object
in_reply_to_user_id              78 non-null float64
in_reply_to_user_id_str          78 non-null object
in_reply_to_screen_name          78 non-null object
user                             2354 non-null object
geo                              0 non-null object
coordinates                   

##### Define
Rename id_str to tweet_id for consistency of information using rename() function.

##### Code

In [58]:
analytics_clean = analytics_clean.rename(columns={'id_str':'tweet_id'})

##### Testing

In [59]:
analytics_clean.head()

Unnamed: 0,created_at,id,tweet_id,full_text,truncated,display_text_range,entities,extended_entities,source,in_reply_to_status_id,...,favorite_count,favorited,retweeted,possibly_sensitive,possibly_sensitive_appealable,lang,retweeted_status,quoted_status_id,quoted_status_id_str,quoted_status
0,2017-08-01 16:23:56+00:00,892420643555336193,892420643555336193,This is Phineas. He's a mystical boy. Only eve...,False,"[0, 85]","{'hashtags': [], 'symbols': [], 'user_mentions...","{'media': [{'id': 892420639486877696, 'id_str'...","<a href=""http://twitter.com/download/iphone"" r...",,...,39467,False,False,False,False,en,,,,
1,2017-08-01 00:17:27+00:00,892177421306343426,892177421306343426,This is Tilly. She's just checking pup on you....,False,"[0, 138]","{'hashtags': [], 'symbols': [], 'user_mentions...","{'media': [{'id': 892177413194625024, 'id_str'...","<a href=""http://twitter.com/download/iphone"" r...",,...,33819,False,False,False,False,en,,,,
2,2017-07-31 00:18:03+00:00,891815181378084864,891815181378084864,This is Archie. He is a rare Norwegian Pouncin...,False,"[0, 121]","{'hashtags': [], 'symbols': [], 'user_mentions...","{'media': [{'id': 891815175371796480, 'id_str'...","<a href=""http://twitter.com/download/iphone"" r...",,...,25461,False,False,False,False,en,,,,
3,2017-07-30 15:58:51+00:00,891689557279858688,891689557279858688,This is Darla. She commenced a snooze mid meal...,False,"[0, 79]","{'hashtags': [], 'symbols': [], 'user_mentions...","{'media': [{'id': 891689552724799489, 'id_str'...","<a href=""http://twitter.com/download/iphone"" r...",,...,42908,False,False,False,False,en,,,,
4,2017-07-29 16:00:24+00:00,891327558926688256,891327558926688256,This is Franklin. He would like you to stop ca...,False,"[0, 138]","{'hashtags': [{'text': 'BarkWeek', 'indices': ...","{'media': [{'id': 891327551943041024, 'id_str'...","<a href=""http://twitter.com/download/iphone"" r...",,...,41048,False,False,False,False,en,,,,


##### Define
Drop id column as it is redundant information in the incorrect datatype using drop().

##### Code

In [60]:
analytics_clean = analytics_clean.drop('id', axis=1)

In [61]:
predictions_clean.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 2075 entries, 0 to 2074
Data columns (total 12 columns):
tweet_id       2075 non-null object
jpg_url        2075 non-null object
img_num        2075 non-null int64
prediction1    2075 non-null object
p1_conf        2075 non-null float64
p1_is_dog      2075 non-null bool
prediction2    2075 non-null object
p2_conf        2075 non-null float64
p2_is_dog      2075 non-null bool
prediction3    2075 non-null object
p3_conf        2075 non-null float64
p3_is_dog      2075 non-null bool
dtypes: bool(3), float64(3), int64(1), object(5)
memory usage: 152.1+ KB


##### Define
Drop all impertinent data from archive_clean and analytics_clean using drop() function before merging.

##### Code & Testing

In [62]:
#drop impertinent data columns for archive_clean
archive_clean = archive_clean.drop(columns=['retweeted_status_user_id', 'retweeted_status_id',
                            'retweeted_status_timestamp', 'in_reply_to_status_id', 
                            'in_reply_to_user_id', 'rating_denominator', 'rating_numerator', 'source'])

In [63]:
analytics_clean.columns

Index(['created_at', 'tweet_id', 'full_text', 'truncated',
       'display_text_range', 'entities', 'extended_entities', 'source',
       'in_reply_to_status_id', 'in_reply_to_status_id_str',
       'in_reply_to_user_id', 'in_reply_to_user_id_str',
       'in_reply_to_screen_name', 'user', 'geo', 'coordinates', 'place',
       'contributors', 'is_quote_status', 'retweet_count', 'favorite_count',
       'favorited', 'retweeted', 'possibly_sensitive',
       'possibly_sensitive_appealable', 'lang', 'retweeted_status',
       'quoted_status_id', 'quoted_status_id_str', 'quoted_status'],
      dtype='object')

In [64]:
#drop impertinent data columns for analytics_clean
analytics_clean = analytics_clean.drop(columns=['in_reply_to_status_id', 'in_reply_to_status_id_str',
                              'in_reply_to_user_id', 'in_reply_to_user_id_str',
                              'in_reply_to_screen_name', 'retweeted_status',
                              'quoted_status_id', 'quoted_status_id_str', 'quoted_status',
                              'is_quote_status', 'possibly_sensitive', 'possibly_sensitive_appealable',
                                               'favorited', 'retweeted', 'full_text'])

##### Define
Merge analytics_clean, archive_clean and predictions_clean into a singular dataframe using merge().

##### Code

In [65]:
analytics_clean.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 2354 entries, 0 to 2353
Data columns (total 15 columns):
created_at            2354 non-null datetime64[ns, UTC]
tweet_id              2354 non-null object
truncated             2354 non-null bool
display_text_range    2354 non-null object
entities              2354 non-null object
extended_entities     2073 non-null object
source                2354 non-null object
user                  2354 non-null object
geo                   0 non-null object
coordinates           0 non-null object
place                 1 non-null object
contributors          0 non-null object
retweet_count         2354 non-null int64
favorite_count        2354 non-null int64
lang                  2354 non-null object
dtypes: bool(1), datetime64[ns, UTC](1), int64(2), object(11)
memory usage: 259.9+ KB


In [66]:
#merge all three 
tweet_masterdf = pd.merge(archive_clean, analytics_clean, on='tweet_id')
tweet_masterdf = pd.merge(tweet_masterdf, predictions_clean, on='tweet_id')

##### Testing

In [67]:
tweet_masterdf

Unnamed: 0,tweet_id,timestamp,text,expanded_urls,name,category,rating,created_at,truncated,display_text_range,...,img_num,prediction1,p1_conf,p1_is_dog,prediction2,p2_conf,p2_is_dog,prediction3,p3_conf,p3_is_dog
0,892420643555336193,2017-08-01 16:23:56+00:00,This is Phineas. He's a mystical boy. Only eve...,https://twitter.com/dog_rates/status/892420643...,Phineas,,1.3,2017-08-01 16:23:56+00:00,False,"[0, 85]",...,1,Orange,0.097049,False,Bagel,0.085851,False,Banana,0.076110,False
1,892177421306343426,2017-08-01 00:17:27+00:00,This is Tilly. She's just checking pup on you....,https://twitter.com/dog_rates/status/892177421...,Tilly,,1.3,2017-08-01 00:17:27+00:00,False,"[0, 138]",...,1,Chihuahua,0.323581,True,Pekinese,0.090647,True,Papillon,0.068957,True
2,891815181378084864,2017-07-31 00:18:03+00:00,This is Archie. He is a rare Norwegian Pouncin...,https://twitter.com/dog_rates/status/891815181...,Archie,,1.2,2017-07-31 00:18:03+00:00,False,"[0, 121]",...,1,Chihuahua,0.716012,True,Malamute,0.078253,True,Kelpie,0.031379,True
3,891689557279858688,2017-07-30 15:58:51+00:00,This is Darla. She commenced a snooze mid meal...,https://twitter.com/dog_rates/status/891689557...,Darla,,1.3,2017-07-30 15:58:51+00:00,False,"[0, 79]",...,1,Paper_Towel,0.170278,False,Labrador_Retriever,0.168086,True,Spatula,0.040836,False
4,891327558926688256,2017-07-29 16:00:24+00:00,This is Franklin. He would like you to stop ca...,https://twitter.com/dog_rates/status/891327558...,Franklin,,1.2,2017-07-29 16:00:24+00:00,False,"[0, 138]",...,2,Basset,0.555712,True,English_Springer,0.225770,True,German_Short-Haired_Pointer,0.175219,True
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
1989,666049248165822465,2015-11-16 00:24:50+00:00,Here we have a 1949 1st generation vulpix. Enj...,https://twitter.com/dog_rates/status/666049248...,,,1.0,2015-11-16 00:24:50+00:00,False,"[0, 120]",...,1,Miniature_Pinscher,0.560311,True,Rottweiler,0.243682,True,Doberman,0.154629,True
1990,666044226329800704,2015-11-16 00:04:52+00:00,This is a purebred Piers Morgan. Loves to Netf...,https://twitter.com/dog_rates/status/666044226...,,,1.0,2015-11-16 00:04:52+00:00,False,"[0, 137]",...,1,Rhodesian_Ridgeback,0.408143,True,Redbone,0.360687,True,Miniature_Pinscher,0.222752,True
1991,666033412701032449,2015-11-15 23:21:54+00:00,Here is a very happy pup. Big fan of well-main...,https://twitter.com/dog_rates/status/666033412...,,,1.0,2015-11-15 23:21:54+00:00,False,"[0, 130]",...,1,German_Shepherd,0.596461,True,Malinois,0.138584,True,Bloodhound,0.116197,True
1992,666029285002620928,2015-11-15 23:05:30+00:00,This is a western brown Mitsubishi terrier. Up...,https://twitter.com/dog_rates/status/666029285...,,,1.0,2015-11-15 23:05:30+00:00,False,"[0, 139]",...,1,Redbone,0.506826,True,Miniature_Pinscher,0.074192,True,Rhodesian_Ridgeback,0.072010,True


# Data Visualization & Analysis

<a id='questions'></a>
## Questions to Answer

1.) What is the most favorited tweet?
    
2.) what is the most retweeted tweet?
    
3.) What is the distribution of dog categories?

4.) What is rating column's statistical data?

5.) What is the most common prediction?

6.) What is the possible relationship between retweet counts and favorite counts?

### 1.) What is the most favorited tweet?

In [72]:
tweet_masterdf.favorite_count.nlargest()

309    132810
775    131075
58     107956
400    107015
108    106827
Name: favorite_count, dtype: int64

In [None]:
#view information of the most favorited tweet
tweet_masterdf.iloc[309]

In [None]:
#total number of favorites
total_favs = sum(tweet_masterdf.favorite_count)

In [None]:
#percentage of favorites for the most favorited tweet
(tweet_masterdf.favorite_count.loc[309]/total_favs) * 100

> The most favorited We Rate Dogs tweet has a tweet ID of '822872901745569793', has been favorited 132,810 times, was retweeted 48,265 times and had been created on 01-21-2017. It has a standardized rating of 1.3 and the tweet itself has a length of 87 characters. The neural network's top 3 prediction for the image accompanying the tweet are as follows: Lakeland Terrier, Labrador Retriever, Irish Terrier

> The total number of favorites as of August 1, 2017 is 17,738,077.

> The tweet with the highest number of favorites only accounts for 0.74% out of all We Rate Dogs' total favorites.

### 2.) What is the most retweeted tweet?

In [None]:
#finding the index of the tweet with the highest retweet
tweet_masterdf.retweet_count.nlargest()

In [None]:
#find all rows pertaining to the tweet with the highest retweet
tweet_masterdf.iloc[775]

In [None]:
#total number of retweets
total_retweets = sum(tweet_masterdf.retweet_count)

In [None]:
#percentage of retweets for the most retweeted tweet
(tweet_masterdf.retweet_count.loc[309]/total_retweets) * 100

> The most retweeted We Rate Dogs tweet has a tweet ID of '744234799360020481', has been favorited 131,075 times, was retweeted 79,515 times and had been created on 06-18-2016. It has a standardized rating of 1.3 and the tweet itself has a length of 91 characters. The neural network's top 3 prediction for the image accompanying the tweet are as follows: Labrador Retriever, Ice Bear and Whippet

> The total number of retweets up until August 1, 2017 is 5,516,906.

> The tweet with the highest number of favorites only accounts for 0.87% retweets of all We Rate Dogs' total retweets.

### 3.) What is the distribution of dog categories?

In [None]:
#obtain percentage
category_percent = (tweet_masterdf.category.value_counts() / tweet_masterdf.category.count()) * 100
category_percent

In [None]:
#create a bar plot to visualize percentage of distribution for each 
#use x variable in the cell above
x.sort_values(ascending=False).plot.barh()
plt.title('Percentage Distribution of Dog Categories')
plt.xlabel('Percentage of Category')
plt.ylabel('Category');

> Of the categories, majority or 66.34% were assigned as puppers followed by 20.6% doggo assignments and 7.19% puppo assignments. Dogs that were assigned multiple categories had 3.6% while floofer staggered closely as accounting for 2.2% of the category assignments.

### 4.) What is the rating column's statistical data?

In [None]:
tweet_masterdf.rating.describe()

> There are 1,994 values under ratings, which has a mean of 1.22 and a standard deviation of 4.06. The minimum value for rating is 1.0 whereas the maximum value for rating, and is considered an outlier, is 177.60.

### 5.) What are the most common predictions?

In [None]:
#set variables
predictions = tweet_masterdf.prediction1
s2 = tweet_masterdf.prediction2
s3 = tweet_masterdf.prediction3

series = [s2,s3]

#for loop to create a masterlist of predictions
for i in series:
    predictions.append(i)

predictions.value_counts().head()

> The most common predictions are as follows: Golden Retriver, Labrador Retriever, Pembroke, Chihuaha and Pug

### 6.) What is the possible relationship between retweet counts and favorite counts?

In [None]:
#create a scatter plot
y = tweet_masterdf.retweet_count
x = tweet_masterdf.favorite_count
plt.scatter(x,y, marker='o', alpha=0.5)
plt.title('Correlation of Retweet Counts and Favorite Counts')
plt.xlabel('Retweet Counts')
plt.ylabel('Favorite Counts');

> It appears that, according to the scatterplot containing the data of favorite counts and retweet counts, there may be a positive correlation between the two. However, since we did not perform any statistical operations, we are unable to say for certain.

# Conclusion & Insights

<ul>
    <li> There seems to be a positive relationship between retweet counts and favorite counts. However, we are unable to state this with certainty until we utilize inferential statistics to determine the the relationship accurately. </li>
    <li> Starting from when We Rate Dogs was created up until August 1,2017, it has received a total number of 5,516,906. retweets and 17,738,077 favorites. </li>
    <li>Out of 17,738,077 total number of favorites as of August 1, 2017, the most favorited tweet only makes up to 0.74% or 132,810 favorites.</li>
    <li> Dogs are, more often than not, assigned as puppers followed by doggos, puppos, dogs that were assigned multiple categories and floofers. </li>
</ul>

In [71]:
#save notebook to csv file
tweet_masterdf.to_csv('twitter_master.csv', index=False)

#### Citations:

https://kite.com/python/examples/4420/beautifulsoup-parse-an-html-table-and-write-to-a-csv

https://stackoverflow.com/questions/39213597/convert-text-data-from-requests-object-to-dataframe-with-pandas

https://stackoverflow.com/questions/53578054/combining-different-columns

https://pandas.pydata.org/pandas-docs/stable/user_guide/merging.html