In [1]:
import pandas as pd
import requests
import numpy as np
import json
import tweepy

Your goal: wrangle WeRateDogs Twitter data to create interesting and trustworthy analyses and visualizations. The Twitter archive is great, but it only contains very basic tweet information. Additional gathering, then assessing and cleaning is required for "Wow!"-worthy analyses and visualizations.

### Key points to keep in mind when data wrangling for this project:

- You only want original ratings (no retweets) that have images. Though there are 5000+ tweets in the dataset, not all are dog ratings and some are retweets.
- Assessing and cleaning the entire dataset completely would require a lot of time, and is not necessary to practice and demonstrate your skills in data wrangling. Therefore, the requirements of this project are only to assess and clean at least 8 quality issues and at least 2 quality issues in this dataset.
- Cleaning includes merging individual pieces of data according to the rules of tidy data.
- The fact that the rating numerators are greater than the denominators does not need to be cleaned. This unique rating system is a big part of the popularity of WeRateDogs.
- You do not need to gather the tweets beyond August 1st, 2017. You can, but note that you won't be able to gather the image predictions for these tweets since you don't have access to the algorithm used.

In [2]:
#create df for twitter archive
twit_arch = pd.read_csv('twitter-archive-enhanced.csv')

In [3]:
#access img predictions
url = 'https://d17h27t6h515a5.cloudfront.net/topher/2017/August/599fd2ad_image-predictions/image-predictions.tsv'
r = requests.get(url, stream=True)

In [4]:
#create tsv for img predictions
with open('image-predictions.tsv', 'wb') as f:
    for item in r:
        f.write(item)

In [5]:
#create df for img predictions
img_pred = pd.read_csv('image-predictions.tsv', sep='\t')

In [6]:
#keys for accessing API
consumer_key = 'Z3Oz1kXvqymKJSSImppIqoDR2'
consumer_secret = 'oQPgmHnM4CnbmfbHcZKliacP6x0gyKSDJ9kde6yGOHOyhQ6RD9'

access_token = '829641247962836993-kbbUiL3mCfOic2XHDp0fJ2OtIzVrDVJ'
access_token_secret = 'Td6gHrNB6FnUtjySKqmm59fSkOHL33ggVsQ41dQPvssEh'

In [7]:
auth = tweepy.OAuthHandler(consumer_key, consumer_secret)
auth.set_access_token(access_token, access_token_secret)

In [8]:
api = tweepy.API(auth)

In [9]:
#access we rate dogs username, @dog_rates
we_rate_dogs = api.get_user('dog_rates')

In [10]:
#get all tweets and put them into a dict with retweet and like counts
retweets_likes = {}
for tw_id in twit_arch.tweet_id:
    try:
        tweet = api.get_status(tw_id)
        retweets_likes[tweet.id] = {
            'retweets' : tweet.retweet_count, 
            'likes' : tweet.favorite_count}
    except:
        retweets_likes[tw_id] = {
            'retweets' : 'None', 
            'likes' : 'None'}

In [11]:
#put dict of retweets_likes into json file
with open('tweet_json.txt', 'w') as outfile:  
    json.dump(retweets_likes, outfile)

In [12]:
#make dict into dataframe
retweets_likes = pd.DataFrame(retweets_likes)

In [13]:
#open file to make dict into dataframe
with open('tweet_json.txt', 'r') as infile:
    retweets_likes = pd.DataFrame(json.load(infile))

### To run on startup

Run when it's not necessary to recreate the files from above / to regather data. 

In [224]:
import pandas as pd
import requests
import numpy as np
import json
import tweepy

In [225]:
#create df for twitter archive
twit_arch = pd.read_csv('twitter-archive-enhanced.csv')

In [226]:
#create df for img predictions
img_pred = pd.read_csv('image-predictions.tsv', sep='\t')

In [227]:
with open('tweet_json.txt', 'r') as infile:
    retweets_likes = pd.DataFrame(json.load(infile))

### Data Assessment

In [228]:
#clean copies
retweets_likes_clean = retweets_likes.copy()
img_pred_clean = img_pred.copy()
twit_arch_clean = twit_arch.copy()

Visual Check of Data

In [229]:
twit_arch_clean.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 2356 entries, 0 to 2355
Data columns (total 17 columns):
tweet_id                      2356 non-null int64
in_reply_to_status_id         78 non-null float64
in_reply_to_user_id           78 non-null float64
timestamp                     2356 non-null object
source                        2356 non-null object
text                          2356 non-null object
retweeted_status_id           181 non-null float64
retweeted_status_user_id      181 non-null float64
retweeted_status_timestamp    181 non-null object
expanded_urls                 2297 non-null object
rating_numerator              2356 non-null int64
rating_denominator            2356 non-null int64
name                          2356 non-null object
doggo                         2356 non-null object
floofer                       2356 non-null object
pupper                        2356 non-null object
puppo                         2356 non-null object
dtypes: float64(4), int64(3), ob

In [230]:
twit_arch_clean.head(25)

Unnamed: 0,tweet_id,in_reply_to_status_id,in_reply_to_user_id,timestamp,source,text,retweeted_status_id,retweeted_status_user_id,retweeted_status_timestamp,expanded_urls,rating_numerator,rating_denominator,name,doggo,floofer,pupper,puppo
0,892420643555336193,,,2017-08-01 16:23:56 +0000,"<a href=""http://twitter.com/download/iphone"" r...",This is Phineas. He's a mystical boy. Only eve...,,,,https://twitter.com/dog_rates/status/892420643...,13,10,Phineas,,,,
1,892177421306343426,,,2017-08-01 00:17:27 +0000,"<a href=""http://twitter.com/download/iphone"" r...",This is Tilly. She's just checking pup on you....,,,,https://twitter.com/dog_rates/status/892177421...,13,10,Tilly,,,,
2,891815181378084864,,,2017-07-31 00:18:03 +0000,"<a href=""http://twitter.com/download/iphone"" r...",This is Archie. He is a rare Norwegian Pouncin...,,,,https://twitter.com/dog_rates/status/891815181...,12,10,Archie,,,,
3,891689557279858688,,,2017-07-30 15:58:51 +0000,"<a href=""http://twitter.com/download/iphone"" r...",This is Darla. She commenced a snooze mid meal...,,,,https://twitter.com/dog_rates/status/891689557...,13,10,Darla,,,,
4,891327558926688256,,,2017-07-29 16:00:24 +0000,"<a href=""http://twitter.com/download/iphone"" r...",This is Franklin. He would like you to stop ca...,,,,https://twitter.com/dog_rates/status/891327558...,12,10,Franklin,,,,
5,891087950875897856,,,2017-07-29 00:08:17 +0000,"<a href=""http://twitter.com/download/iphone"" r...",Here we have a majestic great white breaching ...,,,,https://twitter.com/dog_rates/status/891087950...,13,10,,,,,
6,890971913173991426,,,2017-07-28 16:27:12 +0000,"<a href=""http://twitter.com/download/iphone"" r...",Meet Jax. He enjoys ice cream so much he gets ...,,,,"https://gofundme.com/ydvmve-surgery-for-jax,ht...",13,10,Jax,,,,
7,890729181411237888,,,2017-07-28 00:22:40 +0000,"<a href=""http://twitter.com/download/iphone"" r...",When you watch your owner call another dog a g...,,,,https://twitter.com/dog_rates/status/890729181...,13,10,,,,,
8,890609185150312448,,,2017-07-27 16:25:51 +0000,"<a href=""http://twitter.com/download/iphone"" r...",This is Zoey. She doesn't want to be one of th...,,,,https://twitter.com/dog_rates/status/890609185...,13,10,Zoey,,,,
9,890240255349198849,,,2017-07-26 15:59:51 +0000,"<a href=""http://twitter.com/download/iphone"" r...",This is Cassie. She is a college pup. Studying...,,,,https://twitter.com/dog_rates/status/890240255...,14,10,Cassie,doggo,,,


In [231]:
twit_arch_clean.name.value_counts()

None         745
a             55
Charlie       12
Cooper        11
Lucy          11
Oliver        11
Penny         10
Lola          10
Tucker        10
Winston        9
Bo             9
the            8
Sadie          8
Daisy          7
Toby           7
Bailey         7
Buddy          7
an             7
Stanley        6
Koda           6
Scout          6
Jax            6
Dave           6
Leo            6
Bella          6
Jack           6
Milo           6
Rusty          6
Oscar          6
Gus            5
            ... 
Tove           1
Glacier        1
Pip            1
Zoe            1
Cilantro       1
Jennifur       1
Olaf           1
Teddy          1
Socks          1
Snoop          1
Lenox          1
Jeremy         1
Zooey          1
Todo           1
Kanu           1
Ember          1
Steve          1
Laika          1
Eazy           1
Obi            1
Kane           1
Finnegus       1
Alexander      1
Meatball       1
Zara           1
Bronte         1
Edgar          1
Glenn         

- not all dogs are categorized into doggo, floofer, pupper or puppo
- some dogs are missing names, eg 'None' or 'a' or 'an'
- should combine rating numerator over denominator for easier reading
- can remove retweeted data, since it's not original (if it's a retweet, remove it) and then can remove all columns related to that for cleaner looking dataset

In [232]:
img_pred_clean.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 2075 entries, 0 to 2074
Data columns (total 12 columns):
tweet_id    2075 non-null int64
jpg_url     2075 non-null object
img_num     2075 non-null int64
p1          2075 non-null object
p1_conf     2075 non-null float64
p1_dog      2075 non-null bool
p2          2075 non-null object
p2_conf     2075 non-null float64
p2_dog      2075 non-null bool
p3          2075 non-null object
p3_conf     2075 non-null float64
p3_dog      2075 non-null bool
dtypes: bool(3), float64(3), int64(2), object(4)
memory usage: 152.0+ KB


In [233]:
img_pred_clean.head(25)

Unnamed: 0,tweet_id,jpg_url,img_num,p1,p1_conf,p1_dog,p2,p2_conf,p2_dog,p3,p3_conf,p3_dog
0,666020888022790149,https://pbs.twimg.com/media/CT4udn0WwAA0aMy.jpg,1,Welsh_springer_spaniel,0.465074,True,collie,0.156665,True,Shetland_sheepdog,0.061428,True
1,666029285002620928,https://pbs.twimg.com/media/CT42GRgUYAA5iDo.jpg,1,redbone,0.506826,True,miniature_pinscher,0.074192,True,Rhodesian_ridgeback,0.07201,True
2,666033412701032449,https://pbs.twimg.com/media/CT4521TWwAEvMyu.jpg,1,German_shepherd,0.596461,True,malinois,0.138584,True,bloodhound,0.116197,True
3,666044226329800704,https://pbs.twimg.com/media/CT5Dr8HUEAA-lEu.jpg,1,Rhodesian_ridgeback,0.408143,True,redbone,0.360687,True,miniature_pinscher,0.222752,True
4,666049248165822465,https://pbs.twimg.com/media/CT5IQmsXIAAKY4A.jpg,1,miniature_pinscher,0.560311,True,Rottweiler,0.243682,True,Doberman,0.154629,True
5,666050758794694657,https://pbs.twimg.com/media/CT5Jof1WUAEuVxN.jpg,1,Bernese_mountain_dog,0.651137,True,English_springer,0.263788,True,Greater_Swiss_Mountain_dog,0.016199,True
6,666051853826850816,https://pbs.twimg.com/media/CT5KoJ1WoAAJash.jpg,1,box_turtle,0.933012,False,mud_turtle,0.045885,False,terrapin,0.017885,False
7,666055525042405380,https://pbs.twimg.com/media/CT5N9tpXIAAifs1.jpg,1,chow,0.692517,True,Tibetan_mastiff,0.058279,True,fur_coat,0.054449,False
8,666057090499244032,https://pbs.twimg.com/media/CT5PY90WoAAQGLo.jpg,1,shopping_cart,0.962465,False,shopping_basket,0.014594,False,golden_retriever,0.007959,True
9,666058600524156928,https://pbs.twimg.com/media/CT5Qw94XAAA_2dP.jpg,1,miniature_poodle,0.201493,True,komondor,0.192305,True,soft-coated_wheaten_terrier,0.082086,True


- can combine this dataset with the other- if tweet id's match, can add guess if it's above a certain score

In [234]:
retweets_likes_clean.info()

<class 'pandas.core.frame.DataFrame'>
Index: 2 entries, likes to retweets
Columns: 2356 entries, 666020888022790149 to 892420643555336193
dtypes: int64(1789), object(567)
memory usage: 36.8+ KB


In [235]:
retweets_likes_clean.head()

Unnamed: 0,666020888022790149,666029285002620928,666033412701032449,666044226329800704,666049248165822465,666050758794694657,666051853826850816,666055525042405380,666057090499244032,666058600524156928,...,890240255349198849,890609185150312448,890729181411237888,890971913173991426,891087950875897856,891327558926688256,891689557279858688,891815181378084864,892177421306343426,892420643555336193
likes,,,,,,,,,,,...,32092,27898,65879,11912,20318,40496,42340,25159,33388,38984
retweets,,,,,,,,,,,...,7529,4323,19172,2105,3160,9530,8764,4216,6355,8650


- need to transpose dataset
- can combine this with other dataset

### Quality Issues

twit_arch

- X Change timestamp data type into datetime (from object) in twit_arch_clean
- X Erase useless columns (source)
- X delete dogs with no names (unlikely that there's a rating)
- X Remove tweets that are replies
- X Remove tweets that are retweets

img_pred

- X only include column of id, link, and best guess
- X add to twit_arch

retweets_likes
- X Switch retweets and likes columns & rows (transpose)
- X add column header for tweet id
- X add columns to twit_arch, match by tweet id

### Tidiness Issues:

twit_arch

X transpose retweets_likes dataframe
X Combine doggo, floofer, puppo and pupper into one column
- Combine ratings into one column and make into a string

### Tidiness Issues & Testing

In [236]:
#transpose retweets_likes so columns are retweets & likes, rows are tweet IDs
retweets_likes_clean = retweets_likes_clean.transpose()

In [237]:
#test retweets_likes
retweets_likes_clean.sample(10)

Unnamed: 0,likes,retweets
682638830361513985,2214.0,663.0
667509364010450944,,
791774931465953280,49568.0,25257.0
786233965241827333,16847.0,5432.0
834209720923721728,22182.0,5327.0
676606785097199616,2001.0,482.0
735991953473572864,,
670783437142401025,853.0,420.0
674410619106390016,1259.0,507.0
670755717859713024,466.0,120.0


### Quality Issues & Testing

In [238]:
#Change timestamp data type into datetime (from object) in twit_arch_clean
twit_arch_clean.timestamp = pd.to_datetime(twit_arch_clean.timestamp)

In [239]:
#testing for datetime
twit_arch_clean.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 2356 entries, 0 to 2355
Data columns (total 17 columns):
tweet_id                      2356 non-null int64
in_reply_to_status_id         78 non-null float64
in_reply_to_user_id           78 non-null float64
timestamp                     2356 non-null datetime64[ns]
source                        2356 non-null object
text                          2356 non-null object
retweeted_status_id           181 non-null float64
retweeted_status_user_id      181 non-null float64
retweeted_status_timestamp    181 non-null object
expanded_urls                 2297 non-null object
rating_numerator              2356 non-null int64
rating_denominator            2356 non-null int64
name                          2356 non-null object
doggo                         2356 non-null object
floofer                       2356 non-null object
pupper                        2356 non-null object
puppo                         2356 non-null object
dtypes: datetime64[ns](1

##### BACK TO TIDY ISSUES!! 

In [240]:
#reset retweets_likes_clean index
retweets_likes_clean = retweets_likes_clean.reset_index()

In [241]:
#rename column headers to match twit_arch_clean
retweets_likes_clean.columns = ['tweet_id', 'likes', 'retweets']

In [242]:
retweets_likes_clean.sample(10)

Unnamed: 0,tweet_id,likes,retweets
850,691793053716221953,8762,4636
1764,799063482566066176,8891,2791
259,670755717859713024,466,120
121,668190681446379520,677,207
918,697242256848379904,2707,737
543,676811746707918848,1508,462
2137,850333567704068097,3599,362
885,694001791655137281,3634,1148
931,697990423684476929,3531,1424
126,668248472370458624,1033,517


In [243]:
#Combine twit_arch with retweets_likes
twit_arch_clean = pd.merge(twit_arch_clean, retweets_likes_clean,
                            on=['tweet_id'], how='left')

In [244]:
twit_arch_clean.head()

Unnamed: 0,tweet_id,in_reply_to_status_id,in_reply_to_user_id,timestamp,source,text,retweeted_status_id,retweeted_status_user_id,retweeted_status_timestamp,expanded_urls,rating_numerator,rating_denominator,name,doggo,floofer,pupper,puppo,likes,retweets
0,892420643555336193,,,2017-08-01 16:23:56,"<a href=""http://twitter.com/download/iphone"" r...",This is Phineas. He's a mystical boy. Only eve...,,,,https://twitter.com/dog_rates/status/892420643...,13,10,Phineas,,,,,,
1,892177421306343426,,,2017-08-01 00:17:27,"<a href=""http://twitter.com/download/iphone"" r...",This is Tilly. She's just checking pup on you....,,,,https://twitter.com/dog_rates/status/892177421...,13,10,Tilly,,,,,,
2,891815181378084864,,,2017-07-31 00:18:03,"<a href=""http://twitter.com/download/iphone"" r...",This is Archie. He is a rare Norwegian Pouncin...,,,,https://twitter.com/dog_rates/status/891815181...,12,10,Archie,,,,,,
3,891689557279858688,,,2017-07-30 15:58:51,"<a href=""http://twitter.com/download/iphone"" r...",This is Darla. She commenced a snooze mid meal...,,,,https://twitter.com/dog_rates/status/891689557...,13,10,Darla,,,,,,
4,891327558926688256,,,2017-07-29 16:00:24,"<a href=""http://twitter.com/download/iphone"" r...",This is Franklin. He would like you to stop ca...,,,,https://twitter.com/dog_rates/status/891327558...,12,10,Franklin,,,,,,


### Quality, Round 2

In [245]:
#Check for tweets that are replies, and delete
twit_arch_clean.in_reply_to_status_id = twit_arch_clean.in_reply_to_status_id.notnull()

twit_arch_clean = twit_arch_clean[twit_arch_clean.in_reply_to_status_id != True]

In [246]:
#check (using user_id) that there are no values besides null listed
twit_arch_clean.in_reply_to_user_id.value_counts()

Series([], Name: in_reply_to_user_id, dtype: int64)

In [247]:
#drop both columns
twit_arch_clean = twit_arch_clean.drop('in_reply_to_user_id', axis=1)

twit_arch_clean = twit_arch_clean.drop('in_reply_to_status_id', axis=1)

In [248]:
#make sure the columns are gone
twit_arch_clean.head()

Unnamed: 0,tweet_id,timestamp,source,text,retweeted_status_id,retweeted_status_user_id,retweeted_status_timestamp,expanded_urls,rating_numerator,rating_denominator,name,doggo,floofer,pupper,puppo,likes,retweets
0,892420643555336193,2017-08-01 16:23:56,"<a href=""http://twitter.com/download/iphone"" r...",This is Phineas. He's a mystical boy. Only eve...,,,,https://twitter.com/dog_rates/status/892420643...,13,10,Phineas,,,,,,
1,892177421306343426,2017-08-01 00:17:27,"<a href=""http://twitter.com/download/iphone"" r...",This is Tilly. She's just checking pup on you....,,,,https://twitter.com/dog_rates/status/892177421...,13,10,Tilly,,,,,,
2,891815181378084864,2017-07-31 00:18:03,"<a href=""http://twitter.com/download/iphone"" r...",This is Archie. He is a rare Norwegian Pouncin...,,,,https://twitter.com/dog_rates/status/891815181...,12,10,Archie,,,,,,
3,891689557279858688,2017-07-30 15:58:51,"<a href=""http://twitter.com/download/iphone"" r...",This is Darla. She commenced a snooze mid meal...,,,,https://twitter.com/dog_rates/status/891689557...,13,10,Darla,,,,,,
4,891327558926688256,2017-07-29 16:00:24,"<a href=""http://twitter.com/download/iphone"" r...",This is Franklin. He would like you to stop ca...,,,,https://twitter.com/dog_rates/status/891327558...,12,10,Franklin,,,,,,


In [249]:
#Check for tweets that are retweets, and delete
twit_arch_clean.retweeted_status_id = twit_arch_clean.retweeted_status_id.notnull()

twit_arch_clean = twit_arch_clean[twit_arch_clean.retweeted_status_id != True]

In [250]:
#check (using user_id) that there are no values besides null listed
twit_arch_clean.retweeted_status_user_id.value_counts()

Series([], Name: retweeted_status_user_id, dtype: int64)

In [251]:
#drop all retweet columns
twit_arch_clean = twit_arch_clean.drop('retweeted_status_id', axis=1)

twit_arch_clean = twit_arch_clean.drop('retweeted_status_user_id', axis=1)

twit_arch_clean = twit_arch_clean.drop('retweeted_status_timestamp', axis=1)

In [252]:
#make sure the columns are gone
twit_arch_clean.head()

Unnamed: 0,tweet_id,timestamp,source,text,expanded_urls,rating_numerator,rating_denominator,name,doggo,floofer,pupper,puppo,likes,retweets
0,892420643555336193,2017-08-01 16:23:56,"<a href=""http://twitter.com/download/iphone"" r...",This is Phineas. He's a mystical boy. Only eve...,https://twitter.com/dog_rates/status/892420643...,13,10,Phineas,,,,,,
1,892177421306343426,2017-08-01 00:17:27,"<a href=""http://twitter.com/download/iphone"" r...",This is Tilly. She's just checking pup on you....,https://twitter.com/dog_rates/status/892177421...,13,10,Tilly,,,,,,
2,891815181378084864,2017-07-31 00:18:03,"<a href=""http://twitter.com/download/iphone"" r...",This is Archie. He is a rare Norwegian Pouncin...,https://twitter.com/dog_rates/status/891815181...,12,10,Archie,,,,,,
3,891689557279858688,2017-07-30 15:58:51,"<a href=""http://twitter.com/download/iphone"" r...",This is Darla. She commenced a snooze mid meal...,https://twitter.com/dog_rates/status/891689557...,13,10,Darla,,,,,,
4,891327558926688256,2017-07-29 16:00:24,"<a href=""http://twitter.com/download/iphone"" r...",This is Franklin. He would like you to stop ca...,https://twitter.com/dog_rates/status/891327558...,12,10,Franklin,,,,,,


In [253]:
#Erase source column- it's not used for our purposes
twit_arch_clean= twit_arch_clean.drop('source', axis=1)

In [254]:
#test to make sure it's done

twit_arch_clean.head()

Unnamed: 0,tweet_id,timestamp,text,expanded_urls,rating_numerator,rating_denominator,name,doggo,floofer,pupper,puppo,likes,retweets
0,892420643555336193,2017-08-01 16:23:56,This is Phineas. He's a mystical boy. Only eve...,https://twitter.com/dog_rates/status/892420643...,13,10,Phineas,,,,,,
1,892177421306343426,2017-08-01 00:17:27,This is Tilly. She's just checking pup on you....,https://twitter.com/dog_rates/status/892177421...,13,10,Tilly,,,,,,
2,891815181378084864,2017-07-31 00:18:03,This is Archie. He is a rare Norwegian Pouncin...,https://twitter.com/dog_rates/status/891815181...,12,10,Archie,,,,,,
3,891689557279858688,2017-07-30 15:58:51,This is Darla. She commenced a snooze mid meal...,https://twitter.com/dog_rates/status/891689557...,13,10,Darla,,,,,,
4,891327558926688256,2017-07-29 16:00:24,This is Franklin. He would like you to stop ca...,https://twitter.com/dog_rates/status/891327558...,12,10,Franklin,,,,,,


In [255]:
#move to tidiness- combine doggo categories
#Combine dog category columns (doggo, floofer, pupper, puppo) into one column

In [256]:
twit_arch_clean = pd.melt(twit_arch_clean, id_vars=['tweet_id', 'timestamp', 'text', 'expanded_urls',
                                                    'rating_numerator', 'rating_denominator', 'name', 'likes', 'retweets'], 
                          value_vars=['floofer', 'pupper', 'puppo', 'doggo'], 
                         var_name='dog_type')

In [257]:
#test to see dog values
twit_arch_clean.sample(10)

Unnamed: 0,tweet_id,timestamp,text,expanded_urls,rating_numerator,rating_denominator,name,likes,retweets,dog_type,value
1354,686947101016735744,2016-01-12 16:25:26,This is Jackson. He was specifically told not ...,https://twitter.com/dog_rates/status/686947101...,11,10,Jackson,,,floofer,
635,769940425801170949,2016-08-28 16:51:16,This is Klein. These pics were taken a month a...,https://twitter.com/dog_rates/status/769940425...,12,10,Klein,,,floofer,
3187,707059547140169728,2016-03-08 04:25:07,Say hello to Cupcake. She's an Icelandic Dippe...,https://twitter.com/dog_rates/status/707059547...,11,10,Cupcake,,,pupper,
6484,846505985330044928,2017-03-27 23:35:28,THIS WAS NOT HIS FAULT HE HAD NO IDEA. 11/10 S...,https://twitter.com/shomaristone/status/846484...,11,10,,,,doggo,
1228,695629776980148225,2016-02-05 15:27:17,Meet Calvin. He's proof that degrees mean abso...,https://twitter.com/dog_rates/status/695629776...,8,10,Calvin,,,floofer,
4259,878604707211726852,2017-06-24 13:24:20,Martha is stunning how h*ckin dare you. 13/10 ...,https://twitter.com/bbcworld/status/8785998685...,13,10,,,,puppo,
522,788039637453406209,2016-10-17 15:31:05,Did... did they pick out that license plate? 1...,https://twitter.com/dog_rates/status/788039637...,12,10,,,,floofer,
5922,672980819271634944,2015-12-05 03:28:25,Extraordinary dog here. Looks large. Just a he...,https://twitter.com/dog_rates/status/672980819...,5,10,,,,puppo,
6375,873580283840344065,2017-06-10 16:39:04,We usually don't rate Deck-bound Saskatoon Bla...,https://twitter.com/dog_rates/status/873580283...,13,10,,,,doggo,
451,800141422401830912,2016-11-20 00:59:15,This is Peaches. She's the ultimate selfie sid...,https://twitter.com/dog_rates/status/800141422...,13,10,Peaches,,,floofer,


In [258]:
#delete dogs with no names ('None' and 'a')
twit_arch_clean = twit_arch_clean[twit_arch_clean.name != 'None']

In [259]:
twit_arch_clean = twit_arch_clean[twit_arch_clean.name != 'a']

In [260]:
twit_arch_clean.name.value_counts()

Charlie     44
Lucy        44
Oliver      40
Cooper      40
Tucker      36
Penny       36
Sadie       32
Lola        32
Winston     32
the         32
Toby        28
Daisy       28
Stanley     24
an          24
Bo          24
Jax         24
Bailey      24
Bella       24
Koda        24
Oscar       24
Rusty       20
Bentley     20
Leo         20
Buddy       20
Milo        20
Dave        20
Louis       20
Chester     20
Scout       20
Chip        16
            ..
Suki         4
Creg         4
Stubert      4
Maisey       4
Bronte       4
Meatball     4
Clyde        4
Finnegus     4
Livvie       4
Brudge       4
Lili         4
Grizzie      4
Mollie       4
Teddy        4
Snoop        4
Lenox        4
Jeremy       4
Philbert     4
Zooey        4
Todo         4
Kanu         4
Dawn         4
Ember        4
Steve        4
Laika        4
Socks        4
Eazy         4
Kane         4
Moreton      4
Opie         4
Name: name, Length: 953, dtype: int64

In [261]:
img_pred.head()

Unnamed: 0,tweet_id,jpg_url,img_num,p1,p1_conf,p1_dog,p2,p2_conf,p2_dog,p3,p3_conf,p3_dog
0,666020888022790149,https://pbs.twimg.com/media/CT4udn0WwAA0aMy.jpg,1,Welsh_springer_spaniel,0.465074,True,collie,0.156665,True,Shetland_sheepdog,0.061428,True
1,666029285002620928,https://pbs.twimg.com/media/CT42GRgUYAA5iDo.jpg,1,redbone,0.506826,True,miniature_pinscher,0.074192,True,Rhodesian_ridgeback,0.07201,True
2,666033412701032449,https://pbs.twimg.com/media/CT4521TWwAEvMyu.jpg,1,German_shepherd,0.596461,True,malinois,0.138584,True,bloodhound,0.116197,True
3,666044226329800704,https://pbs.twimg.com/media/CT5Dr8HUEAA-lEu.jpg,1,Rhodesian_ridgeback,0.408143,True,redbone,0.360687,True,miniature_pinscher,0.222752,True
4,666049248165822465,https://pbs.twimg.com/media/CT5IQmsXIAAKY4A.jpg,1,miniature_pinscher,0.560311,True,Rottweiler,0.243682,True,Doberman,0.154629,True


In [262]:
#delete extra columns in img_pred
img_pred_clean = img_pred_clean.drop('p2', axis=1)
img_pred_clean = img_pred_clean.drop('p2_conf', axis=1)
img_pred_clean = img_pred_clean.drop('p2_dog', axis=1)
img_pred_clean = img_pred_clean.drop('p3', axis=1)
img_pred_clean = img_pred_clean.drop('p3_conf', axis=1)
img_pred_clean = img_pred_clean.drop('p3_dog', axis=1)

In [263]:
img_pred_clean = img_pred_clean.drop('img_num', axis=1)

In [264]:
img_pred_clean.head()

Unnamed: 0,tweet_id,jpg_url,p1,p1_conf,p1_dog
0,666020888022790149,https://pbs.twimg.com/media/CT4udn0WwAA0aMy.jpg,Welsh_springer_spaniel,0.465074,True
1,666029285002620928,https://pbs.twimg.com/media/CT42GRgUYAA5iDo.jpg,redbone,0.506826,True
2,666033412701032449,https://pbs.twimg.com/media/CT4521TWwAEvMyu.jpg,German_shepherd,0.596461,True
3,666044226329800704,https://pbs.twimg.com/media/CT5Dr8HUEAA-lEu.jpg,Rhodesian_ridgeback,0.408143,True
4,666049248165822465,https://pbs.twimg.com/media/CT5IQmsXIAAKY4A.jpg,miniature_pinscher,0.560311,True


In [265]:
#Combine twit_arch with img_pred
twit_arch_clean = pd.merge(twit_arch_clean, img_pred_clean,
                            on=['tweet_id'], how='left')

In [266]:
twit_arch_clean.head()

Unnamed: 0,tweet_id,timestamp,text,expanded_urls,rating_numerator,rating_denominator,name,likes,retweets,dog_type,value,jpg_url,p1,p1_conf,p1_dog
0,892420643555336193,2017-08-01 16:23:56,This is Phineas. He's a mystical boy. Only eve...,https://twitter.com/dog_rates/status/892420643...,13,10,Phineas,,,floofer,,https://pbs.twimg.com/media/DGKD1-bXoAAIAUK.jpg,orange,0.097049,False
1,892177421306343426,2017-08-01 00:17:27,This is Tilly. She's just checking pup on you....,https://twitter.com/dog_rates/status/892177421...,13,10,Tilly,,,floofer,,https://pbs.twimg.com/media/DGGmoV4XsAAUL6n.jpg,Chihuahua,0.323581,True
2,891815181378084864,2017-07-31 00:18:03,This is Archie. He is a rare Norwegian Pouncin...,https://twitter.com/dog_rates/status/891815181...,12,10,Archie,,,floofer,,https://pbs.twimg.com/media/DGBdLU1WsAANxJ9.jpg,Chihuahua,0.716012,True
3,891689557279858688,2017-07-30 15:58:51,This is Darla. She commenced a snooze mid meal...,https://twitter.com/dog_rates/status/891689557...,13,10,Darla,,,floofer,,https://pbs.twimg.com/media/DF_q7IAWsAEuuN8.jpg,paper_towel,0.170278,False
4,891327558926688256,2017-07-29 16:00:24,This is Franklin. He would like you to stop ca...,https://twitter.com/dog_rates/status/891327558...,12,10,Franklin,,,floofer,,https://pbs.twimg.com/media/DF6hr6BUMAAzZgT.jpg,basset,0.555712,True


Insights

1. 

2. 

3. 

Visualizations

1. 

2. 