In [1]:
import pandas as pd
import numpy as np
import requests
import os
from random import sample
import re

# Gather

Code for downloading Udacity's Dog Prediction Data

In [2]:
#url = 'https://d17h27t6h515a5.cloudfront.net/topher/2017/August/599fd2ad_image-predictions/image-predictions.tsv'
#r = requests.get('https://d17h27t6h515a5.cloudfront.net/topher/2017/August/599fd2ad_image-predictions/image-predictions.tsv')

In [3]:
# make a directory if one does not already exist
#folder_name = 'dog_predictions'
#if not os.path.exists(folder_name):
    #os.makedirs(folder_name)

In [4]:
#with open(os.path.join(folder_name, 
                           #url.split('/')[-1]), mode='wb') as file:
             #file.write(r.content)

Initializing All Relevant Datasets

In [5]:
twitter_archive = pd.read_csv('twitter-archive-enhanced.csv')
twitter_archive;

In [6]:
predicts = pd.read_csv('dog_predictions/image-predictions.tsv', sep='\t')
predicts;

In [45]:
twitter_api = pd.read_csv('twitter_api_data.csv')
twitter_api;

# Assess

### Documented Issues

`twitter_archive`
#### Quality
- timestamp is a string and not datetime
- text column irrelevant material
- missing values in expanded_urls
- There are 42 instances where categorical variables are found in the text, but are not accurately accounted for in the categorical columns
- There are 109 instances where the name column is not accurate, (ex: index 542 name is considered "incredibly" since text before contains "incredible"), and an incorrect name is in place.
    - I recognize that it is an oversight that I cannot test whether or not a name is missed because I do not yet have knowledge of a language processing library.
- Missing rows in "in_reply_to_status_id" "in_reply_to_user_id" "retweeted_status_id" "retweeted_status_user_id" "retweeted_status_timestamp"
- Ratings may contain floats. Texts needs to be checked again
- change id to string
- duplicate tweets

#### Tidiness
- text column contains a source variable for the tweet
- dog "ages/types" (floofer, pupper etc.) should be single, categorial column

`predicts`
#### Quality
- prediction dog breeds have inconsistent casing
- column titles should be be full names
- change id to string
- extract predictions for images where predictions are both dogs and above 70% confidence
    - If our confidence level is too low, then our statements become less meaningful. However, because I am not sure how to test the accuracy of the predictions, I will choose a lowish confidence level since I am aware that many of the pictures will contain dogs.

#### Tidiness
 

`twitter_api`
#### Quality
- change id to string (should have done this when extracting)

#### Tidiness
- tables need to be reorganized
    - 1 for souce metadata (urls

In [46]:
pd.set_option('display.max_colwidth', -1)

### twitter_archive

#### Visual Assessment

In [9]:
#to be used for visual assessments. Supressed to save space.
twitter_archive.sample(5);

In [10]:
twitter_archive.source.unique()

array(['<a href="http://twitter.com/download/iphone" rel="nofollow">Twitter for iPhone</a>',
       '<a href="http://twitter.com" rel="nofollow">Twitter Web Client</a>',
       '<a href="http://vine.co" rel="nofollow">Vine - Make a Scene</a>',
       '<a href="https://about.twitter.com/products/tweetdeck" rel="nofollow">TweetDeck</a>'],
      dtype=object)

In [109]:
#no duplicated values
list(twitter_archive.text.duplicated()).count(True)

0

In [12]:
#to be used for visual assessments. commented out to save space
twitter_archive.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 2356 entries, 0 to 2355
Data columns (total 17 columns):
tweet_id                      2356 non-null int64
in_reply_to_status_id         78 non-null float64
in_reply_to_user_id           78 non-null float64
timestamp                     2356 non-null object
source                        2356 non-null object
text                          2356 non-null object
retweeted_status_id           181 non-null float64
retweeted_status_user_id      181 non-null float64
retweeted_status_timestamp    181 non-null object
expanded_urls                 2297 non-null object
rating_numerator              2356 non-null int64
rating_denominator            2356 non-null int64
name                          2356 non-null object
doggo                         2356 non-null object
floofer                       2356 non-null object
pupper                        2356 non-null object
puppo                         2356 non-null object
dtypes: float64(4), int64(3), ob

In [13]:
#investigating max values for numerator/denominator
twitter_archive.describe();

#### Function(s) used for progrmattic assessment

In [14]:
def category_accuracy(df, columns_list):
    '''category_accuracy parses through the texts column of a dataframe and checks if the information matches
    the values of the categorical columns, GIVEN that the text is used as the categorical variable source
    and categorical column values are the same as column header.
    Restrictions: text column must be called "text".
    Returns: category_accuracy returns the index of rows where the categorical values do not match the text'''
    offending_rows = []
    # matching text with other columns
    #itterate through df
    for index, row in df.iterrows():
        match = []
        #current values in columns
        secondary = [row[column] for column in columns_list]
        #going through every word and checking for classification qualification
        source = row.text.split()
        for word in source:
            match += [value for value in columns_list if value in word and pd.notnull(word) and word != 'None']
    #checking for accuracy
        #if there was more than one match, was it accurate?
        if len(match)> 0:
            test = []
            #this loops checks to see if the row value is the same as the matched value
            for current in secondary:
                for i in match:
                    test.append(i == current)
            # if there are less correct than actual matches, then there is an inaccurate column.
            if test.count(True) < len(match):
                offending_rows.append(index)
    return offending_rows

### Parsing Text for Accuracy

In [15]:
#category_accuracy(df, columns)
offending_categorical_rows = category_accuracy(twitter_archive, ['doggo', 'floofer', 'pupper', 'puppo'])
print('The number of instances where doggo, floofer, pupper, and puppo is found in the text, but does not have the correct value is: ', len(offending_categorical_rows))

The number of instances where doggo, floofer, pupper, and puppo is found in the text, but does not have the correct value is:  42


In [16]:
print('The list of offending rows: ', offending_categorical_rows)

The list of offending rows:  [54, 83, 85, 106, 134, 172, 228, 268, 274, 296, 302, 475, 477, 545, 798, 934, 946, 987, 993, 1027, 1093, 1120, 1220, 1228, 1254, 1265, 1351, 1516, 1634, 1635, 1636, 1643, 1710, 1712, 1743, 1826, 1843, 1847, 1862, 1900, 1928, 2141]


In [17]:
#investigate offending rows. Rows had issues. Code has been commented out to save space.
#for i in offending_categorical_rows:
    #display(twitter_archive[twitter_archive.index == i])

### Checking Names Column

Since names are capitalized, names that are lowercase will be flagged as they will likely not be actual names.

In [18]:
#demonstrating regular names
regular_names = []
for name in twitter_archive.name:
    if name[0].isupper() and name != 'None' and pd.notnull(name):
        regular_names.append(name)
sample(regular_names, 20)

['Clark',
 'Doc',
 'Oscar',
 'Rosie',
 'Moofasa',
 'Ron',
 'Charlie',
 'Dexter',
 'Percy',
 'Shooter',
 'Gary',
 'Sammy',
 'Phred',
 'Wilson',
 'Dave',
 'Bella',
 'Daisy',
 'Tucker',
 'Chase',
 'Reggie']

In [19]:
#flagging lowercase names and index
incorrect_names = []
for index, row in twitter_archive.iterrows():
    if row['name'][0].islower() and row['name'] != 'None' and pd.notnull(row['name']):
        incorrect_names.append((row['name'], index))
incorrect_names
#len(incorrect_names) returns 109 instances

[('such', 22),
 ('a', 56),
 ('quite', 118),
 ('quite', 169),
 ('quite', 193),
 ('not', 335),
 ('one', 369),
 ('incredibly', 542),
 ('a', 649),
 ('mad', 682),
 ('an', 759),
 ('very', 773),
 ('a', 801),
 ('very', 819),
 ('just', 822),
 ('my', 852),
 ('one', 924),
 ('not', 988),
 ('his', 992),
 ('one', 993),
 ('a', 1002),
 ('a', 1004),
 ('a', 1017),
 ('an', 1025),
 ('very', 1031),
 ('actually', 1040),
 ('a', 1049),
 ('just', 1063),
 ('getting', 1071),
 ('mad', 1095),
 ('very', 1097),
 ('this', 1120),
 ('unacceptable', 1121),
 ('all', 1138),
 ('a', 1193),
 ('old', 1206),
 ('a', 1207),
 ('infuriating', 1259),
 ('a', 1340),
 ('a', 1351),
 ('a', 1361),
 ('an', 1362),
 ('a', 1368),
 ('a', 1382),
 ('very', 1385),
 ('getting', 1435),
 ('just', 1457),
 ('a', 1499),
 ('the', 1527),
 ('the', 1603),
 ('actually', 1693),
 ('by', 1724),
 ('a', 1737),
 ('officially', 1747),
 ('a', 1785),
 ('the', 1797),
 ('the', 1815),
 ('a', 1853),
 ('a', 1854),
 ('a', 1877),
 ('a', 1878),
 ('life', 1916),
 ('a', 1923

## predicts

In [20]:
predicts.sample(5)

Unnamed: 0,tweet_id,jpg_url,img_num,p1,p1_conf,p1_dog,p2,p2_conf,p2_dog,p3,p3_conf,p3_dog
1629,805207613751304193,https://pbs.twimg.com/media/CyysDQlVIAAYgrl.jpg,1,Pembroke,0.244705,True,Rhodesian_ridgeback,0.180461,True,Cardigan,0.094664,True
1622,803380650405482500,https://pbs.twimg.com/media/CyYub2kWEAEYdaq.jpg,1,bookcase,0.890601,False,entertainment_center,0.019287,False,file,0.00949,False
1517,787322443945877504,https://pbs.twimg.com/media/Cu0hlfwWYAEdnXO.jpg,1,seat_belt,0.747739,False,golden_retriever,0.105703,True,dingo,0.017257,False
1660,811627233043480576,https://pbs.twimg.com/media/C0N6opSXAAAkCtN.jpg,1,beagle,0.39628,True,Pembroke,0.049562,True,wire-haired_fox_terrier,0.046349,True
700,684800227459624960,https://pbs.twimg.com/media/CYDmK7ZVAAI_ylL.jpg,1,miniature_schnauzer,0.294457,True,Norfolk_terrier,0.161885,True,West_Highland_white_terrier,0.120992,True


In [21]:
predicts.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 2075 entries, 0 to 2074
Data columns (total 12 columns):
tweet_id    2075 non-null int64
jpg_url     2075 non-null object
img_num     2075 non-null int64
p1          2075 non-null object
p1_conf     2075 non-null float64
p1_dog      2075 non-null bool
p2          2075 non-null object
p2_conf     2075 non-null float64
p2_dog      2075 non-null bool
p3          2075 non-null object
p3_conf     2075 non-null float64
p3_dog      2075 non-null bool
dtypes: bool(3), float64(3), int64(2), object(4)
memory usage: 152.1+ KB


## twitter_api

In [22]:
twitter_api.sample(5)

Unnamed: 0,id,favorite_count,retweet_count,expanded_url
1229,713761197720473600,4979,1403,https://twitter.com/dog_rates/status/713761197720473600/photo/1
1621,684926975086034944,3586,474,https://twitter.com/dog_rates/status/684926975086034944/photo/1
523,809448704142938112,7263,1544,https://twitter.com/dog_rates/status/809448704142938112/photo/1
133,866720684873056260,19554,4622,"https://twitter.com/NBCNews/status/866458718883467265/photo/1, https://twitter.com/nbcnews/status/866458718883467265"
1022,746542875601690625,5171,1925,


In [23]:
twitter_api.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 2356 entries, 0 to 2355
Data columns (total 4 columns):
id                2356 non-null int64
favorite_count    2356 non-null object
retweet_count     2356 non-null object
expanded_url      2191 non-null object
dtypes: int64(1), object(3)
memory usage: 73.7+ KB


# Clean

Creating Cleaning Copies

In [24]:
twitter_archive_clean = twitter_archive.copy()

In [25]:
predicts_clean = predicts.copy()

In [26]:
twitter_api_clean = twitter_api.copy()

### Twitter Archive

#### Quality
- timestamp is a string and not datetime
- text column irrelevant material
- 112 duplicate posts, possible because they are reposted? many duplicates have the tag "RT @dog_rates: " before
- texts must be stripped of "RT @dog_rates: "
- There are 42 instances where categorical variables are found in the text, but are not accurately accounted for in the categorical columns
- There are 109 instances where the name column is not accurate, (ex: index 542 name is considered "incredibly" since text before contains "incredible"), and an incorrect name is in place.
    - I recognize that it is an oversight that I cannot test whether or not a name is missed because I do not yet have knowledge of a language processing library.
- Missing rows in "in_reply_to_status_id" "in_reply_to_user_id" "retweeted_status_id" "retweeted_status_user_id" "retweeted_status_timestamp"
- Ratings may contain floats. Texts needs to be checked again
- change id to string
- missing values in expanded_urls

#### Tidiness
- text column contains a source variable for the tweet
- dog "ages/types" (floofer, pupper etc.) should be single, categorial column

#### Define 
Use astype to change timestamp to timedate

#### Code

In [27]:
twitter_archive_clean.timestamp = pd.to_datetime(twitter_archive.timestamp)

#### Test

In [28]:
#timestamp was successfully changed to datetime, commented for space
#twitter_archive_clean.info()

#### Define
Use str.replace to remove unwanted url from text column

#### Code

In [60]:
#testing
twitter_archive.text[0]

"This is Phineas. He's a mystical boy. Only ever appears in the hole of a donut. 13/10 https://t.co/MgUWQ76dJU"

In [74]:
#regex citation: https://stackoverflow.com/questions/6038061/regular-expression-to-find-urls-within-a-string
#regex expression has been altered to work with the dataset (elipses and other perumutations have been added)
#includes characacters :// 
#([\w_-]+(?:(?:\.[\w_-]+)+))groups combonations of "alphanumeric.alphanumeric"
#([\w.,@?^=%&:/~+#-]*[\w@?^=%&/~+#-])? groups combonations of alphanumeric and special symbols
text = re.sub(r'(\\n\\nhttps|http|ftp|https)([:‚Ä¶/]+)([\w_‚Ä¶-]*(?:(?:\.[\w_‚Ä¶-]*)*))([\w.,@?^=%&:/~+#-]*[\w@?^=%&/~+#-‚Ä¶]*)?', '', twitter_archive.text[0], flags = re.MULTILINE)
text

"This is Phineas. He's a mystical boy. Only ever appears in the hole of a donut. 13/10 "

In [72]:
twitter_archive_clean.text = twitter_archive_clean.text.str.replace(r'(\\n\\nhttps|http|ftp|https)([:‚Ä¶/]+)([\w_‚Ä¶-]*(?:(?:\.[\w_‚Ä¶-]*)*))([\w.,@?^=%&:/~+#-]*[\w@?^=%&/~+#-‚Ä¶]*)?', '');

#### Define
Remove "RT @dog_rates: " replace

#### Code

In [114]:
twitter_archive_clean.text = twitter_archive_clean.text.str.replace(r'(RT @dog_rates: )', '');

#### Test

In [115]:
for i in twitter_archive_clean.text:
    if 'RT @dog_rates: ' in i:
        print(i)

#### Define
Drop Duplicates tweets. Although they might have different meta data, they are still the same "tweet" and having duplicates will alter the analysis

#### Code

In [117]:
#investigate duplicates?
list(twitter_archive_clean.text.duplicated()).count(True)

112

#### Test

#### Test

In [75]:
[i for i in twitter_archive_clean.text if 'https' in i]

[]

In [32]:
#urls removed, supressed for space
twitter_archive_clean.text;

In [33]:
twitter_archive_clean.sample(5);

#### Define
Investigate and fix innacurate classification, using modified categorical_accuracy function

#### Code

In [76]:
for i in offending_categorical_rows:
    display(twitter_archive_clean[twitter_archive_clean.index == i])

Unnamed: 0,tweet_id,in_reply_to_status_id,in_reply_to_user_id,timestamp,source,text,retweeted_status_id,retweeted_status_user_id,retweeted_status_timestamp,expanded_urls,rating_numerator,rating_denominator,name,doggo,floofer,pupper,puppo
54,881666595344535552,,,2017-07-03 00:11:11+00:00,"<a href=""http://twitter.com/download/iphone"" rel=""nofollow"">Twitter for iPhone</a>",This is Gary. He couldn't miss this puppertunity for a selfie. Flawless focusing skills. 13/10 would boop intensely,,,,https://twitter.com/dog_rates/status/881666595344535552/photo/1,13,10,Gary,,,pupper,


Unnamed: 0,tweet_id,in_reply_to_status_id,in_reply_to_user_id,timestamp,source,text,retweeted_status_id,retweeted_status_user_id,retweeted_status_timestamp,expanded_urls,rating_numerator,rating_denominator,name,doggo,floofer,pupper,puppo
83,876537666061221889,,,2017-06-18 20:30:39+00:00,"<a href=""http://twitter.com/download/iphone"" rel=""nofollow"">Twitter for iPhone</a>",I can say with the pupmost confidence that the doggos who assisted with this search are heroic as h*ck. 14/10 for all,,,,https://twitter.com/mpstowerham/status/876162994446753793,14,10,,doggo,,,


Unnamed: 0,tweet_id,in_reply_to_status_id,in_reply_to_user_id,timestamp,source,text,retweeted_status_id,retweeted_status_user_id,retweeted_status_timestamp,expanded_urls,rating_numerator,rating_denominator,name,doggo,floofer,pupper,puppo
85,876120275196170240,,,2017-06-17 16:52:05+00:00,"<a href=""http://twitter.com/download/iphone"" rel=""nofollow"">Twitter for iPhone</a>","Meet Venti, a seemingly caffeinated puppoccino. She was just informed the weekend would include walks, pats and scritches. 13/10 much excite",,,,https://twitter.com/dog_rates/status/876120275196170240/photo/1,13,10,Venti,,,,puppo


Unnamed: 0,tweet_id,in_reply_to_status_id,in_reply_to_user_id,timestamp,source,text,retweeted_status_id,retweeted_status_user_id,retweeted_status_timestamp,expanded_urls,rating_numerator,rating_denominator,name,doggo,floofer,pupper,puppo
106,871879754684805121,,,2017-06-06 00:01:46+00:00,"<a href=""http://twitter.com/download/iphone"" rel=""nofollow"">Twitter for iPhone</a>",Say hello to Lassie. She's celebrating #PrideMonth by being a splendid mix of astute and adorable. Proudly supupporting her owner. 13/10,,,,"https://twitter.com/dog_rates/status/871879754684805121/photo/1,https://twitter.com/dog_rates/status/871879754684805121/photo/1",13,10,Lassie,,,,puppo


Unnamed: 0,tweet_id,in_reply_to_status_id,in_reply_to_user_id,timestamp,source,text,retweeted_status_id,retweeted_status_user_id,retweeted_status_timestamp,expanded_urls,rating_numerator,rating_denominator,name,doggo,floofer,pupper,puppo
134,866686824827068416,,,2017-05-22 16:06:55+00:00,"<a href=""http://twitter.com/download/iphone"" rel=""nofollow"">Twitter for iPhone</a>",This is Lili. She can't believe you betrayed her with bath time. Never looking you in the eye again. 12/10 would puppologize profusely,,,,"https://twitter.com/dog_rates/status/866686824827068416/photo/1,https://twitter.com/dog_rates/status/866686824827068416/photo/1",12,10,Lili,,,,puppo


Unnamed: 0,tweet_id,in_reply_to_status_id,in_reply_to_user_id,timestamp,source,text,retweeted_status_id,retweeted_status_user_id,retweeted_status_timestamp,expanded_urls,rating_numerator,rating_denominator,name,doggo,floofer,pupper,puppo
172,858843525470990336,,,2017-05-01 00:40:27+00:00,"<a href=""http://twitter.com/download/iphone"" rel=""nofollow"">Twitter for iPhone</a>",I have stumbled puppon a doggo painting party. They're looking to be the next Pupcasso or Puppollock. All 13/10 would put it on the fridge,,,,https://twitter.com/dog_rates/status/858843525470990336/photo/1,13,10,,doggo,,,puppo


Unnamed: 0,tweet_id,in_reply_to_status_id,in_reply_to_user_id,timestamp,source,text,retweeted_status_id,retweeted_status_user_id,retweeted_status_timestamp,expanded_urls,rating_numerator,rating_denominator,name,doggo,floofer,pupper,puppo
228,848213670039564288,8.482121e+17,4196984000.0,2017-04-01 16:41:12+00:00,"<a href=""http://twitter.com/download/iphone"" rel=""nofollow"">Twitter for iPhone</a>",Jerry just apuppologized to me. He said there was no ill-intent to the slippage. I overreacted I admit. Pupgraded to an 11/10 would pet,,,,,11,10,,,,,puppo


Unnamed: 0,tweet_id,in_reply_to_status_id,in_reply_to_user_id,timestamp,source,text,retweeted_status_id,retweeted_status_user_id,retweeted_status_timestamp,expanded_urls,rating_numerator,rating_denominator,name,doggo,floofer,pupper,puppo
268,841439858740625411,,,2017-03-14 00:04:30+00:00,"<a href=""http://twitter.com/download/iphone"" rel=""nofollow"">Twitter for iPhone</a>",Here we have some incredible doggos for #K9VeteransDay. All brave as h*ck. Salute your dog in solidarity. 14/10 for all,,,,"https://twitter.com/dog_rates/status/841439858740625411/photo/1,https://twitter.com/dog_rates/status/841439858740625411/photo/1,https://twitter.com/dog_rates/status/841439858740625411/photo/1,https://twitter.com/dog_rates/status/841439858740625411/photo/1",14,10,,doggo,,,


Unnamed: 0,tweet_id,in_reply_to_status_id,in_reply_to_user_id,timestamp,source,text,retweeted_status_id,retweeted_status_user_id,retweeted_status_timestamp,expanded_urls,rating_numerator,rating_denominator,name,doggo,floofer,pupper,puppo
274,840698636975636481,8.406983e+17,8.405479e+17,2017-03-11 22:59:09+00:00,"<a href=""http://twitter.com/download/iphone"" rel=""nofollow"">Twitter for iPhone</a>",@0_kelvin_0 &gt;10/10 is reserved for puppos sorry Kevin,,,,,10,10,,,,,puppo


Unnamed: 0,tweet_id,in_reply_to_status_id,in_reply_to_user_id,timestamp,source,text,retweeted_status_id,retweeted_status_user_id,retweeted_status_timestamp,expanded_urls,rating_numerator,rating_denominator,name,doggo,floofer,pupper,puppo
296,837366284874571778,,,2017-03-02 18:17:34+00:00,"<a href=""http://twitter.com/download/iphone"" rel=""nofollow"">Twitter for iPhone</a>",This is Lucy. She has a portrait of herself on her ear. Excellent for identification pupposes. 13/10 innovative af,,,,https://twitter.com/dog_rates/status/837366284874571778/photo/1,13,10,Lucy,,,,puppo


Unnamed: 0,tweet_id,in_reply_to_status_id,in_reply_to_user_id,timestamp,source,text,retweeted_status_id,retweeted_status_user_id,retweeted_status_timestamp,expanded_urls,rating_numerator,rating_denominator,name,doggo,floofer,pupper,puppo
302,836648853927522308,,,2017-02-28 18:46:45+00:00,"<a href=""http://twitter.com/download/iphone"" rel=""nofollow"">Twitter for iPhone</a>","RT @SchafeBacon2016: @dog_rates Slightly disturbed by the outright profanity, but confident doggos were involved. 11/10, would tailgate aga‚Ä¶",8.366481e+17,7.124572e+17,2017-02-28 18:43:57 +0000,https://twitter.com/SchafeBacon2016/status/836648149003485187/photo/1,11,10,,doggo,,,


Unnamed: 0,tweet_id,in_reply_to_status_id,in_reply_to_user_id,timestamp,source,text,retweeted_status_id,retweeted_status_user_id,retweeted_status_timestamp,expanded_urls,rating_numerator,rating_denominator,name,doggo,floofer,pupper,puppo
475,816062466425819140,,,2017-01-02 23:23:48+00:00,"<a href=""http://twitter.com/download/iphone"" rel=""nofollow"">Twitter for iPhone</a>",RT @dog_rates: Meet Jack. He's one of the rare doggos that doesn't mind baths. 11/10 click the link to see how you can help Jack!\n\n,8.159907e+17,4196984000.0,2017-01-02 18:38:42 +0000,"https://www.gofundme.com/surgeryforjacktheminpin,https://twitter.com/dog_rates/status/815990720817401858/photo/1",11,10,Jack,doggo,,,


Unnamed: 0,tweet_id,in_reply_to_status_id,in_reply_to_user_id,timestamp,source,text,retweeted_status_id,retweeted_status_user_id,retweeted_status_timestamp,expanded_urls,rating_numerator,rating_denominator,name,doggo,floofer,pupper,puppo
477,815990720817401858,,,2017-01-02 18:38:42+00:00,"<a href=""http://twitter.com/download/iphone"" rel=""nofollow"">Twitter for iPhone</a>",Meet Jack. He's one of the rare doggos that doesn't mind baths. 11/10 click the link to see how you can help Jack!\n\n,,,,"https://www.gofundme.com/surgeryforjacktheminpin,https://twitter.com/dog_rates/status/815990720817401858/photo/1",11,10,Jack,doggo,,,


Unnamed: 0,tweet_id,in_reply_to_status_id,in_reply_to_user_id,timestamp,source,text,retweeted_status_id,retweeted_status_user_id,retweeted_status_timestamp,expanded_urls,rating_numerator,rating_denominator,name,doggo,floofer,pupper,puppo
545,805826884734976000,,,2016-12-05 17:31:15+00:00,"<a href=""http://twitter.com/download/iphone"" rel=""nofollow"">Twitter for iPhone</a>",This is Duke. He is not a fan of the pupporazzi. 12/10,,,,https://twitter.com/dog_rates/status/805826884734976000/video/1,12,10,Duke,,,,puppo


Unnamed: 0,tweet_id,in_reply_to_status_id,in_reply_to_user_id,timestamp,source,text,retweeted_status_id,retweeted_status_user_id,retweeted_status_timestamp,expanded_urls,rating_numerator,rating_denominator,name,doggo,floofer,pupper,puppo
798,772877495989305348,,,2016-09-05 19:22:09+00:00,"<a href=""http://twitter.com"" rel=""nofollow"">Twitter Web Client</a>",You need to watch these two doggos argue through a cat door. Both 11/10,,,,https://twitter.com/dog_rates/status/772877495989305348/video/1,11,10,,doggo,,,


Unnamed: 0,tweet_id,in_reply_to_status_id,in_reply_to_user_id,timestamp,source,text,retweeted_status_id,retweeted_status_user_id,retweeted_status_timestamp,expanded_urls,rating_numerator,rating_denominator,name,doggo,floofer,pupper,puppo
934,753420520834629632,,,2016-07-14 02:47:04+00:00,"<a href=""http://twitter.com/download/iphone"" rel=""nofollow"">Twitter for iPhone</a>",Here we are witnessing an isolated squad of bouncing doggos. Unbelievably rare for this time of year. 11/10 for all,,,,https://twitter.com/dog_rates/status/753420520834629632/video/1,11,10,,doggo,,,


Unnamed: 0,tweet_id,in_reply_to_status_id,in_reply_to_user_id,timestamp,source,text,retweeted_status_id,retweeted_status_user_id,retweeted_status_timestamp,expanded_urls,rating_numerator,rating_denominator,name,doggo,floofer,pupper,puppo
946,752568224206688256,,,2016-07-11 18:20:21+00:00,"<a href=""http://vine.co"" rel=""nofollow"">Vine - Make a Scene</a>",Here are three doggos completely misjudging an airborne stick. Decent efforts tho. All 9/10,,,,https://vine.co/v/5W0bdhEUUVT,9,10,,doggo,,,


Unnamed: 0,tweet_id,in_reply_to_status_id,in_reply_to_user_id,timestamp,source,text,retweeted_status_id,retweeted_status_user_id,retweeted_status_timestamp,expanded_urls,rating_numerator,rating_denominator,name,doggo,floofer,pupper,puppo
987,749036806121881602,,,2016-07-02 00:27:45+00:00,"<a href=""http://twitter.com/download/iphone"" rel=""nofollow"">Twitter for iPhone</a>",This is Dietrich. He hops at random. Other doggos don't understand him. It upsets him greatly. 8/10 would comfort,,,,https://twitter.com/dog_rates/status/749036806121881602/photo/1,8,10,Dietrich,doggo,,,


Unnamed: 0,tweet_id,in_reply_to_status_id,in_reply_to_user_id,timestamp,source,text,retweeted_status_id,retweeted_status_user_id,retweeted_status_timestamp,expanded_urls,rating_numerator,rating_denominator,name,doggo,floofer,pupper,puppo
993,748575535303884801,,,2016-06-30 17:54:50+00:00,"<a href=""http://twitter.com/download/iphone"" rel=""nofollow"">Twitter for iPhone</a>",This is one of the most reckless puppers I've ever seen. How she got a license in the first place is beyond me. 6/10,,,,https://twitter.com/dog_rates/status/748575535303884801/photo/1,6,10,one,,,pupper,


Unnamed: 0,tweet_id,in_reply_to_status_id,in_reply_to_user_id,timestamp,source,text,retweeted_status_id,retweeted_status_user_id,retweeted_status_timestamp,expanded_urls,rating_numerator,rating_denominator,name,doggo,floofer,pupper,puppo
1027,746056683365994496,,,2016-06-23 19:05:49+00:00,"<a href=""http://twitter.com/download/iphone"" rel=""nofollow"">Twitter for iPhone</a>",This is Arlen and Thumpelina. They are best pals. Cuddly af. 11/10 for both puppers,,,,"https://twitter.com/dog_rates/status/746056683365994496/photo/1,https://twitter.com/dog_rates/status/746056683365994496/photo/1",11,10,Arlen,,,pupper,


Unnamed: 0,tweet_id,in_reply_to_status_id,in_reply_to_user_id,timestamp,source,text,retweeted_status_id,retweeted_status_user_id,retweeted_status_timestamp,expanded_urls,rating_numerator,rating_denominator,name,doggo,floofer,pupper,puppo
1093,737310737551491075,,,2016-05-30 15:52:33+00:00,"<a href=""http://twitter.com/download/iphone"" rel=""nofollow"">Twitter for iPhone</a>",Everybody stop what you're doing and watch these puppers enjoy summer. Both 13/10,,,,https://twitter.com/dog_rates/status/737310737551491075/video/1,13,10,,,,pupper,


Unnamed: 0,tweet_id,in_reply_to_status_id,in_reply_to_user_id,timestamp,source,text,retweeted_status_id,retweeted_status_user_id,retweeted_status_timestamp,expanded_urls,rating_numerator,rating_denominator,name,doggo,floofer,pupper,puppo
1120,731156023742988288,,,2016-05-13 16:15:54+00:00,"<a href=""http://twitter.com/download/iphone"" rel=""nofollow"">Twitter for iPhone</a>",Say hello to this unbelievably well behaved squad of doggos. 204/170 would try to pet all at once,,,,https://twitter.com/dog_rates/status/731156023742988288/photo/1,204,170,this,doggo,,,


Unnamed: 0,tweet_id,in_reply_to_status_id,in_reply_to_user_id,timestamp,source,text,retweeted_status_id,retweeted_status_user_id,retweeted_status_timestamp,expanded_urls,rating_numerator,rating_denominator,name,doggo,floofer,pupper,puppo
1220,714606013974974464,,,2016-03-29 00:12:05+00:00,"<a href=""http://twitter.com/download/iphone"" rel=""nofollow"">Twitter for iPhone</a>",Here are two lil cuddly puppers. Both 12/10 would snug like so much,,,,https://twitter.com/dog_rates/status/714606013974974464/photo/1,12,10,,,,pupper,


Unnamed: 0,tweet_id,in_reply_to_status_id,in_reply_to_user_id,timestamp,source,text,retweeted_status_id,retweeted_status_user_id,retweeted_status_timestamp,expanded_urls,rating_numerator,rating_denominator,name,doggo,floofer,pupper,puppo
1228,713900603437621249,,,2016-03-27 01:29:02+00:00,"<a href=""http://twitter.com/download/iphone"" rel=""nofollow"">Twitter for iPhone</a>",Happy Saturday here's 9 puppers on a bench. 99/90 good work everybody,,,,https://twitter.com/dog_rates/status/713900603437621249/photo/1,99,90,,,,pupper,


Unnamed: 0,tweet_id,in_reply_to_status_id,in_reply_to_user_id,timestamp,source,text,retweeted_status_id,retweeted_status_user_id,retweeted_status_timestamp,expanded_urls,rating_numerator,rating_denominator,name,doggo,floofer,pupper,puppo
1254,710658690886586372,,,2016-03-18 02:46:49+00:00,"<a href=""http://twitter.com/download/iphone"" rel=""nofollow"">Twitter for iPhone</a>",Here's a brigade of puppers. All look very prepared for whatever happens next. 80/80,,,,https://twitter.com/dog_rates/status/710658690886586372/photo/1,80,80,,,,pupper,


Unnamed: 0,tweet_id,in_reply_to_status_id,in_reply_to_user_id,timestamp,source,text,retweeted_status_id,retweeted_status_user_id,retweeted_status_timestamp,expanded_urls,rating_numerator,rating_denominator,name,doggo,floofer,pupper,puppo
1265,709901256215666688,,,2016-03-16 00:37:03+00:00,"<a href=""http://twitter.com/download/iphone"" rel=""nofollow"">Twitter for iPhone</a>","WeRateDogs stickers are here and they're 12/10! Use code ""puppers"" at checkout üê∂üêæ\n\nShop now:",,,,"http://goo.gl/ArWZfi,https://twitter.com/dog_rates/status/709901256215666688/photo/1,https://twitter.com/dog_rates/status/709901256215666688/photo/1,https://twitter.com/dog_rates/status/709901256215666688/photo/1,https://twitter.com/dog_rates/status/709901256215666688/photo/1",12,10,,,,pupper,


Unnamed: 0,tweet_id,in_reply_to_status_id,in_reply_to_user_id,timestamp,source,text,retweeted_status_id,retweeted_status_user_id,retweeted_status_timestamp,expanded_urls,rating_numerator,rating_denominator,name,doggo,floofer,pupper,puppo
1351,704054845121142784,,,2016-02-28 21:25:30+00:00,"<a href=""http://twitter.com/download/iphone"" rel=""nofollow"">Twitter for iPhone</a>",Here is a whole flock of puppers. 60/50 I'll take the lot,,,,https://twitter.com/dog_rates/status/704054845121142784/photo/1,60,50,a,,,pupper,


Unnamed: 0,tweet_id,in_reply_to_status_id,in_reply_to_user_id,timestamp,source,text,retweeted_status_id,retweeted_status_user_id,retweeted_status_timestamp,expanded_urls,rating_numerator,rating_denominator,name,doggo,floofer,pupper,puppo
1516,690959652130045952,,,2016-01-23 18:09:53+00:00,"<a href=""http://twitter.com/download/iphone"" rel=""nofollow"">Twitter for iPhone</a>",This golden is happy to refute the soft mouth egg test. Not a fan of sweeping generalizations. 11/10 #notallpuppers,,,,"https://twitter.com/dog_rates/status/690959652130045952/photo/1,https://twitter.com/dog_rates/status/690959652130045952/photo/1,https://twitter.com/dog_rates/status/690959652130045952/photo/1,https://twitter.com/dog_rates/status/690959652130045952/photo/1",11,10,,,,pupper,


Unnamed: 0,tweet_id,in_reply_to_status_id,in_reply_to_user_id,timestamp,source,text,retweeted_status_id,retweeted_status_user_id,retweeted_status_timestamp,expanded_urls,rating_numerator,rating_denominator,name,doggo,floofer,pupper,puppo
1634,684225744407494656,6.842229e+17,4196984000.0,2016-01-05 04:11:44+00:00,"<a href=""http://twitter.com/download/iphone"" rel=""nofollow"">Twitter for iPhone</a>","Two sneaky puppers were not initially seen, moving the rating to 143/130. Please forgive us. Thank you",,,,"https://twitter.com/dog_rates/status/684225744407494656/photo/1,https://twitter.com/dog_rates/status/684225744407494656/photo/1",143,130,,,,pupper,


Unnamed: 0,tweet_id,in_reply_to_status_id,in_reply_to_user_id,timestamp,source,text,retweeted_status_id,retweeted_status_user_id,retweeted_status_timestamp,expanded_urls,rating_numerator,rating_denominator,name,doggo,floofer,pupper,puppo
1635,684222868335505415,,,2016-01-05 04:00:18+00:00,"<a href=""http://twitter.com/download/iphone"" rel=""nofollow"">Twitter for iPhone</a>",Someone help the girl is being mugged. Several are distracting her while two steal her shoes. Clever puppers 121/110,,,,https://twitter.com/dog_rates/status/684222868335505415/photo/1,121,110,,,,pupper,


Unnamed: 0,tweet_id,in_reply_to_status_id,in_reply_to_user_id,timestamp,source,text,retweeted_status_id,retweeted_status_user_id,retweeted_status_timestamp,expanded_urls,rating_numerator,rating_denominator,name,doggo,floofer,pupper,puppo
1636,684200372118904832,,,2016-01-05 02:30:55+00:00,"<a href=""http://twitter.com/download/iphone"" rel=""nofollow"">Twitter for iPhone</a>",Gang of fearless hoofed puppers here. Straight savages. Elevated for extra terror. Front one has killed before 6/10s,,,,https://twitter.com/dog_rates/status/684200372118904832/photo/1,6,10,,,,pupper,


Unnamed: 0,tweet_id,in_reply_to_status_id,in_reply_to_user_id,timestamp,source,text,retweeted_status_id,retweeted_status_user_id,retweeted_status_timestamp,expanded_urls,rating_numerator,rating_denominator,name,doggo,floofer,pupper,puppo
1643,683857920510050305,,,2016-01-04 03:50:08+00:00,"<a href=""http://twitter.com/download/iphone"" rel=""nofollow"">Twitter for iPhone</a>",Meet Sadie. She fell asleep on the beach and her friends buried her. 10/10 can't trust fellow puppers these days,,,,https://twitter.com/dog_rates/status/683857920510050305/photo/1,10,10,Sadie,,,pupper,


Unnamed: 0,tweet_id,in_reply_to_status_id,in_reply_to_user_id,timestamp,source,text,retweeted_status_id,retweeted_status_user_id,retweeted_status_timestamp,expanded_urls,rating_numerator,rating_denominator,name,doggo,floofer,pupper,puppo
1710,680583894916304897,,,2015-12-26 03:00:19+00:00,"<a href=""http://twitter.com/download/iphone"" rel=""nofollow"">Twitter for iPhone</a>",This is Penny. Her tennis ball slowly rolled down her cone and into the pool. 8/10 bad things happen to good puppers,,,,"https://twitter.com/dog_rates/status/680583894916304897/photo/1,https://twitter.com/dog_rates/status/680583894916304897/photo/1,https://twitter.com/dog_rates/status/680583894916304897/photo/1,https://twitter.com/dog_rates/status/680583894916304897/photo/1",8,10,Penny,,,pupper,


Unnamed: 0,tweet_id,in_reply_to_status_id,in_reply_to_user_id,timestamp,source,text,retweeted_status_id,retweeted_status_user_id,retweeted_status_timestamp,expanded_urls,rating_numerator,rating_denominator,name,doggo,floofer,pupper,puppo
1712,680494726643068929,,,2015-12-25 21:06:00+00:00,"<a href=""http://twitter.com/download/iphone"" rel=""nofollow"">Twitter for iPhone</a>",Here we have uncovered an entire battalion of holiday puppers. Average of 11.26/10,,,,https://twitter.com/dog_rates/status/680494726643068929/photo/1,26,10,,,,pupper,


Unnamed: 0,tweet_id,in_reply_to_status_id,in_reply_to_user_id,timestamp,source,text,retweeted_status_id,retweeted_status_user_id,retweeted_status_timestamp,expanded_urls,rating_numerator,rating_denominator,name,doggo,floofer,pupper,puppo
1743,679405845277462528,,,2015-12-22 20:59:10+00:00,"<a href=""http://vine.co"" rel=""nofollow"">Vine - Make a Scene</a>",Crazy unseen footage from Jurassic Park. 10/10 for both dinosaur puppers,,,,https://vine.co/v/iKVFEigMLxP,10,10,,,,pupper,


Unnamed: 0,tweet_id,in_reply_to_status_id,in_reply_to_user_id,timestamp,source,text,retweeted_status_id,retweeted_status_user_id,retweeted_status_timestamp,expanded_urls,rating_numerator,rating_denominator,name,doggo,floofer,pupper,puppo
1826,676440007570247681,,,2015-12-14 16:34:00+00:00,"<a href=""http://twitter.com/download/iphone"" rel=""nofollow"">Twitter for iPhone</a>",Hope your Monday isn't too awful. Here's two baseball puppers. 11/10 for each,,,,"https://twitter.com/dog_rates/status/676440007570247681/photo/1,https://twitter.com/dog_rates/status/676440007570247681/photo/1",11,10,,,,pupper,


Unnamed: 0,tweet_id,in_reply_to_status_id,in_reply_to_user_id,timestamp,source,text,retweeted_status_id,retweeted_status_user_id,retweeted_status_timestamp,expanded_urls,rating_numerator,rating_denominator,name,doggo,floofer,pupper,puppo
1843,675853064436391936,,,2015-12-13 01:41:41+00:00,"<a href=""http://twitter.com/download/iphone"" rel=""nofollow"">Twitter for iPhone</a>",Here we have an entire platoon of puppers. Total score: 88/80 would pet all at once,,,,"https://twitter.com/dog_rates/status/675853064436391936/photo/1,https://twitter.com/dog_rates/status/675853064436391936/photo/1",88,80,,,,pupper,


Unnamed: 0,tweet_id,in_reply_to_status_id,in_reply_to_user_id,timestamp,source,text,retweeted_status_id,retweeted_status_user_id,retweeted_status_timestamp,expanded_urls,rating_numerator,rating_denominator,name,doggo,floofer,pupper,puppo
1847,675820929667219457,,,2015-12-12 23:34:00+00:00,"<a href=""http://twitter.com/download/iphone"" rel=""nofollow"">Twitter for iPhone</a>",Here's a handful of sleepy puppers. All look unaware of their surroundings. Lousy guard dogs. Still cute tho 11/10s,,,,https://twitter.com/dog_rates/status/675820929667219457/photo/1,11,10,,,,pupper,


Unnamed: 0,tweet_id,in_reply_to_status_id,in_reply_to_user_id,timestamp,source,text,retweeted_status_id,retweeted_status_user_id,retweeted_status_timestamp,expanded_urls,rating_numerator,rating_denominator,name,doggo,floofer,pupper,puppo
1862,675432746517426176,,,2015-12-11 21:51:30+00:00,"<a href=""http://twitter.com/download/iphone"" rel=""nofollow"">Twitter for iPhone</a>",Happy Friday. Here's some golden puppers. 12/10 for all,,,,"https://twitter.com/dog_rates/status/675432746517426176/photo/1,https://twitter.com/dog_rates/status/675432746517426176/photo/1,https://twitter.com/dog_rates/status/675432746517426176/photo/1",12,10,,,,pupper,


Unnamed: 0,tweet_id,in_reply_to_status_id,in_reply_to_user_id,timestamp,source,text,retweeted_status_id,retweeted_status_user_id,retweeted_status_timestamp,expanded_urls,rating_numerator,rating_denominator,name,doggo,floofer,pupper,puppo
1900,674664755118911488,,,2015-12-09 18:59:46+00:00,"<a href=""http://twitter.com/download/iphone"" rel=""nofollow"">Twitter for iPhone</a>",This is Rodman. He's getting destroyed by the surfs. Valiant effort though. 10/10 better than most puppers probably,,,,https://twitter.com/dog_rates/status/674664755118911488/photo/1,10,10,Rodman,,,pupper,


Unnamed: 0,tweet_id,in_reply_to_status_id,in_reply_to_user_id,timestamp,source,text,retweeted_status_id,retweeted_status_user_id,retweeted_status_timestamp,expanded_urls,rating_numerator,rating_denominator,name,doggo,floofer,pupper,puppo
1928,674045139690631169,,,2015-12-08 01:57:39+00:00,"<a href=""http://twitter.com/download/iphone"" rel=""nofollow"">Twitter for iPhone</a>",Herd of wild dogs here. Not sure what they're trying to do. No real goals in life. 3/10 find your purpose puppers,,,,https://twitter.com/dog_rates/status/674045139690631169/photo/1,3,10,,,,pupper,


Unnamed: 0,tweet_id,in_reply_to_status_id,in_reply_to_user_id,timestamp,source,text,retweeted_status_id,retweeted_status_user_id,retweeted_status_timestamp,expanded_urls,rating_numerator,rating_denominator,name,doggo,floofer,pupper,puppo
2141,669993076832759809,,,2015-11-26 21:36:12+00:00,"<a href=""http://twitter.com/download/iphone"" rel=""nofollow"">Twitter for iPhone</a>",This is Zoey. Her dreams of becoming a hippo ballerina don't look promising. 9/10 it'll be ok puppers,,,,https://twitter.com/dog_rates/status/669993076832759809/photo/1,9,10,Zoey,,,pupper,


In [35]:
def category_correction(df, columns_list):
    '''progrmatically fills columns based on text source'''
    offending_rows = []
    # itterates through rows of dataframe
    for index, row in df.iterrows():
        #current values in columns
        current = [row[column] for column in columns_list]
        #print(index)
        #splitting text
        source = row.text.split()
        #checking if any words in text qualify for categorical classification
        for word in source:
            for value in columns_list:
                if value in word and pd.notnull(word) and word != 'None':
                    df.loc[index, value]= value
    return df

In [36]:
twitter_archive_clean = category_correction(twitter_archive_clean, ['doggo', 'floofer', 'pupper', 'puppo'])

#### Test

In [37]:
#checking to see if changes in offending rows were made. commented for space
#for i in offending_categorical_rows:
#    display(twitter_archive_clean[twitter_archive_clean.index == i])

In [38]:
print('The number of categorical rows with missing or innacurate categories is: ',len(category_accuracy(twitter_archive_clean, ['doggo', 'floofer', 'pupper', 'puppo'])))

The number of categorical rows with missing or innacurate categories is:  0


#### Define
Manually change each row's name?

In [106]:
twitter_archive_clean.expanded_urls[682]

'https://vine.co/v/iEggaEOiLO3,https://vine.co/v/iEggaEOiLO3'

#### Code

In [99]:
for i, index in incorrect_names:
    print(index)
    print(twitter_archive_clean.text[index])
    print()
name_list = ['Grace', 'puffie', 'Forrest', 'Zoey', 'Quizno', ]

22
I've yet to rate a Venezuelan Hover Wiener. This is such an honor. 14/10 paw-inspiring af (IG: roxy.thedoxy) 

56
Here is a pupper approaching maximum borkdrive. Zooming at never before seen speeds. 14/10 paw-inspiring af 
(IG: puffie_the_chow) 

118
RT @dog_rates: We only rate dogs. This is quite clearly a smol broken polar bear. We'd appreciate if you only send dogs. Thank you... 12/10‚Ä¶

169
We only rate dogs. This is quite clearly a smol broken polar bear. We'd appreciate if you only send dogs. Thank you... 12/10 

193
Guys, we only rate dogs. This is quite clearly a bulbasaur. Please only send dogs. Thank you... 12/10 human used pet, it's super effective 

335
There's going to be a dog terminal at JFK Airport. This is not a drill. 10/10  


369
Occasionally, we're sent fantastic stories. This is one of them. 14/10 for Grace 

542
We only rate dogs. Please stop sending in non-canines like this Freudian Poof Lion. This is incredibly frustrating... 11/10 

649
Here is a perfect e

In [48]:
twitter_archive_clean.text[773]

'RT @dog_rates: We only rate dogs. Pls stop sending in non-canines like this Mongolian grass snake. This is very frustrating. 11/10 https://‚Ä¶'

#### Test

# Analysis

(jotting ideas down early so as to not forget)
### Motivating Questions:
- Do higher "ratings" correlate with higher number of retweets? (Must define ratings. Std might be useful)
- Investigate which is more "popular": Cute or Funny. Cute defined by Kindchenschema and funny defined by Benign Violation.
- Within the categories of cuteness and funniness, are more extreme examples more popular? Measured by retweets.
- What are observed characteristics of the top 3 most popular tweets? Is there a theme?

Although the ratings individually do not seem to make much sense due to the numerator exceeding the denominator, perhaps the ratings can be better understood as a decimal score, with higher scores indicating higher approval.