In [1]:
import pandas as pd
import numpy as np
import requests
import os
from random import sample
import re

# Gather

Code for downloading Udacity's Dog Prediction Data

In [2]:
#url = 'https://d17h27t6h515a5.cloudfront.net/topher/2017/August/599fd2ad_image-predictions/image-predictions.tsv'
#r = requests.get('https://d17h27t6h515a5.cloudfront.net/topher/2017/August/599fd2ad_image-predictions/image-predictions.tsv')

In [3]:
# make a directory if one does not already exist
#folder_name = 'dog_predictions'
#if not os.path.exists(folder_name):
    #os.makedirs(folder_name)

In [4]:
#with open(os.path.join(folder_name, 
                           #url.split('/')[-1]), mode='wb') as file:
             #file.write(r.content)

Initializing All Relevant Datasets

In [5]:
twitter_archive = pd.read_csv('twitter-archive-enhanced.csv')
twitter_archive;

In [6]:
predicts = pd.read_csv('dog_predictions/image-predictions.tsv', sep='\t')
predicts;

In [7]:
twitter_api = pd.read_csv('twitter_api_data.csv')
twitter_api;

# Assess

### Documented Issues

`twitter_archive`
#### Quality
- timestamp is a string and not datetime
- text column irrelevant material
- 137 duplicate posts, possible because they are reposted? many duplicates have the tag "RT @dog_rates: " before
- texts must be stripped of "RT @dog_rates: "
- There are 42 instances where categorical variables are found in the text, but are not accurately accounted for in the categorical columns
- There are 109 instances where the name column is not accurate, (ex: index 542 name is considered "incredibly" since text before contains "incredible"), and an incorrect name is in place.
    - I recognize that it is an oversight that I cannot test whether or not a name is missed because I do not yet have knowledge of a language processing library.
- Missing rows in "in_reply_to_status_id" "in_reply_to_user_id" "retweeted_status_id" "retweeted_status_user_id" "retweeted_status_timestamp"
- Ratings may contain floats. Texts needs to be checked again
- change id to string
- drop expanded url and source, as they are incomplete and not useful for analysis

#### Tidiness
- text column contains a source variable for the tweet
- dog "ages/types" (floofer, pupper etc.) should be single, categorial column

`predicts`
#### Quality
- prediction dog breeds have inconsistent casing
- column titles should be be full names
- change id to string
- extract predictions for images where predictions are both dogs and above 70% confidence
    - If our confidence level is too low, then our statements become less meaningful. However, because I am not sure how to test the accuracy of the predictions, I will choose a lowish confidence level since I am aware that many of the pictures will contain dogs.

#### Tidiness
 

`twitter_api`
#### Quality
- change id to string (should have done this when extracting)

#### Tidiness
- tables need to be reorganized
    - 1 for souce metadata (urls

In [8]:
pd.set_option('display.max_colwidth', -1)

### twitter_archive

#### Visual Assessment

In [9]:
#to be used for visual assessments. Supressed to save space.
twitter_archive.sample(5);

In [10]:
twitter_archive.source.unique()

array(['<a href="http://twitter.com/download/iphone" rel="nofollow">Twitter for iPhone</a>',
       '<a href="http://twitter.com" rel="nofollow">Twitter Web Client</a>',
       '<a href="http://vine.co" rel="nofollow">Vine - Make a Scene</a>',
       '<a href="https://about.twitter.com/products/tweetdeck" rel="nofollow">TweetDeck</a>'],
      dtype=object)

In [11]:
#no duplicated values
list(twitter_archive.text.duplicated()).count(True)

0

In [12]:
#to be used for visual assessments. commented out to save space
twitter_archive.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 2356 entries, 0 to 2355
Data columns (total 17 columns):
tweet_id                      2356 non-null int64
in_reply_to_status_id         78 non-null float64
in_reply_to_user_id           78 non-null float64
timestamp                     2356 non-null object
source                        2356 non-null object
text                          2356 non-null object
retweeted_status_id           181 non-null float64
retweeted_status_user_id      181 non-null float64
retweeted_status_timestamp    181 non-null object
expanded_urls                 2297 non-null object
rating_numerator              2356 non-null int64
rating_denominator            2356 non-null int64
name                          2356 non-null object
doggo                         2356 non-null object
floofer                       2356 non-null object
pupper                        2356 non-null object
puppo                         2356 non-null object
dtypes: float64(4), int64(3), ob

In [13]:
#investigating max values for numerator/denominator
twitter_archive.describe();

#### Function(s) used for progrmattic assessment

In [14]:
def category_accuracy(df, columns_list):
    '''category_accuracy parses through the texts column of a dataframe and checks if the information matches
    the values of the categorical columns, GIVEN that the text is used as the categorical variable source
    and categorical column values are the same as column header.
    Restrictions: text column must be called "text".
    Returns: category_accuracy returns the index of rows where the categorical values do not match the text'''
    offending_rows = []
    # matching text with other columns
    #itterate through df
    for index, row in df.iterrows():
        match = []
        #current values in columns
        secondary = [row[column] for column in columns_list]
        #going through every word and checking for classification qualification
        source = row.text.split()
        for word in source:
            match += [value for value in columns_list if value in word and pd.notnull(word) and word != 'None']
    #checking for accuracy
        #if there was more than one match, was it accurate?
        if len(match)> 0:
            test = []
            #this loops checks to see if the row value is the same as the matched value
            for current in secondary:
                for i in match:
                    test.append(i == current)
            # if there are less correct than actual matches, then there is an inaccurate column.
            if test.count(True) < len(match):
                offending_rows.append(index)
    return offending_rows

### Parsing Text for Accuracy

In [15]:
#category_accuracy(df, columns)
offending_categorical_rows = category_accuracy(twitter_archive, ['doggo', 'floofer', 'pupper', 'puppo'])
print('The number of instances where doggo, floofer, pupper, and puppo is found in the text, but does not have the correct value is: ', len(offending_categorical_rows))

The number of instances where doggo, floofer, pupper, and puppo is found in the text, but does not have the correct value is:  42


In [16]:
print('The list of offending rows: ', offending_categorical_rows)

The list of offending rows:  [54, 83, 85, 106, 134, 172, 228, 268, 274, 296, 302, 475, 477, 545, 798, 934, 946, 987, 993, 1027, 1093, 1120, 1220, 1228, 1254, 1265, 1351, 1516, 1634, 1635, 1636, 1643, 1710, 1712, 1743, 1826, 1843, 1847, 1862, 1900, 1928, 2141]


In [17]:
#investigate offending rows. Rows had issues. Code has been commented out to save space.
#for i in offending_categorical_rows:
    #display(twitter_archive[twitter_archive.index == i])

### Checking Names Column

Since names are capitalized, names that are lowercase will be flagged as they will likely not be actual names.

In [18]:
#demonstrating regular names
regular_names = []
for name in twitter_archive.name:
    if name[0].isupper() and name != 'None' and pd.notnull(name):
        regular_names.append(name)
sample(regular_names, 20)

['Hank',
 'Toby',
 'Monster',
 'Spanky',
 'Mitch',
 'Earl',
 'Boomer',
 'Charlie',
 'Mojo',
 'Crumpet',
 'Lola',
 'Milo',
 'Deacon',
 'Dudley',
 'Gizmo',
 'Jamesy',
 'Wallace',
 'Crystal',
 'Rey',
 'Tucker']

In [19]:
#flagging lowercase names and index
incorrect_names = []
for index, row in twitter_archive.iterrows():
    if row['name'][0].islower() and row['name'] != 'None' and pd.notnull(row['name']):
        incorrect_names.append((row['name'], index))
incorrect_names
#len(incorrect_names) returns 109 instances

[('such', 22),
 ('a', 56),
 ('quite', 118),
 ('quite', 169),
 ('quite', 193),
 ('not', 335),
 ('one', 369),
 ('incredibly', 542),
 ('a', 649),
 ('mad', 682),
 ('an', 759),
 ('very', 773),
 ('a', 801),
 ('very', 819),
 ('just', 822),
 ('my', 852),
 ('one', 924),
 ('not', 988),
 ('his', 992),
 ('one', 993),
 ('a', 1002),
 ('a', 1004),
 ('a', 1017),
 ('an', 1025),
 ('very', 1031),
 ('actually', 1040),
 ('a', 1049),
 ('just', 1063),
 ('getting', 1071),
 ('mad', 1095),
 ('very', 1097),
 ('this', 1120),
 ('unacceptable', 1121),
 ('all', 1138),
 ('a', 1193),
 ('old', 1206),
 ('a', 1207),
 ('infuriating', 1259),
 ('a', 1340),
 ('a', 1351),
 ('a', 1361),
 ('an', 1362),
 ('a', 1368),
 ('a', 1382),
 ('very', 1385),
 ('getting', 1435),
 ('just', 1457),
 ('a', 1499),
 ('the', 1527),
 ('the', 1603),
 ('actually', 1693),
 ('by', 1724),
 ('a', 1737),
 ('officially', 1747),
 ('a', 1785),
 ('the', 1797),
 ('the', 1815),
 ('a', 1853),
 ('a', 1854),
 ('a', 1877),
 ('a', 1878),
 ('life', 1916),
 ('a', 1923

## predicts

In [20]:
predicts.sample(5)

Unnamed: 0,tweet_id,jpg_url,img_num,p1,p1_conf,p1_dog,p2,p2_conf,p2_dog,p3,p3_conf,p3_dog
937,703611486317502464,https://pbs.twimg.com/media/CcO66OjXEAASXmH.jpg,1,Pembroke,0.756441,True,basenji,0.126621,True,Cardigan,0.080117,True
1116,725729321944506368,https://pbs.twimg.com/media/ChJO9YaWYAEL0zC.jpg,1,boxer,0.599076,True,bull_mastiff,0.177318,True,French_bulldog,0.141461,True
1483,781251288990355457,https://pbs.twimg.com/media/CteP5H5WcAEhdLO.jpg,2,Mexican_hairless,0.887771,True,Italian_greyhound,0.030666,True,seat_belt,0.02673,False
612,680130881361686529,https://pbs.twimg.com/media/CXBPbVtWAAA2Vus.jpg,1,Maltese_dog,0.199121,True,West_Highland_white_terrier,0.197897,True,Shih-Tzu,0.15713,True
1366,761672994376806400,https://pbs.twimg.com/ext_tw_video_thumb/761672828462718981/pu/img/R00UYAAWB3GtuHdI.jpg,1,gondola,0.318851,False,sea_lion,0.306525,False,pool_table,0.111565,False


In [21]:
predicts.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 2075 entries, 0 to 2074
Data columns (total 12 columns):
tweet_id    2075 non-null int64
jpg_url     2075 non-null object
img_num     2075 non-null int64
p1          2075 non-null object
p1_conf     2075 non-null float64
p1_dog      2075 non-null bool
p2          2075 non-null object
p2_conf     2075 non-null float64
p2_dog      2075 non-null bool
p3          2075 non-null object
p3_conf     2075 non-null float64
p3_dog      2075 non-null bool
dtypes: bool(3), float64(3), int64(2), object(4)
memory usage: 152.1+ KB


## twitter_api

In [22]:
twitter_api.sample(5)

Unnamed: 0,id,favorite_count,retweet_count,expanded_url
2307,666826780179869698,247,92,https://twitter.com/dog_rates/status/666826780179869698/photo/1
17,888804989199671297,24584,4000,https://twitter.com/i/web/status/888804989199671297
1378,701570477911896070,2884,951,https://twitter.com/dog_rates/status/701570477911896070/photo/1
2228,668256321989451776,1288,599,https://twitter.com/dog_rates/status/668256321989451776/photo/1
236,847251039262605312,20760,4346,https://twitter.com/dog_rates/status/847251039262605312/photo/1


In [23]:
twitter_api.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 2356 entries, 0 to 2355
Data columns (total 4 columns):
id                2356 non-null int64
favorite_count    2356 non-null object
retweet_count     2356 non-null object
expanded_url      2299 non-null object
dtypes: int64(1), object(3)
memory usage: 73.7+ KB


# Clean

Creating Cleaning Copies

In [24]:
twitter_archive_clean = twitter_archive.copy()

In [25]:
predicts_clean = predicts.copy()

In [26]:
twitter_api_clean = twitter_api.copy()

### Twitter Archive

#### Quality
- timestamp is a string and not datetime
- text column irrelevant material
- 137 duplicate posts, possible because they are reposted? many duplicates have the tag "RT @dog_rates: " before
- texts must be stripped of "RT @dog_rates: "
- There are 42 instances where categorical variables are found in the text, but are not accurately accounted for in the categorical columns
- There are 109 instances where the name column is not accurate, (ex: index 542 name is considered "incredibly" since text before contains "incredible"), and an incorrect name is in place.
    - I recognize that it is an oversight that I cannot test whether or not a name is missed because I do not yet have knowledge of a language processing library.
- Missing rows in "in_reply_to_status_id" "in_reply_to_user_id" "retweeted_status_id" "retweeted_status_user_id" "retweeted_status_timestamp"
- Ratings may contain floats. Texts needs to be checked again
- change id to string
- missing values in expanded_urls
- expanded urls need space between each to be human readable
- expanded urls are not all unique

#### Tidiness
- text column contains a source variable for the tweet
- dog "ages/types" (floofer, pupper etc.) should be single, categorial column
- expanded_urls need tidier formatting, but will leave as is for now since column isn't vital to analysis

#### Define 
Use astype to change timestamp to timedate

#### Code

In [27]:
twitter_archive_clean.timestamp = pd.to_datetime(twitter_archive.timestamp)

#### Test

In [28]:
#timestamp was successfully changed to datetime, commented for space
#twitter_archive_clean.info()

#### Define
Use str.replace to remove unwanted url from text column

#### Code

In [29]:
#testing
twitter_archive.text[0]

"This is Phineas. He's a mystical boy. Only ever appears in the hole of a donut. 13/10 https://t.co/MgUWQ76dJU"

In [30]:
#regex citation: https://stackoverflow.com/questions/6038061/regular-expression-to-find-urls-within-a-string
#regex expression has been altered to work with the dataset (elipses and other perumutations have been added)
#includes characacters :// 
#([\w_-]+(?:(?:\.[\w_-]+)+))groups combonations of "alphanumeric.alphanumeric"
#([\w.,@?^=%&:/~+#-]*[\w@?^=%&/~+#-])? groups combonations of alphanumeric and special symbols
text = re.sub(r'(\\n\\nhttps|http|ftp|https)([:…/]+)([\w_…-]*(?:(?:\.[\w_…-]*)*))([\w.,@?^=%&:/~+#-]*[\w@?^=%&/~+#-…]*)?', '', twitter_archive.text[0], flags = re.MULTILINE)
text

"This is Phineas. He's a mystical boy. Only ever appears in the hole of a donut. 13/10 "

In [31]:
twitter_archive_clean.text = twitter_archive_clean.text.str.replace(r'(\\n\\nhttps|http|ftp|https)([:…/]+)([\w_…-]*(?:(?:\.[\w_…-]*)*))([\w.,@?^=%&:/~+#-]*[\w@?^=%&/~+#-…]*)?', '');

#### Test

In [32]:
[i for i in twitter_archive_clean.text if 'https' in i]

[]

In [33]:
#urls removed, supressed for space
twitter_archive_clean.text;

In [34]:
twitter_archive_clean.sample(5);

#### Define
Remove "RT @dog_rates: " replace

#### Code

In [35]:
twitter_archive_clean.text = twitter_archive_clean.text.str.replace(r'(RT @dog_rates: )', '');

#### Test

In [36]:
for i in twitter_archive_clean.text:
    if 'RT @dog_rates: ' in i:
        print(i)

#### Define
Drop Duplicates tweets. Although they might have different meta data, they are still the same "tweet" and having duplicates will alter the analysis

#### Code

In [37]:
#investigate duplicates?
list(twitter_archive_clean.text.duplicated()).count(True)

137

In [38]:
#dropping all duplicates based on text data
twitter_archive_clean.drop_duplicates('text', keep = 'first', inplace = True)
twitter_archive_clean.reset_index(inplace = True)

#### Test

In [39]:
#former shape was 2356. 2356 - 2219 = 137
twitter_archive_clean.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 2219 entries, 0 to 2218
Data columns (total 18 columns):
index                         2219 non-null int64
tweet_id                      2219 non-null int64
in_reply_to_status_id         78 non-null float64
in_reply_to_user_id           78 non-null float64
timestamp                     2219 non-null datetime64[ns, UTC]
source                        2219 non-null object
text                          2219 non-null object
retweeted_status_id           181 non-null float64
retweeted_status_user_id      181 non-null float64
retweeted_status_timestamp    181 non-null object
expanded_urls                 2160 non-null object
rating_numerator              2219 non-null int64
rating_denominator            2219 non-null int64
name                          2219 non-null object
doggo                         2219 non-null object
floofer                       2219 non-null object
pupper                        2219 non-null object
puppo               

#### Define
Investigate and fix innacurate classification, using modified categorical_accuracy function

#### Code

In [40]:
for i in offending_categorical_rows:
    display(twitter_archive_clean[twitter_archive_clean.index == i])

Unnamed: 0,index,tweet_id,in_reply_to_status_id,in_reply_to_user_id,timestamp,source,text,retweeted_status_id,retweeted_status_user_id,retweeted_status_timestamp,expanded_urls,rating_numerator,rating_denominator,name,doggo,floofer,pupper,puppo
54,55,881633300179243008,8.81607e+17,47384430.0,2017-07-02 21:58:53+00:00,"<a href=""http://twitter.com/download/iphone"" rel=""nofollow"">Twitter for iPhone</a>",@roushfenway These are good dogs but 17/10 is an emotional impulse rating. More like 13/10s,,,,,17,10,,,,,


Unnamed: 0,index,tweet_id,in_reply_to_status_id,in_reply_to_user_id,timestamp,source,text,retweeted_status_id,retweeted_status_user_id,retweeted_status_timestamp,expanded_urls,rating_numerator,rating_denominator,name,doggo,floofer,pupper,puppo
83,85,876120275196170240,,,2017-06-17 16:52:05+00:00,"<a href=""http://twitter.com/download/iphone"" rel=""nofollow"">Twitter for iPhone</a>","Meet Venti, a seemingly caffeinated puppoccino. She was just informed the weekend would include walks, pats and scritches. 13/10 much excite",,,,https://twitter.com/dog_rates/status/876120275196170240/photo/1,13,10,Venti,,,,


Unnamed: 0,index,tweet_id,in_reply_to_status_id,in_reply_to_user_id,timestamp,source,text,retweeted_status_id,retweeted_status_user_id,retweeted_status_timestamp,expanded_urls,rating_numerator,rating_denominator,name,doggo,floofer,pupper,puppo
85,87,875144289856114688,,,2017-06-15 00:13:52+00:00,"<a href=""http://twitter.com/download/iphone"" rel=""nofollow"">Twitter for iPhone</a>",Meet Nugget and Hank. Nugget took Hank's bone. Hank is wondering if you would please return it to him. Both 13/10 would not intervene,,,,https://twitter.com/dog_rates/status/875144289856114688/video/1,13,10,Nugget,,,,


Unnamed: 0,index,tweet_id,in_reply_to_status_id,in_reply_to_user_id,timestamp,source,text,retweeted_status_id,retweeted_status_user_id,retweeted_status_timestamp,expanded_urls,rating_numerator,rating_denominator,name,doggo,floofer,pupper,puppo
106,108,871515927908634625,,,2017-06-04 23:56:03+00:00,"<a href=""http://twitter.com/download/iphone"" rel=""nofollow"">Twitter for iPhone</a>",This is Napolean. He's a Raggedy East Nicaraguan Zoom Zoom. Runs on one leg. Built for deception. No eyes. Good with kids. 12/10 great doggo,,,,"https://twitter.com/dog_rates/status/871515927908634625/photo/1,https://twitter.com/dog_rates/status/871515927908634625/photo/1",12,10,Napolean,doggo,,,


Unnamed: 0,index,tweet_id,in_reply_to_status_id,in_reply_to_user_id,timestamp,source,text,retweeted_status_id,retweeted_status_user_id,retweeted_status_timestamp,expanded_urls,rating_numerator,rating_denominator,name,doggo,floofer,pupper,puppo
134,139,865359393868664832,,,2017-05-19 00:12:11+00:00,"<a href=""http://twitter.com/download/iphone"" rel=""nofollow"">Twitter for iPhone</a>",This is Sammy. Her tongue ejects without warning sometimes. It's a serious condition. Needs a hefty dose from a BlepiPen. 13/10,,,,"https://twitter.com/dog_rates/status/865359393868664832/photo/1,https://twitter.com/dog_rates/status/865359393868664832/photo/1",13,10,Sammy,,,,


Unnamed: 0,index,tweet_id,in_reply_to_status_id,in_reply_to_user_id,timestamp,source,text,retweeted_status_id,retweeted_status_user_id,retweeted_status_timestamp,expanded_urls,rating_numerator,rating_denominator,name,doggo,floofer,pupper,puppo
172,177,857393404942143489,,,2017-04-27 00:38:11+00:00,"<a href=""http://twitter.com/download/iphone"" rel=""nofollow"">Twitter for iPhone</a>","Instead of the usual nightly dog rate, I'm sharing this story with you. Meeko is 13/10 and would like your help \n\n",,,,"https://www.gofundme.com/meeko-needs-heart-surgery,https://twitter.com/dog_rates/status/857393404942143489/photo/1,https://twitter.com/dog_rates/status/857393404942143489/photo/1,https://twitter.com/dog_rates/status/857393404942143489/photo/1,https://twitter.com/dog_rates/status/857393404942143489/photo/1",13,10,,,,,


Unnamed: 0,index,tweet_id,in_reply_to_status_id,in_reply_to_user_id,timestamp,source,text,retweeted_status_id,retweeted_status_user_id,retweeted_status_timestamp,expanded_urls,rating_numerator,rating_denominator,name,doggo,floofer,pupper,puppo
228,233,847842811428974592,,,2017-03-31 16:07:33+00:00,"<a href=""http://twitter.com/download/iphone"" rel=""nofollow"">Twitter for iPhone</a>","This is Rontu. He is described as a pal, cuddle bug, protector and constant shadow. 12/10, but he needs your help\n\n",,,,"https://www.gofundme.com/help-save-rontu,https://twitter.com/dog_rates/status/847842811428974592/photo/1",12,10,Rontu,,,,


Unnamed: 0,index,tweet_id,in_reply_to_status_id,in_reply_to_user_id,timestamp,source,text,retweeted_status_id,retweeted_status_user_id,retweeted_status_timestamp,expanded_urls,rating_numerator,rating_denominator,name,doggo,floofer,pupper,puppo
268,274,840698636975636481,8.406983e+17,8.405479e+17,2017-03-11 22:59:09+00:00,"<a href=""http://twitter.com/download/iphone"" rel=""nofollow"">Twitter for iPhone</a>",@0_kelvin_0 &gt;10/10 is reserved for puppos sorry Kevin,,,,,10,10,,,,,


Unnamed: 0,index,tweet_id,in_reply_to_status_id,in_reply_to_user_id,timestamp,source,text,retweeted_status_id,retweeted_status_user_id,retweeted_status_timestamp,expanded_urls,rating_numerator,rating_denominator,name,doggo,floofer,pupper,puppo
274,280,839549326359670784,,,2017-03-08 18:52:12+00:00,"<a href=""http://twitter.com/download/iphone"" rel=""nofollow"">Twitter for iPhone</a>","Meet Winston. He knows he's a little too big for the swing, but he doesn't care. Kindly requests a push. 12/10 would happily oblige",,,,https://twitter.com/dog_rates/status/839549326359670784/photo/1,12,10,Winston,,,,


Unnamed: 0,index,tweet_id,in_reply_to_status_id,in_reply_to_user_id,timestamp,source,text,retweeted_status_id,retweeted_status_user_id,retweeted_status_timestamp,expanded_urls,rating_numerator,rating_denominator,name,doggo,floofer,pupper,puppo
296,303,836397794269200385,,,2017-02-28 02:09:08+00:00,"<a href=""http://twitter.com/download/iphone"" rel=""nofollow"">Twitter for iPhone</a>",This is Buddy. He ran into a glass door once. Now he's h*ckin skeptical. 13/10 empowering af (vid by Brittany Gaunt),8.178278e+17,4196984000.0,2017-01-07 20:18:46 +0000,https://twitter.com/dog_rates/status/817827839487737858/video/1,13,10,Buddy,,,,


Unnamed: 0,index,tweet_id,in_reply_to_status_id,in_reply_to_user_id,timestamp,source,text,retweeted_status_id,retweeted_status_user_id,retweeted_status_timestamp,expanded_urls,rating_numerator,rating_denominator,name,doggo,floofer,pupper,puppo
302,309,835536468978302976,,,2017-02-25 17:06:32+00:00,"<a href=""http://twitter.com/download/iphone"" rel=""nofollow"">Twitter for iPhone</a>",Meet Lola. Her hobbies include being precious af and using her foot as a toothbrush. 12/10 Lola requests your help\n\n,8.352641e+17,4196984000.0,2017-02-24 23:04:14 +0000,"https://www.gofundme.com/lolas-life-saving-surgery-funds,https://twitter.com/dog_rates/status/835264098648616962/photo/1,https://twitter.com/dog_rates/status/835264098648616962/photo/1",12,10,Lola,,,,


Unnamed: 0,index,tweet_id,in_reply_to_status_id,in_reply_to_user_id,timestamp,source,text,retweeted_status_id,retweeted_status_user_id,retweeted_status_timestamp,expanded_urls,rating_numerator,rating_denominator,name,doggo,floofer,pupper,puppo
475,497,813142292504645637,,,2016-12-25 22:00:04+00:00,"<a href=""http://twitter.com/download/iphone"" rel=""nofollow"">Twitter for iPhone</a>",Everybody stop what you're doing and look at this dog with her tiny Santa hat. 13/10,,,,"https://twitter.com/dog_rates/status/813142292504645637/photo/1,https://twitter.com/dog_rates/status/813142292504645637/photo/1,https://twitter.com/dog_rates/status/813142292504645637/photo/1",13,10,,,,,


Unnamed: 0,index,tweet_id,in_reply_to_status_id,in_reply_to_user_id,timestamp,source,text,retweeted_status_id,retweeted_status_user_id,retweeted_status_timestamp,expanded_urls,rating_numerator,rating_denominator,name,doggo,floofer,pupper,puppo
477,499,813127251579564032,,,2016-12-25 21:00:18+00:00,"<a href=""http://twitter.com/download/iphone"" rel=""nofollow"">Twitter for iPhone</a>",Here's an anonymous doggo that appears to be very done with Christmas. 11/10 cheer up pup,,,,"https://twitter.com/dog_rates/status/813127251579564032/photo/1,https://twitter.com/dog_rates/status/813127251579564032/photo/1",11,10,,doggo,,,


Unnamed: 0,index,tweet_id,in_reply_to_status_id,in_reply_to_user_id,timestamp,source,text,retweeted_status_id,retweeted_status_user_id,retweeted_status_timestamp,expanded_urls,rating_numerator,rating_denominator,name,doggo,floofer,pupper,puppo
545,572,801285448605831168,,,2016-11-23 04:45:12+00:00,"<a href=""http://twitter.com/download/iphone"" rel=""nofollow"">Twitter for iPhone</a>",oh h*ck 10/10,,,,https://twitter.com/dog_rates/status/801285448605831168/photo/1,10,10,,,,,


Unnamed: 0,index,tweet_id,in_reply_to_status_id,in_reply_to_user_id,timestamp,source,text,retweeted_status_id,retweeted_status_user_id,retweeted_status_timestamp,expanded_urls,rating_numerator,rating_denominator,name,doggo,floofer,pupper,puppo
798,869,761745352076779520,,,2016-08-06 02:06:59+00:00,"<a href=""http://twitter.com/download/iphone"" rel=""nofollow"">Twitter for iPhone</a>",Guys.. we only rate dogs. Pls don't send any more pics of the Loch Ness Monster. Only send in dogs. Thank you. 11/10,,,,https://twitter.com/dog_rates/status/761745352076779520/photo/1,11,10,,,,,


Unnamed: 0,index,tweet_id,in_reply_to_status_id,in_reply_to_user_id,timestamp,source,text,retweeted_status_id,retweeted_status_user_id,retweeted_status_timestamp,expanded_urls,rating_numerator,rating_denominator,name,doggo,floofer,pupper,puppo
934,1016,746906459439529985,7.468859e+17,4196984000.0,2016-06-26 03:22:31+00:00,"<a href=""http://twitter.com/download/iphone"" rel=""nofollow"">Twitter for iPhone</a>","PUPDATE: can't see any. Even if I could, I couldn't reach them to pet. 0/10 much disappointment",,,,https://twitter.com/dog_rates/status/746906459439529985/photo/1,0,10,,,,,


Unnamed: 0,index,tweet_id,in_reply_to_status_id,in_reply_to_user_id,timestamp,source,text,retweeted_status_id,retweeted_status_user_id,retweeted_status_timestamp,expanded_urls,rating_numerator,rating_denominator,name,doggo,floofer,pupper,puppo
946,1029,745712589599014916,,,2016-06-22 20:18:30+00:00,"<a href=""http://twitter.com/download/iphone"" rel=""nofollow"">Twitter for iPhone</a>",This is Percy. He fell asleep at the wheel. Irresponsible af. 7/10 absolute menace on the roadway,,,,https://twitter.com/dog_rates/status/745712589599014916/photo/1,7,10,Percy,,,,


Unnamed: 0,index,tweet_id,in_reply_to_status_id,in_reply_to_user_id,timestamp,source,text,retweeted_status_id,retweeted_status_user_id,retweeted_status_timestamp,expanded_urls,rating_numerator,rating_denominator,name,doggo,floofer,pupper,puppo
987,1075,739623569819336705,,,2016-06-06 01:02:55+00:00,"<a href=""http://vine.co"" rel=""nofollow"">Vine - Make a Scene</a>",Here's a doggo that don't need no human. 12/10 independent af (vid by @MichelleLiuCee),,,,https://vine.co/v/iY9Fr1I31U6,12,10,,doggo,,,


Unnamed: 0,index,tweet_id,in_reply_to_status_id,in_reply_to_user_id,timestamp,source,text,retweeted_status_id,retweeted_status_user_id,retweeted_status_timestamp,expanded_urls,rating_numerator,rating_denominator,name,doggo,floofer,pupper,puppo
993,1083,738537504001953792,,,2016-06-03 01:07:16+00:00,"<a href=""http://twitter.com/download/iphone"" rel=""nofollow"">Twitter for iPhone</a>",This is Bayley. She fell asleep trying to escape her evil fence enclosure. 11/10 night night puppo,,,,"https://twitter.com/dog_rates/status/738537504001953792/photo/1,https://twitter.com/dog_rates/status/738537504001953792/photo/1",11,10,Bayley,,,,puppo


Unnamed: 0,index,tweet_id,in_reply_to_status_id,in_reply_to_user_id,timestamp,source,text,retweeted_status_id,retweeted_status_user_id,retweeted_status_timestamp,expanded_urls,rating_numerator,rating_denominator,name,doggo,floofer,pupper,puppo
1027,1120,731156023742988288,,,2016-05-13 16:15:54+00:00,"<a href=""http://twitter.com/download/iphone"" rel=""nofollow"">Twitter for iPhone</a>",Say hello to this unbelievably well behaved squad of doggos. 204/170 would try to pet all at once,,,,https://twitter.com/dog_rates/status/731156023742988288/photo/1,204,170,this,,,,


Unnamed: 0,index,tweet_id,in_reply_to_status_id,in_reply_to_user_id,timestamp,source,text,retweeted_status_id,retweeted_status_user_id,retweeted_status_timestamp,expanded_urls,rating_numerator,rating_denominator,name,doggo,floofer,pupper,puppo
1093,1188,718454725339934721,,,2016-04-08 15:05:29+00:00,"<a href=""http://twitter.com/download/iphone"" rel=""nofollow"">Twitter for iPhone</a>","This pic is old but I hadn't seen it until today and had to share. Creative af. 13/10 very good boy, would pet well",,,,https://twitter.com/dog_rates/status/718454725339934721/photo/1,13,10,,,,,


Unnamed: 0,index,tweet_id,in_reply_to_status_id,in_reply_to_user_id,timestamp,source,text,retweeted_status_id,retweeted_status_user_id,retweeted_status_timestamp,expanded_urls,rating_numerator,rating_denominator,name,doggo,floofer,pupper,puppo
1120,1215,715009755312439296,,,2016-03-30 02:56:24+00:00,"<a href=""http://twitter.com/download/iphone"" rel=""nofollow"">Twitter for iPhone</a>",This is Siba. She's remarkably mobile. Very sleepy as well. 12/10 would happily transport,,,,https://twitter.com/dog_rates/status/715009755312439296/photo/1,12,10,Siba,,,,


Unnamed: 0,index,tweet_id,in_reply_to_status_id,in_reply_to_user_id,timestamp,source,text,retweeted_status_id,retweeted_status_user_id,retweeted_status_timestamp,expanded_urls,rating_numerator,rating_denominator,name,doggo,floofer,pupper,puppo
1220,1319,706516534877929472,,,2016-03-06 16:27:23+00:00,"<a href=""http://twitter.com/download/iphone"" rel=""nofollow"">Twitter for iPhone</a>",Please enjoy this pup in a cooler. Permanently ready for someone to throw a tennis ball his way. 12/10,,,,https://twitter.com/dog_rates/status/706516534877929472/photo/1,12,10,,,,,


Unnamed: 0,index,tweet_id,in_reply_to_status_id,in_reply_to_user_id,timestamp,source,text,retweeted_status_id,retweeted_status_user_id,retweeted_status_timestamp,expanded_urls,rating_numerator,rating_denominator,name,doggo,floofer,pupper,puppo
1228,1327,705975130514706432,,,2016-03-05 04:36:02+00:00,"<a href=""http://twitter.com/download/iphone"" rel=""nofollow"">Twitter for iPhone</a>",This is Adele. Her tongue flies out of her mouth at random. It's a debilitating illness. 10/10 stay strong pupper,,,,"https://twitter.com/dog_rates/status/705975130514706432/photo/1,https://twitter.com/dog_rates/status/705975130514706432/photo/1",10,10,Adele,,,pupper,


Unnamed: 0,index,tweet_id,in_reply_to_status_id,in_reply_to_user_id,timestamp,source,text,retweeted_status_id,retweeted_status_user_id,retweeted_status_timestamp,expanded_urls,rating_numerator,rating_denominator,name,doggo,floofer,pupper,puppo
1254,1354,703631701117943808,,,2016-02-27 17:24:05+00:00,"<a href=""http://twitter.com/download/iphone"" rel=""nofollow"">Twitter for iPhone</a>",This is Bella. Based on this picture she's at least 8ft tall (wow)! Must be rare. 11/10 would pet on tippy toes,,,,"https://twitter.com/dog_rates/status/703631701117943808/photo/1,https://twitter.com/dog_rates/status/703631701117943808/photo/1,https://twitter.com/dog_rates/status/703631701117943808/photo/1,https://twitter.com/dog_rates/status/703631701117943808/photo/1",11,10,Bella,,,,


Unnamed: 0,index,tweet_id,in_reply_to_status_id,in_reply_to_user_id,timestamp,source,text,retweeted_status_id,retweeted_status_user_id,retweeted_status_timestamp,expanded_urls,rating_numerator,rating_denominator,name,doggo,floofer,pupper,puppo
1265,1366,702671118226825216,,,2016-02-25 01:47:04+00:00,"<a href=""http://twitter.com/download/iphone"" rel=""nofollow"">Twitter for iPhone</a>",Meet Rambo &amp; Kiwi. Rambo's the pup with the sharp toes &amp; rad mohawk. One stays woke while one sleeps. 10/10 for both,,,,https://twitter.com/dog_rates/status/702671118226825216/photo/1,10,10,Rambo,,,,


Unnamed: 0,index,tweet_id,in_reply_to_status_id,in_reply_to_user_id,timestamp,source,text,retweeted_status_id,retweeted_status_user_id,retweeted_status_timestamp,expanded_urls,rating_numerator,rating_denominator,name,doggo,floofer,pupper,puppo
1351,1456,695314793360662529,,,2016-02-04 18:35:39+00:00,"<a href=""http://twitter.com/download/iphone"" rel=""nofollow"">Twitter for iPhone</a>",This is Colin. He really likes green beans. It's tearing his family apart. 10/10 please pray for Colin,,,,"https://twitter.com/dog_rates/status/695314793360662529/photo/1,https://twitter.com/dog_rates/status/695314793360662529/photo/1,https://twitter.com/dog_rates/status/695314793360662529/photo/1,https://twitter.com/dog_rates/status/695314793360662529/photo/1",10,10,Colin,,,,


Unnamed: 0,index,tweet_id,in_reply_to_status_id,in_reply_to_user_id,timestamp,source,text,retweeted_status_id,retweeted_status_user_id,retweeted_status_timestamp,expanded_urls,rating_numerator,rating_denominator,name,doggo,floofer,pupper,puppo
1516,1627,684594889858887680,,,2016-01-06 04:38:35+00:00,"<a href=""http://twitter.com/download/iphone"" rel=""nofollow"">Twitter for iPhone</a>","""FOR THE LAST TIME I DON'T WANNA PLAY TWISTER ALL THE SPOTS ARE GREY DAMN IT CINDY"" ...10/10",,,,https://twitter.com/dog_rates/status/684594889858887680/photo/1,10,10,,,,,


Unnamed: 0,index,tweet_id,in_reply_to_status_id,in_reply_to_user_id,timestamp,source,text,retweeted_status_id,retweeted_status_user_id,retweeted_status_timestamp,expanded_urls,rating_numerator,rating_denominator,name,doggo,floofer,pupper,puppo
1634,1755,678774928607469569,,,2015-12-21 03:12:08+00:00,"<a href=""http://twitter.com/download/iphone"" rel=""nofollow"">Twitter for iPhone</a>",This is Tug. He's not required to wear the cone he just wants his voice to project more clearly. 11/10,,,,https://twitter.com/dog_rates/status/678774928607469569/photo/1,11,10,Tug,,,,


Unnamed: 0,index,tweet_id,in_reply_to_status_id,in_reply_to_user_id,timestamp,source,text,retweeted_status_id,retweeted_status_user_id,retweeted_status_timestamp,expanded_urls,rating_numerator,rating_denominator,name,doggo,floofer,pupper,puppo
1635,1756,678767140346941444,,,2015-12-21 02:41:11+00:00,"<a href=""http://twitter.com/download/iphone"" rel=""nofollow"">Twitter for iPhone</a>",This is Mia. She makes awful decisions. 8/10,,,,https://twitter.com/dog_rates/status/678767140346941444/photo/1,8,10,Mia,,,,


Unnamed: 0,index,tweet_id,in_reply_to_status_id,in_reply_to_user_id,timestamp,source,text,retweeted_status_id,retweeted_status_user_id,retweeted_status_timestamp,expanded_urls,rating_numerator,rating_denominator,name,doggo,floofer,pupper,puppo
1636,1757,678764513869611008,,,2015-12-21 02:30:45+00:00,"<a href=""http://twitter.com/download/iphone"" rel=""nofollow"">Twitter for iPhone</a>","Meet Wilson. He got caught humping the futon. He's like ""dude, help me out here"" 10/10 I'd help Wilson out",,,,https://twitter.com/dog_rates/status/678764513869611008/photo/1,10,10,Wilson,,,,


Unnamed: 0,index,tweet_id,in_reply_to_status_id,in_reply_to_user_id,timestamp,source,text,retweeted_status_id,retweeted_status_user_id,retweeted_status_timestamp,expanded_urls,rating_numerator,rating_denominator,name,doggo,floofer,pupper,puppo
1643,1764,678424312106393600,,,2015-12-20 03:58:55+00:00,"<a href=""http://twitter.com/download/iphone"" rel=""nofollow"">Twitter for iPhone</a>",This is Crystal. She's a shitty fireman. No sense of urgency. People could be dying Crystal. 2/10 just irresponsible,,,,https://twitter.com/dog_rates/status/678424312106393600/photo/1,2,10,Crystal,,,,


Unnamed: 0,index,tweet_id,in_reply_to_status_id,in_reply_to_user_id,timestamp,source,text,retweeted_status_id,retweeted_status_user_id,retweeted_status_timestamp,expanded_urls,rating_numerator,rating_denominator,name,doggo,floofer,pupper,puppo
1710,1832,676191832485810177,,,2015-12-14 00:07:50+00:00,"<a href=""http://twitter.com/download/iphone"" rel=""nofollow"">Twitter for iPhone</a>",These two pups just met and have instantly bonded. Spectacular scene. Mesmerizing af. 10/10 and 7/10 for blue dog,,,,"https://twitter.com/dog_rates/status/676191832485810177/photo/1,https://twitter.com/dog_rates/status/676191832485810177/photo/1,https://twitter.com/dog_rates/status/676191832485810177/photo/1",10,10,,,,,


Unnamed: 0,index,tweet_id,in_reply_to_status_id,in_reply_to_user_id,timestamp,source,text,retweeted_status_id,retweeted_status_user_id,retweeted_status_timestamp,expanded_urls,rating_numerator,rating_denominator,name,doggo,floofer,pupper,puppo
1712,1834,676121918416756736,,,2015-12-13 19:30:01+00:00,"<a href=""http://vine.co"" rel=""nofollow"">Vine - Make a Scene</a>",Here we are witnessing a very excited dog. Clearly has no control over neck movements. 8/10 would still pet,,,,https://vine.co/v/iZXg7VpeDAv,8,10,,,,,


Unnamed: 0,index,tweet_id,in_reply_to_status_id,in_reply_to_user_id,timestamp,source,text,retweeted_status_id,retweeted_status_user_id,retweeted_status_timestamp,expanded_urls,rating_numerator,rating_denominator,name,doggo,floofer,pupper,puppo
1743,1867,675334060156301312,,,2015-12-11 15:19:21+00:00,"<a href=""http://twitter.com/download/iphone"" rel=""nofollow"">Twitter for iPhone</a>",Good morning here's a grass pupper. 12/10,,,,"https://twitter.com/dog_rates/status/675334060156301312/photo/1,https://twitter.com/dog_rates/status/675334060156301312/photo/1",12,10,,,,pupper,


Unnamed: 0,index,tweet_id,in_reply_to_status_id,in_reply_to_user_id,timestamp,source,text,retweeted_status_id,retweeted_status_user_id,retweeted_status_timestamp,expanded_urls,rating_numerator,rating_denominator,name,doggo,floofer,pupper,puppo
1826,1951,673686845050527744,,,2015-12-07 02:13:55+00:00,"<a href=""http://twitter.com/download/iphone"" rel=""nofollow"">Twitter for iPhone</a>",This is George. He's upset that the 4th of July isn't everyday. 11/10,,,,https://twitter.com/dog_rates/status/673686845050527744/photo/1,11,10,George,,,,


Unnamed: 0,index,tweet_id,in_reply_to_status_id,in_reply_to_user_id,timestamp,source,text,retweeted_status_id,retweeted_status_user_id,retweeted_status_timestamp,expanded_urls,rating_numerator,rating_denominator,name,doggo,floofer,pupper,puppo
1843,1969,673317986296586240,,,2015-12-06 01:48:12+00:00,"<a href=""http://twitter.com/download/iphone"" rel=""nofollow"">Twitter for iPhone</a>",Take a moment and appreciate how these two dogs fell asleep. Simply magnificent. 10/10 for both,,,,"https://twitter.com/dog_rates/status/673317986296586240/photo/1,https://twitter.com/dog_rates/status/673317986296586240/photo/1",10,10,,,,,


Unnamed: 0,index,tweet_id,in_reply_to_status_id,in_reply_to_user_id,timestamp,source,text,retweeted_status_id,retweeted_status_user_id,retweeted_status_timestamp,expanded_urls,rating_numerator,rating_denominator,name,doggo,floofer,pupper,puppo
1847,1974,673148804208660480,,,2015-12-05 14:35:56+00:00,"<a href=""http://twitter.com/download/iphone"" rel=""nofollow"">Twitter for iPhone</a>",This is Fletcher. He's had a ruff night. No more Fireball for Fletcher. 8/10 it'll be over soon pupper,,,,https://twitter.com/dog_rates/status/673148804208660480/photo/1,8,10,Fletcher,,,pupper,


Unnamed: 0,index,tweet_id,in_reply_to_status_id,in_reply_to_user_id,timestamp,source,text,retweeted_status_id,retweeted_status_user_id,retweeted_status_timestamp,expanded_urls,rating_numerator,rating_denominator,name,doggo,floofer,pupper,puppo
1862,1989,672828477930868736,,,2015-12-04 17:23:04+00:00,"<a href=""http://twitter.com/download/iphone"" rel=""nofollow"">Twitter for iPhone</a>",This is Jerry. He's a Timbuk Slytherin. Eats his pizza from the side first. Crushed that cup with his bare paws 9/10,,,,https://twitter.com/dog_rates/status/672828477930868736/photo/1,9,10,Jerry,,,,


Unnamed: 0,index,tweet_id,in_reply_to_status_id,in_reply_to_user_id,timestamp,source,text,retweeted_status_id,retweeted_status_user_id,retweeted_status_timestamp,expanded_urls,rating_numerator,rating_denominator,name,doggo,floofer,pupper,puppo
1900,2028,671866342182637568,,,2015-12-02 01:39:53+00:00,"<a href=""http://twitter.com/download/iphone"" rel=""nofollow"">Twitter for iPhone</a>",Meet Dylan. He can use a fork but clearly can't put on a sweatshirt correctly. Looks like a disgruntled teen. 10/10,,,,https://twitter.com/dog_rates/status/671866342182637568/photo/1,10,10,Dylan,,,,


Unnamed: 0,index,tweet_id,in_reply_to_status_id,in_reply_to_user_id,timestamp,source,text,retweeted_status_id,retweeted_status_user_id,retweeted_status_timestamp,expanded_urls,rating_numerator,rating_denominator,name,doggo,floofer,pupper,puppo
1928,2056,671357843010908160,,,2015-11-30 15:59:17+00:00,"<a href=""http://twitter.com/download/iphone"" rel=""nofollow"">Twitter for iPhone</a>",Tfw she says hello from the other side. 9/10,,,,"https://twitter.com/dog_rates/status/671357843010908160/photo/1,https://twitter.com/dog_rates/status/671357843010908160/photo/1,https://twitter.com/dog_rates/status/671357843010908160/photo/1,https://twitter.com/dog_rates/status/671357843010908160/photo/1",9,10,,,,,


Unnamed: 0,index,tweet_id,in_reply_to_status_id,in_reply_to_user_id,timestamp,source,text,retweeted_status_id,retweeted_status_user_id,retweeted_status_timestamp,expanded_urls,rating_numerator,rating_denominator,name,doggo,floofer,pupper,puppo
2141,2274,667455448082227200,,,2015-11-19 21:32:34+00:00,"<a href=""http://twitter.com"" rel=""nofollow"">Twitter Web Client</a>",This is Reese and Twips. Reese protects Twips. Both think they're too good for seat belts. Simply reckless. 7/10s,,,,https://twitter.com/dog_rates/status/667455448082227200/photo/1,7,10,Reese,,,,


In [41]:
def category_correction(df, columns_list):
    '''progrmatically fills columns based on text source'''
    offending_rows = []
    # itterates through rows of dataframe
    for index, row in df.iterrows():
        #current values in columns
        current = [row[column] for column in columns_list]
        #print(index)
        #splitting text
        source = row.text.split()
        #checking if any words in text qualify for categorical classification
        for word in source:
            for value in columns_list:
                if value in word and pd.notnull(word) and word != 'None':
                    df.loc[index, value]= value
    return df

In [42]:
twitter_archive_clean = category_correction(twitter_archive_clean, ['doggo', 'floofer', 'pupper', 'puppo'])

#### Test

In [43]:
#checking to see if changes in offending rows were made. commented for space
#for i in offending_categorical_rows:
#    display(twitter_archive_clean[twitter_archive_clean.index == i])

In [44]:
print('The number of categorical rows with missing or innacurate categories is: ',len(category_accuracy(twitter_archive_clean, ['doggo', 'floofer', 'pupper', 'puppo'])))

The number of categorical rows with missing or innacurate categories is:  0


#### Define
Record Name list by hand, and then change names progratically

#### Code

In [45]:
twitter_archive.expanded_urls[649]

'https://twitter.com/dog_rates/status/792913359805018113/photo/1,https://twitter.com/dog_rates/status/792913359805018113/photo/1,https://twitter.com/dog_rates/status/792913359805018113/photo/1,https://twitter.com/dog_rates/status/792913359805018113/photo/1'

In [46]:
# Running this script again since observations have been dropped
incorrect_names = []
for index, row in twitter_archive_clean.iterrows():
    if row['name'][0].islower() and row['name'] != 'None' and pd.notnull(row['name']):
        incorrect_names.append((row['name'], index))
incorrect_names;

In [47]:
#print incorrect name text for manual checking, commented out for space
#for i, index in incorrect_names:
#    print(index)
#    print(twitter_archive_clean.text[index])
#    print()

In [48]:
name_list = ['roxy', 'Grace', 'puffie', 'Forrest', 'Zoey', 'Quizno', 'Wylie', 'Kip', 'Jacob', 'Rufus', 'Spork', 'Cherokee', 'Hemry', 'Alphred', 'Alfredo', 'Leroi', 'Berta', 'Chuk', 'Alfonso', 'Cheryl', 'Jessiga', 'Klint', 'Kohl', 'Daryl', 'Pepe', 'Octaviath', 'Johm']
#iterrating through all incorrect names (looping though all names and programmatically fixing may end up causing more problems than solving)
for i, index in incorrect_names:
    #set all incorrect names to 'None'
    twitter_archive_clean.at[index, 'name'] = 'None'
    #itterate through text to find a matching name
    for word in twitter_archive_clean.text[index].split():
        for name in name_list:
            #correct from None to name if there is a match
            if name in word:
                twitter_archive_clean.at[index, 'name'] = name

#### Test

In [49]:
# runnin "incorrect_names" script again to test if names were changed
#should only return 'puffie' and 'roxy', as those were the only name that was lowercase
incorrect_names = []
for index, row in twitter_archive_clean.iterrows():
    if row['name'][0].islower() and row['name'] != 'None' and pd.notnull(row['name']):
        incorrect_names.append((row['name'], index))
incorrect_names

[('roxy', 22), ('puffie', 55)]

In [65]:
#capitalize puffie and roxy
twitter_archive_clean.at[incorrect_names[1][1], 'name'] = 'Puffie'

In [51]:
twitter_archive_clean.at[incorrect_names[0][1], 'name']

'Puffie'

In [52]:
twitter_archive_clean.name[55]

'puffie'

#### Define
Drop columns "in_reply_to_status_id" "in_reply_to_user_id" "retweeted_status_id" "retweeted_status_user_id" "retweeted_status_timestamp"

#### Code

In [53]:
twitter_archive_clean.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 2219 entries, 0 to 2218
Data columns (total 18 columns):
index                         2219 non-null int64
tweet_id                      2219 non-null int64
in_reply_to_status_id         78 non-null float64
in_reply_to_user_id           78 non-null float64
timestamp                     2219 non-null datetime64[ns, UTC]
source                        2219 non-null object
text                          2219 non-null object
retweeted_status_id           181 non-null float64
retweeted_status_user_id      181 non-null float64
retweeted_status_timestamp    181 non-null object
expanded_urls                 2160 non-null object
rating_numerator              2219 non-null int64
rating_denominator            2219 non-null int64
name                          2219 non-null object
doggo                         2219 non-null object
floofer                       2219 non-null object
pupper                        2219 non-null object
puppo               

In [54]:
twitter_archive_clean.drop(columns=["in_reply_to_status_id", "in_reply_to_user_id", "retweeted_status_id", "retweeted_status_user_id", "retweeted_status_timestamp"], inplace = True)

#### Test

In [55]:
twitter_archive_clean.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 2219 entries, 0 to 2218
Data columns (total 13 columns):
index                 2219 non-null int64
tweet_id              2219 non-null int64
timestamp             2219 non-null datetime64[ns, UTC]
source                2219 non-null object
text                  2219 non-null object
expanded_urls         2160 non-null object
rating_numerator      2219 non-null int64
rating_denominator    2219 non-null int64
name                  2219 non-null object
doggo                 2219 non-null object
floofer               2219 non-null object
pupper                2219 non-null object
puppo                 2219 non-null object
dtypes: datetime64[ns, UTC](1), int64(4), object(8)
memory usage: 225.4+ KB


#### Define
Change id to string using astype

#### Code

In [56]:
twitter_archive_clean.tweet_id = twitter_archive_clean.tweet_id.astype(str);

#### Test

In [57]:
twitter_archive_clean.info();

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 2219 entries, 0 to 2218
Data columns (total 13 columns):
index                 2219 non-null int64
tweet_id              2219 non-null object
timestamp             2219 non-null datetime64[ns, UTC]
source                2219 non-null object
text                  2219 non-null object
expanded_urls         2160 non-null object
rating_numerator      2219 non-null int64
rating_denominator    2219 non-null int64
name                  2219 non-null object
doggo                 2219 non-null object
floofer               2219 non-null object
pupper                2219 non-null object
puppo                 2219 non-null object
dtypes: datetime64[ns, UTC](1), int64(3), object(9)
memory usage: 225.4+ KB


#### Define
Use str.extract on text to obtain new columns of numerator and denominator
Change new columns to float type using astype

#### Code

In [58]:
twitter_archive_clean['rating'] = twitter_archive_clean.text.str.extract(r'(\d+\.*\d*/\d+\.*\d*)')

In [59]:
twitter_archive_clean['rating'] = twitter_archive_clean['rating'].str.split('/')

In [60]:
twitter_archive_clean[['numerator','denominator']] = pd.DataFrame(twitter_archive_clean.rating.values.tolist(), index= twitter_archive_clean.index)

In [61]:
twitter_archive_clean = twitter_archive_clean.astype({"numerator":'float', "denominator":'float'})

In [62]:
twitter_archive_clean.drop(columns=['rating_numerator', 'rating_denominator', 'rating'], inplace = True)

#### Test

In [63]:
twitter_archive_clean.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 2219 entries, 0 to 2218
Data columns (total 13 columns):
index            2219 non-null int64
tweet_id         2219 non-null object
timestamp        2219 non-null datetime64[ns, UTC]
source           2219 non-null object
text             2219 non-null object
expanded_urls    2160 non-null object
name             2219 non-null object
doggo            2219 non-null object
floofer          2219 non-null object
pupper           2219 non-null object
puppo            2219 non-null object
numerator        2219 non-null float64
denominator      2219 non-null float64
dtypes: datetime64[ns, UTC](1), float64(2), int64(1), object(9)
memory usage: 225.4+ KB


In [64]:
twitter_archive_clean[twitter_archive_clean['numerator'] == 11.26]

Unnamed: 0,index,tweet_id,timestamp,source,text,expanded_urls,name,doggo,floofer,pupper,puppo,numerator,denominator
1596,1712,680494726643068929,2015-12-25 21:06:00+00:00,"<a href=""http://twitter.com/download/iphone"" rel=""nofollow"">Twitter for iPhone</a>",Here we have uncovered an entire battalion of holiday puppers. Average of 11.26/10,https://twitter.com/dog_rates/status/680494726643068929/photo/1,,,,pupper,,11.26,10.0


# Analysis

(jotting ideas down early so as to not forget)
### Motivating Questions:
- Do higher "ratings" correlate with higher number of retweets? (Must define ratings. Std might be useful)
- Investigate which is more "popular": Cute or Funny. Cute defined by Kindchenschema and funny defined by Benign Violation.
- Within the categories of cuteness and funniness, are more extreme examples more popular? Measured by retweets.
- What are observed characteristics of the top 3 most popular tweets? Is there a theme?

Although the ratings individually do not seem to make much sense due to the numerator exceeding the denominator, perhaps the ratings can be better understood as a decimal score, with higher scores indicating higher approval.