# Wrangle Act
### By Julio Uribe

The purpose of this project is to exapnd on wrangling abilities. In this file we will gather data about "weRateDogs" twitter posts from a couple of different sources: directly from Twitter API, using the provided twitter enhanced file for tweet id's, and pulling from udacity's server to look at neural net results in a tsv file.

# Setup: Import  Modules

In [17]:
import tweepy
from tweepy import OAuthHandler
import json
import numpy as np
from timeit import default_timer as timer
import pandas as pd
import requests

# Gather Data

## First Source File: Twitter Enhanced file and set up API keys

In [2]:
#load file info into dataframe for tweet id's to use later
twitter_archive = pd.read_csv('twitter-archive-enhanced.csv')
#load tweet ids for api extraction
tweet_ids = twitter_archive.tweet_id.values

#Set up API credentials from file outside directory
creds = []
with open('/Users/Jules/Desktop/DAND/twitter_credentials.txt', 'r') as f:
    creds = f.read().split("'")
consumer_key = creds[1]
consumer_secret = creds[3]
access_token = creds[5]
access_secret = creds[7]
#create auth object with keys
auth = tweepy.OAuthHandler(consumer_key, consumer_secret)
auth.set_access_token(access_token, access_secret)
#create tweepy api object for requests
api = tweepy.API(auth, wait_on_rate_limit = True)

print (len(tweet_ids))

2356


## Second Source File: Query Twitter's API for JSON data

In [9]:
# Query Twitter's API for JSON data for each tweet ID in the Twitter archive
count = 0
fails_dict = {}
start = timer()
# Save each tweet's returned JSON as a new line in a .txt file
with open('tweet_json.txt', 'w') as outfile:
    # Loop pauses/resumes at about 900 tweets due to api's rate limit
    for tweet_id in tweet_ids:
        count += 1
        print(str(count) + ": " + str(tweet_id))
        try:
            tweet = api.get_status(tweet_id, tweet_mode='extended')
            print("Success")
            json.dump(tweet._json, outfile)
            outfile.write('\n')
        except tweepy.TweepError as e:
            print("Fail")
            fails_dict[tweet_id] = e
            pass
end = timer()
print(end - start)
print(fails_dict)

1: 892420643555336193
Success
2: 892177421306343426
Success
3: 891815181378084864
Success
4: 891689557279858688
Success
5: 891327558926688256
Success
6: 891087950875897856
Success
7: 890971913173991426
Success
8: 890729181411237888
Success
9: 890609185150312448
Success
10: 890240255349198849
Success
11: 890006608113172480
Success
12: 889880896479866881
Success
13: 889665388333682689
Success
14: 889638837579907072
Success
15: 889531135344209921
Success
16: 889278841981685760
Success
17: 888917238123831296
Success
18: 888804989199671297
Success
19: 888554962724278272
Success
20: 888202515573088257
Fail
21: 888078434458587136
Success
22: 887705289381826560
Success
23: 887517139158093824
Success
24: 887473957103951883
Success
25: 887343217045368832
Success
26: 887101392804085760
Success
27: 886983233522544640
Success
28: 886736880519319552
Success
29: 886680336477933568
Success
30: 886366144734445568
Success
31: 886267009285017600
Success
32: 886258384151887873
Success
33: 8860541600590725

Success
263: 842765311967449089
Success
264: 842535590457499648
Success
265: 842163532590374912
Success
266: 842115215311396866
Success
267: 841833993020538882
Success
268: 841680585030541313
Success
269: 841439858740625411
Success
270: 841320156043304961
Success
271: 841314665196081154
Success
272: 841077006473256960
Success
273: 840761248237133825
Success
274: 840728873075638272
Success
275: 840698636975636481
Success
276: 840696689258311684
Success
277: 840632337062862849
Success
278: 840370681858686976
Success
279: 840268004936019968
Success
280: 839990271299457024
Success
281: 839549326359670784
Success
282: 839290600511926273
Success
283: 839239871831150596
Success
284: 838952994649550848
Success
285: 838921590096166913
Success
286: 838916489579200512
Success
287: 838831947270979586
Success
288: 838561493054533637
Success
289: 838476387338051585
Success
290: 838201503651401729
Success
291: 838150277551247360
Success
292: 838085839343206401
Success
293: 838083903487373313
Success


Success
520: 810284430598270976
Success
521: 810254108431155201
Success
522: 809920764300447744
Success
523: 809808892968534016
Success
524: 809448704142938112
Success
525: 809220051211603969
Success
526: 809084759137812480
Success
527: 808838249661788160
Success
528: 808733504066486276
Success
529: 808501579447930884
Success
530: 808344865868283904
Success
531: 808134635716833280
Success
532: 808106460588765185
Success
533: 808001312164028416
Success
534: 807621403335917568
Success
535: 807106840509214720
Success
536: 807059379405148160
Success
537: 807010152071229440
Success
538: 806629075125202948
Success
539: 806620845233815552
Success
540: 806576416489959424
Success
541: 806542213899489280
Success
542: 806242860592926720
Success
543: 806219024703037440
Success
544: 805958939288408065
Success
545: 805932879469572096
Success
546: 805826884734976000
Success
547: 805823200554876929
Success
548: 805520635690676224
Success
549: 805487436403003392
Success
550: 805207613751304193
Success


Success
777: 776113305656188928
Success
778: 776088319444877312
Success
779: 775898661951791106
Success
780: 775842724423557120
Success
781: 775733305207554048
Success
782: 775729183532220416
Success
783: 775364825476165632
Success
784: 775350846108426240
Success
785: 775096608509886464
Fail
786: 775085132600442880
Success
787: 774757898236878852
Success
788: 774639387460112384
Success
789: 774314403806253056
Success
790: 773985732834758656
Success
791: 773922284943896577
Success
792: 773704687002451968
Success
793: 773670353721753600
Success
794: 773547596996571136
Success
795: 773336787167145985
Success
796: 773308824254029826
Success
797: 773247561583001600
Success
798: 773191612633579521
Success
799: 772877495989305348
Success
800: 772826264096874500
Success
801: 772615324260794368
Success
802: 772581559778025472
Success
803: 772193107915964416
Success
804: 772152991789019136
Success
805: 772117678702071809
Success
806: 772114945936949249
Success
807: 772102971039580160
Success
808

Success
1033: 745314880350101504
Success
1034: 745074613265149952
Success
1035: 745057283344719872
Success
1036: 744995568523612160
Success
1037: 744971049620602880
Success
1038: 744709971296780288
Success
1039: 744334592493166593
Success
1040: 744234799360020481
Success
1041: 744223424764059648
Success
1042: 743980027717509120
Success
1043: 743895849529389061
Success
1044: 743835915802583040
Success
1045: 743609206067040256
Success
1046: 743595368194129920
Success
1047: 743545585370791937
Success
1048: 743510151680958465
Success
1049: 743253157753532416
Success
1050: 743222593470234624
Success
1051: 743210557239623680
Success
1052: 742534281772302336
Success
1053: 742528092657332225
Success
1054: 742465774154047488
Success
1055: 742423170473463808
Success
1056: 742385895052087300
Success
1057: 742161199639494656
Success
1058: 742150209887731712
Success
1059: 741793263812808706
Success
1060: 741743634094141440
Success
1061: 741438259667034112
Success
1062: 741303864243200000
Success
10

Success
1282: 708810915978854401
Success
1283: 708738143638450176
Success
1284: 708711088997666817
Success
1285: 708479650088034305
Success
1286: 708469915515297792
Success
1287: 708400866336894977
Success
1288: 708356463048204288
Success
1289: 708349470027751425
Success
1290: 708149363256774660
Success
1291: 708130923141795840
Success
1292: 708119489313951744
Success
1293: 708109389455101952
Success
1294: 708026248782585858
Success
1295: 707995814724026368
Success
1296: 707983188426153984
Success
1297: 707969809498152960
Success
1298: 707776935007539200
Success
1299: 707741517457260545
Success
1300: 707738799544082433
Success
1301: 707693576495472641
Success
1302: 707629649552134146
Success
1303: 707610948723478529
Success
1304: 707420581654872064
Success
1305: 707411934438625280
Success
1306: 707387676719185920
Success
1307: 707377100785885184
Success
1308: 707315916783140866
Success
1309: 707297311098011648
Success
1310: 707059547140169728
Success
1311: 707038192327901184
Success
13

Success
1532: 690015576308211712
Success
1533: 690005060500217858
Success
1534: 689999384604450816
Success
1535: 689993469801164801
Success
1536: 689977555533848577
Success
1537: 689905486972461056
Success
1538: 689877686181715968
Success
1539: 689835978131935233
Success
1540: 689661964914655233
Success
1541: 689659372465688576
Success
1542: 689623661272240129
Success
1543: 689599056876867584
Success
1544: 689557536375177216
Success
1545: 689517482558820352
Success
1546: 689289219123089408
Success
1547: 689283819090870273
Success
1548: 689280876073582592
Success
1549: 689275259254616065
Success
1550: 689255633275777024
Success
1551: 689154315265683456
Success
1552: 689143371370250240
Success
1553: 688916208532455424
Success
1554: 688908934925697024
Success
1555: 688898160958271489
Success
1556: 688894073864884227
Success
1557: 688828561667567616
Success
1558: 688804835492233216
Success
1559: 688789766343622656
Success
1560: 688547210804498433
Success
1561: 688519176466644993
Success
15

Success
1781: 677700003327029250
Success
1782: 677698403548192770
Success
1783: 677687604918272002
Success
1784: 677673981332312066
Success
1785: 677662372920729601
Success
1786: 677644091929329666
Success
1787: 677573743309385728
Success
1788: 677565715327688705
Success
1789: 677557565589463040
Success
1790: 677547928504967168
Success
1791: 677530072887205888
Success
1792: 677335745548390400
Success
1793: 677334615166730240
Success
1794: 677331501395156992
Success
1795: 677328882937298944
Success
1796: 677314812125323265
Success
1797: 677301033169788928
Success
1798: 677269281705472000
Success
1799: 677228873407442944
Success
1800: 677187300187611136
Success
1801: 676975532580409345
Success
1802: 676957860086095872
Success
1803: 676949632774234114
Success
1804: 676948236477857792
Success
1805: 676946864479084545
Success
1806: 676942428000112642
Success
1807: 676936541936185344
Success
1808: 676916996760600576
Success
1809: 676897532954456065
Success
1810: 676864501615042560
Success
18

Success
2030: 671855973984772097
Success
2031: 671789708968640512
Success
2032: 671768281401958400
Success
2033: 671763349865160704
Success
2034: 671744970634719232
Success
2035: 671743150407421952
Success
2036: 671735591348891648
Success
2037: 671729906628341761
Success
2038: 671561002136281088
Success
2039: 671550332464455680
Success
2040: 671547767500775424
Success
2041: 671544874165002241
Success
2042: 671542985629241344
Success
2043: 671538301157904385
Success
2044: 671536543010570240
Success
2045: 671533943490011136
Success
2046: 671528761649688577
Success
2047: 671520732782923777
Success
2048: 671518598289059840
Success
2049: 671511350426865664
Success
2050: 671504605491109889
Success
2051: 671497587707535361
Success
2052: 671488513339211776
Success
2053: 671486386088865792
Success
2054: 671485057807351808
Success
2055: 671390180817915904
Success
2056: 671362598324076544
Success
2057: 671357843010908160
Success
2058: 671355857343524864
Success
2059: 671347597085433856
Success
20

Success
2279: 667435689202614272
Success
2280: 667405339315146752
Success
2281: 667393430834667520
Success
2282: 667369227918143488
Success
2283: 667211855547486208
Success
2284: 667200525029539841
Success
2285: 667192066997374976
Success
2286: 667188689915760640
Success
2287: 667182792070062081
Success
2288: 667177989038297088
Success
2289: 667176164155375616
Success
2290: 667174963120574464
Success
2291: 667171260800061440
Success
2292: 667165590075940865
Success
2293: 667160273090932737
Success
2294: 667152164079423490
Success
2295: 667138269671505920
Success
2296: 667119796878725120
Success
2297: 667090893657276420
Success
2298: 667073648344346624
Success
2299: 667070482143944705
Success
2300: 667065535570550784
Success
2301: 667062181243039745
Success
2302: 667044094246576128
Success
2303: 667012601033924608
Success
2304: 666996132027977728
Success
2305: 666983947667116034
Success
2306: 666837028449972224
Success
2307: 666835007768551424
Success
2308: 666826780179869698
Success
23

### Load JSON data we got from Twitter API into a cleaner dataframe

In [3]:
#load twitter json file into a pandas dataframe
tweets_json_full = pd.read_json("tweet_json.txt", lines=True)
#tweets_json_full.info()

In [4]:
#create a smaller version of tweets_json_full with only the columns we're interested in
tweets_json = pd.DataFrame(tweets_json_full[['id', 'created_at', 'favorite_count', 'retweet_count', 'full_text', 'extended_entities']])
#tweets_json.head()

In [6]:
#seeing if we can extract anything interesting from the extended_entities values
# for i in range(5):
#     print(tweets_json_full['extended_entities'][i]['media'][0]['url'])

## Third Source file: Use Requests Module to Load Neural Net Results

In [5]:
r = requests.get('https://d17h27t6h515a5.cloudfront.net/topher/2017/August/599fd2ad_image-predictions/image-predictions.tsv')
with open("image_predictions.tsv", 'wb') as file:
    for chunk in r.iter_content(chunk_size=128):
        file.write(chunk)
#read file back in and create a df for image predictions data     
image_predictions = pd.read_csv("image_predictions.tsv", sep='\t')

# Assessing Data

For this project, we have three dataframes we're currently working with: 
* twitter_archive - imported tweet info from twitter_archive_enhanced.txt provided by udacity. Has tweet IDs, tweet text, ratings, and other information.
* tweets_json - data from twitter API containing retweets, favorited count, tweet text, and more.
* image_predictions - results from neural net. Contains three predictions, image url, number of images, etc.

The three dataframes provide info about the tweets posted from the WeRateDogs twitter profile. We'll do some assessing of the data before we merge these dataframes together. Then we'll clean the data to get the most complete data set we can.

## Quality Issues
* twitter_archive has 2356 entries, tweets_json has 2340 entries, and image_predictions has 2075 entries
* In 'twitter_archive', incorrect and missing names under 'name' column: 'None', 'a', 'the', 'an', etc.
* In 'twitter_archive', a few rating numerators are under 10 but according to the twitter profile, all ratings should be above 10
* In 'twitter_archive', there's are rows where the 'rating_denominator' is lower or higher than 10. We need to standardize all rows to be out of 10.
* In 'twitter_archive', timestamp column is in string format instead of datetime.
* In 'image_predicitons', results from the neural net in p1 give us results in inconsistent lower/upper case. Need to make consistent
* In 'tweets_json', 'id' column should be renamed to 'tweet_id' to be consistent with other two dataframes
* In 'image_predicitons', we get invalid results from our neural net in p1 such as 'desktop_computer', 'electric_fan', 'wild_boar'.
* In 'image_predictions', some prediction of dog breeds aren't actual dog breeds
* From 'image_predictions', I see not all twitter posts actually have dogs in the post. We should toss these out
* From 'image_predictions', we have tweets that do not have a dog present. We should toss out rows that do not have a dog present. Consider using three prediction values to toss out tweets?


## Tidiness Issues
* In 'twitter_archive', the last four columns (doggo, floofer, pupper, puppo) are not always observed and best serve as a category. We should combine these 4 columns into one
* In 'twitter_archive', there are multiple values in 'source' column.
* In 'tweets_json', we have multiple values in the 'extended entities' column. Clean up column values to iphone, vine, web client, etc.
* twitter_archive has 2356 entries, tweets_json has 2340 entries, and image_predictions has 2075 entries. We'll merge later on tweet IDs.

In [8]:
#twitter_archive

In [9]:
# twitter_archive has 2356 entries, tweets_json has 2340 entries, and image_predictions has 2075 entries. We'll merge later
# on tweet IDs.
#twitter_archive.describe()
#tweets_json.info()
#twitter_archive.isnull().sum()
#type(twitter_archive.timestamp[0])
#twitter_archive.source.value_counts()

In [10]:
#twitter_archive.name.value_counts()

In [11]:
#twitter_archive.rating_denominator.value_counts()

In [12]:
#twitter_archive.rating_numerator.value_counts()

In [13]:
#tweets_json

In [14]:
# tweets_json.isnull().sum()
# tweets_json.info()

In [15]:
# image_predictions.isnull().sum()
# image_predictions

In [16]:
# image_predictions.describe()

In [17]:
#image_predictions.p1.value_counts()
# image_predictions.p1.value_counts()
#image_predictions.p3.value_counts()

In [18]:
# if it hits false multiple times, toss out row
#image_predictions[image_predictions.p3 == 'space_shuttle']

In [19]:
#image_predictions[image_predictions.p1 == 'coffee_mug'].jpg_url

In [20]:
# there's a good chance that a large part of our data set doesn't actually contain dogs in the image, throwing off ratings
#image_predictions['p1_dog'].mean(), image_predictions['p2_dog'].mean(), image_predictions['p3_dog'].mean()

In [21]:
# explore prediction results for tweets with more than one image. How does the neural net handle multiple images?
# multi_pic = image_predictions[image_predictions["img_num"] > 1]
# multi_pic

In [22]:
# lets compare the average p1_dog, p2_dog, p3_dog rates from multiple images to the whole dataframe
# multi_pic['p1_dog'].mean(), multi_pic['p2_dog'].mean(), multi_pic['p3_dog'].mean()
# Multiple images is more likely to have a dog in it than the general dog prediciton rate from entire dataframe

In [23]:
#checking for duplicated values
# twitter_archive[twitter_archive.tweet_id.duplicated()]
# tweets_json[tweets_json.id.duplicated()]
# image_predictions[image_predictions.tweet_id.duplicated()]

# Cleaning Data

### Create copies for data

In [6]:
# Create copies of all three dataframes
twitter_archive_clean = twitter_archive.copy()
tweets_json_clean = tweets_json.copy()
image_predictions_clean = image_predictions.copy()

## Goal is to merge all three data sets. First we need to clean up some columns

### First Merge and Cleaning

In [10]:
# Define
# Rename a few columns in the dataframes for consistency
# Clean
tweets_json_clean.rename(columns={'id':'tweet_id', 'created_at': 'timestamp'}, inplace=True)
twitter_archive_clean.rename(columns={'text':'full_text', 'created_at': 'timestamp'}, inplace=True)
# Test: Make sure column names are consistent when shared/overlapping
#twitter_archive_clean.columns, tweets_json_clean.columns

In [7]:
# Define
# I'm going to delete several columns that are less interesting in the dataframes
#Clean
twitter_archive_clean.drop(['in_reply_to_status_id', 'in_reply_to_user_id', 'source', 'retweeted_status_id', 'retweeted_status_user_id', 'retweeted_status_timestamp', 'expanded_urls'], axis=1, inplace=True)
tweets_json_clean.drop(['extended_entities'], axis=1, inplace=True)
# Test: Make sure all the appropriate columns have been deleted
#twitter_archive_clean.columns, tweets_json_clean.columns

In [12]:
# Define
# Merge the twitter_enhanced_clean and tweets_json_clean together using 'tweet_id'
tweets_super_clean = tweets_json_clean.merge(twitter_archive_clean, how='inner', on='tweet_id')
# Test: let's see what columns we have now and if the merge is doing what we want it to do
#tweets_super_clean.head()
#tweets_super_clean.info()

In [13]:
# Before we move onto our second merge, we can clean up this data set a bit more
# Define
# timestamp_x and timestamp_y show the same data but timestamp_x is already in the datetime format we want so we'll keep that one
# two columns for full text as well. We'll keep the first one
# Drop the columns
tweets_super_clean.drop(['timestamp_y', 'full_text_y'], axis=1, inplace=True)
# Rename the columns
tweets_super_clean.rename(columns={'timestamp_x':'timestamp', 'full_text_x': 'full_text'}, inplace=True)
# Test: verify our column surgery was clean and successful
tweets_super_clean.info()

<class 'pandas.core.frame.DataFrame'>
Int64Index: 2340 entries, 0 to 2339
Data columns (total 12 columns):
tweet_id              2340 non-null int64
timestamp             2340 non-null datetime64[ns]
favorite_count        2340 non-null int64
retweet_count         2340 non-null int64
full_text             2340 non-null object
rating_numerator      2340 non-null int64
rating_denominator    2340 non-null int64
name                  2340 non-null object
doggo                 2340 non-null object
floofer               2340 non-null object
pupper                2340 non-null object
puppo                 2340 non-null object
dtypes: datetime64[ns](1), int64(5), object(6)
memory usage: 237.7+ KB


### Second Merge

In [14]:
# We'll now merge tweets_super_clean with image_predictions_clean using tweet_ids
tweets_super_clean = tweets_super_clean.merge(image_predictions_clean, on='tweet_id', how='inner')
tweets_super_clean.columns
tweets_super_clean.info()

<class 'pandas.core.frame.DataFrame'>
Int64Index: 2067 entries, 0 to 2066
Data columns (total 23 columns):
tweet_id              2067 non-null int64
timestamp             2067 non-null datetime64[ns]
favorite_count        2067 non-null int64
retweet_count         2067 non-null int64
full_text             2067 non-null object
rating_numerator      2067 non-null int64
rating_denominator    2067 non-null int64
name                  2067 non-null object
doggo                 2067 non-null object
floofer               2067 non-null object
pupper                2067 non-null object
puppo                 2067 non-null object
jpg_url               2067 non-null object
img_num               2067 non-null int64
p1                    2067 non-null object
p1_conf               2067 non-null float64
p1_dog                2067 non-null bool
p2                    2067 non-null object
p2_conf               2067 non-null float64
p2_dog                2067 non-null bool
p3                    2067 non-nu

In [15]:
# Define
# There's are rows where the 'rating_denominator' is lower or higher than 10. We need to standardize all rows to be out of 10
# Clean
tweets_super_clean.rating_denominator = 10
# Test
tweets_super_clean.rating_denominator.value_counts()

10    2067
Name: rating_denominator, dtype: int64

In [24]:
# Define
#In 'twitter_archive', correct all 'a', 'the', 'an', dog names by replacing them with NaN values
not_names = ['a', 'an', 'the']
for name in not_names:
    tweets_super_clean.name.replace(name, np.nan, inplace=True)
tweets_super_clean.name.value_counts()

None           575
Penny           10
Cooper          10
Tucker          10
Lucy            10
Charlie         10
Oliver          10
Bo               8
Lola             8
Winston          8
Sadie            8
Daisy            7
Toby             7
Scout            6
Rusty            6
Koda             6
Milo             6
Bella            6
Dave             6
Stanley          6
Bailey           6
Jax              6
Leo              5
Larry            5
Chester          5
Alfie            5
Oscar            5
Louis            5
Buddy            5
Clarence         4
              ... 
Derby            1
Cilantro         1
Gerbald          1
Travis           1
Timmy            1
Vince            1
Anna             1
Snickers         1
Bowie            1
Shakespeare      1
Glacier          1
Patch            1
Claude           1
Jameson          1
Trigger          1
Dunkin           1
Comet            1
Aiden            1
Shawwn           1
Banjo            1
BeBe             1
Franq       

In [30]:
# Test
# Look at the full text from tweets where we found no dog name and manually look over to confirm there aren't dog names
need_names = tweets_super_clean[tweets_super_clean.name == 'None'].full_text
for i in need_names:
    print (i)
# Our visual assessment here doesn't catch dog names that we skiped. We have 

Here we have a majestic great white breaching off South Africa's coast. Absolutely h*ckin breathtaking. 13/10 (IG: tucker_marlo) #BarkWeek https://t.co/kQ04fDDRmh
When you watch your owner call another dog a good boy but then they turn back to you and say you're a great boy. 13/10 https://t.co/v0nONBcwxq
Here's a puppo that seems to be on the fence about something haha no but seriously someone help her. 13/10 https://t.co/BxvuXk0UCm
You may not have known you needed to see this today. 13/10 please enjoy (IG: emmylouroo) https://t.co/WZqNqygEyV
This... is a Jubilant Antarctic House Bear. We only rate dogs. Please only send dogs. Thank you... 12/10 would suffocate in floof https://t.co/4Ad1jzJSdp
Here we have a corgi undercover as a malamute. Pawbably doing important investigative work. Zero control over tongue happenings. 13/10 https://t.co/44ItaMubBf
I present to you, Pup in Hat. Pup in Hat is great for all occasions. Extremely versatile. Compact as h*ck. 14/10 (IG: itselizabethgales) 

## Quality Issues
* In 'twitter_archive', incorrect and missing names under 'name' column: 'None', 'a', 'the', 'an', etc.
* In 'twitter_archive', a few rating numerators are under 10 but according to the twitter profile, all ratings should be above 10
* In 'twitter_archive', there's are rows where the 'rating_denominator' is lower or higher than 10. We need to standardize all rows to be out of 10.
* ~~In 'twitter_archive', timestamp column is in string format instead of datetime.~~
* In 'image_predicitons', results from the neural net in p1 give us results in inconsistent lower/upper case. Need to make consistent
* ~~In 'tweets_json', 'id' column should be renamed to 'tweet_id' to be consistent with other two dataframes~~
* In 'image_predicitons', we get invalid results from our neural net in p1 such as 'desktop_computer', 'electric_fan', 'wild_boar'.
* In 'image_predictions', some prediction of dog breeds aren't actual dog breeds
* From 'image_predictions', not all twitter posts actually have dogs in the post
* From 'image_predictions', we have tweets that do not have a dog present. We should toss out rows that do not have a dog present. Consider using three prediction values to toss out tweets?


## Tidiness Issues
* ~~all three dataframes should be one table~~
* ~~rename several columns before mering for consistency: full_text, tweet_id, timestamp to be the standard
* ~~get rid of less interesting columns so that our dataframe isn't massive when fully merged
* ~~twitter_archive has 2356 entries, tweets_json has 2340 entries, and image_predictions has 2075 entries~~
* In 'twitter_archive', the last four columns (doggo, floofer, pupper, puppo) are not always observed and best serve as a category. We should combine these 4 columns into one