# Analyze Tweet Data
## Part 1: Data Wrangling

## Table of Contents
- [Introduction](#intro)
- [Gathering Data](#gather)
- [Assessing Data](#assess)
- [Cleaning Data](#clean)
- [Conclusion](#conclusion)

<a id='intro'></a>
## Introduction

In [7]:
import pandas as pd
import requests
import os
import twitter_api as t_api
from timeit import default_timer as timer
import json
import copy
import numpy as np

<a id='gather'></a>
## Gathering Data

### 1. Enhanced Twitter Archive

In [40]:
# enhanced twitter archive data
df_archive = pd.read_csv('data/twitter_archive_enhanced.csv')
df_archive.head(3)

Unnamed: 0,tweet_id,in_reply_to_status_id,in_reply_to_user_id,timestamp,source,text,retweeted_status_id,retweeted_status_user_id,retweeted_status_timestamp,expanded_urls,rating_numerator,rating_denominator,name,doggo,floofer,pupper,puppo
0,892420643555336193,,,2017-08-01 16:23:56 +0000,"<a href=""http://twitter.com/download/iphone"" r...",This is Phineas. He's a mystical boy. Only eve...,,,,https://twitter.com/dog_rates/status/892420643...,13,10,Phineas,,,,
1,892177421306343426,,,2017-08-01 00:17:27 +0000,"<a href=""http://twitter.com/download/iphone"" r...",This is Tilly. She's just checking pup on you....,,,,https://twitter.com/dog_rates/status/892177421...,13,10,Tilly,,,,
2,891815181378084864,,,2017-07-31 00:18:03 +0000,"<a href=""http://twitter.com/download/iphone"" r...",This is Archie. He is a rare Norwegian Pouncin...,,,,https://twitter.com/dog_rates/status/891815181...,12,10,Archie,,,,


### 2. Image Predictions

In [41]:
# GET request URL
get_url = 'https://d17h27t6h515a5.cloudfront.net/topher/2017/August/599fd2ad_image-predictions/image-predictions.tsv'

# response from GET request
response = requests.get(get_url)

# .tsv file name
tsv_file = os.path.join('data', get_url.split('/')[-1])

# store response data
    # 1. create a file handle from the .tsv file name
    # 2. open the file handle in writing mode
    # 3. write the binary response to the file
with open(tsv_file, mode = 'wb') as file:
    file.write(response.content)

In [42]:
# verify successful download of .tsv file
tsv_check = [file for file in os.listdir('data') if file == tsv_file.split('\\')[-1]]

if len(tsv_check) == 1:
    print('Successfully downloaded {}'.format(tsv_check[0]))
else:
    print('Downloading {} failed'.format(tsv_check[0]))

Successfully downloaded image-predictions.tsv


In [43]:
# import .tsv data
df_image = pd.read_csv(tsv_file, sep = '\t')
df_image.head(3)

Unnamed: 0,tweet_id,jpg_url,img_num,p1,p1_conf,p1_dog,p2,p2_conf,p2_dog,p3,p3_conf,p3_dog
0,666020888022790149,https://pbs.twimg.com/media/CT4udn0WwAA0aMy.jpg,1,Welsh_springer_spaniel,0.465074,True,collie,0.156665,True,Shetland_sheepdog,0.061428,True
1,666029285002620928,https://pbs.twimg.com/media/CT42GRgUYAA5iDo.jpg,1,redbone,0.506826,True,miniature_pinscher,0.074192,True,Rhodesian_ridgeback,0.07201,True
2,666033412701032449,https://pbs.twimg.com/media/CT4521TWwAEvMyu.jpg,1,German_shepherd,0.596461,True,malinois,0.138584,True,bloodhound,0.116197,True


### 3. Additional Tweet Data

In [44]:
# script from twitter_api.py

# import tweepy

# def twitter_api():
#     # Keys and Tokens
#     consumer_key = 'CONSUMER KEY'
#     consumer_secret = 'CONSUMER SECRET'
#     access_token = 'ACCESS TOKEN'
#     access_secret = 'ACCESS SECRET'

#     # OAuthHandler instance equipped with an access token for OAuth Authentication
#     auth = tweepy.OAuthHandler(consumer_key, consumer_secret)
#     auth.set_access_token(access_token, access_secret)

#     # twitter API wrapper
#     return tweepy.API(auth_handler = auth, wait_on_rate_limit = True, wait_on_rate_limit_notify = True)

In [45]:
# twitter API wrapper from twitter_api.py
api = t_api.twitter_api()

# input: tweet ids from enhanced twitter archive
tweet_ids = df_archive.tweet_id

In [46]:
# dict-object containing failed tweet ids and error details
error_dict = {}

# track progress of iteration over all tweets in archive
count = 0
start = timer()

# query Twitter API for JSON data
with open('data/tweet_json.txt', 'w') as outfile:
    for tweet_id in tweet_ids:
        count += 1
        print('Tweet ID: {} (Count: {})'.format(tweet_id, count)) # tweet id being queried
        try: # query with tweet id and obtain status object
            api_data = api.get_status(tweet_id, tweet_mode = 'extended')
        except tweepy.TweepError as e: # failed query
            print(' - Fail\n')
            error_dict[tweet_id] = e
        else: # write JSON data to .txt file
            print(' - Success\n')
            json.dump(api_data._json, outfile)
            outfile.write('\n') # change line for the next tweet id
end = timer()
print('Query took {} minutes.'.format(round((end-start)/60, 2))) # duration of iteration
print('Query failed for {} tweets.'.format(len(error_dict)))

Tweet ID: 892420643555336193 (Count: 1)
 - Success

Tweet ID: 892177421306343426 (Count: 2)
 - Success

Tweet ID: 891815181378084864 (Count: 3)
 - Success

Tweet ID: 891689557279858688 (Count: 4)
 - Success

Tweet ID: 891327558926688256 (Count: 5)
 - Success

Tweet ID: 891087950875897856 (Count: 6)
 - Success

Tweet ID: 890971913173991426 (Count: 7)
 - Success

Tweet ID: 890729181411237888 (Count: 8)
 - Success

Tweet ID: 890609185150312448 (Count: 9)
 - Success

Tweet ID: 890240255349198849 (Count: 10)
 - Success

Tweet ID: 890006608113172480 (Count: 11)
 - Success

Tweet ID: 889880896479866881 (Count: 12)
 - Success

Tweet ID: 889665388333682689 (Count: 13)
 - Success

Tweet ID: 889638837579907072 (Count: 14)
 - Success

Tweet ID: 889531135344209921 (Count: 15)
 - Success

Tweet ID: 889278841981685760 (Count: 16)
 - Success

Tweet ID: 888917238123831296 (Count: 17)
 - Success

Tweet ID: 888804989199671297 (Count: 18)
 - Success

Tweet ID: 888554962724278272 (Count: 19)
 - Success

Tw

 - Success

Tweet ID: 861769973181624320 (Count: 156)
 - Fail

Tweet ID: 861383897657036800 (Count: 157)
 - Success

Tweet ID: 861288531465048066 (Count: 158)
 - Success

Tweet ID: 861005113778896900 (Count: 159)
 - Success

Tweet ID: 860981674716409858 (Count: 160)
 - Success

Tweet ID: 860924035999428608 (Count: 161)
 - Success

Tweet ID: 860563773140209665 (Count: 162)
 - Success

Tweet ID: 860524505164394496 (Count: 163)
 - Success

Tweet ID: 860276583193509888 (Count: 164)
 - Success

Tweet ID: 860184849394610176 (Count: 165)
 - Success

Tweet ID: 860177593139703809 (Count: 166)
 - Success

Tweet ID: 859924526012018688 (Count: 167)
 - Success

Tweet ID: 859851578198683649 (Count: 168)
 - Success

Tweet ID: 859607811541651456 (Count: 169)
 - Success

Tweet ID: 859196978902773760 (Count: 170)
 - Success

Tweet ID: 859074603037188101 (Count: 171)
 - Success

Tweet ID: 858860390427611136 (Count: 172)
 - Success

Tweet ID: 858843525470990336 (Count: 173)
 - Success

Tweet ID: 858471635

 - Success

Tweet ID: 835685285446955009 (Count: 308)
 - Success

Tweet ID: 835574547218894849 (Count: 309)
 - Success

Tweet ID: 835536468978302976 (Count: 310)
 - Success

Tweet ID: 835309094223372289 (Count: 311)
 - Success

Tweet ID: 835297930240217089 (Count: 312)
 - Success

Tweet ID: 835264098648616962 (Count: 313)
 - Success

Tweet ID: 835246439529840640 (Count: 314)
 - Success

Tweet ID: 835172783151792128 (Count: 315)
 - Success

Tweet ID: 835152434251116546 (Count: 316)
 - Success

Tweet ID: 834931633769889797 (Count: 317)
 - Success

Tweet ID: 834786237630337024 (Count: 318)
 - Success

Tweet ID: 834574053763584002 (Count: 319)
 - Success

Tweet ID: 834477809192075265 (Count: 320)
 - Success

Tweet ID: 834458053273591808 (Count: 321)
 - Success

Tweet ID: 834209720923721728 (Count: 322)
 - Success

Tweet ID: 834167344700198914 (Count: 323)
 - Success

Tweet ID: 834089966724603904 (Count: 324)
 - Success

Tweet ID: 834086379323871233 (Count: 325)
 - Success

Tweet ID: 833863

 - Success

Tweet ID: 817827839487737858 (Count: 460)
 - Success

Tweet ID: 817777686764523521 (Count: 461)
 - Success

Tweet ID: 817536400337801217 (Count: 462)
 - Success

Tweet ID: 817502432452313088 (Count: 463)
 - Success

Tweet ID: 817423860136083457 (Count: 464)
 - Success

Tweet ID: 817415592588222464 (Count: 465)
 - Success

Tweet ID: 817181837579653120 (Count: 466)
 - Success

Tweet ID: 817171292965273600 (Count: 467)
 - Success

Tweet ID: 817120970343411712 (Count: 468)
 - Success

Tweet ID: 817056546584727552 (Count: 469)
 - Success

Tweet ID: 816829038950027264 (Count: 470)
 - Success

Tweet ID: 816816676327063552 (Count: 471)
 - Success

Tweet ID: 816697700272001025 (Count: 472)
 - Success

Tweet ID: 816450570814898180 (Count: 473)
 - Success

Tweet ID: 816336735214911488 (Count: 474)
 - Success

Tweet ID: 816091915477250048 (Count: 475)
 - Success

Tweet ID: 816062466425819140 (Count: 476)
 - Success

Tweet ID: 816014286006976512 (Count: 477)
 - Success

Tweet ID: 815990

 - Success

Tweet ID: 797165961484890113 (Count: 612)
 - Success

Tweet ID: 796904159865868288 (Count: 613)
 - Success

Tweet ID: 796865951799083009 (Count: 614)
 - Success

Tweet ID: 796759840936919040 (Count: 615)
 - Success

Tweet ID: 796563435802726400 (Count: 616)
 - Success

Tweet ID: 796484825502875648 (Count: 617)
 - Success

Tweet ID: 796387464403357696 (Count: 618)
 - Success

Tweet ID: 796177847564038144 (Count: 619)
 - Success

Tweet ID: 796149749086875649 (Count: 620)
 - Success

Tweet ID: 796125600683540480 (Count: 621)
 - Success

Tweet ID: 796116448414461957 (Count: 622)
 - Success

Tweet ID: 796080075804475393 (Count: 623)
 - Success

Tweet ID: 796031486298386433 (Count: 624)
 - Success

Tweet ID: 795464331001561088 (Count: 625)
 - Success

Tweet ID: 795400264262053889 (Count: 626)
 - Success

Tweet ID: 795076730285391872 (Count: 627)
 - Success

Tweet ID: 794983741416415232 (Count: 628)
 - Success

Tweet ID: 794926597468000259 (Count: 629)
 - Success

Tweet ID: 794355

 - Success

Tweet ID: 778027034220126208 (Count: 764)
 - Success

Tweet ID: 777953400541634568 (Count: 765)
 - Success

Tweet ID: 777885040357281792 (Count: 766)
 - Success

Tweet ID: 777684233540206592 (Count: 767)
 - Success

Tweet ID: 777641927919427584 (Count: 768)
 - Success

Tweet ID: 777621514455814149 (Count: 769)
 - Success

Tweet ID: 777189768882946048 (Count: 770)
 - Success

Tweet ID: 776819012571455488 (Count: 771)
 - Success

Tweet ID: 776813020089548800 (Count: 772)
 - Success

Tweet ID: 776477788987613185 (Count: 773)
 - Success

Tweet ID: 776249906839351296 (Count: 774)
 - Success

Tweet ID: 776218204058357768 (Count: 775)
 - Success

Tweet ID: 776201521193218049 (Count: 776)
 - Success

Tweet ID: 776113305656188928 (Count: 777)
 - Success

Tweet ID: 776088319444877312 (Count: 778)
 - Success

Tweet ID: 775898661951791106 (Count: 779)
 - Success

Tweet ID: 775842724423557120 (Count: 780)
 - Success

Tweet ID: 775733305207554048 (Count: 781)
 - Success

Tweet ID: 775729

 - Fail

Tweet ID: 756998049151549440 (Count: 917)
 - Fail

Tweet ID: 756939218950160384 (Count: 918)
 - Fail

Tweet ID: 756651752796094464 (Count: 919)
 - Fail

Tweet ID: 756526248105566208 (Count: 920)
 - Fail

Tweet ID: 756303284449767430 (Count: 921)
 - Fail

Tweet ID: 756288534030475264 (Count: 922)
 - Fail

Tweet ID: 756275833623502848 (Count: 923)
 - Fail

Tweet ID: 755955933503782912 (Count: 924)
 - Fail

Tweet ID: 755206590534418437 (Count: 925)
 - Fail

Tweet ID: 755110668769038337 (Count: 926)
 - Fail

Tweet ID: 754874841593970688 (Count: 927)
 - Fail

Tweet ID: 754856583969079297 (Count: 928)
 - Fail

Tweet ID: 754747087846248448 (Count: 929)
 - Fail

Tweet ID: 754482103782404096 (Count: 930)
 - Fail

Tweet ID: 754449512966619136 (Count: 931)
 - Fail

Tweet ID: 754120377874386944 (Count: 932)
 - Fail

Tweet ID: 754011816964026368 (Count: 933)
 - Fail

Tweet ID: 753655901052166144 (Count: 934)
 - Fail

Tweet ID: 753420520834629632 (Count: 935)
 - Fail

Tweet ID: 753398408988

 - Fail

Tweet ID: 739606147276148736 (Count: 1077)
 - Fail

Tweet ID: 739544079319588864 (Count: 1078)
 - Fail

Tweet ID: 739485634323156992 (Count: 1079)
 - Fail

Tweet ID: 739238157791694849 (Count: 1080)
 - Fail

Tweet ID: 738891149612572673 (Count: 1081)
 - Fail

Tweet ID: 738885046782832640 (Count: 1082)
 - Fail

Tweet ID: 738883359779196928 (Count: 1083)
 - Fail

Tweet ID: 738537504001953792 (Count: 1084)
 - Fail

Tweet ID: 738402415918125056 (Count: 1085)
 - Fail

Tweet ID: 738184450748633089 (Count: 1086)
 - Fail

Tweet ID: 738166403467907072 (Count: 1087)
 - Fail

Tweet ID: 738156290900254721 (Count: 1088)
 - Fail

Tweet ID: 737826014890496000 (Count: 1089)
 - Fail

Tweet ID: 737800304142471168 (Count: 1090)
 - Fail

Tweet ID: 737678689543020544 (Count: 1091)
 - Fail

Tweet ID: 737445876994609152 (Count: 1092)
 - Fail

Tweet ID: 737322739594330112 (Count: 1093)
 - Fail

Tweet ID: 737310737551491075 (Count: 1094)
 - Fail

Tweet ID: 736736130620620800 (Count: 1095)
 - Fail

Twe

 - Fail

Tweet ID: 712717840512598017 (Count: 1235)
 - Fail

Tweet ID: 712668654853337088 (Count: 1236)
 - Fail

Tweet ID: 712438159032893441 (Count: 1237)
 - Fail

Tweet ID: 712309440758808576 (Count: 1238)
 - Fail

Tweet ID: 712097430750289920 (Count: 1239)
 - Fail

Tweet ID: 712092745624633345 (Count: 1240)
 - Fail

Tweet ID: 712085617388212225 (Count: 1241)
 - Fail

Tweet ID: 712065007010385924 (Count: 1242)
 - Fail

Tweet ID: 711998809858043904 (Count: 1243)
 - Fail

Tweet ID: 711968124745228288 (Count: 1244)
 - Fail

Tweet ID: 711743778164514816 (Count: 1245)
 - Fail

Tweet ID: 711732680602345472 (Count: 1246)
 - Fail

Tweet ID: 711694788429553666 (Count: 1247)
 - Fail

Tweet ID: 711652651650457602 (Count: 1248)
 - Fail

Tweet ID: 711363825979756544 (Count: 1249)
 - Fail

Tweet ID: 711306686208872448 (Count: 1250)
 - Fail

Tweet ID: 711008018775851008 (Count: 1251)
 - Fail

Tweet ID: 710997087345876993 (Count: 1252)
 - Fail

Tweet ID: 710844581445812225 (Count: 1253)
 - Fail

Twe

 - Fail

Tweet ID: 700062718104104960 (Count: 1393)
 - Fail

Tweet ID: 700029284593901568 (Count: 1394)
 - Fail

Tweet ID: 700002074055016451 (Count: 1395)
 - Fail

Tweet ID: 699801817392291840 (Count: 1396)
 - Fail

Tweet ID: 699788877217865730 (Count: 1397)
 - Fail

Tweet ID: 699779630832685056 (Count: 1398)
 - Fail

Tweet ID: 699775878809702401 (Count: 1399)
 - Fail

Tweet ID: 699691744225525762 (Count: 1400)
 - Fail

Tweet ID: 699446877801091073 (Count: 1401)
 - Fail

Tweet ID: 699434518667751424 (Count: 1402)
 - Fail

Tweet ID: 699423671849451520 (Count: 1403)
 - Fail

Tweet ID: 699413908797464576 (Count: 1404)
 - Fail

Tweet ID: 699370870310113280 (Count: 1405)
 - Fail

Tweet ID: 699323444782047232 (Count: 1406)
 - Fail

Tweet ID: 699088579889332224 (Count: 1407)
 - Fail

Tweet ID: 699079609774645248 (Count: 1408)
 - Fail

Tweet ID: 699072405256409088 (Count: 1409)
 - Fail

Tweet ID: 699060279947165696 (Count: 1410)
 - Fail

Tweet ID: 699036661657767936 (Count: 1411)
 - Fail

Twe

 - Success

Tweet ID: 689289219123089408 (Count: 1546)
 - Success

Tweet ID: 689283819090870273 (Count: 1547)
 - Success

Tweet ID: 689280876073582592 (Count: 1548)
 - Success

Tweet ID: 689275259254616065 (Count: 1549)
 - Success

Tweet ID: 689255633275777024 (Count: 1550)
 - Success

Tweet ID: 689154315265683456 (Count: 1551)
 - Success

Tweet ID: 689143371370250240 (Count: 1552)
 - Success

Tweet ID: 688916208532455424 (Count: 1553)
 - Success

Tweet ID: 688908934925697024 (Count: 1554)
 - Success

Tweet ID: 688898160958271489 (Count: 1555)
 - Success

Tweet ID: 688894073864884227 (Count: 1556)
 - Success

Tweet ID: 688828561667567616 (Count: 1557)
 - Success

Tweet ID: 688804835492233216 (Count: 1558)
 - Success

Tweet ID: 688789766343622656 (Count: 1559)
 - Success

Tweet ID: 688547210804498433 (Count: 1560)
 - Success

Tweet ID: 688519176466644993 (Count: 1561)
 - Success

Tweet ID: 688385280030670848 (Count: 1562)
 - Success

Tweet ID: 688211956440801280 (Count: 1563)
 - Success

 - Success

Tweet ID: 681281657291280384 (Count: 1695)
 - Success

Tweet ID: 681261549936340994 (Count: 1696)
 - Success

Tweet ID: 681242418453299201 (Count: 1697)
 - Success

Tweet ID: 681231109724700672 (Count: 1698)
 - Success

Tweet ID: 681193455364796417 (Count: 1699)
 - Success

Tweet ID: 680970795137544192 (Count: 1700)
 - Success

Tweet ID: 680959110691590145 (Count: 1701)
 - Success

Tweet ID: 680940246314430465 (Count: 1702)
 - Success

Tweet ID: 680934982542561280 (Count: 1703)
 - Success

Tweet ID: 680913438424612864 (Count: 1704)
 - Success

Tweet ID: 680889648562991104 (Count: 1705)
 - Success

Tweet ID: 680836378243002368 (Count: 1706)
 - Success

Tweet ID: 680805554198020098 (Count: 1707)
 - Success

Tweet ID: 680801747103793152 (Count: 1708)
 - Success

Tweet ID: 680798457301471234 (Count: 1709)
 - Success

Tweet ID: 680609293079592961 (Count: 1710)
 - Success

Tweet ID: 680583894916304897 (Count: 1711)
 - Success

Tweet ID: 680497766108381184 (Count: 1712)
 - Success

 - Success

Tweet ID: 675849018447167488 (Count: 1845)
 - Success

Tweet ID: 675845657354215424 (Count: 1846)
 - Success

Tweet ID: 675822767435051008 (Count: 1847)
 - Success

Tweet ID: 675820929667219457 (Count: 1848)
 - Success

Tweet ID: 675798442703122432 (Count: 1849)
 - Success

Tweet ID: 675781562965868544 (Count: 1850)
 - Success

Tweet ID: 675740360753160193 (Count: 1851)
 - Success

Tweet ID: 675710890956750848 (Count: 1852)
 - Success

Tweet ID: 675707330206547968 (Count: 1853)
 - Success

Tweet ID: 675706639471788032 (Count: 1854)
 - Success

Tweet ID: 675534494439489536 (Count: 1855)
 - Success

Tweet ID: 675531475945709568 (Count: 1856)
 - Success

Tweet ID: 675522403582218240 (Count: 1857)
 - Success

Tweet ID: 675517828909424640 (Count: 1858)
 - Success

Tweet ID: 675501075957489664 (Count: 1859)
 - Success

Tweet ID: 675497103322386432 (Count: 1860)
 - Success

Tweet ID: 675489971617296384 (Count: 1861)
 - Success

Tweet ID: 675483430902214656 (Count: 1862)
 - Success

 - Success

Tweet ID: 672609152938721280 (Count: 1994)
 - Success

Tweet ID: 672604026190569472 (Count: 1995)
 - Success

Tweet ID: 672594978741354496 (Count: 1996)
 - Success

Tweet ID: 672591762242805761 (Count: 1997)
 - Success

Tweet ID: 672591271085670400 (Count: 1998)
 - Success

Tweet ID: 672538107540070400 (Count: 1999)
 - Success

Tweet ID: 672523490734551040 (Count: 2000)
 - Success

Tweet ID: 672488522314567680 (Count: 2001)
 - Success

Tweet ID: 672482722825261057 (Count: 2002)
 - Success

Tweet ID: 672481316919734272 (Count: 2003)
 - Success

Tweet ID: 672475084225949696 (Count: 2004)
 - Success

Tweet ID: 672466075045466113 (Count: 2005)
 - Success

Tweet ID: 672272411274932228 (Count: 2006)
 - Success

Tweet ID: 672267570918129665 (Count: 2007)
 - Success

Tweet ID: 672264251789176834 (Count: 2008)
 - Success

Tweet ID: 672256522047614977 (Count: 2009)
 - Success

Tweet ID: 672254177670729728 (Count: 2010)
 - Success

Tweet ID: 672248013293752320 (Count: 2011)
 - Success

 - Success

Tweet ID: 669972011175813120 (Count: 2143)
 - Success

Tweet ID: 669970042633789440 (Count: 2144)
 - Success

Tweet ID: 669942763794931712 (Count: 2145)
 - Success

Tweet ID: 669926384437997569 (Count: 2146)
 - Success

Tweet ID: 669923323644657664 (Count: 2147)
 - Success

Tweet ID: 669753178989142016 (Count: 2148)
 - Success

Tweet ID: 669749430875258880 (Count: 2149)
 - Success

Tweet ID: 669684865554620416 (Count: 2150)
 - Success

Tweet ID: 669683899023405056 (Count: 2151)
 - Success

Tweet ID: 669682095984410625 (Count: 2152)
 - Success

Tweet ID: 669680153564442624 (Count: 2153)
 - Success

Tweet ID: 669661792646373376 (Count: 2154)
 - Success

Tweet ID: 669625907762618368 (Count: 2155)
 - Success

Tweet ID: 669603084620980224 (Count: 2156)
 - Success

Tweet ID: 669597912108789760 (Count: 2157)
 - Success

Tweet ID: 669583744538451968 (Count: 2158)
 - Success

Tweet ID: 669573570759163904 (Count: 2159)
 - Success

Tweet ID: 669571471778410496 (Count: 2160)
 - Success

 - Success

Tweet ID: 667165590075940865 (Count: 2292)
 - Success

Tweet ID: 667160273090932737 (Count: 2293)
 - Success

Tweet ID: 667152164079423490 (Count: 2294)
 - Success

Tweet ID: 667138269671505920 (Count: 2295)
 - Success

Tweet ID: 667119796878725120 (Count: 2296)
 - Success

Tweet ID: 667090893657276420 (Count: 2297)
 - Success

Tweet ID: 667073648344346624 (Count: 2298)
 - Success

Tweet ID: 667070482143944705 (Count: 2299)
 - Success

Tweet ID: 667065535570550784 (Count: 2300)
 - Success

Tweet ID: 667062181243039745 (Count: 2301)
 - Success

Tweet ID: 667044094246576128 (Count: 2302)
 - Success

Tweet ID: 667012601033924608 (Count: 2303)
 - Success

Tweet ID: 666996132027977728 (Count: 2304)
 - Success

Tweet ID: 666983947667116034 (Count: 2305)
 - Success

Tweet ID: 666837028449972224 (Count: 2306)
 - Success

Tweet ID: 666835007768551424 (Count: 2307)
 - Success

Tweet ID: 666826780179869698 (Count: 2308)
 - Success

Tweet ID: 666817836334096384 (Count: 2309)
 - Success

In [9]:
# list-object containing counts of re-tweets and favorites for each tweet
json_list = []
json_attributes = ['id', 'retweet_count','favorite_count']

# append each JSON data as dict-object to the list
with open('data/tweet_json.txt') as file:
    for line in file:
        json_data = json.loads(line[:-1]) # exclude \n and load JSON data only
        json_list.append({json_attributes[0]: json_data[json_attributes[0]]
                        , json_attributes[1]: json_data[json_attributes[1]]
                        , json_attributes[2]: json_data[json_attributes[2]]})

# create dataframe from the list of dict-objects containing JSON data
df_json = pd.DataFrame(data = json_list, columns = json_attributes)
df_json.head(3)

Unnamed: 0,id,retweet_count,favorite_count
0,892420643555336193,8227,37760
1,892177421306343426,6079,32441
2,891815181378084864,4023,24432


<a id='assess'></a>
## Assessing Data

In [11]:
# copy of twitter_archive_enhanced.csv
# df_archive = pd.read_csv('data/twitter_archive_enhanced.csv')
df_archive_clean = df_archive.copy()

# copy of image-predictions.tsv 
# df_image = pd.read_csv('data/image-predictions.tsv', sep = '\t')
df_image_clean = df_image.copy()

# copy of JSON data from tweet_json.txt
df_json_clean = df_json.copy()

### 1. Enhanced Twitter Archive

1. Data type for the `tweet_id` column is integer, instead of string (object).
2. Data types for the `in_reply_to_status_id` and `in_reply_to_user_id` columns are float, instead of string (object).
3. Data type for the `timestamp` column is object, instead of datetime.
4. The four possible values for the `source` column not only indicate the source of the tweet but includes the HTML tags and attributes.
5. 181 tweets were created by re-tweeting existing tweets.
6. The `expanded_urls` column does not show meaningful information besides the tweet id which is already listed under the `tweet_id` column.
7. 23 dog ratings extracted from the tweet are inaccurate in that these ratings have values for the `rating_denominator` column other than 10.
8. 109 names of dogs in the `name` column are inaccurate in that these not only begin with a lower case alphabet but are oridinary terms such as "such", "a", "quite", etc.
9. Four possible dog "stages", doggo, floofer, pupper, and puppo, are shown as column headers although these "stages" are not variable names.

In [12]:
df_archive_clean.sample(5)

Unnamed: 0,tweet_id,in_reply_to_status_id,in_reply_to_user_id,timestamp,source,text,retweeted_status_id,retweeted_status_user_id,retweeted_status_timestamp,expanded_urls,rating_numerator,rating_denominator,name,doggo,floofer,pupper,puppo
1884,674800520222154752,,,2015-12-10 03:59:15 +0000,"<a href=""http://twitter.com/download/iphone"" r...",This is Tedders. He broke his leg saving babie...,,,,https://twitter.com/dog_rates/status/674800520...,11,10,Tedders,,,,
2128,670303360680108032,,,2015-11-27 18:09:09 +0000,"<a href=""http://twitter.com/download/iphone"" r...",This is a Speckled Cauliflower Yosemite named ...,,,,https://twitter.com/dog_rates/status/670303360...,9,10,a,,,,
704,785872687017132033,,,2016-10-11 16:00:24 +0000,"<a href=""http://twitter.com/download/iphone"" r...",Meet Rusty. He appears to be rather h*ckin flu...,,,,https://twitter.com/dog_rates/status/785872687...,12,10,Rusty,,,,
1320,706346369204748288,,,2016-03-06 05:11:12 +0000,"<a href=""http://twitter.com/download/iphone"" r...",This is Koda. She's a Beneboom Cumberwiggle. 1...,,,,https://twitter.com/dog_rates/status/706346369...,12,10,Koda,,,,
2025,671882082306625538,,,2015-12-02 02:42:26 +0000,"<a href=""http://twitter.com/download/iphone"" r...",This is Batdog. He's sleeping now but when he ...,,,,https://twitter.com/dog_rates/status/671882082...,11,10,Batdog,,,,


In [13]:
df_archive_clean.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 2356 entries, 0 to 2355
Data columns (total 17 columns):
tweet_id                      2356 non-null int64
in_reply_to_status_id         78 non-null float64
in_reply_to_user_id           78 non-null float64
timestamp                     2356 non-null object
source                        2356 non-null object
text                          2356 non-null object
retweeted_status_id           181 non-null float64
retweeted_status_user_id      181 non-null float64
retweeted_status_timestamp    181 non-null object
expanded_urls                 2297 non-null object
rating_numerator              2356 non-null int64
rating_denominator            2356 non-null int64
name                          2356 non-null object
doggo                         2356 non-null object
floofer                       2356 non-null object
pupper                        2356 non-null object
puppo                         2356 non-null object
dtypes: float64(4), int64(3), ob

In [65]:
for i, source in enumerate(df_archive_clean.source.unique()):
    print(i+1, source)

1 <a href="http://twitter.com/download/iphone" rel="nofollow">Twitter for iPhone</a>
2 <a href="http://twitter.com" rel="nofollow">Twitter Web Client</a>
3 <a href="http://vine.co" rel="nofollow">Vine - Make a Scene</a>
4 <a href="https://about.twitter.com/products/tweetdeck" rel="nofollow">TweetDeck</a>


In [68]:
df_archive_clean.retweeted_status_id.notna().sum()

181

In [70]:
df_archive_clean.retweeted_status_user_id.notna().sum()

181

In [69]:
df_archive_clean.retweeted_status_timestamp.notna().sum()

181

In [100]:
(df_archive_clean.rating_denominator != 10).sum()

23

In [89]:
df_archive_clean.name.str.extract(pat = '(^[a-z])').dropna().shape[0]

109

In [90]:
error_name = [name for name in df_archive_clean.name.unique() if name.lower() == name]
df_archive_clean.query('name in @error_name').shape[0]

109

In [92]:
for name in error_name:
    print(name)

such
a
quite
not
one
incredibly
mad
an
very
just
my
his
actually
getting
this
unacceptable
all
old
infuriating
the
by
officially
life
light
space


### 2. Image Predictions

### 3. Additional Tweet Data

<a id='clean'></a>
## Cleaning Data

<a id='conclusion'></a>
## Conclusion