# Project Motivation

The goal is to wrangle WeRateDogs Twitter data to create interesting and trustworthy analyses and visualizations. The Twitter archive is great, but it only contains very basic tweet information. Additional gathering, then assessing and cleaning is required for "Wow!"-worthy analyses and visualizations.

In [178]:
import pandas as pd
import numpy as np
import requests
import tweepy
import json
import time

## Gathering Data for the project

The twitter-archive-enhanced.csv file was provided. Here I am loading the file into a pandas dataframe.

In [179]:
twitter_archive = pd.read_csv('twitter-archive-enhanced.csv')

In [180]:
twitter_archive.sample()

Unnamed: 0,tweet_id,in_reply_to_status_id,in_reply_to_user_id,timestamp,source,text,retweeted_status_id,retweeted_status_user_id,retweeted_status_timestamp,expanded_urls,rating_numerator,rating_denominator,name,doggo,floofer,pupper,puppo
1593,686386521809772549,,,2016-01-11 03:17:53 +0000,"<a href=""http://twitter.com/download/iphone"" r...",Say hello to Crimson. He's a Speckled Winnebag...,,,,https://twitter.com/dog_rates/status/686386521...,11,10,Crimson,,,,


The tweet image predictions, i.e., what breed of dog (or other object, animal, etc.) is present in each tweet according to a neural network. This file (image_predictions.tsv) is hosted on Udacity's servers and should be downloaded programmatically using the Requests library and the following URL: https://d17h27t6h515a5.cloudfront.net/topher/2017/August/599fd2ad_image-predictions/image-predictions.tsv

In [181]:
prediction_file_url = 'https://d17h27t6h515a5.cloudfront.net/topher/2017/August/599fd2ad_image-predictions/image-predictions.tsv'

In [182]:
response = requests.get(prediction_file_url)
with open('prediction_file.tsv', 'wb') as file:
    file.write(response.content)

In [183]:
image_predictions = pd.read_csv('prediction_file.tsv', sep='\t')

In [184]:
image_predictions.sample()

Unnamed: 0,tweet_id,jpg_url,img_num,p1,p1_conf,p1_dog,p2,p2_conf,p2_dog,p3,p3_conf,p3_dog
1571,794983741416415232,https://pbs.twimg.com/media/CvT6IV6WEAQhhV5.jpg,3,schipperke,0.363272,True,kelpie,0.197021,True,Norwegian_elkhound,0.151024,True


Each tweet's retweet count and favorite ("like") count at minimum, and any additional data you find interesting. Using the tweet IDs in the WeRateDogs Twitter archive, query the Twitter API for each tweet's JSON data using Python's Tweepy library and store each tweet's entire set of JSON data in a file called tweet_json.txt file. Each tweet's JSON data should be written to its own line. Then read this .txt file line by line into a pandas DataFrame with (at minimum) tweet ID, retweet count, and favorite count. Note: do not include your Twitter API keys, secrets, and tokens in your project submission.

In [73]:
import tweepy
from settings import *
import json

auth = tweepy.OAuthHandler(consumer_key, consumer_secret)
auth.set_access_token(access_token, access_secret)

api = tweepy.API(auth, wait_on_rate_limit=True, wait_on_rate_limit_notify=True)

Here I want to do the actual pull of tweets needed.

In [108]:
twitter_archive.sample()

Unnamed: 0,tweet_id,in_reply_to_status_id,in_reply_to_user_id,timestamp,source,text,retweeted_status_id,retweeted_status_user_id,retweeted_status_timestamp,expanded_urls,rating_numerator,rating_denominator,name,doggo,floofer,pupper,puppo
693,786963064373534720,,,2016-10-14 16:13:10 +0000,"<a href=""http://twitter.com/download/iphone"" r...",This is Rory. He's got an interview in a few m...,,,,https://twitter.com/dog_rates/status/786963064...,12,10,Rory,,,,


In [109]:
start = time.time()
exceptions = []
file_exception = False
for tweet_id in twitter_archive.tweet_id:    
    print(tweet_id)
    try:       
        tweet = api.get_status(tweet_id, tweet_mode='extended')
        with open('tweet_json.txt', 'a') as outfile:
            try:
                json.dump(tweet._json, outfile)
                outfile.write('\n')
            except:
                file_exception = True
                print('file exception')
    except:
        exceptions.append(tweet_id)
        print('tweet exception')
            
    end = time.time()
    print(end - start)

892420643555336193
0.31557202339172363
892177421306343426
0.5924980640411377
891815181378084864
0.8684940338134766
891689557279858688
1.1483218669891357
891327558926688256
1.4640731811523438
891087950875897856
1.7382149696350098
890971913173991426
2.046069860458374
890729181411237888
2.2992758750915527
890609185150312448
2.5914881229400635
890240255349198849
2.8725500106811523
890006608113172480
3.157594919204712
889880896479866881
3.487740993499756
889665388333682689
3.7787909507751465
889638837579907072
4.011507987976074
889531135344209921
4.29113507270813
889278841981685760
4.592304944992065
888917238123831296
4.886746883392334
888804989199671297
5.189771890640259
888554962724278272
5.465207815170288
888202515573088257
tweet exception
5.774913787841797
888078434458587136
6.054730176925659
887705289381826560
6.384222030639648
887517139158093824
6.685495853424072
887473957103951883
6.988052129745483
887343217045368832
7.2992119789123535
887101392804085760
7.584299802780151
88698323352

62.20159697532654
850753642995093505
62.4446439743042
850380195714523136
62.74682402610779
850333567704068097
63.030510902404785
850145622816686080
63.34998893737793
850019790995546112
63.53744697570801
849776966551130114
63.85103917121887
849668094696017920
64.11123275756836
849412302885593088
64.4143009185791
849336543269576704
64.69470500946045
849051919805034497
64.96575999259949
848690551926992896
65.27987289428711
848324959059550208
65.55103302001953
848213670039564288
65.75364208221436
848212111729840128
65.99687695503235
847978865427394560
66.21858501434326
847971574464610304
66.50185298919678
847962785489326080
66.78799390792847
847842811428974592
67.08604001998901
847617282490613760
67.3666729927063
847606175596138505
67.62328696250916
847251039262605312
67.84549689292908
847157206088847362
68.13250494003296
847116187444137987
68.4376118183136
846874817362120707
68.74807691574097
846514051647705089
69.02957201004028
846505985330044928
69.54095816612244
846153765933735936
69.8

124.28382706642151
820690176645140481
124.59144616127014
820494788566847489
124.92288899421692
820446719150292993
125.24314403533936
820314633777061888
125.55517196655273
820078625395449857
125.81845498085022
820013781606658049
126.10305905342102
819952236453363712
126.35127091407776
819924195358416896
126.64446306228638
819711362133872643
126.9298038482666
819588359383371776
127.21690392494202
819347104292290561
127.53534507751465
819238181065359361
127.74358010292053
819227688460238848
128.02051997184753
819015337530290176
128.32891821861267
819015331746349057
128.6124382019043
819006400881917954
128.87598299980164
819004803107983360
129.13848686218262
818646164899774465
129.57409000396729
818627210458333184
129.893394947052
818614493328580609
130.20068788528442
818588835076603904
130.53543996810913
818536468981415936
130.7474389076233
818307523543449600
131.32066702842712
818259473185828864
131.6308901309967
818145370475810820
131.88197493553162
817908911860748288
132.13879108428955

188.054692029953
792883833364439040
188.37086296081543
792773781206999040
188.6968491077423
792394556390137856
188.9511570930481
792050063153438720
189.27162718772888
791821351946420224
189.49312710762024
791784077045166082
189.75280690193176
791780927877898241
190.0315659046173
791774931465953280
190.36291217803955
791672322847637504
190.6537230014801
791406955684368384
190.9396481513977
791312159183634433
191.2185618877411
791026214425268224
191.5279049873352
790987426131050500
191.7601659297943
790946055508652032
192.04037380218506
790723298204217344
192.52666306495667
790698755171364864
192.84184575080872
790581949425475584
193.07575798034668
790337589677002753
193.32762002944946
790277117346975746
193.5762050151825
790227638568808452
193.8639199733734
789986466051088384
194.11373496055603
789960241177853952
194.39247393608093
789903600034189313
194.67873692512512
789628658055020548
194.93086099624634
789599242079838210
195.2257959842682
789530877013393408
195.48403787612915
789314

247.69511008262634
762035686371364864
247.95620799064636
761976711479193600
248.20180320739746
761750502866649088
248.46015095710754
761745352076779520
248.8524408340454
761672994376806400
249.1541211605072
761599872357261312
249.43290901184082
761371037149827077
249.75371718406677
761334018830917632
250.02133107185364
761292947749015552
250.27640104293823
761227390836215808
250.5575089454651
761004547850530816
250.85080790519714
760893934457552897
251.04380202293396
760656994973933572
251.33244800567627
760641137271070720
251.64177107810974
760539183865880579
251.95176696777344
760521673607086080
252.22077298164368
760290219849637889
252.47635221481323
760252756032651264
252.74562907218933
760190180481531904
253.0263431072235
760153949710192640
253.32738399505615
759943073749200896
253.5428659915924
759923798737051648
253.85739302635193
759846353224826880
254.15329003334045
759793422261743616
254.44201612472534
759566828574212096
tweet exception
254.67829298973083
759557299618865152
2

Rate limit reached. Sleeping for: 643


905.4044950008392
758474966123810816
905.6574170589447
758467244762497024
905.8892800807953
758405701903519748
906.1672620773315
758355060040593408
906.4740130901337
758099635764359168
906.7326710224152
758041019896193024
907.0515470504761
757741869644341248
907.3263959884644
757729163776290825
907.6234540939331
757725642876129280
907.9422309398651
757611664640446465
908.2256557941437
757597904299253760
908.510223865509
757596066325864448
908.8112411499023
757400162377592832
909.0775971412659
757393109802180609
909.3674550056458
757354760399941633
909.6373851299286
756998049151549440
909.9521241188049
756939218950160384
910.224898815155
756651752796094464
910.5558240413666
756526248105566208
910.8902180194855
756303284449767430
911.1978509426117
756288534030475264
911.5175929069519
756275833623502848
911.7782139778137
755955933503782912
912.0815460681915
755206590534418437
912.3937590122223
755110668769038337
912.7065298557281
754874841593970688
913.0109169483185
754856583969079297
913

971.7758989334106
730427201120833536
972.0827419757843
730211855403241472
972.3747429847717
730196704625098752
972.6754159927368
729854734790754305
972.9723291397095
729838605770891264
973.2777299880981
729823566028484608
973.5452959537506
729463711119904772
973.874694108963
729113531270991872
974.1571290493011
728986383096946689
974.4497530460358
728760639972315136
974.7595250606537
728751179681943552
975.0552868843079
728653952833728512
975.3131458759308
728409960103686147
975.6072039604187
728387165835677696
975.9941120147705
728046963732717569
976.2806088924408
728035342121635841
976.5976119041443
728015554473250816
976.8595979213715
727685679342333952
977.1922190189362
727644517743104000
977.4861788749695
727524757080539137
977.8092761039734
727314416056803329
978.1196839809418
727286334147182592
978.3979358673096
727175381690781696
978.7125089168549
727155742655025152
979.0158548355103
726935089318363137
979.3056311607361
726887082820554753
979.5881681442261
726828223124897792
97

1038.0569188594818
704819833553219584
1038.3694829940796
704761120771465216
1038.670156955719
704499785726889984
1038.9866189956665
704491224099647488
1039.2786178588867
704480331685040129
1039.5830428600311
704364645503647744
1040.2344479560852
704347321748819968
1040.5139830112457
704134088924532736
1041.0155329704285
704113298707505153
1041.6469810009003
704054845121142784
1042.4503378868103
703774238772166656
1043.122724056244
703769065844768768
1043.4134821891785
703631701117943808
1043.7421278953552
703611486317502464
1044.053633928299
703425003149250560
1044.3970201015472
703407252292673536
1044.7055168151855
703382836347330562
1044.985962152481
703356393781329922
1045.2923102378845
703268521220972544
1045.5850939750671
703079050210877440
1045.8872349262238
703041949650034688
1046.1618430614471
702932127499816960
1046.4607818126678
702899151802126337
1046.7579500675201
702684942141153280
1047.074219942093
702671118226825216
1047.3811748027802
702598099714314240
1047.701173782348

1105.8393218517303
688519176466644993
1106.1583518981934
688385280030670848
1106.4497609138489
688211956440801280
1106.7769479751587
688179443353796608
1107.022922039032
688116655151435777
1107.3457140922546
688064179421470721
1107.642930984497
687841446767013888
1107.9483959674835
687826841265172480
1108.2312479019165
687818504314159109
1108.5247230529785
687807801670897665
1108.8040328025818
687732144991551489
1109.0887689590454
687704180304273409
1109.397047996521
687664829264453632
1109.7086369991302
687494652870668288
1110.0057559013367
687480748861947905
1110.3122780323029
687476254459715584
1110.6449251174927
687460506001633280
1110.9607350826263
687399393394311168
1111.2597260475159
687317306314240000
1111.6421768665314
687312378585812992
1111.902675151825
687127927494963200
1112.2210938930511
687124485711986689
1112.5203969478607
687109925361856513
1112.8062000274658
687102708889812993
1113.0726799964905
687096057537363968
1113.3885409832
686947101016735744
1113.652755022049
6

1171.4534230232239
677918531514703872
1171.7546010017395
677895101218201600
1172.0277981758118
677716515794329600
1172.3195941448212
677700003327029250
1172.6287581920624
677698403548192770
1172.9124219417572
677687604918272002
1173.2269370555878
677673981332312066
1173.5169608592987
677662372920729601
1173.875
677644091929329666
1174.181783914566
677573743309385728
1174.502431154251
677565715327688705
1174.832102060318
677557565589463040
1175.1368570327759
677547928504967168
1175.4873712062836
677530072887205888
1175.7707319259644
677335745548390400
1176.109200000763
677334615166730240
1176.3859639167786
677331501395156992
1176.6678538322449
677328882937298944
1177.0102071762085
677314812125323265
1177.285987854004
677301033169788928
1177.5905177593231
677269281705472000
1177.9055178165436
677228873407442944
1178.1960680484772
677187300187611136
1178.479169845581
676975532580409345


Rate limit reached. Sleeping for: 626


1809.8575599193573
676957860086095872
1810.1458070278168
676949632774234114
1810.429016828537
676948236477857792
1810.739953994751
676946864479084545
1811.0575618743896
676942428000112642
1811.3700051307678
676936541936185344
1811.6881530284882
676916996760600576
1811.9510869979858
676897532954456065
1812.2736220359802
676864501615042560
1812.5951659679413
676821958043033607
1812.9357080459595
676819651066732545
1813.2330529689789
676811746707918848
1813.5075538158417
676776431406465024
1813.815822839737
676617503762681856
1814.139014005661
676613908052996102
1814.4424159526825
676606785097199616
1814.7805750370026
676603393314578432
1815.1082980632782
676593408224403456
1815.4073779582977
676590572941893632
1815.7301859855652
676588346097852417
1816.0191440582275
676582956622721024
1816.3092079162598
676575501977128964
1816.609838962555
676533798876651520
1816.9382491111755
676496375194980353
1817.2105729579926
676470639084101634
1817.544489145279
676440007570247681
1817.85795211792
6

1877.2282929420471
672125275208069120
1877.5313210487366
672095186491711488
1877.8182151317596
672082170312290304
1878.1243410110474
672068090318987265
1878.4427428245544
671896809300709376
1878.756488084793
671891728106971137
1879.0321381092072
671882082306625538
1879.3507778644562
671879137494245376
1879.6303651332855
671874878652489728
1879.9418940544128
671866342182637568
1880.2124819755554
671855973984772097
1880.543699979782
671789708968640512
1880.864017009735
671768281401958400
1881.1672809123993
671763349865160704
1881.4770247936249
671744970634719232
1881.80925989151
671743150407421952
1882.1337509155273
671735591348891648
1882.4394011497498
671729906628341761
1882.7430350780487
671561002136281088
1883.0117230415344
671550332464455680
1883.2953419685364
671547767500775424
1883.582927942276
671544874165002241
1883.8993701934814
671542985629241344
1884.1840431690216
671538301157904385
1884.491056919098
671536543010570240
1884.790412902832
671533943490011136
1885.0953829288483
6

1946.0699400901794
668142349051129856
1946.404000043869
668113020489474048
1946.7225461006165
667937095915278337
1947.0116670131683
667924896115245057
1947.2641899585724
667915453470232577
1947.569787979126
667911425562669056
1947.8654029369354
667902449697558528
1948.1717600822449
667886921285246976
1948.475093126297
667885044254572545
1948.7552018165588
667878741721415682
1949.0681970119476
667873844930215936
1949.3658549785614
667866724293877760
1949.6520750522614
667861340749471744
1949.9795291423798
667832474953625600
1950.2927269935608
667806454573760512
1950.5717070102692
667801013445750784
1950.903729915619
667793409583771648
1951.1909120082855
667782464991965184
1951.5149068832397
667773195014021121
1951.7956540584564
667766675769573376
1952.0588459968567
667728196545200128
1952.3946719169617
667724302356258817
1952.6828019618988
667550904950915073
1952.9712448120117
667550882905632768
1953.2601850032806
667549055577362432
1953.5729129314423
667546741521195010
1953.88567686080

In [110]:
exceptions

[888202515573088257,
 873697596434513921,
 872668790621863937,
 872261713294495745,
 869988702071779329,
 866816280283807744,
 861769973181624320,
 856602993587888130,
 851953902622658560,
 845459076796616705,
 844704788403113984,
 842892208864923648,
 837366284874571778,
 837012587749474308,
 829374341691346946,
 827228250799742977,
 812747805718642688,
 802247111496568832,
 779123168116150273,
 775096608509886464,
 771004394259247104,
 770743923962707968,
 759566828574212096,
 754011816964026368,
 680055455951884288]

In [185]:
df = pd.read_json('tweet_json.txt', lines=True)
df.sample(5)

Unnamed: 0,created_at,id,id_str,full_text,truncated,display_text_range,entities,extended_entities,source,in_reply_to_status_id,...,favorited,retweeted,possibly_sensitive,possibly_sensitive_appealable,lang,retweeted_status,quoted_status_id,quoted_status_id_str,quoted_status_permalink,quoted_status
1143,2016-04-17 00:58:53+00:00,721503162398597120,721503162398597120,This is Panda. He's happy af. 11/10 https://t....,False,"[0, 35]","{'hashtags': [], 'symbols': [], 'user_mentions...","{'media': [{'id': 721503152168681472, 'id_str'...","<a href=""http://twitter.com/download/iphone"" r...",,...,False,False,0.0,0.0,en,,,,,
2006,2015-12-01 19:10:13+00:00,671768281401958400,671768281401958400,When you try to recreate the scene from Lady &...,False,"[0, 144]","{'hashtags': [], 'symbols': [], 'user_mentions...","{'media': [{'id': 671768277677441024, 'id_str'...","<a href=""http://twitter.com/download/iphone"" r...",,...,False,False,0.0,0.0,en,,,,,
272,2017-03-07 01:17:48+00:00,838921590096166913,838921590096166912,This is Arlo. He's officially the king of snow...,False,"[0, 112]","{'hashtags': [], 'symbols': [], 'user_mentions...","{'media': [{'id': 838921573767618560, 'id_str'...","<a href=""http://twitter.com/download/iphone"" r...",,...,False,False,0.0,0.0,en,,,,,
2033,2015-11-30 15:18:34+00:00,671347597085433856,671347597085433856,This is Lola. She was not fully prepared for t...,False,"[0, 90]","{'hashtags': [], 'symbols': [], 'user_mentions...","{'media': [{'id': 671347593046306816, 'id_str'...","<a href=""http://twitter.com/download/iphone"" r...",,...,False,False,0.0,0.0,en,,,,,
638,2016-10-27 23:17:38+00:00,791780927877898241,791780927877898240,RT @dog_rates: This is Maddie. She gets some w...,False,"[0, 119]","{'hashtags': [], 'symbols': [], 'user_mentions...",,"<a href=""http://twitter.com/download/iphone"" r...",,...,False,False,0.0,0.0,en,{'created_at': 'Sat Jun 25 17:31:25 +0000 2016...,,,,


In [186]:
tweet_details_df = df.copy()

In [187]:
tweet_details_df = tweet_details_df[['id', 'retweet_count', 'favorite_count']]

In [188]:
tweet_details_df

Unnamed: 0,id,retweet_count,favorite_count
0,892420643555336193,7718,36250
1,892177421306343426,5704,31259
2,891815181378084864,3781,23535
3,891689557279858688,7870,39527
4,891327558926688256,8489,37744
...,...,...,...
2326,666049248165822465,39,96
2327,666044226329800704,132,272
2328,666033412701032449,41,112
2329,666029285002620928,42,121


## Assessing data for the project

#### Data quality Dimensions
- Completeness
- Validity
- Accuracy
- Consistency

#### Tidiness: issues with structure that prevent easy analysis. Untidy data is also known as messy data. Tidy data requirements:

- Each variable forms a column.

- Each observation forms a row.

- Each type of observational unit forms a table.

<font color='green' size='4'>
twitter-archive-enhanced dataset

##### Quality issues

- There are about 2000 records that are not classified as either doggo, floofer, pupper, puppo. In other words the 4 columns are set as None for about 2000 records.
- There are 59 records without associated image.
- There are 181 retweets that from this dataset perspective would be considered duplicate records.
- About 800 records have an unaccurate name. 745 set up as None and 55 set up as 'a'.
- There is a record that has both doggo and floofer. It should only be floofer.
- Records with denominator different than 10
- Records with numerator far from the median.
- Numerator should be 11 instead of 27. Ideally it should be 11.27 but for this exercise I will leave it as 11 because most of the numbers are integers.¶

##### Tidiness

- The columns doggo, floofer, pupper and puppo should be one column because it is one variable.

#### Getting acquainted with the dataset

In [129]:
twitter_archive.shape

(2356, 17)

In [127]:
twitter_archive.sample()

Unnamed: 0,tweet_id,in_reply_to_status_id,in_reply_to_user_id,timestamp,source,text,retweeted_status_id,retweeted_status_user_id,retweeted_status_timestamp,expanded_urls,rating_numerator,rating_denominator,name,doggo,floofer,pupper,puppo
1621,684926975086034944,,,2016-01-07 02:38:10 +0000,"<a href=""http://twitter.com/download/iphone"" r...",Meet Bruiser &amp; Charlie. They are the best ...,,,,https://twitter.com/dog_rates/status/684926975...,11,10,Bruiser,,,,


In [128]:
twitter_archive.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 2356 entries, 0 to 2355
Data columns (total 17 columns):
 #   Column                      Non-Null Count  Dtype  
---  ------                      --------------  -----  
 0   tweet_id                    2356 non-null   int64  
 1   in_reply_to_status_id       78 non-null     float64
 2   in_reply_to_user_id         78 non-null     float64
 3   timestamp                   2356 non-null   object 
 4   source                      2356 non-null   object 
 5   text                        2356 non-null   object 
 6   retweeted_status_id         181 non-null    float64
 7   retweeted_status_user_id    181 non-null    float64
 8   retweeted_status_timestamp  181 non-null    object 
 9   expanded_urls               2297 non-null   object 
 10  rating_numerator            2356 non-null   int64  
 11  rating_denominator          2356 non-null   int64  
 12  name                        2356 non-null   object 
 13  doggo                       2356 

In [204]:
twitter_archive.in_reply_to_status_id.value_counts().count()

77

#### Quality Issue. There are 59 records without associated image.

In [208]:
twitter_archive.expanded_urls.isnull().sum()

59

In [209]:
twitter_archive[twitter_archive.expanded_urls.isnull()]

Unnamed: 0,tweet_id,in_reply_to_status_id,in_reply_to_user_id,timestamp,source,text,retweeted_status_id,retweeted_status_user_id,retweeted_status_timestamp,expanded_urls,rating_numerator,rating_denominator,name,doggo,floofer,pupper,puppo
30,886267009285017600,8.862664e+17,2281182000.0,2017-07-15 16:51:35 +0000,"<a href=""http://twitter.com/download/iphone"" rel=""nofollow"">Twitter for iPhone</a>",@NonWhiteHat @MayhewMayhem omg hello tanner you are a scary good boy 12/10 would pet with extreme caution,,,,,12,10,,,,,
55,881633300179243008,8.81607e+17,47384430.0,2017-07-02 21:58:53 +0000,"<a href=""http://twitter.com/download/iphone"" rel=""nofollow"">Twitter for iPhone</a>",@roushfenway These are good dogs but 17/10 is an emotional impulse rating. More like 13/10s,,,,,17,10,,,,,
64,879674319642796034,8.795538e+17,3105441000.0,2017-06-27 12:14:36 +0000,"<a href=""http://twitter.com/download/iphone"" rel=""nofollow"">Twitter for iPhone</a>",@RealKentMurphy 14/10 confirmed,,,,,14,10,,,,,
113,870726314365509632,8.707262e+17,16487760.0,2017-06-02 19:38:25 +0000,"<a href=""http://twitter.com/download/iphone"" rel=""nofollow"">Twitter for iPhone</a>",@ComplicitOwl @ShopWeRateDogs &gt;10/10 is reserved for dogs,,,,,10,10,,,,,
148,863427515083354112,8.634256e+17,77596200.0,2017-05-13 16:15:35 +0000,"<a href=""http://twitter.com/download/iphone"" rel=""nofollow"">Twitter for iPhone</a>","@Jack_Septic_Eye I'd need a few more pics to polish a full analysis, but based on the good boy content above I'm leaning towards 12/10",,,,,12,10,,,,,
179,857214891891077121,8.571567e+17,180671000.0,2017-04-26 12:48:51 +0000,"<a href=""http://twitter.com/download/iphone"" rel=""nofollow"">Twitter for iPhone</a>",@Marc_IRL pixelated af 12/10,,,,,12,10,,,,,
185,856330835276025856,,,2017-04-24 02:15:55 +0000,"<a href=""http://twitter.com/download/iphone"" rel=""nofollow"">Twitter for iPhone</a>",RT @Jenna_Marbles: @dog_rates Thanks for rating my cermets 14/10 wow I'm so proud I watered them so much,8.563302e+17,66699013.0,2017-04-24 02:13:14 +0000,,14,10,,,,,
186,856288084350160898,8.56286e+17,279281000.0,2017-04-23 23:26:03 +0000,"<a href=""http://twitter.com/download/iphone"" rel=""nofollow"">Twitter for iPhone</a>",@xianmcguire @Jenna_Marbles Kardashians wouldn't be famous if as a society we didn't place enormous value on what they do. The dogs are very deserving of their 14/10,,,,,14,10,,,,,
188,855862651834028034,8.558616e+17,194351800.0,2017-04-22 19:15:32 +0000,"<a href=""http://twitter.com/download/iphone"" rel=""nofollow"">Twitter for iPhone</a>",@dhmontgomery We also gave snoop dogg a 420/10 but I think that predated your research,,,,,420,10,,,,,
189,855860136149123072,8.558585e+17,13615720.0,2017-04-22 19:05:32 +0000,"<a href=""http://twitter.com/download/iphone"" rel=""nofollow"">Twitter for iPhone</a>","@s8n You tried very hard to portray this good boy as not so good, but you have ultimately failed. His goodness shines through. 666/10",,,,,666,10,,,,,


#### Quality Issue. There are 181 retweets that from this dataset perspective would be considered duplicate records.

In [206]:
twitter_archive.query('retweeted_status_id != "NaN"').tweet_id.count()

181

In [201]:
twitter_archive.query('retweeted_status_id != "NaN"')

Unnamed: 0,tweet_id,in_reply_to_status_id,in_reply_to_user_id,timestamp,source,text,retweeted_status_id,retweeted_status_user_id,retweeted_status_timestamp,expanded_urls,rating_numerator,rating_denominator,name,doggo,floofer,pupper,puppo
19,888202515573088257,,,2017-07-21 01:02:36 +0000,"<a href=""http://twitter.com/download/iphone"" rel=""nofollow"">Twitter for iPhone</a>",RT @dog_rates: This is Canela. She attempted some fancy porch pics. They were unsuccessful. 13/10 someone help her https://t.co/cLyzpcUcMX,8.874740e+17,4.196984e+09,2017-07-19 00:47:34 +0000,"https://twitter.com/dog_rates/status/887473957103951883/photo/1,https://twitter.com/dog_rates/status/887473957103951883/photo/1,https://twitter.com/dog_rates/status/887473957103951883/photo/1,http...",13,10,Canela,,,,
32,886054160059072513,,,2017-07-15 02:45:48 +0000,"<a href=""http://twitter.com/download/iphone"" rel=""nofollow"">Twitter for iPhone</a>",RT @Athletics: 12/10 #BATP https://t.co/WxwJmvjfxo,8.860537e+17,1.960740e+07,2017-07-15 02:44:07 +0000,"https://twitter.com/dog_rates/status/886053434075471873,https://twitter.com/dog_rates/status/886053434075471873",12,10,,,,,
36,885311592912609280,,,2017-07-13 01:35:06 +0000,"<a href=""http://twitter.com/download/iphone"" rel=""nofollow"">Twitter for iPhone</a>",RT @dog_rates: This is Lilly. She just parallel barked. Kindly requests a reward now. 13/10 would pet so well https://t.co/SATN4If5H5,8.305833e+17,4.196984e+09,2017-02-12 01:04:29 +0000,"https://twitter.com/dog_rates/status/830583320585068544/photo/1,https://twitter.com/dog_rates/status/830583320585068544/photo/1,https://twitter.com/dog_rates/status/830583320585068544/photo/1,http...",13,10,Lilly,,,,
68,879130579576475649,,,2017-06-26 00:13:58 +0000,"<a href=""http://twitter.com/download/iphone"" rel=""nofollow"">Twitter for iPhone</a>",RT @dog_rates: This is Emmy. She was adopted today. Massive round of pupplause for Emmy and her new family. 14/10 for all involved https://…,8.780576e+17,4.196984e+09,2017-06-23 01:10:23 +0000,"https://twitter.com/dog_rates/status/878057613040115712/photo/1,https://twitter.com/dog_rates/status/878057613040115712/photo/1",14,10,Emmy,,,,
73,878404777348136964,,,2017-06-24 00:09:53 +0000,"<a href=""http://twitter.com/download/iphone"" rel=""nofollow"">Twitter for iPhone</a>","RT @dog_rates: Meet Shadow. In an attempt to reach maximum zooming borkdrive, he tore his ACL. Still 13/10 tho. Help him out below\n\nhttps:/…",8.782815e+17,4.196984e+09,2017-06-23 16:00:04 +0000,"https://www.gofundme.com/3yd6y1c,https://twitter.com/dog_rates/status/878281511006478336/photo/1",13,10,Shadow,,,,
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
1023,746521445350707200,,,2016-06-25 01:52:36 +0000,"<a href=""http://twitter.com/download/iphone"" rel=""nofollow"">Twitter for iPhone</a>",RT @dog_rates: This is Shaggy. He knows exactly how to solve the puzzle but can't talk. All he wants to do is help. 10/10 great guy https:/…,6.678667e+17,4.196984e+09,2015-11-21 00:46:50 +0000,https://twitter.com/dog_rates/status/667866724293877760/photo/1,10,10,Shaggy,,,,
1043,743835915802583040,,,2016-06-17 16:01:16 +0000,"<a href=""http://twitter.com/download/iphone"" rel=""nofollow"">Twitter for iPhone</a>",RT @dog_rates: Extremely intelligent dog here. Has learned to walk like human. Even has his own dog. Very impressive 10/10 https://t.co/0Dv…,6.671383e+17,4.196984e+09,2015-11-19 00:32:12 +0000,https://twitter.com/dog_rates/status/667138269671505920/photo/1,10,10,,,,,
1242,711998809858043904,,,2016-03-21 19:31:59 +0000,"<a href=""http://twitter.com/download/iphone"" rel=""nofollow"">Twitter for iPhone</a>",RT @twitter: @dog_rates Awesome Tweet! 12/10. Would Retweet. #LoveTwitter https://t.co/j6FQGhxYuN,7.119983e+17,7.832140e+05,2016-03-21 19:29:52 +0000,"https://twitter.com/twitter/status/711998279773347841/photo/1,https://twitter.com/twitter/status/711998279773347841/photo/1",12,10,,,,,
2259,667550904950915073,,,2015-11-20 03:51:52 +0000,"<a href=""http://twitter.com"" rel=""nofollow"">Twitter Web Client</a>","RT @dogratingrating: Exceptional talent. Original humor. Cutting edge, Nova Scotian comedian. 12/10 https://t.co/uarnTjBeVA",6.675487e+17,4.296832e+09,2015-11-20 03:43:06 +0000,"https://twitter.com/dogratingrating/status/667548695664070656/photo/1,https://twitter.com/dogratingrating/status/667548695664070656/photo/1",12,10,,,,,


#### Quality Issue. There are almost 2000 records out of 2356 that do not have a dog stage. 

In [133]:
twitter_archive.query('doggo == "None" and floofer == "None" and pupper == "None" and puppo == "None"').tweet_id.count()

1976

#### Quality Issue. This should only be floof, not doggo.
Complete tweet: At first I thought this was a shy doggo, but it's actually a Rare Canadian Floofer Owl. Amateurs would confuse the two. 11/10 only send dogs

In [136]:
twitter_archive.query('floofer == "floofer" and doggo == "doggo"')[['text']]

Unnamed: 0,text
200,"At first I thought this was a shy doggo, but i..."


#### Quality Issue. Records with denominator different than 10

In [139]:
twitter_archive.query('rating_denominator != 10').tweet_id.count()

23

##### Quality Issue. One by one analyze records with denominator different than 10

In [199]:
twitter_archive.query('rating_denominator != 10')

Unnamed: 0,tweet_id,in_reply_to_status_id,in_reply_to_user_id,timestamp,source,text,retweeted_status_id,retweeted_status_user_id,retweeted_status_timestamp,expanded_urls,rating_numerator,rating_denominator,name,doggo,floofer,pupper,puppo
313,835246439529840640,8.35246e+17,26259580.0,2017-02-24 21:54:03 +0000,"<a href=""http://twitter.com/download/iphone"" rel=""nofollow"">Twitter for iPhone</a>","@jonnysun @Lin_Manuel ok jomny I know you're excited but 960/00 isn't a valid rating, 13/10 is tho",,,,,960,0,,,,,
342,832088576586297345,8.320875e+17,30582080.0,2017-02-16 04:45:50 +0000,"<a href=""http://twitter.com/download/iphone"" rel=""nofollow"">Twitter for iPhone</a>",@docmisterio account started on 11/15/15,,,,,11,15,,,,,
433,820690176645140481,,,2017-01-15 17:52:40 +0000,"<a href=""http://twitter.com/download/iphone"" rel=""nofollow"">Twitter for iPhone</a>",The floofs have been released I repeat the floofs have been released. 84/70 https://t.co/NIYC820tmd,,,,"https://twitter.com/dog_rates/status/820690176645140481/photo/1,https://twitter.com/dog_rates/status/820690176645140481/photo/1,https://twitter.com/dog_rates/status/820690176645140481/photo/1",84,70,,,,,
516,810984652412424192,,,2016-12-19 23:06:23 +0000,"<a href=""http://twitter.com/download/iphone"" rel=""nofollow"">Twitter for iPhone</a>",Meet Sam. She smiles 24/7 &amp; secretly aspires to be a reindeer. \nKeep Sam smiling by clicking and sharing this link:\nhttps://t.co/98tB8y7y7t https://t.co/LouL5vdvxx,,,,"https://www.gofundme.com/sams-smile,https://twitter.com/dog_rates/status/810984652412424192/photo/1",24,7,Sam,,,,
784,775096608509886464,,,2016-09-11 22:20:06 +0000,"<a href=""http://twitter.com/download/iphone"" rel=""nofollow"">Twitter for iPhone</a>","RT @dog_rates: After so many requests, this is Bretagne. She was the last surviving 9/11 search dog, and our second ever 14/10. RIP https:/…",7.403732e+17,4196984000.0,2016-06-08 02:41:38 +0000,"https://twitter.com/dog_rates/status/740373189193256964/photo/1,https://twitter.com/dog_rates/status/740373189193256964/photo/1,https://twitter.com/dog_rates/status/740373189193256964/photo/1,http...",9,11,,,,,
902,758467244762497024,,,2016-07-28 01:00:57 +0000,"<a href=""http://twitter.com/download/iphone"" rel=""nofollow"">Twitter for iPhone</a>",Why does this never happen at my front door... 165/150 https://t.co/HmwrdfEfUE,,,,https://twitter.com/dog_rates/status/758467244762497024/video/1,165,150,,,,,
1068,740373189193256964,,,2016-06-08 02:41:38 +0000,"<a href=""http://twitter.com/download/iphone"" rel=""nofollow"">Twitter for iPhone</a>","After so many requests, this is Bretagne. She was the last surviving 9/11 search dog, and our second ever 14/10. RIP https://t.co/XAVDNDaVgQ",,,,"https://twitter.com/dog_rates/status/740373189193256964/photo/1,https://twitter.com/dog_rates/status/740373189193256964/photo/1,https://twitter.com/dog_rates/status/740373189193256964/photo/1,http...",9,11,,,,,
1120,731156023742988288,,,2016-05-13 16:15:54 +0000,"<a href=""http://twitter.com/download/iphone"" rel=""nofollow"">Twitter for iPhone</a>",Say hello to this unbelievably well behaved squad of doggos. 204/170 would try to pet all at once https://t.co/yGQI3He3xv,,,,https://twitter.com/dog_rates/status/731156023742988288/photo/1,204,170,this,,,,
1165,722974582966214656,,,2016-04-21 02:25:47 +0000,"<a href=""http://twitter.com/download/iphone"" rel=""nofollow"">Twitter for iPhone</a>",Happy 4/20 from the squad! 13/10 for all https://t.co/eV1diwds8a,,,,https://twitter.com/dog_rates/status/722974582966214656/photo/1,4,20,,,,,
1202,716439118184652801,,,2016-04-03 01:36:11 +0000,"<a href=""http://twitter.com/download/iphone"" rel=""nofollow"">Twitter for iPhone</a>",This is Bluebert. He just saw that both #FinalFur match ups are split 50/50. Amazed af. 11/10 https://t.co/Kky1DPG4iq,,,,https://twitter.com/dog_rates/status/716439118184652801/photo/1,50,50,Bluebert,,,,


#### Quality Issue. Records with numerator away from the median.

Picked the 15 kind of randomly as a number 

In [141]:
twitter_archive.rating_numerator.describe()

count    2356.000000
mean       13.126486
std        45.876648
min         0.000000
25%        10.000000
50%        11.000000
75%        12.000000
max      1776.000000
Name: rating_numerator, dtype: float64

In [148]:
twitter_archive.query('rating_numerator > 15').tweet_id.count()

26

#### Quality Issue. Name accuracy. There are about 800 names that are clearly not well defined: None or a.

In [149]:
twitter_archive.name.value_counts()

None       745
a           55
Charlie     12
Oliver      11
Cooper      11
          ... 
Jameson      1
Huck         1
Willem       1
Emmie        1
Tedders      1
Name: name, Length: 957, dtype: int64

In [198]:
twitter_archive.query('name in ["None", "a"]')

Unnamed: 0,tweet_id,in_reply_to_status_id,in_reply_to_user_id,timestamp,source,text,retweeted_status_id,retweeted_status_user_id,retweeted_status_timestamp,expanded_urls,rating_numerator,rating_denominator,name,doggo,floofer,pupper,puppo
5,891087950875897856,,,2017-07-29 00:08:17 +0000,"<a href=""http://twitter.com/download/iphone"" rel=""nofollow"">Twitter for iPhone</a>",Here we have a majestic great white breaching off South Africa's coast. Absolutely h*ckin breathtaking. 13/10 (IG: tucker_marlo) #BarkWeek https://t.co/kQ04fDDRmh,,,,https://twitter.com/dog_rates/status/891087950875897856/photo/1,13,10,,,,,
7,890729181411237888,,,2017-07-28 00:22:40 +0000,"<a href=""http://twitter.com/download/iphone"" rel=""nofollow"">Twitter for iPhone</a>",When you watch your owner call another dog a good boy but then they turn back to you and say you're a great boy. 13/10 https://t.co/v0nONBcwxq,,,,"https://twitter.com/dog_rates/status/890729181411237888/photo/1,https://twitter.com/dog_rates/status/890729181411237888/photo/1",13,10,,,,,
12,889665388333682689,,,2017-07-25 01:55:32 +0000,"<a href=""http://twitter.com/download/iphone"" rel=""nofollow"">Twitter for iPhone</a>",Here's a puppo that seems to be on the fence about something haha no but seriously someone help her. 13/10 https://t.co/BxvuXk0UCm,,,,https://twitter.com/dog_rates/status/889665388333682689/photo/1,13,10,,,,,puppo
24,887343217045368832,,,2017-07-18 16:08:03 +0000,"<a href=""http://twitter.com/download/iphone"" rel=""nofollow"">Twitter for iPhone</a>",You may not have known you needed to see this today. 13/10 please enjoy (IG: emmylouroo) https://t.co/WZqNqygEyV,,,,https://twitter.com/dog_rates/status/887343217045368832/video/1,13,10,,,,,
25,887101392804085760,,,2017-07-18 00:07:08 +0000,"<a href=""http://twitter.com/download/iphone"" rel=""nofollow"">Twitter for iPhone</a>",This... is a Jubilant Antarctic House Bear. We only rate dogs. Please only send dogs. Thank you... 12/10 would suffocate in floof https://t.co/4Ad1jzJSdp,,,,https://twitter.com/dog_rates/status/887101392804085760/photo/1,12,10,,,,,
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
2351,666049248165822465,,,2015-11-16 00:24:50 +0000,"<a href=""http://twitter.com/download/iphone"" rel=""nofollow"">Twitter for iPhone</a>",Here we have a 1949 1st generation vulpix. Enjoys sweat tea and Fox News. Cannot be phased. 5/10 https://t.co/4B7cOc1EDq,,,,https://twitter.com/dog_rates/status/666049248165822465/photo/1,5,10,,,,,
2352,666044226329800704,,,2015-11-16 00:04:52 +0000,"<a href=""http://twitter.com/download/iphone"" rel=""nofollow"">Twitter for iPhone</a>",This is a purebred Piers Morgan. Loves to Netflix and chill. Always looks like he forgot to unplug the iron. 6/10 https://t.co/DWnyCjf2mx,,,,https://twitter.com/dog_rates/status/666044226329800704/photo/1,6,10,a,,,,
2353,666033412701032449,,,2015-11-15 23:21:54 +0000,"<a href=""http://twitter.com/download/iphone"" rel=""nofollow"">Twitter for iPhone</a>",Here is a very happy pup. Big fan of well-maintained decks. Just look at that tongue. 9/10 would cuddle af https://t.co/y671yMhoiR,,,,https://twitter.com/dog_rates/status/666033412701032449/photo/1,9,10,a,,,,
2354,666029285002620928,,,2015-11-15 23:05:30 +0000,"<a href=""http://twitter.com/download/iphone"" rel=""nofollow"">Twitter for iPhone</a>",This is a western brown Mitsubishi terrier. Upset about leaf. Actually 2 dogs here. 7/10 would walk the shit out of https://t.co/r7mOb2m0UI,,,,https://twitter.com/dog_rates/status/666029285002620928/photo/1,7,10,a,,,,


#### Tidiness issue. The columns doggo, floofer, pupper and puppo should be one column insted of four because it is one variable.

In [152]:
(twitter_archive[['doggo', 'floofer', 'pupper', 'puppo']]
 .query('doggo != "None" or floofer != "None" or pupper != "None" or puppo != "None"'))

Unnamed: 0,doggo,floofer,pupper,puppo
9,doggo,,,
12,,,,puppo
14,,,,puppo
29,,,pupper,
43,doggo,,,
...,...,...,...,...
1995,,,pupper,
2002,,,pupper,
2009,,,pupper,
2015,,,pupper,


In [191]:
(twitter_archive
 .query('(doggo != "None" or floofer != "None" or pupper != "None" or puppo != "None") and rating_denominator != 10'))

Unnamed: 0,tweet_id,in_reply_to_status_id,in_reply_to_user_id,timestamp,source,text,retweeted_status_id,retweeted_status_user_id,retweeted_status_timestamp,expanded_urls,rating_numerator,rating_denominator,name,doggo,floofer,pupper,puppo


 #### Quality Issue. Numerator should be 11 instead of 27. Ideally it should be 11.27 but for this exercie will leave it as 11 because most of the numbers are integers.

In [197]:
(twitter_archive
 .query('(doggo != "None" or floofer != "None" or pupper != "None" or puppo != "None") and rating_numerator >= 15'))

Unnamed: 0,tweet_id,in_reply_to_status_id,in_reply_to_user_id,timestamp,source,text,retweeted_status_id,retweeted_status_user_id,retweeted_status_timestamp,expanded_urls,rating_numerator,rating_denominator,name,doggo,floofer,pupper,puppo
763,778027034220126208,,,2016-09-20 00:24:34 +0000,"<a href=""http://twitter.com/download/iphone"" rel=""nofollow"">Twitter for iPhone</a>",This is Sophie. She's a Jubilant Bush Pupper. Super h*ckin rare. Appears at random just to smile at the locals. 11.27/10 would smile back https://t.co/QFaUiIHxHq,,,,https://twitter.com/dog_rates/status/778027034220126208/photo/1,27,10,Sophie,,,pupper,


In [196]:
pd.options.display.max_colwidth = 200

<font color='green' size='4'>
image_predictions dataset

#### Data quality Dimensions
- Completeness
- Validity
- Accuracy
- Consistency

#### Tidiness: issues with structure that prevent easy analysis. Untidy data is also known as messy data. Tidy data requirements:

- Each variable forms a column.

- Each observation forms a row.

- Each type of observational unit forms a table.

##### Quality issues

- Pass

##### Tidiness

- The predictions should be on its own table and each prediction should be on its own row.

In [175]:
image_predictions.sample()

Unnamed: 0,tweet_id,jpg_url,img_num,p1,p1_conf,p1_dog,p2,p2_conf,p2_dog,p3,p3_conf,p3_dog
982,707387676719185920,https://pbs.twimg.com/media/CdElVm7XEAADP6o.jpg,1,Chihuahua,0.888468,True,Italian_greyhound,0.088635,True,toy_terrier,0.015938,True


In [168]:
image_predictions.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 2075 entries, 0 to 2074
Data columns (total 12 columns):
 #   Column    Non-Null Count  Dtype  
---  ------    --------------  -----  
 0   tweet_id  2075 non-null   int64  
 1   jpg_url   2075 non-null   object 
 2   img_num   2075 non-null   int64  
 3   p1        2075 non-null   object 
 4   p1_conf   2075 non-null   float64
 5   p1_dog    2075 non-null   bool   
 6   p2        2075 non-null   object 
 7   p2_conf   2075 non-null   float64
 8   p2_dog    2075 non-null   bool   
 9   p3        2075 non-null   object 
 10  p3_conf   2075 non-null   float64
 11  p3_dog    2075 non-null   bool   
dtypes: bool(3), float64(3), int64(2), object(4)
memory usage: 152.1+ KB


In [166]:
image_predictions.img_num.value_counts()

1    1780
2     198
3      66
4      31
Name: img_num, dtype: int64

In [156]:
image_predictions.p1.value_counts()

golden_retriever      150
Labrador_retriever    100
Pembroke               89
Chihuahua              83
pug                    57
                     ... 
revolver                1
water_buffalo           1
walking_stick           1
nail                    1
trombone                1
Name: p1, Length: 378, dtype: int64

In [160]:
image_predictions.describe()

Unnamed: 0,tweet_id,img_num,p1_conf,p2_conf,p3_conf
count,2075.0,2075.0,2075.0,2075.0,2075.0
mean,7.384514e+17,1.203855,0.594548,0.1345886,0.06032417
std,6.785203e+16,0.561875,0.271174,0.1006657,0.05090593
min,6.660209e+17,1.0,0.044333,1.0113e-08,1.74017e-10
25%,6.764835e+17,1.0,0.364412,0.05388625,0.0162224
50%,7.119988e+17,1.0,0.58823,0.118181,0.0494438
75%,7.932034e+17,1.0,0.843855,0.1955655,0.09180755
max,8.924206e+17,4.0,1.0,0.488014,0.273419


In [161]:
image_predictions.p1_dog.value_counts()

True     1532
False     543
Name: p1_dog, dtype: int64

In [163]:
image_predictions.tweet_id.duplicated().sum()

0

<font color='green' size='4'>
tweet_details_df dataset

#### Data quality Dimensions
- Completeness
- Validity
- Accuracy
- Consistency

#### Tidiness: issues with structure that prevent easy analysis. Untidy data is also known as messy data. Tidy data requirements:

- Each variable forms a column.

- Each observation forms a row.

- Each type of observational unit forms a table.

In [169]:
tweet_details_df.sample()

Unnamed: 0,id,retweet_count,favorite_count
441,818145370475810820,2618,12488


In [170]:
tweet_details_df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 2331 entries, 0 to 2330
Data columns (total 3 columns):
 #   Column          Non-Null Count  Dtype
---  ------          --------------  -----
 0   id              2331 non-null   int64
 1   retweet_count   2331 non-null   int64
 2   favorite_count  2331 non-null   int64
dtypes: int64(3)
memory usage: 54.8 KB


In [172]:
tweet_details_df.describe()

Unnamed: 0,id,retweet_count,favorite_count
count,2331.0,2331.0,2331.0
mean,7.419079e+17,2707.513085,7569.226083
std,6.82317e+16,4578.311885,11747.116239
min,6.660209e+17,1.0,0.0
25%,6.78267e+17,548.0,1319.5
50%,7.182469e+17,1269.0,3293.0
75%,7.986692e+17,3146.0,9264.5
max,8.924206e+17,77917.0,156349.0


In [174]:
tweet_details_df.id.duplicated().sum()

0

## Cleaning

<font color='green' size='4'>
twitter-archive-enhanced dataset

##### Quality issues

- (Not selected to be fixed during this exercise). There are about 2000 records that are not classified as either doggo, floofer, pupper, puppo. In other words the 4 columns are set as None for about 2000 records.
- Remove the 59 records without associated image.
- Remove the 181 retweets that from this dataset perspective would be considered duplicate records.
- Set to None the records that start with a lowercase letter. That way the records with unknown name will all be set up to None.
- Change to floofer the record that has both fluffer and doggo.
- Records with denominator different than 10.
- Records with numerator far from the median.
- Numerator should be 11 instead of 27. Ideally it should be 11.27 but for this exercise I will leave it as 11 because most of the numbers are integers.¶

##### Tidiness

- The columns doggo, floofer, pupper and puppo should be one column because it is one variable.

#### Getting acquainted with the dataset

In [129]:
twitter_archive.shape

(2356, 17)

In [210]:
twitter_archive.sample()

Unnamed: 0,tweet_id,in_reply_to_status_id,in_reply_to_user_id,timestamp,source,text,retweeted_status_id,retweeted_status_user_id,retweeted_status_timestamp,expanded_urls,rating_numerator,rating_denominator,name,doggo,floofer,pupper,puppo
1805,676942428000112642,,,2015-12-16 01:50:26 +0000,"<a href=""http://twitter.com/download/iphone"" rel=""nofollow"">Twitter for iPhone</a>",Who leaves the last cupcake just sitting there? 9/10 https://t.co/PWMqAoEx2a,,,,https://twitter.com/dog_rates/status/676942428000112642/photo/1,9,10,,,,,


In [211]:
twitter_archive_cp = twitter_archive.copy()

Define

#### 1. Remove the 59 records without associated image.

In [214]:
twitter_archive_cp.expanded_urls.isnull().sum()

59

Code

In [215]:
twitter_archive_cp = twitter_archive_cp[~twitter_archive_cp.expanded_urls.isnull()]

Test. There are no records with expanded_urls as null.

In [216]:
twitter_archive_cp.expanded_urls.isnull().sum()

0

Define

#### 2. Remove the 181 retweets that from this dataset perspective would be considered duplicate records. Note: One of the records was removed as part of the previous task.

In [219]:
twitter_archive_cp.retweeted_status_id.isna().sum()

2117

Code

In [220]:
twitter_archive_cp = twitter_archive_cp[twitter_archive_cp.retweeted_status_id.isna()]

Test. No more records with retweet status exist in the dataset.

In [222]:
twitter_archive_cp.retweeted_status_id.notna().sum()

0

Define

#### 9. Replace all the names which start with a lowercase letter to 'None'

In [None]:
twitter_archive_cp[twitter_archive_cp.name.str.contains('^[a-z]')]

Code

In [267]:
twitter_archive_cp.name = twitter_archive_cp.name.str.replace(r'(^[a-z][\w]+)', 'None')

Test. There are no records which start with a lowercase letter

In [269]:
twitter_archive_cp[twitter_archive_cp.name.str.contains('^[a-z]')]

Unnamed: 0,tweet_id,in_reply_to_status_id,in_reply_to_user_id,timestamp,source,text,retweeted_status_id,retweeted_status_user_id,retweeted_status_timestamp,expanded_urls,rating_numerator,rating_denominator,name,doggo,floofer,pupper,puppo


Define

#### 4. Change to only doggo the record that has both fluffer and doggo. The word floofer was used to mention an owl.

In [232]:
twitter_archive_cp.query('floofer == "floofer" and doggo == "doggo"')

Unnamed: 0,tweet_id,in_reply_to_status_id,in_reply_to_user_id,timestamp,source,text,retweeted_status_id,retweeted_status_user_id,retweeted_status_timestamp,expanded_urls,rating_numerator,rating_denominator,name,doggo,floofer,pupper,puppo
200,854010172552949760,,,2017-04-17 16:34:26 +0000,"<a href=""http://twitter.com/download/iphone"" rel=""nofollow"">Twitter for iPhone</a>","At first I thought this was a shy doggo, but it's actually a Rare Canadian Floofer Owl. Amateurs would confuse the two. 11/10 only send dogs https://t.co/TXdT3tmuYk",,,,"https://twitter.com/dog_rates/status/854010172552949760/photo/1,https://twitter.com/dog_rates/status/854010172552949760/photo/1",11,10,,doggo,floofer,,


Code

In [233]:
twitter_archive_cp.loc[200, 'floofer'] = 'None'

Test

In [234]:
twitter_archive_cp.query('floofer == "floofer" and doggo == "doggo"')

Unnamed: 0,tweet_id,in_reply_to_status_id,in_reply_to_user_id,timestamp,source,text,retweeted_status_id,retweeted_status_user_id,retweeted_status_timestamp,expanded_urls,rating_numerator,rating_denominator,name,doggo,floofer,pupper,puppo


Define

#### 5. Remove records which description does not include a rating. Once the results were analyzed, that would be records 516 and 1662.

In [237]:
twitter_archive_cp.query('rating_denominator != 10')[['text', 'rating_numerator', 'rating_denominator']]

Unnamed: 0,text,rating_numerator,rating_denominator
433,The floofs have been released I repeat the floofs have been released. 84/70 https://t.co/NIYC820tmd,84,70
516,Meet Sam. She smiles 24/7 &amp; secretly aspires to be a reindeer. \nKeep Sam smiling by clicking and sharing this link:\nhttps://t.co/98tB8y7y7t https://t.co/LouL5vdvxx,24,7
902,Why does this never happen at my front door... 165/150 https://t.co/HmwrdfEfUE,165,150
1068,"After so many requests, this is Bretagne. She was the last surviving 9/11 search dog, and our second ever 14/10. RIP https://t.co/XAVDNDaVgQ",9,11
1120,Say hello to this unbelievably well behaved squad of doggos. 204/170 would try to pet all at once https://t.co/yGQI3He3xv,204,170
1165,Happy 4/20 from the squad! 13/10 for all https://t.co/eV1diwds8a,4,20
1202,This is Bluebert. He just saw that both #FinalFur match ups are split 50/50. Amazed af. 11/10 https://t.co/Kky1DPG4iq,50,50
1228,Happy Saturday here's 9 puppers on a bench. 99/90 good work everybody https://t.co/mpvaVxKmc1,99,90
1254,Here's a brigade of puppers. All look very prepared for whatever happens next. 80/80 https://t.co/0eb7R1Om12,80,80
1274,"From left to right:\nCletus, Jerome, Alejandro, Burp, &amp; Titson\nNone know where camera is. 45/50 would hug all at once https://t.co/sedre1ivTK",45,50


Code

In [239]:
drop_list = [516, 1662]

In [241]:
twitter_archive_cp.drop(drop_list, inplace=True)

Test. The records 516 and 1662 no longer exist.

In [250]:
twitter_archive_cp.query('rating_denominator != 10')[['tweet_id','text', 'rating_numerator', 'rating_denominator']]

Unnamed: 0,tweet_id,text,rating_numerator,rating_denominator
433,820690176645140481,The floofs have been released I repeat the floofs have been released. 84/70 https://t.co/NIYC820tmd,84,70
902,758467244762497024,Why does this never happen at my front door... 165/150 https://t.co/HmwrdfEfUE,165,150
1068,740373189193256964,"After so many requests, this is Bretagne. She was the last surviving 9/11 search dog, and our second ever 14/10. RIP https://t.co/XAVDNDaVgQ",9,11
1120,731156023742988288,Say hello to this unbelievably well behaved squad of doggos. 204/170 would try to pet all at once https://t.co/yGQI3He3xv,204,170
1165,722974582966214656,Happy 4/20 from the squad! 13/10 for all https://t.co/eV1diwds8a,4,20
1202,716439118184652801,This is Bluebert. He just saw that both #FinalFur match ups are split 50/50. Amazed af. 11/10 https://t.co/Kky1DPG4iq,50,50
1228,713900603437621249,Happy Saturday here's 9 puppers on a bench. 99/90 good work everybody https://t.co/mpvaVxKmc1,99,90
1254,710658690886586372,Here's a brigade of puppers. All look very prepared for whatever happens next. 80/80 https://t.co/0eb7R1Om12,80,80
1274,709198395643068416,"From left to right:\nCletus, Jerome, Alejandro, Burp, &amp; Titson\nNone know where camera is. 45/50 would hug all at once https://t.co/sedre1ivTK",45,50
1351,704054845121142784,Here is a whole flock of puppers. 60/50 I'll take the lot https://t.co/9dpcw6MdWa,60,50


Define

#### 6. Correct tweet_id record 740373189193256964. The numerator and denominator should be 14/10.

Code

In [252]:
mask = twitter_archive_cp.tweet_id == 740373189193256964
twitter_archive_cp.loc[mask, 'rating_numerator'] = 14
twitter_archive_cp.loc[mask, 'rating_denominator'] = 10

Test

In [253]:
twitter_archive_cp[twitter_archive_cp['tweet_id'] == 740373189193256964]

Unnamed: 0,tweet_id,in_reply_to_status_id,in_reply_to_user_id,timestamp,source,text,retweeted_status_id,retweeted_status_user_id,retweeted_status_timestamp,expanded_urls,rating_numerator,rating_denominator,name,doggo,floofer,pupper,puppo
1068,740373189193256964,,,2016-06-08 02:41:38 +0000,"<a href=""http://twitter.com/download/iphone"" rel=""nofollow"">Twitter for iPhone</a>","After so many requests, this is Bretagne. She was the last surviving 9/11 search dog, and our second ever 14/10. RIP https://t.co/XAVDNDaVgQ",,,,"https://twitter.com/dog_rates/status/740373189193256964/photo/1,https://twitter.com/dog_rates/status/740373189193256964/photo/1,https://twitter.com/dog_rates/status/740373189193256964/photo/1,http...",14,10,,,,,


#### 7. Correct tweet_id 722974582966214656. The numerator and denominator should be 13/10.

In [254]:
twitter_archive_cp[twitter_archive_cp['tweet_id'] == 722974582966214656]

Unnamed: 0,tweet_id,in_reply_to_status_id,in_reply_to_user_id,timestamp,source,text,retweeted_status_id,retweeted_status_user_id,retweeted_status_timestamp,expanded_urls,rating_numerator,rating_denominator,name,doggo,floofer,pupper,puppo
1165,722974582966214656,,,2016-04-21 02:25:47 +0000,"<a href=""http://twitter.com/download/iphone"" rel=""nofollow"">Twitter for iPhone</a>",Happy 4/20 from the squad! 13/10 for all https://t.co/eV1diwds8a,,,,https://twitter.com/dog_rates/status/722974582966214656/photo/1,4,20,,,,,


Code

In [255]:
mask = twitter_archive_cp.tweet_id == 722974582966214656
twitter_archive_cp.loc[mask, 'rating_numerator'] = 13
twitter_archive_cp.loc[mask, 'rating_denominator'] = 10

Test

In [256]:
twitter_archive_cp[twitter_archive_cp['tweet_id'] == 722974582966214656]

Unnamed: 0,tweet_id,in_reply_to_status_id,in_reply_to_user_id,timestamp,source,text,retweeted_status_id,retweeted_status_user_id,retweeted_status_timestamp,expanded_urls,rating_numerator,rating_denominator,name,doggo,floofer,pupper,puppo
1165,722974582966214656,,,2016-04-21 02:25:47 +0000,"<a href=""http://twitter.com/download/iphone"" rel=""nofollow"">Twitter for iPhone</a>",Happy 4/20 from the squad! 13/10 for all https://t.co/eV1diwds8a,,,,https://twitter.com/dog_rates/status/722974582966214656/photo/1,13,10,,,,,


#### 8. Correct tweet_id 666287406224695296. The numerator and denominator should be 9/10.

Code

In [257]:
mask = twitter_archive_cp.tweet_id == 666287406224695296
twitter_archive_cp.loc[mask, 'rating_numerator'] = 9
twitter_archive_cp.loc[mask, 'rating_denominator'] = 10

Test

In [258]:
twitter_archive_cp[twitter_archive_cp['tweet_id'] == 666287406224695296]

Unnamed: 0,tweet_id,in_reply_to_status_id,in_reply_to_user_id,timestamp,source,text,retweeted_status_id,retweeted_status_user_id,retweeted_status_timestamp,expanded_urls,rating_numerator,rating_denominator,name,doggo,floofer,pupper,puppo
2335,666287406224695296,,,2015-11-16 16:11:11 +0000,"<a href=""http://twitter.com/download/iphone"" rel=""nofollow"">Twitter for iPhone</a>",This is an Albanian 3 1/2 legged Episcopalian. Loves well-polished hardwood flooring. Penis on the collar. 9/10 https://t.co/d9NcXFKwLv,,,,https://twitter.com/dog_rates/status/666287406224695296/photo/1,9,10,an,,,,


#### 9. Replace all the names which start with a lowercase letter to 'None'

In [None]:
twitter_archive_cp[twitter_archive_cp.name.str.contains('^[a-z]')]

Code

In [267]:
twitter_archive_cp.name = twitter_archive_cp.name.str.replace(r'(^[a-z][\w]+)', 'None')

Test. There are no records which start with a lowercase letter

In [269]:
twitter_archive_cp[twitter_archive_cp.name.str.contains('^[a-z]')]

Unnamed: 0,tweet_id,in_reply_to_status_id,in_reply_to_user_id,timestamp,source,text,retweeted_status_id,retweeted_status_user_id,retweeted_status_timestamp,expanded_urls,rating_numerator,rating_denominator,name,doggo,floofer,pupper,puppo


#### 9. Numerator should be 11 instead of 27. Ideally it should be 11.27 but for this exercise I will leave it as 11 because most of the numbers are integers.

In [270]:
twitter_archive_cp.query('rating_numerator == 27')

Unnamed: 0,tweet_id,in_reply_to_status_id,in_reply_to_user_id,timestamp,source,text,retweeted_status_id,retweeted_status_user_id,retweeted_status_timestamp,expanded_urls,rating_numerator,rating_denominator,name,doggo,floofer,pupper,puppo
763,778027034220126208,,,2016-09-20 00:24:34 +0000,"<a href=""http://twitter.com/download/iphone"" rel=""nofollow"">Twitter for iPhone</a>",This is Sophie. She's a Jubilant Bush Pupper. Super h*ckin rare. Appears at random just to smile at the locals. 11.27/10 would smile back https://t.co/QFaUiIHxHq,,,,https://twitter.com/dog_rates/status/778027034220126208/photo/1,27,10,Sophie,,,pupper,


Code

In [271]:
mask = twitter_archive_cp.tweet_id == 778027034220126208
twitter_archive_cp.loc[mask, 'rating_numerator'] = 11

Test

In [272]:
twitter_archive_cp.loc[mask]

Unnamed: 0,tweet_id,in_reply_to_status_id,in_reply_to_user_id,timestamp,source,text,retweeted_status_id,retweeted_status_user_id,retweeted_status_timestamp,expanded_urls,rating_numerator,rating_denominator,name,doggo,floofer,pupper,puppo
763,778027034220126208,,,2016-09-20 00:24:34 +0000,"<a href=""http://twitter.com/download/iphone"" rel=""nofollow"">Twitter for iPhone</a>",This is Sophie. She's a Jubilant Bush Pupper. Super h*ckin rare. Appears at random just to smile at the locals. 11.27/10 would smile back https://t.co/QFaUiIHxHq,,,,https://twitter.com/dog_rates/status/778027034220126208/photo/1,11,10,Sophie,,,pupper,
