# Project 4: WeRateDogs data wrangling

## Table of contents

<ul>
<li><a href="#intro">Introduction</a></li>
<li><a href="#wrangling">Data Wrangling</a></li>
    <ul>
        <li><a href="#gathering">Data gathering</a></li>
        <li><a href="#assessment">Data assessment</a></li>
        <li><a href="#cleaning">Data cleaning</a></li>
    </ul>
<li><a href="#eda">Exploratory Data Analysis</a></li>
<li><a href="#conclusions">Conclusions</a></li>
<li><a href="#limitations">Limitations</a></li>
</ul>

<a id='intro'></a>
## Introduction 

In this project, we will be wrangling, analysing and visualizing  the tweet archive of Twitter user [@dog_rates](https://twitter.com/dog_rates), also known as [WeRateDogs](https://en.wikipedia.org/wiki/WeRateDogs). After gathering this data from Twitter's api using [Tweepy](https://www.tweepy.org/), we will be assessing and cleaning it. Then, we will be finding useful insights from our cleaned data and producing some visualizations. The last step but not the least will is making a report of our results. It will be done in **act\_report.pdf** file associated to the project.

<a id='wrangling'></a>
## Data wrangling 

In [1]:
# importing libraries
import pandas as pd
import numpy as np
import tweepy 
import os
import time
import json
import requests
from PIL import Image
from io import BytesIO
from sqlalchemy import create_engine

<a id='gathering'></a>
### Data gathering

#### `1.` WeRateDogs archive

In [2]:
# load WeRateDogs archive into dataframe
df_twitter = pd.read_csv('../data/twitter-archive-enhanced.csv')
df_twitter.head()

Unnamed: 0,tweet_id,in_reply_to_status_id,in_reply_to_user_id,timestamp,source,text,retweeted_status_id,retweeted_status_user_id,retweeted_status_timestamp,expanded_urls,rating_numerator,rating_denominator,name,doggo,floofer,pupper,puppo
0,892420643555336193,,,2017-08-01 16:23:56 +0000,"<a href=""http://twitter.com/download/iphone"" r...",This is Phineas. He's a mystical boy. Only eve...,,,,https://twitter.com/dog_rates/status/892420643...,13,10,Phineas,,,,
1,892177421306343426,,,2017-08-01 00:17:27 +0000,"<a href=""http://twitter.com/download/iphone"" r...",This is Tilly. She's just checking pup on you....,,,,https://twitter.com/dog_rates/status/892177421...,13,10,Tilly,,,,
2,891815181378084864,,,2017-07-31 00:18:03 +0000,"<a href=""http://twitter.com/download/iphone"" r...",This is Archie. He is a rare Norwegian Pouncin...,,,,https://twitter.com/dog_rates/status/891815181...,12,10,Archie,,,,
3,891689557279858688,,,2017-07-30 15:58:51 +0000,"<a href=""http://twitter.com/download/iphone"" r...",This is Darla. She commenced a snooze mid meal...,,,,https://twitter.com/dog_rates/status/891689557...,13,10,Darla,,,,
4,891327558926688256,,,2017-07-29 16:00:24 +0000,"<a href=""http://twitter.com/download/iphone"" r...",This is Franklin. He would like you to stop ca...,,,,https://twitter.com/dog_rates/status/891327558...,12,10,Franklin,,,,


#### `2.` The tweet image predictions.

In [3]:
# download 'image_predictions.tsv' into data folder
folder_path = '../data'
url = "https://d17h27t6h515a5.cloudfront.net/topher/2017/August/599fd2ad_image-predictions/image-predictions.tsv"
response = requests.get(url)
with open(os.path.join(folder_path, url.split('/')[-1]), mode='wb') as file:
    file.write(response.content)        

In [4]:
# check if file in data folder
os.listdir(folder_path)

['image-predictions.tsv', 'twitter-archive-enhanced.csv', 'tweet_json.txt']

In [5]:
# load 'image_predictions.tsv' into dataframe
df_image_predictions = pd.read_csv('../data/image-predictions.tsv', sep='\t')
df_image_predictions.head()

Unnamed: 0,tweet_id,jpg_url,img_num,p1,p1_conf,p1_dog,p2,p2_conf,p2_dog,p3,p3_conf,p3_dog
0,666020888022790149,https://pbs.twimg.com/media/CT4udn0WwAA0aMy.jpg,1,Welsh_springer_spaniel,0.465074,True,collie,0.156665,True,Shetland_sheepdog,0.061428,True
1,666029285002620928,https://pbs.twimg.com/media/CT42GRgUYAA5iDo.jpg,1,redbone,0.506826,True,miniature_pinscher,0.074192,True,Rhodesian_ridgeback,0.07201,True
2,666033412701032449,https://pbs.twimg.com/media/CT4521TWwAEvMyu.jpg,1,German_shepherd,0.596461,True,malinois,0.138584,True,bloodhound,0.116197,True
3,666044226329800704,https://pbs.twimg.com/media/CT5Dr8HUEAA-lEu.jpg,1,Rhodesian_ridgeback,0.408143,True,redbone,0.360687,True,miniature_pinscher,0.222752,True
4,666049248165822465,https://pbs.twimg.com/media/CT5IQmsXIAAKY4A.jpg,1,miniature_pinscher,0.560311,True,Rottweiler,0.243682,True,Doberman,0.154629,True


#### `3.` Gather retweet and favorite counts from Tweepy.

In [6]:
# WeRateDogs tweet ids
twitter_ids_list = list(df_twitter.tweet_id)
twitter_ids_list

[892420643555336193,
 892177421306343426,
 891815181378084864,
 891689557279858688,
 891327558926688256,
 891087950875897856,
 890971913173991426,
 890729181411237888,
 890609185150312448,
 890240255349198849,
 890006608113172480,
 889880896479866881,
 889665388333682689,
 889638837579907072,
 889531135344209921,
 889278841981685760,
 888917238123831296,
 888804989199671297,
 888554962724278272,
 888202515573088257,
 888078434458587136,
 887705289381826560,
 887517139158093824,
 887473957103951883,
 887343217045368832,
 887101392804085760,
 886983233522544640,
 886736880519319552,
 886680336477933568,
 886366144734445568,
 886267009285017600,
 886258384151887873,
 886054160059072513,
 885984800019947520,
 885528943205470208,
 885518971528720385,
 885311592912609280,
 885167619883638784,
 884925521741709313,
 884876753390489601,
 884562892145688576,
 884441805382717440,
 884247878851493888,
 884162670584377345,
 883838122936631299,
 883482846933004288,
 883360690899218434,
 883117836046

In [7]:
# Instantiate a tweepy api object
consumer_key = 'P3BUNeKMDnFOqDYQJMFBsdwfJ'
consumer_secret = 'UD7MhLdrKOSXcfDkUB7MydusVUMl1rNDEeQFsiAev4uAOazevx'
access_token = '974068981207388160-pMYoANRGIyHjze9JNgLtLxnhfK4BGqz'
access_secret = 'dBzeODA2Aha9CsbKA3udIGu1Kw8D5HpeX4ErKXUDB308G'

auth = tweepy.OAuthHandler(consumer_key, consumer_secret)
auth.set_access_token(access_token, access_secret)

api = tweepy.API(auth, wait_on_rate_limit=True, wait_on_rate_limit_notify=True)

In [8]:
# File path to save tweets' content
tweet_json_path = '../data/tweet_json.txt'
# Quering twitter ids in 'twitter_ids_list'
for tweet_id in twitter_ids_list: 
    try:
        start_time = time.time() # start timing request 
        tweet = api.get_status(tweet_id, tweet_mode="extended")
        end_time = time.time() # end timing request
        print(f"Tweet id: {tweet_id}. Request duration: {end_time - start_time}") 
        # Writing Twitter Json file into ../data/tweet_json.txt file.
        tweet_json =[]
        tweet_json.append({tweet.id: f'{tweet.retweet_count} {tweet.favorite_count}'}) 
        with open(tweet_json_path, mode='a', encoding='utf-8') as file:
            json.dump(tweet_json[0], file)
            file.write("\n")
    except:
        print(f"⚠️ Warning ⚠️ Tweet associated with the id {tweet_id} has been deleted")

Tweet id: 892420643555336193. Request duration: 1.2902090549468994
Tweet id: 892177421306343426. Request duration: 0.9786131381988525
Tweet id: 891815181378084864. Request duration: 0.9588901996612549
Tweet id: 891689557279858688. Request duration: 1.0192813873291016
Tweet id: 891327558926688256. Request duration: 0.9209067821502686
Tweet id: 891087950875897856. Request duration: 1.0234184265136719
Tweet id: 890971913173991426. Request duration: 0.8564424514770508
Tweet id: 890729181411237888. Request duration: 1.0903472900390625
Tweet id: 890609185150312448. Request duration: 0.9095399379730225
Tweet id: 890240255349198849. Request duration: 0.925957441329956
Tweet id: 890006608113172480. Request duration: 0.9200854301452637
Tweet id: 889880896479866881. Request duration: 1.0255212783813477
Tweet id: 889665388333682689. Request duration: 0.8985958099365234
Tweet id: 889638837579907072. Request duration: 0.8405821323394775
Tweet id: 889531135344209921. Request duration: 0.9233627319335

Tweet id: 869227993411051520. Request duration: 0.9139001369476318
Tweet id: 868880397819494401. Request duration: 1.3306903839111328
Tweet id: 868639477480148993. Request duration: 1.0241937637329102
Tweet id: 868622495443632128. Request duration: 0.9204497337341309
Tweet id: 868552278524837888. Request duration: 0.869666337966919
Tweet id: 867900495410671616. Request duration: 0.9726014137268066
Tweet id: 867774946302451713. Request duration: 1.0269663333892822
Tweet id: 867421006826221569. Request duration: 1.0182535648345947
Tweet id: 867072653475098625. Request duration: 1.0249483585357666
Tweet id: 867051520902168576. Request duration: 0.9195449352264404
Tweet id: 866720684873056260. Request duration: 0.9301846027374268
Tweet id: 866686824827068416. Request duration: 0.912757396697998
Tweet id: 866450705531457537. Request duration: 1.1256334781646729
Tweet id: 866334964761202691. Request duration: 0.9213430881500244
Tweet id: 866094527597207552. Request duration: 0.92197728157043

Tweet id: 846042936437604353. Request duration: 0.9013514518737793
Tweet id: 845812042753855489. Request duration: 1.0234348773956299
Tweet id: 845677943972139009. Request duration: 0.9071593284606934
Tweet id: 845397057150107648. Request duration: 1.7427027225494385
Tweet id: 845306882940190720. Request duration: 0.8911645412445068
Tweet id: 845098359547420673. Request duration: 0.9486782550811768
Tweet id: 844979544864018432. Request duration: 0.902684211730957
Tweet id: 844973813909606400. Request duration: 0.9243545532226562
Tweet id: 844580511645339650. Request duration: 0.924769401550293
Tweet id: 844223788422217728. Request duration: 0.9183757305145264
Tweet id: 843981021012017153. Request duration: 1.1150047779083252
Tweet id: 843856843873095681. Request duration: 0.9245445728302002
Tweet id: 843604394117681152. Request duration: 0.9211862087249756
Tweet id: 843235543001513987. Request duration: 0.9265477657318115
Tweet id: 842846295480000512. Request duration: 1.02673149108886

Tweet id: 828801551087042563. Request duration: 0.9206714630126953
Tweet id: 828770345708580865. Request duration: 0.9396529197692871
Tweet id: 828708714936930305. Request duration: 0.9021000862121582
Tweet id: 828650029636317184. Request duration: 1.0229568481445312
Tweet id: 828409743546925057. Request duration: 0.9209096431732178
Tweet id: 828408677031882754. Request duration: 0.9037528038024902
Tweet id: 828381636999917570. Request duration: 0.9351456165313721
Tweet id: 828376505180889089. Request duration: 0.9239780902862549
Tweet id: 828372645993398273. Request duration: 0.9209442138671875
Tweet id: 828361771580813312. Request duration: 0.9214255809783936
Tweet id: 828046555563323392. Request duration: 0.9210474491119385
Tweet id: 828011680017821696. Request duration: 0.9213871955871582
Tweet id: 827933404142436356. Request duration: 0.9078960418701172
Tweet id: 827653905312006145. Request duration: 0.7612316608428955
Tweet id: 827600520311402496. Request duration: 0.887816190719

Tweet id: 813910438903693312. Request duration: 0.966374397277832
Tweet id: 813812741911748608. Request duration: 0.7813127040863037
Tweet id: 813800681631023104. Request duration: 0.9787931442260742
Tweet id: 813217897535406080. Request duration: 0.9211547374725342
Tweet id: 813202720496779264. Request duration: 0.8239359855651855
Tweet id: 813187593374461952. Request duration: 0.9128541946411133
Tweet id: 813172488309972993. Request duration: 0.9225144386291504
Tweet id: 813157409116065792. Request duration: 0.922034740447998
Tweet id: 813142292504645637. Request duration: 0.9187078475952148
Tweet id: 813130366689148928. Request duration: 0.9208383560180664
Tweet id: 813127251579564032. Request duration: 0.9213590621948242
Tweet id: 813112105746448384. Request duration: 0.8970296382904053
Tweet id: 813096984823349248. Request duration: 0.9420976638793945
Tweet id: 813081950185472002. Request duration: 0.9221148490905762
Tweet id: 813066809284972545. Request duration: 0.92089939117431

Tweet id: 796904159865868288. Request duration: 0.801872968673706
Tweet id: 796865951799083009. Request duration: 0.8775510787963867
Tweet id: 796759840936919040. Request duration: 0.7730562686920166
Tweet id: 796563435802726400. Request duration: 0.8328649997711182
Tweet id: 796484825502875648. Request duration: 0.7932043075561523
Tweet id: 796387464403357696. Request duration: 0.8270692825317383
Tweet id: 796177847564038144. Request duration: 0.7405502796173096
Tweet id: 796149749086875649. Request duration: 0.7739138603210449
Tweet id: 796125600683540480. Request duration: 0.8805849552154541
Tweet id: 796116448414461957. Request duration: 0.7522084712982178
Tweet id: 796080075804475393. Request duration: 0.7710371017456055
Tweet id: 796031486298386433. Request duration: 0.7417211532592773
Tweet id: 795464331001561088. Request duration: 0.9282739162445068
Tweet id: 795400264262053889. Request duration: 0.919266939163208
Tweet id: 795076730285391872. Request duration: 0.92124485969543

Tweet id: 781163403222056960. Request duration: 0.9002840518951416
Tweet id: 780931614150983680. Request duration: 1.0158107280731201
Tweet id: 780858289093574656. Request duration: 0.9257941246032715
Tweet id: 780800785462489090. Request duration: 0.9111223220825195
Tweet id: 780601303617732608. Request duration: 0.931328535079956
Tweet id: 780543529827336192. Request duration: 1.219437599182129
Tweet id: 780496263422808064. Request duration: 0.9776527881622314
Tweet id: 780476555013349377. Request duration: 1.0783426761627197
Tweet id: 780459368902959104. Request duration: 0.9092397689819336
Tweet id: 780192070812196864. Request duration: 0.9155457019805908
Tweet id: 780092040432480260. Request duration: 0.8800621032714844
Tweet id: 780074436359819264. Request duration: 1.0725743770599365
Tweet id: 779834332596887552. Request duration: 0.8777117729187012
Tweet id: 779377524342161408. Request duration: 0.8809270858764648
Tweet id: 779124354206535695. Request duration: 0.88772106170654

Tweet id: 763956972077010945. Request duration: 0.9265897274017334
Tweet id: 763837565564780549. Request duration: 0.9303290843963623
Tweet id: 763183847194451968. Request duration: 0.9294607639312744
Tweet id: 763167063695355904. Request duration: 0.8664126396179199
Tweet id: 763103485927849985. Request duration: 0.8661017417907715
Tweet id: 762699858130116608. Request duration: 0.9222614765167236
Tweet id: 762471784394268675. Request duration: 0.9191691875457764
Tweet id: 762464539388485633. Request duration: 0.8714001178741455
Tweet id: 762316489655476224. Request duration: 1.0168168544769287
Tweet id: 762035686371364864. Request duration: 0.9725496768951416
Tweet id: 761976711479193600. Request duration: 1.0184845924377441
Tweet id: 761750502866649088. Request duration: 0.9262809753417969
Tweet id: 761745352076779520. Request duration: 0.9146301746368408
Tweet id: 761672994376806400. Request duration: 0.9303596019744873
Tweet id: 761599872357261312. Request duration: 0.908831596374

Rate limit reached. Sleeping for: 66


Tweet id: 758740312047005698. Request duration: 72.37657165527344
Tweet id: 758474966123810816. Request duration: 1.2528128623962402
Tweet id: 758467244762497024. Request duration: 1.016186237335205
Tweet id: 758405701903519748. Request duration: 0.8415026664733887
Tweet id: 758355060040593408. Request duration: 0.8985249996185303
Tweet id: 758099635764359168. Request duration: 0.9158251285552979
Tweet id: 758041019896193024. Request duration: 0.9238951206207275
Tweet id: 757741869644341248. Request duration: 0.9221334457397461
Tweet id: 757729163776290825. Request duration: 0.9299192428588867
Tweet id: 757725642876129280. Request duration: 0.9142186641693115
Tweet id: 757611664640446465. Request duration: 0.91587233543396
Tweet id: 757597904299253760. Request duration: 0.8437683582305908
Tweet id: 757596066325864448. Request duration: 0.8805680274963379
Tweet id: 757400162377592832. Request duration: 0.9375364780426025
Tweet id: 757393109802180609. Request duration: 0.9194850921630859

Tweet id: 746521445350707200. Request duration: 0.9460263252258301
Tweet id: 746507379341139972. Request duration: 0.9553656578063965
Tweet id: 746369468511756288. Request duration: 0.8733310699462891
Tweet id: 746131877086527488. Request duration: 0.9359939098358154
Tweet id: 746056683365994496. Request duration: 1.0148439407348633
Tweet id: 745789745784041472. Request duration: 0.9246187210083008
Tweet id: 745712589599014916. Request duration: 0.8631258010864258
Tweet id: 745433870967832576. Request duration: 1.0517020225524902
Tweet id: 745422732645535745. Request duration: 1.8835198879241943
Tweet id: 745314880350101504. Request duration: 0.9255437850952148
Tweet id: 745074613265149952. Request duration: 1.0830142498016357
Tweet id: 745057283344719872. Request duration: 1.2142622470855713
Tweet id: 744995568523612160. Request duration: 1.2943742275238037
Tweet id: 744971049620602880. Request duration: 1.2396602630615234
Tweet id: 744709971296780288. Request duration: 1.095504045486

Tweet id: 727155742655025152. Request duration: 0.8993070125579834
Tweet id: 726935089318363137. Request duration: 0.970745325088501
Tweet id: 726887082820554753. Request duration: 0.9497487545013428
Tweet id: 726828223124897792. Request duration: 1.1902339458465576
Tweet id: 726224900189511680. Request duration: 1.0306284427642822
Tweet id: 725842289046749185. Request duration: 1.0108890533447266
Tweet id: 725786712245440512. Request duration: 1.017073631286621
Tweet id: 725729321944506368. Request duration: 0.9239730834960938
Tweet id: 725458796924002305. Request duration: 0.9307870864868164
Tweet id: 724983749226668032. Request duration: 1.0054028034210205
Tweet id: 724771698126512129. Request duration: 0.9325506687164307
Tweet id: 724405726123311104. Request duration: 0.9124770164489746
Tweet id: 724049859469295616. Request duration: 1.0156524181365967
Tweet id: 724046343203856385. Request duration: 1.0386416912078857
Tweet id: 724004602748780546. Request duration: 0.86428523063659

Tweet id: 709519240576036864. Request duration: 0.9047384262084961
Tweet id: 709449600415961088. Request duration: 1.1719651222229004
Tweet id: 709409458133323776. Request duration: 1.2379505634307861
Tweet id: 709225125749587968. Request duration: 0.9604616165161133
Tweet id: 709207347839836162. Request duration: 0.9259157180786133
Tweet id: 709198395643068416. Request duration: 0.9072191715240479
Tweet id: 709179584944730112. Request duration: 0.9277751445770264
Tweet id: 709158332880297985. Request duration: 0.9205770492553711
Tweet id: 709042156699303936. Request duration: 0.9010145664215088
Tweet id: 708853462201716736. Request duration: 0.9265406131744385
Tweet id: 708845821941387268. Request duration: 0.9307243824005127
Tweet id: 708834316713893888. Request duration: 0.9145393371582031
Tweet id: 708810915978854401. Request duration: 0.918604850769043
Tweet id: 708738143638450176. Request duration: 0.9155068397521973
Tweet id: 708711088997666817. Request duration: 0.9127173423767

Tweet id: 700062718104104960. Request duration: 0.9025967121124268
Tweet id: 700029284593901568. Request duration: 0.9756879806518555
Tweet id: 700002074055016451. Request duration: 1.0741498470306396
Tweet id: 699801817392291840. Request duration: 0.9385902881622314
Tweet id: 699788877217865730. Request duration: 1.1017286777496338
Tweet id: 699779630832685056. Request duration: 1.0240585803985596
Tweet id: 699775878809702401. Request duration: 0.880378007888794
Tweet id: 699691744225525762. Request duration: 0.9141712188720703
Tweet id: 699446877801091073. Request duration: 1.0759663581848145
Tweet id: 699434518667751424. Request duration: 1.0188190937042236
Tweet id: 699423671849451520. Request duration: 1.0307304859161377
Tweet id: 699413908797464576. Request duration: 1.0074772834777832
Tweet id: 699370870310113280. Request duration: 1.1231184005737305
Tweet id: 699323444782047232. Request duration: 1.0190708637237549
Tweet id: 699088579889332224. Request duration: 1.0202145576477

Tweet id: 690989312272396288. Request duration: 0.7913374900817871
Tweet id: 690959652130045952. Request duration: 0.7771990299224854
Tweet id: 690938899477221376. Request duration: 0.7869710922241211
Tweet id: 690932576555528194. Request duration: 0.7605807781219482
Tweet id: 690735892932222976. Request duration: 0.7839882373809814
Tweet id: 690728923253055490. Request duration: 0.7839367389678955
Tweet id: 690690673629138944. Request duration: 0.7901527881622314
Tweet id: 690649993829576704. Request duration: 0.751110315322876
Tweet id: 690607260360429569. Request duration: 0.7971227169036865
Tweet id: 690597161306841088. Request duration: 0.7815859317779541
Tweet id: 690400367696297985. Request duration: 0.7818031311035156
Tweet id: 690374419777196032. Request duration: 0.8244643211364746
Tweet id: 690360449368465409. Request duration: 0.8298313617706299
Tweet id: 690348396616552449. Request duration: 0.8323073387145996
Tweet id: 690248561355657216. Request duration: 0.8031752109527

Tweet id: 684188786104872960. Request duration: 0.7708566188812256
Tweet id: 684177701129875456. Request duration: 0.7578449249267578
Tweet id: 684147889187209216. Request duration: 0.7715141773223877
Tweet id: 684122891630342144. Request duration: 0.7771768569946289
Tweet id: 684097758874210310. Request duration: 0.783585786819458
Tweet id: 683857920510050305. Request duration: 0.7739417552947998
Tweet id: 683852578183077888. Request duration: 0.7719550132751465
Tweet id: 683849932751646720. Request duration: 0.7628962993621826
Tweet id: 683834909291606017. Request duration: 0.7833549976348877
Tweet id: 683828599284170753. Request duration: 0.7869939804077148
Tweet id: 683773439333797890. Request duration: 0.7758574485778809
Tweet id: 683742671509258241. Request duration: 0.7785496711730957
Tweet id: 683515932363329536. Request duration: 0.7836637496948242
Tweet id: 683498322573824003. Request duration: 0.7874500751495361
Tweet id: 683481228088049664. Request duration: 0.7662117481231

Tweet id: 678675843183484930. Request duration: 0.7880845069885254
Tweet id: 678643457146150913. Request duration: 0.7873013019561768
Tweet id: 678446151570427904. Request duration: 0.8014123439788818
Tweet id: 678424312106393600. Request duration: 0.7883830070495605
Tweet id: 678410210315247616. Request duration: 0.7720284461975098
Tweet id: 678399652199309312. Request duration: 0.7863495349884033
Tweet id: 678396796259975168. Request duration: 0.7750058174133301
Tweet id: 678389028614488064. Request duration: 0.7883832454681396
Tweet id: 678380236862578688. Request duration: 0.753455638885498
Tweet id: 678341075375947776. Request duration: 0.7793951034545898
Tweet id: 678334497360859136. Request duration: 0.7475659847259521
Tweet id: 678278586130948096. Request duration: 0.7623364925384521
Tweet id: 678255464182861824. Request duration: 0.7785253524780273
Tweet id: 678023323247357953. Request duration: 0.802976131439209
Tweet id: 678021115718029313. Request duration: 0.82395696640014

Rate limit reached. Sleeping for: 73


Tweet id: 676975532580409345. Request duration: 79.08842158317566
Tweet id: 676957860086095872. Request duration: 1.0997319221496582
Tweet id: 676949632774234114. Request duration: 0.8016350269317627
Tweet id: 676948236477857792. Request duration: 0.8159084320068359
Tweet id: 676946864479084545. Request duration: 0.8015284538269043
Tweet id: 676942428000112642. Request duration: 0.8154034614562988
Tweet id: 676936541936185344. Request duration: 0.8349182605743408
Tweet id: 676916996760600576. Request duration: 0.8015491962432861
Tweet id: 676897532954456065. Request duration: 0.8088629245758057
Tweet id: 676864501615042560. Request duration: 0.7842509746551514
Tweet id: 676821958043033607. Request duration: 0.7763857841491699
Tweet id: 676819651066732545. Request duration: 0.7748305797576904
Tweet id: 676811746707918848. Request duration: 0.7804780006408691
Tweet id: 676776431406465024. Request duration: 0.7591044902801514
Tweet id: 676617503762681856. Request duration: 0.7595744132995

Tweet id: 674082852460433408. Request duration: 0.8244466781616211
Tweet id: 674075285688614912. Request duration: 0.788644552230835
Tweet id: 674063288070742018. Request duration: 0.8569066524505615
Tweet id: 674053186244734976. Request duration: 0.8241920471191406
Tweet id: 674051556661161984. Request duration: 0.8409535884857178
Tweet id: 674045139690631169. Request duration: 0.8235330581665039
Tweet id: 674042553264685056. Request duration: 1.0340907573699951
Tweet id: 674038233588723717. Request duration: 1.1605839729309082
Tweet id: 674036086168010753. Request duration: 1.11562180519104
Tweet id: 674024893172875264. Request duration: 1.5397212505340576
Tweet id: 674019345211760640. Request duration: 0.9790449142456055
Tweet id: 674014384960745472. Request duration: 0.855694055557251
Tweet id: 674008982932058114. Request duration: 1.2363171577453613
Tweet id: 673956914389192708. Request duration: 1.2615246772766113
Tweet id: 673919437611909120. Request duration: 1.2572896480560303

Tweet id: 671520732782923777. Request duration: 0.7902603149414062
Tweet id: 671518598289059840. Request duration: 0.9865868091583252
Tweet id: 671511350426865664. Request duration: 0.81553053855896
Tweet id: 671504605491109889. Request duration: 0.8112878799438477
Tweet id: 671497587707535361. Request duration: 0.8132941722869873
Tweet id: 671488513339211776. Request duration: 0.7818787097930908
Tweet id: 671486386088865792. Request duration: 0.8425414562225342
Tweet id: 671485057807351808. Request duration: 0.8272252082824707
Tweet id: 671390180817915904. Request duration: 0.7988049983978271
Tweet id: 671362598324076544. Request duration: 0.842933177947998
Tweet id: 671357843010908160. Request duration: 0.8707618713378906
Tweet id: 671355857343524864. Request duration: 0.8267500400543213
Tweet id: 671347597085433856. Request duration: 0.8347973823547363
Tweet id: 671186162933985280. Request duration: 0.8416688442230225
Tweet id: 671182547775299584. Request duration: 0.843556880950927

Tweet id: 669353438988365824. Request duration: 1.3733305931091309
Tweet id: 669351434509529089. Request duration: 0.7913706302642822
Tweet id: 669328503091937280. Request duration: 0.8061091899871826
Tweet id: 669327207240699904. Request duration: 0.7793867588043213
Tweet id: 669324657376567296. Request duration: 0.7949738502502441
Tweet id: 669216679721873412. Request duration: 0.7812716960906982
Tweet id: 669214165781868544. Request duration: 0.7957267761230469
Tweet id: 669203728096960512. Request duration: 0.802926778793335
Tweet id: 669037058363662336. Request duration: 0.7982022762298584
Tweet id: 669015743032369152. Request duration: 0.8340697288513184
Tweet id: 669006782128353280. Request duration: 0.8970341682434082
Tweet id: 669000397445533696. Request duration: 0.8093233108520508
Tweet id: 668994913074286592. Request duration: 0.7696435451507568
Tweet id: 668992363537309700. Request duration: 0.7974152565002441
Tweet id: 668989615043424256. Request duration: 0.7993528842926

Tweet id: 667160273090932737. Request duration: 0.8165102005004883
Tweet id: 667152164079423490. Request duration: 1.0353317260742188
Tweet id: 667138269671505920. Request duration: 0.7917463779449463
Tweet id: 667119796878725120. Request duration: 0.8018467426300049
Tweet id: 667090893657276420. Request duration: 0.7777495384216309
Tweet id: 667073648344346624. Request duration: 0.781609058380127
Tweet id: 667070482143944705. Request duration: 1.0347261428833008
Tweet id: 667065535570550784. Request duration: 0.8016502857208252
Tweet id: 667062181243039745. Request duration: 0.8769495487213135
Tweet id: 667044094246576128. Request duration: 0.8090105056762695
Tweet id: 667012601033924608. Request duration: 0.7679715156555176
Tweet id: 666996132027977728. Request duration: 0.8230152130126953
Tweet id: 666983947667116034. Request duration: 0.815237283706665
Tweet id: 666837028449972224. Request duration: 0.8671259880065918
Tweet id: 666835007768551424. Request duration: 0.81598854064941

In [13]:
# Strip a set of characters from a string, Source: https://www.w3resource.com/python-exercises/string/python-data-type-string-exercise-41.php
def strip_chars(str, chars):
    return "".join(c for c in str if c not in chars)

In [61]:
# Read data from tweet_json.txt and append it to df_tweet_json_list using this template: f'{tweet_id} {retweet_count} {favorite_count}'
df_tweet_json_list = []

with open('../data/tweet_json.txt', mode='r+', encoding='utf-8') as file:
    for line in file.readlines():
        line = strip_chars(line, '{}":\n')
        df_tweet_json_list.append(line)

In [62]:
# create df_tweet_json  and populate its column'tweet_json' with df_tweet_json_list content.
df_tweet_json = pd.DataFrame(df_tweet_json_list, columns=['tweet_json'])

In [63]:
# Create tweet_id, retweet_count and favorite_count columns with the right fit data
df_tweet_json['tweet_id'] = df_tweet_json.tweet_json.str.extract(r'([0-9]+(?=\s))').astype(int)

df_tweet_json['retweet_count'] = df_tweet_json.tweet_json.str.extract(r'((?<=\s)[0-9]+(?=\s))').astype(int)

df_tweet_json['favorite_count'] = df_tweet_json.tweet_json.str.extract(r'[\s]{1}([0-9]+)$').replace(' ','').astype(int)

In [64]:
# drop column tweet_json
df_tweet_json.drop('tweet_json', axis='columns', inplace=True)

In [65]:
df_tweet_json.head()

Unnamed: 0,tweet_id,retweet_count,favorite_count
0,892420643555336193,7487,35458
1,892177421306343426,5557,30696
2,891815181378084864,3680,23088
3,891689557279858688,7664,38741
4,891327558926688256,8265,37024


<a id='assessment'></a>
### Data assessment

#### Quality

##### `X` table

- First assessment
- Second assessment
- Third assessment

##### `Y` table 

- First assessment
- Second assessment
- Third assessment

##### `Z` table 

- First assessment
- Second assessment
- Third assessment

#### Tidiness

##### `X` table

- First assessment
- Second assessment
- Third assessment

##### `Y` table

- First assessment
- Second assessment
- Third assessment

##### `Z` table

- First assessment
- Second assessment
- Third assessment

<a id='cleaning'></a>
### Data cleaning

#### Quality

##### `X` table

###### Define

###### Code

###### Test

##### `Y` table

###### Define

###### Code

###### Test

##### `Z` table

###### Define

###### Code

###### Test

<a id='eda'></a>
## Exploratory Data Analysis 

## Conclusions

## Limitations