# Analyze Tweet Data
## Part 1: Data Wrangling

## Table of Contents
- [Introduction](#intro)
- [Gathering Data](#gather)
- [Assessing Data](#assess)
- [Cleaning Data](#clean)
- [Conclusion](#conclusion)
- [Reference](#reference)

<a id='intro'></a>
## Introduction
The process of cleaning and preparing data defines a bulk (approximately 80%) of data analysis (Dasu and Johnson 2003). In fact, data often must be cleaned and prepared multiple times throughout the course of analysis as new questions arise from a series of observations or new data is gathered to address these additional questions (Wickham 2014). To explore the vast array of possible methods for _cleaning and preparing data_ and understand the significance of this process in data analysis, this project was divided into two sections, __Data Wrangling__ and __Exploratory Data Analysis and Data Visualization__, for analyzing Tweets from [_WeRateDogs (@dogrates)_](https://twitter.com/dog_rates).

This document covers the first section, the _data wrangling_ phase of analyzing Tweets, which encompasses gathering, assessing, and cleaning data.
* Each of the three sources of data on Tweets from _WeRateDogs (@dogrates)_ were gathered with different methods: manually downloading available data, performing a GET Request to call the URL that hosts the data, and querying Twitter API for Tweet data. 
* Then, these datasets were assessed for two types of issues, quality and tidiness.
* The identified issues were addressed by defining a series of cleaning operations for each issue, translating the plans into codes, performing each cleaning operation, and verifying the resolutions.

At the end of this phase, cleaned datasets were stored as separate files so that these _functional_ datasets structured in formats that permit analysis are readily available for the second phase of the data analysis, _exploratory data analysis (EDA) and data visualization_. This second section of the project, which deals with exploring and augmenting the data to maximize the potential of the analyses and deriving insights from visualizations, is covered in a separate document.

All relevant packages for wrangling the desired Tweet data from _WeRateDogs (@dogrates)_ are imported below.

In [1]:
import pandas as pd
import requests
import os
import twitter_api as t_api
from timeit import default_timer as timer
import json
import copy
import numpy as np

<a id='gather'></a>
## Gathering Data

### 1. Enhanced Twitter Archive
[Udacity](http://udacity.com) was provided with _WeRateDogs_ Twitter Archive which contains the basic data for all 5,000+ Tweets. Udacity _enhanced_ this original dataset by extracting dog ratings, dog names, and dog _stage_ from each Tweet's text data. Note that as a part of this enhancement, the original dataset was filtered to produce 2,356 rows of data which include only the Tweets that appear to mention dog ratings. The enhanced version was made available for download and was manually imported as shown below.

In [2]:
# enhanced twitter archive data
df_archive = pd.read_csv('data/twitter_archive_enhanced.csv')
df_archive.head(3)

Unnamed: 0,tweet_id,in_reply_to_status_id,in_reply_to_user_id,timestamp,source,text,retweeted_status_id,retweeted_status_user_id,retweeted_status_timestamp,expanded_urls,rating_numerator,rating_denominator,name,doggo,floofer,pupper,puppo
0,892420643555336193,,,2017-08-01 16:23:56 +0000,"<a href=""http://twitter.com/download/iphone"" r...",This is Phineas. He's a mystical boy. Only eve...,,,,https://twitter.com/dog_rates/status/892420643...,13,10,Phineas,,,,
1,892177421306343426,,,2017-08-01 00:17:27 +0000,"<a href=""http://twitter.com/download/iphone"" r...",This is Tilly. She's just checking pup on you....,,,,https://twitter.com/dog_rates/status/892177421...,13,10,Tilly,,,,
2,891815181378084864,,,2017-07-31 00:18:03 +0000,"<a href=""http://twitter.com/download/iphone"" r...",This is Archie. He is a rare Norwegian Pouncin...,,,,https://twitter.com/dog_rates/status/891815181...,12,10,Archie,,,,


### 2. Image Predictions
Udacity ran the images included in the Tweets from the enhanced Twitter archive through a [neural network](https://www.youtube.com/watch?v=2-Ol7ZB0MmU) and made top three predictions of each dog's breed. The `.tsv` file which contains the data on these predictions is hosted in Udacity's server ([URL](https://d17h27t6h515a5.cloudfront.net/topher/2017/August/599fd2ad_image-predictions/image-predictions.tsv)). Python's [Requests](http://docs.python-requests.org/en/master/) library was used to submit a GET request to this URL and programmatically download the file.

In [3]:
# GET request URL
get_url = 'https://d17h27t6h515a5.cloudfront.net/topher/2017/August/599fd2ad_image-predictions/image-predictions.tsv'

# response from GET request
response = requests.get(get_url)

# .tsv file name
tsv_file = os.path.join('data', get_url.split('/')[-1])

# store response data
    # 1. create a file handle from the .tsv file name
    # 2. open the file handle in writing mode
    # 3. write the binary response to the file
with open(tsv_file, mode = 'wb') as file:
    file.write(response.content)

In [4]:
# verify successful download of .tsv file
tsv_check = [file for file in os.listdir('data') if file == tsv_file.split('\\')[-1]]

if len(tsv_check) == 1:
    print('Successfully downloaded {}'.format(tsv_check[0]))
else:
    print('Downloading {} failed'.format(tsv_check[0]))

Successfully downloaded image-predictions.tsv


In [5]:
# import .tsv data
df_image = pd.read_csv(tsv_file, sep = '\t')
df_image.head(3)

Unnamed: 0,tweet_id,jpg_url,img_num,p1,p1_conf,p1_dog,p2,p2_conf,p2_dog,p3,p3_conf,p3_dog
0,666020888022790149,https://pbs.twimg.com/media/CT4udn0WwAA0aMy.jpg,1,Welsh_springer_spaniel,0.465074,True,collie,0.156665,True,Shetland_sheepdog,0.061428,True
1,666029285002620928,https://pbs.twimg.com/media/CT42GRgUYAA5iDo.jpg,1,redbone,0.506826,True,miniature_pinscher,0.074192,True,Rhodesian_ridgeback,0.07201,True
2,666033412701032449,https://pbs.twimg.com/media/CT4521TWwAEvMyu.jpg,1,German_shepherd,0.596461,True,malinois,0.138584,True,bloodhound,0.116197,True


### 3. Additional Tweet Data
Additional Tweet data which were omitted during the process of enhancing the Twitter archive were gathered by querying Twitter's API with Python's [Tweepy](http://www.tweepy.org/) library for each Tweet ID in the enhanced Twitter archive. This JSON-formatted data was _dumped_ to a `.txt` file. Because the counts of re-tweets and favorites were the only types of additional Tweet data the author intended to analyze and the previous datasets were imported to dataframe objects, only the re-tweet and favorite counts for each Tweet ID were extracted from the `.txt` file and were used to create a dataframe object.

Note that the Twitter API wrapper was imported from a separate Python file, `twitter_api.py`, because this file contains private keys and tokens unique to the author's Twitter Developer account. However, the specific code, excluding the actual keys and tokens, in this file which returns the Twitter API wrapper is available for review in the `README.md` file.

In [6]:
# twitter API wrapper from twitter_api.py
api = t_api.twitter_api()

# input: tweet ids from enhanced twitter archive
tweet_ids = df_archive.tweet_id

In [7]:
# dict-object containing failed tweet ids and error details
error_dict = {}

# track progress of iteration over all tweets in archive
count = 0
start = timer()

# query Twitter API for JSON data
with open('data/tweet_json.txt', 'w') as outfile:
    for tweet_id in tweet_ids:
        count += 1
        print('Tweet ID: {} (Count: {})'.format(tweet_id, count)) # tweet id being queried
        try: # query with tweet id and obtain status object
            api_data = api.get_status(tweet_id, tweet_mode = 'extended')
        except t_api.tweepy.TweepError as e: # failed query
            print(' - Fail\n')
            error_dict[tweet_id] = e
        else: # write JSON data to .txt file
            print(' - Success\n')
            json.dump(api_data._json, outfile)
            outfile.write('\n') # change line for the next tweet id
end = timer()
print('Query took {} minutes.'.format(round((end-start)/60, 2))) # duration of iteration
print('Query failed for {} tweets.'.format(len(error_dict)))

Tweet ID: 892420643555336193 (Count: 1)
 - Success

Tweet ID: 892177421306343426 (Count: 2)
 - Success

Tweet ID: 891815181378084864 (Count: 3)
 - Success

Tweet ID: 891689557279858688 (Count: 4)
 - Success

Tweet ID: 891327558926688256 (Count: 5)
 - Success

Tweet ID: 891087950875897856 (Count: 6)
 - Success

Tweet ID: 890971913173991426 (Count: 7)
 - Success

Tweet ID: 890729181411237888 (Count: 8)
 - Success

Tweet ID: 890609185150312448 (Count: 9)
 - Success

Tweet ID: 890240255349198849 (Count: 10)
 - Success

Tweet ID: 890006608113172480 (Count: 11)
 - Success

Tweet ID: 889880896479866881 (Count: 12)
 - Success

Tweet ID: 889665388333682689 (Count: 13)
 - Success

Tweet ID: 889638837579907072 (Count: 14)
 - Success

Tweet ID: 889531135344209921 (Count: 15)
 - Success

Tweet ID: 889278841981685760 (Count: 16)
 - Success

Tweet ID: 888917238123831296 (Count: 17)
 - Success

Tweet ID: 888804989199671297 (Count: 18)
 - Success

Tweet ID: 888554962724278272 (Count: 19)
 - Success

Tw

 - Success

Tweet ID: 861769973181624320 (Count: 156)
 - Fail

Tweet ID: 861383897657036800 (Count: 157)
 - Success

Tweet ID: 861288531465048066 (Count: 158)
 - Success

Tweet ID: 861005113778896900 (Count: 159)
 - Success

Tweet ID: 860981674716409858 (Count: 160)
 - Success

Tweet ID: 860924035999428608 (Count: 161)
 - Success

Tweet ID: 860563773140209665 (Count: 162)
 - Success

Tweet ID: 860524505164394496 (Count: 163)
 - Success

Tweet ID: 860276583193509888 (Count: 164)
 - Success

Tweet ID: 860184849394610176 (Count: 165)
 - Success

Tweet ID: 860177593139703809 (Count: 166)
 - Success

Tweet ID: 859924526012018688 (Count: 167)
 - Success

Tweet ID: 859851578198683649 (Count: 168)
 - Success

Tweet ID: 859607811541651456 (Count: 169)
 - Success

Tweet ID: 859196978902773760 (Count: 170)
 - Success

Tweet ID: 859074603037188101 (Count: 171)
 - Success

Tweet ID: 858860390427611136 (Count: 172)
 - Success

Tweet ID: 858843525470990336 (Count: 173)
 - Success

Tweet ID: 858471635

 - Success

Tweet ID: 835574547218894849 (Count: 309)
 - Success

Tweet ID: 835536468978302976 (Count: 310)
 - Success

Tweet ID: 835309094223372289 (Count: 311)
 - Success

Tweet ID: 835297930240217089 (Count: 312)
 - Success

Tweet ID: 835264098648616962 (Count: 313)
 - Success

Tweet ID: 835246439529840640 (Count: 314)
 - Success

Tweet ID: 835172783151792128 (Count: 315)
 - Success

Tweet ID: 835152434251116546 (Count: 316)
 - Success

Tweet ID: 834931633769889797 (Count: 317)
 - Success

Tweet ID: 834786237630337024 (Count: 318)
 - Success

Tweet ID: 834574053763584002 (Count: 319)
 - Success

Tweet ID: 834477809192075265 (Count: 320)
 - Success

Tweet ID: 834458053273591808 (Count: 321)
 - Success

Tweet ID: 834209720923721728 (Count: 322)
 - Success

Tweet ID: 834167344700198914 (Count: 323)
 - Success

Tweet ID: 834089966724603904 (Count: 324)
 - Success

Tweet ID: 834086379323871233 (Count: 325)
 - Success

Tweet ID: 833863086058651648 (Count: 326)
 - Success

Tweet ID: 833826

 - Success

Tweet ID: 817536400337801217 (Count: 462)
 - Success

Tweet ID: 817502432452313088 (Count: 463)
 - Success

Tweet ID: 817423860136083457 (Count: 464)
 - Success

Tweet ID: 817415592588222464 (Count: 465)
 - Success

Tweet ID: 817181837579653120 (Count: 466)
 - Success

Tweet ID: 817171292965273600 (Count: 467)
 - Success

Tweet ID: 817120970343411712 (Count: 468)
 - Success

Tweet ID: 817056546584727552 (Count: 469)
 - Success

Tweet ID: 816829038950027264 (Count: 470)
 - Success

Tweet ID: 816816676327063552 (Count: 471)
 - Success

Tweet ID: 816697700272001025 (Count: 472)
 - Success

Tweet ID: 816450570814898180 (Count: 473)
 - Success

Tweet ID: 816336735214911488 (Count: 474)
 - Success

Tweet ID: 816091915477250048 (Count: 475)
 - Success

Tweet ID: 816062466425819140 (Count: 476)
 - Success

Tweet ID: 816014286006976512 (Count: 477)
 - Success

Tweet ID: 815990720817401858 (Count: 478)
 - Success

Tweet ID: 815966073409433600 (Count: 479)
 - Success

Tweet ID: 815745

 - Success

Tweet ID: 796865951799083009 (Count: 614)
 - Success

Tweet ID: 796759840936919040 (Count: 615)
 - Success

Tweet ID: 796563435802726400 (Count: 616)
 - Success

Tweet ID: 796484825502875648 (Count: 617)
 - Success

Tweet ID: 796387464403357696 (Count: 618)
 - Success

Tweet ID: 796177847564038144 (Count: 619)
 - Success

Tweet ID: 796149749086875649 (Count: 620)
 - Success

Tweet ID: 796125600683540480 (Count: 621)
 - Success

Tweet ID: 796116448414461957 (Count: 622)
 - Success

Tweet ID: 796080075804475393 (Count: 623)
 - Success

Tweet ID: 796031486298386433 (Count: 624)
 - Success

Tweet ID: 795464331001561088 (Count: 625)
 - Success

Tweet ID: 795400264262053889 (Count: 626)
 - Success

Tweet ID: 795076730285391872 (Count: 627)
 - Success

Tweet ID: 794983741416415232 (Count: 628)
 - Success

Tweet ID: 794926597468000259 (Count: 629)
 - Success

Tweet ID: 794355576146903043 (Count: 630)
 - Success

Tweet ID: 794332329137291264 (Count: 631)
 - Success

Tweet ID: 794205

 - Success

Tweet ID: 777885040357281792 (Count: 766)
 - Success

Tweet ID: 777684233540206592 (Count: 767)
 - Success

Tweet ID: 777641927919427584 (Count: 768)
 - Success

Tweet ID: 777621514455814149 (Count: 769)
 - Success

Tweet ID: 777189768882946048 (Count: 770)
 - Success

Tweet ID: 776819012571455488 (Count: 771)
 - Success

Tweet ID: 776813020089548800 (Count: 772)
 - Success

Tweet ID: 776477788987613185 (Count: 773)
 - Success

Tweet ID: 776249906839351296 (Count: 774)
 - Success

Tweet ID: 776218204058357768 (Count: 775)
 - Success

Tweet ID: 776201521193218049 (Count: 776)
 - Success

Tweet ID: 776113305656188928 (Count: 777)
 - Success

Tweet ID: 776088319444877312 (Count: 778)
 - Success

Tweet ID: 775898661951791106 (Count: 779)
 - Success

Tweet ID: 775842724423557120 (Count: 780)
 - Success

Tweet ID: 775733305207554048 (Count: 781)
 - Success

Tweet ID: 775729183532220416 (Count: 782)
 - Success

Tweet ID: 775364825476165632 (Count: 783)
 - Success

Tweet ID: 775350

Rate limit reached. Sleeping for: 713


 - Success

Tweet ID: 758474966123810816 (Count: 902)
 - Success

Tweet ID: 758467244762497024 (Count: 903)
 - Success

Tweet ID: 758405701903519748 (Count: 904)
 - Success

Tweet ID: 758355060040593408 (Count: 905)
 - Success

Tweet ID: 758099635764359168 (Count: 906)
 - Success

Tweet ID: 758041019896193024 (Count: 907)
 - Success

Tweet ID: 757741869644341248 (Count: 908)
 - Success

Tweet ID: 757729163776290825 (Count: 909)
 - Success

Tweet ID: 757725642876129280 (Count: 910)
 - Success

Tweet ID: 757611664640446465 (Count: 911)
 - Success

Tweet ID: 757597904299253760 (Count: 912)
 - Success

Tweet ID: 757596066325864448 (Count: 913)
 - Success

Tweet ID: 757400162377592832 (Count: 914)
 - Success

Tweet ID: 757393109802180609 (Count: 915)
 - Success

Tweet ID: 757354760399941633 (Count: 916)
 - Success

Tweet ID: 756998049151549440 (Count: 917)
 - Success

Tweet ID: 756939218950160384 (Count: 918)
 - Success

Tweet ID: 756651752796094464 (Count: 919)
 - Success

Tweet ID: 756526

 - Success

Tweet ID: 742465774154047488 (Count: 1054)
 - Success

Tweet ID: 742423170473463808 (Count: 1055)
 - Success

Tweet ID: 742385895052087300 (Count: 1056)
 - Success

Tweet ID: 742161199639494656 (Count: 1057)
 - Success

Tweet ID: 742150209887731712 (Count: 1058)
 - Success

Tweet ID: 741793263812808706 (Count: 1059)
 - Success

Tweet ID: 741743634094141440 (Count: 1060)
 - Success

Tweet ID: 741438259667034112 (Count: 1061)
 - Success

Tweet ID: 741303864243200000 (Count: 1062)
 - Success

Tweet ID: 741099773336379392 (Count: 1063)
 - Success

Tweet ID: 741067306818797568 (Count: 1064)
 - Success

Tweet ID: 740995100998766593 (Count: 1065)
 - Success

Tweet ID: 740711788199743490 (Count: 1066)
 - Success

Tweet ID: 740699697422163968 (Count: 1067)
 - Success

Tweet ID: 740676976021798912 (Count: 1068)
 - Success

Tweet ID: 740373189193256964 (Count: 1069)
 - Success

Tweet ID: 740365076218183684 (Count: 1070)
 - Success

Tweet ID: 740359016048689152 (Count: 1071)
 - Success

 - Success

Tweet ID: 716439118184652801 (Count: 1203)
 - Success

Tweet ID: 716285507865542656 (Count: 1204)
 - Success

Tweet ID: 716080869887381504 (Count: 1205)
 - Success

Tweet ID: 715928423106027520 (Count: 1206)
 - Success

Tweet ID: 715758151270801409 (Count: 1207)
 - Success

Tweet ID: 715733265223708672 (Count: 1208)
 - Success

Tweet ID: 715704790270025728 (Count: 1209)
 - Success

Tweet ID: 715696743237730304 (Count: 1210)
 - Success

Tweet ID: 715680795826982913 (Count: 1211)
 - Success

Tweet ID: 715360349751484417 (Count: 1212)
 - Success

Tweet ID: 715342466308784130 (Count: 1213)
 - Success

Tweet ID: 715220193576927233 (Count: 1214)
 - Success

Tweet ID: 715200624753819648 (Count: 1215)
 - Success

Tweet ID: 715009755312439296 (Count: 1216)
 - Success

Tweet ID: 714982300363173890 (Count: 1217)
 - Success

Tweet ID: 714962719905021952 (Count: 1218)
 - Success

Tweet ID: 714957620017307648 (Count: 1219)
 - Success

Tweet ID: 714631576617938945 (Count: 1220)
 - Success

Tweet ID: 704054845121142784 (Count: 1352)
 - Success

Tweet ID: 703774238772166656 (Count: 1353)
 - Success

Tweet ID: 703769065844768768 (Count: 1354)
 - Success

Tweet ID: 703631701117943808 (Count: 1355)
 - Success

Tweet ID: 703611486317502464 (Count: 1356)
 - Success

Tweet ID: 703425003149250560 (Count: 1357)
 - Success

Tweet ID: 703407252292673536 (Count: 1358)
 - Success

Tweet ID: 703382836347330562 (Count: 1359)
 - Success

Tweet ID: 703356393781329922 (Count: 1360)
 - Success

Tweet ID: 703268521220972544 (Count: 1361)
 - Success

Tweet ID: 703079050210877440 (Count: 1362)
 - Success

Tweet ID: 703041949650034688 (Count: 1363)
 - Success

Tweet ID: 702932127499816960 (Count: 1364)
 - Success

Tweet ID: 702899151802126337 (Count: 1365)
 - Success

Tweet ID: 702684942141153280 (Count: 1366)
 - Success

Tweet ID: 702671118226825216 (Count: 1367)
 - Success

Tweet ID: 702598099714314240 (Count: 1368)
 - Success

Tweet ID: 702539513671897089 (Count: 1369)
 - Success

Tweet ID: 

 - Success

Tweet ID: 692041934689402880 (Count: 1503)
 - Success

Tweet ID: 692017291282812928 (Count: 1504)
 - Success

Tweet ID: 691820333922455552 (Count: 1505)
 - Success

Tweet ID: 691793053716221953 (Count: 1506)
 - Success

Tweet ID: 691756958957883396 (Count: 1507)
 - Success

Tweet ID: 691675652215414786 (Count: 1508)
 - Success

Tweet ID: 691483041324204033 (Count: 1509)
 - Success

Tweet ID: 691459709405118465 (Count: 1510)
 - Success

Tweet ID: 691444869282295808 (Count: 1511)
 - Success

Tweet ID: 691416866452082688 (Count: 1512)
 - Success

Tweet ID: 691321916024623104 (Count: 1513)
 - Success

Tweet ID: 691096613310316544 (Count: 1514)
 - Success

Tweet ID: 691090071332753408 (Count: 1515)
 - Success

Tweet ID: 690989312272396288 (Count: 1516)
 - Success

Tweet ID: 690959652130045952 (Count: 1517)
 - Success

Tweet ID: 690938899477221376 (Count: 1518)
 - Success

Tweet ID: 690932576555528194 (Count: 1519)
 - Success

Tweet ID: 690735892932222976 (Count: 1520)
 - Success

 - Success

Tweet ID: 683498322573824003 (Count: 1652)
 - Success

Tweet ID: 683481228088049664 (Count: 1653)
 - Success

Tweet ID: 683462770029932544 (Count: 1654)
 - Success

Tweet ID: 683449695444799489 (Count: 1655)
 - Success

Tweet ID: 683391852557561860 (Count: 1656)
 - Success

Tweet ID: 683357973142474752 (Count: 1657)
 - Success

Tweet ID: 683142553609318400 (Count: 1658)
 - Success

Tweet ID: 683111407806746624 (Count: 1659)
 - Success

Tweet ID: 683098815881154561 (Count: 1660)
 - Success

Tweet ID: 683078886620553216 (Count: 1661)
 - Success

Tweet ID: 683030066213818368 (Count: 1662)
 - Success

Tweet ID: 682962037429899265 (Count: 1663)
 - Success

Tweet ID: 682808988178739200 (Count: 1664)
 - Success

Tweet ID: 682788441537560576 (Count: 1665)
 - Success

Tweet ID: 682750546109968385 (Count: 1666)
 - Success

Tweet ID: 682697186228989953 (Count: 1667)
 - Success

Tweet ID: 682662431982772225 (Count: 1668)
 - Success

Tweet ID: 682638830361513985 (Count: 1669)
 - Success

Rate limit reached. Sleeping for: 719


 - Success

Tweet ID: 676975532580409345 (Count: 1801)
 - Success

Tweet ID: 676957860086095872 (Count: 1802)
 - Success

Tweet ID: 676949632774234114 (Count: 1803)
 - Success

Tweet ID: 676948236477857792 (Count: 1804)
 - Success

Tweet ID: 676946864479084545 (Count: 1805)
 - Success

Tweet ID: 676942428000112642 (Count: 1806)
 - Success

Tweet ID: 676936541936185344 (Count: 1807)
 - Success

Tweet ID: 676916996760600576 (Count: 1808)
 - Success

Tweet ID: 676897532954456065 (Count: 1809)
 - Success

Tweet ID: 676864501615042560 (Count: 1810)
 - Success

Tweet ID: 676821958043033607 (Count: 1811)
 - Success

Tweet ID: 676819651066732545 (Count: 1812)
 - Success

Tweet ID: 676811746707918848 (Count: 1813)
 - Success

Tweet ID: 676776431406465024 (Count: 1814)
 - Success

Tweet ID: 676617503762681856 (Count: 1815)
 - Success

Tweet ID: 676613908052996102 (Count: 1816)
 - Success

Tweet ID: 676606785097199616 (Count: 1817)
 - Success

Tweet ID: 676603393314578432 (Count: 1818)
 - Success

 - Success

Tweet ID: 673689733134946305 (Count: 1950)
 - Success

Tweet ID: 673688752737402881 (Count: 1951)
 - Success

Tweet ID: 673686845050527744 (Count: 1952)
 - Success

Tweet ID: 673680198160809984 (Count: 1953)
 - Success

Tweet ID: 673662677122719744 (Count: 1954)
 - Success

Tweet ID: 673656262056419329 (Count: 1955)
 - Success

Tweet ID: 673636718965334016 (Count: 1956)
 - Success

Tweet ID: 673612854080196609 (Count: 1957)
 - Success

Tweet ID: 673583129559498752 (Count: 1958)
 - Success

Tweet ID: 673580926094458881 (Count: 1959)
 - Success

Tweet ID: 673576835670777856 (Count: 1960)
 - Success

Tweet ID: 673363615379013632 (Count: 1961)
 - Success

Tweet ID: 673359818736984064 (Count: 1962)
 - Success

Tweet ID: 673355879178194945 (Count: 1963)
 - Success

Tweet ID: 673352124999274496 (Count: 1964)
 - Success

Tweet ID: 673350198937153538 (Count: 1965)
 - Success

Tweet ID: 673345638550134785 (Count: 1966)
 - Success

Tweet ID: 673343217010679808 (Count: 1967)
 - Success

 - Success

Tweet ID: 670717338665226240 (Count: 2100)
 - Success

Tweet ID: 670704688707301377 (Count: 2101)
 - Success

Tweet ID: 670691627984359425 (Count: 2102)
 - Success

Tweet ID: 670679630144274432 (Count: 2103)
 - Success

Tweet ID: 670676092097810432 (Count: 2104)
 - Success

Tweet ID: 670668383499735048 (Count: 2105)
 - Success

Tweet ID: 670474236058800128 (Count: 2106)
 - Success

Tweet ID: 670468609693655041 (Count: 2107)
 - Success

Tweet ID: 670465786746662913 (Count: 2108)
 - Success

Tweet ID: 670452855871037440 (Count: 2109)
 - Success

Tweet ID: 670449342516494336 (Count: 2110)
 - Success

Tweet ID: 670444955656130560 (Count: 2111)
 - Success

Tweet ID: 670442337873600512 (Count: 2112)
 - Success

Tweet ID: 670435821946826752 (Count: 2113)
 - Success

Tweet ID: 670434127938719744 (Count: 2114)
 - Success

Tweet ID: 670433248821026816 (Count: 2115)
 - Success

Tweet ID: 670428280563085312 (Count: 2116)
 - Success

Tweet ID: 670427002554466305 (Count: 2117)
 - Success

 - Success

Tweet ID: 667861340749471744 (Count: 2250)
 - Success

Tweet ID: 667832474953625600 (Count: 2251)
 - Success

Tweet ID: 667806454573760512 (Count: 2252)
 - Success

Tweet ID: 667801013445750784 (Count: 2253)
 - Success

Tweet ID: 667793409583771648 (Count: 2254)
 - Success

Tweet ID: 667782464991965184 (Count: 2255)
 - Success

Tweet ID: 667773195014021121 (Count: 2256)
 - Success

Tweet ID: 667766675769573376 (Count: 2257)
 - Success

Tweet ID: 667728196545200128 (Count: 2258)
 - Success

Tweet ID: 667724302356258817 (Count: 2259)
 - Success

Tweet ID: 667550904950915073 (Count: 2260)
 - Success

Tweet ID: 667550882905632768 (Count: 2261)
 - Success

Tweet ID: 667549055577362432 (Count: 2262)
 - Success

Tweet ID: 667546741521195010 (Count: 2263)
 - Success

Tweet ID: 667544320556335104 (Count: 2264)
 - Success

Tweet ID: 667538891197542400 (Count: 2265)
 - Success

Tweet ID: 667534815156183040 (Count: 2266)
 - Success

Tweet ID: 667530908589760512 (Count: 2267)
 - Success

In [8]:
# list-object containing counts of re-tweets and favorites for each tweet
json_list = []
json_attributes = ['id', 'retweet_count','favorite_count']

# append each JSON data as dict-object to the list
with open('data/tweet_json.txt') as file:
    for line in file:
        json_data = json.loads(line[:-1]) # exclude \n and load JSON data only
        json_list.append({json_attributes[0]: json_data[json_attributes[0]]
                        , json_attributes[1]: json_data[json_attributes[1]]
                        , json_attributes[2]: json_data[json_attributes[2]]})

# create dataframe from the list of dict-objects containing JSON data
df_json = pd.DataFrame(data = json_list, columns = json_attributes)
df_json.head(3)

Unnamed: 0,id,retweet_count,favorite_count
0,892420643555336193,8219,37722
1,892177421306343426,6077,32385
2,891815181378084864,4020,24391


<a id='assess'></a>
## Assessing Data

The three datasets collected in __Gathering Data__ were assessed for quality issues which deal with the content of the data and tidiness issues which pertain to the structure of the data. Only the observations, which render cleaning necessary in __Cleaning Data__, were documented. 

In this step of the data wrangling phase, a copy of each dataset was created and assessed. However, the actual datasets can still be reviewed for quality and tidiness issues, and creating copies of these datasets can be postponed to the next step of the data wrangling phase.

In [9]:
# copy of twitter_archive_enhanced.csv
# df_archive = pd.read_csv('data/twitter_archive_enhanced.csv')
df_archive_clean = df_archive.copy()

# copy of image-predictions.tsv 
# df_image = pd.read_csv('data/image-predictions.tsv', sep = '\t')
df_image_clean = df_image.copy()

# copy of JSON data from tweet_json.txt
df_json_clean = df_json.copy()

### 1. Enhanced Twitter Archive
1. Data type for the `tweet_id` column is integer, instead of string (object). \[see _assessment 1.2_\]
2. The two columns `in_reply_to_status_id` and `in_reply_to_user_id` appear to be outside the scope of the interest of this project. \[see _assessments 1.1 and 1.2_ and documentation on [Tweet Object](https://developer.twitter.com/en/docs/tweets/data-dictionary/overview/tweet-object.html)\]
3. Data type for the `timestamp` column is object, instead of datetime. \[see _assessment 1.2_\]
4. The four possible values for the `source` column not only indicate the source of the tweet but includes the HTML tags and attributes. \[see _assessment 1.3_\]
5. 181 tweets were created by re-tweeting existing tweets. \[see _assessments 1.4.1 ~ 1.4.3_\]
6. The `expanded_urls` column does not show meaningful information besides the tweet id which is already listed under the `tweet_id` column. \[see _assessment 1.1_\]
7. 23 dog ratings extracted from the tweet are inaccurate in that these ratings have values for the `rating_denominator` column other than 10. \[see _assessments 1.5.1, 1.5.2_\]
8. 109 names of dogs in the `name` column are inaccurate. \[see _assessments 1.6.1 ~ 1.6.4_\]    
    * These names not only begin with a lower case alphabet but are oridinary terms such as "such", "a", "quite", etc.
    * Few of the tweet texts corresponding to these names do introduce the actual names of dogs in the phrase, 'name is ~' or 'named ~'.


9. The five columns, `name`, `doggo`, `floofer`, `pupper`, and `puppo`, use "None" instead of `NaN` for missing values. \[see _assessment 1.1_\]
10. Four possible dog "stages", doggo, floofer, pupper, and puppo, are shown as column headers although these "stages" are not variable names. \[see _assessment 1.1_\]

In [10]:
# assessment 1.1
df_archive_clean.sample(5)

Unnamed: 0,tweet_id,in_reply_to_status_id,in_reply_to_user_id,timestamp,source,text,retweeted_status_id,retweeted_status_user_id,retweeted_status_timestamp,expanded_urls,rating_numerator,rating_denominator,name,doggo,floofer,pupper,puppo
1343,704761120771465216,,,2016-03-01 20:11:59 +0000,"<a href=""http://twitter.com/download/iphone"" r...",This pupper killed this great white in an epic...,,,,https://twitter.com/dog_rates/status/704761120...,13,10,,,,pupper,
911,757597904299253760,,,2016-07-25 15:26:30 +0000,"<a href=""http://twitter.com/download/iphone"" r...",RT @jon_hill987: @dog_rates There is a cunning...,7.575971e+17,280479778.0,2016-07-25 15:23:28 +0000,https://twitter.com/jon_hill987/status/7575971...,11,10,,,,pupper,
1613,685315239903100929,,,2016-01-08 04:21:00 +0000,"<a href=""http://twitter.com/download/iphone"" r...",I would like everyone to appreciate this pup's...,,,,https://twitter.com/dog_rates/status/685315239...,11,10,,,,,
2244,667886921285246976,,,2015-11-21 02:07:05 +0000,"<a href=""http://twitter.com/download/iphone"" r...",This is Erik. He's fucken massive. But also ki...,,,,https://twitter.com/dog_rates/status/667886921...,11,10,Erik,,,,
1812,676811746707918848,,,2015-12-15 17:11:09 +0000,"<a href=""http://twitter.com/download/iphone"" r...",Say hello to Penny &amp; Gizmo. They are pract...,,,,https://twitter.com/dog_rates/status/676811746...,9,10,Penny,,,,


In [11]:
# assessment 1.2
df_archive_clean.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 2356 entries, 0 to 2355
Data columns (total 17 columns):
tweet_id                      2356 non-null int64
in_reply_to_status_id         78 non-null float64
in_reply_to_user_id           78 non-null float64
timestamp                     2356 non-null object
source                        2356 non-null object
text                          2356 non-null object
retweeted_status_id           181 non-null float64
retweeted_status_user_id      181 non-null float64
retweeted_status_timestamp    181 non-null object
expanded_urls                 2297 non-null object
rating_numerator              2356 non-null int64
rating_denominator            2356 non-null int64
name                          2356 non-null object
doggo                         2356 non-null object
floofer                       2356 non-null object
pupper                        2356 non-null object
puppo                         2356 non-null object
dtypes: float64(4), int64(3), ob

In [12]:
# assessment 1.3: unique values for source column
for i, source in enumerate(df_archive_clean.source.unique()):
    print(i+1, source)

1 <a href="http://twitter.com/download/iphone" rel="nofollow">Twitter for iPhone</a>
2 <a href="http://twitter.com" rel="nofollow">Twitter Web Client</a>
3 <a href="http://vine.co" rel="nofollow">Vine - Make a Scene</a>
4 <a href="https://about.twitter.com/products/tweetdeck" rel="nofollow">TweetDeck</a>


In [13]:
# assessment 1.4.1: number of non-na's in retweeted_status_id column
df_archive_clean.retweeted_status_id.notna().sum()

181

In [14]:
# assessment 1.4.2: number of non-na's in retweeted_status_user_id column
df_archive_clean.retweeted_status_user_id.notna().sum()

181

In [15]:
# assessment 1.4.3: number of non-na's in retweeted_status_timestamp column
df_archive_clean.retweeted_status_timestamp.notna().sum()

181

In [16]:
# assessment 1.5.1: number of values in rating_denominator column besides 10
(df_archive_clean.rating_denominator != 10).sum()

23

In [17]:
# assessment 1.5.2: extract rating-like substring from tweet text

# sub-dataframe for rows with rating_denominator values besides 10 
df_ratings = df_archive_clean.query('rating_denominator != 10').loc[:,['text', 'rating_numerator', 'rating_denominator']]

# extract rating, if any, of format digit(s)/10
rating_format = '(\d+/10)'
df_ratings['rating'] = df_ratings.text.str.extract(pat = rating_format)

# compare given ratings and ratings extracted from text
pd.set_option('display.max_colwidth', -1)
df_ratings

Unnamed: 0,text,rating_numerator,rating_denominator,rating
313,"@jonnysun @Lin_Manuel ok jomny I know you're excited but 960/00 isn't a valid rating, 13/10 is tho",960,0,13/10
342,@docmisterio account started on 11/15/15,11,15,
433,The floofs have been released I repeat the floofs have been released. 84/70 https://t.co/NIYC820tmd,84,70,
516,Meet Sam. She smiles 24/7 &amp; secretly aspires to be a reindeer. \nKeep Sam smiling by clicking and sharing this link:\nhttps://t.co/98tB8y7y7t https://t.co/LouL5vdvxx,24,7,
784,"RT @dog_rates: After so many requests, this is Bretagne. She was the last surviving 9/11 search dog, and our second ever 14/10. RIP https:/…",9,11,14/10
902,Why does this never happen at my front door... 165/150 https://t.co/HmwrdfEfUE,165,150,
1068,"After so many requests, this is Bretagne. She was the last surviving 9/11 search dog, and our second ever 14/10. RIP https://t.co/XAVDNDaVgQ",9,11,14/10
1120,Say hello to this unbelievably well behaved squad of doggos. 204/170 would try to pet all at once https://t.co/yGQI3He3xv,204,170,
1165,Happy 4/20 from the squad! 13/10 for all https://t.co/eV1diwds8a,4,20,13/10
1202,This is Bluebert. He just saw that both #FinalFur match ups are split 50/50. Amazed af. 11/10 https://t.co/Kky1DPG4iq,50,50,11/10


In [18]:
# reset the option configured in assessment 1.5.2
pd.reset_option('display.max_colwidth')

In [19]:
# assessment 1.6.1: number of values in name column that begins with a lower case alphabet
df_archive_clean.name.str.extract(pat = '(^[a-z])').dropna().shape[0]

109

In [20]:
# assessment 1.6.2: number of values in name column that begins with a lower case alphabet
error_name = [name for name in df_archive_clean.name.unique() if name.lower() == name]
df_archive_clean.query('name in @error_name').shape[0]

109

In [21]:
# assessment 1.6.3: values in name column that begins with a lower case alphabet
for name in error_name:
    print(name)

such
a
quite
not
one
incredibly
mad
an
very
just
my
his
actually
getting
this
unacceptable
all
old
infuriating
the
by
officially
life
light
space


In [22]:
# assessment 1.6.4: extract name-like string from tweet text

# sub-dataframe of rows with dog names which are oridinary vocabularies
df_names = df_archive_clean.query('name in @error_name').loc[:, ['text','name']]

# extract dog-name, if any, following 'named' or 'name is'
name_format = '(?:named|name is)\s([A-Z][a-z]+)'
df_names['new_name'] = df_names.text.str.extract(pat = name_format)

# compare given names and names extracted from text
pd.set_option('display.max_colwidth',-1)
df_names

Unnamed: 0,text,name,new_name
22,I've yet to rate a Venezuelan Hover Wiener. This is such an honor. 14/10 paw-inspiring af (IG: roxy.thedoxy) https://t.co/20VrLAA8ba,such,
56,Here is a pupper approaching maximum borkdrive. Zooming at never before seen speeds. 14/10 paw-inspiring af \n(IG: puffie_the_chow) https://t.co/ghXBIIeQZF,a,
118,RT @dog_rates: We only rate dogs. This is quite clearly a smol broken polar bear. We'd appreciate if you only send dogs. Thank you... 12/10…,quite,
169,We only rate dogs. This is quite clearly a smol broken polar bear. We'd appreciate if you only send dogs. Thank you... 12/10 https://t.co/g2nSyGenG9,quite,
193,"Guys, we only rate dogs. This is quite clearly a bulbasaur. Please only send dogs. Thank you... 12/10 human used pet, it's super effective https://t.co/Xc7uj1C64x",quite,
335,There's going to be a dog terminal at JFK Airport. This is not a drill. 10/10 \nhttps://t.co/dp5h9bCwU7,not,
369,"Occasionally, we're sent fantastic stories. This is one of them. 14/10 for Grace https://t.co/bZ4axuH6OK",one,
542,We only rate dogs. Please stop sending in non-canines like this Freudian Poof Lion. This is incredibly frustrating... 11/10 https://t.co/IZidSrBvhi,incredibly,
649,Here is a perfect example of someone who has their priorities in order. 13/10 for both owner and Forrest https://t.co/LRyMrU7Wfq,a,
682,RT @dog_rates: Say hello to mad pupper. You know what you did. 13/10 would pet until no longer furustrated https://t.co/u1ulQ5heLX,mad,


In [23]:
# reset the option configured in assessment 1.6.4
pd.reset_option('display.max_colwidth')

### 2. Image Predictions
1. Data type for the `tweet_id` column is integer, instead of string (object). \[see _assessment 2.2_\]
2. Each set of three columns listed below are shown as column headers although these are not variable names. \[see _assessment 2.1_\]
    * `p1`, `p2`, `p3`: 1st, 2nd, and 3rd predictions of a dog's breed
    * `p1_conf`, `p2_conf`, `p3_conf`: confidence in the 1st, 2nd, and 3rd predictions of a dog's breed
    * `p1_dog`, `p2_dog`, `p3_dog`: whether the 1st, 2nd, and 3rd predictions of a dog's breed is in fact a breed of dog

In [24]:
# assessment 2.1
df_image_clean.sample(5)

Unnamed: 0,tweet_id,jpg_url,img_num,p1,p1_conf,p1_dog,p2,p2_conf,p2_dog,p3,p3_conf,p3_dog
684,683852578183077888,https://pbs.twimg.com/media/CX2ISqSWYAAEtCF.jpg,1,toy_poodle,0.551352,True,teddy,0.180678,False,miniature_poodle,0.164095,True
193,669571471778410496,https://pbs.twimg.com/media/CUrLsI-UsAALfUL.jpg,1,minivan,0.873488,False,pickup,0.041259,False,beach_wagon,0.0154,False
1464,778408200802557953,https://pbs.twimg.com/media/Cs12ICuWAAECNRy.jpg,3,Pembroke,0.848362,True,Cardigan,0.108124,True,beagle,0.011942,True
370,672975131468300288,https://pbs.twimg.com/media/CVbjRSIWsAElw2s.jpg,1,pug,0.836421,True,Brabancon_griffon,0.044668,True,French_bulldog,0.03657,True
1951,863432100342583297,https://pbs.twimg.com/media/C_uG6eAUAAAvMvR.jpg,1,Staffordshire_bullterrier,0.690517,True,French_bulldog,0.10336,True,beagle,0.079489,True


In [25]:
# assessment 2.2
df_image_clean.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 2075 entries, 0 to 2074
Data columns (total 12 columns):
tweet_id    2075 non-null int64
jpg_url     2075 non-null object
img_num     2075 non-null int64
p1          2075 non-null object
p1_conf     2075 non-null float64
p1_dog      2075 non-null bool
p2          2075 non-null object
p2_conf     2075 non-null float64
p2_dog      2075 non-null bool
p3          2075 non-null object
p3_conf     2075 non-null float64
p3_dog      2075 non-null bool
dtypes: bool(3), float64(3), int64(2), object(4)
memory usage: 152.1+ KB


### 3. Additional Tweet Data
1. Unlike the two previous datasets, the name of the column for tweet IDs in this dataset is `id` instead of `tweet_id`. \[see _assessments 3.1 and 3.2_\]
2. Data type for the `id` column is integer, instead of string (object). \[see _assessment 3.2_\]

In [26]:
# assessment 3.1
df_json_clean.sample(5)

Unnamed: 0,id,retweet_count,favorite_count
341,831322785565769729,1638,9634
2307,666421158376562688,111,310
461,816450570814898180,8784,32254
970,749064354620928000,1612,4989
1470,693109034023534592,648,1784


In [27]:
# assessment 3.2
df_json_clean.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 2339 entries, 0 to 2338
Data columns (total 3 columns):
id                2339 non-null int64
retweet_count     2339 non-null int64
favorite_count    2339 non-null int64
dtypes: int64(3)
memory usage: 54.9 KB


<a id='clean'></a>
## Cleaning Data

### 1. Enhanced Twitter Archive
#### 1.1 Data type for the `tweet_id` column is integer, instead of string (object).
__1.1.1 Define__

Convert the data type of the `tweet_id` column from integer to string (object).

__1.1.2 Code__

In [28]:
df_archive_clean.tweet_id = df_archive_clean.tweet_id.astype(dtype = 'str')

__1.1.3 Test__

The data type for the `tweet_id` column was successfully converted to string (object).

In [29]:
df_archive_clean.tweet_id.dtype

dtype('O')

#### 1.2 The two columns `in_reply_to_status_id` and `in_reply_to_user_id` appear to be outside the scope of interest of this project.
__1.2.1 Define__

Remove the two columns `in_reply_to_status_id` and `in_reply_to_user_id` from the dataframe object `df_archive_clean`.

__1.2.2 Code__

In [30]:
df_archive_clean.drop(columns = ['in_reply_to_status_id', 'in_reply_to_user_id'], inplace = True)

__1.2.3 Test__

The two columns `in_reply_to_status_id` and `in_reply_to_user_id` were successfully removed.

In [31]:
df_archive_clean.columns

Index(['tweet_id', 'timestamp', 'source', 'text', 'retweeted_status_id',
       'retweeted_status_user_id', 'retweeted_status_timestamp',
       'expanded_urls', 'rating_numerator', 'rating_denominator', 'name',
       'doggo', 'floofer', 'pupper', 'puppo'],
      dtype='object')

#### 1.3 Data type for the `timestamp` column is object, instead of datetime.
__1.3.1 Define__

Convert the data type of the `timestamp` column from object to datetime.

__1.3.2 Code__

In [32]:
df_archive_clean.timestamp = pd.to_datetime(arg = df_archive_clean.timestamp)

__1.3.3 Test__

The data type of the `timestamp` column was successfully converted to datetime.

In [33]:
df_archive_clean.timestamp.dtype

datetime64[ns, UTC]

#### 1.4 The four possible values for the `source` column not only indicate the source of the tweet but includes the HTML tags and attributes.
__1.4.1 Define__
* Remove the HTML tags and attributes from the values of the `source` column by mapping these values to their corresponding sources defined in a dictionary object.
* Convert the data type of the `source` column from string (object) to category.

__1.4.2 Code__

In [34]:
# empty dictionary for storing mapping of sources
source_dict = {}

for source in df_archive_clean.source.unique():
    source_dict[source] = source[source.find('>')+1:source.find('</a>')]
    # source.find('>')+1: starting position of source name
    # source.find('</a>'): ending position + 1 of source name

In [35]:
# map sources to the source names defined in the above dictionary
def map_source(dataframe):
    if dataframe['source'] in source_dict.keys():
        return source_dict[dataframe['source']]

df_archive_clean['source'] = df_archive_clean.apply(map_source, axis = 1)

In [36]:
# convert data type to category
df_archive_clean.source = df_archive_clean.source.astype(dtype = 'category')

__1.4.3 Test__
After the values of the `source` column were shortened by removing the HTML tags and attributes, the data type of the column was successfully converted to category.

In [37]:
# unique values for source column
for i, source in enumerate(df_archive_clean.source.unique()):
    print(i+1, source)

1 Twitter for iPhone
2 Twitter Web Client
3 Vine - Make a Scene
4 TweetDeck


In [38]:
df_archive_clean.source.dtype

CategoricalDtype(categories=['TweetDeck', 'Twitter Web Client', 'Twitter for iPhone',
                  'Vine - Make a Scene'],
                 ordered=False)

#### 1.5 181 tweets were created by re-tweeting existing tweets.
__1.5.1 Define__

1. Add a new column `retweet` with the value `True` for retweets and the value `False` for non-retweets.
2. Drop all three columns, `retweeted_status_id`, `retweeted_status_user_id`, and `retweeted_status_timestamp`.

__1.5.2 Code__

In [39]:
# add a new column retweet
df_archive_clean['retweet'] = df_archive_clean.retweeted_status_id.notna()

In [40]:
# remove three columns
df_archive_clean.drop(columns = ['retweeted_status_id', 'retweeted_status_user_id', 'retweeted_status_timestamp'], inplace = True)

__1.5.3 Test__
* The `retweet` column shows 181 retweets and 2175 non-retweets.
* The three columns, `retweeted_status_id`, `retweeted_status_user_id`, and `retweeted_status_timestamp`, were successfully removed.

In [41]:
# numbers of retweets and non-retweets
df_archive_clean.retweet.value_counts()

False    2175
True      181
Name: retweet, dtype: int64

In [42]:
# remaining columns
df_archive_clean.columns

Index(['tweet_id', 'timestamp', 'source', 'text', 'expanded_urls',
       'rating_numerator', 'rating_denominator', 'name', 'doggo', 'floofer',
       'pupper', 'puppo', 'retweet'],
      dtype='object')

#### 1.6 The `expanded_urls` column does not show meaningful information besides the tweet id which is already listed under the `tweet_id` column.
__1.6.1 Define__

Remove the `expanded_urls` column.

__1.6.2 Code__

In [43]:
df_archive_clean.drop(columns = 'expanded_urls', inplace = True)

__1.6.3 Test__

The `expanded_urls` column was successfully removed.

In [44]:
df_archive_clean.columns

Index(['tweet_id', 'timestamp', 'source', 'text', 'rating_numerator',
       'rating_denominator', 'name', 'doggo', 'floofer', 'pupper', 'puppo',
       'retweet'],
      dtype='object')

#### 1.7 23 dog ratings extracted from the tweet are inaccurate in that these ratings have values for the `rating_denominator` column other than 10.
__1.7.1 Define__

* Convert all `NaN` values under the `rating` column of the dataframe object `df_ratings` to the string `0/0`.
* Correct the `rating_numerator` and `rating_denominator` columns in the original dataframe based on the values under the `rating` column of the dataframe object `df_ratings`.
    * Split the values under the `rating` column by the separator `/`.
    * Assign the first of the two splitted values to `rating_numerator` and the second to `rating_denominator`.
    * Convert the data type of the splitted values from string to integer.

__1.7.2 Code__

In [45]:
# convert NaN's to 0/0
df_ratings.rating.fillna(value = '0/0', inplace = True)

# split ratings by /, assign values to numerator and denominator, and convert dtype to int
df_archive_clean.loc[df_ratings.index, 'rating_numerator'] = df_ratings.rating.str.split('/').str[0].astype(dtype = 'int')
df_archive_clean.loc[df_ratings.index, 'rating_denominator'] = df_ratings.rating.str.split('/').str[1].astype(dtype = 'int')

__1.7.3 Test__

The only remaining rows with values for the `rating_denominator` column besides 10 are those for which no rating is available in the text of the original tweet. A _rating_ of `0/0` instead of `NaN` was assigned to all instances of this case.

In [46]:
(df_archive_clean.rating_denominator != 10).sum(), (df_ratings.rating == '0/0').sum()

(16, 16)

#### 1.8 109 names of dogs in the `name` column are inaccurate.
__1.8.1 Define__

Replace the erroneous names in the `name` column with the values under the `new_name` column of the dataframe object `df_names`.

__1.8.2 Code__

In [47]:
df_archive_clean.loc[df_names.index, 'name'] = df_names['new_name']

__1.8.3 Test__

Erroneous names in the `name` column were successfully replaced with the correct names of the dogs. If the corresponding tweet did not introduce the name of the dog, then `NaN` was assigned, thus still replacing any erroneous names which were ordinary words.

In [48]:
df_archive_clean.loc[df_names.index, 'name'].isna().shape[0], df_names.new_name.isna().shape[0]

(109, 109)

#### 1.9 The five columns, `name`, `doggo`, `floofer`, `pupper`, and `puppo`, use "None" instead of `NaN` for missing values.
__1.9.1 Define__

Replace the all instances of "None" in the five columns with `NaN`.

__1.9.2 Code__

In [49]:
# columns to be corrected
column_list = ['name', 'doggo', 'floofer', 'pupper','puppo']

# number of instances of None in each column
for column in column_list:
    print(column, df_archive_clean[df_archive_clean[column] == "None"].shape[0])

name 745
doggo 2259
floofer 2346
pupper 2099
puppo 2326


In [50]:
# replace None with NaN
for column in column_list:
    df_archive_clean.loc[df_archive_clean[df_archive_clean[column] == "None"].index, column] = np.nan

__1.9.3 Test__

No instance of "None" was found in any of the five columns after the replacement with `NaN`.

In [51]:
# number of instances of None remaining in each column
for column in column_list:
    print(column, df_archive_clean[df_archive_clean[column] == "None"].shape[0])

name 0
doggo 0
floofer 0
pupper 0
puppo 0


#### 1.10 Four possible dog "stages", doggo, floofer, pupper, and puppo, are shown as column headers although these "stages" are not variable names.
__1.10.1.1 Define__

* Extract the stage from each row across the four columns `doggo`, `floofer`, `pupper`, and `puppo`.
* Assign the stage to a new column labeled as `stage`.
* Drop the four columns `doggo`, `floofer`, `pupper`, and `puppo`.

__1.10.1.2 Code__

In [52]:
# number of instances of each stage
df_archive_clean.loc[:, ['doggo', 'floofer', 'pupper', 'puppo']].notna().sum()

doggo       97
floofer     10
pupper     257
puppo       30
dtype: int64

In [53]:
# sub-dataframe of columns representing stages
df_stages = df_archive_clean.loc[:, ['doggo', 'floofer', 'pupper', 'puppo']]

# extract stage from the four columns in each row
df_archive_clean['stage'] = df_stages.apply(lambda x: ','.join(x.dropna()), axis = 1)

# replace empty values with NaN
df_archive_clean['stage'].replace(to_replace = '', value = np.nan, inplace = True)

# drop the four columns
df_archive_clean.drop(columns = ['doggo', 'floofer', 'pupper', 'puppo'], inplace = True)

__1.10.1.3 Test__

* The four columns `doggo`, `floofer`, `pupper`, and `puppo` were successfully removed from the dataframe.
* For 14 tweets, more than one stage were observed.
    * Some instances of this case occurred because the image appears to include pictures of more than one dog.
    * In other instances, the stages were not specifically referring to a dog in that the image was not a picture of a dog.

In [54]:
# remaining columns
df_archive_clean.columns

Index(['tweet_id', 'timestamp', 'source', 'text', 'rating_numerator',
       'rating_denominator', 'name', 'retweet', 'stage'],
      dtype='object')

In [55]:
# number of instances of each stage
df_archive_clean['stage'].value_counts()

pupper           245
doggo             83
puppo             29
doggo,pupper      12
floofer            9
doggo,floofer      1
doggo,puppo        1
Name: stage, dtype: int64

In [56]:
# list-object containing unique stages which include multiple stages
multiple_stages = [stage_ for stage_ in df_archive_clean.stage.dropna().unique() if stage_.find(',') != -1]

# compare stages mentioned in the tweet with the extracted stages
pd.set_option('display.max_colwidth', -1)
df_archive_clean.loc[df_archive_clean.query('stage in @multiple_stages').index, ['text', 'stage']]

Unnamed: 0,text,stage
191,Here's a puppo participating in the #ScienceMarch. Cleverly disguising her own doggo agenda. 13/10 would keep the planet habitable for https://t.co/cMhq16isel,"doggo,puppo"
200,"At first I thought this was a shy doggo, but it's actually a Rare Canadian Floofer Owl. Amateurs would confuse the two. 11/10 only send dogs https://t.co/TXdT3tmuYk","doggo,floofer"
460,"This is Dido. She's playing the lead role in ""Pupper Stops to Catch Snow Before Resuming Shadow Box with Dried Apple."" 13/10 (IG: didodoggo) https://t.co/m7isZrOBX7","doggo,pupper"
531,Here we have Burke (pupper) and Dexter (doggo). Pupper wants to be exactly like doggo. Both 12/10 would pet at same time https://t.co/ANBpEYHaho,"doggo,pupper"
565,"Like doggo, like pupper version 2. Both 11/10 https://t.co/9IxWAXFqze","doggo,pupper"
575,This is Bones. He's being haunted by another doggo of roughly the same size. 12/10 deep breaths pupper everything's fine https://t.co/55Dqe0SJNj,"doggo,pupper"
705,This is Pinot. He's a sophisticated doggo. You can tell by the hat. Also pointier than your average pupper. Still 10/10 would pet cautiously https://t.co/f2wmLZTPHd,"doggo,pupper"
733,"Pupper butt 1, Doggo 0. Both 12/10 https://t.co/WQvcPEpH2u","doggo,pupper"
778,"RT @dog_rates: Like father (doggo), like son (pupper). Both 12/10 https://t.co/pG2inLaOda","doggo,pupper"
822,RT @dog_rates: This is just downright precious af. 12/10 for both pupper and doggo https://t.co/o5J479bZUC,"doggo,pupper"


In [57]:
df_archive_clean.stage.unique()

array([nan, 'doggo', 'puppo', 'pupper', 'floofer', 'doggo,puppo',
       'doggo,floofer', 'doggo,pupper'], dtype=object)

__1.10.2.1 Define__

* Replace all instances of multiple stages with `NaN` due to the multiple variations of this case.
* Convert the data type of the `stage` column from string to category.

__1.10.2.2 Code__

In [58]:
# replace multiple stages with NaN
df_archive_clean.loc[df_archive_clean.query('stage in @multiple_stages').index, 'stage'] = np.nan

# convert data type to category
df_archive_clean.stage = df_archive_clean.stage.astype(dtype = 'category')

__1.10.2.3 Test__

The `stage` column was successfully converted to a categorical variable with four possible values: `pupper`, `doggo`, `puppo`, and `floofer`.

In [59]:
# number of instances of each stage
df_archive_clean.stage.value_counts()

pupper     245
doggo      83 
puppo      29 
floofer    9  
Name: stage, dtype: int64

In [60]:
# data type of stage column
df_archive_clean.stage.dtype

CategoricalDtype(categories=['doggo', 'floofer', 'pupper', 'puppo'], ordered=False)

### 2. Image Predictions

#### 2.1 Data type for the `tweet_id` column is integer, instead of string (object).
__2.1.1 Define__

Convert the data type of the `tweet_id` column from integer to string (object).

__2.1.2 Code__

In [61]:
df_image_clean.tweet_id = df_image_clean.tweet_id.astype(dtype = 'str')

__2.1.3 Test__

The data type of the `tweet_id` column was successfully converted to string (object).

In [62]:
df_image_clean.tweet_id.dtype

dtype('O')

#### 2.2 Three sets of three columns are shown as column headers although these are not variable names.
__2.2.1 Define__

* For each set of the three columns, unpivot the three columns, rank the predictions under a new variable `prediction`, and assign the values under a new value column.
    * Assign the values of `p1`, `p2`, `p3` to a new column `breed`.
    * Assign the values of `p1_conf`, `p2_conf`, `p3_conf` to a new column `confidence`.
    * Assign the values of `p1_dog`, `p2_dog`, `p3_dog` to a new column `dog_breed`.
* Join all three tidy datasets and assign the resulting object to the original dataframe object `df_image_clean`.

__2.2.2 Code__

In [63]:
# unpivot the dataframe for each set of three columns
df_breed = df_image_clean.melt(id_vars = ['tweet_id', 'jpg_url', 'img_num']
                               , value_vars = ['p1', 'p2', 'p3']
                               , var_name = 'prediction'
                               , value_name = 'breed')
df_confidence = df_image_clean.melt(id_vars = ['tweet_id', 'jpg_url', 'img_num']
                               , value_vars = ['p1_conf', 'p2_conf', 'p3_conf']
                               , var_name = 'prediction'
                               , value_name = 'confidence')
df_check = df_image_clean.melt(id_vars = ['tweet_id', 'jpg_url', 'img_num']
                               , value_vars = ['p1_dog', 'p2_dog', 'p3_dog']
                               , var_name = 'prediction'
                               , value_name = 'dog_breed')

In [64]:
# reconcile the ranks for predictions to p1, p2, and p3 for all three sub-dataframes
df_confidence.prediction = df_confidence.prediction.str.split('_').str[0]
df_check.prediction = df_check.prediction.str.split('_').str[0]

In [65]:
# merge first two sub-dataframes
df2_image = df_breed.merge(right = df_confidence
                           , on = ['tweet_id', 'jpg_url', 'img_num', 'prediction']
                           , how = 'inner')

# merge the last sub-dataframe
df3_image = df2_image.merge(right = df_check
                            , on = ['tweet_id', 'jpg_url', 'img_num', 'prediction']
                            , how = 'inner')

# assign the merged datasets to the original dataframe
df3_image.sort_values(by = ['tweet_id', 'img_num', 'prediction'], inplace = True)
df3_image.reset_index(inplace = True, drop = True)
df_image_clean = df3_image

__2.2.3 Test__

Three sets of three columns were successfully tidied from a wide format to a long format.

In [66]:
df_image_clean.head(20)

Unnamed: 0,tweet_id,jpg_url,img_num,prediction,breed,confidence,dog_breed
0,666020888022790149,https://pbs.twimg.com/media/CT4udn0WwAA0aMy.jpg,1,p1,Welsh_springer_spaniel,0.465074,True
1,666020888022790149,https://pbs.twimg.com/media/CT4udn0WwAA0aMy.jpg,1,p2,collie,0.156665,True
2,666020888022790149,https://pbs.twimg.com/media/CT4udn0WwAA0aMy.jpg,1,p3,Shetland_sheepdog,0.061428,True
3,666029285002620928,https://pbs.twimg.com/media/CT42GRgUYAA5iDo.jpg,1,p1,redbone,0.506826,True
4,666029285002620928,https://pbs.twimg.com/media/CT42GRgUYAA5iDo.jpg,1,p2,miniature_pinscher,0.074192,True
5,666029285002620928,https://pbs.twimg.com/media/CT42GRgUYAA5iDo.jpg,1,p3,Rhodesian_ridgeback,0.07201,True
6,666033412701032449,https://pbs.twimg.com/media/CT4521TWwAEvMyu.jpg,1,p1,German_shepherd,0.596461,True
7,666033412701032449,https://pbs.twimg.com/media/CT4521TWwAEvMyu.jpg,1,p2,malinois,0.138584,True
8,666033412701032449,https://pbs.twimg.com/media/CT4521TWwAEvMyu.jpg,1,p3,bloodhound,0.116197,True
9,666044226329800704,https://pbs.twimg.com/media/CT5Dr8HUEAA-lEu.jpg,1,p1,Rhodesian_ridgeback,0.408143,True


### 3. Additional Tweet Data

#### 3.1 Data type for the `id` column is integer, instead of string (object).
__3.1.1 Define__

Convert the data type of the `id` column from integer to string (object).

__3.1.2 Code__

In [67]:
df_json_clean.id = df_json_clean.id.astype(dtype = 'str')

__3.1.3 Test__

The data type of the `id` column was successfully converted to string (object).

__3.1.2 Code__

In [68]:
df_json_clean.id.dtype

dtype('O')

#### 3.2 Unlike the two previous datasets, the name of the column for tweet IDs in this dataset is `id` instead of `tweet_id`.
__3.2.1 Define__

Change the name of the `id` column to `tweet_id`.

__3.2.2 Code__

In [69]:
df_json_clean.rename(columns = {'id':'tweet_id'}, inplace = True)

__3.2.3 Test__

The name of the `id` column was successfully changed to `tweet_id`.

In [70]:
df_json_clean.columns

Index(['tweet_id', 'retweet_count', 'favorite_count'], dtype='object')

### 4. Merge and Store Cleaned Datasets
#### 4.1 Twitter Archive Data
__4.1.1 Define__

Merge the two dataframe objects `df_archive_clean` and `df_json_clean`, which were cleaned in the first and third sections of __Cleaning Data__, by the Tweet IDs shared between the two.

__4.1.2 Code__

In [71]:
# combine two datasets
df_archive_master = df_archive_clean.merge(right = df_json_clean
                                           , on = 'tweet_id'
                                           , how = 'inner')

__4.1.3 Test__

The two dataframe objects `df_archive_clean` and `df_json_clean` were successfully joined by the Tweet IDs shared between the two. The merged dataset has the same number of rows as `df_json_clean`.

In [72]:
df_archive_master.head()

Unnamed: 0,tweet_id,timestamp,source,text,rating_numerator,rating_denominator,name,retweet,stage,retweet_count,favorite_count
0,892420643555336193,2017-08-01 16:23:56+00:00,Twitter for iPhone,This is Phineas. He's a mystical boy. Only ever appears in the hole of a donut. 13/10 https://t.co/MgUWQ76dJU,13,10,Phineas,False,,8219,37722
1,892177421306343426,2017-08-01 00:17:27+00:00,Twitter for iPhone,"This is Tilly. She's just checking pup on you. Hopes you're doing ok. If not, she's available for pats, snugs, boops, the whole bit. 13/10 https://t.co/0Xxu71qeIV",13,10,Tilly,False,,6077,32385
2,891815181378084864,2017-07-31 00:18:03+00:00,Twitter for iPhone,This is Archie. He is a rare Norwegian Pouncing Corgo. Lives in the tall grass. You never know when one may strike. 12/10 https://t.co/wUnZnhtVJB,12,10,Archie,False,,4020,24391
3,891689557279858688,2017-07-30 15:58:51+00:00,Twitter for iPhone,This is Darla. She commenced a snooze mid meal. 13/10 happens to the best of us https://t.co/tD36da7qLQ,13,10,Darla,False,,8373,41030
4,891327558926688256,2017-07-29 16:00:24+00:00,Twitter for iPhone,"This is Franklin. He would like you to stop calling him ""cute."" He is a very fierce shark and should be respected as such. 12/10 #BarkWeek https://t.co/AtUZn91f7f",12,10,Franklin,False,,9079,39227


In [73]:
# total number of rows in each individual dataset and the merged dataset
df_archive_clean.shape[0], df_json_clean.shape[0], df_archive_master.shape[0]

(2356, 2339, 2339)

Having tested and verified that the merge was successfully completed, store both datasets `df_archive_clean` and `df_archive_master` as `.csv` files.

In [74]:
# clean twitter archive data w/out retweet and favorite counts
df_archive_clean.to_csv('data/twitter_archive_clean.csv', index = False)

# clean twitter archive data w/ retweet and favorite counts
df_archive_master.to_csv('data/twitter_archive_master.csv', index = False)

#### 4.2 Image Predictions
Store the cleaned dataset `df_image_clean` to a separate `.csv` file. Unlike the previous two datasets, this dataset was not merged because
* this dataset by itself represents a unique observational unit.
* merging this dataset with any one of the other two datasets would duplicate rows of data in the other dataset without creating a separately unique observational unit.

In [75]:
df_image_clean.to_csv('data/image_prediction_clean.csv', index = False)

<a id='conclusion'></a>
## Conclusion

<a id='reference'></a>
## Reference
Dasu T, Johnson T (2003). _Exploratory Data Mining and Data Cleaning._ John Wiley & Sons.

Wickham, H. (2014). Tidy Data. _Journal of Statistical Software, 59_(10). doi:10.18637/jss.v059.i10