## Gathering Data for this Project

Gather each of the three pieces of data as described below in a Jupyter Notebook titled `wrangle_act.ipynb`:
1. The WeRateDogs Twitter archive. I am giving this file to you, so imagine it as a file on hand. Download this file manually by clicking the following link: twitter_archive_enhanced.csv

2. The tweet image predictions, i.e., what breed of dog (or other object, animal, etc.) is present in each tweet according to a neural network. This file (image_predictions.tsv) is hosted on Udacity's servers and should be downloaded programmatically using the Requests library and the following URL: https://d17h27t6h515a5.cloudfront.net/topher/2017/August/599fd2ad_image-predictions/image-predictions.tsv

3. Each tweet's retweet count and favorite ("like") count at minimum, and any additional data you find interesting. Using the tweet IDs in the WeRateDogs Twitter archive, query the Twitter API for each tweet's JSON data using Python's Tweepy library and store each tweet's entire set of JSON data in a file called tweet_json.txt file. Each tweet's JSON data should be written to its own line. Then read this .txt file line by line into a pandas DataFrame with (at minimum) tweet ID, retweet count, and favorite count. Note: do not include your Twitter API keys, secrets, and tokens in your project submission.

In [49]:
import requests
import tweepy
import pandas as pd
import numpy as np
import os
import creds
import json
import time

### WeRateDogs Twitter Archive

In [35]:
archive_csv = "twitter-archive-enhanced.csv"

if not os.path.exists(archive_csv):
    os.rename('Downloads\\'+archive_csv,os.getcwd()+'\\'+archive_csv)

In [40]:
archive_df = pd.read_csv(archive_csv)
archive_df.head()

Unnamed: 0,tweet_id,in_reply_to_status_id,in_reply_to_user_id,timestamp,source,text,retweeted_status_id,retweeted_status_user_id,retweeted_status_timestamp,expanded_urls,rating_numerator,rating_denominator,name,doggo,floofer,pupper,puppo
0,892420643555336193,,,2017-08-01 16:23:56 +0000,"<a href=""http://twitter.com/download/iphone"" r...",This is Phineas. He's a mystical boy. Only eve...,,,,https://twitter.com/dog_rates/status/892420643...,13,10,Phineas,,,,
1,892177421306343426,,,2017-08-01 00:17:27 +0000,"<a href=""http://twitter.com/download/iphone"" r...",This is Tilly. She's just checking pup on you....,,,,https://twitter.com/dog_rates/status/892177421...,13,10,Tilly,,,,
2,891815181378084864,,,2017-07-31 00:18:03 +0000,"<a href=""http://twitter.com/download/iphone"" r...",This is Archie. He is a rare Norwegian Pouncin...,,,,https://twitter.com/dog_rates/status/891815181...,12,10,Archie,,,,
3,891689557279858688,,,2017-07-30 15:58:51 +0000,"<a href=""http://twitter.com/download/iphone"" r...",This is Darla. She commenced a snooze mid meal...,,,,https://twitter.com/dog_rates/status/891689557...,13,10,Darla,,,,
4,891327558926688256,,,2017-07-29 16:00:24 +0000,"<a href=""http://twitter.com/download/iphone"" r...",This is Franklin. He would like you to stop ca...,,,,https://twitter.com/dog_rates/status/891327558...,12,10,Franklin,,,,


### Tweet Image Predictions

In [16]:
predictions_url = "https://d17h27t6h515a5.cloudfront.net/topher/2017/August/599fd2ad_image-predictions/image-predictions.tsv"

In [23]:
r = requests.get(predictions_url)
open('image-predictions.tsv', 'wb').write(r.content);

335079

In [41]:
predictions_df = pd.read_csv('image-predictions.tsv', sep='\t')
predictions_df.head()

Unnamed: 0,tweet_id,jpg_url,img_num,p1,p1_conf,p1_dog,p2,p2_conf,p2_dog,p3,p3_conf,p3_dog
0,666020888022790149,https://pbs.twimg.com/media/CT4udn0WwAA0aMy.jpg,1,Welsh_springer_spaniel,0.465074,True,collie,0.156665,True,Shetland_sheepdog,0.061428,True
1,666029285002620928,https://pbs.twimg.com/media/CT42GRgUYAA5iDo.jpg,1,redbone,0.506826,True,miniature_pinscher,0.074192,True,Rhodesian_ridgeback,0.07201,True
2,666033412701032449,https://pbs.twimg.com/media/CT4521TWwAEvMyu.jpg,1,German_shepherd,0.596461,True,malinois,0.138584,True,bloodhound,0.116197,True
3,666044226329800704,https://pbs.twimg.com/media/CT5Dr8HUEAA-lEu.jpg,1,Rhodesian_ridgeback,0.408143,True,redbone,0.360687,True,miniature_pinscher,0.222752,True
4,666049248165822465,https://pbs.twimg.com/media/CT5IQmsXIAAKY4A.jpg,1,miniature_pinscher,0.560311,True,Rottweiler,0.243682,True,Doberman,0.154629,True


### Twitter API Data

In [62]:
auth = tweepy.OAuthHandler(creds.consumer_key, creds.consumer_secret)
auth.set_access_token(creds.access_token, creds.access_secret)

api = tweepy.API(auth, wait_on_rate_limit=True, wait_on_rate_limit_notify=True)

fails_dict = {}

start = time.time()
with open('tweet_json.txt','w') as savefile:
    for _ , tweet_id in predictions_df.tweet_id.iteritems():
        print(tweet_id)
        try:
            tweet = api.get_status(tweet_id, tweet_mode='extended')
            print("Success")
            json.dump(tweet._json, savefile)
            savefile.write('\n')
        except Exception as e:
            print("Fail")
            print(e)
            fails_dict[tweet_id] = e
            pass

end = time.time()
print("Total Run Time: {} Seconds".format(end-start))

666020888022790149
Success
666029285002620928
Success
666033412701032449
Success
666044226329800704
Success
666049248165822465
Success
666050758794694657
Success
666051853826850816
Success
666055525042405380
Success
666057090499244032
Success
666058600524156928
Success
666063827256086533
Success
666071193221509120
Success
666073100786774016
Success
666082916733198337
Success
666094000022159362
Success
666099513787052032
Success
666102155909144576
Success
666104133288665088
Success
666268910803644416
Success
666273097616637952
Success
666287406224695296
Success
666293911632134144
Success
666337882303524864
Success
666345417576210432
Success
666353288456101888
Success
666362758909284353
Success
666373753744588802
Success
666396247373291520
Success
666407126856765440
Success
666411507551481857
Success
666418789513326592
Success
666421158376562688
Success
666428276349472768
Success
666430724426358785
Success
666435652385423360
Success
666437273139982337
Success
666447344410484738
Success
6

Success
671518598289059840
Success
671520732782923777
Success
671528761649688577
Success
671533943490011136
Success
671536543010570240
Success
671538301157904385
Success
671542985629241344
Success
671544874165002241
Success
671547767500775424
Success
671561002136281088
Success
671729906628341761
Success
671735591348891648
Success
671743150407421952
Success
671744970634719232
Success
671763349865160704
Success
671768281401958400
Success
671789708968640512
Success
671855973984772097
Success
671866342182637568
Success
671874878652489728
Success
671879137494245376
Success
671882082306625538
Success
671891728106971137
Success
671896809300709376
Success
672068090318987265
Success
672082170312290304
Success
672095186491711488
Success
672125275208069120
Success
672139350159835138
Success
672160042234327040
Success
672169685991993344
Success
672205392827572224
Success
672222792075620352
Success
672231046314901505
Success
672239279297454080
Success
672245253877968896
Success
672248013293752320
S

Fail
[{'code': 144, 'message': 'No status found with that ID.'}]
680070545539371008
Success
680085611152338944
Success
680100725817409536
Success
680115823365742593
Success
680130881361686529
Success
680145970311643136
Success
680161097740095489
Success
680176173301628928
Success
680191257256136705
Success
680206703334408192
Success
680221482581123072
Success
680440374763077632
Success
680473011644985345
Success
680494726643068929
Success
680497766108381184
Success
680583894916304897
Success
680609293079592961
Success
680798457301471234
Success
680801747103793152
Success
680836378243002368
Success
680889648562991104
Success
680913438424612864
Success
680934982542561280
Success
680940246314430465
Success
680959110691590145
Success
680970795137544192
Success
681193455364796417
Success
681231109724700672
Success
681242418453299201
Success
681261549936340994
Success
681281657291280384
Success
681297372102656000
Success
681302363064414209
Success
681320187870711809
Success
68133944865580236

Rate limit reached. Sleeping for: 640


Success
700002074055016451
Success
700029284593901568
Success
700062718104104960
Success
700143752053182464
Success
700151421916807169
Success
700167517596164096
Success
700462010979500032
Success
700505138482569216
Success
700518061187723268
Success
700747788515020802
Success
700796979434098688
Success
700847567345688576
Success
700864154249383937
Success
700890391244103680
Success
701214700881756160
Success
701545186879471618
Success
701570477911896070
Success
701601587219795968
Success
701889187134500865
Success
701952816642965504
Success
701981390485725185
Success
702217446468493312
Success
702276748847800320
Success
702321140488925184
Success
702539513671897089
Success
702598099714314240
Success
702671118226825216
Success
702684942141153280
Success
702932127499816960
Success
703041949650034688
Success
703079050210877440
Success
703268521220972544
Success
703356393781329922
Success
703382836347330562
Success
703407252292673536
Success
703425003149250560
Success
703611486317502464
S

Success
741793263812808706
Success
742150209887731712
Success
742161199639494656
Success
742385895052087300
Success
742423170473463808
Success
742465774154047488
Success
742528092657332225
Success
743210557239623680
Success
743222593470234624
Success
743253157753532416
Success
743510151680958465
Success
743545585370791937
Success
743595368194129920
Success
743609206067040256
Success
743895849529389061
Success
743980027717509120
Success
744234799360020481
Success
744334592493166593
Success
744709971296780288
Success
744971049620602880
Success
744995568523612160
Success
745057283344719872
Success
745314880350101504
Success
745422732645535745
Success
745433870967832576
Success
745712589599014916
Success
745789745784041472
Success
746056683365994496
Success
746131877086527488
Success
746369468511756288
Success
746507379341139972
Success
746726898085036033
Success
746790600704425984
Success
746818907684614144
Success
746872823977771008
Success
746906459439529985
Success
747103485104099331
S

Success
785639753186217984
Success
785872687017132033
Success
785927819176054784
Success
786036967502913536
Success
786233965241827333
Success
786363235746385920
Success
786595970293370880
Success
786664955043049472
Success
786709082849828864
Success
786963064373534720
Success
787322443945877504
Success
787397959788929025
Success
787717603741622272
Success
787810552592695296
Success
788039637453406209
Success
788070120937619456
Success
788150585577050112
Success
788178268662984705
Success
788412144018661376
Success
788765914992902144
Success
788908386943430656
Success
789137962068021249
Success
789268448748703744
Success
789530877013393408
Success
789599242079838210
Success
789628658055020548
Success
789986466051088384
Success
790277117346975746
Success
790337589677002753
Success
790581949425475584
Success
790698755171364864
Success
790723298204217344
Success
790946055508652032
Success
790987426131050500
Success
791026214425268224
Success
791312159183634433
Success
791406955684368384
S

Rate limit reached. Sleeping for: 602


Success
831939777352105988
Success
832032802820481025
Success
832040443403784192
Success
832215726631055365
Success
832273440279240704
Success
832369877331693569
Success
832397543355072512
Success
832636094638288896
Success
832757312314028032
Success
832769181346996225
Success
832998151111966721
Success
833124694597443584
Success
833479644947025920
Success
833722901757046785
Success
833826103416520705
Success
833863086058651648
Success
834086379323871233
Success
834167344700198914
Success
834209720923721728
Success
834458053273591808
Success
834574053763584002
Success
834786237630337024
Success
834931633769889797
Success
835152434251116546
Success
835172783151792128
Success
835264098648616962
Success
835297930240217089
Success
835574547218894849
Success
836001077879255040
Success
836260088725786625
Success
836380477523124226
Success
836677758902222849
Success
836753516572119041
Success
836989968035819520
Success
837012587749474308
Fail
[{'code': 144, 'message': 'No status found with th

In [73]:
data = []
with open('tweet_json.txt') as json_file:
    for index, line in enumerate(json_file):
        data.append(json.loads(line))

In [74]:
twitter_df = pd.DataFrame.from_records(data)

In [76]:
twitter_df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 2062 entries, 0 to 2061
Data columns (total 28 columns):
contributors                     0 non-null object
coordinates                      0 non-null object
created_at                       2062 non-null object
display_text_range               2062 non-null object
entities                         2062 non-null object
extended_entities                2062 non-null object
favorite_count                   2062 non-null int64
favorited                        2062 non-null bool
full_text                        2062 non-null object
geo                              0 non-null object
id                               2062 non-null int64
id_str                           2062 non-null object
in_reply_to_screen_name          23 non-null object
in_reply_to_status_id            23 non-null float64
in_reply_to_status_id_str        23 non-null object
in_reply_to_user_id              23 non-null float64
in_reply_to_user_id_str          23 non-null obj

In [77]:
twitter_df.head()

Unnamed: 0,contributors,coordinates,created_at,display_text_range,entities,extended_entities,favorite_count,favorited,full_text,geo,...,lang,place,possibly_sensitive,possibly_sensitive_appealable,retweet_count,retweeted,retweeted_status,source,truncated,user
0,,,Sun Nov 15 22:32:08 +0000 2015,"[0, 131]","{'hashtags': [], 'symbols': [], 'user_mentions...","{'media': [{'id': 666020881337073664, 'id_str'...",2494,False,Here we have a Japanese Irish Setter. Lost eye...,,...,en,,False,False,488,False,,"<a href=""http://twitter.com/download/iphone"" r...",False,"{'id': 4196983835, 'id_str': '4196983835', 'na..."
1,,,Sun Nov 15 23:05:30 +0000 2015,"[0, 139]","{'hashtags': [], 'symbols': [], 'user_mentions...","{'media': [{'id': 666029276303482880, 'id_str'...",125,False,This is a western brown Mitsubishi terrier. Up...,,...,en,,False,False,46,False,,"<a href=""http://twitter.com/download/iphone"" r...",False,"{'id': 4196983835, 'id_str': '4196983835', 'na..."
2,,,Sun Nov 15 23:21:54 +0000 2015,"[0, 130]","{'hashtags': [], 'symbols': [], 'user_mentions...","{'media': [{'id': 666033409081393153, 'id_str'...",121,False,Here is a very happy pup. Big fan of well-main...,,...,en,,False,False,43,False,,"<a href=""http://twitter.com/download/iphone"" r...",False,"{'id': 4196983835, 'id_str': '4196983835', 'na..."
3,,,Mon Nov 16 00:04:52 +0000 2015,"[0, 137]","{'hashtags': [], 'symbols': [], 'user_mentions...","{'media': [{'id': 666044217047650304, 'id_str'...",286,False,This is a purebred Piers Morgan. Loves to Netf...,,...,en,,False,False,135,False,,"<a href=""http://twitter.com/download/iphone"" r...",False,"{'id': 4196983835, 'id_str': '4196983835', 'na..."
4,,,Mon Nov 16 00:24:50 +0000 2015,"[0, 120]","{'hashtags': [], 'symbols': [], 'user_mentions...","{'media': [{'id': 666049244999131136, 'id_str'...",103,False,Here we have a 1949 1st generation vulpix. Enj...,,...,en,,False,False,41,False,,"<a href=""http://twitter.com/download/iphone"" r...",False,"{'id': 4196983835, 'id_str': '4196983835', 'na..."


In [87]:
twitter_df_clean = twitter_df.copy()

In [88]:
twitter_df_clean.drop(['entities',
                       'extended_entities',
                       'contributors',
                       'coordinates',
                       'geo',
                       'place',
                       'retweeted_status',
                       'in_reply_to_screen_name',
                       'in_reply_to_status_id',
                       'in_reply_to_status_id_str',
                       'in_reply_to_user_id',
                       'in_reply_to_user_id_str'],
                     axis = 1,
                     inplace = True)

In [89]:
twitter_df_clean.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 2062 entries, 0 to 2061
Data columns (total 16 columns):
created_at                       2062 non-null object
display_text_range               2062 non-null object
favorite_count                   2062 non-null int64
favorited                        2062 non-null bool
full_text                        2062 non-null object
id                               2062 non-null int64
id_str                           2062 non-null object
is_quote_status                  2062 non-null bool
lang                             2062 non-null object
possibly_sensitive               2062 non-null bool
possibly_sensitive_appealable    2062 non-null bool
retweet_count                    2062 non-null int64
retweeted                        2062 non-null bool
source                           2062 non-null object
truncated                        2062 non-null bool
user                             2062 non-null object
dtypes: bool(6), int64(3), object(7)
memory 

In [90]:
twitter_df_clean.to_csv('twitter_data_clean.csv')

## Assessing Data
After gathering each of the above pieces of data, assess them visually and programmatically for quality and tidiness issues. Detect and document at least eight (8) quality issues and two (2) tidiness issues in your wrangle_act.ipynb Jupyter Notebook. To meet specifications, the issues that satisfy the Project Motivation (see the Key Points header on the previous page) must be assessed.

### Quality Issues
- missing values
- created_at is formatted as a string
- lang is formatted as a string

## Cleaning Data
Clean each of the issues you documented while assessing. Perform this cleaning in wrangle_act.ipynb as well. The result should be a high quality and tidy master pandas DataFrame (or DataFrames, if appropriate). Again, the issues that satisfy the Project Motivation must be cleaned.

# Storing, Analyzing, and Visualizing Data for this Project
Store the clean DataFrame(s) in a CSV file with the main one named twitter_archive_master.csv. If additional files exist because multiple tables are required for tidiness, name these files appropriately. Additionally, you may store the cleaned data in a SQLite database (which is to be submitted as well if you do).

Analyze and visualize your wrangled data in your wrangle_act.ipynb Jupyter Notebook. At least three (3) insights and one (1) visualization must be produced.

# Reporting for this Project
Create a 300-600 word written report called wrangle_report.pdf or wrangle_report.html that briefly describes your wrangling efforts. This is to be framed as an internal document.

Create a 250-word-minimum written report called act_report.pdf or act_report.html that communicates the insights and displays the visualization(s) produced from your wrangled data. This is to be framed as an external document, like a blog post or magazine article, for example.

Both of these documents can be created in separate Jupyter Notebooks using the Markdown functionality of Jupyter Notebooks, then downloading those notebooks as PDF files or HTML files (see image below). You might prefer to use a word processor like Google Docs or Microsoft Word, however.