# WeRateDogs Project - Wrangle & Analyze Twitter Data

By Nagashri Nagaraj<br>
Date November 13 2018

## Introduction:
Goal of this project is to wrangle and analyze @WeRateDogs Twitter data thru' web scraping using Tweepy API, Requests library, downloading csv for gathering data. Then assess and clean the data to create interesting and trustworthy analyses and visualizations. 

## Gathering Data:
1. Download the given csv file manually: twitter_archive_enhanced.csv<br>
2. The tweet image predictions, i.e., what breed of dog (or other object, animal, etc.) is present in each tweet according to a neural network. This file (image_predictions.tsv) is hosted on Udacity's servers and downloaded programmatically using the Requests library and the following URL: https://d17h27t6h515a5.cloudfront.net/topher/2017/August/599fd2ad_image-predictions/image-predictions.tsv<br>
3. Each tweet's retweet count and favorite ("like") count at minimum, and any additional data you find interesting. Using the tweet IDs in the WeRateDogs Twitter archive, query the Twitter API for each tweet's JSON data using Python's Tweepy library and store each tweet's entire set of JSON data in a file called tweet_json.txt file. Each tweet's JSON data should be written to its own line. Then read this .txt file line by line into a pandas DataFrame with (at minimum) tweet ID, retweet count, and favorite count.<br>

In [118]:
#import libraries

import numpy as np
import pandas as pd
import os # to download files from Udacity server
import requests # to download files from Udacity server
import tweepy 
import json
from tqdm import *
import datetime as dt
import re
import matplotlib.pyplot as plt
import seaborn as sns
%matplotlib inline


In [119]:
#Read twitter-archive-enhanced.csv and store it as dataframe variable archive

archive = pd.read_csv("data/twitter-archive-enhanced.csv")


In [120]:
archive.sample(2)

Unnamed: 0,tweet_id,in_reply_to_status_id,in_reply_to_user_id,timestamp,source,text,retweeted_status_id,retweeted_status_user_id,retweeted_status_timestamp,expanded_urls,rating_numerator,rating_denominator,name,doggo,floofer,pupper,puppo
1851,675710890956750848,,,2015-12-12 16:16:45 +0000,"<a href=""http://twitter.com/download/iphone"" r...",This is Lenny. He was just told that he couldn...,,,,https://twitter.com/dog_rates/status/675710890...,12,10,Lenny,,,,
2182,668992363537309700,,,2015-11-24 03:19:43 +0000,"<a href=""http://twitter.com/download/iphone"" r...",This is Harrison. He braves the snow like a ch...,,,,https://twitter.com/dog_rates/status/668992363...,8,10,Harrison,,,,


In [121]:

# Programatically download image prediction file from Udacity server using Requests library
url = 'https://d17h27t6h515a5.cloudfront.net/topher/2017/August/599fd2ad_image-predictions/image-predictions.tsv'
response = requests.get(url)

# Save data
with open('data/image-predictions.tsv', "wb") as file: 
    file.write(response.content)
    
# Import data
df_breeds = pd.read_csv('data/image-predictions.tsv', sep = "\t")

df_breeds.sample(2)


Unnamed: 0,tweet_id,jpg_url,img_num,p1,p1_conf,p1_dog,p2,p2_conf,p2_dog,p3,p3_conf,p3_dog
2036,884441805382717440,https://pbs.twimg.com/media/DEYrIZwWsAA2Wo5.jpg,1,Pembroke,0.993225,True,Cardigan,0.003216,True,Chihuahua,0.002081,True
1558,793241302385262592,https://pbs.twimg.com/media/CwIougTWcAAMLyq.jpg,1,golden_retriever,0.559308,True,Labrador_retriever,0.390222,True,cocker_spaniel,0.036316,True


Consumer API keys KJ4snTTqprVl13xESdynv0FLD (API key)

2I8tD942i78jbYOmttOrarGfnHJwSCDct6V3611el6to5AJyt8 (API secret key)

755364732-Qeb0twZzwoOHSwdjy2WNEO6V9SxVjs9rjfylXSeV (Access token)

uXwh7cRrVOwzHOa9d9bXCjzI63tiXgLM98rGjqm62h5tO (Access token secret)

https://developer.twitter.com/en/apps/15941238

In [122]:
# Import data from Twitter API

# authentication pieces
consumer_key = "KJ4snTTqprVl13xESdynv0FLD"
consumer_secret = "2I8tD942i78jbYOmttOrarGfnHJwSCDct6V3611el6to5AJyt8"
access_token = "755364732-Qeb0twZzwoOHSwdjy2WNEO6V9SxVjs9rjfylXSeV"
access_secret = "uXwh7cRrVOwzHOa9d9bXCjzI63tiXgLM98rGjqm62h5tO"

auth = tweepy.OAuthHandler(consumer_key, consumer_secret)
auth.set_access_token(access_token, access_secret)

# Create connection to API
api = tweepy.API(auth, 
                 parser = tweepy.parsers.JSONParser(), 
                 wait_on_rate_limit = True, 
                 wait_on_rate_limit_notify = True)

# Create list of twitter ids
tweet_ids = archive["tweet_id"].astype(list)


In [123]:
# Download tweepy status object based on tweet_id in archive and store it in a list.
tweets_data = []

# Tweets that can't be found for the tweet_ids are saved in the list below
failed_ids = []

for tweet_id in tqdm(tweet_ids):
    try:
        tweets_data.append(api.get_status(row["tweet_id"], tweet_mode='extended'))
    except Exception as e:
        failed_ids.append((row["tweet_id"]))




  0%|                                                 | 0/2356 [00:00<?, ?it/s]

  0%|                                         | 2/2356 [00:00<06:11,  6.33it/s]

  0%|                                         | 4/2356 [00:00<05:11,  7.56it/s]

  0%|                                         | 5/2356 [00:00<06:21,  6.16it/s]

  0%|▏                                        | 8/2356 [00:00<05:06,  7.67it/s]

  0%|▏                                       | 10/2356 [00:01<05:12,  7.50it/s]

  1%|▏                                       | 12/2356 [00:01<04:36,  8.49it/s]

  1%|▏                                       | 13/2356 [00:01<06:04,  6.44it/s]

  1%|▎                                       | 15/2356 [00:01<04:53,  7.98it/s]

  1%|▎                                       | 17/2356 [00:01<04:14,  9.20it/s]

  1%|▎                                       | 19/2356 [00:02<05:01,  7.76it/s]

  1%|▎                                       | 21/2356 [00:02<05:05,  7.65it/s]

  1%|▍                    

  9%|███▍                                   | 209/2356 [00:15<02:17, 15.60it/s]

  9%|███▍                                   | 211/2356 [00:15<02:20, 15.21it/s]

  9%|███▌                                   | 213/2356 [00:15<02:20, 15.23it/s]

  9%|███▌                                   | 215/2356 [00:15<02:23, 14.93it/s]

  9%|███▌                                   | 217/2356 [00:15<02:17, 15.60it/s]

  9%|███▋                                   | 220/2356 [00:16<02:08, 16.64it/s]

  9%|███▋                                   | 222/2356 [00:16<02:13, 15.97it/s]

 10%|███▋                                   | 224/2356 [00:16<02:24, 14.71it/s]

 10%|███▋                                   | 226/2356 [00:16<02:30, 14.18it/s]

 10%|███▊                                   | 228/2356 [00:16<02:29, 14.21it/s]

 10%|███▊                                   | 230/2356 [00:16<02:28, 14.29it/s]

 10%|███▊                                   | 232/2356 [00:17<02:34, 13.73it/s]

 10%|███▊                   

 18%|██████▉                                | 416/2356 [00:28<01:57, 16.45it/s]

 18%|██████▉                                | 418/2356 [00:28<02:01, 15.92it/s]

 18%|██████▉                                | 420/2356 [00:28<02:04, 15.50it/s]

 18%|██████▉                                | 422/2356 [00:29<02:05, 15.36it/s]

 18%|███████                                | 424/2356 [00:29<02:17, 14.10it/s]

 18%|███████                                | 426/2356 [00:29<02:09, 14.85it/s]

 18%|███████                                | 428/2356 [00:29<02:07, 15.11it/s]

 18%|███████                                | 430/2356 [00:29<02:09, 14.82it/s]

 18%|███████▏                               | 432/2356 [00:29<02:11, 14.59it/s]

 18%|███████▏                               | 434/2356 [00:29<02:12, 14.50it/s]

 19%|███████▏                               | 436/2356 [00:30<02:15, 14.16it/s]

 19%|███████▎                               | 438/2356 [00:30<02:16, 14.02it/s]

 19%|███████▎               

 27%|██████████▎                            | 625/2356 [00:42<01:52, 15.32it/s]

 27%|██████████▍                            | 627/2356 [00:42<01:52, 15.38it/s]

 27%|██████████▍                            | 629/2356 [00:42<01:53, 15.17it/s]

 27%|██████████▍                            | 631/2356 [00:42<01:53, 15.20it/s]

 27%|██████████▍                            | 633/2356 [00:42<02:02, 14.12it/s]

 27%|██████████▌                            | 635/2356 [00:42<01:52, 15.24it/s]

 27%|██████████▌                            | 637/2356 [00:42<01:46, 16.13it/s]

 27%|██████████▌                            | 639/2356 [00:43<01:49, 15.75it/s]

 27%|██████████▌                            | 641/2356 [00:43<01:53, 15.07it/s]

 27%|██████████▋                            | 643/2356 [00:43<01:56, 14.76it/s]

 27%|██████████▋                            | 645/2356 [00:43<01:47, 15.98it/s]

 28%|██████████▋                            | 648/2356 [00:43<01:42, 16.72it/s]

 28%|██████████▊            

 35%|█████████████▊                         | 833/2356 [00:55<01:36, 15.73it/s]

 35%|█████████████▊                         | 835/2356 [00:55<01:32, 16.48it/s]

 36%|█████████████▊                         | 837/2356 [00:55<01:34, 16.06it/s]

 36%|█████████████▉                         | 839/2356 [00:55<01:33, 16.23it/s]

 36%|█████████████▉                         | 841/2356 [00:55<01:32, 16.32it/s]

 36%|█████████████▉                         | 843/2356 [00:55<01:33, 16.15it/s]

 36%|█████████████▉                         | 845/2356 [00:55<01:37, 15.43it/s]

 36%|██████████████                         | 847/2356 [00:56<01:31, 16.40it/s]

 36%|██████████████                         | 849/2356 [00:56<01:33, 16.16it/s]

 36%|██████████████                         | 851/2356 [00:56<01:28, 17.06it/s]

 36%|██████████████                         | 853/2356 [00:56<01:41, 14.86it/s]

 36%|██████████████▏                        | 855/2356 [00:56<01:37, 15.33it/s]

 36%|██████████████▏        

 44%|████████████████▊                     | 1039/2356 [01:08<01:22, 16.04it/s]

 44%|████████████████▊                     | 1041/2356 [01:08<01:22, 16.03it/s]

 44%|████████████████▊                     | 1043/2356 [01:08<01:23, 15.72it/s]

 44%|████████████████▊                     | 1045/2356 [01:08<01:24, 15.58it/s]

 44%|████████████████▉                     | 1047/2356 [01:08<01:22, 15.78it/s]

 45%|████████████████▉                     | 1049/2356 [01:08<01:27, 14.92it/s]

 45%|████████████████▉                     | 1051/2356 [01:08<01:25, 15.33it/s]

 45%|████████████████▉                     | 1053/2356 [01:09<01:22, 15.71it/s]

 45%|█████████████████                     | 1055/2356 [01:09<01:27, 14.95it/s]

 45%|█████████████████                     | 1057/2356 [01:09<01:25, 15.25it/s]

 45%|█████████████████                     | 1059/2356 [01:09<01:32, 14.03it/s]

 45%|█████████████████                     | 1061/2356 [01:09<01:28, 14.60it/s]

 45%|█████████████████▏     

 53%|████████████████████                  | 1241/2356 [01:21<01:07, 16.49it/s]

 53%|████████████████████                  | 1243/2356 [01:21<01:06, 16.75it/s]

 53%|████████████████████                  | 1245/2356 [01:21<01:07, 16.40it/s]

 53%|████████████████████                  | 1247/2356 [01:21<01:11, 15.59it/s]

 53%|████████████████████▏                 | 1250/2356 [01:21<01:05, 16.98it/s]

 53%|████████████████████▏                 | 1252/2356 [01:21<01:06, 16.51it/s]

 53%|████████████████████▏                 | 1254/2356 [01:22<01:05, 16.80it/s]

 53%|████████████████████▎                 | 1256/2356 [01:22<01:02, 17.56it/s]

 53%|████████████████████▎                 | 1258/2356 [01:22<01:02, 17.46it/s]

 53%|████████████████████▎                 | 1260/2356 [01:22<01:02, 17.67it/s]

 54%|████████████████████▎                 | 1262/2356 [01:22<01:02, 17.45it/s]

 54%|████████████████████▍                 | 1264/2356 [01:22<01:00, 18.14it/s]

 54%|████████████████████▍  

 61%|███████████████████████▎              | 1447/2356 [01:34<01:02, 14.43it/s]

 62%|███████████████████████▎              | 1449/2356 [01:34<00:59, 15.21it/s]

 62%|███████████████████████▍              | 1451/2356 [01:34<00:57, 15.73it/s]

 62%|███████████████████████▍              | 1453/2356 [01:34<00:57, 15.66it/s]

 62%|███████████████████████▍              | 1455/2356 [01:34<00:56, 15.95it/s]

 62%|███████████████████████▌              | 1457/2356 [01:34<00:55, 16.16it/s]

 62%|███████████████████████▌              | 1459/2356 [01:34<00:57, 15.66it/s]

 62%|███████████████████████▌              | 1461/2356 [01:35<00:57, 15.54it/s]

 62%|███████████████████████▌              | 1463/2356 [01:35<00:57, 15.56it/s]

 62%|███████████████████████▋              | 1465/2356 [01:35<00:58, 15.16it/s]

 62%|███████████████████████▋              | 1467/2356 [01:35<00:57, 15.51it/s]

 62%|███████████████████████▋              | 1469/2356 [01:35<00:56, 15.58it/s]

 62%|███████████████████████

 70%|██████████████████████████▌           | 1650/2356 [01:47<00:44, 15.95it/s]

 70%|██████████████████████████▋           | 1652/2356 [01:47<00:43, 16.28it/s]

 70%|██████████████████████████▋           | 1652/2356 [02:00<00:43, 16.28it/s]

 70%|█████████████████████████▎          | 1653/2356 [02:47<3:31:34, 18.06s/it]

 70%|█████████████████████████▎          | 1655/2356 [02:47<2:27:52, 12.66s/it]

 70%|█████████████████████████▎          | 1657/2356 [02:47<1:43:25,  8.88s/it]

 70%|█████████████████████████▎          | 1659/2356 [02:47<1:12:23,  6.23s/it]

 71%|██████████████████████████▊           | 1661/2356 [02:47<50:48,  4.39s/it]

 71%|██████████████████████████▊           | 1663/2356 [02:48<35:42,  3.09s/it]

 71%|██████████████████████████▊           | 1665/2356 [02:48<25:06,  2.18s/it]

 71%|██████████████████████████▉           | 1667/2356 [02:48<17:44,  1.55s/it]

 71%|██████████████████████████▉           | 1669/2356 [02:48<12:36,  1.10s/it]

 71%|███████████████████████

 79%|█████████████████████████████▉        | 1854/2356 [03:00<00:30, 16.34it/s]

 79%|█████████████████████████████▉        | 1856/2356 [03:00<00:30, 16.60it/s]

 79%|█████████████████████████████▉        | 1858/2356 [03:00<00:30, 16.34it/s]

 79%|██████████████████████████████        | 1860/2356 [03:00<00:32, 15.44it/s]

 79%|██████████████████████████████        | 1862/2356 [03:00<00:34, 14.46it/s]

 79%|██████████████████████████████        | 1864/2356 [03:01<00:35, 13.92it/s]

 79%|██████████████████████████████        | 1866/2356 [03:01<00:34, 14.39it/s]

 79%|██████████████████████████████▏       | 1868/2356 [03:01<00:33, 14.39it/s]

 79%|██████████████████████████████▏       | 1870/2356 [03:01<00:33, 14.33it/s]

 79%|██████████████████████████████▏       | 1872/2356 [03:01<00:31, 15.13it/s]

 80%|██████████████████████████████▏       | 1874/2356 [03:01<00:31, 15.42it/s]

 80%|██████████████████████████████▎       | 1876/2356 [03:01<00:30, 15.77it/s]

 80%|███████████████████████

 87%|█████████████████████████████████▏    | 2060/2356 [03:13<00:16, 17.92it/s]

 88%|█████████████████████████████████▎    | 2062/2356 [03:13<00:16, 18.30it/s]

 88%|█████████████████████████████████▎    | 2064/2356 [03:13<00:16, 17.40it/s]

 88%|█████████████████████████████████▎    | 2066/2356 [03:13<00:17, 16.70it/s]

 88%|█████████████████████████████████▎    | 2068/2356 [03:13<00:18, 15.31it/s]

 88%|█████████████████████████████████▍    | 2070/2356 [03:13<00:18, 15.16it/s]

 88%|█████████████████████████████████▍    | 2072/2356 [03:14<00:18, 15.51it/s]

 88%|█████████████████████████████████▍    | 2074/2356 [03:14<00:17, 16.07it/s]

 88%|█████████████████████████████████▍    | 2076/2356 [03:14<00:18, 15.28it/s]

 88%|█████████████████████████████████▌    | 2078/2356 [03:14<00:17, 15.56it/s]

 88%|█████████████████████████████████▌    | 2080/2356 [03:14<00:17, 15.40it/s]

 88%|█████████████████████████████████▌    | 2082/2356 [03:14<00:18, 14.98it/s]

 88%|███████████████████████

 96%|████████████████████████████████████▌ | 2265/2356 [03:26<00:05, 16.37it/s]

 96%|████████████████████████████████████▌ | 2267/2356 [03:26<00:05, 16.67it/s]

 96%|████████████████████████████████████▌ | 2269/2356 [03:26<00:05, 16.79it/s]

 96%|████████████████████████████████████▋ | 2271/2356 [03:26<00:05, 15.12it/s]

 96%|████████████████████████████████████▋ | 2273/2356 [03:27<00:05, 14.38it/s]

 97%|████████████████████████████████████▋ | 2275/2356 [03:27<00:05, 14.67it/s]

 97%|████████████████████████████████████▋ | 2277/2356 [03:27<00:05, 15.08it/s]

 97%|████████████████████████████████████▊ | 2279/2356 [03:27<00:05, 15.06it/s]

 97%|████████████████████████████████████▊ | 2281/2356 [03:27<00:04, 15.58it/s]

 97%|████████████████████████████████████▊ | 2283/2356 [03:27<00:04, 15.24it/s]

 97%|████████████████████████████████████▊ | 2286/2356 [03:27<00:04, 15.72it/s]

 97%|████████████████████████████████████▉ | 2288/2356 [03:28<00:04, 15.99it/s]

 97%|███████████████████████

In [124]:
print("The list of tweets" ,len(tweets_data))
print("The list of tweets no found" , len(failed_ids))

The list of tweets 16
The list of tweets no found 2340


In [125]:
#Then in this code block we isolate the json part of each tweepy 
#status object that we have downloaded and we add them all into a list

my_list_of_dicts = []
for each_json_tweet in tweets_data:
    my_list_of_dicts.append(each_json_tweet)

In [130]:
#we write this list into a txt file:

with open('tweet_json.txt', 'w') as file:
        file.write(json.dumps(my_list_of_dicts, indent=4))

In [131]:
tweet_json.sample(2)

Unnamed: 0,tweet_id,favorite_count,retweet_count,followers_count,friends_count,source,retweeted_status,url
307,821813639212650496,0,3659,7440039,10,Twitter for iPhone,This is a retweet,This is a retweet
326,819015337530290176,0,40291,7440039,10,Twitter for iPhone,This is a retweet,This is a retweet


In [129]:
df_api = pd.DataFrame(tweet_json, columns=list(tweets_data[0].keys()))

df_api.sample(8)

Unnamed: 0,created_at,id,id_str,full_text,truncated,display_text_range,entities,extended_entities,source,in_reply_to_status_id,...,place,contributors,is_quote_status,retweet_count,favorite_count,favorited,retweeted,possibly_sensitive,possibly_sensitive_appealable,lang
1,,,,,,,,,Twitter for iPhone,,...,,,,6190,32804,,,,,
303,,,,,,,,,Twitter for iPhone,,...,,,,3245,15757,,,,,
298,,,,,,,,,Twitter for iPhone,,...,,,,29,743,,,,,
45,,,,,,,,,Twitter for iPhone,,...,,,,3770,21013,,,,,
87,,,,,,,,,Twitter for iPhone,,...,,,,6368,28165,,,,,
241,,,,,,,,,Twitter for iPhone,,...,,,,40,0,,,,,
244,,,,,,,,,Twitter for iPhone,,...,,,,3037,16782,,,,,
209,,,,,,,,,Twitter for iPhone,,...,,,,354,1768,,,,,


In [88]:
#identify information of interest from JSON dictionaries in txt file
#and put it in a dataframe called tweet JSON
my_demo_list = []
with open('tweet_json.txt', encoding='utf-8') as json_file:  
    all_data = json.load(json_file)
    for each_dictionary in all_data:
        tweet_id = each_dictionary['id']
        whole_tweet = each_dictionary['text']
        only_url = whole_tweet[whole_tweet.find('https'):]
        favorite_count = each_dictionary['favorite_count']
        retweet_count = each_dictionary['retweet_count']
        followers_count = each_dictionary['user']['followers_count']
        friends_count = each_dictionary['user']['friends_count']
        whole_source = each_dictionary['source']
        only_device = whole_source[whole_source.find('rel="nofollow">') + 15:-4]
        source = only_device
        retweeted_status = each_dictionary['retweeted_status'] = each_dictionary.get('retweeted_status', 'Original tweet')
        if retweeted_status == 'Original tweet':
            url = only_url
        else:
            retweeted_status = 'This is a retweet'
            url = 'This is a retweet'
        my_demo_list.append({'tweet_id': str(tweet_id),
                             'favorite_count': int(favorite_count),
                             'retweet_count': int(retweet_count),
                             'followers_count': int(followers_count),
                             'friends_count': int(friends_count),
                             'url': url,
                             'source': source,
                             'retweeted_status': retweeted_status,
                            })
        tweet_json = pd.DataFrame(my_demo_list, columns = ['tweet_id', 'favorite_count','retweet_count', 
                                                           'followers_count', 'friends_count','source', 
                                                           'retweeted_status', 'url'])                    

KeyError: 'text'

In [89]:
tweet_json.sample(2)

Unnamed: 0,tweet_id,favorite_count,retweet_count,followers_count,friends_count,source,retweeted_status,url
296,823699002998870016,13415,2651,7440039,10,Twitter for iPhone,Original tweet,https://t.co/RsMs6iThDO
122,859074603037188101,34370,13975,7440038,10,Twitter for iPhone,Original tweet,https://t.co/lAySVN8EBp


In [26]:
"""
#for index, row in df.iterrows():
#    print(row["tweet_id"])
for i in df['tweet_id']:
    print(i)
    
#for tweet_id in archive['tweet_id']:
    #print(archive['tweet_id'])
    #print(api.get_status(tweet_id))
"""    

'\n#for index, row in df.iterrows():\n#    print(row["tweet_id"])\nfor i in df[\'tweet_id\']:\n    print(i)\n    \n#for tweet_id in archive[\'tweet_id\']:\n    #print(archive[\'tweet_id\'])\n    #print(api.get_status(tweet_id))\n'

In [27]:
"""
# Programatically download image prediction file from Udacity server using Requests library

# create folder and save images in a file
folder_name = 'image_predictions'
if not os.path.exists(folder_name):
    os.makedirs(folder_name)
    
url = 'https://d17h27t6h515a5.cloudfront.net/topher/2017/August/599fd2ad_image-predictions/image-predictions.tsv'
response = requests.get(url)
response

# response content should be in write binary mode
with open(os.path.join(folder_name, url.split('/')[-1]), mode = 'wb') as file:
    file.write(response.content)
    
# open the tsv file
images = pd.read_csv('image_predictions/image-predictions.tsv',
                       sep='\t')
"""                       

In [28]:
images.sample(2)


Unnamed: 0,tweet_id,jpg_url,img_num,p1,p1_conf,p1_dog,p2,p2_conf,p2_dog,p3,p3_conf,p3_dog
1524,788178268662984705,https://pbs.twimg.com/media/CvAr88kW8AEKNAO.jpg,2,Samoyed,0.73548,True,Pomeranian,0.075101,True,Arctic_fox,0.036072,False
232,670417414769758208,https://pbs.twimg.com/media/CU3NE8EWUAEVdPD.jpg,1,sea_urchin,0.493257,False,porcupine,0.460565,False,cardoon,0.008146,False


Consumer API keys
KJ4snTTqprVl13xESdynv0FLD (API key)

2I8tD942i78jbYOmttOrarGfnHJwSCDct6V3611el6to5AJyt8 (API secret key)

755364732-Qeb0twZzwoOHSwdjy2WNEO6V9SxVjs9rjfylXSeV (Access token)

uXwh7cRrVOwzHOa9d9bXCjzI63tiXgLM98rGjqm62h5tO (Access token secret)

https://developer.twitter.com/en/apps/15941238

In [30]:
# authentication pieces
consumer_key = "KJ4snTTqprVl13xESdynv0FLD"
consumer_secret = "2I8tD942i78jbYOmttOrarGfnHJwSCDct6V3611el6to5AJyt8"
access_token = "755364732-Qeb0twZzwoOHSwdjy2WNEO6V9SxVjs9rjfylXSeV"
access_secret = "uXwh7cRrVOwzHOa9d9bXCjzI63tiXgLM98rGjqm62h5tO"

auth = tweepy.OAuthHandler(consumer_key, consumer_secret)
auth.set_access_token(access_token, access_secret)

# Create connection to API
api = tweepy.API(auth, 
                 parser = tweepy.parsers.JSONParser(), 
                 wait_on_rate_limit = True, 
                 wait_on_rate_limit_notify = True)

# Create list of twitter ids
tweet_ids = archive["tweet_id"].astype(list)


In [31]:
api.get_settings()
#print(api.get_status(666104133288665088))

{'allow_contributor_request': 'all',
 'allow_dm_groups_from': 'following',
 'allow_dms_from': 'following',
 'always_use_https': True,
 'discoverable_by_email': False,
 'discoverable_by_mobile_phone': False,
 'display_sensitive_media': False,
 'geo_enabled': False,
 'language': 'en',
 'protected': False,
 'screen_name': 'nagashrin',
 'sleep_time': {'enabled': False, 'end_time': None, 'start_time': None},
 'time_zone': {'name': 'Eastern Time (US & Canada)',
  'tzinfo_name': 'America/New_York',
  'utc_offset': -14400},
 'translator_type': 'none',
 'trend_location': [{'country': 'United States',
   'countryCode': 'US',
   'name': 'United States',
   'parentid': 1,
   'placeType': {'code': 12, 'name': 'Country'},
   'url': 'http://where.yahooapis.com/v1/place/23424977',
   'woeid': 23424977}],
 'use_cookie_personalization': True}

In [32]:

# Download tweepy status object based on tweet_id in archive and store it in a list.
list_of_tweets = []

# Tweets that can't be found for the tweet_ids are saved in the list below
failed_ids = []

for index, row in archive.iterrows():
    #print(row["tweet_id"])
    #print(api.get_status(row["tweet_id"]))
    try:
        list_of_tweets.append(api.get_status(row["tweet_id"]))
    except Exception as e:
        failed_ids.append((row["tweet_id"]))

"""    
for tweet_id in archive['tweet_id']:
    print(archive['tweet_id'])
    print(api.get_status(tweet_id))
    try:
        list_of_tweets.append(api.get_status(tweet_id))
    except Exception as e:
        cant_find_tweets_for_ids.append((tweet_id))
"""

"    \nfor tweet_id in archive['tweet_id']:\n    print(archive['tweet_id'])\n    print(api.get_status(tweet_id))\n    try:\n        list_of_tweets.append(api.get_status(tweet_id))\n    except Exception as e:\n        cant_find_tweets_for_ids.append((tweet_id))\n"

In [9]:

print("The list of tweets" ,len(list_of_tweets))
print("The list of tweets no found" , len(failed_ids))


The list of tweets 697
The list of tweets no found 1659


In [95]:
#Then in this code block we isolate the json part of each tweepy 
#status object that we have downloaded and we add them all into a list

my_list_of_dicts = []
for each_json_tweet in list_of_tweets:
    my_list_of_dicts.append(each_json_tweet)

In [96]:
#we write this list into a txt file:

with open('tweet_json.txt', 'w') as file:
        file.write(json.dumps(my_list_of_dicts, indent=4))

In [97]:
#identify information of interest from JSON dictionaries in txt file
#and put it in a dataframe called tweet JSON
my_demo_list = []
with open('tweet_json.txt', encoding='utf-8') as json_file:  
    all_data = json.load(json_file)
    for each_dictionary in all_data:
        tweet_id = each_dictionary['id']
        whole_tweet = each_dictionary['text']
        only_url = whole_tweet[whole_tweet.find('https'):]
        favorite_count = each_dictionary['favorite_count']
        retweet_count = each_dictionary['retweet_count']
        followers_count = each_dictionary['user']['followers_count']
        friends_count = each_dictionary['user']['friends_count']
        whole_source = each_dictionary['source']
        only_device = whole_source[whole_source.find('rel="nofollow">') + 15:-4]
        source = only_device
        retweeted_status = each_dictionary['retweeted_status'] = each_dictionary.get('retweeted_status', 'Original tweet')
        if retweeted_status == 'Original tweet':
            url = only_url
        else:
            retweeted_status = 'This is a retweet'
            url = 'This is a retweet'
        my_demo_list.append({'tweet_id': str(tweet_id),
                             'favorite_count': int(favorite_count),
                             'retweet_count': int(retweet_count),
                             'followers_count': int(followers_count),
                             'friends_count': int(friends_count),
                             'url': url,
                             'source': source,
                             'retweeted_status': retweeted_status,
                            })
        tweet_json = pd.DataFrame(my_demo_list, columns = ['tweet_id', 'favorite_count','retweet_count', 
                                                           'followers_count', 'friends_count','source', 
                                                           'retweeted_status', 'url'])            

In [104]:
tweet_json.sample(2)

Unnamed: 0,tweet_id,favorite_count,retweet_count,followers_count,friends_count,source,retweeted_status,url
244,819015331746349057,0,20767,7438054,10,Twitter for iPhone,This is a retweet,This is a retweet
447,783821107061198850,7925,2167,7438058,10,Twitter for iPhone,Original tweet,https://t.co/STcPjiNAHp


In [106]:
tweet_json.to_csv('tweet_clean.csv', sep = ',', encoding = 'utf-8', index = False)

## Assessing Data

In [109]:
archive.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 2356 entries, 0 to 2355
Data columns (total 17 columns):
tweet_id                      2356 non-null int64
in_reply_to_status_id         78 non-null float64
in_reply_to_user_id           78 non-null float64
timestamp                     2356 non-null object
source                        2356 non-null object
text                          2356 non-null object
retweeted_status_id           181 non-null float64
retweeted_status_user_id      181 non-null float64
retweeted_status_timestamp    181 non-null object
expanded_urls                 2297 non-null object
rating_numerator              2356 non-null int64
rating_denominator            2356 non-null int64
name                          2356 non-null object
doggo                         2356 non-null object
floofer                       2356 non-null object
pupper                        2356 non-null object
puppo                         2356 non-null object
dtypes: float64(4), int64(3), ob