
# Twitter data analysis

## Table of Contents
<ul>
<li><a href="#intro">Introduction</a></li>
<li><a href="#gathering">Data gathering</a></li>
<li><a href="#Assessing">Assessing</a></li>
<li><a href="#Cleaning">Cleaning</a></li>
<li><a href="#conclusions">Conclusions</a></li>
</ul>

In [18]:
# imports 
import pandas as pd
import numpy as np
import requests
import tweepy
import json
import os

<a id='intro'></a>
## Introduction

> In this project, I analyze tweets from [@WeRateDogs](https://twitter.com/dog_rates) twitter account. WeRateDogs is a Twitter account with more than eight million followers that rates people's dogs with a humorous comment about the dog.
<br>
I will gather the data using the twitter API.

------------------------------------

<a id='intro'></a>
## Data gathering

#### Read data from a CSV file

In [22]:
#Read twitter-archive-enhanced.csv file
df_csv = pd.read_csv('twitter-archive-enhanced.csv', index_col='tweet_id')
df_csv.head()

Unnamed: 0_level_0,in_reply_to_status_id,in_reply_to_user_id,timestamp,source,text,retweeted_status_id,retweeted_status_user_id,retweeted_status_timestamp,expanded_urls,rating_numerator,rating_denominator,name,doggo,floofer,pupper,puppo
tweet_id,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1,Unnamed: 14_level_1,Unnamed: 15_level_1,Unnamed: 16_level_1
892420643555336193,,,2017-08-01 16:23:56 +0000,"<a href=""http://twitter.com/download/iphone"" r...",This is Phineas. He's a mystical boy. Only eve...,,,,https://twitter.com/dog_rates/status/892420643...,13,10,Phineas,,,,
892177421306343426,,,2017-08-01 00:17:27 +0000,"<a href=""http://twitter.com/download/iphone"" r...",This is Tilly. She's just checking pup on you....,,,,https://twitter.com/dog_rates/status/892177421...,13,10,Tilly,,,,
891815181378084864,,,2017-07-31 00:18:03 +0000,"<a href=""http://twitter.com/download/iphone"" r...",This is Archie. He is a rare Norwegian Pouncin...,,,,https://twitter.com/dog_rates/status/891815181...,12,10,Archie,,,,
891689557279858688,,,2017-07-30 15:58:51 +0000,"<a href=""http://twitter.com/download/iphone"" r...",This is Darla. She commenced a snooze mid meal...,,,,https://twitter.com/dog_rates/status/891689557...,13,10,Darla,,,,
891327558926688256,,,2017-07-29 16:00:24 +0000,"<a href=""http://twitter.com/download/iphone"" r...",This is Franklin. He would like you to stop ca...,,,,https://twitter.com/dog_rates/status/891327558...,12,10,Franklin,,,,


#### Download file

In [3]:
#Download image_predictions.tsv file
r = requests.get('https://d17h27t6h515a5.cloudfront.net/topher/2017/August/599fd2ad_image-predictions/image-predictions.tsv')
with open(os.path.join('image_predictions.tsv'), mode='wb') as file:
    file.write(r.content)

# Read the downloaded file    
df_tsv = pd.read_csv('image_predictions.tsv', sep='\t', index_col='tweet_id')
df_tsv.head()

Unnamed: 0_level_0,jpg_url,img_num,p1,p1_conf,p1_dog,p2,p2_conf,p2_dog,p3,p3_conf,p3_dog
tweet_id,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1
666020888022790149,https://pbs.twimg.com/media/CT4udn0WwAA0aMy.jpg,1,Welsh_springer_spaniel,0.465074,True,collie,0.156665,True,Shetland_sheepdog,0.061428,True
666029285002620928,https://pbs.twimg.com/media/CT42GRgUYAA5iDo.jpg,1,redbone,0.506826,True,miniature_pinscher,0.074192,True,Rhodesian_ridgeback,0.07201,True
666033412701032449,https://pbs.twimg.com/media/CT4521TWwAEvMyu.jpg,1,German_shepherd,0.596461,True,malinois,0.138584,True,bloodhound,0.116197,True
666044226329800704,https://pbs.twimg.com/media/CT5Dr8HUEAA-lEu.jpg,1,Rhodesian_ridgeback,0.408143,True,redbone,0.360687,True,miniature_pinscher,0.222752,True
666049248165822465,https://pbs.twimg.com/media/CT5IQmsXIAAKY4A.jpg,1,miniature_pinscher,0.560311,True,Rottweiler,0.243682,True,Doberman,0.154629,True


#### Twitter API

In [4]:
# Connect to twitter APIs
# I have stored authuntication information in auth.json file for security reasons

auth = json.load(open('auth.json'))

consumer_key = auth['consumer_key']
consumer_secret = auth['consumer_secret']
access_token = auth['access_token']
access_secret = auth['access_secret']

auth = tweepy.OAuthHandler(consumer_key, consumer_secret)
auth.set_access_token(access_token, access_secret)

api = tweepy.API(auth,wait_on_rate_limit=True,wait_on_rate_limit_notify=True)

In [5]:
# Check contents of a tweet response
json_keys =[]
tweet = api.get_status(892420643555336193)


In [6]:
# Get all dictionary keys recursively
keys = []
def getAllKeys(d):
    for key, value in d.items():
        keys.append(key)
        if type(value) is dict:
            getAllKeys(value)
    return keys     

In [7]:
# Print tweet's properties
print(getAllKeys(tweet._json))

['created_at', 'id', 'id_str', 'text', 'truncated', 'entities', 'hashtags', 'symbols', 'user_mentions', 'urls', 'media', 'extended_entities', 'media', 'source', 'in_reply_to_status_id', 'in_reply_to_status_id_str', 'in_reply_to_user_id', 'in_reply_to_user_id_str', 'in_reply_to_screen_name', 'user', 'id', 'id_str', 'name', 'screen_name', 'location', 'description', 'url', 'entities', 'url', 'urls', 'description', 'urls', 'protected', 'followers_count', 'friends_count', 'listed_count', 'created_at', 'favourites_count', 'utc_offset', 'time_zone', 'geo_enabled', 'verified', 'statuses_count', 'lang', 'contributors_enabled', 'is_translator', 'is_translation_enabled', 'profile_background_color', 'profile_background_image_url', 'profile_background_image_url_https', 'profile_background_tile', 'profile_image_url', 'profile_image_url_https', 'profile_banner_url', 'profile_link_color', 'profile_sidebar_border_color', 'profile_sidebar_fill_color', 'profile_text_color', 'profile_use_background_image'

#### From the tweet properties above I select the following properties to store in the JSON file
1. tweet_id
* created_at
* favorite_count
* retweet_count
* tweet_status_ok (If status exists true else false)
 

In [8]:
# Read tweets with tweet id and save them in a json file
error_list=[]
with open('tweet_json.txt', 'w') as file:
    for t in np.array(df_csv.index):
        try:
            tweet = api.get_status(t)
            created_at = str(tweet.created_at)
            favorite_count = tweet.favorite_count
            retweet_count = tweet.retweet_count
            tweet_status_ok = True
        except Exception as e:
            error_list.append(str(t))

        j ={'tweet_id':int(t),
                        'created_at':str(created_at),
                        'favorite_count':int(favorite_count),
                        'retweet_count':int(retweet_count),
                        'tweet_status_ok': tweet_status_ok}
        json.dump(j,file)
        file.write('\n')

Rate limit reached. Sleeping for: 611
Rate limit reached. Sleeping for: 679


In [8]:
# Read JSON file
json_list=[]
df_json = pd.DataFrame(json_list, columns=['tweet_id','created_at','favorite_count','retweet_count','tweet_status_ok'])

with open('tweet_json.txt','r') as file:
    for line in file:
        d = json.loads(line)
        df_json = df_json.append(d, ignore_index=True)      

df_json.head()

Unnamed: 0,tweet_id,created_at,favorite_count,retweet_count,tweet_status_ok
0,892420643555336193,2017-08-01 16:23:56,36599,7789,True
1,892177421306343426,2017-08-01 00:17:27,31536,5771,True
2,891815181378084864,2017-07-31 00:18:03,23755,3820,True
3,891689557279858688,2017-07-30 15:58:51,39927,7961,True
4,891327558926688256,2017-07-29 16:00:24,38124,8598,True


------------------------------

<a id='intro'></a>
## Assessing

#### Unclean Data: Dirty vs. Messy

>There are two types of unclean data:
>* **Dirty** data, also known as **low quality** data. Low quality data has **content issues**.
>* **Messy** data, also known as **untidy data**. Untidy data has **structural issues**.

#### Visual Assesment 

In [12]:
df_csv.head() 
#also checked
#df_csv.head(50)
#df_csv.tail(50)
#df_csv.sample(50)

Unnamed: 0_level_0,in_reply_to_status_id,in_reply_to_user_id,timestamp,source,text,retweeted_status_id,retweeted_status_user_id,retweeted_status_timestamp,expanded_urls,rating_numerator,rating_denominator,name,doggo,floofer,pupper,puppo
tweet_id,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1,Unnamed: 14_level_1,Unnamed: 15_level_1,Unnamed: 16_level_1
798340744599797760,,,2016-11-15 01:44:00 +0000,"<a href=""http://twitter.com/download/iphone"" r...",RT @dog_rates: This is Davey. He'll have your ...,7.717705e+17,4196984000.0,2016-09-02 18:03:10 +0000,https://twitter.com/dog_rates/status/771770456...,11,10,Davey,,,,
707738799544082433,,,2016-03-10 01:24:13 +0000,"<a href=""http://vine.co"" rel=""nofollow"">Vine -...",He's doing his best. 12/10 very impressive tha...,,,,https://vine.co/v/hUvHKYrdb1d,12,10,,,,,
674271431610523648,,,2015-12-08 16:56:51 +0000,"<a href=""http://twitter.com/download/iphone"" r...","""AT DAWN, WE RIDE""\n10/10 for both dogs https:...",,,,https://twitter.com/dog_rates/status/674271431...,10,10,,,,,
717421804990701568,,,2016-04-05 18:41:02 +0000,"<a href=""http://twitter.com/download/iphone"" r...",This is Ebby. She's a Zimbabwean Feta. Embarra...,,,,https://twitter.com/dog_rates/status/717421804...,9,10,Ebby,,,,
698262614669991936,,,2016-02-12 21:49:15 +0000,"<a href=""http://twitter.com/download/iphone"" r...",This is Franklin. He's a yoga master. Trying t...,,,,https://twitter.com/dog_rates/status/698262614...,11,10,Franklin,,,,
744234799360020481,,,2016-06-18 18:26:18 +0000,"<a href=""http://twitter.com/download/iphone"" r...",Here's a doggo realizing you can stand in a po...,,,,https://twitter.com/dog_rates/status/744234799...,13,10,,doggo,,,
828011680017821696,,,2017-02-04 22:45:42 +0000,"<a href=""http://twitter.com/download/iphone"" r...",Say hello to Brutus and Jersey. They think the...,,,,https://twitter.com/dog_rates/status/828011680...,11,10,Brutus,,,,
841833993020538882,,,2017-03-15 02:10:39 +0000,"<a href=""http://twitter.com/download/iphone"" r...",RT @dog_rates: This is Ken. His cheeks are mag...,8.174239e+17,4196984000.0,2017-01-06 17:33:29 +0000,https://twitter.com/dog_rates/status/817423860...,13,10,Ken,,,,
852189679701164033,,,2017-04-12 16:00:27 +0000,"<a href=""http://twitter.com/download/iphone"" r...",This is Sailor. He has collected the best dirt...,,,,https://twitter.com/dog_rates/status/852189679...,12,10,Sailor,,,,
871032628920680449,,,2017-06-03 15:55:36 +0000,"<a href=""http://twitter.com/download/iphone"" r...",This is Boomer. He's doing an advanced water t...,,,,https://twitter.com/dog_rates/status/871032628...,13,10,Boomer,,,,


In [23]:
df_tsv.head() 
#also checked
#df_tsv.head(50)
#df_tsv.tail(50)
#df_tsv.sample(50)

Unnamed: 0_level_0,jpg_url,img_num,p1,p1_conf,p1_dog,p2,p2_conf,p2_dog,p3,p3_conf,p3_dog
tweet_id,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1
666020888022790149,https://pbs.twimg.com/media/CT4udn0WwAA0aMy.jpg,1,Welsh_springer_spaniel,0.465074,True,collie,0.156665,True,Shetland_sheepdog,0.061428,True
666029285002620928,https://pbs.twimg.com/media/CT42GRgUYAA5iDo.jpg,1,redbone,0.506826,True,miniature_pinscher,0.074192,True,Rhodesian_ridgeback,0.07201,True
666033412701032449,https://pbs.twimg.com/media/CT4521TWwAEvMyu.jpg,1,German_shepherd,0.596461,True,malinois,0.138584,True,bloodhound,0.116197,True
666044226329800704,https://pbs.twimg.com/media/CT5Dr8HUEAA-lEu.jpg,1,Rhodesian_ridgeback,0.408143,True,redbone,0.360687,True,miniature_pinscher,0.222752,True
666049248165822465,https://pbs.twimg.com/media/CT5IQmsXIAAKY4A.jpg,1,miniature_pinscher,0.560311,True,Rottweiler,0.243682,True,Doberman,0.154629,True


In [24]:
df_csv.head()
#also checked
#df_json.head(50)
#df_json.tail(50)
#df_json.sample(50)

Unnamed: 0_level_0,in_reply_to_status_id,in_reply_to_user_id,timestamp,source,text,retweeted_status_id,retweeted_status_user_id,retweeted_status_timestamp,expanded_urls,rating_numerator,rating_denominator,name,doggo,floofer,pupper,puppo
tweet_id,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1,Unnamed: 14_level_1,Unnamed: 15_level_1,Unnamed: 16_level_1
892420643555336193,,,2017-08-01 16:23:56 +0000,"<a href=""http://twitter.com/download/iphone"" r...",This is Phineas. He's a mystical boy. Only eve...,,,,https://twitter.com/dog_rates/status/892420643...,13,10,Phineas,,,,
892177421306343426,,,2017-08-01 00:17:27 +0000,"<a href=""http://twitter.com/download/iphone"" r...",This is Tilly. She's just checking pup on you....,,,,https://twitter.com/dog_rates/status/892177421...,13,10,Tilly,,,,
891815181378084864,,,2017-07-31 00:18:03 +0000,"<a href=""http://twitter.com/download/iphone"" r...",This is Archie. He is a rare Norwegian Pouncin...,,,,https://twitter.com/dog_rates/status/891815181...,12,10,Archie,,,,
891689557279858688,,,2017-07-30 15:58:51 +0000,"<a href=""http://twitter.com/download/iphone"" r...",This is Darla. She commenced a snooze mid meal...,,,,https://twitter.com/dog_rates/status/891689557...,13,10,Darla,,,,
891327558926688256,,,2017-07-29 16:00:24 +0000,"<a href=""http://twitter.com/download/iphone"" r...",This is Franklin. He would like you to stop ca...,,,,https://twitter.com/dog_rates/status/891327558...,12,10,Franklin,,,,


#### Quality issues:
> **`df_csv`:** 
* `source` is in form of a html tag.
* Some `name`s are null.
* There is a `+0000` at the end the time in `timestamp` column.
* There is a `+0000` at the end the time in `retweeted_status_timestamp` column (if it is not null).
* There are many null values for `doggo`, `floofer`,`pupper` and `puppo` columns.

>**`df_tsv`:**
* No considerable visual issue

>**`df_json`:**
* No considerable visual issue

#### Structural issues:
> **`df_csv`:** 
* 
* 
* 

>**`df_tsv`:**
* 

>**`df_json`:**
* 

#### Programatical Assesment

--------------

<a id='Cleaning'></a>
## Cleaning

-------------------------------------------------------

<a id='conclusions'></a>
## Conclusions