# Project: Wrangling and Analyze Data

## Data Gathering
In the cells below, **all** three pieces of data for this project will be gathered and loaded in the notebook. 

In [1]:
# importing needed libraries
import tweepy
from tweepy import OAuthHandler
import json
from timeit import default_timer as timer
import requests
import pandas as pd
import numpy as np
import json
import os
import matplotlib.pyplot as plt
import seaborn as sns
import twitter_keys as keys
%matplotlib inline

ModuleNotFoundError: No module named 'twitter_keys'

In [None]:
#  Downloading the image_predictions.tsv file programmatically
url = 'https://d17h27t6h515a5.cloudfront.net/topher/2017/August/599fd2ad_image-predictions/image-predictions.tsv'
response = requests.get(url)
with open('image_predictions.tsv', 'wb') as file:
    file.write(response.content)

In [None]:
# Loading of data
twitter_archive_df = pd.read_csv('twitter-archive-enhanced.csv')
image_predictions_df = pd.read_csv('image_predictions.tsv', sep='\t')
arr = [json.loads(i) for i in open("tweet-json.txt")]
twitter_jason_df = pd.DataFrame(arr)

In [None]:
twitter_archive_df

In [None]:
twitter_archive_df.shape

In [None]:
image_predictions_df

In [None]:
image_predictions_df.shape

In [None]:
twitter_jason_df

In [None]:
twitter_jason_df.shape

In [None]:
twitter_archive_df.info()

In [None]:
twitter_archive_df.sample(10)

In [None]:
improper_denominator_df = twitter_archive_df[twitter_archive_df['rating_denominator'] != 10]
improper_denominator_df

In [None]:
for text in improper_denominator_df['text'].values:
    print(text)

In [None]:
image_predictions_df.sample(10)

In [None]:
image_predictions_df.info()

In [None]:
twitter_jason_df.sample(10)

In [None]:
twitter_jason_df.info()

### ASSESSEMENT
## Quality Issues (twitter_archive_df)

    1. 13/10 rating for dog at index 313 not 13/0. Inaccurate rating_numerator value.
    2. 14/10 rating for dog at index 784 not 9/11. Inaccurate rating_numerator and rating_denominator values.
    3. 13/10 rating for dog at index 1165 not 4/20. Inaccurate rating_numerator and rating_denominator values.
    4. 11/10 rating for dog at index 1202 not 50/50. Inaccurate rating_numerator and rating_denominator values.
    5. 10/10 rating for dog at index 1662 not 7/11. Inaccurate rating_numerator and rating_denominator values.
    6. 9/10 rating for dog at index 2335 not 1/2. Inaccurate rating_numerator and rating_denominator values.
    7. Rating text and score for dog at index 1068 is the same with index 784. Both are the same dog. 
    8. There is actually no rating for dog at index 516. 
    9. There is actually no rating for dog at index 342.
    10. 204/170 rating for dog at index 1120. rating_denominator is not consistent.
    11. 99/90 rating for dog at index 1228. rating_denominator is not consistent.
    12.80/80 rating for dog at index 1254. rating_denominator is not consistent.
    13. 45/50 rating for dog at index 1274. rating_denominator is not consistent.
    14. 60/50 rating for dog at index 1351. rating_denominator is not consistent.
    15. 44/40 rating for dog at index 1433. rating_denominator is not consistent.
    16. 4/20 rating for dog at index 1598. rating_denominator is not consistent.
    17. 143/130 rating for dog at index1634. rating_denominator is not consistent.
    18. 121/110 rating for dog at index 1635. rating_denominator is not consistent.
    19. 84/70 rating for dog at index 433. rating_denominator is not consistent.
    20. 20/16 rating for dog at index 1663. rating_denominator is not consistent.
    21. 144/120 rating for dog at index 1779. rating_denominator is not consistent.
    22. 88/80 rating for dog at index 1843. rating_denominator is not consistent.
    23. 165/150 rating for dog at index 902. rating_denominator is not consistent.
    
## Tidiness Issues (twitter_jason_df)
    1. Created_at column should be split into week_day, month, year and time columns.
    2. Entities column should be a table of its own.
    3. extended entities should be a table of its own.


# Cleaning Data
In this section, all of the issues documented during assessment will be cleaned.

In [None]:
# Making opies of the datasets
twitter_archive_clean_copy = twitter_archive_df.copy()
twitter_jason_df_clean_copy = twitter_jason_df.copy()
image_predictions_df_clean_copy = image_predictions_df.copy()

### Quality Issue #1: 13/10 rating for dog at index 313 not 13/0

#### Define
Replace  0 with 10 in the rating_denominator at index 313.

#### Code

In [None]:
twitter_archive_clean_copy.at[313, 'rating_denominator'] = 10

#### Test

In [None]:
assert twitter_archive_clean_copy.at[313, 'rating_denominator'] == 10

### Quality Issue #2: 14/10 rating for dog at index 784 not 9/11

#### Define
Replace  11 with 10 in the rating_denominator at index 784. <br>
Replace  9 with 14 in the rating_numerator at index 784.

#### Code

In [None]:
twitter_archive_clean_copy.at[784, 'rating_denominator'] 

In [None]:
twitter_archive_clean_copy.at[784, 'rating_denominator']  = 10

In [None]:
twitter_archive_clean_copy.at[784, 'rating_numerator']

In [None]:
twitter_archive_clean_copy.at[784, 'rating_numerator'] = 14

#### Test

In [None]:
assert twitter_archive_clean_copy.at[784, 'rating_denominator']  == 10

In [None]:
assert twitter_archive_clean_copy.at[784, 'rating_numerator'] == 14

### Quality Issue #3: 13/10 rating for dog at index 1165 not 4/20.

#### Define
Replace  4 with 13 in the rating_numerator at index 1165. <br>
Replace  20 with 10 in the rating_denominator at index 1165.

#### Code

In [None]:
twitter_archive_clean_copy.at[1165, 'rating_numerator'] 

In [None]:
twitter_archive_clean_copy.at[1165, 'rating_numerator'] = 13

In [None]:
twitter_archive_clean_copy.at[1165, 'rating_denominator']

In [None]:
twitter_archive_clean_copy.at[1165, 'rating_denominator'] = 10

#### Test

In [None]:
assert twitter_archive_clean_copy.at[1165, 'rating_numerator'] == 13

In [None]:
assert twitter_archive_clean_copy.at[1165, 'rating_denominator'] == 10

 ### Quality Issue #4: 11/10 rating for dog at index 1202 not 50/50

#### Define
Replace 50 with 11 in the rating_numerator at index 1202. <br>
Replace 50 with 10 in the rating_denominator at index 1202.

#### Code

In [None]:
twitter_archive_clean_copy.at[1202, 'rating_numerator'] 

In [None]:
twitter_archive_clean_copy.at[1202, 'rating_numerator'] = 11

In [None]:
twitter_archive_clean_copy.at[1202, 'rating_denominator'] 

In [None]:
twitter_archive_clean_copy.at[1202, 'rating_denominator'] = 10

#### Test

In [None]:
assert twitter_archive_clean_copy.at[1202, 'rating_numerator'] == 11
assert twitter_archive_clean_copy.at[1202, 'rating_denominator'] == 10

 ### Quality Issue #5: 10/10 rating for dog at index 1662 not 7/11

#### Define
Replace 7 with 10 in the rating_numerator at index 1662. <br>
Replace 11 with 10 in the rating_numerator at index 1662. 

#### Code

In [None]:
twitter_archive_clean_copy.at[1662, 'rating_numerator'] 

In [None]:
twitter_archive_clean_copy.at[1662, 'rating_numerator'] = 10
twitter_archive_clean_copy.at[1662, 'rating_denominator'] = 10

#### Test

In [None]:
assert twitter_archive_clean_copy.at[1662, 'rating_numerator'] == 10
assert twitter_archive_clean_copy.at[1662, 'rating_denominator'] == 10

### Quality Issue #6: 9/10 rating for dog at index 2335 not 1/2

#### Define
Replace 1 with 9 in the rating_numerator at index 2335. <br>
Replace 2 with 10 in the rating_numerator at index 2335. 

#### Code

In [None]:
twitter_archive_clean_copy.at[2335, 'rating_numerator'] 

In [None]:
twitter_archive_clean_copy.at[2335, 'rating_denominator'] 

In [None]:
twitter_archive_clean_copy.at[2335, 'rating_numerator'] = 9
twitter_archive_clean_copy.at[2335, 'rating_denominator'] = 10

#### Test

In [None]:
assert twitter_archive_clean_copy.at[2335, 'rating_numerator'] == 9
assert twitter_archive_clean_copy.at[2335, 'rating_denominator'] == 10

### Quality Issue #10 - 23: rating_denominator is not consistent
 

#### Define
Reduce the denominator to 10 by dividing both side by an appropriate number.

#### Code

In [None]:
twitter_archive_clean_copy[twitter_archive_clean_copy['rating_denominator'] < 10]
# we need to drop this row because there is actually no rating for this dog. This can be seen in the text column.

In [None]:
twitter_archive_clean_copy[twitter_archive_clean_copy['rating_denominator'] > 10]

In [None]:
# dividing both the numerator and the denominator by a number that will reduce the denominator to 10 if the denominator is greater than 10
new_numerator = []
new_denominator = []
for numerator, denominator in zip(twitter_archive_clean_copy['rating_numerator'].values, twitter_archive_clean_copy['rating_denominator'].values):
    divisor = denominator/10
    if denominator > 10:
        new_numerator.append(int(numerator/divisor))
        new_denominator.append(int(denominator/divisor))
    else:
        new_numerator.append(numerator)
        new_denominator.append(denominator)
len(new_numerator)

In [None]:
for numerator in new_numerator:
    print(numerator)

In [None]:
for denominator in new_denominator:
    print(denominator)

In [None]:
twitter_archive_clean_copy['rating_numerator'] = new_numerator
twitter_archive_clean_copy['rating_denominator'] = new_denominator
twitter_archive_clean_copy.head()

#### Test

In [None]:
for i in twitter_archive_clean_copy['rating_denominator'].values:
    if i > 10:
        print(i)

In [None]:
twitter_archive_clean_copy.info()

### Quality Issue #7, 8 and 9
7. Rating text and score for dog at index 1068 is the same with index 784. Both are the same dog. 
8. There is actually no rating for dog at index 516. 
9. There is actually no rating for dog at index 342. 

#### Define
Drop the rows

#### Code

In [None]:
rows_to_drop = [342, 516, 1068]
twitter_archive_clean_copy.drop(rows_to_drop, axis=0, inplace = True)

#### Test

In [None]:
twitter_archive_clean_copy.shape
# Before, there were 2356 rows. Now, there are 2353 rows. 

## Tidiness Issues #1

#### Define
Created_at column in twitter_jason_df should be split into hour, week_day, month, year and time columns.

#### Code

In [None]:
date = {'week_day': [i.split()[0] for i in twitter_jason_df_clean_copy['created_at'].values],
'month': [i.split()[1] for i in twitter_jason_df_clean_copy['created_at'].values],
'time': [i.split()[3] for i in twitter_jason_df_clean_copy['created_at'].values],
'year': [i.split()[-1] for i in twitter_jason_df_clean_copy['created_at'].values]
}

In [None]:
created_at_df = pd.DataFrame(date)
hour_in_the_day = [time.split(':')[0] for time in created_at_df['time'].values]
created_at_df['hour_in_the_day'] = hour_in_the_day
created_at_df

In [None]:
new_jason_df = pd.concat([created_at_df, twitter_jason_df_clean_copy], axis=1)
new_jason_df.drop('created_at', axis=1, inplace=True)

#### Test

In [None]:
new_jason_df.head()

## Tidiness Issues #2

#### Define
Entities column in twitter_jason_df should be a table of its own.

#### Code

In [None]:
entities_table = list(twitter_jason_df_clean_copy['entities'])
entities_df = pd.DataFrame(entities_table)

#### Test

In [None]:
entities_df.head()

In [None]:
entities_df.info()

##### The entities table is not needed for this analysis.

## Tidiness Issues #3

#### Define
extended entities should be a table of its own.

#### Code

In [None]:
twitter_jason_df_clean_copy['extended_entities'].isna().sum()

In [None]:
twitter_jason_df_clean_copy.info()

In [None]:
twitter_jason_df_clean_copy['extended_entities'].values

In [None]:
# treating the null values in the extended_entities column
missing_values = {'media': [{'id': 000000000000000000, 'id_str': '000000000000000000', 'indices': [00, 000], 
                'media_url': 'missing_value', 'media_url_https': 'missing_value',
                'url': 'missing_value', 'display_url': 'missing_value', 
                'expanded_url': 'missing_value', 'type': 'missing_value', 
                'sizes': {'large': {'w': 000, 'h': 000, 'resize': 'missing_value'}, 'thumb': {'w': 000, 'h': 000, 'resize': 'missing_value'}, 
                'small': {'w': 000, 'h': 000, 'resize': 'missing_value'}, 'medium': {'w': 000, 'h': 000, 'resize': 'missing_value'}}}]}

whole_list = []
extended_entities_list = twitter_jason_df_clean_copy['extended_entities'].values
for i in extended_entities_list:
    if type(i) == float:
        whole_list.append(missing_values)
    else:
        whole_list.append(i)
for a in whole_list:
    assert type(a) == dict

In [None]:
extended_entities_new_list = []
for item in whole_list:
    for value in item['media']:
        extended_entities_new_list.append(value)
extended_entities_df = pd.DataFrame(extended_entities_new_list)

#### Test

In [None]:
extended_entities_df.head()

##### The sizes column should be a table of its own.

In [None]:
extended_entities_sizes = list(extended_entities_df['sizes'])
extended_entities_sizes_df = pd.DataFrame(extended_entities_sizes)
extended_entities_sizes_df

##### Let's split the columns in extended_entities_sizes_df.

In [None]:
# defining a function
def split_size_df(df, column_in_quotes):
    sizes_list = list(df[column_in_quotes])
    split_df = pd.DataFrame(sizes_list)
    split_df.columns = [column_in_quotes + '_width', column_in_quotes + '_height', column_in_quotes + '_resize']
    drop_df = df.drop([column_in_quotes], axis=1)
    new_df = pd.concat([drop_df, split_df], axis=1)
    return new_df

In [None]:
large = split_size_df(extended_entities_sizes_df, 'large')
thumb = split_size_df(large, 'thumb')
small = split_size_df(thumb, 'small')
extended_entities_sizes_df = split_size_df(small, 'medium')
extended_entities_sizes_df

In [None]:
extended_entities_sizes_df.info()

##### The extended_entities column and all the subtables are not needed in this analysis

In [None]:
needed_twitter_jason_dfcolumns = new_jason_df[['id', 'hour_in_the_day', 'week_day', 'month', 'year', 'retweet_count', 'favorite_count']]
needed_twitter_jason_dfcolumns.rename(columns = {'id':'tweet_id'}, inplace=True)
needed_twitter_jason_dfcolumns

In [None]:
twitter_archive_clean_copy.columns

In [None]:
twitter_archive_clean_copy.drop(['in_reply_to_status_id', 'in_reply_to_user_id', 'timestamp',
       'source', 'retweeted_status_id', 'retweeted_status_user_id',
       'retweeted_status_timestamp', 'expanded_urls'], axis=1, inplace=True)
# The above columns are not needed for the analysis

In [None]:
twitter_archive_clean_copy

In [None]:
new_df = pd.merge(needed_twitter_jason_dfcolumns, twitter_archive_clean_copy)
new_df

In [None]:
new_df.info()

In [None]:
new_df.sample(10)

In [None]:
image_predictions_df_clean_copy

In [None]:
twitter_archive_master = pd.merge(new_df, image_predictions_df_clean_copy)
twitter_archive_master.head(10)

## Storing Data
The gathered, assessed, and cleaned master dataset will be saved to a CSV file named "twitter_archive_master.csv".

In [None]:
twitter_archive_master.to_csv('twitter_archive_master.csv')

In [None]:
# checking whether the twitter_archive_master.csv is in the present working directory
os.listdir()

In [None]:
twitter_archive_master.info()

# Eploratory Data Analysis

In [None]:
correlation_matrix = twitter_archive_master.corr()
sns.set(rc={'figure.figsize':(12,10)})
sns.heatmap(correlation_matrix, annot=True)
plt.show()

The following can be deduced from the correlation matrix: <br>
<ul>
<li>favorite_count and retweet_count columns have a correlation of 0.79 which is a significant figure to consider. This shows that there exists a positive correlation between people's like of a rating and their tendency of retweeting the rating. In other words, people are most likely to retweet a rating that they like. <br>
<li>There is no correlation between rating score (rating_numerator column) and people's like of a rating (favorite_count). The correlation is 0.016 which is insignificant. <br>
<li>There is no correlation between rating score (rating_numerator column) and people's tendency of retweeting a rating (retweet_count). The correlation is 0.018 which is insignificant. <br>
<li>There is a negative correlation between p1_conf and p3_conf. The correlation is -0.71. This means that as p1_conf increases, p3_conf decreases and vice versa. <br>
<li>There is a negative correlation between p1_conf and p2_conf. The correlation is -0.51. This means that as p1_conf increases, p2_conf decreases and vice versa. <br>
<li>There is a positive correlation between p1_dog and p2_dog. The correlation is 0.63. This means that if p1_dog is 'True', p2_dog is also likely to be 'True'. <br>
<li>There is a positive correlation between p2_dog and p3_dog. The correlation is 0.55. This means that if p2_dog is 'True', p3_dog is also likely to be 'True'. <br>
<li>There is a positive correlation between p1_dog and p3_dog. The correlation is 0.56. This means that if p1_dog is 'True', p3_dog is also likely to be 'True'. <br>
<ul>

In [None]:
twitter_archive_master

### Is there any observed pattern between the hour_in_the_day and the rating score?

In [None]:
twitter_archive_master.groupby('hour_in_the_day').mean()['rating_numerator'].sort_values(ascending=False)

The highest mean rating was done between 15:00 and 16:00 daily.

### Is there any observed pattern between the week_day and the rating score?

In [None]:
twitter_archive_master.groupby('week_day').mean()['rating_numerator'].sort_values(ascending=False)

Mondays had the highest mean ratings while Saturdays had the lowest mean rating.

### Is there any observed pattern between the month and the rating score?

In [None]:
twitter_archive_master.groupby('month').mean()['rating_numerator'].sort_values(ascending=False)

WeRateDogs tend to give very high ratings in the month of July and very low ratings in the month of December.

### Is there any observed pattern between the year and the rating score?

In [None]:
twitter_archive_master.groupby('year').mean()['rating_numerator'].sort_values(ascending=False)

Although there are no significant differences between the ratings in the three years under study, 2016 got the highest ratings while 2015 got the lowest ratings.

### Is there any observed pattern between the hour_in_the_day and the favorite_count?

In [None]:
twitter_archive_master.groupby('hour_in_the_day').mean()['favorite_count'].sort_values(ascending=False)

People tend to like a rating at between 6:00 and 7:00.

### Is there any observed pattern between the week_day and the favorite_count?

In [None]:
twitter_archive_master.groupby('week_day').mean()['favorite_count'].sort_values(ascending=False)

People are more likely to like a rating on a Wednesday and are less likely to like a rating on a Thursday

### Is there any observed pattern between the hour_in_the_day and the retweet_count?

In [None]:
twitter_archive_master.groupby('hour_in_the_day').mean()['retweet_count'].sort_values(ascending=False)

People are more likely to retweet a rating between 6:00 and 7:00.

### Is there any observed pattern between the week_day and the retweet_count?

In [None]:
twitter_archive_master.groupby('week_day').mean()['retweet_count'].sort_values(ascending=False)

People are more likely to retweet a rating on a Wednesday and are less likely to retweet a rating on a Thursday

### Is there any observed pattern between the month and the favorite_count?

In [None]:
twitter_archive_master.groupby('month').mean()['favorite_count'].sort_values(ascending=False)

People are more likely to like a rating in June and are less likely to like a rating on a November

### Is there any observed pattern between the month and the retweet_count?

In [None]:
twitter_archive_master.groupby('month').mean()['retweet_count'].sort_values(ascending=False)

People are more likely to retweet a rating in June and are less likely to retweet a rating in November.

### Is there any observed pattern between the year and the retweet_count?

In [None]:
twitter_archive_master.groupby('year').mean()['retweet_count'].sort_values(ascending=False)

People retweeted more ratings in 2017, about six times more than in 2015.

### Is there any observed pattern between the year and the favorite_count?

In [None]:
twitter_archive_master.groupby('year').mean()['favorite_count'].sort_values(ascending=False)

People liked more ratings in 2017, about ten times more than in 2015.

### Is there relationship between favorite_count and retweet_count?

In [None]:
# Scatter plot of favorite_count vs. retweet_count
plt.figure(figsize=[8, 6])
plt.scatter(twitter_archive_master['favorite_count'], twitter_archive_master['retweet_count'])
plt.title("Scatter plot of people's like of a rating vs their tendency of retweeting the rating")
plt.xlabel('favorite_count')
plt.ylabel('retweet_count')
plt.show

There is a very strong positive correlation between favorite_count and retweet_count. This means that when people like a rating, they are likely to retweet the rating.

### Is there relationship between rating score and favorite_count?

In [None]:
# Scatter plot of favorite_count vs. rating_numerator
plt.figure(figsize=[8, 6])
plt.scatter(twitter_archive_master['favorite_count'], twitter_archive_master['rating_numerator'])
plt.title("Scatter plot of rating score vs people's like of a rating")
plt.xlabel('favorite_count')
plt.ylabel('rating_numerator')
plt.show

There is no correlation between the rating score given by WeRateDogs and people's likeness of the rating.

### Is there relationship between rating score and retweet_count?

In [None]:
# Scatter plot of retweet_count vs. rating_numerator
plt.figure(figsize=[8, 6])
plt.scatter(twitter_archive_master['retweet_count'], twitter_archive_master['rating_numerator'])
plt.title("Scatter plot of rating score vs people's tendency to retweet a rating")
plt.xlabel('retweet_count')
plt.ylabel('rating_numerator')
plt.show

There is no correlation between the rating score given by WeRateDogs and people's tendency to retweet a rating.

# Conclusion

This analysis is based on the WeRateDogs Twitter archive. The following are the insights drawn from the study.

# Insights
<ul>
<li>favorite_count and retweet_count columns have a correlation of 0.79 which is a significant figure to consider. This shows that there exists a positive correlation between people's like of a rating and their tendency of retweeting the rating. In other words, people are most likely to retweet a rating that they like. <br>
<li>There is no correlation between rating score (rating_numerator column) and people's like of a rating (favorite_count). The correlation is 0.016 which is insignificant. <br>
<li>There is no correlation between rating score (rating_numerator column) and people's tendency of retweeting a rating (retweet_count). The correlation is 0.018 which is insignificant. <br>
<li>The highest mean rating was done between 15:00 and 16:00 daily. This means that WeRateDogs  give high ratings within this hour of the day.
<li>Mondays had the highest mean ratings while Saturdays had the lowest mean rating. This means that WeRateDogs give high ratings on Mondays.
<li>WeRateDogs tend to give very high ratings in the month of July and very low ratings in the month of December.
<li>Although there are no significant differences between the ratings in the three years under study, 2016 got the highest ratings while 2015 got the lowest ratings.
<li>People tend to like a rating between the hours of 6:00 and 7:00.
<li>People are more likely to like a rating on a Wednesday and are less likely to like a rating on a Thursday.
<li>People are more likely to like a rating in June and are less likely to like a rating in November
<li>People are more likely to retweet a rating in June and are less likely to retweet a rating in November.
<li>People retweeted more ratings in 2017, about six times more than in 2015.
<li>People liked more ratings in 2017, about ten times more than in 2015.
<ul>


# Limitations
There were outliers in the rating_numerator column which affected the results of the analysis. Perhaps, different insights would have been generated if those outliers were removed.


# References
N/A