### Data Analysis of WeRateDogs Twitter archive

#### Context
Goal: Wrangle WeRateDogs Twitter data to create interesting and trustworthy analyses and visualizations. The Twitter archive is great, but it only contains very basic tweet information. Additional gathering, then assessing and cleaning is required for "Wow!"-worthy analyses and visualizations.

####  Data
In this project, three datasets were used.

* WeRateDogs Twitter archive
* Tweet image predictions File(URL)
* Additional data from twitter API.

#### Import necessary libraries.

In [None]:
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
%matplotlib inline
import seaborn as sns
import requests
import tweepy
from tweepy import OAuthHandler
from timeit import default_timer as timer
import json
import os


### Gathering data

* WeRateDogs Twitter Archive

In [None]:
df_archive = pd.read_csv("twitter-archive-enhanced.csv")

In [None]:
df_archive.head(3)

In [None]:
df_archive.info()

#### Tweet image predictions File(URL)

In [None]:
# Download Tweet image predictions programmatically using the Requests library 

url= 'https://d17h27t6h515a5.cloudfront.net/topher/2017/August/599fd2ad_image-predictions/image-predictions.tsv'
response = requests.get(url)

In [None]:
# open a file and write the content into the file
with open('image-predictions.tsv',mode='wb') as fileImage:
    fileImage.write(response.content)

In [None]:
df_image = pd.read_csv('image-predictions.tsv',sep='\t') # read tsv file

In [None]:
df_image.head(3)

In [None]:
df_image.info()

###  Additional data from twitter API

* I used information from https://stackabuse.com/reading-and-writing-json-to-a-file-in-python/ to read the downloaded dataset (tweet_json.txt file)

The dataset has been downloaded already and written into the tweet_json.txt file. The data will be extracted from this file below

In [None]:
# Download Tweet image predictions programmatically using the Requests library 

url= 'https://video.udacity-data.com/topher/2018/November/5be5fb7d_tweet-json/tweet-json.txt'
response_tweet = requests.get(url)

In [None]:
#we write this list into a txt file
with open('tweet_json.txt', mode= 'wb') as file:
        file.write(response_tweet.content)

In [None]:
tweets = []
for line in open('tweet_json.txt', 'r'):
    tweets.append(json.loads(line))


In [None]:
tweets[0]

In [None]:
#create a list of data from tweet_json.txt
list_tweets = []

for json_data in tweets:
    list_tweets.append({'id' : json_data['id'],
                       'retweet_count': int(json_data['retweet_count']),
                       'favorite_count' : int(json_data['favorite_count'])})
   
#create a Dataframe
tweets_api = pd.DataFrame(list_tweets, columns = ['id', 'retweet_count' , 'favorite_count']) 

#Check out the obatined DataFrame
tweets_api.head() 

In [None]:
tweets_api.info()

## Accessing Data

### Visual Assessment

* The three datasets are viewed using the head() and tail() methods to assess them for Tidiness and Quality

#### WeRateDogs Twitter Archive

In [None]:
df_archive.head(3)

In [None]:
df_archive.tail(3)

#### Observation
- Missing values are noticed in the dataset

#### Tweet image predictions File(URL)

In [None]:
df_image.head(3)

In [None]:
df_image.tail(3)

#### Observation
- The observation on some columns need to be standardize to lower case for clarity. Overall, visually the dataset is clean

#### Additional data from twitter API

In [None]:
tweets_api.head(3)

In [None]:
tweets_api.tail(3)

#### Observation
- The dataset visually looks clean.

### Programmatic Assessment

- Accessment of the 3 datasets using pandas functions

#### WeRateDogs Twitter Archive

In [None]:
df_archive.sample(5)

In [None]:
df_archive.info()

#### Observation:
- Presence of null values in 5 columns

In [None]:
df_archive.describe(include = 'number').T

In [None]:
df_archive.describe(exclude = 'number').T

#### Observation:
- Doggo, floofer, pupper and puppo seems to have 2 unique values.

In [None]:
# check for duplicated data

df_archive.duplicated().sum()

In [None]:
# check for counts of each value

df_archive.rating_denominator.value_counts()

#### Observation:
- Rating denominator should be 10. Further analysis will be made

In [None]:
# check for counts of each value

df_archive.rating_numerator.value_counts()

#### Observation:
- Some extreme values are detected

In [None]:
3# check the value counts of the source column

df_archive.source.value_counts()

#### Observation:
- It is observed that most data was extracted from the iPhone. Further cleaning will be made to get this information

#### Tweet image predictions File(URL)

In [None]:
df_image.sample(5)

In [None]:
df_image.info()

#### Observation:

* There are no missing values.

In [None]:
df_image.describe(include = 'number').T

In [None]:
df_image.describe(exclude = 'number').T

#### Observation:
- There are 408 unique p3 with Labrador_retriever the most frequent
- There are 378 unique p1 with golden_retriever the most frequent

In [None]:
# check for duplicated data

df_image.duplicated().sum()

In [None]:
df_image.p1_dog.value_counts()

In [None]:
df_image.p2_dog.value_counts()

In [None]:
df_image.p3_dog.value_counts()

### Additional data from twitter API

In [None]:
tweets_api.sample(2)

In [None]:
tweets_api.info()

In [None]:
tweets_api.duplicated().sum()

In [None]:
# Top 5 favorite_count records

tweets_api.sort_values(['favorite_count'], ascending= 0)[0:5]

In [None]:
tweets_api.isnull().sum()

### Cleaning Summary

#### Tidiness List

`WeRateDogs`

- Columns 'doggo', 'floofer', 'pupper', 'puppo' in d should be a single column.

- Change tweet_id to integer datatype. Merge with the two tables for a master dataset

`Tweets_api`

Merge 'tweets_api' and 'df_image' to 'df_archive' 

#### Quality List

`WeRateDogs`

- Drop columns that won't be used for analysis especially with null values.
- Change the timestamp to a datetime datatype.
- In timestamp column, +0000 should be removed.
- Create a standard for "rating_denominator". The standard is 10. 
- The "rating_numerator" has some extreme values. Fix that. 0-10
- The name column has 'a' and None. It would be changed to NaN
- The dog names format should be consistent. Make the first letter capital for all the names or all small letters.
- The source column observations can be extracted in simpler form.

`Tweet image Prediction File`

- Drop duplicate values from jpg_url
- The column names such as p1,p2 are not decriptive.
- The prediction dog breeds involve both uppercase and lowercase for the first letter.


### Tidiness

- Change the source column to simpler words.
-  Columns 'doggo', 'floofer', 'pupper', 'puppo' in d should be a single column stage.

#### Code

* Make copies of the dataset
* Change the source column to simpler words.

In [None]:
# make copies of the 3 datasets for cleaning

df_archive_clean = df_archive.copy()

df_image_clean = df_image.copy()

tweets_api_clean = tweets_api.copy()

#### Define

- Drop columns that won't be used for analysis especially with null values.

#### Code

In [None]:
df_archive_clean.drop(['in_reply_to_status_id', 'in_reply_to_user_id', 'retweeted_status_id', 'retweeted_status_user_id', 'retweeted_status_timestamp'], 
                axis = 1, inplace = True)

#### Test

In [None]:
df_archive_clean.sample(2)

#### Define

- In timestamp column, +0000 should be removed. 

#### Code

In [None]:
df_archive_clean.timestamp = df_archive_clean.timestamp.str[:-5].str.strip()

#### Test

In [None]:
df_archive_clean.timestamp.head(2)

#### Define

- Change the timestamp to a datetime datatype

#### Code

In [None]:
df_archive_clean["timestamp"] = pd.to_datetime(df_archive_clean["timestamp"])

#### Test

In [None]:
df_archive_clean["timestamp"].info()

 #### Define
 
- Change the source column to simpler words.

In [None]:
# change the source list: 
source_list = ['<a href="http://twitter.com/download/iphone" rel="nofollow">Twitter for iPhone</a>',
              '<a href="http://vine.co" rel="nofollow">Vine - Make a Scene</a>',
              '<a href="http://twitter.com" rel="nofollow">Twitter Web Client</a>',
              '<a href="https://about.twitter.com/products/tweetdeck" rel="nofollow">TweetDeck</a>']


source_list_new = ['Twitter for iPhone', 'Vine', 'Twitter Web Client', 'TweetDeck']

In [None]:
for old_source, new_source in zip(source_list, source_list_new):
    df_archive_clean.source.replace(source_list, source_list_new, inplace=True)

#### Test

In [None]:
df_archive_clean.sample(3)

#### Define

* Columns 'doggo', 'floofer', 'pupper', 'puppo' in d should be a single column stage.

#### Code



In [None]:
# for loop to replace all the 'None' 
stage = ['doggo','pupper', 'floofer', 'puppo' ]
for i in stage:
        df_archive_clean[i] = df_archive_clean[i].replace('None', '')

* I used the cat documentation https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.Series.cat.html

In [None]:
df_archive_clean.head()

In [None]:
# use cat to combine  
df_archive_clean['stage'] = df_archive_clean.doggo.str.cat(df_archive_clean.floofer).str.cat(df_archive_clean.pupper).str.cat(df_archive_clean.puppo)

# drop the four old columns
df_archive_clean = df_archive_clean.drop(['doggo','floofer','pupper','puppo'], axis = 1)

# use np.nan to fill the empty values
df_archive_clean['stage'] = df_archive_clean['stage'].replace('', np.nan)

In [None]:
df_archive_clean.stage.fillna('not classified', inplace=True)

#### Test

In [None]:
df_archive_clean.sample(3)

#### Define

- Create a standard for "rating_denominator". The standard is 10. Since the standard is 10, I decided to drop the column.

#### Code



In [None]:
df_archive_clean.drop('rating_denominator', axis=1, inplace=True)

#### Test


In [None]:
df_archive_clean.head(2)

#### Define

- The "rating_numerator" has some extreme values. Fix that. 0-10

#### Code

In [None]:
df_archive_clean.loc[df_archive_clean['rating_numerator'] > 10 , 'rating_numerator'] = 10

#### Test

In [None]:
df_archive_clean.describe()

#### Define

- The name column has 'a' and None. It would be changed to NaN

#### Code

In [None]:
df_archive_clean.name.replace(['None', 'a'], np.nan, inplace=True)

#### Test

In [None]:
df_archive_clean.name.value_counts().head()

#### Define

- The dog names format should be consistent. Make the first letter capital for all the names or all small letters.

#### Code

In [None]:

df_archive_clean['name'] = df_archive_clean.name.str.capitalize()

### Test

In [None]:
df_archive_clean.head()

### Tweet image Prediction File`



#### Define

- Drop duplicate values from jpg_url

#### Code

In [None]:
df_image_clean = df_image_clean.drop_duplicates(subset=['jpg_url'], keep='first')

#### Test

In [None]:
sum(df_image_clean['jpg_url'].duplicated())

#### Define

- The column names such as p1,p2, p3 are not decriptive.

#### Code

In [None]:
df_image_clean.sample(1)

In [None]:
df_image_clean.rename(columns={'p1':'predict_1', 'p1_conf': 'probability_1', 'p1_dog': 'classify_1',
                                  'p2': 'predict_2', 'p2_conf': 'probability_2', 'p2_dog': 'classify_2',
                                  'p3': 'predict_3', 'p3_conf': 'probability_3', 'p3_dog': 'classify_3'}, inplace = True)

#### Test

In [None]:
df_image_clean.sample(1)

## Merge datasets; Tweet Image prediction dataset and Tweet_api dataset

In [None]:
master_dataset1 = pd.merge(df_archive_clean, 
                      df_image_clean, 
                      how = 'left', on = ['tweet_id'])

In [None]:
master_dataset1.sample(3)

In [None]:
master_dataset1.shape

In [None]:
master_dataset1.isnull().sum()

In [None]:
master_dataset1.info()

In [None]:
#keep rows that have picture (jpg_url)

master_dataset1 = master_dataset1[master_dataset1['jpg_url'].notnull()]

In [None]:
master_dataset1.info()

## Merge datasets; Master dataset and Tweet_api dataset

In [None]:
tweets_api.head()

In [None]:
twitter_data = pd.merge(master_dataset1, tweets_api, 
                      how = 'left', left_on = 'tweet_id', right_on = 'id')

In [None]:
twitter_data.head()

## Assessing merged dataset

#### Visual Assessement

In [None]:
twitter_data.head()

In [None]:
twitter_data.tail()

#### Programmatic Assessement

- Using Pandas functions

In [None]:
twitter_data.info()

In [None]:
twitter_data.describe()

In [None]:
twitter_data.describe(exclude = 'number').T

### Cleaning summary

`Quality List`
- Drop the id column, it is redundant.
- Drop some column not needed for further analysis.
- Assign the proper datatype to columns

### Quality


#### Define
- Drop the id column, it is redundant.

#### Code

In [None]:
#Remove redundant variable id
twitter_data.drop('id', axis=1, inplace = True)

#### Test

In [None]:
twitter_data.shape

#### Define
- Drop some column not needed for further analysis.

#### Code

In [None]:
twitter_data.drop(['expanded_urls', 'jpg_url'], axis = 1, inplace = True)

In [None]:
twitter_data.dropna(subset=["retweet_count", "favorite_count"], inplace = True)

#### Test

In [None]:
twitter_data.info()

#### Define

- All the values within columns should be in lowercase.

#### Code

In [None]:
strings = list(twitter_data.dtypes[twitter_data.dtypes == 'object'].index)

In [None]:
strings

In [None]:
for col in strings:
    twitter_data[col] = twitter_data[col].astype(str)

#### Test

In [None]:
twitter_data.sample(3)

#### Define
- Assign the proper datatype to columns

#### Code


In [None]:
twitter_data.stage = twitter_data.stage.astype('category')
twitter_data.retweet_count = twitter_data.retweet_count.astype('int64')
twitter_data.favorite_count = twitter_data.favorite_count.astype('int64')

In [None]:
twitter_data.info()

In [None]:
twitter_data.drop(['text','img_num'], axis = 1, inplace = True)

In [None]:
twitter_data.head()

### Storing data in a csv file

In [None]:
# Store the clean DataFrame in a CSV file

twitter_data.to_csv('twitter_archive_master.csv',index=False, encoding = 'utf-8')

## Visualizing Data

- Using the cleaned dataset

In [None]:
twitter_data.sample(2)

In [None]:
twitter_data.describe()

#### Univariate Analysis

In [None]:
def histogram_boxplot(data, feature, figsize=(15, 10), kde=False, bins=None):
    """
    Boxplot and histogram combined

    data: dataframe
    feature: dataframe column
    figsize: size of figure (default (15,10))
    kde: whether to show the density curve (default False)
    bins: number of bins for histogram (default None)
    """
    f2, (ax_box2, ax_hist2) = plt.subplots(
        nrows=2,  # Number of rows of the subplot grid= 2
        sharex=True,  # x-axis will be shared among all subplots
        gridspec_kw={"height_ratios": (0.25, 0.75)},
        figsize=figsize,
    )  # creating the 2 subplots
    sns.boxplot(
        data=data, x=feature, ax=ax_box2, showmeans=True, color="violet"
    )  # boxplot will be created and a star will indicate the mean value of the column
    sns.histplot(
        data=data, x=feature, kde=kde, ax=ax_hist2, bins=bins, palette="winter"
    ) if bins else sns.histplot(
        data=data, x=feature, kde=kde, ax=ax_hist2
    )  # For histogram
    ax_hist2.axvline(
        data[feature].mean(), color="green", linestyle="--"
    )  # Add mean to the histogram
    ax_hist2.axvline(
        data[feature].median(), color="black", linestyle="-"
    )  # Add median to the histogram

In [None]:
histogram_boxplot(twitter_data, "retweet_count")

#### Insights:

- The retweet count is right skewed which means the average retweet count is more than the median.

- The box plot shows that there is a lot of outliers.

In [None]:
histogram_boxplot(twitter_data, "favorite_count")

#### Insights:

- The favorite count is right skewed which means the average retweet count is more than the median.

- The box plot shows that there is a lot of outliers.

In [None]:
histogram_boxplot(twitter_data, "rating_numerator")

#### Insights

- Most of the ratings are below 10.

In [None]:
# function to create labeled barplots


def labeled_barplot(data, feature, perc=False, n=None):
    """
    Barplot with percentage at the top

    data: dataframe
    feature: dataframe column
    perc: whether to display percentages instead of count (default is False)
    n: displays the top n category levels (default is None, i.e., display all levels)
    """

    total = len(data[feature])  # length of the column
    count = data[feature].nunique()
    if n is None:
        plt.figure(figsize=(count + 2, 6))
    else:
        plt.figure(figsize=(n + 2, 6))

    plt.xticks(rotation=90, fontsize=15)
    ax = sns.countplot(
        data=data,
        x=feature,
        palette="Paired",
        order=data[feature].value_counts().index[:n].sort_values(),
    )

    for p in ax.patches:
        if perc == True:
            label = "{:.1f}%".format(
                100 * p.get_height() / total
            )  # percentage of each class of the category
        else:
            label = p.get_height()  # count of each level of the category

        x = p.get_x() + p.get_width() / 2  # width of the plot
        y = p.get_height()  # height of the plot

        ax.annotate(
            label,
            (x, y),
            ha="center",
            va="center",
            size=12,
            xytext=(0, 5),
            textcoords="offset points",
        )  # annotate the percentage

    plt.show()  # show the plot

In [None]:
labeled_barplot(twitter_data, "source", perc=True)

#### Insights

- Data extracted are mainly from users that have an iphone followed by the web.

In [None]:
labeled_barplot(twitter_data, "stage", perc=True)

#### Insights

- Most of the data are not classified which signifies missing values followed by pupper.

In [None]:
labeled_barplot(twitter_data, "classify_1", perc=True)

In [None]:
labeled_barplot(twitter_data, "classify_2", perc=True)

In [None]:
labeled_barplot(twitter_data, "classify_3", perc=True)

#### Insights

- Most of the prediction data are true

In [None]:
labeled_barplot(twitter_data, "rating_numerator", perc=True)

#### Ratings

- Most ratings are 10.

In [None]:
plt.figure(figsize = (17,6))
ax = sns.barplot(x = twitter_data['predict_1'].value_counts()[0:10].index,
            y =twitter_data['predict_1'].value_counts()[0:10],
            data = twitter_data);
ax.set_xticklabels(ax.get_xticklabels(),rotation = 60, fontsize = 15);
plt.xlabel("Dog Breeds",fontsize = 18);
plt.ylabel("Prediction Count",fontsize = 18);
plt.title("Popular Dog Breeds",fontsize = 18);

#### Insights

- The Golden retriever is the most predicted dog breed.

In [None]:
plt.figure(figsize=(12,8))
plt.xticks(rotation=60)
plt.plot(twitter_data.timestamp, twitter_data.favorite_count)
plt.grid()

In [None]:
plt.figure(figsize=(12,8))
plt.xticks(rotation=60)
plt.plot(twitter_data.timestamp, twitter_data.retweet_count)
plt.grid()

#### Insights:

- There was more activity between the 5th and 6th month of 2016 based on retweet and favorite counts.