## Notebook 3 - EDA on Non-Text Data and Next Steps
The purpose of this notebook is to perform EDA on the non-text aspects of the tweets and create a list of next steps based on findings.

In [None]:
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns
%matplotlib inline

In [None]:
df = pd.read_pickle('../data/all_tweets.p')

In [None]:
df.info()

In [None]:
df.sample(5)

**Note:**  
Checking to see if only the SPCA's are in the username field. Looks like there are others in there. I'm going to remove them.

In [None]:
df.username.value_counts()

In [None]:
# created this list by printing unique usernames and erasing non-SPCA's
all_sfspcas = ['sfspca', 'PSPCA', 'HoustonSPCA', 'spcaoftexas', 'Tulsa_SPCA', 'RichmondSPCA',
              'OntarioSPCA', 'FMSPCA', 'BC_SPCA']

df = df.loc[df.username.isin(all_sfspcas)]
df.shape

**Note:**  
Because of the boost in retweet numbers created by Hurricane Harvey, I researched other natural disasters in the locations of the organizations and I found a few. I'm going to remove all tweets from a few days before and two weeks following any natural disaster. Here's a list of each disaster:
1. Hurricane Harvey (Houston) - made landfall on 8/25/17
2. Alberta Wildfire (Fort McMurray) - started 5/1/16 and under control 6/3/16
3. Moore Tornado (Oklahoma) - touched down 5/20/13

In [None]:
from datetime import datetime
from datetime import timedelta
import pytz

In [None]:
# create disaster dictionary that lists beginning of timeframe as 3 days prior
disaster_dict = {
    'hurricane': datetime(2017,8,22, tzinfo=pytz.timezone('US/Central')),
    'wildfire': datetime(2016,4,29, tzinfo=pytz.timezone('US/Mountain')),
    'tornado': datetime(2013,5,18, tzinfo=pytz.timezone('US/Central'))
}

for disaster, start in disaster_dict.items():
    end = start + timedelta(days=14)
    df = df.loc[~((df.local_datetime > start) & (df.local_datetime < end))]
    print('Shape after removing {}: {}'.format(disaster,df.shape))

**Note:**  
Looking at tweets per user per year in consideration of dropping the first couple years of data.

In [None]:
by_year_user = df.groupby(by=['year','username'],as_index=False,sort=False)[['author_id']].count()
by_year_user.rename(columns = {'author_id': 'num_tweets'}, inplace=True)

In [None]:
fig = plt.figure(figsize=(15,6))
sns.barplot(x='year', y='num_tweets', hue='username', data=by_year_user, ci=None)
plt.show()

**Note:**  
Dropping all tweets before 2012 since that's when Twitter had IPO and started gaining in popularity.

In [None]:
df = df.loc[df.year > 2011]
df.shape

**Note:**  
Because retweets of a tweet create more potential viewers of the tweet than favoriting a tweet does, I'm going to use number of retweets as the scoring metric (and the target for the model). I will not be using favorites as a predictor because it is the result of a tweet and isn't a known value before the tweet is sent. I will now look at minimizing the effects of outliers by first addressing the skew in the target and then identifying outliers using the Tukey method.

In [None]:
# looking at stats for the target
df[['retweets']].describe()

**Note:**  
With a lot of the data gathered around 0 and very high maximum relative to the mean, median and standard deviation, this data appears heavily skewed right. Need to transform the data to minimize the skew.

In [None]:
# get measure of skew to confirm suspicion stated above
from scipy.stats import skew

retweet_skew = df[['retweets']].apply(lambda x: skew(x))
retweet_skew

**Note:**  
This is a very high skew measurement (as suspected from the max compared to the mean and median), so I'm going to  transform with `log1p` (add 1 and take the natural log).

In [None]:
df['retweets'] = np.log1p(df['retweets'])

In [None]:
# get measure of skew again for sake of comparison
from scipy.stats import skew

retweet_skew = df[['retweets']].apply(lambda x: skew(x))
retweet_skew

In [None]:
# plot the distribution of retweets
fig = plt.figure(figsize=(8,4))
sns.distplot(df.retweets, kde=False)
plt.show()

**Note:**  
The transformation has significantly reduced the skew. Time to identify outliers.

In [None]:
# function to implement the Tukey method for identifying outliers
def identify_outliers(dataframe, col):
    Q1 = np.percentile(dataframe[col], 25)
    Q3 = np.percentile(dataframe[col], 75)
    tukey_window = 1.5*(Q3-Q1)
    less_than_Q1 = dataframe[col] < Q1 - tukey_window
    greater_than_Q3 = dataframe[col] > Q3 + tukey_window
    tukey_mask = (less_than_Q1 | greater_than_Q3)
    return dataframe[tukey_mask]

In [None]:
outliers = identify_outliers(df,'retweets')
outliers.shape

In [None]:
outliers

**Note:**  
I'm going to keep these 12 outliers as the tweets do not look to be the result of anything extraordinary. They include celebrity visitors, campaigns for pet safety, etc.

**Note:**  
I'm going to create three more columns that are boolean values, one each for whether or not a tweet has a hashtag, a mention or a url.

In [None]:
# new columns based on presence of mention, hashtag, url
has_one = lambda x: 0 if x == '' else 1
df['has_mention'] = df.mentions.apply(has_one)
df['has_hashtag'] = df.hashtags.apply(has_one)
df['has_url'] = df.urls.apply(has_one)

In [None]:
fig, axes = plt.subplots(nrows=1, ncols=2, figsize=(20,6))
plt.suptitle('Mean and Median Retweets for Each Organization Based on Presence of Mentions', fontsize=20)
sns.barplot(x='username', y='retweets', hue='has_mention', data=df, ci=None, ax=axes[0])
sns.barplot(x='username', y='retweets', hue='has_mention', data=df, estimator=np.median,ci=None, ax=axes[1])
for ax in axes:
    plt.sca(ax)
    plt.xticks(rotation=45)
    ax.yaxis.label.set_fontsize(16)
    ax.xaxis.label.set_fontsize(16)

In [None]:
fig, axes = plt.subplots(nrows=1, ncols=2, figsize=(20,6))
plt.suptitle('Mean and Median Retweets for Each Organization Based on Presence of Hashtags', fontsize=20)
sns.barplot(x='username', y='retweets', hue='has_hashtag', data=df, ci=None, ax=axes[0])
sns.barplot(x='username', y='retweets', hue='has_hashtag', data=df, estimator=np.median,ci=None, ax=axes[1])
for ax in axes:
    plt.sca(ax)
    plt.xticks(rotation=45)
    ax.yaxis.label.set_fontsize(16)
    ax.xaxis.label.set_fontsize(16)

In [None]:
fig = plt.figure(figsize=(20,7))
plt.suptitle('Retweets by User Based on Presence of Hashtag', fontsize=20)
sns.boxplot(x='username', y='retweets', hue='has_hashtag',data=df)
plt.xlabel('User', fontsize=16)
plt.ylabel('Retweets', fontsize=16)
plt.show()

In [None]:
fig, axes = plt.subplots(nrows=1, ncols=2, figsize=(20,6))
plt.suptitle('Mean and Median Retweets for Each Organization Based on Presence of URLs', fontsize=20)
sns.barplot(x='username', y='retweets', hue='has_url', data=df, ci=None, ax=axes[0])
sns.barplot(x='username', y='retweets', hue='has_url', data=df, estimator=np.median,ci=None, ax=axes[1])
for ax in axes:
    plt.sca(ax)
    plt.xticks(rotation=45)
    ax.yaxis.label.set_fontsize(16)
    ax.xaxis.label.set_fontsize(16)

In [None]:
fig, axes = plt.subplots(nrows=1, ncols=2, figsize=(20,6))
plt.suptitle('Mean and Median Retweets by Hour', fontsize=20)
sns.barplot(x='hour', y='retweets', data=df, ci=None, ax=axes[0])
sns.barplot(x='hour', y='retweets', data=df, estimator=np.median,ci=None, ax=axes[1])
for ax in axes:
    plt.sca(ax)
    ax.yaxis.label.set_fontsize(16)
    ax.xaxis.label.set_fontsize(16)

In [None]:
fig = plt.figure(figsize=(20,7))
plt.suptitle('Retweets by Hour Based on Presence of Hashtag', fontsize=20)
sns.boxplot(x='hour', y='retweets', hue='has_hashtag',data=df)
plt.xlabel('Hour', fontsize=16)
plt.ylabel('Retweets', fontsize=16)
plt.show()

In [None]:
# look at tweets at 3 am to see what's causing such a high value
df.loc[df.hour == 3].username.value_counts()

**Note:**  
Almost all of the 3 am tweets are from the Alberta organization. Maybe they are using a scheduling tool like zoho or hootsuite.

In [None]:
fig, axes = plt.subplots(nrows=1, ncols=2, figsize=(20,6))
plt.suptitle('Mean and Median Retweets by Month', fontsize=20)
sns.barplot(x='month', y='retweets', data=df, ci=None, ax=axes[0])
sns.barplot(x='month', y='retweets', data=df, estimator=np.median,ci=None, ax=axes[1])
for ax in axes:
    plt.sca(ax)
    ax.yaxis.label.set_fontsize(16)
    ax.xaxis.label.set_fontsize(16)

In [None]:
fig, axes = plt.subplots(nrows=1, ncols=2, figsize=(20,6))
plt.suptitle('Mean and Median Retweets by Day of Week', fontsize=20)
sns.barplot(x='weekday', y='retweets', data=df, ci=None, ax=axes[0])
sns.barplot(x='weekday', y='retweets', data=df, estimator=np.median,ci=None, ax=axes[1])
for ax in axes:
    plt.sca(ax)
    ax.yaxis.label.set_fontsize(16)
    ax.xaxis.label.set_fontsize(16)

**Notes:**  
1. Having a mention seems to result in lower retweets.
1. Having a hashtag seems to result in higher retweets, even more so for some organizations (i.e. BC SPCA) and the increase varies a lot by time of day too.
1. Having a url has varying affects depending on organization.
1. It appears tweets sent in the very early morning hours have the highest number of retweets.
1. Tweets sent in July got a little bump in retweets, and it's interesting to note that none of the outliers were tweeted in July.
1. It looks like weekend tweets get more retweets than those sent during the week.

## Next Steps
1. Can I get number of followers from the Twitter API?  
**Result:** Yes, but only back to mid-2016  

1. Look into timezone of timestamp so that "hour" is local to user and day of week is local too. Might need to add this as tweets are pulled since adjustment for each SPCA may be different. Also got lots of tweets at 3 am from Alberta organization. Can this be real?  
**Result:** Fixed time zone issue and graph remained relatively unchanged. I suspect they're using a scheduling tool to send tweets in the wee hours.  

1. Investigate how those 11 tweets from non-SPCA users got into the data.  
**Result:** Not worth investigating. Just gonna drop them.  

1. Drop the tweets from Houston since Hurricane Harvey.  
**Result:** Dropped all the tweets during the time periods of three major natural disasters.


In [None]:
df.to_pickle('../data/post_eda.p')