## Notebook 3 - EDA and Next Steps
The purpose of this notebook is to perform EDA on the tweets and create a list of next steps based on findings.

In [None]:
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns
%matplotlib inline

In [None]:
df = pd.read_pickle('../data/all_tweets.p')

In [None]:
df.info()

In [None]:
df.sample(5)

**Note:**  
Checking to see if only the SPCA's are in the username field. Looks like there are 11 others in there. For now I'm going to remove them and to the list of next stepsI'll add investigating how they got in there.

In [None]:
df.username.value_counts()

In [None]:
# created this list by printing unique usernames and erasing non-SPCA's
all_sfspca = ['sfspca', 'PSPCA', 'HoustonSPCA', 'spcaoftexas', 'Tulsa_SPCA', 'RichmondSPCA',
              'OntarioSPCA', 'FMSPCA', 'BC_SPCA']

df = df.loc[df.username.isin(all_sfspca)]
df.shape

**Note:**  
Looking at correlation between favorites and retweets to see if both need to be factored into scoring.

In [None]:
fig = plt.figure(figsize = (10,6))
sns.heatmap(df[['favorites','retweets']].corr(), annot=True)
# plt.savefig('retweet-favorite-correlation.png')

**Note:**  
Because these two have a high correlation, it makes sense to only consider one of them in the scoring, and that will be number of retweets. Need to consider if scoring should incorporate the number of followers (i.e. retweets divided by followers). Data does not currently have number of followers so add researching this to the list of next steps.

**Note:**  
I'm going to create three more columns that are boolean values, one each for whether or not a tweet has a hashtag, a mention or a url.

In [None]:
# look at how many tweets are without mentions, hashtags, urls
sum(df.mentions == ''), sum(df.hashtags == ''), sum(df.urls == '')

In [None]:
# new columns based on presence of mention, hashtag, url
empty = lambda x: 0 if x == '' else 1
df['has_mention'] = df.mentions.apply(empty)
df['has_hashtag'] = df.hashtags.apply(empty)
df['has_url'] = df.urls.apply(empty)

In [None]:
df.sample(5)

In [None]:
# sns.pairplot(df, vars=['favorites', 'retweets', 'year', 'month', 'weekday', 'local_hour'])

In [None]:
fig, axes = plt.subplots(nrows=1, ncols=2, figsize=(20,6))
plt.suptitle('Mean and Median Retweets for Each Organization Based on Mentions', fontsize=20)
sns.barplot(x='username', y='retweets', hue='has_mention', data=df, ci=None, ax=axes[0])
sns.barplot(x='username', y='retweets', hue='has_mention', data=df, estimator=np.median,ci=None, ax=axes[1])
for ax in axes:
    plt.sca(ax)
    plt.xticks(rotation=45)
    ax.yaxis.label.set_fontsize(16)
    ax.xaxis.label.set_fontsize(16)

In [None]:
fig, axes = plt.subplots(nrows=1, ncols=2, figsize=(20,6))
plt.suptitle('Mean and Median Retweets for Each Organization Based on Hashtags', fontsize=20)
sns.barplot(x='username', y='retweets', hue='has_hashtag', data=df, ci=None, ax=axes[0])
sns.barplot(x='username', y='retweets', hue='has_hashtag', data=df, estimator=np.median,ci=None, ax=axes[1])
for ax in axes:
    plt.sca(ax)
    plt.xticks(rotation=45)
    ax.yaxis.label.set_fontsize(16)
    ax.xaxis.label.set_fontsize(16)

In [None]:
fig, axes = plt.subplots(nrows=1, ncols=2, figsize=(20,6))
plt.suptitle('Mean and Median Retweets for Each Organization Based on URLs', fontsize=20)
sns.barplot(x='username', y='retweets', hue='has_url', data=df, ci=None, ax=axes[0])
sns.barplot(x='username', y='retweets', hue='has_url', data=df, estimator=np.median,ci=None, ax=axes[1])
for ax in axes:
    plt.sca(ax)
    plt.xticks(rotation=45)
    ax.yaxis.label.set_fontsize(16)
    ax.xaxis.label.set_fontsize(16)

In [None]:
fig, axes = plt.subplots(nrows=1, ncols=2, figsize=(20,6))
plt.suptitle('Mean and Median Retweets by Hour', fontsize=20)
sns.barplot(x='local_hour', y='retweets', data=df, ci=None, ax=axes[0])
sns.barplot(x='local_hour', y='retweets', data=df, estimator=np.median,ci=None, ax=axes[1])
for ax in axes:
    plt.sca(ax)
#     plt.xticks(rotation=45)
    ax.yaxis.label.set_fontsize(16)
    ax.xaxis.label.set_fontsize(16)

In [None]:
# look at tweets at 3 am to see what's causing such a high value
df.loc[df.local_hour == 3].username.value_counts()

**Note:**  
Almost all of the 3 am tweets are from the Alberta organization. Might there be an issue with datestamp? Add to list of next steps.

In [None]:
fig, axes = plt.subplots(nrows=1, ncols=2, figsize=(20,6))
plt.suptitle('Mean and Median Retweets by Month', fontsize=20)
sns.barplot(x='month', y='retweets', data=df, ci=None, ax=axes[0])
sns.barplot(x='month', y='retweets', data=df, estimator=np.median,ci=None, ax=axes[1])
for ax in axes:
    plt.sca(ax)
#     plt.xticks(rotation=45)
    ax.yaxis.label.set_fontsize(16)
    ax.xaxis.label.set_fontsize(16)

**Note:**  
I'm thinking that Harvey boosted the retweet numbers for Houston in the month of August. Add consideration of removing these tweets to next steps.

In [None]:
from datetime import datetime
df.loc[(df.date > datetime(2017, 8, 24)) & (df.username == 'HoustonSPCA')][['retweets']].median()

In [None]:
fig, axes = plt.subplots(nrows=1, ncols=2, figsize=(20,6))
plt.suptitle('Mean and Median Retweets by Day of Week', fontsize=20)
sns.barplot(x='weekday', y='retweets', data=df, ci=None, ax=axes[0])
sns.barplot(x='weekday', y='retweets', data=df, estimator=np.median,ci=None, ax=axes[1])
for ax in axes:
    plt.sca(ax)
#     plt.xticks(rotation=45)
    ax.yaxis.label.set_fontsize(16)
    ax.xaxis.label.set_fontsize(16)

**Notes:**  
1. Having a mention seems to result in lower retweets.
1. Having a hashtag seems to result in higher retweets.
1. Having a url has varying affects depending on organization.
1. It appears tweets sent in the very early morning hours have the highest number of retweets, but I wonder if this is an error in processing the datestamp.
1. The month a tweet was sent seems to make no difference, assuming the bump in August is from Hurricane Harvey in Houston. Need to look into this more.
1. It looks like weekend tweets get more retweets than those during the week.

## Next Steps
1. Can I get number of followers from the Twitter API?
1. Look into timezone of timestamp so that "hour" is local to user and day of week is local too. Might need to add this as tweets are pulled since adjustment for each SPCA may be different. Also got lots of tweets at 3 am from Alberta organization. Can this be real?
1. Investigate how those 11 tweets from non-SPCA users got into the data.
