# NCAA Tweet Exploration
#### Katherine Taylor
The focus of this project is to generate fake tweets using team account data from the 2019 NCAA tournament first and second round. Two models will be built: one to produce the tweets, and the other to predict the number of likes, retweets, and replies the tweet would have received. The end goal is to create a joke NCAA basketball team account with several tweets. The below notebook explores the existing data.

### 1. Imports

In [20]:
from bokeh.layouts import gridplot
from bokeh.plotting import figure, output_notebook, show
import pandas as pd
import numpy as np
output_notebook()

In [6]:
# import data
basketball_tweets = pd.read_csv("data/cleaned_team_data.csv")

In [8]:
basketball_tweets.head()

Unnamed: 0.1,Unnamed: 0,ID,AuthorUserName,AuthorDisplayName,Created,Message,ContentType,ContextUrl,AuthorFollowerCount,LikeCount,TwitterAuthorListedCount,TWAuthorIsVerified,TWRetweetCount,TWReplyCount,TWAuthorCreated,Tags,ZPoints,Sentiment,No.Tags,Engagement.Score
0,1,1.10924e+18,LibertyMBB,Liberty Menâ€™s Basketball,2019-03-22 23:59:00,WELCOME TO THE DANCE MR. CABBIL The Senior has...,Video,https://twitter.com/LibertyMBB/statuses/110924...,8377,14,108,False,2,0,2011-09-05 17:54:00,Liberty,29,Positive,False,16
1,2,1.10924e+18,FSUHoops,FSU Hoops,2019-03-22 23:59:00,We stand together. Help a Seminole in need.,Text,https://twitter.com/FSUHoops/statuses/11092431...,63235,3911,635,True,1838,49,2009-02-20 15:59:00,Florida State,134,Positive,False,5798
2,3,1.10924e+18,UBmenshoops,UB Men's Basketball,2019-03-22 23:59:00,Bulls Cruise to First-Round Win over Arizona S...,Photo,https://twitter.com/UBmenshoops/statuses/11092...,15485,420,172,True,100,6,2012-08-17 20:32:00,Buffalo,86,Positive,False,526
3,4,1.10924e+18,umichbball,Michigan Men's Basketball,2019-03-22 23:58:00,There is a great challenge ahead of us tomorro...,Photo,https://twitter.com/umichbball/statuses/110924...,309001,224,1687,True,21,1,2009-03-03 16:38:00,Michigan,68,Positive,False,246
4,5,1.10924e+18,ACU_MBB,ACU Mens Basketball,2019-03-22 23:57:00,To our 3 seniors @JayFrank0 @jaren_lewis and @...,Video,https://twitter.com/ACU_MBB/statuses/110924275...,3098,165,42,False,36,0,2013-12-04 16:57:00,Abilene Christi,66,Positive,False,201


In [11]:
basketball_tweets['engagement'] = basketball_tweets['LikeCount'] + basketball_tweets['TWRetweetCount'] + basketball_tweets['TWReplyCount']

In [33]:
basketball_tweets.drop(['Unnamed: 0','ID'],inplace = True, axis = 1)

### Data exploration

In [34]:
basketball_tweets.info()
# no null cells, that's good!

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 3316 entries, 0 to 3315
Data columns (total 19 columns):
 #   Column                    Non-Null Count  Dtype 
---  ------                    --------------  ----- 
 0   AuthorUserName            3316 non-null   object
 1   AuthorDisplayName         3316 non-null   object
 2   Created                   3316 non-null   object
 3   Message                   3316 non-null   object
 4   ContentType               3316 non-null   object
 5   ContextUrl                3316 non-null   object
 6   AuthorFollowerCount       3316 non-null   int64 
 7   LikeCount                 3316 non-null   int64 
 8   TwitterAuthorListedCount  3316 non-null   int64 
 9   TWAuthorIsVerified        3316 non-null   bool  
 10  TWRetweetCount            3316 non-null   int64 
 11  TWReplyCount              3316 non-null   int64 
 12  TWAuthorCreated           3316 non-null   object
 13  Tags                      3316 non-null   object
 14  ZPoints                 

In [35]:
basketball_tweets.describe()

Unnamed: 0,AuthorFollowerCount,LikeCount,TwitterAuthorListedCount,TWRetweetCount,TWReplyCount,ZPoints,Engagement.Score,engagement
count,3316.0,3316.0,3316.0,3316.0,3316.0,3316.0,3316.0,3316.0
mean,131206.9,165.606755,447.219542,29.264174,3.103136,44.50392,197.974065,197.974065
std,271575.6,450.235173,448.982939,111.873623,9.443884,28.201735,559.97167,559.97167
min,1097.0,0.0,32.0,0.0,0.0,6.0,0.0,0.0
25%,8937.0,0.0,122.0,0.0,0.0,13.0,0.0,0.0
50%,39360.5,33.5,356.0,6.0,1.0,46.0,41.0,41.0
75%,109598.2,138.0,593.0,19.0,3.0,64.0,163.25,163.25
max,2218576.0,10116.0,2486.0,3304.0,160.0,144.0,13580.0,13580.0


### 3. Data visualization

In [27]:
hist, edges = np.histogram(basketball_tweets['engagement'], bins = 50)
p = figure(title='Engagement Scores', background_fill_color="#fafafa",plot_width = 800, plot_height = 300)
p.quad(top=hist, bottom=0, left=edges[:-1], right=edges[1:], fill_color="navy", line_color="white", alpha=0.5)
show(p)

It looks like engagement scores are very left skewed, it would be good to break out the score into its components: likes, retweets, and replies

In [42]:
hist, edges = np.histogram(basketball_tweets['LikeCount'], bins = 50)
p1 = figure(title='Likes', background_fill_color="#fafafa",plot_width = 800, plot_height = 300)
p1.quad(top=hist, bottom=0, left=edges[:-1], right=edges[1:], fill_color="navy", line_color="white", alpha=0.5)

hist, edges = np.histogram(basketball_tweets['TWRetweetCount'], bins = 50)
p2 = figure(title='Retweets', background_fill_color="#fafafa",plot_width = 800, plot_height = 300)
p2.quad(top=hist, bottom=0, left=edges[:-1], right=edges[1:], fill_color="orange", line_color="white", alpha=0.5)

hist, edges = np.histogram(basketball_tweets['TWReplyCount'], bins = 50)
p3 = figure(title='Replies', background_fill_color="#fafafa",plot_width = 800, plot_height = 300)
p3.quad(top=hist, bottom=0, left=edges[:-1], right=edges[1:], fill_color="pink", line_color="white", alpha=0.5)

show(gridplot([p1,p2,p3], ncols=3, plot_width=265, plot_height=300, toolbar_location=None))

The engagment distribution is similar to the distribution for all three components, with likes having the largest values, followed by retweets and replies.

In [45]:
hist, edges = np.histogram(basketball_tweets['AuthorFollowerCount'], bins = 50)
p = figure(title='Number of Followers', background_fill_color="#fafafa",plot_width = 800, plot_height = 300)
p.quad(top=hist, bottom=0, left=edges[:-1], right=edges[1:], fill_color="navy", line_color="white", alpha=0.5)
show(p)