# What makes athletes popular? a sentiment and regression analysis
## Part III: Regression analysis of sentiment

In this project, I am using natural language processing (NLP) to try to understand what factors drive public opinion towards athletes. In [part 1](https://nbviewer.jupyter.org/github/map222/trailofpapers/blob/master/sentiment_sports/paper/scrape_reddit_covariates.ipynb), I showed how to scrape reddit; in [part 2 NNN](https://nbviewer.jupyter.org/github/map222/trailofpapers/blob/master/sentiment_sports/paper/regression_sentiment.ipynb), I showed how to use sentiment analysis to calculate opinion towards players. In this notebook, I will fit regression models to comment sentiment and votes in order to determine which features of an athlete are predictive of sentiment.

### How regression allows us to understand underlying factors

We know that some athletes are more popular than others, and that athletes differ in their characteristics, both intrinsic (height, age, race), and extrinsic (points scored, yards run). Given that there are dozens of these characteristics, how do we understand which ones make athletes more popular or less?

One naive approach would be to graph scatterplots of the relationship between sentiment and characteristics, as we did in part 2. This can be illuminating, and reveal obvious relationships, like how sentiment is higher for young player (NNN show graph). However, age is correlated with many other things; for example, young players usually get less playing time than the average player. Does this mean people prefer players who don't play a lot, or is something else going on? What we need is a technique that can consider all of these characteristics simultaneously.

The method we use in the project is to use multi-variate linear regression. This allows you to measure how each characteristic would change sentiment simultaneously, and independently of each other characteristic. For example, if we find that the coefficient for age is -0.1 sentiment points per year of age, we know that no matter what a player's other characteristics are, if we were to increase their age by one year, their measured sentiment would most likely decrease.


### Aggregation

Before describing how we did the regression, let's briefly go over how we aggregate our data. Our raw data comes in the form of sentences about single players by single users. We then aggregate this data taking the mean sentiment towards each player by each user, in a given season. For players, due to the large sample size, we analyze the data at this level; for coaches, we aggregate this data one step further, and calculate the average sentiment towards each coach, across all users, for a given season (a mean of means). In addition to calculating the average sentiment at these levels, we count the number of sentences that went into that average, which we use in weighting later.

Here is what our data looks like in aggregated form:

In [7]:
import pandas as pd
nba_df = pd.read_csv('c:/Users/map22/Google Drive/sentiment_nba/nba_user_player_sentiment.tsv', sep='\t')
nba_df = nba_df.dropna(subset=['Race', 'PPG']) # get some name matches for years players weren't playing / coaches


  interactivity=interactivity, compiler=compiler, result=result)


In [8]:
nba_df[['Player', 'user', 'season', 'flair', 'compound_mean', 'comment_count', 'PPG']].sample(3, random_state = 24601)

Unnamed: 0,Player,user,season,flair,compound_mean,comment_count,PPG
426694,jermaine o'neal,The_Grand_Wizowd,2015,Bulls,0.4404,1,
374829,jamal crawford,runningobsessed,2015,NBA,0.0772,1,14.2
588090,kristaps porzingis,Helicase21,2017,[GSW] JaVale McGee,0.0,1,22.7


### Simple regression model: unweighted, unclustered



In [10]:

import statsmodels.formula.api as smf
model = smf.wls( formula = 'compound_mean ~ MP + PPG', \
#                       ' PPG +  + total_population+  * white_black_diff + C(Race) * clinton_vote_lead',
                data = nba_df,
               weights = 1,# / (nba_df['compound_mean_std'] / np.sqrt(fit_df['user_count'])),
#                missing='raise'   
               ).fit()

In [11]:
model.summary()

0,1,2,3
Dep. Variable:,compound_mean,R-squared:,0.0
Model:,WLS,Adj. R-squared:,0.0
Method:,Least Squares,F-statistic:,107.4
Date:,"Sun, 24 Feb 2019",Prob (F-statistic):,2.36e-47
Time:,14:18:27,Log-Likelihood:,-271560.0
No. Observations:,855181,AIC:,543100.0
Df Residuals:,855178,BIC:,543200.0
Df Model:,2,,
Covariance Type:,nonrobust,,

0,1,2,3,4,5,6
,coef,std err,t,P>|t|,[0.025,0.975]
Intercept,0.0699,0.001,74.243,0.000,0.068,0.072
MP,-1.212e-06,6.45e-07,-1.879,0.060,-2.48e-06,5.24e-08
PPG,0.0007,6.42e-05,11.450,0.000,0.001,0.001

0,1,2,3
Omnibus:,1042.415,Durbin-Watson:,1.981
Prob(Omnibus):,0.0,Jarque-Bera (JB):,1166.915
Skew:,-0.044,Prob(JB):,4.0499999999999995e-254
Kurtosis:,3.158,Cond. No.,5460.0


### Adding weighting



### Adding clustering of errors