# Why are some athletes more popular than others?
## Part II: Calculating sentiment towards players
In this project I am using natural language processing to try to understand what factors drive public opinion towards athletes. In part 1 I covered how I scraped the data off of Reddit and other sports websites. In this notebook, part 2, I will cover how I identified which player comments were about; and how to calculate sentiments towards a player. In part 3 I will cover how to use regression models to isolate what drives public opinion.

## What are entities and sentiment?
In this notebook I am going to use a lot of jargon from natural language processing (NLP). Before diving into how I calculated opinion towards players, let's review a few terms:
* Corpus: a corpus is a collection of documents.
* Named entity: In NLP, an "entity" is basically a noun. Thus a *named* entity is simply a proper noun. The most common named entity in basketball right now is "LeBron."
* Named entity recognition (NER): This is the task of identifying which parts of a sentence are named entities. A simple NER model would use things like capitalization to identify named entities. For example, "Nice assist by Wall," would identify the player John Wall. More complicated NER models use part-of-speech tagging or even neural nets to identify named entities.
* Sentiment analysis: This is the task of identifying whether a sentence or document has generally positive or negative feelings. Simple models assign a positive or negative value to each word (e.g. "love" is a positive word). More complex models assign sentiment for each entity in a document. Sentiment models are often trained to a specific task.

## How do we calculate the sentiment towards a player?
After scraping data from Reddit, I had a corpus of comments about NBA and NFL players. These comments ranged from short exclamations about specific players ("Cedi is the GOAT!"), to longer comments involving multiple named entities ("JR Smith threw a bowl of chicken tortilla soup at Damon Jones.").

Probably the best way to calculate sentiment towards players would be to use a combined entity-sentiment model. These models parse each sentence for parts of speech and named entities, and assign sentiment towards each named entity. For example, a combined model could take "LeBron is better than Jordan," and assign positive sentiment to LeBron directly. In the end, I did not use these methods for a few reasons: this was my first time doing sentiment analysis, and these models are complicated, so I wanted to start simpler; these models are less interpretable than other models; these models take a looooong time to run, making iteration slower.

Instead, I took a two step approach. First, I identified sentences that contained a single named entity for a players; and then I calculated the overall sentiment of that sentence. The downsides to this approach were that I had to throw away a lot of information from sentences that contained multiple named entities; and that it made the assumption that the sentiment of a sentence reflected the sentiment towards the player in the sentence. In the end, though, I had more than enough data to overcome the first obstacle. As for the second, while it might be true that the sentiment of a sentence doesn't always reflect player sentiment, in general the relationship seemed to hold.

## Identifying which athletes a sentence are about
### Named entity recognition
I took two broad approaches to named entity recognition. First, I used Stanford's SNER package to identify the named entities in a sentence. My other alternative was to start with a list of named entities I cared about (NBA and NFL players), and then simply check to see if these names were present. In this section, I am going to show how to do both ways. 

This notebook will show an example of how to do this on a single document. I function-ified this process in the module [`sentiment_sports.py`](https://github.com/map222/trailofpapers/blob/sentiment_sports/sentiment_sports/sports_sentiment.py) so you can use it at your leisure.
### Imports

In [2]:
import pandas as pd
from nltk import sent_tokenize
import string
import re
import pandas as pd
import dask.dataframe as dd
from ast import literal_eval
#from sner import Ner
from nltk import sent_tokenize

#### Chunking comments into sentences
After scraping I had full comments off of Reddit. To get more samples, we can chunk the comments into sentences for analysis using NLTK's `sent_tokenize` function.

First, let's start with a typical Cavs fan comment. I use `pandas` DataFrames for everything, so let's use one here.

In [39]:
comment_df = pd.DataFrame({'comment':['Cedi is the GOAT! Isaiah Thomas is the worst'], 'user': ['map222'], 'flair':'CLE'})
comment_df

Unnamed: 0,comment,flair,user
0,Cedi is the GOAT! Isaiah Thomas is the worst,CLE,map222


To tokenize, I used NLTK's `sent_tokenize` function. Since multiple sentences can be returned from a comment, I did some manipulation to get back a Series with a row for each sentence.

In [40]:
sentences_df = (comment_df['comment'].apply(lambda row: pd.Series(sent_tokenize(row)))
                                     .stack())
sentences_df

0  0             Cedi is the GOAT!
   1    Isaiah Thomas is the worst
dtype: object

We then need to do a bit more `pandas` manipulation to get a DataFrame where the index for each sentence is the same as its parent comment

In [41]:
sentences_df = (sentences_df.reset_index()
                  .set_index('level_0')
                  .rename(columns={0:'sentences'})
                  .drop(['level_1'], axis = 1))
sentences_df

Unnamed: 0_level_0,sentences
level_0,Unnamed: 1_level_1
0,Cedi is the GOAT!
0,Isaiah Thomas is the worst


Now that the index is sorted out, we can rejoin the sentences to the original comments, which allows us to retain metadata like the user and flair.

In [42]:
comment_df = (comment_df.join(sentences_df)
                        .drop(columns = ['comment']))
comment_df

Unnamed: 0,flair,user,sentences
0,CLE,map222,Cedi is the GOAT!
0,CLE,map222,Isaiah Thomas is the worst


### Sentiment model