In [6]:
import pandas as pd
pd.set_option('display.max_colwidth', 100)

# What makes athletes popular?

## A sentiment analysis of NBA and NFL reddit

### Powerpoint slide of Cedi and Tristan

## Outline

I. It's easy to scrape reddit

II. How to calculate sentiment per player

III. How to use regression to understand what makes athletes popular

IV. Results!

## I. How to scrape reddit

### Scraping using reddit's API

#### Reddit's API is quite good

Allows you to post comments, reply, and act like a user

#### Disadvantages

It's designed around acting like a user rather than scraping

It requires authentication

It broke a month after I started, annoying me

## Scraping using the pushshift API

[pushshift.io](https://pushshift.io/api-parameters/) is a third party social media data aggregator, specializing in reddit

### Advantages

No authentication

Did not change API while I used

Easy to use

### Example query to pushshift
As simple as parsing a `requests` query

In [2]:
import requests
import simplejson as json
url_params = {'subreddit': 'nba',
              'size':500}
submission_url = 'https://api.pushshift.io/reddit/search/submission/'
pushshift_response = json.loads(requests.get(submission_url, params=url_params).text)

In [5]:
pushshift_response['data'][0]

{'author': 'deadskin',
 'author_flair_background_color': '',
 'author_flair_css_class': 'Raptors2',
 'author_flair_richtext': [],
 'author_flair_text': '[TOR] Jose Calderon',
 'author_flair_text_color': 'dark',
 'author_flair_type': 'text',
 'author_fullname': 't2_5nr6h',
 'author_patreon_flair': False,
 'can_mod_post': False,
 'contest_mode': False,
 'created_utc': 1554215998,
 'domain': 'streamable.com',
 'full_link': 'https://www.reddit.com/r/nba/comments/b8k2m6/brook_lopez_turned_31_yesterday_and_received_a/',
 'gildings': {'gid_1': 0, 'gid_2': 0, 'gid_3': 0},
 'id': 'b8k2m6',
 'is_crosspostable': True,
 'is_meta': False,
 'is_original_content': False,
 'is_reddit_media_domain': False,
 'is_robot_indexable': True,
 'is_self': False,
 'is_video': False,
 'link_flair_background_color': '',
 'link_flair_css_class': 'highlights',
 'link_flair_richtext': [],
 'link_flair_text': 'Highlights',
 'link_flair_text_color': 'dark',
 'link_flair_type': 'text',
 'locked': False,
 'media': {'oemb

#### Parse the JSON into a DataFrame

In [10]:
sample_df = pd.read_csv('d:/data/sentiment_sports/combined_months_ner_sentiment_2017.tsv', nrows=100, sep='\t')
sample_df.head(3)[['user', 'flair', 'year_month', 'sentences']]

Unnamed: 0,user,flair,year_month,sentences
0,nymusix,Warriors,201710,jokic with an absolutely beautiful dime [highlight].
1,Hustle_Marsalis,Pelicans,201710,also espn and si ranking player like jokic ahead of him.
2,21sewage,Nuggets,201710,[wind] jokic on westbrook’s shoulder check: “i flopped.”.


## II. Calculate sentiment towards players
### Part I: Figure out which players the comments are about

#### Approach 1: Use Named entity recognition to identify players
Models go about this different ways. Some try to classify each token, with properties about the token (capitalization, place in sentence). Others like CRF use contextual cues from other words. There are even deep neural nets for this. I used Stanford's off the shelf NER.

In [13]:
from nltk.tag import StanfordNERTagger
text = 'Cedi is the GOAT'
st = StanfordNERTagger('english.all.3class.distsim.crf.ser.gz')

LookupError: 

===========================================================================
  NLTK was unable to find stanford-ner.jar! Set the CLASSPATH
  environment variable.

  For more information, on stanford-ner.jar, see:
    <https://nlp.stanford.edu/software>
===========================================================================

#### Approach 2: Use a known list of players

### Comparison of approaches

## Part 2: Calculating sentiment towards players

## III. Using regression to understand factors

## IV: Results!