# Data Collection

The code below uses reddit's pushift api and the praw api wrapper to extract reddit comments from two posts. These two posts are the teams' respective game threads from the r/SFGiants and r/Dodgers subreddits. A game thread is where fans post comments and interact with each other while watching the game. The context for the comments was the two team's June 28th matchup against eachother in Dodger Stadium. The starting pitchers were Anthony DeSclafani (8-2, 2.77 ERA) and Trevor Bauer (7-5, 2.57 ERA). The Dodgers defeated the Giants 3-2, cutting the Giants lead in the division to 2.5 games.

In [1]:
import praw
import requests
import pandas as pd

In [2]:
url= 'https://api.pushshift.io/reddit/search/submission'

In [3]:
params = {
    'subreddit' : 'sfgiants',
    'author' :'sfgbot'
}

In [4]:
res = requests.get(url, params)

In [5]:
#check status
res.status_code

200

In [6]:
#convert to json
data = res.json()

In [7]:
posts = data['data']

In [8]:
#api drill down to id
posts[0]['id']

'oa5ymf'

In [9]:
df = pd.DataFrame(posts) #all of the keys of the dictionary are column heads, 

df[['subreddit', 'selftext', 'title','id']].head()

Unnamed: 0,subreddit,selftext,title,id
0,SFGiants,### Giants (50-28) @ Dodgers (48-31)\n\nFirst ...,Gameday Thread 6/29/21 Giants (Gausman) @ Dodg...,oa5ymf
1,SFGiants,### Giants 2 @ Dodgers 3\n\n**Purpose of this ...,"POSTGAME THREAD: Giants @ Dodgers, 6/28. Join ...",oa1qzn
2,SFGiants,### Giants (50-27) @ Dodgers (47-31)\n\nFirst ...,Gameday Thread 6/28/21 Giants (DeSclafani) @ D...,o9hxxu
3,SFGiants,### Athletics 6 @ Giants 2\n\n**Purpose of thi...,"POSTGAME THREAD: Athletics @ Giants, 6/27. Joi...",o97thy
4,SFGiants,### Athletics (46-33) @ Giants (50-26)\n\nFirs...,Gameday Thread 6/27/21 Athletics (Irvin) @ Gia...,o8uxrz


**List of Sublmission Ids for just game threads**

In [10]:
submission_ids = [code for thread, code in zip(df['title'],df['id']) if 'Dodgers' in thread]

Extracted comments in thread using praw api wrapper.

Praw Documentation ([*source*](https://praw.readthedocs.io/en/latest/tutorials/comments.html))

Praw Youtube Tutorial by Sentdex ([*source*](https://www.youtube.com/watch?v=KX2jvnQ3u60&ab_channel=sentdex))

In [11]:

reddit = praw.Reddit(
    
    client_id="ZKKDfvdcVbHI7g",
    client_secret="NH9M7mAcGC9Y7TEUeY1M4rQKf-1zJg",
    username="dsmsprojects",
    password="3_gZ?KyiGg6.@uC",
    user_agent="prawproject",
)

In this section a game thread submission for 6/28/2021 is extracted using the submissions id obtained from pushshift api.

In [12]:
giantsthread = reddit.submission(id='o9hxxu')

In [13]:
#subreddit = reddit.subreddit('sfgiants')

In [14]:
#this code takes a while to run
giants_df = pd.DataFrame()
giantsthread.comments.replace_more(limit=None) #allows more than 500 comments to be extracted
comments = giantsthread.comments.list() #allows replys to comments to also be extracted
for comment in comments:
    
    try:
        giants_df = giants_df.append({'Comments': comment.body}, ignore_index=True) #add each comment as a row into the giants_df
    except: AttributeError #ignores error that occurs when comment.body is applied to the end of the thread and there isn't anything in "More Comments"
    
giants_df

Unnamed: 0,Comments
0,Not scared of Bauer. Let's fucking rattle him.
1,Obligatory going to the game today. Heard ther...
2,https://www.mlb.com/all-star/ballot \n\nHave y...
3,games past 6pm should be illegal. i’m old and ...
4,The thing about the dodgers that I have to say...
...,...
3961,Extreme hatred of the wave is justifiable. \n\...
3962,Yeah!!!!
3963,The same as 2/$10 Barefoot? Asking for a friend
3964,It ain't right.


Add a column for the team. This is the first choice in our binary classification model. The other choice will be the dodgers.

In [15]:
giants_df['Team'] = 'Giants'

giants_df.head()

Unnamed: 0,Comments,Team
0,Not scared of Bauer. Let's fucking rattle him.,Giants
1,Obligatory going to the game today. Heard ther...,Giants
2,https://www.mlb.com/all-star/ballot \n\nHave y...,Giants
3,games past 6pm should be illegal. i’m old and ...,Giants
4,The thing about the dodgers that I have to say...,Giants


In [16]:
params_2 = {
'subreddit' : 'dodgers',
'author' :'dodgerbot'
}

In [17]:
res_2 = requests.get(url, params_2)

In [18]:
res_2.status_code

200

In [19]:
data_2 = res_2.json()

In [20]:
posts_2 = data_2['data']

In [21]:
#api drill down to id
posts_2[0]['id']

'oa8uo9'

In [22]:
df_2 = pd.DataFrame(posts_2) #all of the keys of the dictionary are column heads

df_2[['subreddit', 'selftext', 'title','id']].head()

Unnamed: 0,subreddit,selftext,title,id
0,Dodgers,## Today's Matchup\n\n## ⚾ Dodgers vs. Giants ...,Daily Chat 6/29 ⚾ Game Day,oa8uo9
1,Dodgers,### Line Score - Final\n\n| |1|2|3|4|5|6|7|8|9...,Postgame Thread ⚾ Giants 2 @ Dodgers 3,oa1rel
2,Dodgers,### Giants (50-27) @ Dodgers (47-31) [](http:/...,Game Chat 6/28 - Giants (50-27) @ Dodgers (47-...,o9xssr
3,Dodgers,## Today's Matchup\n\n## ⚾ Dodgers vs. Giants ...,Daily Chat 6/28 ⚾ Game Day,o9ktqs
4,Dodgers,### Line Score - Final\n\n| |1|2|3|4|5|6|7|8|9...,Postgame Thread ⚾ Cubs 1 @ Dodgers 7,o9ailb


Below filters the submission ids for only Giants/Dodgers matchups. The for loop below could designed to loop through multiple games by changing the submission id, but one game was sufficient data.

In [23]:
submission_ids_2 = [code for thread, code in zip(df_2['title'], df_2['id']) if 'Giants' in thread]

submission_ids_2

['oa1rel', 'o9xssr']

In [24]:
dodgers_thread = reddit.submission(id='o9xssr')

In [25]:
dodgers_df = pd.DataFrame()
dodgers_thread.comments.replace_more(limit=None)
comments_2 = dodgers_thread.comments.list()
for comment in comments_2:
    
    try:
        dodgers_df = dodgers_df.append({'Comments': comment.body}, ignore_index=True)
    except: AttributeError
    
dodgers_df

Unnamed: 0,Comments
0,Kenley throwing three straight sliders to Pose...
1,"Not going to lie, I’ve been going through a bi..."
2,"Kenley with a 1.42 ERA, pretty good for a wash..."
3,I talked shit about Pollock and he’s been goin...
4,"People acting like two ER in 6 innings is bad,..."
...,...
2349,What's wrong with liking assholes?
2350,FIP doesn't invalidate ERA aka results. it sup...
2351,Look as a bisexual I like people that like my ...
2352,How about xERA? That’s a whole run over his E...


In [26]:
dodgers_df['Team'] = 'Dodgers'

dodgers_df.head()

Unnamed: 0,Comments,Team
0,Kenley throwing three straight sliders to Pose...,Dodgers
1,"Not going to lie, I’ve been going through a bi...",Dodgers
2,"Kenley with a 1.42 ERA, pretty good for a wash...",Dodgers
3,I talked shit about Pollock and he’s been goin...,Dodgers
4,"People acting like two ER in 6 innings is bad,...",Dodgers


In [27]:
#combine the dataset for each team's comments to be used for the binary classification model
rivalry_df = pd.concat([giants_df,dodgers_df])

rivalry_df.shape

(6320, 2)

In [29]:
#export to csv file, the modeling will be completed in the model.ipynb file
rivalry_df.to_csv('../datasets/rivalry_df.csv', index = False)