# Coffee_NLP

In [1]:
import numpy as np
import requests
from datetime import datetime
import time
import pandas as pd
import json
import re

from sklearn.feature_extraction.text import CountVectorizer

np.random.seed(42)

import matplotlib.pyplot as plt
import seaborn as sns
sns.set()
%matplotlib inline

## Pushshift API

The Pushshift API allows us to access Reddit data via constructing an URL with relevant parameters without needing Reddit credentials; For example, an URL below such will query posts from the Coffee subreddit (subreddit of interest) within the range of the specified unix timestamps which is the time in seconds elapsed since 00:00:00 Thursday, 1 January 1970.

https://api.pushift.io/reddit/submission/search/?subreddit=Coffee&limit1000&after=1514764800&before=1517443200

The official documentation for Pushshift API can be found at https://github.com/pushshift/api

Attributions are in order: I obtained the below PushShift API function code from the following source: https://www.reddit.com/r/redditdev/comments/8vaum1/how_to_retrieve_old_posts_from_subreddits/

### Collect "Old" Coffee posts

In [2]:
import datetime

The function seeks to gather posts from the coffee subreddit; if there are more than 1000 posts within the given timeframe, through the Pushshift API we iteratively collect all posts.

In [3]:
 def get_posts_for_time_period(sub, beginning, end=int(datetime.datetime.now().timestamp())):
    print("Querying pushshift")
    url = "https://apiv2.pushshift.io/reddit/submission/search/" \
               "?subreddit={0}" \
               "&limit=1000" \
               "&after={1}" \
               "&before={2}".format(sub, beginning, end)
         
    response = requests.get(url)
    resp_json = response.json()
    return resp_json['data']

I chose the year 2015 to assign as my 'old' data. When I had initially used 1/1/2017 to 12/31/2017 as my 'old' data and compared with 'new' data of 1/1/2018 to 12/1/2018, there was no discernible distinction between the two timeframe in terms of changes in coffee trends. In order to detect enough signal, I chose an older timeframe of 2015 for analysis.

In [4]:
beginning_timestamp = int(datetime.datetime(year=2015, month=1, day=1).timestamp()) 
end_timestamp = int(datetime.datetime(year=2015, month=12, day=31).timestamp()) 
data_1 = get_posts_for_time_period("Coffee", beginning_timestamp, end_timestamp)
all_data_old = data_1

Querying pushshift


In [5]:
response = requests.get("https://apiv2.pushshift.io/reddit/submission/search/?subreddit={0}&limit=1000&after={1}&before={2}")

In [6]:
type(all_data_old)

list

The URL creates and returns a JSON formatted result like below:

In [7]:
all_data_old[0:1]

[{'author': 'hilinecoffee',
  'author_flair_css_class': None,
  'author_flair_text': None,
  'created_utc': 1420112239,
  'domain': 'articlesfactory.com',
  'full_link': 'https://www.reddit.com/r/Coffee/comments/2r02kg/delicious_blue_bottle_coffee/',
  'id': '2r02kg',
  'is_self': False,
  'num_comments': 0,
  'over_18': False,
  'permalink': '/r/Coffee/comments/2r02kg/delicious_blue_bottle_coffee/',
  'retrieved_on': 1440990924,
  'score': 0,
  'stickied': False,
  'subreddit': 'Coffee',
  'subreddit_id': 't5_2qhze',
  'thumbnail': 'default',
  'title': 'Delicious Blue Bottle Coffee',
  'url': 'http://www.articlesfactory.com/articles/advice/know-more-about-the-different-types-of-coffee.html'}]

In [8]:
all_data_old[999]['created_utc']

1421842810

In [9]:
import datetime

Through a while loop, I gather all posts from the subreddit within a given timeframe.

In [10]:
while len(data_1) >= 1000:
    last_one = data_1[999]
    updated_timestamp = last_one['created_utc'] + 1
    data_1 = get_posts_for_time_period(sub="Coffee", beginning=updated_timestamp, end=end_timestamp)
    print('Queried pushshift until', datetime.datetime.utcfromtimestamp(data_1[-1]['created_utc']).strftime('%Y-%m-%d %H:%M:%S'))
    all_data_old.extend(data_1)

Querying pushshift
Queried pushshift until 2015-02-10 15:13:04
Querying pushshift
Queried pushshift until 2015-03-03 19:59:19
Querying pushshift
Queried pushshift until 2015-03-26 16:12:18
Querying pushshift
Queried pushshift until 2015-04-20 18:11:24
Querying pushshift
Queried pushshift until 2015-05-18 21:01:27
Querying pushshift
Queried pushshift until 2015-06-16 13:09:27
Querying pushshift
Queried pushshift until 2015-07-13 16:04:28
Querying pushshift
Queried pushshift until 2015-08-08 12:38:02
Querying pushshift
Queried pushshift until 2015-08-31 20:06:40
Querying pushshift
Queried pushshift until 2015-09-27 12:53:39
Querying pushshift
Queried pushshift until 2015-10-21 01:48:44
Querying pushshift
Queried pushshift until 2015-11-16 06:58:57
Querying pushshift
Queried pushshift until 2015-12-09 20:15:10
Querying pushshift
Queried pushshift until 2015-12-30 19:10:19
Querying pushshift
Queried pushshift until 2015-12-31 07:35:05


In [61]:
len(all_data_old)

15035

We collected a total of 15,035 posts from 1/1/15 to 12/31/15

### Collect "New" coffee posts

Now we collect more posts from the coffee subreddit but with a different timeframe. I chose the respective timeframe of between 12/1/17 and 12/1/18 to understand recent coffee trends and what users talk about.

In [12]:
beginning_timestamp_new = int(datetime.datetime(year=2017, month=12, day=1).timestamp())
end_timestamp_new = int(datetime.datetime(year=2018, month=12, day=1).timestamp())
data_new = get_posts_for_time_period("Coffee", beginning_timestamp_new, end_timestamp_new)
all_data_new = data_new

Querying pushshift


In [13]:
while len(data_new) >= 1000:
    # go back for more data
    last_one_new = data_new[999]
    updated_timestamp_new = last_one_new['created_utc'] + 1
    data_new = get_posts_for_time_period(sub="coffee", beginning=updated_timestamp_new, end=end_timestamp_new)
    print('Queried pushshift until', datetime.datetime.utcfromtimestamp(data_new[-1]['created_utc']).strftime('%Y-%m-%d %H:%M:%S'))
    all_data_new.extend(data_new)

Querying pushshift
Queried pushshift until 2018-01-06 18:13:05
Querying pushshift
Queried pushshift until 2018-01-22 23:58:30
Querying pushshift
Queried pushshift until 2018-02-07 14:46:51
Querying pushshift
Queried pushshift until 2018-02-24 23:29:53
Querying pushshift
Queried pushshift until 2018-03-15 10:31:45
Querying pushshift
Queried pushshift until 2018-04-03 11:47:40
Querying pushshift
Queried pushshift until 2018-04-22 21:21:31
Querying pushshift
Queried pushshift until 2018-05-14 01:02:54
Querying pushshift
Queried pushshift until 2018-06-03 21:29:36
Querying pushshift
Queried pushshift until 2018-06-25 06:10:33
Querying pushshift
Queried pushshift until 2018-07-13 19:30:01
Querying pushshift
Queried pushshift until 2018-08-02 05:46:38
Querying pushshift
Queried pushshift until 2018-08-22 01:54:54
Querying pushshift
Queried pushshift until 2018-09-10 00:57:49
Querying pushshift
Queried pushshift until 2018-09-27 18:47:26
Querying pushshift
Queried pushshift until 2018-10-15 1

In [14]:
all_data_new[499]['created_utc']

1420931336

In [62]:
len(all_data_new)

19638

We obtained 19,638 posts from the coffee subreddit

In [64]:
all_data_new[1:2]

[{'author': 'dickpiano',
  'author_flair_css_class': None,
  'author_flair_text': None,
  'brand_safe': True,
  'can_mod_post': False,
  'contest_mode': False,
  'created_utc': 1512127700,
  'domain': 'self.Coffee',
  'full_link': 'https://www.reddit.com/r/Coffee/comments/7gueve/is_instant_coffee_a_scam/',
  'id': '7gueve',
  'is_crosspostable': True,
  'is_reddit_media_domain': False,
  'is_self': True,
  'is_video': False,
  'locked': False,
  'num_comments': 13,
  'num_crossposts': 0,
  'over_18': False,
  'parent_whitelist_status': 'all_ads',
  'permalink': '/r/Coffee/comments/7gueve/is_instant_coffee_a_scam/',
  'pinned': False,
  'retrieved_on': 1512209306,
  'score': 0,
  'selftext': "I tried drinking instant coffee espresso powder with cold water in the morning and I didn't feel any buzz at all. I rarely drink coffee so my tolerance is low, thus, I would notice a buzz. I tried this several different times and didn't notice SHIT. Are you supposed to use hot water to somehow acti

It is evident that we have a list of dicts and we are interested in the 'title' and 'selftext' of each post in the subreddit. We now create lists of 'title' and 'selftext' through list comprehensions for both all_data_old and all_data_new

### Process into dataframes

Access 'title' and 'selftext' through list comprehensions to read into pandas dataframe

In [17]:
title_old = [i['title'] if 'title' in i.keys() else '' for i in all_data_old ]

In [19]:
self_text_old = [i['selftext'] if 'selftext' in i.keys() else '' for i in all_data_old]

In [20]:
len(title_old)

15035

In [21]:
len(self_text_old)

15035

We confirmed that the length of 'title' and 'selftext' matches the length of all_data_old which contains all our posts

Now we create a dataframe of the extracted texts to perform exploratory data analysis

In [22]:
df_old = pd.DataFrame({'title':title_old, 'selftext':self_text_old})

In [23]:
df_old.shape

(15035, 2)

Join the contents of a post (primarily the title and selftext into one cell/observation and making it into one document via concatentation.

In [24]:
df_old['all'] = df_old['title'] + ' ' + df_old['selftext']

In [25]:
df_old.head(10)

Unnamed: 0,title,selftext,all
0,Delicious Blue Bottle Coffee,,Delicious Blue Bottle Coffee
1,PSA: Looking for an inexpensive alternative to...,,PSA: Looking for an inexpensive alternative to...
2,Is there a feasible way to store coffee for lo...,I just received two 2 pound bags of coffee for...,Is there a feasible way to store coffee for lo...
3,A shot in the dark (name that coffee),I want to get some coffee for a very left-lean...,A shot in the dark (name that coffee) I want t...
4,[Serious question] How can I keep drinking cof...,This isn't a joke at all. I'm 100% serious. I ...,[Serious question] How can I keep drinking cof...
5,2014 Coffee Collage,,2014 Coffee Collage
6,What's your favorite coffee producing country/...,What countries and regions produce your favori...,What's your favorite coffee producing country/...
7,To-go mug for Keurig mini.,So I got a Keurig for Christmas in order to ma...,To-go mug for Keurig mini. So I got a Keurig f...
8,Cold Bruer Help,I set up my new Cold Bruer to brew overnight l...,Cold Bruer Help I set up my new Cold Bruer to ...
9,Have you ever made a decent cup with folgers?,,Have you ever made a decent cup with folgers?


Let's remove posts that were removed by the moderator or deleted by the user to minimize noise

In [26]:
df_old = df_old[df_old['selftext'] != '[removed]']

In [27]:
df_old = df_old[df_old['selftext'] != '[deleted]']

In [28]:
df_old.shape

(14269, 3)

We approximately removed 800 posts

In [29]:
df_old.reset_index(drop=True, inplace=True)

Now I create a separate column indicating the class in which it belongs. We are interested in determining which coffee trends are in-demand. Therefore, the posts that we gathered from the previous timeframe will be deemd 0 or the negative class.

In [30]:
df_old['in'] = 0

Let's also drop duplicates that we may have obtained during our querying of the API.

In [31]:
df_old.drop_duplicates(keep ='first', inplace = True)

In [32]:
df_old.shape

(14012, 4)

In [33]:
df_old.head()

Unnamed: 0,title,selftext,all,in
0,Delicious Blue Bottle Coffee,,Delicious Blue Bottle Coffee,0
1,PSA: Looking for an inexpensive alternative to...,,PSA: Looking for an inexpensive alternative to...,0
2,Is there a feasible way to store coffee for lo...,I just received two 2 pound bags of coffee for...,Is there a feasible way to store coffee for lo...,0
3,A shot in the dark (name that coffee),I want to get some coffee for a very left-lean...,A shot in the dark (name that coffee) I want t...,0
4,[Serious question] How can I keep drinking cof...,This isn't a joke at all. I'm 100% serious. I ...,[Serious question] How can I keep drinking cof...,0


### Repeat the same process for 'new' data

We repeat the same process for 'old' data now for 'new' data (a.k.a. posts from 12/1/17 to 12/1/18)

In [34]:
len(all_data_new)

19638

In [35]:
title_new = [i['title'] if 'title' in i.keys() else '' for i in all_data_new]

In [36]:
self_text_new = [i['selftext'] if 'selftext' in i.keys() else '' for i in all_data_new]

In [37]:
df_new = pd.DataFrame({'title':title_new, 'selftext':self_text_new})

In [38]:
df_new.head(10)

Unnamed: 0,title,selftext
0,Is there any hope with automatic coffee brewers?,I've been doing pour-overs for a while and rea...
1,Is instant coffee a scam?,I tried drinking instant coffee espresso powde...
2,DIY Drip Coffee Maker?,I've been in a DIY mood lately. I would love ...
3,Does anyone else feel like they don’t feel the...,"I drink a LOT of coffee, and I was wondering i..."
4,Is there anyone who feels completely the same ...,[removed]
5,Never had coffee,"Hi folks, I know this may sound crazy, but I'v..."
6,Starbucks and Cheetos have teamed up to bring ...,[deleted]
7,"[news] that extra cup of java is good for you,...",
8,Best coffee to buy in Seattle?,So I’ll be heading to the emerald city one wee...
9,Chemex drinkers - What’s your trick?,I just got an 8 cup Chemex and I’m having trou...


In [39]:
df_new.shape

(19638, 2)

In [40]:
df_new = df_new[df_new['selftext'] != '[removed]']

In [41]:
df_new = df_new[df_new['selftext'] != '[deleted]']

In [42]:
df_new.shape

(18339, 2)

In [43]:
df_new['all'] = df_new['title'] + ' ' + df_new['selftext']

In [44]:
df_new.reset_index(drop=True, inplace=True)

Because this will be our class of interest, we assign it a value of 1 to indicate the positive class in a new column.

In [45]:
df_new['in'] = 1

In [46]:
df_new.head()

Unnamed: 0,title,selftext,all,in
0,Is there any hope with automatic coffee brewers?,I've been doing pour-overs for a while and rea...,Is there any hope with automatic coffee brewer...,1
1,Is instant coffee a scam?,I tried drinking instant coffee espresso powde...,Is instant coffee a scam? I tried drinking ins...,1
2,DIY Drip Coffee Maker?,I've been in a DIY mood lately. I would love ...,DIY Drip Coffee Maker? I've been in a DIY mood...,1
3,Does anyone else feel like they don’t feel the...,"I drink a LOT of coffee, and I was wondering i...",Does anyone else feel like they don’t feel the...,1
4,Never had coffee,"Hi folks, I know this may sound crazy, but I'v...","Never had coffee Hi folks, I know this may sou...",1


In [47]:
df_new.drop_duplicates(keep ='first', inplace = True)

In [48]:
df_new.shape

(18033, 4)

In [49]:
df_new.head(15)

Unnamed: 0,title,selftext,all,in
0,Is there any hope with automatic coffee brewers?,I've been doing pour-overs for a while and rea...,Is there any hope with automatic coffee brewer...,1
1,Is instant coffee a scam?,I tried drinking instant coffee espresso powde...,Is instant coffee a scam? I tried drinking ins...,1
2,DIY Drip Coffee Maker?,I've been in a DIY mood lately. I would love ...,DIY Drip Coffee Maker? I've been in a DIY mood...,1
3,Does anyone else feel like they don’t feel the...,"I drink a LOT of coffee, and I was wondering i...",Does anyone else feel like they don’t feel the...,1
4,Never had coffee,"Hi folks, I know this may sound crazy, but I'v...","Never had coffee Hi folks, I know this may sou...",1
5,"[news] that extra cup of java is good for you,...",,"[news] that extra cup of java is good for you,...",1
6,Best coffee to buy in Seattle?,So I’ll be heading to the emerald city one wee...,Best coffee to buy in Seattle? So I’ll be head...,1
7,Chemex drinkers - What’s your trick?,I just got an 8 cup Chemex and I’m having trou...,Chemex drinkers - What’s your trick? I just go...,1
8,[MOD] The Official Noob-Tastic Question Fest,Welcome to the weekly /r/Coffee question threa...,[MOD] The Official Noob-Tastic Question Fest W...,1
9,[MOD] What have you been brewing this week?/ C...,Hey everyone!\n\nWelcome back to the weekly /r...,[MOD] What have you been brewing this week?/ C...,1


In [50]:
df_old['in'].value_counts()

0    14012
Name: in, dtype: int64

In [51]:
df_new['in'].value_counts()

1    18033
Name: in, dtype: int64

This is a relatively balanced class. I am not concerned for imbalanced classes at the moment. Had we had a problem of imbalanced classes, here are ways to combat the problem:
- Undersampling: We understamply the prevalent class so the data to be modeled is more balanced between 0s and 1s; the idea behind undersampling is that the data for dominant class has many redundant records and dealing with a smaller, more balanced data yields better model performance
- Oversampling and Up/Down Weighting: A prominent criticism of undersampling method is that it throws away data and does not use all information at hand; instead, we can oversample the rarer class by drawing additional rows with replacement (a.k.a. bootstrapping)


## Exporting for EDA

We save out files that are now ready for EDA

In [53]:
df_old.to_csv('../datasets/df_old.csv')

In [54]:
df_new.to_csv('../datasets/df_new.csv')

**Please proceed to 02_eda_coffee.ipynb**