# GA DSI Project 3: Classification

# Problem Statement

With the advent of COVID-19, people have been hit by waves of lockdowns and have taken to starting new hobbies as ways to destress and to indulge in things they have always wanted to do. These include brewing and fermenting, which require little startup equipment but can make delicious alcoholic products for their friends and family. With the growing interest in these techniques, many companies and online services have sprung up to address this need, to provide forums which aspiring homebrewers and winemakers can get information and learn with their peers.

A new alcohol company wishes to understand consumer patterns with regard to winemaking and brewing, their two largest services. Since people are mostly stuck at home, many have taken to one or the other to pass time. The company wishes to create a chatbot which can take consumer queries and give them winemaking or homebrewing tips. However, due to these two processes having similar keywords (e.g. fermentation, tank, yeast), this is not a trivial task. To this end, they have requested a machine learning model from us which can identify whether the prospective customer wishes to know about homebrewing or winemaking.

To this end, we will be creating a machine learning model using posts from the Enology and Viticulture (r/winemaking) and Homebrewing (r/homebrewing) subreddits to train our model, and use the classifiers we have learnt so far to find the optimal model which can suit their needs and be integrated into their chatbot.

In [1]:
# unholy trinity
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt

# api
import requests
import seaborn as sns
from bs4 import BeautifulSoup as bs

# timing code
import time

# Scraping Data

As mentioned above, we will be scraping [r/winemaking](https://www.reddit.com/r/winemaking/) and [r/homebrewing](https://www.reddit.com/r/homebrewing/) to train our model. Since pushshift has a limit on the number of posts we can pull at one go, we will have to create a workaround. In this case, we will pull every row individually, then take the timestamp as the query parameter for the next pull. In this way, our code will be making 1000 queries per subreddit. 

In [2]:
tic = time.perf_counter()

N = 0
last = ''
reddit_data = []
pages = ['winemaking', 'homebrewing']
for page in pages:
    url = f'https://api.pushshift.io/reddit/search/submission/?subreddit={page}'
    print(f'The status code for the {page} subreddit is: {requests.get(url).status_code}')
    while N < 1000:
        request = requests.get(f'{url}&before={last}')
        json = request.json()
        for s in json["data"]:
            reddit_data.append(s)
            N += 1
        last = int(s["created_utc"])
    N=0
    last=''
        
toc = time.perf_counter()
print(f"The scraping took {toc - tic:0.4f} seconds")

The status code for the winemaking subreddit is: 200
The status code for the homebrewing subreddit is: 200
The scraping took 250.1583 seconds


Let's take put the data into a pandas dataframe for easier reference, and confirm that 1000 entries have been scraped for each subreddit.

In [3]:
reddit_df = pd.DataFrame(reddit_data)
reddit_df['subreddit'].value_counts()

winemaking     1000
Homebrewing    1000
Name: subreddit, dtype: int64

In [4]:
reddit_df.columns.sort_values()

Index(['all_awardings', 'allow_live_comments', 'author', 'author_cakeday',
       'author_flair_background_color', 'author_flair_css_class',
       'author_flair_richtext', 'author_flair_template_id',
       'author_flair_text', 'author_flair_text_color', 'author_flair_type',
       'author_fullname', 'author_is_blocked', 'author_patreon_flair',
       'author_premium', 'awarders', 'can_mod_post', 'contest_mode',
       'created_utc', 'crosspost_parent', 'crosspost_parent_list',
       'distinguished', 'domain', 'edited', 'full_link', 'gallery_data',
       'gildings', 'id', 'is_created_from_ads_ui', 'is_crosspostable',
       'is_gallery', 'is_meta', 'is_original_content',
       'is_reddit_media_domain', 'is_robot_indexable', 'is_self', 'is_video',
       'link_flair_background_color', 'link_flair_richtext',
       'link_flair_template_id', 'link_flair_text', 'link_flair_text_color',
       'link_flair_type', 'locked', 'media', 'media_embed', 'media_metadata',
       'media_only', 'n

In [5]:
reddit_df.head()

Unnamed: 0,all_awardings,allow_live_comments,author,author_flair_css_class,author_flair_richtext,author_flair_text,author_flair_type,author_fullname,author_is_blocked,author_patreon_flair,...,secure_media,secure_media_embed,author_cakeday,author_flair_background_color,author_flair_template_id,author_flair_text_color,edited,poll_data,distinguished,suggested_sort
0,[],False,Plenox,,[],,text,t2_hdu1d,False,False,...,,,,,,,,,,
1,[],False,thesnakewithin,,[],,text,t2_2sc63pjf,False,False,...,,,,,,,,,,
2,[],False,handbanana42069,,[],,text,t2_8txoqw2b,False,False,...,,,,,,,,,,
3,[],False,combhonn,,[],,text,t2_ay6uj6sh,False,False,...,,,,,,,,,,
4,[],False,WoodenPear,,[],,text,t2_1xaou9f0,False,False,...,,,,,,,,,,


We can perform a check on whether there are any posts that are only media which would make our analysis more difficult. There are no entries which are only media thankfully.

In [6]:
reddit_df.groupby('subreddit')['media_only'].value_counts()

subreddit    media_only
Homebrewing  False         1000
winemaking   False         1000
Name: media_only, dtype: int64

For our model training purposes, we only need 3 columns:
1. the title of the post
2. the text of the post
3. which subreddit the data came from (dependent variable)

In [7]:
reddit_df = reddit_df[['title', 'selftext', 'subreddit']]

In [8]:
reddit_df.head()

Unnamed: 0,title,selftext,subreddit
0,What is this buildup?,,winemaking
1,"Added a bit too much water 200ish ml, asked fo...",,winemaking
2,How many vines would you start with?,Hi all- \n\n&amp;#x200B;\n\nWine lover and avi...,winemaking
3,Misc CO2 / Oxygen protection question ...,I have a wine batch that has CO2 naturally dis...,winemaking
4,What's going on in this bottle?,,winemaking


We can now save our data into a csv file to be ready for processing.

In [10]:
reddit_df.to_csv('../data/reddit-data.csv', index=False)