# Reddit Subreddit Classification
## Notebook 1 - Data Acquisition and Cleaning
---

## Problem Statement

Subreddit posts related to dating and relationships seem to cover many common issues and situations. In many of these posts, users lament the frustrations and struggles of finding a romantic partner among a sea of people with vastly different familial, financial and other personal interests.

This project aims to build a classifier using natural language processing that can distinguish between posts from the subreddit r/dating and r/datingoverforty as accurately as possible despite their similarities.

## Executive Summary

Although many of the discussions within both of these subreddits share common themes and issues, I have determined that natural language processing tools can uncover enough differences to create an accurate classifer of their posts.

### Findings

While exploring the data, I ran a logistic regression model which calculated coefficients on each word used in the posts of these subreddits. These coefficients provide a gauge for the level of impact the presence of a word in a post has in the likelihood that post is predicted to be in one subreddit or the other. The words/topics with the most predictive impact in the logistic regression model for each subreddit is summarized below.

| **r/dating**                | **r/datingoverforty**        |
|-----------------------------|------------------------------|
| Numbers in the 20s          | Numbers in the 40s and above |
| Girl, girlfriend, boyfriend | Divorce                      |
| School/College              | Children                     |
|                             | Marriage                     |

### Results

I ran the text data through 5 different models with various pre-processing strategies and model hyper-parameters. The best model was a logistic regression model implementing a function that preprocessed the text using the insights from the analysis above. This model achieved an accuracy score of 82.02%. A full summary of the metrics achieved by model are provided below.

| **Metric**      | **Score** |
|-----------------|-----------|
| Accuracy Score  | 82.02%    |
| Precision Score | 82%       |
| Recall Score    | 82%       |
| F1 Score        | 82%       |

To further improve upon the accuracy of this model, I would like to explore taking the following next steps: (i) adding more features about the text such as (a) sentiment analysis on each post and (b) counts of the number of words and sentences used in each post, and (ii) exploring other natural language processing strategies to differentiate the subreddits further.




---

## Data Acquisition
Data for the two selected subreddits was acquired from the Pushshift API using the following code. Submissions were collected starting from 1 year prior to March 24th, 2022.

### Import Libraries

In [9]:
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
import requests
import time
import datetime
import os
from nltk.sentiment.vader import SentimentIntensityAnalyzer

### Data Acquisition Code

In [10]:
# Base url for Pushshift API
base_url = "https://api.pushshift.io/reddit/search/submission"

In [11]:
# Function to make request from Pushshift API and return data in dataframe
# sleep code inspired by: https://realpython.com/python-sleep/

def extract_reddit_data(subreddit, size, after):
    # Create URL for request from pushshift based on subreddit, size of data response and starting time period
    url = base_url + f"?subreddit={subreddit}&size={size}&after={after}&sort=asc"
    
    # Set variable to use in while loop which will keep trying if our request doens't get a 200 status code response
    no_response = True
    
    # While loop to make requests as long as response status code is not 200
    while no_response:
        print(f'Making request at: {url}')
        res = requests.get(url)
        print(f'Request response code: {res.status_code}')
        if res.status_code == 200:
            no_response = False
        else:
            print(f'Trying again')
            time.sleep(3)
            
    # Saving response data into dictionary
    data = res.json()['data']
    
    # Print length of data to check response was not empty
    print(f'Length of Data: {len(data)}')
    
    # Loop through response to collect certain items for each post into a list
    submissions = []
    for submission in data:
        s = {
            "id": submission.get('id', ""),
            "created_utc": submission.get('created_utc',""),
            "title": submission.get("title", ""),
            "selftext": submission.get("selftext", ""),
            "subreddit": submission.get("subreddit", ""),
            "subreddit_id": submission.get("subreddit_id", ""),
            "url": submission.get("url", ""),
        }
        submissions.append(submission)
    return submissions

In [12]:
# Function to making all requests necessary to collect a year's worth of posts for a subreddit

def extract_all_submissions(subreddit, after):
    # Get first set of submissions
    print(f'Getting first set of submissions')
    first = extract_reddit_data(subreddit, "100", after)
    
    # Write first set of submissions to CSV
    pd.DataFrame(first).to_csv(f"../data/{subreddit}-full-{after}.csv")
    
    # Get timestamp of last post in first set to use for the next API call
    last_timestamp = datetime.datetime.fromtimestamp(first[-1]['created_utc'])

    # While loop to keep making API calls as long as the timestamp of the last post collected is before today's date
    # Writes each API response to CSV
    while last_timestamp.date() < datetime.date.today():
        time.sleep(3)
        print(f'Getting submissions after {last_timestamp}')
        next_submissions = extract_reddit_data(subreddit, "100", str(int(last_timestamp.timestamp())))
        pd.DataFrame(next_submissions).to_csv(f"../data/{subreddit}-full-{last_timestamp}.csv")
        last_timestamp = datetime.datetime.fromtimestamp(next_submissions[-1]['created_utc'])
        

In [13]:
# Acquiring data from first subreddit; commented out to avoid running it again

# extract_all_submissions("dating", "365d")

In [14]:
# Acquiring data from second subreddit; commented out to avoid running it again

#extract_all_submissions("datingoverforty", "365d")

## Data Cleaning
After all the data obtained from the Pushshift API was written to CSV, the CSV files were then each read into a pandas DataFrame and concatenated. Then, the data was processed with some light cleaning steps.

In [15]:
# Function to collect all csvs for a subreddit and return a dataframe

def subreddit_to_df(subreddit):

    # Get file names in data folder for subreddit
    subreddit_files = []
    for file in os.listdir("../data"):
        split_filename = file.split("-")
        if split_filename[0] == subreddit and split_filename[1] != "full":
            subreddit_files.append(file)
        
    # Read csvs into dataframes then concatenate each of them 
    df = pd.concat([pd.read_csv(f'../data/{file}') for file in subreddit_files]).drop(columns="Unnamed: 0")
    
    return df

In [16]:
# Function to process dataframe by removing columns, dealing with missing text

def process_df(df):
    # Create new copy of dataframe to be processed
    new_df = df.copy()

    # Replace all "[removed]" values in dataframe with np.nan
    new_df.replace("[removed]", np.nan, inplace=True)
    
    # Drop np.nans
    new_df.dropna(inplace=True)
    
    # Create new column with text from the title and the body of the post
    new_df['alltext'] = new_df['title'].str.cat(new_df['selftext'], sep = " \n ")
    
    # Drop all columns except for alltext and subreddit which will become our target
    new_df = new_df[["alltext", "subreddit"]]
    
    # Set new target column by setting it to 0 where subreddit is dating, and 1 where datingoverforty
    new_df.loc[:, "||__target__||"] = np.where(new_df["subreddit"] == "datingoverforty", 1, 0)
    
    # Return dataframe with just alltext and new target column
    return new_df[["alltext", "||__target__||"]]

In [17]:
# Create dataframes of subreddit posts
dating_df = subreddit_to_df("dating")
over_forty_df = subreddit_to_df("datingoverforty")

# Process dataframes
dating_df = process_df(dating_df)
over_forty_df = process_df(over_forty_df)

In [18]:
# Concatenate dataframes
model_df = pd.concat([dating_df, over_forty_df])

# Reindex dataframe after the series of concatenations
model_df.reset_index(drop=True, inplace=True)

### Fixing imbalanced classes in data
Our final dataset is very imbalanced due to posts being submitted more frequently for one subreddit over the other. The following code was used to balance the classes before writing the final dataset to CSV.

In [19]:
# Show imbalanced classes
model_df['||__target__||'].value_counts(normalize=True)

0    0.927439
1    0.072561
Name: ||__target__||, dtype: float64

In [20]:
# Create under sampled dataframe for the majority class matching the number of samples in the minority class
under_sample_0 = model_df[model_df["||__target__||"]==0].sample(n=4584, random_state=42, replace=False)

# Concatenate undersampled dataframe with minority class data frame
model_df = pd.concat([under_sample_0, model_df[model_df["||__target__||"]==1]])

# Reset index after sampling and concatenation
model_df.reset_index(drop=True, inplace=True)

## Addition of Sentiment Analysis
After the primary set of data comprising text and the target variables was composed, sentiment analysis from NLTK's Vader sentiment analyzer was added to the dataset.

In [21]:
# Instantiate Vader sentiment analyzer
sia = SentimentIntensityAnalyzer()

# Create list of sentiment analysis results
sentiments = []

for text in list(model_df['alltext']):
    sentiments.append(sia.polarity_scores(text))

# Create dataframe from list of sentiments
sentiments_df = pd.DataFrame(sentiments)

In [22]:
# Add sentiment results to main dataframe
model_df = pd.concat([model_df, sentiments_df], axis=1)

In [23]:
# Rename columns so that they don't interfere with any columns that may be made later by text vectorizers
model_df.rename(columns={
    "neg": "||__neg__||",
    "neu": "||__neu__||",
    "pos": "||__pos__||",
    "compound": "||__compound__||",
}, inplace=True)

## Export to CSV

In [24]:
# Write final dataset to CSV
model_df.to_csv("../data/final.csv", index=False)