# Data Preprocessing

In [1]:
import pandas as pd
import numpy as np
from sklearn.model_selection import train_test_split
import warnings; warnings.simplefilter('ignore')

Reading in the Data from Data Gathering

In [2]:
ppc_df = pd.read_csv('../Data/PPC.csv')
seo_df = pd.read_csv('../Data/SEO.csv')
email_df = pd.read_csv('../Data/email1.csv')
extra_email = pd.read_csv('../Data/email2.csv')

Combining my two email dataframes into one

In [3]:
email_df = pd.concat([email_df,extra_email])

### For each data frame I am going through the following steps
1. Creating a new column, Total text, which is a combination of each reddit post's title and self text
2. Assigning each dataframe a value, either 0, 1, or 2, to identify which subreddit each post came from
3. Filling all null values with an empty string, which ensures that our data frame does not have any null values
4. Removing all columns that are not Total text, or subreddit.
5. dropping all posts that have the same Total text

In [4]:
list_of_df = [ppc_df,seo_df,email_df]
i=0
for df in list_of_df:
    df['Total_text'] = df['title'] + df['selftext']
    df['subreddit'] = i
    df.fillna(" ",inplace=True)
    cols_to_drop = list(df.columns)
    cols_to_drop.remove('Total_text')
    cols_to_drop.remove('subreddit')
    df.drop(columns=cols_to_drop,inplace=True)
    df.drop_duplicates(subset="Total_text",inplace=True)
    i += 1

Combining all Data into one Dataframe

In [5]:
final_df = pd.concat([ppc_df,seo_df,email_df])

Checking to ensure that our data is balanced

In [6]:
final_df['subreddit'].value_counts()

0    970
2    902
1    643
Name: subreddit, dtype: int64

Even though SEO has fewer unique posts that email and PPC, we should still be able to have accurate predictions if we stratify our train test split on the y variable

###### Defining our X and y variables

In [7]:
y = final_df['subreddit']
X = final_df['Total_text']

In [8]:
X_train, X_test, y_train, y_test = train_test_split(X, y, random_state=42,stratify = y)

In [9]:
y_train.to_csv('../Data/y_train.csv',header=True)
X_train.to_csv('../Data/X_train.csv',header=True)
y_test.to_csv('../Data/y_test.csv',header=True)
X_test.to_csv('../Data/X_test.csv',header=True)