# Creating a Stress Detection Tool using Data From Subreddits: Pre-Processing

In this notebook I will be creating two train/test sets using two different vectorizers.  The first vectorizer I will be using is the Count vectorizer.  As its name implies, this vectorizer counts the occurences of each word and the more frequently a word occurs, the more statistically significant it identifies it as.  The second vectorizer I will be using is tf-idf, or term frequency - inverse document frequency.  Like the Count vectorizer, tf-idf also counts the frequency of the words, but tf-idf also calculates a value for how significant each word based on additional factors.

#### Import necessary libraries

In [1]:
import pandas as pd
import numpy as np

from sklearn.feature_extraction.text import CountVectorizer, TfidfVectorizer

from sklearn.model_selection import train_test_split

import pickle

In [2]:
import warnings
warnings.filterwarnings("ignore")

#### Import dataframe from pickle

In [3]:
df = pd.read_pickle('df.pickle')

## First train/test set:
* In the first set I am testing I will be using count vectors

#### Define x and y

In [4]:
x=df['text']
y=df['stress_label']

#### Define vectorizer, stopwords

In [5]:
vect=CountVectorizer(stop_words="english")

#### Train/test split x and y

In [6]:
x=vect.fit_transform(x)

In [7]:
x_train,x_test,y_train,y_test=train_test_split(x,y,random_state=42)

#### Save as pickle

In [8]:
pkl_filename = "x_train_CV.pkl"
with open(pkl_filename, 'wb') as file:
    pickle.dump(x_train, file)
    
    
pkl_filename = "x_test_CV.pkl"
with open(pkl_filename, 'wb') as file:
    pickle.dump(x_test, file)
    
    
pkl_filename = "y_train_CV.pkl"
with open(pkl_filename, 'wb') as file:
    pickle.dump(y_train, file)
    
    
pkl_filename = "y_test_CV.pkl"
with open(pkl_filename, 'wb') as file:
    pickle.dump(y_test, file)
    
vec_file = "cv_vectorizer.pickle"
pickle.dump(vect, open(vec_file, "wb"))

## Second train/test set:
* In the second set I am testing I will be using tf-idf vectors

#### Define x and y

In [9]:
x=df['text']
y=df['stress_label']

#### Define vectorizer, stopwords

In [10]:
vectorizer = TfidfVectorizer(stop_words="english")

#### Train/test split x and y

In [11]:
X_tfidf = vectorizer.fit_transform(x)

In [12]:
x_train, x_test, y_train, y_test = train_test_split(X_tfidf, y, test_size=0.33, random_state=42)

#### Save as pickle

In [13]:
pkl_filename = "x_train_TFID.pkl"
with open(pkl_filename, 'wb') as file:
    pickle.dump(x_train, file)
    
    
pkl_filename = "x_test_TFID.pkl"
with open(pkl_filename, 'wb') as file:
    pickle.dump(x_test, file)
    
    
pkl_filename = "y_train_TFID.pkl"
with open(pkl_filename, 'wb') as file:
    pickle.dump(y_train, file)
    
    
pkl_filename = "y_test_TFID.pkl"
with open(pkl_filename, 'wb') as file:
    pickle.dump(y_test, file)
    
vec_file = "tfidf_vectorizer.pickle"
pickle.dump(vectorizer, open(vec_file, "wb"))