# Modelling

# Background

There is increased competition in the space for coding bootcamps. Bootcamps such as Hack Reactor, Vertical Institute, Rocket Academy and Le Wagon. If no action is taken, there will be a decline in market share, poor marketing return of investment (ROI), poorer lead generation which means we will not be able to meet the enrollment KPI's.

GA marketing is therefore trying to figure out how to better identify the online persona of a bootcamp seeker as opposed to that of a computer science major to aid in targeted advertising.


Considering the two topics have quite a bit in common, efforts to further segregate the two targets could yield better advertising ROI.

# Problem Statement

Due to increased competition in the market for bootcamps. General Assembly has been facing poorer enrollments and they intend to maintain their position as one of the better bootcamps out there.We are team of data scientists that are being tasked by General Assembly to build a model with >90% accuracy that helps to identify between those who are looking for bootcamp style learning vs computer science majors/prospective students based on the words they use online.

## Data Dictionary

| Feature | Type | Dataset | Description|
| :--- | :--- | :--- | :--- |
| subreddit | Object | cs_major / coding_bootcamp | Subreddit contains the topic of the subreddit in the dataframe. Either cs Major or coding bootcamp|
| selftext | Object | cs_major / coding_bootcamp | selftext contains the text or the message of the post written by the end user. |
| title | Object | cs_major / coding_bootcamp | title contains the title of the post. |
| csMajors | Object | cs_major | csMajors is the topic or also known as the subreddit. csMajors refers to Computer Science Major that universities offers to students. |
| coding_bootcamp | Object | coding_bootcamp| coding_bootcamp is the topic or also known as the subreddit. coding_bootamp refers to coding bootcamps that are taken by mid-career switches, companies and students who are interested in upskilling. | 
| combined_text | Object | cs_major / coding_bootcamp | combined_text is the combined columns of selftext and title. |



In [2]:
# import libraries
import numpy as np
import pandas as pd

import string
import re
import nltk
wn = nltk.WordNetLemmatizer()
stopwords = nltk.corpus.stopwords.words('english')

import seaborn as sns
import matplotlib.pyplot as plt
%matplotlib inline

from sklearn import metrics
from sklearn.linear_model import LogisticRegression
from sklearn.model_selection import train_test_split,cross_val_score
from sklearn.model_selection import cross_val_predict
from sklearn.model_selection import train_test_split, GridSearchCV
from sklearn.pipeline import Pipeline
from sklearn.naive_bayes import MultinomialNB

from pylab import rcParams
from sklearn import preprocessing

from sklearn import metrics
from sklearn.metrics import classification_report
from sklearn.metrics import confusion_matrix
from sklearn.metrics import precision_score, recall_score, r2_score
from sklearn.metrics import  accuracy_score, plot_roc_curve, roc_auc_score, recall_score, precision_score, f1_score


import string
import re
import nltk


from sklearn.feature_extraction.text import CountVectorizer
from sklearn.feature_extraction.text import TfidfVectorizer

pd.set_option('display.max_colwidth', 100)
import warnings
warnings.filterwarnings('ignore')

### Load CSV Files

In [3]:
cs = pd.read_csv('data/cs_major.csv')
bc = pd.read_csv('data/coding_bootcamp.csv')

Included the new_stopwords found from the EDA to improve the model.

In [4]:
new_stopwords = ['like','im','know','boot','would','camp','looking','got','get','one','back','know']
stopwords.extend(new_stopwords)

In [5]:
#filling null values with an empty string
cs.fillna('',inplace = True)

# combine both title and selftext as a new column
cs['combined_text'] = cs['selftext'] + ' ' + cs['title']
cs.drop(axis = 1, columns = cs[['selftext', 'title']], inplace = True)

#filling null values with an empty string
bc.fillna('', inplace = True)

#combine both title and selftext as a new column
bc['combined_text'] = bc['selftext'] + ' ' + bc['title']
bc.drop(axis = 1, columns = bc[['selftext', 'title']], inplace = True)

In [6]:
#create function to clean text, remove punctuation
def clean_text(text):
    text = str(text)
    text =  "".join([word.lower() for word in text if word not in string.punctuation])
    tokens = re.split('\W+', text)
    
    # apply lemmatize and stopwords exclusion within the same step
    text = [wn.lemmatize(word) for word in tokens if word not in stopwords]
    return ' '.join(text)

cs['combined_text_clean'] = cs['combined_text'].apply(lambda x: clean_text(x))
cs.drop(axis = 1, columns = 'combined_text',inplace = True)


bc['combined_text_clean'] = bc['combined_text'].apply(lambda x: clean_text(x))
bc.drop(axis = 1, columns = 'combined_text',inplace = True)

In [7]:
cs.head()

Unnamed: 0,subreddit,combined_text_clean
0,csMajors,recently accepted first job offer starting june previously spoken relocating assistance told ava...
1,csMajors,sed update took month hear cap rejected
2,csMajors,wanted make post hopefully inspire people go low ranked school gonna talk advice general thing l...
3,csMajors,renegotiate intern offer try different time fall spring instead summer
4,csMajors,applied around 200 internship season ended receiving two offer uhgoptum cigna first internship c...


In [8]:
#for non vectorised
df = pd.concat([cs, bc], axis = 0, ignore_index = True)
df["subreddit"] = df["subreddit"].map({'codingbootcamp': 1,
                                      'csMajors': 0})

# Modelling

Here we have split the modelling into the 2 vectorisation methods using GridSearch CV with pipeline. Each pipeline will contain the vectorisation method and a classifier. We will be using TF-IDF and CountVectorisation with N-gram(1,3) for the vectorisation methods. We will be using Multinomial Naive Bayes and Logistic Regression for the classification methods. We hope to find out which pair has the best f1 score.

In [9]:
# set X, y for train test split
X = df['combined_text_clean']
y = df['subreddit']
X_train,X_test,y_train,y_test = train_test_split(X,y, test_size = 0.3, stratify = y,random_state = 42)


## TF-IDF Modelling

### GridSearch and Pipeline (TF-IDF w/N-gram(1,3), Multinomial Naive Bayes )

In [10]:
y_test.value_counts(normalize=True)

1    0.5
0    0.5
Name: subreddit, dtype: float64

In [11]:
# Set up a pipeline with tf-idf vectorizer and multinomial naive bayes

pipe_tvec = Pipeline([
    ('tvec', TfidfVectorizer()),
    ('nb', MultinomialNB())
])

In [24]:
#parameters for pipeline with tf-idf
pipe_tvec_params = {
    'tvec__max_features': [2_000, 3_000, 4_000, 5_000],
    'tvec__stop_words': [None, 'english'],
    'tvec__ngram_range': [(1,1), (1,2),(1,3)]
}

In [25]:
# Instantiate GridSearchCV.

gs_tvec = GridSearchCV(pipe_tvec, 
                        param_grid = pipe_tvec_params, 
                       scoring='f1',
                        cv=5) # 5-fold cross-validation.

In [26]:
# Fit GridSearch to training data.
gs_tvec.fit(X_train, y_train)

GridSearchCV(cv=5,
             estimator=Pipeline(steps=[('tvec', TfidfVectorizer()),
                                       ('nb', MultinomialNB())]),
             param_grid={'tvec__max_features': [2000, 3000, 4000, 5000],
                         'tvec__ngram_range': [(1, 1), (1, 2), (1, 3)],
                         'tvec__stop_words': [None, 'english']},
             scoring='f1')

In [27]:
# Score model on training set.
print(gs_tvec.score(X_train, y_train))
# Score model on testing set.
print(gs_tvec.score(X_test, y_test))


0.9330558446063129
0.9110032362459547


In [None]:
print(gs_tvec.best_params_)
print(gs_tvec.best_score_)
print(gs_tvec.best_estimator_)

In [None]:
# Get predictions
preds_tvec = gs_tvec.predict(X_test)

print(classification_report(y_test, preds_tvec)) # 0: CS Major, 1: Coding bootcamp

### GridSearch and Pipeline (TF-IDF w/N-gram(1,3), Logistic Regression )

In [None]:
#instantiate the model
logreg = LogisticRegression()
#instantiate pipeline
pipe = Pipeline([
    ('tvec', TfidfVectorizer()),
    ('logreg', LogisticRegression())
])
# selected parameters for pipeline
pipe_tvec_params = {
    'tvec__max_features': [2_000, 3_000, 4_000, 5_000],
    'tvec__stop_words': [None, 'english'],
    'tvec__ngram_range': [(1,1), (1,2),(1,3)]
}
# Instantiate GridSearchCV.
gs_tvec = GridSearchCV(pipe, 
                  param_grid=pipe_tvec_params, 
                  scoring = 'f1',
                  cv=5) # 5-fold cross-validation.

#fit X,y train into gridsearch
gs_tvec.fit(X_train, y_train)

In [None]:
# Score model on training set.
print(gs_tvec.score(X_train, y_train))
#score model on test set.
print(gs_tvec.score(X_test, y_test))

In [None]:
print(gs_tvec.best_params_)
print(gs_tvec.best_score_)
print(gs_tvec.best_estimator_)

In [None]:
# Get predictions
y_pred = gs_tvec.predict(X_test)
print(classification_report(y_test, y_pred)) # 0: CS Major, 1: Coding bootcamp

## Count Vectorizer Modelling

### GridSearch and Pipeline (CountVectorizer w/N-gram(1,3), Multinomial Naive Bayes )

In [None]:
# set X, y for train test split
X = df['combined_text_clean']
y = df['subreddit']
X_train,X_test,y_train,y_test = train_test_split(X,y, test_size = 0.3, stratify = y,random_state = 42)

In [None]:
y_test.value_counts(normalize=True)

In [None]:
#instatiate pipeline
pipe = Pipeline([
    ('cvec', CountVectorizer(ngram_range = (1,3))),
    ('nb', MultinomialNB())
])

# Search over the following values of hyperparameters:
# Maximum number of features fit: 2000, 3000, 4000, 5000
# Minimum number of documents needed to include token: 2, 3
# Maximum number of documents needed to include token: 90%, 95%
# Check (individual tokens) and also check (individual tokens and 2-grams).

pipe_params = {
    'cvec__max_features': [2_000, 3_000, 4_000, 5_000],
    'cvec__min_df': [2, 3],
    'cvec__max_df': [.9, .95],
    'cvec__ngram_range': [(1,1), (1,2),(1,3)]
}

# Instantiate GridSearchCV.
gs = GridSearchCV(pipe, 
                  param_grid=pipe_params, 
                  cv=5) # 5-fold cross-validation.

# Fit GridSearch to training data.
gs.fit(X_train, y_train)

In [22]:
# the best score from GridSearch CV
print(gs.best_score_)

# Score model on training set.
print(gs.score(X_train, y_train))

# Score model on testing set.
print(gs.score(X_test, y_test))

0.9112500000000001
0.9251785714285714
0.9079166666666667


In [23]:
# Get predictions
y_pred = gs.predict(X_test)
print(classification_report(y_test, y_pred)) # 0: CS Major, 1: Coding bootcamp

              precision    recall  f1-score   support

           0       0.92      0.89      0.91      1200
           1       0.90      0.92      0.91      1200

    accuracy                           0.91      2400
   macro avg       0.91      0.91      0.91      2400
weighted avg       0.91      0.91      0.91      2400



### Logistic Regression

### GridSearch and Pipeline (CountVectorizer w/ N-gram(1,3), Logistic Regression)

In [24]:
# set X, y for train test split
X = df['combined_text_clean']
y = df['subreddit']
X_train,X_test,y_train,y_test = train_test_split(X,y, test_size = 0.3, stratify = y,random_state = 42)

In [25]:
#instantiate the model
logreg = LogisticRegression()
pipe = Pipeline([
    ('cvec', CountVectorizer(ngram_range = (1,3))),
    ('logreg', LogisticRegression())
])

# Search over the following values of hyperparameters:
# Maximum number of features fit: 2000, 3000, 4000, 5000
# Minimum number of documents needed to include token: 2, 3
# Maximum number of documents needed to include token: 90%, 95%
# Check (individual tokens) and also check (individual tokens and 2-grams).
pipe_params = {
    'cvec__max_features': [2_000, 3_000, 4_000, 5_000],
    'cvec__min_df': [2, 3],
    'cvec__max_df': [.9, .95],
    'cvec__ngram_range': [(1,1), (1,2),(1,3)]
}

# Instantiate GridSearchCV.
gs = GridSearchCV(pipe, 
                  param_grid=pipe_params, 
                  scoring = 'f1',
                  cv=5) # 5-fold cross-validation.

# Fit GridSearch to training data.
gs.fit(X_train, y_train)

GridSearchCV(cv=5,
             estimator=Pipeline(steps=[('cvec',
                                        CountVectorizer(ngram_range=(1, 3))),
                                       ('logreg', LogisticRegression())]),
             param_grid={'cvec__max_df': [0.9, 0.95],
                         'cvec__max_features': [2000, 3000, 4000, 5000],
                         'cvec__min_df': [2, 3],
                         'cvec__ngram_range': [(1, 1), (1, 2), (1, 3)]},
             scoring='f1')

In [26]:
# the best score from GridSearch CV
print(gs.best_score_)
# Score model on training set.
print(gs.score(X_train, y_train))
#score model on test set.
print(gs.score(X_test, y_test))


0.9201746533460197
0.9875489149768766
0.9153526970954358


In [27]:
# Get predictions
y_pred = gs.predict(X_test)
print(classification_report(y_test, y_pred)) # 0: CS Major, 1: Coding bootcamp

              precision    recall  f1-score   support

           0       0.92      0.91      0.91      1200
           1       0.91      0.92      0.92      1200

    accuracy                           0.92      2400
   macro avg       0.92      0.92      0.91      2400
weighted avg       0.92      0.92      0.91      2400



# Evaluation of Models

| Vectorizer |Classification Model | Train | Test |
|:---|:---|:---|:---
| TFIFD | Naive Bayes | 0.93305 | 0.91100 |
| TFIFD | Logistic Regression | 0.962240| 0.925879 |
| CountVectorizer | Naive Bayes | 0.92517 | 0.90791 |
| CountVectorizer | Logistic Regression| 0.98754 | 0.91535 |

The train/test scores shows good range and below 0.1 between the train and test scores. The models are decently accurate models.

| Vectorizer |Classification Model | f1 score|
|:---|:---|:---|
| TFIFD | Naive Bayes | 0.91 |
| TFIFD | Logistic Regression | 0.93 |
| CountVectorizer | Naive Bayes | 0.91 | 
| CountVectorizer | Logistic Regression| 0.92 |

We have selected the TFIFD / Logistic Regression to be the best performing model with a f1 score of 0.93.