In the following codealong, we will combine our new NLP knowledge with our knowledge of pipelines. We will apply this combination of skills to a common task: effectively separate `spam` from `ham` in a set of messages. 

In [5]:
ls ../..

[34mcode_challenge[m[m/          [34mphase_3[m[m/
[34mdsc-ml-fundamentals-lab[m[m/ [34mphase_4[m[m/
[34minstructor_repo_071921[m[m/  [34msrc[m[m/
[34mnew_caller[m[m/              [34mstudent_caller_july[m[m/
[34mphase_1[m[m/                 [34mstudents[m[m/


In [8]:
# Import not necessary for students
import sys
sys.path.append('../..')

from new_caller.random_student_engager.student_caller import CohortCaller
from new_caller.random_student_engager.student_list import avocoder_toasters

caller = CohortCaller(avocoder_toasters)

The dataset comes from the [UCI Machine Learning Repository](https://archive.ics.uci.edu/ml/datasets/sms+spam+collection). 

In [9]:
# Run cell with no changes to import Ham vs. Spam SMS dataset
import pandas as pd

with open('data/SMSSpamCollection') as read_file:
    texts = read_file.readlines()
    
text = [text.split('\t')[1] for text in texts]
label = [text.split('\t')[0] for text in texts]
df = pd.DataFrame(text, columns=['text'])
df['label'] = label
df['label'] = df['label']

In [10]:
df.head()

Unnamed: 0,text,label
0,"Go until jurong point, crazy.. Available only ...",ham
1,Ok lar... Joking wif u oni...\n,ham
2,Free entry in 2 a wkly comp to win FA Cup fina...,spam
3,U dun say so early hor... U c already then say...,ham
4,"Nah I don't think he goes to usf, he lives aro...",ham


As the head method shows, our data is labeled either ham or spam.

Check the distribution of the target in the cell below.

In [None]:
# Use pandas to find the distribution of Spam to Ham in the dataset


In [None]:
#__SOLUTION__
df['labe']

In [None]:
caller.call_n_students(1)

Certain metrics require that our target be in the form of 0's and 1's. Use the LabelEncoder method to transform the target.  

In [None]:
# f1 metric requires 0,1 labels
# Which should be 0 and which should be 1
from sklearn.preprocessing import LabelEncoder


In [None]:
caller.call_n_students(1)

# Target Distribution and Train-Test Split

The model building workflow is similar to what we have performed in Phase 3.  

To begin, train-test split the data set.  Preserve the class balance in the test set.

In [None]:
# train-test split the dataset while preserving the class balance show above
# Pass random_state=42 as an argument as well
from sklearn.model_selection import train_test_split



In [None]:
caller.call_n_students(1)

# EDA: Frequency Distributions

For some EDA, let's look at the frequency distribution of works across the entire dataset. 

In order to do so, we need to perform a few preprocessing steps.  

First, use the RegexpTokenizer from NLTK to isolate words from the messages.  You can use https://regexr.com/ to play with different regex patterns. 



In [None]:
# Use the RegexpTokenizer to isolate words in the text
from nltk.tokenize import RegexpTokenizer

# Create regex pattern
pattern = None

# Instantiate the RegexpTokenizer with the pattern you chose 

# tokenize the X_train texts


In [None]:
caller.call_n_students(1)

In [None]:
# make words lowercase


In [None]:
caller.call_n_students(1)

In [None]:
# remove stop words
from nltk.corpus import stopwords


In [None]:
caller.call_n_students(1)

In [None]:
# create frequency distribution of all words in the training set
from nltk import FreqDist

fd = FreqDist(token_list)
fd.most_common(10)


In [None]:
caller.call_n_students(1)

# Count Vectorizor and TFIDF Vectorizer

The sklearn `CountVectorizer` and `TfidfVectorizer` will do a lot of the preprocessing work for us. By default, they will make the tokens lowercase.  Stop words can be removed by simply passing a list of stop words as the `stop_words` argument.  By default, a regular expression pattern can be passed to remove punctuation and other unwanted parts of the tokens.

In the cell below, instantiate a TFIDF vectorizer with the **default** parameters and fit it on the sample.

In [None]:
from sklearn.feature_extraction.text import TfidfVectorizer 

X_train_sample = X_train.sample(10, random_state=42)

# fit the tfidf vectorizer to this sample

In [None]:
caller.call_n_students(1)

The vectorizer, by default, returns a **sparse matrix**. A sparse matrix is a matrix composed mostly of zeros.  We don't necessarily have to convert sparse matrices to arrays, but doing so can help visualize what is going on. 

In [None]:
# Convert the sparse matrix to an array using the todense() method


In [None]:
caller.call_n_students(1)

In [None]:
# To further help us visualize what is going on, convert the array to a dataframe


In [None]:
caller.call_n_students(1)

In [None]:
# Set the columns attribute equal to the return of the get_feature_names() method 


In [None]:
caller.call_n_students(1)

When building our model, we want to use our `tfidf` the same way as we would any sklearn object.  In other words, we fit it on the train set and transform the test set.  

Like the phase 3 projects, pipelines will help us do so. In the cell below, create a simple pipeline that includes a TfidfVectorizer and a MultinomialNB model.  

In [None]:
from sklearn.pipeline import make_pipeline
from sklearn.naive_bayes import MultinomialNB

# your code here


In [None]:
caller.call_n_students(1)

In [None]:
# pass the pipeline into sklearn's cross validate function.  
from sklearn.model_selection import cross_validate

# your code here: return the train score so we can look at the bias variance tradeoff


In [None]:
caller.call_n_students(1)

Let's use sklearn's cross_val_predict function to see what type of mistakes our model is making.  

In [None]:
# Pass the pipeline, as well as X_train['text'] and y_train to cross_val_predict
from sklearn.model_selection import cross_val_predict

y_hat_train = None

In [None]:
# Create a confusion matrix with the results of cross_val_predict
from sklearn.metrics import confusion_matrix


In [None]:
caller.call_n_students(1)

#Interpret the results above. What type of mistakes are most important to reduce in a spam detector?

In [None]:
caller.call_n_students(1)

In [None]:
# change the scoring metric to return f1 score: use the argument scoring='f1' 


In [None]:
# print out the mean training score


In [None]:
# print out the mean test score


In [None]:
caller.call_n_students(1)

Let's try to improve our model by passing in stopwords.



In [None]:
# Create a new pipeline with a Tfidf Vectorizer that removes stopwords


In [None]:
caller.call_n_students(1)

# Grid Search

Removing stop_words helps improve out model. False Positives are still 0, but our false negatives are still high.  Let's try to reduce our False Negate Rate with grid search.

In [None]:
# Use gridsearch to narrow in on the correct hyper-parameters
# Print out the sw_pipe to how to define the dictionary keys. 
# what are some hyperparameters we can choose for the tfidf vectorizer?

parameter_dict = {}

In [None]:
# Define new pipeline with defaul parameters for tfidf and multinomialNB

In [None]:
# define a Grid Search object. 
# Make sure to return the train score
# and set scoring to `f1`

In [None]:
# print out the mean train score of the best estimator
# Hint look at 

In [None]:
# print out the best_score_

In [None]:
# print out the best_params_

In [None]:
# using the best_estimator_, 
# use cross_val_predict to generate a confusion matrix