## Introduction to Machine Learning  

## Assignment 6:  Preprocessing Categorical Variables

You can't learn technical subjects without hands-on practice. The assignments are an important part of the course. To submit this assignment you will need to make sure that you save your Jupyter notebook. 

Below are the links of 2 videos that explain:

1. [How to save your Jupyter notebook](https://youtu.be/0aoLgBoAUSA) and,       
2. [How to answer a question in a Jupyter notebook assignment](https://youtu.be/7j0WKhI3W4s).

### Assignment Learning Goals:

By the end of the module, students are expected to:

- Explain `handle_unknown="ignore"` hyperparameter of `scikit-learn`'s `OneHotEncoder`.
- Identify when it's appropriate to apply ordinal encoding vs one-hot encoding.
- Explain strategies to deal with categorical variables with too many categories.
- Explain why text data needs a different treatment than categorical variables.
- Use `scikit-learn`'s `CountVectorizer` to encode text data.
- Explain different hyperparameters of `CountVectorizer`.
- Use `ColumnTransformer` to build all our transformations together into one object and use it with `scikit-learn` pipelines.

This assignment covers [Module 6](https://ml-learn.mds.ubc.ca/en/module6) of the online course. You should complete this module before attempting this assignment.

Any place you see `...`, you must fill in the function, variable, or data to complete the code. Substitute the `None` with your completed code and answers then proceed to run the cell!

Note that some of the questions in this assignment will have hidden tests. This means that no feedback will be given as to the correctness of your solution. It will be left up to you to decide if your answer is sufficiently correct. These questions are worth 2 points.

In [6]:
# Import libraries needed for this lab
from hashlib import sha1

import altair as alt
import graphviz
import numpy as np
import pandas as pd

from sklearn import tree
from sklearn.dummy import DummyClassifier
from sklearn.tree import DecisionTreeClassifier
from sklearn.model_selection import train_test_split, cross_validate
from sklearn.neighbors import KNeighborsClassifier
from sklearn.impute import SimpleImputer
from sklearn.pipeline import Pipeline, make_pipeline
from sklearn.compose import make_column_transformer 
from sklearn.model_selection import GridSearchCV, RandomizedSearchCV
from sklearn.ensemble import RandomForestClassifier
from sklearn.preprocessing import (
    FunctionTransformer,
    Normalizer,
    OneHotEncoder,
    StandardScaler,
    normalize,
    scale)
from sklearn.svm import SVC

import test_assignment6 as t
#alt.renderers.enable('mimetype')
alt.data_transformers.disable_max_rows()

DataTransformerRegistry.enable('default')

## 1. Introducing and Exploring the dataset <a name="1"></a>
<hr>


In this lab you will be working with [the Olympics Games DataSet](https://www.kaggle.com/samruddhim/olympics-althlete-events-analysis).

Our problem is to predict the medal type of each example. 
 You can find more information on the dataset and features [here](https://www.kaggle.com/samruddhim/olympics-althlete-events-analysis).


*Note that many popular datasets have sex as a feature where the possible values are male and female. This representation reflects how the data were collected and is not meant to imply that, for example, gender is binary.*


The following starter code preprocesses the data to get rid of rows with `NaN` values in the target column `Medal`.

In [7]:
medal_df = pd.read_csv("data/athlete_events.csv")
medal_df = medal_df.dropna(subset=['Medal'])

**Question 1.1** <br> {points: 1}  

In order to avoid violating the golden rule, before we do anything with the data, let's split it.

Split the data into `train_df` (80%) and `test_df` (20%). 

Keep the target column (`Medal`) in the splits so that we can use it in EDA. 

Make sure to set `random_state=123` for grading purposes. 


In [8]:
train_df, test_df = train_test_split(medal_df, train_size = 0.8, random_state = 123)
train_df

Unnamed: 0,ID,Name,Sex,Age,Height,Weight,Team,NOC,Games,Year,Season,City,Sport,Event,Medal
221047,111063,Harbinder Singh Chimni,M,21.0,174.0,60.0,India,IND,1964 Summer,1964,Summer,Tokyo,Hockey,Hockey Men's Hockey,Gold
222232,111670,Lars-Erik Skild,M,28.0,170.0,70.0,Sweden,SWE,1980 Summer,1980,Summer,Moskva,Wrestling,"Wrestling Men's Lightweight, Greco-Roman",Bronze
122592,61968,Christa Khler (-Kinast),F,24.0,162.0,56.0,East Germany,GDR,1976 Summer,1976,Summer,Montreal,Diving,Diving Women's Springboard,Silver
14077,7597,Bao Yingying,F,24.0,172.0,67.0,China,CHN,2008 Summer,2008,Summer,Beijing,Fencing,"Fencing Women's Sabre, Team",Silver
222715,111877,Miroslav Slma,M,30.0,,,Czechoslovakia,TCH,1948 Winter,1948,Winter,Sankt Moritz,Ice Hockey,Ice Hockey Men's Ice Hockey,Silver
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
56592,28989,Milen Atanasov Dobrev,M,24.0,177.0,94.0,Bulgaria,BUL,2004 Summer,2004,Summer,Athina,Weightlifting,Weightlifting Men's Middle-Heavyweight,Gold
107633,54409,Larsen Alan Jensen,M,22.0,185.0,88.0,United States,USA,2008 Summer,2008,Summer,Beijing,Swimming,Swimming Men's 400 metres Freestyle,Bronze
122929,62140,Peter-Michael Kolbe,M,35.0,194.0,84.0,West Germany,FRG,1988 Summer,1988,Summer,Seoul,Rowing,Rowing Men's Single Sculls,Silver
193716,97228,Oleg Alekseyevich Protopopov,M,35.0,175.0,71.0,Soviet Union-1,URS,1968 Winter,1968,Winter,Grenoble,Figure Skating,Figure Skating Mixed Pairs,Gold


In [9]:
t.test_1_1(train_df,test_df)

'Success'

**Question 1.2** <br> {points: 1}  

How many examples are there in our training data? 

Save your answer in an object named `training_size`.

In [10]:
training_size = len(train_df)
training_size

31826

In [11]:
t.test_1_2(training_size)

'Success'

**Question 1.3** <br> {points: 3}  

Let's examine our `train_df` a bit. 

What is the youngest and oldest age of an athlete that won a medal in the Olympics?

Save the results in objects `youngest_age` and `oldest_age`. 


In [12]:
describe_df = train_df.describe(include = 'all')
youngest_age = describe_df.loc['min','Age']
oldest_age = describe_df.loc['max','Age']

In [13]:
# check that the variable exists
assert 'oldest_age' in globals(
), "Please make sure that your solution is named 'oldest_age'"

# This test has been intentionally hidden. It will be up to you to decide if your solution
# is sufficiently good.

In [14]:
t.test_1_3_2(youngest_age)

'Success'

**Question 1.4** <br> {points: 1}  

Look at the column dtypes using `.info()`.

How many non numeric **features** are there? 

Save the results in an object named `num_cat_feats`.

In [15]:
train_df.info()
num_cat_feats = 9

<class 'pandas.core.frame.DataFrame'>
Int64Index: 31826 entries, 221047 to 109718
Data columns (total 15 columns):
 #   Column  Non-Null Count  Dtype  
---  ------  --------------  -----  
 0   ID      31826 non-null  int64  
 1   Name    31826 non-null  object 
 2   Sex     31826 non-null  object 
 3   Age     31228 non-null  float64
 4   Height  24918 non-null  float64
 5   Weight  24393 non-null  float64
 6   Team    31826 non-null  object 
 7   NOC     31826 non-null  object 
 8   Games   31826 non-null  object 
 9   Year    31826 non-null  int64  
 10  Season  31826 non-null  object 
 11  City    31826 non-null  object 
 12  Sport   31826 non-null  object 
 13  Event   31826 non-null  object 
 14  Medal   31826 non-null  object 
dtypes: float64(3), int64(2), object(10)
memory usage: 4.9+ MB


In [16]:
t.test_1_4(num_cat_feats)

'Success'

**Question 1.5** <br> {points: 3}  

Let's take a look at some of the columns and the categories within them. 

Use `.describe` to answer the following questions. Save the describe dataframe in an object named `describe_df`.  

a) Which categorical feature has the most unique values? Save this in an object named `most_unique`. 

b) How many binary columns are there? Save this in an object named `binary_cols`. 

c) How many categorical features have missing values? Save this number in an object named `missing_cat`.



In [17]:
describe_df

most_unique = 'Name'
binary_cols = 2
missing_cat = 0

In [18]:
t.test_1_5_1(most_unique)

'Success'

In [19]:
t.test_1_5_2(binary_cols)

'Success'

In [20]:
t.test_1_5_3(missing_cat)

'Success'

**Question 1.6** <br> {points: 2}  

Filter or groupby the `train_df` dataframe to answer the next question. 

Which `NOC` won the most medals? Save this in an object named `most_medals`. 

Which `NOC` won the most `Gold` medals? Save this in an object named `most_gold`. 


In [21]:
medal_df = pd.DataFrame(train_df.groupby('NOC').apply(lambda x: x.value_counts('Medal')).sort_values(ascending = False))
most_medals = 'USA'
most_gold = 'USA'

In [22]:
t.test_1_6_1(most_medals)

'Success'

In [23]:
t.test_1_6_2(most_gold)

'Success'

We are going to separate feature vectors from the targets.

We are only going to use the folowing columns:

- `Sex`
- `Age`
- `Height`
- `Weight`
- `NOC`
- `Year`
- `Season`
- `City`
- `Sport`


and using `Medal` as the target column. 

We've created  `X_train`, `y_train`, `X_test`, `y_test` for you. 

In [24]:
X_train = train_df.drop(columns=['ID', 'Name', 'Team', 'Event','Medal', 'Games'])
y_train = train_df['Medal']

X_test = test_df.drop(columns=['ID', 'Name', 'Team', 'Event','Medal', 'Games'])
y_test = test_df['Medal']

X_train.head()

Unnamed: 0,Sex,Age,Height,Weight,NOC,Year,Season,City,Sport
221047,M,21.0,174.0,60.0,IND,1964,Summer,Tokyo,Hockey
222232,M,28.0,170.0,70.0,SWE,1980,Summer,Moskva,Wrestling
122592,F,24.0,162.0,56.0,GDR,1976,Summer,Montreal,Diving
14077,F,24.0,172.0,67.0,CHN,2008,Summer,Beijing,Fencing
222715,M,30.0,,,TCH,1948,Winter,Sankt Moritz,Ice Hockey


## 2. Preprocessing and building your pipelines

**Question 2.1** <br> {points: 4}  

Before you can start preprocessing our data, you need to identify the binary, categorical, ordinal and numeric columns in your `X_train` and build lists of each feature type. 


Save the column names in lists named  `numeric_feats`, `binary_feats`, `categorical_feats` and `ordinal_feat`.


In [25]:
X_train.head()

Unnamed: 0,Sex,Age,Height,Weight,NOC,Year,Season,City,Sport
221047,M,21.0,174.0,60.0,IND,1964,Summer,Tokyo,Hockey
222232,M,28.0,170.0,70.0,SWE,1980,Summer,Moskva,Wrestling
122592,F,24.0,162.0,56.0,GDR,1976,Summer,Montreal,Diving
14077,F,24.0,172.0,67.0,CHN,2008,Summer,Beijing,Fencing
222715,M,30.0,,,TCH,1948,Winter,Sankt Moritz,Ice Hockey


In [26]:
train_cols = list(X_train)
numeric_feats = list(train_cols[i] for i in [1, 2, 3, 5])
binary_feats = list(train_cols[i] for i in [0, 6])
categorical_feats = list(train_cols[i] for i in [4, 7, 8])
ordinal_feat = train_cols = []

In [27]:
t.test_2_1_1(numeric_feats)

'Success'

In [28]:
t.test_2_1_2(binary_feats)

'Success'

In [29]:
t.test_2_1_3(categorical_feats)

'Success'

In [30]:
t.test_2_1_4(ordinal_feat)

'Success'

**Question 2.2** <br> {points: 1}  

Ok let's start making our pipelines. Use `make_pipeline()` to make a pipeline for the numeric features called `numeric_transformer`. 

Use `SimpleImputation()` with `strategy=median`. For the second step make sure to use standardization with `StandardScaler()`.

In [31]:
numeric_transformer = make_pipeline(
    SimpleImputer(strategy="median"),
    StandardScaler()
)

In [32]:
t.test_2_2(numeric_transformer)

'Success'

**Question 2.3** <br> {points: 1}  

Next, use `make_pipeline()` to make a pipeline for the categorical features called `categorical_transformer`. 

Use `SimpleImputation()` with `strategy=most_frequent`. 

Make sure to use the necessary one-hot encoding transformer with `dtype=int` and `handle_unknown="ignore"`.

In [33]:
categorical_transformer = make_pipeline(
    SimpleImputer(strategy="most_frequent"),
    OneHotEncoder(dtype = int, handle_unknown = 'ignore')
)

In [34]:
t.test_2_3(categorical_transformer)

'Success'

**Question 2.4** <br> {points: 1}  
  
Use `make_pipeline()` to make a pipeline for the binary features call `binary_transformer`. 

Use `SimpleImputation()` with `strategy=most_frequent`. 

Make sure to use the necessary one-hot encoding transformer with `dtype=int`.

In [35]:
binary_transformer = make_pipeline(SimpleImputer(strategy = 'most_frequent'),
                                  OneHotEncoder(dtype = int, drop = 'if_binary'))

In [36]:
t.test_2_4(binary_transformer)

'Success'

**Question 2.5** <br> {points: 1}  


Define a column transformer using [`make_column_transformer`](https://scikit-learn.org/stable/modules/generated/sklearn.compose.make_column_transformer.html) called `preprocessor` for the numerical, categorical, and remainding feature types.


In [37]:
preprocessor = make_column_transformer(
    (numeric_transformer, numeric_feats), 
    (categorical_transformer, categorical_feats),
    (binary_transformer, binary_feats))

In [38]:
t.test_2_5(preprocessor)

'Success'

# 3. Model Building

**Question 3.1** <br> {points: 1}  

It's important to build a dummy classifier to compare our model to. Make a `DummyClassifier` using `strategy="prior"`. 

Carry out 5-fold cross validation on `X_train` and `y_train` using ` cross_validate()`. Don't forget to include the training score. 

Save the results in a dataframe named `dummy_scores`. 

In [39]:
dummy_clf = DummyClassifier(strategy = 'prior')
dummy_clf.fit(X_train, y_train)
dummy_scores = pd.DataFrame(cross_validate(dummy_clf, X_train, y_train, cv = 5, return_train_score=True))
dummy_scores

Unnamed: 0,fit_time,score_time,test_score,train_score
0,0.026392,0.006615,0.334433,0.334328
1,0.026507,0.006675,0.334328,0.334355
2,0.026132,0.006682,0.334328,0.334355
3,0.026131,0.006717,0.334328,0.334355
4,0.026275,0.006613,0.334328,0.334355


In [40]:
t.test_3_1(dummy_scores)

'Success'

**Question 3.2** <br> {points: 1}  


Define a main pipeline called `main_pipe` that transforms all the different features and uses a `RandomForestClassifier` model using `random_state=77` and setting the hyperparameter `n_estimators` to 10. 

In [41]:
main_pipe = make_pipeline(preprocessor, RandomForestClassifier(random_state = 77, n_estimators = 10))

In [42]:
t.test_3_2(main_pipe)

'Success'

**Question 3.3** <br> {points: 1}  

Perform 5 fold cross-validation on `X_train` and `y_train` using the main pipeline `main_pipe`. Make sure to set `return_train_score=True` and save the result in a dataframe called `scores_df`. 

*Note: This could take 5 minutes. Remember how large our training data is.*

In [43]:
scores_df = pd.DataFrame(cross_validate(main_pipe, X_train, y_train, cv = 5, return_train_score = True))

In [44]:
t.test_3_3(scores_df)

'Success'

**Question 3.4** <br> {points: 2}

What is the mean training and cross-validation scores? 

Save the mean training score in `mean_training_score` and the mean cross-validation score in the object named `cv_score`.

In [45]:
mean_training_score = scores_df['train_score'].mean()
cv_score = scores_df['test_score'].mean() 
print(mean_training_score, cv_score)

0.9341261908105037 0.6159117898280807


In [46]:
# check that the variable exists
assert 'cv_score' in globals(
), "Please make sure that your solution is named 'cv_score'"

assert 'mean_training_score' in globals(
), "Please make sure that your solution is named 'mean_training_score'"
# This test has been intentionally hidden. It will be up to you to decide if your solution
# is sufficiently good.

**Question 3.5** <br> {points: 1}

Is the model overfitting or underfitting? 

A) Overfitting

B) Underfitting

C) Neither

*Answer in the cell below using the uppercase letter associated with your answer. Place your answer between `""`, assign the correct answer to an object called `answer3_05`.*

In [47]:
answer3_05 = 'A'
answer3_05

'A'

In [48]:
t.test_3_5(answer3_05)

'Success'

**Question 3.6** <br> {points: 1}

Which model performed better?

A) `RandomForestClassifier`

B) `DummyClassifier`

C) Neither

*Answer in the cell below using the uppercase letter associated with your answer. Place your answer between `""`, assign the correct answer to an object called `answer3_06`.*

In [49]:
answer3_06 = 'A'
answer3_06

'A'

In [50]:
t.test_3_6(answer3_06)

'Success'

**Question 3.7** <br> {points: 1}  
Now that we have our pipelines and a model let's tune the hyperparameter `max_depth`. 

Sweep over the hyperparameters in `param_grid` using `RandomizedSearchCV` with a  `cv=5`, `n_iter=5` and setting `return_train_score=True`. Don't forget to set `random_state=77`.

Save your grid search in an object named `depth_search`. 

You may also want to set `verbose=2` since it may take some time. 

Don't forget to fit `depth_search`.


In [51]:
param_grid = {
    "randomforestclassifier__max_depth": range(1,151,10)
}

depth_search = RandomizedSearchCV(main_pipe, param_grid, cv = 5, n_iter = 5, return_train_score = True, random_state = 77, verbose = 2)
depth_search.fit(X_train, y_train)

Fitting 5 folds for each of 5 candidates, totalling 25 fits
[CV] END ..............randomforestclassifier__max_depth=141; total time=   3.7s
[CV] END ..............randomforestclassifier__max_depth=141; total time=   3.6s
[CV] END ..............randomforestclassifier__max_depth=141; total time=   3.7s
[CV] END ..............randomforestclassifier__max_depth=141; total time=   3.6s
[CV] END ..............randomforestclassifier__max_depth=141; total time=   3.6s
[CV] END ...............randomforestclassifier__max_depth=61; total time=   3.3s
[CV] END ...............randomforestclassifier__max_depth=61; total time=   2.9s
[CV] END ...............randomforestclassifier__max_depth=61; total time=   3.0s
[CV] END ...............randomforestclassifier__max_depth=61; total time=   2.8s
[CV] END ...............randomforestclassifier__max_depth=61; total time=   2.9s
[CV] END ...............randomforestclassifier__max_depth=21; total time=   0.7s
[CV] END ...............randomforestclassifier__m

RandomizedSearchCV(cv=5,
                   estimator=Pipeline(steps=[('columntransformer',
                                              ColumnTransformer(transformers=[('pipeline-1',
                                                                               Pipeline(steps=[('simpleimputer',
                                                                                                SimpleImputer(strategy='median')),
                                                                                               ('standardscaler',
                                                                                                StandardScaler())]),
                                                                               ['Age',
                                                                                'Height',
                                                                                'Weight',
                                                                         

In [52]:
t.test_3_7(depth_search)

'Success'

**Question 3.8** <br> {points: 1}  

Obtain the results for cross validation from grid search using `depth_search.cv_results_`.

Select the columns:

- `mean_test_score`
- `param_randomforestclassifier__max_depth`
- `mean_fit_time`
- `rank_test_score`

Sort your values in ascending order of `rank_test_score`. 

Make sure to save it as a dataframe and display it. Save this as an object named `grid_results`.

In [53]:
cv_df = pd.DataFrame(depth_search.cv_results_)
grid_results = cv_df.loc[:,['mean_test_score','param_randomforestclassifier__max_depth','mean_fit_time','rank_test_score']].sort_values('rank_test_score')

In [54]:
t.test_3_8(grid_results)

'Success'

**Question 3.9** <br> {points: 1} 

What is the best hyperparameter value for `n_estimators`? Save it in an object named `best_depth`. 

What was the corresponding validation score for it? Save this in an object named `best_depth_score`. 

*Hint: `.best_params_`  and `.best_score_` are helpful here.* 

In [138]:
best_depth = depth_search.best_params_['randomforestclassifier__max_depth'] 

best_depth_score = depth_search.best_score_

In [139]:
t.test_3_9(depth_search, best_depth, best_depth_score)

'Success'

# 4. Evaluating on the test set <a name="5"></a>
<hr>

Now that we have a best performing model, it's time to assess our model on the set aside test set. 

**Question 4.1** <br> {points: 2} 

What is the training score of the best scoring model? Save the result in an object named `train_score`. 

In [62]:
train_score = cv_df.loc[0,'mean_train_score']
train_score

0.9341261908105037

In [63]:
assert 'train_score' in globals(
), "Please make sure that your solution is named 'train_score'"
# This test has been intentionally hidden. It will be up to you to decide if your solution
# is sufficiently good.

**Question 4.2** <br> {points: 1} 


What is the test score of the best model? 

Score the best model from `depth_search` on `X_test` and `y_test`. 

Save the result in an object named `test_score`. 


In [66]:
test_score = main_pipe.fit(X_test, y_test).score(X_test, y_test)

In [67]:
t.test_4_2(test_score)

'Success'

# 5. Text Data

Let's develop our own SMS spam filtering system using Kaggle's [SMS Spam Collection Dataset](https://www.kaggle.com/uciml/sms-spam-collection-dataset) that was originally referenced from [The UCI Machine Learning Repository](https://archive.ics.uci.edu/ml/datasets/SMS+Spam+Collection). 

We will use `CountVectorizer` to encode text messages and `SVC` for classification. 

**Sorry for the offensive language in some text messages; it's the reality of such platforms 😔. If you are sensitive to such language try not to read the raw messages.** 

In [97]:
sms_df = pd.read_csv("data/spam.csv", encoding="latin-1")
sms_df = sms_df.drop(["Unnamed: 2", "Unnamed: 3", "Unnamed: 4"], axis=1)
sms_df = sms_df.rename(columns={"v1": "target", "v2": "sms"})
sms_df

Unnamed: 0,target,sms
0,ham,"Go until jurong point, crazy.. Available only ..."
1,ham,Ok lar... Joking wif u oni...
2,spam,Free entry in 2 a wkly comp to win FA Cup fina...
3,ham,U dun say so early hor... U c already then say...
4,ham,"Nah I don't think he goes to usf, he lives aro..."
...,...,...
5567,spam,This is the 2nd time we have tried 2 contact u...
5568,ham,Will Ì_ b going to esplanade fr home?
5569,ham,"Pity, * was in mood for that. So...any other s..."
5570,ham,The guy did some bitching but I acted like i'd...


In [69]:
sms_df.shape

(5572, 2)

**Question 5.1** <br> {points: 1}  

Split `sms_df` into train (80%) and test splits (20%) setting `random_state=123`. 
Name your objects `text_train_df` and `text_test_df`. 
Examine the first few rows of the train portion. 

In [72]:
text_train_df, text_test_df = train_test_split(sms_df, test_size = 0.2, random_state = 123)

In [73]:
t.test_5_1(text_train_df, text_test_df)

'Success'

**Question 5.2** <br> {points: 1}  

Split both `text_train_df` and `text_test_df` into the target and feature columns. Here,  `target` is the target column (`y`) and `sms` is the column in your `X`. 
    
Name your objects `X_text_train`, `y_text_train` and  `X_text_test` `y_text_test`.

*Hint: Make sure that you are using single brackets (a Pandas Series) for your target (y) objects. The tests will not pass unless your y variables are of type Pandas Series. This can be done by selecting the column target with single square brackets.*

In [166]:
X_text_train = text_train_df['sms'].drop(columns = 'target')
y_text_train = text_train_df['target']
X_text_test = text_test_df['sms'].drop(columns = 'target')
y_text_test = text_test_df['target']

In [167]:
t.test_5_2(X_text_train, X_text_test, y_text_train, y_text_test)

'Success'

**Question 5.3** <br> {points: 2}  

Note that in case of text data, the usual EDA is not applicable. In this exercise will carry out some simple EDA to get a sense of the data.  

What's the label distribution in the target column (How many `ham` and how many `spam` values do you have in the column `target`) in the training set? 

Save the result in an object named `target_freq`.

The autograder is expecting an answer as a pandas series. 

*Hint: There is function that we use quite often that will give us the frequency of each category in a column.*

In [168]:
target_freq = text_train_df.groupby('target').size() 
target_freq

target
ham     3843
spam     614
dtype: int64

In [169]:
assert 'target_freq' in globals(
), "Please make sure that your solution is named 'target_freq'"
# This test has been intentionally hidden. It will be up to you to decide if your solution
# is sufficiently good.

**Question 5.4** <br> {points: 1} 

What's the average length in characters of text messages? Save the value to the nearest whole value in an object named `avg_text`. 

*Hint: `str.len()` may come in handy here.* 

In [170]:
avg_text = text_train_df['sms'].str.len().mean()

In [171]:
t.test_5_4(avg_text)

'Success'

**Question 5.5** <br> {points: 1} 

Would you classify `sms` column as a categorical column? Does it make sense to carry out one-hot encoding on this column?

A) It is a categorical column and I would carry out one-hot encoding on this column.

B) It is a categorical column and I would **NOT** carry out one-hot encoding on this column.

C) It is a free text column and I would carry out one-hot encoding on this column.

D) It is a free text column and I would **NOT** carry out one-hot encoding on this column.

*Answer in the cell below using the uppercase letter associated with your answer. Place your answer between `""`, assign the correct answer to an object called `answer5_05`.*

In [172]:
answer5_05 = 'D'
answer5_05

'D'

In [173]:
t.test_5_5(answer5_05)

'Success'

**Question 5.6** <br> {points: 0}  
Import `CountVectorizer` from the appropriate library. 

In [174]:
from sklearn.feature_extraction.text import CountVectorizer

In [175]:
t.test_5_6()

'Success'

**Question 5.7** <br> {points: 1} 

Transform the training data using `CountVectorizer` with default parameters. Create an object named `vec`, fit it on `X_text_train` and `y_text_train` and transform `X_text_train`. 

Save the newly transformed `X_text_train` in an object named `transformed_X_train`. 

In [180]:
vec = CountVectorizer()
transformed_X_train = vec.fit(X_text_train, y_text_train).transform(X_text_train)

In [181]:
t.test_5_7(transformed_X_train)

'Success'

**Question 5.8** <br> {points: 1} 

How many features have been created to represent each text message? 

Save the value in an object named `vocab_size`.

In [182]:
vocab_size = len(vec.vocabulary_)

In [183]:
t.test_5_8(vocab_size)

'Success'

**Question 5.9** <br> {points: 2} 

What does each feature represent and each feature value represent? 

A) A word in the corpus with the value representing the number of times the word occurs in the given text message.

B) A text message in the corpus with the value representing the distance from the closest text in the corpus.

C) An example in the corpus with the value representing the length of the text message.

D) A sentence in the corpus with the value representing the number of times the sentence occurs in the given text message.

*Answer in the cell below using the uppercase letter associated with your answer. Place your answer between `""`, assign the correct answer to an object called `answer5_09`.*

In [184]:
answer5_09 ='A'
answer5_09

'A'

In [185]:
assert 'answer5_09' in globals(
), "Please make sure that your solution is named 'answer5_09'"
# This test has been intentionally hidden. It will be up to you to decide if your solution
# is sufficiently good.

**Question 5.10** <br> {points: 1} 

Build a pipeline named `dummy_pipe` for feature extraction using `CountVectorizer` with `binary=True` and `DummyClassifier` with strategy equal to `most_frequent`.

Use `cross_validate()`setting `cv=5` with `dummy_pipe` and set `return_train_score=True` on `X_text_train` and `y_text_train` to obtain the train and test scores. 

Save this in a dataframe named `dummy_scores`. 

In [189]:
dummy_pipe = make_pipeline(CountVectorizer(binary = True), DummyClassifier(strategy = 'most_frequent'))
dummy_pipe.fit(X_text_train, y_text_train)
dummy_scores = pd.DataFrame(cross_validate(dummy_pipe, X_text_train, y_text_train, cv = 5, return_train_score = True))

In [190]:
t.test_5_10(dummy_pipe, dummy_scores)

'Success'

**Question 5.11** <br> {points: 1} 

What are the mean values of the columns in `dummy_scores`? Save this in an object named `dummy_scores_mean`

In [191]:
dummy_scores_mean = dummy_scores.mean()

In [192]:
t.test_5_11(dummy_scores_mean)

'Success'

**Question 5.12** <br> {points: 1} 

Very often representing your free text feature values in a binary format works better in practice than the default one and so we are going with that. 

Now build a pipeline named `svc_pipe_binary` for feature extraction using `CountVectorizer` with `binary=True` and `SVC` with default hyperparameters. Make sure you are using `make_pipeline()` for this. 

Cross validate on `svc_pipe_binary` using `X_text_train` and `y_text_train` and setting `cv=5`  and `return_train_score=True`.  

Save the results in a dataframe named `svc_scores`. 


In [197]:
svc_pipe_binary = make_pipeline(CountVectorizer(binary = True), SVC())
svc_scores = pd.DataFrame(cross_validate(svc_pipe_binary, X_text_train, y_text_train, cv = 5, return_train_score = True))

In [198]:
t.test_5_12(svc_scores)

'Success'

**Question 5.13** <br> {points: 1} 

What are the mean values of the columns in `svc_scores`? Save this in an object named `svc_scores_mean`.

In [199]:
svc_scores_mean = svc_scores.mean()

In [200]:
t.test_5_13(svc_scores_mean)

'Success'

**Question 5.14** <br> {points: 1} 

Are you getting better results with `SVC` compared to `DummyClassifier`?

A) I am getting better results with `SVC`.

B) I am getting better results with `DummyClassifier`.


*Answer in the cell below using the uppercase letter associated with your answer. Place your answer between `""`, assign the correct answer to an object called `answer5_14`.*

In [201]:
answer5_14 = 'A'
answer5_14

'A'

In [202]:
t.test_5_14(answer5_14)

'Success'

## Attributions
- The Olympics Games DataSet - [Kaggle](https://www.kaggle.com/samruddhim/olympics-althlete-events-analysis)

- The SMS Spam Collection Dataset - [Kaggle](https://www.kaggle.com/uciml/sms-spam-collection-dataset) and [The UCI Machine Learning Repository](https://archive.ics.uci.edu/ml/datasets/SMS+Spam+Collection)

    *Almeida, T.A., GÃ³mez Hidalgo, J.M., Yamakami, A. Contributions to the Study of SMS Spam Filtering: New Collection and Results. Proceedings of the 2011 ACM Symposium on Document Engineering (DOCENG'11), Mountain View, CA, USA, 2011*


- MDS DSCI 571 - Supervised Learning I - [MDS's GitHub website](https://github.com/UBC-MDS/DSCI_571_sup-learn-1) 


## Before Submitting 

Before submitting your assignment please do the following:

- Read through your solutions
- **Restart your kernel and clear output and rerun your cells from top to bottom** 
- Makes sure that none of your code is broken 
- Verify that the tests from the questions you answered have obtained the output "Success"

This is a simple way to make sure that you are submitting all the variables needed to mark the assignment. This method should help avoid losing marks due to changes in your environment.  