# Lab 8: Implement Your Machine Learning Project Plan

In this lab assignment, you will implement the machine learning project plan you created in the written assignment. You will:

1. Load your data set and save it to a Pandas DataFrame.
2. Perform exploratory data analysis on your data to determine which feature engineering and data preparation techniques you will use.
3. Prepare your data for your model and create features and a label.
4. Fit your model to the training data and evaluate your model.
5. Improve your model by performing model selection and/or feature selection techniques to find best model for your problem.

### Import Packages

Before you get started, import a few packages.

In [2]:
import pandas as pd
import numpy as np
import os 
import matplotlib.pyplot as plt
import seaborn as sns

<b>Task:</b> In the code cell below, import additional packages that you have used in this course that you will need for this task.

In [41]:
from sklearn.feature_extraction.text import TfidfVectorizer
import gensim
from sklearn.neighbors import KNeighborsClassifier
from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.metrics import plot_roc_curve, accuracy_score, roc_auc_score
from sklearn.model_selection import train_test_split, GridSearchCV
from sklearn.model_selection import train_test_split, cross_val_score
from sklearn.tree import DecisionTreeClassifier, DecisionTreeRegressor
from sklearn.metrics import mean_squared_error, r2_score
from sklearn.linear_model import LogisticRegression


## Part 1: Load the Data Set


You have chosen to work with one of four data sets. The data sets are located in a folder named "data." The file names of the three data sets are as follows:

* The "adult" data set that contains Census information from 1994 is located in file `adultData.csv`
* The airbnb NYC "listings" data set is located in file  `airbnbListingsData.csv`
* The World Happiness Report (WHR) data set is located in file `WHR2018Chapter2OnlineData.csv`
* The book review data set is located in file `bookReviewsData.csv`



<b>Task:</b> In the code cell below, use the same method you have been using to load your data using `pd.read_csv()` and save it to DataFrame `df`.

In [4]:
adultDataSet_filename = os.path.join(os.getcwd(), "data", "adultData.csv")
airbnbDataSet_filename = os.path.join(os.getcwd(), "data", "airbnbListingsData.csv")
WHRDataSet_filename = os.path.join(os.getcwd(), "data", "WHR2018Chapter2OnlineData.csv")
bookReviewDataSet_filename = os.path.join(os.getcwd(), "data", "bookReviewsData.csv")
filename = os.path.join(os.getcwd(), "data", "bookReviewsData.csv")
df = pd.read_csv(filename, header=0)

## Part 2: Exploratory Data Analysis

The next step is to inspect and analyze your data set with your machine learning problem and project plan in mind. 

This step will help you determine data preparation and feature engineering techniques you will need to apply to your data to build a balanced modeling data set for your problem and model. These data preparation techniques may include:
* addressing missingness, such as replacing missing values with means
* renaming features and labels
* finding and replacing outliers
* performing winsorization if needed
* performing one-hot encoding on categorical features
* performing vectorization for an NLP problem
* addressing class imbalance in your data sample to promote fair AI


Think of the different techniques you have used to inspect and analyze your data in this course. These include using Pandas to apply data filters, using the Pandas `describe()` method to get insight into key statistics for each column, using the Pandas `dtypes` property to inspect the data type of each column, and using Matplotlib and Seaborn to detect outliers and visualize relationships between features and labels. If you are working on a classification problem, use techniques you have learned to determine if there is class imbalance.


<b>Task</b>: Use the techniques you have learned in this course to inspect and analyze your data. 

<b>Note</b>: You can add code cells if needed by going to the <b>Insert</b> menu and clicking on <b>Insert Cell Below</b> in the drop-drown menu.

In [5]:
# Checking if Data is balanced
print(df['Positive Review'].sum())
print(len(df) - df['Positive Review'].sum())

980
993


In [6]:
# Ensuring no null values
df.isnull().any()

Review             False
Positive Review    False
dtype: bool

In [7]:
y = df['Positive Review']
X = df['Review']

# Transforming the text to features
original_X = X
X = X.apply(lambda row: gensim.utils.simple_preprocess(row))

X.head()

0    [this, was, perhaps, the, best, of, johannes, ...
1    [this, very, fascinating, book, is, story, wri...
2    [the, four, tales, in, this, collection, are, ...
3    [the, book, contained, more, profanity, than, ...
4    [we, have, now, entered, second, time, of, dee...
Name: Review, dtype: object

In [8]:
# Creating training and testing data
X_train, X_test, y_train, y_test = train_test_split(X, y, train_size=.80, random_state=1234)

X_train.head()

1369    [as, my, brother, said, when, flipping, throug...
1366    [cooper, book, is, yet, another, warm, and, fu...
385     [have, many, robot, books, and, this, is, the,...
750     [as, china, re, emerges, as, dominant, power, ...
643     [have, been, huge, fan, of, michael, crichton,...
Name: Review, dtype: object

In [9]:
word2vec_model = gensim.models.Word2Vec(X_train,
                                   vector_size=100,
                                   window=5,
                                   min_count=2)

In [10]:
# Measure size of model
len(word2vec_model.wv.key_to_index)

10354

## Part 3: Implement Your Project Plan

<b>Task:</b> Use the rest of this notebook to carry out your project plan. You will:

1. Prepare your data for your model and create features and a label.
2. Fit your model to the training data and evaluate your model.
3. Improve your model by performing model selection and/or feature selection techniques to find best model for your problem.


Add code cells below and populate the notebook with commentary, code, analyses, results, and figures as you see fit.

In [11]:
# Convert features into training and testing set
words = set(word2vec_model.wv.index_to_key)

X_train = np.array([np.array([word2vec_model.wv[word] for word in words if word in training_example])
                        for training_example in X_train], dtype=object)

X_test = np.array([np.array([word2vec_model.wv[word] for word in words if word in training_example])
                        for training_example in X_test], dtype=object)


In [12]:
# Making a consistent set of features per example
X_train_feature_vector = []
for w in X_train:
    if w.size:
        X_train_feature_vector.append(w.mean(axis=0))
    else:
        X_train_feature_vector.append(np.zeros(100, dtype=float))
        
X_test_feature_vector = []
for w in X_test:
    if w.size:
        X_test_feature_vector.append(w.mean(axis=0))
    else:
        X_test_feature_vector.append(np.zeros(100, dtype=float))

In [35]:
# Model Selection
hyperparams = [2**n for n in range(2,8)]
hyperparams
accuracy_scores = []

for md in hyperparams:
    
    model = DecisionTreeClassifier(max_depth = md, min_samples_leaf = 1)
    
    acc_score = cross_val_score(model, X_train_feature_vector, y_train, cv=5)
    
    acc_mean = np.mean(acc_score)
    
    accuracy_scores.append(acc_mean)

for s in range(len(accuracy_scores)):
    print('Accuracy score for max_depth {0}: {1}'.format(hyperparams[s], accuracy_scores[s]))

Accuracy score for max_depth 1: 0.5754149085794655
Accuracy score for max_depth 2: 0.6007675306409485
Accuracy score for max_depth 4: 0.5842616033755273
Accuracy score for max_depth 8: 0.57541289933695
Accuracy score for max_depth 16: 0.5640204942736589
Accuracy score for max_depth 32: 0.5576933895921238
Accuracy score for max_depth 64: 0.5652903355435
Accuracy score for max_depth 128: 0.5519750853928069


In [33]:
hyperparams = [100*n for n in range(2,8)]
hyperparams
accuracy_scores = []

for m_iter in hyperparams:

    model = LogisticRegression(max_iter=m_iter)
    model.fit(X_train_feature_vector, y_train)

    acc_score = cross_val_score(model, X_train_feature_vector, y_train, cv=5)
    
    acc_mean = np.mean(acc_score)
    
    accuracy_scores.append(acc_mean)
    
for s in range(len(accuracy_scores)):
    print('Accuracy score for max_depth {0}: {1}'.format(hyperparams[s], accuracy_scores[s]))    

Accuracy score for max_depth 200: 0.5969660438014868
Accuracy score for max_depth 300: 0.5969660438014868
Accuracy score for max_depth 400: 0.5969660438014868
Accuracy score for max_depth 500: 0.5969660438014868
Accuracy score for max_depth 600: 0.5969660438014868
Accuracy score for max_depth 700: 0.5969660438014868


In [44]:
hyperparams = [n for n in range(2,30)]
hyperparams
accuracy_scores = []

for k in hyperparams:

    model = KNeighborsClassifier(n_neighbors=k)
    model.fit(X_train_feature_vector, y_train)

    acc_score = cross_val_score(model, X_train_feature_vector, y_train, cv=5)
    
    acc_mean = np.mean(acc_score)
    
    accuracy_scores.append(acc_mean)
    
for s in range(len(accuracy_scores)):
    print('Accuracy score for max_depth {0}: {1}'.format(hyperparams[s], accuracy_scores[s]))

Accuracy score for max_depth 2: 0.5507052441229657
Accuracy score for max_depth 3: 0.5583001808318263
Accuracy score for max_depth 4: 0.5462748643761302
Accuracy score for max_depth 5: 0.5621076953988347
Accuracy score for max_depth 6: 0.5627285513361462
Accuracy score for max_depth 7: 0.574765923246936
Accuracy score for max_depth 8: 0.5513281093027929
Accuracy score for max_depth 9: 0.5602190074341974
Accuracy score for max_depth 10: 0.5658870805706249
Accuracy score for max_depth 11: 0.5608298171589311
Accuracy score for max_depth 12: 0.5677878239903557
Accuracy score for max_depth 13: 0.565905163753265
Accuracy score for max_depth 14: 0.5671830419931687
Accuracy score for max_depth 15: 0.566562186055857
Accuracy score for max_depth 16: 0.5652822985734379
Accuracy score for max_depth 17: 0.5608519188266022
Accuracy score for max_depth 18: 0.5583001808318263
Accuracy score for max_depth 19: 0.5684528832630098
Accuracy score for max_depth 20: 0.572875226039783
Accuracy score for max_d