# Lab 8: Define and Solve an ML Problem of Your Choosing

In [2]:
import pandas as pd
import numpy as np
import os 
import matplotlib.pyplot as plt
import seaborn as sns

In this lab assignment, you will follow the machine learning life cycle and implement a model to solve a machine learning problem of your choosing. You will select a data set and choose a predictive problem that the data set supports.  You will then inspect the data with your problem in mind and begin to formulate a  project plan. You will then implement the machine learning project plan. 

You will complete the following tasks:

1. Build Your DataFrame
2. Define Your ML Problem
3. Perform exploratory data analysis to understand your data.
4. Define Your Project Plan
5. Implement Your Project Plan:
    * Prepare your data for your model.
    * Fit your model to the training data and evaluate your model.
    * Improve your model's performance.

## Part 1: Build Your DataFrame

You will have the option to choose one of four data sets that you have worked with in this program:

* The "census" data set that contains Census information from 1994: `censusData.csv`
* Airbnb NYC "listings" data set: `airbnbListingsData.csv`
* World Happiness Report (WHR) data set: `WHR2018Chapter2OnlineData.csv`
* Book Review data set: `bookReviewsData.csv`

Note that these are variations of the data sets that you have worked with in this program. For example, some do not include some of the preprocessing necessary for specific models. 

#### Load a Data Set and Save it as a Pandas DataFrame

The code cell below contains filenames (path + filename) for each of the four data sets available to you.

<b>Task:</b> In the code cell below, use the same method you have been using to load the data using `pd.read_csv()` and save it to DataFrame `df`. 

You can load each file as a new DataFrame to inspect the data before choosing your data set.

In [3]:
# File names of the four data sets
adultDataSet_filename = os.path.join(os.getcwd(), "data", "censusData.csv")
airbnbDataSet_filename = os.path.join(os.getcwd(), "data", "airbnbListingsData.csv")
WHRDataSet_filename = os.path.join(os.getcwd(), "data", "WHR2018Chapter2OnlineData.csv")
bookReviewDataSet_filename = os.path.join(os.getcwd(), "data", "bookReviewsData.csv")


df = pd.read_csv(bookReviewDataSet_filename)

df.head()

Unnamed: 0,Review,Positive Review
0,This was perhaps the best of Johannes Steinhof...,True
1,This very fascinating book is a story written ...,True
2,The four tales in this collection are beautifu...,True
3,The book contained more profanity than I expec...,False
4,We have now entered a second time of deep conc...,True


## Part 2: Define Your ML Problem

Next you will formulate your ML Problem. In the markdown cell below, answer the following questions:

1. List the data set you have chosen.
2. What will you be predicting? What is the label?
3. Is this a supervised or unsupervised learning problem? Is this a clustering, classification or regression problem? Is it a binary classificaiton or multi-class classifiction problem?
4. What are your features? (note: this list may change after your explore your data)
5. Explain why this is an important problem. In other words, how would a company create value with a model that predicts this label?

I have chosen the book review data set. I will be predicting whether a book review has a positive sentiment. This is a supervised classification learning algorithm with a binary label. The features of the data set are the review text, and the label is a True/False label if the review is positive. This problem is important because there are many instances where a company may want to scrape large amounts of reviews and determine the general sentiment of what people thing about a given product. For example, stockbrockers will want to know if a new product has generally positive or negative reviews to determine if stocks should be purchased for that company that released it.

## Part 3: Understand Your Data

The next step is to perform exploratory data analysis. Inspect and analyze your data set with your machine learning problem in mind. Consider the following as you inspect your data:

1. What data preparation techniques would you like to use? These data preparation techniques may include:

    * addressing missingness, such as replacing missing values with means
    * finding and replacing outliers
    * renaming features and labels
    * finding and replacing outliers
    * performing feature engineering techniques such as one-hot encoding on categorical features
    * selecting appropriate features and removing irrelevant features
    * performing specific data cleaning and preprocessing techniques for an NLP problem
    * addressing class imbalance in your data sample to promote fair AI
    

2. What machine learning model (or models) you would like to use that is suitable for your predictive problem and data?
    * Are there other data preparation techniques that you will need to apply to build a balanced modeling data set for your problem and model? For example, will you need to scale your data?
 
 
3. How will you evaluate and improve the model's performance?
    * Are there specific evaluation metrics and methods that are appropriate for your model?
    

Think of the different techniques you have used to inspect and analyze your data in this course. These include using Pandas to apply data filters, using the Pandas `describe()` method to get insight into key statistics for each column, using the Pandas `dtypes` property to inspect the data type of each column, and using Matplotlib and Seaborn to detect outliers and visualize relationships between features and labels. If you are working on a classification problem, use techniques you have learned to determine if there is class imbalance.

<b>Task</b>: Use the techniques you have learned in this course to inspect and analyze your data. You can import additional packages that you have used in this course that you will need to perform this task.

<b>Note</b>: You can add code cells if needed by going to the <b>Insert</b> menu and clicking on <b>Insert Cell Below</b> in the drop-drown menu.

In [4]:
from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.model_selection import train_test_split


df["Positive Review"].value_counts()
X = df["Review"]
y = df["Positive Review"]
X_train, X_test, y_train, y_test = train_test_split(X, y, train_size=.75, random_state=1234)
tfidf_vectorizer = TfidfVectorizer()
tfidf_vectorizer.fit(X_train)
X_train_tfidf = tfidf_vectorizer.transform(X_train)
X_test_tfidf = tfidf_vectorizer.transform(X_test)

X_train_tfidf = X_train_tfidf.toarray()
X_test_tfidf = X_test_tfidf.toarray()
y_train_encoded = y_train.astype(int)
y_test_encoded = y_test.astype(int)

## Part 4: Define Your Project Plan

Now that you understand your data, in the markdown cell below, define your plan to implement the remaining phases of the machine learning life cycle (data preparation, modeling, evaluation) to solve your ML problem. Answer the following questions:

* Do you have a new feature list? If so, what are the features that you chose to keep and remove after inspecting the data? 
* Explain different data preparation techniques that you will use to prepare your data for modeling.
* What is your model (or models)?
* Describe your plan to train your model, analyze its performance and then improve the model. That is, describe your model building, validation and selection plan to produce a model that generalizes well to new data. 

I applied tf-idf to each review, so every single review got turned into a sparce high dimensional vector with the number of columns equal to the vocabulary list of the review corpus, where every word that is present is given some weight (high if the word is frequent in this doc but rare in others). I also converted the sparce matrices to numpy arrays and encoded the labels. This is needed in order to turn the text data into numeric vector data which machine learning models are able to work with. The model I am intending to use is a dense neural network. In order to find the best parameters I will use a parameter grid first on the first layer, and then on the second. I will test the units, as well as a drop off layer to prevent overfitting of the test data.

## Part 5: Implement Your Project Plan

<b>Task:</b> In the code cell below, import additional packages that you have used in this course that you will need to implement your project plan.

In [5]:
import tensorflow.keras as keras


<b>Task:</b> Use the rest of this notebook to carry out your project plan. 

You will:

1. Prepare your data for your model.
2. Fit your model to the training data and evaluate your model.
3. Improve your model's performance by performing model selection and/or feature selection techniques to find best model for your problem.

Add code cells below and populate the notebook with commentary, code, analyses, results, and figures as you see fit. 

In [6]:
def build_model(units, dropout, learning_rate = 0.1):
    model = keras.Sequential()
    model.add(keras.layers.Dense(units = units, 
                                 activation = 'relu',
                                 input_shape=(X_train_tfidf.shape[1],)))
    model.add(keras.layers.Dropout(dropout))
    model.add(keras.layers.Dense(1, activation='sigmoid'))
    model.compile(optimizer= keras.optimizers.SGD(learning_rate),
                           loss='binary_crossentropy',
                           metrics=['accuracy']) 
    return model

In [13]:
units_list = [64, 96, 128]
dropout_list = [0.2, 0.3, 0.4]
lr_list = [0.1]

param_grid = []

for units in units_list:
    for dropout in dropout_list:
        for lr in lr_list:
            param_grid.append({
                'units': units,
                'dropout': dropout,
                'lr': lr
            })
print(param_grid)

[{'units': 64, 'dropout': 0.2, 'lr': 0.1}, {'units': 64, 'dropout': 0.3, 'lr': 0.1}, {'units': 64, 'dropout': 0.4, 'lr': 0.1}, {'units': 96, 'dropout': 0.2, 'lr': 0.1}, {'units': 96, 'dropout': 0.3, 'lr': 0.1}, {'units': 96, 'dropout': 0.4, 'lr': 0.1}, {'units': 128, 'dropout': 0.2, 'lr': 0.1}, {'units': 128, 'dropout': 0.3, 'lr': 0.1}, {'units': 128, 'dropout': 0.4, 'lr': 0.1}]


In [8]:
import gc
from tensorflow.keras import backend as K #not clearing the model kept crashing the kernel

In [24]:

best_acc = 0
best_params = None
best_model = None

for params in param_grid:
    print(f"Training with params: {params}")
    
    model = build_model(units = params['units'],
                        dropout=params['dropout'],
                        learning_rate=params['lr'])
    
    history = model.fit(X_train_tfidf,
             y_train_encoded,
              epochs = 10,
              batch_size = 32,
              validation_split = 0.2,
              verbose = 0)
    loss, acc = model.evaluate(X_test_tfidf, y_test_encoded, verbose=0)
    print(f"→ Accuracy: {acc:.4f}")
    
    if acc > best_acc:
        best_acc = acc
        best_params = params
        best_model = model
        
    K.clear_session()
    del model
    gc.collect()
    
print("Best Accuracy: {:.4f}".format(best_acc))
print("Best parameters:", best_params)
    

Training with params: {'units': 64, 'dropout': 0.2, 'lr': 0.1}
→ Accuracy: 0.7591
Training with params: {'units': 64, 'dropout': 0.3, 'lr': 0.1}
→ Accuracy: 0.5810
Training with params: {'units': 64, 'dropout': 0.4, 'lr': 0.1}
→ Accuracy: 0.7085
Training with params: {'units': 96, 'dropout': 0.2, 'lr': 0.1}
→ Accuracy: 0.5951
Training with params: {'units': 96, 'dropout': 0.3, 'lr': 0.1}
→ Accuracy: 0.5405
Training with params: {'units': 96, 'dropout': 0.4, 'lr': 0.1}
→ Accuracy: 0.6073
Training with params: {'units': 128, 'dropout': 0.2, 'lr': 0.1}
→ Accuracy: 0.6579
Training with params: {'units': 128, 'dropout': 0.3, 'lr': 0.1}
→ Accuracy: 0.5020
Training with params: {'units': 128, 'dropout': 0.4, 'lr': 0.1}
→ Accuracy: 0.5162
Best Accuracy: 0.7591
Best parameters: {'units': 64, 'dropout': 0.2, 'lr': 0.1}


In [14]:
#adding another layer to bump up accuracy with the current best parameters:
#'units': 64, 'dropout': 0.4, 'lr': 0.1

In [6]:
def build_model_secondlayer(units, learning_rate = 0.1):
    model = keras.Sequential()
    model.add(keras.layers.Dense(units = 64, 
                                 activation = 'relu',
                                 input_shape=(X_train_tfidf.shape[1],)))
    model.add(keras.layers.Dropout(0.4))
    
    model.add(keras.layers.Dense(units = units, 
                                 activation = 'relu',
                                 input_shape=(X_train_tfidf.shape[1],)))
    
    model.add(keras.layers.Dense(1, activation='sigmoid'))
    model.compile(optimizer= keras.optimizers.SGD(learning_rate),
                           loss='binary_crossentropy',
                           metrics=['accuracy']) 
    return model

In [26]:

best_acc = 0
best_params = None
best_model = None

units = {16, 32, 64}
for units in units:
    print(f"Training with units: {units}")
    
    model = build_model_secondlayer(units = units,
                        learning_rate= 0.1)
    
    history = model.fit(X_train_tfidf,
             y_train_encoded,
              epochs = 10,
              batch_size = 32,
              validation_split = 0.2,
              verbose = 0)
    loss, acc = model.evaluate(X_test_tfidf, y_test_encoded, verbose=0)
    print(f"→ Accuracy: {acc:.4f}")
    
    if acc > best_acc:
        best_acc = acc
        best_params = params
        best_model = model
        
    K.clear_session()
    del model
    gc.collect()
    
print("Best Accuracy: {:.4f}".format(best_acc))

Training with units: 16
→ Accuracy: 0.7814
Training with units: 32
→ Accuracy: 0.5628
Training with units: 64
→ Accuracy: 0.7146
Best Accuracy: 0.7814


In [27]:
#the best model is with two layers of 64 + 16 units and a single drop off layer in between

best_model = build_model_secondlayer(16)
best_model.summary()

Model: "sequential"
_________________________________________________________________
Layer (type)                 Output Shape              Param #   
dense (Dense)                (None, 64)                1187776   
_________________________________________________________________
dropout (Dropout)            (None, 64)                0         
_________________________________________________________________
dense_1 (Dense)              (None, 16)                1040      
_________________________________________________________________
dense_2 (Dense)              (None, 1)                 17        
Total params: 1,188,833
Trainable params: 1,188,833
Non-trainable params: 0
_________________________________________________________________


In [11]:
num_epochs = [10, 20, 30, 40, 50]

for epochs in num_epochs:
    print(f"num epochs: {epochs}")
    best_model = build_model_secondlayer(16)
    history = best_model.fit(X_train_tfidf,
             y_train_encoded,
              epochs = epochs,
              batch_size = 32,
              validation_split = 0.2,
              verbose = 0)
    loss, acc = best_model.evaluate(X_test_tfidf, y_test_encoded, verbose=0)
    print(f"→ Accuracy: {acc:.4f}")
    
    K.clear_session()  
    del best_model           
    del history        
    gc.collect() 

num epochs: 10
→ Accuracy: 0.6964
num epochs: 20
→ Accuracy: 0.8138
num epochs: 30
→ Accuracy: 0.7591
num epochs: 40
→ Accuracy: 0.8198
num epochs: 50
→ Accuracy: 0.8158


In [20]:
#the model's accuracy does not imporove after training for 20 epochs.
#the best model has two layers with 64 and 16 units
#it has a learning rate of 0.1 and is trained for 20 epochs
best_model = build_model_secondlayer(16)
history = best_model.fit(X_train_tfidf,
             y_train_encoded,
              epochs = 20,
              batch_size = 32,
              validation_split = 0.2,
              verbose = 0)
loss, acc = best_model.evaluate(X_test_tfidf, y_test_encoded, verbose=0)
print(acc)

0.7975708246231079


In [21]:

sample_indices = np.random.choice(range(len(X_test)), 3, replace=False)

for idx in sample_indices:
    review_text = X_test.iloc[idx]
    true_label = y_test.iloc[idx]
    pred_prob = best_model.predict(X_test_tfidf[idx:idx+1])[0][0]
    pred_label = int(round(pred_prob))
    
    pred_label_str = "True" if pred_label == 1 else "False"
    
    print(f"Review: {review_text[:100]}{'...' if len(review_text) > 100 else ''}")
    print(f"True Label: {true_label}")
    print(f"Predicted Label: {pred_label_str} (Confidence: {pred_prob:.2f})\n")


Review: I have read 3 books of Ann in last 3-4 months and no doubt loved it, but not sure If I am reading sa...
True Label: False
Predicted Label: False (Confidence: 0.25)

Review: This book places too much emphasis on spending money instead of eating...well if we all had money to...
True Label: False
Predicted Label: False (Confidence: 0.23)

Review: Why use 1 word when 12 will bewilder the reader and make the book that much thicker.  I can not beli...
True Label: False
Predicted Label: False (Confidence: 0.37)

