# Homework 3: Predicting Product Review Sentiment Using Classification

## 	Overview
The goal of this assignment is to build a classification machine learning (ML) pipeline in a web application to use as a tool to analyze the models to gain useful insights about model performance. Using trained classification models, build a ML application that predicts whether a product review is positive or negative.

The learning outcomes for this assignment are:

●	Build end-to-end classification pipeline with four classifiers 1) Logistic Regression, 2)Stochastic Gradient Descent, and 3) Stochastic Gradient Descent with Cross Validation.

●	Evaluate classification methods using standard metrics including precision, recall, and accuracy, ROC Curves, and area under the curve.

●	Develop a web application that walks users through steps of the classification pipeline and provide tools to analyze multiple methods across multiple metrics.

●	Develop a web application that classifies products as positive or negative and indicates the cost of displaying false positives and false negatives using a specified model.

## 3.1	Amazon Product Reviews Dataset

This assignment involves training and evaluating ML end-to-end pipeline in a web application using the Amazon Product Reviews dataset. Millions of Amazon customers have contributed over a hundred million reviews to express opinions and describe their experiences regarding products on the Amazon.com website. This makes Amazon Customer Reviews a rich source of information for academic researchers in the fields of natural language processing (NLP), information retrieval (IR), and machine learning (ML), amongst others. Specifically, this dataset was constructed to represent a sample of customer evaluations and opinions, variation in the perception of a product across geographical regions, and promotional intent or bias in reviews.
We have added additional features to the dataset. There are many features, but the important ones include:

●	name: name of Amazon product

●	Reviews.text: text in review

●	Reviews.title: title of reviews

### 3.2	Explore and Preprocess Data (3 points)
The goal of this page is to explore and preprocess the dataset. First, import the dataset from your machine. We have provided code to remove unuseful features using the clean_data() helper function (see helper_function.py). Then, remove punctuation from the reviews. We have provided UI and functions to summarize text statistics from reviews, search reviews with a keyword, and remove reviews. At the end of this page, encode documents with word counts and Term Frequency Inverse Document frequency features. See details about the checkpoint functions below.
Some activities require a try and except block to train classification models (in later sections).
``` try:
# write some code Except ValueError as err:
st.write({str(err)}) # Print the error message


In [2]:
# import libraries and helper functions
import pandas as pd 
import numpy as np 
import helper_functions
import string
import streamlit as st
from plotly.subplots import make_subplots
import plotly.graph_objects as go
from sklearn.feature_extraction.text import CountVectorizer
from sklearn.feature_extraction.text import TfidfTransformer
from sklearn.model_selection import train_test_split
from sklearn.linear_model import LogisticRegression
from sklearn.linear_model import SGDClassifier
from sklearn.model_selection import GridSearchCV, RepeatedKFold
from sklearn.metrics import precision_score, recall_score, accuracy_score, roc_auc_score


In [3]:
# read data
products = pd.read_csv('datasets/Amazon Product Reviews I.csv')
products.head()

Unnamed: 0,id,asins,brand,categories,colors,dateAdded,dateUpdated,dimension,ean,keys,...,reviews.rating,reviews.sourceURLs,reviews.text,reviews.title,reviews.userCity,reviews.userProvince,reviews.username,sizes,upc,weight
0,AVpe7AsMilAPnD_xQ78G,B00QJDU3KY,Amazon,"Amazon Devices,mazon.co.uk",,2016-03-08T20:21:53Z,2017-07-18T23:52:58Z,169 mm x 117 mm x 9.1 mm,,kindlepaperwhite/b00qjdu3ky,...,5.0,https://www.amazon.com/Kindle-Paperwhite-High-...,I initially had trouble deciding between the p...,"Paperwhite voyage, no regrets!",,,Cristina M,,,205 grams
1,AVpe7AsMilAPnD_xQ78G,B00QJDU3KY,Amazon,"Amazon Devices,mazon.co.uk",,2016-03-08T20:21:53Z,2017-07-18T23:52:58Z,169 mm x 117 mm x 9.1 mm,,kindlepaperwhite/b00qjdu3ky,...,5.0,https://www.amazon.com/Kindle-Paperwhite-High-...,Allow me to preface this with a little history...,One Simply Could Not Ask For More,,,Ricky,,,205 grams
2,AVpe7AsMilAPnD_xQ78G,B00QJDU3KY,Amazon,"Amazon Devices,mazon.co.uk",,2016-03-08T20:21:53Z,2017-07-18T23:52:58Z,169 mm x 117 mm x 9.1 mm,,kindlepaperwhite/b00qjdu3ky,...,4.0,https://www.amazon.com/Kindle-Paperwhite-High-...,I am enjoying it so far. Great for reading. Ha...,Great for those that just want an e-reader,,,Tedd Gardiner,,,205 grams
3,AVpe7AsMilAPnD_xQ78G,B00QJDU3KY,Amazon,"Amazon Devices,mazon.co.uk",,2016-03-08T20:21:53Z,2017-07-18T23:52:58Z,169 mm x 117 mm x 9.1 mm,,kindlepaperwhite/b00qjdu3ky,...,5.0,https://www.amazon.com/Kindle-Paperwhite-High-...,I bought one of the first Paperwhites and have...,Love / Hate relationship,,,Dougal,,,205 grams
4,AVpe7AsMilAPnD_xQ78G,B00QJDU3KY,Amazon,"Amazon Devices,mazon.co.uk",,2016-03-08T20:21:53Z,2017-07-18T23:52:58Z,169 mm x 117 mm x 9.1 mm,,kindlepaperwhite/b00qjdu3ky,...,5.0,https://www.amazon.com/Kindle-Paperwhite-High-...,I have to say upfront - I don't like coroporat...,I LOVE IT,,,Miljan David Tanic,,,205 grams


In [4]:
# remove unuseful features 
products=helper_functions.clean_data(products)[0]
products.head()



Unnamed: 0,reviews,rating,title
0,I initially had trouble deciding between the p...,5.0,"Paperwhite voyage, no regrets!"
1,Allow me to preface this with a little history...,5.0,One Simply Could Not Ask For More
2,I am enjoying it so far. Great for reading. Ha...,4.0,Great for those that just want an e-reader
3,I bought one of the first Paperwhites and have...,5.0,Love / Hate relationship
4,I have to say upfront - I don't like coroporat...,5.0,I LOVE IT


**Checkpoint 1:** Start cleaning the text by removing punctuation from features in the dataset.
To do this, fill in code for the remove_punctuation function, which takes the following inputs:
the pandas dataframe (df), and a list of the feature(s) to clean. The function returns a dataframe with updated features with removed punctuation (df).
Perform the following tasks in the remove_punctuation function:
1.	Create a translator using the string library that creates a one to one mapping of a character to its translation/replacement.
2.	Write a for loop that iterates through the feature names, check that strings are in the feature.
a.	If the features are strings, use the translator to remove punctuation from the strings. It’s recommended that you use a lambda function.
3.	Store the updated dataframe df in st.session_state[‘data’].

Example code: ```
translator = str.maketrans('', '', string.punctuation) for feature_name in features:
if(df[feature_name].dtype ==’object’):
df[feature_name] = … # add code here ```


In [5]:
def remove_punctuation(df, features):
    translator = str.maketrans('', '', string.punctuation)
    for feature in features:
        if df[feature].dtype == 'object':
            df[feature] = df[feature].apply(lambda x: x.translate(translator))
    st.session_state['data'] = df
    return df

products = remove_punctuation(products, ['reviews', 'title'])
products.head()

Unnamed: 0,reviews,rating,title
0,I initially had trouble deciding between the p...,5.0,Paperwhite voyage no regrets
1,Allow me to preface this with a little history...,5.0,One Simply Could Not Ask For More
2,I am enjoying it so far Great for reading Had ...,4.0,Great for those that just want an ereader
3,I bought one of the first Paperwhites and have...,5.0,Love Hate relationship
4,I have to say upfront I dont like coroporate ...,5.0,I LOVE IT


**Checkpoint 2:** Words need to be encoded as integers or floating point values to input to a machine learning algorithm. Your task is to perform word frequency encoding in the word_count_encoder function, which takes four inputs: the pandas dataframe (df), a list of the feature(s) to perform work count encoding on the given features (feature) and a list of strings with word encoding names 'TF-IDF', 'Word Count' (word_encoder). The function performs work count encoding on the given features and returns the data frame with word count encoded features (df).
Perform the following tasks in the word_count_encoder function:
1.	Use the CountVectorizer() to create a count vectorizer class object.
2.	Use the count vectorizer transform() function to the feature in df to create frequency counts for words.
3.	Convert the frequency counts to an array using the toarray() function and convert the array to a pandas dataframe.
4.	Add a prefix to the column names in the data frame created in Step 3 using add_prefix() pandas function with ‘word_count_’ as the prefix.
5.	Add the word count dataframe to df using the pd.concat() function.
6.	Update the confirmation statement to show the length of the word_count dataframe.


In [6]:
def word_count_encoder(df, features, word_encoder):
    for feature in features:
        # create a CountVectorizer object
        count_vect = CountVectorizer()

        # fit and transform the feature using the CountVectorizer
        freq_counts = count_vect.fit_transform(df[feature])

        # convert the resulting sparse matrix to a dense array and then to a dataframe
        freq_counts_df = pd.DataFrame(freq_counts.toarray())

        # add prefix to column names
        freq_counts_df = freq_counts_df.add_prefix('word_count_')

        # concatenate the resulting word count dataframe to the original dataframe
        df = pd.concat([df, freq_counts_df], axis=1)

    # print confirmation message
    if 'Word Count' in word_encoder:
        print(f"Word count encoding has been performed. Added {len(freq_counts_df.columns)} columns to the dataframe.")
    return df

products= word_count_encoder(products, ['reviews'], ['Word Count'])
products.head()

Word count encoding has been performed. Added 6750 columns to the dataframe.


Unnamed: 0,reviews,rating,title,word_count_0,word_count_1,word_count_2,word_count_3,word_count_4,word_count_5,word_count_6,...,word_count_6740,word_count_6741,word_count_6742,word_count_6743,word_count_6744,word_count_6745,word_count_6746,word_count_6747,word_count_6748,word_count_6749
0,I initially had trouble deciding between the p...,5.0,Paperwhite voyage no regrets,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
1,Allow me to preface this with a little history...,5.0,One Simply Could Not Ask For More,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
2,I am enjoying it so far Great for reading Had ...,4.0,Great for those that just want an ereader,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
3,I bought one of the first Paperwhites and have...,5.0,Love Hate relationship,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
4,I have to say upfront I dont like coroporate ...,5.0,I LOVE IT,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0


**Checkpoint 3:** Next, we want to perform TF-IDF encoding, which quantifies the importance or 
relevance of words or phrases. Fill in code for the tf_idf_encoder function, which takes three 
inputs: the pandas dataframe (df), a list of the feature(s) to perform tf-idf encoding on
(feature), and a list of strings with word encoding names ‘TF-IDF’, ‘Word Count’ (word_encoder). 
This function returns the dataframe with TF-IDF encoded feature(s).
Perform the following tasks in the tf_idf_word_count_encoder function:
1. Use the CountVectorizer() to create a count vectorizer class object.
2. Use the count vectorizer transform() function to the feature in df to create frequency 
counts for words.
3. Use the TfidfTransformer() to create a TF-IDF transformer class object.
4. Transform the frequency counts (from Step 2) into TF-IDF features using the 
TfidfTransformer object.
5. Create a pandas dataframe for the TF-IDF features which takes the TF-IDF features array 
as input so convert the TF-IDF features to an array using the toarray() function.
6. Add a prefix to the column names in the data frame created in Step 3 using add_prefix() 
pandas function with ‘tf_idf_word_count_’ as the prefix.
7. Add the TF-IDF dataframe to df using the pd.concat() function.


In [6]:

def tf_idf_encoder(df, features, word_encoder):
    for feature in features:
        # create a CountVectorizer object
        count_vect = CountVectorizer()

        # fit and transform the feature using the CountVectorizer
        freq_counts = count_vect.fit_transform(df[feature])

        # create a TfidfTransformer object
        tfidf_transformer = TfidfTransformer()

        # fit and transform the frequency counts using the TfidfTransformer
        tfidf_features = tfidf_transformer.fit_transform(freq_counts)

        # convert the resulting sparse matrix to a dense array and then to a dataframe
        tfidf_features_df = pd.DataFrame(tfidf_features.toarray())

        # add prefix to column names
        tfidf_features_df = tfidf_features_df.add_prefix('tf_idf_word_count_')

        # concatenate the resulting tf-idf dataframe to the original dataframe
        df = pd.concat([df, tfidf_features_df], axis=1)

    # print confirmation message
    if 'TF-IDF' in word_encoder:
        print(f"TF-IDF encoding has been performed. Added {len(tfidf_features_df.columns)} columns to the dataframe.")
    return df
products = tf_idf_encoder(products, ['reviews'], ['Word Count'])
products.head()

Unnamed: 0,reviews,rating,title,word_count_0,word_count_1,word_count_2,word_count_3,word_count_4,word_count_5,word_count_6,...,tf_idf_word_count_6740,tf_idf_word_count_6741,tf_idf_word_count_6742,tf_idf_word_count_6743,tf_idf_word_count_6744,tf_idf_word_count_6745,tf_idf_word_count_6746,tf_idf_word_count_6747,tf_idf_word_count_6748,tf_idf_word_count_6749
0,I initially had trouble deciding between the p...,5.0,Paperwhite voyage no regrets,0,0,0,0,0,0,0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
1,Allow me to preface this with a little history...,5.0,One Simply Could Not Ask For More,0,0,0,0,0,0,0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
2,I am enjoying it so far Great for reading Had ...,4.0,Great for those that just want an ereader,0,0,0,0,0,0,0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
3,I bought one of the first Paperwhites and have...,5.0,Love Hate relationship,0,0,0,0,0,0,0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
4,I have to say upfront I dont like coroporate ...,5.0,I LOVE IT,0,0,0,0,0,0,0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0


## 3.3 Train Regression Models (6 points)
The goal of this page is to train multiple models and inspect model coefficients and cross validation 
results for relevant models. First, we have provided code to assign the negative values to the 
product ratings using the negative ratings selected from the user (assume rating=3 is neural) in 
the set_pos_neg_reviews() function. Then, write the split_dataset() function which splits the 
dataset into training and validation input and output using the appropriate word encoding (as 
specified by the user). Next, write four functions to training functions to train the following 
models: 1) Logistic Regression, 2) Stochastic Gradient Descent, and 3) Stochastic Gradient 
Descent with Cross Validation. Lastly, write a function to inspect the coefficients of each model.

**Checkpoint 4:** Before training the models, you need to split the data set into training and test 
sets. Complete the split_dataset function which takes the following inputs: training features (X), 
training targets (y), the ratio of test samples (number). As input, pass the data matrix X along 
with the corresponding target vector y into scikit-learn’s train_test_split() function. Set the default 
test_size to 0.2, and the default random_state to 42. The function will output four objects, in the 
following order: X_train, X_val, y_train, y_val. Refer to the scikit-learn train_test_split() function 
for help.
```
(df, number, test_size, target, feature_encoding, random_state=42)
```
Perform the following tasks in the split_dataset function:
1. Use the train_test_split() function to split the dataset into four parts including X_train, 
X_val, y_train, y_val sets using the input X, y, number/100 (set test percentage), and 
random state.
2. Check the feature_encoding list of strings that contain either ‘TF-IDF’ or ‘Word Count’ ing 
the feature_encoding list and set the input to feature names that start with 
‘tf_idf_word_count_’ for TF-IDF and ‘word_count_’ for word count (see example below). 
Also, the dataset can contain both feature encodings.
``` if(‘Word Count’ in 
feature_encoding):
X_train_sentiment = X_train.loc[:, X_train.columns.str.startswith('word_count_')]
X_val_sentiment = X_val.loc[:, X_val.columns.str.startswith('word_count_')]
```

In [7]:
def split_dataset(X, y, number, feature_encoding, random_state=42):
    # split the dataset into train and validation sets
    X_train, X_val, y_train, y_val = train_test_split(X, y, test_size=number/100, random_state=random_state)
    
    # check if the feature encoding is Word Count or TF-IDF and extract the corresponding features
    if 'Word Count' in feature_encoding:
        X_train = X_train.loc[:, X_train.columns.str.startswith('word_count_')]
        X_val = X_val.loc[:, X_val.columns.str.startswith('word_count_')]
    elif 'TF-IDF' in feature_encoding:
        X_train = X_train.loc[:, X_train.columns.str.startswith('tf_idf_word_count_')]
        X_val = X_val.loc[:, X_val.columns.str.startswith('tf_idf_word_count_')]
        
    return X_train, X_val, y_train, y_val


In [8]:
X = products.drop(columns=['rating'])
y = products['rating']
X_train, X_val, y_train, y_val =split_dataset(X, y, number=20, feature_encoding=['Word Count'], random_state=42)

**Checkpoint 5:** Now it is time to train the model with logistic regression and store it in 
st.session_state[model_name]. Complete the train_logistic_regression function which takes the 
following inputs: training features, training targets, a string of the model name, and a dictionary 
with the logistic regression hyperparameters max_iter, solver, tol, and penalty (X_train, y_train, 
model_name, params, random_state=42). The function outputs the trained model (lg_model).
Perform the following tasks in the train_logistic_regression function:
1. Create a try and except block to train a logistic regression model.
2. Create a LogisticRegression class object using the random_state as input.
3. Fit the model to the data using the fit() function with input data X_train, y_train.
Remember to create a continuous y_train array using np.ravel() function.
4. Save the model in st.session_state[model_name].
5. Return the trained mode

In [24]:
products = products[products['rating'] != 3]
products.reset_index(drop=True, inplace=True)
products.shape

(1053, 6753)

In [30]:
products['sentiment'] = products['rating'].apply(lambda r : +1 if r > 3 else -1)
products.head(1)

Unnamed: 0,reviews,rating,title,word_count_0,word_count_1,word_count_2,word_count_3,word_count_4,word_count_5,word_count_6,...,word_count_6741,word_count_6742,word_count_6743,word_count_6744,word_count_6745,word_count_6746,word_count_6747,word_count_6748,word_count_6749,sentiment
0,I initially had trouble deciding between the p...,5.0,Paperwhite voyage no regrets,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,1


In [10]:

def split_dataset(df, number, target, feature_encoding, random_state=42):
    """
    This function splits the dataset into the training and test sets.

    Input:
        - X: training features
        - y: training targets
        - number: the ratio of test samples
        - target: article feature name 'rating'
        - feature_encoding: (string) 'Word Count' or 'TF-IDF' encoding
        - random_state: determines random number generation for centroid initialization
    Output:
        - X_train_sentiment: training features (word encoded)
        - X_val_sentiment: test/validation features (word encoded)
        - y_train: training targets
        - y_val: test/validation targets
    """
    X_train, X_val, y_train, y_val = [], [], [], []
    X_train_sentiment, X_val_sentiment = [], []
    try:
        # Split dataset into y (target='sentiment') and X (all other features)
        X = df.drop(columns=[target])
        y = df[target]

        # Split the train and test sets into X_train, X_val, y_train, y_val using X, y, number/100, and random_state
        X_train, X_val, y_train, y_val = train_test_split(X, y, test_size=number/100, random_state=random_state)

        # Use the column word_count and tf_idf_word_count as a feature prefix on X_train and X_val sets
        if 'Word Count' in feature_encoding:
            X_train_sentiment = X_train.loc[:, X_train.columns.str.startswith('word_count_')]
            X_val_sentiment = X_val.loc[:, X_val.columns.str.startswith('word_count_')]
        elif 'TF-IDF' in feature_encoding:
            X_train_sentiment = X_train.loc[:, X_train.columns.str.startswith('tf_idf_word_count_')]
            X_val_sentiment = X_val.loc[:, X_val.columns.str.startswith('tf_idf_word_count_')]

        # Compute dataset percentages
        train_percentage = (len(X_train) /
                            (len(X_train)+len(X_val)))*100
        test_percentage = (len(X_val) /
                           (len(X_train)+len(X_val)))*100

        # Print dataset split result
        st.markdown('The training dataset contains {0:.2f} observations ({1:.2f}%) and the test dataset contains {2:.2f} observations ({3:.2f}%).'.format(len(X_train),
                                                                                                                                                          train_percentage,
                                                                                                                                                          len(
                                                                                                                                                              X_val),
                                                                                                                                                          test_percentage))

        # (Uncomment code) Save train and test split to st.session_state
        st.session_state['X_train'] = X_train_sentiment
        st.session_state['X_val'] = X_val_sentiment
        st.session_state['y_train'] = y_train
        st.session_state['y_val'] = y_val
    except:
        print('Exception thrown; testing test size to 0')
    return X_train_sentiment, X_val_sentiment, y_train, y_val

In [26]:
X = products.drop(columns=['rating'])
y = products['rating']
X_train, X_val, y_train, y_val =split_dataset(products,target='sentiment',number=20, feature_encoding=['Word Count'], random_state=42)

In [31]:
X_train

Unnamed: 0,word_count_0,word_count_1,word_count_2,word_count_3,word_count_4,word_count_5,word_count_6,word_count_7,word_count_8,word_count_9,...,word_count_6740,word_count_6741,word_count_6742,word_count_6743,word_count_6744,word_count_6745,word_count_6746,word_count_6747,word_count_6748,word_count_6749
786,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
261,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
299,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
713,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
990,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
330,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
466,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
121,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
1044,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0


In [19]:
def train_logistic_regression(X_train, y_train, model_name, params, random_state=42):
    # create a try and except block to train a logistic regression model
    try:
        # create a LogisticRegression object
        lg_model = LogisticRegression(random_state=random_state, **params)
        
        # fit the model to the data
        lg_model.fit(X_train, np.ravel(y_train))
        
        # save the model in session state
        st.session_state[model_name] = lg_model
        
        # print confirmation message
        print(f"{model_name} trained and saved in session state.")
        
        # return the trained model
        return lg_model
    
    except Exception as e:
        # print error message and return None if there was an error
        print(f"Error training logistic regression model: {e}")
        return None


In [20]:
train_logistic_regression(X_train, y_train, 'lg_model', {'max_iter': 100, 'solver': 'lbfgs', 'tol': 1e-4, 'penalty': 'l2'}, random_state=42)

lg_model trained and saved in session state.


STOP: TOTAL NO. of ITERATIONS REACHED LIMIT.

Increase the number of iterations (max_iter) or scale the data as shown in:
    https://scikit-learn.org/stable/modules/preprocessing.html
Please also refer to the documentation for alternative solver options:
    https://scikit-learn.org/stable/modules/linear_model.html#logistic-regression
  n_iter_i = _check_optimize_result(


**Checkpoint 6:** In order to fit the training data to the corresponding target, we use the train_sgd_classifer function. This function implements regularized linear models with stochastic gradient descent (SGD) learning, such as SVM, logistic regression, squared hinge and perceptron. We will only be using this function for logistic regression in the assignment, so for the loss parameter, pass “log_loss”. This function takes the following inputs: training features, training targets, a string of the model name, and a dictionary of the hyperparameters to tune during cross validation (X_train, y_train, model_name, params, random_state=42). To prevent overfitting, we will consider ridge regression. Pass “l1” to the penalty parameter. The function returns the trained model (ridge_cv).
Perform the following tasks in the train_sgd_classifer function:
1.	Create a try and except block to train a logistic regression model with Stochastic Gradient Descent algorithm.
2.	Create a SGDClassifier class object using the random_state and params as input. ``` sgd_model = SGDClassifier(random_state=random_state,
loss=params['loss'], penalty=params['penalty'], alpha=params['alpha'], max_iter=params['max_iter'], tol=params['tol'])
```
3.	Fit the model to the data using the fit() function with input data X_train, y_train.
Remember to create a continuous y_train array using np.ravel() function.
4.	Save the model in st.session_state[model_name].
5.	Return the model


In [11]:
def train_sgd_classifier(X_train, y_train, model_name, params, random_state=42):
    try:
        sgd_model = SGDClassifier(random_state=random_state,
                                  loss=params['loss'],
                                  penalty=params['penalty'],
                                  alpha=params['alpha'],
                                  max_iter=params['max_iter'],
                                  tol=params['tol'])
        sgd_model.fit(X_train, np.ravel(y_train))
        st.session_state[model_name] = sgd_model
        return sgd_model
    except Exception as e:
        st.write("Error during training:", e)


In [12]:
train_sgd_classifier(X_train, y_train, 'lg_model', {'max_iter': 100, 'solver': 'lbfgs', 'tol': 1e-4, 'penalty': 'l2'}, random_state=42)

2023-04-22 23:51:13.087 
  command:

    streamlit run c:\ProgramData\anaconda3\lib\site-packages\ipykernel_launcher.py [ARGUMENTS]


**Checkpoint 7:** Now we will train the logistic regression model using cross validation. The train_sgdcv_classifer function takes the following inputs: training features, training targets, a string of the model name, and a dictionary of the hyperparameters to tune during cross validation (X_train, y_train, model_name, params, random_state=42). The function returns the trained model (sgdcv_model). Perform the following tasks in the train_sgdcv_classifer function:
Perform the following tasks in the train_sgdcv_classifer function:
1.	Create a try and except block to train a logistic regression model with Stochastic Gradient Descent algorithm with Repeated K-Fold Cross Validation and search for the optimal parameters with gridsearch.
2.	Find the optimal hyperparameters using GridSearchCV with SGDClassifier as the estimator.
``` sgd_cv_model = GridSearchCV(estimator=SGDClassifier(random_state=random_state), param_grid=params, cv=cv_params['n_splits']) ```
3.	Fit the model to data using the fit() function
4.	Save	the	cross	validation	results	(sgd_cv_model.cv_results_)	in st.session_state[‘cv_results_’].
5.	Save the best model estimator in st.session_state[model_name]. ``` st.session_state[model_name] = sgd_cv_model.best_estimator_ ```
6.	Return the best model estimator


In [25]:
def train_sgdcv_classifer(X_train, y_train, model_name, params, cv_params, random_state=42):
    try:
        # Create SGDClassifier object
        sgd_model = SGDClassifier(random_state=random_state, loss=params['loss'], penalty=params['penalty'], 
                                  alpha=params['alpha'], max_iter=params['max_iter'], tol=params['tol'])
        
        # Define grid of hyperparameters to search over
        param_grid = {'alpha': params['alpha_range'], 'penalty': params['penalty_range']}
        
        # Define cross validation strategy
        cv = RepeatedKFold(n_splits=cv_params['n_splits'], n_repeats=cv_params['n_repeats'], random_state=random_state)
        
        # Perform GridSearchCV to find optimal hyperparameters
        sgd_cv_model = GridSearchCV(estimator=sgd_model, param_grid=param_grid, cv=cv, scoring='accuracy')
        sgd_cv_model.fit(X_train, y_train)
        
        # Save cross validation results and best model estimator
        st.session_state['cv_results_'] = sgd_cv_model.cv_results_
        st.session_state[model_name] = sgd_cv_model.best_estimator_
        # Return best model estimator
        return sgd_cv_model.best_estimator_
    
    except:
        st.error('Error: Could not train logistic regression model with SGDClassifier algorithm and cross validation.')


In [26]:
params = {'loss': 'log_loss', 'penalty': 'l2', 'alpha': 0.0001, 'max_iter': 1000, 'tol': 1e-3, 'alpha_range': [0.0001, 0.001, 0.01, 0.1], 'penalty_range': ['l2', 'l1', 'elasticnet']}
cv_params = {'n_splits': 5, 'n_repeats': 2}

train_sgdcv_classifer(X_train, y_train, 'SGDClassifier', params, cv_params, random_state=42)




**Checkpoint 8:** Now that we have the trained models, we need to get the coefficients of the trained models and summarize the positive and negative coefficients. Complete the inspect_coefficients function which takes the following inputs: a list of strings of the trained model names, and a list of strings of the model names to print coefficients (trained_models, inspect_models). The inspect_models parameter is given by the user in Streamlit. The function returns a dictionary that contains the coefficients of the selected models, with the keys: 'Logistic
Regression',	'Stochastic	Gradient	Descent',	and	'Stochastic	Gradient	Descent	with	Cross
Validation'.
Perform the following tasks in the inspect_coefficients function:
1.	Write a for loop through the model names and trained models. ``` for name, model in trained_models.items():```
2.	In the for loop,
a.	check that the model is not None
b.	If the model is valid, store the coefficients in out_dict[name] using model.coef (same for all models) and display the coefficients.
c.	Compute and print the following values:
i.	Total number of coefficients
ii.	Number of positive coefficients
iii.	Number of negative coefficients
3.	Display ‘cv_results_’ in st.session_state[‘cv_results_’] if it exists (from Checkpoint 7)


In [60]:
def inspect_coefficients(trained_models, inspect_models):
    out_dict = {}

    for name, model in trained_models.items():
        if name in inspect_models and model is not None:
            out_dict[name] = model.coef_[0]
            print(f"Model: {name}")
            print(f"Number of coefficients: {len(model.coef_[0])}")
            print(f"Number of positive coefficients: {len(model.coef_[0][model.coef_[0] > 0])}")
            print(f"Number of negative coefficients: {len(model.coef_[0][model.coef_[0] < 0])}")
            print(f"Coefficients: {model.coef_[0]}")
            print("---------------------------------------------------")
    
    if 'cv_results_' in st.session_state:
        print('cv_results_')
        print(st.session_state['cv_results_'])

    return out_dict


## 3.4	Test Regression Models (2 points)

The goal of this page is to evaluate the classification models using precision, recall, accuracy, and ROC Curves. First, the user selects the performance metrics for evaluation. Then, they select the classification models to evaluate. Using the aforementioned inputs, two functions, 1) that computes the evaluation metrics using a trained model and 2) displays a ROC Curve using precision and recall. At the end of this page, the user can select a model to deploy on the ‘Deploy App’ page.


**Checkpoint 9:** Next, evaluate metrics for a given logistic regression model. Complete the compute_eval_metrics funcion, which takes the following inputs: the pandas dataframe with training features, the pandas dataframe with true targets, the model to evaluate, and the metrics to evaluate performance (string); 'precision', 'recall', 'accuracy' (X, y_true, model, metrics). The function returns a dictionary which contains the computed metrics of the selected model, with the following structure: {metric1: value1, metric2: value2, ...} (metric_dict).
Perform the following tasks in the compute_eval_metrics function:
1.	Make a prediction using the model and input data
2.	Write a for loop that iterates through metrics, a list containings one or more strings including ‘precision’, ‘recall’, ‘accuracy’
a.	Check the metric name and compute it based on the string input. For example, if metric=’precision’ then compute the precision on the predicted and input y_true.
b.	Store the result in out_dict[metric_name]


In [62]:
def compute_eval_metrics(X, y_true, model, metrics):
    # Make predictions
    y_pred = model.predict(X)
    
    # Initialize dictionary for metric values
    metric_dict = {}
    
    # Compute selected metrics
    for metric in metrics:
        if metric == 'precision':
            metric_dict[metric] = precision_score(y_true, y_pred)
        elif metric == 'recall':
            metric_dict[metric] = recall_score(y_true, y_pred)
        elif metric == 'accuracy':
            metric_dict[metric] = accuracy_score(y_true, y_pred)
        elif metric == 'roc_auc':
            metric_dict[metric] = roc_auc_score(y_true, y_pred)
            
    return metric_dict

**Checkpoint 10:** Next, plot the ROC curve between predicted and actual values for model names in trained_models on the training and validation datasets. To do this, complete the plot_roc_curve function which takes the following inputs: training input data, test input data, true targets, predicted targets, trained model names, and the trained models in a dictionary accessed with model name (X_train, X_val, y_train, y_true, trained_models). The function returns the plotted figure, and a dataframe containing the train and validation errors, with the following keys:
●	df[model_name.__name__ + " Train Precision"] = train_precision_all
●	df[model_name.__name__ + " Train Recall"] = train_recall_all
●	df[model_name.__name__ + " Validation Precision"] = val_precision_all
●	df[model_name.__name__ + " Validation Recall"] = val_recall_all
Perform the following tasks in the plot_roc_curve function: Write a for loop that iterates through the trained model names with an enumerator (e.g., i) variable to use for plotting
1.	Use the trained model in trained_models[model_name] to:
i.	Make predictions on the train set using predict_proba() function
ii.	Make predictions on the validation set using predict_proba() function
iii.	Apply the threshold to the predictions on the training set using the apply_threshold function
iv.	Apply the threshold to the predictions on the validation set using the apply_threshold function
v.	Compute precision and recall on the training set using the predictions on the training set (with threshold applied) and the true values (y_train). Use precision_score (set zero_division=1) and recall_score functions. vi.	Compute precision and recall on validation set using the predictions on the validation set (with threshold applied) and the true values (y_val). Use precision_score (set zero_division=1) and recall_score functions.
The apply_threshold function accesses the probabilities of a classification model using the predicted probabilities and a threshold value. The probabilities are a list of arrays where each element has two values (probability that a review is negative and the probability that the review is positive). Note that these probabilities are related in that the probability(positive) = 1 probability(negative); thus, the sum of both probabilities is 1. As a result, we only need to check one probability value and apply the threshold.
2.	Plot a ROC Curves showing the results on training and validation sets using the
train_precision_all, train_recall_all, val_precision_all, and val_recall_all. Plot precision on the y-axis and recall on the x-axis (see code snippet below.
```
fig.add_trace(go.Scatter(x=train_recall_all, y=train_precision_all, name="Train"), row=i+1, col=1) # use enumerated value i to align figures vertically
fig.add_trace(go.Scatter(x=val_recall_all, y=val_precision_all, name="Validation"), row=i+1, col=1) # use enumerated value i
fig.update_xaxes(title_text="Recall") fig.update_yaxes(title_text='Precision', row=i+1, col=1) # use enumerated value i
fig.update_layout(title=model_name+' ROC Curve') ```

3. Save the results (train_precision_all, train_recall_all, val_precision_all, and val_recall_all) in df.


In [63]:
import plotly.graph_objects as go
import pandas as pd
import numpy as np
from sklearn.metrics import precision_score, recall_score

def apply_threshold(y_pred, threshold):
    return [1 if pred[1] >= threshold else 0 for pred in y_pred]

def plot_roc_curve(X_train, X_val, y_train, y_val, trained_models):
    df = pd.DataFrame()
    fig = make_subplots(rows=len(trained_models), cols=1)
    for i, model_name in enumerate(trained_models.keys()):
        model = trained_models[model_name]
        y_train_pred = model.predict_proba(X_train)
        y_val_pred = model.predict_proba(X_val)
        train_pred = apply_threshold(y_train_pred, 0.5)
        val_pred = apply_threshold(y_val_pred, 0.5)
        train_precision_all = precision_score(y_train, train_pred, zero_division=1)
        train_recall_all = recall_score(y_train, train_pred, zero_division=1)
        val_precision_all = precision_score(y_val, val_pred, zero_division=1)
        val_recall_all = recall_score(y_val, val_pred, zero_division=1)
        fig.add_trace(go.Scatter(x=train_recall_all, y=train_precision_all, name="Train"), row=i+1, col=1)
        fig.add_trace(go.Scatter(x=val_recall_all, y=val_precision_all, name="Validation"), row=i+1, col=1)
        fig.update_xaxes(title_text="Recall")
        fig.update_yaxes(title_text='Precision', row=i+1, col=1)
        fig.update_layout(title=model_name+' ROC Curve')
        df[model_name.__name__ + " Train Precision"] = [train_precision_all]
        df[model_name.__name__ + " Train Recall"] = [train_recall_all]
        df[model_name.__name__ + " Validation Precision"] = [val_precision_all]
        df[model_name.__name__ + " Validation Recall"] = [val_recall_all]

    fig.show()
    return df


## 3.5	Deploy ML Application (1 point)
The goal of this page is to deploy a Product Sentiment Classification application that takes a user’s review text as input and predicts whether the review is positive or negative. Your goal is to restore the dataset from the previous page. Then, write the deploy_model() function which uses the selected model from Page C and the input text from the user to predict the review sentiment.
Checkpoint 11: Finally, deploy the model! Restore the trained model from st.session_state[‘deploy_model’] and use it to predict the sentiment of the input data. To do this, complete the deploy_model function, which takes the dataframe containing the dataset (df). The function returns the product sentiment, +1 or -1 (Product_sentiment).
Perform the following tasks in the deploy_model function:
1.	Restore the model for deployment in st.session_state[‘deploy_model’]
2.	Predict the product sentiment of the input test using the predict function e.g., model.predict(data)
The website uses the output of deploy_model to display the product sentiment on the website


In [None]:
def deploy_model(df):
    try:
        # Restore trained model
        model = st.session_state['deploy_model']

        # Make predictions on test data
        y_pred = model.predict(df)

        # Convert predictions to +1 or -1 sentiment
        Product_sentiment = [1 if pred == 1 else -1 for pred in y_pred]

        return Product_sentiment

    except:
        st.error('Error: Could not deploy model. Please train the model first.')
