# Lab 8: Define and Solve an ML Problem of Your Choosing

In [3]:
import pandas as pd
import numpy as np
import os 
import matplotlib.pyplot as plt
import seaborn as sns
import gensim

In this lab assignment, you will follow the machine learning life cycle and implement a model to solve a machine learning problem of your choosing. You will select a data set and choose a predictive problem that the data set supports.  You will then inspect the data with your problem in mind and begin to formulate a  project plan. You will then implement the machine learning project plan. 

You will complete the following tasks:

1. Build Your DataFrame
2. Define Your ML Problem
3. Perform exploratory data analysis to understand your data.
4. Define Your Project Plan
5. Implement Your Project Plan:
    * Prepare your data for your model.
    * Fit your model to the training data and evaluate your model.
    * Improve your model's performance.

## Part 1: Build Your DataFrame

You will have the option to choose one of four data sets that you have worked with in this program:

* The "census" data set that contains Census information from 1994: `censusData.csv`
* Airbnb NYC "listings" data set: `airbnbListingsData.csv`
* World Happiness Report (WHR) data set: `WHR2018Chapter2OnlineData.csv`
* Book Review data set: `bookReviewsData.csv`

Note that these are variations of the data sets that you have worked with in this program. For example, some do not include some of the preprocessing necessary for specific models. 

#### Load a Data Set and Save it as a Pandas DataFrame

The code cell below contains filenames (path + filename) for each of the four data sets available to you.

<b>Task:</b> In the code cell below, use the same method you have been using to load the data using `pd.read_csv()` and save it to DataFrame `df`. 

You can load each file as a new DataFrame to inspect the data before choosing your data set.

In [4]:
# File names of the four data sets
adultDataSet_filename = os.path.join(os.getcwd(), "data", "censusData.csv")
airbnbDataSet_filename = os.path.join(os.getcwd(), "data", "airbnbListingsData.csv")
WHRDataSet_filename = os.path.join(os.getcwd(), "data", "WHR2018Chapter2OnlineData.csv")
bookReviewDataSet_filename = os.path.join(os.getcwd(), "data", "bookReviewsData.csv")


#df = pd.read_csv(WHRDataSet_filename, header = 0)
df = pd.read_csv(bookReviewDataSet_filename, header = 0)

df.head()
#print(df['country'])
# print(df.shape)
# nan_count = df.isnull().sum()# YOUR CODE HERE
# nan_count

Unnamed: 0,Review,Positive Review
0,This was perhaps the best of Johannes Steinhof...,True
1,This very fascinating book is a story written ...,True
2,The four tales in this collection are beautifu...,True
3,The book contained more profanity than I expec...,False
4,We have now entered a second time of deep conc...,True


## Part 2: Define Your ML Problem

Next you will formulate your ML Problem. In the markdown cell below, answer the following questions:

1. List the data set you have chosen.
2. What will you be predicting? What is the label?
3. Is this a supervised or unsupervised learning problem? Is this a clustering, classification or regression problem? Is it a binary classificaiton or multi-class classifiction problem?
4. What are your features? (note: this list may change after your explore your data)
5. Explain why this is an important problem. In other words, how would a company create value with a model that predicts this label?

I would like to use unsupervised learning on the book review dataset. maybe use a different model that uses word embeddings instead of TF-IDF. After doing more research, I am planning on using the word2vec model since it's an unsupervised learning model. and then in the end, i can use the actual labels of each data point to check the model's accuracy.

This is a classification problem. I only have one feature which in its raw form, is the review itself, but I will convert it into word embeddings

This is an important problem because it can help authors assess whether or not their book is getting good feedback/responses without having to go through thousands of reviews manually. It saves a lot of time and resources which will always be valuable. It's also different from our past labs and specifically, our past lab using this same dataset. This is because the accuracy I got for using TF-IDF was only 80% on the test data, so hopefully, this approach will provide better results. 

## Part 3: Understand Your Data

The next step is to perform exploratory data analysis. Inspect and analyze your data set with your machine learning problem in mind. Consider the following as you inspect your data:

1. What data preparation techniques would you like to use? These data preparation techniques may include:

    * addressing missingness, such as replacing missing values with means
    * finding and replacing outliers
    * renaming features and labels
    * finding and replacing outliers
    * performing feature engineering techniques such as one-hot encoding on categorical features
    * selecting appropriate features and removing irrelevant features
    * performing specific data cleaning and preprocessing techniques for an NLP problem
    * addressing class imbalance in your data sample to promote fair AI
    

2. What machine learning model (or models) you would like to use that is suitable for your predictive problem and data?
    * Are there other data preparation techniques that you will need to apply to build a balanced modeling data set for your problem and model? For example, will you need to scale your data?
 
 
3. How will you evaluate and improve the model's performance?
    * Are there specific evaluation metrics and methods that are appropriate for your model?
    

Think of the different techniques you have used to inspect and analyze your data in this course. These include using Pandas to apply data filters, using the Pandas `describe()` method to get insight into key statistics for each column, using the Pandas `dtypes` property to inspect the data type of each column, and using Matplotlib and Seaborn to detect outliers and visualize relationships between features and labels. If you are working on a classification problem, use techniques you have learned to determine if there is class imbalance.

<b>Task</b>: Use the techniques you have learned in this course to inspect and analyze your data. You can import additional packages that you have used in this course that you will need to perform this task.

<b>Note</b>: You can add code cells if needed by going to the <b>Insert</b> menu and clicking on <b>Insert Cell Below</b> in the drop-drown menu.

In [5]:
y = df['Positive Review'] 
X = df['Review']

X.shape
X.head()

0    This was perhaps the best of Johannes Steinhof...
1    This very fascinating book is a story written ...
2    The four tales in this collection are beautifu...
3    The book contained more profanity than I expec...
4    We have now entered a second time of deep conc...
Name: Review, dtype: object

## Part 4: Define Your Project Plan

Now that you understand your data, in the markdown cell below, define your plan to implement the remaining phases of the machine learning life cycle (data preparation, modeling, evaluation) to solve your ML problem. Answer the following questions:

* Do you have a new feature list? If so, what are the features that you chose to keep and remove after inspecting the data? 
* Explain different data preparation techniques that you will use to prepare your data for modeling.
* What is your model (or models)?
* Describe your plan to train your model, analyze its performance and then improve the model. That is, describe your model building, validation and selection plan to produce a model that generalizes well to new data. 

I do not have a completely new feature list since I am only working with one feature. But I will be transforming the raw feature into a vectorized form for my model. For the data preparation technique, I will use the built-in function from Gensim to preprocess the text. This function will remove some stop words, covert all text to lowercase, remove punctuation and tokenize the text. I will be using the word2vec model. I will do this by converting my training and testing data to word embeddings.

Then I will convert the features in the training and test datasets into feature vectors using the word embeddings. I will use this to train a logisitc regression model

After training and testing, to improve the model, I may implement a k fold cross validation since my dataset is quite small (only 1973 data points).

## Part 5: Implement Your Project Plan

<b>Task:</b> In the code cell below, import additional packages that you have used in this course that you will need to implement your project plan.

In [6]:
from sklearn.model_selection import train_test_split
from sklearn.linear_model import LogisticRegression
from sklearn.metrics import roc_auc_score

<b>Task:</b> Use the rest of this notebook to carry out your project plan. 

You will:

1. Prepare your data for your model.
2. Fit your model to the training data and evaluate your model.
3. Improve your model's performance by performing model selection and/or feature selection techniques to find best model for your problem.

Add code cells below and populate the notebook with commentary, code, analyses, results, and figures as you see fit. 

In [7]:
original_X = X
X = X.apply(lambda row: gensim.utils.simple_preprocess(row))

In [8]:
X.head()

0    [this, was, perhaps, the, best, of, johannes, ...
1    [this, very, fascinating, book, is, story, wri...
2    [the, four, tales, in, this, collection, are, ...
3    [the, book, contained, more, profanity, than, ...
4    [we, have, now, entered, second, time, of, dee...
Name: Review, dtype: object

In [9]:
X_train, X_test, y_train, y_test = train_test_split(X, y, train_size=.80, random_state=1234)

X_train.head()

1369    [as, my, brother, said, when, flipping, throug...
1366    [cooper, book, is, yet, another, warm, and, fu...
385     [have, many, robot, books, and, this, is, the,...
750     [as, china, re, emerges, as, dominant, power, ...
643     [have, been, huge, fan, of, michael, crichton,...
Name: Review, dtype: object

In [10]:
print("Begin")
word2vec_model = gensim.models.Word2Vec(X_train,
                                   vector_size=100,
                                   window=5,
                                   min_count=2)

print("End")

Begin
End


In [11]:
len(word2vec_model.wv.key_to_index)  # retrieve vocabulary and measure its size

10354

In [12]:
top25 = word2vec_model.wv.index_to_key[:25]
top25

['the',
 'of',
 'and',
 'to',
 'is',
 'in',
 'this',
 'it',
 'that',
 'book',
 'for',
 'with',
 'as',
 'not',
 'was',
 'but',
 'you',
 'are',
 'on',
 'have',
 'he',
 'be',
 'his',
 'or',
 'one']

In [13]:
pd.DataFrame({w:word2vec_model.wv[w] for w in top25}).T.style.background_gradient(cmap='coolwarm').set_precision(2)

  pd.DataFrame({w:word2vec_model.wv[w] for w in top25}).T.style.background_gradient(cmap='coolwarm').set_precision(2)


Unnamed: 0,0,1,2,3,4,5,6,7,8,9,10,11,12,13,14,15,16,17,18,19,20,21,22,23,24,25,26,27,28,29,30,31,32,33,34,35,36,37,38,39,40,41,42,43,44,45,46,47,48,49,50,51,52,53,54,55,56,57,58,59,60,61,62,63,64,65,66,67,68,69,70,71,72,73,74,75,76,77,78,79,80,81,82,83,84,85,86,87,88,89,90,91,92,93,94,95,96,97,98,99
the,-0.13,0.09,0.63,0.13,-0.21,-0.74,0.97,0.94,-0.62,-0.03,-0.45,-1.34,-0.23,0.34,0.5,-0.78,0.43,-0.55,-0.8,-1.08,0.09,0.35,0.56,-0.36,0.53,-0.04,-0.47,0.07,-0.74,-0.07,0.32,0.71,-0.46,-0.62,-0.59,0.33,0.63,-0.32,-0.49,-0.62,-0.28,-0.13,-0.47,-0.65,0.44,0.19,-0.8,0.69,0.44,0.59,0.35,-0.52,-0.08,0.59,-0.51,0.5,0.46,-0.15,-0.73,0.54,0.31,0.07,-0.22,-0.19,-1.08,0.69,-0.35,-0.19,-0.92,1.32,-0.33,1.75,0.93,-0.76,0.53,0.18,0.19,0.67,-0.74,-0.04,-0.6,0.62,0.16,0.77,-0.46,-0.01,0.57,0.91,0.78,0.71,1.1,0.33,0.22,-0.03,1.98,0.84,1.09,-1.12,-0.15,0.19
of,-0.49,0.7,-0.0,-0.13,0.36,-1.3,1.17,1.52,-0.95,-0.4,-0.44,-1.65,-0.4,-0.19,0.86,-0.86,0.83,-0.92,-0.58,-1.63,-0.27,-0.06,0.82,-1.08,0.17,-0.3,-0.92,0.25,-0.44,0.36,1.0,0.43,-1.12,-0.89,-0.56,1.27,1.2,0.36,-0.37,-0.51,0.56,-0.38,-0.83,-0.16,0.18,-0.09,-0.31,0.87,0.62,0.56,1.03,-1.28,0.12,0.4,-1.14,0.9,0.59,0.31,-0.58,0.63,-0.17,0.17,0.01,-0.83,-0.37,0.6,0.25,-0.14,-0.12,0.7,-0.02,1.35,0.69,-0.34,0.17,0.31,-0.59,0.7,-0.81,-0.05,-0.77,0.75,0.2,0.25,-0.14,-0.17,0.34,0.4,1.26,0.57,0.92,0.01,0.22,-0.6,1.76,0.54,1.48,-0.67,0.12,-0.12
and,-0.22,0.55,0.3,-0.26,0.19,-1.02,0.8,1.3,-0.63,-0.16,-0.27,-1.29,-0.21,0.08,0.56,-0.72,0.21,-0.8,-0.41,-1.4,-0.14,0.28,0.71,-0.6,-0.04,-0.02,-0.5,-0.1,-0.37,0.05,0.66,0.12,-0.22,-0.71,-0.46,0.81,0.76,0.0,-0.53,-0.96,0.19,-0.33,-0.51,-0.14,0.51,-0.15,-0.32,0.32,0.67,0.57,0.3,-0.82,-0.16,0.25,-0.69,0.48,0.6,-0.05,-0.56,0.39,0.22,0.15,0.06,-0.05,-0.68,0.72,0.14,0.11,-0.79,1.12,-0.3,1.31,0.93,-0.2,0.62,0.17,-0.03,0.25,-0.84,0.23,-0.63,0.39,-0.3,0.83,-0.34,-0.41,0.21,0.8,0.92,0.51,1.01,0.46,0.15,-0.2,1.68,0.68,0.94,-0.59,-0.01,-0.19
to,-0.1,0.46,0.12,-0.55,0.4,-1.39,0.67,1.66,-0.23,-0.22,0.6,-0.65,-0.27,0.12,0.34,-0.19,-0.07,-0.99,-0.24,-1.64,0.11,0.1,1.13,-0.45,-0.59,0.03,-0.11,-0.4,0.11,-0.1,0.81,-0.42,0.72,-0.83,-0.1,0.62,0.7,-0.3,-0.65,-1.39,0.39,-0.65,-0.34,0.15,0.96,-0.21,-0.13,-0.11,1.07,0.64,-0.23,-0.86,-0.26,-0.01,-0.3,0.21,0.95,0.06,-0.3,0.39,0.42,0.11,0.63,0.4,-0.61,0.72,0.36,0.85,-1.01,0.93,-0.48,0.72,0.95,0.42,0.9,-0.38,0.19,-0.26,-1.02,0.28,-0.57,0.16,-0.95,1.35,-0.34,-0.84,0.22,0.64,0.68,0.27,0.72,0.85,0.18,0.23,1.55,0.41,0.33,-0.3,-0.17,-0.55
is,-0.18,0.43,-0.3,-0.1,0.15,-1.17,0.96,1.86,-0.5,-0.35,0.01,-0.81,-0.68,-0.06,-0.26,-0.41,0.65,-0.38,-0.27,-1.85,0.1,0.01,1.07,-0.73,-0.06,0.02,-0.52,0.3,-0.53,0.2,1.01,0.05,0.32,-0.91,-0.52,0.69,0.7,0.12,-0.28,-0.38,0.36,-0.78,-0.66,0.17,0.55,-0.16,-0.23,0.09,0.68,0.19,0.55,-0.57,-0.05,0.05,-0.7,-0.14,0.57,-0.18,-0.69,0.48,0.24,-0.47,0.69,0.1,-0.57,0.95,0.33,0.37,-0.94,0.72,-0.55,0.82,0.76,0.12,1.0,-0.03,0.26,0.59,-0.56,0.23,-0.99,-0.12,-0.18,0.49,-0.15,-0.81,0.93,0.48,0.66,0.33,0.8,0.8,0.75,-0.13,1.77,-0.13,0.56,-0.31,0.36,-0.0
in,-0.28,0.57,0.03,-0.23,0.22,-1.13,0.95,1.49,-0.71,-0.32,-0.19,-1.23,-0.35,-0.09,0.49,-0.68,0.5,-0.84,-0.42,-1.56,-0.12,0.07,0.8,-0.78,-0.03,-0.15,-0.63,-0.05,-0.3,0.19,0.87,0.17,-0.37,-0.79,-0.48,0.96,0.84,0.13,-0.38,-0.77,0.39,-0.45,-0.57,-0.1,0.39,-0.14,-0.25,0.44,0.65,0.53,0.5,-0.96,0.02,0.3,-0.79,0.51,0.62,0.07,-0.57,0.48,0.13,0.02,0.23,-0.33,-0.5,0.66,0.17,0.13,-0.54,0.76,-0.2,1.12,0.81,-0.15,0.51,0.12,-0.17,0.48,-0.75,0.1,-0.63,0.37,-0.18,0.53,-0.19,-0.46,0.35,0.5,0.97,0.45,0.92,0.37,0.19,-0.27,1.68,0.46,0.9,-0.59,0.11,-0.17
this,0.45,-0.12,0.22,0.25,-0.3,-0.87,0.81,1.67,-0.55,-0.26,0.17,-0.68,-0.32,0.68,-0.65,0.07,0.69,0.16,-0.74,-1.69,0.46,0.08,0.98,-0.57,0.16,0.69,-0.16,-0.38,-0.66,0.26,0.91,0.36,0.99,-0.49,-0.62,0.03,0.06,-0.8,-0.71,-0.93,-0.21,-0.93,-0.02,-0.44,0.67,0.18,-0.7,-0.28,0.24,0.11,-0.09,-0.86,-0.6,0.5,-0.47,-0.48,0.4,-0.85,-1.04,0.45,0.66,-0.74,0.85,0.52,-1.1,1.14,-0.51,0.35,-1.44,0.75,-0.56,1.06,0.67,-0.28,1.39,-0.5,0.85,1.01,-0.67,-0.02,-0.44,-0.38,-0.73,1.1,-0.23,-0.93,1.16,0.66,0.3,0.43,0.7,1.11,0.3,0.44,1.59,0.17,0.37,-1.1,0.22,0.67
it,0.22,0.26,0.1,0.2,-0.2,-0.99,0.45,1.75,-0.63,-0.33,0.17,-0.64,-0.31,0.58,-0.43,0.07,0.32,-0.31,-0.4,-1.71,0.49,0.48,0.94,-0.41,-0.16,0.5,-0.17,-0.23,-0.6,0.22,0.95,-0.12,1.34,-0.64,-0.57,0.45,0.28,-0.34,-0.65,-1.05,-0.11,-0.67,-0.15,-0.14,0.76,-0.1,-0.47,-0.35,0.64,0.23,-0.21,-0.43,-0.55,0.19,-0.26,-0.36,0.47,-0.65,-0.81,0.44,0.71,-0.27,0.83,0.75,-0.82,0.99,-0.32,0.46,-1.3,0.9,-0.45,0.66,0.73,-0.13,1.38,-0.34,0.75,0.37,-0.4,0.17,-0.65,-0.52,-0.92,1.24,-0.36,-0.94,0.67,0.8,0.56,0.43,0.72,1.21,0.32,0.58,1.48,0.02,-0.2,-0.55,0.26,0.14
that,-0.05,0.58,0.16,-0.03,0.01,-1.08,0.75,1.54,-0.68,-0.26,0.14,-0.91,-0.17,0.38,0.12,-0.29,0.13,-0.55,-0.43,-1.62,0.11,0.47,0.88,-0.53,-0.21,0.19,-0.3,-0.38,-0.4,-0.07,0.86,-0.2,0.85,-0.59,-0.36,0.53,0.45,-0.19,-0.55,-1.2,-0.01,-0.5,-0.24,-0.16,0.62,-0.2,-0.32,-0.15,0.64,0.48,-0.06,-0.54,-0.39,0.16,-0.21,-0.04,0.62,-0.32,-0.63,0.39,0.47,-0.1,0.45,0.47,-0.67,0.76,-0.09,0.34,-1.09,1.05,-0.45,0.73,0.92,0.01,1.07,-0.16,0.43,0.17,-0.6,0.12,-0.66,-0.18,-0.77,1.11,-0.38,-0.67,0.37,0.68,0.65,0.43,0.84,0.91,0.22,0.31,1.52,0.34,0.21,-0.42,0.18,-0.26
book,-0.11,0.22,-0.68,0.44,-0.21,-1.21,0.38,2.22,-0.87,-0.41,0.06,-0.46,-0.66,0.22,-0.7,0.36,0.71,-0.64,-0.28,-1.99,0.64,0.35,0.84,-0.47,-0.01,-0.08,-0.54,0.51,-0.87,0.36,1.51,0.08,1.31,-0.66,-0.81,0.97,0.7,0.22,-0.4,-0.33,0.23,-0.44,-0.7,-0.07,0.6,-0.09,-0.46,0.25,0.98,0.19,0.29,-0.15,-0.41,0.27,-0.32,-0.5,0.32,-0.49,-0.63,0.66,0.61,-0.16,1.2,0.37,-0.63,0.85,-0.23,0.35,-0.78,0.35,-0.14,0.41,0.38,-0.3,1.25,-0.4,0.53,0.43,0.13,0.07,-0.71,-0.67,-0.64,1.0,-0.23,-1.07,0.6,0.37,0.99,0.55,0.5,1.39,0.79,0.65,1.85,-0.75,-0.6,-0.22,0.83,0.05


In [14]:
print(X_train.head())
print(X_test.head())

1369    [as, my, brother, said, when, flipping, throug...
1366    [cooper, book, is, yet, another, warm, and, fu...
385     [have, many, robot, books, and, this, is, the,...
750     [as, china, re, emerges, as, dominant, power, ...
643     [have, been, huge, fan, of, michael, crichton,...
Name: Review, dtype: object
1692    [bought, this, book, this, weekend, as, we, re...
1744    [when, first, came, to, iran, black, clad, wom...
1236    [this, book, is, packed, full, of, incredible,...
21      [while, this, book, is, good, attempt, at, pla...
894     [if, your, looking, to, increase, your, person...
Name: Review, dtype: object


In [15]:
words = set(word2vec_model.wv.index_to_key)

print('Begin transforming X_train')
X_train_word_embeddings = np.array([np.array([word2vec_model.wv[word] for word in words if word in training_example])
                        for training_example in X_train], dtype=object)
print('Finish transforming X_train')

print('Begin transforming X_test')
X_test_word_embeddings = np.array([np.array([word2vec_model.wv[word] for word in words if word in training_example])
                        for training_example in X_test], dtype=object)
print('Finish transforming X_test')


Begin transforming X_train
Finish transforming X_train
Begin transforming X_test
Finish transforming X_test


In [16]:
X_train_feature_vector = []
for w in X_train_word_embeddings:
    if w.size:
        X_train_feature_vector.append(w.mean(axis=0))
    else:
        X_train_feature_vector.append(np.zeros(50, dtype=float))
        
X_test_feature_vector = []
for w in X_test_word_embeddings:
    if w.size:
        X_test_feature_vector.append(w.mean(axis=0))
    else:
        X_test_feature_vector.append(np.zeros(50, dtype=float))

In [19]:
# # 1. Create a LogisticRegression model object, and fit a Logistic Regression model to the transformed training data
# model = LogisticRegression(max_iter=500)
# model.fit(X_train_feature_vector, y_train)

# # 2. Make predictions on the transformed test data using the predict_proba() method and 
# # save the values of the second column
# probability_predictions = model.predict_proba(X_test_feature_vector)[:,1]

# # 3. Make predictions on the transformed test data using the predict() method 
# class_label_predictions = model.predict(X_test_feature_vector)

# # 4. Compute the Area Under the ROC curve (AUC) for the test data. Note that this time we are using one 
# # function 'roc_auc_score()' to compute the auc rather than using both 'roc_curve()' and 'auc()' as we have 
# # done in the past
# auc = roc_auc_score(y_test, probability_predictions)
# print('AUC on the test data: {:.4f}'.format(auc))
from sklearn.cluster import KMeans
from sklearn.metrics import silhouette_score
import numpy as np

# 1. Create a KMeans model object
# Note: You'll need to choose the number of clusters (n_clusters) based on your data and domain knowledge
model = KMeans(n_clusters=2, random_state=42)

# 2. Fit the model to the feature vectors (without using labels)
model.fit(X_train_feature_vector)

# 3. Make predictions on the test data
cluster_predictions = model.predict(X_test_feature_vector)

# 4. If you want to get probability-like scores, you can use the distances to cluster centers
distances = model.transform(X_test_feature_vector)
probability_like_scores = 1 / (1 + distances)

# 5. Evaluate the clustering performance using silhouette score
silhouette_avg = silhouette_score(X_test_feature_vector, cluster_predictions)
print('Silhouette Score on the test data: {:.4f}'.format(silhouette_avg))

# 6. If you have true labels and want to compare, you can use adjusted rand index
from sklearn.metrics import adjusted_rand_score
ari = adjusted_rand_score(y_test, cluster_predictions)
print('Adjusted Rand Index: {:.4f}'.format(ari))

Silhouette Score on the test data: 0.4061
Adjusted Rand Index: -0.0024


## ANALYSIS/ PIVOT:
It seems that using word embeddings does not do very well with unsupervised learning. 
I am planning on pivoting to trying to on TD-IDF instead

In [1]:
import pandas as pd
import numpy as np
import os
from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.cluster import KMeans
from sklearn.metrics import classification_report

# File path
bookReviewDataSet_filename = os.path.join(os.getcwd(), "data", "bookReviewsData.csv")

# Load dataset
df = pd.read_csv(bookReviewDataSet_filename, header=0)

# Extract features and labels
X = df['Review']
y = df['Positive Review']  # This will be used later for comparison

# Define custom stop words
custom_stop_words = [
    'the', 'and', 'is', 'in', 'to', 'a', 'of', 'that', 'it', 'for', 'on', 'with', 'as', 'an', 'this', 
    'was', 'which', 'at', 'by', 'from', 'but', 'or', 'are', 'not', 'we', 'be', 'have', 'has', 'had', 
    'will', 'would', 'should', 'may', 'might', 'could', 'than', 'then', 'also', 'some', 'so', 'one', 
    'two', 'three', 'these', 'those', 'there', 'where', 'how', 'when', 'why'
]

# Create and fit TfidfVectorizer with custom stop words
tfidf_vectorizer = TfidfVectorizer(stop_words=custom_stop_words)
X_tfidf = tfidf_vectorizer.fit_transform(X)

# Define number of clusters
num_clusters = 2

# Create and fit KMeans model
kmeans = KMeans(n_clusters=num_clusters, random_state=1234)
kmeans.fit(X_tfidf)

# Assign cluster labels to original data
df['Cluster'] = kmeans.labels_

# Print the first few entries with their cluster labels
print(df.head())

# Analyze the clusters
# Calculate the majority label for each cluster
cluster_labels = df.groupby('Cluster')['Positive Review'].apply(lambda x: x.mode()[0])

# Map clusters to predicted labels
df['Predicted Label'] = df['Cluster'].map(cluster_labels)

# Evaluate the performance
print(classification_report(df['Positive Review'], df['Predicted Label']))

# Print out the clusters to analyze manually
for cluster_num in range(num_clusters):
    print(f"Cluster {cluster_num}:")
    print(df[df['Cluster'] == cluster_num]['Review'].head(10))  # Print the first 10 reviews in each cluster
    print()


                                              Review  Positive Review  Cluster
0  This was perhaps the best of Johannes Steinhof...             True        1
1  This very fascinating book is a story written ...             True        1
2  The four tales in this collection are beautifu...             True        1
3  The book contained more profanity than I expec...            False        1
4  We have now entered a second time of deep conc...             True        1
              precision    recall  f1-score   support

       False       0.54      0.54      0.54       993
        True       0.53      0.53      0.53       980

    accuracy                           0.54      1973
   macro avg       0.54      0.54      0.54      1973
weighted avg       0.54      0.54      0.54      1973

Cluster 0:
5     I don't know why it won the National Book Awar...
7     I was very disapointed in the book.Basicly the...
13    Lovers of Mr. Rochester beware - in this, his ...
15    As the name im

In [7]:
# # 1. Create a TfidfVectorizer object
# tfidf_vectorizer = TfidfVectorizer()

# # 2. Fit the vectorizer to X_train
# tfidf_vectorizer.fit(X_train)

# # 3. Print the first 50 items in the vocabulary
# print("Vocabulary size {0}: ".format(len(tfidf_vectorizer.vocabulary_)))
# print(str(list(tfidf_vectorizer.vocabulary_.items())[0:50])+'\n')

      
# # 4. Transform *both* the training and test data using the fitted vectorizer and its 'transform' attribute
# X_train_tfidf = tfidf_vectorizer.transform(X_train)
# X_test_tfidf = tfidf_vectorizer.transform(X_test)


# # 5. Print the matrix
# print(X_train_tfidf.todense())

Vocabulary size 19029: 
[('as', 1344), ('my', 11353), ('brother', 2455), ('said', 14836), ('when', 18601), ('flipping', 6860), ('through', 17175), ('this', 17133), ('book', 2242), ('if', 8558), ('girls', 7460), ('start', 16165), ('acting', 517), ('like', 10066), ('guys', 7842), ('then', 17066), ('what', 18593), ('do', 5226), ('we', 18496), ('need', 11479), ('them', 17061), ('for', 6962), ('why', 18655), ('should', 15421), ('you', 18965), ('pretend', 13207), ('to', 17259), ('be', 1809), ('someone', 15841), ('or', 11982), ('something', 15845), ('re', 13834), ('not', 11673), ('get', 7424), ('guy', 7840), ('dumps', 5467), ('because', 1839), ('mushy', 11332), ('love', 10268), ('songs', 15859), ('cry', 4260), ('at', 1438), ('sad', 14814), ('movies', 11269), ('babies', 1629), ('and', 1000), ('puppies', 13575), ('have', 8028), ('embarrassing', 5745), ('girl', 7457)]

[[0. 0. 0. ... 0. 0. 0.]
 [0. 0. 0. ... 0. 0. 0.]
 [0. 0. 0. ... 0. 0. 0.]
 ...
 [0. 0. 0. ... 0. 0. 0.]
 [0. 0. 0. ... 0. 0. 0.

In [2]:
# 1. Create a LogisticRegression model object, and fit a Logistic Regression model to the transformed training data
model = LogisticRegression(max_iter=200)
model.fit(X_train_tfidf, y_train)

# 2. Make predictions on the transformed test data using the predict_proba() method and 
# save the values of the second column
probability_predictions = model.predict_proba(X_test_tfidf)[:,1]

# 3. Make predictions on the transformed test data using the predict() method 
class_label_predictions = model.predict(X_test_tfidf)

# 4. Compute the Area Under the ROC curve (AUC) for the test data. Note that this time we are using one 
# function 'roc_auc_score()' to compute the auc rather than using both 'roc_curve()' and 'auc()' as we have 
# done in the past
auc = roc_auc_score(y_test, probability_predictions)
print('AUC on the test data: {:.4f}'.format(auc))

# 5. Print out the size of the resulting feature space using the 'vocabulary_' attribute of the vectorizer
len_feature_space = len(tfidf_vectorizer.vocabulary_)
print('The size of the feature space: {0}'.format(len_feature_space))

# 6. Get a glimpse of the features:
first_five = list(tfidf_vectorizer.vocabulary_.items())[0:5]
print('Glimpse of first 5 entries of the mapping of a word to its column/feature index \n{}:'.format(first_five))


AUC on the test data: 0.9141
The size of the feature space: 18980
Glimpse of first 5 entries of the mapping of a word to its column/feature index 
[('my', 11332), ('brother', 2448), ('said', 14810), ('flipping', 6850), ('through', 17137)]:


In [3]:
print('Review #1:\n')
print(X_test.to_numpy()[124])

print('\nPrediction: Is this a good review? {}\n'.format(class_label_predictions[124])) 

print('Actual: Is this a good review? {}\n'.format(y_test.to_numpy()[124]))

Review #1:

I've been a fan of Carol Dweck's scholarly work for years. Her work on self-esteem, self-concept, and the incremental vs. entity theories of intelligence provides some of the most powerfully useful tools I've encountered for educators and parents in their work with children, as well as in their own self-awareness and lives. I'm delighted to see this information written here in such a user-friendly conversational tone, rich with stories that illustrate the nuances and complexities of Dweck's research and ideas. I'm recommending this book to all of my graduate students (teachers and principals working with gifted learners), as well as to parents of high-ability children.

Dona Matthews, Ph.D., Director of the Hunter College Center for Gifted Studies and Education, City University of New York


Prediction: Is this a good review? True

Actual: Is this a good review? True



In [10]:
print('Review #2:\n')
print(X_test.to_numpy()[238])

print('\nPrediction: Is this a good review? {}\n'.format(class_label_predictions[238])) 

print('Actual: Is this a good review? {}\n'.format(y_test.to_numpy()[238]))

Review #2:

I have read other books by Alesia Holliday and enjoyed them so I looked forward to reading this book.  Unfortunately, I could not get any farther than the first 25 pages.  I even tried diving in further into the book to see if it got better and I still could not read more than 5 pages without turning away.  The best I can do to pin down why I dislike it so much is to say that it tries too hard.  No character seems to even approach reality.  They are all, including the main character and her love interest, over the top


Prediction: Is this a good review? False

Actual: Is this a good review? False



In [11]:
for min_df in [1,10,100,1000]:
    
    print('\nMin Document Frequency Value: {0}'.format(min_df))
    
    # 1. Create a TfidfVectorizer object
    tfidf_vectorizer = TfidfVectorizer(min_df=min_df, ngram_range=(1,2))

    # 2. Fit the vectorizer to X_train
    tfidf_vectorizer.fit(X_train)

    # 3. Transform the training and test data
    X_train_tfidf = tfidf_vectorizer.transform(X_train)
    X_test_tfidf = tfidf_vectorizer.transform(X_test)

    # 4. Create a LogisticRegression model object, and fit a Logistic Regression model to the transformed 
    # training data
    model = LogisticRegression(max_iter=200)
    model.fit(X_train_tfidf, y_train)
    
    # 5. Make predictions on the transformed test data using the predict_proba() method and save 
    # the values of the second column
    probability_predictions = model.predict_proba(X_test_tfidf)[:,1]

    # 6. Compute the Area Under the ROC curve (AUC) for the test data.
    auc = roc_auc_score(y_test, probability_predictions)
    print('AUC on the test data: {:.4f}'.format(auc))

    # 7. Compute the size of the resulting feature space using the 'vocabulary_' attribute of the vectorizer
    len_feature_space = len(tfidf_vectorizer.vocabulary_)
    print('The size of the feature space: {0}'.format(len_feature_space))
    
    # 8. Get a glimpse of the features:
    first_five = list(tfidf_vectorizer.vocabulary_.items())[0:5]
    print('Glimpse of first 5 entries of the mapping of a word to its column/feature index \n{}:'.format(first_five))

    # 9: Print the first five "stop words" - words that we are ignoring
    first_five_stop = list(tfidf_vectorizer.stop_words_)[0:5]
    print('Glimpse of first 5 stop words \n{}:'.format(first_five_stop))
    


Min Document Frequency Value: 1
AUC on the test data: 0.9312
The size of the feature space: 143560
Glimpse of first 5 entries of the mapping of a word to its column/feature index 
[('as', 11962), ('my', 79875), ('brother', 20610), ('said', 105149), ('when', 137651)]:
Glimpse of first 5 stop words 
[]:

Min Document Frequency Value: 10
AUC on the test data: 0.9252
The size of the feature space: 4257
Glimpse of first 5 entries of the mapping of a word to its column/feature index 
[('as', 316), ('my', 2288), ('brother', 588), ('said', 2967), ('when', 4049)]:
Glimpse of first 5 stop words 
['children so', 'starting her', 'the turkish', 'unconnected', 'fantasies of']:

Min Document Frequency Value: 100
AUC on the test data: 0.8621
The size of the feature space: 279
Glimpse of first 5 entries of the mapping of a word to its column/feature index 
[('as', 19), ('my', 144), ('when', 258), ('through', 233), ('this', 226)]:
Glimpse of first 5 stop words 
['children so', 'starting her', 'the turk

ANALYSIS: after running different document frequency values, it seems 
like having a minimum frequency value of 1 gives us the best result. 

CONCLUSION:
This unsupervised learning model does do slightly better than the supervised approach that we did in the previous sections. Also, after experimenting with word embeddings vs TF-IDF, TF-IDF does seem to do much better


