<a href="https://colab.research.google.com/github/joyinning/python_lie_detection/blob/main/Sentiment_Classification_vs_Lie_Detection_With_Results_(NLP).ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

## **1. Information**

**1) Background** <br>
There are many machine learning solutions to detect if a person is lying or not. 

What is **Sentiment Classification**?
- It analyzes the sentiment or emotional tone using text data, such as a review, and determines whether it is positive, negative, or neutral. 
- It is used for understanding the overall sentiment expressed in the text. 

What is **Lie Detection**?
- It involves identifying whether a statement is true or false. 
- It is used for detecting whether the texts are accuract or not, analyzing the factual things made in the text.

**2) Goal** <br>
The goal of this research is as follows.
- To build sentiment classification and lie detection using the reviews of hotels in the United States.
- To understand the difference between sentiment classification and lie detection and conclude which model can make a better performance.
> While sentiment classification and lie detection both analyze text data, they are fundamentally using different features. 
> > In other words, Sentiment classification focuses on emotions and opinions, while lie detection focuses on the accuracy of factual claims. 
- To calculate gain ratio scores and select the top features.

**3) Research Process** <br>
The research will be conducted as the following process.

1. **Data Preprocessing** <br>
Upload, clean, (transform, if needed), and explore the data set to decide which values we should use to build an algorithm.

2. **Text Preprocessing** <br>
Clean and preprocess the review data by tokenizing, removing stop words and punctuations, converting to lowercase, lemmitizing, and stemming for setting the reviews for the required form of models of NLP. 

3. **Vectorization and Feature Selection** <br>
Vectorize the preprocessed text data and select the top 15 features using gain ratio scores.

4. **Building the Sentiment Classification and Lie Detection Model** <br>
Split the dataset into training and testing sets, train a machine learning algorithm such as MultinomialNB, SVM, Decision Tree, and Random Forest, evaluate the models, and conduct hyperparameter tuning for improvements, if needed. 

5. **Evaluating the Models** <br>
Compare the results of each algorithm and think about the nest step for improvements of those models. 

## **2. Data Preprocessing**

This step is for preparing and exploring the given data set (the reviews of US hotels) for further machine learning research. 

### **1) Upload the data set**

In [None]:
from google.colab import drive
drive.mount('/content/drive')

In [None]:
import pandas as pd 
review = pd.read_csv("/content/drive/MyDrive/Colab Notebooks/IST 707/us_hotel_review.csv")

### **2) EDA**

**Basic Information of the Data Structure**

The data set, '`review`' has 35,437 reviews with the following three attributes.
1. **`is_positive`**: whether the review is positive or negative (positive = '`y`', negative = '`n`')
2. **`Reviewer_score`**: the review score (10 points scale)
3. **`review`**: actual reviews, string values

In [None]:
review.shape

In [None]:
review.head()

**Handling Missing Values**

There are no missing values in all attributes.

In [None]:
review.isna().sum()

**Value Transformation** <br>
For further modeling, convert the character values in **`is_positive`** to numeric values.
- positive ('y') = 1
- negative ('n') = 0

In [None]:
review = review.replace('y', 1)
review = review.replace('n', 0)

## **3. Text Preprocessing**

**1) Upload Required Classifiers for Text Preprocessing**

Before text preprocessing, upload the following necessary classifiers in this environment. 
- Sentence Tokenizer
- Stopwords
- RegexpTokenizer
- WordNetLemmatizer
- PorterStemmer

In [None]:
# sent_tokenize
import nltk
nltk.download('punkt')
from nltk.tokenize import sent_tokenize

In [None]:
# stopwords
import nltk
nltk.download('stopwords')
from nltk.corpus import stopwords
stop_words = set(stopwords.words('english'))

In [None]:
# RegexpTokenizer
from nltk.tokenize import RegexpTokenizer, word_tokenize
regexp_tokenizer = RegexpTokenizer('[\'a-zA-Z]+')

In [None]:
# WordNetLemmatizer and PorterStemmer
import nltk
nltk.download('wordnet')
from nltk.stem import WordNetLemmatizer, PorterStemmer
wordnet_lemmatizer = WordNetLemmatizer()
porter_stemmer = PorterStemmer()

**2) Define a Function for Text Preprocessing** <br>
For convenience, make a customized function to conduct text preprocessing that includes the above required tools.

In [None]:
import re 
def text_preprocess(document, rebuild_document = True):
  words = []

  for sentence in sent_tokenize(document):
    tokens = [wordnet_lemmatizer.lemmatize(t.lower()) for t in regexp_tokenizer.tokenize(sentence) if t.lower() not in stop_words]
    words += tokens
  if rebuild_document:
    content = ' '.join(words).strip()
    content = content.replace(r"'"," ")
    content = re.sub('s\+', ' ', content)
    content = content.strip()

    return content
  else:

    return words

**3) Implement Text Preprocessing**

Create new lists of review texts, tokenized review texts, and sentiment labels to split the review text data into predictive and target variables. 

In [None]:
X_reviews = [] # review texts after text preprocessing
X_token_reviews = [] # tokenized review texts after text preprocessing
Y_reviews = [] # sentiment labels

Run a for loop to divide the data set and append values into the new lists.

In [None]:
for index, row in review.iterrows():
  sentiment_index = row.is_positive
  review = row.review

  X_reviews.append(text_preprocess(review))
  X_token_reviews.append(text_preprocess(review, False))
  Y_reviews.append(sentiment_index)

Check the samples of new lists.

In [None]:
print('X_reviews: ', X_reviews[0])

In [None]:
print('X_token_reviews: ', X_token_reviews[0])

In [None]:
print('Y_reviews: ', Y_reviews[0:10])

## **4. Vectorization & Feature Selection**

There are two techniques that will be used in this step: Count Vectorizer and TD-IDF.

What is **Count Vectorization**?
- It counts the frequency of each word in a document.
- It shows a document as **a vector of word frequencies**. 

What is **TD-IDF**?
- It takes into account both the frequency of a word in a document and the quality (states or facts) across all documents. 
- It represents **a document as a vector of weights** that represent the importance of each word in the document and in the corpus. 

**4-1) Count Vectorization**

**1) Import the Required Classifier**

In [None]:
from sklearn.feature_extraction.text import CountVectorizer
count_vectorizer = CountVectorizer()

**2) Transform `X_reviews` Using Count Vectorization Classifier**

In [None]:
X_reviews_cv = count_vectorizer.fit_transform(X_reviews)

**3) Calculate the Gain Ratio for Selecting Features** <br>
Compute the **gain ratio** of each feature from the data set after running count vectorization. <br>
The **gain ratio** will be a standard of feature selection.

In [None]:
from sklearn.feature_selection import mutual_info_classif
import numpy as np

mi_cv = mutual_info_classif(X_reviews_cv, Y_reviews)
feature_scores_cv = mi_cv / np.log2(X_reviews_cv.shape[1])

**4) Select Top 15 Features** <br>
Select top 15 features based on the gain ratio scores.

In [None]:
# Find the name of features
feature_names_cv = count_vectorizer.get_feature_names_out()

In [None]:
# Select the top 15 features using gain ratios
k = 15
top_index_cv = np.argsort(feature_scores_cv)[-k:]
top_names_cv = [feature_names_cv[i] for i in top_index_cv]
top_scores_cv = [feature_scores_cv[i] for i in top_index_cv]

The set of top 15 features from the reviews after count vectorization includes words that express sentiments as follows. 
- **Positive**: positive, great, excellent, helpful, confortable, friendly
- **Negative**: negative, rude, horrible, never, bad, poor

In [None]:
# Display the selected features and their gain ratio scores
pd.DataFrame({'Features': top_names_cv, 'Scores' : top_scores_cv}).sort_values('Scores', ascending = False).reset_index().drop(labels='index',axis=1)

**4-2) TD-IDF**


**1) Import the Required Classifier**

In [None]:
from sklearn.feature_extraction.text import TfidfVectorizer
tdidf = TfidfVectorizer()

**2) Transform `X_reviews` Using TD-IDF Classifier**

In [None]:
X_reviews_tdidf = tdidf.fit_transform(X_reviews)

**3) Calculate the Gain Ratio for Selecting Features** <br>
Compute the gain ratio of each feature from the data set after running TD-IDF Vectorization. <br>
The gain ratio will be a standard of feature selection.

In [None]:
from sklearn.feature_selection import mutual_info_classif
import numpy as np

mi_tdidf = mutual_info_classif(X_reviews_tdidf, Y_reviews)
feature_scores_tdidf = mi_tdidf / np.log2(X_reviews_tdidf.shape[1])

**4) Select Top 15 Features** <br>
Select top 15 features based on the gain ratio scores.

In [None]:
# Find the name of features
feature_names_tdidf = tdidf.get_feature_names_out()

In [None]:
# Select the top 15 features using gain ratio scores
k = 15
top_index_tdidf = np.argsort(feature_scores_tdidf)[-k:]
top_names_tdidf = [feature_names_tdidf[i] for i in top_index_tdidf]
top_scores_tdidf = [feature_scores_tdidf[i] for i in top_index_tdidf]

The set of top 15 features from the reviews after TD-IDF includes more words that indicates categories for reviewing, including room, staff, location, bed, and bathroom, than the reviews after count vectorization. In other words, the TD-IDF doesn't have words with a high gain ratio that related to sentiments.

- **Category**: room, staff, location, bed, bathroom
- **Sentiment**: positive, good, like (It can have more than two meanings)

In [None]:
# Display the selected top 15 features
pd.DataFrame({'Features': top_names_tdidf, 'Scores' : top_scores_tdidf}).sort_values('Scores', ascending = False).reset_index().drop(labels='index',axis=1)

## **5. Modeling**

Build sentiment classification and lie detection models using the following machine learning techniques. 
- **Sentiment Classification**: MultinomialNB (Naive Bayes), SVM(Support Vector Machine)
- **Lie Detection**: Decision Tree, Random Forest

**5-1) Create the new review and sentiment label sets after vectorization and feature selection** <Br>
Create the new review and sentiment label sets based on the selected top 15 features from Count Vectorization and TD-IDF.

In [None]:
X_reviews_cv_top = X_reviews_cv[:, top_index_cv]
X_reviews_tdidf_top = X_reviews_tdidf[:, top_index_tdidf]

**5-2) Split into train and test sets** <br>
For modeling, divide the review and sentiment label data into train and test sets.

In [None]:
# count vectorized reviews
from sklearn.model_selection import train_test_split
X_train_cv, X_test_cv, y_train_cv, y_test_cv = train_test_split(X_reviews_cv_top, Y_reviews, test_size=0.2, random_state=42)

In [None]:
# TD-IDF reviews
from sklearn.model_selection import train_test_split
X_train_tdidf, X_test_tdidf, y_train_tdidf, y_test_tdidf = train_test_split(X_reviews_tdidf_top, Y_reviews, test_size=0.2, random_state=42)

### **5-3) Sentiment Classification** <br>

`MultinomialNB (Naive Bayes)` and `SVM(Support Vector Machine)` models are used for building the sentiment classification algorithm.

**MultinomialNB (Naive Bayes)**

**1) Import Required Classifier**

In [None]:
from sklearn.naive_bayes import MultinomialNB
ld_cv_nb = MultinomialNB()
ld_tdidf_nb = MultinomialNB()

**2) Train the model with default parameters** <br>
To get a baseline for hyperparameter tuning, train the model with its default parameter setting.

In [None]:
# count vectorization model
ld_cv_nb.fit(X_train_cv, y_train_cv)
y_pred_cv_nb = ld_cv_nb.predict(X_test_cv)

In [None]:
# TD-IDF reviews
ld_tdidf_nb.fit(X_train_tdidf, y_train_tdidf)
y_pred_tdidf_nb = ld_tdidf_nb.predict(X_test_tdidf)

**3) Check the Default Parameter**

In [None]:
print(str(ld_cv_nb.get_params()))

In [None]:
print(str(ld_tdidf_nb.get_params()))

**4) Evaluate the Models**

In [None]:
## count vectorization
from sklearn.metrics import roc_curve, auc, accuracy_score, precision_score, recall_score, f1_score

false_positive_rate, true_positive_rate, thresholds = roc_curve(y_test_cv, y_pred_cv_nb)
roc_auc = auc(false_positive_rate, true_positive_rate)
print('ROC_AUC:',roc_auc)
print('Accuracy:',accuracy_score(y_test_cv, y_pred_cv_nb))
print('Precision:',precision_score(y_test_cv, y_pred_cv_nb))
print('Recall:',recall_score(y_test_cv, y_pred_cv_nb))
print('F1:',f1_score(y_test_cv, y_pred_cv_nb))

In [None]:
from sklearn.metrics import confusion_matrix
import seaborn as sns

cf_matrix_cv_nb = confusion_matrix(y_test_cv, y_pred_cv_nb)
sns.heatmap(cf_matrix_cv_nb/np.sum(cf_matrix_cv_nb), annot=True, fmt='.2%')

In [None]:
## TD-IDF
from sklearn.metrics import roc_curve, auc, accuracy_score, precision_score, recall_score, f1_score

false_positive_rate, true_positive_rate, thresholds = roc_curve(y_test_tdidf, y_pred_tdidf_nb)
roc_auc = auc(false_positive_rate, true_positive_rate)
print('ROC_AUC:',roc_auc)
print('Accuracy:',accuracy_score(y_test_tdidf, y_pred_tdidf_nb))
print('Precision:',precision_score(y_test_tdidf, y_pred_tdidf_nb))
print('Recall:',recall_score(y_test_tdidf, y_pred_tdidf_nb))
print('F1:',f1_score(y_test_tdidf, y_pred_tdidf_nb))

In [None]:
cf_matrix_tdidf_nb = confusion_matrix(y_test_tdidf, y_pred_tdidf_nb)
sns.heatmap(cf_matrix_tdidf_nb/np.sum(cf_matrix_tdidf_nb), annot=True, fmt='.2%')

**5) Hyperparameter Tuning** <br>
Conduct hyperparameter tuning to find the parameters that make better performances. <br>
<br>
In this case, find the best parameter that make a good **roc_auc** score. <br>
Only **`C`** parameter will be controlled in this step.

In [None]:
alpha = np.linspace(0.1, 1.0, 10, endpoint = True)
parameters = {'alpha': alpha}

Set the parameter grid to the classifier. <br>
In this hyperparameter tuning, we will focus on improving AUC scores.

In [None]:
from sklearn.model_selection import GridSearchCV
ld_cv_nb_hyper = GridSearchCV(MultinomialNB(), param_grid = parameters, cv=3, return_train_score=True, scoring= 'roc_auc')

Train a model again and print the best parameters

In [None]:
# count vectorization reviews
ld_cv_nb_hyper.fit(X_train_cv, y_train_cv)
print('Best model: %s' % str(ld_cv_nb_hyper.best_params_))

Train a model with the best parameters (`alpha` = 0.8) and evaluate measure scores.

In [None]:
# Train
ld_cv_nb_best = ld_cv_nb_hyper.best_estimator_
ld_cv_nb_best.fit(X_train_cv, y_train_cv)
y_pred_cv_nb_best = ld_cv_nb_best.predict(X_test_cv)

Unfortunately, the **roc_auc** score after hyperparameter tuning decreased by about 0.0001.

In [None]:
# Evaluate
false_positive_rate, true_positive_rate, thresholds = roc_curve(y_test_cv, y_pred_cv_nb_best)
roc_auc = auc(false_positive_rate, true_positive_rate)
print('ROC_AUC:',roc_auc)
print('Accuracy:',accuracy_score(y_test_cv, y_pred_cv_nb_best))
print('Precision:',precision_score(y_test_cv, y_pred_cv_nb_best))
print('Recall:',recall_score(y_test_cv, y_pred_cv_nb_best))
print('F1:',f1_score(y_test_cv, y_pred_cv_nb_best))

Also, the multinomialNB model with the hyperparameters can't divide positive and negative values exactly.

In [None]:
cf_matrix_cv_nb_best = confusion_matrix(y_test_cv, y_pred_cv_nb_best)
sns.heatmap(cf_matrix_cv_nb_best/np.sum(cf_matrix_cv_nb_best), annot=True, fmt='.2%')

Conduct hyperparameter tuning with the tdidf data sets.

In [None]:
from sklearn.model_selection import GridSearchCV
ld_tdidf_nb_hyper = GridSearchCV(MultinomialNB(), param_grid = parameters, cv=3, return_train_score=True, scoring= 'roc_auc')

In [None]:
# tdidf review
ld_tdidf_nb_hyper.fit(X_train_tdidf, y_train_tdidf)
print('Best model: %s' % str(ld_tdidf_nb_hyper.best_params_))

In [None]:
# Train
ld_tdidf_nb_best = ld_tdidf_nb_hyper.best_estimator_
ld_tdidf_nb_best.fit(X_train_tdidf, y_train_tdidf)
y_pred_tdidf_nb_best = ld_tdidf_nb_best.predict(X_test_tdidf)

In [None]:
# Evaluate
false_positive_rate, true_positive_rate, thresholds = roc_curve(y_test_tdidf, y_pred_tdidf_nb_best)
roc_auc = auc(false_positive_rate, true_positive_rate)
print('ROC_AUC:',roc_auc)
print('Accuracy:',accuracy_score(y_test_tdidf, y_pred_tdidf_nb_best))
print('Precision:',precision_score(y_test_tdidf, y_pred_tdidf_nb_best))
print('Recall:',recall_score(y_test_tdidf, y_pred_tdidf_nb_best))
print('F1:',f1_score(y_test_tdidf, y_pred_tdidf_nb_best))

In [None]:
cf_matrix_tdidf_nb_best = confusion_matrix(y_test_tdidf, y_pred_tdidf_nb_best)
sns.heatmap(cf_matrix_tdidf_nb_best/np.sum(cf_matrix_tdidf_nb_best), annot=True, fmt='.2%')

###**SVM (Support Vector Machine)**

**1) Import Required Classifier**

In [None]:
from sklearn.svm import SVC
ld_cv_svm = SVC()
ld_tdidf_svm = SVC()

**2) Train the model with default parameters** <br>
To get a baseline for hyperparameter tuning, train the model with its default parameter setting.

In [None]:
# count vectorization model
ld_cv_svm.fit(X_train_cv, y_train_cv)
y_pred_cv_svm = ld_cv_svm.predict(X_test_cv)

In [None]:
# TD-IDF reviews
ld_tdidf_svm.fit(X_train_tdidf, y_train_tdidf)
y_pred_tdidf_svm = ld_tdidf_svm.predict(X_test_tdidf)

**3) Check the Default Parameter**

In [None]:
print(str(ld_cv_svm.get_params()))

In [None]:
print(str(ld_tdidf_svm.get_params()))

**4) Evaluate the Models**

In [None]:
## count vectorization
from sklearn.metrics import roc_curve, auc, accuracy_score, precision_score, recall_score, f1_score

false_positive_rate, true_positive_rate, thresholds = roc_curve(y_test_cv, y_pred_cv_svm)
roc_auc = auc(false_positive_rate, true_positive_rate)
print('ROC_AUC:',roc_auc)
print('Accuracy:',accuracy_score(y_test_cv, y_pred_cv_svm))
print('Precision:',precision_score(y_test_cv, y_pred_cv_svm))
print('Recall:',recall_score(y_test_cv, y_pred_cv_svm))
print('F1:',f1_score(y_test_cv, y_pred_cv_svm))

In [None]:
cf_matrix_cv_svm = confusion_matrix(y_test_cv, y_pred_cv_svm)
sns.heatmap(cf_matrix_cv_svm/np.sum(cf_matrix_cv_svm), annot=True, fmt='.2%')

In [None]:
## TD-IDF
from sklearn.metrics import roc_curve, auc, accuracy_score, precision_score, recall_score, f1_score

false_positive_rate, true_positive_rate, thresholds = roc_curve(y_test_tdidf, y_pred_tdidf_nb)
roc_auc = auc(false_positive_rate, true_positive_rate)
print('ROC_AUC:',roc_auc)
print('Accuracy:',accuracy_score(y_test_tdidf, y_pred_tdidf_svm))
print('Precision:',precision_score(y_test_tdidf, y_pred_tdidf_svm))
print('Recall:',recall_score(y_test_tdidf, y_pred_tdidf_svm))
print('F1:',f1_score(y_test_tdidf, y_pred_tdidf_svm))

In [None]:
cf_matrix_tdidf_svm = confusion_matrix(y_test_tdidf, y_pred_tdidf_svm)
sns.heatmap(cf_matrix_tdidf_svm/np.sum(cf_matrix_tdidf_svm), annot=True, fmt='.2%')

**5) Hyperparameter Tuning** <br>
Conduct hyperparameter tuning to find the parameters that make better performances. <br>
<br>
In this case, find the best parameter that make a good **roc_auc** score. <br>
Use `kernel`, `c`, and `gamma` as the parameter for hyperparameter tuning.

In [None]:
kernel = ['linear', 'rbf']
C = [1,10,20,50,100]
gamma = [0.1, 0.01, 0.001]
parameters = {'kernel': kernel, 'C': C, 'gamma': gamma}

Set the parameter grid to the classifier. <br>
In this hyperparameter tuning, we will focus on improving AUC scores.

In [None]:
from sklearn.model_selection import GridSearchCV
ld_cv_svm_hyper = GridSearchCV(SVC(), param_grid = parameters, cv=3, return_train_score=True, scoring= 'roc_auc')

Train a model again and print the best parameters

In [None]:
# count vectorization reviews
ld_cv_svm_hyper.fit(X_train_cv, y_train_cv)
print('Best model: %s' % str(ld_cv_svm_hyper.best_params_))

Train a model with the best parameters (`alpha` = 0.8) and evaluate measure scores.

In [None]:
# Train
ld_cv_svm_best = ld_cv_svm_hyper.best_estimator_
ld_cv_svm_best.fit(X_train_cv, y_train_cv)
y_pred_cv_svm_best = ld_cv_svm_best.predict(X_test_cv)

unfortunately, the **roc_auc** score after hyperparameter tuning decreased by about 0.0001.

In [None]:
# Evaluate
false_positive_rate, true_positive_rate, thresholds = roc_curve(y_test_cv, y_pred_cv_svm_best)
roc_auc = auc(false_positive_rate, true_positive_rate)
print('ROC_AUC:',roc_auc)
print('Accuracy:',accuracy_score(y_test_cv, y_pred_cv_svm_best))
print('Precision:',precision_score(y_test_cv, y_pred_cv_svm_best))
print('Recall:',recall_score(y_test_cv, y_pred_cv_svm_best))
print('F1:',f1_score(y_test_cv, y_pred_cv_svm_best))

Also, the SVM model with the hyperparameters can't divide positive and negative values correctly.

In [None]:
cf_matrix_cv_svm_best = confusion_matrix(y_test_cv, y_pred_cv_svm_best)
sns.heatmap(cf_matrix_cv_svm_best/np.sum(cf_matrix_cv_svm_best), annot=True, fmt='.2%')

### **5-4) Lie Detection** <br>

`Decision Tree` and `Random Forest` models are used for building the lie detection algorithm.

**Decision Tree**

**1) Import Required Classifier**

In [None]:
from sklearn.tree import DecisionTreeClassifier
ld_cv_dt = DecisionTreeClassifier()
ld_tdidf_dt = DecisionTreeClassifier()

**2) Train the model with default parameters** <br>
To get a baseline for hyperparameter tuning, train the model with its default parameter setting.

In [None]:
# count vectorization model
ld_cv_dt.fit(X_train_cv, y_train_cv)
y_pred_cv_dt = ld_cv_dt.predict(X_test_cv)

In [None]:
# TD-IDF reviews
ld_tdidf_dt.fit(X_train_tdidf, y_train_tdidf)
y_pred_tdidf_dt = ld_tdidf_dt.predict(X_test_tdidf)

**3) Check the Default Parameter**

In [None]:
print(str(ld_cv_dt.get_params()))

In [None]:
print(str(ld_tdidf_dt.get_params()))

**4) Evaluate the Models**

In [None]:
## count vectorization
from sklearn.metrics import roc_curve, auc, accuracy_score, precision_score, recall_score, f1_score

false_positive_rate, true_positive_rate, thresholds = roc_curve(y_test_cv, y_pred_cv_dt)
roc_auc = auc(false_positive_rate, true_positive_rate)
print('ROC_AUC:',roc_auc)
print('Accuracy:',accuracy_score(y_test_cv, y_pred_cv_dt))
print('Precision:',precision_score(y_test_cv, y_pred_cv_dt))
print('Recall:',recall_score(y_test_cv, y_pred_cv_dt))
print('F1:',f1_score(y_test_cv, y_pred_cv_dt))

In [None]:
cf_matrix_cv_df = confusion_matrix(y_test_cv, y_pred_cv_dt)
sns.heatmap(cf_matrix_cv_df/np.sum(cf_matrix_cv_df), annot=True, fmt='.2%')

In [None]:
## TD-IDF
from sklearn.metrics import roc_curve, auc, accuracy_score, precision_score, recall_score, f1_score

false_positive_rate, true_positive_rate, thresholds = roc_curve(y_test_tdidf, y_pred_tdidf_nb)
roc_auc = auc(false_positive_rate, true_positive_rate)
print('ROC_AUC:',roc_auc)
print('Accuracy:',accuracy_score(y_test_tdidf, y_pred_tdidf_dt))
print('Precision:',precision_score(y_test_tdidf, y_pred_tdidf_dt))
print('Recall:',recall_score(y_test_tdidf, y_pred_tdidf_dt))
print('F1:',f1_score(y_test_tdidf, y_pred_tdidf_dt))

In [None]:
cf_matrix_tdidf_df = confusion_matrix(y_test_tdidf, y_pred_tdidf_dt)
sns.heatmap(cf_matrix_tdidf_df/np.sum(cf_matrix_tdidf_df), annot=True, fmt='.2%')

**5) Hyperparameter Tuning** <br>
Conduct hyperparameter tuning to find the parameters that make better performances. <br>
<br>
Only **`C`** parameter will be controlled in this step.

In [None]:
alpha = np.linspace(0.1, 1.0, 10, endpoint = True)
parameters = {'alpha': alpha}

Set the parameter grid to the classifier

In [None]:
from sklearn.model_selection import GridSearchCV
ld_cv_nb_hyper = GridSearchCV(MultinomialNB(), param_grid = parameters, cv=3, return_train_score=True, scoring= 'roc_auc')

Train a model again and print the best parameters

In [None]:
# count vectorization reviews
ld_cv_nb_hyper.fit(X_train_cv, y_train_cv)
print('Best model: %s' % str(ld_cv_nb_hyper.best_params_))

Train a model with the best parameters and evaluate auc and accuracy scores.

In [None]:
# Train
ld_cv_nb_best = ld_cv_nb_hyper.best_estimator_
ld_cv_nb_best.fit(X_train_cv, y_train_cv)
y_pred_cv_nb_best = ld_cv_nb_best.predict(X_test_cv)

In [None]:
# Evaluate
false_positive_rate, true_positive_rate, thresholds = roc_curve(y_test_cv, y_pred_cv_nb_best)
roc_auc = auc(false_positive_rate, true_positive_rate)
print('ROC_AUC:',roc_auc)
print('Accuracy:',accuracy_score(y_test_cv, y_pred_cv_nb_best))
print('Precision:',precision_score(y_test_cv, y_pred_cv_nb_best))
print('Recall:',recall_score(y_test_cv, y_pred_cv_nb_best))
print('F1:',f1_score(y_test_cv, y_pred_cv_nb_best))

**Random Forest**

**1) Import Required Classifier**

In [None]:
from sklearn.ensemble import RandomForestClassifier
ld_cv_rf = RandomForestClassifier()
ld_tdidf_rf = RandomForestClassifier()

**2) Train the model with default parameters** <br>
To get a baseline for hyperparameter tuning, train the model with its default parameter setting.

In [None]:
# count vectorization model
ld_cv_rf.fit(X_train_cv, y_train_cv)
y_pred_cv_rf = ld_cv_rf.predict(X_test_cv)

In [None]:
# TD-IDF reviews
ld_tdidf_rf.fit(X_train_tdidf, y_train_tdidf)
y_pred_tdidf_rf = ld_tdidf_rf.predict(X_test_tdidf)

**3) Check the Default Parameter**

In [None]:
print(str(ld_cv_rf.get_params()))

In [None]:
print(str(ld_tdidf_rf.get_params()))

**4) Evaluate the Models**

In [None]:
## count vectorization
from sklearn.metrics import roc_curve, auc, accuracy_score, precision_score, recall_score, f1_score

false_positive_rate, true_positive_rate, thresholds = roc_curve(y_test_cv, y_pred_cv_rf)
roc_auc = auc(false_positive_rate, true_positive_rate)
print('ROC_AUC:',roc_auc)
print('Accuracy:',accuracy_score(y_test_cv, y_pred_cv_rf))
print('Precision:',precision_score(y_test_cv, y_pred_cv_rf))
print('Recall:',recall_score(y_test_cv, y_pred_cv_rf))
print('F1:',f1_score(y_test_cv, y_pred_cv_rf))

In [None]:
cf_matrix_cv_rf = confusion_matrix(y_test_cv, y_pred_cv_rf)
sns.heatmap(cf_matrix_cv_rf/np.sum(cf_matrix_cv_rf), annot=True, fmt='.2%')

In [None]:
## TD-IDF
from sklearn.metrics import roc_curve, auc, accuracy_score, precision_score, recall_score, f1_score

false_positive_rate, true_positive_rate, thresholds = roc_curve(y_test_tdidf, y_pred_tdidf_rf)
roc_auc = auc(false_positive_rate, true_positive_rate)
print('ROC_AUC:',roc_auc)
print('Accuracy:',accuracy_score(y_test_tdidf, y_pred_tdidf_rf))
print('Precision:',precision_score(y_test_tdidf, y_pred_tdidf_rf))
print('Recall:',recall_score(y_test_tdidf, y_pred_tdidf_rf))
print('F1:',f1_score(y_test_tdidf, y_pred_tdidf_rf))

In [None]:
cf_matrix_tdidf_rf = confusion_matrix(y_test_tdidf, y_pred_tdidf_rf)
sns.heatmap(cf_matrix_tdidf_rf/np.sum(cf_matrix_tdidf_rf), annot=True, fmt='.2%')

## **6. Evaluation**
Check read_me in my github [Link]()