___

<p style="text-align: center;"><img src="https://docs.google.com/uc?id=1lY0Uj5R04yMY3-ZppPWxqCr5pvBLYPnV" class="img-fluid" alt="CLRSWY"></p>

___

# WELCOME!

Welcome to the "***Sentiment Analysis and Classification Project***" project, the first and only project of the ***Natural Language Processing (NLP)*** course.

This analysis will focus on using Natural Language techniques to find broad trends in the written thoughts of the customers. 
The goal in this project is to predict whether customers recommend the product they purchased using the information in their review text.

One of the challenges in this project is to extract useful information from the *Review Text* variable using text mining techniques. The other challenge is that you need to convert text files into numeric feature vectors to run machine learning algorithms.

At the end of this project, you will learn how to build sentiment classification models using Machine Learning algorithms (***Logistic Regression, Naive Bayes, Support Vector Machine, Random Forest*** and ***Ada Boosting***), **Deep Learning algorithms** and **BERT algorithm**.

Before diving into the project, please take a look at the Determines and Tasks.

- ***NOTE:*** *This tutorial assumes that you already know the basics of coding in Python and are familiar with the theory behind the algorithms mentioned above as well as NLP techniques.*



---
---


# #Determines
The data is a collection of 22641 Rows and 10 column variables. Each row includes a written comment as well as additional customer information. 
Also each row corresponds to a customer review, and includes the variables:


**Feature Information:**

**Clothing ID:** Integer Categorical variable that refers to the specific piece being reviewed.

**Age:** Positive Integer variable of the reviewers age.

**Title:** String variable for the title of the review.

**Review Text:** String variable for the review body.

**Rating:** Positive Ordinal Integer variable for the product score granted by the customer from 1 Worst, to 5 Best.

**Recommended IND:** Binary variable stating where the customer recommends the product where 1 is recommended, 0 is not recommended.

**Positive Feedback Count:** Positive Integer documenting the number of other customers who found this review positive.

**Division Name:** Categorical name of the product high level division.

**Department Name:** Categorical name of the product department name.

**Class Name:** Categorical name of the product class name.

---

The basic goal in this project is to predict whether customers recommend the product they purchased using the information in their *Review Text*.
Especially, it should be noted that the expectation in this project is to use only the "Review Text" variable and neglect the other ones. 
Of course, if you want, you can work on other variables individually.

Project Structure is separated in five tasks: ***EDA, Feature Selection and Data Cleaning , Text Mining, Word Cloud*** and ***Sentiment Classification with Machine Learning, Deep Learning and BERT model***.

Classically, you can start to know the data after doing the import and load operations. 
You need to do missing value detection for Review Text, which is the only variable you need to care about. You can drop other variables.

You will need to apply ***noise removal*** and ***lexicon normalization*** processes by using the capabilities of the ***nltk*** library to the data set that is ready for text mining.

Afterwards, you will implement ***Word Cloud*** as a visual analysis of word repetition.

Finally, You will build models with five different algorithms and compare their performance. Thus, you will determine the algorithm that makes the most accurate emotion estimation by using the information obtained from the * Review Text * variable.






---
---


# #Tasks

#### 1. Exploratory Data Analysis

- Import Modules, Load Discover the Data

#### 2. Feature Selection and Data Cleaning

- Feature Selection and Rename Column Name
- Missing Value Detection

#### 3. Text Mining

- Tokenization
- Noise Removal
- Lexicon Normalization

#### 4. WordCloud - Repetition of Words

- Detect Reviews
- Collect Words 
- Create Word Cloud 


#### 5. Sentiment Classification with Machine Learning, Deep Learning and BERT Model

- Train - Test Split
- Vectorization
- TF-IDF
- Logistic Regression
- Naive Bayes
- Support Vector Machine
- Random Forest
- AdaBoost
- Deep Learning Model
- BERT Model
- Model Comparison

---
---


# Sentiment analysis of women's clothes reviews


In this project we used sentiment analysis to determined whether the product is recommended or not. We used different machine learning algorithms to get more accurate predictions. The following classification algorithms have been used: ML algorithms(Logistic Regression, Naive Bayes, Support Vector Machine (SVM), Random Forest and Ada Boosting), Deep learning algorithm and BERT algorithm. The dataset comes from Woman Clothing Review that can be find at (https://www.kaggle.com/nicapotato/womens-ecommerce-clothing-reviews. 


## 1. Exploratory Data Analysis

### Import Libraries, Load and Discover the Data

In [1]:
import pandas as pd
import numpy as np
import seaborn as sns
import matplotlib.pyplot as plt

import nltk
from nltk.tokenize import sent_tokenize, word_tokenize
from nltk.corpus import stopwords
from nltk.stem import WordNetLemmatizer
from nltk.stem import PorterStemmer


import warnings
warnings.filterwarnings("ignore")
plt.rcParams["figure.figsize"] = (10,6)
pd.set_option('display.max_columns', 50)
pd.set_option('display.max_rows', 50)

In [1]:
pip install -U matplotlib

Collecting matplotlib
  Downloading matplotlib-3.6.1-cp310-cp310-win_amd64.whl (7.2 MB)
     ---------------------------------------- 7.2/7.2 MB 3.4 MB/s eta 0:00:00
Installing collected packages: matplotlib
  Attempting uninstall: matplotlib
    Found existing installation: matplotlib 3.5.3
    Uninstalling matplotlib-3.5.3:
      Successfully uninstalled matplotlib-3.5.3
Successfully installed matplotlib-3.6.1
Note: you may need to restart the kernel to use updated packages.


ERROR: pip's dependency resolver does not currently take into account all the packages that are installed. This behaviour is the source of the following dependency conflicts.
pandas-profiling 3.3.0 requires matplotlib<3.6,>=3.2, but you have matplotlib 3.6.1 which is incompatible.


In [3]:
nltk.download('punkt')
nltk.download('stopwords')
nltk.download('wordnet')
nltk.download('omw-1.4')

[nltk_data] Downloading package punkt to /root/nltk_data...
[nltk_data]   Unzipping tokenizers/punkt.zip.
[nltk_data] Downloading package stopwords to /root/nltk_data...
[nltk_data]   Unzipping corpora/stopwords.zip.
[nltk_data] Downloading package wordnet to /root/nltk_data...
[nltk_data] Downloading package omw-1.4 to /root/nltk_data...


True

In [7]:
# to connect google drive

from google.colab import drive
drive.mount('/content/drive')

Drive already mounted at /content/drive; to attempt to forcibly remount, call drive.mount("/content/drive", force_remount=True).


In [10]:
# if you work in colab, execute this code to read the data

df = pd.read_csv("/content/drive/MyDrive/Womens Clothing E-Commerce Reviews.csv")

In [11]:
df.head()

Unnamed: 0.1,Unnamed: 0,Clothing ID,Age,Title,Review Text,Rating,Recommended IND,Positive Feedback Count,Division Name,Department Name,Class Name
0,0,767,33,,Absolutely wonderful - silky and sexy and comf...,4,1,0,Initmates,Intimate,Intimates
1,1,1080,34,,Love this dress! it's sooo pretty. i happene...,5,1,4,General,Dresses,Dresses
2,2,1077,60,Some major design flaws,I had such high hopes for this dress and reall...,3,0,0,General,Dresses,Dresses
3,3,1049,50,My favorite buy!,"I love, love, love this jumpsuit. it's fun, fl...",5,1,0,General Petite,Bottoms,Pants
4,4,847,47,Flattering shirt,This shirt is very flattering to all due to th...,5,1,6,General,Tops,Blouses


In [12]:
df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 23486 entries, 0 to 23485
Data columns (total 11 columns):
 #   Column                   Non-Null Count  Dtype 
---  ------                   --------------  ----- 
 0   Unnamed: 0               23486 non-null  int64 
 1   Clothing ID              23486 non-null  int64 
 2   Age                      23486 non-null  int64 
 3   Title                    19676 non-null  object
 4   Review Text              22641 non-null  object
 5   Rating                   23486 non-null  int64 
 6   Recommended IND          23486 non-null  int64 
 7   Positive Feedback Count  23486 non-null  int64 
 8   Division Name            23472 non-null  object
 9   Department Name          23472 non-null  object
 10  Class Name               23472 non-null  object
dtypes: int64(6), object(5)
memory usage: 2.0+ MB


In [13]:
df.isnull().sum()

Unnamed: 0                    0
Clothing ID                   0
Age                           0
Title                      3810
Review Text                 845
Rating                        0
Recommended IND               0
Positive Feedback Count       0
Division Name                14
Department Name              14
Class Name                   14
dtype: int64

In [14]:
df.duplicated().sum()

0

In [15]:
df_copy1 = df.copy()

#### Check Proportion of Target Class Variable:

In [19]:
df["Recommended IND"].value_counts()

1    19314
0     4172
Name: Recommended IND, dtype: int64

In [20]:
df["Recommended IND"].value_counts(normalize=True)

1    0.822362
0    0.177638
Name: Recommended IND, dtype: float64

The target class variable is imbalanced, where "Recommended" values are more dominating then "Not Recommendation".

## 2. Feature Selection and Data Cleaning

From now on, the DataFrame you will work with should contain two columns: **"Review Text"** and **"Recommended IND"**. You can do the missing value detection operations from now on. You can also rename the column names if you want.



### Feature Selection and Rename Column Name

In [25]:
df = df[["Review Text", "Recommended IND"]]

In [26]:
df.head()

Unnamed: 0,Review Text,Recommended IND
0,Absolutely wonderful - silky and sexy and comf...,1
1,Love this dress! it's sooo pretty. i happene...,1
2,I had such high hopes for this dress and reall...,0
3,"I love, love, love this jumpsuit. it's fun, fl...",1
4,This shirt is very flattering to all due to th...,1


In [27]:
df.rename(columns={"Review Text": "Review_Text", "Recommended IND": "Recommended_IND"}, inplace=True)

In [28]:
df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 23486 entries, 0 to 23485
Data columns (total 2 columns):
 #   Column           Non-Null Count  Dtype 
---  ------           --------------  ----- 
 0   Review_Text      22641 non-null  object
 1   Recommended_IND  23486 non-null  int64 
dtypes: int64(1), object(1)
memory usage: 367.1+ KB


In [29]:
# Changing target label value as 1

df.Recommended_IND = df.Recommended_IND.map({1:0, 0:1})

In [30]:
df.Recommended_IND.value_counts()

0    19314
1     4172
Name: Recommended_IND, dtype: int64

In [31]:
df.Recommended_IND.value_counts(normalize=True)

0    0.822362
1    0.177638
Name: Recommended_IND, dtype: float64

### Missing Value Detection

In [32]:
df.isnull().sum()

Review_Text        845
Recommended_IND      0
dtype: int64

In [33]:
df.dropna(inplace=True)
df = df.reset_index(drop=True)

In [34]:
df.isnull().sum()

Review_Text        0
Recommended_IND    0
dtype: int64

In [35]:
df.Recommended_IND.value_counts(normalize=True)

0    0.818868
1    0.181132
Name: Recommended_IND, dtype: float64

In [36]:
df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 22641 entries, 0 to 22640
Data columns (total 2 columns):
 #   Column           Non-Null Count  Dtype 
---  ------           --------------  ----- 
 0   Review_Text      22641 non-null  object
 1   Recommended_IND  22641 non-null  int64 
dtypes: int64(1), object(1)
memory usage: 353.9+ KB


In [37]:
df_copy2 = df.copy()

In [38]:
df_copy2.head()

Unnamed: 0,Review_Text,Recommended_IND
0,Absolutely wonderful - silky and sexy and comf...,0
1,Love this dress! it's sooo pretty. i happene...,0
2,I had such high hopes for this dress and reall...,1
3,"I love, love, love this jumpsuit. it's fun, fl...",0
4,This shirt is very flattering to all due to th...,0


## 3. Text Mining

Text is the most unstructured form of all the available data, therefore various types of noise are present in it. This means that the data is not readily analyzable without any pre-processing. The entire process of cleaning and standardization of text, making it noise-free and ready for analysis is known as **text preprocessing**.

The three key steps of text preprocessing:

- **Tokenization:**
This step is one of the top priorities when it comes to working on text mining. Tokenization is essentially splitting a phrase, sentence, paragraph, or an entire text document into smaller units, such as individual words or terms. Each of these smaller units are called tokens.

- **Noise Removal:**
Any piece of text which is not relevant to the context of the data and the end-output can be specified as the noise.
For example – language stopwords (commonly used words of a language – is, am, the, of, in etc), URLs or links, upper and lower case differentiation, punctuations and industry specific words. This step deals with removal of all types of noisy entities present in the text.


- **Lexicon Normalization:**
Another type of textual noise is about the multiple representations exhibited by single word.
For example – “play”, “player”, “played”, “plays” and “playing” are the different variations of the word – “play”. Though they mean different things, contextually they all are similar. This step converts all the disparities of a word into their normalized form (also known as lemma). 
There are two methods of lexicon normalisation; **[Stemming or Lemmatization](https://www.guru99.com/stemming-lemmatization-python-nltk.html)**. Lemmatization is recommended for this case, because Lemmatization as this will return the root form of each word (rather than just stripping suffixes, which is stemming).

As the first step change text to tokens and convertion all of the words to lower case.  Next remove punctuation, bad characters, numbers and stop words. The second step is aimed to normalization them throught the Lemmatization method. 


***Note:*** *Use the functions of the ***[nltk Library](https://www.guru99.com/nltk-tutorial.html)*** for all the above operations.*



### Tokenization, Noise Removal, Lexicon Normalization

In [39]:
df.Review_Text[850]

'I loved the photo of this dress. upon examination of the dress (and trying it on) after receiving in the mail, the dress shown online is nothing like the dress i received save for the pattern. the dress i received has a side zip as well as a belt and no pleats on the top. the bottom is also cut straight across not as it appears in the photo. turns out it is not as flattering as it should appear.'

In [40]:
stop_words = stopwords.words('english')

for i in ["not", "no"]:
        stop_words.remove(i)

In [41]:
def cleaning(data):
   
    #1. Removing upper brackets to keep negative auxiliary verbs in text
    data = data.replace("'", "")
    
    #2. Tokenize
    text_tokens = word_tokenize(data.lower())
    
    #3. Remove Puncs and number
    tokens_without_punc = [w for w in text_tokens if w.isalpha()]
    
    #4. Removing Stopwords
    tokens_without_sw = [t for t in tokens_without_punc if t not in stop_words]
    
    #5. lemma
    text_cleaned = [WordNetLemmatizer().lemmatize(t) for t in tokens_without_sw]
    
    #joining
    return " ".join(text_cleaned)

In [42]:
df["Review_Text"] = df["Review_Text"].apply(cleaning)
df["Review_Text"].head()

0          absolutely wonderful silky sexy comfortable
1    love dress sooo pretty happened find store im ...
2    high hope dress really wanted work initially o...
3    love love love jumpsuit fun flirty fabulous ev...
4    shirt flattering due adjustable front tie perf...
Name: Review_Text, dtype: object

## 4. WordCloud - Repetition of Words

Now you'll create a Word Clouds for reviews, representing most common words in each target class.

Word Cloud is a data visualization technique used for representing text data in which the size of each word indicates its frequency or importance. Significant textual data points can be highlighted using a word cloud.

You are expected to create separate word clouds for positive and negative reviews. You can qualify a review as positive or negative, by looking at its recommended status. You may need to use capabilities of matplotlib for visualizations.

You can follow the steps below:

- Detect Reviews
- Collect Words 
- Create Word Cloud 


### Detect Reviews (positive and negative separately)

In [43]:
df_negative = df[df.Recommended_IND == 1]
df_negative.head()

Unnamed: 0,Review_Text,Recommended_IND
2,high hope dress really wanted work initially o...,1
5,love tracy reese dress one not petite foot tal...,1
10,dress run small esp zipper area run ordered sp...,1
22,first not pullover styling side zipper wouldnt...,1
25,loved material didnt really look long dress pu...,1


In [44]:
df_positive = df[df.Recommended_IND == 0]
df_positive.head()

Unnamed: 0,Review_Text,Recommended_IND
0,absolutely wonderful silky sexy comfortable,0
1,love dress sooo pretty happened find store im ...,0
3,love love love jumpsuit fun flirty fabulous ev...,0
4,shirt flattering due adjustable front tie perf...,0
6,aded basket hte last mintue see would look lik...,0


### Collect Words (positive and negative separately)

In [45]:
positive_words = " ".join(df_positive.Review_Text)
positive_words[:100]

'absolutely wonderful silky sexy comfortable love dress sooo pretty happened find store im glad bc ne'

In [46]:
negative_words = " ".join(df_negative.Review_Text)
negative_words[:100]

'high hope dress really wanted work initially ordered petite small usual size found outrageously smal'

### Create Word Cloud (for most common words in recommended not recommended reviews separately)

In [47]:
# pip install wordcloud

In [48]:
from wordcloud import WordCloud

In [49]:
worldcloud = WordCloud(background_color="black", max_words =250, colormap=plt.cm.hsv)

In [50]:
worldcloud.generate(positive_words)

<wordcloud.wordcloud.WordCloud at 0x7f741f8ba410>

In [None]:
import matplotlib.pyplot as plt
plt.figure(figsize = (15,15))
plt.imshow(worldcloud, interpolation="bilinear")
plt.title("Most Frequently Used Words in Positive Comments", fontdict={"size": 15})
plt.axis("off")
plt.show()

In [52]:
worldcloud.generate(negative_words)

<wordcloud.wordcloud.WordCloud at 0x7f741f8ba410>

In [None]:
import matplotlib.pyplot as plt
plt.figure(figsize = (15,15))
plt.imshow(worldcloud, interpolation="bilinear")
plt.title("Most Frequently Used Words in Negative Comments", fontdict={"size": 15})
plt.axis("off")
plt.show()

## 5. Sentiment Classification with Machine Learning, Deep Learning and BERT model

Before moving on to modeling, as data preprocessing steps you will need to perform **[vectorization](https://machinelearningmastery.com/prepare-text-data-machine-learning-scikit-learn/)** and **train-test split**. You have performed many times train test split process before.
But you will perform the vectorization for the first time.

Machine learning algorithms most often take numeric feature vectors as input. Thus, when working with text documents, you need a way to convert each document into a numeric vector. This process is known as text vectorization. Commonly used vectorization approach that you will use here is to represent each text as a vector of word counts.

At this moment, you have your review text column as a token (which has no punctuations and stopwords). You can use Scikit-learn’s CountVectorizer to convert the text collection into a matrix of token counts. You can imagine this resulting matrix as a 2-D matrix, where each row is a unique word, and each column is a review.

Train all models using TFIDF and Count vectorizer data.

**For Deep learning model, use embedding layer for all words.** 

**For BERT model, use TF tensor**

After performing data preprocessing, build your models using following classification algorithms:

- Logistic Regression,
- Naive Bayes,
- Support Vector Machine,
- Random Forest,
- Ada Boosting
- Deep Learning Model
- BERT Model

### Train - Test Split

To run machine learning algorithms we need to convert text files into numerical feature vectors. We will use bag of words model for our analysis.

First we spliting the data into train and test sets:

In [54]:
from sklearn.model_selection import train_test_split

In [55]:
X = df["Review_Text"]
y= df["Recommended_IND"]

In [56]:
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.1, stratify=y, random_state=101)

In the next step we create a numerical feature vector for each document:

In [57]:
from sklearn.metrics import confusion_matrix, classification_report, f1_score, recall_score

In [58]:
def eval(model, X_train, X_test):
    y_pred = model.predict(X_test)
    y_pred_train = model.predict(X_train)
    print(confusion_matrix(y_test, y_pred))
    print("Test_Set")
    print(classification_report(y_test,y_pred))
    print("Train_Set")
    print(classification_report(y_train,y_pred_train))

### Count Vectorization

In [59]:
from sklearn.feature_extraction.text import CountVectorizer

In [60]:
vectorizer = CountVectorizer(preprocessor=cleaning, min_df=3)
X_train_count = vectorizer.fit_transform(X_train)
X_test_count = vectorizer.transform(X_test)

In [61]:
X_train_count.toarray()

array([[0, 0, 0, ..., 0, 0, 0],
       [0, 0, 0, ..., 0, 0, 0],
       [0, 0, 0, ..., 0, 0, 0],
       ...,
       [0, 0, 0, ..., 0, 0, 0],
       [0, 0, 0, ..., 0, 0, 0],
       [0, 0, 0, ..., 0, 0, 0]])

In [62]:
pd.DataFrame(X_train_count.toarray(), columns = vectorizer.get_feature_names_out())

Unnamed: 0,ab,abby,abdomen,ability,able,abo,absolute,absolutely,abstract,absurd,abt,abundance,ac,accent,accented,accentuate,accentuated,accentuates,accentuating,accept,acceptable,access,accessorize,accessorized,accessorizing,...,yo,yoga,yoke,york,youd,youll,young,younger,youre,youthful,youve,yuck,yucky,yummy,zag,zero,zig,zigzag,zip,zipped,zipper,zippered,zipping,zone,zoom
0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0
1,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0
2,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0
3,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0
4,0,0,0,0,0,0,0,1,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
20371,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0
20372,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0
20373,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0
20374,0,0,0,0,0,0,0,1,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0


### TF-IDF

In [63]:
from sklearn.feature_extraction.text import TfidfVectorizer

In [64]:
tf_idf_vectorizer = TfidfVectorizer(preprocessor=cleaning, min_df=3)
X_train_tf_idf = tf_idf_vectorizer.fit_transform(X_train)            
X_test_tf_idf = tf_idf_vectorizer.transform(X_test)

## Logistic Regression

### CountVectorizer

In [65]:
from sklearn.linear_model import LogisticRegression
log = LogisticRegression(max_iter=1000, class_weight='balanced', random_state=101)
log.fit(X_train_count,y_train)

LogisticRegression(class_weight='balanced', max_iter=1000, random_state=101)

In [66]:
print("LOG MODEL")
eval(log, X_train_count, X_test_count)

LOG MODEL
[[1653  202]
 [  82  328]]
Test_Set
              precision    recall  f1-score   support

           0       0.95      0.89      0.92      1855
           1       0.62      0.80      0.70       410

    accuracy                           0.87      2265
   macro avg       0.79      0.85      0.81      2265
weighted avg       0.89      0.87      0.88      2265

Train_Set
              precision    recall  f1-score   support

           0       0.99      0.92      0.95     16685
           1       0.73      0.96      0.83      3691

    accuracy                           0.93     20376
   macro avg       0.86      0.94      0.89     20376
weighted avg       0.94      0.93      0.93     20376



In [67]:
from sklearn.model_selection import cross_validate
from sklearn.metrics import make_scorer
from sklearn.metrics import f1_score, accuracy_score, precision_score, recall_score

# scoring = {'accuracy': make_scorer(accuracy_score),
#             'precision-neg': make_scorer(precision_score, average=None),
#             'recall-neg': make_scorer(recall_score, average=None),
#             'f1-neg': make_scorer(f1_score, average=None)}

In [68]:
# Cross Validation

model = LogisticRegression(max_iter=1000, class_weight="balanced", random_state=101)
scores = cross_validate(model, X_train_count, y_train, 
                        scoring = ['precision','recall','f1','accuracy'], 
                        cv = 10, return_train_score=True)
df_scores = pd.DataFrame(scores, index = range(1, 11))
df_scores.mean()[2:]

test_precision     0.607573
train_precision    0.734620
test_recall        0.789492
train_recall       0.967338
test_f1            0.686448
train_f1           0.835065
test_accuracy      0.869356
train_accuracy     0.930779
dtype: float64

In [69]:
# Logistic Model with C=0.01

log = LogisticRegression(C=0.01, max_iter=1000, class_weight="balanced", random_state=101)
log.fit(X_train_count,y_train)

LogisticRegression(C=0.01, class_weight='balanced', max_iter=1000,
                   random_state=101)

In [70]:
print("LOG MODEL with C=0.01")
eval(log, X_train_count, X_test_count)

LOG MODEL with C=0.01
[[1581  274]
 [  64  346]]
Test_Set
              precision    recall  f1-score   support

           0       0.96      0.85      0.90      1855
           1       0.56      0.84      0.67       410

    accuracy                           0.85      2265
   macro avg       0.76      0.85      0.79      2265
weighted avg       0.89      0.85      0.86      2265

Train_Set
              precision    recall  f1-score   support

           0       0.97      0.86      0.91     16685
           1       0.58      0.88      0.70      3691

    accuracy                           0.86     20376
   macro avg       0.77      0.87      0.80     20376
weighted avg       0.90      0.86      0.87     20376



In [71]:
# Cross Validation with C=0.01

model = LogisticRegression(C=0.01, max_iter=1000, class_weight="balanced", random_state=101)
scores = cross_validate(model, X_train_count, y_train, 
                        scoring = ['precision','recall','f1','accuracy'], 
                        cv = 10, return_train_score=True)
df_scores = pd.DataFrame(scores, index = range(1, 11))
df_scores.mean()[2:]

test_precision     0.554106
train_precision    0.575180
test_recall        0.849363
train_recall       0.883380
test_f1            0.670597
train_f1           0.696717
test_accuracy      0.848842
train_accuracy     0.860686
dtype: float64

In [72]:
from yellowbrick.classifier import PrecisionRecallCurve
from sklearn.metrics import PrecisionRecallDisplay, f1_score, recall_score, average_precision_score

In [73]:
viz = PrecisionRecallCurve(
    LogisticRegression(C = 0.01, max_iter=1000, class_weight="balanced", random_state=101),
    per_class=True,
    cmap="Set1"
)
viz.fit(X_train_count,y_train)
viz.score(X_test_count, y_test)
viz.show();

ImportError: ignored

<Figure size 576x396 with 1 Axes>

In [74]:
log = LogisticRegression(C=0.01, max_iter=1000, class_weight="balanced", random_state=101).fit(X_train_count,y_train)

y_pred = log.predict(X_test_count)
log_count_rec_neg = recall_score(y_test, y_pred, average = 'binary')
log_count_f1_neg = f1_score(y_test, y_pred, average = 'binary')
log_count_AP = viz.score_

### TF-IDF

In [75]:
from sklearn.linear_model import LogisticRegression
log = LogisticRegression(max_iter=1000, class_weight='balanced', random_state=101)
log.fit(X_train_tf_idf,y_train)

LogisticRegression(class_weight='balanced', max_iter=1000, random_state=101)

In [76]:
print("LOG MODEL")
eval(log, X_train_tf_idf, X_test_tf_idf)

LOG MODEL
[[1609  246]
 [  60  350]]
Test_Set
              precision    recall  f1-score   support

           0       0.96      0.87      0.91      1855
           1       0.59      0.85      0.70       410

    accuracy                           0.86      2265
   macro avg       0.78      0.86      0.80      2265
weighted avg       0.90      0.86      0.87      2265

Train_Set
              precision    recall  f1-score   support

           0       0.98      0.88      0.93     16685
           1       0.64      0.93      0.76      3691

    accuracy                           0.89     20376
   macro avg       0.81      0.90      0.84     20376
weighted avg       0.92      0.89      0.90     20376



In [77]:
# Cross Validation

model = LogisticRegression(max_iter=1000, class_weight="balanced", random_state=101)
scores = cross_validate(model, X_train_tf_idf, y_train, 
                        scoring = ['precision','recall','f1','accuracy'], 
                        cv = 10, return_train_score=True)
df_scores = pd.DataFrame(scores, index = range(1, 11))
df_scores.mean()[2:]

test_precision     0.592765
train_precision    0.639967
test_recall        0.851803
train_recall       0.927602
test_f1            0.698920
train_f1           0.757393
test_accuracy      0.867049
train_accuracy     0.892352
dtype: float64

In [78]:
# Cross Validation with C=0.1

model = LogisticRegression(C= 0.1, max_iter=1000, class_weight="balanced", random_state=101)
scores = cross_validate(model, X_train_tf_idf, y_train, 
                        scoring = ['precision','recall','f1','accuracy'], 
                        cv = 10, return_train_score=True)
df_scores = pd.DataFrame(scores, index = range(1, 11))
df_scores.mean()[2:]

test_precision     0.552917
train_precision    0.568957
test_recall        0.864269
train_recall       0.892471
test_f1            0.674282
train_f1           0.694905
test_accuracy      0.848694
train_accuracy     0.858041
dtype: float64

In [79]:
viz = PrecisionRecallCurve(
    LogisticRegression(C= 0.1, max_iter=1000, class_weight="balanced", random_state=101),
    per_class=True,
    cmap="Set1"
)
viz.fit(X_train_tf_idf,y_train)
viz.score(X_test_tf_idf, y_test)
viz.show();

ImportError: ignored

<Figure size 576x396 with 1 Axes>

In [80]:
log = LogisticRegression(C= 0.1, max_iter=1000, class_weight="balanced", random_state=101).fit(X_train_tf_idf, y_train)

y_pred = log.predict(X_test_tf_idf)
log_tfidf_rec_neg = recall_score(y_test, y_pred, average = 'binary')
log_tfidf_f1_neg = f1_score(y_test, y_pred, average = 'binary')
log_tfidf_AP = viz.score_

## Naive Bayes 

### Countvectorizer

In [81]:
from sklearn.utils import class_weight
classes_weights = class_weight.compute_sample_weight(class_weight='balanced', y=y_train)
classes_weights

array([2.76022758, 2.76022758, 0.61060833, ..., 0.61060833, 0.61060833,
       0.61060833])

In [82]:
from sklearn.naive_bayes import MultinomialNB, BernoulliNB

In [83]:
nb = MultinomialNB(alpha=1) 
nb.fit(X_train_count, y_train, sample_weight=classes_weights)

MultinomialNB(alpha=1)

In [84]:
print("NB MODEL with CountVectorizer")
eval(nb, X_train_count, X_test_count)

NB MODEL with CountVectorizer
[[1586  269]
 [  53  357]]
Test_Set
              precision    recall  f1-score   support

           0       0.97      0.85      0.91      1855
           1       0.57      0.87      0.69       410

    accuracy                           0.86      2265
   macro avg       0.77      0.86      0.80      2265
weighted avg       0.90      0.86      0.87      2265

Train_Set
              precision    recall  f1-score   support

           0       0.98      0.87      0.92     16685
           1       0.61      0.91      0.73      3691

    accuracy                           0.88     20376
   macro avg       0.79      0.89      0.82     20376
weighted avg       0.91      0.88      0.89     20376



In [85]:
model = MultinomialNB(alpha=1)
scores = cross_validate(model, X_train_count, y_train, 
                        scoring = ['precision','recall','f1','accuracy'], 
                        cv = 10, return_train_score=True, 
                        fit_params={"sample_weight":classes_weights})
df_scores = pd.DataFrame(scores, index = range(1, 11))
df_scores.mean()[2:]

test_precision     0.573209
train_precision    0.608333
test_recall        0.856416
train_recall       0.914477
test_f1            0.686646
train_f1           0.730629
test_accuracy      0.858363
train_accuracy     0.877852
dtype: float64

In [86]:
# cross validation  with alpha=10

model = MultinomialNB(alpha=10)
scores = cross_validate(model, X_train_count, y_train, 
                        scoring = ['precision','recall','f1','accuracy'], 
                        cv = 10, return_train_score=True, 
                        fit_params={"sample_weight":classes_weights})
df_scores = pd.DataFrame(scores, index = range(1, 11))
df_scores.mean()[2:]

test_precision     0.557855
train_precision    0.578800
test_recall        0.872668
train_recall       0.906620
test_f1            0.680546
train_f1           0.706535
test_accuracy      0.851590
train_accuracy     0.863570
dtype: float64

In [87]:
viz = PrecisionRecallCurve(
    MultinomialNB(alpha=10),
    classes=nb.classes_,
    per_class=True,
    cmap="Set1"
)
viz.fit(X_train_count,y_train)
viz.score(X_test_count, y_test)
viz.show();

ImportError: ignored

<Figure size 576x396 with 1 Axes>

In [88]:
nb = MultinomialNB(alpha=10).fit(X_train_count, y_train)

y_pred = nb.predict(X_test_count)
nb_count_rec_neg = recall_score(y_test, y_pred, average = 'binary')
nb_count_f1_neg = f1_score(y_test, y_pred, average = 'binary')
nb_count_AP = viz.score_

### TF-IDF

In [89]:
nb = MultinomialNB(alpha=1) 
                           
nb.fit(X_train_tf_idf, y_train, sample_weight=classes_weights)

MultinomialNB(alpha=1)

In [90]:
print("NB MODEL with CountVectorizer")
eval(nb, X_train_tf_idf, X_test_tf_idf)

NB MODEL with CountVectorizer
[[1572  283]
 [  51  359]]
Test_Set
              precision    recall  f1-score   support

           0       0.97      0.85      0.90      1855
           1       0.56      0.88      0.68       410

    accuracy                           0.85      2265
   macro avg       0.76      0.86      0.79      2265
weighted avg       0.89      0.85      0.86      2265

Train_Set
              precision    recall  f1-score   support

           0       0.98      0.86      0.92     16685
           1       0.59      0.92      0.72      3691

    accuracy                           0.87     20376
   macro avg       0.78      0.89      0.82     20376
weighted avg       0.91      0.87      0.88     20376



In [91]:
model = MultinomialNB(alpha=1)
scores = cross_validate(model, X_train_tf_idf, y_train, 
                        scoring = ['precision','recall','f1','accuracy'], 
                        cv = 10, return_train_score=True, 
                        fit_params={"sample_weight":classes_weights})
df_scores = pd.DataFrame(scores, index = range(1, 11))
df_scores.mean()[2:]

test_precision     0.559323
train_precision    0.593929
test_recall        0.864814
train_recall       0.918480
test_f1            0.679154
train_f1           0.721379
test_accuracy      0.851933
train_accuracy     0.871477
dtype: float64

In [92]:
# cross validation with alpha=10

model = MultinomialNB(alpha=10)
scores = cross_validate(model, X_train_tf_idf, y_train, 
                        scoring = ['precision','recall','f1','accuracy'], 
                        cv = 10, return_train_score=True, 
                        fit_params={"sample_weight":classes_weights})
df_scores = pd.DataFrame(scores, index = range(1, 11))
df_scores.mean()[2:]

test_precision     0.545759
train_precision    0.564449
test_recall        0.880526
train_recall       0.909961
test_f1            0.673771
train_f1           0.696718
test_accuracy      0.845504
train_accuracy     0.856492
dtype: float64

In [93]:
viz = PrecisionRecallCurve(
    MultinomialNB(alpha=10),
    classes=nb.classes_,
    per_class=True,
    cmap="Set1"
)
viz.fit(X_train_tf_idf,y_train)
viz.score(X_test_tf_idf, y_test)
viz.show();

ImportError: ignored

<Figure size 576x396 with 1 Axes>

In [94]:
nb = MultinomialNB(alpha=10).fit(X_train_tf_idf, y_train)

y_pred = nb.predict(X_test_tf_idf)
nb_tfidf_rec_neg = recall_score(y_test, y_pred, average = 'binary')
nb_tfidf_f1_neg = f1_score(y_test, y_pred, average = 'binary')
nb_tfidf_AP = viz.score_

## Support Vector Machine (SVM)

### Countvectorizer

In [95]:
from sklearn.svm import LinearSVC 
svc = LinearSVC(class_weight="balanced", random_state=101)
svc.fit(X_train_count,y_train)

LinearSVC(class_weight='balanced', random_state=101)

In [96]:
print("SVC MODEL")
eval(svc, X_train_count, X_test_count)

SVC MODEL
[[1651  204]
 [ 105  305]]
Test_Set
              precision    recall  f1-score   support

           0       0.94      0.89      0.91      1855
           1       0.60      0.74      0.66       410

    accuracy                           0.86      2265
   macro avg       0.77      0.82      0.79      2265
weighted avg       0.88      0.86      0.87      2265

Train_Set
              precision    recall  f1-score   support

           0       1.00      0.95      0.97     16685
           1       0.81      0.98      0.89      3691

    accuracy                           0.95     20376
   macro avg       0.90      0.96      0.93     20376
weighted avg       0.96      0.95      0.96     20376



In [97]:
model = LinearSVC(class_weight="balanced", random_state=101)
scores = cross_validate(model, X_train_count, y_train, 
                        scoring = ['precision','recall','f1','accuracy'], 
                        cv = 10, return_train_score=True)
df_scores = pd.DataFrame(scores, index = range(1, 11))
df_scores.mean()[2:]

test_precision     0.580787
train_precision    0.824954
test_recall        0.704699
train_recall       0.983895
test_f1            0.636548
train_f1           0.897437
test_accuracy      0.854289
train_accuracy     0.959260
dtype: float64

In [98]:
# cross validation with C=0.001 

model = LinearSVC(C=0.001, class_weight="balanced", random_state=101)
scores = cross_validate(model, X_train_count, y_train, 
                        scoring = ['precision','recall','f1','accuracy'], 
                        cv = 10, return_train_score=True)
df_scores = pd.DataFrame(scores, index = range(1, 11))
df_scores.mean()[2:]

test_precision     0.553761
train_precision    0.575577
test_recall        0.851532
train_recall       0.886691
test_f1            0.671016
train_f1           0.698036
test_accuracy      0.848744
train_accuracy     0.861035
dtype: float64

In [99]:
viz = PrecisionRecallCurve(
    LinearSVC(C=0.001, class_weight="balanced", random_state=101),
    classes=svc.classes_,
    per_class=True,
    cmap="Set1"
)
viz.fit(X_train_count,y_train)
viz.score(X_test_count, y_test)
viz.show();

ImportError: ignored

<Figure size 576x396 with 1 Axes>

In [100]:
svc = LinearSVC(C=0.001, class_weight="balanced", random_state=101).fit(X_train_count, y_train)

y_pred = svc.predict(X_test_count)
svc_count_rec_neg = recall_score(y_test, y_pred, average = 'binary')
svc_count_f1_neg = f1_score(y_test, y_pred, average = 'binary')
svc_count_AP = viz.score_

### TD-IDF

In [101]:
from sklearn.svm import LinearSVC
svc = LinearSVC(class_weight="balanced", random_state=101)
svc.fit(X_train_tf_idf,y_train)

LinearSVC(class_weight='balanced', random_state=101)

In [102]:
print("SVC MODEL")
eval(svc, X_train_tf_idf, X_test_tf_idf)

SVC MODEL
[[1632  223]
 [  76  334]]
Test_Set
              precision    recall  f1-score   support

           0       0.96      0.88      0.92      1855
           1       0.60      0.81      0.69       410

    accuracy                           0.87      2265
   macro avg       0.78      0.85      0.80      2265
weighted avg       0.89      0.87      0.88      2265

Train_Set
              precision    recall  f1-score   support

           0       0.99      0.92      0.95     16685
           1       0.72      0.97      0.83      3691

    accuracy                           0.93     20376
   macro avg       0.86      0.94      0.89     20376
weighted avg       0.94      0.93      0.93     20376



In [103]:
model = LinearSVC(class_weight="balanced", random_state=101)
scores = cross_validate(model, X_train_tf_idf, y_train, scoring = ['precision','recall','f1','accuracy'], cv = 10, return_train_score=True)
df_scores = pd.DataFrame(scores, index = range(1, 11))
df_scores.mean()[2:]

test_precision     0.604945
train_precision    0.725171
test_recall        0.799783
train_recall       0.971673
test_f1            0.688623
train_f1           0.830512
test_accuracy      0.869013
train_accuracy     0.928156
dtype: float64

In [104]:
# cross validation with C=0.01

model = LinearSVC(C=0.01, class_weight="balanced", random_state=101)
scores = cross_validate(model, X_train_tf_idf, y_train, 
                        scoring = ['precision','recall','f1','accuracy'], 
                        cv = 10, return_train_score=True)
df_scores = pd.DataFrame(scores, index = range(1, 11))
df_scores.mean()[2:]

test_precision     0.548936
train_precision    0.565584
test_recall        0.868874
train_recall       0.896114
test_f1            0.672714
train_f1           0.693476
test_accuracy      0.846780
train_accuracy     0.856498
dtype: float64

In [105]:
viz = PrecisionRecallCurve(
    LinearSVC(C=0.001, class_weight="balanced", random_state=101),
    classes=svc.classes_,
    per_class=True,
    cmap="Set1"
)
viz.fit(X_train_tf_idf, y_train)
viz.score(X_test_tf_idf, y_test)
viz.show();

ImportError: ignored

<Figure size 576x396 with 1 Axes>

In [106]:
svc = LinearSVC(C=0.001, class_weight="balanced", random_state=101).fit(X_train_tf_idf, y_train)

y_pred = svc .predict(X_test_tf_idf)
svc_tfidf_rec_neg = recall_score(y_test, y_pred, average = 'binary')
svc_tfidf_f1_neg = f1_score(y_test, y_pred, average = 'binary')
svc_tfidf_AP = viz.score_

## Random Forest

### Countvectorizer

In [107]:
from sklearn.ensemble import RandomForestClassifier
rf = RandomForestClassifier(n_estimators=100, max_depth= 10, class_weight="balanced", random_state = 101, n_jobs = -1)
rf.fit(X_train_count, y_train)

RandomForestClassifier(class_weight='balanced', max_depth=10, n_jobs=-1,
                       random_state=101)

In [108]:
print("RF MODEL")
eval(rf, X_train_count, X_test_count)

RF MODEL
[[1583  272]
 [  81  329]]
Test_Set
              precision    recall  f1-score   support

           0       0.95      0.85      0.90      1855
           1       0.55      0.80      0.65       410

    accuracy                           0.84      2265
   macro avg       0.75      0.83      0.78      2265
weighted avg       0.88      0.84      0.85      2265

Train_Set
              precision    recall  f1-score   support

           0       0.97      0.87      0.92     16685
           1       0.60      0.88      0.71      3691

    accuracy                           0.87     20376
   macro avg       0.78      0.87      0.81     20376
weighted avg       0.90      0.87      0.88     20376



In [109]:
model = RandomForestClassifier(n_estimators=100, max_depth= 6, class_weight="balanced", random_state = 101, n_jobs = -1)
scores = cross_validate(model, X_train_count, y_train, 
                        scoring = ['precision','recall','f1','accuracy'], 
                        cv = 10, return_train_score=True)
df_scores = pd.DataFrame(scores, index = range(1, 11))
df_scores.mean()[2:]

test_precision     0.514194
train_precision    0.539390
test_recall        0.802219
train_recall       0.845510
test_f1            0.626456
train_f1           0.658527
test_accuracy      0.826708
train_accuracy     0.841098
dtype: float64

In [110]:
viz = PrecisionRecallCurve(
    RandomForestClassifier(n_estimators=100, max_depth= 6, class_weight="balanced", random_state = 101, n_jobs = -1),
    classes=rf.classes_,
    per_class=True,
    cmap="Set1"
)
viz.fit(X_train_count,y_train)
viz.score(X_test_count, y_test)
viz.show();

ImportError: ignored

<Figure size 576x396 with 1 Axes>

In [111]:
rf = RandomForestClassifier(n_estimators=100, max_depth= 6, class_weight="balanced", random_state = 101, n_jobs = -1).fit(X_train_count, y_train)

y_pred = rf.predict(X_test_count)
rf_count_rec_neg = recall_score(y_test, y_pred, average = 'binary')
rf_count_f1_neg = f1_score(y_test, y_pred, average = 'binary')
rf_count_AP = viz.score_

### TF-IDF

In [112]:
from sklearn.ensemble import RandomForestClassifier
rf = RandomForestClassifier(n_estimators=100, max_depth= 10, class_weight="balanced", random_state = 101, n_jobs = -1)
rf.fit(X_train_tf_idf, y_train)

RandomForestClassifier(class_weight='balanced', max_depth=10, n_jobs=-1,
                       random_state=101)

In [113]:
print("RF MODEL")
eval(rf, X_train_tf_idf, X_test_tf_idf)

RF MODEL
[[1571  284]
 [  82  328]]
Test_Set
              precision    recall  f1-score   support

           0       0.95      0.85      0.90      1855
           1       0.54      0.80      0.64       410

    accuracy                           0.84      2265
   macro avg       0.74      0.82      0.77      2265
weighted avg       0.88      0.84      0.85      2265

Train_Set
              precision    recall  f1-score   support

           0       0.97      0.87      0.92     16685
           1       0.60      0.89      0.72      3691

    accuracy                           0.87     20376
   macro avg       0.79      0.88      0.82     20376
weighted avg       0.90      0.87      0.88     20376



In [114]:
model = RandomForestClassifier(n_estimators=100, max_depth= 5, class_weight="balanced", random_state = 101, n_jobs = -1)
scores = cross_validate(model, X_train_tf_idf, y_train, 
                        scoring = ['precision','recall','f1','accuracy'], 
                        cv = 10, return_train_score=True)
df_scores = pd.DataFrame(scores, index = range(1, 11))
df_scores.mean()[2:]

test_precision     0.490700
train_precision    0.517044
test_recall        0.810077
train_recall       0.854902
test_f1            0.610891
train_f1           0.644208
test_accuracy      0.813016
train_accuracy     0.828840
dtype: float64

In [115]:
viz = PrecisionRecallCurve(
    RandomForestClassifier(n_estimators=100, max_depth= 6, class_weight="balanced", random_state = 101, n_jobs = -1),
    classes=rf.classes_,
    per_class=True,
    cmap="Set1"
)
viz.fit(X_train_tf_idf,y_train)
viz.score(X_test_tf_idf, y_test)
viz.show();

ImportError: ignored

<Figure size 576x396 with 1 Axes>

In [116]:
rf = RandomForestClassifier(n_estimators=100, max_depth= 6, class_weight="balanced", random_state = 101, n_jobs = -1).fit(X_train_tf_idf, y_train)

y_pred = rf.predict(X_test_tf_idf)
rf_tfidf_rec_neg = recall_score(y_test, y_pred, average = 'binary')
rf_tfidf_f1_neg = f1_score(y_test, y_pred, average = 'binary')
rf_tfidf_AP = viz.score_

## Ada Boosting

### Countvectorizer

In [117]:
from sklearn.ensemble import AdaBoostClassifier
ada = AdaBoostClassifier(random_state = 101)
ada.fit(X_train_count, y_train, sample_weight=classes_weights)

AdaBoostClassifier(random_state=101)

In [118]:
print("Ada MODEL")
eval(ada, X_train_count, X_test_count)

Ada MODEL
[[1562  293]
 [  95  315]]
Test_Set
              precision    recall  f1-score   support

           0       0.94      0.84      0.89      1855
           1       0.52      0.77      0.62       410

    accuracy                           0.83      2265
   macro avg       0.73      0.81      0.75      2265
weighted avg       0.87      0.83      0.84      2265

Train_Set
              precision    recall  f1-score   support

           0       0.94      0.83      0.89     16685
           1       0.51      0.78      0.62      3691

    accuracy                           0.82     20376
   macro avg       0.73      0.81      0.75     20376
weighted avg       0.87      0.82      0.84     20376



In [119]:
model = AdaBoostClassifier(random_state = 101)
scores = cross_validate(model, X_train_count, y_train,
                        scoring = ['precision','recall','f1','accuracy'],
                        cv = 10,
                        return_train_score=True,
                        fit_params={"sample_weight":classes_weights})
df_scores = pd.DataFrame(scores, index = range(1, 11))
df_scores.mean()[2:]

test_precision     0.496929
train_precision    0.506349
test_recall        0.769708
train_recall       0.782926
test_f1            0.603719
train_f1           0.614961
test_accuracy      0.816893
train_accuracy     0.822400
dtype: float64

In [120]:
model = AdaBoostClassifier(random_state = 101)
model.fit(X_train_count, y_train, sample_weight=classes_weights)

PrecisionRecallDisplay.from_estimator(model, X_test_count, y_test, pos_label=1)
plt.show()

ImportError: ignored

<Figure size 576x396 with 1 Axes>

In [121]:
ada = AdaBoostClassifier(random_state = 101).fit(X_train_count, y_train, sample_weight=classes_weights)

y_pred = ada.predict(X_test_count)
y_pred_proba = ada.predict_proba(X_test_count)
ada_count_rec_neg = recall_score(y_test, y_pred, average = 'binary')
ada_count_f1_neg = f1_score(y_test, y_pred, average = 'binary')
ada_count_AP = average_precision_score(y_test, y_pred_proba[:,1])

### TF-IDF

In [122]:
from sklearn.ensemble import AdaBoostClassifier
ada = AdaBoostClassifier(random_state = 101)
ada.fit(X_train_tf_idf, y_train, sample_weight=classes_weights)

AdaBoostClassifier(random_state=101)

In [123]:
print("Ada MODEL")
eval(ada, X_train_tf_idf, X_test_tf_idf)

Ada MODEL
[[1523  332]
 [ 105  305]]
Test_Set
              precision    recall  f1-score   support

           0       0.94      0.82      0.87      1855
           1       0.48      0.74      0.58       410

    accuracy                           0.81      2265
   macro avg       0.71      0.78      0.73      2265
weighted avg       0.85      0.81      0.82      2265

Train_Set
              precision    recall  f1-score   support

           0       0.95      0.83      0.88     16685
           1       0.50      0.79      0.61      3691

    accuracy                           0.82     20376
   macro avg       0.72      0.81      0.75     20376
weighted avg       0.87      0.82      0.83     20376



In [124]:
model = AdaBoostClassifier(random_state = 101)
scores = cross_validate(model, X_train_tf_idf, y_train, 
                        scoring = ['precision','recall','f1','accuracy'], 
                        cv = 10, return_train_score=True, 
                        fit_params={"sample_weight":classes_weights})
df_scores = pd.DataFrame(scores, index = range(1, 11))
df_scores.mean()[2:]

test_precision     0.483321
train_precision    0.495765
test_recall        0.774313
train_recall       0.796441
test_f1            0.594817
train_f1           0.610690
test_accuracy      0.808746
train_accuracy     0.815851
dtype: float64

In [125]:
model = AdaBoostClassifier(random_state = 101)
model.fit(X_train_tf_idf, y_train, sample_weight=classes_weights)

PrecisionRecallDisplay.from_estimator(model, X_test_tf_idf, y_test, pos_label=1)
plt.show()

ImportError: ignored

<Figure size 576x396 with 1 Axes>

In [126]:
ada = AdaBoostClassifier(random_state = 101).fit(X_train_tf_idf, y_train, sample_weight=classes_weights)

y_pred = ada.predict(X_test_tf_idf)
y_pred_proba = ada.predict_proba(X_test_tf_idf)
ada_tfidf_rec_neg = recall_score(y_test, y_pred, average = 'binary')
ada_tfidf_f1_neg = f1_score(y_test, y_pred, average = 'binary')
ada_tfidf_AP = average_precision_score(y_test, y_pred_proba[:,1])

## DL modeling

In [127]:
from tensorflow.keras.models import Sequential
from tensorflow.keras.layers import Dense, GRU, Embedding, Dropout, BatchNormalization
from tensorflow.keras.optimizers import Adam
from tensorflow.keras.preprocessing.text import Tokenizer
from tensorflow.keras.preprocessing.sequence import pad_sequences

In [128]:
X = df_copy2["Review_Text"]
y = df_copy2["Recommended_IND"]

### Tokenization

In [129]:
tokenizer = Tokenizer()

tokenizer.fit_on_texts(X)

### Creating word index

In [130]:
tokenizer.word_index

{'the': 1,
 'i': 2,
 'and': 3,
 'a': 4,
 'it': 5,
 'is': 6,
 'this': 7,
 'to': 8,
 'in': 9,
 'but': 10,
 'on': 11,
 'for': 12,
 'of': 13,
 'with': 14,
 'was': 15,
 'so': 16,
 'my': 17,
 'dress': 18,
 'not': 19,
 'that': 20,
 'love': 21,
 'size': 22,
 'very': 23,
 'have': 24,
 'top': 25,
 'fit': 26,
 'are': 27,
 'like': 28,
 'be': 29,
 'as': 30,
 'me': 31,
 'wear': 32,
 "it's": 33,
 'great': 34,
 'too': 35,
 "i'm": 36,
 'or': 37,
 'am': 38,
 'just': 39,
 'you': 40,
 'would': 41,
 'they': 42,
 'up': 43,
 'at': 44,
 'fabric': 45,
 'small': 46,
 'color': 47,
 'look': 48,
 'if': 49,
 'more': 50,
 'really': 51,
 'ordered': 52,
 'little': 53,
 'perfect': 54,
 'will': 55,
 'one': 56,
 'these': 57,
 'flattering': 58,
 'well': 59,
 'an': 60,
 'soft': 61,
 'out': 62,
 'back': 63,
 'because': 64,
 'had': 65,
 'can': 66,
 '\r': 67,
 'comfortable': 68,
 'cute': 69,
 'nice': 70,
 'than': 71,
 'bought': 72,
 'beautiful': 73,
 'when': 74,
 'all': 75,
 'looks': 76,
 'bit': 77,
 'fits': 78,
 'large': 79,

In [131]:
# to see the number of unique tokens

len(tokenizer.word_index)

14847

### Converting tokens to numeric

In [132]:
X_num_tokens = tokenizer.texts_to_sequences(X)

In [133]:
X[70]

'This top is so cute, but it is massively babydoll shaped (a- line) which is not apparent from the pictures. i measured the xs i have and the chest is about 42" and sweep is over 70". i would definitely keep this top if it hung straighter. the craftsmanship is lovely and fabric is so natural and handwoven looking. i\'m thinking about asking my tailor if she can take the sides in, but that might ruin it as the fabric layout is cut and proportioned to this swing style.'

In [134]:
print(X_num_tokens[70])

[7, 25, 6, 16, 69, 10, 5, 6, 5105, 3983, 892, 4, 335, 82, 6, 19, 2296, 105, 1, 526, 2, 2772, 1, 98, 2, 24, 3, 1, 190, 6, 110, 3145, 3, 6702, 6, 151, 4248, 2, 41, 155, 260, 7, 25, 49, 5, 768, 2437, 1, 2862, 6, 193, 3, 45, 6, 16, 933, 3, 8532, 185, 36, 650, 110, 2686, 17, 1285, 49, 591, 66, 374, 1, 508, 9, 10, 20, 269, 2558, 5, 30, 1, 45, 6703, 6, 120, 3, 2559, 8, 7, 499, 133]


### Maximum number of tokens for all documents¶

In [135]:
# DL modellerinde modele verdiğimiz datamızın aynı boyutta olması gerekiyor.

len(X_num_tokens[70])

89

In [136]:
X[100]

"The fabric felt cheap and i didn't find it to be a flattering top. for reference i am wearing a medium in the photos and my measurements are 38-30-40."

In [137]:
print(X_num_tokens[100])

[1, 45, 267, 489, 3, 2, 124, 221, 5, 8, 29, 4, 58, 25, 12, 330, 2, 38, 141, 4, 97, 9, 1, 451, 3, 17, 1506, 27, 920, 691, 927]


In [138]:
len(X_num_tokens[100])

31

In [139]:
len(X_num_tokens[40])

26

In [140]:
num_tokens = [len(tokens) for tokens in X_num_tokens]
num_tokens = np.array(num_tokens)

In [141]:
num_tokens

array([ 7, 62, 97, ..., 42, 85, 19])

In [142]:
num_tokens.mean()

60.60699615741354

In [143]:
num_tokens.max()

116

In [144]:
num_tokens.argmax() 

16263

In [145]:
X[10111]

'This is a great staple piece for your closet. i do agree that it runs large. i am usually a small and ordered this in an xsmall and that was perfect for the billowy-ness of the top. i have a long torso and arms, so going with an xsmall made the arms slightly short on me. i do not see this as an issue as i just pull them up to 3/4 length on my arm (the sleeves are elastic so this is possible). i receive compliments every time i wear this top. it is beautiful and comfortable. i definitely recommen'

In [146]:
len(X_num_tokens[10111])

103

In [147]:
max_tokens = 103

In [148]:
sum(num_tokens < max_tokens) / len(num_tokens)

0.9564948544675589

In [149]:
sum(num_tokens < max_tokens)

21656

In [150]:
len(num_tokens)

22641

### Fixing token counts of all documents (pad_sequences)

In [151]:
X_pad = pad_sequences(X_num_tokens, maxlen=max_tokens)

In [152]:
X_pad.shape

(22641, 103)

In [153]:
np.array(X_num_tokens[600])

array([   7,  483,   78,   28,    4,  864,  231,   65, 1773, 1748,    3,
         38, 1181, 2495,   17,  468,  726,   16,    2,   15,  185,   12,
          4,  483,    8,   32,    8,    1,  865,    7,  138,   74,   36,
         39,   19,  797,    4, 2063,    7,    6,    5,   33,   68,    3,
         19, 3561,  277,    8, 2495,    3, 1868,    9,  250,  308,    1,
        285, 1801,   36,  322,   35,   16,    5,   15,  891,   90,   11,
         17,  403])

In [154]:
len(np.array(X_num_tokens[600]))

68

In [155]:
X_pad[600]

array([   0,    0,    0,    0,    0,    0,    0,    0,    0,    0,    0,
          0,    0,    0,    0,    0,    0,    0,    0,    0,    0,    0,
          0,    0,    0,    0,    0,    0,    0,    0,    0,    0,    0,
          0,    0,    7,  483,   78,   28,    4,  864,  231,   65, 1773,
       1748,    3,   38, 1181, 2495,   17,  468,  726,   16,    2,   15,
        185,   12,    4,  483,    8,   32,    8,    1,  865,    7,  138,
         74,   36,   39,   19,  797,    4, 2063,    7,    6,    5,   33,
         68,    3,   19, 3561,  277,    8, 2495,    3, 1868,    9,  250,
        308,    1,  285, 1801,   36,  322,   35,   16,    5,   15,  891,
         90,   11,   17,  403], dtype=int32)

In [156]:
np.array(X_num_tokens[10111])

array([   7,    6,    4,   34,  605,  198,   12,  173,  686,    2,  134,
        466,   20,    5,  132,   79,    2,   38,  107,    4,   46,    3,
         52,    7,    9,   60, 1598,    3,   20,   15,   54,   12,    1,
       1014, 3131,   13,    1,   25,    2,   24,    4,   90,  403,    3,
        194,   16,  176,   14,   60, 1598,  112,    1,  194,  259,  114,
         11,   31,    2,  134,   19,  121,    7,   30,   60,  449,   30,
          2,   39,  535,   85,   43,    8,  380,  154,   86,   11,   17,
        414,    1,  135,   27,  601,   16,    7,    6, 1401,    2, 1000,
        210,  316,  183,    2,   32,    7,   25,    5,    6,   73,    3,
         68,    2,  155, 6269])

In [157]:
X_pad[10111]

array([   7,    6,    4,   34,  605,  198,   12,  173,  686,    2,  134,
        466,   20,    5,  132,   79,    2,   38,  107,    4,   46,    3,
         52,    7,    9,   60, 1598,    3,   20,   15,   54,   12,    1,
       1014, 3131,   13,    1,   25,    2,   24,    4,   90,  403,    3,
        194,   16,  176,   14,   60, 1598,  112,    1,  194,  259,  114,
         11,   31,    2,  134,   19,  121,    7,   30,   60,  449,   30,
          2,   39,  535,   85,   43,    8,  380,  154,   86,   11,   17,
        414,    1,  135,   27,  601,   16,    7,    6, 1401,    2, 1000,
        210,  316,  183,    2,   32,    7,   25,    5,    6,   73,    3,
         68,    2,  155, 6269], dtype=int32)

In [158]:
len(X_num_tokens)

22641

### Train Test Split

In [159]:
X_train, X_test, y_train, y_test = train_test_split(X_pad, y, test_size=0.1, stratify=y, random_state=0)

### Modeling

In [160]:
embedding_size = 50

In [161]:
model = Sequential()

model.add(Embedding(input_dim=len(tokenizer.word_index)+1, 
                    output_dim=embedding_size, 
                    input_length=max_tokens))  

model.add(Dropout(0.5))

model.add(GRU(units=48, return_sequences=True)) 
model.add(Dropout(0.5))

model.add(GRU(units=24, return_sequences=True))
model.add(Dropout(0.5))

model.add(GRU(units=12))
model.add(Dropout(0.5))

model.add(Dense(1, activation='sigmoid'))


In [162]:
optimizer = Adam(learning_rate=0.001)

In [163]:
model.compile(loss='binary_crossentropy',
              optimizer=optimizer,
              metrics=['Recall'])

In [164]:
model.summary() 

Model: "sequential"
_________________________________________________________________
 Layer (type)                Output Shape              Param #   
 embedding (Embedding)       (None, 103, 50)           742400    
                                                                 
 dropout (Dropout)           (None, 103, 50)           0         
                                                                 
 gru (GRU)                   (None, 103, 48)           14400     
                                                                 
 dropout_1 (Dropout)         (None, 103, 48)           0         
                                                                 
 gru_1 (GRU)                 (None, 103, 24)           5328      
                                                                 
 dropout_2 (Dropout)         (None, 103, 24)           0         
                                                                 
 gru_2 (GRU)                 (None, 12)                1

In [165]:
from tensorflow.keras.callbacks import EarlyStopping

early_stop = EarlyStopping(monitor="val_recall", mode="max", 
                           verbose=1, patience = 2, restore_best_weights=True)

In [166]:
from sklearn.utils import class_weight
classes_weights = class_weight.compute_sample_weight(class_weight='balanced', y=y_train)
pd.Series(classes_weights).unique()

array([0.61060833, 2.76022758])

In [167]:
model.fit(X_train, y_train, epochs=10, batch_size=32, sample_weight= classes_weights, 
         validation_data=(X_test, y_test), callbacks=[early_stop])

Epoch 1/10
Epoch 2/10
Epoch 3/10
Epoch 3: early stopping


<keras.callbacks.History at 0x7f73bc85e150>

In [168]:
model_loss = pd.DataFrame(model.history.history)
model_loss.head()

Unnamed: 0,loss,recall,val_loss,val_recall
0,0.50256,0.745868,0.422751,0.919512
1,0.334791,0.892441,0.322159,0.868293
2,0.279128,0.919805,0.30234,0.8


In [169]:
model_loss.plot()

<matplotlib.axes._subplots.AxesSubplot at 0x7f73bc6dc490>

ImportError: ignored

<Figure size 576x396 with 1 Axes>

In [170]:
model.evaluate(X_test, y_test)



[0.4227510094642639, 0.9195122122764587]

In [171]:
model.evaluate(X_train, y_train)



[0.3807419538497925, 0.9376862645149231]

In [172]:
from sklearn.metrics import confusion_matrix, classification_report, accuracy_score, f1_score, roc_auc_score

y_pred = model.predict(X_test) >= 0.5
y_train_pred = model.predict(X_train) >= 0.5

print("Test Result")
print(confusion_matrix(y_test, y_pred))
print("-------------------------------------------------------")
print(classification_report(y_test, y_pred))
print("###"*20)
print("Train Result")
print(confusion_matrix(y_train, y_train_pred))
print("-------------------------------------------------------")
print(classification_report(y_train, y_train_pred))

Test Result
[[1450  405]
 [  33  377]]
-------------------------------------------------------
              precision    recall  f1-score   support

           0       0.98      0.78      0.87      1855
           1       0.48      0.92      0.63       410

    accuracy                           0.81      2265
   macro avg       0.73      0.85      0.75      2265
weighted avg       0.89      0.81      0.83      2265

############################################################
Train Result
[[13368  3317]
 [  230  3461]]
-------------------------------------------------------
              precision    recall  f1-score   support

           0       0.98      0.80      0.88     16685
           1       0.51      0.94      0.66      3691

    accuracy                           0.83     20376
   macro avg       0.75      0.87      0.77     20376
weighted avg       0.90      0.83      0.84     20376



In [173]:
y_pred_proba = model.predict(X_test)
PrecisionRecallDisplay.from_predictions(y_test, y_pred_proba)
plt.show()



ImportError: ignored

<Figure size 576x396 with 1 Axes>

In [174]:
# Saving the model

model.save('/content/drive/MyDrive/NLP/DL_model_sentiment.h5')

In [175]:
y_pred = (model.predict(X_test) > 0.5).astype("int")
y_pred_proba = model.predict(X_test)
DL_rec_neg = recall_score(y_test, y_pred, average = 'binary')
DL_f1_neg = f1_score(y_test, y_pred, average = 'binary')
DL_AP = average_precision_score(y_test, y_pred_proba)



## BERT Modeling

In [176]:
import tensorflow as tf
import os

# Note that the `tpu` argument is for Colab-only
resolver = tf.distribute.cluster_resolver.TPUClusterResolver(tpu='grpc://' + os.environ['COLAB_TPU_ADDR'])

tf.config.experimental_connect_to_cluster(resolver)
tf.tpu.experimental.initialize_tpu_system(resolver)
print("All devices: ", tf.config.list_logical_devices('TPU'))  # TPU sadece tensorflow da kullanılabilir.

All devices:  [LogicalDevice(name='/job:worker/replica:0/task:0/device:TPU:0', device_type='TPU'), LogicalDevice(name='/job:worker/replica:0/task:0/device:TPU:1', device_type='TPU'), LogicalDevice(name='/job:worker/replica:0/task:0/device:TPU:2', device_type='TPU'), LogicalDevice(name='/job:worker/replica:0/task:0/device:TPU:3', device_type='TPU'), LogicalDevice(name='/job:worker/replica:0/task:0/device:TPU:4', device_type='TPU'), LogicalDevice(name='/job:worker/replica:0/task:0/device:TPU:5', device_type='TPU'), LogicalDevice(name='/job:worker/replica:0/task:0/device:TPU:6', device_type='TPU'), LogicalDevice(name='/job:worker/replica:0/task:0/device:TPU:7', device_type='TPU')]


In [177]:
strategy = tf.distribute.TPUStrategy(resolver)

In [178]:
!pip install transformers

Looking in indexes: https://pypi.org/simple, https://us-python.pkg.dev/colab-wheels/public/simple/
Collecting transformers
  Downloading transformers-4.23.1-py3-none-any.whl (5.3 MB)
[K     |████████████████████████████████| 5.3 MB 5.2 MB/s 
Collecting tokenizers!=0.11.3,<0.14,>=0.11.1
  Downloading tokenizers-0.13.1-cp37-cp37m-manylinux_2_17_x86_64.manylinux2014_x86_64.whl (7.6 MB)
[K     |████████████████████████████████| 7.6 MB 39.6 MB/s 
[?25hCollecting huggingface-hub<1.0,>=0.10.0
  Downloading huggingface_hub-0.10.1-py3-none-any.whl (163 kB)
[K     |████████████████████████████████| 163 kB 70.1 MB/s 
Installing collected packages: tokenizers, huggingface-hub, transformers
Successfully installed huggingface-hub-0.10.1 tokenizers-0.13.1 transformers-4.23.1


In [179]:
df_copy2.head()

Unnamed: 0,Review_Text,Recommended_IND
0,Absolutely wonderful - silky and sexy and comf...,0
1,Love this dress! it's sooo pretty. i happene...,0
2,I had such high hopes for this dress and reall...,1
3,"I love, love, love this jumpsuit. it's fun, fl...",0
4,This shirt is very flattering to all due to th...,0


In [180]:
X = df_copy2['Review_Text'].values
y = df_copy2['Recommended_IND'].values


### Train test split

In [181]:
from sklearn.model_selection import train_test_split

X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.1, stratify=y, random_state=101)

### Tokenization

In [182]:
from transformers import AutoTokenizer

tokenizer = AutoTokenizer.from_pretrained("bert-base-uncased")

Downloading:   0%|          | 0.00/28.0 [00:00<?, ?B/s]

Downloading:   0%|          | 0.00/570 [00:00<?, ?B/s]

Downloading:   0%|          | 0.00/232k [00:00<?, ?B/s]

Downloading:   0%|          | 0.00/466k [00:00<?, ?B/s]

In [183]:
# For every sentence...
max_token = []
for sent in X:

    # Tokenize the text and add `[CLS]` and `[SEP]` tokens.
    
    input_ids = tokenizer.encode(sent, add_special_tokens=True)
    max_token.append(len(input_ids))

print('Max sentence length: ', max(max_token))

Max sentence length:  162


In [184]:
import numpy as np
from scipy import stats

arr = np.array(max_token)
 
print("Descriptive analysis")
print("Document Size \t=", arr.shape[0])
print("Doc Token Count\t=", arr)

# measures of central tendency
print("Measures of Central Tendency")
print("Mean \t\t=", arr.mean())
print("Median \t\t=", np.median(arr))
print("Mode \t\t=", stats.mode(arr)[0][0])

# measures of dispersion
print("Measures of Dispersion")
print("Minimum \t=", arr.min())
print("Maximum \t=", arr.max())
print("Range \t\t=", arr.ptp())
print("Variance \t=", arr.var())
print("Standard Deviation =", arr.std())

Descriptive analysis
Document Size 	= 22641
Doc Token Count	= [ 10  82 118 ...  54 102  26]
Measures of Central Tendency
Mean 		= 76.75659202332052
Median 		= 75.0
Mode 		= 122
Measures of Dispersion
Minimum 	= 4
Maximum 	= 162
Range 		= 158
Variance 	= 1251.4350770125225
Standard Deviation = 35.37562829141728


### Transformation text to tensor

In [185]:
# defining a function to transform vectors to matrises

def transformation(X):
  # set array dimensions
  seq_len = 163  # max +1
  num_samples = len(X)

  # initialize empty zero arrays
  Xids = np.zeros((num_samples, seq_len))
  Xmask = np.zeros((num_samples, seq_len))

    
  for i, phrase in enumerate(X):
      tokens = tokenizer.encode_plus(phrase, max_length=seq_len, truncation=True,
                                      padding='max_length', add_special_tokens=True) 
      
      # assign tokenized outputs to respective rows in numpy arrays
      Xids[i] = tokens['input_ids']
      Xmask[i] = tokens['attention_mask']
  return Xids, Xmask

In [186]:
Xids_train, Xmask_train = transformation(X_train)

Xids_test, Xmask_test = transformation(X_test)

In [187]:
print("Xids_train.shape  :", Xids_train.shape)
print("Xmask_train.shape :", Xmask_train.shape)
print("Xids_test.shape   :", Xids_test.shape)
print("Xmask_test.shape  :", Xmask_test.shape)

Xids_train.shape  : (20376, 163)
Xmask_train.shape : (20376, 163)
Xids_test.shape   : (2265, 163)
Xmask_test.shape  : (2265, 163)


In [188]:
labels_train = y_train.reshape(-1,1)
labels_train

array([[1],
       [1],
       [0],
       ...,
       [0],
       [0],
       [0]])

In [189]:
labels_test = y_test.reshape(-1,1)
labels_test

array([[0],
       [0],
       [0],
       ...,
       [1],
       [0],
       [0]])

In [190]:
import tensorflow as tf

dataset_train = tf.data.Dataset.from_tensor_slices((Xids_train, Xmask_train, labels_train))
dataset_train

<TensorSliceDataset element_spec=(TensorSpec(shape=(163,), dtype=tf.float64, name=None), TensorSpec(shape=(163,), dtype=tf.float64, name=None), TensorSpec(shape=(1,), dtype=tf.int64, name=None))>

In [191]:
dataset_test = tf.data.Dataset.from_tensor_slices((Xids_test, Xmask_test, labels_test))
dataset_test

<TensorSliceDataset element_spec=(TensorSpec(shape=(163,), dtype=tf.float64, name=None), TensorSpec(shape=(163,), dtype=tf.float64, name=None), TensorSpec(shape=(1,), dtype=tf.int64, name=None))>

In [192]:
def map_func(Xids, Xmask, labels):
    # we convert our three-item tuple into a two-item tuple where the input item is a dictionary
    return {'input_ids': Xids, 'attention_mask': Xmask}, labels  # inputları süslü parantez içinde, outputu dışarda(labels) veriyoruz.

In [193]:
dataset_train = dataset_train.map(map_func)
dataset_test = dataset_test.map(map_func)

In [194]:
dataset_train

<MapDataset element_spec=({'input_ids': TensorSpec(shape=(163,), dtype=tf.float64, name=None), 'attention_mask': TensorSpec(shape=(163,), dtype=tf.float64, name=None)}, TensorSpec(shape=(1,), dtype=tf.int64, name=None))>

In [195]:
dataset_test

<MapDataset element_spec=({'input_ids': TensorSpec(shape=(163,), dtype=tf.float64, name=None), 'attention_mask': TensorSpec(shape=(163,), dtype=tf.float64, name=None)}, TensorSpec(shape=(1,), dtype=tf.int64, name=None))>

### Batch Size

In [196]:
batch_size = 32 # 16, 32

# batch_size 
# fit into a batch of 32
train_ds = dataset_train.batch(batch_size)  
val_ds = dataset_test.batch(batch_size)

length = len(X_train)
train_ds2 = dataset_train.shuffle(buffer_size = length, reshuffle_each_iteration=True).batch(batch_size)

### Creating Model

In [197]:
def create_model():
    seq_len =163
    from transformers import TFAutoModel
    model = TFAutoModel.from_pretrained("bert-base-uncased")
    input_ids = tf.keras.layers.Input(shape=(seq_len,), name='input_ids', dtype='int32')
    attention_mask = tf.keras.layers.Input(shape=(seq_len,), name='attention_mask', dtype='int32')

    embeddings = model.bert(input_ids=input_ids, attention_mask=attention_mask)["pooler_output"]

    x = tf.keras.layers.Dense(seq_len, activation='relu')(embeddings) 
    x = tf.keras.layers.Dropout(0.1, name="dropout")(x) 
    y = tf.keras.layers.Dense(1, activation='sigmoid', name='outputs')(x)

    return tf.keras.Model(inputs=[input_ids, attention_mask], outputs=y)

In [198]:
with strategy.scope():  
  
  optimizer = tf.keras.optimizers.Adam(learning_rate=2e-5) 
  loss = tf.keras.losses.BinaryCrossentropy()
  recall = tf.keras.metrics.Recall()
  model3 = create_model()
  model3.compile(optimizer=optimizer, loss=loss, metrics=[recall])

Downloading:   0%|          | 0.00/536M [00:00<?, ?B/s]

Some layers from the model checkpoint at bert-base-uncased were not used when initializing TFBertModel: ['nsp___cls', 'mlm___cls']
- This IS expected if you are initializing TFBertModel from the checkpoint of a model trained on another task or with another architecture (e.g. initializing a BertForSequenceClassification model from a BertForPreTraining model).
- This IS NOT expected if you are initializing TFBertModel from the checkpoint of a model that you expect to be exactly identical (initializing a BertForSequenceClassification model from a BertForSequenceClassification model).
All the layers of TFBertModel were initialized from the model checkpoint at bert-base-uncased.
If your task is similar to the task the model of the checkpoint was trained on, you can already use TFBertModel for predictions without further training.


In [199]:
model3.summary()

Model: "model"
__________________________________________________________________________________________________
 Layer (type)                   Output Shape         Param #     Connected to                     
 input_ids (InputLayer)         [(None, 163)]        0           []                               
                                                                                                  
 attention_mask (InputLayer)    [(None, 163)]        0           []                               
                                                                                                  
 bert (TFBertMainLayer)         TFBaseModelOutputWi  109482240   ['input_ids[0][0]',              
                                thPoolingAndCrossAt               'attention_mask[0][0]']         
                                tentions(last_hidde                                               
                                n_state=(None, 163,                                           

In [200]:
history = model3.fit(
    train_ds2, validation_data= val_ds, class_weight= {0:1, 1:4},
    epochs=1)



### Model evaluation

In [201]:
from sklearn.metrics import classification_report, confusion_matrix

y_pred = model3.predict(val_ds) >= 0.5


print(classification_report(y_test, y_pred))

              precision    recall  f1-score   support

           0       0.98      0.89      0.94      1855
           1       0.66      0.92      0.77       410

    accuracy                           0.90      2265
   macro avg       0.82      0.91      0.85      2265
weighted avg       0.92      0.90      0.91      2265



In [202]:
y_train_pred = model3.predict(train_ds) >= 0.5  


print(classification_report(y_train, y_train_pred)) 

              precision    recall  f1-score   support

           0       0.99      0.91      0.95     16685
           1       0.71      0.97      0.82      3691

    accuracy                           0.92     20376
   macro avg       0.85      0.94      0.89     20376
weighted avg       0.94      0.92      0.93     20376



In [203]:
from sklearn.metrics import PrecisionRecallDisplay

y_pred_proba = model3.predict(val_ds)

PrecisionRecallDisplay.from_predictions(y_test, y_pred_proba)
plt.show();



ImportError: ignored

<Figure size 576x396 with 1 Axes>

In [204]:
model3.save("/content/drive/MyDrive/NLP/BERT_model.h5")

In [205]:
from tensorflow.keras.models import load_model

model4 = load_model('/content/drive/MyDrive/NLP/BERT_model.h5')

In [206]:
y_pred_proba = model4.predict(val_ds)
y_pred = (y_pred_proba > 0.5).astype("int")
BERT_rec = recall_score(y_test, y_pred)
BERT_f1 = f1_score(y_test, y_pred)
BERT_AP = average_precision_score(y_test, y_pred_proba)



In [207]:
BERT_AP

0.8488167533661037

### Compare Models F1 Scores, Recall Scores and Average Precision Score

In [208]:
compare = pd.DataFrame({"Model": ["LogReg_count", "LogReg_tfidf", "NaiveBayes_count", "NaiveBayes_tfidf", "SVM_count", "SVM_tfidf", "Random Forest_count", 
                                  "Random Forest_tfidf", "AdaBoost_count", "AdaBoost_tfidf", "DL_model", "BERT_model"],
                        
                        "F1_Score_Negative": [log_count_f1_neg, log_tfidf_f1_neg, nb_count_f1_neg, nb_tfidf_f1_neg, svc_count_f1_neg, svc_tfidf_f1_neg,
                                             rf_count_f1_neg, rf_tfidf_f1_neg, ada_count_f1_neg, ada_tfidf_f1_neg, DL_f1_neg, BERT_f1],  
                        
                        "Recall_Score_Negative": [log_count_rec_neg, log_tfidf_rec_neg, nb_count_rec_neg, nb_tfidf_rec_neg, svc_count_rec_neg,
                                                  svc_tfidf_rec_neg, rf_count_rec_neg, rf_tfidf_rec_neg, ada_count_rec_neg, ada_tfidf_rec_neg,
                                                  DL_rec_neg, BERT_rec],
                        
                        "Average_Precision_Score": [log_count_AP, log_tfidf_AP, nb_count_AP, nb_tfidf_AP, svc_count_AP, svc_tfidf_AP, rf_count_AP,
                                                          rf_tfidf_AP, ada_count_AP, ada_tfidf_AP, DL_AP, BERT_AP]})


    
plt.figure(figsize=(15,30))
plt.subplot(311)
compare = compare.sort_values(by="Recall_Score_Negative", ascending=False)
ax=sns.barplot(x="Recall_Score_Negative", y="Model", data=compare, palette="Blues_d")            
ax.bar_label(ax.containers[0], fmt="%.3f")

plt.subplot(312)
compare = compare.sort_values(by="F1_Score_Negative", ascending=False)
ax=sns.barplot(x="F1_Score_Negative", y="Model", data=compare, palette="Blues_d")
ax.bar_label(ax.containers[0],fmt="%.3f")


plt.subplot(313)
compare = compare.sort_values(by="Average_Precision_Score", ascending=False)
ax=sns.barplot(x="Average_Precision_Score", y="Model", data=compare, palette="Blues_d")
ax.bar_label(ax.containers[0],fmt="%.3f")
plt.show();

AttributeError: ignored

ImportError: ignored

<Figure size 1080x2160 with 1 Axes>