# Grade prediction with BERT embedding

In this notebook, we will use pre-trained deep learning model to process some text. We will then use the output of that model to classify the reviews we scrapped in previous courses. We will try to predict wether it was positive or negative. 

Under the hood, the model is actually made up of two models.


1.   DistilBERT processes the sentence and passes along some information it extracted from it on to the next model. DistilBERT is a smaller version of BERT developed and open sourced by the team at HuggingFace. It’s a lighter and faster version of BERT that roughly matches its performance.
2.   The next model, a basic Logistic Regression model from scikit learn will identify if the review was positive or negative



## Installing the transformers library

Let's start by installing the huggingface transformers library so we can load our deep learning NLP model.

In [None]:
!pip install transformers

Looking in indexes: https://pypi.org/simple, https://us-python.pkg.dev/colab-wheels/public/simple/


In [None]:
import numpy as np
import pandas as pd
from sklearn.model_selection import train_test_split
from sklearn.linear_model import LogisticRegression
from sklearn.model_selection import GridSearchCV
from sklearn.model_selection import cross_val_score
import torch
import transformers as ppb
import warnings
warnings.filterwarnings('ignore')

In [None]:
from tqdm import tqdm, tqdm_notebook
tqdm_notebook().pandas()

0it [00:00, ?it/s]

In [None]:
import nltk
nltk.download("punkt")
warnings.filterwarnings('ignore')

[nltk_data] Downloading package punkt to /root/nltk_data...
[nltk_data]   Package punkt is already up-to-date!


In [None]:
import itertools

## Importing the dataset

We'll import the dataset as we did in previous notebooks:

In [None]:
from google.colab import drive
import os
drive.mount('/content/drive')
dirpath = "drive" 
os.listdir(dirpath)

Drive already mounted at /content/drive; to attempt to forcibly remount, call drive.mount("/content/drive", force_remount=True).


['MyDrive', '.shortcut-targets-by-id', '.file-revisions-by-id', '.Trash-0']

In [None]:
import os

dirpath = "drive/MyDrive/Capgemini_data_camp/4. Embedding 2" 

os.listdir(dirpath)

data_file = "reviews.csv"

reviews = pd.read_csv("/content/drive/MyDrive/Capgemini_data_camp/4. Embedding 2/reviews.csv")
reviews.head()


Unnamed: 0.1,Unnamed: 0,page,titre,verbatim,date,note,reponse,date_experience,fournisseur,source
0,0,1,Aucun soucis particulier,Je paie ma facture tous les deux mois en fonct...,Il y a 17 heures,4,,Date de l'expérience: 01 décembre 2022,https://fr.trustpilot.com/review/engie.fr,trustpilot
1,1,1,Engie facture a ses clients des sommes…,Engie facture a ses clients des sommes exorbit...,Il y a un jour,1,"Bonjour Julien Blanco,\n\nPour des raisons de ...",Date de l'expérience: 26 novembre 2022,https://fr.trustpilot.com/review/engie.fr,trustpilot
2,2,1,Facturation sur consommation d'un autre logement,Ils me facturent sur le pdl du logement au des...,ll y a 3 jours,1,"Bonjour BlooDz,\n\nPour des raisons de confide...",Date de l'expérience: 29 novembre 2022,https://fr.trustpilot.com/review/engie.fr,trustpilot
3,3,1,un service client ou il est dur de…,un service client ou il est dur de comprendre ...,ll y a 3 jours,1,"Bonjour Ricanto77,\nPour des raisons de confid...",Date de l'expérience: 29 novembre 2022,https://fr.trustpilot.com/review/engie.fr,trustpilot
4,4,1,Client d'ENGIE depuis longtemps toujours satis...,Excellente expérience avec ENGIE et une interl...,Il y a 24 minutes,5,,Date de l'expérience: 01 décembre 2022,https://fr.trustpilot.com/review/engie.fr,trustpilot


In [None]:
reviews.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 37294 entries, 0 to 37293
Data columns (total 10 columns):
 #   Column           Non-Null Count  Dtype 
---  ------           --------------  ----- 
 0   Unnamed: 0       37294 non-null  int64 
 1   page             37294 non-null  int64 
 2   titre            37294 non-null  object
 3   verbatim         36314 non-null  object
 4   date             37294 non-null  object
 5   note             37294 non-null  int64 
 6   reponse          7972 non-null   object
 7   date_experience  37294 non-null  object
 8   fournisseur      37294 non-null  object
 9   source           37294 non-null  object
dtypes: int64(3), object(7)
memory usage: 2.8+ MB


In the first place we should determine what are we considering a positive review. In this case we shall consider as positive every comment that had 4 or 5 as grade and negative otherwise.

In [None]:
# ASSIGN A LABEL TO EACH REVIEW TO DIFFERENCITATE POSITIVE AND NEGATIVE REVIEWS

# REMOVE REVIEWS WITH NULL VALUES
reviews['positive'] = reviews['note'].isin([4,5])
reviews['positive'] = reviews['positive'].replace({True: 1, False: 0})
reviews.dropna(inplace=True)

reviews.head(20)

Unnamed: 0.1,Unnamed: 0,page,titre,verbatim,date,note,reponse,date_experience,fournisseur,source,positive
1,1,1,Engie facture a ses clients des sommes…,Engie facture a ses clients des sommes exorbit...,Il y a un jour,1,"Bonjour Julien Blanco,\n\nPour des raisons de ...",Date de l'expérience: 26 novembre 2022,https://fr.trustpilot.com/review/engie.fr,trustpilot,0
2,2,1,Facturation sur consommation d'un autre logement,Ils me facturent sur le pdl du logement au des...,ll y a 3 jours,1,"Bonjour BlooDz,\n\nPour des raisons de confide...",Date de l'expérience: 29 novembre 2022,https://fr.trustpilot.com/review/engie.fr,trustpilot,0
3,3,1,un service client ou il est dur de…,un service client ou il est dur de comprendre ...,ll y a 3 jours,1,"Bonjour Ricanto77,\nPour des raisons de confid...",Date de l'expérience: 29 novembre 2022,https://fr.trustpilot.com/review/engie.fr,trustpilot,0
5,5,1,Service commercial déplorable dont on…,Service commercial déplorable dont on ne compr...,ll y a 4 jours,1,"Bonjour Dubois,\n\nPour des raisons de confide...",Date de l'expérience: 28 novembre 2022,https://fr.trustpilot.com/review/engie.fr,trustpilot,0
6,6,1,En cours de litige actuellement j…,En cours de litige actuellement j attends des ...,ll y a 5 jours,1,"Bonjour Pierre “juniorB”,\nPour des raisons de...",Date de l'expérience: 17 octobre 2022,https://fr.trustpilot.com/review/engie.fr,trustpilot,0
7,7,1,ARNAQUE ARNAQUE ARNAQUE ARNAQUE ARNAQUE…,ARNAQUE ARNAQUE ARNAQUE ARNAQUE ARNAQUE ARNAQU...,22 nov. 2022,1,"Bonjour Kikou Kikou,\nPour des raisons de conf...",Date de l'expérience: 22 novembre 2022,https://fr.trustpilot.com/review/engie.fr,trustpilot,0
8,8,1,A déconseiller …,Service client parlant a peine correctement le...,23 nov. 2022,1,"Bonjour Antoine LAURENT,\n\nPour des raisons d...",Date de l'expérience: 23 novembre 2022,https://fr.trustpilot.com/review/engie.fr,trustpilot,0
9,9,1,On m’avait pourtant prévenu… le 21…,On m’avait pourtant prévenu… le 21 novembre je...,22 nov. 2022,1,"Bonjour Manon Chaussat,\n\nPour des raisons de...",Date de l'expérience: 22 novembre 2022,https://fr.trustpilot.com/review/engie.fr,trustpilot,0
11,11,1,Reçu la régularisation ce jour très…,Reçu la régularisation ce jour très surpris on...,21 nov. 2022,1,"Bonjour Djoe Joe,\n\nPour des raisons de confi...",Date de l'expérience: 21 novembre 2022,https://fr.trustpilot.com/review/engie.fr,trustpilot,0
13,13,1,Arnaqueurs,Arnaqueurs\nFaites très attention chers lecteu...,Actualisé le 22 nov. 2022,1,"Bonjour Christophe,\n\nPour des raisons de con...",Date de l'expérience: 22 novembre 2022,https://fr.trustpilot.com/review/engie.fr,trustpilot,0


For computational purposes, we'll only use 1000 sentences which have less than 100 words in it.

In [None]:
def split_reviews_per_sentence(reviews, col ='verbatim' ):
    reviews["review_sentences"] = reviews[col].progress_apply(
        lambda rvw: nltk.sent_tokenize(rvw)
    )
    return reviews

In [None]:
!pip install unidecode

Looking in indexes: https://pypi.org/simple, https://us-python.pkg.dev/colab-wheels/public/simple/


In [None]:
corpus = reviews['verbatim']
import unidecode
def preprocess_comment(comment):
  corpus_l = comment.lower() 

  # In most of the case punctuation do not help on understanding a sentence or a doc
  characters_to_remove = ["@", "/", "#", ".", ",", "!", "?", "(", ")", "-", "_","’","'", "\"", ":"]
  transformation_dict = {initial:" " for initial in characters_to_remove}
  no_punctuation_corpus = corpus_l.translate(str.maketrans(transformation_dict))
 
  no_accent_corpus = unidecode.unidecode(no_punctuation_corpus)
  clean_corpus = no_accent_corpus.replace("\n", "").replace("\xa0", "")
  ## remove numbers 
  clean_corpus = ''.join([i for i in clean_corpus if not i.isdigit()])
  
  return clean_corpus

In [None]:
reviews['verbatim'] = reviews['verbatim'].astype(str)

In [None]:
reviews = split_reviews_per_sentence(reviews, col ='verbatim' )

  0%|          | 0/7972 [00:00<?, ?it/s]

In [None]:
reviews = reviews.explode('review_sentences')
reviews['review_sentences_words'] = reviews['review_sentences'].str.split(" ", expand = False)
reviews['word_count'] = reviews['review_sentences_words'].str.len()

In [None]:
reviews.head(50)

Unnamed: 0.1,Unnamed: 0,page,titre,verbatim,date,note,reponse,date_experience,fournisseur,source,positive,review_sentences,review_sentences_words,word_count
1,1,1,Engie facture a ses clients des sommes…,Engie facture a ses clients des sommes exorbit...,Il y a un jour,1,"Bonjour Julien Blanco,\n\nPour des raisons de ...",Date de l'expérience: 26 novembre 2022,https://fr.trustpilot.com/review/engie.fr,trustpilot,0,Engie facture a ses clients des sommes exorbit...,"[Engie, facture, a, ses, clients, des, sommes,...",9
1,1,1,Engie facture a ses clients des sommes…,Engie facture a ses clients des sommes exorbit...,Il y a un jour,1,"Bonjour Julien Blanco,\n\nPour des raisons de ...",Date de l'expérience: 26 novembre 2022,https://fr.trustpilot.com/review/engie.fr,trustpilot,0,"Engie mon facturer un technicien pour le gaz ,...","[Engie, mon, facturer, un, technicien, pour, l...",26
1,1,1,Engie facture a ses clients des sommes…,Engie facture a ses clients des sommes exorbit...,Il y a un jour,1,"Bonjour Julien Blanco,\n\nPour des raisons de ...",Date de l'expérience: 26 novembre 2022,https://fr.trustpilot.com/review/engie.fr,trustpilot,0,Résultat des courses une facture de 71 euros j...,"[Résultat, des, courses, une, facture, de, 71,...",35
1,1,1,Engie facture a ses clients des sommes…,Engie facture a ses clients des sommes exorbit...,Il y a un jour,1,"Bonjour Julien Blanco,\n\nPour des raisons de ...",Date de l'expérience: 26 novembre 2022,https://fr.trustpilot.com/review/engie.fr,trustpilot,0,Engie savent prendre l’argent à tord mais quan...,"[Engie, savent, prendre, l’argent, à, tord, ma...",17
2,2,1,Facturation sur consommation d'un autre logement,Ils me facturent sur le pdl du logement au des...,ll y a 3 jours,1,"Bonjour BlooDz,\n\nPour des raisons de confide...",Date de l'expérience: 29 novembre 2022,https://fr.trustpilot.com/review/engie.fr,trustpilot,0,Ils me facturent sur le pdl du logement au des...,"[Ils, me, facturent, sur, le, pdl, du, logemen...",57
3,3,1,un service client ou il est dur de…,un service client ou il est dur de comprendre ...,ll y a 3 jours,1,"Bonjour Ricanto77,\nPour des raisons de confid...",Date de l'expérience: 29 novembre 2022,https://fr.trustpilot.com/review/engie.fr,trustpilot,0,un service client ou il est dur de comprendre ...,"[un, service, client, ou, il, est, dur, de, co...",43
5,5,1,Service commercial déplorable dont on…,Service commercial déplorable dont on ne compr...,ll y a 4 jours,1,"Bonjour Dubois,\n\nPour des raisons de confide...",Date de l'expérience: 28 novembre 2022,https://fr.trustpilot.com/review/engie.fr,trustpilot,0,Service commercial déplorable dont on ne compr...,"[Service, commercial, déplorable, dont, on, ne...",42
5,5,1,Service commercial déplorable dont on…,Service commercial déplorable dont on ne compr...,ll y a 4 jours,1,"Bonjour Dubois,\n\nPour des raisons de confide...",Date de l'expérience: 28 novembre 2022,https://fr.trustpilot.com/review/engie.fr,trustpilot,0,"En attendant, le client d'ici est une vache à ...","[En, attendant,, le, client, d'ici, est, une, ...",12
5,5,1,Service commercial déplorable dont on…,Service commercial déplorable dont on ne compr...,ll y a 4 jours,1,"Bonjour Dubois,\n\nPour des raisons de confide...",Date de l'expérience: 28 novembre 2022,https://fr.trustpilot.com/review/engie.fr,trustpilot,0,5 années chez eux (je n'avais jamais eu affair...,"[5, années, chez, eux, (je, n'avais, jamais, e...",47
5,5,1,Service commercial déplorable dont on…,Service commercial déplorable dont on ne compr...,ll y a 4 jours,1,"Bonjour Dubois,\n\nPour des raisons de confide...",Date de l'expérience: 28 novembre 2022,https://fr.trustpilot.com/review/engie.fr,trustpilot,0,"Vraiment, n'allez pas chez le détestable Engie !","[Vraiment,, n'allez, pas, chez, le, détestable...",8


In [None]:
reviews_filtered = reviews[['review_sentences','positive','word_count']][reviews['word_count'] < 100]
reviews_filtered_1000 = reviews_filtered.sample(n=1000)
reviews_filtered_1000.head(50)

Unnamed: 0,review_sentences,positive,word_count
8817,Affligeant en 2020 ...,0,4
371,quelle tristesse pour un tel groupe.,0,6
28700,C'est de la tromperie vis-à-vis des clients qu...,0,29
604,"En fait, ils sont comme beaucoup d'entreprises...",0,22
434,J'essaie d'envoyer un mail via leur page de co...,0,21
1713,Pas de soucis..ecoute des interlocuteurs au top..,1,7
9480,Rien de plus a dire que c est devenu un vrai b...,0,38
457,Les différentes personnes de chez Engie m'ont ...,0,16
17730,J'aimerais savoir le prix du kwh en hc et hp a...,0,25
1619,"Depuis que je suis avec Ohm energie, je n'ai e...",1,12


In [None]:
reviews_filtered_1000.groupby('positive')['review_sentences'].count()

positive
0    867
1    133
Name: review_sentences, dtype: int64

We have an imbalanced dataset with 86.7% negative reviews and 13.3% positive reviews.

In [None]:
#Save in two variables the comments and the labels

sentences = reviews_filtered_1000['review_sentences']
labels = reviews_filtered_1000['positive']

## Loading the Pre-trained BERT model
Let's now load a pre-trained BERT model.

In [None]:
# For DistilBERT:
model_class, tokenizer_class, pretrained_weights = (ppb.DistilBertModel, ppb.DistilBertTokenizer, 'distilbert-base-uncased')

## Want BERT instead of distilBERT? Uncomment the following line:
#model_class, tokenizer_class, pretrained_weights = (ppb.BertModel, ppb.BertTokenizer, 'bert-base-uncased')

# Load pretrained model/tokenizer
tokenizer = tokenizer_class.from_pretrained(pretrained_weights)
model = model_class.from_pretrained(pretrained_weights)

Some weights of the model checkpoint at distilbert-base-uncased were not used when initializing DistilBertModel: ['vocab_projector.weight', 'vocab_transform.weight', 'vocab_projector.bias', 'vocab_layer_norm.weight', 'vocab_transform.bias', 'vocab_layer_norm.bias']
- This IS expected if you are initializing DistilBertModel from the checkpoint of a model trained on another task or with another architecture (e.g. initializing a BertForSequenceClassification model from a BertForPreTraining model).
- This IS NOT expected if you are initializing DistilBertModel from the checkpoint of a model that you expect to be exactly identical (initializing a BertForSequenceClassification model from a BertForSequenceClassification model).


Right now, the variable model holds a pretrained distilBERT model -- a version of BERT that is smaller, but much faster and requiring a lot less memory.

# Model 1: Preparing the Dataset
Before we can hand our sentences to BERT, we need to do some minimal processing to put them in the format it requires.

### Tokenization
Our first step is to tokenize the sentences -- break them up into word and subwords in the format BERT is comfortable with.

In [None]:
sentences = list(reviews_filtered_1000['review_sentences'])
len(sentences) # check 

1000

In [None]:
# Apply tokenization

encoding = tokenizer(sentences, padding=True, truncation=True, return_tensors="pt")
encoding

{'input_ids': tensor([[  101, 21358, 10258,  ...,     0,     0,     0],
        [  101, 10861,  6216,  ...,     0,     0,     0],
        [  101,  1039,  1005,  ...,     0,     0,     0],
        ...,
        [  101, 16021, 10450,  ...,     0,     0,     0],
        [  101,  1011,  1019,  ...,     0,     0,     0],
        [  101,  6581, 16222,  ...,     0,     0,     0]]), 'attention_mask': tensor([[1, 1, 1,  ..., 0, 0, 0],
        [1, 1, 1,  ..., 0, 0, 0],
        [1, 1, 1,  ..., 0, 0, 0],
        ...,
        [1, 1, 1,  ..., 0, 0, 0],
        [1, 1, 1,  ..., 0, 0, 0],
        [1, 1, 1,  ..., 0, 0, 0]])}

### Padding

Here we notice that the tokenizer has already padded the the list of tokens. Moreover the output is already a tensor. That's why we slightly modified the questions.

In [None]:
# We do a sanity check to verify that all tokens have the same lenght

max_lenght = len(encoding['input_ids'][0]) # We compute the lenght of the first token

for value in enumerate(encoding['input_ids']):
  # If the corpus is correctly padded, all lenghts should be equal
  if len(value[1]) != max_lenght:
    print("Padding is not ok")
    bool = False
    break
  else: bool = True

if bool == True:
  print("Padding is ok")

Padding is ok


### Padding
After tokenization, tokenized is a list of sentences -- each sentences is represented as a list of tokens. We want BERT to process our examples all at once (as one batch). It's just faster that way. For that reason, we need to pad all lists to the same size, so we can represent the input as one 2-d array, rather than a list of lists (of different lengths).


Our dataset is now in the padded variable, we can view its dimensions below:

In [None]:
np.array(encoding['input_ids']).shape

(1000, 187)

### Masking
If we directly send padded to BERT, that would slightly confuse it. We need to create another variable to tell it to ignore (mask) the padding we've added when it's processing its input. That's what attention_mask is:

In [None]:
attention_mask = np.where(encoding['input_ids'] != 0, 1, 0)
print(attention_mask.shape)

# We can also use the built-in method
print(np.array(encoding['attention_mask']).shape)

# Sanity check to see if the two attention mask are the same
print(f"The two methods give the same result : {np.array_equal(attention_mask, np.array(encoding['attention_mask']))}")

(1000, 187)
(1000, 187)
The two methods give the same result : True


## Model 1: And Now, Deep Learning!
Now that we have our model and inputs ready, let's run our model! 

The model() function runs our sentences through BERT. The results of the processing will be returned into last_hidden_states.

In [None]:
input_ids = encoding['input_ids'] # Encoding is already made of tensors
attention_mask = encoding['attention_mask']

with torch.no_grad():
    last_hidden_states = model(input_ids, attention_mask=attention_mask)

Let's slice only the part of the output that we need. That is the output corresponding the first token of each sentence. The way BERT does sentence classification, is that it adds a token called [CLS] (for classification) at the beginning of every sentence. The output corresponding to that token can be thought of as an embedding for the entire sentence.

We'll save those in the features variable, as they'll serve as the features to our logitics regression model.



In [None]:
features = last_hidden_states[0][:,0,:].numpy()

In [None]:
features

array([[-0.17517811, -0.21334815,  0.02934225, ..., -0.09681305,
         0.17115438,  0.41159746],
       [-0.33222446, -0.19886987, -0.20308407, ..., -0.16556385,
         0.16818312,  0.5150594 ],
       [-0.37591895, -0.04533167, -0.22401412, ...,  0.02739608,
         0.11627943,  0.43378788],
       ...,
       [-0.44437608, -0.09750488, -0.4205883 , ..., -0.14284353,
         0.162042  ,  0.49023363],
       [-0.20313033, -0.06822723, -0.01797578, ..., -0.19606154,
         0.2440709 ,  0.41131645],
       [-0.36309308,  0.02670904, -0.06815834, ..., -0.09744149,
         0.3339356 ,  0.38100356]], dtype=float32)

## Model 2: Train/Test Split
Let's now split our datset into a training set and testing set (even though we're using 1,000 sentences from the reviews training set).

In [None]:
# Split the data in train and test with the train_test_split function

X_train, X_test, y_train, y_test = train_test_split(features, reviews_filtered_1000['positive'], test_size=0.3, random_state=42)

## [Bonus] Grid Search for Parameters
We can dive into Logistic regression directly with the Scikit Learn default parameters, but sometimes it's worth searching for the best value of the C parameter, which determines regularization strength.

In [None]:
from scipy.stats import uniform
uniform(0, 100)

<scipy.stats._distn_infrastructure.rv_frozen at 0x7fc26ec80ee0>

In [None]:
from sklearn.model_selection import RandomizedSearchCV, GridSearchCV
from scipy.stats import randint

model = LogisticRegression()

# Define the parameter distribution for the randomized search
param_dist = {'C': randint(0, 100)}

# Perform the randomized grid search to find the best parameter C
# As the dataset is quite unbalanced with a lot of positives we choose to use f1-score to have better information
random_search = RandomizedSearchCV(model, param_distributions=param_dist, n_iter=100, cv=5, random_state=42, scoring='f1')
random_search.fit(X_train, y_train)

print("Best C for the logistic regression (random_search): ", random_search.best_params_['C'])

# Fine-tune the value of C with a classical grid search
param_grid = {'C': np.linspace(random_search.best_params_['C'] - 2, random_search.best_params_['C'] - 2, num=100)}
grid_search = GridSearchCV(model, param_grid=param_grid, cv=5, scoring='f1')
grid_search.fit(X_train, y_train)

print("Best C for the logistic regression after fine-tuning: ", grid_search.best_params_['C'])
print("Best f1 score: ", grid_search.best_score_)

Best C for the logistic regression (random_search):  13
Best C for the logistic regression after fine-tuning:  11.0
Best f1 score:  0.47175627240143375


We now train the LogisticRegression model. If you've chosen to do the gridsearch, you can plug the value of C into the model declaration (e.g. LogisticRegression(C=5.2)).

In [None]:
# Fit the logistic regression model

model = LogisticRegression(C=grid_search.best_params_['C'])
model.fit(X_train, y_train)

LogisticRegression(C=11.0)

## Evaluating Model 2
So how well does our model do in classifying sentences? One way is to check the accuracy against the testing dataset:

In [None]:
# Evaluate the score of the model

from sklearn.metrics import accuracy_score, f1_score,  precision_recall_fscore_support
y_pred =  model.predict(X_test)

print("Accuracy :", accuracy_score(y_test, y_pred))
print("F1-score :", f1_score(y_test, y_pred))

Accuracy : 0.8433333333333334
F1-score : 0.40506329113924044


As the dataset is imbalanced, it is better to look at the F1-score of the model rather than the accuracy.

In [None]:
scores = precision_recall_fscore_support(y_test, y_pred, average=None, labels=[1,0])
df = pd.DataFrame(scores, columns=['positive', 'negative'], index=['precision', 'recall', 'fscore', 'support'])
df

Unnamed: 0,positive,negative
precision,0.444444,0.897727
recall,0.372093,0.922179
fscore,0.405063,0.909789
support,43.0,257.0


How good is this score? What can we compare it against? Let's first look at a dummy classifier:

In [None]:
from sklearn.dummy import DummyClassifier
clf = DummyClassifier()

clf.fit(X_train, y_train)
y_pred = clf.predict(X_test)

# Compare your model with a dummy classifier
scores = precision_recall_fscore_support(y_test, y_pred, average=None, labels=[1,0])
df = pd.DataFrame(scores, columns=['positive','negative'], index=['precision', 'recall', 'fscore', 'support'])
df

Unnamed: 0,positive,negative
precision,0.0,0.856667
recall,0.0,1.0
fscore,0.0,0.922801
support,43.0,257.0


As we can see from the F1-score, our model clearly does better than a dummy classifier in predicting the minority class which is positive comments in this case.