<a href="https://colab.research.google.com/github/lupis30puc/bipm_text_analytics_exercises/blob/main/Exercise5_text_classification_team5.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

In this exercise we will build a classification model for the newsgroup dataset. We will apply the following steps:
> A. Document representation with tf-idf

>B. Naïve Bayes classification model

>C. Pipelines and Random Forest

>D. Grid search with tf-idf

>E. Grid search with Doc2Vec

1. Import the following packages:

In [None]:
import pickle
import pandas as pd
from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.model_selection import train_test_split
from sklearn.naive_bayes import MultinomialNB
from sklearn.ensemble import RandomForestClassifier
from sklearn.model_selection import GridSearchCV
from sklearn.metrics import classification_report, accuracy_score
from sklearn.pipeline import Pipeline
from gensim.models.doc2vec import Doc2Vec, TaggedDocument
from google.colab import files

2. Load the stemmed data from Exercise 2 together with the columns target and target_names. What are the classes here? What is their distribution?

In [None]:
uploaded = files.upload()

Saving Stemmed.pkl to Stemmed.pkl


In [None]:
#Loading the stemmed data
stemmed = pickle.load(open('Stemmed.pkl','rb'))
stemmed[0]

'car wonder enlighten car saw dai door sport car look late earli call bricklin door small addit bumper separ rest bodi know tellm model engin spec year product car histori info funki look car mail thank'

In [None]:
#Loading the original dara
original=pd.read_json('https://raw.githubusercontent.com/selva86/datasets/master/newsgroups.json')
print(original.head())


                                             content  ...           target_names
0  From: lerxst@wam.umd.edu (where's my thing)\nS...  ...              rec.autos
1  From: guykuo@carson.u.washington.edu (Guy Kuo)...  ...  comp.sys.mac.hardware
2  From: twillis@ec.ecn.purdue.edu (Thomas E Will...  ...  comp.sys.mac.hardware
3  From: jgreen@amber (Joe Green)\nSubject: Re: W...  ...          comp.graphics
4  From: jcm@head-cfa.harvard.edu (Jonathan McDow...  ...              sci.space

[5 rows x 3 columns]


In [None]:
df = pd.DataFrame(columns=['preprocessed', 'target', 'target_names']) # creating new DF
df['preprocessed']=stemmed # getting the preprocessed data
df['target']=original['target'] # getting the orignal target number
df['target_names']=original['target_names'] # getting the original target names
df.head()

Unnamed: 0,preprocessed,target,target_names
0,car wonder enlighten car saw dai door sport ca...,7,rec.autos
1,clock poll final final clock report acceler cl...,4,comp.sys.mac.hardware
2,question folk mac plu final gave ghost weekend...,4,comp.sys.mac.hardware
3,weitek robert kyanko rob rjck uucp wrote abrax...,1,comp.graphics
4,shuttl launch question articl cowcb world std ...,14,sci.space


3. Restrict the dataset from 2. to the following topics: *'soc.religion.christian', 'rec.sport.hockey', 'talk.politics.mideast', 'rec.motorcycles’*. Remove the ‘contents’ column.

In [None]:
targetsnames =  ['talk.politics.mideast', 'rec.sport.hockey' , 'soc.religion.christian', 'rec.motorcycles'] # relevant topics
df.target_names.isin(targetsnames) # creating boolean which shoes if target is relvant or not
df_2=df[df.target_names.isin(targetsnames)]  # building a new DF with just the relevant topics
df_2.head()

Unnamed: 0,preprocessed,target,target_names
10,recommend duc worth ducati gt line ducati gt m...,8,rec.motorcycles
21,nhl team captain articl apr samba oit unc edu ...,10,rec.sport.hockey
28,pantheism environment articl apr atho rutger e...,15,soc.religion.christian
33,isra expans lust articl spam math adelaid edu ...,17,talk.politics.mideast
35,goali mask articl netnew upenn edu kkeller mai...,10,rec.sport.hockey


## Part A: Document representation with tf-idf
4. Apply the following function to df:
  
  *docs_train, docs_test, y_train, y_test =train_test_split(df.preprocessed, df.target, test_size = 0.20, random_state = 12)*

  What is it doing and why?

In [None]:
docs_train, docs_test, y_train, y_test =train_test_split(df_2.preprocessed, df_2.target, test_size = 0.20, random_state = 12) 
# is splliting the data we have in test and training data so we can measure the accuracy of out model later on

In [None]:
print(docs_test, y_test)

9319     washington beat pitt articl kkq acsu buffalo e...
7883     latest branch davidian articl apr geneva rutge...
6290     qualiti cathol liturgi tim rolf write activ pa...
947      tuff christian realiz frequent get troubl stra...
11242    abc canada abc coverag king flame game suppos ...
                               ...                        
7699     chant passion mike rolf write know latin beaut...
2169     playoff predict predict try laugh hyster someb...
764      seventh centuri armenian math problem problem ...
8139     stan fischler keenan stuff articl apr new colu...
3269     ship bike recommend ship motorcycl san francis...
Name: preprocessed, Length: 473, dtype: object 9319     10
7883     15
6290     15
947      15
11242    10
         ..
7699     15
2169     10
764      17
8139     10
3269      8
Name: target, Length: 473, dtype: int64


5. Derive the tf-idf frequency matrix for docs_train using
  
  *TfidfVectorizer and max_df=0.7, min_df=0.1*. 
Store in tf_train. 

  Apply the trained transformer to the test data using transform(). Store in tf_test. 

  Why are we not using fit_transform on the test data?

In [None]:
model = TfidfVectorizer(max_df=0.7, min_df=0.1)
tf_train = model.fit_transform(docs_train)
print("train" '\n', tf_train[0])

train
   (0, 17)	0.1390466118001983
  (0, 53)	0.09817941673585602
  (0, 71)	0.13435314501375176
  (0, 39)	0.19287904798535266
  (0, 11)	0.14806750809457395
  (0, 70)	0.1265014336483417
  (0, 35)	0.1466956012087497
  (0, 68)	0.1490057688850429
  (0, 75)	0.11882822042574444
  (0, 87)	0.12263461795633522
  (0, 84)	0.3285367160772088
  (0, 67)	0.12452675151019846
  (0, 8)	0.14322830236191653
  (0, 57)	0.24687173660005488
  (0, 90)	0.10833888239946689
  (0, 29)	0.22645324664826336
  (0, 22)	0.14199360411972675
  (0, 47)	0.29567190047239905
  (0, 6)	0.5885931984293965
  (0, 19)	0.14079168938068798
  (0, 66)	0.12766974268609335
  (0, 12)	0.10220960874887552
  (0, 89)	0.06672396194008437
  (0, 0)	0.09533980685475581
  (0, 1)	0.15135163664161153


In [None]:
tf_test = model.transform(docs_test)
print("test" '\n', tf_test[0])
print("feature names" '\n',  model.get_feature_names())

#fit_transform(self, raw_documents[, y]) --Learn vocabulary and idf, return document-term matrix.
#transform(self, raw_documents[, copy]) --Transform documents to document-term matrix.
#https://scikit-learn.org/stable/modules/generated/sklearn.feature_extraction.text.TfidfVectorizer.html

test
   (0, 90)	0.4218419068406405
  (0, 89)	0.25980481534768
  (0, 76)	0.18366113234773754
  (0, 73)	0.2629127316250822
  (0, 71)	0.26156715681108517
  (0, 52)	0.266017557369826
  (0, 42)	0.17537403770957705
  (0, 24)	0.26636938793032844
  (0, 18)	0.29520972341240415
  (0, 14)	0.27844065967624904
  (0, 13)	0.23134225672574707
  (0, 2)	0.26707729083953
  (0, 1)	0.2946608899330096
  (0, 0)	0.18561353519014362
feature names
 ['apr', 'articl', 'ask', 'awai', 'believ', 'best', 'better', 'bike', 'call', 'case', 'christian', 'claim', 'com', 'come', 'cours', 'dai', 'differ', 'dod', 'edu', 'end', 'exist', 'fact', 'far', 'follow', 'game', 'gener', 'get', 'go', 'god', 'good', 'got', 'great', 'happen', 'help', 'hockei', 'includ', 'interest', 'israel', 'kill', 'know', 'let', 'life', 'like', 'littl', 'live', 'long', 'look', 'lot', 'make', 'mean', 'need', 'new', 'opinion', 'peopl', 'person', 'place', 'plai', 'point', 'post', 'probabl', 'problem', 'question', 'read', 'reason', 'right', 'rutger', 'sai

We don't use fit_transform with test data because we will use it to classify the topics

## Part B: Naïve Bayes classification model
6. Run the following two lines. What are they doing?


In [None]:
clf =MultinomialNB() #initialize the Naive Bayes classifier for multinomial models
clf.fit(tf_train,y_train) #Fit Naive Bayes classifier according to X, y 
#--X:Training vectors, where n_samples is the number of samples and n_features is the number of features.
#--y:Target values.

#The multinomial Naive Bayes classifier is suitable for classification with discrete features (e.g., word counts for text classification). 
#The multinomial distribution normally requires integer feature counts. 
#However, in practice, fractional counts such as tf-idf may also work.

MultinomialNB(alpha=1.0, class_prior=None, fit_prior=True)

7. Run *y_pred = clf.predict(tf_test)*. What is it doing?



In [None]:
y_pred = clf.predict(tf_test) #Perform classification on an array of test vectors X.
print(len(y_pred))
print(y_pred)

473
[10 15 17 15 10 15 10 10  8  8 15 15 17 15 15 10  8 10 10 10 15 15  8  8
 15  8 17 10 17 10 15 15 17 15  8  8 17  8  8  8  8 17 10 15 17 15 10 17
 17 15 17 17 10 10 10 10 17  8 15  8  8 10 17 17  8 15 17 15 10  8 17 10
 15  8 15 10 15  8 17 10 10  8 10 10 17  8  8  8 10  8 10 10 17 10 17 15
 15 15  8 17 15 17 15 15 17  8  8 15  8 10 10 15 17 15  8 10 17  8 15 15
 17 17  8 10  8 17  8 10  8  8  8 10 15 17  8 17 17 17 15  8 17  8 10 15
  8  8  8  8  8  8  8 15 10 10 17 17 10 10  8 10  8 17 10 17 10 10 15 10
  8 15 15 15  8  8  8 10  8 15 17 15  8 10 10 17  8  8 15 17 17 17  8 17
  8 17 17  8  8  8 15  8  8 10 10 17  8  8 15 10 15 10 17  8 10 15 15 17
  8  8 15  8 15 15  8 15  8  8 10 10  8 17  8  8 17  8 15 15 15  8 10 17
 17  8 17 10  8 15 15 17  8 17 10 10 15 17  8  8  8 10  8  8  8 15 17 10
  8  8  8 15 15 17 15  8 17 17 17 15 15 15 17 10 15 17  8 10 15 10 10 15
  8 17 15 10  8 17 10  8 15 15 17 17 10  8 15 10 15 17  8 10 10 15  8  8
 15  8 15  8 10  8  8 10  8  8 15 10 15 17 10 1

8. Determine clf.score(tf_train ,y_train), accuracy_score(y_test, y_pred) and classification_report(y_test, y_pred). What do they say about the model? Is it a good one?

In [None]:
score_1 = clf.score(tf_train ,y_train) # Return the mean accuracy on the given test data and labels. (X, y)
print(score_1)
# In multi-label classification, this is the subset accuracy 
# which is a harsh metric since you require for each sample that each label set be correctly predicted.


0.871822033898305


In [None]:
acc_1 =accuracy_score(y_test, y_pred) # Accuracy classification score. (y_true, y_pred)
print(acc_1)
# In multilabel classification, this function computes subset accuracy: 
# the set of labels predicted for a sample must exactly match the corresponding set of labels in y_true.

0.8393234672304439


In [None]:
report_1 = classification_report(y_test, y_pred) # Build a text report showing the main classification metrics. (y_true, y_pred)
print(report_1) 

              precision    recall  f1-score   support

           8       0.82      0.87      0.84       134
          10       0.90      0.89      0.89       118
          15       0.79      0.87      0.83       106
          17       0.85      0.73      0.79       115

    accuracy                           0.84       473
   macro avg       0.84      0.84      0.84       473
weighted avg       0.84      0.84      0.84       473



**If you look at class 15 (under estimation) and 17 (more over estimation)...**

**Generally a good model, especially if you think about the simple approach.**

## Part C: Pipelines and Random Forest
9. Apply the following two lines. What are they doing? How does the resulting model look like?

In [None]:
# Pipeline of transforms with a final estimator.
rf = Pipeline([('tfidf', TfidfVectorizer(max_df=0.7, min_df=0.1)), ('clf', RandomForestClassifier(random_state = 42)),])
# The purpose of the pipeline is to assemble several steps that can be cross-validated together while setting different parameters. 

In [None]:
# Fit all the transforms one after the other and transform the data, then fit the transformed data using the final estimator.
rf.fit(docs_train, y_train)

Pipeline(memory=None,
         steps=[('tfidf',
                 TfidfVectorizer(analyzer='word', binary=False,
                                 decode_error='strict',
                                 dtype=<class 'numpy.float64'>,
                                 encoding='utf-8', input='content',
                                 lowercase=True, max_df=0.7, max_features=None,
                                 min_df=0.1, ngram_range=(1, 1), norm='l2',
                                 preprocessor=None, smooth_idf=True,
                                 stop_words=None, strip_accents=None,
                                 sublinear_tf=False,
                                 token_pattern...
                 RandomForestClassifier(bootstrap=True, ccp_alpha=0.0,
                                        class_weight=None, criterion='gini',
                                        max_depth=None, max_features='auto',
                                        max_leaf_nodes=None, max_samples=None

10. Determine the performance of the model in 9. Is it better than the Naïve Bayes one?

In [None]:
score_2 = rf.score(docs_train, y_train) # Apply transforms, and score with the final estimator
print(score_2)

0.9936440677966102


In [None]:
rf_y_pred = rf.predict(docs_test)
# Apply transforms to the data, and predict with the final estimator
print(len(rf_y_pred))

473


In [None]:
acc_2 = accuracy_score(y_test, rf_y_pred)
print(acc_2)

0.8668076109936576


In [None]:
report_2 = classification_report(y_test, rf_y_pred)
print(report_2)

              precision    recall  f1-score   support

           8       0.88      0.87      0.88       134
          10       0.86      0.91      0.88       118
          15       0.85      0.84      0.84       106
          17       0.87      0.84      0.86       115

    accuracy                           0.87       473
   macro avg       0.87      0.87      0.87       473
weighted avg       0.87      0.87      0.87       473



Performs a bit better than Naive Bayes

## Part D: Grid search with tf-idf
11. Run the following lines. What are they doing and why?

In [None]:
param_grid = {'min_samples_leaf': [3, 4, 5], 'n_estimators': [10, 50, 100, 200, 300, 1000]}
# The parameter grid to explore, as a dictionary mapping estimator parameters to sequences of allowed values.

In [None]:
rf_2 = RandomForestClassifier(random_state = 42) # Initializing RandomForest Classifier

In [None]:
grid_search = GridSearchCV(estimator = rf_2, param_grid = param_grid, cv = 10)
# Exhaustive search over specified parameter values for an estimator.
# The parameters of the estimator used to apply these methods are optimized by cross-validated grid-search over a parameter grid.

In [None]:
grid_search.fit(tf_train, y_train) # Run fit with all sets of parameters.

GridSearchCV(cv=10, error_score=nan,
             estimator=RandomForestClassifier(bootstrap=True, ccp_alpha=0.0,
                                              class_weight=None,
                                              criterion='gini', max_depth=None,
                                              max_features='auto',
                                              max_leaf_nodes=None,
                                              max_samples=None,
                                              min_impurity_decrease=0.0,
                                              min_impurity_split=None,
                                              min_samples_leaf=1,
                                              min_samples_split=2,
                                              min_weight_fraction_leaf=0.0,
                                              n_estimators=100, n_jobs=None,
                                              oob_score=False, random_state=42,
                                 

12. What is the best model (Hint: grid_search.best_params_) in 11?

In [None]:
best_model = grid_search.best_params_  # Parameter setting that gave the best results on the hold out data.
print(best_model)

{'min_samples_leaf': 4, 'n_estimators': 1000}


13. Determine the performance of the best model in 12 and compare it with the performance in 10 (Hint: Access the model with grid_search.best_estimator_).

In [None]:
random_f = grid_search.best_estimator_  # Estimator that was chosen by the search, i.e. estimator which gave highest score (or smallest loss if specified) on the left out data. Not available if refit=False.
y_pred_3 = random_f.predict(tf_test)
score_3 = grid_search.score(tf_train, y_train)
print(score_3)

0.9533898305084746


In [None]:
acc_3 = accuracy_score(y_test, y_pred_3)
print(acc_3)

0.8731501057082452


In [None]:
report_3 = classification_report(y_test, y_pred_3)
print(report_3)

              precision    recall  f1-score   support

           8       0.94      0.87      0.90       134
          10       0.85      0.93      0.89       118
          15       0.85      0.82      0.84       106
          17       0.85      0.87      0.86       115

    accuracy                           0.87       473
   macro avg       0.87      0.87      0.87       473
weighted avg       0.88      0.87      0.87       473



**The best model comparing 11, Naives and 10 is still 10.**

## Part E: Grid search with Doc2Vec
14. Derive the text representation of the dataset from 3. with the Doc2vec model (vector_size=100, min_count=566). Make sure that you consider the train/test split. Then apply the same approach as in 11. to 13. Are the results better than in 13.?

In [None]:
#Prepare dataset
corpus_gen = [doc.split() for doc in df_2.preprocessed]
docs2_train, docs2_test, y_train_2, y_test_2 =train_test_split(corpus_gen, df_2.target, test_size = 0.20, random_state = 12) 

documents_train = [TaggedDocument(doc, [i]) for i, doc in enumerate(docs2_train)]

In [None]:
# Run doc2vec on tagged texts
model2 = Doc2Vec(documents_train, vector_size=100, min_count=566)

data_train = pd.DataFrame([model2.infer_vector(doc) for doc in docs2_train])
data_test = pd.DataFrame([model2.infer_vector(doc) for doc in docs_test])

In [None]:
data_train.head()

Unnamed: 0,0,1,2,3,4,5,6,7,8,9,10,11,12,13,14,15,16,17,18,19,20,21,22,23,24,25,26,27,28,29,30,31,32,33,34,35,36,37,38,39,...,60,61,62,63,64,65,66,67,68,69,70,71,72,73,74,75,76,77,78,79,80,81,82,83,84,85,86,87,88,89,90,91,92,93,94,95,96,97,98,99
0,0.007446,-0.000986,-0.010638,0.011423,-0.000872,0.004708,-0.004648,-0.000894,-0.001165,-0.002955,0.003636,0.001516,0.00935,-0.002444,-0.002069,-0.011438,0.001573,0.013785,0.005954,0.006411,-0.013671,0.009582,0.003618,-0.005288,-0.005215,-0.001546,-0.007783,-0.009049,0.014987,0.001659,0.012572,-0.003872,0.004383,-0.015075,0.003393,-0.004249,-0.001048,-0.004095,0.004827,-0.009589,...,-0.007941,0.000431,-0.003274,0.016654,0.01479,0.017837,0.004124,0.000398,-0.016857,0.001292,0.007838,-0.021353,0.016556,0.01209083,0.004239,0.00963,-0.003473,0.00629,0.002058,0.010159,-0.009383,0.00871,-0.00222,0.006144,-0.008508,-0.005139,-0.005068,-0.013913,0.000797,-0.002434,0.002483,-0.000646,-0.001042,0.014663,0.003834,0.003764,-0.002463,0.009802,0.002558,-0.013692
1,0.008682,-0.002034,-0.001568,0.022848,0.002711,0.014521,-0.014123,-0.011882,-0.002656,-0.010413,0.015991,0.008217,0.009964,-0.017364,0.000651,-0.012127,0.016034,0.000266,-0.002549,0.005676,-0.020371,-0.001052,0.005841,0.000721,0.008876,0.001694,-0.019209,-0.012652,0.008708,-0.00782,-0.000457,0.004069,0.004547,-0.018393,-0.007978,0.015887,0.00899,0.002111,-0.007509,-0.01377,...,-0.006662,0.010018,0.00655,0.023768,0.01773,0.023151,0.016392,0.008507,-0.00251,0.001345,0.000872,-0.020125,0.014458,0.00935138,0.007274,0.006856,-0.002791,0.012659,-0.00489,0.023065,0.005345,0.010861,-0.003609,0.022151,-0.005048,-0.003544,0.000788,-0.013396,0.010217,0.001329,0.020461,-0.010438,0.002425,0.023387,0.006402,-0.007602,0.002157,0.010048,0.008154,-0.023768
2,0.008094,-0.006639,-0.008165,0.016391,-0.008243,0.005007,0.005787,-0.006307,0.012727,-0.004804,0.022919,0.008962,0.00035,-0.009479,-0.010033,0.002994,0.015209,0.005381,0.007153,0.001536,-0.012903,0.007617,0.000676,-0.019338,0.008161,-0.002877,-0.021196,-0.010167,0.012425,-0.008161,-0.001147,-0.002437,0.006246,-0.007943,0.000821,0.009258,0.007975,0.006936,-0.009885,-0.003608,...,-0.00449,0.005919,-0.006164,0.023576,0.015403,0.019483,0.008129,0.01766,-0.008208,0.001216,-0.004588,-0.017413,0.023873,0.005426322,0.010927,0.00178,-0.00371,0.008036,0.006753,0.024907,0.007795,-0.004573,0.001059,0.028608,-0.003812,-0.010329,0.004698,-0.01418,0.00175,-0.016257,0.010639,-0.007596,-0.004653,0.019195,-0.002375,-0.010197,-0.005115,0.007443,0.010493,-0.029363
3,0.004603,-0.005152,-0.004259,0.009375,-0.003916,0.001466,0.004413,-0.000281,0.003976,-0.002836,0.002133,0.006059,0.001158,-0.004146,-0.010889,-0.000622,0.003436,0.00217,0.00265,-0.004856,-0.002122,0.004431,0.005323,-0.007501,0.004315,-0.002236,-0.005989,-0.000849,0.007535,-0.003045,-0.000337,-0.004859,0.001096,-0.01401,-0.003797,-0.000896,0.004697,0.001065,-0.001394,-0.006394,...,-0.004792,0.007404,-0.005944,0.013817,0.008303,0.011211,0.000177,0.003302,-0.012332,-0.000625,0.004407,-0.010741,0.010471,0.003328565,0.004211,-0.001087,0.004841,0.005844,0.011151,0.017944,4.7e-05,0.002454,-0.005007,0.008847,0.002119,-0.004608,-0.001125,-0.012019,0.005231,-0.011234,0.001708,-0.001072,0.004669,0.007096,0.004637,-0.002949,-0.005312,0.00407,0.008176,-0.008119
4,0.001308,-0.000447,-0.006382,0.000843,0.001849,-0.000386,-0.005315,2.3e-05,0.002657,-0.001101,0.006053,0.005212,-0.000678,-0.005576,-0.004032,-0.006276,-0.000396,-0.000683,0.004497,-0.00265,0.001387,0.005307,-0.001419,-0.003216,-0.00462,-0.003945,0.000641,0.002837,-0.000136,0.003659,0.005152,0.003192,0.002667,-0.003805,-0.000406,-0.002068,-0.000708,0.003073,0.003811,-0.000832,...,-0.005098,0.003582,0.000626,0.002279,0.005982,0.003601,-0.003666,0.004465,0.000957,-0.005011,0.001262,-0.000135,0.000167,5.341135e-07,0.002483,-0.001217,0.001249,0.001736,-0.001829,0.005459,-0.004162,0.003862,0.001172,-0.001073,-0.00276,-0.005752,0.003001,-0.00102,-0.0023,0.004329,-0.003366,-0.004595,0.001707,0.003599,0.00203,0.000315,-0.002722,0.001779,0.003574,-0.008453


In [None]:
grid_search.fit(data_train, y_train_2)


GridSearchCV(cv=10, error_score=nan,
             estimator=RandomForestClassifier(bootstrap=True, ccp_alpha=0.0,
                                              class_weight=None,
                                              criterion='gini', max_depth=None,
                                              max_features='auto',
                                              max_leaf_nodes=None,
                                              max_samples=None,
                                              min_impurity_decrease=0.0,
                                              min_impurity_split=None,
                                              min_samples_leaf=1,
                                              min_samples_split=2,
                                              min_weight_fraction_leaf=0.0,
                                              n_estimators=100, n_jobs=None,
                                              oob_score=False, random_state=42,
                                 

In [None]:
docs2v_best = grid_search.best_estimator_
print(grid_search.best_params_)

{'min_samples_leaf': 5, 'n_estimators': 300}


In [None]:
score_4 = docs2v_best.score(data_train, y_train_2)
print(score_4)


0.961864406779661


In [None]:
y_pred_4 = docs2v_best.predict(data_test)
acc_4 = accuracy_score(y_test_2, y_pred_4)
print(acc_4)

0.2769556025369979


In [None]:
report_4 = classification_report(y_test_2, y_pred_4)
print(report_4)

              precision    recall  f1-score   support

           8       0.29      0.49      0.36       134
          10       0.29      0.33      0.31       118
          15       0.24      0.23      0.23       106
          17       0.33      0.02      0.03       115

    accuracy                           0.28       473
   macro avg       0.29      0.27      0.23       473
weighted avg       0.29      0.28      0.24       473



**This result is showing overfitting**