

# Multinomial Naive Bayes | Semi supervised learning 



**Problem statement**: The problem requires us to predict the author, i.e. EAP, HPL and MWS given the text. In simpler words, text classification with 3 different classes.

**Evaluation metric**:  Multi-class logarithmic loss

**Objective**: Using semi-supervised learning as an approach to the given NLP problem and comparing it to the supervised learning approach.(based on the given evaluation metric)

**Dataset**:We have train and test dataset.

**Approach:**
For this problem we will use **Multinomial Naive Bayes** as our classification algorithm.We will use both **TF-IDF** and **CountVectorizer** to convert the terms into numeric values.

Steps we will follow:
1. Split the dataset into train,valid and unlabeled
2. Fitting the TF-IDF to both train and valid datasets
3. Using the labeled dataset to train our model
4. Predictions made on the unlabeled dataset
5. Use the predicted values as pseudo labels and combine with the train dataset
6. Train the model on this dataset and evaluate on valid dataset
7. Repeat steps 2-6 using CountVectorizer to convert the terms to numeric value/count 


**Note**: First we use a simple model for semi- supervised learning as here we don't have actual unlabeled dataset so we use self training where we divide the dataset intothree parts: train ,valid and unalabeled.In the unlabeled we drop the target column.

We will then use **self training classifier** from scikit learn's semi supervised module


Reference:

* https://www.kaggle.com/code/abhishek/approaching-almost-any-nlp-problem-on-kaggle
* https://www.kaggle.com/code/sasakitetsuya/semi-supervised-classification-on-a-text-dataset



## Simple semi-supervised learning model

First we have used a simple model of semi supervised learning without using any specific semi supervised learning technique

In [None]:
# Importing libraries 

import pandas as pd
import numpy as np
from sklearn.feature_extraction.text import TfidfVectorizer,CountVectorizer,TfidfTransformer
from sklearn import model_selection, preprocessing
from sklearn.model_selection import train_test_split
from sklearn.naive_bayes import MultinomialNB
from sklearn.pipeline import Pipeline
from sklearn.semi_supervised import LabelSpreading, SelfTrainingClassifier
from sklearn.preprocessing import FunctionTransformer




In [None]:
#loading dataset

train=pd.read_csv('/kaggle/input/spooky/train.csv')
test=pd.read_csv('/kaggle/input/spooky/test.csv')


In [None]:
#looking at the data

train.head()


In [None]:
# we use the LabelEncoder from scikit-learn to convert text labels to integers: 0,1,2

lbl_enc= preprocessing.LabelEncoder()
y=lbl_enc.fit_transform(train.author.values)

In [None]:
# Logloss function

def multiclass_logloss(actual,predicted,eps=1e-15):
    
    # Convert 'actual' to a binary array if it's not already:
    if len(actual.shape) == 1:
        actual2 = np.zeros((actual.shape[0], predicted.shape[1]))
        for i, val in enumerate(actual):
            actual2[i, val] = 1
        actual = actual2

    clip = np.clip(predicted, eps, 1 - eps)
    rows = actual.shape[0]
    vsota = np.sum(actual * np.log(clip))
    return -1.0 / rows * vsota

In [None]:
xtrain=train.iloc[:14000]
X_unlabeled=train.iloc[14000:]  #No need to worry about class imbalance here,as labels are dropped
X_unlabeled=X_unlabeled.text.values
ytrain=y[:14000]



In [None]:
#splitting the dataset into train and validation 

Xtrain,Xvalid,Ytrain,Yvalid = train_test_split(xtrain.text.values,ytrain,
                                              stratify=ytrain,
                                              random_state=42,
                                              test_size=0.1,
                                              shuffle=True)
Xtest=test.text.values



In [None]:
print(Xtrain.shape)
print(Xvalid.shape)
print(X_unlabeled.shape)


### TF-IDF 

In [None]:
# To convert terms to numeric value having some weights assign to them depending on the importance of the term

tfv= TfidfVectorizer(min_df=3,
                     max_features=None,
                     strip_accents='unicode',
                     analyzer='word',
                     token_pattern=r'\w{1,}',
                     ngram_range=(1,3),
                     use_idf=True,
                     smooth_idf=True,
                     sublinear_tf=True,                 
                     stop_words='english'
                    )

X_train_unlabeled= np.concatenate((Xtrain,X_unlabeled))

tfv.fit(list(Xtrain)+list(Xvalid)+list(X_unlabeled)+list(X_train_unlabeled))    
xtrain_tfv=tfv.transform(Xtrain)
x_unlabeled_tfv=tfv.transform(X_unlabeled)
X_tfv=tfv.transform(X_train_unlabeled)
xtest_tfv=tfv.transform(Xtest)                                   #---need to check!
xvalid_tfv=tfv.transform(Xvalid)





In [None]:
#fitting a simple naive bayes on TF-IDF

clf=MultinomialNB()
clf.fit(xtrain_tfv,Ytrain)
pseudo_labels=clf.predict(x_unlabeled_tfv)

In [None]:
Y_tfv=np.concatenate((Ytrain,pseudo_labels))


#fit the model on new dataset

clf_train_unlabeled=MultinomialNB()
clf_train_unlabeled.fit(X_tfv,Y_tfv)
predictions=clf_train_unlabeled.predict_proba(xvalid_tfv)

# Evaluate the model

print('logloss: %0.3f '%multiclass_logloss(Yvalid,predictions))


### CountVectorizer

In [None]:
ctv = CountVectorizer(analyzer='word',token_pattern=r'\w{1,}',
            ngram_range=(1, 3), stop_words = 'english')

# Fitting Count Vectorizer to both training and test sets (semi-supervised learning)
X_train_unlabeled= np.concatenate((Xtrain,X_unlabeled))
ctv.fit(list(Xtrain) + list(Xvalid)+list(X_unlabeled)+list(X_train_unlabeled))
xtrain_ctv =  ctv.transform(Xtrain) 
xvalid_ctv = ctv.transform(Xvalid)
x_unlabeled_ctv=ctv.transform(X_unlabeled)
X_ctv=ctv.transform(X_train_unlabeled)

In [None]:
#fitting a simple naive bayes on CountVectorizer

clf=MultinomialNB()
clf.fit(xtrain_ctv,Ytrain)
pseudo_labels=clf.predict(x_unlabeled_ctv)

In [None]:
#Retrain model with pseudo labels | Semi-supervised learning


Y_ctv=np.concatenate((Ytrain,pseudo_labels))


#fit the model on new dataset

clf_train_unlabeled=MultinomialNB()
clf_train_unlabeled.fit(X_ctv,Y_ctv)
predictions=clf_train_unlabeled.predict_proba(xvalid_ctv)

# Evaluate the model

print('logloss: %0.3f '%multiclass_logloss(Yvalid,predictions))




### Observations

* We observe a logloss of **0.611** (tfidf)/**0.530** (ctv) which is a bit higher than **0.57** (tfidf)/ **0.485** (ctv) observed when no semi supervised learning was used.It is important to note that we have used **self training** here to mimic the case when there is actual unlabeled dataset.
* We observe no improvement in our model

**Why have we seen no improvement?**

We were suppose to use semi supervised learning to avoid majorly two shortcomings that can arise in MultinomialNB:
* Early convergence
* Cold start issues

But using self training was not able to handle these issues because here the dataset size remains the same.

## Using scikit learn's self learning classifier

Now we will use the self learning classifier which uses the concept of **pseudo label threshold** to label the unlabeled labels

In [None]:
X_train, X_valid, y_train, y_valid = train_test_split(train.text.values, y,
                                                      stratify=y, 
                                                      random_state=42, 
                                                      test_size=0.2,
                                                      shuffle=True
                                                      )

#mask 80% of labels within the training data and create a target variable that uses -1 to denote unlabeled data



y_mask = np.random.rand(len(y_train)) < 0.95

y_train[~y_mask] = -1


In [None]:
print(y_mask)
print(y_train)

In [None]:
# Logloss function

def multiclass_logloss(actual,predicted,eps=1e-15):
    
    # Convert 'actual' to a binary array if it's not already:
    if len(actual.shape) == 1:
        actual2 = np.zeros((actual.shape[0], predicted.shape[1]))
        for i, val in enumerate(actual):
            actual2[i, val] = 1
        actual = actual2

    clip = np.clip(predicted, eps, 1 - eps)
    rows = actual.shape[0]
    vsota = np.sum(actual * np.log(clip))
    return -1.0 / rows * vsota

### TF-IDF

In [None]:
# apply tfidf

tfv= TfidfVectorizer(min_df=3,
                     max_features=None,
                     strip_accents='unicode',
                     analyzer='word',
                     token_pattern=r'\w{1,}',
                     ngram_range=(1,3),
                     use_idf=True,
                     smooth_idf=True,
                     sublinear_tf=True,                 
                     stop_words='english'
                    )

tfv.fit(list(X_train)+list(X_valid))    
xtrain_tfv=tfv.transform(X_train)
#xtest_tfv=tfv.transform(Xtest)                                   #---need to check!
xvalid_tfv=tfv.transform(X_valid)

In [None]:
model=MultinomialNB()

self_training_model = SelfTrainingClassifier(base_estimator=model, # An estimator object implementing fit and predict_proba.
                                             threshold=0.70, # default=0.75, The decision threshold for use with criterion='threshold'. Should be in [0, 1).
                                             criterion='threshold', # {‘threshold’, ‘k_best’}, default=’threshold’, The selection criterion used to select which labels to add to the training set. If 'threshold', pseudo-labels with prediction probabilities above threshold are added to the dataset. If 'k_best', the k_best pseudo-labels with highest prediction probabilities are added to the dataset.
                                             max_iter=100,
                                             verbose=True # default=False, Verbosity prints some information after each iteration
                                             
                                            )

self_training_model.fit(xtrain_tfv,y_train)
y_pred = self_training_model.predict_proba(xvalid_tfv)

# Evaluate the model

print('logloss: %0.3f '%multiclass_logloss(y_valid,y_pred))

### CountVectorize

In [None]:
#apply countvectorize

ctv = CountVectorizer(analyzer='word',token_pattern=r'\w{1,}',
            ngram_range=(1, 3), stop_words = 'english')

# Fitting Count Vectorizer to both training and test sets (semi-supervised learning)
ctv.fit(list(X_train) + list(X_valid))
xtrain_ctv =  ctv.transform(X_train) 
xvalid_ctv = ctv.transform(X_valid)

In [None]:
model=MultinomialNB()

self_training_model = SelfTrainingClassifier(base_estimator=model, # An estimator object implementing fit and predict_proba.
                                             threshold=0.70, # default=0.75, The decision threshold for use with criterion='threshold'. Should be in [0, 1).
                                             criterion='threshold', # {‘threshold’, ‘k_best’}, default=’threshold’, The selection criterion used to select which labels to add to the training set. If 'threshold', pseudo-labels with prediction probabilities above threshold are added to the dataset. If 'k_best', the k_best pseudo-labels with highest prediction probabilities are added to the dataset.
                                             max_iter=100,
                                             verbose=True # default=False, Verbosity prints some information after each iteration
                                             
                                            )

self_training_model.fit(xtrain_ctv,y_train)
#y_pred = self_training_model.predict_proba(xvalid_ctv)
y_pred= self_training_model.predict_proba(xvalid_ctv)

# Evaluate the model

print('logloss: %0.3f '%multiclass_logloss(y_valid,y_pred))



Here we observe a logloss of **0.466** which is improved from **0.48** from the base model(supervised learning), but here we note that we have masked only 5% of the data and unlabeled it to -1.When we we masked majority of the data or about 90% of the data as unlabeled the performance worsened 