### NB-LogisticRegression
[Source](https://www.kaggle.com/jhoward/nb-svm-strong-linear-baseline)

In [1]:
import pandas as pd, numpy as np
from sklearn.model_selection import train_test_split
from sklearn.linear_model import LogisticRegression
from sklearn.feature_extraction.text import CountVectorizer, TfidfVectorizer

In [2]:
encoded_label_dict = {"CG" : 0, "OR" : 1}
def encode_label(x):
    return encoded_label_dict.get(x,-1)

In [3]:
df = pd.read_csv("../../data/classification/data/data.csv")

In [4]:
df["target"] = df["label"].apply(lambda x: encode_label(x))

In [5]:
train, test = train_test_split(df, test_size=0.2, shuffle=True, stratify=None, random_state=2021)

In [6]:
train.head()

Unnamed: 0,category,rating,label,text_,target
29115,Books_5,5.0,OR,Nora Roberts never disappoints! Loved the book...,1
31611,Books_5,5.0,OR,This was my first time reading this classic an...,1
16922,Tools_and_Home_Improvement_5,5.0,CG,"Bought this for my dad, who uses the tools in ...",0
5946,Sports_and_Outdoors_5,5.0,CG,These are great beanies. I use them as an exer...,0
24761,Kindle_Store_5,4.0,CG,Lots of twists and turns that will make you fe...,0


In [7]:
len(train),len(test)

(32345, 8087)

In [8]:
COMMENT = 'text_'
train[COMMENT].fillna("unknown", inplace=True)
test[COMMENT].fillna("unknown", inplace=True)

A value is trying to be set on a copy of a slice from a DataFrame

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  downcast=downcast,


## Building the model

We'll start by creating a *bag of words* representation, as a *term document matrix*. We'll use ngrams, as suggested in the NBSVM paper.

In [9]:
import re, string
re_tok = re.compile(f'([{string.punctuation}“”¨«»®´·º½¾¿¡§£₤‘’])')
def tokenize(s): return re_tok.sub(r' \1 ', s).split()

It turns out that using TF-IDF gives even better priors than the binarized features used in the paper. I don't think this has been mentioned in any paper before, but it improves leaderboard score from 0.59 to 0.55.

In [10]:
n = train.shape[0]
vec = TfidfVectorizer(ngram_range=(1,2), tokenizer=tokenize,
               min_df=3, max_df=0.9, strip_accents='unicode', use_idf=1,
               smooth_idf=1, sublinear_tf=1 )
trn_term_doc = vec.fit_transform(train[COMMENT])
test_term_doc = vec.transform(test[COMMENT])

This creates a *sparse matrix* with only a small number of non-zero elements (*stored elements* in the representation  below).

In [11]:
trn_term_doc, test_term_doc

(<32345x88640 sparse matrix of type '<class 'numpy.float64'>'
 	with 3307677 stored elements in Compressed Sparse Row format>,
 <8087x88640 sparse matrix of type '<class 'numpy.float64'>'
 	with 807557 stored elements in Compressed Sparse Row format>)

Here's the basic naive bayes feature equation:

In [12]:
def pr(y_i, y):
    p = x[y==y_i].sum(0)
    return (p+1) / ((y==y_i).sum()+1)

In [13]:
x = trn_term_doc
test_x = test_term_doc

Fit a model for one dependent at a time:

In [14]:
def get_mdl(y):
    y = y.values
    r = np.log(pr(1,y) / pr(0,y))
    m = LogisticRegression(C=4)
    x_nb = x.multiply(r)
    return m.fit(x_nb, y), r

In [15]:
m,r = get_mdl(train["target"])
preds_probas = m.predict_proba(test_x.multiply(r))[:,1]

STOP: TOTAL NO. of ITERATIONS REACHED LIMIT.

Increase the number of iterations (max_iter) or scale the data as shown in:
    https://scikit-learn.org/stable/modules/preprocessing.html
Please also refer to the documentation for alternative solver options:
    https://scikit-learn.org/stable/modules/linear_model.html#logistic-regression


In [16]:
preds = [1 if prob>=0.5 else 0 for prob in preds_probas]

In [17]:
from sklearn.metrics import confusion_matrix
y_true = test.target.values
y_pred = preds
confusion_matrix(y_true,y_pred)

array([[3743,  267],
       [ 158, 3919]])

In [21]:
from sklearn.metrics import accuracy_score, precision_score, recall_score, classification_report
acc = accuracy_score(y_true,y_pred)
precision = precision_score(y_true,y_pred)
recall = recall_score(y_true,y_pred)

In [22]:
print(f"Accuracy: {acc*100}; Precision:{precision*100}; Recall:{recall*100}")

Accuracy: 94.74465191047359; Precision:93.62159579550884; Recall:96.12460142261466


In [23]:
print(classification_report(y_true, y_pred, target_names=["CG","OR"]))

              precision    recall  f1-score   support

          CG       0.96      0.93      0.95      4010
          OR       0.94      0.96      0.95      4077

    accuracy                           0.95      8087
   macro avg       0.95      0.95      0.95      8087
weighted avg       0.95      0.95      0.95      8087



##### Understanding weights

In [None]:
import eli5
eli5.show_weights(estimator=m,
                  feature_names= list(vec.get_feature_names()),
                  target_names=["CG","OR"],
                  top=(50, 50))

#### Writing predictions to disc

In [None]:
preds_df_rows = []
for i, row in test.reset_index().iterrows():
    query = row["text_"]
    pred_prob = preds_probas[i]
    pred_label = preds[i]
    preds_df_rows.append([pred_prob,pred_label])
preds_df = pd.DataFrame(preds_df_rows, columns=["NbLogReg_Model_Probability","NbLogReg_Model_Prediction"])

In [None]:
preds_df.to_csv("../../data/classification/data/NbLogReg_predictions.csv", index=None)