# Baselines fully using a Scikit-Learn pipeline: Bow, Tf-idf, Logreg, SVM, NB, etc.

The purpose of this notebook is to present a fully scikit-learn based pipeline for processing text data, training simple models, and submit the best results.

We try the following vectorizers:
* Bag-of-Words (BoW)
* Bag-of-Words + frequency–inverse document frequency (tfidf)

We try the following models:
* Logistic Regression (Logreg)
* Support Vector Machines (SVM)
* Naive Bayes (NB)

In [None]:
import os
import numpy as np # linear algebra
import pandas as pd # data processing, CSV file I/O (e.g. pd.read_csv)

In [None]:
print(os.listdir("../input"))

## Preprocessing

In [None]:
from sklearn.feature_extraction.text import CountVectorizer, TfidfTransformer

In [None]:
train_df = pd.read_csv('../input/train.csv')
test_df = pd.read_csv('../input/test.csv')
y_train = train_df['target'].values

In [None]:
print("Train data:", train_df.shape)
print("Test data:", test_df.shape)
train_df.head(5)

In [None]:
train_df['target'].value_counts()

We notice a very imbalanced dataset. This is a problem that can be addressed if you realize your submission is not getting the desired score. There is no way in this case to know if it's worth trying to balance the dataset, since the test set could also be imbalanced

### Bag of Words (BOW)

Here is the description of BoW on the [feature extraction user guide](https://scikit-learn.org/stable/modules/feature_extraction.html) in the scikit-learn documentation:
> Text Analysis is a major application field for machine learning algorithms. However the raw data, a sequence of symbols cannot be fed directly to the algorithms themselves as most of them expect numerical feature vectors with a fixed size rather than the raw text documents with variable length.
>
> ...
>
> A corpus of documents can thus be represented by a matrix with one row per document and one column per token (e.g. word) occurring in the corpus.
>
> We call vectorization the general process of turning a collection of text documents into numerical feature vectors. This specific strategy (tokenization, counting and normalization) is called the Bag of Words or “Bag of n-grams” representation. Documents are described by word occurrences while completely ignoring the relative position information of the words in the document.

We will limit ourselves to the top 50,000 most frequent words for faster training. Try increasing dimensionality if you feel this is too small.

In [None]:
count_vec = CountVectorizer(max_features=50000)
train_bow = count_vec.fit_transform(train_df['question_text'])
test_bow = count_vec.transform(test_df['question_text'])

print(train_bow.shape)
print(test_bow.shape)

### Tf-idf

[Tf-idf documentation](https://scikit-learn.org/stable/modules/feature_extraction.html#tfidf-term-weighting):
> In a large text corpus, some words will be very present (e.g. “the”, “a”, “is” in English) hence carrying very little meaningful information about the actual contents of the document. If we were to feed the direct count data directly to a classifier those very frequent terms would shadow the frequencies of rarer yet more interesting terms.
>
> In order to re-weight the count features into floating point values suitable for usage by a classifier it is very common to use the tf–idf transform.
>
> Tf means term-frequency while tf–idf means term-frequency times inverse document-frequency: 
>
> $\text{tf-idf}(t,d) = \text{tf}(t,d) \times \text{idf}(t)$
>
> Using the `TfidfTransformer`’s default settings, `TfidfTransformer(norm='l2', use_idf=True, smooth_idf=True, sublinear_tf=False)` the term frequency, the number of times a term occurs in a given document, is multiplied with idf component, which is computed as
>
> $idf(t) = log \frac{1+n_d}{1+df(d,t)+1}$

In [None]:
tfidf = TfidfTransformer()
train_tfidf = tfidf.fit_transform(train_bow)
test_tfidf = tfidf.transform(test_bow)

print(train_tfidf.shape)
print(test_tfidf.shape)

## Model Training and Evaluation

In [None]:
# Models
from sklearn.linear_model import LogisticRegression
from sklearn.svm import LinearSVC
from sklearn import naive_bayes
from sklearn.metrics import f1_score, make_scorer
from sklearn.model_selection import cross_val_score, GridSearchCV

In [None]:
# Create a scorer object we will use to evaluate the methods.
f1_scorer = make_scorer(f1_score)

### Logistic Regression on BoW

In [None]:
logreg_bow = LogisticRegression(solver='lbfgs')
logreg_bow_score = cross_val_score(
    estimator=logreg_bow,
    X=train_bow, 
    y=y_train,
    verbose=2,
    scoring=f1_scorer,
    cv=4, # Since kaggle CPUs have 4 cores
    n_jobs=-1
)
print(logreg_bow_score)

### Logistic Regression on Tfidf

In [None]:
logreg_tfidf = LogisticRegression(solver='saga')
logreg_tfidf_score = cross_val_score(
    estimator=logreg_tfidf,
    X=train_tfidf, 
    y=y_train,
    scoring=f1_scorer,
    verbose=2,
    cv=4, # Since kaggle CPUs have 4 cores
    n_jobs=-1
)
print(logreg_tfidf_score)

### SVM on BoW

In [None]:
# Use Dual=False since n_samples > n_features
svm_bow = LinearSVC(dual=False)
svm_bow_score = cross_val_score(
    estimator=svm_bow,
    X=train_bow, 
    y=y_train,
    verbose=2,
    scoring=f1_scorer,
    cv=4, # Since kaggle CPUs have 4 cores
    n_jobs=-1
)
print(svm_bow_score)

### SVM on Tfidf

In [None]:
svm_tfidf = LinearSVC(dual=False)
svm_tfidf_score = cross_val_score(
    estimator=svm_tfidf,
    X=train_tfidf, 
    y=y_train,
    scoring=f1_scorer,
    verbose=2,
    cv=4, # Since kaggle CPUs have 4 cores
    n_jobs=-1
)
print(svm_tfidf_score)

### Multinomial Naive Bayes on BoW

[Documentation](https://scikit-learn.org/stable/modules/naive_bayes.html#multinomial-naive-bayes):
> MultinomialNB implements the naive Bayes algorithm for multinomially distributed data, and is one of the two classic naive Bayes variants used in text classification (where the data are typically represented as word vector counts, although tf-idf vectors are also known to work well in practice).

In [None]:
# Use Dual=False since n_samples > n_features
mnb_bow = naive_bayes.MultinomialNB()
mnb_bow_score = cross_val_score(
    estimator=mnb_bow,
    X=train_bow, 
    y=y_train,
    verbose=2,
    scoring=f1_scorer,
    cv=4, # Since kaggle CPUs have 4 cores
    n_jobs=-1
)
print(mnb_bow_score)

### Complement Naive Bayes on Tfidf

In [None]:
mnb_tfidf = naive_bayes.MultinomialNB()
mnb_tfidf_score = cross_val_score(
    estimator=mnb_tfidf,
    X=train_tfidf, 
    y=y_train,
    scoring=f1_scorer,
    verbose=2,
    cv=4, # Since kaggle CPUs have 4 cores
    n_jobs=-1
)
print(mnb_tfidf_score)

### Complement Naive Bayes on BoW

In [None]:
cnb_bow = naive_bayes.ComplementNB()
cnb_bow_score = cross_val_score(
    estimator=cnb_bow,
    X=train_bow, 
    y=y_train,
    verbose=2,
    scoring=f1_scorer,
    cv=4, # Since kaggle CPUs have 4 cores
    n_jobs=-1
)
print(cnb_bow_score)

### Complement Naive Bayes on Tfidf

In [None]:
cnb_tfidf = naive_bayes.ComplementNB()
cnb_tfidf_score = cross_val_score(
    estimator=cnb_tfidf,
    X=train_tfidf, 
    y=y_train,
    scoring=f1_scorer,
    verbose=2,
    cv=4, # Since kaggle CPUs have 4 cores
    n_jobs=-1
)
print(cnb_tfidf_score)

## Evaluating the models

In [None]:
model_scores = dict(
    logreg_bow=logreg_bow_score,
    svm_bow=svm_bow_score,
    mnb_bow=mnb_bow_score,
    cnb_bow=cnb_bow_score,
    logreg_tfidf=logreg_tfidf_score,
    svm_tfidf=svm_tfidf_score,
    mnb_tfidf=mnb_tfidf_score,
    cnb_tfidf=cnb_tfidf_score,
)

In [None]:
model_scores_df = pd.DataFrame(model_scores)
model_scores_df

In [None]:
best_model_name = model_scores_df.mean().idxmax()
print("Best Model:", best_model_name)
model_scores_df.mean().plot(kind='bar', rot=45)

In [None]:
model = eval(best_model_name)

if 'bow' in best_model_name:
    X_train = train_bow
    X_test = test_bow
else:
    X_train = train_tfidf
    X_test = test_tfidf

model.fit(X_train, y_train)

## Submission

In [None]:
submission_df = pd.read_csv('../input/sample_submission.csv')
submission_df.head()

In [None]:
submission_df['prediction'] = model.predict(X_test)
submission_df.to_csv('submission.csv', index=None)