# MBTI Classifier Training
To perform classification with a number of difference algorithms, extract text features for Bag of Words analysis, use those features to train a classifier, then evaluate its performance on a test set. In this notebook I will classify the cleaned `mbti_1.csv` dataset.

In [1]:
import pandas as pd
import numpy as np


In [2]:
#read in data
df = pd.read_csv('mbti_cleaned_unsplit1.csv', encoding = "'ISO-8859-1")
df = df.drop('Unnamed: 0', axis=1)

#select only entries with no null values
df = df[pd.notnull(df['clean_posts'])]
df = df[pd.notnull(df['posts'])]
df.head()


Unnamed: 0,type,posts,clean_posts
0,INFJ,'http://www.youtube.com/watch?v=qsXHcwe3krw|||...,enfp intj moments sportscenter top ten plays p...
1,ENTP,'I'm finding the lack of me in these posts ver...,im finding lack posts alarming sex boring posi...
2,INTP,'Good one _____ https://www.youtube.com/wat...,good one course say know blessing curse absolu...
3,INTJ,"'Dear INTP, I enjoyed our conversation the o...",dear intp enjoyed conversation day esoteric ga...
4,ENTJ,'You're fired.|||That's another silly misconce...,youre fired another silly misconception approa...


## Feature Extraction
make sure to save both CountVectorizer and TfidfTransformer in global variables, because when the time comes to extract features from the test set, we will need access to each in order to fit the test data features to the train data features.

In [27]:
from sklearn.model_selection import train_test_split
from sklearn.feature_extraction.text import CountVectorizer, TfidfTransformer

#define the text and labels to be used as corpus, labels
corpus = np.array(df.clean_posts)
labels = np.array(df.type)
#declare count_vect and tfidf_transformer
count_vect = CountVectorizer(binary=False, ngram_range=(1,1))
tfidf_transformer = TfidfTransformer()

I created funtions to perform each step of feature extraction, tf-idf transformation, and classifier testing so as to be able to debug any problems step by step. This entire process can be performed by creating a training pipeline utilizing ```Pipeline``` from ```sklearn.pipeline```

In [28]:
def perpare_data_set(corpus, labels, test_size=0.3):
    "splits data into training corpus, test corpus, training labels, test labels"
    train_c, test_c, train_l, test_l  = train_test_split(corpus, labels, test_size=test_size, random_state=42)
    return train_c, test_c, train_l, test_l

def cv_feature_extract(train_c):
    "tokenizes text data"
    return count_vect.fit_transform(train_c)

def fit_transform(X_train_counts):
    "re-weights tokenized data"
    return tfidf_transformer.fit_transform(X_train_counts)

def test_model(test_corpus, test_labels, fit_model):
    "tests the predictions of a classifier against test data"
    extracted = count_vect.transform(test_corpus)
    transformed = tfidf_transformer.transform(extracted)
    predicted = fit_model.predict(transformed)
    return np.mean(predicted == test_labels)

Apply the above functions to the data. Pause after ```fit_transform``` to check how many features have been extracted. 

In [29]:
#split data into train and test
train_corpus, test_corpus, train_labels, test_labels = perpare_data_set(corpus, labels, test_size=0.3)

#extract features
X_train_counts = cv_feature_extract(train_corpus)
X_train_tfidf = fit_transform(X_train_counts)

#examine features
X_train_tfidf.shape


(6071, 115959)

Notice 115959 features have been extracted from 6071 data entries. For clarities sake, check the shape of the test data to make sure it has the same number of features when extracted.

In [20]:
cv_test_features = count_vect.transform(test_corpus)
cv_test_features.shape

(2603, 115959)

Excellent. If they do not match, it is because they used diffirent cases of CountVectorizer() and TfidfTransformer(), and will not be compatible when it comes time to test the classifier's predicitions.
## Classifier Training
The following is the training of classifiers using the algorithms `MultinomialNB`, `LogisticRegression`, and `SGDClassifier`.
### `MultinomialNB`


In [34]:
from sklearn.naive_bayes import MultinomialNB

#train classifier mnb
mnb = MultinomialNB().fit(X_train_counts, train_labels)

In [35]:
#test classifier against test data.
test_model(test_corpus, test_labels, mnb)

0.40376488666922783

`MultinomialNB` was only successful at prediciting the label of test data 40%. 
### `LogisticRegression`

In [32]:
from sklearn.linear_model import LogisticRegression

#train classifier lr
lr = LogisticRegression(penalty='l2', max_iter=100, C=1)
lr.fit(X_train_tfidf, train_labels)

LogisticRegression(C=1, class_weight=None, dual=False, fit_intercept=True,
          intercept_scaling=1, max_iter=100, multi_class='ovr', n_jobs=1,
          penalty='l2', random_state=None, solver='liblinear', tol=0.0001,
          verbose=0, warm_start=False)

In [33]:
#test classifier against test data.
test_model(test_corpus, test_labels, lr)

0.63388398002305035

`LogisticRegression` was successful at predicition the label of test data 63% of the time. Quite an improvement.
### `SGDClassifier`

In [40]:
from sklearn.linear_model import SGDClassifier

#train classifier sgdc
sgdc = SGDClassifier()
sgdc.fit(X_train_tfidf, train_labels)

SGDClassifier(alpha=0.0001, average=False, class_weight=None, epsilon=0.1,
       eta0=0.0, fit_intercept=True, l1_ratio=0.15,
       learning_rate='optimal', loss='hinge', n_iter=5, n_jobs=1,
       penalty='l2', power_t=0.5, random_state=None, shuffle=True,
       verbose=0, warm_start=False)

In [41]:
#test classifier against test data.
test_model(test_corpus, test_labels, sgdc)

0.67460622358816746

 `SGDClassifier` was successful at predicition the label of test data 67% of the time. Unsurprisingly, `SGDClassifier` is widely considered to be one of the more useful algorithms for text classification.
## Utilizing a Pipeline
Here is some sample code for building a pipline for model training.

In [None]:
from sklearn.pipeline import Pipeline

#declare classifier text_clf
text_clf = Pipeline([('vect', CountVectorizer()),
                      ('tfidf', TfidfTransformer()),
                      ('clf', MultinomialNB()),
 ])

Below is an example of performing feature extraction then training classifier `text_clf` in one line of code.

In [None]:
#train classifier text_clf
text_clf.fit(train_corpus, train_labels)  