# MBTI Classifier Training
To perform classification with a number of difference algorithms, extract text features for Bag of Words analysis, use those features to train a classifier, then evaluate its performance on a test set. In this notebook I will classify the cleaned `mbti_cleaned_unsplit1.csv` dataset.

In [1]:
import pandas as pd
import numpy as np


In [2]:
#read in data
df = pd.read_csv('mbti_cleaned_unsplit2.csv', encoding = "'ISO-8859-1")
df = df.drop('Unnamed: 0', axis=1)

#select only entries with no null values
df = df[pd.notnull(df['clean_posts'])]
df = df[pd.notnull(df['posts'])]
df.head()


Unnamed: 0,type,posts,clean_posts
0,INFJ,'http://www.youtube.com/watch?v=qsXHcwe3krw|||...,enfp intj moments sportscenter top ten plays p...
1,ENTP,'I'm finding the lack of me in these posts ver...,im finding lack posts alarming sex boring posi...
2,INTP,'Good one _____ https://www.youtube.com/wat...,good one course say know blessing curse absolu...
3,INTJ,"'Dear INTP, I enjoyed our conversation the o...",dear intp enjoyed conversation day esoteric ga...
4,ENTJ,'You're fired.|||That's another silly misconce...,youre fired another silly misconception approa...


## Feature Extraction for Bag of Words
The steps to prepare text for classification (post cleaning) is to create a matrix of token counts of the words (bag of words). After that, the counts are normalized using `TfidfTransformer`. Make sure to save both `CountVectorizer` and `TfidfTransformer` in global variables, because when the time comes to extract features from the test set, we will need access to each in order to make sure the test data uses the same amount of features as the training data, otherwise the trained model won't be compatible with the features of the testing data. 

In [3]:
from sklearn.model_selection import train_test_split
from sklearn.feature_extraction.text import CountVectorizer, TfidfTransformer

#define the text and labels to be used as corpus, labels
corpus = np.array(df.clean_posts)
labels = np.array(df.type)
#declare count_vect and tfidf_transformer
count_vect = CountVectorizer(binary=False, ngram_range=(1,1))
tfidf_transformer = TfidfTransformer()

I created funtions to perform each step of feature extraction, tf-idf transformation, and classifier testing so as to be able to debug any problems step by step. This entire process can be performed by creating a training pipeline utilizing ```Pipeline``` from ```sklearn.pipeline```

In [4]:
def perpare_data_set(corpus, labels, test_size=0.3):
    "splits data into training corpus, test corpus, training labels, test labels"
    train_c, test_c, train_l, test_l  = train_test_split(corpus, labels, test_size=test_size, random_state=42)
    return train_c, test_c, train_l, test_l

def cv_feature_extract(train_c):
    "tokenizes text data"
    return count_vect.fit_transform(train_c)

def fit_transform(X_train_counts):
    "re-weights tokenized data"
    return tfidf_transformer.fit_transform(X_train_counts)

def test_model(test_corpus, test_labels, fit_model):
    "tests the predictions of a classifier against test data"
    extracted = count_vect.transform(test_corpus)
    transformed = tfidf_transformer.transform(extracted)
    predicted = fit_model.predict(transformed)
    return np.mean(predicted == test_labels)

Apply the above functions to the data. Pause after ```fit_transform``` to check how many features have been extracted. 

In [5]:
#split data into train and test
train_corpus, test_corpus, train_labels, test_labels = perpare_data_set(corpus, labels, test_size=0.3)

#extract features
X_train_counts = cv_feature_extract(train_corpus)
X_train_tfidf = fit_transform(X_train_counts)

#examine features
X_train_tfidf.shape


(6071, 116161)

Notice 115959 features have been extracted from 6071 data entries. For clarity's sake, check the shape of the test data to make sure it has the same number of features when extracted.

In [6]:
#examine features
cv_test_features = count_vect.transform(test_corpus)
cv_test_features.shape

(2603, 116161)

Excellent. If they do not match, it is because they used diffirent cases of CountVectorizer() and TfidfTransformer(), and will not be compatible when it comes time to test the classifier's predicitions.
## Classifier Training
The following is the training of classifiers using the algorithms `MultinomialNB`, `LogisticRegression`, and `SGDClassifier`.
### `MultinomialNB`


In [7]:
from sklearn.naive_bayes import MultinomialNB

#train classifier mnb
mnb = MultinomialNB().fit(X_train_counts, train_labels)

In [8]:
#test classifier against test data.
test_model(test_corpus, test_labels, mnb)

0.40261237034191316

`MultinomialNB` was only successful at prediciting the label of test data 40%. 
### `LogisticRegression`

In [9]:
from sklearn.linear_model import LogisticRegression

#train classifier lr
lr = LogisticRegression(penalty='l2', max_iter=100, C=1)
lr.fit(X_train_tfidf, train_labels)

LogisticRegression(C=1, class_weight=None, dual=False, fit_intercept=True,
          intercept_scaling=1, max_iter=100, multi_class='ovr', n_jobs=1,
          penalty='l2', random_state=None, solver='liblinear', tol=0.0001,
          verbose=0, warm_start=False)

In [10]:
#test classifier against test data.
test_model(test_corpus, test_labels, lr)

0.63349980791394545

`LogisticRegression` was successful at predicition the label of test data 63% of the time. Quite an improvement.
### `SGDClassifier`

In [11]:
from sklearn.linear_model import SGDClassifier

#train classifier sgdc
sgdc = SGDClassifier()
sgdc.fit(X_train_tfidf, train_labels)



SGDClassifier(alpha=0.0001, average=False, class_weight=None, epsilon=0.1,
       eta0=0.0, fit_intercept=True, l1_ratio=0.15,
       learning_rate='optimal', loss='hinge', max_iter=5, n_iter=None,
       n_jobs=1, penalty='l2', power_t=0.5, random_state=None,
       shuffle=True, tol=None, verbose=0, warm_start=False)

In [12]:
#test classifier against test data.
test_model(test_corpus, test_labels, sgdc)

0.67345370726085285

 `SGDClassifier` was successful at predicition the label of test data 67% of the time. Unsurprisingly, `SGDClassifier` is widely considered to be one of the more useful algorithms for text classification.

## Parameter Tuning and Cross Validation
Having tested a few different classification methods, it's time to select the most sucessful one and perform cross validation using `GridSearchCV` to select the best parameters. This step can be computationally expensive and time consuming, so test parameters incrementally.

In [13]:
from sklearn.model_selection import GridSearchCV

In [14]:
#specify a parameter grid to search over
parameters_2 = {
    'penalty': ['l1'],
    'l1_ratio': [0.15, 0.3, 0.5],
    'max_iter': [10, 100, 200],
    'loss': ['hinge', 'log'],
    'n_jobs': [-1]
}

sgdc_cv2 = GridSearchCV(sgdc, parameters_2, cv=5) #specify GridSearchCV object

sgdc_cv2.fit(X_train_tfidf, train_labels) #fit to training data

GridSearchCV(cv=5, error_score='raise',
       estimator=SGDClassifier(alpha=0.0001, average=False, class_weight=None, epsilon=0.1,
       eta0=0.0, fit_intercept=True, l1_ratio=0.15,
       learning_rate='optimal', loss='hinge', max_iter=5, n_iter=None,
       n_jobs=1, penalty='l2', power_t=0.5, random_state=None,
       shuffle=True, tol=None, verbose=0, warm_start=False),
       fit_params=None, iid=True, n_jobs=1,
       param_grid={'penalty': ['l1'], 'l1_ratio': [0.15, 0.3, 0.5], 'max_iter': [10, 100, 200], 'loss': ['hinge', 'log'], 'n_jobs': [-1]},
       pre_dispatch='2*n_jobs', refit=True, return_train_score=True,
       scoring=None, verbose=0)

We can check and see what parameters GridSearchCV selected. This is useful if you want to go back and test other parameters without rechecking certain parameters unnecessarily. Here we can see that it selected `l1_ratio=0.3`, `max_iter=10`, and `penalty=l1`.

In [15]:
sgdc_cv2.best_params_

{'l1_ratio': 0.5,
 'loss': 'hinge',
 'max_iter': 10,
 'n_jobs': -1,
 'penalty': 'l1'}

Score the new cross-validated model on new test data.

In [16]:
X_test_extracted = count_vect.transform(test_corpus)
X_test_transformed = tfidf_transformer.transform(X_test_extracted)

sgdc_cv2.score(X_test_transformed, test_labels)

0.67883211678832112

a 0.6% improvement in accuracy is nothing to sneeze at, especially if this model was taken to scale.
## Model Evaluation
Below I've borrowed a few metrics for evaluating the model performance in more detail from Dipanjan Sarkar's repository for his book Text Analytics with Python, viewable @ https://github.com/dipanjanS/text-analytics-with-python, which includes a useful homegrown library for model performance evaluation. 

In [17]:
from sklearn import metrics

def display_confusion_matrix(true_labels, predicted_labels, classes=[1,0]):
    'takes predicted classifications and compares them to true labels in a matrix'
    total_classes = len(classes)
    level_labels = [total_classes*[0], list(range(total_classes))]

    cm = metrics.confusion_matrix(y_true=true_labels, y_pred=predicted_labels, 
                                  labels=classes)
    cm_frame = pd.DataFrame(data=cm, 
                            columns=pd.MultiIndex(levels=[['Predicted:'], classes], 
                                                  labels=level_labels), 
                            index=pd.MultiIndex(levels=[['Actual:'], classes], 
                                                labels=level_labels)) 
    return (cm_frame) # I've adjusted this function to return a dataframe rather than print it


In [18]:
#use the trained model to predict processed test data
y_pred = sgdc_cv2.predict(X_test_transformed)
labels = list(np.unique(df.type)) #make a list of unique target variables

#display confusion matrix
cm_df = display_confusion_matrix(true_labels=test_labels, predicted_labels=y_pred, classes=labels)
cm_df

Unnamed: 0_level_0,Unnamed: 1_level_0,Predicted:,Predicted:,Predicted:,Predicted:,Predicted:,Predicted:,Predicted:,Predicted:,Predicted:,Predicted:,Predicted:,Predicted:,Predicted:,Predicted:,Predicted:,Predicted:
Unnamed: 0_level_1,Unnamed: 1_level_1,ENFJ,ENFP,ENTJ,ENTP,ESFJ,ESFP,ESTJ,ESTP,INFJ,INFP,INTJ,INTP,ISFJ,ISFP,ISTJ,ISTP
Actual:,ENFJ,13,3,0,0,0,0,1,0,9,13,4,8,0,1,1,2
Actual:,ENFP,1,125,0,9,0,0,0,1,17,28,16,7,0,2,0,0
Actual:,ENTJ,1,6,31,6,1,0,1,0,4,4,5,5,0,2,0,2
Actual:,ENTP,0,15,0,119,0,0,3,0,13,7,17,15,0,2,1,1
Actual:,ESFJ,0,0,0,2,1,0,0,0,1,1,0,3,0,1,1,0
Actual:,ESFP,0,1,0,0,0,0,0,0,3,2,1,0,0,1,0,1
Actual:,ESTJ,0,3,1,1,0,0,0,0,3,4,0,1,0,1,1,0
Actual:,ESTP,0,0,0,1,0,0,0,6,6,1,7,2,0,0,0,2
Actual:,INFJ,1,9,1,18,0,0,1,0,304,58,19,10,0,2,1,4
Actual:,INFP,1,15,0,9,0,0,8,1,24,476,20,17,0,8,2,5


In [19]:
#print percision and recall scores for model
print(metrics.classification_report(test_labels, y_pred, labels))

             precision    recall  f1-score   support

       ENFJ       0.76      0.24      0.36        55
       ENFP       0.65      0.61      0.63       206
       ENTJ       0.82      0.46      0.58        68
       ENTP       0.63      0.62      0.62       193
       ESFJ       0.50      0.10      0.17        10
       ESFP       0.00      0.00      0.00         9
       ESTJ       0.00      0.00      0.00        15
       ESTP       0.67      0.24      0.35        25
       INFJ       0.68      0.71      0.69       428
       INFP       0.70      0.81      0.75       586
       INTJ       0.63      0.71      0.67       320
       INTP       0.75      0.77      0.76       403
       ISFJ       0.83      0.56      0.67        54
       ISFP       0.54      0.44      0.49        70
       ISTJ       0.71      0.38      0.50        65
       ISTP       0.74      0.70      0.72        96

avg / total       0.68      0.68      0.67      2603



We can see that representation bias seems to have a large effect on the recall of the model. Some categories couldn't be successfully classified because their representation was so low. 

## Utilizing a Pipeline
Here is some sample code for building a pipline for model training.

In [20]:
df.type.unique()

array(['INFJ', 'ENTP', 'INTP', 'INTJ', 'ENTJ', 'ENFJ', 'INFP', 'ENFP',
       'ISFP', 'ISTP', 'ISFJ', 'ISTJ', 'ESTP', 'ESFP', 'ESTJ', 'ESFJ'], dtype=object)

In [21]:
from sklearn.pipeline import Pipeline

#declare classifier text_clf
text_clf = Pipeline([('vect', CountVectorizer()),
                      ('tfidf', TfidfTransformer()),
                      ('clf', MultinomialNB()),
 ])

Below is an example of performing feature extraction then training classifier `text_clf` in one line of code.

In [22]:
#train classifier text_clf
text_clf.fit(train_corpus, train_labels)  

Pipeline(memory=None,
     steps=[('vect', CountVectorizer(analyzer='word', binary=False, decode_error='strict',
        dtype=<class 'numpy.int64'>, encoding='utf-8', input='content',
        lowercase=True, max_df=1.0, max_features=None, min_df=1,
        ngram_range=(1, 1), preprocessor=None, stop_words=None,
        strip...inear_tf=False, use_idf=True)), ('clf', MultinomialNB(alpha=1.0, class_prior=None, fit_prior=True))])