# Myers-Briggs Personality Type Predictor 

Data .csv comes from https://www.kaggle.com/datasnaek/mbti-type/data

#### Packages used:
* pandas - used for csv file reading
* nltk - natural language tool kit - explained later
* numpy - calculating the average prediction rate
* sklearn - model creation, fitting, and predicitons
* re - string pruning

In [1]:
import pandas as pd #used for csv file reading
import nltk 
import numpy as np 
import sklearn #used for model fitting and predictions

from nltk.corpus import stopwords #used for removing basic words from posts, e.g. "a", "the", "of", etc
import re #used for string pruning
#nltk.download() #downloads a list of the basic words
train = pd.read_csv("mbti_1.csv", header=0, delimiter=',')

To get an idea of the data we're working with, the number of occurences of each type is calculated.

In [2]:
categories = ['INTJ','INTP','INFJ','ISTJ','ENTJ','INFP','ISFP','ESFP','ESFJ','ESTJ','ISFJ','ENFP','ENFJ','ISTP','ESTP','ENTP']
type_counts = [0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0]
for i in range(len(train["type"])):
    for k in range(len(categories)):
        if train["type"][i] == categories[k]:
            type_counts[k] = type_counts[k]+1
    
    
for j in range(len(categories)):
    print(categories[j])
    print(type_counts[j])

INTJ
1091
INTP
1304
INFJ
1470
ISTJ
205
ENTJ
231
INFP
1832
ISFP
271
ESFP
48
ESFJ
42
ESTJ
39
ISFJ
166
ENFP
675
ENFJ
190
ISTP
337
ESTP
89
ENTP
685


As seen above, the data is unfortunatly not very evenly distributed with a large amount of INFPs and not many ESTJs or ESFJs

Looking at the data as a whole, we can see that there are 8675 entries with both a personality type, and the text that they've posted.

In [3]:
print(train.shape)
print(train.columns.values)

(8675, 2)
['type' 'posts']


Looking at the tenth row, we can see an example of the text post and the personality type of the person who posted it.

In [4]:
print(train["posts"][10])
print(train["type"][10])

'One time my parents were fighting over my dad's affair and my dad pushed my mom. The fall broke her finger.  She's pointed a gun at him and made him get on his knees and beg for his life. She's...|||I'm gonna talk about what a piece of shit my dad is now.  He's an alcoholic and he has some kind of serious mental problem when it comes to complying with the IRS. (In his words, Laws don't apply...|||OMG...at the women's center I lived at, run by a Catholic charity, the fat bully program manager took it upon herself to change policy so that tenants were FORCED to attend the Christmas party. If...|||I don't work, but I have a calling I am 100% committed to 24/7 with no vacation or off days EVER. I'm a Kundalini mystic.  Oh, I don't get paid, either!  It's one of those destined things...|||My art teacher in high school had a stack of art school catalogs. When I saw the one for the school I ended up going to, I immediately knew that was the one. Without any research. It was like when...|||IN

For each data entry(row) the '|'s and the hyperlinks are removed using re.

In [5]:
for i in range(len(train["posts"])):
    train["posts"][i] = re.sub('[!|]', ' ', train["posts"][i]) #gets rid of the |s
    train["posts"][i] = re.sub(r'http\S+', '', train["posts"][i]) #gets rid of the hyperlinks

print(train["posts"][10])

'One time my parents were fighting over my dad's affair and my dad pushed my mom. The fall broke her finger.  She's pointed a gun at him and made him get on his knees and beg for his life. She's...   I'm gonna talk about what a piece of shit my dad is now.  He's an alcoholic and he has some kind of serious mental problem when it comes to complying with the IRS. (In his words, Laws don't apply...   OMG...at the women's center I lived at, run by a Catholic charity, the fat bully program manager took it upon herself to change policy so that tenants were FORCED to attend the Christmas party. If...   I don't work, but I have a calling I am 100% committed to 24/7 with no vacation or off days EVER. I'm a Kundalini mystic.  Oh, I don't get paid, either   It's one of those destined things...   My art teacher in high school had a stack of art school catalogs. When I saw the one for the school I ended up going to, I immediately knew that was the one. Without any research. It was like when...   IN

Here, nltk is used to collect what are called "stop words". stop words are commonly used words that the model should not take into account when training.

In [6]:
nltk.download('stopwords')
swords = stopwords.words('english')
print(swords)

[nltk_data] Downloading package stopwords to
[nltk_data]     C:\Users\lauri\AppData\Roaming\nltk_data...


['i', 'me', 'my', 'myself', 'we', 'our', 'ours', 'ourselves', 'you', "you're", "you've", "you'll", "you'd", 'your', 'yours', 'yourself', 'yourselves', 'he', 'him', 'his', 'himself', 'she', "she's", 'her', 'hers', 'herself', 'it', "it's", 'its', 'itself', 'they', 'them', 'their', 'theirs', 'themselves', 'what', 'which', 'who', 'whom', 'this', 'that', "that'll", 'these', 'those', 'am', 'is', 'are', 'was', 'were', 'be', 'been', 'being', 'have', 'has', 'had', 'having', 'do', 'does', 'did', 'doing', 'a', 'an', 'the', 'and', 'but', 'if', 'or', 'because', 'as', 'until', 'while', 'of', 'at', 'by', 'for', 'with', 'about', 'against', 'between', 'into', 'through', 'during', 'before', 'after', 'above', 'below', 'to', 'from', 'up', 'down', 'in', 'out', 'on', 'off', 'over', 'under', 'again', 'further', 'then', 'once', 'here', 'there', 'when', 'where', 'why', 'how', 'all', 'any', 'both', 'each', 'few', 'more', 'most', 'other', 'some', 'such', 'no', 'nor', 'not', 'only', 'own', 'same', 'so', 'than', '

[nltk_data]   Unzipping corpora\stopwords.zip.


here, the data is split into the training and testing datasets at an 80/20 ratio.

In [7]:
from sklearn.model_selection import train_test_split
train, test = train_test_split(train, test_size = 0.1)
print(train.shape)
print(test.shape)

(7807, 2)
(868, 2)


Here the sklearn function CountVectorizer is used to build the "bag" of words.
CountVectorizer creates an array that keeps track of how many times each word found is used per data entry.

In [8]:
from sklearn.feature_extraction.text import CountVectorizer
count_vec = CountVectorizer(analyzer = 'word')
train_counts = count_vec.fit_transform(train["posts"])
print(train_counts.shape)

(7807, 102357)


this cell include a look inside the bag and seeing all the words it contains, for a total of 96101.

In [9]:
bag = count_vec.get_feature_names()
print(bag)



# sklearn Pipeline

the sklearn object Pipeline packages all of the useful elements of creating a bag of words model into one for ease of use.

* CountVectorizer - creates an array that keeps track of how many times each word found is used per data entry.
* TfidfTransformer - converts the "count" array from CountVectorizer into one that tracks the relative frequency of each word in the bag.
* SGDClassifier - a Support Vector Machine is used as the classifier function.

The model is then fit to the training data

In [10]:
from sklearn.pipeline import Pipeline
from sklearn.feature_extraction.text import TfidfTransformer
from sklearn.linear_model import SGDClassifier
text_clf = Pipeline([
    ('vect', CountVectorizer(stop_words = swords)),
    ('tfidf', TfidfTransformer()),
    ('clf', SGDClassifier(loss='hinge', penalty='l2', alpha=1e-3, max_iter=5, tol=None)),
])

text_clf.fit(train["posts"], train["type"])

Pipeline(memory=None,
         steps=[('vect',
                 CountVectorizer(analyzer='word', binary=False,
                                 decode_error='strict',
                                 dtype=<class 'numpy.int64'>, encoding='utf-8',
                                 input='content', lowercase=True, max_df=1.0,
                                 max_features=None, min_df=1,
                                 ngram_range=(1, 1), preprocessor=None,
                                 stop_words=['i', 'me', 'my', 'myself', 'we',
                                             'our', 'ours', 'ourselves', 'you',
                                             "you're", "you've", "you'll",...
                ('clf',
                 SGDClassifier(alpha=0.001, average=False, class_weight=None,
                               early_stopping=False, epsilon=0.1, eta0=0.0,
                               fit_intercept=True, l1_ratio=0.15,
                               learning_rate='optimal', loss=

# Results

Finally we can use the same pipeline object that has now been fitted to predict the personality type of the test data.

In [13]:
predicts = text_clf.predict(test["posts"])
acc = np.mean(predicts == test["type"])
print(acc)

0.6647465437788018
