# **MBTI PREDICTION**

The Myers–Briggs Type Indicator (MBTI) is an introspective self-report questionnaire indicating differing psychological preferences in how people perceive the world and make decisions. 
The test attempts to assign four categories:

   * introversion or extraversion 
   * sensing or intuition 
   * thinking or feeling 
   * judging or perceiving
  
One letter from each category is taken to produce a four-letter test result, like "INFJ" or "ENFP".

Source: <a href="https://en.wikipedia.org/wiki/Myers%E2%80%93Briggs_Type_Indicator" target="_blank">https://en.wikipedia.org/wiki/Myers%E2%80%93Briggs_Type_Indicator</a>


## **Libraries**

In [3]:
import pandas as pd
import re
import nltk
import numpy as np
from nltk import word_tokenize
from nltk.corpus import stopwords
from nltk.stem import PorterStemmer
from collections import Counter

from sklearn.model_selection import train_test_split
from sklearn.feature_extraction.text import CountVectorizer
from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.naive_bayes import MultinomialNB
from sklearn import metrics
from sklearn.svm import LinearSVC

## **Import Data**

Source: <a href="https://www.kaggle.com/datasnaek/mbti-type" target="_blank">https://www.kaggle.com/datasnaek/mbti-type</a>

In [4]:
df = pd.read_csv("mbti_dataset.csv",encoding="utf-8")

In [5]:
df.head()

Unnamed: 0,type,posts
0,INFJ,'http://www.youtube.com/watch?v=qsXHcwe3krw|||...
1,ENTP,'I'm finding the lack of me in these posts ver...
2,INTP,'Good one _____ https://www.youtube.com/wat...
3,INTJ,"'Dear INTP, I enjoyed our conversation the o..."
4,ENTJ,'You're fired.|||That's another silly misconce...


## **Data Cleaning**

### **Labels enconding**

In [6]:
types = df["type"].unique()
for idx,mbti in enumerate(types):
    df["type"]= df["type"].replace(mbti,idx)

In [7]:
print(types)
df["type"].value_counts()

['INFJ' 'ENTP' 'INTP' 'INTJ' 'ENTJ' 'ENFJ' 'INFP' 'ENFP' 'ISFP' 'ISTP'
 'ISFJ' 'ISTJ' 'ESTP' 'ESFP' 'ESTJ' 'ESFJ']


6     1832
0     1470
2     1304
3     1091
1      685
7      675
9      337
8      271
4      231
11     205
5      190
10     166
12      89
13      48
15      42
14      39
Name: type, dtype: int64

### **Preprocessing data**

In [8]:
stop_words = stopwords.words("english")
#create an object of class PorterStemmer
porter = PorterStemmer()

In [9]:
def cleanData(posts):
    # Lowercase
    clean_text = posts.lower()
    #remove all hyperlinks
    clean_text = re.sub(r'(https?:\/\/(?:www\.|(?!www))[a-zA-Z0-9][a-zA-Z0-9-]+[a-zA-Z0-9]\.[^\s]{2,}|www\.[a-zA-Z0-9][a-zA-Z0-9-]+[a-zA-Z0-9]\.[^\s]{2,}|https?:\/\/(?:www\.|(?!www))[a-zA-Z0-9]+\.[^\s]{2,}|www\.[a-zA-Z0-9]+\.[^\s]{2,})','',clean_text)
    word_list = word_tokenize(clean_text)
    clean_posts = []
    for word in word_list:
        if word.isalpha() and word not in stop_words:
            word = porter.stem(word)
            clean_posts.append(word)
    return clean_posts



In [10]:
df["posts_preprocessed"]= df["posts"].apply(lambda row: cleanData(row))


**Preprocessing data analysis**

* We group the posts with the same type in order to detect the most used commun words for each type 

In [11]:
df_grouped = df.groupby('type')['posts_preprocessed'].apply(list).reset_index(name='posts_grouped')
df_grouped['posts_grouped'] = df_grouped['posts_grouped'].apply(lambda row: [y for x in row for y in x]) # Flatten list
df_grouped 

Unnamed: 0,type,posts_grouped
0,0,"[intj, moment, sportscent, top, ten, play, exp..."
1,1,"[find, lack, post, bore, posit, often, exampl,..."
2,2,"[one, cours, say, know, bless, absolut, posit,..."
3,3,"[intp, enjoy, convers, day, esoter, gab, natur..."
4,4,"[anoth, silli, misconcept, approach, logic, go..."
5,5,"[went, break, month, ago, togeth, year, plan, ..."
6,6,"[think, agre, person, consid, alpha, beta, fox..."
7,7,"[want, go, trip, without, stay, behind, would,..."
8,8,"[paint, without, guess, istp, best, bud, esfp,..."
9,9,"[got, read, enneagram, though, read, somewher,..."


In [12]:
def select_important_words(post_preprocessed):
    
    w=dict.fromkeys(post_preprocessed,0)
    for i in post_preprocessed:
        w[i]=w[i]+1
    data_sorted = {k: v for k, v in sorted(w.items(), key=lambda x: x[1],reverse=True)}
    return data_sorted

* We create a descending order dictionnary: each word as a key and the number of accurrences as a value

In [13]:
df_grouped['dictionnary'] = df_grouped['posts_grouped'].apply(lambda row: select_important_words(row))
df_grouped["dictionnary"]

0     {'like': 12861, 'think': 10235, 'peopl': 8405,...
1     {'like': 5805, 'think': 4563, 'peopl': 3602, '...
2     {'like': 10546, 'think': 8421, 'peopl': 6831, ...
3     {'like': 8517, 'think': 6513, 'peopl': 5759, '...
4     {'like': 1846, 'think': 1471, 'peopl': 1181, '...
5     {'like': 1759, 'think': 1396, 'peopl': 1169, '...
6     {'like': 16990, 'think': 12980, 'peopl': 10261...
7     {'like': 6501, 'think': 4633, 'enfp': 3779, 'p...
8     {'like': 2541, 'think': 1766, 'realli': 1288, ...
9     {'like': 2696, 'think': 1965, 'get': 1732, 'pe...
10    {'like': 1534, 'think': 1216, 'isfj': 930, 'pe...
11    {'like': 1798, 'think': 1226, 'istj': 939, 'pe...
12    {'like': 803, 'think': 564, 'get': 514, 'estp'...
13    {'like': 380, 'think': 292, 'peopl': 220, 'kno...
14    {'like': 292, 'think': 269, 'estj': 213, 'peop...
15    {'like': 411, 'esfj': 365, 'think': 352, 'peop...
Name: dictionnary, dtype: object

* We select the 20 most frequent words for each type

In [14]:
df_grouped["most_frequent_words"]=df_grouped["dictionnary"].apply(lambda row: list(row.keys())[0:20])
df_grouped["most_frequent_words"]

0     [like, think, peopl, feel, infj, know, one, ge...
1     [like, think, peopl, entp, one, get, would, kn...
2     [like, think, peopl, would, one, intp, get, kn...
3     [like, think, peopl, intj, would, one, know, g...
4     [like, think, peopl, entj, get, would, one, kn...
5     [like, think, peopl, feel, enfj, know, get, re...
6     [like, think, peopl, feel, realli, know, infp,...
7     [like, think, enfp, peopl, know, get, feel, re...
8     [like, think, realli, peopl, feel, get, know, ...
9     [like, think, get, peopl, would, know, istp, o...
10    [like, think, isfj, peopl, get, would, know, r...
11    [like, think, istj, peopl, would, get, know, o...
12    [like, think, get, estp, peopl, know, one, typ...
13    [like, think, peopl, know, get, realli, would,...
14    [like, think, estj, peopl, would, know, get, o...
15    [like, esfj, think, peopl, type, know, get, fe...
Name: most_frequent_words, dtype: object

* We group the most frequent words for all type and create a dictionnary with occurrences

In [15]:
total=[]
for i in range(0,len(types)-1):
    total = total + df_grouped["most_frequent_words"][i]
most_freq_dict=Counter(total)
most_freq_dict

Counter({'like': 15,
         'think': 15,
         'peopl': 15,
         'feel': 15,
         'infj': 1,
         'know': 15,
         'one': 15,
         'get': 15,
         'would': 15,
         'realli': 15,
         'thing': 15,
         'time': 15,
         'say': 15,
         'person': 14,
         'go': 15,
         'make': 15,
         'want': 15,
         'love': 7,
         'type': 14,
         'much': 7,
         'entp': 1,
         'see': 7,
         'way': 1,
         'intp': 1,
         'use': 2,
         'intj': 1,
         'entj': 1,
         'good': 1,
         'enfj': 1,
         'friend': 6,
         'infp': 1,
         'enfp': 1,
         'isfp': 1,
         'istp': 1,
         'someth': 1,
         'isfj': 1,
         'istj': 1,
         'estp': 1,
         'esfp': 1,
         'estj': 1})

We can see that the word "like" is one of the most frequent words for the 15 types

* We create a list with the words which appear as a one of the most frequent words for at least 11 types

In [16]:
words_to_delete=[]
for i in most_freq_dict:
    if most_freq_dict[i]>6:
        words_to_delete.append(i)
words_to_delete

['like',
 'think',
 'peopl',
 'feel',
 'know',
 'one',
 'get',
 'would',
 'realli',
 'thing',
 'time',
 'say',
 'person',
 'go',
 'make',
 'want',
 'love',
 'type',
 'much',
 'see']

## **Filtering preprocessing data**

In [17]:
df["posts_preprocessed_filtered"]= df["posts_preprocessed"].apply(lambda row: [w for w in row if not w in words_to_delete])

In [18]:
df.head()

Unnamed: 0,type,posts,posts_preprocessed,posts_preprocessed_filtered
0,0,'http://www.youtube.com/watch?v=qsXHcwe3krw|||...,"[intj, moment, sportscent, top, ten, play, exp...","[intj, moment, sportscent, top, ten, play, exp..."
1,1,'I'm finding the lack of me in these posts ver...,"[find, lack, post, bore, posit, often, exampl,...","[find, lack, post, bore, posit, often, exampl,..."
2,2,'Good one _____ https://www.youtube.com/wat...,"[one, cours, say, know, bless, absolut, posit,...","[cours, bless, absolut, posit, best, friend, c..."
3,3,"'Dear INTP, I enjoyed our conversation the o...","[intp, enjoy, convers, day, esoter, gab, natur...","[intp, enjoy, convers, day, esoter, gab, natur..."
4,4,'You're fired.|||That's another silly misconce...,"[anoth, silli, misconcept, approach, logic, go...","[anoth, silli, misconcept, approach, logic, ke..."


## **Models**

### **Vectorization**

 * CountVectorizer
 * TfidfVectorizer

In [19]:
cv = CountVectorizer()
tfidf = TfidfVectorizer()

In [20]:
X=df["posts_preprocessed"].map(' '.join)
y=df["type"]

cnt_vector = cv.fit_transform(X)
X_train,X_test,y_train,y_test = train_test_split(cnt_vector,y,test_size=0.2)


In [21]:
tfidf_vector = tfidf.fit_transform(X)
X_train_tfidf,X_test_tfidf,y_train_tfidf,y_test_tfidf = train_test_split(tfidf_vector,y,test_size=0.2)

In [22]:
X_filtered=df["posts_preprocessed_filtered"].map(' '.join)

tfidf_vector_filtered = tfidf.fit_transform(X_filtered)
X_train_tfidf_filtered,X_test_tfidf_filtered,y_train_tfidf_filtered,y_test_tfidf_filtered = train_test_split(tfidf_vector_filtered,y,test_size=0.2)

### ***Prediction***

**Model 1:** Preprocessed data + CountVectorizer + LinearSVC

In [23]:
model = LinearSVC().fit(X_train,y_train)

predictions=model.predict(X_test)

classification = metrics.classification_report(y_test,predictions, target_names= types)


print(classification)

              precision    recall  f1-score   support

        INFJ       0.50      0.62      0.55       264
        ENTP       0.55      0.46      0.50       147
        INTP       0.54      0.57      0.56       256
        INTJ       0.50      0.50      0.50       206
        ENTJ       0.45      0.32      0.37        53
        ENFJ       0.33      0.27      0.30        41
        INFP       0.58      0.64      0.61       369
        ENFP       0.53      0.50      0.51       145
        ISFP       0.36      0.30      0.32        54
        ISTP       0.42      0.43      0.43        67
        ISFJ       0.41      0.40      0.41        35
        ISTJ       0.31      0.21      0.25        42
        ESTP       0.73      0.36      0.48        22
        ESFP       0.00      0.00      0.00        14
        ESTJ       0.50      0.12      0.20         8
        ESFJ       0.20      0.08      0.12        12

    accuracy                           0.52      1735
   macro avg       0.43   

**Model 2:** Preprocessed data + TfidfVectorizer + LinearSVC

In [24]:
model = LinearSVC().fit(X_train_tfidf,y_train_tfidf)

predictions2=model.predict(X_test_tfidf)

classification2 = metrics.classification_report(y_test_tfidf,predictions2, target_names= types)


print(classification2)

              precision    recall  f1-score   support

        INFJ       0.66      0.67      0.67       305
        ENTP       0.64      0.67      0.65       130
        INTP       0.64      0.73      0.68       260
        INTJ       0.58      0.60      0.59       200
        ENTJ       0.75      0.41      0.53        44
        ENFJ       0.78      0.47      0.58        30
        INFP       0.66      0.79      0.72       386
        ENFP       0.66      0.59      0.62       126
        ISFP       0.54      0.46      0.50        57
        ISTP       0.70      0.61      0.65        61
        ISFJ       0.95      0.49      0.64        39
        ISTJ       0.67      0.39      0.49        56
        ESTP       0.62      0.29      0.40        17
        ESFP       0.00      0.00      0.00         6
        ESTJ       1.00      0.25      0.40         8
        ESFJ       0.67      0.20      0.31        10

    accuracy                           0.65      1735
   macro avg       0.66   

  _warn_prf(average, modifier, msg_start, len(result))


**Model 3:** Filtered Preprocessed data + TfidfVectorizer + LinearSVC

In [25]:
model = LinearSVC().fit(X_train_tfidf_filtered,y_train_tfidf_filtered)

predictions3=model.predict(X_test_tfidf_filtered)

classification3 = metrics.classification_report(y_test_tfidf_filtered,predictions3, target_names= types,zero_division=0)

print(classification3)


              precision    recall  f1-score   support

        INFJ       0.62      0.64      0.63       283
        ENTP       0.69      0.49      0.57       146
        INTP       0.59      0.74      0.65       242
        INTJ       0.55      0.70      0.61       197
        ENTJ       0.61      0.40      0.48        35
        ENFJ       0.60      0.35      0.44        34
        INFP       0.68      0.80      0.73       391
        ENFP       0.75      0.64      0.69       157
        ISFP       0.54      0.30      0.38        50
        ISTP       0.79      0.59      0.67        75
        ISFJ       0.73      0.38      0.50        29
        ISTJ       0.67      0.44      0.53        41
        ESTP       0.70      0.29      0.41        24
        ESFP       0.00      0.00      0.00        11
        ESTJ       0.25      0.10      0.14        10
        ESFJ       0.75      0.30      0.43        10

    accuracy                           0.64      1735
   macro avg       0.59   

We can see that the filtering doesn't improve the f1-score result.

## **Conclusion**

This approach is not good. None of the 3 models fit the problem.
The idea is not make a prediction for each category seperately. 
Have a look at the <a href="https://github.com/pmecchia/mbti_prediction/blob/master/model%20improvement.ipynb">model improvement notebook.</a>