This notebook contains preprocessing steps for MBTI dataset. 

The Myers Briggs Type Indicator (or MBTI for short) is a personality type system that divides everyone into 16 distinct personality types across 4 axis:
* Introversion (I) – Extroversion (E)
* Intuition (N) – Sensing (S)
* Thinking (T) – Feeling (F)
* Judging (J) – Perceiving (P)

In the dataset, there are 8600 rows of data. Each row contains a person's MBTI personality class and the last 50 things that he/she posted in PersonalityCafe Forum. 

In [1]:
# Import libraries 
import pandas as pd 
import seaborn as sns
import matplotlib.pyplot as plt
%matplotlib inline
from wordcloud import WordCloud, STOPWORDS
import requests
from lxml.html import fromstring
import re 
from nltk.corpus import stopwords 
from nltk.stem import PorterStemmer, WordNetLemmatizer
from sklearn.preprocessing import LabelEncoder
from tqdm import tqdm
import json 
import numpy as np
from os.path import join

In [2]:
import nltk 
nltk.download('stopwords')
nltk.download('wordnet')

[nltk_data] Error loading stopwords: <urlopen error [Errno -2] Name or
[nltk_data]     service not known>
[nltk_data] Error loading wordnet: <urlopen error [Errno -2] Name or
[nltk_data]     service not known>


False

In [3]:
# Define constant
# define paths, constants etc. 
datadir = "../dataset/"
datafile = "../dataset/mbti-type/mbti_1.csv"
HTTP = ["http://", "https://", ".com", "www."]
IMAGE = [".jpg",".png", ".gif"]
EMOJI = [":D",":)",":(","D:",":o"]
LINK = r'http\S+'

# Open dataset folder into a DataFrame

In [4]:
# Opening dataset as pandas dataframe 
df = pd.read_csv(datafile)
print("There are %d number of data "  %len(df))
# Looking the first 5 elements 
df.head(5)

There are 8675 number of data 


Unnamed: 0,type,posts
0,INFJ,'http://www.youtube.com/watch?v=qsXHcwe3krw|||...
1,ENTP,'I'm finding the lack of me in these posts ver...
2,INTP,'Good one _____ https://www.youtube.com/wat...
3,INTJ,"'Dear INTP, I enjoyed our conversation the o..."
4,ENTJ,'You're fired.|||That's another silly misconce...


In [5]:
# copy the original to use later
df_copy = df.copy()
# split the posts into array 
df_copy['posts'] = df_copy['posts'].apply(lambda x: x.split("|||"))
print("The shape (%d,%d)" %(df_copy.shape))

The shape (8675,2)


In [6]:
# the first five posts of the first user
df_copy.loc[0].posts[0:5]

["'http://www.youtube.com/watch?v=qsXHcwe3krw",
 'http://41.media.tumblr.com/tumblr_lfouy03PMA1qa1rooo1_500.jpg',
 'enfp and intj moments  https://www.youtube.com/watch?v=iz7lE1g4XM4  sportscenter not top ten plays  https://www.youtube.com/watch?v=uCdfze1etec  pranks',
 'What has been the most life-changing experience in your life?',
 'http://www.youtube.com/watch?v=vXZeYwwRDw8   http://www.youtube.com/watch?v=u8ejam5DP3E  On repeat for most of today.']

# Preprocess 

###  What to do with links? -> Scrap links and get the title of the page 


In [7]:
NOT_ALLOWED= "Bilgi" # when page is closed by Bilgi Teknolojileri bk. 

def replace_links_title(df_copy):
    """ for each row checks if the post 
    is a link, if it's a link then request the page. 
    For alexxandra-.tumblr.com the function gives a parser error. 
    saves the preprocessed file into csv. 
    @df_copy, a dataframe, each row has a posts array 
    returns : None 
    """
    nbr_link = 0 
    link_dict = dict()
    for i,post in enumerate(df_copy.posts):
        link_dict[i] = dict()
        for j,p in enumerate(post):
            if any(f in p for f in HTTP):
                # get the page title 
                link = re.findall(LINK,  p)
                if not ("http://-alexxxandra-.tumblr.com/" in link) and not("http://memearchive.net/memerial.net/fullsize/1370.jpg" in link):
                    if len(link)>0:
                        for l in link: # if multiple links 
                            try:
                                r = requests.get(l)
                                r.raise_for_status()
                                try : 
                                    tree = fromstring(r.content)
                                    title = tree.findtext('.//title') 
                                    if title and not("Bilgi" in title): 
                                        p = re.sub(
                                            LINK, 
                                            title, 
                                            p) 
                                    link_dict[i][j] = p
                                    nbr_link += 1
                                except: 
                                    print("Error in from String for i: %d j: %d", (i,j))
                            except: 
                                print("Error for i: %d j:%d" %(i,j))
            post[j] = p                  
        if i%50==0: 
            # save the results 
            print("Saving result for %d " %(i))
            df_copy.to_csv("backup_df.csv")

        df_copy.loc[i]['posts'] = post
    print("Number of links in the whole data %d " %(nbr_link))

### Lemmatization and Removing StopWords 

In [7]:
# Labels for types
lab_encoder = LabelEncoder().fit(list(df.type.unique()))
classes=lab_encoder.inverse_transform(range(16))
print(classes)
print(['INFJ', 'ENTP', 'INTP', 'INTJ', 'ENTJ', 'ENFJ', 'INFP', 'ENFP', 'ISFP', 'ISTP', 'ISFJ', 'ISTJ', 'ESTP', 'ESFP', 'ESTJ', 'ESFJ'])

['ENFJ' 'ENFP' 'ENTJ' 'ENTP' 'ESFJ' 'ESFP' 'ESTJ' 'ESTP' 'INFJ' 'INFP'
 'INTJ' 'INTP' 'ISFJ' 'ISFP' 'ISTJ' 'ISTP']
['INFJ', 'ENTP', 'INTP', 'INTJ', 'ENTJ', 'ENFJ', 'INFP', 'ENFP', 'ISFP', 'ISTP', 'ISFJ', 'ISTJ', 'ESTP', 'ESFP', 'ESTJ', 'ESFJ']


In [8]:
# remove stop words 
lemmatiser = WordNetLemmatizer()
def preprocess_data(df_copy, 
                    punc_remove=True,
                    link_remove=True,
                    only_letters=True,
                    type_remove=True
                   ):
    stopWordsEng = stopwords.words("english")
    post_list = []
    label_list = []
    exclusions = '|'.join([f.lower() for f in list(df.type.unique())])
    for row in tqdm(df_copy.iterrows()):
        posts = row[1].posts
        temp_post = ""
        for p in posts:
            # change remaining links as LNK
            temp = p.lower()
            if link_remove: temp = re.sub(LINK, 'link',temp)
            # change type names as TYP 
            if type_remove: temp = re.sub(exclusions, 'type', temp)
            # chose only letters 
            if only_letters: temp = re.sub("[^a-zA-Z]", " ", temp)
            # remove punctuations 
            if punc_remove: temp = re.sub(' +', ' ', temp)
            # remove stopwords and lemmatize 
            temp = " ".join([lemmatiser.lemmatize(w) for w in temp.split(' ') if w not in stopWordsEng])
            temp_post += " " +temp
        post_list.append(temp_post)
        label_list.append(lab_encoder.transform([row[1].type])[0])
    return post_list, label_list

In [11]:
# First replace links 
# replace_links_title(df_copy)
# apply remaining preprocess 
posts_all, label_all = preprocess_data(df_copy)
# save the preprocessed file 
data = dict()
data['posts'] = posts_all
data['types'] = [int(l) for l in label_all] 
filename = join(datadir,"preprocessed_data_all.json")
with open(filename, "w+") as fp:
    json.dump(data,fp)

8675it [00:57, 149.94it/s]


In [12]:
# only lemmatize
posts_none, label_none = preprocess_data(df_copy, False,False,False,False)
# save the preprocessed file 
data = dict()
data['posts'] = posts_none
data['types'] = [int(l) for l in label_none] 
filename = join(datadir,"preprocessed_data_none.json")
with open(filename, "w+") as fp:
    json.dump(data,fp)

8675it [00:47, 182.00it/s]


In [41]:
# For the ablation study different version added 
nbr_feature = 4 
# experiment names related to the removed features 
# For punc  -> only punctuations removed 
#     links -> only links replaced 
#     letters-> only non-letters removed
#     type  -> only type names removed 
exp_names = ["punc","links","letters","type"]
for i in range(nbr_feature):
    posts, labels = preprocess_data(df_copy, 
                                    (exp_names[i]=="punc"),
                                    (exp_names[i]=="links"),
                                    (exp_names[i]=="letters"),
                                    (exp_names[i]=="type"))
    data = dict()
    data['posts'] = posts
    data['types'] = [int(l) for l in labels] 
    filename = join(datadir,"preprocessed_data_"+exp_names[i]+".json")
    with open(filename, "w+") as fp:
        json.dump(data,fp)

8675it [00:48, 177.98it/s]
8675it [00:46, 184.88it/s]
8675it [01:05, 133.45it/s]
8675it [00:49, 176.80it/s]


## Split classes as 4 dichotomies 

In [10]:
def make_4classes(data_file):
    with open(data_file) as fp:
        data = json.load(fp)
    data['posts']  = [p.lower() for p in data['posts']]   
    # change the class names as originals  
    for i,l in enumerate(data['types']):
        cls = lab_encoder.inverse_transform([l])
        data['types'][i] = cls[0]
    #print(data['types'][0:10])
    # construct a dataframe 
    df_preprocessed = pd.DataFrame.from_dict(data)
    # print(df_preprocessed.head())
    # check if class names are true 
    # print(sum(df_preprocessed.types == df.type)) 
    # Create column for E/I 
    # if types start I -> 1, if starts with E->0 
    df_preprocessed['I-E'] = np.zeros(len(df_preprocessed),dtype=int)
    df_preprocessed['I-E'][df_preprocessed.types.str.startswith('I')] = 1 
    # Create column for S-N 
    # if types start S -> 1, if starts with N->0 
    df_preprocessed['S-N'] = np.zeros(len(df_preprocessed),dtype=int)
    df_preprocessed['S-N'][df_preprocessed.types.str[1] == 'S'] = 1 
    # Create column for T-F 
    # if types start T -> 1, if starts with F->0 
    df_preprocessed['T-F'] = np.zeros(len(df_preprocessed),dtype=int)
    df_preprocessed['T-F'][df_preprocessed.types.str[2] == 'T'] = 1 
    # Create column for J-P 
    # if types start J -> 1, if starts with P->0 
    df_preprocessed['J-P'] = np.zeros(len(df_preprocessed),dtype=int)
    df_preprocessed['J-P'][df_preprocessed.types.str[3] == 'J'] = 1 
    return df_preprocessed 

In [48]:
# open preprocessed dataset 
# Labels for types
lab_encoder = LabelEncoder().fit(list(df.type.unique()))
classes=lab_encoder.inverse_transform(range(16))

In [12]:
data_file = "../dataset/preprocessed_data_letters.json"
df_preprocessed = make_4classes(data_file)
df_preprocessed.head()
# save the 4 dichotomies dataset 
df_preprocessed.to_csv("../dataset/preprocessed_data_letters_4class.json")

A value is trying to be set on a copy of a slice from a DataFrame

See the caveats in the documentation: http://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
A value is trying to be set on a copy of a slice from a DataFrame

See the caveats in the documentation: http://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
A value is trying to be set on a copy of a slice from a DataFrame

See the caveats in the documentation: http://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
A value is trying to be set on a copy of a slice from a DataFrame

See the caveats in the documentation: http://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy


In [13]:
# check the written file 
df_w = pd.read_csv('../dataset/preprocessed_data_letters_4class.json')
df_w.head()

Unnamed: 0.1,Unnamed: 0,posts,types,I-E,S-N,T-F,J-P
0,0,http www youtube com watch v qsxhcwe krw h...,INFJ,1,0,0,1
1,1,finding lack post alarming sex boring posit...,ENTP,0,0,1,0
2,2,good one http www youtube com wat...,INTP,1,0,1,0
3,3,dear intp enjoyed conversation day esot...,INTJ,1,0,1,1
4,4,fired another silly misconception approach...,ENTJ,0,0,1,1


In [None]:
df_w[['I-E','S-N','T-F','J-P']].sum()