# Dataset

For unpervised learning, we use a dataset extracted from https://www.kaggle.com/tunguz/big-five-personality-test

It contains results from a personality test, I've replicated it there:
https://sondages.inria.fr/index.php/996557?lang=en

For additional explanations, see https://en.wikipedia.org/wiki/Big_Five_personality_traits

For each of the five traits, you will have a score based on your answers (10 questions, answer 1-2-3-4-5).
Some questions are are asked in a positive way 'P' (i.e. the bigger your score, the more you correspond to this type of personality), others are negative 'N'.

In [16]:
import numpy as np
import pandas as pd

The original dataset is heavy (500MB), the following code is aggregating the results to store a 20MB file.

For each type of question, we create a score (between 0 and 5, 5: personality ++, 0:personality --) based on their answer.

In this dataset, 1 line = 1 person

In [27]:
df = pd.read_csv("./data/data-final.csv", sep='\t')
df = pd.DataFrame(np.array(df[df.columns[:50]], dtype=np.int8), 
                  columns = df.columns[:50]).to_csv('./data/opti.csv', index= False)

In [38]:
question_schema = {
    'EXT1':'P', 'EXT2':'N', 'EXT3':'P', 'EXT4':'N', 'EXT5' :'P',
    'EXT6':'N', 'EXT7':'P', 'EXT8':'N', 'EXT9':'P', 'EXT10':'N',
    'EST1':'P', 'EST2':'N', 'EST3':'P', 'EST4':'N', 'EST5' :'P',
    'EST6':'P', 'EST7':'P', 'EST8':'P', 'EST9':'P', 'EST10':'P',
    'AGR1':'N', 'AGR2':'N', 'AGR3':'N', 'AGR4':'P', 'AGR5' :'N',
    'AGR6':'P', 'AGR7':'N', 'AGR8':'P', 'AGR9':'P', 'AGR10':'P',
    'CSN1':'P', 'CSN2':'N', 'CSN3':'P', 'CSN4':'N', 'CSN5' :'P',
    'CSN6':'N', 'CSN7':'P', 'CSN8':'P', 'CSN9':'P', 'CSN10':'P',
    'OPN1':'P', 'OPN2':'N', 'OPN3':'P', 'OPN4':'N', 'OPN5' :'P',
    'OPN6':'N', 'OPN7':'P', 'OPN8':'P', 'OPN9':'P', 'OPN10':'P',
}

In [39]:
df

Unnamed: 0,EXT1,EXT2,EXT3,EXT4,EXT5,EXT6,EXT7,EXT8,EXT9,EXT10,...,OPN1,OPN2,OPN3,OPN4,OPN5,OPN6,OPN7,OPN8,OPN9,OPN10
0,4,1,5,2,5,1,5,2,4,1,...,5,1,4,1,4,1,5,3,4,5
1,3,5,3,4,3,3,2,5,1,5,...,1,2,4,2,3,1,4,2,5,3
2,2,3,4,4,3,2,1,3,2,5,...,5,1,2,1,4,2,5,3,4,4
3,2,2,2,3,4,2,2,4,1,4,...,4,2,5,2,3,1,4,4,3,3
4,3,3,3,3,5,3,3,5,3,4,...,5,1,5,1,5,1,5,3,5,5
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
1015336,4,2,4,3,4,3,3,3,3,3,...,2,2,4,3,4,2,4,2,2,4
1015337,4,3,4,3,3,3,4,4,3,3,...,4,1,5,1,5,1,3,4,5,4
1015338,4,2,4,3,5,1,4,2,4,4,...,5,1,5,1,4,1,5,5,4,5
1015339,2,4,3,4,2,2,1,4,2,4,...,5,2,4,2,3,2,4,5,5,3


In [49]:
names= ['EXT', 'EST', 'AGR', 'CSN', 'OPN']
arg_df = pd.DataFrame()
for name in names:
    all_names = [name+str(k) for k in np.arange(1,11,1)]
    res = np.zeros(len(df[exts[0]]))
    for val in all_names:
        res_ext = [6-df[val][i] if question_schema[val] =='N' else df[val][i] for i in range(len(df[val]))]
        res = res + np.array(res_ext)
    arg_df[name] = res/10

In [52]:
arg_df.to_csv("./data/data.csv", index=False)

It's the data.csv file in the data folder.

codebook.txt is directly extracted from the kaggle dataset, and explains the data!