# Exploration of the Big Five Personality Dataset

This dataset was found on [Kaggle](https://www.kaggle.com/tunguz/big-five-personality-test). It contains 1,015,342 questionnaire answers collected online by [Open Psychometrics](https://openpsychometrics.org/tests/IPIP-BFFM/). The questionnaire used 50 items to measure five personality traits; openness to experience, conscientiousness, extraversion, agreeableness and neuroticism. Participants were asked to respond to each item (e.g. "I am always prepared) on a scale of 1-5 (1=Disagree, 3=Neutral, and 5=Agree). 

In [1]:
import pandas as pd

In [2]:
df = pd.read_csv('./data/data-final.csv', sep='\t')
df.head()

Unnamed: 0,EXT1,EXT2,EXT3,EXT4,EXT5,EXT6,EXT7,EXT8,EXT9,EXT10,...,dateload,screenw,screenh,introelapse,testelapse,endelapse,IPC,country,lat_appx_lots_of_err,long_appx_lots_of_err
0,4.0,1.0,5.0,2.0,5.0,1.0,5.0,2.0,4.0,1.0,...,2016-03-03 02:01:01,768.0,1024.0,9.0,234.0,6,1,GB,51.5448,0.1991
1,3.0,5.0,3.0,4.0,3.0,3.0,2.0,5.0,1.0,5.0,...,2016-03-03 02:01:20,1360.0,768.0,12.0,179.0,11,1,MY,3.1698,101.706
2,2.0,3.0,4.0,4.0,3.0,2.0,1.0,3.0,2.0,5.0,...,2016-03-03 02:01:56,1366.0,768.0,3.0,186.0,7,1,GB,54.9119,-1.3833
3,2.0,2.0,2.0,3.0,4.0,2.0,2.0,4.0,1.0,4.0,...,2016-03-03 02:02:02,1920.0,1200.0,186.0,219.0,7,1,GB,51.75,-1.25
4,3.0,3.0,3.0,3.0,5.0,3.0,3.0,5.0,3.0,4.0,...,2016-03-03 02:02:57,1366.0,768.0,8.0,315.0,17,2,KE,1.0,38.0


A .txt file containing details of the data was also downloaded from Kaggle along with the dataset. Here you can find the questions that were asked of the participants and additional data that was collected (e.g. dateload - the timestamp from when the questionnaire was started). 

In [4]:
f = open('./data/codebook.txt', "r")
for line in f:
    print(line)
f.close()

This data was collected (2016-2018) through an interactive on-line personality test.

The personality test was constructed with the "Big-Five Factor Markers" from the IPIP. https://ipip.ori.org/newBigFive5broadKey.htm

Participants were informed that their responses would be recorded and used for research at the beginning of the test, and asked to confirm their consent at the end of the test.



The following items were presented on one page and each was rated on a five point scale using radio buttons. The order on page was was EXT1, AGR1, CSN1, EST1, OPN1, EXT2, etc.

The scale was labeled 1=Disagree, 3=Neutral, 5=Agree



EXT1	I am the life of the party.

EXT2	I don't talk a lot.

EXT3	I feel comfortable around people.

EXT4	I keep in the background.

EXT5	I start conversations.

EXT6	I have little to say.

EXT7	I talk to a lot of different people at parties.

EXT8	I don't like to draw attention to myself.

EXT9	I don't mind being the center of attention.

EXT10	I am quiet around str

### Subset to single IP address use only

As stated in the codebook.txt file, there are instances where an IP address is used more than once (column 'IPC') to complete the questionaire. This could indicate multiple submissions from the same person or submissions over a shared network (e.g. a university). As seen below, there are nearly 700K instances where the IP address has only been used once. As suggested in the codebook.txt file, I will only use these instances in the remaining analyses for maximum cleanliness.

In [5]:
df['IPC'].value_counts()

1      696845
2      105868
3       34323
4       17332
5       11135
        ...  
103       103
99         99
98         98
88         88
87         87
Name: IPC, Length: 201, dtype: int64

In [6]:
single_IP_df = df.copy()
single_IP_df = single_IP_df.loc[single_IP_df['IPC'] == 1]
len(single_IP_df)

696845

### Reversing scores on negatively keyed items

By following the link to the [IPIP website](https://ipip.ori.org/newBigFive5broadKey.htm) provided in the codebook.txt file, we can see that some of the questions are negatively keyed. For example, item EXT2 in the online questionnaire results is "I don't talk a lot" - as this is a measure of extraversion, we would expect someone high in extraversion to disagree with this statement and provide a low score on the scale of 1-5 (1=Disagree, 3=Neutral, and 5=Agree). If this is compared to item EXT1 ("I am the life of the party"), we would expect someone high in extraversion to agree with this statement and provide a high score on the scale of 1-5. As a result, some of the scores in the dataset need to be reversed so that the scores consistently indicate each personality trait (e.g. high scores on the EXT items indicate a tendency towards being high in extraversion).

In [7]:
neg_keyed_items = [
    'EXT2', 
    'EXT4', 
    'EXT6',
    'EXT8',
    'EXT10', 
    
    'EST1',
    'EST3',
    'EST5',
    'EST6',
    'EST7',
    'EST8',
    'EST9',
    'EST10',
    
    'AGR1',
    'AGR3',
    'AGR5',
    'AGR7',
    
    'CSN2',
    'CSN4',
    'CSN6',
    'CSN8',
    
    'OPN2',
    'OPN4',
    'OPN6'
]

In [8]:
adj_df = single_IP_df.copy()
adj_df[neg_keyed_items] = adj_df[neg_keyed_items].apply(lambda x: x.map({1.0:5.0, 2.0:4.0, 3.0:3.0, 4.0:2.0, 5.0:1.0}))
adj_df.head()

Unnamed: 0,EXT1,EXT2,EXT3,EXT4,EXT5,EXT6,EXT7,EXT8,EXT9,EXT10,...,dateload,screenw,screenh,introelapse,testelapse,endelapse,IPC,country,lat_appx_lots_of_err,long_appx_lots_of_err
0,4.0,5.0,5.0,4.0,5.0,5.0,5.0,4.0,4.0,5.0,...,2016-03-03 02:01:01,768.0,1024.0,9.0,234.0,6,1,GB,51.5448,0.1991
1,3.0,1.0,3.0,2.0,3.0,3.0,2.0,1.0,1.0,1.0,...,2016-03-03 02:01:20,1360.0,768.0,12.0,179.0,11,1,MY,3.1698,101.706
2,2.0,3.0,4.0,2.0,3.0,4.0,1.0,3.0,2.0,1.0,...,2016-03-03 02:01:56,1366.0,768.0,3.0,186.0,7,1,GB,54.9119,-1.3833
3,2.0,4.0,2.0,3.0,4.0,4.0,2.0,2.0,1.0,2.0,...,2016-03-03 02:02:02,1920.0,1200.0,186.0,219.0,7,1,GB,51.75,-1.25
5,3.0,3.0,4.0,4.0,4.0,4.0,2.0,3.0,3.0,2.0,...,2016-03-03 02:03:12,1600.0,1000.0,4.0,196.0,3,1,SE,59.3333,18.05


In [9]:
len(adj_df)

696845