# **Speed Dating**


Contents
--------
1. [Loading the dataset](#loading)
2. [Perspectives](#perspectives)



## <a name="loading"></a>Loading the dataset

In [1]:
import numpy as np
import pandas as pd

fname = "./Speed+Dating+Data.csv"
data = pd.read_csv(fname)

UnicodeDecodeError: 'utf-8' codec can't decode byte 0x8e in position 195982: invalid start byte

Oops, there is a byte `\x8e` that doesn't decode as an utf-8 char. Let's see what this char is.

In [4]:
with open(fname, 'rb') as f:
    raw_data = f.read()

i = raw_data.find(b'\x8e')
i

5176718

In [6]:
raw_data[i-50:i+150]

b'.00,2,25,Climate Dynamics,18.00,"Ecole Normale Sup\x8erieure, Paris",,,2,1,1,France,"78,110",,1,2,1,assistant master of the universe (otherwise it\'s too much work),15.00,8,2,5,10,10,10,7,1,9,8,3,7,9,10,1'

We can see that `\x8e` should be interpreted as "é" (LATIN SMALL LETTER E WITH ACUTE), corresponding to `mac_roman` charset.

In [13]:
df = pd.read_csv(fname, encoding='mac_roman')
df.head()

Unnamed: 0,iid,id,gender,idg,condtn,wave,round,position,positin1,order,...,attr3_3,sinc3_3,intel3_3,fun3_3,amb3_3,attr5_3,sinc5_3,intel5_3,fun5_3,amb5_3
0,1,1.0,0,1,1,1,10,7,,4,...,5.0,7.0,7.0,7.0,7.0,,,,,
1,1,1.0,0,1,1,1,10,7,,3,...,5.0,7.0,7.0,7.0,7.0,,,,,
2,1,1.0,0,1,1,1,10,7,,10,...,5.0,7.0,7.0,7.0,7.0,,,,,
3,1,1.0,0,1,1,1,10,7,,5,...,5.0,7.0,7.0,7.0,7.0,,,,,
4,1,1.0,0,1,1,1,10,7,,7,...,5.0,7.0,7.0,7.0,7.0,,,,,


In the following data analysis, we shall distinguish single-subject data (`'age'`, `'income'`, `'gender'`, etc) and dating data (`'dec'`, `'age_o'`, etc). We therefore produce a reduced dataframe which contains subject specific data.

Subjects are distinguished by their `'iid'`. If for a given `'iid'`, an entry has a single value, then it must pertain to the subject. For instance, for `iid = 1`, the column `'age'` has all its rows with value 21. Since this is true for all `iid`s, we can deduce that the entry `'age'` pertains to the subject and not to the dating.

In [16]:
# for each iid (participant), unique value => entry pertains to the subject
subject_entries = df.groupby('iid').nunique().max() <= 1
# re-insert iid
subject_entries = pd.concat([pd.Series([True], index=['iid']), subject_entries])
# multiple values => entry pertains to dating
partner_entries = ~subject_entries

subject_cols = df.columns[subject_entries]
subject_df = df[subject_cols].groupby('iid').aggregate(lambda x: x.iloc[0])

In [15]:
subject_df

Unnamed: 0_level_0,id,gender,idg,condtn,wave,round,age,field,field_cd,undergra,...,attr3_3,sinc3_3,intel3_3,fun3_3,amb3_3,attr5_3,sinc5_3,intel5_3,fun5_3,amb5_3
iid,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1,Unnamed: 14_level_1,Unnamed: 15_level_1,Unnamed: 16_level_1,Unnamed: 17_level_1,Unnamed: 18_level_1,Unnamed: 19_level_1,Unnamed: 20_level_1,Unnamed: 21_level_1
1,1.0,0,1,1,1,10,21.0,Law,1.0,,...,5.0,7.0,7.0,7.0,7.0,,,,,
2,2.0,0,3,1,1,10,24.0,law,1.0,,...,7.0,6.0,9.0,9.0,4.0,,,,,
3,3.0,0,5,1,1,10,25.0,Economics,2.0,,...,,,,,,,,,,
4,4.0,0,7,1,1,10,23.0,Law,1.0,,...,6.0,5.0,6.0,8.0,5.0,,,,,
5,5.0,0,9,1,1,10,21.0,Law,1.0,,...,4.0,5.0,10.0,6.0,10.0,,,,,
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
548,18.0,1,36,2,21,22,30.0,Business,8.0,"University of Cologne, Germany",...,8.0,9.0,9.0,9.0,9.0,8.0,9.0,9.0,9.0,7.0
549,19.0,1,38,2,21,22,28.0,General management/finance,8.0,"LUISS, Rome",...,7.0,9.0,8.0,7.0,8.0,5.0,8.0,8.0,6.0,8.0
550,20.0,1,40,2,21,22,30.0,MBA,8.0,Oxford,...,,,,,,,,,,
551,21.0,1,42,2,21,22,27.0,Business,8.0,Harvard,...,,,,,,,,,,


This little issue revealed another problem with the dataset, which calls for a word of caution. The user who studied at Ecole Normale superieure was not honest when giving his `'career'`. To make it clearer, let us show the entry corresponding to the previous data.

In [11]:
# infer which entries pertain to the participant and which pertain to the people met
# for each iid, unique value => entry pertains to the participant

# multiple values => entry pertains to the peolple met

# keep the data relevant to the participant (not the people met)
user_df = df.groupby('iid').aggregate(lambda x: x.iloc[0])


Unnamed: 0,iid,age,field,undergra,goal,career
0,1,21.0,Law,,2.0,lawyer
1,1,21.0,Law,,2.0,lawyer
2,1,21.0,Law,,2.0,lawyer
3,1,21.0,Law,,2.0,lawyer
4,1,21.0,Law,,2.0,lawyer
...,...,...,...,...,...,...
8373,552,25.0,Climate Dynamics,"Ecole Normale Suprieure, Paris",1.0,assistant master of the universe (otherwise it...
8374,552,25.0,Climate Dynamics,"Ecole Normale Suprieure, Paris",1.0,assistant master of the universe (otherwise it...
8375,552,25.0,Climate Dynamics,"Ecole Normale Suprieure, Paris",1.0,assistant master of the universe (otherwise it...
8376,552,25.0,Climate Dynamics,"Ecole Normale Suprieure, Paris",1.0,assistant master of the universe (otherwise it...


## <a name="perpectives"></a>Perspectives

Possible extensions with more data/more precise data.

- Zipcode : do people prfer those who live closer
- Sex orientation