# Main study

In this file, I'll explore the data given by the OKCupid app, and will try to create ML models for analyzing and predicting data. 

The what's, how's and any more questions will come as I study the dataset. So read this file as a diary, where I code and ask myself questions. 

First, reading the data and getting columns names

In [71]:
import pandas as pd

profiles = pd.read_csv('profiles.csv')

profiles.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 59946 entries, 0 to 59945
Data columns (total 31 columns):
 #   Column       Non-Null Count  Dtype  
---  ------       --------------  -----  
 0   age          59946 non-null  int64  
 1   body_type    54650 non-null  object 
 2   diet         35551 non-null  object 
 3   drinks       56961 non-null  object 
 4   drugs        45866 non-null  object 
 5   education    53318 non-null  object 
 6   essay0       54458 non-null  object 
 7   essay1       52374 non-null  object 
 8   essay2       50308 non-null  object 
 9   essay3       48470 non-null  object 
 10  essay4       49409 non-null  object 
 11  essay5       49096 non-null  object 
 12  essay6       46175 non-null  object 
 13  essay7       47495 non-null  object 
 14  essay8       40721 non-null  object 
 15  essay9       47343 non-null  object 
 16  ethnicity    54266 non-null  object 
 17  height       59943 non-null  float64
 18  income       59946 non-null  int64  
 19  job 

In [72]:
profiles.head()

Unnamed: 0,age,body_type,diet,drinks,drugs,education,essay0,essay1,essay2,essay3,...,location,offspring,orientation,pets,religion,sex,sign,smokes,speaks,status
0,22,a little extra,strictly anything,socially,never,working on college/university,about me:<br />\n<br />\ni would love to think...,currently working as an international agent fo...,making people laugh.<br />\nranting about a go...,"the way i look. i am a six foot half asian, ha...",...,"south san francisco, california","doesn&rsquo;t have kids, but might want them",straight,likes dogs and likes cats,agnosticism and very serious about it,m,gemini,sometimes,english,single
1,35,average,mostly other,often,sometimes,working on space camp,i am a chef: this is what that means.<br />\n1...,dedicating everyday to being an unbelievable b...,being silly. having ridiculous amonts of fun w...,,...,"oakland, california","doesn&rsquo;t have kids, but might want them",straight,likes dogs and likes cats,agnosticism but not too serious about it,m,cancer,no,"english (fluently), spanish (poorly), french (...",single
2,38,thin,anything,socially,,graduated from masters program,"i'm not ashamed of much, but writing public te...","i make nerdy software for musicians, artists, ...",improvising in different contexts. alternating...,my large jaw and large glasses are the physica...,...,"san francisco, california",,straight,has cats,,m,pisces but it doesn&rsquo;t matter,no,"english, french, c++",available
3,23,thin,vegetarian,socially,,working on college/university,i work in a library and go to school. . .,reading things written by old dead people,playing synthesizers and organizing books acco...,socially awkward but i do my best,...,"berkeley, california",doesn&rsquo;t want kids,straight,likes cats,,m,pisces,no,"english, german (poorly)",single
4,29,athletic,,socially,never,graduated from college/university,hey how's it going? currently vague on the pro...,work work work work + play,creating imagery to look at:<br />\nhttp://bag...,i smile a lot and my inquisitive nature,...,"san francisco, california",,straight,likes dogs and likes cats,,m,aquarius,no,english,single


In [73]:
profiles.loc[0]

age                                                           22
body_type                                         a little extra
diet                                           strictly anything
drinks                                                  socially
drugs                                                      never
education                          working on college/university
essay0         about me:<br />\n<br />\ni would love to think...
essay1         currently working as an international agent fo...
essay2         making people laugh.<br />\nranting about a go...
essay3         the way i look. i am a six foot half asian, ha...
essay4         books:<br />\nabsurdistan, the republic, of mi...
essay5         food.<br />\nwater.<br />\ncell phone.<br />\n...
essay6                               duality and humorous things
essay7         trying to find someone to hang out with. i am ...
essay8         i am new to california and looking for someone...
essay9         you want t

## Fixing data

Fun stuff. First ideas are to label encode and to one-hot encode some stuff, but let's keep it open ended

Let's check `body_type`

In [74]:
profiles['body_type'].value_counts()

body_type
average           14652
fit               12711
athletic          11819
thin               4711
curvy              3924
a little extra     2629
skinny             1777
full figured       1009
overweight          444
jacked              421
used up             355
rather not say      198
Name: count, dtype: int64

In [75]:
profiles['body_type'].value_counts(1)

body_type
average           0.268106
fit               0.232589
athletic          0.216267
thin              0.086203
curvy             0.071802
a little extra    0.048106
skinny            0.032516
full figured      0.018463
overweight        0.008124
jacked            0.007704
used up           0.006496
rather not say    0.003623
Name: proportion, dtype: float64

The idea was to label-encode this data, but there's no straight order (and it would have some social implications I guess), so one-hot enconding feels like the solution to input this data in the ML models. 

In [76]:
# do notice there are 'rather not say' values and there are missing values, so we fill them with 'unknown'
profiles['body_type'] = profiles['body_type'].fillna(value='unknown')

profiles = pd.get_dummies(profiles, prefix='body_type', columns=['body_type'], dtype=int)


Now for `diet`

In [77]:
profiles['diet'].value_counts()

diet
mostly anything        16585
anything                6183
strictly anything       5113
mostly vegetarian       3444
mostly other            1007
strictly vegetarian      875
vegetarian               667
strictly other           452
mostly vegan             338
other                    331
strictly vegan           228
vegan                    136
mostly kosher             86
mostly halal              48
strictly halal            18
strictly kosher           18
halal                     11
kosher                    11
Name: count, dtype: int64

In [78]:
profiles['diet'].value_counts(1)

diet
mostly anything        0.466513
anything               0.173919
strictly anything      0.143822
mostly vegetarian      0.096875
mostly other           0.028326
strictly vegetarian    0.024613
vegetarian             0.018762
strictly other         0.012714
mostly vegan           0.009507
other                  0.009311
strictly vegan         0.006413
vegan                  0.003825
mostly kosher          0.002419
mostly halal           0.001350
strictly halal         0.000506
strictly kosher        0.000506
halal                  0.000309
kosher                 0.000309
Name: proportion, dtype: float64

Here there is not even an attempt at labeling. But also, getting dummies will create 17 new columns.

But, there are values that repeat themselves. Like some peopel will be 'mostly vegan' and some other people are 'stricly vegan'. So the idea is to make a have informations like vegan = true and strictness = mostly (or something like this).

So I'll one-hot encode the type of diet (vegan, anything, kosher ...).

But the 'strictness' feels weird. In one sense, feels logical for no prefix to be labeled 0, mostly is 1, and stricly is 2. On the other sense, imagine one person answered 'vegan', other 'mostly vegan', and the last, 'stricly vegan'. Then, the weaker intention on being vegan falls on the 'mostly' and the higher one, 'stricly', putting the 'no-prefix' in the middle. In that sense, the base-line would be 1 (no prefix), 0's are for 'mostly's and 'stricly' is gonna be 2 (not changed)

There's also a 'why' would someone answer 'anything' and 'stricly anything' which I can't quite grasp, so I'll just stick with the numbers. Also, we'll set baselines of strictness @ 0 for convenience, so mostly is a -1

**Conclusion**: We'll one-hot encode diet_type and label strictness as such:

```python
{
    'stricly': 1,
    'mostly': -1,
    # anything else is a 0 (middle value)
}
```


In [79]:
# fill NaN with unknown 
profiles['diet'].fillna('unknown', inplace=True)

# creates and 1-hto-encodes diet types
profiles['diet_type'] = profiles['diet'].str.replace(r'^(strictly |mostly )', '', regex=True)
profiles = pd.get_dummies(profiles, columns=['diet_type'], dtype=int)

# middleware: for labelling
def get_strictness(x):
    if 'stricly' in x: 
        return 1
    elif 'mostly' in x: 
        return -1
    else: 
        return 0

# labels
profiles['diet_strictness'] = profiles['diet'].apply(get_strictness)

profiles.drop(columns=['diet'], inplace=True, axis=1)

profiles.info()


The behavior will change in pandas 3.0. This inplace method will never work because the intermediate object on which we are setting values always behaves as a copy.

For example, when doing 'df[col].method(value, inplace=True)', try using 'df.method({col: value}, inplace=True)' or df[col] = df[col].method(value) instead, to perform the operation inplace on the original object.


  profiles['diet'].fillna('unknown', inplace=True)


<class 'pandas.core.frame.DataFrame'>
RangeIndex: 59946 entries, 0 to 59945
Data columns (total 50 columns):
 #   Column                    Non-Null Count  Dtype  
---  ------                    --------------  -----  
 0   age                       59946 non-null  int64  
 1   drinks                    56961 non-null  object 
 2   drugs                     45866 non-null  object 
 3   education                 53318 non-null  object 
 4   essay0                    54458 non-null  object 
 5   essay1                    52374 non-null  object 
 6   essay2                    50308 non-null  object 
 7   essay3                    48470 non-null  object 
 8   essay4                    49409 non-null  object 
 9   essay5                    49096 non-null  object 
 10  essay6                    46175 non-null  object 
 11  essay7                    47495 non-null  object 
 12  essay8                    40721 non-null  object 
 13  essay9                    47343 non-null  object 
 14  ethnic

## Now for drinking!!



In [80]:
print(profiles['drinks'].value_counts())
print(profiles['drinks'].value_counts(1))

drinks
socially       41780
rarely          5957
often           5164
not at all      3267
very often       471
desperately      322
Name: count, dtype: int64
drinks
socially       0.733484
rarely         0.104580
often          0.090659
not at all     0.057355
very often     0.008269
desperately    0.005653
Name: proportion, dtype: float64


It seems there can be a order to it, so labeling from 0 to 5 (6 values) might be the way to go
- 0 -> not at all
- 1 -> rarely
- 2 -> socially
- 3 -> often
- 4 -> very often
- 5 -> desperately

BUT missings values are a pain. Socially IS the main value by a long shot, so instead of filling NaN with the string 'unknwon', I'll opt to make NaN become 'socially'.

There's obviously pro's and con's on doing this, but since the answers can be with an implied order, keeping said order feels like the right choice



In [81]:
#
profiles['drinks'] = profiles['drinks'].fillna('socially')

profiles['drinks'] = profiles['drinks'].map({
    'not at all': 0,
    'rarely': 1,
    'socially': 2,
    'often': 3,
    'very often': 4,
    'desperately': 5,
})

print(profiles['drinks'].value_counts())
print(profiles['drinks'].value_counts(1))

drinks
2    44765
1     5957
3     5164
0     3267
4      471
5      322
Name: count, dtype: int64
drinks
2    0.746755
1    0.099373
3    0.086144
0    0.054499
4    0.007857
5    0.005372
Name: proportion, dtype: float64


In [82]:
profiles.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 59946 entries, 0 to 59945
Data columns (total 50 columns):
 #   Column                    Non-Null Count  Dtype  
---  ------                    --------------  -----  
 0   age                       59946 non-null  int64  
 1   drinks                    59946 non-null  int64  
 2   drugs                     45866 non-null  object 
 3   education                 53318 non-null  object 
 4   essay0                    54458 non-null  object 
 5   essay1                    52374 non-null  object 
 6   essay2                    50308 non-null  object 
 7   essay3                    48470 non-null  object 
 8   essay4                    49409 non-null  object 
 9   essay5                    49096 non-null  object 
 10  essay6                    46175 non-null  object 
 11  essay7                    47495 non-null  object 
 12  essay8                    40721 non-null  object 
 13  essay9                    47343 non-null  object 
 14  ethnic

Now for the drugs

In [83]:
print(profiles['drugs'].value_counts(0))
print(profiles['drugs'].value_counts(1))

drugs
never        37724
sometimes     7732
often          410
Name: count, dtype: int64
drugs
never        0.822483
sometimes    0.168578
often        0.008939
Name: proportion, dtype: float64


For exploratory reasons, I'll put label all missing values to the mode(never). It can be a controversy choice, since not answering the drug questions can also be seeing as a stigma of some sorts, but I'll just keep this in mind when analysing the model. ALSO there's like more than 10k missing values in this column, so there's that

I choose to do this because there's a clear labeling order, and I'd rather keep it :)

In [84]:
profiles['drugs'] = profiles['drugs'].fillna('never')

profiles['drugs'] = profiles['drugs'].map({
    'never': 0,
    'sometimes': 1,
    'often': 2
})

In [85]:
print(profiles['drugs'].value_counts(0))
print(profiles['drugs'].value_counts(1))

drugs
0    51804
1     7732
2      410
Name: count, dtype: int64
drugs
0    0.864178
1    0.128983
2    0.006839
Name: proportion, dtype: float64


Once again, let's keep in mind this should affect our model in the end!

## Education!

In [86]:
print(profiles['education'].value_counts(0))
print(profiles['education'].value_counts(1))

education
graduated from college/university    23959
graduated from masters program        8961
working on college/university         5712
working on masters program            1683
graduated from two-year college       1531
graduated from high school            1428
graduated from ph.d program           1272
graduated from law school             1122
working on two-year college           1074
dropped out of college/university      995
working on ph.d program                983
college/university                     801
graduated from space camp              657
dropped out of space camp              523
graduated from med school              446
working on space camp                  445
working on law school                  269
two-year college                       222
working on med school                  212
dropped out of two-year college        191
dropped out of masters program         140
masters program                        136
dropped out of ph.d program            127
d

This one is cool, but will be the same idea for the diets. We'll have the education status, where options are graduated, working, dropped, and education level, where it can be college/university, masters program, two-year college, high school. ph.d program, law school, space camp (wow), med school.

Here, NaN are to be 'unkown', since it's literally that and there will be no attempt to order anything.


In [87]:
profiles['education'] = profiles['education'].fillna('unknown')
# get columns with only education_level
profiles['education_level'] = profiles['education'].replace(['graduated from ', 'working on ', 'dropped out of '], '', regex=True)

# get columns with education status
# middleware 
def get_edu_status(x):
    if 'graduated from ' in x: 
        return 'graduated'
    elif 'working on ' in x:
        return 'working_on'
    elif 'dropped out of ' in x: 
        return 'dropped_out'
    elif 'unknown' in x:
        return 'unknown'
    else:
        return 'unspecified'

profiles['education_status'] = profiles['education'].apply(get_edu_status)
profiles = pd.get_dummies(profiles, columns=['education_status', 'education_level'], dtype=int)

profiles = profiles.drop('education', axis=1)


profiles.info()



<class 'pandas.core.frame.DataFrame'>
RangeIndex: 59946 entries, 0 to 59945
Data columns (total 63 columns):
 #   Column                              Non-Null Count  Dtype  
---  ------                              --------------  -----  
 0   age                                 59946 non-null  int64  
 1   drinks                              59946 non-null  int64  
 2   drugs                               59946 non-null  int64  
 3   essay0                              54458 non-null  object 
 4   essay1                              52374 non-null  object 
 5   essay2                              50308 non-null  object 
 6   essay3                              48470 non-null  object 
 7   essay4                              49409 non-null  object 
 8   essay5                              49096 non-null  object 
 9   essay6                              46175 non-null  object 
 10  essay7                              47495 non-null  object 
 11  essay8                              40721

## ethinicity!
let's have a look

In [88]:
# just so I can see everything
pd.set_option('display.max_rows', None)
print(profiles['ethnicity'].value_counts(0).sum())
print(profiles['ethnicity'].value_counts(1))

54266
ethnicity
white                                                                                                      0.605001
asian                                                                                                      0.113036
hispanic / latin                                                                                           0.052022
black                                                                                                      0.037003
other                                                                                                      0.031438
hispanic / latin, white                                                                                    0.023974
indian                                                                                                     0.019847
asian, white                                                                                               0.014945
white, other                                            

So yeah, a bunch of options and some NaN values. I'll fill those with 'unknown' and we'll one-hot encode the rest!

In [89]:
profiles['ethnicity'] = profiles['ethnicity'].fillna('unknown')

ethinicities = ['white', 'asian', 'hispanic / latin', 'black', 'other', 'indian', 'pacific islander', 'native american', 'middle eastern', 'unknown']

for ethinicity in ethinicities:
    col_name = 'ethinicity_' + ethinicity.replace(' / ', '_').replace(' ', '_')
    profiles[col_name] = profiles['ethnicity'].str.contains(ethinicity).astype(int)
    
    
profiles = profiles.drop('ethnicity', axis=1)

profiles.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 59946 entries, 0 to 59945
Data columns (total 72 columns):
 #   Column                              Non-Null Count  Dtype  
---  ------                              --------------  -----  
 0   age                                 59946 non-null  int64  
 1   drinks                              59946 non-null  int64  
 2   drugs                               59946 non-null  int64  
 3   essay0                              54458 non-null  object 
 4   essay1                              52374 non-null  object 
 5   essay2                              50308 non-null  object 
 6   essay3                              48470 non-null  object 
 7   essay4                              49409 non-null  object 
 8   essay5                              49096 non-null  object 
 9   essay6                              46175 non-null  object 
 10  essay7                              47495 non-null  object 
 11  essay8                              40721

In [90]:
profiles.iloc[0]

age                                                                                  22
drinks                                                                                2
drugs                                                                                 0
essay0                                about me:<br />\n<br />\ni would love to think...
essay1                                currently working as an international agent fo...
essay2                                making people laugh.<br />\nranting about a go...
essay3                                the way i look. i am a six foot half asian, ha...
essay4                                books:<br />\nabsurdistan, the republic, of mi...
essay5                                food.<br />\nwater.<br />\ncell phone.<br />\n...
essay6                                                      duality and humorous things
essay7                                trying to find someone to hang out with. i am ...
essay8                          

## Height!

It's a straigh number! Just filling NaN with the mean will suffice (there's actually very few missing)

In [91]:
profiles['height'] = profiles['height'].fillna(profiles['height'].mean())

## Income!

In [92]:
print(profiles['income'].value_counts(0))

income
-1          48442
 20000       2952
 100000      1621
 80000       1111
 30000       1048
 40000       1005
 50000        975
 60000        736
 70000        707
 150000       631
 1000000      521
 250000       149
 500000        48
Name: count, dtype: int64


a LOT of -1's. Change that to 0's. 

But I think this wouldn't be a very nice column for training. We'll keep this in mind

In [93]:
profiles['income'] = profiles['income'].replace(-1, 0)

## jobs!

In [94]:
profiles['job'].value_counts(0)


job
other                                7589
student                              4882
science / tech / engineering         4848
computer / hardware / software       4709
artistic / musical / writer          4439
sales / marketing / biz dev          4391
medicine / health                    3680
education / academia                 3513
executive / management               2373
banking / financial / real estate    2266
entertainment / media                2250
law / legal services                 1381
hospitality / travel                 1364
construction / craftsmanship         1021
clerical / administrative             805
political / government                708
rather not say                        436
transportation                        366
unemployed                            273
retired                               250
military                              204
Name: count, dtype: int64

Another multi-label!

In [95]:
profiles['job'] = profiles['job'].fillna('unknown')

profiles = pd.get_dummies(profiles, columns=['job'], dtype=int)

## 

In [None]:
profiles.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 59946 entries, 0 to 59945
Data columns (total 93 columns):
 #   Column                                 Non-Null Count  Dtype  
---  ------                                 --------------  -----  
 0   age                                    59946 non-null  int64  
 1   drinks                                 59946 non-null  int64  
 2   drugs                                  59946 non-null  int64  
 3   essay0                                 54458 non-null  object 
 4   essay1                                 52374 non-null  object 
 5   essay2                                 50308 non-null  object 
 6   essay3                                 48470 non-null  object 
 7   essay4                                 49409 non-null  object 
 8   essay5                                 49096 non-null  object 
 9   essay6                                 46175 non-null  object 
 10  essay7                                 47495 non-null  object 
 11  es

In [100]:
profiles['orientation'].value_counts()

orientation
straight    51606
gay          5573
bisexual     2767
Name: count, dtype: int64