# Name(s): Ojas Patel, Pranav Naravetla, Suhas Dara, Avinash Damania

# OKCupid Data Mining Project

# Introduction

(What is the data science problem you are trying to solve? Why does the problem matter? What could the results of your predictive model be used for? Why would we want to be able to predict the thing you’re trying to predict? Then describe the dataset that you will use to tackle this problem.)

In this project, we will use an OKCupid dataset to solve the problem of predicting education level using information from dating profiles such as physical traits and lifestyle choices. Predicting someone's education level from their dating profile is useful for those with dating preferences. When making a profile, people will often avoid filling out certain fields, meaning that someone could match most of your preferences but be at a different stage of their education or career. For those who would prefer dating someone they can more strongly relate to in terms of school/work, predicting their education level can be quite valuable.

The OKCupid dataset we will use contains 19782 profiles from various residents of California. These profiles each have several different attributes that describe the person, such as age, body type, diet, and more. These attributes will be the basis of our predictive models.

In [1]:
def warn(*args, **kwargs):
    pass
import warnings
warnings.warn = warn

# Some headers
import random
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
from sklearn import model_selection, preprocessing, decomposition, neighbors, pipeline, tree, svm, multiclass
from sklearn import naive_bayes, neural_network, ensemble, metrics, linear_model

## Data Exploration

In this section, we will clean and engineer the data in preparation for use in training our models.

### Data Prep

In [2]:
df = pd.read_csv("test_profiles.csv")
df.columns

# we must drop rows that do not have an education value to deal with missing values
df = df[df.education.notnull()]
df = df.reset_index(drop=True)

label = df['education']
data = df.drop(columns=['education'])

data.head()

Unnamed: 0.1,Unnamed: 0,age,body_type,diet,drinks,drugs,essay0,essay1,essay2,essay3,...,location,offspring,orientation,pets,religion,sex,sign,smokes,speaks,status
0,0,28,athletic,mostly anything,socially,sometimes,i'm looking to meet someone who i have a lot i...,"i work in the tech industry during the day, an...","i'm an expert at scrabble, designing ambigrams...","my lip ring, which i use to distract people fr...",...,"san francisco, california",,straight,likes dogs and has cats,christianity and laughing about it,m,taurus but it doesn&rsquo;t matter,no,"english (fluently), chinese (fluently), japane...",single
1,1,34,average,mostly anything,socially,never,"they say i'm a smart, funny, worldly girl who ...","i'm transitioning from a lost, once ambitious ...",listening<br />\n<br />\nnot judging others<br...,eyes<br />\n<br />\nsmile<br />\n<br />\nthoug...,...,"san francisco, california",,straight,has dogs and likes cats,,f,aries and it&rsquo;s fun to think about,no,english,single
2,2,29,fit,mostly anything,socially,never,update: i am in bmore/philly till may 27th for...,i'm self-employed events technician and it's w...,"being a smart ass, saying inappropriate things...",my charming good looks and the piece of lettuc...,...,"oakland, california",,straight,likes dogs and likes cats,,m,,no,"english (fluently), swedish (okay), spanish (p...",single
3,3,45,athletic,,not at all,never,,,,,...,"san francisco, california",,gay,likes dogs and likes cats,other,m,cancer,no,"english, spanish (poorly)",single
4,4,37,average,,socially,,hmmi freely give compliments. i appreciate a g...,i'm a teacher by day and i usually love it. i'...,,,...,"san francisco, california",,straight,,,f,taurus and it&rsquo;s fun to think about,no,english,single


In [3]:
len(data)

1782

In [4]:
data.count()

Unnamed: 0     1782
age            1782
body_type      1635
diet           1082
drinks         1727
drugs          1392
essay0         1615
essay1         1572
essay2         1530
essay3         1469
essay4         1509
essay5         1497
essay6         1429
essay7         1451
essay8         1235
essay9         1453
ethnicity      1627
height         1782
income         1782
job            1609
last_online    1782
location       1782
offspring       757
orientation    1782
pets           1230
religion       1243
sex            1782
sign           1511
smokes         1645
speaks         1780
status         1782
dtype: int64

In [5]:
data = data.drop(columns = ['Unnamed: 0', 'essay0', 'essay1','essay2','essay3','essay4','essay5','essay6','essay7'
                  ,'essay8','essay9','last_online','sign','offspring', 'diet', 'speaks','location','status', 'income'],axis=0)

# removing Unnamed: 0 as it is a repeat of the index

# removing essay0, essay1, essay2, essay3, essay4, essay5, essay6, essay7, essay8, essay9 as every row has 
#a unique value

# removing last_online as it can't be used to predict the label (education)

# removing sign as it can't be used to predict the label (education)

# removing offspring as too many rows have NaN as a value

# removing diet as too many rows have NaN as a value

# removing speaks as there are too many distinct values and cannot be mapped into a smaller domain

# removing location as there art too many distinct values and it cannot be used to predict the label (educatin)

# removing status as there is a heavy imbalance with almost all of the values being 'single'

# removing income as almost all values are not listed (value put as -1)

In [6]:
# dictionary to consolidate the labels for education
label_engineering = {
    'graduated from college/university': 'bachelors',
    'greatued from masters program': 'advanced degree',
    'working on college/university': 'bachelors',
    'working on masters program': 'advanced degree',
    'graduated from two-year college': 'associates',
    'graduated from high school': 'high-school',
    'graduated from ph.d program': 'advanced degree',
    'graduated from law school': 'advanced degree',
    'working on two-year college': 'associates',
    'working on ph.d program': 'advanced degree',
    'dropped out of college/university': 'high-school',
    'college/university': 'bachelors',
    'graduated from space camp': 'spacecamp',
    'dropped out of space camp': 'spacecamp',
    'graduated from med school': 'advanced degree',
    'working on space camp': 'spacecamp',
    'working on law school': 'advanced degree',
    'working on med school': 'advanced degree',
    'dropped out of two-year college': 'high-school',
    'two-year college': 'associates',
    'masters program': 'advanced degree',
    'dropped out of masters program': 'advanced degree',
    'dropped out of ph.d program': 'advanced degree',
    'high school': 'high-school',
    'dropped out of high school': 'high-school',
    'working on high school': 'high-school',
    'space camp': 'spacecamp',
    'ph.d program': 'advanced degree',
    'med school': 'advanced degree',
    'law school': 'advanced degree',
    'dropped out of law school': 'advanced degree',
    'dropped out of med school': 'advanced degree',
    'graduated from masters program': 'advanced degree'}

In [7]:
label = label.replace(label_engineering)

In [8]:
label.value_counts()

bachelors          1059
advanced degree     501
associates           84
high-school          77
spacecamp            61
Name: education, dtype: int64

In [9]:
# Now we must fill in missing values for remaining columns
data.count()

age            1782
body_type      1635
drinks         1727
drugs          1392
ethnicity      1627
height         1782
job            1609
orientation    1782
pets           1230
religion       1243
sex            1782
smokes         1645
dtype: int64

In [10]:
# age, sex, orientation are completely full
# First we will check unique values for body_type and fill in missing values
data['body_type'].value_counts()

average           428
fit               377
athletic          367
thin              147
curvy             115
a little extra     73
skinny             51
full figured       38
jacked             16
overweight         13
used up             6
rather not say      4
Name: body_type, dtype: int64

In [11]:
# About 6% of the elements are missing a value for body_type. Given that 'average' is the mode of all the values 
# and that it is safe to assume that the typical person have an 'average' body_type, we will fill the NaNs 
# with 'average'

# dictionary to consolidate the values for the body_type column because a lot of them are redundant
body_type_dictionary = {
    'average': 'average',
    'athletic': 'athletic',
    'fit': 'athletic',
    'thin': 'underweight',
    'curvy': 'overweight',
    'a little extra': 'overweight',
    'skinny': 'underweight',
    'full figured': 'overweight',
    'jacked': 'athletic',
    'overweight': 'overweight',
    'used up': 'overweight',
    'rather not say': 'average'
}

data['body_type'] = data['body_type'].fillna('average')
data['body_type'] = data['body_type'].replace(body_type_dictionary)
data['body_type'].value_counts()

athletic       760
average        579
overweight     245
underweight    198
Name: body_type, dtype: int64

In [12]:
# we will check unique values for drinks and fill in missing values
data['drinks'].value_counts()

socially       1239
rarely          194
often           176
not at all       90
desperately      15
very often       13
Name: drinks, dtype: int64

In [13]:
# as the values are categorical, the mode is 'socially' and makes up about 70% of the total data, so we will fill all 
# NaNs with 'socially'

# dictionary to consolidate the values for the drinks column because a lot of them 
drinks_dictionary = {
    'socially': 'sometimes',
    'often': 'often',
    'rarely': 'rarely',
    'not at all': 'never',
    'desperately': 'very often',
    'very often': 'very often'
}

data['drinks'] = data['drinks'].fillna('socially')
data['drinks'] = data['drinks'].replace(drinks_dictionary)
data['drinks'].value_counts()

sometimes     1294
rarely         194
often          176
never           90
very often      28
Name: drinks, dtype: int64

In [14]:
# we will check unique values for drugs and fill in missing values
data['drugs'].value_counts()

never        1134
sometimes     240
often          18
Name: drugs, dtype: int64

In [15]:
# as the values are categorical, the mode is 'never; and it makes up for about 80% of values with a value for 'drugs'
data['drugs'] = data['drugs'].fillna('never')
data['drugs'].value_counts()

never        1524
sometimes     240
often          18
Name: drugs, dtype: int64

In [16]:
# we will check unique values for ethnicity and fill in missing values
data['ethnicity'].value_counts()

white                                                                                                      1013
asian                                                                                                       163
hispanic / latin                                                                                             69
black                                                                                                        61
other                                                                                                        48
hispanic / latin, white                                                                                      42
indian                                                                                                       38
asian, white                                                                                                 26
white, other                                                                                            

In [17]:
# as the values are categorical, the mode is 'white' and makes up about 50% of the total data, so we will fill 
# all NaNs with 'white'

# Narrowed down the categories to white, asian, black, hispanic, native, pacific, mixed, and other 
# because of a lot of diversity. If only second ethnicity is 'other', it might be insignificant enough to be
# ignored, and is classified as the first ethnicity
ethnicity_dictionary = {
    'white': 'white',
    'asian': 'asian',
    'hispanic / latin': 'hispanic',
    'black': 'black',
    'other': 'other',
    'hispanic / latin, white': 'mixed',
    'indian': 'asian',
    'asian, white': 'mixed',
    'white, other': 'white',
    'asian, pacific islander': 'mixed',
    'middle eastern': 'asian',
    'black, white': 'mixed',
    'native american, white': 'mixed',
    'black, other': 'black',
    'middle eastern, white': 'mixed',
    'hispanic / latin, other': 'hispanic',
    'black, native american, white': 'mixed',
    'pacific islander': 'pacific',
    'asian, other': 'asian',
    'pacific islander, white': 'mixed',
    'native american': 'native',
    'middle eastern, hispanic / latin': 'mixed',
    'hispanic / latin, white, other': 'mixed',
    'black, hispanic / latin': 'mixed',
    'asian, pacific islander, hispanic / latin, white, other': 'mixed',
    'black, hispanic / latin, white': 'mixed',
    'asian, white, other': 'mixed',
    'asian, pacific islander, white': 'mixed',
    'asian, middle eastern, black, native american, indian, pacific islander, hispanic / latin, white, other': 'mixed',
    'black, white, other': 'mixed',
    'indian, other': 'asian',
    'indian, white, other': 'mixed',
    'black, indian, white, other': 'mixed',
    'native american, other': 'native',
    'asian, native american, white, other': 'mixed',
    'native american, white, other': 'mixed',
    'asian, hispanic / latin': 'mixed',
    'asian, hispanic / latin, white': 'mixed',
    'native american, hispanic / latin': 'mixed',
    'native american, pacific islander': 'mixed',
    'pacific islander, white, other': 'mixed',
    'middle eastern, indian, other': 'mixed',
    'pacific islander, other': 'pacific',
    'middle eastern, other': 'asian',
    'asian, pacific islander, hispanic / latin': 'mixed',
    'asian, middle eastern, black, indian, pacific islander, hispanic / latin, white': 'mixed',
    'pacific islander, hispanic / latin, white': 'mixed',
    'asian, black, pacific islander, hispanic / latin, white': 'mixed',
    'black, native american, white, other': 'mixed',
    'native american, pacific islander, hispanic / latin, white, other': 'mixed',
    'indian, hispanic / latin': 'mixed',
    'asian, black': 'mixed'
}

data['ethnicity'] = data['ethnicity'].fillna('white')
data['ethnicity'] = data['ethnicity'].replace(ethnicity_dictionary)
data['ethnicity'].value_counts()

white       1189
asian        221
mixed        164
hispanic      77
black         71
other         48
pacific        7
native         5
Name: ethnicity, dtype: int64

In [18]:
# we will fill missing values in height with the mean height of their respective genders
data['height'] = df['height'].fillna(df.groupby('sex')['height'].transform('mean'))

In [19]:
# we will check unique values for jobs and fill in missing values
data['job'].value_counts()

other                                217
student                              159
science / tech / engineering         158
computer / hardware / software       153
artistic / musical / writer          136
sales / marketing / biz dev          127
education / academia                 117
medicine / health                     93
entertainment / media                 89
banking / financial / real estate     80
executive / management                67
hospitality / travel                  40
law / legal services                  35
clerical / administrative             31
political / government                27
construction / craftsmanship          26
rather not say                        14
transportation                        14
unemployed                            10
retired                                9
military                               7
Name: job, dtype: int64

In [20]:
# as the values are categorical, and quite a few people already did not want to reveal their job anyways,
# the empty fields are also clumped into the 'rather not say' category

# Narrowed down the categories to student, STEM, arts, business, education, military, not working, military 
# because of a lot of different jobs and a lot of them fall into the same category
job_dictionary = {
    'other': 'other',
    'student': 'student',
    'science / tech / engineering': 'STEM',
    'computer / hardware / software': 'STEM',
    'artistic / musical / writer': 'arts',
    'sales / marketing / biz dev': 'business',
    'education / academia': 'education',
    'medicine / health': 'STEM',
    'banking / financial / real estate': 'business',
    'executive / management': 'business',
    'hospitality / travel': 'business',
    'entertainment / media': 'arts',
    'law / legal services': 'arts',
    'clerical / administrative': 'business',
    'political / government': 'arts', 
    'construction / craftsmanship': 'STEM',
    'rather not say': 'other',
    'transportation': 'STEM',
    'unemployed': 'not working',
    'retired': 'not working',
    'military': 'military'
}

data['job'] = data['job'].fillna('rather not say')
data['job'] = data['job'].replace(job_dictionary)
data['job'].value_counts()

STEM           444
other          404
business       345
arts           287
student        159
education      117
not working     19
military         7
Name: job, dtype: int64

In [21]:
# we will check unique values for pets and fill in missing values
# maybe we should remove because there are a lot of missing values
data['pets'].value_counts()

likes dogs and likes cats          455
likes dogs                         190
likes dogs and has cats            155
has dogs                           137
has dogs and likes cats             80
likes dogs and dislikes cats        63
has cats                            49
has dogs and has cats               47
likes cats                          27
dislikes dogs and dislikes cats     10
has dogs and dislikes cats           9
dislikes dogs and likes cats         4
dislikes cats                        3
dislikes dogs and has cats           1
Name: pets, dtype: int64

In [22]:
# Narrowed down the categories to owns, likes, or dislikes pets.
pet_dictionary = {
    'likes dogs and likes cats': 'likes',
    'likes dogs': 'likes',
    'likes dogs and has cats': 'owns',
    'has dogs': 'owns',
    'has dogs and likes cats': 'owns',
    'likes dogs and dislikes cats': 'likes',
    'has cats': 'owns',
    'has dogs and has cats': 'owns',
    'likes cats': 'likes',
    'dislikes dogs and dislikes cats': 'dislikes',
    'has dogs and dislikes cats': 'owns',
    'dislikes dogs and likes cats': 'likes',
    'dislikes cats': 'dislikes',
    'dislikes dogs and has cats': 'owns'
}

# will fill NaNs with 'likes' as the average person does not mind pets, but we do not want to assume ownership
data['pets'] = data['pets'].fillna('likes')
data['pets'] = data['pets'].replace(pet_dictionary)
data['pets'].value_counts()

likes       1291
owns         478
dislikes      13
Name: pets, dtype: int64

In [23]:
# we will check unique values for religion and fill in missing values
data['religion'].value_counts()

agnosticism but not too serious about it      97
agnosticism                                   86
agnosticism and laughing about it             80
other                                         77
atheism                                       65
other and laughing about it                   63
christianity                                  62
catholicism but not too serious about it      62
atheism and laughing about it                 58
christianity but not too serious about it     56
atheism but not too serious about it          50
other but not too serious about it            47
judaism but not too serious about it          45
catholicism                                   38
other and somewhat serious about it           30
catholicism and laughing about it             29
christianity and somewhat serious about it    26
atheism and somewhat serious about it         26
buddhism and laughing about it                23
agnosticism and somewhat serious about it     21
judaism and laughing

In [24]:
# dictionary to consolidate values for the religion column because there were a lot of redundant values
religion_dictionary = {    
    "agnosticism but not too serious about it": "agnostic",
    "agnosticism": "agnostic",
    "agnosticism and laughing about it": "agnostic",
    "other": "not-listed",
    "atheism": "atheist",
    "other and laughing about it": "not-listed",
    "christianity": "christian",
    "catholicism but not too serious about it": "christian",
    "atheism and laughing about it": "atheist",
    "christianity but not too serious about it": "christian",
    "atheism but not too serious about it": "atheist",
    "other but not too serious about it": "not-listed",
    "judaism but not too serious about it": "jewish",
    "catholicism": "christian",
    "other and somewhat serious about it": "not-listed",
    "catholicism and laughing about it": "christian",
    "christianity and somewhat serious about it": "christian",
    "atheism and somewhat serious about it": "atheist",
    "buddhism and laughing about it": "buddhist",
    "agnosticism and somewhat serious about it": "agnostic",
    "judaism and laughing about it": "jewish",
    "judaism": "jewish",
    "atheism and very serious about it": "atheist",
    "catholicism and somewhat serious about it": "christian",
    "buddhism but not too serious about it": "buddhist",
    "buddhism": "buddhist",
    "christianity and very serious about it": "christian",
    "other and very serious about it": "not-listed",
    "christianity and laughing about it": "christian",
    "agnosticism and very serious about it": "agnostic",
    "buddhism and somewhat serious about it": "buddhist",
    "hinduism but not too serious about it": "hindu",
    "judaism and somewhat serious about it": "jewish",
    "catholicism and very serious about it": "christian",
    "hinduism and somewhat serious about it": "hindu",
    "buddhism and very serious about it": "buddhist",
    "hinduism": "hindu",
    "islam": "muslim",
    "islam but not too serious about it": "muslim",
    "islam and very serious about it": "muslim",
    "hinduism and laughing about it": "hindu",
    "islam and somewhat serious about it": "muslim",
    "hinduism and very serious about it": "hindu",
    "judaism and very serious about it": "jewish",
    "islam and laughing about it": "muslim"
}

data['religion'] = data['religion'].fillna('not-listed')
data['religion'] = data['religion'].replace(religion_dictionary)
data['religion'].value_counts()

not-listed    770
christian     323
agnostic      296
atheist       216
jewish         88
buddhist       66
hindu          15
muslim          8
Name: religion, dtype: int64

In [25]:
# we will check unique values for smokes and fill in missing values
data['smokes'].value_counts()

no                1352
sometimes          110
when drinking       83
yes                 55
trying to quit      45
Name: smokes, dtype: int64

In [26]:
# 'no' is the mode and makes up about 75% of the data, so we will fill NaNs with 'no'
data['smokes'] = data['smokes'].fillna('no')
data['smokes'].value_counts()

no                1489
sometimes          110
when drinking       83
yes                 55
trying to quit      45
Name: smokes, dtype: int64

In [27]:
'''
No need to deal with noise/outliers as there are no chances of noise in the data collection as it is
all user entered. Additionally, as the data is mainly categorical, it is tough to determine something 
as an outlier as it can't be plotted in an n-dimensional plot.
'''

"\nNo need to deal with noise/outliers as there are no chances of noise in the data collection as it is\nall user entered. Additionally, as the data is mainly categorical, it is tough to determine something \nas an outlier as it can't be plotted in an n-dimensional plot.\n"

### Feature Engineering

In [28]:
# We can convert some of the categorical features into numerical values because they are on a ordinal scale. This allows
# there to be a larger difference between values that are further away. For example, in the drinks column, 'often' is
# further from 'never' than 'sometimes' is. This conversion to numerical values for ordinal features allows us to 
# visualize this magnitude in difference rather than having the standard distance of 1 between different categorical
# features

# converts some of the categorical features into values because they are ordinal
def convert_ordinal(data):
    copy = data.copy()
    
    drugs_codes = {
        'never': 0,
        'sometimes': 1,
        'often': 2
    }

    drinks_codes = {
        'never': 0,
        'rarely': 1,
        'sometimes': 2,
        'often': 3,
        'very often': 4
    }

    smokes_codes = {
        "no": 0,
        "when drinking": 1,
        "trying to quit": 2,
        "sometimes": 3,
        "yes": 4
    }

    pets_codes = {
        "dislikes": 0,
        "likes": 1,
        "owns": 2
    }
    
    body_codes = {
        "underweight": 0,
        "average": 1,
        "athletic": 2,
        "overweight": 3
    }

    copy['drugs'] = data['drugs'].replace(drugs_codes)
    copy['drinks'] = data['drinks'].replace(drinks_codes)
    copy['smokes'] = data['smokes'].replace(smokes_codes)
    copy['pets'] = data['pets'].replace(pets_codes)
    copy['body_type'] = data['body_type'].replace(body_codes)
    return copy

data = convert_ordinal(data)
data

Unnamed: 0,age,body_type,drinks,drugs,ethnicity,height,job,orientation,pets,religion,sex,smokes
0,28,2,2,1,asian,65.0,STEM,straight,2,christian,m,0
1,34,1,2,0,white,63.0,other,straight,2,not-listed,f,0
2,29,2,2,0,white,70.0,arts,straight,1,not-listed,m,0
3,45,2,0,0,mixed,68.0,education,gay,1,not-listed,m,0
4,37,1,2,0,white,63.0,education,straight,1,not-listed,f,0
5,28,1,2,0,asian,65.0,business,straight,1,not-listed,f,0
6,25,1,2,1,white,66.0,student,bisexual,2,not-listed,f,4
7,20,0,2,0,white,69.0,student,straight,1,christian,m,0
8,27,2,3,0,white,72.0,STEM,straight,1,agnostic,m,0
9,47,1,2,0,white,71.0,STEM,straight,2,not-listed,m,0


In [29]:
# most of the models that we plan on using do not accept categorical data. Therefore, for the categorical features that
# weren't ordinal, we had to one-hot encode them.

# one hot encoding categorical features that are not ordinal
def one_hot_encode(data, column=None):
    copy = data.copy()
    
    if column is None:
        to_encode = ['ethnicity', 'job', 'orientation', 'religion', 'sex']
        for column in to_encode:
            copy = one_hot_encode(copy, column)
    else:
        dummies = pd.get_dummies(copy[[column]])
        copy = pd.concat([copy, dummies], axis=1)
        copy = copy.drop([column], axis=1)
        
    return copy

one_hot_encode(data)

Unnamed: 0,age,body_type,drinks,drugs,height,pets,smokes,ethnicity_asian,ethnicity_black,ethnicity_hispanic,...,religion_agnostic,religion_atheist,religion_buddhist,religion_christian,religion_hindu,religion_jewish,religion_muslim,religion_not-listed,sex_f,sex_m
0,28,2,2,1,65.0,2,0,1,0,0,...,0,0,0,1,0,0,0,0,0,1
1,34,1,2,0,63.0,2,0,0,0,0,...,0,0,0,0,0,0,0,1,1,0
2,29,2,2,0,70.0,1,0,0,0,0,...,0,0,0,0,0,0,0,1,0,1
3,45,2,0,0,68.0,1,0,0,0,0,...,0,0,0,0,0,0,0,1,0,1
4,37,1,2,0,63.0,1,0,0,0,0,...,0,0,0,0,0,0,0,1,1,0
5,28,1,2,0,65.0,1,0,1,0,0,...,0,0,0,0,0,0,0,1,1,0
6,25,1,2,1,66.0,2,4,0,0,0,...,0,0,0,0,0,0,0,1,1,0
7,20,0,2,0,69.0,1,0,0,0,0,...,0,0,0,1,0,0,0,0,0,1
8,27,2,3,0,72.0,1,0,0,0,0,...,1,0,0,0,0,0,0,0,0,1
9,47,1,2,0,71.0,2,0,0,0,0,...,0,0,0,0,0,0,0,1,0,1


### More Data Exploration

In [30]:
# Some of the models that we plan on using are very sensitive to the effect of outliers, so we have to remove them
# in order for the models to function effectively.

# Removes outliers from the data using isolation forests
def remove_outliers(data, labels):
    data_cp = data.copy()
    labels_cp = labels.copy()
    
    iso_forest = ensemble.IsolationForest(n_estimators=100, contamination=0.05)
    pred = iso_forest.fit_predict(data_cp.drop(columns=['ethnicity', 'job', 'orientation', 'religion', 'sex']))
    pred = pd.Series(pred)
    data_cp = data_cp[pred == 1].reset_index(drop=True)
    labels_cp = labels_cp[pred == 1].reset_index(drop=True)
    
    return (data_cp, labels_cp)

remove_outliers(data, label)

(      age  body_type  drinks  drugs ethnicity  height        job orientation  \
 0      28          2       2      1     asian    65.0       STEM    straight   
 1      34          1       2      0     white    63.0      other    straight   
 2      29          2       2      0     white    70.0       arts    straight   
 3      45          2       0      0     mixed    68.0  education         gay   
 4      37          1       2      0     white    63.0  education    straight   
 5      28          1       2      0     asian    65.0   business    straight   
 6      25          1       2      1     white    66.0    student    bisexual   
 7      20          0       2      0     white    69.0    student    straight   
 8      27          2       3      0     white    72.0       STEM    straight   
 9      47          1       2      0     white    71.0       STEM    straight   
 10     31          2       2      0     mixed    70.0       STEM    straight   
 11     48          0       

In [31]:
# balancing the data 
def balance_data(data, labels, rows=400):
    all_data = pd.concat([data, labels], axis=1)
    unique_labels = labels.unique()
    
    to_concat = []
    for label in unique_labels:
        num_label = len(all_data[all_data['education']==label].index)
        if num_label > rows:
            #downsample
            down = all_data[all_data['education']==label].sample(n=rows)
            to_concat.append(down)
        elif num_label < rows:
            #upsample
            choices = all_data[all_data['education']==label]
            indices = np.array(choices.index)
            resampled_indices = np.random.choice(indices, size=rows, replace=True)
            up = choices.loc[resampled_indices]
            to_concat.append(up)
    
    new_df = pd.concat(to_concat, axis=0).sample(frac=1).reset_index(drop=True)
    new_labels = new_df['education']
    new_data = new_df.drop(columns=['education'])
    
    return new_data, new_labels

balance_data(data, label)

(      age  body_type  drinks  drugs ethnicity  height          job  \
 0      32          2       2      0     white    72.0        other   
 1      25          2       2      0     white    64.0      student   
 2      29          0       2      0     white    65.0        other   
 3      24          2       2      0     white    68.0     business   
 4      21          2       2      0     white    74.0         arts   
 5      36          2       2      0     white    70.0        other   
 6      22          1       2      0     white    66.0      student   
 7      30          1       2      1     asian    69.0        other   
 8      54          2       2      0     white    63.0         STEM   
 9      28          2       0      0     asian    65.0     business   
 10     24          2       2      0     white    71.0         STEM   
 11     26          3       4      0     white    72.0     business   
 12     32          2       2      0     white    70.0     business   
 13   

## Modeling

### K-Nearest Neighbors Classifier

In [43]:
scaler = preprocessing.StandardScaler()
pca = decomposition.PCA()
knn = neighbors.KNeighborsClassifier()
ppl = pipeline.Pipeline(steps=[('scaler', scaler), ('pca', pca), ('knn', knn)])

knn_data = one_hot_encode(data)

param_grid = {'pca__n_components': [val / 20 for val in range(12, 20)], 'knn__n_neighbors': list(range(5, 10))}
inner = model_selection.GridSearchCV(ppl, param_grid, scoring='f1_micro', cv=5, iid=True)
scores = model_selection.cross_val_score(inner, knn_data, label, cv=5)
print('F1-score:', scores.mean())

F1-score: 0.5651774309945504


### Decision Tree

In [51]:
tree_data = one_hot_encode(data)

param_grid = {'criterion': ['gini', 'entropy'], 'splitter': ['best', 'random'], 'min_samples_leaf': [val / 200 for val in range(1, 11)]}
inner = model_selection.GridSearchCV(tree.DecisionTreeClassifier(), param_grid, scoring='accuracy', cv=5, iid=True)
scores = model_selection.cross_val_score(inner, tree_data, label, cv=5)
print('Accuracy:', scores.mean())

Accuracy: 0.5936837143980884


### SVMs

In [60]:
def run_svm_model(data, labels, outer_model, name):
    scaler = preprocessing.StandardScaler()
    pca = decomposition.PCA()
    ppl = pipeline.Pipeline(steps=[('scaler', scaler), ('pca', pca), ('clf', outer_model)])
    
    svm_data = one_hot_encode(data)

    inner_clfs = [svm.SVC(kernel=type, class_weight='balanced') for type in ['linear', 'poly', 'sigmoid', 'rbf']]
    inner_clfs.extend([svm.LinearSVC(loss=type, class_weight='balanced') for type in ['hinge', 'squared_hinge']])
    param_grid = {'pca__n_components': [val / 10 for val in range(6, 10)], 'clf__estimator': inner_clfs}
    
    inner = model_selection.GridSearchCV(ppl, param_grid, scoring='accuracy', cv=5, iid=True)
    scores = model_selection.cross_val_score(inner, svm_data, labels, cv=5)
    print(name, 'accuracy:', scores.mean())

run_svm_model(data, label, multiclass.OneVsOneClassifier(None), 'One vs One')
run_svm_model(data, label, multiclass.OneVsRestClassifier(None), 'One vs Rest')

One vs One accuracy: 0.3938828221292119
One vs Rest accuracy: 0.42421893723631987


In [63]:
def run_svc_model(data, labels, balanced):
    weight = 'balanced' if balanced else None
    score = 'accuracy' if balanced else 'f1_micro'

    scaler = preprocessing.StandardScaler()
    pca = decomposition.PCA()
    svc = svm.SVC(class_weight=weight)
    ppl = pipeline.Pipeline(steps=[('scaler', scaler), ('pca', pca), ('svm', svc)])

    svm_data = one_hot_encode(data)

    kernels = ['linear', 'poly', 'sigmoid', 'rbf']
    costs = [1.0, 10.0, 100.0]
    param_grid = {'svm__kernel': kernels, 'svm__C': costs, 'svm__decision_function_shape': ['ovo', 'ovr']}

    inner = model_selection.GridSearchCV(ppl, param_grid, scoring=score, cv=5, iid=True)
    scores = model_selection.cross_val_score(inner, svm_data, labels, cv=5)
    
    if balanced:
        print('Balanced SVM accuracy', scores.mean())
    else:
        print('Unbalanced SVM F1-score', scores.mean())

run_svc_model(data, label, True)
run_svc_model(data, label, False)

Balanced SVM accuracy 0.46575822496666425
Unbalanced SVM F1-score 0.585315913487519


### Naive Bayes

In [40]:
# Used a naive bayes model. Since we don't know the distribution of the data, I tried both GaussianNB and MultinomialNB.
# The model accuracy was extremely poor with one hot encoding, so I had to remove the categorical features that were
# ordinal. I used 10 fold cross-validation, and got an F1-score of around 54% for both GaussianNB, and MultinomialNB, with
# MultinomialNB having the slightly higher score.

def run_naive_bayes(data, labels, isGaussianNB):
    if isGaussianNB:
        classifier = naive_bayes.GaussianNB()
    else:
        classifier = naive_bayes.MultinomialNB()
    bayes_data = data.drop(columns=['ethnicity', 'job', 'orientation', 'religion', 'sex'])
    
    scores = model_selection.cross_val_score(classifier, bayes_data, labels, scoring='f1_micro', cv=10)
    
    print('10 Fold Cross Validation F1-score:', str(scores.mean()), '\n')

print('Naive Bayes using GaussianNB')
run_naive_bayes(data, label, True)

print('Naive Bayes using MultinomialNB')
run_naive_bayes(data, label, False)

Naive Bayes using GaussianNB
10 Fold Cross Validation F1-score: 0.5325069477409279 

Naive Bayes using MultinomialNB
10 Fold Cross Validation F1-score: 0.5477295899201874 



### Neural Network

In [41]:
scaler = preprocessing.MinMaxScaler()
nn = neural_network.MLPClassifier()
ppl = pipeline.Pipeline(steps=[('scaler', scaler), ('nn', nn)])

nn_data, nn_labels = remove_outliers(data, label)
# nn_data, nn_labels = balance_data(nn_data, nn_labels)
nn_data = one_hot_encode(nn_data)

param_grid = {'nn__activation': ['logistic', 'tanh', 'relu'], 'nn__solver': ['sgd', 'adam'], 'nn__learning_rate': ['constant', 'adaptive']}
inner = model_selection.GridSearchCV(ppl, param_grid, scoring='accuracy', cv=5, iid=True)
scores = model_selection.cross_val_score(inner, nn_data, nn_labels, cv=5)
print('Accuracy:', scores.mean())

Accuracy: 0.5992948303581788


### Ensembles

In [78]:
ensemble_data, ensemble_labels = remove_outliers(data, label)
ensemble_data = one_hot_encode(ensemble_data)

In [79]:
param_grid = {'max_depth': [25,30,35,40], 'min_samples_leaf': [0.002,0.005,0.01], 'max_features': ('sqrt','log2')}
inner = model_selection.GridSearchCV(ensemble.RandomForestClassifier(n_estimators=100), param_grid, scoring='accuracy', cv=5)
scores = model_selection.cross_val_score(inner, ensemble_data, ensemble_labels, cv=5)
print('Accuracy:', scores.mean())

Accuracy: 0.6022422905752398


In [80]:
param_grid = {'n_estimators': [50, 100, 150], 'learning_rate': [0.7, 0.85, 1.0]}
inner = model_selection.GridSearchCV(ensemble.AdaBoostClassifier(), param_grid, scoring='accuracy', cv=5)
scores = model_selection.cross_val_score(inner, ensemble_data, ensemble_labels, cv=5)
print('Accuracy:', scores.mean())

Accuracy: 0.5797859931670587


In [83]:
estimators = [
    ('dt', tree.DecisionTreeClassifier()),
    ('nn', pipeline.make_pipeline(preprocessing.MinMaxScaler(), neural_network.MLPClassifier())),
    ('svc', pipeline.make_pipeline(preprocessing.StandardScaler(), decomposition.PCA(n_components=7), svm.SVC(decision_function_shape='ovr')))
]

param_grid = {'weights': [[1, 1, 1], [1, 1, 0.7]]}
inner = model_selection.GridSearchCV(ensemble.VotingClassifier(estimators=estimators), param_grid, scoring='accuracy', cv=5, iid=True)
scores = model_selection.cross_val_score(inner, ensemble_data, ensemble_labels, cv=5)
print('Accuracy:', scores.mean())

Accuracy: 0.5809520312459757


## Results and Analysis

In [None]:
'''
The average accuracies were between 57% and 60%. Our data was primarily categorical and had to be scaled 
using one-hot encoding, which increased our number of columns. By increasing the number of dimensions, our models
and predictive algorithms fell to the curse of dimensionality. Additionally, we were attempting to predict one of 5 
different labels, which makes it difficult to predict with a high accuracy, especially given that our data was
primarily categorical. Given these circumstances, I do believe that our predictive algorithms are accurate.

However, there was a heavy class imbalance favoring 'bachelors'. This class made up about 60% of the data. Additionally,
using domain knowledge of real life experiences, it seems very difficult to predict someone's education level based
on factors such as religion, whether or not they smoke or drink, body type, preference for pets, and height. It is 
possible for someone of any religion to have any level of education and someone who is tall or short could receive 
their ph.d. Features such as income and job are far better indicators of education level. The problem we faced with 
income was that the majority of values were missing, which makes sense as the data is self reported. The problem we 
faced with job/occupation was that there were a lot of unique values, which would further worsen the curse of
dimensionality. Even with the given values, the job values were very vague and didn't give any specific role 
details that could be used to predict education. For example, there was a value of 'entertainment'. This could include
the most succesful singers in the world who make millions of dollars or a local band member who was still in high school.
We also found another potential source of error could come from the fact that the data was self reported. This means that
if someone was embarrased or thought a factor about themselves, such as body type, was unattractive, they would not 
report it. This could lead to a lot of significant or differentiating factors to be missing. Since on a dating website, people are
encouraged to portray the best version of themselves to make themselves attractive to potential partners. Therefore, there
may be inaccuracy is some of the data, especially in features such as height, body type, drinks, and smokes, where people
might lie to conceal some of their poorer features, affecting the models that we use.

Another aspect that we found to complicate predicting the labels was that one of the classes was 'space camp'. We were 
not sure why this was an option and is it clearly did not indicate a real or serious education level, but felt that 
we should not remove the elements that belonged to the 'space camp' class as it was not insignificant. Given that this
was a fake class, predicting this class was difficult as there are no real factors that indicate whether someone's
education level was space camp. Two elements could have the exact same attributes and one could have a class of 
'advanced degree' while the other could have a class of 'space camp', which simply complicates the model or algorithm.


All in all, while we were able to predict the education level around a 60% accuracy, we cannot conclude that there is 
necessarily a strong correlation between the features that we explored and education level. Given the slight imbalance
favoring 'bachelors', predicting 'bachelors' for every element would result in the same accuracy. However, we did find 
that our predictions were not just 'bachelors', which was only predicted 60 - 75% of the time. This is a good sign that 
our models did fit to the data and find some level of correlation between the features and the labels. 

'''