# Name(s): Ojas Patel, Pranav Naravetla, Suhas Dara, Avinash Damania

# OKCupid Data Mining Project

## Introduction

In this project, we will use an OKCupid dataset to solve the problem of predicting ___ using information from dating profiles such as physical traits and lifestyle choices.


(What is the data science problem you are trying to solve? Why does the problem matter? What could the results of your predictive model be used for? Why would we want to be able to predict the thing you’re trying to predict? Then describe the dataset that you will use to tackle this problem.)

In [1]:
# Some headers
import random
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
from numpy.random import rand
from numpy import square, sqrt
from pandas import DataFrame
from sklearn.manifold import MDS
from sklearn.model_selection import StratifiedKFold
from scipy.spatial.distance import pdist

## Data Prep

In this section, we will clean the data in preparation for use in training our models.

In [2]:
df = pd.read_csv("test_profiles.csv")
df.columns

# we must drop rows that do not have an education value to deal with missing values
df = df[df.education.notnull()]

label = df['education']
data = df.drop(columns=['education'])

data.head()

Unnamed: 0.1,Unnamed: 0,age,body_type,diet,drinks,drugs,essay0,essay1,essay2,essay3,...,location,offspring,orientation,pets,religion,sex,sign,smokes,speaks,status
0,0,28,athletic,mostly anything,socially,sometimes,i'm looking to meet someone who i have a lot i...,"i work in the tech industry during the day, an...","i'm an expert at scrabble, designing ambigrams...","my lip ring, which i use to distract people fr...",...,"san francisco, california",,straight,likes dogs and has cats,christianity and laughing about it,m,taurus but it doesn&rsquo;t matter,no,"english (fluently), chinese (fluently), japane...",single
1,1,34,average,mostly anything,socially,never,"they say i'm a smart, funny, worldly girl who ...","i'm transitioning from a lost, once ambitious ...",listening<br />\n<br />\nnot judging others<br...,eyes<br />\n<br />\nsmile<br />\n<br />\nthoug...,...,"san francisco, california",,straight,has dogs and likes cats,,f,aries and it&rsquo;s fun to think about,no,english,single
2,2,29,fit,mostly anything,socially,never,update: i am in bmore/philly till may 27th for...,i'm self-employed events technician and it's w...,"being a smart ass, saying inappropriate things...",my charming good looks and the piece of lettuc...,...,"oakland, california",,straight,likes dogs and likes cats,,m,,no,"english (fluently), swedish (okay), spanish (p...",single
3,3,45,athletic,,not at all,never,,,,,...,"san francisco, california",,gay,likes dogs and likes cats,other,m,cancer,no,"english, spanish (poorly)",single
4,4,37,average,,socially,,hmmi freely give compliments. i appreciate a g...,i'm a teacher by day and i usually love it. i'...,,,...,"san francisco, california",,straight,,,f,taurus and it&rsquo;s fun to think about,no,english,single


In [3]:
len(data)

1782

In [4]:
data.count()

Unnamed: 0     1782
age            1782
body_type      1635
diet           1082
drinks         1727
drugs          1392
essay0         1615
essay1         1572
essay2         1530
essay3         1469
essay4         1509
essay5         1497
essay6         1429
essay7         1451
essay8         1235
essay9         1453
ethnicity      1627
height         1782
income         1782
job            1609
last_online    1782
location       1782
offspring       757
orientation    1782
pets           1230
religion       1243
sex            1782
sign           1511
smokes         1645
speaks         1780
status         1782
dtype: int64

In [5]:
data = data.drop(columns = ['Unnamed: 0', 'essay0', 'essay1','essay2','essay3','essay4','essay5','essay6','essay7'
                  ,'essay8','essay9','last_online','sign','offspring', 'diet', 'speaks','location','status', 'income'],axis=0)

# removing Unnamed: 0 as it is a repeat of the index

# removing essay0, essay1, essay2, essay3, essay4, essay5, essay6, essay7, essay8, essay9 as every row has 
#a unique value

# removing last_online as it can't be used to predict the label (education)

# removing sign as it can't be used to predict the label (education)

# removing offspring as too many rows have NaN as a value

# removing diet as too many rows have NaN as a value

# removing speaks as there are too many distinct values and cannot be mapped into a smaller domain

# removing location as there art too many distinct values and it cannot be used to predict the label (educatin)

# removing status as there is a heavy imbalance with almost all of the values being 'single'

# removing income as almost all values are not listed (value put as -1)

In [6]:
label_engineering = {
    'graduated from college/university': 'bachelors',
    'greatued from masters program': 'advanced degree',
    'working on college/university': 'bachelors',
    'working on masters program': 'advanced degree',
    'graduated from two-year college': 'associates',
    'graduated from high school': 'high-school',
    'graduated from ph.d program': 'advanced degree',
    'graduated from law school': 'advanced degree',
    'working on two-year college': 'advanced degree',
    'working on ph.d program': 'advanced degree',
    'dropped out of college/university': 'high-school',
    'college/university': 'bachelors',
    'graduated from space camp': 'spacecamp',
    'dropped out of space camp': 'spacecamp',
    'graduated from med school': 'advanced degree',
    'working on space camp': 'spacecamp',
    'working on law school': 'advanced degree',
    'working on med school': 'advanced degree',
    'dropped out of two-year college': 'high-school',
    'two-year college': 'advanced degree',
    'masters program': 'advanced degree',
    'dropped out of masters program': 'advanced degree',
    'dropped out of ph.d program': 'advanced degree',
    'high school': 'high-school',
    'dropped out of high school': 'high-school',
    'working on high school': 'high-school',
    'space camp': 'spacecamp',
    'ph.d program': 'advanced degree',
    'med school': 'advanced degree',
    'law school': 'advanced degree',
    'dropped out of law school': 'advanced degree',
    'dropped out of med school': 'advanced degree',
    'graduated from masters program': 'advanced degree'}



In [7]:
label = label.replace(label_engineering)

In [8]:
label.unique()

array(['advanced degree', 'bachelors', 'spacecamp', 'associates',
       'high-school'], dtype=object)

In [9]:
# Now we must fill in missing values for remaining columns
data.count()

age            1782
body_type      1635
drinks         1727
drugs          1392
ethnicity      1627
height         1782
job            1609
orientation    1782
pets           1230
religion       1243
sex            1782
smokes         1645
dtype: int64

In [11]:
# age, sex, orientation are completely full
# First we will check unique values for body_type and fill in missing values
data['body_type'].value_counts()

average           428
fit               377
athletic          367
thin              147
curvy             115
a little extra     73
skinny             51
full figured       38
jacked             16
overweight         13
used up             6
rather not say      4
Name: body_type, dtype: int64

In [18]:
# About 6% of the elements are missing a value for body_type. Given that 'average' is the mode of all the values 
# and that it is safe to assume that the typical person have an 'average' body_type, we will fill the NaNs 
# with 'average'

body_type_dictionary = {
    'average': 'average',
    'athletic': 'athletic',
    'fit': 'athletic',
    'thin': 'underweight',
    'curvy': 'overweight',
    'a little extra': 'overweight',
    'skinny': 'underweight',
    'full figured': 'overweight',
    'jacked': 'athletic',
    'overweight': 'overweight',
    'used up': 'overweight',
    'rather not say': 'average'
}

data['body_type'] = data['body_type'].fillna('average')
data['body_type'] = data['body_type'].replace(body_type_dictionaryy_type_dictionary)
data['body_type'].value_counts()

athletic       760
average        579
overweight     245
underweight    198
Name: body_type, dtype: int64

In [13]:
# we will check unique values for drinks and fill in missing values
data['drinks'].value_counts()

socially       1239
rarely          194
often           176
not at all       90
desperately      15
very often       13
Name: drinks, dtype: int64

In [17]:
# as the values are categorical, the mode is 'socially' and makes up about 70% of the total data, so we will fill all 
# NaNs with 'socially'

# Narrowed down the categories to sometimes, often, and rarely because some of them were redundant

drinks_dictionary = {
    'socially': 'sometimes',
    'often': 'often',
    'rarely': 'rarely',
    'not at all': 'rarely',
    'desperately': 'often',
    'very often': 'often'
}

data['drinks'] = data['drinks'].fillna('socially')
data['drinks'] = data['drinks'].replace(drinks_dictionary)
data['drinks'].value_counts()

sometimes    1294
rarely        284
often         204
Name: drinks, dtype: int64

In [21]:
# we will check unique values for drugs and fill in missing values
data['drugs'].value_counts()

never        1134
sometimes     240
often          18
Name: drugs, dtype: int64

In [22]:
# as the values are categorical, the mode is 'never; and it makes up for about 80% of values with a value for 'drugs'
data['drugs'] = data['drugs'].fillna('never')
data['drugs'].value_counts()

never        1524
sometimes     240
often          18
Name: drugs, dtype: int64

In [19]:
# we will check unique values for ethnicity and fill in missing values


In [24]:
# we will check unique values for height and fill in missing values
# we will fill missing values with the mean height
data['height'] = data['height'].fillna(data['height'].mean())

In [25]:
# we will check unique values for jobs and fill in missing values
data['job'].value_counts()

STEM            444
other           404
business        345
liberal arts    287
student         159
education       117
not working      19
military          7
Name: job, dtype: int64

In [26]:
job_dictionary = {
    'other': 'other',
    'student': 'student',
    'science / tech / engineering': 'STEM',
    'computer / hardware / software': 'STEM',
    'artistic / musical / writer': 'liberal arts',
    'sales / marketing / biz dev': 'business',
    'education / academia': 'education',
    'medicine / health': 'STEM',
    'banking / financial / real estate': 'business',
    'executive / management': 'business',
    'hospitality / travel': 'business',
    'entertainment / media': 'liberal arts',
    'law / legal services': 'liberal arts',
    'clerical / administrative': 'business',
    'political / government': 'liberal arts', 
    'construction / craftsmanship': 'STEM',
    'rather not say': 'other',
    'transportation': 'STEM',
    'unemployed': 'not working',
    'retired': 'not working',
    'military': 'military'
}

data['job'] = data['job'].fillna('other')
data['job'] = data['job'].replace(job_dictionary)
data['job'].value_counts()

STEM            444
other           404
business        345
liberal arts    287
student         159
education       117
not working      19
military          7
Name: job, dtype: int64

In [27]:
# we will check unique values for pets and fill in missing values
# maybe we should remove because there are a lot of missing values
data['pets'].value_counts()

likes dogs and likes cats          455
likes dogs                         190
likes dogs and has cats            155
has dogs                           137
has dogs and likes cats             80
likes dogs and dislikes cats        63
has cats                            49
has dogs and has cats               47
likes cats                          27
dislikes dogs and dislikes cats     10
has dogs and dislikes cats           9
dislikes dogs and likes cats         4
dislikes cats                        3
dislikes dogs and has cats           1
Name: pets, dtype: int64

In [28]:
pet_dictionary = {
    'likes dogs and likes cats': 'likes pets',
    'likes dogs': 'likes pets',
    'likes dogs and has cats': 'has pets',
    'has dogs': 'has pets',
    'has dogs and likes cats': 'has pets',
    'likes dogs and dislikes cats': 'likes pets',
    'has cats': 'has pets',
    'has dogs and has cats': 'has pets',
    'likes cats': 'likes pets',
    'dislikes dogs and dislikes cats': 'dislikes pets',
    'has dogs and dislikes cats': 'has pets',
    'dislikes dogs and likes cats': 'likes pets',
    'dislikes cats': 'dislikes pets',
    'dislikes dogs and has cats': 'has pets'
}

# will fill NaNs with 'likes pets' as the average person does not mind pets, but we do not want to assume 
# ownership
data['pets'] = data['pets'].fillna('likes pets')
data['pets'] = data['pets'].replace(pet_dictionary)
data['pets'].value_counts()

likes pets       1291
has pets          478
dislikes pets      13
Name: pets, dtype: int64

In [32]:
# we will check unique values for religion and fill in missing values
data['religion'].value_counts()

not-listed    770
agnostic      296
christian     285
atheist       216
jewish         88
buddhist       66
catholic       38
hindu          15
muslim          8
Name: religion, dtype: int64

In [33]:
# map distinct religion values to buckets
religion_dictionary = {
    
    "agnosticism but not too serious about it": "agnostic",
    "agnosticism": "agnostic",
    "agnosticism and laughing about it": "agnostic",
    "other": "not-listed",
    "atheism": "atheist",
    "other and laughing about it": "not-listed",
    "christianity": "christian",
    "catholicism but not too serious about it": "christian",
    "atheism and laughing about it": "atheist",
    "christianity but not too serious about it": "christian",
    "atheism but not too serious about it": "atheist",
    "other but not too serious about it": "not-listed",
    "judaism but not too serious about it": "jewish",
    "catholicism": "christian",
    "other and somewhat serious about it": "not-listed",
    "catholicism and laughing about it": "christian",
    "christianity and somewhat serious about it": "christian",
    "atheism and somewhat serious about it": "atheist",
    "buddhism and laughing about it": "buddhist",
    "agnosticism and somewhat serious about it": "agnostic",
    "judaism and laughing about it": "jewish",
    "judaism": "jewish",
    "atheism and very serious about it": "atheist",
    "catholicism and somewhat serious about it": "christian",
    "buddhism but not too serious about it": "buddhist",
    "buddhism": "buddhist",
    "christianity and very serious about it": "christian",
    "other and very serious about it": "not-listed",
    "christianity and laughing about it": "christian",
    "agnosticism and very serious about it": "agnostic",
    "buddhism and somewhat serious about it": "buddhist",
    "hinduism but not too serious about it": "hindu",
    "judaism and somewhat serious about it": "jewish",
    "catholicism and very serious about it": "christian",
    "hinduism and somewhat serious about it": "hindu",
    "buddhism and very serious about it": "buddhist",
    "hinduism": "hindu",
    "islam": "muslim",
    "islam but not too serious about it": "muslim",
    "islam and very serious about it": "muslim",
    "hinduism and laughing about it": "hindu",
    "islam and somewhat serious about it": "muslim",
    "hinduism and very serious about it": "hindu",
    "judaism and very serious about it": "jewish",
    "islam and laughing about it": "muslim"
}

data['religion'] = data['religion'].fillna('not-listed')
data['religion'] = data['religion'].replace(religion_dictionary)
data['religion'].value_counts()

not-listed    770
agnostic      296
christian     285
atheist       216
jewish         88
buddhist       66
catholic       38
hindu          15
muslim          8
Name: religion, dtype: int64

In [54]:
# we will check unique values for smokes and fill in missing values
data['smokes'].value_counts()

In [56]:
# 'no' is the mode and makes up about 75% of the data, so we will with 'no'
data['smokes'] = data['smokes'].fillna('no')
data['smokes'].value_counts()

no                1489
sometimes          110
when drinking       83
yes                 55
trying to quit      45
Name: smokes, dtype: int64

In [None]:
'''
No need to deal with noise/outliers as there are no chances of noise in the data collection as it is
all user entered. Additionally, as the data is mainly categorical, it is tough to determine something 
as an outlier as it can't be plotted in an n-dimensional plot.
'''

## Data Exploration

## Feature Engineering

In [None]:
# we can possibly normalize the age and height as those are our only numerical attributes

## Modeling

## Results and Analysis