# Name(s): Ojas Patel, Pranav Naravetla, Suhas Dara, Avinash Damania

# OKCupid Data Mining Project

## Introduction

In this project, we will use an OKCupid dataset to solve the problem of predicting ___ using information from dating profiles such as physical traits and lifestyle choices.


(What is the data science problem you are trying to solve? Why does the problem matter? What could the results of your predictive model be used for? Why would we want to be able to predict the thing you’re trying to predict? Then describe the dataset that you will use to tackle this problem.)

In [2]:
# Some headers
import random
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
from numpy.random import rand
from numpy import square, sqrt
from pandas import DataFrame
from sklearn.manifold import MDS
from sklearn.model_selection import StratifiedKFold
from scipy.spatial.distance import pdist

## Data Prep

In this section, we will clean the data in preparation for use in training our models.

In [25]:
df = pd.read_csv("test_profiles.csv")
df.columns

# we must drop rows that do not have an education value
df = df[df.education.notnull()]

label = df['education']
data = df.drop(columns=['education'])

data.head()

Unnamed: 0.1,Unnamed: 0,age,body_type,diet,drinks,drugs,essay0,essay1,essay2,essay3,...,location,offspring,orientation,pets,religion,sex,sign,smokes,speaks,status
0,0,28,athletic,mostly anything,socially,sometimes,i'm looking to meet someone who i have a lot i...,"i work in the tech industry during the day, an...","i'm an expert at scrabble, designing ambigrams...","my lip ring, which i use to distract people fr...",...,"san francisco, california",,straight,likes dogs and has cats,christianity and laughing about it,m,taurus but it doesn&rsquo;t matter,no,"english (fluently), chinese (fluently), japane...",single
1,1,34,average,mostly anything,socially,never,"they say i'm a smart, funny, worldly girl who ...","i'm transitioning from a lost, once ambitious ...",listening<br />\n<br />\nnot judging others<br...,eyes<br />\n<br />\nsmile<br />\n<br />\nthoug...,...,"san francisco, california",,straight,has dogs and likes cats,,f,aries and it&rsquo;s fun to think about,no,english,single
2,2,29,fit,mostly anything,socially,never,update: i am in bmore/philly till may 27th for...,i'm self-employed events technician and it's w...,"being a smart ass, saying inappropriate things...",my charming good looks and the piece of lettuc...,...,"oakland, california",,straight,likes dogs and likes cats,,m,,no,"english (fluently), swedish (okay), spanish (p...",single
3,3,45,athletic,,not at all,never,,,,,...,"san francisco, california",,gay,likes dogs and likes cats,other,m,cancer,no,"english, spanish (poorly)",single
4,4,37,average,,socially,,hmmi freely give compliments. i appreciate a g...,i'm a teacher by day and i usually love it. i'...,,,...,"san francisco, california",,straight,,,f,taurus and it&rsquo;s fun to think about,no,english,single


In [26]:
len(data)

1782

In [27]:
data.count()

Unnamed: 0     1782
age            1782
body_type      1635
diet           1082
drinks         1727
drugs          1392
essay0         1615
essay1         1572
essay2         1530
essay3         1469
essay4         1509
essay5         1497
essay6         1429
essay7         1451
essay8         1235
essay9         1453
ethnicity      1627
height         1782
income         1782
job            1609
last_online    1782
location       1782
offspring       757
orientation    1782
pets           1230
religion       1243
sex            1782
sign           1511
smokes         1645
speaks         1780
status         1782
dtype: int64

In [28]:
data = data.drop(columns = ['Unnamed: 0', 'essay0', 'essay1','essay2','essay3','essay4','essay5','essay6','essay7'
                  ,'essay8','essay9','last_online','sign','offspring', 'diet', 'speaks','location','status'],axis=0)

# removing Unnamed: 0 as it is a repeat of the index

# removing essay0, essay1, essay2, essay3, essay4, essay5, essay6, essay7, essay8, essay9 as every row has 
#a unique value

# removing last_online as it can't be used to predict the label (education)

# removing sign as it can't be used to predict the label (education)

# removing offspring as too many rows have NaN as a value

# removing diet as too many rows have NaN as a value

# removing speaks as there are too many distinct values and cannot be mapped into a smaller domain

# removing location as there art too many distinct values and it cannot be used to predict the label (educatin)

# removing status as there is a heavy imbalance with almost all of the values being 'single'

In [29]:
label_engineering = {
    'graduated from college/university': 'bachelors',
    'greatued from masters program': 'advanced degree',
    'working on college/university': 'bachelors',
    'working on masters program': 'advanced degree',
    'graduated from two-year college': 'associates',
    'graduated from high school': 'high-school',
    'graduated from ph.d program': 'advanced degree',
    'graduated from law school': 'advanced degree',
    'working on two-year college': 'advanced degree',
    'working on ph.d program': 'advanced degree',
    'dropped out of college/university': 'high-school',
    'college/university': 'bachelors',
    'graduated from space camp': 'spacecamp',
    'dropped out of space camp': 'spacecamp',
    'graduated from med school': 'advanced degree',
    'working on space camp': 'spacecamp',
    'working on law school': 'advanced degree',
    'working on med school': 'advanced degree',
    'dropped out of two-year college': 'high-school',
    'two-year college': 'advanced degree',
    'masters program': 'advanced degree',
    'dropped out of masters program': 'advanced degree',
    'dropped out of ph.d program': 'advanced degree',
    'high school': 'high-school',
    'dropped out of high school': 'high-school',
    'working on high school': 'high-school',
    'space camp': 'spacecamp',
    'ph.d program': 'advanced degree',
    'med school': 'advanced degree',
    'law school': 'advanced degree',
    'dropped out of law school': 'advanced degree',
    'dropped out of med school': 'advanced degree',
    'graduated from masters program': 'advanced degree'}



In [30]:
label = label.replace(label_engineering)

In [31]:
label.unique()

array(['advanced degree', 'bachelors', 'spacecamp', 'associates',
       'high-school'], dtype=object)

In [32]:
# Now we must fill in missing values for remaining columns
data.count()

age            1782
body_type      1635
drinks         1727
drugs          1392
ethnicity      1627
height         1782
income         1782
job            1609
orientation    1782
pets           1230
religion       1243
sex            1782
smokes         1645
dtype: int64

In [33]:
# age and gender are completely full
# First we will check unique values for body_type and fill in missing values

In [34]:
# we will check unique values for drinks and fill in missing values
data['drinks'].value_counts()

socially       1239
rarely          194
often           176
not at all       90
desperately      15
very often       13
Name: drinks, dtype: int64

In [42]:
# as the values are categorical, the mode is 'socially' and makes up about 70% of the total data, so we will fill all 
# NaNs with 'socially'

data['drinks'] = data['drinks'].fillna('socially')
data['drinks'].value_counts()

socially       1294
rarely          194
often           176
not at all       90
desperately      15
very often       13
Name: drinks, dtype: int64

In [None]:
# we will check unique values for drugs and fill in missing values

In [None]:
# we will check unique values for ethnicity and fill in missing values

In [None]:
# we will check unique values for height and fill in missing values

In [None]:
# we will check unique values for income and fill in missing values

In [None]:
# we will check unique values for jobs and fill in missing values

In [None]:
# we will check unique values for orientation and fill in missing values

In [None]:
# we will check unique values for pets and fill in missing values

In [None]:
# we will check unique values for religion and fill in missing values

In [None]:
# we will check unique values for sex and fill in missing values

In [None]:
# we will check unique values for smokes and fill in missing values

## Data Exploration

## Feature Engineering

## Modeling

## Results and Analysis