In [71]:
%matplotlib inline

import pandas as pd
import numpy as np
import seaborn as sns
import matplotlib.pyplot as plt

### Problem Statement:

Using the Speed Dating Data provided by Kaggle, determine how likely a participant will choose to see a person again (Dec=1) based on how the aspects of the participant's perception of the person and their history (attr, sinc. intel. fun, amb, shar, like, prob, met)

### An outline of any potential methods and models:

- Get a understanding of the counts for each feature and the outcome
- See how each feature individually affects the outcome (using odds ratios and seaborn graphs)
- Split data set into train and test
- Perform Logistic Regression to calculate the probability a participant will choose to see a person again.
- Analyze model with cross_val_score; use Gridsearch if not satisfactory

In [72]:
#df.columns.get_loc('attr1_s')
df = pd.read_csv('Speed Dating Data.csv', usecols= [i for i in range(108)])

### Data Dictionary:

(File has more columns than needed for analysis in case I decide to include more features into analysis)

FieldName|Description
---------|-----------------------------------------------------------------------------------
iid|unique subject number, group(wave id gender)
gender|Female=0 Male=1
order|the number of date that night when met partner
pid|partner’s iid number
match|1=yes, 0=no
int_corr|correlation between participant’s and partner’s ratings of interests in Time 1
samerace|participant and the partner were the same race. 1= yes, 0=no
dec_o|decision of partner the night of event
attr_o|rating of attraction by partner the night of the event
sinc_o|rating of sincerity by partner the night of the event
intel_o|rating of intelligence by partner the night of the event
fun_o|rating of fun by partner the night of the event
amb_o|rating of ambition by partner the night of the event
shar_o|rating of shared interest by partner the night of the event
like_o|how much does your partner like you overall? (1=don't like at all, 10=like a lot)
prob_o|How probable your partner thinks it is that you will say 'yes' for him? (1=not probable, 10=extremely probable)
dec|Decision: 1=Yes 2=No
attr|Rate attraction of partner 1-10
sinc|Rate sincerity of partner 1-10
intel|Rate intelligence of partner 1-10
fun|Rate fun(ness) of partner 1-10
amb|Rate ambition of partner 1-10
shar|Rate shared interest of partner 1-10
like|Overall, how much do you like this person? (1=don't like at all, 10=like a lot)
prob|How probable do you think it is that this person will say 'yes' for you? (1=not probable, 10=extremely probable)
met|Have you met this partner before: 1=Yes 2=No

field_cd| 	field coded 
---------|-----------------------------------------------------------------------------------
1| Law  
2| Math
3| Social Science, Psychologist 
4| Medical Science, Pharmaceuticals, and Bio Tech 
5| Engineering  
6| English/Creative Writing/ Journalism 
7| History/Religion/Philosophy 
8| Business/Econ/Finance 
9| Education, Academia 
10| Biological Sciences/Chemistry/Physics
11| Social Work 
12| Undergrad/undecided 
13|Political Science/International Affairs 
14|Film
15|Fine Arts/Arts Administration
16|Languages
17|Architecture
18|Other

In [None]:
#Since the data has so many features, remove cols not needed for analysis
col_to_drop = ['id', 'idg', 'condtn', 'wave', 'round', 'position', 'positin1', 'partner', 'met_o', \
               'race_o', 'pf_o_att', 'pf_o_sin', 'pf_o_int', 'pf_o_fun', 'pf_o_amb', 'pf_o_sha', \
               'age', 'age_o', 'field', 'field_cd']
first_part = df.iloc[:, :df.columns.get_loc('field_cd')+1]
second_part = df.iloc[:, df.columns.get_loc('dec'):df.columns.get_loc('met')+1]

df = first_part.join(second_part).drop(col_to_drop, axis=1)

### Describe any outstanding questions, assumptions, risks, caveats

- most of the features are based on the participant's opinions (not set facts) so their ranking system may change with time during the surveys
- outliers can skew the data
- there is a large percentage of the observations with null values which will be filled either mode/median/mean; this will also affect the data

### Demonstrate domain knowledge, including specific features or relevant benchmarks from similar projects

Type: classification problem                                                                                           
Outcome: Dec column 1=Yes 0=No                                                                                         
Number of observations: 8378                                                                                           
Number of not null observations: 5669                                                                                 
Method will be similar to the flight delay and titanic problemset of using Logistic Regression

### Define your goals and criteria, in order to explain what success looks like
1. Understanding the data: are there any outliers? What is each features' relationship with the outcome? What are the value counts?
2. Performing the analysis: Make sure none of the variables are highly correlated, make sure data is not skewed by outliers
3. Getting the results: Is the cross val score satisfactory? If not use gridsearch to see how it can be improved.

### Bonus:

1. Consider alternative hypotheses: 
    - Compare how partners rank each other and use similarity to predict if they will match.
    - Using the participant's background-- age, field, SATs (some currently excluded), race, school[...] to predict if a person will want to see the participant again.

2. "Convert" your goal metric from a statistical one (like Mean Squared Error) and tie it to something non-data people can understand, like a cost/benefit analysis, etc.