Lambda School Data Science

*Unit 2, Sprint 3, Module 1*

---


# Define ML problems

You will use your portfolio project dataset for all assignments this sprint.

## Assignment

Complete these tasks for your project, and document your decisions.

- [ ] Choose your target. Which column in your tabular dataset will you predict?
- [ ] Is your problem regression or classification?
- [ ] How is your target distributed?
    - Classification: How many classes? Are the classes imbalanced?
    - Regression: Is the target right-skewed? If so, you may want to log transform the target.
- [ ] Choose your evaluation metric(s).
    - Classification: Is your majority class frequency >= 50% and < 70% ? If so, you can just use accuracy if you want. Outside that range, accuracy could be misleading. What evaluation metric will you choose, in addition to or instead of accuracy?
    - Regression: Will you use mean absolute error, root mean squared error, R^2, or other regression metrics?
- [ ] Choose which observations you will use to train, validate, and test your model.
    - Are some observations outliers? Will you exclude them?
    - Will you do a random split or a time-based split?
- [ ] Begin to clean and explore your data.
- [ ] Begin to choose which features, if any, to exclude. Would some features "leak" future information?

If you haven't found a dataset yet, do that today. [Review requirements for your portfolio project](https://lambdaschool.github.io/ds/unit2) and choose your dataset.

Some students worry, ***what if my model isn't “good”?*** Then, [produce a detailed tribute to your wrongness. That is science!](https://twitter.com/nathanwpyle/status/1176860147223867393)

In [None]:
import pandas as pd
import pandas_profiling.profile_report as profile
import numpy as np

In [None]:
path = '../data/speed_dating/speed_dating_data.csv'
df = pd.read_csv(path)

In [None]:
pd.options.display.max_rows = 100
pd.options.display.max_columns = 100

In [None]:
#Uncomment to run profile report

#profile.ProfileReport(df, True, dark_mode=True).to_file('speed_dating_profile_report.html')

In [None]:
print(df.shape)
df.head()

(8378, 195)


Unnamed: 0,iid,id,gender,idg,condtn,wave,round,position,positin1,order,partner,pid,match,int_corr,samerace,age_o,race_o,pf_o_att,pf_o_sin,pf_o_int,pf_o_fun,pf_o_amb,pf_o_sha,dec_o,attr_o,sinc_o,intel_o,fun_o,amb_o,shar_o,like_o,prob_o,met_o,age,field,field_cd,undergra,mn_sat,tuition,race,imprace,imprelig,from,zipcode,income,goal,date,go_out,career,career_c,...,shar2_2,attr3_2,sinc3_2,intel3_2,fun3_2,amb3_2,attr5_2,sinc5_2,intel5_2,fun5_2,amb5_2,you_call,them_cal,date_3,numdat_3,num_in_3,attr1_3,sinc1_3,intel1_3,fun1_3,amb1_3,shar1_3,attr7_3,sinc7_3,intel7_3,fun7_3,amb7_3,shar7_3,attr4_3,sinc4_3,intel4_3,fun4_3,amb4_3,shar4_3,attr2_3,sinc2_3,intel2_3,fun2_3,amb2_3,shar2_3,attr3_3,sinc3_3,intel3_3,fun3_3,amb3_3,attr5_3,sinc5_3,intel5_3,fun5_3,amb5_3
0,1,1.0,0,1,1,1,10,7,,4,1,11.0,0,0.14,0,27.0,2.0,35.0,20.0,20.0,20.0,0.0,5.0,0,6.0,8.0,8.0,8.0,8.0,6.0,7.0,4.0,2.0,21.0,Law,1.0,,,,4.0,2.0,4.0,Chicago,60521,69487.0,2.0,7.0,1.0,lawyer,,...,,6.0,7.0,8.0,7.0,6.0,,,,,,1.0,1.0,0.0,,,15.0,20.0,20.0,15.0,15.0,15.0,,,,,,,,,,,,,,,,,,,5.0,7.0,7.0,7.0,7.0,,,,,
1,1,1.0,0,1,1,1,10,7,,3,2,12.0,0,0.54,0,22.0,2.0,60.0,0.0,0.0,40.0,0.0,0.0,0,7.0,8.0,10.0,7.0,7.0,5.0,8.0,4.0,2.0,21.0,Law,1.0,,,,4.0,2.0,4.0,Chicago,60521,69487.0,2.0,7.0,1.0,lawyer,,...,,6.0,7.0,8.0,7.0,6.0,,,,,,1.0,1.0,0.0,,,15.0,20.0,20.0,15.0,15.0,15.0,,,,,,,,,,,,,,,,,,,5.0,7.0,7.0,7.0,7.0,,,,,
2,1,1.0,0,1,1,1,10,7,,10,3,13.0,1,0.16,1,22.0,4.0,19.0,18.0,19.0,18.0,14.0,12.0,1,10.0,10.0,10.0,10.0,10.0,10.0,10.0,10.0,1.0,21.0,Law,1.0,,,,4.0,2.0,4.0,Chicago,60521,69487.0,2.0,7.0,1.0,lawyer,,...,,6.0,7.0,8.0,7.0,6.0,,,,,,1.0,1.0,0.0,,,15.0,20.0,20.0,15.0,15.0,15.0,,,,,,,,,,,,,,,,,,,5.0,7.0,7.0,7.0,7.0,,,,,
3,1,1.0,0,1,1,1,10,7,,5,4,14.0,1,0.61,0,23.0,2.0,30.0,5.0,15.0,40.0,5.0,5.0,1,7.0,8.0,9.0,8.0,9.0,8.0,7.0,7.0,2.0,21.0,Law,1.0,,,,4.0,2.0,4.0,Chicago,60521,69487.0,2.0,7.0,1.0,lawyer,,...,,6.0,7.0,8.0,7.0,6.0,,,,,,1.0,1.0,0.0,,,15.0,20.0,20.0,15.0,15.0,15.0,,,,,,,,,,,,,,,,,,,5.0,7.0,7.0,7.0,7.0,,,,,
4,1,1.0,0,1,1,1,10,7,,7,5,15.0,1,0.21,0,24.0,3.0,30.0,10.0,20.0,10.0,10.0,20.0,1,8.0,7.0,9.0,6.0,9.0,7.0,8.0,6.0,2.0,21.0,Law,1.0,,,,4.0,2.0,4.0,Chicago,60521,69487.0,2.0,7.0,1.0,lawyer,,...,,6.0,7.0,8.0,7.0,6.0,,,,,,1.0,1.0,0.0,,,15.0,20.0,20.0,15.0,15.0,15.0,,,,,,,,,,,,,,,,,,,5.0,7.0,7.0,7.0,7.0,,,,,


Most participants did not match with anyone during their respective event. This is most likely the reason for missing values on questionares sent out after the event.

In [None]:
target = df['match']
train_targ_vcounts = target.value_counts()
print('Class Percentages')
print('Not Matched: ', train_targ_vcounts[0]/len(target))
print('Matched: ', train_targ_vcounts[1]/len(target))

Class Percentages
Not Matched:  0.8352828837431368
Matched:  0.16471711625686322


In [None]:
test = df.loc[round(df.shape[0]*.8, 0) :, :]
print(test.shape)
test.head()

(1676, 195)


Unnamed: 0,iid,id,gender,idg,condtn,wave,round,position,positin1,order,partner,pid,match,int_corr,samerace,age_o,race_o,pf_o_att,pf_o_sin,pf_o_int,pf_o_fun,pf_o_amb,pf_o_sha,dec_o,attr_o,sinc_o,intel_o,fun_o,amb_o,shar_o,like_o,prob_o,met_o,age,field,field_cd,undergra,mn_sat,tuition,race,imprace,imprelig,from,zipcode,income,goal,date,go_out,career,career_c,...,shar2_2,attr3_2,sinc3_2,intel3_2,fun3_2,amb3_2,attr5_2,sinc5_2,intel5_2,fun5_2,amb5_2,you_call,them_cal,date_3,numdat_3,num_in_3,attr1_3,sinc1_3,intel1_3,fun1_3,amb1_3,shar1_3,attr7_3,sinc7_3,intel7_3,fun7_3,amb7_3,shar7_3,attr4_3,sinc4_3,intel4_3,fun4_3,amb4_3,shar4_3,attr2_3,sinc2_3,intel2_3,fun2_3,amb2_3,shar2_3,attr3_3,sinc3_3,intel3_3,fun3_3,amb3_3,attr5_3,sinc5_3,intel5_3,fun5_3,amb5_3
6702,443,4.0,1,8,2,17,11,13,10.0,3,9,438.0,0,0.38,0,23.0,2.0,14.0,15.0,16.0,17.0,18.0,20.0,0,4.0,4.0,6.0,5.0,10.0,4.0,5.0,6.0,,27.0,International Affairs - Economic Policy,13.0,Univeristy of Michigan,1290.0,21645.0,3.0,1.0,3.0,New York,17403,27794.0,2.0,6.0,2.0,Economic Policy Advisor on Latin America,9.0,...,10.0,7.0,7.0,7.0,8.0,8.0,7.0,7.0,7.0,8.0,8.0,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,
6703,443,4.0,1,8,2,17,11,2,10.0,4,10,439.0,0,0.59,0,29.0,2.0,20.0,10.0,50.0,5.0,10.0,5.0,0,6.0,7.0,9.0,,7.0,,5.0,5.0,2.0,27.0,International Affairs - Economic Policy,13.0,Univeristy of Michigan,1290.0,21645.0,3.0,1.0,3.0,New York,17403,27794.0,2.0,6.0,2.0,Economic Policy Advisor on Latin America,9.0,...,10.0,7.0,7.0,7.0,8.0,8.0,7.0,7.0,7.0,8.0,8.0,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,
6704,444,5.0,1,10,2,17,11,8,11.0,9,1,430.0,0,0.33,0,22.0,4.0,10.0,25.0,15.0,20.0,15.0,15.0,0,7.0,6.0,8.0,7.0,8.0,2.0,5.0,1.0,2.0,26.0,MBA,8.0,Middlebury College,1470.0,34300.0,2.0,5.0,1.0,"Los Angeles, CA",10016,41737.0,2.0,4.0,2.0,Media Management,7.0,...,0.0,8.0,7.0,8.0,8.0,7.0,8.0,4.0,7.0,8.0,7.0,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,
6705,444,5.0,1,10,2,17,11,9,11.0,10,2,431.0,0,-0.03,0,27.0,4.0,15.0,15.0,25.0,15.0,15.0,15.0,1,8.0,7.0,8.0,8.0,7.0,4.0,6.0,4.0,2.0,26.0,MBA,8.0,Middlebury College,1470.0,34300.0,2.0,5.0,1.0,"Los Angeles, CA",10016,41737.0,2.0,4.0,2.0,Media Management,7.0,...,0.0,8.0,7.0,8.0,8.0,7.0,8.0,4.0,7.0,8.0,7.0,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,
6706,444,5.0,1,10,2,17,11,11,11.0,1,3,432.0,0,-0.23,1,28.0,2.0,15.0,30.0,30.0,10.0,10.0,5.0,1,10.0,10.0,10.0,10.0,10.0,9.0,10.0,9.0,2.0,26.0,MBA,8.0,Middlebury College,1470.0,34300.0,2.0,5.0,1.0,"Los Angeles, CA",10016,41737.0,2.0,4.0,2.0,Media Management,7.0,...,0.0,8.0,7.0,8.0,8.0,7.0,8.0,4.0,7.0,8.0,7.0,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,


In [None]:
train = df.loc[:6701, :]
print(train.shape)
train.tail()

(6702, 195)


Unnamed: 0,iid,id,gender,idg,condtn,wave,round,position,positin1,order,partner,pid,match,int_corr,samerace,age_o,race_o,pf_o_att,pf_o_sin,pf_o_int,pf_o_fun,pf_o_amb,pf_o_sha,dec_o,attr_o,sinc_o,intel_o,fun_o,amb_o,shar_o,like_o,prob_o,met_o,age,field,field_cd,undergra,mn_sat,tuition,race,imprace,imprelig,from,zipcode,income,goal,date,go_out,career,career_c,...,shar2_2,attr3_2,sinc3_2,intel3_2,fun3_2,amb3_2,attr5_2,sinc5_2,intel5_2,fun5_2,amb5_2,you_call,them_cal,date_3,numdat_3,num_in_3,attr1_3,sinc1_3,intel1_3,fun1_3,amb1_3,shar1_3,attr7_3,sinc7_3,intel7_3,fun7_3,amb7_3,shar7_3,attr4_3,sinc4_3,intel4_3,fun4_3,amb4_3,shar4_3,attr2_3,sinc2_3,intel2_3,fun2_3,amb2_3,shar2_3,attr3_3,sinc3_3,intel3_3,fun3_3,amb3_3,attr5_3,sinc5_3,intel5_3,fun5_3,amb5_3
6697,443,4.0,1,8,2,17,11,3,10.0,5,4,433.0,0,0.2,0,30.0,6.0,20.0,20.0,20.0,20.0,10.0,10.0,1,7.0,6.0,6.0,6.0,4.0,4.0,8.0,7.0,2.0,27.0,International Affairs - Economic Policy,13.0,Univeristy of Michigan,1290.0,21645.0,3.0,1.0,3.0,New York,17403,27794.0,2.0,6.0,2.0,Economic Policy Advisor on Latin America,9.0,...,10.0,7.0,7.0,7.0,8.0,8.0,7.0,7.0,7.0,8.0,8.0,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,
6698,443,4.0,1,8,2,17,11,4,10.0,6,5,434.0,0,0.13,0,27.0,2.0,30.0,15.0,15.0,30.0,0.0,10.0,0,6.0,7.0,8.0,9.0,7.0,8.0,7.0,8.0,2.0,27.0,International Affairs - Economic Policy,13.0,Univeristy of Michigan,1290.0,21645.0,3.0,1.0,3.0,New York,17403,27794.0,2.0,6.0,2.0,Economic Policy Advisor on Latin America,9.0,...,10.0,7.0,7.0,7.0,8.0,8.0,7.0,7.0,7.0,8.0,8.0,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,
6699,443,4.0,1,8,2,17,11,7,10.0,8,6,435.0,0,0.56,0,22.0,1.0,25.0,5.0,20.0,25.0,15.0,10.0,0,7.0,9.0,9.0,8.0,9.0,,8.0,6.0,2.0,27.0,International Affairs - Economic Policy,13.0,Univeristy of Michigan,1290.0,21645.0,3.0,1.0,3.0,New York,17403,27794.0,2.0,6.0,2.0,Economic Policy Advisor on Latin America,9.0,...,10.0,7.0,7.0,7.0,8.0,8.0,7.0,7.0,7.0,8.0,8.0,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,
6700,443,4.0,1,8,2,17,11,6,10.0,7,7,436.0,0,-0.02,0,23.0,2.0,28.0,8.0,17.0,22.0,17.0,8.0,0,7.0,10.0,9.0,9.0,9.0,7.0,8.0,10.0,2.0,27.0,International Affairs - Economic Policy,13.0,Univeristy of Michigan,1290.0,21645.0,3.0,1.0,3.0,New York,17403,27794.0,2.0,6.0,2.0,Economic Policy Advisor on Latin America,9.0,...,10.0,7.0,7.0,7.0,8.0,8.0,7.0,7.0,7.0,8.0,8.0,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,
6701,443,4.0,1,8,2,17,11,12,10.0,2,8,437.0,0,0.35,0,23.0,2.0,15.0,15.0,18.0,18.0,14.0,20.0,0,7.0,8.0,10.0,8.0,10.0,7.0,7.0,5.0,2.0,27.0,International Affairs - Economic Policy,13.0,Univeristy of Michigan,1290.0,21645.0,3.0,1.0,3.0,New York,17403,27794.0,2.0,6.0,2.0,Economic Policy Advisor on Latin America,9.0,...,10.0,7.0,7.0,7.0,8.0,8.0,7.0,7.0,7.0,8.0,8.0,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,


In [None]:
baseline = train_targ_vcounts[0]/len(target)

In [None]:
train.describe()

Unnamed: 0,iid,id,gender,idg,condtn,wave,round,position,positin1,order,partner,pid,match,int_corr,samerace,age_o,race_o,pf_o_att,pf_o_sin,pf_o_int,pf_o_fun,pf_o_amb,pf_o_sha,dec_o,attr_o,sinc_o,intel_o,fun_o,amb_o,shar_o,like_o,prob_o,met_o,age,field_cd,race,imprace,imprelig,goal,date,go_out,career_c,sports,tvsports,exercise,dining,museums,art,hiking,gaming,...,shar2_2,attr3_2,sinc3_2,intel3_2,fun3_2,amb3_2,attr5_2,sinc5_2,intel5_2,fun5_2,amb5_2,you_call,them_cal,date_3,numdat_3,num_in_3,attr1_3,sinc1_3,intel1_3,fun1_3,amb1_3,shar1_3,attr7_3,sinc7_3,intel7_3,fun7_3,amb7_3,shar7_3,attr4_3,sinc4_3,intel4_3,fun4_3,amb4_3,shar4_3,attr2_3,sinc2_3,intel2_3,fun2_3,amb2_3,shar2_3,attr3_3,sinc3_3,intel3_3,fun3_3,amb3_3,attr5_3,sinc5_3,intel5_3,fun5_3,amb5_3
count,6702.0,6702.0,6702.0,6702.0,6702.0,6702.0,6702.0,6702.0,4856.0,6702.0,6702.0,6692.0,6702.0,6544.0,6702.0,6620.0,6629.0,6613.0,6613.0,6613.0,6604.0,6595.0,6595.0,6702.0,6561.0,6496.0,6474.0,6427.0,6122.0,5789.0,6537.0,6485.0,6420.0,6629.0,6620.0,6639.0,6623.0,6623.0,6623.0,6605.0,6623.0,6564.0,6623.0,6623.0,6623.0,6623.0,6623.0,6623.0,6623.0,6623.0,...,4290.0,5978.0,5978.0,5978.0,5978.0,5978.0,2892.0,2892.0,2892.0,2892.0,2892.0,3379.0,3379.0,3379.0,1218.0,513.0,3379.0,3379.0,3379.0,3379.0,3379.0,3379.0,1421.0,1421.0,1421.0,1421.0,1421.0,1421.0,2364.0,2364.0,2364.0,2364.0,2364.0,2364.0,2364.0,2364.0,2364.0,2364.0,2364.0,1421.0,3379.0,3379.0,3379.0,3379.0,3379.0,1421.0,1421.0,1421.0,1421.0,1421.0
mean,227.681438,8.76604,0.493136,16.942554,1.809311,9.177857,16.593256,8.910325,9.133237,8.79439,8.826321,228.043186,0.168308,0.195665,0.412563,26.38565,2.696334,22.276717,17.474464,20.231713,17.358333,11.046208,11.806229,0.425395,6.240497,7.240148,7.406163,6.424148,6.809948,5.57186,6.186202,5.256284,1.957788,26.359632,7.533686,2.687754,3.820776,3.793598,2.163068,5.00651,2.133474,5.06947,6.431074,4.585837,6.307716,7.721274,6.94202,6.650159,5.795561,3.863355,...,13.420294,7.159585,7.968886,8.278354,7.594848,7.496989,6.872407,7.506224,7.849931,7.327801,7.328838,0.74963,0.999408,0.375555,1.341544,0.871345,23.374993,16.760293,19.622874,16.315949,11.387674,12.791178,30.585503,15.995074,16.636875,16.392681,8.401126,12.467277,22.371404,10.682318,11.166244,13.790186,9.228003,10.841794,22.290186,10.51692,11.208968,14.470812,9.67555,12.406052,7.250666,8.10654,8.46937,7.658183,7.418467,6.757917,7.659395,7.923293,7.111893,7.017593
std,124.622115,5.357606,0.49999,10.667926,0.392874,4.576303,4.060673,5.382173,5.497173,5.325948,5.341548,125.013957,0.374167,0.303992,0.492332,3.572579,1.211101,12.773226,6.976048,6.854377,6.121483,6.205317,6.228629,0.49444,1.911643,1.69759,1.513865,1.938872,1.755944,2.120306,1.830183,2.12254,0.257482,3.582518,3.651239,1.211681,2.88871,2.847871,1.435405,1.423749,1.096107,3.191064,2.605847,2.777812,2.431014,1.753603,2.07026,2.274782,2.542041,2.573925,...,5.360048,1.331951,1.448359,1.142281,1.542153,1.792967,1.31956,1.530378,1.257681,1.631924,1.554084,1.664965,1.396731,0.484338,1.396061,0.828577,13.075361,7.326475,6.147169,4.939169,5.691686,6.488607,17.480671,9.24323,7.165185,6.563175,6.199691,8.750683,15.688999,5.674572,5.720521,6.576736,6.218415,6.150009,16.148079,5.862449,5.878613,7.866184,6.531869,7.190502,1.579986,1.593694,1.428955,1.758192,1.988443,1.270667,1.330876,1.347047,1.617947,1.824049
min,1.0,1.0,0.0,1.0,1.0,1.0,5.0,1.0,1.0,1.0,1.0,1.0,0.0,-0.83,0.0,18.0,1.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,1.0,18.0,1.0,1.0,0.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,0.0,0.0,0.0,1.0,...,0.0,2.0,2.0,4.0,1.0,2.0,2.0,2.0,2.0,2.0,2.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,3.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,5.0,0.0,0.0,0.0,0.0,0.0,2.0,2.0,4.0,2.0,1.0,2.0,4.0,5.0,1.0,1.0
25%,113.25,4.0,0.0,8.0,2.0,5.0,14.0,4.0,4.0,4.0,4.0,113.0,0.0,-0.01,0.0,24.0,2.0,15.0,15.0,17.31,15.0,5.0,9.52,0.0,5.0,6.0,7.0,5.0,6.0,4.0,5.0,4.0,2.0,24.0,5.0,2.0,1.0,1.0,1.0,4.0,1.0,2.0,5.0,2.0,5.0,7.0,6.0,5.0,4.0,1.0,...,10.0,7.0,7.0,8.0,7.0,7.0,6.0,7.0,7.0,6.0,6.0,0.0,0.0,0.0,1.0,0.0,15.22,14.58,16.67,15.0,9.0,10.0,20.0,10.0,10.0,10.0,5.0,5.0,9.0,7.0,7.0,8.0,5.0,7.0,9.0,7.0,7.0,8.0,6.0,10.0,7.0,7.0,8.0,7.0,6.0,6.0,7.0,7.0,6.0,6.0
50%,231.0,8.0,0.0,16.0,2.0,9.0,18.0,8.0,8.5,8.0,8.0,231.0,0.0,0.21,0.0,26.0,2.0,20.0,18.0,20.0,18.0,10.0,10.64,0.0,6.0,7.0,8.0,7.0,7.0,6.0,6.0,5.0,2.0,26.0,8.0,2.0,3.0,3.0,2.0,5.0,2.0,6.0,7.0,4.0,7.0,8.0,7.0,7.0,6.0,3.0,...,15.0,7.0,8.0,8.0,8.0,8.0,7.0,8.0,8.0,7.0,7.0,0.0,1.0,0.0,1.0,1.0,20.0,16.98,20.0,16.67,10.0,14.29,25.0,15.0,15.0,18.0,10.0,10.0,20.0,10.0,10.0,10.0,8.0,10.0,20.0,10.0,10.0,10.0,9.0,10.0,7.0,8.0,8.0,8.0,8.0,7.0,8.0,8.0,7.0,7.0
75%,341.0,13.0,1.0,25.0,2.0,14.0,20.0,13.0,13.0,13.0,13.0,341.0,0.0,0.43,1.0,28.0,4.0,25.0,20.0,23.08,20.0,15.0,15.56,1.0,8.0,8.0,8.0,8.0,8.0,7.0,7.0,7.0,2.0,28.0,10.0,4.0,6.0,6.0,2.0,6.0,3.0,7.0,9.0,7.0,8.0,9.0,8.0,8.0,8.0,6.0,...,16.67,8.0,9.0,9.0,9.0,9.0,8.0,8.0,9.0,8.0,8.0,1.0,1.0,1.0,2.0,1.0,30.0,20.0,20.0,20.0,15.22,16.67,40.0,20.0,20.0,20.0,14.0,20.0,30.0,15.0,15.0,20.0,10.0,15.0,30.0,15.0,15.0,20.0,10.0,15.0,8.0,9.0,9.0,9.0,9.0,8.0,9.0,9.0,8.0,8.0
max,443.0,21.0,1.0,42.0,2.0,17.0,21.0,21.0,21.0,21.0,21.0,453.0,1.0,0.9,1.0,42.0,6.0,100.0,60.0,50.0,50.0,53.0,30.0,1.0,10.5,10.0,10.0,11.0,10.0,10.0,10.0,10.0,8.0,42.0,17.0,6.0,10.0,10.0,6.0,7.0,7.0,17.0,10.0,10.0,10.0,10.0,10.0,10.0,10.0,14.0,...,30.0,10.0,10.0,10.0,10.0,10.0,10.0,10.0,10.0,10.0,10.0,21.0,9.0,1.0,9.0,4.0,80.0,65.0,45.0,30.0,30.0,55.0,75.0,60.0,30.0,35.0,30.0,55.0,70.0,40.0,25.0,30.0,40.0,45.0,80.0,50.0,30.0,40.0,50.0,45.0,12.0,12.0,12.0,12.0,12.0,9.0,10.0,10.0,10.0,10.0


The lists below are attributes by which each person was asked to score. The 'other' lists refer to the importance of each attribute to the querant when choosing a mate. The 'self' lists refer to the the querant's opinion of how those attributes should be ranked regarding themselves. Two scoring methods were implemented. One was a score of each attribute on a scale of 1-10 with one being least desirable. The other method was by assigning a total of 100 points across all attributes. In all cases, lower scores indicate lower desirability and/or importance.

The ratings were performed at various intervals. Each of the forms has been assigned a number (1-4) in appended to the end of the list variable name. Lower
* Prior to the event
* Midway through the event
* Day after the event
* Three to Four weeks after the event

The attributes are listed below. These are compiled into aforementioned lists while preserving the original column names in the dataset.
* Attractiveness
* Sincerity
* Intelligence
* Fun
* Ambition
* Shared Interests

In [None]:
# Ratings prior to event

# What does each person look for in the opposite sex when choosing a date
other_quals_1 = ['attr1_1', 'sinc1_1', 'intel1_1', 'fun1_1', 'amb1_1', 'shar1_1']

# Self assessment of the same qualities
self_quals_1 = ['attr3_1', 'sinc3_1', 'intel3_1', 'fun3_1', 'amb3_1']

In [None]:
# Middle of event

# What does each person look for in the opposite sex when choosing a date
other_quals_2 = ['attr1_s', 'sinc1_s', 'intel1_s', 'fun1_s', 'amb1_s', 'shar1_s']

# Self assessment of the same qualities
self_quals_2 = ['attr3_s', 'sinc3_s', 'intel3_s', 'fun3_s', 'amb3_s']

In [None]:
# Day after event

# What does each person look for in the opposite sex when choosing a date
other_quals_3 = ['attr1_2', 'sinc1_2', 'intel1_2', 'fun1_2', 'amb1_2', 'shar1_2']

# Self assessment of the same qualities
self_quals_3 = ['attr3_2', 'sinc3_2', 'intel3_2', 'fun3_2', 'amb3_2']

In [None]:
# Three to four weeks after event after receiving matches

# What does each person look for in the opposite sex when choosing a date
other_quals_4 = ['attr1_3', 'sinc1_3', 'intel1_3', 'fun1_3', 'amb1_3', 'shar1_3']

# Self assessment of the same qualities
self_quals_4 = ['attr3_3', 'sinc3_3', 'intel3_3', 'fun3_3', 'amb3_3']

During the event, indicates whether or not the two participants are the same race. The original values will be recoded to allow for a missing data indicator.

Original Values
* 0 No
* 1 Yes
&nbsp;<br />
&nbsp;<br />

Recoded Values
* 1 No
* 2 Yes

In [None]:
same_race = range(0, 3, 1)
[i for i in same_race]

[0, 1, 2]

The lists below are lists of answers to survey questions. Each question is a list of numeric values with their corresponding plain english counterparts. 

Each person's encoded field of study.

*  &nbsp;1 Law
*  &nbsp;2 Math
*  &nbsp;3 Social Science, Psychologist
*  &nbsp;4 Medical Science, Pharmaceuticals, and Bio Tech
*  &nbsp;5 Engineering
*  &nbsp;6 English/Creative Writing/ Journalism
*  &nbsp;7 History/Religion/Philosophy
*  &nbsp;8 Business/Econ/Finance
*  &nbsp;9 Education, Academia
* 10 Biological Sciences/Chemistry/Physics
* 11 Social Work
* 12 Undergrad/undecided
* 13 Political Science/International Affairs
* 14 Film
* 15 Fine Arts/Arts Administration
* 16 Languages
* 17 Architecture
* 18 Other

In [None]:
field_of_study = range(0, 19, 1)

[i for i in field_of_study]

[0, 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15, 16, 17, 18]

Each person's race.

* 1 Black/African American
* 2 European/Caucasian-American
* 3 Latino/Hispanic American
* 4 Asian/Pacific Islander/Asian-American
* 5 Native American
* 6 Other

In [None]:
race = range(0, 7, 1)

[i for i in race]

[0, 1, 2, 3, 4, 5, 6]

Each person's goal of the night regarding why they chose to participate in the speed dating event.

* 1 Seemed like a fun night out
* 2 To meet new people
* 3 To get a date
* 4 Looking for a serious relationship
* 5 To say I did it
* 6 Other

In [None]:
event_goal = range(0, 7, 1)

[i for i in event_goal]

[0, 1, 2, 3, 4, 5, 6]

Frequency each person goes on dates.

* 1 Several times a week
* 2 Twice a week
* 3 Once a week
* 4 Twice a month
* 5 Once a month
* 6 Several times a year
* 7 Almost never

In [None]:
go_dates_freq = range(0, 8, 1)

[i for i in go_dates_freq]

[0, 1, 2, 3, 4, 5, 6, 7]

How often each person goes out in general but not necessarily on a date.
 
* 1 Several times a week
* 2 Twice a week
* 3 Once a week
* 4 Twice a month
* 5 Once a month
* 6 Several times a year
* 7 Almost never

In [None]:
go_out_freq = range(0, 8, 1)

[i for i in go_out_freq]

[0, 1, 2, 3, 4, 5, 6, 7]

Intended career for each person encoded.

* &nbsp;1 Lawyer
* &nbsp;2 Academic/Research
* &nbsp;3 Psychologist
* &nbsp;4 Doctor/Medicine
* &nbsp;5 Engineer
* &nbsp;6 Creative Arts/Entertainment
* &nbsp;7 Banking/Consulting/Finance/Marketing/Business/CEO/Entrepreneur/Admin
* &nbsp;8 Real Estate
* &nbsp;9 International/Humanitarian Affairs
* 10 Undecided
* 11 Social Work
* 12 Speech Pathology
* 13 Politics
* 14 Pro sports/Athletics
* 15 Other
* 16 Journalism
* 17 Architecture

In [None]:
intended_career = range(0, 18, 1)

[i for i in intended_career]

[0, 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15, 16, 17]

These lists are text-based. numeric values can be acquired through the .index() function.

On a scale of 1-10, how interested is each person in each of the following activities. Each activity is it's own column.  

<table>
    <tr>
        <td align='right'>sports:</td>
        <td align='left'>playing sports/athletics</td>
    </tr>
    <tr>
        <td align='right'>tvsports:</td>
        <td align='left'>watching sports</td>
    </tr>
    <tr>
        <td align='right'>excersice:</td>
        <td align='left'>body building/exercising</td>
    </tr>
    <tr>
        <td align='right'>dining:</td>
        <td align='left'>dining out</td>
    </tr>
    <tr>
        <td align='right'>museums:</td>
        <td align='left'>museums/galleries</td>
    </tr>
    <tr>
        <td align='right'>art:</td>
        <td align='left'>art</td>
    <tr>
        <td align='right'>hiking:</td>
        <td align='left'>hiking/camping</td>
    </tr>
    <tr>
        <td align='right'>gaming:</td>
        <td align='left'>gaming</td>
    </tr>
    <tr>
        <td align='right'>clubbing:</td>
        <td align='left'>dancing/clubbing</td>
    </tr>
    <tr>
        <td align='right'>reading:</td>
        <td align='left'>reading</td>
    </tr>
    <tr>
        <td align='right'>tv:</td>
        <td align='left'>watching tv</td>
    </tr>
    <tr>
        <td align='right'>theater:</td>
        <td align='left'>theater</td>
    </tr>
    <tr>
        <td align='right'>movies:</td>
        <td align='left'>Movies</td>
    </tr>
    <tr>
        <td align='right'>concerts:</td>
        <td align='left'>going to concerts</td>
    </tr>
    <tr>
        <td align='right'>music:</td>
        <td align='left'>music</td>
    </tr>
    <tr>
        <td align='right'>shopping:</td>
        <td align='left'>shopping</td>
    </tr>
    <tr>
        <td align='right'>yoga:</td>
        <td align='left'>yoga/meditation</td>
    </tr>
</table>

In [None]:
activity_interest = ['sports', 'tvsports','exercise',
                    'dining', 'museums', 'art', 'hiking',
                    'gaming', 'clubbing', 'reading', 'tv',
                    'theater', 'movies', 'concerts', 'music',
                    'shopping', 'yoga']
                    
activity_interest

['sports',
 'tvsports',
 'exercise',
 'dining',
 'museums',
 'art',
 'hiking',
 'gaming',
 'clubbing',
 'reading',
 'tv',
 'theater',
 'movies',
 'concerts',
 'music',
 'shopping',
 'yoga']

In [None]:
# This code is a list of all the feature column lists, a list of lists.

features_list = ['samerace', 'field_cd', 'race',
                 'goal', 'date',
                 'go_out', 'career_c']

feature_lists = [other_quals_1, self_quals_1,
                 other_quals_2, self_quals_2,
                 other_quals_3, self_quals_3,
                 other_quals_4, self_quals_4,
                 activity_interest, features_list]

This function exists solely to show column information. Each group of columns is directly related to each other in that they all existed on the same survey. Breaking them into their respective groups assisted with seeing relationships within each group and change from one survey to the next.

In [None]:
# df must be pandas dataframe
# cols must be 2d list of column lists
# for example: a list of columns named for questions on different surveys
# or all unrelated columns in a single list

def feat_info(df, cols):
    if isinstance(cols[0], list):
        for i in cols:
            print('Shape\n', df[i].shape)
            print('\n\nDescribe\n', df[i].describe())
            print('\n\nNull Values\n', df[i].isna().sum())

In [None]:
feat_info(df, feature_lists)

Shape
 (8378, 6)


Describe
            attr1_1      sinc1_1     intel1_1       fun1_1       amb1_1  \
count  8299.000000  8299.000000  8299.000000  8289.000000  8279.000000   
mean     22.514632    17.396389    20.265613    17.457043    10.682539   
std      12.587674     7.046700     6.783003     6.085239     6.124888   
min       0.000000     0.000000     0.000000     0.000000     0.000000   
25%      15.000000    15.000000    17.390000    15.000000     5.000000   
50%      20.000000    18.180000    20.000000    18.000000    10.000000   
75%      25.000000    20.000000    23.810000    20.000000    15.000000   
max     100.000000    60.000000    50.000000    50.000000    53.000000   

           shar1_1  
count  8257.000000  
mean     11.845111  
std       6.362154  
min       0.000000  
25%       9.520000  
50%      10.640000  
75%      16.000000  
max      30.000000  


Null Values
 attr1_1      79
sinc1_1      79
intel1_1     79
fun1_1       89
amb1_1       99
shar1_1     121
dtyp

Null filter contains all columns with 20% or greater null values. I wrote this to assist with determining how much of the implied data doesn't exist.
Although normally bad, I beleive the null values here can be used to illustrate patterns regarding input. If a participant puts forth more effort in their participation, will they achieve better results, ie more matches. The amount of effort by each person may signifiy varying mindsets and attitudes toward the event beyond the survey question regarding each person's goal.

In [None]:
null_filter = df.isna().sum()[df.isna().sum()>len(df)*.2]

In [None]:
# save the dataframe in the data folder for use in future assignments
path = '../data/speed_dating_ap1'

df.to_csv(path, index=False)

In [None]:
# this function removes all columns which have 20% or more missing information
# The missing information may eventually be removed on an observational basis 
# to compare effectiveness of each methology.

def handle_nulls(df, cols=[], null_list=[]):
    
    df = df.copy()
    df = df.drop(null_list, axis=1)

    cols = cols.copy()

    for i in cols:
        if isinstance(i, list):
            for j in i:
                if j in null_list:
                    i.remove(j)
        else:
            if isinstance(i, str):
                cols.remove(i)


    return df, cols

In [None]:
# get the number of features prior to dropping null columns 20% or greater
x = 0
prior_null_drop = sum([x+len(i) for i in feature_lists if isinstance(i, list)])
prior_null_drop

68

In [None]:
# drop the null columns
df_null20, cols_null20 = handle_nulls(df, feature_lists, null_filter.index)

In [None]:
# get the number of features after dropping null columns 20% or greater
x = 0
post_null_drop = sum([x+len(i) for i in feature_lists if isinstance(i, list)])
post_null_drop

56

In [None]:
# show percentage of features with 20% or more missing data
print('Percentage Features Dropped: ',round(((prior_null_drop-post_null_drop)/prior_null_drop)*100, 2))

Percentage Features Dropped:  17.65


In [None]:
# total number of columns in dataframe with 20% or more missing data out of 195 columns
# numerous columns are numeric representations of string variants. String variant columns
# are to be excluded in the final model. This will allow for less preprocessing along with
# null value accountability, which will be represented by the number 0.
len(null_filter)

90