# ECON 213R FINAL PROJECT: COUPLE MATCHING  

### By Ben Chapdelaine and Matt Youngberg  

**Description**: Using survey data from Gender Differences In Mate Selection: Evidence From a Speed Dating Experiment (Raymond Fisman, Sheena S. Iyengar, Emir Kamenica, & Itamar Simonson, 2006), we want to use machine learning to predict whether both respondents to a survey would consent to a second date if they had gone on a first date previously. Although this data likely only generalizes to a heterosexual college crowd, it's application in an app or matchmaking service could potentially create good matches, leading to less first dates where spark's *don't* fly.

In [1]:
import pandas as pd

from imblearn.over_sampling import SMOTE

from sklearn.model_selection import train_test_split
from sklearn.ensemble import RandomForestClassifier
from sklearn.metrics import classification_report

**Note:** The survey from which the data came had a substantial number of fields that were not applicable for our purposes. This following import `dataset.csv` is a set that had been already narrowed down by us already, only keeping the columns that were useful for the purpose of this project.

In [2]:
dataset = pd.read_csv('dataset.csv')
dataset.head()

Unnamed: 0,match,iid_1,iid_2,interested_1,interested_2,gender_1,age_1,field_1,field_cd_1,race_1,...,movies_2,concerts_2,music_2,shopping_2,yoga_2,attr3_1_2,sinc3_1_2,intel3_1_2,fun3_1_2,amb3_1_2
0,0,1,11,0,0,0,21.0,Law,1.0,4.0,...,8.0,7.0,8.0,5.0,1.0,8.0,9.0,8.0,7.0,5.0
1,0,1,12,0,0,0,21.0,Law,1.0,4.0,...,7.0,7.0,9.0,5.0,5.0,9.0,9.0,10.0,9.0,9.0
2,1,1,13,1,1,0,21.0,Law,1.0,4.0,...,8.0,9.0,9.0,8.0,1.0,4.0,7.0,8.0,8.0,3.0
3,1,1,14,1,1,0,21.0,Law,1.0,4.0,...,10.0,6.0,8.0,6.0,1.0,9.0,9.0,9.0,9.0,9.0
4,1,1,15,1,1,0,21.0,Law,1.0,4.0,...,9.0,6.0,7.0,2.0,1.0,7.0,7.0,9.0,7.0,9.0


Checking for missing fields below show that almost every field has a number of missing values...

In [3]:
dataset.isnull().sum().sort_values(ascending=False).to_frame().T

Unnamed: 0,career_c_1,career_1,intel3_1_1,amb3_1_1,sinc3_1_1,attr3_1_1,fun3_1_1,age_1,imprelig_1,gaming_1,...,imprace_2,race_2,field_2,interested_1,iid_1,iid_2,gender_2,interested_2,gender_1,match
0,99,69,67,67,67,67,67,65,59,59,...,20,20,20,0,0,0,0,0,0,0


We intended this machine learning model to be applicable in the case of an app or matchmaking site. Therefore, we believe that if it was applied, we could reliably depend on full survey pariticpation, and could solicit answers from respondents if they haven't completely filled it out. So we feel justified in simply dropping any responses that don't have full data, since it only reduced our sample by about 100 observations-- not a huge loss with a sample of 4000+ observations.

In [4]:
dataset = dataset.dropna(axis=0)
dataset = dataset.drop(['iid_1', 'iid_2', 'interested_1', 'interested_2', 'gender_1', 'gender_2'], axis=1)
dataset = pd.get_dummies(dataset)

X = dataset.drop('match', axis=1).values
y = dataset['match'].values

After much feature engineering and experimentation with different algorithims, it appeared the most common-sensical model was the one that outperformed the rest by far: a Random Forest Model. This model relies upon many weak learning trees casting votes to figure out whether the couple will be a match or not. We tried many different parameters and tested for robustness, but the block of code below represents the best results we were able to acheive after much experimentation.

In [5]:
X_resampled, y_resampled = SMOTE(random_state=42).fit_resample(X, y)

X_train, X_test, y_train, y_test = train_test_split(X_resampled, y_resampled, test_size=.33, random_state=42)

clf = RandomForestClassifier(n_estimators=1000, verbose=0, n_jobs=-1, random_state=42)
clf.fit(X_train, y_train)
predictions = clf.predict(X_test)
print(classification_report(y_test, predictions))

              precision    recall  f1-score   support

           0       0.90      0.96      0.93      1114
           1       0.96      0.90      0.93      1080

   micro avg       0.93      0.93      0.93      2194
   macro avg       0.93      0.93      0.93      2194
weighted avg       0.93      0.93      0.93      2194

