# Speed Dating Prediction

We are going to evaluate the factors that could predict the success of speed dating using RandomForest.

In [2]:
import pandas as pd
import sklearn as sk
import seaborn as sns
import matplotlib.pyplot as plt
from sklearn.model_selection import train_test_split
from sklearn.metrics import confusion_matrix

## Importing the data

In [3]:
df = pd.read_csv("Speed Dating Data.csv",encoding='unicode_escape')
df.head()

Unnamed: 0,iid,id,gender,idg,condtn,wave,round,position,positin1,order,...,attr3_3,sinc3_3,intel3_3,fun3_3,amb3_3,attr5_3,sinc5_3,intel5_3,fun5_3,amb5_3
0,1,1.0,0,1,1,1,10,7,,4,...,5.0,7.0,7.0,7.0,7.0,,,,,
1,1,1.0,0,1,1,1,10,7,,3,...,5.0,7.0,7.0,7.0,7.0,,,,,
2,1,1.0,0,1,1,1,10,7,,10,...,5.0,7.0,7.0,7.0,7.0,,,,,
3,1,1.0,0,1,1,1,10,7,,5,...,5.0,7.0,7.0,7.0,7.0,,,,,
4,1,1.0,0,1,1,1,10,7,,7,...,5.0,7.0,7.0,7.0,7.0,,,,,


We're going to predict the success of speed dating, which is indicated by "decision", yes (1) and no (0) in the dataset. The other variables are used to predict a decision, except the id-related attributes. 

In [4]:
df["dec"].value_counts()

0    4860
1    3518
Name: dec, dtype: int64

There are more "no" decisions than "yes". By doing a quick calculation, the "yes" decision percentage of the dataset is around 42%. Therefore, it is interesting to see the accuracy, precision and recall of both "yes" and "no" decisions' predictions.

Let's drop the columns with unnecessary values (id-related) and columns with dtype different from "float", add the variables to X,y and split the data. 

Note: since RandomForest doesn't handle missing values, I added a line of code to remove all the columns with NaN values. However, i don't know if there are alternative ways to still use the values from those columns?

In [5]:
df = df.drop(columns=["id","iid","idg","partner","positin1"])
df = df.select_dtypes(exclude=['object']) #drop columns with dtype "str"
df = df.set_index("dec")
df = df.reset_index()
df = df.dropna(axis=1) #drop columns containing NaN
df.head()

Unnamed: 0,dec,gender,condtn,wave,round,position,order,match,samerace,dec_o
0,1,0,1,1,10,7,4,0,0,0
1,1,0,1,1,10,7,3,0,0,0
2,1,0,1,1,10,7,10,1,1,1
3,1,0,1,1,10,7,5,1,0,1
4,1,0,1,1,10,7,7,1,0,1


In [6]:
df["dec"].value_counts()

0    4860
1    3518
Name: dec, dtype: int64

In [10]:
X = df.loc[:,"gender":"dec_o"]
y = df["dec"]
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.3, random_state=1)

## Training the algorithm

In [11]:
from sklearn.ensemble import RandomForestClassifier
rf = RandomForestClassifier(random_state=1, n_estimators=100)
rf = rf.fit(X_train,y_train)

In [12]:
rf.score(X_test,y_test)

0.7613365155131265

## Evaluating the model

Let's first create the confusion matrix

In [13]:
#find out label
rf.classes_

array([0, 1])

In [14]:
y_pred = rf.predict(X_test)
cm = confusion_matrix(y_test,y_pred)
cm = pd.DataFrame(cm, index=["No (actual)","Yes (actual)"], columns=["No (pred)","Yes (pred)"])
cm

Unnamed: 0,No (pred),Yes (pred)
No (actual),1191,272
Yes (actual),328,723


In [17]:
(1191+723)/(1191+272+328+723)

0.7613365155131265

The accuracy is around 76%. Let's print out the precision and recall via classification report.

In [18]:
from sklearn.metrics import classification_report
print (classification_report(y_test, y_pred))

              precision    recall  f1-score   support

           0       0.78      0.81      0.80      1463
           1       0.73      0.69      0.71      1051

    accuracy                           0.76      2514
   macro avg       0.76      0.75      0.75      2514
weighted avg       0.76      0.76      0.76      2514



The precision and recall for predicting "no" decision are better than for predicting "yes" decision. 

Especially, the recall for predicting "yes" decision is quite low, with only 69%.

I personally think the precision is more important than recall in this speed dating case :) 

## Different parameters

Let's try out different parameters to see if there is any difference in the prediction percentage.

In [20]:
rf_new = RandomForestClassifier(n_estimators=30,max_features=8,random_state=1)
rf_new = rf_new.fit(X_train, y_train)
y_new_pred = rf_new.predict(X_test)
print(classification_report(y_test,y_new_pred))

              precision    recall  f1-score   support

           0       0.78      0.82      0.80      1463
           1       0.73      0.67      0.70      1051

    accuracy                           0.76      2514
   macro avg       0.75      0.75      0.75      2514
weighted avg       0.76      0.76      0.76      2514



The results show that the algorithm still works similar with different parameters.