# Speed Dating Prediction

We are going to evaluate the factors that could predict the success of speed dating using RandomForest.

In [1]:
import pandas as pd
import sklearn as sk
import seaborn as sns
import matplotlib.pyplot as plt
from sklearn.model_selection import train_test_split
from sklearn.metrics import confusion_matrix

## Importing the data

In [2]:
df = pd.read_csv("Speed Dating Data.csv",encoding='unicode_escape')
df.head()

Unnamed: 0,iid,id,gender,idg,condtn,wave,round,position,positin1,order,...,attr3_3,sinc3_3,intel3_3,fun3_3,amb3_3,attr5_3,sinc5_3,intel5_3,fun5_3,amb5_3
0,1,1.0,0,1,1,1,10,7,,4,...,5.0,7.0,7.0,7.0,7.0,,,,,
1,1,1.0,0,1,1,1,10,7,,3,...,5.0,7.0,7.0,7.0,7.0,,,,,
2,1,1.0,0,1,1,1,10,7,,10,...,5.0,7.0,7.0,7.0,7.0,,,,,
3,1,1.0,0,1,1,1,10,7,,5,...,5.0,7.0,7.0,7.0,7.0,,,,,
4,1,1.0,0,1,1,1,10,7,,7,...,5.0,7.0,7.0,7.0,7.0,,,,,


We're going to predict the success of speed dating, which is indicated by match (1) and not match (0) in the dataset. The other variables are used to predict a match, except the id-related attributes. 

In [3]:
df["match"].value_counts()

0    6998
1    1380
Name: match, dtype: int64

There are more "match" cases than "not match". By doing a quick calculation, the "match" percentage of the dataset is around 83,5%. However, it is still interesting to see the accuracy, precision and recall of both "match" and "not match" predictions.

Let's drop the columns with unnecessary values (id-related) and columns with dtype different from "float", add the variables to X,y and split the data. 

Note: since RandomForest doesn't handle missing values, I added a line of code to remove all the columns with NaN values. However, i don't know if there are alternative ways to still use the values from those columns?

In [4]:
df = df.drop(columns=["id","iid","idg","partner","positin1"])
df = df.select_dtypes(exclude=['object']) #drop columns with dtype "str"
df = df.set_index("match")
df = df.reset_index()
df = df.dropna(axis=1) #drop columns containing NaN
df.head()

Unnamed: 0,match,gender,condtn,wave,round,position,order,samerace,dec_o,dec
0,0,0,1,1,10,7,4,0,0,1
1,0,0,1,1,10,7,3,0,0,1
2,1,0,1,1,10,7,10,1,1,1
3,1,0,1,1,10,7,5,0,1,1
4,1,0,1,1,10,7,7,0,1,1


In [5]:
df["match"].value_counts()

0    6998
1    1380
Name: match, dtype: int64

In [6]:
X = df.loc[:,"gender":"dec"]
y = df["match"]
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.3, random_state=1)

## Training the algorithm

In [7]:
from sklearn.ensemble import RandomForestClassifier
rf = RandomForestClassifier(random_state=1, n_estimators=100)
rf = rf.fit(X_train,y_train)

In [8]:
rf.score(X_test,y_test)

1.0

## Evaluating the model

Let's first create the confusion matrix

In [9]:
#find out label
rf.classes_

array([0, 1])

In [10]:
y_pred = rf.predict(X_test)
cm = confusion_matrix(y_test,y_pred)
cm = pd.DataFrame(cm, index=["Not match (actual)","Match (actual)"], columns=["Not match (pred)","Match (pred)"])
cm

Unnamed: 0,Not match (pred),Match (pred)
Not match (actual),2107,0
Match (actual),0,407


Surprisingly, the algorithm predicts perfectly, with the accuracy,precision and recall are 100%. Let's print out the precision and recall via classification report.

In [13]:
from sklearn.metrics import classification_report
print (classification_report(y_test, y_pred))

              precision    recall  f1-score   support

           0       1.00      1.00      1.00      2107
           1       1.00      1.00      1.00       407

    accuracy                           1.00      2514
   macro avg       1.00      1.00      1.00      2514
weighted avg       1.00      1.00      1.00      2514



## Different parameters

Since the algorithm predicts 100% accurately, let's try out different parameters to see if there is any difference.

In [17]:
rf_new = RandomForestClassifier(n_estimators=30,max_features=8,random_state=1)
rf_new = rf_new.fit(X_train, y_train)
y_new_pred = rf_new.predict(X_test)
print(classification_report(y_test,y_new_pred))

              precision    recall  f1-score   support

           0       1.00      1.00      1.00      2107
           1       1.00      1.00      1.00       407

    accuracy                           1.00      2514
   macro avg       1.00      1.00      1.00      2514
weighted avg       1.00      1.00      1.00      2514



The results show that the algorithm still works perfectly even with different parameters. 😱