# Speed Dating Analysis Report
<img src="https://images.seattletimes.com/wp-content/uploads/2018/12/speeddating-1205-RGB-1.jpg?d=1200x630" width = "600"/>

Source: https://www.seattletimes.com/life/lifestyle/speed-dating-in-the-age-of-swiping-the-irl-dating-trend-is-still-ringing-bells/

#### Can we predict whether two heterosexual people matched or not based on 6 attributes after 4 minutes of speed dating? 

### Introduction:
"Speed dating" is a matchmaking process that involves two people getting to know one another within a very short time frame (usually less than 10 minutes), and then switching to another person once the time is up. This speedy and energetic approach to meeting other single people may result in a match between two individuals where they both choose to pursue dating. But what factors influence whether or not two people match? <b> Is there a way to predict whether two people will match or not in a speed date based on certain attributes? </b> 
The Speed Dating dataset posted by Ulrik Thyge Pedersen gathered data from participants in a four-minute speed dating experiment where they were asked to rate their date on six different attributes: attractiveness, sincerity, intelligence, fun, ambition, and shared interests. Using this data, we will attempt to predict whether or not two people will match based on these attributes.

In [5]:
#Run this cell to import program
import altair as alt
import pandas as pd
import numpy as np
import seaborn as sns
import matplotlib.pyplot as plt
%matplotlib inline
import sklearn
from sklearn.compose import make_column_transformer
from sklearn.metrics.pairwise import euclidean_distances
from sklearn.neighbors import KNeighborsClassifier
from sklearn.pipeline import Pipeline, make_pipeline
from sklearn.preprocessing import OneHotEncoder, StandardScaler
from sklearn.model_selection import GridSearchCV, cross_validate, train_test_split

#### Speed Dating Dataframe

To begin, we read in the raw data.

In [6]:
import matplotlib.pyplot as plt

df_speeddating = pd.read_csv("speeddating.csv")

df_speeddating.head(5)

Unnamed: 0,has_null,wave,gender,age,age_o,d_age,d_d_age,race,race_o,samerace,...,d_expected_num_interested_in_me,d_expected_num_matches,like,guess_prob_liked,d_like,d_guess_prob_liked,met,decision,decision_o,match
0,b'',1.0,b'female',21.0,27.0,6.0,b'[4-6]',b'Asian/Pacific Islander/Asian-American',b'European/Caucasian-American',b'0',...,b'[0-3]',b'[3-5]',7.0,6.0,b'[6-8]',b'[5-6]',0.0,b'1',b'0',b'0'
1,b'',1.0,b'female',21.0,22.0,1.0,b'[0-1]',b'Asian/Pacific Islander/Asian-American',b'European/Caucasian-American',b'0',...,b'[0-3]',b'[3-5]',7.0,5.0,b'[6-8]',b'[5-6]',1.0,b'1',b'0',b'0'
2,b'',1.0,b'female',21.0,22.0,1.0,b'[0-1]',b'Asian/Pacific Islander/Asian-American',b'Asian/Pacific Islander/Asian-American',b'1',...,b'[0-3]',b'[3-5]',7.0,,b'[6-8]',b'[0-4]',1.0,b'1',b'1',b'1'
3,b'',1.0,b'female',21.0,23.0,2.0,b'[2-3]',b'Asian/Pacific Islander/Asian-American',b'European/Caucasian-American',b'0',...,b'[0-3]',b'[3-5]',7.0,6.0,b'[6-8]',b'[5-6]',0.0,b'1',b'1',b'1'
4,b'',1.0,b'female',21.0,24.0,3.0,b'[2-3]',b'Asian/Pacific Islander/Asian-American',b'Latino/Hispanic American',b'0',...,b'[0-3]',b'[3-5]',6.0,6.0,b'[6-8]',b'[5-6]',0.0,b'1',b'1',b'1'


#### Cleaning and wrangling data
Next, we shall make our speed dating data suitable for our analysis. First, we extract the six attributes (attractive_partner, sincere_partner, intelligence_partner, funny_partner, ambition_partner, shared_interests_partner) columns and the class column (match) that are required for this analysis with []. Then, by using .dropna() method, we remove the rows that contain NULL value, "NaN". We will also rename the values of the class column "match" from b'0' to "fail" and b'1' to "success". Moreover, we must convert the "match" column stored as an object type to a categorical type. 


In [7]:
speeddating = df_speeddating[["match","attractive_partner", "sincere_partner","intelligence_partner", "funny_partner", "ambition_partner", "shared_interests_partner"]]

speeddating = speeddating.sample(n=1000, random_state=2023).dropna()

speeddating["match"] = speeddating["match"].replace({"b'0'":"fail","b'1'":"success"}).astype("category")
# speeddating.set_title("")
speeddating.head(5)

Unnamed: 0,match,attractive_partner,sincere_partner,intelligence_partner,funny_partner,ambition_partner,shared_interests_partner
3852,fail,9.0,8.0,8.0,8.0,6.0,2.0
1938,fail,5.0,5.0,6.0,6.0,5.0,5.0
4879,fail,7.0,5.0,6.0,4.0,5.0,5.0
1619,fail,7.0,10.0,8.0,7.0,7.0,3.0
3151,fail,7.0,6.0,7.0,6.0,6.0,5.0


#### Summary of Speeddating Data
With seven columns, one category type, and six numeric types, we can see that our data is now organised and prepared to be divided into train and test sets using the info function. 

In [None]:
speeddating.info()

#### Creating the training and test sets
Next, we will split the cleaned speed-dating data into a training and testing set. We will then use the training set to perform an exploratory analysis on the six variables and to explore the dataset. After making the classifier to predict the "match" of speed-dating with training set, we use the test set to evaluate the accuracy of the model to predict the "match" on new observation. 

In [None]:
speeddating_training, speeddating_testing = train_test_split(
   speeddating, train_size=0.75, random_state=2023 # do not change the random_state
)
speeddating_training = pd.DataFrame(speeddating_training)
speeddating_training.head(5)

#### Checking the "match" data distribution

In [None]:
speeddating_training_dd = speeddating_training.assign(row_number=range(len(speeddating_training)))

explore_speeddating = pd.DataFrame()

explore_speeddating['count'] = speeddating_training_dd.groupby('match')['row_number'].count()
explore_speeddating['percentage'] = 100 * explore_speeddating['count']/len(speeddating_training_dd)

explore_speeddating

We can see from this table that there is an imbalance in the match class; where 83.0% failed to match and 18.2% was successful. This would be a limitation for our model and its accuracy. Since 83.0% of speed dates failed, our model may have a high accuracy (around 80 percent) because it will correctly predict a lot of failed dates but it will not be very accurate in predicting successful dates.

#### Creating Boxplot 
The distribution of attribute ratings may be viewed using a box plot, which also makes it possible to contrast match outcomes. To determine the impact of each property on the match outcomes, we thus produced six boxplots for each attribute using seaborn.boxplot.

In [None]:
sns.boxplot(x='match', y='attractive_partner', width=0.25, data=speeddating_training).set(title='Figure 1: Comparison of match results for Attraction Attribute', ylabel="Rating of Partner's Attractiveness", xlabel="Match")

Interpretation of Figure 1: This boxplot compares the explantory variable "Partner's Attractiveness", and the response variable of "match outcome". The median shows that on average partner's attractiveness score was two ratings higher for successful matches than failed matches. In this rating scale between 1-10, the category of success having two rating higher than fail suggests that the partner's attractiveness has an influence on match results.  

In [None]:
sns.boxplot(x='match', y='sincere_partner', width=0.25, data=speeddating_training).set(title='Figure 2: Comparison of match results for Sincere Attribute', ylabel="Rating of Partner's Sincerity", xlabel="Match")
#this plot looks at explantory vairble "like"... we see that there is seperation, if they liked their partner they successfully dataset.
# we wont include this in the model, it is the same... no impact... 

Interpretation of Figure 2: This boxplot compares the explantory variable "Partner's Sincerity", and the response variable of "match outcome". There is not a lot of overlap between category of fail and category of match success, thus the two groups are different. Rating of Partner's sincerity seems to have influence on match results.

In [None]:
sns.boxplot(x='match', y='intelligence_partner', width=0.25, data=speeddating_training).set(title='Figure 3: Comparison of match results for Intelligence Attribute', ylabel="Rating of Partner's Intelligence", xlabel="Match")


# some overlap but not exact, not include age 

Interpretation of Figure 3: The explanatory variable "Partner's Intelligence" and the response variable "match outcome" are compared in this boxplot. Similar to figure 2, these non-overlapping boxes and the narrow interquartile range of the success boxplot indicate that higher intellect ratings result in successful matches.  

In [None]:
sns.boxplot(x='match', y='funny_partner', width=0.25, data=speeddating_training).set(title='Figure 4: Comparison of match results for Fun Attribute', ylabel="Rating of Partner being funny", xlabel="Match")

Interpretation of Figure 4: median signficant... keep? 

In [None]:
sns.boxplot(x='match', y='ambition_partner', width=0.25, data=speeddating_training).set(title='Figure 5: Comparison of match results for Ambition Attribute', ylabel="Rating of Partner's Ambition", xlabel="Match")

Interpretation of Figure 5: low outlier for success.. drop?

In [None]:
sns.boxplot(x='match', y='shared_interests_partner', width=0.25, data=speeddating_training).set(title='Figure 6: Comparison of match results for Shared Interests Attribute', ylabel="Rating of Partner's Shared Interests", xlabel="Match")

Interpretation of Figure 6: median difference is significant... keep ? 

##### Hence, relevant attributes include ... will be used for the model.

In [None]:
#hard to interpret... cannot interpret, 
#correlation of interests and difference in age by matching, we cannot distingusih classes 

In [None]:
#after we build model... cross validation, k-nearest neighbours, set seeds in every cell, 
#for visualization classfication metrix, even if we have high level of accuracy
#we see a lot of false negative due to class imabalance. out side scope of course 
#research about topic and cite
#disccussion(summarize, what impact)

# Deciding which columns are significant

In [None]:
speeddating_train, speeddating_test = train_test_split(speeddating, test_size=0.25, random_state=123) # set the random state to be 123

speeddating_preprocessor = make_column_transformer(
    (StandardScaler(), [
        "attractive_partner", "sincere_partner","intelligence_partner", "funny_partner", "shared_interests_partner","ambition_partner"]),
    (OneHotEncoder(), ["match"])
)

knn_spec = KNeighborsClassifier(n_neighbors=3)

X = speeddating_train[[
    "attractive_partner", "sincere_partner","intelligence_partner", "funny_partner", "shared_interests_partner","ambition_partner"]]
y = speeddating_train["match"]

speeddating_fit = make_pipeline(speeddating_preprocessor, knn_spec).fit(X, y)

speeddating_test_predictions = speeddating_fit.predict(speeddating_test)
speeddating_test_predictions = pd.concat(
    [
        speeddating_test.reset_index(drop=True),
        pd.DataFrame(speeddating_test_predictions, columns=["predicted"]),
    ],
    axis=1,
)

speeddating_test_predictions

In [None]:
X_test = speeddating_test[[
    "attractive_partner", "sincere_partner","intelligence_partner", "funny_partner", "shared_interests_partner","ambition_partner"]]
y_test = speeddating_test["match"]

speeddating_prediction_accuracy = speeddating_fit.score(X_test, y_test)

speeddating_prediction_accuracy

In [None]:
speeddating_mat = sklearn.metrics.confusion_matrix(
    speeddating_test_predictions["match"],
    speeddating_test_predictions["predicted"],
    labels=speeddating_fit.classes_,
)

from sklearn.metrics import ConfusionMatrixDisplay

disp = ConfusionMatrixDisplay(
    confusion_matrix=speeddating_mat, display_labels=speeddating_fit.classes_
)
disp.plot()

In [None]:
X_val = speeddating_train[[
    "attractive_partner", "sincere_partner","intelligence_partner", "funny_partner", "shared_interests_partner","ambition_partner"]]
y_val = speeddating_train["match"]

speeddating_pipe = make_pipeline(speeddating_preprocessor, knn_spec)

speeddating_vfold_score = cross_validate(estimator=speeddating_pipe,  cv=5, X=X_val, y=y_val, return_train_score=True,)

pd.DataFrame(speeddating_vfold_score)

speeddating_metrics_mean = pd.DataFrame(speeddating_vfold_score).mean()
speeddating_metrics_std = pd.DataFrame(speeddating_vfold_score).std()

In [None]:
# hyperparameter opitmization
from sklearn.model_selection import (
    GridSearchCV,
    RandomizedSearchCV,
    cross_validate,
    train_test_split,
    ParameterGrid)

# specify a particular value for the n_neighbors argument
# pass the set of values for each parameters you would like to tune
param_grid = {
  "kneighborsclassifier__n_neighbors": range(2,15,1),
}

# redefine the pipeline to use default values for parameters
speeddating_tune_pipe = make_pipeline(
  speeddating_preprocessor,
  KNeighborsClassifier())

# run 5-fold-cross-validations to tune hyperparameters
knn_tune_grid = GridSearchCV(
  estimator=speeddating_tune_pipe,
  param_grid=param_grid,
  cv=5
)


# fit the models to the data
# predictors and target
X_tune=speeddating_training[["attractive_partner", "sincere_partner","intelligence_partner", "funny_partner", "shared_interests_partner","ambition_partner"]]
y_tune=speeddating_training["match"]

# assign tuned models
knn_model_grid = knn_tune_grid.fit(X_tune, y_tune)


# find the cv_results_ and save in dataframe
accuracies_grid = pd.DataFrame(knn_model_grid.cv_results_)


# use a line plot to find the best value of the number of neighbors
accuracy_versus_k_grid = (
    alt.Chart(accuracies_grid, title="Grid Search")
    .mark_line(point=True)
    .encode(
        x=alt.X(
            "param_kneighborsclassifier__n_neighbors",
            title="Neighbors",
            scale=alt.Scale(zero=False),
        ),
        y=alt.Y(
            "mean_test_score", 
            title="Mean Test Score", 
            scale=alt.Scale(zero=False)
        ),
    )
    .configure_axis(labelFontSize=10, titleFontSize=15)
    .properties(width=400, height=300)
)
accuracy_versus_k_grid


The grid search plot shows us that n=6 has the highest accuracy while n=3 has the lowest. Therefore, we will use 6 as the k value for our KNeighborsClassifier

#### Cross Validation (CV)

##### CV Accuracy: Attractive of the Partner & Match

In [None]:
np.random.seed(1000)
speeddating_preprocessor = make_column_transformer(
    (StandardScaler(), ["attractive_partner"]),
)

# create the 25/75 split of the *training data* into sub-training and validation
speeddating_subtrain, speeddating_validation = train_test_split(
    speeddating_training, test_size=0.25
)

# fit the model on the sub-training data
knn = KNeighborsClassifier(n_neighbors=6) 
X = speeddating_subtrain.loc[:, ["attractive_partner"]]
y = speeddating_subtrain["match"]
knn_fit = make_pipeline(speeddating_preprocessor, knn).fit(X, y)

# compute the score on validation data
acc = knn_fit.score(
    speeddating_validation.loc[:, ["attractive_partner"]],
    speeddating_validation["match"]
)
acc

##### CV Accuracy: Intelligence of the Partner & Match

In [None]:
np.random.seed(1000)
speeddating_preprocessor = make_column_transformer(
    (StandardScaler(), ["intelligence_partner"]),
)

# create the 25/75 split of the *training data* into sub-training and validation
speeddating_subtrain, speeddating_validation = train_test_split(
    speeddating_training, test_size=0.25
)

# fit the model on the sub-training data
knn = KNeighborsClassifier(n_neighbors=6) 
X = speeddating_subtrain.loc[:, ["intelligence_partner"]]
y = speeddating_subtrain["match"]
knn_fit = make_pipeline(speeddating_preprocessor, knn).fit(X, y)

# compute the score on validation data
acc = knn_fit.score(
    speeddating_validation.loc[:, ["intelligence_partner"]],
    speeddating_validation["match"]
)
acc

##### CV Accuracy: Funny attribute of Partner & Match

In [None]:
np.random.seed(1000)
speeddating_preprocessor = make_column_transformer(
    (StandardScaler(), ["funny_partner"]),
)

# create the 25/75 split of the *training data* into sub-training and validation
speeddating_subtrain, speeddating_validation = train_test_split(
    speeddating_training, test_size=0.25
)

# fit the model on the sub-training data
knn = KNeighborsClassifier(n_neighbors=6) 
X = speeddating_subtrain.loc[:, ["funny_partner"]]
y = speeddating_subtrain["match"]
knn_fit = make_pipeline(speeddating_preprocessor, knn).fit(X, y)

# compute the score on validation data
acc = knn_fit.score(
    speeddating_validation.loc[:, ["funny_partner"]],
    speeddating_validation["match"]
)
acc

##### CV Accuracy: Shared Interests attribute of Partner & Match

In [None]:
np.random.seed(1000)
speeddating_preprocessor = make_column_transformer(
    (StandardScaler(), ["shared_interests_partner"]),
)

# create the 25/75 split of the *training data* into sub-training and validation
speeddating_subtrain, speeddating_validation = train_test_split(
    speeddating_training, test_size=0.25
)

# fit the model on the sub-training data
knn = KNeighborsClassifier(n_neighbors=6) 
X = speeddating_subtrain.loc[:, ["shared_interests_partner"]]
y = speeddating_subtrain["match"]
knn_fit = make_pipeline(speeddating_preprocessor, knn).fit(X, y)

# compute the score on validation data
acc = knn_fit.score(
    speeddating_validation.loc[:, ["shared_interests_partner"]],
    speeddating_validation["match"]
)
acc

##### CV Accuracy: Ambition attribute of Partner & Match

In [None]:
np.random.seed(1000)
# fit the model on the sub-training data
speeddating_preprocessor = make_column_transformer(
    (StandardScaler(), ["ambition_partner"]),
)

# create the 25/75 split of the *training data* into sub-training and validation
speeddating_subtrain, speeddating_validation = train_test_split(
    speeddating_training, test_size=0.25
)

knn = KNeighborsClassifier(n_neighbors=6) 
X = speeddating_subtrain.loc[:, ["ambition_partner"]]
y = speeddating_subtrain["match"]
knn_fit = make_pipeline(speeddating_preprocessor, knn).fit(X, y)

# compute the score on validation data
acc = knn_fit.score(
    speeddating_validation.loc[:, ["ambition_partner"]],
    speeddating_validation["match"]
)
acc

##### CV Accuracy: Sincere attribute of Partner & Match

In [None]:
np.random.seed(1000)
speeddating_preprocessor = make_column_transformer(
    (StandardScaler(), ["sincere_partner"]),
)

# create the 25/75 split of the *training data* into sub-training and validation
speeddating_subtrain, speeddating_validation = train_test_split(
    speeddating_training, test_size=0.25
)

# fit the model on the sub-training data
knn = KNeighborsClassifier(n_neighbors=6) 
X = speeddating_subtrain.loc[:, ["sincere_partner"]]
y = speeddating_subtrain["match"]
knn_fit = make_pipeline(speeddating_preprocessor, knn).fit(X, y)

# compute the score on validation data
acc = knn_fit.score(
    speeddating_validation.loc[:, ["sincere_partner"]],
    speeddating_validation["match"]
)
acc

##### CV Accuracy: Sincere, intelligence, ambition, and shared interests attributes of Partner & Match

In [None]:
np.random.seed(1000)
# fit the model on the sub-training data
speeddating_preprocessor = make_column_transformer(
    (StandardScaler(), ["sincere_partner","intelligence_partner", "shared_interests_partner", "ambition_partner"]),
)

# create the 25/75 split of the *training data* into sub-training and validation
speeddating_subtrain, speeddating_validation = train_test_split(
    speeddating_training, test_size=0.25
)

knn = KNeighborsClassifier(n_neighbors=6) 
X = speeddating_subtrain.loc[:, ["sincere_partner","intelligence_partner", "shared_interests_partner", "ambition_partner"]]
y = speeddating_subtrain["match"]
knn_fit = make_pipeline(speeddating_preprocessor, knn).fit(X, y)

# compute the score on validation data
acc = knn_fit.score(
    speeddating_validation.loc[:, ["sincere_partner","intelligence_partner", "shared_interests_partner", "ambition_partner"]],
    speeddating_validation["match"]
)
acc

When we test the accuracy of the model after dropping the columns of "attractive_partner" and "funny_partner", leaving us to use the predictors of sincerity, intelligence, ambition, and shared interests, the score was 0.87. This is a fairly high score but since we have found that attractiveness is a strong predictor and shared interests a weak predictor for matching, we will try another model using the additional predictor of "attractive_partner" while taking out "shared_interests_partner".

In [None]:
np.random.seed(1000)
# fit the model on the sub-training data
speeddating_preprocessor = make_column_transformer(
    (StandardScaler(), ["attractive_partner","sincere_partner","intelligence_partner", "ambition_partner"]),
)

# create the 25/75 split of the *training data* into sub-training and validation
speeddating_subtrain, speeddating_validation = train_test_split(
    speeddating_training, test_size=0.25
)

knn = KNeighborsClassifier(n_neighbors=6) 
X = speeddating_subtrain.loc[:, ["attractive_partner", "sincere_partner","intelligence_partner", "ambition_partner"]]
y = speeddating_subtrain["match"]
knn_fit = make_pipeline(speeddating_preprocessor, knn).fit(X, y)

# compute the score on validation data
acc = knn_fit.score(
    speeddating_validation.loc[:, ["attractive_partner","sincere_partner","intelligence_partner", "ambition_partner"]],
    speeddating_validation["match"]
)
acc

In [None]:
final_speeddating = speeddating[["sincere_partner","intelligence_partner", "shared_interests_partner", "ambition_partner", "match"]]
final_speeddating.head(5)

In [None]:
final_speeddating_train, final_speeddating_test = train_test_split(final_speeddating, test_size=0.25, random_state=123) # set the random state to be 123


## Discussion

#### Summary of results:

#### Expectations and impacts of results:

#### Future questions

### Sources/ Citation 

#### messy data wrangling, clean up later

In [10]:
# first combination: attractiveness, intelligence, funny, sincere
np.random.seed(1000)

speeddating_train, speeddating_test = train_test_split(
    speeddating, train_size=0.75, stratify=speeddating["match"]
)
speeddating_train.info()

speeddating_train["match"].value_counts(normalize=True)

# preprocess data
speeddating_preprocessor = make_column_transformer(
    (StandardScaler(), ["attractive_partner", "intelligence_partner", 
                        "funny_partner", "sincere_partner"])
)

# make a dataframe for finding the highest accuracy
# accuracies grid
knn = KNeighborsClassifier()
speeddating_tune_pipe = make_pipeline(speeddating_preprocessor, knn)


parameter_grid = {
    "kneighborsclassifier__n_neighbors": range(1, 100, 5),
}

speeddating_tune_grid = GridSearchCV(
    estimator=speeddating_tune_pipe,
    param_grid=parameter_grid,
    cv=10
)

accuracies_grid = pd.DataFrame(
             speeddating_tune_grid
             .fit(speeddating_train.loc[:, ["attractive_partner", "intelligence_partner",
                                            "funny_partner", "sincere_partner"]],
                  speeddating_train["match"]
            ).cv_results_)


accuracies_grid = accuracies_grid[["param_kneighborsclassifier__n_neighbors", 
                                   "mean_test_score", "std_test_score"]
              ].assign(
                  sem_test_score = accuracies_grid["std_test_score"] / 10**(1/2)
              ).rename(
                  columns = {"param_kneighborsclassifier__n_neighbors" : "n_neighbors"}
              ).drop(
                  columns = ["std_test_score"]
              )
accuracies_grid




# now use k=6 to find accuracy of model?
# use a split of the training set for validation set
# create the 25/75 split of the *training data* into sub-training and validation
speeddating_subtrain, speeddating_validation = train_test_split(
    speeddating_train, test_size=0.25
)

# fit the model on the sub-training data
knn = KNeighborsClassifier(n_neighbors=6) 
X = speeddating_subtrain.loc[:, ["attractive_partner","intelligence_partner",
                                 "funny_partner", "sincere_partner"]]
y = speeddating_subtrain["match"]
knn_fit = make_pipeline(speeddating_preprocessor, knn).fit(X, y)

# compute the score on validation data
acc = knn_fit.score(
    speeddating_validation.loc[:, ["attractive_partner","intelligence_partner",
                                   "funny_partner", "sincere_partner"]],
    speeddating_validation["match"]
)
acc

<class 'pandas.core.frame.DataFrame'>
Int64Index: 619 entries, 2778 to 5556
Data columns (total 7 columns):
 #   Column                    Non-Null Count  Dtype   
---  ------                    --------------  -----   
 0   match                     619 non-null    category
 1   attractive_partner        619 non-null    float64 
 2   sincere_partner           619 non-null    float64 
 3   intelligence_partner      619 non-null    float64 
 4   funny_partner             619 non-null    float64 
 5   ambition_partner          619 non-null    float64 
 6   shared_interests_partner  619 non-null    float64 
dtypes: category(1), float64(6)
memory usage: 34.6 KB


0.8129032258064516

In [17]:
# second combination: sincere, intelligence, shared interests, ambition 
np.random.seed(1000)

speeddating_train, speeddating_test = train_test_split(
    speeddating, train_size=0.75, stratify=speeddating["match"]
)
speeddating_train.info()

speeddating_train["match"].value_counts(normalize=True)

# preprocess data
speeddating_preprocessor = make_column_transformer(
    (StandardScaler(), ["sincere_partner","intelligence_partner",
                        "shared_interests_partner", "ambition_partner"])
)

# make a dataframe for finding the highest accuracy
# accuracies grid
knn = KNeighborsClassifier()
speeddating_tune_pipe = make_pipeline(speeddating_preprocessor, knn)


parameter_grid = {
    "kneighborsclassifier__n_neighbors": range(1, 100, 5),
}

speeddating_tune_grid = GridSearchCV(
    estimator=speeddating_tune_pipe,
    param_grid=parameter_grid,
    cv=10
)

accuracies_grid = pd.DataFrame(
             speeddating_tune_grid
             .fit(speeddating_train.loc[:, ["sincere_partner","intelligence_partner",
                                            "shared_interests_partner", "ambition_partner"]],
                  speeddating_train["match"]
            ).cv_results_)


accuracies_grid = accuracies_grid[["param_kneighborsclassifier__n_neighbors", 
                                   "mean_test_score", "std_test_score"]
              ].assign(
                  sem_test_score = accuracies_grid["std_test_score"] / 10**(1/2)
              ).rename(
                  columns = {"param_kneighborsclassifier__n_neighbors" : "n_neighbors"}
              ).drop(
                  columns = ["std_test_score"]
              )
accuracies_grid




# k=56 has the highest accuracy :O
# k=6 had accuracy of 0.787 and k=56 has 0.794
# use a split of the training set for validation set
# create the 25/75 split of the *training data* into sub-training and validation
speeddating_subtrain, speeddating_validation = train_test_split(
    speeddating_train, test_size=0.25
)

# fit the model on the sub-training data
knn = KNeighborsClassifier(n_neighbors=56) 
X = speeddating_subtrain.loc[:, ["sincere_partner","intelligence_partner",
                                 "shared_interests_partner", "ambition_partner"]]
y = speeddating_subtrain["match"]
knn_fit = make_pipeline(speeddating_preprocessor, knn).fit(X, y)

# compute the score on validation data
acc = knn_fit.score(
    speeddating_validation.loc[:, ["sincere_partner","intelligence_partner",
                                   "shared_interests_partner", "ambition_partner"]],
    speeddating_validation["match"]
)
acc

<class 'pandas.core.frame.DataFrame'>
Int64Index: 619 entries, 2778 to 5556
Data columns (total 7 columns):
 #   Column                    Non-Null Count  Dtype   
---  ------                    --------------  -----   
 0   match                     619 non-null    category
 1   attractive_partner        619 non-null    float64 
 2   sincere_partner           619 non-null    float64 
 3   intelligence_partner      619 non-null    float64 
 4   funny_partner             619 non-null    float64 
 5   ambition_partner          619 non-null    float64 
 6   shared_interests_partner  619 non-null    float64 
dtypes: category(1), float64(6)
memory usage: 34.6 KB


0.7935483870967742

In [22]:
# third combination: attractiveness, funny, shared_interests, intelligence
np.random.seed(1000)

speeddating_train, speeddating_test = train_test_split(
    speeddating, train_size=0.75, stratify=speeddating["match"]
)
speeddating_train.info()

speeddating_train["match"].value_counts(normalize=True)

# preprocess data
speeddating_preprocessor = make_column_transformer(
    (StandardScaler(), ["attractive_partner","intelligence_partner",
                        "funny_partner", "shared_interests_partner"])
)

# make a dataframe for finding the highest accuracy
# accuracies grid
knn = KNeighborsClassifier()
speeddating_tune_pipe = make_pipeline(speeddating_preprocessor, knn)


parameter_grid = {
    "kneighborsclassifier__n_neighbors": range(1, 100, 5),
}

speeddating_tune_grid = GridSearchCV(
    estimator=speeddating_tune_pipe,
    param_grid=parameter_grid,
    cv=10
)

accuracies_grid = pd.DataFrame(
             speeddating_tune_grid
             .fit(speeddating_train.loc[:, ["attractive_partner","intelligence_partner",
                                            "funny_partner", "shared_interests_partner"]],
                  speeddating_train["match"]
            ).cv_results_)


accuracies_grid = accuracies_grid[["param_kneighborsclassifier__n_neighbors", 
                                   "mean_test_score", "std_test_score"]
              ].assign(
                  sem_test_score = accuracies_grid["std_test_score"] / 10**(1/2)
              ).rename(
                  columns = {"param_kneighborsclassifier__n_neighbors" : "n_neighbors"}
              ).drop(
                  columns = ["std_test_score"]
              )
accuracies_grid




# k=51 and k=11 have the highest accuracy but k=51 has the lower std

# use a split of the training set for validation set
# create the 25/75 split of the *training data* into sub-training and validation
speeddating_subtrain, speeddating_validation = train_test_split(
    speeddating_train, test_size=0.25
)

# fit the model on the sub-training data
knn = KNeighborsClassifier(n_neighbors=51) 
X = speeddating_subtrain.loc[:, ["attractive_partner","intelligence_partner",
                                "funny_partner", "shared_interests_partner"]]
y = speeddating_subtrain["match"]
knn_fit = make_pipeline(speeddating_preprocessor, knn).fit(X, y)

# compute the score on validation data
acc = knn_fit.score(
    speeddating_validation.loc[:, ["attractive_partner","intelligence_partner",
                                    "funny_partner", "shared_interests_partner"]],
    speeddating_validation["match"]
)
acc

<class 'pandas.core.frame.DataFrame'>
Int64Index: 619 entries, 2778 to 5556
Data columns (total 7 columns):
 #   Column                    Non-Null Count  Dtype   
---  ------                    --------------  -----   
 0   match                     619 non-null    category
 1   attractive_partner        619 non-null    float64 
 2   sincere_partner           619 non-null    float64 
 3   intelligence_partner      619 non-null    float64 
 4   funny_partner             619 non-null    float64 
 5   ambition_partner          619 non-null    float64 
 6   shared_interests_partner  619 non-null    float64 
dtypes: category(1), float64(6)
memory usage: 34.6 KB


0.7935483870967742