<a href="https://colab.research.google.com/github/henrymazer/glassmorphism/blob/main/A_perfect_ML_model%3F.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

In [None]:

# IMPORTANT: RUN THIS CELL IN ORDER TO IMPORT YOUR KAGGLE DATA SOURCES
# TO THE CORRECT LOCATION (/kaggle/input) IN YOUR NOTEBOOK,
# THEN FEEL FREE TO DELETE THIS CELL.
# NOTE: THIS NOTEBOOK ENVIRONMENT DIFFERS FROM KAGGLE'S PYTHON
# ENVIRONMENT SO THERE MAY BE MISSING LIBRARIES USED BY YOUR
# NOTEBOOK.

import os
import sys
from tempfile import NamedTemporaryFile
from urllib.request import urlopen
from urllib.parse import unquote, urlparse
from urllib.error import HTTPError
from zipfile import ZipFile
import tarfile
import shutil

CHUNK_SIZE = 40960
DATA_SOURCE_MAPPING = 'predict-online-dating-matches-dataset:https%3A%2F%2Fstorage.googleapis.com%2Fkaggle-data-sets%2F5250700%2F8744629%2Fbundle%2Farchive.zip%3FX-Goog-Algorithm%3DGOOG4-RSA-SHA256%26X-Goog-Credential%3Dgcp-kaggle-com%2540kaggle-161607.iam.gserviceaccount.com%252F20240823%252Fauto%252Fstorage%252Fgoog4_request%26X-Goog-Date%3D20240823T231210Z%26X-Goog-Expires%3D259200%26X-Goog-SignedHeaders%3Dhost%26X-Goog-Signature%3D02e5e9f3d397203f8f358df59655f5a27886c1b8419035bf7f13d43bab77c915a861115f9b81613386dd46acfac781a01fb98798bb3c0be9d7dbd7669e39b13535be44731ef182cebbc86a024fd6f37fab3c14799d50c5401e51efce4c9c7442fc9da7bb54fb00a6c6a7bf9a5939f73a842159fd1247bf64c22e34f3557b71cc1781f303ec52825047c82025eb2cd83fdcf3761ba4edddb4c3bd5467ac2fa235a3eccd1bb41b02010f34db9013754ba61830e8af4fbbf20cc52b6c8459aeaacfb962bdcb2c22e94475da790f8515fc293784d9b9e379ddc05e3a8407c7fc0836bfd436e7169e85450d1165b022f51862c0a0e306d64444a6af1b6cf3d38f9751'

KAGGLE_INPUT_PATH='/kaggle/input'
KAGGLE_WORKING_PATH='/kaggle/working'
KAGGLE_SYMLINK='kaggle'

!umount /kaggle/input/ 2> /dev/null
shutil.rmtree('/kaggle/input', ignore_errors=True)
os.makedirs(KAGGLE_INPUT_PATH, 0o777, exist_ok=True)
os.makedirs(KAGGLE_WORKING_PATH, 0o777, exist_ok=True)

try:
  os.symlink(KAGGLE_INPUT_PATH, os.path.join("..", 'input'), target_is_directory=True)
except FileExistsError:
  pass
try:
  os.symlink(KAGGLE_WORKING_PATH, os.path.join("..", 'working'), target_is_directory=True)
except FileExistsError:
  pass

for data_source_mapping in DATA_SOURCE_MAPPING.split(','):
    directory, download_url_encoded = data_source_mapping.split(':')
    download_url = unquote(download_url_encoded)
    filename = urlparse(download_url).path
    destination_path = os.path.join(KAGGLE_INPUT_PATH, directory)
    try:
        with urlopen(download_url) as fileres, NamedTemporaryFile() as tfile:
            total_length = fileres.headers['content-length']
            print(f'Downloading {directory}, {total_length} bytes compressed')
            dl = 0
            data = fileres.read(CHUNK_SIZE)
            while len(data) > 0:
                dl += len(data)
                tfile.write(data)
                done = int(50 * dl / int(total_length))
                sys.stdout.write(f"\r[{'=' * done}{' ' * (50-done)}] {dl} bytes downloaded")
                sys.stdout.flush()
                data = fileres.read(CHUNK_SIZE)
            if filename.endswith('.zip'):
              with ZipFile(tfile) as zfile:
                zfile.extractall(destination_path)
            else:
              with tarfile.open(tfile.name) as tarfile:
                tarfile.extractall(destination_path)
            print(f'\nDownloaded and uncompressed: {directory}')
    except HTTPError as e:
        print(f'Failed to load (likely expired) {download_url} to path {destination_path}')
        continue
    except OSError as e:
        print(f'Failed to load {download_url} to path {destination_path}')
        continue

print('Data source import complete.')


# INTRODUCTION

![](https://i.imgur.com/FtIazQy.png)

In this data analysis, we explore the world of online dating, aiming to uncover the differences between men with and without matches. By examining variable distributions, relationships, and running statistical tests, we find the key factors that influence match success. We also run a machine learning model to predict matches based on personal information!

If you're new to the data world, you'll get to learn about:
- Treating datasets with Pandas
- Exploratory Data Analysis
- Hypothesis Testing
- Predictions
- Storytelling with Data

Besides, I'll end with a little investigation on why our predictions were so perfect, something almost impossible with real world data.

As you'll see during the analysis, this actually look much more like a real task in a data team than I expected, I even found a problem in the data!

# LOAD LIBRARIES AND DATA

In [None]:
import numpy as np
import pandas as pd

import seaborn as sns
import matplotlib.pyplot as plt
plt.style.use('ggplot')

from sklearn.model_selection import train_test_split
from sklearn.preprocessing import StandardScaler
from sklearn.ensemble import RandomForestRegressor
from sklearn.linear_model import LinearRegression
from sklearn.metrics import r2_score, mean_squared_error, mean_absolute_error

import statsmodels.api as sm
from statsmodels.formula.api import ols
import statsmodels.stats.multicomp as multi


from scipy.stats import ttest_ind

import os
for dirname, _, filenames in os.walk('/kaggle/input'):
    for filename in filenames:
        print(os.path.join(dirname, filename))



In [None]:
data = pd.read_csv(os.path.join(dirname, filename))
data.head()

In [None]:
data.info()

No column with missing values.

# EXPLORATORY DATA ANALYSIS

It's important to note that we don't use exploratory data analysis as a determining factor in feature selection, but rather as a way to gain new insights about the data.

Why don't we use it?<br>
Imagine you have a feature y that is determined by x_1, x_2, x_3, ..., x_10, with each x accounting for 10% of the variability of y. It is very likely that you won't see any relationship between the x's and y. Not to mention other cases where the feature's contribution to the target is hard to identify, such as certain non-linear relationships.

In [None]:
data.head()

Let's already separate features from target and identify those which are dummies:

In [None]:
features = ['Gender', 'PurchasedVIP', 'Income', 'Children', 'Age', 'Attractiveness']
target = 'Matches'

In [None]:
data.nunique().sort_values()

In [None]:
fig, axs = plt.subplots(2, 2, figsize=(15, 8))
fig.suptitle("Features Distributions", fontsize=14)
for ax, col in zip(axs.flatten(), ['Attractiveness', 'Matches', 'Age', 'Income']):
    data.hist(column=col, ax=ax, bins=50)
    ax.set_title(col, fontsize=10)

fig.tight_layout()
fig.subplots_adjust(top=0.85)
plt.show()

Ok, just to keep with the univariate analysis, let's check the variables with fewer unique values:

In [None]:
fig, axs = plt.subplots(1, 3, figsize=(20, 6))

sns.countplot(x='Gender', data=data, ax=axs[0])
axs[0].set_title('Gender Count')

sns.countplot(x='PurchasedVIP', data=data, ax=axs[1])
axs[1].set_title('Purchased VIP Count')

sns.countplot(x='Children', data=data, ax=axs[2])
axs[2].set_title('Children Count')

fig.tight_layout()

plt.show()


Our binary variables are pretty balanced. For children, we have something closer to what we see in the population.

Now, going back to matches, our target, it's have an uncommon distribution. The author of the dataset mentioned the possibility of "ghost users", but we're sill studying these 'no matches'. Let's start by comparing those without matches with the rest of the sample:

In [None]:
data['Match_Status'] = data['Matches'].apply(lambda x: 'With Matches' if x > 0 else 'Without Matches')

columns = ['Attractiveness', 'Age', 'Income']

fig, axs = plt.subplots(1, 3, figsize=(20, 6))

for ax, col in zip(axs, columns):
    sns.violinplot(x='Match_Status', y=col, data=data, ax=ax)
    ax.set_title(f'{col} vs Matches')
    ax.set_xlabel('Matches')
    ax.set_ylabel(col)

fig.tight_layout()
plt.show()

Not a big difference, probably won't even pass a t-test. But let's check this:

In [None]:
cols = ['Gender', 'PurchasedVIP', 'Children']


fig, axs = plt.subplots(1, 3, figsize=(20, 6))

for ax, col in zip(axs, cols):
    mean_values = data.groupby('Match_Status')[col].mean().reset_index()
    sns.barplot(x='Match_Status', y=col, data=mean_values, ax=ax)
    ax.set_title(f'Mean {col} by Match Status')
    ax.set_xlabel('Match Status')
    ax.set_ylabel(f'Mean {col}')


All females and all VIP got matches! It's expected that they get more matches, but I wasn't expecting this!

What's the profile of those who didn't get matches?

In [None]:
no_matches_data = data.query("Matches == 0 ")
no_matches_data.describe()

The main characteristics of those without matches is being a man without a VIP signature. The other characteristics doesn't appear to be so different from those with matches. But let's compare men with men:

In [None]:
men_matches_data = data.query("Matches > 0 and Gender==0")
men_matches_data.describe()

See? Pretty close! And I bet it's not statistically significant!

In [None]:


numerical_variables = ['Income', 'Children', 'Age', 'Attractiveness']

for var in numerical_variables:
    t_stat, p_value = ttest_ind(no_matches_data[var], men_matches_data[var])
    if p_value < 0.05:
        print(f'The difference in {var} between the two populations is statistically significant (t-statistic = {t_stat:.2f}, p-value = {p_value:.4f})')
    else:
        print(f'The difference in {var} between the two populations is not statistically significant (t-statistic = {t_stat:.2f}, p-value = {p_value:.4f})')

Men usually think that being rich is a huge advantage for getting girls. They also believe women have many more 'opportunities' when it comes to dating. Let's assess these affirmations:

In [None]:
import plotly.express as px

fig = px.scatter(data, x='Matches', y='Income', title='Matches vs Income')

fig.show()

In [None]:
import plotly.graph_objects as go
import plotly.subplots as sp

male_data = data[data['Gender'] == 0]
female_data = data[data['Gender'] == 1]

fig = sp.make_subplots(rows=1, cols=2, subplot_titles=('Male', 'Female'))

fig.add_trace(
    go.Scatter(x=male_data['Matches'], y=male_data['Income'], mode='markers', name='Male'),
    row=1, col=1
)

fig.add_trace(
    go.Scatter(x=female_data['Matches'], y=female_data['Income'], mode='markers', name='Female'),
    row=1, col=2
)

fig.update_layout(
    title_text='Matches vs Income by Gender',
    width=1000, height=500
)

fig.update_xaxes(title_text='Matches', row=1, col=1)
fig.update_yaxes(title_text='Income', row=1, col=1)

fig.update_xaxes(title_text='Matches', row=1, col=2)
fig.update_yaxes(title_text='Income', row=1, col=2)

# Show the plot
fig.show()


Well, women don't seem to find income attractive. At least, we can't say that for sure, and there aren't any insights on the subject here.

The weirdest part is every man having either 0 or 70 matches. This is really weird, and I believe there was an error when the data was collected.

We'll continue our study, but we have enough evidence not to trust this data.

Hey, did you see? This is what I told you about before—this is what we do with EDA! Finding these inconsistencies is a big part of EDA!

In [None]:
data.query('Gender==0').Matches.value_counts(normalize=True)

Now, let's compare men and women matches:

In [None]:
data.groupby('Gender')['Matches'].mean()

In [None]:
sns.boxplot(x='Gender', y='Matches', data=data)

Well, no need to do any statistical tests here, right?

What about kids? Do they scare possible partners?

In [None]:
_ = sns.boxplot(x='Children', y='Matches', data=data)



The mean is different for 2 children or more, but it's not clear if this is significant. Let's check:

In [None]:
model = ols('Matches ~ C(Children)', data=data).fit()
anova_table = sm.stats.anova_lm(model, typ=2)
print(anova_table)

Wow, not really! That is unexpected. People usually talks so many things about single people with kids, but it looks like is something ok!

Let's suppose male data has a problem, let's see what the data says about women:

In [None]:
women = data.query("Gender==1")
model = ols('Matches ~ C(Children)', data=women).fit()
anova_table = sm.stats.anova_lm(model, typ=2)
print(anova_table)

Well, still not different. So having a kid doesn't matter that much in the dating world!

# PREDICTING MATCHES

Can we build a reasonable predictive model?

In [None]:
data.columns

In [None]:
features

In [None]:
target

In [None]:
X = data[features]
y = data[target]

In [None]:
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)

scaler = StandardScaler()
X_train_scaled = scaler.fit_transform(X_train)
X_test_scaled = scaler.transform(X_test)

regressor = RandomForestRegressor()
regressor.fit(X_train_scaled, y_train)


In [None]:
y_pred = regressor.predict(X_test_scaled)

r_squared = r2_score(y_test, y_pred)
rmse = np.sqrt(mean_squared_error(y_test, y_pred))
mae = mean_absolute_error(y_test, y_pred)
mape = np.mean(np.abs((y_test - y_pred) / y_test)) * 100

# Print the evaluation metrics
print(f'R-squared: {r_squared:.2f}')
print(f'RMSE: {rmse:.2f}')
print(f'MAE: {mae:.2f}')
print(f'MAPE: {mape:.2f}%')

In [None]:
_ = plt.scatter(y_test, y_pred)

In [None]:
feature_importances = regressor.feature_importances_

features = X.columns
importances_df = pd.DataFrame({'Feature': features, 'Importance': feature_importances})
importances_df = importances_df.sort_values(by='Importance', ascending=False)

plt.figure(figsize=(12, 8))
plt.barh(importances_df['Feature'], importances_df['Importance'])
plt.xlabel('Importance')
plt.ylabel('Feature')
plt.title('Feature Importances from Random Forest Regressor')
plt.gca().invert_yaxis()
plt.show()


Well, it seem to be a perfect model, like we got everything we need to know the number of matches. Like I told you before, I believe there is a mistake in this dataset input.



# POST-MODEL EVALUATION

As mentioned before, I found this model pretty weird and atypical. Accordingly to our feature importance only three variables are important for our prediction. I want to assess a few things before ending the analysis

In [None]:
data.head()

In [None]:
data.Matches.value_counts(1)

Is there any woman with the same amount of matches than a man?

In [None]:
male_data = data.query("Gender==0")
male_data.Matches.value_counts(1)

In [None]:
female_data = data.query("Gender==1")
female_data.Matches.value_counts(1)

Well, actually there are some women with 70 matches, just like a few men. Although gender facilitates the prediction of matches, as men will necessarily have either 0 or 70 matches, there are still some things missing. Let's check the PurchasedVIP variable for each gender.

In [None]:
pd.crosstab(male_data.PurchasedVIP, male_data.Matches)

So if I know someone is a Male, I just need to know if he has PurchasedVIP to find the Matches! That's why is so easy to predict their matches!

In [None]:
pd.crosstab(female_data.PurchasedVIP, female_data.Matches)

For female, knowing if she has PurchasedVIP is not enough. So we'll look at the third mostly important feature, accordingly to our previous model: Attractiveness!

In [None]:
pd.crosstab(female_data.Attractiveness, female_data.Matches)

Well, that's it! By knowing the female attractiveness, you know her number of matches! That's why we found a perfect model, the variables follow a specific pattern directly related to the number of matches.

Well, hope you enjoyed this investigation! Cheers!