# First data science project - data science gym

## Introduction

**Project description**

The fitness center network, 'Bodybuilder-Data Scientist,' is working on a strategy to engage users based on analytical data. One of the most common problems facing fitness clubs and similar services is customer churn. It's not always clear when a user has stopped using the service, as they may not always leave in an obvious way.

For a fitness center, a client is considered to have churned if they haven't visited the gym at least once in the last month. While it's possible that they went on vacation and will return to the gym upon their arrival, it's more likely that they won't. If a client starts going to the gym but then suddenly stops, they are unlikely to return.

Your task is to analyze the data and develop an action plan to retain customers.

Specifically, the objectives are to:

1. Learn how to predict the probability of customer churn (for the following month) for each client.
2. Create typical user profiles: identify several of the most prominent groups and characterize their key attributes.
3. Analyze the main features that have the greatest impact on churn.
4. Formulate key conclusions and develop recommendations to improve customer relationship management, including:
   1. Identifying target customer groups;
   2. Proposing measures to reduce churn;
   3. Determining other nuances of customer interactions.

**Data description**

We have a dataset `gym_churn.csv` containing information about the month prior to churn and the fact of churn for a specific month. The dataset includes the following fields:

1. `Churn` - indicating whether the customer churned in the current month.

The current fields in the dataset contain user data for the month prior to the churn check, such as:

2. `Gender` - the gender of the customer.
3. `Near_Location` - whether the customer lives or works in the area where the fitness center is located.
4. `Partner` - indicating whether the customer is an employee of a club partner company, in which case the fitness center stores information about the customer's employer.
5. `Promo_friends` - indicating whether the customer registered under the "bring a friend" promotion, using a promo code from an acquaintance when paying for the first subscription.
6. `Phone` - indicating whether the customer provided a contact phone number.
7. `Age` - the age of the customer.
8. `Lifetime` - the time since the customer's first visit to the fitness center (in months).

The dataset also includes information based on the client's visit log, purchases, and current subscription status, such as:

9. `Contract_period` - the duration of the customer's current active subscription, which can be a month, 3 months, 6 months, or a year.
10. `Month_to_end_contract` - the time until the end of the customer's current active subscription (in months).
11. `Group_visits` - indicating whether the customer attends group classes.
12. `Avg_class_frequency_total` - the average frequency of visits per week for the entire duration of the subscription.
13. `Avg_class_frequency_current_month` - the average frequency of visits per week for the previous month.
14. `Avg_additional_charges_total` - the total revenue from other fitness center services, such as cafes, sports goods, beauty, and massage salon.

**Project plan**

1. Load the data.

2. Conduct an exploratory data analysis (EDA):
   1. Examine the dataset for missing features and study the mean values and standard deviations.
   2. Analyze the mean values of features in two groups: those who churned and those who stayed.
   3. Construct bar histograms and feature distributions for those who churned and those who stayed.
   4. Create a correlation matrix and display it.
<br><br>

3. Develop a customer churn prediction model.

4. Build a binary classification model for customers, where the target feature is customer churn in the next month:
   1. Split the data into training and validation sets.
   2. Train the model in two ways:
      1. Logistic regression.
      2. Random forest.
   3. Evaluate the accuracy, precision, and recall metrics for both models. Compare the models based on the metrics. Which model performed better based on the metrics?
<br><br>

5. Perform customer clustering:
   1. Standardize the data.
   2. Build a distance matrix on the standardized feature matrix and draw a dendrogram.
   3. Train the clustering model based on the K-Means algorithm and predict customer clusters. Let's agree to use `n = 5` as the number of clusters.
   4. Examine the mean values of features for the clusters. Can any immediate observations be made?
   5. Construct feature distributions for the clusters. Can any observations be made from them?
   6. Calculate the churn rate for each obtained cluster. Are they different in terms of churn rate? Which clusters are prone to churn, and which are reliable?
<br><br>

6. Formulate conclusions and make basic recommendations for working with customers.

In [2]:
import pandas as pd
import numpy as np
import sklearn as sk
import plotly.graph_objects as go
import plotly.subplots as sp
from IPython.display import display

# Save raw dataset in case we need it later
raw_gym = pd.read_csv('gym_churn.csv')

FIG_WIDTH = 8
FIG_HEIGHT = 5


# Data preconditioning

This section will be short, as we only need to set `snake_case` column names for the dataset.

In [3]:
df_gym = (
    raw_gym
    .copy()
    .rename(columns=lambda df: df.lower())
)


# Exploratory data analysis

We need to complete several steps:

1. Examine the dataset for missing features and study the mean values and standard deviations.
2. Analyze the mean values of features in two groups: those who churned and those who stayed.
3. Construct bar histograms and feature distributions for those who churned and those who stayed.
4. Create a correlation matrix and display it.

Let's start!

In [4]:
display(df_gym.isna().sum())


gender                               0
near_location                        0
partner                              0
promo_friends                        0
phone                                0
contract_period                      0
group_visits                         0
age                                  0
avg_additional_charges_total         0
month_to_end_contract                0
lifetime                             0
avg_class_frequency_total            0
avg_class_frequency_current_month    0
churn                                0
dtype: int64

Looks like we don't have any missing features. Let's check their `mean` and `std` values.

In [5]:
display(
    df_gym
    .describe()
    .round(2)
    .T
    [['min', 'mean', 'std', 'max']]
)


Unnamed: 0,min,mean,std,max
gender,0.0,0.51,0.5,1.0
near_location,0.0,0.85,0.36,1.0
partner,0.0,0.49,0.5,1.0
promo_friends,0.0,0.31,0.46,1.0
phone,0.0,0.9,0.3,1.0
contract_period,1.0,4.68,4.55,12.0
group_visits,0.0,0.41,0.49,1.0
age,18.0,29.18,3.26,41.0
avg_additional_charges_total,0.15,146.94,96.36,552.59
month_to_end_contract,1.0,4.32,4.19,12.0


There are many binary features in this dataset - this can be identified by `min = 0` and `max = 1` records. Let's see `mean` values per feature depending on churn.

In [6]:
display(
    pd.pivot_table(
        data=df_gym,
        index='churn',
        aggfunc='mean'
    )
    .round(2)
    .T
)


churn,0,1
age,29.98,26.99
avg_additional_charges_total,158.45,115.08
avg_class_frequency_current_month,2.03,1.04
avg_class_frequency_total,2.02,1.47
contract_period,5.75,1.73
gender,0.51,0.51
group_visits,0.46,0.27
lifetime,4.71,0.99
month_to_end_contract,5.28,1.66
near_location,0.87,0.77


We seem to have quite different groups of people (on average) that stayed with us and left our service. A couple of interesting examples:

1. `avg_class_frequency`, `promo_friends`, `group_visits` for people how left is lower than for those who stayed.
2. `gener` split seems to be identical in both groups.

Let's see distributions for numerical features.

In [12]:
col_numerical = [
    'contract_period', 'age', 'avg_additional_charges_total', 'month_to_end_contract',
     'lifetime', 'avg_class_frequency_total', 'avg_class_frequency_current_month'
]

fig = sp.make_subplots(
    rows=4, cols=2,
    subplot_titles=col_numerical,
    horizontal_spacing=0.1, vertical_spacing=0.07
)

for counter, column in enumerate(col_numerical, start=1):
    row = (counter - 1) // 2 + 1
    col = (counter - 1) % 2 + 1
    fig.add_trace(
        go.Histogram(x=df_gym[column], nbinsx=50, name=column, showlegend=False),
        row=row, col=col
    )

fig.update_layout(
    title='Distribution of numerical features',
    width=FIG_WIDTH * 100,
    height=FIG_HEIGHT * 100 * 2,
    template='plotly_white'
)
fig.show()


Let's see distributions for each binary features.

In [11]:
col_binary = [
    'churn', 'gender', 'group_visits', 'near_location',
     'partner', 'phone', 'promo_friends'
]

fig = sp.make_subplots(
    rows=4, cols=2,
    subplot_titles=col_binary,
    horizontal_spacing=0.1, vertical_spacing=0.07
)

for counter, column in enumerate(col_binary, start=1):
    row = (counter - 1) // 2 + 1
    col = (counter - 1) % 2 + 1
    fig.add_trace(
        go.Bar(
            x=df_gym[column].value_counts().index, 
            y=df_gym[column].value_counts().values,
            name=column, showlegend=False
        ),
        row=row, col=col
    )
    fig.update_xaxes(row=row, col=col, tickmode='array', tickvals=[0, 1])

fig.update_layout(
    title='Distribution of binary features',
    width=FIG_WIDTH * 100,
    height=FIG_HEIGHT * 100 * 2,
    xaxis=dict(tickmode='array', tickvals=[0, 1]),
    template='plotly_white'
)
fig.show()
