# First data science project - data science gym

## Introduction

**Project description**

The fitness center network, 'Bodybuilder-Data Scientist,' is working on a strategy to engage users based on analytical data. One of the most common problems facing fitness clubs and similar services is customer churn. It's not always clear when a user has stopped using the service, as they may not always leave in an obvious way.

For a fitness center, a client is considered to have churned if they haven't visited the gym at least once in the last month. While it's possible that they went on vacation and will return to the gym upon their arrival, it's more likely that they won't. If a client starts going to the gym but then suddenly stops, they are unlikely to return.

Your task is to analyze the data and develop an action plan to retain customers.

Specifically, the objectives are to:

1. Learn how to predict the probability of customer churn (for the following month) for each client.
2. Create typical user profiles: identify several of the most prominent groups and characterize their key attributes.
3. Analyze the main features that have the greatest impact on churn.
4. Formulate key conclusions and develop recommendations to improve customer relationship management, including:
   1. Identifying target customer groups;
   2. Proposing measures to reduce churn;
   3. Determining other nuances of customer interactions.

**Data description**

We have a dataset `gym_churn.csv` containing information about the month prior to churn and the fact of churn for a specific month. The dataset includes the following fields:

1. `Churn` - indicating whether the customer churned in the current month.

The current fields in the dataset contain user data for the month prior to the churn check, such as:

2. `Gender` - the gender of the customer.
3. `Near_Location` - whether the customer lives or works in the area where the fitness center is located.
4. `Partner` - indicating whether the customer is an employee of a club partner company, in which case the fitness center stores information about the customer's employer.
5. `Promo_friends` - indicating whether the customer registered under the "bring a friend" promotion, using a promo code from an acquaintance when paying for the first subscription.
6. `Phone` - indicating whether the customer provided a contact phone number.
7. `Age` - the age of the customer.
8. `Lifetime` - the time since the customer's first visit to the fitness center (in months).

The dataset also includes information based on the client's visit log, purchases, and current subscription status, such as:

9. `Contract_period` - the duration of the customer's current active subscription, which can be a month, 3 months, 6 months, or a year.
10. `Month_to_end_contract` - the time until the end of the customer's current active subscription (in months).
11. `Group_visits` - indicating whether the customer attends group classes.
12. `Avg_class_frequency_total` - the average frequency of visits per week for the entire duration of the subscription.
13. `Avg_class_frequency_current_month` - the average frequency of visits per week for the previous month.
14. `Avg_additional_charges_total` - the total revenue from other fitness center services, such as cafes, sports goods, beauty, and massage salon.

**Project plan**

1. Load the data.

2. Conduct an exploratory data analysis (EDA):
   1. Examine the dataset for missing features and study the mean values and standard deviations.
   2. Analyze the mean values of features in two groups: those who churned and those who stayed.
   3. Construct bar and feature distributions for those who churned and those who stayed.
   4. Create a correlation matrix and display it.
<br><br>

3. Build a binary classification model for customers, where the target feature is customer churn in the next month:
   1. Split the data into training and validation sets.
   2. Train the model in two ways:
      1. Logistic regression.
      2. Random forest.
   3. Evaluate the accuracy, precision, and recall metrics for both models. Compare the models based on the metrics. Which model performed better based on the metrics?
<br><br>

4. Perform customer clustering:
   1. Standardize the data.
   2. Build a distance matrix on the standardized feature matrix and draw a dendrogram.
   3. Train the clustering model based on the K-Means algorithm and predict customer clusters. Let's agree to use `n = 5` as the number of clusters.
   4. Examine the mean values of features for the clusters. Can any immediate observations be made?
   5. Construct feature distributions for the clusters. Can any observations be made from them?
   6. Calculate the churn rate for each obtained cluster. Are they different in terms of churn rate? Which clusters are prone to churn, and which are reliable?
<br><br>

1. Formulate conclusions and make basic recommendations for working with customers.

In [1]:
import pandas as pd
import numpy as np
import plotly.graph_objects as go
import plotly.figure_factory as ff
import plotly.subplots as sp
from IPython.display import display

from scipy.cluster.hierarchy import dendrogram, linkage, fcluster
from sklearn.model_selection import train_test_split
from sklearn.linear_model import LogisticRegression
from sklearn.ensemble import RandomForestClassifier
from sklearn.preprocessing import StandardScaler
from sklearn.metrics import accuracy_score, precision_score, recall_score, silhouette_score
from sklearn.cluster import KMeans

# Save raw dataset in case we need it later
raw_gym = pd.read_csv('gym_churn.csv')

FIG_WIDTH = 8
FIG_HEIGHT = 5


# Data preconditioning

This section will be short, as we only need to set `snake_case` column names for the dataset.

In [2]:
df_gym = (
    raw_gym
    .copy()
    .rename(columns=lambda df: df.lower())
)


# Exploratory data analysis

We need to complete several steps:

1. Examine the dataset for missing features and study the mean values and standard deviations.
2. Analyze the mean values of features in two groups: those who churned and those who stayed.
3. Construct bar and feature distributions for those who churned and those who stayed.
4. Create a correlation matrix and display it.

Let's start!

In [3]:
display(df_gym.isna().sum())


gender                               0
near_location                        0
partner                              0
promo_friends                        0
phone                                0
contract_period                      0
group_visits                         0
age                                  0
avg_additional_charges_total         0
month_to_end_contract                0
lifetime                             0
avg_class_frequency_total            0
avg_class_frequency_current_month    0
churn                                0
dtype: int64

Looks like we don't have any missing features. Let's check their `mean` and `std` values.

In [4]:
display(
    df_gym
    .describe()
    .round(2)
    .T
    [['min', 'mean', 'std', 'max']]
)


Unnamed: 0,min,mean,std,max
gender,0.0,0.51,0.5,1.0
near_location,0.0,0.85,0.36,1.0
partner,0.0,0.49,0.5,1.0
promo_friends,0.0,0.31,0.46,1.0
phone,0.0,0.9,0.3,1.0
contract_period,1.0,4.68,4.55,12.0
group_visits,0.0,0.41,0.49,1.0
age,18.0,29.18,3.26,41.0
avg_additional_charges_total,0.15,146.94,96.36,552.59
month_to_end_contract,1.0,4.32,4.19,12.0


There are many binary features in this dataset - this can be identified by `min = 0` and `max = 1` records. Let's see `mean` values per feature depending on churn.

In [5]:
display(
    pd.pivot_table(
        data=df_gym,
        index='churn',
        aggfunc='mean'
    )
    .round(2)
    .T
)


churn,0,1
age,29.98,26.99
avg_additional_charges_total,158.45,115.08
avg_class_frequency_current_month,2.03,1.04
avg_class_frequency_total,2.02,1.47
contract_period,5.75,1.73
gender,0.51,0.51
group_visits,0.46,0.27
lifetime,4.71,0.99
month_to_end_contract,5.28,1.66
near_location,0.87,0.77


We seem to have quite different groups of people (on average) that stayed with us and left our service. A couple of interesting examples:

1. `avg_class_frequency`, `promo_friends`, `group_visits` for people how left is lower than for those who stayed.
2. `gener` split seems to be identical in both groups.

Let's see distributions for numerical features.

In [6]:
col_numerical = [
    'contract_period', 'age', 'avg_additional_charges_total', 'month_to_end_contract',
     'lifetime', 'avg_class_frequency_total', 'avg_class_frequency_current_month'
]

fig = sp.make_subplots(
    rows=4, cols=2,
    subplot_titles=col_numerical,
    horizontal_spacing=0.2, vertical_spacing=0.1
)

for counter, column in enumerate(col_numerical, start=1):
    row = (counter - 1) // 2 + 1
    col = (counter - 1) % 2 + 1
    fig.add_trace(
        go.Histogram(
            x=df_gym[df_gym.churn == 0][column], nbinsx=50,
            marker=dict(color='green'), name='stayed', legendgroup='stayed',
            showlegend=True if (row == 1) & (col == 1) else False
        ),
        row=row, col=col
    )
    fig.add_trace(
        go.Histogram(
            x=df_gym[df_gym.churn == 1][column], nbinsx=50,
            marker=dict(color='red'), name='churn', legendgroup='churn',
            showlegend=True if (row == 1) & (col == 1) else False
        ),
        row=row, col=col
    )
    fig.update_xaxes(row=row, col=col, title_text='Values')
    fig.update_yaxes(row=row, col=col, title_text='Count')

fig.update_layout(
    title='Distribution of numerical features',
    width=FIG_WIDTH * 100,
    height=FIG_HEIGHT * 100 * 2,
    template='plotly_white',
    # legend=dict(orientation='v')
)
fig.show()


Let's see distributions for each binary features.

In [7]:
col_binary = [
    'gender', 'group_visits', 'near_location',
    'partner', 'phone', 'promo_friends'
]

fig = sp.make_subplots(
    rows=3, cols=2,
    subplot_titles=col_binary,
    horizontal_spacing=0.1, vertical_spacing=0.07
)

for counter, column in enumerate(col_binary, start=1):
    row = (counter - 1) // 2 + 1
    col = (counter - 1) % 2 + 1
    fig.add_trace(
        go.Bar(
            x=df_gym[df_gym.churn == 0][column].value_counts().index,
            y=df_gym[df_gym.churn == 0][column].value_counts().values,
            marker=dict(color='green'), name='stayed', legendgroup='stayed',
            showlegend=True if (row == 1) & (col == 1) else False
        ),
        row=row, col=col
    )
    fig.add_trace(
        go.Bar(
            x=df_gym[df_gym.churn == 1][column].value_counts().index,
            y=df_gym[df_gym.churn == 1][column].value_counts().values,
            marker=dict(color='red'), name='churn', legendgroup='churn',
            showlegend=True if (row == 1) & (col == 1) else False
        ),
        row=row, col=col
    )
    fig.update_xaxes(row=row, col=col, tickmode='array', tickvals=[0, 1])
    fig.update_yaxes(row=row, col=col, title_text='Count')

fig.update_layout(
    title='Distribution of binary features',
    width=FIG_WIDTH * 100,
    height=FIG_HEIGHT * 100 * 2,
    template='plotly_white'
)
fig.show()


Based on the visuals, we can make the following observations and conclusions:

1. Gender: The distribution of gender is almost equal among both churned and retained customers. There doesn't seem to be a significant difference in churn rates between male and female customers.

2. Group visits: Retained customers are more likely to participate in group visits (1575) compared to churned customers (776). This suggests that group visits might have a positive impact on customer retention.

3. Near location: The majority of retained customers live or work near the gym (2566), while a smaller proportion of churned customers do (815). This implies that the location of the gym plays a crucial role in retaining customers.

4. Partner: More retained customers are part of the partner program (1570) than churned customers (684). This indicates that being a part of the partner program could increase customer loyalty and retention.

5. Phone: Most customers in both categories have provided their phone numbers, with more retained customers providing phone numbers (2656) than churned customers (958). This feature may not be very informative for predicting churn, as there is not a substantial difference between the two groups.

6. Promo friends: A higher number of retained customers (1900) came through the "bring a friend" promotion than churned customers (866). This suggests that the "bring a friend" promotion might help in retaining customers.

In summary, features such as group visits, near location, partner program, and promo friends seem to have an impact on customer retention. Gender and phone number might not be as informative for predicting churn.

Now lets's have a look at the correlation matrix.

In [8]:
fig = go.Figure(
    data=go.Heatmap(
        z=df_gym.corr().values,
        x=df_gym.corr().index.values,
        y=df_gym.corr().columns.values,
        text=df_gym.corr().values.round(2),
        texttemplate="%{text}",
        showscale=False
    )
)

fig.update_layout(
    title='Correlation matrix',
    width=FIG_WIDTH * 100,
    height=FIG_HEIGHT * 100 * 1.5,
    template='plotly_white'
)
fig.show()


Based on the correlation matrix, we can make the following observations and conclusions:

1. The `churn` (target variable) is negatively correlated with several features like `contract_period`, `age`, `lifetime`, and `avg_class_frequency_total`. This means that as these variables increase, the churn rate tends to decrease. For example, customers with a longer contract period, older age, longer lifetime with the gym, and higher average class frequency are less likely to churn.

2. There are some strong positive correlations between certain pairs of independent features, such as `promo_friends` and `partner` (0.452), `contract_period` and `partner` (0.306), and `promo_friends` and `near_location` (0.211). These pairs of features might have some redundant information, and it might be worth investigating if reducing the number of features would improve the model's performance.

3. Other correlations in the matrix are relatively weak, indicating that most features provide distinct information that can be used for modeling.

It is important to keep in mind that correlation does not imply causation, and these observations should be used to inform further analysis and modeling decisions.

It's also worth noting that the correlation matrix does not capture non-linear relationships between variables, so other techniques might be needed to uncover more complex relationships.

# Building a prediction model

Our plan for this section is:

1. Split the data into training and validation sets.
2. Train the model in two ways:
   1. Logistic regression.
   2. Random forest.
3. Evaluate the accuracy, precision, and recall metrics for both models. Compare the models based on the metrics. Which model performed better based on the metrics?

Based on the previous exercise, we know that some of the features have relatively high correlation. It will be good to remove them (perform regularisation) to make learning easier.

In [9]:
df_gym = df_gym.drop(
    ['avg_class_frequency_current_month', 'month_to_end_contract'], axis=1
)

col_numerical = [
    item for item in col_numerical if item not in ['avg_class_frequency_current_month', 'month_to_end_contract']
]


Now let's split the data.

In [10]:
features = df_gym.drop('churn', axis=1)
target = df_gym.churn

features_train, features_test, target_train, target_test = train_test_split(
    features, target, test_size=0.2, random_state=0, stratify = target
)

scaler = StandardScaler()
scaler.fit(features_train)

features_train_st = scaler.transform(features_train)
features_test_st = scaler.transform(features_test)


And create models. In general, it is a good practice to standardize input features when working with linear models, like Logistic Regression, while tree-based models, like Random Forest, can handle raw input features without issue.

In [11]:
# Linear model
model_lr = LogisticRegression(random_state=0)
model_lr.fit(features_train_st, target_train)

predictions_lr = model_lr.predict(features_test_st)
probabilities_lr = model_lr.predict_proba(features_test_st)[:, 1]

# Random forest model
model_rf = RandomForestClassifier(n_estimators = 100, random_state = 0)
model_rf.fit(features_train, target_train)

predictions_rf = model_rf.predict(features_test)
probabilities_rf = model_rf.predict_proba(features_test)[:, 1]


Now let's check the metrics.

In [12]:
for model, prediction in zip([model_lr, model_rf], [predictions_lr, predictions_rf]):    
    print(
        type(model).__name__, ' metrics:', '\n',
        'Accuracy: ', round(accuracy_score(target_test, prediction), 2), '\n',
        'Precision: ', round(precision_score(target_test, prediction), 2), '\n',
        'Recall: ', round(recall_score(target_test, prediction), 2), '\n',
        sep=''
    )


LogisticRegression metrics:
Accuracy: 0.91
Precision: 0.83
Recall: 0.83

RandomForestClassifier metrics:
Accuracy: 0.89
Precision: 0.81
Recall: 0.79



Based on the results of the two models, we can draw the following conclusions:

1. The Logistic Regression model performs slightly better than the Random Forest Classifier in terms of accuracy (0.91 vs. 0.89). This indicates that the Logistic Regression model is more successful in predicting the correct outcomes (churn or no churn) for the given dataset.

2. Both models have relatively similar precision scores (0.83 for Logistic Regression and 0.81 for Random Forest Classifier). Precision measures the proportion of true positive predictions (correctly predicted churns) out of all positive predictions made. The small difference in precision between the two models suggests that they are comparably effective in predicting churn without generating too many false positives.

3. The Logistic Regression model has a slightly higher recall score than the Random Forest Classifier (0.83 vs. 0.79). Recall measures the proportion of true positive predictions (correctly predicted churns) out of all actual positive instances (all actual churns). The higher recall score for the Logistic Regression model indicates that it is more successful in identifying the majority of the actual churn cases compared to the Random Forest Classifier.

In summary, the Logistic Regression model outperforms the Random Forest Classifier in terms of accuracy and recall, and it is comparably effective in terms of precision. Given the results, the Logistic Regression model would be a more suitable choice for predicting member churn in this case.

However, it's important to consider the specific needs and priorities of the gym when choosing a model. If the primary concern is minimizing false positives while maintaining a balance between precision and recall, the Logistic Regression model is the better choice. On the other hand, if the focus is on maximizing the detection of actual churn cases even at the cost of a slightly lower precision, the Random Forest Classifier could be an acceptable alternative.

# Clusterisation

Let's identify different clusters of our users.

In [13]:
features_st = scaler.fit_transform(df_gym)
linkage_matrix = linkage(features_st, method = 'ward')

fig = ff.create_dendrogram(linkage_matrix)

fig.update_layout(
    title='User clusters',
    width=FIG_WIDTH * 100,
    height=FIG_HEIGHT * 100,
    template='plotly_white',
    xaxis=dict(visible=False)
)
fig.show()


And now using `KMeans` algorithm.

In [14]:
sil_scores = []
max_clusters = 10

for cluster in range(2, max_clusters + 1):
    sil_scores.append(
        silhouette_score(
            features_st, KMeans(n_clusters=cluster, random_state=6).fit_predict(features_st)
        )
    )

optimal_n_clusters = 2 + np.argmax(sil_scores)

fig = go.Figure()

fig.add_trace(
    go.Scatter(
        x=list(range(2, max_clusters + 1)),
        y=sil_scores,
        mode='lines+markers'
    )
)

fig.update_layout(
    title='Silhouette method score for KMeans',
    xaxis_title='Number of clusters',
    yaxis_title='Silhouette score',
    width=FIG_WIDTH * 100,
    height=FIG_HEIGHT * 100,
    template='plotly_white'
)
fig.show()


Based on the provided dataset of silhouette scores for different numbers of clusters (k) in K-means, we can make the following conclusions:

1. The highest silhouette score is obtained for `k = 5`, which suggests that dividing the data into 5 clusters results in the best-defined and most separated clusters.

2. The silhouette scores for `k = 4` and `k = 2` are also relatively high, indicating that they might provide reasonably good clustering as well, although not as optimal as `k = 5`.

3. As the number of clusters increases beyond 5, the silhouette scores generally decrease, indicating that the quality of clustering diminishes as we create more clusters.

In conclusion, based on the silhouette scores, it would be best to choose 5 clusters for this dataset using the K-means algorithm, as it results in the highest silhouette score and the most well-defined clusters.

In [15]:
df_gym['user_cluster'] = KMeans(n_clusters=optimal_n_clusters, random_state=6).fit_predict(features_st)

display(
    pd.pivot_table(
        data=df_gym,
        index='user_cluster',
        aggfunc='mean'
    )
    .round(2)
    .T
)


user_cluster,0,1,2,3,4
age,29.3,28.71,29.9,26.9,30.06
avg_additional_charges_total,144.21,137.15,158.46,114.54,160.93
avg_class_frequency_total,1.85,1.77,2.02,1.47,2.05
churn,0.27,0.4,0.01,1.0,0.0
contract_period,4.78,3.01,8.62,1.7,3.87
gender,0.52,0.5,0.5,0.51,0.52
group_visits,0.43,0.24,0.53,0.29,0.46
lifetime,3.94,3.02,4.65,0.98,4.76
near_location,0.86,0.0,0.99,1.0,1.0
partner,0.47,0.49,0.94,0.33,0.21


The analysis of the clusters reveals that Clusters 2 and 4 have the most stable memberships, with higher average lifetimes, lower churn rates, and higher participation in gym activities. In contrast, Cluster 3 faces the highest risk of losing members due to high churn rates and lower engagement in gym activities. Cluster 1 is characterized by having members who are not located near the gym and lower participation in group activities.

In [16]:
cluster_colors = {0: 'rgb(0, 0, 255)', 1: 'rgb(0, 100, 255)', 2: 'rgb(0, 150, 255)', 3: 'red', 4: 'rgb(0, 200, 255)'}

fig = sp.make_subplots(
    rows=len(col_numerical), cols=1,
    subplot_titles=col_numerical, vertical_spacing=0.05
)

for counter, feature in enumerate(col_numerical, start=1):
    for cluster in np.sort(df_gym.user_cluster.unique())[::-1]:
        fig.add_trace(
            go.Box(
                x=df_gym[df_gym['user_cluster'] == cluster][feature],
                name=f'Cluster {cluster}',
                legendgroup=f'Cluster {cluster}',
                showlegend=False,
                marker=dict(color=cluster_colors[cluster])
            ),
            row=counter, col=1
        )

fig.update_layout(
    width=FIG_WIDTH * 100,
    height=FIG_HEIGHT * 300,
    title_text="Box plots of numerical features by cluster",
    template='plotly_white'
)
fig.show()


In [18]:
fig = sp.make_subplots(
    rows=len(col_binary), cols=1,
    subplot_titles=col_binary, vertical_spacing=0.05
)

for counter, feature in enumerate(col_binary, start=1):
    for cluster in np.sort(df_gym.user_cluster.unique()):
        fig.add_trace(
            go.Histogram(
                x=df_gym[df_gym['user_cluster'] == cluster][feature],
                name=f'Cluster {cluster}',
                legendgroup=f'Cluster {cluster}',
                showlegend=True if counter == 1 else False,
                marker=dict(color=cluster_colors[cluster])
            ),
            row=counter, col=1
        )
        fig.update_yaxes(row=counter, col=1, title_text='Count')

fig.update_layout(
    width=FIG_WIDTH * 100,
    height=FIG_HEIGHT * 300,
    title_text="Histograms of binary features by cluster",
    template='plotly_white'
)
fig.show()


Based on the analyses performed, we can make the following key observations and conclusions to help the gym reduce churn:

1. Age: Retained customers tend to be slightly older than churned customers. The gym could target older age groups with tailored marketing campaigns and special offers to attract and retain more customers.

2. Contract period: Longer contract periods are associated with lower churn rates. The gym could encourage customers to sign up for longer contracts by offering discounted rates, additional benefits, or exclusive services.

3. Group visits: Customers who participate in group activities have lower churn rates. The gym could promote group classes and activities, or even introduce new and engaging group sessions to attract and retain customers.

4. Partner and promo_friends: Customers who come through partner companies or via the "bring a friend" promotion have lower churn rates. The gym should strengthen partnerships with local businesses and encourage existing members to bring friends, offering incentives for both the referrer and the new member.

5. Near location: Proximity to the gym plays a role in churn. The gym could focus marketing efforts on nearby neighborhoods and consider opening new branches in areas with high demand.

6. Average additional charges: Customers who spend more on additional services (e.g., personal training, spa, etc.) are less likely to churn. The gym could upsell these services, offer package deals, or introduce new, appealing services to retain customers.

7. Lifetime and average class frequency: Retained customers have longer lifetimes and higher average class attendance frequency. The gym should implement strategies to keep customers engaged and motivated, such as loyalty programs, regular check-ins with personal trainers, or offering a variety of classes.

Based on the machine learning models, the Logistic Regression model performed slightly better than the Random Forest Classifier in terms of accuracy, precision, and recall. This model can be used to predict customer churn and identify at-risk customers, allowing the gym to take proactive measures to retain them.

In summary, the gym should focus on attracting older customers, promoting longer contracts, encouraging group activities, leveraging partnerships and referrals, targeting local neighborhoods, upselling additional services, and keeping customers engaged to reduce churn.