# Executive Summary

# Input Data and Transformations

### Access data

Import necessary libraries

In [1]:
import pandas as pd

Clone files stored in the git repository

In [None]:
!git clone https://pkarczma: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@github.com/pkarczma/gym-subscription-predictor.git

Read CSV and JSON files with data

In [3]:
path = 'gym-subscription-predictor/'
df_csv = pd.read_csv(path+'train.csv')
df_json = pd.read_json(path+'train.json')

### Analyse and clear data

Get familiar with the data

In [None]:
df_csv.info()
df_csv.head()

There are some columns that seem unnnecessary for our model. We will drop them:

In [5]:
df_csv = df_csv.drop(columns=['name', 'location_population', 'location_from_population', 'daily_commute', 'credit_card_type'])

Count the number of missing values in each of the remaining columns:

In [None]:
df_csv.isnull().sum(axis = 0)

There are some NaN values in data in several columns. We need to use a different approach depending on the column with the missing values. The following procedure will be applied:
* 'user_id' / 'target' / 'location' / 'occupation' / 'friends_number': no missing values, columns are useful, nothing changes
* 'name' / 'location_population' / 'location_from' / 'location_from_population': a few missing values, this column isn't necessary for model prediction so it will be dropped
* 'education': fill missing falues with a median of a column
* 'hobbies': fill missing values with empty string

For the remaining data with missing values it is problematic to replace it. Thus, the rows with at least one missing calue will be dropped from the dataset.



In [None]:
df_csv['hobbies'] = df_csv['hobbies'].fillna('')
df_csv['education'] = df_csv['education'].fillna(df_csv['education'].median())
df_csv = df_csv.dropna()
df_csv.info()

As a result, we removed around 25% of all rows, but now the data is clean and ready for the next step.

### Transform data

In order to prepare data for the model we need to convert it to the proper format. The following code will convert data to categories so that is it easier for the model to read it:

In [8]:
df_csv['sex'] = df_csv['sex'].astype('category').cat.codes
df_csv['location'] = df_csv['location'].astype('category').cat.codes
df_csv['location_from'] = df_csv['location_from'].astype('category').cat.codes
df_csv['occupation'] = df_csv['occupation'].astype('category').cat.codes
df_csv['relationship_status'] = df_csv['relationship_status'].astype('category').cat.codes

For the date of birth, I assume there is no need to keep the exact date - having just a year of birth should be enough for the model. I will drop the day and month information from 'dob' column:

In [9]:
df_csv['dob'] = pd.DatetimeIndex(df_csv['dob']).year

For the 'hobbies' column the best way is to get dummies for each value and split it into several columns with numbers 0 and 1 indicating interest (or lack of interest) in a particular hobby. An additional 'hobby_' prefix will indicate that this column represents a hobby, but also to make sure that none of the column names are overlapping with the rest.

In [10]:
df_csv = pd.concat([df_csv.drop('hobbies', axis=1), df_csv['hobbies'].str.get_dummies(sep=',').add_prefix('hobby_')], axis=1)

At this point the data contain only numbers, there are no missing values, and it is prepared for the next step.

In [None]:
df_csv.info()
df_csv.head()

# Model Selection and Training

### Split data

For the model training and testing the dataset will be split into two subsets:
* 80% of the data will be used for training
* 20% od the remaining data will be used for testing

In [12]:
train_dataset = df_csv.sample(frac=0.8, random_state=0)
test_dataset = df_csv.drop(train_dataset.index)

Afterwards, let's separate labels from features:

In [13]:
train_features = train_dataset.copy()
test_features = test_dataset.copy()
train_labels = train_features.pop('target')
test_labels = test_features.pop('target')

### Build model

Now it is time to build a model. For this task I am going to use Keras interface from the TensorFlow library.

In [14]:
import matplotlib.pyplot as plt
import numpy as np
import tensorflow as tf
from tensorflow import keras
from tensorflow.keras import layers

The selected model will be a regression-based neural network consisting of several input, hidden, and output layers. It will use existing data prepared in the previous section as an input in order to create predictions of the desired variable.

This model prefers to have the input data normalized in a specific way. Thus, we need to create a normalization layer that is adapted to the dataset:

In [15]:
normalizer = layers.experimental.preprocessing.Normalization()
normalizer.adapt(np.array(train_features))

Afterwards, we can build a model consisting of a sequential stack of layers:

In [None]:
train_features.info()

In [None]:
model = tf.keras.Sequential([
    normalizer,
    layers.Dense(183, input_dim=183, activation='relu'),
    layers.Dense(60, activation='relu'),
    layers.Dense(1, activation='sigmoid')
])
model.summary()

### Compile and fit model



Next step is to compile the model.

In [64]:
model.compile(
    optimizer=tf.optimizers.Adam(),
    loss='binary_crossentropy',
    metrics=['accuracy'])

Then, we can fit the model providing different settings that can be adjusted for the model efficacy.



In [68]:
history = model.fit(
    # Data to be used for training
    train_features, train_labels,
    # Number of epochs
    epochs=10,
    # Suppress logging
    verbose=0,
    # Calculate validation results on 20% of the training data
    validation_split = 0.2)

We can have a look at the last few epochs of the training of the model in order to see if everything works well.

In [None]:
# Show history in the last few epochs
hist = pd.DataFrame(history.history)
hist['epoch'] = history.epoch
hist.tail()

# Model Quality Assessment

In [None]:
test_predictions = model.predict(test_features)
good = 0
all = 0
for i, j in zip(np.around(test_predictions), test_labels):
  if i == j:
    good += 1
  all += 1
print(good / all)
a = plt.axes(aspect='equal')
plt.scatter(test_labels, test_predictions)
plt.xlabel('True values')
plt.ylabel('Predictions')
lims = [-0.1, 1.1]
plt.xlim(lims)
plt.ylim(lims)
_ = plt.plot(lims,lims)

# Findings

# Limitations of the Approach