# Executive Summary

# Input Data and Transformations

### Access data

Import necessary libraries

In [153]:
import pandas as pd
import json

Clone files stored in the git repository

In [None]:
!git clone https://pkarczma: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@github.com/pkarczma/gym-subscription-predictor.git

Read CSV and JSON files with data

In [155]:
path = 'gym-subscription-predictor/'
df_csv = pd.read_csv(path+'train.csv')
df_json = pd.read_json(path+'train.json', orient='split').set_index('id')

The data in JSON file needs some conversion in order to extract necessary data nested inside. A new dataframe containing only information about group names will be extracted and merged with the information from CSV file

In [156]:
df_groups = pd.DataFrame(columns=['groups'])
for i in df_json.to_dict()['groups'].items():
  groups = ''
  for j in i[1]['data']:
    if len(groups) > 0:
      groups += '|'
    groups += j['group_name']
  df_groups = df_groups.append({'groups': groups}, ignore_index=True)
df_groups.head()
df = pd.concat([df_csv, df_groups], axis=1)

### Analyse and clear data

Get familiar with the data

In [None]:
df.info()
df.head()

There are some columns that seem unnnecessary for our model. We will drop them:

In [158]:
df = df.drop(columns=['name', 'location_population', 'location_from_population', 'daily_commute', 'credit_card_type'])

Count the number of missing values in each of the remaining columns:

In [None]:
df.isnull().sum(axis = 0)

There are some NaN values in data in several columns. We need to use a different approach depending on the column with the missing values. The following procedure will be applied:
* 'user_id' / 'target' / 'location' / 'occupation' / 'friends_number': no missing values, columns are useful, nothing changes
* 'name' / 'location_population' / 'location_from' / 'location_from_population': a few missing values, this column isn't necessary for model prediction so it will be dropped
* 'education': fill missing falues with a median of a column
* 'hobbies': fill missing values with empty string

For the remaining data with missing values it is problematic to replace it. Thus, the rows with at least one missing calue will be dropped from the dataset.



In [None]:
df['hobbies'] = df['hobbies'].fillna('')
df['education'] = df['education'].fillna(df['education'].median())
df = df.dropna()
df.info()

As a result, we removed around 25% of all rows, but now the data is clean and ready for the next step.

### Transform data

In order to prepare data for the model we need to convert it to the proper format. The following code will convert data to categories so that is it easier for the model to read it:

In [161]:
df['sex'] = df['sex'].astype('category').cat.codes
df['location'] = df['location'].astype('category').cat.codes
df['location_from'] = df['location_from'].astype('category').cat.codes
df['occupation'] = df['occupation'].astype('category').cat.codes
df['relationship_status'] = df['relationship_status'].astype('category').cat.codes

For the date of birth, I assume there is no need to keep the exact date - having just a year of birth should be enough for the model. I will drop the day and month information from 'dob' column:

In [162]:
df['dob'] = pd.DatetimeIndex(df['dob']).year

For the 'hobbies' column the best way is to get dummies for each value and split it into several columns with numbers 0 and 1 indicating interest (or lack of interest) in a particular hobby. An additional 'hobby_' prefix will indicate that this column represents a hobby, but also to make sure that none of the column names are overlapping with the rest.

In [163]:
df = pd.concat([df.drop('hobbies', axis=1), df['hobbies'].str.get_dummies(sep=',').add_prefix('hobby_')], axis=1)

Let's do the same for the 'groups' column:

In [164]:
df = pd.concat([df.drop('groups', axis=1), df['groups'].str.get_dummies(sep='|').add_prefix('group_')], axis=1)

At this point the data contain only numbers, there are no missing values, and it is prepared for the next step.

In [None]:
df.info()
df.head()

# Model Selection and Training

### Split data

For the model training and testing the dataset will be split into two subsets:
* 80% of the data will be used for training
* 20% od the remaining data will be used for testing

In [166]:
train_dataset = df.sample(frac=0.8, random_state=0)
test_dataset = df.drop(train_dataset.index)

Afterwards, let's separate labels from features:

In [167]:
train_features = train_dataset.copy()
test_features = test_dataset.copy()
train_labels = train_features.pop('target')
test_labels = test_features.pop('target')

### Build model

Now it is time to build a model. For this task I am going to use Keras interface from the TensorFlow library.

In [168]:
import matplotlib.pyplot as plt
import numpy as np
from sklearn.metrics import confusion_matrix
import itertools
import tensorflow as tf
from tensorflow import keras
from tensorflow.keras import layers

The selected model will be a regression-based neural network consisting of several input, hidden, and output layers. It will use existing data prepared in the previous section as an input in order to create predictions of the desired variable.

This model prefers to have the input data normalized in a specific way. Thus, we need to create a normalization layer that is adapted to the dataset:

In [169]:
normalizer = layers.experimental.preprocessing.Normalization()
normalizer.adapt(np.array(train_features))

Afterwards, we can build a fully-connected model consisting of a sequential stack of layers, where first layers are using a rectified linear unit activation function, while the output layer is using a sigmoid function:

In [None]:
model = tf.keras.Sequential([
    normalizer,
    layers.Dense(3520, input_dim=3520, activation='relu'),
    layers.Dense(1, activation='sigmoid')
])
model.summary()

### Compile and fit model



Next step is to compile the model.

In [171]:
model.compile(
    optimizer=tf.optimizers.Adam(),
    loss='binary_crossentropy',
    metrics=['accuracy'])

Then, we can fit the model providing different settings that can be adjusted for the model efficacy.



In [172]:
history = model.fit(
    # Data to be used for training
    train_features, train_labels,
    # Number of epochs
    epochs=5,
    # Suppress logging
    verbose=0,
    # Calculate validation results on 20% of the training data
    validation_split = 0.2)

We can have a look at the last few epochs of the training of the model in order to see if everything works well.

In [None]:
# Show history in the last few epochs
hist = pd.DataFrame(history.history)
hist['epoch'] = history.epoch
hist.tail()

# Model Quality Assessment

In order to assess the quality of the model we are going to use the part of the dataset that hasn't been provided to the model yet.

In [174]:
test_predictions = model.predict(test_features)

Now we can see what is the fraction of correct prediction by comparing it to the true labels:

In [None]:
correct = sum(i == j for i, j in zip(np.around(test_predictions), test_labels))[0]
print(correct / len(test_labels))

One can see that over 80% of predictions are correct. Even better way to look at the results is to create a confuction matrix showing the fraction of correct and incorrect predictions in each class (in this case it will be '0' and '1' as this is a binary classification).

In [None]:
cm = confusion_matrix(test_labels, np.around(test_predictions))
cm = cm.astype('float') / cm.sum(axis=1)[:, np.newaxis]
plt.imshow(cm, interpolation='nearest', cmap=plt.cm.Blues)
plt.title('Normalized confusion matrix')
plt.colorbar()
plt.ylabel('True label')
plt.xlabel('Predicted label')
thresh = cm.max() / 2.
for i, j in itertools.product(range(cm.shape[0]), range(cm.shape[1])):
    plt.text(j, i, "{0:0.2f}".format(cm[i, j]),
        horizontalalignment="center",
        color="white" if cm[i, j] > thresh else "black")

One can see that over 90% of labels marked as '0' are correctly identified. For labels marked as '1' the model doesn't work that well and predicts correctly only around 28% of all cases.

# Scoring Test File

In this section the model will be used to produce target variable on the data stored in test.csv and test.json files. First, we need to read the file and transform it for the model in a similar way as before.

In [None]:
df_model = df.copy()
df_csv = pd.read_csv(path+'test.csv')
df_json = pd.read_json(path+'test.json', orient='split').set_index('id')
df_groups = pd.DataFrame(columns=['groups'])
for i in df_json.to_dict()['groups'].items():
  groups = ''
  for j in i[1]['data']:
    if len(groups) > 0:
      groups += '|'
    groups += j['group_name']
  df_groups = df_groups.append({'groups': groups}, ignore_index=True)
df_groups.head()
df = pd.concat([df_csv, df_groups], axis=1)
df = df.drop(columns=['target', 'name', 'location_population', 'location_from_population', 'daily_commute', 'credit_card_type'])
df['hobbies'] = df['hobbies'].fillna('')
df['education'] = df['education'].fillna(df['education'].median())
df = df.dropna()
df['sex'] = df['sex'].astype('category').cat.codes
df['location'] = df['location'].astype('category').cat.codes
df['location_from'] = df['location_from'].astype('category').cat.codes
df['occupation'] = df['occupation'].astype('category').cat.codes
df['relationship_status'] = df['relationship_status'].astype('category').cat.codes
df['dob'] = pd.DatetimeIndex(df['dob']).year
df = pd.concat([df.drop('hobbies', axis=1), df['hobbies'].str.get_dummies(sep=',').add_prefix('hobby_')], axis=1)
df = pd.concat([df.drop('groups', axis=1), df['groups'].str.get_dummies(sep='|').add_prefix('group_')], axis=1)
df.info()
df.head()

Now we need to add columns that are missing from this dataframe and remove additional columns that didn't exist previously in order to have exactly the same set of columns as in the dataframe used for the model building. New columns will be filled with '0'. Additional column will be dropped because the model doesn't know what to do with hobbies or groups that didn't exist in the fitting data.

In [None]:
missing_col = df_model.columns.difference(df.columns)
for col in missing_col:
  df[col] = 0
additional_col = df.columns.difference(df_model.columns)
df.drop(labels=additional_col.tolist(), axis=1, inplace=True)
df.drop(labels='target', axis=1, inplace=True)
df.info()

Now we can use the model to predict target value for the test dataset. As a result we get a list of probabilities:

In [None]:
test_features = df.copy()
test_predictions = model.predict(test_features)
print(test_predictions)

We can finally prepare a scored test file. For the rows that we skipped in the prediction phase due to missing information we are going to assume the target variable as '0'. For the rest of the users we will use predicted probability and set a target value by rounding probability to the nearest integer value. Then, we save the output file as a CSV in a desired format consisting of 3 columns: 'user_id', 'probability_of_one', and 'target'.

In [None]:
df_score = pd.DataFrame(columns=['user_id', 'probability_of_one', 'target'])
user_pred = 0
for user in range(df_csv.shape[0]):
  if user in df['user_id']:
    df_score =df_score.append({'user_id': user, 'probability_of_one': test_predictions[user_pred][0], 'target': np.around(test_predictions[user_pred][0])}, ignore_index=True)
    user_pred += 1
  else:
    df_score =df_score.append({'user_id': user, 'probability_of_one': 0., 'target': 0.}, ignore_index=True)
df_score = df_score.astype({'user_id': int, 'target': int})
df_score.head()
df_score.to_csv('test.csv', index=False)

# Findings

# Limitations of the Approach