# Executive Summary

# Input Data and Transformations

### Access data

Import necessary libraries

In [89]:
import pandas as pd

Clone files stored in the git repository

In [None]:
!git clone https://pkarczma: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@github.com/pkarczma/gym-subscription-predictor.git

Read CSV and JSON files with data

In [91]:
path = 'gym-subscription-predictor/'
df_csv = pd.read_csv(path+'train.csv')
df_json = pd.read_json(path+'train.json')

### Analyse and clear data

Get familiar with the data

In [None]:
df_csv.info()
df_csv.head()

There are some columns that seem unnnecessary for our model. We will drop them:

In [93]:
df_csv = df_csv.drop(columns=['name', 'location_population', 'location_from_population', 'daily_commute', 'credit_card_type'])

Count the number of missing values in each of the remaining columns:

In [None]:
df_csv.isnull().sum(axis = 0)

There are some NaN values in data in several columns. We need to use a different approach depending on the column with the missing values. The following procedure will be applied:
* 'user_id' / 'target' / 'location' / 'occupation' / 'friends_number': no missing values, columns are useful, nothing changes
* 'name' / 'location_population' / 'location_from' / 'location_from_population': a few missing values, this column isn't necessary for model prediction so it will be dropped
* 'education': fill missing falues with a median of a column
* 'hobbies': fill missing values with empty string

For the remaining data with missing values it is problematic to replace it. Thus, the rows with at least one missing calue will be dropped from the dataset.



In [None]:
df_csv['hobbies'] = df_csv['hobbies'].fillna('')
df_csv['education'] = df_csv['education'].fillna(df_csv['education'].median())
df_csv = df_csv.dropna()
df_csv.info()

As a result, we removed around 25% of all rows, but now the data is clean and ready for the next step.

### Transform data

In order to prepare data for the model we need to convert it to the proper format. The following code will convert data to categories so that is it easier for the model to read it:

In [96]:
df_csv['sex'] = df_csv['sex'].astype('category').cat.codes
df_csv['location'] = df_csv['location'].astype('category').cat.codes
df_csv['location_from'] = df_csv['location_from'].astype('category').cat.codes
df_csv['occupation'] = df_csv['occupation'].astype('category').cat.codes
df_csv['relationship_status'] = df_csv['relationship_status'].astype('category').cat.codes

For the date of birth, I assume there is no need to keep the exact date - having just a year of birth should be enough for the model. I will drop the day and month information from 'dob' column:

In [None]:
df_csv['dob'] = pd.DatetimeIndex(df_csv['dob']).year

For the 'hobbies' column the best way is to get dummies for each value and split it into several columns with numbers 0 and 1 indicating interest (or lack of interest) in a particular hobby. An additional 'hobby_' prefix will indicate that this column represents a hobby, but also to make sure that none of the column names are overlapping with the rest.

In [102]:
df_csv = pd.concat([df_csv.drop('hobbies', axis=1), df_csv['hobbies'].str.get_dummies(sep=',').add_prefix('hobby_')], axis=1)

At this point the data contain only numbers, there are no missing values, and it is prepared for the next step.

In [None]:
df_csv.info()
df_csv.head()

# Model Selection and Training

# Model Quality Assessment

# Findings

# Limitations of the Approach