# Business Case Report, Tymoteusz Cieślik

## Introduction

Everytime a company enters a new market in a different country there emerges a need to analyse the circumstances, environment and the potential client target. No matter the size or recognition, the business is set to generate profits while distributing different services. In order to provide that for the gym chain, it is essential to reach the people that would be more or less interested with what the particular company can offer. In this case we have to deal with the information about people stored in different files and the indication whether they would be willing to buy a long--term gym subscription. Our goal is to build a predictive model based on the provided training data in order to foresee the potential targets in the test data. At first we should import the necessary libraries and modules that would allow to solve the posed problem. 

In [1]:
import pandas as pd
import json
import sklearn as sk
import numpy as np
import re
from sklearn.model_selection import train_test_split
import xgboost as xgb
from xgboost import XGBClassifier
from sklearn.metrics import accuracy_score
from sklearn.ensemble import RandomForestClassifier
from sklearn.ensemble import RandomForestRegressor
from sklearn.datasets import make_classification
from sklearn.model_selection import train_test_split
import sklearn.model_selection as skm
import sklearn.ensemble as ske
from sklearn.preprocessing import MinMaxScaler
from sklearn.datasets import make_classification
from sklearn.feature_selection import SelectKBest
from sklearn.feature_selection import f_classif
import catboost as ct
from sklearn.model_selection import cross_val_predict
import lightgbm as lgb
from sklearn.decomposition import PCA
from sklearn.model_selection import GridSearchCV
import warnings
warnings.filterwarnings('ignore')

## Input and data transformations

In the next step we are able to load the training and test data into python instance. For each group there exist two files:
- one .csv file where the data is structurally stored in the form of a table and for each person there are variables representing their characteristics and their personal information,
- one .json file where the data is unstructured and stored in a form of nested dictionaries.

What matters the most is the data cleaning and ordering so as it could be used as an input to a predictive model. In case of the numbers, the training data contains 4000 entries while the test one consists of half of this number.

In [2]:
data_json = json.load(open('train.json', encoding='utf-8'))
data_json_test = json.load(open('test.json', encoding='utf-8'))

df_csv = pd.read_csv(r'train.csv')
df_csv_test = pd.read_csv(r'test.csv')

At first, we can focus on the data in .json file. As we have mentioned it does not have a structure, so it is crucial to retrieve the information for each surveyed person. In general it contains the groups on some social media platform, which the particular people are a part of. By iterating over the file we are able to find the information about each person indicated by its 'id' and create the data frame where each column represents one particular group. During this process we can omit the city name appearing in parentheses and indicating the area where the person operates from, as it does not provide any useful information and could only unnecessarily increase the dimensionality of the data. Then for the purpose of the predictive modelling, we transorm the data frame by imposing the onehot encoding algorithm so that instead of categorical attributes we are able to store the information as the binary indication whether each user belongs to a group or not.

In [3]:
def group_processing(data_json):
    """
    The function is used to process the data stored in.json file which is considered as the argument. 
    It cleans the information stored inside and returns it as a pandas DataFrame, where there was applied onehot encoding
    in order for the categorical variables to be preserved and used for predictive modelling.
    """
    groups = []
    for i in range(len(data_json['data'])):
        person_groups=[]
        grp = len(data_json['data'][i]['groups']['data'])
        if grp == 0:
            groups.append([int(data_json['data'][i]['id']),['no group']])
        else:
            for group in range(grp):
                single_g = data_json['data'][i]['groups']['data'][group]['group_name']
                single_group = re.sub("[\(\[].*?[\)\]]", "", single_g)
                person_groups.append(single_group)
            groups.append([int(data_json['data'][i]['id']),person_groups])
    df_groups = pd.DataFrame(groups,columns=['id', 'groups'])
    df_groups = df_groups.set_index('id')
    df_groups_onehot = pd.get_dummies(df_groups.groups.apply(pd.Series).stack()).sum(level=0)
    return df_groups_onehot

When it comes to the .csv file, it appears that the data transformation should not be complicated as it is already structured. One thing that catches an eye is the hobbies variable, where instead of a string we have the list of hobbies for each person. Similarly as before, we are able to create a new data frame, where each column represents one particular hobby for each entry. After that once more, we transform it in a way where we use onehot encoding in order for the variables to be binary and the possibility of using them in predictive modelling. 

In [4]:
def hobbies_processing(data_csv):
    """
    The function operates on a .csv file and transforms the hobbies column into a separate frame, where each column
    represents one hobby. Then the onehot encoding is used for the predictive purposes and the transformed data frame
    is the output of the fuction.
    """
    hobbies=[]
    for ids, hobby in data_csv[['user_id','hobbies']].itertuples(index=False):
        try:
            more_hobbies = hobby.split(",")
            hobby_l = []
            for h in more_hobbies:
                hobby_l.append(h.lower())
            hobbies.append([ids,hobby_l])
        except:
            hobbies.append([ids,'no hobby'])
    df_hobbies = pd.DataFrame(hobbies,columns=['id', 'hobbies'])
    df_hobbies = df_hobbies.set_index('id')
    df_hobbies_onehot = pd.get_dummies(df_hobbies.hobbies.apply(pd.Series).stack()).sum(level=0)
    return df_hobbies_onehot

For the remaining information stored in a .csv file we could still perform some tranformation in order to enhance their performation in predictions. For example we discretize the date of birth of each person assinging the decade of their birth instead of the particular day, as it is enough to assess the person's age. Then we divide the whole table into two groups of varibles, the numerical ones and the categorical ones:
- for numerical variables we perform their normalization, in order for each variable's values to be in range [0,1], so that each attribute could contribute equally to the model, 
- for categorical variables we once more apply onehot encoding so that the information is preserved as binary indication.

Besides that, during the process we omit the information that should not contribute to the models, such as name and the locations. 

In [5]:
def data_processing(data):
    """
    The function takes as the input the csv file and processes the data inside in order for it to be used for predictive
    modelling. By transforming the date of birth into a decade, dividing the variales into categorical and numerical ones and 
    normalizing them it prepares two data frames with attributes that then can contribute to the model.
    
    """
    decade = []
    for date in data['dob']:
        try:
            if int(date[0:4]) < 1930:
                decade.append("1920s")
            elif int(date[0:4]) >= 1930 and int(date[0:4]) < 1940:
                decade.append("1930s")
            elif int(date[0:4]) >= 1940 and int(date[0:4]) < 1950:
                decade.append("1940s")
            elif int(date[0:4]) >= 1950 and int(date[0:4]) < 1960:
                decade.append("1950s")
            elif int(date[0:4]) >= 1960 and int(date[0:4]) < 1970:
                decade.append("1960s")
            elif int(date[0:4]) >= 1970 and int(date[0:4]) < 1980:
                decade.append("1970s")
            elif int(date[0:4]) >= 1980 and int(date[0:4]) < 1990:
                decade.append("1980s")
            elif int(date[0:4]) >= 1990 and int(date[0:4]) < 2000:
                decade.append("1990s")
            elif int(date[0:4]) >= 2000:
                decade.append("00s")
        except:
            decade.append(date)
    data['decade_of_birth'] = decade
    numerical_csv = data[['location_population', 'location_from_population', 'daily_commute', 'friends_number', 'education']]
    scaler = MinMaxScaler(feature_range=(0, 1))
    numerical_csv_norm = scaler.fit_transform(numerical_csv)
    numerical_csv_norm = pd.DataFrame(numerical_csv_norm, columns = numerical_csv.columns)
    onehot_csv = pd.get_dummies(data=data[['sex','decade_of_birth', 'occupation', 'relationship_status','credit_card_type']])
    return onehot_csv, numerical_csv_norm

After the necessary transformations we can call the functions on the training data in order to obtain four different data frames. 

In [6]:
df_groups_onehot_train = group_processing(data_json)
df_hobbies_onehot_train = hobbies_processing(df_csv)
onehot_csv_train, numerical_csv_norm_train = data_processing(df_csv)

Then, we merge all of them so that we will have one big data frame containing all information that could be used for predictive modelling. It's worth noticing that in the data there are a lot of missing values what can negatively impact the quality of predictions, so that as the potential help we can create a separate frame containing only rows where the value is provided for each column. In addition, we have decided not to impute the missing values, as it would not make sense because each individual is different and has its preferences so that averaging the values is pointless.

In [7]:
merged = pd.concat([numerical_csv_norm_train, df_groups_onehot_train,df_hobbies_onehot_train,onehot_csv_train],axis = 1)
merged_no_nan = merged.dropna()

As we have prepared all the necessary information we can move to the predictions themselves. Before that it's important to understand with what kind of problem we have to deal with. Basing only on the training data we will have to predict the binary label of the target. That way we can claim that the issue beongs to the group of supervised learning problems, where we are bound to use the classification method due to the fact that the target is a categorical variable. That way we can move on to the presentation of the solution of a posed problem.

## Model selection

The first thing to do is the selection of way of dealing with the problem. In order to have a bigger picture of it and the variety of results to be wider we can use different machine learning models imposed on different subsets of features. Then through implementing different metrics of model accuracy evaluation we will be able to choose the optimal combination and use it to predict the labels on test data.

### Subsets of features

The main goal of the predictive modeling is to achieve as big accuracy of the classification as possible so that we can increase the probability of success by using different subsets of features for this task. We decided that there would be three possibilities to apply:
- the first subset will be the whole data frame which was obtained in previous steps, as that way it's possible to assess the accuracy when all variables are concerned,
- the second subset will contain ten automatically chosen variables by implenenting  SelectKBest function form the sklearn module, which  performs univariate linear regression tests to choose the most influential attributes which then make up for the subset. The drawback of such approach is that we can only use this approach on the table where there are no missing values,
- for the last one we can use a dimensionality reduction method in order to reduce the number of columns that contribute to the whole model. Such application can drastically change the values of columns but preserves the whole information so that it is useful to analyse the performance of such subset.

In [8]:
labels = np.array(df_csv['target'])
labels_no_nan = df_csv.iloc[list(merged_no_nan.index)]['target']
features = np.array(merged)
features_no_nan = np.array(merged_no_nan)

In [9]:
fsc_10 = SelectKBest(score_func=f_classif, k=10)
features_selected_10 = fsc_10.fit_transform(features_no_nan, labels_no_nan)

In [10]:
pca = PCA() 
features_pca = pca.fit_transform(features_no_nan)

### Quality assessment

In these part we can present 4 different model evaluation metrics that would help us decide which combination of ML method and subset performs the best. The metrics include:
- Accuracy, as the main indicator of the fraction of the labels that were correctly assigned,
- Area Under ROC Curve, which allows measuring the ability of a classifier to distinguish between classes,
- precision and recall that describe the fraction of relevant instances.

In [11]:
scorings_class = ['accuracy', 'roc_auc', 'precision', 'recall']

### Machine learning methods

Now, when we know the subsets and the way of evaluating the created models we can move to the presentation of methods that we will use for the problem. As in the data there are a lot of missing values and we are not in favor of dropping any rows containing them, we should use the ones that deal well with such obstacles and provide a good performance regardless of their drawbacks.

We've decided to use and implement three different ML algorithms/methods, which are as follows:
- XGBoost,
- Catboost,
- LightGBM.

They are the state-of-the art algorithms that can be used as classifiers and do not pose any restrictions that would exclude some of the data. 

### Training process

For all of the created models we will use 5-fold cross validation in order for the training set to be used entirely and consistently. That way we will be able to average the obtained results and the values of quality evaluations, what would make the classifiers more general. In addition to that, having the results of classification, we would be able to tune the methods' hyperparameters for the best performing subset in order to calibrate the model, what would result in a better fit and enhanced performance. 

### XGBoost

We generate the instance of a classifier and then we train it using the cross validation algorithm. For each subset of features we are able to apply the evaluation metrics and present them in a form of a table.

In [12]:
model_XGB = XGBClassifier(use_label_encoder=False, eval_metric='mlogloss')

In [13]:
XGB_scores_1 = skm.cross_validate(model_XGB, features, labels, cv=5,scoring=scorings_class)
XGB_accuracy_1 = np.mean(XGB_scores_1['test_accuracy'])
XGB_roc_auc_1 = np.mean(XGB_scores_1['test_roc_auc'])
XGB_precision_1 = np.mean(XGB_scores_1['test_precision'])
XGB_recall_1 = np.mean(XGB_scores_1['test_recall'])

In [14]:
XGB_scores_2 = skm.cross_validate(model_XGB, features_selected_10, labels_no_nan, cv=5,scoring=scorings_class)
XGB_accuracy_2 = np.mean(XGB_scores_2['test_accuracy'])
XGB_roc_auc_2 = np.mean(XGB_scores_2['test_roc_auc'])
XGB_precision_2 = np.mean(XGB_scores_2['test_precision'])
XGB_recall_2 = np.mean(XGB_scores_2['test_recall'])

In [15]:
XGB_scores_3 = skm.cross_validate(model_XGB, features_pca, labels_no_nan, cv=5,scoring=scorings_class)
XGB_accuracy_3 = np.mean(XGB_scores_3['test_accuracy'])
XGB_roc_auc_3 = np.mean(XGB_scores_3['test_roc_auc'])
XGB_precision_3 = np.mean(XGB_scores_3['test_precision'])
XGB_recall_3 = np.mean(XGB_scores_3['test_recall'])

In the table below, we can notics the numerical values for the metrics of quality assessment. It appears that the best performing subset is the first one, as for each column the values of evaluations are the highest with the accuracy reaching approximately 93% what can be considered as a good prediction. 

In [16]:
scores_XGB_dict = {'Accuracy':[XGB_accuracy_1,XGB_accuracy_2,XGB_accuracy_3],
              'AUC':[XGB_roc_auc_1,XGB_roc_auc_2,XGB_roc_auc_3],
              'Precision':[XGB_precision_1,XGB_precision_2,XGB_precision_3],
              'Recall':[XGB_recall_1,XGB_recall_2,XGB_recall_3]}
scores_XGB_table = pd.DataFrame.from_dict(scores_XGB_dict, orient='index',
                       columns=['subset 1', 'subset 2', 'subset 3']).transpose()
scores_XGB_table

Unnamed: 0,Accuracy,AUC,Precision,Recall
subset 1,0.93225,0.966074,0.882411,0.766728
subset 2,0.884211,0.878074,0.831397,0.533154
subset 3,0.90031,0.922246,0.877634,0.587048


#### Hyperparameter tuning

That way, as we mentioned before, we can choose the first subset and try to calibrate the model's hyperparameters in order to boost its performance. To do that we will focus on looking for the optimal values via the grid search method, where we can apply scoring methods and choose the values for which the model would be the most accurate.

In [None]:
model_XGB_to_tune = XGBClassifier(use_label_encoder=False, eval_metric='mlogloss', colsample_bylevel=0.9, colsample_bytree=0.1)

param_grid = {'reg_alpha': [0,0.3,0.6,1,2],
             'reg_lambda': [0,0.3,0.6,1,2]}
 
grid = GridSearchCV(model_XGB_to_tune, param_grid, scoring = scorings_class, cv = 5, verbose = 3, refit = False)
grid.fit(features, labels)

It turns out that adjusting parameters colsample_bylevel to 0.9 and colsample_bytree to 0.1 have boosted the performance and accuracy so that we can redo the training process inputting these parameters to the model and check whether there in fact will be any improvement.

In [17]:
model_XGB_tuned = XGBClassifier(use_label_encoder=False, eval_metric='mlogloss', colsample_bylevel=0.9, colsample_bytree=0.1)
XGB_scores_tuned = skm.cross_validate(model_XGB_tuned, features, labels, cv=5,scoring=scorings_class)
XGB_accuracy_tuned = np.mean(XGB_scores_tuned['test_accuracy'])
XGB_roc_auc_tuned = np.mean(XGB_scores_tuned['test_roc_auc'])
XGB_precision_tuned = np.mean(XGB_scores_tuned['test_precision'])
XGB_recall_tuned = np.mean(XGB_scores_tuned['test_recall'])

As we notice on the table below, all of the metrics' values have increased and we can claim that the tuning process was successful. After that we can move on and check the performance of the remaining methods.

In [18]:
scores_XGB_dict_tuned = {'Accuracy':[XGB_accuracy_tuned],
              'AUC':[XGB_roc_auc_tuned],
              'Precision':[XGB_precision_tuned],
              'Recall':[XGB_recall_tuned]}
scores_XGB_table_tuned = pd.DataFrame.from_dict(scores_XGB_dict_tuned, orient='index',
                       columns=['subset 1']).transpose()
scores_XGB_table_tuned

Unnamed: 0,Accuracy,AUC,Precision,Recall
subset 1,0.94425,0.972501,0.924426,0.789035


### Catboost

Similarly as before, we will follow the same training path utilizing similar algorithms and resources, what will allow us to check how this method behaves. 

In [19]:
model_cat = ct.CatBoostClassifier(verbose = False)

In [20]:
Cat_scores_1 = skm.cross_validate(model_cat, features, labels, cv=5,scoring=scorings_class)
Cat_accuracy_1 = np.mean(Cat_scores_1['test_accuracy'])
Cat_roc_auc_1 = np.mean(Cat_scores_1['test_roc_auc'])
Cat_precision_1 = np.mean(Cat_scores_1['test_precision'])
Cat_recall_1 = np.mean(Cat_scores_1['test_recall'])

In [21]:
Cat_scores_2 = skm.cross_validate(model_cat, features_selected_10, labels_no_nan, cv=5,scoring=scorings_class)
Cat_accuracy_2 = np.mean(Cat_scores_2['test_accuracy'])
Cat_roc_auc_2 = np.mean(Cat_scores_2['test_roc_auc'])
Cat_precision_2 = np.mean(Cat_scores_2['test_precision'])
Cat_recall_2 = np.mean(Cat_scores_2['test_recall'])

In [22]:
Cat_scores_3 = skm.cross_validate(model_cat, features_pca, labels_no_nan, cv=5,scoring=scorings_class)
Cat_accuracy_3 = np.mean(Cat_scores_3['test_accuracy'])
Cat_roc_auc_3 = np.mean(Cat_scores_3['test_roc_auc'])
Cat_precision_3 = np.mean(Cat_scores_3['test_precision'])
Cat_recall_3 = np.mean(Cat_scores_3['test_recall'])

The table indicates that this time once again the most accurate catboost is while applied for the first subset because all of the mentioned metrics take the highest values.

In [23]:
scores_Cat_dict = {'Accuracy':[Cat_accuracy_1,Cat_accuracy_2,Cat_accuracy_3],
              'AUC':[Cat_roc_auc_1,Cat_roc_auc_2,Cat_roc_auc_3],
              'Precision':[Cat_precision_1,Cat_precision_2,Cat_precision_3],
              'Recall':[Cat_recall_1,Cat_recall_2,Cat_recall_3]}
scores_Cat_table = pd.DataFrame.from_dict(scores_Cat_dict, orient='index',
                       columns=['subset 1', 'subset 2', 'subset 3']).transpose()
scores_Cat_table

Unnamed: 0,Accuracy,AUC,Precision,Recall
subset 1,0.9395,0.97263,0.938299,0.749398
subset 2,0.890093,0.877535,0.895433,0.511568
subset 3,0.9,0.928736,0.886105,0.576243


#### Hyperparameter tuning

Also this time it is possible for us to find the optimal hyperparameters for the model using the grid search algorithm.

In [None]:
model_Cat_to_tune = ct.CatBoostClassifier(verbose = False,iterations=500)
param_grid = {'learning_rate': [0.01, 0.1,0.5,0.9],
             'random_strength': [0.2,0.5,0.8]}
 
grid_cat = GridSearchCV(model_Cat_to_tune, param_grid, scoring = scorings_class, cv = 5, verbose = 3, refit = False)
grid_cat.fit(features, labels)

It appears that no combination of different parameters could boost the performance. The only thing that allowed to achieve a little bit better evaluations was the increase in the number of iterations of the algorithm.

In [24]:
model_Cat_tuned = ct.CatBoostClassifier(verbose = False, iterations=500)

In [25]:
Cat_scores_tuned = skm.cross_validate(model_Cat_tuned, features, labels, cv=5,scoring=scorings_class)
Cat_accuracy_tuned = np.mean(Cat_scores_tuned['test_accuracy'])
Cat_roc_auc_tuned = np.mean(Cat_scores_tuned['test_roc_auc'])
Cat_precision_tuned = np.mean(Cat_scores_tuned['test_precision'])
Cat_recall_tuned = np.mean(Cat_scores_tuned['test_recall'])

The table below indicates that the accuracy along with remaining metrics is slightly better for the tuned model.

In [26]:
scores_Cat_dict_tuned = {'Accuracy':[Cat_accuracy_tuned],
              'AUC':[Cat_roc_auc_tuned],
              'Precision':[Cat_precision_tuned],
              'Recall':[Cat_recall_tuned]}
scores_Cat_table_tuned = pd.DataFrame.from_dict(scores_Cat_dict_tuned, orient='index',
                       columns=['subset 1']).transpose()
scores_Cat_table_tuned

Unnamed: 0,Accuracy,AUC,Precision,Recall
subset 1,0.941,0.973182,0.941762,0.754344


### LightGBM

Now we can consider the last of proposed methods, where the situation is pretty similar to the approaches used previously, where we apply the method to three subsets and check their performance using evaluation metrics.

In [27]:
model_LGB = lgb.LGBMClassifier()
LGB_scores_1 = skm.cross_validate(model_LGB, features, labels, cv=5,scoring=scorings_class)
LGB_accuracy_1 = np.mean(LGB_scores_1['test_accuracy'])
LGB_roc_auc_1 = np.mean(LGB_scores_1['test_roc_auc'])
LGB_precision_1 = np.mean(LGB_scores_1['test_precision'])
LGB_recall_1 = np.mean(LGB_scores_1['test_recall'])

In [28]:
LGB_scores_2 = skm.cross_validate(model_LGB, features_selected_10, labels_no_nan, cv=5,scoring=scorings_class)
LGB_accuracy_2 = np.mean(LGB_scores_2['test_accuracy'])
LGB_roc_auc_2 = np.mean(LGB_scores_2['test_roc_auc'])
LGB_precision_2 = np.mean(LGB_scores_2['test_precision'])
LGB_recall_2 = np.mean(LGB_scores_2['test_recall'])

In [29]:
LGB_scores_3 = skm.cross_validate(model_LGB, features_pca, labels_no_nan, cv=5,scoring=scorings_class)
LGB_accuracy_3 = np.mean(LGB_scores_3['test_accuracy'])
LGB_roc_auc_3 = np.mean(LGB_scores_3['test_roc_auc'])
LGB_precision_3 = np.mean(LGB_scores_3['test_precision'])
LGB_recall_3 = np.mean(LGB_scores_3['test_recall'])

Unsurprisingly, the situation looks similar to the ones shown previously, where the best performing subset is the first one whose acuracy exceeds the level of 93%.

In [30]:
scores_LGB_dict = {'Accuracy':[LGB_accuracy_1,LGB_accuracy_2,LGB_accuracy_3],
              'AUC':[LGB_roc_auc_1,LGB_roc_auc_2,LGB_roc_auc_3],
              'Precision':[LGB_precision_1,LGB_precision_2,LGB_precision_3],
              'Recall':[LGB_recall_1,LGB_recall_2,LGB_recall_3]}
scores_LGB_table = pd.DataFrame.from_dict(scores_LGB_dict, orient='index',
                       columns=['subset 1', 'subset 2', 'subset 3']).transpose()
scores_LGB_table

Unnamed: 0,Accuracy,AUC,Precision,Recall
subset 1,0.9375,0.969193,0.897921,0.779135
subset 2,0.881115,0.876959,0.785137,0.562421
subset 3,0.894118,0.924984,0.870004,0.556243


#### Hyperparameter tuning

We can try to improve the resutls by tuning the method's hyperparameters. It turns out that such approach did not provide any improvements so that we will have to stick with the primary models and their evaluations.

In [None]:
model_LGB_to_tune = lgb.LGBMClassifier()
param_grid = {'max_depth' : [10,50,100,200,500]
             }
 
grid_LGB = GridSearchCV(model_Cat_to_tune, param_grid, scoring = scorings_class, cv = 5, verbose = 3, refit = False)
grid_LGB.fit(features, labels)

### Final model

In order to predict the labels from the test data we have to pick one of the presented models and implement it. Judging them by the values of evaluation metrics the two best performing models were achieved for tuned XGBoost on first subset and for tuned Catboost also for first subset. The results are really close, but the most important factor appears to be the accuracy, so we will select XGBoost as the final and best model as it provides the highest predictive ability.

For the purpose of performing the final predictions we have to use the same functions on the test data in order to structurize and prepare it for the input.

In [31]:
df_groups_onehot_test= group_processing(data_json_test)
df_hobbies_onehot_test = hobbies_processing(df_csv_test)
onehot_csv_test, numerical_csv_norm_test = data_processing(df_csv_test)
merged_test = pd.concat([numerical_csv_norm_test, df_groups_onehot_test,df_hobbies_onehot_test,onehot_csv_test,pd.DataFrame(np.zeros(len(onehot_csv_test)))],axis = 1)
features_test = np.array(merged_test)

Then we can load the model and fit it on the whole training data containing the first subset of variables. For the task to be completed we can store the binary predictions of labels as well as the propensity probabilities in lists and then convert them to the final .csv file, where for each user ID these values are saved.

In [32]:
best_model = model_XGB_tuned
best_model.fit(features, labels)
predictions = best_model.predict(features_test)
probabilities = [i[1] for i in best_model.predict_proba(features_test)]

In [46]:
to_save = pd.DataFrame(df_csv_test['user_id'])
to_save['probability_of_one'] = probabilities
to_save['target'] = predictions
to_save.set_index('user_id').to_csv("test_results.csv")

## Summary

#### Insights

For international chain of gyms it is really crucial to analyse the behavior of potential new clients and subscribers, and for that reason the creation of different predictive models may allow the eficient selection of targets. The data concerning the surveyed people can be in fact very insightful and its correct preparation and processing is an essential part of building such models. Unfortunately, as there is some missing information about users, the quality of the data might be a little distorted and usually imputing the absent values does not make a point. Because of that, the range of machine learning methods used for the classification is limited. Nevertheless, the selection of three different algorithms and their application for three different subsets allowed for the valuable predictions. After comparing different evaluation metrics, where the accuracy exceeded 90% for some of the models, we were able to pick one of them and apply it for the provided test data, what resulted in the posterior predictions and probabilities, which ten were saved and stored in a final .csv file.



#### Limitations
Such approach was optimal concerning the number of entries to the data, but if there were much more rows and users surveyed, the dimensionality of the prepared data could be considered to be an insurmountable obstacle, what would result in the prolongled computation time. The other thing that was mentioned before is the lack of some of the values for different attributes and people. Without taking proper steps while dealing with them it is probable that the final results would be inaccurate and inconsistent. Finally, the functions created in order to prepare the data for ML models work only for the structure of the data which was provided in training and test files. If there were many inputs where the distribution of the information varied across the files, it would be preferable to use different kind of approach, such as natural language processing or even the implementation of ML methods/API in order to scrap the needed data.