# Santander Customer Transaction Prediction
https://www.kaggle.com/c/santander-customer-transaction-prediction/data

## Data description

### File descriptions

    train.csv - the training set.
    test.csv - the test set. The test set contains some rows which are not included in scoring.
    sample_submission.csv - a sample submission file in the correct format.
    
    
### Data Fields

You are provided with an anonymized dataset containing numeric feature variables, the binary target column, and a string ID_code column.

The task is to predict the value of target column in the test set.

## First look at the data

### Library import

In [1]:
# Library import 
import numpy as np 
import pandas as pd 
import matplotlib.pyplot as plt
import tensorflow as tf
from sklearn.preprocessing import normalize
import seaborn as sns

% matplotlib inline

ModuleNotFoundError: No module named 'seaborn'

In [None]:
# Reading in the csv files

train = pd.read_csv('train.csv')
test = pd.read_csv('test.csv')

### Data fields

In [None]:
train.head()

In [None]:
train.info()

In [None]:
test.head()

In [None]:
test.info()

In [None]:
train_IDtarget = pd.DataFrame(columns=['ID_code', 'target'])
test_ID = pd.DataFrame(columns=['ID_code'])

train_IDtarget['ID_code'] = train['ID_code']
train_IDtarget['target'] = train['target']
test_ID['ID_code'] = test['ID_code']

In [None]:
train = train.drop(['ID_code', 'target'], axis=1)
train = normalize(train)
train = pd.DataFrame(data=train)
train = pd.concat([train_IDtarget, train], axis=1)

test = test.drop(['ID_code'], axis=1)
test = normalize(test)
test = pd.DataFrame(data=test)
test = pd.concat([test_ID, test], axis=1)

In [None]:
train.head()

In [None]:
train.info()

In [None]:
test.head()

In [None]:
test.info()

In [None]:
f, axes = plt.subplots(2, 2, figsize=(7, 7), sharex=True)
sns.distplot(train['1'], color='skyblue', ax=axes[0, 0])
sns.distplot(train['2'], color='olive', ax=axes[0, 1])
sns.distplot(train['3'], color='gold', ax=axes[1, 0])
sns.distplot(train['4'], color='teal', ax=axes[1, 1])

In [None]:
sns.pairplot(train)
sns.plt.show();

In [None]:
pd.plotting.scatter_matrix(train, alpha = 0.3, figsize = (14,8), diagonal = 'kde');

## Data cleanup and feature engineering
### Data fields

Things to try:

*Training data*
- Text

## Final dataset and normalization

### Training dataset

The feature engineering conducted for the training dataset will be done for test data, respectively.

In [None]:
full_train = train

### Test dataset

##### Compiling test datafields, sentiment, CNN adoption prediction and dataset merger

In [None]:
full_test = test

## Score function

Submissions are evaluated on area under the ROC curve between the predicted probability and the observed target.


**Score function used for algorithm development**

For internal training, the scikit classification evaluation methods accuracy score, precision, recall and f1 score will be used in addition to the overall scoring function (https://scikit-learn.org/stable/modules/generated/sklearn.metrics.roc_auc_score.html#sklearn.metrics.roc_auc_score).

In [None]:
# Importing libraries and creating scores table
# https://scikit-learn.org/stable/modules/model_evaluation.html
# scikit-learn.org/stable/modules/generated/sklearn.metrics.precision_recall_fscore_support.html

from sklearn.metrics import accuracy_score, precision_score, recall_score, f1_score, roc_auc_score

scores = pd.DataFrame(columns=['Classifier', 'Accuracy', 'Precision', 'Recall', 'F1', 'ROC'])

## Train validation split

We can create a train test split on the training data before we can expose our model to the test dataset at first place. The train dataset includes 200,000 rows, the given test dataset 200,000 rows.

The train set will be split into 160,000 training rows and 40,000 validation rows. The models will be trained on this data before predicting data along the test set. For simplicity, the artificial test set will be referred to as "validation set". Implementation will be along one comprehensive dataframe. 

In [None]:
# Splitting the dataset into train and validation

from sklearn.model_selection import train_test_split

train_data, valid_data = train_test_split(full_train, train_size=0.8, shuffle=True, random_state=25)

print('Observations: %d' % (len(full_train)))
print('Training Observations: %d' % (len(train_data)))
print('Validation Observations: %d' % (len(valid_data)))

## Develop the classifier model

Following classifier models will be tried for this project: 
- https://scikit-learn.org/stable/modules/sgd.html#classification
- https://scikit-learn.org/stable/modules/generated/sklearn.neighbors.KNeighborsClassifier.html#sklearn.neighbors.KNeighborsClassifier
- https://scikit-learn.org/stable/modules/generated/sklearn.neural_network.MLPClassifier.html#sklearn.neural_network.MLPClassifier
- https://scikit-learn.org/stable/modules/generated/sklearn.svm.LinearSVC.html#sklearn.svm.LinearSVC
- https://scikit-learn.org/stable/modules/generated/sklearn.svm.SVC.html
- https://scikit-learn.org/stable/modules/generated/sklearn.tree.DecisionTreeClassifier.html
- https://scikit-learn.org/stable/modules/generated/sklearn.ensemble.AdaBoostClassifier.html#sklearn.ensemble.AdaBoostClassifier
- https://scikit-learn.org/stable/modules/generated/sklearn.ensemble.BaggingClassifier.html#sklearn.ensemble.BaggingClassifier
- https://scikit-learn.org/stable/modules/generated/sklearn.ensemble.GradientBoostingClassifier.html#sklearn.ensemble.GradientBoostingClassifier
- https://scikit-learn.org/stable/modules/generated/sklearn.ensemble.RandomForestClassifier.html#sklearn.ensemble.RandomForestClassifier
- https://machinelearningmastery.com/develop-first-xgboost-model-python-scikit-learn/

### Selecting model input data

In [None]:
# Incl. check to keep only the full minus one dummies per original variable

from sklearn.preprocessing import Imputer

imputer = Imputer()

train_variables = train_data.drop(['ID_code', 'target'], axis=1)

train_variables = imputer.fit_transform(train_variables)

train_targets = train_data['target']

valid_variables = valid_data.drop(['ID_code', 'target'], axis=1)

valid_variables = imputer.fit_transform(valid_variables)

valid_targets = valid_data['target']

### Stochastic Gradient Descent (SGD) classifier

In [None]:
# https://scikit-learn.org/stable/modules/sgd.html#classification

from sklearn.linear_model import SGDClassifier

SGD_model = SGDClassifier(alpha=0.0001, epsilon=0.1, eta0=0.0, fit_intercept=True, 
                          learning_rate='optimal', loss="hinge", penalty="l2", max_iter=5, random_state=25)

SGD_model.fit(train_variables, train_targets)

SGD_preds = SGD_model.predict(valid_variables)

classifier = 'SGD'
accuracy = accuracy_score(valid_targets, SGD_preds)
precision = precision_score(valid_targets, SGD_preds, average='weighted')
recall = recall_score(valid_targets, SGD_preds, average='weighted')
f1 = f1_score(valid_targets, SGD_preds, average='weighted')
ROC = roc_auc_score(valid_targets, SGD_preds)

scores = scores.append(pd.Series([classifier, accuracy, precision, recall, f1, ROC], index=scores.columns), ignore_index=True)

# Accuracy
print('Accuracy: %.2f' % accuracy)

# Precision
print('Precision: %.2f' % precision)

# Recall
print('Recall: %.2f' % recall)

# F1 score
print('F1 score: %.2f' % f1)

# ROC
print('ROC curve area: %.2f' % ROC)

### k neighbors classifier

In [None]:
# https://scikit-learn.org/stable/modules/generated/sklearn.neighbors.KNeighborsClassifier.html#sklearn.neighbors.KNeighborsClassifier

from sklearn.neighbors import KNeighborsClassifier

kneigh_model = KNeighborsClassifier(n_neighbors=15)

kneigh_model.fit(train_variables, train_targets)

kneigh_preds = kneigh_model.predict(valid_variables)

classifier = 'kNeigh'
accuracy = accuracy_score(valid_targets, kneigh_preds)
precision = precision_score(valid_targets, kneigh_preds, average='weighted')
recall = recall_score(valid_targets, kneigh_preds, average='weighted')
f1 = f1_score(valid_targets, kneigh_preds, average='weighted')
ROC = roc_auc_score(valid_targets, kneigh_preds)

scores = scores.append(pd.Series([classifier, accuracy, precision, recall, f1, ROC], index=scores.columns), ignore_index=True)

# Accuracy
print('Accuracy: %.2f' % accuracy)

# Precision
print('Precision: %.2f' % precision)

# Recall
print('Recall: %.2f' % recall)

# F1 score
print('F1 score: %.2f' % f1)

# ROC
print('ROC curve area: %.2f' % ROC)

### Multi-layer perceptron (MLP) classifier

In [None]:
# https://scikit-learn.org/stable/modules/generated/sklearn.neural_network.MLPClassifier.html#sklearn.neural_network.MLPClassifier

from sklearn.neural_network import MLPClassifier

MLP_model = MLPClassifier(hidden_layer_sizes=(100, ), 
                          activation='relu', 
                          solver='adam', 
                          alpha=0.0001, 
                          batch_size='auto', 
                          learning_rate='constant', 
                          learning_rate_init=0.001,
                          random_state=25)

MLP_model.fit(train_variables, train_targets)

MLP_preds = MLP_model.predict(valid_variables)

classifier = 'MLP'
accuracy = accuracy_score(valid_targets, MLP_preds)
precision = precision_score(valid_targets, MLP_preds, average='weighted')
recall = recall_score(valid_targets, MLP_preds, average='weighted')
f1 = f1_score(valid_targets, MLP_preds, average='weighted')
ROC = roc_auc_score(valid_targets, MLP_preds)

scores = scores.append(pd.Series([classifier, accuracy, precision, recall, f1, ROC], index=scores.columns), ignore_index=True)

# Accuracy
print('Accuracy: %.2f' % accuracy)

# Precision
print('Precision: %.2f' % precision)

# Recall
print('Recall: %.2f' % recall)

# F1 score
print('F1 score: %.2f' % f1)

# ROC
print('ROC curve area: %.2f' % ROC)

### Support vector classifier (SVC)

In [None]:
# https://scikit-learn.org/stable/modules/generated/sklearn.svm.SVC.html

from sklearn.svm import SVC

SVC_model = SVC(random_state=25)

SVC_model.fit(train_variables, train_targets)

SVC_preds = SVC_model.predict(valid_variables)

classifier = 'SVC'
accuracy = accuracy_score(valid_targets, SVC_preds)
precision = precision_score(valid_targets, SVC_preds, average='weighted')
recall = recall_score(valid_targets, SVC_preds, average='weighted')
f1 = f1_score(valid_targets, SVC_preds, average='weighted')
ROC = roc_auc_score(valid_targets, SVC_preds)

scores = scores.append(pd.Series([classifier, accuracy, precision, recall, f1, ROC], index=scores.columns), ignore_index=True)

# Accuracy
print('Accuracy: %.2f' % accuracy)

# Precision
print('Precision: %.2f' % precision)

# Recall
print('Recall: %.2f' % recall)

# F1 score
print('F1 score: %.2f' % f1)

# ROC
print('ROC curve area: %.2f' % ROC)

### Linear support vector classifier (Linear SVC)

In [None]:
# https://scikit-learn.org/stable/modules/generated/sklearn.svm.LinearSVC.html#sklearn.svm.LinearSVC

from sklearn.svm import LinearSVC

LinSVC_model = LinearSVC(random_state=25)

LinSVC_model.fit(train_variables, train_targets)

LinSVC_preds = LinSVC_model.predict(valid_variables)

classifier = 'Linear SVC'
accuracy = accuracy_score(valid_targets, LinSVC_preds)
precision = precision_score(valid_targets, LinSVC_preds, average='weighted')
recall = recall_score(valid_targets, LinSVC_preds, average='weighted')
f1 = f1_score(valid_targets, LinSVC_preds, average='weighted')
ROC = roc_auc_score(valid_targets, LinSVC_preds)

scores = scores.append(pd.Series([classifier, accuracy, precision, recall, f1, ROC], index=scores.columns), ignore_index=True)

# Accuracy
print('Accuracy: %.2f' % accuracy)

# Precision
print('Precision: %.2f' % precision)

# Recall
print('Recall: %.2f' % recall)

# F1 score
print('F1 score: %.2f' % f1)

# ROC
print('ROC curve area: %.2f' % ROC)

### Decision Tree Classifier

In [None]:
# https://scikit-learn.org/stable/modules/generated/sklearn.tree.DecisionTreeClassifier.html

from sklearn.tree import DecisionTreeClassifier

DecTree_model = DecisionTreeClassifier(random_state=25)

DecTree_model.fit(train_variables, train_targets)

DecTree_preds = DecTree_model.predict(valid_variables)

classifier = 'Decision Tree'
accuracy = accuracy_score(valid_targets, DecTree_preds)
precision = precision_score(valid_targets, DecTree_preds, average='weighted')
recall = recall_score(valid_targets, DecTree_preds, average='weighted')
f1 = f1_score(valid_targets, DecTree_preds, average='weighted')
ROC = roc_auc_score(valid_targets, DecTree_preds)

scores = scores.append(pd.Series([classifier, accuracy, precision, recall, f1, ROC], index=scores.columns), ignore_index=True)

# Accuracy
print('Accuracy: %.2f' % accuracy)

# Precision
print('Precision: %.2f' % precision)

# Recall
print('Recall: %.2f' % recall)

# F1 score
print('F1 score: %.2f' % f1)

# ROC
print('ROC curve area: %.2f' % ROC)

### Gradient boosting ensemble classifier

In [None]:
# https://scikit-learn.org/stable/modules/generated/sklearn.ensemble.GradientBoostingClassifier.html#sklearn.ensemble.GradientBoostingClassifier

from sklearn.ensemble import GradientBoostingClassifier

GradBoost_model = GradientBoostingClassifier(n_estimators=100, learning_rate=0.1, random_state=25)

GradBoost_model.fit(train_variables, train_targets)

GradBoost_preds = GradBoost_model.predict(valid_variables)

classifier = 'Gradient Boost'
accuracy = accuracy_score(valid_targets, GradBoost_preds)
precision = precision_score(valid_targets, GradBoost_preds, average='weighted')
recall = recall_score(valid_targets, GradBoost_preds, average='weighted')
f1 = f1_score(valid_targets, GradBoost_preds, average='weighted')
ROC = roc_auc_score(valid_targets, GradBoost_preds)

scores = scores.append(pd.Series([classifier, accuracy, precision, recall, f1, ROC], index=scores.columns), ignore_index=True)

# Accuracy
print('Accuracy: %.2f' % accuracy)

# Precision
print('Precision: %.2f' % precision)

# Recall
print('Recall: %.2f' % recall)

# F1 score
print('F1 score: %.2f' % f1)

# ROC
print('ROC curve area: %.2f' % ROC)

### Random forest ensemble classifier

In [None]:
# https://scikit-learn.org/stable/modules/generated/sklearn.ensemble.RandomForestClassifier.html#sklearn.ensemble.RandomForestClassifier

from sklearn.ensemble import RandomForestClassifier

RndmForest_model = RandomForestClassifier(random_state=25)

RndmForest_model.fit(train_variables, train_targets)

RndmForest_preds = RndmForest_model.predict(valid_variables)

classifier = 'Random Forest'
accuracy = accuracy_score(valid_targets, RndmForest_preds)
precision = precision_score(valid_targets, RndmForest_preds, average='weighted')
recall = recall_score(valid_targets, RndmForest_preds, average='weighted')
f1 = f1_score(valid_targets, RndmForest_preds, average='weighted')
ROC = roc_auc_score(valid_targets, RndmForest_preds)

scores = scores.append(pd.Series([classifier, accuracy, precision, recall, f1, ROC], index=scores.columns), ignore_index=True)

# Accuracy
print('Accuracy: %.2f' % accuracy)

# Precision
print('Precision: %.2f' % precision)

# Recall
print('Recall: %.2f' % recall)

# F1 score
print('F1 score: %.2f' % f1)

# ROC
print('ROC curve area: %.2f' % ROC)

### Bagging ensemble classifier

In [None]:
# https://scikit-learn.org/stable/modules/generated/sklearn.ensemble.BaggingClassifier.html#sklearn.ensemble.BaggingClassifier

from sklearn.ensemble import BaggingClassifier
from sklearn.ensemble import GradientBoostingClassifier

base_estimator = GradientBoostingClassifier(learning_rate=0.1, random_state=25)

Bagging_model = BaggingClassifier(base_estimator=base_estimator, random_state=25)

Bagging_model.fit(train_variables, train_targets)

Bagging_preds = Bagging_model.predict(valid_variables)

classifier = 'Bagging'
accuracy = accuracy_score(valid_targets, Bagging_preds)
precision = precision_score(valid_targets, Bagging_preds, average='weighted')
recall = recall_score(valid_targets, Bagging_preds, average='weighted')
f1 = f1_score(valid_targets, Bagging_preds, average='weighted')
ROC = roc_auc_score(valid_targets, Bagging_preds)

scores = scores.append(pd.Series([classifier, accuracy, precision, recall, f1, ROC], index=scores.columns), ignore_index=True)

# Accuracy
print('Accuracy: %.2f' % accuracy)

# Precision
print('Precision: %.2f' % precision)

# Recall
print('Recall: %.2f' % recall)

# F1 score
print('F1 score: %.2f' % f1)

# ROC
print('ROC curve area: %.2f' % ROC)

### AdaBoost ensemble classifier

In [None]:
# https://scikit-learn.org/stable/modules/generated/sklearn.ensemble.AdaBoostClassifier.html#sklearn.ensemble.AdaBoostClassifier

from sklearn.ensemble import AdaBoostClassifier
from sklearn.ensemble import GradientBoostingClassifier

base_estimator = GradientBoostingClassifier(n_estimators=100, learning_rate=0.1, random_state=25)

AdaBoost_model = AdaBoostClassifier(base_estimator=base_estimator, 
                                    algorithm='SAMME', 
                                    learning_rate=0.001,
                                    random_state=25)

AdaBoost_model.fit(train_variables, train_targets)

AdaBoost_preds = AdaBoost_model.predict(valid_variables)

classifier = 'AdaBoost'
accuracy = accuracy_score(valid_targets, AdaBoost_preds)
precision = precision_score(valid_targets, AdaBoost_preds, average='weighted')
recall = recall_score(valid_targets, AdaBoost_preds, average='weighted')
f1 = f1_score(valid_targets, AdaBoost_preds, average='weighted')
ROC = roc_auc_score(valid_targets, AdaBoost_preds)

scores = scores.append(pd.Series([classifier, accuracy, precision, recall, f1, ROC], index=scores.columns), ignore_index=True)

# Accuracy
print('Accuracy: %.2f' % accuracy)

# Precision
print('Precision: %.2f' % precision)

# Recall
print('Recall: %.2f' % recall)

# F1 score
print('F1 score: %.2f' % f1)

# ROC
print('ROC curve area: %.2f' % ROC)

### XGBoost classifier

In [None]:
# https://machinelearningmastery.com/develop-first-xgboost-model-python-scikit-learn/

from xgboost import XGBClassifier

XGBoost_model = XGBClassifier(learning_rate=0.001, random_state=25)

XGBoost_model.fit(train_variables, train_targets)

XGBoost_preds = XGBoost_model.predict(valid_variables)

classifier = 'XGBoost'
accuracy = accuracy_score(valid_targets, XGBoost_preds)
precision = precision_score(valid_targets, XGBoost_preds, average='weighted')
recall = recall_score(valid_targets, XGBoost_preds, average='weighted')
f1 = f1_score(valid_targets, XGBoost_preds, average='weighted')
ROC = roc_auc_score(valid_targets, XGBoost_preds)

scores = scores.append(pd.Series([classifier, accuracy, precision, recall, f1, ROC], index=scores.columns), ignore_index=True)

# Accuracy
print('Accuracy: %.2f' % accuracy)

# Precision
print('Precision: %.2f' % precision)

# Recall
print('Recall: %.2f' % recall)

# F1 score
print('F1 score: %.2f' % f1)

# ROC
print('ROC curve area: %.2f' % ROC)

### Performance comparison

In [None]:
# Looking at performance scores

print(scores)

In [None]:
# Plotting accuracy comparison

plt.bar(scores['Classifier'], scores['Accuracy'], color='C1')
plt.xticks(rotation=90)
plt.ylabel('Accuracy Score')
plt.xlabel('Classifiers');

In [None]:
# Plotting precision comparison

plt.bar(scores['Classifier'], scores['Precision'], color='C5')
plt.xticks(rotation=90)
plt.ylabel('Precision Score')
plt.xlabel('Classifiers');

In [None]:
# Plotting recall comparison

plt.bar(scores['Classifier'], scores['Recall'], color='C8')
plt.xticks(rotation=90)
plt.ylabel('Recall Score')
plt.xlabel('Classifiers');

In [None]:
# Plotting F1 comparison

plt.bar(scores['Classifier'], scores['F1'], color='C9')
plt.xticks(rotation=90)
plt.ylabel('F1 Score')
plt.xlabel('Classifiers');

In [None]:
# Plotting Cohen's kappa comparison

plt.bar(scores['Classifier'], scores['ROC'], color='C2')
plt.xticks(rotation=90)
plt.ylabel('ROC curve area')
plt.xlabel('Classifiers');

### Top performer fine-tuning

In [None]:
# Use grid search https://machinelearningmastery.com/tune-learning-rate-for-gradient-boosting-with-xgboost-in-python/
# Params useful for grid search in gradient boosting: n_estimators and learning rate

from sklearn.model_selection import GridSearchCV
from sklearn.metrics import roc_auc_score
from sklearn.metrics import make_scorer

GradBoost_simple = GradientBoostingClassifier()

roc_scorer = make_scorer(roc_auc_score)

n_estimators = [25, 50, 100, 500]

learning_rate = [0.001, 0.01, 0.1]

param_grid = dict(learning_rate=learning_rate, n_estimators=n_estimators)

grid_search = GridSearchCV(GradBoost_simple, param_grid, scoring=roc_scorer, n_jobs=-1, cv=5, return_train_score=True)
grid_result = grid_search.fit(train_variables, train_targets)

# Summarize results
print("Best: %f using %s" % (grid_result.best_score_, grid_result.best_params_))
means = grid_result.cv_results_['mean_test_score']
stds = grid_result.cv_results_['std_test_score']
params = grid_result.cv_results_['params']
for mean, stdev, param in zip(means, stds, params):
    print("%f (%f) with: %r" % (mean, stdev, param))

# Plot results
results = np.array(means).reshape(len(learning_rate), len(n_estimators))
for i, value in enumerate(learning_rate):
    plt.plot(n_estimators, results[i], label='learning_rate: ' + str(value))
plt.legend()
plt.title("Gradient boosting learning rate / n_estimators / ROC curve area")
plt.xlabel('n_estimators')
plt.ylabel("ROC curve area")
plt.savefig('Santander_ALGORITHM_gridsearch.png')

## And the winners are..

In [None]:
# https://scikit-learn.org/stable/modules/generated/sklearn.ensemble.GradientBoostingClassifier.html#sklearn.ensemble.GradientBoostingClassifier

from sklearn.ensemble import GradientBoostingClassifier

GradBoost_model = GradientBoostingClassifier(n_estimators=100, learning_rate=0.1, random_state=25)

GradBoost_model.fit(train_variables, train_targets)

GradBoost_preds = GradBoost_model.predict(valid_variables)

classifier = 'Gradient Boost (Tuned)'
accuracy = accuracy_score(valid_targets, GradBoost_preds)
precision = precision_score(valid_targets, GradBoost_preds, average='weighted')
recall = recall_score(valid_targets, GradBoost_preds, average='weighted')
f1 = f1_score(valid_targets, GradBoost_preds, average='weighted')
ROC = roc_auc_score(valid_targets, GradBoost_preds)

scores = scores.append(pd.Series([classifier, accuracy, precision, recall, f1, ROC], index=scores.columns), ignore_index=True)

# Accuracy
print('Accuracy: %.2f' % accuracy)

# Precision
print('Precision: %.2f' % precision)

# Recall
print('Recall: %.2f' % recall)

# F1 score
print('F1 score: %.2f' % f1)

# ROC
print('ROC curve area: %.2f' % ROC)

In [None]:
# https://scikit-learn.org/stable/modules/generated/sklearn.ensemble.BaggingClassifier.html#sklearn.ensemble.BaggingClassifier

from sklearn.ensemble import BaggingClassifier
from sklearn.ensemble import GradientBoostingClassifier

base_estimator = GradientBoostingClassifier(n_estimators=100, learning_rate=0.1, random_state=25)

Bagging_model = BaggingClassifier(n_estimators=250, base_estimator=base_estimator, random_state=25)

Bagging_model.fit(train_variables, train_targets)

Bagging_preds = Bagging_model.predict(valid_variables)

classifier = 'Bagging (Tuned)'
accuracy = accuracy_score(valid_targets, Bagging_preds)
precision = precision_score(valid_targets, Bagging_preds, average='weighted')
recall = recall_score(valid_targets, Bagging_preds, average='weighted')
f1 = f1_score(valid_targets, Bagging_preds, average='weighted')
ROC = roc_auc_score(valid_targets, Bagging_preds)

scores = scores.append(pd.Series([classifier, accuracy, precision, recall, f1, ROC], index=scores.columns), ignore_index=True)

# Accuracy
print('Accuracy: %.2f' % accuracy)

# Precision
print('Precision: %.2f' % precision)

# Recall
print('Recall: %.2f' % recall)

# F1 score
print('F1 score: %.2f' % f1)

# ROC
print('ROC curve area: %.2f' % ROC)

In [None]:
# https://scikit-learn.org/stable/modules/generated/sklearn.ensemble.AdaBoostClassifier.html#sklearn.ensemble.AdaBoostClassifier

from sklearn.ensemble import AdaBoostClassifier
from sklearn.ensemble import GradientBoostingClassifier

base_estimator = GradientBoostingClassifier(n_estimators=100, learning_rate=0.1, random_state=25)

AdaBoost_model = AdaBoostClassifier(base_estimator=base_estimator,
                                    n_estimators=250,
                                    algorithm='SAMME', 
                                    learning_rate=0.1,
                                    random_state=25)

AdaBoost_model.fit(train_variables, train_targets)

AdaBoost_preds = AdaBoost_model.predict(valid_variables)

classifier = 'AdaBoost (Tuned)'
accuracy = accuracy_score(valid_targets, AdaBoost_preds)
precision = precision_score(valid_targets, AdaBoost_preds, average='weighted')
recall = recall_score(valid_targets, AdaBoost_preds, average='weighted')
f1 = f1_score(valid_targets, AdaBoost_preds, average='weighted')
ROC = roc_auc_score(valid_targets, AdaBoost_preds)

scores = scores.append(pd.Series([classifier, accuracy, precision, recall, f1, ROC], index=scores.columns), ignore_index=True)

# Accuracy
print('Accuracy: %.2f' % accuracy)

# Precision
print('Precision: %.2f' % precision)

# Recall
print('Recall: %.2f' % recall)

# F1 score
print('F1 score: %.2f' % f1)

# ROC
print('ROC curve area: %.2f' % ROC)

In [None]:
print(scores)

## Writing predictions to submission file

In [None]:
# https://www.kaggle.com/gaborvecsei/adoption-speed-from-images

submission = pd.DataFrame()

submission['ID_code'] = full_test['ID_code']

submission['target'] = AdaBoost_preds

submission.head()

In [None]:
submission.info()

In [None]:
submission['target'].value_counts()

In [None]:
submission.to_csv('submission.csv', index=False)

In [None]:
#train.to_csv('train_facets.csv', index=False)