# Predicting Customer Satisfaction Level from Santander 

This project is on the kaggle platform (link in cell below). The dataset is anonymized and consists of a large number of numeric variables.

https://www.kaggle.com/c/santander-customer-satisfaction

In [89]:
# Importing libraries and frameworks
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
%matplotlib inline
import seaborn as sns
import scipy.stats
from sklearn.feature_selection import VarianceThreshold
from sklearn.preprocessing import StandardScaler
from sklearn.preprocessing import Normalizer

## Importing dataset

In [None]:
# Importing train data
df_train = pd.read_csv("data/train.csv")
df_train.head()

In [None]:
# Importing test data
df_data_test = pd.read_csv("data/test.csv")
df_result_test = pd.read_csv("data/sample_submission.csv")
df_test = df_data_test.merge(df_result_test, on = 'ID')
df_test.head()

## Feature Engineering

In [None]:
print(len(df_train))
print(len(df_test))

In [None]:
# Saving columns names and modifing it
original_col_names = df_train.columns

In [None]:
# Setting new columns names for to facilitate data manipulation.
new_col_names = ["ID"]
new_col_names = new_col_names + (["var" + str(i) for i in range(1,370)])
new_col_names = new_col_names + ["TARGET"]

df_train.columns = new_col_names
df_test.columns = new_col_names

As the number of characteristics is very large I applied the Principal Component Analysis algorithm to reduce the size of the dataset and facilitate the analyzes and transformations that precede the creation of the predictive model.

In [85]:
# Checking for missing values
print(pd.isna(df_train).any().any())
print(pd.isna(df_test).any().any())
print(pd.isnull(df_train).any().any())
print(pd.isnull(df_test).any().any())

False
False
False
False


I saved dataset before applying transformations. Then I dropped "ID" variable and converting all independent variables to float. So I got the features with a proportion of non-zero values greater than 60%.

In [78]:
# Saving dataset
df_train.to_csv("data/df_train.csv", index=False)
df_test.to_csv("data/df_test.csv", index=False)

### Removing duplicated features

I eliminated the duplicate columns applying drop_duplicates function to pandas in the transposed dataframe.

In [113]:
df_no_duplicate_train = df_train.T.drop_duplicates(keep='first').T
df_no_duplicate_test = df_test[list(df_no_duplicate_train.columns)]

### Applying more filters to features

Before filtering features, I normalized dataset.

In [115]:
normalizer = Normalizer()
df_normalized_train = pd.DataFrame(normalizer.fit_transform(df_no_duplicate_train.drop("ID", axis=1)),
                                  columns = df_no_duplicate_train.drop("ID", axis=1).columns)

normalizer = Normalizer()
df_normalized_test = pd.DataFrame(normalizer.fit_transform(df_no_duplicate_test.drop("ID", axis=1)),
                                  columns = df_no_duplicate_test.drop("ID", axis=1).columns)

df_normalized_train.to_csv("data/df_normalized_train.csv", index=False)
df_normalized_test.to_csv("data/df_normalized_test.csv", index=False)

### Removing constant and Quasi-Constant Features Using Variance Threshold

I removed constant and quasi-constant variables from my dataset using variance threshhold.

In [121]:
# Removing quasi-constant columns
constant_filter = VarianceThreshold(threshold=0.0001)
constant_filter.fit(df_normalized_train.drop("TARGET", axis=1))

filtered_columns = [column for column in df_normalized_train.drop("TARGET", axis=1).columns
                    if column not in df_normalized_train.drop("TARGET", axis=1) \
                    .columns[constant_filter.get_support()]]


df_filtered_train = df_normalized_train.drop(filtered_columns, axis=1)
df_filtered_test = df_normalized_test.drop(filtered_columns, axis=1)

In [122]:
df_filtered_train.shape

(76020, 42)

### Removing Highly Correlated Features

I removed the highly correlated independent features. I created a correlation matrix for filter low-correlated features.

In [123]:
correlation_matrix = df_filtered_train.corr()
correlated_features = set()

[correlated_features.add(rowname) for rowname in correlation_matrix.columns for colname in correlation_matrix.columns if \
correlation_matrix.loc[rowname][colname] > 0.8 and rowname != colname]

df_filter_norm_train = df_filtered_train.drop(labels=correlated_features, axis=1)
df_filter_norm_test = df_filtered_test.drop(labels=correlated_features, axis=1)

In [130]:
df_filter_norm_train.to_csv("data/df_filter_norm_train.csv")
df_filter_norm_test.to_csv("data/df_filter_norm_test.csv")

In [145]:
df_reduced_train = df_train[list(df_filter_norm_train.columns)]
df_reduced_test = df_test[list(df_filter_norm_train.columns)]

In [None]:
df_reduced_train.to_csv()
df_reduced_test.to_csv()