# Deep Feature Synthesis - PCA - KNN and XGBoost


Score on Driven Data KNN: 0.6995
Score on Driven Data XGBoost: 0.7279

#**Summary:** For these submissions our goal was to learn a little bit about Deep Feature Synthesis for automated feature generation and see how these new features impacted our score. We must say the attempt was not as successfull as others, first because the score on Driven Data was much lower but mostly because the gap between the cross-validation scores where higher than the scores we got on Driven Data. This might be a matter of the way the DFS was implemented : because of time and computational power restrictions we decided to perform PCA after the feature generation to be able to fit somehow the outcome of DFS in our model but these did not seem to capture the data better than the original features. All and all it was an interesting experience for us to play a little bit with the different aggregation methos that DFS provides and hope to get a better understandment in future works, as we have found from experience that feature creation (at least manually) can have a very positive impact on prediction.

**Content:**
1. Data Loading
2. Data Cleaning
3. One hote encoding and scaling
4. Feature Creation
 4.1 K-means clustering for location
 4.2 Deep Feature Synthesis
5. PCA
6. Models
 6.1 KNN
 6.2 XGboost
7.Predictions

In [1]:
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns

from sklearn.decomposition import PCA
from sklearn.discriminant_analysis import LinearDiscriminantAnalysis
from sklearn.discriminant_analysis import QuadraticDiscriminantAnalysis
from sklearn.model_selection import cross_val_score
from sklearn.preprocessing import PowerTransformer
from sklearn.model_selection import GridSearchCV
from sklearn.ensemble import RandomForestClassifier
from sklearn.cluster import KMeans
from sklearn.neighbors import KNeighborsClassifier


import featuretools as ft

import xgboost as xgb

%matplotlib inline

# 1. Data Loading

In [2]:
# load data
X_train = pd.read_csv('train_features.csv')
y_train = pd.read_csv('train_labels.csv')
X_test = pd.read_csv('test_features.csv')
y_test = pd.read_csv('submission_format.csv')

# merge features and labels on train set
train = X_train.copy()
train = train.merge(y_train, how = 'left', on = 'id')

In [3]:
# column to always drop
columns_to_drop = [

    'subvillage',
    'region_code',
    'district_code',
    'wpt_name',
    'recorded_by',
    'scheme_name',
    'management_group',
    'payment',
    'extraction_type_group',
    'extraction_type_class',
    'waterpoint_type_group',
    'quality_group',
    'quantity_group',
    'source_type',
    'source_class',
    'num_private', 
    'date_recorded',
  
]

In [4]:
# columns to drop for now
additional_columns_to_drop = [
    'funder',
    'installer',
    'amount_tsh',
    'lga',
    'ward',
    'scheme_management'
]

In [5]:
X_train.drop(columns_to_drop, axis = 1, inplace = True)
X_train.drop(additional_columns_to_drop, axis = 1, inplace = True)

X_test.drop(columns_to_drop, axis = 1, inplace = True)
X_test.drop(additional_columns_to_drop, axis = 1, inplace = True)

# 2. Data Cleaning

Following the same approach as usual.

In [6]:
# create a column storing the info whether construction year was recorded or not
X_train['construction_year_recorded'] = np.where(X_train.construction_year == 0, False, True)
X_test['construction_year_recorded'] = np.where(X_test.construction_year == 0, False, True)

# replace construction_year == 0 with the mean construction year
mean_construction_year = round(X_train.loc[X_train.construction_year != 0, 'construction_year'].mean(), 0)
X_train.loc[X_train.construction_year == 0, 'construction_year'] = mean_construction_year
X_test.loc[X_test.construction_year == 0, 'construction_year'] = mean_construction_year

In [7]:
# create a column storing the info whether longitude/latitude was recorded or not
X_train['longitude_recorded'] = np.where(abs(X_train.longitude) < 0.1, False, True)
X_train['latitude_recorded'] = np.where(abs(X_train.latitude) < 0.1, False, True)

X_test['longitude_recorded'] = np.where(X_test.longitude < 0.1, False, True)
X_test['latitude_recorded'] = np.where(X_test.latitude < 0.1, False, True)

In [8]:
# replace missing values in public_meeting with the majority category (True)
X_train.loc[X_train.public_meeting.isna(), 'public_meeting'] = True
X_test.loc[X_test.public_meeting.isna(), 'public_meeting'] = True

# replace missing values in permit with the majority category (True)
X_train.loc[X_train.permit.isna(), 'permit'] = True
X_test.loc[X_test.permit.isna(), 'permit'] = True

# 3. One hot encoding and scaling

In [9]:
# one-hot encoding
X_train = pd.get_dummies(X_train, 
                         prefix = X_train.select_dtypes('object').columns, 
                         columns = X_train.select_dtypes('object').columns,
                         drop_first = True
                        )

X_test = pd.get_dummies(X_test, 
                         prefix = X_test.select_dtypes('object').columns, 
                         columns = X_test.select_dtypes('object').columns,
                         drop_first = True
                        )

# power transformation of numerical columns
numerical_columns = X_train.select_dtypes(['int64', 'float64']).columns

pt = PowerTransformer()
X_train.loc[:,numerical_columns] = pt.fit_transform(X_train.loc[:,numerical_columns])
X_test.loc[:,numerical_columns] = pt.transform(X_test.loc[:,numerical_columns])

# add columns to test set that only exist in train set
X_test[list(set(X_train.columns).difference(set(X_test.columns)))[0]] = 0

# make sure columns are in the same order
X_train = X_train[sorted(X_train.columns)].copy()
X_test = X_test[sorted(X_test.columns)].copy()


# creating a mapping for the classes
classes = {
    'functional' : 0,
    'non functional' : 1,
    'functional needs repair' : 2
}

# create the inverse mapping
classes_inv = {v: k for k, v in classes.items()}

# map the target to numerical
y_train = y_train.status_group.map(classes)

# 4.Feature Creation


### 4.1 K-Means Clustering For Latitude and Longitude

With this approach we wanted to better capture different regions best on their geographical location instead of having the raw longitude and latitude coordinates.

In [10]:
#extracting only latitude and longitude from the features

X_train_geo= X_train[['latitude','longitude']]
X_test_geo= X_test[['latitude','longitude']]

In [11]:
#Create the clusters. We tried 9 clusters based on the number of basins found in the EDA

kmeans = KMeans(n_clusters = 9, init ='k-means++')
kmeans.fit(X_train_geo[X_train_geo.columns[0:2]]) 
X_train_geo['cluster_label'] = kmeans.fit_predict(X_train_geo[X_train_geo.columns[0:2]])

kmeans = KMeans(n_clusters = 9, init ='k-means++')
kmeans.fit(X_test_geo[X_test_geo.columns[0:2]])
X_test_geo['cluster_label'] = kmeans.fit_predict(X_test_geo[X_test_geo.columns[0:2]])


In [12]:
#dropping the longitude and latitude columns  from the features
columns_to_drop = ["longitude","latitude"]
X_train.drop(columns_to_drop, axis = 1, inplace = True)
X_test.drop(columns_to_drop, axis = 1, inplace = True)

#creating DataFrames with the new features
X_train_geo = pd.get_dummies(X_train_geo["cluster_label"], drop_first=True)
X_test_geo = pd.get_dummies(X_test_geo["cluster_label"], drop_first=True)

#concatenating them with the original features DF
X_train = pd.concat([X_train,X_train_geo], axis = 1)
X_test = pd.concat([X_test,X_test_geo], axis = 1)

#renaming the columns
X_train= X_train.rename(columns={1: "cluster_1",2:"cluster_2",3:"cluster_3",4:"cluster_4", 5:"cluster_5",6:"cluster_6",7:"cluster_7", 8: "cluster_8"})
X_test= X_test.rename(columns={1: "cluster_1",2:"cluster_2",3:"cluster_3",4:"cluster_4", 5:"cluster_5",6:"cluster_6",7:"cluster_7", 8: "cluster_8"})

In [13]:
X_test.index = range(len(X_train),(len(X_train) + len(X_test)))

In [14]:
X = pd.concat([X_train,X_test])


### 4.2 Deep Feature Synthesis 

To make sure that the same variables are created (we saw there was an issue when we split DFS into train and test), we are going to join train and test feature for the automated feature creation to later re-join them by the index.

In [None]:
#make sure the index in X_test starts after X_train begins for the join
X_test.index = range(len(X_train),(len(X_train) + len(X_test)))
#join the two dfs
X = pd.concat([X_train,X_test])


In [15]:

#defining my entity set
es = ft.EntitySet(id = 'id')

#defining boolean varialbes as boolean following approach by https://brendanhasz.github.io/2018/11/11/featuretools
BOOL = ft.variable_types.Boolean

#variable types dictionary for boolean variables not to be identified as numerical
variable_types = {
    'cluster_1':BOOL,
    'cluster_2':BOOL,
    'cluster_3':BOOL,
    'cluster_4':BOOL,
    'cluster_5':BOOL,
    'cluster_6':BOOL,
    'cluster_7':BOOL,
    'cluster_8':BOOL,
    'basin_Lake Nyasa': BOOL,
    'basin_Lake Rukwa': BOOL,
    'basin_Lake Tanganyika': BOOL,
    'basin_Lake Victoria': BOOL,
    'basin_Pangani': BOOL,
    'basin_Rufiji': BOOL,
    'basin_Ruvuma / Southern Coast': BOOL,
    'basin_Wami / Ruvu': BOOL,
    'extraction_type_cemo': BOOL,
    'extraction_type_climax': BOOL,
    'extraction_type_gravity': BOOL,
    'extraction_type_india mark ii': BOOL,
    'extraction_type_india mark iii': BOOL,
    'extraction_type_ksb': BOOL,
    'extraction_type_mono': BOOL,
    'extraction_type_nira/tanira': BOOL,
    'extraction_type_other': BOOL,
    'extraction_type_other - mkulima/shinyanga': BOOL,
    'extraction_type_other - play pump': BOOL,
    'extraction_type_other - rope pump': BOOL,
    'extraction_type_other - swn 81': BOOL,
    'extraction_type_submersible': BOOL,
    'extraction_type_swn 80': BOOL,
    'extraction_type_walimi': BOOL,
    'extraction_type_windmill': BOOL,
    'management_other': BOOL,
    'management_other - school': BOOL,
    'management_parastatal': BOOL,
    'management_private operator': BOOL,
    'management_trust': BOOL,
    'management_unknown': BOOL,
    'management_vwc': BOOL,
    'management_water authority': BOOL,
    'management_water board': BOOL,
    'management_wua': BOOL,
    'management_wug': BOOL,
    'payment_type_monthly': BOOL,
    'payment_type_never pay': BOOL,
    'payment_type_on failure': BOOL,
    'payment_type_other': BOOL,
    'payment_type_per bucket': BOOL,
    'payment_type_unknown': BOOL,
    'permit_True': BOOL,
    'public_meeting_True': BOOL,
    'quantity_enough': BOOL,
    'quantity_insufficient': BOOL,
    'quantity_seasonal': BOOL,
    'quantity_unknown': BOOL,
    'region_Dar es Salaam': BOOL,
    'region_Dodoma': BOOL,
    'region_Iringa': BOOL,
    'region_Kagera': BOOL,
    'region_Kigoma': BOOL,
    'region_Kilimanjaro': BOOL,
    'region_Lindi': BOOL,
    'region_Manyara': BOOL,
    'region_Mara': BOOL,
    'region_Mbeya': BOOL,
    'region_Morogoro': BOOL,
    'region_Mtwara': BOOL,
    'region_Mwanza': BOOL,
    'region_Pwani': BOOL,
    'region_Rukwa': BOOL,
    'region_Ruvuma': BOOL,
    'region_Shinyanga': BOOL,
    'region_Singida': BOOL,
    'region_Tabora': BOOL,
    'region_Tanga': BOOL,
    'source_hand dtw': BOOL,
    'source_lake': BOOL,
    'source_machine dbh': BOOL,
    'source_other': BOOL,
    'source_rainwater harvesting': BOOL,
    'source_river': BOOL,
    'source_shallow well': BOOL,
    'source_spring': BOOL,
    'source_unknown': BOOL,
    'water_quality_fluoride': BOOL,
    'water_quality_fluoride abandoned': BOOL,
    'water_quality_milky': BOOL,
    'water_quality_salty': BOOL,
    'water_quality_salty abandoned': BOOL,
    'water_quality_soft': BOOL,
    'water_quality_unknown': BOOL,
    'waterpoint_type_communal standpipe': BOOL,
    'waterpoint_type_communal standpipe multiple': BOOL,
    'waterpoint_type_dam': BOOL,
    'waterpoint_type_hand pump': BOOL,
    'waterpoint_type_improved spring': BOOL,
    'waterpoint_type_other': BOOL

}



In [16]:
#creating my entities for X_train and X_test with the defined variable types

es = es.entity_from_dataframe(entity_id = 'entity_id', dataframe = X, 
                              index = 'id', variable_types=variable_types)


In [17]:
#creating the new features using dfs algorithm. 
#For numerical values we chose the mean, skewness, max value and std as aggregators and multiply_numeric and divide_numeric as transformations
#For boolean variables, we selected "and" as the only transformation variable (if 2 booleans are true at the same time)

features, feature_names = ft.dfs(entityset=es, target_entity='entity_id',agg_primitives =["mean","skew","max","std"]
                        ,trans_primitives = ["and","multiply_numeric","divide_numeric"], max_depth = 2)


In [18]:
#set the index for the created features to split again after
features.index = (range(0,(len(X_train) + len(X_test))))

# 5. PCA for feature selection

As DFS left us with over 4K features, we are going to perform PCA to reduce them into 50 components, which is around half of the original number of features. As such we expect them to capture the principal components of the DFS, as inputting the 4k variables is over our computer processing capabilities.

In [40]:
#first we redefine X_train and X_test as the features we created
X_train_dfs = features.loc[0:59399,:]
X_test_dfs = features.loc[59400:74250,:]


PCA(copy=True, iterated_power='auto', n_components=50, random_state=None,
    svd_solver='auto', tol=0.0, whiten=False)

In [41]:
#perform the PCA
pca = PCA(n_components=50)
pca.fit(X_train_dfs)

PCA(copy=True, iterated_power='auto', n_components=50, random_state=None,
    svd_solver='auto', tol=0.0, whiten=False)

In [42]:
#transform our features
X_train_dfs = pca.transform(X_train_dfs)
X_test_dfs = pca.transform(X_test_dfs)

# 6. Model 1 - KNeighbors Classifier

We input the new train variables and slightly tune the number of neighbors.

In [43]:
from sklearn.neighbors import KNeighborsClassifier

# create a knn classifier
knn = KNeighborsClassifier()

# create a param grid
param_grid = {'n_neighbors' : [7,9,11]}

# do a grid search to find the best parameter
grid_knn = GridSearchCV(
    estimator = knn,
    param_grid = param_grid,
    scoring = 'accuracy',
    n_jobs = -1,
    cv = 10,
    refit = True,
    return_train_score = True
)

# fit the model
grid_knn.fit(X_train_dfs, y_train)

# read results of grid search into dataframe
cv_results_df = pd.DataFrame(grid_knn.cv_results_)

# print results
cv_results_df[['params', 'mean_test_score', 'mean_train_score']].sort_values(by = ['mean_test_score'], ascending = False)

Unnamed: 0,params,mean_test_score,mean_train_score
1,{'n_neighbors': 9},0.744394,0.782177
0,{'n_neighbors': 7},0.744343,0.79136
2,{'n_neighbors': 11},0.743232,0.774654


## 6.2 Model : XG Boost

Just to give it a try, we decided to put the two PCA components in to an XGBoost model, as we got the highest scores from it in all our efforts. Fitting this model took a lot of computer power and almost 40 minutes.

In [44]:

param_test = {
 'max_depth':[7,8,9],
 'min_child_weight':[2,3],
}


In [45]:
#model creation and gridsearch
xgb_model = xgb.XGBClassifier(learning_rate=0.1, 
                              n_estimators=120, 
                              gamma=0.2, 
                              #max_depth = 14,
                              #min_child_weight = 2,
                              num_class = 3,
                              subsample=0.8, 
                              colsample_bytree=0.8,
                              objective= 'multi:softmax', 
                              nthread=4, 
                              scale_pos_weight=1,
                              seed=27)

gsearch = GridSearchCV(estimator = xgb_model, 
                       param_grid = param_test, 
                       scoring='accuracy',
                       n_jobs=4,
                       cv=5,
                       refit = True,
                       return_train_score = True)

In [46]:
train_model_9 = gsearch.fit(X_train_dfs, y_train)

In [47]:
pd.DataFrame(gsearch.cv_results_)[['params', 'mean_test_score', 'mean_train_score']]

Unnamed: 0,params,mean_test_score,mean_train_score
0,"{'max_depth': 7, 'min_child_weight': 2}",0.777458,0.835968
1,"{'max_depth': 7, 'min_child_weight': 3}",0.777761,0.834276
2,"{'max_depth': 8, 'min_child_weight': 2}",0.782071,0.860408
3,"{'max_depth': 8, 'min_child_weight': 3}",0.781633,0.856688
4,"{'max_depth': 9, 'min_child_weight': 2}",0.783468,0.882727
5,"{'max_depth': 9, 'min_child_weight': 3}",0.783603,0.878081


The scores look slightly higher than KNN but the model looks overfitted. However the gap between test and train scores is not as big as when we did multiple iterations of hyperparameter tuning on XGBoost.

# 7. Prediction

## 7.1 Prediction for KNN

In [52]:

#predicting y with the best parameters and checking the output makes sense

y_pred = grid_knn.best_estimator_.predict(X_test_dfs)
y_pred_df = pd.DataFrame(y_pred)

#to make sure distribution of classes make sense
y_pred_df[0].value_counts()



0    8920
1    5479
2     451
Name: 0, dtype: int64

In [39]:
# map back to string classes
y_pred = pd.Series(y_pred).map(classes_inv)

# create submission data frame
y_test.loc[:,'status_group'] = y_pred

# write to csv
y_test.to_csv('submission11_dfs_pca_knn.csv', index = False)

## 7.1 Prediction for XGBoost

In [59]:
y_pred= gsearch.best_estimator_.predict(X_test_dfs)
y_pred_df= pd.DataFrame(y_pred)


In [60]:
#to make sure the values make sense.
y_pred_df[0].value_counts()

0    8747
1    6003
2     100
Name: 0, dtype: int64

In [61]:
# map back to string classes
y_pred = pd.Series(y_pred).map(classes_inv)

# create submission data frame
y_test.loc[:,'status_group'] = y_pred

# write to csv
y_test.to_csv('submission11_dfs_pca_xgboost.csv', index = False)