# Customer Segmentation Report for Arvato Financial Services

In this project, we will analyze demographics data for customers of a mail-order sales company in Germany, comparing it against demographics information for the general population. We will use unsupervised learning techniques to perform customer segmentation, identifying the parts of the population that best describe the core customer base of the company. Then, we will apply what you've learned on a third dataset with demographics information for targets of a marketing campaign for the company, and use a model to predict which individuals are most likely to convert into becoming customers for the company. The data that we will use has been provided by Bertelsmann Arvato Analytics, and represents a real-life data science task.

In [None]:
# import libraries here; add more as necessary
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns
from sklearn.decomposition import PCA
from sklearn.preprocessing import StandardScaler


# magic word for producing visualizations in notebook
%matplotlib inline

## Part 0: Get to Know the Data

There are four data files associated with this project:

- `Udacity_AZDIAS_052018.csv`: Demographics data for the general population of Germany; 891 211 persons (rows) x 366 features (columns).
- `Udacity_CUSTOMERS_052018.csv`: Demographics data for customers of a mail-order company; 191 652 persons (rows) x 369 features (columns).
- `Udacity_MAILOUT_052018_TRAIN.csv`: Demographics data for individuals who were targets of a marketing campaign; 42 982 persons (rows) x 367 (columns).
- `Udacity_MAILOUT_052018_TEST.csv`: Demographics data for individuals who were targets of a marketing campaign; 42 833 persons (rows) x 366 (columns).

Each row of the demographics files represents a single person, but also includes information outside of individuals, including information about their household, building, and neighborhood. We use the information from the first two files to figure out how customers ("CUSTOMERS") are similar to or differ from the general population at large ("AZDIAS"), then use our analysis to make predictions on the other two files ("MAILOUT"), predicting which recipients are most likely to become a customer for the mail-order company.


In [None]:
# load in the data
azdias = pd.read_csv('./Data/Udacity_AZDIAS_052018.csv', sep=';')
customers = pd.read_csv('./Data/Udacity_CUSTOMERS_052018.csv', sep=';')

In [None]:
# we check the categorical features for azdias and if they need to be corrected
cat_columns_azdias = azdias.select_dtypes(include='object').columns
for cat in cat_columns_azdias:
    print('for column {} there are {} unique elements'.format(cat,len(azdias.loc[:,cat].unique())))

In [None]:
# we check the categorical features for customers and if they need to be corrected
cat_columns_customers = customers.select_dtypes(include='object').columns
for cat in cat_columns_customers:
    print('for column {} there are {} unique elements'.format(cat,len(customers.loc[:,cat].unique())))

In [None]:
# we understand that EINGEFUEGT_AM column in both datasets includes specific dates. So it should be converted to numbers
# this way we then remove this column from categorical features that can hugely reduce the number of features
# after using get_dummies() method
azdias['EINGEFUEGT_AM'] = pd.to_datetime(azdias['EINGEFUEGT_AM'])
azdias['EINGEFUEGT_AM'] = azdias.loc[~azdias['EINGEFUEGT_AM'].isnull(),'EINGEFUEGT_AM'].astype(int)

In [None]:
# do the same for customers
customers['EINGEFUEGT_AM'] = pd.to_datetime(customers['EINGEFUEGT_AM'])
customers['EINGEFUEGT_AM'] = customers.loc[~customers['EINGEFUEGT_AM'].isnull(),'EINGEFUEGT_AM'].astype(int)

In [None]:
# now we check the categorical variables in azdias and customers
print('--------------azdias-----------')
cat_columns_azdias = azdias.select_dtypes(include='object').columns
for cat in cat_columns_azdias:
    print('for column {} there are {} unique elements'.format(cat,len(azdias.loc[:,cat].unique())))
print('--------------customers-----------')
cat_columns_customers = customers.select_dtypes(include='object').columns
for cat in cat_columns_customers:
    print('for column {} there are {} unique elements'.format(cat,len(customers.loc[:,cat].unique())))

In [None]:
# we see that the problem with EINGEFUEGT_AM column is solved. Now we check for further problems in categorical columns
# we see that we have problem with CAMEO_DEUG_2015 and CAMEO_INTL_2015 columns. Records are int numbers but saved as objects
# we then convert them to numbers
# we convert 'X' values to -1 as unknown
azdias.loc[azdias['CAMEO_DEUG_2015']=='X','CAMEO_DEUG_2015']=-1
# we then convert values other than nan to int values
azdias['CAMEO_DEUG_2015'] = azdias.loc[~azdias['CAMEO_DEUG_2015'].isnull(),'CAMEO_DEUG_2015'].astype(int)

In [None]:
# do the same for CAMEO_INTL_2015 column
azdias.loc[azdias['CAMEO_INTL_2015']=='XX','CAMEO_INTL_2015']=-1
azdias['CAMEO_INTL_2015'] = azdias.loc[~azdias['CAMEO_INTL_2015'].isnull(),'CAMEO_INTL_2015'].astype(int)
# do it for customers
customers.loc[customers['CAMEO_DEUG_2015']=='X','CAMEO_DEUG_2015']=-1
customers['CAMEO_DEUG_2015'] = customers.loc[~customers['CAMEO_DEUG_2015'].isnull(),'CAMEO_DEUG_2015'].astype(int)
customers.loc[customers['CAMEO_INTL_2015']=='XX','CAMEO_INTL_2015']=-1
customers['CAMEO_INTL_2015'] = customers.loc[~customers['CAMEO_INTL_2015'].isnull(),'CAMEO_INTL_2015'].astype(int)


In [None]:
# now check for categorical features
print('--------------azdias-----------')
cat_columns_azdias = azdias.select_dtypes(include='object').columns
for cat in cat_columns_azdias:
    print('for column {} there are {} unique elements'.format(cat,len(azdias.loc[:,cat].unique())))
print('--------------customers-----------')
cat_columns_customers = customers.select_dtypes(include='object').columns
for cat in cat_columns_customers:
    print('for column {} there are {} unique elements'.format(cat,len(customers.loc[:,cat].unique())))

In [None]:
# we try to wrangle azdias and prepare it to start our analysis
# calculating null percentage of each column to decide which column to be eliminated
for i in range(azdias.shape[1]):
    print('Null percentage for column {} is {:.1f}%'.format(i,(azdias.iloc[:,i].isnull().sum()/azdias.shape[0])*100))

In [None]:
# We see that column 300 is relevant to cutomer status which is important for our modelling
# while 65.6% of this column is null. We then decide to change all nan values to -1 which means unknown
azdias.iloc[:,300].fillna(-1,inplace=True)

In [None]:
# column 100 is relevant to EXTSEL992 key that has no definition in the excel file with 73.4% null vlues
# we then think that this column is not important for the modelling and drop whole the column
azdias.drop(['EXTSEL992'],axis=1,inplace=True)

In [None]:
# next step is to drop all columns with more than 90 percent of null data as imputing
# would not reflect the precise action for these columns. 
for i in range(azdias.shape[1]):
    if(azdias.iloc[:,i].isnull().sum()/azdias.shape[0]*100>90):
        print('Null percentage more than 90% \
        for column {} with null percentage {:.1f}%'.\
              format(i,(azdias.iloc[:,i].isnull().sum()/\
                        azdias.shape[0])*100))

In [None]:
# we then decide to eliminate each column with more than 90% null data
dropped_cols = azdias.columns[4:8]
azdias.drop(dropped_cols,axis=1,inplace=True)

In [None]:
# check if there is one record with more than 90% null data. We proceed with dropping this record and 
# continue with imputing null records for all the columns
for i in range(azdias.shape[0]):
    if((azdias.iloc[i,:].isnull().sum()/azdias.shape[1])*100>90):
        print('Record {} has more than 90% null columns with null\
        percent {:.1f}%'.format(i,(azdias.iloc[i,:].isnull().sum()/azdias.\
                                   shape[1])*100))

In [None]:
# Result is no record having more than 90% of null data. We thus continue with imputing null records 
# for all the remaining columns. We choose mode of each column to replace null values as we are working 
# with categorical data for all the columns.
for col in azdias.columns:
    try:
        azdias[col] = azdias[col].transform(lambda x: x.fillna(x.mode()[0]))
    except:
        print('That broke...')


In [None]:
# Checking null values
for i in range(azdias.shape[1]):
    print('Null percentage for column {} is {:.1f}%'.format(i,(azdias.iloc[:,i].isnull().sum()/azdias.shape[0])*100))

In [None]:
# We do the same data wrangling for customers dataframe
# First we try to wrangle customers df and prepare it to start our analysis
# calculating null percentage of each column to decide which column to be eliminated
for i in range(customers.shape[1]):
    print('Null percentage for column {} is {:.1f}%'.format(i,(customers.iloc[:,i].isnull().sum()/customers.shape[0])*100))


In [None]:
# we decide to eliminate each column with more than 90% null data
dropped_cols = customers.columns[4:8]
customers.drop(dropped_cols,axis=1,inplace=True)

In [None]:
# for all the remaining columns. We choose mode of each column to replace null values as we are working 
# with categorical data for all the columns.
for col in customers.columns:
    try:
        customers[col] = customers[col].transform(lambda x: x.fillna(x.mode()[0]))
    except:
        print('That broke...')

In [None]:
# Checking null values
for i in range(customers.shape[1]):
    print('Null percentage for column {} is {:.1f}%'.format(i,(customers.iloc[:,i].isnull().sum()/customers.shape[0])*100))

## Part 1: Customer Segmentation Report

The main bulk of our analysis will come in this part of the project. Here, we use unsupervised learning techniques to describe the relationship between the demographics of the company's existing customers and the general population of Germany. By the end of this part, we would be able to describe parts of the general population that are more likely to be part of the mail-order company's main customer base, and which parts of the general population are less so.

In [None]:
# This is the algorithm: we first prepare datasets as they have new dummy variables. 
# Then dimensional reduction will be done for two datasets, considering elbow method to find the optimum number of features
# to take up 90% of the total variance of the original data. Then two datasets will be standardized and clustered (again considering
# elbow method to find optimum number of clusters) using K-means algorithm. The cluster labels then will be compared to find the 
# similarity between clusters employing some specific metrics. The final outcome would be the clusters in two datasets that are mostly 
# similar and the members in two different datasets can represent eachother.


# Step 1
# add dummy variables to customers and azdias datasets
customers = pd.get_dummies(customers)
azdias = pd.get_dummies(azdias)
# write to drive
customers.to_csv('./Data/customers_filled.csv',index=False)
azdias.to_csv('./Data/azdias_filled.csv',index=False)

In [None]:
# we use 30 percent of data as it is hyper-dimensional and huge in number of records
customers = pd.read_csv('./Data/customers_filled.csv')
azdias = pd.read_csv('./Data/azdias_filled.csv')

# sampling original processed data
customers_sample = customers.sample(frac=0.3).reset_index(drop=True)
azdias_sample = azdias.sample(frac=0.3).reset_index(drop=True)

In [None]:
# customers standardization
# Select columns to standardize (excluding the first column)
columns_to_standardize = customers_sample.columns[1:]

# Initialize the StandardScaler
scaler = StandardScaler()

# Fit the scaler to the data
scaler.fit(customers_sample[columns_to_standardize])

# Transform the data
customers_sample[columns_to_standardize] = scaler.transform(customers_sample[columns_to_standardize])


In [None]:
# azdias standardization
# Select columns to standardize (excluding the first column)
columns_to_standardize = azdias_sample.columns[1:]

# Initialize the StandardScaler
scaler = StandardScaler()

# Fit the scaler to the data
scaler.fit(azdias_sample[columns_to_standardize])

# Transform the data
azdias_sample[columns_to_standardize] = scaler.transform(azdias_sample[columns_to_standardize])

In [None]:
# copy to new dataframes
azdias=azdias_sample.copy()
customers=customers_sample.copy()
# write to drive
azdias.to_csv('./Data/azdias_std_sample.csv',index=False)
customers.to_csv('./Data/customers_std_sample.csv',index=False)

In [None]:
# read sample data
azdias=pd.read_csv('./Data/azdias_std_sample.csv')
customers=pd.read_csv('./Data/customers_std_sample.csv')

In [None]:
# Step 2
# Now we proceed with dimensional reduction to a COMMON number of features that can fill up about 90% of the original data variance
# we first try to find the optimum number of features for both datasets to conduct dimensional reduction

# find the common features between the two datasets
common_features = list(set(azdias.columns) & set(customers.columns))
# remove identifier column
common_features.remove('LNR')
# concatenate two datasets considering common features  
combined_df = pd.concat([azdias[common_features], customers[common_features]], ignore_index=True)

best_components = 0  # Variable to store the best number of components
max_features = combined_df.shape[1]

for num_components in range(1, max_features + 1):
    pca = PCA(n_components=num_components)
    df_reduced = pca.fit_transform(combined_df)
    
    explained_variance = sum(pca.explained_variance_ratio_)
    print('for number of features={}, explained variance ratio is {}'.format(num_components,explained_variance))

    if explained_variance > 0.9:
        
        best_components = num_components
        break

print('Best number of features that can explain 90 percent of the variance is: ',best_components)

In [None]:
# we found that ncomponents=232 can explain 90% of the variance in the original combined dataset
# we then proceed with doing dimensional reduction for azdias and customers with this number of
# components
ncomponents=232
# azdias dimensional reduction
pca = PCA(n_components=ncomponents)
df_reduced1 = pca.fit_transform(azdias.iloc[:,1:])
azdias_reduced = pd.concat([azdias.iloc[:,0],pd.DataFrame(df_reduced1)],axis=1)

# customers dimensional reduction
pca = PCA(n_components=ncomponents)
df_reduced2 = pca.fit_transform(customers.iloc[:,1:])
customers_reduced = pd.concat([customers.iloc[:,0],pd.DataFrame(df_reduced2)],axis=1)

# write to drive
azdias_reduced.to_csv('./Data/azdias_reduced.csv',index=False)
customers_reduced.to_csv('./Data/customers_reduced.csv',index=False)


In [None]:
# read reduce data
azdias_reduced = pd.read_csv('./Data/azdias_reduced.csv')
customer_reduced = pd.read_csv('./Data/customers_reduced.csv')

In [None]:
# proceed with implementing elbow method to find the best cluster number for azdias dataset
# kmeans.inertia_ is sum of squared distances of samples to their closest cluster center. We are looking 
# for the place where this number would not change drastically by increasing the number of clusters (elbow).
from sklearn.cluster import KMeans
wcss = []
for i in range(1, 20):
    kmeans = KMeans(n_clusters = i, init = 'k-means++')
    kmeans.fit(azdias_reduced.iloc[:,1:].values)
    wcss.append(kmeans.inertia_)

In [None]:
# plot the result
plt.plot(range(1, 20), wcss)
plt.title('The Elbow Method')
plt.xlabel('Number of clusters for demography data')
plt.ylabel('WCSS')
plt.show()

In [None]:
# based on the plot, we see that n_clusters=10 would be a good choice for clustering as inertia 
# is not meaningfully changing after this number. We thus go forward to cluster azdias dataset with
# n_clusters=10 and consider 1 cluster for customer dataset

# azdias clustering
kmeans1 = KMeans(n_clusters = 10, init = 'k-means++')
azdias_kmeans = kmeans1.fit_predict(azdias_reduced.iloc[:,1:])

# customers clustering
kmeans2 = KMeans(n_clusters = 1, init = 'k-means++')
customer_kmeans = kmeans2.fit_predict(customer_reduced.iloc[:,1:])


In [None]:
from sklearn.metrics import pairwise_distances

# calculate cluster similarity by calculating Euclidean pairwise distance between clusters in two datasets
centroid_distances = pairwise_distances(kmeans1.cluster_centers_, kmeans2.cluster_centers_)
# convert Euclidean distances to similarities (using inverse distance)
similarity_matrix = 1 / (1 + centroid_distances)  


In [None]:
similarity_matrix
# we choose clusters 3 and 8 of the demography as the most similar clusters to the company customers

In [None]:
# it means that clusters 3 and 8 from azdias are most similar to customers of the company

# assigning cluster labels to each dataset
azdias = azdias_reduced.copy()
customer = customer_reduced.copy()
azdias['Cluster_label']=kmeans1.labels_
customer['Cluster_label']=kmeans2.labels_

In [None]:
# we have now two datasets that are clustered and each cluster from azdias is mapped to the cluster in customer
# we make a dictionary including all individuals that are most likely similar to company customers

affine_individuals = {'Similar Individuals':[]}
for i in (3,8):
    affine_individuals['Similar Individuals'].extend(azdias.loc[azdias['Cluster_label']==i,'LNR'].values)
    

In [None]:
print('The number of individuals in the demography data that \
     \ncan represent customers in the company are {}'.format(len(affine_individuals['Similar Individuals'])))

In [None]:
# save dictionary using pickling
import pickle

with open('affinity.pickle', 'wb') as file:
    pickle.dump(affine_individuals, file)

## Part 2: Supervised Learning Model

Now that we've found which parts of the population are more likely to be customers of the mail-order company, it's time to build a prediction model. Each of the rows in the "MAILOUT" data files represents an individual that was targeted for a mailout campaign. Ideally, we should be able to use the demographic information from each individual to decide whether or not it will be worth it to include that person in the campaign.

The "MAILOUT" data has been split into two approximately equal parts, each with almost 43 000 data rows. In this part, we can verify our model with the "TRAIN" partition, which includes a column, "RESPONSE", that states whether or not a person became a customer of the company following the campaign. In the next part, we'll need to create predictions on the "TEST" partition, where the "RESPONSE" column has been withheld.

In [None]:
# in this part, we train and verify our model based on Udacity_MAILOUT_052018_TRAIN.csv dataset
train_dataset = pd.read_csv('./Data/Udacity_MAILOUT_052018_TRAIN.csv',sep=';')

# preprocess train data
# we check the categorical features for train dataset and if they need to be corrected
cat_columns = train_dataset.select_dtypes(include='object').columns
for cat in cat_columns:
    print('for column {} there are {} unique elements'.format(cat,len(train_dataset.loc[:,cat].unique())))

In [None]:
# we understand that EINGEFUEGT_AM column in train dataset includes specific dates. So it should be converted to numbers
# this way we then remove this column from categorical features that can hugely reduce the number of features
# after using get_dummies() method
train_dataset['EINGEFUEGT_AM'] = pd.to_datetime(train_dataset['EINGEFUEGT_AM'])
train_dataset['EINGEFUEGT_AM'] = train_dataset.loc[~train_dataset['EINGEFUEGT_AM'].isnull(),'EINGEFUEGT_AM'].astype(int)

In [None]:
# now we check the categorical variables in train dataset 
print('--------------Train dataset-----------')
cat_columns = train_dataset.select_dtypes(include='object').columns
for cat in cat_columns:
    print('for column {} there are {} unique elements'.format(cat,len(train_dataset.loc[:,cat].unique())))


In [None]:
# we see that the problem with EINGEFUEGT_AM column is solved. Now we check for further problems in categorical columns
# we see that we have problem with CAMEO_DEUG_2015 and CAMEO_INTL_2015 columns. Records are int numbers but saved as objects
# we then convert them to numbers
# we convert 'X' values to -1 as unknown
train_dataset.loc[train_dataset['CAMEO_DEUG_2015']=='X','CAMEO_DEUG_2015']=-1
# we then convert values other than nan to int values
train_dataset['CAMEO_DEUG_2015'] = train_dataset.loc[~train_dataset['CAMEO_DEUG_2015'].isnull(),'CAMEO_DEUG_2015'].astype(int)

In [None]:
# do the same for CAMEO_INTL_2015 column
train_dataset.loc[train_dataset['CAMEO_INTL_2015']=='XX','CAMEO_INTL_2015']=-1
train_dataset['CAMEO_INTL_2015'] = train_dataset.loc[~train_dataset['CAMEO_INTL_2015'].isnull(),'CAMEO_INTL_2015'].astype(int)

In [None]:
# now check for categorical features
print('--------------Train dataset-----------')
cat_columns = train_dataset.select_dtypes(include='object').columns
for cat in cat_columns:
    print('for column {} there are {} unique elements'.format(cat,len(train_dataset.loc[:,cat].unique())))

In [None]:
# calculate null percentage of each column to decide which column to be eliminated
for i in range(train_dataset.shape[1]):
    print('Null percentage for column {} is {:.1f}%'.format(i,(train_dataset.iloc[:,i].isnull().sum()/train_dataset.shape[0])*100))

In [None]:
# We see that column 300 is relevant to cutomer status which is important for our modelling
# while 65.6% of this column is null. We then decide to change all nan values to -1 which means unknown
train_dataset.iloc[:,300].fillna(-1,inplace=True)

In [None]:
# next step is to drop all columns with more than 90 percent of null data as imputing
# would not reflect the precise action for these columns. 
for i in range(train_dataset.shape[1]):
    if(train_dataset.iloc[:,i].isnull().sum()/train_dataset.shape[0]*100>90):
        print('Null percentage more than 90% \
        for column {} with null percentage {:.1f}%'.\
              format(i,(train_dataset.iloc[:,i].isnull().sum()/\
                        train_dataset.shape[0])*100))

In [None]:
# we then decide to eliminate each column with more than 90% null data
dropped_cols = train_dataset.columns[4:8]
train_dataset.drop(dropped_cols,axis=1,inplace=True)

In [None]:
# check if there is one record with more than 90% null data. We proceed with dropping this record and 
# continue with imputing null records for all the columns
for i in range(train_dataset.shape[0]):
    if((train_dataset.iloc[i,:].isnull().sum()/train_dataset.shape[1])*100>90):
        print('Record {} has more than 90% null columns with null\
        percent {:.1f}%'.format(i,(train_dataset.iloc[i,:].isnull().sum()/train_dataset.\
                                   shape[1])*100))

In [None]:
# result is no record having more than 90% of null data. We thus continue with imputing null records 
# for all the remaining columns. We choose mode of each column to replace null values as we are working 
# with categorical data for all the columns.
for col in train_dataset.columns:
    try:
        train_dataset[col] = train_dataset[col].transform(lambda x: x.fillna(x.mode()[0]))
    except:
        print('That broke...')

In [None]:
# check null values
for i in range(train_dataset.shape[1]):
    print('Null percentage for column {} is {:.1f}%'.format(i,(train_dataset.iloc[:,i].isnull().sum()/train_dataset.shape[0])*100))

In [None]:
# deal with categorical varibles
train_dataset = pd.get_dummies(train_dataset)

In [None]:
# now preprocessing is done and we can start training the model with train dataset
# and employing random forest algorithm
X = train_dataset.drop('RESPONSE',axis=1).values
y = train_dataset.loc[:, 'RESPONSE'].values

In [None]:
# split dataset: 75% train data, 25% test data
from sklearn.model_selection import train_test_split
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size = 0.25)

In [None]:
# feature scaling
from sklearn.preprocessing import StandardScaler
sc = StandardScaler()
X_train = sc.fit_transform(X_train)
X_test = sc.transform(X_test)

In [None]:
# train the model
from sklearn.ensemble import RandomForestClassifier
classifier = RandomForestClassifier(n_estimators = 10, criterion = 'entropy')
classifier.fit(X_train, y_train)

In [None]:
# predict the response for test set
y_pred = classifier.predict(X_test)

In [None]:
# make confusion matrix
from sklearn.metrics import confusion_matrix, accuracy_score
cm = confusion_matrix(y_test, y_pred)
print(cm)
accuracy_score(y_test, y_pred)

In [None]:
# we see that only 123 responses are incorrectly predicted and accuracy is 98.8 %
# this verifies the credibility of the trained mode. Last part of the project is to
# employ this trained model to predict response of Udacity_MAILOUT_052018_TEST.csv
import pickle

with open('model.pickle', 'wb') as file:
    pickle.dump(classifier, file)

In [None]:
# read csv file
test_dataset = pd.read_csv('./Data/Udacity_MAILOUT_052018_TEST.csv', sep=';')

In [None]:
# we do the same preprocessing as for the training dataset
test_dataset['EINGEFUEGT_AM'] = pd.to_datetime(test_dataset['EINGEFUEGT_AM'])
test_dataset['EINGEFUEGT_AM'] = test_dataset.loc[~test_dataset['EINGEFUEGT_AM'].isnull(),'EINGEFUEGT_AM'].astype(int)
test_dataset.loc[test_dataset['CAMEO_INTL_2015']=='XX','CAMEO_INTL_2015']=-1
test_dataset['CAMEO_INTL_2015'] = test_dataset.loc[~test_dataset['CAMEO_INTL_2015'].isnull(),'CAMEO_INTL_2015'].astype(int)
test_dataset.loc[test_dataset['CAMEO_DEUG_2015']=='X','CAMEO_DEUG_2015']=-1
test_dataset['CAMEO_DEUG_2015'] = test_dataset.loc[~test_dataset['CAMEO_DEUG_2015'].isnull(),'CAMEO_DEUG_2015'].astype(int)
test_dataset.iloc[:,300].fillna(-1,inplace=True)
dropped_cols = test_dataset.columns[4:8]
test_dataset.drop(dropped_cols,axis=1,inplace=True)
for col in test_dataset.columns:
    try:
        test_dataset[col] = test_dataset[col].transform(lambda x: x.fillna(x.mode()[0]))
    except:
        print('That broke...')

In [None]:
test_dataset = pd.get_dummies(test_dataset)

In [None]:
# feature scaling
from sklearn.preprocessing import StandardScaler
sc = StandardScaler()
X = sc.fit_transform(test_dataset.values)

In [None]:
# predict the response for test set
y = classifier.predict(X)

In [None]:
# add RESPONSE column to test dataset
labeled_test_data = pd.read_csv('./Data/Udacity_MAILOUT_052018_TEST.csv', sep=';')
labeled_test_data['RESPONSE'] = y

In [None]:
# save new labled test file as a csv file
labeled_test_data.to_csv('./Data/Udacity_MAILOUT_052018_TEST_LABELED.csv',index=False)

In [None]:
# check for RESPONSE=1 in the labeled data
potential_customers = labeled_test_data.loc[labeled_test_data.iloc[:,-1]==1,:].LNR

In [None]:
# we see that our trained model predicts 39 individuals would become customer of the company given the 
# demographic features
potential_customers.to_csv('./potential_customers.csv',index=False)