# Capstone Project: Create a Customer Segmentation Report for Arvato Financial Services

In this project, you will analyze demographics data for customers of a mail-order sales company in Germany, comparing it against demographics information for the general population. You'll use unsupervised learning techniques to perform customer segmentation, identifying the parts of the population that best describe the core customer base of the company. Then, you'll apply what you've learned on a third dataset with demographics information for targets of a marketing campaign for the company, and use a model to predict which individuals are most likely to convert into becoming customers for the company. The data that you will use has been provided by our partners at Bertelsmann Arvato Analytics, and represents a real-life data science task.

If you completed the first term of this program, you will be familiar with the first part of this project, from the unsupervised learning project. The versions of those two datasets used in this project will include many more features and has not been pre-cleaned. You are also free to choose whatever approach you'd like to analyzing the data rather than follow pre-determined steps. In your work on this project, make sure that you carefully document your steps and decisions, since your main deliverable for this project will be a blog post reporting your findings.

In [None]:
#pip install xgboost


In [None]:
#pip install bayesian-optimization

In [None]:
#pip install lightgbm

In [None]:
#pip install scikit_optimize==0.8.1

In [None]:
#pip install scikit-learn==0.23.2


## Part 0: Get to Know the Data

There are four data files associated with this project:

- `Udacity_AZDIAS_052018.csv`: Demographics data for the general population of Germany; 891 211 persons (rows) x 366 features (columns).
- `Udacity_CUSTOMERS_052018.csv`: Demographics data for customers of a mail-order company; 191 652 persons (rows) x 369 features (columns).
- `Udacity_MAILOUT_052018_TRAIN.csv`: Demographics data for individuals who were targets of a marketing campaign; 42 982 persons (rows) x 367 (columns).
- `Udacity_MAILOUT_052018_TEST.csv`: Demographics data for individuals who were targets of a marketing campaign; 42 833 persons (rows) x 366 (columns).




In [None]:
# import libraries here; add more as necessary

from sklearn.preprocessing import StandardScaler
from sklearn.decomposition import PCA
from sklearn.pipeline import Pipeline
from sklearn.cluster import KMeans

from sklearn.impute import SimpleImputer

from .utils import  *
# import skopt
# from skopt import BayesSearchCV
# class BayesSearchCV(BayesSearchCV):
#     def _run_search(self, x): raise BaseException('Use newer skopt')

%matplotlib inline
plt.style.use('fivethirtyeight')

## Part 0: Get to Know the Data

There are four data files associated with this project:

- `Udacity_AZDIAS_052018.csv`: Demographics data for the general population of Germany; 891 211 persons (rows) x 366 features (columns).
- `Udacity_CUSTOMERS_052018.csv`: Demographics data for customers of a mail-order company; 191 652 persons (rows) x 369 features (columns).
- `Udacity_MAILOUT_052018_TRAIN.csv`: Demographics data for individuals who were targets of a marketing campaign; 42 982 persons (rows) x 367 (columns).
- `Udacity_MAILOUT_052018_TEST.csv`: Demographics data for individuals who were targets of a marketing campaign; 42 833 persons (rows) x 366 (columns).


In [None]:
# load in the data
azdias = pd.read_csv('data/azdias.csv',sep=';')
customers = pd.read_csv('data/customers.csv',sep=';')
attributes = pd.read_excel('data/attributes.xlsx' , engine='openpyxl', skiprows = 1)

#drop 3 of the customer columns which are missing in the global population dataset
customers_clean = customers.drop(['PRODUCT_GROUP', 'CUSTOMER_GROUP', 'ONLINE_PURCHASE'], axis = 1, inplace = True)

In [None]:
#size of the datasets
print('azdias dimensions: ' + str(azdias.shape))
print('customers dimensions: ' + str(customers.shape))

In [None]:
#azdias first 5 rows
azdias.head()

In [None]:
customers.head()

In [None]:
#desription of the azdias frame (Unnamed: 0 looks like repeats the index and LNR is the ID of each person)
azdias.describe()

In [None]:
#check for object values within the dataset
print(cat_check(azdias))

In [None]:
# as we can see the there are some X and XX values that should be corrected in the following columns
print(azdias['CAMEO_DEU_2015'].unique())
print(azdias['CAMEO_DEUG_2015'].unique())
print(azdias['CAMEO_INTL_2015'].unique())

EINGEFUEGT_AM is a datetime column and OST_WEST_KZ should be mapped to binary values
all transpormations are done in feature_transform() in file utils.py

In [None]:
attributes.drop(['Unnamed: 0'], axis = 1, inplace = True)

### Missing values

In [None]:
#use the attributes xls file to map the 0,-1,9 values to -1
azdias = unknown_unify(azdias, attributes)
customers = unknown_unify(customers, attributes)

In [None]:
#missing vaues per feature histogram
col_azdiaz = azdias.isnull().sum()/len(azdias)
col_customers = customers.isnull().sum()/len(customers)
plt.hist(col_azdiaz[col_azdiaz > 0.3], fc=(1, 0, 0, 0.5), label='Azdias')
plt.hist(col_customers[col_customers > 0.3], fc=(0, 0, 1, 0.5), label='Customers')
plt.xlabel('Percentage Missing Values')
plt.ylabel('Number of Features')
plt.legend()

In [None]:
#plot histogram missing values per row
bins = 50
plt.hist(azdias.isnull().sum(axis=1), bins, fc=(1, 0, 0, 0.5), label='Azdias')
plt.hist(customers.isnull().sum(axis=1), bins, fc=(0, 0, 1, 0.5), label='Customers')
plt.xlabel('Number of missing values')
plt.ylabel('Number of Rows')
plt.legend()

In [None]:
#display features with more than 30% missing values
missing_values = azdias.isnull().sum()/len(azdias)
azdias[missing_values[missing_values > 0.30].index].head()

####      ALTER_KIND features mark the age of children. Having a lot of NaN values is normal and dropping the features may result in loosing information.

In [None]:
#transforming some of the features and removing incorrect values
azdias = feature_transform(azdias)
customers = feature_transform(customers)

In [None]:
#Distribution of DECADE feature
fig = plt.figure(figsize = (10,6))
sns.distplot(azdias['PRAEGENDE_JUGENDJAHRE_DECADE'], norm_hist = True, label='Azdias')
sns.distplot(customers['PRAEGENDE_JUGENDJAHRE_DECADE'], norm_hist = True, label='Customers')
plt.legend()

In [None]:
#Distribution of INCOME feature
fig = plt.figure(figsize = (10,6))
sns.distplot(azdias['HH_EINKOMMEN_SCORE'], norm_hist = True, label='Azdias')
sns.distplot(customers['HH_EINKOMMEN_SCORE'], norm_hist = True, label='Customers')
plt.legend()

In [None]:
#Distribution of VACATION HABITS feature
fig = plt.figure(figsize = (10,6))
sns.distplot(azdias['GFK_URLAUBERTYP'], norm_hist = True, label='Azdias')
sns.distplot(customers['GFK_URLAUBERTYP'], norm_hist = True, label='Customers')
plt.legend()

## Part 1: Unsupervised model:

In [None]:
#unsupervised data preprocessing
pipeline_unsup = Pipeline([  ('impute', SimpleImputer(strategy= 'constant', fill_value = -1)),
                             ('scale', StandardScaler()),
                             ('pca' , PCA()),
                        ])

#fit and transform the sets
azdias_pca = pipeline_unsup.fit_transform(azdias)
customers_pca =  pipeline_unsup.transform(customers)

In [None]:
#calculate PCA features to use in the clustering model
fig = plt.figure(figsize = (14,8))
plt.bar(list(range(0, len(pipeline_unsup[2].explained_variance_ratio_))), pipeline_unsup[2].explained_variance_ratio_)
plt.xlabel('PCA components')
plt.ylabel('Variance Ratio')
plt.xlim(-1, 100)
plt.show()

pipeline_unsup[2].explained_variance_ratio_[:130].sum()

In [None]:
#calculate number of clusters for the KMeans++ model:
score = []

for i in range(2,15):
    clt = KMeans(n_clusters = i)
    global_cluster = clt.fit(azdias_pca[:, :130])
    score.append(global_cluster.inertia_)
    print(score[i-2])



In [None]:
#plow elbow curve
fig = plt.figure(figsize = (12,8))
sns.lineplot(x = list(range(2,15)), y = score)
plt.xlabel('Cluster Number')
plt.show()

In [None]:
#init and fit the Kmeans++ with the desirable number of clusters (10)
clt = KMeans(n_clusters = 8)

azdiaz_cluster = clt.fit(azdias_pca[:, :130])

In [None]:
#predict cluster labels for arvato customerz
customers_cluster = clt.predict(customers_pca[:, :130])

In [None]:
#plot the distribution between azdias and customers data inside the clusters
fig = plt.figure(figsize = (13,8))
sns.distplot(azdiaz_cluster.labels_)
sns.distplot(customers_cluster)
plt.legend()
plt.show()

#### Let's take a look at the features between the two most different clusters

### Identify important feature differences between the clusters

In [None]:
azdias_clst = azdias
azdias_clst['Cluster'] = azdiaz_cluster.labels_

customers_clst = customers
customers_clst['Cluster'] = customers_cluster

azdias_clst = azdias_clst[azdias_clst['Cluster'] == 4]
customers_clst = customers_clst[customers_clst['Cluster'] == 0]

#get mean difference between features
diff = pd.DataFrame({'Cluster_4': azdias_clst.mean(), 'Cluster_0': customers_clst.mean()})
diff['delta'] = abs(diff['Cluster_4'] - diff['Cluster_0'])
diff.head(20).sort_values(['delta'], ascending = False)

##### CAMEO_DEUG_2015: Customers are more likely to be 'established middle class' against Cluster 3 values : low-consumption middleclass.

##### CAMEO_INTL_2015: Prosperous households are more likely to be customers.

##### ANZ_PERSONEN: Households with more adult people use the company's products.

##### ALTER_HH: The main age within the households of customers is lower than the underrepresented cluster.

