#### Libraries

In [1]:
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns
import re

# magic word for producing visualizations in notebook
%matplotlib inline

## Part 1: Customer Segmentation Report

The main bulk of your analysis will come in this part of the project. Here, you should use unsupervised learning techniques to describe the relationship between the demographics of the company's existing customers and the general population of Germany. By the end of this part, you should be able to describe parts of the general population that are more likely to be part of the mail-order company's main customer base, and which parts of the general population are less so.

Reading pre-processed data:

In [4]:
customers = pd.read_csv('customers_scaler.csv')
customers.drop(['Unnamed: 0'], axis = 1, inplace = True)
customers.head()

Unnamed: 0,LNR,AKT_DAT_KL,ALTERSKATEGORIE_FEIN,ANZ_HAUSHALTE_AKTIV,ANZ_KINDER,ANZ_PERSONEN,ANZ_STATISTISCHE_HAUSHALTE,ANZ_TITEL,ARBEIT,BALLRAUM,...,KKK_2.0,KKK_3.0,KKK_4.0,REGIOTYP_1.0,REGIOTYP_2.0,REGIOTYP_3.0,REGIOTYP_4.0,REGIOTYP_5.0,REGIOTYP_6.0,REGIOTYP_7.0
0,9626,1.0,10.0,1.0,0.0,2.0,1.0,0.0,1.0,3.0,...,0,0,0,1,0,0,0,0,0,0
1,9628,9.0,4.0,4.0,0.0,3.0,4.0,0.0,4.0,4.0,...,0,0,1,0,0,0,1,0,0,0
2,143872,1.0,0.0,1.0,0.0,1.0,1.0,0.0,3.0,7.0,...,0,1,0,0,0,0,0,0,0,1
3,143873,1.0,8.0,0.0,0.0,0.0,1.0,0.0,1.0,7.0,...,0,1,0,0,0,0,0,0,1,0
4,143874,1.0,14.0,7.0,0.0,4.0,7.0,0.0,3.0,3.0,...,0,0,1,0,0,0,0,0,0,1


In [5]:
azdias = pd.read_csv('azdias_scaler.csv')
azdias.drop(['Unnamed: 0'], axis = 1, inplace = True)
azdias.head()

Unnamed: 0,LNR,AKT_DAT_KL,ALTERSKATEGORIE_FEIN,ANZ_HAUSHALTE_AKTIV,ANZ_KINDER,ANZ_PERSONEN,ANZ_STATISTISCHE_HAUSHALTE,ANZ_TITEL,ARBEIT,BALLRAUM,...,KKK_2.0,KKK_3.0,KKK_4.0,REGIOTYP_1.0,REGIOTYP_2.0,REGIOTYP_3.0,REGIOTYP_4.0,REGIOTYP_5.0,REGIOTYP_6.0,REGIOTYP_7.0
0,910215,4.0,4.0,4.0,4.0,4.0,4.0,4.0,4.0,4.0,...,0,0,1,0,0,0,1,0,0,0
1,910220,9.0,21.0,11.0,0.0,2.0,12.0,0.0,3.0,6.0,...,1,0,0,0,0,1,0,0,0,0
2,910225,9.0,17.0,10.0,0.0,1.0,7.0,0.0,3.0,2.0,...,1,0,0,0,1,0,0,0,0,0
3,910226,1.0,13.0,1.0,0.0,0.0,2.0,0.0,2.0,4.0,...,0,0,1,0,0,0,1,0,0,0
4,910241,1.0,14.0,3.0,0.0,4.0,3.0,0.0,4.0,2.0,...,0,1,0,0,0,0,0,1,0,0


## Part 2: Supervised Learning Model

Now that you've found which parts of the population are more likely to be customers of the mail-order company, it's time to build a prediction model. Each of the rows in the "MAILOUT" data files represents an individual that was targeted for a mailout campaign. Ideally, we should be able to use the demographic information from each individual to decide whether or not it will be worth it to include that person in the campaign.

The "MAILOUT" data has been split into two approximately equal parts, each with almost 43 000 data rows. In this part, you can verify your model with the "TRAIN" partition, which includes a column, "RESPONSE", that states whether or not a person became a customer of the company following the campaign. In the next part, you'll need to create predictions on the "TEST" partition, where the "RESPONSE" column has been withheld.

In [2]:
# mailout_train = pd.read_csv('data/Udacity_MAILOUT_052018_TRAIN.csv')

## Part 3: Kaggle Competition

Now that you've created a model to predict which individuals are most likely to respond to a mailout campaign, it's time to test that model in competition through Kaggle. If you click on the link [here](http://www.kaggle.com/t/21e6d45d4c574c7fa2d868f0e8c83140), you'll be taken to the competition page where, if you have a Kaggle account, you can enter. If you're one of the top performers, you may have the chance to be contacted by a hiring manager from Arvato or Bertelsmann for an interview!

Your entry to the competition should be a CSV file with two columns. The first column should be a copy of "LNR", which acts as an ID number for each individual in the "TEST" partition. The second column, "RESPONSE", should be some measure of how likely each individual became a customer – this might not be a straightforward probability. As you should have found in Part 2, there is a large output class imbalance, where most individuals did not respond to the mailout. Thus, predicting individual classes and using accuracy does not seem to be an appropriate performance evaluation method. Instead, the competition will be using AUC to evaluate performance. The exact values of the "RESPONSE" column do not matter as much: only that the higher values try to capture as many of the actual customers as possible, early in the ROC curve sweep.

In [3]:
#mailout_test = pd.read_csv('../../data/Term2/capstone/arvato_data/Udacity_MAILOUT_052018_TEST.csv', sep=';')