# Capstone Project: Create a Customer Segmentation Report for Arvato Financial Services

In this project, you will analyze demographics data for customers of a mail-order sales company in Germany, comparing it against demographics information for the general population. You'll use unsupervised learning techniques to perform customer segmentation, identifying the parts of the population that best describe the core customer base of the company. Then, you'll apply what you've learned on a third dataset with demographics information for targets of a marketing campaign for the company, and use a model to predict which individuals are most likely to convert into becoming customers for the company. The data that you will use has been provided by our partners at Bertelsmann Arvato Analytics, and represents a real-life data science task.

The versions of those two datasets used in this project will include many more features and has not been pre-cleaned. You are also free to choose whatever approach you'd like to analyzing the data rather than follow pre-determined steps. In your work on this project, make sure that you carefully document your steps and decisions, since your main deliverable for this project will be a blog post reporting your findings.

In [155]:
# Check the present working directory
!pwd

/Users/mohanadelemary/Desktop/Uda/Nano_3_Capstone/bertelsmann-arvato-customer-segmentation


### Download the data
If you do not have the required **data/** directory in your workspace, follow the instructions below. Use either one of the methods below. 

**Method 1** <br/>
You must [download this dataset](https://video.udacity-data.com/topher/2024/August/66b9ba05_arvato_data.tar/arvato_data.tar.gz) from the Downloads section in the classroom, and upload it into the workspace. After you upload the tar file to the present working  directory, **/workspace/cd1971 Data Scientist Capstone/Bertelsmann_Arvato Project Workspace/**,  in the Jupyter server, you can open a terminal and the run the following command to extract the dataset from the compressed file. 
```bash
!tar -xzvf arvato_data.tar.gz
```
This command will extract all the contents of arvato_data.tar.gz into the current directory. 

**Method 2** <br/>
Execute the Python code below to download the dataset. 


In [156]:
import requests
import tarfile
import os


def download_and_extract(url, extract_to='.'):
    """
    Downloads a tar.gz file from a URL and extracts it to a directory.
    Args:
    - url (str): URL of the tar.gz file to download.
    - extract_to (str): Directory path to extract the contents of the tar.gz file.
    """
    # Get the filename from the URL
    filename = url.split('/')[-1]

    # Download the file
    print("Downloading the file...")
    response = requests.get(url, stream=True)
    if response.status_code == 200:
        with open(filename, 'wb') as file:
            file.write(response.raw.read())
        print("Download completed.")
    else:
        print("Failed to download the file.")
        return

    # Extract the tar.gz file
    print("Extracting the file...")
    try:
        with tarfile.open(filename, 'r:gz') as tar:
            tar.extractall(path=extract_to)
        print("Extraction completed.")
    except Exception as e:
        print(f"Failed to extract the file: {e}")
    finally:
        # Optionally remove the tar.gz file after extraction
        os.remove(filename)
        print("Downloaded tar.gz file removed.")

# URL of the tar.gz file
url = 'https://video.udacity-data.com/topher/2024/August/66b9ba05_arvato_data.tar/arvato_data.tar.gz'

# Call the function with the URL
download_and_extract(url)



Downloading the file...
Download completed.
Extracting the file...


  tar.extractall(path=extract_to)


Extraction completed.
Downloaded tar.gz file removed.



### Important Note
>Delete the **data/** folder and the downloaded tar file, before you submit your code. The current workspace cannot save the files beyond 1GB of space in total. 


### Import the Packages

In [157]:
# import libraries here; add more as necessary
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns

# magic word for producing visualizations in notebook
%matplotlib inline

## Part 0: Get to Know the Data

There are four data files associated with this project:

- `Udacity_AZDIAS_052018.csv`: Demographics data for the general population of Germany; 891 211 persons (rows) x 366 features (columns).
- `Udacity_CUSTOMERS_052018.csv`: Demographics data for customers of a mail-order company; 191 652 persons (rows) x 369 features (columns).
- `Udacity_MAILOUT_052018_TRAIN.csv`: Demographics data for individuals who were targets of a marketing campaign; 42 982 persons (rows) x 367 (columns).
- `Udacity_MAILOUT_052018_TEST.csv`: Demographics data for individuals who were targets of a marketing campaign; 42 833 persons (rows) x 366 (columns).

Each row of the demographics files represents a single person, but also includes information outside of individuals, including information about their household, building, and neighborhood. Use the information from the first two files to figure out how customers ("CUSTOMERS") are similar to or differ from the general population at large ("AZDIAS"), then use your analysis to make predictions on the other two files ("MAILOUT"), predicting which recipients are most likely to become a customer for the mail-order company.

The "CUSTOMERS" file contains three extra columns ('CUSTOMER_GROUP', 'ONLINE_PURCHASE', and 'PRODUCT_GROUP'), which provide broad information about the customers depicted in the file. The original "MAILOUT" file included one additional column, "RESPONSE", which indicated whether or not each recipient became a customer of the company. For the "TRAIN" subset, this column has been retained, but in the "TEST" subset it has been removed; it is against that withheld column that your final predictions will be assessed in the Kaggle competition.

Otherwise, all of the remaining columns are the same between the three data files. For more information about the columns depicted in the files, you can refer to two Excel spreadsheets provided in the workspace. [One of them](./DIAS Information Levels - Attributes 2017.xlsx) is a top-level list of attributes and descriptions, organized by informational category. [The other](./DIAS Attributes - Values 2017.xlsx) is a detailed mapping of data values for each feature in alphabetical order.

In the below cell, we've provided some initial code to load in the first two datasets. Note for all of the `.csv` data files in this project that they're semicolon (`;`) delimited, so an additional argument in the [`read_csv()`](https://pandas.pydata.org/pandas-docs/stable/generated/pandas.read_csv.html) call has been included to read in the data properly. Also, considering the size of the datasets, it may take some time for them to load completely.

You'll notice when the data is loaded in that a warning message will immediately pop up. Before you really start digging into the modeling and analysis, you're going to need to perform some cleaning. Take some time to browse the structure of the data and look over the informational spreadsheets to understand the data values. Make some decisions on which features to keep, which features to drop, and if any revisions need to be made on data formats. It'll be a good idea to create a function with pre-processing steps, since you'll need to clean all of the datasets before you work with them.

>Note: If you experience "Kernel died" issue while running the codeblock below, then load less number of rows from the .csv files.

In [158]:
# load in the data
azdias = pd.read_csv('data/Udacity_AZDIAS_052018.csv', sep=';')
customers = pd.read_csv('data/Udacity_CUSTOMERS_052018.csv', sep=';')

  azdias = pd.read_csv('data/Udacity_AZDIAS_052018.csv', sep=';')
  customers = pd.read_csv('data/Udacity_CUSTOMERS_052018.csv', sep=';')


### Part 0.1: Attributes Dictionary

We'll read in the two files explaining our features and values for reference and to understand how to interpret the population and customers data

In [159]:
#Reading in the two files explaining our attributes and values
attributes_values = pd.read_excel('DIAS Attributes - Values 2017.xlsx')
info = pd.read_excel('DIAS Information Levels - Attributes 2017.xlsx')

In [160]:
info.head(20)

Unnamed: 0.1,Unnamed: 0,Unnamed: 1,Unnamed: 2,Unnamed: 3,Unnamed: 4
0,,Information level,Attribute,Description,Additional notes
1,,,AGER_TYP,best-ager typology,in cooperation with Kantar TNS; the informatio...
2,,Person,ALTERSKATEGORIE_GROB,age through prename analysis,modelled on millions of first name-age-referen...
3,,,ANREDE_KZ,gender,
4,,,CJT_GESAMTTYP,Customer-Journey-Typology relating to the pref...,"relating to the preferred information, marketi..."
5,,,FINANZ_MINIMALIST,financial typology: low financial interest,Gfk-Typology based on a representative househo...
6,,,FINANZ_SPARER,financial typology: money saver,
7,,,FINANZ_VORSORGER,financial typology: be prepared,
8,,,FINANZ_ANLEGER,financial typology: investor,
9,,,FINANZ_UNAUFFAELLIGER,financial typology: unremarkable,


In [161]:
attributes_values.head()

Unnamed: 0.1,Unnamed: 0,Unnamed: 1,Unnamed: 2,Unnamed: 3,Unnamed: 4
0,,Attribute,Description,Value,Meaning
1,,AGER_TYP,best-ager typology,-1,unknown
2,,,,0,no classification possible
3,,,,1,passive elderly
4,,,,2,cultural elderly


In [162]:
# Setting up the proper column names for attributes_values
attributes_values.columns = attributes_values.iloc[0]    # Set first row as header
attributes_values = attributes_values[1:].reset_index(drop=True)   # Drop the first row and reset index
attributes_values.head()

Unnamed: 0,NaN,Attribute,Description,Value,Meaning
0,,AGER_TYP,best-ager typology,-1,unknown
1,,,,0,no classification possible
2,,,,1,passive elderly
3,,,,2,cultural elderly
4,,,,3,experience-driven elderly


In [163]:
# Columns Attribute & Description seem to have the key values at their first row.
# Subsequent rows contain the other values possible for the same attribute. 
# Therefore we simply need to forward-fill these columns

columns_to_fill = ['Attribute', 'Description']
attributes_values[columns_to_fill] = attributes_values[columns_to_fill].fillna(method='ffill')
attributes_values.head()

  attributes_values[columns_to_fill] = attributes_values[columns_to_fill].fillna(method='ffill')


Unnamed: 0,NaN,Attribute,Description,Value,Meaning
0,,AGER_TYP,best-ager typology,-1,unknown
1,,AGER_TYP,best-ager typology,0,no classification possible
2,,AGER_TYP,best-ager typology,1,passive elderly
3,,AGER_TYP,best-ager typology,2,cultural elderly
4,,AGER_TYP,best-ager typology,3,experience-driven elderly


In [164]:
# Setting up the proper column names for info

info.columns = info.iloc[0]    # Set first row as header
info = info[1:].reset_index(drop=True)   # Drop the first row and reset index
info.head()

Unnamed: 0,NaN,Information level,Attribute,Description,Additional notes
0,,,AGER_TYP,best-ager typology,in cooperation with Kantar TNS; the informatio...
1,,Person,ALTERSKATEGORIE_GROB,age through prename analysis,modelled on millions of first name-age-referen...
2,,,ANREDE_KZ,gender,
3,,,CJT_GESAMTTYP,Customer-Journey-Typology relating to the pref...,"relating to the preferred information, marketi..."
4,,,FINANZ_MINIMALIST,financial typology: low financial interest,Gfk-Typology based on a representative househo...


In [165]:
# Removing the Null column at the beginning of the attributes_values df

attributes_values = attributes_values.iloc[:, 1:]
info = info.iloc[:, 1:]
attributes_values.head()

Unnamed: 0,Attribute,Description,Value,Meaning
0,AGER_TYP,best-ager typology,-1,unknown
1,AGER_TYP,best-ager typology,0,no classification possible
2,AGER_TYP,best-ager typology,1,passive elderly
3,AGER_TYP,best-ager typology,2,cultural elderly
4,AGER_TYP,best-ager typology,3,experience-driven elderly


In [166]:
# Joining both df's into a single df on the attribute column

attributes = info.merge(attributes_values, how='outer', on='Attribute')
attributes.head()

Unnamed: 0,Information level,Attribute,Description_x,Additional notes,Description_y,Value,Meaning
0,,AGER_TYP,best-ager typology,in cooperation with Kantar TNS; the informatio...,best-ager typology,-1,unknown
1,,AGER_TYP,best-ager typology,in cooperation with Kantar TNS; the informatio...,best-ager typology,0,no classification possible
2,,AGER_TYP,best-ager typology,in cooperation with Kantar TNS; the informatio...,best-ager typology,1,passive elderly
3,,AGER_TYP,best-ager typology,in cooperation with Kantar TNS; the informatio...,best-ager typology,2,cultural elderly
4,,AGER_TYP,best-ager typology,in cooperation with Kantar TNS; the informatio...,best-ager typology,3,experience-driven elderly


In [167]:
# The information level column is mostly nulls. very little cells with values

attributes['Information level'].value_counts()

Information level
Household             22
Microcell (RR4_ID)    10
Postcode               8
125m x 125m Grid       8
PLZ8                   7
Person                 6
Microcell (RR3_ID)     6
RR1_ID                 5
Building               1
Community              1
Name: count, dtype: int64

In [168]:
attributes.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 2271 entries, 0 to 2270
Data columns (total 7 columns):
 #   Column             Non-Null Count  Dtype 
---  ------             --------------  ----- 
 0   Information level  74 non-null     object
 1   Attribute          2271 non-null   object
 2   Description_x      2176 non-null   object
 3   Additional notes   173 non-null    object
 4   Description_y      2258 non-null   object
 5   Value              2258 non-null   object
 6   Meaning            2247 non-null   object
dtypes: object(7)
memory usage: 124.3+ KB


In [169]:
def fetch_contains(search_string, df=attributes, ):

    """
    PURPOSE:
    - Provide a quick way to query and understand any attribute that we'll be working with in the population and customer data
    
    INPUT: 
    - Search_string: String of characters to look for in all cells of a dataframe
    - df: specifies the dataframe to query, by default it is the attributes dataframe
    
    OUTPUT:
    - dataframe containing all rows where the string queried was found
    
    """
    # Create a mask that checks for substring containment
    mask = df.apply(lambda row: row.astype(str).str.contains(search_string, case=False, na=False).any(), axis=1)

    # Use the mask with query
    result = attributes[mask]

    return result

In [170]:
fetch_contains('ager')

Unnamed: 0,Information level,Attribute,Description_x,Additional notes,Description_y,Value,Meaning
0,,AGER_TYP,best-ager typology,in cooperation with Kantar TNS; the informatio...,best-ager typology,-1,unknown
1,,AGER_TYP,best-ager typology,in cooperation with Kantar TNS; the informatio...,best-ager typology,0,no classification possible
2,,AGER_TYP,best-ager typology,in cooperation with Kantar TNS; the informatio...,best-ager typology,1,passive elderly
3,,AGER_TYP,best-ager typology,in cooperation with Kantar TNS; the informatio...,best-ager typology,2,cultural elderly
4,,AGER_TYP,best-ager typology,in cooperation with Kantar TNS; the informatio...,best-ager typology,3,experience-driven elderly
89,,CAMEO_DEU_2015,CAMEO_4.0: specific group,,CAMEO classification 2015 - detailled classifi...,1B,Wealthy Best Ager
105,,CAMEO_DEU_2015,CAMEO_4.0: specific group,,CAMEO classification 2015 - detailled classifi...,4E,Golden Ager
736,,GFK_URLAUBERTYP,vacation habits,,vacation habits,7,Golden ager
1906,,LP_FAMILIE_FEIN,family type fine,,familytyp fine,4,single parent with teenager
1909,,LP_FAMILIE_FEIN,family type fine,,familytyp fine,7,family with teenager


### Part 0.2: Customers & Population Data 

Given that the dataframes are closely related and will be compared to each other in the end, it makes sense to do the analysis and cleaning for both in paralle


In [171]:
azdias.head()

Unnamed: 0,LNR,AGER_TYP,AKT_DAT_KL,ALTER_HH,ALTER_KIND1,ALTER_KIND2,ALTER_KIND3,ALTER_KIND4,ALTERSKATEGORIE_FEIN,ANZ_HAUSHALTE_AKTIV,...,VHN,VK_DHT4A,VK_DISTANZ,VK_ZG11,W_KEIT_KIND_HH,WOHNDAUER_2008,WOHNLAGE,ZABEOTYP,ANREDE_KZ,ALTERSKATEGORIE_GROB
0,910215,-1,,,,,,,,,...,,,,,,,,3,1,2
1,910220,-1,9.0,0.0,,,,,21.0,11.0,...,4.0,8.0,11.0,10.0,3.0,9.0,4.0,5,2,1
2,910225,-1,9.0,17.0,,,,,17.0,10.0,...,2.0,9.0,9.0,6.0,3.0,9.0,2.0,5,2,3
3,910226,2,1.0,13.0,,,,,13.0,1.0,...,0.0,7.0,10.0,11.0,,9.0,7.0,3,2,4
4,910241,-1,1.0,20.0,,,,,14.0,3.0,...,2.0,3.0,5.0,4.0,2.0,9.0,3.0,4,1,3


In [172]:
customers.head()

Unnamed: 0,LNR,AGER_TYP,AKT_DAT_KL,ALTER_HH,ALTER_KIND1,ALTER_KIND2,ALTER_KIND3,ALTER_KIND4,ALTERSKATEGORIE_FEIN,ANZ_HAUSHALTE_AKTIV,...,VK_ZG11,W_KEIT_KIND_HH,WOHNDAUER_2008,WOHNLAGE,ZABEOTYP,PRODUCT_GROUP,CUSTOMER_GROUP,ONLINE_PURCHASE,ANREDE_KZ,ALTERSKATEGORIE_GROB
0,9626,2,1.0,10.0,,,,,10.0,1.0,...,2.0,6.0,9.0,7.0,3,COSMETIC_AND_FOOD,MULTI_BUYER,0,1,4
1,9628,-1,9.0,11.0,,,,,,,...,3.0,0.0,9.0,,3,FOOD,SINGLE_BUYER,0,1,4
2,143872,-1,1.0,6.0,,,,,0.0,1.0,...,11.0,6.0,9.0,2.0,3,COSMETIC_AND_FOOD,MULTI_BUYER,0,2,4
3,143873,1,1.0,8.0,,,,,8.0,0.0,...,2.0,,9.0,7.0,1,COSMETIC,MULTI_BUYER,0,1,4
4,143874,-1,1.0,20.0,,,,,14.0,7.0,...,4.0,2.0,9.0,3.0,1,FOOD,MULTI_BUYER,0,1,3


In [173]:
# Looking into the df's shapes
print(azdias.shape)
print(customers.shape)

(891221, 366)
(191652, 369)


In [174]:
# Checking for duplicates
print(azdias.duplicated().sum())
print(customers.duplicated().sum())

0
0


In [175]:
# Checking for Nulls
# To support the analysis of each feature independently, we'll create a function that creates a df with features and Null values for each df

def create_null_table(df):

    """
    INPUT: 
    df: dataframe to be processed

    OUTPUT:
    null_df: dataframe of all the df's columns plus the null values within each column
    
    
    """
    null_df = df.isna().sum()
    null_df = null_df.reset_index()
    null_df = null_df.sort_values(by=0,ascending=False)

    return null_df
    




In [176]:
cn = create_null_table(customers)

In [177]:
an = create_null_table(azdias)


In [178]:
customers.shape

(191652, 369)

In [179]:
cn.head(10)

Unnamed: 0,index,0
7,ALTER_KIND4,191416
6,ALTER_KIND3,190377
5,ALTER_KIND2,186552
4,ALTER_KIND1,179886
300,KK_KUNDENTYP,111937
100,EXTSEL992,85283
148,KBA05_KRSOBER,55980
144,KBA05_KRSHERST1,55980
136,KBA05_GBZ,55980
137,KBA05_HERST1,55980


In [180]:
azdias.shape

(891221, 366)

In [181]:
an.head(10)

Unnamed: 0,index,0
7,ALTER_KIND4,890016
6,ALTER_KIND3,885051
5,ALTER_KIND2,861722
4,ALTER_KIND1,810163
100,EXTSEL992,654153
300,KK_KUNDENTYP,584612
8,ALTERSKATEGORIE_FEIN,262947
85,D19_VERSAND_ONLINE_QUOTE_12,257113
62,D19_LOTTO,257113
36,D19_BANKEN_ONLINE_QUOTE_12,257113


In [182]:
# Based on the values for the top four columns in both tables ['ALTER_KIND4','ALTER_KIND3','ALTER_KIND2','ALTER_KIND1']
# The respective columns for these features are mostly nulls in both tables. Dropping them would be the best tactic

azdias.drop(columns=['ALTER_KIND4','ALTER_KIND3','ALTER_KIND2','ALTER_KIND1'], inplace=True)
customers.drop(columns=['ALTER_KIND4','ALTER_KIND3','ALTER_KIND2','ALTER_KIND1'], inplace=True)


In [183]:
# re-running the create_null_table function to get updated tables
cn = create_null_table(customers)
an = create_null_table(azdias)


In [185]:
cn.head(30)

Unnamed: 0,index,0
296,KK_KUNDENTYP,111937
96,EXTSEL992,85283
146,KBA05_KRSZUL,55980
141,KBA05_KRSHERST2,55980
132,KBA05_GBZ,55980
133,KBA05_HERST1,55980
134,KBA05_HERST2,55980
135,KBA05_HERST3,55980
136,KBA05_HERST4,55980
137,KBA05_HERST5,55980


In [93]:
temp.head(30)

Unnamed: 0,index,0
7,ALTER_KIND4,191416
6,ALTER_KIND3,190377
5,ALTER_KIND2,186552
4,ALTER_KIND1,179886
300,KK_KUNDENTYP,111937
100,EXTSEL992,85283
148,KBA05_KRSOBER,55980
144,KBA05_KRSHERST1,55980
136,KBA05_GBZ,55980
137,KBA05_HERST1,55980


In [138]:
# Create a mask that checks for substring containment
mask = attributes.apply(lambda row: row.astype(str).str.contains('ager', case=False, na=False).any(), axis=1)

# Use the mask with query
result = attributes[mask]
result


Unnamed: 0,Information level,Attribute,Description_x,Additional notes,Description_y,Value,Meaning
0,,AGER_TYP,best-ager typology,in cooperation with Kantar TNS; the informatio...,best-ager typology,-1,unknown
1,,AGER_TYP,best-ager typology,in cooperation with Kantar TNS; the informatio...,best-ager typology,0,no classification possible
2,,AGER_TYP,best-ager typology,in cooperation with Kantar TNS; the informatio...,best-ager typology,1,passive elderly
3,,AGER_TYP,best-ager typology,in cooperation with Kantar TNS; the informatio...,best-ager typology,2,cultural elderly
4,,AGER_TYP,best-ager typology,in cooperation with Kantar TNS; the informatio...,best-ager typology,3,experience-driven elderly
89,,CAMEO_DEU_2015,CAMEO_4.0: specific group,,CAMEO classification 2015 - detailled classifi...,1B,Wealthy Best Ager
105,,CAMEO_DEU_2015,CAMEO_4.0: specific group,,CAMEO classification 2015 - detailled classifi...,4E,Golden Ager
736,,GFK_URLAUBERTYP,vacation habits,,vacation habits,7,Golden ager
1906,,LP_FAMILIE_FEIN,family type fine,,familytyp fine,4,single parent with teenager
1909,,LP_FAMILIE_FEIN,family type fine,,familytyp fine,7,family with teenager


In [103]:
result = attributes[attributes.isin(['ALTERS']).any(axis=1)]
result

Unnamed: 0,Information level,Attribute,Description_x,Additional notes,Description_y,Value,Meaning
5,Person,ALTERSKATEGORIE_GROB,age through prename analysis,modelled on millions of first name-age-referen...,age classification through prename analysis,"-1, 0",unknown
6,Person,ALTERSKATEGORIE_GROB,age through prename analysis,modelled on millions of first name-age-referen...,age classification through prename analysis,1,< 30 years
7,Person,ALTERSKATEGORIE_GROB,age through prename analysis,modelled on millions of first name-age-referen...,age classification through prename analysis,2,30 - 45 years
8,Person,ALTERSKATEGORIE_GROB,age through prename analysis,modelled on millions of first name-age-referen...,age classification through prename analysis,3,46 - 60 years
9,Person,ALTERSKATEGORIE_GROB,age through prename analysis,modelled on millions of first name-age-referen...,age classification through prename analysis,4,> 60 years
10,Person,ALTERSKATEGORIE_GROB,age through prename analysis,modelled on millions of first name-age-referen...,age classification through prename analysis,9,uniformly distributed


In [127]:
# Create a mask that checks for substring containment
mask = attributes.apply(lambda row: row.astype(str).str.contains('kk', case=False, na=False).any(), axis=1)

# Use the mask with query
result = attributes[mask]
result


Unnamed: 0,Information level,Attribute,Description_x,Additional notes,Description_y,Value,Meaning
387,,D19_KK_KUNDENTYP,consumption movement in the last 12 months,AZ has access to approx. 650 Million transacti...,consumption movement in the last 12 months,-1,unknown
388,,D19_KK_KUNDENTYP,consumption movement in the last 12 months,AZ has access to approx. 650 Million transacti...,consumption movement in the last 12 months,1,regular customer
389,,D19_KK_KUNDENTYP,consumption movement in the last 12 months,AZ has access to approx. 650 Million transacti...,consumption movement in the last 12 months,2,active customer
390,,D19_KK_KUNDENTYP,consumption movement in the last 12 months,AZ has access to approx. 650 Million transacti...,consumption movement in the last 12 months,3,new costumer
391,,D19_KK_KUNDENTYP,consumption movement in the last 12 months,AZ has access to approx. 650 Million transacti...,consumption movement in the last 12 months,4,stray customer
392,,D19_KK_KUNDENTYP,consumption movement in the last 12 months,AZ has access to approx. 650 Million transacti...,consumption movement in the last 12 months,5,inactive customer
393,,D19_KK_KUNDENTYP,consumption movement in the last 12 months,AZ has access to approx. 650 Million transacti...,consumption movement in the last 12 months,6,passive customer
1891,,KKK,purchasing power,modelled on different AZ DIAS data,purchasing power,"-1, 0",unknown
1892,,KKK,purchasing power,modelled on different AZ DIAS data,purchasing power,1,very high
1893,,KKK,purchasing power,modelled on different AZ DIAS data,purchasing power,2,high


In [122]:
attr = attributes['Attribute'].value_counts()
attr = attr.reset_index()
attr

Unnamed: 0,Attribute,count
0,CAMEO_DEU_2015,44
1,LP_LEBENSPHASE_FEIN,40
2,CAMEO_DEUINTL_2015,26
3,ALTER_HH,22
4,PRAEGENDE_JUGENDJAHRE,16
...,...,...
322,MIN_GEBAEUDEJAHR,1
323,GEBURTSJAHR,1
324,GKZ,1
325,PLZ,1


In [123]:
attr = attr.sort_values(by='Attribute',ascending=True)

In [125]:
attr.head(20)

Unnamed: 0,Attribute,count
278,AGER_TYP,5
258,ALTERSKATEGORIE_GROB,6
3,ALTER_HH,22
301,ANREDE_KZ,3
312,ANZ_HAUSHALTE_AKTIV,1
311,ANZ_HH_TITEL,1
308,ANZ_PERSONEN,1
309,ANZ_TITEL,1
307,ARBEIT,1
58,BALLRAUM,8


## Part 1: Customer Segmentation Report

The main bulk of your analysis will come in this part of the project. Here, you should use unsupervised learning techniques to describe the relationship between the demographics of the company's existing customers and the general population of Germany. By the end of this part, you should be able to describe parts of the general population that are more likely to be part of the mail-order company's main customer base, and which parts of the general population are less so.

## Part 2: Supervised Learning Model

Now that you've found which parts of the population are more likely to be customers of the mail-order company, it's time to build a prediction model. Each of the rows in the "MAILOUT" data files represents an individual that was targeted for a mailout campaign. Ideally, we should be able to use the demographic information from each individual to decide whether or not it will be worth it to include that person in the campaign.

The "MAILOUT" data has been split into two approximately equal parts, each with almost 43 000 data rows. In this part, you can verify your model with the "TRAIN" partition, which includes a column, "RESPONSE", that states whether or not a person became a customer of the company following the campaign. In the next part, you'll need to create predictions on the "TEST" partition, where the "RESPONSE" column has been withheld.

In [None]:
mailout_train = pd.read_csv('data/Udacity_MAILOUT_052018_TRAIN.csv', sep=';')

In [None]:
mailout_test = pd.read_csv('data/Udacity_MAILOUT_052018_TEST.csv', sep=';')