# Project: Identify Customer Segments

In this project, I apply unsupervised learning techniques to identify segments from the population that form the core customer base for Bertelsmann Arvato Analytics (a mail-order sales company in Germany). These segments can then be used to direct marketing campaigns towards audiences that will have the highest expected ROI.

There are four files associated with this project however due to having to sign an NDA I can't publish the actual data files to Github.  Below is what each of these files contained.

- 'azdias.csv': Demographics data for the general population of Germany; 891211 persons (rows) x 85 features (columns).
- 'customers.csv': Demographics data for customers of a mail-order company; 191652 persons (rows) x 85 features (columns).
- 'data_Dictionary.md': Detailed information file about the features in the provided datasets.
- 'feature_summary.csv': Summary of feature attributes for demographics data; 85 features (rows) x 4 columns

Each row of the demographics files represents a single person, but also includes information outside of individuals, including information about their household, building, and neighborhood. I used this information to cluster the general population into groups with similar demographic properties. Then I will look at how the people in the customers dataset fit into those clusters. The hope here is that certain clusters are over-represented in the customers data, as compared to the general population; those over-represented clusters will be assumed to be part of the core userbase. This information can then be used for further applications, such as targeting for a marketing campaign.

In [7]:
#import standard libraries
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns
%matplotlib inline

In [11]:
#Load in the general demographics data.
df_azdias = pd.read_csv('azdias.csv', sep=';')

#Load in the feature summary file.
df_feat_sum = pd.read_csv('feature_summary.csv', sep=';')

## Eploratory Data Analysis

In [13]:
df_azdias.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 85 entries, 0 to 84
Data columns (total 4 columns):
attribute             85 non-null object
information_level     85 non-null object
type                  85 non-null object
missing_or_unknown    85 non-null object
dtypes: object(4)
memory usage: 2.7+ KB


In [14]:
df_feat_sum.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 85 entries, 0 to 84
Data columns (total 4 columns):
attribute             85 non-null object
information_level     85 non-null object
type                  85 non-null object
missing_or_unknown    85 non-null object
dtypes: object(4)
memory usage: 2.7+ KB


In [16]:
df_azdias.describe()

Unnamed: 0,attribute,information_level,type,missing_or_unknown
count,85,85,85,85
unique,85,9,5,9
top,RETOURTYP_BK_S,person,ordinal,[-1]
freq,1,43,49,26


In [17]:
df_feat_sum.describe()

Unnamed: 0,attribute,information_level,type,missing_or_unknown
count,85,85,85,85
unique,85,9,5,9
top,RETOURTYP_BK_S,person,ordinal,[-1]
freq,1,43,49,26


In [23]:
print("The general dataset has %s rows and %s columns" %(df_azdias.shape[0], df_azdias.shape[1]))
print("The summary dataset has %s rows and %s columns"  %(df_feat_sum.shape[0], df_feat_sum.shape[1]))

The general dataset has 85 rows and 4 columns
The summary dataset has 85 rows and 4 columns


In [18]:
df_azdias.head()

Unnamed: 0,attribute,information_level,type,missing_or_unknown
0,AGER_TYP,person,categorical,"[-1,0]"
1,ALTERSKATEGORIE_GROB,person,ordinal,"[-1,0,9]"
2,ANREDE_KZ,person,categorical,"[-1,0]"
3,CJT_GESAMTTYP,person,categorical,[0]
4,FINANZ_MINIMALIST,person,ordinal,[-1]


In [19]:
df_feat_sum.head()

Unnamed: 0,attribute,information_level,type,missing_or_unknown
0,AGER_TYP,person,categorical,"[-1,0]"
1,ALTERSKATEGORIE_GROB,person,ordinal,"[-1,0,9]"
2,ANREDE_KZ,person,categorical,"[-1,0]"
3,CJT_GESAMTTYP,person,categorical,[0]
4,FINANZ_MINIMALIST,person,ordinal,[-1]
