# Project: Identify Customer Segments

In this project, I will apply unsupervised learning techniques to identify segments of the population that form the core customer base for a mail-order sales company in Germany. These segments can then be used to direct marketing campaigns towards audiences that will have the highest expected rate of returns. The data that I will use has been provided by a partners at Bertelsmann Arvato Analytics.

In [None]:
# import libraries here; add more as necessary
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns
from sklearn.preprocessing import OneHotEncoder
from sklearn.preprocessing import StandardScaler
from sklearn.impute import SimpleImputer
from sklearn.decomposition import PCA
from sklearn.cluster import KMeans
# magic word for producing visualizations in notebook
%matplotlib inline

In [1]:
### Step 0: Load the Data

There are four files associated with this project (not including this one):

- `Udacity_AZDIAS_Subset.csv`: Demographics data for the general population of Germany; 891211 persons (rows) x 85 features (columns).
- `Udacity_CUSTOMERS_Subset.csv`: Demographics data for customers of a mail-order company; 191652 persons (rows) x 85 features (columns).
- `Data_Dictionary.md`: Detailed information file about the features in the provided datasets.
- `AZDIAS_Feature_Summary.csv`: Summary of feature attributes for demographics data; 85 features (rows) x 4 columns


SyntaxError: invalid syntax (1097591243.py, line 3)

In [2]:
# Load in the general demographics data.
azdias = pd.read_csv("Udacity_AZDIAS_Subset.csv",delimiter=";")

# Load in the feature summary file.
feat_info = pd.read_csv("AZDIAS_Feature_Summary.csv",delimiter=";")

NameError: name 'pd' is not defined

In [3]:
# Check the structure of the data after it's loaded
print("The general demographic data shape is " , azdias.shape)

NameError: name 'azdias' is not defined

In [4]:
azdias.head()

NameError: name 'azdias' is not defined

In [5]:
azdias.tail()

NameError: name 'azdias' is not defined

In [6]:
azdias.describe()

NameError: name 'azdias' is not defined

In [7]:
azdias.info()

NameError: name 'azdias' is not defined

In [8]:
# Check the structure of the data after it's loaded
print("The feature summary data shape is " , feat_info.shape)

NameError: name 'feat_info' is not defined

In [9]:
feat_info.head()

NameError: name 'feat_info' is not defined

In [10]:
feat_info.tail()

NameError: name 'feat_info' is not defined

In [11]:
feat_info.describe()

NameError: name 'feat_info' is not defined

In [12]:
feat_info.info()

NameError: name 'feat_info' is not defined

In [13]:
## Step 1: Preprocessing

### Step 1.1: Assess Missing Data

The feature summary file contains a summary of properties for each demographics data column. I will use this file to help me make cleaning decisions during this stage of the project. First of all, we should assess the demographics data in terms of missing data.

#### Step 1.1.1: Convert Missing Value Codes to NaNs
The fourth column of the feature attributes summary (loaded in above as `feat_info`) documents the codes from the data dictionary that indicate missing or unknown data.

SyntaxError: invalid syntax (1864957165.py, line 5)

In [14]:
# see what kind of missing code values
print('Unique missing values code in all columns are:', feat_info['missing_or_unknown'].unique())

NameError: name 'feat_info' is not defined

In [15]:
# split feat_infto into two dataframe, one of thim will enter the preprocessing, and the other doesn't need to apply
# preprocessing on int because it doesn't contain any missing values codes
feat_info1 = feat_info[feat_info['missing_or_unknown'] != '[]']
feat_info2 = feat_info[feat_info['missing_or_unknown'] == '[]']

NameError: name 'feat_info' is not defined

In [16]:
# preprocess the columns of missing_or_unknown values codes to make it easy to extract information from it
feat_info1['missing_or_unknown'] = feat_info1['missing_or_unknown'].apply(lambda x: x.replace("[",""))
feat_info1['missing_or_unknown'] = feat_info1['missing_or_unknown'].apply(lambda x: x.replace("]",""))
feat_info1['missing_or_unknown'] = feat_info1['missing_or_unknown'].apply(lambda x: x.split(","))
feat_info1['missing_or_unknown'] = feat_info1['missing_or_unknown'].apply(lambda x: [t if t.isalpha() else int(t) for t in x])
feat_info = pd.concat([feat_info1, feat_info2], axis=0)
feat_info = feat_info.reset_index(drop=True)

NameError: name 'feat_info1' is not defined

In [17]:
# Identify missing or unknown data values and convert them to NaNs.
missing_values_codes = []
columns = []
for column, missing_value_code in zip(feat_info["attribute"],feat_info["missing_or_unknown"]):
    missing_values_codes.extend(missing_value_code)
    columns.extend([column] * len(missing_value_code))

NameError: name 'feat_info' is not defined

In [18]:
# convert missing values codes to Nans
for attribute , code in zip(columns, missing_values_codes):
    azdias[attribute] = azdias[attribute].apply(lambda x: np.NAN if x == code else x)

In [19]:
#### Step 1.1.2: Assess Missing Data in Each Column

How much missing data is present in each column? There are a few columns that are outliers in terms of the proportion of values that are missing.

SyntaxError: invalid syntax (1408060312.py, line 3)

In [20]:
# Perform an assessment of how much missing data there is in each column of the
# dataset.
proportion_missing_values = azdias.isna().mean()
for col , proportion in zip(azdias.columns, proportion_missing_values):
    print("The proportion of values that are missing in " , col , " column is ", proportion)

NameError: name 'azdias' is not defined

In [21]:
# Investigate patterns in the amount of missing data in each column.
plt.figure(figsize=(17,10))
plt.title("The patterns in amount of missing data in each column")
plt.hist(proportion_missing_values,rwidth=0.96)
plt.xlabel('Proportion of Missing Values')
plt.ylabel('Number of Columns')
plt.show()

NameError: name 'plt' is not defined

In [22]:
# Remove the outlier columns from the dataset.
plt.figure(figsize=(17,10))
plt.title("Boxplot for Proportion of Missing Values in each attribute")
plt.boxplot(proportion_missing_values,vert=False)
plt.xlabel('Proportion of Missing Values')

NameError: name 'plt' is not defined

In [23]:
# remove outliers columns
Q1_col = np.quantile(proportion_missing_values, 0.25)
Q3_col = np.quantile(proportion_missing_values, 0.75)
IQR_col = Q3_col - Q1_col
lower_bound_col = Q1_col - (1.5 * IQR_col)
upper_bound_col = Q3_col + (1.5 * IQR_col)
removed_columns = []
for col , proportion in zip(proportion_missing_values.index , proportion_missing_values):
    if proportion >= upper_bound_col or proportion <= lower_bound_col: removed_columns.append(col)
print('Removed Columns are:', removed_columns)
azdias_dropped_columns = azdias.drop(removed_columns,axis=1)

NameError: name 'np' is not defined

In [24]:
There is a pattern in the missing values that, the majority of columns in the dataset by a percentage of **93%** are all fall in the range from **0%** to **20%** proportion of missing values in each column, and there exist a small fraction of the dataset columns that have huge numebr of missing values compared to any other columns they are **6** columns having more than **30%**  proportion of missing values in each column, we will drop these columns from our dataset.

The outliers columns that will be dropped are:
- **AGER_TYP**
- **GEBURTSJAHR**
- **TITEL_KZ**
- **ALTER_HH**
- **KK_KUNDENTYP**
- **KBA05_BAUMAX**

SyntaxError: invalid syntax (2394730673.py, line 1)

In [25]:
#### Step 1.1.3: Assess Missing Data in Each Row

How much data is missing in each row? As with the columns, we should see some groups of points that have a very different numbers of missing values. we will Divide the data into two subsets: one for data points that are above some threshold for missing values, and a second subset for points below that threshold.

In order to know what to do with the outlier rows, we should see if the distribution of data values on columns that are not missing data (or are missing very little data) are similar or different between the two groups. we will Select at least five of these columns and compare the distribution of values.

Depending on what we observe in our comparison, this will have implications on how we approach our conclusions later in the analysis. If the distributions of non-missing features look similar between the data with many missing values and the data with few or no missing values, then we could argue that simply dropping those points from the analysis won't present a major issue. On the other hand, if the data with many missing values looks very different from the data with few or no missing values, then we should make a note on those data as special.

SyntaxError: invalid syntax (3709695748.py, line 3)

In [26]:
# How much data is missing in each row of the dataset?
azdias_dropped_columns["NAN count"] = azdias_dropped_columns.isna().sum(axis=1)
print("The number of missing values in each row of the dataset")
print(azdias_dropped_columns["NAN count"])

NameError: name 'azdias_dropped_columns' is not defined

In [27]:
plt.figure(figsize=(17,10))
plt.title('Boxplot for Number of Missing Values in each row')
plt.boxplot(azdias_dropped_columns["NAN count"],vert=False)
plt.xlabel('Number of Missing Values in each row')

NameError: name 'plt' is not defined

In [28]:
# code to divide the data into two subsets based on the number of missing
# values in each row.
Q1_row = np.quantile(azdias_dropped_columns["NAN count"], 0.25)
Q3_row = np.quantile(azdias_dropped_columns["NAN count"], 0.75)
IQR_row = Q3_row - Q1_row
upper_bound_row = np.ceil(Q3_row + (1.5 * IQR_row))
# Data with many missing values
first_subset = azdias_dropped_columns.loc[azdias_dropped_columns["NAN count"] >= upper_bound_row]
# Data with few or no missing values
second_subset = azdias_dropped_columns.loc[azdias_dropped_columns["NAN count"] < upper_bound_row]

NameError: name 'np' is not defined

In [29]:
# Compare the distribution of values for at least five columns where there are
# no or few missing values, between the two subsets.
def countplot(data1, data2, column_name, super_title, sub_title1, sub_title2, mode='count'):
    """
    This function creates a countplot of a column of data in two different data sets. The plot is divided into two subplots, 
    with the option to display the count of occurrences or the proportion of occurrences.

    Parameters:
    data1 (pandas dataframe): The first data set to be plotted.
    data2 (pandas dataframe): The second data set to be plotted.
    column_name (str): The name of the column to be plotted.
    super_title (str): The title of the entire plot.
    sub_title1 (str): The title of the first subplot.
    sub_title2 (str): The title of the second subplot.
    mode (str, optional): The mode of the plot. Can be either 'count' (default) or 'proportion'.

    Returns:
    None
    """
    figure , ax = plt.subplots(1,2)
    figure.suptitle(super_title)
    figure.set_figheight(7)
    figure.set_figwidth(18)
    if mode == 'proportion':
        proportions_1 = data1[column_name].value_counts(normalize=True)
        proportions_2 = data2[column_name].value_counts(normalize=True)
        sns.barplot(x=proportions_1.index, y=proportions_1.values, ax=ax[0])
        ax[0].set_title(sub_title1)
        sns.barplot(x=proportions_2.index, y=proportions_2.values, ax=ax[1])
        ax[1].set_title(sub_title2)
    elif mode == 'count':
        sns.countplot(x=column_name, data=data1, ax=ax[0])
        ax[0].set_title(sub_title1)
        sns.countplot(x=column_name, data=data2, ax=ax[1])
        ax[1].set_title(sub_title2)
columns_names = [col for col, value in zip(proportion_missing_values.index, proportion_missing_values) if value == 0]
for col in columns_names:
    countplot(first_subset, second_subset, col, "Distribution of data values", "Data with many missing values",
             "Data with few or no missing values")

NameError: name 'proportion_missing_values' is not defined

In [30]:
# initialize different imputers
categorical_mixed_imputer = SimpleImputer(missing_values= np.NAN, strategy= "most_frequent")
numerical_imputer = SimpleImputer(missing_values= np.NAN, strategy= "mean")
ordinal_imputer = SimpleImputer(missing_values= np.NAN, strategy= "median")

NameError: name 'SimpleImputer' is not defined

In [31]:
def fill_missing_values(df, flag= 'customer'):
    """
    This function imputes missing values in a DataFrame. It imputes different data types differently.
    For categorical and mixed type columns, it uses the 'most_frequent' strategy.
    For numerical type columns, it uses the 'mean' strategy.
    For ordinal type columns, it uses the 'median' strategy.
    
    Parameters:
    df (pandas.DataFrame): The DataFrame which has missing values to be imputed.
    flag: a string input that has two inputs, the first is 'customer' which means he will apply only transformations from 
    imputer on the data, and will not make the fit step, while 'general' means he will fit the imputer on this data
    
    Returns:
    df (pandas.DataFrame): The DataFrame with imputed missing values.
    
    """
    # impute categorical and mixed data types features
    categorical_mixed_cols = feat_info[(feat_info['type'] == 'categorical') | (feat_info['type'] == 'mixed')].attribute
    categorical_mixed_cols = categorical_mixed_cols.apply(lambda x: x if x in df.columns else np.NAN).dropna().to_list()
    if flag == 'general':
        categorical_mixed_imputer.fit(df[categorical_mixed_cols])
        imputed_categorical_mixed_columns = categorical_mixed_imputer.transform(df[categorical_mixed_cols])
    elif flag == 'customer':
        imputed_categorical_mixed_columns = categorical_mixed_imputer.transform(df[categorical_mixed_cols])
    imputed_categorical_mixed_columns = pd.DataFrame(imputed_categorical_mixed_columns, index= df.index, columns= categorical_mixed_cols)
    df = df.drop(columns=categorical_mixed_cols)
    df = pd.concat([df, imputed_categorical_mixed_columns], axis=1)
    # impute numerical data types features
    numerical_cols = feat_info[feat_info['type'] == 'numeric'].attribute
    numerical_cols = numerical_cols.apply(lambda x: x if x in df.columns else np.NAN).dropna().to_list()
    if flag == 'general':
        numerical_imputer.fit(df[numerical_cols])
        imputed_numerical_columns = numerical_imputer.transform(df[numerical_cols])
    elif flag == 'customer':
        imputed_numerical_columns = numerical_imputer.transform(df[numerical_cols])
    imputed_numerical_columns = pd.DataFrame(imputed_numerical_columns, index= df.index, columns= numerical_cols)
    df = df.drop(columns=numerical_cols)
    df = pd.concat([df, imputed_numerical_columns], axis=1)
    # impute ordinal data types features
    ordinal_cols = feat_info[feat_info['type'] == 'ordinal'].attribute
    ordinal_cols = ordinal_cols.apply(lambda x: x if x in df.columns else np.NAN).dropna().to_list()
    if flag == 'general':
        ordinal_imputer.fit(df[ordinal_cols])
        imputed_ordinal_columns = ordinal_imputer.transform(df[ordinal_cols])
    elif flag == 'customer':
        imputed_ordinal_columns = ordinal_imputer.transform(df[ordinal_cols])
    imputed_ordinal_columns = pd.DataFrame(imputed_ordinal_columns, index= df.index, columns= ordinal_cols)
    df = df.drop(columns=ordinal_cols)
    df = pd.concat([df, imputed_ordinal_columns], axis=1)
    return df

In [32]:
# impute the missing values in each column before encoding steps
second_subset = fill_missing_values(second_subset, 'general')

NameError: name 'second_subset' is not defined

In [33]:
the distributions of non-missing features look similar between the data with many missing values and the data with few or no missing values in **ANREDE_KZ** feature only, but they are different in the rest of the features, so i decided not to drop any data points because they are special

SyntaxError: invalid syntax (4189532817.py, line 1)

In [34]:
### Step 1.2: Select and Re-Encode Features

- For numeric and interval data, these features can be kept without changes.
- Most of the variables in the dataset are ordinal in nature. While ordinal values may technically be non-linear in spacing, make the simplifying assumption that the ordinal variables can be treated as being interval in nature (that is, kept without any changes).
- Special handling may be necessary for the remaining two variable types: categorical, and 'mixed'.

SyntaxError: invalid syntax (3617485553.py, line 3)

In [35]:
# How many features are there of each data type?
feature__data_type_count = feat_info['type'].value_counts()
for data_type, count in zip(feature__data_type_count.index, feature__data_type_count):
    print("There are ", count , " Features in {} data type".format(data_type))

NameError: name 'feat_info' is not defined

In [36]:
#### Step 1.2.1: Re-Encode Categorical Features

For categorical data, we would ordinarily need to encode the levels as dummy variables. Depending on the number of categories, perform one of the following:
- For binary (two-level) categoricals that take numeric values, we can keep them without needing to do anything.
- There is one binary variable that takes on non-numeric values. For this one, we need to re-encode the values as numbers or create a dummy variable.
- For multi-level categoricals (three or more values), you can choose to encode the values using multiple dummy variables, or (to keep things straightforward) just drop them from the analysis.

SyntaxError: invalid syntax (1795927280.py, line 3)

In [37]:
# drop GEBAEUDETYP feature as we will not re-encode it because it has different values distribution and will obstacle
# the process of applying transformations on the customer data, as this feature in customer data has different value
# distribtuion and this will lead to conflicts
second_subset = second_subset.drop(columns = ['GEBAEUDETYP'])

NameError: name 'second_subset' is not defined

In [38]:
# Assess categorical variables: which are binary, which are multi-level, and
# which one needs to be re-encoded?
categorical = feat_info[feat_info['type'] == 'categorical'].attribute
categorical = categorical.apply(lambda x: x if x in second_subset.columns else np.NAN).dropna()
categorical_levels = {'binary-level':[], 'binary-level-re-encoded':[],'multi-level-re-encoded':[]}
for categorical_col in categorical:
    unique_categories = pd.Series(second_subset[categorical_col].unique()).dropna().to_list()
    if len(unique_categories) == 2:
        if str(unique_categories[0]).isalpha() or str(unique_categories[1]).isalpha():
            categorical_levels['binary-level-re-encoded'].append(categorical_col)
        else:
            categorical_levels['binary-level'].append(categorical_col)
    elif len(unique_categories) > 2:
        categorical_levels['multi-level-re-encoded'].append(categorical_col)
for key, value in  categorical_levels.items():
    print('============================================')
    for val in value:
        print('The attribute ' , val , 'is' , key)
    print('============================================\n')

NameError: name 'feat_info' is not defined

In [39]:
# preprocessing for multi-level attributes to be able to convert them to multiple dummy variables
for categorical_col in categorical_levels['multi-level-re-encoded']:
    second_subset.loc[:, categorical_col] = second_subset[categorical_col].apply(lambda x: x if str(x).isalpha() else str(x))

NameError: name 'categorical_levels' is not defined

In [40]:
# Re-encode categorical variable(s) to be kept in the analysis.
# re-encode binary-level attribute
second_subset[categorical_levels['binary-level-re-encoded'][0]] = second_subset[categorical_levels['binary-level-re-encoded'][0]].apply(lambda x: 0 if x == 'W' else 1)
# re-encode multi-level attribute
multiple_dummy_attributes = pd.get_dummies(second_subset[categorical_levels['multi-level-re-encoded']])
second_subset = pd.concat([second_subset, multiple_dummy_attributes],axis=1)

NameError: name 'second_subset' is not defined

In [41]:
re-encoded the binary variable that has non numerical values, and replaced its non numerical representation of categorical levels with two number representation for each categorical level and the two numbers are **0** and **1**, the multi-level attributes choose to drop one of them which is **GEBAEUDETYP** because in general demographics it has different values distribution than in customer demographics which will lead to conflicts and for the others multi-level attributes, instead re-encoded them, and replaced these attributes with the new multiple dummy attributes

SyntaxError: invalid syntax (3526015070.py, line 1)

In [42]:
#### Step 1.2.2: Engineer Mixed-Type Features

There are a handful of features that are marked as "mixed" in the feature summary that require special treatment in order to be included in the analysis. There are two in particular that deserve attention; the handling of the rest are up to your own choices:
- "PRAEGENDE_JUGENDJAHRE" combines information on three dimensions: generation by decade, movement (mainstream vs. avantgarde), and nation (east vs. west). While there aren't enough levels to disentangle east from west, I should create two new variables to capture the other two dimensions: an interval-type variable for decade, and a binary variable for movement.
- "CAMEO_INTL_2015" combines information on two axes: wealth and life stage. Break up the two-digit codes by their 'tens'-place and 'ones'-place digits into two new ordinal variables.

SyntaxError: invalid syntax (3718106776.py, line 3)

In [43]:
def movement_encoding(x):
    """
    Encodes an input value into a binary value for movement.
    
    Parameters:
    x (int): The input value to be encoded.
    
    Returns:
    int: The encoded binary value. 0 for mainstream, 1 for avantgrade.
    if x not in any of the above return x 
    
    """
    mainstream_codes = [1, 3, 5, 8, 10, 12, 14]
    avantgrade_codes = [2, 4, 6, 7, 9, 11, 13, 15]
    if x in mainstream_codes:
        x = 0
    elif x in avantgrade_codes:
        x = 1
    else:
        x = x
    return x

In [44]:
def decade_encoding(x):
    """
    Encodes an input value into a discrete value for decade.
    
    Parameters:
    x (int): The input value to be encoded.
    
    Returns:
    int: The encoded discrete value representing a decade.
    if x = 1 or 2 return 0 (40s)
    if x = 3 or 4 return 1 (50s)
    if x = 5 or 6 or 7 return 2 (60s)
    if x = 8 or 9 return 3 (70s)
    if x = 10 or 11 or 12 return 4 (80s)
    if x = 13 or 14 or 15 return 5 (90s)
    if x not in any of the above return x 
    
    """
    decade_40s = [1, 2]
    decade_50s = [3, 4]
    decade_60s = [5, 6, 7]
    decade_70s = [8, 9]
    decade_80s = [10, 11, 12]
    decade_90s = [13, 14, 15]
    if x in decade_40s:
        x = 0
    elif x in decade_50s:
        x = 1
    elif x in decade_60s:
        x = 2
    elif x in decade_70s:
        x = 3
    elif x in decade_80s:
        x = 4
    elif x in decade_90s:
        x = 5
    else:
        x = x 
    return x

In [45]:
def wealth_encoding(x):
    """
    Encodes an input value into a discrete value for wealth.
    
    Parameters:
    x (int): The input value to be encoded.
    
    Returns:
    int: The encoded discrete value representing a wealth.
    if x in wealthy_households return 0
    if x in prosperous_households return 1
    if x in comfortable_households return 2
    if x in less_affluent_households return 3
    if x in poorer_households return 4
    if x not in any of the above return x 
    
    """
    wealthy_households = [11,12,13,14,15]
    prosperous_households = [21,22,23,24,25]
    comfortable_households = [31,32,33,34,35]
    less_affluent_households = [41,42,43,44,45]
    poorer_households = [51,52,53,54,55]
    if x in wealthy_households:
        x = 0
    elif x in prosperous_households:
        x = 1
    elif x in comfortable_households:
        x = 2
    elif x in less_affluent_households:
        x = 3
    elif x in poorer_households:
        x = 4
    else:
        x = x
    return x

In [46]:
def life_stage_encoding(x):
    """
    Encodes an input value into a discrete value for life stage.
    
    Parameters:
    x (int): The input value to be encoded.
    
    Returns:
    int: The encoded discrete value representing a life stage.
    if x in pre_family_couples_and_singles return 0
    if x in young_couples_with_children return 1
    if x in families_with_school_age_children return 2
    if x in older_families_and_mature_couples return 3
    if x in elders_in_retirement return 4
    if x not in any of the above return x 
    
    """
    pre_family_couples_and_singles = [11,21,31,41,51]
    young_couples_with_children = [12,22,32,42,52]
    families_with_school_age_children = [13,23,33,43,53]
    older_families_and_mature_couples = [14,24,34,44,54]
    elders_in_retirement = [15,25,35,45,55]
    if x in pre_family_couples_and_singles:
        x = 0
    elif x in young_couples_with_children:
        x = 1
    elif x in families_with_school_age_children:
        x = 2
    elif x in older_families_and_mature_couples:
        x = 3
    elif x in elders_in_retirement:
        x = 4
    else:
        x = x
    return x

In [47]:
# Investigate "PRAEGENDE_JUGENDJAHRE" and engineer two new variables.
# Create first variable movement wiht two bianry values: 0 for avantgarde and 1 for mainstream
second_subset['movement'] = second_subset['PRAEGENDE_JUGENDJAHRE'].apply(movement_encoding)
# Create second variable decade with multi-values: 0 for 40s and 1 for 50s and 2 for 60s 
# and 3 for 70s and 4 for 80s and 5 for 90s
second_subset['decade'] = second_subset['PRAEGENDE_JUGENDJAHRE'].apply(decade_encoding)

NameError: name 'second_subset' is not defined

In [48]:
# Investigate "CAMEO_INTL_2015" and engineer two new variables.
second_subset["CAMEO_INTL_2015"]= second_subset["CAMEO_INTL_2015"].astype(float)
# Create first variable wealth with multi-values: 0 for (11,12,13,14,15) and 1 for (21,22,23,24,25) and 2 for (31,32,33,34,35)
# and 3 for (41,42,43,44,45) and 4 for (51,52,53,54,55)
second_subset['wealth'] = second_subset['CAMEO_INTL_2015'].apply(wealth_encoding)
# create second variable life_stage with multi-values: 0 for (11,21,31,41,51) and 1 for (12,22,32,42,52)
# and 2 for (13,23,33,43,53) and 3 for (14,24,34,44,54) and 4 for (15,25,35,45,55)
second_subset['life_stage'] = second_subset['CAMEO_INTL_2015'].apply(life_stage_encoding)

NameError: name 'second_subset' is not defined

In [49]:
Created two variables **decade** and **movement** from **PRAEGENDE_JUGENDJAHRE** then deleted the original column and kept the new two variables, then make new two variables **life_stage** and **wealth** from **CAMEO_INTL_2015** and removed the original column and kept the new two variables

decided to keep the following mixed features without any feature engineering preprocessing because they were will distinguished in their original encoding

- **WOHNLAGE**
- **KBA05_BAUMAX**
- **PLZ8_BAUMAX**

decided also to remove the following features because there is not enough informations to help me distinguish between different dimensions represented in each feature

- **LP_LEBENSPHASE_FEIN**
- **LP_LEBENSPHASE_GROB**

SyntaxError: invalid syntax (3224485307.py, line 1)

In [50]:
#### Step 1.2.3: Complete Feature Selection

The dataframe should consist of the following:
- All numeric, interval, and ordinal type columns from the original dataset.
- Binary categorical features (all numerically-encoded).
- Engineered features from other multi-level categorical features and mixed features.

SyntaxError: invalid syntax (525677814.py, line 3)

In [51]:
# Delete all useless columns
second_subset = second_subset.drop(["PRAEGENDE_JUGENDJAHRE", "CAMEO_INTL_2015"], axis= 1)
second_subset = second_subset.drop(categorical_levels['multi-level-re-encoded'],axis='columns')

NameError: name 'second_subset' is not defined

In [52]:
second_subset = second_subset.drop(columns=['LP_LEBENSPHASE_FEIN', 'LP_LEBENSPHASE_GROB'])
second_subset = second_subset.drop(columns=["NAN count"])

NameError: name 'second_subset' is not defined

In [53]:
### Step 1.3: Create a Cleaning Function

In [54]:
def clean_data(df,feat_info):
    """
    Perform feature trimming, re-encoding, and engineering for demographics
    data
    
    INPUT: Demographics DataFrame
    OUTPUT: Trimmed and cleaned demographics DataFrame
    """
    
    # Put in code here to execute all main cleaning steps:
    # convert missing value codes into NaNs, ...
    missing_values_codes = []
    columns = []
    for column, missing_value_code in zip(feat_info["attribute"],feat_info["missing_or_unknown"]):
        missing_values_codes.extend(missing_value_code)
        columns.extend([column] * len(missing_value_code))
    for attribute , code in zip(columns, missing_values_codes):
        df[attribute] = df[attribute].apply(lambda x: np.NAN if x == code else x)
    # remove selected columns and rows, ...
    removed_columns = ['AGER_TYP', 'GEBURTSJAHR', 'TITEL_KZ', 'ALTER_HH', 'KK_KUNDENTYP', 'KBA05_BAUMAX']
    print('Removed Columns are:', removed_columns)
    df = df.drop(removed_columns,axis=1)
    df["NAN count"] = df.isna().sum(axis=1)
    df = df.loc[df["NAN count"] < upper_bound_row]
    df = df.drop(columns=["NAN count"])
    df = fill_missing_values(df)
    # select, re-encode, and engineer column values.
    df = df.drop(columns = ['GEBAEUDETYP'])
    categorical = feat_info[feat_info['type'] == 'categorical'].attribute
    categorical = categorical.apply(lambda x: x if x in df.columns else np.NAN).dropna()
    categorical_levels = {'binary-level':[], 'binary-level-re-encoded':[],'multi-level-re-encoded':[]}
    for categorical_col in categorical:
        unique_categories = pd.Series(df[categorical_col].unique()).dropna().to_list()
        if len(unique_categories) == 2:
            if str(unique_categories[0]).isalpha() or str(unique_categories[1]).isalpha():
                categorical_levels['binary-level-re-encoded'].append(categorical_col)
            else:
                categorical_levels['binary-level'].append(categorical_col)
        elif len(unique_categories) > 2:
            categorical_levels['multi-level-re-encoded'].append(categorical_col)
    for categorical_col in categorical_levels['multi-level-re-encoded']:
        df.loc[:, categorical_col] = df[categorical_col].apply(lambda x: x if str(x).isalpha() else str(x))
    df[categorical_levels['binary-level-re-encoded'][0]] = df[categorical_levels['binary-level-re-encoded'][0]].apply(lambda x: 0 if x == 'W' else 1)
    multiple_dummy_attributes = pd.get_dummies(df[categorical_levels['multi-level-re-encoded']])
    df = pd.concat([df, multiple_dummy_attributes],axis=1)
    df = df.drop(categorical_levels['multi-level-re-encoded'],axis='columns')
    # Create first variable movement wiht two bianry values: 0 for avantgarde and 1 for mainstream
    df['movement'] = df['PRAEGENDE_JUGENDJAHRE'].apply(movement_encoding)
    # Create second variable decade with multi-values: 0 for 40s and 1 for 50s and 2 for 60s 
    # and 3 for 70s and 4 for 80s and 5 for 90s
    df['decade'] = df['PRAEGENDE_JUGENDJAHRE'].apply(decade_encoding)
    # Create first variable wealth with multi-values: 0 for (11,12,13,14,15) and 1 for (21,22,23,24,25) and 2 for (31,32,33,34,35)
    # and 3 for (41,42,43,44,45) and 4 for (51,52,53,54,55)
    df["CAMEO_INTL_2015"]= df["CAMEO_INTL_2015"].astype(float)
    df['wealth'] = df['CAMEO_INTL_2015'].apply(wealth_encoding)
    # create second variable life_stage with multi-values: 0 for (11,21,31,41,51) and 1 for (12,22,32,42,52)
    # and 2 for (13,23,33,43,53) and 3 for (14,24,34,44,54) and 4 for (15,25,35,45,55)
    df['life_stage'] = df['CAMEO_INTL_2015'].apply(life_stage_encoding)
    df = df.drop(["PRAEGENDE_JUGENDJAHRE", "CAMEO_INTL_2015"], axis= 1)
    # Return the cleaned dataframe.
    df = df.drop(columns=['LP_LEBENSPHASE_FEIN', 'LP_LEBENSPHASE_GROB'])
    return df