## I. Introduction
In this project, you will analyze demographics data for customers of a mail-order sales company in Germany, comparing it against demographics information for the general population. You'll use unsupervised learning techniques to perform customer segmentation, identifying the parts of the population that best describe the core customer base of the company. Then, you'll apply what you've learned on a third dataset with demographics information for targets of a marketing campaign for the company, and use a model to predict which individuals are most likely to convert into becoming customers for the company. The data that you will use has been provided by our partners at Bertelsmann Arvato Analytics and represents a real-life data science task.

If you completed the first term of this program, you will be familiar with the first part of this project, from the unsupervised learning project. The versions of those two datasets used in this project will include many more features and have not been pre-cleaned. You are also free to choose whatever approach you'd like to analyzing the data rather than follow pre-determined steps. In your work on this project, make sure that you carefully document your steps and decisions, since your main deliverable for this project will be a blog post reporting your findings.

### Get to Know the Data

There are four data files associated with this project:

- `Udacity_AZDIAS_052018.csv`: Demographics data for the general population of Germany; 891 211 persons (rows) x 366 features (columns).
- `Udacity_CUSTOMERS_052018.csv`: Demographics data for customers of a mail-order company; 191 652 persons (rows) x 369 features (columns).
- `Udacity_MAILOUT_052018_TRAIN.csv`: Demographics data for individuals who were targets of a marketing campaign; 42 982 persons (rows) x 367 (columns).
- `Udacity_MAILOUT_052018_TEST.csv`: Demographics data for individuals who were targets of a marketing campaign; 42 833 persons (rows) x 366 (columns).

Each row of the demographics files represents a single person, but also includes information outside of individuals, including information about their household, building, and neighborhood. Use the information from the first two files to figure out how customers ("CUSTOMERS") are similar to or differ from the general population at large ("AZDIAS"), then use your analysis to make predictions on the other two files ("MAILOUT"), predicting which recipients are most likely to become a customer for the mail-order company.

The "CUSTOMERS" file contains three extra columns ('CUSTOMER_GROUP', 'ONLINE_PURCHASE', and 'PRODUCT_GROUP'), which provide broad information about the customers depicted in the file. The original "MAILOUT" file included one additional column, "RESPONSE", which indicated whether or not each recipient became a customer of the company. For the "TRAIN" subset, this column has been retained, but in the "TEST" subset it has been removed; it is against that withheld column that your final predictions will be assessed in the Kaggle competition.

Otherwise, all of the remaining columns are the same between the three data files. For more information about the columns depicted in the files, you can refer to two Excel spreadsheets provided in the workspace. [One of them](./DIAS Information Levels - Attributes 2017.xlsx) is a top-level list of attributes and descriptions, organized by informational category. [The other](./DIAS Attributes - Values 2017.xlsx) is a detailed mapping of data values for each feature in alphabetical order.

In the below cell, we've provided some initial code to load in the first two datasets. Note for all of the `.csv` data files in this project that they're semicolon (`;`) delimited, so an additional argument in the [`read_csv()`](https://pandas.pydata.org/pandas-docs/stable/generated/pandas.read_csv.html) call has been included to read in the data properly. Also, considering the size of the datasets, it may take some time for them to load completely.

You'll notice when the data is loaded in that a warning message will immediately pop up. Before you really start digging into the modeling and analysis, you're going to need to perform some cleaning. Take some time to browse the structure of the data and look over the informational spreadsheets to understand the data values. Make some decisions on which features to keep, which features to drop, and if any revisions need to be made on data formats. It'll be a good idea to create a function with pre-processing steps, since you'll need to clean all of the datasets before you work with them.


In [None]:
# Importing libraries:
import joblib

import matplotlib.pyplot as plt
import seaborn as sns

from .utils import *



# magic word for producing visualizations in notebook
%matplotlib inline

import warnings
warnings.filterwarnings('ignore')

## Part 1: Customer Segmentation Report<a name="part1"></a>

The main bulk of your analysis will come in this part of the project. Here, you should use unsupervised learning techniques to describe the relationship between the demographics of the company's existing customers and the general population of Germany. By the end of this part, you should be able to describe parts of the general population that are more likely to be part of the mail-order company's main customer base, and which parts of the general population are less so.

### 1.1 Data Overview<a name="overview"></a>

In [None]:
# load in the data
#azdias = pd.read_csv('../../data/Term2/capstone/arvato_data/Udacity_AZDIAS_052018.csv', sep=';')

# Loading general population data:
azdias = joblib.load('azdias')

In [None]:
# Checking general population demographics first rows:
azdias.head()

In [None]:
# Verifying dataframe's shape:
print('General population dataframe has {} observations and {} variables.'.format(azdias.shape[0], azdias.shape[1]))

### 1.2 Cleaning Data<a name="cleandata"></a>

#### 1.2.1 NaN and Unkown Values<a name="nanunknown"></a>

Since there is a large number of columns, it's possible to be conservative eliminating columns with many `nan` values. In this case, columns with more than 35% of these values will be deleted.

First, variables will be mapped in order to check if there are unknown values that are represented by one specific class. In that case, these unknown values will be transformed into `nan` values as well.


In [None]:
# Apllying join_nan_with_unknown on the azdias dataframe:
azdias = join_nan_with_unknown(azdias)

In [None]:
# Creating a list of variables that overcomes the threshold for nan values:
nan_threshold = 0.35 # 35%

# Nan proportion, using general population dataframe (more observations):
var_nan_prop = azdias.isnull().mean()

# List:
nan_list = list()
for i in range(len(var_nan_prop)):
    if var_nan_prop[i] >= nan_threshold:
        nan_list.append(var_nan_prop.index.values[i])

print('{} columns with more than {}% of nan values.'.format(len(nan_list), nan_threshold*100))

In [None]:
# Creating function that delete the columns listed in nan_list:
def eliminate_nan_columns(df, nan_cols = nan_list):
    '''
    It deletes dataframe columns in nan_cols list.

    Inputs:
    df: original dataframe;
    nan_cols:list of columns to be deleted.

    Output:
    df: dataframe updated without nan_cols.
    '''
    # Deleting nan_cols:
    df.drop(columns = nan_cols, inplace = True)

    return df

In [None]:
# Apllying eliminate_nan_columns on the azdias dataframe:
azdias = eliminate_nan_columns(azdias)

#### 1.2.2 Non-Informative Columns<a name="noninform"></a>

In [None]:
# Defining function to drop useless columns:
def drop_useless_cols(df):
    '''
    It deletes columns with no usefull information on people ('LNR' and 'EINGEFUEGT_AM').

    Input:
    df: original dataframe.

    Output:
    df: updated dataframe.
    '''
    # Deleting columns:
    df.drop(columns = ['LNR', 'EINGEFUEGT_AM', 'EINGEZOGENAM_HH_JAHR'], inplace = True)

    return df

In [None]:
# Apllying drop_useless_cols on the azdias dataframe:
azdias = drop_useless_cols(azdias)

#### 1.2.3 Columns' Types<a name="coltype"></a>




In [None]:
# Apllying change_dtypes on the azdias dataframe:
azdias = change_dtypes(azdias)

#### 1.2.4 Feature Engineering<a name="feateng"></a>

In the process of going through each one of the features, two features caught the attention for containing more than one apparently important information:
* `CAMEO_INTL_2015`: it is composed of two numbers, each one relates to 5 different classifications. The first one classifies as *1. Wealthy*, *2. Prosperous*, *3. Comfortable*, *4. Less Affluent*, and *5. Poorer*. The second one classifies as *1. Pre-family couples and singles*, *2. Young couples with children*, *3. Families with school-age children*, *4. Older families and mature couples*, and *5. Elders in retirement*.

* `PRAEGENDE_JUGENDJAHRE`: indicates especially the *youth decade* related to each observation, as well as the related movement that the person was part of (*avant-garde* or *mainstream*).

These observations will be used to create new features that will possibly help through the analysis' process. One another transformation to be performed is to simplify the `ALTER_HH` variable to represent decades.


In [None]:
# Apllying feature_engineer on the azdias dataframe:
azdias = feature_engineer(azdias)

#### 1.2.5 Correlation Analysis<a name="corranalysis"></a>

The next step will be to verify the `correlation` between columns.

Highly correlated features indicate they may represent similar information. Selecting only one of these highly correlated features will help to reduce the number of variables to be considered along in the process.

Since most of the numerical variables represent ordinal classes, the most appropriate correlation analysis would be one using distance or rank approaches. Because of the limited computational power, and considering that these variables represent, in essence, a metric measure, the `Pearson` correlation will be applied in this task.

In [None]:
# Defining function to compute correlation, and return columns and its correlated features:
def analyze_correlation(df, method = 'pearson', corr_threshold = 0.7):
    '''
    It computes the correlation between the variables, selects one among the correlated (the one with
    less nan values), and returns a list of selected columns as well as a dictionary relating selected
    columns with their correlated ones.

    Input:
    df: dataframe on which correlation will be computed;
    method: string indicating the method to compute the correlation ('pearson', 'kendall' or 'spearman').
    corr_threshold: absolut correlation threshold to consider columns as being highly correlated.

    Outputs:
    cols: list of selected columns;
    corr_dict: dictionary relating selected columns to their correlated features.
    '''
    # Computing correlation:
    corr = df.corr(method = method)

    # Creating selected columns list:
    cols = list()

    # Creating correlated columns list:
    corr_cols = list()

    # Creating correlation dictionary:
    corr_dict = dict()

    # Ordering columns by nan values ratio:
    nan_ord_cols = list(df[list(corr.columns)].isnull().mean().index.values)[::-1]

    # Looping through columns:
    for col in nan_ord_cols:

        # If column doesn't appear as selected column nor as correlated column, verify correlations:
        if (col not in cols) & (col not in corr_cols):

            # Add column to selected columns:
            cols.append(col)
            # Initialize list of highly correlated columns:
            col_corr = list()

            # Looping through rows:
            for i in range(corr.shape[0]):

                # If correlation is higher than threshold:
                if abs(corr[col].iloc[i]) > corr_threshold:
                    # Add feature as a correlated column:
                    corr_cols.append(corr.index.values[i])
                    # Add feature as highly correlated:
                    col_corr.append((corr.index.values[i], corr[col].iloc[i]))

            # Assigning highly correlated features:
            corr_dict[col] = col_corr

    # Adding columns with no computed correlation:
    for col in df.columns:
        if col not in corr.columns:
            cols.append(col)

    # Sorting values:
    cols.sort()

    return cols, corr_dict

In [None]:
# Applying analyze_correlation on the azdias dataframe:
sel_cols, corr_dict = analyze_correlation(azdias)

print('{} columns were selected, considering correlations between features.'.format(len(sel_cols)))

In [None]:
# Defining function to select features:
def select_features(df, sel = sel_cols):
    '''
    It selects columns presented in the sel list.
    '''
    # Selecting features:
    df = df[sel]

    return df

In [None]:
# Applying select_features on the azdias dataframe:
azdias = select_features(azdias)

In [None]:
azdias.head()

#### 1.2.6 Applying Data Cleaning on Customer Dataset<a name="cleancustomer"></a>

In [None]:
# Load in the data:
#customers = pd.read_csv('../../data/Term2/capstone/arvato_data/Udacity_CUSTOMERS_052018.csv', sep=';')

# Loading customers data:
customers = joblib.load('customers')

In [None]:
# Checking customers population demographics first rows:
customers.head()

In [None]:
# Verifying dataframe's shape:
print('Customers dataframe has {} observations and {} variables.'.format(customers.shape[0], customers.shape[1]))

In [None]:
# Defining function to apply data cleaning process on data:
def clean_data(df):
    '''
    It applies data cleaning process on the dataframe.

    Input:
    df: original dataframe.

    Output:
    df: clean dataframe.
    '''
    # Apllying join_nan_with_unknown on the dataframe:
    df = join_nan_with_unknown(df)

    # Apllying eliminate_nan_columns on the dataframe:
    df = eliminate_nan_columns(df)

    # Apllying drop_useless_cols on the dataframe:
    df = drop_useless_cols(df)

    # Apllying change_dtypes on the dataframe:
    df = change_dtypes(df)

    # Apllying feature_engineer on the dataframe:
    df = feature_engineer(df)

    # Applying select_features on the dataframe:
    df = select_features(df)

    return df

In [None]:
# Applying transformations on the customers data:
customers = clean_data(customers)

In [None]:
customers.head()

In [None]:
azdias.to_csv('../data/azdias_1.csv', index=False)
customers.to_csv('../data/customers_1.csv', index=False)

With these features cleaning and pre-selection, some *exploratory data analysis* will be performed.

### 1.3 Exploratory Data Analysis<a name="eda"></a>
<p>
    Through this exploratory analysis, the goal is to understand the company's customer profile and how this profile relates to the general population. The analysis will be focused on answering a few questions:
    <ul>
        <li>How old are the customers?</li>
        <li>Where do they live?</li>
        <li>How are they classified while consumers?</li>
        <li>What about their incomes?</li>
        <li>What are their consumption habits, lifestyle and family composition??</li>
    </ul>
</p>

In [None]:
# Setting seaborn general style:
sns.set_theme(style = "whitegrid", font_scale = 1.1)

# Defining a function to create a comparison barplot:
def customers_against_general(col, customers = customers, general = azdias):
    '''
    Given one variable, it counts the number of observations per class, both for customers and general population, and
    creates a bor plot comparing the percentages for each group.

    Inputs:
    col: string identifying target column to be compared;
    customers: dataframe containing customers' observations;
    general: dataframe contatinig general population's observations.
    '''
    # Counting customers observations for each class:
    cust_series = customers[col].value_counts()

    # Counting general population observations for each class:
    pop_series = general[col].value_counts().sort_index()

    # Creating dataframe:
    comparison_df = pd.DataFrame(index = pop_series.index.values, \
                                 columns = ['cust_count', 'gen_pop_count', 'customers', 'general_population'])
    comparison_df.gen_pop_count = pop_series.values

    # Assigning values to customers column:
    comparison_df.cust_count = [cust_series[cust_series.index == idx].values[0] \
                                if len(cust_series[cust_series.index == idx].values) == 1 \
                                else 0 for idx in comparison_df.index.values]

    # Printing percentage of NaN values:
    nan_cust = customers[col].isnull().mean() * 100
    nan_pop = general[col].isnull().mean() * 100
    print('"' + col + '"' + ' VARIABLE INFO:')
    print('{:.2f}% of the customers observations are related to NaN values.'.format(nan_cust))
    print('{:.2f}% of the general population observations are related to NaN values.\n'.format(nan_pop))

    # Computing each class' percentage:
    comparison_df.customers = (comparison_df.cust_count / np.sum(comparison_df.cust_count))*100
    comparison_df.general_population = (comparison_df.gen_pop_count / np.sum(comparison_df.gen_pop_count))*100

    # Defining xsize plot:
    if len(comparison_df.index) <= 11:
        plot_xsize = 1.5 * len(comparison_df.index)

        if plot_xsize < 5:
            plot_xsize = 5
    else:
        plot_xsize = 16.5

    comparison_df['Class'] = comparison_df.index.values
    melted_df = pd.melt(comparison_df, id_vars = ['Class'] , value_vars = ['customers', 'general_population'], \
                        var_name = 'Group', value_name ='Percentage')
    palette = {'customers': 'darkcyan', 'general_population': 'springgreen'}

    fig, ax = plt.subplots(nrows = 1, ncols = 1, figsize = (plot_xsize, 5))
    sns.barplot(x = 'Class', y = 'Percentage', hue = 'Group', data = melted_df, palette = palette, ax = ax).set(
        title = 'Customers Vs. General Population - "{}"'.format(col))
    sns.despine(left=True, top = True)
    fig.show()

In [None]:
# Defining function to compare statistics:
def stats_comparison(col, customers = customers, general = azdias):
    '''
    It computes variable's statistics (count, mean, standard deviation, min, max, and quartiles) for customers and
    general population and join them together in one dataframe.

    Inputs:
    col: string identifying target column;
    customers: dataframe containing customers' observations;
    general: dataframe contatinig general population's observations.

    Output:
    stats: dataframe containing statistics for both groups.
    '''
    # Computing customers' statistics:
    cust_stats = customers[col].describe()

    # Computing general population's statistics:
    gen_stats = general[col].describe()

    # Creating dataframe:
    stats = pd.DataFrame(index = cust_stats.index.values, columns = ['Customers', 'General_Pop'])

    # Assigning the values to the respective columns:
    stats.Customers = cust_stats.values
    stats.General_Pop = gen_stats.values

    return stats

#### 1.3.1 Age<a name="age"></a>

After the pre-selection of features, the age will be analyzed through the perspective of the `YOUTH_DECADE` variable that represents the decade corresponding to the person's youth. The decades go from the 40'ies to the 90'ies, and considering the youth period from the age of 15 to the age of 25, this variable can be interpreted as follows:
* `4`: *youth in the 40'ies (more than 85 years old)*;
* `5`: *youth in the 50'ies (75 to 85 years old)*;
* `6`: *youth in the 60'ies (65 to 75 years old)*;
* `7`: *youth in the 70'ies (55 to 65 years old)*;
* `8`: *youth in the 80'ies (45 to 55 years old)*;
* `9`: *youth in the 90'ies (less than 45 years old)*.

In [None]:
# Applying customers_against_general function to ALTER_HH column:
customers_against_general('YOUTH_DECADE')

In [None]:
# Checking variable statistics for customers and general population (considering classes as numerical type):
youth_dec_stats = stats_comparison('YOUTH_DECADE')
youth_dec_stats

While in the general population, most of the people have their youth related to the `90'ies`, among customers that would be the less representative population.

In the customers' group, the most representative classes refer to the `50'ies` and the `60'ies` as the most common youth decades. As an approximation, it would be possible to say that it refers to people that are **between 65 and 85 years old**.

It's interesting no notice that the younger the group is, the less representative it is among clients and the more representative it is in the general population. In other words, elders are overrepresented, while youngers are underrepresented in the customers' group.

This age analysis brings up more questions to be studied and new possibilities to the company:

* Does the interest in organic products comes with the age or is it a matter of reaching out to the new generations?

* Although younger people may not be the direct target to the company, they could also represent a segment to be explored, once this group represents the majority of the general population.

<h4>Age Insights</h4>

* If the company would like to achieve more clients with a profile *similar* to the current clients' profile, then it would be better to focus on **elderly people**.

* Given the underrepresentation of clients with lower ages when comparing to the general population, there is a **great opportunity** to increase the number of clients by reaching out to these generations through, for example, marketing campaigns designed for this specific public.

#### 1.3.2 Youth Movements<a name="youthmov"></a>

`PRAEGENDE_JUGENDJAHRE` was previously split into the `YOUTH_DECADE` and the `AVANT_GARDE` features. Age was analyzed through the first engineered feature, and the second one indicating the dominating movement in the person's youth (avant-garde or mainstream) will be checked now:
* `0`: *not avant-garde (mainstream)*;
* `1`: *avant-garde*.

In [None]:
# Applying customers_against_general function to PRAEGENDE_JUGENDJAHRE column:
customers_against_general('AVANT_GARDE')

In [None]:
# Checking variable statistics for customers and general population (considering classes as numerical type):
avant_stats = stats_comparison('AVANT_GARDE')
avant_stats

If the analysis was only performed on customers' data, it would be possible to say that clients were equally represented by *avant-gardes* and *mainstreamers*.

When comparing this distribution to the general population, it's possible to see that actually *avant-gardes* are more likely to be interested in the products offered by the mail-order company. While people related to the *avant-garde* movement represent about 20% of the general population, among clients this representation rises up to 50%.

The opposite happens to *mainstreamers*: they represent almost 80% of the general population and about 50% of the customers.

Another important aspect is the fact that, during the correlation analysis, the features `AVANT_GARDE` and `GREEN_AVANTGARDE` presented a perfect positive correlation with each other. It could indicate that, although avant-garde movements may be related to different social/economic aspects through the years, they can always be related in some aspects to the green avant-garde movement.

More studies should be done to prove this theory, but in this brief analysis, it's possible to say that this correlation indicates that the company's clients are more interested in topics related to sustainability, or more concerned about the impacts that people's actions cause on the environment.

Since the *green movement* has been increasing over the years both in society and also in politics, this could be an important aspect to be explored in order to reach younger generations.

<h4>Youth Movements Insights</h4>

* `Avant-gardes` are more interested in the products offered by the company.

* `Mainstreamers` represent almost the same proportion as `avant-gardes` among customers, but they are highly underrepresented when comparing to the general population.

* If the youth movement is compared together with the age (`YOUTH_DECADE`), it's possible to see that for elderly people even the `mainstreamers` group is overrepresented among clients. It indicates that age and movement conditions can reinforce one another.

* The correlation between `avant-garde` and `green avant-garde` movements indicates that this could be an important cultural aspect that could be explored by the company, both for retaining clients and for reaching out to new generations.

#### 1.3.3 Location<a name="location"></a>

The `BALLRAUM` variable is described as the *distance to next urban center*, and it could give and an indication of where the customers live and how it relates to the general population. Its classes go from `1` to `7`:
* `Class 1`: *until 10 km*;
* `Class 2`: *10 - 20 km*;
* `Class 3`: *20 - 30 km*;
* `Class 4`: *30 - 40 km*;
* `Class 5`: *40 - 50 km*;
* `Class 6`: *50 - 100 km*;
* `Class 7`: *over 100 km*.

In [None]:
# Applying customers_against_general function to BALLRAUM column:
customers_against_general('BALLRAUM')

In [None]:
# Checking variable statistics for customers and general population (considering classes as numerical type):
ball_stats = stats_comparison('BALLRAUM')
ball_stats

Most of the clients live between 50 and 100 km from urban centers, which corresponds to the class `6`. However, this is not a characteristic that *specifically* defines the company's customers, since it follows the distribution of the general population.

Looking both at the bar plot and the statistics, it's clear that the general population and the customers follow the same distribution, meaning that equal proportions of the population in different urban centers positions are being reached.

The highest difference appears in class `1` that represents people living up to 10 km from the urban center. It makes sense, once it's a mail-order company, and people close to urban centers may have more possibilities of buying these products directly in the companies.

`REGIOTYP` classifies people according to their neighbourhood:
* `0`: *unknown*;
* `1`: *upper class*;
* `2`: *conservatives*;
* `3`: *upper middle class*;
* `4`: *middle class*;
* `5`: *lower middle class*;
* `6`: *traditional workers*;
* `7`: *marginal groups*.

In [None]:
# Applying customers_against_general function to BALLRAUM column:
customers_against_general('REGIOTYP')

In [None]:
# Checking variable statistics for customers and general population (considering classes as numerical type):
regio_stats = stats_comparison('REGIOTYP')
regio_stats

Although the proportional distributions don't show great differences between customers and the general population, there's a tendency of overrepresentation among customers for upper classes neighborhood and underrepresentation for the other neighborhood types.

It gets more clear when analyzing class `1` (* upper-class *) that represents about 7% of the general population, and about 13% of the customers.

`ORTSGR_KLS9` variable represents community size, considering the number of inhabitants:
* `0`: *unknown*;
* `1`: *less than or equal to 2000 inhabitants*;
* `2`: *2001 to 5000 inhabitants*;
* `3`: *5,001 to 10,000 inhabitants*;
* `4`: *10,001 to 20,000 inhabitants*;
* `5`: *20,001 to 50,000 inhabitants*;
* `6`: *50,001 to 100,000 inhabitants*;
* `7`: *100,001 to 300,000 inhabitants*;
* `8`: *300,001 to 700,000 inhabitants*;
* `9`: *over 700,000 inhabitants*;

In [None]:
# Applying customers_against_general function to BALLRAUM column:
customers_against_general('ORTSGR_KLS9')

In [None]:
# Checking variable statistics for customers and general population (considering classes as numerical type):
inhab_stats = stats_comparison('ORTSGR_KLS9')
inhab_stats

As an overall view, it would be possible to say that consumers are proportionally equally distributed in the different community sizes.

There's a slight tendency of overrepresentation in cities up to 50 thousand inhabitants and a tendency of underrepresentation in bigger cities with more than 700 thousand inhabitants, which corroborates with the previous `BALLRAUM` variable analysis.

<h4>Location Insights</h4>

* As a generalization, the proportion of the company's clients follows the same distribution as the proportion presented in the general population when considering people's location.

* There's a slight tendency of customers' overrepresentation in upper classes neighborhoods and small to medium cities (up to 50 thousand inhabitants).

* In the general population, almost 20% of the people live close to urban centers (class `1`). Among clients, this percentage decreases by almost 5%. It is corroborated when analyzing the proportion of clients in cities with more than 700 thousand inhabitants (underrepresentation). It makes sense considering the mail-order nature of the company.

#### 1.3.4 Consumer Classification<a name="cameo"></a>

So far, it was possible to identify the typical company's clients as elderly people living up to 50 km from urban centers. Now, the `CAMEO Classification` will be used in order to better understand customers' consumption and lifestyle habits and compare them to the general population's habits.

`CAMEO_INTL_2015` was divided in two variables: `CAMEO_INTL_FAM_STATUS` and `CAMEO_INTL_FAM_COMPOSITION`. The first one relates to 5 different classes:
* `1`: *wealthy*;
* `2`: *prosperous*;
* `3`: *comfortable*;
* `4`: *less afluent*;
* `5`: *poorer*.

`CAMEO_INTL_FAM_COMPOSITION` represents:
* `1`: *pre-family couples and singles*;
* `2`: *young couples with children*;
* `3`: *families with school age children*;
* `4`: *older families and mature couples*;
* `5`: *elders in retirement*.

In [None]:
# Applying customers_against_general function to CAMEO_DEUG_2015 column:
customers_against_general('CAMEO_INTL_FAM_STATUS')

In [None]:
# Checking variable statistics for customers and general population (considering classes as numerical type):
cameo_status_stats = stats_comparison('CAMEO_INTL_FAM_STATUS')
cameo_status_stats

`Wealthy` and `prosperous` classes are the most representative among clients. In, `wealthy` is the most overrepresented in comparison to the general population.

On the other hand, the `poorer` status appears as the most representative in the general population and the most underrepresented among clients.

In [None]:
# Applying customers_against_general function to CAMEO_DEUG_2015 column:
customers_against_general('CAMEO_INTL_FAM_COMPOSITION')

In [None]:
# Checking variable statistics for customers and general population (considering classes as numerical type):
cameo_comp_stats = stats_comparison('CAMEO_INTL_FAM_COMPOSITION')
cameo_comp_stats

More than the overrepresentation seen especially in the classes `4` and `5`, indicating that *older families*, *mature couples*, and *elders in retirement* are quite more common among clients than in the general population, what catches the attention is the underrepresentation in the class `1` related to *pre-families* and *singles*.

It corroborates with one aspect seen before: that younger people are less likely to become clients in the company. Whether it's a matter of age, life situation, or any other possible condition would require deeper research.

`CAMEO_DEU_2015` relates to a similar content as the previous variable, also being a CAMEO Classification 2015, this time with a more detailed classification:
* `1A`: *Work-Life Balance*;
* `1B`: *Wealthy Best Ager*;
* `1C`: *Successful Songwriter*;
* `1D`: *Old Nobility*;
* `1E`: *City Nobility*;
* `2A`: *Cottage Chic*;
* `2B`: *Noble Jogger*;
* `2C`: *Established Gourmet*;
* `2D`: *Fine Management*;
* `3A`: *Career & Family*;
* `3B`: *Powershopping Families*;
* `3C`: *Rural Neighborhood*;
* `3D`: *Secure Retirement*;
* `4A`: *Family Starter*;
* `4B`: *Family Life*;
* `4C`: *String Trimmer*;
* `4D`: *Empty Nest*;
* `4E`: *Golden Ager*;
* `5A`: *Younger Employees*;
* `5B`: *Suddenly Family*;
* `5C`: *Family First*;
* `5D`: *Stock Market Junkies*;
* `5E`: *Coffee Rider*;
* `5F`: *Active Retirement*;
* `6A`: *Jobstarter*;
* `6B`: *Petty Bourgeois*;
* `6C`: *Long-Established*;
* `6D`: *Sportgardener*;
* `6E`: *Urban Parents*;
* `6F`: *Frugal Aging*;
* `7A`: *Journey Man*;
* `7B`: *Mantaplatte*;
* `7C`: *Factory Worker*;
* `7D`: *Rear Window*;
* `7E`: *Interested Retirees*;
* `8A`: *Multi-cultural*;
* `8B`: *Young & Mobile*;
* `8C`: *Prefab*;
* `8D`: *Town Seniors*;
* `9A`: *First Shared Apartment*;
* `9B`: *Temporary Workers*;
* `9C`: *Afternoon Talk Show*;
* `9D`: *Mini-Jobber*;
* `9E`: *Socking Away*.

In [None]:
# Applying customers_against_general function to CAMEO_DEUG_2015 column:
customers_against_general('CAMEO_DEU_2015')

In [None]:
# Checking variable statistics for customers and general population:
cameo_deu_stats = stats_comparison('CAMEO_DEU_2015')
cameo_deu_stats

Once again, there are indications that the distribution of customers along the different segments doesn't follow the general population distribution.

Among customers, the top classification is `Fine Management`, while among the general population, this position is taken by `Petty Bourgeois`.

Classes `1A` to `2D`, `3D`, `4A` and `5D` tend to be overrepresented in customers in comparison to the general population. The opposite happens in classes `7A`, and `8A` to `9D`.

It reinforces that there are specific segments in the population that are especially attracted to the products offered by the company. Although there's no further explanation about these classes, we can deduce by their names that they relate not only to the social class but also to people's behavior and habits.

<h4>Consumer Classifications Insights</h4>

* *Upper social classes* tend to be more representative among the company's customers, following the opposite pattern of the general population.

* Some specific segments in the population are especially representative among customers, if not by their overall percentage, by their overrepresentation when comparing to the general population. Some examples: `Fine Management`, `Work-Life Balance`, `Successful Songwriter`, `Old Nobility`, `City Nobility`, and so on.

* Not only social class but also life habits and mindset may be related to the consumption of the organic products offered by the mail-order company.

#### 1.3.5 Income<a name="income"></a>

`HH_EINKOMMEN_SCORE` indicates the estimated household net income, corresponding to the following code:
* `1`: *highest income*;
* `2`: *very high income*;
* `3`: *high income*;
* `4`: *average income*;
* `5`: *lower income*;
* `6`: *very low income*.

In [None]:
# Applying customers_against_general function to HH_EINKOMMEN_SCORE column:
customers_against_general('HH_EINKOMMEN_SCORE')

In [None]:
# Checking variable statistics for customers and general population (considering classes as numerical type):
income_stats = stats_comparison('HH_EINKOMMEN_SCORE')
income_stats

Clients classified with `highest income` and `very high income` represent more than 50% of the clients. Their proportion among customers shows values more than twice as high as the proportion shown in the general population. The top customer's class is the `2` (`very high income`), exceeding 35%, while in the general population it represents about 15% of the people.

Classes `3` and `4`, representing `high income` and `average income`, are practically equally represented both in the general population and among customers.

`Lower income` and especially `very low income` are underrepresented classes: while in the general population `very low income` represents almost 30% of the people, among clients its representation decreases to 7%.

Again, this can be seen as an opportunity. Once the majority of the population belongs to the lower classes when it comes to their incomes, if the company had the purpose to reach a broader audience, it could be considered releasing cheaper versions of the products.

<h4>Income Insights</h4>

* The highest the income, the higher the propensity of buying the company's products. Clients with the highest incomes are proportionally the most representative, the opposite of what happens in the general population.

* Since the clients' profile represents the opposite of the population's distribution through the income aspect, it could also be seen as an opportunity. Cheaper versions of the products could be considered in order to reach the majority of the population that is represented by medium to low incomes.

#### 1.3.6 Habits and Other Curiosities<a name="habits"></a>

To better understand customers' profiles, `GFK_URLAUBERTYP` indicates people's vacation habits. These vacation habits are represented by the following codes:
* `1`: *event travelers*;
* `2`: *family-oriented vacationists*;
* `3`: *winter sportspeople*;
* `4`: *culture lovers*;
* `5`: *nature fans*;
* `6`: *hiker*;
* `7`: *golden ager*;
* `8`: *homeland-connected vacationists*;
* `9`: *package tour travelers*;
* `10`: *connoiseurs*;
* `11`: *active families*;
* `12`: *without vacation*.

In [None]:
# Applying customers_against_general function to GFK_URLAUBERTYP column:
customers_against_general('GFK_URLAUBERTYP')

In [None]:
# Checking variable statistics for customers and general population (considering classes as numerical type):
vacation_stats = stats_comparison('GFK_URLAUBERTYP')
vacation_stats

It is possible to see some clear trends among customers that differ from the general population. There is a huge overrepresentation of `nature fans` as a vacation habit among customers. It may indicate that, not only in terms of vacation habits, the company's customers may have a mindset that values nature connection, which makes sense given that the company sells organic products. In a lower proportion, `golden agers` also seem to be overrepresented in the customers' group.

When looking for underrepresentation, at least 4 classes catch the attention: the ones `without vacation`, `active families` and `family-oriented vacationists`, and `package tour travelers`. All of these are more representative among the general population than among customers.

`LP_LEBENSPHASE_GROB` refines the last classification, including information about incomes:
* `1`: *single low-income, and average earners of younger age*;
* `2`: *single low-income, and average earners of higher age*;
* `3`: *single high-income earners*;
* `4`: *single low-income, and average earner-couples*;
* `5`: *single high-income, and earner-couples*;
* `6`: *single parents*;
* `7`: *single low-income, and average earner-families*;
* `8`: *high-income earner-families*;
* `9`: *average earners of younger age from mulitperson households*;
* `10`: *low-income, and average earners of higher age from mulitperson households*;
* `11`: *high-income earners of younger age from multiperson households*;
* `12`: *high-income earners of higher age from multiperson households*;

In [None]:
# Applying customers_against_general function to LP_LEBENSPHASE_GROB column:
customers_against_general('LP_LEBENSPHASE_GROB')

In [None]:
# Checking variable statistics for customers and general population (considering classes as numerical type):
lebensphase_stats = stats_comparison('LP_LEBENSPHASE_GROB')
lebensphase_stats

Corroborating with the previous analysis on *age* and *incomes*, the top class for customers is the one representing *high-income earners of higher age*.

The other classes that stand out when comparing to the general population are also related to high incomes, independently from the family structure. The only exception is the class representing *low-income, and average earners of higher age*, related to low-income and average earners that seem to be overrepresented among customers, possibly because of the higher age factor.

As seen before, pre-families and singles are the ones with the highest underrepresentation among clients. This time, it's possible to see that the class `5`, representing *single high-income*, and *earner-couples* are overrepresented among clients. It could indicate that income is more important than age or family composition when it comes to becoming a client.

On the other hand, while people classified as *single low-income* and *average earners of younger age* represent over 15% of the general population, among clients this percentage barely exceeds 1%.

`ZABEOTYP` indicates energy consumers types as:
* `1`: *green*;
* `2`: *smart*;
* `3`: *fair supplied*;
* `4`: *price driven*;
* `5`: *seeking orientation*;
* `6`: *indifferent*.

In [None]:
# Applying customers_against_general function to ZABEOTYP column:
customers_against_general('ZABEOTYP')

Among customers, it's possible to see an overrepresentation of energy consumers of the types `green` and `fair supplied`, indicating a tendency of a sense of responsibility with a conscious and sustainable energy consumption habit.

<h4>Habits and Curiosities Insights</h4>

* People with travel habits classified as `golden agers` and especially `nature fans` are more representative among customers than among the general population, which indicates a good aspect for customer segmentation.

* On the other hand, the ones related to vacation habits classified as `without vacation`, `active families`, `family-oriented vacationists` and `package tour travelers` seem to be less likely to become costumers in the company.

* While simply looking at the family composition, it is possible to see that clients are more represented by `couples`, `two-generational` and `multi-generational householders`.

* When this familiar structure is analyzed together with an income perspective, we can see that more important than the family structure is the income related to the family.

* In comparison to the general population, in terms of energy consumption habits, customers seem to be less indifferent and more interested in sustainable energy solutions.

### 1.4 The Wise-Conscious Avant-Gardes<a name="report"></a>

If I had to reach out to the public that is more likely to join the company's customers group through a marketing campaign, I would focus on the `wisdom` related to the elders, but also on the `consciousness` of the impact that the consumption habits have on the planet.

Given the fact that they may have a special connection with nature, the consumption of organic products can improve the individual's health and also the planet, and that is an aspect that can be explored when reaching out to customers.

It would also be important to highlight the `avant-garde` profile of these people, of those who think ahead of their time, indicating that the habit of consuming organic products is not just a lifestyle, but a legacy for future generations.

As a big picture, the regular customer of the mail-order company would be:
* mostly elderly people with different mindsets, but also younger people from upper classes related to higher incomes;

* the typical picture would be of *older families*, *mature couples* or *elders in retirement*, but also *pre-family couples* and *singles* if combined with a high-income situation;

* the majority of the clients are related to higher social classes that could be classified as *wealthy* and *prosperous*;

* there is a predominant mindset related to the *avant-garde* and *green avant-garde* movements, regardless of the specificities of the movement throughout the years;

* among elders, even *mainstreamers* are overrepresented in the clients' distribution.