# Capstone Project: Create a Customer Segmentation Report for Arvato Financial Services

In this project, you will analyze demographics data for customers of a mail-order sales company in Germany, comparing it against demographics information for the general population. You'll use unsupervised learning techniques to perform customer segmentation, identifying the parts of the population that best describe the core customer base of the company. Then, you'll apply what you've learned on a third dataset with demographics information for targets of a marketing campaign for the company, and use a model to predict which individuals are most likely to convert into becoming customers for the company. The data that you will use has been provided by our partners at Bertelsmann Arvato Analytics, and represents a real-life data science task.

If you completed the first term of this program, you will be familiar with the first part of this project, from the unsupervised learning project. The versions of those two datasets used in this project will include many more features and has not been pre-cleaned. You are also free to choose whatever approach you'd like to analyzing the data rather than follow pre-determined steps. In your work on this project, make sure that you carefully document your steps and decisions, since your main deliverable for this project will be a blog post reporting your findings.

### Download the data
If you do not have the required **data/** directory in your workspace, follow the instructions below. Use either one of the methods below. 

**Method 1** <br/>
You must [download this dataset](https://video.udacity-data.com/topher/2024/May/66393287_arvato_data.tar/arvato_data.tar.gz) from the classroom, and upload it into the workspace. After you upload the tar file to the appropriate directory, **/cd0549_bertelsmann_arvato_project_workspace/**,  in the Jupyter server, you can open a terminal and the run the following command to extract the dataset from the compressed file. 
```bash
!tar -xzvf arvato_data.tar.gz
```
This command will extract all the contents of arvato_data.tar.gz into the current directory from where you ran the command. 

**Method 2** <br/>
Execute the Python code below to download the dataset. 

In [5]:
# import libraries here; add more as necessary
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns

# magic word for producing visualizations in notebook
%matplotlib inline

ModuleNotFoundError: No module named 'seaborn'

## Part 0: Get to Know the Data

There are four data files associated with this project:

- `Udacity_AZDIAS_052018.csv`: Demographics data for the general population of Germany; 891 211 persons (rows) x 366 features (columns).
- `Udacity_CUSTOMERS_052018.csv`: Demographics data for customers of a mail-order company; 191 652 persons (rows) x 369 features (columns).
- `Udacity_MAILOUT_052018_TRAIN.csv`: Demographics data for individuals who were targets of a marketing campaign; 42 982 persons (rows) x 367 (columns).
- `Udacity_MAILOUT_052018_TEST.csv`: Demographics data for individuals who were targets of a marketing campaign; 42 833 persons (rows) x 366 (columns).

Each row of the demographics files represents a single person, but also includes information outside of individuals, including information about their household, building, and neighborhood. Use the information from the first two files to figure out how customers ("CUSTOMERS") are similar to or differ from the general population at large ("AZDIAS"), then use your analysis to make predictions on the other two files ("MAILOUT"), predicting which recipients are most likely to become a customer for the mail-order company.

The "CUSTOMERS" file contains three extra columns ('CUSTOMER_GROUP', 'ONLINE_PURCHASE', and 'PRODUCT_GROUP'), which provide broad information about the customers depicted in the file. The original "MAILOUT" file included one additional column, "RESPONSE", which indicated whether or not each recipient became a customer of the company. For the "TRAIN" subset, this column has been retained, but in the "TEST" subset it has been removed; it is against that withheld column that your final predictions will be assessed in the Kaggle competition.

Otherwise, all of the remaining columns are the same between the three data files. For more information about the columns depicted in the files, you can refer to two Excel spreadsheets provided in the workspace. [One of them](./DIAS Information Levels - Attributes 2017.xlsx) is a top-level list of attributes and descriptions, organized by informational category. [The other](./DIAS Attributes - Values 2017.xlsx) is a detailed mapping of data values for each feature in alphabetical order.

In the below cell, we've provided some initial code to load in the first two datasets. Note for all of the `.csv` data files in this project that they're semicolon (`;`) delimited, so an additional argument in the [`read_csv()`](https://pandas.pydata.org/pandas-docs/stable/generated/pandas.read_csv.html) call has been included to read in the data properly. Also, considering the size of the datasets, it may take some time for them to load completely.

You'll notice when the data is loaded in that a warning message will immediately pop up. Before you really start digging into the modeling and analysis, you're going to need to perform some cleaning. Take some time to browse the structure of the data and look over the informational spreadsheets to understand the data values. Make some decisions on which features to keep, which features to drop, and if any revisions need to be made on data formats. It'll be a good idea to create a function with pre-processing steps, since you'll need to clean all of the datasets before you work with them.

In [3]:
# load in the data
azdias = pd.read_csv('data/Udacity_AZDIAS_052018.csv', sep=';')
customers = pd.read_csv('data/Udacity_CUSTOMERS_052018.csv', sep=';')

  azdias = pd.read_csv('data/Udacity_AZDIAS_052018.csv', sep=';')
  customers = pd.read_csv('data/Udacity_CUSTOMERS_052018.csv', sep=';')


In [4]:
azdias.head()

Unnamed: 0,LNR,AGER_TYP,AKT_DAT_KL,ALTER_HH,ALTER_KIND1,ALTER_KIND2,ALTER_KIND3,ALTER_KIND4,ALTERSKATEGORIE_FEIN,ANZ_HAUSHALTE_AKTIV,...,VHN,VK_DHT4A,VK_DISTANZ,VK_ZG11,W_KEIT_KIND_HH,WOHNDAUER_2008,WOHNLAGE,ZABEOTYP,ANREDE_KZ,ALTERSKATEGORIE_GROB
0,910215,-1,,,,,,,,,...,,,,,,,,3,1,2
1,910220,-1,9.0,0.0,,,,,21.0,11.0,...,4.0,8.0,11.0,10.0,3.0,9.0,4.0,5,2,1
2,910225,-1,9.0,17.0,,,,,17.0,10.0,...,2.0,9.0,9.0,6.0,3.0,9.0,2.0,5,2,3
3,910226,2,1.0,13.0,,,,,13.0,1.0,...,0.0,7.0,10.0,11.0,,9.0,7.0,3,2,4
4,910241,-1,1.0,20.0,,,,,14.0,3.0,...,2.0,3.0,5.0,4.0,2.0,9.0,3.0,4,1,3


In [None]:
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
# import seaborn as sns

# Load datasets
print("Loading data...")
azdias = pd.read_csv('data/Udacity_AZDIAS_052018.csv', sep=';', nrows=5000)  # Load first 1000 for speed
customers = pd.read_csv('data/Udacity_CUSTOMERS_052018.csv', sep=';', nrows=5000)

print("Data loaded successfully!\n")

# Basic exploration
print("="*60)
print("DATASET SHAPES")
print("="*60)
print(f"AZDIAS: {azdias.shape}")
print(f"CUSTOMERS: {customers.shape}")

print("\n" + "="*60)
print("FIRST 5 ROWS - FIRST 10 COLUMNS (AZDIAS)")
print("="*60)
print(azdias.iloc[:5, :10])

print("\n" + "="*60)
print("DATA TYPES")
print("="*60)
print(azdias.dtypes.value_counts())

print("\n" + "="*60)
print("MISSING VALUES (Top 20 columns with most missing)")
print("="*60)
missing = azdias.isnull().sum().sort_values(ascending=False)
print(missing.head(20))

print("\n" + "="*60)
print("SAMPLE VALUES FROM FIRST 10 COLUMNS")
print("="*60)
for col in azdias.columns[:10]:
    print(f"\n{col}: {azdias[col].unique()[:10]}")

print("\n" + "="*60)
print("CHECKING FOR SPECIAL MISSING CODES")
print("="*60)
# Check for common missing value indicators
for col in azdias.columns[:10]:
    if azdias[col].dtype in ['int64', 'float64']:
        if -1 in azdias[col].values:
            print(f"{col}: Contains -1")
        if 0 in azdias[col].values:
            print(f"{col}: Contains 0")

Loading data...
Data loaded successfully!

DATASET SHAPES
AZDIAS: (1000, 366)
CUSTOMERS: (1000, 369)

FIRST 5 ROWS - FIRST 10 COLUMNS (AZDIAS)
      LNR  AGER_TYP  AKT_DAT_KL  ALTER_HH  ALTER_KIND1  ALTER_KIND2  \
0  910215        -1         NaN       NaN          NaN          NaN   
1  910220        -1         9.0       0.0          NaN          NaN   
2  910225        -1         9.0      17.0          NaN          NaN   
3  910226         2         1.0      13.0          NaN          NaN   
4  910241        -1         1.0      20.0          NaN          NaN   

   ALTER_KIND3  ALTER_KIND4  ALTERSKATEGORIE_FEIN  ANZ_HAUSHALTE_AKTIV  
0          NaN          NaN                   NaN                  NaN  
1          NaN          NaN                  21.0                 11.0  
2          NaN          NaN                  17.0                 10.0  
3          NaN          NaN                  13.0                  1.0  
4          NaN          NaN                  14.0                

In [9]:
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt

# Load first 5000 rows for better sample
print("Loading data...")
azdias = pd.read_csv('data/Udacity_AZDIAS_052018.csv', sep=';', nrows=5000)
customers = pd.read_csv('data/Udacity_CUSTOMERS_052018.csv', sep=';', nrows=5000)

print("="*70)
print("STEP 1: IDENTIFY OBJECT (STRING) COLUMNS")
print("="*70)
object_cols = azdias.select_dtypes(include=['object']).columns
print(f"Object columns: {object_cols.tolist()}")

for col in object_cols:
    print(f"\n{col}:")
    print(f"  Unique values: {azdias[col].unique()[:20]}")
    print(f"  Value counts:\n{azdias[col].value_counts().head(10)}")

print("\n" + "="*70)
print("STEP 2: CHECK FOR -1, 0, X PATTERNS (Potential Missing Codes)")
print("="*70)

# Check numeric columns for suspicious values
suspicious_cols = []
for col in azdias.columns[:50]:  # Check first 50 columns
    if azdias[col].dtype in ['int64', 'float64']:
        unique_vals = azdias[col].dropna().unique()
        
        # Check for -1
        if -1 in unique_vals:
            count_neg1 = (azdias[col] == -1).sum()
            suspicious_cols.append((col, '-1', count_neg1))
        
        # Check for 0 (if it seems like it might be a missing code)
        if 0 in unique_vals:
            count_0 = (azdias[col] == 0).sum()
            # If more than 30% are zeros, it might be suspicious
            if count_0 / len(azdias) > 0.3:
                suspicious_cols.append((col, '0', count_0))

print(f"Found {len(suspicious_cols)} columns with suspicious values:")
for col, value, count in suspicious_cols[:20]:
    print(f"  {col}: {value} appears {count} times ({count/len(azdias)*100:.1f}%)")

print("\n" + "="*70)
print("STEP 3: MISSING VALUE SUMMARY")
print("="*70)

missing_summary = pd.DataFrame({
    'column': azdias.columns,
    'missing_count': azdias.isnull().sum(),
    'missing_percent': (azdias.isnull().sum() / len(azdias) * 100).round(2)
})

missing_summary = missing_summary[missing_summary['missing_count'] > 0].sort_values(
    'missing_percent', ascending=False
)

print(f"\nColumns with >50% missing values: {len(missing_summary[missing_summary['missing_percent'] > 50])}")
print(f"Columns with >80% missing values: {len(missing_summary[missing_summary['missing_percent'] > 80])}")

print("\nTop 30 columns by missing percentage:")
print(missing_summary.head(30))

print("\n" + "="*70)
print("STEP 4: COMPARE AZDIAS vs CUSTOMERS")
print("="*70)

# Check if CUSTOMERS has the 3 extra columns
extra_cols = set(customers.columns) - set(azdias.columns)
print(f"Extra columns in CUSTOMERS: {extra_cols}")

if extra_cols:
    for col in extra_cols:
        print(f"\n{col}:")
        print(customers[col].value_counts())

print("\n" + "="*70)
print("STEP 5: VISUALIZE MISSING DATA PATTERN")
print("="*70)

# Calculate missing percentage for each column
missing_pct = (azdias.isnull().sum() / len(azdias) * 100).sort_values(ascending=False)

# Plot
plt.figure(figsize=(12, 6))
plt.bar(range(len(missing_pct[:30])), missing_pct[:30])
plt.axhline(y=50, color='r', linestyle='--', label='50% threshold')
plt.axhline(y=80, color='orange', linestyle='--', label='80% threshold')
plt.xlabel('Column Index')
plt.ylabel('Missing Percentage (%)')
plt.title('Top 30 Columns with Missing Data')
plt.legend()
plt.tight_layout()
plt.savefig('missing_data_analysis.png', dpi=100, bbox_inches='tight')
print("Plot saved as 'missing_data_analysis.png'")
plt.close()

print("\n" + "="*70)
print("ANALYSIS COMPLETE!")
print("="*70)

Loading data...
STEP 1: IDENTIFY OBJECT (STRING) COLUMNS
Object columns: ['CAMEO_DEU_2015', 'CAMEO_DEUG_2015', 'CAMEO_INTL_2015', 'D19_LETZTER_KAUF_BRANCHE', 'EINGEFUEGT_AM', 'OST_WEST_KZ']

CAMEO_DEU_2015:
  Unique values: [nan '8A' '4C' '2A' '6B' '8C' '4A' '2D' '1A' '1E' '9D' '5C' '8B' '7A' '5D'
 '9E' '9B' '1B' '3D' '4E']
  Value counts:
CAMEO_DEU_2015
6B    313
4C    264
8A    262
8B    199
7A    191
2D    179
4A    177
3C    176
3D    171
9D    160
Name: count, dtype: int64

CAMEO_DEUG_2015:
  Unique values: [nan 8.0 4.0 2.0 6.0 1.0 9.0 5.0 7.0 3.0 '4' '3' '7' '2' '8' '9' '6' '5'
 '1' 'X']
  Value counts:
CAMEO_DEUG_2015
8      439
9      366
6      346
4      324
8.0    273
3      267
2      267
7      261
4.0    243
6.0    230
Name: count, dtype: int64

CAMEO_INTL_2015:
  Unique values: [nan 51.0 24.0 12.0 43.0 54.0 22.0 14.0 13.0 15.0 33.0 41.0 34.0 55.0 25.0
 23.0 31.0 52.0 35.0 45.0]
  Value counts:
CAMEO_INTL_2015
51      434
41      322
24      287
51.0    268
41.0    215
24

  azdias = pd.read_csv('data/Udacity_AZDIAS_052018.csv', sep=';', nrows=5000)
  customers = pd.read_csv('data/Udacity_CUSTOMERS_052018.csv', sep=';', nrows=5000)


Plot saved as 'missing_data_analysis.png'

ANALYSIS COMPLETE!


In [10]:
import pandas as pd
import numpy as np

def clean_demographics_data(df):
    """
    Clean German demographics data based on official DIAS documentation
    
    Missing value codes:
    - -1: unknown (universal)
    - 0: unknown (context-dependent - check feature by feature)
    - 9: unknown (specific features like KBA05_, SEMIO_)
    - 'X': unknown (CAMEO columns)
    """
    
    df_clean = df.copy()
    
    # ========================================
    # STEP 1: Universal -1 = unknown
    # ========================================
    print("Step 1: Converting -1 to NaN...")
    df_clean = df_clean.replace(-1, np.nan)
    
    # ========================================
    # STEP 2: Feature-specific 0 = unknown
    # ========================================
    print("Step 2: Converting specific 0 values to NaN...")
    
    # Features where 0 = unknown/no classification
    zero_is_missing = [
        'AGER_TYP',
        'ALTER_HH', 
        'ALTERSKATEGORIE_GROB',
        'ANREDE_KZ',
        'BIP_FLAG',
        'GEBAEUDETYP',
        'GEBAEUDETYP_RASTER',
        'GEOSCORE_KLS7',
        'HAUSHALTSSTRUKTUR',
        'HEALTH_TYP',
        'HH_EINKOMMEN_SCORE',
        'KBA05_BAUMAX',
        'KBA05_GBZ',
        'KKK',
        'LP_FAMILIE_GROB',
        'LP_STATUS_GROB',
        'NATIONALITAET_KZ',
        'ONLINE_AFFINITAET',
        'PRAEGENDE_JUGENDJAHRE',
        'REGIOTYP',
        'RETOURTYP_BK_S',
        'TITEL_KZ',
        'WOHNDAUER_2008',
        'WOHNLAGE',
        'WACHSTUMSGEBIET_NB',
        'W_KEIT_KIND_HH'
    ]
    
    for col in zero_is_missing:
        if col in df_clean.columns:
            df_clean[col] = df_clean[col].replace(0, np.nan)
    
    # ========================================
    # STEP 3: D19 columns - 0 = "no transaction known"
    # ========================================
    print("Step 3: Converting D19 transaction 0's to NaN...")
    
    # All D19 columns where 0 = "no transaction known"
    d19_cols = [col for col in df_clean.columns if col.startswith('D19_')]
    
    for col in d19_cols:
        if col in df_clean.columns:
            df_clean[col] = df_clean[col].replace(0, np.nan)
    
    # ========================================
    # STEP 4: Feature-specific 9 = unknown
    # ========================================
    print("Step 4: Converting 9 to NaN for specific columns...")
    
    # KBA05 columns where 9 = unknown
    kba05_cols = [col for col in df_clean.columns if col.startswith('KBA05_')]
    for col in kba05_cols:
        df_clean[col] = df_clean[col].replace(9, np.nan)
    
    # SEMIO columns where 9 = unknown
    semio_cols = [col for col in df_clean.columns if col.startswith('SEMIO_')]
    for col in semio_cols:
        df_clean[col] = df_clean[col].replace(9, np.nan)
    
    # Other columns with 9 = unknown
    nine_is_missing = [
        'ALTERSKATEGORIE_GROB',
        'RELAT_AB',
        'ZABEOTYP'
    ]
    
    for col in nine_is_missing:
        if col in df_clean.columns:
            df_clean[col] = df_clean[col].replace(9, np.nan)
    
    # ========================================
    # STEP 5: Handle mixed-type CAMEO columns
    # ========================================
    print("Step 5: Fixing CAMEO columns...")
    
    # CAMEO_DEUG_2015: Mixed types (floats and strings) + 'X'
    if 'CAMEO_DEUG_2015' in df_clean.columns:
        df_clean['CAMEO_DEUG_2015'] = df_clean['CAMEO_DEUG_2015'].replace('X', np.nan)
        df_clean['CAMEO_DEUG_2015'] = pd.to_numeric(df_clean['CAMEO_DEUG_2015'], errors='coerce')
    
    # CAMEO_INTL_2015: Same issue
    if 'CAMEO_INTL_2015' in df_clean.columns:
        df_clean['CAMEO_INTL_2015'] = pd.to_numeric(df_clean['CAMEO_INTL_2015'], errors='coerce')
    
    # CAMEO_DEU_2015: Keep as categorical (alphanumeric codes like '8A')
    # No changes needed - it's legitimately categorical
    
    # ========================================
    # STEP 6: Handle D19_LETZTER_KAUF_BRANCHE
    # ========================================
    print("Step 6: Converting 'D19_UNBEKANNT' to NaN...")
    
    if 'D19_LETZTER_KAUF_BRANCHE' in df_clean.columns:
        df_clean['D19_LETZTER_KAUF_BRANCHE'] = df_clean['D19_LETZTER_KAUF_BRANCHE'].replace(
            'D19_UNBEKANNT', np.nan
        )
    
    # ========================================
    # STEP 7: Convert EINGEFUEGT_AM to datetime
    # ========================================
    print("Step 7: Converting dates...")
    
    if 'EINGEFUEGT_AM' in df_clean.columns:
        df_clean['EINGEFUEGT_AM'] = pd.to_datetime(df_clean['EINGEFUEGT_AM'], errors='coerce')
    
    print("Data cleaning complete!")
    print(f"Shape after cleaning: {df_clean.shape}")
    
    return df_clean


def get_missing_summary(df):
    """Get comprehensive missing value summary"""
    missing = pd.DataFrame({
        'column': df.columns,
        'missing_count': df.isnull().sum(),
        'missing_percent': (df.isnull().sum() / len(df) * 100).round(2),
        'dtype': df.dtypes
    })
    
    missing = missing[missing['missing_count'] > 0].sort_values(
        'missing_percent', ascending=False
    )
    
    return missing


def drop_high_missing_columns(df, threshold=80):
    """Drop columns with more than threshold% missing"""
    missing_pct = (df.isnull().sum() / len(df) * 100)
    cols_to_drop = missing_pct[missing_pct > threshold].index.tolist()
    
    print(f"Dropping {len(cols_to_drop)} columns with >{threshold}% missing:")
    print(cols_to_drop)
    
    df_reduced = df.drop(columns=cols_to_drop)
    
    print(f"Shape: {df.shape} → {df_reduced.shape}")
    
    return df_reduced, cols_to_drop

In [None]:
# Load full datasets (not just 5000 rows)
print("Loading full datasets...")
azdias = pd.read_csv('data/Udacity_AZDIAS_052018.csv', sep=';')
customers = pd.read_csv('data/Udacity_CUSTOMERS_052018.csv', sep=';')

print(f"AZDIAS shape: {azdias.shape}")
print(f"CUSTOMERS shape: {customers.shape}")

# Clean the data
print("\n" + "="*70)
print("CLEANING AZDIAS")
print("="*70)
azdias_clean = clean_demographics_data(azdias)

print("\n" + "="*70)
print("CLEANING CUSTOMERS")
print("="*70)
customers_clean = clean_demographics_data(customers)

# Get missing summary AFTER cleaning
print("\n" + "="*70)
print("MISSING VALUES AFTER CLEANING")
print("="*70)
missing_summary = get_missing_summary(azdias_clean)
print(missing_summary.head(30))

# Drop columns with >80% missing
print("\n" + "="*70)
print("DROPPING HIGH-MISSING COLUMNS")
print("="*70)
azdias_reduced, dropped_cols = drop_high_missing_columns(azdias_clean, threshold=80)
customers_reduced, _ = drop_high_missing_columns(customers_clean, threshold=80)

print(f"\nFinal shapes:")
print(f"AZDIAS: {azdias_reduced.shape}")
print(f"CUSTOMERS: {customers_reduced.shape}")

# Save cleaned datasets
azdias_reduced.to_csv('data/cleaned_AZDIAS.csv', index=False)
customers_reduced.to_csv('data/cleaned_CUSTOMERS.csv', index=False)

Loading full datasets...


  azdias = pd.read_csv('data/Udacity_AZDIAS_052018.csv', sep=';')
  customers = pd.read_csv('data/Udacity_CUSTOMERS_052018.csv', sep=';')


AZDIAS shape: (891221, 366)
CUSTOMERS shape: (191652, 369)

CLEANING AZDIAS
Step 1: Converting -1 to NaN...
Step 2: Converting specific 0 values to NaN...
Step 3: Converting D19 transaction 0's to NaN...
Step 4: Converting 9 to NaN for specific columns...
Step 5: Fixing CAMEO columns...
Step 6: Converting 'D19_UNBEKANNT' to NaN...
Step 7: Converting dates...
Data cleaning complete!
Shape after cleaning: (891221, 366)

CLEANING CUSTOMERS
Step 1: Converting -1 to NaN...
Step 2: Converting specific 0 values to NaN...
Step 3: Converting D19 transaction 0's to NaN...
Step 4: Converting 9 to NaN for specific columns...
Step 5: Fixing CAMEO columns...
Step 6: Converting 'D19_UNBEKANNT' to NaN...
Step 7: Converting dates...
Data cleaning complete!
Shape after cleaning: (191652, 369)

MISSING VALUES AFTER CLEANING
                                                column  missing_count  \
D19_TELKO_ONLINE_QUOTE_12    D19_TELKO_ONLINE_QUOTE_12         890433   
ALTER_KIND4                          

In [12]:
azdias_reduced.to_csv('data/cleaned_AZDIAS.csv', index=False)
customers_reduced.to_csv('data/cleaned_CUSTOMERS.csv', index=False)

In [None]:
# Be sure to add in a lot more cells (both markdown and code) to document your
# approach and findings!

## Part 1: Customer Segmentation Report

The main bulk of your analysis will come in this part of the project. Here, you should use unsupervised learning techniques to describe the relationship between the demographics of the company's existing customers and the general population of Germany. By the end of this part, you should be able to describe parts of the general population that are more likely to be part of the mail-order company's main customer base, and which parts of the general population are less so.

## Part 2: Supervised Learning Model

Now that you've found which parts of the population are more likely to be customers of the mail-order company, it's time to build a prediction model. Each of the rows in the "MAILOUT" data files represents an individual that was targeted for a mailout campaign. Ideally, we should be able to use the demographic information from each individual to decide whether or not it will be worth it to include that person in the campaign.

The "MAILOUT" data has been split into two approximately equal parts, each with almost 43 000 data rows. In this part, you can verify your model with the "TRAIN" partition, which includes a column, "RESPONSE", that states whether or not a person became a customer of the company following the campaign. In the next part, you'll need to create predictions on the "TEST" partition, where the "RESPONSE" column has been withheld.

In [None]:
mailout_train = pd.read_csv('data/Udacity_MAILOUT_052018_TRAIN.csv', sep=';')