Tidy a bank marketing campaign dataset by splitting it into subsets, updating values, converting data types, and storing it as multiple csv files.
Project Description
Data cleaning is an essential skill for data engineers, encompassing reading, modifying, splitting, and storing data.
In this notebook, you will apply your data-cleaning skills to process information about marketing campaigns run by a bank.
You will need to modify values, add new features, convert data types, and save data into multiple files.



Project Instructions
Subset, clean, and reformat the bank_marketing.csv dataset to create and store three new files based on the requirements detailed in the notebook.
Split and tidy bank_marketing.csv, storing as three DataFrames called client, campaign, and economics, each containing the columns outlined in the notebook and formatted to the data types listed.
Save the three DataFrames to csv files, without an index, as client.csv, campaign.csv, and economics.csv respectively.


In [3]:
import pandas as pd
import numpy as np

#### Reading DATA

In [7]:

# Load the dataset
df = pd.read_csv("bank_marketing.csv")



## Splitting Data

#### Client Dataframe and Cleaning

In [8]:
# Cleaning the 'client' DataFrame
client_columns = ['client_id', 'age', 'job', 'marital', 'education', 'credit_default', 'housing', 'loan']
client_df = df[client_columns]

# Fill missing values in 'age' with median
client_df['age'].fillna(client_df['age'].median(), inplace=True)

# Standardize 'education' levels (e.g., merging similar categories)
client_df['education'] = client_df['education'].replace({
    'basic.4y': 'basic', 
    'basic.6y': 'basic', 
    'basic.9y': 'basic'
})

# Replace missing values in categorical columns with 'unknown'
for col in ['job', 'marital', 'education', 'credit_default', 'housing', 'loan']:
    client_df[col].fillna('unknown', inplace=True)

# Ensure 'client_id' is of integer type
client_df['client_id'] = client_df['client_id'].astype(int)



The behavior will change in pandas 3.0. This inplace method will never work because the intermediate object on which we are setting values always behaves as a copy.

For example, when doing 'df[col].method(value, inplace=True)', try using 'df.method({col: value}, inplace=True)' or df[col] = df[col].method(value) instead, to perform the operation inplace on the original object.


  client_df['age'].fillna(client_df['age'].median(), inplace=True)
A value is trying to be set on a copy of a slice from a DataFrame

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  client_df['age'].fillna(client_df['age'].median(), inplace=True)
A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  client_df['ed

#### Campaign Dataframe with necessary Cleaning

In [9]:

# Cleaning the 'campaign' DataFrame
campaign_columns = ['client_id', 'contact', 'month', 'campaign', 'pdays', 'previous', 'poutcome', 'y']
campaign_df = df[campaign_columns]

# Replace '999' in 'pdays' with NaN (indicating the client was never contacted before)
campaign_df['pdays'] = campaign_df['pdays'].replace(999, np.nan)

# Fill missing values in 'poutcome' with 'unknown'
campaign_df['poutcome'].fillna('unknown', inplace=True)

# Convert 'campaign' to integer
campaign_df['campaign'] = campaign_df['campaign'].astype(int)

# Map 'y' to binary values for easier analysis
campaign_df['y'] = campaign_df['y'].map({'yes': 1, 'no': 0})




A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  campaign_df['pdays'] = campaign_df['pdays'].replace(999, np.nan)
The behavior will change in pandas 3.0. This inplace method will never work because the intermediate object on which we are setting values always behaves as a copy.

For example, when doing 'df[col].method(value, inplace=True)', try using 'df.method({col: value}, inplace=True)' or df[col] = df[col].method(value) instead, to perform the operation inplace on the original object.


  campaign_df['poutcome'].fillna('unknown', inplace=True)
A value is trying to be set on a copy of a slice from a DataFrame

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  campaign_df['poutcome'

#### Economices Dataframe and Cleaning

In [10]:

# Cleaning the 'economics' DataFrame
economics_columns = ['client_id', 'emp_var_rate', 'cons_price_idx', 'cons_conf_idx', 'euribor3m', 'nr_employed']
economics_df = df[economics_columns]

# Fill any missing economic values with the mean of their respective columns
for col in ['emp_var_rate', 'cons_price_idx', 'cons_conf_idx', 'euribor3m', 'nr_employed']:
    economics_df[col].fillna(economics_df[col].mean(), inplace=True)

# Convert 'client_id' to integer to match the other DataFrames
economics_df['client_id'] = economics_df['client_id'].astype(int)



The behavior will change in pandas 3.0. This inplace method will never work because the intermediate object on which we are setting values always behaves as a copy.

For example, when doing 'df[col].method(value, inplace=True)', try using 'df.method({col: value}, inplace=True)' or df[col] = df[col].method(value) instead, to perform the operation inplace on the original object.


  economics_df[col].fillna(economics_df[col].mean(), inplace=True)
A value is trying to be set on a copy of a slice from a DataFrame

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  economics_df[col].fillna(economics_df[col].mean(), inplace=True)
A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  economics_df[

(   client_id  age        job  marital    education credit_default housing loan
 0          0   56  housemaid  married        basic             no      no   no
 1          1   57   services  married  high.school        unknown      no   no
 2          2   37   services  married  high.school             no     yes   no
 3          3   40     admin.  married        basic             no      no   no
 4          4   56   services  married  high.school             no      no  yes,
    client_id    contact month  campaign  pdays  previous     poutcome  y
 0          0  telephone   may         1    NaN         0  nonexistent  0
 1          1  telephone   may         1    NaN         0  nonexistent  0
 2          2  telephone   may         1    NaN         0  nonexistent  0
 3          3  telephone   may         1    NaN         0  nonexistent  0
 4          4  telephone   may         1    NaN         0  nonexistent  0,
    client_id  emp_var_rate  cons_price_idx  cons_conf_idx  euribor3m  \
 

In [None]:
# Saving the cleaned DataFrames to CSV files without the index
client_df.to_csv('client.csv', index=False)
campaign_df.to_csv('campaign.csv', index=False)
economics_df.to_csv('economics.csv', index=False)

# Display head of each cleaned DataFrame
client_df.head(), campaign_df.head(), economics_df.head()