#  Machine Learning in Production with Python  by Nirmal Rawal

### Summary for Code-Along Session

**Background Context:**

In this code-along session, we will work with a dataset from a Portuguese banking institution's direct marketing campaigns, sourced from Kaggle. The dataset contains information about various marketing efforts, including telephonic outreach, aimed at promoting term deposits. Term deposits are significant for banks as they provide a stable income stream. Identifying and targeting potential customers effectively can enhance marketing efficiency and reduce costs.

https://www.kaggle.com/datasets/prakharrathi25/banking-dataset-marketing-targets/data

[https://github.com/mconwa02/datacamp-code-along-2024](https://github.com/nirmal-rawal/Banking-Dataset---Marketing-Targets-ML_project-.git)

**Tasks Covered:**

1. **Feature Engineering**:
   - We will develope functions to enhance the DataFrame with new features.
   - These functions were combined into a streamlined data transformation process using the Pandas `pipe` method.

2. **Unit Testing**:
   - We implemented unit tests using the `pytest` framework to ensure the new features were correctly added to the DataFrame.
   - The tests verified that the new columns were accurately calculated and correctly incorporated into the DataFrame.

Overall, this session provided practical experience in enhancing a dataset with valuable features and ensuring the robustness of these enhancements through thorough testing.

### Loading Data
This code processes a dataset containing bank client data from a CSV file, performing several transformations and aggregations. It reads the CSV file into a DataFrame and then prints detailed information about the structure and contents of that DataFrame.

In [3]:
# Loading Data: Reads the data from a CSV file into a Pandas DataFrame.
import pandas as pd

df = pd.read_csv("train.csv", sep=';')
print(df.head(10))

   age           job   marital  education  ... pdays  previous poutcome   y
0   58    management   married   tertiary  ...    -1         0  unknown  no
1   44    technician    single  secondary  ...    -1         0  unknown  no
2   33  entrepreneur   married  secondary  ...    -1         0  unknown  no
3   47   blue-collar   married    unknown  ...    -1         0  unknown  no
4   33       unknown    single    unknown  ...    -1         0  unknown  no
5   35    management   married   tertiary  ...    -1         0  unknown  no
6   28    management    single   tertiary  ...    -1         0  unknown  no
7   42  entrepreneur  divorced   tertiary  ...    -1         0  unknown  no
8   58       retired   married    primary  ...    -1         0  unknown  no
9   43    technician    single  secondary  ...    -1         0  unknown  no

[10 rows x 17 columns]


In [4]:
df.columns

Index(['age', 'job', 'marital', 'education', 'default', 'balance', 'housing',
       'loan', 'contact', 'day', 'month', 'duration', 'campaign', 'pdays',
       'previous', 'poutcome', 'y'],
      dtype='object')

In [5]:
df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 45211 entries, 0 to 45210
Data columns (total 17 columns):
 #   Column     Non-Null Count  Dtype 
---  ------     --------------  ----- 
 0   age        45211 non-null  int64 
 1   job        45211 non-null  object
 2   marital    45211 non-null  object
 3   education  45211 non-null  object
 4   default    45211 non-null  object
 5   balance    45211 non-null  int64 
 6   housing    45211 non-null  object
 7   loan       45211 non-null  object
 8   contact    45211 non-null  object
 9   day        45211 non-null  int64 
 10  month      45211 non-null  object
 11  duration   45211 non-null  int64 
 12  campaign   45211 non-null  int64 
 13  pdays      45211 non-null  int64 
 14  previous   45211 non-null  int64 
 15  poutcome   45211 non-null  object
 16  y          45211 non-null  object
dtypes: int64(7), object(10)
memory usage: 5.9+ MB


In [6]:
df.isna().sum().values

array([0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0])

### Adding New Columns
### annual_duration 
Adds a column with a nested list for each value in the duration column, where each entry is a list containing the value and its half.

In [7]:
df["annual_duration"]=[[i,i/2]for i in df["duration"]]

In [8]:
df["annual_duration"]

0         [261, 130.5]
1          [151, 75.5]
2           [76, 38.0]
3           [92, 46.0]
4          [198, 99.0]
             ...      
45206     [977, 488.5]
45207     [456, 228.0]
45208    [1127, 563.5]
45209     [508, 254.0]
45210     [361, 180.5]
Name: annual_duration, Length: 45211, dtype: object

### campaign_limit 
Adds a column with a tuple for each value in the campaign column, where each entry is a tuple containing the value and its square.

In [9]:
df['campaign_limit']=[(i,i**2)for i in df["campaign"]]
df["campaign_limit"]

0         (1, 1)
1         (1, 1)
2         (1, 1)
3         (1, 1)
4         (1, 1)
          ...   
45206     (3, 9)
45207     (2, 4)
45208    (5, 25)
45209    (4, 16)
45210     (2, 4)
Name: campaign_limit, Length: 45211, dtype: object

### Grouping and Aggregating

- Groups the DataFrame by the marital column.
- Applies a lambda function to aggregate the grouped data into a new DataFrame with the following columns:
    - balance_max: Maximum value of the balance column within each group.
    - age_mean: Mean value of the age column within each group.
    - annual_duration_flat: Flattened list of all annual_duration values within each group.
    - campaign_limit_concat: Concatenated string of all campaign_limit values within each group.

In [10]:
result= df.groupby("marital").apply(
    lambda x:pd.Series(
        {
            "balance_max":x['balance'].max(),
            "age_mean":x["age"].mean(),

        }
    )
    
)

### Printing the Result
Outputs the columns of aggregated result DataFrame and prints the values in the balance_max and age_mean columns.

In [11]:
print(result)

          balance_max   age_mean
marital                         
divorced      66721.0  45.782984
married       98417.0  43.408099
single       102127.0  33.703440


### Function Definitions
This code enhances a DataFrame containing banking customer data by adding new features derived from existing columns using fucntions.

The max_customer_account_balance function below takes a DataFrame as input and adds a new column, balance_max, which contains the maximum account balance from the balance column across all customers.

In [12]:
from pandas import  DataFrame
def max_customer_account_balance(df:DataFrame) -> DataFrame:
    """add the column with the mean age of all costomer"""
    df["balance_max"]=df["balance"].max()
    return df

max_customer_account_balance(df.head(5))

Unnamed: 0,age,job,marital,education,default,balance,housing,loan,contact,day,month,duration,campaign,pdays,previous,poutcome,y,annual_duration,campaign_limit,balance_max
0,58,management,married,tertiary,no,2143,yes,no,unknown,5,may,261,1,-1,0,unknown,no,"[261, 130.5]","(1, 1)",2143
1,44,technician,single,secondary,no,29,yes,no,unknown,5,may,151,1,-1,0,unknown,no,"[151, 75.5]","(1, 1)",2143
2,33,entrepreneur,married,secondary,no,2,yes,yes,unknown,5,may,76,1,-1,0,unknown,no,"[76, 38.0]","(1, 1)",2143
3,47,blue-collar,married,unknown,no,1506,yes,no,unknown,5,may,92,1,-1,0,unknown,no,"[92, 46.0]","(1, 1)",2143
4,33,unknown,single,unknown,no,1,no,no,unknown,5,may,198,1,-1,0,unknown,no,"[198, 99.0]","(1, 1)",2143


The customers_mean_age function below takes a DataFrame as input and adds a new column, age_mean, which contains the mean age of all customers from the age column.

In [13]:
def mean_age_of_customer(df:DataFrame) -> DataFrame:
    """add the column with the mean age of all costomer"""
    df["age_mean"]=df["age"].mean()
    return df
mean_age_of_customer(df.head(5))

Unnamed: 0,age,job,marital,education,default,balance,housing,loan,contact,day,month,duration,campaign,pdays,previous,poutcome,y,annual_duration,campaign_limit,age_mean
0,58,management,married,tertiary,no,2143,yes,no,unknown,5,may,261,1,-1,0,unknown,no,"[261, 130.5]","(1, 1)",43.0
1,44,technician,single,secondary,no,29,yes,no,unknown,5,may,151,1,-1,0,unknown,no,"[151, 75.5]","(1, 1)",43.0
2,33,entrepreneur,married,secondary,no,2,yes,yes,unknown,5,may,76,1,-1,0,unknown,no,"[76, 38.0]","(1, 1)",43.0
3,47,blue-collar,married,unknown,no,1506,yes,no,unknown,5,may,92,1,-1,0,unknown,no,"[92, 46.0]","(1, 1)",43.0
4,33,unknown,single,unknown,no,1,no,no,unknown,5,may,198,1,-1,0,unknown,no,"[198, 99.0]","(1, 1)",43.0


In [14]:
def gb_mean_age_of_customer(df:DataFrame) -> DataFrame:
    """add the column with the mean age of all costomer"""
    df["age_mean"]=df["age"].mean()
    return df["age_mean"].iloc[0]


In [15]:
def gb_max_customer_account_balance(df:DataFrame) -> DataFrame:
    """add the column with the mean age of all costomer"""
    df["balance_max"]=df["balance"].max()
    return df["balance_max"].iloc[0]

In [16]:
df.groupby("marital").apply(
    lambda x:pd.Series(
        {
            "balance_max":gb_max_customer_account_balance(x),
            "age_mean":gb_mean_age_of_customer(x),

        }
    )
    
)

Unnamed: 0_level_0,balance_max,age_mean
marital,Unnamed: 1_level_1,Unnamed: 2_level_1
divorced,66721.0,45.782984
married,98417.0,43.408099
single,102127.0,33.70344


In [17]:
df.groupby("marital").apply(
    lambda x:pd.Series(
        {
            "balance_max":max_customer_account_balance(x),
            "age_mean":mean_age_of_customer(x),

        }
    )
    
)

Error: Output too large to be added to notebook

### Feature Creation Pipeline

This creating_features_banking_data function applies the max_customer_account_balance and customers_mean_age functions to the DataFrame using the Pandas pipe method. This method allows for a clean and readable chaining of operations.
It returns the modified DataFrame with the new features added

In [18]:
df.pipe(max_customer_account_balance).pipe(mean_age_of_customer)

Unnamed: 0,age,job,marital,education,default,balance,housing,loan,contact,day,month,duration,campaign,pdays,previous,poutcome,y,annual_duration,campaign_limit,balance_max,age_mean
0,58,management,married,tertiary,no,2143,yes,no,unknown,5,may,261,1,-1,0,unknown,no,"[261, 130.5]","(1, 1)",102127,40.93621
1,44,technician,single,secondary,no,29,yes,no,unknown,5,may,151,1,-1,0,unknown,no,"[151, 75.5]","(1, 1)",102127,40.93621
2,33,entrepreneur,married,secondary,no,2,yes,yes,unknown,5,may,76,1,-1,0,unknown,no,"[76, 38.0]","(1, 1)",102127,40.93621
3,47,blue-collar,married,unknown,no,1506,yes,no,unknown,5,may,92,1,-1,0,unknown,no,"[92, 46.0]","(1, 1)",102127,40.93621
4,33,unknown,single,unknown,no,1,no,no,unknown,5,may,198,1,-1,0,unknown,no,"[198, 99.0]","(1, 1)",102127,40.93621
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
45206,51,technician,married,tertiary,no,825,no,no,cellular,17,nov,977,3,-1,0,unknown,yes,"[977, 488.5]","(3, 9)",102127,40.93621
45207,71,retired,divorced,primary,no,1729,no,no,cellular,17,nov,456,2,-1,0,unknown,yes,"[456, 228.0]","(2, 4)",102127,40.93621
45208,72,retired,married,secondary,no,5715,no,no,cellular,17,nov,1127,5,184,3,success,yes,"[1127, 563.5]","(5, 25)",102127,40.93621
45209,57,blue-collar,married,secondary,no,668,no,no,telephone,17,nov,508,4,-1,0,unknown,no,"[508, 254.0]","(4, 16)",102127,40.93621


### Execution

Calls the creating_features_banking_data function with the sample DataFrame df.
Stores the resulting DataFrame in banking_customers_df.
Prints the first few rows of the resulting DataFrame to verify the added features.

### Unit Testing

This code provides unit tests for two functions, `max_customer_account_balance` and `customers_mean_age`, using the `pytest` framework. The goal is to ensure that these functions correctly add calculated columns to a Pandas DataFrame.

**Imports and Fixtures**:
   - Imports the necessary modules and functions: `pytest`, `DataFrame` from Pandas, and the productions functions.
   - Defines a fixture `test_df` that sets up a sample DataFrame for testing purposes.

In [19]:
import pytest
@pytest.fixture()
def test_df():
    return DataFrame(
        data={
            "balance": [100, 200, 300, 400, 500],
            "age": [30, 40, 50, 60, 70],
        }
    )

**Unit Test for `max_customer_account_balance` Function**
   - Creates a DataFrame with balance data.
   - Applies the `max_customer_account_balance` function to add the `balance_max` column.
   - Asserts that the `balance_max` column is present and contains the correct maximum value (500).

In [20]:
def test_max_customer_account_balance():
    test_df =DataFrame(data={"balance":[100,200,400,600]})
    actual_df=max_customer_account_balance(test_df)
    actual=actual_df["balance_max"].iloc[0]
    expect =600
    assert "balance_max" in test_df.columns
    assert actual == expect
test_max_customer_account_balance()

**Unit Test for `customers_mean_age` Function**:
   - Creates a DataFrame with age data.
   - Applies the `customers_mean_age` function to add the `age_mean` column.
   - Asserts that the `age_mean` column is present and contains the correct mean value (50).

In [23]:
def test_customers_mean_age():
    """Test the customers_mean_age function."""
    test_df = DataFrame(data={"age": [30, 40, 50, 60, 70]})
    test_df = mean_age_of_customer(test_df)
    actual = test_df["age_mean"].iloc[0]
    expected = 50
    assert "age_mean" in test_df.columns
    assert actual == expected
test_customers_mean_age()

**Purpose**:
- Ensure the `max_customer_account_balance` function correctly computes and adds the maximum account balance as a new column.
- Ensure the `customers_mean_age` function correctly computes and adds the mean age of customers as a new column.

These tests validate the correctness of the feature engineering functions, ensuring they perform as expected when applied to a DataFrame.