# Data Cleaning with Python (Part 3)

In this topic, we will learn about: 
1. **Feature Engineering**
2. **Data Integration**

These are two critical steps in preparing data for analysis and modeling. Feature engineering transforms raw data into meaningful features, while data integration combines data from multiple sources, enhancing the dataset’s depth and scope.

### Import Libraries and Load Data

First, we need to import the necessary libraries and load our dataset that was cleaned in Part 1. Pandas is the primary library we’ll use to manipulate our data.


In [2]:
# Importing the Pandas library
import pandas as pd

# Loading the dataset
df = pd.read_csv('cleaned_data_3.csv')

# Displaying the first few rows of the dataset
df.head()


Unnamed: 0,CustomerID,TransactionID,Transaction Date,Product Name,Product Category,Quantity,Price Per Unit,Payment Method,Customer Age,Total Amount
0,CUST041,TXN0001,2023-12-14,Tablet,Electronics,9.0,229.78,Debit Card,34.0,2068.02
1,CUST008,TXN0002,2023-12-02,Tablet,Electronics,4.0,443.23,Debit Card,39.0,1772.92
2,CUST002,TXN0003,2023-12-01,Smartphone,Electronics,4.0,221.94,Cash,34.0,887.76
3,CUST048,TXN0004,2023-08-07,Monitor,Electronics,9.0,226.87,Debit Card,47.0,2041.83
4,CUST018,TXN0005,2023-06-27,Laptop,Electronics,1.0,169.77,Debit Card,71.0,169.77


## 1. Feature Engineering

Feature engineering is the process of creating new features from existing data, which can help improve the quality and relevance of data for modeling and analysis.

### a) Extracting Year, Month, and Day from Transaction Date

Extracting `Year`, `Month`, and `Day` from a date column can help reveal patterns over time, such as seasonal trends or monthly cycles. This breakdown allows for detailed analysis of yearly trends, monthly fluctuations, and daily patterns, providing insights into time-based behaviors.



In [3]:
#First we need to ensure that the Transaction date is in datetime format.
df['Transaction Date'] = pd.to_datetime(df['Transaction Date'], errors='coerce')

In [4]:
# Extract year, month, and day
df['Year'] = df['Transaction Date'].dt.year
df['Month'] = df['Transaction Date'].dt.month
df['Day'] = df['Transaction Date'].dt.day

In [5]:
df.columns

Index(['CustomerID', 'TransactionID', 'Transaction Date', 'Product Name',
       'Product Category', 'Quantity', 'Price Per Unit', 'Payment Method',
       'Customer Age', 'Total Amount', 'Year', 'Month', 'Day'],
      dtype='object')

### b) Creating a Loyalty Points per Transaction Feature

The `Loyalty Points per Transaction` feature is a simple way to measure customer loyalty by assigning points based on customer spending in each transaction. This feature can be used to develop loyalty programs or identify high-value customers.

- **Assign Loyalty Points per Transaction** based on the `Total Amount` (e.g., 1 point for every $10 spent).


In [6]:
# Define points conversion rate
points_rate = 10  # 1 point for every $10 spent

# Calculate loyalty points by dividing Total Amount by points_rate
df['Loyalty Points'] = df['Total Amount'] / points_rate

# Convert loyalty points to an integer value (rounding down)
df['Loyalty Points'] = df['Loyalty Points'].astype(int)

# Display the DataFrame with the new Loyalty Points feature
df

Unnamed: 0,CustomerID,TransactionID,Transaction Date,Product Name,Product Category,Quantity,Price Per Unit,Payment Method,Customer Age,Total Amount,Year,Month,Day,Loyalty Points
0,CUST041,TXN0001,2023-12-14,Tablet,Electronics,9.0,229.78,Debit Card,34.0,2068.02,2023,12,14,206
1,CUST008,TXN0002,2023-12-02,Tablet,Electronics,4.0,443.23,Debit Card,39.0,1772.92,2023,12,2,177
2,CUST002,TXN0003,2023-12-01,Smartphone,Electronics,4.0,221.94,Cash,34.0,887.76,2023,12,1,88
3,CUST048,TXN0004,2023-08-07,Monitor,Electronics,9.0,226.87,Debit Card,47.0,2041.83,2023,8,7,204
4,CUST018,TXN0005,2023-06-27,Laptop,Electronics,1.0,169.77,Debit Card,71.0,169.77,2023,6,27,16
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
195,CUST010,TXN0196,2023-12-21,Tablet,Electronics,5.0,164.50,Cash,25.0,822.50,2023,12,21,82
196,CUST024,TXN0197,2023-09-30,Headphones,Electronics,6.0,260.11,Debit Card,52.0,1560.66,2023,9,30,156
197,CUST049,TXN0198,2023-12-03,Keyboard,Electronics,3.0,211.03,Debit Card,63.0,633.09,2023,12,3,63
198,CUST011,TXN0199,2023-03-28,Monitor,Electronics,7.0,61.26,Debit Card,71.0,428.82,2023,3,28,42


For analysis purposes, these loyalty points can be grouped by `Customer ID` to find the **total loyalty points per customer**. This allows businesses to assess customer loyalty on an individual level, identifying high-value customers based on their accumulated points across multiple transactions.


## 2. Data Integration

Data integration combines datasets from multiple sources to provide a more comprehensive view. This is useful for enriching our data with additional information and enabling deeper analysis.

### Merging Datasets

To integrate data, we can use `pd.merge()` to join datasets on a common column (e.g., `Customer ID`). This method aligns rows from each dataset based on the values in the specified column, allowing us to combine related data.

### Load the second dataset

In [7]:
# Loading the dataset
df2 = pd.read_csv('../inputs/datasets/raw/Customer_Demographics.xls')

# Displaying the first few rows of the dataset
df2

Unnamed: 0,CustomerID,City
0,CUST041,New York
1,CUST008,New York
2,CUST002,Houston
3,CUST048,Houston
4,CUST018,Los Angeles
5,CUST016,New York
6,CUST015,New York
7,CUST009,Houston
8,CUST007,Los Angeles
9,CUST044,Los Angeles


In [13]:
df_combined = pd.merge(df, df2, on='CustomerID', how='outer')
df_combined.head(50)

Unnamed: 0,CustomerID,TransactionID,Transaction Date,Product Name,Product Category,Quantity,Price Per Unit,Payment Method,Customer Age,Total Amount,Year,Month,Day,Loyalty Points,City
0,CUST001,TXN0036,2023-01-08,Smartphone,Electronics,4.0,30.82,Cash,42.0,123.28,2023,1,8,12,Los Angeles
1,CUST001,TXN0165,2023-11-27,Laptop,Electronics,1.0,492.42,Debit Card,56.0,492.42,2023,11,27,49,Los Angeles
2,CUST001,TXN0181,2023-01-14,Monitor,Electronics,3.0,346.2,Cash,55.0,1038.6,2023,1,14,103,Los Angeles
3,CUST002,TXN0003,2023-12-01,Smartphone,Electronics,4.0,221.94,Cash,34.0,887.76,2023,12,1,88,Houston
4,CUST002,TXN0018,2023-02-15,Smartphone,Electronics,5.0,160.9,Cash,23.0,804.5,2023,2,15,80,Houston
5,CUST002,TXN0024,2023-12-18,Laptop,Electronics,9.0,34.54,Credit Card,20.0,310.86,2023,12,18,31,Houston
6,CUST003,TXN0017,2023-07-13,Desk Chair,Furniture,6.0,280.59,Credit Card,78.0,1683.54,2023,7,13,168,Los Angeles
7,CUST003,TXN0055,2023-02-04,Sofa,Furniture,3.0,265.21,Credit Card,50.0,795.63,2023,2,4,79,Los Angeles
8,CUST003,TXN0071,2023-06-18,Keyboard,Electronics,8.0,59.78,Debit Card,63.0,478.24,2023,6,18,47,Los Angeles
9,CUST003,TXN0114,2023-05-27,Monitor,Electronics,1.0,475.18,Cash,21.0,475.18,2023,5,27,47,Los Angeles


### Explanation:

`pd.merge()` joins two DataFrames (`df` and `df2`) based on the common column `Customer ID`.

The `how` parameter determines the type of join, and it has the following options:

- `how='inner'`: Only includes rows with matching values in both datasets (default behavior).
- `how='left'`: Keeps all rows from the left DataFrame (`df`), filling in `NaN` for missing values from the right DataFrame (`df2`).
- `how='right'`: Keeps all rows from the right DataFrame (`df2`), filling in `NaN` for missing values from the left DataFrame (`df`).
- `how='outer'`: Includes all rows from both DataFrames, filling in `NaN` for any missing values.

Using these options, you can tailor the merge to fit your data needs and create a unified dataset for analysis.


In [14]:
MYdf1 = pd.DataFrame({
  'name': ['John', 'Jane', 'Jim'],
  'age': [28, 34, 29],
    'city': ['New York', 'Los Angeles', 'Chicago']
})

MYdf2 = pd.DataFrame({
  'name': ['Abert', 'Benny', 'Carlos','John', 'Jane', 'Jim'],
  'age': [22, 33, 49,28, 34, 29],
    'city': ['York', 'Newport', 'Coventry','New York', 'Los Angeles', 'Chicago']
})

In [15]:
MYdf3 = pd.merge(MYdf1,MYdf2, how='outer', on='name')
MYdf3

Unnamed: 0,name,age_x,city_x,age_y,city_y
0,Abert,,,22,York
1,Benny,,,33,Newport
2,Carlos,,,49,Coventry
3,Jane,34.0,Los Angeles,34,Los Angeles
4,Jim,29.0,Chicago,29,Chicago
5,John,28.0,New York,28,New York


In [16]:
MYdf2

Unnamed: 0,name,age,city
0,Abert,22,York
1,Benny,33,Newport
2,Carlos,49,Coventry
3,John,28,New York
4,Jane,34,Los Angeles
5,Jim,29,Chicago


### Conclusion

Data integration and feature engineering are crucial steps in preparing data for analysis. By merging datasets, we gain a complete view that uncovers insights hidden in isolated sources. Feature engineering allows us to transform variables and interpretability. Together, these techniques improve data quality and set the foundation for impactful, data-driven decisions.
