<h1 style="text-align: center;"><a title="Data Science-AIMS-Cmr-2021-22">Chapter 4: 
    Data Transformation and Feature Engineering</h1>

**Instructor:** 

* Rockefeller

## Learning Objectives

By the end of this lesson, you will:
- Create new columns and derived features from existing data
- Master string operations for text processing
- Handle categorical data effectively
- Perform grouping and aggregation operations
- Merge and join datasets from multiple sources
- Apply feature engineering techniques to African business datasets

## Introduction: The Art of Feature Engineering

Feature engineering is often considered the most important skill in data science. It's the process of creating new variables (features) from existing data to better represent the underlying patterns and relationships.

In the African context, this might mean:
- Creating **seasonal indicators** from dates for agricultural data
- Calculating **profit margins** from revenue and cost data for businesses
- Deriving **population density** from population and area data for urban planning
- Extracting **age groups** from birth dates for demographic analysis

Let's explore these techniques with real African datasets!

In [1]:
# Import necessary libraries
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
from datetime import datetime, timedelta
import warnings
warnings.filterwarnings('ignore')

# Set display options
pd.set_option('display.max_columns', None)
pd.set_option('display.max_rows', 100)
plt.style.use('default')
sns.set_palette("husl")

print("Libraries imported successfully!")
print(f"pandas version: {pd.__version__}")
print(f"numpy version: {np.__version__}")

Libraries imported successfully!
pandas version: 2.2.3
numpy version: 2.2.5


## Dataset 1: Nigerian Small Business Data

Let's start with a dataset representing small businesses across Nigeria, including revenue, costs, and operational data.

In [11]:
df_nigeria =pd.read_csv(filepath_or_buffer ='data/naija_businesses.csv')
df_nigeria.head()

Unnamed: 0,business_id,business_name,state,business_type,start_date,monthly_revenue_naira,monthly_costs_naira,employees,has_website,accepts_mobile_money
0,NG_0001,Business_1,Benue,Retail,2023-02-01,417741,406718,8,False,True
1,NG_0002,Business_2,Kaduna,Manufacturing,2023-11-17,696804,256593,10,True,False
2,NG_0003,Business_3,Oyo,Technology,2020-09-02,520713,87719,11,False,True
3,NG_0004,Business_4,Benue,Manufacturing,2023-04-14,553527,389310,7,False,True
4,NG_0005,Business_5,Rivers,Technology,2022-11-28,231537,-11869,10,True,False


## 1. Creating New Columns

Let's start by creating new features from existing data:

In [14]:
# 1. Calculate profit and profit margin
df_nigeria['monthly_profit_naira'] = df_nigeria['monthly_revenue_naira'] - df_nigeria['monthly_costs_naira']
df_nigeria['profit_margin_percent'] = round((df_nigeria['monthly_profit_naira'] / df_nigeria['monthly_revenue_naira']) * 100 , 2)
df_nigeria

Unnamed: 0,business_id,business_name,state,business_type,start_date,monthly_revenue_naira,monthly_costs_naira,employees,has_website,accepts_mobile_money,monthly_profit_naira,profit_margin_percent
0,NG_0001,Business_1,Benue,Retail,2023-02-01,417741,406718,8,False,True,11023,2.64
1,NG_0002,Business_2,Kaduna,Manufacturing,2023-11-17,696804,256593,10,True,False,440211,63.18
2,NG_0003,Business_3,Oyo,Technology,2020-09-02,520713,87719,11,False,True,432994,83.15
3,NG_0004,Business_4,Benue,Manufacturing,2023-04-14,553527,389310,7,False,True,164217,29.67
4,NG_0005,Business_5,Rivers,Technology,2022-11-28,231537,-11869,10,True,False,243406,105.13
...,...,...,...,...,...,...,...,...,...,...,...,...
195,NG_0196,Business_196,Kano,Services,2023-09-27,611128,415487,8,True,False,195641,32.01
196,NG_0197,Business_197,Kano,Agriculture,2021-01-31,916895,349547,7,False,True,567348,61.88
197,NG_0198,Business_198,Kaduna,Agriculture,2021-11-29,778180,430501,8,False,True,347679,44.68
198,NG_0199,Business_199,Oyo,Agriculture,2022-11-08,98486,78788,4,False,True,19698,20.00


In [16]:

# 2. Calculate revenue per employee
df_nigeria['revenue_per_employee'] = round(df_nigeria['monthly_revenue_naira'] / df_nigeria['employees'],2)
df_nigeria.head()

Unnamed: 0,business_id,business_name,state,business_type,start_date,monthly_revenue_naira,monthly_costs_naira,employees,has_website,accepts_mobile_money,monthly_profit_naira,profit_margin_percent,revenue_per_employee
0,NG_0001,Business_1,Benue,Retail,2023-02-01,417741,406718,8,False,True,11023,2.64,52217.62
1,NG_0002,Business_2,Kaduna,Manufacturing,2023-11-17,696804,256593,10,True,False,440211,63.18,69680.4
2,NG_0003,Business_3,Oyo,Technology,2020-09-02,520713,87719,11,False,True,432994,83.15,47337.55
3,NG_0004,Business_4,Benue,Manufacturing,2023-04-14,553527,389310,7,False,True,164217,29.67,79075.29
4,NG_0005,Business_5,Rivers,Technology,2022-11-28,231537,-11869,10,True,False,243406,105.13,23153.7


In [40]:
# 3. Create business age in years
current_date = pd.Timestamp('2024-10-28')
df_nigeria['business_age_years'] = round  ((current_date  - pd.to_datetime(df_nigeria['start_date'])).dt.days / 365 , 2)
df_nigeria.sample(5)


Unnamed: 0,business_id,business_name,state,business_type,start_date,monthly_revenue_naira,monthly_costs_naira,employees,has_website,accepts_mobile_money,monthly_profit_naira,profit_margin_percent,revenue_per_employee,business_age_years
153,NG_0154,Business_154,Rivers,Services,2021-12-27,239679,13311,4,False,True,226368,94.45,59919.75,2.84
134,NG_0135,Business_135,Benue,Agriculture,2020-06-08,270389,216311,6,False,True,54078,20.0,45064.83,4.39
76,NG_0077,Business_77,Kaduna,Agriculture,2021-05-27,836936,555873,6,False,False,281063,33.58,139489.33,3.42
34,NG_0035,Business_35,Imo,Retail,2023-09-27,182707,89967,5,True,True,92740,50.76,36541.4,1.09
172,NG_0173,Business_173,Rivers,Technology,2022-01-30,538418,377069,7,False,True,161349,29.97,76916.86,2.75


In [42]:

# 4. Create size categories based on employees
def categorize_business_size(employees):
    if employees <= 5:
        return 'Micro'
    elif employees <= 20:
        return 'Small'
    elif employees <= 50:
        return 'Medium'
    else:
        return 'Large'

df_nigeria['business_size'] = df_nigeria['employees'].apply(categorize_business_size)
df_nigeria.head(10)

Unnamed: 0,business_id,business_name,state,business_type,start_date,monthly_revenue_naira,monthly_costs_naira,employees,has_website,accepts_mobile_money,monthly_profit_naira,profit_margin_percent,revenue_per_employee,business_age_years,business_size
0,NG_0001,Business_1,Benue,Retail,2023-02-01,417741,406718,8,False,True,11023,2.64,52217.62,1.74,Small
1,NG_0002,Business_2,Kaduna,Manufacturing,2023-11-17,696804,256593,10,True,False,440211,63.18,69680.4,0.95,Small
2,NG_0003,Business_3,Oyo,Technology,2020-09-02,520713,87719,11,False,True,432994,83.15,47337.55,4.16,Small
3,NG_0004,Business_4,Benue,Manufacturing,2023-04-14,553527,389310,7,False,True,164217,29.67,79075.29,1.54,Small
4,NG_0005,Business_5,Rivers,Technology,2022-11-28,231537,-11869,10,True,False,243406,105.13,23153.7,1.92,Small
5,NG_0006,Business_6,Anambra,Technology,2022-01-28,-172819,-138255,5,False,False,-34564,20.0,-34563.8,2.75,Micro
6,NG_0007,Business_7,Oyo,Services,2021-06-21,724708,207436,12,False,True,517272,71.38,60392.33,3.36,Small
7,NG_0008,Business_8,Oyo,Technology,2022-07-09,928641,111173,11,False,False,817468,88.03,84421.91,2.31,Small
8,NG_0009,Business_9,Benue,Manufacturing,2021-02-11,431482,345185,2,True,False,86297,20.0,215741.0,3.71,Micro
9,NG_0010,Business_10,Kano,Technology,2021-06-08,549862,293803,3,True,False,256059,46.57,183287.33,3.39,Micro


In [None]:
# 5. Create digital adoption score
df_nigeria['digital_score'] = (df_nigeria['has_website'].astype(int) + 
                              df_nigeria['accepts_mobile_money'].astype(int))

df_nigeria.head(5)

Unnamed: 0,business_id,business_name,state,business_type,start_date,monthly_revenue_naira,monthly_costs_naira,employees,has_website,accepts_mobile_money,monthly_profit_naira,profit_margin_percent,revenue_per_employee,business_age_years,business_size,digital_score
0,NG_0001,Business_1,Benue,Retail,2023-02-01,417741,406718,8,False,True,11023,2.64,52217.62,1.74,Small,1
1,NG_0002,Business_2,Kaduna,Manufacturing,2023-11-17,696804,256593,10,True,False,440211,63.18,69680.4,0.95,Small,1
2,NG_0003,Business_3,Oyo,Technology,2020-09-02,520713,87719,11,False,True,432994,83.15,47337.55,4.16,Small,1
3,NG_0004,Business_4,Benue,Manufacturing,2023-04-14,553527,389310,7,False,True,164217,29.67,79075.29,1.54,Small,1
4,NG_0005,Business_5,Rivers,Technology,2022-11-28,231537,-11869,10,True,False,243406,105.13,23153.7,1.92,Small,1


In [45]:

# Display sample of new features
df_nigeria[['business_name', 'monthly_profit_naira', 'profit_margin_percent', "business_age_years",
           'business_size', 'digital_score']].head(10)

Unnamed: 0,business_name,monthly_profit_naira,profit_margin_percent,business_age_years,business_size,digital_score
0,Business_1,11023,2.64,1.74,Small,1
1,Business_2,440211,63.18,0.95,Small,1
2,Business_3,432994,83.15,4.16,Small,1
3,Business_4,164217,29.67,1.54,Small,1
4,Business_5,243406,105.13,1.92,Small,1
5,Business_6,-34564,20.0,2.75,Micro,0
6,Business_7,517272,71.38,3.36,Small,1
7,Business_8,817468,88.03,2.31,Small,0
8,Business_9,86297,20.0,3.71,Micro,1
9,Business_10,256059,46.57,3.39,Micro,1


## 2. String Operations and Text Processing

Let's work with text data to extract meaningful information:

In [46]:
df_nigeria

Unnamed: 0,business_id,business_name,state,business_type,start_date,monthly_revenue_naira,monthly_costs_naira,employees,has_website,accepts_mobile_money,monthly_profit_naira,profit_margin_percent,revenue_per_employee,business_age_years,business_size,digital_score
0,NG_0001,Business_1,Benue,Retail,2023-02-01,417741,406718,8,False,True,11023,2.64,52217.62,1.74,Small,1
1,NG_0002,Business_2,Kaduna,Manufacturing,2023-11-17,696804,256593,10,True,False,440211,63.18,69680.40,0.95,Small,1
2,NG_0003,Business_3,Oyo,Technology,2020-09-02,520713,87719,11,False,True,432994,83.15,47337.55,4.16,Small,1
3,NG_0004,Business_4,Benue,Manufacturing,2023-04-14,553527,389310,7,False,True,164217,29.67,79075.29,1.54,Small,1
4,NG_0005,Business_5,Rivers,Technology,2022-11-28,231537,-11869,10,True,False,243406,105.13,23153.70,1.92,Small,1
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
195,NG_0196,Business_196,Kano,Services,2023-09-27,611128,415487,8,True,False,195641,32.01,76391.00,1.09,Small,1
196,NG_0197,Business_197,Kano,Agriculture,2021-01-31,916895,349547,7,False,True,567348,61.88,130985.00,3.74,Small,1
197,NG_0198,Business_198,Kaduna,Agriculture,2021-11-29,778180,430501,8,False,True,347679,44.68,97272.50,2.92,Small,1
198,NG_0199,Business_199,Oyo,Agriculture,2022-11-08,98486,78788,4,False,True,19698,20.00,24621.50,1.97,Micro,1


In [47]:
df_nigeria_sample =df_nigeria.head(20)

df_nigeria_sample

Unnamed: 0,business_id,business_name,state,business_type,start_date,monthly_revenue_naira,monthly_costs_naira,employees,has_website,accepts_mobile_money,monthly_profit_naira,profit_margin_percent,revenue_per_employee,business_age_years,business_size,digital_score
0,NG_0001,Business_1,Benue,Retail,2023-02-01,417741,406718,8,False,True,11023,2.64,52217.62,1.74,Small,1
1,NG_0002,Business_2,Kaduna,Manufacturing,2023-11-17,696804,256593,10,True,False,440211,63.18,69680.4,0.95,Small,1
2,NG_0003,Business_3,Oyo,Technology,2020-09-02,520713,87719,11,False,True,432994,83.15,47337.55,4.16,Small,1
3,NG_0004,Business_4,Benue,Manufacturing,2023-04-14,553527,389310,7,False,True,164217,29.67,79075.29,1.54,Small,1
4,NG_0005,Business_5,Rivers,Technology,2022-11-28,231537,-11869,10,True,False,243406,105.13,23153.7,1.92,Small,1
5,NG_0006,Business_6,Anambra,Technology,2022-01-28,-172819,-138255,5,False,False,-34564,20.0,-34563.8,2.75,Micro,0
6,NG_0007,Business_7,Oyo,Services,2021-06-21,724708,207436,12,False,True,517272,71.38,60392.33,3.36,Small,1
7,NG_0008,Business_8,Oyo,Technology,2022-07-09,928641,111173,11,False,False,817468,88.03,84421.91,2.31,Small,0
8,NG_0009,Business_9,Benue,Manufacturing,2021-02-11,431482,345185,2,True,False,86297,20.0,215741.0,3.71,Micro,1
9,NG_0010,Business_10,Kano,Technology,2021-06-08,549862,293803,3,True,False,256059,46.57,183287.33,3.39,Micro,1


In [57]:
# Create more realistic business names with African context
african_business_names = [
    'Lagos-Adunni Fashion House/Union', 'Kano-Groundnut/Cooperative Ltd',
    'Rivers-Palm Oil Processing/Association', 'Kaduna-Textile Manufacturing/Services Ltd',
    'Oyo-Agro Pastoral/Services Limited', 'Imo-Cassava Farmers/Union',
    'Benue-Yam Gem/Trading Company', 'Anambra-Tech Solutions/Association',
    'Lagos-Digital Marketing Hub/Society', 'Kano-Leather Works/Services Ltd',
    'Port Harcourt-Fish Market/Union', 'Abuja-Movie Consulting/Limited Services',
    'Ibadan-Food Processing/Cooperative', 'Enugu-Coal Trading Co/Association',
    'Calabar-Tourism Services/Union', 'Abuja-Mining Equipments/Association',
    'Maiduguri-Livestock Market/Trading Partners Ltd', 'Sokoto-Onion Farmers/Services Limited',
    'Benin-Rubber Plantation/Cooperative', 'Warri-Oil Refining/Services Ltd'
]

# Assign realistic names to first 20 businesses
df_nigeria_sample.loc[:19, 'business_name'] = african_business_names

df_nigeria_sample

Unnamed: 0,business_id,business_name,state,business_type,start_date,monthly_revenue_naira,monthly_costs_naira,employees,has_website,accepts_mobile_money,monthly_profit_naira,profit_margin_percent,revenue_per_employee,business_age_years,business_size,digital_score,business_structure,is_agricultural,is_tech_related,is_manufacturing
0,NG_0001,Lagos-Adunni Fashion House/Union,Lagos,Retail,2023-02-01,417741,406718,8,False,True,11023,2.64,52217.62,1.74,Small,1,Union,False,False,False
1,NG_0002,Kano-Groundnut/Cooperative Ltd,Kano,Manufacturing,2023-11-17,696804,256593,10,True,False,440211,63.18,69680.4,0.95,Small,1,Cooperative Ltd,True,False,False
2,NG_0003,Rivers-Palm Oil Processing/Association,Rivers,Technology,2020-09-02,520713,87719,11,False,True,432994,83.15,47337.55,4.16,Small,1,Association,True,False,True
3,NG_0004,Kaduna-Textile Manufacturing/Services Ltd,Kaduna,Manufacturing,2023-04-14,553527,389310,7,False,True,164217,29.67,79075.29,1.54,Small,1,Services Ltd,False,False,True
4,NG_0005,Oyo-Agro Pastoral/Services Limited,Oyo,Technology,2022-11-28,231537,-11869,10,True,False,243406,105.13,23153.7,1.92,Small,1,Services Limited,True,False,False
5,NG_0006,Imo-Cassava Farmers/Union,Imo,Technology,2022-01-28,-172819,-138255,5,False,False,-34564,20.0,-34563.8,2.75,Micro,0,Union,True,False,False
6,NG_0007,Benue-Yam Gem/Trading Company,Benin,Services,2021-06-21,724708,207436,12,False,True,517272,71.38,60392.33,3.36,Small,1,Trading Company,True,False,False
7,NG_0008,Anambra-Tech Solutions/Association,Anambra,Technology,2022-07-09,928641,111173,11,False,False,817468,88.03,84421.91,2.31,Small,0,Association,False,True,False
8,NG_0009,Lagos-Digital Marketing Hub/Society,Lagos,Manufacturing,2021-02-11,431482,345185,2,True,False,86297,20.0,215741.0,3.71,Micro,1,Society,False,True,False
9,NG_0010,Kano-Leather Works/Services Ltd,Kano,Technology,2021-06-08,549862,293803,3,True,False,256059,46.57,183287.33,3.39,Micro,1,Services Ltd,False,False,True


In [58]:
# 1. Extract state from business name
df_nigeria_sample['state'] = df_nigeria_sample['business_name'].str.split("-").str[0]
df_nigeria_sample['state']

0             Lagos
1              Kano
2            Rivers
3            Kaduna
4               Oyo
5               Imo
6             Benue
7           Anambra
8             Lagos
9              Kano
10    Port Harcourt
11            Abuja
12           Ibadan
13            Enugu
14          Calabar
15            Abuja
16        Maiduguri
17           Sokoto
18            Benin
19            Warri
Name: state, dtype: object

In [59]:
# 4. Extract business structure (Ltd, Co, etc.)
df_nigeria_sample['business_structure'] = df_nigeria_sample['business_name'].str.split("/").str[-1]
df_nigeria_sample['business_structure']

0                    Union
1          Cooperative Ltd
2              Association
3             Services Ltd
4         Services Limited
5                    Union
6          Trading Company
7              Association
8                  Society
9             Services Ltd
10                   Union
11        Limited Services
12             Cooperative
13             Association
14                   Union
15             Association
16    Trading Partners Ltd
17        Services Limited
18             Cooperative
19            Services Ltd
Name: business_structure, dtype: object

In [60]:
# 2. Check if business name contains certain keywords
df_nigeria_sample['is_agricultural'] = df_nigeria_sample['business_name'].str.contains('Agro|Farm|Cassava|Groundnut|Yam|Palm|Onion|Livestock', case=False, na=False)
df_nigeria_sample['is_tech_related'] = df_nigeria_sample['business_name'].str.contains('Tech|Digital|Solutions|Hub', case=False, na=False)
df_nigeria_sample['is_manufacturing'] = df_nigeria_sample['business_name'].str.contains('Manufacturing|Processing|Works|Equipment', case=False, na=False)
df_nigeria_sample
# 

Unnamed: 0,business_id,business_name,state,business_type,start_date,monthly_revenue_naira,monthly_costs_naira,employees,has_website,accepts_mobile_money,monthly_profit_naira,profit_margin_percent,revenue_per_employee,business_age_years,business_size,digital_score,business_structure,is_agricultural,is_tech_related,is_manufacturing
0,NG_0001,Lagos-Adunni Fashion House/Union,Lagos,Retail,2023-02-01,417741,406718,8,False,True,11023,2.64,52217.62,1.74,Small,1,Union,False,False,False
1,NG_0002,Kano-Groundnut/Cooperative Ltd,Kano,Manufacturing,2023-11-17,696804,256593,10,True,False,440211,63.18,69680.4,0.95,Small,1,Cooperative Ltd,True,False,False
2,NG_0003,Rivers-Palm Oil Processing/Association,Rivers,Technology,2020-09-02,520713,87719,11,False,True,432994,83.15,47337.55,4.16,Small,1,Association,True,False,True
3,NG_0004,Kaduna-Textile Manufacturing/Services Ltd,Kaduna,Manufacturing,2023-04-14,553527,389310,7,False,True,164217,29.67,79075.29,1.54,Small,1,Services Ltd,False,False,True
4,NG_0005,Oyo-Agro Pastoral/Services Limited,Oyo,Technology,2022-11-28,231537,-11869,10,True,False,243406,105.13,23153.7,1.92,Small,1,Services Limited,True,False,False
5,NG_0006,Imo-Cassava Farmers/Union,Imo,Technology,2022-01-28,-172819,-138255,5,False,False,-34564,20.0,-34563.8,2.75,Micro,0,Union,True,False,False
6,NG_0007,Benue-Yam Gem/Trading Company,Benue,Services,2021-06-21,724708,207436,12,False,True,517272,71.38,60392.33,3.36,Small,1,Trading Company,True,False,False
7,NG_0008,Anambra-Tech Solutions/Association,Anambra,Technology,2022-07-09,928641,111173,11,False,False,817468,88.03,84421.91,2.31,Small,0,Association,False,True,False
8,NG_0009,Lagos-Digital Marketing Hub/Society,Lagos,Manufacturing,2021-02-11,431482,345185,2,True,False,86297,20.0,215741.0,3.71,Micro,1,Society,False,True,False
9,NG_0010,Kano-Leather Works/Services Ltd,Kano,Technology,2021-06-08,549862,293803,3,True,False,256059,46.57,183287.33,3.39,Micro,1,Services Ltd,False,False,True


In [61]:
print(f"\nSample of text-derived features:")
df_nigeria_sample[['business_name', 'state', 'is_agricultural', 
           'is_tech_related', 'business_structure']].head(10)


Sample of text-derived features:


Unnamed: 0,business_name,state,is_agricultural,is_tech_related,business_structure
0,Lagos-Adunni Fashion House/Union,Lagos,False,False,Union
1,Kano-Groundnut/Cooperative Ltd,Kano,True,False,Cooperative Ltd
2,Rivers-Palm Oil Processing/Association,Rivers,True,False,Association
3,Kaduna-Textile Manufacturing/Services Ltd,Kaduna,False,False,Services Ltd
4,Oyo-Agro Pastoral/Services Limited,Oyo,True,False,Services Limited
5,Imo-Cassava Farmers/Union,Imo,True,False,Union
6,Benue-Yam Gem/Trading Company,Benue,True,False,Trading Company
7,Anambra-Tech Solutions/Association,Anambra,False,True,Association
8,Lagos-Digital Marketing Hub/Society,Lagos,False,True,Society
9,Kano-Leather Works/Services Ltd,Kano,False,False,Services Ltd


## 3. Categorical Data Handling

Let's work with categorical variables and create meaningful groupings:

In [62]:
# 1. Create regional groupings
def assign_region(state):
    north = ['Kano', 'Kaduna', 'Benue']
    south_west = ['Lagos', 'Oyo']
    south_east = ['Imo', 'Anambra']
    south_south = ['Rivers']
    
    if state in north:
        return 'North'
    elif state in south_west:
        return 'South-West'
    elif state in south_east:
        return 'South-East'
    elif state in south_south:
        return 'South-South'
    else:
        return 'Other'

df_nigeria_sample['region'] = df_nigeria_sample['state'].apply(assign_region)
df_nigeria_sample['region'] 

0      South-West
1           North
2     South-South
3           North
4      South-West
5      South-East
6           North
7      South-East
8      South-West
9           North
10          Other
11          Other
12          Other
13          Other
14          Other
15          Other
16          Other
17          Other
18          Other
19          Other
Name: region, dtype: object

In [65]:
pd.Series.quantile?

[31mSignature:[39m
pd.Series.quantile(
    self,
    q: [33m'float | Sequence[float] | AnyArrayLike'[39m = [32m0.5[39m,
    interpolation: [33m'QuantileInterpolation'[39m = [33m'linear'[39m,
) -> [33m'float | Series'[39m
[31mDocstring:[39m
Return value at the given quantile.

Parameters
----------
q : float or array-like, default 0.5 (50% quantile)
    The quantile(s) to compute, which can lie in range: 0 <= q <= 1.
interpolation : {'linear', 'lower', 'higher', 'midpoint', 'nearest'}
    This optional parameter specifies the interpolation method to use,
    when the desired quantile lies between two data points `i` and `j`:

        * linear: `i + (j - i) * (x-i)/(j-i)`, where `(x-i)/(j-i)` is
          the fractional part of the index surrounded by `i > j`.
        * lower: `i`.
        * higher: `j`.
        * nearest: `i` or `j` whichever is nearest.
        * midpoint: (`i` + `j`) / 2.

Returns
-------
float or Series
    If ``q`` is an array, a Series will be returne

In [63]:

# 2. Create performance categories
profit_quartiles = df_nigeria_sample['profit_margin_percent'].quantile([0.25, 0.5, 0.75])
profit_quartiles

0.25    20.0000
0.50    37.6400
0.75    57.7425
Name: profit_margin_percent, dtype: float64

In [66]:

def categorize_performance(profit_margin):
    if profit_margin < profit_quartiles[0.25]:
        return 'Low Performer'
    elif profit_margin < profit_quartiles[0.5]:
        return 'Average Performer'
    elif profit_margin < profit_quartiles[0.75]:
        return 'Good Performer'
    else:
        return 'Top Performer'

df_nigeria_sample['performance_category'] = df_nigeria_sample['profit_margin_percent'].apply(categorize_performance)
df_nigeria_sample

Unnamed: 0,business_id,business_name,state,business_type,start_date,monthly_revenue_naira,monthly_costs_naira,employees,has_website,accepts_mobile_money,monthly_profit_naira,profit_margin_percent,revenue_per_employee,business_age_years,business_size,digital_score,business_structure,is_agricultural,is_tech_related,is_manufacturing,region,performance_category
0,NG_0001,Lagos-Adunni Fashion House/Union,Lagos,Retail,2023-02-01,417741,406718,8,False,True,11023,2.64,52217.62,1.74,Small,1,Union,False,False,False,South-West,Low Performer
1,NG_0002,Kano-Groundnut/Cooperative Ltd,Kano,Manufacturing,2023-11-17,696804,256593,10,True,False,440211,63.18,69680.4,0.95,Small,1,Cooperative Ltd,True,False,False,North,Top Performer
2,NG_0003,Rivers-Palm Oil Processing/Association,Rivers,Technology,2020-09-02,520713,87719,11,False,True,432994,83.15,47337.55,4.16,Small,1,Association,True,False,True,South-South,Top Performer
3,NG_0004,Kaduna-Textile Manufacturing/Services Ltd,Kaduna,Manufacturing,2023-04-14,553527,389310,7,False,True,164217,29.67,79075.29,1.54,Small,1,Services Ltd,False,False,True,North,Average Performer
4,NG_0005,Oyo-Agro Pastoral/Services Limited,Oyo,Technology,2022-11-28,231537,-11869,10,True,False,243406,105.13,23153.7,1.92,Small,1,Services Limited,True,False,False,South-West,Top Performer
5,NG_0006,Imo-Cassava Farmers/Union,Imo,Technology,2022-01-28,-172819,-138255,5,False,False,-34564,20.0,-34563.8,2.75,Micro,0,Union,True,False,False,South-East,Average Performer
6,NG_0007,Benue-Yam Gem/Trading Company,Benue,Services,2021-06-21,724708,207436,12,False,True,517272,71.38,60392.33,3.36,Small,1,Trading Company,True,False,False,North,Top Performer
7,NG_0008,Anambra-Tech Solutions/Association,Anambra,Technology,2022-07-09,928641,111173,11,False,False,817468,88.03,84421.91,2.31,Small,0,Association,False,True,False,South-East,Top Performer
8,NG_0009,Lagos-Digital Marketing Hub/Society,Lagos,Manufacturing,2021-02-11,431482,345185,2,True,False,86297,20.0,215741.0,3.71,Micro,1,Society,False,True,False,South-West,Average Performer
9,NG_0010,Kano-Leather Works/Services Ltd,Kano,Technology,2021-06-08,549862,293803,3,True,False,256059,46.57,183287.33,3.39,Micro,1,Services Ltd,False,False,True,North,Good Performer


In [69]:

# 3. One-hot encoding for categorical variables
# Create dummy variables for business type
business_type_dummies = pd.get_dummies(df_nigeria_sample['business_type'], prefix='type')
business_type_dummies

Unnamed: 0,type_Agriculture,type_Manufacturing,type_Retail,type_Services,type_Technology
0,False,False,True,False,False
1,False,True,False,False,False
2,False,False,False,False,True
3,False,True,False,False,False
4,False,False,False,False,True
5,False,False,False,False,True
6,False,False,False,True,False
7,False,False,False,False,True
8,False,True,False,False,False
9,False,False,False,False,True


In [71]:
region_dummies = pd.get_dummies(df_nigeria_sample['region'], prefix='region')
region_dummies

Unnamed: 0,region_North,region_Other,region_South-East,region_South-South,region_South-West
0,False,False,False,False,True
1,True,False,False,False,False
2,False,False,False,True,False
3,True,False,False,False,False
4,False,False,False,False,True
5,False,False,True,False,False
6,True,False,False,False,False
7,False,False,True,False,False
8,False,False,False,False,True
9,True,False,False,False,False


In [72]:

# Add dummy variables to main dataframe
df_nigeria_sample = pd.concat([df_nigeria_sample, business_type_dummies, region_dummies], axis=1)
df_nigeria_sample

Unnamed: 0,business_id,business_name,state,business_type,start_date,monthly_revenue_naira,monthly_costs_naira,employees,has_website,accepts_mobile_money,monthly_profit_naira,profit_margin_percent,revenue_per_employee,business_age_years,business_size,digital_score,business_structure,is_agricultural,is_tech_related,is_manufacturing,region,performance_category,type_Agriculture,type_Manufacturing,type_Retail,type_Services,type_Technology,region_North,region_Other,region_South-East,region_South-South,region_South-West
0,NG_0001,Lagos-Adunni Fashion House/Union,Lagos,Retail,2023-02-01,417741,406718,8,False,True,11023,2.64,52217.62,1.74,Small,1,Union,False,False,False,South-West,Low Performer,False,False,True,False,False,False,False,False,False,True
1,NG_0002,Kano-Groundnut/Cooperative Ltd,Kano,Manufacturing,2023-11-17,696804,256593,10,True,False,440211,63.18,69680.4,0.95,Small,1,Cooperative Ltd,True,False,False,North,Top Performer,False,True,False,False,False,True,False,False,False,False
2,NG_0003,Rivers-Palm Oil Processing/Association,Rivers,Technology,2020-09-02,520713,87719,11,False,True,432994,83.15,47337.55,4.16,Small,1,Association,True,False,True,South-South,Top Performer,False,False,False,False,True,False,False,False,True,False
3,NG_0004,Kaduna-Textile Manufacturing/Services Ltd,Kaduna,Manufacturing,2023-04-14,553527,389310,7,False,True,164217,29.67,79075.29,1.54,Small,1,Services Ltd,False,False,True,North,Average Performer,False,True,False,False,False,True,False,False,False,False
4,NG_0005,Oyo-Agro Pastoral/Services Limited,Oyo,Technology,2022-11-28,231537,-11869,10,True,False,243406,105.13,23153.7,1.92,Small,1,Services Limited,True,False,False,South-West,Top Performer,False,False,False,False,True,False,False,False,False,True
5,NG_0006,Imo-Cassava Farmers/Union,Imo,Technology,2022-01-28,-172819,-138255,5,False,False,-34564,20.0,-34563.8,2.75,Micro,0,Union,True,False,False,South-East,Average Performer,False,False,False,False,True,False,False,True,False,False
6,NG_0007,Benue-Yam Gem/Trading Company,Benue,Services,2021-06-21,724708,207436,12,False,True,517272,71.38,60392.33,3.36,Small,1,Trading Company,True,False,False,North,Top Performer,False,False,False,True,False,True,False,False,False,False
7,NG_0008,Anambra-Tech Solutions/Association,Anambra,Technology,2022-07-09,928641,111173,11,False,False,817468,88.03,84421.91,2.31,Small,0,Association,False,True,False,South-East,Top Performer,False,False,False,False,True,False,False,True,False,False
8,NG_0009,Lagos-Digital Marketing Hub/Society,Lagos,Manufacturing,2021-02-11,431482,345185,2,True,False,86297,20.0,215741.0,3.71,Micro,1,Society,False,True,False,South-West,Average Performer,False,True,False,False,False,False,False,False,False,True
9,NG_0010,Kano-Leather Works/Services Ltd,Kano,Technology,2021-06-08,549862,293803,3,True,False,256059,46.57,183287.33,3.39,Micro,1,Services Ltd,False,False,True,North,Good Performer,False,False,False,False,True,True,False,False,False,False


In [73]:



print("Categorical data processing completed!")
print(f"\nRegion distribution:")
print(df_nigeria_sample['region'].value_counts())
print(f"\nPerformance category distribution:")
print(df_nigeria_sample['performance_category'].value_counts())
print(f"\nNew dummy columns created: {list(business_type_dummies.columns) + list(region_dummies.columns)}")

Categorical data processing completed!

Region distribution:
region
Other          10
North           4
South-West      3
South-East      2
South-South     1
Name: count, dtype: int64

Performance category distribution:
performance_category
Average Performer    8
Top Performer        5
Good Performer       5
Low Performer        2
Name: count, dtype: int64

New dummy columns created: ['type_Agriculture', 'type_Manufacturing', 'type_Retail', 'type_Services', 'type_Technology', 'region_North', 'region_Other', 'region_South-East', 'region_South-South', 'region_South-West']
