## **A Data-Driven Market Intelligence System for Stock Risk and Behavior Analysis at the Nairobi Securities Exchange**

## Problem Statement

Despite the Nairobi Securities Exchange (NSE) seeing a surge in equity turnover-rising 18% to Ksh 56 billion in 2025 ([SE Half-Year Results, 2025](https://www.nse.co.ke/wp-content/uploads/NSE-Plc-Unaudited-Group-results-for-the-6-months-ended-30-June-2025.pdf)), retail participation remains hampered by a critical information gap. According to the Institute of Certified Investment and Financial Analysts reveals that 77% of Kenyan retail investors rely on "personal research" and social intuition because they lack accessible analytical tools ([ICIFA Annual Report, 2024](https://icifa.co.ke/static/resources/others/annual-report-2024465e3dbed42d.pdf)).

While the market added over Ksh 1 Trillion in capitalization since 2023, most investors suffer from "herding behavior," where decisions are made by following the crowd rather than technical data -[USIU-Africa Research, 2025](https://erepo.usiu.ac.ke/xmlui/bitstream/handle/11732/8460/MASILA%20BRIAN%20SALU%20MBA%202024.pdf?sequence=1&isAllowed=y). This project bridges this gap by converting raw daily prices into behavioral risk clusters, moving investors from intuition to evidence-based decision-making.

The dataset used for this analysis is publicly available from Mendeley: [Nairobi Securities Exchange (NSE) Kenya](https://data.mendeley.com/research-data/?query=Nairobi%20Securities%20Exchange%20(NSE)%20Kenya%20-%20All%20Stocks%20Prices)

## Objectives

**Main Objective**
- How can a data-driven stock market intelligence system be developed for the Nairobi Securities Exchange to support informed and risk-aware investment decisions?

**Specific Objectives**
1) `Feature Engineering:` How can financial metrics such as Rolling Volatility, Daily Returns, and Maximum Drawdowns be derived from historical stock price data to effectively quantify stock behavior?

2) `Behavioral Segmentation:` How can unsupervised machine learning techniques (e.g., K-Means and DBSCAN) be applied to group NSE-listed stocks into risk-based clusters such as stable, high-volatility, and speculative stocks?

4) `Sector Risk Analysis:` What systemic risks and stability patterns can be identified across different market sectors within the Nairobi Securities Exchange?

5) `Interactive Deployment:` How can the derived insights be interactively deployed through a Streamlit dashboard to enable users to select stocks, assess their risk profiles, and compare them against sector-level benchmarks?

**Solution**

This project aims to provide an interactive dashboard where investors can explore NSE stocks, assess risk clusters, and compare sector performance. Users can make data-driven decisions, spot stable or high-risk stocks, and identify where to invest confidently.

## Libraries & Importations

In [124]:
import pandas as pd

In [125]:
df_2021 = pd.read_csv("../Data/NSE_data_all_stocks_2021_upto_31dec2021.csv")
df_2022 = pd.read_csv("../Data/NSE_data_all_stocks_2022.csv")
df_2023 = pd.read_csv("../Data/NSE_data_all_stocks_2023.csv")
df_2024 = pd.read_csv("../Data/NSE_data_all_stocks_2024.csv")


from IPython.display import display

print("2021 Stocks")
display(df_2021.head())
print("2022 Stocks")
display(df_2022.head())
print("2023 Stocks")
display(df_2023.head())
print("2024 Stocks")
display(df_2024.head())



2021 Stocks


Unnamed: 0,DATE,CODE,NAME,12m Low,12m High,Day Low,Day High,Day Price,Previous,Change,Change%,Volume,Adjust
0,04-Jan-21,EGAD,Eaagads Ltd,8.2,14,12.5,12.5,12.5,12.5,-,-,3200,-
1,04-Jan-21,KUKZ,Kakuzi Plc,300.0,397,365.0,365.0,365.0,365.0,-,-,-,-
2,04-Jan-21,KAPC,Kapchorua Tea Kenya Plc,59.0,90,78.0,78.0,78.0,78.0,-,-,-,-
3,04-Jan-21,LIMT,Limuru Tea Plc,360.0,475,360.0,360.0,360.0,360.0,-,-,100,-
4,04-Jan-21,SASN,Sasini Plc,14.8,20,19.5,19.5,19.5,19.5,-,-,-,-


2022 Stocks


Unnamed: 0,Date,Code,Name,12m Low,12m High,Day Low,Day High,Day Price,Previous,Change,Change%,Volume,Adjusted Price
0,3-Jan-22,EGAD,Eaagads Ltd,10.0,15.0,13.5,13.8,13.5,13.5,-,-,4000,-
1,3-Jan-22,KUKZ,Kakuzi Plc,355.0,427.0,385.0,385.0,385.0,385.0,-,-,-,-
2,3-Jan-22,KAPC,Kapchorua Tea Kenya Plc,80.0,101.0,99.5,99.5,99.5,95.5,4,4.19%,100,-
3,3-Jan-22,LIMT,Limuru Tea Plc,260.0,360.0,320.0,320.0,320.0,320.0,-,-,-,-
4,3-Jan-22,SASN,Sasini Plc,16.75,22.6,18.7,18.7,18.7,18.7,-,-,-,-


2023 Stocks


Unnamed: 0,Date,Code,Name,12m Low,12m High,Day Low,Day High,Day Price,Previous,Change,Change%,Volume,Adjusted Price
0,3-Jan-23,EGAD,Eaagads Ltd,10.35,14.5,10.5,10.5,10.5,10.5,-,-,1900.00,-
1,3-Jan-23,KUKZ,Kakuzi Plc,342.0,440.0,385.0,385.0,385.0,385.0,-,-,-,-
2,3-Jan-23,KAPC,Kapchorua Tea Kenya Plc,207.0,280.0,115.75,115.75,115.75,113.25,2.5,2.21%,100,-
3,3-Jan-23,LIMT,Limuru Tea Plc,365.0,380.0,420.0,420.0,420.0,420.0,-,-,-,-
4,3-Jan-23,SASN,Sasini Plc,15.1,22.0,22.0,22.5,22.45,22.45,-,-,6900.00,-


2024 Stocks


Unnamed: 0,Date,Code,Name,12m Low,12m High,Day Low,Day High,Day Price,Previous,Change,Change%,Volume,Adjusted Price
0,2-Jan-24,EGAD,Eaagads Ltd,10.35,14.5,12.8,12.8,12.8,13.95,-1.15,-8.24%,100,-
1,2-Jan-24,KUKZ,Kakuzi Plc,342.0,440.0,385.0,385.0,385.0,385.0,-,-,-,-
2,2-Jan-24,KAPC,Kapchorua Tea Kenya Plc,207.0,280.0,215.0,215.0,215.0,215.0,-,-,-,-
3,2-Jan-24,LIMT,Limuru Tea Plc,365.0,380.0,380.0,380.0,380.0,380.0,-,-,-,-
4,2-Jan-24,SASN,Sasini Plc,15.1,22.0,20.0,20.0,20.0,20.0,-,-,3300.00,-


In [136]:
df_2021.rename(columns={
    'DATE': 'Date',
    'CODE': 'Code',
    'NAME': 'Name',
    'Adjust': 'Adjusted Price'
}, inplace=True)

In [137]:
df_2021.head()

Unnamed: 0,Date,Code,Name,12m Low,12m High,Day Low,Day High,Day Price,Previous,Change,Change%,Volume,Adjusted Price
0,04-Jan-21,EGAD,Eaagads Ltd,8.2,14,12.5,12.5,12.5,12.5,-,-,3200,-
1,04-Jan-21,KUKZ,Kakuzi Plc,300.0,397,365.0,365.0,365.0,365.0,-,-,-,-
2,04-Jan-21,KAPC,Kapchorua Tea Kenya Plc,59.0,90,78.0,78.0,78.0,78.0,-,-,-,-
3,04-Jan-21,LIMT,Limuru Tea Plc,360.0,475,360.0,360.0,360.0,360.0,-,-,100,-
4,04-Jan-21,SASN,Sasini Plc,14.8,20,19.5,19.5,19.5,19.5,-,-,-,-


## Load Sector Data

In [138]:
import pandas as pd

sectors_2021 = pd.read_csv("../Data/NSE_data_stock_market_sectors_as_at_31dec2021.csv")
sectors_2022 = pd.read_csv("../Data/NSE_data_stock_market_sectors_2022.csv")
sectors_2023_2024 = pd.read_csv("../Data/NSE_data_stock_market_sectors_2023_2024.csv")


In [139]:
print("2021 Sector Data:")
display(sectors_2021.head())
print("\n2022 Sector Data:")
display(sectors_2022.head())
print("\n2023-2024 Sector Data:")
display(sectors_2023_2024.head())

2021 Sector Data:


Unnamed: 0,SECTOR,CODE,NAME
0,Agricultural,EGAD,Eaagads Ltd
1,Agricultural,KUKZ,Kakuzi Plc
2,Agricultural,KAPC,Kapchorua Tea Kenya Plc
3,Agricultural,LIMT,Limuru Tea Plc
4,Agricultural,SASN,Sasini Plc



2022 Sector Data:


Unnamed: 0,Sector,Stock_code,Stock_name
0,Agricultural,EGAD,Eaagads Ltd
1,Agricultural,KUKZ,Kakuzi Plc
2,Agricultural,KAPC,Kapchorua Tea Kenya Plc
3,Agricultural,LIMT,Limuru Tea Plc
4,Agricultural,SASN,Sasini Plc



2023-2024 Sector Data:


Unnamed: 0,Sector,Stock_code,Stock_name
0,Agricultural,EGAD,Eaagads Ltd
1,Agricultural,KUKZ,Kakuzi Plc
2,Agricultural,KAPC,Kapchorua Tea Kenya Plc
3,Agricultural,LIMT,Limuru Tea Plc
4,Agricultural,SASN,Sasini Plc


## Standardize Sector Column Names

In [140]:
# Standardize all sector columns
sectors_2021 = sectors_2021.rename(columns={
    'SECTOR': 'Sector', 
    'CODE': 'Code', 
    'NAME': 'Name'
})

sectors_2022 = sectors_2022.rename(columns={
    'Sector': 'Sector',
    'Stock_code': 'Code', 
    'Stock_name': 'Name'
})

sectors_2023_2024 = sectors_2023_2024.rename(columns={
    'Sector': 'Sector',
    'Stock_code': 'Code', 
    'Stock_name': 'Name'
})

## Merge Each Year with Sector Information

In [141]:
# Merge stock data with sectors
df_2021_final = df_2021.merge(sectors_2021[['Code', 'Sector']], on='Code', how='left')
df_2022_final = df_2022.merge(sectors_2022[['Code', 'Sector']], on='Code', how='left')
df_2023_final = df_2023.merge(sectors_2023_2024[['Code', 'Sector']], on='Code', how='left')
df_2024_final = df_2024.merge(sectors_2023_2024[['Code', 'Sector']], on='Code', how='left')

## Concatenate ALL 4 Years

In [143]:
# Concatenate all years
all_stocks = pd.concat([df_2021_final, df_2022_final, df_2023_final, df_2024_final], ignore_index=True)

print(f"\n All 4 years concatenated!")
print(f"  Total rows: {len(all_stocks):,}")
print(f"  Shape: {all_stocks.shape}")
all_stocks.head()


 All 4 years concatenated!
  Total rows: 69,945
  Shape: (69945, 14)


Unnamed: 0,Date,Code,Name,12m Low,12m High,Day Low,Day High,Day Price,Previous,Change,Change%,Volume,Adjusted Price,Sector
0,04-Jan-21,EGAD,Eaagads Ltd,8.2,14,12.5,12.5,12.5,12.5,-,-,3200,-,Agricultural
1,04-Jan-21,KUKZ,Kakuzi Plc,300.0,397,365.0,365.0,365.0,365.0,-,-,-,-,Agricultural
2,04-Jan-21,KAPC,Kapchorua Tea Kenya Plc,59.0,90,78.0,78.0,78.0,78.0,-,-,-,-,Agricultural
3,04-Jan-21,LIMT,Limuru Tea Plc,360.0,475,360.0,360.0,360.0,360.0,-,-,100,-,Agricultural
4,04-Jan-21,SASN,Sasini Plc,14.8,20,19.5,19.5,19.5,19.5,-,-,-,-,Agricultural


In [144]:
# Checking new structure

print(f"\nMissing values:\n{all_stocks.isna().sum()}")
print(f"\nDuplicates: {all_stocks.duplicated().sum()}")


Missing values:
Date                0
Code                0
Name                0
12m Low             0
12m High            0
Day Low             0
Day High            0
Day Price           0
Previous            0
Change              0
Change%             0
Volume              0
Adjusted Price      0
Sector            191
dtype: int64

Duplicates: 0


### saving merged data

In [146]:
import os
os.makedirs('../data/processed', exist_ok=True)

# Save merged but NOT cleaned data
all_stocks.to_csv('../data/processed/nse_merged_raw.csv', index=False)

print(f"  Rows: {len(all_stocks):,}")
print(f"  Years: 2021-2024")

  Rows: 69,945
  Years: 2021-2024


## note 02

# Data Cleaning

## Objective
Load the merged raw data (all 4 years) and clean it ONCE to create the final df_all dataset.

## Load Merged Raw Data from Notebook 01

In [147]:
# Load the merged raw data
all_stocks = pd.read_csv('../data/processed/nse_merged_raw.csv')

print(f"Loaded merged raw data: {all_stocks.shape}")
all_stocks.head()

Loaded merged raw data: (69945, 14)


Unnamed: 0,Date,Code,Name,12m Low,12m High,Day Low,Day High,Day Price,Previous,Change,Change%,Volume,Adjusted Price,Sector
0,04-Jan-21,EGAD,Eaagads Ltd,8.2,14,12.5,12.5,12.5,12.5,-,-,3200,-,Agricultural
1,04-Jan-21,KUKZ,Kakuzi Plc,300.0,397,365.0,365.0,365.0,365.0,-,-,-,-,Agricultural
2,04-Jan-21,KAPC,Kapchorua Tea Kenya Plc,59.0,90,78.0,78.0,78.0,78.0,-,-,-,-,Agricultural
3,04-Jan-21,LIMT,Limuru Tea Plc,360.0,475,360.0,360.0,360.0,360.0,-,-,100,-,Agricultural
4,04-Jan-21,SASN,Sasini Plc,14.8,20,19.5,19.5,19.5,19.5,-,-,-,-,Agricultural


In [148]:
all_stocks.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 69945 entries, 0 to 69944
Data columns (total 14 columns):
 #   Column          Non-Null Count  Dtype 
---  ------          --------------  ----- 
 0   Date            69945 non-null  object
 1   Code            69945 non-null  object
 2   Name            69945 non-null  object
 3   12m Low         69945 non-null  object
 4   12m High        69945 non-null  object
 5   Day Low         69945 non-null  object
 6   Day High        69945 non-null  object
 7   Day Price       69945 non-null  object
 8   Previous        69945 non-null  object
 9   Change          69945 non-null  object
 10  Change%         69945 non-null  object
 11  Volume          69945 non-null  object
 12  Adjusted Price  69945 non-null  object
 13  Sector          69754 non-null  object
dtypes: object(14)
memory usage: 7.5+ MB


## Clean Numeric Columns

Convert all price, volume, and percentage columns to numeric.

In [149]:
# Columns to clean
cols = [
    "12m Low", "12m High", "Day Low", "Day High",
    "Day Price", "Previous", "Change", "Change%",
    "Volume", "Adjusted Price"
]

# Remove % sign from Change%
all_stocks["Change%"] = all_stocks["Change%"].astype(str).str.replace("%", "", regex=False)

# Remove commas, replace '-' with 0, convert to numeric
all_stocks[cols] = (
    all_stocks[cols]
        .replace(",", "", regex=True)
        .replace("-", 0)
        .apply(pd.to_numeric, errors="coerce")
)

In [150]:
all_stocks.head()

Unnamed: 0,Date,Code,Name,12m Low,12m High,Day Low,Day High,Day Price,Previous,Change,Change%,Volume,Adjusted Price,Sector
0,04-Jan-21,EGAD,Eaagads Ltd,8.2,14.0,12.5,12.5,12.5,12.5,0.0,0.0,3200.0,0.0,Agricultural
1,04-Jan-21,KUKZ,Kakuzi Plc,300.0,397.0,365.0,365.0,365.0,365.0,0.0,0.0,0.0,0.0,Agricultural
2,04-Jan-21,KAPC,Kapchorua Tea Kenya Plc,59.0,90.0,78.0,78.0,78.0,78.0,0.0,0.0,0.0,0.0,Agricultural
3,04-Jan-21,LIMT,Limuru Tea Plc,360.0,475.0,360.0,360.0,360.0,360.0,0.0,0.0,100.0,0.0,Agricultural
4,04-Jan-21,SASN,Sasini Plc,14.8,20.0,19.5,19.5,19.5,19.5,0.0,0.0,0.0,0.0,Agricultural


## Converting Date Column

In [151]:
all_stocks['Date'] = pd.to_datetime(all_stocks['Date'], errors='coerce')

print(f"Date range: {all_stocks['Date'].min()} to {all_stocks['Date'].max()}")

Date range: 2021-01-04 00:00:00 to 2024-12-31 00:00:00


  all_stocks['Date'] = pd.to_datetime(all_stocks['Date'], errors='coerce')


## Renaming Columns for Consistency

In [153]:
all_stocks = all_stocks.rename(columns={
    'Code': 'Stock_code',
    'Change%': '%Change'
})

print("Columns:", list(all_stocks.columns))

Columns: ['Date', 'Stock_code', 'Name', '12m Low', '12m High', 'Day Low', 'Day High', 'Day Price', 'Previous', 'Change', '%Change', 'Volume', 'Adjusted Price', 'Sector']


## Create df_all - Final Cleaned Dataset

In [154]:
# Create df_all
df_all = all_stocks.copy()

# Sort by Stock_code and Date
df_all = df_all.sort_values(['Stock_code', 'Date']).reset_index(drop=True)

df_all.head()

Unnamed: 0,Date,Stock_code,Name,12m Low,12m High,Day Low,Day High,Day Price,Previous,Change,%Change,Volume,Adjusted Price,Sector
0,2021-01-04,ABSA,ABSA Bank Kenya Plc,8.5,14.2,9.42,9.8,9.52,9.66,-0.14,1.45,18500.0,0.0,Banking
1,2021-01-05,ABSA,ABSA Bank Kenya Plc,8.5,14.2,9.44,9.7,9.44,9.52,-0.08,0.84,1923300.0,0.0,Banking
2,2021-01-06,ABSA,ABSA Bank Kenya Plc,8.5,14.2,9.4,9.68,9.44,9.44,0.0,0.0,233400.0,0.0,Banking
3,2021-01-07,ABSA,ABSA Bank Kenya Plc,8.5,14.2,9.36,9.46,9.4,9.44,-0.04,0.42,194700.0,0.0,Banking
4,2021-01-11,ABSA,ABSA Bank Kenya Plc,8.5,14.2,9.44,9.7,9.46,9.48,-0.02,0.21,77900.0,0.0,Banking


## Data Quality Checks

In [155]:
print("Missing values:")
print(df_all.isna().sum())

print(f"\nDuplicates: {df_all.duplicated().sum()}")
print(f"Date range: {df_all['Date'].min()} to {df_all['Date'].max()}")
print(f"Unique stocks: {df_all['Stock_code'].nunique()}")
print(f"Unique sectors: {df_all['Sector'].nunique()}")

Missing values:
Date                0
Stock_code          0
Name                0
12m Low             0
12m High            0
Day Low             0
Day High            0
Day Price           0
Previous            0
Change              0
%Change             0
Volume              0
Adjusted Price      0
Sector            191
dtype: int64

Duplicates: 0
Date range: 2021-01-04 00:00:00 to 2024-12-31 00:00:00
Unique stocks: 77
Unique sectors: 15


In [156]:
df_all.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 69945 entries, 0 to 69944
Data columns (total 14 columns):
 #   Column          Non-Null Count  Dtype         
---  ------          --------------  -----         
 0   Date            69945 non-null  datetime64[ns]
 1   Stock_code      69945 non-null  object        
 2   Name            69945 non-null  object        
 3   12m Low         69945 non-null  float64       
 4   12m High        69945 non-null  float64       
 5   Day Low         69945 non-null  float64       
 6   Day High        69945 non-null  float64       
 7   Day Price       69945 non-null  float64       
 8   Previous        69945 non-null  float64       
 9   Change          69945 non-null  float64       
 10  %Change         69945 non-null  float64       
 11  Volume          69945 non-null  float64       
 12  Adjusted Price  69945 non-null  float64       
 13  Sector          69754 non-null  object        
dtypes: datetime64[ns](1), float64(10), object(3)
memory us

## Save Cleaned Data

In [158]:
# Save cleaned data
df_all.to_csv('../data/processed/nse_all_clean.csv', index=False)

df_all.to_csv('../data/processed/nse_merged.csv', index=False)

print(f"  nse_all_clean.csv: {df_all.shape}")
print(f"  nse_merged.csv: {df_all.shape}")

  nse_all_clean.csv: (69945, 14)
  nse_merged.csv: (69945, 14)


### note 03

#  Feature Engineering

## Objective
Create financial indicators to capture:
- Price volatility
- Returns behavior
- Liquidity patterns
- Trend momentum
- Sector-relative metrics

In [159]:
# Load merged data
df = pd.read_csv('../data/processed/nse_merged.csv', parse_dates=['Date'])
print(f"Loaded {len(df):,} rows spanning {df['Stock_code'].nunique()} stocks")

Loaded 69,945 rows spanning 77 stocks


## 1. Generate Features

We'll use our custom feature engineering module

In [161]:
from features import engineer_features

ModuleNotFoundError: No module named 'features'