# 📈 Inflation Predictor

A data science project focused on predicting inflation rates using historical data from the World Bank. This repository contains code and documentation for the full machine learning pipeline, from data acquisition and preprocessing to model deployment via Streamlit.

### 🔬 Key Processes:

1. **Data Collection** – Fetching datasets from the World Bank API or CSV files.
2. **Data Cleaning** – Handling missing values, formatting inconsistencies, and outliers.
3. **Exploratory Data Analysis** – Understanding distributions, trends, and correlations.
4. **Feature Engineering** – Generating relevant features such as lag variables, rolling averages, and economic ratios.
5. **Model Training** – Training and tuning the XGBoost model using scikit-learn pipelines.
6. **Evaluation** – Measuring performance using RMSE, MAE, and R².
7. **Visualization** – Presenting insights and predictions in a user-friendly format.
8. **Deployment** – Deploying the predictive model using Streamlit.

Import the Key Libraries used in the data workflow.

In [3]:
import seaborn as sns 
import matplotlib.pyplot as plt 
import matplotlib.ticker as ticker
from sklearn.model_selection import train_test_split
from sklearn.metrics import mean_absolute_error
import pandas as pd 
import numpy as np 
import streamlit as st 
import xgboost as xgb

Load the data.
This data is a csv file from the world bank showing all the inflation rates from 1960 to 2023.

In [4]:
# Step 1: Load the data
my_data = r"/Users/d/Desktop/world_inflation_data/Sheet 1-API_FP.CPI.TOTL.ZG_DS2_en_csv_v2_122376.csv"
df = pd.read_csv(my_data, skiprows = 4)

Now, because of the shape of the data, we'd have to melt it, such that pandas can easily interpret it.
We are trying to minimise errors.

In [5]:
# Step 2: Reshape / melt the wide format into long format
df_long = df.melt(
    id_vars=["Country Name", "Country Code", "Indicator Name", "Indicator Code"],
    var_name="Year",
    value_name="Inflation"
)

Step 3: Clean the Year column

df_long["Year"] = pd.to_numeric(df_long["Year"], errors="coerce")

What’s happening here?
df_long["Year"]:  This selects the Year column from our DataFrame df_long.

At this stage, the Year column might not be in a clean numeric format. For example, it could contain values like:
"1990", "1991", "N/A", "Year 1992", "missing"
pd.to_numeric(..., errors="coerce"):
This function tries to convert values to numbers.
If a value can be converted (e.g. "1990" → 1990), it becomes a number.
If a value cannot be converted (e.g. "N/A" or "Year 1992"), then:
With errors="coerce", it replaces the value with NaN (Not a Number).
The result is assigned back to df_long["Year"], so now the entire column will contain clean numeric year values or NaN where conversion failed.

In [6]:
# Step 3: Clean the Year column
df_long["Year"] = pd.to_numeric(df_long["Year"], errors="coerce")