## Script for cleaning the dataset

Good practices in data analytics state that the best way of managing this dataset is pivoting it. 

The final dataset will contain 3 columns, one with date, one with total_investment and one with region. 

In [None]:
## importing libraries

import pandas as pd
import numpy as np
from openpyxl import load_workbook

## importing data

data = pd.read_excel('data.xlsx')
print("Data imported successfully")
print(data.head())


In [None]:
## let's transform the dataset into the wide format so it is better to work with.

# Melt the dataframe to convert from wide to long format
# This will create year, region, and total_investment columns
data_cleaned = data.melt(
    id_vars=['Region'],
    var_name='year',
    value_name='total_investment'
)

# Rename 'Region' to 'region' for consistency
data_cleaned  = data_cleaned.rename(columns={'Region': 'region'})

# Convert year to integer (it might be read as string)
data_cleaned['year'] = data_cleaned['year'].astype(int)

# Sort by region and year for better readability
data_cleaned = data_cleaned.sort_values(['region', 'year']).reset_index(drop=True)

# Display the transformed dataset
print("Transformed dataset shape:", data_cleaned.shape)
print("\nFirst few rows:")
print(data_cleaned.head(10))
print("\nDataset info:")
print(data_cleaned.info())


Transformed dataset shape: (286, 3)

First few rows:
  region  year  total_investment
0   East  2000      4.027036e+07
1   East  2001      7.294011e+07
2   East  2002      4.686010e+08
3   East  2003      4.160920e+08
4   East  2004      3.106673e+08
5   East  2005      4.760725e+07
6   East  2006      3.093638e+08
7   East  2007      1.254190e+08
8   East  2008      2.473995e+08
9   East  2009      2.528732e+08

Dataset info:
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 286 entries, 0 to 285
Data columns (total 3 columns):
 #   Column            Non-Null Count  Dtype  
---  ------            --------------  -----  
 0   region            286 non-null    object 
 1   year              286 non-null    int64  
 2   total_investment  286 non-null    float64
dtypes: float64(1), int64(1), object(1)
memory usage: 6.8+ KB
None


In [4]:
## Saving the cleaned data

# Save as CSV (most common format, easy to share and use in other tools)
data_cleaned.to_csv('data_cleaned.csv', index=False)
print("✓ Data saved as 'data_cleaned.csv'")

# Display confirmation
print(f"\nSaved {len(data_cleaned)} rows and {len(data_cleaned.columns)} columns")
print(f"Columns: {', '.join(data_cleaned.columns.tolist())}")

✓ Data saved as 'data_cleaned.csv'

Saved 286 rows and 3 columns
Columns: region, year, total_investment
