# Database Setup Notebook

This notebook builds the entire MySQL database for the Big Data project.

The steps performed:

- [Step 0 ‚Äî Import Libraries](#step20)
- [Step 1 ‚Äî Define File Paths & Abbreviations](#step21)
- [Step 2 ‚Äî Detect Encodings](#step22)
- [Step 3 ‚Äî Clean CSV Files](#step23)
- [Step 4 ‚Äî Convert to Long Format (`all_long`)](#step24)
- [Step 5 ‚Äî Build `countries_df`](#step25)
- [Step 6 ‚Äî Build `indicators_df`](#step26)
- [Step 7 ‚Äî Build `values_df`](#step27)
- [Step 8 ‚Äî Connect to MySQL & Create DB](#step28)
- [Step 9 ‚Äî Reset Schema & Create Tables](#step29)
- [Step 10 ‚Äî Insert Countries & Indicators](#step210)
- [Step 11 ‚Äî Bulk Insert `indicator_values`](#step211)
- [Step 12 ‚Äî Create SQL View `all_data`](#step212)
- [Step 13 ‚Äî Sanity Checks](#step213)

## STEP 0 ‚Äî Import all required libraries


In [1]:
import os
import math
import re

import numpy as np
import pandas as pd
import pymysql

import custom_functions  # cleaning & encoding helpers

pd.options.display.max_columns = 100


## STEP 1 ‚Äî Define file paths and abbreviations

We list the 8 CSV files (health + environmental indicators) using the same paths as in `main.ipynb`.

They are split into:
- **Notation 1:** semicolon-delimited, clean files  
- **Notation 2:** comma-delimited, messy files requiring extra cleaning  


In [2]:
# File paths for notation 1 ('che', 'wr', 'wu', 'sr', 'su', 'gem')
filepaths_1 = [
    'UPDATED CSV DATA - Intro to Big Data/Current health expenditure (% of GDP).csv',
    'UPDATED CSV DATA - Intro to Big Data/People using at least basic drinking water services, rural (% of rural population).csv',
    'UPDATED CSV DATA - Intro to Big Data/People using at least basic drinking water services, urban (% of urban population).csv',
    'UPDATED CSV DATA - Intro to Big Data/People using safely managed sanitation services, rural (% of rural population).csv',
    'UPDATED CSV DATA - Intro to Big Data/People using safely managed sanitation services, urban (% of urban population).csv',
    'UPDATED CSV DATA - Intro to Big Data/Total greenhouse gas emissions including LULUCF (Mt CO2e).csv'
]

# File paths for notation 2 ('pop', 'ren')
filepaths_2 = [
    'UPDATED CSV DATA - Intro to Big Data/Population, total.csv',
    'UPDATED CSV DATA - Intro to Big Data/Renewable energy consumption (% of total final energy consumption).csv'
]

# Abbreviations
abbreviations_1 = ['che', 'wr', 'wu', 'sr', 'su', 'gem']
abbreviations_2 = ['pop', 'ren']


## STEP 2 ‚Äî Detect encodings for all CSV files

We use `custom_functions.detect_encoding()` to ensure each file is read with the correct encoding.


In [3]:
encodings_1 = {}
for abbreviation, filepath in zip(abbreviations_1, filepaths_1):
    encodings_1[abbreviation] = custom_functions.detect_encoding(filepath)

print("Encodings for notation 1 files:\n", encodings_1)

encodings_2 = {}
for abbreviation, filepath in zip(abbreviations_2, filepaths_2):
    encodings_2[abbreviation] = custom_functions.detect_encoding(filepath)

print("Encodings for notation 2 files:\n", encodings_2)


Encodings for notation 1 files:
 {'che': 'UTF-8-SIG', 'wr': 'UTF-8-SIG', 'wu': 'UTF-8-SIG', 'sr': 'UTF-8-SIG', 'su': 'UTF-8-SIG', 'gem': 'UTF-8-SIG'}
Encodings for notation 2 files:
 {'pop': 'UTF-8-SIG', 'ren': 'UTF-8-SIG'}


## STEP 3 ‚Äî Clean all CSV files using your custom functions

Notation 1 uses:
- `;` separator
- rows start at index 3

Notation 2 uses:
- `,` separator
- many quotes, commas, extra characters
- rows start at index 4


In [4]:
# 3.1 Notation 1
dfnames_1 = ['df_che', 'df_wr', 'df_wu', 'df_sr', 'df_su', 'df_gem']
df_dict_1 = {}

for df_name, abbreviation, filepath in zip(dfnames_1, abbreviations_1, filepaths_1):
    df_dict_1[df_name] = custom_functions.clean_csv(
        filepath=filepath,
        encoding=encodings_1[abbreviation],
        separator=';',
        trail1='\n',
        trail2=None,
        trail3=None,
        to_be_replaced='"',
        start_row=3
    )

df_che = df_dict_1['df_che']
df_wr  = df_dict_1['df_wr']
df_wu  = df_dict_1['df_wu']
df_sr  = df_dict_1['df_sr']
df_su  = df_dict_1['df_su']
df_gem = df_dict_1['df_gem']

# 3.2 Notation 2
dfnames_2 = ['df_pop', 'df_ren']
df_dict_2 = {}

for df_name, abbreviation, filepath in zip(dfnames_2, abbreviations_2, filepaths_2):
    df_dict_2[df_name] = custom_functions.clean_csv(
        filepath=filepath,
        encoding=encodings_2[abbreviation],
        separator=',',
        trail1='\n',
        trail2='"',
        trail3=',',
        to_be_replaced='"',
        start_row=4
    )

df_pop = df_dict_2['df_pop']
df_ren = df_dict_2['df_ren']

print("Shapes of cleaned DataFrames:")
for name, df in {
    "che": df_che, "wr": df_wr, "wu": df_wu,
    "sr": df_sr, "su": df_su, "gem": df_gem,
    "pop": df_pop, "ren": df_ren
}.items():
    print(f"{name}: {df.shape}")


Shapes of cleaned DataFrames:
che: (266, 69)
wr: (266, 69)
wu: (266, 69)
sr: (266, 69)
su: (266, 69)
gem: (266, 69)
pop: (266, 69)
ren: (266, 69)


## STEP 4 ‚Äî Convert all datasets into a unified long-format table

Each dataset is converted from wide (1960‚Äì2024 columns) into long format:


In [None]:

# Melting function- Long format

def melt_indicator(df):
    year_cols = [c for c in df.columns if str(c).isdigit()]

    long_df = df.melt(
        id_vars=["Country Name", "Country Code", "Indicator Name", "Indicator Code"],
        value_vars=year_cols,
        var_name="Year",
        value_name="Value"
    )
    long_df["Year"] = long_df["Year"].astype(int)
    long_df["Value"] = pd.to_numeric(long_df["Value"], errors="coerce")
    return long_df

che_long  = melt_indicator(df_che)
wr_long   = melt_indicator(df_wr)
wu_long   = melt_indicator(df_wu)
sr_long   = melt_indicator(df_sr)
su_long   = melt_indicator(df_su)
gem_long  = melt_indicator(df_gem)
pop_long  = melt_indicator(df_pop)
ren_long  = melt_indicator(df_ren)

all_long = pd.concat(
    [che_long, wr_long, wu_long, sr_long, su_long, gem_long, pop_long, ren_long],
    ignore_index=True
)

print("all_long shape:", all_long.shape)
all_long.head()


all_long shape: (138320, 6)


Unnamed: 0,Country Name,Country Code,Indicator Name,Indicator Code,Year,Value
0,Aruba,ABW,Current health expenditure (% of GDP),SH.XPD.CHEX.GD.ZS,1960,
1,Africa Eastern and Southern,AFE,Current health expenditure (% of GDP),SH.XPD.CHEX.GD.ZS,1960,
2,Afghanistan,AFG,Current health expenditure (% of GDP),SH.XPD.CHEX.GD.ZS,1960,
3,Africa Western and Central,AFW,Current health expenditure (% of GDP),SH.XPD.CHEX.GD.ZS,1960,
4,Angola,AGO,Current health expenditure (% of GDP),SH.XPD.CHEX.GD.ZS,1960,


## STEP 5 ‚Äî Build countries_df, indicators_df, values_df

These are the **3 tables** that will be inserted into MySQL:
- `countries_df` ‚Äî unique list of countries  
- `indicators_df` ‚Äî unique list of indicators  
- `values_df` ‚Äî all actual data values  


In [6]:
# 5.1 countries_df
countries_df = (
    all_long[["Country Code", "Country Name"]]
    .drop_duplicates()
    .sort_values("Country Code")
    .reset_index(drop=True)
)

countries_df["country_id"] = countries_df.index + 1
countries_df["region"] = None

countries_df = countries_df.rename(columns={
    "Country Code": "country_code",
    "Country Name": "country_name"
})

print("countries_df shape:", countries_df.shape)
countries_df.head()

# 5.2 indicators_df
indicators_df = (
    all_long[["Indicator Code", "Indicator Name"]]
    .drop_duplicates()
    .reset_index(drop=True)
)

indicators_df["indicator_id"] = indicators_df.index + 1

def extract_unit(name):
    m = re.search(r"\((.*?)\)", str(name))
    return m.group(1) if m else "original units"

indicators_df["unit"] = indicators_df["Indicator Name"].apply(extract_unit)

indicators_df = indicators_df.rename(columns={
    "Indicator Code": "indicator_code",
    "Indicator Name": "indicator_name"
})

print("indicators_df shape:", indicators_df.shape)
indicators_df.head()

# 5.3 values_df
country_code_to_id   = dict(zip(countries_df["country_code"],  countries_df["country_id"]))
indicator_code_to_id = dict(zip(indicators_df["indicator_code"], indicators_df["indicator_id"]))

values_df = all_long.copy()
values_df["country_id"]   = values_df["Country Code"].map(country_code_to_id)
values_df["indicator_id"] = values_df["Indicator Code"].map(indicator_code_to_id)

values_df = values_df.rename(columns={"Year": "year", "Value": "value"})

print("values_df shape:", values_df.shape)
values_df.head()


countries_df shape: (266, 4)
indicators_df shape: (8, 4)
values_df shape: (138320, 8)


Unnamed: 0,Country Name,Country Code,Indicator Name,Indicator Code,year,value,country_id,indicator_id
0,Aruba,ABW,Current health expenditure (% of GDP),SH.XPD.CHEX.GD.ZS,1960,,1,1
1,Africa Eastern and Southern,AFE,Current health expenditure (% of GDP),SH.XPD.CHEX.GD.ZS,1960,,2,1
2,Afghanistan,AFG,Current health expenditure (% of GDP),SH.XPD.CHEX.GD.ZS,1960,,3,1
3,Africa Western and Central,AFW,Current health expenditure (% of GDP),SH.XPD.CHEX.GD.ZS,1960,,4,1
4,Angola,AGO,Current health expenditure (% of GDP),SH.XPD.CHEX.GD.ZS,1960,,5,1


## STEP 6 ‚Äî Connect to MySQL and create database


In [None]:
# Conect to MySQL 
conn = pymysql.connect(
    host="localhost",
    user="root",
    password="XXX",   # ‚Üê your real password
    autocommit=True
)

cursor = conn.cursor()

cursor.execute("CREATE DATABASE IF NOT EXISTS bigdata_project;")
cursor.execute("USE bigdata_project;")

print("Connected to MySQL and using database bigdata_project.")


‚úÖ Connected to MySQL and using database bigdata_project.


## STEP 7 ‚Äî Reset tables and recreate schema
We drop:
- indicator_values  
- indicators  
- countries  

Then recreate all three.


In [None]:
cursor.execute("SET FOREIGN_KEY_CHECKS = 0;")
cursor.execute("DROP TABLE IF EXISTS indicator_values;")
cursor.execute("DROP TABLE IF EXISTS indicators;")
cursor.execute("DROP TABLE IF EXISTS countries;")
cursor.execute("SET FOREIGN_KEY_CHECKS = 1;")

cursor.execute("""
CREATE TABLE countries (
    country_id INT PRIMARY KEY,
    country_code VARCHAR(5),
    country_name VARCHAR(255),
    region VARCHAR(100)
);
""")

cursor.execute("""
CREATE TABLE indicators (
    indicator_id INT PRIMARY KEY,
    indicator_code VARCHAR(255),
    indicator_name VARCHAR(500),
    unit VARCHAR(100)
);
""")

cursor.execute("""
CREATE TABLE indicator_values (
    value_id BIGINT AUTO_INCREMENT PRIMARY KEY,
    country_id INT,
    indicator_id INT,
    year INT,
    value DOUBLE,
    FOREIGN KEY (country_id) REFERENCES countries(country_id),
    FOREIGN KEY (indicator_id) REFERENCES indicators(indicator_id)
);
""")

print("Tables countries, indicators, indicator_values created.")



üîÑ Resetting tables in bigdata_project...
Tables countries, indicators, indicator_values created.


## STEP 8 ‚Äî Insert countries and indicators


In [None]:
insert_countries_sql = """
    INSERT INTO countries (country_id, country_code, country_name, region)
    VALUES (%s, %s, %s, %s);
"""

for _, row in countries_df.iterrows():
    cursor.execute(insert_countries_sql, (int(row.country_id), row["country_code"], row["country_name"], row["region"]))

print("Inserted countries:", len(countries_df))

print("\nüì§ Inserting indicators...")

insert_indicators_sql = """
    INSERT INTO indicators (indicator_id, indicator_code, indicator_name, unit)
    VALUES (%s, %s, %s, %s);
"""

for _, row in indicators_df.iterrows():
    cursor.execute(insert_indicators_sql, (int(row.indicator_id), row["indicator_code"], row["indicator_name"], row["unit"]))

print("Inserted indicators:", len(indicators_df))



üì§ Inserting countries...
Inserted countries: 266

üì§ Inserting indicators...
Inserted indicators: 8


## STEP 9 ‚Äî Insert ~100,000 indicator_values in batches


In [None]:
cursor.execute("TRUNCATE TABLE indicator_values;")

rows = []
for row in values_df.itertuples(index=False):
    val = None if pd.isna(row.value) else row.value
    rows.append((int(row.country_id), int(row.indicator_id), int(row.year), val))

total = len(rows)
print("Total rows to insert into indicator_values:", total)

insert_values_sql = """
    INSERT INTO indicator_values (country_id, indicator_id, year, value)
    VALUES (%s, %s, %s, %s);
"""

conn.autocommit(False)
batch_size = 5000
inserted = 0

for start in range(0, total, batch_size):
    batch = rows[start:start + batch_size]
    cursor.executemany(insert_values_sql, batch)
    conn.commit()
    inserted += len(batch)
    print(f"Inserted {inserted} / {total} rows...", end="\r")

conn.autocommit(True)
print(f"\nFinished inserting {inserted} rows into indicator_values.")



üì§ Inserting indicator_values (bulk)...
Total rows to insert into indicator_values: 138320
Inserted 138320 / 138320 rows...
Finished inserting 138320 rows into indicator_values.


## STEP 10 ‚Äî Create view *all_data*

This view joins all three tables into a single logical dataset that you can query directly.


In [None]:
cursor.execute("""
CREATE OR REPLACE VIEW all_data AS
SELECT
    iv.value_id,
    iv.year,
    iv.value,
    c.country_id,
    c.country_code,
    c.country_name,
    c.region,
    i.indicator_id,
    i.indicator_code,
    i.indicator_name,
    i.unit
FROM indicator_values iv
JOIN countries  c ON iv.country_id   = c.country_id
JOIN indicators i ON iv.indicator_id = i.indicator_id;
""")

print("View all_data created.")



üîó Creating view all_data...
View all_data created.


## STEP 11 ‚Äî Sanity checks
We count rows and preview the joined dataset.


In [None]:
print("\n Sanity checks:")

cursor.execute("SELECT COUNT(*) FROM countries;")
print("countries rows:", cursor.fetchone()[0])

cursor.execute("SELECT COUNT(*) FROM indicators;")
print("indicators rows:", cursor.fetchone()[0])

cursor.execute("SELECT COUNT(*) FROM indicator_values;")
print("indicator_values rows:", cursor.fetchone()[0])

sample_df = pd.read_sql("SELECT * FROM all_data LIMIT 10;", conn)
display(sample_df)

print("\n Database setup completed successfully.")



üìä Sanity checks:
countries rows: 266
indicators rows: 8
indicator_values rows: 138320


  sample_df = pd.read_sql("SELECT * FROM all_data LIMIT 10;", conn)


Unnamed: 0,value_id,year,value,country_id,country_code,country_name,region,indicator_id,indicator_code,indicator_name,unit
0,1,1960,,1,ABW,Aruba,,1,SH.XPD.CHEX.GD.ZS,Current health expenditure (% of GDP),% of GDP
1,2,1960,,2,AFE,Africa Eastern and Southern,,1,SH.XPD.CHEX.GD.ZS,Current health expenditure (% of GDP),% of GDP
2,3,1960,,3,AFG,Afghanistan,,1,SH.XPD.CHEX.GD.ZS,Current health expenditure (% of GDP),% of GDP
3,4,1960,,4,AFW,Africa Western and Central,,1,SH.XPD.CHEX.GD.ZS,Current health expenditure (% of GDP),% of GDP
4,5,1960,,5,AGO,Angola,,1,SH.XPD.CHEX.GD.ZS,Current health expenditure (% of GDP),% of GDP
5,6,1960,,6,ALB,Albania,,1,SH.XPD.CHEX.GD.ZS,Current health expenditure (% of GDP),% of GDP
6,7,1960,,7,AND,Andorra,,1,SH.XPD.CHEX.GD.ZS,Current health expenditure (% of GDP),% of GDP
7,8,1960,,8,ARB,Arab World,,1,SH.XPD.CHEX.GD.ZS,Current health expenditure (% of GDP),% of GDP
8,9,1960,,9,ARE,United Arab Emirates,,1,SH.XPD.CHEX.GD.ZS,Current health expenditure (% of GDP),% of GDP
9,10,1960,,10,ARG,Argentina,,1,SH.XPD.CHEX.GD.ZS,Current health expenditure (% of GDP),% of GDP



‚úÖ Database setup completed successfully.
