# creating_database.ipynb
### Author: Alexander X. Gonzalez-Torres
**October 24, 2024**

This Jupyter Notebook documents my process to combine our raw dataset, in its eponymous subdirectory,into an usable database for the project.    



In [8]:
## LIBRARIES USED ##

import pandas as pd


**STEP 1: Trim WDICSV.csv down to countries by eliminating regions**


In [54]:
# Load the CSV into a DataFrame
wdi_csv = pd.read_csv("raw_dataset/WDICSV.csv")

# Find the index of the first occurrence of "Afghanistan" in the 'Country Name' column
afghanistan_index = wdi_csv[wdi_csv['Country Name'] == "Afghanistan"].index[0]

# Keep all rows starting from the index of "Afghanistan"
trimmed_df = wdi_csv.loc[afghanistan_index:]

# Check the result
print("Numbers of rows expected after trimming: 322896.")
print(f"Number of rows in DataFrame: {len(trimmed_df)}")

# Download DataFrame as .csv 
trimmed_df.to_csv("world_bank_processed.csv", index=False)

Numbers of rows expected after trimming: 322896.
Number of rows in DataFrame: 322896


**STEP 2: Standardize dates in EM-DAT csv (YYYY)**

In [59]:
# Load CSV into DataFrame 

emdat_csv = pd.read_csv("raw_dataset/public_emdat_custom_request_2024-10-11_b70b4036-b2bb-48b2-87db-19b38cd8140f.csv")

# Extract the first 4 characters (which correspond to the year)
emdat_csv['Year'] = emdat_csv['DisNo.'].str[:4]

# Convert the extracted year to integer
emdat_csv['Year'] = emdat_csv['Year'].astype(int)

# Place Year column next to DisNo. 
cols = emdat_csv.columns.tolist()  
disno_index = cols.index('DisNo.')  
cols.insert(disno_index + 1, cols.pop(cols.index('Year')))  
emdat_csv = emdat_csv[cols]  

# Check emdat_csv 
emdat_csv.head()

# Download DataFrame as .csv 
emdat_csv.to_csv("emdat_processed.csv", index=False)

**TENTATIVE, STEP 3: Turn WDI Indicators into columns and years into rows**

In [69]:
# Step 1: Identify the columns that represent years (which should be numeric)
year_columns = [col for col in trimmed_df.columns if col.isdigit()]

# Step 2: Melt the DataFrame, ensuring that only year columns are melted
trimmed_df_melted = pd.melt(
    trimmed_df,
    id_vars=['Country Name', 'Country Code', 'Indicator Name', 'Indicator Code'],  
    value_vars=year_columns,  
    var_name='Year',
    value_name='Value'
)

# Convert the 'Year' column to integers (now we know it only contains year values)
trimmed_df_melted['Year'] = trimmed_df_melted['Year'].astype(int)

# Convert 'Value' column to numeric (this will turn non-numeric values into NaN)
trimmed_df_melted['Value'] = pd.to_numeric(trimmed_df_melted['Value'], errors='coerce')

# Step 3: Pivot the DataFrame so each indicator becomes its own column
trimmed_df_pivoted = trimmed_df_melted.pivot_table(
    index=['Country Name', 'Country Code', 'Year'],
    columns='Indicator Name',
    values='Value',
    aggfunc='first'  
)

# Reset the index to make 'Country Name', 'Country Code', and 'Year' normal columns
trimmed_df_pivoted.reset_index(inplace=True)

# The DataFrame should now have 'Country Name', 'Country Code', 'Year' as the first three columns
# and each 'Indicator Name' as its own column
trimmed_df_pivoted.to_csv("world_bank_long.csv", index=False)

Indicator Name Country Name Country Code  Year  \
0               Afghanistan          AFG  1960   
1               Afghanistan          AFG  1961   
2               Afghanistan          AFG  1962   
3               Afghanistan          AFG  1963   
4               Afghanistan          AFG  1964   

Indicator Name  ARI treatment (% of children under 5 taken to a health provider)  \
0                                                             NaN                  
1                                                             NaN                  
2                                                             NaN                  
3                                                             NaN                  
4                                                             NaN                  

Indicator Name  Access to clean fuels and technologies for cooking (% of population)  \
0                                                             NaN                      
1                 

NOTE: Beyond this I'm stuck, we'll probably want to write a script that automates the time-series analysis. I don't think that a single CSV file would be most productive. 