# Machine Learning Project

## Week one goals
- Create Github repository
- Connect it to Quarto Blog
- Find potential datasets
- Import data
- Preliminary data cleaning and preparing
- Create research questions that can be answered using data analysis, visualizations, and machine learning
- Figure out what tools and packages to be used

## GitHub Repository
[Here's a link to the GitHub repo](https://github.com/megaminding/machine_learning_research)



## Dataset
Before choosing a specific topic, I did data exploration for various datasets to see which would be the most interesting. I looked at:
- US Census https://github.com/zykls/folktablesLinks
- Covid19 - https://www.kaggle.com/imdevskp/covid-19-analysis-visualization-comparisons/dataLink 
- Uber: https://www.kaggle.com/datasets/yasserh/uber-fares-dataset
- NYC: https://data.cityofnewyork.us/browse?category=Transportation

## Environmental Data Analysis
- "How do changes in land use, air quality, and climate variables interact to influence greenhouse gas emissions and health outcomes?"
- Created an account with Public EM-DAT to access a global database on natural and technological disasters
- <img src="Screenshot 2025-04-22 at 12.25.46 PM.png" alt="User submitting message 'Happy Valentines Day' by user Megan Elizabeth Tieu"  height="300">
- Configured parameters to obtain custom dataset
- <img src="Screenshot 2025-04-21 at 6.21.45 PM.png" alt="User submitting message 'Happy Valentines Day' by user Megan Elizabeth Tieu"  height="300">




In [24]:
import pandas as pd

df = pd.read_excel("emdat.xlsx", engine="openpyxl")  

print(df.head())


          DisNo. Historic Classification Key Disaster Group Disaster Subgroup  \
0  1999-9388-DJI       No    nat-cli-dro-dro        Natural    Climatological   
1  1999-9388-SDN       No    nat-cli-dro-dro        Natural    Climatological   
2  1999-9388-SOM       No    nat-cli-dro-dro        Natural    Climatological   
3  2000-0001-AGO       No    tec-tra-roa-roa  Technological         Transport   
4  2000-0002-AGO       No    nat-hyd-flo-riv        Natural      Hydrological   

  Disaster Type Disaster Subtype External IDs Event Name  ISO  ...  \
0       Drought          Drought          NaN        NaN  DJI  ...   
1       Drought          Drought          NaN        NaN  SDN  ...   
2       Drought          Drought          NaN        NaN  SOM  ...   
3          Road             Road          NaN        NaN  AGO  ...   
4         Flood   Riverine flood          NaN        NaN  AGO  ...   

  Reconstruction Costs ('000 US$) Reconstruction Costs, Adjusted ('000 US$)  \
0            

Let's inspect the data

In [25]:
df.info()
df.describe()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 16209 entries, 0 to 16208
Data columns (total 46 columns):
 #   Column                                     Non-Null Count  Dtype  
---  ------                                     --------------  -----  
 0   DisNo.                                     16209 non-null  object 
 1   Historic                                   16209 non-null  object 
 2   Classification Key                         16209 non-null  object 
 3   Disaster Group                             16209 non-null  object 
 4   Disaster Subgroup                          16209 non-null  object 
 5   Disaster Type                              16209 non-null  object 
 6   Disaster Subtype                           16209 non-null  object 
 7   External IDs                               2709 non-null   object 
 8   Event Name                                 5105 non-null   object 
 9   ISO                                        16209 non-null  object 
 10  Country               

Unnamed: 0,AID Contribution ('000 US$),Magnitude,Latitude,Longitude,Start Year,Start Month,Start Day,End Year,End Month,End Day,...,No. Affected,No. Homeless,Total Affected,Reconstruction Costs ('000 US$),"Reconstruction Costs, Adjusted ('000 US$)",Insured Damage ('000 US$),"Insured Damage, Adjusted ('000 US$)",Total Damage ('000 US$),"Total Damage, Adjusted ('000 US$)",CPI
count,489.0,3311.0,1816.0,1816.0,16209.0,16140.0,14638.0,16209.0,16046.0,14710.0,...,7503.0,1333.0,12087.0,33.0,33.0,713.0,694.0,3237.0,3110.0,15527.0
mean,28559.16,61225.45,16.415862,42.47762,2011.109816,6.463755,15.352849,2011.141156,6.592858,15.814276,...,628020.5,31627.04,393991.3,5687264.0,6357118.0,1347736.0,1699765.0,1178820.0,1478426.0,72.85861
std,211895.6,748641.5,21.786044,75.523526,7.422845,3.413559,8.973253,7.425922,3.391621,8.891107,...,6649927.0,214353.6,5249462.0,17452320.0,17603430.0,4644761.0,5954430.0,6317104.0,8316465.0,11.582942
min,3.0,-57.0,-72.64,-172.095,2000.0,1.0,1.0,2000.0,1.0,1.0,...,1.0,3.0,1.0,84.0,131.0,34.0,48.0,2.0,3.0,56.514291
25%,166.0,23.5,1.0615,1.6765,2005.0,4.0,7.0,2005.0,4.0,8.0,...,600.0,340.0,42.0,100000.0,100000.0,75000.0,99170.5,16000.0,21323.25,61.989586
50%,765.0,200.0,18.6425,55.5745,2010.0,7.0,15.0,2010.0,7.0,16.0,...,6500.0,1966.0,1000.0,565000.0,702336.0,250000.0,349124.5,100000.0,142028.0,71.563596
75%,4984.0,21737.0,34.78675,103.23525,2018.0,9.0,23.0,2018.0,9.0,24.0,...,60035.0,7000.0,17570.5,3344000.0,4245383.0,800000.0,1117445.0,550000.0,717240.2,80.445779
max,3518530.0,40000000.0,67.93,179.65,2025.0,12.0,31.0,2025.0,12.0,31.0,...,330000000.0,5000000.0,330000000.0,100000000.0,100000000.0,60000000.0,93614350.0,210000000.0,284465200.0,100.0


## Data Cleaning
Get rid of columns where the data is irrelevant or there's not enough of it for it to be interesting.


In [26]:
df.isna().sum().sort_values(ascending=False)


Reconstruction Costs, Adjusted ('000 US$)    16176
Reconstruction Costs ('000 US$)              16176
AID Contribution ('000 US$)                  15720
Insured Damage, Adjusted ('000 US$)          15515
Insured Damage ('000 US$)                    15496
River Basin                                  14976
No. Homeless                                 14876
Longitude                                    14393
Latitude                                     14393
External IDs                                 13500
Total Damage, Adjusted ('000 US$)            13099
Total Damage ('000 US$)                      12972
Magnitude                                    12898
Associated Types                             12771
Origin                                       12133
Event Name                                   11104
No. Injured                                  10190
No. Affected                                  8706
Admin Units                                   7793
Magnitude Scale                

In [27]:
df = df.drop(columns=[ 'DisNo.', 'External IDs','Event Name', 'Origin','Associated Types','Appeal','Declaration','OFDA/BHA Response','AID Contribution (\'000 US$)',  'Reconstruction Costs (\'000 US$)','Reconstruction Costs, Adjusted (\'000 US$)','Insured Damage (\'000 US$)','Insured Damage, Adjusted (\'000 US$)','River Basin','Admin Units','Entry Date','Last Update'   ], errors='ignore')
df

Unnamed: 0,Historic,Classification Key,Disaster Group,Disaster Subgroup,Disaster Type,Disaster Subtype,ISO,Country,Subregion,Region,...,End Month,End Day,Total Deaths,No. Injured,No. Affected,No. Homeless,Total Affected,Total Damage ('000 US$),"Total Damage, Adjusted ('000 US$)",CPI
0,No,nat-cli-dro-dro,Natural,Climatological,Drought,Drought,DJI,Djibouti,Sub-Saharan Africa,Africa,...,,,,,100000.0,,100000.0,,,58.111474
1,No,nat-cli-dro-dro,Natural,Climatological,Drought,Drought,SDN,Sudan,Northern Africa,Africa,...,,,,,2000000.0,,2000000.0,,,56.514291
2,No,nat-cli-dro-dro,Natural,Climatological,Drought,Drought,SOM,Somalia,Sub-Saharan Africa,Africa,...,,,21.0,,1200000.0,,1200000.0,,,56.514291
3,No,tec-tra-roa-roa,Technological,Transport,Road,Road,AGO,Angola,Sub-Saharan Africa,Africa,...,1.0,26.0,14.0,11.0,,,11.0,,,56.514291
4,No,nat-hyd-flo-riv,Natural,Hydrological,Flood,Riverine flood,AGO,Angola,Sub-Saharan Africa,Africa,...,1.0,15.0,31.0,,70000.0,,70000.0,10000.0,17695.0,56.514291
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
16204,No,tec-tra-roa-roa,Technological,Transport,Road,Road,BRA,Brazil,Latin America and the Caribbean,Americas,...,4.0,8.0,10.0,18.0,,,18.0,,,
16205,No,tec-mis-fir-fir,Technological,Miscellaneous accident,Fire (Miscellaneous),Fire (Miscellaneous),CHN,China,Eastern Asia,Asia,...,4.0,8.0,20.0,,,,,,,
16206,No,nat-met-sto-san,Natural,Meteorological,Storm,Sand/Dust storm,IRQ,Iraq,Western Asia,Asia,...,4.0,14.0,,2751.0,,,2751.0,,,
16207,No,nat-met-sto-sto,Natural,Meteorological,Storm,Storm (General),SPI,Canary Islands,Northern Africa,Africa,...,4.0,13.0,,,,,,,,


Let's also get rid of data points if it's missing important variables we care about.

In [28]:
df = df.dropna(subset=['Country', 'Disaster Type', 'Start Year'])
df


Unnamed: 0,Historic,Classification Key,Disaster Group,Disaster Subgroup,Disaster Type,Disaster Subtype,ISO,Country,Subregion,Region,...,End Month,End Day,Total Deaths,No. Injured,No. Affected,No. Homeless,Total Affected,Total Damage ('000 US$),"Total Damage, Adjusted ('000 US$)",CPI
0,No,nat-cli-dro-dro,Natural,Climatological,Drought,Drought,DJI,Djibouti,Sub-Saharan Africa,Africa,...,,,,,100000.0,,100000.0,,,58.111474
1,No,nat-cli-dro-dro,Natural,Climatological,Drought,Drought,SDN,Sudan,Northern Africa,Africa,...,,,,,2000000.0,,2000000.0,,,56.514291
2,No,nat-cli-dro-dro,Natural,Climatological,Drought,Drought,SOM,Somalia,Sub-Saharan Africa,Africa,...,,,21.0,,1200000.0,,1200000.0,,,56.514291
3,No,tec-tra-roa-roa,Technological,Transport,Road,Road,AGO,Angola,Sub-Saharan Africa,Africa,...,1.0,26.0,14.0,11.0,,,11.0,,,56.514291
4,No,nat-hyd-flo-riv,Natural,Hydrological,Flood,Riverine flood,AGO,Angola,Sub-Saharan Africa,Africa,...,1.0,15.0,31.0,,70000.0,,70000.0,10000.0,17695.0,56.514291
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
16204,No,tec-tra-roa-roa,Technological,Transport,Road,Road,BRA,Brazil,Latin America and the Caribbean,Americas,...,4.0,8.0,10.0,18.0,,,18.0,,,
16205,No,tec-mis-fir-fir,Technological,Miscellaneous accident,Fire (Miscellaneous),Fire (Miscellaneous),CHN,China,Eastern Asia,Asia,...,4.0,8.0,20.0,,,,,,,
16206,No,nat-met-sto-san,Natural,Meteorological,Storm,Sand/Dust storm,IRQ,Iraq,Western Asia,Asia,...,4.0,14.0,,2751.0,,,2751.0,,,
16207,No,nat-met-sto-sto,Natural,Meteorological,Storm,Storm (General),SPI,Canary Islands,Northern Africa,Africa,...,4.0,13.0,,,,,,,,


Get rid of duplicates

In [29]:
df = df.drop_duplicates()
df


Unnamed: 0,Historic,Classification Key,Disaster Group,Disaster Subgroup,Disaster Type,Disaster Subtype,ISO,Country,Subregion,Region,...,End Month,End Day,Total Deaths,No. Injured,No. Affected,No. Homeless,Total Affected,Total Damage ('000 US$),"Total Damage, Adjusted ('000 US$)",CPI
0,No,nat-cli-dro-dro,Natural,Climatological,Drought,Drought,DJI,Djibouti,Sub-Saharan Africa,Africa,...,,,,,100000.0,,100000.0,,,58.111474
1,No,nat-cli-dro-dro,Natural,Climatological,Drought,Drought,SDN,Sudan,Northern Africa,Africa,...,,,,,2000000.0,,2000000.0,,,56.514291
2,No,nat-cli-dro-dro,Natural,Climatological,Drought,Drought,SOM,Somalia,Sub-Saharan Africa,Africa,...,,,21.0,,1200000.0,,1200000.0,,,56.514291
3,No,tec-tra-roa-roa,Technological,Transport,Road,Road,AGO,Angola,Sub-Saharan Africa,Africa,...,1.0,26.0,14.0,11.0,,,11.0,,,56.514291
4,No,nat-hyd-flo-riv,Natural,Hydrological,Flood,Riverine flood,AGO,Angola,Sub-Saharan Africa,Africa,...,1.0,15.0,31.0,,70000.0,,70000.0,10000.0,17695.0,56.514291
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
16204,No,tec-tra-roa-roa,Technological,Transport,Road,Road,BRA,Brazil,Latin America and the Caribbean,Americas,...,4.0,8.0,10.0,18.0,,,18.0,,,
16205,No,tec-mis-fir-fir,Technological,Miscellaneous accident,Fire (Miscellaneous),Fire (Miscellaneous),CHN,China,Eastern Asia,Asia,...,4.0,8.0,20.0,,,,,,,
16206,No,nat-met-sto-san,Natural,Meteorological,Storm,Sand/Dust storm,IRQ,Iraq,Western Asia,Asia,...,4.0,14.0,,2751.0,,,2751.0,,,
16207,No,nat-met-sto-sto,Natural,Meteorological,Storm,Storm (General),SPI,Canary Islands,Northern Africa,Africa,...,4.0,13.0,,,,,,,,


## Interesting Variables:
- Disaster Type           
- Location (country, region, longitude, latitude)
- Total Deaths, No. Affected, and No. Injured
- Time (start day, start month, etc.)
- Cost (Reconstruction Costs and Total Damage)

## Research Questions:

"Which countries have experienced the highest number of natural disasters in the past 25 years?" 
- We'll need to investigate variables on countries, when it occured, and type of disaster

"Are certain types of disasters becoming more common over time (e.g., wildfires vs. earthquakes)?" 
- We'll need to investigate variables on time, when it occured, and type of disaster
- Let's consider limitations with this dataset since it starts in 2000

"Can we predict the average number of deaths or affected people for an event based on disaster type and context?"
- We'll need to investigate variables on  Total Deaths, No. Affected, and No. Injured and type of disaster

"Can we experiment with data forecasting and predict the number of disasters a country might experience based on its historical trends and geography?" 
- We'll need to investigate variables on type of disaster, when it occured, and location
- We'll need to incorporate another dataset on geography

"Can we classify the severity of a disaster based on its type, location, and year?"
- We'll need to investigate variables on type, location, time, and severity


## Tools Needed
- For visualizations, matplotlib, geopanda, seaborn and plotly
- For machine learning, specifically forecasting, we can try Scikit-learn for regression or classification models or LSTM
- For classification, we can use it to identify patterns or relationships among disasters

## Let's get started on first research question
"Which countries have experienced the highest number of natural disasters?" 
- We'll need to investigate variables on countries, when it occured, and type of disaster

In [30]:
disaster_count_by_country_df = df.groupby(['Country']).agg(
    disaster_count=('Disaster Type', 'count')
).reset_index().sort_values(by='disaster_count', ascending=False)


disaster_count_by_country_df

Unnamed: 0,Country,disaster_count
39,China,1356
89,India,820
212,United States of America,724
90,Indonesia,565
152,Philippines,474
...,...,...
5,Anguilla,1
218,Wallis and Futuna Islands,1
164,Saint Helena,1
135,Netherlands Antilles,1


In [32]:
import plotly.express as px
df = px.data.gapminder()
import plotly.io as pio
pio.renderers.default="iframe"

fig = px.scatter(
    disaster_count_by_country_df.query('disaster_count>=300'),
    x="Country",
    y="disaster_count",
    size="disaster_count",
    color="Country",
    hover_name="Country",
    log_y=True, 
    size_max=100,
    title="Number of Disaster Types by Country"
)

fig.show()