In [1]:
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns

pd.set_option('display.max_columns', None)
pd.set_option('display.max_rows', None)

# Earthquake That Potentially Trigger Volcanic Eruption Prediction

## Data Preprocessing

The main idea of this project is to predict wether or not an earthquake would trigger volcanic eruptioin.<br/>
<br/>
To do this I picked two datasets from kaggle.com, which are:
1. #### [Significant Earthquakes, 1965-2016](https://www.kaggle.com/usgs/earthquake-database) <br/>
Consists Date, time, and location of all earthquakes around the world with magnitude of 5.5 or higher. <br/>
2. #### [Volcanic Eruptions in the Holocene Period](https://www.kaggle.com/smithsonian/volcanic-eruptions) <br/>
Consists Name, location, and type of volcanoes active in the past 10,000 years.<br/>

### Importing, Reading, Cleaning, and Tidying Dataset
Since I've used two datasets, I think it's best to do this step one by one. In this part we are going to read and understand each columns in both datasets, check errors and missing values, and decide which columns should be used in this project.

In [2]:
# Importing both dataset
eq = pd. read_csv('./data/earthquake.csv')
ve = pd.read_csv('./data/volcanic_eruption.csv')

###  1 Significant Earthquakes, 1965-2016
All of the column description can be accessed from:<br/>
https://earthquake.usgs.gov/data/comcat/data-eventterms.php<br/>
<br/>
Column Description:
1. <b>Date</b> : Date when the event occurred.
2. <b>Time</b> : Time when the event occurred. 
3. <b>Latitude</b> : Decimal degrees latitude.
4. <b>Longitude</b> : Decimal degrees longitude.
5. <b>Type</b> : Type of seismic event. ('Earthquake','Nuclear Explosion', 'Explosion', 'Rock Burst')
6. <b>Depth</b> : Depth of the event in kilometers.
7. <b>Depth Error</b> : Uncertainty of reported depth of the event in kilometers.
8. <b>Depth Seismic Stations</b> : The total number of seismic stations used to determine earthquake location.
9. <b>Magnitude</b> : The magnitude for the event. 
10. <b>Magnitude Type</b> : The method or algorithm used to calculate the preferred magnitude for the event.
11. <b>Magnitude Error</b> : Uncertainty of reported magnitude of the event. The estimated standard error of the magnitude. The uncertainty corresponds to the specific magnitude type being reported and does not take into account magnitude variations and biases between different magnitude scales. 
12. <b>Magnitude Seismic Stations</b> : The total number of seismic stations used to calculate the magnitude for this earthquake.
13. <b>Azimuthal Gap</b> : The largest azimuthal gap between azimuthally adjacent stations (in degrees). In general, the smaller this number, the more reliable is the calculated horizontal position of the earthquake. Earthquake locations in which the azimuthal gap exceeds 180 degrees typically have large location and depth uncertainties.
14. <b>Horizontal Distance</b> : Horizontal distance from the epicenter to the nearest station (in degrees). 1 degree is approximately 111.2 kilometers. In general, the smaller this number, the more reliable is the calculated depth of the earthquake.
15. <b>Horizontal Error</b> : Uncertainty of reported location of the event in kilometers.
16. <b>Root Mean Square</b> : The root-mean-square (RMS) travel time residual, in sec, using all weights. This parameter provides a measure of the fit of the observed arrival times to the predicted arrival times for this location. Smaller numbers reflect a better fit of the data. The value is dependent on the accuracy of the velocity model used to compute the earthquake location, the quality weights assigned to the arrival time data, and the procedure used to locate the earthquake.<br/>
17. <b>ID</b> : A unique identifier for the event. 
18. <b>Source</b> : list of network contributors.
19. <b>Location Source</b> : Network that originally authored the reported location of this event.
20. <b>Magnitude Source</b> : Network that originally authored the reported magnitude for this event.
21. <b>Status</b> : Indicates whether the event has been reviewed by a human.



In [3]:
eq.head()

Unnamed: 0,Date,Time,Latitude,Longitude,Type,Depth,Depth Error,Depth Seismic Stations,Magnitude,Magnitude Type,Magnitude Error,Magnitude Seismic Stations,Azimuthal Gap,Horizontal Distance,Horizontal Error,Root Mean Square,ID,Source,Location Source,Magnitude Source,Status
0,01/02/1965,13:44:18,19.246,145.616,Earthquake,131.6,,,6.0,MW,,,,,,,ISCGEM860706,ISCGEM,ISCGEM,ISCGEM,Automatic
1,01/04/1965,11:29:49,1.863,127.352,Earthquake,80.0,,,5.8,MW,,,,,,,ISCGEM860737,ISCGEM,ISCGEM,ISCGEM,Automatic
2,01/05/1965,18:05:58,-20.579,-173.972,Earthquake,20.0,,,6.2,MW,,,,,,,ISCGEM860762,ISCGEM,ISCGEM,ISCGEM,Automatic
3,01/08/1965,18:49:43,-59.076,-23.557,Earthquake,15.0,,,5.8,MW,,,,,,,ISCGEM860856,ISCGEM,ISCGEM,ISCGEM,Automatic
4,01/09/1965,13:32:50,11.938,126.427,Earthquake,15.0,,,5.8,MW,,,,,,,ISCGEM860890,ISCGEM,ISCGEM,ISCGEM,Automatic


In [4]:
eq.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 23412 entries, 0 to 23411
Data columns (total 21 columns):
 #   Column                      Non-Null Count  Dtype  
---  ------                      --------------  -----  
 0   Date                        23412 non-null  object 
 1   Time                        23412 non-null  object 
 2   Latitude                    23412 non-null  float64
 3   Longitude                   23412 non-null  float64
 4   Type                        23412 non-null  object 
 5   Depth                       23412 non-null  float64
 6   Depth Error                 4461 non-null   float64
 7   Depth Seismic Stations      7097 non-null   float64
 8   Magnitude                   23412 non-null  float64
 9   Magnitude Type              23409 non-null  object 
 10  Magnitude Error             327 non-null    float64
 11  Magnitude Seismic Stations  2564 non-null   float64
 12  Azimuthal Gap               7299 non-null   float64
 13  Horizontal Distance         160

##### Parsing Dates
The first thing I've noticed is the 'Date' and 'Time' columns in string format. we should start by combining this two columns and parse it to datetime format.<br/>
Although it should be noted that a small portion of the data is written in different style that need a different care. to find this we search for 'Date' values with more than 10 character.

In [5]:
eq[eq['Date'].str.len() > 10]

Unnamed: 0,Date,Time,Latitude,Longitude,Type,Depth,Depth Error,Depth Seismic Stations,Magnitude,Magnitude Type,Magnitude Error,Magnitude Seismic Stations,Azimuthal Gap,Horizontal Distance,Horizontal Error,Root Mean Square,ID,Source,Location Source,Magnitude Source,Status
3378,1975-02-23T02:58:41.000Z,1975-02-23T02:58:41.000Z,8.017,124.075,Earthquake,623.0,,,5.6,MB,,,,,,,USP0000A09,US,US,US,Reviewed
7512,1985-04-28T02:53:41.530Z,1985-04-28T02:53:41.530Z,-32.998,-71.766,Earthquake,33.0,,,5.6,MW,,,,,,1.3,USP0002E81,US,US,HRV,Reviewed
20650,2011-03-13T02:23:34.520Z,2011-03-13T02:23:34.520Z,36.344,142.344,Earthquake,10.1,13.9,289.0,5.8,MWC,,,32.3,,,1.06,USP000HWQP,US,US,GCMT,Reviewed


In [6]:
# now that we get the error index, we could use iloc to slice it
eq.Date.iloc[[3378,7512,20650]] = eq.Date.str[:10]
eq.Time.iloc[[3378,7512,20650]] = eq.Time.str[12:20]

A value is trying to be set on a copy of a slice from a DataFrame

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  self._setitem_with_indexer(indexer, value)


In [7]:
# and then combine both columns and convert it to datetime format.
eq['Date_Time'] = eq['Date'] + ' ' + eq['Time']
eq['Date_Time'] = pd.to_datetime(eq['Date_Time'])

In [8]:
eq.iloc[[3378,7512,20650]]

Unnamed: 0,Date,Time,Latitude,Longitude,Type,Depth,Depth Error,Depth Seismic Stations,Magnitude,Magnitude Type,Magnitude Error,Magnitude Seismic Stations,Azimuthal Gap,Horizontal Distance,Horizontal Error,Root Mean Square,ID,Source,Location Source,Magnitude Source,Status,Date_Time
3378,1975-02-23,2:58:41.,8.017,124.075,Earthquake,623.0,,,5.6,MB,,,,,,,USP0000A09,US,US,US,Reviewed,1975-02-23 02:58:41
7512,1985-04-28,2:53:41.,-32.998,-71.766,Earthquake,33.0,,,5.6,MW,,,,,,1.3,USP0002E81,US,US,HRV,Reviewed,1985-04-28 02:53:41
20650,2011-03-13,2:23:34.,36.344,142.344,Earthquake,10.1,13.9,289.0,5.8,MWC,,,32.3,,,1.06,USP000HWQP,US,US,GCMT,Reviewed,2011-03-13 02:23:34


In [9]:
# now it should be safe to remove both 'Date' and 'Time' columns
eq = eq.drop(['Date','Time'],axis = 1)

In [10]:
cols = eq.columns.tolist()
cols = cols[-1:] + cols[:-1]
cols

['Date_Time',
 'Latitude',
 'Longitude',
 'Type',
 'Depth',
 'Depth Error',
 'Depth Seismic Stations',
 'Magnitude',
 'Magnitude Type',
 'Magnitude Error',
 'Magnitude Seismic Stations',
 'Azimuthal Gap',
 'Horizontal Distance',
 'Horizontal Error',
 'Root Mean Square',
 'ID',
 'Source',
 'Location Source',
 'Magnitude Source',
 'Status']

In [11]:
eq = eq[cols]
eq.head()

Unnamed: 0,Date_Time,Latitude,Longitude,Type,Depth,Depth Error,Depth Seismic Stations,Magnitude,Magnitude Type,Magnitude Error,Magnitude Seismic Stations,Azimuthal Gap,Horizontal Distance,Horizontal Error,Root Mean Square,ID,Source,Location Source,Magnitude Source,Status
0,1965-01-02 13:44:18,19.246,145.616,Earthquake,131.6,,,6.0,MW,,,,,,,ISCGEM860706,ISCGEM,ISCGEM,ISCGEM,Automatic
1,1965-01-04 11:29:49,1.863,127.352,Earthquake,80.0,,,5.8,MW,,,,,,,ISCGEM860737,ISCGEM,ISCGEM,ISCGEM,Automatic
2,1965-01-05 18:05:58,-20.579,-173.972,Earthquake,20.0,,,6.2,MW,,,,,,,ISCGEM860762,ISCGEM,ISCGEM,ISCGEM,Automatic
3,1965-01-08 18:49:43,-59.076,-23.557,Earthquake,15.0,,,5.8,MW,,,,,,,ISCGEM860856,ISCGEM,ISCGEM,ISCGEM,Automatic
4,1965-01-09 13:32:50,11.938,126.427,Earthquake,15.0,,,5.8,MW,,,,,,,ISCGEM860890,ISCGEM,ISCGEM,ISCGEM,Automatic


In [12]:
eq.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 23412 entries, 0 to 23411
Data columns (total 20 columns):
 #   Column                      Non-Null Count  Dtype         
---  ------                      --------------  -----         
 0   Date_Time                   23412 non-null  datetime64[ns]
 1   Latitude                    23412 non-null  float64       
 2   Longitude                   23412 non-null  float64       
 3   Type                        23412 non-null  object        
 4   Depth                       23412 non-null  float64       
 5   Depth Error                 4461 non-null   float64       
 6   Depth Seismic Stations      7097 non-null   float64       
 7   Magnitude                   23412 non-null  float64       
 8   Magnitude Type              23409 non-null  object        
 9   Magnitude Error             327 non-null    float64       
 10  Magnitude Seismic Stations  2564 non-null   float64       
 11  Azimuthal Gap               7299 non-null   float64   

##### Dropping Columns and Values
There are several things that needs to be considered in terms of dropping columns and values.
1. From 'Type' Columns, Since the target of this project is related to volcanic activities, it is safe to drop the earthquakes with Nuclear explosions, Rock Burst, and Explosion Type.
2. 'Depth Error' and 'Magnitude Error' columns is the uncertainity value of the earthquakes features, however I did not have a certain idea about treating this data given there are no specific explaination about it. Therefore, it is considerable to drop them.
3. The same goes for Depth and Magnitude Seismic Stations.
4. In 'Azimuthal Gap' and 'Horizontal Distance' columns, the smaller the number tend to have a more reliable data. however while I've checked the high number of both columns. the 'Status' column shows that it has been reviewed by human expertise.


In [13]:
# filter the dataset by type
eq = eq[eq['Type'] == 'Earthquake']

In [14]:
# use only relevant columns for this project
eq = eq[['Date_Time','Latitude','Longitude','Depth','Magnitude','Magnitude Type']]

In [15]:
eq.reset_index(inplace = True)

In [16]:
# notice that there are missing values on Magnitude type
eq = eq.drop('index',axis = 1)
eq.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 23232 entries, 0 to 23231
Data columns (total 6 columns):
 #   Column          Non-Null Count  Dtype         
---  ------          --------------  -----         
 0   Date_Time       23232 non-null  datetime64[ns]
 1   Latitude        23232 non-null  float64       
 2   Longitude       23232 non-null  float64       
 3   Depth           23232 non-null  float64       
 4   Magnitude       23232 non-null  float64       
 5   Magnitude Type  23229 non-null  object        
dtypes: datetime64[ns](1), float64(4), object(1)
memory usage: 1.1+ MB


In [17]:
eq[eq['Magnitude Type'].isna() == True]

Unnamed: 0,Date_Time,Latitude,Longitude,Depth,Magnitude,Magnitude Type
6598,1983-08-24 13:36:00,40.3732,-124.9227,11.93,5.7,
7172,1984-11-23 18:08:00,37.46,-118.59,9.0,5.82,
7786,1986-03-31 11:55:00,37.4788,-121.6858,9.17,5.6,


<b>Dealing Missing Values</b><br/>
since there are only three missing values, we could determine the Magnitude Type based on the Depth and Magnitude Feature and fill them with the mode of the relatively similar data. The results is most likely suitable for MWW types.

In [18]:
eq_nan_a = eq.loc[((eq['Depth'].values < 9.5) & (eq['Depth'].values >= 9) & (eq['Magnitude'].values <= 5.9) & (eq['Magnitude'].values >= 5.6))]

In [19]:
eq_nan_a['Magnitude Type'].mode()

0    MWW
dtype: object

In [20]:
eq_nan_b = eq.loc[((eq['Depth'].values < 12.5) & (eq['Depth'].values >= 11.5) & (eq['Magnitude'].values <= 5.8) & (eq['Magnitude'].values >= 5.6))]

In [21]:
eq_nan_b['Magnitude Type'].mode()

0    MWW
dtype: object

In [22]:
eq['Magnitude Type'] = eq['Magnitude Type'].fillna('MWW')

In [23]:
eq.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 23232 entries, 0 to 23231
Data columns (total 6 columns):
 #   Column          Non-Null Count  Dtype         
---  ------          --------------  -----         
 0   Date_Time       23232 non-null  datetime64[ns]
 1   Latitude        23232 non-null  float64       
 2   Longitude       23232 non-null  float64       
 3   Depth           23232 non-null  float64       
 4   Magnitude       23232 non-null  float64       
 5   Magnitude Type  23232 non-null  object        
dtypes: datetime64[ns](1), float64(4), object(1)
memory usage: 1.1+ MB


In [24]:
eq.to_csv('./data/clean_earthquake.csv', index = False)

###  2 Volcanic Eruptions in the Holocene Period

Column Description:
1. <b>Number</b> : Unique ID for each Volcano.
2. <b>Name</b> : Name of the Volcano. 
3. <b>Country</b> : The Country where the volcano located.
4. <b>Region</b> : The Region where the volcano located.
5. <b>Type</b> : Types of the volcano.
6. <b>Activity Evidence</b> : Evidences of eruption activity.
7. <b>Last Known Eruption</b> : The year which the latest eruption activity occured.
8. <b>Latitude</b> : Decimal degree latitudes.
9. <b>Longitude</b> : Decimal degree longitude. 
10. <b>Elevation (Meters)</b> : Elevation of the volcano.
11. <b>Dominant Rock Type</b> : The volcano major rock type variations.
12. <b>Tectonic Setting</b> : The tectonic setting where the eruption activity occured.

In [25]:
ve.head()

Unnamed: 0,Number,Name,Country,Region,Type,Activity Evidence,Last Known Eruption,Latitude,Longitude,Elevation (Meters),Dominant Rock Type,Tectonic Setting
0,210010,West Eifel Volcanic Field,Germany,Mediterranean and Western Asia,Maar(s),Eruption Dated,8300 BCE,50.17,6.85,600,Foidite,Rift Zone / Continental Crust (>25 km)
1,210020,Chaine des Puys,France,Mediterranean and Western Asia,Lava dome(s),Eruption Dated,4040 BCE,45.775,2.97,1464,Basalt / Picro-Basalt,Rift Zone / Continental Crust (>25 km)
2,210030,Olot Volcanic Field,Spain,Mediterranean and Western Asia,Pyroclastic cone(s),Evidence Credible,Unknown,42.17,2.53,893,Trachybasalt / Tephrite Basanite,Intraplate / Continental Crust (>25 km)
3,210040,Calatrava Volcanic Field,Spain,Mediterranean and Western Asia,Pyroclastic cone(s),Eruption Dated,3600 BCE,38.87,-4.02,1117,Basalt / Picro-Basalt,Intraplate / Continental Crust (>25 km)
4,211001,Larderello,Italy,Mediterranean and Western Asia,Explosion crater(s),Eruption Observed,1282 CE,43.25,10.87,500,No Data,Subduction Zone / Continental Crust (>25 km)


In [26]:
ve.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 1508 entries, 0 to 1507
Data columns (total 12 columns):
 #   Column               Non-Null Count  Dtype  
---  ------               --------------  -----  
 0   Number               1508 non-null   int64  
 1   Name                 1508 non-null   object 
 2   Country              1508 non-null   object 
 3   Region               1508 non-null   object 
 4   Type                 1508 non-null   object 
 5   Activity Evidence    1507 non-null   object 
 6   Last Known Eruption  1508 non-null   object 
 7   Latitude             1508 non-null   float64
 8   Longitude            1508 non-null   float64
 9   Elevation (Meters)   1508 non-null   int64  
 10  Dominant Rock Type   1455 non-null   object 
 11  Tectonic Setting     1501 non-null   object 
dtypes: float64(2), int64(2), object(8)
memory usage: 141.5+ KB


In [27]:
# Since our earthquake dataset is limited t0 1965 - 2016 CE, we're going to drop eruption that happened before it.
ve.drop(ve[ve['Last Known Eruption'].str.contains('BCE')].index,inplace =True)

In [28]:
ve.drop(ve[ve['Last Known Eruption'].str.contains('Unknown')].index,inplace = True)

In [29]:
ve['Last Known Eruption'] = ve['Last Known Eruption'].str.replace(' CE','')

In [30]:
ve['Last Known Eruption'] = pd.to_numeric(ve['Last Known Eruption'])

In [31]:
index = ve.loc[((ve['Last Known Eruption'].values <= 2016) & (ve['Last Known Eruption'].values >= 1965))].index
index

Int64Index([  11,   15,   44,   45,   50,   52,   54,   58,   68,   70,
            ...
            1465, 1470, 1472, 1481, 1491, 1496, 1497, 1498, 1499, 1503],
           dtype='int64', length=324)

In [32]:
ve = ve.loc[index]

In [33]:
ve.reset_index(inplace = True)
ve.drop('index',axis = 1,inplace = True)

In [34]:
# There are still missing values in 'Activity Evidence' and 'Dominant Rock Type'.
ve.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 324 entries, 0 to 323
Data columns (total 12 columns):
 #   Column               Non-Null Count  Dtype  
---  ------               --------------  -----  
 0   Number               324 non-null    int64  
 1   Name                 324 non-null    object 
 2   Country              324 non-null    object 
 3   Region               324 non-null    object 
 4   Type                 324 non-null    object 
 5   Activity Evidence    323 non-null    object 
 6   Last Known Eruption  324 non-null    int64  
 7   Latitude             324 non-null    float64
 8   Longitude            324 non-null    float64
 9   Elevation (Meters)   324 non-null    int64  
 10  Dominant Rock Type   322 non-null    object 
 11  Tectonic Setting     324 non-null    object 
dtypes: float64(2), int64(3), object(7)
memory usage: 30.5+ KB


In [35]:
ve[ve['Activity Evidence'].isna()==True]

Unnamed: 0,Number,Name,Country,Region,Type,Activity Evidence,Last Known Eruption,Latitude,Longitude,Elevation (Meters),Dominant Rock Type,Tectonic Setting
162,284305,Mariana Back-Arc Segment at 15.5°N,United States,"Japan, Taiwan, Marianas",Submarine,,2015,15.406,144.506,-4100,,Subduction Zone / Oceanic Crust (< 15 km)


In [36]:
#as for Activity Evidences, would not really be useful for this project, we could fill it with the mode for the time being.
ve['Activity Evidence'] = ve['Activity Evidence'].fillna(ve['Activity Evidence'].mode()[0])

In [37]:
# in Dominant Rock Type, there ara missing values stored as 'No Data'
# we need to replace this with NaN.
ve[ve['Dominant Rock Type'] == 'No Data']

Unnamed: 0,Number,Name,Country,Region,Type,Activity Evidence,Last Known Eruption,Latitude,Longitude,Elevation (Meters),Dominant Rock Type,Tectonic Setting
4,221041,Dallol,Ethiopia,Africa and Red Sea,Explosion crater(s),Eruption Observed,2011,14.242,40.3,-48,No Data,Rift Zone / Intermediate Crust (15-25 km)
29,243030,Unnamed,Tonga,New Zealand to Fiji,Submarine,Eruption Observed,1999,-20.85,-175.53,-13,No Data,Subduction Zone / Oceanic Crust (< 15 km)
40,250030,Unnamed,Papua New Guinea,Melanesia and Australia,Submarine,Eruption Dated,1972,-3.03,147.78,-1300,No Data,Rift Zone / Continental Crust (>25 km)
158,284193,South Sarigan Seamount,United States,"Japan, Taiwan, Marianas",Submarine,Eruption Observed,2010,16.58,145.78,-184,No Data,Subduction Zone / Crust Thickness Unknown
173,290160,Unnamed,Russia,Kuril Islands,Submarine,Eruption Dated,1972,46.47,151.28,-502,No Data,Subduction Zone / Oceanic Crust (< 15 km)
294,358059,Arenales,Chile,South America,Stratovolcano,Eruption Observed,1979,-47.2,-73.48,3437,No Data,Subduction Zone / Continental Crust (>25 km)
310,377020,East Gakkel Ridge at 85°E,Undersea Features,Iceland and Arctic Ocean,Submarine,Evidence Credible,1999,85.608,85.25,-3800,No Data,Rift Zone / Oceanic Crust (< 15 km)
315,385052,Unnamed,Undersea Features,Atlantic Ocean,Submarine,Eruption Observed,2002,-32.958,-5.22,0,No Data,Intraplate / Oceanic Crust (< 15 km)


In [38]:
ve['Dominant Rock Type'] = ve['Dominant Rock Type'].replace('No Data',np.nan)

In [39]:
ve[ve['Dominant Rock Type'].isna()==True]

Unnamed: 0,Number,Name,Country,Region,Type,Activity Evidence,Last Known Eruption,Latitude,Longitude,Elevation (Meters),Dominant Rock Type,Tectonic Setting
4,221041,Dallol,Ethiopia,Africa and Red Sea,Explosion crater(s),Eruption Observed,2011,14.242,40.3,-48,,Rift Zone / Intermediate Crust (15-25 km)
26,242005,Havre Seamount,New Zealand,New Zealand to Fiji,Submarine,Eruption Observed,2012,-31.08,-179.033,-897,,Subduction Zone / Oceanic Crust (< 15 km)
29,243030,Unnamed,Tonga,New Zealand to Fiji,Submarine,Eruption Observed,1999,-20.85,-175.53,-13,,Subduction Zone / Oceanic Crust (< 15 km)
40,250030,Unnamed,Papua New Guinea,Melanesia and Australia,Submarine,Eruption Dated,1972,-3.03,147.78,-1300,,Rift Zone / Continental Crust (>25 km)
158,284193,South Sarigan Seamount,United States,"Japan, Taiwan, Marianas",Submarine,Eruption Observed,2010,16.58,145.78,-184,,Subduction Zone / Crust Thickness Unknown
162,284305,Mariana Back-Arc Segment at 15.5°N,United States,"Japan, Taiwan, Marianas",Submarine,Eruption Observed,2015,15.406,144.506,-4100,,Subduction Zone / Oceanic Crust (< 15 km)
173,290160,Unnamed,Russia,Kuril Islands,Submarine,Eruption Dated,1972,46.47,151.28,-502,,Subduction Zone / Oceanic Crust (< 15 km)
294,358059,Arenales,Chile,South America,Stratovolcano,Eruption Observed,1979,-47.2,-73.48,3437,,Subduction Zone / Continental Crust (>25 km)
310,377020,East Gakkel Ridge at 85°E,Undersea Features,Iceland and Arctic Ocean,Submarine,Evidence Credible,1999,85.608,85.25,-3800,,Rift Zone / Oceanic Crust (< 15 km)
315,385052,Unnamed,Undersea Features,Atlantic Ocean,Submarine,Eruption Observed,2002,-32.958,-5.22,0,,Intraplate / Oceanic Crust (< 15 km)


In [40]:
# try to get similarities in the rest of the dataset, we could see by its coordinate that it is closer to Lautaro Mountain.
# therefore we could fill the Arenales' Dominant Rock Type as 'Dacite'
ve.loc[((ve['Type'] == 'Stratovolcano') & (ve['Tectonic Setting'] == 'Subduction Zone / Continental Crust (>25 km)') & (ve['Country'] == 'Chile'))]

Unnamed: 0,Number,Name,Country,Region,Type,Activity Evidence,Last Known Eruption,Latitude,Longitude,Elevation (Meters),Dominant Rock Type,Tectonic Setting
282,357060,"Azul, Cerro",Chile,South America,Stratovolcano,Eruption Observed,1967,-35.653,-70.761,3788,Dacite,Subduction Zone / Continental Crust (>25 km)
283,357070,"Chillan, Nevados de",Chile,South America,Stratovolcano,Eruption Observed,2016,-36.863,-71.377,3212,Andesite / Basaltic Andesite,Subduction Zone / Continental Crust (>25 km)
285,357091,Callaqui,Chile,South America,Stratovolcano,Eruption Observed,1980,-37.92,-71.45,3164,Andesite / Basaltic Andesite,Subduction Zone / Continental Crust (>25 km)
286,357100,Lonquimay,Chile,South America,Stratovolcano,Eruption Observed,1990,-38.379,-71.586,2832,Andesite / Basaltic Andesite,Subduction Zone / Continental Crust (>25 km)
287,357110,Llaima,Chile,South America,Stratovolcano,Eruption Observed,2009,-38.692,-71.729,3125,Basalt / Picro-Basalt,Subduction Zone / Continental Crust (>25 km)
288,357120,Villarrica,Chile,South America,Stratovolcano,Eruption Observed,2016,-39.42,-71.93,2847,Basalt / Picro-Basalt,Subduction Zone / Continental Crust (>25 km)
290,357150,Puyehue-Cordon Caulle,Chile,South America,Stratovolcano,Eruption Observed,2012,-40.59,-72.117,2236,Andesite / Basaltic Andesite,Subduction Zone / Continental Crust (>25 km)
291,358020,Calbuco,Chile,South America,Stratovolcano,Eruption Observed,2015,-41.33,-72.618,1974,Andesite / Basaltic Andesite,Subduction Zone / Continental Crust (>25 km)
293,358057,"Hudson, Cerro",Chile,South America,Stratovolcano,Eruption Observed,2011,-45.9,-72.97,1905,Basalt / Picro-Basalt,Subduction Zone / Continental Crust (>25 km)
294,358059,Arenales,Chile,South America,Stratovolcano,Eruption Observed,1979,-47.2,-73.48,3437,,Subduction Zone / Continental Crust (>25 km)


In [41]:
ve['Dominant Rock Type'].loc[294] = 'Dacite'

A value is trying to be set on a copy of a slice from a DataFrame

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  self._setitem_with_indexer(indexer, value)


In [42]:
# for the missing 'Dominant Rock Type' that have a Submarine Type, we could safely assume that it was a basaltic rock types.


# since the Dallol volcano doesn't have a lot of similarities we use the tectonic settings and use the mode of the Dominant Rock Types
# which is 'Basalt / Picro-Basalt'

ve.loc[(ve['Tectonic Setting'] == 'Rift Zone / Intermediate Crust (15-25 km)')]

Unnamed: 0,Number,Name,Country,Region,Type,Activity Evidence,Last Known Eruption,Latitude,Longitude,Elevation (Meters),Dominant Rock Type,Tectonic Setting
4,221041,Dallol,Ethiopia,Africa and Red Sea,Explosion crater(s),Eruption Observed,2011,14.242,40.3,-48,,Rift Zone / Intermediate Crust (15-25 km)
5,221060,Alu-Dalafilla,Ethiopia,Africa and Red Sea,Fissure vent(s),Eruption Observed,2008,13.793,40.553,578,Basalt / Picro-Basalt,Rift Zone / Intermediate Crust (15-25 km)
6,221080,Erta Ale,Ethiopia,Africa and Red Sea,Shield,Eruption Observed,2016,13.6,40.67,613,Basalt / Picro-Basalt,Rift Zone / Intermediate Crust (15-25 km)
7,221101,Nabro,Eritrea,Africa and Red Sea,Stratovolcano,Eruption Observed,2012,13.37,41.7,2218,Trachyte / Trachydacite,Rift Zone / Intermediate Crust (15-25 km)
8,221113,Dabbahu,Ethiopia,Africa and Red Sea,Stratovolcano,Eruption Observed,2005,12.595,40.48,1401,Basalt / Picro-Basalt,Rift Zone / Intermediate Crust (15-25 km)
9,221115,Manda Hararo,Ethiopia,Africa and Red Sea,Shield(s),Eruption Observed,2009,12.17,40.82,600,Basalt / Picro-Basalt,Rift Zone / Intermediate Crust (15-25 km)
10,221126,Ardoukoba,Djibouti,Africa and Red Sea,Fissure vent(s),Eruption Observed,1978,11.58,42.47,298,Basalt / Picro-Basalt,Rift Zone / Intermediate Crust (15-25 km)


In [43]:
ve['Dominant Rock Type'] = ve['Dominant Rock Type'].fillna('Basalt / Picro-Basalt')

In [44]:
ve.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 324 entries, 0 to 323
Data columns (total 12 columns):
 #   Column               Non-Null Count  Dtype  
---  ------               --------------  -----  
 0   Number               324 non-null    int64  
 1   Name                 324 non-null    object 
 2   Country              324 non-null    object 
 3   Region               324 non-null    object 
 4   Type                 324 non-null    object 
 5   Activity Evidence    324 non-null    object 
 6   Last Known Eruption  324 non-null    int64  
 7   Latitude             324 non-null    float64
 8   Longitude            324 non-null    float64
 9   Elevation (Meters)   324 non-null    int64  
 10  Dominant Rock Type   324 non-null    object 
 11  Tectonic Setting     324 non-null    object 
dtypes: float64(2), int64(3), object(7)
memory usage: 30.5+ KB


In [45]:
ve.describe()

Unnamed: 0,Number,Last Known Eruption,Latitude,Longitude,Elevation (Meters)
count,324.0,324.0,324.0,324.0,324.0
mean,297801.348765,2001.762346,10.174343,27.024312,1542.472222
std,45200.35806,15.021419,31.342992,121.486938,1609.867361
min,211040.0,1966.0,-77.53,-179.033,-4100.0
25%,263307.5,1992.0,-8.442,-84.85825,634.5
50%,284258.0,2008.0,10.306,81.5375,1484.5
75%,342042.5,2015.0,36.459,139.95075,2428.25
max,390130.0,2016.0,85.608,179.58,5967.0


In [46]:
ve.to_csv('./data/clean_volcanic_eruption.csv', index = False)

## Create New Dataset
Now that we have got both datasets cleaned, we should create a new dataset with the nearest volcanic eruption from the earthquake location and its distances in the cleaned earthquake dataset and merge it with cleaned volcanic dataset.

In [47]:
clean_eq = pd. read_csv('./data/clean_earthquake.csv')
clean_ve = pd.read_csv('./data/clean_volcanic_eruption.csv')

### Getting nearest eruption
To get the nearest eruption we use cdist from scipy.spatial.distance and get the names of the nearest volcano.<br/>
<br/>
[source](https://stackoverflow.com/a/39318808)

In [48]:
# create new dataframes so the main dataframe doesn't messed up through the process.
nv_eq = clean_eq[['Latitude','Longitude']]
nv_ve = clean_ve[['Latitude','Longitude','Name']]

# generate list of tuples including lat and lon at both dataset where x is lat and y is lon.
nv_eq['points_eq'] = [(x, y) for x,y in zip(nv_eq['Latitude'], nv_eq['Longitude'])]
nv_ve['points_ve'] = [(x, y) for x,y in zip(nv_ve['Latitude'], nv_ve['Longitude'])]

# import cdist
from scipy.spatial.distance import cdist

# find nearest eruption from earthquakes
nv_eq['near_volcano_points'] = [list(nv_ve['points_ve'])[cdist([x], list(nv_ve['points_ve'])).argmin()] for x in nv_eq['points_eq']]

# match coordinates points with volcano's name
nv_eq['near_volcano_names'] = [nv_ve[nv_ve['points_ve'] == x]['Name'].values[0] for x in nv_eq['near_volcano_points']]
    

A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  
A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  import sys
A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  del sys.path[0]
A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.

In [49]:
# append the near volcano names to the main dataframes
clean_eq['Nearest Volcano'] = nv_eq['near_volcano_names'] 

In [50]:
clean_eq.head()

Unnamed: 0,Date_Time,Latitude,Longitude,Depth,Magnitude,Magnitude Type,Nearest Volcano
0,1965-01-02 13:44:18,19.246,145.616,131.6,6.0,MW,Supply Reef
1,1965-01-04 11:29:49,1.863,127.352,80.0,5.8,MW,Ibu
2,1965-01-05 18:05:58,-20.579,-173.972,20.0,6.2,MW,Tofua
3,1965-01-08 18:49:43,-59.076,-23.557,15.0,5.8,MW,Montagu Island
4,1965-01-09 13:32:50,11.938,126.427,15.0,5.8,MW,Bulusan


### Merge Dataset
merge both datasets on Volcano names, both columns in dataset should have same name.

In [51]:
clean_ve =clean_ve.rename(columns = {'Name' : 'Nearest Volcano'})
clean_ve.head()

Unnamed: 0,Number,Nearest Volcano,Country,Region,Type,Activity Evidence,Last Known Eruption,Latitude,Longitude,Elevation (Meters),Dominant Rock Type,Tectonic Setting
0,211040,Stromboli,Italy,Mediterranean and Western Asia,Stratovolcano,Eruption Observed,2016,38.789,15.213,924,Trachyandesite / Basaltic Trachyandesite,Subduction Zone / Continental Crust (>25 km)
1,211060,Etna,Italy,Mediterranean and Western Asia,Stratovolcano(es),Eruption Observed,2016,37.734,15.004,3330,Trachybasalt / Tephrite Basanite,Subduction Zone / Continental Crust (>25 km)
2,221010,"Tair, Jebel at",Yemen,Africa and Red Sea,Stratovolcano,Eruption Observed,2008,15.55,41.83,244,Trachybasalt / Tephrite Basanite,Rift Zone / Oceanic Crust (< 15 km)
3,221020,Zubair Group,Yemen,Africa and Red Sea,Shield,Eruption Observed,2013,15.05,42.18,191,Basalt / Picro-Basalt,Rift Zone / Oceanic Crust (< 15 km)
4,221041,Dallol,Ethiopia,Africa and Red Sea,Explosion crater(s),Eruption Observed,2011,14.242,40.3,-48,Basalt / Picro-Basalt,Rift Zone / Intermediate Crust (15-25 km)


In [52]:
main_df = pd.merge(clean_eq, clean_ve, how = 'left', on = 'Nearest Volcano')

In [53]:
main_df.head()

Unnamed: 0,Date_Time,Latitude_x,Longitude_x,Depth,Magnitude,Magnitude Type,Nearest Volcano,Number,Country,Region,Type,Activity Evidence,Last Known Eruption,Latitude_y,Longitude_y,Elevation (Meters),Dominant Rock Type,Tectonic Setting
0,1965-01-02 13:44:18,19.246,145.616,131.6,6.0,MW,Supply Reef,284142,United States,"Japan, Taiwan, Marianas",Submarine,Eruption Dated,1989,20.13,145.1,-8,Andesite / Basaltic Andesite,Subduction Zone / Crust Thickness Unknown
1,1965-01-04 11:29:49,1.863,127.352,80.0,5.8,MW,Ibu,268030,Indonesia,Indonesia,Stratovolcano,Eruption Observed,2016,1.488,127.63,1325,Andesite / Basaltic Andesite,Subduction Zone / Oceanic Crust (< 15 km)
2,1965-01-05 18:05:58,-20.579,-173.972,20.0,6.2,MW,Tofua,243060,Tonga,New Zealand to Fiji,Caldera,Eruption Observed,2014,-19.75,-175.07,515,Andesite / Basaltic Andesite,Subduction Zone / Oceanic Crust (< 15 km)
3,1965-01-08 18:49:43,-59.076,-23.557,15.0,5.8,MW,Montagu Island,390081,United Kingdom,Antarctica,Shield,Eruption Observed,2007,-58.445,-26.374,1370,Basalt / Picro-Basalt,Subduction Zone / Oceanic Crust (< 15 km)
4,1965-01-09 13:32:50,11.938,126.427,15.0,5.8,MW,Bulusan,273010,Philippines,Philippines and SE Asia,Stratovolcano(es),Eruption Observed,2016,12.77,124.05,1565,Andesite / Basaltic Andesite,Subduction Zone / Continental Crust (>25 km)


### Getting Distances
We are going to use the Haversine formula to get the distances. <br/>
<br/>
The Haversine formula is perhaps the first equation to consider when understanding how to calculate distances on a sphere. The word "Haversine" comes from the function:<br/>
<br/>
haversine(θ) = sin²(θ/2)<br/>
<br/>
The following equation where φ is latitude, λ is longitude, R is earth’s radius (mean radius = 6,371km) is how we translate the above formula to include latitude and longitude coordinates. Note that angles need to be in radians to pass to trig functions:<br/>
<br/>
a = sin²(φB - φA/2) + cos φA * cos φB * sin²(λB - λA/2)<br/>
c = 2 * atan2( √a, √(1−a) )<br/>
d = R ⋅ c<br/>
<br/>

[source](https://community.esri.com/groups/coordinate-reference-systems/blog/2017/10/05/haversine-formula)

In [54]:
# step 1 : get lat & lon radians
rad_lat_eq = np.radians(main_df['Latitude_x'])
rad_lon_eq = np.radians(main_df['Longitude_x'])
rad_lat_ve = np.radians(main_df['Latitude_y'])
rad_lon_ve = np.radians(main_df['Longitude_y'])

# step 2 : get the differences from lat and lon radians
d_rad_lat = (rad_lat_ve - rad_lat_eq)
d_rad_lon = (rad_lon_ve - rad_lon_eq)

# step 3 : use the haversine formula on lat and lon
a = np.sin(d_rad_lat/2)**2 + np.cos(rad_lat_eq) * np.cos(rad_lat_ve) * np.sin(d_rad_lon / 2)**2
c = 2 * np.arctan2(np.sqrt(a), np.sqrt(1 - a))
d = 6371 * c

In [55]:
# append the distances to the main dataframes
main_df['Eq - Ve Distances'] = d

In [56]:
main_df.head()

Unnamed: 0,Date_Time,Latitude_x,Longitude_x,Depth,Magnitude,Magnitude Type,Nearest Volcano,Number,Country,Region,Type,Activity Evidence,Last Known Eruption,Latitude_y,Longitude_y,Elevation (Meters),Dominant Rock Type,Tectonic Setting,Eq - Ve Distances
0,1965-01-02 13:44:18,19.246,145.616,131.6,6.0,MW,Supply Reef,284142,United States,"Japan, Taiwan, Marianas",Submarine,Eruption Dated,1989,20.13,145.1,-8,Andesite / Basaltic Andesite,Subduction Zone / Crust Thickness Unknown,112.162847
1,1965-01-04 11:29:49,1.863,127.352,80.0,5.8,MW,Ibu,268030,Indonesia,Indonesia,Stratovolcano,Eruption Observed,2016,1.488,127.63,1325,Andesite / Basaltic Andesite,Subduction Zone / Oceanic Crust (< 15 km),51.898694
2,1965-01-05 18:05:58,-20.579,-173.972,20.0,6.2,MW,Tofua,243060,Tonga,New Zealand to Fiji,Caldera,Eruption Observed,2014,-19.75,-175.07,515,Andesite / Basaltic Andesite,Subduction Zone / Oceanic Crust (< 15 km),147.078303
3,1965-01-08 18:49:43,-59.076,-23.557,15.0,5.8,MW,Montagu Island,390081,United Kingdom,Antarctica,Shield,Eruption Observed,2007,-58.445,-26.374,1370,Basalt / Picro-Basalt,Subduction Zone / Oceanic Crust (< 15 km),176.936371
4,1965-01-09 13:32:50,11.938,126.427,15.0,5.8,MW,Bulusan,273010,Philippines,Philippines and SE Asia,Stratovolcano(es),Eruption Observed,2016,12.77,124.05,1565,Andesite / Basaltic Andesite,Subduction Zone / Continental Crust (>25 km),274.2612


In [57]:
main_df.info()

<class 'pandas.core.frame.DataFrame'>
Int64Index: 30230 entries, 0 to 30229
Data columns (total 19 columns):
 #   Column               Non-Null Count  Dtype  
---  ------               --------------  -----  
 0   Date_Time            30230 non-null  object 
 1   Latitude_x           30230 non-null  float64
 2   Longitude_x          30230 non-null  float64
 3   Depth                30230 non-null  float64
 4   Magnitude            30230 non-null  float64
 5   Magnitude Type       30230 non-null  object 
 6   Nearest Volcano      30230 non-null  object 
 7   Number               30230 non-null  int64  
 8   Country              30230 non-null  object 
 9   Region               30230 non-null  object 
 10  Type                 30230 non-null  object 
 11  Activity Evidence    30230 non-null  object 
 12  Last Known Eruption  30230 non-null  int64  
 13  Latitude_y           30230 non-null  float64
 14  Longitude_y          30230 non-null  float64
 15  Elevation (Meters)   30230 non-null 

### Get Year Distances

In [58]:
main_df['Date_Time'] = pd.to_datetime(main_df['Date_Time'])
main_df['eq_year'] = main_df['Date_Time'].dt.year

In [59]:
main_df['year_diff'] = main_df['Last Known Eruption'] - main_df['eq_year']

In [60]:
main_df.sample(10)

Unnamed: 0,Date_Time,Latitude_x,Longitude_x,Depth,Magnitude,Magnitude Type,Nearest Volcano,Number,Country,Region,Type,Activity Evidence,Last Known Eruption,Latitude_y,Longitude_y,Elevation (Meters),Dominant Rock Type,Tectonic Setting,Eq - Ve Distances,eq_year,year_diff
25833,2010-03-01 02:44:42,-35.039,-72.487,22.9,5.7,MWB,"Azul, Cerro",353060,Ecuador,South America,Shield,Eruption Observed,2008,-0.92,-91.408,1640,Basalt / Picro-Basalt,Rift Zone / Oceanic Crust (< 15 km),4270.437646,2010,-2
7792,1983-01-03 06:04:02,-59.495,-26.262,45.0,5.5,MB,Bristol Island,390080,United Kingdom,Antarctica,Stratovolcano,Eruption Observed,2016,-59.017,-26.533,1100,Basalt / Picro-Basalt,Subduction Zone / Oceanic Crust (< 15 km),55.338348,1983,33
28183,2013-08-04 03:28:51,38.2133,141.8621,56.0,5.8,MWW,Adatarayama,283170,Japan,"Japan, Taiwan, Marianas",Stratovolcano(es),Eruption Observed,1996,37.647,140.281,1728,Andesite / Basaltic Andesite,Subduction Zone / Continental Crust (>25 km),152.296665,2013,-17
5845,1978-05-30 20:17:15,11.05,57.327,33.0,5.5,MS,Ardoukoba,221126,Djibouti,Africa and Red Sea,Fissure vent(s),Eruption Observed,1978,11.58,42.47,298,Basalt / Picro-Basalt,Rift Zone / Intermediate Crust (15-25 km),1620.802656,1978,0
3699,1974-05-10 08:12:05,-4.372,-102.109,33.0,6.1,MB,Unnamed,385052,Undersea Features,Atlantic Ocean,Submarine,Eruption Observed,2002,-32.958,-5.22,0,Basalt / Picro-Basalt,Intraplate / Oceanic Crust (< 15 km),10382.875163,1974,28
13999,1992-08-01 18:19:03,-21.344,-179.189,633.9,5.8,MW,Unnamed,250030,Papua New Guinea,Melanesia and Australia,Submarine,Eruption Dated,1972,-3.03,147.78,-1300,Basalt / Picro-Basalt,Rift Zone / Continental Crust (>25 km),4110.135662,1992,-20
22467,2005-04-11 11:11:01,8.66,-103.564,10.0,5.5,MWC,Unnamed,334040,Undersea Features,Hawaii and Pacific Ocean,Submarine,Eruption Observed,2003,10.73,-103.58,0,Basalt / Picro-Basalt,Intraplate / Oceanic Crust (< 15 km),230.180178,2005,-2
25998,2010-06-26 09:50:43,-8.07,108.089,68.0,5.9,MWC,Galunggung,263140,Indonesia,Indonesia,Stratovolcano,Eruption Observed,1984,-7.25,108.058,2168,Basalt / Picro-Basalt,Subduction Zone / Continental Crust (>25 km),91.243816,2010,-26
24275,2007-11-26 17:41:41,15.28,-93.363,87.2,5.7,MWC,Tacana,341130,Mexico-Guatemala,México and Central America,Stratovolcano,Eruption Observed,1986,15.132,-92.109,4064,Andesite / Basaltic Andesite,Subduction Zone / Continental Crust (>25 km),135.558971,2007,-21
27021,2011-08-21 00:23:40,-18.24,167.867,35.0,5.6,MWB,Kuwae,257070,Vanuatu,Melanesia and Australia,Caldera,Eruption Observed,1974,-16.829,168.536,-2,Basalt / Picro-Basalt,Subduction Zone / Intermediate Crust (15-25 km),172.184541,2011,-37


## Create a Target Column
The target of this project is to predict wether an earthquake is related to volcanic eruption and it follows with this question:<br/>
#### Can earthquakes trigger volcanic eruptions?
and the answer is yes given a few given parameters. earthquake with a large magnitude (greater than 6) are considered to be related to a subsequent eruption or to some type of unrest at a nearby volcano. However, volcanoes can only be triggered into eruption by nearby tectonic earthquakes if they are already poised to erupt.[[1]](https://www.usgs.gov/faqs/can-earthquakes-trigger-volcanic-eruptions?qt-news_science_products=0#qt-news_science_products) Therefore I conclude the parameters: <br/>
1. Magnitude greater than 6
2. Distances less than ~Q1 (500 Km)
3. Earthquake happened before the eruption
<br/>
<br/>
It should be noted that these parameters were not 100% accurate given at a time it is still debatable. however this parameter should give enough to forecast wether or not an earthquake has a potential to trigger volcanic eruption.


In [61]:
def potential_eruption(main_df):

    if (main_df['Magnitude'] > 6) and (main_df['Eq - Ve Distances'] <= 500) and (main_df['year_diff'] > 0):
        return 1
    else :
        return 0
main_df['potential_eruption'] = main_df.apply(potential_eruption, axis = 1)

In [62]:
main_df.columns

Index(['Date_Time', 'Latitude_x', 'Longitude_x', 'Depth', 'Magnitude',
       'Magnitude Type', 'Nearest Volcano', 'Number', 'Country', 'Region',
       'Type', 'Activity Evidence', 'Last Known Eruption', 'Latitude_y',
       'Longitude_y', 'Elevation (Meters)', 'Dominant Rock Type',
       'Tectonic Setting', 'Eq - Ve Distances', 'eq_year', 'year_diff',
       'potential_eruption'],
      dtype='object')

In [63]:
main_df = main_df[['Date_Time', 'eq_year', 'Latitude_x', 'Longitude_x', 'Depth', 'Magnitude',
       'Magnitude Type', 'Nearest Volcano', 'Number', 'Country', 'Region',
       'Type', 'Activity Evidence', 'Last Known Eruption', 'Latitude_y',
       'Longitude_y', 'Elevation (Meters)', 'Dominant Rock Type',
       'Tectonic Setting', 'Eq - Ve Distances', 'year_diff',
       'potential_eruption']]

In [64]:
main_df = main_df.rename(columns = {'Date_Time' : 'eq_date_time','Latitude_x' : 'eq_lat',
                                   'Longitude_x' : 'eq_lon','Depth' : 'eq_depth','Magnitude' : 'eq_mag',
                                   'Magnitude Type' : 'eq_mag_type','Nearest Volcano' : 've_closest',
                                   'Country' : 've_country','Region' : 've_region','Type' : 've_type',
                                   'Activity Evidence' : 've_evidence','Last Known Eruption':'ve_year',
                                   'Latitude_y' : 've_lat','Longitude_y' : 've_lon','Elevation (Meters)':'ve_ele',
                                   'Dominant Rock Type' : 've_rock','Tectonic Setting':'ve_tect','Eq - Ve Distances':'eq_ve_dist',
                                   'year_diff':'eq_ve_yeardiff'})
main_df = main_df.drop('Number', axis = 1)

In [65]:
main_df.head()

Unnamed: 0,eq_date_time,eq_year,eq_lat,eq_lon,eq_depth,eq_mag,eq_mag_type,ve_closest,ve_country,ve_region,ve_type,ve_evidence,ve_year,ve_lat,ve_lon,ve_ele,ve_rock,ve_tect,eq_ve_dist,eq_ve_yeardiff,potential_eruption
0,1965-01-02 13:44:18,1965,19.246,145.616,131.6,6.0,MW,Supply Reef,United States,"Japan, Taiwan, Marianas",Submarine,Eruption Dated,1989,20.13,145.1,-8,Andesite / Basaltic Andesite,Subduction Zone / Crust Thickness Unknown,112.162847,24,0
1,1965-01-04 11:29:49,1965,1.863,127.352,80.0,5.8,MW,Ibu,Indonesia,Indonesia,Stratovolcano,Eruption Observed,2016,1.488,127.63,1325,Andesite / Basaltic Andesite,Subduction Zone / Oceanic Crust (< 15 km),51.898694,51,0
2,1965-01-05 18:05:58,1965,-20.579,-173.972,20.0,6.2,MW,Tofua,Tonga,New Zealand to Fiji,Caldera,Eruption Observed,2014,-19.75,-175.07,515,Andesite / Basaltic Andesite,Subduction Zone / Oceanic Crust (< 15 km),147.078303,49,1
3,1965-01-08 18:49:43,1965,-59.076,-23.557,15.0,5.8,MW,Montagu Island,United Kingdom,Antarctica,Shield,Eruption Observed,2007,-58.445,-26.374,1370,Basalt / Picro-Basalt,Subduction Zone / Oceanic Crust (< 15 km),176.936371,42,0
4,1965-01-09 13:32:50,1965,11.938,126.427,15.0,5.8,MW,Bulusan,Philippines,Philippines and SE Asia,Stratovolcano(es),Eruption Observed,2016,12.77,124.05,1565,Andesite / Basaltic Andesite,Subduction Zone / Continental Crust (>25 km),274.2612,51,0


In [66]:
main_df.info()

<class 'pandas.core.frame.DataFrame'>
Int64Index: 30230 entries, 0 to 30229
Data columns (total 21 columns):
 #   Column              Non-Null Count  Dtype         
---  ------              --------------  -----         
 0   eq_date_time        30230 non-null  datetime64[ns]
 1   eq_year             30230 non-null  int64         
 2   eq_lat              30230 non-null  float64       
 3   eq_lon              30230 non-null  float64       
 4   eq_depth            30230 non-null  float64       
 5   eq_mag              30230 non-null  float64       
 6   eq_mag_type         30230 non-null  object        
 7   ve_closest          30230 non-null  object        
 8   ve_country          30230 non-null  object        
 9   ve_region           30230 non-null  object        
 10  ve_type             30230 non-null  object        
 11  ve_evidence         30230 non-null  object        
 12  ve_year             30230 non-null  int64         
 13  ve_lat              30230 non-null  float64   

In [67]:
main_df.describe()

Unnamed: 0,eq_year,eq_lat,eq_lon,eq_depth,eq_mag,ve_year,ve_lat,ve_lon,ve_ele,eq_ve_dist,eq_ve_yeardiff,potential_eruption
count,30230.0,30230.0,30230.0,30230.0,30230.0,30230.0,30230.0,30230.0,30230.0,30230.0,30230.0,30230.0
mean,1992.971783,-1.449964,7.099548,99.919419,5.867285,1999.30043,0.682569,21.558359,790.439795,2279.512565,6.328647,0.099669
std,14.181014,28.627924,138.221946,167.645712,0.413053,15.705822,27.654128,129.419123,1856.504976,3840.048637,21.377995,0.299563
min,1965.0,-77.08,-179.997,-1.1,5.5,1966.0,-77.53,-179.033,-4100.0,0.702917,-50.0,0.0
25%,1982.0,-21.05275,-157.68025,15.0,5.6,1989.0,-18.325,-104.3,-39.0,127.549367,-8.0,0.0
50%,1994.0,-6.1095,45.269,33.0,5.7,2003.0,-3.52,77.825,516.0,305.715502,7.0,0.0
75%,2005.0,17.01275,143.00675,71.9,6.0,2014.0,14.95,145.8,1728.0,2186.886433,22.0,0.0
max,2016.0,86.005,179.998,700.0,9.1,2016.0,85.608,179.58,5967.0,17882.797816,51.0,1.0


In [68]:
main_df.to_csv('./data/merged_earthquake_volcano.csv', index = False)