## 10_scraping_earthquake_data.ipynb
<p style="background-color:#fff6e4; padding:15px; border-width:3px; border-color:#f5ecda; border-style:solid; border-radius:6px"><b>This script scrapes the earthquake data from the USGS database.</b> Main parts of the routines have been developed in previous courses at the University of London by the same author (Mohr, 2021, 2023, 2024a) and have been developed further to fulfil the needs of the scraping procedure for this MSc thesis. However, the code has been modified to fulfil the latest requirements and package inter-dependencies. Some comments will be added in this Jupyter Notebook and the code has several inline comments. For the project/research itself, see the appropriate document.
</p>

#### References (for this script)
*Mohr, S. (2021) Regional Spatial Clusters of Earthquakes at the Pacific Ring of Fire: Analysing Data from the USGS ANSS ComCat and Building Regional Spatial Clusters. DSM020, Python, examined coursework cw1. University of London.*

*Mohr, S. (2023) Clustering of Earthquakes on a Worldwide Scale with the Help of Big Data Machine Learning Methods. DSM010, Big Data, examined coursework cw2. University of London.*

*Mohr, S. (2024a) Comparing Different Tectonic Setups Considering Publicly Available Basic Earthquake’s Data. DSM050, Data Visualisation, examined coursework cw1. University of London.*

#### History
<pre>
241016 Generation from previous courseworks at the UoL, setup basic logging, improve error handling for saving the data
241017 Reformatting names and variables, use library os for path creation, use also time for filename,
       rewrite and add prcoceduresave_dataset, handle verbosity in the same manner, save_dataset: docstring, 
       add logging and docstring to get_data_from_usgs_anss_api, function calling with explicit var names
241018 Add errorhandling & logging, docstring to query_earthquakes, export shared procedures to shared_procedures.py and 
       import them from there as a general solution for sharing identical procedures between scripts for this project,
       re-write query_earthquakes to use get_data_from_web_api, move get_data_from_web_api to shared_procedures.py
241203 Repair query by constructing a combined dictionary with times and params, set parameters and scrape data
250104 Check docstrings
250110 Set Lon=[100;180][-180;-60] and Lat=[-70;70] because of buffer around plate margins of the Pacific-,
       Philippine-, Cocos-, and Nazca-Plate
</pre>

#### Todo
<pre>./.</pre>

## Preparing the environment
### System information

In [1]:
# which python installation and version are we using here?
print('\n******* Python Info ***********')
!which python
!python --version

# show some CPU and RAM info
print('\n******* CPU Info ***********')
!lscpu
print('\n******* RAM Info (in GB) ***********')
!free -g


******* Python Info ***********
/bin/python
Python 3.8.10

******* CPU Info ***********
Architecture:                       x86_64
CPU op-mode(s):                     32-bit, 64-bit
Byte Order:                         Little Endian
Address sizes:                      46 bits physical, 48 bits virtual
CPU(s):                             64
On-line CPU(s) list:                0-63
Thread(s) per core:                 2
Core(s) per socket:                 8
Socket(s):                          4
NUMA node(s):                       4
Vendor ID:                          GenuineIntel
CPU family:                         6
Model:                              85
Model name:                         Intel(R) Xeon(R) Gold 6234 CPU @ 3.30GHz
Stepping:                           7
CPU MHz:                            1200.047
CPU max MHz:                        4000.0000
CPU min MHz:                        1200.0000
BogoMIPS:                           6600.00
Virtualization:                     VT-x
L1

In [2]:
# show installed packages and versions
!pip freeze

absl-py==2.1.0
affine==2.4.0
aggdraw==1.3.16
array-record==0.4.0
asttokens==2.4.1
astunparse==1.6.3
atomicwrites==1.1.5
attrs==19.3.0
Automat==0.8.0
backcall==0.2.0
beautifulsoup4==4.8.2
blinker==1.4
cachetools==5.5.0
certifi==2019.11.28
chardet==3.0.4
click==8.1.7
click-plugins==1.1.1
cligj==0.7.2
cloud-init==24.3.1
colorama==0.4.3
comm==0.2.2
command-not-found==0.3
configobj==5.0.6
confluent-kafka==2.5.3
constantly==15.1.0
contextily==1.5.2
contourpy==1.1.1
cryptography==2.8
cupshelpers==1.0
cycler==0.10.0
dbus-python==1.2.16
debugpy==1.8.7
decorator==4.4.2
defer==1.0.6
distro==1.4.0
distro-info==0.23+ubuntu1.1
dm-tree==0.1.8
entrypoints==0.3
et-xmlfile==1.0.1
etils==1.3.0
executing==2.0.1
fail2ban==0.11.1
fastjsonschema==2.20.0
filelock==3.13.1
fiona==1.9.6
flatbuffers==24.3.25
fonttools==4.53.1
fsspec==2023.12.2
ftfy==6.2.0
gast==0.4.0
geographiclib==2.0
geopandas==0.13.2
geopy==2.4.1
google-auth==2.36.0
google-auth-oauthlib==1.

### Setting PATH correctly

In [3]:
# there is somewhere a PATH-error on LENA for a while
# adding my packages path to the PATH environment

import sys
sys.path.append("/home/smohr001/.local/lib/python3.8/site-packages")
sys.path

['/home/smohr001/thesis',
 '/usr/lib/python38.zip',
 '/usr/lib/python3.8',
 '/usr/lib/python3.8/lib-dynload',
 '',
 '/opt/jupyterhub/lib/python3.8/site-packages',
 '/opt/jupyterhub/lib/python3.8/site-packages/IPython/extensions',
 '/home/smohr001/.ipython',
 '/home/smohr001/.local/lib/python3.8/site-packages']

### Loading libraries

In [4]:
# importing standard libraries
import sys
import os
import warnings
import datetime
import time
import math
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns
import random
import logging

# importing shared procedures for this procect (needs to be a simple .py file)
%run shared_procedures.py

# importing additional libraries
import requests
from requests.exceptions import HTTPError
import json

# get info about installed and used versions of some important (deep learning) libraries
print("Some important installed libraries:\n")
print(f"Pandas version: {pd.__version__}")
print(f"Numpy version: {np.__version__}")
print(f"Seaborn version: {sns.__version__}")

Some important installed libraries:

Pandas version: 1.4.1
Numpy version: 1.22.2
Seaborn version: 0.13.2


#### Set up parameters and identification of this script

In [5]:
# show all matplotlib graphs inline
%matplotlib inline

# ignore warnings (low priority)
warnings.filterwarnings('ignore')

# set script (ipynb notebook) name (e.g. for logging)
script_name = "10_scrape_earthquake_data.ipynb"

# start parameterized logging
setup_logging(logfile_dir = "log", 
              logfile_name = "10_data_scraping.log", 
              log_level = logging.INFO, 
              script_name = script_name
             )

# set data directory
data_dir = "data"
logging.info(f"{script_name}: Set data directory to './{data_dir}'.")

2025-01-10 12:57:33,731 - INFO - Starting script '10_scrape_earthquake_data.ipynb'.
2025-01-10 12:57:33,733 - INFO - Set loglevel to INFO.
2025-01-10 12:57:33,734 - INFO - 10_scrape_earthquake_data.ipynb: Set data directory to './data'.


#### Testing the API data availability and showing the geoJSON based result

In [6]:
# geoJSON format of answer
# should result in excatly 1 earthquake
query_parameters = {
    "format": "geojson",
    "starttime": "2014-01-01",
    "endtime": "2014-01-02",
    "minmagnitude": 6
}

query_status, query_answer = get_data_from_web_api(url = "https://earthquake.usgs.gov/fdsnws/event/1/query?",
                                                        query_parameters = query_parameters,
                                                        verbosity = 0)

if(query_status):
    print("geoJSON format\n")
    print(query_answer.text)
else:
    print("\nSome error occured! Nothing to print!")


geoJSON format

{"type":"FeatureCollection","metadata":{"generated":1736513859000,"url":"https://earthquake.usgs.gov/fdsnws/event/1/query?format=geojson&starttime=2014-01-01&endtime=2014-01-02&minmagnitude=6","title":"USGS Earthquakes","status":200,"api":"1.14.1","count":1},"features":[{"type":"Feature","properties":{"mag":6.5,"place":"32 km W of Sola, Vanuatu","time":1388592209000,"updated":1651596180609,"tz":null,"url":"https://earthquake.usgs.gov/earthquakes/eventpage/usc000lvb5","detail":"https://earthquake.usgs.gov/fdsnws/event/1/query?eventid=usc000lvb5&format=geojson","felt":null,"cdi":null,"mmi":4.262,"alert":"green","status":"reviewed","tsunami":1,"sig":650,"net":"us","code":"c000lvb5","ids":",pt14001000,at00myqcls,usc000lvb5,iscgem604060577,","sources":",pt,at,us,iscgem,","types":",cap,impact-link,losspager,moment-tensor,origin,phase-data,shakemap,","nst":null,"dmin":3.997,"rms":0.76,"gap":14,"magType":"mww","type":"earthquake","title":"M 6.5 - 32 km W of Sola, Vanuatu"},"geo

#### Main query method

In [7]:
def query_earthquakes(query_timeframe, api_query_parameters, verbosity=0):

    """
    Queries the United States Geological Survey (USGS) Advanced National Seismic System (ANSS) API for earthquake data 
    and returns a pandas DataFrame containing the results.

    Parameters:
        query_timeframe : list
            A list of timeframes to query within. The list should be in a format that the API endpoint accepts.
        api_query_parameters : dict
            A dictionary of query parameters to be sent to the API endpoint. Parameters include but are not limited to 
            'starttime', 'endtime', 'minmagnitude', and 'maxmagnitude'.
        verbosity : int, optional
            Verbosity level for logging and printing. Default is 0 (no verbose output).

    Returns:
        pandas.DataFrame
            A DataFrame containing the requested earthquake data. The columns typically include 'id', 'lon', 'lat', 'depth',
            'mag', 'time', 'felt', 'cdi', 'mmi', 'alert', 'status', 'tsunami', 'nst', 'net', and 'sig'.
    
    Raises:
        ValueError
            If 'query_timeframe' or 'api_query_parameters' are not set or are invalid.
        RuntimeError
            If the API query fails to retrieve data.
    
    Logs:
        Logs the start and end of the querying process, error encountered during query and data retrieval,
        and information about the number of events queried and parsed.
    
    Notes:
        This docstring was generated with the help of AI and proofread by the author.
    """
    
    # are some parameters set?
    if (query_timeframe and api_query_parameters):
        
        # print & log some information
        print("======================================================================================================")
        print("Querying United States Geological Survey (USGS) Advanced National Seismic System (ANSS)")
        logging.info(f"query_earthquakes: START main query method for earthquakes.")

        # initialize timing information for this routine
        start = time.time()

        # initialize empty dataframe earthquakes
        earthquakes = pd.DataFrame()

        # start  of loop -----------------------------------------------------------------------------------------------
        # simulating a repeat-until-loop structure
        while True:

            # adding starttime and endtime to the query parameters
            api_query_parameters["starttime"] = query_timeframe[0]
            api_query_parameters["endtime"] = query_timeframe[1]

            # show the parameters
            print("\nTimeframe(s): " + str(query_timeframe))
            print("\nQuery parameters: " + str(api_query_parameters))

            # get the number of events to check the threshold of allowed events (show errors, verbosity = 1)
            query_status, events_count = get_data_from_web_api(url = "https://earthquake.usgs.gov/fdsnws/event/1/count?",
                                                                     query_parameters = api_query_parameters,
                                                                     verbosity = verbosity)

            # parse the queried data (to get the number of events)
            if(query_status):
                count = events_count.json()['count']
                maxAllowed = events_count.json()['maxAllowed']

                # set number of events to query per API query well below the API threshold 
                if(maxAllowed > 5000):
                    maxAllowed = 5000

                # print some info for the number of events
                print("Number of events to query: " + str(count) + " (of " + str(maxAllowed) + " allowed)")

                # check the number of events for the given threshold
                if(count <= maxAllowed):
                    # number of events is below threshold
                    print("Number of events is below threshold, getting earthquake data ...")

                    # query the API (show errors, verbosity = 1)
                    query_status_ok, api_response_json = \
                        get_data_from_web_api(url = "https://earthquake.usgs.gov/fdsnws/event/1/query?",
                                                    query_parameters = api_query_parameters,
                                                    verbosity = verbosity)

                    # parse the queried data
                    if(query_status_ok):
                        # query should be okay, go ahead
                        print("Got data, parsing ...")

                        # parse the 'FeatureCollection' from geoJSON request answer
                        feature_collection = api_response_json.json()['metadata']
                        count = feature_collection['count']
                        print("Number of events to parse:", count)

                        # get features branch of JSON
                        features = api_response_json.json()['features']

                        # run through every feature (which is one event = earthquake)
                        for feature in features:
                            # set an empty featurelist for THIS earthquake
                            earthquake = []

                            # get usgs id
                            earthquake.append(feature['id'])

                            # get geometry and coordinates (a list) and lat, lon and depth
                            coordinates = feature['geometry']['coordinates']
                            earthquake.append(coordinates[0])
                            earthquake.append(coordinates[1])
                            earthquake.append(coordinates[2])

                            # get all properties of this earthquake
                            properties = feature['properties']
                            for key in properties:
                                # _drop_ some unwanted properties 
                                if(key not in ('rms','dmin','gap','magType','url','detail','code','ids','sources', \
                                               'types','title','place','type','updated','tz')):
                                    # get the key-value-pair without any conversion or formatting
                                    earthquake.append(properties[key])

                            # append this earhquake to the dataframe of earthquakes
                            earthquake_df = pd.DataFrame([earthquake])
                            earthquakes = pd.concat([earthquakes, earthquake_df])

                    else:
                        # unsuccessful query 
                        logging.error(f"query_earthquakes: Nothing to parse!")

                    # delete first timeframe (to continue with the next one)
                    query_timeframe.pop(0)

                    # set count to 200001 to simply CONTINUE with the main loop
                    count = maxAllowed + 1;


                else:
                    # number of events too high
                    print("Number of events is too high, setting a reduced time frame!")
                    start_time = datetime.datetime.strptime(api_query_parameters['starttime'], '%Y-%m-%d %H:%M:%S')
                    end_time = datetime.datetime.strptime(api_query_parameters['endtime'], '%Y-%m-%d %H:%M:%S')
                    time_diff = (end_time - start_time).total_seconds()
                    new_end_time = start_time + datetime.timedelta(0,int(time_diff / 2))
                    query_timeframe.insert(1, str(new_end_time))
                    print("New timeframe(s): " + str(query_timeframe))

            else:
                # unsuccessful query 
                logging.error(f"query_earthquakes: Bad query, nothing to parse!")

                # set count = 0 and maxAllowed = 1 to simply exit the main loop
                count = 0
                maxAllowed = 1

            # exit the main loop if
            #   (1) less than the maxAllowed objects will be queried or
            #   (2) the last timeframe has been reached
            #   (3) an unsuccessful query has been carried out
            if ((count < maxAllowed) or (len(query_timeframe) < 2)):
                break

        # end of loop -----------------------------------------------------------------------------------------------
        
        # rename the columns of the earthquakes dataframe (if there are any)
        if(len(earthquakes) > 0):
            earthquakes.columns = ["id","lon","lat","depth","mag","time","felt","cdi","mmi","alert","status","tsunami","nst","net","sig"]
        
        # print & log some final information (numer of parsed events and runtime of routine)
        print("------------------------------------------------------------------------------------------------------")
        print("\nTotal number of parsed events in dataframe: " + str(len(earthquakes)))

        end = time.time()
        print("Runtime to query and parse the data: " + str(round(end - start, 1)) + " s")
        print("======================================================================================================")
        time.sleep(0.5)
        logging.info(f"query_earthquakes: END main query method for earthquakes with {len(earthquakes)} earthquakes in {round(end - start, 1)} s.")
        
        # reset index
        earthquakes.reset_index(drop=True, inplace=True)

        # return the dataframe with found earthquakes
        return earthquakes

    # no input parameters evailable
    else:
        logging.error(f"query_earthquakes: No input parameters evailable!")
        return(None)

#### Querying earthquake data

In [9]:
# set the dynamic query timeframe (starttime and endtime in datetime format) and parameters
query_timeframe = ["1970-01-01 00:00:00", "2019-12-31 23:59:59"]
api_query_parameters = {
    "minlatitude": '-70.0', "maxlatitude": '70.0', 
    "minlongitude": '100.0', "maxlongitude": '300.0',
    "format": "geojson",
    "minmagnitude": "5.0",
    "eventtype": "earthquake"
}

# get data for this area, add area information
earthquakes = query_earthquakes(query_timeframe = query_timeframe, 
                                api_query_parameters = api_query_parameters, 
                                verbosity = 0)

# show earthquakes
display(earthquakes)

2025-01-10 12:58:10,402 - INFO - query_earthquakes: START main query method for earthquakes.


Querying United States Geological Survey (USGS) Advanced National Seismic System (ANSS)

Timeframe(s): ['1970-01-01 00:00:00', '2019-12-31 23:59:59']

Query parameters: {'minlatitude': '-70.0', 'maxlatitude': '70.0', 'minlongitude': '100.0', 'maxlongitude': '300.0', 'format': 'geojson', 'minmagnitude': '5.0', 'eventtype': 'earthquake', 'starttime': '1970-01-01 00:00:00', 'endtime': '2019-12-31 23:59:59'}
Number of events to query: 63878 (of 5000 allowed)
Number of events is too high, setting a reduced time frame!
New timeframe(s): ['1970-01-01 00:00:00', '1994-12-31 23:59:59', '2019-12-31 23:59:59']

Timeframe(s): ['1970-01-01 00:00:00', '1994-12-31 23:59:59', '2019-12-31 23:59:59']

Query parameters: {'minlatitude': '-70.0', 'maxlatitude': '70.0', 'minlongitude': '100.0', 'maxlongitude': '300.0', 'format': 'geojson', 'minmagnitude': '5.0', 'eventtype': 'earthquake', 'starttime': '1970-01-01 00:00:00', 'endtime': '1994-12-31 23:59:59'}
Number of events to query: 30622 (of 5000 allowed)

Number of events to query: 4196 (of 5000 allowed)
Number of events is below threshold, getting earthquake data ...
Got data, parsing ...
Number of events to parse: 4196

Timeframe(s): ['1991-11-16 14:59:59', '1994-12-31 23:59:59', '2019-12-31 23:59:59']

Query parameters: {'minlatitude': '-70.0', 'maxlatitude': '70.0', 'minlongitude': '100.0', 'maxlongitude': '300.0', 'format': 'geojson', 'minmagnitude': '5.0', 'eventtype': 'earthquake', 'starttime': '1991-11-16 14:59:59', 'endtime': '1994-12-31 23:59:59'}
Number of events to query: 4325 (of 5000 allowed)
Number of events is below threshold, getting earthquake data ...
Got data, parsing ...
Number of events to parse: 4325

Timeframe(s): ['1994-12-31 23:59:59', '2019-12-31 23:59:59']

Query parameters: {'minlatitude': '-70.0', 'maxlatitude': '70.0', 'minlongitude': '100.0', 'maxlongitude': '300.0', 'format': 'geojson', 'minmagnitude': '5.0', 'eventtype': 'earthquake', 'starttime': '1994-12-31 23:59:59', 'endtime': '2019-12-31 23:59:59'}

Number of events to query: 5313 (of 5000 allowed)
Number of events is too high, setting a reduced time frame!
New timeframe(s): ['2010-08-16 20:59:59', '2012-03-09 13:29:59', '2013-10-01 05:59:59', '2019-12-31 23:59:59']

Timeframe(s): ['2010-08-16 20:59:59', '2012-03-09 13:29:59', '2013-10-01 05:59:59', '2019-12-31 23:59:59']

Query parameters: {'minlatitude': '-70.0', 'maxlatitude': '70.0', 'minlongitude': '100.0', 'maxlongitude': '300.0', 'format': 'geojson', 'minmagnitude': '5.0', 'eventtype': 'earthquake', 'starttime': '2010-08-16 20:59:59', 'endtime': '2012-03-09 13:29:59'}
Number of events to query: 3363 (of 5000 allowed)
Number of events is below threshold, getting earthquake data ...
Got data, parsing ...
Number of events to parse: 3363

Timeframe(s): ['2012-03-09 13:29:59', '2013-10-01 05:59:59', '2019-12-31 23:59:59']

Query parameters: {'minlatitude': '-70.0', 'maxlatitude': '70.0', 'minlongitude': '100.0', 'maxlongitude': '300.0', 'format': 'geojson', 'minmagnitude': '5.0'

2025-01-10 13:03:11,183 - INFO - query_earthquakes: END main query method for earthquakes with 63878 earthquakes in 300.3 s.


Unnamed: 0,id,lon,lat,depth,mag,time,felt,cdi,mmi,alert,status,tsunami,nst,net,sig
0,usp00000hw,-101.0180,-36.1090,33.00,5.3,98610866900,,,,,reviewed,0,432,us,
1,usp00000hv,-82.3720,5.9880,33.00,5.3,98610207400,,,,,reviewed,0,432,us,
2,usp00000hs,103.0900,-4.3400,107.00,5.4,98602279500,,,,,reviewed,0,449,us,
3,usp00000hn,-86.4730,11.5340,33.00,5.1,98583353700,,,,,reviewed,0,400,us,
4,usp00000hk,141.5150,37.0400,56.00,5.1,98577374600,,,,,reviewed,0,400,us,
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
63873,us10007a0b,173.2622,-42.5910,6.85,5.3,1479478977140,3,3.4,,,reviewed,0,433,us,
63874,us100079g9,130.4786,-6.2582,112.15,5.5,1479401804290,,,3.25,green,reviewed,0,465,us,
63875,us100078zy,-177.5663,-22.0950,296.01,5.3,1479327974490,,,2.38,green,reviewed,0,432,us,
63876,us100078vh,113.2445,-9.0027,85.00,5.7,1479309011020,243,6.2,3.93,green,reviewed,0,651,us,


#### Save scraped earthquake dataset

In [10]:
# save earthquake dataset
save_dataset(data_file = "earthquakes_scraped.csv", 
             data_dir = data_dir, 
             data_set = earthquakes
            )  

2025-01-10 13:03:48,572 - INFO - save_dataset: Data saved successfully to 'data/earthquakes_scraped_250110-130348.csv'.


#### End of script

In [11]:
# log the end of this script
logging.info(f"End of script '{script_name}'.")

2025-01-10 13:03:51,266 - INFO - End of script '10_scrape_earthquake_data.ipynb'.
