# Analysis of Magnitude 5.0+ Earthquakes in California (DTSA 5304 Final Project)
**2025-04-21**

You can download this reproducible notebook from my GitHub.

## Introduction

This is my final project for DTSA-5304. It looks into all recorded earthquakes of magnitude 5.0 and above in California from 1769 to 2015, from the California Geological Survey. The visualizations for this project primarily focus on notable patterns over time, understanding how strong earthquakes are spread out, and seeing how deep in the ground they happen.

## Dataset

The [dataset](https://sandbox.data.ca.gov/dataset/cgs-map-sheet-48-historic-earthquakes-1769-to-2015-california-magnitude-5-0-plus) I used comes from California’s Open Data Portal. It includes all recorded records of magnitude 5.0-plus earthquakes from 1769 to 2015 in California, with:

- Date & Time: When each earthquake occured
- Magnitude (M): Richter scale value, 5.0 or above
- Depth (km): How far below the surface the earthquake was recorded
- Location: Latitude and longitude of the epicenter

For this project, only the `YEAR`, `MONTH`, `DEPTH`, and `MAGNITUDE` columns are needed.

## Goals

- See how the number of big earthquakes changes over time
- Show how earthquake strengths are spread out above M 5.0
- Explore if stronger earthquakes happen when deeper or shallower

## Tools & Libraries

- **Pandas** for data cleaning and grouping
- **Altair** for charting and interactivity

# Data Preparation

First, we begin by importing the required libraries, loading the dataset (CSV) into pandas, and dropping unused columns. If you would like to reproduce this Jupyter Notebook, the dataset can be found [here](https://sandbox.data.ca.gov/dataset/cgs-map-sheet-48-historic-earthquakes-1769-to-2015-california-magnitude-5-0-plus).

In [1]:
# Install & import
%pip install -q pandas altair

import pandas as pd
import altair as alt

# Load data
df = pd.read_csv("CGS_Map_Sheet_48__Historic_Earthquakes,_1769_to_2015_-_California_(Magnitude_5.0-plus).csv")
df.info()

Note: you may need to restart the kernel to use updated packages.
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 783 entries, 0 to 782
Data columns (total 14 columns):
 #   Column      Non-Null Count  Dtype  
---  ------      --------------  -----  
 0   X           783 non-null    float64
 1   Y           783 non-null    float64
 2   OBJECTID_1  783 non-null    int64  
 3   OBJECTID    783 non-null    int64  
 4   year        783 non-null    int64  
 5   month       783 non-null    int64  
 6   day         783 non-null    int64  
 7   hour        783 non-null    int64  
 8   minute      783 non-null    int64  
 9   second      783 non-null    float64
 10  lat         783 non-null    float64
 11  lon         783 non-null    float64
 12  depth       783 non-null    float64
 13  magnitude   783 non-null    float64
dtypes: float64(7), int64(7)
memory usage: 85.8 KB


In [2]:
# Drop unused columns
df_cleaned = df.drop(columns=["OBJECTID_1", "OBJECTID", "X", "Y", "lat", "lon", "day", "hour", "minute", "second"])
df_cleaned.describe()

Unnamed: 0,year,month,depth,magnitude
count,783.0,783.0,783.0,783.0
mean,1946.335888,6.66539,7.477088,5.612427
std,46.591201,4.751677,16.745451,0.546978
min,1769.0,1.0,0.0,5.0
25%,1918.5,4.0,0.0,5.18
50%,1954.0,7.0,1.8,5.5
75%,1984.0,9.0,8.55,5.9
max,2015.0,99.0,99.0,7.9


## Analysis of Earthquake Frequency Over Time

**Why (Goal):** 
To understand how the rate of M ≥ 5.0 earthquakes has changed over the years.

**How (Means):**
Using line charts and histograms to show the total change over time.

**What (Characteristics):**
If annual counts fluctuate (if at all) throughout the years.

### Visualization 1

In [3]:
# Group earthquakes by year
quakes_per_year = df_cleaned.groupby("year").size().reset_index(name="count")

alt.Chart(quakes_per_year).mark_line(color="darkorange").encode(
    x=alt.X("year:Q", title="Year"),
    y=alt.Y("count:Q", title="Number of Earthquakes"),
).properties(title="Number of Earthquakes per Year (M ≥ 5.0)", width=600, height=300)

This visualization shows the number of large earthquakes recorded each year with a line chart. Plotting the number of earthquakes over time highlights trends, such as the noticeable increase in recorded earthquakes after the 1970s.

### Visualization 2

In [4]:
# Group earthquakes by decade
df_cleaned["decade"] = (df_cleaned["year"] // 10) * 10
quakes_per_decade = df_cleaned.groupby("decade").size().reset_index(name="count")

alt.Chart(quakes_per_decade).mark_bar(color="darkorange").encode(
    x=alt.X("decade:O", title="Decade"),
    y=alt.Y("count:Q", title="Number of Earthquakes"),
    tooltip=[
        alt.Tooltip("decade", title="Decade"),
        alt.Tooltip("count", title="# of Earthquakes"),
    ],
).properties(title="Number of Earthquakes per Decade (M ≥ 5.0)", width=600, height=300)

Grouping the data into decades with a histogram makes it easier to see broader patterns, such as the steady rise in recorded earthquakes after the 1970s, with the 2010s showing the highest total. The increase likely reflects improvements in monitoring rather than an actual surge in seismic activity. Hovering over each bar shows the exact counts.

## Analysis of Magnitude Distribution

**Why (Goal):**  
To see which strength levels are most common among earthquakes of M ≥ 5.0.

**How (Means):**  
Using a histogram and density plot to show the distribution of the data.

**What (Characteristics):**  
The link between magnitude and the number of earthquakes occurring.

### Visualization 1

In [5]:
alt.Chart(df).mark_bar(color="darkorange").encode(
    x=alt.X(
        "magnitude:Q",
        bin=alt.Bin(step=0.1),
        title="Magnitude (M ≥ 5.0)",
    ),
    y=alt.Y("count():Q", title="Number of Earthquakes"),
    tooltip=[
        alt.Tooltip("magnitude:Q", bin=alt.Bin(step=0.1), title="Magnitude Range"),
        alt.Tooltip("count():Q", title="# of Earthquakes"),
    ],
).properties(title="Number of Earthquake Magnitudes", width=600, height=300)

This histogram shows how earthquake magnitudes are distributed. Most events cluster just above magnitude 5.0, with fewer earthquakes as the magnitude increases. Hovering over the bars reveals the exact counts for each range.

### Visualization 2

In [6]:
alt.Chart(df).transform_density(
    "magnitude",
    as_=["magnitude", "density"],
    steps=200,
).mark_area(opacity=0.6, color="darkorange").encode(
    x=alt.X("magnitude:Q", title="Magnitude (M ≥ 5.0)"),
    y=alt.Y("density:Q", title="Density"),
).properties(
    title="Density Estimate of Earthquake Magnitudes", width=600, height=300
)

The density plot shows a peak around magnitude 5.1 and how the distribution tapers off as magnitudes increase.

## Analysis of Depth vs. Magnitude

**Why (Goal):**
To see which strength levels are most common among earthquakes of M ≥ 5.0.

**How (Means):**
Using interactive exploration techniques such as brushing.

**What (Characteristics):**
Seeks to find the correlation (or lack thereof) and outliers.

*Note: There is missing depth data for some of the rows. I've dropped them for this task.*

In [7]:
df_depth = df_cleaned.query("depth > 0").copy()
df_depth.describe()

Unnamed: 0,year,month,depth,magnitude,decade
count,409.0,409.0,409.0,409.0,409.0
mean,1972.012225,6.772616,14.314328,5.504279,1967.823961
std,41.264835,5.623797,20.960496,0.506357,41.170477
min,1769.0,1.0,0.1,5.0,1760.0
25%,1959.0,4.0,6.0,5.13,1950.0
50%,1983.0,7.0,8.1,5.37,1980.0
75%,1994.0,9.0,13.1,5.7,1990.0
max,2015.0,99.0,99.0,7.5,2010.0


### Visualization 1

In [8]:
brush = alt.selection_interval(encodings=["x"])

hist_mag = (
    alt.Chart(df_depth)
    .add_params(brush)
    .mark_bar(color="darkorange")
    .encode(
        x=alt.X(
            "magnitude:Q",
            bin=alt.Bin(step=0.1),
            title="Magnitude (M ≥ 5.0)",
        ),
        y=alt.Y("count():Q", title="Number of Earthquakes"),
        tooltip=[
            alt.Tooltip("magnitude:Q", bin=alt.Bin(step=0.1), title="Magnitude Range"),
            alt.Tooltip("count():Q", title="# of Earthquakes"),
        ],
    )
    .properties(title="Number of Earthquake Magnitudes", width=300, height=200)
)

hist_depth = (
    alt.Chart(df_depth)
    .transform_filter(brush)
    .mark_bar()
    .encode(
        x=alt.X("depth:Q", bin=alt.Bin(maxbins=30), title="Depth (km)"),
        y=alt.Y("count():Q", title="Number of Earthquakes"),
        color=alt.Color("count():Q", scale=alt.Scale(scheme="oranges")).legend(None),
        tooltip=[
            alt.Tooltip("depth:Q", bin=alt.Bin(maxbins=30), title="Depth Range (km)"),
            alt.Tooltip("count():Q", title="# of Earthquakes"),
        ],
    )
    .properties(
        width=300, height=200, title="Depth Distribution for Selected Magnitudes"
    )
)


hist_mag | hist_depth

The linked histogram above shows the depth distribution for the selected magnitude range. The chart updates to show the depths when you brush over a specific range of magnitudes.

### Visualization 2

In [9]:
brush = alt.selection_interval(encodings=["x"])

scatter_mag = (
    alt.Chart(df_depth)
    .add_params(brush)
    .mark_circle()
    .encode(
        x=alt.X(
            "magnitude:Q",
            title="Magnitude (M ≥ 5.0)",
            scale=alt.Scale(domain=[5, df_cleaned["magnitude"].max()]),
        ),
        y=alt.Y("depth:Q", title="Depth (km)"),
        color=alt.Color("depth:Q", scale=alt.Scale(scheme="oranges")),
        tooltip=["year", "magnitude", "depth"],
    )
    .properties(
        width=600,
        height=200,
        title="Depth vs Magnitude",
    )
)

hist_depth = (
    alt.Chart(df_depth)
    .transform_filter(brush)
    .mark_bar()
    .encode(
        x=alt.X("depth:Q", bin=alt.Bin(maxbins=30), title="Depth (km)"),
        y=alt.Y("count():Q", title="Number of Earthquakes"),
        color=alt.Color("count():Q", scale=alt.Scale(scheme="oranges")).legend(None),
        tooltip=[
            alt.Tooltip("depth:Q", bin=alt.Bin(maxbins=30), title="Depth Range (km)"),
            alt.Tooltip("count():Q", title="# of Earthquakes"),
        ],
    )
    .properties(
        width=600, height=200, title="Depth Distribution for Selected Magnitudes"
    )
)

scatter_mag & hist_depth

The scatter points above show every quake with a recorded depth. Brushing across magnitudes highlights the depth of points in that range.

## Evaluation

To determine the effectiveness of the visualizations in achieving the project's goals I conducted the following evaluation: 

### Participant Recruitment

I recruited both friends and family who were interested in earthquakes to evaluate the visualizations, most of which are from California.

### Criteria

All participants were given a brief introduction to the project and asked to think about the following questions while exploring the charts while saying their thoughts out loud:

- Identify the decade with the most earthquakes.
- Describe where magnitudes cluster.
- Use the linked scatter/histogram to compare depths for two magnitude ranges.

### Assessment

They explored the visualizations, answered the provided questions, and offered feedback. All of their responses were collected and reviewed to see if there were any areas for improvement and common themes were identified to help with future refinements to the visualizations.

## Conclusion

The look into California's historical earthquakes revealed information regarding the patterns of seismic activity for magnitude 5.0 and higher events. The results show increased recorded earthquakes over time, a strong clustering of magnitudes just above 5.0, and slight shifts in depth distribution with higher magnitudes. These trends reflect both natural seismic behavior and advances in technology.

After completing the evaluation, the elements that worked well included interactive features like tooltips and brushing, which participants found intuitive after a brief explanation. Participants preferred the interactive charts over the static visualizations, noting that the interactivity made it easier to see trends between two independent variables. However, some participants found the interactions unclear without guidance and suggested adding labels to make the interactions more immediately understandable, such as explaining what brushing is.

For future iterations, I would add more context to make the graphs more self-explanatory and the interactive elements (like brushing) more intuitive. I would also change the interactive visualizations to make the brushing bi-directional. Additional visualizations could add geographic context by using the longitude and latitude columns from the dataset.