
# GLY 6739 — Python Diagnostic (Ungraded, No Google/AI)
**Purpose:** This notebook is an ungraded diagnostic to assess your current Python + scientific Python skills (and a bit of command-line literacy). Your results guide what we teach next.

**Time target:** ~60–90 minutes (stop when time is up; partial work is expected).

## Rules
- **Do not use Google, ChatGPT, StackOverflow, or notes.**
- You may use **`help()`**, docstrings, and local built-in documentation.
- You may optionally use **Linux command-line tools** where appropriate (e.g., `bash`, `grep`, `awk`, `sed`, `sort`, `uniq`, `wc`).  
  *Hint:* Some tasks are intentionally easier with the shell.

## Submission workflow (required)
1. Complete this notebook by filling in the code cells marked **TODO**.
2. **Rename** the notebook to:  
   `python_diagnostic_<LASTNAME>_<FIRSTNAME>.ipynb`
3. Commit and push to **your public GitHub repository** for this course.
4. Submit the **GitHub link to the notebook file** (not just the repo).
5. Ensure the repo is **public** so the link works.

---

## Data files (expected folder layout)
Your repo should include a folder structure like:

```
your-repo/
  diagnostics/
    python_diagnostic_<LASTNAME>_<FIRSTNAME>.ipynb
  data/
    events.csv
    stations.txt
    sac/
      *.SAC   (or *.sac)
```

If you do not have `data/` yet, create it and add the provided files.



# 0) Setup (do not edit)
Run the cell below once. You may add imports later if needed, but try to keep it minimal.


In [None]:
import math
import csv
from pathlib import Path

import numpy as np
import pandas as pd
import matplotlib.pyplot as plt

from obspy import read, Stream, Trace, UTCDateTime

print("Imports OK")



# Academic honesty check (required)
In the cell below, write a short statement confirming you followed the rules (no Google/AI/notes).

Example:

> I completed this diagnostic without Google/AI/notes, using only built-in Python help and (optionally) Linux command-line tools.


In [None]:
# TODO: Write your statement as a Python triple-quoted string.
statement = '''

'''
print(statement.strip())

# Part A — Core Python (basics that show up everywhere)


## A1) Data types: tuples, lists, dicts
1. Create:
   - a **tuple** `station_tuple` with 3 station codes (strings)
   - a **list** `magnitudes` with at least 6 floats
   - a **dict** `event` with keys: `"time"`, `"station"`, `"etype"`, `"magnitude"`
2. Print each object and its `type(...)`.

**Goal:** demonstrate you know the difference between tuple/list/dict and when they’re used.


In [None]:
# TODO



## A2) If / elif / else logic (event labeling)
Write a function:

```python
def label_event(mag):
    '''
    Return one of: "micro", "small", "moderate", "large"
    using thresholds:
      mag < 2.0         -> "micro"
      2.0 <= mag < 4.0  -> "small"
      4.0 <= mag < 6.0  -> "moderate"
      mag >= 6.0        -> "large"
    '''
```
Test it on: `[-0.2, 0.0, 1.9, 2.0, 3.99, 4.0, 5.5, 6.0, 7.1]`

Also: if `mag` is negative, return `"invalid"` (this should take priority).


In [None]:
# TODO



## A3) For loops: accumulation + indexing
Given a list `x` of numbers, compute:
1. The sum
2. The sum of squares
3. The mean (without using `np.mean`)

Requirements:
- Use a **for loop**
- Use **indexing** at least once (e.g., `x[i]`)

Use `x = [1, 2, 3, 4, 5, 6, 7]`.


In [None]:
# TODO



## A4) While loops: rounding error trap (1/10 problem)
This tests numerical reasoning and loop logic.

Task:
1. Start with `total = 0.0`
2. In a **while loop**, repeatedly add `0.1` and count how many steps it takes to reach 1.
3. Try to stop using `while total != 1.0:` (or `while total == 1.0:`) and observe what happens.
4. Fix it robustly (e.g., use an epsilon tolerance or count steps).

Print:
- the final `total`
- the number of steps
- an explanation (1–3 sentences) in a Python comment about what went wrong and why.


In [None]:
# TODO



## A5) Overflow and underflow (integers and floats)
Use small dtypes so the behavior is obvious.

### (a) Integer overflow with `np.int8`
1. Create `a = np.int8(120)`
2. Add 1 repeatedly in a loop 20 times, storing values in a list.
3. Print the list.

Explain in a comment what happened and why.

### (b) Float underflow / overflow with `np.float16`
1. Create `b = np.float16(1.0)`
2. Repeatedly divide by 2 in a loop until it becomes exactly 0.
3. Count steps, and print the last 10 nonzero values you saw (or as many as you have).

Then:
4. Create `c = np.float16(1.0)` and repeatedly multiply by 2 until it becomes `inf`.


In [None]:
# TODO



## A6) Functions: Fibonacci-like sequence (generalized)
Write a function:

```python
def fib_like(a0, a1, N):
    '''
    Return a list (or np.array) of length N, with:
      seq[0] = a0
      seq[1] = a1
      seq[n] = seq[n-1] + seq[n-2]  for n >= 2
    Requirements:
      - N must be > 20 (raise ValueError otherwise)
    '''
```
Tasks:
1. Generate a sequence with `a0=1`, `a1=1`, `N=30`.
2. Compute the ratio `seq[n+1]/seq[n]` for all valid n.
3. Make a matplotlib plot of ratio vs index.
4. Print the last 5 ratios.

**This tests:** functions, error handling, indexing, plotting.


In [None]:
# TODO



## A7) File I/O — text
You are given `data/stations.txt` with one station code per line.

Tasks:
1. Read the file.
2. Strip whitespace and ignore blank lines.
3. Create:
   - a Python list of station codes
   - a dict mapping station code -> integer index (starting at 0)
4. Print the first 5 stations and the dict entry for the first station.

**Optional shell hint:** `wc -l`, `head`, `tail`, `sort`, `uniq`.


In [None]:
# TODO



## A8) File I/O — CSV (no pandas)
You are given `data/events.csv` with columns like:

`time,station,etype,magnitude`

Tasks (use `csv` module, not pandas):
1. Read all rows into a list of dicts.
2. Count events by `etype` (dict: etype -> count).
3. Compute mean magnitude by `etype` (dict: etype -> mean).
4. Find the **largest magnitude event** and print its full row dict.

**Shell hint:** Try using `cut`, `awk`, `sort`, `uniq -c` to sanity-check counts.


In [None]:
# TODO



## A9) Conversions between data structures (list, tuple, dict, DataFrame)
Using your events data:
1. Create a **list of tuples**: `(time, station, etype, magnitude)` for each event.
2. Convert that into a **pandas DataFrame** with appropriate column names.
3. Convert the DataFrame back into:
   - list of dicts
   - dict of lists (column -> list)
4. Print the `type(...)` of each object and show one example row/item from each.


In [None]:
# TODO



# Interlude — Command line skills (recommended)
Some data wrangling is faster in the shell than in Python. For the tasks below, you may either:

- use Python **or**
- use shell commands with `!` in a notebook cell (recommended where obvious)

Example:
```bash
!head -n 5 data/events.csv
```

If you use shell commands, paste the command(s) you ran and the output.



## S1) Quick inspection
Without using pandas, answer:
1. How many lines are in `data/events.csv`?
2. What are the unique event types (`etype`) and their counts?

Prefer shell tools (`wc`, `cut`, `sort`, `uniq -c`), but Python is allowed.


In [None]:
# TODO: Put your shell commands (with !) and/or Python here.



## S2) Filter and extract
Create a new file `data/vt_events.csv` containing:
- the header row, plus
- only rows where `etype` is exactly `VT`

Do this with either:
- shell tools (`grep`, `awk`) **or**
- Python

Then report how many VT events were written.


In [None]:
# TODO



## S3) One-liner challenge (optional bonus diagnostic)
In ONE shell pipeline, print the **top 5 stations** by number of events.

(You can assume station is the 2nd column.)


In [None]:
# TODO (optional)


# Part B — Scientific Python (NumPy / pandas / matplotlib / ObsPy)


## B1) NumPy arrays and vectorization
Create a time vector `t` from 0 to 100 seconds with `dt=0.01`.

Create a synthetic signal:
x(t) = sin(2π f1 t) + 0.3 sin(2π f2 t)
with `f1=1.0 Hz`, `f2=3.0 Hz`.

Tasks:
1. Compute mean and standard deviation of `x`.
2. Find the index of the maximum value and report the corresponding time.
3. Create a new array `x2` that is `x` but with values clipped to [-1, 1]. (Do not use a Python loop.)


In [None]:
# TODO



## B2) Boolean masks and threshold picking
Using `x` from B1:
1. Make a boolean mask where `abs(x) > 0.9`.
2. Count how many samples exceed the threshold.
3. Find the **first** time when the threshold is exceeded.
4. Find all **contiguous segments** where threshold is exceeded and report how many segments there are.

Hint: For segments, you can look at where the mask changes from False->True and True->False
(e.g., using `np.diff` on an integer version of the mask).


In [None]:
# TODO



## B3) Matplotlib plotting essentials
Plot `x` vs `t`.

Requirements:
- Labels: "Time (s)", "Amplitude"
- Title
- Grid
- Add a second line on the same plot: the clipped `x2` (use a legend)


In [None]:
# TODO



## B4) pandas: groupby + summary tables
Using `data/events.csv`:
1. Load into a DataFrame.
2. Show the first 5 rows.
3. Compute mean magnitude by `etype`.
4. Compute count by `station` and show the top 10 stations by count.
5. Create a pivot table of mean magnitude with rows=station and columns=etype (fill missing with NaN).


In [None]:
# TODO



## B5) Time handling (UTCDateTime)
1. Convert `"2025-03-01T12:00:00"` into a `UTCDateTime`.
2. Create a list of 10 times separated by 15.0 seconds starting at that time.
3. Print them in ISO format.


In [None]:
# TODO


# Part C — ObsPy and waveform file formats


## C1) Reading and inspecting waveforms
You are given a directory of SAC files in `data/sac/`.

Tasks:
1. List all SAC files in that directory (glob).
2. Read **one** SAC file with ObsPy and store it as `st` (a Stream).
3. Print:
   - number of traces
   - for the first trace: `network.station.location.channel`, starttime, endtime, sampling rate, npts


In [None]:
# TODO



## C2) Convert SAC → MiniSEED with naming convention
Convert **all** SAC files in `data/sac/` to MiniSEED in an output directory `data/mseed/`.

Naming convention for each output MiniSEED file:
```
<NET>.<STA>.<LOC>.<CHA>.<YYYYMMDDTHHMMSS>.mseed
```
Where the timestamp is the trace starttime in UTC, formatted like `20250301T120000`.

Requirements:
- Create the output directory if it doesn't exist.
- Use ObsPy to read and write (no external converters).
- Handle missing LOC codes: if location is empty, use `"--"` as the LOC in the filename.
- Print a summary at the end:
  - number of SAC input files
  - number of MiniSEED files written
  - show the first 5 output filenames

**Note:** If a SAC file reads into multiple traces, write one MiniSEED per trace.


In [None]:
# TODO



## C3) Basic processing on one trace
Pick one converted MiniSEED file and:
1. Read it.
2. Detrend (demean + linear).
3. Bandpass 1–10 Hz.
4. Compute RMS amplitude and peak absolute amplitude.
5. Plot the filtered waveform (time on x-axis in seconds from start).

(No need to do instrument correction in this diagnostic.)


In [None]:
# TODO



## C4) Short concept questions (answer in comments)
1. What is the difference between a Python **list** and a NumPy **array**?
2. What is a pandas **DataFrame** conceptually (compared to list/dict)?
3. What is the difference between an ObsPy **Stream** and **Trace**?
4. Why might `total == 1.0` fail in a loop after repeatedly adding 0.1?
5. Define integer overflow and float underflow in your own words.


In [None]:
# TODO: Answer the questions above as comments.



# Wrap-up
1. Restart your kernel and run **all** cells to ensure your notebook runs top-to-bottom.
2. Rename the notebook to `python_diagnostic_<LASTNAME>_<FIRSTNAME>.ipynb`
3. Commit and push to your public GitHub repo.
4. Submit the direct GitHub link to the notebook file.
