# EE 519 — Speech AI
## HW-1 | Notebook 5: Speech Production, Sound Categories, Segmentation, Variability

**Student Name:**  
**USC ID:**  
**Date:**  

---

### Learning Objectives
By completing this notebook, you will:
- Practice recording and analyzing **speech sounds** (vowels, consonants)
- Connect **speech production** (voicing, place/manner) to signal patterns
- Perform **basic segmentation**: words → syllables → phones (approx.)
- Study **variability** across repeated recordings and speakers
- Identify **coarticulation** in real speech using waveform + spectrogram + listening

> ⚠️ **Important**
> - All answers (code + explanations) must be written **inside this notebook**
> - Do **not** delete questions or prompts
> - Clearly label all plots (title, axes, units)
> - Use **relative paths only** and keep audio under `./audio/` for grading


### Grading (Notebook 5 — 20 points)

| Component | Points |
|---|---:|
| Recording/collection of required sounds + clean organization | 4 |
| Variability analysis (same speaker repeats + multi-speaker) | 5 |
| Segmentation (words/syllables/phones) with evidence | 6 |
| Speech production explanations (voiced/voiceless, manner/place cues) | 3 |
| Clarity, structure, reflections | 2 |

> We grade **understanding and reasoning**, not perfection.


---

# 0. Setup (Reproducibility)

This notebook must run **quickly and reproducibly** for grading.

## ✅ Reproducibility requirements
- Put all audio files in `./audio/` (relative paths only)
- No absolute paths, no cloud mounts
- We should be able to download your ZIP and run this notebook top-to-bottom

Recommended structure:
```
HW1/
├── HW1_Notebook5_Speech_Production_Segmentation_Variability.ipynb
└── audio/
    ├── speaker_self/
    ├── speaker_2/           (optional but recommended)
    └── speaker_3/           (optional)
```

## Recording requirements (minimum)
You must collect at least:
### (A) Same-speaker repetition
Record yourself saying the sentence **3 times**:
- Prompt: **“My name is _____. I am taking a speech AI course.”**
Save as:
- `./audio/speaker_self/sent_1.wav`
- `./audio/speaker_self/sent_2.wav`
- `./audio/speaker_self/sent_3.wav`

### (B) Sound categories (self)
Record these **isolated sounds** (≈1–2 sec each):
- Vowels: /a/ and /i/
- Fricatives (unvoiced): /s/ and /f/
Save as:
- `./audio/speaker_self/vowel_a.wav`
- `./audio/speaker_self/vowel_i.wav`
- `./audio/speaker_self/fric_s.wav`
- `./audio/speaker_self/fric_f.wav`

### (C) Multi-speaker (recommended)
Record the same sentence from **at least one other person** (friend/classmate) OR use a public speech clip.
Save as:
- `./audio/speaker_2/sent_1.wav`

> If you cannot record another person, you must explain why and use a public clip as “speaker_2”.


In [None]:
# TODO: Imports
import numpy as np
import matplotlib.pyplot as plt
from IPython.display import Audio, display


In [None]:
# TODO: WAV loader (reuse from earlier notebooks)
def load_wav(path):
    """Return mono float signal in [-1, 1] and sample rate fs."""
    raise NotImplementedError


In [None]:
# TODO: Helpers
def play_audio(x, fs):
    display(Audio(x, rate=fs))

def plot_waveform(x, fs, title, tlim=None):
    raise NotImplementedError

def plot_spectrogram(x, fs, title, n_fft=1024, win_ms=25, hop_ms=10):
    raise NotImplementedError


---

# 1. Organize and Load Your Recordings

### Task
Create a Python dictionary that lists your file paths.

**Do not hardcode absolute paths**. Use the provided relative path patterns.

Then load each file, print:
- fs
- duration
- min/max amplitude

> If sampling rates differ, document what you do (resample or keep separate).


In [None]:
# TODO: Provide your paths here
paths = {
    "self_sent_1": "./audio/speaker_self/sent_1.wav",
    "self_sent_2": "./audio/speaker_self/sent_2.wav",
    "self_sent_3": "./audio/speaker_self/sent_3.wav",
    "vowel_a": "./audio/speaker_self/vowel_a.wav",
    "vowel_i": "./audio/speaker_self/vowel_i.wav",
    "fric_s": "./audio/speaker_self/fric_s.wav",
    "fric_f": "./audio/speaker_self/fric_f.wav",
    # Optional / recommended:
    "speaker2_sent_1": "./audio/speaker_2/sent_1.wav",
}


In [None]:
# TODO: Load all recordings into a dict: signals[name] = (x, fs)
signals = {}
# for name, p in paths.items():
#     ...


### Quick sanity check
For 2–3 files:
- Plot waveform
- Plot spectrogram
- Play audio


In [None]:
# TODO: sanity check (choose a few examples)
# name = "self_sent_1"
# x, fs = signals[name]
# plot_waveform(x, fs, f"{name}: waveform")
# plot_spectrogram(x, fs, f"{name}: spectrogram")
# play_audio(x, fs)


---

# 2. Same-Speaker Variability (Repeatability)

You recorded the same sentence 3 times.

### Task
Compare:
- speaking rate (duration)
- amplitude range / energy distribution
- timing differences (pauses)
- spectral patterns

You must include:
- Overlay plots for the three recordings (time-aligned if possible)
- At least one spectrogram comparison


In [None]:
# TODO: Extract the 3 self recordings
# x1, fs1 = signals["self_sent_1"]
# x2, fs2 = signals["self_sent_2"]
# x3, fs3 = signals["self_sent_3"]


In [None]:
# TODO: Handle sampling rate mismatch (if any)
# Decide: resample to common fs or keep separate and explain.


In [None]:
# TODO: Compare durations and basic stats in a small table (markdown or printed)


In [None]:
# TODO: Overlay waveforms (suggestion: normalize each to same max abs for viewing)
# Tip: plot the first N seconds or align to start-of-speech using a simple energy threshold


In [None]:
# TODO: Spectrogram comparison
# plot_spectrogram(x1, fs, "Self sentence 1")
# plot_spectrogram(x2, fs, "Self sentence 2")
# plot_spectrogram(x3, fs, "Self sentence 3")


### Observations (Same speaker, repeats)

Answer in 10–14 lines:

- What stays stable across repeats?
- What changes the most (timing, amplitude, articulation clarity)?
- Is variability more obvious in waveform or spectrogram? Why?
- What does this suggest about “speech as a signal”?


---

# 3. Multi-Speaker Variability

Compare your sentence with another speaker (or a public clip).

### Task
Include:
- Waveform and spectrogram comparison
- At least one voiced region comparison (pitch differences)
- One unvoiced region comparison (fricative/noise differences)

> If you use a public clip, specify its source in a markdown cell.


In [None]:
# TODO: Load second speaker sentence
# xs, fss = signals["speaker2_sent_1"]
# plot_waveform(xs, fss, "Speaker 2: waveform")
# plot_spectrogram(xs, fss, "Speaker 2: spectrogram")
# play_audio(xs, fss)


### Observations (Across speakers)

Answer in 10–14 lines:

- Compare pitch range (qualitatively or roughly estimate F0 from harmonics)
- Compare speaking rate
- Compare spectral envelope (formant regions)
- Which differences are likely anatomical vs habitual?


---

# 4. Sound Categories: Vowels vs Fricatives (Voiced vs Unvoiced)

You recorded:
- Vowels: /a/, /i/
- Fricatives: /s/, /f/

### Task
For each sound:
- Plot waveform (zoomed 50–100 ms)
- Plot spectrogram
- Describe the key differences

Focus on:
- periodicity vs noise
- energy distribution across frequency
- voicing cues


In [None]:
# TODO: Analyze each isolated sound
# for name in ["vowel_a", "vowel_i", "fric_s", "fric_f"]:
#     x, fs = signals[name]
#     plot_waveform(x, fs, f"{name}: waveform (zoom)", tlim=(..., ...))
#     plot_spectrogram(x, fs, f"{name}: spectrogram")
#     play_audio(x, fs)


### Conceptual questions (Speech production)

Answer clearly:

1. Why do vowels show a harmonic structure but fricatives often do not?  
2. /s/ and /f/ are both unvoiced fricatives. Why do they sound different?  
3. How do place and manner of articulation relate to what you see in the spectrogram?


---

# 5. Segmentation: Words → Syllables → Phones (Approx.)

### Goal
Take one of your self-sentences and annotate:
- Word boundaries (start/end times)
- Syllable boundaries (approx.)
- Phone-level boundaries for at least **one word** (approx.)

You must provide **evidence**:
- waveform plot with vertical lines
- spectrogram plot with vertical lines
- playback of each segmented region

> You do not need perfect phonetic labeling. We evaluate reasoning and evidence.


In [None]:
# TODO: Choose the recording to segment
seg_name = "self_sent_1"
# x, fs = signals[seg_name]


## 5.1 Define boundaries (in seconds)

### Task
Create lists of boundary times.

Examples:
- `word_bounds = [(t0,t1,"My"), (t1,t2,"name"), ...]`
- `syll_bounds = [(t0,t1,"my"), (t1,t2,"name"), ...]`  (approx.)
- `phone_bounds = [(t0,t1,"m"), (t1,t2,"ay"), ...]`   (approx. for one word)

You may use:
- waveform inspection
- spectrogram inspection
- listening + trial-and-error


In [None]:
# TODO: Fill these with your boundary estimates (seconds)
word_bounds = [
    # (start, end, "label"),
]

syll_bounds = [
    # (start, end, "label"),
]

phone_bounds = [
    # For ONE chosen word only
    # (start, end, "label"),
]


In [None]:
# TODO: Plot waveform with boundaries
# - Overlay word boundaries in one plot
# - Overlay syllable boundaries in another plot
# - Overlay phone boundaries for one word (zoomed)


In [None]:
# TODO: Plot spectrogram with boundaries
# Use vertical lines (plt.axvline) at boundary times


In [None]:
# TODO: Play segments
# For each word in word_bounds:
#   extract and play that segment
# For each syllable in syll_bounds:
#   extract and play
# For phone_bounds (one word):
#   extract and play


### Observations (Segmentation & coarticulation)

Answer in 12–16 lines:

- Which boundaries were easy to identify? Which were hard? Why?
- Where do you see coarticulation (smooth transitions rather than clear breaks)?
- Give one example where the phone boundary is ambiguous, and explain why.


---

# 6. Mini-Analysis: Voiced vs Voiceless Regions in a Sentence

### Task
Pick two short regions from your sentence:
- one clearly voiced (vowel-like)
- one clearly voiceless (fricative-like or unvoiced consonant)

For each region:
- plot waveform (zoom)
- plot spectrum or spectrogram
- explain differences in terms of production and acoustics


In [None]:
# TODO: Define two regions (seconds)
voiced_region = (None, None)
voiceless_region = (None, None)


In [None]:
# TODO: Extract, plot, and play the two regions


### Conceptual explanation (Voicing)

Write 8–12 lines:

- How does voicing appear in waveform and spectrogram?
- What cues did you use to label a region as voiced vs voiceless?


---

# 7. Summary: Key Takeaways

Write 8–12 lines:

- What did you learn about speech variability (same speaker vs different speaker)?
- What did segmentation teach you about coarticulation?
- How did speech production theory help you interpret your plots?


---

# 8. Reflection (Mandatory)

Write thoughtful answers (be specific):

1. What did you learn about your own speech signal that surprised you?  
2. What was the hardest part: recording, organizing files, variability analysis, or segmentation? Why?  
3. Which visualization helped most (waveform vs spectrogram vs playback)?  
4. If you could redo one part, what would you do differently?  
5. What is one question you now want to explore further (e.g., pitch tracking, formant estimation, phoneme classification)?


---

# 9. AI Use Disclosure (Required)

If you used any AI tools (including ChatGPT), briefly describe:
- What you used it for (e.g., debugging, concept clarification)
- What you wrote/changed yourself

*(If you did not use AI, write “No AI tools used.”)*
