# Predicting Caller Aggressiveness from Speech Features — Linear & Logistic Regression

**Context.** We collected short audio snippets from customer-service style calls to study sentiment and perceived aggressiveness. These audio clips were later shown to human annotators who rated **how aggressive** the speaker sounded:

- a **continuous rating** on a 1–10 scale (averaged across multiple raters); and
- a **binary rating** indicating whether the clip was considered **aggressive (1)** or **not aggressive (0)** after aggregating multiple reviews.

You are given a **preprocessed dataset** that includes:

- acoustic features extracted from the audio (e.g., pitch variability, loudness, duration),
- textual features from automatic speech recognition (ASR) using `whisper-large-v3` (e.g., transcript, word counts),
- both aggression targets: `rating` (1–10) and `rating_binary` (0/1).

Your task is to **train and interpret** linear and logistic regression models that predict caller aggressiveness based on the provided features.


## Dataset & Features (what they mean and how to read them)

Below are the key columns you'll find in the dataset `processed_aggression_dataset.csv`:

- **`file_path`**: Filename of the audio clip.
- **`rating`** _(target for linear regression)_: Mean perceived aggressiveness (1–10). Higher means more aggressive.
- **`rating_binary`** _(target for logistic regression)_: 1 if considered aggressive, else 0.

### Acoustic features (from audio)

- **`duration_sec`**: Total length of the audio clip in seconds.
- **`mean_pitch`**: Estimated fundamental frequency (F0) in Hz; higher values generally correspond to higher perceived pitch.
- **`pitch_var`**: Standard deviation (variability) of the pitch; higher variability can indicate more dynamic/prosodic speech.
- **`mean_energy`**: Average root-mean-square (RMS) energy; a proxy for loudness/intensity.
- **`spectral_centroid`**: “Brightness” of the sound (Hz); higher values indicate more energy in higher frequencies.

### Textual features (from ASR transcript)

- **`transcript`**: The recognized text of the audio (string).
- **`word_count`**: Number of tokenized words in the transcript.
- **`avg_word_length`**: Average number of characters per word.
- **`speech_rate`**: Words per second (`word_count / duration_sec`).

> **Interpretation tip:** For regression, we will mainly use numerical columns. Text can be used for qualitative interpretation, but the numerical summaries (counts, rates, durations) are what go into the models.


## Exercise 1 — Load the data from Google Drive

Load the CSV into a pandas DataFrame.

> In Google Colab, Google Drive is mounted under `/content/drive`. Your file is located at:
>
> `/content/drive/MyDrive/[your path on the drive].csv`


In [None]:
# If you are running this in Google Colab, first mount your Drive:
try:
    from google.colab import drive  # type: ignore

    drive.mount("/content/drive")
    print("Drive mounted.")
except Exception as e:
    print("Not running in Colab or Drive mount failed:", e)

import os
import pandas as pd
import numpy as np

# Try both common path variants just in case
paths_to_try = [
    "/content/drive/MyDrive/PhD/voice_sentiment/processed_aggression_dataset.csv",
    "content/drive/MyDrive/PhD/voice_sentiment/processed_aggression_dataset.csv",
]

df = None
for p in paths_to_try:
    if os.path.exists(p):
        df = pd.read_csv(p)
        print(f"Loaded: {p}")
        break

if df is None:
    raise FileNotFoundError(
        "Could not find processed_aggression_dataset.csv in the expected locations."
    )

df.head()


## Exercise 2 — Explore data types

Inspect the structure of the DataFrame to see column names and data types.


In [None]:
df.info()


### Exercise 2.1 — Ensure the data types are appropriate

- `rating` should be `?` (continuous 1–10).
- `rating_binary` should be `?` (0/1).
- Feature columns should be `?`.
- `transcript` is text.

Coerce types and handle any parsing issues.


In [None]:
# Expected numeric columns (you can add/remove based on your dataset)
numeric_cols = [
    "rating",
    "rating_binary",
    "duration_sec",
    "mean_pitch",
    "pitch_var",
    "mean_energy",
    "spectral_centroid",
    "word_count",
    "avg_word_length",
    "speech_rate",
]

# Coerce numeric columns
for c in numeric_cols:
    if c in df.columns:
        df[c] = pd.to_numeric(df[c], errors="coerce")

# Ensure integer type for binary target
if "rating_binary" in df.columns:
    df["rating_binary"] = df["rating_binary"].round().astype("Int64")

# Quick check after coercion
df[numeric_cols].describe(include="all")


### Exercise 2.2 — Explore the `rating` variable

- Plot a histogram of `rating`.
- Comment on the distribution (e.g., skew, peaks, range).


In [None]:
import matplotlib.pyplot as plt

ratings = df["rating"].dropna()

plt.figure()
plt.hist(ratings, bins=10)
plt.title("Distribution of Aggression Rating (1–10)")
plt.xlabel("Rating")
plt.ylabel("Frequency")
plt.show()

print("Count:", ratings.count())
print("Mean:", ratings.mean())
print("Std:", ratings.std())
print("Min/Max:", ratings.min(), ratings.max())


> **Discussion:**  
> Consider whether the distribution is symmetric or skewed. Are there many low or high ratings? Does the spread suggest that predicting exact values might be challenging? Keep this in mind when interpreting the regression model and the RMSE later.


### Exercise 2.3 — Relationship between `rating` and `pitch_var`

Create a scatter plot of `rating` (y-axis) vs. `pitch_var` (x-axis). Comment on whether there seems to be a relationship, and what other factors might influence this relationship that **aren't** visible in the plot (e.g., loudness, speech rate, semantic content).


In [None]:
x = df["pitch_var"]
y = df["rating"]

mask = x.notna() & y.notna()
x_ = x[mask]
y_ = y[mask]

plt.figure()
plt.scatter(x_, y_, alpha=0.7)
plt.title("Aggression Rating vs Pitch Variability")
plt.xlabel("Pitch variability (std of F0)")
plt.ylabel("Aggression rating (1–10)")
plt.show()

# Correlation (Pearson) as a rough summary
corr = np.corrcoef(x_, y_)[0, 1] if len(x_) > 1 else np.nan
print("Pearson correlation (rating, pitch_var):", corr)


> **Discussion:**  
> Even if a visual trend appears, remember that correlation does not imply causation. Discuss how other factors such as **duration**, **energy (loudness)**, **speech rate**, or even **lexical content** could confound or moderate the relationship.


## Exercise 3 — Train–test split and variable separation

We split the data to evaluate generalization on unseen data. A common split is **80% train / 20% test**.  
Define **dependent (target)** and **independent (feature)** variables for the models.


In [None]:
from sklearn.model_selection import train_test_split

# For linear regression target
target_lin = "rating"
features_lin = ["mean_pitch", "duration_sec", "word_count"]

# Filter rows with complete data for selected columns
needed_lin = [target_lin] + features_lin
df_lin = df[needed_lin].dropna().copy()

X_lin = df_lin[features_lin]
y_lin = df_lin[target_lin]

X_train_lin, X_test_lin, y_train_lin, y_test_lin = train_test_split(
    X_lin, y_lin, test_size=0.2, random_state=42
)

X_train_lin.shape, X_test_lin.shape


## Exercise 4 — Linear Regression (OLS)

We'll model the continuous `rating` from three predictors.


### Exercise 4.1 — Create the model:

**Formula:** `rating ~ mean_pitch + duration_sec + word_count`


In [None]:
import statsmodels.api as sm

X_train_const = sm.add_constant(X_train_lin)  # adds intercept
ols_model = sm.OLS(y_train_lin, X_train_const).fit()
ols_model


### Exercise 4.2 — Return the summary


In [None]:
ols_summary = ols_model.summary()
print(ols_summary)


### Exercise 4.3 — Interpret the model output

- **Coefficients**: The sign tells you the direction (positive/negative) of association with `rating`. Magnitude indicates change in `rating` per unit increase of the predictor (holding others constant).
- **p-values**: Indicate whether each coefficient is statistically distinguishable from zero (given assumptions).
- **R-squared**: Percentage of variance in `rating` explained by the model (on the training set).
- **Assumptions**: OLS assumes linearity, homoscedasticity, independence, and normally distributed residuals.


## Exercise 5 — Predict on the test set

Use the fitted linear model to predict `rating` on the test data and evaluate with RMSE.


In [None]:
from sklearn.metrics import mean_squared_error

X_test_const = sm.add_constant(X_test_lin, has_constant="add")
y_pred_lin = ols_model.predict(X_test_const)

rmse = mean_squared_error(y_test_lin, y_pred_lin, squared=False)
print("Test RMSE:", rmse)

pd.DataFrame({"y_true": y_test_lin.values, "y_pred": y_pred_lin.values}).head()


### Exercise 5.1 — Transform `duration_sec` and `word_count`

Some features are skewed and may benefit from transformations. A simple, robust choice is **log-transform**:

- Replace `duration_sec` with `log1p(duration_sec)`
- Replace `word_count` with `log1p(word_count)`

Refit the model and compare results.


In [None]:
# Create transformed copies
X_train_lin_tf = X_train_lin.copy()
X_test_lin_tf = X_test_lin.copy()

for col in ["duration_sec", "word_count"]:
    X_train_lin_tf[col] = np.log1p(X_train_lin_tf[col])
    X_test_lin_tf[col] = np.log1p(X_test_lin_tf[col])

X_train_tf_const = sm.add_constant(X_train_lin_tf)
ols_model_tf = sm.OLS(y_train_lin, X_train_tf_const).fit()
print(ols_model_tf.summary())


Re-evaluate RMSE on the transformed-feature model.


In [None]:
X_test_tf_const = sm.add_constant(X_test_lin_tf, has_constant="add")
y_pred_lin_tf = ols_model_tf.predict(X_test_tf_const)

rmse_tf = mean_squared_error(y_test_lin, y_pred_lin_tf, squared=False)
print("Test RMSE (transformed features):", rmse_tf)


### Exercise 5.2 — Interpret the RMSE

- **RMSE** is in the same units as the target (`rating`, 1–10 scale).
- Compare RMSE to the **spread** (e.g., standard deviation) and the **range** of `rating`.


### Exercise 5.3 — Extend the model with other variables

- Try to add other variables that might improve the model's performance.


## Exercise 6 — Logistic Regression on `rating_binary`

Now predict the **binary** aggressiveness label using logistic regression. We'll use the same three predictors and the same log transforms for the skewed features.


In [None]:
# Prepare data for logistic regression
target_log = "rating_binary"
features_log = ["mean_pitch", "duration_sec", "word_count"]

needed_log = [target_log] + features_log
df_log = df[needed_log].dropna().copy()

# Ensure binary target is 0/1 int
df_log[target_log] = df_log[target_log].astype(int)

X_log = df_log[features_log].copy()
X_log["duration_sec"] = np.log1p(X_log["duration_sec"])
X_log["word_count"] = np.log1p(X_log["word_count"])

y_log = df_log[target_log]

X_train_log, X_test_log, y_train_log, y_test_log = train_test_split(
    X_log, y_log, test_size=0.2, random_state=42, stratify=y_log
)

import statsmodels.api as sm

X_train_log_const = sm.add_constant(X_train_log)
logit_model = sm.Logit(y_train_log, X_train_log_const).fit(disp=False)


### Exercise 6.1 — Model summary

Print out the summary showing coefficient estimates on the **log-odds** scale. Positive coefficients increase the log-odds (and thus the probability) of `rating_binary = 1` (aggressive), holding other variables constant.


In [None]:
print(logit_model.summary())

### Exercise 6.2 — Compare logistic vs. linear regression

- **Targets**: Linear regression predicts a **continuous** `rating` (1–10); logistic regression predicts a **probability** for `rating_binary=1`.
- **Coefficients**: Linear coefficients are changes in the expected **rating** per unit change in a predictor; logistic coefficients are changes in **log-odds**. You can exponentiate logistic coefficients to get **odds ratios**.
- **Metrics**: For linear we used **RMSE**; for logistic consider **accuracy**, **precision/recall**, **ROC-AUC**.


In [None]:
from sklearn.metrics import (
    accuracy_score,
    precision_recall_fscore_support,
    roc_auc_score,
    roc_curve,
)

X_test_log_const = sm.add_constant(X_test_log, has_constant="add")
y_prob = logit_model.predict(X_test_log_const)
y_pred = (y_prob >= 0.5).astype(int)

acc = accuracy_score(y_test_log, y_pred)
prec, rec, f1, _ = precision_recall_fscore_support(
    y_test_log, y_pred, average="binary", zero_division=0
)
auc = roc_auc_score(y_test_log, y_prob)

print(f"Accuracy: {acc:.3f}")
print(f"Precision: {prec:.3f}, Recall: {rec:.3f}, F1: {f1:.3f}")
print(f"ROC-AUC: {auc:.3f}")
