# Audio Classification: Speaker & Language

This notebook does the following:
1. Use a single **metadata CSV** to label both **speaker** and **language**.
2. Extract **MFCC features** from each audio file.
3. Perform **binary classification** separately for:
   - **Speaker** (Jeevan vs. Not_Jeevan)
   - **Language** (English vs. Not_English)
4. Use **k-Fold Cross-Validation** to evaluate performance (accuracy, precision, recall, F1) and generate confusion matrices.

In [2]:
#Importing ALl libraries
import os
import pandas as pd
import librosa
from sklearn.model_selection import StratifiedKFold, cross_val_predict
from sklearn.svm import SVC
from sklearn.metrics import classification_report, confusion_matrix
import numpy as np

# Data Collection
__ Add Text Here

* Youtube
* Google Photos
* Facebook and Messenger
* Recorded to balance Data

# Data Pre Processing
__ Add text here

And code if needed. Or link to Github
* Used yt_dlp to download just the audio from youtube videos (These videos contain just my speech). No copyright infriengement intended. The channel used is in the codecell.
* Used __ to convert all video into audio
* Used librosa, ffemg to trim audio data into equal slices of 7 seconds (source for why that's suitable), removed silence, and added empty voice when a segment would be less than 7 seconds.

## What needs to be done:
These factors might result in some inaccuracy.
* The audio is recorded from a phone, often with a lot of wind or people in the background. So, there's some noise that needs to be cleaned.
* Some data is labeled as one class but has data from both classes. 
    * In conversations when both people are speaking in one segment, I classified it based on whoever is speaking for a longer time duration. 
    * People who speak Nepali also sprinkle English words in between, again I used the majority time rule to label data.

# Data Labeling
The most challenging part after fining the data was labeling.

__ Add more__

# Data Analysis

In [3]:
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns
import re

# Load the metadata CSV file into a pandas dataframe
# If your CSV file has no header, we assign our own column names.
df = pd.read_csv('data/metadata.csv', header=None, names=['filename', 'speaker', 'language'])

# Display basic statistics about the dataframe
print("Total files:", len(df))
print("\nData Sample:")
print(df.head())

# Function to extract category and segment number from the filename.
def extract_info(filename):
    # Matches strings like "Jeevan_Jaycees_Nepali_segment_1.wav"
    # and extracts:
    #   category: "Jeevan_Jaycees_Nepali_segment"
    #   segment: 1
    match = re.search(r'(.*_segment)_(\d+)\.wav', filename)
    if match:
        category = match.group(1)
        segment = int(match.group(2))
        return pd.Series([category, segment])
    else:
        return pd.Series([None, None])

# Apply the extraction function to create two new columns: 'category' and 'segment'
df[['category', 'segment']] = df['filename'].apply(extract_info)

# Print unique categories and their counts
print("\nUnique Categories:")
print(df['category'].unique())

print("\nCategory Counts:")
print(df['category'].value_counts())

# Basic stats on segment numbers
print("\nSegment Number Statistics:")
print(df['segment'].describe())

# Visualization 1: Bar plot of file counts per category
plt.figure(figsize=(10, 6))
sns.countplot(data=df, x='category', order=df['category'].value_counts().index)
plt.title("File Count per Category")
plt.xlabel("Category")
plt.ylabel("Count")
plt.xticks(rotation=45)
plt.tight_layout()
plt.show()

# Visualization 2: Histogram of segment numbers, colored by category
plt.figure(figsize=(10, 6))
sns.histplot(data=df, x='segment', hue='category', multiple='stack', bins=30)
plt.title("Distribution of Segment Numbers by Category")
plt.xlabel("Segment Number")
plt.ylabel("Count")
plt.tight_layout()
plt.show()


Matplotlib is building the font cache; this may take a moment.


In [9]:

def extract_features(file_path, sr=16000, n_mfcc=13):
    """
    Loads the audio file, extracts MFCC features, and returns the averaged MFCCs.
    """
    y, sr = librosa.load(file_path, sr=sr)
    mfcc = librosa.feature.mfcc(y=y, sr=sr, n_mfcc=n_mfcc)
    mfcc_mean = np.mean(mfcc, axis=1)
    return mfcc_mean

def build_dataset(metadata_csv, audio_dir):
    """
    Reads metadata from a CSV that has at least 3 columns:
        - filename
        - speaker_label (e.g., "Jeevan" or "Not_Jeevan")
        - language_label (e.g., "English" or "Not_English")
    """
    df = pd.read_csv(metadata_csv)
    X, y_speaker, y_language = [], [], []
    for _, row in df.iterrows():
        file_path = os.path.join(audio_dir, row['filename'])
        X.append(extract_features(file_path))
        y_speaker.append(1 if row['speaker_label'].lower() == 'jeevan' else 0)
        y_language.append(1 if row['language_label'].lower() == 'english' else 0)
    return np.array(X), np.array(y_speaker), np.array(y_language)

def evaluate_classifier(X, y, n_splits=5):
    """
    Performs k-Fold cross-validation, returns classification metrics and a confusion matrix.
    """
    clf = SVC(kernel='linear', probability=True, random_state=42)
    skf = StratifiedKFold(n_splits=n_splits, shuffle=True, random_state=42)
    y_pred = cross_val_predict(clf, X, y, cv=skf)
    return classification_report(y, y_pred, target_names=["Class 0", "Class 1"]), confusion_matrix(y, y_pred)

In [10]:
# Paths to metadata CSV and audio files folder
metadata_csv_path = "data/metadata.csv"
audio_directory = "data/audio_files"

# Build dataset
X, y_speaker, y_language = build_dataset(metadata_csv_path, audio_directory)

NameError: name 'pd' is not defined

In [11]:
# Speaker Classification
speaker_report, speaker_cm = evaluate_classifier(X, y_speaker, n_splits=5)
print("=== Speaker Classification (Jeevan vs. Not_Jeevan) ===")
print("Classification Report:\n", speaker_report)
print("Confusion Matrix:\n", speaker_cm)


NameError: name 'X' is not defined

In [None]:
# Language Classification
language_report, language_cm = evaluate_classifier(X, y_language, n_splits=5)
print("\n=== Language Classification (English vs. Not_English) ===")
print("Classification Report:\n", language_report)
print("Confusion Matrix:\n", language_cm)
