# Grammar Scoring Engine for Spoken Data

## Introduction

This notebook develops a solution for scoring the grammatical quality of spoken audio samples on a scale from 1 to 5. The objective is to build a model that takes audio files as input and outputs a continuous score based on the grammar quality of the speech.

### Task Description
- Audio files are 45-60 seconds long WAV files
- Labels are MOS Likert Grammar Scores (1-5 scale)
- Training dataset: 444 samples
- Testing dataset: 195 samples
- Evaluation metric: Pearson Correlation

### Grammar Score Rubric
- **1**: Struggles with proper sentence structure and syntax
- **2**: Limited understanding of sentence structure with basic mistakes
- **3**: Decent grasp of structure but errors in grammar or syntax
- **4**: Strong understanding with occasional minor errors
- **5**: High grammatical accuracy with complex structures

## Approach Overview

Our approach will consist of the following steps:

1. **Data Exploration**: Understand the distribution of audio files and labels
2. **Feature Extraction**: Extract relevant features from audio files
   - Audio features (MFCC, spectrograms, etc.)
   - Speech transcription and text features
3. **Model Development**: Build and train models to predict grammar scores
4. **Evaluation**: Evaluate models using cross-validation and appropriate metrics
5. **Testing**: Generate predictions for the test set

Let's begin by setting up our environment and exploring the data.

In [None]:
# Import necessary libraries
import os
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns
from tqdm.notebook import tqdm
import librosa
import librosa.display
import scipy
from sklearn.model_selection import train_test_split, KFold, cross_val_score
from sklearn.preprocessing import StandardScaler
from sklearn.linear_model import Ridge, Lasso
from sklearn.ensemble import RandomForestRegressor, GradientBoostingRegressor
from sklearn.metrics import mean_squared_error, mean_absolute_error, r2_score
from scipy.stats import pearsonr
import warnings
warnings.filterwarnings('ignore')

# Set random seed for reproducibility
np.random.seed(42)

# Set paths
DATA_PATH = 'dataset/'
TRAIN_AUDIO_PATH = os.path.join(DATA_PATH, 'audios_train')
TEST_AUDIO_PATH = os.path.join(DATA_PATH, 'audios_test')

## 1. Data Exploration

Let's start by loading the training and testing data and exploring their distributions.

In [None]:
# Load training and testing data
train_df = pd.read_csv(os.path.join(DATA_PATH, 'train.csv'))
test_df = pd.read_csv(os.path.join(DATA_PATH, 'test.csv'))
sample_submission = pd.read_csv(os.path.join(DATA_PATH, 'sample_submission.csv'))

print(f"Training data shape: {train_df.shape}")
print(f"Testing data shape: {test_df.shape}")
print(f"Sample submission shape: {sample_submission.shape}")

# Display sample of training data
train_df.head()

In [None]:
# Explore the distribution of labels in the training set
plt.figure(figsize=(10, 6))
sns.histplot(train_df['label'], bins=9, kde=True)
plt.title('Distribution of Grammar Scores in Training Data')
plt.xlabel('Grammar Score')
plt.ylabel('Count')
plt.grid(True, alpha=0.3)
plt.show()

# Summary statistics
print("Summary statistics of grammar scores:")
print(train_df['label'].describe())

In [None]:
# Function to load and display a sample audio file
def load_and_display_audio(file_path):
    y, sr = librosa.load(file_path, sr=None)
    duration = librosa.get_duration(y=y, sr=sr)
    
    plt.figure(figsize=(14, 5))
    
    # Plot waveform
    plt.subplot(1, 2, 1)
    librosa.display.waveshow(y, sr=sr)
    plt.title(f'Waveform (Duration: {duration:.2f}s)')
    
    # Plot spectrogram
    plt.subplot(1, 2, 2)
    D = librosa.amplitude_to_db(np.abs(librosa.stft(y)), ref=np.max)
    librosa.display.specshow(D, sr=sr, x_axis='time', y_axis='log')
    plt.colorbar(format='%+2.0f dB')
    plt.title('Spectrogram')
    
    plt.tight_layout()
    plt.show()
    
    return y, sr, duration

# Let's look at some examples from different score categories
# First, sort the dataframe by label
sorted_df = train_df.sort_values('label')

# Sample from low, medium, and high scores
low_score = sorted_df.iloc[0]['filename']
medium_score = sorted_df.iloc[len(sorted_df)//2]['filename']
high_score = sorted_df.iloc[-1]['filename']

print(f"Low score example: {low_score}, Score: {sorted_df.iloc[0]['label']}")
y_low, sr_low, dur_low = load_and_display_audio(os.path.join(TRAIN_AUDIO_PATH, low_score))

print(f"Medium score example: {medium_score}, Score: {sorted_df.iloc[len(sorted_df)//2]['label']}")
y_med, sr_med, dur_med = load_and_display_audio(os.path.join(TRAIN_AUDIO_PATH, medium_score))

print(f"High score example: {high_score}, Score: {sorted_df.iloc[-1]['label']}")
y_high, sr_high, dur_high = load_and_display_audio(os.path.join(TRAIN_AUDIO_PATH, high_score))