# Grammar Scoring Engine for Spoken Data

## Introduction

This notebook develops a comprehensive solution for scoring the grammatical quality of spoken audio samples on a scale from 1 to 5. The objective is to build a model that takes audio files as input and outputs a continuous score based on the grammar quality of the speech.

### Task Description
- **Audio files**: 45-60 seconds long WAV files
- **Labels**: MOS Likert Grammar Scores (1-5 scale)
- **Training dataset**: 444 samples
- **Testing dataset**: 195 samples
- **Evaluation metric**: Pearson Correlation

### Grammar Score Rubric
- **1**: The person's speech struggles with proper sentence structure and syntax, displaying limited control over simple grammatical structures and memorized sentence patterns.
- **2**: The person has a limited understanding of sentence structure and syntax. Although they use simple structures, they consistently make basic sentence structure and grammatical mistakes. They might leave sentences incomplete.
- **3**: The person demonstrates a decent grasp of sentence structure but makes errors in grammatical structure, or they show a decent grasp of grammatical structure but make errors in sentence syntax and structure.
- **4**: The person displays a strong understanding of sentence structure and syntax. They consistently show good control of grammar. While occasional errors may occur, they are generally minor and do not lead to misunderstandings; the person can correct most of them.
- **5**: Overall, the person showcases high grammatical accuracy and adept control of complex grammar. They use grammar accurately and effectively, seldom making noticeable mistakes. Additionally, they handle complex language structures well and correct themselves when necessary.

## Solution Approach

Our solution approach consists of the following steps:

1. **Data Exploration**: Understand the distribution of audio files and their labels
2. **Feature Extraction**: Extract relevant features from audio files
   - Audio features (MFCC, spectrograms, etc.)
   - Prosodic features (tempo, rhythm, etc.)
3. **Model Development**: Build and train different regression models
4. **Model Evaluation**: Evaluate models using cross-validation and Pearson correlation
5. **Model Tuning**: Fine-tune the best performing model
6. **Prediction**: Generate predictions for the test set

Let's begin by setting up our environment and exploring the data.

In [None]:
# Import necessary libraries
import os
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns
from tqdm.notebook import tqdm
import librosa
import librosa.display
import pickle
from sklearn.preprocessing import StandardScaler
from sklearn.model_selection import train_test_split, KFold, cross_val_score
from sklearn.metrics import mean_squared_error, mean_absolute_error, r2_score
from scipy.stats import pearsonr

# Import our custom modules
import feature_extraction
import model_building
import predict

# Set plot style
plt.style.use('seaborn-whitegrid')

# Set random seed for reproducibility
np.random.seed(42)

# Set paths
DATA_PATH = 'dataset/'
TRAIN_AUDIO_PATH = os.path.join(DATA_PATH, 'audios_train')
TEST_AUDIO_PATH = os.path.join(DATA_PATH, 'audios_test')

## 1. Data Exploration

Let's start by loading the training and testing data and exploring their distributions.

In [None]:
# Load training and testing data
train_df = pd.read_csv(os.path.join(DATA_PATH, 'train.csv'))
test_df = pd.read_csv(os.path.join(DATA_PATH, 'test.csv'))
sample_submission = pd.read_csv(os.path.join(DATA_PATH, 'sample_submission.csv'))

print(f"Training data shape: {train_df.shape}")
print(f"Testing data shape: {test_df.shape}")
print(f"Sample submission shape: {sample_submission.shape}")

# Display sample of training data
train_df.head()

In [None]:
# Check for missing values
print("Missing values in training data:")
print(train_df.isnull().sum())

print("\nMissing values in testing data:")
print(test_df.isnull().sum())

In [None]:
# Explore the distribution of labels in the training set
plt.figure(figsize=(12, 6))
sns.histplot(train_df['label'], bins=9, kde=True, color='darkblue')
plt.title('Distribution of Grammar Scores in Training Data', fontsize=16)
plt.xlabel('Grammar Score', fontsize=14)
plt.ylabel('Count', fontsize=14)
plt.grid(True, alpha=0.3)
plt.xticks(np.arange(1, 5.5, 0.5))
plt.show()

# Summary statistics
print("Summary statistics of grammar scores:")
print(train_df['label'].describe())