# Grammar Scoring Engine using Whisper + Rule-based Features

### 1. Introduction

In this project, we develop a Grammar Scoring Engine for spoken English using audio files. The task is part of a Kaggle competition where each input is a WAV file (45-60s) and the output is a grammar score between 0 and 5 (continuous). 

We utilize OpenAI's Whisper model to transcribe the audio, followed by lightweight rule-based grammar feature extraction (like POS tagging) using `nltk`. A Random Forest Regressor is then trained on these features.

In [None]:
# This Python 3 environment comes with many helpful analytics libraries installed
# It is defined by the kaggle/python Docker image: https://github.com/kaggle/docker-python
# For example, here's several helpful packages to load

import numpy as np # linear algebra
import pandas as pd # data processing, CSV file I/O (e.g. pd.read_csv)

# Input data files are available in the read-only "../input/" directory
# For example, running this (by clicking run or pressing Shift+Enter) will list all files under the input directory

import os
for dirname, _, filenames in os.walk('/kaggle/input/shl-dataset'):
    for filename in filenames:
        print(os.path.join(dirname, filename))

# You can write up to 20GB to the current directory (/kaggle/working/) that gets preserved as output when you create a version using "Save & Run All" 
# You can also write temporary files to /kaggle/temp/, but they won't be saved outside of the current session

In [None]:
!pip install -q openai-whisper
!pip install -q nltk scikit-learn pandas tqdm

In [None]:
import os
import whisper
import pandas as pd
import nltk
from nltk import word_tokenize, pos_tag
from sklearn.ensemble import RandomForestRegressor
from sklearn.metrics import mean_squared_error
from tqdm import tqdm

nltk.download('punkt')
nltk.download('averaged_perceptron_tagger')

model = whisper.load_model("base")

In [None]:
train_df = pd.read_csv("/kaggle/input/shl-dataset/dataset/train.csv")
test_df = pd.read_csv("/kaggle/input/shl-dataset/dataset/test.csv")
sample_submission = pd.read_csv("/kaggle/input/shl-dataset/dataset/sample_submission.csv")
train_df.head()

In [None]:
def extract_features(audio_path):
    result = model.transcribe(audio_path, fp16=False)
    text = result['text'].strip()
    
    tokens = word_tokenize(text)
    pos_tags = pos_tag(tokens)

    total_words = len(tokens)
    num_nouns = sum(1 for word, tag in pos_tags if tag.startswith('NN'))
    num_verbs = sum(1 for word, tag in pos_tags if tag.startswith('VB'))
    num_adjs = sum(1 for word, tag in pos_tags if tag.startswith('JJ'))
    avg_word_len = sum(len(word) for word in tokens) / total_words if total_words else 0

    return {
        'text': text,
        'total_words': total_words,
        'num_nouns': num_nouns,
        'num_verbs': num_verbs,
        'num_adjs': num_adjs,
        'avg_word_len': avg_word_len
    }

In [None]:
print(len(train_features))

In [None]:
train_features = []

for i, row in tqdm(train_df.iterrows(), total=len(train_df)):
    audio_filename = row['filename']
    audio_path = f"/kaggle/input/shl-dataset/dataset/audios_train/{audio_filename}"  # ✅ FULL PATH TO FILE

    try:
        features = extract_features(audio_path)  # Pass full path
        features['label'] = row['label']
        train_features.append(features)
    except Exception as e:
        print(f"❌ Failed to process {audio_filename}: {e}")

In [None]:
X = train_feat_df[['total_words', 'num_nouns', 'num_verbs', 'num_adjs', 'avg_word_len']]
y = train_feat_df['label']

model_rf = RandomForestRegressor(random_state=42)
model_rf.fit(X, y)

In [None]:
y_pred = model_rf.predict(X)
mse = mean_squared_error(y, y_pred)
print(f"Train MSE: {mse:.4f}")

In [None]:
test_features = []

for i, row in tqdm(test_df.iterrows(), total=len(test_df)):
    path = f"/mnt/data/audios_test{row['file'].split('test')[-1]}"
    features = extract_features(path)
    test_features.append(features)

test_feat_df = pd.DataFrame(test_features)
X_test = test_feat_df[['total_words', 'num_nouns', 'num_verbs', 'num_adjs', 'avg_word_len']]

In [None]:
preds = model_rf.predict(X_test)
submission = pd.DataFrame({'file': test_df['file'], 'label': preds})
submission.to_csv("submission.csv", index=False)
submission.head()

### 10. Conclusion

We developed a simple, fast, and efficient Grammar Scoring Engine using Whisper transcription and basic rule-based features from the transcript. The model performs reasonably well on train data and can be improved further by:

- Using more linguistic features like grammatical errors, parse trees, etc.
- Incorporating BERT embeddings or transformer-based text models
- Fine-tuning Whisper to your data
- Using end-to-end audio-to-score models

This serves as a great baseline to build upon for the final Kaggle submission.