In [44]:
!git clone https://github.com/lekshmi-j/grammar-autocorrector.git

Cloning into 'grammar-autocorrector'...
remote: Enumerating objects: 24, done.[K
remote: Counting objects: 100% (24/24), done.[K
remote: Compressing objects: 100% (18/18), done.[K
remote: Total 24 (delta 4), reused 18 (delta 2), pack-reused 0 (from 0)[K
Receiving objects: 100% (24/24), 16.78 KiB | 8.39 MiB/s, done.
Resolving deltas: 100% (4/4), done.


In [45]:
%cd grammar-autocorrector

/content/grammar-autocorrector/grammar-autocorrector


# Phase 1: Text & Error Analysis

Objective:
Understand real-world grammatical errors and analyze them using
tokenization, POS tagging, and dependency parsing.


In [46]:
!pip install datasets pandas




In [47]:
import pandas as pd
from datasets import load_dataset


In [48]:
dataset = load_dataset("jfleg", split="validation")


In [49]:
df = pd.DataFrame(dataset)


In [50]:
df.columns


Index(['sentence', 'corrections'], dtype='object')

In [51]:
#Inspect Raw Data
df.head()


Unnamed: 0,sentence,corrections
0,So I think we can not live if old people could...,[So I think we would not be alive if our ances...
1,For not use car .,"[Not for use with a car . , Do not use in the ..."
2,Here was no promise of morning except that we ...,"[Here was no promise of morning , except that ..."
3,Thus even today sex is considered as the least...,"[Thus , even today , sex is considered as the ..."
4,image you salf you are wark in factory just to...,[Imagine yourself you are working in factory j...


In [52]:
for i in range(5):
    print(f"❌ Incorrect: {df.loc[i, 'sentence']}")
    print(f"✔ Corrected: {df.loc[i, 'corrections'][0]}")
    print("-" * 50)


❌ Incorrect: So I think we can not live if old people could not find siences and tecnologies and they did not developped . 
✔ Corrected: So I think we would not be alive if our ancestors did not develop sciences and technologies . 
--------------------------------------------------
❌ Incorrect: For not use car . 
✔ Corrected: Not for use with a car . 
--------------------------------------------------
❌ Incorrect: Here was no promise of morning except that we looked up through the trees we saw how low the forest had swung . 
✔ Corrected: Here was no promise of morning , except that we looked up through the trees , and we saw how low the forest had swung . 
--------------------------------------------------
❌ Incorrect: Thus even today sex is considered as the least important topic in many parts of India . 
✔ Corrected: Thus , even today , sex is considered as the least important topic in may parts of India . 
--------------------------------------------------
❌ Incorrect: image you sal

Analysing sentence lengths

In [53]:
def get_sentence_length(sentence):
    # Split the sentence into words using spaces
    words = sentence.split()

    # Count how many words are there
    return len(words)


In [54]:
df["sentence_length"] = df["sentence"].apply(get_sentence_length)


In [55]:
df["sentence_length"].describe()


Unnamed: 0,sentence_length
count,755.0
mean,18.556291
std,10.142248
min,0.0
25%,12.0
50%,16.0
75%,23.0
max,80.0


In [56]:
#Taking Random Samples
df.sample(5, random_state=42)[["sentence", "corrections"]]


Unnamed: 0,sentence,corrections
291,For example we can see on the discovery channe...,[For example we can see on the Discovery Chann...
536,People use the public water to drink water .,"[People use public water to drink . , People u..."
39,Very soon they will run out at the current rat...,[They will run out very soon at the current ra...
77,It is obvious that after returning i was tired...,[It was obvious after returning that I was tir...
493,And there is a lot of critics concerned that t...,[And there are a lot of critics concerned that...


### Dataset Observations

- Sentences often contain multiple grammatical errors
- Corrections may involve tense, agreement, or article fixes
- Some corrections rewrite phrasing slightly (noise)


In [57]:
SAMPLE_SIZE = 30


In [58]:
analysis_df = df.sample(
    n=SAMPLE_SIZE,      # number of rows to select
    random_state=42     # ensures the same random rows every time
)


In [59]:
analysis_df = analysis_df.reset_index(drop=True)


In [60]:
analysis_df.head()


Unnamed: 0,sentence,corrections,sentence_length
0,For example we can see on the discovery channe...,[For example we can see on the Discovery Chann...,22
1,People use the public water to drink water .,"[People use public water to drink . , People u...",9
2,Very soon they will run out at the current rat...,[They will run out very soon at the current ra...,13
3,It is obvious that after returning i was tired...,[It was obvious after returning that I was tir...,18
4,And there is a lot of critics concerned that t...,[And there are a lot of critics concerned that...,22


In [61]:
for idx, row in analysis_df.iterrows():
    incorrect_sentence = row["sentence"]
    corrected_sentence = row["corrections"][0]

    print(f"{idx + 1}. ❌ {incorrect_sentence}")
    print(f"   ✔ {corrected_sentence}")
    print("-" * 60)


1. ❌ For example we can see on the discovery channel in wild life many people are hobbies to learn from the animals . 
   ✔ For example we can see on the Discovery Channel in the wild many people enjoy learning about the animals . 
------------------------------------------------------------
2. ❌ People use the public water to drink water . 
   ✔ People use public water to drink . 
------------------------------------------------------------
3. ❌ Very soon they will run out at the current rate of utilisation . 
   ✔ They will run out very soon at the current rate of utilization . 
------------------------------------------------------------
4. ❌ It is obvious that after returning i was tired and the night is meant to sleep ! . 
   ✔ It was obvious after returning that I was tired , and the night is meant to sleep ! 
------------------------------------------------------------
5. ❌ And there is a lot of critics concerned that the reqired testing is so long that declined the valueable fo

## Error Categories (Initial)

- SPELL: Incorrect spelling
- SVA: Subject–Verb Agreement errors
- ARTICLE: Incorrect or missing a/an/the
- TENSE: Wrong verb tense
- VERB FORM: Incorrect verb form (gerund, infinitive, auxiliary)


In [62]:
annotations = []


In [63]:
annotations = []

# 1
annotations.append({
    "sentence": analysis_df.loc[0, "sentence"],
    "error_types": ["SPELL", "ARTICLE", "VERB FORM"]
})

# 2
annotations.append({
    "sentence": analysis_df.loc[1, "sentence"],
    "error_types": ["ARTICLE", "VERB FORM"]
})

# 3
annotations.append({
    "sentence": analysis_df.loc[2, "sentence"],
    "error_types": ["SPELL"]
})

# 4
annotations.append({
    "sentence": analysis_df.loc[3, "sentence"],
    "error_types": ["TENSE", "SVA", "ARTICLE"]
})

# 5
annotations.append({
    "sentence": analysis_df.loc[4, "sentence"],
    "error_types": ["SPELL", "SVA", "VERB FORM"]
})

# 6
annotations.append({
    "sentence": analysis_df.loc[5, "sentence"],
    "error_types": ["ARTICLE", "PUNCTUATION"] if False else ["ARTICLE"]
})

# 7 (no grammatical error – identical)
annotations.append({
    "sentence": analysis_df.loc[6, "sentence"],
    "error_types": []
})

# 8
annotations.append({
    "sentence": analysis_df.loc[7, "sentence"],
    "error_types": ["SPELL"]
})

# 9
annotations.append({
    "sentence": analysis_df.loc[8, "sentence"],
    "error_types": ["VERB FORM"]
})

# 10
annotations.append({
    "sentence": analysis_df.loc[9, "sentence"],
    "error_types": ["PREPOSITION"] if False else ["ARTICLE"]
})

# 11
annotations.append({
    "sentence": analysis_df.loc[10, "sentence"],
    "error_types": ["SPELL", "VERB FORM"]
})

# 12
annotations.append({
    "sentence": analysis_df.loc[11, "sentence"],
    "error_types": ["VERB FORM"]
})

# 13
annotations.append({
    "sentence": analysis_df.loc[12, "sentence"],
    "error_types": ["SPELL"]
})

# 14 (correct sentence)
annotations.append({
    "sentence": analysis_df.loc[13, "sentence"],
    "error_types": []
})

# 15
annotations.append({
    "sentence": analysis_df.loc[14, "sentence"],
    "error_types": ["VERB FORM", "SPELL"]
})

# 16 (correct sentence)
annotations.append({
    "sentence": analysis_df.loc[15, "sentence"],
    "error_types": []
})

# 17
annotations.append({
    "sentence": analysis_df.loc[16, "sentence"],
    "error_types": ["SVA", "PREPOSITION"] if False else ["SVA"]
})

# 18
annotations.append({
    "sentence": analysis_df.loc[17, "sentence"],
    "error_types": ["TENSE", "VERB FORM"]
})

# 19
annotations.append({
    "sentence": analysis_df.loc[18, "sentence"],
    "error_types": ["SPELL", "ARTICLE"]
})

# 20
annotations.append({
    "sentence": analysis_df.loc[19, "sentence"],
    "error_types": ["SPELL", "VERB FORM"]
})

# 21
annotations.append({
    "sentence": analysis_df.loc[20, "sentence"],
    "error_types": ["VERB FORM"]
})

# 22
annotations.append({
    "sentence": analysis_df.loc[21, "sentence"],
    "error_types": ["ARTICLE", "SVA"]
})

# 23
annotations.append({
    "sentence": analysis_df.loc[22, "sentence"],
    "error_types": ["SPELL", "SVA"]
})

# 24
annotations.append({
    "sentence": analysis_df.loc[23, "sentence"],
    "error_types": ["SPELL", "VERB FORM"]
})

# 25
annotations.append({
    "sentence": analysis_df.loc[24, "sentence"],
    "error_types": ["SVA", "ARTICLE"]
})

# 26
annotations.append({
    "sentence": analysis_df.loc[25, "sentence"],
    "error_types": ["ARTICLE"]
})

# 27
annotations.append({
    "sentence": analysis_df.loc[26, "sentence"],
    "error_types": ["VERB FORM"]
})

# 28
annotations.append({
    "sentence": analysis_df.loc[27, "sentence"],
    "error_types": ["PRONOUN"] if False else ["VERB FORM"]
})

# 29
annotations.append({
    "sentence": analysis_df.loc[28, "sentence"],
    "error_types": ["VERB FORM", "ARTICLE"]
})

# 30
annotations.append({
    "sentence": analysis_df.loc[29, "sentence"],
    "error_types": ["SPELL", "SVA"]
})


In [64]:
#Convert annotations to dataframe
error_df = pd.DataFrame(annotations)
error_df


Unnamed: 0,sentence,error_types
0,For example we can see on the discovery channe...,"[SPELL, ARTICLE, VERB FORM]"
1,People use the public water to drink water .,"[ARTICLE, VERB FORM]"
2,Very soon they will run out at the current rat...,[SPELL]
3,It is obvious that after returning i was tired...,"[TENSE, SVA, ARTICLE]"
4,And there is a lot of critics concerned that t...,"[SPELL, SVA, VERB FORM]"
5,"In my opinion , this statement is groundless a...",[ARTICLE]
6,It is more exciting and memorable .,[]
7,"In deed , they can be refuced .",[SPELL]
8,Successful people have to do things stablely a...,[VERB FORM]
9,"For example , my parents went to a group tour ...",[ARTICLE]


In [65]:
from collections import Counter

all_errors = []
for errs in error_df["error_types"]:
    all_errors.extend(errs)

Counter(all_errors)


Counter({'SPELL': 12, 'ARTICLE': 10, 'VERB FORM': 14, 'TENSE': 2, 'SVA': 7})

## Observations from Manual Error Analysis

- Many sentences contain multiple grammatical errors
- Tense errors often co-occur with time expressions
- Article errors require noun-level context
- Verb form errors are distinct from tense errors
- Over-correction risk is high if context is ignored


In [66]:
error_df.to_csv("data/processed/manual_error_annotations.csv", index=False)


Tokenization (NLTK)

In [67]:
!pip install nltk




In [68]:
import nltk
nltk.download('punkt')
nltk.download('punkt_tab')


[nltk_data] Downloading package punkt to /root/nltk_data...
[nltk_data]   Package punkt is already up-to-date!
[nltk_data] Downloading package punkt_tab to /root/nltk_data...
[nltk_data]   Package punkt_tab is already up-to-date!


True

In [69]:
from nltk.tokenize import sent_tokenize, word_tokenize


In [70]:
#Sentence Tokenisation
sample_text = analysis_df.loc[0, "sentence"]
sentences = sent_tokenize(sample_text)
sentences


['For example we can see on the discovery channel in wild life many people are hobbies to learn from the animals .']

In [71]:
#Word Tokenization
tokens = word_tokenize(sample_text)
tokens


['For',
 'example',
 'we',
 'can',
 'see',
 'on',
 'the',
 'discovery',
 'channel',
 'in',
 'wild',
 'life',
 'many',
 'people',
 'are',
 'hobbies',
 'to',
 'learn',
 'from',
 'the',
 'animals',
 '.']

In [72]:
for i in range(5):
    sentence = analysis_df.loc[i, "sentence"]
    tokens = word_tokenize(sentence)
    print(f"Sentence {i+1}: {sentence}")
    print("Tokens:", tokens)
    print("-" * 50)


Sentence 1: For example we can see on the discovery channel in wild life many people are hobbies to learn from the animals . 
Tokens: ['For', 'example', 'we', 'can', 'see', 'on', 'the', 'discovery', 'channel', 'in', 'wild', 'life', 'many', 'people', 'are', 'hobbies', 'to', 'learn', 'from', 'the', 'animals', '.']
--------------------------------------------------
Sentence 2: People use the public water to drink water . 
Tokens: ['People', 'use', 'the', 'public', 'water', 'to', 'drink', 'water', '.']
--------------------------------------------------
Sentence 3: Very soon they will run out at the current rate of utilisation . 
Tokens: ['Very', 'soon', 'they', 'will', 'run', 'out', 'at', 'the', 'current', 'rate', 'of', 'utilisation', '.']
--------------------------------------------------
Sentence 4: It is obvious that after returning i was tired and the night is meant to sleep ! . 
Tokens: ['It', 'is', 'obvious', 'that', 'after', 'returning', 'i', 'was', 'tired', 'and', 'the', 'night', '

In [73]:
for i in range(5):
    incorrect = analysis_df.loc[i, "sentence"]
    corrected = analysis_df.loc[i, "corrections"][0]

    print(f"❌ Incorrect tokens: {word_tokenize(incorrect)}")
    print(f"✔ Corrected tokens: {word_tokenize(corrected)}")
    print("-" * 60)


❌ Incorrect tokens: ['For', 'example', 'we', 'can', 'see', 'on', 'the', 'discovery', 'channel', 'in', 'wild', 'life', 'many', 'people', 'are', 'hobbies', 'to', 'learn', 'from', 'the', 'animals', '.']
✔ Corrected tokens: ['For', 'example', 'we', 'can', 'see', 'on', 'the', 'Discovery', 'Channel', 'in', 'the', 'wild', 'many', 'people', 'enjoy', 'learning', 'about', 'the', 'animals', '.']
------------------------------------------------------------
❌ Incorrect tokens: ['People', 'use', 'the', 'public', 'water', 'to', 'drink', 'water', '.']
✔ Corrected tokens: ['People', 'use', 'public', 'water', 'to', 'drink', '.']
------------------------------------------------------------
❌ Incorrect tokens: ['Very', 'soon', 'they', 'will', 'run', 'out', 'at', 'the', 'current', 'rate', 'of', 'utilisation', '.']
✔ Corrected tokens: ['They', 'will', 'run', 'out', 'very', 'soon', 'at', 'the', 'current', 'rate', 'of', 'utilization', '.']
------------------------------------------------------------
❌ Incorre

## Tokenization Insights

- Tokenization converts text into atomic units (words and punctuation)
- Many grammar errors are not detectable at token level
- Incorrect and correct sentences often have identical token counts
- Tokenization is necessary but insufficient for grammar correction


**What tokenization CAN reveal**

Missing words (e.g., missing article)

Extra words

Spelling errors (sometimes)

**What tokenization CANNOT reveal**

Tense correctness

Subject–verb agreement

Contextual misuse

In [74]:
analysis_df["tokens"] = analysis_df["sentence"].apply(word_tokenize)
analysis_df[["sentence", "tokens"]].head()


Unnamed: 0,sentence,tokens
0,For example we can see on the discovery channe...,"[For, example, we, can, see, on, the, discover..."
1,People use the public water to drink water .,"[People, use, the, public, water, to, drink, w..."
2,Very soon they will run out at the current rat...,"[Very, soon, they, will, run, out, at, the, cu..."
3,It is obvious that after returning i was tired...,"[It, is, obvious, that, after, returning, i, w..."
4,And there is a lot of critics concerned that t...,"[And, there, is, a, lot, of, critics, concerne..."


Tokenization answers: WHAT words exist?

POS answers: WHAT role do they play?

Dependencies answer: HOW are they related?


POS (Part of Speech Tagging)

Tokenization tells you what words exist.

POS tagging tells you what grammatical role each word plays.

In [75]:
import nltk
nltk.download('averaged_perceptron_tagger_eng')


[nltk_data] Downloading package averaged_perceptron_tagger_eng to
[nltk_data]     /root/nltk_data...
[nltk_data]   Package averaged_perceptron_tagger_eng is already up-to-
[nltk_data]       date!


True

In [76]:
from nltk import pos_tag


In [77]:
sentence = "She go to school yesterday"
tokens = nltk.word_tokenize(sentence)
pos_tags = pos_tag(tokens)

pos_tags


[('She', 'PRP'),
 ('go', 'VBP'),
 ('to', 'TO'),
 ('school', 'NN'),
 ('yesterday', 'NN')]

In [78]:
corrected = "She went to school yesterday"
pos_tag(nltk.word_tokenize(corrected))


[('She', 'PRP'),
 ('went', 'VBD'),
 ('to', 'TO'),
 ('school', 'NN'),
 ('yesterday', 'NN')]

In [79]:
for i in range(5):
    sentence = analysis_df.loc[i, "sentence"]
    tokens = nltk.word_tokenize(sentence)
    pos_tags = pos_tag(tokens)

    print(f"Sentence {i+1}: {sentence}")
    print("POS tags:", pos_tags)
    print("-" * 50)


Sentence 1: For example we can see on the discovery channel in wild life many people are hobbies to learn from the animals . 
POS tags: [('For', 'IN'), ('example', 'NN'), ('we', 'PRP'), ('can', 'MD'), ('see', 'VB'), ('on', 'IN'), ('the', 'DT'), ('discovery', 'NN'), ('channel', 'NN'), ('in', 'IN'), ('wild', 'JJ'), ('life', 'NN'), ('many', 'JJ'), ('people', 'NNS'), ('are', 'VBP'), ('hobbies', 'NNS'), ('to', 'TO'), ('learn', 'VB'), ('from', 'IN'), ('the', 'DT'), ('animals', 'NNS'), ('.', '.')]
--------------------------------------------------
Sentence 2: People use the public water to drink water . 
POS tags: [('People', 'NNS'), ('use', 'VBP'), ('the', 'DT'), ('public', 'JJ'), ('water', 'NN'), ('to', 'TO'), ('drink', 'VB'), ('water', 'NN'), ('.', '.')]
--------------------------------------------------
Sentence 3: Very soon they will run out at the current rate of utilisation . 
POS tags: [('Very', 'RB'), ('soon', 'RB'), ('they', 'PRP'), ('will', 'MD'), ('run', 'VB'), ('out', 'RP'), ('at

In [80]:
for i in range(5):
    incorrect = analysis_df.loc[i, "sentence"]
    corrected = analysis_df.loc[i, "corrections"][0]

    incorrect_pos = pos_tag(nltk.word_tokenize(incorrect))
    corrected_pos = pos_tag(nltk.word_tokenize(corrected))

    print(f"❌ Incorrect POS: {[tag for _, tag in incorrect_pos]}")
    print(f"✔ Correct POS:   {[tag for _, tag in corrected_pos]}")
    print("-" * 60)


❌ Incorrect POS: ['IN', 'NN', 'PRP', 'MD', 'VB', 'IN', 'DT', 'NN', 'NN', 'IN', 'JJ', 'NN', 'JJ', 'NNS', 'VBP', 'NNS', 'TO', 'VB', 'IN', 'DT', 'NNS', '.']
✔ Correct POS:   ['IN', 'NN', 'PRP', 'MD', 'VB', 'IN', 'DT', 'NNP', 'NNP', 'IN', 'DT', 'JJ', 'JJ', 'NNS', 'VBP', 'VBG', 'IN', 'DT', 'NNS', '.']
------------------------------------------------------------
❌ Incorrect POS: ['NNS', 'VBP', 'DT', 'JJ', 'NN', 'TO', 'VB', 'NN', '.']
✔ Correct POS:   ['NNS', 'VBP', 'JJ', 'NN', 'TO', 'VB', '.']
------------------------------------------------------------
❌ Incorrect POS: ['RB', 'RB', 'PRP', 'MD', 'VB', 'RP', 'IN', 'DT', 'JJ', 'NN', 'IN', 'NN', '.']
✔ Correct POS:   ['PRP', 'MD', 'VB', 'RP', 'RB', 'RB', 'IN', 'DT', 'JJ', 'NN', 'IN', 'NN', '.']
------------------------------------------------------------
❌ Incorrect POS: ['PRP', 'VBZ', 'JJ', 'IN', 'IN', 'VBG', 'NN', 'VBD', 'VBN', 'CC', 'DT', 'NN', 'VBZ', 'VBN', 'TO', 'VB', '.', '.']
✔ Correct POS:   ['PRP', 'VBD', 'JJ', 'IN', 'VBG', 'IN', 'PRP'

## POS-Level Grammar Patterns Observed

- PRP + VB + time word → possible tense error
- Singular noun + VB → subject–verb agreement error
- NN without DT → missing article
- Incorrect verb form appears as wrong POS tag (VB vs VBD/VBG)


In [81]:
def get_pos_tags(sentence):
    """
    Takes a sentence as input and returns
    a list of (word, POS_tag) pairs.
    """

    # Break the sentence into individual words
    tokens = word_tokenize(sentence)

    # Assign a Part-Of-Speech tag to each word
    tagged_tokens = pos_tag(tokens)

    return tagged_tokens


In [82]:
analysis_df["pos_tags"] = analysis_df["sentence"].apply(get_pos_tags)


In [83]:
analysis_df[["sentence", "pos_tags"]].head()


Unnamed: 0,sentence,pos_tags
0,For example we can see on the discovery channe...,"[(For, IN), (example, NN), (we, PRP), (can, MD..."
1,People use the public water to drink water .,"[(People, NNS), (use, VBP), (the, DT), (public..."
2,Very soon they will run out at the current rat...,"[(Very, RB), (soon, RB), (they, PRP), (will, M..."
3,It is obvious that after returning i was tired...,"[(It, PRP), (is, VBZ), (obvious, JJ), (that, I..."
4,And there is a lot of critics concerned that t...,"[(And, CC), (there, EX), (is, VBZ), (a, DT), (..."


Dependency Parsing (spaCy)

Grammar errors are relationship errors, not token errors

In [84]:
!pip install spacy
!python -m spacy download en_core_web_sm


Collecting en-core-web-sm==3.8.0
  Using cached https://github.com/explosion/spacy-models/releases/download/en_core_web_sm-3.8.0/en_core_web_sm-3.8.0-py3-none-any.whl (12.8 MB)
[38;5;2m✔ Download and installation successful[0m
You can now load the package via spacy.load('en_core_web_sm')
[38;5;3m⚠ Restart to reload dependencies[0m
If you are in a Jupyter or Colab notebook, you may need to restart Python in
order to load all the package's dependencies. You can do this by selecting the
'Restart kernel' or 'Restart runtime' option.


In [85]:
import spacy
nlp = spacy.load("en_core_web_sm")


In [86]:
sentence = "She go to school yesterday"
doc = nlp(sentence)

for token in doc:
    print(token.text, token.dep_, token.head.text, token.pos_)


She nsubj go PRON
go ROOT go VERB
to prep go ADP
school pobj to NOUN
yesterday npadvmod go NOUN


In [87]:
for token in doc:
    if token.dep_ == "nsubj":
        subject = token
    if token.dep_ == "ROOT":
        verb = token

print("Subject:", subject.text)
print("Verb:", verb.text)
print("Verb tense:", verb.morph)


Subject: She
Verb: go
Verb tense: Tense=Pres|VerbForm=Fin


In [88]:
incorrect = "She go to school yesterday"
corrected = "She went to school yesterday"

doc_bad = nlp(incorrect)
doc_good = nlp(corrected)

def extract_core(doc):
    subj, verb = None, None
    for token in doc:
        if token.dep_ == "nsubj":
            subj = token
        if token.dep_ == "ROOT":
            verb = token
    return subj, verb

bad_subj, bad_verb = extract_core(doc_bad)
good_subj, good_verb = extract_core(doc_good)

print("❌ Incorrect verb:", bad_verb.text, bad_verb.morph)
print("✔ Correct verb:", good_verb.text, good_verb.morph)


❌ Incorrect verb: go Tense=Pres|VerbForm=Fin
✔ Correct verb: went Tense=Past|VerbForm=Fin


In [89]:
for i in range(5):
    sentence = analysis_df.loc[i, "sentence"]
    doc = nlp(sentence)

    print(f"Sentence: {sentence}")
    for token in doc:
        print(f"{token.text:10} {token.dep_:10} → {token.head.text}")
    print("-" * 60)


Sentence: For example we can see on the discovery channel in wild life many people are hobbies to learn from the animals . 
For        prep       → see
example    pobj       → For
we         nsubj      → see
can        aux        → see
see        ROOT       → see
on         prep       → see
the        det        → channel
discovery  compound   → channel
channel    pobj       → on
in         prep       → channel
wild       amod       → life
life       pobj       → in
many       amod       → people
people     nsubj      → are
are        ccomp      → see
hobbies    attr       → are
to         aux        → learn
learn      xcomp      → hobbies
from       prep       → learn
the        det        → animals
animals    pobj       → from
.          punct      → see
------------------------------------------------------------
Sentence: People use the public water to drink water . 
People     nsubj      → use
use        ROOT       → use
the        det        → water
public     amod       → water


In [90]:
#Subject–Verb Agreement (SVA)
def detect_sva(doc):
    for token in doc:
        if token.dep_ == "nsubj" and token.head.pos_ == "VERB":
            subj = token
            verb = token.head
            return subj, verb

doc = nlp("He eat apple")
subj, verb = detect_sva(doc)

print("Subject:", subj.text, subj.morph)
print("Verb:", verb.text, verb.morph)


Subject: He Case=Nom|Gender=Masc|Number=Sing|Person=3|PronType=Prs
Verb: eat Tense=Pres|VerbForm=Fin


In [92]:
#Detect missing articles
doc = nlp("He eat apple")

for token in doc:
    if token.pos_ == "NOUN":
        has_det = any(child.dep_ == "det" for child in token.children)
        print(token.text, "has determiner:", has_det)


apple has determiner: False


## Dependency-Level Grammar Patterns

- nsubj → ROOT verb relationship defines agreement
- Singular subject + base verb → SVA error
- Time modifier (e.g., yesterday) + non-past verb → tense error
- Noun without determiner child → article error
- Dependency parsing enables localized corrections


In [93]:
def get_dependency_relations(sentence):
    """
    Takes a sentence as input and returns a list of
    (word, dependency_label, head_word) tuples.
    """

    doc = nlp(sentence)
    dependencies = []

    for token in doc:
        word = token.text
        dependency = token.dep_
        head_word = token.head.text

        dependencies.append((word, dependency, head_word))

    return dependencies


In [94]:
analysis_df["dependencies"] = analysis_df["sentence"].apply(get_dependency_relations)


In [95]:
analysis_df[["sentence", "dependencies"]].head()


Unnamed: 0,sentence,dependencies
0,For example we can see on the discovery channe...,"[(For, prep, see), (example, pobj, For), (we, ..."
1,People use the public water to drink water .,"[(People, nsubj, use), (use, ROOT, use), (the,..."
2,Very soon they will run out at the current rat...,"[(Very, advmod, soon), (soon, advmod, run), (t..."
3,It is obvious that after returning i was tired...,"[(It, nsubj, is), (is, ROOT, is), (obvious, ac..."
4,And there is a lot of critics concerned that t...,"[(And, cc, is), (there, expl, is), (is, ROOT, ..."


## Consolidated Error Patterns

| Error Type | Linguistic Signal | Detection Level |
|-----------|------------------|----------------|
| SPELL | Non-dictionary token | Token |
| SVA | Singular subject + base verb | Dependency |
| ARTICLE | Noun without determiner | Dependency |
| TENSE | Past time word + non-past verb | POS + Dependency |
| VERB FORM | Auxiliary misuse / wrong verb POS | POS |


IF token not found in dictionary
THEN candidate spelling error

IF subject (nsubj) is singular
AND verb is base form (VB)
THEN subject–verb agreement error

IF noun (NN) has no determiner child (det)
AND noun is countable
THEN missing article error

IF sentence contains past-time modifier (yesterday, last, ago)
AND main verb is not past tense (VBD)
THEN tense error

IF auxiliary verb present
AND main verb POS does not match auxiliary
THEN verb form error


## Rule-based vs ML-based Decisions

| Error Type | Approach | Reason |
|-----------|--------|--------|
| SPELL | Rule-based | Dictionary & edit distance works well |
| SVA | Rule-based | Clear dependency patterns |
| ARTICLE | Hybrid | Simple cases rule-based, others contextual |
| TENSE | Rule-based (initial) | Time-word patterns detectable |
| VERB FORM | ML-based (later) | Context-sensitive, ambiguous |


## Phase 2 Strategy (Locked)

Phase 2 will implement:
- Rule-based SPELL correction
- Rule-based SVA correction
- Rule-based ARTICLE correction (simple cases only)
- Rule-based TENSE correction using time expressions

ML-based correction is deferred to Phase 4.


## Design Principle: Precision over Recall

Incorrect corrections degrade user trust more than missed corrections.
Therefore, rules will only fire when confidence is high.
