# Text Preprocessing and N-Gram Language Model

## Tasks:
1. **Data Selection and Preprocessing**: Apply text normalization (tokenization, remove stop words, remove punctuation/numbers, lemmatization)
2. **N-Gram Analysis**: Calculate probability of sentences using Markov assumption

## 1. Import Libraries

In [1]:
import pandas as pd
import numpy as np
import re
import string
from collections import defaultdict
import nltk
from nltk.tokenize import word_tokenize
from nltk.corpus import stopwords
from nltk.stem import WordNetLemmatizer

nltk.download('punkt')
nltk.download('stopwords')
nltk.download('wordnet')
nltk.download('averaged_perceptron_tagger')
nltk.download('punkt_tab')

[nltk_data] Downloading package punkt to
[nltk_data]     C:\Users\CRIZMA\AppData\Roaming\nltk_data...
[nltk_data]   Package punkt is already up-to-date!
[nltk_data] Downloading package stopwords to
[nltk_data]     C:\Users\CRIZMA\AppData\Roaming\nltk_data...
[nltk_data]   Package stopwords is already up-to-date!
[nltk_data] Downloading package wordnet to
[nltk_data]     C:\Users\CRIZMA\AppData\Roaming\nltk_data...
[nltk_data]   Package wordnet is already up-to-date!
[nltk_data] Downloading package averaged_perceptron_tagger to
[nltk_data]     C:\Users\CRIZMA\AppData\Roaming\nltk_data...
[nltk_data]   Package averaged_perceptron_tagger is already up-to-
[nltk_data]       date!
[nltk_data] Downloading package punkt_tab to
[nltk_data]     C:\Users\CRIZMA\AppData\Roaming\nltk_data...
[nltk_data]   Package punkt_tab is already up-to-date!


True

## 2. Load Dataset

In [2]:
df = pd.read_csv('../data/train.csv')

df.head()

Unnamed: 0,id,text,author
0,id26305,"This process, however, afforded me no means of...",EAP
1,id17569,It never once occurred to me that the fumbling...,HPL
2,id11008,"In his left hand was a gold snuff box, from wh...",EAP
3,id27763,How lovely is spring As we looked from Windsor...,MWS
4,id12958,"Finding nothing else, not even gold, the Super...",HPL


In [3]:
df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 19579 entries, 0 to 19578
Data columns (total 3 columns):
 #   Column  Non-Null Count  Dtype 
---  ------  --------------  ----- 
 0   id      19579 non-null  object
 1   text    19579 non-null  object
 2   author  19579 non-null  object
dtypes: object(3)
memory usage: 459.0+ KB


In [4]:
print(df['author'].value_counts())

author
EAP    7900
MWS    6044
HPL    5635
Name: count, dtype: int64


In [5]:
df.shape

(19579, 3)

## 3. Text Preprocessing Pipeline

### 3.1 Define Preprocessing Functions

In [6]:
class TextPreprocessor:
    """
    A comprehensive text preprocessing class that performs:
    - Tokenization
    - Lowercasing
    - Removing punctuation and numbers
    - Removing stop words
    - Lemmatization
    """
    
    def __init__(self):
        self.lemmatizer = WordNetLemmatizer()
        self.stop_words = set(stopwords.words('english'))
        
    def tokenize(self, text):
        return word_tokenize(text)
    
    def remove_punctuation_and_numbers(self, tokens):

        # Remove punctuation
        tokens = [token for token in tokens if token not in string.punctuation]

        # Remove tokens that are purely numeric or contain numbers
        tokens = [token for token in tokens if not any(char.isdigit() for char in token)]

        return tokens
    
    def remove_stopwords(self, tokens):

        return [token for token in tokens if token.lower() not in self.stop_words]
    
    def lemmatize(self, tokens):
        return [self.lemmatizer.lemmatize(token.lower()) for token in tokens]
    
    def preprocess(self, text):
        tokens = self.tokenize(text)
        tokens = self.remove_punctuation_and_numbers(tokens)
        tokens = self.remove_stopwords(tokens)
        tokens = self.lemmatize(tokens)
        
        return tokens
    
    def preprocess_to_text(self, text):
        tokens = self.preprocess(text)
        return ' '.join(tokens)

preprocessor = TextPreprocessor()

### 3.2 Demonstrate Preprocessing on Sample Texts

In [8]:
sample_texts = df['text'].head(3).tolist()

for i, text in enumerate(sample_texts, 1):
    print(f"-----\nExample {i}")
    print(f"Original Text:\n{text}\n")
    
    # Show step-by-step preprocessing
    tokens = preprocessor.tokenize(text)
    print(f"After Tokenization ({len(tokens)} tokens):\n{tokens[:20]}...\n")
    
    tokens_no_punct = preprocessor.remove_punctuation_and_numbers(tokens)
    print(f"After Removing Punctuation & Numbers ({len(tokens_no_punct)} tokens):\n{tokens_no_punct[:20]}...\n")
    
    tokens_no_stop = preprocessor.remove_stopwords(tokens_no_punct)
    print(f"After Removing Stop Words ({len(tokens_no_stop)} tokens):\n{tokens_no_stop[:20]}...\n")
    
    tokens_lemmatized = preprocessor.lemmatize(tokens_no_stop)
    print(f"After Lemmatization ({len(tokens_lemmatized)} tokens):\n{tokens_lemmatized[:20]}...\n")
    
    preprocessed_text = ' '.join(tokens_lemmatized)
    print(f"Final Preprocessed Text:\n{preprocessed_text}\n")


-----
Example 1
Original Text:
This process, however, afforded me no means of ascertaining the dimensions of my dungeon; as I might make its circuit, and return to the point whence I set out, without being aware of the fact; so perfectly uniform seemed the wall.

After Tokenization (48 tokens):
['This', 'process', ',', 'however', ',', 'afforded', 'me', 'no', 'means', 'of', 'ascertaining', 'the', 'dimensions', 'of', 'my', 'dungeon', ';', 'as', 'I', 'might']...

After Removing Punctuation & Numbers (41 tokens):
['This', 'process', 'however', 'afforded', 'me', 'no', 'means', 'of', 'ascertaining', 'the', 'dimensions', 'of', 'my', 'dungeon', 'as', 'I', 'might', 'make', 'its', 'circuit']...

After Removing Stop Words (21 tokens):
['process', 'however', 'afforded', 'means', 'ascertaining', 'dimensions', 'dungeon', 'might', 'make', 'circuit', 'return', 'point', 'whence', 'set', 'without', 'aware', 'fact', 'perfectly', 'uniform', 'seemed']...

After Lemmatization (21 tokens):
['process', 'howev

### 3.3 Apply Preprocessing to Entire Dataset

In [9]:
df['preprocessed_text'] = df['text'].apply(preprocessor.preprocess_to_text)
df['preprocessed_tokens'] = df['text'].apply(preprocessor.preprocess)

df[['text', 'preprocessed_text']].head()

Unnamed: 0,text,preprocessed_text
0,"This process, however, afforded me no means of...",process however afforded mean ascertaining dim...
1,It never once occurred to me that the fumbling...,never occurred fumbling might mere mistake
2,"In his left hand was a gold snuff box, from wh...",left hand gold snuff box capered hill cutting ...
3,How lovely is spring As we looked from Windsor...,lovely spring looked windsor terrace sixteen f...
4,"Finding nothing else, not even gold, the Super...",finding nothing else even gold superintendent ...


## 4. N-Gram Language Model

### 4.1 Build N-Gram Model Class

In [10]:
class NGramLanguageModel:
    """
    N-Gram Language Model with Markov Assumption
    Supports unigram, bigram, and trigram models with Laplace smoothing
    """
    
    def __init__(self, n=2, smoothing=True):
        """
        Initialize N-gram model
        n: order of n-gram (1=unigram, 2=bigram, 3=trigram)
        smoothing: whether to apply Laplace (add-1) smoothing
        """
        self.n = n
        self.smoothing = smoothing
        self.ngram_counts = defaultdict(int)
        self.context_counts = defaultdict(int)
        self.vocabulary = set()
        self.vocab_size = 0
        
    def train(self, tokens_list):
        """
        Train the model on a list of tokenized sentences
        tokens_list: list of lists, where each inner list is a tokenized sentence
        """
        print(f"Training {self.n}-gram model...")
        
        for tokens in tokens_list:
            # Add start and end tokens
            padded_tokens = ['<START>'] * (self.n - 1) + tokens + ['<END>']
            
            # Update vocabulary
            self.vocabulary.update(tokens)
            
            # Count n-grams and contexts
            for i in range(len(padded_tokens) - self.n + 1):
                ngram = tuple(padded_tokens[i:i + self.n])
                context = tuple(padded_tokens[i:i + self.n - 1])
                
                self.ngram_counts[ngram] += 1
                self.context_counts[context] += 1
        
        self.vocab_size = len(self.vocabulary)
        print(f"Training complete!")
        print(f"Vocabulary size: {self.vocab_size}")
        print(f"Number of unique {self.n}-grams: {len(self.ngram_counts)}")
        print(f"Number of unique contexts: {len(self.context_counts)}")
        
    def get_probability(self, ngram):
        """
        Calculate probability of an n-gram using Markov assumption
        P(w_n | w_1, ..., w_{n-1}) = Count(w_1, ..., w_n) / Count(w_1, ..., w_{n-1})
        """
        if isinstance(ngram, str):
            ngram = tuple(ngram.split())
            
        context = ngram[:-1]
        
        if self.smoothing:
            # Laplace smoothing: add 1 to numerator and vocab_size to denominator
            numerator = self.ngram_counts[ngram] + 1
            denominator = self.context_counts[context] + self.vocab_size
        else:
            numerator = self.ngram_counts[ngram]
            denominator = self.context_counts[context]
            
        if denominator == 0:
            return 0.0
            
        return numerator / denominator
    
    def sentence_probability(self, tokens):
        """
        Calculate the probability of an entire sentence
        Using chain rule: P(sentence) = P(w1) * P(w2|w1) * P(w3|w1,w2) * ...
        Returns both probability and log probability (to avoid underflow)
        """
        # Add padding for start tokens
        padded_tokens = ['<START>'] * (self.n - 1) + tokens + ['<END>']
        
        log_prob = 0.0
        prob = 1.0
        ngrams_used = []
        
        for i in range(len(padded_tokens) - self.n + 1):
            ngram = tuple(padded_tokens[i:i + self.n])
            ngram_prob = self.get_probability(ngram)
            
            if ngram_prob > 0:
                log_prob += np.log(ngram_prob)
                prob *= ngram_prob
                ngrams_used.append((ngram, ngram_prob))
            else:
                # Handle zero probability
                log_prob = float('-inf')
                prob = 0.0
                break
                
        return {
            'probability': prob,
            'log_probability': log_prob,
            'ngrams': ngrams_used,
            'perplexity': np.exp(-log_prob / len(ngrams_used)) if ngrams_used else float('inf')
        }

### 4.2 Train Bigram Model

In [11]:
bigram_model = NGramLanguageModel(n=2, smoothing=True)
bigram_model.train(df['preprocessed_tokens'].tolist())

Training 2-gram model...
Training complete!
Vocabulary size: 22338
Number of unique 2-grams: 219727
Number of unique contexts: 22339


### 4.3 Calculate Probabilities for 10 Sentences

We'll select 10 diverse sentences from the dataset and calculate their probabilities using the Markov assumption.

In [None]:
np.random.seed(42)
sample_indices = np.random.choice(len(df), size=10, replace=False)
sample_sentences = df.iloc[sample_indices]

print("Formula: P(sentence) = P(w1|START) * P(w2|w1) * P(w3|w2) * ... * P(END|wn)")
print("Using Laplace (Add-1) Smoothing\n")

results = []

for idx, (i, row) in enumerate(sample_sentences.iterrows(), 1):
    original_text = row['text']
    tokens = row['preprocessed_tokens']
    author = row['author']
    
    # Calculate probability
    result = bigram_model.sentence_probability(tokens)
    
    print(f"\nSENTENCE {idx}")
    print(f"Author: {author}")
    print(f"\nOriginal Text:")
    print(f"{original_text[:200]}{'...' if len(original_text) > 200 else ''}")
    print(f"\nPreprocessed Tokens ({len(tokens)} tokens):")
    print(f"{tokens}")
    print(f"\n--- PROBABILITY CALCULATION ---")
    print(f"Sentence Probability: {result['probability']:.2e}")
    print(f"Log Probability: {result['log_probability']:.4f}")
    print(f"Perplexity: {result['perplexity']:.4f}")
    
    print(f"\n--- BIGRAM BREAKDOWN (First 10 bigrams) ---")
    for j, (ngram, prob) in enumerate(result['ngrams'][:10], 1):
        context = ngram[0]
        word = ngram[1]
        print(f"{j}. P({word} | {context}) = {prob:.6f}")
    
    if len(result['ngrams']) > 10:
        print(f"... and {len(result['ngrams']) - 10} more bigrams")
    
    results.append({
        'sentence_num': idx,
        'author': author,
        'original_length': len(original_text),
        'token_count': len(tokens),
        'probability': result['probability'],
        'log_probability': result['log_probability'],
        'perplexity': result['perplexity']
    })


print("\n")

summary_df = pd.DataFrame(results)
pd.set_option('display.max_columns', None)
pd.set_option('display.width', None)
pd.set_option('display.float_format', '{:.2e}'.format)
print(summary_df.to_string(index=False))

Formula: P(sentence) = P(w1|START) * P(w2|w1) * P(w3|w2) * ... * P(END|wn)
Using Laplace (Add-1) Smoothing


SENTENCE 1
Author: EAP

Original Text:
The gigantic magnitude and the immediately available nature of the sum, dazzled and bewildered all who thought upon the topic.

Preprocessed Tokens (11 tokens):
['gigantic', 'magnitude', 'immediately', 'available', 'nature', 'sum', 'dazzled', 'bewildered', 'thought', 'upon', 'topic']

--- PROBABILITY CALCULATION ---
Sentence Probability: 7.58e-48
Log Probability: -108.4982
Perplexity: 8446.5581

--- BIGRAM BREAKDOWN (First 10 bigrams) ---
1. P(gigantic | <START>) = 0.000095
2. P(magnitude | gigantic) = 0.000089
3. P(immediately | magnitude) = 0.000089
4. P(available | immediately) = 0.000089
5. P(nature | available) = 0.000090
6. P(sum | nature) = 0.000088
7. P(dazzled | sum) = 0.000089
8. P(bewildered | dazzled) = 0.000090
9. P(thought | bewildered) = 0.000089
10. P(upon | thought) = 0.000175
... and 2 more bigrams

SENTENCE 2
Author: MWS


### 4.4 Train and Compare Trigram Model

In [None]:
trigram_model = NGramLanguageModel(n=3, smoothing=True)
trigram_model.train(df['preprocessed_tokens'].tolist())

print("\n\n")

trigram_results = []

for idx, (i, row) in enumerate(sample_sentences.iterrows(), 1):
    tokens = row['preprocessed_tokens']
    result = trigram_model.sentence_probability(tokens)
    
    trigram_results.append({
        'sentence_num': idx,
        'probability': result['probability'],
        'log_probability': result['log_probability'],
        'perplexity': result['perplexity']
    })

# Compare bigram vs trigram
comparison_df = pd.DataFrame({
    'Sentence': [r['sentence_num'] for r in results],
    'Bigram_LogProb': [r['log_probability'] for r in results],
    'Trigram_LogProb': [r['log_probability'] for r in trigram_results],
    'Bigram_Perplexity': [r['perplexity'] for r in results],
    'Trigram_Perplexity': [r['perplexity'] for r in trigram_results]
})

print(comparison_df.to_string(index=False))

Training 3-gram model...
Training complete!
Vocabulary size: 22338
Number of unique 3-grams: 261283
Number of unique contexts: 213534



 Sentence  Bigram_LogProb  Trigram_LogProb  Bigram_Perplexity  Trigram_Perplexity
        1       -1.08e+02        -1.11e+02           8.45e+03            1.07e+04
        2       -5.00e+01        -5.29e+01           4.17e+03            6.77e+03
        3       -3.10e+02        -3.15e+02           9.13e+03            1.06e+04
        4       -9.60e+01        -9.94e+01           6.19e+03            8.37e+03
        5       -1.11e+02        -1.11e+02           1.01e+04            1.04e+04
        6       -1.74e+02        -1.75e+02           9.28e+03            1.02e+04
        7       -1.67e+02        -1.67e+02           1.06e+04            1.08e+04
        8       -3.48e+01        -3.70e+01           6.03e+03            1.04e+04
        9       -1.19e+02        -1.21e+02           9.55e+03            1.08e+04
       10       -2.72e+01        -2.79e+01 