# Assignment 2: N-grams and Language Identification
## CNG463 - Introduction to Natural Language Processing
### METU NCC Computer Engineering | Fall 2025-26

**Student Name:**  
**Student ID:**  
**Due Date:** 16 November 2025 (Sunday) before midnight

---

## Overview

This assignment focuses on:
1. Building 2-gram and 3-gram language models with Laplace smoothing
2. Sentence-based language identification using 10-fold cross-validation
3. Evaluation using accuracy, precision, recall, and F1-score
4. Comparison and analysis

**Grading:**
- Written Questions (7 √ó 4 pts): **28 pts**
- Code Tasks with TODO (11 total): **72 pts** distributed by effort level:
  - Simple tasks: 4 pts each (2 cells)
  - Moderate tasks: 6 pts each (4 cells)
  - Complex tasks: 8 pts each (5 cells)
- **Total: 100 pts**

---

## Pre-Submission Checklist

- [ ] Name and student ID at top
- [ ] No cells are added or removed
- [ ] All TODO sections completed
- [ ] All questions answered
- [ ] Code runs without errors
- [ ] Results tables included
- [ ] Run All before saving

## Setup and Imports

In [None]:
# Standard libraries
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns
from collections import defaultdict, Counter
from typing import List, Tuple, Dict
import re

# Scikit-learn for cross-validation and metrics
from sklearn.model_selection import KFold
from sklearn.metrics import accuracy_score, precision_recall_fscore_support
from scipy.stats import ttest_rel


# Set random seed for reproducibility
np.random.seed(42)

---

# Task 1: Corpus Preparation and Statistics (22 points)

## 1.1: Upload Corpus Files

Prepare your text files in **two different languages** (accepted formats: `.txt`, `.pdf`, or `.docx`). When you run the cell below, you'll be prompted to upload files for each language separately. Make sure your files contain substantial text (reports, essays, or similar content from other courses). Each language requires at least **5000** words in its corpus.

In [10]:
from google.colab import files

print("Upload your ENGLISH corpus file(s):")
english_files = files.upload()

print("\nUpload your SECOND LANGUAGE corpus file(s):")
second_lang_files = files.upload()


Upload your ENGLISH corpus file(s):



Upload your SECOND LANGUAGE corpus file(s):


## 1.2: Load and Preprocess Data (12 points)

Load your uploaded files, extract text, preprocess, split into sentences, and tokenize. You'll need helper functions to handle different file formats.

**Steps:**
1. Read files based on format (`.txt`, `.pdf`, `.docx`) and combine them into single text for each language

*   List item
*   List item


2. Apply preprocessing (e.g., lowercasing, handling punctuation)
3. Split each corpus into individual sentences
4. Tokenize each sentence into a list of words
5. Store the results as two lists of tokenized sentences

In [11]:
import re
from typing import List

def read_txt_file(filename: str) -> str:
    """Read a .txt file and return its content."""
    with open(filename, 'r', encoding='utf-8') as f:
        return f.read()

def read_pdf_file(filename: str) -> str:
    """Read a .pdf file and return its text content."""
    # TODO: Install and use PyPDF2 or pdfplumber
    # Example: pip install PyPDF2
    pass

def read_docx_file(filename: str) -> str:
    """Read a .docx file and return its text content."""
    # TODO: Install and use python-docx
    # Example: pip install python-docx
    pass

def split_into_sentences(text: str) -> List[str]:
    """Split text into sentences."""
    # TODO: Implement sentence splitting
    # You can use simple regex or nltk.sent_tokenize
    pass

def tokenize_sentence(sentence: str) -> List[str]:
    """Tokenize a sentence into words."""
    # TODO: Implement word tokenization
    # You can use str.split() or nltk.word_tokenize
    pass

# TODO:
#
# 1. Read and combine files for each language
#    - Loop through lang1_files and lang2_files
#    - Use appropriate read function based on file extension
#    - Combine all text into lang1_text and lang2_text
#
# 2. Apply preprocessing to both lang1_text and lang2_text
#    (e.g., lowercasing, removing extra whitespace)
#
# 3. Split each corpus into sentences using split_into_sentences()
#
# 4. Tokenize each sentence using tokenize_sentence()
#
# 5. Store results as:
#    - lang1_sentences: List[List[str]] (list of tokenized sentences)
#    - lang2_sentences: List[List[str]] (list of tokenized sentences)

lang1_text = ""
lang2_text = ""


# At the end, you should have:
# lang1_sentences = [[word1, word2, ...], [word1, word2, ...], ...]
# lang2_sentences = [[word1, word2, ...], [word1, word2, ...], ...]

# [8 pts]

**Question 1.1:** What preprocessing choices did you make and why? (3-5 sentences)

**[YOUR ANSWER HERE]**

## 1.3: Basic Statistics (10 points)

Calculate and display key statistics for both language corpora to understand their characteristics.

In [12]:
# TODO: Calculate statistics for BOTH languages
#
# For each language (lang1_sentences and lang2_sentences):
# - Total chacter count
# - Special character/punctuation count
# - Total word count
# - Vocabulary size (unique words)
# - Sentence count
# - Average sentence length
#
# Example structure:
# lang1_total_words = sum(len(sentence) for sentence in lang1_sentences)
# lang1_vocabulary = len(set(word for sentence in lang1_sentences for word in sentence))
# ...
#
# Create a comparison table or print statistics side by side for the two corpora

# Your code here

# [6 pts]

**Question 1.2:** What are the key differences between your two corpora? (2-3 sentences)

**[YOUR ANSWER HERE]**

---

# Task 2: N-gram Language Identification (58 points)

**Baseline (25 pts):** Implement 2-gram and 3-gram models, run 10-fold CV, report accuracy.  
**Creativity Bonus (10 pts):** Out-of-vocabulary analysis.

## 2.1: Implement N-gram Models (12 points)

Implement the `NgramLanguageModel` class with Laplace smoothing using NLTK's n-gram utilities. The model should count n-grams during training and calculate sentence probabilities with smoothing.

In [13]:
import nltk
from nltk.util import ngrams, pad_sequence
from nltk.lm import Laplace
from nltk.lm.preprocessing import padded_everygram_pipeline
from typing import List

# Download required NLTK data
nltk.download('punkt', quiet=True)

class NgramLanguageModel:
    """
    N-gram language model with Laplace (add-1) smoothing using NLTK.
    """

    def __init__(self, n: int = 2):
        """
        Initialize the n-gram model.

        Args:
            n: Order of n-gram (2 for bigram, 3 for trigram)
        """
        self.n = n
        self.model = Laplace(n)

    def train(self, sentences: List[List[str]]):
        """
        Train the model on a list of tokenized sentences.

        Args:
            sentences: List of tokenized sentences (each sentence is a list of words)
        """
        # TODO: Your code here
        # Use padded_everygram_pipeline to prepare training data with padding
        # Then fit the model using self.model.fit()
        pass

    def get_probability(self, sentence: List[str]) -> float:
        """
        Calculate the probability of a sentence.

        Args:
            sentence: Tokenized sentence

        Returns:
            Probability of the sentence
        """
        # TODO: Your code here
        # Pad the sentence and generate n-grams
        # For each n-gram, get probability using self.model.score()
        # Multiply probabilities together (or sum log probabilities to avoid underflow)
        pass

# [8 pts]

### Spot Check: Inspect Your N-gram Models

After implementing the model, train sample models on both languages and inspect what they learned.

In [14]:
# TODO: Train sample models and inspect them
#
# 1. Create 2-gram and 3-gram models for both languages
# 2. Train them on your full datasets (lang1_sentences and lang2_sentences)
# 3. Inspect the models to see what n-grams they learned
#
# Example:
# model_2gram_lang1 = NgramLanguageModel(n=2)
# model_2gram_lang1.train(lang1_sentences)
# model_3gram_lang1 = NgramLanguageModel(n=3)
# model_3gram_lang1.train(lang1_sentences)
#
# model_2gram_lang2 = NgramLanguageModel(n=2)
# model_2gram_lang2.train(lang2_sentences)
# model_3gram_lang2 = NgramLanguageModel(n=3)
# model_3gram_lang2.train(lang3_sentences)
#
# Display sample n-grams and their counts from each model
# Check vocabulary size: len(model.model.vocab)
# Show most common n-grams or test probabilities on sample sentences

# Your code here

# [4 pts]

## 2.2: Implement Language Identification (8 points)

Create a function that compares sentence probabilities from two language models and returns the predicted label.

In [15]:
def identify_language(sentence: List[str],
                     model_lang1: NgramLanguageModel,
                     model_lang2: NgramLanguageModel) -> int:
    """
    Identify the language of a sentence using two language models.

    Args:
        sentence: Tokenized sentence
        model_lang1: Language model for language 1 (label 0)
        model_lang2: Language model for language 2 (label 1)

    Returns:
        Predicted label (0 or 1)
    """
    # TODO: Your code here
    #
    # Steps:
    # 1. Calculate probability of sentence using model_lang1
    #    prob1 = model_lang1.get_probability(sentence)
    #
    # 2. Calculate probability of sentence using model_lang2
    #    prob2 = model_lang2.get_probability(sentence)
    #
    # 3. Compare probabilities and return the label of the model with higher probability
    #    if prob1 > prob2: return 0
    #    else: return 1

    pass

# [8 pts]


## 2.3: Implement Evaluation Function (6 points)

Create a function that calculates accuracy, precision, recall, and F1-score given predicted and true labels.

In [16]:
def calculate_metrics(y_true: List[int], y_pred: List[int]) -> Dict[str, float]:
    """
    Calculate evaluation metrics.

    Args:
        y_true: True labels
        y_pred: Predicted labels

    Returns:
        Dictionary with accuracy, precision, recall, f1_score
    """
    # TODO: Your code here
    # Use sklearn's accuracy_score and precision_recall_fscore_support
    pass

# [6 pts]

## 2.4: 10-Fold Cross-Validation for Language Identification (8 points)

**Overview:** We will compare 2-gram and 3-gram models for sentence-based language identification. In each fold, we split sentences into training and test sets. For each n-gram order (2 and 3), we train two models‚Äîone per language‚Äîon the training sentences. Then, we use `identify_language()` to classify test sentences by comparing their probabilities under both language models. Finally, we evaluate performance using `calculate_metrics()` and repeat this process across all 10 folds to get robust performance estimates.

In [17]:
from sklearn.model_selection import KFold

# Prepare dataset: combine sentences from both languages with labels
X = lang1_sentences + lang2_sentences
y = [0] * len(lang1_sentences) + [1] * len(lang2_sentences)

print(f"Dataset prepared:")
print(f"  Total sentences: {len(X)}")
print(f"  Language 1 (label 0): {sum(1 for label in y if label == 0)} sentences")
print(f"  Language 2 (label 1): {sum(1 for label in y if label == 1)} sentences")
print()

# Initialize 10-fold cross-validation
kfold = KFold(n_splits=10, shuffle=True, random_state=42)

# Store results for each fold
results_2gram = {'accuracy': [], 'precision': [], 'recall': [], 'f1': []}
results_3gram = {'accuracy': [], 'precision': [], 'recall': [], 'f1': []}

# For each fold:
#   1. Split data: X_train, X_test, y_train, y_test using train_idx and test_idx
#   2. Separate training sentences by language:
#      - train_lang1 = sentences where y_train == 0
#      - train_lang2 = sentences where y_train == 1
#   3. Train four models:
#      - model_2gram_lang1 and model_2gram_lang2 (both with n=2)
#      - model_3gram_lang1 and model_3gram_lang2 (both with n=3)
#   4. Make predictions on X_test:
#      - pred_2gram = [identify_language(sent, model_2gram_lang1, model_2gram_lang2) for sent in X_test]
#      - pred_3gram = [identify_language(sent, model_3gram_lang1, model_3gram_lang2) for sent in X_test]
#   5. Calculate metrics:
#      - metrics_2gram = calculate_metrics(y_test, pred_2gram)
#      - metrics_3gram = calculate_metrics(y_test, pred_3gram)
#   6. Store results:
#      - results_2gram['accuracy'].append(metrics_2gram['accuracy'])
#      - (repeat for precision, recall, f1 for both models)

# TODO: Complete the cross-validation loop
for fold_idx, (train_idx, test_idx) in enumerate(kfold.split(X), 1):
    print(f"\n{'='*50}")
    print(f"Fold {fold_idx}/10")
    print(f"{'='*50}")

    # Get train and test data for this fold
    X_train = [X[i] for i in train_idx]
    y_train = [y[i] for i in train_idx]
    X_test = [X[i] for i in test_idx]
    y_test = [y[i] for i in test_idx]

    # Your implementation here

    # Print fold results (after you calculate them)
    # print(f"Fold {fold_idx} - 2-gram accuracy: {results_2gram['accuracy'][-1]:.4f}")
    # print(f"Fold {fold_idx} - 3-gram accuracy: {results_3gram['accuracy'][-1]:.4f}")

print("\n" + "="*50)
print("Cross-validation completed!")
print("="*50)

# [8 pts]

NameError: name 'lang1_sentences' is not defined

## 2.5: Display Results (12)

*Create a table showing for each model:*
Mean accuracy, precision, recall, F1 (with std)

In [None]:
# TODO: Calculate and display summary statistics
#
#
# Example:
# results_df = pd.DataFrame({
#     'Model': ['2-gram', '3-gram'],
#     'Accuracy': [...],
#     'Precision': [...],
#     ...
# })

# [4 pts]

**Question 2.1:** Which of your trained models performed best on the validation data, and why? (3-4 sentences)

**[YOUR ANSWER HERE]**

**Question 2.2:** Were the results consistent across different folds of cross-validation? (2-3 sentences)

**[YOUR ANSWER HERE]**

## 2.6: Out-of-Vocabulary Testing (12 pts)

Test your models with **five** sentences containing words not in your training corpus.

In [None]:
# TODO: Create and test OOV sentences
#
# Ideas:
# - Sentences with made-up words
# - Mix of real and invented words
# - Be creative!

# [8 pts]

**Question 2.3:** How well did your models handle out-of-vocabulary (OOV) samples? Explain briefly. (2-3 sentences)

**[YOUR ANSWER HERE]**

---

# Task 3: Statistical Analysis (20 points)

**Baseline (10 pts):** Statistical significance testing and comparison.  
**Creativity (10 pts):** Advanced analysis (confusion matrices, error analysis, etc.).

## 3.1: Statistical Significance Testing (10 points)

Use paired t-test to compare models. p-value < 0.05 indicates statistically significant difference.

In [None]:
# TODO: Perform paired t-tests
#
# Compare: 2-gram vs 3-gram
#
# Use: t_stat, p_value = ttest_rel(results_1, results_2)

# [6 pts]

**Question 3.1:** Are the performance differences statistically significant? Explain what 'statistical significance' means in this context. (2-3 sentences)

**[YOUR ANSWER HERE]**

## 3.2: Advanced Analysis (10 points)

*   List item
*   List item



*   List item
*   List item



Perform deeper analysis such as per-language performance, misclassification patterns, etc.

In [None]:
# TODO: Your advanced analysis here

# [6 pts]

**Question 3.2:** What interesting patterns or insights did you discover from your results? (4-5 sentences)

**[YOUR ANSWER HERE]**

## Uncomment and run to save your notebook as a pdf

In [18]:
# Install required packages (this takes about 30 seconds)
print("Installing PDF converter... please wait...")
!apt-get update -qq
!apt-get install -y texlive-xetex texlive-fonts-recommended texlive-plain-generic pandoc > /dev/null 2>&1
!pip install -q nbconvert

print("\n" + "="*50)
print("COLAB NOTEBOOK TO PDF CONVERTER")
print("="*50)
print("\nSTEP 1: Download your notebook")
print("- Go to File ‚Üí Download ‚Üí Download .ipynb")
print("- Save it to your computer")
print("\nSTEP 2: Upload it here")
print("- Click the folder icon on the left (üìÅ)")
print("- Click the upload button and select your .ipynb file")
print("- Wait for upload to complete")
print("\nSTEP 3: Enter the filename below")
print("="*50)

# Get notebook name from user
notebook_name = input("\nEnter your notebook name: ")

# Add .ipynb if missing
if not notebook_name.endswith('.ipynb'):
    notebook_name += '.ipynb'

import os
notebook_path = f'/content/{notebook_name}'

# Check if file exists
if not os.path.exists(notebook_path):
    print(f"\n‚ö† Error: '{notebook_name}' not found in /content/")
    print("\nMake sure you uploaded the file using the folder icon (üìÅ) on the left!")
else:
    print(f"\n‚úì Found {notebook_name}")
    print("Converting to PDF... this may take 1-2 minutes...\n")

    # Convert the notebook to PDF
    !jupyter nbconvert --to pdf "{notebook_path}"

    # Download the PDF
    from google.colab import files
    pdf_name = notebook_name.replace('.ipynb', '.pdf')
    pdf_path = f'/content/{pdf_name}'

    if os.path.exists(pdf_path):
        print("‚úì SUCCESS! Downloading your PDF now...")
        files.download(pdf_path)
        print("\n‚úì Done! Check your downloads folder.")
    else:
        print("‚ö† Error: Could not create PDF")

Installing PDF converter... please wait...
W: Skipping acquire of configured file 'main/source/Sources' as repository 'https://r2u.stat.illinois.edu/ubuntu jammy InRelease' does not seem to provide it (sources.list entry misspelt?)

COLAB NOTEBOOK TO PDF CONVERTER

STEP 1: Download your notebook
- Go to File ‚Üí Download ‚Üí Download .ipynb
- Save it to your computer

STEP 2: Upload it here
- Click the folder icon on the left (üìÅ)
- Click the upload button and select your .ipynb file
- Wait for upload to complete

STEP 3: Enter the filename below

Enter your notebook name: cng463_assignment2.ipynb

‚úì Found cng463_assignment2.ipynb
Converting to PDF... this may take 1-2 minutes...

[NbConvertApp] Converting notebook /content/cng463_assignment2.ipynb to pdf
[NbConvertApp] Writing 61416 bytes to notebook.tex
[NbConvertApp] Building PDF
[NbConvertApp] Running xelatex 3 times: ['xelatex', 'notebook.tex', '-quiet']
[NbConvertApp] Running bibtex 1 time: ['bibtex', 'notebook']
[NbConvertAp

<IPython.core.display.Javascript object>

<IPython.core.display.Javascript object>


‚úì Done! Check your downloads folder.
