# Assignment 2: N-grams and Language Identification
## CNG463 - Introduction to Natural Language Processing
### METU NCC Computer Engineering | Fall 2025-26

**Student Name:**  
**Student ID:**  
**Due Date:** 16 November 2025 (Sunday) before midnight

---

## Overview

This assignment focuses on:
1. Building **character-based** 2-gram and 3-gram language models with Laplace smoothing
2. Sentence-based language identification using 10-fold cross-validation
3. Evaluation using accuracy, precision, recall, and F1-score
4. Comparison and analysis

**Note:** For language identification, we use **character n-grams** rather than word n-grams because they better capture language-specific patterns like letter combinations, diacritics, and writing systems.

**Grading:**
- Written Questions (7 √ó 4 pts): **28 pts**
- Code Tasks with TODO (11 total): **72 pts** distributed by effort level:
  - Simple tasks: 4 pts each (2 cells)
  - Moderate tasks: 6 pts each (4 cells)
  - Complex tasks: 8 pts each (5 cells)
- **Total: 100 pts**

---

## Pre-Submission Checklist

- [ ] Name and student ID at top
- [ ] No cells are added or removed
- [ ] All TODO sections completed
- [ ] All questions answered
- [ ] Code runs without errors
- [ ] Results tables included
- [ ] Run All before saving

## Setup and Imports

In [None]:
# Standard libraries
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns
from collections import defaultdict, Counter
from typing import List, Tuple, Dict
import re

# Scikit-learn for cross-validation and metrics
from sklearn.model_selection import KFold
from sklearn.metrics import accuracy_score, precision_recall_fscore_support
from scipy.stats import ttest_rel


# Set random seed for reproducibility
np.random.seed(42)

---

# Task 1: Corpus Preparation and Statistics (22 points)

## 1.1: Upload Corpus Files

Prepare your text files in **two different languages** (accepted formats: `.txt`, `.pdf`, or `.docx`). When you run the cell below, you'll be prompted to upload files for each language separately. Make sure your files contain substantial text (reports, essays, or similar content from other courses). Each language requires at least **5000** words in its corpus.

In [None]:
from google.colab import files

print("Upload your ENGLISH corpus file(s):")
english_files = files.upload()

print("\nUpload your SECOND LANGUAGE corpus file(s):")
second_lang_files = files.upload()


## 1.2: Load and Preprocess Data (12 points)

Load your uploaded files, extract text, preprocess, split into sentences, and tokenize. You'll need helper functions to handle different file formats.

**Steps:**
1. Read files based on format (`.txt`, `.pdf`, `.docx`) and combine them into single text for each language
2. Apply preprocessing (e.g., lowercasing, handling punctuation)
3. Split each corpus into individual sentences
4. Tokenize each sentence into words (for statistics)
5. Store the results as two lists of tokenized sentences

**Important:** You'll use word tokenization for calculating statistics, but for the n-gram models in Task 2, you'll work with character n-grams directly on the sentence strings.

In [None]:
import re
from typing import List

def read_txt_file(filename: str) -> str:
    """Read a .txt file and return its content."""
    with open(filename, 'r', encoding='utf-8') as f:
        return f.read()

def read_pdf_file(filename: str) -> str:
    """Read a .pdf file and return its text content."""
    # TODO: Install and use PyPDF2 or pdfplumber
    # Example: pip install PyPDF2
    pass

def read_docx_file(filename: str) -> str:
    """Read a .docx file and return its text content."""
    # TODO: Install and use python-docx
    # Example: pip install python-docx
    pass

def split_into_sentences(text: str) -> List[str]:
    """Split text into sentences."""
    # TODO: Implement sentence splitting
    # You can use simple regex or nltk.sent_tokenize
    pass

def tokenize_sentence(sentence: str) -> List[str]:
    """Tokenize a sentence into words."""
    # TODO: Implement word tokenization
    # You can use str.split() or nltk.word_tokenize
    pass

# TODO:
#
# 1. Read and combine files for each language
#    - Loop through lang1_files and lang2_files
#    - Use appropriate read function based on file extension
#    - Combine all text into lang1_text and lang2_text
#
# 2. Apply preprocessing to both lang1_text and lang2_text
#    (e.g., lowercasing, removing extra whitespace)
#
# 3. Split each corpus into sentences using split_into_sentences()
#
# 4. Tokenize each sentence using tokenize_sentence()
#
# Note: These tokenized sentences will be used for statistics in Task 1.
# In Task 2, you'll work with the raw sentence strings for character n-grams.

# At the end, you should have:
# lang1_sentences = [sent1, sent2, ...]
# lang2_sentences = [sent1, sent2, ...]
# lang1_sentences_tokenized = [[word1, word2, ...], [word1, word2, ...], ...]
# lang2_sentences_tokenized = [[word1, word2, ...], [word1, word2, ...], ...]


# [8 pts]

**Question 1.1:** What preprocessing choices did you make and why? (3-5 sentences)

**[YOUR ANSWER HERE]**

## 1.3: Basic Statistics (10 points)

Calculate and display key statistics for both language corpora to understand their characteristics.

In [None]:
# TODO: Calculate statistics for BOTH languages
#
# For each language (lang1_sentences and lang2_sentences):
# - Total character count
# - Special character/punctuation count
# - Character vocabulary size (unique characters)
# - Total word count
# - Word vocabulary size (unique words)
# - Sentence count
# - Average sentence length (in words)
#
# Example structure:
# lang1_total_characters = sum(len(sentence) for sentence in lang1_sentences)
# ...
# lang1_total_words = sum(len(sentence) for sentence in lang1_sentences_tokenized)
# lang1_word_vocabulary = len(set(word for sentence in lang1_sentences_tokenized for word in sentence))
# ...
#
# Print statistics side by side for the two corpora


# [6 pts]

**Question 1.2:** What are the key differences between your two corpora? (2-3 sentences)

**[YOUR ANSWER HERE]**

---

# Task 2: Character N-gram Language Identification (58 points)

**Baseline (46 pts):** Implement character-based 2-gram and 3-gram models, run 10-fold CV, report accuracy.  
**Creativity (12 pts):** Out-of-vocabulary analysis.

## 2.1: Implement Character N-gram Models (12 points)

Implement the `CharNgramLanguageModel` class with Laplace smoothing using NLTK's n-gram utilities. The model should count **character** n-grams during training and calculate sentence probabilities with smoothing.

**Key difference from word n-grams:** Instead of tokenizing sentences into words, you'll work with individual characters in each sentence.

In [None]:
import nltk
from nltk.util import ngrams, pad_sequence
from nltk.lm import Laplace
from nltk.lm.preprocessing import padded_everygram_pipeline
from typing import List

# Download required NLTK data
nltk.download('punkt', quiet=True)

class CharNgramLanguageModel:
    """
    Character-based N-gram language model with Laplace (add-1) smoothing using NLTK.
    """

    def __init__(self, n: int = 2):
        """
        Initialize the character n-gram model.

        Args:
            n: Order of n-gram (2 for bigram, 3 for trigram)
        """
        self.n = n
        self.model = Laplace(n)

    def train(self, sentences: List[str]):
        """
        Train the model on a list of sentences.

        Args:
            sentences: List of sentences (each sentence is a string)
        """
        # TODO: Your code here
        # Convert each sentence to a list of characters
        # Example: "hello" -> ['h', 'e', 'l', 'l', 'o']
        # Use padded_everygram_pipeline to prepare training data with padding
        # Then fit the model using self.model.fit()
        pass

    def get_probability(self, sentence: str) -> float:
        """
        Calculate the probability of a sentence.

        Args:
            sentence: Sentence string

        Returns:
            Probability of the sentence
        """
        # TODO: Your code here
        # Convert sentence to list of characters
        # Pad the character sequence and generate n-grams
        # For each n-gram, get probability using self.model.score()
        # Multiply probabilities together (or sum log probabilities to avoid underflow)
        pass

# [8 pts]

### Spot Check: Inspect Your N-gram Models

After implementing the model, train sample models on both languages and inspect what they learned.

In [None]:
# TODO: Train sample models and inspect them
#
# 1. Create 2-gram and 3-gram models for both languages
# 2. Train them on your full datasets (lang1_sentences and lang2_sentences)
# 3. Inspect the models to see what n-grams they learned
#
# Example:
# model_2gram_lang1 = NgramLanguageModel(n=2)
# model_2gram_lang1.train(lang1_sentences)
# model_3gram_lang1 = NgramLanguageModel(n=3)
# model_3gram_lang1.train(lang1_sentences)
#
# model_2gram_lang2 = NgramLanguageModel(n=2)
# model_2gram_lang2.train(lang2_sentences)
# model_3gram_lang2 = NgramLanguageModel(n=3)
# model_3gram_lang2.train(lang3_sentences)
#
# Display sample n-grams and their counts from each model
# Check vocabulary size: len(model.model.vocab)
# Show most common n-grams or test probabilities on sample sentences

# Your code here

# [4 pts]

## 2.2: Implement Language Identification (8 points)

Create a function that compares sentence probabilities from two language models and returns the predicted label.

In [None]:
def identify_language(sentence: str,
                     model_lang1: CharNgramLanguageModel,
                     model_lang2: CharNgramLanguageModel) -> int:
    """
    Identify the language of a sentence using two character-based language models.

    Args:
        sentence: Sentence string
        model_lang1: Language model for language 1 (label 0)
        model_lang2: Language model for language 2 (label 1)

    Returns:
        Predicted label (0 or 1)
    """
    # TODO: Your code here
    #
    # Steps:
    # 1. Calculate probability of sentence using model_lang1
    #    prob1 = model_lang1.get_probability(sentence)
    #
    # 2. Calculate probability of sentence using model_lang2
    #    prob2 = model_lang2.get_probability(sentence)
    #
    # 3. Compare probabilities and return the label of the model with higher probability
    #    if prob1 > prob2: return 0
    #    else: return 1

    pass

# [8 pts]


## 2.3: Implement Evaluation Function (6 points)

Create a function that calculates accuracy, precision, recall, and F1-score given predicted and true labels.

In [None]:
def calculate_metrics(y_true: List[int], y_pred: List[int]) -> Dict[str, float]:
    """
    Calculate evaluation metrics.

    Args:
        y_true: True labels
        y_pred: Predicted labels

    Returns:
        Dictionary with accuracy, precision, recall, f1_score
    """
    # TODO: Your code here
    # Use sklearn's accuracy_score and precision_recall_fscore_support
    pass

# [6 pts]

## 2.4: 10-Fold Cross-Validation for Language Identification (8 points)

Implement 10-fold cross-validation to evaluate your character-based n-gram models. In each fold, split the data, train separate models for each language and n-gram order, make predictions, and evaluate performance.

In [None]:
from sklearn.model_selection import KFold

# Prepare dataset: combine sentence STRINGS from both languages with labels
X = lang1_sentences_str + lang2_sentences_str
y = [0] * len(lang1_sentences_str) + [1] * len(lang2_sentences_str)

print(f"Dataset prepared:")
print(f"  Total sentences: {len(X)}")
print(f"  Language 1 (label 0): {sum(1 for label in y if label == 0)} sentences")
print(f"  Language 2 (label 1): {sum(1 for label in y if label == 1)} sentences")
print()

# Initialize 10-fold cross-validation
kfold = KFold(n_splits=10, shuffle=True, random_state=42)

# Store results for each fold
results_2gram = {'accuracy': [], 'precision': [], 'recall': [], 'f1': []}
results_3gram = {'accuracy': [], 'precision': [], 'recall': [], 'f1': []}

# TODO: Implement 10-fold cross-validation
#
# For each fold, you need to:
#   1. Split data into train and test sets using train_idx and test_idx from kfold.split(X)
#   2. Separate training sentences by language based on labels
#   3. Train FOUR models total:
#      - Two 2-gram models (one per language)
#      - Two 3-gram models (one per language)
#   4. Make predictions on test sentences using identify_language()
#   5. Calculate metrics using calculate_metrics()
#   6. Store results in results_2gram and results_3gram dictionaries
#

for fold_idx, (train_idx, test_idx) in enumerate(kfold.split(X), 1):
    print(f"\n{'='*50}")
    print(f"Fold {fold_idx}/10")
    print(f"{'='*50}")

    # TODO Your implementation here

print("\n" + "="*50)
print("Cross-validation completed!")
print("="*50)

# [8 pts]

## 2.5: Display Results (12)

*Create a table showing for each model:*
Mean accuracy, precision, recall, F1 (with std)

In [None]:
# TODO: Calculate and display summary statistics
#
#
# Example:
# results_df = pd.DataFrame({
#     'Model': ['2-gram', '3-gram'],
#     'Accuracy': [...],
#     'Precision': [...],
#     ...
# })

# [4 pts]

**Question 2.1:** Which of your trained models performed best on the validation data, and why? (3-4 sentences)

**[YOUR ANSWER HERE]**

**Question 2.2:** Were the results consistent across different folds of cross-validation? (2-3 sentences)

**[YOUR ANSWER HERE]**

## 2.6: Out-of-Vocabulary Testing (12 pts)

Test your models with **five** sentences containing characters or character combinations not common in your training corpus. For character n-grams, this might include unusual letter combinations, foreign words, or made-up words that still follow language patterns.

In [None]:
# TODO: Create and test OOV sentences
#
# - Be creative!

# [8 pts]

**Question 2.3:** How well did your models handle out-of-vocabulary (OOV) samples? (2-3 sentences)

**[YOUR ANSWER HERE]**

---

# Task 3: Statistical Analysis (20 points)

**Baseline (10 pts):** Statistical significance testing and comparison.  
**Creativity (10 pts):** Advanced analysis (confusion matrices, error analysis, etc.).

## 3.1: Statistical Significance Testing (10 points)

Use paired t-test to compare models. p-value < 0.05 indicates statistically significant difference.

In [None]:
# TODO: Perform paired t-tests
#
# Compare: 2-gram vs 3-gram
#
# Use: t_stat, p_value = ttest_rel(results_1, results_2)

# [6 pts]

**Question 3.1:** Are the performance differences statistically significant? Explain what 'statistical significance' means in this context. (2-3 sentences)

**[YOUR ANSWER HERE]**

## 3.2: Advanced Analysis (10 points)

Perform deeper analysis such as per-language performance, misclassification patterns, etc.

In [None]:
# TODO: Your advanced analysis here

# [6 pts]

**Question 3.2:** What interesting patterns or insights did you discover from your results? (4-5 sentences)

**[YOUR ANSWER HERE]**

# Convert Your Colab Notebook to PDF

### Step 1: Download Your Notebook
- Go to **File ‚Üí Download ‚Üí Download .ipynb**
- Save the file to your computer

### Step 2: Upload to Colab
- Click the **üìÅ folder icon** on the left sidebar
- Click the **upload button**
- Select your downloaded .ipynb file

### Step 3: Run the Code Below
- **Uncomment the cell below** and run the cell
- This will take about 1-2 minutes to install required packages
- When prompted, type your notebook name (e.g.`gs_000000_as2.ipynb`) and press Enter

### The PDF will be automatically downloaded to your computer


In [None]:
# # Install required packages (this takes about 30 seconds)
# print("Installing PDF converter... please wait...")
# !apt-get update -qq
# !apt-get install -y texlive-xetex texlive-fonts-recommended texlive-plain-generic pandoc > /dev/null 2>&1
# !pip install -q nbconvert

# print("\n" + "="*50)

# # Get notebook name from user
# notebook_name = input("\nEnter your notebook name: ")

# # Add .ipynb if missing
# if not notebook_name.endswith('.ipynb'):
#     notebook_name += '.ipynb'

# import os
# notebook_path = f'/content/{notebook_name}'

# # Check if file exists
# if not os.path.exists(notebook_path):
#     print(f"\n‚ö† Error: '{notebook_name}' not found in /content/")
#     print("\nMake sure you uploaded the file using the folder icon (üìÅ) on the left!")
# else:
#     print(f"\n‚úì Found {notebook_name}")
#     print("Converting to PDF... this may take 1-2 minutes...\n")

#     # Convert the notebook to PDF
#     !jupyter nbconvert --to pdf "{notebook_path}"

#     # Download the PDF
#     from google.colab import files
#     pdf_name = notebook_name.replace('.ipynb', '.pdf')
#     pdf_path = f'/content/{pdf_name}'

#     if os.path.exists(pdf_path):
#         print("‚úì SUCCESS! Downloading your PDF now...")
#         files.download(pdf_path)
#         print("\n‚úì Done! Check your downloads folder.")
#     else:
#         print("‚ö† Error: Could not create PDF")