DSC140A SuperHW

The problem. Consider the words “meilleur” and “mejor”. In English, both of these words mean “best”
– one in French and the other in Spanish. Even if you don’t know how to speak either of these languages,
you might be able to guess which word is Spanish and which is French based on the spelling. For example,
the word “meilleur” looks more like a French word than a Spanish word due to it containing “ei” and ending
in “eur”. On the other hand, the word “mejor” looks more like a Spanish word than a French word due to it
containing “j” and ending in “or”. This suggests that there is some statistical structure in the words of these
languages that we can use to build a machine learning model capable of distinguishing between French and
Spanish without actually understanding the words themselves.

Your goal in this problem is to build a machine learning model that can take a word as input and predict
whether it is Spanish or French.

• train_words: a list of n strings, each one of them a word (in either Spanish or French)


• train_labels: a list of n strings, each one of them either "spanish" or "french", indicating the
language of the corresponding word in train_words

•  test_words: a list of m strings, each one of them a word (in either Spanish or French)


This function should return a list of m strings, each one of them either "spanish" or "french", indicating
your classifier’s prediction for the language of the corresponding word in test_words.
Your classify() function is responsible for training your machine learning model on the training data and
then using that model to make predictions on the test data.


A good choice of features is important. You might want to consider using the frequency of different
letters or pairs of letters in each word as features. For example, one of your features might be whether
“el” appears in the word. You do not need to create all of these features by hand – you can use Python
to help you generate them.

Don’t confuse training accuracy with test accuracy. It is possible to achieve 90%+ training accuracy
on this data set, but that doesn’t mean your model will generalize well to the test set.
2
• Be careful to avoid overfitting! If you use too many features or too complex of a model, you may find
that your model performs well on the training data but poorly on the test data.
• Start with the simplest models first. We have learned some models in this class that can be implemented
using only a couple lines of code (with numpy).

In [3]:
import numpy as np
from collections import Counter

def extract_word_dna(word):
    dna = []
    
    for letter in word:
        dna.append(f"letter_{letter}")
    
    for i in range(len(word) - 1):
        double = word[i:i+2]
        dna.append(f"duo_{double}")
    
    for tail_len in [1, 2, 3]:
        if len(word) >= tail_len:
            tail = word[-tail_len:]
            dna.append(f"tail_{tail}")
    
    for head_len in [1, 2, 3]:
        if len(word) >= head_len:
            head = word[:head_len]
            dna.append(f"head_{head}")
            
    size_bucket = min(len(word) // 2, 5)  
    dna.append(f"size_{size_bucket}")
    
    return dna

def build_language_fingerprints(words_list):
    fingerprint = Counter()
    
    spanish_giveaways = ["os", "ar", "er", "ir", "mente", "dad", "cion", "ll", "rr", "ia", "io", "ez", "ito", "ita"]
    french_giveaways = ["eu", "ou", "ai", "ei", "au", "eau", "oi", "ie", "tion", "eux", "aux", "ez", "ais", "ment"]
    
    for word in words_list:
        for letter in word:
            fingerprint[f"letter_{letter}"] += 1

        for i in range(len(word) - 1):
            double = word[i:i+2]
            fingerprint[f"duo_{double}"] += 1
            
        for i in range(len(word) - 2):
            triple = word[i:i+3]
            fingerprint[f"trio_{triple}"] += 1

        for tail_len in [1, 2, 3]:
            if len(word) >= tail_len:
                tail = word[-tail_len:]
                fingerprint[f"tail_{tail}"] += 3  

        for head_len in [1, 2, 3]:
            if len(word) >= head_len:
                head = word[:head_len]
                fingerprint[f"head_{head}"] += 2  

        size_bucket = min(len(word) // 2, 5)  
        fingerprint[f"size_{size_bucket}"] += 1
        
        for pattern in spanish_giveaways:
            if pattern in word:
                fingerprint[f"es_{pattern}"] += 3
                
        for pattern in french_giveaways:
            if pattern in word:
                fingerprint[f"fr_{pattern}"] += 3
    
    return fingerprint

def classify(train_words, train_labels, test_words):
    espanol_words = [w.lower() for w, label in zip(train_words, train_labels) if label == "spanish"]
    francais_words = [w.lower() for w, label in zip(train_words, train_labels) if label == "french"]
    
    word_count = len(train_words)
    espanol_prob = len(espanol_words) / word_count
    francais_prob = len(francais_words) / word_count
    
    espanol_prints = build_language_fingerprints(espanol_words)
    francais_prints = build_language_fingerprints(francais_words)
    
    espanol_total = sum(espanol_prints.values())
    francais_total = sum(francais_prints.values())
    
    vocab = set(list(espanol_prints.keys()) + list(francais_prints.keys()))
    vocab_size = len(vocab)
    
    results = []
    
    for mystery_word in test_words:
        mystery_word = mystery_word.lower()
        
        espanol_score = np.log(espanol_prob)
        francais_score = np.log(francais_prob)
        
        word_dna = extract_word_dna(mystery_word)
        
        for feature in word_dna:
            if feature in espanol_prints:
                espanol_score += np.log((espanol_prints[feature] + 1) / (espanol_total + vocab_size))
            else:
                espanol_score += np.log(1 / (espanol_total + vocab_size))
            
            if feature in francais_prints:
                francais_score += np.log((francais_prints[feature] + 1) / (francais_total + vocab_size))
            else:
                francais_score += np.log(1 / (francais_total + vocab_size))
        
        francais_score += 0.06
        
        if espanol_score > francais_score:
            results.append("spanish")
        else:
            results.append("french")
    
    return results

In [4]:
def calculate_accuracy(true_labels, predicted_labels):
    correct = sum(1 for true, pred in zip(true_labels, predicted_labels) if true == pred)
    return correct / len(true_labels) * 100  # Return as percentage

# Create a validation split from your training data
import random

def train_val_split(words, labels, val_size=0.2):
    # Create paired data
    data = list(zip(words, labels))
    random.seed(42)  # For reproducibility
    random.shuffle(data)
    
    # Calculate split point
    split_idx = int(len(data) * (1 - val_size))
    
    # Split data
    train_data = data[:split_idx]
    val_data = data[split_idx:]
    
    # Unzip the data
    train_words, train_labels = zip(*train_data)
    val_words, val_labels = zip(*val_data)
    
    return list(train_words), list(train_labels), list(val_words), list(val_labels)

# Load your training data
import csv

def load_data(file_path):
    words = []
    labels = []
    
    with open(file_path, 'r') as f:
        reader = csv.reader(f)
        next(reader)  # Skip header if present
        for row in reader:
            words.append(row[0])
            labels.append(row[1])
    
    return words, labels

# Example usage:
words, labels = load_data('train.csv')  # Path to the dataset provided in the assignment
train_words, train_labels, val_words, val_labels = train_val_split(words, labels)

# Train on train_words/train_labels and predict on val_words
predictions = classify(train_words, train_labels, val_words)

# Calculate accuracy
accuracy = calculate_accuracy(val_labels, predictions)
print(f"Model accuracy: {accuracy:.2f}%")

Model accuracy: 86.67%


In [8]:
def cross_validate(words, labels, k=5):
    # Create paired data
    data = list(zip(words, labels))
    random.seed(42)  # For reproducibility
    random.shuffle(data)
    
    # Calculate fold size
    fold_size = len(data) // k
    
    accuracies = []
    
    for i in range(k):
        # Select validation fold
        start = i * fold_size
        end = start + fold_size if i < k-1 else len(data)
        val_data = data[start:end]
        train_data = data[:start] + data[end:]
        
        # Split into words and labels
        train_words, train_labels = zip(*train_data)
        val_words, val_labels = zip(*val_data)
        
        # Train and predict
        predictions = classify(list(train_words), list(train_labels), list(val_words))
        
        # Calculate accuracy
        accuracy = calculate_accuracy(val_labels, predictions)
        accuracies.append(accuracy)
    
    return accuracies

# Run cross-validation
accuracies = cross_validate(words, labels)
print(f"Cross-validation accuracies: {accuracies}")
print(f"Average accuracy: {sum(accuracies)/len(accuracies):.2f}%")

Cross-validation accuracies: [81.66666666666667, 84.16666666666667, 86.25, 87.5, 86.66666666666667]
Average accuracy: 85.25%


In [6]:
def confusion_matrix(train_words, train_labels, test_words, test_labels):
    """Generate confusion matrix to identify error patterns"""
    predictions = classify(train_words, train_labels, test_words)
    
    # Initialize counts
    tp = fp = tn = fn = 0
    for pred, true in zip(predictions, test_labels):
        if pred == "spanish" and true == "spanish":
            tp += 1
        elif pred == "spanish" and true == "french":
            fp += 1
        elif pred == "french" and true == "french":
            tn += 1
        elif pred == "french" and true == "spanish":
            fn += 1
    
    # Calculate metrics
    accuracy = (tp + tn) / len(test_labels)
    precision_spanish = tp / (tp + fp) if (tp + fp) > 0 else 0
    recall_spanish = tp / (tp + fn) if (tp + fn) > 0 else 0
    
    return {
        "accuracy": accuracy,
        "precision_spanish": precision_spanish,
        "recall_spanish": recall_spanish,
        "confusion_matrix": [[tp, fn], [fp, tn]]
    }

In [7]:
def test_edge_cases(train_words, train_labels):
    """Test classifier on specially crafted challenging cases"""
    edge_cases = [
        # Words that could be either language
        "animal", "central", "radio", "normal",
        
        # Very short words
        "no", "si", "la", "le",
        
        # Words with language-specific patterns
        "español", "français", "biblioteca", "bibliothèque",
    ]
    
    predictions = classify(train_words, train_labels, edge_cases)
    for word, pred in zip(edge_cases, predictions):
        print(f"'{word}' classified as {pred}")