DSC140A SuperHW

The problem. Consider the words “meilleur” and “mejor”. In English, both of these words mean “best”
– one in French and the other in Spanish. Even if you don’t know how to speak either of these languages,
you might be able to guess which word is Spanish and which is French based on the spelling. For example,
the word “meilleur” looks more like a French word than a Spanish word due to it containing “ei” and ending
in “eur”. On the other hand, the word “mejor” looks more like a Spanish word than a French word due to it
containing “j” and ending in “or”. This suggests that there is some statistical structure in the words of these
languages that we can use to build a machine learning model capable of distinguishing between French and
Spanish without actually understanding the words themselves.

Your goal in this problem is to build a machine learning model that can take a word as input and predict
whether it is Spanish or French.

• train_words: a list of n strings, each one of them a word (in either Spanish or French)


• train_labels: a list of n strings, each one of them either "spanish" or "french", indicating the
language of the corresponding word in train_words

•  test_words: a list of m strings, each one of them a word (in either Spanish or French)


This function should return a list of m strings, each one of them either "spanish" or "french", indicating
your classifier’s prediction for the language of the corresponding word in test_words.
Your classify() function is responsible for training your machine learning model on the training data and
then using that model to make predictions on the test data.


A good choice of features is important. You might want to consider using the frequency of different
letters or pairs of letters in each word as features. For example, one of your features might be whether
“el” appears in the word. You do not need to create all of these features by hand – you can use Python
to help you generate them.

Don’t confuse training accuracy with test accuracy. It is possible to achieve 90%+ training accuracy
on this data set, but that doesn’t mean your model will generalize well to the test set.
2
• Be careful to avoid overfitting! If you use too many features or too complex of a model, you may find
that your model performs well on the training data but poorly on the test data.
• Start with the simplest models first. We have learned some models in this class that can be implemented
using only a couple lines of code (with numpy).

In [5]:
import numpy as np

def classify(train_words, train_labels, test_words):
    # Extract all unigrams and bigrams from training words
    unigrams = set()
    bigrams = set()
    for word in train_words:
        # Convert to lowercase to ensure uniformity
        lower_word = word.lower()
        # Add unigrams
        for c in lower_word:
            unigrams.add(c)
        # Add bigrams
        for i in range(len(lower_word) - 1):
            bigram = lower_word[i:i+2]
            bigrams.add(bigram)
    unigrams = list(unigrams)
    bigrams = list(bigrams)
    features = unigrams + bigrams
    
    # Compute prior probabilities
    total = len(train_labels)
    spanish_count = sum(1 for label in train_labels if label == 'spanish')
    french_count = total - spanish_count
    prior_spanish = spanish_count / total if total != 0 else 0.5
    prior_french = french_count / total if total != 0 else 0.5
    
    # Initialize counts for features in each class
    spanish_feature_counts = {feature: 0 for feature in features}
    french_feature_counts = {feature: 0 for feature in features}
    
    # Populate feature counts for Spanish and French
    for word, label in zip(train_words, train_labels):
        lower_word = word.lower()
        # Extract unigrams and bigrams present in the word
        current_unigrams = set(lower_word)
        current_bigrams = set()
        for i in range(len(lower_word) - 1):
            bigram = lower_word[i:i+2]
            current_bigrams.add(bigram)
        # Update counts
        if label == 'spanish':
            for u in current_unigrams:
                if u in spanish_feature_counts:
                    spanish_feature_counts[u] += 1
            for b in current_bigrams:
                if b in spanish_feature_counts:
                    spanish_feature_counts[b] += 1
        else:
            for u in current_unigrams:
                if u in french_feature_counts:
                    french_feature_counts[u] += 1
            for b in current_bigrams:
                if b in french_feature_counts:
                    french_feature_counts[b] += 1
    
    # Calculate probabilities with Laplace smoothing
    prob_spanish = {}
    prob_french = {}
    for feature in features:
        prob_spanish[feature] = (spanish_feature_counts[feature] + 1) / (spanish_count + 2)
        prob_french[feature] = (french_feature_counts[feature] + 1) / (french_count + 2)
    
    # Predict for each test word
    predictions = []
    for word in test_words:
        lower_word = word.lower()
        test_unigrams = set(lower_word)
        test_bigrams = set()
        for i in range(len(lower_word) - 1):
            bigram = lower_word[i:i+2]
            test_bigrams.add(bigram)
        
        # Compute log probabilities
        log_prob_spanish = np.log(prior_spanish) if prior_spanish > 0 else -np.inf
        log_prob_french = np.log(prior_french) if prior_french > 0 else -np.inf
        
        for feature in features:
            # Check if the feature is present in the test word
            present = False
            if feature in unigrams:
                present = feature in test_unigrams
            else:
                present = feature in test_bigrams
            
            # Update log probabilities
            if present:
                log_prob_spanish += np.log(prob_spanish[feature])
                log_prob_french += np.log(prob_french[feature])
            else:
                log_prob_spanish += np.log(1 - prob_spanish[feature])
                log_prob_french += np.log(1 - prob_french[feature])
        
        # Determine the predicted label
        if log_prob_spanish > log_prob_french:
            predictions.append('spanish')
        else:
            predictions.append('french')
    
    return predictions