M2 project : Convolutions and character embeddings
======================

**FLEURY MANON 22410952/ LANGREE TOMMI 22312312**

The project aims to predict the language from which a character sequence comes from. This is done with surnames and it involves a dozen of languages.

In [None]:
import torch
import torch.nn as nn
import torch.optim as optim
from torch.nn.utils.rnn import pad_sequence
from torch.utils.data import DataLoader, Dataset
import warnings
from random import shuffle

Data download & description
---------------------

In [None]:
from urllib.request import urlretrieve

urlretrieve('http://www.linguist.univ-paris-diderot.fr/~bcrabbe/datasets/name2lang.train','name2lang.train')
urlretrieve('http://www.linguist.univ-paris-diderot.fr/~bcrabbe/datasets/name2lang.valid','name2lang.valid')

#Prints the beginning of the valid set
istream = open('name2lang.valid')
for idx, line in enumerate(istream):
  print(line.strip())
  if idx >=300:
    break
istream.close()


Barros, Portuguese
Campos, Portuguese
D'cruz, Portuguese
Henriques, Portuguese
Machado, Portuguese
Silva, Portuguese
Torres, Portuguese
Ahearn, Irish
Aonghus, Irish
Brady, Irish
Cearbhall, Irish
Flann, Irish
Kavanagh, Irish
Maguire, Irish
Mcmahon, Irish
Mcneil, Irish
Monahan, Irish
Muirchertach, Irish
Mullen, Irish
O'Connell, Irish
O'Grady, Irish
O'Hara, Irish
O'Mahony, Irish
Rory, Irish
Shannon, Irish
Sioda, Irish
Tadhgan, Irish
Abel, Spanish
Agramunt, Spanish
Aldana, Spanish
Alfaro, Spanish
Aquino, Spanish
Arena, Spanish
Blanco, Spanish
Bustos, Spanish
Cardona, Spanish
Castellano, Spanish
Del olmo, Spanish
Etxeberria, Spanish
Garrastazu, Spanish
Hierro, Spanish
Loyola, Spanish
Maradona, Spanish
Mas, Spanish
Nieves, Spanish
Ortega, Spanish
Pelaez, Spanish
Robles, Spanish
Roldan, Spanish
Suero, Spanish
Tomas, Spanish
Torres, Spanish
Tos, Spanish
Ubina, Spanish
Urena, Spanish
Valdez, Spanish
Varela, Spanish
Vasquez, Spanish
Villa, Spanish
Villaverde, Spanish
Zavala, Spanish
Pham, Vietna

In [None]:
istream = open('name2lang.train')
for idx, line in enumerate(istream):
  print(line.strip())
  if idx >=6:
    break
istream.close()

Abreu, Portuguese
Albuquerque, Portuguese
Almeida, Portuguese
Alves, Portuguese
Araujo, Portuguese
Araullo, Portuguese
Basurto, Portuguese


First exercise : data preprocessing (3pts)
---
The first exercise amounts to create encodings from integers to strings and from strings to integers.

In [None]:
def vocabulary(filename,char_vocab,pad_token='<pad>'):
    """
    Args:
      filename (str)    : the name of the file
      char_vocab (bool) : selects if we extract char symbols or language codes
      pad_token(str)    : the value of the pad symbol
    """
    #char_vocab is a boolean flag that tells if we extract char symbols or language codes
    # idx2sym : map index to symbol -> caracters
    # sym2idx : map symbol to index -> integers

    idx2sym = []
    sym2idx = {}
    idx2sym.append(pad_token)
    sym2idx[pad_token] = 0

    istream = open(filename)
    for line in istream:
      line = line.strip()

      if char_vocab: # we extract char symbols
        for i,char in enumerate(line):
          if char not in sym2idx: # or idx2sym
            idx2sym.append(char)
            sym2idx[char] = len(idx2sym) - 1
      else: # we want to extract language codes
        line = line.split(', ')
        if len(line) > 1:
          code = line[1]
          if code not in sym2idx:
            idx2sym.append(code)
            sym2idx[code] = len(idx2sym) - 1

    #return the two encoding maps idx2sym and sym2idx as a couple
    return idx2sym, sym2idx


In [None]:
# test with TRUE
filename = 'name2lang.train'
idx2sym, sym2idx = vocabulary(filename, True)
print(idx2sym)
print(sym2idx)
#print(idx2sym[24])
#print(sym2idx['h']) # 24

['<pad>', 'A', 'b', 'r', 'e', 'u', ',', ' ', 'P', 'o', 't', 'g', 's', 'l', 'q', 'm', 'i', 'd', 'a', 'v', 'j', 'B', 'C', 'z', 'h', 'p', 'D', "'", 'c', 'n', 'E', 'F', 'G', 'L', 'M', 'N', 'R', 'S', 'f', 'V', 'I', 'y', 'k', 'H', 'J', 'K', 'O', 'w', 'T', 'Q', 'W', 'x', 'U', 'Y', 'Z', 'X']
{'<pad>': 0, 'A': 1, 'b': 2, 'r': 3, 'e': 4, 'u': 5, ',': 6, ' ': 7, 'P': 8, 'o': 9, 't': 10, 'g': 11, 's': 12, 'l': 13, 'q': 14, 'm': 15, 'i': 16, 'd': 17, 'a': 18, 'v': 19, 'j': 20, 'B': 21, 'C': 22, 'z': 23, 'h': 24, 'p': 25, 'D': 26, "'": 27, 'c': 28, 'n': 29, 'E': 30, 'F': 31, 'G': 32, 'L': 33, 'M': 34, 'N': 35, 'R': 36, 'S': 37, 'f': 38, 'V': 39, 'I': 40, 'y': 41, 'k': 42, 'H': 43, 'J': 44, 'K': 45, 'O': 46, 'w': 47, 'T': 48, 'Q': 49, 'W': 50, 'x': 51, 'U': 52, 'Y': 53, 'Z': 54, 'X': 55}


In [None]:
# test with FALSE
filename = 'name2lang.train'
idx2sym, sym2idx = vocabulary(filename, False)
print(idx2sym)
print(sym2idx)

['<pad>', 'Portuguese', 'Irish', 'Spanish', 'Vietnamese', 'Chinese', 'Greek', 'Czech', 'Dutch', 'Japanese', 'French', 'German', 'Scottish', 'English', 'Russian', 'Polish', 'Arabic', 'Korean', 'Italian']
{'<pad>': 0, 'Portuguese': 1, 'Irish': 2, 'Spanish': 3, 'Vietnamese': 4, 'Chinese': 5, 'Greek': 6, 'Czech': 7, 'Dutch': 8, 'Japanese': 9, 'French': 10, 'German': 11, 'Scottish': 12, 'English': 13, 'Russian': 14, 'Polish': 15, 'Arabic': 16, 'Korean': 17, 'Italian': 18}


In [None]:
def pad_sequence(sequence,pad_size,pad_token):
    '''pad_size : The final length that the sequence should have after adding padding characters.'''
    L_seq_pad = list(sequence)

    if len(sequence) < pad_size:
      for i in range(pad_size-len(sequence)):
        L_seq_pad.append(pad_token)

    if len(sequence) > pad_size:
      L_seq_pad = L_seq_pad[:pad_size]

    return L_seq_pad #returns a list with additional pad tokens to match pad_size if needed


def code_sequence(charseq,encodingmap): # encodingmap = dico sym2idx
  #we ignore chars not seen in train set
  #charseq is a sequence of chars
  return [encodingmap[c] for c in charseq if c in encodingmap]

def decode_sequence(idxseq,decodingmap): #decodingmap = idx2sym
  #idxseq is a list of integers
  return [decodingmap[idx] for idx in idxseq]


Second exercise : data generator (2pt)
------------

The data generator aims to deliver efficiently well formed batches of data to the model.

In [None]:
def read_dataset(filename,input_symbols):
    '''reads from a raw datafile, either the surnames if input_symbols is True otherwise it reads the language
    <=> if input_symbol=True => surnames
        if input_symbol=False => nationality'''
    symbols = []
    istream = open(filename)
    for line in istream:
      if line and not line.isspace():
        word,lang = line.split(',')
        symbol = list(word.strip()) if input_symbols else lang.strip()
        symbols.append(symbol)
    istream.close()
    return symbols

In [None]:
from torch.utils.data import Dataset
from torch.nn.utils.rnn import pad_sequence
import torch

# DataGenerator class for loading and preparing data
class DataGenerator(Dataset):
    def __init__(self, data_path, parentgenerator=None):
        self.data = self.load_data(data_path)
        self.pad_token = '<PAD>'
        if parentgenerator:
            # Reuse mappings from parent generator
            self.input_sym2idx = parentgenerator.input_sym2idx
            self.input_idx2sym = parentgenerator.input_idx2sym
            self.output_sym2idx = parentgenerator.output_sym2idx
            self.output_idx2sym = parentgenerator.output_idx2sym
        else:
            self.create_mappings()

    def load_data(self, data_path):
        # Load data from file (name,label)
        data = []
        with open(data_path, 'r', encoding='utf-8') as f:
            for line in f:
                parts = line.strip().split(',')
                if len(parts) >= 2:
                    name = parts[0]
                    label = ','.join(parts[1:]).strip()  # Handle labels with commas
                    data.append((name, label))
        return data

    def create_mappings(self):
        # Create character and label mappings
        chars = set()
        labels = set()
        for name, label in self.data:
            chars.update(name)
            labels.add(label)
        # Map characters to indices (start from 1)
        self.input_sym2idx = {char: idx + 1 for idx, char in enumerate(sorted(chars))}
        self.input_sym2idx[self.pad_token] = 0  # Padding token at index 0
        self.input_idx2sym = {idx: char for char, idx in self.input_sym2idx.items()}
        # Map labels to indices
        self.output_sym2idx = {label: idx for idx, label in enumerate(sorted(labels))}
        self.output_idx2sym = {idx: label for label, idx in self.output_sym2idx.items()}

    def generate_batches(self, batch_size):
        # Generate batches of data
        data = self.data.copy()
        for i in range(0, len(data), batch_size):
            batch_data = data[i:i + batch_size]
            seqX = []
            seqY = []
            for word, lang in batch_data:
                # Convert characters to indices
                x = [self.input_sym2idx.get(char, self.input_sym2idx[self.pad_token]) for char in word]
                seqX.append(torch.tensor(x, dtype=torch.long))
                # Get label index
                y = self.output_sym2idx[lang]
                seqY.append(y)
            # Pad sequences to the same length
            seqX = pad_sequence(seqX, batch_first=True, padding_value=self.input_sym2idx[self.pad_token])
            yield seqX, seqY


Third exercise : Implement the word embedding submodule (5pts)
-----
This exercise amounts to implement a pytorch submodule that takes as input a sequence of char indexes and outputs the word embedding corresponding for the sequence.

The module contains no training method and is meant to be used in a larger network. Its use is quite similar to `nn.Embedding`




In [None]:
import torch
import torch.nn as nn

class CharConvolution(nn.Module):

      def __init__(self,windowK,chars_vocab_size,input_embedding_size,output_embedding_size,padding_idx = None):
          """A minimalist convnet with max pooling
          Args:
          windowK (int): size of the window, positive integer
          chars_vocab_size(int): size of the character vocabulary
          input_embedding_size (int): size of each embedding vector
          output_embedding_size (int): number of output channels (number of kernels), >=1
          padding_idx (int, optional): the index for padding token (0 in our case)"""
          super(CharConvolution, self).__init__()
          self.embedding = nn.Embedding(chars_vocab_size, input_embedding_size, padding_idx=padding_idx)
          M = (2*windowK)+1
          self.convlayer = nn.Conv1d(input_embedding_size, output_embedding_size, M, padding=windowK)

      def forward(self,xinput):
          """Args:
          xinput: a tensor [batch,seq] where seq the sequence length
          Returns:
          A tensor with the max_pooled result along the sequence dimension"""
          #Implement the forward method, taking an input of the form [batch,seq] and return the max pooled result
          x_embed = self.embedding(xinput) # # output shape: [batch, seq, embedding_size]
          #Rearrange dimensions for Conv1d which expects [batch, embedding_size, seq]
          x_embed = x_embed.permute(0, 2, 1)  # output shape: [batch, embedding_size, seq]
          # Apply convolution layer and max-pooling
          x_conv = self.convlayer(x_embed)  # output shape: [batch, output_embedding_size, new_seq]
          x_pooled = nn.MaxPool1d(x_conv.size(2))(x_conv)  # apply max pooling along the sequence dimension
          # Remove the extra dimension added by pooling, resulting in [batch, output_embedding_size]
          return x_pooled.squeeze(-1)


In [None]:
# Example parameters
windowK = 3  # Size of the convolutional window
chars_vocab_size = 100  # Number of unique characters
input_embedding_size = 64  # Size of each embedding
output_embedding_size = 32  # Number of output features
padding_idx = 0  # Index for the padding token

# instance of CharConvolution
char_conv = CharConvolution(windowK, chars_vocab_size, input_embedding_size, output_embedding_size, padding_idx)

# Example input (e.g., [batch_size, seq_length])
example_input = torch.LongTensor([[1, 2, 3, 0, 4], [1, 0, 2, 3, 4]])  # Batch of 2 sequences

# Forward pass through the CharConvolution
output = char_conv(example_input)
print(output.shape)  # Should be [batch_size, output_embedding_size]

torch.Size([2, 32])


Fourth Exercise : predict the target language (10pts)
-------
In this exercise, we aim to predict the target language from a word char embedding. You will implement for the `LanguageIdentifier` class:
* A forward function: the function takes as input a char index tensor and returns a vector of prediction for each word
* A train function: the function trains the model on the full dataset (with early stopping)
* A predict function: the function takes a test corpus (a list of words)
and predicts the language. The function outputs its results in textual form. Each word is printed on the same line as its predicted class.

Once implemented you are expected to search for hyperparameters in the main program.






In [None]:
import torch
import torch.nn as nn
import torch.optim as optim
from torch.nn.utils.rnn import pad_sequence
from torch.utils.data import DataLoader, Dataset
import warnings

# Device configuration
device = torch.device('cuda' if torch.cuda.is_available() else 'cpu')



# LanguageIdentifier model
class LanguageIdentifier(nn.Module):
    def __init__(self, datagenerator, window_size, char_embedding_size, word_embedding_size):
        super(LanguageIdentifier, self).__init__()
        invocab_size = len(datagenerator.input_idx2sym)
        outvocab_size = len(datagenerator.output_idx2sym)
        pad_idx = datagenerator.input_sym2idx[datagenerator.pad_token]
        self.charE = CharConvolution(window_size, invocab_size, char_embedding_size, word_embedding_size, padding_idx=pad_idx)
        self.output = nn.Linear(word_embedding_size, outvocab_size)

    def load(self, filename):
        with warnings.catch_warnings():
            warnings.simplefilter("ignore", category=FutureWarning)
            self.load_state_dict(torch.load(filename))

    def forward(self, xinput):
        word_embeddings = self.charE(xinput)  # [batch_size, word_embedding_size]
        return self.output(word_embeddings)  # [batch_size, outvocab_size]

    def train_model(self, traingenerator, validgenerator, epochs, batch_size, device='cpu', learning_rate=0.001):
        self.to(device)
        optimizer = optim.Adam(self.parameters(), lr=learning_rate)
        scheduler = optim.lr_scheduler.ReduceLROnPlateau(optimizer, 'min', patience=2)
        loss_fn = nn.CrossEntropyLoss()
        self.minloss = float('inf')
        patience = 5
        patience_counter = 0

        for epoch in range(1, epochs + 1):
            self.train()
            total_loss = 0
            for seqX, seqY in traingenerator.generate_batches(batch_size):
                X = seqX.to(device)
                Y = torch.tensor(seqY, dtype=torch.long).to(device)

                optimizer.zero_grad()
                Yhat = self.forward(X)
                loss = loss_fn(Yhat, Y)
                loss.backward()
                optimizer.step()
                total_loss += loss.item()

            avg_loss = total_loss / len(traingenerator.data) * batch_size
            valid_loss, valid_acc = self.validate(validgenerator, batch_size, device)
            scheduler.step(valid_loss)
            print(f'Epoch [{epoch}/{epochs}], Loss: {avg_loss:.4f}, Validation Loss: {valid_loss:.4f}, Validation Acc: {valid_acc:.2f}%')

            if valid_loss < self.minloss:
                self.minloss = valid_loss
                patience_counter = 0
                torch.save(self.state_dict(), 'best_model.pth')
            else:
                patience_counter += 1
                if patience_counter >= patience:
                    print("Early stopping triggered.")
                    break

    def predict(self, datagenerator, batch_size, device):
          self.eval()
          predictions = []
          actual_labels = []
          device = torch.device(device)

          for seqX, seqY in datagenerator.generate_batches(batch_size):
              X = seqX.to(device)
              Y = torch.tensor(seqY, dtype=torch.long).to(device)
              with torch.no_grad():
                  logits = self.forward(X)
                  Yhat = torch.argmax(logits, dim=1)
                  predictions.extend(Yhat.cpu().numpy())
                  actual_labels.extend(Y.cpu().numpy())

          # Map indices to class names
          idx2sym = datagenerator.output_idx2sym
          predicted_classes = [idx2sym[idx] for idx in predictions]
          actual_classes = [idx2sym[idx] for idx in actual_labels]

          # Print each word with its predicted and actual class
          for (word, _), predicted_class, actual_class in zip(datagenerator.data, predicted_classes, actual_classes):
              print(f"Name: {word} | Predicted: {predicted_class} | Actual: {actual_class}")

          # Calculate and print overall accuracy
          total = len(actual_labels)
          correct = sum(p == a for p, a in zip(predicted_classes, actual_classes))
          accuracy = 100 * correct / total
          print(f"\nPrediction Accuracy: {accuracy:.2f}%")

          # Per-class accuracy
          from collections import defaultdict
          class_correct = defaultdict(int)
          class_total = defaultdict(int)

          for pred, actual in zip(predicted_classes, actual_classes):
              class_total[actual] += 1
              if pred == actual:
                  class_correct[actual] += 1

          print("\nPer-Class Accuracy:")
          for language in class_total:
              class_accuracy = 100 * class_correct[language] / class_total[language]
              print(f"{language}: {class_accuracy:.2f}%")




    def validate(self, datagenerator, batch_size, device='cpu'):
        self.eval()
        batch_losses, batch_accuracies, batch_sizes = [], [], []

        device = torch.device(device)
        loss_fn = nn.CrossEntropyLoss()

        for seqX, seqY in datagenerator.generate_batches(batch_size):
            X = seqX.to(device)
            Y = torch.tensor(seqY, dtype=torch.long).to(device)

            with torch.no_grad():
                Yhat = self.forward(X)
                loss = loss_fn(Yhat, Y)
                Ypred = torch.argmax(Yhat, dim=1)
                acc = (Ypred == Y).sum().item()

                batch_losses.append(loss.item())
                batch_accuracies.append(acc)
                batch_sizes.append(len(Y))

        valid_loss = sum(batch_losses) / len(batch_losses)
        valid_accuracy = sum(batch_accuracies) / sum(batch_sizes) * 100
        return valid_loss, valid_accuracy


Main program. You are expected to search for hyperparameters:

In [None]:
train_data_path = 'name2lang.train'
valid_data_path = 'name2lang.valid'

traing = DataGenerator(train_data_path)
validg = DataGenerator(valid_data_path, parentgenerator=traing)

In [None]:
#hyperparameters testing
window_size = 4
char_embedding_size = 64
word_embedding_size = 512

model = LanguageIdentifier(traing, window_size, char_embedding_size, word_embedding_size)

In [None]:
#training param
epochs = 25
batch_size = 256
learning_rate = 0.001
device = 'cuda' if torch.cuda.is_available() else 'cpu'

# Train the model
model.train_model(traing, validg, epochs, batch_size, device=device, learning_rate=learning_rate)


Epoch [1/25], Loss: 3.3537, Validation Loss: 2.1296, Validation Acc: 45.96%
Epoch [2/25], Loss: 1.9843, Validation Loss: 1.8318, Validation Acc: 47.24%
Epoch [3/25], Loss: 1.7479, Validation Loss: 1.5089, Validation Acc: 56.26%
Epoch [4/25], Loss: 1.4382, Validation Loss: 1.3985, Validation Acc: 57.96%
Epoch [5/25], Loss: 1.3001, Validation Loss: 1.3371, Validation Acc: 59.35%
Epoch [6/25], Loss: 1.2124, Validation Loss: 1.2717, Validation Acc: 60.59%
Epoch [7/25], Loss: 1.1383, Validation Loss: 1.2096, Validation Acc: 63.21%
Epoch [8/25], Loss: 1.0739, Validation Loss: 1.1476, Validation Acc: 65.12%
Epoch [9/25], Loss: 1.0152, Validation Loss: 1.0922, Validation Acc: 67.34%
Epoch [10/25], Loss: 0.9599, Validation Loss: 1.0428, Validation Acc: 69.29%
Epoch [11/25], Loss: 0.9101, Validation Loss: 1.0014, Validation Acc: 70.53%
Epoch [12/25], Loss: 0.8655, Validation Loss: 0.9651, Validation Acc: 71.92%
Epoch [13/25], Loss: 0.8230, Validation Loss: 0.9304, Validation Acc: 72.85%
Epoch [1

The model reached a validation accuracy of around 80%, which is quite good.

-> early stopping effectively prevented overfitting, ensuring that the model maintained generalization performance without unnecessary training.

In [None]:
# Predict on validation data
print("\nPredictions on validation data:")
model.predict(validg, batch_size, device=device)


Predictions on validation data:
Name: Barros | Predicted: Greek | Actual: Portuguese
Name: Campos | Predicted: Greek | Actual: Portuguese
Name: D'cruz | Predicted: Spanish | Actual: Portuguese
Name: Henriques | Predicted: Russian | Actual: Portuguese
Name: Machado | Predicted: Russian | Actual: Portuguese
Name: Silva | Predicted: English | Actual: Portuguese
Name: Torres | Predicted: English | Actual: Portuguese
Name: Ahearn | Predicted: English | Actual: Irish
Name: Aonghus | Predicted: Russian | Actual: Irish
Name: Brady | Predicted: English | Actual: Irish
Name: Cearbhall | Predicted: English | Actual: Irish
Name: Flann | Predicted: English | Actual: Irish
Name: Kavanagh | Predicted: Russian | Actual: Irish
Name: Maguire | Predicted: Russian | Actual: Irish
Name: Mcmahon | Predicted: English | Actual: Irish
Name: Mcneil | Predicted: English | Actual: Irish
Name: Monahan | Predicted: Russian | Actual: Irish
Name: Muirchertach | Predicted: Russian | Actual: Irish
Name: Mullen | Predi

Our model is performing exceptionally well on these languages : Russian, Arabic, Japanese, English, Italian, correctly classifying the vast majority of names. This suggests that the model has learned strong features that distinguish these languages from others. As instance the model is struggling with languages like French, Spanish, Dutch, Portugese ...

The languages with low accuracy might have fewer samples in the training data, causing the model to not learn their features effectively. Some languages share similar character patterns, making it difficult for the model to distinguish between them like Portuguese and Spanish names may have similar structures.