# Spanish Verbs in Embedding Space

#### Purpose
- practice matrix math with embedding space
- practice creating language model
- practice using pretrained models


#### Resources

<u>Spanish Corpora and Word Lists</u>

| Type       | Source        | Link |
|------------|---------------|------|
| Corpus     | Kaggle        | [120M Words](https://www.kaggle.com/datasets/rtatman/120-million-word-spanish-corpus) |
| Corpus     | HuggingFace   | [LargeSpanishCorpus](https://huggingface.co/datasets/large_spanish_corpus) |
| Corpus     | HuggingFace   | [SpanishBillionWords](https://huggingface.co/datasets/spanish_billion_words) |
| Words      | Github        | [lorenbrichter](https://raw.githubusercontent.com/lorenbrichter/Words/master/Words/es.txt) |
| Verbs      | Github        | [bretttolbert](https://github.com/bretttolbert/verbecc/blob/main/verbecc/data/verbs-es.xml) |
| Verbs      | Github        | [ghidinelli](https://github.com/ghidinelli/fred-jehle-spanish-verbs/blob/master/jehle_verb_database.csv) |



<u>Parts of Speech Checkers</u>
- Hard coded: [SpanishDict](https://api.spanishdict.com/api/v1/wordoftheday/{word}')
- ML: some language model

#### Plan
1. Get words counts from hugging face.
2. verbs list from spanish dict hardcoded method
3. verbs list from some ml model
4. Comparisons of the two:
    - FILL
5. Create baseline distance:
    - Min and max distance between verbs
    - Standard deviation of distance between verbs
    - Average distance of kth closest verb for first 30 k's
        - For every verb, calculate distance of 30 closest verbs, then average over all verbs
    - Sanity check: Average distance between a few verbs and their similarity verb clusters
6. Create tuples of re verbs and their non re counterparts
7. Tests on RE verbs:
    - 2D feature reduction:
        - plot tuples in 2D space
    - Embedding distance between tuples
        - Min, max, mean, std dev
        - plot distribution
        - plot distance vs length of verb
        - plot distance vs use of verbs (which metric to combine uses? average or max or both?)
    - Kth closest verb for each item of each tuple
    



### Import Data

In [None]:
import xml.etree.ElementTree as ET
import requests
import random

def get_verbs_from_bred() -> list:
    """
    Retrieves a list of verbs and definitions from  bretttolbert's verbecc project.

    Returns:
    - A list of verbs.
    """
    # Send a GET request to fetch the XML content
    url = "https://github.com/ghidinelli/fred-jehle-spanish-verbs/blob/master/jehle_verb_database.csv"
    response = requests.get(url)

    # Parse the XML content
    xml_content = response.content
    root = ET.fromstring(xml_content)

    # Find all verbs
    verb_elements = root.findall(".//v")
    verbs = [v.find("i").text for v in verb_elements]
    return verbs


In [None]:
import csv
import requests
from io import StringIO
def get_verbs_from_fred() -> list:
    """
    Retrieves a list of verbs from ghidinelli's fred-jehle-spanish-verbs project.

    Returns:
    - A list of verbs.
    """
    verbs = {}
    url = 'https://raw.githubusercontent.com/ghidinelli/fred-jehle-spanish-verbs/master/jehle_verb_database.csv'
    response = requests.get(url)
    file = StringIO(response.text)
    reader = csv.reader(file)
    next(reader)  # Skip the header row
    for row in reader:
        verb = row[0]
        definition = row[1]
        if verb not in verbs:
            verbs[verb] = definition
    return verbs

In [1]:
from collections import defaultdict
from datasets import load_dataset
import pickle


def get_corpus_as_bow() -> list:
    """
    Parses the HuggingFace large_spanish_corpus dataset into bag of words.
    Uses a single function for both to avoid having to save the entire
    dataset to disk.

    Returns:
    - A bag of words.
    """

    dataset = load_dataset(
        'large_spanish_corpus', name='EUBookShop', split='train',
        streaming=True, trust_remote_code=True)

    # Default Dictionary will output 0 for queried keys that don't exist
    bow = defaultdict(int)

    # Iterate over each batch of texts in the dataset
    for batch in dataset:
        # Tokenize the text into individual words
        words = batch['text'].split()
        # Iterate over each word
        for word in words:
            bow[word] += 1

    sorted_bow = sorted(list(bow.items()), key=lambda x: x[1], reverse=True)

    # Save the dictionary
    with open('sorted_bow.pkl', 'wb') as f:
        pickle.dump(sorted_bow, f)

    return sorted_bow


  from .autonotebook import tqdm as notebook_tqdm


In [5]:
def load_bow() -> defaultdict:
    """
    Loads the sorted bag of words from disk.

    Returns:
    - A BOW as a defaultdict(int) object.
    """
    with open('sorted_bow.pkl', 'rb') as f:
        return pickle.load(f)

In [6]:
from collections import defaultdict
from transformers import pipeline

def verbs_from_words(bow: list) -> list:
    """
    Uses HuggingFace's bert-base-multilingual-cased LLM for
    parts-of-speech tagging to extract a bag of verbs from
    a bag of words.

    Returns:
    - A bag of verbs.
    """

    nlp = pipeline('ner', model='bert-base-multilingual-cased')

    bov = []

    # Find verbs in the bow
    for word, count in bow:
        pos = nlp(word)
        for token in pos:
            bov.append([word, count])

    return bov




In [7]:
def endswith(bow: list) -> list:
    """
    Only words ending in 'ar', 'er', 'ir'.
    """
    bov = []
    for word, count in bow:
        if word.endswith(('ar', 'er', 'ir')):
            bov.append([word, count])
    return bov


In [6]:
def diff_bov_bred(bov: defaultdict(int), bred: list) -> defaultdict(int):
    """
    Returns the difference between the bag of verbs and the list of verbs
    from bretttolbert's verbecc project.

    Returns:
    - A bag of verbs as a defaultdict(int) object.
    """
    return {
        'bov not bret': {k: bov[k] for k in set(bov) - set(bred)},
        'bret not bov': {k: bred[k] for k in set(bred) - set(bov)}
    }

None


In [8]:
bow = load_bow()
bov1 = verbs_from_words(bow)
bov2 = endswith(bow)


config.json: 100%|██████████| 625/625 [00:00<00:00, 627kB/s]
model.safetensors: 100%|██████████| 714M/714M [02:42<00:00, 4.40MB/s] 





All PyTorch model weights were used when initializing TFBertForTokenClassification.

Some weights or buffers of the TF 2.0 model TFBertForTokenClassification were not initialized from the PyTorch model and are newly initialized: ['classifier.weight', 'classifier.bias']
You should probably TRAIN this model on a down-stream task to be able to use it for predictions and inference.
tokenizer_config.json: 100%|██████████| 29.0/29.0 [00:00<00:00, 29.2kB/s]
vocab.txt: 100%|██████████| 996k/996k [00:00<00:00, 3.51MB/s]
tokenizer.json: 100%|██████████| 1.96M/1.96M [00:00<00:00, 3.71MB/s]


AttributeError: 'list' object has no attribute 'keys'

In [12]:
x = defaultdict(int)
x['2'] = 10
x['3'] = 11
y = sorted(x.items(), key=lambda x: x[1], reverse=True)
for word in y.keys():
    print(word)


AttributeError: 'list' object has no attribute 'keys'