# Procesamiento de lenguaje natural
## Desafío 2
### Custom embedddings con Gensim



- Crear sus propios vectores con Gensim basado en lo visto en clase con otro dataset.
- Probar términos de interés y explicar similitudes en el espacio de embeddings (sacar conclusiones entre palabras similitudes y diferencias).
- Graficarlos.
- Obtener conclusiones.

### Datos

Para obtener los datos se utiliza un e-book libre obtenido de Project Gutemberg: https://www.gutenberg.org/ebooks/58650

Title: Introduction to the study of the history of language

Author: Herbert A. Strong
        Willem Sijbrand Logeman
        Benjamin Ide Wheeler

Release date: January 8, 2019

Language: English

In [72]:
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns

import multiprocessing
from gensim.models import Word2Vec

In [73]:
import os
file_path = os.path.join("docs", "Introduction_to_the_study_of_the_history_of_language.txt")

with open(file_path, "r", encoding="utf-8") as file:
    text = file.read()

print(text[:500])


The Project Gutenberg eBook of Introduction to the study of the history of language
    
This ebook is for the use of anyone anywhere in the United States and
most other parts of the world at no cost and with almost no restrictions
whatsoever. You may copy it, give it away or re-use it under the terms
of the Project Gutenberg License included with this ebook or online
at www.gutenberg.org. If you are not located in the United States,
you will have to check the laws of the country where you are l


Debido a que el libro tiene introducciones que son iguales en todos los e-books provenientes del proyecto Gutemberg, se elimina todo esto, tanto al principio como al final. De igual forma se tomo en cuenta para el análisis solamente el contenido del libro obviando el índice, prólogo y los títulos.

De igual forma, se encontró esta nota. por lo tanto, se eliminaron las delimitaciones a estas palabras que para este efecto podrian causar ruido.

[Transcriber’s Note:

Text delimited by underscores is italic.

Text delimited by equal signs is bold.]

In [74]:
def remove_previous_text(text, intro_marker):
    intro_index = text.find(intro_marker)
    if intro_index !=-1:
        return text[intro_index + len(intro_marker):].strip()
    return text

In [75]:
def remove_underscores(text):
    # remove italic delimitation
    pattern = r'_([^_]+)_'
    cleaned_text = re.sub(pattern, r'\1', text)
    return cleaned_text

In [76]:
def remove_equal_signs(text):
    # remove bold delimitation
    pattern = r'=([^=]+)='
    cleaned_text = re.sub(pattern, r'\1', text)
    return cleaned_text

In [77]:
def remove_text_below(text, end_marker):
    end_marker_pos = text.find(end_marker)
    if end_marker_pos != -1:
        return text[:end_marker_pos]
    return text
            

In [78]:
def remove_chapter_text(text):
    lines = text.split('\n')
    clean_lines = [line for line in lines if not re.match(r'^CHAPTER [IVXLCDM]+\.$', line.strip())]
    clean_text = '\n'.join(clean_lines)
    return clean_text


In [79]:
def remove_titles(text, titles):
    lines = text.split('\n')
    clean_lines = [line for line in lines if not any(title in line for title in titles)]
    clean_text = '\n'.join(clean_lines)
    return clean_text
    

In [80]:
def remove_empty_lines(text):
    lines = text.split('\n')
    clean_lines = [line for line in lines if line.strip()]
    clean_text = '\n'.join(clean_lines)
    return clean_text

In [81]:
import re
with open(file_path, "r", encoding="utf-8") as file:
    text = file.read()

intro_marker = "*** START OF THE PROJECT GUTENBERG EBOOK INTRODUCTION TO THE STUDY OF THE HISTORY OF LANGUAGE ***"
end_marker = "*** END OF THE PROJECT GUTENBERG EBOOK INTRODUCTION TO THE STUDY OF THE HISTORY OF LANGUAGE ***"
index_marker = "INDEX"
previous_chap_marker = "” 176, line 7, for ‘ðoances’ read ‘ðances.’"
titles = [
    "ON THE DEVELOPMENT OF LANGUAGE", "ON THE DIFFERENTIATION OF LANGUAGE", "ON SOUND-CHANGE", "CHANGE IN WORD-SIGNIFICATION", 
    "ANALOGY", "THE FUNDAMENTAL FACTS OF SYNTAX", "CHANGE OF MEANING IN SYNTAX", "CONTAMINATION", "ORIGINAL CREATION", 
    "ON ISOLATION AND THE REACTION AGAINST IT", "THE FORMATION OF NEW GROUPS", "ON THE INFLUENCE OF CHANGE IN FUNCTION ON ANALOGICAL FORMATION",
    "DISPLACEMENT IN ETYMOLOGICAL GROUPING", "ON THE DIFFERENTIATION OF MEANING", "CATEGORIES: PSYCHOLOGICAL AND GRAMMATICAL", "GENDER", "NUMBER",
    "TENSE", "VOICE", "DISPLACEMENT OF THE SYNTACTICAL DISTRIBUTION", "ON CONCORD", "ECONOMY OF EXPRESSION", "RISE OF WORD-FORMATION AND INFLECTION",
    "THE DIVISION OF THE PARTS OF SPEECH", "LANGUAGE AND WRITING", "ON MIXTURE IN LANGUAGE", "THE STANDARD LANGUAGE" 
]

cleaned_text = remove_previous_text(text, intro_marker)
cleaned_text = remove_underscores(cleaned_text)
cleaned_text = remove_equal_signs(cleaned_text)
cleaned_text = remove_text_below(cleaned_text, end_marker)
cleaned_text = remove_text_below(cleaned_text, index_marker)
cleaned_text = remove_previous_text(cleaned_text, previous_chap_marker)
cleaned_text = remove_chapter_text(cleaned_text)
cleaned_text = remove_titles(cleaned_text, titles)
cleaned_text = remove_empty_lines(cleaned_text)

output_file_path = os.path.join("docs", "cleaned_text.txt") 

with open(output_file_path, "w", encoding="utf-8") as output_file:
    output_file.write(cleaned_text)

print("Cleaned text saved to:", output_file_path)


Cleaned text saved to: docs/cleaned_text.txt


In [82]:
df = pd.read_csv(output_file_path, sep='/n', header=None, engine='python')
df.head()

Unnamed: 0,0
0,It is the province of the Science of Language ...
1,"possible, the processes of the development of ..."
2,to its latest stage. The observations made on ...
3,naturally be registered in different historica...
4,definite languages; these grammars would follo...


In [83]:
print("Cantidad de documentos:", df.shape[0])

Cantidad de documentos: 10588


In [85]:
print("Cantidad de words distintas en el corpus:", len(w2v_model.wv.index_to_key))

Cantidad de words distintas en el corpus: 5084


### Preprocesamiento

In [84]:
from keras.preprocessing.text import text_to_word_sequence

sentence_tokens = []
for _, row in df[:None].iterrows():
    sentence_tokens.append(text_to_word_sequence(row[0]))

In [63]:
sentence_tokens[:2]

[['it',
  'is',
  'the',
  'province',
  'of',
  'the',
  'science',
  'of',
  'language',
  'to',
  'explain',
  'as',
  'far',
  'as'],
 ['possible',
  'the',
  'processes',
  'of',
  'the',
  'development',
  'of',
  'language',
  'from',
  'its',
  'earliest']]

### Creación de los vectores (word2vec)

In [86]:
from gensim.models.callbacks import CallbackAny2Vec
class callback(CallbackAny2Vec):
    """
    Callback to print loss after each epoch
    """
    def __init__(self):
        self.epoch = 0

    def on_epoch_end(self, model):
        loss = model.get_latest_training_loss()
        if self.epoch == 0:
            print('Loss after epoch {}: {}'.format(self.epoch, loss))
        else:
            print('Loss after epoch {}: {}'.format(self.epoch, loss- self.loss_previous_step))
        self.epoch += 1
        self.loss_previous_step = loss

In [102]:
#modelo skipgram
w2v_model = Word2Vec(min_count=5,    
                     window=2,       
                     vector_size=300,       
                     negative=20,    
                     workers=1,      
                     sg=1)           

In [103]:
w2v_model.build_vocab(sentence_tokens)

In [104]:
print("Cantidad de docs en el corpus:", w2v_model.corpus_count)

Cantidad de docs en el corpus: 10588


In [105]:
print("Cantidad de words distintas en el corpus:", len(w2v_model.wv.index_to_key))

Cantidad de words distintas en el corpus: 2209


### 3 - Entrenar embeddings

In [106]:
w2v_model.train(sentence_tokens,
                 total_examples=w2v_model.corpus_count,
                 epochs=20,
                 compute_loss = True,
                 callbacks=[callback()]
                 )

Loss after epoch 0: 898804.25
Loss after epoch 1: 633934.125
Loss after epoch 2: 604372.125
Loss after epoch 3: 528593.25
Loss after epoch 4: 523853.0
Loss after epoch 5: 521980.0
Loss after epoch 6: 513216.75
Loss after epoch 7: 482796.5
Loss after epoch 8: 480711.0
Loss after epoch 9: 474506.5
Loss after epoch 10: 471428.5
Loss after epoch 11: 468894.0
Loss after epoch 12: 463245.0
Loss after epoch 13: 461778.5
Loss after epoch 14: 459479.5
Loss after epoch 15: 455387.0
Loss after epoch 16: 438607.0
Loss after epoch 17: 438475.0
Loss after epoch 18: 436146.0
Loss after epoch 19: 434495.0


(1442155, 2384600)

### Probar palabras de interés

In [109]:
word_list = ["language", "linguistic", "phenomena", "dialect", "imperfect", "word", "sound", "ideas", "mind", "syntax", "similar", "meaning", "teutonic", "vowels", "influence"]

In [110]:
for word in word_list:
    similar_words = w2v_model.wv.most_similar(positive=[word], topn=10)
    print(f"Most similar words for '{word}':")
    for similar_word, similarity_score in similar_words:
        print(f"- {similar_word}: {similarity_score}")
    print()

Most similar words for 'language':
- dialect: 0.640992283821106
- orthography: 0.6233234405517578
- alphabet: 0.6084741950035095
- standard: 0.6042174100875854
- type: 0.5730559229850769
- pre: 0.5708451271057129
- harmony: 0.5628711581230164
- usage: 0.5550401210784912
- adoption: 0.5549219846725464
- idiom: 0.5501529574394226

Most similar words for 'linguistic':
- community: 0.7204412221908569
- human: 0.7185691595077515
- speaker’s: 0.7130988240242004
- proportion: 0.6935109496116638
- etymological: 0.6874521970748901
- factors: 0.6842604279518127
- area: 0.6764870285987854
- affects: 0.6742784380912781
- system: 0.673038125038147
- wider: 0.6715328693389893

Most similar words for 'phenomena':
- society: 0.876911461353302
- levelling: 0.8664689660072327
- code: 0.8662737011909485
- interest: 0.8651589751243591
- pitch: 0.8484036922454834
- quantity: 0.8406350016593933
- roots: 0.8397217392921448
- phonetics: 0.8393345475196838
- tendencies: 0.8360954523086548
- plant: 0.8360311388

In [111]:
for word in word_list:
    similar_words = w2v_model.wv.most_similar(negative=[word], topn=10)
    print(f"Less similar words for '{word}':")
    for similar_word, similarity_score in similar_words:
        print(f"- {similar_word}: {similarity_score}")
    print()

Less similar words for 'language':
- 3: -0.04215028136968613
- b: -0.046031054109334946
- cf: -0.05209769681096077
- ‘to: -0.06652882695198059
- ”: -0.06801827996969223
- with: -0.08108284324407578
- john: -0.09184236824512482
- c: -0.09245195239782333
- 5: -0.0965479239821434
- e: -0.10087797045707703

Less similar words for 'linguistic':
- say: -0.014466356486082077
- je: -0.037316761910915375
- know: -0.0384289026260376
- esse: -0.038538914173841476
- ‘i: -0.039839088916778564
- ‘it: -0.04713747277855873
- o: -0.05176649987697601
- because: -0.05385943129658699
- you: -0.058498725295066833
- feminine: -0.06258956342935562

Less similar words for 'phenomena':
- has: -0.2165675163269043
- been: -0.23093591630458832
- was: -0.2571670413017273
- an: -0.2726992070674896
- be: -0.2746957838535309
- its: -0.2829643785953522
- commonly: -0.28807151317596436
- once: -0.2904875576496124
- very: -0.29094964265823364
- he: -0.2923482060432434

Less similar words for 'dialect':
- cf: -0.10492998

Tomando en cuenta que el libro trata sobre lingüística, se seleccionaron palabras relevantes que se relacionen con el tema. Por ejemplo, la palabra "language" tiene similitud con "dialect", "orthography", "alphabet", "standard", "type", "pre", "harmony", "usage", "adoption" e "idiom". En este caso, tiene sentido la mayoría de las palabras, pero "pre" podría estar relacionado en este contexto. Algo interesante es que al analizar la palabra "dialect", no se encuentra la palabra "language" previamente analizada. En su lugar, se observa que tiene similitudes con otras palabras que se interpretarían más en el contexto. Otra palabra interesante en su análisis fue "teutonic", cuyas palabras similares son "romance", "slavonic", "scandinavian", "declensions", "norman", "european", "weak", "ist", "indo" y "modern". Las primeras se relacionan porque en este contexto puede significar orígenes lingüísticos. Si hablamos de declinaciones, justamente se sabe que es parte importante de un lenguaje y deberia estar relacionado.
La mayoría de las palabras que presentan poca similitud con las palabras escogidas son en general adverbios, adverbios, artículos y palabras muy generales, o letras.

### Graficos de agrupación de vectores

In [116]:
from sklearn.decomposition import IncrementalPCA    
from sklearn.manifold import TSNE                   
import numpy as np                                  

def reduce_dimensions(model, num_dimensions = 2 ):
     
    vectors = np.asarray(model.wv.vectors)
    labels = np.asarray(model.wv.index_to_key)  

    tsne = TSNE(n_components=num_dimensions, random_state=0)
    vectors = tsne.fit_transform(vectors)

    return vectors, labels

In [114]:
# Graficar los embedddings en 2D
import plotly.graph_objects as go
import plotly.express as px

vecs, labels = reduce_dimensions(w2v_model)

MAX_WORDS=200
fig = px.scatter(x=vecs[:MAX_WORDS,0], y=vecs[:MAX_WORDS,1], text=labels[:MAX_WORDS])
fig.show(renderer="colab") 

In [115]:
# Graficar los embedddings en 3D
vecs, labels = reduce_dimensions(w2v_model,3)

fig = px.scatter_3d(x=vecs[:MAX_WORDS,0], y=vecs[:MAX_WORDS,1], z=vecs[:MAX_WORDS,2],text=labels[:MAX_WORDS])
fig.update_traces(marker_size = 2)
fig.show(renderer="colab")

Al observar los graficos, se pueden encontrar cosas interesantes, por ejemplo, "accusative" y "genitive" son palabras que se encuentran cerca en el grafico. se debe tomar en cuenta que los dos son casos gramaticales presentes en varios idiomas. Otra cosa es que los advervios de tiempo como "sometimes" y "often" se encuentran cerca. Ciertos lenguajes Palabras como "substantive", "subject", "noun", "object", "predicate" se encuentran cerca dado que son partes esenciales de un lenguaje por lo que esto tendria sentido dentro de este contexto.