<a href="https://www.inove.com.ar"><img src="https://github.com/hernancontigiani/ceia_memorias_especializacion/raw/master/Figures/logoFIUBA.jpg" width="500" align="center"></a>


# Procesamiento de lenguaje natural
## Custom embedddings con Gensim



### Objetivo
El objetivo es utilizar documentos / corpus para crear embeddings de palabras basado en ese contexto. Se utilizarán los corpusde diferentes filosofoso para evaluar la diferencias en las interpretaciones de cada uno de ellos.

In [29]:
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns

import multiprocessing
from gensim.models import Word2Vec

### Datos
Utilizaremos como dataset el conjunto de texto de cada uno de los autores por separado. En este caso elegimos a Nietzsche, Kant, Hume y Aristoteles.

In [30]:
# Descargar la carpeta de dataset
import os
import gdown
if os.access('./archive', os.F_OK) is False:
    url = 'https://drive.google.com/uc?id=1BL83ikPLHdzorgO5u9XY_PgtIk3TquLv&export=download'
    output = 'archive.zip'
    gdown.download(url, output, quiet=False)
    !unzip -q archive.zip   
else:
    print("El dataset ya se encuentra descargado")

Downloading...
From: https://drive.google.com/uc?id=1BL83ikPLHdzorgO5u9XY_PgtIk3TquLv&export=download
To: /content/archive.zip
100%|██████████| 17.0M/17.0M [00:00<00:00, 124MB/s] 


replace aristotle.txt? [y]es, [n]o, [A]ll, [N]one, [r]ename: N


In [31]:
df_hume = pd.read_csv('hume.txt', sep='/n', header=None)
df_hume.head()



  """Entry point for launching an IPython kernel.


Unnamed: 0,0
0,THE victory which the earl of Richmond gained ...
1,"entirely decisive; being attended, as well wit..."
2,"dispersion of the royal army, as with the deat..."
3,for this great success suddenly prompted the s...
4,"battle, to bestow on their victorious general ..."


In [32]:
df_kant = pd.read_csv('kant.txt', sep='/n', header=None)
df_kant.head()


  """Entry point for launching an IPython kernel.


Unnamed: 0,0
0,We may call the faculty of cognition from prin...
1,"Reason_, and the inquiry into its possibility ..."
2,"Critique of pure Reason, although by this facu..."
3,"Reason in its theoretical employment, as it ap..."
4,in the former work; without wishing to inquire...


In [33]:

df_aristotle = pd.read_csv('aristotle.txt', sep='/n', header=None)
df_aristotle.head()

  


Unnamed: 0,0
0,Things are said to be named 'equivocally' when...
1,"common name, the definition corresponding with..."
2,"each. Thus, a real man and a figure in a pictu..."
3,the name 'animal'; yet these are equivocally s...
4,"have a common name, the definition correspondi..."


In [34]:
# Armar el dataset utilizando salto de línea para separar las oraciones/docs
df_nietzsche = pd.read_csv('nietzsche.txt', sep='/n', header=None)
df_nietzsche.head()


  


Unnamed: 0,0
0,What I am now going to relate is the history o...
1,"I shall describe what will happen, what must n..."
2,_the triumph of Nihilism._ This history can be...
3,necessity itself is at work in bringing it abo...
4,already proclaimed by a hundred different omen...


In [35]:
print("Cantidad de documentos, Nietzsche:", df_nietzsche.shape[0])
print("Cantidad de documentos, Hume:", df_hume.shape[0])
print("Cantidad de documentos, Kant:", df_kant.shape[0])
print("Cantidad de documentos:", df_aristotle.shape[0])

Cantidad de documentos, Nietzsche: 54697
Cantidad de documentos, Hume: 154175
Cantidad de documentos, Kant: 48425
Cantidad de documentos: 31977


### 1 - Preprocesamiento

In [36]:
from gensim.models.callbacks import CallbackAny2Vec
# Durante el entrenamiento gensim por defecto no informa el "loss" en cada época
# Sobracargamos el callback para poder tener esta información
class callback(CallbackAny2Vec):
    """
    Callback to print loss after each epoch
    """
    def __init__(self):
        self.epoch = 0

    def on_epoch_end(self, model):
        loss = model.get_latest_training_loss()
        if self.epoch == 0:
            print('Loss after epoch {}: {}'.format(self.epoch, loss))
        else:
            print('Loss after epoch {}: {}'.format(self.epoch, loss- self.loss_previous_step))
        self.epoch += 1
        self.loss_previous_step = loss

In [37]:
from keras.preprocessing.text import text_to_word_sequence

def create_model(df):
    sentence_tokens = []

    for _, row in df[:None].iterrows():
        sentence_tokens.append(text_to_word_sequence(row[0]))

    w2v_model = Word2Vec(min_count=5,    # frecuencia mínima de palabra para incluirla en el vocabulario
                     window=2,       # cant de palabras antes y desp de la predicha
                     size=300,       # dimensionalidad de los vectores 
                     negative=20,    # cantidad de negative samples... 0 es no se usa
                     workers=1,      # si tienen más cores pueden cambiar este valor
                     sg=1)           # modelo 0:CBOW  1:skipgram

    w2v_model.build_vocab(sentence_tokens)

    print("Cantidad de docs en el corpus:", w2v_model.corpus_count)
    print("Cantidad de words distintas en el corpus:", len(w2v_model.wv.vocab))

    w2v_model.train(sentence_tokens,
                 total_examples=w2v_model.corpus_count,
                 epochs=20,
                 compute_loss = True,
                 callbacks=[callback()]
                 )
    
    return w2v_model

### 2 - Crear los vectores (word2vec)

### 3 - Entrenar el modelo generador

In [38]:
# Entrenamos el modelo generador de vectores

w2v_aristotle = create_model(df_aristotle)

Cantidad de docs en el corpus: 31977
Cantidad de words distintas en el corpus: 4649
Loss after epoch 0: 2403093.75
Loss after epoch 1: 1801277.25
Loss after epoch 2: 1653045.0
Loss after epoch 3: 1635215.0
Loss after epoch 4: 1585498.0
Loss after epoch 5: 1534436.0
Loss after epoch 6: 1517969.0
Loss after epoch 7: 1502319.0
Loss after epoch 8: 1489356.0
Loss after epoch 9: 1484057.0
Loss after epoch 10: 1428536.0
Loss after epoch 11: 1411410.0
Loss after epoch 12: 1404964.0
Loss after epoch 13: 1399986.0
Loss after epoch 14: 1396676.0
Loss after epoch 15: 1392566.0
Loss after epoch 16: 1391966.0
Loss after epoch 17: 1391906.0
Loss after epoch 18: 1397022.0
Loss after epoch 19: 1407960.0


In [39]:
w2v_kanr = create_model(df_kant)

Cantidad de docs en el corpus: 48425
Cantidad de words distintas en el corpus: 4564
Loss after epoch 0: 3205007.75
Loss after epoch 1: 2438415.75
Loss after epoch 2: 2351424.0
Loss after epoch 3: 2252797.5
Loss after epoch 4: 2216432.0
Loss after epoch 5: 2191654.0
Loss after epoch 6: 2177567.0
Loss after epoch 7: 2105522.0
Loss after epoch 8: 2086442.0
Loss after epoch 9: 2075064.0
Loss after epoch 10: 2058578.0
Loss after epoch 11: 2048620.0
Loss after epoch 12: 2039742.0
Loss after epoch 13: 2030902.0
Loss after epoch 14: 2025888.0
Loss after epoch 15: 2145336.0
Loss after epoch 16: 2161804.0
Loss after epoch 17: 2168832.0
Loss after epoch 18: 2190140.0
Loss after epoch 19: 2210232.0


In [40]:
w2v_nietzsche = create_model(df_nietzsche)

Cantidad de docs en el corpus: 54697
Cantidad de words distintas en el corpus: 7777
Loss after epoch 0: 3866808.5
Loss after epoch 1: 2769714.0
Loss after epoch 2: 2691383.5
Loss after epoch 3: 2609977.0
Loss after epoch 4: 2578673.0
Loss after epoch 5: 2549616.0
Loss after epoch 6: 2475424.0
Loss after epoch 7: 2443630.0
Loss after epoch 8: 2416090.0
Loss after epoch 9: 2389938.0
Loss after epoch 10: 2362490.0
Loss after epoch 11: 2335822.0
Loss after epoch 12: 2343646.0
Loss after epoch 13: 2476588.0
Loss after epoch 14: 2450888.0
Loss after epoch 15: 2434240.0
Loss after epoch 16: 2412660.0
Loss after epoch 17: 2405060.0
Loss after epoch 18: 2400692.0
Loss after epoch 19: 2424288.0


In [41]:
w2v_hume = create_model(df_hume)

Cantidad de docs en el corpus: 154175
Cantidad de words distintas en el corpus: 12089
Loss after epoch 0: 9505246.0
Loss after epoch 1: 7684452.0
Loss after epoch 2: 7301788.0
Loss after epoch 3: 7154556.0
Loss after epoch 4: 7244666.0
Loss after epoch 5: 7196700.0
Loss after epoch 6: 7052732.0
Loss after epoch 7: 6938700.0
Loss after epoch 8: 6819652.0
Loss after epoch 9: 1680540.0
Loss after epoch 10: 1494696.0
Loss after epoch 11: 1461624.0
Loss after epoch 12: 1421904.0
Loss after epoch 13: 1387240.0
Loss after epoch 14: 1349552.0
Loss after epoch 15: 1317720.0
Loss after epoch 16: 1280560.0
Loss after epoch 17: 1246576.0
Loss after epoch 18: 1217440.0
Loss after epoch 19: 1200272.0


### Evaluacion de modelos

#### Hume

In [42]:
w2v_hume.wv.most_similar(positive=["god"], topn=5)

[('curse', 0.49214187264442444),
 ('god’s', 0.4783207178115845),
 ('savior', 0.4697226881980896),
 ('forbid', 0.4689366817474365),
 ('heaven', 0.4667375385761261)]

In [43]:
w2v_hume.wv.most_similar(positive=["truth"], topn=5)

[('sweetness', 0.45587295293807983),
 ('assenting', 0.4500746726989746),
 ('astronomy', 0.44134700298309326),
 ('withal', 0.42235422134399414),
 ('reader', 0.42134958505630493)]

In [44]:
w2v_hume.wv.most_similar(positive=["love"], topn=5)

[('pity', 0.48173290491104126),
 ('dislike', 0.4709981679916382),
 ('fumes', 0.4703837037086487),
 ('contempt', 0.46717074513435364),
 ('humility', 0.46627748012542725)]

#### Kant

In [45]:
w2v_kanr.wv.most_similar(positive=["god"], topn=5)

[('divine', 0.530991792678833),
 ('immortality', 0.521170973777771),
 ('physician', 0.4907466173171997),
 ('deity', 0.48998719453811646),
 ('personality', 0.48404788970947266)]

In [46]:
w2v_kanr.wv.most_similar(positive=["truth"], topn=5)

[('falsehood', 0.545657217502594),
 ('falsity', 0.5106203556060791),
 ('incompetent', 0.5029391050338745),
 ('impossibility', 0.4763123691082001),
 ('assertion', 0.4720863401889801)]

In [47]:
w2v_kanr.wv.most_similar(positive=["love"], topn=5)

[('denial', 0.6088111400604248),
 ('approve', 0.5773358345031738),
 ('selfishness', 0.5720782279968262),
 ('prayers', 0.5572217702865601),
 ('propensities', 0.5534285306930542)]

#### Aristotle

In [48]:
w2v_aristotle.wv.most_similar(positive=["god"], topn=5)

[('shoe', 0.7328649163246155),
 ('lemnos', 0.7284876704216003),
 ('expenses', 0.728395938873291),
 ('sophist', 0.72465980052948),
 ('lawgiver', 0.7230583429336548)]

In [49]:
w2v_aristotle.wv.most_similar(positive=["truth"], topn=5)

[('gnomae', 0.6589851379394531),
 ('experience', 0.6587479710578918),
 ('mine', 0.6425724029541016),
 ('phronesis', 0.641693651676178),
 ('tragic', 0.6355767250061035)]

In [50]:
w2v_aristotle.wv.most_similar(positive=["love"], topn=5)

[('envy', 0.6429694294929504),
 ('wants', 0.6347129344940186),
 ('utility', 0.6136761903762817),
 ('values', 0.608590841293335),
 ('dishonour', 0.6044912338256836)]

#### Nietzsche

In [51]:
w2v_nietzsche.wv.most_similar(positive=["god"], topn=5)

[('maid', 0.44852370023727417),
 ('delphic', 0.4399888515472412),
 ('punishes', 0.4374995827674866),
 ('beloved', 0.4363768696784973),
 ('blasphemy', 0.4347548484802246)]

In [52]:
w2v_nietzsche.wv.most_similar(positive=["truth"], topn=5)

[('martyrs', 0.5049238204956055),
 ('fidelity', 0.4889427423477173),
 ('‘i', 0.47673070430755615),
 ('justice', 0.47319066524505615),
 ('socialism', 0.47144877910614014)]

In [53]:
w2v_nietzsche.wv.most_similar(positive=["love"], topn=5)

[('fidelity', 0.497587651014328),
 ('tickle', 0.49569636583328247),
 ('hate', 0.4927929639816284),
 ('charity', 0.4907481074333191),
 ('unlearn', 0.4890514612197876)]

### Analisis de resultados

|       | God | Truth     | Love |
| :---        |    :----:   |          :---: |:---: |
| Aristotle      | Shoe, Lemnos, Expenses, Sophist, Lawgiver       |Gnomae, Experience, Mine, Phronesis, Tragic   |Envy, Wants, Utility, Values, Dishonour|
| Kant   | Divine, Immortality, Physician, Deity, Personality        | Falsehood, Falsity, Incompetent, Impossibility, Assertion      |Denial, Approve, Selfishness, Prayers, Propensities|
| Hume   | Curse, God, Savior, Forbid, Heaven        | Sweetness, Assenting, Astronomy, Withal, Reader      |Pity, Dislike, Fumes, Contempt, Humility|
| Nietzsche   | Maid, Delphic, Punishes, Beloved, Blasphemy        | Martyrs, Fidelity, I, Justice, Socialism      | Fidelity, Tickle, Hate, Charity, Unlearn|

Lejos de ser un conocedor en el tema, es mas, me gustaria serlo para poder hacer una evaluacion mas cierta de los modelos, me parece que se puede evidenciar en los resultados las diferentes interpretaciones que hacen los autores sobre algunos de los topicos que mas han inquietado al hombre desde el inicio de su existencia. **Dios** por ejemplo, es para Kant expresion de divinidad e inmortalidad mientras que para Hume es quien maldice y prohibe, aunque tambien tiene alguna cercania con la palabra salvador. El **Amor** por otro lado, mientras que en los textos de Nietzsche aparece cercano a la fidelidad y la caridad, en Aristoteles parece estar asociado a algo que se quiere o algo que tenga utilidad.