<a href="https://www.inove.com.ar"><img src="https://github.com/hernancontigiani/ceia_memorias_especializacion/raw/master/Figures/logoFIUBA.jpg" width="500" align="center"></a>


# Procesamiento de lenguaje natural
## Bot conversacional


In [1]:
import json
import string
import random
import re
import urllib.request

import numpy as np
import tensorflow as tf 
from tensorflow.keras import Sequential 
from tensorflow.keras.layers import Dense, Dropout

# Para leer y parsear el texto en HTML de wikipedia
import bs4 as bs

import nltk
# Descargar el diccionario
nltk.download("punkt")
nltk.download("wordnet")

[nltk_data] Downloading package punkt to /root/nltk_data...
[nltk_data]   Unzipping tokenizers/punkt.zip.
[nltk_data] Downloading package wordnet to /root/nltk_data...
[nltk_data]   Unzipping corpora/wordnet.zip.


True

### Datos
De la misma manera en la que fue visto en clase, se obtuvo el dataset desde Wikipedia, esta vez sobre la Trilogia de la Fundacion de Isaac Asimov.

In [2]:
raw_html = urllib.request.urlopen('https://en.wikipedia.org/wiki/Foundation_series')
raw_html = raw_html.read()

article_html = bs.BeautifulSoup(raw_html, 'lxml')

article_paragraphs = article_html.find_all('p')

article_text = ''

for para in article_paragraphs:
    article_text += para.text

article_text = article_text.lower()

In [3]:
print("Cantidad de caracteres en la nota:", len(article_text))

Cantidad de caracteres en la nota: 44037


### 2 - Preprocesamiento
- Remover caracteres especiales
- Quitar espacios o saltos

In [4]:
text = re.sub(r'\[[0-9]*\]', ' ', article_text)
text = re.sub(r'\s+', ' ', text)

In [5]:
# Revisamos el texto preprocesado
text

'the foundation series is a science fiction book series written by american author isaac asimov. first published as a series of short stories in 1942–50, and subsequently in three collections in 1951–53, for thirty years the series was a trilogy: foundation, foundation and empire, and second foundation. it won the one-time hugo award for "best all-time series" in 1966. asimov began adding new volumes in 1981, with two sequels: foundation\'s edge and foundation and earth, and two prequels: prelude to foundation and forward the foundation. the additions made reference to events in asimov\'s robot and empire series, indicating that they also were set in the same fictional universe. the premise of the stories is that, in the waning days of a future galactic empire, the mathematician hari seldon spends his life developing a theory of psychohistory, a new and effective mathematical sociology. using statistical laws of mass action, it can predict the future of large populations. seldon forese

In [6]:
print("Cantidad de caracteres en el texto:", len(text))

Cantidad de caracteres en el texto: 43815


### 3 - Dividir el texto en sentencias y en palabras

In [7]:
corpus = nltk.sent_tokenize(text)
words = nltk.word_tokenize(text)

In [8]:
# Revisamos el los primeros documentos del corpus
corpus[:10]

['the foundation series is a science fiction book series written by american author isaac asimov.',
 'first published as a series of short stories in 1942–50, and subsequently in three collections in 1951–53, for thirty years the series was a trilogy: foundation, foundation and empire, and second foundation.',
 'it won the one-time hugo award for "best all-time series" in 1966. asimov began adding new volumes in 1981, with two sequels: foundation\'s edge and foundation and earth, and two prequels: prelude to foundation and forward the foundation.',
 "the additions made reference to events in asimov's robot and empire series, indicating that they also were set in the same fictional universe.",
 'the premise of the stories is that, in the waning days of a future galactic empire, the mathematician hari seldon spends his life developing a theory of psychohistory, a new and effective mathematical sociology.',
 'using statistical laws of mass action, it can predict the future of large popula

In [9]:
# Revisamos las primeras palabras del vocabulario
words[:20]

['the',
 'foundation',
 'series',
 'is',
 'a',
 'science',
 'fiction',
 'book',
 'series',
 'written',
 'by',
 'american',
 'author',
 'isaac',
 'asimov',
 '.',
 'first',
 'published',
 'as',
 'a']

### 4 - Funciones de ayuda para limpiar y procesar el input del usuario
- Lematizar los tokens de la oración
- Quitar símbolos de puntuación

In [10]:
from nltk.stem import WordNetLemmatizer
lemmatizer = WordNetLemmatizer()

def perform_lemmatization(tokens):
    return [lemmatizer.lemmatize(token) for token in tokens]

punctuation_removal = dict((ord(punctuation), None) for punctuation in string.punctuation)

def get_processed_text(document):
    # 1 - reduce el texto a mínuscula
    # 2 - quitar los simbolos de puntuacion
    # 3 - realiza la tokenización
    # 4 - realiza la lematización
    return perform_lemmatization(nltk.word_tokenize(document.lower().translate(punctuation_removal)))

### 5 - Utilizar vectores TF-IDF y la similitud coseno construido con el corpus de wikipedia

In [18]:
from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.metrics.pairwise import cosine_similarity

def generate_response(user_input, corpus):
    response = ''
    # Sumar al corpus la pregunta del usuario para calcular
    # su cercania con otros documentos/sentencias
    corpus.append(user_input)

    # Crear un vectorizar TFIDF que quite las "stop words" del ingles y utilice
    # nuestra funcion para obtener los tokens lematizados "get_processed_text"
    word_vectorizer = TfidfVectorizer(tokenizer=get_processed_text, stop_words='english')

    # Crear los vectores a partir del corpus
    all_word_vectors = word_vectorizer.fit_transform(corpus)

    # Calcular la similitud coseno entre todas los documentos excepto el agregado (el útlimo "-1")
    # NOTA: con los word embedings veremos más en detalle esta matriz de similitud
    similar_vector_values = cosine_similarity(all_word_vectors[-1], all_word_vectors)

    # Obtener el índice del vector más cercano a nuestra oración
    # --> descartando la similitud contra nuestor vector propio
    similar_sentence_number = similar_vector_values.argsort()[0][-2]
    matched_vector = similar_vector_values.flatten()
    matched_vector.sort()
    vector_matched = matched_vector[-2]

    if vector_matched == 0:
        response = "I am sorry, I could not understand you"
    else:
        response = corpus[similar_sentence_number]
    
    corpus.remove(user_input)
    return response

### 6 - Ensayar el sistema
El sistema intentará encontrar la parte del artículo que más se relaciona con nuestro texto de entrada. Sugerencias ensayar:
- Author
- Character
- Books
- Hari Seldon
- Trantor
- Galaxy


In [17]:
generate_response("author", corpus)

  'stop_words.' % sorted(inconsistent))


In [20]:
generate_response("character", corpus)

  'stop_words.' % sorted(inconsistent))


'the end of eternity is vaguely referenced in foundation\'s edge, where a character mentions the eternals, whose "task it was to choose a reality that would be most suitable to humanity".'

In [21]:
generate_response("books", corpus)

  'stop_words.' % sorted(inconsistent))


'the books also wrestle with the idea of individualism.'

In [22]:
generate_response("hari seldon", corpus)

  'stop_words.' % sorted(inconsistent))


'asimov estimates that his foundation series takes place nearly 50,000 years into the future, with hari seldon born in 47,000 ce.'

In [23]:
generate_response("trantor", corpus)

  'stop_words.' % sorted(inconsistent))


'the second foundation itself, however, is finally revealed to be located on the former imperial homeworld of trantor.'

In [24]:
generate_response("the mule", corpus)

  'stop_words.' % sorted(inconsistent))


'according to lead singer ian gillan, the hard rock band deep purple\'s song the mule is based on the foundation character: "yes, the mule was inspired by asimov.'

In [25]:
generate_response("galaxy", corpus)

  'stop_words.' % sorted(inconsistent))


'after many attempts to infer the second foundation\'s whereabouts from the few clues available, the foundation is led to believe the second foundation is located on terminus (the "opposite end of the galaxy" for a galaxy with a circular shape).'