<a href="https://colab.research.google.com/github/papillonbee/Papillonbee/blob/master/Papillonbee.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# Papillonbee

---


## A chatbot implemented with latent semantic indexing technique



Author: Papan Yongmalwong

## Motivation
Suppose you have data of some texts that map to their own responses; e.g. 

'Hey baby' maps to 'yeah?'

'Where are you' maps to 'I’m at my dorm'

It would be an easy task to write a chatbot that makes response according to the data you have since it could be done with if-else statement; e.g. 

```
if(INUPUT_TEXT == 'Hey baby'):
     make_response('yeah?')
```

But what if the input text does not map to any of the texts, there would be no response made for the conversation. To overcome this problem, it would be best to identify the text that is most similar to the input text.

## Goal
To identify the text in your data that is most similar to the input text.

## Solution
Apply a technique in natural language processing called [latent semantic indexing (LSI)](https://en.wikipedia.org/wiki/Latent_semantic_analysis).
1.   Transform texts into vectors.
2.   Take cosine of the angle between the 2 vectors (each text in your data and the input text) to measure similarity.
3.   Choose the vector (the text in your data) with the highest similarity score.





**The solution can be done in 5 steps:**

**Step 1: Install gensim and pythainlp**
*   Gensim is a Python library for topic modelling, document indexing and similarity retrieval with large corpora. Target audience is the natural language processing (NLP) and information retrieval (IR) community.
*   PyThaiNLP is a Python package for text processing and linguistic analysis, similar to nltk but with focus on Thai language.

In [0]:
!pip install gensim

In [0]:
!pip install pythainlp

**Step 2: Load LINE conversation text file of** *Rabbit* **and** *Papillonbee*

In [0]:
import requests
response = requests.get('https://gist.githubusercontent.com/papillonbee/227d8a1c26303c815614ade026906b4c/raw/c5b9745ce95d571a40190aa6128ef39973c43dab/rabbit_dictionary.txt')
txt = response.text.replace('\n','|')
txt += '|'
t = ''
Rabbit = []
for i in txt:
    if i == '|':
        Rabbit.append(t)
        t = ''
        continue
    t += i
response = requests.get('https://gist.githubusercontent.com/papillonbee/a18a99a59d9372c9b11e2d1828a26c14/raw/862a3cd0560d94fd1b6e00d6bf5490e96529b3e3/ppllnb.txt')
txt = response.text.replace('\n','|')
txt += '|'
t = ''
Papillonbee = []
for i in txt:
    if i == '|':
        Papillonbee.append(t)
        t = ''
        continue
    t += i

**Step 3: Create corpus from** *Rabbit* **, fit model, and define** *talk_with_Papillonbee*

In [0]:
from gensim import corpora, models, similarities
import pythainlp as tnlp
my_text = [list(filter(lambda a: a != ' ' and a != '  ' and a != '   ', tnlp.word_tokenize(line.lower()))) for line in Rabbit]
my_dictionary = corpora.Dictionary(my_text)
my_corpus = [my_dictionary.doc2bow(text) for text in my_text]
my_lsi = models.LsiModel(my_corpus, id2word=my_dictionary, num_topics=200)

In [0]:
def talk_with_Papillonbee(INPUT_TEXT):
  vec_bow = my_dictionary.doc2bow(list(filter(lambda a: a != ' ' and a != '  ' and a != '   ', tnlp.word_tokenize(INPUT_TEXT.lower()))))
  vec_lsi = my_lsi[vec_bow]
  my_index = similarities.MatrixSimilarity(my_lsi[my_corpus])
  my_sims = my_index[vec_lsi]
  arr = sorted(enumerate(my_sims), key=lambda item: -item[1])[:5]
  output_text = Papillonbee[arr[0][0]]
  top_5_list = ''
  top_5_list += 'Top 5 most similar texts to \'' + INPUT_TEXT + '\':\n'
  for i in range(5):
      top_5_list += str(i+1) + '.)' + str(Rabbit[arr[i][0]]) + ': ' + str(Papillonbee[arr[i][0]]) + '\nCosine similarity = ' + str(arr[i][1])
      if i != 4:
          top_5_list += '\n'
  return output_text, top_5_list

**Step 4: Input any text to talk with** *Papillonbee*

In [6]:
#Edit text here
INPUT_TEXT = 'Hey baby whats up'
response = talk_with_Papillonbee(INPUT_TEXT)

  if np.issubdtype(vec.dtype, np.int):


**Step 5: Print the response from** *Papillonbee*

In [7]:
print(response[0])

yeah?


In [8]:
print(response[1])

Top 5 most similar texts to 'Hey baby whats up':
1.)Hey baby: yeah?
Cosine similarity = 0.84603983
2.)Hey baby: ?
Cosine similarity = 0.84603983
3.)Sticker Hey baby: Huh?
Cosine similarity = 0.6877541
4.)Baby: hey there
Cosine similarity = 0.60257614
5.)Hey: Huh?
Cosine similarity = 0.59982234


## References
Phatthiyaphaibun, W. (2018, June 4). *User manual PyThaiNLP 1.6*. Retrieved from https://github.com/PyThaiNLP/pythainlp/blob/dev/docs/pythainlp-1-6-eng.md

Řehůřek, R. (2017, November 14). *gensim Documentation Release 0.8.6*. Retrieved from https://media.readthedocs.org/pdf/gensim/stable/gensim.pdf

---
**Add Papillonbee on LINE and start to chat [here](https://line.me/R/ti/p/%40ban4934y)**

**Full work is available [here](https://github.com/papillonbee/Papillonbee)**