Creating a Chatbot using word2vec is ideal for any E-commerce or any insitution. This a toy model of the backend methods of a Chatbot.

In [None]:
import re
import numpy as np
import pandas as pd
from numpy import dot
import tensorflow as tf
from numpy.linalg import norm
from itertools import product
from tensorflow import keras

In [70]:
faq_df = pd.read_csv('FAQs.csv')
faq_df_test = pd.read_csv('FAQs_test.csv')

A Chatbot that similarity search needs to add similiar questions or words. Otherwise, any variation from the original question will not trigger that question.

In [71]:
faq_df

Unnamed: 0,Question,Answer
0,When was Albert Einstein born?,Albert Einstein was born on 14 March 1879.
1,Where was he born?,"He was born in Ulm, Germany."
2,When did he die?,"He died 18 April 1955 in Princeton, New Jersey..."
3,Who were his parents?,His father was Hermann Einstein and his mother...
4,Did he have any sisters and brothers?,He had one sister named Maja.
5,Did he marry and have children?,He was married to Mileva Marić between 1903 an...
6,Where did he receive his education?,He received his main education at the followin...
7,When was Albert Einstein awarded the Nobel Pri...,"The Nobel Prize Awarding Institution, the Roya..."
8,Did Albert Einstein attend the Nobel Prize Awa...,The Nobel Prize was announced on 9 November 19...
9,For what did he receive the Nobel Prize?,Einstein was rewarded for his many contributio...


So, we add Similar questions or sentences of faq (SQs). This will require human intervention. For example, the first question has a word "When was Albert Einstein born?" I will write the question as "The date of Albert's Einstein's birth", "Albert Einstein's birth date", "Do you the day Albert Einstein's born?". By adding these similar questions, we improve the knowledge base and make the corpus much more searchable.

In [72]:
faq_df['Answer'][0]

'Albert Einstein was born on 14 March 1879.'

In [73]:
faq_df['sq_words'] = 0

In [74]:
faq_df['sq_words'][0] = ["What is the date of his birth", "Do you know the time of his birth", "Tell me the day Albert Einstein was born"]

faq_df['sq_words'][1] = ["Where is the place he was born", "Do you know the birthplace of Albert Einstein", "In what country he was born?"]

faq_df['sq_words'][2] = ["The time of his death", "His date of death", "When wsa the day Albert Einstein died", 'When did Albert Einstein die', 'When dis his life come to an end', 'When did he decease']

faq_df['sq_words'][3] = ["Who is father", "Who is his mother?", "What is the name of his father and mother"]

faq_df['sq_words'][4] = ['Who are his siblings', 'Tell me about his sib', 'Who are the kin of Albert Einstein', 'Who are his relatives']

faq_df['sq_words'][5] = ['The date of his marriage', 'With whom did he have children', 'What is the name of his wife', 'What is the name of his woman']

faq_df['sq_words'][6] = ['What did he study in his university', ' From which university did he graduate', ' From which university did he receive his certificate in physics']

faq_df['sq_words'][7] = ['When did he get a nobel prize', 'What was his award for his contribution', 'What the date he received his nobel prize']

faq_df['sq_words'][8] = ['Why could he not participate in the Nobel Prize ceremony', 'Why could he not come to nobel award show']

faq_df['sq_words'][9] = ['For what reason he got nobel prize?', 'What was his accomplishment in the field of physics', 'What did he do to get nobel prize', 'What did contributed in physics?']

A value is trying to be set on a copy of a slice from a DataFrame

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  faq_df['sq_words'][0] = ["What is the date of his birth", "Do you know the time of his birth", "Tell me the day Albert Einstein was born"]


In [75]:
faq_df['sq_words'][0]

['What is the date of his birth',
 'Do you know the time of his birth',
 'Tell me the day Albert Einstein was born']

In [76]:
faq_df.explode('sq_words')

Unnamed: 0,Question,Answer,sq_words
0,When was Albert Einstein born?,Albert Einstein was born on 14 March 1879.,What is the date of his birth
0,When was Albert Einstein born?,Albert Einstein was born on 14 March 1879.,Do you know the time of his birth
0,When was Albert Einstein born?,Albert Einstein was born on 14 March 1879.,Tell me the day Albert Einstein was born
1,Where was he born?,"He was born in Ulm, Germany.",Where is the place he was born
1,Where was he born?,"He was born in Ulm, Germany.",Do you know the birthplace of Albert Einstein
1,Where was he born?,"He was born in Ulm, Germany.",In what country he was born?
2,When did he die?,"He died 18 April 1955 in Princeton, New Jersey...",The time of his death
2,When did he die?,"He died 18 April 1955 in Princeton, New Jersey...",His date of death
2,When did he die?,"He died 18 April 1955 in Princeton, New Jersey...",When wsa the day Albert Einstein died
2,When did he die?,"He died 18 April 1955 in Princeton, New Jersey...",When did Albert Einstein die


In [77]:
for i in range(len(faq_df)):
    faq_df['sq_words'][i].append(faq_df['Question'][i])

Creating a corpus from the SQs is an ideal way to map the question to the answer.

In [78]:
faq_df_corpus = faq_df.explode('sq_words')

In [79]:
corpus = faq_df_corpus['sq_words'].to_list()

In [80]:
corpus

['What is the date of his birth',
 'Do you know the time of his birth',
 'Tell me the day Albert Einstein was born',
 'When was Albert Einstein born?',
 'Where is the place he was born',
 'Do you know the birthplace of Albert Einstein',
 'In what country he was born?',
 'Where was he born?',
 'The time of his death',
 'His date of death',
 'When wsa the day Albert Einstein died',
 'When did Albert Einstein die',
 'When dis his life come to an end',
 'When did he decease',
 'When did he die?',
 'Who is father',
 'Who is his mother?',
 'What is the name of his father and mother',
 'Who were his parents?',
 'Who are his siblings',
 'Tell me about his sib',
 'Who are the kin of Albert Einstein',
 'Who are his relatives',
 'Did he have any sisters and brothers?',
 'The date of his marriage',
 'With whom did he have children',
 'What is the name of his wife',
 'What is the name of his woman',
 'Did he marry and have children?',
 'What did he study in his university',
 ' From which university

In [82]:
#Hperparameters
window = 2 #This is the window size between the focus word and the context word in sentence 

Text processing is a crucial part of 

In [83]:
def text_preprocessing(
    text:list,
    punctuations = r'''!()-[]{};:'"\,<>./?@#$%^&*_“~''',
    stop_words=['ourselves', 'hers', 'between', 'yourself', 'but', 'again', 'there', 'about', 'once', 'during', 'out', 'very', 'having', 'with', 'they', 'own', 'an', 'be', 'some', 'for', 'do', 'its', 'yours', 'such', 'into', 'of', 'most', 'itself', 'other', 'off', 'is', 's', 'am', 'or', 'who', 'as', 'from', 'him', 'each', 'the', 'themselves', 'until', 'below', 'are', 'we', 'these', 'your', 'his', 'through', 'don', 'nor', 'me', 'were', 'her', 'more', 'himself', 'this', 'down', 'should', 'our', 'their', 'while', 'above', 'both', 'up', 'to', 'ours', 'had', 'she', 'all', 'no', 'when', 'at', 'any', 'before', 'them', 'same', 'and', 'been', 'have', 'in', 'will', 'on', 'does', 'yourselves', 'then', 'that', 'because', 'what', 'over', 'why', 'so', 'can', 'did', 'not', 'now', 'under', 'he', 'you', 'herself', 'has', 'just', 'where', 'too', 'only', 'myself', 'which', 'those', 'i', 'after', 'few', 'whom', 't', 'being', 'if', 'theirs', 'my', 'against', 'a', 'by', 'doing', 'it', 'how', 'further', 'was', 'here', 'than']
    )->list:
    """
    A method to preproces text
    """
    for x in text.lower(): 
        if x in punctuations: 
            text = text.replace(x, "")

    # Removing words that have numbers in them
    text = re.sub(r'\w*\d\w*', '', text)

    # Removing digits
    text = re.sub(r'[0-9]+', '', text)

    # Cleaning the whitespaces
    text = re.sub(r'\s+', ' ', text).strip()

    # Setting every word to lower
    text = text.lower()

    # Converting all our text to a list 
    text = text.split(' ')

    # Droping empty strings
    text = [x for x in text if x!='']

    # Droping stop words
    text = [x for x in text if x not in stop_words]

    return text

In [84]:
clean_text = []
for sentence in corpus:
   clean_text.append(text_preprocessing(sentence))

In [85]:
clean_text

[['date', 'birth'],
 ['know', 'time', 'birth'],
 ['tell', 'day', 'albert', 'einstein', 'born'],
 ['albert', 'einstein', 'born'],
 ['place', 'born'],
 ['know', 'birthplace', 'albert', 'einstein'],
 ['country', 'born'],
 ['born'],
 ['time', 'death'],
 ['date', 'death'],
 ['wsa', 'day', 'albert', 'einstein', 'died'],
 ['albert', 'einstein', 'die'],
 ['dis', 'life', 'come', 'end'],
 ['decease'],
 ['die'],
 ['father'],
 ['mother'],
 ['name', 'father', 'mother'],
 ['parents'],
 ['siblings'],
 ['tell', 'sib'],
 ['kin', 'albert', 'einstein'],
 ['relatives'],
 ['sisters', 'brothers'],
 ['date', 'marriage'],
 ['children'],
 ['name', 'wife'],
 ['name', 'woman'],
 ['marry', 'children'],
 ['study', 'university'],
 ['university', 'graduate'],
 ['university', 'receive', 'certificate', 'physics'],
 ['receive', 'education'],
 ['get', 'nobel', 'prize'],
 ['award', 'contribution'],
 ['date', 'received', 'nobel', 'prize'],
 ['albert', 'einstein', 'awarded', 'nobel', 'prize', 'physics'],
 ['could', 'partic

In [86]:
word_list = []
all_text = []

In [87]:
for text in corpus:

    # Cleaning the text
    text = text_preprocessing(text)

    # Appending to the all text list
    all_text += text #This will be used for creating a unique word dictionary

    # Creating a context dictionary
    for i, word in enumerate(text):
        for w in range(window):
            # Getting the context that is ahead by *window* words
            if i + 1 + w < len(text): 
                word_list.append([word] + [text[(i + 1 + w)]])
            # Getting the context that is behind by *window* words    
            if i - w - 1 >= 0:
                word_list.append([word] + [text[(i - w - 1)]])

In [88]:
word_list

[['date', 'birth'],
 ['birth', 'date'],
 ['know', 'time'],
 ['know', 'birth'],
 ['time', 'birth'],
 ['time', 'know'],
 ['birth', 'time'],
 ['birth', 'know'],
 ['tell', 'day'],
 ['tell', 'albert'],
 ['day', 'albert'],
 ['day', 'tell'],
 ['day', 'einstein'],
 ['albert', 'einstein'],
 ['albert', 'day'],
 ['albert', 'born'],
 ['albert', 'tell'],
 ['einstein', 'born'],
 ['einstein', 'albert'],
 ['einstein', 'day'],
 ['born', 'einstein'],
 ['born', 'albert'],
 ['albert', 'einstein'],
 ['albert', 'born'],
 ['einstein', 'born'],
 ['einstein', 'albert'],
 ['born', 'einstein'],
 ['born', 'albert'],
 ['place', 'born'],
 ['born', 'place'],
 ['know', 'birthplace'],
 ['know', 'albert'],
 ['birthplace', 'albert'],
 ['birthplace', 'know'],
 ['birthplace', 'einstein'],
 ['albert', 'einstein'],
 ['albert', 'birthplace'],
 ['albert', 'know'],
 ['einstein', 'albert'],
 ['einstein', 'birthplace'],
 ['country', 'born'],
 ['born', 'country'],
 ['time', 'death'],
 ['death', 'time'],
 ['date', 'death'],
 ['dea

In [89]:
words = list(set(all_text))

In [90]:
words.sort()

In [91]:
words

['accomplishment',
 'albert',
 'attend',
 'award',
 'awarded',
 'birth',
 'birthplace',
 'born',
 'brothers',
 'ceremony',
 'certificate',
 'children',
 'come',
 'contributed',
 'contribution',
 'could',
 'country',
 'date',
 'day',
 'death',
 'decease',
 'die',
 'died',
 'dis',
 'education',
 'einstein',
 'end',
 'father',
 'field',
 'get',
 'got',
 'graduate',
 'kin',
 'know',
 'life',
 'marriage',
 'marry',
 'mother',
 'name',
 'nobel',
 'parents',
 'participate',
 'physics',
 'place',
 'prize',
 'reason',
 'receive',
 'received',
 'relatives',
 'show',
 'sib',
 'siblings',
 'sisters',
 'study',
 'tell',
 'time',
 'university',
 'wife',
 'woman',
 'wsa']

In [92]:
albert_dictionary = {}

for i, word in enumerate(words):
    albert_dictionary.update({word: i})

In [93]:
albert_dictionary

{'accomplishment': 0,
 'albert': 1,
 'attend': 2,
 'award': 3,
 'awarded': 4,
 'birth': 5,
 'birthplace': 6,
 'born': 7,
 'brothers': 8,
 'ceremony': 9,
 'certificate': 10,
 'children': 11,
 'come': 12,
 'contributed': 13,
 'contribution': 14,
 'could': 15,
 'country': 16,
 'date': 17,
 'day': 18,
 'death': 19,
 'decease': 20,
 'die': 21,
 'died': 22,
 'dis': 23,
 'education': 24,
 'einstein': 25,
 'end': 26,
 'father': 27,
 'field': 28,
 'get': 29,
 'got': 30,
 'graduate': 31,
 'kin': 32,
 'know': 33,
 'life': 34,
 'marriage': 35,
 'marry': 36,
 'mother': 37,
 'name': 38,
 'nobel': 39,
 'parents': 40,
 'participate': 41,
 'physics': 42,
 'place': 43,
 'prize': 44,
 'reason': 45,
 'receive': 46,
 'received': 47,
 'relatives': 48,
 'show': 49,
 'sib': 50,
 'siblings': 51,
 'sisters': 52,
 'study': 53,
 'tell': 54,
 'time': 55,
 'university': 56,
 'wife': 57,
 'woman': 58,
 'wsa': 59}

In [95]:
total_albert_words = len(albert_dictionary)
albert_words = list(albert_dictionary.keys())

total_words, words

In [96]:
n_focused_encoding = []
m_context_encoding = []

X_focused = []
Y_context = []

In [97]:
for i, word_list in enumerate(word_list):

    focused_row = np.zeros(total_albert_words)
    context_row = np.zeros(total_albert_words)
    
    focus_word_index = albert_dictionary.get(word_list[0])
    context_word_index = albert_dictionary.get(word_list[1])

    focused_row[focus_word_index] = 1
    context_row[context_word_index] = 1

    n_focused_encoding.append(focused_row)
    m_context_encoding.append(context_row)

    
X_focused = np.array(n_focused_encoding)
Y_context = np.array(m_context_encoding)
    

In [98]:
X_focused

array([[0., 0., 0., ..., 0., 0., 0.],
       [0., 0., 0., ..., 0., 0., 0.],
       [0., 0., 0., ..., 0., 0., 0.],
       ...,
       [0., 0., 0., ..., 0., 0., 0.],
       [0., 0., 0., ..., 0., 0., 0.],
       [0., 0., 0., ..., 0., 0., 0.]])

In [136]:
Y_context

array([[0., 0., 0., ..., 0., 0., 0.],
       [0., 0., 0., ..., 0., 0., 0.],
       [0., 0., 0., ..., 0., 0., 0.],
       ...,
       [0., 0., 0., ..., 0., 0., 0.],
       [0., 0., 0., ..., 0., 0., 0.],
       [0., 0., 0., ..., 0., 0., 0.]])

In [100]:
X_focused.shape

(232, 60)

In [101]:
Y_context.shape

(232, 60)

In [102]:
#Hyperparameter
embed_size = 3 

In [104]:
input_layer = tf.keras.Input(shape=X_focused.shape[1],)

x_ = tf.keras.layers.Dense(units=embed_size)(input_layer)
output_layer = tf.keras.layers.Dense(units=Y_context.shape[1], activation=tf.nn.softmax)(x_)

model = tf.keras.Model(inputs=input_layer, outputs=output_layer)
model.compile(optimizer = 'adam', loss = 'categorical_crossentropy') #optimizer and loss function



Time for some training!!!

In [137]:
model.fit(
    x=X_focused, 
    y=Y_context, 
    batch_size=256,
    epochs=1000 
    )

Epoch 1/1000
Epoch 2/1000
Epoch 3/1000
Epoch 4/1000
Epoch 5/1000
Epoch 6/1000
Epoch 7/1000
Epoch 8/1000
Epoch 9/1000
Epoch 10/1000
Epoch 11/1000
Epoch 12/1000
Epoch 13/1000
Epoch 14/1000
Epoch 15/1000
Epoch 16/1000
Epoch 17/1000
Epoch 18/1000
Epoch 19/1000
Epoch 20/1000
Epoch 21/1000
Epoch 22/1000
Epoch 23/1000
Epoch 24/1000
Epoch 25/1000
Epoch 26/1000
Epoch 27/1000
Epoch 28/1000
Epoch 29/1000
Epoch 30/1000
Epoch 31/1000
Epoch 32/1000
Epoch 33/1000
Epoch 34/1000
Epoch 35/1000
Epoch 36/1000
Epoch 37/1000
Epoch 38/1000
Epoch 39/1000
Epoch 40/1000
Epoch 41/1000
Epoch 42/1000
Epoch 43/1000
Epoch 44/1000
Epoch 45/1000
Epoch 46/1000
Epoch 47/1000
Epoch 48/1000
Epoch 49/1000
Epoch 50/1000
Epoch 51/1000
Epoch 52/1000
Epoch 53/1000
Epoch 54/1000
Epoch 55/1000
Epoch 56/1000
Epoch 57/1000
Epoch 58/1000
Epoch 59/1000
Epoch 60/1000
Epoch 61/1000
Epoch 62/1000
Epoch 63/1000
Epoch 64/1000
Epoch 65/1000
Epoch 66/1000
Epoch 67/1000
Epoch 68/1000
Epoch 69/1000
Epoch 70/1000
Epoch 71/1000
Epoch 72/1000
E

<keras.callbacks.History at 0x7f07673b9600>

Loss seems good given the lack of proper computing power. The loss can be decreased given the hyparameters can be changed, which requires a high-end computational power.

In [106]:
albert_weights = model.get_weights()[0]  

Embeddings of a word. Need more computing power for

In [107]:
albert_weights

array([[-1.1260878 , -1.3396604 , -0.4073514 ],
       [ 1.0015075 ,  0.98034227, -0.27034208],
       [ 1.0469748 ,  0.7737589 ,  1.1676605 ],
       [-0.8662802 , -0.13384867,  1.3177891 ],
       [ 1.210026  ,  0.8899499 ,  1.294925  ],
       [-0.23857994, -1.4409283 , -0.51845074],
       [ 1.0997603 ,  0.47751483, -0.19983272],
       [ 0.6485021 ,  1.3925015 , -0.08019659],
       [-1.2747762 , -0.41251385, -1.0494404 ],
       [-0.11548753,  0.40655464,  0.9723213 ],
       [-1.4166626 , -0.88827187, -0.73772675],
       [-0.33165666, -1.276127  , -1.0351048 ],
       [-1.3020599 ,  1.2463874 ,  0.46835342],
       [-0.50382984, -1.2369312 ,  0.3565933 ],
       [-1.0312641 ,  1.2462982 ,  1.472558  ],
       [-0.62306774,  1.0189959 ,  1.1172502 ],
       [ 0.8858112 ,  1.4347991 , -0.8552127 ],
       [ 1.442468  , -1.3610036 , -0.25990364],
       [ 1.0922296 ,  1.0804216 , -0.03584035],
       [-1.0009443 , -1.1817664 ,  0.03408216],
       [ 0.09640199,  0.13389581,  0.050

In [108]:
#Creating an embedding dictionary for now

word_ebmedding_dict = {}

for word in words:
    word_ebmedding_dict.update({
        word: albert_weights[albert_dictionary.get(word)]
    })

In [109]:
word_ebmedding_dict

{'accomplishment': array([-1.1260878, -1.3396604, -0.4073514], dtype=float32),
 'albert': array([ 1.0015075 ,  0.98034227, -0.27034208], dtype=float32),
 'attend': array([1.0469748, 0.7737589, 1.1676605], dtype=float32),
 'award': array([-0.8662802 , -0.13384867,  1.3177891 ], dtype=float32),
 'awarded': array([1.210026 , 0.8899499, 1.294925 ], dtype=float32),
 'birth': array([-0.23857994, -1.4409283 , -0.51845074], dtype=float32),
 'birthplace': array([ 1.0997603 ,  0.47751483, -0.19983272], dtype=float32),
 'born': array([ 0.6485021 ,  1.3925015 , -0.08019659], dtype=float32),
 'brothers': array([-1.2747762 , -0.41251385, -1.0494404 ], dtype=float32),
 'ceremony': array([-0.11548753,  0.40655464,  0.9723213 ], dtype=float32),
 'certificate': array([-1.4166626 , -0.88827187, -0.73772675], dtype=float32),
 'children': array([-0.33165666, -1.276127  , -1.0351048 ], dtype=float32),
 'come': array([-1.3020599 ,  1.2463874 ,  0.46835342], dtype=float32),
 'contributed': array([-0.50382984,

In [110]:
def combination_of_words(arr1, arr2)->list:
    """This will return a list of the combination of words"""
    return list(product(arr1, arr2))

def consine_similarity(word_1:str, word_2:str)->float:
    """This function return the cosine similarity"""
    word_1_coord = word_ebmedding_dict.get(word_1)
    word_2_coord = word_ebmedding_dict.get(word_2)

    similarity_score = 1 - (dot(word_1_coord, word_2_coord)/(norm(word_1_coord) * norm(word_2_coord)))
    return similarity_score

def sentence_similarity(corpus_sentence:str, test_sentence:str)->float:
    """This function will return the a consine similarity between two sentences"""

    similarity_score_total = 0

    corpus_sentence_list = text_preprocessing(corpus_sentence)

    test_question_list = [a_word for a_word in text_preprocessing(test_sentence) if a_word in albert_dictionary.keys()]

    
    word_pairs_for_similarity = combination_of_words(corpus_sentence_list, test_question_list)
    
    for a_pair in word_pairs_for_similarity:
        similarity_score_total += consine_similarity(a_pair[0], a_pair[1])

    return similarity_score_total/len(word_pairs_for_similarity) #Average similarity score

In [129]:
faq_df_corpus['ranking'] = 0
testing_question = faq_df_test['Question'][0]

In [130]:
faq_df_corpus.columns

Index(['Question', 'Answer', 'sq_words', 'ranking'], dtype='object')

In [131]:
faq_df_corpus['ranking'] = faq_df_corpus['sq_words'].apply(lambda a_sentence: sentence_similarity(a_sentence, testing_question))

In [132]:
faq_df_corpus.reset_index(inplace=True)

In [133]:
faq_df_corpus.drop(columns=['index'], axis = 1, inplace=True)

Testing Questions

In [134]:
testing_question

'What is the date of his death?'

Answer to the question

In [135]:
faq_df_corpus.loc[faq_df_corpus['ranking'].idxmax()]['Answer']

'He was born in Ulm, Germany. '

Currently, the model is not able to give a proper answer (also known as wrong-mapping in an E-commerce context). This can be overcome by adding more SQs and enhancing the corpus and training the model several times.

Also, the current model does not take into account about data augmentation and spell correction for not only English language but also Roman Bangla letters. So, there is much more work that needs to be done.