![EFREI Logo](https://external-content.duckduckgo.com/iu/?u=https%3A%2F%2Fuploads-ssl.webflow.com%2F5dea07d5bb83abf6cdffcf8a%2F5e206b6b8c5a1694d9a3419f_logo-2017-vecto.png&f=1&nofb=1&ipt=433d775bcd4b94706740a20223e10261a41069b267db4346c31d507acba27b2b&ipo=images)

# My First Chatbot

A first look into chatbot

## The principe 

This chatbot is based on a TF-IDF algorithm, and provides answers to the questions submitted by computing the most relevant document of a text corpus, and extracting and processing a phrase from this document.

We focused for this porgram on user experience, with optimized execution time and good-looking graphics.

### TF-IDF

A TF-IDF algorithm is a computation method generated by a search engine algorithm that allows to determine the relevance of a word compared to a term

### What were the main objectives of this project?

## The code

### Handling the texts

#### Cleaning the texts

This function is used to “clean” the text, in other words, it removes all the characters that aren’t letters and also puts all uppercase letters in lowercase letters. It lowers the text and then use a list of all the characters that we want to keep. If the character isn't in the list, it is replaced by a sapce.

In [1]:
from src.lib.utils import *

def clean_text(text: str) -> str:
    """Cleans the text by lowercasing it and replacing any non-latin character or punctuation mark with spaces"""

    # The string to be returned
    cleaned_text = ""
    # The lowered text
    l_text = lower(text)

    # For every character in the lowered text
    for character in l_text:
        # Do we want to keep the character ?
        if(character in LOWERCASE_LETTERS):
            cleaned_text += DIC_UNACCENT[character]
        else:
            # Ensures that the text doesn't starts with a space
            if(len(cleaned_text) != 0):
                # If the previous character is not a space (to avoid multiple spaces)
                if(cleaned_text[-1] != " "):
                    cleaned_text += " "
    
    if(len(cleaned_text) == 0): return cleaned_text
    # Removes trailing space if it exists
    if(cleaned_text[-1] == " "):
        cleaned_text = cleaned_text[:-1]

    return cleaned_text

In [2]:
clean_text("J'adore déjà programmer en Python !! 👍👍")

'j adore deja programmer en python'

#### Converting the texts

This function is used to copy what was processed by the previous function `clean_text`.
It creates a `clean` file with all the words in lowercase and no line break

In [3]:
# Used to remove directory with all it files
from shutil import rmtree as remove_folder

def convert_texts(files:list[str], destination_directory:str, origin_directory:str) -> None:
    """Cleans the texts and stores them into the `root/cleaned` directory"""

    # If the cleaned directory exists, removes it and all its content
    if os.path.exists(destination_directory):
        remove_folder(destination_directory)
    # And creates a brand new one !
    os.makedirs(destination_directory)

    # For every file that should be cleaned (t stands for text)
    for t in files:
        # Opens and cleans the text
        with open(f"{origin_directory}/{t}", "r", encoding='utf8') as f_read:
            text = f_read.read()
            cleaned = clean_text(text)

            # And then writes it into a new file
            with open(f"{destination_directory}/{t}", "w", encoding='utf8') as f_write:
                f_write.write(cleaned)

### TF-IDF methods

First, we need the TF method

#### Term Frequency (TF)

This function is used to get the frequency of each words in the texts, it returns a dictionary that associates each word with it’s occurrence throughout one text. It is very easy to make and will be essential for what’s next. 

In [4]:


def term_frequency(text: str) -> dict[str, int]:
    """Returns a dictionary associating with each word the number of times it appears in the string"""

    # Transforms the text into an array of words
    words = text.split(" ")
    # The variable that will be returned
    res = {}

    # For every word
    for w in words:
        # Is the word already in the result dictionnary ?
        if(w in res.keys()):
            # We add 1 to its count
            res[w] += 1
        else:
            # We initialize its count at 1
            res[w] = 1

    return res

In [5]:
term_frequency("je suis en l1 et je mange en meme temps")

{'je': 2,
 'suis': 1,
 'en': 2,
 'l1': 1,
 'et': 1,
 'mange': 1,
 'meme': 1,
 'temps': 1}

#### Inverse Document Frequency (IDF)

Inverse Document Frequency (IDF)
This function is the second part of the TF-IDF, it computes the importance of the different words throughout every texts, it completes the first function “TF” by using the occurrences and calculating an importance score that will be very helpful when we will need to get the most important words from a text for example.

In [6]:
import math

def inverse_document_frequency(directory: str) -> dict[str: float]:
    """Returns a dictionary associating the IDF score with each word of each speech file in the directory"""

    # Lists all the files in the document
    files = list_files(directory, '.txt')
    # The variable that will be returned 🤯🤯🤯
    res = {}

    # For every cleaned text
    for speech in files:
        # Opens it with a nice format string
        # fd stands for file descriptor
        with open(f"{directory}/{speech}", "r") as fd:
            text = fd.read()
            # `words`` stores the set of all the words of the text
            list_words = text.split(" ")
            words = set(list_words)

            # For every word
            for w in words:
                # If the word have already been encountered in another text
                if(w in res.keys()):
                    # We increment the counter by one
                    res[w] += 1
                # If it is the first time the word is seen, initializes its counter
                else:
                    res[w] = 1
    
    # Applies the formula for every word
    for key in res.keys():
        res[key] = math.log10(len(files) / res[key])
    
    return res

In [7]:
inverse_document_frequency('src/cleaned/Linux_Documentation/')

{'suivantes': 0.8450980400142568,
 'csm': 0.8450980400142568,
 'mais': 0.8450980400142568,
 'sont': 0.8450980400142568,
 'syslinux': 0.8450980400142568,
 'and': 0.8450980400142568,
 'reflector': 0.8450980400142568,
 'outils': 0.8450980400142568,
 'ainsi': 0.5440680443502757,
 'sera': 0.8450980400142568,
 'pages': 0.8450980400142568,
 'partitionnement': 0.8450980400142568,
 'initramfs': 0.8450980400142568,
 'redemarrer': 0.8450980400142568,
 'fat32': 0.8450980400142568,
 'gestionnaire': 0.5440680443502757,
 'installe': 0.8450980400142568,
 'hypothese': 0.8450980400142568,
 'raccourci': 0.8450980400142568,
 '790324': 0.8450980400142568,
 'mmcli': 0.8450980400142568,
 'crees': 0.8450980400142568,
 'raid': 0.8450980400142568,
 'affichera': 0.8450980400142568,
 'heures': 0.8450980400142568,
 'meta': 0.8450980400142568,
 'boot0': 0.8450980400142568,
 'chargeurs': 0.8450980400142568,
 'la': 0.0,
 'reglee': 0.8450980400142568,
 'criteres': 0.8450980400142568,
 'amd': 0.8450980400142568,
 'est'

#### Troubleshooting

The formula we had to use gave us a bit of trouble, in fact, at first it was returning us negative values which were obviously incorrect, we were not sure at the time so we had to find the issues in our code and fix it.

## Providing a great user experience

### The UX menu

We have had a lot of fun doing the menu for our project, in reality we started doing a menu using tkinter and customtkinter which lead us to great results, however, in the middle of the making we heard that it wasn’t sure that tkinter was allowed.  

We then changed all our plans, our goal was to make a console based chatbot, which we did.

#### Scene class

We used object oriented programming to make the bubbles that surrounds our messages, and the scene that stores and updates all buubles automatically.

The Scene class is used to avoid code repetition, and overflow of parameters and local variables. It helps with the responsiveness

#### Bubble class

Here you can see what the bubble looks like, and how it adapts to the text that is being displayed

In [15]:
from src.lib.ux import Bubble

Bubble(20, 0, "Salut!").draw(consolewidth=20)
Bubble(20, 0, "Je suis une super bulle de texte responsive").draw(consolewidth=20)

[G[32m╭────────╮[39m
[32m│ [39mSalut![32m │
[39m[32m┴────────╯[39m
[G[32m╭──────────────────╮[39m
[32m│ [39mJe suis une     [32m │
[39m[32m│ [39msuper bulle de  [32m │
[39m[32m│ [39mtexte responsive[32m │
[39m[32m┴──────────────────╯[39m


## Generating an answer

### Computing similarities

In order to find the most relevant document, we have to compute the similarity between two TF-IDF vectors to see if the document is relevant. The formula used to do this is :
$
\frac{\vec{V_1} \cdot \vec{V_2}}{|\vec{V_1}| \times |\vec{V_2}|}
$

#### Scalar product

The implementation of the scalare product is pretty straightforward. We just have to throw an exception if the vectors aren't of the same dimension.

In [17]:
def scalar_product(vector1: list[float], vector2: list[float]) -> int:
    """ Computes the scalar products of two vector of equal dimension """
    if(len(vector1) != len(vector2)):
        raise IndexError("Must be two vectors of equal length!")
    res = 0
    for i in range(len(vector1)):
        res += vector1[i] * vector2[i]
    return res

In [21]:
scalar_product([0, 1], [1, 0])

0

In [20]:
scalar_product([1, 0.5], [0.5, 1])

1.0

#### Vector norm

In [22]:
from math import sqrt

def vector_norm(vector: list[float]):
    """ Computes the norm of a vector """
    s = 0
    for v in vector:
        s += v ** 2
    return sqrt(s)

#### Similarity

In [23]:
def similarity(vector1:list[float], vector2:list[float]) -> float:
    """ Computes similarity between to vectors """
    p1 = scalar_product(vector1, vector2)
    p2 = (vector_norm(vector1) * vector_norm(vector2))
    # If one or more of the vector is the null vector
    if(p2 == 0): return 0
    return p1 / p2

In [24]:
similarity([1, 0.5], [0.5, 1])

0.7999999999999998

#### Get the phrase to print

With the most relevant document in the head, we just select the first phrase that contains the word with the highest TF-IDF and return it raw ("uncleaned")

In [25]:
def get_phrase(word:str, raw_text:str) -> str:
    """ Returns the first phrase containing the word, in raw str, without the ending '.' """
    phrases = raw_text.split('.')
    phrases_cleaned = [clean_text(p) for p in phrases]
    res = [phrases[i] for i in range(len(phrases)) if word in phrases_cleaned[i].split(' ')]
    if(len(res)):
        return res[0]
    else:
        return False

## What did we learn

We learned how to make a prettu CLI app, and what are the basis of chatbot. We also forced ourselves to use Git, and we've even gone as far as learning Tkinter to make a graphical UX.