# Homework 11 - LDA and ChatBot
In this assignment, we will be applying LDA and building a simple chatbot using the provided datasets.

Complete the missing parts in this guide.

### Step 1: Load Data
You can load the data from the provided TSV file using `pandas`.

### Step 2: Preprocess
 - Clean the data by removing stop-words, punctuations, emoticons etc..

### Step 3: Apply LDA
 - Find the topics in the dataset using LDA (Latent Dirichlet Allocation).
 - Describe the topics found in the dataset.

### Step 4: Create a simple ChatBot
 - Use nltk to create a simple chatbot that can respond to user queries based on similarity of sentences in the dataset and the user input. 

## Dataset Overview
The dataset obtained originally from https://archive.ics.uci.edu/dataset/331/sentiment+labelled+sentences contains sentences labeled with sentiment. Each sentence is associated with a sentiment label (positive or negative). The dataset is split into three parts, each containing sentences from different sources: Amazon, Yelp, and IMDb.
Score is either 1 (for positive) or 0 (for negative)	


## Submission Guidelines

- Submit your completed notebook as a HTML export, or a PDF file.

To export to HTML, if you are on Jupyter, select `File` > `Export Notebook As` > `HTML`.

If you are on VSCode, you can use the `Jupyter: Export to HTML` command.
 - Open the command palette (Ctrl+Shift+P or Cmd+Shift+P on Mac).
     - Search for `Jupyter: Export to HTML`.
     - Save the HTML file to your computer and submit it via Canvas.


In [1]:
from sklearn.feature_extraction.text import TfidfVectorizer, CountVectorizer
from sklearn.decomposition import LatentDirichletAllocation
import string
import random
import nltk
from nltk.corpus import stopwords
import re
import pandas as pd
import numpy as np

Like last week, we need to load the data from the TSV file. This time we will only use the amazon reviews dataset.

In [2]:
df = pd.read_csv("../Datasets/amazon_cells_labelled.tsv", sep="\t") # adjust the path as needed
df.head()

Unnamed: 0,sentence,score
0,So there is no way for me to plug it in here i...,0
1,"Good case, Excellent value.",1
2,Great for the jawbone.,1
3,Tied to charger for conversations lasting more...,0
4,The mic is great.,1


With our dataframe made, we now need to clean it before analyzing. Apply the `remove_punctuation()` and `remove_stopwords()` functions on our dataset to clean it. Save the cleaned data to a new column named `cleaned_sentence`.

In [None]:
#Uncomment the line below if you need to download the stopwords
#nltk.download('stopwords')
stop = stopwords.words('english')

def remove_punctuation(text):
    translator = str.maketrans('','', string.punctuation)
    return text.translate(translator)

def remove_stopwords(text):
    text = [word.lower() for word in text.split() if word.lower() not in stop]
    return " ".join(text)

df[_] = df['sentence'].apply(remove_punctuation).apply(remove_stopwords)
df.head()

We need to adjust our data slightly before using LDA. In the cell below, use the `CountVectorizer()` function. Then, use `fit_transform()` with `df['cleaned_sentence']` as a parameter

In [None]:
vect = _(max_features = 5000, max_df=.15) # Your Code Here
X = vect._(df[_]) # Your Code Here

Using the `LatenDirichletAllocation()` function below, we want to pass it 10 components. You can adjust the max iterations for your local setup, or leave it as 25 if unsure.

In [None]:

lda = _(n_components=10, learning_method="batch", max_iter=25, random_state=0) # Your Code Here
document_topics = lda._(X)

And finally' let's see the results! Call the `print_topics()` function below, passing in `feature_names` and `sorting`.

In [None]:
sorting = np.argsort(lda.components_, axis=1)[:, ::-1]
feature_names = np.array(vect.get_feature_names_out())

#slide 27
def print_topics(topics, feature_names, sorting, topics_per_chunk, n_words):
    for i in range(0, len(topics), topics_per_chunk):
        these_topics = topics[i: i + topics_per_chunk]
        len_this_chunk = len(topics)
        
        print(*these_topics)
        print("----------------------")


        for i in range(n_words):
            try:
                print(*feature_names[sorting[these_topics, i]])
            except Exception as e:
                print(e)
                pass
            
            
_(topics=range(10), feature_names=feature_names, sorting=sorting, topics_per_chunk=5, n_words=10) # Your Code Here




## ChatBot


Let's build a simple chatbot using rules and sentence similarity. In this particular case we will use the TFIDF vectorizer to convert our sentences into vectors. Note that modern chatbots now use deep learning models, but this is a good exercise to understand the basics of how chatbots can work.

We will perform a little bit more preprocessing this time. In addition to removal of punctuation and stopwords, we will also lemmatize the words in our dataset. Lemmatization is the process of reducing a word to its base or root form. For example, "running" becomes "run". This helps in reducing the dimensionality of our dataset and improves the performance of our model.

In [None]:
import random
import string
from nltk.stem import WordNetLemmatizer
from nltk import sent_tokenize, word_tokenize
sent_tokens = df[_].str.lower().tolist()


Let's define our lemmatization function first. You need to use the `WordNetLemmatizer` from the `nltk` library. Make sure to download the WordNet data if you haven't already.

In [None]:
nltk.download('punkt')    # sentence/token splitter
nltk.download('wordnet')  # for lemmatization

lemmer = WordNetLemmatizer()

# remove punctuation, tokenize, and lemmatize in one call
remove_punct = dict((ord(p), None) for p in string.punctuation)
def LemTokens(tokens):
    return [lemmer.lemmatize(t) for t in tokens]

def LemNormalize(text): # Normalize text by removing punctuation, tokenizing, and lemmatizing
    return LemTokens(word_tokenize(text.lower().translate(remove_punct)))


Let's define some greeting inputs and responses. These will be used to match user inputs with predefined responses. Add your own greetings and responses to the lists below.

In [None]:
GREETING_INPUTS  = {"hello","hi", ...} # Add more greetings as needed
GREETING_RESPONSES = ["hi","hey", "sup", ...]
# Add more greeting responses as needed

def greeting(sentence):
    for word in sentence.split():
        if word in GREETING_INPUTS:
            return random.choice(GREETING_RESPONSES)
    return None


We need now to define a response function that will take user input and return a response based on  the most similar entries in our dataset.

In [None]:
from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.metrics.pairwise    import cosine_similarity

def response(user_response):
    robo_response = ""
    # temporarily add user query so TF-IDF matrix includes it
    sent_tokens.append(user_response)
    tfidf     = TfidfVectorizer(tokenizer=LemNormalize, stop_words='english').fit_transform(sent_tokens)
    vals      = cosine_similarity(tfidf[-1], tfidf)
    idx       = vals.argsort()[0][-2]    # second-highest similarity
    flat      = vals.flatten()
    flat.sort()
    sim_score = flat[-2]
    sent_tokens.pop()                    # remove user query

    if sim_score == 0: # if no similar sentences found
        robo_response = "I’m sorry, I don’t understand."
    else:
        robo_response = sent_tokens[idx]
    return robo_response

Now Let's create the interface.

In [None]:
def chat():
    # Create a s
    print("ROBO: My name is Robo. Ask me anything about product reviews. Type 'bye' to exit.")
    while True:
        user_input = input("YOU: ").lower().strip()
        if user_input == 'bye': # exit condition. Important!
            print("ROBO: Goodbye! Take care.")
            break
        if user_input in ('thanks','thank you'):
            print("ROBO: You’re welcome!")
            break
        # greeting?
        greet = greeting(user_input)
        if greet:
            print(f"ROBO: {_}") # YOUR Code Here
        else:
            print(f"ROBO: {response(_)}") # YOUR Code Here

Now test it. See if you can find a few sentences that match the reviews in the dataset.

In [None]:
chat()