# Agenda
1. Chatbot Overview with Langchain
2. Spotify Songs Recommender with faiss vector database
  - dataset: https://www.kaggle.com/datasets/joebeachcapital/30000-spotify-songs?select=spotify_songs.csv

In [None]:
from google.colab import drive
drive.mount('/content/drive')

: 

# 1.0 Chatbot Overview with Langchain

- source: <strong>langchain/chatbots</strong> with modification

## Use case

Chatbots are one of the central LLM use-cases. The core features of chatbots are that they can have long-running conversations and have access to information that users want to know about.

Aside from basic prompting and LLMs, memory and retrieval are the core components of a chatbot. Memory allows a chatbot to remember past interactions, and retrieval provides a chatbot with up-to-date, domain-specific information.

## Overview

The chat model interface is based around messages rather than raw text. Several components are important to consider for chat:

* `chat model`: See [here](/docs/integrations/chat) for a list of chat model integrations and [here](/docs/modules/model_io/chat) for documentation on the chat model interface in LangChain. You can use `LLMs` (see [here](/docs/modules/model_io/llms)) for chatbots as well, but chat models have a more conversational tone and natively support a message interface.
* `prompt template`: Prompt templates make it easy to assemble prompts that combine default messages, user input, chat history, and (optionally) additional retrieved context.
* `memory`: [See here](/docs/modules/memory/) for in-depth documentation on memory types
* `retriever` (optional): [See here](/docs/modules/data_connection/retrievers) for in-depth documentation on retrieval systems. These are useful if you want to build a chatbot with domain-specific knowledge.

## Quickstart

Here's a quick preview of how we can create chatbot interfaces. First let's install some dependencies and set the required credentials:

In [None]:
!pip install langchain openai tiktoken chromadb

In [None]:
from google.colab import userdata
import os

os.environ['OPENAI_API_KEY'] = userdata.get('OPENAI_API_KEY')

## Regular Conversation

In [None]:
from langchain.prompts import (
    ChatPromptTemplate,
    MessagesPlaceholder,
    SystemMessagePromptTemplate,
    HumanMessagePromptTemplate,
)

from langchain.chat_models import ChatOpenAI
from langchain.chains import LLMChain, ConversationalRetrievalChain
from langchain.memory import ConversationBufferMemory

from langchain.document_loaders import WebBaseLoader
from langchain.text_splitter import RecursiveCharacterTextSplitter
from langchain.embeddings import OpenAIEmbeddings
from langchain.vectorstores import Chroma

In [None]:
# LLM
llm = ChatOpenAI(model_name="gpt-3.5-turbo")

In [None]:
# Prompt
prompt = ChatPromptTemplate(
    messages=[
        SystemMessagePromptTemplate.from_template(
            "You are a nice chatbot having a conversation with a human."
        ),
        # The `variable_name` here is what must align with memory
        MessagesPlaceholder(variable_name="chat_history"),
        HumanMessagePromptTemplate.from_template("{question}"),
    ]
)

In [None]:
# Notice that we `return_messages=True` to fit into the MessagesPlaceholder
# Notice that `"chat_history"` aligns with the MessagesPlaceholder name
memory = ConversationBufferMemory(memory_key="chat_history", return_messages=True)

In [None]:
conversation = LLMChain(llm=llm, prompt=prompt, verbose=True, memory=memory)

In [None]:
# Notice that we just pass in the `question` variables - `chat_history` gets populated by memory
conversation({"question": "hi how are you"})

In [None]:
conversation(
    {"question": "my name is niken, i want to ask you something"}
)

In [None]:
conversation({"question": "do you know natural language processing (NLP) ?"})

In [None]:
conversation({"question": "what is the different between natural language understanding and natural language generation ?"})

In [None]:
conversation({"question": "who is my name ?"})

In [None]:
conversation({"question": "whos win in the battle between gojo satoru vs ryomen sukuna ?"})

## retriever conversation
- add new context about gojo vs sukana

In [None]:
loader = WebBaseLoader("https://beebom.com/jujutsu-kaisen-gojo-vs-sukuna/")
data = loader.load()

In [None]:
text_splitter = RecursiveCharacterTextSplitter(chunk_size=500, chunk_overlap=0)
all_splits = text_splitter.split_documents(data)

vectorstore = Chroma.from_documents(documents=all_splits, embedding=OpenAIEmbeddings())

In [None]:
embd = embedding=OpenAIEmbeddings()


In [None]:
retriever = vectorstore.as_retriever()
qa_retriever = ConversationalRetrievalChain.from_llm(llm, retriever=retriever, memory=memory)

In [None]:
qa_retriever("whos win in the battle between gojo satoru vs ryomen sukuna ?")

# 2.0 Similarity Search using Faiss
- https://github.com/facebookresearch/faiss
- https://python.langchain.com/docs/integrations/vectorstores/faiss

In this project, we will utilize the Faiss vector database, a powerful library for similarity search and recommendation, to create our song recommendation system. Faiss allows us to efficiently search through vast amounts of data to find courses that closely match the ones you’ve enjoyed or are interested in.

In [None]:
!pip install faiss-gpu

In [None]:
import numpy as np
import pandas as pd
import random

import faiss

from sklearn.feature_extraction.text import TfidfVectorizer

import warnings
warnings.filterwarnings('ignore')

In [None]:
data = pd.read_csv(r'/content/drive/MyDrive/Colab Notebooks/NLP_AI_Week_Ruangguru/spotify_songs.csv').dropna().reset_index(drop=True)

In [None]:
data.head(2)

In [None]:
data_clean = data[['track_name', 'track_artist']].drop_duplicates().reset_index(drop=True)

In [None]:
data_clean.shape

In [None]:
#create corpus
data_clean['corpus'] = data_clean['track_artist'] + " " + data_clean['track_name']

song_corpus = data_clean['corpus']

In [None]:
#vectorization
vectorizer = TfidfVectorizer(analyzer='char_wb', ngram_range=(3,3), min_df=5)
X = vectorizer.fit_transform(song_corpus)

In [None]:
X.shape

In [None]:
#convert sparse matrix to numpy array
X_array = np.float32(X.toarray())

# create vector database index
index = faiss.IndexFlatL2(X_array.shape[1])

# add vectors to the index
index.add(X_array)

In [None]:
# testing search
search_text = ["coldplay"]
search_text_vector = vectorizer.transform(search_text)
search_text_vector_array = np.float32(search_text_vector.toarray())

distances, indices = index.search(search_text_vector_array, 5)

for song_index in indices[0]:
    print(f"Song Title: {data_clean['track_name'][song_index]} from {data_clean['track_artist'][song_index]}")

In [None]:
#cerate function
def recommend_course(title):
    search_text = [title]
    search_text_vector = vectorizer.transform(search_text)
    search_text_vector_array = np.float32(search_text_vector.toarray())
    distances, indices = index.search(search_text_vector_array, 5)

    n_1 = f"Song Title: {data_clean['track_name'][indices[0][0]]} from {data_clean['track_artist'][indices[0][0]]}"
    n_2 = f"Song Title: {data_clean['track_name'][indices[0][1]]} from {data_clean['track_artist'][indices[0][1]]}"
    n_3 = f"Song Title: {data_clean['track_name'][indices[0][2]]} from {data_clean['track_artist'][indices[0][2]]}"

    return n_1, n_2, n_3

In [None]:
recommend_course("chainsmokers")

In [None]:
recommend_course("coldplay")

In [None]:
recommend_course("Denny Caknan")

In [None]:
recommend_course("Happy Asmara")