# LLM Project - Dataframe question and answer system for Hotel Reviews using LangChain and Google PaLM

In this project, I will build a question and answer system using Google PaLM LLM and LangChain for an user in Hotel operation to retrieve a customer review based on the topic/keyword entered by the user from the dataframe of hotel reviews collected.

For the dataframe used, I have used dataset available on Kaggle as follows:

https://www.kaggle.com/datasets/michelhatab/hotel-reviews-bookingcom



# Importing libraries

In [1]:
import pandas as pd

Importing Hotel Review dataset from Kaggle

In [2]:
import os
os.environ['KAGGLE_CONFIG_DIR'] = '/content/'

In [3]:
!kaggle datasets download -d michelhatab/hotel-reviews-bookingcom

Downloading hotel-reviews-bookingcom.zip to /content
  0% 0.00/83.1k [00:00<?, ?B/s]
100% 83.1k/83.1k [00:00<00:00, 46.5MB/s]


In [4]:
!unzip 'hotel-reviews-bookingcom.zip'

Archive:  hotel-reviews-bookingcom.zip
  inflating: La_Veranda_Reviews-2023-01-16.csv  


In [95]:
df = pd.read_csv('La_Veranda_Reviews-2023-01-16.csv')
df.head()

Unnamed: 0,Title,PositiveReview,NegativeReview,Score,GuestName,GuestCountry,RoomType,NumberOfNights,VisitDate,GroupType,PropertyResponse
0,Wonderful place to stay.,"New, comfortable apartments, close to the airp...",Nothing at all.,10.0,Olga,Norway,Budget Twin Room,1 night,June 2022,Solo traveler,
1,It was superb,We had a really pleasant stay! The staff was v...,,10.0,Iwona,Poland,Double Room,3 nights,December 2022,Family,
2,Very Good,the location is great and near the airport. bu...,,8.0,Ruijia,Sweden,Double Room,1 night,December 2022,Solo traveler,
3,Wonderful,Great stuff\nGreat Quality/price\nClean,,9.0,Theprincem,United Kingdom,Double Room with Balcony,2 nights,September 2022,Solo traveler,
4,"Fantastic value for a new, modern and spotless...","Clean and modern with very comfortable beds, i...",,10.0,M,Switzerland,Family Suite with Balcony,1 night,October 2022,Family,


# Explore the dataset

In [96]:
df.shape

(1523, 11)

In [97]:
df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 1523 entries, 0 to 1522
Data columns (total 11 columns):
 #   Column            Non-Null Count  Dtype  
---  ------            --------------  -----  
 0   Title             1521 non-null   object 
 1   PositiveReview    775 non-null    object 
 2   NegativeReview    435 non-null    object 
 3   Score             1523 non-null   float64
 4   GuestName         1523 non-null   object 
 5   GuestCountry      1523 non-null   object 
 6   RoomType          1460 non-null   object 
 7   NumberOfNights    1523 non-null   object 
 8   VisitDate         1523 non-null   object 
 9   GroupType         1523 non-null   object 
 10  PropertyResponse  123 non-null    object 
dtypes: float64(1), object(10)
memory usage: 131.0+ KB


I will remove unnecessary columns of GuestName and PropertyResponse as other columns are more important when hotel operators look into the overview of their hotel reviews

In [98]:
df = df.drop(columns=['GuestName', 'PropertyResponse'])
df.head()

Unnamed: 0,Title,PositiveReview,NegativeReview,Score,GuestCountry,RoomType,NumberOfNights,VisitDate,GroupType
0,Wonderful place to stay.,"New, comfortable apartments, close to the airp...",Nothing at all.,10.0,Norway,Budget Twin Room,1 night,June 2022,Solo traveler
1,It was superb,We had a really pleasant stay! The staff was v...,,10.0,Poland,Double Room,3 nights,December 2022,Family
2,Very Good,the location is great and near the airport. bu...,,8.0,Sweden,Double Room,1 night,December 2022,Solo traveler
3,Wonderful,Great stuff\nGreat Quality/price\nClean,,9.0,United Kingdom,Double Room with Balcony,2 nights,September 2022,Solo traveler
4,"Fantastic value for a new, modern and spotless...","Clean and modern with very comfortable beds, i...",,10.0,Switzerland,Family Suite with Balcony,1 night,October 2022,Family


As PositiveReview column will be the column to be embedded/vectorized to search for reviews by topic or keyword, I will drop the rows in the dataframe where PositiveReview is missing

In [99]:
df = df.dropna(subset=['PositiveReview'])
df.info()

<class 'pandas.core.frame.DataFrame'>
Int64Index: 775 entries, 0 to 787
Data columns (total 9 columns):
 #   Column          Non-Null Count  Dtype  
---  ------          --------------  -----  
 0   Title           775 non-null    object 
 1   PositiveReview  775 non-null    object 
 2   NegativeReview  424 non-null    object 
 3   Score           775 non-null    float64
 4   GuestCountry    775 non-null    object 
 5   RoomType        748 non-null    object 
 6   NumberOfNights  775 non-null    object 
 7   VisitDate       775 non-null    object 
 8   GroupType       775 non-null    object 
dtypes: float64(1), object(8)
memory usage: 60.5+ KB


In order to test, I will use the first 300 reviews in the dataframe to start with.

In [100]:
df = df[:300]
len(df)

300

# LangChain and VertexAI set up

In [None]:
!pip install langchain

In [103]:
from google.colab import auth as google_auth
google_auth.authenticate_user()

In [104]:
PROJECT_ID = 'llmpalm'
LOCATION = 'asia-northeast1'

In [105]:
import vertexai
vertexai.init(project = PROJECT_ID, location = LOCATION)

In [106]:
from langchain.llms import VertexAI

In [107]:
llm = VertexAI(
    model_name="text-bison@001",
    max_output_tokens=512,
    temperature=0.7,
    top_p=0.8,
    top_k=40,
    verbose=True,
)

# Loading dataframe with document_loaders

In [109]:
from langchain.document_loaders import DataFrameLoader

In [110]:
loader = DataFrameLoader(df, page_content_column='PositiveReview')  # in DataFrameLoader, page_content_column selects the column to be Embedded/vectorized
# other columns are stored as metadata
data = loader.load()

In [111]:
len(data)

300

In [112]:
data[0]

Document(page_content='New, comfortable apartments, close to the airport, to very clean beach.\nStaff is extremely helpful and easy to communicate with \nTasty food on the first floor, comfortable restaurant for both cozy evenings and calm work to escape the heat in the midday', metadata={'Title': 'Wonderful place to stay.', 'NegativeReview': 'Nothing at all.', 'Score': 10.0, 'GuestCountry': 'Norway', 'RoomType': 'Budget Twin Room', 'NumberOfNights': '1 night', 'VisitDate': 'June 2022', 'GroupType': 'Solo traveler'})

# Embedding and Vector Database

I will use HuggingFace Embedding to vectorize the data and use FAISS for vector database



In [None]:
!pip install langchain sentence_transformers

In [114]:
from langchain.embeddings import HuggingFaceEmbeddings

In [115]:
embeddings = HuggingFaceEmbeddings()

In [116]:
from langchain.vectorstores import FAISS

In [117]:
!pip install faiss-cpu



In [118]:
vectordb = FAISS.from_documents(documents=data, embedding = embeddings)

In [119]:
# Create a retriever for querying the vector database

# the job of this retriever object is, whenever you have a new user question, it will create an embedding of the question,
# then it will retrieve the similiar looking vector from the vector database by comparing the embeddings in the database

retriever = vectordb.as_retriever(score_threshold = 0.7)

Some testing:

In [120]:
rdocs = retriever.get_relevant_documents("Get me reviews on food")
rdocs

[Document(page_content='The food was very good,the chef is kind.', metadata={'Title': 'Very Good', 'NegativeReview': 'In the bathroom smelled bad', 'Score': 8.0, 'GuestCountry': 'Romania', 'RoomType': 'Double Room with Balcony', 'NumberOfNights': '7 nights', 'VisitDate': 'June 2022', 'GroupType': 'Family'}),
 Document(page_content='The stuff was very friendly and helpfull. They exceeded my expectations.  Breakfast for charge same as anywhere else was enormous size and very delicious.', metadata={'Title': 'Great place, super staff, huge delicious breakfast. Recommend', 'NegativeReview': 'Liked everythingz close to airport and town half an hour walking to town', 'Score': 10.0, 'GuestCountry': 'United Kingdom', 'RoomType': 'Double Room', 'NumberOfNights': '2 nights', 'VisitDate': 'June 2022', 'GroupType': 'Couple'}),
 Document(page_content='The food we had was excellent, service was excellent we will go back.', metadata={'Title': 'Wonderful', 'NegativeReview': nan, 'Score': 9.0, 'GuestCou

In [121]:
len(rdocs)

4

In [122]:
type(rdocs)

list

In [123]:
for doc in rdocs:
  print(doc)
  print()

page_content='The food was very good,the chef is kind.' metadata={'Title': 'Very Good', 'NegativeReview': 'In the bathroom smelled bad', 'Score': 8.0, 'GuestCountry': 'Romania', 'RoomType': 'Double Room with Balcony', 'NumberOfNights': '7 nights', 'VisitDate': 'June 2022', 'GroupType': 'Family'}

page_content='The stuff was very friendly and helpfull. They exceeded my expectations.  Breakfast for charge same as anywhere else was enormous size and very delicious.' metadata={'Title': 'Great place, super staff, huge delicious breakfast. Recommend', 'NegativeReview': 'Liked everythingz close to airport and town half an hour walking to town', 'Score': 10.0, 'GuestCountry': 'United Kingdom', 'RoomType': 'Double Room', 'NumberOfNights': '2 nights', 'VisitDate': 'June 2022', 'GroupType': 'Couple'}

page_content='The food we had was excellent, service was excellent we will go back.' metadata={'Title': 'Wonderful', 'NegativeReview': nan, 'Score': 9.0, 'GuestCountry': 'United Kingdom', 'RoomType'

# Prompt Template

In [125]:
from langchain.prompts import PromptTemplate
from langchain.chains import RetrievalQA

In [126]:
prompt_template = """Given the following context and a question, generate an answer based on this context only.
In the answer try to provide as much as possible from the source dataframe context without making much changes.
If the answer is not found in the context or the answer to the given question is not relevant to be found from the source,
kindly state "I am sorry I dont know, I can only answer about reviews I have from the data." Don't try to make up an answer.

CONTEXT: {context}

QUESTION: {question}"""


PROMPT = PromptTemplate(
    template=prompt_template, input_variables=["context", "question"]
)
chain_type_kwargs = {"prompt": PROMPT}

In [127]:
chain = RetrievalQA.from_chain_type(llm=llm,
                            chain_type="stuff",
                            retriever=retriever,
                            input_key="query",
                            return_source_documents=True,
                            chain_type_kwargs=chain_type_kwargs)

Here are some examples of how the review is found/retrieved from the source dataframe:

I have also included the examples of questions that are irrelevant to ensure that the answers are not made up.

In [128]:
chain('Any reviews about the hotel location?')

{'query': 'Any reviews about the hotel location?',
 'result': 'The location is good. It is near to the airport, bus station and beach.',
 'source_documents': [Document(page_content='location is good you can take bus 425 from the airport to somewhere close enough to the hotel. stuff were super nice! room is spacious and clean, bed is comfy too.', metadata={'Title': 'perfect', 'NegativeReview': nan, 'Score': 10.0, 'GuestCountry': 'United Kingdom', 'RoomType': 'Double Room', 'NumberOfNights': '1 night', 'VisitDate': 'November 2021', 'GroupType': 'Couple'}),
  Document(page_content='Everything was very good. Staff very polite and helpful. Rooms are clean. Breakfast was delicious. Hotel is near to the beach, airport and bus station.', metadata={'Title': 'Excellent 👍', 'NegativeReview': nan, 'Score': 10.0, 'GuestCountry': 'Serbia', 'RoomType': 'Double Room', 'NumberOfNights': '2 nights', 'VisitDate': 'August 2022', 'GroupType': 'Family'}),
  Document(page_content='Our stay was just about per

In [129]:
chain('Pick reviews about service.')

{'query': 'Pick reviews about service.',
 'result': 'Excellent range of food offered and good service. \nVery pleasant, helpful and friendly staff.',
 'source_documents': [Document(page_content='the staff, the cleanness, the price, check out time', metadata={'Title': 'everything was exceptional', 'NegativeReview': 'nothing', 'Score': 10.0, 'GuestCountry': 'Cyprus', 'RoomType': 'Double Room with Balcony', 'NumberOfNights': '1 night', 'VisitDate': 'July 2022', 'GroupType': 'Couple'}),
  Document(page_content='Excellent range of food offered and good service.', metadata={'Title': 'excellent arrangements made to cope with midnight arrival', 'NegativeReview': 'no issues for overnight stay', 'Score': 8.0, 'GuestCountry': 'United Kingdom', 'RoomType': 'Double Room', 'NumberOfNights': '1 night', 'VisitDate': 'October 2022', 'GroupType': 'Couple'}),
  Document(page_content='Friendly, good value, clean', metadata={'Title': 'Exceptional', 'NegativeReview': nan, 'Score': 10.0, 'GuestCountry': 'Uni

In [130]:
chain('How many hotels are there in Paris?')

{'query': 'How many hotels are there in Paris?',
 'result': 'I am sorry I dont know, I can only answer about reviews I have from the data.',
 'source_documents': [Document(page_content='Nice modern and clean rooms. The location is excellent close to Airport, Makenzie  beach and touristic area, and bus stop going from Airport to town is few meters away. Very friendly staff.\nHighly recommended.', metadata={'Title': 'Pleasant stay!', 'NegativeReview': 'Nothing', 'Score': 10.0, 'GuestCountry': 'Lebanon', 'RoomType': 'Double Room', 'NumberOfNights': '1 night', 'VisitDate': 'September 2022', 'GroupType': 'Couple'}),
  Document(page_content='The modern, nice hotel, accommodation for one night before departure. The hotel arranged a taxi to the airport for 15 €', metadata={'Title': 'The modern, nice hotel with friendly staff', 'NegativeReview': nan, 'Score': 9.0, 'GuestCountry': 'Czech Republic', 'RoomType': 'Double Room with Balcony', 'NumberOfNights': '1 night', 'VisitDate': 'October 2021', 

In [131]:
chain('Give me the most famous hotel name in Tokyo.')

{'query': 'Give me the most famous hotel name in Tokyo.',
 'result': 'I am sorry I dont know, I can only answer about reviews I have from the data.',
 'source_documents': [Document(page_content='The perfect hotel.', metadata={'Title': 'Exceptional', 'NegativeReview': 'I liked everything.', 'Score': 10.0, 'GuestCountry': 'New Zealand', 'RoomType': 'Double Room with Balcony', 'NumberOfNights': '1 night', 'VisitDate': 'July 2022', 'GroupType': 'Solo traveler'}),
  Document(page_content='Exceptionally friendly service of hotel and restarant in same building. Attention and service superb.', metadata={'Title': 'Wonderful', 'NegativeReview': nan, 'Score': 9.0, 'GuestCountry': 'Lithuania', 'RoomType': 'Suite with Balcony', 'NumberOfNights': '1 night', 'VisitDate': 'June 2022', 'GroupType': 'Family'}),
  Document(page_content='Nice and simple hotel. Very convenient location close to the airport. The front desk was very friendly and helpful.', metadata={'Title': 'Wonderful', 'NegativeReview': 'T