<a href="https://colab.research.google.com/github/krishanuc/CapstoneSentimentAnalysis/blob/airbnb-dataset/TopicModeling_SentimentAnalysis.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

## **IISc CCE AI/ML Capstone Project - Topic modeling and sentiment analysis based on user review data.**

### **Purpose:**
This project aims to take user reviews as inputs and figure out the following:

*   Figure out topics discussed
*   Model the topics
*   Topic clustering
*   Unsupervised sentiment analysis on each topic cluster providing Top N positive, negative, and neutral sentiment topics
*   Figuring out the Top N impactful topic for overall rating improvement

### **Approach:**

We are taking a phased approach to achieve the goal. The following phases will be carried out:

#### **Milestone-1**

*  Cleaning and preparing data
*  Topic modeling
*  Unsupervised topic clustering
*  End of Milestone-1 - Top N trending topic

#### **Milestone-2**

* Unsupervised sentiment analysis on each topic cluster
* End of Milestone-2 - Top N positive, negative, and neutral sentiment topics

#### **Milestone-3**

* End of Milestone-3 - Top N impactful topic for overall rating improvement

### **Milestone-1**



1.   Import all the necessary libraries

In this dataset from Kaggle https://www.kaggle.com/datasets/rhonarosecortez/new-york-airbnb-open-data/data we have 3 files:


*   calendar.csv
*   listings.csv
*   reviewes.csv

We will be working only with reviews.csv as we want to do a topic modelling on the reviews.



In [1]:
# Importing all necessary libraries

# utilities
import re
import pickle
import numpy as np
import pandas as pd

# plotting
import seaborn as sns
from wordcloud import WordCloud
import matplotlib.pyplot as plt


# sklearn
from sklearn.svm import LinearSVC
from sklearn.naive_bayes import BernoulliNB
from sklearn.linear_model import LogisticRegression

from sklearn.model_selection import train_test_split
from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.metrics import confusion_matrix, classification_report


2. Import and view the dataset

In [5]:
# Importing the dataset
#COLUMNS  = ["listing_id", "id", "date", "reviewer_id", "reviewer_name", "comments"]
# ENCODING = "ISO-8859-1"
df = pd.read_csv('/content/reviews.csv', engine='python', on_bad_lines='skip' )

df.head(10) # view the first 10 rows of the dataframe

Unnamed: 0,listing_id,id,date,reviewer_id,reviewer_name,comments
0,2992450,15066586,2014-07-01,16827297,Kristen,Large apartment; nice kitchen and bathroom. Ke...
1,2992450,21810844,2014-10-24,22648856,Christopher,"This may be a little late, but just to say Ken..."
2,2992450,27434334,2015-03-04,45406,Altay,The apartment was very clean and convenient to...
3,2992450,28524578,2015-03-25,5485362,John,Kenneth was ready when I got there and arrange...
4,2992450,35913434,2015-06-23,15772025,Jennifer,We were pleased to see how 2nd Street and the ...
5,2992450,38893053,2015-07-19,11614467,Stephanie,"The flat is not in a good area, while we were ..."
6,2992450,57989144,2015-12-31,28580637,Betty,The apartment was centrally located near all t...
7,2992450,457366954464901293,2021-09-22,413779309,Carolina,The place is clean and the host is very nice
8,2992450,695544085190177036,2022-08-17,19928494,Kyle,It was much dirtier in person and half the fur...
9,3820211,17665203,2014-08-15,11024290,Abigail,We had a marvelous time staying at Terra's bea...


In [10]:
# Removing the unnecessary columns since we will only be working with reviews
# later on we want to co-relate reviews with listings and perhaps with the reviewers
data = df[['listing_id','reviewer_id', 'comments']]
# Replacing the values to ease understanding.
# dataset['sentiment'] = dataset['sentiment'].replace(4,1)

data.head(10)

Unnamed: 0,listing_id,reviewer_id,comments
0,2992450,16827297,Large apartment; nice kitchen and bathroom. Ke...
1,2992450,22648856,"This may be a little late, but just to say Ken..."
2,2992450,45406,The apartment was very clean and convenient to...
3,2992450,5485362,Kenneth was ready when I got there and arrange...
4,2992450,15772025,We were pleased to see how 2nd Street and the ...
5,2992450,11614467,"The flat is not in a good area, while we were ..."
6,2992450,28580637,The apartment was centrally located near all t...
7,2992450,413779309,The place is clean and the host is very nice
8,2992450,19928494,It was much dirtier in person and half the fur...
9,3820211,11024290,We had a marvelous time staying at Terra's bea...
