# Airbnb New York Sentiment Analysis
For this project I'll center on the general sentiment of the Airbnb experience in New York throughout 2023, leaving an inference analysis of the available features for a future project.
## Dataset
The dataset to be used is the reviews.csv, which contains the following variables:
1. listing_id: foreign key for the ID of the airbnb accommodation.
2. id: primary key, ID corresponding to each individual review.
3. date: date of the review.
4. reviewer_id: foreign key for the ID of the reviewer.
5. reviewer_name: first name of the reviewer (anonymization is guaranteed by the dataset provider).
6. comments: string containing the review.
## Cleaning
the cleaning process will consist of the following steps:
1. Handling Null values.
2. Ensuring primary key uniqueness.
3. Cleaning text for each review.
4. Validating review length (>1).

In [1]:
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns
import re

In [2]:
rev = pd.read_csv('data/reviews.csv', parse_dates=['date'])

In [3]:
rev.head(2)

Unnamed: 0,listing_id,id,date,reviewer_id,reviewer_name,comments
0,2595,17857,2009-11-21,50679,Jean,Notre séjour de trois nuits.\r<br/>Nous avons ...
1,2595,19176,2009-12-05,53267,Cate,Great experience.


In [4]:
rev.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 947328 entries, 0 to 947327
Data columns (total 6 columns):
 #   Column         Non-Null Count   Dtype         
---  ------         --------------   -----         
 0   listing_id     947328 non-null  int64         
 1   id             947328 non-null  int64         
 2   date           947328 non-null  datetime64[ns]
 3   reviewer_id    947328 non-null  int64         
 4   reviewer_name  947325 non-null  object        
 5   comments       947094 non-null  object        
dtypes: datetime64[ns](1), int64(3), object(2)
memory usage: 43.4+ MB


**reviews per year:**

In [5]:
rev.date.dt.year.value_counts().sort_index()

date
2009        50
2010       455
2011      1729
2012      3508
2013      6722
2014     12953
2015     26377
2016     45555
2017     62775
2018     89637
2019    116582
2020     45032
2021     93942
2022    165445
2023    188842
2024     87724
Name: count, dtype: int64

Just the reviews from 2023 will be used

In [6]:
rev23 = rev[rev.date.dt.year == 2023].copy()

#### Na values

In [7]:
rev23.isna().sum()

listing_id        0
id                0
date              0
reviewer_id       0
reviewer_name     0
comments         46
dtype: int64

In [8]:
rev23.dropna(inplace=True)

#### Primary key

In [9]:
rev23.info()

<class 'pandas.core.frame.DataFrame'>
Index: 188796 entries, 87 to 933650
Data columns (total 6 columns):
 #   Column         Non-Null Count   Dtype         
---  ------         --------------   -----         
 0   listing_id     188796 non-null  int64         
 1   id             188796 non-null  int64         
 2   date           188796 non-null  datetime64[ns]
 3   reviewer_id    188796 non-null  int64         
 4   reviewer_name  188796 non-null  object        
 5   comments       188796 non-null  object        
dtypes: datetime64[ns](1), int64(3), object(2)
memory usage: 10.1+ MB


In [10]:
rev23.nunique()

listing_id        13789
id               188796
date                365
reviewer_id      175035
reviewer_name     38917
comments         182539
dtype: int64

id has as many unique values as the number of rows, 188796

#### Text cleaning

In [11]:
def clean_text(text):
    text = re.sub(r'[^\w\s]', '', text) # removing everything except leters, numbers and spaces
    text = re.sub(r'\s+', ' ', text) # remove multiple spaces
    text = text.lower() # lowercase   
    return text

In [12]:
rev23['comments'] = rev23.comments.apply(clean_text)

#### Invalid length of (cleaned) reviews

In [13]:
print('number of reviews with length 0 and 1:')
print(0, ': ', (rev23.comments.str.len() == 0).sum())
print(1, ': ', (rev23.comments.str.len() == 1).sum())

number of reviews with length 0 and 1:
0 :  548
1 :  84


In [14]:
rev23[rev23.comments.str.len() == 1].head()

Unnamed: 0,listing_id,id,date,reviewer_id,reviewer_name,comments
57159,986727,913817627344455011,2023-06-14,147393061,Alvaro,a
62814,860827,1006580575381246458,2023-10-20,55404456,Peter,a
86317,1349658,935527963596364388,2023-07-14,56028511,Warren,h
140223,4162519,886916421836727295,2023-05-08,487599270,Robin,好
166405,5094593,942021138245117665,2023-07-23,462585515,Tarek,n


In [15]:
rev23.drop(rev23.query('comments.str.len() == 0 or comments.str.len() == 1').index, inplace=True)

## reviews length

we'll create columns for the length and the number of words of each comment for future analysis

In [16]:
rev23['rev_len'] = rev23.comments.str.len()
rev23['n_words'] = rev23.comments.str.split(' ').apply(lambda x: len(x))

In [17]:
rev23.head(3)

Unnamed: 0,listing_id,id,date,reviewer_id,reviewer_name,comments,rev_len,n_words
87,5136,962324187708959042,2023-08-20,7246906,Alix,rebeccas place is very spacious and located on...,711,130
294,230877,931221940179651285,2023-07-08,22401560,Ruth,this is the best airbnb experience we have eve...,589,107
699,74680,795649067652790489,2023-01-02,43473482,Michelle,my partner and i stayed at nazleens place for ...,980,187


In [None]:
rev23.to_csv('rev23.csv', index=False) # save