# Data Preparation & Text Pre-Processing

## Overview

I have been contracted by a boutique real estate firm out of Manhattan Beach California to help them optimize the Airbnb branch of their business. With hundreds of properties across Los Angeles, this firm wants to ensure that they are maximizing return on each of their properties by setting an optimal per night price point. With detailed information numerous written reviews for each of their properties, they wish to uncover whether these written reviews along with other can be used to set optimal price points.

## Business Understanding

To perform this analysis, I have chosen to use Natural Language Processing (NLP) to build a classification model to explore whether Airbnb written reviews are reliable predictors for the ‘price per night’ of a given Airbnb listing. Specifically, for one bedroom listings as they occupy the majority of Airbnb listings in the greater Los Angeles. Based on the results, this analysis will aim to communicate clear recommendations on how to utilize this model to optimize Airbnb price listing strategy.

## The Data

The data for this analysis was pulled from Inside Airbnb. The two data sets used include detailed information such as 'price', 'number of bedrooms', 'neighborhoods' from 44,000 Airbnb listings in greater Los Angeles as well as their corresponding written reviews. In the "Data Preprocessing" section below we will merge these two data sets and filter them down to the features on interest.

## Data Preparation

We'll start with importing the relevant packages:

In [1]:
# Imports
import pandas as pd
import numpy as np
import nltk
nltk.download('stopwords')


from nltk.tokenize import RegexpTokenizer, word_tokenize
from nltk.corpus import stopwords, wordnet
from nltk.stem import PorterStemmer, WordNetLemmatizer
from nltk.tag import pos_tag
from nltk.probability import FreqDist
from sklearn.model_selection import train_test_split, GridSearchCV
from sklearn.pipeline import Pipeline
from sklearn.feature_extraction.text import TfidfVectorizer, CountVectorizer
from sklearn.linear_model import LogisticRegression, LinearRegression, Ridge
from sklearn.tree import DecisionTreeClassifier
from sklearn.ensemble import RandomForestClassifier, GradientBoostingRegressor
from sklearn.neighbors import KNeighborsClassifier
from sklearn.naive_bayes import GaussianNB
from sklearn.preprocessing import OneHotEncoder
from sklearn.compose import ColumnTransformer, make_column_transformer
from sklearn.metrics import accuracy_score, classification_report, confusion_matrix, ConfusionMatrixDisplay
from sklearn.model_selection import cross_val_score, cross_val_predict, cross_validate
from tqdm import tqdm

[nltk_data] Downloading package stopwords to /Users/jf/nltk_data...
[nltk_data]   Package stopwords is already up-to-date!


### Load in Los Angeles 'Listings' & 'Reviews'

In [2]:
# load in rviews and listings data
la_rev = pd.read_csv('../data/la_reviews.csv')
la_list = pd.read_csv('../data/la_listings.csv')

## 1. 'Reviews' Cleaning & EDA

We'll begin by exploring and cleaning the Airbnb reviews data frame to prepare it for Natural Language Processing (NLP).

In [3]:
# take a look at first few rows of data set
la_rev.head()

Unnamed: 0,listing_id,id,date,reviewer_id,reviewer_name,comments
0,8941071,68391055,2016-04-04,10164333,Smruti,"Danielle was a great host, she was extremely r..."
1,8941071,153719836,2017-05-21,97944097,Rob,The apartment was great for us to spend the we...
2,8941071,147589354,2017-04-27,4123723,Widya,"Danielle is a great host, very concerned with..."
3,8941071,145742425,2017-04-19,1459499,Darian,Great location and spacious. Danielle's place ...
4,8941071,144400833,2017-04-15,98494277,Charlie,"Danielle's place was as expected, really good ..."


**Here we see that each Airbnb listing has it's ```reviews``` split up into individual entries (rows). We will have to group all reviews by their respective listings**

In [4]:
# quick breakdown of the data set
la_rev.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 1532925 entries, 0 to 1532924
Data columns (total 6 columns):
 #   Column         Non-Null Count    Dtype 
---  ------         --------------    ----- 
 0   listing_id     1532925 non-null  int64 
 1   id             1532925 non-null  int64 
 2   date           1532925 non-null  object
 3   reviewer_id    1532925 non-null  int64 
 4   reviewer_name  1532925 non-null  object
 5   comments       1532639 non-null  object
dtypes: int64(3), object(3)
memory usage: 70.2+ MB


### Handling null values
We decide to drop all null values as there are only 286 in a data set of over 1.5 million rows.

In [5]:
# checking for nulls
la_rev['comments'].isna().sum()

286

In [6]:
# visualizing nulls
la_rev[la_rev['comments'].isna()]

Unnamed: 0,listing_id,id,date,reviewer_id,reviewer_name,comments
11041,12232198,597570153,2020-01-25,109869097,Larry,
36262,18771571,499592244,2019-07-31,200662007,Lilyan,
38689,19966810,428734333,2019-03-25,248109683,Danielle,
39068,20012997,500736816306141481,2021-11-21,78355861,Jill,
40846,19434972,486204261023070433,2021-11-01,110160469,Farnaz,
...,...,...,...,...,...,...
1511887,16661893,621560353,2020-04-01,342670536,Al,
1514989,46444583,868264023745043602,2023-04-12,23149537,Flannery,
1517560,19332411,222064868,2017-12-28,113707624,Betsy,
1524215,30460122,561530122,2019-11-09,207881679,Bianca,


In [7]:
# dropping null values
la_rev.dropna(subset=['comments'], inplace=True)

In [8]:
# sanity check
la_rev.info()

<class 'pandas.core.frame.DataFrame'>
Int64Index: 1532639 entries, 0 to 1532924
Data columns (total 6 columns):
 #   Column         Non-Null Count    Dtype 
---  ------         --------------    ----- 
 0   listing_id     1532639 non-null  int64 
 1   id             1532639 non-null  int64 
 2   date           1532639 non-null  object
 3   reviewer_id    1532639 non-null  int64 
 4   reviewer_name  1532639 non-null  object
 5   comments       1532639 non-null  object
dtypes: int64(3), object(3)
memory usage: 81.9+ MB


### Grouping reviews by listing_id
Now that null values have been handled we need to group all the comments by listing_id. Not only is this a necessary step for NLP but, it will also shrink our data set which will make training models less time consuming.

As a sanity check, we check separate reviews for a specific listing_id before grouping all reviews by listing_id.

In [9]:
# checking separate reviews for a specific 'listing_id' BEFORE grouping
la_rev[la_rev['listing_id'] == 109]

Unnamed: 0,listing_id,id,date,reviewer_id,reviewer_name,comments
497553,109,74506539,2016-05-15,22509885,Jenn,Me and two friends stayed for four and a half ...
497554,109,449036,2011-08-15,927861,Edwin,The host canceled my reservation the day befor...


In [10]:
# grouping all reviews text by 'listing_id'
la_rev_con = la_rev.groupby(['listing_id'], as_index=False).agg({'comments': " ".join})

In [11]:
# visualize the condensed data frame
la_rev_con

Unnamed: 0,listing_id,comments
0,109,Me and two friends stayed for four and a half ...
1,2708,Charles is the man!! Just wrapped up an amazin...
2,2732,"Unfortunately, I was really disappointed with ..."
3,6033,Sarah was a great host. She was always quick t...
4,6931,The best host and best stay I've ever had with...
...,...,...
32956,968513441909611726,BEAUTIFUL HOME. GREAT LOCATION. AWESOME SUPER ...
32957,969535403681694277,We had a perfect time at Sean’s cottage. It wa...
32958,969626715256159808,Kelly was communicative and super responsive. ...
32959,970252209631292696,Such a cute spot in a nice neighborhood… check...


In [12]:
# sanity check: cross-referencing with 'listing_id' 109 above
la_rev_con['comments'][0]

"Me and two friends stayed for four and a half months. It was a great place to stay! The apartment was very comfortable and I really enjoyed having the park with running path across the street. The only downside was it wasn't within walking distance to restaurants, bars, or coffee shops. But they are a short drive away. Overall, great stay! The host canceled my reservation the day before arrival."

Based on the sanity check above we can see that the 'group by' worked as the separate reviews from ```listing_id 109``` have been grouped in the text above.

In [43]:
la_rev_con['comments'].isna().sum()

0

## 2. 'Listings' Cleaning & EDA

With a total of 74 features in this data frame, we it narrow it down to our features of interest: **price, bedrooms, bathrooms, neighborhood, written reviews, ratings and property type.**

Seeing as our analysis is primarily focused on one-bedroom lisitngs only, we will filter down our data to include only one bedrooms.

Let's take a look at all of our features as well as potential null values that need to be handled.

In [13]:
la_list.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 44594 entries, 0 to 44593
Data columns (total 75 columns):
 #   Column                                        Non-Null Count  Dtype  
---  ------                                        --------------  -----  
 0   id                                            44594 non-null  int64  
 1   listing_url                                   44594 non-null  object 
 2   scrape_id                                     44594 non-null  int64  
 3   last_scraped                                  44594 non-null  object 
 4   source                                        44594 non-null  object 
 5   name                                          44594 non-null  object 
 6   description                                   43937 non-null  object 
 7   neighborhood_overview                         25053 non-null  object 
 8   picture_url                                   44594 non-null  object 
 9   host_id                                       44594 non-null 

In [16]:
# condensing df to potential features of interest for modeling
la_small = la_list[['id', 'price', 'bedrooms', 'neighbourhood_cleansed', 
                    'minimum_nights', 'maximum_nights', 'property_type', 'bathrooms_text', 'review_scores_rating',
                    'review_scores_accuracy', 'review_scores_cleanliness', 'review_scores_checkin', 
                    'review_scores_communication', 'review_scores_location', 'review_scores_value']]

In [17]:
# Filtering for 1 bedroom properties - Analysis only concerned with 1 bedroom Airbnb listings
df_la_1bd = la_small[la_small['bedrooms'] == 1.0]

### Cleaning ```price``` feature - target variable

In [34]:
# removing decimals from price points
df_la_1bd['price'] = df_la_1bd['price'].str.split('.').str[0]

# removing commas from price points
df_la_1bd['price'] = df_la_1bd['price'].str.replace(",", "")

# removing $ signs from price points
df_la_1bd['price'] = df_la_1bd['price'].str.replace('$', '')

# convert price to 'int'
df_la_1bd['price'] = df_la_1bd['price'].astype(int)

df_la_1bd['price']

A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  df_la_1bd['price'] = df_la_1bd['price'].str.split('.').str[0]
A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  df_la_1bd['price'] = df_la_1bd['price'].str.replace(",", "")
A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  df_la_1bd['price'] = df_la_1bd['price'].str.replace('$', '')
A value is trying to 

2         69
3        120
8        201
11        88
15        60
        ... 
44586    175
44587    194
44588    180
44590    168
44593    480
Name: price, Length: 13791, dtype: int64

### Cleaning ```bathrooms_text``` column:

We decide to drop text values and incorrect entries as they account for a small number of listings out of the 13,000 total listings.

In [23]:
df_la_1bd['bathrooms_text'].value_counts()

1            13074
1.5            419
2              231
Half-bath       35
2.5             18
11              16
0               16
3               13
6               11
3.5             11
Shared           9
4                6
Private          2
8                1
11.5             1
Name: bathrooms_text, dtype: int64

In [29]:
# select for only numbers in bathrooms column
df_la_1bd['bathrooms_text'] = df_la_1bd['bathrooms_text'].str.split(" ").str[0]

# drop text and incorrect entries
b = ['4', '6', '8', '11', '11.5', 'Half-bath', 'Private', 'Shared']
bathroom_drop = df_la_1bd[df_la_1bd['bathrooms_text'].isin(b)].index
df_la_1bd.drop(bathroom_drop, inplace=True)

# asign 'float_type' to 'bathrooms_text'
df_la_1bd['bathrooms_text'].astype(float)

A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  df_la_1bd['bathrooms_text'] = df_la_1bd['bathrooms_text'].str.split(" ").str[0]


2        1.0
3        1.5
8        1.0
11       1.0
15       1.0
        ... 
44586    1.0
44587    1.0
44588    1.0
44590    1.0
44593    1.0
Name: bathrooms_text, Length: 13791, dtype: float64

### Cleaning ```property_type``` column:

In [30]:
# breakdown the various property types
df_la_1bd['property_type'].value_counts()[0:30]

Entire rental unit                    6362
Entire guesthouse                     1522
Entire home                           1392
Entire guest suite                     859
Entire condo                           757
Private room in home                   424
Room in hotel                          307
Entire bungalow                        263
Entire serviced apartment              254
Entire loft                            236
Private room in rental unit            221
Room in boutique hotel                 217
Camper/RV                              113
Private room in guesthouse              93
Tiny home                               87
Entire cottage                          81
Entire townhouse                        75
Private room in guest suite             57
Private room in loft                    52
Private room in serviced apartment      40
Entire place                            39
Room in aparthotel                      37
Private room in condo                   37
Private roo

The property types above can obviously be cleaned up and condensed. Any type of 'private room' can be grouped, any 'Entire' unit can be grouped and any 'suite' can aslo be grouped.

In [31]:
# create function to condense property types
def transform_accommodation_type(val):
    if "room" in val:
        return "Private Room"
    elif "suite" in val:
        return "Guest Suite"
    elif "Entire" in val:
        return "Entire Unit"
    else:
        return "Other"

df_la_1bd['property_type'] = df_la_1bd['property_type'].apply(transform_accommodation_type)

# sanity checkl
df_la_1bd['property_type'].value_counts()

A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  df_la_1bd['property_type'] = df_la_1bd['property_type'].apply(transform_accommodation_type)


Entire Unit     11056
Private Room     1047
Guest Suite       859
Other             829
Name: property_type, dtype: int64

## Merge 'Reviews' and 'Listings'
Now that we have cleaned both data frames, it is time to merge them into a single data frame. Specifically, an inner meger in this case as we only want lisitngs that appear in both data frames so that all lisitngs in the merged data frame will have corresponding reviews.

In [44]:
# merge 'reviews' df with filtered 'listings' df
df_merge = df_la_1bd.merge(la_rev_con, how='inner', left_on='id', right_on='listing_id')
df_merge

Unnamed: 0,id,price,bedrooms,neighbourhood_cleansed,minimum_nights,maximum_nights,property_type,bathrooms_text,review_scores_rating,review_scores_accuracy,review_scores_cleanliness,review_scores_checkin,review_scores_communication,review_scores_location,review_scores_value,listing_id,comments
0,41240375,120,1.0,Playa del Rey,30,1125,Entire Unit,1.5,5.00,5.00,5.00,5.00,5.00,5.00,5.00,41240375,Paola is the best host I have ever had. She ha...
1,15239926,201,1.0,Santa Clarita,2,30,Other,1,4.99,4.98,5.00,4.99,5.00,4.97,4.86,15239926,Fantastic super hosts and space . What a beaut...
2,14821183,88,1.0,Diamond Bar,1,1125,Private Room,1,3.00,4.00,2.00,5.00,5.00,4.00,3.00,14821183,Quiet house. Bedroom is enough for sleeping. u...
3,26296415,180,1.0,Torrance,1,1125,Entire Unit,1,5.00,5.00,5.00,5.00,5.00,5.00,5.00,26296415,Amazing experience. The house was also recentl...
4,22746714,35,1.0,North El Monte,1,1125,Private Room,1,4.57,5.00,4.57,4.71,4.43,4.43,4.71,22746714,Nice neighborhood and hosts. Great location. T...
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
9823,575384126844892676,159,1.0,West Hollywood,31,1125,Entire Unit,1,4.67,4.67,5.00,4.67,5.00,4.00,3.33,575384126844892676,"Super host, thank you! This apartment was perf..."
9824,16072625,177,1.0,East Hollywood,31,92,Entire Unit,1,4.14,4.57,4.29,4.86,4.43,4.43,3.86,16072625,Great location. And overall a decent stay. My ...
9825,924091269757225413,120,1.0,Beverly Hills,1,28,Entire Unit,1,1.00,1.00,1.00,1.00,1.00,5.00,1.00,924091269757225413,"Since they sent me a review, I guess I’m revie..."
9826,720164781296601135,175,1.0,West Hollywood,2,1125,Entire Unit,1,4.75,4.75,4.50,4.50,4.25,5.00,4.75,720164781296601135,"Great location, great host, beautiful apartmen..."


Good practice is to check for null values after merging two data frames.

In [45]:
#checking for nulls in reviews
df_merge.isna().sum()

id                              0
price                           0
bedrooms                        0
neighbourhood_cleansed          0
minimum_nights                  0
maximum_nights                  0
property_type                   0
bathrooms_text                  3
review_scores_rating            0
review_scores_accuracy         59
review_scores_cleanliness      58
review_scores_checkin          60
review_scores_communication    58
review_scores_location         61
review_scores_value            61
listing_id                      0
comments                        0
dtype: int64

In [46]:
# dropping nulls from reviews
df_merge.dropna(inplace=True)

# sanity check
df_merge.info()

<class 'pandas.core.frame.DataFrame'>
Int64Index: 9763 entries, 0 to 9827
Data columns (total 17 columns):
 #   Column                       Non-Null Count  Dtype  
---  ------                       --------------  -----  
 0   id                           9763 non-null   int64  
 1   price                        9763 non-null   int64  
 2   bedrooms                     9763 non-null   float64
 3   neighbourhood_cleansed       9763 non-null   object 
 4   minimum_nights               9763 non-null   int64  
 5   maximum_nights               9763 non-null   int64  
 6   property_type                9763 non-null   object 
 7   bathrooms_text               9763 non-null   object 
 8   review_scores_rating         9763 non-null   float64
 9   review_scores_accuracy       9763 non-null   float64
 10  review_scores_cleanliness    9763 non-null   float64
 11  review_scores_checkin        9763 non-null   float64
 12  review_scores_communication  9763 non-null   float64
 13  review_scores_loca

### Dropping 'Long-Term' Stays

By looking ay the ```minimum_nights``` column, we notice that some lisitngs required a 30-night minimum stay, indictaing that these are long-term or monthly rentals. We also notice that the corresponding ```price``` for these rentals are for a monthly rate and not a per night rate. As a result, we drop these 'long-term' listings.

In [47]:
# drop long term stays
longterm_stays = df_merge[df_merge['minimum_nights'] >= 30].index
df_merge.drop(longterm_stays, inplace=True)
df_merge

Unnamed: 0,id,price,bedrooms,neighbourhood_cleansed,minimum_nights,maximum_nights,property_type,bathrooms_text,review_scores_rating,review_scores_accuracy,review_scores_cleanliness,review_scores_checkin,review_scores_communication,review_scores_location,review_scores_value,listing_id,comments
1,15239926,201,1.0,Santa Clarita,2,30,Other,1,4.99,4.98,5.00,4.99,5.00,4.97,4.86,15239926,Fantastic super hosts and space . What a beaut...
2,14821183,88,1.0,Diamond Bar,1,1125,Private Room,1,3.00,4.00,2.00,5.00,5.00,4.00,3.00,14821183,Quiet house. Bedroom is enough for sleeping. u...
3,26296415,180,1.0,Torrance,1,1125,Entire Unit,1,5.00,5.00,5.00,5.00,5.00,5.00,5.00,26296415,Amazing experience. The house was also recentl...
4,22746714,35,1.0,North El Monte,1,1125,Private Room,1,4.57,5.00,4.57,4.71,4.43,4.43,4.71,22746714,Nice neighborhood and hosts. Great location. T...
6,52992116,200,1.0,Silver Lake,2,30,Guest Suite,1,5.00,5.00,5.00,4.97,5.00,5.00,5.00,52992116,Everything was great! The room was nice and cl...
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
9818,593563874468145823,50,1.0,South El Monte,5,365,Private Room,1.5,4.86,4.86,5.00,5.00,4.57,4.57,4.86,593563874468145823,"Great place, good communication and excellent ..."
9819,16134682,186,1.0,West Hollywood,2,10,Entire Unit,1,5.00,5.00,5.00,5.00,5.00,5.00,5.00,16134682,Bella’s place is everything you see on the pic...
9822,45361780,134,1.0,Hawthorne,1,26,Entire Unit,1,4.90,4.89,4.75,4.91,4.95,4.73,4.74,45361780,The place was exactly how the owner described ...
9825,924091269757225413,120,1.0,Beverly Hills,1,28,Entire Unit,1,1.00,1.00,1.00,1.00,1.00,5.00,1.00,924091269757225413,"Since they sent me a review, I guess I’m revie..."


Though we have dropped long-term stays in ```minimum_nights```, we notice that ```price``` carries a few outliers. By digging into these listings on airbnb.com, we find that they are incorrectly priced (monthly rate instead of per night rate). We drop these as well.

In [48]:
df_merge['price'].describe()

count     5258.000000
mean       223.465196
std       1905.944137
min         10.000000
25%        110.000000
50%        149.000000
75%        199.000000
max      99999.000000
Name: price, dtype: float64

In [49]:
df_merge[df_merge['price'] >= 999]

Unnamed: 0,id,price,bedrooms,neighbourhood_cleansed,minimum_nights,maximum_nights,property_type,bathrooms_text,review_scores_rating,review_scores_accuracy,review_scores_cleanliness,review_scores_checkin,review_scores_communication,review_scores_location,review_scores_value,listing_id,comments
139,35843831,9999,1.0,Studio City,1,1125,Other,1.0,5.0,5.0,5.0,5.0,5.0,5.0,5.0,35843831,Great time here. Thanks guys
143,35843690,9999,1.0,Studio City,1,1125,Other,1.0,1.0,1.0,1.0,1.0,1.0,5.0,1.0,35843690,Overpriced and overrated. If you’re bringing y...
533,37557082,1500,1.0,Malibu,7,1125,Entire Unit,1.0,5.0,5.0,5.0,5.0,5.0,4.71,4.86,37557082,"What a great place! Peaceful, beautiful and ho..."
679,52619433,3130,1.0,Glendale,2,365,Entire Unit,1.0,4.33,4.58,4.5,4.17,4.42,4.92,4.58,52619433,Dymund was great!! Place was nice and clean ev...
713,812386491863587915,9999,1.0,Pasadena,1,1125,Private Room,1.0,5.0,5.0,5.0,5.0,5.0,5.0,5.0,812386491863587915,Great stay! Really spacious for my partner and...
1188,812389869434454826,9999,1.0,Pasadena,1,1125,Private Room,1.0,5.0,2.0,5.0,2.0,5.0,5.0,3.0,812389869434454826,"The room is in a hotel, and the photos on AirB..."
1278,35633263,9999,1.0,Studio City,1,1125,Other,1.0,5.0,5.0,5.0,5.0,5.0,5.0,5.0,35633263,Wonderful place!
1842,38755677,5000,1.0,Universal City,1,28,Entire Unit,1.0,4.95,4.95,4.84,4.89,5.0,5.0,4.79,38755677,Alex along with his place were very nice! Alex...
1973,18979304,99999,1.0,Beverly Hills,1,1125,Entire Unit,1.0,4.98,5.0,4.88,4.98,5.0,5.0,4.89,18979304,We appreciated the accommodations during our r...
2684,19769097,9999,1.0,Beverly Hills,1,1125,Private Room,1.0,4.95,4.95,4.95,5.0,5.0,5.0,4.9,19769097,Great location and Sugar (the worlds sweetest ...


In [50]:
#drop long term stays - any listings over $999 per night correspond to longterm stays
longterm_stays = df_merge[df_merge['price'] >= 999].index
df_merge.drop(longterm_stays, inplace=True)

### Converting Tagret Variable ```Price``` to Discrete
Seeing as this is a classification analysis we need to convert our ```price``` target variable to a discrte varaible. We achieve by creating classes: ranges of price points. Specifically weuse the inter quartile range (IQR) to ditinguish our classes.

In [None]:
df_processed_out['price'].describe()

In [None]:
#create a column that sorts price based on categories of ranges of price
conditions = [
    (df_merge['price'] <= 109),
    (df_merge['price'] > 109) & (df_merge['price'] <= 148),
    (df_merge['price'] > 148) & (df_merge['price'] <= 199),
    (df_merge['price'] > 199)]

#create a list of the values we want to assign for each condition
values = [0, 1, 2, 3]

#create a new column and use np.select to assign values to it using our lists as arguments
df_merge['price_range'] = np.select(conditions, values)
df_merge

Now that we have a merged and cleaned data set let's move on to the fun part...**text processing**!

## Text Pre-Processing

An essential task in Natural-Language-Processing is to sort through the text, essentially whittling it down to it key components.

Steps:

1. **Special Character Removal**: Eliminates non-alphabetic characters to make the text more uniform.
2. **Case Normalization**: Converts all text to lowercase to neutralize case sensitivity.
3. **Tokenization**: Breaks the text into individual words for easier manipulation.
4. **Word Length Filtering**: Removes words that are too short and are likely to be irrelevant.
5. **Stopword Removal**: Filters out commonly-used words that usually don't contribute to the meaning of a sentence.
6. **POS Tagging and Lemmatization**: Assigns Parts of Speech (POS) tags and reduces words to their base or root form.


### Pre-Process Function:

In [52]:
# instantiate tokenizer
# regex pattern returns words or 3 or more characters and drops all non-english characters
token_pattern = r"(?u)\w{3,}|/[^\x00-\x7F]+/"
tokenizer = RegexpTokenizer(token_pattern)

# create a list of stopwords
stopwords_list = stopwords.words('english')

In [53]:
# create function to pre-process text
def preprocess_text(text, tokenizer, stopwords_list):
    # Standardize case (lowercase the text)
    text_std = text.lower()
    # Tokenize
    token_list = tokenizer.tokenize(text_std)
    # Remove stopwords
    stopwords_removed = [token for token in token_list if token not in stopwords_list]
    return stopwords_removed
   

In [54]:
# pre-process the text data using the 'preprocess_text' function
reviews_proc = df_merge.comments.apply(lambda x: preprocess_text(x, tokenizer, stopwords_list))
reviews_proc

1       [fantastic, super, hosts, space, beautiful, pl...
2       [quiet, house, bedroom, enough, sleeping, uncl...
3       [amazing, experience, house, also, recently, u...
4       [nice, neighborhood, hosts, great, location, h...
6       [everything, great, room, nice, clean, describ...
                              ...                        
9818    [great, place, good, communication, excellent,...
9819    [bella, place, everything, see, pictures, bett...
9822    [place, exactly, owner, described, beautiful, ...
9825    [since, sent, review, guess, reviewing, confir...
9826    [great, location, great, host, beautiful, apar...
Name: comments, Length: 5239, dtype: object

### Tag & Lemmatize:

In [55]:
#Create Lemmatizer
lemmatizer = WordNetLemmatizer()

#Map POS tag to first character for use in WordNetLemmatizer
def get_wordnet_pos(treebank_tag):
    if treebank_tag.startswith('J'):
        return wordnet.ADJ
    elif treebank_tag.startswith('V'):
        return wordnet.VERB
    elif treebank_tag.startswith('N'):
        return wordnet.NOUN
    elif treebank_tag.startswith('R'):
        return wordnet.ADV
    else:
        return wordnet.NOUN  # Default to NOUN
    
#POS Tagging
tagged_text = reviews_proc.apply(lambda x: pos_tag(x))
tagged_text

#Lemmatize the processed text
processed_rev = tagged_text.apply(lambda x: [lemmatizer.lemmatize(word, get_wordnet_pos(pos)) for word, pos in x])

#create new column in df with the processed reviews
df_merge['processed_reviews'] = processed_rev
df_merge

#convert token lists to strings
df_merge['processed_reviews'] = df_merge['processed_reviews'].str.join(' ')
df_merge

Unnamed: 0,id,price,bedrooms,neighbourhood_cleansed,minimum_nights,maximum_nights,property_type,bathrooms_text,review_scores_rating,review_scores_accuracy,review_scores_cleanliness,review_scores_checkin,review_scores_communication,review_scores_location,review_scores_value,listing_id,comments,processed_reviews
1,15239926,201,1.0,Santa Clarita,2,30,Other,1,4.99,4.98,5.00,4.99,5.00,4.97,4.86,15239926,Fantastic super hosts and space . What a beaut...,"[fantastic, super, host, space, beautiful, pla..."
2,14821183,88,1.0,Diamond Bar,1,1125,Private Room,1,3.00,4.00,2.00,5.00,5.00,4.00,3.00,14821183,Quiet house. Bedroom is enough for sleeping. u...,"[quiet, house, bedroom, enough, sleep, unclean]"
3,26296415,180,1.0,Torrance,1,1125,Entire Unit,1,5.00,5.00,5.00,5.00,5.00,5.00,5.00,26296415,Amazing experience. The house was also recentl...,"[amaze, experience, house, also, recently, upg..."
4,22746714,35,1.0,North El Monte,1,1125,Private Room,1,4.57,5.00,4.57,4.71,4.43,4.43,4.71,22746714,Nice neighborhood and hosts. Great location. T...,"[nice, neighborhood, host, great, location, ho..."
6,52992116,200,1.0,Silver Lake,2,30,Guest Suite,1,5.00,5.00,5.00,4.97,5.00,5.00,5.00,52992116,Everything was great! The room was nice and cl...,"[everything, great, room, nice, clean, describ..."
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
9818,593563874468145823,50,1.0,South El Monte,5,365,Private Room,1.5,4.86,4.86,5.00,5.00,4.57,4.57,4.86,593563874468145823,"Great place, good communication and excellent ...","[great, place, good, communication, excellent,..."
9819,16134682,186,1.0,West Hollywood,2,10,Entire Unit,1,5.00,5.00,5.00,5.00,5.00,5.00,5.00,16134682,Bella’s place is everything you see on the pic...,"[bella, place, everything, see, picture, good,..."
9822,45361780,134,1.0,Hawthorne,1,26,Entire Unit,1,4.90,4.89,4.75,4.91,4.95,4.73,4.74,45361780,The place was exactly how the owner described ...,"[place, exactly, owner, describe, beautiful, l..."
9825,924091269757225413,120,1.0,Beverly Hills,1,28,Entire Unit,1,1.00,1.00,1.00,1.00,1.00,5.00,1.00,924091269757225413,"Since they sent me a review, I guess I’m revie...","[since, send, review, guess, review, confirm, ..."


In [56]:
#convert token lists to strings
df_merge['processed_reviews'] = df_merge['processed_reviews'].str.join(' ')
df_merge

Unnamed: 0,id,price,bedrooms,neighbourhood_cleansed,minimum_nights,maximum_nights,property_type,bathrooms_text,review_scores_rating,review_scores_accuracy,review_scores_cleanliness,review_scores_checkin,review_scores_communication,review_scores_location,review_scores_value,listing_id,comments,processed_reviews
1,15239926,201,1.0,Santa Clarita,2,30,Other,1,4.99,4.98,5.00,4.99,5.00,4.97,4.86,15239926,Fantastic super hosts and space . What a beaut...,fantastic super host space beautiful place sta...
2,14821183,88,1.0,Diamond Bar,1,1125,Private Room,1,3.00,4.00,2.00,5.00,5.00,4.00,3.00,14821183,Quiet house. Bedroom is enough for sleeping. u...,quiet house bedroom enough sleep unclean
3,26296415,180,1.0,Torrance,1,1125,Entire Unit,1,5.00,5.00,5.00,5.00,5.00,5.00,5.00,26296415,Amazing experience. The house was also recentl...,amaze experience house also recently upgrade w...
4,22746714,35,1.0,North El Monte,1,1125,Private Room,1,4.57,5.00,4.57,4.71,4.43,4.43,4.71,22746714,Nice neighborhood and hosts. Great location. T...,nice neighborhood host great location host res...
6,52992116,200,1.0,Silver Lake,2,30,Guest Suite,1,5.00,5.00,5.00,4.97,5.00,5.00,5.00,52992116,Everything was great! The room was nice and cl...,everything great room nice clean describe chec...
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
9818,593563874468145823,50,1.0,South El Monte,5,365,Private Room,1.5,4.86,4.86,5.00,5.00,4.57,4.57,4.86,593563874468145823,"Great place, good communication and excellent ...",great place good communication excellent servi...
9819,16134682,186,1.0,West Hollywood,2,10,Entire Unit,1,5.00,5.00,5.00,5.00,5.00,5.00,5.00,16134682,Bella’s place is everything you see on the pic...,bella place everything see picture good best s...
9822,45361780,134,1.0,Hawthorne,1,26,Entire Unit,1,4.90,4.89,4.75,4.91,4.95,4.73,4.74,45361780,The place was exactly how the owner described ...,place exactly owner describe beautiful locatio...
9825,924091269757225413,120,1.0,Beverly Hills,1,28,Entire Unit,1,1.00,1.00,1.00,1.00,1.00,5.00,1.00,924091269757225413,"Since they sent me a review, I guess I’m revie...",since send review guess review confirm reserva...


In [57]:
#create csv of working data frame for modeling
df_merge.to_csv('../data/processed_final.csv')