# Los Angeles AirBnb EDA

Problem Statement = Develop the best possible predictor of airbnb rentals to provide users of the model with the most savings in their vacation rental. The model will continuosly learn from recent data to provide up to date data.

## Imports

In [None]:
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns

import random
random.seed(42)

pd.options.mode.chained_assignment = None

Here I am reading in the data for the first quarter (06 June, 2022), in total there is a year's worth of listings but for an initial exploration I only need some of the data.

In [None]:
#!unzip ../data/Archive.zip

In [None]:
airbnb = pd.read_csv('./listings_1.csv')

In [None]:
airbnb.shape

In [None]:
airbnb.head()

In [None]:
airbnb.info()

## Feature Selection

Price is the target variable so first I will examine the values in the column.

In [None]:
# https://stackoverflow.com/questions/32464280/converting-currency-with-to-numbers-in-python-pandas
airbnb['price']

The values are being read in as objects beacuse of the "$" symbol, so I will strip it to only number and change the type to float.

In [None]:
airbnb['price'] = airbnb['price'].replace('[\$,]','', regex=True).astype(float)

Now that the target variable is a float I can use the correlation method and identify the numerical values most valuable to the model.

In [None]:
airbnb.corr()[['price']].sort_values(by = 'price', ascending = False)

'bedrooms', 'accommodates', and 'beds' have the highest correlation with the 'price' variable so they will be selected for initial modeling. Correlation only takes into account numerical variables so later on categorical values will be examined and selected.

In [None]:
features = ['bedrooms', 'accommodates', 'beds', 'price']

airbnb1 = airbnb[features]

airbnb1.info()

## Null Values

Chosen variables appear to have a number of null values

In [None]:
airbnb1.isnull().sum()

In [None]:
airbnb1.describe()

In [None]:
airbnb1['bedrooms'].value_counts()

In [None]:
airbnb1['beds'].value_counts()

Upon further examination there does not appear to be any option of inputting 0 bedrooms (studio apartment) or 0 beds (alternate accomodations) so I will assume that is why the numbers appear as null and fill them accordingly.

In [None]:
airbnb1 = airbnb1.fillna(0)

In [None]:
airbnb1.isnull().sum()

## EDA

I will perform some data visualization to get a sense of what the variables look like

In [None]:
sns.boxplot(x = airbnb1['bedrooms']);

In [None]:
sns.boxplot(x = airbnb1['accommodates']);

In [None]:
sns.boxplot(x = airbnb1['beds']);

In [None]:
sns.pairplot(airbnb1, y_vars = ['price'], x_vars = ['bedrooms', 'accommodates', 'beds']);

There seem to be some some outliers, need further examination

In [None]:
airbnb1.sort_values(by = ['price'], ascending=False).head(15)

10 of the 40,438 listings are super luxurious and I need to eliminate them to get a better visualization of the data. Seeing as how they are outliers and so rare I don't forsee a significant impact on the model accuracy.

In [None]:
airbnb1 = airbnb1[airbnb1['price'] < 25000.0]

In [None]:
sns.pairplot(airbnb1, y_vars = ['price'], x_vars = ['bedrooms', 'accommodates', 'beds']);

3 more outliers stand out, specifically 20+ bedrooms and 30+ beds, given their rarity I am willing to sacrifice their inclusion in the model data to improve the inference and prediction for the other 99% of the listings

In [None]:
airbnb1 = airbnb1[(airbnb1['bedrooms'] < 20) & (airbnb1['beds'] < 30)]

In [None]:
sns.pairplot(airbnb1, y_vars = ['price'], x_vars = ['bedrooms', 'accommodates', 'beds']);

Clean data shows more clear patterns and adds confidence to a linear relationship being present. Now I will re-check the correlation given the changes to the data

In [None]:
airbnb1.corr()[['price']].sort_values(by = 'price', ascending = False)

It seems the changes to the data have significantly improved the correlation of the chosen variables to the target variable, 'price'

___

# Initial Modeling

In [None]:
from sklearn.linear_model import LinearRegression
from sklearn.model_selection import train_test_split, cross_val_score
from sklearn.metrics import r2_score

In [None]:
X = airbnb1[['bedrooms', 'accommodates', 'beds']]
y = airbnb1['price']

In [None]:
X_train, X_test, y_train, y_test = train_test_split(X, y, random_state = 42)

### Null Model

In [None]:
airbnb1['price'].mean()

Null Model in this case indicates price of a listing should be average price of all listings, which is $267.18

In [None]:
# breakfast hour week 4 Transformers Lesson
null = np.full_like(y_test, y_train.mean())

null

In [None]:
r2_score(y_test, null)

Very poor r2 score for the null model which makes sense given the spread of the data.

### Linear Model

In [None]:
lr = LinearRegression()

lr.fit(X_train, y_train)

In [None]:
print(f'Train Score: {lr.score(X_train, y_train)}')
print(f'Test Score: {lr.score(X_test, y_test)}')
print(f'Cross Val Score: {cross_val_score(lr, X_train, y_train).mean()}')

Results look promising, similiar train and test score show the model is not overfit and a similar test and val score show we have a representative test set. Next I will add more features to see if the R2 can be significantly increased.

# Categorical Features

A lot of variables were left on the table for the initial analysis, most importantly categorical features and features that are numerical but need further cleaning to maximize value.

- Location based features, latitude and longitude had little correlation by themselves but how about a polynomial feature of the two combined?

In [None]:
# Removing outliers and Filling Nulls
airbnb_clean = airbnb[(airbnb['price'] < 25000.0) & (airbnb1['bedrooms'] < 20) & (airbnb1['beds'] < 30)]
airbnb_clean = airbnb_clean.fillna(0)

# Polynomial Features
airbnb_exp = airbnb_clean[['price', 'latitude', 'longitude']]

In [None]:
airbnb_exp['location'] = airbnb_exp['latitude'] * airbnb_exp['longitude']

In [None]:
airbnb_exp['location'].isnull().sum()

In [None]:
airbnb_exp.corr()[['price']].sort_values(by = 'price', ascending = False)

Correlation proved too small to provide significant value, not too surprising given that AirBnb randomizes and occludes the exact location of a listing to provide more security to the host.

In [None]:
plt.figure(figsize = (10, 10))
plt.scatter(airbnb_exp['location'], airbnb_exp['price'])
plt.xlabel('Location = lat * lon')
plt.ylabel('Price')
plt.title('Location and Price');

In [None]:
plt.scatter(airbnb_exp['longitude'], airbnb_exp['latitude'])
plt.xlabel('Longitude')
plt.ylabel('Latitude')
plt.title('Map of LA listings');

### Neighbouhoods

Inside Airbnb calculcated the neighbourhood using coordinates and a map of LA so I will explore that data next

In [None]:
airbnb_exp = airbnb_clean[['price', 'neighbourhood_cleansed']]

In [None]:
airbnb_exp['neighbourhood_cleansed'].describe()

In [None]:
airbnb_exp.isnull().sum()

In [None]:
top_50 = airbnb_exp['neighbourhood_cleansed'].value_counts().head(50)

In [None]:
plt.figure(figsize = (10, 10))
top_50.sort_values().plot(kind='barh')
plt.xlabel('Listings')
plt.ylabel('Neighbourhoods')
plt.title('Top 50 most popular neighbourhoods for LA listings');

In [None]:
bottom_50 = airbnb_exp['neighbourhood_cleansed'].value_counts().tail(50)
plt.figure(figsize = (10, 10))
bottom_50.sort_values().plot(kind='barh')
plt.xlabel('Listings')
plt.ylabel('Neighbourhoods')
plt.title('Bottom 50 most popular neighbourhoods for LA listings');

In [None]:
# One Hot Encode to graph neighbourhoods
from sklearn.preprocessing import OneHotEncoder

In [None]:
# Transformers Lesson
ohe = OneHotEncoder(drop='if_binary', handle_unknown='ignore', sparse=False)

In [None]:
# One Hot Encode to graph neighbourhoods
features = ['bedrooms', 'accommodates', 'beds', 'neighbourhood_cleansed']

X = airbnb_clean[features]
y = airbnb_clean['price']

In [None]:
X_train, X_test, y_train, y_test = train_test_split(X, y, random_state = 42)

In [None]:
X_train_ohe = ohe.fit_transform(X_train[['neighbourhood_cleansed']])
X_test_ohe = ohe.transform(X_test[['neighbourhood_cleansed']])

In [None]:
X_train_ohe.shape, X_test_ohe.shape

In [None]:
lr_neighbourhood = LinearRegression()

lr_neighbourhood.fit(X_train_ohe, y_train)

In [None]:
print(f'Train Score: {lr_neighbourhood.score(X_train_ohe, y_train)}')
print(f'Test Score: {lr_neighbourhood.score(X_test_ohe, y_test)}')

Not only does the addition of neighbourhoods decrease the train r2 score significantly, it turns the test r2 score negative. I cannot recomment keeping the new one hot encoded variables.

### Host_since has the data of creation of host account

Change host_since date to datetime in order to graph it

In [None]:
airbnb_exp['host_since'] = pd.to_datetime(airbnb_clean['host_since'])

In [None]:
airbnb_exp['host_since'].isnull().sum()

In [None]:
# noticed some outlier dates, filterting them out, Airbnb was founded in 2008 so any date before that is a clearical error
airbnb_exp = airbnb_exp[airbnb_exp['host_since'] >= "2008"]

In [None]:
# https://stackoverflow.com/questions/16180946/drawing-average-line-in-histogram-matplotlib
plt.hist(airbnb_exp['host_since'], edgecolor = 'k', bins = 40)
plt.axvline(airbnb_exp['host_since'].mean(), color = 'red')
plt.xlabel('Date')
plt.ylabel('Number of Accounts')
plt.title('Date of Host Account Creation');

Linear models do not accept dates, therefore I will combine host_since and last_review to create a new column I will title "Host Experience"

In [None]:
airbnb_clean['last_review'].value_counts()

It seems too many of the accounts have never been reviewed. Not worth pursuing this avenue if I have to sacrifice almost 1/4 of the data. I don't see another way of converting this datetime data into length of account given the present data

### Host Location

Looking at the host locations to see if it's worth keeping as a variable

In [None]:
airbnb['host_location'].value_counts().head(20)

In [None]:
airbnb['host_location'].describe()

Too many possibilities to one-hot encode or dumify this variable, using neighbourhoods.csv I can binarize the column to "In LA or not"

In [None]:
neighbourhoods_df = pd.read_csv('../data/neighbourhoods.csv')

neighbourhoods = neighbourhoods_df['neighbourhood'].tolist()

In [None]:
airbnb['host_location'].head()

In [None]:
# Need to take off the comma and everything after it so it matches the list of neighbourhoods
# https://stackoverflow.com/questions/47024428/replace-column-values-using-regex-in-pandas-data-frame

airbnb_clean['local_host'] = airbnb_clean['host_location'].str.split(',').str[0]

airbnb_clean['local_host'].head()

In [None]:
# Binarizing the column
airbnb_clean['local_host_b'] = airbnb_clean['local_host'].map(lambda location: 1 if location in neighbourhoods else 0)

airbnb_clean['local_host_b'].value_counts()

To decided whether the variable is worth keeping I want to see if there is a noticeable difference in price between both conditions.

In [None]:
airbnb_clean.groupby('local_host_b')['price'].describe()

There is enough of a difference so I will keep the new column

#### Testing out new model

In [None]:
features = ['bedrooms', 'accommodates', 'beds', 'local_host_b']

X = airbnb_clean[features]
y = airbnb_clean['price']

X_train, X_test, y_train, y_test = train_test_split(X, y, random_state = 42)

lr_lhost = LinearRegression()
lr_lhost.fit(X_train, y_train)

print(f'Train Score: {lr_lhost.score(X_train, y_train)}')
print(f'Test Score: {lr_lhost.score(X_test, y_test)}')

No noticeable increase in either r2 score. Not worth keeping in the model.

### Bathrooms Text

Because of a change in the Airbnb website the number of bathrooms are no longer stored as floats, they are strings describing the number of bathrooms. I will explore this data to see if it's worth adding to the model.

In [None]:
airbnb['bathrooms_text'].value_counts()

In [None]:
airbnb['bathrooms_text'].describe()

In [None]:
airbnb['bathrooms_text'].isnull().sum()

The data appears to be made up primarly of two components, the number of bathrooms and whether or not they are private. First I will extract the number of bathrooms. 

In [None]:
airbnb_bath = airbnb[['price', 'bathrooms_text']]

airbnb_bath = airbnb_bath.fillna(0)

In [None]:
airbnb_bath.isnull().sum()

In [None]:
# https://www.regular-expressions.info/floatingpoint.html
airbnb_bath['bathrooms'] = airbnb_bath['bathrooms_text'].str.extract(r'(\d+\.?\d?)').astype(float)

airbnb_bath['bathrooms'].value_counts()

In [None]:
airbnb_bath.corr()[['price']].sort_values(by = 'price', ascending = False)

Promising correaltion number between number of bathrooms and price, I will perform some further cleaning before testing out the variable in a new model.

In [None]:
plt.figure(figsize = (10, 10))
plt.scatter(airbnb_bath['bathrooms'], airbnb_bath['price'])
plt.xlabel('Number of Bathrooms')
plt.ylabel('Price')
plt.title('Number of Bathrooms vs Price');

There seem to be a few outliers that can be elimiated from the data.

In [None]:
airbnb_bath = airbnb_bath[(airbnb_bath['bathrooms'] < 15) & (airbnb_bath['price'] < 25000)]

In [None]:
plt.figure(figsize = (10, 10))
plt.scatter(airbnb_bath['bathrooms'], airbnb_bath['price'])
plt.xlabel('Number of Bathrooms')
plt.ylabel('Price')
plt.title('Number of Bathrooms vs Price');

In [None]:
airbnb_bath.corr()[['price']].sort_values(by = 'price', ascending = False)

With some data cleaning and elimination of ouliers the new "bathrooms" variable is a great candidate for the model.

### Room Type

Last promising variable is room type, which should offer insight into the types of listings offered and will hopefully show a linear relationship

In [None]:
airbnb['room_type'].value_counts()

In [None]:
airbnb['room_type'].isnull().sum()

In [None]:
plt.figure(figsize = (10, 10))
plt.scatter(airbnb['room_type'], airbnb['price'])
plt.xlabel('Number of Bathrooms')
plt.ylabel('Price')
plt.title('Number of Bathrooms vs Price');

In [None]:
airbnb_room = airbnb[airbnb['price'] < 25000]

In [None]:
plt.figure(figsize = (10, 10))
plt.scatter(airbnb_room['room_type'], airbnb_room['price'])
plt.xlabel('Type of Room')
plt.ylabel('Price')
plt.title('Type of Room vs Price');

Initial EDA shows a promising pattern, further testing is needed.

## Final Model

In [None]:
airbnb_final = airbnb[['bedrooms', 'accommodates', 'beds', 'bathrooms_text', 'room_type', 'price']]

In [None]:
# Converting all orginal data to fit Linear Regression
airbnb_final = airbnb_final.fillna(0)
airbnb_final['bathrooms_text'] = airbnb_final['bathrooms_text'].str.extract(r'(\d+\.?\d?)').astype(float)
airbnb_final = pd.get_dummies(columns = ['room_type'], data = airbnb_final, drop_first=True)
airbnb_final = airbnb_final[(airbnb_final['bathrooms'] < 15) & (airbnb['price'] < 25000) & (airbnb1['bedrooms'] < 20) & (airbnb1['beds'] < 30)]

In [None]:
X = airbnb_final.drop(columns='price')
y = airbnb_final['price']

In [None]:
X_train, X_test, y_train, y_test = train_test_split(X, y, random_state = 42)

lr_final = LinearRegression()
lr_final.fit(X_train, y_train)

print(f'Train Score: {lr_final.score(X_train, y_train)}')
print(f'Test Score: {lr_final.score(X_test, y_test)}')

Using only a Linear Regression Model the highest r2 score achieved was 0.25 for the train and test set, meaning only 25% of the variability in price for LA listings in the quarter of June 2022 can be explained by the model. The next notebook uses more advanced models.