## Seung Jun Choi in Urban Information Lab
## Analyzing AirBnB Prices in New York City

(Artlasky, 2021)
https://jovian.ai/artlasky/nyc-airbnb-dataset

(Jaiman et al., n.d.) https://amanjaiman.github.io/nyc-airbnb-data/

![image.png](attachment:image.png)

![image.png](attachment:image.png)

This winter, we're planning to take a trip to New York City! Everyone knows the cost of living there is sky-high, so naturally, we wanted to see if there was a way to find bargains. One popular option: Airbnb!

Airbnb is a shared economy platform for people to offer their own housing for travellers. Since 2008, it has grown in popularity and has become ubiquitous in travelling options, becoming a large competitor in the hotel industry. Competing with hotels and other Airbnbs makes pricing challenging for sellers. There are many features that can factor into the price - its proximity to popular locations, amenities, size, etc.

For this project, a dataset downloaded as a .csv file from Kaggle, "New York City Open Data" will be analyzed. The known information of about 50,000 or so BNB facilities in New York City, which bououghs they are located, whether they cater to single guests or entire families, how much space they offer, and how expensive they are.

Python open datasets library will be used to download the data from Kaggle, the Pandas library will be used to clean and query the data. Matplotlib will give us visualization of the data as its analyzed.

In this tutorial, we are aiming to see if there are certain features that contribute to price more than others. We also want to see if we can find outliers for the Airbnbs (bargains or ripoffs). We hope that this exploration can be useful for travelers looking to find a place in New York City, or for homeowners to be able to price their property at a competitive price to make a profit.

## Research Quetions
### Q1. How many rooms are available, and what type are they?
### Q2. What is the average price of each type of available room?
### Q3. What is the average affordable price of a midtown Manhattan bnb?
### Q4.  If you wanted to rent a whole house what would be the average price?
### Q5.  Plot the locations of all bnbs in Manhattan, color coded by type.

# Air BNB information for touring NYC
(from a Google search): A bed and breakfast (typically shortened to B&B or BnB) is a small lodging establishment that offers overnight accommodation and breakfast. ... Bed and breakfast is also used to describe the level of catering included in a hotel's room prices, as opposed to room only, half-board or full-board.

New York City is a heavy user of the air-bnb facility with some 50,000 available locations. This dataset, downloaded from Kaggle, explores the resources available for staying at a bed and breakfast facility while in New York City, where in the city they are located, and how the accondations provide for individuals or families. The next few steps download that data from Kaggle for single-variant analysis.

In [5]:
# get a dataset from Kaggle, once it is downloaded we'll have the
# data in a local file
dataset_url = 'https://www.kaggle.com/dgomonov/new-york-city-airbnb-open-data' 

In [6]:
# On the first run of this notebook the dataset was downloaded and extracted
# To repeat that operation, in case you're running this notebook in your own environment,
# uncomment and execute the following two lines
#
# NOTE: you will need an account on Kaggle as you'll need to provide your username and key to download
# the dataset. You can leave these commands commented out and just use the file in this project's folder

#import opendatasets as od
#od.download(dataset_url)

In [7]:
# NYC air bnb data
data_dir = './new-york-city-airbnb-open-data'

![image.png](attachment:image.png)

![image.png](attachment:image.png)

## Data Preparation and Cleaning
Before analysis, we will remove any data that will be unnecessary. For example, there is no use the id or host id in the analysis.

In [None]:
import numpy as np # linear algebra
import pandas as pd # data processing, CSV file I/O (e.g. pd.read_csv)
import matplotlib.pyplot as plt

from IPython.display import HTML, display # displaying maps in the notebook
import seaborn as sns; sns.set() # graphing data

![image.png](attachment:image.png)

![image.png](attachment:image.png)

![image.png](attachment:image.png)

![image.png](attachment:image.png)

![image.png](attachment:image.png)

Let's examine some basic stats:

There are close to 50k entries, so we want to sample data for plotting. Data seems to missing for some fields, most noticeably those relating to the number of reviews.

In [None]:
print(nyc_air_bnb_df.info())

![image.png](attachment:image.png)

Let's examine the categorical variables.

![image.png](attachment:image.png)

We are going to be working with Airbnbs in the five boroughs in New York City. There are three types of rooms: shared rooms, private rooms, or the entire home or apartment. We can expect the rooms to have a significant impact on the price, as people would prefer to have an entire home to themselves rather than a shared room, so homeowners will likely charge more. We will explore this later.

How about price? Let's see what our typically Airbnb is priced at, as well as what the outliers look like.

![image.png](attachment:image.png)

In [None]:
print(main_df['price'].describe(percentiles=[.25, .50, .75, .95]))

![image.png](attachment:image.png)

We find that there are some very large outliers, so for visualization purposes, we winsorize (ignore) the top 5% of data, about $400.

![image.png](attachment:image.png)

## Neighbourhood Groups
First we'll look at the neighbourhood groups, and where Airbnbs are most commonly found. We'll also look at how they are priced in each neighbourhood group. Using seaborn, we can make some nice visuals!

In [None]:
# using countplot to visualize the number of Airbnbs in each borough
ax = sns.countplot('neighbourhood_group',data=main_df,order=main_df['neighbourhood_group'].value_counts().index)
ax.set_title('Share of Neighborhood')
plt.show()

![image.png](attachment:image.png)

Manhattan and Brooklyn have the highest number of listings, at around 20K each. This can be attributed to the fact that both of those neighbourhoods have more of the tourist attractions, so people would typically want to stay close to what they are seeing.

In [None]:
# we can see from our statistical table that we have some extreme values, therefore we need to remove them for the sake of a better visualization

# creating a sub-dataframe with no extreme values / less than 400
winsorized_df=main_df[main_df.price < 400]

# using violinplot to showcase density and distribtuion of prices 
viz_2=sns.violinplot(data=winsorized_df, x='neighbourhood_group', y='price')
viz_2.set_title('Price distribution for each neighbourhood')
plt.show()

![image.png](attachment:image.png)

Here we can see the distribution of prices of properties, based on which neighbourhood group they belong to. We can see that Manhattan sems to have more of the higher priced properties. Bronx, Staten Island, and Queens have much more reasonable prices compared to Brooklyn and Manhattan. All distributions have positive skew.

## Room Type
Now that we've looked at the locations for the Airbnbs, let's explore the room types and how the price distribution differs between them.

In [None]:
# using countplot once again
ax = sns.countplot('room_type',data=main_df,order=main_df['room_type'].value_counts().index)
ax.set_title('Share of Room Type')
plt.show()

![image.png](attachment:image.png)

We also see that entire homes and private rooms are the most common, which may be because the demand for shared rooms is typically lower.

In [None]:
# make sure to use the winsorized_df to ignore the outliers!
# using violinplot to showcase density and distribtuion of prices
viz_2=sns.violinplot(data=winsorized_df, x='room_type', y='price')
viz_2.set_title('Price distribution by room type')
plt.show()

![image.png](attachment:image.png)

As expected, shared rooms have the lowest mean price, while entire homes have the highest. All room types seem to have a similar spread, however private rooms and shared rooms seemd to be more centered around their mean. There is more disparity of price with entire homes.

Now let's see how the prices look on a map of New York City, so we can really visualize our data. Using plotly, we can display a map of New York City with the Airbnbs colored in based on price. This way, we can visualize where exactly the more expensive or super cheap Airbnbs are located.

![image.png](attachment:image.png)

![image.png](attachment:image.png)

![image.png](attachment:image.png)

![image.png](attachment:image.png)

In [None]:
room_graph=px.histogram(bnb_affordable,x="price",title="Affordable room prices", template="plotly_dark")
room_graph.show()

![image.png](attachment:image.png)

In [None]:
# Now, suppose the vacationing couple is not concerned about price but wants to see all room types
# for this, we'll reuse the neighbourhood data frame from before but remove the price restriction

# remake the dataframe for Manhattan, but this time allow any price
bnb_midtown_df = bnb[bnb.neighbourhood_group=='Manhattan']

room_graph=px.histogram(bnb_midtown_df, x="neighbourhood",color="room_type",title="Room types in Manhattan",
                       template="plotly_dark")
room_graph.show()

![image.png](attachment:image.png)

In [None]:
air_bnb_neighborhood = air_bnb_df_copy[['host_name', 'neighbourhood_group', 'neighbourhood', 'availability_365',
                                        'room_type', 'price']]
bnb = air_bnb_neighborhood[air_bnb_neighborhood.neighbourhood_group == 'Manhattan']
bnb_midtown_df = bnb[bnb.neighbourhood=='Midtown']

room_availability_graph = px.histogram(bnb_midtown_df, x="availability_365",title="Avialability of apartments",
                                      template="plotly_dark")
room_availability_graph.show()

![image.png](attachment:image.png)

In [None]:
room_availability_scatter = px.scatter(air_bnb_neighborhood, x="neighbourhood", y="price", color = "room_type", 
                                       title="Avialability of apartments", template="plotly_dark")
room_availability_scatter.show()

![image.png](attachment:image.png)

## Asking and Answering Questions
In this project, we analyzed available Bed and Breakfast facilities available in Manhattan, for a couple planning a vacation there.

### Q1: how many rooms are available, and what type are they

In [None]:
room_availability_scatter = px.scatter(nyc_air_bnb_df, x="neighbourhood_group", color = "room_type", 
                                       title="Apartment types", template="plotly_dark")
room_availability_scatter.show()

![image.png](attachment:image.png)

### Q2: What is the average price of each type of available room?

![image.png](attachment:image.png)

### Q3: What is the average affordable price of a midtown Manhattan bnb?

![image.png](attachment:image.png)

In [None]:
avg_affordable_price = int(affordable_price_sum)/affordable_bnbs
print("The average bnb room price across all of affordable bnbs in Midtown Manhattan is ${:.2f}".format(avg_affordable_price))

![image.png](attachment:image.png)

### Q4: if you wanted to rent a whole house what would be the average price?

In [None]:
# the following is to get the column names and is for reference only
#print(air_bnb_df_copy.columns)

all_home_types_df = air_bnb_df_copy[['room_type', 'price']]
entire_home_df = all_home_types_df[all_home_types_df.room_type == 'Entire home/apt']
entire_home_df

homes_count = len(entire_home_df.index)
homes_price_all = entire_home_df.price.sum()

print("The average entire-home bnb is ${:.2f}".format(int(homes_price_all)/homes_count))

![image.png](attachment:image.png)

### Q5: plot the locations of all bnbs in Manhattan, color coded by type

In [None]:
bnb_manhattan = air_bnb_df_copy[air_bnb_df_copy.neighbourhood_group == 'Manhattan']
map = px.scatter(bnb_manhattan,x="latitude", y="longitude", color="room_type", template="plotly_dark", title="Room types in Manhattan",width=600,height=400)
map.show()

![image.png](attachment:image.png)

In [None]:
# Geographic plot using Plotly

import plotly.graph_objects as go
import plotly.offline as py
py.init_notebook_mode(connected=False)

# randomly sampling 10000 data points from our dataframe for rendering purposes
sample_df = winsorized_df.sample(10000)

mapbox_access_token = ""

# adding points from our sample_df and coloring them based on the price
# the colorscale we use is magma, where higher prices are a darker shade of red, and lower prices area lighter shade
fig = go.Figure(go.Scattermapbox(
        lon = sample_df['longitude'],
        lat = sample_df['latitude'],
        mode='markers',
        marker=go.scattermapbox.Marker(
            size=5,
            color = sample_df['price'],
            colorscale="Magma",
            reversescale=True,
            colorbar=dict(
                title="Price ($)"
            ),
            opacity=0.7
        ),
    ))

# settings the map title and lat and long
fig.update_layout(
    title="Airbnb prices in New York City",
    hovermode='closest',
    mapbox=go.layout.Mapbox(
        accesstoken=mapbox_access_token,
        bearing=0,
        center=go.layout.mapbox.Center(
            lat=40.7,
            lon=-74
        ),
        pitch=0,
        zoom=9
    )
)

py.iplot(fig)

![image.png](attachment:image.png)

![image.png](attachment:image.png)

We see there is definitely clustering of higher prices in downtown Manhattan. There are also noticeable clusters in Upper Brooklyn and Upper Manhattan. Location could provide a good signal of price.

## Natural Language Processing

Natural Language Processing
Natural Language Processing deals with analyzing text to draw insights how and why humans say the things they do. We can use this to look at the names that homeowners gave their Airbnbs, and see how people advertise their homes using text features.

(If you want to learn more about NLP, check out this article by Towards Data Science)

Word Count
The first thing we want to look at is word frequency in names. This will allows us to see how hosts are naming their property, and what words we can expect to see when browsing for an Airbnb.

We start by getting the names and creating a list of all words appearing in each name.

In [None]:
# getting list of all names
names = main_df['name'].tolist()
name_words = []
for name in names:
    name = str(name).split()
    for word in name:
        name_words.append(word.lower())

Now we have a list of all the words in the names. The problem is a lot of the words that we will see are words that don't have much meaning to us. Take a look at the first 30 we encounter, for example.

In [None]:
print(name_words[:30])

['clean', '&', 'quiet', 'apt', 'home', 'by', 'the', 'park', 'skylit', 'midtown', 'castle', 'the', 'village', 'of', 'harlem....new', 'york', '!', 'cozy', 'entire', 'floor', 'of', 'brownstone', 'entire', 'apt:', 'spacious', 'studio/loft', 'by', 'central', 'park', 'large']

Here we see tokens like "by" or "the" that don't really tell us much. Luckily, the NLTK library for Python has a way for us to get rid of these words (stopwords).

In [None]:
import nltk # natural language toolkit
from nltk.corpus import stopwords

filtered_words = [word for word in name_words if word not in stopwords.words('english')]
print(filtered_words[:30])

['clean', '&', 'quiet', 'apt', 'home', 'park', 'skylit', 'midtown', 'castle', 'village', 'harlem....new', 'york', '!', 'cozy', 'entire', 'floor', 'brownstone', 'entire', 'apt:', 'spacious', 'studio/loft', 'central', 'park', 'large', 'cozy', '1', 'br', 'apartment', 'midtown', 'east']

We can see that our filtered words no longer contain the stopwords we wanted to avoid.

Next, we'll use the collections library in Python that allows us to quickly count the words in our list. We can then add the word and it's frequency into a dataframe for visualization.

In [None]:
from collections import Counter # to count words in our list

# plotting top 25 words used by the host in naming their home
words_count = Counter(filtered_words).most_common()
words_count = words_count[:25]

In [None]:
# converting the data into a dataframe so we can plot it using Seaborn
words_df = pd.DataFrame(words_count)
words_df.head()

![image.png](attachment:image.png)

To make this easier to understand, let's rename our columns. The first column is showing the words from most to least common. The second column shows the count in the entire list of names.

In [None]:
words_df.rename(columns={0:"Words", 1:"Count"}, inplace=True)

In [None]:
# Now let's plot!
fig, ax = plt.subplots()
fig.set_size_inches(11.7, 8.27)
fig.suptitle('Top 25 Words Used in NYC Airbnb Names', fontsize=20)
sns.barplot(x='Words', y='Count', data=words_df, ax=ax)
plt.xticks(rotation=80)
plt.show()

![image.png](attachment:image.png)

Looking at the top 25 words, we can see come clear trends. For one, hosts are using simple and specific words that allow users to find their property quicker with a quick search. Words like "bedroom", "private" and "apartment" help homeowners get found, and then they can try to pull travelers in with nice pictures and descriptions. Another thing we can see is the use of words that draw users in. "Cozy", "Spacious" and "Beautiful" are all words that appear often, and for good reason!

## Sentiment Analysis
Speaking of nice sounding words, let's take a look at how hosts are naming their property, in particular, the positivity behind the name. If hosts tend to advertise their home better, they might want to charge a higher price for having more amenities, being in a better location, etc., which is all something they can list in the name.

We will use the NLTK library, and their VADER lexicon, a built-in dictionary of sentiments and words that we can use to calculate sentiment scores of our text. In order to use the lexicon, you need to download the lexicon from nltk inline.

In [None]:
nltk.download('vader_lexicon')

![image.png](attachment:image.png)

Now we can use the SentimentIntensityAnalyzer that does the work for us! We can give it pieces of text, in our case the names of the Airbnb, and it will return to use the sentiment scores.

In [None]:
# initialize the analyzer
from nltk.sentiment.vader import SentimentIntensityAnalyzer
sid = SentimentIntensityAnalyzer()

In [None]:
# preparing an array that we will use as data for our dataframe
data = []
for i,row in main_df.iterrows():
    # we want to create a dictionary that stores the name and price of a property, as well as the neutral, positive, and
    # compound scores that were returned by NLTK.
    dic = {}
    ss = sid.polarity_scores(str(row['name']))
    
    dic['name'] = row['name']
    dic['sentiment_neu'] = ss['neu']
    dic['sentiment_pos'] = ss['pos']
    dic['sentiment_compound'] = ss['compound']
    dic['price'] = row['price']
    data.append(dic)

In [None]:
# building our dataframe from the data
sentiment_df = pd.DataFrame(data)
sentiment_df.head()

![image.png](attachment:image.png)

In [None]:
sns.lmplot('sentiment_compound', 'price', data=sentiment_df, fit_reg=False, height=8, aspect=2)

![image.png](attachment:image.png)

We look at the compound sentiment score, which is determined by the nltk SentimentIntensityAnalyzer as the normalized aggregated score (more information here), and see how it affects the price that the hosts set for their Airbnb. We actually see something a little interesting: a majority of the Airbnbs that are offered at a higher price seem to have a higher compound score. This could be because hosts make their property seem better in the name (with richer vocabulary), and then charge a higher price.

Because we see that the right side of our plot is very dense, we decide not to use this as a feature when regressing for price.

Note: because we are looking at just the names of the property (where we expect not too much extra sentiment information), many of the Airbnb names returned a compound score of 0.0

## Variable Correlations

We now look at a correlation plot among the numerical variables. We don't see any strong correlations between meaningful variables, except number_of_reviews vs reviews_per_month

In [None]:
main_df.corr().style.background_gradient(cmap='coolwarm')
# plt.show()

![image.png](attachment:image.png)

The problem is of regressing price.

Let's try a multiple linear regression on the features. We drop the features (name, id, host name, and last review). We transform the categorical variables (neighbourhood_group, neighbourhood, room_type) into labels using Scikit-Learn's label transformer.

We use Ordinary Least Squares (OLS) Regression. We hold out 20% of the data for testing.

In [None]:
'''Machine Learning'''
import sklearn
from sklearn import preprocessing
from sklearn import metrics
from sklearn.metrics import r2_score, mean_absolute_error
from sklearn.preprocessing import LabelEncoder,OneHotEncoder
from sklearn.model_selection import train_test_split
from sklearn.linear_model import LinearRegression
# from sklearn.ensemble import RandomForestRegressor,  GradientBoostingRegressor


# Preparing the main_df 
main_df.drop(['name','id','host_name','last_review'],axis=1,inplace=True)
main_df['reviews_per_month']=main_df['reviews_per_month'].replace(np.nan, 0)

'''Encode labels with value between 0 and n_classes-1.'''
le = preprocessing.LabelEncoder() # Fit label encoder
le.fit(main_df['neighbourhood_group'])
main_df['neighbourhood_group']=le.transform(main_df['neighbourhood_group']) # Transform labels to normalized encoding.

le = preprocessing.LabelEncoder()
le.fit(main_df['neighbourhood'])
main_df['neighbourhood']=le.transform(main_df['neighbourhood'])

le = preprocessing.LabelEncoder()
le.fit(main_df['room_type'])
main_df['room_type']=le.transform(main_df['room_type'])

main_df.sort_values(by='price',ascending=True,inplace=True)

In [None]:
'''Train LRM'''
lm = LinearRegression()

X = main_df[['neighbourhood_group','neighbourhood','latitude','longitude','room_type','minimum_nights','number_of_reviews','reviews_per_month','calculated_host_listings_count','availability_365']]
y = main_df['price']

X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2)

lm.fit(X_train,y_train)

For evaluation, we calculate Mean Absolute Error (MAE), Root Mean Squared Error (RMSE), and the coefficient of determination (R^2).

MAE is the average of the absolute errors. MSE is the average of the squared errors. This penalizes larger errors by more. Taking the square root to get RMSE returns to our original units.

R^2 is the proportion of the variance in the dependent variable that is predictable from the independent variable.

From Scikit's documentation for R^2:

The coefficient R^2 is defined as (1 - u/v), where u is the residual sum of squares ((y_true - y_pred) 2).sum() and v is the total sum of squares ((y_true - y_true.mean()) 2).sum(). The best possible score is 1.0 and it can be negative (because the model can be arbitrarily worse).

We use MAE, because we believe price outliers exist and don't want them to impact the error. According to MAE, on average, our model is off by 72$. This is better than 1 standard deviation of guessing the mean (240), but realistically not that great.

In [None]:
'''Get Predictions & Print Metrics'''
predicts = lm.predict(X_test)

print("""
        Mean Absolute Error: {}
        Root Mean Squared Error: {}
        R2 Score: {}
     """.format(
        mean_absolute_error(y_test,predicts),
        np.sqrt(metrics.mean_squared_error(y_test, predicts)),
        r2_score(y_test,predicts),
        ))

  Mean Absolute Error: 75.26679580308493
  
  Root Mean Squared Error: 225.61535777968695
  
  R2 Score: 0.090924502614535

We plot the regressor price predictions against the actual ones. This is to visually check if our regression estimates look good, as well as test if the assumptions of a linear relationship are satisfied. The assumptions are explained in detail here.

Some core assumptions:

error terms (residuals) are normally distributed around the regression line, and homoscedastic (spread doesn't grow or shrink).
no multicollinearity - this is when a feature is itself linearly dependent on other features.

In [None]:
plt.figure(figsize=(16,8))
sns.regplot(predicts,y_test)
plt.xlabel('Predictions')
plt.ylabel('Actual')
plt.title("Linear Model Predictions")
plt.grid(False)
plt.show()

![image.png](attachment:image.png)

We notice some large outliers in the positive direction. These represent listings that were much more expensive than expected. Perhaps this is an indication that they're a ripoff! Or there are more features that account for their price, such as room quality and amenities.

This plot shows some problems: we notice our regressor is relatively conservative in predicting price (it doesn't go above 400), when there are listings in the thousands. We also notice it predicts negative prices for some listings, which is nonsensical. Unfortunately, the residuals don't seem to be evenly distributed around the regression line - this may indicate an issue with the model assumptions.

## Finding a Better Model
It seems as though there is a significant relationship between Airbnb price and and a variety of the features included in the dataset. However, considering the number of features in the present model, it is likely that we could find a more parsimonious model and improve our  R2  value. Unfortunately, SKLearn makes it difficult to analyze the significance of each of the predictors in a model. Instead, we can temporarily make use of Python's StatsModels library. This library in particular has some very powerful statistical tools including robust model summary information.

### Diagnostic Plots
Below are a few methods to generate diagnostic plots which can be used to check the assumptions for linearity.

Plot of residual size against the fitted (predicted) value. We expect an even (homoscedastic), Gaussian distribution around y=0.
Plot of residuals against the order the data was presented. If we see trends, then that indicates a serious problem.
Histogram of residual sizes.

In [None]:
# Residuals vs. Fitted
def r_v_fit(m):
    ax = sns.residplot(m.fittedvalues, m.resid)
    plt.title("Residuals vs. Fitted")
    plt.ylabel("Residuals")
    plt.xlabel("Fitted Values")
    plt.show()
    
# Residuals vs. Order
def r_v_order(m):
    ax = plt.scatter(m.resid.index, m.resid)
    plt.title("Residuals vs. Order")
    plt.ylabel("Residuals")
    plt.xlabel("Order")
    plt.show()

# Histogram
def r_hist(m, binwidth):
    resid = m.resid
    plt.hist(m.resid, bins=np.arange(min(resid), max(resid) + binwidth, binwidth))
    plt.title("Histogram of Residuals")
    plt.show()
    
# Get separate dataframe for statsmodels analysis
sm_df = pd.read_csv('nyc.csv')

# Split data for training and testing
sm_df['logprice'] = np.log(1 + sm_df['price'])
train_data, test_data = train_test_split(sm_df, test_size=0.2)

In [None]:
import statsmodels.formula.api as smf

# Create the model
model = smf.ols(
    'price ~ neighbourhood_group + latitude + longitude \
     + room_type + minimum_nights + number_of_reviews + reviews_per_month \
     + calculated_host_listings_count + availability_365',
    data=train_data).fit()

print("P-Value:\t{}".format(model.pvalues[0]))
print("R_Squared:\t{}".format(model.rsquared))
print("R_Squared Adj:\t{}".format(model.rsquared_adj))

# Diagnostic Plots for model
r_v_fit(model)
r_v_order(model)
r_hist(model, 100)

P-Value:	2.7230705088860967e-16

R_Squared:	0.11050193903353478

R_Squared Adj:	0.11013023069213623

![image.png](attachment:image.png)

![image.png](attachment:image.png)

![image.png](attachment:image.png)

The above plots show a few issues with the current model. The residuals vs. fitted plot shows mostly positive residuals and a number of outliers. It also appears as though the plot has a cone shape, and therefore does not meet the equal spread condition for linear regression. Residuals vs. Order also shows a number of outliers and, like the vs. fitted plot, takes on mostly positive values. Lastly, the histogram of residuals is very skewed. This is likely due to the outliers as seen above. Also, the R-Squared value of the model (listed above the residual plots) is very low.

Because of the skew in distribution of residuals/number of outliers, it may be useful to attempt a log transformation on the response (price). Let's try the new model out to see how it compares:

In [None]:
# Fitting a new model with a log-transformed price
log_model = smf.ols(
    'logprice ~ neighbourhood_group + latitude + longitude \
     + room_type + minimum_nights + number_of_reviews + reviews_per_month \
     + calculated_host_listings_count + availability_365',
    data=train_data).fit()

print("P-Value:\t{}".format(log_model.pvalues[0]))
print("R_Squared:\t{}".format(log_model.rsquared))
print("R_Squared Adj:\t{}".format(log_model.rsquared_adj))

# Diagnostic Plots for new, transformed model
r_v_fit(log_model)
r_v_order(log_model)
r_hist(log_model, 0.1)

P-Value:	1.6528354271293651e-137

R_Squared:	0.5121880744786157

R_Squared Adj:	0.5119842249485189

![image.png](attachment:image.png)

![image.png](attachment:image.png)

![image.png](attachment:image.png)

Although there are still some outliers, this model immediately appears to be improved! All three diagnostic plots meet the assumptions necessary for using linear regression. There is still a slight cone shape to the residuals vs. fitted plot, but it looks much better. Residuals vs. order has better spread around the x-axis, which indicates independence of the data. Lastly, the histogram of residuals has a much more normal distribution than the original model. The R-Squared value is significantly higher than that of the original model, so this new model is much better for explaining price.

With this new nonlinear relationship, our interpretation of the model coefficients also changes. if we have a coefficient m, then an increase of X by 1 corresponds to an increase of log(Y) by m; this is equivalent to saying Y increases by a factor of e^m.

## Reducing the Model
Now that we have a better model, it may be worth examining to see if any predictors may be removed from the model. The Statsmodels library has great summary statistics, so we can look at the p-value of each of the predictors to see how significant they are:

In [None]:
print(model.pvalues)

![image.png](attachment:image.png)

With a significance level of  α=0.05 , we can see from the above output that only two of the predictors, Queens Borough and Reviews Per Month, are not significant predictors of price. Through backwards elimination we can remove these predictors one-by-one to see if our model improves. First we can start by eliminating reviews per month because it has the higher p-value.

In [None]:
log_model_1 = smf.ols(
    'logprice ~ neighbourhood_group + latitude + longitude \
     + room_type + minimum_nights + reviews_per_month \
     + calculated_host_listings_count + availability_365',
    data=train_data).fit()

print("P-Value:\t{}".format(log_model_1.pvalues[0]))
print("R_Squared:\t{}".format(log_model_1.rsquared))
print("R_Squared Adj:\t{}".format(log_model_1.rsquared_adj))

![image.png](attachment:image.png)

In [None]:
model.params

![image.png](attachment:image.png)

## Summary

We perform exploratory data analysis, examining listings across boroughs, room types, and location. We also use natural language processing to look at how hosts name their property, looking at the top words as well as sentiment. We then perform ordinary least squares regression on price, and find that there is some predictive power, however the assumptions of linearity aren't followed, indicating a nonlinear relationship. We perform a log-linear regression on price, and find the assumptions are satisfied, in addition to predictive power improving dramatically. We analyze the predictive power of each individual feature.

Recommendation for the future: Try a more complex model (Gradient Boosted Regressor)

# Happy Airbnb hunting!

![image.png](attachment:image.png)

![image.png](attachment:image.png)