
# 0. INTRODUCTION

Welcome to this analytical journey! In this notebook, we embark on an exciting task: developing a predictive model to estimate the price of accommodations. This is not just any prediction – we aim to understand how the number of rooms in an accommodation influences its pricing, a critical factor in the real estate and hospitality industries.

Our tool of choice for this endeavor is the powerful [Linear Regression model](https://spark.apache.org/docs/latest/api/python/reference/api/pyspark.ml.regression.LinearRegression.html), a staple in the field of machine learning and statistical modeling. Linear Regression is renowned for its simplicity and effectiveness, particularly when it comes to understanding relationships between numerical variables. In our case, it's the relationship between the number of rooms and accommodation prices.

The data fueling our analysis comes from a robust and public [dataset](https://drive.google.com/uc?id=1ygn8_Gh3wd7XvYebw3S7SV_87dw7BXjh&export=download) provided by Airbnb. This dataset is a rich source of information, encapsulating various attributes of accommodations listed on the platform. We will specifically focus on extracting insights about room numbers and their corresponding prices.

Throughout this notebook, we will guide you through the process of data preparation, model training, and evaluation. Our aim is to not only build a functional model but also to extract meaningful insights that can inform business strategies in the accommodation sector.

So, let's begin our journey into the world of big data and machine learning, utilizing the power of Apache Spark to drive our analysis forward.



# 1. PREPARE THE DOCUMENT AND GET THE DATA


As we move forward in our journey to develop a predictive model for accommodation prices, the first critical step is to set up our environment and gather the data we need. This stage is foundational, ensuring that we have a solid base to build upon.

We start by preparing our document for analysis. This involves initializing a Spark session, which is the entry point for programming Spark applications. A Spark session provides a way to interact with Spark functionalities, enabling us to leverage its powerful data processing capabilities. 

In this case, Databricks provides everything that we need to start running Spark. We just need to create a cluster, attach it to the notebook and we will be ready to start.

With our environment ready, the next crucial step is obtaining the data. For our model, we'll be using publicly available data from Airbnb. This dataset is not only comprehensive but also rich in details, providing a plethora of information about various accommodations across different regions.

We will download this dataset using reliable methods, ensuring the integrity and quality of the data. Once downloaded, we will load the data into our Spark environment. This step involves reading the dataset into a DataFrame, a fundamental structure of Apache Spark, which allows us to manipulate and analyze data in a distributed and efficient manner.

Upon successful loading of the data, we'll proceed with an initial exploration. This involves understanding the structure of the data, such as the number of columns and rows, types of variables, and a peek at the first few rows of the dataset. This preliminary exploration is crucial as it gives us insights into the nature of the data we are dealing with and helps us identify any initial cleaning or transformation that might be necessary.

In [0]:
# import some libraries

from pyspark.sql.functions import col, exp, percent_rank, percentile_approx

In [0]:
%sh
# download the data

wget -O /tmp/datasets.zip https://drive.google.com/uc?id=1ygn8_Gh3wd7XvYebw3S7SV_87dw7BXjh&export=download

--2023-12-08 17:23:10--  https://drive.google.com/uc?id=1ygn8_Gh3wd7XvYebw3S7SV_87dw7BXjh
Resolving drive.google.com (drive.google.com)... 142.250.217.110, 2607:f8b0:400a:800::200e
Connecting to drive.google.com (drive.google.com)|142.250.217.110|:443... connected.
HTTP request sent, awaiting response... 303 See Other
Location: https://doc-0o-3c-docs.googleusercontent.com/docs/securesc/ha0ro937gcuc7l7deffksulhg5h7mbp1/kp9qva0de2hnq80uu2ikn6s07vvovnlj/1702056150000/09343199190924879457/*/1ygn8_Gh3wd7XvYebw3S7SV_87dw7BXjh?uuid=04318700-9550-4f83-b47e-11956a8211ef [following]
--2023-12-08 17:23:11--  https://doc-0o-3c-docs.googleusercontent.com/docs/securesc/ha0ro937gcuc7l7deffksulhg5h7mbp1/kp9qva0de2hnq80uu2ikn6s07vvovnlj/1702056150000/09343199190924879457/*/1ygn8_Gh3wd7XvYebw3S7SV_87dw7BXjh?uuid=04318700-9550-4f83-b47e-11956a8211ef
Resolving doc-0o-3c-docs.googleusercontent.com (doc-0o-3c-docs.googleusercontent.com)... 142.251.33.65, 2607:f8b0:400a:805::2001
Connecting to doc-0o-3c-docs

In [0]:
%sh
# unzip the data that we want

unzip /tmp/datasets.zip Datasets/sf-airbnb-clean.parquet/* -d /tmp/

Archive:  /tmp/datasets.zip
  inflating: /tmp/Datasets/sf-airbnb-clean.parquet/_SUCCESS  
  inflating: /tmp/Datasets/sf-airbnb-clean.parquet/_started_4320459746949313749  
  inflating: /tmp/Datasets/sf-airbnb-clean.parquet/_committed_4320459746949313749  
  inflating: /tmp/Datasets/sf-airbnb-clean.parquet/part-00000-tid-4320459746949313749-5c3d407c-c844-4016-97ad-2edec446aa62-6688-1-c000.snappy.parquet  


In [0]:
%fs

mv -r file:/tmp/Datasets dbfs:/FileStore



We will check if the ingestion and movement of the data went fine.

In [0]:
%fs ls /FileStore

path,name,size,modificationTime
dbfs:/FileStore/sf-airbnb-clean.parquet/,sf-airbnb-clean.parquet/,0,0
dbfs:/FileStore/tables/,tables/,0,0


In [0]:
# read the data

path = "dbfs:/FileStore/sf-airbnb-clean.parquet/"

data = spark.read.parquet(path)

In [0]:
# show a sample of our data
 
data.sample(0.0015).display()

host_is_superhost,cancellation_policy,instant_bookable,host_total_listings_count,neighbourhood_cleansed,latitude,longitude,property_type,room_type,accommodates,bathrooms,bedrooms,beds,bed_type,minimum_nights,number_of_reviews,review_scores_rating,review_scores_accuracy,review_scores_cleanliness,review_scores_checkin,review_scores_communication,review_scores_location,review_scores_value,price,bedrooms_na,bathrooms_na,beds_na,review_scores_rating_na,review_scores_accuracy_na,review_scores_cleanliness_na,review_scores_checkin_na,review_scores_communication_na,review_scores_location_na,review_scores_value_na
f,moderate,f,1.0,Haight Ashbury,37.77575,-122.44397,Apartment,Entire home/apt,2.0,1.0,1.0,1.0,Real Bed,30.0,27.0,97.0,10.0,10.0,10.0,10.0,9.0,9.0,130.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
t,moderate,f,1.0,Haight Ashbury,37.76351,-122.44584,House,Private room,2.0,1.0,1.0,1.0,Real Bed,2.0,167.0,100.0,10.0,10.0,10.0,10.0,10.0,10.0,150.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
t,strict_14_with_grace_period,f,2.0,South of Market,37.78453,-122.39141,Loft,Entire home/apt,4.0,2.0,2.0,2.0,Real Bed,30.0,24.0,98.0,10.0,10.0,10.0,10.0,10.0,10.0,250.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
f,flexible,f,1.0,Haight Ashbury,37.76637,-122.4467,House,Private room,2.0,1.0,1.0,1.0,Real Bed,7.0,50.0,96.0,10.0,10.0,10.0,10.0,10.0,10.0,80.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
f,strict_14_with_grace_period,f,1.0,Diamond Heights,37.74132,-122.43705,Townhouse,Entire home/apt,6.0,3.0,3.0,3.0,Real Bed,7.0,5.0,100.0,10.0,10.0,10.0,10.0,9.0,10.0,295.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
t,strict_14_with_grace_period,f,12.0,Castro/Upper Market,37.75903,-122.43726,Apartment,Private room,1.0,2.0,1.0,1.0,Real Bed,30.0,7.0,100.0,10.0,10.0,10.0,10.0,9.0,10.0,65.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
t,moderate,t,1.0,Potrero Hill,37.75437,-122.39897,House,Entire home/apt,3.0,1.0,2.0,2.0,Real Bed,30.0,12.0,95.0,10.0,10.0,10.0,10.0,10.0,10.0,135.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
f,strict_14_with_grace_period,t,165.0,Financial District,37.79738,-122.39756,Apartment,Entire home/apt,2.0,1.0,1.0,1.0,Real Bed,30.0,0.0,98.0,10.0,10.0,10.0,10.0,10.0,10.0,170.0,0.0,0.0,0.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0
f,strict_14_with_grace_period,f,1.0,South of Market,37.77552,-122.40926,Apartment,Private room,6.0,1.0,1.0,2.0,Real Bed,30.0,0.0,98.0,10.0,10.0,10.0,10.0,10.0,10.0,65.0,0.0,0.0,0.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0
f,flexible,t,8.0,Downtown/Civic Center,37.78821,-122.41664,Apartment,Private room,1.0,2.0,1.0,1.0,Real Bed,30.0,0.0,98.0,10.0,10.0,10.0,10.0,10.0,10.0,50.0,0.0,0.0,0.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0


In [0]:
# see the columns that we have

data.columns

Out[6]: ['host_is_superhost',
 'cancellation_policy',
 'instant_bookable',
 'host_total_listings_count',
 'neighbourhood_cleansed',
 'latitude',
 'longitude',
 'property_type',
 'room_type',
 'accommodates',
 'bathrooms',
 'bedrooms',
 'beds',
 'bed_type',
 'minimum_nights',
 'number_of_reviews',
 'review_scores_rating',
 'review_scores_accuracy',
 'review_scores_cleanliness',
 'review_scores_checkin',
 'review_scores_communication',
 'review_scores_location',
 'review_scores_value',
 'price',
 'bedrooms_na',
 'bathrooms_na',
 'beds_na',
 'review_scores_rating_na',
 'review_scores_accuracy_na',
 'review_scores_cleanliness_na',
 'review_scores_checkin_na',
 'review_scores_communication_na',
 'review_scores_location_na',
 'review_scores_value_na']

In [0]:
# we see information about some columns that we are interested in

data.select('price', 'bedrooms', 'review_scores_rating').summary('count', 'mean','min','max').show()

+-------+------------------+------------------+--------------------+
|summary|             price|          bedrooms|review_scores_rating|
+-------+------------------+------------------+--------------------+
|  count|              7146|              7146|                7146|
|   mean|213.30982367758187|1.3427092079485026|   96.03428491463755|
|    min|              10.0|               0.0|                20.0|
|    max|           10000.0|              14.0|               100.0|
+-------+------------------+------------------+--------------------+




# 2. ALGORITHM CREATION

Continuing with our predictive modeling project for accommodation pricing, the next essential step is the creation of the algorithm. This phase focuses on developing and tuning a machine learning model capable of accurately predicting accommodation prices based on the collected data.
Data Initialization and Preprocessing

Before diving into model building, proper data preprocessing is crucial. This involves cleaning the data, addressing missing values, and possibly performing feature transformation to enhance the data's relevance for our model. For example, we can transform categorical variables into numerical ones using techniques like one-hot encoding.


Once our data is prepared, we'll select an appropriate machine learning model. Given the nature of our problem, which is a regression task (predicting prices, a continuous value), we might consider models like linear regression.

Model Training

The chosen model will be trained using a portion of our dataset. During this process, the algorithm will learn to recognize patterns and relationships between the features of the accommodations and their prices.

Model Validation and Tuning

After training, it's crucial to validate the model's performance using a part of the dataset that wasn't used in training (test data). We will evaluate the model based on metrics such as Mean Squared Error (MSE) or the coefficient of determination (R²). Depending on the results, we might adjust the model's parameters or even reconsider our choice of model if necessary.

Implementation and Testing

Finally, once we are satisfied with the model's performance, we will implement it to make predictions on new data. It's essential to conduct thorough testing to ensure that the model behaves as expected in different scenarios and with different types of accommodation data.

This algorithm creation process is iterative and may require several cycles of adjustments and testing to fine-tune the model and ensure it provides accurate and reliable predictions.

In [0]:
# we divide the data in train and test 

split_data = data.randomSplit([0.8, 0.2], seed = 12)

In [0]:
# we extract the data that we want to use in the prediction, and create a vector with them

from pyspark.ml.feature import VectorAssembler

vecAssembler = VectorAssembler(inputCols = ['bedrooms', 'review_scores_rating'], outputCol = 'features')
trainVec = vecAssembler.transform(split_data[0])

trainVec.select('features').show(5)

+-----------+
|   features|
+-----------+
|[1.0,100.0]|
| [1.0,97.0]|
| [1.0,98.0]|
|[1.0,100.0]|
| [3.0,98.0]|
+-----------+
only showing top 5 rows



In [0]:
# we create the model and train it

from  pyspark.ml.regression import LinearRegression 

LR = LinearRegression(featuresCol='features', labelCol= 'price')
LRFIT = LR.fit(trainVec)

In [0]:
# we can say that the regression is

print(f"Price = {round(LRFIT.coefficients[0], 2)} * bedrooms + {round(LRFIT.coefficients[1], 2)} * review_scores_rating - {round(LRFIT.intercept*-1, 2)}")

Price = 109.56 * bedrooms + 2.7 * review_scores_rating - 195.12


In [0]:
# after creating the model, we will create a pipeline to do all the process

from pyspark.ml import Pipeline

pipeline_1 = Pipeline(stages = [vecAssembler, LR])
pipModel = pipeline_1.fit(split_data[0])

In [0]:
# once we have created the pipeline we can create transform our data

pipTrainData = pipModel.transform(split_data[0])

pipTrainData.select('bedrooms', 'review_scores_rating', 'price', 'prediction').display(10)

bedrooms,review_scores_rating,price,prediction
1.0,100.0,200.0,184.0989421315413
1.0,97.0,85.0,176.0091804061193
1.0,98.0,95.0,178.70576764792662
1.0,100.0,250.0,184.0989421315413
3.0,98.0,250.0,397.82480531440353
1.0,98.0,45.0,178.70576764792662
1.0,98.0,70.0,178.70576764792662
1.0,96.0,105.0,173.31259316431198
1.0,91.0,86.0,159.8296569552752
1.0,95.0,100.0,170.6160059225046


In [0]:
display(trainVec)


After we create and train the model, its time to evaluate it.
We will use the metric RMSE, most commun used in this cases.

First, we will calculate the RSME that we would have obtained, if we were using the average of the price instead of the prediction. Then we will be able to know if it was worth to made the model to predict, or if its neccessary to apply other steps.

In [0]:
# create a column with the average of the price

from pyspark.sql.functions import avg, lit

avg_price = split_data[0].select(avg('price')).first()[0]
data_avgprice = pipTrainData.withColumn('avg_price', lit(avg_price))

In [0]:
data_avgprice.columns

Out[50]: ['host_is_superhost',
 'cancellation_policy',
 'instant_bookable',
 'host_total_listings_count',
 'neighbourhood_cleansed',
 'latitude',
 'longitude',
 'property_type',
 'room_type',
 'accommodates',
 'bathrooms',
 'bedrooms',
 'beds',
 'bed_type',
 'minimum_nights',
 'number_of_reviews',
 'review_scores_rating',
 'review_scores_accuracy',
 'review_scores_cleanliness',
 'review_scores_checkin',
 'review_scores_communication',
 'review_scores_location',
 'review_scores_value',
 'price',
 'bedrooms_na',
 'bathrooms_na',
 'beds_na',
 'review_scores_rating_na',
 'review_scores_accuracy_na',
 'review_scores_cleanliness_na',
 'review_scores_checkin_na',
 'review_scores_communication_na',
 'review_scores_location_na',
 'review_scores_value_na',
 'features',
 'prediction',
 'avg_price']

In [0]:
data_avgprice.select('avg_price').show(10)

+-----------------+
|        avg_price|
+-----------------+
|210.4009458749343|
|210.4009458749343|
|210.4009458749343|
|210.4009458749343|
|210.4009458749343|
|210.4009458749343|
|210.4009458749343|
|210.4009458749343|
|210.4009458749343|
|210.4009458749343|
+-----------------+
only showing top 10 rows



In [0]:
# evaluate the model using the avg

from pyspark.ml.evaluation import RegressionEvaluator

meanEvaluator = RegressionEvaluator(predictionCol = 'avg_price', labelCol = 'price', metricName = 'rmse')

rmse_avg = meanEvaluator.evaluate(data_avgprice)


In [0]:
# evaluate the model using the prediction

Evaluator = RegressionEvaluator(predictionCol = 'prediction', labelCol = 'price', metricName = 'rmse')

rmse_prediction = Evaluator.evaluate(data_avgprice)

In [0]:
print(f"We have a RMSE of: {round(rmse_avg, 2)} using the average as price, and {round(rmse_prediction, 2)} using the model to predict the price.")

We have a RMSE of: 280.6 using the average as price, and 260.44 using the model to predict the price



# 3.CONCLUSION

In our endeavor to predict accommodation prices using a Linear Regression model, we have achieved a significant improvement over a baseline approach that uses the average price. Our model yielded a Root Mean Square Error (RMSE) of 260.44, which is notably lower than the RMSE of 280.6 obtained when simply using the average price as a predictor. This indicates that our model provides a more accurate and nuanced understanding of the pricing dynamics based on the number of rooms and its reviews.

However, it's important to recognize that our model, like any analytical tool, has room for improvement. We have identified several avenues for enhancing the model's performance, such as addressing outliers, normalizing the data, and potentially incorporating additional explanatory variables that could capture more complexity in the data.

As we conclude, we reflect on the potential of machine learning in transforming raw data into actionable insights. The value of our model lies not only in its current predictive power but also in its capacity to evolve and improve. By applying further data preprocessing techniques, experimenting with different model parameters, and exploring more sophisticated algorithms, we can continue to refine our predictions.

This notebook serves as a foundation for such future work. It is a testament to the power of Spark and machine learning in making informed predictions that can guide decision-making in real-world scenarios.