
## Overview

In this project, we leverage the power of **linear regression** and the **PySpark framework** within the Databricks platform to develop a predictive model for estimating tips in a restaurant setting. By analyzing various factors we aim to create a reliable prediction model that can assist restaurant owners and staff in understanding and forecasting the gratuity they can expect from their customers. This endeavor not only helps in improving service quality but also enhances financial planning for restaurant businesses.

Initially the dataset is uploaded to DBFS.[DBFS](https://docs.databricks.com/user-guide/dbfs-databricks-file-system.html) is a Databricks File System that allows you to store data for querying inside of Databricks. This notebook assumes that you have a file already inside of DBFS that you would like to read from.

This notebook is written in **Python** so the default cell type is Python. However, you can use different languages by using the `%LANGUAGE` syntax. Python, Scala, SQL, and R are all supported.

In [0]:
# File location and type
file_location = "/FileStore/tables/tips-2.csv"
file_type = "csv"

df = spark.read.csv(file_location, header=True, inferSchema=True)
df.show()

+----------+----+------+------+---+------+----+
|total_bill| tip|   sex|smoker|day|  time|size|
+----------+----+------+------+---+------+----+
|     16.99|1.01|Female|    No|Sun|Dinner|   2|
|     10.34|1.66|  Male|    No|Sun|Dinner|   3|
|     21.01| 3.5|  Male|    No|Sun|Dinner|   3|
|     23.68|3.31|  Male|    No|Sun|Dinner|   2|
|     24.59|3.61|Female|    No|Sun|Dinner|   4|
|     25.29|4.71|  Male|    No|Sun|Dinner|   4|
|      8.77| 2.0|  Male|    No|Sun|Dinner|   2|
|     26.88|3.12|  Male|    No|Sun|Dinner|   4|
|     15.04|1.96|  Male|    No|Sun|Dinner|   2|
|     14.78|3.23|  Male|    No|Sun|Dinner|   2|
|     10.27|1.71|  Male|    No|Sun|Dinner|   2|
|     35.26| 5.0|Female|    No|Sun|Dinner|   4|
|     15.42|1.57|  Male|    No|Sun|Dinner|   2|
|     18.43| 3.0|  Male|    No|Sun|Dinner|   4|
|     14.83|3.02|Female|    No|Sun|Dinner|   2|
|     21.58|3.92|  Male|    No|Sun|Dinner|   2|
|     10.33|1.67|Female|    No|Sun|Dinner|   3|
|     16.29|3.71|  Male|    No|Sun|Dinne

In [0]:
df.printSchema()

root
 |-- total_bill: double (nullable = true)
 |-- tip: double (nullable = true)
 |-- sex: string (nullable = true)
 |-- smoker: string (nullable = true)
 |-- day: string (nullable = true)
 |-- time: string (nullable = true)
 |-- size: integer (nullable = true)



In [0]:
df.columns

Out[4]: ['total_bill', 'tip', 'sex', 'smoker', 'day', 'time', 'size']

In [0]:
## handling categorical features 

from pyspark.ml.feature import StringIndexer


In [0]:
indexer = StringIndexer(inputCol = "sex", outputCol="sex_indexed")

df_r = indexer.fit(df).transform(df)
df_r.show()


+----------+----+------+------+---+------+----+-----------+
|total_bill| tip|   sex|smoker|day|  time|size|sex_indexed|
+----------+----+------+------+---+------+----+-----------+
|     16.99|1.01|Female|    No|Sun|Dinner|   2|        1.0|
|     10.34|1.66|  Male|    No|Sun|Dinner|   3|        0.0|
|     21.01| 3.5|  Male|    No|Sun|Dinner|   3|        0.0|
|     23.68|3.31|  Male|    No|Sun|Dinner|   2|        0.0|
|     24.59|3.61|Female|    No|Sun|Dinner|   4|        1.0|
|     25.29|4.71|  Male|    No|Sun|Dinner|   4|        0.0|
|      8.77| 2.0|  Male|    No|Sun|Dinner|   2|        0.0|
|     26.88|3.12|  Male|    No|Sun|Dinner|   4|        0.0|
|     15.04|1.96|  Male|    No|Sun|Dinner|   2|        0.0|
|     14.78|3.23|  Male|    No|Sun|Dinner|   2|        0.0|
|     10.27|1.71|  Male|    No|Sun|Dinner|   2|        0.0|
|     35.26| 5.0|Female|    No|Sun|Dinner|   4|        1.0|
|     15.42|1.57|  Male|    No|Sun|Dinner|   2|        0.0|
|     18.43| 3.0|  Male|    No|Sun|Dinne

In [0]:
indexer = StringIndexer(inputCols = ["sex","smoker", "day", "time"] ,outputCols= ["sex_indexed", "smoker_indexed", "day_indexed", "time_indexed"])

df_r = indexer.fit(df).transform(df)
df_r.show()

+----------+----+------+------+---+------+----+-----------+--------------+-----------+------------+
|total_bill| tip|   sex|smoker|day|  time|size|sex_indexed|smoker_indexed|day_indexed|time_indexed|
+----------+----+------+------+---+------+----+-----------+--------------+-----------+------------+
|     16.99|1.01|Female|    No|Sun|Dinner|   2|        1.0|           0.0|        1.0|         0.0|
|     10.34|1.66|  Male|    No|Sun|Dinner|   3|        0.0|           0.0|        1.0|         0.0|
|     21.01| 3.5|  Male|    No|Sun|Dinner|   3|        0.0|           0.0|        1.0|         0.0|
|     23.68|3.31|  Male|    No|Sun|Dinner|   2|        0.0|           0.0|        1.0|         0.0|
|     24.59|3.61|Female|    No|Sun|Dinner|   4|        1.0|           0.0|        1.0|         0.0|
|     25.29|4.71|  Male|    No|Sun|Dinner|   4|        0.0|           0.0|        1.0|         0.0|
|      8.77| 2.0|  Male|    No|Sun|Dinner|   2|        0.0|           0.0|        1.0|         0.0|


In [0]:
df_r.columns

Out[14]: ['total_bill',
 'tip',
 'sex',
 'smoker',
 'day',
 'time',
 'size',
 'sex_indexed',
 'smoker_indexed',
 'day_indexed',
 'time_indexed']

In [0]:
#to group all independent features togeather we use VECTORASSEMBLER

from pyspark.ml.feature import VectorAssembler
assembler = VectorAssembler(inputCols=['tip', 'size', 'sex_indexed', 'smoker_indexed', 'day_indexed', 'time_indexed'], outputCol="Indepenedent features")

output = assembler.transform(df_r)

In [0]:
output.show()

+----------+----+------+------+---+------+----+-----------+--------------+-----------+------------+---------------------+
|total_bill| tip|   sex|smoker|day|  time|size|sex_indexed|smoker_indexed|day_indexed|time_indexed|Indepenedent features|
+----------+----+------+------+---+------+----+-----------+--------------+-----------+------------+---------------------+
|     16.99|1.01|Female|    No|Sun|Dinner|   2|        1.0|           0.0|        1.0|         0.0| [1.01,2.0,1.0,0.0...|
|     10.34|1.66|  Male|    No|Sun|Dinner|   3|        0.0|           0.0|        1.0|         0.0| [1.66,3.0,0.0,0.0...|
|     21.01| 3.5|  Male|    No|Sun|Dinner|   3|        0.0|           0.0|        1.0|         0.0| [3.5,3.0,0.0,0.0,...|
|     23.68|3.31|  Male|    No|Sun|Dinner|   2|        0.0|           0.0|        1.0|         0.0| [3.31,2.0,0.0,0.0...|
|     24.59|3.61|Female|    No|Sun|Dinner|   4|        1.0|           0.0|        1.0|         0.0| [3.61,4.0,1.0,0.0...|
|     25.29|4.71|  Male|

In [0]:
output.select("Indepenedent features").show()

+---------------------+
|Indepenedent features|
+---------------------+
| [1.01,2.0,1.0,0.0...|
| [1.66,3.0,0.0,0.0...|
| [3.5,3.0,0.0,0.0,...|
| [3.31,2.0,0.0,0.0...|
| [3.61,4.0,1.0,0.0...|
| [4.71,4.0,0.0,0.0...|
| [2.0,2.0,0.0,0.0,...|
| [3.12,4.0,0.0,0.0...|
| [1.96,2.0,0.0,0.0...|
| [3.23,2.0,0.0,0.0...|
| [1.71,2.0,0.0,0.0...|
| [5.0,4.0,1.0,0.0,...|
| [1.57,2.0,0.0,0.0...|
| [3.0,4.0,0.0,0.0,...|
| [3.02,2.0,1.0,0.0...|
| [3.92,2.0,0.0,0.0...|
| [1.67,3.0,1.0,0.0...|
| [3.71,3.0,0.0,0.0...|
| [3.5,3.0,1.0,0.0,...|
| (6,[0,1],[3.35,3.0])|
+---------------------+
only showing top 20 rows



In [0]:
#Lets pick 2 features i.e., an independent and dependent feature and put it in final_data

final_data = output.select("Indepenedent features", "total_bill")

In [0]:
final_data.show()

+---------------------+----------+
|Indepenedent features|total_bill|
+---------------------+----------+
| [1.01,2.0,1.0,0.0...|     16.99|
| [1.66,3.0,0.0,0.0...|     10.34|
| [3.5,3.0,0.0,0.0,...|     21.01|
| [3.31,2.0,0.0,0.0...|     23.68|
| [3.61,4.0,1.0,0.0...|     24.59|
| [4.71,4.0,0.0,0.0...|     25.29|
| [2.0,2.0,0.0,0.0,...|      8.77|
| [3.12,4.0,0.0,0.0...|     26.88|
| [1.96,2.0,0.0,0.0...|     15.04|
| [3.23,2.0,0.0,0.0...|     14.78|
| [1.71,2.0,0.0,0.0...|     10.27|
| [5.0,4.0,1.0,0.0,...|     35.26|
| [1.57,2.0,0.0,0.0...|     15.42|
| [3.0,4.0,0.0,0.0,...|     18.43|
| [3.02,2.0,1.0,0.0...|     14.83|
| [3.92,2.0,0.0,0.0...|     21.58|
| [1.67,3.0,1.0,0.0...|     10.33|
| [3.71,3.0,0.0,0.0...|     16.29|
| [3.5,3.0,1.0,0.0,...|     16.97|
| (6,[0,1],[3.35,3.0])|     20.65|
+---------------------+----------+
only showing top 20 rows



In [0]:
#Code for Linear regression 

from pyspark.ml.regression import LinearRegression
train_data, test_data = final_data.randomSplit([0.75, 0.25])
regressor= LinearRegression(featuresCol="Indepenedent features", labelCol="total_bill")
regressor = regressor.fit(train_data)

In [0]:
# statically measure which is used to measure the average functional relationship between variables

#Coefficients provide the impact or weight of a variable towards the entire model

regressor.coefficients

Out[31]: DenseVector([3.3315, 3.103, -1.3708, 2.3859, -0.2787, -0.6884])

In [0]:
#INTERCEPT function when you want to determine the value of the dependent variable when the independent variable is 0 (zero). For example, you can use the INTERCEPT function to predict a metal's electrical resistance at 0°C when your data points were taken at room temperature and higher.

regressor.intercept

Out[32]: 1.4926812587381062

In [0]:
#Predictions 

prediction = regressor.evaluate(test_data)

In [0]:
#This shows the actual values and our predicted values 

prediction.predictions.show()

+---------------------+----------+------------------+
|Indepenedent features|total_bill|        prediction|
+---------------------+----------+------------------+
| (6,[0,1],[1.25,2.0])|     10.07|11.863047015909341|
| (6,[0,1],[1.25,2.0])|     10.51|11.863047015909341|
|  (6,[0,1],[2.0,3.0])|     16.31| 17.46466997651938|
| (6,[0,1],[2.24,3.0])|     16.04|18.264234934006275|
| (6,[0,1],[2.34,4.0])|     17.81|21.700369468089317|
| (6,[0,1],[3.76,2.0])|     18.24|20.225163862959764|
|  (6,[0,1],[9.0,4.0])|     48.33|  43.8882970383506|
| [1.0,1.0,1.0,0.0,...|      7.25|6.5563429468730465|
| [1.01,2.0,1.0,0.0...|     16.99| 9.413918376545844|
| [1.1,2.0,1.0,1.0,...|      12.9| 12.37835204833388|
| [1.32,2.0,0.0,0.0...|      9.68|11.817531216490366|
| [1.5,2.0,0.0,1.0,...|     15.69| 14.80307950198333|
| [1.5,2.0,1.0,0.0,...|     26.41| 11.32508574343424|
| [1.56,2.0,0.0,0.0...|      9.94| 12.61709617397726|
| [1.61,2.0,1.0,1.0...|     10.59|14.077427582993526|
| [1.63,2.0,1.0,0.0...|     

In [0]:
# r2 - When R-squared is 0.4, it means that 40% of the variation in the data can be explained by the line. In other words, the line you drew on the graph doesn't perfectly match the data, but it's doing a decent job of capturing some of the patterns in the data. The higher it is, the better our recipe is at explaining things. 

#meanAbsolute error -  how far off your arrows were from the bullseye

#mean squared error -  We want the "error" (difference between our cake and the expected taste) to be as small as possible. The lower the MSE, the better your predictions

#RMSE - The lower the RMSE, the better your aim, because it means your shots are closer to the bullseye on average.


prediction.r2, prediction.degreesOfFreedom, prediction.meanAbsoluteError, prediction.meanSquaredError, prediction.rootMeanSquaredError

Out[45]: (0.4903137485555916,
 63,
 4.56912345189115,
 43.37505351183223,
 6.585973998721239)