## Overview

This notebook will show you how to create and query a table or DataFrame that you uploaded to DBFS. [DBFS](https://docs.databricks.com/user-guide/dbfs-databricks-file-system.html) is a Databricks File System that allows you to store data for querying inside of Databricks. This notebook assumes that you have a file already inside of DBFS that you would like to read from.

This notebook is written in **Python** so the default cell type is Python. However, you can use different languages by using the `%LANGUAGE` syntax. Python, Scala, SQL, and R are all supported.

In [0]:
# File location and type
file_location = "/FileStore/tables/tips-3.csv"
file_type = "csv"

# The applied options are for CSV files. For other file types, these will be ignored.
df = spark.read.csv(file_location, header=True, inferSchema=True)

In [0]:
df.printSchema()

root
 |-- total_bill: double (nullable = true)
 |-- tip: double (nullable = true)
 |-- sex: string (nullable = true)
 |-- smoker: string (nullable = true)
 |-- day: string (nullable = true)
 |-- time: string (nullable = true)
 |-- size: integer (nullable = true)



In [0]:
df.head(3)

Out[4]: [Row(total_bill=16.99, tip=1.01, sex='Female', smoker='No', day='Sun', time='Dinner', size=2),
 Row(total_bill=10.34, tip=1.66, sex='Male', smoker='No', day='Sun', time='Dinner', size=3),
 Row(total_bill=21.01, tip=3.5, sex='Male', smoker='No', day='Sun', time='Dinner', size=3)]

In [0]:
df.show()

+----------+----+------+------+---+------+----+
|total_bill| tip|   sex|smoker|day|  time|size|
+----------+----+------+------+---+------+----+
|     16.99|1.01|Female|    No|Sun|Dinner|   2|
|     10.34|1.66|  Male|    No|Sun|Dinner|   3|
|     21.01| 3.5|  Male|    No|Sun|Dinner|   3|
|     23.68|3.31|  Male|    No|Sun|Dinner|   2|
|     24.59|3.61|Female|    No|Sun|Dinner|   4|
|     25.29|4.71|  Male|    No|Sun|Dinner|   4|
|      8.77| 2.0|  Male|    No|Sun|Dinner|   2|
|     26.88|3.12|  Male|    No|Sun|Dinner|   4|
|     15.04|1.96|  Male|    No|Sun|Dinner|   2|
|     14.78|3.23|  Male|    No|Sun|Dinner|   2|
|     10.27|1.71|  Male|    No|Sun|Dinner|   2|
|     35.26| 5.0|Female|    No|Sun|Dinner|   4|
|     15.42|1.57|  Male|    No|Sun|Dinner|   2|
|     18.43| 3.0|  Male|    No|Sun|Dinner|   4|
|     14.83|3.02|Female|    No|Sun|Dinner|   2|
|     21.58|3.92|  Male|    No|Sun|Dinner|   2|
|     10.33|1.67|Female|    No|Sun|Dinner|   3|
|     16.29|3.71|  Male|    No|Sun|Dinne

In [0]:
df.columns

Out[6]: ['total_bill', 'tip', 'sex', 'smoker', 'day', 'time', 'size']

In [0]:
## handling categorical features
from pyspark.ml.feature import StringIndexer

In [0]:
indexer=StringIndexer(inputCol="sex", outputCol="sex_index")
df1=indexer.fit(df).transform(df)
df1.show()

+----------+----+------+------+---+------+----+---------+
|total_bill| tip|   sex|smoker|day|  time|size|sex_index|
+----------+----+------+------+---+------+----+---------+
|     16.99|1.01|Female|    No|Sun|Dinner|   2|      1.0|
|     10.34|1.66|  Male|    No|Sun|Dinner|   3|      0.0|
|     21.01| 3.5|  Male|    No|Sun|Dinner|   3|      0.0|
|     23.68|3.31|  Male|    No|Sun|Dinner|   2|      0.0|
|     24.59|3.61|Female|    No|Sun|Dinner|   4|      1.0|
|     25.29|4.71|  Male|    No|Sun|Dinner|   4|      0.0|
|      8.77| 2.0|  Male|    No|Sun|Dinner|   2|      0.0|
|     26.88|3.12|  Male|    No|Sun|Dinner|   4|      0.0|
|     15.04|1.96|  Male|    No|Sun|Dinner|   2|      0.0|
|     14.78|3.23|  Male|    No|Sun|Dinner|   2|      0.0|
|     10.27|1.71|  Male|    No|Sun|Dinner|   2|      0.0|
|     35.26| 5.0|Female|    No|Sun|Dinner|   4|      1.0|
|     15.42|1.57|  Male|    No|Sun|Dinner|   2|      0.0|
|     18.43| 3.0|  Male|    No|Sun|Dinner|   4|      0.0|
|     14.83|3.

In [0]:
indexer=StringIndexer(inputCols=["smoker","day","time"], outputCols=["smoker_index","day_index","time_index"])
df1=indexer.fit(df1).transform(df1)
df1.show()

+----------+----+------+------+---+------+----+---------+------------+---------+----------+
|total_bill| tip|   sex|smoker|day|  time|size|sex_index|smoker_index|day_index|time_index|
+----------+----+------+------+---+------+----+---------+------------+---------+----------+
|     16.99|1.01|Female|    No|Sun|Dinner|   2|      1.0|         0.0|      1.0|       0.0|
|     10.34|1.66|  Male|    No|Sun|Dinner|   3|      0.0|         0.0|      1.0|       0.0|
|     21.01| 3.5|  Male|    No|Sun|Dinner|   3|      0.0|         0.0|      1.0|       0.0|
|     23.68|3.31|  Male|    No|Sun|Dinner|   2|      0.0|         0.0|      1.0|       0.0|
|     24.59|3.61|Female|    No|Sun|Dinner|   4|      1.0|         0.0|      1.0|       0.0|
|     25.29|4.71|  Male|    No|Sun|Dinner|   4|      0.0|         0.0|      1.0|       0.0|
|      8.77| 2.0|  Male|    No|Sun|Dinner|   2|      0.0|         0.0|      1.0|       0.0|
|     26.88|3.12|  Male|    No|Sun|Dinner|   4|      0.0|         0.0|      1.0|

In [0]:
# group independent and dependent features 

from pyspark.ml.feature import VectorAssembler

vectorassembler = VectorAssembler(inputCols=['total_bill','size','sex_index','smoker_index','day_index','time_index'], outputCol="Independent Features")

output = vectorassembler.transform(df1)

In [0]:
output.show()

+----------+----+------+------+---+------+----+---------+------------+---------+----------+--------------------+
|total_bill| tip|   sex|smoker|day|  time|size|sex_index|smoker_index|day_index|time_index|Independent Features|
+----------+----+------+------+---+------+----+---------+------------+---------+----------+--------------------+
|     16.99|1.01|Female|    No|Sun|Dinner|   2|      1.0|         0.0|      1.0|       0.0|[16.99,2.0,1.0,0....|
|     10.34|1.66|  Male|    No|Sun|Dinner|   3|      0.0|         0.0|      1.0|       0.0|[10.34,3.0,0.0,0....|
|     21.01| 3.5|  Male|    No|Sun|Dinner|   3|      0.0|         0.0|      1.0|       0.0|[21.01,3.0,0.0,0....|
|     23.68|3.31|  Male|    No|Sun|Dinner|   2|      0.0|         0.0|      1.0|       0.0|[23.68,2.0,0.0,0....|
|     24.59|3.61|Female|    No|Sun|Dinner|   4|      1.0|         0.0|      1.0|       0.0|[24.59,4.0,1.0,0....|
|     25.29|4.71|  Male|    No|Sun|Dinner|   4|      0.0|         0.0|      1.0|       0.0|[25.2

In [0]:
output.select('Independent Features','tip').show()

+--------------------+----+
|Independent Features| tip|
+--------------------+----+
|[16.99,2.0,1.0,0....|1.01|
|[10.34,3.0,0.0,0....|1.66|
|[21.01,3.0,0.0,0....| 3.5|
|[23.68,2.0,0.0,0....|3.31|
|[24.59,4.0,1.0,0....|3.61|
|[25.29,4.0,0.0,0....|4.71|
|[8.77,2.0,0.0,0.0...| 2.0|
|[26.88,4.0,0.0,0....|3.12|
|[15.04,2.0,0.0,0....|1.96|
|[14.78,2.0,0.0,0....|3.23|
|[10.27,2.0,0.0,0....|1.71|
|[35.26,4.0,1.0,0....| 5.0|
|[15.42,2.0,0.0,0....|1.57|
|[18.43,4.0,0.0,0....| 3.0|
|[14.83,2.0,1.0,0....|3.02|
|[21.58,2.0,0.0,0....|3.92|
|[10.33,3.0,1.0,0....|1.67|
|[16.29,3.0,0.0,0....|3.71|
|[16.97,3.0,1.0,0....| 3.5|
|(6,[0,1],[20.65,3...|3.35|
+--------------------+----+
only showing top 20 rows



In [0]:
finaldata=output.select("Independent Features", "tip")

In [0]:
finaldata.show()

+--------------------+----+
|Independent Features| tip|
+--------------------+----+
|[16.99,2.0,1.0,0....|1.01|
|[10.34,3.0,0.0,0....|1.66|
|[21.01,3.0,0.0,0....| 3.5|
|[23.68,2.0,0.0,0....|3.31|
|[24.59,4.0,1.0,0....|3.61|
|[25.29,4.0,0.0,0....|4.71|
|[8.77,2.0,0.0,0.0...| 2.0|
|[26.88,4.0,0.0,0....|3.12|
|[15.04,2.0,0.0,0....|1.96|
|[14.78,2.0,0.0,0....|3.23|
|[10.27,2.0,0.0,0....|1.71|
|[35.26,4.0,1.0,0....| 5.0|
|[15.42,2.0,0.0,0....|1.57|
|[18.43,4.0,0.0,0....| 3.0|
|[14.83,2.0,1.0,0....|3.02|
|[21.58,2.0,0.0,0....|3.92|
|[10.33,3.0,1.0,0....|1.67|
|[16.29,3.0,0.0,0....|3.71|
|[16.97,3.0,1.0,0....| 3.5|
|(6,[0,1],[20.65,3...|3.35|
+--------------------+----+
only showing top 20 rows



In [0]:
from pyspark.ml.regression import LinearRegression

#train test split

train, test = finaldata.randomSplit([0.75,0.25])
regressor = LinearRegression(featuresCol='Independent Features',labelCol='tip')
regressor=regressor.fit(train)

In [0]:
regressor.coefficients

Out[38]: DenseVector([0.0928, 0.1243, -0.0005, -0.1317, 0.0728, -0.1089])

In [0]:
regressor.intercept

Out[39]: 0.8408173581066862

In [0]:
y_pred = regressor.evaluate(test)

In [0]:
y_pred.predictions.show()

+--------------------+----+------------------+
|Independent Features| tip|        prediction|
+--------------------+----+------------------+
|(6,[0,1],[9.55,2.0])|1.45|1.9756438681969208|
|(6,[0,1],[10.07,2...|1.25| 2.023902850188402|
|(6,[0,1],[10.51,2...|1.25| 2.064737373411963|
|(6,[0,1],[14.0,2.0])| 3.0|2.3886293871624815|
|(6,[0,1],[16.04,3...|2.24| 2.702218958079327|
|(6,[0,1],[16.31,3...| 2.0| 2.727276506421058|
|(6,[0,1],[17.59,3...|2.64| 2.846067846707781|
|(6,[0,1],[17.78,2...|3.27|2.7394350639467113|
|(6,[0,1],[18.69,3...|2.31|2.9481541547666836|
|(6,[0,1],[20.23,2...|2.01| 2.966809113714267|
|(6,[0,1],[48.33,4...| 9.0| 5.823182000615997|
|[7.56,2.0,0.0,0.0...|1.44|1.8275791893781277|
|[8.35,2.0,1.0,0.0...| 1.5| 1.900405830233789|
|[8.52,2.0,0.0,0.0...|1.48|1.9166726945931702|
|[8.58,1.0,0.0,1.0...|1.92|1.7390640988452535|
|[8.77,2.0,0.0,0.0...| 2.0|1.9760106100192447|
|[9.68,2.0,0.0,0.0...|1.32|2.0604638285043366|
|[10.27,2.0,0.0,0....|1.71|2.1152192119177484|
|[10.34,3.0,0

In [0]:
y_pred.meanAbsoluteError,y_pred.meanSquaredError, y_pred.r2

Out[44]: (0.7881136629408962, 1.089893106318984, 0.5230109933872209)