## Overview

This notebook will show you how to create and query a table or DataFrame that you uploaded to DBFS. [DBFS](https://docs.databricks.com/user-guide/dbfs-databricks-file-system.html) is a Databricks File System that allows you to store data for querying inside of Databricks. This notebook assumes that you have a file already inside of DBFS that you would like to read from.

This notebook is written in **Python** so the default cell type is Python. However, you can use different languages by using the `%LANGUAGE` syntax. Python, Scala, SQL, and R are all supported.

In [0]:
# File location and type
file_location = "/FileStore/tables/Churn_Modelling.csv"
file_type = "csv"

# CSV options
#df=spark.read.csv(file_location,header=True,InferSchema=True)

In [0]:
df=spark.read.csv(file_location,header=True,inferSchema=True)

In [0]:
df.printSchema()

In [0]:
# File location and type
file_location = "/FileStore/tables/tips.csv"
file_type = "csv"

# CSV options
tips=spark.read.csv(file_location,header=True,inferSchema=True)

In [0]:
tips.printSchema()

In [0]:
#sex , smoker and day are categorical features
from pyspark.ml.feature import StringIndexer

In [0]:
stin=StringIndexer(inputCols=['sex','smoker','day','time'],outputCols=['sex_indexed','smoker_indexed','day_indexed','time_indexed'])

df_new=stin.fit(tips).transform(tips)

In [0]:
df_new.columns

In [0]:
df_new.show()

In [0]:
from pyspark.ml.feature import VectorAssembler

In [0]:
vect=VectorAssembler(inputCols=['tip','size','sex_indexed','smoker_indexed','day_indexed','time_indexed'],outputCol="Independent Features")

In [0]:
final_data=vect.transform(df_new)

In [0]:
final_data.show()

In [0]:
final=final_data.select(['total_bill','Independent features'])

In [0]:
final.columns

In [0]:
#linear regression
from pyspark.ml.regression import LinearRegression

train_data,test_data=final.randomSplit([0.8,0.2])
regre=LinearRegression(featuresCol="Independent features",labelCol="total_bill")
regre=regre.fit(train_data)

In [0]:
regre.coefficients

In [0]:
regre.intercept


In [0]:
pred=regre.evaluate(test_data)


In [0]:
pred.predictions.show()

In [0]:
###performance metrics
pred.r2

In [0]:
pred.meanAbsoluteError,pred.meanSquaredError

In [0]:
regre.save("/FileStore/tables/Linear_regression.pickle")