# Linear Regression Code Along

This notebook is the reference for the video lecture on the Linear Regression Code Along. Basically what we do here is examine a dataset with Ecommerce Customer Data for a company's website and mobile app. Then we want to see if we can build a regression model that will predict the customer's yearly spend on the company's product.

First thing to do is start a Spark Session

In [4]:
from pyspark.sql import SparkSession
spark = SparkSession.builder.appName('lr_example').getOrCreate()

In [5]:
from pyspark.ml.regression import LinearRegression

In [6]:
# Use Spark to read in the Ecommerce Customers csv file.
data = spark.read.csv("/FileStore/tables/Ecommerce_Customers.csv",inferSchema=True,header=True)

In [7]:
# Print the Schema of the DataFrame
data.printSchema()

In [8]:
data.show()

In [9]:
data.head(1)

In [10]:
for item in data.head(1)[0]:
    print(item)

## Setting Up DataFrame for Machine Learning

In [12]:
# A few things we need to do before Spark can accept the data!
# It needs to be in the form of two columns
# ("label","features")

# Import VectorAssembler and Vectors
from pyspark.ml.linalg import Vectors
from pyspark.ml.feature import VectorAssembler

In [13]:
# In this example, we catch only the numeric variables

# Yearly Amount Spent = variável dependente - variável que iremos prever

data.printSchema()

In [14]:
assembler = VectorAssembler(
    inputCols=["Avg Session Length", "Time on App", 
               "Time on Website",'Length of Membership'],
    outputCol="features")

In [15]:
output = assembler.transform(data)

In [16]:
output.printSchema()

output.select("features").show(5)

In [17]:
output.head(1)

In [18]:
final_data = output.select("features",'Yearly Amount Spent')

final_data.show(2)

In [19]:
train_data,test_data = final_data.randomSplit([0.7,0.3])

In [20]:
train_data.describe().show()

In [21]:
test_data.describe().show()

In [22]:
# Create a Linear Regression Model object
lr = LinearRegression(labelCol='Yearly Amount Spent')

In [23]:
# Fit the model to the data and call this model lrModel
lrModel = lr.fit(train_data)

In [24]:
# Print the coefficients and intercept for linear regression
print("Coefficients: {} \n Intercept: {}".format(lrModel.coefficients,lrModel.intercept))

In [25]:
test_results = lrModel.evaluate(test_data)

In [26]:
# Interesting results....
test_results.residuals.show()

In [27]:
print("RMSE: {}".format(test_results.rootMeanSquaredError))
print("MSE: {}".format(test_results.meanSquaredError))
print('MAE: {}'.format(test_results.meanAbsoluteError))
print('Adj. R2: {}'.format(test_results.r2adj))

In [28]:
final_data.describe().show()

In [29]:
unlabeled_data = test_data.select('features')

In [30]:
predictions = lrModel.transform(unlabeled_data)

In [31]:
predictions.show()

Excellent results! Let's see how you handle some more realistically modeled data in the Consulting Project!