<a href="https://colab.research.google.com/github/piyu18/PySpark/blob/main/PySpark_LinearRegression.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

In [2]:
! pip install pyspark

Looking in indexes: https://pypi.org/simple, https://us-python.pkg.dev/colab-wheels/public/simple/


In [3]:
#Import neccesarry libraries
from pyspark.sql import SparkSession
from pyspark.ml.regression import LinearRegression
from pyspark.ml.feature import VectorAssembler

In [4]:
#create spark session
spark = SparkSession.builder.appName('Linear_Regression').getOrCreate()

The input data set contains data about details of customerce using Ecommerce. Based on the information provided, the goal is to come up with a model to predict how much time customer is spending yearly.

In [5]:
df = spark.read.csv('/content/Ecommerce_Customers.csv',inferSchema=True,header=True)
df.show(5)

+--------------------+--------------------+------------------+-----------+---------------+--------------------+-------------------+
|               Email|             Address|Avg Session Length|Time on App|Time on Website|Length of Membership|Yearly Amount Spent|
+--------------------+--------------------+------------------+-----------+---------------+--------------------+-------------------+
|mstephenson@ferna...|835 Frank TunnelW...|       34.49726773|12.65565115|    39.57766802|         4.082620633|         587.951054|
|   hduke@hotmail.com|4547 Archer Commo...|       31.92627203|11.10946073|    37.26895887|         2.664034182|        392.2049334|
|    pallen@yahoo.com|24645 Valerie Uni...|       33.00091476|11.33027806|    37.11059744|         4.104543202|        487.5475049|
|riverarebecca@gma...|1414 David Throug...|       34.30555663|13.71751367|    36.72128268|         3.120178783|         581.852344|
|mstephens@davidso...|14023 Rodriguez P...|       33.33067252|12.79518855|  

In [6]:
df

DataFrame[Email: string, Address: string, Avg Session Length: double, Time on App: double, Time on Website: double, Length of Membership: double, Yearly Amount Spent: double]

In [7]:
df.printSchema()

root
 |-- Email: string (nullable = true)
 |-- Address: string (nullable = true)
 |-- Avg Session Length: double (nullable = true)
 |-- Time on App: double (nullable = true)
 |-- Time on Website: double (nullable = true)
 |-- Length of Membership: double (nullable = true)
 |-- Yearly Amount Spent: double (nullable = true)



In [8]:
df.columns

['Email',
 'Address',
 'Avg Session Length',
 'Time on App',
 'Time on Website',
 'Length of Membership',
 'Yearly Amount Spent']

In [9]:
cols=list(df.columns)

In [10]:
cols

['Email',
 'Address',
 'Avg Session Length',
 'Time on App',
 'Time on Website',
 'Length of Membership',
 'Yearly Amount Spent']

In [11]:
independent_cols =cols[2:len(cols)-1]

In [12]:
independent_cols

['Avg Session Length',
 'Time on App',
 'Time on Website',
 'Length of Membership']

#### VectorAssembler
A feature transformer that merges multiple columns into a vector column.

In [13]:
feature = VectorAssembler(inputCols=independent_cols,outputCol='IndependentFeatureVector')

In [14]:
output = feature.transform(df)
output.select("IndependentFeatureVector").show()

+------------------------+
|IndependentFeatureVector|
+------------------------+
|    [34.49726773,12.6...|
|    [31.92627203,11.1...|
|    [33.00091476,11.3...|
|    [34.30555663,13.7...|
|    [33.33067252,12.7...|
|    [33.87103788,12.0...|
|    [32.0215955,11.36...|
|    [32.73914294,12.3...|
|    [33.9877729,13.38...|
|    [31.93654862,11.8...|
|    [33.99257277,13.3...|
|    [33.87936082,11.5...|
|    [29.53242897,10.9...|
|    [33.19033404,12.9...|
|    [32.38797585,13.1...|
|    [30.73772037,12.6...|
|    [32.1253869,11.73...|
|    [32.33889932,12.0...|
|    [32.18781205,14.7...|
|    [32.61785606,13.9...|
+------------------------+
only showing top 20 rows



In [15]:
output=feature.transform(df)
output.show()

+--------------------+--------------------+------------------+-----------+---------------+--------------------+-------------------+------------------------+
|               Email|             Address|Avg Session Length|Time on App|Time on Website|Length of Membership|Yearly Amount Spent|IndependentFeatureVector|
+--------------------+--------------------+------------------+-----------+---------------+--------------------+-------------------+------------------------+
|mstephenson@ferna...|835 Frank TunnelW...|       34.49726773|12.65565115|    39.57766802|         4.082620633|         587.951054|    [34.49726773,12.6...|
|   hduke@hotmail.com|4547 Archer Commo...|       31.92627203|11.10946073|    37.26895887|         2.664034182|        392.2049334|    [31.92627203,11.1...|
|    pallen@yahoo.com|24645 Valerie Uni...|       33.00091476|11.33027806|    37.11059744|         4.104543202|        487.5475049|    [33.00091476,11.3...|
|riverarebecca@gma...|1414 David Throug...|       34.30555

In [16]:
final_data=output.select("IndependentFeatureVector","Yearly Amount Spent")
final_data.show(5)

+------------------------+-------------------+
|IndependentFeatureVector|Yearly Amount Spent|
+------------------------+-------------------+
|    [34.49726773,12.6...|         587.951054|
|    [31.92627203,11.1...|        392.2049334|
|    [33.00091476,11.3...|        487.5475049|
|    [34.30555663,13.7...|         581.852344|
|    [33.33067252,12.7...|         599.406092|
+------------------------+-------------------+
only showing top 5 rows



In [17]:
train_data,test_data=final_data.randomSplit([0.80,0.20])

In [18]:
lr = LinearRegression(featuresCol='IndependentFeatureVector',labelCol='Yearly Amount Spent')
lr = lr.fit(train_data)

In [19]:
lr.coefficients

DenseVector([25.5475, 38.811, 0.4157, 61.5348])

In [20]:
lr.intercept

-1045.8370114085758

In [21]:
predicted_output = lr.evaluate(test_data)

In [22]:
predicted_output.predictions.show(10)

+------------------------+-------------------+------------------+
|IndependentFeatureVector|Yearly Amount Spent|        prediction|
+------------------------+-------------------+------------------+
|    [30.87948434,13.2...|           490.2066| 494.4515050251505|
|    [31.04722214,11.1...|        392.4973992|388.16295661968434|
|    [31.26064687,13.2...|        421.3266313|422.57596954998417|
|    [31.57020083,13.3...|        545.9454921| 563.8939405564113|
|    [31.6005122,12.22...|        479.1728515| 461.2824154771695|
|    [31.65480968,13.0...|        475.2634237| 468.9115793560511|
|    [31.66104982,11.3...|        416.3583536| 417.4206982769217|
|    [31.72420252,13.1...|        503.3878873| 509.8458289231942|
|    [31.8209982,10.77...|         424.675281| 417.1874012280816|
|    [31.82797906,12.4...|        440.0027475|449.52855962672174|
+------------------------+-------------------+------------------+
only showing top 10 rows



In [23]:
R2_coeff = predicted_output.r2

In [24]:
R2_coeff

0.9856302698526402

Coclusion: From the R-squared value we can come to conclusion that model is having an accuracy of 97%