### Examples Of Pyspark ML

In [1]:
from pyspark.sql import SparkSession
from pyspark.sql.dataframe import DataFrame as pyspark_DataFrame


spark = SparkSession.builder.appName("Pyspark ML").getOrCreate()
spark

In [2]:
## Read The dataset
training: pyspark_DataFrame = spark.read.csv("test1.csv", header=True, inferSchema=True)
training.show()
training.printSchema()
training.columns

+---------+---+----------+------+
|     Name|age|Experience|Salary|
+---------+---+----------+------+
|    Krish| 31|        10| 30000|
|Sudhanshu| 30|         8| 25000|
|    Sunny| 29|         4| 20000|
|     Paul| 24|         3| 20000|
|   Harsha| 21|         1| 15000|
|  Shubham| 23|         2| 18000|
+---------+---+----------+------+

root
 |-- Name: string (nullable = true)
 |-- age: integer (nullable = true)
 |-- Experience: integer (nullable = true)
 |-- Salary: integer (nullable = true)



['Name', 'age', 'Experience', 'Salary']

#### [Age,Experience]----> new feature--->independent feature

In [3]:
from pyspark.ml.feature import VectorAssembler

In [4]:
independentFeatureAssembler = VectorAssembler(
    inputCols=["age", "Experience"], outputCol="Independent Feature"
)

In [5]:
output: pyspark_DataFrame = independentFeatureAssembler.transform(training)
output.show()

+---------+---+----------+------+-------------------+
|     Name|age|Experience|Salary|Independent Feature|
+---------+---+----------+------+-------------------+
|    Krish| 31|        10| 30000|        [31.0,10.0]|
|Sudhanshu| 30|         8| 25000|         [30.0,8.0]|
|    Sunny| 29|         4| 20000|         [29.0,4.0]|
|     Paul| 24|         3| 20000|         [24.0,3.0]|
|   Harsha| 21|         1| 15000|         [21.0,1.0]|
|  Shubham| 23|         2| 18000|         [23.0,2.0]|
+---------+---+----------+------+-------------------+



In [6]:
final_data: pyspark_DataFrame = output["Salary", "Independent Feature"]

final_data.show()

+------+-------------------+
|Salary|Independent Feature|
+------+-------------------+
| 30000|        [31.0,10.0]|
| 25000|         [30.0,8.0]|
| 20000|         [29.0,4.0]|
| 20000|         [24.0,3.0]|
| 15000|         [21.0,1.0]|
| 18000|         [23.0,2.0]|
+------+-------------------+



In [15]:
from pyspark.ml.regression import LinearRegression

# train test split

train_data, test_data = final_data.randomSplit([0.75, 0.25])
regressor = LinearRegression(featuresCol="Independent Feature", labelCol="Salary")
regressor = regressor.fit(train_data)

In [16]:
# Coefficients
regressor.coefficients

DenseVector([109.3058, 1199.4092])

In [17]:
# Intercepts
regressor.intercept

12187.592319054227

In [18]:
# prediction
pred_results = regressor.evaluate(test_data)
pred_results.predictions.show()

+------+-------------------+------------------+
|Salary|Independent Feature|        prediction|
+------+-------------------+------------------+
| 20000|         [24.0,3.0]| 18409.15805022155|
| 30000|        [31.0,10.0]|27570.162481536143|
+------+-------------------+------------------+



In [19]:
pred_results.meanSquaredError, pred_results.meanAbsoluteError

(4217444.237654746, 2010.339734121153)

I'll help explain the outputs after the Linear Regression model was created. Let's break it down step by step:

1. **Coefficients** (regressor.coefficients)
```
DenseVector([109.3058, 1199.4092])
```
- These numbers represent the weights for each independent feature [age, Experience]
- For age: 109.3058 means for every 1 year increase in age, salary increases by approximately 109.31 units (currency)
- For experience: 1199.4092 means for every 1 year increase in experience, salary increases by approximately 1,199.41 units (currency)
- Experience has a much stronger impact on salary than age in this model

2. **Intercept** (regressor.intercept)
```
12187.592319054227
```
- This is the base salary (when all features are 0)
- It means if someone has 0 years of age (hypothetically) and 0 years of experience, the predicted salary would be approximately 12,187.59 units

3. **Predictions** (pred_results.predictions)
```
+------+-------------------+------------------+
|Salary|Independent Feature|        prediction|
+------+-------------------+------------------+
| 20000|         [24.0,3.0]| 18409.15805022155|
| 30000|        [31.0,10.0]|27570.162481536143|
+------+-------------------+------------------+
```
- Shows actual salary vs predicted salary for the test data
- First row: Person with age 24 and 3 years experience
  * Actual salary: 20,000
  * Predicted salary: ~18,409
- Second row: Person with age 31 and 10 years experience
  * Actual salary: 30,000
  * Predicted salary: ~27,570

4. **Error Metrics**
```
(4217444.237654746, 2010.339734121153)
```
- Mean Squared Error (MSE): 4,217,444.24
  * Measures average squared difference between predicted and actual values
  * Higher number because errors are squared
- Mean Absolute Error (MAE): 2,010.34
  * Average absolute difference between predicted and actual values
  * On average, predictions are off by about 2,010 units
  * This is more interpretable than MSE as it's in the same units as the salary

The model appears to be making reasonable predictions, though there's some error in its estimates as shown by the MAE. The stronger coefficient for experience compared to age suggests that experience is a more important factor in determining salary in this dataset.