# PySpark Assignment
## María Ferrero and Lara Monteserín


The main goal of this assignment is to check whether feature selection can improve results by removing
irrelevant variables, or at least, maintain the results but using fewer features.

We will do that with a
LinearRegression algorithm (with no HPO, in order to keep the assignment short). We will do that by
training different feature selection approaches on the training partition and comparing them on a test
set. In any case, the main aim of the assignment is technical (i.e. being able to use pyspark with a dataset).

**WHAT TO HAND IN:** A notebook with some explanations about what you are doing in each step, and also draw some short conclusions at the end of the notebook. Submit the notebook in two formats: (ipynb) and html. Please, submit also a screen capture showing (at least) the last cells of your executed script.

## PART 0: Creating the Spark session, loading the data and preparing the dataframe for ML use.

In Google Colab, it is neccesary to install pyspark everytime. Also, we upload the file to Google Colab.

In [18]:
from google.colab import files

# Upload the CSV file
uploaded = files.upload()

Saving wind_available_second.csv to wind_available_second (1).csv


In [19]:
!pip install pyspark



Now we initialize the Spark context and create a Spark session. Then, we read the data in Pandas.

In [20]:
# SPARK CONTEXT INITIALIZATION
from pyspark.sql import SparkSession
import pandas as pd

# Create a Spark session
spark = SparkSession.builder.master("local[*]").appName("App").getOrCreate()

# Get the Spark context
sc = spark.sparkContext

In [21]:
# Read the CSV file into a Pandas Dataframe
wind_ava = pd.read_csv('wind_available_second.csv')

To first see the structure of the dataframe, we visualize the first rows.



In [22]:
wind_ava.head()


Unnamed: 0,energy,year,month,day,hour,p54_162_1,p54_162_2,p54_162_3,p54_162_4,p54_162_5,...,v100_16,v100_17,v100_18,v100_19,v100_20,v100_21,v100_22,v100_23,v100_24,v100_25
0,402.71,2005,1,2,18,2534970.0,2526864.0,2518754.0,2510648.0,2502537.0,...,-4.683596,,-4.407196,,-4.131295,-4.669626,-4.528932,-4.388736,-4.24854,-4.107846
1,696.8,2005,1,3,0,,,2521184.0,2513088.0,,...,-3.397886,-3.257192,-3.115998,-2.975304,-2.834609,-3.39639,-3.254198,-3.112506,-2.970314,
2,1591.15,2005,1,3,6,2533727.0,2525703.0,2517678.0,2509654.0,,...,-1.454105,,-1.13829,,-0.822476,-1.459094,-1.302933,-1.147271,-0.99111,-0.834949
3,1338.62,2005,1,3,12,,2526548.0,2518609.0,2510670.0,2502732.0,...,1.255015,1.370265,1.485515,1.600765,1.716015,1.210612,1.319376,1.42814,1.536405,1.645169
4,562.5,2005,1,3,18,2529543.0,,2513702.0,2505782.0,2497861.0,...,1.939031,,,2.193977,2.278793,1.873673,1.953,2.031829,2.111157,2.189986


Before preparing the dataframe for ML use, it is neccesary to treat the missing and the null values. In order to do this, we will perform imputation techniques. As we have been able to verify in the first assignment, the Iterative Imputer for this data has been the one getting better results in future predictions in most of the cases, so it is the one that we will use here. After the imputation, we transform the Pandas dataframe into a Spark dataframe.


In [23]:
from sklearn.experimental import enable_iterative_imputer
from sklearn.impute import IterativeImputer

# Usar IterativeImputer de scikit-learn para imputar valores nulos
imputer = IterativeImputer(max_iter=10, random_state=100514164)
wind_ava = pd.DataFrame(imputer.fit_transform(wind_ava), columns=wind_ava.columns)

# Convertir el DataFrame de pandas imputado a un DataFrame de PySpark
wind_ava = spark.createDataFrame(wind_ava)


Finally, we prepare the dataframe for ML use. The algorithms in Spark ML library need a dataframe with just two columns: the first one (typically named features) must contain a matrix with the input attributes, the second one must contain the output attribute (typically named label). In order to do that, VectorAssembler is going to be used to put together all the input attributes.

In [17]:
#ruta = '/content/guardado.csv'
#wind_ava.write.mode('overwrite').csv(ruta, header=True)
#files.download(ruta)

#wind_ava.write.csv('/content/gdrive/My Drive', header=True)

import pandas as pd

#pandas_df = wind_ava.toPandas()

# Especifica la ruta local donde deseas guardar el archivo CSV
#ruta_local = 'ruta/del/archivo_local.csv'

# Guarda el DataFrame de pandas como un archivo CSV local
#pandas_df.to_csv(ruta_local, header=True, index=False)

Py4JJavaError: ignored

In [9]:
wind_ava.show()

+-------+------+-----+---+----+------------------+-----------------+----------------+------------------+-----------------+-----------------+-----------------+-----------------+------------------+------------------+-----------------+------------------+------------------+----------------+------------------+------------------+------------------+-----------------+------------------+-----------------+------------------+------------------+------------------+------------------+------------------+-----------------+-----------------+-----------------+----------------+-----------------+-----------------+------------------+-----------------+------------------+------------------+----------------+-----------------+------------------+-----------------+------------------+------------------+------------------+-----------------+----------------+-----------------+------------------+------------------+----------------+------------------+------------------+--------------------+--------------------+-------

In [14]:
#wind_ava = pd.read_csv('wind_ava.csv')
#wind_ava_mierda = spark.read.csv('/content/gdrive/My Drive', inferSchema=True)


In [15]:
wind_ava.show()

+-------+------+---+----+----+------------------+------------------+------------------+------------------+------------------+-----------------+------------------+------------------+----------------+------------------+-----------------+-----------------+------------------+------------------+----------------+-----------------+------------------+------------------+------------------+-----------------+------------------+-----------------+------------------+-----------------+------------------+------------------+------------------+------------------+------------------+------------------+------------------+------------------+------------------+------------------+------------------+----------------+------------------+------------------+------------------+------------------+------------------+-----------------+------------------+------------------+------------------+------------------+------------------+----------------+------------------+------------------+------------------+-----------------

In [24]:
from pyspark.ml.feature import VectorAssembler

# In Pyspark, typically the response is called label
wind_ava = wind_ava.withColumnRenamed("energy", "label") #we don't do this step because when transforming the pandas dataset

ignore = ['label']

assembler = VectorAssembler(
    inputCols=[x for x in wind_ava.columns if x not in ignore],
    outputCol='features')

wind_ava = assembler.transform(wind_ava).select(['label', 'features'])

Now the first columns of the dataframe look as follows

In [25]:
wind_ava.show()

+-------+--------------------+
|  label|            features|
+-------+--------------------+
| 402.71|[2005.0,1.0,2.0,1...|
|  696.8|[2005.0,1.0,3.0,0...|
|1591.15|[2005.0,1.0,3.0,6...|
|1338.62|[2005.0,1.0,3.0,1...|
|  562.5|[2005.0,1.0,3.0,1...|
|  232.3|[2005.0,1.0,4.0,0...|
| 329.95|[2005.0,1.0,4.0,6...|
| 960.51|[2005.0,1.0,4.0,1...|
| 194.62|[2005.0,1.0,4.0,1...|
| 358.51|[2005.0,1.0,5.0,0...|
|  808.8|[2005.0,1.0,5.0,6...|
|  93.36|[2005.0,1.0,5.0,1...|
| 155.94|[2005.0,1.0,5.0,1...|
|   0.01|[2005.0,1.0,6.0,0...|
|   4.85|[2005.0,1.0,6.0,1...|
| 218.76|[2005.0,1.0,7.0,0...|
| 906.21|[2005.0,1.0,7.0,6...|
| 201.42|[2005.0,1.0,7.0,1...|
| 641.34|[2005.0,1.0,7.0,1...|
|1524.05|[2005.0,1.0,8.0,0...|
+-------+--------------------+
only showing top 20 rows



## PART 1: Split data intro train and test


In [26]:
(trainingData_sd, testData_sd) = wind_ava.randomSplit([0.7, 0.3])

## PART 2: Formulate three pipelines, train and evaluate them:
- a. Feature selection with the UnivariateFeatureSelector and the fpr strategy (least conservative)
- b. Same, with the fwe strategy (most conservative).
- c. Same, but doing PCA and using 3 components

### PIPELINE 1: Feature selection with the UnivariateFeatureSelector and the fpr strategy (least conservative)

In [28]:
from pyspark.ml.feature import UnivariateFeatureSelector
from pyspark.ml import Pipeline
from pyspark.ml.classification import LogisticRegression #as it is the chosen algorithm for this assignment
from pyspark.ml.evaluation import RegressionEvaluator

# Step 1: UnivariateFeatureSelector
selector = UnivariateFeatureSelector(
    featuresCol='features',
    outputCol='selected_features',
    labelCol='label',
    selectionMode='fpr',  # False Positive Rate strategy
    #selectionThreshold = 0.05 # change to make it less conservative (ESTO DA ERROR)
)

# Step 2: Logistic Regression
lr = LogisticRegression(
    labelCol='label',
    featuresCol='selected_features',
    maxIter=10,
)

# Step 3: Create the pipeline
pipeline_fpr = Pipeline(stages=[selector, lr])

# Step 4: Train the pipeline on the training data
model_fpr = pipeline_fpr.fit(trainingData_sd)

# Step 5: Make predictions on the test data
predictions = model_fpr.transform(testData_sd)

# Step 6: Evaluate the model using RMSE
evaluator = RegressionEvaluator(labelCol='label', predictionCol='prediction', metricName='rmse')
rmse_fpr = evaluator.evaluate(predictions)

# Print the RMSE
print(f"Root Mean Squared Error (RMSE): {rmse_fpr}")

IllegalArgumentException: ignored

### PIPELINE 2: Same as PIPELINE 1, with the fwe strategy (most conservative).

In [None]:
# Step 1: UnivariateFeatureSelector with FWE strategy
selector_fwe = UnivariateFeatureSelector(
    featuresCol='features',
    outputCol='selected_features',
    labelCol='label',
    selectionMode='fwe',  # Family-Wise Error Rate strategy (most conservative)
    selectionThreshold = 0.05 # threshold
)

# Step 2 is the same as before
# Step 3: Create the pipeline
pipeline_fwe = Pipeline(stages=[selector, lr])

# Step 4: Train the pipeline on the training data
model_fwe = pipeline_fwe.fit(trainingData_sd)

# Step 5: Make predictions on the test data
predictions = model_fwe.transform(testData_sd)

# Step 6: Evaluate the model using RMSE
evaluator = RegressionEvaluator(labelCol='label', predictionCol='prediction', metricName='rmse')
rmse_fwe = evaluator.evaluate(predictions)

# Print the RMSE
print(f"Root Mean Squared Error (RMSE): {rmse_fwe}")

## PIPELINE 3: Same as PIPELINES 1 AND 2, but doing PCA and using 3 components

In [None]:
#The following is to stop the cluster. Not needed in databricks
#
spark.stop()