## Regresión Lineal

Se desea examinar un conjunto de datos de clientes de e-commerce para el sitio web y la aplicación móvil de una empresa y se busca construir un modelo de regresión que prediga el gasto anual del cliente en el producto de la empresa.

In [0]:
generation = "mod4gen13"

In [0]:
data = spark.read.csv(f"/mnt/{generation}/input/ecommerce_customers.csv", inferSchema=True, header=True)

In [0]:
# Print the Schema of the DataFrame
data.printSchema()

In [0]:
data.display()

In [0]:
data.head()

In [0]:
for item in data.head():
    print(item)

#### VectorAssembler

Se requiere aplicar algunas transformaciones antes de que Spark pueda modelar. Nuestro DataFrame debe contar con dos columnas de la forma `("target", "features")`

In [0]:
from pyspark.ml.feature import VectorAssembler

In [0]:
data.columns

In [0]:
assembler = VectorAssembler(inputCols=["Avg Session Length", "Time on App", "Time on Website", "Length of Membership"], outputCol="features")

In [0]:
type(assembler)

In [0]:
output = assembler.transform(data)

In [0]:
output.select("features").show()

In [0]:
output.display()

In [0]:
final_data = output.select("features", "Yearly Amount Spent")

In [0]:
final_data.show()

In [0]:
train_data, test_data = final_data.randomSplit([0.7, 0.3])

In [0]:
train_data.describe().show()

In [0]:
test_data.describe().show()

In [0]:
from pyspark.ml.regression import LinearRegression

In [0]:
# Se instancia el modelo LinearRegression
lr = LinearRegression(labelCol="Yearly Amount Spent")

In [0]:
type(lr)

In [0]:
# Ajuste del modelo
lr_model = lr.fit(train_data)

In [0]:
# Coeficientes y el intercepto de la regresión lineal
print(f"Coefficients: {lr_model.coefficients} Intercept: {lr_model.intercept}")

In [0]:
# Evaluación del modelo
test_results = lr_model.evaluate(test_data)

In [0]:
type(test_results)

In [0]:
test_results.residuals.show()

In [0]:
unlabeled_data = test_data.select("features")

In [0]:
predictions = lr_model.transform(unlabeled_data)

In [0]:
predictions.show()

In [0]:
test_data.show()

In [0]:
final_df = test_data.join(predictions, on="features", how="inner")

In [0]:
final_df.select("Yearly Amount Spent", "prediction").display()

In [0]:
final_df.select("Yearly Amount Spent", "prediction").display()

In [0]:
print(f"R2: {test_results.r2}")
print(f"MAE: {test_results.meanAbsoluteError}")
print(f"RMSE: {test_results.rootMeanSquaredError}")