# Regresión Lineal

<font size='5'>
    La regresión lineal es una técnica que permite adaptar el resultado de una variable (<strong>variable dependiente</strong>) a otra u otras (<strong>variables independientes)</strong>)<br>
    <br>
Puede ser expresado como se muestra a continuación: <br>
    <br>
$$Y = \Sigma {\beta}_{i}{X}_{i} + \varepsilon  $$<br>

${X}_{i}$: variables independientes que determinan a $Y$.<br>
${\beta}_{i}$: mide la influencia de la variable independiente ${X}_{i}$.<br>
$\varepsilon$ : corresponde al error de ajuste al conjunto de datos.<br>
<br>
La curva Y para el conjunto de datos se muestra asi:<br>
<img src="rl.png"> <br>
<br>
**Supuestos del modelo**
    <br>
- Relación lineal entre de las variables independientes con la variable dependiente.<br><br>
- Homocedasticidad  se  refiere  al  supuesto  de  que  la  variabledependiente (Y) presenta una distribución con igual varianza en todoel rango de valores de la variable independiente (X).<br><br>
- Errores con varianza constante.<br><br>
</font>

## Ejemplo


En este ejemplo se va a determinar cual es el numero de asistentes (crew), que se necesitan 
para atender un barco. 
Las variables utilizadas son:<br>
    Age: edad del barco<br>
    Tonnage: tonelaje del barco<br>
    Passenger: cantidad de pasajeros<br>
    Lenght: longitud del barco<br>
    Cabins: número de cabinas <br>
    passanger_density: densidad promedio del barco<br>
        

In [1]:
# inizializar spark
from pyspark.sql import SparkSession
spark = SparkSession.builder.appName('lr_example').getOrCreate()

In [2]:
from pyspark.ml.regression import LinearRegression

In [14]:
# Leer el csv consumo_cerveza desde Spark
data = spark.read.csv("cruise_ship.csv",inferSchema=True,header=True)

In [15]:
# imprimir el esquema del dataframe
data.printSchema()

root
 |-- Ship_name: string (nullable = true)
 |-- Cruise_line: string (nullable = true)
 |-- Age: integer (nullable = true)
 |-- Tonnage: double (nullable = true)
 |-- passengers: double (nullable = true)
 |-- length: double (nullable = true)
 |-- cabins: double (nullable = true)
 |-- passenger_density: double (nullable = true)
 |-- crew: double (nullable = true)



In [16]:
data.show(10)

+-----------+-----------+---+------------------+----------+------+------+-----------------+----+
|  Ship_name|Cruise_line|Age|           Tonnage|passengers|length|cabins|passenger_density|crew|
+-----------+-----------+---+------------------+----------+------+------+-----------------+----+
|    Journey|    Azamara|  6|30.276999999999997|      6.94|  5.94|  3.55|            42.64|3.55|
|      Quest|    Azamara|  6|30.276999999999997|      6.94|  5.94|  3.55|            42.64|3.55|
|Celebration|   Carnival| 26|            47.262|     14.86|  7.22|  7.43|             31.8| 6.7|
|   Conquest|   Carnival| 11|             110.0|     29.74|  9.53| 14.88|            36.99|19.1|
|    Destiny|   Carnival| 17|           101.353|     26.42|  8.92| 13.21|            38.36|10.0|
|    Ecstasy|   Carnival| 22|            70.367|     20.52|  8.55|  10.2|            34.29| 9.2|
|    Elation|   Carnival| 15|            70.367|     20.52|  8.55|  10.2|            34.29| 9.2|
|    Fantasy|   Carnival| 23| 

In [18]:
for item in data.head():
    print(item)

Journey
Azamara
6
30.276999999999997
6.94
5.94
3.55
42.64
3.55


## Setting Up DataFrame for Machine Learning 

In [22]:
# Adaptación de algunas cosas para poder utilizar Mlib!
# Se debe generar dos columnas una con las variables independientes y la otra con la respuesta
# ("label","features")

# Importar VectorAssembler y Vectors
from pyspark.ml.linalg import Vectors
from pyspark.ml.feature import VectorAssembler
from pyspark.ml.feature import StringIndexer

In [23]:
data.columns

['Ship_name',
 'Cruise_line',
 'Age',
 'Tonnage',
 'passengers',
 'length',
 'cabins',
 'passenger_density',
 'crew']

In [26]:
assembler = VectorAssembler(
    inputCols=['Age', 'Tonnage',
                 'passengers', 'length',
                 'cabins', 'passenger_density'],
    outputCol="features")

In [27]:
output = assembler.transform(data)

In [28]:
output.select("features").show()

+--------------------+
|            features|
+--------------------+
|[6.0,30.276999999...|
|[6.0,30.276999999...|
|[26.0,47.262,14.8...|
|[11.0,110.0,29.74...|
|[17.0,101.353,26....|
|[22.0,70.367,20.5...|
|[15.0,70.367,20.5...|
|[23.0,70.367,20.5...|
|[19.0,70.367,20.5...|
|[6.0,110.23899999...|
|[10.0,110.0,29.74...|
|[28.0,46.052,14.5...|
|[18.0,70.367,20.5...|
|[17.0,70.367,20.5...|
|[11.0,86.0,21.24,...|
|[8.0,110.0,29.74,...|
|[9.0,88.5,21.24,9...|
|[15.0,70.367,20.5...|
|[12.0,88.5,21.24,...|
|[20.0,70.367,20.5...|
+--------------------+
only showing top 20 rows



In [29]:
output.show()

+-----------+-----------+---+------------------+----------+------+------+-----------------+----+--------------------+
|  Ship_name|Cruise_line|Age|           Tonnage|passengers|length|cabins|passenger_density|crew|            features|
+-----------+-----------+---+------------------+----------+------+------+-----------------+----+--------------------+
|    Journey|    Azamara|  6|30.276999999999997|      6.94|  5.94|  3.55|            42.64|3.55|[6.0,30.276999999...|
|      Quest|    Azamara|  6|30.276999999999997|      6.94|  5.94|  3.55|            42.64|3.55|[6.0,30.276999999...|
|Celebration|   Carnival| 26|            47.262|     14.86|  7.22|  7.43|             31.8| 6.7|[26.0,47.262,14.8...|
|   Conquest|   Carnival| 11|             110.0|     29.74|  9.53| 14.88|            36.99|19.1|[11.0,110.0,29.74...|
|    Destiny|   Carnival| 17|           101.353|     26.42|  8.92| 13.21|            38.36|10.0|[17.0,101.353,26....|
|    Ecstasy|   Carnival| 22|            70.367|     20.

In [30]:
final_data = output.select("features",'crew')

In [31]:
train_data,test_data = final_data.randomSplit([0.7,0.3])

In [32]:
train_data.describe().show()

+-------+------------------+
|summary|              crew|
+-------+------------------+
|  count|               115|
|   mean|7.3149565217391395|
| stddev|3.1176054615493825|
|    min|              0.59|
|    max|              13.6|
+-------+------------------+



In [33]:
test_data.describe().show()

+-------+-----------------+
|summary|             crew|
+-------+-----------------+
|  count|               43|
|   mean|9.075813953488371|
| stddev|4.146206661377154|
|    min|             0.88|
|    max|             21.0|
+-------+-----------------+



In [34]:
# Create a Linear Regression Model object
lr = LinearRegression(labelCol='crew')

In [35]:
# Fit the model to the data and call this model lrModel
lrModel = lr.fit(train_data,)

In [36]:
# Print the coefficients and intercept for linear regression
print("Coefficients: {} Intercept: {}".format(lrModel.coefficients,lrModel.intercept))

Coefficients: [-0.006783969593379291,0.01916000252502769,-0.13278714395729793,0.3832086354698286,0.7118431353742843,4.388537381301796e-05] Intercept: -0.5252134065754188


In [37]:
test_results = lrModel.evaluate(test_data)

In [38]:
# Interesting results....
test_results.residuals.show()

+--------------------+
|           residuals|
+--------------------+
|  0.7565758866487364|
|  0.5454936201533087|
| -1.2783844072381374|
| -0.5237120747229564|
|  0.6938236872073649|
|   1.129045580855264|
|-0.21007031945157273|
|  0.1360260338643391|
|   0.719297361520951|
|-0.19856456335952366|
|-0.08405846566093622|
|  0.5723092583142826|
|-0.35407803643417957|
|-0.24523855845540155|
| -0.1917805937661452|
|  1.2884928474857276|
| 0.42732658792183464|
|-0.48768183081961247|
|  1.0597277287087437|
|   7.295498985265427|
+--------------------+
only showing top 20 rows



In [39]:
unlabeled_data = test_data.select('features')

In [40]:
predictions = lrModel.transform(unlabeled_data)

In [41]:
predictions.show()

+--------------------+------------------+
|            features|        prediction|
+--------------------+------------------+
|[4.0,220.0,54.0,1...|20.243424113351264|
|[5.0,133.5,39.59,...|12.584506379846692|
|[5.0,160.0,36.34,...|14.878384407238137|
|[6.0,90.0,20.0,9....| 9.523712074722956|
|[6.0,113.0,37.82,...|11.306176312792635|
|[8.0,91.0,22.44,9...| 9.870954419144736|
|[8.0,110.0,29.74,...|11.810070319451572|
|[9.0,59.058,17.0,...| 7.263973966135661|
|[9.0,81.0,21.44,9...| 9.280702638479049|
|[9.0,105.0,27.2,8...|10.878564563359523|
|[9.0,116.0,26.0,9...|11.084058465660936|
|[10.0,77.0,20.16,...| 8.427690741685717|
|[10.0,81.76899999...|  8.77407803643418|
|[10.0,86.0,21.14,...|   9.4452385584554|
|[10.0,105.0,27.2,...|10.871780593766145|
|[10.0,151.4,26.2,...|11.241507152514272|
|[11.0,58.6,15.66,...| 7.172673412078165|
|[11.0,90.09,25.01...| 8.967681830819613|
|[11.0,108.977,26....|10.940272271291256|
|[11.0,110.0,29.74...|11.804501014734575|
+--------------------+------------

In [42]:
print("RMSE: {}".format(test_results.rootMeanSquaredError))
print("MSE: {}".format(test_results.meanSquaredError))

RMSE: 1.5652712492324938
MSE: 2.4500740836738517
