# Machine Learning
---

* Un sistema inteligente es aquel sistema capaz de resolver problemas complejos y multidisciplinares de una forma automática dando soporte a las
decisiones de un experto
    * Algoritmos simbolistas. Razonamiento inductivo
    * Redes neuronales artificiales
    * Algoritmos genéticos y evolutivos
    * Probabilísticos. Teorema de Bayes.
    * Algoritmos de "similitudes". K-NN, SVM

* Los algoritmos que implementan los sistemas inteligentes son algoritmos iterativos, por lo tanto tienen que dar varias "pasadas" a los datos para llevar a cabo su tarea
* Se los conocen como algoritmos de aprendizaje de máquina (machine learning)
* Deben estar optimizados para un óptimo rendimiento

* Extraer conocimiento desde los datos (aprender a partir de los datos)
* Involucra: estadística + IA (computación)
* Reconocer patrones a partir de datos
* Análisis predictivo o estadístico
* Se busca encontrar patrones ocultos que permitan predecir o tomar decisiones a futuro

* Ejemplo sistema que filtre correos spam
  * Enfoque clásico: crear lista de asuntos con frases a filtrar (escribir reglas)
  * Enfoque ML: entrenar modelo ML (aprende automáticamente qué palabras y frases son buenos predictores de spam, al detectar patrones de palabras inusualmente frecuentes en los ejemplos de spam)

## Clasificación

* **Aprendizaje supervisado**: el conjunto de entrenamiento que se introduce en el algoritmo incluye las soluciones deseadas, llamadas etiquetas (labels)

| Asunto        | Spam |
| --            | --   |
| Gran Oferta   | Si   |
| CV            | No   |
| Proy Inv      | No   |
| Ganó 1 millon | Si   |

* **Aprendizaje no supervisado**: el conjunto de datos de entrenamiento no está etiquetado. Por ejemplo, se tiene un conjunto de datos referido a visitantes de un blog. Con aprendizaje no supervisado podríamos detectar, con un algoritmo de clasificación, los grupos de usuarios que son similares
    * K-Means: algoritmo utilizado para la agrupación en clústeres de datos (segmentación, agrupación)
    * GMM (mixtura gaussiana): son modelos probabilísticos construidos a partir de una suma ponderada de distribuciones de probabilidad Gaussianas. Se pueden emplear como clasificadores   

## MLlib

* Es la librería de algoritmos de machine learning para Spark
* Los algoritmos están diseñados e implementados para ejecutarse de manera eficiente en un ambiente distribuido
* Algoritmos:
    * Logistic regression
    * Naive Bayes
    * Generalized linear regression
    * Survival regression
    * Decision trees
    * Random forests
    * Gradient-boosted trees
    * Alternating least squares (ALS)
    * K-means
    * Gaussian mixtures
    * Latent Dirichlet allocation (LDA)
    * Frequent itemsets
    * Association rules
    * Sequential pattern mining

In [60]:
import os
os.environ['JAVA_HOME'] = '/Library/Java/JavaVirtualMachines/openjdk-17.jdk/Contents/Home'

In [61]:
from pyspark.sql import SparkSession

In [62]:
from IPython.core.display import HTML

In [63]:
display(HTML("<style>pre { white-space: pre !important; }</style>"))

In [64]:
spark = SparkSession.builder.master("local[*]").getOrCreate()

## Regresión Lineal

In [65]:
#datos!! https://www.kaggle.com/prasadperera/the-boston-housing-dataset
df = spark.read.csv('BostonHousing.csv', inferSchema=True, header=True)

In [66]:
df.show(5)

+-------+----+-----+----+-----+-----+----+------+---+---+-------+------+-----+----+
|   crim|  zn|indus|chas|  nox|   rm| age|   dis|rad|tax|ptratio|     b|lstat|medv|
+-------+----+-----+----+-----+-----+----+------+---+---+-------+------+-----+----+
|0.00632|18.0| 2.31|   0|0.538|6.575|65.2|  4.09|  1|296|   15.3| 396.9| 4.98|24.0|
|0.02731| 0.0| 7.07|   0|0.469|6.421|78.9|4.9671|  2|242|   17.8| 396.9| 9.14|21.6|
|0.02729| 0.0| 7.07|   0|0.469|7.185|61.1|4.9671|  2|242|   17.8|392.83| 4.03|34.7|
|0.03237| 0.0| 2.18|   0|0.458|6.998|45.8|6.0622|  3|222|   18.7|394.63| 2.94|33.4|
|0.06905| 0.0| 2.18|   0|0.458|7.147|54.2|6.0622|  3|222|   18.7| 396.9| 5.33|36.2|
+-------+----+-----+----+-----+-----+----+------+---+---+-------+------+-----+----+
only showing top 5 rows


In [67]:
from pyspark.ml.feature import VectorAssembler

In [68]:
assembler = VectorAssembler(inputCols=["rm", "crim", "lstat"], outputCol="features")

In [69]:
features_df = assembler.transform(df)

In [70]:
features_df.show(5)

+-------+----+-----+----+-----+-----+----+------+---+---+-------+------+-----+----+--------------------+
|   crim|  zn|indus|chas|  nox|   rm| age|   dis|rad|tax|ptratio|     b|lstat|medv|            features|
+-------+----+-----+----+-----+-----+----+------+---+---+-------+------+-----+----+--------------------+
|0.00632|18.0| 2.31|   0|0.538|6.575|65.2|  4.09|  1|296|   15.3| 396.9| 4.98|24.0|[6.575,0.00632,4.98]|
|0.02731| 0.0| 7.07|   0|0.469|6.421|78.9|4.9671|  2|242|   17.8| 396.9| 9.14|21.6|[6.421,0.02731,9.14]|
|0.02729| 0.0| 7.07|   0|0.469|7.185|61.1|4.9671|  2|242|   17.8|392.83| 4.03|34.7|[7.185,0.02729,4.03]|
|0.03237| 0.0| 2.18|   0|0.458|6.998|45.8|6.0622|  3|222|   18.7|394.63| 2.94|33.4|[6.998,0.03237,2.94]|
|0.06905| 0.0| 2.18|   0|0.458|7.147|54.2|6.0622|  3|222|   18.7| 396.9| 5.33|36.2|[7.147,0.06905,5.33]|
+-------+----+-----+----+-----+-----+----+------+---+---+-------+------+-----+----+--------------------+
only showing top 5 rows


In [71]:
features_df.printSchema()

root
 |-- crim: double (nullable = true)
 |-- zn: double (nullable = true)
 |-- indus: double (nullable = true)
 |-- chas: integer (nullable = true)
 |-- nox: double (nullable = true)
 |-- rm: double (nullable = true)
 |-- age: double (nullable = true)
 |-- dis: double (nullable = true)
 |-- rad: integer (nullable = true)
 |-- tax: integer (nullable = true)
 |-- ptratio: double (nullable = true)
 |-- b: double (nullable = true)
 |-- lstat: double (nullable = true)
 |-- medv: double (nullable = true)
 |-- features: vector (nullable = true)



In [72]:
from pyspark.ml.regression import LinearRegression

In [73]:
lr = LinearRegression(labelCol="medv", featuresCol="features")

In [74]:
lrModel = lr.fit(features_df)

25/08/22 11:47:52 WARN Instrumentation: [cf8c74cc] regParam is zero, which might cause numerical instability and overfitting.


In [75]:
lrModel.coefficients

DenseVector([5.217, -0.1029, -0.5785])

In [76]:
lrModel.intercept

-2.5622510119272093

El anterior resultado se puede interpretar como: predicted home value = (5.2 x number of rooms) - (.1 x crime rate) - (.6 x % lower class) - 2.6

In [77]:
subset = features_df.limit(10).select("features", "medv")
subset.show()

+--------------------+----+
|            features|medv|
+--------------------+----+
|[6.575,0.00632,4.98]|24.0|
|[6.421,0.02731,9.14]|21.6|
|[7.185,0.02729,4.03]|34.7|
|[6.998,0.03237,2.94]|33.4|
|[7.147,0.06905,5.33]|36.2|
| [6.43,0.02985,5.21]|28.7|
|[6.012,0.08829,12...|22.9|
|[6.172,0.14455,19...|27.1|
|[5.631,0.21124,29...|16.5|
|[6.004,0.17004,17.1]|18.9|
+--------------------+----+



In [78]:
prediction = lrModel.transform(subset)

In [79]:
prediction.show()

+--------------------+----+------------------+
|            features|medv|        prediction|
+--------------------+----+------------------+
|[6.575,0.00632,4.98]|24.0|28.857717647784423|
|[6.421,0.02731,9.14]|21.6|25.645644850540144|
|[7.185,0.02729,4.03]|34.7| 32.58746300992212|
|[6.998,0.03237,2.94]|33.4| 32.24191904275643|
|[7.147,0.06905,5.33]|36.2| 31.63288834584224|
| [6.43,0.02985,5.21]|28.7|  27.9657852461671|
|[6.012,0.08829,12...|22.9| 21.60241460459668|
|[6.172,0.14455,19...|27.1|18.543911230275796|
|[5.631,0.21124,29...|16.5| 9.478596352794202|
|[6.004,0.17004,17.1]|18.9| 18.85073477002384|
+--------------------+----+------------------+



In [80]:
from pyspark.ml.linalg import Vectors

In [81]:
prueba = [(Vectors.dense([6.7, 0.2, 2.94]), )]
resultado = spark.createDataFrame(prueba, ["features"])

lrModel.transform(resultado).show()

+--------------+-----------------+
|      features|       prediction|
+--------------+-----------------+
|[6.7,0.2,2.94]|30.67001049444645|
+--------------+-----------------+



## K-MEANS

In [82]:
#set de datos a trabajar  https://archive.ics.uci.edu/ml/datasets/iris
df = spark.read.csv("iris-setosa.csv", inferSchema=True, header=True)

In [83]:
df.show(5)

+---------+--------+---------+--------+-----------+
|sp_length|sp_width|pl_length|pl_width|      class|
+---------+--------+---------+--------+-----------+
|      5.1|     3.5|      1.4|     0.2|Iris-setosa|
|      4.9|     3.0|      1.4|     0.2|Iris-setosa|
|      4.7|     3.2|      1.3|     0.2|Iris-setosa|
|      4.6|     3.1|      1.5|     0.2|Iris-setosa|
|      5.0|     3.6|      1.4|     0.2|Iris-setosa|
+---------+--------+---------+--------+-----------+
only showing top 5 rows


In [84]:
from pyspark.ml.linalg import Vectors
from pyspark.ml.feature import VectorAssembler

In [85]:
assembler = VectorAssembler(inputCols = ["sp_length", "sp_width", "pl_length", "pl_width"], outputCol="features")

In [86]:
irisFeatures = assembler.transform(df)

In [87]:
irisFeatures.show(5)

+---------+--------+---------+--------+-----------+-----------------+
|sp_length|sp_width|pl_length|pl_width|      class|         features|
+---------+--------+---------+--------+-----------+-----------------+
|      5.1|     3.5|      1.4|     0.2|Iris-setosa|[5.1,3.5,1.4,0.2]|
|      4.9|     3.0|      1.4|     0.2|Iris-setosa|[4.9,3.0,1.4,0.2]|
|      4.7|     3.2|      1.3|     0.2|Iris-setosa|[4.7,3.2,1.3,0.2]|
|      4.6|     3.1|      1.5|     0.2|Iris-setosa|[4.6,3.1,1.5,0.2]|
|      5.0|     3.6|      1.4|     0.2|Iris-setosa|[5.0,3.6,1.4,0.2]|
+---------+--------+---------+--------+-----------+-----------------+
only showing top 5 rows


In [88]:
from pyspark.ml.clustering import KMeans

In [89]:
(trainingData, testData) = irisFeatures.randomSplit([0.7, 0.3])
kmeans = KMeans().setK(3).setSeed(101010) # KMeans model with 3 clusters. setSeed makes reproducible results.
model = kmeans.fit(trainingData) # train kmeans model

In [90]:
transformed = model.transform(trainingData) # add a new column to the table with predicted results
transformed.show(50)  #150

+---------+--------+---------+--------+---------------+-----------------+----------+
|sp_length|sp_width|pl_length|pl_width|          class|         features|prediction|
+---------+--------+---------+--------+---------------+-----------------+----------+
|      4.3|     3.0|      1.1|     0.1|    Iris-setosa|[4.3,3.0,1.1,0.1]|         1|
|      4.4|     2.9|      1.4|     0.2|    Iris-setosa|[4.4,2.9,1.4,0.2]|         1|
|      4.4|     3.0|      1.3|     0.2|    Iris-setosa|[4.4,3.0,1.3,0.2]|         1|
|      4.4|     3.2|      1.3|     0.2|    Iris-setosa|[4.4,3.2,1.3,0.2]|         1|
|      4.5|     2.3|      1.3|     0.3|    Iris-setosa|[4.5,2.3,1.3,0.3]|         1|
|      4.6|     3.1|      1.5|     0.2|    Iris-setosa|[4.6,3.1,1.5,0.2]|         1|
|      4.6|     3.2|      1.4|     0.2|    Iris-setosa|[4.6,3.2,1.4,0.2]|         1|
|      4.6|     3.4|      1.4|     0.3|    Iris-setosa|[4.6,3.4,1.4,0.3]|         1|
|      4.8|     3.0|      1.4|     0.1|    Iris-setosa|[4.8,3.0,1

In [91]:
predictions = model.transform(testData)
predictions.show(17) #151

+---------+--------+---------+--------+---------------+-----------------+----------+
|sp_length|sp_width|pl_length|pl_width|          class|         features|prediction|
+---------+--------+---------+--------+---------------+-----------------+----------+
|      4.6|     3.6|      1.0|     0.2|    Iris-setosa|[4.6,3.6,1.0,0.2]|         1|
|      4.7|     3.2|      1.3|     0.2|    Iris-setosa|[4.7,3.2,1.3,0.2]|         1|
|      4.7|     3.2|      1.6|     0.2|    Iris-setosa|[4.7,3.2,1.6,0.2]|         1|
|      4.8|     3.0|      1.4|     0.3|    Iris-setosa|[4.8,3.0,1.4,0.3]|         1|
|      4.8|     3.1|      1.6|     0.2|    Iris-setosa|[4.8,3.1,1.6,0.2]|         1|
|      4.8|     3.4|      1.6|     0.2|    Iris-setosa|[4.8,3.4,1.6,0.2]|         1|
|      4.9|     2.5|      4.5|     1.7| Iris-virginica|[4.9,2.5,4.5,1.7]|         0|
|      4.9|     3.1|      1.5|     0.1|    Iris-setosa|[4.9,3.1,1.5,0.1]|         1|
|      5.0|     2.0|      3.5|     1.0|Iris-versicolor|[5.0,2.0,3

## Gaussian Mixture

In [93]:
from pyspark.ml.clustering import GaussianMixture, GaussianMixtureModel

In [94]:
df2 = spark.read.csv("iris-setosa.csv", inferSchema=True, header=True)

In [95]:
assembler = VectorAssembler(inputCols = ["sp_length", "sp_width", "pl_length", "pl_width"], outputCol="features")

In [96]:
irisFeatures = assembler.transform(df2)

In [97]:
irisFeatures.show(5)

+---------+--------+---------+--------+-----------+-----------------+
|sp_length|sp_width|pl_length|pl_width|      class|         features|
+---------+--------+---------+--------+-----------+-----------------+
|      5.1|     3.5|      1.4|     0.2|Iris-setosa|[5.1,3.5,1.4,0.2]|
|      4.9|     3.0|      1.4|     0.2|Iris-setosa|[4.9,3.0,1.4,0.2]|
|      4.7|     3.2|      1.3|     0.2|Iris-setosa|[4.7,3.2,1.3,0.2]|
|      4.6|     3.1|      1.5|     0.2|Iris-setosa|[4.6,3.1,1.5,0.2]|
|      5.0|     3.6|      1.4|     0.2|Iris-setosa|[5.0,3.6,1.4,0.2]|
+---------+--------+---------+--------+-----------+-----------------+
only showing top 5 rows


In [98]:
(trainingData, testData) = irisFeatures.randomSplit([0.7, 0.3])

In [99]:
gmm = GaussianMixture().setK(3).setSeed(42)

In [100]:
model = gmm.fit(testData)

In [101]:
print("Gaussian shown as a DataFrame:")
model.gaussiansDF.show()

Gaussian shown as a DataFrame:
+--------------------+--------------------+
|                mean|                 cov|
+--------------------+--------------------+
|[6.05494725185918...|0.350220628839082...|
|[5.04166917813600...|0.087429600546934...|
|[6.30079556689470...|0.075921237273638...|
+--------------------+--------------------+



In [102]:
model.weights #lamba

[0.43812992882924295, 0.2857129609893477, 0.27615711018140937]

In [103]:
predictions = model.transform(testData)

In [104]:
# Show the predictions including the cluster assignment and probabilities
print("Predictions:")
predictions.select("features", "prediction", "probability").show(truncate=False)

Predictions:
+-----------------+----------+------------------------------------------------------------------+
|features         |prediction|probability                                                       |
+-----------------+----------+------------------------------------------------------------------+
|[4.5,2.3,1.3,0.3]|1         |[5.616644999654675E-5,0.9999438335500017,1.6384909004385688E-15]  |
|[4.6,3.6,1.0,0.2]|1         |[3.642637935550374E-15,0.9999999999999927,3.642637935550374E-15]  |
|[4.7,3.2,1.6,0.2]|1         |[3.0890269091346075E-16,0.9999999999999993,3.0890269091346075E-16]|
|[4.9,2.4,3.3,1.0]|0         |[0.9999999999999988,6.264219911622257E-16,6.299084355053824E-16]  |
|[5.0,3.0,1.6,0.2]|1         |[2.9432758336971997E-16,0.9999999999999994,2.943275634972842E-16] |
|[5.0,3.5,1.3,0.3]|1         |[8.805996534299511E-17,1.0,8.805996534299511E-17]                 |
|[5.1,2.5,3.0,1.1]|0         |[0.9999999999946758,1.8767439106420072E-15,5.322387617530219E-12] |
|[5.1,3

In [105]:
#lambda1*norm(mu1,sigma1) + lambda2*norm(mu2,sigma2) + lambda3*norm(mu3,sigma3) + ... +lambdan*norm(mun,sigman)