<a href="https://colab.research.google.com/github/kennenvi/Challange-Dados/blob/main/Sistema_de_recomendacao_com_PySpark.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# Iniciando ambiente
---

## Instalando o PySpark no Google Colab

In [1]:
!pip install pyspark==3.3.1

Looking in indexes: https://pypi.org/simple, https://us-python.pkg.dev/colab-wheels/public/simple/
Collecting pyspark==3.3.1
  Downloading pyspark-3.3.1.tar.gz (281.4 MB)
[K     |████████████████████████████████| 281.4 MB 35 kB/s 
[?25hCollecting py4j==0.10.9.5
  Downloading py4j-0.10.9.5-py2.py3-none-any.whl (199 kB)
[K     |████████████████████████████████| 199 kB 46.1 MB/s 
[?25hBuilding wheels for collected packages: pyspark
  Building wheel for pyspark (setup.py) ... [?25l[?25hdone
  Created wheel for pyspark: filename=pyspark-3.3.1-py2.py3-none-any.whl size=281845512 sha256=8a3e1751e4982a48e0a3eb67205c066c79bc960eadb52919220ac76420a3416c
  Stored in directory: /root/.cache/pip/wheels/43/dc/11/ec201cd671da62fa9c5cc77078235e40722170ceba231d7598
Successfully built pyspark
Installing collected packages: py4j, pyspark
Successfully installed py4j-0.10.9.5 pyspark-3.3.1


## SparkSession

O ponto de entrada para programar o Spark com a API Dataset e DataFrame.

In [2]:
from pyspark.sql import SparkSession

spark = SparkSession.builder \
    .master('local[*]') \
    .appName("Iniciando com Spark") \
    .config('spark.ui.port', '4050') \
    .getOrCreate()

In [3]:
spark

## Baixando conjunto de dados

In [4]:
!wget 'https://caelum-online-public.s3.amazonaws.com/challenge-spark/semanas-3-e-4.zip' && unzip semanas-3-e-4.zip -d dados/

--2022-12-14 18:54:58--  https://caelum-online-public.s3.amazonaws.com/challenge-spark/semanas-3-e-4.zip
Resolving caelum-online-public.s3.amazonaws.com (caelum-online-public.s3.amazonaws.com)... 52.217.167.249, 52.217.204.89, 52.217.202.185, ...
Connecting to caelum-online-public.s3.amazonaws.com (caelum-online-public.s3.amazonaws.com)|52.217.167.249|:443... connected.
HTTP request sent, awaiting response... 200 OK
Length: 2588308 (2.5M) [application/zip]
Saving to: ‘semanas-3-e-4.zip’


2022-12-14 18:54:59 (5.75 MB/s) - ‘semanas-3-e-4.zip’ saved [2588308/2588308]

Archive:  semanas-3-e-4.zip
   creating: dados/dataset_ml_parquet/
  inflating: dados/dataset_ml_parquet/_SUCCESS  
  inflating: dados/dataset_ml_parquet/._SUCCESS.crc  
  inflating: dados/dataset_ml_parquet/part-00003-a14b227c-f87e-4893-b5f9-4163ed07cb37-c000.snappy.parquet  
  inflating: dados/dataset_ml_parquet/part-00001-a14b227c-f87e-4893-b5f9-4163ed07cb37-c000.snappy.parquet  
  inflating: dados/dataset_ml_parquet/par

# Sistema de recomendação

In [5]:
#Carregando dados
dados = spark.read.parquet(
  '/content/dados/dataset_ml_parquet'  
)

In [6]:
dados.show()

+--------------------+-----+---------+---------+-------+------+----+--------------------+----------+------+---------+------------+----------+----------+--------+--------+------------------+-------------+------------------+--------+-------+----------+------------+-----------------+---------------+
|                  id|andar|area_util|banheiros|quartos|suites|vaga|              bairro|condominio|  iptu|    valor|Zona Central|Zona Norte|Zona Oeste|Zona Sul|Academia|Animais permitidos|Churrasqueira|Condomínio fechado|Elevador|Piscina|Playground|Portaria 24h|Portão eletrônico|Salão de festas|
+--------------------+-----+---------+---------+-------+------+----+--------------------+----------+------+---------+------------+----------+----------+--------+--------+------------------+-------------+------------------+--------+-------+----------+------------+-----------------+---------------+
|00002dd9-cc74-480...|    2|       35|        1|      1|   0.0| 0.0|        Santo Cristo|     100.0| 100.0

In [7]:
# Verificando estrutura dos dados
dados.printSchema()

root
 |-- id: string (nullable = true)
 |-- andar: integer (nullable = true)
 |-- area_util: integer (nullable = true)
 |-- banheiros: integer (nullable = true)
 |-- quartos: integer (nullable = true)
 |-- suites: double (nullable = true)
 |-- vaga: double (nullable = true)
 |-- bairro: string (nullable = true)
 |-- condominio: double (nullable = true)
 |-- iptu: double (nullable = true)
 |-- valor: double (nullable = true)
 |-- Zona Central: integer (nullable = true)
 |-- Zona Norte: integer (nullable = true)
 |-- Zona Oeste: integer (nullable = true)
 |-- Zona Sul: integer (nullable = true)
 |-- Academia: integer (nullable = true)
 |-- Animais permitidos: integer (nullable = true)
 |-- Churrasqueira: integer (nullable = true)
 |-- Condomínio fechado: integer (nullable = true)
 |-- Elevador: integer (nullable = true)
 |-- Piscina: integer (nullable = true)
 |-- Playground: integer (nullable = true)
 |-- Portaria 24h: integer (nullable = true)
 |-- Portão eletrônico: integer (nullable 

Será retirado a coluna `bairro`, porque esta possui uma quantidade muito grande de valores únicos para serem transformados de variáveis categóricas para numéricas

In [8]:
dados\
    .groupby('bairro')\
    .count()\
    .count()

150

In [9]:
# Retirando coluna "dados"
dados = dados.drop('bairro')

### Preparando os dados

In [10]:
from pyspark.ml.feature import VectorAssembler

In [11]:
X = dados.drop('id').columns
X

['andar',
 'area_util',
 'banheiros',
 'quartos',
 'suites',
 'vaga',
 'condominio',
 'iptu',
 'valor',
 'Zona Central',
 'Zona Norte',
 'Zona Oeste',
 'Zona Sul',
 'Academia',
 'Animais permitidos',
 'Churrasqueira',
 'Condomínio fechado',
 'Elevador',
 'Piscina',
 'Playground',
 'Portaria 24h',
 'Portão eletrônico',
 'Salão de festas']

In [12]:
dados_prep = VectorAssembler(inputCols=X, outputCol='features').transform(dados)

In [13]:
dados_prep.select('features').show(truncate=False)

+----------------------------------------------------------------------------------------------------------------+
|features                                                                                                        |
+----------------------------------------------------------------------------------------------------------------+
|[2.0,35.0,1.0,1.0,0.0,0.0,100.0,100.0,245000.0,1.0,0.0,0.0,0.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0]         |
|(23,[0,1,2,3,5,6,7,8,10,15,17,19,20,22],[1.0,84.0,2.0,2.0,1.0,770.0,105.0,474980.0,1.0,1.0,1.0,1.0,1.0,1.0])    |
|(23,[1,2,3,6,7,8,12,14,17],[85.0,2.0,2.0,460.0,661.0,290000.0,1.0,1.0,1.0])                                     |
|(23,[1,2,3,5,6,7,8,11,18,19],[58.0,1.0,2.0,1.0,550.0,550.0,249000.0,1.0,1.0,1.0])                               |
|(23,[1,2,3,4,5,6,8,10],[64.0,2.0,2.0,1.0,1.0,850.0,530000.0,1.0])                                               |
|[0.0,200.0,6.0,4.0,4.0,2.0,2500.0,420.0,2900000.0,0.0,0.0,1.0,0.0,1.0,1.0,1.0,1

In [14]:
from pyspark.ml.feature import StandardScaler

In [15]:
scaler = StandardScaler(inputCol='features', outputCol='scaled_features')
model_scaler = scaler.fit(dados_prep)
dados_prep = model_scaler.transform(dados_prep)

In [16]:
dados_prep.select('features', 'scaled_features').show()

+--------------------+--------------------+
|            features|     scaled_features|
+--------------------+--------------------+
|[2.0,35.0,1.0,1.0...|[0.13607726247524...|
|(23,[0,1,2,3,5,6,...|(23,[0,1,2,3,5,6,...|
|(23,[1,2,3,6,7,8,...|(23,[1,2,3,6,7,8,...|
|(23,[1,2,3,5,6,7,...|(23,[1,2,3,5,6,7,...|
|(23,[1,2,3,4,5,6,...|(23,[1,2,3,4,5,6,...|
|[0.0,200.0,6.0,4....|[0.0,2.2447697820...|
|(23,[1,2,3,6,8,10...|(23,[1,2,3,6,8,10...|
|(23,[0,1,2,3,4,5,...|(23,[0,1,2,3,4,5,...|
|(23,[1,2,3,4,5,8,...|(23,[1,2,3,4,5,8,...|
|[0.0,41.0,1.0,1.0...|[0.0,0.4601778053...|
|[5.0,78.0,1.0,2.0...|[0.34019315618810...|
|(23,[1,2,3,4,5,6,...|(23,[1,2,3,4,5,6,...|
|(23,[1,2,3,4,5,6,...|(23,[1,2,3,4,5,6,...|
|(23,[1,2,3,4,5,6,...|(23,[1,2,3,4,5,6,...|
|[9.0,120.0,2.0,2....|[0.61234768113858...|
|[20.0,341.0,2.0,3...|[1.36077262475241...|
|[0.0,194.0,5.0,4....|[0.0,2.1774266885...|
|(23,[1,2,3,8,10,2...|(23,[1,2,3,8,10,2...|
|(23,[0,1,2,3,8,10...|(23,[0,1,2,3,8,10...|
|(23,[0,1,2,3,5,6,...|(23,[0,1,2

### Redução de dimensionalidade

In [17]:
from pyspark.ml.feature import PCA

In [18]:
pca_k = len(X)
pca_k

23

In [19]:
pca = PCA(k=pca_k, inputCol='scaled_features', outputCol='pca_features')
model_pca = pca.fit(dados_prep)

Verificando a explicabilidade da variancia de cada feature

In [20]:
import numpy as np

In [21]:
model_pca.explainedVariance

DenseVector([0.2655, 0.1721, 0.0913, 0.0544, 0.0522, 0.0466, 0.0443, 0.0416, 0.0347, 0.0272, 0.0244, 0.0201, 0.0192, 0.0176, 0.0155, 0.0139, 0.012, 0.0113, 0.0101, 0.0092, 0.0089, 0.0079, 0.0])

In [22]:
exp_var = np.array(model_pca.explainedVariance)

Olhando para os dados serão escolhidas features até que 80% da variância seja explicada

In [23]:
pca_k = sum(exp_var.cumsum() <= .8)

In [24]:
model_pca = pca.setK(pca_k).fit(dados_prep)
dados_prep = model_pca.transform(dados_prep)

In [25]:
componentes_pca = len(dados_prep.select('pca_features').take(1)[0][0])
print('Quantidades de componentes provenientes do PCA:', componentes_pca)

Quantidades de componentes provenientes do PCA: 8


### Realizando agrupamento (Clusterização)

In [26]:
from pyspark.ml.clustering import KMeans

In [27]:
SEED = 42

In [28]:
kmeans = KMeans(k=10, featuresCol='pca_features', predictionCol='cluster', seed=SEED)
model_kmeans = kmeans.fit(dados_prep)
dados_cluster = model_kmeans.transform(dados_prep)

In [29]:
dados_cluster.show()

+--------------------+-----+---------+---------+-------+------+----+----------+------+---------+------------+----------+----------+--------+--------+------------------+-------------+------------------+--------+-------+----------+------------+-----------------+---------------+--------------------+--------------------+--------------------+-------+
|                  id|andar|area_util|banheiros|quartos|suites|vaga|condominio|  iptu|    valor|Zona Central|Zona Norte|Zona Oeste|Zona Sul|Academia|Animais permitidos|Churrasqueira|Condomínio fechado|Elevador|Piscina|Playground|Portaria 24h|Portão eletrônico|Salão de festas|            features|     scaled_features|        pca_features|cluster|
+--------------------+-----+---------+---------+-------+------+----+----------+------+---------+------------+----------+----------+--------+--------+------------------+-------------+------------------+--------+-------+----------+------------+-----------------+---------------+--------------------+-------

## Criando recomendador

In [30]:
dados_recomendacao = dados_cluster\
    .select('id', 'pca_features', 'cluster')

In [31]:
id_movel_base = dados_recomendacao.take(1)[0][0]

In [32]:
from pyspark.sql import functions as f

In [33]:
from pyspark.sql.functions import udf
from pyspark.sql.types import FloatType
from scipy.spatial.distance import euclidean

In [34]:
def recomendador(id_movel):
    cluster, componentes_movel = dados_cluster\
        .where(f.col('id') == id_movel_base)\
        .select('cluster', 'pca_features')\
        .collect()[0]
    
    moveis_recomendados = dados_cluster\
        .where(f.col('cluster') == cluster)
    
    @udf(returnType=FloatType())
    def calcula_distancia(valor):
        return euclidean(componentes_movel, valor)
    
    moveis_recomendados_dist = moveis_recomendados\
        .withColumn('dist', calcula_distancia('pca_features'))\
        .sort('dist')\
        .limit(10)

    moveis_recomendados_dist.show()

In [35]:
recomendador(id_movel_base)

+--------------------+-----+---------+---------+-------+------+----+----------+-----+--------+------------+----------+----------+--------+--------+------------------+-------------+------------------+--------+-------+----------+------------+-----------------+---------------+--------------------+--------------------+--------------------+-------+------------+
|                  id|andar|area_util|banheiros|quartos|suites|vaga|condominio| iptu|   valor|Zona Central|Zona Norte|Zona Oeste|Zona Sul|Academia|Animais permitidos|Churrasqueira|Condomínio fechado|Elevador|Piscina|Playground|Portaria 24h|Portão eletrônico|Salão de festas|            features|     scaled_features|        pca_features|cluster|        dist|
+--------------------+-----+---------+---------+-------+------+----+----------+-----+--------+------------+----------+----------+--------+--------+------------------+-------------+------------------+--------+-------+----------+------------+-----------------+---------------+--------

## Criando uma Pipeline

In [36]:
from pyspark.ml import Pipeline

In [37]:
dados.show()

+--------------------+-----+---------+---------+-------+------+----+----------+------+---------+------------+----------+----------+--------+--------+------------------+-------------+------------------+--------+-------+----------+------------+-----------------+---------------+
|                  id|andar|area_util|banheiros|quartos|suites|vaga|condominio|  iptu|    valor|Zona Central|Zona Norte|Zona Oeste|Zona Sul|Academia|Animais permitidos|Churrasqueira|Condomínio fechado|Elevador|Piscina|Playground|Portaria 24h|Portão eletrônico|Salão de festas|
+--------------------+-----+---------+---------+-------+------+----+----------+------+---------+------------+----------+----------+--------+--------+------------------+-------------+------------------+--------+-------+----------+------------+-----------------+---------------+
|00002dd9-cc74-480...|    2|       35|        1|      1|   0.0| 0.0|     100.0| 100.0| 245000.0|           1|         0|         0|       0|       1|                 1| 

In [38]:
va = VectorAssembler(inputCols=X, outputCol='features')
scaler = StandardScaler(inputCol='features', outputCol='scaled_features')
pca = PCA(k=8, inputCol='scaled_features', outputCol='pca_features')
kmeans = KMeans(k=10, featuresCol='pca_features', predictionCol='cluster', seed=SEED)

pipe = Pipeline(stages=[va, scaler, pca, kmeans])
model_pipe = pipe.fit(dados)
dados_prep_pipe = model_pipe.transform(dados)

In [39]:
dados_prep_pipe.show()

+--------------------+-----+---------+---------+-------+------+----+----------+------+---------+------------+----------+----------+--------+--------+------------------+-------------+------------------+--------+-------+----------+------------+-----------------+---------------+--------------------+--------------------+--------------------+-------+
|                  id|andar|area_util|banheiros|quartos|suites|vaga|condominio|  iptu|    valor|Zona Central|Zona Norte|Zona Oeste|Zona Sul|Academia|Animais permitidos|Churrasqueira|Condomínio fechado|Elevador|Piscina|Playground|Portaria 24h|Portão eletrônico|Salão de festas|            features|     scaled_features|        pca_features|cluster|
+--------------------+-----+---------+---------+-------+------+----+----------+------+---------+------------+----------+----------+--------+--------+------------------+-------------+------------------+--------+-------+----------+------------+-----------------+---------------+--------------------+-------

## Informações dos clusters

In [40]:
import plotly.express as px

In [41]:
dados_prep_pipe\
    .select('quartos', 'cluster')\
    .groupby('cluster')\
    .agg(
        f.mean('quartos').alias('quartos')
    )\
    .sort('quartos')\
    .show()

+-------+------------------+
|cluster|           quartos|
+-------+------------------+
|      0|2.1670878079595703|
|      9|2.3416655873591505|
|      3| 2.397264631043257|
|      7|    2.482996611871|
|      5|2.5109358726380004|
|      4| 2.530987446871602|
|      8|  2.57292576419214|
|      2|3.4444444444444446|
|      1|3.8679479231246123|
|      6|3.9322125813449023|
+-------+------------------+



In [42]:
quartos_cluster = dados_prep_pipe\
    .select('quartos', 'cluster')\
    .groupby('cluster')\
    .agg(
        f.mean('quartos').alias('quartos')
    )\
    .sort('quartos')

In [43]:
fig = px.bar(quartos_cluster.toPandas(), y='quartos', color='quartos', hover_data=['quartos', 'cluster'], text='cluster')

fig.show()

In [44]:
import plotly.graph_objects as go

In [45]:
def get_dados_agrupados_cluster(coluna):
    dados_coluna = dados_prep_pipe\
        .select(coluna, 'cluster')\
        .groupby('cluster')\
        .agg(
            f.mean(coluna).alias(coluna)
        )\
        .sort(coluna)
    
    return dados_coluna.toPandas()

In [46]:
fig = go.Figure()

lista_colunas = ['banheiros', 'quartos']#, 'area_util']

for coluna in lista_colunas:
    temp = get_dados_agrupados_cluster(coluna)
    fig.add_trace(go.Bar(x=temp['cluster'].apply(lambda x: str(x)), y=temp[coluna], name=coluna))

fig.show()