<a href="https://colab.research.google.com/github/luasampaio/data-engineering/blob/main/09_PySparkRdd.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

### O que são partitions?
Um DataFrame ou RDD no PySpark é dividido em várias partes chamadas partitions.
Cada partition é processada independentemente por um executor no cluster.
Uma boa configuração de partitions é essencial para:
- Balancear a carga entre os nós.
- Melhorar o desempenho do processamento distribuído.
- Evitar shuffles desnecessários (rearranjo de dados).

In [2]:
from pyspark.sql import SparkSession
import pyspark.sql.functions as F

Montar o Google Drive

In [4]:
# Mount Google Drive
from google.colab import drive
drive.mount('/content/drive')

Mounted at /content/drive


In [5]:
spark = (
    SparkSession.builder.appName('PySpark - LucianaSampaio')
    .config('spark.sql.repl.eagerEval.enabled', True)
    .getOrCreate()
)

In [6]:
df = spark.read.csv('/content/drive/MyDrive/.Dataset/test.csv', sep=',', encoding='UTF-8', header=True, inferSchema=True)
df

PassengerId,Pclass,Name,Sex,Age,SibSp,Parch,Ticket,Fare,Cabin,Embarked
892,3,"Kelly, Mr. James",male,34.5,0,0,330911,7.8292,,Q
893,3,"Wilkes, Mrs. Jame...",female,47.0,1,0,363272,7.0,,S
894,2,"Myles, Mr. Thomas...",male,62.0,0,0,240276,9.6875,,Q
895,3,"Wirz, Mr. Albert",male,27.0,0,0,315154,8.6625,,S
896,3,"Hirvonen, Mrs. Al...",female,22.0,1,1,3101298,12.2875,,S
897,3,"Svensson, Mr. Joh...",male,14.0,0,0,7538,9.225,,S
898,3,"Connolly, Miss. Kate",female,30.0,0,0,330972,7.6292,,Q
899,2,"Caldwell, Mr. Alb...",male,26.0,1,1,248738,29.0,,S
900,3,"Abrahim, Mrs. Jos...",female,18.0,0,0,2657,7.2292,,C
901,3,"Davies, Mr. John ...",male,21.0,2,0,A/4 48871,24.15,,S


In [7]:
df.printSchema()

root
 |-- PassengerId: integer (nullable = true)
 |-- Pclass: integer (nullable = true)
 |-- Name: string (nullable = true)
 |-- Sex: string (nullable = true)
 |-- Age: double (nullable = true)
 |-- SibSp: integer (nullable = true)
 |-- Parch: integer (nullable = true)
 |-- Ticket: string (nullable = true)
 |-- Fare: double (nullable = true)
 |-- Cabin: string (nullable = true)
 |-- Embarked: string (nullable = true)



# Verificar o número de partitions

In [8]:
print(df.rdd.getNumPartitions())

1


# Função lambda para obter os Ids das partições

In [9]:
df.rdd.glom().map(lambda x: x[0]).collect()

[Row(PassengerId=892, Pclass=3, Name='Kelly, Mr. James', Sex='male', Age=34.5, SibSp=0, Parch=0, Ticket='330911', Fare=7.8292, Cabin=None, Embarked='Q')]

In [10]:
# Exibir o número de linhas e colunas do DataFrame
rows, columns = df.count(), len(df.columns)
print(f"Linhas: {rows}\nColunas: {columns}")

Linhas: 418
Colunas: 11


In [11]:
df.rdd.glom().map(len).collect()

[418]

In [12]:
df_repartitioned = df.repartition(10)  # Divide em 10 partitions

# Validando a quantidade de partições

In [14]:
print(df_repartitioned.rdd.getNumPartitions())

10


In [15]:
df_coalesced = df.coalesce(5)  # Reduz para 5 partitions

In [17]:
df_coalesced.rdd.glom().map(len).collect()

[418]