<a href="https://colab.research.google.com/github/pstorniolo/Master2021/blob/main/Spark320_Arrow_test.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

#Apache Arrow in PySpark

Apache Arrow è un formato di **dati a colonne in memoria** utilizzato in Spark per trasferire in modo efficiente i dati tra i processi *JVM* e *Python*. Questo è attualmente molto vantaggioso per gli utenti Python che lavorano con dati *Pandas/NumPy*. Il suo utilizzo non è automatico e potrebbe richiedere alcune modifiche minori alla configurazione o al codice per trarre il massimo vantaggio e garantire la compatibilità.

In [None]:
# Install Spark 3.2.0 - JDK11
!apt-get install openjdk-11-jdk-headless -qq > /dev/null
!wget -q https://archive.apache.org/dist/spark/spark-3.2.0/spark-3.2.0-bin-hadoop3.2.tgz
!tar xf spark-3.2.0-bin-hadoop3.2.tgz
!rm -f *.tgz*

import os
os.environ["JAVA_HOME"] = "/usr/lib/jvm/java-11-openjdk-amd64"
os.environ["SPARK_HOME"] = "/content/spark-3.2.0-bin-hadoop3.2"

!pip install -q findspark
!pip install -q pyspark==3.2.0

In [None]:
import pyspark
from pyspark.sql import SparkSession
from pyspark.sql import Row

from datetime import datetime, date
import numpy as np
import pandas as pd

spark=SparkSession.builder.appName("Spark-Session").getOrCreate()
spark.conf.set("spark.sql.execution.arrow.pyspark.enabled","true")
spark.conf.set("spark.sql.execution.arrow.pyspark.selfDestruct.enabled","true")

In [None]:
# Generate a Pandas DataFrame
pdf = pd.DataFrame(np.random.rand(1000000, 6), columns=["a","b","c","d","e","f"])

In [None]:
# Create a Spark DataFrame from a Pandas DataFrame
spark.conf.set("spark.sql.execution.arrow.pyspark.enabled","false")
df = spark.createDataFrame(pdf)

# Convert the Spark DataFrame back to a Pandas DataFrame
pndf = df.select('*').toPandas()
pndf

Unnamed: 0,a,b,c,d,e,f
0,0.851704,0.742301,0.844796,0.869055,0.549991,0.087759
1,0.795028,0.289542,0.516365,0.234267,0.203202,0.686972
2,0.576182,0.735758,0.264134,0.722898,0.586625,0.686198
3,0.791706,0.735150,0.863005,0.721994,0.003031,0.153958
4,0.243429,0.490172,0.847106,0.057154,0.056042,0.167847
...,...,...,...,...,...,...
999995,0.948746,0.730308,0.014361,0.888353,0.308251,0.774478
999996,0.227034,0.732276,0.773951,0.429611,0.091561,0.064155
999997,0.655569,0.103088,0.005855,0.977241,0.506079,0.171003
999998,0.971768,0.156542,0.878554,0.442738,0.084821,0.284072


In [None]:
# Create a Spark DataFrame from a Pandas DataFrame using Arrow
spark.conf.set("spark.sql.execution.arrow.pyspark.enabled","true")
df = spark.createDataFrame(pdf)

# Convert the Spark DataFrame back to a Pandas DataFrame using Arrow
pndf = df.select('*').toPandas()
pndf

Unnamed: 0,a,b,c,d,e,f
0,0.851704,0.742301,0.844796,0.869055,0.549991,0.087759
1,0.795028,0.289542,0.516365,0.234267,0.203202,0.686972
2,0.576182,0.735758,0.264134,0.722898,0.586625,0.686198
3,0.791706,0.735150,0.863005,0.721994,0.003031,0.153958
4,0.243429,0.490172,0.847106,0.057154,0.056042,0.167847
...,...,...,...,...,...,...
999995,0.948746,0.730308,0.014361,0.888353,0.308251,0.774478
999996,0.227034,0.732276,0.773951,0.429611,0.091561,0.064155
999997,0.655569,0.103088,0.005855,0.977241,0.506079,0.171003
999998,0.971768,0.156542,0.878554,0.442738,0.084821,0.284072
