# **Unlock the Potential of PySpark DataFrames: Hands-on Tips and Personalizations**

`University of East London, Docklands Campus, 2023-24`

`module name`: **`Machine Learning on Big Data (CN7030) - MSc AI&DS`**

`Author`: **`Dr Amin Karami (PG Academic Lead in CDT School)`**

`E`: **`a.karami@uel.ac.uk`**

`W`: **`http://www.aminkarami.com/`**

---

**DataFrame (DF)**: Schema (named columns) + declarative language. A DataFrame is a Dataset organized into named columns. It is conceptually equivalent to a table in a relational database. It is very efficient for strucutred data.

data: https://drive.google.com/file/d/1HiP_TkWYClAmzhhOFzhXdOXTfZb9DoB_/view?usp=drive_link (641MB)

source: https://spark.apache.org/docs/latest/sql-programming-guide.html

source: https://spark.apache.org/docs/latest/api/python/reference/

# **Section 1: Initialize PySpark**

In [1]:
# !pip3 install pyspark




[notice] A new release of pip is available: 23.2.1 -> 24.0
[notice] To update, run: python.exe -m pip install --upgrade pip


# **System Check**

In [6]:
import psutil

# Check available memory
available_memory = psutil.virtual_memory().available
print(f"Available Memory: {available_memory} bytes")


Available Memory: 18284122112 bytes
Available Cores: 12


In [8]:
# Check available cores
available_cores = psutil.cpu_count()
print(f"Available Cores: {available_cores}")

Available Cores: 12


# **Linking with Spark**

In [3]:
# Linking with Spark
import pyspark
from pyspark.sql import SparkSession

spark = SparkSession.builder \
                    .appName("Tutorial2_CN7030") \
                    .master("local[*]") \
                    .config("spark.executor.memory", "4g") \
                    .config("spark.driver.memory", "2g") \
                    .config("spark.executor.cores", "2") \
                    .config("spark.sql.inMemoryColumnarStorage.compressed", "true") \
                    .getOrCreate()

spark

# **Connect to the Google Drive**

# **Section 2: Create PySpark DataFrame from CSV**

In [23]:
df = spark.read.csv("data.csv", header = True, inferSchema = True)

# show table
df.show(truncate = True)

# show schema
df.printSchema()

# some info
print(df.count())
print(len(df.columns))

+--------------------+------------------+------+------------------+-----+------+------+------+--------+------+--------+
|                  id|               age|salary|             score|sales|height|weight|gender|category|income|expenses|
+--------------------+------------------+------+------------------+-----+------+------+------+--------+------+--------+
|  0.8017532427858894| 79.71301351894658|     4| 37.10369234532471|   47|    62|     5|Female|       B|     8|      85|
|  0.6565552949992319| 24.22600602366628|     6| 79.75769828959885|    8|    30|    42|  Male|       A|    70|      85|
|  0.2515595782593636| 70.97364852287149|    17| 71.71375356646551|   82|     1|    72|  Male|       A|    66|      53|
|  0.2073428376111074| 45.09378549789149|    32|              NULL|   59|    12|    21|  Male|       B|    56|       2|
|  0.6392921379278927|30.357970527906065|    13|              NULL|   22|    69|    58|Female|       A|     0|      74|
|  0.8505582285081454|              NULL

# How many partitions?

In [24]:
df.rdd.getNumPartitions()

12

# Let's increase the number of partitions

In [30]:
# Increase/Decrease the number of partitions from original
df2 = df.repartition(8)
# Reduce the number of partitions
# df2 = df2.coalesce(1)
df2.rdd.getNumPartitions()

8

In [29]:
df3 = df.repartition('category')
df3.rdd.getNumPartitions()

2

# Write the DF to disk in a partitioned manner

# **Section 3: DataFrame Operations and Transformations**

# **Section 4: Working with Missing Data**