**DATAFRAMES & SparkSQL** 

   In this file we want to present our knowledge and ability to work with DataFrames and SparkSQL

In [1]:
pip install pyspark findspark


[1m[[0m[34;49mnotice[0m[1;39;49m][0m[39;49m A new release of pip is available: [0m[31;49m25.0[0m[39;49m -> [0m[32;49m25.3[0m
[1m[[0m[34;49mnotice[0m[1;39;49m][0m[39;49m To update, run: [0m[32;49mpython3 -m pip install --upgrade pip[0m
Note: you may need to restart the kernel to use updated packages.


In [2]:
import findspark
findspark.init()

from pyspark.sql import SparkSession

# Create a Spark session
spark = SparkSession.builder \
    .appName("MySparkApp") \
    .config("spark.driver.memory", "2g") \
    .getOrCreate()

spark

Using Spark's default log4j profile: org/apache/spark/log4j2-defaults.properties
25/12/11 12:16:06 WARN Utils: Your hostname, Ninas-macbook.local, resolves to a loopback address: 127.0.0.1; using 10.10.4.47 instead (on interface en0)
25/12/11 12:16:06 WARN Utils: Set SPARK_LOCAL_IP if you need to bind to another address
Using Spark's default log4j profile: org/apache/spark/log4j2-defaults.properties
Setting default log level to "WARN".
To adjust logging level use sc.setLogLevel(newLevel). For SparkR, use setLogLevel(newLevel).
25/12/11 12:16:07 WARN NativeCodeLoader: Unable to load native-hadoop library for your platform... using builtin-java classes where applicable


In [3]:
red_df = spark.read.csv("winequality-red.csv", header=True, inferSchema=True, sep=';')
white_df = spark.read.csv("winequality-white.csv", header=True, inferSchema=True, sep=';')

In [4]:
red_df.show(5)
white_df.show(5)
red_df.printSchema()

+-------------+----------------+-----------+--------------+---------+-------------------+--------------------+-------+----+---------+-------+-------+
|fixed acidity|volatile acidity|citric acid|residual sugar|chlorides|free sulfur dioxide|total sulfur dioxide|density|  pH|sulphates|alcohol|quality|
+-------------+----------------+-----------+--------------+---------+-------------------+--------------------+-------+----+---------+-------+-------+
|          7.4|             0.7|        0.0|           1.9|    0.076|               11.0|                34.0| 0.9978|3.51|     0.56|    9.4|      5|
|          7.8|            0.88|        0.0|           2.6|    0.098|               25.0|                67.0| 0.9968| 3.2|     0.68|    9.8|      5|
|          7.8|            0.76|       0.04|           2.3|    0.092|               15.0|                54.0|  0.997|3.26|     0.65|    9.8|      5|
|         11.2|            0.28|       0.56|           1.9|    0.075|               17.0|           

In [5]:
from pyspark.sql.functions import lit

red_df = red_df.withColumn("wine_type", lit("red"))
white_df = white_df.withColumn("wine_type", lit("white"))

In [6]:
wine_df = red_df.union(white_df)
wine_df.show(5)

+-------------+----------------+-----------+--------------+---------+-------------------+--------------------+-------+----+---------+-------+-------+---------+
|fixed acidity|volatile acidity|citric acid|residual sugar|chlorides|free sulfur dioxide|total sulfur dioxide|density|  pH|sulphates|alcohol|quality|wine_type|
+-------------+----------------+-----------+--------------+---------+-------------------+--------------------+-------+----+---------+-------+-------+---------+
|          7.4|             0.7|        0.0|           1.9|    0.076|               11.0|                34.0| 0.9978|3.51|     0.56|    9.4|      5|      red|
|          7.8|            0.88|        0.0|           2.6|    0.098|               25.0|                67.0| 0.9968| 3.2|     0.68|    9.8|      5|      red|
|          7.8|            0.76|       0.04|           2.3|    0.092|               15.0|                54.0|  0.997|3.26|     0.65|    9.8|      5|      red|
|         11.2|            0.28|       0

Basic DataFrame operations:

1. Counting rows:

In [7]:
wine_df.count()

6497

2. Summary statistics

In [8]:
wine_df.describe().show()

25/12/11 12:16:17 WARN SparkStringUtils: Truncated the string representation of a plan since it was too large. This behavior can be adjusted by setting 'spark.sql.debug.maxToStringFields'.


+-------+-----------------+-------------------+-------------------+------------------+-------------------+-------------------+--------------------+--------------------+-------------------+-------------------+------------------+------------------+---------+
|summary|    fixed acidity|   volatile acidity|        citric acid|    residual sugar|          chlorides|free sulfur dioxide|total sulfur dioxide|             density|                 pH|          sulphates|           alcohol|           quality|wine_type|
+-------+-----------------+-------------------+-------------------+------------------+-------------------+-------------------+--------------------+--------------------+-------------------+-------------------+------------------+------------------+---------+
|  count|             6497|               6497|               6497|              6497|               6497|               6497|                6497|                6497|               6497|               6497|              6497|  

3. Checking for Null values

In [9]:
from pyspark.sql.functions import col, when, count, isnan

numeric_cols = [c for c, t in wine_df.dtypes if t in ("double", "int", "float")]
string_cols = [c for c, t in wine_df.dtypes if t not in ("double", "int", "float")]

missing = wine_df.select(
    # numeric columns — check NULL or NaN
    *[
        count(when(col(c).isNull() | isnan(c), c)).alias(c)
        for c in numeric_cols
    ],
    # string columns — only check NULL
    *[
        count(when(col(c).isNull(), c)).alias(c)
        for c in string_cols
    ]
)

missing.show()


+-------------+----------------+-----------+--------------+---------+-------------------+--------------------+-------+---+---------+-------+-------+---------+
|fixed acidity|volatile acidity|citric acid|residual sugar|chlorides|free sulfur dioxide|total sulfur dioxide|density| pH|sulphates|alcohol|quality|wine_type|
+-------------+----------------+-----------+--------------+---------+-------------------+--------------------+-------+---+---------+-------+-------+---------+
|            0|               0|          0|             0|        0|                  0|                   0|      0|  0|        0|      0|      0|        0|
+-------------+----------------+-----------+--------------+---------+-------------------+--------------------+-------+---+---------+-------+-------+---------+



**SparkSQL**

In [10]:
wine_df.createOrReplaceTempView("wines")

1. Average alcohol content by wine type

spark.sql("""
    SELECT wine_type, AVG(alcohol) AS avg_alcohol
    FROM wines
    GROUP BY wine_type
""").show()


2. Count wines by quality rating

In [11]:
spark.sql("""
    SELECT quality, COUNT(*) AS count
    FROM wines
    GROUP BY quality
    ORDER BY quality
""").show()

+-------+-----+
|quality|count|
+-------+-----+
|      3|   30|
|      4|  216|
|      5| 2138|
|      6| 2836|
|      7| 1079|
|      8|  193|
|      9|    5|
+-------+-----+



3. Highest alcohol red wines

In [12]:
spark.sql("""
    SELECT *
    FROM wines
    WHERE wine_type = 'red'
    ORDER BY alcohol DESC
    LIMIT 5
""").show()

+-------------+----------------+-----------+--------------+---------+-------------------+--------------------+-------+----+---------+-------+-------+---------+
|fixed acidity|volatile acidity|citric acid|residual sugar|chlorides|free sulfur dioxide|total sulfur dioxide|density|  pH|sulphates|alcohol|quality|wine_type|
+-------------+----------------+-----------+--------------+---------+-------------------+--------------------+-------+----+---------+-------+-------+---------+
|         15.9|            0.36|       0.65|           7.5|    0.096|               22.0|                71.0| 0.9976|2.98|     0.84|   14.9|      5|      red|
|          5.2|            0.34|        0.0|           1.8|     0.05|               27.0|                63.0| 0.9916|3.68|     0.79|   14.0|      6|      red|
|          5.0|            0.42|       0.24|           2.0|     0.06|               19.0|                50.0| 0.9917|3.72|     0.74|   14.0|      8|      red|
|          4.9|            0.42|        

4. Correlation: alcohol vs quality

In [13]:
spark.sql("""
    SELECT corr(alcohol, quality) AS alcohol_quality_corr
    FROM wines;
""").show()

+--------------------+
|alcohol_quality_corr|
+--------------------+
|  0.4443185200076535|
+--------------------+



In [21]:
heart = spark.read.csv("heart.csv", header=True, inferSchema=True, sep=',')
heart.createOrReplaceTempView("heart")
heart.printSchema()
spark.sql("SELECT Age, Cholesterol FROM heart WHERE Age > 50").show()

root
 |-- Age: integer (nullable = true)
 |-- Sex: string (nullable = true)
 |-- ChestPainType: string (nullable = true)
 |-- RestingBP: integer (nullable = true)
 |-- Cholesterol: integer (nullable = true)
 |-- FastingBS: integer (nullable = true)
 |-- RestingECG: string (nullable = true)
 |-- MaxHR: integer (nullable = true)
 |-- ExerciseAngina: string (nullable = true)
 |-- Oldpeak: double (nullable = true)
 |-- ST_Slope: string (nullable = true)
 |-- HeartDisease: integer (nullable = true)

+---+-----------+
|Age|Cholesterol|
+---+-----------+
| 54|        195|
| 54|        208|
| 58|        164|
| 54|        273|
| 60|        248|
| 53|        260|
| 52|        284|
| 53|        468|
| 51|        188|
| 53|        518|
| 56|        167|
| 54|        224|
| 65|        306|
| 54|        230|
| 54|        294|
| 52|        259|
| 59|        318|
| 52|        180|
| 51|        194|
| 58|        213|
+---+-----------+
only showing top 20 rows


In [24]:
#Select with dataframe

heart.select("Age", "Sex", "HeartDisease").show(5)

+---+---+------------+
|Age|Sex|HeartDisease|
+---+---+------------+
| 40|  M|           0|
| 49|  F|           1|
| 37|  M|           0|
| 48|  F|           1|
| 54|  M|           0|
+---+---+------------+
only showing top 5 rows


In [26]:
#Select with Spark SQL

spark.sql("""
    SELECT Age, Sex, HeartDisease
    FROM cvd
    LIMIT 5
""").show()

# Both give the same output; SQL is concise for analysts, DSL is good for Python developers.

{"ts": "2025-12-11 12:25:11.249", "level": "ERROR", "logger": "SQLQueryContextLogger", "msg": "[TABLE_OR_VIEW_NOT_FOUND] The table or view `cvd` cannot be found. Verify the spelling and correctness of the schema and catalog.\nIf you did not qualify the name with a schema, verify the current_schema() output, or qualify the name with the correct schema and catalog.\nTo tolerate the error on drop use DROP VIEW IF EXISTS or DROP TABLE IF EXISTS. SQLSTATE: 42P01", "context": {"errorClass": "TABLE_OR_VIEW_NOT_FOUND"}, "exception": {"class": "Py4JJavaError", "msg": "An error occurred while calling o29.sql.\n: org.apache.spark.sql.AnalysisException: [TABLE_OR_VIEW_NOT_FOUND] The table or view `cvd` cannot be found. Verify the spelling and correctness of the schema and catalog.\nIf you did not qualify the name with a schema, verify the current_schema() output, or qualify the name with the correct schema and catalog.\nTo tolerate the error on drop use DROP VIEW IF EXISTS or DROP TABLE IF EXISTS.

AnalysisException: [TABLE_OR_VIEW_NOT_FOUND] The table or view `cvd` cannot be found. Verify the spelling and correctness of the schema and catalog.
If you did not qualify the name with a schema, verify the current_schema() output, or qualify the name with the correct schema and catalog.
To tolerate the error on drop use DROP VIEW IF EXISTS or DROP TABLE IF EXISTS. SQLSTATE: 42P01; line 3 pos 9;
'GlobalLimit 5
+- 'LocalLimit 5
   +- 'Project ['Age, 'Sex, 'HeartDisease]
      +- 'UnresolvedRelation [cvd], [], false


In [None]:
# Filtering with dataframes

cvd_df.filter(cvd_df.Age > 50).show(5)

In [None]:
# Filtering with Spark SQL

spark.sql("""
    SELECT *
    FROM cvd
    WHERE Age > 50
    LIMIT 5
""").show()

In [None]:
# Aggregation with DataFrames

from pyspark.sql.functions import avg

cvd_df.groupBy("Sex").agg(avg("MaxHR").alias("avg_maxhr")).show()

In [None]:
# Aggregation with Spark SQL

spark.sql("""
    SELECT Sex, AVG(MaxHR) AS avg_maxhr
    FROM cvd
    GROUP BY Sex
""").show()


In [None]:
# Self-Join: Find pairs of patients who share the same chest pain type with Spark SQL

spark.sql("""
    SELECT a.Age AS Age1, b.Age AS Age2, a.ChestPainType
    FROM cvd a
    JOIN cvd b ON a.ChestPainType = b.ChestPainType
    LIMIT 5
""").show()