**DATAFRAMES & SparkSQL** 

   In this file we want to present our knowledge and ability to work with DataFrames and SparkSQL

In [2]:
pip install pyspark findspark


[1m[[0m[34;49mnotice[0m[1;39;49m][0m[39;49m A new release of pip is available: [0m[31;49m25.0[0m[39;49m -> [0m[32;49m25.3[0m
[1m[[0m[34;49mnotice[0m[1;39;49m][0m[39;49m To update, run: [0m[32;49mpip install --upgrade pip[0m
Note: you may need to restart the kernel to use updated packages.


In [4]:
import findspark
findspark.init()

from pyspark.sql import SparkSession

# Create a Spark session
spark = SparkSession.builder \
    .appName("MySparkApp") \
    .config("spark.driver.memory", "2g") \
    .getOrCreate()

spark

Using Spark's default log4j profile: org/apache/spark/log4j2-defaults.properties
25/11/20 16:25:01 WARN Utils: Your hostname, Ninas-macbook.local, resolves to a loopback address: 127.0.0.1; using 10.124.248.129 instead (on interface en0)
25/11/20 16:25:01 WARN Utils: Set SPARK_LOCAL_IP if you need to bind to another address
Using Spark's default log4j profile: org/apache/spark/log4j2-defaults.properties
Setting default log level to "WARN".
To adjust logging level use sc.setLogLevel(newLevel). For SparkR, use setLogLevel(newLevel).
25/11/20 16:25:02 WARN NativeCodeLoader: Unable to load native-hadoop library for your platform... using builtin-java classes where applicable
25/11/20 16:25:02 WARN Utils: Service 'SparkUI' could not bind on port 4040. Attempting port 4041.


In [6]:
red_df = spark.read.csv("winequality-red.csv", header=True, inferSchema=True, sep=';')
white_df = spark.read.csv("winequality-white.csv", header=True, inferSchema=True, sep=';')

In [7]:
red_df.show(5)
white_df.show(5)
red_df.printSchema()

+-------------+----------------+-----------+--------------+---------+-------------------+--------------------+-------+----+---------+-------+-------+
|fixed acidity|volatile acidity|citric acid|residual sugar|chlorides|free sulfur dioxide|total sulfur dioxide|density|  pH|sulphates|alcohol|quality|
+-------------+----------------+-----------+--------------+---------+-------------------+--------------------+-------+----+---------+-------+-------+
|          7.4|             0.7|        0.0|           1.9|    0.076|               11.0|                34.0| 0.9978|3.51|     0.56|    9.4|      5|
|          7.8|            0.88|        0.0|           2.6|    0.098|               25.0|                67.0| 0.9968| 3.2|     0.68|    9.8|      5|
|          7.8|            0.76|       0.04|           2.3|    0.092|               15.0|                54.0|  0.997|3.26|     0.65|    9.8|      5|
|         11.2|            0.28|       0.56|           1.9|    0.075|               17.0|           

In [8]:
from pyspark.sql.functions import lit

red_df = red_df.withColumn("wine_type", lit("red"))
white_df = white_df.withColumn("wine_type", lit("white"))

In [9]:
wine_df = red_df.union(white_df)
wine_df.show(5)

+-------------+----------------+-----------+--------------+---------+-------------------+--------------------+-------+----+---------+-------+-------+---------+
|fixed acidity|volatile acidity|citric acid|residual sugar|chlorides|free sulfur dioxide|total sulfur dioxide|density|  pH|sulphates|alcohol|quality|wine_type|
+-------------+----------------+-----------+--------------+---------+-------------------+--------------------+-------+----+---------+-------+-------+---------+
|          7.4|             0.7|        0.0|           1.9|    0.076|               11.0|                34.0| 0.9978|3.51|     0.56|    9.4|      5|      red|
|          7.8|            0.88|        0.0|           2.6|    0.098|               25.0|                67.0| 0.9968| 3.2|     0.68|    9.8|      5|      red|
|          7.8|            0.76|       0.04|           2.3|    0.092|               15.0|                54.0|  0.997|3.26|     0.65|    9.8|      5|      red|
|         11.2|            0.28|       0

Basic DataFrame operations:

1. Counting rows:

In [10]:
wine_df.count()

6497

2. Summary statistics

In [11]:
wine_df.describe().show()

25/11/20 16:27:47 WARN SparkStringUtils: Truncated the string representation of a plan since it was too large. This behavior can be adjusted by setting 'spark.sql.debug.maxToStringFields'.


+-------+-----------------+-------------------+-------------------+------------------+-------------------+-------------------+--------------------+--------------------+-------------------+-------------------+------------------+------------------+---------+
|summary|    fixed acidity|   volatile acidity|        citric acid|    residual sugar|          chlorides|free sulfur dioxide|total sulfur dioxide|             density|                 pH|          sulphates|           alcohol|           quality|wine_type|
+-------+-----------------+-------------------+-------------------+------------------+-------------------+-------------------+--------------------+--------------------+-------------------+-------------------+------------------+------------------+---------+
|  count|             6497|               6497|               6497|              6497|               6497|               6497|                6497|                6497|               6497|               6497|              6497|  

3. Checking for Null values

In [13]:
from pyspark.sql.functions import col, when, count, isnan

numeric_cols = [c for c, t in wine_df.dtypes if t in ("double", "int", "float")]
string_cols = [c for c, t in wine_df.dtypes if t not in ("double", "int", "float")]

missing = wine_df.select(
    # numeric columns — check NULL or NaN
    *[
        count(when(col(c).isNull() | isnan(c), c)).alias(c)
        for c in numeric_cols
    ],
    # string columns — only check NULL
    *[
        count(when(col(c).isNull(), c)).alias(c)
        for c in string_cols
    ]
)

missing.show()


+-------------+----------------+-----------+--------------+---------+-------------------+--------------------+-------+---+---------+-------+-------+---------+
|fixed acidity|volatile acidity|citric acid|residual sugar|chlorides|free sulfur dioxide|total sulfur dioxide|density| pH|sulphates|alcohol|quality|wine_type|
+-------------+----------------+-----------+--------------+---------+-------------------+--------------------+-------+---+---------+-------+-------+---------+
|            0|               0|          0|             0|        0|                  0|                   0|      0|  0|        0|      0|      0|        0|
+-------------+----------------+-----------+--------------+---------+-------------------+--------------------+-------+---+---------+-------+-------+---------+



**SparkSQL**

In [14]:
wine_df.createOrReplaceTempView("wines")

1. Average alcohol content by wine type

spark.sql("""
    SELECT wine_type, AVG(alcohol) AS avg_alcohol
    FROM wines
    GROUP BY wine_type
""").show()


2. Count wines by quality rating

In [18]:
spark.sql("""
    SELECT quality, COUNT(*) AS count
    FROM wines
    GROUP BY quality
    ORDER BY quality
""").show()

+-------+-----+
|quality|count|
+-------+-----+
|      3|   30|
|      4|  216|
|      5| 2138|
|      6| 2836|
|      7| 1079|
|      8|  193|
|      9|    5|
+-------+-----+



3. Highest alcohol red wines

In [19]:
spark.sql("""
    SELECT *
    FROM wines
    WHERE wine_type = 'red'
    ORDER BY alcohol DESC
    LIMIT 5
""").show()

+-------------+----------------+-----------+--------------+---------+-------------------+--------------------+-------+----+---------+-------+-------+---------+
|fixed acidity|volatile acidity|citric acid|residual sugar|chlorides|free sulfur dioxide|total sulfur dioxide|density|  pH|sulphates|alcohol|quality|wine_type|
+-------------+----------------+-----------+--------------+---------+-------------------+--------------------+-------+----+---------+-------+-------+---------+
|         15.9|            0.36|       0.65|           7.5|    0.096|               22.0|                71.0| 0.9976|2.98|     0.84|   14.9|      5|      red|
|          5.2|            0.34|        0.0|           1.8|     0.05|               27.0|                63.0| 0.9916|3.68|     0.79|   14.0|      6|      red|
|          5.0|            0.42|       0.24|           2.0|     0.06|               19.0|                50.0| 0.9917|3.72|     0.74|   14.0|      8|      red|
|          4.9|            0.42|        

4. Correlation: alcohol vs quality

In [None]:
spark.sql("""
    SELECT corr(alcohol, quality) AS alcohol_quality_corr
    FROM wines;
""").show()