# EDA Operations With _DataFrame_

## Overview

In this section we look into various operations we can use on a _DataFrame_. Specifically, we will perform a simple
EDA analysis on the <a href="https://archive.ics.uci.edu/dataset/53/iris">Iris dataset</a>. 

# EDA Operations With _DataFrame_

The following script shows you how to

- Rename columns in a _DataFrame_
- How to group values of columns
- How to compute agrregates such as mean and variance
- How to pick the values of a column that meet a certain condition


```
from pathlib import Path
from pyspark.sql import SparkSession
from pyspark.sql import Row
from pyspark.sql import functions as F
from pyspark.sql.functions import expr, col
import sys

DATA_PATH = Path("/home/alex/qi3/qi3_notes/computational_mathematics/src/datasets/iris.data")
APP_NAME = "EDA Iris data"

if __name__ == '__main__':

    # get a spark session
    spark = SparkSession.builder.appName(APP_NAME).getOrCreate()

    schema = ("`sepal-length` FLOAT, `sepal-width` FLOAT, "
              "`petal-length` FLOAT, `petal-width` FLOAT, `class` STRING")

    # load the data
    df = (spark.read.format("csv")
              .option("header", False)
              .option("inferSchema", False)
              .option("delimiter", ",")
              .schema(schema)
              .load(str(DATA_PATH)))

    df.show(n=10)
    df.printSchema()

    # we want to rename the column class to class-idx
    new_df = df.withColumnRenamed("class", "class-idx")

    new_df.printSchema()

    # compute basic statistics
    # how many observations per class
    counts = (new_df.select("class-idx")
               .groupBy("class-idx")
               .count()
               .orderBy("count")
               .show()
               )

    # compute mean values
    (new_df.groupBy("class-idx").agg(
        F.mean("sepal-length"),
        F.mean("sepal-width"),
        F.mean("petal-length"),
        F.mean("petal-width"),
    )).show()

    (new_df.groupBy("class-idx").agg(
        F.max("sepal-length"),
        F.max("sepal-width"),
        F.max("petal-length"),
        F.max("petal-width"),
    )).show()

    (new_df.groupBy("class-idx").agg(
        F.min("sepal-length"),
        F.min("sepal-width"),
        F.min("petal-length"),
        F.min("petal-width"),
    )).show()

    # compute variance
    (new_df.groupBy("class-idx").agg(
        F.var_pop("sepal-length"),
        F.var_pop("sepal-width"),
        F.var_pop("petal-length"),
        F.var_pop("petal-width"),
    )).show()
    
    # select all the columns from sepal-length column that
    # are greater than the mean
    (new_df.select("sepal-length")
     .where(col("sepal-length") > 6.58)
     .show())

    spark.stop()
```

```
+------------+-----------+------------+-----------+-----------+
|sepal-length|sepal-width|petal-length|petal-width|      class|
+------------+-----------+------------+-----------+-----------+
|         5.1|        3.5|         1.4|        0.2|Iris-setosa|
|         4.9|        3.0|         1.4|        0.2|Iris-setosa|
|         4.7|        3.2|         1.3|        0.2|Iris-setosa|
|         4.6|        3.1|         1.5|        0.2|Iris-setosa|
|         5.0|        3.6|         1.4|        0.2|Iris-setosa|
|         5.4|        3.9|         1.7|        0.4|Iris-setosa|
|         4.6|        3.4|         1.4|        0.3|Iris-setosa|
|         5.0|        3.4|         1.5|        0.2|Iris-setosa|
|         4.4|        2.9|         1.4|        0.2|Iris-setosa|
|         4.9|        3.1|         1.5|        0.1|Iris-setosa|
+------------+-----------+------------+-----------+-----------+
only showing top 10 rows

root
 |-- sepal-length: float (nullable = true)
 |-- sepal-width: float (nullable = true)
 |-- petal-length: float (nullable = true)
 |-- petal-width: float (nullable = true)
 |-- class: string (nullable = true)

root
 |-- sepal-length: float (nullable = true)
 |-- sepal-width: float (nullable = true)
 |-- petal-length: float (nullable = true)
 |-- petal-width: float (nullable = true)
 |-- class-idx: string (nullable = true)

+---------------+-----+
|      class-idx|count|
+---------------+-----+
| Iris-virginica|   50|
|    Iris-setosa|   50|
|Iris-versicolor|   50|
+---------------+-----+

+---------------+-----------------+------------------+------------------+-------------------+
|      class-idx|avg(sepal-length)|  avg(sepal-width)| avg(petal-length)|   avg(petal-width)|
+---------------+-----------------+------------------+------------------+-------------------+
| Iris-virginica|6.588000001907349|2.9739999914169313| 5.551999988555909| 2.0259999775886537|
|    Iris-setosa|5.006000003814697|  3.41800000667572|1.4639999961853027|0.24400000482797624|
|Iris-versicolor|5.935999975204468| 2.770000009536743| 4.259999980926514| 1.3259999918937684|
+---------------+-----------------+------------------+------------------+-------------------+

+---------------+-----------------+----------------+-----------------+----------------+
|      class-idx|max(sepal-length)|max(sepal-width)|max(petal-length)|max(petal-width)|
+---------------+-----------------+----------------+-----------------+----------------+
| Iris-virginica|              7.9|             3.8|              6.9|             2.5|
|    Iris-setosa|              5.8|             4.4|              1.9|             0.6|
|Iris-versicolor|              7.0|             3.4|              5.1|             1.8|
+---------------+-----------------+----------------+-----------------+----------------+

+---------------+-----------------+----------------+-----------------+----------------+
|      class-idx|min(sepal-length)|min(sepal-width)|min(petal-length)|min(petal-width)|
+---------------+-----------------+----------------+-----------------+----------------+
| Iris-virginica|              4.9|             2.2|              4.5|             1.4|
|    Iris-setosa|              4.3|             2.3|              1.0|             0.1|
|Iris-versicolor|              4.9|             2.0|              3.0|             1.0|
+---------------+-----------------+----------------+-----------------+----------------+

+---------------+---------------------+--------------------+---------------------+--------------------+
|      class-idx|var_pop(sepal-length)|var_pop(sepal-width)|var_pop(petal-length)|var_pop(petal-width)|
+---------------+---------------------+--------------------+---------------------+--------------------+
| Iris-virginica|  0.39625593862917335| 0.10192399745559985|  0.29849598364259367| 0.07392400431251753|
|    Iris-setosa|  0.12176398698426805| 0.14227600014114894| 0.029504003444672463|0.011264000588417073|
|Iris-versicolor|   0.2611040078888099| 0.09650000076294335|  0.21639999752045924| 0.03832399889564613|
+---------------+---------------------+--------------------+---------------------+--------------------+


+------------+
|sepal-length|
+------------+
|         7.0|
|         6.9|
|         6.6|
|         6.7|
|         6.6|
|         6.8|
|         6.7|
|         6.7|
|         7.1|
|         7.6|
|         7.3|
|         6.7|
|         7.2|
|         6.8|
|         7.7|
|         7.7|
|         6.9|
|         7.7|
|         6.7|
|         7.2|
+------------+
only showing top 20 rows


```

## Summary

## References

1. Jules S. Damji, Brooke Wenig, Tathagata Das, Deny Lee, _Learning Spark. Lighting-fasts data analytics_, 2nd Edition, O'Reilly.