# Ex2 - Getting and Knowing your Data

Check out [Chipotle Exercises Video Tutorial](https://www.youtube.com/watch?v=lpuYZ5EUyS8&list=PLgJhDSE2ZLxaY_DigHeiIDC1cD09rXgJv&index=2) to watch a data scientist go through the exercises

This time we are going to pull data directly from the internet.
Special thanks to: https://github.com/justmarkham for sharing the dataset and materials.

### Step 1. Import the necessary libraries

In [1]:
from pyspark.sql import SparkSession

from pyspark import SparkFiles
from pyspark.sql import types as T
import pyspark.sql.functions as F

spark = SparkSession.Builder().getOrCreate()
spark

your 131072x1 screen size is bogus. expect trouble
25/03/15 18:20:37 WARN Utils: Your hostname, Mark resolves to a loopback address: 127.0.1.1; using 10.255.255.254 instead (on interface lo)
25/03/15 18:20:37 WARN Utils: Set SPARK_LOCAL_IP if you need to bind to another address
Setting default log level to "WARN".
To adjust logging level use sc.setLogLevel(newLevel). For SparkR, use setLogLevel(newLevel).
25/03/15 18:20:38 WARN NativeCodeLoader: Unable to load native-hadoop library for your platform... using builtin-java classes where applicable


### Step 2. Import the dataset from this [address](https://raw.githubusercontent.com/justmarkham/DAT8/master/data/chipotle.tsv). 

### Step 3. Assign it to a variable called chipo.

In [2]:
data_path = (
    "https://raw.githubusercontent.com/justmarkham/DAT8/master/data/chipotle.tsv"
)

spark.sparkContext.addFile(data_path)

In [3]:
chipo = spark.read.csv(
    f"file:///{SparkFiles.get('chipotle.tsv')}",
    inferSchema=True,
    header=True,
    sep="\t",
    nullValue="NULL",
    # schema=T.StructType(
    #     [
    #         T.StructField("order_id", T.IntegerType()),
    #         T.StructField("quantity", T.IntegerType()),
    #         T.StructField("item_name", T.StringType()),
    #         T.StructField("choice_description", T.ArrayType(T.StringType())),
    #         T.StructField("item_price", T.FloatType()),
    #     ]
    # ),
)
print(f"{chipo.count() = }")
chipo.limit(5).show()

chipo.count() = 4622
+--------+--------+--------------------+--------------------+----------+
|order_id|quantity|           item_name|  choice_description|item_price|
+--------+--------+--------------------+--------------------+----------+
|       1|       1|Chips and Fresh T...|                NULL|    $2.39 |
|       1|       1|                Izze|        [Clementine]|    $3.39 |
|       1|       1|    Nantucket Nectar|             [Apple]|    $3.39 |
|       1|       1|Chips and Tomatil...|                NULL|    $2.39 |
|       2|       2|        Chicken Bowl|[Tomatillo-Red Ch...|   $16.98 |
+--------+--------+--------------------+--------------------+----------+



### Step 4. See the first 10 entries

In [4]:
chipo.limit(10).show()

+--------+--------+--------------------+--------------------+----------+
|order_id|quantity|           item_name|  choice_description|item_price|
+--------+--------+--------------------+--------------------+----------+
|       1|       1|Chips and Fresh T...|                NULL|    $2.39 |
|       1|       1|                Izze|        [Clementine]|    $3.39 |
|       1|       1|    Nantucket Nectar|             [Apple]|    $3.39 |
|       1|       1|Chips and Tomatil...|                NULL|    $2.39 |
|       2|       2|        Chicken Bowl|[Tomatillo-Red Ch...|   $16.98 |
|       3|       1|        Chicken Bowl|[Fresh Tomato Sal...|   $10.98 |
|       3|       1|       Side of Chips|                NULL|    $1.69 |
|       4|       1|       Steak Burrito|[Tomatillo Red Ch...|   $11.75 |
|       4|       1|    Steak Soft Tacos|[Tomatillo Green ...|    $9.25 |
|       5|       1|       Steak Burrito|[Fresh Tomato Sal...|    $9.25 |
+--------+--------+--------------------+-----------

### Step 5. What is the number of observations in the dataset?

In [5]:
# Solution 1

chipo.count()

4622

In [6]:
# Solution 2

non_null_counts = chipo.agg(
    *[F.count_if(F.col(c).isNotNull()).alias(c) for c in chipo.columns]
)

spark.createDataFrame(
    [
        (c, non_null_counts.select(c).head(1)[0][c], str(chipo.schema[c].dataType))
        for c in chipo.columns
    ],
    schema=["Field", "Non-Null", "Data Type"],
).show()

                                                                                

+------------------+--------+-------------+
|             Field|Non-Null|    Data Type|
+------------------+--------+-------------+
|          order_id|    4622|IntegerType()|
|          quantity|    4622|IntegerType()|
|         item_name|    4622| StringType()|
|choice_description|    3376| StringType()|
|        item_price|    4622| StringType()|
+------------------+--------+-------------+



### Step 6. What is the number of columns in the dataset?

In [7]:
len(chipo.columns)

5

### Step 7. Print the name of all the columns.

In [8]:
chipo.columns

['order_id', 'quantity', 'item_name', 'choice_description', 'item_price']

### Step 8. How is the dataset indexed?

In [9]:
# PySpark dataframes do not have indexes

### Step 9. Which was the most-ordered item? 

In [10]:
(
    chipo.groupBy("item_name")
    .agg(F.sum("quantity").alias("quantity"))
    .orderBy(F.desc("quantity"))
    .limit(1)
).show(truncate=False)

+------------+--------+
|item_name   |quantity|
+------------+--------+
|Chicken Bowl|761     |
+------------+--------+



### Step 10. For the most-ordered item, how many items were ordered?

In [11]:
# Already answered above???

### Step 11. What was the most ordered item in the choice_description column?

In [12]:
chipo.limit(3).show(truncate=False)

+--------+--------+----------------------------+------------------+----------+
|order_id|quantity|item_name                   |choice_description|item_price|
+--------+--------+----------------------------+------------------+----------+
|1       |1       |Chips and Fresh Tomato Salsa|NULL              |$2.39     |
|1       |1       |Izze                        |[Clementine]      |$3.39     |
|1       |1       |Nantucket Nectar            |[Apple]           |$3.39     |
+--------+--------+----------------------------+------------------+----------+



In [13]:
chipo.printSchema()

root
 |-- order_id: integer (nullable = true)
 |-- quantity: integer (nullable = true)
 |-- item_name: string (nullable = true)
 |-- choice_description: string (nullable = true)
 |-- item_price: string (nullable = true)



In [14]:
(
    chipo.groupBy("choice_description")
    .agg(F.count("choice_description").alias("count"))
    .orderBy(F.desc("count"))
).show(1, truncate=False)

+------------------+-----+
|choice_description|count|
+------------------+-----+
|[Diet Coke]       |134  |
+------------------+-----+
only showing top 1 row



### Step 12. How many items were orderd in total?

In [15]:
chipo.limit(3).show()

+--------+--------+--------------------+------------------+----------+
|order_id|quantity|           item_name|choice_description|item_price|
+--------+--------+--------------------+------------------+----------+
|       1|       1|Chips and Fresh T...|              NULL|    $2.39 |
|       1|       1|                Izze|      [Clementine]|    $3.39 |
|       1|       1|    Nantucket Nectar|           [Apple]|    $3.39 |
+--------+--------+--------------------+------------------+----------+



In [16]:
chipo.agg(F.sum("quantity").alias("total_items")).show()

+-----------+
|total_items|
+-----------+
|       4972|
+-----------+



### Step 13. Turn the item price into a float

#### Step 13.a. Check the item price type

In [17]:
chipo.schema["item_price"].dataType

StringType()

#### Step 13.b. Create a lambda function and change the type of item price

In [18]:
chipo = chipo.withColumn(
    "price", F.regexp_extract("item_price", r"\$(\d+\.\d+)", 1).cast(T.FloatType())
)
chipo.show(5)

+--------+--------+--------------------+--------------------+----------+-----+
|order_id|quantity|           item_name|  choice_description|item_price|price|
+--------+--------+--------------------+--------------------+----------+-----+
|       1|       1|Chips and Fresh T...|                NULL|    $2.39 | 2.39|
|       1|       1|                Izze|        [Clementine]|    $3.39 | 3.39|
|       1|       1|    Nantucket Nectar|             [Apple]|    $3.39 | 3.39|
|       1|       1|Chips and Tomatil...|                NULL|    $2.39 | 2.39|
|       2|       2|        Chicken Bowl|[Tomatillo-Red Ch...|   $16.98 |16.98|
+--------+--------+--------------------+--------------------+----------+-----+
only showing top 5 rows



#### Step 13.c. Check the item price type

In [19]:
chipo.schema["price"]

StructField('price', FloatType(), True)

### Step 14. How much was the revenue for the period in the dataset?

In [20]:
chipo.show(3)

+--------+--------+--------------------+------------------+----------+-----+
|order_id|quantity|           item_name|choice_description|item_price|price|
+--------+--------+--------------------+------------------+----------+-----+
|       1|       1|Chips and Fresh T...|              NULL|    $2.39 | 2.39|
|       1|       1|                Izze|      [Clementine]|    $3.39 | 3.39|
|       1|       1|    Nantucket Nectar|           [Apple]|    $3.39 | 3.39|
+--------+--------+--------------------+------------------+----------+-----+
only showing top 3 rows



In [21]:
chipo.agg(
    F.concat(
        F.lit("$"),
        F.round(F.sum(F.col("price") * F.col("quantity")), 2).cast("STRING"),
    ).alias("revenue")
).show()

+---------+
|  revenue|
+---------+
|$39237.02|
+---------+



### Step 15. How many orders were made in the period?

In [22]:
chipo.show(1)

# There are duplicate orders? Weird.
# (
#     chipo.groupBy("order_id")
#     .agg(F.count("order_id").alias("order_count"))
#     .orderBy(F.desc("order_count"))
# ).show()

chipo.select(F.count_if(F.isnull("order_id")).alias("count_null_order_id")).show()

chipo.select(F.count_distinct("order_id").alias("order_count")).show()

+--------+--------+--------------------+------------------+----------+-----+
|order_id|quantity|           item_name|choice_description|item_price|price|
+--------+--------+--------------------+------------------+----------+-----+
|       1|       1|Chips and Fresh T...|              NULL|    $2.39 | 2.39|
+--------+--------+--------------------+------------------+----------+-----+
only showing top 1 row

+-------------------+
|count_null_order_id|
+-------------------+
|                  0|
+-------------------+

+-----------+
|order_count|
+-----------+
|       1834|
+-----------+



### Step 16. What is the average revenue amount per order?

In [23]:
chipo.show(1)

(
    chipo.groupBy("order_id")
    .agg(F.sum((F.col("quantity") * F.col("price"))).alias("revenue_per_order"))
    .select(F.round(F.avg("revenue_per_order"), 2))
).show()

+--------+--------+--------------------+------------------+----------+-----+
|order_id|quantity|           item_name|choice_description|item_price|price|
+--------+--------+--------------------+------------------+----------+-----+
|       1|       1|Chips and Fresh T...|              NULL|    $2.39 | 2.39|
+--------+--------+--------------------+------------------+----------+-----+
only showing top 1 row

+--------------------------------+
|round(avg(revenue_per_order), 2)|
+--------------------------------+
|                           21.39|
+--------------------------------+



### Step 17. How many different items are sold?

In [25]:
chipo.show(1)

chipo.select(F.count_distinct("item_name").alias("item_count")).show()

+--------+--------+--------------------+------------------+----------+-----+
|order_id|quantity|           item_name|choice_description|item_price|price|
+--------+--------+--------------------+------------------+----------+-----+
|       1|       1|Chips and Fresh T...|              NULL|    $2.39 | 2.39|
+--------+--------+--------------------+------------------+----------+-----+
only showing top 1 row

+----------+
|item_count|
+----------+
|        50|
+----------+

