# Ex2 - Getting and Knowing your Data

This time we are going to pull data directly from the internet.
Special thanks to: https://github.com/justmarkham for sharing the dataset and materials.

### Step 1. Import the necessary libraries

In [2]:
!apt-get install openjdk-8-jdk-headless -qq > /dev/null
!wget -q https://dlcdn.apache.org/spark/spark-3.3.0/spark-3.3.0-bin-hadoop3.tgz

In [3]:
!tar xf spark-3.3.0-bin-hadoop3.tgz

In [4]:
!pip install -q findspark

In [5]:
import os
os.environ["JAVA_HOME"] = "/usr/lib/jvm/java-8-openjdk-amd64"
os.environ["SPARK_HOME"] = "/content/spark-3.3.0-bin-hadoop3"

In [6]:
import findspark
findspark.init()

import pandas as pd

In [7]:
from pyspark.sql import SparkSession

spark = SparkSession.builder\
        .master("local")\
        .appName("Colab")\
        .config('spark.ui.port', '4050')\
        .getOrCreate()

### Step 2. Import the dataset from this [address](https://raw.githubusercontent.com/justmarkham/DAT8/master/data/chipotle.tsv). 

In [8]:
url = "https://raw.githubusercontent.com/justmarkham/DAT8/master/data/chipotle.tsv"


### Step 3. Assign it to a variable called chipo.

In [59]:
from pyspark import SparkFiles
spark.sparkContext.addFile(url)
chipotle=spark.read.csv(SparkFiles.get("chipotle.tsv"), header=True, sep="\t")

### Step 4. See the first 10 entries

In [10]:
chipotle.show(10)

+--------+--------+--------------------+--------------------+----------+
|order_id|quantity|           item_name|  choice_description|item_price|
+--------+--------+--------------------+--------------------+----------+
|       1|       1|Chips and Fresh T...|                NULL|    $2.39 |
|       1|       1|                Izze|        [Clementine]|    $3.39 |
|       1|       1|    Nantucket Nectar|             [Apple]|    $3.39 |
|       1|       1|Chips and Tomatil...|                NULL|    $2.39 |
|       2|       2|        Chicken Bowl|[Tomatillo-Red Ch...|   $16.98 |
|       3|       1|        Chicken Bowl|[Fresh Tomato Sal...|   $10.98 |
|       3|       1|       Side of Chips|                NULL|    $1.69 |
|       4|       1|       Steak Burrito|[Tomatillo Red Ch...|   $11.75 |
|       4|       1|    Steak Soft Tacos|[Tomatillo Green ...|    $9.25 |
|       5|       1|       Steak Burrito|[Fresh Tomato Sal...|    $9.25 |
+--------+--------+--------------------+-----------

### Step 5. What is the number of observations in the dataset?

In [11]:
# Solution 1
chipotle.count()


4622

In [12]:
# Solution 2

chipotle.summary().show()

+-------+-----------------+------------------+-----------------+--------------------+----------+
|summary|         order_id|          quantity|        item_name|  choice_description|item_price|
+-------+-----------------+------------------+-----------------+--------------------+----------+
|  count|             4622|              4622|             4622|                4622|      4622|
|   mean|927.2548680225011|1.0757247944612722|             null|                null|      null|
| stddev|528.8907955866096|0.4101863342575333|             null|                null|      null|
|    min|                1|                 1|6 Pack Soft Drink|                NULL|    $1.09 |
|    25%|            477.0|               1.0|             null|                null|      null|
|    50%|            926.0|               1.0|             null|                null|      null|
|    75%|           1393.0|               1.0|             null|                null|      null|
|    max|              999|   

### Step 6. What is the number of columns in the dataset?

In [13]:
len(chipotle.columns)

5

### Step 7. Print the name of all the columns.

In [14]:
chipotle.columns

['order_id', 'quantity', 'item_name', 'choice_description', 'item_price']

### Step 8. How is the dataset indexed?

### Step 9. Which was the most-ordered item? 

In [15]:
from pyspark.sql.functions import col
from pyspark.sql.types import StringType,BooleanType,DateType, FloatType

In [16]:
chipotle = chipotle.withColumn("quantity", col("quantity").cast(FloatType()))


In [17]:
chipotle.printSchema()

root
 |-- order_id: string (nullable = true)
 |-- quantity: float (nullable = true)
 |-- item_name: string (nullable = true)
 |-- choice_description: string (nullable = true)
 |-- item_price: string (nullable = true)



In [22]:
chipotle.groupby("item_name").sum("quantity").sort("sum(quantity)", ascending=False).show()

+--------------------+-------------+
|           item_name|sum(quantity)|
+--------------------+-------------+
|        Chicken Bowl|        761.0|
|     Chicken Burrito|        591.0|
| Chips and Guacamole|        506.0|
|       Steak Burrito|        386.0|
|   Canned Soft Drink|        351.0|
|               Chips|        230.0|
|          Steak Bowl|        221.0|
|       Bottled Water|        211.0|
|Chips and Fresh T...|        130.0|
|         Canned Soda|        126.0|
|  Chicken Salad Bowl|        123.0|
|  Chicken Soft Tacos|        120.0|
|       Side of Chips|        110.0|
|      Veggie Burrito|         97.0|
|    Barbacoa Burrito|         91.0|
|         Veggie Bowl|         87.0|
|       Carnitas Bowl|         71.0|
|       Barbacoa Bowl|         66.0|
|    Carnitas Burrito|         60.0|
|    Steak Soft Tacos|         56.0|
+--------------------+-------------+
only showing top 20 rows



### Step 10. For the most-ordered item, how many items were ordered?

In [25]:
chipotle.groupby("choice_description").sum("quantity").sort("sum(quantity)", ascending=False).show()

+--------------------+-------------+
|  choice_description|sum(quantity)|
+--------------------+-------------+
|                NULL|       1382.0|
|         [Diet Coke]|        159.0|
|              [Coke]|        143.0|
|            [Sprite]|         89.0|
|[Fresh Tomato Sal...|         49.0|
|[Fresh Tomato Sal...|         42.0|
|[Fresh Tomato Sal...|         40.0|
|          [Lemonade]|         36.0|
|[Fresh Tomato Sal...|         36.0|
|         [Coca Cola]|         32.0|
|[Fresh Tomato Sal...|         30.0|
|[Fresh Tomato Sal...|         30.0|
|[Fresh Tomato Sal...|         26.0|
|[Fresh Tomato Sal...|         25.0|
|[Fresh Tomato Sal...|         24.0|
|[Fresh Tomato Sal...|         23.0|
|[Fresh Tomato Sal...|         22.0|
|[Fresh Tomato Sal...|         22.0|
|[Fresh Tomato Sal...|         21.0|
|[Fresh Tomato Sal...|         21.0|
+--------------------+-------------+
only showing top 20 rows



### Step 11. What was the most ordered item in the choice_description column?

### Step 12. How many items were orderd in total?

In [32]:
total_items = chipotle.agg({"quantity":"sum"}).show()

+-------------+
|sum(quantity)|
+-------------+
|       4972.0|
+-------------+



### Step 13. Turn the item price into a float

#### Step 13.a. Check the item price type

In [34]:
chipotle.describe()

DataFrame[summary: string, order_id: string, quantity: string, item_name: string, choice_description: string, item_price: string]

#### Step 13.b. Create a lambda function and change the type of item price

In [54]:
import pyspark.sql.functions as F

def subs_index(s):
  return float(s[1:-1])

udf_substring = F.udf(lambda s: subs_index(s), FloatType())

In [60]:
chipotle = chipotle.withColumn("item_price", udf_substring(col("item_price")))

In [63]:
chipotle.show()

+--------+--------+--------------------+--------------------+----------+
|order_id|quantity|           item_name|  choice_description|item_price|
+--------+--------+--------------------+--------------------+----------+
|       1|       1|Chips and Fresh T...|                NULL|      2.39|
|       1|       1|                Izze|        [Clementine]|      3.39|
|       1|       1|    Nantucket Nectar|             [Apple]|      3.39|
|       1|       1|Chips and Tomatil...|                NULL|      2.39|
|       2|       2|        Chicken Bowl|[Tomatillo-Red Ch...|     16.98|
|       3|       1|        Chicken Bowl|[Fresh Tomato Sal...|     10.98|
|       3|       1|       Side of Chips|                NULL|      1.69|
|       4|       1|       Steak Burrito|[Tomatillo Red Ch...|     11.75|
|       4|       1|    Steak Soft Tacos|[Tomatillo Green ...|      9.25|
|       5|       1|       Steak Burrito|[Fresh Tomato Sal...|      9.25|
|       5|       1| Chips and Guacamole|           

#### Step 13.c. Check the item price type

In [64]:
chipotle.describe()

DataFrame[summary: string, order_id: string, quantity: string, item_name: string, choice_description: string, item_price: string]

### Step 14. How much was the revenue for the period in the dataset?

In [66]:
chipotle.withColumn("revenue", col("item_price") * col("quantity")).agg({"revenue": "sum"}).show()

+-----------------+
|     sum(revenue)|
+-----------------+
|39237.01973223686|
+-----------------+



In [73]:
chipotle.withColumn("revenue", col("item_price") * col("quantity")).agg({"revenue": "sum"}).collect()[0][0]

39237.01973223686

### Step 15. How many orders were made in the period?

In [70]:
chipotle.select("order_id").distinct().count()

1834

.### Step 16. What is the average revenue amount per order?

In [74]:
# Solution 1
chipotle = chipotle.withColumn("revenue", col("item_price") * col("quantity"))
chipotle

DataFrame[order_id: string, quantity: string, item_name: string, choice_description: string, item_price: float, revenue: double]

In [91]:
# Solution 2

chipotle.groupby("order_id").agg(F.sum("revenue").alias("sum_revenue")).agg({"sum_revenue": "avg"}).collect()[0][0]

21.39423104265914

### Step 17. How many different items are sold?

In [92]:
chipotle.select("item_name").distinct().count()

50