## Overview

This notebook will show you how to create and query a table or DataFrame that you uploaded to DBFS. [DBFS](https://docs.databricks.com/user-guide/dbfs-databricks-file-system.html) is a Databricks File System that allows you to store data for querying inside of Databricks. This notebook assumes that you have a file already inside of DBFS that you would like to read from.

This notebook is written in **Python** so the default cell type is Python. However, you can use different languages by using the `%LANGUAGE` syntax. Python, Scala, SQL, and R are all supported.

In [0]:
# File location and type
file_location = "/FileStore/tables/chipotle-1.tsv"
file_type = "csv"

infer_schema = "false"
first_row_is_header = "True"
delimiter = "\t"

# The applied options are for CSV files. For other file types, these will be ignored.
df = spark.read.format(file_type) \
  .option("inferSchema", infer_schema) \
  .option("header", first_row_is_header) \
  .option("sep", delimiter) \
  .load(file_location)

#display(df)

In [0]:
from pyspark.sql.functions import *

In [0]:
df.printSchema()

root
 |-- order_id: string (nullable = true)
 |-- quantity: string (nullable = true)
 |-- item_name: string (nullable = true)
 |-- choice_description: string (nullable = true)
 |-- item_price: string (nullable = true)



In [0]:
def strip_col(c):
    return(c[1:])
strip_col_udf = udf(strip_col)
df= (df.withColumn("price", strip_col_udf(col("item_price"))))
df= df.withColumn("price",col("price").cast(('float')))

In [0]:
#df.display()

q1. How many products cost more than $10.00?

In [0]:
df1 = df.filter(col("price")>10.00)
df1.select('item_name').distinct().count()

Out[65]: 31

q2.

What is the price of each item?
print a data frame with only two columns item_name and item_price

In [0]:
df.groupby(col("item_name"),col("item_price")).count().show()


+--------------------+----------+-----+
|           item_name|item_price|count|
+--------------------+----------+-----+
|    Steak Soft Tacos|    $9.25 |   31|
|    Barbacoa Burrito|    $9.25 |   46|
|Chips and Mild Fr...|    $3.00 |    1|
|       Carnitas Bowl|   $23.50 |    1|
|     Chicken Burrito|   $10.58 |    4|
|        Chicken Bowl|    $8.49 |  104|
|  Chicken Salad Bowl|   $17.50 |    3|
|       Bottled Water|    $3.00 |    9|
|        Chicken Bowl|    $8.75 |  313|
|        Chicken Bowl|   $21.96 |    7|
|    Nantucket Nectar|    $6.78 |    2|
|Chicken Crispy Tacos|   $10.98 |    1|
|Chips and Tomatil...|    $2.39 |   17|
|Chips and Fresh T...|   $44.25 |    1|
|       Bottled Water|    $4.50 |    4|
|      Veggie Burrito|   $33.75 |    1|
|     Chicken Burrito|   $16.38 |    1|
|         Veggie Bowl|    $8.75 |   24|
|Chicken Crispy Tacos|   $11.25 |   14|
|Chips and Tomatil...|    $5.90 |    2|
+--------------------+----------+-----+
only showing top 20 rows



Q 3. What was the quantity of the most expensive item ordered?

In [0]:
df.groupBy().max('price').count()

Out[67]: 1

In [0]:
#assert test
df3=df.filter(col("price")== 44.25)
df3.display()

order_id,quantity,item_name,choice_description,item_price,price
1443,15,Chips and Fresh Tomato Salsa,,$44.25,44.25


Q4. How many times was a Veggie Salad Bowl ordered?

In [0]:
df.filter(col("item_name")=='Veggie Salad Bowl').count()

Out[69]: 18

Q5. How many times did someone order more than one Canned Soda?

In [0]:
df.filter((col("item_name")=='Canned Soda') & (col('quantity')>1)).count()

Out[70]: 20