# Ex1 - Filtering and Sorting Data

This time we are going to pull data directly from the internet.
Special thanks to: https://github.com/justmarkham for sharing the dataset and materials.

### Step 1. Import the necessary libraries

In [1]:
from pyspark.sql import SparkSession
from pyspark.sql import functions as F
from pyspark.sql import types as T

spark = SparkSession.builder.appName("Chipotli").getOrCreate()

### Step 2. Import the dataset from this [address](https://raw.githubusercontent.com/justmarkham/DAT8/master/data/chipotle.tsv). 

### Step 3. Assign it to a variable called chipo.

In [2]:
url = 'https://raw.githubusercontent.com/justmarkham/DAT8/master/data/chipotle.tsv'
from pyspark import SparkFiles
spark.sparkContext.addFile(url)

chipo = spark.read.csv(SparkFiles.get("chipotle.tsv"), sep='\t', header=True, inferSchema= True)

### Step 4. How many products cost more than $10.00?

In [4]:
def make_float(string):
    string = string.replace('$', '')
    return float(string)
spark_udf = F.udf(lambda x: make_float(x), T.FloatType())
chipo = chipo.withColumn("item_price", spark_udf(F.col("item_price")))

In [5]:
chipo.printSchema()

root
 |-- order_id: integer (nullable = true)
 |-- quantity: integer (nullable = true)
 |-- item_name: string (nullable = true)
 |-- choice_description: string (nullable = true)
 |-- item_price: float (nullable = true)



In [8]:
chipo.where(F.col('item_price') > 10).count()

1130

### Step 5. What is the price of each item? 
###### print a data frame with only two columns item_name and item_price

In [9]:
chipo.select(F.col('item_name'), F.col('item_price')).where(F.col('item_price') > 10).take(10)

[Row(item_name='Chicken Bowl', item_price=16.979999542236328),
 Row(item_name='Chicken Bowl', item_price=10.979999542236328),
 Row(item_name='Steak Burrito', item_price=11.75),
 Row(item_name='Chicken Bowl', item_price=11.25),
 Row(item_name='Chicken Burrito', item_price=10.979999542236328),
 Row(item_name='Barbacoa Bowl', item_price=11.75),
 Row(item_name='Chicken Bowl', item_price=11.25),
 Row(item_name='Steak Burrito', item_price=11.75),
 Row(item_name='Chicken Burrito', item_price=10.979999542236328),
 Row(item_name='Chicken Burrito', item_price=10.979999542236328)]

### Step 6. Sort by the name of the item

In [13]:
chipo.select(F.col('item_name'), F.col('item_price')).orderBy('item_name').take(5)

[Row(item_name='6 Pack Soft Drink', item_price=6.489999771118164),
 Row(item_name='6 Pack Soft Drink', item_price=6.489999771118164),
 Row(item_name='6 Pack Soft Drink', item_price=6.489999771118164),
 Row(item_name='6 Pack Soft Drink', item_price=6.489999771118164),
 Row(item_name='6 Pack Soft Drink', item_price=6.489999771118164)]

### Step 7. What was the quantity of the most expensive item ordered?

In [14]:
chipo.select(F.col('item_name'), F.col('quantity'), F.col('item_price')).orderBy('item_price', ascending=False).take(1)

[Row(item_name='Chips and Fresh Tomato Salsa', quantity=15, item_price=44.25)]

### Step 8. How many times was a Veggie Salad Bowl ordered?

In [15]:
chipo.where(F.col('item_name') == 'Veggie Salad Bowl').count()

18

### Step 9. How many times did someone order more than one Canned Soda?

In [19]:
chipo.where((F.col('item_name') == 'Canned Soda') & (F.col('quantity') > 1)).count()

20