# Ex2 - Getting and Knowing your Data

This time we are going to pull data directly from the internet.
Special thanks to: https://github.com/justmarkham for sharing the dataset and materials.

### Step 1. Import the necessary libraries

In [1]:
from pyspark.sql import SparkSession
from pyspark.sql import functions as F
from pyspark.sql import types as T

spark = SparkSession.builder.appName("Chipotli").getOrCreate()


### Step 2. Import the dataset from this [address](https://raw.githubusercontent.com/justmarkham/DAT8/master/data/chipotle.tsv). 

### Step 3. Assign it to a variable called chipo.

In [2]:
url = 'https://raw.githubusercontent.com/justmarkham/DAT8/master/data/chipotle.tsv'
from pyspark import SparkFiles
spark.sparkContext.addFile(url)

chipo = spark.read.csv(SparkFiles.get("chipotle.tsv"), sep='\t', header=True, inferSchema= True)

### Step 4. See the first 10 entries

In [3]:
chipo.show(10)

+--------+--------+--------------------+--------------------+----------+
|order_id|quantity|           item_name|  choice_description|item_price|
+--------+--------+--------------------+--------------------+----------+
|       1|       1|Chips and Fresh T...|                NULL|    $2.39 |
|       1|       1|                Izze|        [Clementine]|    $3.39 |
|       1|       1|    Nantucket Nectar|             [Apple]|    $3.39 |
|       1|       1|Chips and Tomatil...|                NULL|    $2.39 |
|       2|       2|        Chicken Bowl|[Tomatillo-Red Ch...|   $16.98 |
|       3|       1|        Chicken Bowl|[Fresh Tomato Sal...|   $10.98 |
|       3|       1|       Side of Chips|                NULL|    $1.69 |
|       4|       1|       Steak Burrito|[Tomatillo Red Ch...|   $11.75 |
|       4|       1|    Steak Soft Tacos|[Tomatillo Green ...|    $9.25 |
|       5|       1|       Steak Burrito|[Fresh Tomato Sal...|    $9.25 |
+--------+--------+--------------------+-----------

### Step 5. What is the number of observations in the dataset?

In [4]:
# Solution 1

chipo.count()

4622

In [5]:
# Solution 2

chipo.summary().show()

+-------+-----------------+------------------+-----------------+--------------------+----------+
|summary|         order_id|          quantity|        item_name|  choice_description|item_price|
+-------+-----------------+------------------+-----------------+--------------------+----------+
|  count|             4622|              4622|             4622|                4622|      4622|
|   mean|927.2548680225011|1.0757247944612722|             null|                null|      null|
| stddev|528.8907955866096|0.4101863342575333|             null|                null|      null|
|    min|                1|                 1|6 Pack Soft Drink|                NULL|    $1.09 |
|    25%|              477|                 1|             null|                null|      null|
|    50%|              926|                 1|             null|                null|      null|
|    75%|             1393|                 1|             null|                null|      null|
|    max|             1834|   

### Step 6. What is the number of columns in the dataset?

In [6]:
len(chipo.columns)

5

### Step 7. Print the name of all the columns.

In [7]:
chipo.printSchema()

root
 |-- order_id: integer (nullable = true)
 |-- quantity: integer (nullable = true)
 |-- item_name: string (nullable = true)
 |-- choice_description: string (nullable = true)
 |-- item_price: string (nullable = true)



### Step 8. How is the dataset indexed?

In [8]:
#

### Step 9. Which was the most-ordered item? 

In [9]:
most = chipo.\
        groupby('item_name').\
        sum('quantity').\
        orderBy('sum(quantity)', ascending=False).\
        take(1)[0]
print(most['item_name'])

Chicken Bowl


### Step 10. For the most-ordered item, how many items were ordered?

In [10]:
print(most['sum(quantity)'])

761


### Step 11. What was the most ordered item in the choice_description column?

In [11]:
most =  chipo.\
        groupby('choice_description').\
        count().orderBy('count', ascending=False).where(F.col('choice_description') != 'NULL').take(1)

In [12]:
print(most)

[Row(choice_description='[Diet Coke]', count=134)]


### Step 12. How many items were orderd in total?

In [13]:
chipo.select(F.col('quantity')).groupBy().sum().collect()

[Row(sum(quantity)=4972)]

### Step 13. Turn the item price into a float

In [14]:
def make_float(string):
    string = string.replace('$', '')
    return float(string)

In [15]:
spark_udf = F.udf(lambda x: make_float(x), T.FloatType())

In [16]:

chipo = chipo.withColumn("item_price", spark_udf(F.col("item_price")))

In [17]:
chipo.select(F.col('item_price')).show()

+----------+
|item_price|
+----------+
|      2.39|
|      3.39|
|      3.39|
|      2.39|
|     16.98|
|     10.98|
|      1.69|
|     11.75|
|      9.25|
|      9.25|
|      4.45|
|      8.75|
|      8.75|
|     11.25|
|      4.45|
|      2.39|
|      8.49|
|      8.49|
|      2.18|
|      8.75|
+----------+
only showing top 20 rows



In [18]:
chipo.printSchema()

root
 |-- order_id: integer (nullable = true)
 |-- quantity: integer (nullable = true)
 |-- item_name: string (nullable = true)
 |-- choice_description: string (nullable = true)
 |-- item_price: float (nullable = true)



### Step 14. How much was the revenue for the period in the dataset?

In [19]:
revenue = chipo.select(chipo.quantity * chipo.item_price).groupBy().sum()

In [20]:
revenue.show()

+----------------------------+
|sum((quantity * item_price))|
+----------------------------+
|            39237.0197327137|
+----------------------------+



### Step 15. How many orders were made in the period?

In [21]:
chipo.select(F.countDistinct('order_id')).show()

+------------------------+
|count(DISTINCT order_id)|
+------------------------+
|                    1834|
+------------------------+



### Step 16. What is the average revenue amount per order?

In [58]:
# Solution 1

res = chipo.select(
    (chipo.quantity * chipo.item_price).alias('revenue'),chipo.order_id).\
    groupBy('order_id').sum().groupBy().avg().\
    select(F.col('avg(sum(revenue))')).collect()

In [61]:
print(res)

[Row(avg(sum(revenue))=21.394231042919138)]


### Step 17. How many different items are sold?

In [41]:
chipo.select(F.countDistinct('item_name')).show()

+-------------------------+
|count(DISTINCT item_name)|
+-------------------------+
|                       50|
+-------------------------+

