### <div class="alert alert-success" style="background:#2C3E50;color:white">Data Frame Operations - Basic Transformations such as filtering, aggregations, joins etc</div>

This section will show different transformations on Data Frames such as -
* filtering datasets
* joining datasets 
* aggregating the data as desired after joining them.

All of the above operations will be performed using various Data Frame APIs like -
* select - for Selection or Projection
* Filtering data – filter or where
* Joins – join (supports outer join as well)
* Aggregations – groupBy and agg with support of functions such as sum.

<p style="background:#F1C40F"><b>NOTE : </b>Looking at examples based on hypothetical scenarios for various operations on Data Frames such as selecting, filtering, joining, aggregating and sorting.</p>

<p style="background:#AED6F1"><b>Selection or Projection of Data in Data Frames</p>

Data can be selected and fetched fom DF using native DF style syntax or sql style syntax. In native DF approach to select and project data we can use either of the following DF APIs -
* select(*cols) - Projects a set of expressions and returns a new DataFrame.
    * <code>df_name.select(df_name.attribute_name, df_name.attribute_name,...)</code>
    * <code>df_name.select('attribute_name', 'attribute_name',...)</code>
    
* withColumn()

For selecting using sql style syntax -
* selectExpr(*expr) - Projects a set of SQL expressions and returns a new DataFrame. This is a variant of select() that accepts SQL expressions.


<p style="background:#F1C40F"><b>NOTE : </b> Below are few important details -</p>

* We can apply functions to manipulate the data while it is being projected.
    * <code>df.select(substring('<attibute_name>', 1, 5))</code>
* Derived fields can be given aliases using alias function.
    * <code>df.select(substring('<attibute_name>', 1, 5).alias('XYZ'))</code>
* Using withColumn function we can project additional derived fields along with the existing attributes.
    * <code>df.withColumn('XYZ', substring('<attibute_name>', 1, 5))</code>

<p style="background :#d0d5db"><b>df.select()</b></p>

In [None]:
>>> orders.select('order_id', 'order_date').show(truncate=False)

In [None]:
>>> orders.select('order_id', substring('order_date', 1, 7)).show()


In [None]:
>>> orders.select('order_id', substring('order_date', 1, 7).alias('order_month')).show()


In [None]:
>>> orders.select(orders.order_id, orders.order_date).show(truncate=False)

<p style="background :#d0d5db"><b>df.withColumn()</b></p>

In [None]:
>>> orders.withColumn('order_month', substring('order_date', 1, 7)).show()


<p style="background :#d0d5db"><b>Selection using native DF SQL style - Examples</b> </p>

<p style="background :#d0d5db"><b>df.selectExpr()</b></p>

In [None]:
>>> orders.selectExpr('substring(order_date, 1, 7) as order_month').show()
                                                   ^^ 

<p style="background:#AED6F1"><b>Filtering Data from Data Frames</p>

Data Frames have 2 APIs to filter the data - filter() and where(). where() is alias to filter(), hence, both function the same way.
* filter() or where() - are overloaded and hence, can both be used by writing code in native DF style syntax or sql style syntax.

* Native DF style syntax -
eg. <code>where(orders.order_status == 'COMPLETE').show()</code>

* SQL style syntax -
eg. <code>where('order_status = "COMPLETE"').show()</code>

<p style="background:#F1C40F"><b>NOTE : </b> Few more practice examples for filtering data are as under -</p>

<p style="background:#F1C40F">Get Orders which are either COMPLETE or CLOSED.</p>

Native DF style syntax

In [None]:
>>> orders.filter(orders.order_status == 'COMPLETE').show()

In [None]:
# using boolean OR

>>> orders.filter((orders.order_status == 'COMPLETE') | (orders.order_status == 'CLOSED')).show()

In [None]:
# using DF function isin()

>>> orders.filter(orders.order_status.isin('COMPLETE','CLOSED')).show()

In [None]:
>>> orders.filter(orders.order_status.isin('COMPLETE','CLOSED')).count()

SQL style syntax

In [None]:
>>> orders.filter("order_status = 'COMPLETE'").show()

In [None]:
>>> orders.filter("order_status in('CLOSED', 'COMPLETE')").show()

<p style="background:#F1C40F"> Get Orders which are either COMPLETE or CLOSED and placed in month of August 2013.</p>

In [None]:
# df native style statement

>>> orders.filter((orders.order_status.isin('COMPLETE','CLOSED')) &
... (orders.order_date.like('2013-08%'))).show()

In [None]:
# sql Style syntax

>>> orders.filter("order_status in ('COMPLETE', 'CLOSED') and order_date like '2013-08%'").show()

In [None]:
# sql Style syntax

>>> orders.where("order_status in ('CLOSED', 'COMPLETE') and order_date like '2013-08%'").show()

In [None]:
>>> orders.where("order_status in ('CLOSED', 'COMPLETE') and order_date like '2013-08%'").count()

<p style="background:#F1C40F">Get Order Items where order_item_subtotal is not equal to product of item_quantity and product_price.</p>

In [None]:
>>> orderItems.show()

In [None]:
>>> orderItems.select('subtotal', 'qty', 'product_price').show()

In [None]:
# df native style statement

>>> orderItems. \
... select('subtotal', 'qty', 'product_price'). \
... where(orderItems.subtotal != orderItems.qty * orderItems.product_price). \
... show()

In [None]:
>>> from pyspark.sql.functions import round

In [None]:
# df native style statement

>>> orderItems. \
... select('subtotal', 'qty', 'product_price'). \
... where(orderItems.subtotal != round((orderItems.qty * orderItems.product_price), 2)). \
... show()

In [None]:
# sql style syntax

>>> orderItems = spark.read. \
... format('csv'). \
... schema('oi_id int, oi_ordr_id int, oi_prod_id int, oi_qnty int, oi_subtotal float, oi_prod_prce float'). \
... load('/public/retail_db/order_items')

In [None]:
>>> orderItems.show()

In [None]:
>>> orderItems.select('oi_subtotal', 'oi_qnty', 'oi_prod_prce').show()

In [None]:
>>> orderItems.select('oi_subtotal', 'oi_qnty', 'oi_prod_prce'). \
... where('oi_subtotal <> round((oi_qnty * oi_prod_prce),2)'). \
... show()

In [None]:
>>> orderItems.selectExpr('oi_subtotal', 'oi_qnty', 'oi_prod_prce').show()

<p style="background:#F1C40F">Get all Orders which are placed on first of every month</p>

In [None]:
>>> from pyspark.sql.functions import min, max

In [None]:
>>> orders.select(max('order_date')).show()

In [None]:
>>> orders.select(min('order_date')).show()

In [None]:
>>> orders.select(date_format('order_date', 'yyyy-MM-dd')).show()

In [None]:
>>> orders.select(date_format('order_date', 'yyyy-MM-dd').alias('order_date')).show()

In [None]:
>>> orders.where(date_format('order_date', 'dd') == '01').show()

In [None]:
# there should be 12 uniques dates - 1 for each month

>>> orders.where(date_format('order_date', 'dd') == '01').select('order_date').distinct().count()

<p style="background:#F1C40F"><b>NOTE : </b>Just tried spark.sql</p>

In [None]:
>>> orders = spark.read.csv('/public/retail_db/orders', schema='order_id int, order_date string, order_cust_id int, order_status string')

In [None]:
>>> orders.show()

In [None]:
>>> orders.createOrReplaceTempView('ordersvw')

In [None]:
>>> spark.sql("select * from ordersvw").show()

In [None]:
>>> spark.sql("select order_date from ordersvw").show()

In [None]:
>>> spark.sql("select date_format(order_date,'yyyy-MM-dd') from ordersvw").show()

In [None]:
>>> spark.sql("select date_format(order_date,'yyyy-MM') from ordersvw").show()

In [None]:
>>> spark.sql("select date_format(order_date,'dd') from ordersvw").show()

In [None]:
>>> spark.sql("select * from ordersvw where date_format(order_date,'dd') = '01'").show()

In [None]:
>>> spark.sql("select distinct(count(order_date)) from ordersvw where date_format(order_date,'dd') = '01'").show()

In [None]:
>>> spark.sql("select * from ordersvw where date_format(order_date,'dd') = '01'").distinct().count()

In [None]:
>>> spark.sql("select distinct(order_date) from ordersvw where date_format(order_date,'dd') = '01'")
DataFrame[order_date: string]

In [None]:
>>> spark.sql("select distinct(order_date) from ordersvw where date_format(order_date,'dd') = '01'").show()

<p style="background:#AED6F1"><b>Joining multiple Data Frames</p>

<p style="background:#F1C40F"><b>NOTE : Examples to understand Joins in DF </b></p>

In [None]:
>>> orders = spark.read.csv('/public/retail_db/orders', 
            schema='order_id int, order_date string, order_cust_id int, order_status string')

In [None]:
>>> orderItems = spark.read. \
...  format('csv'). \
...  schema('''order_item_id int,
...  oi_order_id int,
...  oi_product_id int,
...  oi_qty int,
...  oi_subtotal float,
...  oi_product_price float'''). \
...  load('/public/retail_db/order_items')

<p style="background:#F1C40F">Get all the order items corresponding to COMPLETE OR CLOSED orders.</p>

In [None]:
>>> ordersFiltered = orders.where("order_status in ('COMPLETE','CLOSED')")

In [None]:
>>> ordersFiltered.show()

In [None]:
>>> ordersJoin = ordersFiltered. \
... join(orderItems, ordersFiltered.order_id == orderItems.order_id, 'inner')

In [None]:
>>> type(ordersJoin)

In [None]:
>>> ordersJoin.show()

<p style="background:#F1C40F">Get all the orders where there are no corresponding order items.</p>

In [None]:
>>> orders.select('order_id').distinct().count()
68883 

In [None]:
>>> orderItems.select('order_id').distinct().count()
57431

In [None]:
>>> orderLeftOuterJoin = orders.join(orderItems,
... orders.order_id == orderItems.order_id,
... 'left')

In [None]:
>>> orderLeftOuterJoin.printSchema()

In [None]:
>>> orderLeftOuterJoin.show()

In [None]:
>>> orderLeftOuterJoin.count()
183650

In [None]:
>>> ordersLeftOuterJoin.where("oi_order_id is null")

In [None]:
>>> ordersLeftOuterJoin.where("oi_order_id is null").show()

In [None]:
>>> ordersLeftOuterJoin.where("oi_order_id is null").count()
11452

<p style="background:#F1C40F">Check if there are any order items where there is no corresponding data in orders data set.</p>

In [None]:
>>> ordersRightOuterJoin = orders.join(orderItems,
... orders.order_id == orderItems.oi_order_id,
... 'right')

In [None]:
>>> ordersRightOuterJoin.printSchema()

In [None]:
>>> ordersRightOuterJoin.show()

In [None]:
>>> ordersRightOuterJoin.where("order_id is null").show()

<p style="background:#AED6F1"><b>Perform Aggregations using Data Frames</p>

In [None]:
>>> orders = spark.read. \
... format('csv'). \
... schema('order_id int, order_date string, order_cus_id int, order_status string'). \
... load('/public/retail_db/orders')

In [None]:
>>> orders.printSchema()

In [None]:
>>> orders.select('order_status').count()

In [None]:
>>> orders.select('order_status').distinct().count()

In [None]:
9

In [None]:
>>> from pyspark.sql.functions import countDistinct

In [None]:
>>> orders.select(countDistinct('order_status')).show()

In [None]:
>>> orders.select(countDistinct('order_status')).alias('order_status_count').show()

In [None]:
>>> orders.select(countDistinct('order_status').alias('order_status_count')).show()

<p style="background:#F1C40F"> Get count by status from orders.</p>

In [None]:
>>> orderItems = spark.read. \
...  format('csv'). \
...  schema('''order_item_id int,
...  oi_order_id int,
...  oi_product_id int,
...  oi_qty int,
...  oi_subtotal float,
...  oi_product_price float'''). \
...  load('/public/retail_db/order_items')

In [None]:
>>> orders = spark.read.csv('/public/retail_db/orders', schema='order_id int, order_date string, order_cust_id int, order_status string') 

In [None]:
>>> orders. \
... groupBy('order_status'). \
... count(). \
... show()

**Solution**

In [None]:
>>> from pyspark.sql.functions import count

In [None]:
>>> orders. \
... groupBy('order_status'). \
... agg(count('order_status').alias('order_status_count')). \
... show()

<p style="background:#F1C40F">Get revenue for each order id from order items.</p>

In [None]:
>>> orderItems = spark.read.csv('/public/retail_db/order_items', schema='oi_item_id int, oi_order_id int, oi_product_id int, oi_qty int, oi_subtotal float, oi_prod_price float')

In [None]:
>>> orderItems.printSchema()

In [None]:
>>> orderItems.where('oi_order_id = 2').show()

In [None]:
>>> from pyspark.sql.functions import round

In [None]:
>>> orderItems.where('oi_order_id = 2'). \
... select(sum('oi_subtotal')).show()

In [None]:
>>> orderItems.where('oi_order_id = 2'). \
... select(round(sum('oi_subtotal'), 2)).show()

**Solution**

In [None]:
>>> orderItems.groupBy('oi_order_id'). \
... sum('oi_subtotal').show()

In [None]:
# got error

>>> orderItems.groupBy('oi_order_id'). \
... round(sum('oi_subtotal'),2).show()


Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
AttributeError: 'GroupedData' object has no attribute 'round'

In [None]:
>>> orderItems.groupBy('oi_order_id'). \
... agg(round(sum('oi_subtotal'), 2).alias('order_revenue'))

DataFrame[oi_order_id: int, order_revenue: double]

In [None]:
>>> orderItems.groupBy('oi_order_id'). \
... agg(round(sum('oi_subtotal'), 2).alias('order_revenue')).show()

<p style="background:#F1C40F">Get daily product revenue (order_date and order_item_product_id are part of keys, order_item_subtotal to be used for aggregation).</p>

In [None]:
>>> ordersJoin = orders.join(orderItems, orders.order_id == orderItems.oi_order_id)

In [None]:
>>> ordersJoin.printSchema()

In [None]:
>>> from pyspark.sql.functions import round, sum

In [None]:
>>> ordersJoin. \
... groupBy('order_date', 'oi_product_id'). \
... agg(round(sum('oi_subtotal'), 2).alias('productRevenue')). \
... show()

In [None]:
>>> ordersJoin. \
... groupBy('order_date', 'oi_product_id'). \
... agg(round(sum('oi_subtotal'), 2).alias('productRevenue')). \
... count()
 

<p style="background:#AED6F1"><b>Sorting Data in Data Frames</p>

In [None]:
>>> orders = spark.read.csv('/public/retail_db/orders', schema='order_id int, order_date string, order_cust_id int, order_status string') 

In [None]:
>>> orderItems = spark.read. \
...  format('csv'). \
...  schema('''order_item_id int,
...  oi_order_id int,
...  oi_product_id int,
...  oi_qty int,
...  oi_subtotal float,
...  oi_product_price float'''). \
...  load('/public/retail_db/order_items')

In [None]:
>>> orders.count()
68883

In [None]:
>>> orders.sort('order_date').show()


In [None]:
>>> orders.sort('order_date', 'order_cust_id').show()

In [None]:
>>> orders.sort(['order_date', 'order_cust_id'], ascending=[0, 1]).show()

In [None]:
>>> orders.sort(['order_date', 'order_cust_id'], ascending=[1, 0]).show()

In [None]:
>>> orders.sort('order_date', orders.order_cust_id.desc()).show()

In [None]:
>>> orders.sort(orders.order_date.asc(), orders.order_cust_id.desc()).show()

<p style="background:#F1C40F">Sort orders by order status.</p>

In [None]:
>>> orders.sort('order_status').show()

<p style="background:#F1C40F">Sort orders by date and then status.</p>

In [None]:
>>> orders.sort('order_date', 'order_status').show()

<p style="background:#F1C40F">Sort order items by order id and subtotal in descending order.</p>

In [None]:
>>> orderItems.sort(['oi_order_id', 'oi_subtotal'], ascending=[1,0]).show()

In [None]:
>>> orderItems.sort('oi_order_id', orderItems.oi_subtotal.desc()).show()

<p style="background:#AED6F1"><b>Development Life Cycle using Data Frames</p>

<p style="background:#FA8072;border-style:solid;"><b>WILL DO LATER</b></p>

<p style="background:#AED6F1"><b>Run Applications using Spark Submit</p>

<p style="background:#FA8072;border-style:solid;"><b>WILL DO LATER</b></p>

<p style="background:#F1C40F"><b>NOTE : </b> Example below shows all the above DF Operations and Transformations used in a scenario based problem which is to Get Daily Product Revenue with highest revenue being displayed first.</p>

<p style="background:#AED6F1"><b> Problem statement - Get Daily Product Revenue</p>

We need to develop code, where in we will calculate the revenue for each product on a daily basis. 
The datasets required for this problem are - 
* Orders Data File (orders.csv)
* Order Items Data File (orderItems.csv)
* Products Data File (products.csv)

<p style="background:#AED6F1"><b> Design - Get Daily Product Revenue</p>

The design of the problem is as follows -

1. orders.csv read in orders data frame.
2. order_items.csv read in orderItems data frame.
3. products.csv read in products data frame.
4. orders DF and orderItems DF joined into ordersJoin DF, on order_id as key column.
5. ordersJoin DF grouped by product_id column and then order_item_subtotal column aggregated using sum function.
6. ordersJoin DF and products DF joined on product_id as key column and a result data frame  created to show product_name and sum(subtotals)


<p style="background :#d0d5db"><b>Creating DF</b> </p>

In [None]:
>>> orders = spark.read.csv('/public/retail_db/orders', schema='order_id int, order_date string, order_cust_id int, order_status string')

In [None]:
>>> orders.printSchema()

In [None]:
>>> orders.show(5)

In [None]:
>>> orderItems = spark.read. \
... format('csv'). \
... schema('''order_item_id int,
... order_id int,
... product_id int,
... qty int,
... subtotal float,
... product_price float'''). \
... load('/public/retail_db/order_items')

In [None]:
>>> orderItems.printSchema()


In [None]:
>>> orderItems.show(10)


In [None]:
>>> products = spark.read. \
...              format('csv'). \
...              schema('product_id int, product_cat_id int, product_name string, product_description string, product_price float, product_img string'). \
...              load('/public/retail_db/products')

In [None]:
>>> products.printSchema()

In [None]:
>>> productsDF = products.select('product_id', 'product_name')

In [None]:
>>> productsDF.show()

<p style="background :#d0d5db"><b>Filtering, Joining and Aggregating Orders DFs</b> </p>

In [None]:
>>> from pyspark.sql.functions import sum, round

In [None]:
>>> ordersJoin = orders.where('order_status in("CLOSED", "COMPLETE")'). \
...  join(orderItems, orders.order_id == orderItems.oi_order_id). \
...  groupBy('order_date', 'oi_product_id'). \
...  agg(round(sum('oi_subtotal'), 2).alias('product_revenue'))

In [None]:
>>> ordersJoin.show()

<p style="background :#d0d5db"><b>Joining Order and Product DFs and Sorting Revenue DF</b> </p>

In [None]:
>>> revenueDF = ordersJoin. \
...              join(productsDF, ordersJoin.oi_product_id == productsDF.product_id). \
...              select('order_date', 'product_name', 'product_revenue'). \
...              sort(['order_date', 'product_revenue'], ascending=[1,0])

In [None]:
>>> revenueDF.show()

<p style="background:#AED6F1"><b>Exercises</p>

<p style="background:#F1C40F">Get number of CLOSED or COMPLETE orders placed by each customer.</p>

In [None]:
>>> orders = spark.read. \
...     format('csv'). \
...     schema('order_id int, order_date string, order_customer_id int, order_status string'). \
...     load('/public/retail_db/orders')

In [None]:
>>> orders.printSchema()

In [None]:
>>> from pyspark.sql.functions import *

In [None]:
>>> orderByCustomer = orders. \
...                     where('order_status in ("CLOSED", "COMPLETE")'). \
...                     groupBy('order_customer_id', 'order_status'). \
...                     agg(count('order_id').alias('order_count'))

In [None]:
>>> orderByCustomer.show()

In [None]:
>>> orderByCustomer.sort('order_customer_id').show()

<p style="background:#F1C40F">Get revenue generated by each customer for the month of 2014 January (consider only CLOSED or COMPLETE orders).</p>

In [None]:
>>> orders = spark.read. \
...     format('csv'). \
...     schema('order_id int, order_date string, order_customer_id int, order_status string'). \
...     load('/public/retail_db/orders')

In [None]:
>>> orderItems = spark.read. \
...  format('csv'). \
...  schema('''order_item_id int,
...  oi_order_id int,
...  oi_product_id int,
...  oi_qty int,
...  oi_subtotal float,
...  oi_product_price float'''). \
...  load('/public/retail_db/order_items')

In [None]:
>>> orders.where('order_status in ("COMPLETE", "CLOSED") and substring(order_date, 1, 7) = "2014-01"').show()

In [None]:
>>> revenueDF = orders.select('order_id', substring(orders.order_date, 1, 7).alias('order_month'), 'order_customer_id', 'order_status')

In [None]:
>>> revenueDF.show()

In [None]:
>>> revenueDF = orders.select('order_id', substring(orders.order_date, 1, 7).alias('order_month'), 'order_customer_id', 'order_status'). \
...             where('order_status in ("COMPLETE", "CLOSED") and order_month = "2014-01"')


In [None]:
>>> revenueDF.show()

In [None]:
>>> revenueDF = orders.select('order_id', substring(orders.order_date, 1, 7).alias('order_month'), 'order_customer_id', 'order_status'). \
...             where('order_status in ("COMPLETE", "CLOSED") and order_month = "2014-01"'). \
...             join(orderItems, orders.order_id == orderItems.oi_order_id)

In [None]:
>>> revenueDF.show()

In [None]:
>>> revenueDF = orders.select('order_id', substring(orders.order_date, 1, 7).alias('order_month'), 'order_customer_id', 'order_status'). \
...             where('order_status in ("COMPLETE", "CLOSED") and order_month = "2014-01"'). \
...             join(orderItems, orders.order_id == orderItems.oi_order_id). \
...             groupBy('order_month', 'order_customer_id'). \
...             agg(round(sum('oi_subtotal'), 2).alias('revenue'))

In [None]:
>>> revenueDF.show()

<p style="background:#F1C40F">Get revenue generated by each product on monthly basis – get product name, month and revenue generated by each product (round off revenue to 2 decimals).</p>

In [None]:
>>> orders = spark.read. \
...      format('csv'). \
...      schema('order_id int, order_date string, order_customer_id int, order_status string'). \
...      load('/public/retail_db/orders')

In [None]:
>>> orderItems = spark.read. \
...  format('csv'). \
...  schema('''order_item_id int,
...  oi_order_id int,
...  oi_product_id int,
...  oi_qty int,
...  oi_subtotal float,
...  oi_product_price float'''). \
...  load('/public/retail_db/order_items')

In [None]:
>>> products = spark.read. \
...              format('csv'). \
...              schema('product_id int, product_cat_id int, product_name string, product_description string, product_price float, product_img string'). \
...              load('/public/retail_db/products')

In [None]:
>>> from pyspark.sql.functions import *

In [None]:
>>> revenueDF = orders.select('order_id', substring(orders.order_date, 1, 7).alias('order_month'), 'order_status')

In [None]:
>>> revenueDF.show()

In [None]:
>>> revenueDF = orders.select('order_id', substring(orders.order_date, 1, 7).alias('order_month'), 'order_status'). \
...             join(orderItems, orders.order_id == orderItems.oi_order_id). \
...             groupBy('order_month', 'oi_product_id'). \
...             agg(round(sum('oi_subtotal'), 2).alias('product_revenue'))

In [None]:
>>> revenueDF.show()

In [None]:
>>> revenueDF = orders.select('order_id', substring(orders.order_date, 1, 7).alias('order_month'), 'order_status'). \
...             join(orderItems, orders.order_id == orderItems.oi_order_id). \
...             groupBy('order_month', 'oi_product_id'). \
...             agg(round(sum('oi_subtotal'), 2).alias('product_revenue')). \
...             join(products, orderItems.oi_product_id == products.product_id). \
...             select('order_month', 'product_name', 'product_revenue')

In [None]:
>>> revenueDF.show()

<p style="background:#F1C40F">Get revenue generated by each product category on daily basis – get category name, date and revenue generated by each category (round off revenue to 2 decimals).</p>

In [None]:
>>> from pyspark.sql.functions import *

In [None]:
>>> orders = spark.read. \
...             format('csv'). \
...             schema('order_id int, order_date string, order_customer_id int, order_status string'). \
...             load('/public/retail_db/orders')

In [None]:
>>> orderItems = spark.read. \
...                     format('csv'). \
...                     schema('oi_item_id int, oi_order_id int, oi_product_id int, oi_qty int, oi_subtotal float, oi_product_price float'). \
...                     load('/public/retail_db/order_items')

In [None]:
>>> products = spark.read. \
...                     format('csv'). \
...                     schema('product_id int, product_category_id int, product_name string, product_description string, product_price float, product_image string'). \
...                     load('/public/retail_db/products')

In [None]:
>>> categories = spark.read. \
...                     format('csv'). \
...                     schema('category_id int, category_department_id int, category_name string'). \
...                     load('/public/retail_db/categories')

In [None]:
>>> categories.show()

In [None]:
>>> ordersJoin = orders.where('order_status in ("CLOSED","COMPLETE")'). \
...                     join(orderItems, orders.order_id == orderItems.oi_order_id)

In [None]:
>>> prodcatJoin = products.select('product_id', 'product_category_id'). \
...                             join(categories, products.product_category_id == categories.category_id). \
...                             select('product_id', 'category_name')

In [None]:
>>> prodcatJoin.show(5)

In [None]:
>>> ordersJoin.show(5)

In [None]:
>>> categoryRevenueDF = ordersJoin. \
...                             join(prodcatJoin, ordersJoin.oi_product_id == prodcatJoin.product_id). \
...                             groupBy('order_date', 'product_category_id', 'category_name'). \
...                             agg(round(sum('oi_subtotal'), 2).alias('revenue')). \
...                             select('order_date', 'category_name', 'revenue')

In [None]:
>>> categoryRevenueDF.show()

In [None]:
>>> categoryRevenueDF.where("category_name == 'Golf Balls'").sort('order_date').show()

In [None]:
>>> categoryRevenueDF.where("category_name == 'Golf Balls'").sort('order_date').count()

<p style="background :#d0d5db"><b>Validation Through Hive Query</b> </p>

In [None]:
hive (monahadoop_final)> select order_date, cat_name, round(sum(item_subtotal), 2)
                       > from 
                       > orders, order_items, products, categories
                       > where
                       > orders.order_id = order_items.item_order_id and
                       > order_items.item_product_id = products.product_id and
                       > products.product_category_id = categories.cat_id
                       > and order_status in ("CLOSED", "COMPLETE")
                       > group by order_date, product_category_id, cat_name;

In [None]:
hive (monahadoop_final)> select order_date, cat_name, round(sum(item_subtotal), 2)
                       > from 
                       > orders, order_items, products, categories
                       > where
                       > orders.order_id = order_items.item_order_id and
                       > order_items.item_product_id = products.product_id and
                       > products.product_category_id = categories.cat_id
                       > and order_status in ("CLOSED", "COMPLETE")
                       > group by order_date, product_category_id, cat_name
                       > having cat_name = 'Golf Balls';

In [None]:
2013-07-25 00:00:00.0	Golf Balls	79.96
2013-07-26 00:00:00.0	Golf Balls	249.86
2013-07-27 00:00:00.0	Golf Balls	116.93
2013-07-28 00:00:00.0	Golf Balls	37.98
2013-07-29 00:00:00.0	Golf Balls	278.85
2013-07-30 00:00:00.0	Golf Balls	95.94
2013-07-31 00:00:00.0	Golf Balls	175.89

Time taken: 42.673 seconds, Fetched: 296 row(s)

<p style="background:#F1C40F">Get the details of the customers who never placed any order.</p>

In [None]:
>>> customers = spark.read. \
...                     format('csv'). \
...                     schema('customer_id int, customer_fname string, customer_lname string, customer_email string, customer_password string, customer_street string, customer_city string, customer_state string, customer_zipcode string'). \
...                     load('/public/retail_db/customers')

In [None]:
>>> customers.printSchema()

In [None]:
>>> orders = spark.read. \
...      format('csv'). \
...      schema('order_id int, order_date string, order_customer_id int, order_status string'). \
...      load('/public/retail_db/orders')

In [None]:
>>> CustomerWithNoOrdersDF = customers. \
...                             join(orders, customers.customer_id == orders.order_customer_id, "left"). \
...                             where('order_status is NULL'). \
...                             select('customer_fname', 'customer_lname'). \
...                             sort('customer_lname', 'customer_fname')

<p style="background :#d0d5db"><b>END</b> </p>