## Projecting Data

Let us understand different aspects of projecting data. We primarily using `SELECT` to project the data.

* We can project all columns using `*` or some columns using column names.
* We can provide aliases to a column or expression using `AS` in `SELECT` clause.
* `DISTINCT` can be used to get the distinct records from selected columns. We can also use `DISTINCT *` to get unique records using all the columns.
* As of now **Spark SQL** does not support projecting all but one or few columns. It is supported in Hive. Following will work in hive and it will project all the columns from orders except for order_id.

```
SET hive.support.quoted.identifiers=none;
SELECT `(order_id)?+.+` FROM orders;
```

Let us start spark context for this Notebook so that we can execute the code provided. You can sign up for our [10 node state of the art cluster/labs](https://labs.itversity.com/plans) to learn Spark SQL using our unique integrated LMS.

In [1]:
val username = System.getProperty("user.name")

username = itversity


itversity

In [1]:
import org.apache.spark.sql.SparkSession

val username = System.getProperty("user.name")
val spark = SparkSession.
    builder.
    config("spark.ui.port", "0").
    config("spark.sql.warehouse.dir", s"/user/${username}/warehouse").
    enableHiveSupport.
    appName(s"${username} | Spark SQL - Basic Transformations").
    master("yarn").
    getOrCreate

username = itv002480
spark = org.apache.spark.sql.SparkSession@6341d370


org.apache.spark.sql.SparkSession@6341d370

If you are going to use CLIs, you can use Spark SQL using one of the 3 approaches.

**Using Spark SQL**

```
spark2-sql \
    --master yarn \
    --conf spark.ui.port=0 \
    --conf spark.sql.warehouse.dir=/user/${USER}/warehouse
```

**Using Scala**

```
spark2-shell \
    --master yarn \
    --conf spark.ui.port=0 \
    --conf spark.sql.warehouse.dir=/user/${USER}/warehouse
```

**Using Pyspark**

```
pyspark2 \
    --master yarn \
    --conf spark.ui.port=0 \
    --conf spark.sql.warehouse.dir=/user/${USER}/warehouse
```

In [2]:
%%sql
select current_database()

Waiting for a Spark session to start...

+------------------+
|current_database()|
+------------------+
|           default|
+------------------+



In [3]:
%%sql
use itv002480_retail

++
||
++
++



In [4]:
%%sql

SELECT * FROM orders LIMIT 10

|       8|2013-07-25 00:00:...|             2911|     PRO...


+--------+--------------------+-----------------+---------------+
|order_id|          order_date|order_customer_id|   order_status|
+--------+--------------------+-----------------+---------------+
|       1|2013-07-25 00:00:...|            11599|         CLOSED|
|       2|2013-07-25 00:00:...|              256|PENDING_PAYMENT|
|       3|2013-07-25 00:00:...|            12111|       COMPLETE|
|       4|2013-07-25 00:00:...|             8827|         CLOSED|
|       5|2013-07-25 00:00:...|            11318|       COMPLETE|
|       6|2013-07-25 00:00:...|             7130|       COMPLETE|
|       7|2013-07-25 00:00:...|             4530|       COMPLETE|
|       8|2013-07-25 00:00:...|             2911|     PROCESSING|
|       9|2013-07-25 00:00:...|             5657|PENDING_PAYMENT|
|      10|2013-07-25 00:00:...|             5648|PENDING_PAYMENT|
+--------+--------------------+-----------------+---------------+



In [5]:
%%sql

DESCRIBE orders

+-----------------+---------+-------+
|         col_name|data_type|comment|
+-----------------+---------+-------+
|         order_id|      int|   null|
|       order_date|   string|   null|
|order_customer_id|      int|   null|
|     order_status|   string|   null|
+-----------------+---------+-------+



In [6]:
%%sql

SELECT order_customer_id, order_date, order_status FROM orders LIMIT 10

|             5648|2013-07-25 00:...


+-----------------+--------------------+---------------+
|order_customer_id|          order_date|   order_status|
+-----------------+--------------------+---------------+
|            11599|2013-07-25 00:00:...|         CLOSED|
|              256|2013-07-25 00:00:...|PENDING_PAYMENT|
|            12111|2013-07-25 00:00:...|       COMPLETE|
|             8827|2013-07-25 00:00:...|         CLOSED|
|            11318|2013-07-25 00:00:...|       COMPLETE|
|             7130|2013-07-25 00:00:...|       COMPLETE|
|             4530|2013-07-25 00:00:...|       COMPLETE|
|             2911|2013-07-25 00:00:...|     PROCESSING|
|             5657|2013-07-25 00:00:...|PENDING_PAYMENT|
|             5648|2013-07-25 00:00:...|PENDING_PAYMENT|
+-----------------+--------------------+---------------+



In [7]:
%%sql

SELECT order_customer_id, date_format(order_date, 'yyyy-MM'), order_status FROM orders LIMIT 10

|            ...


+-----------------+---------------------------------------------------+---------------+
|order_customer_id|date_format(CAST(order_date AS TIMESTAMP), yyyy-MM)|   order_status|
+-----------------+---------------------------------------------------+---------------+
|             8702|                                            2014-02|       COMPLETE|
|             3066|                                            2014-02|PENDING_PAYMENT|
|             7314|                                            2014-02|SUSPECTED_FRAUD|
|             1271|                                            2014-02|       COMPLETE|
|            11083|                                            2014-02|       COMPLETE|
|             3159|                                            2014-02|         CLOSED|
|             4551|                                            2014-02|         CLOSED|
|             8135|                                            2014-02|        PENDING|
|             7497|             

In [8]:
%%sql

SELECT order_customer_id, 
    date_format(order_date, 'yyyy-MM') AS order_month, 
    order_status 
FROM orders LIMIT 10

+-----------------+-----------+---------------+
|order_customer_id|order_month|   order_status|
+-----------------+-----------+---------------+
|             8702|    2014-02|       COMPLETE|
|             3066|    2014-02|PENDING_PAYMENT|
|             7314|    2014-02|SUSPECTED_FRAUD|
|             1271|    2014-02|       COMPLETE|
|            11083|    2014-02|       COMPLETE|
|             3159|    2014-02|         CLOSED|
|             4551|    2014-02|         CLOSED|
|             8135|    2014-02|        PENDING|
|             7497|    2014-02|PENDING_PAYMENT|
|             1868|    2014-02|        ON_HOLD|
+-----------------+-----------+---------------+



In [9]:
%%sql

SELECT DISTINCT order_status FROM orders

+---------------+
|   order_status|
+---------------+
|PENDING_PAYMENT|
|       COMPLETE|
|        ON_HOLD|
| PAYMENT_REVIEW|
|     PROCESSING|
|         CLOSED|
|SUSPECTED_FRAUD|
|        PENDING|
|       CANCELED|
+---------------+



In [10]:
%%sql

SELECT DISTINCT * FROM orders LIMIT 10

|   36754|2014-03-07 00:00:...|            10160|     PRO...


+--------+--------------------+-----------------+---------------+
|order_id|          order_date|order_customer_id|   order_status|
+--------+--------------------+-----------------+---------------+
|   34852|2014-02-25 00:00:...|             8544|        PENDING|
|   35248|2014-02-27 00:00:...|             9108|       COMPLETE|
|   35322|2014-02-27 00:00:...|             5301|PENDING_PAYMENT|
|   35380|2014-02-28 00:00:...|             2966|PENDING_PAYMENT|
|   35543|2014-02-28 00:00:...|             9021|       COMPLETE|
|   35667|2014-03-01 00:00:...|             3780|       COMPLETE|
|   36353|2014-03-05 00:00:...|             2205|        ON_HOLD|
|   36754|2014-03-07 00:00:...|            10160|     PROCESSING|
|   36808|2014-03-07 00:00:...|             4770|       COMPLETE|
|   36809|2014-03-07 00:00:...|               62|       COMPLETE|
+--------+--------------------+-----------------+---------------+



* Using Spark SQL with Python or Scala

In [None]:
spark.sql("SELECT * FROM orders").show()

In [None]:
spark.sql("DESCRIBE orders").show()

In [None]:
spark.sql("SELECT order_customer_id, order_date, order_status FROM orders").show()

In [None]:
spark.sql("""
SELECT order_customer_id, 
    date_format(order_date, 'yyyy-MM'), 
    order_status 
FROM orders""").show()

In [None]:
spark.sql("""
SELECT order_customer_id, 
    date_format(order_date, 'yyyy-MM') AS order_month, 
    order_status 
FROM orders
""").show()

In [None]:
spark.sql("SELECT DISTINCT order_status FROM orders").show()

In [None]:
spark.sql("SELECT DISTINCT * FROM orders").show()