- Use data from /data/nyse_all/nyse_data

- Use database YOUR_OS_USER_NAME_nyse

- Create partitioned table nyse_eod_part

- Field Names: stockticker, tradedate, openprice, highprice, lowprice, closeprice, volume

- Determine correct data types based on the values

- Create Managed table with “,” as delimiter.

- Partition Field should be tradeyear and of type INT (one partition for corresponding year)

- Insert data into partitioned table using dynamic partition mode.

- Here are the steps to come up with the solution.

>Review the files under /data/nyse_all/nyse_data - determine data types (For example: tradedate should be INT and volume should be BIGINT)

>Create database YOUR_OS_USER_NAME_nyse (if it does not exists)

>Create non partitioned stage table

>Load data into non partitioned stage table

>Validate the count and also see that data is as expected by running simple select query.

>Create partitioned table

>Set required properties to use dynamic partition

>Insert data into partitioned table - here is how you can compute year from tradedate of type int year(to_date(cast(tradedate AS STRING), 'yyyyMMdd')) AS tradeyear

In [1]:
from pyspark.sql import SparkSession

In [2]:
spark = SparkSession.\
    builder.\
    enableHiveSupport().\
    appName("Spark SQL -Exercise Partitioning").\
    master("yarn").\
    getOrCreate()

Setting default log level to "WARN".
To adjust logging level use sc.setLogLevel(newLevel). For SparkR, use setLogLevel(newLevel).


In [6]:
spark.sql("SET hive.exec.dynamic.partition").show()

+--------------------+-----------+
|                 key|      value|
+--------------------+-----------+
|hive.exec.dynamic...|<undefined>|
+--------------------+-----------+



### Set required properties to use dynamic partition

In [7]:
spark.sql("SET hive.exec.dynamic.partition.mode").show()

+--------------------+-----------+
|                 key|      value|
+--------------------+-----------+
|hive.exec.dynamic...|<undefined>|
+--------------------+-----------+



In [8]:
spark.sql("SET hive.exec.dynamic.partition=true")

DataFrame[key: string, value: string]

In [5]:
spark.sql("create database if not exists exercise")

DataFrame[]

In [9]:
spark.sql("SET hive.exec.dynamic.partition.mode=nonstrict")

DataFrame[key: string, value: string]

In [3]:
spark.sql("show databases").show()

+------------+
|   namespace|
+------------+
|     default|
|    exercise|
|          hr|
|       kevin|
|kevin_retail|
|      retail|
|         sms|
|        test|
+------------+



### Create the database

In [10]:
spark.sql("use exercise")

DataFrame[]

### Select the database to use

In [20]:
spark.sql("Select current_database()").show()

[Stage 0:>                                                          (0 + 1) / 1]

+------------------+
|current_database()|
+------------------+
|          exercise|
+------------------+



                                                                                

### Listing the files

In [8]:
#data
! ls /home/hadoop/data/data/nyse_all/nyse_data/


NYSE_1997.txt.gz  NYSE_2003.txt.gz  NYSE_2009.txt.gz  NYSE_2015.txt.gz
NYSE_1998.txt.gz  NYSE_2004.txt.gz  NYSE_2010.txt.gz  NYSE_2016.txt.gz
NYSE_1999.txt.gz  NYSE_2005.txt.gz  NYSE_2011.txt.gz  NYSE_2017.txt.gz
NYSE_2000.txt.gz  NYSE_2006.txt.gz  NYSE_2012.txt.gz
NYSE_2001.txt.gz  NYSE_2007.txt.gz  NYSE_2013.txt.gz
NYSE_2002.txt.gz  NYSE_2008.txt.gz  NYSE_2014.txt.gz


### Review and determine data types

In [None]:
!zless /home/hadoop/data/data/nyse_all/nyse_data/NYSE_1997.txt.gz

AA,19970101,47.82,47.82,47.82,47.82,0
ABC,19970101,6.03,6.03,6.03,6.03,0
ABM,19970101,9.25,9.25,9.25,9.25,0
ABT,19970101,25.37,25.37,25.37,25.37,0
ABX,19970101,28.75,28.75,28.75,28.75,0
ACP,19970101,9.12,9.12,9.12,9.12,0
ACV,19970101,16,16,16,16,0
ADC,19970101,21.37,21.37,21.37,21.37,0
ADM,19970101,17.24,17.24,17.24,17.24,0
ADX,19970101,13.16,13.16,13.16,13.16,0
AED,19970101,31.5,31.5,31.5,31.5,0
AEE,19970101,38.5,38.5,38.5,38.5,0
AEG,19970101,15.2,15.2,15.2,15.2,0
AEM,19970101,14,14,14,14,0
AEP,19970101,41.12,41.12,41.12,41.12,0
AES,19970101,11.62,11.62,11.62,11.62,0
AF,19970101,12.29,12.29,12.29,12.29,0
AFG,19970101,25.179,25.179,25.179,25.179,0
AFL,19970101,10.69,10.69,10.69,10.69,0
AG,19970101,28.62,28.62,28.62,28.62,0
AGCO,19970101,28.625,28.625,28.625,28.625,0
AGM,19970101,10.25,10.25,10.25,10.25,0
AGM.A,19970101,26.5,26.5,26.5,26.5,0
[7m/home/hadoop/data/data/nyse_all/nyse_data/NYSE_1997.txt.gz[m[K

### Create non partitioned stage table

In [29]:
spark.sql("""CREATE TABLE IF NOT EXISTS nyse_staging (
  stockticker STRING,
  tradedate STRING,
  openprice FLOAT,
  highprice FLOAT,
  lowprice FLOAT,
  closeprice FLOAT,
  volume INT
) 
ROW FORMAT DELIMITED FIELDS TERMINATED BY ','""")

DataFrame[]

#### Loading data into non partitioned stage table

In [30]:
spark.sql("LOAD DATA LOCAL INPATH '/home/hadoop/data/data/nyse_all/nyse_data/*'  INTO TABLE nyse_staging ")

DataFrame[]

### Validating the count and also see that data is as expected by running select query.

In [31]:
spark.sql("SELECT * FROM nyse_staging LIMIT 10").show(5,truncate=False)



+-----------+---------+---------+---------+--------+----------+------+
|stockticker|tradedate|openprice|highprice|lowprice|closeprice|volume|
+-----------+---------+---------+---------+--------+----------+------+
|AA         |19980101 |52.77    |52.77    |52.77   |52.77     |0     |
|ABC        |19980101 |7.28     |7.28     |7.28    |7.28      |0     |
|ABM        |19980101 |15.28    |15.28    |15.28   |15.28     |0     |
|ABT        |19980101 |32.75    |32.75    |32.75   |32.75     |0     |
|ABX        |19980101 |18.62    |18.62    |18.62   |18.62     |0     |
+-----------+---------+---------+---------+--------+----------+------+
only showing top 5 rows



                                                                                

In [None]:
spark.sql("SELECT count(1) FROM nyse_staging ").show()

In [36]:
from pyspark.sql.functions import to_date

In [43]:
spark.sql("select  date_format(to_date(tradedate, 'yyyyMMdd'), 'yyyy') dateyear from nyse_staging limit 2").show()

+--------+
|dateyear|
+--------+
|    1997|
|    1997|
+--------+



#### creating table

In [52]:
spark.sql("""CREATE TABLE IF NOT EXISTS nyse_eod_part (
  stockticker STRING,
  tradedate STRING,
  openprice FLOAT,
  highprice FLOAT,
  lowprice FLOAT,
  closeprice FLOAT,
  volume INT
) PARTITIONED BY (tradeyear INT)
ROW FORMAT DELIMITED FIELDS TERMINATED BY ','""")

DataFrame[]

### Describing the created partitioned table

In [53]:
spark.sql("DESCRIBE FORMATTED nyse_eod_part").show(200, truncate=False)

+----------------------------+----------------------------------------------------------------+-------+
|col_name                    |data_type                                                       |comment|
+----------------------------+----------------------------------------------------------------+-------+
|stockticker                 |string                                                          |null   |
|tradedate                   |string                                                          |null   |
|openprice                   |float                                                           |null   |
|highprice                   |float                                                           |null   |
|lowprice                    |float                                                           |null   |
|closeprice                  |float                                                           |null   |
|volume                      |int                               

### Inserting data into partitioned table 

In [54]:
spark.sql("""INSERT INTO TABLE nyse_eod_part PARTITION (tradeyear)
SELECT ns.*, cast(date_format(to_date(tradedate, 'yyyyMMdd'), 'yyyy')as int) tradeyear
FROM nyse_staging ns""")

                                                                                

DataFrame[]

## Validation

In [55]:
spark.sql("show partitions nyse_eod_part").show()

+--------------+
|     partition|
+--------------+
|tradeyear=1997|
|tradeyear=1998|
|tradeyear=1999|
|tradeyear=2000|
|tradeyear=2001|
|tradeyear=2002|
|tradeyear=2003|
|tradeyear=2004|
|tradeyear=2005|
|tradeyear=2006|
|tradeyear=2007|
|tradeyear=2008|
|tradeyear=2009|
|tradeyear=2010|
|tradeyear=2011|
|tradeyear=2012|
|tradeyear=2013|
|tradeyear=2014|
|tradeyear=2015|
|tradeyear=2016|
+--------------+
only showing top 20 rows



In [57]:
 !hdfs dfs -ls /user/hive/warehouse/exercise.db/nyse_eod_part

Found 21 items
drwxrwxrwx   - hadoop supergroup          0 2023-03-28 04:14 /user/hive/warehouse/exercise.db/nyse_eod_part/tradeyear=1997
drwxrwxrwx   - hadoop supergroup          0 2023-03-28 04:14 /user/hive/warehouse/exercise.db/nyse_eod_part/tradeyear=1998
drwxrwxrwx   - hadoop supergroup          0 2023-03-28 04:14 /user/hive/warehouse/exercise.db/nyse_eod_part/tradeyear=1999
drwxrwxrwx   - hadoop supergroup          0 2023-03-28 04:14 /user/hive/warehouse/exercise.db/nyse_eod_part/tradeyear=2000
drwxrwxrwx   - hadoop supergroup          0 2023-03-28 04:14 /user/hive/warehouse/exercise.db/nyse_eod_part/tradeyear=2001
drwxrwxrwx   - hadoop supergroup          0 2023-03-28 04:14 /user/hive/warehouse/exercise.db/nyse_eod_part/tradeyear=2002
drwxrwxrwx   - hadoop supergroup          0 2023-03-28 04:14 /user/hive/warehouse/exercise.db/nyse_eod_part/tradeyear=2003
drwxrwxrwx   - hadoop supergroup          0 2023-03-28 04:14 /user/hive/warehouse/exercise.db/nyse_eod_part/tradeyear=2004
d

In [58]:
spark.sql("SELECT count(1) FROM nyse_eod_part").show()



+--------+
|count(1)|
+--------+
| 9384739|
+--------+



                                                                                

In [60]:
import pandas as pd
import glob

In [61]:
path = r'/home/hadoop/data/data/nyse_all/nyse_data' 
all_files = glob.glob(path + "/*.txt.gz")

li = []

for filename in all_files:
    df = pd.read_csv(filename, index_col=None, header=None)
    li.append(df)

frame = pd.concat(li, axis=0, ignore_index=True)
frame.shape

(9384739, 7)