The aim of the exercise is to try out:
* how to get data into Hadoop
* different ways to store data in Hive (partitioning, format, compression)
* SQL querying over big data

## Uploading data to Hadoop
We will work with records of PID data (routes, trips, stops).
We will work with open PID data that are desribed on https://pid.cz/o-systemu/opendata/#h-gtfs.The zip file we work with is downloadable from http://data.pid.cz/PID_GTFS.zip
Store data on HDFS: `hive_02/PID_GTFS`
* **How do you create a new folder on HDFS?**
* **How do you copy data a previous created folder on HDFS?**
* **Is ZIP suitable format for HDFS?**
* look at a few rows of the unzipped data to see the total number of rows (and try to answer the question of why this is a good thing to do);


## Starting the Hive console
Start Hive from the command line via `hive` command.

If you haven't created your Hive database yet, create one (enter the database name as your username) and switch to it. Hive commands must be terminated with a semicolon!


## 1. Input data as external table
1.1 From the files uploaded to HDFS in the previous step choose the file routes.txt and copy them to a new folder /ext_tables/routes/

1.2 Create external table *route_ext* (The external table uses the data structure as is, no format changes etc. are made in this step.)
  * CSV (text) format
  * field separator is ","
  * the record separator is the line break character
  * the first line contains the headers and is skipped
  * the file contains the following fields:

| Field | Type | Description |
|-------------|-----------|--------------------------------------------|
| route_id | string | id of the route |
| agency_id | string | id of the agency |
| route_short_name | string | short name of the route |
| route_long_name |string | short name of the route  |
| route_type | string | ? |
| route_url | string | url of the route |
| route_color | string | ? |
| route_text_color | string |? |
| is_night | boolean | if it is night route |
| is_regional | boolean | if it is regioanl route |
| is_substitute_transport | boolean | if it is substitional route|

1.3 Run SQL queries performed on the external table. Check:
  * extract a few rows of table *route_ext* and compare with input data;
  * total number of rows and compare with input data (should not be exactly the same &ndash; why?);
  * number of rows with NULL values for the fields (should be a zero of the total number of rows).



## 2. Conversion to optimized table

2.1 Create an empty internal (managed) table *route* in which the data will be stored in a more suitable format and compressed:
  * parquet format
  * without partitioning
  * snappy compression (need to specify uppercase SNAPPY)
  * all fields will be the same

2.2 Insert the data from the *route_ext* table into the *route* table:
  * convert boolean fields
  * convert the other fields unchanged.

2.3 Check the *route* table:
  * List a few rows.
  * Find the number of records in the table *route* and compare with the number of records in the table *route_ext*.

2.4 The *route* table is internal, and thus owned by Hive.
  * Find it on HDFS under `hdfs://172.16.102.123:8020/user/hive/warehouse/your_database_name.db` and find its size (number of MB).
  * The IP address should correspond with your cluster. 
   * Compare the size with the size of the external table (the data you uploaded to HDFS, see above).

2.5 In Hive, drop the external table *route_ext* (DROP TABLE). Check that the table is no longer in your database, but the data is still on HDFS.

## 3. Table with partitions
3.1 From the files uploaded to HDFS in the step one choose the file stop_times.txt and copy them to a newly created folder /ext_tables/stop_times/

3.2 Create external table *stop_times_ext* (The external table uses the data structure as is, no format changes etc. are made in this step.)
  * CSV (text) format
  * field separator is ","
  * the record separator is the line break character
  * the first line contains the headers and is skipped
  * determine fields from the file

2.1 Create an empty internal (managed) table *stop_times_part* in which the data will be stored in a more suitable format and compressed:
  * parquet format
  * with partititioning. Partitioning column name will be named route_id
  * snappy compression (need to specify uppercase SNAPPY)
  * all the otheres fields will be the same as the stop_times_ext table

3.2 Copy the data from table *stop_times_ext* into table *stop_times_part*, creating dynamic partitioning by route_id when copying. Dynamic partitioning must be enabled in advance using the commands:  
```
set hive.exec.dynamic.partition=true;
set hive.exec.dynamic.partition.mode=nonstrict;
```
The new route_id column is calculated by the command
```
concat('L',REGEXP_EXTRACT(trip_id, '[0-9]*',0))
```
3.3 Locate the *stop_times_part* table on HDFS under `hdfs://172.16.102.123:8020/user/hive/warehouse/your_databaase_name.db` and see how the partitioning is implemented.

## 4. Inquiry over Hive
Work with the *route* and *stop_times_part* table.

4.1 Find out how many unique routes are there. *(816)*

4.2 Find out the lowest and highest route number are there. *(1, 997)*

4.3 Find out the longest and the shortest route. *(tbd)*

4.4 Find out the route with most and least stops. *(tbd)*

4.5 Find out the longest regional and night route. *(tbd)*

4.6 Find out two stops with maximum and minimum distance between. *(tbd)*

4.7 Find out the average speed for route L1, L170 and L991.  *(tbd)*