## Hive Important Points

- Hive Build by Facebook
- Hive just like interface, not a database
- Structured data has data in row and column format and also has schema
- Hive can handle structured and semi-structured data, not familiar with unstructured data
- Hive run time schema validation on read
- Hive schema validation is done when we go to read hive_table
  - The amount of data to be loaded in Hive table is Huge
  - WORM : (Write Once and Read Many)
- Hive is not recommended for Row-level updates(insert, update, delete)
- Hive by default use "Derby" to store metadata
- Hive default database location /user/hive/warehouse/default/, it is configurable
- Hive user define database location /user/hive/warehouse/dbname/table_name/
- Hive can be used as a file type converter 
- Hive store metadata on metastore(RDBMS/derby) and data on HDFS
- Hive preferred file format is ORC(Optimized Row Columnar)
- **TRUNCATE** operation will remove data only, nor metadata from internal/managed table
- **DROP** operation will remove data and metadata both.
- **gparted** used to extend linux hard disk size
- TRUNCATE operation
  - Supported by internal/managed table
  - Not supported by external table
- DROP operation
  - remove metadata and data from internal/managed table
  - remove metadata only from external table, data will be saved
- You can delete external table by Using HDFS command ` hdfs dfs -rmr /user/hive/warehouse/7jan.db/tx_ext  `, Convert external table to Internal and truncate it `   `
- Multi-level partitioning on multiple columns
- Hive partition create directory for all partitions
- Hive Design label optimization
     - Partitioning
       - Max partition size is 200 ` set hive.exec.max.dynamic.partitions=500 `
       - In partitioning NULL value store in *` column = _HIVE_DEFAULT_PARTITION_ `*
       - Partitioning for query optimization and data management
       - Static Partitioning
         - Define partition column name with value
       - Dynamic Partitioning
         - Define partition column name only
     - Bucketing
       - Bucketing on multiple column consider a single string(all columns)
       - Bucketing for data sampling and join optimization
       - Hive bucket devide data based on hash() function
         - `f(x) = x % number of buckets`: for int data type
         - `f(x) = 'column'.hashCode() => f(x) = x % number of buckets `: for string data type
- Hive Storage Level optimization
  - File formats
    - Text-based (read normal text reader)
    - Binary (couldn't read normal text reader)
  - Compression
    - Cold data: Not frequently used data : *GZ*
    - Hot data: frequently used data : *Snappy*

- Hive JOIN optimizaiton
  -  Common join
  - Map side join
  - Bucket map join
  - Sort merge bucket join (SMB join)
  - Sort merge bucket Map join (SMBM join)
- Vectorization allows Hive to process a batch of rows together instead of processing one row at a time(1024 rows)
  - Vectorization allows ORC file format
  - Vectorization not allows on complex data type
  - data casting not supported
- The **CBO** optimizes and calculates the cost of various plans for a query and selects the cheapest plane.
- The **RBO** optimizes to reuse resources and RAM.
- Logical plan can be visualized by using ` explain ` keyword before every hive commands
- Create VIEW in Hive ` create view tx_view as select cname, sum(revenue) from tx_orc group by cname ; `
- **VIEW** stores Hive query, that is used to create this VIEW table
- **Materialized Views** stores actual data, that is output by Hive query
- - Materialized Views table should be internal and transaction
  - Materialized Views table file format should by ORC
- **VIEW** stores Hive query but **Materialized Views** stores Hive query output
- Create Materialized Views ` create materialized view <name> as query ` , table can be update, drop, etc.



- Data move from HDFS to Hive table, when we run LOAD command
- Data copy from Local to Hive, when we run LOAD LOCAL command
- Serializer is used to convert data in row format to row() object
- deserializer is used to convert data in row() object to row format
- **SODI** Serializer in output/read and deserialize in input/write

- UDF (User define function) Can use in Hive, Function performed on row-based like upper-case, lower-case, etc (one to one)
- - ` add jar /home/jar_dir/jar_file.jar  `
  - ` CREATE TEMPRORY FUNCTION fun_name AS 'jar_function_name' ;  `
  - ` SELECT fun_name(col_name), col2 FROM hivedb.hive_table;  `
- UDAF (User define aggregate function) Function performed on column based like max(), min(), sum(), etc (many to one)
- UDTF User Define Table Generating Function, Function performed pruning on table-based (one to many)
- **Window functions**
  - Ranking (partition by + order by)
    - Row_number, rank, dense_rank, percent_rank, ntile
  - Analytical (partition by + order by)
    - Cume_dist, lag, lead
  - Aggregation (Partition by is compulsory)(order by is not required)
    - Min, max, avg, sum

ppy");
