Skip to content
This repository has been archived by the owner on Mar 24, 2022. It is now read-only.

Commit

Permalink
initial commit
Browse files Browse the repository at this point in the history
  • Loading branch information
dbbaskette committed Oct 13, 2015
1 parent 0e2e3e8 commit 52248a1
Show file tree
Hide file tree
Showing 6 changed files with 160 additions and 1 deletion.
Binary file added 2008_cms_data.csv.gz
Binary file not shown.
Binary file added Pivotal - GPDB Overview and Demo.pdf
Binary file not shown.
111 changes: 110 additions & 1 deletion README.md
@@ -1 +1,110 @@
# gpdb-sandbox-tutorials
![](https://drive.google.com/uc?export=&id=0B5ncp8FqIy8VOU5MUmh3MzMydlk)

#An Introduction and Greenplum Database Tutorial using
#the Greenplum DB Sandbox VM

-----------------------------------

About the Greenplum Architecture
--------------------------------

Pivotal Greenplum Database is a massively parallel processing (MPP) database server with an architecture specially designed to manage large-scale analytic data warehouses and business intelligence workloads.

MPP (also known as a shared nothing architecture) refers to systems with two or more processors that cooperate to carry out an operation, each processor with its own memory, operating system and disks. Greenplum uses this high-performance system architecture to distribute the load of multi-terabyte data warehouses, and can use all of a system's resources in parallel to process a query.

Greenplum Database is based on PostgreSQL open-source technology. It is essentially several PostgreSQL database instances acting together as one cohesive database management system (DBMS). It is based on PostgreSQL 8.2.15, and in most cases is very similar to PostgreSQL with regard to SQL support, features, configuration options, and end-user functionality. Database users interact with Greenplum Database as they would a regular PostgreSQL DBMS.

The internals of PostgreSQL have been modified or supplemented to support the parallel structure of Greenplum Database. For example, the system catalog, optimizer, query executor, and transaction manager components have been modified and enhanced to be able to execute queries simultaneously across all of the parallel PostgreSQL database instances. The Greenplum interconnect (the networking layer) enables communication between the distinct PostgreSQL instances and allows the system to behave as one logical database.

Greenplum Database also includes features designed to optimize PostgreSQL for business intelligence (BI) workloads. For example, Greenplum has added parallel data loading (external tables), resource management, query optimizations, and storage enhancements, which are not found in standard PostgreSQL. Many features and optimizations developed by Greenplum make their way into the PostgreSQL community. For example, table partitioning is a feature first developed by Greenplum, and it is now in standard PostgreSQL.

Greenplum Database stores and processes large amounts of data by distributing the data and processing workload across several servers or hosts. Greenplum Database is an array of individual databases based upon PostgreSQL 8.2 working together to present a single database image. The master is the entry point to the Greenplum Database system. It is the database instance to which clients connect and submit SQL statements. The master coordinates its work with the other database instances in the system, called segments, which store and process the data.

Figure 1. High-Level Greenplum Database Architecture
![](https://drive.google.com/uc?export=&id=0B5ncp8FqIy8VM2Y2bjh1VUx1c3M)

The following topics describe the components that make up a Greenplum Database system and how they work together.

**Greenplum Master**
The Greenplum Database master is the entry to the Greenplum Database system, accepting client connections and SQL queries, and distributing work to the segment instances.

Greenplum Database end-users interact with Greenplum Database (through the master) as they would with a typical PostgreSQL database. They connect to the database using client programs such as psql or application programming interfaces (APIs) such as JDBC or ODBC.

The master is where the global system catalog resides. The global system catalog is the set of system tables that contain metadata about the Greenplum Database system itself. The master does not contain any user data; data resides only on the segments. The master authenticates client connections, processes incoming SQL commands, distributes workloads among segments, coordinates the results returned by each segment, and presents the final results to the client program.

**Greenplum Segments**
Greenplum Database segment instances are independent PostgreSQL databases that each store a portion of the data and perform the majority of query processing.

When a user connects to the database via the Greenplum master and issues a query, processes are created in each segment database to handle the work of that query. For more information about query processes, see About Greenplum Query Processing.

User-defined tables and their indexes are distributed across the available segments in a Greenplum Database system; each segment contains a distinct portion of data. The database server processes that serve segment data run under the corresponding segment instances. Users interact with segments in a Greenplum Database system through the master.

Segments run on a servers called segment hosts. A segment host typically executes from two to eight Greenplum segments, depending on the CPU cores, RAM, storage, network interfaces, and workloads. Segment hosts are expected to be identically configured. The key to obtaining the best performance from Greenplum Database is to distribute data and workloads evenly across a large number of equally capable segments so that all segments begin working on a task simultaneously and complete their work at the same time.

**Greenplum Interconnect**
The interconect is the networking layer of the Greenplum Database architecture.

The interconnect refers to the inter-process communication between segments and the network infrastructure on which this communication relies. The Greenplum interconnect uses a standard 10-Gigabit Ethernet switching fabric.

By default, the interconnect uses User Datagram Protocol (UDP) to send messages over the network. The Greenplum software performs packet verification beyond what is provided by UDP. This means the reliability is equivalent to Transmission Control Protocol (TCP), and the performance and scalability exceeds TCP. If the interconnect used TCP, Greenplum Database would have a scalability limit of 1000 segment instances. With UDP as the current default protocol for the interconnect, this limit is not applicable.

**Pivotal Query Optimizer**
The Pivotal Query Optimizer brings a state of the art query optimization framework to Greenplum Database that is distinguished from other optimizers in several ways:

- Modularity. Pivotal Query Optimizer is not confined inside a single RDBMS. It is currently leveraged in both Greenplum Database and Pivotal HAWQ, but it can also be run as a standalone component to allow greater flexibility in adopting new backend systems and using the optimizer as a service. This also enables elaborate testing of the optimizer without going through the other components of the database stack.

- Extensibility. The Pivotal Query Optimizer has been designed as a collection of independent components that can be replaced, configured, or extended separately. This significantly reduces the development costs of adding new features, and also allows rapid adoption of emerging technologies. Within the Query Optimizer, the representation of the elements of a query has been separated from how the query is optimized. This lets the optimizer treat all elements equally and avoids the issues with the imposed order of optimizations steps of multi-phase optimizers.

- Performance. The Pivotal Query Optimizer leverages a multi-core scheduler that can distribute individual optimization tasks across multiple cores to speed up the optimization process. This allows the Query Optimizer to apply all possible optimizations as the same time, which results in many more plan alternatives and a wider range of queries that can be optimized. For instance, when the Pivotal Query Optimizer was used with TPC-H Query 21 it generated 1.2 Billion possible plans in 250 ms. This is especially important in Big Data Analytics where performance challenges are magnified by the volume of data that needs to be processed. A suboptimal optimization choice could very well lead to a query that just runs forever.



Greenplum Database Tutorial
-----------------

This tutorial showcases how GPDB can address day-to-day tasks performed in typical DW/BI environments. It is designed to be used with the Greenplum Database Sandbox VM that is available for download.

The scripts/data for this tutorial are in the gpdb-sandbox virtual machine at /home/gpadmin. The repository is pre-cloned, but will update as the VM boots in order to provide the most recent version of these instructions.

- Import the GPDB Sandbox Virtual Machine
- Start the GPDB Sandbox Virtual Machine. Once the machine starts, you will see the following screen
![](https://drive.google.com/uc?export=&id=0B5ncp8FqIy8VUUtkUERxbFNZd00)
This screen provides you all the information you need to interact with the VM.
- Username/Password combinations
- Managment URLs
- IP address for SSH Connection

Interacting with the Sandbox via a new terminal is preferable, as it makes many of the operation simpler.


----------
****
Lesson 1: Parallel Data Loading
----------

In a large scale, multi-terabyte data warehouse, large amounts of data must be loaded within a relatively small maintenance window. Greenplum supports fast, parallel data loading with its external tables feature. Administrators can also load external tables in single row error isolation mode to filter bad rows into a separate error table while continuing to load properly formatted rows. Administrators can specify an error threshold for a load operation to control how many improperly formatted rows cause Greenplum to abort the load operation.

By using external tables in conjunction with Greenplum Database's parallel file server (gpfdist), administrators can achieve maximum parallelism and load bandwidth from their Greenplum Database system.

Figure 1. External Tables Using Greenplum Parallel File Server (gpfdist)
![](https://drive.google.com/uc?export=&id=0B5ncp8FqIy8VME5JMDZCNmE2cGs)
Another Greenplum utility, gpload, runs a load task that you specify in a YAML-formatted control file. You describe the source data locations, format, transformations required, participating hosts, database destinations, and other particulars in the control file and gpload executes the load. This allows you to describe a complex task and execute it in a controlled, repeatable fashion.

This tutorial will demon



1. Open a terminal and ssh into the sandbox machine.
`ssh gpadmin@X.X.X.X`


2. Start the Greeplum Database.
`./start_all.sh`

3. `cd gpdb-sandbox-dayinthelife`
4.


Originally the work of Brad Ganas

BIG Thanks to those who inspired this: Matt Neglay, Austin Rutherford, and others!!
40 changes: 40 additions & 0 deletions create_tables.sql
@@ -0,0 +1,40 @@
--------------------------------------------------------------------------------------
-- PART I - LOADING DATA.
--------------------------------------------------------------------------------------
-- create a database to work in.
create database ditl;

-- Drop these objects if they already exist in the database.
drop table if exists cms;
drop table if exists cms_part;
drop table if exists cms_qlz;
drop table if exists cms_zlib;
drop table if exists cms_zlib9;
drop table if exists wwearthquakes_lastwk;
drop table if exists cms_load_errors;
drop table if exists cms_bad_key;
drop external table if exists cms_backup;
drop external table if exists cms_export;
drop external table if exists ext_cms;
drop external table if exists ext_wwearthquakes_lastwk;
drop table if exists cms_seq;
drop table if exists cms_p0;
drop sequence if exists myseq;

-- Create the table to hold the cms data from data.gov. we already know the layout.
drop table if exists cms;
CREATE TABLE cms
(
car_line_id character varying(20),
bene_sex_ident_cd numeric(20),
bene_age_cat_cd bigint,
car_line_icd9_dgns_cd character varying(10),
car_line_hcpcs_cd character varying(10),
car_line_betos_cd character varying(5),
car_line_srvc_cnt bigint,
car_line_prvdr_type_cd bigint,
car_line_cms_type_srvc_cd character varying(5),
car_line_place_of_srvc_cd bigint,
car_hcpcs_pmt_amt bigint
)
distributed by (car_line_id);
3 changes: 3 additions & 0 deletions ext_table.sql
@@ -0,0 +1,3 @@
-- Create an external table that 'points' to the source file.
drop external table if exists ext_cms;
create external table ext_cms (like cms) location ('gpfdist://localhost:8081/2008_cms_data.csv') format 'csv' (header);
7 changes: 7 additions & 0 deletions load_data.sh
@@ -0,0 +1,7 @@
-- Kill and restart the gpfdist utility on the database
ps ax | grep gpfdist
pkill -9 gpfdist
gpfdist -d /home/gpadmin/gpdb-sandboc-dayinthelife/ -p 8081 -l /home/gpadmin/gpdb-sandbox-dayinthelife/gpfdist.log &
gunzip 2008_cms_data.csv.gz


0 comments on commit 52248a1

Please sign in to comment.