Skip to content

Optimised Hadoop scripts to retrieve data analytics from the latest US geographical survey. Mostly done for personal usage while learning Hadoop so there is no real value behind the data extracted but more of a comparison with SQL complex queries.

License

Notifications You must be signed in to change notification settings

jkcso/BigData-Analytics

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

3 Commits
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

Summary

Data Set

The tables of the usgs database provided by US Geographical Survey are available as a set of TSV files, each containing a table of data.

Tables

state

Contains all states and administratively equivalent entities within the USA.

populated_place

Each state has a number of habitations recorded in the populated place table.

feature

The type column of feature identifies the type of geographic feature, such as forest, dam, lake, and include some classified as populated places under type ppl.

Data Cleaning

Note that the data is not very ‘clean’, meaning that there are foreign keys not present that intuitively might expected to be present, and there is a certain amount of inconsistency between data found in the ferature and populated place tables.

How to Run a script.

pig −x local q0.pig

Scripts

  1. A Pig script that writes a CSV file with the scheme (state name) containing all those state names in feature for which there are no corresponding records in state. The result must be ordered by state name, return the names found in upper case, should assume all records in state are in upper case, and ignore difference in case between the two tables.

  2. A Pig script that writes a CSV file with the scheme (state name,population,elevation) that returns in order of state name the sum of the population and the average elevation of all populated place data in a given state. The result must be ordered by state name, and elevation data must be rounded to the nearest integer.

  3. A Pig script that writes a CSV file with the scheme (state name,county,no ppl,no stream) the number of populated places and the number of streams recorded in feature in each county. The result must be ordered by state name and county.

  4. A Pig script that writes a CSV file with the scheme (state name,name,population) containing the state name and place name of each populated place, returning only the five largest populated places in each state. The result must be ordered by state name, with places in each state listed in declining order of population. If populations agree, then order of name should be used.

About

Optimised Hadoop scripts to retrieve data analytics from the latest US geographical survey. Mostly done for personal usage while learning Hadoop so there is no real value behind the data extracted but more of a comparison with SQL complex queries.

Topics

Resources

License

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published