gdutils.extract
============

`ExtractTable` is a `class` in module `extract`, which provides methods for extracting data from tabular data sources.

---

__Examples Setup__

The following commands are used for setting up the examples below. 

*Note:* The example input files were pulled and converted from the GeoJSON [link](http://d2ad6b4ur7yvpq.cloudfront.net/naturalearth-3.3.0/ne_110m_land.geojson) provided in the [geopandas IO docs](https://geopandas.org/io.html).

In [None]:
# Install ``gdutils`` package
!pip install git+https://github.com/KeiferC/gdutils.git > /dev/null

In [None]:
import gdutils.extract as et

import geopandas as gpd
import pandas as pd

---

Example 1. Extract a table 
-------------------------------

*Note*: returns a ``geopandas GeoDataFrame``

__Example 1.1.__ Extract a table from a file


- Example 1.1.1. Extract from a shapefile

In [None]:
# Ex. 1.1.1

shp_path = 'example-inputs/example-shp/example.shp' # path to file containing table to extract
shp_et = et.ExtractTable(shp_path) # alternative: et.read_file(filepath)
shp_gdf = shp_et.extract() # extracts table as a geopandas GeoDataframe

shp_gdf.head() # renders first 5 rows of table

- Example 1.1.2. Extract from a CSV

In [None]:
# Ex. 1.1.2

csv_path = 'example-inputs/example.csv'
csv_et = et.read_file(csv_path) # using alternative
csv_gdf = csv_et.extract()

csv_gdf.head()

- Example 1.1.3. Extract from an Excel file

In [None]:
# Ex. 1.1.3

excel_path = 'example-inputs/example.csv'
excel_gdf = et.read_file(excel_path).extract() # shorthand equivalent

excel_gdf.head()

- Example 1.1.4. Extract from a ZIP file

In [None]:
# Ex. 1.1.4

zip_path = 'example-inputs/example.zip'
zip_gdf = et.read_file(zip_path).extract()

zip_gdf.head()

__Example 1.2.__ Extract a table from a URL

In [None]:
# Ex. 1.2

url = 'http://d2ad6b4ur7yvpq.cloudfront.net/naturalearth-3.3.0/ne_110m_land.geojson' 
    # URL copied from https://geopandas.org/io.html
url_gdf = et.ExtractTable(url).extract()

url_gdf.head()

__Example 1.3.__ Extract a table from a `pandas DataFrame`

In [None]:
# Ex. 1.3

pandas_df = pd.read_csv(csv_path)
pandas_gdf = et.ExtractTable(pandas_df).extract()

pandas_gdf.head()

__Example 1.4.__ Extract a table from a `geopandas GeoDataFrame`

In [None]:
# Ex. 1.4

geopandas_gdf = et.ExtractTable(csv_gdf).extract()

geopandas_gdf.head()

Example 2. Extract a table with a selected index
-------------------------------------------------------------------------

__Example 2.1.__ Extract a table with a known column label as the index

In [None]:
# Ex. 2.1

known_column = 'featurecla'
known_column_gdf = et.ExtractTable(shp_path, column=known_column).extract() 
    # alternative: ExtractTable.read_file(shp_path, column=known_column)

known_column_gdf.head()

__Example 2.2.__ Extract a table without a known column label as the index

In [None]:
# Ex. 2.2

unknown_column_et = et.ExtractTable(shp_path)
columns_list = unknown_column_et.list_columns() # returns a list of columns from which to choose
print(columns_list)

In [None]:
unknown_column_et.column = 'scalerank' # selects the 'scalerank' column as the index
unknown_column_gdf = unknown_column_et.extract()

unknown_column_gdf.head()

Example 3. Extract a subtable
-----------------------------------

*Note*: The `counties.zip` file contains a shapefile of California county boundaries as sourced from the 2010 US decennial census. The shapefile was pulled from the [NHGIS database](https://data2.nhgis.org).

In [None]:
counties_file = 'example-inputs/counties.zip'

__Example 3.1.__ Extract a subtable without a known column value

In [None]:
# Ex. 3.1

unknown_value_et = et.read_file(counties_file)
print(unknown_value_et.list_columns())

In [None]:
unknown_value_et.column = 'NAME10'
print(unknown_value_et.list_values()) # alternatively, use `list_values(unique=True)` to get unique values

In [None]:
unknown_value_et.value = 'Alameda' # can also take in a list e.g. = ['Alameda', 'Alpine']
unknown_value_gdf = unknown_value_et.extract()

unknown_value_gdf.head()

__Example 3.2.__ Extract a subtable with a known column value

In [None]:
# Ex. 3.2

known_value_gdf = et.read_file(counties_file, column='NAME10', value='Alameda').extract()

known_value_gdf.head()

__Example 3.3.__ Extract a subtable with known column values

In [None]:
# Ex. 3.3

known_values = ['Alameda', 'San Francisco', 'Napa']
known_values_gdf = et.read_file(counties_file, column='NAME10', value=known_values).extract()

known_values_gdf.head()

Example 4. Extract to a file
-------------------------------

In [None]:
!mkdir outputs # creates a folder called 'output'

output_path = 'outputs/output.shp' 
    # the output filetype depends on the provided extension
    # e.g. 'output/output.csv' writes to a CSV file
    # e.g. 'output/output.xlsx' writes to an Excel file

__Ex. 4.1.__ Extract to file from a geopandas GeoDataFrame

In [None]:
# Ex. 4.1.

et.ExtractTable(known_values_gdf).extract_to_file(output_path)

In [None]:
# Let's look at the extracted file:
et.ExtractTable(output_path).extract().head()

__Ex. 4.2.__ Extract to file from input file without known values

In [None]:
# Ex. 4.2.

unknown_to_file_et = et.read_file(counties_file)
print(unknown_to_file_et.list_values(column='NAME10', unique=True))

In [None]:
unknown_to_file_et.column = 'NAME10'
unknown_to_file_et.value = ['Merced', 'Solano', 'Humboldt', 'Kings', 'Santa Cruz']
unknown_to_file_et.outfile = 'outputs/output.csv' # sets output path

unknown_to_file_et.extract_to_file()

In [None]:
# Let's look at the extracted file:
et.read_file('outputs/output.csv').extract().head()

__Ex. 4.3.__ Extract to file from input file with known columns and values

In [None]:
# Ex. 4.3.

to_file_values = ['Sacramento', 'Yolo', 'San Diego']

et.ExtractTable(counties_file, 'outputs/output.csv', 
                column='NAME10', value=to_file_values).extract_to_file()

In [None]:
# Let's look at the extracted file:
et.read_file('outputs/output.csv').extract().head()

---

__Examples Cleanup__

The following commands are used to reset and clean up the examples above.

In [None]:
# Remove outputs
!rm -r outputs

In [None]:
# Uninstall Package
!echo y | pip uninstall gdutils

In [None]:
# Reset Jupyter Notebook IPython Kernel
from IPython.core.display import HTML
HTML("<script>Jupyter.notebook.kernel.restart()</script>")