# Importing Data

> Creating and Testing the Python Intake Class with Pandas and Geopandas support.
- toc: true
- prettify: true
- default_exp: intaker
- audio: https://charleskarpati.com/audio/01_download_and_load.mp3

[![Binder](https://mybinder.org/badge_logo.svg)](https://mybinder.org/v2/gh/bnia/dataplay/main?filepath=%2Fnotebooks%2F01_Download_and_Load.ipynb)
[![Binder](https://pete88b.github.io/fastpages/assets/badges/colab.svg)](https://colab.research.google.com/github/bnia/dataplay/blob/main/notebooks/01_Download_and_Load.ipynb)
[![Binder](https://pete88b.github.io/fastpages/assets/badges/github.svg)](https://github.com/bnia/dataplay/tree/main/notebooks/01_Download_and_Load.ipynb)
[![Open Source Love svg3](https://badges.frapsoft.com/os/v3/open-source.svg?v=103)](https://github.com/ellerbrock/open-source-badges/)
<br>
[![NPM License](https://img.shields.io/npm/l/all-contributors.svg?style=flat)](https://github.com/bnia/dataplay/blob/main/LICENSE)
[![Active](http://img.shields.io/badge/Status-Active-green.svg)](https://bnia.github.io) 
[![Python Versions](https://img.shields.io/pypi/pyversions/dataplay.svg)](https://pypi.python.org/pypi/dataplay/)
[![GitHub last commit](https://img.shields.io/github/last-commit/bnia/dataplay.svg?style=flat)]()  
<br>
[![GitHub stars](https://img.shields.io/github/stars/bnia/dataplay.svg?style=social&label=Star)](https://github.com/bnia/dataplay) 
[![GitHub watchers](https://img.shields.io/github/watchers/bnia/dataplay.svg?style=social&label=Watch)](https://github.com/bnia/dataplay) 
[![GitHub forks](https://img.shields.io/github/forks/bnia/dataplay.svg?style=social&label=Fork)](https://github.com/bnia/dataplay) 
[![GitHub followers](https://img.shields.io/github/followers/bnia.svg?style=social&label=Follow)](https://github.com/bnia/dataplay) 
<br>
[![Tweet](https://img.shields.io/twitter/url/https/github.com/bnia/dataplay.svg?style=social)](https://twitter.com/intent/tweet?text=Check%20out%20this%20%E2%9C%A8%20colab%20by%20@bniajfi%20https://github.com/bnia/dataplay%20%F0%9F%A4%97) 
[![Twitter Follow](https://img.shields.io/twitter/follow/bniajfi.svg?style=social)](https://twitter.com/bniajfi)

<details open>
<summary>

## About: 

</summary>

### Whats inside?

In this notebook, we build and test a basic data-intaker.

- A .CSV file will be loaded into pandas
- We will import geospatial data from Esri then load this data into geo-pandas.

</details>
<details open>
<summary>

## The Function

</summary>

<pre>

In [37]:
#hide_input 
from dataplay import intaker
help(intaker)

Help on module dataplay.intaker in dataplay:

NAME
    dataplay.intaker

CLASSES
    builtins.object
        Intake
    
    class Intake(builtins.object)
     |  # The intaker class retrieves data into a pandas dataframe.
     |  # Can read in a CSV URL but uses dataplay.geom.readInGeometryData() for Geojson endpoints.
     |  # Otherwise this tool assumes shp or pgeojson files have geom='geometry', in_crs=2248. 
     |  # Depending on interactivity the values should be 
     |  # coerce fillna(-1321321321321325)
     |  
     |  Methods defined here:
     |  
     |  checkColumn(dataset, column)
     |      # a2. Returns Bool
     |  
     |  coerce(ds1, ds2, col1, col2, interactive)
     |      # b1. Used by Merge Lib. Returns Both Datasets and Coerce Status
     |  
     |  coerceDtypes(isNum, dt, ds, col, interactive)
     |  
     |  getAndCheck(url, col='geometry', interactive=False)
     |      # a1. Used by Merge Lib. Returns valid (df, column) or (df, False) or (False, False)

</pre>

In [1]:
#hide_input #hide_output
# Run this when in dev for autoreload of imported local module changes
%load_ext autoreload
%autoreload 2 #reload all modules every time before executing the Python code typed.

UsageError: unrecognized arguments: #reload all modules every time before executing the Python code typed.


In [2]:
#hide_input #hide_output
import pandas as pd
pd.set_option('display.max_colwidth', 3)
pd.set_option('max_colwidth', 20)
pd.set_option('display.expand_frame_repr', False)
pd.set_option('display.precision', 2)

In [3]:
#hide_input #hide_output
# configures the shell to display the output of all expressions executed in a cell, not just the last one.
from IPython.core.interactiveshell import InteractiveShell
InteractiveShell.ast_node_interactivity = "all"

In [4]:
#export #hide_input #hide_output
import geopandas as gpd
import numpy as np
import pandas as pd
# conditionally loaded ->  from dataplay import geoms

In [5]:
#export #collapse_input

# The intaker class retrieves data into a pandas dataframe.
# Can read in a CSV URL but uses dataplay.geom.readInGeometryData() for Geojson endpoints.
# Otherwise this tool assumes shp or pgeojson files have geom='geometry', in_crs=2248. 
# Depending on interactivity the values should be 
# coerce fillna(-1321321321321325)
class Intake:

  # 1. Recursively calls self/getData until something valid is given.
  #    Returns df or False. Calls readInGeometryData. or pulls csv directly.
  # Returns df or False.
  def getData(url, interactive=False):
    escapeQuestionFlags = ["no", '', 'none']
    if ( Intake.isPandas(url) ): return url
    if (str(url).lower() in escapeQuestionFlags ): return False
    # if interactive: print('Getting Data From: ', url)
    try:
      isGeom = False
      if  ('csv' in url): 
        df = pd.read_csv( url )
        # check if 'geometry' is a column
        if (Intake.checkColumn(df, 'geometry')): 
          isGeom = True
      if (isGeom or [ele for ele in ['pgeojson', 'shp', 'geojson'] if(ele in url)]):
        # print('importing geoms', url)
        from dataplay import geoms
        # print(f'using readInGeometryData using args: url=${url}, porg=False, geom=\'geometry\', lat=False, lng=False, revgeocode=False,  save=False, in_crs=2248, out_crs=False' )
        df = geoms.readInGeometryData(url=url, porg=False, geom='geometry', lat=False, lng=False, revgeocode=False,  save=False, in_crs=2248, out_crs=False)
      return df
    except:
      if interactive: return Intake.getData(input("Error: Try Again?  ( URL/ PATH or  'NO'/ <Empty> ) " ), interactive)
      return False

  # 1ai. A misnomer. Returns Bool.
  def isPandas(df): return isinstance(df, pd.DataFrame) or isinstance(df, gpd.GeoDataFrame) or isinstance(df, tuple)


  # a1. Used by Merge Lib. Returns valid (df, column) or (df, False) or (False, False).
  def getAndCheck(url, col='geometry', interactive=False):
    df = Intake.getData(url, interactive) # Returns False or df
    if ( not Intake.isPandas(df) ):
      if(interactive): print('No data was retrieved.', df)
      return False, False
    if (isinstance(col, list)):
      for colm in col:
        if not Intake.getAndCheckColumn(df, colm):
          if(interactive): print('Exiting. Error on the column: ', colm)
          return df, False
    newcol = Intake.getAndCheckColumn(df, col, interactive) # Returns False or col
    if (not newcol):
      if(interactive): print('Exiting. Error on the column: ', col)
      return df, col
    return df, newcol

  # a2. Returns Bool
  def checkColumn(dataset, column): return {column}.issubset(dataset.columns)

  # b1. Used by Merge Lib. Returns Both Datasets and Coerce Status
  def coerce(ds1, ds2, col1, col2, interactive):
    ds1, ldt, lIsNum = Intake.getdTypeAndFillNum(ds1, col1, interactive)
    ds2, rdt, rIsNum  = Intake.getdTypeAndFillNum(ds2, col2, interactive)

    ds2 = Intake.coerceDtypes(lIsNum, rdt, ds2, col2, interactive)
    ds1 = Intake.coerceDtypes(rIsNum, ldt, ds1, col1, interactive)

    # Return the data and the coerce status
    return ds1, ds2, (ds1[col1].dtype == ds2[col2].dtype)

   # b2. Used by Merge Lib. fills na with crazy number
  def getdTypeAndFillNum(ds, col, interactive):
    dt = ds[col].dtype
    isNum = dt == 'float64' or dt == 'int64'
    if isNum: ds[col] = ds[col].fillna(-1321321321321325)
    return ds, dt, isNum

   # b3. Used by Merge Lib.
  def coerceDtypes(isNum, dt, ds, col, interactive):
    if isNum and dt == 'object':
      if(interactive): print('Converting Key from Object to Int' )
      ds[col] = pd.to_numeric(ds[col], errors='coerce')
      if interactive: print('Converting Key from Int to Float' )
      ds[col] = ds[col].astype(float)
    return ds

  # a3. Returns False or col. Interactive calls self
  def getAndCheckColumn(df, col, interactive):
    if Intake.checkColumn(df, col) : return col
    if (not interactive): return False
    else:
        print("Invalid column given: ", col);
        print(df.columns);
        print("Please enter a new column fom the list above.");
        col = input("Column Name: " )
        return Intake.getAndCheckColumn(df, col, interactive);

</details>
<details open>
<summary>

## Try it

</summary>

In [17]:
#hide_input #hide_output 
%pwd

'c:\\Users\\charl\\Documents\\GitHub\\karpatic\\src\\ipynb\\dataplay'

In [33]:
#hide_input 
url = "C:/Users/charl/Documents/GitHub/karpatic/src/ipynb/dataplay/tracts_data.csv"
print(f"<a src='${url}'>Link</a>")

<a src='$C:/Users/charl/Documents/GitHub/karpatic/src/ipynb/dataplay/tracts_data.csv'>Link</a>


In [34]:
df = Intake.getData(url, interactive=True)

And the same thing works without a url

In [35]:
#hide_input 
url = 'https://services1.arcgis.com/mVFRs7NF4iFitgbY/ArcGIS/rest/services/Hhchpov/FeatureServer/0/query?where=1%3D1&outFields=*&returnGeometry=true&f=pgeojson'
print(f"<a src='${url}'>Link</a>")

<a src='$https://services1.arcgis.com/mVFRs7NF4iFitgbY/ArcGIS/rest/services/Hhchpov/FeatureServer/0/query?where=1%3D1&outFields=*&returnGeometry=true&f=pgeojson'>Link</a>


In [10]:
rdf = Intake.getData(url) 

Using Esri and the Geoms handler directly:

In [11]:
#hide_input 
import dataplay

In [29]:
#hide_input 
geoloom_gdf_url = "https://services1.arcgis.com/mVFRs7NF4iFitgbY/ArcGIS/rest/services/Geoloom_Crowd/FeatureServer/0/query?where=1%3D1&outFields=*&returnGeometry=true&f=pgeojson"
print(f"<a src='${url}'>Link</a>")

<a src='$https://services1.arcgis.com/mVFRs7NF4iFitgbY/ArcGIS/rest/services/Hhpov/FeatureServer/0/query?where=1%3D1&outFields=*&returnGeometry=true&f=pgeojson'>Link</a>


In [12]:
geoloom_gdf = dataplay.geoms.readInGeometryData(
    url=geoloom_gdf_url, 
    porg=False, 
    geom='geometry', 
    lat=False, 
    lng=False, 
    revgeocode=False,  
    save=False, 
    in_crs=4326, 
    out_crs=False
)

In [13]:
#hide_input 
geoloom_gdf = geoloom_gdf.dropna(subset=['geometry'])
# geoloom_gdf.head(1)

Again but with the Intake class:

In [31]:
#hide_input 
url = 'https://services1.arcgis.com/mVFRs7NF4iFitgbY/ArcGIS/rest/services/Geoloom_Crowd/FeatureServer/0/query?where=1%3D1&outFields=*&returnGeometry=true&f=pgeojson'
print(f"<a src='${url}'>Link</a>")

<a src='$https://services1.arcgis.com/mVFRs7NF4iFitgbY/ArcGIS/rest/services/Geoloom_Crowd/FeatureServer/0/query?where=1%3D1&outFields=*&returnGeometry=true&f=pgeojson'>Link</a>


In [14]:
Geoloom_Crowd, rcol = Intake.getAndCheck(url)

This getAndCheck function is useful for checking for a required field.

In [36]:
#hide_input 
url = 'https://services1.arcgis.com/mVFRs7NF4iFitgbY/ArcGIS/rest/services/Hhpov/FeatureServer/0/query?where=1%3D1&outFields=*&returnGeometry=true&f=pgeojson'
print(f"<a src='${url}'>Link</a>")

<a src='$https://services1.arcgis.com/mVFRs7NF4iFitgbY/ArcGIS/rest/services/Hhpov/FeatureServer/0/query?where=1%3D1&outFields=*&returnGeometry=true&f=pgeojson'>Link</a>


In [24]:
#collapse_output
Hhpov, rcol = Intake.getAndCheck(url, 'hhpov19', True) 

In [None]:
#hide_input 
Hhpov = Hhpov[['CSA2010', 'hhpov15',	'hhpov16',	'hhpov17',	'hhpov18',	'hhpov19']] 

In [None]:
# Hhpov.to_csv('Hhpov.csv').to_csv('Hhpov.csv', encoding="utf-8", index=False, quoting=csv.QUOTE_ALL)

We could also retrieve from a file.

In [None]:
# rdf = Intake.getData('Hhpov.csv')

</details>