## Background

The focus herein is the reading of data sets via their schema file definitions.  Critically, the schema files are the same schema files that are used by Apache Spark Scala.  Basically, if a team uses a mix of Python/Scala/Spark for 

* data modelling & analysis
* features engineering
* data architecture engineering
* and more

it is quite helpful if the same schema file can be used across cases.

<br>

For this illustration the data set is

* https://raw.githubusercontent.com/miscellane/hub/develop/data/countries/us/environment/toxins/chemicals/chemicalsEnvirofacts.csv

and its schema is outlined in

* https://raw.githubusercontent.com/miscellane/hub/develop/data/countries/us/environment/toxins/chemicals/chemicalsEnvirofacts.json



<br>
<br>

## Libraries

In [1]:
import pandas as pd
import numpy as np
import requests
import json

<br>
<br>

## Data Sets & Schemata

### Schema

Schema URL

In [2]:
schemaurl = 'https://raw.githubusercontent.com/miscellane/hub/develop/data/countries/'\
'us/environment/toxins/chemicals/chemicalsEnvirofacts.json'

<br>

Schema Reading

In [3]:
try:
    req = requests.get(url=schemaurl)
    req.raise_for_status()
except requests.exceptions.RequestException as err:
    raise err

<br>

Schema Content

In [4]:
content = json.loads(req.content)

<br>

Focus on key `fields`

In [5]:
variables = content['fields']
fields = pd.DataFrame.from_dict(data=variables, orient='columns')
fields.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 20 entries, 0 to 19
Data columns (total 4 columns):
 #   Column    Non-Null Count  Dtype 
---  ------    --------------  ----- 
 0   name      20 non-null     object
 1   type      20 non-null     object
 2   nullable  20 non-null     bool  
 3   metadata  20 non-null     object
dtypes: bool(1), object(3)
memory usage: 628.0+ bytes


<br>

Mapping `type` to Python types

* http://spark.apache.org/docs/2.4.8/sql-reference.html

In [6]:
def mappings(k: pd.Series):

    dictionary = {'integer': int, 'string': str, 'double': float}

    return k.apply(lambda x: dictionary[x])


In [7]:
fields.loc[:, 'localtype'] = mappings(fields['type'])
fields.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 20 entries, 0 to 19
Data columns (total 5 columns):
 #   Column     Non-Null Count  Dtype 
---  ------     --------------  ----- 
 0   name       20 non-null     object
 1   type       20 non-null     object
 2   nullable   20 non-null     bool  
 3   metadata   20 non-null     object
 4   localtype  20 non-null     object
dtypes: bool(1), object(4)
memory usage: 788.0+ bytes


<br>

Creating the attributes for `pd.read_csv`

In [8]:
usecols = fields.name.values
dtype = fields[['name', 'localtype']].set_index(keys='name', drop=False, inplace=False).to_dict(orient='dict')['localtype']

<br>
<br>

### Data

Hence, the data can be read carefully

In [9]:
dataurl = 'https://raw.githubusercontent.com/miscellane/hub/develop/data/countries/us/environment/toxins/chemicals/chemicalsEnvirofacts.csv'

In [10]:
try:
    data = pd.read_csv(filepath_or_buffer=dataurl, header=0, usecols=usecols, dtype=dtype, encoding='utf-8')
except OSError as err:
    raise Exception(err.strerror) in err

In [11]:
data.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 823 entries, 0 to 822
Data columns (total 20 columns):
 #   Column                                   Non-Null Count  Dtype  
---  ------                                   --------------  -----  
 0   TRI_CHEM_INFO.TRI_CHEM_ID                822 non-null    object 
 1   TRI_CHEM_INFO.CHEM_NAME                  822 non-null    object 
 2   TRI_CHEM_INFO.ACTIVE_DATE                823 non-null    int64  
 3   TRI_CHEM_INFO.INACTIVE_DATE              823 non-null    int64  
 4   TRI_CHEM_INFO.CAAC_IND                   823 non-null    int64  
 5   TRI_CHEM_INFO.CARC_IND                   823 non-null    int64  
 6   TRI_CHEM_INFO.R3350_IND                  823 non-null    int64  
 7   TRI_CHEM_INFO.METAL_IND                  823 non-null    int64  
 8   TRI_CHEM_INFO.FEDS_IND                   823 non-null    int64  
 9   TRI_CHEM_INFO.CLASSIFICATION             823 non-null    int64  
 10  TRI_CHEM_INFO.PBT_START_YEAR             21 non-nu