# Nested CSV 

Simple test exploring how to map nested CSV documents into Pandas

The question is how would this work with RML[Morph].  
The exploded dataframe, if that is the correct route, would end up with a large number of extra rows.
This would mean that the RML process, as it moves through the rows, would generate many duplicate triples.

This is fine since duplicate triples are rather easy to process out.  Either through string matches or 
simply routing the data through a triplestore like Oxigraph, which will simply absorb the duplicates into 
single representations.  This means we are generating extra triples we don't need, but it is not a heavy 
computational burden and is likely an easier route than an alternative.  

Note, an _optimized_ approach would be to generate a subset of columns into a new dataframe and 
then generate triples from that and merge them with the triples from the main frame that ignores 
the nested value columns.   For example, like the df_subset example below.

In [2]:
import pandas as pd
import json

In [21]:
df = pd.read_csv("../inputs/GDSC_metadata.csv")

In [22]:
df

Unnamed: 0,ID,Title,Creator,Url,DOI,Publisher,Rights,License,Restrictions,Coverage,...,Index Fields,Bash ETL,SQL Transform,Last Updated,Update Frequency,Last Accessed,Columns,ETL Documentation,Notes,TODO
0,1.0,Miami-Dade Home Owner's Loan Corporation (HOLC...,University of Richmond's Digital Scholarship Lab,https://services.arcgis.com/jIL9msH9OI208GCb/a...,,University of Richmond's Digital Scholarship Lab,Public Domain,,Use items owned by Esri in ArcGIS Online in co...,United States,...,--,TBD,TBD,2021-08-17,Never,--,OBJECTID|HOLC_grade|city|HOLC|Code|ST |CitySta...,Direct download as ESRI json from ESRI service...,,update to reflect Miami only\ninclude other ci...
1,2.0,2019 Florida Census Tracts,Department of Commerce|U.S. Census Bureau|Geog...,https://www.census.gov/geographies/mapping-fil...,,Department of Commerce|U.S. Census Bureau|Geog...,Public Domain,This Software was created by U.S. Government e...,,United States|Florida,...,--,TBD,TBD,2021-09-22,Never,--,"STATEFP(string,2)|COUNTYFP(string,3)|TRACTCE(s...",ogr2ogr shapefile to postGIS,,
2,3.0,2019 Miami-Dade ACS 5 Year Estimates - Tract L...,Department of Commerce|U.S. Census Bureau,https://www.census.gov/data/developers/data-se...,,Department of Commerce|U.S. Census Bureau,Public Domain,This Software was created by U.S. Government e...,,United States|Florida,...,--,TBD,TBD,2021-06-02,Never,--,https://api.census.gov/data/2019/acs/acs5/vari...,The custom function acs_customgregate is given...,please use this as a template for all ACS esti...,
3,4.0,2019 Florida ACS 5 Year Estimates - Tract Leve...,Department of Commerce|U.S. Census Bureau,https://www.census.gov/data/developers/data-se...,,Department of Commerce|U.S. Census Bureau,Public Domain,This Software was created by U.S. Government e...,,United States|Florida,...,--,TBD,TBD,2022-11-22,Never,--,https://api.census.gov/data/2019/acs/acs5/vari...,The custom function acs_customgregate is given...,update using 2019 DVMT ACS as model,
4,5.0,2020 Florida Census Tracts - Florida,Department of Commerce|U.S. Census Bureau|Geog...,https://www.census.gov/geographies/mapping-fil...,,Department of Commerce|U.S. Census Bureau|Geog...,Public Domain,This Software was created by U.S. Government e...,,United States|Florida,...,--,TBD,TBD,2021-09-22,Never,--,"STATEFP(string,2)|COUNTYFP(string,3)|TRACTCE(s...",ogr2ogr shapefile to postGIS,,
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
143,,,,,,,,,,,...,,,,,,,,,,
144,,,,,,,,,,,...,,,,,,,,,,
145,,,,,,,,,,,...,,,,,,,,,,
146,,,,,,,,,,,...,,,,,,,,,,


In [27]:
# just grab a few columns to work with that has an example nested column to work with
columns_subset = ['ID', 'Title', 'Rights', 'Coverage']
df_subset = df[columns_subset]


In [24]:
df_subset.head(10)

Unnamed: 0,ID,Title,Rights,Coverage
0,1.0,Miami-Dade Home Owner's Loan Corporation (HOLC...,Public Domain,United States
1,2.0,2019 Florida Census Tracts,Public Domain,United States|Florida
2,3.0,2019 Miami-Dade ACS 5 Year Estimates - Tract L...,Public Domain,United States|Florida
3,4.0,2019 Florida ACS 5 Year Estimates - Tract Leve...,Public Domain,United States|Florida
4,5.0,2020 Florida Census Tracts - Florida,Public Domain,United States|Florida
5,6.0,2020 Florida ACS 5 Year Estimates - Tract Leve...,Public Domain,United States|Florida
6,7.0,2020 Florida ACS 5 Year Estimates - Tract Leve...,Public Domain,United States|Florida
7,8.0,2021 Florida Census Tracts - Florida,Public Domain,United States|Florida
8,9.0,2021 Miami-Dade ACS 5 Year Estimates - Tract L...,Public Domain,United States|Florida|Miami-Dade County
9,10.0,2021 Florida ACS 5 Year Estimates - Block Grou...,Public Domain,United States|Florida


In [25]:
df_subset['Coverage2'] = df_subset['Coverage'].str.split('|')
df_sub_explode= df_subset.explode('Coverage2')

A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  df_subset['Coverage2'] = df_subset['Coverage'].str.split('|')


In [26]:
df_sub_explode.head(10)

Unnamed: 0,ID,Title,Rights,Coverage,Coverage2
0,1.0,Miami-Dade Home Owner's Loan Corporation (HOLC...,Public Domain,United States,United States
1,2.0,2019 Florida Census Tracts,Public Domain,United States|Florida,United States
1,2.0,2019 Florida Census Tracts,Public Domain,United States|Florida,Florida
2,3.0,2019 Miami-Dade ACS 5 Year Estimates - Tract L...,Public Domain,United States|Florida,United States
2,3.0,2019 Miami-Dade ACS 5 Year Estimates - Tract L...,Public Domain,United States|Florida,Florida
3,4.0,2019 Florida ACS 5 Year Estimates - Tract Leve...,Public Domain,United States|Florida,United States
3,4.0,2019 Florida ACS 5 Year Estimates - Tract Leve...,Public Domain,United States|Florida,Florida
4,5.0,2020 Florida Census Tracts - Florida,Public Domain,United States|Florida,United States
4,5.0,2020 Florida Census Tracts - Florida,Public Domain,United States|Florida,Florida
5,6.0,2020 Florida ACS 5 Year Estimates - Tract Leve...,Public Domain,United States|Florida,United States


## Example for JSON with normalize and meta

In [None]:
# with open('books.json') as f:
#     data = json.load(f)

data = json.loads(j)

# Use pd.json_normalize to convert the JSON to a DataFrame
df = pd.json_normalize(data['books'], meta=['title', ['author', 'first_name'], ['author', 'last_name'], ['publisher', 'name'], ['publisher', 'location']])

# Rename the columns for clarity
df.columns = ['Title', 'Author_First_Name', 'Author_Last_Name', 'Publisher_Name', 'Publisher_Location']

# Display the DataFrame
df.head()