# Real Example: Housing Data

This is a real example using Housing Data and demonstrates the `hepfile.csv_tools` module!

In [1]:
import hepfile as hf
import pandas as pd

Before moving on with the tutorial, make sure you have downloaded the following datasets using the wget command. This only needs to be run once.

Also, make sure you review the following link on the hepfile readthedocs page to get some context: https://hepfile.readthedocs.io/en/latest/fundamentals.html

In [2]:
!wget -nc -O 'People.csv' 'https://raw.githubusercontent.com/mattbellis/hepfile/main/docs/example_nb/People.csv'
!wget -nc -O 'Vehicles.csv' 'https://raw.githubusercontent.com/mattbellis/hepfile/main/docs/example_nb/Vehicles.csv'
!wget -nc -O 'Residences.csv' 'https://raw.githubusercontent.com/mattbellis/hepfile/main/docs/example_nb/Residences.csv'

File ‘People.csv’ already there; not retrieving.
File ‘Vehicles.csv’ already there; not retrieving.
File ‘Residences.csv’ already there; not retrieving.


The next step is to define a list of all of these filepaths

In [3]:
filepaths = ['People.csv', 'Vehicles.csv', 'Residences.csv']

For the sake of completeness, let's take a look at these datasets

In [4]:
import pandas as pd
for f in filepaths:
    print(f + ':\n')
    print(pd.read_csv(f).to_markdown())
    print()

People.csv:

|    |   Household ID | First name   | Last name   | Gender ID   |   Age |   Height |   Yearly income | Highest degree/grade   |
|---:|---------------:|:-------------|:------------|:------------|------:|---------:|----------------:|:-----------------------|
|  0 |              0 | blah         | blah        | M           |    54 |      159 |           75000 | BS                     |
|  1 |              0 | blah         | blah        | F           |    52 |      140 |           80000 | MS                     |
|  2 |              0 | blah         | blah        | NB          |    18 |      168 |               0 | 12                     |
|  3 |              0 | blah         | blah        | F           |    14 |      150 |               0 | 9                      |
|  4 |              1 | blah         | blah        | M           |    32 |      159 |           49000 | BS                     |
|  5 |              1 | blah         | blah        | M           |    27 |      140 

So there is a lot of different columns in these three csvs but it looks like they are all connected by the common key `Household ID`. This is similar to a database structure where each csv has a different length but are connected by a common ID. This makes these files perfect for being stored in a hepfile!

First, to view these csvs as one awkward array in the structure of a hepfile, you can use the `hepfile.csv_tools.csv_to_awkward` method. All we have to do is provide it the csv filepaths and the common key.

In [5]:
awk = hf.csv_tools.csv_to_awkward(filepaths, common_key='Household ID')
awk.show()

{'People.csv': [{'Household ID': [0, 0, 0, 0], 'First name': [...], ...}, ...],
 'Vehicles.csv': [{'Household ID': [0, 0, ..., 0], ...}, {...}, {...}],
 'Residences.csv': [{'Household ID': [0], ...}, {...}, {...}, {...}]}


Note how the groups are simply named after the file names that we used to create the awkward array. This is the default nature of the `csv_to_awkward` method but usually this isn't preferrable. To fix this, we can use the `group_names` flag in the method call.

In [6]:
awk = hf.csv_tools.csv_to_awkward(filepaths, common_key='Household ID', group_names=['People', 'Vehicles', 'Residences'])
awk.show()

{People: [{'Household ID': [0, 0, 0, 0], 'First name': [...], ...}, ..., {...}],
 Vehicles: [{'Household ID': [0, 0, ..., 0, 0], ...}, {...}, {...}],
 Residences: [{'Household ID': [0], 'House/apartment/condo': ..., ...}, ...]}


If we want to go straight to writing a hepfile instead of just creating an awkward array of the data, we can use the `hepfile.csv_tools.csv_to_hepfile` method. This takes many of the same options as the `csv_to_awkward` method.

In [10]:
outfilename, hepfile = hf.csv_tools.csv_to_hepfile(filepaths, common_key='Household ID', group_names=['People', 'Vehicles', 'Residences'])
print()
print('#########################################')
print(f'Output File Name: {outfilename}')

Adding group [1mPeople[0m
Adding a counter for [1mPeople[0m as [1mnPeople[0m
Adding dataset [1mHousehold ID[0m to the dictionary under group [1mPeople[0m.
Adding dataset [1mFirst name[0m to the dictionary under group [1mPeople[0m.
Adding dataset [1mLast name[0m to the dictionary under group [1mPeople[0m.
Adding dataset [1mGender ID[0m to the dictionary under group [1mPeople[0m.
Adding dataset [1mAge[0m to the dictionary under group [1mPeople[0m.
Adding dataset [1mHeight[0m to the dictionary under group [1mPeople[0m.
Adding dataset [1mYearly income[0m to the dictionary under group [1mPeople[0m.
----------------------------------------------------
Slashes / are not allowed in dataset names
Replacing / with - in dataset name Highest degree/grade
The new name will be Highest degree-grade
----------------------------------------------------
Adding dataset [1mHighest degree-grade[0m to the dictionary under group [1mPeople[0m.
Adding group [1mVehicles[0

Notice how the outfile name is the name of the first csv file with csv replaced with h5. Sometimes, this works but other times you may want to provide a more specific output file name. Use the `outfile` flag to do this.

In [11]:
outfilename, hepfile = hf.csv_tools.csv_to_hepfile(filepaths, common_key='Household ID', outfile='test.h5', group_names=['People', 'Vehicles', 'Residences'])
print()
print('#########################################')
print(f'Output File Name: {outfilename}')

Adding group [1mPeople[0m
Adding a counter for [1mPeople[0m as [1mnPeople[0m
Adding dataset [1mHousehold ID[0m to the dictionary under group [1mPeople[0m.
Adding dataset [1mFirst name[0m to the dictionary under group [1mPeople[0m.
Adding dataset [1mLast name[0m to the dictionary under group [1mPeople[0m.
Adding dataset [1mGender ID[0m to the dictionary under group [1mPeople[0m.
Adding dataset [1mAge[0m to the dictionary under group [1mPeople[0m.
Adding dataset [1mHeight[0m to the dictionary under group [1mPeople[0m.
Adding dataset [1mYearly income[0m to the dictionary under group [1mPeople[0m.
----------------------------------------------------
Slashes / are not allowed in dataset names
Replacing / with - in dataset name Highest degree/grade
The new name will be Highest degree-grade
----------------------------------------------------
Adding dataset [1mHighest degree-grade[0m to the dictionary under group [1mPeople[0m.
Adding group [1mVehicles[0