# EX08: Data Wrangling

You will define and use functions that are commonly useful when _wrangling_ data in this exercise. You will frequently need your data to be organized in specific ways in order to perform analysis on it and that organization is rarely exactly the "shape" the data is stored in (such as a CSV table). Data _wrangling_ is the process of loading, converting, and reorganizing data so that you can analyze it.

In [None]:
__author__ = ""

You will implement the utility functions for this exercise in the `data_utils.py` file found in the `exercises/ex08` directory. As you now know, when you import modules in a running Python program, the module is evaluated only once. Since your Jupyter Notebook _kernel_ is running the entire time you are working on functions in `data_utils.py`, we will use a special extension to automatically reload any changes you make _and save_ in modules you import. The special conventions in the cell below are turning this feature on.

In [1]:
%reload_ext autoreload
%autoreload 2
print("Autoreload of imported modules enabled. Be sure to save your work in other modules!")

Autoreload of imported modules enabled. Be sure to save your work in other modules!


Data files will be stored in the `data` directory of the workspace. This Notebook is located in `exercises/ex08` directory. If you think of how to _navigate_ from this directory to the `data` directory, you would need to go "two directories up" and then "into the `data` directory". The constant `DATA_DIRECTORY` defined below uses the convention of two dots to refer to "one directory up", so it is a `str` that references the `data` directory _relative_ to this exercise's directory.

Then, another constant is established referencing the path to the data file you will use to test your functions in this exercise.

In [1]:
DATA_DIRECTORY="../../data"
DATA_FILE_PATH=f"{DATA_DIRECTORY}/nc_durham_2015_march_21_to_26.csv"

## Part 0. Reading Data from a Stored CSV File into Memory

In this part of the exercise, you will implement utility functions to read a CSV file from your computer's hard-drive storage into your running program's (Jupyter kernel's) memory. Once in memory, computations over the data set are very fast.

By default, your CSV file is read in row-by-row. Storing these rows as a list of "row" dictionaries is one way of _representing_ tabular data.

### 0.0) Implement the `read_csv_rows` Function

Complete the implementation of the `read_csv_rows` function in `data_utils.py` and be sure to save your work when making changes in that file _before_ re-evaluating the cell below to test it.

Purpose: Read an entire CSV of data into a `list` of rows, each row represented as `dict[str, str]`.

* Function Name: `read_csv_rows`
* Parameter: 
    1. `str` path to CSV file
* Return Type: `list[dict[str, str]]` 

Implementation hint: refer back to the code you wrote in lecture on 10/19 for reading a CSV file. We give you the code for this function.

There _should be_ 294 rows and 29 columns read from the `nc_durham_2015_march_21_to_26.csv` stops file. Additionally, the column names should print below those stats.

In [1]:
from data_utils import read_csv_rows
data_rows: list[dict[str, str]] = read_csv_rows(DATA_FILE_PATH)

if len(data_rows) == 0:
    print("Go implement read_csv_rows in data_utils.py")
    print("Be sure to save your work before re-evaluating this cell!")
else:
    print(f"Data File Read: {DATA_FILE_PATH}")
    print(f"{len(data_rows)} rows")
    print(f"{len(data_rows[0].keys())} columns")
    print(f"Columns names: {data_rows[0].keys()}")

ModuleNotFoundError: No module named 'data_utils'

### 0.2) `columnar` Function

Define and implement this function in `data_utils.py`.

Purpose: _Transform_ a table represented as a list of rows (e.g. `list[dict[str, str]]`) into one represented as a dictionary of columns (e.g. `dict[str, list[str]]`).

Why is this function useful? Many types of analysis are much easier to perform column-wise.

* Function Name: `columnar`
* Parameter: `list[dict[str, str]]` - a "table" organized as a list of rows
* Return Type: `dict[str, list[str]]` - a "table" organized as a dictionary of columns

Implementation strategy: Establish an empty dictionary to the your column-oriented table you are building up to ultimately return. Loop through each of the column names in the first row of the parameter. Get a list of each column's values via your `column_values` function defined previously. Then, associate the column name with the list of its values in the dictionary you established. After looping through every column name, return the dictionary.