# Data Science, Fall 2024


[Acknowledgments Page](https://ds100.org/fa23/acks/)

A demo of data cleaning and exploratory data analysis using the CDC Tuberculosis data and the Mauna Loa CO2 data.

In [213]:
import numpy as np
import pandas as pd

In [214]:
import matplotlib.pyplot as plt
import seaborn as sns
#%matplotlib inline
plt.rcParams['figure.figsize'] = (12, 9)

sns.set()
sns.set_context('talk')
np.set_printoptions(threshold=20, precision=2, suppress=True)
pd.set_option('display.max_rows', 30)
pd.set_option('display.max_columns', None)
pd.set_option('display.precision', 2)
# This option stops scientific notation for pandas
pd.set_option('display.float_format', '{:.2f}'.format)

# Silence some spurious seaborn warnings
import warnings
warnings.filterwarnings("ignore", category=FutureWarning)

## Structure: Different File Formats

There are many file types for storing structured data: CSV, TSV, JSON, XML, ASCII, SAS...
* Documentation will be your best friend to understand how to process many of these file types.
* In lecture, we will cover TSV and JSON since pandas supports them out-of-box.

### TSV

**TSV** (Tab-Separated Values) files are very similar to CSVs, but now items are delimited by tabs.

The `pd.read_csv` function also reads in TSVs if we specify the **delimiter** with parameter `sep='\t'` ([documentation](https://pandas.pydata.org/docs/reference/api/pandas.read_csv.html)).

In [215]:
# Read TSV file
tb=pd.read_csv('./data/cdc_tuberculosis.tsv',sep='\t')
tb.head()

Unnamed: 0.1,Unnamed: 0,No. of TB cases,Unnamed: 2,Unnamed: 3,TB incidence,Unnamed: 5,Unnamed: 6
0,U.S. jurisdiction,2019,2020,2021,2019.0,2020.0,2021.0
1,Total,8900,7173,7860,2.71,2.16,2.37
2,Alabama,87,72,92,1.77,1.43,1.83
3,Alaska,58,58,58,7.91,7.92,7.92
4,Arizona,183,136,129,2.51,1.89,1.77


*Side note*: there was a question last time on how pandas differentiates a comma delimiter vs. a comma within the field itself, e.g., `8,900`. Check out the documentation for the `quotechar` parameter.

### JSON
The City of Berkeley Open Data [website](https://data.cityofberkeley.info/Health/COVID-19-Confirmed-Cases/xn6j-b766) has a dataset with COVID-19 Confirmed Cases among Berkeley residents by date.


We can download this file, saving it as a JSON (note the source URL file type).

In the interest of **reproducible data science** we will download the data programatically.  We have defined some helper functions in the below.  I can then reuse this helper functions in many different notebooks.

In [216]:
import requests
from pathlib import Path
import time
def fetch_and_cache(data_url, file, data_dir="data", force=False):
    """
    Download and cache a url and return the file object.

    data_url: the web address to download
    file: the file in which to save the results.
    data_dir: (default="data") the location to save the data
    force: if true the file is always re-downloaded

    return: The pathlib.Path object representing the file.
    """

    ### BEGIN SOLUTION
    data_dir = Path(data_dir)
    data_dir.mkdir(exist_ok = True)
    file_path = data_dir / Path(file)
    # If the file already exists and we want to force a download then
    # delete the file first so that the creation date is correct.
    if force and file_path.exists():
        file_path.unlink()
    if force or not file_path.exists():
        print('Downloading...', end=' ')
        resp = requests.get(data_url)
        with file_path.open('wb') as f:
            f.write(resp.content)
        print('Done!')
        last_modified_time = time.ctime(file_path.stat().st_mtime)
    else:
        last_modified_time = time.ctime(file_path.stat().st_mtime)
        print("Using cached version that was downloaded (UTC):", last_modified_time)
    return file_path
    ### END SOLUTION


In [217]:
covid_file = fetch_and_cache(
    "https://data.cityofberkeley.info/api/views/xn6j-b766/rows.json?accessType=DOWNLOAD",
    "confirmed-cases.json",
    force=False)
covid_file          # a file path wrapper object

Using cached version that was downloaded (UTC): Sat Oct 26 10:03:30 2024


WindowsPath('data/confirmed-cases.json')

#### File size

Often, I like to start my analysis by getting a rough estimate of the size of the data.  This will help inform the tools I use and how I view the data.  If it is relatively small I might use a text editor or a spreadsheet to look at the data.  If it is larger, I might jump to more programmatic exploration or even used distributed computing tools.

However here we will use Python tools to probe the file.

Since these seem to be text files I might also want to investigate the number of lines, which often corresponds to the number of records.

In [218]:
# get size of file using getsize function
import os

# Total Count of lines
with open(covid_file, "r") as f:
    print(covid_file, "is", sum(1 for l in f), "lines.")

data\confirmed-cases.json is 2107 lines.


### EDA: Digging into JSON

Python has relatively good support for JSON data since it closely matches the internal python object model.  In the following cell we import the entire JSON datafile into a python dictionary using the `json` package.

In [219]:
# load JSON file
import json

with open(covid_file, 'r') as f:
    covid_json = json.load(f)
covid_json

{'meta': {'view': {'id': 'xn6j-b766',
   'name': 'COVID-19 Confirmed Cases',
   'assetType': 'dataset',
   'attribution': 'City of Berkeley',
   'averageRating': 0,
   'category': 'Health',
   'createdAt': 1587074071,
   'description': 'Counts of confirmed COVID-19 cases among Berkeley residents by date. As of 6/21/22, this dataset will be updated weekly instead of daily. As of 11/14/22, this dataset only includes PCR cases.',
   'displayType': 'table',
   'downloadCount': 4524,
   'hideFromCatalog': False,
   'hideFromDataJson': False,
   'locked': False,
   'newBackend': True,
   'numberOfComments': 0,
   'oid': 37306599,
   'provenance': 'official',
   'publicationAppendEnabled': False,
   'publicationDate': 1623695944,
   'publicationGroup': 17032857,
   'publicationStage': 'published',
   'rowsUpdatedAt': 1721778125,
   'rowsUpdatedBy': 'g3qt-vv5v',
   'tableId': 18345932,
   'totalTimesRated': 0,
   'viewCount': 30867,
   'viewLastModified': 1721778124,
   'viewType': 'tabular',


The `covid_json` variable is now a dictionary encoding the data in the file:

In [220]:
# find type of your file
type(covid_json)

dict

#### Examine what keys are in the top level json object

We can list the keys to determine what data is stored in the object.

In [221]:
# identify keys() of JSON file
covid_json.keys()

dict_keys(['meta', 'data'])

**Observation**: The JSON dictionary contains a `meta` key which likely refers to meta data (data about the data).  Meta data often maintained with the data and can be a good source of additional information.

<br/>

We can investigate the meta data further by examining the keys associated with the metadata.

In [222]:
# Further explore meta key()
covid_json['meta'].keys()

dict_keys(['view'])

The `meta` key contains another dictionary called `view`.  This likely refers to meta-data about a particular "view" of some underlying database.  We will learn more about views when we study SQL later in the class.    

In [223]:
# Further explore view key()
covid_json['meta']['view'].keys()

dict_keys(['id', 'name', 'assetType', 'attribution', 'averageRating', 'category', 'createdAt', 'description', 'displayType', 'downloadCount', 'hideFromCatalog', 'hideFromDataJson', 'locked', 'newBackend', 'numberOfComments', 'oid', 'provenance', 'publicationAppendEnabled', 'publicationDate', 'publicationGroup', 'publicationStage', 'rowsUpdatedAt', 'rowsUpdatedBy', 'tableId', 'totalTimesRated', 'viewCount', 'viewLastModified', 'viewType', 'approvals', 'clientContext', 'columns', 'grants', 'metadata', 'owner', 'query', 'rights', 'tableAuthor', 'tags', 'flags'])

Notice that this a nested/recursive data structure.  As we dig deeper we reveal more and more keys and the corresponding data:

```
meta
|-> data
    | ... (haven't explored yet)
|-> view
    | -> id
    | -> name
    | -> attribution
    ...
    | -> description
    ...
    | -> columns
    ...
```

There is a key called description in the view sub dictionary.  This likely contains a description of the data:

In [224]:
# use description key from view dictionary
covid_json['meta']['view']['description']

'Counts of confirmed COVID-19 cases among Berkeley residents by date. As of 6/21/22, this dataset will be updated weekly instead of daily. As of 11/14/22, this dataset only includes PCR cases.'


#### Examining the Data Field for Records

We can look at a few entries in the `data` field. This is what we'll load into Pandas.


In [225]:
# explore data key and print some entries
for index, row in enumerate(covid_json['data'][:3]):
    print(f"{index}|{row}")

0|['row-rb9f.5nek-8ine', '00000000-0000-0000-4662-FC7183571076', 0, 1721778125, None, 1721778125, None, '{ }', '2019-12-01T00:00:00', '0', '0']
1|['row-vztw-e5xz~k7e6', '00000000-0000-0000-4572-4B7999484114', 0, 1721778125, None, 1721778125, None, '{ }', '2019-12-02T00:00:00', '0', '0']
2|['row-bjvt.2dfe.85rq', '00000000-0000-0000-7CFB-15321FF8BFED', 0, 1721778125, None, 1721778125, None, '{ }', '2019-12-03T00:00:00', '0', '0']


Observations:
* These look like equal-length records, so maybe `data` is a table!
* But what do each of values in the record mean? Where can we find column headers?

Back to the metadata.

#### Columns Metadata

Another potentially useful key in the metadata dictionary is the `columns`.  This returns a list:

In [226]:
# check type of columns key
type(covid_json['meta']['view']['columns'])

list

Let's go back to the file explorer.

Based on the contents of this key, what are reasonable names for each column in the `data` table?

#### Summary of exploring the JSON file

1. The above **metadata** tells us a lot about the columns in the data including column names, potential data anomalies, and a basic statistic.
1. Because of its non-tabular structure, JSON makes it easier (than CSV) to create **self-documenting data**, meaning that information about the data is stored in the same file as the data.
1. Self documenting data can be helpful since it maintains its own description and these descriptions are more likely to be updated as data changes.

### JSON with pandas

After our above EDA, let's finally go about loading the data (not the metadata) into a pandas dataframe.

In the following block of code we:
1. Translate the JSON records into a dataframe:

    * fields: `covid_json['meta']['view']['columns']`
    * records: `covid_json['data']`
    
1. Remove columns that have no metadata description.  This would be a bad idea in general but here we remove these columns since the above analysis suggests that they are unlikely to contain useful information.
1. Examine the `tail` of the table.

In [227]:
fields = covid_json['meta']['view']['columns']
records = covid_json['data']

In [228]:
a=[]
for i in fields:
    a.append(i['name'])
print(a)

['sid', 'id', 'position', 'created_at', 'created_meta', 'updated_at', 'updated_meta', 'meta', 'Date', 'New Cases', 'Cumulative Cases']


In [229]:
covid_data=pd.DataFrame(records)

In [230]:
# Load the data from JSON and assign column titles
covid_data.columns=['sid', 'id', 'position', 'created_at', 'created_meta', 'updated_at', 'updated_meta', 'meta', 'Date', 'New Cases', 'Cumulative Cases']
covid_data.tail()

Unnamed: 0,sid,id,position,created_at,created_meta,updated_at,updated_meta,meta,Date,New Cases,Cumulative Cases
1690,row-d28u_ew2y~ms8f,00000000-0000-0000-7C6A-161A637B9C10,0,1721778125,,1721778125,,{ },2024-07-17T00:00:00,6,25198
1691,row-nmtf.ke2i~yb3k,00000000-0000-0000-04BA-221AC1453106,0,1721778125,,1721778125,,{ },2024-07-18T00:00:00,5,25203
1692,row-4y8w-m68e~d3t5,00000000-0000-0000-F211-E2C7B20961B2,0,1721778125,,1721778125,,{ },2024-07-19T00:00:00,2,25205
1693,row-riyz.7u5e.ivns,00000000-0000-0000-20B4-50F4576EBBF2,0,1721778125,,1721778125,,{ },2024-07-20T00:00:00,1,25206
1694,row-i4id_x6ri~nar5,00000000-0000-0000-9612-3BEFD61FF295,0,1721778125,,1721778125,,{ },2024-07-21T00:00:00,0,25206


<br/>

---


## Temporality

Let's briefly look at how we can use pandas `dt` accessors to work with dates/times in a dataset.

We will use the dataset from Lab 3: the Berkeley PD Calls for Service dataset.

In [231]:
# use data/Berkeley_PD_-_Calls_for_Service.csv now
df=pd.read_csv('./data/Berkeley_PD_-_Calls_for_Service.csv')
df.head()

Unnamed: 0,CASENO,OFFENSE,EVENTDT,EVENTTM,CVLEGEND,CVDOW,InDbDate,Block_Location,BLKADDR,City,State
0,21014296,THEFT MISD. (UNDER $950),04/01/2021 12:00:00 AM,10:58,LARCENY,4,06/15/2021 12:00:00 AM,"Berkeley, CA\n(37.869058, -122.270455)",,Berkeley,CA
1,21014391,THEFT MISD. (UNDER $950),04/01/2021 12:00:00 AM,10:38,LARCENY,4,06/15/2021 12:00:00 AM,"Berkeley, CA\n(37.869058, -122.270455)",,Berkeley,CA
2,21090494,THEFT MISD. (UNDER $950),04/19/2021 12:00:00 AM,12:15,LARCENY,1,06/15/2021 12:00:00 AM,"2100 BLOCK HASTE ST\nBerkeley, CA\n(37.864908,...",2100 BLOCK HASTE ST,Berkeley,CA
3,21090204,THEFT FELONY (OVER $950),02/13/2021 12:00:00 AM,17:00,LARCENY,6,06/15/2021 12:00:00 AM,"2600 BLOCK WARRING ST\nBerkeley, CA\n(37.86393...",2600 BLOCK WARRING ST,Berkeley,CA
4,21090179,BURGLARY AUTO,02/08/2021 12:00:00 AM,6:20,BURGLARY - VEHICLE,1,06/15/2021 12:00:00 AM,"2700 BLOCK GARBER ST\nBerkeley, CA\n(37.86066,...",2700 BLOCK GARBER ST,Berkeley,CA


Looks like there are three columns with dates/times: `EVENTDT`, `EVENTTM`, and `InDbDate`.

Most likely, `EVENTDT` stands for the date when the event took place, `EVENTTM` stands for the time of day the event took place (in 24-hr format), and `InDbDate` is the date this call is recorded onto the database.

If we check the data type of these columns, we will see they are stored as strings. We can convert them to `datetime` objects using pandas `to_datetime` function.

In [232]:
# pd.to_datetime on EVENTDT
df['EVENTDT'] = pd.to_datetime(df['EVENTDT'], format='%m/%d/%Y %I:%M:%S %p')
df['EVENTTM']=pd.to_datetime(df['EVENTTM'],format='%H:%M')
df['InDbDate'] = pd.to_datetime(df['InDbDate'], format='%m/%d/%Y %I:%M:%S %p')

Now we can use the `dt` accessor on this column.

We can get the month:

In [233]:
# get months from EVENTDT
df['EVENTDT'].dt.month

0        4
1        4
2        4
3        2
4        2
        ..
2627    12
2628     2
2629     3
2630     4
2631     2
Name: EVENTDT, Length: 2632, dtype: int32

Which day of the week the date is on:

In [234]:
# get week days from EVENTDT
df['EVENTDT'].dt.weekday

0       3
1       3
2       0
3       5
4       0
       ..
2627    0
2628    2
2629    2
2630    5
2631    4
Name: EVENTDT, Length: 2632, dtype: int32

Check the mimimum values to see if there are any suspicious-looking, 17s dates:

In [235]:
# Answer here
dates_day_17 = df[df['EVENTDT'].dt.day == 17]
min_year = df['EVENTDT'].dt.year.min()
year = dates_day_17[dates_day_17['EVENTDT'].dt.year == min_year]
year.sample(5)

Unnamed: 0,CASENO,OFFENSE,EVENTDT,EVENTTM,CVLEGEND,CVDOW,InDbDate,Block_Location,BLKADDR,City,State
1804,20057585,VEHICLE STOLEN,2020-12-17,1900-01-01 10:00:00,MOTOR VEHICLE THEFT,4,2021-06-15,"800 BLOCK POTTER ST\nBerkeley, CA\n(37.850798,...",800 BLOCK POTTER ST,Berkeley,CA
624,20057207,ASSAULT/BATTERY MISD.,2020-12-17,1900-01-01 16:50:00,ASSAULT,4,2021-06-15,"2100 BLOCK SHATTUCK AVE\nBerkeley, CA\n(37.871...",2100 BLOCK SHATTUCK AVE,Berkeley,CA
1157,20057267,ROBBERY,2020-12-17,1900-01-01 07:33:00,ROBBERY,4,2021-06-15,"1600 BLOCK VIRGINIA ST\nBerkeley, CA\n(37.8751...",1600 BLOCK VIRGINIA ST,Berkeley,CA
154,20092214,THEFT FROM AUTO,2020-12-17,1900-01-01 18:30:00,LARCENY - FROM VEHICLE,4,2021-06-15,"800 BLOCK SHATTUCK AVE\nBerkeley, CA\n(37.8918...",800 BLOCK SHATTUCK AVE,Berkeley,CA
617,20057373,GUN/WEAPON,2020-12-17,1900-01-01 22:18:00,WEAPONS OFFENSE,4,2021-06-15,"6200 BLOCK SAN PABLO AVE\nBerkeley, CA\n(37.84...",6200 BLOCK SAN PABLO AVE,Berkeley,CA


Doesn't look like it! We are good!


We can also do many things with the `dt` accessor like switching time zones and converting time back to UNIX/POSIX time. Check out the documentation on [`.dt` accessor](https://pandas.pydata.org/docs/user_guide/basics.html#basics-dt-accessors) and [time series/date functionality](https://pandas.pydata.org/docs/user_guide/timeseries.html#).


## Data Faithfulness: Mauna Loa CO2 data

CO2 concentrations have been monitored at Mauna Loa Observatory since 1958 ([website link](https://gml.noaa.gov/ccgg/trends/data.html)).




In [236]:
co2_file = "./data/co2_mm_mlo.txt"

Let's do some **EDA**!!

### How do we read the file into Pandas?
Let's instead check out this file with JupyterLab.

* Note it's a `.txt` file.
* Do we trust this file extension?
* What structure is it?


Looking at the first few lines of the data, we spot some relevant characteristics:

- The values are separated by white space, possibly tabs.
- The data line up down the rows. For example, the month appears in 7th to 8th position of each line.
- The 71st and 72nd lines in the file contain column headings split over two lines.

We can use `read_csv` to read the data into a Pandas data frame, and we provide several arguments to specify that the separators are white space, there is no header (**we will set our own column names**), and to skip the first 72 rows of the file.

In [237]:
# use pd.read_csv to read txt file
column_names=['0','1','2','3','4','5','6']
text_file = pd.read_csv("./data/co2_mm_mlo.txt",delim_whitespace=True,header=None,names=column_names,skiprows=72)
text_file.head()

Unnamed: 0,0,1,2,3,4,5,6
0,1958,3,1958.21,315.71,315.71,314.62,-1
1,1958,4,1958.29,317.45,317.45,315.29,-1
2,1958,5,1958.38,317.5,317.5,314.71,-1
3,1958,6,1958.46,-99.99,317.1,314.85,-1
4,1958,7,1958.54,315.86,315.86,314.98,-1


Congratulations! You've wrangled the data!

<br/>

...But our columns aren't named.
**We need to do more EDA.**

### Exploring Variable Feature Types

The NOAA [webpage](https://gml.noaa.gov/ccgg/trends/) might have some useful tidbits (in this case it doesn't).
Let's go back to the raw data file to identify each feature.


We'll rerun `pd.read_csv`, but this time with some **custom column names.**

In [238]:
column_names=['Yr','Mo','DecDate','Avg','Int','Trend','Days']
text_file.columns=column_names
text_file

Unnamed: 0,Yr,Mo,DecDate,Avg,Int,Trend,Days
0,1958,3,1958.21,315.71,315.71,314.62,-1
1,1958,4,1958.29,317.45,317.45,315.29,-1
2,1958,5,1958.38,317.50,317.50,314.71,-1
3,1958,6,1958.46,-99.99,317.10,314.85,-1
4,1958,7,1958.54,315.86,315.86,314.98,-1
...,...,...,...,...,...,...,...
733,2019,4,2019.29,413.32,413.32,410.49,26
734,2019,5,2019.38,414.66,414.66,411.20,28
735,2019,6,2019.46,413.92,413.92,411.58,27
736,2019,7,2019.54,411.77,411.77,411.43,23


Yikes! Plotting the data uncovered a problem. It looks like we have some **missing values**. What happened here?

In [239]:
text_file.head()

Unnamed: 0,Yr,Mo,DecDate,Avg,Int,Trend,Days
0,1958,3,1958.21,315.71,315.71,314.62,-1
1,1958,4,1958.29,317.45,317.45,315.29,-1
2,1958,5,1958.38,317.5,317.5,314.71,-1
3,1958,6,1958.46,-99.99,317.1,314.85,-1
4,1958,7,1958.54,315.86,315.86,314.98,-1


In [240]:
text_file.tail()

Unnamed: 0,Yr,Mo,DecDate,Avg,Int,Trend,Days
733,2019,4,2019.29,413.32,413.32,410.49,26
734,2019,5,2019.38,414.66,414.66,411.2,28
735,2019,6,2019.46,413.92,413.92,411.58,27
736,2019,7,2019.54,411.77,411.77,411.43,23
737,2019,8,2019.62,409.95,409.95,411.84,29
