# Data Science, Fall 2024


[Acknowledgments Page](https://ds100.org/fa23/acks/)

A demo of data cleaning and exploratory data analysis using the CDC Tuberculosis data and the Mauna Loa CO2 data.

In [23]:
import numpy as np
import pandas as pd

In [24]:
import matplotlib.pyplot as plt
import seaborn as sns
#%matplotlib inline
plt.rcParams['figure.figsize'] = (12, 9)

sns.set()
sns.set_context('talk')
np.set_printoptions(threshold=20, precision=2, suppress=True)
pd.set_option('display.max_rows', 30)
pd.set_option('display.max_columns', None)
pd.set_option('display.precision', 2)
# This option stops scientific notation for pandas
pd.set_option('display.float_format', '{:.2f}'.format)

# Silence some spurious seaborn warnings
import warnings
warnings.filterwarnings("ignore", category=FutureWarning)

## Structure: Different File Formats

There are many file types for storing structured data: CSV, TSV, JSON, XML, ASCII, SAS...
* Documentation will be your best friend to understand how to process many of these file types.
* In lecture, we will cover TSV and JSON since pandas supports them out-of-box.

### TSV

**TSV** (Tab-Separated Values) files are very similar to CSVs, but now items are delimited by tabs.

The `pd.read_csv` function also reads in TSVs if we specify the **delimiter** with parameter `sep='\t'` ([documentation](https://pandas.pydata.org/docs/reference/api/pandas.read_csv.html)).

In [25]:
# Read TSV file


In [26]:
#Reaad CVS
tsv_file = pd.read_csv(r'C:\Users\Muhammad_Talha\Downloads\COHORT 7\Week 1 + 2\Data Wrangling\cdc_tuberculosis.tsv',sep='\t')
tsv_file

Unnamed: 0.1,Unnamed: 0,No. of TB cases,Unnamed: 2,Unnamed: 3,TB incidence,Unnamed: 5,Unnamed: 6
0,U.S. jurisdiction,2019,2020,2021,2019.00,2020.00,2021.00
1,Total,8900,7173,7860,2.71,2.16,2.37
2,Alabama,87,72,92,1.77,1.43,1.83
3,Alaska,58,58,58,7.91,7.92,7.92
4,Arizona,183,136,129,2.51,1.89,1.77
...,...,...,...,...,...,...,...
48,Virginia,191,169,161,2.23,1.96,1.86
49,Washington,221,163,199,2.90,2.11,2.57
50,West Virginia,9,13,7,0.50,0.73,0.39
51,Wisconsin,51,35,66,0.88,0.59,1.12


*Side note*: there was a question last time on how pandas differentiates a comma delimiter vs. a comma within the field itself, e.g., `8,900`. Check out the documentation for the `quotechar` parameter.

### JSON
The City of Berkeley Open Data [website](https://data.cityofberkeley.info/Health/COVID-19-Confirmed-Cases/xn6j-b766) has a dataset with COVID-19 Confirmed Cases among Berkeley residents by date.


We can download this file, saving it as a JSON (note the source URL file type).

In the interest of **reproducible data science** we will download the data programatically.  We have defined some helper functions in the below.  I can then reuse this helper functions in many different notebooks.

In [3]:
import requests
from pathlib import Path
import time
def fetch_and_cache(data_url, file, data_dir="data", force=False):
    """
    Download and cache a url and return the file object.

    data_url: the web address to download
    file: the file in which to save the results.
    data_dir: (default="data") the location to save the data
    force: if true the file is always re-downloaded

    return: The pathlib.Path object representing the file.
    """

    ### BEGIN SOLUTION
    data_dir = Path(data_dir)
    data_dir.mkdir(exist_ok = True)
    file_path = data_dir / Path(file)
    # If the file already exists and we want to force a download then
    # delete the file first so that the creation date is correct.
    if force and file_path.exists():
        file_path.unlink()
    if force or not file_path.exists():
        print('Downloading...', end=' ')
        resp = requests.get(data_url)
        with file_path.open('wb') as f:
            f.write(resp.content)
        print('Done!')
        last_modified_time = time.ctime(file_path.stat().st_mtime)
    else:
        last_modified_time = time.ctime(file_path.stat().st_mtime)
        print("Using cached version that was downloaded (UTC):", last_modified_time)
    return file_path
    ### END SOLUTION


In [40]:
#REad Json / Download
# covid_file = fetch_and_cache(
#     "https://data.cityofberkeley.info/api/views/xn6j-b766/rows.json?accessType=DOWNLOAD",
#     "confirmed-cases.json",
#     force=False)
# covid_file          # a file path wrapper object

import json

with open(r'C:\Users\Muhammad_Talha\Downloads\COHORT 7\Week 1 + 2\Data Wrangling\confirmed-cases.json','r' ) as f:
    covid_json_file = json.load(f)

covid_json_file
# json_file = pd.read_csv(r'C:\Users\Muhammad_Talha\Downloads\COHORT 7\Week 1 + 2\Data Wrangling\confirmed-cases.json')

{'meta': {'view': {'id': 'xn6j-b766',
   'name': 'COVID-19 Confirmed Cases',
   'assetType': 'dataset',
   'attribution': 'City of Berkeley',
   'averageRating': 0,
   'category': 'Health',
   'createdAt': 1587074071,
   'description': 'Counts of confirmed COVID-19 cases among Berkeley residents by date. As of 6/21/22, this dataset will be updated weekly instead of daily. As of 11/14/22, this dataset only includes PCR cases.',
   'displayType': 'table',
   'downloadCount': 4524,
   'hideFromCatalog': False,
   'hideFromDataJson': False,
   'locked': False,
   'newBackend': True,
   'numberOfComments': 0,
   'oid': 37306599,
   'provenance': 'official',
   'publicationAppendEnabled': False,
   'publicationDate': 1623695944,
   'publicationGroup': 17032857,
   'publicationStage': 'published',
   'rowsUpdatedAt': 1721778125,
   'rowsUpdatedBy': 'g3qt-vv5v',
   'tableId': 18345932,
   'totalTimesRated': 0,
   'viewCount': 30867,
   'viewLastModified': 1721778124,
   'viewType': 'tabular',


#### File size

Often, I like to start my analysis by getting a rough estimate of the size of the data.  This will help inform the tools I use and how I view the data.  If it is relatively small I might use a text editor or a spreadsheet to look at the data.  If it is larger, I might jump to more programmatic exploration or even used distributed computing tools.

However here we will use Python tools to probe the file.

Since these seem to be text files I might also want to investigate the number of lines, which often corresponds to the number of records.

In [46]:
# get size of file using getsize function
import os
file_path = r'C:\Users\Muhammad_Talha\Downloads\COHORT 7\Week 1 + 2\Data Wrangling\confirmed-cases.json'
file_size = os.path.getsize(file_path)
print('file size is ',file_size)

file_sze_mb = file_size/(1024 * 1024)
print('file size in mbs ',file_sze_mb)

with open(file_path, "r") as f:
    line_count = sum(1 for line in f)
    print("\n Count of lines ", line_count)


file size is  265382
file size in mbs  0.25308799743652344

 Count of lines  2107


### EDA: Digging into JSON

Python has relatively good support for JSON data since it closely matches the internal python object model.  In the following cell we import the entire JSON datafile into a python dictionary using the `json` package.

In [47]:
# load JSON file
with open(r'C:\Users\Muhammad_Talha\Downloads\COHORT 7\Week 1 + 2\Data Wrangling\confirmed-cases.json','r' ) as f:
    covid_json_file = json.load(f)

covid_json_file

{'meta': {'view': {'id': 'xn6j-b766',
   'name': 'COVID-19 Confirmed Cases',
   'assetType': 'dataset',
   'attribution': 'City of Berkeley',
   'averageRating': 0,
   'category': 'Health',
   'createdAt': 1587074071,
   'description': 'Counts of confirmed COVID-19 cases among Berkeley residents by date. As of 6/21/22, this dataset will be updated weekly instead of daily. As of 11/14/22, this dataset only includes PCR cases.',
   'displayType': 'table',
   'downloadCount': 4524,
   'hideFromCatalog': False,
   'hideFromDataJson': False,
   'locked': False,
   'newBackend': True,
   'numberOfComments': 0,
   'oid': 37306599,
   'provenance': 'official',
   'publicationAppendEnabled': False,
   'publicationDate': 1623695944,
   'publicationGroup': 17032857,
   'publicationStage': 'published',
   'rowsUpdatedAt': 1721778125,
   'rowsUpdatedBy': 'g3qt-vv5v',
   'tableId': 18345932,
   'totalTimesRated': 0,
   'viewCount': 30867,
   'viewLastModified': 1721778124,
   'viewType': 'tabular',


The `covid_json` variable is now a dictionary encoding the data in the file:

In [48]:
# find type of your file
type(covid_json_file)

dict

#### Examine what keys are in the top level json object

We can list the keys to determine what data is stored in the object.

In [49]:
# identify keys() of JSON file
covid_json_file.keys()

dict_keys(['meta', 'data'])

**Observation**: The JSON dictionary contains a `meta` key which likely refers to meta data (data about the data).  Meta data often maintained with the data and can be a good source of additional information.

<br/>

We can investigate the meta data further by examining the keys associated with the metadata.

In [50]:
# Further explore meta key()
covid_json_file['meta'].keys()

dict_keys(['view'])

The `meta` key contains another dictionary called `view`.  This likely refers to meta-data about a particular "view" of some underlying database.  We will learn more about views when we study SQL later in the class.    

In [52]:
# Further explore view key()
covid_json_file['meta']['view'].keys()

dict_keys(['id', 'name', 'assetType', 'attribution', 'averageRating', 'category', 'createdAt', 'description', 'displayType', 'downloadCount', 'hideFromCatalog', 'hideFromDataJson', 'locked', 'newBackend', 'numberOfComments', 'oid', 'provenance', 'publicationAppendEnabled', 'publicationDate', 'publicationGroup', 'publicationStage', 'rowsUpdatedAt', 'rowsUpdatedBy', 'tableId', 'totalTimesRated', 'viewCount', 'viewLastModified', 'viewType', 'approvals', 'clientContext', 'columns', 'grants', 'metadata', 'owner', 'query', 'rights', 'tableAuthor', 'tags', 'flags'])

Notice that this a nested/recursive data structure.  As we dig deeper we reveal more and more keys and the corresponding data:

```
meta
|-> data
    | ... (haven't explored yet)
|-> view
    | -> id
    | -> name
    | -> attribution
    ...
    | -> description
    ...
    | -> columns
    ...
```

There is a key called description in the view sub dictionary.  This likely contains a description of the data:

In [53]:
# use description key from view dictionary
covid_json_file['meta']['view']['description']

'Counts of confirmed COVID-19 cases among Berkeley residents by date. As of 6/21/22, this dataset will be updated weekly instead of daily. As of 11/14/22, this dataset only includes PCR cases.'


#### Examining the Data Field for Records

We can look at a few entries in the `data` field. This is what we'll load into Pandas.


In [64]:
# explore data key and print some entries
for i in covid_json_file['data'][:3]:
    print(i)

['row-rb9f.5nek-8ine', '00000000-0000-0000-4662-FC7183571076', 0, 1721778125, None, 1721778125, None, '{ }', '2019-12-01T00:00:00', '0', '0']
['row-vztw-e5xz~k7e6', '00000000-0000-0000-4572-4B7999484114', 0, 1721778125, None, 1721778125, None, '{ }', '2019-12-02T00:00:00', '0', '0']
['row-bjvt.2dfe.85rq', '00000000-0000-0000-7CFB-15321FF8BFED', 0, 1721778125, None, 1721778125, None, '{ }', '2019-12-03T00:00:00', '0', '0']


Observations:
* These look like equal-length records, so maybe `data` is a table!
* But what do each of values in the record mean? Where can we find column headers?

Back to the metadata.

#### Columns Metadata

Another potentially useful key in the metadata dictionary is the `columns`.  This returns a list:

In [57]:
# check type of columns key
covid_json_file['meta']['view']['columns']
type(covid_json_file['meta']['view']['columns'])

list

Let's go back to the file explorer.

Based on the contents of this key, what are reasonable names for each column in the `data` table?

#### Summary of exploring the JSON file

1. The above **metadata** tells us a lot about the columns in the data including column names, potential data anomalies, and a basic statistic.
1. Because of its non-tabular structure, JSON makes it easier (than CSV) to create **self-documenting data**, meaning that information about the data is stored in the same file as the data.
1. Self documenting data can be helpful since it maintains its own description and these descriptions are more likely to be updated as data changes.

### JSON with pandas

After our above EDA, let's finally go about loading the data (not the metadata) into a pandas dataframe.

In the following block of code we:
1. Translate the JSON records into a dataframe:

    * fields: `covid_json['meta']['view']['columns']`
    * records: `covid_json['data']`
    
1. Remove columns that have no metadata description.  This would be a bad idea in general but here we remove these columns since the above analysis suggests that they are unlikely to contain useful information.
1. Examine the `tail` of the table.

In [65]:
covid_json_file_copy = covid_json_file.copy()
covid_json_file_copy

{'meta': {'view': {'id': 'xn6j-b766',
   'name': 'COVID-19 Confirmed Cases',
   'assetType': 'dataset',
   'attribution': 'City of Berkeley',
   'averageRating': 0,
   'category': 'Health',
   'createdAt': 1587074071,
   'description': 'Counts of confirmed COVID-19 cases among Berkeley residents by date. As of 6/21/22, this dataset will be updated weekly instead of daily. As of 11/14/22, this dataset only includes PCR cases.',
   'displayType': 'table',
   'downloadCount': 4524,
   'hideFromCatalog': False,
   'hideFromDataJson': False,
   'locked': False,
   'newBackend': True,
   'numberOfComments': 0,
   'oid': 37306599,
   'provenance': 'official',
   'publicationAppendEnabled': False,
   'publicationDate': 1623695944,
   'publicationGroup': 17032857,
   'publicationStage': 'published',
   'rowsUpdatedAt': 1721778125,
   'rowsUpdatedBy': 'g3qt-vv5v',
   'tableId': 18345932,
   'totalTimesRated': 0,
   'viewCount': 30867,
   'viewLastModified': 1721778124,
   'viewType': 'tabular',


In [72]:
# Load the data from JSON and assign column titles
fields_file = (
pd.DataFrame(
covid_json_file_copy['data'],columns=[col['name'] for col in covid_json_file_copy['meta']['view']['columns']]))
fields_file

Unnamed: 0,sid,id,position,created_at,created_meta,updated_at,updated_meta,meta,Date,New Cases,Cumulative Cases
0,row-rb9f.5nek-8ine,00000000-0000-0000-4662-FC7183571076,0,1721778125,,1721778125,,{ },2019-12-01T00:00:00,0,0
1,row-vztw-e5xz~k7e6,00000000-0000-0000-4572-4B7999484114,0,1721778125,,1721778125,,{ },2019-12-02T00:00:00,0,0
2,row-bjvt.2dfe.85rq,00000000-0000-0000-7CFB-15321FF8BFED,0,1721778125,,1721778125,,{ },2019-12-03T00:00:00,0,0
3,row-u3xy-3xcz_xc3n,00000000-0000-0000-8EB1-2012B4B4B530,0,1721778125,,1721778125,,{ },2019-12-04T00:00:00,0,0
4,row-na7q-wzeq.h924,00000000-0000-0000-2FD9-335951D74781,0,1721778125,,1721778125,,{ },2019-12-05T00:00:00,0,0
...,...,...,...,...,...,...,...,...,...,...,...
1690,row-d28u_ew2y~ms8f,00000000-0000-0000-7C6A-161A637B9C10,0,1721778125,,1721778125,,{ },2024-07-17T00:00:00,6,25198
1691,row-nmtf.ke2i~yb3k,00000000-0000-0000-04BA-221AC1453106,0,1721778125,,1721778125,,{ },2024-07-18T00:00:00,5,25203
1692,row-4y8w-m68e~d3t5,00000000-0000-0000-F211-E2C7B20961B2,0,1721778125,,1721778125,,{ },2024-07-19T00:00:00,2,25205
1693,row-riyz.7u5e.ivns,00000000-0000-0000-20B4-50F4576EBBF2,0,1721778125,,1721778125,,{ },2024-07-20T00:00:00,1,25206


<br/>

---


## Temporality

Let's briefly look at how we can use pandas `dt` accessors to work with dates/times in a dataset.

We will use the dataset from Lab 3: the Berkeley PD Calls for Service dataset.

In [100]:
# use data/Berkeley_PD_-_Calls_for_Service.csv now
calls = pd.read_csv(r'C:\Users\Muhammad_Talha\Downloads\COHORT 7\Week 1 + 2\Data Wrangling\Berkeley_PD_-_Calls_for_Service.csv')
calls
# calls = calls.sort(by='CASENO')

Unnamed: 0,CASENO,OFFENSE,EVENTDT,EVENTTM,CVLEGEND,CVDOW,InDbDate,Block_Location,BLKADDR,City,State
0,18022300,DISTURBANCE,04/18/2018 12:00:00 AM,22:17,DISORDERLY CONDUCT,3,09/06/2018 03:30:12 AM,"OREGON STREET &amp; MCGEE AVE\nBerkeley, CA\n(...",OREGON STREET & MCGEE AVE,Berkeley,CA
1,18026683,THEFT MISD. (UNDER $950),05/09/2018 12:00:00 AM,21:25,LARCENY,3,09/06/2018 03:30:13 AM,"200 UNIVERSITY AVE\nBerkeley, CA\n(37.865511, ...",200 UNIVERSITY AVE,Berkeley,CA
2,18038550,THEFT MISD. (UNDER $950),05/18/2018 12:00:00 AM,20:00,LARCENY,5,09/06/2018 03:30:09 AM,"2200 MILVIA ST\nBerkeley, CA\n(37.868574, -122...",2200 MILVIA ST,Berkeley,CA
3,18014810,BURGLARY AUTO,03/13/2018 12:00:00 AM,08:50,BURGLARY - VEHICLE,2,09/06/2018 03:30:08 AM,"1200 SIXTH ST\nBerkeley, CA\n(37.881142, -122....",1200 SIXTH ST,Berkeley,CA
4,18018643,ALCOHOL OFFENSE,03/31/2018 12:00:00 AM,13:29,LIQUOR LAW VIOLATION,6,09/06/2018 03:30:11 AM,"CENTER STREET &amp; SHATTUCK AVE\nBerkeley, CA...",CENTER STREET & SHATTUCK AVE,Berkeley,CA
...,...,...,...,...,...,...,...,...,...,...,...
3783,18045829,THEFT MISD. (UNDER $950),08/15/2018 12:00:00 AM,08:42,LARCENY,3,09/06/2018 03:30:10 AM,"2300 TELEGRAPH AVE\nBerkeley, CA\n(37.868714, ...",2300 TELEGRAPH AVE,Berkeley,CA
3784,18040137,DISTURBANCE,07/17/2018 12:00:00 AM,10:34,DISORDERLY CONDUCT,2,09/06/2018 03:30:13 AM,"1100 UNIVERSITY AVE\nBerkeley, CA\n(37.869067,...",1100 UNIVERSITY AVE,Berkeley,CA
3785,18090816,VANDALISM,05/16/2018 12:00:00 AM,20:00,VANDALISM,3,09/06/2018 03:30:13 AM,"800 VICENTE RD\nBerkeley, CA\n",800 VICENTE RD,Berkeley,CA
3786,18024397,SEXUAL ASSAULT FEL.,04/28/2018 12:00:00 AM,17:00,SEX CRIME,6,09/06/2018 03:30:12 AM,"2700 BANCROFT WAY\nBerkeley, CA\n(37.869312, -...",2700 BANCROFT WAY,Berkeley,CA


Looks like there are three columns with dates/times: `EVENTDT`, `EVENTTM`, and `InDbDate`.

Most likely, `EVENTDT` stands for the date when the event took place, `EVENTTM` stands for the time of day the event took place (in 24-hr format), and `InDbDate` is the date this call is recorded onto the database.

If we check the data type of these columns, we will see they are stored as strings. We can convert them to `datetime` objects using pandas `to_datetime` function.

In [101]:
# pd.to_datetime on EVENTDT
print('Before Conversion  : \n',calls[['EVENTDT', 'EVENTTM', 'InDbDate']].dtypes)

calls['EVENTDT'] = pd.to_datetime(calls['EVENTDT'], format='%m/%d/%Y %I:%M:%S %p' )
calls['EVENTTM'] = pd.to_datetime(calls['EVENTTM'], format='%H:%M').dt.time
calls['InDbDate'] = pd.to_datetime(calls['InDbDate'], format='%m/%d/%Y %I:%M:%S %p')
print('After the Conversion : \n',calls[['EVENTDT', 'EVENTTM', 'InDbDate']].dtypes)

calls

Before Conversion  : 
 EVENTDT     object
EVENTTM     object
InDbDate    object
dtype: object
After the Conversion : 
 EVENTDT     datetime64[ns]
EVENTTM             object
InDbDate    datetime64[ns]
dtype: object


Unnamed: 0,CASENO,OFFENSE,EVENTDT,EVENTTM,CVLEGEND,CVDOW,InDbDate,Block_Location,BLKADDR,City,State
0,18022300,DISTURBANCE,2018-04-18,22:17:00,DISORDERLY CONDUCT,3,2018-09-06 03:30:12,"OREGON STREET &amp; MCGEE AVE\nBerkeley, CA\n(...",OREGON STREET & MCGEE AVE,Berkeley,CA
1,18026683,THEFT MISD. (UNDER $950),2018-05-09,21:25:00,LARCENY,3,2018-09-06 03:30:13,"200 UNIVERSITY AVE\nBerkeley, CA\n(37.865511, ...",200 UNIVERSITY AVE,Berkeley,CA
2,18038550,THEFT MISD. (UNDER $950),2018-05-18,20:00:00,LARCENY,5,2018-09-06 03:30:09,"2200 MILVIA ST\nBerkeley, CA\n(37.868574, -122...",2200 MILVIA ST,Berkeley,CA
3,18014810,BURGLARY AUTO,2018-03-13,08:50:00,BURGLARY - VEHICLE,2,2018-09-06 03:30:08,"1200 SIXTH ST\nBerkeley, CA\n(37.881142, -122....",1200 SIXTH ST,Berkeley,CA
4,18018643,ALCOHOL OFFENSE,2018-03-31,13:29:00,LIQUOR LAW VIOLATION,6,2018-09-06 03:30:11,"CENTER STREET &amp; SHATTUCK AVE\nBerkeley, CA...",CENTER STREET & SHATTUCK AVE,Berkeley,CA
...,...,...,...,...,...,...,...,...,...,...,...
3783,18045829,THEFT MISD. (UNDER $950),2018-08-15,08:42:00,LARCENY,3,2018-09-06 03:30:10,"2300 TELEGRAPH AVE\nBerkeley, CA\n(37.868714, ...",2300 TELEGRAPH AVE,Berkeley,CA
3784,18040137,DISTURBANCE,2018-07-17,10:34:00,DISORDERLY CONDUCT,2,2018-09-06 03:30:13,"1100 UNIVERSITY AVE\nBerkeley, CA\n(37.869067,...",1100 UNIVERSITY AVE,Berkeley,CA
3785,18090816,VANDALISM,2018-05-16,20:00:00,VANDALISM,3,2018-09-06 03:30:13,"800 VICENTE RD\nBerkeley, CA\n",800 VICENTE RD,Berkeley,CA
3786,18024397,SEXUAL ASSAULT FEL.,2018-04-28,17:00:00,SEX CRIME,6,2018-09-06 03:30:12,"2700 BANCROFT WAY\nBerkeley, CA\n(37.869312, -...",2700 BANCROFT WAY,Berkeley,CA


Now we can use the `dt` accessor on this column.

We can get the month:

In [106]:
# get months from EVENTDT
calls['EVENTDT'] = pd.to_datetime(calls['EVENTDT'], format='%m/%d/%Y')

calls['Month'] = calls['EVENTDT'].dt.month
calls

Unnamed: 0,CASENO,OFFENSE,EVENTDT,EVENTTM,CVLEGEND,CVDOW,InDbDate,Block_Location,BLKADDR,City,State,Month
0,18022300,DISTURBANCE,2018-04-18,22:17:00,DISORDERLY CONDUCT,3,2018-09-06 03:30:12,"OREGON STREET &amp; MCGEE AVE\nBerkeley, CA\n(...",OREGON STREET & MCGEE AVE,Berkeley,CA,4
1,18026683,THEFT MISD. (UNDER $950),2018-05-09,21:25:00,LARCENY,3,2018-09-06 03:30:13,"200 UNIVERSITY AVE\nBerkeley, CA\n(37.865511, ...",200 UNIVERSITY AVE,Berkeley,CA,5
2,18038550,THEFT MISD. (UNDER $950),2018-05-18,20:00:00,LARCENY,5,2018-09-06 03:30:09,"2200 MILVIA ST\nBerkeley, CA\n(37.868574, -122...",2200 MILVIA ST,Berkeley,CA,5
3,18014810,BURGLARY AUTO,2018-03-13,08:50:00,BURGLARY - VEHICLE,2,2018-09-06 03:30:08,"1200 SIXTH ST\nBerkeley, CA\n(37.881142, -122....",1200 SIXTH ST,Berkeley,CA,3
4,18018643,ALCOHOL OFFENSE,2018-03-31,13:29:00,LIQUOR LAW VIOLATION,6,2018-09-06 03:30:11,"CENTER STREET &amp; SHATTUCK AVE\nBerkeley, CA...",CENTER STREET & SHATTUCK AVE,Berkeley,CA,3
...,...,...,...,...,...,...,...,...,...,...,...,...
3783,18045829,THEFT MISD. (UNDER $950),2018-08-15,08:42:00,LARCENY,3,2018-09-06 03:30:10,"2300 TELEGRAPH AVE\nBerkeley, CA\n(37.868714, ...",2300 TELEGRAPH AVE,Berkeley,CA,8
3784,18040137,DISTURBANCE,2018-07-17,10:34:00,DISORDERLY CONDUCT,2,2018-09-06 03:30:13,"1100 UNIVERSITY AVE\nBerkeley, CA\n(37.869067,...",1100 UNIVERSITY AVE,Berkeley,CA,7
3785,18090816,VANDALISM,2018-05-16,20:00:00,VANDALISM,3,2018-09-06 03:30:13,"800 VICENTE RD\nBerkeley, CA\n",800 VICENTE RD,Berkeley,CA,5
3786,18024397,SEXUAL ASSAULT FEL.,2018-04-28,17:00:00,SEX CRIME,6,2018-09-06 03:30:12,"2700 BANCROFT WAY\nBerkeley, CA\n(37.869312, -...",2700 BANCROFT WAY,Berkeley,CA,4


Which day of the week the date is on:

In [107]:
# get week days from EVENTDT

calls['EVENTDT'] = pd.to_datetime(calls['EVENTDT'], format='%m/%d/%Y')

calls['Weekday'] = calls['EVENTDT'].dt.day_name()
calls

Unnamed: 0,CASENO,OFFENSE,EVENTDT,EVENTTM,CVLEGEND,CVDOW,InDbDate,Block_Location,BLKADDR,City,State,Month,Weekday
0,18022300,DISTURBANCE,2018-04-18,22:17:00,DISORDERLY CONDUCT,3,2018-09-06 03:30:12,"OREGON STREET &amp; MCGEE AVE\nBerkeley, CA\n(...",OREGON STREET & MCGEE AVE,Berkeley,CA,4,Wednesday
1,18026683,THEFT MISD. (UNDER $950),2018-05-09,21:25:00,LARCENY,3,2018-09-06 03:30:13,"200 UNIVERSITY AVE\nBerkeley, CA\n(37.865511, ...",200 UNIVERSITY AVE,Berkeley,CA,5,Wednesday
2,18038550,THEFT MISD. (UNDER $950),2018-05-18,20:00:00,LARCENY,5,2018-09-06 03:30:09,"2200 MILVIA ST\nBerkeley, CA\n(37.868574, -122...",2200 MILVIA ST,Berkeley,CA,5,Friday
3,18014810,BURGLARY AUTO,2018-03-13,08:50:00,BURGLARY - VEHICLE,2,2018-09-06 03:30:08,"1200 SIXTH ST\nBerkeley, CA\n(37.881142, -122....",1200 SIXTH ST,Berkeley,CA,3,Tuesday
4,18018643,ALCOHOL OFFENSE,2018-03-31,13:29:00,LIQUOR LAW VIOLATION,6,2018-09-06 03:30:11,"CENTER STREET &amp; SHATTUCK AVE\nBerkeley, CA...",CENTER STREET & SHATTUCK AVE,Berkeley,CA,3,Saturday
...,...,...,...,...,...,...,...,...,...,...,...,...,...
3783,18045829,THEFT MISD. (UNDER $950),2018-08-15,08:42:00,LARCENY,3,2018-09-06 03:30:10,"2300 TELEGRAPH AVE\nBerkeley, CA\n(37.868714, ...",2300 TELEGRAPH AVE,Berkeley,CA,8,Wednesday
3784,18040137,DISTURBANCE,2018-07-17,10:34:00,DISORDERLY CONDUCT,2,2018-09-06 03:30:13,"1100 UNIVERSITY AVE\nBerkeley, CA\n(37.869067,...",1100 UNIVERSITY AVE,Berkeley,CA,7,Tuesday
3785,18090816,VANDALISM,2018-05-16,20:00:00,VANDALISM,3,2018-09-06 03:30:13,"800 VICENTE RD\nBerkeley, CA\n",800 VICENTE RD,Berkeley,CA,5,Wednesday
3786,18024397,SEXUAL ASSAULT FEL.,2018-04-28,17:00:00,SEX CRIME,6,2018-09-06 03:30:12,"2700 BANCROFT WAY\nBerkeley, CA\n(37.869312, -...",2700 BANCROFT WAY,Berkeley,CA,4,Saturday


Check the mimimum values to see if there are any suspicious-looking, 70s dates:

In [131]:
calls['EVENTDT'] = pd.to_datetime(calls['EVENTDT'], format='%m/%d/%Y')
min_date = calls['EVENTDT'].min()
print("The minimum date is: ",min_date)
dates_70s = calls[(calls['EVENTDT'] >= '1970-01-01') & (calls['EVENTDT'] < '1980-01-01')]

print("All dates in the dataset:")
print(calls['EVENTDT'].sort_values().unique())
dates_70s


The minimum date is:  2018-03-08 00:00:00
All dates in the dataset:
['2018-03-08T00:00:00.000000000' '2018-03-09T00:00:00.000000000'
 '2018-03-10T00:00:00.000000000' ... '2018-08-17T00:00:00.000000000'
 '2018-08-18T00:00:00.000000000' '2018-08-19T00:00:00.000000000']


Unnamed: 0,CASENO,OFFENSE,EVENTDT,EVENTTM,CVLEGEND,CVDOW,InDbDate,Block_Location,BLKADDR,City,State,Month,Weekday


Doesn't look like it! We are good!


We can also do many things with the `dt` accessor like switching time zones and converting time back to UNIX/POSIX time. Check out the documentation on [`.dt` accessor](https://pandas.pydata.org/docs/user_guide/basics.html#basics-dt-accessors) and [time series/date functionality](https://pandas.pydata.org/docs/user_guide/timeseries.html#).


## Data Faithfulness: Mauna Loa CO2 data

CO2 concentrations have been monitored at Mauna Loa Observatory since 1958 ([website link](https://gml.noaa.gov/ccgg/trends/data.html)).




In [132]:
co2_file = r"C:\Users\Muhammad_Talha\Downloads\COHORT 7\Week 1 + 2\Data Wrangling\co2_mm_mlo.txt"
with open(co2_file, 'r') as file:
    content = file.read()
content


'# --------------------------------------------------------------------\n# USE OF NOAA ESRL DATA\n# \n# These data are made freely available to the public and the\n# scientific community in the belief that their wide dissemination\n# will lead to greater understanding and new scientific insights.\n# The availability of these data does not constitute publication\n# of the data.  NOAA relies on the ethics and integrity of the user to\n# ensure that ESRL receives fair credit for their work.  If the data \n# are obtained for potential use in a publication or presentation, \n# ESRL should be informed at the outset of the nature of this work.  \n# If the ESRL data are essential to the work, or if an important \n# result or conclusion depends on the ESRL data, co-authorship\n# may be appropriate.  This should be discussed at an early stage in\n# the work.  Manuscripts using the ESRL data should be sent to ESRL\n# for review before they are submitted for publication so we can\n# ensure that th

Let's do some **EDA**!!

### How do we read the file into Pandas?
Let's instead check out this file with JupyterLab.

* Note it's a `.txt` file.
* Do we trust this file extension?
* What structure is it?


Looking at the first few lines of the data, we spot some relevant characteristics:

- The values are separated by white space, possibly tabs.
- The data line up down the rows. For example, the month appears in 7th to 8th position of each line.
- The 71st and 72nd lines in the file contain column headings split over two lines.

We can use `read_csv` to read the data into a Pandas data frame, and we provide several arguments to specify that the separators are white space, there is no header (**we will set our own column names**), and to skip the first 72 rows of the file.

In [138]:
# use pd.read_csv to read txt file
df = pd.read_csv(
    co2_file, 
    delim_whitespace=True, 
    skiprows=72, 
    header=None 
)
df.head(10)

Unnamed: 0,0,1,2,3,4,5,6
0,1958,3,1958.21,315.71,315.71,314.62,-1
1,1958,4,1958.29,317.45,317.45,315.29,-1
2,1958,5,1958.38,317.5,317.5,314.71,-1
3,1958,6,1958.46,-99.99,317.1,314.85,-1
4,1958,7,1958.54,315.86,315.86,314.98,-1
5,1958,8,1958.62,314.93,314.93,315.94,-1
6,1958,9,1958.71,313.2,313.2,315.91,-1
7,1958,10,1958.79,-99.99,312.66,315.61,-1
8,1958,11,1958.88,313.33,313.33,315.31,-1
9,1958,12,1958.96,314.67,314.67,315.61,-1


Congratulations! You've wrangled the data!

<br/>

...But our columns aren't named.
**We need to do more EDA.**

### Exploring Variable Feature Types

The NOAA [webpage](https://gml.noaa.gov/ccgg/trends/) might have some useful tidbits (in this case it doesn't).
Let's go back to the raw data file to identify each feature.


We'll rerun `pd.read_csv`, but this time with some **custom column names.**

In [150]:
column_names = [ 'Yr', 'Mo', 'DecDate', 'Avg', 'Int', 'Trend', 'Days']

df = pd.read_csv(
    co2_file, 
    delim_whitespace=True, 
    skiprows=72, 
    names=column_names
)
df.head()

Unnamed: 0,Yr,Mo,DecDate,Avg,Int,Trend,Days
0,1958,3,1958.21,315.71,315.71,314.62,-1
1,1958,4,1958.29,317.45,317.45,315.29,-1
2,1958,5,1958.38,317.5,317.5,314.71,-1
3,1958,6,1958.46,-99.99,317.1,314.85,-1
4,1958,7,1958.54,315.86,315.86,314.98,-1


Yikes! Plotting the data uncovered a problem. It looks like we have some **missing values**. What happened here?

In [145]:

missing_values = df.isna().sum()
print(missing_values)

0


In [151]:
df.tail()

Unnamed: 0,Yr,Mo,DecDate,Avg,Int,Trend,Days
733,2019,4,2019.29,413.32,413.32,410.49,26
734,2019,5,2019.38,414.66,414.66,411.2,28
735,2019,6,2019.46,413.92,413.92,411.58,27
736,2019,7,2019.54,411.77,411.77,411.43,23
737,2019,8,2019.62,409.95,409.95,411.84,29
