[![Open In Colab](https://colab.research.google.com/assets/colab-badge.svg)](https://colab.research.google.com/github.com/onedataengineer21/notebooks/blob/main/python/Exploring%20Chicago%20Crimes%20Data%202022.ipynb)



```
# This is formatted as code
```

*Chicago has always been in news during the election years for the wrong reasons. Politicians bring up the issue of the number of crimes in Chicago. This notebook will take a look at the Chicago Crimes dataset for the year 2022.*

In [1]:
!wget https://raw.githubusercontent.com/nytimes/covid-19-data/master/us-counties-2020.csv
!wget https://raw.githubusercontent.com/nytimes/covid-19-data/master/us-counties-2021.csv
!wget https://raw.githubusercontent.com/nytimes/covid-19-data/master/us-counties-2022.csv
!wget https://raw.githubusercontent.com/nytimes/covid-19-data/master/us-counties-2023.csv

--2024-04-07 04:03:59--  https://raw.githubusercontent.com/nytimes/covid-19-data/master/us-counties-2020.csv
Resolving raw.githubusercontent.com (raw.githubusercontent.com)... 185.199.109.133, 185.199.110.133, 185.199.108.133, ...
Connecting to raw.githubusercontent.com (raw.githubusercontent.com)|185.199.109.133|:443... connected.
HTTP request sent, awaiting response... 200 OK
Length: 35871900 (34M) [text/plain]
Saving to: ‘us-counties-2020.csv’


2024-04-07 04:04:00 (140 MB/s) - ‘us-counties-2020.csv’ saved [35871900/35871900]

--2024-04-07 04:04:00--  https://raw.githubusercontent.com/nytimes/covid-19-data/master/us-counties-2021.csv
Resolving raw.githubusercontent.com (raw.githubusercontent.com)... 185.199.108.133, 185.199.109.133, 185.199.110.133, ...
Connecting to raw.githubusercontent.com (raw.githubusercontent.com)|185.199.108.133|:443... connected.
HTTP request sent, awaiting response... 200 OK
Length: 50311433 (48M) [text/plain]
Saving to: ‘us-counties-2021.csv’


2024-04-07 

In [2]:
## importing the packages
import pandas as pd
import os
from pathlib import Path
import duckdb
import numpy as np

# Get the current working directory
current_working_directory = os.getcwd()

# Convert the current working directory to a Path object
script_dir = Path(current_working_directory)

### Extracting the datasets of the US counties from the NY times github repo


* We are pulling datasets of the US counties from 2020 to 2023
* Each of these datasets are in csv format

In [3]:
covid2020 = pd.read_csv("/content/us-counties-2020.csv")
covid2021 = pd.read_csv("/content/us-counties-2021.csv")
covid2022 = pd.read_csv("/content/us-counties-2022.csv")
covid2023 = pd.read_csv("/content/us-counties-2023.csv")

In [4]:
def extract_data(dataset_path):
  """
  this method is going to extract the data from the csv file path
  """
  try:
    covid = pd.read_csv(dataset_path)
  except FileNotFoundError:
      print("File not found.")
  except pd.errors.EmptyDataError:
      print("No data")
  except pd.errors.ParserError:
      print("Parse error")
  return covid

In [5]:
covid2020 = extract_data("/content/us-counties-2020.csv")
covid2021 = extract_data("/content/us-counties-2021.csv")
covid2022 = extract_data("/content/us-counties-2022.csv")
covid2023 = extract_data("/content/us-counties-2023.csv")

### Transform the dataset

* Merged all the datasets from 2020 to 2023.
* Add new columns to the dataset - DailyCases and DailyDeaths
* Filter the data to a given state

In [6]:
def transform(df_list, statename):
  """
  """
  ### Concating the dataframes into one single dataframe
  covid = pd.concat(df_list)

  ### Filtering the dataframe for the given state
  covid = covid[covid.state == statename]

  ### Adding Daily cases and Daily Deaths
  covid = covid[covid.state == statename].sort_values(by=['county', 'date'])
  covid['DailyCases'] = covid['cases'].diff().fillna(0).astype('Int64')
  covid['DailyDeaths'] = covid['deaths'].diff().fillna(0).astype('Int64')

  return covid

In [7]:
covid = transform([covid2020, covid2021, covid2022, covid2023], 'Illinois')

In [8]:
covid.shape

(111844, 8)

In [9]:
covid.head()

Unnamed: 0,date,county,state,fips,cases,deaths,DailyCases,DailyDeaths
5401,2020-03-20,Adams,Illinois,17001.0,1,0.0,0,0
6335,2020-03-21,Adams,Illinois,17001.0,1,0.0,0,0
7384,2020-03-22,Adams,Illinois,17001.0,1,0.0,0,0
8539,2020-03-23,Adams,Illinois,17001.0,1,0.0,0,0
9802,2020-03-24,Adams,Illinois,17001.0,1,0.0,0,0


In [10]:
covid = covid[covid.state == 'Illinois'].sort_values(by=['county', 'date'])

covid['DailyCases'] = covid['cases'].diff().fillna(0).astype('Int64')
covid['DailyDeaths'] = covid['deaths'].diff().fillna(0).astype('Int64')

In [11]:
covid[['DailyCases', 'DailyDeaths']]

Unnamed: 0,DailyCases,DailyDeaths
5401,0,0
6335,0,0
7384,0,0
8539,0,0
9802,0,0
...,...,...
251431,0,0
254687,0,0
257943,0,0
261199,0,0


In [12]:
def load(data, name):
  """
  """
  covid.to_csv(name + ".csv")

In [13]:
load(covid, "Illionis")

In [14]:
load(covid, "Wisconsin")

In [15]:
load(covid, "Colorado")