# Preparing Chicago Crime Data

Shenyue Jia

## About Chicago Crime Data

- Source: [Chicago Data Portal](https://data.cityofchicago.org/Public-Safety/Crimes-2001-to-Present/ijzp-q8t2): Crimes 2001 to Present
    - Data Description
        - All Crimes that were reported in the city of Chicago and their details
    - Includes
        - Type of crime, exact date/time, lat/long, District/ward, was there an arrest, etc.

## Reference

- [Data Processing Helper Notebook](https://github.com/coding-dojo-data-science/preparing-chicago-crime-data)

## Download Chicago Crime Data

### Set the correct `RAW_FILE` path

- The cell below will attempt to check your Downloads folder for any file with a name that contains "Crimes_-_2001_to_Present".
    - If you know the file path already, you can skip the next cell and just manually set the RAW_FILE variable in the following code cell.

In [1]:
## Run the cell below to attempt to programmatically find your crime file
import os,glob

## Getting the home folder from environment variables
home_folder = os.environ['HOME']
# print("- Your Home Folder is: " + home_folder)

## Check for downloads folder
if 'Downloads' in os.listdir(home_folder):
    
    
    # Print the Downloads folder path
    dl_folder = os.path.abspath(os.path.join(home_folder,'Downloads'))
    print(f"- Your Downloads folder is '{dl_folder}/'\n")
    
    ## checking for crime files using glob
    crime_files = sorted(glob.glob(dl_folder+'/**/Crimes_-_2001_to_Present*',recursive=True))
    
    # If more than 
    if len(crime_files)==1:
        RAW_FILE = crime_files[0]
        
    elif len(crime_files)>1:
        print('[i] The following files were found:')
        
        for i, fname in enumerate(crime_files):
            print(f"\tcrime_files[{i}] = '{fname}'")
        print(f'\n- Please fill in the RAW_FILE variable in the code cell below with the correct filepath.')

else:
    print(f'[!] Could not programmatically find your downloads folder.')
    print('- Try using Finder (on Mac) or File Explorer (Windows) to navigate to your Downloads folder.')

- Your Downloads folder is '/Users/Shenyue/Downloads/'



In [2]:
## (Required) MAKE SURE TO CHANGE THIS VARIABLE TO MATCH YOUR LOCAL FILE NAME
RAW_FILE = "/Users/Shenyue/Downloads/Crimes_-_2001_to_Present.csv" #(or slice correct index from the crime_files list)

if RAW_FILE != "/Users/Shenyue/Downloads/Crimes_-_2001_to_Present.csv":
	raise Exception("You must update the RAW_FILE variable to match your local filepath.")
	
RAW_FILE

'/Users/Shenyue/Downloads/Crimes_-_2001_to_Present.csv'

In [3]:
## (Optional) SET THE FOLDER FOR FINAL FILES
OUTPUT_FOLDER = 'Data/Chicago/'
os.makedirs(OUTPUT_FOLDER, exist_ok=True)

### Download Chicago Crime Data and Format Data

In [4]:
import pandas as pd
pd.set_option('display.max_columns', 100)
pd.set_option('display.float_format',lambda x: f"{x:,.2f}")

In [5]:
chicago_full = pd.read_csv(RAW_FILE)
chicago_full

Unnamed: 0,ID,Case Number,Date,Block,IUCR,Primary Type,Description,Location Description,Arrest,Domestic,Beat,District,Ward,Community Area,FBI Code,X Coordinate,Y Coordinate,Year,Updated On,Latitude,Longitude,Location
0,10224738,HY411648,09/05/2015 01:30:00 PM,043XX S WOOD ST,0486,BATTERY,DOMESTIC BATTERY SIMPLE,RESIDENCE,False,True,924,9.00,12.00,61.00,08B,1165074.00,1875917.00,2015,02/10/2018 03:50:01 PM,41.82,-87.67,"(41.815117282, -87.669999562)"
1,10224739,HY411615,09/04/2015 11:30:00 AM,008XX N CENTRAL AVE,0870,THEFT,POCKET-PICKING,CTA BUS,False,False,1511,15.00,29.00,25.00,06,1138875.00,1904869.00,2015,02/10/2018 03:50:01 PM,41.90,-87.77,"(41.895080471, -87.765400451)"
2,11646166,JC213529,09/01/2018 12:01:00 AM,082XX S INGLESIDE AVE,0810,THEFT,OVER $500,RESIDENCE,False,True,631,6.00,8.00,44.00,06,,,2018,04/06/2019 04:04:43 PM,,,
3,10224740,HY411595,09/05/2015 12:45:00 PM,035XX W BARRY AVE,2023,NARCOTICS,POSS: HEROIN(BRN/TAN),SIDEWALK,True,False,1412,14.00,35.00,21.00,18,1152037.00,1920384.00,2015,02/10/2018 03:50:01 PM,41.94,-87.72,"(41.937405765, -87.716649687)"
4,10224741,HY411610,09/05/2015 01:00:00 PM,0000X N LARAMIE AVE,0560,ASSAULT,SIMPLE,APARTMENT,False,True,1522,15.00,28.00,25.00,08A,1141706.00,1900086.00,2015,02/10/2018 03:50:01 PM,41.88,-87.76,"(41.881903443, -87.755121152)"
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
7765324,12936285,JF526139,06/27/2022 10:05:00 AM,025XX N HALSTED ST,1153,DECEPTIVE PRACTICE,FINANCIAL IDENTITY THEFT OVER $ 300,,False,False,1935,19.00,43.00,7.00,11,1170513.00,1917030.00,2022,01/03/2023 03:46:28 PM,41.93,-87.65,"(41.927817456, -87.648845932)"
7765325,12936301,JF526810,12/22/2022 06:00:00 PM,020XX W CORNELIA AVE,1320,CRIMINAL DAMAGE,TO VEHICLE,STREET,False,False,1921,19.00,32.00,5.00,14,1161968.00,1923233.00,2022,01/03/2023 03:46:28 PM,41.95,-87.68,"(41.945021752, -87.680071764)"
7765326,12936397,JF526745,12/19/2022 02:00:00 PM,044XX N ROCKWELL ST,0620,BURGLARY,UNLAWFUL ENTRY,APARTMENT,False,False,1911,19.00,47.00,4.00,05,1158237.00,1929586.00,2022,01/03/2023 03:46:28 PM,41.96,-87.69,"(41.962531969, -87.693611152)"
7765327,12935341,JF525383,12/20/2022 06:45:00 AM,027XX W ROOSEVELT RD,0810,THEFT,OVER $500,STREET,False,False,1135,11.00,28.00,29.00,06,1158071.00,1894595.00,2022,01/03/2023 03:46:28 PM,41.87,-87.70,"(41.866517317, -87.695178701)"


In [6]:
# explicitly setting the format to speed up pd.to_datetime
date_format = "%m/%d/%Y %I:%M:%S %p"


### Demonstrating/testing date_format
example = chicago_full.loc[0,'Date']
display(example)
pd.to_datetime(example,format=date_format)

'09/05/2015 01:30:00 PM'

Timestamp('2015-09-05 13:30:00')

In [7]:
# this cell can take up to 1 min to run
chicago_full['Datetime'] = pd.to_datetime(chicago_full['Date'], format=date_format)
chicago_full = chicago_full.sort_values('Datetime')
chicago_full = chicago_full.set_index('Datetime')
chicago_full

Unnamed: 0_level_0,ID,Case Number,Date,Block,IUCR,Primary Type,Description,Location Description,Arrest,Domestic,Beat,District,Ward,Community Area,FBI Code,X Coordinate,Y Coordinate,Year,Updated On,Latitude,Longitude,Location
Datetime,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1,Unnamed: 14_level_1,Unnamed: 15_level_1,Unnamed: 16_level_1,Unnamed: 17_level_1,Unnamed: 18_level_1,Unnamed: 19_level_1,Unnamed: 20_level_1,Unnamed: 21_level_1,Unnamed: 22_level_1
2001-01-01 00:00:00,1322043,G003560,01/01/2001 12:00:00 AM,061XX S ARTESIAN AV,1310,CRIMINAL DAMAGE,TO PROPERTY,RESIDENCE,False,False,825,8.00,,,14,1161130.00,1863925.00,2001,08/17/2015 03:03:40 PM,41.78,-87.68,"(41.782292325, -87.684798685)"
2001-01-01 00:00:00,2616775,HJ220522,01/01/2001 12:00:00 AM,071XX N PAULINA ST,1752,OFFENSE INVOLVING CHILDREN,AGG CRIM SEX ABUSE FAM MEMBER,RESIDENCE,False,False,2423,24.00,49.00,1.00,20,1163844.00,1947579.00,2001,08/17/2015 03:03:40 PM,42.01,-87.67,"(42.011788612, -87.672485973)"
2001-01-01 00:00:00,8146039,HT380969,01/01/2001 12:00:00 AM,047XX W MONTANA ST,0266,CRIM SEXUAL ASSAULT,PREDATORY,RESIDENCE,False,False,2521,25.00,31.00,19.00,02,1144295.00,1915861.00,2001,07/15/2011 12:39:33 AM,41.93,-87.75,"(41.925143516, -87.745217143)"
2001-01-01 00:00:00,9748516,HX397222,01/01/2001 12:00:00 AM,031XX W DOUGLAS BLVD,1562,SEX OFFENSE,AGG CRIMINAL SEXUAL ABUSE,CHURCH/SYNAGOGUE/PLACE OF WORSHIP,False,False,1022,10.00,24.00,29.00,17,,,2001,08/17/2015 03:03:40 PM,,,
2001-01-01 00:00:00,10473864,HZ213356,01/01/2001 12:00:00 AM,012XX S DAMEN AVE,1582,OFFENSE INVOLVING CHILDREN,CHILD PORNOGRAPHY,OTHER,False,False,1233,12.00,2.00,28.00,17,,,2001,04/09/2016 03:47:49 PM,,,
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
2023-03-25 23:50:00,13022210,JG199900,03/25/2023 11:50:00 PM,0000X E WACKER DR,0810,THEFT,OVER $500,HOTEL / MOTEL,False,False,111,1.00,42.00,32.00,06,1177009.00,1902589.00,2023,04/01/2023 04:49:15 PM,41.89,-87.63,"(41.888045958, -87.625413826)"
2023-03-25 23:51:00,13021672,JG199255,03/25/2023 11:51:00 PM,006XX E 79TH ST,2220,LIQUOR LAW VIOLATION,ILLEGAL POSSESSION BY MINOR,STREET,False,False,624,6.00,6.00,44.00,22,1182080.00,1852773.00,2023,04/01/2023 04:49:15 PM,41.75,-87.61,"(41.751230848, -87.608335265)"
2023-03-25 23:57:00,13021664,JG199221,03/25/2023 11:57:00 PM,033XX W BARRY AVE,0486,BATTERY,DOMESTIC BATTERY SIMPLE,APARTMENT,False,True,1412,14.00,35.00,21.00,08B,1153362.00,1920417.00,2023,04/01/2023 04:49:15 PM,41.94,-87.71,"(41.937470062, -87.711779184)"
2023-03-25 23:58:00,13021669,JG199242,03/25/2023 11:58:00 PM,006XX E 79TH ST,0484,BATTERY,"PROTECTED EMPLOYEE - HANDS, FISTS, FEET, NO / ...",STREET,True,False,624,6.00,6.00,44.00,08B,1182080.00,1852773.00,2023,04/01/2023 04:49:15 PM,41.75,-87.61,"(41.751230848, -87.608335265)"


In [8]:
(chicago_full.isna().sum()/len(chicago_full)).round(2)

ID                     0.00
Case Number            0.00
Date                   0.00
Block                  0.00
IUCR                   0.00
Primary Type           0.00
Description            0.00
Location Description   0.00
Arrest                 0.00
Domestic               0.00
Beat                   0.00
District               0.00
Ward                   0.08
Community Area         0.08
FBI Code               0.00
X Coordinate           0.01
Y Coordinate           0.01
Year                   0.00
Updated On             0.00
Latitude               0.01
Longitude              0.01
Location               0.01
dtype: float64

### Separate the Full Dataset by Years

In [9]:
# save the years for every crime
chicago_full["Year"] = chicago_full.index.year
chicago_full["Year"] = chicago_full["Year"].astype(str)
chicago_full["Year"].value_counts()

2002    486801
2001    485877
2003    475979
2004    469420
2005    453770
2006    448174
2007    437083
2008    427165
2009    392818
2010    370494
2011    351960
2012    336261
2013    307464
2014    275731
2016    269783
2017    269060
2018    268764
2015    264751
2019    261232
2022    237688
2020    212066
2021    208520
2023     54468
Name: Year, dtype: int64

In [10]:
## Dropping unneeded columns to reduce file size
drop_cols = ["X Coordinate","Y Coordinate", "Community Area","FBI Code",
             "Case Number","Updated On",'Block','Location','IUCR']

In [11]:
# save final df
chicago_final = chicago_full.drop(columns=drop_cols).sort_index()#.reset_index()
chicago_final

Unnamed: 0_level_0,ID,Date,Primary Type,Description,Location Description,Arrest,Domestic,Beat,District,Ward,Year,Latitude,Longitude
Datetime,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1
2001-01-01 00:00:00,1322043,01/01/2001 12:00:00 AM,CRIMINAL DAMAGE,TO PROPERTY,RESIDENCE,False,False,825,8.00,,2001,41.78,-87.68
2001-01-01 00:00:00,2616775,01/01/2001 12:00:00 AM,OFFENSE INVOLVING CHILDREN,AGG CRIM SEX ABUSE FAM MEMBER,RESIDENCE,False,False,2423,24.00,49.00,2001,42.01,-87.67
2001-01-01 00:00:00,8146039,01/01/2001 12:00:00 AM,CRIM SEXUAL ASSAULT,PREDATORY,RESIDENCE,False,False,2521,25.00,31.00,2001,41.93,-87.75
2001-01-01 00:00:00,9748516,01/01/2001 12:00:00 AM,SEX OFFENSE,AGG CRIMINAL SEXUAL ABUSE,CHURCH/SYNAGOGUE/PLACE OF WORSHIP,False,False,1022,10.00,24.00,2001,,
2001-01-01 00:00:00,10473864,01/01/2001 12:00:00 AM,OFFENSE INVOLVING CHILDREN,CHILD PORNOGRAPHY,OTHER,False,False,1233,12.00,2.00,2001,,
...,...,...,...,...,...,...,...,...,...,...,...,...,...
2023-03-25 23:50:00,13022210,03/25/2023 11:50:00 PM,THEFT,OVER $500,HOTEL / MOTEL,False,False,111,1.00,42.00,2023,41.89,-87.63
2023-03-25 23:51:00,13021672,03/25/2023 11:51:00 PM,LIQUOR LAW VIOLATION,ILLEGAL POSSESSION BY MINOR,STREET,False,False,624,6.00,6.00,2023,41.75,-87.61
2023-03-25 23:57:00,13021664,03/25/2023 11:57:00 PM,BATTERY,DOMESTIC BATTERY SIMPLE,APARTMENT,False,True,1412,14.00,35.00,2023,41.94,-87.71
2023-03-25 23:58:00,13021669,03/25/2023 11:58:00 PM,BATTERY,"PROTECTED EMPLOYEE - HANDS, FISTS, FEET, NO / ...",STREET,True,False,624,6.00,6.00,2023,41.75,-87.61


In [13]:
chicago_final.info()

<class 'pandas.core.frame.DataFrame'>
DatetimeIndex: 7765329 entries, 2001-01-01 00:00:00 to 2023-03-25 23:58:00
Data columns (total 13 columns):
 #   Column                Dtype  
---  ------                -----  
 0   ID                    int64  
 1   Date                  object 
 2   Primary Type          object 
 3   Description           object 
 4   Location Description  object 
 5   Arrest                bool   
 6   Domestic              bool   
 7   Beat                  int64  
 8   District              float64
 9   Ward                  float64
 10  Year                  object 
 11  Latitude              float64
 12  Longitude             float64
dtypes: bool(2), float64(4), int64(2), object(5)
memory usage: 725.7+ MB


In [14]:
chicago_final.memory_usage(deep=True).astype(float)

Index                   62,122,632.00
ID                      62,122,632.00
Date                   613,460,991.00
Primary Type           520,539,068.00
Description            568,349,607.00
Location Description   529,272,903.00
Arrest                   7,765,329.00
Domestic                 7,765,329.00
Beat                    62,122,632.00
District                62,122,632.00
Ward                    62,122,632.00
Year                   473,685,069.00
Latitude                62,122,632.00
Longitude               62,122,632.00
dtype: float64

In [15]:
# unique # of year bins
year_bins = chicago_final['Year'].astype(str).unique()
year_bins

array(['2001', '2002', '2003', '2004', '2005', '2006', '2007', '2008',
       '2009', '2010', '2011', '2012', '2013', '2014', '2015', '2016',
       '2017', '2018', '2019', '2020', '2021', '2022', '2023'],
      dtype=object)

In [16]:
FINAL_DROP = ['Date','Year']

In [17]:
## set save location 

os.makedirs(OUTPUT_FOLDER, exist_ok=True)
print(f"[i] Saving .csv's to {OUTPUT_FOLDER}")
## loop through years
for year in year_bins:
    
    ## save temp slices of dfs to save.
    temp_df = chicago_final.loc[ year]
    temp_df = temp_df.reset_index(drop=False)
    temp_df = temp_df.drop(columns=FINAL_DROP)

    # save as csv to output folder
    fname_temp = f"{OUTPUT_FOLDER}Chicago-Crime_{year}.csv"#.gz
    temp_df.to_csv(fname_temp,index=False)

    print(f"- Succesfully saved {fname_temp}")

[i] Saving .csv's to Data/Chicago/
- Succesfully saved Data/Chicago/Chicago-Crime_2001.csv
- Succesfully saved Data/Chicago/Chicago-Crime_2002.csv
- Succesfully saved Data/Chicago/Chicago-Crime_2003.csv
- Succesfully saved Data/Chicago/Chicago-Crime_2004.csv
- Succesfully saved Data/Chicago/Chicago-Crime_2005.csv
- Succesfully saved Data/Chicago/Chicago-Crime_2006.csv
- Succesfully saved Data/Chicago/Chicago-Crime_2007.csv
- Succesfully saved Data/Chicago/Chicago-Crime_2008.csv
- Succesfully saved Data/Chicago/Chicago-Crime_2009.csv
- Succesfully saved Data/Chicago/Chicago-Crime_2010.csv
- Succesfully saved Data/Chicago/Chicago-Crime_2011.csv
- Succesfully saved Data/Chicago/Chicago-Crime_2012.csv
- Succesfully saved Data/Chicago/Chicago-Crime_2013.csv
- Succesfully saved Data/Chicago/Chicago-Crime_2014.csv
- Succesfully saved Data/Chicago/Chicago-Crime_2015.csv
- Succesfully saved Data/Chicago/Chicago-Crime_2016.csv
- Succesfully saved Data/Chicago/Chicago-Crime_2017.csv
- Succesfully

In [18]:
saved_files = sorted(glob.glob(OUTPUT_FOLDER+'*.*csv'))
saved_files

['Data/Chicago/Chicago-Crime_2001.csv',
 'Data/Chicago/Chicago-Crime_2002.csv',
 'Data/Chicago/Chicago-Crime_2003.csv',
 'Data/Chicago/Chicago-Crime_2004.csv',
 'Data/Chicago/Chicago-Crime_2005.csv',
 'Data/Chicago/Chicago-Crime_2006.csv',
 'Data/Chicago/Chicago-Crime_2007.csv',
 'Data/Chicago/Chicago-Crime_2008.csv',
 'Data/Chicago/Chicago-Crime_2009.csv',
 'Data/Chicago/Chicago-Crime_2010.csv',
 'Data/Chicago/Chicago-Crime_2011.csv',
 'Data/Chicago/Chicago-Crime_2012.csv',
 'Data/Chicago/Chicago-Crime_2013.csv',
 'Data/Chicago/Chicago-Crime_2014.csv',
 'Data/Chicago/Chicago-Crime_2015.csv',
 'Data/Chicago/Chicago-Crime_2016.csv',
 'Data/Chicago/Chicago-Crime_2017.csv',
 'Data/Chicago/Chicago-Crime_2018.csv',
 'Data/Chicago/Chicago-Crime_2019.csv',
 'Data/Chicago/Chicago-Crime_2020.csv',
 'Data/Chicago/Chicago-Crime_2021.csv',
 'Data/Chicago/Chicago-Crime_2022.csv',
 'Data/Chicago/Chicago-Crime_2023.csv']

In [19]:
## create a README.txt for the zip files
readme = """Source URL: 
- https://data.cityofchicago.org/Public-Safety/Crimes-2001-to-Present/ijzp-q8t2
- Filtered for years 2000-Present.

Downloaded 04/02/2023
- Files are split into 1 year per file.

EXAMPLE USAGE:
>> import glob
>> import pandas as pd
>> folder = "Data/Chicago/"
>> crime_files = sorted(glob.glob(folder+"*.csv"))
>> df = pd.concat([pd.read_csv(f) for f in crime_files])
"""
print(readme)


with open(f"{OUTPUT_FOLDER}README.txt",'w') as f:
    f.write(readme)

Source URL: 
- https://data.cityofchicago.org/Public-Safety/Crimes-2001-to-Present/ijzp-q8t2
- Filtered for years 2000-Present.

Downloaded 04/02/2023
- Files are split into 1 year per file.

EXAMPLE USAGE:
>> import glob
>> import pandas as pd
>> folder = "Data/Chicago/"
>> crime_files = sorted(glob.glob(folder+"*.csv"))
>> df = pd.concat([pd.read_csv(f) for f in crime_files])



### Confirmation

- Follow the example usage above to test if your files were created successfully.

In [20]:
# get list of files from folder
crime_files = sorted(glob.glob(OUTPUT_FOLDER+"*.csv"))
df = pd.concat([pd.read_csv(f) for f in crime_files])
df

Unnamed: 0,Datetime,ID,Primary Type,Description,Location Description,Arrest,Domestic,Beat,District,Ward,Latitude,Longitude
0,2001-01-01 00:00:00,1322043,CRIMINAL DAMAGE,TO PROPERTY,RESIDENCE,False,False,825,8.00,,41.78,-87.68
1,2001-01-01 00:00:00,2616775,OFFENSE INVOLVING CHILDREN,AGG CRIM SEX ABUSE FAM MEMBER,RESIDENCE,False,False,2423,24.00,49.00,42.01,-87.67
2,2001-01-01 00:00:00,8146039,CRIM SEXUAL ASSAULT,PREDATORY,RESIDENCE,False,False,2521,25.00,31.00,41.93,-87.75
3,2001-01-01 00:00:00,9748516,SEX OFFENSE,AGG CRIMINAL SEXUAL ABUSE,CHURCH/SYNAGOGUE/PLACE OF WORSHIP,False,False,1022,10.00,24.00,,
4,2001-01-01 00:00:00,10473864,OFFENSE INVOLVING CHILDREN,CHILD PORNOGRAPHY,OTHER,False,False,1233,12.00,2.00,,
...,...,...,...,...,...,...,...,...,...,...,...,...
54463,2023-03-25 23:50:00,13022210,THEFT,OVER $500,HOTEL / MOTEL,False,False,111,1.00,42.00,41.89,-87.63
54464,2023-03-25 23:51:00,13021672,LIQUOR LAW VIOLATION,ILLEGAL POSSESSION BY MINOR,STREET,False,False,624,6.00,6.00,41.75,-87.61
54465,2023-03-25 23:57:00,13021664,BATTERY,DOMESTIC BATTERY SIMPLE,APARTMENT,False,True,1412,14.00,35.00,41.94,-87.71
54466,2023-03-25 23:58:00,13021669,BATTERY,"PROTECTED EMPLOYEE - HANDS, FISTS, FEET, NO / ...",STREET,True,False,624,6.00,6.00,41.75,-87.61


In [23]:
years = df['Datetime'].map(lambda x: x.split()[0].split('-')[0])
years.value_counts().sort_index()

2001    485877
2002    486801
2003    475979
2004    469420
2005    453770
2006    448174
2007    437083
2008    427165
2009    392818
2010    370494
2011    351960
2012    336261
2013    307464
2014    275731
2015    264751
2016    269783
2017    269060
2018    268764
2019    261232
2020    212066
2021    208520
2022    237688
2023     54468
Name: Datetime, dtype: int64