# Week 1 - Iterating through data using Python

## A real data challenge – a massive dataset

We want to explore `NYC EMS data since 2011`. We are saddled with a massive file that is `6+GB` that has `more than 26 million` rows of data. At this size, even Pandas will slow down dramatically.

Our strategy is to `iterate` or `loop` through the data in smaller chunks and to analyze them. We will analyze the entire dataset and to get complete, reconstituted results that represent an analysis of the entire 6+GB of data.   

We need to learn a couple of fundamental Python techniques that will help us extend Pandas' abilities.












### 1. ```for loops```... a data journalist's favorite Python expression</center>

We can use a `for loop` to **iterate** (to do the same series of steps in a process over and over again), including:
* running some calculation on each value stored in a list;
* opening and reading a list of files;
* literally an endless series of important tasks.

In [8]:
## name dog lucy
my_dog = "Lucy"

In [9]:
## upper case the previous variable 
## you can target an individual item
my_dog.upper()

'LUCY'

In [10]:
## make it all lowercase
my_dog.lower()

'lucy'

In [11]:
## run this list
fav_animals = ["dogs", "cats", "birds", "elephants"]

In [12]:
##call our list
fav_animals

['dogs', 'cats', 'birds', 'elephants']

In [13]:
## upper case and print each animal in our list
## this will break
fav_animals.upper()

AttributeError: 'list' object has no attribute 'upper'

In [14]:
## recall that we can slice a list

fav_animals[0]

'dogs'

In [15]:
fav_animals[1:3]

['cats', 'birds']

In [16]:
## we can target an individual item that has has been sliced from a list

fav_animals[0].upper()

'DOGS'

In [17]:
# find type
## you can only do individual items in a list, but you can't do an entire list at once using this method. 
### another word for a list is an iterable. 

type(fav_animals)

list

**We can't do one at a time, but we can iterate through all of them using a `for loop`**

In [18]:
## use a for loop to upper case each animal and print it

for fav_animal in fav_animals:
    print(fav_animal.upper())

DOGS
CATS
BIRDS
ELEPHANTS


### What's happening in a `for loop`:

<img src="https://sandeepmj.github.io/image-host/forloop3.png">


<img src="https://sandeepmj.github.io/image-host/forloop4.png">


<img src="https://sandeepmj.github.io/image-host/forloop5.png">


<img src="https://sandeepmj.github.io/image-host/forloop6.png">



<img src="https://sandeepmj.github.io/image-host/forloop7.png">


<img src="https://sandeepmj.github.io/image-host/forloop8.png">


<img src="https://sandeepmj.github.io/image-host/forloop9.png">


<img src="https://sandeepmj.github.io/image-host/forloop10.png">

# To recap:
<img src="https://sandeepmj.github.io/image-host/forloop6.png">


In [19]:
## re run a for loop to upper case each animal and print it

for little_creature in fav_animals: 
    print(little_creature.title())

Dogs
Cats
Birds
Elephants


In [20]:
## remember that once you call the temporary variable, it will call the last object within the list because that's the last item it looped through. 
## call little_creature
little_creature

'elephants'

In [21]:
## did our list change? Call the fav_animals list
### no, it stayed the same. 

fav_animals

['dogs', 'cats', 'birds', 'elephants']

### 2.  `append()`

The `append()` lets us append values to a new list. Even if the list does not exist, we can declare it and then append to it.

```python
    new_list = []
    new_list.append(some_value)
```


In [22]:
## We save our iterated data by adding to an empty list

upper_animals = []

In [23]:
## call upper animals
upper_animals

[]

In [24]:
## what's the type

type(upper_animals)

list

In [26]:
## do the 'for loop' 

for fav_animal in fav_animals:
    upper_animals.append(fav_animal.upper())
    
## call it

upper_animals

['DOGS', 'CATS', 'BIRDS', 'ELEPHANTS', 'DOGS', 'CATS', 'BIRDS', 'ELEPHANTS']

In [28]:
## find out the length 
len(upper_animals)

8

## Let's take **For Loops** for test drive:

### Combine different data points together 

#### You scrape some URLs and place them in a list called myURLS (provided below):

In [27]:
## run this cell to activate the list
myURLS = [
    'great-unique-data-1.html',
    'great-unique-data-2.html',
    'great-unique-data-3.html',
    'great-unique-data-4.html',
    'great-unique-data-5.html',
    'great-unique-data-6.html',
    'great-unique-data-7.html',
    'great-unique-data-8.html',
    'great-unique-data-9.html',
    'great-unique-data-10.html',
    'great-unique-data-11.html',
    'great-unique-data-12.html',
    'great-unique-data-13.html',
    'great-unique-data-14.html',
    'great-unique-data-15.html'
]

myURLS

['great-unique-data-1.html',
 'great-unique-data-2.html',
 'great-unique-data-3.html',
 'great-unique-data-4.html',
 'great-unique-data-5.html',
 'great-unique-data-6.html',
 'great-unique-data-7.html',
 'great-unique-data-8.html',
 'great-unique-data-9.html',
 'great-unique-data-10.html',
 'great-unique-data-11.html',
 'great-unique-data-12.html',
 'great-unique-data-13.html',
 'great-unique-data-14.html',
 'great-unique-data-15.html']

### * You realize that these URLs are missing the base of "http://www.importantsite.com/"
### * Use a ```for loop``` to join the base URL to every partial URL in your list.
### * Print each FULL URL
It should look like: ```"http://www.importantsite.com/great-unique-data-14.html``` but with unique numbers

In [33]:
## for loop and print

for myURL in myURLS:
    print(myURL)

/


### Update myURLS and store full URLS in a new list

#### Instead of just printing the joined URLs, create a new list called ```full_URLS``` that holds the full URLs.

In [34]:
## store the updated values
base_url = "http://www.importantsite.com/"
base_url

full_urls = []
full_urls

[]

In [35]:
## call the new list
for myURLS in base_url:
    full_urls.append(base_url + myURLS)
    
full_urls

['http://www.importantsite.com/h',
 'http://www.importantsite.com/t',
 'http://www.importantsite.com/t',
 'http://www.importantsite.com/p',
 'http://www.importantsite.com/:',
 'http://www.importantsite.com//',
 'http://www.importantsite.com//',
 'http://www.importantsite.com/w',
 'http://www.importantsite.com/w',
 'http://www.importantsite.com/w',
 'http://www.importantsite.com/.',
 'http://www.importantsite.com/i',
 'http://www.importantsite.com/m',
 'http://www.importantsite.com/p',
 'http://www.importantsite.com/o',
 'http://www.importantsite.com/r',
 'http://www.importantsite.com/t',
 'http://www.importantsite.com/a',
 'http://www.importantsite.com/n',
 'http://www.importantsite.com/t',
 'http://www.importantsite.com/s',
 'http://www.importantsite.com/i',
 'http://www.importantsite.com/t',
 'http://www.importantsite.com/e',
 'http://www.importantsite.com/.',
 'http://www.importantsite.com/c',
 'http://www.importantsite.com/o',
 'http://www.importantsite.com/m',
 'http://www.importa

### 3. Counting while iterating

Often we need to increment a number to track progress of an iteration.

In [36]:
## counter without incrementing
my_counter = 1
for animal in fav_animals:
    print(f"Animal {my_counter} is {animal}")
    my_counter = my_counter + 1

Animal 1 is dogs
Animal 2 is cats
Animal 3 is birds
Animal 4 is elephants


In [None]:
## counter that increments


## Back to our EMS data challenge

For those of you who don't have sufficient splace, I have created <a href="https://raw.githubusercontent.com/sandeepmj/datasets/main/ems-excerpt.csv">an excerpt</a> of the `6+GB` dataset that is `25MB` and holds 100,000 rows of data instead of millions of rows. Those using the excerpt, your strategy will break to take break 100,000-rows file and `chunk` it into 10K pieces. 

With the actual `6+GB` file, we'll break it into 500K chunks.

In [39]:
## import libraries
import pandas as pd

In [40]:
## take big csv and chunk it
row_size = 500_000
chunk_number = 1
for partial_df in pd.read_csv("EMS_Incident_Dispatch_Data_20240407.csv",
                             chunksize = row_size):
    print(partial_df)

        CAD_INCIDENT_ID       INCIDENT_DATETIME INITIAL_CALL_TYPE  \
0             110010790  01/01/2011 02:19:47 AM               UNC   
1             110010791  01/01/2011 02:19:49 AM               EDP   
2             110010792  01/01/2011 02:19:52 AM            UNKNOW   
3             110010793  01/01/2011 02:19:56 AM               UNC   
4             110010794  01/01/2011 02:20:05 AM            INJURY   
...                 ...                     ...               ...   
499995        111210092  05/01/2011 12:30:22 AM            SPEVNT   
499996        111210093  05/01/2011 12:30:24 AM              SICK   
499997        111210094  05/01/2011 12:30:35 AM            INJURY   
499998        111210095  05/01/2011 12:31:00 AM             ABDPN   
499999        111210096  05/01/2011 12:31:09 AM            INJURY   

        INITIAL_SEVERITY_LEVEL_CODE FINAL_CALL_TYPE  \
0                                 2             UNC   
1                                 7             EDP   
2     

         CAD_INCIDENT_ID       INCIDENT_DATETIME INITIAL_CALL_TYPE  \
1000000        112870508  10/14/2011 04:38:04 AM            MVAINJ   
1000001        112870509  10/14/2011 04:40:24 AM              DRUG   
1000002        112870510  10/14/2011 04:40:29 AM              SICK   
1000003        112870511  10/14/2011 04:44:39 AM               EDP   
1000004        112870512  10/14/2011 04:46:56 AM              SICK   
...                  ...                     ...               ...   
1499995        120691391  03/09/2012 11:14:01 AM               EDP   
1499996        120691392  03/09/2012 11:14:04 AM            ALTMEN   
1499997        120691393  03/09/2012 11:14:06 AM            DIFFBR   
1499998        120691394  03/09/2012 11:15:03 AM              SICK   
1499999        120691395  03/09/2012 11:15:24 AM               UNC   

         INITIAL_SEVERITY_LEVEL_CODE FINAL_CALL_TYPE  \
1000000                            4          MVAINJ   
1000001                            4          A

         CAD_INCIDENT_ID       INCIDENT_DATETIME INITIAL_CALL_TYPE  \
2000000        122051983  07/23/2012 02:00:59 PM              SICK   
2000001        122051984  07/23/2012 02:01:53 PM            DIFFBR   
2000002        122051985  07/23/2012 02:02:17 PM            MVAINJ   
2000003        122051986  07/23/2012 02:02:46 PM              SICK   
2000004        122051987  07/23/2012 02:02:51 PM              CARD   
...                  ...                     ...               ...   
2499995        123461909  12/11/2012 01:54:59 PM              SICK   
2499996        123461910  12/11/2012 01:55:12 PM             ABDPN   
2499997        123461911  12/11/2012 01:55:55 PM            INJURY   
2499998        123461913  12/11/2012 01:56:06 PM             ELECT   
2499999        123461914  12/11/2012 01:56:17 PM              SICK   

         INITIAL_SEVERITY_LEVEL_CODE FINAL_CALL_TYPE  \
2000000                            6            SICK   
2000001                            2          D

  for partial_df in pd.read_csv("EMS_Incident_Dispatch_Data_20240407.csv",


         CAD_INCIDENT_ID       INCIDENT_DATETIME INITIAL_CALL_TYPE  \
2500000        123461915  12/11/2012 01:56:31 PM             OBMIS   
2500001        123461916  12/11/2012 01:56:34 PM              DRUG   
2500002        123461917  12/11/2012 01:56:36 PM              CARD   
2500003        123461918  12/11/2012 01:57:41 PM             ABDPN   
2500004        123461919  12/11/2012 01:58:08 PM             ABDPN   
...                  ...                     ...               ...   
2999995        131201490  04/30/2013 12:17:59 PM             OBLAB   
2999996        131201491  04/30/2013 12:18:21 PM            INJURY   
2999997        131201492  04/30/2013 12:18:35 PM            INJURY   
2999998        131201493  04/30/2013 12:19:17 PM               EDP   
2999999        131201494  04/30/2013 12:20:02 PM            SICPED   

         INITIAL_SEVERITY_LEVEL_CODE FINAL_CALL_TYPE  \
2500000                            4           OBMIS   
2500001                            4           

  for partial_df in pd.read_csv("EMS_Incident_Dispatch_Data_20240407.csv",


         CAD_INCIDENT_ID       INCIDENT_DATETIME INITIAL_CALL_TYPE  \
3500000        132512372  09/08/2013 04:14:43 PM               EDP   
3500001        132512373  09/08/2013 04:16:27 PM             ABDPN   
3500002        132512374  09/08/2013 04:16:34 PM             ABDPN   
3500003        132512375  09/08/2013 04:16:54 PM            INJURY   
3500004        132512376  09/08/2013 04:17:18 PM               UNC   
...                  ...                     ...               ...   
3999995        140250073  01/25/2014 12:26:13 AM            DIFFBR   
3999996        140250074  01/25/2014 12:26:43 AM              SICK   
3999997        140250076  01/25/2014 12:29:17 AM              DRUG   
3999998        190021369  01/02/2019 10:04:11 AM              SICK   
3999999        140250077  01/25/2014 12:29:40 AM            ASTHMB   

         INITIAL_SEVERITY_LEVEL_CODE FINAL_CALL_TYPE  \
3500000                            7             EDP   
3500001                            5           

         CAD_INCIDENT_ID       INCIDENT_DATETIME INITIAL_CALL_TYPE  \
4500000        141603204  06/09/2014 06:47:28 PM               EDP   
4500001        141603205  06/09/2014 06:47:35 PM               EDP   
4500002        141603206  06/09/2014 06:47:36 PM            INJURY   
4500003        141603207  06/09/2014 06:47:50 PM               UNC   
4500004        141603208  06/09/2014 06:47:53 PM              DRUG   
...                  ...                     ...               ...   
4999995        142892740  10/16/2014 04:11:52 PM            ASTHMB   
4999996        142892741  10/16/2014 04:11:57 PM            INJURY   
4999997        142892742  10/16/2014 04:12:19 PM            INJMAJ   
4999998        142892743  10/16/2014 04:12:21 PM             OBLAB   
4999999        142892744  10/16/2014 04:12:43 PM              SICK   

         INITIAL_SEVERITY_LEVEL_CODE FINAL_CALL_TYPE  \
4500000                            7            CARD   
4500001                            7           

         CAD_INCIDENT_ID       INCIDENT_DATETIME INITIAL_CALL_TYPE  \
5500000        150583353  02/27/2015 07:54:21 PM            INJURY   
5500001        150583354  02/27/2015 07:54:32 PM              SICK   
5500002        150583355  02/27/2015 07:54:40 PM               EDP   
5500003        150583356  02/27/2015 07:55:08 PM               UNC   
5500004        150583357  02/27/2015 07:55:10 PM              SICK   
...                  ...                     ...               ...   
5999995        151820884  07/01/2015 08:32:39 AM            INJURY   
5999996        151820885  07/01/2015 08:32:45 AM            INJURY   
5999997        151820886  07/01/2015 08:32:48 AM               UNC   
5999998        151820887  07/01/2015 08:33:15 AM            DIFFBR   
5999999        151820889  07/01/2015 08:34:11 AM             OBMAJ   

         INITIAL_SEVERITY_LEVEL_CODE FINAL_CALL_TYPE  \
5500000                            5          INJURY   
5500001                            6           

         CAD_INCIDENT_ID       INCIDENT_DATETIME INITIAL_CALL_TYPE  \
6500000        153042357  10/31/2015 03:19:10 PM            INJMAJ   
6500001        153042358  10/31/2015 03:19:16 PM            INJURY   
6500002        153042359  10/31/2015 03:19:37 PM             OBMAJ   
6500003        153042360  10/31/2015 03:19:57 PM            MVAINJ   
6500004        153042361  10/31/2015 03:20:15 PM            INJURY   
...                  ...                     ...               ...   
6999995        160401489  02/09/2016 11:36:49 AM            DIFFBR   
6999996        160401490  02/09/2016 11:36:56 AM            DIFFBR   
6999997        160401491  02/09/2016 11:37:23 AM               UNC   
6999998        160401493  02/09/2016 11:37:38 AM              DRUG   
6999999        160401494  02/09/2016 11:37:39 AM             SEIZR   

         INITIAL_SEVERITY_LEVEL_CODE FINAL_CALL_TYPE  \
6500000                            3          INJMAJ   
6500001                            5          I

         CAD_INCIDENT_ID       INCIDENT_DATETIME INITIAL_CALL_TYPE  \
7500000        161960203  07/14/2016 01:23:03 AM               UNC   
7500001        190052353  01/05/2019 02:00:03 PM              SICK   
7500002        161960204  07/14/2016 01:23:38 AM              SICK   
7500003        161960205  07/14/2016 01:23:54 AM               EDP   
7500004        161960206  07/14/2016 01:24:18 AM            GYNHEM   
...                  ...                     ...               ...   
7999995        163200110  11/15/2016 12:57:43 AM              SICK   
7999996        163200111  11/15/2016 12:58:07 AM            DIFFBR   
7999997        163200112  11/15/2016 12:58:14 AM            DIFFBR   
7999998        163200113  11/15/2016 12:58:37 AM            DIFFBR   
7999999        163200114  11/15/2016 12:58:39 AM              SICK   

         INITIAL_SEVERITY_LEVEL_CODE FINAL_CALL_TYPE  \
7500000                            2             UNC   
7500001                            6           

         CAD_INCIDENT_ID       INCIDENT_DATETIME INITIAL_CALL_TYPE  \
8500000         81093104  04/18/2008 07:40:03 PM              SICK   
8500001         81093105  04/18/2008 07:40:09 PM            DIFFBR   
8500002         81093106  04/18/2008 07:40:46 PM              DRUG   
8500003         81093107  04/18/2008 07:40:49 PM            INJURY   
8500004         81093108  04/18/2008 07:42:26 PM             OTHER   
...                  ...                     ...               ...   
8999995         82552405  09/11/2008 05:12:27 PM            DIFFBR   
8999996         82552406  09/11/2008 05:12:45 PM            INJURY   
8999997         82552407  09/11/2008 05:12:56 PM              SICK   
8999998         82552408  09/11/2008 05:13:24 PM              SICK   
8999999         82552409  09/11/2008 05:13:31 PM              DRUG   

         INITIAL_SEVERITY_LEVEL_CODE FINAL_CALL_TYPE  \
8500000                            4            SICK   
8500001                            2          D

         CAD_INCIDENT_ID       INCIDENT_DATETIME INITIAL_CALL_TYPE  \
9500000         90430620  02/12/2009 07:14:00 AM             OBLAB   
9500001         90430621  02/12/2009 07:15:29 AM             ABDPN   
9500002         90430622  02/12/2009 07:15:54 AM              SICK   
9500003         90430623  02/12/2009 07:16:00 AM               EDP   
9500004         90430624  02/12/2009 07:17:18 AM              SICK   
...                  ...                     ...               ...   
9999995         91873264  07/06/2009 07:40:31 PM            ARREST   
9999996         91873267  07/06/2009 07:41:43 PM             ABDPN   
9999997         91873268  07/06/2009 07:41:50 PM             SEIZR   
9999998         91873269  07/06/2009 07:42:18 PM            SICKFC   
9999999         91873270  07/06/2009 07:42:33 PM            STATEP   

         INITIAL_SEVERITY_LEVEL_CODE FINAL_CALL_TYPE  \
9500000                            5           OBLAB   
9500001                            5           

          CAD_INCIDENT_ID       INCIDENT_DATETIME INITIAL_CALL_TYPE  \
10500000         93351865  12/01/2009 02:36:47 PM            ASTHMB   
10500001         93351866  12/01/2009 02:37:02 PM              SICK   
10500002         93351867  12/01/2009 02:37:06 PM             ABDPN   
10500003         93351868  12/01/2009 02:37:19 PM            PEDSTR   
10500004         93351869  12/01/2009 02:37:20 PM            INJMIN   
...                   ...                     ...               ...   
10999995        101211935  05/01/2010 12:55:52 PM            INJURY   
10999996        101211936  05/01/2010 12:55:54 PM            UNKNOW   
10999997        101211937  05/01/2010 12:56:08 PM               UNC   
10999998        101211938  05/01/2010 12:56:14 PM              SICK   
10999999        101211939  05/01/2010 12:56:27 PM               UNC   

          INITIAL_SEVERITY_LEVEL_CODE FINAL_CALL_TYPE  \
10500000                            2          ASTHMB   
10500001                         

          CAD_INCIDENT_ID       INCIDENT_DATETIME INITIAL_CALL_TYPE  \
11500000        102600050  09/17/2010 12:27:51 AM            INJURY   
11500001        102600051  09/17/2010 12:28:03 AM            INBLED   
11500002        102600052  09/17/2010 12:28:03 AM              SICK   
11500003        102600053  09/17/2010 12:28:23 AM              CARD   
11500004        102600054  09/17/2010 12:28:27 AM            INJURY   
...                   ...                     ...               ...   
11999995        170230422  01/23/2017 03:35:25 AM              DRUG   
11999996        170230423  01/23/2017 03:36:33 AM            SICPED   
11999997        170230424  01/23/2017 03:37:16 AM            DIFFBR   
11999998        170230425  01/23/2017 03:38:42 AM            UNKNOW   
11999999        170230426  01/23/2017 03:39:22 AM            UNKNOW   

          INITIAL_SEVERITY_LEVEL_CODE FINAL_CALL_TYPE  \
11500000                            5          INJURY   
11500001                         

          CAD_INCIDENT_ID       INCIDENT_DATETIME INITIAL_CALL_TYPE  \
12500000        171492882  05/29/2017 07:10:49 PM            MVAINJ   
12500001        171492883  05/29/2017 07:11:50 PM              CARD   
12500002        171492884  05/29/2017 07:11:51 PM              SICK   
12500003        171492885  05/29/2017 07:11:54 PM            ALTMEN   
12500004        171492886  05/29/2017 07:12:20 PM               UNC   
...                   ...                     ...               ...   
12999995        172710868  09/28/2017 08:06:59 AM            INBLED   
12999996        172710869  09/28/2017 08:06:59 AM            MVAINJ   
12999997        172710870  09/28/2017 08:07:07 AM            PEDSTR   
12999998        172710871  09/28/2017 08:07:16 AM             ANAPH   
12999999        172710872  09/28/2017 08:07:23 AM               UNC   

          INITIAL_SEVERITY_LEVEL_CODE FINAL_CALL_TYPE  \
12500000                            4          MVAINJ   
12500001                         

          CAD_INCIDENT_ID       INCIDENT_DATETIME INITIAL_CALL_TYPE  \
13500000        180281488  01/28/2018 10:27:10 AM              SICK   
13500001        180281489  01/28/2018 10:27:24 AM              SICK   
13500002        180281490  01/28/2018 10:27:25 AM            INJMAJ   
13500003        180281491  01/28/2018 10:27:47 AM               EDP   
13500004        180281492  01/28/2018 10:28:12 AM            UNKNOW   
...                   ...                     ...               ...   
13999995        181491375  05/29/2018 09:37:46 AM            GYNMAJ   
13999996        181491376  05/29/2018 09:37:48 AM              SICK   
13999997        181491377  05/29/2018 09:37:57 AM               UNC   
13999998        181491378  05/29/2018 09:38:00 AM              STAB   
13999999        181491379  05/29/2018 09:38:24 AM              CARD   

          INITIAL_SEVERITY_LEVEL_CODE FINAL_CALL_TYPE  \
13500000                            6            SICK   
13500001                         

          CAD_INCIDENT_ID       INCIDENT_DATETIME INITIAL_CALL_TYPE  \
14500000        182642615  09/21/2018 02:52:03 PM               EDP   
14500001        182642616  09/21/2018 02:52:29 PM            CARDBR   
14500002        182642618  09/21/2018 02:52:42 PM               EDP   
14500003        182642620  09/21/2018 02:53:05 PM            INJURY   
14500004        182642621  09/21/2018 02:53:10 PM            UNKNOW   
...                   ...                     ...               ...   
14999995        190230819  01/23/2019 07:45:01 AM            MVAINJ   
14999996        190230820  01/23/2019 07:45:35 AM            SICMIN   
14999997        190230821  01/23/2019 07:45:42 AM               UNC   
14999998        190230822  01/23/2019 07:46:00 AM             ABDPN   
14999999        190230823  01/23/2019 07:46:04 AM              SICK   

          INITIAL_SEVERITY_LEVEL_CODE FINAL_CALL_TYPE  \
14500000                            7             EDP   
14500001                         

          CAD_INCIDENT_ID       INCIDENT_DATETIME INITIAL_CALL_TYPE  \
15500000        191443492  05/24/2019 06:00:00 PM               UNC   
15500001        191443493  05/24/2019 06:00:05 PM               EDP   
15500002        191443494  05/24/2019 06:00:28 PM             OTHER   
15500003        191443495  05/24/2019 06:00:40 PM            STATEP   
15500004        191443496  05/24/2019 06:00:41 PM              DRUG   
...                   ...                     ...               ...   
15999995        192590562  09/16/2019 04:40:38 AM            INJURY   
15999996        192590563  09/16/2019 04:41:18 AM              CARD   
15999997        192590564  09/16/2019 04:41:24 AM              SICK   
15999998        192590565  09/16/2019 04:41:59 AM               EDP   
15999999        192590566  09/16/2019 04:42:18 AM            CARDBR   

          INITIAL_SEVERITY_LEVEL_CODE FINAL_CALL_TYPE  \
15500000                            2             UNC   
15500001                         

          CAD_INCIDENT_ID       INCIDENT_DATETIME INITIAL_CALL_TYPE  \
16500000        200130143  01/13/2020 12:54:21 AM              SICK   
16500001        200130146  01/13/2020 12:55:20 AM              SICK   
16500002        200130147  01/13/2020 12:55:55 AM              DRUG   
16500003        200130148  01/13/2020 12:55:56 AM            INJMAJ   
16500004        200130149  01/13/2020 12:56:05 AM            ASTHMB   
...                   ...                     ...               ...   
16999995        200860215  03/26/2020 01:01:54 AM              SICK   
16999996        200860218  03/26/2020 01:02:33 AM             SEIZR   
16999997        200860219  03/26/2020 01:02:36 AM             OBMAJ   
16999998        200860220  03/26/2020 01:02:45 AM            INJURY   
16999999        200860221  03/26/2020 01:02:53 AM            GYNMAJ   

          INITIAL_SEVERITY_LEVEL_CODE FINAL_CALL_TYPE  \
16500000                            6            SICK   
16500001                         

          CAD_INCIDENT_ID       INCIDENT_DATETIME INITIAL_CALL_TYPE  \
17500000         50031264  01/03/2005 10:24:58 AM            INJURY   
17500001         50031265  01/03/2005 10:26:02 AM              SICK   
17500002         50031266  01/03/2005 10:25:58 AM               CVA   
17500003         50031267  01/03/2005 10:25:57 AM            INHALE   
17500004         50031268  01/03/2005 10:26:32 AM            ASTHMA   
...                   ...                     ...               ...   
17999995         51652232  06/14/2005 03:48:03 PM            MVAINJ   
17999996         51652233  06/14/2005 03:49:47 PM             OTHER   
17999997         51652234  06/14/2005 03:49:47 PM              SICK   
17999998         51652235  06/14/2005 03:49:58 PM            PEDSTR   
17999999         51652236  06/14/2005 03:49:58 PM            INJURY   

          INITIAL_SEVERITY_LEVEL_CODE FINAL_CALL_TYPE  \
17500000                            5          ALTMEN   
17500001                         

          CAD_INCIDENT_ID       INCIDENT_DATETIME INITIAL_CALL_TYPE  \
18500000         53240349  11/20/2005 02:35:34 AM            DIFFBR   
18500001         53240350  11/20/2005 02:35:37 AM            ASTHMA   
18500002         53240351  11/20/2005 02:35:54 AM            ASTHMA   
18500003         53240352  11/20/2005 02:35:54 AM            INJMIN   
18500004         53240353  11/20/2005 02:36:10 AM              SICK   
...                   ...                     ...               ...   
18999995         61210373  05/01/2006 04:16:11 AM              SICK   
18999996         61210374  05/01/2006 04:16:48 AM              SICK   
18999997         61210375  05/01/2006 04:17:30 AM              DRUG   
18999998         61210376  05/01/2006 04:19:17 AM            ALTMEN   
18999999         61210377  05/01/2006 04:19:39 AM             OBLAB   

          INITIAL_SEVERITY_LEVEL_CODE FINAL_CALL_TYPE  \
18500000                            2          DIFFBR   
18500001                         

          CAD_INCIDENT_ID       INCIDENT_DATETIME INITIAL_CALL_TYPE  \
19500000         62741902  10/01/2006 03:03:18 PM              SICK   
19500001         62741904  10/01/2006 03:03:51 PM             ABDPN   
19500002         62741905  10/01/2006 03:04:51 PM            INJURY   
19500003         62741906  10/01/2006 03:05:08 PM            DIFFBR   
19500004         62741907  10/01/2006 03:05:23 PM             OTHER   
...                   ...                     ...               ...   
19999995         70700574  03/11/2007 04:30:59 AM               UNC   
19999996         70700578  03/11/2007 04:31:57 AM            INJURY   
19999997         70700579  03/11/2007 04:32:21 AM            INJURY   
19999998         70700580  03/11/2007 04:32:38 AM              SICK   
19999999         70700581  03/11/2007 04:32:43 AM              SICK   

          INITIAL_SEVERITY_LEVEL_CODE FINAL_CALL_TYPE  \
19500000                            6            SICK   
19500001                         

          CAD_INCIDENT_ID       INCIDENT_DATETIME INITIAL_CALL_TYPE  \
20500000         72210171  08/09/2007 01:15:29 AM            PEDSTR   
20500001         72210175  08/09/2007 01:17:04 AM              SICK   
20500002         72210176  08/09/2007 01:17:24 AM             ABDPN   
20500003         72210177  08/09/2007 01:18:00 AM             ABDPN   
20500004         72210178  08/09/2007 01:18:01 AM              SICK   
...                   ...                     ...               ...   
20999995        202732523  09/29/2020 04:03:55 PM            UNKNOW   
20999996        202732524  09/29/2020 04:04:41 PM              CARD   
20999997        202732525  09/29/2020 04:04:43 PM             ABDPN   
20999998        202732526  09/29/2020 04:04:57 PM            INJMAJ   
20999999        202732527  09/29/2020 04:05:05 PM            PEDSTR   

          INITIAL_SEVERITY_LEVEL_CODE FINAL_CALL_TYPE  \
20500000                            3          PEDSTR   
20500001                         

          CAD_INCIDENT_ID       INCIDENT_DATETIME INITIAL_CALL_TYPE  \
21500000        210450942  02/14/2021 07:42:00 AM            DIFFFC   
21500001        210450943  02/14/2021 07:42:00 AM             ABDPN   
21500002        210450944  02/14/2021 07:43:00 AM              SICK   
21500003        210450945  02/14/2021 07:44:00 AM            UNKNOW   
21500004        210450947  02/14/2021 07:45:00 AM              SICK   
...                   ...                     ...               ...   
21999995        211693703  06/18/2021 06:39:41 PM               EDP   
21999996        211693704  06/18/2021 06:39:53 PM            INJURY   
21999997        211693705  06/18/2021 06:39:56 PM            SICMFC   
21999998        211693707  06/18/2021 06:40:45 PM            INJURY   
21999999        211693708  06/18/2021 06:41:11 PM              SICK   

          INITIAL_SEVERITY_LEVEL_CODE FINAL_CALL_TYPE  \
21500000                            2          DIFFBR   
21500001                         

          CAD_INCIDENT_ID       INCIDENT_DATETIME INITIAL_CALL_TYPE  \
22500000        212870633  10/14/2021 05:12:59 AM              DRUG   
22500001        212870634  10/14/2021 05:13:11 AM              DRUG   
22500002        212870635  10/14/2021 05:13:39 AM            DIFFBR   
22500003        212870636  10/14/2021 05:13:48 AM            DIFFFC   
22500004        212870637  10/14/2021 05:14:06 AM            INJURY   
...                   ...                     ...               ...   
22999995        220431486  02/12/2022 10:19:23 AM              CARD   
22999996        220431487  02/12/2022 10:19:58 AM               EDP   
22999997        220431488  02/12/2022 10:20:08 AM             ABDPN   
22999998        220431489  02/12/2022 10:20:31 AM            INJMAJ   
22999999        220431490  02/12/2022 10:21:06 AM            UNKNOW   

          INITIAL_SEVERITY_LEVEL_CODE FINAL_CALL_TYPE  \
22500000                            7            DRUG   
22500001                         

          CAD_INCIDENT_ID       INCIDENT_DATETIME INITIAL_CALL_TYPE  \
23500000        221673230  06/16/2022 04:35:31 PM            UNKNOW   
23500001        221673231  06/16/2022 04:35:33 PM            RESPIR   
23500002        221673232  06/16/2022 04:35:42 PM            UNKNOW   
23500003        221673233  06/16/2022 04:35:49 PM               UNC   
23500004        221673235  06/16/2022 04:36:10 PM               EDP   
...                   ...                     ...               ...   
23999995        222794186  10/06/2022 05:37:29 PM              EDPC   
23999996        222794187  10/06/2022 05:37:39 PM            SICKFC   
23999997        222794188  10/06/2022 05:37:49 PM            ARREST   
23999998        222794190  10/06/2022 05:38:21 PM               UNC   
23999999        222794191  10/06/2022 05:38:32 PM            ABDPFC   

          INITIAL_SEVERITY_LEVEL_CODE FINAL_CALL_TYPE  \
23500000                            4          UNKNOW   
23500001                         

          CAD_INCIDENT_ID       INCIDENT_DATETIME INITIAL_CALL_TYPE  \
24500000        230281394  01/28/2023 09:21:50 AM            DIFFFC   
24500001        230281395  01/28/2023 09:22:34 AM              DRUG   
24500002        230281396  01/28/2023 09:22:57 AM             ABDPN   
24500003        230281398  01/28/2023 09:23:32 AM            SICKFC   
24500004        230281399  01/28/2023 09:23:42 AM            UNKNOW   
...                   ...                     ...               ...   
24999995        231462414  05/26/2023 01:45:14 PM            CARDBR   
24999996        231462417  05/26/2023 01:46:44 PM              STAB   
24999997        231462418  05/26/2023 01:47:00 PM            UNKNOW   
24999998        231462419  05/26/2023 01:47:00 PM            INJURY   
24999999        231462420  05/26/2023 01:47:24 PM            ABDPFC   

          INITIAL_SEVERITY_LEVEL_CODE FINAL_CALL_TYPE  \
24500000                            2          DIFFFC   
24500001                         

          CAD_INCIDENT_ID       INCIDENT_DATETIME INITIAL_CALL_TYPE  \
25500000        232583671  09/15/2023 06:16:22 PM              EDPC   
25500001        232583672  09/15/2023 06:16:22 PM              DRUG   
25500002        232583674  09/15/2023 06:17:15 PM             ABDPN   
25500003        232583675  09/15/2023 06:17:16 PM            INJURY   
25500004        232583677  09/15/2023 06:17:39 PM              SICK   
...                   ...                     ...               ...   
25984638        213414667  12/07/2021 11:56:43 PM            INJURY   
25984639        213414668  12/07/2021 11:57:21 PM            SICMIN   
25984640        213414669  12/07/2021 11:57:53 PM            SICKFC   
25984641        213414670  12/07/2021 11:59:15 PM            DIFFBR   
25984642        213414671  12/07/2021 11:59:38 PM            DIFFBR   

          INITIAL_SEVERITY_LEVEL_CODE FINAL_CALL_TYPE  \
25500000                            7            EDPC   
25500001                         

In [41]:
## take a look at partial_df

partial_df

Unnamed: 0,CAD_INCIDENT_ID,INCIDENT_DATETIME,INITIAL_CALL_TYPE,INITIAL_SEVERITY_LEVEL_CODE,FINAL_CALL_TYPE,FINAL_SEVERITY_LEVEL_CODE,FIRST_ASSIGNMENT_DATETIME,VALID_DISPATCH_RSPNS_TIME_INDC,DISPATCH_RESPONSE_SECONDS_QY,FIRST_ACTIVATION_DATETIME,...,ZIPCODE,POLICEPRECINCT,CITYCOUNCILDISTRICT,COMMUNITYDISTRICT,COMMUNITYSCHOOLDISTRICT,CONGRESSIONALDISTRICT,REOPEN_INDICATOR,SPECIAL_EVENT_INDICATOR,STANDBY_INDICATOR,TRANSFER_INDICATOR
25500000,232583671,09/15/2023 06:16:22 PM,EDPC,7,EDPC,7,,N,0,,...,11206.0,90.0,34.0,301.0,14.0,7.0,N,N,N,N
25500001,232583672,09/15/2023 06:16:22 PM,DRUG,4,DRUG,4,09/15/2023 06:18:40 PM,Y,138,09/15/2023 06:18:49 PM,...,10036.0,18.0,3.0,105.0,2.0,12.0,N,N,N,N
25500002,232583674,09/15/2023 06:17:15 PM,ABDPN,5,ABDPN,5,09/15/2023 06:17:26 PM,Y,11,09/15/2023 06:17:33 PM,...,10468.0,50.0,14.0,208.0,10.0,13.0,N,N,N,N
25500003,232583675,09/15/2023 06:17:16 PM,INJURY,5,SAFE,6,,N,0,,...,11370.0,114.0,22.0,401.0,7.0,14.0,N,N,N,N
25500004,232583677,09/15/2023 06:17:39 PM,SICK,6,SICK,6,09/15/2023 06:18:40 PM,Y,61,09/15/2023 06:19:03 PM,...,10468.0,52.0,14.0,207.0,10.0,13.0,N,N,N,N
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
25984638,213414667,12/07/2021 11:56:43 PM,INJURY,5,INJURY,5,12/07/2021 11:56:52 PM,Y,9,12/07/2021 11:57:08 PM,...,10002.0,5.0,1.0,103.0,2.0,7.0,N,N,N,N
25984639,213414668,12/07/2021 11:57:21 PM,SICMIN,7,SICMIN,7,12/08/2021 12:01:10 AM,Y,229,12/08/2021 12:01:28 AM,...,11101.0,114.0,26.0,401.0,30.0,12.0,Y,N,N,N
25984640,213414669,12/07/2021 11:57:53 PM,SICKFC,6,SICKFC,6,12/07/2021 11:58:21 PM,Y,28,12/07/2021 11:58:44 PM,...,11434.0,113.0,28.0,412.0,28.0,5.0,N,N,N,N
25984641,213414670,12/07/2021 11:59:15 PM,DIFFBR,2,DIFFBR,2,12/07/2021 11:59:39 PM,Y,24,12/08/2021 12:00:07 AM,...,10466.0,47.0,12.0,212.0,11.0,16.0,N,N,N,N


In [43]:
## take a look at its info

partial_df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 484643 entries, 25500000 to 25984642
Data columns (total 31 columns):
 #   Column                          Non-Null Count   Dtype  
---  ------                          --------------   -----  
 0   CAD_INCIDENT_ID                 484643 non-null  int64  
 1   INCIDENT_DATETIME               484643 non-null  object 
 2   INITIAL_CALL_TYPE               484643 non-null  object 
 3   INITIAL_SEVERITY_LEVEL_CODE     484643 non-null  int64  
 4   FINAL_CALL_TYPE                 484643 non-null  object 
 5   FINAL_SEVERITY_LEVEL_CODE       484643 non-null  int64  
 6   FIRST_ASSIGNMENT_DATETIME       476522 non-null  object 
 7   VALID_DISPATCH_RSPNS_TIME_INDC  484643 non-null  object 
 8   DISPATCH_RESPONSE_SECONDS_QY    484643 non-null  int64  
 9   FIRST_ACTIVATION_DATETIME       475596 non-null  object 
 10  FIRST_ON_SCENE_DATETIME         459084 non-null  object 
 11  VALID_INCIDENT_RSPNS_TIME_INDC  484643 non-null  object 
 12  INCIDEN

In [44]:
## take big csv and chunk it
row_size = 500_000
chunk_number = 1
for partial_df in pd.read_csv("EMS_Incident_Dispatch_Data_20240407.csv",
                             chunksize = row_size):

SyntaxError: incomplete input (508853331.py, line 5)

In [45]:
## call df_segment


In [46]:
## we can save a segment to a more manageable csv
partial_df.to_csv("test.csv",
                 encoding = "UTF-8",
                 index = False)

In [47]:
## we have to save each one as a file
### you include index = False because otherwise the file would keep creating new index columns. 

row_size = 500_000
chunk_number = 1
for partial_df in pd.read_csv("EMS_Incident_Dispatch_Data_20240407.csv",
                             chunksize = row_size):
    partial_df.to_csv(f"ems-data-{chunk_number}.csv",
                     encoding = "UTF-8",
                     index = False)
    chunk_number = chunk_number + 1

  for partial_df in pd.read_csv("EMS_Incident_Dispatch_Data_20240407.csv",
  for partial_df in pd.read_csv("EMS_Incident_Dispatch_Data_20240407.csv",


In [48]:
## we have to save each one as a file


In [49]:
## what is the chunk number if you call it now?
chunk_number

53

In [None]:
## save files as chunks

row_size = 500_000
chunk_number = 1
for partial_df in pd.read_csv("EMS_Incident_Dispatch_Data_20240407.csv",
                             chunksize = row_size):
    partial_df.to_csv

In [50]:
## let's look at one
test_df = pd.read_csv("ems-data-10.csv")
test_df.head()

Unnamed: 0,CAD_INCIDENT_ID,INCIDENT_DATETIME,INITIAL_CALL_TYPE,INITIAL_SEVERITY_LEVEL_CODE,FINAL_CALL_TYPE,FINAL_SEVERITY_LEVEL_CODE,FIRST_ASSIGNMENT_DATETIME,VALID_DISPATCH_RSPNS_TIME_INDC,DISPATCH_RESPONSE_SECONDS_QY,FIRST_ACTIVATION_DATETIME,...,ZIPCODE,POLICEPRECINCT,CITYCOUNCILDISTRICT,COMMUNITYDISTRICT,COMMUNITYSCHOOLDISTRICT,CONGRESSIONALDISTRICT,REOPEN_INDICATOR,SPECIAL_EVENT_INDICATOR,STANDBY_INDICATOR,TRANSFER_INDICATOR
0,141603204,06/09/2014 06:47:28 PM,EDP,7,CARD,3,06/09/2014 06:48:03 PM,True,35,06/09/2014 06:48:13 PM,...,10018.0,14.0,4.0,105.0,2.0,12.0,False,False,False,False
1,141603205,06/09/2014 06:47:35 PM,EDP,7,EDP,7,06/09/2014 06:49:14 PM,True,99,06/09/2014 06:49:30 PM,...,11233.0,73.0,41.0,316.0,17.0,9.0,False,False,False,False
2,141603206,06/09/2014 06:47:36 PM,INJURY,5,INJURY,5,06/09/2014 06:47:45 PM,True,9,06/09/2014 06:48:02 PM,...,11220.0,72.0,38.0,307.0,20.0,7.0,False,False,False,False
3,141603207,06/09/2014 06:47:50 PM,UNC,2,UNC,2,06/09/2014 06:48:03 PM,True,13,06/09/2014 06:48:56 PM,...,10302.0,121.0,49.0,501.0,31.0,11.0,False,False,False,False
4,141603208,06/09/2014 06:47:53 PM,DRUG,4,DRUG,4,06/09/2014 06:48:23 PM,True,30,06/09/2014 06:48:34 PM,...,10458.0,46.0,15.0,205.0,10.0,15.0,False,False,False,False


In [51]:
## get big picture view
test_df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 500000 entries, 0 to 499999
Data columns (total 31 columns):
 #   Column                          Non-Null Count   Dtype  
---  ------                          --------------   -----  
 0   CAD_INCIDENT_ID                 500000 non-null  int64  
 1   INCIDENT_DATETIME               500000 non-null  object 
 2   INITIAL_CALL_TYPE               500000 non-null  object 
 3   INITIAL_SEVERITY_LEVEL_CODE     500000 non-null  int64  
 4   FINAL_CALL_TYPE                 500000 non-null  object 
 5   FINAL_SEVERITY_LEVEL_CODE       500000 non-null  int64  
 6   FIRST_ASSIGNMENT_DATETIME       496223 non-null  object 
 7   VALID_DISPATCH_RSPNS_TIME_INDC  500000 non-null  object 
 8   DISPATCH_RESPONSE_SECONDS_QY    500000 non-null  int64  
 9   FIRST_ACTIVATION_DATETIME       495150 non-null  object 
 10  FIRST_ON_SCENE_DATETIME         483230 non-null  object 
 11  VALID_INCIDENT_RSPNS_TIME_INDC  500000 non-null  object 
 12  INCIDENT_RESPONS

In [52]:
## find all unique categories in 'INITIAL_CALL_TYPE'
test_df["INITIAL_CALL_TYPE"].unique()

array(['EDP', 'INJURY', 'UNC', 'DRUG', 'SICK', 'ABDPN', 'UNKNOW',
       'STNDBY', 'DIFFBR', 'STATEP', 'ASTHMB', 'MEDRXN', 'CARD', 'OTHER',
       'CHOKE', 'SHOT', 'PEDSTR', 'RESPIR', 'INJMAJ', 'STAB', 'OBMIS',
       'OBMAJ', 'MVAINJ', 'INJMIN', 'ANAPH', 'ARREST', 'SEIZR', 'CVAC',
       'HYPTN', 'CVA', 'GYNHEM', 'INJALS', 'MCI21P', 'ALTMEN', 'OBLAB',
       'SICMIN', 'INBLED', 'BURNMA', 'SICPED', 'TRAUMA', 'EDPC', 'BURNMI',
       'JUMPDN', 'INHALE', 'CARDBR', 'PEDFC', 'RAPE', 'OBOUT', 'SICKFC',
       'OBCOMP', 'MCI21', 'JUMPUP', 'AMPMAJ', 'CHILDA', 'DROWN', 'PEDRF',
       'MVA', 'PD13', 'ELECT', 'AMPMIN', 'COLD', 'CDBRFC', 'STRANS',
       'GYNMAJ', 'DIFFFC', 'MCI43P', 'DIFFRF', 'ASTHFC', 'ABDPFC', 'HEAT',
       'RESPFC', 'EDPM', 'MCI80P', 'VENOM', 'SICKRF', 'SPEVNT', 'T-SICK',
       'PD13C', 'MEDVAC', 'T-TEXT', 'INBLFC', 'DOA', 'ALTMFC', 'SICMFC',
       'MCI26P', 'MECHE', 'MCI32P', 'DRUGFC', 'SAFE', 'ANAPFC', 'UNCFC',
       'MCI29P', 'CARDFC', 'STATFC', 'MCI77', 'ACC', 'MCI25

In [None]:
## what are the unique injuries

In [None]:
## query and filter for venom and incident dates and boroughs


### 4. `glob`

Our goal is to analyze all the files, not a single subset. 

We need a way to `iterate` through all the files.

A `list` is also known as an `iterable` because it contains items that can be **iterated over**.

We need to take the chucked CSVs and store in a single list that we can then iterate over.

We use a package called `glob` that globs all the files into a list.

Let's see how it works:


In [53]:
## import libraries
import glob 

### The power of ```glob``` comes from its ability to gather any target files we want.


In [57]:
## grab only the csv files
## the star means that whatever value is actually listed their in the file name does not matter. 
myfile = glob.glob("ems-data-*.csv")
myfile

['ems-data-40.csv',
 'ems-data-41.csv',
 'ems-data-43.csv',
 'ems-data-8.csv',
 'ems-data-9.csv',
 'ems-data-42.csv',
 'ems-data-52.csv',
 'ems-data-46.csv',
 'ems-data-47.csv',
 'ems-data-45.csv',
 'ems-data-51.csv',
 'ems-data-50.csv',
 'ems-data-44.csv',
 'ems-data-23.csv',
 'ems-data-37.csv',
 'ems-data-36.csv',
 'ems-data-22.csv',
 'ems-data-34.csv',
 'ems-data-20.csv',
 'ems-data-21.csv',
 'ems-data-35.csv',
 'ems-data-19.csv',
 'ems-data-31.csv',
 'ems-data-25.csv',
 'ems-data-24.csv',
 'ems-data-30.csv',
 'ems-data-18.csv',
 'ems-data-26.csv',
 'ems-data-32.csv',
 'ems-data-33.csv',
 'ems-data-27.csv',
 'ems-data-16.csv',
 'ems-data-17.csv',
 'ems-data-15.csv',
 'ems-data-29.csv',
 'ems-data-28.csv',
 'ems-data-14.csv',
 'ems-data-38.csv',
 'ems-data-10.csv',
 'ems-data-11.csv',
 'ems-data-39.csv',
 'ems-data-13.csv',
 'ems-data-12.csv',
 'ems-data-49.csv',
 'ems-data-2.csv',
 'ems-data-3.csv',
 'ems-data-48.csv',
 'ems-data-1.csv',
 'ems-data-4.csv',
 'ems-data-5.csv',
 'ems-d

In [58]:
# ## pip install natsort
!pip install natsort



Collecting natsort
  Obtaining dependency information for natsort from https://files.pythonhosted.org/packages/ef/82/7a9d0550484a62c6da82858ee9419f3dd1ccc9aa1c26a1e43da3ecd20b0d/natsort-8.4.0-py3-none-any.whl.metadata
  Downloading natsort-8.4.0-py3-none-any.whl.metadata (21 kB)
Downloading natsort-8.4.0-py3-none-any.whl (38 kB)
Installing collected packages: natsort
Successfully installed natsort-8.4.0


In [59]:
# ## import library and module
## natsort is the package, natsorted is the individual method being imported from that method. 
from natsort import natsorted


In [60]:
## sort in lexicographical order
myfiles = natsorted(glob.glob("ems-data-*.csv"))
myfiles

['ems-data-1.csv',
 'ems-data-2.csv',
 'ems-data-3.csv',
 'ems-data-4.csv',
 'ems-data-5.csv',
 'ems-data-6.csv',
 'ems-data-7.csv',
 'ems-data-8.csv',
 'ems-data-9.csv',
 'ems-data-10.csv',
 'ems-data-11.csv',
 'ems-data-12.csv',
 'ems-data-13.csv',
 'ems-data-14.csv',
 'ems-data-15.csv',
 'ems-data-16.csv',
 'ems-data-17.csv',
 'ems-data-18.csv',
 'ems-data-19.csv',
 'ems-data-20.csv',
 'ems-data-21.csv',
 'ems-data-22.csv',
 'ems-data-23.csv',
 'ems-data-24.csv',
 'ems-data-25.csv',
 'ems-data-26.csv',
 'ems-data-27.csv',
 'ems-data-28.csv',
 'ems-data-29.csv',
 'ems-data-30.csv',
 'ems-data-31.csv',
 'ems-data-32.csv',
 'ems-data-33.csv',
 'ems-data-34.csv',
 'ems-data-35.csv',
 'ems-data-36.csv',
 'ems-data-37.csv',
 'ems-data-38.csv',
 'ems-data-39.csv',
 'ems-data-40.csv',
 'ems-data-41.csv',
 'ems-data-42.csv',
 'ems-data-43.csv',
 'ems-data-44.csv',
 'ems-data-45.csv',
 'ems-data-46.csv',
 'ems-data-47.csv',
 'ems-data-48.csv',
 'ems-data-49.csv',
 'ems-data-50.csv',
 'ems-dat

In [63]:
## iterate through all the files and pull out "drug" incidents only
## once you run everything, it will store all of the drug-related rows into 'dataframes'
## append means to add to what was already there within the dataframe
dataframes = []
for myfile in myfiles:
    df = pd.read_csv(myfile)
    df["source"] = myfile
    temp_drug_df = df.query("INITIAL_CALL_TYPE == 'DRUG'")
    dataframes.append(temp_drug_df)

  df = pd.read_csv(myfile)
  df = pd.read_csv(myfile)


In [64]:
## how many?
len(dataframes)

52

In [65]:
## call a single one
dataframes[10]

Unnamed: 0,CAD_INCIDENT_ID,INCIDENT_DATETIME,INITIAL_CALL_TYPE,INITIAL_SEVERITY_LEVEL_CODE,FINAL_CALL_TYPE,FINAL_SEVERITY_LEVEL_CODE,FIRST_ASSIGNMENT_DATETIME,VALID_DISPATCH_RSPNS_TIME_INDC,DISPATCH_RESPONSE_SECONDS_QY,FIRST_ACTIVATION_DATETIME,...,POLICEPRECINCT,CITYCOUNCILDISTRICT,COMMUNITYDISTRICT,COMMUNITYSCHOOLDISTRICT,CONGRESSIONALDISTRICT,REOPEN_INDICATOR,SPECIAL_EVENT_INDICATOR,STANDBY_INDICATOR,TRANSFER_INDICATOR,source
1,142892746,10/16/2014 04:13:04 PM,DRUG,4,DRUG,4,10/16/2014 04:13:19 PM,true,15,10/16/2014 04:13:39 PM,...,113.0,27.0,412.0,29.0,5.0,false,false,false,false,ems-data-11.csv
20,142892769,10/16/2014 04:17:37 PM,DRUG,4,DRUG,4,10/16/2014 04:17:54 PM,true,17,10/16/2014 04:18:02 PM,...,100.0,32.0,414.0,27.0,5.0,false,false,false,false,ems-data-11.csv
65,142892824,10/16/2014 04:32:41 PM,DRUG,4,UNC,2,10/16/2014 04:32:59 PM,true,18,10/16/2014 04:33:16 PM,...,105.0,23.0,413.0,29.0,3.0,false,false,false,false,ems-data-11.csv
80,142892840,10/16/2014 04:37:39 PM,DRUG,4,DRUG,4,10/16/2014 04:42:24 PM,true,285,10/16/2014 04:42:47 PM,...,20.0,6.0,107.0,3.0,10.0,false,false,false,false,ems-data-11.csv
126,142892889,10/16/2014 04:46:44 PM,DRUG,4,DRUG,4,10/16/2014 04:47:07 PM,true,23,10/16/2014 04:47:48 PM,...,32.0,9.0,110.0,5.0,13.0,false,false,false,false,ems-data-11.csv
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
499947,150583294,02/27/2015 07:38:53 PM,DRUG,4,DRUG,4,02/27/2015 07:39:39 PM,true,46,02/27/2015 07:40:06 PM,...,20.0,6.0,107.0,3.0,10.0,false,false,false,false,ems-data-11.csv
499952,150583300,02/27/2015 07:40:50 PM,DRUG,4,DRUG,4,02/27/2015 07:41:30 PM,true,40,02/27/2015 07:41:49 PM,...,28.0,9.0,110.0,3.0,13.0,false,false,false,false,ems-data-11.csv
499981,150583334,02/27/2015 07:49:44 PM,DRUG,4,DRUG,4,02/27/2015 07:50:19 PM,true,35,02/27/2015 07:50:46 PM,...,48.0,15.0,206.0,12.0,15.0,false,false,false,false,ems-data-11.csv
499983,150583336,02/27/2015 07:50:49 PM,DRUG,4,DRUG,4,02/27/2015 07:51:00 PM,true,11,02/27/2015 07:51:14 PM,...,115.0,21.0,403.0,30.0,14.0,false,false,false,false,ems-data-11.csv


In [66]:
## concat

df_drugs = pd.concat(dataframes, ignore_index = True)
df_drugs

Unnamed: 0,CAD_INCIDENT_ID,INCIDENT_DATETIME,INITIAL_CALL_TYPE,INITIAL_SEVERITY_LEVEL_CODE,FINAL_CALL_TYPE,FINAL_SEVERITY_LEVEL_CODE,FIRST_ASSIGNMENT_DATETIME,VALID_DISPATCH_RSPNS_TIME_INDC,DISPATCH_RESPONSE_SECONDS_QY,FIRST_ACTIVATION_DATETIME,...,POLICEPRECINCT,CITYCOUNCILDISTRICT,COMMUNITYDISTRICT,COMMUNITYSCHOOLDISTRICT,CONGRESSIONALDISTRICT,REOPEN_INDICATOR,SPECIAL_EVENT_INDICATOR,STANDBY_INDICATOR,TRANSFER_INDICATOR,source
0,110010797,01/01/2011 02:20:17 AM,DRUG,4,DRUG,4,01/01/2011 02:40:16 AM,Y,1199,01/01/2011 02:40:27 AM,...,41.0,17.0,202.0,8.0,15.0,N,N,N,N,ems-data-1.csv
1,110010800,01/01/2011 02:20:34 AM,DRUG,4,DRUG,4,,N,0,,...,18.0,3.0,104.0,2.0,10.0,N,N,N,N,ems-data-1.csv
2,110010806,01/01/2011 02:21:23 AM,DRUG,4,DRUG,4,,N,0,,...,7.0,2.0,103.0,1.0,12.0,N,N,N,N,ems-data-1.csv
3,110010811,01/01/2011 02:21:56 AM,DRUG,4,DRUG,4,01/01/2011 02:22:19 AM,Y,23,01/01/2011 02:22:30 AM,...,122.0,50.0,502.0,31.0,11.0,N,N,N,N,ems-data-1.csv
4,110010813,01/01/2011 02:22:24 AM,DRUG,4,ARREST,1,01/01/2011 02:22:51 AM,Y,27,01/01/2011 02:22:58 AM,...,90.0,34.0,301.0,14.0,7.0,N,N,N,N,ems-data-1.csv
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
1478606,213414599,12/07/2021 11:30:56 PM,DRUG,4,ALTMEN,3,12/07/2021 11:31:07 PM,Y,11,12/07/2021 11:31:19 PM,...,81.0,41.0,303.0,16.0,8.0,N,N,N,N,ems-data-52.csv
1478607,213414609,12/07/2021 11:34:51 PM,DRUG,4,DRUG,4,12/07/2021 11:34:58 PM,Y,7,12/07/2021 11:35:07 PM,...,121.0,49.0,501.0,31.0,11.0,N,N,N,N,ems-data-52.csv
1478608,213414632,12/07/2021 11:44:15 PM,DRUG,4,DRUG,4,12/07/2021 11:44:27 PM,Y,12,12/07/2021 11:44:58 PM,...,46.0,14.0,205.0,10.0,15.0,N,N,N,N,ems-data-52.csv
1478609,213414660,12/07/2021 11:53:53 PM,DRUG,4,DRUG,4,12/07/2021 11:54:56 PM,Y,63,12/07/2021 11:57:58 PM,...,14.0,3.0,105.0,2.0,12.0,N,N,N,N,ems-data-52.csv


In [None]:
## see a sample, with random source list


In [None]:
## call a sample of 20


In [67]:
## filter venom with incident date and severity level code
df_drugs.filter(["INCIDENT_DATETIME", "INITIAL_SEVERITY_LEVEL_CODE","BOROUGH"])

Unnamed: 0,INCIDENT_DATETIME,INITIAL_SEVERITY_LEVEL_CODE,BOROUGH
0,01/01/2011 02:20:17 AM,4,BRONX
1,01/01/2011 02:20:34 AM,4,MANHATTAN
2,01/01/2011 02:21:23 AM,4,MANHATTAN
3,01/01/2011 02:21:56 AM,4,RICHMOND / STATEN ISLAND
4,01/01/2011 02:22:24 AM,4,BROOKLYN
...,...,...,...
1478606,12/07/2021 11:30:56 PM,4,BROOKLYN
1478607,12/07/2021 11:34:51 PM,4,RICHMOND / STATEN ISLAND
1478608,12/07/2021 11:44:15 PM,4,BRONX
1478609,12/07/2021 11:53:53 PM,4,MANHATTAN


In [68]:
## value count by borough
df_drugs.value_counts("BOROUGH")

BOROUGH
MANHATTAN                   502493
BROOKLYN                    378179
QUEENS                      276054
BRONX                       268667
RICHMOND / STATEN ISLAND     53213
UNKNOWN                          5
Name: count, dtype: int64

In [None]:
## percentage by borough


In [None]:
## iterate through all the files and pull out "final severity levels between 6 and 7 inclusive" incidents only


In [None]:
## see a sample, with random source list
