# How many confirmed cases of COVID19 are there in the states of the US?

**Learning goal:** In this case, you will learn how to write `for` loops and how to read and write plain-text files using Python.

You work as a data analyst at a local think tank that specializes in social policy. Today your boss obtained a nice COVID-19 dataset and wants you to help him with a quick report to include in one of his presentations. He would like you to find how many confirmed cases there were between Jan 22, 2020, and Feb 9, 2021, for each state and save the results to a text file. He also needs to know how many cumulative cases there have been in the entire US.

The data came to you in a folder called `confirmed`, which has 52 subfolders inside, one folder for each US state + 2 territories. Inside each subfolder, there are plain text files, one file per county. These files have 385 rows each because there were 385 days between Jan 22, 2020, and Feb 9, 2021.

These data represent cumulative confirmed cases of COVID-19 per county. That means that the numbers report not only how many people got sick that day but also how many had had COVID-19 before (since Jan 22, 2020), including those who recovered, those who didn't recover, and those who caught the virus more than once.

## Madison, Indiana

To read a file into Python, we use the **`open()`** method and the `r` ("read") argument. To get a list of all the rows in the file, we use **`.readlines()`**.

**Note:** Throughout this case, we will truncate the outputs of some of the cells using list slicing to make it easier to navigate the notebook.

In [None]:
file = open("data/confirmed/Indiana/Madison.csv", "r")
madison_indiana = file.readlines()
madison_indiana[0:15] # The list has 385 rows, but we only show the first 15

Each row is a string that contains the actual confirmed cases and the `\n` character, which signals a new line. We want to make this very weird list of strings into a list of integers. In order to achieve that, we need to:

1. Iterate through each element in the list
2. Extract the number
3. Store the number in another list

Our best aid for this kind of task is the **`for` loop**. `for` loops are Python blocks that automate repetitive tasks. The syntax is as follows:

~~~python
for item in sequence:
    <do something> # possibly including the `item` variable name
~~~

where `sequence` is a list, tuple, set, dictionary, or string (in the case of a string, `for` loops iterate over all the characters of that string). The `for` loop above "does something" *for each* item in the sequence, so if the sequence has 10 elements, the loop performs the task 10 times. The keyword `for` is short for "for each".

## Practicing with for-loops

### Example 1 

Iterate through a string, s, and save each character from this string in a list.
~~~json
    input:    s = "Correlation1"

    output:   ['C','o','r','r','e','l','a','t','i','o','n','1']
~~~

**Answer.**

In [None]:
#This first block uses print statements to visually display what is happening during each iteration of our loop

s = "Correlation1"

letter_list = []

for letter in s:
    print('--Begin one iteration of loop--')
    print(f'Element of string for this iteration = {letter}')
    print('Add this letter to the letter list')
    letter_list.append(letter)
    print(letter_list)
    print('--Go to next iteration of loop--')
    print()

print(letter_list)

In [None]:
#This block is performing the same function, without the helper print statements

s = "Correlation1"

letter_list = []

for letter in s:
    letter_list.append(letter)
    
print(letter_list)

### Example 2

Make a new list using the first letter of each word from the provided list of words.

~~~json
    input: ['Data', 'Science', '4', 'All']
    output: ['D', 'S', '4', 'A']
~~~

**Answer.**

In [None]:
#This first block uses print statements to visually display what is happening during each iteration of our loop

word_list = ["Data", "Science", "4", "All"]
output = []

for word in word_list:
    print('--Begin one iteration of loop--')
    print(f'Element of word list for this iteration = {word}')
    print(f'Pull first letter from this word: {word[0]}')
    print('Add this letter to our output list')
    output.append(word[0])
    print(output)
    print('--Go to next iteration of loop--')
    print()

print(output)

In [None]:
#This block is performing the same function, without the helper print statements

word_list = ["Data", "Science", "4", "All"]
output = []

for word in word_list:
    output.append(word[0])
    
print(output)

## Writing a `for` loop

Each element of the `madison_indiana` list is a string that contains the actual confirmed cases and the `\n` character, which signals a new line, like this: `['25\n', '35\n', ...]`. We want to make this very weird list of strings into a list of integers (like this `[25, 35, ...]`). In order to achieve that, we need to:

1. Iterate through each element in the list
2. Extract the number
3. Store the number in another list

What would our `for` loop look like? The first step, "iterate through each element in the list," is easy:

~~~python
for row in madison_indiana:
~~~

For the second step, we need to extract the number. We could certainly write something like this:

~~~python
row = row.replace("\n", "") # Getting rid of the newline
number = int(row) # Converting into an integer
~~~

However, wrapping these two lines into a function might be easier to read:

~~~python
def extract_number(some_text):
    some_text = some_text.replace("\n", "") # Getting rid of the newline
    number = int(some_text) # Converting into an integer
    return number
~~~

We can now include `extract_number(row)` in our `for` loop:

In [None]:
def extract_number(some_text):
    some_text = some_text.replace("\n", "")
    number = int(some_text)
    return number

for row in madison_indiana[0:15]:
    print(extract_number(row))

Now we've successfully extracted the numbers. The last step was "Store the number in another list". This would be the usual way to do that:

In [None]:
madison_indiana_numbers = []

for row in madison_indiana:
    number = extract_number(row)
    madison_indiana_numbers.append(number)

We first created an empty list, `madison_indiana_numbers`, then iterated through the lines of the file with the `for` loop, and for each row, we extracted the number with our function and appended the result to our empty list. So we basically populated our list iteratively (remember from a past case that you can append elements to lists using `my_list.append()`). Let's see what our cleaned data looks like:

In [None]:
madison_indiana_numbers[0:15]

To find the current number of confirmed cases, we use the **`max()`** method (this is a cumulative time series, so the most recent value will always be the maximum):

In [None]:
max(madison_indiana_numbers)

### Doing it for the whole of Indiana

There are 94 files in the folder for Indiana alone. If `for` loops weren't a thing, you'd have to read the files one by one and store the results manually, but now you have the knowledge to automate this task.

### Exercise 1

Write a `for` loop to iterate through all the files in the `data/confirmed/Indiana` folder and save each file as an element in a list called `indiana_data`.

**Hint:** This code will give you a list of all the files that are inside the Indiana folder, with their relevant paths:

~~~python
import glob
list_of_files = glob.glob("data/confirmed/Indiana/*.csv")
~~~

Also, remember to use `open()` and `.readlines()`.

**Answer.**

### Exercise 2

Now that you have `indiana_data`, create a `for` loop that iterates through it and extracts the numbers. Call the resulting nested list `indiana_data_clean`.

For this exercise, you'll have to use a **nested `for` loop**, that is, a `for` loop inside another `for` loop. The nested loop should be indented, like in this example:

~~~python
for i in [1,2,3]: # This is the first loop
    for j in [4,5,6]: # This is the nested loop
        print(i+j)
~~~

**Hint:** You need two empty lists: One is `indiana_data_clean`, and the other one corresponds to the county numbers for a particular county (we give you these lists in the code below). You will append the cleaned numbers to this second list and then iteratively append all the county lists (one per country) to `indiana_data_clean`.

We provide you with a section of the code to help you get started:

~~~python
indiana_data_clean = []

for county in indiana_data:
    county_numbers = []
    # Your nested for loop here
    indiana_data_clean.append(county_numbers)
~~~


**Answer.**

Finally, let's compute the current number of cases for each county and then the state-wide total:

In [None]:
indiana_current = []
for county in indiana_data_clean:
    curr_cases = max(county)
    indiana_current.append(curr_cases)
    
sum(indiana_current)

Now we know there have been 642,071 confirmed cases in Indiana.

## Doing the same for all the states and counties

There are 3,334 files in our dataset. Scaling our analysis to the whole country sounds like a great job for a `for` loop.

For our convenience, let's wrap our cleaning code into a function:

In [None]:
def clean_county_data(path):
    """
    Takes a file path, loads the file into Python
    and returns a list with the cleaned data.
    """
    # Reading in the file
    file = open(path, "r")
    content = file.readlines()
    
    # Cleaning the data and appending it to county_numbers
    county_numbers = []
    for row in content:
        number = extract_number(row)
        county_numbers.append(number)
        
    return county_numbers

To call this function, you would type something like this, and the result will be a list of integers (the numbers that correspond to the county passed as input):

In [None]:
clean_county_data("data/confirmed/Indiana/Madison.csv")

### Exercise 3

This code gives you the list of all the states:

~~~python
import os
list_of_states = os.listdir("data/confirmed")
~~~

And this gives you the list of all the files in `data/confirmed/Indiana`:

~~~python
glob.glob("data/confirmed/Indiana/*.csv")
~~~

You can parameterize this last piece of code like this:

~~~python
state = "Indiana"
glob.glob("data/confirmed/" + state + "/*.csv")
~~~

With this in mind, write a `for` loop that gives the following output (you will have to nest loops):

![Desired output](data/images/desired_result.png)

The desired result is a list in which each element is a list with two elements: 1) The name of the state and 2) the total number of confirmed cases in that state. Call this list `result`.

**Hint:** We give you part of the code to help you get started (you have to add your `for` loop after the `# Your code here` comment):

~~~python
import os
list_of_states = os.listdir("data/confirmed")

result = []
for state in list_of_states:
    # A list where we'll store the maximums from each county in this state
    list_of_current_numbers = []
    
        # Cleaning all the corresponding county files and finding their maximums
        list_of_counties = glob.glob("data/confirmed/" + state + "/*.csv")
        # Your code here
    
    # Summing the current numbers of all the counties of this state
    state_total = sum(list_of_current_numbers)
    
    # Appending the results to the result list
    result.append([state, state_total])

result
~~~

**Answer.**

We can finally sum up all the state-level totals to get a grand total for the entire US:

In [None]:
only_totals = []
for state in result:
    only_totals.append(state[1]) # We are interested just in the number, not the name of the state

sum(only_totals)

From Jan 22, 2020, to Feb 9, 2021, the number of cumulative COVID-19 cases in the US was 27,224,664.

Saving your results as a text file is easy. Let's create the file `covid_cases.txt`. For that, we use this code (notice that we used `w` instead of `r` this time - this is because we want to <b>w</b>rite to that file, not just <b>r</b>ead it):

In [None]:
new_file_to_save = open("covid_cases.txt", "w")
line = "From Jan 22, 2020 to Feb 9, 2021, the number of cumulative COVID-19 cases in the US was 27,224,664."
new_file_to_save.write(line)
new_file_to_save.close()

Go to this case's folder. The new file should be there with the results!

## Appendix

Here is a summary of the main points we covered in this case:

* **Opening a text file**: `file = open("path/to/the/file", "r")`. To read the lines of the file and save them as a list, you use `list_of_lines = file.readlines()`.
* **For loops**. This diagram shows the different parts of a `for` loop block:

![For loops](data/images/for_anatomy.png)

* **Populating a list iteratively:** This is a common task, so save the snippet:

~~~python
list_to_populate = []
for item in some_sequence:
    new_item = some_function(item)
    list_to_populate.append(new_item)
~~~

* **Getting a list of all the files with `extension` in a directory (directory is just another name for "folder")**:

~~~python
import glob
list_of_files = glob.glob("path/to/directory/*.extension")
~~~

Two important things to notice here. First, if you are not familiar with the term **extension**, it is simply the text that comes after the dot in file names. Extensions help identify the file type. For instance, `my_excel.xlsx` has the `.xslx` extension, which identifies Excel files, and `my_image.png` has the `.png` extension, which tells you this is a PNG image file. The second important thing is that this snippet uses an asterisk (`*`) as a **wildcard**. In this context, this symbol tells Python to take into account all the files that are inside that folder ending in `.extension`. So, for instance, this code would catch `hello.extension` and `goodbye.extension` but not `this_is.anotherextension`. Wildcards are placeholders that stand for some undetermined text before or after some other known text.
* **Getting a list of all the subdirectories in a directory**:

~~~python
import os
list_of_subfolders = os.listdir("path/to/directory")
~~~


* **Nested `for` loops:** To nest a `for` loop inside another `for` loop, you have to indent the inner loop (the statements of the inner loop will therefore be indented twice).
* **Creating a file, writing a line, and saving the file to disk**: This is a generic snippet to do that:

~~~python
new_file_to_save = open("path/to/folder/name_of_new_file.txt", "w")
line = "Some text to write"
new_file_to_save.write(line)
new_file_to_save.close()
~~~

## Attribution

*JHU CSSE COVID-19 Dataset*. Johns Hopkins University on behalf of its Center for Systems Science in Engineering. February 9, 2021. Creative Commons Attribution 4.0 International. https://github.com/CSSEGISandData/COVID-19. For additional information, please refer to "Dong E, Du H, Gardner L. An interactive web-based dashboard to track COVID-19 in real time. Lancet Inf Dis. 20(5):533-534. doi: 10.1016/S1473-3099(20)30120-1"