# Getting the Data

Our data comes directly from the [John Hopkins COVID-19 Github repository][1], which tracks all deaths and cases from each country in the world as well as many regions within some countries. All of the data needed for this project is within the [time series][2] directory, which contains four CSV files that summarize the deaths and cases for the world and the USA. The repository uses the word "confirmed" to refer to cases.

[1]: https://github.com/CSSEGISandData/COVID-19
[2]: https://github.com/CSSEGISandData/COVID-19/tree/master/csse_covid_19_data/csse_covid_19_time_series

## Reading in the data into a pandas DataFrame

The pandas `read_csv` function can read in remote CSV files by passing it the URL. The exact URL on Github is a bit tricky. You must use the "raw" data file, which can be retrieved by clicking on the file name (taking you to the next page), then right-clicking the "view raw" or "download" button and copying the link. The image below shows the screen you'll see for the first CSV.

![1]

[1]: images/url_download.png

## Naming conventions

Before we write any code, let's cover some naming conventions that we will use throughout the project.

### `group`

We will use the name `group` to refer to the two separate "groups" of data.

* `"world"` - represents all data from each country
* `"usa"` - represents all data from each US state

### `kind`

We will use the name `kind` to refer to the two different kinds of COVID-19 data.

* `"deaths"`
* `"cases"`


### `area`

Occasionally, we will refer to either a specific country or state with the name `area`.

## Downloading the data

Now that we have the URL, we can download the data with pandas. Complete the exercise below to download all four files as DataFrames.

### Exercise 1

<span style="color:green; font-size:16px">Write a function that reads in a single CSV and returns it as a DataFrame. This function accepts a kind and group. Use the variable `DOWNLOAD_URL` in your solution. Make sure you look at the URL in the repo from above to determine what values `kind` and `group` refer to. You'll have to reassign their values in the function so that the URL is correct. For example, the function call `download_data("world", "deaths")` should download [one of the files on this page](https://github.com/CSSEGISandData/COVID-19/tree/master/csse_covid_19_data/csse_covid_19_time_series).</span>

In [23]:
import pandas as pd

DOWNLOAD_URL = (
    "https://raw.githubusercontent.com/CSSEGISandData/COVID-19/"
    "master/csse_covid_19_data/csse_covid_19_time_series/"
    "time_series_covid19_{kind}_{group}.csv"
)

# Mapping of input values to desired output values for group and kind
GROUP_MAPPING = {
    "usa": "US",    # if param is "US" then name "usa"
    "world": "global"  # if param is "global" then name "world"
}

KIND_MAPPING = {
    "cases": "confirmed",  # if param is "confirmed" then name "cases"
    "deaths": "deaths"  # if param is "deaths" then name "deaths"
}

def download_data(group, kind):
    """
    Returns the dataframe of the dataset from the John Hopkins GitHub repo

    Parameters
    ----------
    group : "world" or "usa"

    kind : "cases" or "deaths"

    Returns
    -------
    dataframe
        The dataframe from the url
    """
    # Map the input values to the desired output values using dictionaries
    group = GROUP_MAPPING.get(group, "global")
    kind = KIND_MAPPING.get(kind, "deaths")

    url = DOWNLOAD_URL.format(kind=kind, group=group)
    df = pd.read_csv(url)
    return df



In [16]:
df = download_data("usa", "deaths")
df.head()

Unnamed: 0,UID,iso2,iso3,code3,FIPS,Admin2,Province_State,Country_Region,Lat,Long_,...,2/28/23,3/1/23,3/2/23,3/3/23,3/4/23,3/5/23,3/6/23,3/7/23,3/8/23,3/9/23
0,84001001,US,USA,840,1001.0,Autauga,Alabama,US,32.539527,-86.644082,...,230,232,232,232,232,232,232,232,232,232
1,84001003,US,USA,840,1003.0,Baldwin,Alabama,US,30.72775,-87.722071,...,724,726,726,726,726,726,726,726,727,727
2,84001005,US,USA,840,1005.0,Barbour,Alabama,US,31.868263,-85.387129,...,103,103,103,103,103,103,103,103,103,103
3,84001007,US,USA,840,1007.0,Bibb,Alabama,US,32.996421,-87.125115,...,109,109,109,109,109,109,109,109,109,109
4,84001009,US,USA,840,1009.0,Blount,Alabama,US,33.982109,-86.567906,...,261,261,261,261,261,261,261,261,261,261


### Important information on exercises - please read!

All of the exercises require you to complete the body of a function. All functions end with the `pass` keyword. **Delete** it and write your solution in the body of the function.

Solutions for all exercises are found in the [solutions.py](solutions.py) file in this directory. You can open it up in your favorite editor, or just click the link to open it in your browser.

In the code cell following each exercise, you will see a single line of code that imports the function from the solutions.py file. For example, `from solutions import download_data`. Running this statement will provide you with a version of the function that produces the correct output for the exercise.

**Comment out the import line** if you want to use and test **your version** of the function completed above. I highly recommend completing the exercises on your own. Keep the import line uncommented if you do not attempt the exercise. 

**Always check the solutions!** Make sure to check the [solutions.py](solutions.py) file for each exercise, even if you are sure you answered it correctly. Verifying solutions is one of the best known methods for internalizing new material.

### Verifying the `download_data` function

Let's read in the world deaths file as a DataFrame and output the head to verify that it works. 

In [17]:
# comment out the import line below if you attempted the exercise above
# keep the line below if you did not attempt the exercise
# from solutions import download_data 
df_world_deaths = download_data('world', 'deaths')
df_world_deaths.head()

Unnamed: 0,Province/State,Country/Region,Lat,Long,1/22/20,1/23/20,1/24/20,1/25/20,1/26/20,1/27/20,...,2/28/23,3/1/23,3/2/23,3/3/23,3/4/23,3/5/23,3/6/23,3/7/23,3/8/23,3/9/23
0,,Afghanistan,33.93911,67.709953,0,0,0,0,0,0,...,7896,7896,7896,7896,7896,7896,7896,7896,7896,7896
1,,Albania,41.1533,20.1683,0,0,0,0,0,0,...,3598,3598,3598,3598,3598,3598,3598,3598,3598,3598
2,,Algeria,28.0339,1.6596,0,0,0,0,0,0,...,6881,6881,6881,6881,6881,6881,6881,6881,6881,6881
3,,Andorra,42.5063,1.5218,0,0,0,0,0,0,...,165,165,165,165,165,165,165,165,165,165
4,,Angola,-11.2027,17.8739,0,0,0,0,0,0,...,1933,1933,1933,1933,1933,1933,1933,1933,1933,1933


Let's write a another function which uses `download_data` to read in all four DataFrames.

### Exercise 2

<span style="color:green; font-size:16px">Write a function that reads in all four CSVs as DataFrames returning them in a dictionary. Use the group and kind separated by an underscore as the key (i.e. `"world_deaths"`). Use the `GROUPS` and `KINDS` variables in your solution.</span>

In [24]:
GROUPS = ("world", "usa")
KINDS = ("deaths", "cases")

def read_all_data():
    """
    Read in all four CSVs as DataFrames
    
    Returns
    -------
    Dictionary of DataFrames
    """
    data = {}

    for group in GROUPS:
        for kind in KINDS:
            try:
                data[f"{group}_{kind}"] = download_data(group, kind)
            except Exception as e:
                print(f"Error occurred while reading data for {group} {kind}: {str(e)}")

    return data


Let's use this function to read in all of the data and output the head of two of them.

In [25]:
# remember to comment out the following line if you attempt the exercise
# this is the last exercise with this warning
# from solutions import read_all_data
data = read_all_data()
data['world_cases'].head(3)

Unnamed: 0,Province/State,Country/Region,Lat,Long,1/22/20,1/23/20,1/24/20,1/25/20,1/26/20,1/27/20,...,2/28/23,3/1/23,3/2/23,3/3/23,3/4/23,3/5/23,3/6/23,3/7/23,3/8/23,3/9/23
0,,Afghanistan,33.93911,67.709953,0,0,0,0,0,0,...,209322,209340,209358,209362,209369,209390,209406,209436,209451,209451
1,,Albania,41.1533,20.1683,0,0,0,0,0,0,...,334391,334408,334408,334427,334427,334427,334427,334427,334443,334457
2,,Algeria,28.0339,1.6596,0,0,0,0,0,0,...,271441,271448,271463,271469,271469,271477,271477,271490,271494,271496


In [26]:
data['usa_cases'].head(3)

Unnamed: 0,UID,iso2,iso3,code3,FIPS,Admin2,Province_State,Country_Region,Lat,Long_,...,2/28/23,3/1/23,3/2/23,3/3/23,3/4/23,3/5/23,3/6/23,3/7/23,3/8/23,3/9/23
0,84001001,US,USA,840,1001.0,Autauga,Alabama,US,32.539527,-86.644082,...,19732,19759,19759,19759,19759,19759,19759,19759,19790,19790
1,84001003,US,USA,840,1003.0,Baldwin,Alabama,US,30.72775,-87.722071,...,69641,69767,69767,69767,69767,69767,69767,69767,69860,69860
2,84001005,US,USA,840,1005.0,Barbour,Alabama,US,31.868263,-85.387129,...,7451,7474,7474,7474,7474,7474,7474,7474,7485,7485


## Saving the data locally

Since the raw data must be downloaded from the internet, let's save a copy of our current data to a local folder so that we have access to it immediately at any time.

### Exercise 3

<span style="color:green; font-size:16px">Write a function that accepts a dictionary of DataFrames and a directory name, and writes them to that directory as CSVs using the key as the filename. Pass the `kwargs` to the `to_csv` method.</span>

In [32]:
import os

def write_data(data, directory, **kwargs):
    """
    Writes each raw data DataFrame to a file as a CSV

    Parameters
    ----------
    data : dict
        Dictionary of DataFrames

    directory : str
        Name of directory to save files, e.g., "data/raw"

    kwargs : dict
        Extra keyword arguments for the `to_csv` DataFrame method

    Returns
    -------
    None
    """
    if not os.path.exists(directory):
        os.makedirs(directory)

    for key, df in data.items():
        filename = f"{key}.csv"
        filepath = os.path.join(directory, filename)

        # Check if the file already exists
        if os.path.isfile(filepath):
            # Generate a unique filename by appending a numeric suffix to the previous file
            suffix = 1
            while True:
                suffix_filename = f"{key}_{suffix}.csv"
                suffix_filepath = os.path.join(directory, suffix_filename)
                if not os.path.isfile(suffix_filepath):
                    os.rename(filepath, suffix_filepath)
                    print(f"New file downloaded: {filename} (Old file renamed to: {suffix_filename})")
                    break
                suffix += 1
        else:
            print(f"New file downloaded: {filename}")

        df.to_csv(filepath, **kwargs)

    print(f"Data saved to {directory}")


Let's write those DataFrames as CSVs (without their index) to the "data/raw" directory.

In [33]:
# from solutions import write_data
write_data(data, "data/raw", index=False)

New file downloaded: world_deaths.csv (Old file renamed to: world_deaths_1.csv)
New file downloaded: world_cases.csv (Old file renamed to: world_cases_1.csv)
New file downloaded: usa_deaths.csv (Old file renamed to: usa_deaths_1.csv)
New file downloaded: usa_cases.csv (Old file renamed to: usa_cases_1.csv)
Data saved to data/raw


### Exercise 4

<span style="color:green; font-size:16px">Write a function similar to `download_data`, but have it read in the local data that we just saved. </span>

In [None]:
def read_local_data(group, kind, directory):
    """
    Read in one CSV as a DataFrame from the given directory
    
    Parameters
    ----------
    group : "world" or "usa"
    
    kind : "deaths" or "cases"
    
    directory : string name of directory to save files i.e. "data/raw"
    
    Returns
    -------
    DataFrame    
    """
    pass

In [None]:
from solutions import read_local_data
read_local_data('world', 'deaths', 'data/raw').head(3)

### Exercise 5

<span style="color:green; font-size:16px">Write a function similar to `read_all_data`, but have it read in all of the local data that we just saved. The function name is `run` since we will be slowly adding all of our data cleaning and transformation steps to it in the next chapter.</span>

In [None]:
def run():
    """
    Run all cleaning and transformation steps
    
    Returns
    -------
    Dictionary of DataFrames
    """
    pass

Here, we verify that `run` works properly.

In [None]:
from solutions import run
data = run()
data['usa_deaths'].tail(3)

This concludes the section on downloading the data.