<a href="https://colab.research.google.com/github/GuillermoFidalgo/Python-for-STEM-Teachers-Workshop/blob/master/notebooks/7-Introduction_to_Pandas_with_COVID_19_data_.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# Introduction




Since the beginning of the pandemic caused by the new coronavirus and that causes the COVID-19 disease, the community in general and especially the scientific community has followed the spread rate throughout the world. The most outstanding data that have been studied are the number of infections and deaths as a function of time, providing information about their evolution and control.
In this code we will use the data stored in covidtracking.com where we will focus on the study of the monitoring parameters of the pandemic in Puerto Rico, manipulating and analyzing the data through tables and histograms through the use of Python and its powerful tools.

Later we will try to answer questions such as what is the number of cases and deaths up to a certain date, how is its increase over time and if we are stopping or continuing the spread of the virus.

# Importing Libraries 



This only needs to be run once!!

Pandas is a Python library used for data manipulation and analysis of numerical tables and time series. Since we will be working with tabulated data we can take advantage of Pandas' extensive modularity and variety of methods to work with tables.



In [None]:
import pandas as pd  # This imports the Pandas Library and gives it a reference name to access pandas and it's methods

## Loading in Data

There are 3 main ways to upload and access data in Google Colab:


1.   From Google Drive (mounting the drive)
2.   Directly from your computer
3.   Store that data somewhere in the internet and specify it's url (**our method**)



We are loading a csv file with pandemic information caused by new coronavirus


Note that we specify 

### Method 1: Google Drive

In [None]:
# Data in a Google Drive
from google.colab import drive
drive.mount('/content/drive')
data_in_drive="/content/drive/MyDrive/UPRM/STEM Workshop/Feb 2021/Data/puerto-rico-history.csv"


df=pd.read_csv(data_in_drive,
               index_col=0,
               na_values="-") 
df

### Method 2: Upload from computer

In [None]:
# Or just upload the data from your computer to the Session
data = "../data/national-history.csv"

df=pd.read_csv(data) #,index_col=0) # Here we have specified that the first colum is the index column.
df

### Method 3: Load from Internet

Define where the data is located as a string and store it in a variable. 
for example: 
`data='data_location' (using single or double quotes)`


**Important** 

Always look at your data beforehand. Be familiar with the structure and what it contains.

Also, in case you want to do this with your students, the easiest way is to have the data stored somewhere on the internet and have a link that your students can access. In our case this would be stored in a file of Github Repository and we will access it from there as follows

In [None]:
# Our method (From Internet)
data_url="https://raw.githubusercontent.com/GuillermoFidalgo/Python-for-STEM-Teachers-Workshop/master/data/puerto-rico-history.csv"


#here we load the csv file 
df=pd.read_csv(data_url,parse_dates=True)

# For more information about each argument you could hover your cursor over the function until a window appears
# you could also put your cursor INSIDE the function and press Ctrl+Shift+Space


In [None]:
df.shape #number of rows and columns

## Let's look at the first 5 entries of the data

In [None]:
df.head(5)

Now that we have the appropiate format we can choose the columns we want to look at and we can also filter the incomplete entries by using the `dropna()` method 

In [None]:
df=df.dropna(axis=1,how='all')
df


As one can see the Covid-19 Data is somethig that progresses with time, so we would like to have the data be indexed by the `date` column. We do this as follows with the `set_index()` method.

In [None]:
df=df.set_index("date")
df

We can filter out some columns. Let's see what we are interested in..

In [None]:
# This gives us the names of the colums available
df.columns

This is the way to select specific columns of interest

In [None]:
relevant=df[
    ['death','deathIncrease','positive', 'positiveIncrease','totalTestResults','totalTestResultsIncrease']
       ]

 We want so know some info about the data

In [None]:
relevant.info()

In [None]:
relevant.describe()

Pandas can automatically convert the data in each column inteligently by looking at the entries in each column.


In [None]:
relevant=relevant.convert_dtypes() #This converts the data

# Now we can verify and see the information that our dataframe contains
print(relevant.info())


# Exercises


We want to know a few things:
- How many deaths and positives we have at one specific date
- How are the cases increasing with time?
- Are we flattening the curve?

## How many deaths and positive cases we have at one specific date


Remember what we have so far, 
`deaths` is the total amount of probable and confirmed deaths of Covid cases up to a certain date (this means that this is a cumulative value). Same principle applies to `positives`

The step below is completely optional. 
We sorted for the index (*remember that the index is the date*)

In [None]:
relevant.sort_index(inplace=True)
relevant

We want to see the amound of people that *have ever been diagnosed positive* until one day.For this we could just specify the date and look at the entry

In [None]:
positives=relevant['positive']
date='2020-05-24' 
print("Amount of positives at",date,"is:", positives.loc[date]) 

In [None]:
relevant['death'].loc[date]

This is great!
But we can use a plot to show this information.

### Making a plot 
We will use Matplotlib's pyplot library for plots

In [None]:
import matplotlib.pyplot as plt

Let's use pandas to plot `positives` vs `date`

First we have to convert the `positive` variable to a numpy array as follows

In [None]:
positives=positives.fillna(0)

In [None]:
x=positives.index
y=positives.values
xticks=x[::31]

plt.figure(figsize=(15,8))
plt.plot(x,y,'.')
plt.grid()
plt.xticks(xticks,rotation=59)
plt.show()

Here is a shorthand version of the same.

In [None]:
positives.plot(figsize=(15,8),
               grid=True,
               fontsize=15,
               rot=55,
               style='.'
               )
plt.ylabel('Positives',fontsize=15)
plt.title("Positive Covid cases in PR",fontsize=20)
plt.xlabel('Date',fontsize=15)
plt.show()

## How are the deaths increasing with time?

For this we can plot the positive/death increase column

In [None]:
relevant.fillna(0,inplace=True)
ax=relevant[['death','deathIncrease']].plot(subplots=True,
               style='--',
               figsize=(15,10),
               rot=55,
               grid=True,
               sharex=True,
               fontsize=13,)
ax[0].set_title('Death accumulative',fontsize=19)
ax[1].set_title("Death per day",size=19)

plt.legend()
plt.xlabel('Date',fontsize=15,loc='right')
plt.show()

If you get an erro about `TypeError: float() argument must be a string or a number, not 'NAType'`
then you should fill the Null values in the dataframe.
Let's go back and do this.

# Are we flattening the curve?

To show this we follow the tips shown in the video below


In [None]:
from IPython.display import YouTubeVideo
YouTubeVideo('54XLXg4fYsc' ,width=1000,height=600)

So we have to plot `positiveIncrease` vs `positive` and both on a log scale. 

In [None]:
relevant[['positive','positiveIncrease']].plot(x='positive',y='positiveIncrease',
                                               figsize=(15,8),fontsize=13,
                                               loglog=True)
plt.plot([0,max(positives)],[0,max(positives)],label='exponential growth') # this is the line that represents exponential growth on a log scale plot like ours

plt.legend()
plt.show()

This is to noisy. Let's take the same plot but with the 7-day average. 
First we add a new column to the dataframe

In [None]:
relevant['7-day average']=relevant['positiveIncrease'].rolling(window=7).mean()
relevant['14-day average']=relevant['positiveIncrease'].rolling(window=14).mean()
relevant['30-day average']=relevant['positiveIncrease'].rolling(window=30).mean()
relevant.fillna(0,inplace=True)
relevant

#if you get an error about Null values change inplace to True

Now we plot 

In [None]:
relevant.plot(kind='line',x='positive',y='7-day average',
              figsize=(10,8),fontsize=13,
              grid=True,
              title="Are we below exponential growth?",
              loglog=True
              )
plt.loglog([0,max(positives)],[0,max(positives)],label='exponential growth') # this is the line that represents exponential growth on a log scale plot like ours
# plt.xscale('log')
# plt.yscale('log')
plt.xlabel('Total # of positives')
plt.ylabel('Change of positives per day')
plt.legend()
plt.show()

# Resources

To see information from the source of the data visit [Covid Tracking Project](https://covidtracking.com/)

More info and documentation: 


*   [Pandas ](https://pandas.pydata.org/)
*   [Matplotlib ](https://matplotlib.org/)
*   [Numpy](https://numpy.org/)



# Credits

This material has been made available by [Guillermo Fidalgo](https://github.com/GuillermoFidalgo) for educational purposes.
Please feel free to copy and teach with this material, I only ask for appropiate credit.