# **COVID-19 Modelling: Section 1**

In this first section we will learn how to:
1. Download multiple files from an url page
2. Extracting data from a table inside a pdf file

## Modelling COVID epidemic and learning how to access to data
This Notebook will be used to evaluate a number of learned python utilities and some new ones. We will be working with real data, and the goel of this is to make a useful and timely tool.

## Downloading multiple Files

We will be obtaining our data from the world head organization Situation Reports: https://www.who.int/emergencies/diseases/novel-coronavirus-2019/situation-reports/. These reports are updated daily. So we want to write a code that is up to date at any time we run it. For this we will need a procedure to download multiple pdf files from a web site.

To do this we are going to make use of several libraries:
* from bs4 import BeautifulSoup
 Beautiful Soup is a Python library for pulling data out of HTML and XML files. It works with your favorite parser to provide idiomatic ways of navigating, searching, and modifying the parse tree. It commonly saves programmers hours or days of work.
 See more in https://www.crummy.com/software/BeautifulSoup/bs4/doc/
* import re
Regular expressions (called REs, or regexes, or regex patterns) are essentially a tiny, highly specialized programming language embedded inside Python and made available through the re module. Using this little language, you specify the rules for the set of possible strings that you want to match; this set might contain English sentences, or e-mail addresses, or TeX commands, or anything you like. You can then ask questions such as “Does this string match the pattern?”, or “Is there a match for the pattern anywhere in this string?”. You can also use REs to modify a string or to split it apart in various ways.
See more in https://docs.python.org/3/howto/regex.html
* import os
The OS module in python provides functions for interacting with the operating system. OS, comes under Python’s standard utility modules. This module provides a portable way of using operating system dependent functionality. 
See more in https://www.geeksforgeeks.org/os-module-python-examples/

* import urllib
Urllib module is the URL handling module for python. It is used to fetch URLs (Uniform Resource Locators). It uses the urlopen function and is able to fetch URLs using a variety of different protocols.
See more in
https://www.geeksforgeeks.org/python-urllib-module/


In [1]:
from urllib import request
from bs4 import BeautifulSoup
import re
import os
import urllib
import numpy as np


# Lets make the directory to store the data
# We use the os library for this
def make1dir(dirname):
  '''parameter: dirname  Name of directory to be created
  '''
  try:
    # Create target Directory
     os.mkdir(dirname)
     print("Directory " , dirname ,  " Created ") 
  except FileExistsError:
     print("Directory " , dirname ,  " already exists")

newdir = 'PHY546'
make1dir(newdir)
# we need a nested directory 
newdir = 'PHY546/COVID19'
make1dir(newdir)

# Now we use beautiful shop, re and urllib
# 

url = "https://www.who.int/emergencies/diseases/novel-coronavirus-2019/situation-reports/"

# open the url
response = request.urlopen(url).read()
# soup is now a beautifulsoup object object, which represents the document as a nested data structure: 
soup= BeautifulSoup(response, "html.parser")   
# We are now going to find all the links to a pdf page within the page
# we use .find_all and we use re.compile to find all the files that
# contain the regular expression .pdf  
links = soup.find_all('a', href=re.compile(r'(.pdf)'))



# clean the pdf link names
url_list = []
for el in links:
    print(el['href'])
    if(el['href'].startswith('http')):
        url_list.append(el['href'])
    else:
        url_list.append("https://www.who.int" + el['href'])

print(url_list)

# download the pdfs to a specified location
# In this case newdir
for url in url_list:
    #print(url)
    #mydir = '/Users/marivi/Desktop/tmp/COVID19/'
    mydir = newdir
    fullfilename = os.path.join(mydir, url.replace("https://www.who.int/docs/default-source/coronaviruse/situation-reports/", ""))
    fullfilename = fullfilename[:fullfilename.find("pdf")]+'pdf'
    request.urlretrieve(url, fullfilename)


Directory  PHY546  Created 
Directory  PHY546/COVID19  Created 
/docs/default-source/coronaviruse/situation-reports/20200405-sitrep-76-covid-19.pdf?sfvrsn=6ecf0977_4
/docs/default-source/coronaviruse/situation-reports/20200404-sitrep-75-covid-19.pdf?sfvrsn=99251b2b_4
/docs/default-source/coronaviruse/situation-reports/20200403-sitrep-74-covid-19-mp.pdf?sfvrsn=4e043d03_14
/docs/default-source/coronaviruse/situation-reports/20200402-sitrep-73-covid-19.pdf?sfvrsn=5ae25bc7_6
/docs/default-source/coronaviruse/situation-reports/20200401-sitrep-72-covid-19.pdf?sfvrsn=3dd8971b_2
/docs/default-source/coronaviruse/situation-reports/20200331-sitrep-71-covid-19.pdf?sfvrsn=4360e92b_8
/docs/default-source/coronaviruse/situation-reports/20200331-sitrep-71-covid-19.pdf?sfvrsn=4360e92b_8
/docs/default-source/coronaviruse/situation-reports/20200330-sitrep-70-covid-19.pdf?sfvrsn=7e0fe3f8_4
/docs/default-source/coronaviruse/situation-reports/20200329-sitrep-69-covid-19.pdf?sfvrsn=8d6620fa_8
/docs/default-

In [2]:
# We can see all the files downloaded (there might be others)
for root, dirs, files in os.walk("."):
    for filename in files:
        print(filename)



gce
config_sentinel
.last_survey_prompt.yaml
.last_opt_in_prompt.yaml
.last_update_check.json
active_config
.metricsUUID
16.24.26.990500.log
16.23.40.043713.log
16.23.57.403439.log
16.24.13.655529.log
16.24.26.483358.log
16.24.09.722063.log
config_default
20200128-sitrep-8-ncov-cleared.pdf
20200122-sitrep-2-2019-ncov.pdf
20200403-sitrep-74-covid-19-mp.pdf
20200121-sitrep-1-2019-ncov.pdf
20200212-sitrep-23-ncov.pdf
20200326-sitrep-66-covid-19.pdf
20200302-sitrep-42-covid-19.pdf
20200317-sitrep-57-covid-19.pdf
20200125-sitrep-5-2019-ncov.pdf
20200322-sitrep-62-covid-19.pdf
20200324-sitrep-64-covid-19.pdf
20200220-sitrep-31-covid-19.pdf
20200126-sitrep-6-2019--ncov.pdf
20200214-sitrep-25-covid-19.pdf
20200127-sitrep-7-2019--ncov.pdf
20200315-sitrep-55-covid-19.pdf
20200221-sitrep-32-covid-19.pdf
20200329-sitrep-69-covid-19.pdf
20200130-sitrep-10-ncov.pdf
20200318-sitrep-58-covid-19.pdf
20200124-sitrep-4-2019-ncov.pdf
20200405-sitrep-76-covid-19.pdf
20200321-sitrep-61-covid-19.pdf
20200328

## Extracting data from PDF files, in particular from tables

Now that we have downloaded the files we should read the tables. To do this we will use a library called tabula-py.
Tabula-py is a simple Python wrapper of tabula-java, which can read the table of PDF. You can read tables from PDF and convert into pandas’ DataFrame. tabula-py also enables you to convert a PDF file into CSV/TSV/JSON file. A lot of what we will use is in here https://tabula-py.readthedocs.io/en/latest/

### Check Java environment and install tabula-py
tabula-py requires java environment so let's check the java environment on your machine.


In [3]:
!java -version
# To be more precisely, it's better to use `{sys.executable} -m pip install tabula-py`
!pip install -q tabula-py
import tabula
import pandas as pd

tabula.environment_info()

openjdk version "11.0.6" 2020-01-14
OpenJDK Runtime Environment (build 11.0.6+10-post-Ubuntu-1ubuntu118.04.1)
OpenJDK 64-Bit Server VM (build 11.0.6+10-post-Ubuntu-1ubuntu118.04.1, mixed mode, sharing)
[K     |████████████████████████████████| 10.4MB 435kB/s 
[?25hPython version:
    3.6.9 (default, Nov  7 2019, 10:44:02) 
[GCC 8.3.0]
Java version:
    openjdk version "11.0.6" 2020-01-14
OpenJDK Runtime Environment (build 11.0.6+10-post-Ubuntu-1ubuntu118.04.1)
OpenJDK 64-Bit Server VM (build 11.0.6+10-post-Ubuntu-1ubuntu118.04.1, mixed mode, sharing)
tabula-py version: 2.1.0
platform: Linux-4.19.104+-x86_64-with-Ubuntu-18.04-bionic
uname:
    uname_result(system='Linux', node='8503449ab51d', release='4.19.104+', version='#1 SMP Wed Feb 19 05:26:34 PST 2020', machine='x86_64', processor='x86_64')
linux_distribution: ('Ubuntu', '18.04', 'bionic')
mac_ver: ('', ('', '', ''), '')
    


In [4]:
# Lets look a one of the reports
pdf_path1 = "/content/PHY546/COVID19/20200302-sitrep-42-covid-19.pdf"
data = tabula.read_pdf(pdf_path1, pages="3-7",lattice=True, pandas_options={"header": [0, 1]}, stream=True)


Got stderr: Apr 06, 2020 5:12:39 PM org.apache.pdfbox.pdmodel.font.FileSystemFontProvider loadDiskCache
Apr 06, 2020 5:12:39 PM org.apache.pdfbox.pdmodel.font.FileSystemFontProvider <init>
Apr 06, 2020 5:12:39 PM org.apache.pdfbox.pdmodel.font.FileSystemFontProvider <init>
Apr 06, 2020 5:12:40 PM org.apache.pdfbox.pdmodel.font.PDCIDFontType2 <init>
INFO: OpenType Layout tables used in font BCDHEE+Calibri-Bold are not implemented in PDFBox and will be ignored
Apr 06, 2020 5:12:41 PM org.apache.pdfbox.pdmodel.font.PDCIDFontType2 <init>
INFO: OpenType Layout tables used in font BCDGEE+Calibri are not implemented in PDFBox and will be ignored
Apr 06, 2020 5:12:41 PM org.apache.pdfbox.pdmodel.font.PDCIDFontType2 <init>
INFO: OpenType Layout tables used in font ArialMT are not implemented in PDFBox and will be ignored



In [5]:

# Lets analyze the data, for example lets look for the row
# that contained the data for a given country (Spain)
#
for item in data: # Data is a list of Pandas dataframes (df)
  print(len(item)) # this will tell you how large is each of these dfs
  for ind in item.index: # runs through the rows of the df
    for col in item.columns: # runs thorugh the columns
      if item[col][ind]=='Spain': # if it finds the name Spain in one df content
        newdf = (item.iloc[[ind]]) # we saved the full row into a df (size 1xdf.columns)

print(newdf)  # prints the row we saved that has the data we are looking for

# We can write a function to search for the country we want:
# Now we encapsulate the previous script into a function

def find_country(country, data):
    """ Utitily to analyze data obtained with tabula-py
    Given a list of different dataframe types, returns a data frame containing
    just a given row that matches the search string 'name'

    Parameters:
    -----------
    country: str  name of the string to find in the list
    data: list of pd Dataframes, the return of reading tables from a pdf
          obtained with tabula-py
    Returns:
    --------
    country_df: Target dataframe in the list
                 with the row that matches the search string 'country'   

     """
    for item in data:
      for ind in item.index:
        for col in item.columns:
          if item[col][ind] == country:
            country_df = (item.iloc[[ind]])
            # The line below does not work in all cases so
            #list_index = data.index(item)
            #pd_idx = ind

    return(country_df) 





44
4
3
2
2
2
3
2
53
2
3
3
27
2
2
2
2
2
2
1
1
1
1
1
1
1
2
       0   1  2  3  4                   5  6   ...  8    9   10   11  12   13  14
19  Spain  45  0  0  0  Local transmission  1  ... NaN  NaN NaN  NaN NaN  NaN NaN

[1 rows x 15 columns]


## **Summary so far**

We now have learned 2 separate things:
* How to batch download all the files of a specific type from a given webpage
* How to 'roughly' browse through all the data obtained from a list of tables
which are returned by tabula-py reading a pdf file. This list of tables is a list of Pandas dataframes. We wrote a function  to locate the position of a target string within the list and within the dataframe

### What we need to do next

* Check that all the files we download maintain the same structure
* identify the values for the columns and remove useless entries
* locate the date from each pdf file
* create a new dataframe with the data for a specific country ordered by date.
* plot the data for a single country
* plot the data for several countries
* Analyze data?
* Other suggestions?



# **COVID-19 Modelling: Section 2**

I'm keeping everything in the same notebook because otherwise we need to load libraries again. Hence this is going to be a very long notebook!

Today we will make more use of Pandas Dataframes. 


In [6]:
# We can test the function find_country 
name = 'France'
test_df = find_country(name, data) # this returns what we need to obtain the country information
print(test_df) 

        0    1  2  3  4                   5  6   ...  8    9   10   11  12   13  14
18  France  100  0  2  0  Local transmission  1  ... NaN  NaN NaN  NaN NaN  NaN NaN

[1 rows x 15 columns]


One of the problem with the WHO data is that they changed the format of the files. Before March 2nd the files were more China oriented. After March 2nd the new strucutre of the relevant column changes. We will analyze data after that date.

## Indexing and selecting data in a df

We want to extract the numbers in colums labelled "1 2 3 4", these
data corresponds to the following information:
* "1" →  **Total number of cases**
* "2" →  **Total number of new cases**
* "3" →  **Total number of deads**
* "4" →  **Total number of new deads**



Let's create a Dataframe ourselves and understand it:

In [7]:
dates = pd.date_range('1/1/2000', periods=8)
df = pd.DataFrame(np.random.randn(8, 4), index=dates, columns=['A', 'B', 'C', 'D'])
df

Unnamed: 0,A,B,C,D
2000-01-01,-0.16227,-1.710181,0.433421,-1.367815
2000-01-02,1.427242,-0.273578,-0.709708,0.679599
2000-01-03,-0.980891,0.728229,0.849695,-0.596621
2000-01-04,0.915468,-1.270308,-1.453919,0.284929
2000-01-05,1.610733,-0.809616,-1.776276,-1.456572
2000-01-06,1.066221,1.544161,-0.898459,-0.230327
2000-01-07,-0.812182,-0.325296,-1.944508,-0.340699
2000-01-08,0.661417,-0.282542,0.400514,-0.742542


Let's evaluate different ways of slicing the dataframe. We can use the ```[] ``` operator or one of the .loc operator.
Pandas now supports three types of multi-axis indexing.

* .loc is primarily label based, but may also be used with a boolean array. .loc will raise KeyError when the items are not found. Allowed inputs are:

  * A single label, e.g. 5 or 'a' (Note that 5 is interpreted as a label of the index. This use is not an integer position along the index.).
  * A list or array of labels ```['a', 'b', 'c']```.
  * A slice object with labels 'a':'f' (Note that contrary to usual python slices, both the start and the stop are included, when present in the index! See Slicing with labels and Endpoints are inclusive.)
  * A boolean array (any NA values will be treated as False).
  * A callable function with one argument (the calling Series or DataFrame) and that returns valid output for indexing (one of the above).


* .iloc is primarily integer position based (from 0 to length-1 of the axis), but may also be used with a boolean array. .iloc will raise IndexError if a requested indexer is out-of-bounds, except slice indexers which allow out-of-bounds indexing. (this conforms with Python/NumPy slice semantics). Allowed inputs are:

  * An integer e.g. 5.
  * A list or array of integers ```[4, 3, 0]```.
  * A slice object with ints 1:7.
  * A boolean array (any NA values will be treated as False).
  * A callable function with one argument (the calling Series or DataFrame) and that returns valid output for indexing (one of the above).


.loc, .iloc, and also ```[]``` indexing can accept a callable as indexer. 
Getting values from an object with multi-axes selection uses the following notation (using .loc as an example, but the following applies to .iloc as well). Any of the axes accessors may be the null slice :. Axes left out of the specification are assumed to be :, e.g. ```p.loc['a']``` is equivalent to ```p.loc['a', :, :]```.




In [8]:
print(df['A'])
# This will print the row in position 1 (second) displayed in column format.
print(df.iloc[1])

2000-01-01   -0.162270
2000-01-02    1.427242
2000-01-03   -0.980891
2000-01-04    0.915468
2000-01-05    1.610733
2000-01-06    1.066221
2000-01-07   -0.812182
2000-01-08    0.661417
Freq: D, Name: A, dtype: float64
A    1.427242
B   -0.273578
C   -0.709708
D    0.679599
Name: 2000-01-02 00:00:00, dtype: float64


Lets now try to work with our onw data extracted from the pdf.
We have saved the data for France into test_df. It is a pd made of a single row.

In [9]:
print(test_df)
# If we do the same with our test_df pd (the one that stores France data)
print(test_df.iloc[0])
# We can convert the pd into a series using the sqeeze function
newdf = test_df.squeeze()
print(newdf)
#Notice that now this is like a single array, hence we can slice it as we do
# with regular np arrays
print(newdf[0:5])


        0    1  2  3  4                   5  6   ...  8    9   10   11  12   13  14
18  France  100  0  2  0  Local transmission  1  ... NaN  NaN NaN  NaN NaN  NaN NaN

[1 rows x 15 columns]
0                 France
1                    100
2                      0
3                      2
4                      0
5     Local transmission
6                      1
7                    NaN
8                    NaN
9                    NaN
10                   NaN
11                   NaN
12                   NaN
13                   NaN
14                   NaN
Name: 18, dtype: object
0                 France
1                    100
2                      0
3                      2
4                      0
5     Local transmission
6                      1
7                    NaN
8                    NaN
9                    NaN
10                   NaN
11                   NaN
12                   NaN
13                   NaN
14                   NaN
Name: 18, dtype: object
0    France

In [10]:
# WE could have done the same using the iloc function
print(test_df.iloc[0][1:5])

1    100
2      0
3      2
4      0
Name: 18, dtype: object


## Organizing data from multiple dataframes into a single one

Now we have identified how to extract the relevant data, we need to see how can we generate a data structure (a pd dataframe) for a chosen country, using all the available files. We need to:
* Each row of the df structure should be the date of the corresponding report from where the data was extracted, ordered chronologicaly 
* There should be 4 Columns: TC (total cases), NC (new day cases), TD (total deads), ND (new day deads)

The date can be easy, because it is in the name of the situation report:

In [11]:
 # Let's print our report name
 print(pdf_path1)

/content/PHY546/COVID19/20200302-sitrep-42-covid-19.pdf


In [12]:
# We can do it by just slicing 
date=pdf_path1[24:32]
print(date)


20200302


Or we can use the partition function:
The partition() method splits the string at the first occurrence of the argument string and returns a tuple containing the part the before separator, argument string and the part after the separator.

In [13]:
string = 'COVID19/'
thedate = pdf_path1.partition(string)[2][0:8]
print(thedate)

20200302


Now we can write a function that obtains the date from the path to a given file

In [0]:
def date(in_path):
  """ given the complete path to the file where the WHO
      situtation report was stored, will return the date
      of the corresponding report
      parameters:
      -----------
        in_path: string; full path of the report

      returns:
      --------
        date: string; date in the form of yyyymmdd
        """
  date = in_path.partition(string)[2][0:8]
  return(date)


Let's put it all in place
First lets choose a country

In [22]:
country_name = 'Spain'
start_date = 20200302 # This is the first date for situation reports with regular world data
# this returns what we need to obtain the country information
#column_names = ["TC", "NC", "TD", "ND"]
#model_df = pd.DataFrame(columns = column_names) # This is the df that will store the info
rows_list=[]
for root, dirs, files in os.walk("."):
    for filename in files:
        if 'sitrep' in filename: # Now we extract from this file all the information we need and add it to this new df
          date=filename[0:8]
          # We only read dates later than march 2nd
          if (int(date) >= start_date):
          # We need the full filename with the path: os.path.join(dirpath, filename)
            full_filename = os.path.join(root, filename)
            data = tabula.read_pdf(full_filename, pages="all",lattice=True, pandas_options={"header": [0, 1]},silent=True)
            test_df = find_country(country_name, data)
            new_row = pd.DataFrame(test_df.iloc[0][1:5])
            new_row = new_row.T
            new_row['date'] = int(date)
            print(new_row)
            rows_list.append(new_row)
            #new_row = new_row.rename(columns = {i+1: column_names[i] for i in range(4)} )
            #model_df.append(new_row)

    







          




         1     2      3    4      date
29  110238  8102  10003  950  20200403
        1     2     3    4      date
27  47610  7937  3434  738  20200326
     1  2  3  4      date
19  45  0  0  0  20200302
       1     2    3   4      date
23  9191  1438  309  21  20200317
        1     2     3    4      date
26  24926  4946  1326  324  20200322
        1     2     3    4      date
26  33089  4517  2182  462  20200324
       1     2    3   4      date
21  5753  1522  136  16  20200315
        1     2     3    4      date
27  72248  8189  5690  832  20200329
        1     2    3    4      date
23  11178  1987  491  182  20200318
         1     2      3    4      date
28  124736  7026  11744  809  20200405
        1     2     3    4      date
26  19980  2833  1002  235  20200321
        1     2     3    4      date
27  64059  7871  4858  769  20200328
        1     2     3    4      date
29  85195  6398  7340  812  20200331
       1    2   3   4      date
21  1024  435  28  18  20200310
  

In [23]:
rows_list

[         1     2      3    4      date
 29  110238  8102  10003  950  20200403,         1     2     3    4      date
 27  47610  7937  3434  738  20200326,      1  2  3  4      date
 19  45  0  0  0  20200302,        1     2    3   4      date
 23  9191  1438  309  21  20200317,         1     2     3    4      date
 26  24926  4946  1326  324  20200322,         1     2     3    4      date
 26  33089  4517  2182  462  20200324,        1     2    3   4      date
 21  5753  1522  136  16  20200315,         1     2     3    4      date
 27  72248  8189  5690  832  20200329,         1     2    3    4      date
 23  11178  1987  491  182  20200318,          1     2      3    4      date
 28  124736  7026  11744  809  20200405,         1     2     3    4      date
 26  19980  2833  1002  235  20200321,         1     2     3    4      date
 27  64059  7871  4858  769  20200328,         1     2     3    4      date
 29  85195  6398  7340  812  20200331,        1    2   3   4      date
 21  10

In [37]:
for item in rows_list:
  for col in item.columns:
    
  print(item['date'])

29    20200403
Name: date, dtype: int64
27    20200326
Name: date, dtype: int64
19    20200302
Name: date, dtype: int64
23    20200317
Name: date, dtype: int64
26    20200322
Name: date, dtype: int64
26    20200324
Name: date, dtype: int64
21    20200315
Name: date, dtype: int64
27    20200329
Name: date, dtype: int64
23    20200318
Name: date, dtype: int64
28    20200405
Name: date, dtype: int64
26    20200321
Name: date, dtype: int64
27    20200328
Name: date, dtype: int64
29    20200331
Name: date, dtype: int64
21    20200310
Name: date, dtype: int64
22    20200312
Name: date, dtype: int64
22    20200316
Name: date, dtype: int64
29    20200401
Name: date, dtype: int64
19    20200308
Name: date, dtype: int64
19    20200304
Name: date, dtype: int64
19    20200307
Name: date, dtype: int64
19    20200309
Name: date, dtype: int64
19    20200303
Name: date, dtype: int64
21    20200314
Name: date, dtype: int64
19    20200306
Name: date, dtype: int64
25    20200320
Name: date, dtype: int64
