# Modelling COVID epidemic and learning how to access to data
This Notebook will be used to evaluate a number of learned python utilities and some new ones. We will be working with real data, and the goel of this is to make a useful and timely tool.

# Downloading multiple Files

We will be obtaining our data from the world head organization Situation Reports: https://www.who.int/emergencies/diseases/novel-coronavirus-2019/situation-reports/. These reports are updated daily. So we want to write a code that is up to date at any time we run it. For this we will need a procedure to download multiple pdf files from a web site.

To do this we are going to make use of several libraries:
* from bs4 import BeautifulSoup
 Beautiful Soup is a Python library for pulling data out of HTML and XML files. It works with your favorite parser to provide idiomatic ways of navigating, searching, and modifying the parse tree. It commonly saves programmers hours or days of work.
 See more in https://www.crummy.com/software/BeautifulSoup/bs4/doc/
* import re
Regular expressions (called REs, or regexes, or regex patterns) are essentially a tiny, highly specialized programming language embedded inside Python and made available through the re module. Using this little language, you specify the rules for the set of possible strings that you want to match; this set might contain English sentences, or e-mail addresses, or TeX commands, or anything you like. You can then ask questions such as “Does this string match the pattern?”, or “Is there a match for the pattern anywhere in this string?”. You can also use REs to modify a string or to split it apart in various ways.
See more in https://docs.python.org/3/howto/regex.html
* import os
The OS module in python provides functions for interacting with the operating system. OS, comes under Python’s standard utility modules. This module provides a portable way of using operating system dependent functionality. 
See more in https://www.geeksforgeeks.org/os-module-python-examples/

* import urllib
Urllib module is the URL handling module for python. It is used to fetch URLs (Uniform Resource Locators). It uses the urlopen function and is able to fetch URLs using a variety of different protocols.
See more in
https://www.geeksforgeeks.org/python-urllib-module/


In [0]:
from urllib import request
from bs4 import BeautifulSoup
import re
import os
import urllib

# Lets make the directory to store the data
# We use the os library for this
def make1dir(dirname):
  '''parameter: dirname  Name of directory to be created
  '''
  try:
    # Create target Directory
     os.mkdir(dirname)
     print("Directory " , dirname ,  " Created ") 
  except FileExistsError:
     print("Directory " , dirname ,  " already exists")

newdir = 'PHY546'
make1dir(newdir)
# we need a nested directory 
newdir = 'PHY546/COVID19'
make1dir(newdir)

# Now we use beautiful shop, re and urllib
# 

url = "https://www.who.int/emergencies/diseases/novel-coronavirus-2019/situation-reports/"

# open the url
response = request.urlopen(url).read()
# soup is now a beautifulsoup object object, which represents the document as a nested data structure: 
soup= BeautifulSoup(response, "html.parser")   
# We are now going to find all the links to a pdf page within the page
# we use .find_all and we use re.compile to find all the files that
# contain the regular expression .pdf  
links = soup.find_all('a', href=re.compile(r'(.pdf)'))

# clean the pdf link names
url_list = []
for el in links:
    print(el['href'])
    if(el['href'].startswith('http')):
        url_list.append(el['href'])
    else:
        url_list.append("https://www.who.int" + el['href'])

print(url_list)

# download the pdfs to a specified location
# In this case newdir
for url in url_list:
    #print(url)
    #mydir = '/Users/marivi/Desktop/tmp/COVID19/'
    mydir = newdir
    fullfilename = os.path.join(mydir, url.replace("https://www.who.int/docs/default-source/coronaviruse/situation-reports/", ""))
    fullfilename = fullfilename[:fullfilename.find("pdf")]+'pdf'
    request.urlretrieve(url, fullfilename)


In [0]:
# We can see all the files downloaded (there might be others)
for root, dirs, files in os.walk("."):
    for filename in files:
        print(filename)



# Extracting data from PDF files, in particular from tables

Now that we have downloaded the files we should read the tables. To do this we will use a library called tabula-py.
Tabula-py is a simple Python wrapper of tabula-java, which can read the table of PDF. You can read tables from PDF and convert into pandas’ DataFrame. tabula-py also enables you to convert a PDF file into CSV/TSV/JSON file. A lot of what we will use is in here https://tabula-py.readthedocs.io/en/latest/

## Check Java environment and install tabula-py
tabula-py requires java environment so let's check the java environment on your machine.


In [18]:
!java -version
# To be more precisely, it's better to use `{sys.executable} -m pip install tabula-py`
!pip install -q tabula-py
import tabula
import pandas as pd

tabula.environment_info()

openjdk version "11.0.6" 2020-01-14
OpenJDK Runtime Environment (build 11.0.6+10-post-Ubuntu-1ubuntu118.04.1)
OpenJDK 64-Bit Server VM (build 11.0.6+10-post-Ubuntu-1ubuntu118.04.1, mixed mode, sharing)
Python version:
    3.6.9 (default, Nov  7 2019, 10:44:02) 
[GCC 8.3.0]
Java version:
    openjdk version "11.0.6" 2020-01-14
OpenJDK Runtime Environment (build 11.0.6+10-post-Ubuntu-1ubuntu118.04.1)
OpenJDK 64-Bit Server VM (build 11.0.6+10-post-Ubuntu-1ubuntu118.04.1, mixed mode, sharing)
tabula-py version: 2.1.0
platform: Linux-4.14.137+-x86_64-with-Ubuntu-18.04-bionic
uname:
    uname_result(system='Linux', node='b972a9ad1994', release='4.14.137+', version='#1 SMP Thu Aug 8 02:47:02 PDT 2019', machine='x86_64', processor='x86_64')
linux_distribution: ('Ubuntu', '18.04', 'bionic')
mac_ver: ('', ('', '', ''), '')
    


In [19]:
# Lets look a one of the reports
pdf_path1 = "https://www.who.int/docs/default-source/coronaviruse/situation-reports/20200324-sitrep-64-covid-19.pdf"
data = tabula.read_pdf(pdf_path1, pages="3-7",lattice=True, pandas_options={"header": [0, 1]}, stream=True)


Got stderr: Mar 27, 2020 7:32:56 PM org.apache.pdfbox.pdmodel.font.PDCIDFontType2 <init>
INFO: OpenType Layout tables used in font BCDIEE+Calibri-Bold are not implemented in PDFBox and will be ignored
Mar 27, 2020 7:32:58 PM org.apache.pdfbox.pdmodel.font.PDCIDFontType2 <init>
INFO: OpenType Layout tables used in font BCDHEE+Calibri are not implemented in PDFBox and will be ignored



In [44]:

# Lets analyze the data, for example lets look for the row
# that contained the data for a given country (Spain)
#len(data)
for item in data:
  #print(len(item))
  for ind in item.index:
    for col in item.columns:
      if item[col][ind]=='Spain':
        newdf = (item.iloc[[ind]])

print(newdf)  

# We can write a function to search for the country we want:

def find_country(name, data):
    """ Utitily to analyze data obtained with tabula-py
    Given a list of different dataframe types, returns a data frame containing
    just a given row that mataches the search string 'name'

    Parameters:
    -----------
    name: str  name of the string to find in the list
    data: list of pd Dataframes, the return of reading tables from a pdf
          obtained with tabula-py
    Returns:
    --------
    list_idx, pd_idx: the location of the target dataframe in the list
                        and the row index within this dataframe      

     """
    for item in data:
      for ind in item.index:
        for col in item.columns:
          if item[col][ind] == name:
            list_idx = data.index(item)
            pd_idx = ind
      return(list_idx, pd_idx) 

name = 'Spain'
x,y = find_country(name, data)
print(x,y)


       0      1     2     3    4   ...   9   10   11   12   13
26  Spain  33089  4517  2182  462  ...  NaN NaN  NaN  NaN  NaN

[1 rows x 14 columns]
0 26


# Resume so far

We now have learned 2 separate things:
* How to batch download all the files of a specific type from a given webpage
* How to 'roughly' browse through all the data obtained from a list of tables
which are returned by tabula-py reading a pdf file. This list of tables is a list of Pandas dataframes. We wrote a function  to locate the position of a target string within the list and within the dataframe

# What we need to do next

* Check that all the files we download maintain the same structure
* locate the date from each pdf file
* create a new dataframe with the data for a specific country ordered by date.
* plot the data for a single country
* plot the data for several countries
* Analyze data?
* Other suggestions?

