<span style="font-size:x-large;">Lab 2 (*Fake News!* Case Study)</span>

# Covid Tests per Capita: Is the US Leading the World?

![](trump-testing.jpg)

President Trump has repeatedly said that the US tests more than any other country in the world by far, and sometimes more than all the other countries put together. See for example the official Whitehouse transcript: [Remarks by President Trump on Supporting our Nation’s Small Businesses Through the Paycheck Protection Program](https://www.whitehouse.gov/briefings-statements/remarks-president-trump-supporting-nations-small-businesses-paycheck-protection-program/), 28th April, 2020.

In this practical work, as part of our *Fake News!* case study, we'll investigate the veracity of one aspect of these claims, the *per capita* test rates.

## Data Acquisition

#### *Our World in Data*

In the videos for this case study we obtained data, originally from the European CDC, through *Our World in Data*: https://ourworldindata.org/.

* Read the *[About](https://ourworldindata.org/about)* page to find out what this initiative is, what they hope to achieve, and who it is backed by. It is always important to know who is behind a data source in order to make an assessment about what degree of credibility to give the source.

Follow the link to `Health | Coronavirus Pandemic` and then to `Tests`. 

Note that the Data Scientists behind this site have provided a number of different ways of looking at and interpreting test rates. There is also a lot of background provided. 

* Scroll down to the section "Our checklist for COVID-19 testing data" and read through the ten items on the checklist.

This is terrific example of Data Science done well!




### Per-capita testing

Find the section entitled "How many tests are performed each day".

Have a look at how the Map is presented, with the time slider that allows you to see snapshots over time.

Then look at the Chart view and scroll the cursor over the chart.

Again, these are great examples of interactive data presentation.

Note: Consistent with many of the media sites, we will refer to these data in general terms as the "per capita" data to distinguish them from the totals data. More precisely, however, they are tests per 1000 people. That is, they are exactly 1000 times the per capita rate. The only reason for multiplying by 1000 is that its easier to read, say, 0.08 than 0.00008.

### Downloading and uploading the data

* Go to `DOWNLOAD` and download the csv file `daily-tests-per-thousand-people-smoothed-7-day.csv`.

* Then open the folder icon in CoCalc, make sure you are in the same directory as this lab sheet, and drop (upload) the csv file into the directory.

Click on the file to open it in CoCalc and have a look at the file format.

You should see, as anticipated from the filename extension, that this is a *comma separated values (csv)* file: in each row, the *fields* are separated by commas. There is also a header line, indicating what the data in each field represents.


### Reading in the data

As usual, start by setting up a constant with the path (in this case, its just the name) of the data file, so you don't need to keep typing it. For this lab we'll use the version of this file from the 29th July, distributed with the lab, so that we are all using the same file. (Feel free to run your code on your own version too - just be aware that the outputs and test results may be different to those in this sheet.)

You can access this file with:

`DATA = "daily-tests-per-thousand-people-smoothed-7-day-20200729.csv"`

Note that in this lab we won't put empty cells for you to complete your code in. You can create cells as you need them using the '+' button.

* Read and print out the first 5 lines of data. The output should start like this:

```
Entity,Code,Date,Daily tests per thousand people (7-day smoothed) (tests per thousand)

Argentina,ARG,"Feb 18, 2020",0

Argentina,ARG,"Feb 19, 2020",0
```


In [1]:
DATA = "daily-tests-per-thousand-people-smoothed-7-day-20200729.csv"

In [2]:
with open(DATA) as file:
    for line in range(5):
        print(file.readline())

Entity,Code,Date,Daily tests per thousand people (7-day smoothed) (tests per thousand)

Argentina,ARG,"Feb 18, 2020",0

Argentina,ARG,"Feb 19, 2020",0

Argentina,ARG,"Feb 20, 2020",0

Argentina,ARG,"Feb 21, 2020",0



## Data Conversion and Cleaning

#### From strings to lists

* Read the first 5 lines again. This time use the `split` method to turn each line into a list before printing it out.

Your output should start like this:
```
['Entity', 'Code', 'Date', 'Daily tests per thousand people (7-day smoothed) (tests per thousand)\n']
['Argentina', 'ARG', '"Feb 18', ' 2020"', '0\n']
['Argentina', 'ARG', '"Feb 19', ' 2020"', '0\n']
```

In [3]:
with open(DATA) as file:
    for line in range(5):
        print(file.readline().split())

['Entity,Code,Date,Daily', 'tests', 'per', 'thousand', 'people', '(7-day', 'smoothed)', '(tests', 'per', 'thousand)']
['Argentina,ARG,"Feb', '18,', '2020",0']
['Argentina,ARG,"Feb', '19,', '2020",0']
['Argentina,ARG,"Feb', '20,', '2020",0']
['Argentina,ARG,"Feb', '21,', '2020",0']


Notice that we still have the newline character in the last item.

* Use the `strip` function to remove whitespace before splitting the lines.

Your output should now start like this:
```
['Entity', 'Code', 'Date', 'Daily tests per thousand people (7-day smoothed) (tests per thousand)']
['Argentina', 'ARG', '"Feb 18', ' 2020"', '0']
['Argentina', 'ARG', '"Feb 19', ' 2020"', '0']
```

In [4]:
with open(DATA) as file:
    for line in range(5):
        print(file.readline().strip().split())

['Entity,Code,Date,Daily', 'tests', 'per', 'thousand', 'people', '(7-day', 'smoothed)', '(tests', 'per', 'thousand)']
['Argentina,ARG,"Feb', '18,', '2020",0']
['Argentina,ARG,"Feb', '19,', '2020",0']
['Argentina,ARG,"Feb', '20,', '2020",0']
['Argentina,ARG,"Feb', '21,', '2020",0']


You may have noticed another problem, caused by the fact that the dates include commas. In fact, you may conclude that with this date format, the choice of a comma as the *delimiter* (separator) was not a particularly good one, and an alternative such as tab separated (tsv) would have been better choice for these data.

Nevertheless, it is not ambiguous, because commas that are not intended as delimiters only appear within double quotes. The 'user' (our code in this case) is expected to take the quotes into account when splitting the lines. 

There are many ways to deal with dates, but for now, we can just remove the comma and the quotes, since neither provide us with any information. The fields within the quotes (month, day and year) can be distinguished by their order (the comma is just for human consumption) and the quotes are redundant since it is a text file and all the fields will be read as strings.

Change your code so that it has a *preprocessing* step before it splits the lines. For each line after the header line your preprocessing step should:
  * find the positions of the two double quotes
  * replace the comma between the quotes with a space
  * replace the old date with the new date without a comma
  * throw a `ValueError` if the line doesn't have double quotes
  * Are there any empty row, if yes, delete empty row

We'll break it down into steps.


In [5]:

        
with open(DATA) as file:
    for line in file:
        lines = line.strip()
        
        
        while ',,' in lines:
            lines = lines.replace(',,',',')

        while '(' in lines:
            first = lines.index('(')
            second = lines.index(')')
            third = lines.index('(', first +1)
            fourth = lines.index(')', second + 1)
            field = lines[first:fourth+1]
            lines = lines.replace(field, '')
        
        #print(lines)
#         while '"' in lines:
#             first = lines.find('"')
#             second = lines.find('"', first+1)
# #             third = lines.find('"', second +1)
# #             fourth = lines.find('"', third + 1)
#             field = lines[first:fourth+1]
#             clean_field = field.replace('"',"")[0:-1]
#             lines = lines.replace(field, clean_field)

        while '"' in lines:
            first = lines.find('"')
            second = lines.find('"', first+1)
            field = lines[first:second+1]
            clean_field = field.replace(',',"")[1:-1]
            lines = lines.replace(field, clean_field)
        
        print(lines)


Entity,Code,Date,Daily tests per thousand people 
Argentina,ARG,Feb 18 2020,0
Argentina,ARG,Feb 19 2020,0
Argentina,ARG,Feb 20 2020,0
Argentina,ARG,Feb 21 2020,0
Argentina,ARG,Feb 22 2020,0
Argentina,ARG,Feb 23 2020,0
Argentina,ARG,Feb 24 2020,0
Argentina,ARG,Feb 25 2020,0
Argentina,ARG,Feb 26 2020,0
Argentina,ARG,Feb 27 2020,0
Argentina,ARG,Feb 28 2020,0
Argentina,ARG,Feb 29 2020,0
Argentina,ARG,Mar 1 2020,0
Argentina,ARG,Mar 2 2020,0
Argentina,ARG,Mar 3 2020,0
Argentina,ARG,Mar 4 2020,0
Argentina,ARG,Mar 5 2020,0
Argentina,ARG,Mar 6 2020,0
Argentina,ARG,Mar 7 2020,0
Argentina,ARG,Mar 8 2020,0
Argentina,ARG,Mar 9 2020,0
Argentina,ARG,Mar 10 2020,0
Argentina,ARG,Mar 11 2020,0
Argentina,ARG,Mar 12 2020,0
Argentina,ARG,Mar 13 2020,0.001
Argentina,ARG,Mar 14 2020,0.001
Argentina,ARG,Mar 15 2020,0.001
Argentina,ARG,Mar 16 2020,0.001
Argentina,ARG,Mar 17 2020,0.002
Argentina,ARG,Mar 18 2020,0.002
Argentina,ARG,Mar 19 2020,0.002
Argentina,ARG,Mar 20 2020,0.003
Argentina,ARG,Mar 21 2020,0.004

Cote d'Ivoire,CIV,Jul 7 2020,0.061
Cote d'Ivoire,CIV,Jul 8 2020,0.061
Cote d'Ivoire,CIV,Jul 9 2020,0.06
Cote d'Ivoire,CIV,Jul 10 2020,0.065
Cote d'Ivoire,CIV,Jul 11 2020,0.069
Cote d'Ivoire,CIV,Jul 12 2020,0.064
Cote d'Ivoire,CIV,Jul 13 2020,0.062
Cote d'Ivoire,CIV,Jul 14 2020,0.059
Cote d'Ivoire,CIV,Jul 15 2020,0.059
Cote d'Ivoire,CIV,Jul 16 2020,0.055
Cote d'Ivoire,CIV,Jul 17 2020,0.048
Cote d'Ivoire,CIV,Jul 18 2020,0.042
Cote d'Ivoire,CIV,Jul 19 2020,0.041
Cote d'Ivoire,CIV,Jul 20 2020,0.043
Cote d'Ivoire,CIV,Jul 21 2020,0.046
Cote d'Ivoire,CIV,Jul 22 2020,0.042
Cote d'Ivoire,CIV,Jul 23 2020,0.05
Cote d'Ivoire,CIV,Jul 24 2020,0.056
Cote d'Ivoire,CIV,Jul 25 2020,0.06
Croatia,HRV,Mar 10 2020,0.004
Croatia,HRV,Mar 11 2020,0.006
Croatia,HRV,Mar 12 2020,0.007
Croatia,HRV,Mar 13 2020,0.009
Croatia,HRV,Mar 14 2020,0.011
Croatia,HRV,Mar 15 2020,0.015
Croatia,HRV,Mar 16 2020,0.019
Croatia,HRV,Mar 17 2020,0.026
Croatia,HRV,Mar 18 2020,0.026
Croatia,HRV,Mar 19 2020,0.027
Croatia,HRV,Mar 20 202

France tests performed,Mar 4 2020,0.01
France tests performed,Mar 5 2020,0.011
France tests performed,Mar 6 2020,0.014
France tests performed,Mar 7 2020,0.017
France tests performed,Mar 8 2020,0.019
France tests performed,Mar 9 2020,0.021
France tests performed,Mar 10 2020,0.025
France tests performed,Mar 11 2020,0.032
France tests performed,Mar 12 2020,0.039
France tests performed,Mar 13 2020,0.045
France tests performed,Mar 14 2020,0.05
France tests performed,Mar 15 2020,0.056
France tests performed,Mar 16 2020,0.07
France tests performed,Mar 17 2020,0.082
France tests performed,Mar 18 2020,0.09
France tests performed,Mar 19 2020,0.097
France tests performed,Mar 20 2020,0.105
France tests performed,Mar 21 2020,0.113
France tests performed,Mar 22 2020,0.121
France tests performed,Mar 23 2020,0.121
France tests performed,Mar 24 2020,0.121
France tests performed,Mar 25 2020,0.14
France tests performed,Mar 26 2020,0.159
France tests performed,Mar 27 2020,0.178
France tests performed,Mar 

Italy,ITA,May 14 2020,1.007
Italy,ITA,May 15 2020,1.017
Italy,ITA,May 16 2020,1.017
Italy,ITA,May 17 2020,1.037
Italy,ITA,May 18 2020,1.027
Italy,ITA,May 19 2020,1.018
Italy,ITA,May 20 2020,1.03
Italy,ITA,May 21 2020,1.03
Italy,ITA,May 22 2020,1.047
Italy,ITA,May 23 2020,1.055
Italy,ITA,May 24 2020,1.044
Italy,ITA,May 25 2020,1.042
Italy,ITA,May 26 2020,1.029
Italy,ITA,May 27 2020,1.029
Italy,ITA,May 28 2020,1.039
Italy,ITA,May 29 2020,1.031
Italy,ITA,May 30 2020,1.024
Italy,ITA,May 31 2020,1.02
Italy,ITA,Jun 1 2020,1.011
Italy,ITA,Jun 2 2020,0.998
Italy,ITA,Jun 3 2020,0.927
Italy,ITA,Jun 4 2020,0.866
Italy,ITA,Jun 5 2020,0.849
Italy,ITA,Jun 6 2020,0.856
Italy,ITA,Jun 7 2020,0.845
Italy,ITA,Jun 8 2020,0.835
Italy,ITA,Jun 9 2020,0.842
Italy,ITA,Jun 10 2020,0.902
Italy,ITA,Jun 11 2020,0.932
Italy,ITA,Jun 12 2020,0.945
Italy,ITA,Jun 13 2020,0.891
Italy,ITA,Jun 14 2020,0.908
Italy,ITA,Jun 15 2020,0.91
Italy,ITA,Jun 16 2020,0.891
Italy,ITA,Jun 17 2020,0.926
Italy,ITA,Jun 18 2020,0.916
Italy

Malta,MLT,Mar 15 2020,0.301
Malta,MLT,Mar 16 2020,0.319
Malta,MLT,Mar 17 2020,0.351
Malta,MLT,Mar 18 2020,0.396
Malta,MLT,Mar 19 2020,0.424
Malta,MLT,Mar 20 2020,0.46
Malta,MLT,Mar 21 2020,0.505
Malta,MLT,Mar 22 2020,0.564
Malta,MLT,Mar 23 2020,0.618
Malta,MLT,Mar 24 2020,0.725
Malta,MLT,Mar 25 2020,0.759
Malta,MLT,Mar 26 2020,0.815
Malta,MLT,Mar 27 2020,0.899
Malta,MLT,Mar 28 2020,0.999
Malta,MLT,Mar 29 2020,1.103
Malta,MLT,Mar 30 2020,1.184
Malta,MLT,Mar 31 2020,1.228
Malta,MLT,Apr 1 2020,1.336
Malta,MLT,Apr 2 2020,1.431
Malta,MLT,Apr 3 2020,1.477
Malta,MLT,Apr 4 2020,1.495
Malta,MLT,Apr 5 2020,1.47
Malta,MLT,Apr 6 2020,1.551
Malta,MLT,Apr 7 2020,1.669
Malta,MLT,Apr 8 2020,1.755
Malta,MLT,Apr 9 2020,1.834
Malta,MLT,Apr 10 2020,2.011
Malta,MLT,Apr 11 2020,2.129
Malta,MLT,Apr 12 2020,2.258
Malta,MLT,Apr 13 2020,2.33
Malta,MLT,Apr 14 2020,2.355
Malta,MLT,Apr 15 2020,2.364
Malta,MLT,Apr 16 2020,2.389
Malta,MLT,Apr 17 2020,2.319
Malta,MLT,Apr 18 2020,2.24
Malta,MLT,Apr 19 2020,2.118
Malta

Paraguay,PRY,Jun 7 2020,0.151
Paraguay,PRY,Jun 8 2020,0.162
Paraguay,PRY,Jun 9 2020,0.159
Paraguay,PRY,Jun 10 2020,0.173
Paraguay,PRY,Jun 11 2020,0.181
Paraguay,PRY,Jun 12 2020,0.198
Paraguay,PRY,Jun 13 2020,0.205
Paraguay,PRY,Jun 14 2020,0.211
Paraguay,PRY,Jun 15 2020,0.196
Paraguay,PRY,Jun 16 2020,0.199
Paraguay,PRY,Jun 17 2020,0.19
Paraguay,PRY,Jun 18 2020,0.187
Paraguay,PRY,Jun 19 2020,0.183
Paraguay,PRY,Jun 20 2020,0.185
Paraguay,PRY,Jun 21 2020,0.178
Paraguay,PRY,Jun 22 2020,0.184
Paraguay,PRY,Jun 23 2020,0.19
Paraguay,PRY,Jun 24 2020,0.193
Paraguay,PRY,Jun 25 2020,0.195
Paraguay,PRY,Jun 26 2020,0.194
Paraguay,PRY,Jun 27 2020,0.196
Paraguay,PRY,Jun 28 2020,0.2
Paraguay,PRY,Jun 29 2020,0.203
Paraguay,PRY,Jun 30 2020,0.198
Paraguay,PRY,Jul 1 2020,0.198
Paraguay,PRY,Jul 2 2020,0.199
Paraguay,PRY,Jul 3 2020,0.203
Paraguay,PRY,Jul 4 2020,0.21
Paraguay,PRY,Jul 5 2020,0.219
Paraguay,PRY,Jul 6 2020,0.227
Paraguay,PRY,Jul 7 2020,0.242
Paraguay,PRY,Jul 8 2020,0.246
Paraguay,PRY,Jul 9 2020,

Slovakia,SVK,Jun 7 2020,0.44
Slovakia,SVK,Jun 8 2020,0.437
Slovakia,SVK,Jun 9 2020,0.291
Slovakia,SVK,Jun 10 2020,0.271
Slovakia,SVK,Jun 11 2020,0.254
Slovakia,SVK,Jun 12 2020,0.239
Slovakia,SVK,Jun 13 2020,0.21
Slovakia,SVK,Jun 14 2020,0.191
Slovakia,SVK,Jun 15 2020,0.188
Slovakia,SVK,Jun 16 2020,0.188
Slovakia,SVK,Jun 17 2020,0.178
Slovakia,SVK,Jun 18 2020,0.16
Slovakia,SVK,Jun 19 2020,0.148
Slovakia,SVK,Jun 20 2020,0.141
Slovakia,SVK,Jun 21 2020,0.137
Slovakia,SVK,Jun 22 2020,0.137
Slovakia,SVK,Jun 23 2020,0.132
Slovakia,SVK,Jun 24 2020,0.134
Slovakia,SVK,Jun 25 2020,0.138
Slovakia,SVK,Jun 26 2020,0.157
Slovakia,SVK,Jun 27 2020,0.165
Slovakia,SVK,Jun 28 2020,0.182
Slovakia,SVK,Jun 29 2020,0.182
Slovakia,SVK,Jun 30 2020,0.186
Slovakia,SVK,Jul 1 2020,0.207
Slovakia,SVK,Jul 2 2020,0.227
Slovakia,SVK,Jul 3 2020,0.234
Slovakia,SVK,Jul 4 2020,0.25
Slovakia,SVK,Jul 5 2020,0.247
Slovakia,SVK,Jul 6 2020,0.247
Slovakia,SVK,Jul 7 2020,0.249
Slovakia,SVK,Jul 8 2020,0.253
Slovakia,SVK,Jul 9 2020

Tunisia,TUN,Jun 22 2020,0.063
Tunisia,TUN,Jun 23 2020,0.057
Tunisia,TUN,Jun 24 2020,0.044
Tunisia,TUN,Jun 25 2020,0.04
Tunisia,TUN,Jun 26 2020,0.025
Tunisia,TUN,Jun 27 2020,0.024
Tunisia,TUN,Jun 28 2020,0.025
Tunisia,TUN,Jun 29 2020,0.023
Tunisia,TUN,Jun 30 2020,0.028
Tunisia,TUN,Jul 1 2020,0.033
Tunisia,TUN,Jul 2 2020,0.036
Tunisia,TUN,Jul 3 2020,0.046
Tunisia,TUN,Jul 4 2020,0.05
Tunisia,TUN,Jul 5 2020,0.05
Tunisia,TUN,Jul 6 2020,0.053
Tunisia,TUN,Jul 7 2020,0.06
Tunisia,TUN,Jul 8 2020,0.063
Tunisia,TUN,Jul 9 2020,0.071
Tunisia,TUN,Jul 10 2020,0.068
Tunisia,TUN,Jul 11 2020,0.07
Tunisia,TUN,Jul 12 2020,0.074
Tunisia,TUN,Jul 13 2020,0.074
Tunisia,TUN,Jul 14 2020,0.071
Tunisia,TUN,Jul 15 2020,0.073
Tunisia,TUN,Jul 16 2020,0.066
Tunisia,TUN,Jul 17 2020,0.064
Tunisia,TUN,Jul 18 2020,0.062
Tunisia,TUN,Jul 19 2020,0.065
Tunisia,TUN,Jul 20 2020,0.065
Tunisia,TUN,Jul 21 2020,0.065
Tunisia,TUN,Jul 22 2020,0.066
Tunisia,TUN,Jul 23 2020,0.076
Turkey,TUR,Mar 25 2020,0.039
Turkey,TUR,Mar 26 2020,0.

In [6]:


with open(DATA) as file:
    for i in range(5):

        while ',,' in lines:
            lines = lines.replace(',,',',')

        while '(' in lines:
            first = lines.index('(')
            second = lines.index(')')
            third = lines.index('(', first +1)
            fourth = lines.index(')', second + 1)
            field = lines[first:fourth+1]
            lines = lines.replace(field, '')
            
        while '"' in lines:
            first = lines.find('"')
            second = lines.find('"', first+1)
            third = lines.find('"', second +1)
            fourth = lines.find('"', third + 1)
            field = lines[first:fourth+1]
            clean_field = field.replace('"',"")[0:-1]
            lines = lines.replace(field, clean_field)
        
        print(lines)

        

            

Zimbabwe,ZWE,Jul 22 2020,0.048
Zimbabwe,ZWE,Jul 22 2020,0.048
Zimbabwe,ZWE,Jul 22 2020,0.048
Zimbabwe,ZWE,Jul 22 2020,0.048
Zimbabwe,ZWE,Jul 22 2020,0.048


In [7]:
with open(DATA) as file:
    for count, line in enumerate(file):
        line = line.strip()
        
        while ',,' in line:
            line = line.replace(",,",',')
     
        while '(' in line:
                first = line.index('(')
                second = line.index(')', first+1)
                third = line.index('(', second + 1)
                fourth = line.index(')', third+1)
                field = line[first-1:fourth+1]
                line = line.replace(field,'')

                
            
        while '"' in line:
            first = line.index('"')
            second = line.index('"', first+1)
            field = line[first:second+1]
            clean_field = field.replace(',',"")[1:-1]
            line = line.replace(field, clean_field)
            
        
        try:
            line.index('""')
        except ValueError:
            raise ValueError('Double Quote Not Found')


ValueError: Double Quote Not Found

In [None]:
def clean(data_row):
    
    content = []
    with open(data_row) as file:
        for line in file:
            lines = line.strip()


            while ',,' in lines:
                lines = lines.replace(',,',',')

            while '(' in lines:
                first = lines.index('(')
                second = lines.index(')')
                third = lines.index('(', first +1)
                fourth = lines.index(')', second + 1)
                field = lines[first:fourth+1]
                lines = lines.replace(field, '')

            while '"' in lines:
                first = lines.find('"')
                second = lines.find('"', first+1)
                third = lines.find('"', second +1)
                fourth = lines.find('"', third + 1)
                field = lines[first:fourth+1]
                clean_field = field.replace('"',"")[0:-1]
                lines = lines.replace(field, clean_field)
            
            content.append(lines)
        return content
    

In [None]:

def clean(lines):

    while ',,' in lines:
        lines = lines.replace(',,',',')

    while '(' in lines:
        first = lines.index('(')
        second = lines.index(')')
        third = lines.index('(', first +1)
        fourth = lines.index(')', second + 1)
        field = lines[first:fourth+1]
        lines = lines.replace(field, '')
            
    while '"' in lines:
        first = lines.find('"')
        second = lines.find('"', first+1)
        field = lines[first:second+1]
        clean_field = field.replace(',',"")[1:-1]
        lines = lines.replace(field, clean_field)
    
    return lines




In [None]:
with open(DATA) as file:
    for i in range(5):
        print(clean(file.readline()))

* Print each line (of the first 5, other than the header, and with the whitespace removed) followed by the indices of the two double quotes:
```
Argentina,ARG,"Feb 18, 2020",0
14 27
Argentina,ARG,"Feb 19, 2020",0
14 27
Argentina,ARG,"Feb 20, 2020",0
14 27
...
```

Tip: You can get the double quote character by enclosing it in single quotes (`'"'`).

Hint: Compare the `find` and `index` methods. Why would you choose one or the other?


* Next, print the line followed by the date string:

```
Argentina,ARG,"Feb 18, 2020",0
"Feb 18, 2020"
Argentina,ARG,"Feb 19, 2020",0
"Feb 19, 2020"
Argentina,ARG,"Feb 20, 2020",0
"Feb 20, 2020"
```

* Now do the same, except with the comma removed from the date string:
```Argentina,ARG,"Feb 18, 2020",0
Feb 18 2020
Argentina,ARG,"Feb 19, 2020",0
Feb 19 2020
Argentina,ARG,"Feb 20, 2020",0
Feb 20 2020    
```

* Next, print the original lines with the old date field replaced by the cleaned date field:
```
Argentina,ARG,Feb 18 2020,0
Argentina,ARG,Feb 19 2020,0
Argentina,ARG,Feb 20 2020,0
```

#### Checked Solution [1 lab mark] 
Now that we have this working, we don't want this preprocessing step 'muddying' up our code, so let's put it in a separate function.

* Write a function `clean (data_row)` that takes as its argument a string, and:
  * strips any unnecessary whitespace characters from the ends
  * if it contains a string in double quotes, strips the quotes and the comma between them
  * if it doesn't contain any quotes (this will be the header line), it strips everything in the line from the space before the first parenthesis

If called with the following code:
```
with open(DATA,'r') as file:
    for i in range(5):
        print(clean(file.readline()))
```
the output should start like this:
```
Entity,Code,Date,Daily tests per thousand people
Argentina,ARG,Feb 18 2020,0
Argentina,ARG,Feb 19 2020,0
...
```

In [None]:
from nose.tools import assert_equal
DATA = "daily-tests-per-thousand-people-smoothed-7-day-20200729.csv"
with open(DATA,'r') as file:
    assert_equal(clean(file.readline()), "Entity,Code,Date,Daily tests per thousand people")
    assert_equal(clean(file.readline()), "Argentina,ARG,Feb 18 2020,0")
print("So far, so good on the practice (non-hidden) tests. Remember there will be additional tests applied.")


#### From a file to a list

Next, rather than printing the lines, store them all in a list (of lists).

* Define a variable `data_lists` as an empty list (). Write code that cleans and splits the (entire) input into lists, and appends them to `data_lists`.

For example, the following should print the first five rows as a list of lists:

```
print(data_lists[:5])

[['Entity', 'Code', 'Date', 'Daily tests per thousand people'], ['Argentina', 'ARG', 'Feb 18 2020', '0'], ['Argentina', 'ARG', 'Feb 19 2020', '0'], ['Argentina', 'ARG', 'Feb 20 2020', '0'], ['Argentina', 'ARG', 'Feb 21 2020', '0']]
```

How many entries (lines of data) are there in the file?

In [8]:
data_lists = []
        
with open(DATA) as file:
    for line in file:

        lines = line.strip()
        
        
#         while ',,' in lines:
#             lines = lines.replace(',,',',')

        while '(' in lines:
            first = lines.index('(')
            second = lines.index(')')
            third = lines.index('(', first +1)
            fourth = lines.index(')', second + 1)
            field = lines[first:fourth+1]
            lines = lines.replace(field, '')
        

        while '"' in lines:
            first = lines.find('"')
            second = lines.find('"', first+1)
            field = lines[first:second+1]
            clean_field = field.replace(',',"")[1:-1]
            lines = lines.replace(field, clean_field)
        
        data_lists.append(lines.split(','))
data_lists[:5]

[['Entity', 'Code', 'Date', 'Daily tests per thousand people '],
 ['Argentina', 'ARG', 'Feb 18 2020', '0'],
 ['Argentina', 'ARG', 'Feb 19 2020', '0'],
 ['Argentina', 'ARG', 'Feb 20 2020', '0'],
 ['Argentina', 'ARG', 'Feb 21 2020', '0']]

In [9]:
data_lists

[['Entity', 'Code', 'Date', 'Daily tests per thousand people '],
 ['Argentina', 'ARG', 'Feb 18 2020', '0'],
 ['Argentina', 'ARG', 'Feb 19 2020', '0'],
 ['Argentina', 'ARG', 'Feb 20 2020', '0'],
 ['Argentina', 'ARG', 'Feb 21 2020', '0'],
 ['Argentina', 'ARG', 'Feb 22 2020', '0'],
 ['Argentina', 'ARG', 'Feb 23 2020', '0'],
 ['Argentina', 'ARG', 'Feb 24 2020', '0'],
 ['Argentina', 'ARG', 'Feb 25 2020', '0'],
 ['Argentina', 'ARG', 'Feb 26 2020', '0'],
 ['Argentina', 'ARG', 'Feb 27 2020', '0'],
 ['Argentina', 'ARG', 'Feb 28 2020', '0'],
 ['Argentina', 'ARG', 'Feb 29 2020', '0'],
 ['Argentina', 'ARG', 'Mar 1 2020', '0'],
 ['Argentina', 'ARG', 'Mar 2 2020', '0'],
 ['Argentina', 'ARG', 'Mar 3 2020', '0'],
 ['Argentina', 'ARG', 'Mar 4 2020', '0'],
 ['Argentina', 'ARG', 'Mar 5 2020', '0'],
 ['Argentina', 'ARG', 'Mar 6 2020', '0'],
 ['Argentina', 'ARG', 'Mar 7 2020', '0'],
 ['Argentina', 'ARG', 'Mar 8 2020', '0'],
 ['Argentina', 'ARG', 'Mar 9 2020', '0'],
 ['Argentina', 'ARG', 'Mar 10 2020', '0']

#### Checking our cleaning so far

We now have a tidy list of lists, each with the four fields. Or do we?

With big data it may not be possible to manually look at every entry to see if we've accounted for every possibility. We should try to make our cleaning or preprocessing as general as possible so that we catch unexpected variations or bad data. 

In practice, we often have to make some assumptions about the data. However we should endeavour to test these.

In this case, we've assumed that the patterns we see at the start of the file continue through the file. So let's check that assumption.

* Write code that checks whether all your entries have 4 fields.

If not, why not? What have we missed?

Hint: Print out the first row (if there is one) where this is not true. Print out the number of that row, open the data file in CoCalc, and have a look at the data in that row. What do you find?


In [10]:
# found = False
# count = 0
# for row in data_lists:
#     count = count + 1
#     if not found and len(row) !=4:
#         print(count,row)
#         found = True

has4 = []

for row in data_lists:
    has4.append(len(row) !=4)


has4.index(True)


ValueError: True is not in list

In [11]:
has4 = []
for data in data_lists:
    has4.append(len(data)!=4)

first= has4.index(True)
data_lists[first]

ValueError: True is not in list

#### General vs Specific

The variations you see in the data are not unusual - remember that this is the collated official government data, its not an 'exercise'. It is *exactly the kind of thing you would deal with as a working Data Scientist*.

We'll need to come back to some differences in the content of the data, but for now let's focus on the formatting. Or, more precisely, *transforming* the data from the format in which it is provided to a format that is suitable for our use.

* Alter your code to do more cleaning as necessary so that it is transformed into a list of lists, each with four fields.

You should try to make your code as **general as possible**. This means that, rather than just adjust for the specific case, think about *patterns*. 

For example, we have seen that the data format uses double quotes around fields containing the delimiter (a comma in this case). If the data is not corrupt (which is an assumption) therefore, we would expect the double quotes to always occur in pairs.
By focussing only on dates, we have been more *specific* than we need to be, so we may miss other cases. A more *general* solution will assume that the same pattern may occur in other fields.

Ensure you re-test your code after any changes you make to ensure it satisfies the requirements. (You might find it useful to write them all down.)

#### Type casting

Finally, the last field, tests per 1000 people, should be a `float`.

* Alter your code to change the tests ratio to a float.

Your first few lines should now look like this:
```
[['Entity', 'Code', 'Date', 'Daily tests per thousand people'], ['Argentina', 'ARG', 'Feb 18 2020', 0.0], ['Argentina', 'ARG', 'Feb 19 2020', 0.0], ['Argentina', 'ARG', 'Feb 20 2020', 0.0], ['Argentina', 'ARG', 'Feb 21 2020', 0.0]]
```

Again, we could make an assumption that the fourth string is always able to be converted to a float, but its possible that somewhere in the file that is not true. Later we will deal with this using *Exceptions*. For now, it is good practice to do some checks before attempting to cast. Before casting, check that:
* the fourth field is not an empty string
* the first character in the string is a number

Tip: You may find string functions like `isnumeric` useful.

In [12]:
first = True
for line in data_lists:
    if first:
        first = False
    else:
        if line[3] != '' and line[3][0].isnumeric():
            x = float(line[3])
            line[3] = x
            
            
    
            
            
            

In [13]:
data_lists

[['Entity', 'Code', 'Date', 'Daily tests per thousand people '],
 ['Argentina', 'ARG', 'Feb 18 2020', 0.0],
 ['Argentina', 'ARG', 'Feb 19 2020', 0.0],
 ['Argentina', 'ARG', 'Feb 20 2020', 0.0],
 ['Argentina', 'ARG', 'Feb 21 2020', 0.0],
 ['Argentina', 'ARG', 'Feb 22 2020', 0.0],
 ['Argentina', 'ARG', 'Feb 23 2020', 0.0],
 ['Argentina', 'ARG', 'Feb 24 2020', 0.0],
 ['Argentina', 'ARG', 'Feb 25 2020', 0.0],
 ['Argentina', 'ARG', 'Feb 26 2020', 0.0],
 ['Argentina', 'ARG', 'Feb 27 2020', 0.0],
 ['Argentina', 'ARG', 'Feb 28 2020', 0.0],
 ['Argentina', 'ARG', 'Feb 29 2020', 0.0],
 ['Argentina', 'ARG', 'Mar 1 2020', 0.0],
 ['Argentina', 'ARG', 'Mar 2 2020', 0.0],
 ['Argentina', 'ARG', 'Mar 3 2020', 0.0],
 ['Argentina', 'ARG', 'Mar 4 2020', 0.0],
 ['Argentina', 'ARG', 'Mar 5 2020', 0.0],
 ['Argentina', 'ARG', 'Mar 6 2020', 0.0],
 ['Argentina', 'ARG', 'Mar 7 2020', 0.0],
 ['Argentina', 'ARG', 'Mar 8 2020', 0.0],
 ['Argentina', 'ARG', 'Mar 9 2020', 0.0],
 ['Argentina', 'ARG', 'Mar 10 2020', 0.0]

In [14]:
with open(DATA) as file:
    for line in file:
        print(line)

Entity,Code,Date,Daily tests per thousand people (7-day smoothed) (tests per thousand)

Argentina,ARG,"Feb 18, 2020",0

Argentina,ARG,"Feb 19, 2020",0

Argentina,ARG,"Feb 20, 2020",0

Argentina,ARG,"Feb 21, 2020",0

Argentina,ARG,"Feb 22, 2020",0

Argentina,ARG,"Feb 23, 2020",0

Argentina,ARG,"Feb 24, 2020",0

Argentina,ARG,"Feb 25, 2020",0

Argentina,ARG,"Feb 26, 2020",0

Argentina,ARG,"Feb 27, 2020",0

Argentina,ARG,"Feb 28, 2020",0

Argentina,ARG,"Feb 29, 2020",0

Argentina,ARG,"Mar 1, 2020",0

Argentina,ARG,"Mar 2, 2020",0

Argentina,ARG,"Mar 3, 2020",0

Argentina,ARG,"Mar 4, 2020",0

Argentina,ARG,"Mar 5, 2020",0

Argentina,ARG,"Mar 6, 2020",0

Argentina,ARG,"Mar 7, 2020",0

Argentina,ARG,"Mar 8, 2020",0

Argentina,ARG,"Mar 9, 2020",0

Argentina,ARG,"Mar 10, 2020",0

Argentina,ARG,"Mar 11, 2020",0

Argentina,ARG,"Mar 12, 2020",0

Argentina,ARG,"Mar 13, 2020",0.001

Argentina,ARG,"Mar 14, 2020",0.001

Argentina,ARG,"Mar 15, 2020",0.001

Argentina,ARG,"Mar 16, 2020",0.001

Argentina

Cote d'Ivoire,CIV,"Apr 28, 2020",0.015

Cote d'Ivoire,CIV,"Apr 29, 2020",0.016

Cote d'Ivoire,CIV,"Apr 30, 2020",0.015

Cote d'Ivoire,CIV,"May 1, 2020",0.014

Cote d'Ivoire,CIV,"May 2, 2020",0.015

Cote d'Ivoire,CIV,"May 3, 2020",0.014

Cote d'Ivoire,CIV,"May 4, 2020",0.015

Cote d'Ivoire,CIV,"May 5, 2020",0.015

Cote d'Ivoire,CIV,"May 6, 2020",0.014

Cote d'Ivoire,CIV,"May 7, 2020",0.014

Cote d'Ivoire,CIV,"May 8, 2020",0.015

Cote d'Ivoire,CIV,"May 9, 2020",0.015

Cote d'Ivoire,CIV,"May 10, 2020",0.016

Cote d'Ivoire,CIV,"May 11, 2020",0.016

Cote d'Ivoire,CIV,"May 12, 2020",0.017

Cote d'Ivoire,CIV,"May 13, 2020",0.018

Cote d'Ivoire,CIV,"May 14, 2020",0.02

Cote d'Ivoire,CIV,"May 15, 2020",0.021

Cote d'Ivoire,CIV,"May 16, 2020",0.023

Cote d'Ivoire,CIV,"May 17, 2020",0.025

Cote d'Ivoire,CIV,"May 18, 2020",0.024

Cote d'Ivoire,CIV,"May 19, 2020",0.024

Cote d'Ivoire,CIV,"May 20, 2020",0.026

Cote d'Ivoire,CIV,"May 21, 2020",0.028

Cote d'Ivoire,CIV,"May 22, 2020",0.03

Cote d'Ivoi


Finland,FIN,"Jul 15, 2020",0.714

Finland,FIN,"Jul 16, 2020",0.716

Finland,FIN,"Jul 17, 2020",0.732

Finland,FIN,"Jul 18, 2020",0.754

Finland,FIN,"Jul 19, 2020",0.748

Finland,FIN,"Jul 20, 2020",0.754

Finland,FIN,"Jul 21, 2020",0.755

Finland,FIN,"Jul 22, 2020",0.747

Finland,FIN,"Jul 23, 2020",0.754

Finland,FIN,"Jul 24, 2020",0.689

France,FRA,"May 24, 2020",0.481

France,FRA,"May 25, 2020",0.487

France,FRA,"May 26, 2020",0.493

France,FRA,"May 27, 2020",0.499

France,FRA,"May 28, 2020",0.505

France,FRA,"May 29, 2020",0.511

France,FRA,"May 30, 2020",0.517

France,FRA,"May 31, 2020",0.504

France,FRA,"Jun 1, 2020",0.49

France,FRA,"Jun 2, 2020",0.477

France,FRA,"Jun 3, 2020",0.464

France,FRA,"Jun 4, 2020",0.451

France,FRA,"Jun 5, 2020",0.438

France,FRA,"Jun 6, 2020",0.425

France,FRA,"Jun 7, 2020",0.431

France,FRA,"Jun 8, 2020",0.437

France,FRA,"Jun 9, 2020",0.444

France,FRA,"Jun 10, 2020",0.45

France,FRA,"Jun 11, 2020",0.456

France,FRA,"Jun 12, 2020",0.463

France,FRA

Japan,JPN,"Feb 23, 2020",0.001

Japan,JPN,"Feb 24, 2020",0.001

Japan,JPN,"Feb 25, 2020",0.001

Japan,JPN,"Feb 26, 2020",0.001

Japan,JPN,"Feb 27, 2020",0.001

Japan,JPN,"Feb 28, 2020",0.001

Japan,JPN,"Feb 29, 2020",0.001

Japan,JPN,"Mar 1, 2020",0.001

Japan,JPN,"Mar 2, 2020",0.001

Japan,JPN,"Mar 3, 2020",0.001

Japan,JPN,"Mar 4, 2020",0.005

Japan,JPN,"Mar 5, 2020",0.005

Japan,JPN,"Mar 6, 2020",0.006

Japan,JPN,"Mar 7, 2020",0.006

Japan,JPN,"Mar 8, 2020",0.006

Japan,JPN,"Mar 9, 2020",0.006

Japan,JPN,"Mar 10, 2020",0.008

Japan,JPN,"Mar 11, 2020",0.004

Japan,JPN,"Mar 12, 2020",0.004

Japan,JPN,"Mar 13, 2020",0.005

Japan,JPN,"Mar 14, 2020",0.006

Japan,JPN,"Mar 15, 2020",0.005

Japan,JPN,"Mar 16, 2020",0.005

Japan,JPN,"Mar 17, 2020",0.006

Japan,JPN,"Mar 18, 2020",0.006

Japan,JPN,"Mar 19, 2020",0.005

Japan,JPN,"Mar 20, 2020",0.008

Japan,JPN,"Mar 21, 2020",0.007

Japan,JPN,"Mar 22, 2020",0.008

Japan,JPN,"Mar 23, 2020",0.008

Japan,JPN,"Mar 24, 2020",0.01

Japan,JPN,"Mar 25,


Morocco,MAR,"Jun 2, 2020",0.284

Morocco,MAR,"Jun 3, 2020",0.298

Morocco,MAR,"Jun 4, 2020",0.313

Morocco,MAR,"Jun 5, 2020",0.33

Morocco,MAR,"Jun 6, 2020",0.346

Morocco,MAR,"Jun 7, 2020",0.375

Morocco,MAR,"Jun 8, 2020",0.391

Morocco,MAR,"Jun 9, 2020",0.406

Morocco,MAR,"Jun 10, 2020",0.419

Morocco,MAR,"Jun 11, 2020",0.432

Morocco,MAR,"Jun 12, 2020",0.44

Morocco,MAR,"Jun 13, 2020",0.451

Morocco,MAR,"Jun 14, 2020",0.451

Morocco,MAR,"Jun 15, 2020",0.451

Morocco,MAR,"Jun 16, 2020",0.45

Morocco,MAR,"Jun 17, 2020",0.448

Morocco,MAR,"Jun 18, 2020",0.45

Morocco,MAR,"Jun 19, 2020",0.451

Morocco,MAR,"Jun 20, 2020",0.45

Morocco,MAR,"Jun 21, 2020",0.448

Morocco,MAR,"Jun 22, 2020",0.454

Morocco,MAR,"Jun 23, 2020",0.462

Morocco,MAR,"Jun 24, 2020",0.465

Morocco,MAR,"Jun 25, 2020",0.453

Morocco,MAR,"Jun 26, 2020",0.444

Morocco,MAR,"Jun 27, 2020",0.431

Morocco,MAR,"Jun 28, 2020",0.418

Morocco,MAR,"Jun 29, 2020",0.411

Morocco,MAR,"Jun 30, 2020",0.41

Morocco,MAR,"Jul 1, 2020",0

Philippines,PHL,"Jul 16, 2020",0.188

Philippines,PHL,"Jul 17, 2020",0.195

Philippines,PHL,"Jul 18, 2020",0.2

Philippines,PHL,"Jul 19, 2020",0.208

Philippines,PHL,"Jul 20, 2020",0.214

Philippines,PHL,"Jul 21, 2020",0.218

Philippines,PHL,"Jul 22, 2020",0.222

Philippines,PHL,"Jul 23, 2020",0.222

Philippines,PHL,"Jul 24, 2020",0.222

Poland,POL,"Mar 13, 2020",0.008

Poland,POL,"Mar 14, 2020",0.013

Poland,POL,"Mar 15, 2020",0.016

Poland,POL,"Mar 16, 2020",0.02

Poland,POL,"Mar 17, 2020",0.024

Poland,POL,"Mar 18, 2020",0.028

Poland,POL,"Mar 19, 2020",0.034

Poland,POL,"Mar 20, 2020",0.038

Poland,POL,"Mar 21, 2020",0.04

Poland,POL,"Mar 22, 2020",0.046

Poland,POL,"Mar 23, 2020",0.051

Poland,POL,"Mar 24, 2020",0.057

Poland,POL,"Mar 25, 2020",0.063

Poland,POL,"Mar 26, 2020",0.069

Poland,POL,"Mar 27, 2020",0.079

Poland,POL,"Mar 28, 2020",0.089

Poland,POL,"Mar 29, 2020",0.095

Poland,POL,"Mar 30, 2020",0.1

Poland,POL,"Mar 31, 2020",0.108

Poland,POL,"Apr 1, 2020",0.112

Polan


South Korea,KOR,"Apr 22, 2020",0.129

South Korea,KOR,"Apr 23, 2020",0.137

South Korea,KOR,"Apr 24, 2020",0.133

South Korea,KOR,"Apr 25, 2020",0.124

South Korea,KOR,"Apr 26, 2020",0.116

South Korea,KOR,"Apr 27, 2020",0.116

South Korea,KOR,"Apr 28, 2020",0.114

South Korea,KOR,"Apr 29, 2020",0.112

South Korea,KOR,"Apr 30, 2020",0.101

South Korea,KOR,"May 1, 2020",0.096

South Korea,KOR,"May 2, 2020",0.097

South Korea,KOR,"May 3, 2020",0.094

South Korea,KOR,"May 4, 2020",0.092

South Korea,KOR,"May 5, 2020",0.089

South Korea,KOR,"May 6, 2020",0.081

South Korea,KOR,"May 7, 2020",0.086

South Korea,KOR,"May 8, 2020",0.088

South Korea,KOR,"May 9, 2020",0.089

South Korea,KOR,"May 10, 2020",0.089

South Korea,KOR,"May 11, 2020",0.089

South Korea,KOR,"May 12, 2020",0.092

South Korea,KOR,"May 13, 2020",0.115

South Korea,KOR,"May 14, 2020",0.139

South Korea,KOR,"May 15, 2020",0.17

South Korea,KOR,"May 16, 2020",0.188

South Korea,KOR,"May 17, 2020",0.211

South Korea,KOR,"May 

United Arab Emirates,ARE,"May 13, 2020",3.363

United Arab Emirates,ARE,"May 14, 2020",3.4

United Arab Emirates,ARE,"May 15, 2020",3.557

United Arab Emirates,ARE,"May 16, 2020",3.717

United Arab Emirates,ARE,"May 17, 2020",3.808

United Arab Emirates,ARE,"May 18, 2020",3.936

United Arab Emirates,ARE,"May 19, 2020",4.036

United Arab Emirates,ARE,"May 20, 2020",4.208

United Arab Emirates,ARE,"May 21, 2020",4.266

United Arab Emirates,ARE,"May 22, 2020",4.193

United Arab Emirates,ARE,"May 23, 2020",4.242

United Arab Emirates,ARE,"May 24, 2020",4.089

United Arab Emirates,ARE,"May 25, 2020",3.855

United Arab Emirates,ARE,"May 26, 2020",3.777

United Arab Emirates,ARE,"May 27, 2020",3.571

United Arab Emirates,ARE,"May 28, 2020",3.612

United Arab Emirates,ARE,"May 29, 2020",3.634

United Arab Emirates,ARE,"May 30, 2020",3.474

United Arab Emirates,ARE,"May 31, 2020",3.571

United Arab Emirates,ARE,"Jun 1, 2020",3.777

United Arab Emirates,ARE,"Jun 2, 2020",4.011

United Arab Emira

In [15]:

is_first_line = True
for line in data_lists:
    if is_first_line:
        is_first_line = False
    else:
        if line[3] != '' and line[3][0].isnumeric():
            x = float(line[3])
            line[3] = x
            
            
            
            
            
        

TypeError: 'float' object is not subscriptable

In [None]:
float_lists = []

Remember to run your own tests, not rely on my tests or my sample output. For example, I've included the first few lines of output (the file is too large to include them all) but as we've seen those lines may not be indicative of the file as a whole. For one thing, they are all cases where the daily test ratio is zero.

Again, remember that a Data Scientist is like a detective - always thinking about what we could possibly have missed, and testing for it.

#### Checked Solution [2 marks]

* Complete the function `get_cleaned_lists (filename)` so that it returns a list of lists, each containing the fields from the data file, with quotes, commas, and leading/trailing whitespace characters removed, any parenthesised text removed, and the daily tests ratio as a float.

Note: Your checked functions may call preceding functions that you have written - they do not have to all be in one long function. As always, however, you should ensure you validate your function with a "clean" kernel.

In [None]:
def get_cleaned_lists(filename):
    data_lists = []
    with open(DATA) as file:
        for line in file:
            while '(' in line:
                first = line.index('(')
                last = line.index(')', first+1)
                field = line[first-1:last+1]
                line = line.replace(field,'')
            while '"' in line:
                first = line.index('"')
                last = line.index('"', first+1)
                field = line[first:last+1]
                clean_field = field.replace(',',"")[1:-1]
                line = line.replace(field, clean_field)
            data_lists.append(line.strip().split(','))
            
    is_first_line = True
    for line in data_lists:
        if is_first_line:
            is_first_line = False
        else:
            if line[3] != '' and line[3][0].isnumeric():
                x = float(line[3])
                line[3] = x
                
                
    return data_lists

In [None]:
from nose.tools import assert_equal
DATA = "daily-tests-per-thousand-people-smoothed-7-day-20200729.csv"
assert_equal(get_cleaned_lists(DATA)[0],['Entity', 'Code', 'Date', 'Daily tests per thousand people'])
assert_equal(get_cleaned_lists(DATA)[5],['Argentina', 'ARG', 'Feb 22 2020', 0.0])
assert_equal(get_cleaned_lists(DATA)[40],['Argentina', 'ARG', 'Mar 28 2020', 0.008])
assert_equal(get_cleaned_lists(DATA)[180],['Argentina tests performed', '', 'Mar 10 2020', 0.0])
print("So far, so good. Remember there will be additional tests applied.")


In [None]:
# For marking use only.


- do not include the first line
- take care of empty space
- take care of first digit being number
- change to float


In [None]:
first_line = True
for line in data_lists:
    if first_line:
        first_line = False
    else:
        if line == '' and line[3][0].isnumeric():
            x = float(line[3])
            line[3] = x

            

## Data Presentation  - Using Dictionaries

*For this part it is left to you to break down the task into subtasks and test them as you go.*

We want to be able to access the daily test ratio for a country on a given date without using loops.

Dictionaries provide *much* faster access to data by *hashing* the dictionary keys.

* Store the daily tests data in a dictionary of dictionaries. The outer dictionary should use the countries as keys. The inner dictionary should use the dates as keys.

For example, if my outer dictionary is called `country_dict`, then evaluating `country_dict["Australia"]` should return:
    
```
{'Mar 29 2020': 0.382,
 'Mar 30 2020': 0.396,
 'Mar 31 2020': 0.41,
 'Apr 1 2020': 0.425,
 'Apr 2 2020': 0.439,
 'Apr 3 2020': 0.453,
 'Apr 4 2020': 0.467,
 ...
```
and `country_dict["Australia"]["Jul 1 2020"]` should return:
```
1.824
```






In [113]:
country_dict = {}
for line in data_lists[1:]:
    country, date, test_ratio = line[0],line[2],line[3]
    if country not in country_dict:
        country_dict[country] = {}
    country_dict[country][date] = test_ratio

country_dict

{'Argentina': {'Feb 18 2020': 0.0,
  'Feb 19 2020': 0.0,
  'Feb 20 2020': 0.0,
  'Feb 21 2020': 0.0,
  'Feb 22 2020': 0.0,
  'Feb 23 2020': 0.0,
  'Feb 24 2020': 0.0,
  'Feb 25 2020': 0.0,
  'Feb 26 2020': 0.0,
  'Feb 27 2020': 0.0,
  'Feb 28 2020': 0.0,
  'Feb 29 2020': 0.0,
  'Mar 1 2020': 0.0,
  'Mar 2 2020': 0.0,
  'Mar 3 2020': 0.0,
  'Mar 4 2020': 0.0,
  'Mar 5 2020': 0.0,
  'Mar 6 2020': 0.0,
  'Mar 7 2020': 0.0,
  'Mar 8 2020': 0.0,
  'Mar 9 2020': 0.0,
  'Mar 10 2020': 0.0,
  'Mar 11 2020': 0.0,
  'Mar 12 2020': 0.0,
  'Mar 13 2020': 0.001,
  'Mar 14 2020': 0.001,
  'Mar 15 2020': 0.001,
  'Mar 16 2020': 0.001,
  'Mar 17 2020': 0.002,
  'Mar 18 2020': 0.002,
  'Mar 19 2020': 0.002,
  'Mar 20 2020': 0.003,
  'Mar 21 2020': 0.004,
  'Mar 22 2020': 0.004,
  'Mar 23 2020': 0.005,
  'Mar 24 2020': 0.005,
  'Mar 25 2020': 0.006,
  'Mar 26 2020': 0.007,
  'Mar 27 2020': 0.008,
  'Mar 28 2020': 0.008,
  'Mar 29 2020': 0.009,
  'Mar 30 2020': 0.01,
  'Mar 31 2020': 0.011,
  'Apr 1 2020

In [24]:
country_dict["Australia"]["Jul 1 2020"]

1.824

### Presenting rankings in a table

* Finally, write a function `print_ranking(date)` that takes a date (as a string) and prints a table of testing results for that date, ranked from highest testing rate to the lowest.

So, for example, `print_ranking("Jul 1 2020")` will start as follows:
    
```
Testing results for Jul 1 2020
8.032 	 Luxembourg
5.653 	 United Arab Emirates
5.055 	 Bahrain
...
```

In [124]:
country_ranking_ratio = []

date = "Jul 1 2020"

for country in country_dict:
    if date in country_dict[country]:
        country_ranking_ratio.append((country_dict[country][date], country))
            
    


In [126]:
sorted(country_ranking_ratio)
        
        


[(0.007, 'Taiwan'),
 (0.009, 'Democratic Republic of Congo'),
 (0.013, 'Nigeria'),
 (0.027, 'Zimbabwe'),
 (0.028, 'Myanmar'),
 (0.028, 'Thailand'),
 (0.03, 'Thailand tests performed'),
 (0.033, 'Ethiopia'),
 (0.033, 'Tunisia'),
 (0.041, 'Indonesia'),
 (0.043, 'Japan'),
 (0.045, 'Togo'),
 (0.051, "Cote d'Ivoire"),
 (0.053, 'Uganda'),
 (0.058, 'Kenya'),
 (0.06, 'Japan tests performed'),
 (0.063, 'Senegal'),
 (0.075, 'Fiji'),
 (0.087, 'Mexico'),
 (0.095, 'Ghana'),
 (0.1, 'Pakistan'),
 (0.106, 'Ecuador'),
 (0.107, 'Bangladesh'),
 (0.127, 'Peru'),
 (0.141, 'Philippines'),
 (0.145, 'Costa Rica'),
 (0.153, 'Bolivia'),
 (0.153, 'India'),
 (0.163, 'Argentina'),
 (0.196, 'Nepal'),
 (0.198, 'Paraguay'),
 (0.199, 'Cuba'),
 (0.202, 'Argentina tests performed'),
 (0.207, 'Slovakia'),
 (0.216, 'South Korea'),
 (0.234, 'Croatia'),
 (0.236, 'Ukraine'),
 (0.242, 'Hungary'),
 (0.29, 'Uruguay'),
 (0.291, 'Rwanda'),
 (0.307, 'Malaysia'),
 (0.324, 'Iran'),
 (0.335, 'Greece'),
 (0.357, 'Colombia'),
 (0.361, 

In [95]:
def print_ranking(date):
    
    country_test_ratio = []
    for country in country_dict:
        if date in country_dict[country]:
            country_test_ratio.append((country_dict[country][date],country))
    
    return sorted(country_test_ratio, reverse= True)
    

In [97]:
print_ranking("Jul 1 2020")

[(8.032, 'Luxembourg'),
 (5.653, 'United Arab Emirates'),
 (5.055, 'Bahrain'),
 (2.61, 'Denmark'),
 (2.316, 'United States tests performed '),
 (2.038, 'Singapore samples tested'),
 (2.01, 'Russia'),
 (1.97, 'Israel'),
 (1.869, 'United States'),
 (1.824, 'Australia'),
 (1.807, 'Malta'),
 (1.754, 'Belarus'),
 (1.689, 'Maldives'),
 (1.392, 'United Kingdom'),
 (1.355, 'Qatar'),
 (1.28, 'Lithuania'),
 (1.264, 'Portugal'),
 (1.208, 'Saudi Arabia'),
 (1.135, 'Switzerland'),
 (1.093, 'New Zealand'),
 (1.087, 'Canada'),
 (1.078, 'Sweden people tested'),
 (1.066, 'Serbia'),
 (1.026, 'Kazakhstan'),
 (1.024, 'Singapore'),
 (0.949, 'Iceland'),
 (0.864, 'Kuwait'),
 (0.841, 'Chile'),
 (0.838, 'Latvia'),
 (0.823, 'Belgium'),
 (0.817, 'Germany'),
 (0.8, 'Italy'),
 (0.749, 'Ireland'),
 (0.671, 'Austria'),
 (0.657, 'Oman'),
 (0.641, 'Panama'),
 (0.602, 'South Africa'),
 (0.597, 'France'),
 (0.594, 'Turkey'),
 (0.594, 'Norway'),
 (0.554, 'Poland'),
 (0.553, 'Netherlands'),
 (0.535, 'Spain'),
 (0.529, 'Ro

### Checking the news

The Whitehouse statement cited in the introduction, above, was made on April 28th.

What do you make of President Trump's statements *from a per capita perspective*?

How does that compare with more recently?


In a May 11 [Rose Garden briefing](https://www.whitehouse.gov/briefings-statements/remarks-president-trump-press-briefing-covid-19-testing/) President Trump stated:

> We’re testing more people per capita than South Korea, the United Kingdom, France, Japan, Sweden, Finland, and many other countries — and, in some cases, combined.

The BBC's May 15th article [Coronavirus: President Trump’s testing claims fact-checked](https://www.bbc.com/news/world-us-canada-52493073) "fact-checks" this claim (Claim One).

* Modify your function to the signature `print_ranking(date, countries=[])` so that:
  * if `countries` is omitted, it still prints the table for *all* countries reporting on that day
  * if a list of countries is passed to the function, then it only prints the table for those countries

Print the table for the US and those six countries on 11th May.

Is the BBC's fact check for the 11th May borne out by these data?

Can you think of a possible reason for these discrepancies? (Hint: Should "we're testing" be interpreted as a rate or as a cumulative total?)

In [16]:
def print_ranking(date, countries=[]):
    countries_ratio = []

    

    countries_ratio.sort(reverse = True)

    return countries_ratio


In [17]:
print_ranking("Jul 1 2020", countries=['Australia', 'Argentina'])

[]

On 22nd June Newsweek, in [Why Trump Is Both Right and Wrong About U.S. Coronavirus Testing Numbers](https://www.newsweek.com/why-trump-right-wrong-about-us-coronavirus-testing-1512472), compares the US with Russia, Spain, Germany and Portugal on cumulative per capita figures. 

How does this compare with the picture you get for the daily rate at this date?

Congratulations - you can now get a job as a fact checking journalist!

&copy; Cara MacNish, Univeristy of Western Australia