# Practical 4: Data Processing with Python

By the end of this practical you will be able to

* Perform basic data manipulation using Python and generator and iterator.
* Perform basic data manipulation and visualization using pandas dataframe


## Familiarizing with Jupyter Notebook

Jupyter notebook is an interactive execution environments for codes, e.g. Python. 
A note book consists of several cells. There are at least two types of cells. 

1. Markdown cell, i.e. like this current one you are reading. It gives you information such as instructions, examples and hints. It does not contain code.
1. Python code cell, i.e. the cell below. It contains the codes that we would like to edit and run.

To run a cell, press the "play icon" or press "shift-Enter"

In [None]:
print("This is a Python code cell.")

## Importing the needed library

Let's get started with the actual practical. First we would like to import the needed libraries.

Run the following Python code cell to import the libraries.

In [None]:
from math import * # a library that provides math functions
from functools import reduce
from pandas import *
import datetime

## Data collectiion

We analyze the Covid 19 cases around the world. 
The dataset is obtained from https://covid.ourworldindata.org/data/full_data.csv

For convenience we downloaded a copy, you can find it in the same folder where this note book is located.


## Exercise 1
In the following cell, use Python to open the file `full_data.csv`, store the content in a string variable called `content` and print out the `content`.

In [None]:
# TODO:


## Using Generator

Instead of reading everything from the file and store them in a string or list, it might be better to read the content out using Python Generator.

The following function `data_gen` takes a filepath as input and return a generator. Each element generated by the generator is a dictionary of shape 

`{'date': '2020-02-25', 'location': 'Afghanistan', 'new_cases': '', 'new_deaths': '', 'total_cases': '1', 'total_deaths': ''}` 

In [None]:
def listpair2dict(list_pair):
    d = {}
    for k,v in list_pair:
        d[k] = v
    return d

def data_gen(filename):
    count = 0
    headers = []
    for l in open(filename, 'r'):
        tabs = l.rstrip('\r\n').split(',')
        if count == 0:
            headers = tabs
            pass
        else:            
            yield listpair2dict(zip(headers,tabs))
        count = count + 1

To test out the generator, we defined a `data_stream_test` variable.

In [None]:

data_stream_test = data_gen("./full_data.csv")

Run the following to observe the result.

In [None]:
print(data_stream_test.__next__())
print(data_stream_test.__next__())

## Question

What would happen if we re-run the cell above over and over?


# Data cleaning


We can't use the data until it is clean.

There are a few issues here. 

1. The date field is a string, we need to convert its value into Python's `date` object. 
2. The `new_cases` field is in string, we need to convert its value to integer.
3. The `total_cases` field is in string, we need to convert its value to integer.

We want to convert the date fields into proper date format. 

For requirement 1 and 2, the code is provided for you below. 

## Exercise 2

Complete the `format_total_cases()` function to achieve requirement number 3.


In [None]:
def format_date(record):
    date_str = record["date"]
    y,m,d = date_str.split('-')
    date = datetime.date(int(y),int(m),int(d))
    record["date"]= date
    return record

def format_new_cases(record):
    new_cases_str = record["new_cases"]
    new_cases = 0 if new_cases_str == '' else int(new_cases_str)
    record["new_cases"] = new_cases
    return record

def format_total_cases(record):
    return record # TODO: fixme
    
def clean_data(records):
    for record in records:
        record = format_date(record)
        record = format_new_cases(record)
        record = format_total_cases(record)
        yield record

In [None]:
data_stream = data_gen("./full_data.csv")
cleaned_data_stream=clean_data(data_stream)

## Exercise 3

Create a data iterator which filter out only records with `location` equals to "Singapore".


In [None]:
def filter_data(records, country):
    for record in records:
        yield record # TODO: FIXME

filtered_cleaned_data_stream = filter_data(cleaned_data_stream, "Singapore")

## Putting the data into a dataframe

Pandas library provides a good API tool for processing and visualizing data. 

First we load the filtered and cleaned data into a dataframe object.

In [None]:
covid19_data = pandas.DataFrame(filtered_cleaned_data_stream)

Let's check the first few rows of the data frame. Do you find it similar to excel table?

In [None]:
covid19_data.head()

## Data Visualization

We now be able to plot a line graph of the data.

In [None]:
covid19_data_indexed = covid19_data.set_index("date")

We use Pandas DataFrame filter syntax to filter records that's after 10th March 2020.

In [None]:
covid19_data_indexed[covid19_data_indexed.index>datetime.date(2020, 3, 10)].plot.line()

## Exercise 4

Generate another graph for Malaysia. 

In [None]:
#TODO