Before you turn this problem in, make sure everything runs as expected. First, **restart the kernel** (in the menubar, select Kernel$\rightarrow$Restart) and then **run all cells** (in the menubar, select Cell$\rightarrow$Run All).

Make sure you fill in any place that says `YOUR CODE HERE` or "YOUR ANSWER HERE", as well as your name and collaborators below:

In [None]:
NAME = ""
COLLABORATORS = ""

---

# Final Project

This is your project. It accounts for maximum 20% of the final grade.

**Instructions**
* This project can be solved **in pairs (two students)** or **alone**. In case you work with a teammate, the evaluation will be the same for both team members. You are responsible to make sure that both team members contribute to the project.
* This project will be partly **auto-graded** and partly **manually graded**: the auto-grading will check that your answers to the question is correct (or close to be correct) and the manual grading will check your python coding and visualization style. **If your submission fails the auto-grade, you will get 0.** 

**Note**
* Write your code after you see `# YOUR CODE HERE` 
* Read the instruction of each question. You have a **limited time to submit: Month Date Time**. Only your last submission counts.
* Copying the solution of other students is forbidden.
* For each exercise example, the symbol `->` indicates the value the function should return.
* After the deadline, submission is only possible by email attachment (.ipynb file) to xxx and cc xxx. Late submission will be penalized (up to 100%, if late > 72h ).

**Project Description**

The file `"HKcovid_individual.csv"` has been downloaded (as of 18 April 2022) from the [Hong Kong government website](https://data.gov.hk/en-data/dataset/hk-dh-chpsebcddr-novel-infectious-agent/resource/dc602108-884a-4af3-bdf3-75b32fe8b5b3). It reports some details of the covid cases since its onset in the city.

To learn more about the situation in the city, your project focuses on the following two parts:
1. Process and analyze data to answer ten questions 
2. Create three descriptive plots

**Guidelines**

* The data may not be fully clean, that is, it may be inconsistent in types (e.g. Age not always available) or inconsistent in record information (e.g. Asymptomatic cases reported differently in different fields). 
* You will read instructions and hints carefully and apply your best judgement to answer the questions.
* For each question, you need to write Python code to answer, and you need to assign the value to a variable.
* For Part 1, you need to store your answers in the 'answers' dictionary with keys "1", "2", "3", ... "10" and corresponding values (the answer to each question).

In [None]:
# You will need to import these two packages.
import pandas as pd
import matplotlib.pyplot as plt

In [None]:
# This is your dataset.
data = pd.read_csv("HKcovid_individual.csv")
data.head()

## Questions (40 points)

Run preliminary data analysis on the dataset to answer the following questions:
1. How many unique reported cases in the dataset?
2. How many of those reported cases are asymptomatic? (see "Case status*" column)
3. What is the number of female to number of male ratio among all cases?
4. How many people are deceased?
5. How many asymptomatic patients (see "Case status*" column) are deceased?
6. What is the average age among all cases? (Skip all cases with age '<1' or 'Pending')
7. What is the average age of deceased people?
8. How many cases are either 'Imported cases' or 'Epidemiologically linked with imported case'?
9. What is the ratio between 'HK resident' to 'Non-HK resident'?
10. (Advanced) Which year-month have the highest number of cases? (Example: '2021-01')

Write all your answers in a dictionary named `answers`. This dictionary has keys `"1", "2", "3", ... "10"` and values (the answer to each question), respectively.

In [None]:
# 1. How many unique reported cases in the dataset?

# YOUR CODE HERE


In [None]:
# 2. How many of those reported cases are asymptomatic? (see "Case status*" column)

# YOUR CODE HERE


In [None]:
# 3. What is the number of female to number of male ratio among all cases?

# YOUR CODE HERE


In [None]:
# 4. How many people are deceased?

# YOUR CODE HERE


In [None]:
# 5. How many asymptomatic patients (see "Case status*" column) are deceased?

# YOUR CODE HERE


In [None]:
# 6. What is the average age among all cases? (Skip all cases with age '<1' or 'Pending')

# YOUR CODE HERE


In [None]:
# 7. What is the average age of deceased people?

# YOUR CODE HERE


In [None]:
# 8 How many cases are either 'Imported cases' or 'Epidemiologically linked with imported case'?

# YOUR CODE HERE


In [None]:
# 9 What is the ratio between 'HK resident' to 'Non-HK resident'?

# YOUR CODE HERE


In [None]:
# 10. (Advanced) Which year-month have the highest number of cases? (Example: '2021-01')
# Save the answer in a string. For example year_month = '2021-01' for January 2021

# YOUR CODE HERE


In [None]:
# Write all your answer in a dictionary named answers. 
# Important: Your dictionary MUST be named 'answers'.
# This dictionary has keys "1", "2", "3", ... "10" and values (the answer to each question), respectively.

# YOUR CODE HERE


In [None]:
# This code should run without errors
try:
    answers["1"]
except:
    raise NotImplementedError()

## Visualization (60 points)

Below are four plots that have been produced using the dataset. You need to replicate **any three out of the four plots** in order to get full grades. Your python coding, visualization style, and the appearance of the plots will be manually graded.

Note that the plots below are ordered by the level to replicate from easy to hard. Each plot is assigned to 20 points. The total points for this part is 60, which means even if you have four plots correct you will not get extra points.  

Read the **hints** carefully to help you get started.

![plot1](plot1.png)

![plot2](plot2.png)

![plot3](plot3.png)

![plot4](plot4.png)

In [None]:
# Hint for plot 1:
# You need to transform the date (in string) into datetime format
data["date_reported"] =  pd.to_datetime(data["Report date"], format="%d/%m/%Y")

# You should use the groubby method on the new date
data.groupby("date_reported").size()

# YOUR CODE HERE


In [None]:
# Hint for plot 2:
# a pandas series has a method .cumsum()

# YOUR CODE HERE


In [None]:
# Hint for plot 3:
# a second y-axis can be added with ax2 = ax1.twinx() 

# YOUR CODE HERE


In [None]:
# Hint for plot 4:
# pandas has a crosstab function. You better use that before plotting
pd.crosstab(data["Age"], data["Gender"])

# YOUR CODE HERE
