### Introduction

In [1]:
# The CNN just released a news article stating that today, 1 in 4 people across the world are under COVID-19 restrictions.
# This wasn't surprising as it's been almost 2 months now and the news headlines are talking about the new virus and how it
# spread in a way we have never seen before!
# This sparked my curiosity to do a simple analysis on the infected cases during the past period to know where did it start 
# and the infection distribution among continents. Then create a visualization to display the distribution.

### The Dataset

In [2]:
# It wasn't hard to find a COVID-19 dataset to work on because this is a hot topic nowadays. I have obtained my dataset from 
# the data science competition website 'Kaggle' as they are currently organizing a Machine Learning competition to predict 
# COVID cases. This is the link to my dataset,

##https://www.kaggle.com/sudalairajkumar/novel-corona-virus-2019-dataset/version/36#time_series_covid_19_confirmed.csv

# It includes the number of confirmed cases by country since 12-Jan which is the date hospitals started to record cases on.

# I preferred to analyze the data by continent to have a higher view of the distribution of confirmed cases around the
# world, so i obtained another dataset from the internet that includes the world countries and their continents. 

# Continent dataset:
## https://datahub.io/JohnSnowLabs/country-and-continent-codes-list#data

# I used both Numpy and Pandas libraries to prepare my data. I merged both datasets to add the continent data to the
# COVID dataset. Some countries were included in the continent sheet belonging to 2 continents like Russia and Turkey, so
# i assumed it belongs to the first continent for simplicity.
# As expected, the number of cases in Asia were very high, so i made log transformation to be able to visualize the data.

In [3]:
# Now let's start to prepare our data

import numpy as np
import pandas as pd

In [4]:
# Importing the dataset that contains the country name and the continent name.
continent=pd.read_csv('country-and-continent-codes-list-csv_csv.txt')
continent.head(3)

Unnamed: 0,Continent_Name,Continent_Code,Country_Name,Two_Letter_Country_Code,Three_Letter_Country_Code,Country_Number
0,Asia,AS,"Afghanistan, Islamic Republic of",AF,AFG,4.0
1,Europe,EU,"Albania, Republic of",AL,ALB,8.0
2,Antarctica,AN,Antarctica (the territory South of 60 deg S),AQ,ATA,10.0


In [5]:
#Importing the COVID dataset,
covid=pd.read_csv('COVID19_open_line_list.csv')
covid.head(3)

Unnamed: 0,ID,age,sex,city,province,country,wuhan(0)_not_wuhan(1),latitude,longitude,geo_resolution,...,Unnamed: 35,Unnamed: 36,Unnamed: 37,Unnamed: 38,Unnamed: 39,Unnamed: 40,Unnamed: 41,Unnamed: 42,Unnamed: 43,Unnamed: 44
0,1.0,30,male,"Chaohu City, Hefei City",Anhui,China,1.0,31.64696,117.7166,admin3,...,,,,,,,,,,
1,2.0,47,male,"Baohe District, Hefei City",Anhui,China,1.0,31.77863,117.3319,admin3,...,,,,,,,,,,
2,3.0,49,male,"High-Tech Zone, Hefei City",Anhui,China,1.0,31.828313,117.224844,point,...,,,,,,,,,,


In [6]:
# Now, I need to prepare the country name in continent dataset to be the first name before the ',' to match the covid dataset,

continent['Country_Name']=continent['Country_Name'].apply(lambda x: x.split(',')[0])

# By checking the data of continent name, i found that the names of some countries is written in a differnt way other than in
# country sheet, now i will modify it. 

continent['Country_Name'].replace({'United States of America':'United States', 'Korea':'South Korea',
'Russian Federation':'Russia','United Kingdom of Great Britain & Northern Ireland': 'United Kingdom'},inplace=True)

# Now i will drop the repeated countries (High level assumption they belong to the first continent mentioned):
continent=continent.drop_duplicates(subset='Country_Name',keep='first')
continent.head(3)


Unnamed: 0,Continent_Name,Continent_Code,Country_Name,Two_Letter_Country_Code,Three_Letter_Country_Code,Country_Number
0,Asia,AS,Afghanistan,AF,AFG,4.0
1,Europe,EU,Albania,AL,ALB,8.0
2,Antarctica,AN,Antarctica (the territory South of 60 deg S),AQ,ATA,10.0


In [7]:
# then i will merge both datasets to have the covid dataset including the continent name column,
merged=covid.merge(continent, how='left', left_on='country',right_on='Country_Name')

# and then include only the columns of interest,
both=merged[['Country_Name','Continent_Name','date_confirmation']]
both.head(3)

# some countries don't have a date for the cases, so i will exclude them by dropping the blank rows,

both1=both.dropna(axis=0)

both1.head(3)

Unnamed: 0,Country_Name,Continent_Name,date_confirmation
0,China,Asia,22.01.2020
1,China,Asia,23.01.2020
2,China,Asia,23.01.2020


In [8]:
# The date column has some values which is in a format of date range (from dd/mm/yy - dd/mm/yy), i'm only interested in the
# start date so i will include it in the column and change the data type to date format.

both1['Date']=both1['date_confirmation'].apply(lambda x: x.split('-')[0])
both1['Date']=both1['Date'].apply(lambda x: pd.to_datetime(x,dayfirst=True))

# Now i will group the data per continent,

grouped=both1.groupby(['Continent_Name',both1.Date]).agg(len)

grouped.head(3)


Unnamed: 0_level_0,Unnamed: 1_level_0,Country_Name,date_confirmation
Continent_Name,Date,Unnamed: 2_level_1,Unnamed: 3_level_1
Africa,2020-02-14,1,1
Africa,2020-02-25,1,1
Africa,2020-02-27,1,1


In [9]:
# Now i will pivot the data so that each continent would be a separate column to facilitate representing it:

grouped.reset_index(inplace=True)
grouped.set_index( 'Date',inplace=True)

grouped_pivot=grouped.pivot(columns='Continent_Name')
grouped_pivot.sum()

# Finally, i will resample the date to be per week:
final=grouped_pivot.resample('W', closed='right').agg(np.sum).stack()
final.reset_index(inplace=True)
final.head(3)

Unnamed: 0,Date,Continent_Name,Country_Name,date_confirmation
0,2020-01-12,Africa,0.0,0.0
1,2020-01-12,Asia,1.0,1.0
2,2020-01-12,Europe,0.0,0.0


In [10]:
# The data values have outliers (Asia), so i will transform the data to log scale to improve the visualization. some countires 
# has 0 cases which will result in error in calculating log, so i will add 1 to all values:

final['Country_log']=final['Country_Name']+1

# Renaming the column of Country_Name to be used in the chart:

final.rename(columns={'Country_Name': 'Continent_Cases'}, inplace=True)
final['# Cases']=final['Country_log'].apply(np.log)
final.reset_index(inplace=True)
final.head(3)

Unnamed: 0,index,Date,Continent_Name,Continent_Cases,date_confirmation,Country_log,# Cases
0,0,2020-01-12,Africa,0.0,0.0,1.0,0.0
1,1,2020-01-12,Asia,1.0,1.0,2.0,0.693147
2,2,2020-01-12,Europe,0.0,0.0,1.0,0.0


### The visualization

In [11]:
# In this demonstration, I will create an interactive scatter plot using the Altair Library.

# Scatter plot is a simple way for visualizing multiple discrete data series at the same time, each series has a unique feature
# (i.e. color, marker shape,...etc) to be able to distinguish different data series by a simple look on the chart.
# The interactive option i will use will make it easy to select each data series separately and view it away from other data
# for deeper analysis. I will also use a variety of techniques to make the chart more readable like displaying the data for  
# each point in the chart when you hover the mouse over it.

### The circumstances this visualization should be used in

In [12]:
# This type of charts can be used in case of having complicated multivariate data, because it has multiple ways in
# which it can relate to the data like using colors, different marker shapes (including customized uploaded photos), size
# of the series...etc, and it relates them to the data values including different types of data (nominal, continuous...etc).

# It is also useful in this case because it has filtration and selection techniques that facilitates displaying each aspect
# of the data separately which will help in forming an informative view of data from all sides.

### The Visualization Library

In [13]:
# For this chart, I'm going to use Altair library.

# Altair is an open-source declarative library used in statistic visualization for python. It provides an API oriented towards 
# data scientists doing exploratory data analysis based on Vega and Vega-Lite which are languages used for creating, saving,
# and sharing interactive visualization designs using JSON specification.
# It was developed by Jake Vanderplas and Brian Granger in close collaboration with the UW Interactive Data Lab. 

# What makes it special is the wide variety of interactive tools and techniques that can be used in creating visualizations
# and displaying, filtering and changing the way data is represented in a way that makes it very efficient in presenting
# complicated data and making it easy to understand and interpret it. The coding steps for this library are also simple and 
# easy to follow.

# Altair can be installed using the below command:
# $ pip install altair vega_datasets
# and if you are using conda package manager, you can install it by the below command:
# $ conda install -c conda-forge altair vega_datasets

# The documentation is providing full information about the codes used in creating different visualizations, 
# https://altair-viz.github.io/

# Altair also has a tutorial in github that is rich of examples, and it avails a repository for adding users' contributions
# of additional Jupyter notebook-based examples,
# Tutorial: https://github.com/altair-viz/altair_notebooks

### The general approach and limitations of the library

In [14]:
# Altair is considered a declarative language, which means that while plotting any chart, you only need to declare links 
# between data columns to the encoding channels, such as x-axis, y-axis, color, etc. and rest all of the plot details are
# handled automatically.

# It integrates with Jupyter and other notebook environments, so long as there is a web connection to load the required
# javascript libraries. It can also be used with various IDEs that are enabled to display Altair charts, and can be used
# offline in most platforms with an appropriate frontend extension enabled.

# Its API does not do any visualization rendering per say. Altair API creates JSON code/data structure using vega-lite specs
# such that the JSON data can be visualized by using user-interfaces like Jupyter Lab.

# The reason i decided to use this library was because it is brand new to me and i was eager to learn a new library from 
# scratch. I was also attracted to the very nice variety of visualizations displayed in their documentation with brand new 
# options and tools.

# I have to say it wasn't easy to find answers for my questions on the internet like the case with matplotlib, i believe this is
# the case with any new library.

# I also noticed slight differences in the definition of features compared to matplotlib which can cause some confusion
# in finding answers, for example i was searching for marker shape but found out it is named 'mark' in Altair, 'alpha' in 
# matplotlib is 'Opacity' here and so on. The good news is that there is ongoing enhancements made to this library as shown
# in the github repo.

### The demonstration

In [15]:
# The approach of using interactive scatter plot is providing tools for the users to select and view certain parts of the data,
# in addition to the normal x & y values. It uses the below:

# 1- marker shapes and colors for different values. In my chart, i used 2 factors (marker color & shape) to display 1 value
#(continent name) which is known as redundant encoding. However, i used it to explore the library features.

# 2- marker size for representing the numerical values (in my chart the number of cases).

# 3- A selector (can be a drop-down menu or a sliding button) to select and view each data series separately which ia a great
# option for multivariate or crowded data because the overlap between points might make it hard to have a clear 
# understanding of each category. In my plot, i used categorical variables in the dropdown menu. The library also
# provides solutions for continous values selection in addition to having the option of multiple selections at the same time.
# Additionally, there is interval selection feature that allows users to select a point and drag it to create a moving box 
# over the chart. The selector i used has options like specifying the first selection point (Africa in my case) and choosing
#a selector name.

# The library also facilitates adjusting other components like the legend (for example i had 3 chart legends representing
# the shape, color and size. i had the option to change all the features of each one separately).

# Some notes to mention while working on the chart:

# Continent marker color: i chose the colors of the olympic flag to represent each continent.
# When i first plotted the chart, the default marker shapes used by the library included a diamond, a cross and a 
# triange (the chart included 6 shapes). however, it was hard to view the overlapping data points because the shapes
# were small due to its values. I tried different color schemes and different opacity values but this didn't help in showing
# the overlapping points, i solved this by 3 ways:

# 1- The tooltip option displays the name of data points and the # of cases when you hover the mouse over it.
# 2- I changed the marker shapes to include 4 triangles pointing to 4 different directions to avoid the overlap.
# 3- Creating 2 plots, one with all the data overlapping and the other one with the drop down menu to be able to 
# select the data of each continent separately and view it clearly using the selector option.

# The alt.Order option was used to decide which data series to plot first (equavalent to z-order in matplotlib).
# The 'Padding' option was used to locate the chart in the notebook.

# There is an additional feature i didn't use because it requires live interaction during the display, it is
# the '.interactive()' tool that can minimize/maximize the chart and scroll it.

In [16]:
# Before starting the visualization, there are some steps needed to facilitate the job like extracting the labels names,

# Selecting unique dates to visualize, and changing the format to string to be used in axis labels:
dates=final.drop_duplicates(subset='Date',keep='first')['Date']
dates
dt=[]
for val in dates:
    dt.append(str(val))

# Modifying the Continent_Cases values to string to be used in tooltip:

final['Continent_Cases']=final['Continent_Cases'].apply(lambda x: str(x))

# Continent unique values:

domain = list(final['Continent_Name'].unique())

In [17]:
# Now we are ready with the data. So let's start!

import altair as alt

# Below is the values we will need in the chart:

# Continent color:
range_ = ['black','yellow','navy','red','green', 'magenta']

# Mark shape:
mark_shape=["circle","square",  "triangle-up", "triangle-down", "triangle-right", "triangle-left"]

In [18]:
# Defining the selection variables:

select_continent = alt.selection_single(
    name='Select',
    fields=['Continent_Name'],
    init={'Continent_Name': 'Africa'},
    bind=alt.binding_select(options=domain))
brush = alt.selection(type='interval', encodings=['x'])

# Creating the first chart that includes all data:

chart1=alt.Chart(final, width=700, height=200, title='COVID-19 Weekly Cases by Continent',
                ).mark_point().encode(
    alt.X('Date',axis=alt.Axis(values=dt, format='%B %d', grid=True,gridOpacity=0.2,
                               titleFontSize=5, titleColor='white',labelFontSize=12, labelColor='white')),
    alt.Y('# Cases', axis=alt.Axis( title='# Cases', domainColor='grey', offset=10,titleFontSize=15,labelFontSize=9,
                                   titleColor='black', labelColor='white',tickColor='white')),
    alt.OpacityValue(0.7),
    alt.Color('Continent_Name',scale=alt.Scale(domain=domain, range=range_),legend=alt.Legend(orient='top'),
              title='Continent Colors'),
    alt.Size('# Cases'),
    alt.Shape('Continent_Name:nominal',scale=alt.Scale(range=mark_shape),title='Continent Shapes'),
    alt.Tooltip(['Continent_Name:nominal', 'Continent_Cases:quantitative']),
    alt.Order('# Cases:Q', sort='ascending')).add_selection(select_continent)

# Creating the second chart that includes each continent separately:

chart2=alt.Chart(final, width=700, height=200).mark_point(filled=True).encode(
    alt.X('Date',axis=alt.Axis(title = 'Confirmed Cases by Week Start Date',values=dt, format='%B %d', grid=True,gridOpacity=0.2,
                               titleFontSize=15, titleColor='black',labelFontSize=12,labelAngle=320, labelColor='grey')),
    alt.Y('date_confirmation', axis=alt.Axis( title='# Cases', domainColor='grey', offset=10,titleFontSize=15,labelFontSize=12,
                                   titleColor='black',labelColor='grey')),
    alt.OpacityValue(0.7),
    alt.Color('Continent_Name',scale=alt.Scale(domain=domain, range=range_),legend=alt.Legend(orient='top'),
              title='Continent Colors'),
    alt.Size('# Cases'),
    alt.Shape('Continent_Name:nominal',scale=alt.Scale(range=mark_shape),title='Continent Shapes'),
    alt.Tooltip(['Continent_Name:nominal', 'Continent_Cases:quantitative']),
    alt.Order('# Cases:Q', sort='ascending')).add_selection(select_continent).transform_filter(
    select_continent).add_selection(brush)

# concatenating both charts

alt.vconcat(chart1, chart2, padding={"left": 60, "top": 5,"right": 5,"bottom": 5}).configure_axis(grid=False)

# Notes:
# * I removed the Y-Value labels for the upper chart because they indicate the log transformed values, instead, i wrote the 
#   actual values using the tooltip option on the data points.
# * Upper chart markers are without fill to avoid the overlap confusion, however lower chart markers are with fill.
# * Note that the lower chart y-axis scale change according to the selection.
# * I shifted the y-axis from the x-axis starting point to be able to view the data points at the beginning.
# * The chart has a nice feature to 'save figure as'. it can be obtained by pressing the small circle on the right side
#   of the plot with 3 dots. this circle also has an option to open the chart code in Vega editor in addition to other options.


### Ten simple rules for writing and sharing computational analyses in Jupyter Notebooks.

In [20]:
# I will mention below 4 of the 10 rules i adhered to in this assignment:

# 1) Rule 1: Tell a story for an audience
# I tried to follow this rule in general by explaining why i chose the data, visualization library and chart type.
# also while answering question # 7 (The demonstration) i was describing the problems i faced while completing the chart and how i overcame it.

# 2) Rule 2: Document the process, not just the results.
# I tried to follow this rule by writing down any note or thought that came to my mind during working on the assignment
# in order not to forget it, and after i finished i gathered all notes and organized them and used them to answer the above
#questions.

# 3) Rule 3: Use cell divisions to make steps clear.
# I followed this rule in preparing the data i will use for visualization, as i separated each group of similar steps
# and wrote a small brief of what i was planning to do.

# 4) Rule 9: Design your notebooks to be read, run, and explored
# I made sure my notebook is organized enough for others to be able to read, run and explored easily by following the above
# steps, however i still didn't share it in github.