<a href="https://colab.research.google.com/github/hellosmallkat/NSDC-Data-Science-Projects-COVID-19-Data-Dashboard-Project/blob/main/Blank_version_NSDC_Data_Science_Projects_COVID_19_Data_Dashboard_Project.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

<h1 align="center">
    NSDC Data Science Projects
</h1>
  
<h2 align="center">
    Project: DATA VISUALIZATION PROJECT - COVID-19 ANALYSIS
</h2>

<h3 align="center">
    Name: Insert Your Name Here
</h3>

### **Project Description**
The NSDC DSPs provide virtual collaboration and networking opportunities for data science learners of all ages and skill levels, aiming to develop new data pre-processing, visualization, storytelling, and programming skills. Studying COVID-19 data is crucial as it helps us understand the virus's spread, the effectiveness of interventions, and its overall impact on populations. Data visualization plays a key role in transforming raw data into meaningful insights, making complex information more accessible and understandable.

In this project, we will explore various datasets related to COVID-19, including vaccination rates, infection rates, and death counts. We will employ descriptive statistics to summarize the data, use data visualization techniques to uncover trends and patterns, and interpret these visualizations to derive actionable insights. By the end of this project, students will have a better understanding of how to handle large datasets, apply statistical methods, and create compelling visual narratives.

Our goal is to equip you with the skills needed to analyze and interpret real-world data, developing a deeper appreciation for the power of data science in addressing global challenges.


### **Goals**
- **Data Analysis**: Analyze vaccination rates, infection rates, and death counts.
- **Visualization**: Create visualizations to uncover trends and patterns.
- **Interpretation**: Derive actionable insights from the data.

### **Tools and Libraries**
- **Python**: Programming language
- **Pandas**: Data manipulation
- **Matplotlib/Seaborn**: Data visualization
- **Plotly**: Interactive plots
- **IPython Widgets**: Interactive elements

**Note:** You are free to use other libraries apart from the ones we have recommended

### **Please read before you begin your project**

**Instructions: Google Colab Notebooks:**

Google Colab is a free cloud service. It is a hosted Jupyter notebook service that requires no setup to use, while providing free access to computing resources. We will be using Google Colab for this project.

In order to work within the Google Colab Notebook, **please start by clicking on "File" and then "Save a copy in Drive."** This will save a copy of the notebook in your personal Google Drive.

Please rename the file to "Capsule - COVID Data Visualization - Your Full Name." Once this project is completed, you will be prompted to share your file with the National Student Data Corps (NSDC) Project Leaders.

You can now start working on the project. :)

**Additional resources**

We would utilise the seaborn library to create visualizations in this notebook. To learn more about the Seaborn library visit: https://seaborn.pydata.org/

Please feel free to refer to the Northeast Big Data Innovation Hub [Flashcard series](https://www.youtube.com/playlist?list=PLNs9ZO9jGtUDxKBBZa5ImsV9h9hlLwJWH) to learn about various visualization techniques and understand the ways to analyze the data effectively.


# Milestone #1: Data Preparation & Exploration


**Dataset Details:**
The dataset used in this project includes various aspects of COVID-19 data such as vaccination rates, infection rates, and death counts. The data is sourced from reliable institutions such as the World Health Organization (WHO), Johns Hopkins University, and other governmental health agencies. These datasets provide a comprehensive view of the pandemic's impact across different regions and time periods.



**Step 1:**

Setting up libraries and installing packages

To install a library:
```python
 import <library> as <shortname>
```
We use a *short name* since it is easier to refer to the package to access functions and also to refer to subpackages within the library.


In [None]:
import pandas as pd
import numpy as np
import plotly.express as px
import matplotlib.pyplot as plt
import seaborn as sns

These are the libraries that will help us throughout this project. We encourage you to read more about the important and most commonly used packages like Pandas, Numpy, MatplotLib and Seaborn, and write a few lines in your own words about what they do. [You may use the Data Science Resource Repository (DSRR) to find resources to get started!](https://nebigdatahub.org/nsdc/data-science-resource-repository/)



**TO DO:** Write 3-5 sentences about commonly used packages and what they're used for.

>*  

**Step 2:**

Let’s access our data. We will be using the COVID dataset. The dataset contains COVID data from around 187 countries.
Link to the dataset: [Covid-19 dataset](https://www.kaggle.com/datasets/imdevskp/corona-virus-report)


We will use Pandas to read the data from the csv file using the `read_csv` function. This function returns a Pandas dataframe. We will store this dataframe in a variable called `data`.

**Note:** Fill in the blanks to complete the code below.

In [None]:
# TODO: Read the data using pandas read_csv function
data = pd.read_csv(________)

To look at some data values, we can use the `head` function to look at the first five values

In [None]:
# TODO: Print the first 5 rows of the data using the head function of Pandas
data._____()

Now, let's take an initial look at the data including some descriptive statistics. Learn more about descriptive statistics with [this NSDC Flashcard video series.](https://www.youtube.com/playlist?list=PLNs9ZO9jGtUDxKBBZa5ImsV9h9hlLwJWH)

We will use the `describe` function to review the data.

In [None]:
# TODO: Describe the data using describe function of pandas
data._____()

**Step 3:** </br>
Let's take a look at how different variables of our data are correlated with each other.
We can use the `corr` function for this.
To plot these relationships, we can use a `pairplot`. You can read more about this method [here](https://seaborn.pydata.org/generated/seaborn.pairplot.html).


In [None]:
# TODO: Plot a correlation/pairplot using Seaborn library in Python
print(data._____)
sns._____(_____)
plt.show

# Milestone #2: Geographic Analysis


**Step 1:** Let us do some analysis on the [WHO Regions of the Americas](https://www.paho.org/en/countries-and-centers). To do so, we will need to create a subset of our master dataset.
To create a subset of the main dataset, we will be using the slicing method.

**Why focus on the Americas?**

The Americas were chosen for this analysis due to their diverse responses to the COVID-19 pandemic, varying healthcare infrastructure, and different socio-economic impacts. This regional focus allows us to compare and contrast the effectiveness of different strategies and the overall impact of the pandemic in countries with different characteristics.

Let's understand why the Americas are important by visualizing various features from the dataset.

In [None]:
# TODO: Aggregate data by WHO Region
who_region_data = data.groupby(_____).sum().reset_index()

# TODO: Plot the trends for confirmed cases, deaths, and recoveries across WHO regions
fig, axes = plt.subplots(3, 1, figsize=(12, 18))

sns.barplot(ax=axes[0], x=_____, y='Confirmed', data=who_region_data)
axes[0].set_title('Confirmed Cases by WHO Region')
axes[0].set_ylabel('_____')

sns.barplot(ax=axes[1], x=_____, y='Deaths', data=who_region_data)
axes[1].set_title('Deaths by WHO Region')
axes[1].set_ylabel('_____')

sns.barplot(ax=axes[2], x=_____, y='Recovered', data=who_region_data)
axes[2].set_title('Recoveries by WHO Region')
axes[2].set_ylabel('_____')

plt.tight_layout()
plt.show()

In [None]:
#TODO: Filter the data for the Americas region
d_america = data[data['WHO Region'] == '______'].sort_values(by='Confirmed', ascending = False)
new_data = d_america.head(__) #top 10 countries
new_data

**Step 2**: Calculate descriptive statistics of our new data set (for only Americas i.e. North and South America)

In [None]:
#TODO: Calculate the descriptive statistics for new_data. Review how we did this above if you need a refresher!


# Milestone #3: Data Visualizations

**Step 1:** Let us create some plots and graphs to further visualize our data. We will start with a barplot for which we will use the Seaborn library.

Need a refresher on bar charts? [Review this video for a quick reminder](https://www.youtube.com/watch?v=Y_HCxHOy4Sw&list=PLNs9ZO9jGtUDxKBBZa5ImsV9h9hlLwJWH&index=28&pp=iAQB)!

In [None]:
#TODO: Plot a bar chart for Active cases for each Country/Region for the top 10 countries
active = new_data[['________','________']].sort_values(by='Active',ascending=False)
#creating a dataframe with countries and active cases in decreasing order
sns.barplot(y='Country/Region',x='Active',______)

**TO DO:** Review the above chart and write 2-5 sentences about your interpretation.
 </br>Note: When interpreting graphs, look for key trends and patterns. Use specific numbers to back up your observations. For example, if you notice a peak in infection rates, mention the exact date and value. Compare different regions or time periods to highlight significant differences or similarities. Your interpretation should provide a clear narrative that explains what the data shows and why it is important.

>*  


**Step 2:** Now, we will plot a double bar chart to compare Confirmed and Recovered COVID Cases in different countries. You can learn more about grouped bar charts here:
https://www.geeksforgeeks.org/plotting-multiple-bar-charts-using-matplotlib-in-python/.

In [None]:
#TODO: plot a double bar chart
plt.figure(figsize=(10, 5))
X_axis = np.arange(len(new_data['______']))
plt.bar(X_axis - 0.2,new_data['Confirmed'] , 0.4, label = '______')
plt.bar(X_axis + 0.2, new_data['Recovered'], 0.4, label = '______')

plt.xticks(X_axis,new_data['Country/Region'] )
plt.xlabel("________")
plt.ylabel("________")
plt.title("_____________")

plt.legend()
plt.show()

**TO DO:** Review the above chart and write 2-5 sentences about your interpretation.

>*  


**Step 3:** Now, let's plot a pie chart. [Click here to review when a pie chart should be used!](https://www.youtube.com/watch?v=AqqA3cP1zGo&list=PLNs9ZO9jGtUDxKBBZa5ImsV9h9hlLwJWH&index=30&pp=iAQB)
<br> For plotting a pie chart we will be using the matplotlib package. To learn more, visit: https://matplotlib.org/

In [None]:
#TODO: Plot a pie chart for the top 10 countries to show the Distributition of Recovered COVID Cases
plt.figure(figsize=(8,8))
patches, text, autotexts = plt.pie(new_data['Recovered'], labels = new_data['________'],autopct="%0.2f%%", pctdistance=0.8)
plt.title("____________") #Hint: Check the TODO statement!
plt.axis('equal')
plt.legend(patches,new_data['_________'] )
plt.show()

**TO DO:** Review the above chart and write 2-5 sentences about your interpretation.

>*  


**Step 4:** Let us compare Active cases in the Americas using a Donut Chart.
You can also learn more about donut charts in Python here: https://matplotlib.org/stable/gallery/pie_and_polar_charts/pie_and_donut_labels.html

Feel free to experiment with parameters like colors, explosions, etc. to make it more readable.

In [None]:
# TODO: Plot a donut chart for the top 10 countries
x = new_data['_____'].to_list()
labels = new_data['_____n']
# colors - feel free to experiment!
colors = ['#0000FF','#FF0000',  '#FFFF00',
          '#FFA500','#ADFF2F']
# explosion
explode = (0.05, 0.05, 0.05, 0.05, 0.05, 0.05, 0.05, 0.05, 0.05, 0.05)

# Pie Chart
plt.pie(x, colors=colors, labels=labels,
        autopct='%1.1f%%', pctdistance=0.85,
        explode=explode)

# draw circle
centre_circle = plt.Circle((0, 0), 0.70, fc='white')
fig = plt.gcf()

# Adding Circle in Pie chart
fig.gca().add_artist(centre_circle)

# Adding Title of chart
plt.title('_________')

# Displaying Chart
plt.show()

**TO DO:** Review the above chart and write 2-5 sentences about your interpretation.

>*  


**Step 5:** Our next visualization will be a Heatmap ([What is a heatmap?](https://www.youtube.com/watch?v=MuHsSH590WY&list=PLNs9ZO9jGtUDxKBBZa5ImsV9h9hlLwJWH&index=33&pp=iAQB)). One of the ways to create these are Choropleth. We will be plotting a geospatial heatmap which is used to represent the density of distribution of data points on a map. <br>
<br> Choropleth maps can be plotted using plotly package in Python. To learn more about it visit: https://plotly.com/python/choropleth-maps/

In [None]:
# TODO: Plot a Heatmap below for total confirmed cases in the Americas
fig = px.choropleth(new_data,
                    locations= '________',
                    locationmode='country names',
                    color='Confirmed',
                    hover_name='________',
                    color_continuous_scale='plasma',
                    title='________')

# fig.update_geos(projection_type="natural earth")

fig.show()


**TO DO:** Review the above chart and write 2-5 sentences about your interpretation.

>*  


**Step 6:** Calculate mortality and recovery rates, aggregate your data by WHO region, and then plot these rates using a barplot.

**Hint**: [Read this article to understand the formula we will be using](https://www.ncbi.nlm.nih.gov/pmc/articles/PMC9291733/) (we will use the one suggested by Battegay and their team).

In [None]:
# Calculate mortality and recovery rates
data['Mortality Rate'] = (data[_____] / data[_____]) * 100
data['Recovery Rate'] = (data[_____] / data[_____]) * 100

# Aggregate data by WHO Region
# Ensure that non-numeric columns are not included in the mean calculation
numeric_columns = ['Confirmed', 'Deaths', 'Recovered', 'Active', 'Mortality Rate', 'Recovery Rate']

# Aggregate data by WHO Region
who_region_rates = data.groupby('________')[numeric_columns].mean().reset_index()


# Plot the mortality and recovery rates by WHO region
fig, axes = plt.subplots(2, 1, figsize=(12, 12))

sns.barplot(ax=axes[0], x='WHO Region', y='Mortality Rate', data=who_region_rates)
axes[0].set_title('Mortality Rate by WHO Region')
axes[0].set_ylabel('Mortality Rate (%)')

#Review the code directly above and use it to help you fill in the following code for "Recovery Rate"
sns.barplot(ax=axes[1], x='_______', y='_______', data=who_region_rates)
axes[1].set_title('____________________')
axes[1].set_ylabel('___________)')

plt.tight_layout()
plt.show()

**TO DO:** Review the above chart and write 2-5 sentences about your interpretation.

>*  


# Milestone #4: Self-Guided Analysis of a Different WHO Region (other than America)

Choose any region within the dataset and create 3 different visualizations below. You may choose to use the same visualizations as above and compare your findings, or you may want to experiment with new visualizations.


**Be sure to give your interpratations for each visualization you create.**

Not sure which region to select? You may want to consider choosing a region that has a high mortality or recovery rate.

**TO DO:** Which region are you choosing and why? Write 2-3 sentences explaining your reasoning below.
>*  


**TO DO:** Begin creating your visualizations below.

In [None]:
#Create Visualization #1 Here

**TO DO:** Review the above visualization and write 2-5 sentences about your interpretation.

>*  


In [None]:
#Create Visualization #2 Here

**TO DO:** Review the above visualization and write 2-5 sentences about your interpretation.

>*  


In [None]:
#Create Visualization #3 Here

**TO DO:** Review the above visualization and write 2-5 sentences about your interpretation.

>*  

**TO DO:** Did the different visualizations helped you to identify different insights? Share your thoughts below.
>*  

**TO DO:** Did you notice any significant similarities or differences between the regions you analyzed throughout this project? Share your thoughts below.
>*  





---


### **Conclusion**



In this project, we explored various aspects of COVID-19 data through descriptive statistics and data visualization techniques. We analyzed datasets from different regions, with a specific focus on the Americas, to understand the pandemic's impact. By interpreting these visualizations, we derived meaningful insights that highlight the importance of data-driven decision-making in addressing global challenges like COVID-19. This project not only enhances your technical skills, but also underscores the crucial role of data science in tackling real-world issues.



---


## <font color='black'> **Thank you for completing the project!**</font>

In order to receive a certificate of completion, please share this notebook with er3101@columbia.edu. Do reach out to us if you have any questions or concerns. We are here to help you learn and grow.
