# CATCH UP - Week 5 Assignment
######  Paige DeFiori
This notebook is an application of what I have learned thus far in the quarter, along with my own personal research in certain coding aspects, like Plotly. A lot of what is implemented is a redoing past weeks assignments but with datasets that are on COVID and more directed at my partner and I's research questions pertaining to it. This notebook explores COVID datasets and visualizes them in different forms, utlizing mainly .csv files. This is because all meta data I find on COVID is reported and updated by .csv. I did find a geoJSON file, however it is just point averages of the countries, so the use of this file is minimal. 
The goal of this notebook is to be more comfortable, not only with coding with Python, but more so with data exploration and what files are capable of transforming into. This should be a base line for my partner and I's midterm visualization to be built off of for our final project.

## Clean the Data

First and for most, I need to import all the libaries needed to read datasets and turn them into visualizations:

In [None]:
# for general data wrangling tasks
import pandas as pd

# to read and visualize spatial data
import geopandas as gpd
from geopandas import GeoDataFrame

from shapely.geometry import Point

# for plotting / figures
import matplotlib.pyplot as plt

# this allows .csv files to be turned into visualizations (mainly for choropleth maps)
import plotly.express as px

Since I know the dataset I am importing is *huge* and full of a multitude of variables, I want to see them all so I can decide which are related to the purpose of my project. This is done with `display.max_columns` :

In [None]:
covidData = pd.read_csv("data/coronavirus-data-explorer-2.csv")
# Shows max columns and rows
pd.set_option('display.max_columns', None)
pd.set_option('display.max_rows', None)
covidData.head(10)

This dataset was downloaded from [Our World in Data](https://ourworldindata.org/covid-cases), and as you can see,  is updated on a day to day basis. For the purpose of the project for now, I will be focusing on January 2020 to January 2021. Possible adaptations of this time frame are expected.

Similar to the function above, `.info()` will allow me to see the columns and make `list(covidData)` will place them in a list so it is easier for me to copy and paste what I want to visualize by shrinking the dataset:

In [None]:
covidData.info()

In [None]:
list(covidData)

Now that all of the COVID variables in this data are clear, I can cut the dataset by isolating the columns I want into their own list, and reeassign the dataframe to be equal to this abbreviated list of statistical varaibles:

In [None]:
columns_to_keep = [
    'iso_code',
    'continent', 
    'location',
    'date',
    'total_cases',
    'total_cases_per_million',
    'total_deaths',
    'total_deaths_per_million',
    'total_vaccinations',
    'total_vaccinations_per_hundred',
    'stringency_index',
    'population',
    'gdp_per_capita'
]

In [None]:
covidData = covidData[columns_to_keep]

Using `.tail()` as a means of confirming that my dataframe is now reassigned to the desires columns.

In [None]:
covidData.tail(10)

Now that the set is cut down, I will rename the columns for aesthetics and reability when the data is transformed into maps and plots. Doing so while simultaneously printing a random sample of 5 to confirm it was completed correctly.

In [None]:
# renaming the columns
covidData.columns = [
    'ISO',
    'Continent', 
    'Country',
    'Date',
    'Total Cases',
    'Total Cases per Million',
    'Total Deaths',
    'Total Deaths per Million',
    'Total Vaccines',
    'Total Vaccines per Hundred',
    'Stringency Index',
    'Population',
    'GDP per Capita']

# printing sample
covidData.sample(5)

I chose to keep certain columns that I have no immediate plans to use, as I may want to develop of these statistical facts later: i.e. the `stringency-index` is a 1-100 scale shows the strictness of countries shudowns on the pandemic, 100 being te strictest. This index could be interesing to work with later on once the basics are covered. 

I want to see where the most total cases of COVID is occuring, doing to by `.sorting` the data with the ascending command as false so the largest number is first. I do this with a new variable name, that I will only use in this cell:

In [None]:
covid_sorted = covidData.sort_values(by='Total Cases',ascending = False)
covid_sorted.head()

By viewing this, I notice a problem that will occur in my visualizations: the data includes the World and Interantional listed as a location where I only want to view the countries' data for mapping. To combat this, I will make a subset of the dataframe that cuts the World location out of it using: 
<br/> `df[df.COLUMN != 'variable'`

In [None]:
covidData = covidData[covidData.Country != 'World']
covidData = covidData[covidData.Country != 'International']
covidData = covidData.sort_values(by='Total Cases',ascending = False)
covidData.head()

Now we can see that the World/international is no longer a variable of country in the dataframe, making my visualizations on COVID cases just related to countries indiviudally. I am hesistant to sort the data in a specific order, as I dont know how it will affect my mapping just yet, so I am not assigning the sorted dataframes to the original name.

Now I will make a copy of the preexisting data set, so I can go back and use all the variables provided. For now, `new_covid` is an even more trimmed version of the data so I can isolate deaths, cases and vaccines ONLY.

In [None]:
new_covid = covidData.copy()
new_covid.head()

Using `.sort_values()` function, will alphabetize the data to look for possible duplicate countries. I do it this way, as deleting rows with Nan values will inaccurately report the variables totals. so I can map totals without adding too many values together:
<br/> I used 200 to see the most recent date that all countrries have valid data, as well as resetting the `reset_index()` so it is easier for me to see. 

In [None]:
new_covid = new_covid.sort_values(by='Country', ascending= True)
new_covid = new_covid.reset_index()

In [None]:
new_covid.head(200)

There are duplicates, I will have to aggregate these rows, doing so with the `.agg` expression based on the Country column. This will add 0 for Nan values and shouldn't affect totals, as I cut the data off by the head previously to 1-31-21. Doing so to account ofr the most recent covid reported data, as some countries lag behind and up to date data isnt truly up to date.

In [None]:
# this indexes through the country column, adding the values in the desired columns with .agg:
new_covid = new_covid.groupby(new_covid['Country']).agg({'Total Cases': 'sum', 'Total Vaccines': 'sum',
                                                               'Total Deaths': 'sum','Country': 'max', 
                                                               'ISO': 'max'})
new_covid.head()

Now `new_covid` dataframe is 3 variables I want on COVID, most up to date but not BY DAY. Which helps with plotting total values, rather than interactive ones to change over time. Notably, I had to, what appears to, duplicate the country name. However, this is needed as the **bolded** country is an unplottable column. Firther, I need the ISO to properly plot on world maps!
<br/> `covidData`, however, is still in place and the same:

In [None]:
covidData.head(2)

Before visualizing, I want to re sort the data to be by case total, rather than alphabetical so plotting is by highest case total. Redundant, I know, but I want to cover all the bases for accurate COVID data:

In [None]:
new_covid = new_covid.sort_values(by='Total Cases', ascending= False)

new_covid.head(195)

I used 195 in the head, because that is how many countries there are.

## Visualize

Now I have 2 dataframes to work with:
<br/> `covidData` is the full dataframe with all varaiables
<br/> `new_covid` is a trimmed version with JUST cases, deaths and vaccines totals and is sorted by MOST total cases.

Now I want to make my desired bar graph of the trimmed data set of the top 5 countries with the most cases:

In [None]:
new_covid.head(5).plot.barh( 
    figsize = (10,7),
    colormap = 'tab20c',
    width= 1,
    title = '5 Countries with Most COVID Cases Compared to Vaccines & Deaths')

In a similar comparison, I want to visualize the top 50 countries with the most cases, compared to vaccines and deaths, with a line graph:

In [None]:
ax = new_covid.head(50).plot.line(
    figsize = (8, 8),
    legend= True,
    title = 'Total Cases Compared to Total Vaccines & Total Deaths')

Both of these visuals reveal a GREAT deal to me. The US is no doubt most the most affected nation, in all aspects. However, the interesting part is the distribution of vaccines in nations that appear to not be _as_ affected compared to others (proportionately).

Below I will use the `new_covid` dataframe, which again is the totals, to plot the world based on Total cases:

In [None]:
# making an animated map via plotly express with a .csv file

list_countries = new_covid['Country'].unique().tolist()

fig = px.choropleth(data_frame = new_covid, 
                    # ISO is necessary in plotly as it has abuilt in world map based off of ISO codes
                    locations = "ISO",       
                    
                    # the column I want to depict the color
                    color = "Total Cases",
                    
                    # capped the cases at 26 million, as thats the most by a given country
                    range_color=[1,26000000], 
                    
                    # what is shown when hoevered over
                    hover_name = "Country",
                    
                    # selects the color for the scale
                    color_continuous_scale = 'sunset',
                    
                    # sets a different view of the world
                    projection = "natural earth")
                   
#creating a title
fig.update_layout(
    title_text='Daily COVID-19 Cases January 2020 - January 2021')
fig.show()

Similarly, I will creat a map but based on total vaccines rather than cases:

In [None]:
# making an animated map via plotly express with a .csv file

list_countries = new_covid['Country'].unique().tolist()

fig = px.choropleth(data_frame = new_covid, 
                    # ISO is necessary in plotly as it has abuilt in world map based off of ISO codes
                    locations = "ISO",       
                    
                    # the column I want to depict the color
                    color = "Total Vaccines",
                    
                    # capped the vaccine total at 33 million, as thats the most by a given country
                    range_color=[1,33000000], 
                    
                    # what is shown when hoevered over
                    hover_name = "Country",
                    
                    # selects the color for the scale
                    color_continuous_scale = 'sunset', 
                    
                    # sets a different view of the world
                    projection = "natural earth")
                   
#creating a title
fig.update_layout(
    title_text='Daily COVID-19 Cases January 2020 - January 2021')
fig.show()

Ideally, I would be able to put these two side by side and compare. Work in progress.

In [None]:
fig = px.scatter_geo(new_covid, locations='ISO', color='Total Cases',
                     hover_name= 'Country', size="Total Cases",
                     projection="natural earth")
#creating a title
fig.update_layout(
    title_text='Total COVID Cases (January 2021) By Country')
fig.show()


Ideally, I would need to work with the scale of the legend and scaling issues in general.

Reverting back to the original dataframe, `covidData`, I want to make an animated map with Plotly express that changes by day. 
<br/>To do so, I will use the `.unique()` command that only keeps 1 of the variables with the name and creates a list of the countries, without repeating one. This is important, as I made this similar note when making the `new_covid` dataframe. However, we can not use that trimmed dataframe because, to make an animated map, I need the repeated countries to show a progression over time.

In [None]:
covidData = covidData.sort_values(by='Date', ascending= True)
covidData.head(10)

This sorting put the dataframe in order, from oldest to newest. I did this so the animation wouldn't have any possible hiccups in hunting for the order of dates.

Now is the fun part: making the map and making it animated!
[This](https://plotly.github.io/plotly.py-docs/generated/plotly.express.choropleth.html) link is where I got the information on plotly express and the functions that work in the `px.choropleth()` figure:

In [None]:
# making an animated map via plotly express with a .csv file

list_countries = covidData['Country'].unique().tolist()

fig = px.choropleth(data_frame = covidData, 
                    # ISO is necessary in plotly in order to depict the map properly
                    locations = "ISO",       
                    color = "Total Cases",
                    # figure out how this works so the scale is consistent?
                    range_color=[1,26000000], 
                    hover_name = "Country",
                    # continuous scale as data is changing
                    color_continuous_scale = 'sunset',
                    animation_frame = "Date")
fig.update_layout(
    title_text='Daily COVID-19 Cases January 2020 - January 2021')
fig.show()

I love it, but I need to figure out the `range_color` ticks in order to keep the scale not so sliding so there is a more drastic change in color overtime.

Again, using the `.agg` expression based on the country, I will create a new `test` (in case it doesn't work) and aggregate the values in the desireed columns based on `max` value, so I know I am getting the most recent information (as for now COVID statistics continue to go up). To make it be more clear for cetain visualizations, I will fill Nan values with 0 to not affect future aggregations of the data with `.fillna(0)`.

In [None]:
# creating another copy of the OG dataframe. 
test = covidData.copy()

# trimming the dataframe by column name
columns_keeping = ['Continent','Country', 'Population', 'Total Cases', 'GDP per Capita', 'Total Deaths', 'Total Vaccines', 'Date']
test = test[columns_keeping]

# sorting the valuesin alphabetical order
test = test.sort_values(by='Country', ascending= True)

# filling Nan values with 0
test = test.fillna(0)

# this is the magic of .agg 
test = test.groupby(test['Country']).aggregate({'Continent': 'min',
                                                     'Country': 'min',
                                                     'Total Cases': 'max',
                                                     'Total Vaccines': 'max',
                                                     'Total Deaths': 'max',
                                                     'GDP per Capita': 'max'})
test.head(5)

It is now in alphabetical order and the sums of all columns were taken, so they would be most recent totals. I did this because so much sorting was already done with the `new_covid` dataframe, to undo it all would be trival. Its find with me to keep another trimmed copy of the dataframe around to work with!

Now I want to see a bar comparing the newly add GDP variable, compared to cases, vaccines, deaths and GDP:

In [None]:
test = test.sort_values(by='Total Cases', ascending= False)

ax = test.head(10).plot.bar(
    figsize = (8, 8),
    legend= True,
    title = 'Total Cases Compared to Total Vaccines & Total Deaths')

This is a basic plot, kind of bland and makes me think of elementary school. LETS GET INTO IT!
Using plotly express, I can pimp the heck out of this and create a stacked bar graph of COVIDs effect, seperated by continent, to compare their countries to one another, based on the rrespective GDP per capita:

In [None]:
fig = px.bar(test, 
             #to make it horizontal
             orientation="h", 
             
             # creating the x and y axes
             y="Continent", x="Total Cases", 
             
             # I want the color to represent the GDP per capita
             color="GDP per Capita",  
             
             # I want the color to change based on said GDP
             color_continuous_scale='Bluered_r', 
             
             # add the country's details via hover
             hover_name="Country")

# adding a title
fig.update_layout(
    title_text='Continental COVID-19 Cases compared by GDP per Capita')
fig.show()

This shows a pretty clear visual of the impact of COVID on which nations. Recall, this `test` dataframe is sorted by total cases, so the largest section of a continent's bar is the most cases based on the x axis. Using a `hover_name` allows all the statics to be seen. I decided to seperate it by continent so the graph wouldnt have to have the same amount of bars as there are countries. However, I think it still shows a clearr image of the pandemic's effects.

Now I want to create an animated scatterplot similarly comparing cases, vaccines and GDP by country:

In [None]:
fig = px.scatter( test, 
    # sets y and x axis variables
    y = "GDP per Capita",  x = "Total Cases",  
    
    #what the size of the scattered plots will represent
    size = "Total Vaccines",  
    
    #what the color will represent and the plots hover info
    color = "Continent",  
    hover_name = "Country",  
    
    # title of the variable each seperate column/chart will represent
    facet_col = "Continent", 
    
    # size of the largest point cannot go beyond this, helps with scale
    size_max = 75,
    
    # keeps the x axis in range of x varriables given
    log_x = True,
    #sets size of figure
    width= 999, height= 400,
) 
# adding a title
fig.update_layout(
    title_text='COVID-19 Vaccines compared to Total Cases & GDP per Capita')
fig.show()

In this scatterplot, the size of the marker represents the total amount of cases a country has. As you can see, or not, Oceana has mere specks as countries compared to the other continents. Moreover, the comparison is of Vaccine distribution and GDP per capita. A correlation between the two is not necessarily perfectly there, however the comparison is shocking. The countries with more vaccines do seem to have a higher GDP per capita than others. The comparison by continent is equally as interesting in this way; look at the size of the North America vs Afria. Its shocking.
<br/> Using a stacked histogram, I want to visualize just cases and vaccines:

In [None]:
fig = px.histogram(test.head(100), x= 'Total Vaccines', y= 'Total Cases', color= 'Country')

# adding a title
fig.update_layout(
    title_text='Top 100 Most Infected countries Compared to COVID Vaccines')
fig.show()

This shows A LOT about the vaccine is distributed globally. I mean, somethiing is up with China, look at their vaccines (24M) versus their total cases (100k).

In the original `covidData` dataframe, there wehre Where 0 is listed, I believe it would most of the time be a Nan value, since I had ot aggreagate the columns to get total of the rows (countries), Nan was replaced by 0.

## GeoJSON Exploration

In [None]:
# to download osm data
import osmnx as ox

# to provide basemaps
import contextily as ctx

I recently found a geoJSON file on COVID, the only one I could findin fact. So I will import the file and visualize it columns:

In [None]:
gdf = gpd.read_file('data/global-covid-cases.geojson')
gdf.head(5)

Not entirely sure what to do with this data set, as it is huge (60MB reported by day) and the only needed item is the geomerty column, because I want to visulize it. So I'll go with it, renaming the columns and confirming it worked:

In [None]:
gdf.columns = [
    'Category',
    'Date', 
    'Country',
    'Cases',
    'Subzone',
    'geometry']

# Printing sample
gdf.head(5)

Further cutting down the columns down a bit for the data I want to use for a map:

In [None]:
columns_to_keep = ['Country', 'Cases', 'Date', 'geometry']
gdf = gdf[columns_to_keep]

# sorting by most up to date which is Jan 29th 2021
gdf = gdf.sort_values(by='Date', ascending= False)
gdf.head()

The date not being the 31st isnt ideal, as it isnt as accurate and also doesn't align with my csv files data. But I assume, if I go an redownload them both, they will be of similar dates. Again, this is more so for exploration a this point!

<br/> By having the dataframe in decending order by date, the top 195 (give or take based on up to date-ness) rows should be each countries' most recent total. I will use the `.agg` function again to sum the case totals:

In [None]:
gdf_new = gdf.groupby(gdf['Country']).aggregate({'Cases': 'sum', 'Date': 'max', 'geometry': 'first'})
gdf.head()

Importing a generic map of the world for a base:

In [None]:
world = gpd.read_file(gpd.datasets.get_path('naturalearth_lowres'))

Plotting the world:

In [None]:
base = world.plot(figsize=(20,12),color='white', edgecolor='black')

Plotting the dataframe points:

In [None]:
gdf.plot(figsize=(20,12),color='red')

All points (except for Australia and Italy) shown on the map are based on geographic centroids, and are not representative of a specific address, building or any location at a spatial scale finer than a province/state. Australian and Italian dots are located at the centroid of the largest city in each state.

Combining the two to create a layered map:

In [None]:
base = world.plot(figsize=(20,12),color='white', edgecolor='black')
gdf.plot(ax= base,color='red')

What is interesting about this geoJSON file is that it is just points, based on relative (average) locations for the countries that gave given COVID data. 
HOPEFULLY, I can generate a similar map, but the points change by size based on `Cases` value with a bubble map / point plot of some sort: with help from [here](https://residentmario.github.io/geoplot/gallery/plot_usa_city_elevations.html).

In [None]:
gdf = gdf.dropna()
gdf.head(10)

In [None]:
base = world.plot(figsize=(20,12),color='white', edgecolor='black')
ax = base

gdf.plot(ax=base, markersize = 'Country', color = 'red')

# TO BE CONTINUED ......

##### Below are charts and maps that I will continuously work on for the midterm. I recently found a geoJSON file that I will use fot base mapping and to build off of.

## More .csv. exploration

Below is a possible different dataframe I found, after the fact of doing everything above:

In [None]:
df = pd.read_csv('data/time_series_covid19_deaths_global.csv')

In [None]:
pd.set_option('display.max_columns', None)
pd.set_option('display.max_rows', None)
df.sample(20)