Explore [geo pandas mapping example](https://geopandas.org/gallery/create_geopandas_from_pandas.html#sphx-glr-gallery-create-geopandas-from-pandas-py) with novel corona virus data from [John's Hopkins via Kaggle](https://www.kaggle.com/sudalairajkumar/novel-corona-virus-2019-dataset#time_series_covid_19_confirmed_US.csv).

You need to manually download the data because you need a (free) Kaggle account. Download the Kaggle Covid data set and unpack it into the data/ directory containing this notebook.

Make sure to have [geopandas](https://geopandas.org/install.html) installed for base functionality.  The [descartes](https://pypi.python.org/pypi/descartes) package is needed for plotting the map.

Set up the library imports

In [None]:
import numpy as np
import pandas as pd
import geopandas
import matplotlib.pyplot as plt

In [None]:
rona_df = pd.read_csv("data/time_series_covid_19_confirmed_US.csv")

In [None]:
rona_df.head()

We want to limit our view to the continental United States.  It would be great to have a [bounding box to select just the Latitude and Longitude of interest](https://wiki.openstreetmap.org/wiki/Bounding_Box) but a quick search didn't turn up any numbers to use.  We can get our info by just dropping non-states in the first six rows who have a UID less than 1000 and Alaska and Hawaii from the states. Syntax help from https://stackoverflow.com/a/17071908/8928529.

We can get just state level data for a less granular map by choose the entries with Lat = 0 but this is more complicated to plot since we lose position info. Go ahead and create the data set but ignore it for now. 

In [None]:
rona_cus_df = rona_df.loc[(rona_df['UID']>1000) &
                          (rona_df['Lat'] > 0) &
                          (~rona_df['Province_State'].isin(['Alaska', 'Hawaii']))
                         ]

In [None]:
rona_state_df = rona_df.loc[(rona_df['UID']>1000) & 
                            (rona_df['Lat'] == 0) &
                            (~rona_df['Province_State'].isin(['Alaska', 'Hawaii'])) &
                            (~rona_df['Admin2'].isin(['Unassigned', '']))
                           ]

In [None]:
rona_cus_df.head()

create the geodata frame for plotting

# We need a key to merge on, so I will make CountyState one variable and strip out spaces

In [None]:
rona_cus_df["County_State"] = rona_cus_df["Admin2"] + rona_cus_df["Province_State"]

rona_cus_df['County_State'] = rona_cus_df['County_State'].str.replace(" ","")

rona_cus_df["County_State"]

# Now read in the population dataset 

In [None]:
#Kept getting error when importing, solved by adding encoding statement
population = pd.read_csv("data/pop.csv",encoding='latin-1')

population.head()

# Now strip out the word County to make it match the structure of the other dataset

In [None]:
population['County_Alone'] = population['CTYNAME'].str.replace('County', '')

In [None]:
population.head()

In [None]:
population["County_State"] = population["County_Alone"] + population["STNAME"]

In [None]:
population.head()

population['County_State'] = population['County_State'].str.replace(" ","")

population.head()

In [None]:
population = population[['POPESTIMATE2019','County_State']]

In [None]:
pd.set_option('display.max_rows', None)

population

In [None]:
#Left join both datasets on the key
Left_join = pd.merge(population,  
                     rona_cus_df,  
                     on ='County_State',  
                     how ='left') 
Left_join 



gdf = geopandas.GeoDataFrame(Left_join ,
                             geometry=geopandas.points_from_xy(Left_join.Long_, Left_join.Lat))


gdf_final = gdf[gdf.Province_State.notnull()]


gdf_final

Now plot the location data on the US contentintal map. (This requires the [descartes](https://pypi.python.org/pypi/descartes) module installed above.)

* Source https://www.mikulskibartosz.name/how-to-change-plot-size-in-jupyter-notebook/


In [None]:
plt.rcParams["figure.figsize"] = (20,10)

In [None]:
plt.rcParams["figure.figsize"] = (20,20)
plt.rcParams['image.interpolation'] = 'none'
plt.rcParams['image.cmap'] = 'binary'


In [None]:
world = geopandas.read_file(geopandas.datasets.get_path('naturalearth_lowres'))

# We restrict to United States
ax = world[world.continent == 'United States'].plot(
    color='white', edgecolor='white')

gdf = geopandas.GeoDataFrame(rona_cus_df,
                             geometry=geopandas.points_from_xy(rona_cus_df.Long_, rona_cus_df.Lat))


gdf.plot(ax=ax, color='red', marker='s',markersize=1)
# We can now plot our ``GeoDataFrame``.
gdf.plot(ax=ax, color='red', marker='s',markersize=1)
plt.axis('off')
plt.figure(figsize=(100,100))
plt.show()

In [None]:
plt.rcParams["figure.figsize"] = (20,10)

# Generate SIR Dataset

In [None]:
#Make infection data per White paper
first_loc = gdf_final.columns.get_loc("1/22/20")
last_loc =  gdf_final.columns.get_loc("6/14/20")
i = 0
for col in gdf_final.columns[first_loc:last_loc]:
    gdf_final['Inf' + str(col)] =gdf_final.apply(lambda row: (100*(row[col] / row['POPESTIMATE2019']))/ 100, axis=1)
    i = i + 1

In [None]:
gdf_final

In [None]:
#Make recovery data per white paper
count = 1
first_loc = gdf_final.columns.get_loc("1/22/20")
last_loc=gdf_final.columns.get_loc("6/14/20")

for col in gdf_final.columns[first_loc:last_loc]:
    gdf_final['Recover' + str(col)] =gdf_final.apply(lambda row: 100*((row[col]/row['POPESTIMATE2019']))/100, axis=1)
    if count==1:
        col_1=col
    if (count==14):
        count=0
        col_2=col
        gdf_final['Recover' + str(col)] =gdf_final.apply(lambda row: 100*(((row[col_2] - row[col_1])/row['POPESTIMATE2019'])/100), axis=1)
    count=count+1

In [None]:
gdf_final

In [None]:
#Separate out infection and recovery data in different dataframes for calculations

first_loc = gdf_final.columns.get_loc("Inf1/22/20")
last_loc =  gdf_final.columns.get_loc("Inf6/13/20")


first_rec = gdf_final.columns.get_loc("Recover1/22/20")
last_rec =  gdf_final.columns.get_loc("Recover6/13/20")

#Make Infection Data
infection = gdf_final.iloc[:,first_loc:last_loc]

#Subtract 1 because the equation is 1-I-R, lets just subtract the 1 here
infection2 = 1-infection

#Make recovery data
recovery = gdf_final.iloc[:,first_rec:last_rec]

#Infection recovery subtraction
infrec = infection2 - recovery.values

#Now concatenate all values in a (R,G,B) manner
final_data = pd.DataFrame(np.rec.fromarrays((infection.values, recovery.values,infrec.values)).tolist(), 
                      columns=infection.columns,
                      index=infection.index)

# Now we need the latitude, longitude, and population

In [None]:
data_frame_imp = gdf_final[["POPESTIMATE2019","County_State","geometry"]]

In [None]:
#This dataset has the RBG value and necessary identifiers to plot
final = pd.concat([data_frame_imp, final_data], axis=1)

In [None]:
final