<h1>Fusion Choropleth maps</h1>
In this assignment we'll examine three different ways of looking at the Covid-19 case data in New York State. All three maps use the New York state counties map as the base map. While we're going to create a few very nice maps, we'll also get a lot of Pandas practice along the way! The three different maps we'll create are:

<span style="color:blue">Problem 1:</span> The first map looks at the total number of positive cases in each county adjusted for population (incidence). The resulting choropleth map will show which counties were the hardest hit by Covid-19 from the start of the pandemic till the last data date. My version of this map (last data date = 02/17/2022) is in the file <span style="color:blue">Problem 1.html</span>

<span style="color:blue">Problem 2:</span> The second map constructs a <a href="https://github.com/python-visualization/folium/blob/master/examples/TimeSliderChoropleth.ipynb">time slider choropleth map</a> using the same data as in problem 1 with the difference that the output is a map for each day between March 1st 2020 and the latest data date. To smooth out noise, rather than plotting the raw incidence for each day, we'll plot the 8 day moving average of the daily incidence (incidence = cases/population). My version of this map (last data date = 02/17/2022) is in the file <span style="color:blue">Problem 2.html</span>

<span style="color:blue">Problem 3:</span> If you slide through the map from Problem 2, you'll notice that there are large chunks of time where the entire map is almost yellow. This is because the choropleth map is using a range from the lowest 8 day moving average to the highest 8 day moving average to figure out how much a county should be shaded. An alternative way at looking at the data would be to construct daily choropleth maps that focused on the relative incidence of Covid-19 across counties on any given day. One way to do this is to rank the counties by the incidence levels for each day separately. In the third choropleth map, we'll construct a time slider choropleth map which uses the 8 day moving average of these daily ranks (highest to lowest). Public health officials will find this more useful than the problem 2 map because they can move resources (testing kits, hospital supplies, treatments, etc.) to the counties with a higher incidence even when overall cases are quire low. My version of this map (last data date = 02/17/2022) is in the file <span style="color:blue">Problem 3.html</span> 


<h2>Getting the data</h2>
<h3>Data sources</h3>
<li>List of New York counties (<a href="https://www.newyork-demographics.com/counties_by_population">https://www.newyork-demographics.com/counties_by_population</a>)</li>
<li>Population of counties - 2020 census (same source as above)</li>
<li>GeoJson file with county boundaries (File cugir-007865-geojson.json)</li>
<li>Covid cases (new cases, cumulative cases) by county (available from <a href="https://dev.socrata.com/foundry/health.data.ny.gov/xdss-u53e">https://dev.socrata.com/foundry/health.data.ny.gov/xdss-u53e</a> using the <a href="https://dev.socrata.com">Download csv file</a> link)</li>

<h3>Data set up</h3>
<p>
<span style="color:green;font-size:20px">Create a population_and_county_df</span>
<li>Read populations and counties into a dataframe population_and_county_df (use <a href="https://pandas.pydata.org/docs/reference/api/pandas.read_html.html">pd.read_html</a>)</li>
<li>Remove the "County" from each county name (this will be necessary later) (use df.apply for this)</li>
<li>Drop any rows that are not useful (there should be 62 counties in total)</li>
    <li>Drop the "County" column (we'll use the "county" - lowercase c - column)</li>
<li>population_and_county_df should look something like:</li>
<pre>
	Rank	Population	county
0	1	2712360	Kings
1	2	2393104	Queens
2	3	1669127	New York
3	4	1522998	Suffolk
4	5	1468262	Bronx
...	...	...	...
57	58	29936	Schoharie
58	59	26681	Lewis
59	60	24808	Yates
60	61	17920	Schuyler
61	62	5068	Hamilton
62 rows × 3 columns
</pre>

<span style="color:green;font-size:20px">Read the geojson file into a python json object</span>
<li>Store this in a variable geojson_data</li>
<li>We can use the file directly for problem 1 but will need to store in in a variable and modify it for Problem 2 <li>drop the "Geography" and "Test % Positive" columns (we don't need them)</li>
<li>

<span style="color:green;font-size:20px">get all covid-19 data from nystate into the variable df by downloading the entire dataset and using pandas read_csv function </span>

<span style="color:green;font-size:20px">Join population_and_county_df to the cases dataframe</span>
<li>Use pandas <a href="https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.DataFrame.join.html">join</a> function to do the join</li>
<li>This will be the base dataset for both problems</li>
<li>Drop rows with nan values. The cases data contains aggregated data by region (New York City, Capital Region, etc.) that are not state counties (case data is already included for each county - e.g., Kings, Queens, New York, Bronx, Albany, etc.). These non-counties result in Nans when joined with the population data since they are not in the population dataframe</li>
<li>You may need to use pandas <a href="https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.DataFrame.set_index.html">pd.set_index</a> to get the join to work (depends on how you do it)</li>
<li>Finally, convert the "Test Date" column from string to datetime</li>
<li>A sample of what df should now contain</li>
<pre>
	Test Date	County	New Positives	Cumulative Number of Positives	Total Number of Tests Performed	Cumulative Number of Tests Performed	Rank	Population
0	02/04/2023	Albany	41	78486	437	1440170	14	314679
1	02/04/2023	Allegany	3	10740	56	253906	52	46654
2	02/04/2023	Bronx	172	509183	3280	9188543	5	1468262
3	02/04/2023	Broome	16	58467	387	1216208	19	198591
5	02/04/2023	Cattaraugus	5	19188	72	314044	35	77211
...	...	...	...	...	...	...	...	...
78177	03/01/2020	Washington	0	0	0	0	41	61504
78178	03/01/2020	Wayne	0	0	0	0	31	91332
78179	03/01/2020	Westchester	0	0	0	0	7	999723
78181	03/01/2020	Wyoming	0	0	0	0	54	40679
78182	03/01/2020	Yates	0	0	0	0	60	24808
66402 rows × 8 columns
</pre>
<pre>
<class 'pandas.core.frame.DataFrame'>
Int64Index: 66402 entries, 0 to 78182
Data columns (total 8 columns):
 #   Column                                Non-Null Count  Dtype         
---  ------                                --------------  -----         
 0   Test Date                             66402 non-null  datetime64[ns]
 1   County                                66402 non-null  object        
 2   New Positives                         66402 non-null  int64         
 3   Cumulative Number of Positives        66402 non-null  int64         
 4   Total Number of Tests Performed       66402 non-null  int64         
 5   Cumulative Number of Tests Performed  66402 non-null  int64         
 6   Rank                                  66402 non-null  object        
 7   Population                            66402 non-null  object        
dtypes: datetime64[ns](1), int64(4), object(3)
memory usage: 4.6+ MB
</pre>

In [264]:
import pandas as pd
import numpy as np
import datetime as dt
import json


#Data set up
#See steps above
population_and_county_df = pd.read_html("https://www.newyork-demographics.com/counties_by_population")[0]
population_and_county_df['county']=population_and_county_df['County'].apply(lambda x: x.replace("County","").strip())
population_and_county_df.drop(62,inplace=True)
population_and_county_df.drop("County",axis=1,inplace=True)

# geojson_data
county_geojson_file = "../class-datasets/cugir-007865-geojson.json"
with open(county_geojson_file,'r') as f:
    geojson_data_string = f.read()
geojson_data = json.loads(geojson_data_string)

 
#Get the cases dataframe object 
df = pd.read_csv("../class-datasets/New_York_State_Statewide_COVID-19_Testing.csv")
df.drop(["Geography","Test % Positive"],inplace=True,axis=1)
#Join with population and county data
df = df.join(population_and_county_df.set_index('county'),on="County")
df.dropna(inplace=True)
df["Test Date"] = pd.to_datetime(df["Test Date"],format="%m/%d/%Y")

In [265]:
df.head()

Unnamed: 0,Test Date,County,New Positives,Cumulative Number of Positives,Total Number of Tests Performed,Cumulative Number of Tests Performed,Rank,Population
0,2023-02-11,Albany,30,78729,410,1443546,14,314679
1,2023-02-11,Allegany,5,10770,75,254256,52,46654
2,2023-02-11,Bronx,125,510301,3736,9214453,5,1468262
3,2023-02-11,Broome,21,58635,404,1220039,19,198591
5,2023-02-11,Cattaraugus,5,19249,65,314619,35,77211


In [266]:
df.info()

<class 'pandas.core.frame.DataFrame'>
Int64Index: 66836 entries, 0 to 78693
Data columns (total 8 columns):
 #   Column                                Non-Null Count  Dtype         
---  ------                                --------------  -----         
 0   Test Date                             66836 non-null  datetime64[ns]
 1   County                                66836 non-null  object        
 2   New Positives                         66836 non-null  int64         
 3   Cumulative Number of Positives        66836 non-null  int64         
 4   Total Number of Tests Performed       66836 non-null  int64         
 5   Cumulative Number of Tests Performed  66836 non-null  int64         
 6   Rank                                  66836 non-null  object        
 7   Population                            66836 non-null  object        
dtypes: datetime64[ns](1), int64(4), object(3)
memory usage: 4.6+ MB


<h1>Problem 1: Incidence choropleth map</h1>
<li>For this problem, you need to create a choropleth map of NY counties that are colored by the incidence of covid-19 in the county</li>
<li>We'll define <span style="color:blue">incidence</span> as the cumulative number of cases divided by the population</li>
<li>The template for the map is in the cell below. You need to create a new column <span style="color:blue">incidence</span> and fill in the missing attributes in the template</li>
<li>Since we only need the latest value of cumulative cases, find the max test_date in the dataframe and then extract the data for that date (easiest is to set the index to the test_date column and then use loc to get data for that date</li>
<li>Then extract the cumulative cases column and the population column and do the division and store the result in an incidence column</li>
<li>Choose an appopriate custom color scheme from <a href="https://github.com/python-visualization/folium/blob/v0.2.0/folium/utilities.py#L104">https://github.com/python-visualization/folium/blob/v0.2.0/folium/utilities.py#L104</a></li>
<li>The dataframe lasts should be similar to:</li>

<pre>


County	New Positives	Cumulative Number of Positives	Total Number of Tests Performed	Cumulative Number of Tests Performed	Rank	Population
Test Date							
2023-02-04	Albany	41	78486	437	1440170	14	314679
2023-02-04	Allegany	3	10740	56	253906	52	46654
2023-02-04	Bronx	172	509183	3280	9188543	5	1468262
2023-02-04	Broome	16	58467	387	1216208	19	198591
2023-02-04	Cattaraugus	5	19188	72	314044	35	77211
...	...	...	...	...	...	...	...
2023-02-04	Washington	6	15711	112	299566	41	61504
2023-02-04	Wayne	11	21793	218	397782	31	91332
2023-02-04	Westchester	87	333945	1801	6065694	7	999723
2023-02-04	Wyoming	3	9889	38	162150	54	40679
2023-02-04	Yates	3	4433	95	96151	60	24808
62 rows × 7 columns
</pre>

In [267]:
#Extract the most recent test date
df['Population'] = df['Population'].astype('int')
lasts = df[ df['Test Date'] == df['Test Date'].max()].copy()

#Calculate the incidence (cases/population)
lasts['incidence'] = lasts['Cumulative Number of Positives'] / lasts['Population']

#Draw the folium map. I've scaffolded this for you
import folium
m = folium.Map(location=[42.9226618,-75.6051974],zoom_start=6) #Figure out a center and zoom level for the map
#Source for county geojson: https://cugir.library.cornell.edu/catalog/cugir-007865
c = folium.Choropleth(geo_data=geojson_data,  #you can directly use the geodata file here
                      data= lasts, #the dataframe
                     columns=['County','incidence'], #the columns - column 1: matches key in geodata; column 2: the data column from the dataframe
                      key_on='feature.properties.name', #the field in the geojson data that will be used to attach dataframe data to the base map
                      fill_color='OrRd', #experiment with fill colors
                      fill_opacity=1, #and with opacity
                      legend_name="Distribution of incidence",
                     highlight=True) #should display county name when hovering over one in the map
c.add_to(m)
c.geojson.add_child(
    folium.features.GeoJsonTooltip(['name'],labels=False)
)
m

<h1>Problem 2: Create a time slider choropleth map</h1>
<li>In this problem, you'll create a choropleth map object that changes with changes in the date. The map will include a slider that the user can use to "slide" through the dates (from March 1st 2020 to the last data date)</li>
<li>The thing to focus on here is the time slider. Time, as we know (or should know anyway) is an inexorable thing that marches on regardless of our efforts to contain it. What our slider needs is an understanding of where the data points are in the time scale that started with the big bang and is still moving</li>
<li>The way we'll deal with it here is to use the concept of <a href="https://en.wikipedia.org/wiki/Unix_time">Unix Time</a>. Unix time is stored as an integer starting with 0 at 00:00:00 of 1st January 1970 (think of that date as the unix big bang date) and then adds 1 to it for each second. </li>
<li>See below for what today is in unix time (or how many seconds since the unix big bang!)</li>
<li>(More on time later) A time slider choropleth map is created using the TimeSliderChoropleth object (see <a href="https://github.com/python-visualization/folium/blob/master/examples/TimeSliderChoropleth.ipynb">https://github.com/python-visualization/folium/blob/master/examples/TimeSliderChoropleth.ipynb</a>. You can download the notebook and play around with it)</li>

<span style="color:green;font-size:30px">Rough steps</span>
<p></p>
<span style="color:green;font-size:24px">Calculate 8 day moving average</span>
<li>calculate the incidence (new cases/population) for each row in df</li>
<li>Smooth out incidence by computing an 8-day moving average. Note that each county will have its own moving average series (i.e., you can't construct rolling windows on the entire incidence column but must first group the data by county</li>
<ol>
    <li>use <a href="https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.DataFrame.groupby.html">pd.groupby</a> to group the data by county</li>
    <li>use <a href="https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.core.groupby.DataFrameGroupBy.transform.html">groupby.transform</a> to apply a function on each group individually</li>
    <li>the function should construct a rolling mean (call <a href="https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.DataFrame.rolling.html">rolling</a> and <a href="https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.DataFrame.mean.html">mean</a> in the function. Also, fill nans with zeros using <a href="https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.DataFrame.fillna.html">fillna</a>). Use the moving average examples we constructed in class as a guide</li>
</ol>
<p></p>
<span style="color:green;font-size:24px">Convert test date to unix time</span>
<li>I've done this for you but do walk through the code</li>
<li>Convert test_date to pandas datetime</li>
<li>Add one day to each value</li>
<li>the function astype(int) returns the integer time in nanoseconds given a datetime object</li>
<li>The result is in nanoseconds, we'll divide this by 10 to the power of 9 to get seconds (unix time)</li>
<li><b>Note:</b> Why add one day? When we do the division by 10**9, we don't want to end up with a decimal point in the result. Therefore we use the integer division (//) operator. Integer division truncates the result and, since all our times are exactly at 00:00:00.0, this ends up pushing the date to the previous day in 99.99% of the cases. There may be a case where the division was exact, in which case we'll get the wrong date, but the odds are very low and, anyway, nobody said life is perfect!</li> 

<p></p>
<span style="color:green;font-size:24px">Make a colormap</span> 
<li>We need to assign a color and opacity to each county at each point in time based on the value of the moving average of incidence</li>
<li>For opacity, we'll use a constant. 0.3 is probably good enough</li>
<li>For color, we'll choose a color palette (YlOrRd is my choice but choose any you like from the palette link in problem 1 above)</li>
<li>Assign the lowest color (yellow with my choice) to the lowest value in the ma8 column</li>
<li>Assign the highest color (red with my choice) to the highest value in the ma8 column</li>
<li><span style="color:blue">branca</span> is a companion library to folium that contains non geography specific map information (like colors!). We'll use that to create a linear color scale from the lowest ma8 value to the highest ma8 value</li>
<li>Then add two columns to df. The column <b>color</b> with the color (each ma8 value will map to a specific color) and the <b>opacity</b> with the opacity (a constant e.g., 0.3)</li>

<p></p>
<span style="color:green;font-size:24px">Replace county names</span>
<li>Since the geojson file needs ids to be strings with no special characters, and we have strings like "New York" and "St. Lawrence" in the data (with a space and a period), we'll replace all county names with "0", "1", "2", ... in the dataframe</li>
<li>We'll have to do this in the geojson object as well (currently, those are even more obscure), so we'll create a dictionary that maps county names to integer strings</li>
<li>Example:</li>
<pre>
{'Kings': '0',
 'Queens': '1',
 'New York': '2',
 'Suffolk': '3',
 'Bronx': '4',
 'Nassau': '5',
 .....
}
</pre>
<li>Make a list of county names (you can get this from population_and_county_df</li>
<li>Create an empty dictionary (county_mapping)</li>
<li>Iterate through the county names list adding key (county name) and value (county number as str) to the dict</li>
<li>Replace county names in df with county numbers (use apply for this). You might want to keep a separate column for county names, though that's not really necessary</li>

<p></p>
<span style="color:green;font-size:24px">Create a sytle dictionary</span>
<li><b>Note</b>: At various points, you may end up with data in the index that you want to use as if it was in a column. Pandas provides a handy way of converting the index into a dataframe column. <a href="https://pandas.pydata.org/docs/reference/api/pandas.DataFrame.reset_index.html">reset_index</a> does that for you and you can use <a href="https://pandas.pydata.org/docs/reference/api/pandas.DataFrame.set_index.html">set_index</a> to change an existing index or add a new index</li>
<li>The style dictionary contains the elements that go into the map (the test_date - a unix time value; the color - a color that reflects the ma8 value; and the opacity - a constant)</li>
<li>iterate through the counties
    <ol>
        <li>create a df2 for each county that contains the color and opacity and uses the test_date as index</li>
        <li>convert the df2 into a dictionary using <a href="https://pandas.pydata.org/docs/reference/api/pandas.DataFrame.to_dict.html">df.to_dict()</a></li>
        <li>add (county id, df2 as a dictionary) as a key value pair to style_dict</li>
    </ol>
</li>
<li>An example of df2:</li>
<pre>
	           color	opacity
test_date		
1583107200	#ffffccff	0.5
1583193600	#ffffccff	0.5
1583280000	#ffffccff	0.5
1583366400	#ffffccff	0.5
1583452800	#ffffccff	0.5
...	...	...
1634256000	#fff2acff	0.5
1634342400	#ffe996ff	0.5
1634428800	#ffeb9cff	0.5
1634515200	#ffeea1ff	0.5
1634601600	#ffeb9cff	0.5
597 rows × 2 columns
</pre>

<p></p>
<span style="color:green;font-size:24px">Create the time slider choropleth map</span>
<li>I've done this part for you</li>
<li>Because we're anchoring the data to the maximum incidence since March 2020, you'll notice that the map discriminates between counties the most during the first omicron wave (the number of identified cases was an order of magnitude higher than at other times during the pandemic). We'll, sort of, address this issue in the next problem.</li>

In [268]:
#Make a copy of df. It will make your life easier!
#Keep this in a separate cell. You don't want to accidentally copy a different version of df into df_save
df_save = df.copy()

In [269]:
#Copy the saved dataframe into df
#This way, if your changes don't work, you can start from scratch without having to run all the earlier code

df = df_save.copy() 
#Calculate 8 day moving average of new cases/population
df = df.sort_values(by=['Test Date'])
df['incidence'] = df['New Positives'] / df['Population']
df['ma8'] = df.groupby('County')['incidence'].transform(lambda x: x.rolling(8).mean().fillna(0))

#Convert test_date to unixtime
df['unix_date'] = (pd.to_datetime(df['Test Date']) + pd.Timedelta(1,'d')).view(int) //(10 ** 9)

#Make a color map
import branca.colormap as cm
max_value = df['ma8'].max()
min_value = df['ma8'].min()
cmap = cm.linear.YlOrRd_09.scale(min_value, max_value) 
df['color'] = df["ma8"].map(cmap)
df['opacity']=0.5

#Map each county to a unique value 0,1,2, etc.
#Make sure the 0, 1, 2 etc are str objects not ints
county_mapping = dict() 
counter = 0
for county in population_and_county_df['county']:
    county_mapping[county] = str(counter)
    counter += 1
#Add key,value (name, number) pairs to this dictionary. Use iteration across a list of county names
#Rename the "county" column to "county_name"
df['county_name'] = df['County']
#Then create a new column "county" that contains the county numbers
#Make this column the index
df['county_name'] = df['County']
df['County'] = df['County'].map(county_mapping)
df = df.set_index('County')
#Update the geojson file (change counties to "0", "1", "2", ....)
#Look at the structure of geojson_data and try to figure out how to do this
#CODE FOR MODIFYING geojson_data goes here
for c in geojson_data['features']:
    for key in county_mapping:
        if key == c['properties']['name']:
            c['id'] = str(county_mapping[key])

#Create styledict
styledict = dict()
for county in df.index.unique():
    df2 = df[df.index == county][['unix_date','color','opacity']] 
    df2 = df2.set_index('unix_date')
    df2 = df2.to_dict('index')
    styledict[county] = df2

#Render the map
import folium,json
from folium.plugins import TimeSliderChoropleth
m = folium.Map([42, -78],  zoom_start=6)

g = TimeSliderChoropleth(
    json.dumps(geojson_data), #Contains geojson. Features must have a key "id" that contains 
                    #strings without special chars or spaces
                    #In other words, we'll need to do something about 
                    #"New York" and "St. Lawrence" counties
    styledict=styledict, #A dictionary. Keys must be the same as the id key in 
                            #the geojson object. Values must be a dictionary of
                            #{unix_timestamp: {"color":color_code,"opacity":float}}
).add_to(m)


m

#Save the map to an html file that you can open in the browser for a larger look
#To save, remove the comment from the following command
# m.save("problem 2 maps.html")

<h1>Problem 3: Get daily covid incidence ranks and create a timeslider choropleth map</h1>
In problem 2, we saw how the covid case incidence changed over time for each county in New York State. For resource allocation, however, a administrator may want to know which counties have a higher incidence <span style="color:red">relative</span> to other counties. The administrator could then move resources (testing, doctors and nurses, hospital supplies) to the counties that have a relatively higher number of cases. 

For this problem, we'll create a time slider choropleth map that, at each point in time, shades the counties with a higher relative incidence darker than the counties with a lower relative incidence. To get relative incidence, we'll rank the counties by their incidence (cases/population) at each point in time. To smooth out noise, we'll use the 8 day moving average of the rank as the data for the choropleth map.

<span style="color:green;font-size:30px">Rough steps</span>
<p></p>
<span style="color:green;font-size:24px">Calculate relative incidences by date</span>
<li>For each date, sum up the values in the incidence column generated in problem 2</li>
<li>Store this in the variable <span style="color:blue">inc_sum_df</span></li>
<li>Then divide each incidence in df by the total incidence for the same date to get what proportion of population adjusted cases are in a county. Write a function to do this so that you can set any divide by zero values at zero. inc_sum_df will give the total incidence for a given date (the denominator)</li>
<li>Then use apply to create a new dataframe column "relative_incidence"</li>
<li>A sample of what to expect in inc_sum_df:</li>
<pre>
 	incidence
test_date	
1583107200	0.000000e+00
1583193600	9.955628e-07
1583280000	0.000000e+00
1583366400	9.557185e-07
1583452800	2.063809e-05
...	...
1644796800	1.122646e-02
1644883200	1.057133e-02
1644969600	1.576210e-02
1645056000	1.837740e-02
1645142400	1.755118e-02
719 rows × 1 columns
</pre>
<li>A sample of what to expect in df['relative_incidence']. The NaNs come from dividing very small numbers by other very small numbers (try..except won't catch those but you can ignore them)</li>
<pre>
0              NaN
1         0.000000
2              NaN
3        38.242138
4         0.000000
           ...    
44573     0.000000
44574     1.852272
44575     2.484564
44576     1.065492
44577     0.000000
Name: relative_incidence, Length: 44578, dtype: float64
</pre>

<p></p>
<span style="color:green;font-size:24px">Calculate ranks and moving averages of rank</span>
<li>Create a column, rank, that ranks the relative incidence for each county within each date. Use groupby and <a href="https://pandas.pydata.org/docs/reference/api/pandas.DataFrame.rank.html">df.rank</a> to do this</li>
<li>For each county, calculate the 8 day moving average of rank. Use groupby, rolling, and mean and fill any NaNs with 0.0 (see <a href="https://pandas.pydata.org/docs/reference/api/pandas.DataFrame.fillna.html">df.fillna</a>). Store this in a column named rma8</li>
<li>Example of df['rank']:</li>
<pre>
0         NaN
1         1.0
2         NaN
3         2.0
4         1.0
         ... 
44573     1.0
44574    42.0
44575    54.0
44576    20.0
44577     1.0
Name: rank, Length: 44578, dtype: float64
</pre>
<li>Example of df['rma8']:</li>
<pre>
county
0      0.000
0      0.000
0      0.000
0      0.000
0      0.000
       ...  
61    32.250
61    34.250
61    34.375
61    36.250
61    31.250
Name: rma8, Length: 44578, dtype: float64
</pre>

<p></p>
<span style="color:green;font-size:24px">Create the time slider choropleth map</span>
<li>First use branca to generate the color and the opacity. Note that you can use a global ma8 max and min because the ranks for each day are scaled to the number of counties and the global max and min will, therefore, be almost the same as the max and the min for each day. This is a lot easier than scaling each day separately</li>
<li>Draw the map. This is identical to the map in problem 2</li>



In [270]:
#Make a copy
#Remember to keep this in a separate cell and not run it twice!
df_save2 = df.copy()

In [271]:
df = df_save2.copy()
#Calculate relative indices by date
df.reset_index(inplace=True) #Restore the default index 0, 1,2,.. We need unix_date as a col

In [278]:
df.head()

Unnamed: 0,County,Test Date,New Positives,Cumulative Number of Positives,Total Number of Tests Performed,Cumulative Number of Tests Performed,Rank,Population,incidence,ma8,unix_date,color,opacity,county_name,relative_incidence
0,59,2020-03-01,0,0,0,0,60,24808,0.0,0.0,1583107200,#ffffccff,0.5,Yates,
1,8,2020-03-01,0,0,0,0,9,757332,0.0,0.0,1583107200,#ffffccff,0.5,Monroe,
2,36,2020-03-01,0,0,0,0,37,68466,0.0,0.0,1583107200,#ffffccff,0.5,Madison,
3,38,2020-03-01,0,0,0,0,39,62253,0.0,0.0,1583107200,#ffffccff,0.5,Livingston,
4,58,2020-03-01,0,0,0,0,59,26681,0.0,0.0,1583107200,#ffffccff,0.5,Lewis,


In [273]:
inc_sum_df = pd.DataFrame(df.groupby('unix_date')['incidence'].sum())
inc_sum_df[1676160000 == inc_sum_df.index].values[0][0]

0.006289894191226189

In [280]:
inc_sum_df = df.sort_values('unix_date').groupby('unix_date')['incidence'].sum()
inc_sum_df

unix_date
1583107200    0.000000e+00
1583193600    1.000277e-06
1583280000    0.000000e+00
1583366400    9.677982e-07
1583452800    2.074080e-05
                  ...     
1675814400    9.335799e-03
1675900800    9.858525e-03
1675987200    9.236923e-03
1676073600    8.247203e-03
1676160000    6.289894e-03
Name: incidence, Length: 1078, dtype: float64

In [299]:
inc_sum_df[1676160000 == inc_sum_df.index].values[0]

0.006289894191226189

In [309]:
def pct_incidence(x):
    try: 
        bottom_val = inc_sum_df[x['unix_date'] == inc_sum_df.index].values[0]
        top_val = x['incidence']
        return (top_val / bottom_val) * 100
    except:
        return 0

In [310]:
df['relative_incidence']= df.apply(pct_incidence, axis=1)

  return (top_val / bottom_val) * 100


In [315]:
df['rank'] = df.groupby('unix_date')['relative_incidence'].rank(method='dense').fillna(0)

In [316]:
df['rma8'] = df['ma8'] = df.groupby('County')['rank'].transform(lambda x: x.rolling(8).mean().fillna(0))


In [317]:
df = df_save2.copy()
#Calculate relative indices by date
df.reset_index(inplace=True) #Restore the default index 0, 1,2,.. We need unix_date as a col

#group the data by unix_date and find the sum of incidence for each group
#inc_sum_df should be a dataframe with one column (the sum -call it incidence) 
#.   and the unix_date as the index
#I've done the group_by - you need to extract and sum incidences for each date
#Each date will have one incidence for each county, add them up!
inc_sum_df = df.sort_values('unix_date').groupby('unix_date')['incidence'].sum()


#A function for calculating the pct_incidence
#Given a row index (x is the index), extract the incidence and the test date associated
# with that row
#Using the testdate, get the corresponding sum from inc_sum_df
#Divide the incidence of the row by the value from inc_sum_df and multiply by 100
#return this value
#Note that the sum can be zero and you can get a ZeroDivideException. 
#Protect against that - return 0 as the pct_incidence if you get the exception
def pct_incidence(x):
     #Your code goes here
     try:
         bottom_val = inc_sum_df[x['unix_date'] == inc_sum_df.index].values[0]
         top_val = x['incidence']
         return (top_val / bottom_val) * 100
     except:
         return 0

        
#Calculate relative incidence
df['relative_incidence']= df.apply(pct_incidence, axis=1)

#Calculate rank and 8 day moving average
#groupby test date and use the rank function to assign a rank 
#Use the dense option (see documentation) and use fillna to replace Nans with 0.0
#https://pandas.pydata.org/docs/reference/api/pandas.DataFrame.rank.html
#https://pandas.pydata.org/docs/reference/api/pandas.DataFrame.fillna.html
df['rank'] = df.groupby('unix_date')['relative_incidence'].rank(method='dense').fillna(0)

#group df by county and then, on the rank column, apply a function
#the function should calculate the 8 day moving average for each county
#use rolling and mean for this. If you still have Nans, use fillna to replace with 0.0
#https://pandas.pydata.org/docs/reference/api/pandas.DataFrame.transform.html
df['rma8'] = df['ma8'] = df.groupby('County')['rank'].transform(lambda x: x.rolling(8).mean().fillna(0))


#Use branca to get colors (constant opacity)
#see previous problem
import branca.colormap as cm
max_value = df['rma8'].max()
min_value = df['rma8'].min()
cmap = cm.linear.YlOrRd_09.scale(min_value, max_value) 
df['color'] = df["ma8"].map(cmap)
df['opacity']=0.5


#Create a styledict
#the styledict is a dictionary with each county as a key and a dataframe (df2) as value
df.set_index("County",inplace=True)
styledict = dict()
for county in df.index.unique():
    df2 = df[df.index == county][['unix_date','color','opacity']] 
    df2 = df2.set_index('unix_date')
    df2 = df2.to_dict('index')
    styledict[county] = df2

#Get the center and the zoom
#Create a TimeSliderChoropleth map
#Render and save it
m = folium.Map([42, -78],  zoom_start=6)

g = TimeSliderChoropleth(
    json.dumps(geojson_data), #Contains geojson. Features must have a key "id" that contains 
                    #strings without special chars or spaces
                    #In other words, we'll need to do something about 
                    #"New York" and "St. Lawrence" counties
    styledict=styledict, #A dictionary. Keys must be the same as the id key in 
                            #the geojson object. Values must be a dictionary of
                            #{unix_timestamp: {"color":color_code,"opacity":float}}
).add_to(m)


# m.save("rincidence.html")

m

#Save the map to an html file that you can open in the browser for a larger look
#To save, remove the comment from the following command
#m.save("problem3.html")

  return (top_val / bottom_val) * 100


KeyboardInterrupt: 

In [318]:
import branca.colormap as cm
max_value = df['rma8'].max()
min_value = df['rma8'].min()
cmap = cm.linear.YlOrRd_09.scale(min_value, max_value) 
df['color'] = df["ma8"].map(cmap)
df['opacity']=0.5

In [324]:
df.set_index("County",inplace=True)
styledict = dict()
for county in df.index.unique():
    df2 = df[df.index == county][['unix_date','color','opacity']] 
    df2 = df2.set_index('unix_date')
    df2 = df2.to_dict('index')
    styledict[county] = df2

KeyError: "None of ['County'] are in the columns"

In [322]:
df.set_index("County",inplace=True)


In [323]:
df

Unnamed: 0_level_0,Test Date,New Positives,Cumulative Number of Positives,Total Number of Tests Performed,Cumulative Number of Tests Performed,Rank,Population,incidence,ma8,unix_date,color,opacity,county_name,relative_incidence,rank,rma8
County,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1,Unnamed: 14_level_1,Unnamed: 15_level_1,Unnamed: 16_level_1
59,2020-03-01,0,0,0,0,60,24808,0.000000,0.000,1583107200,#ffffccff,0.5,Yates,,0.0,0.000
8,2020-03-01,0,0,0,0,9,757332,0.000000,0.000,1583107200,#ffffccff,0.5,Monroe,,0.0,0.000
36,2020-03-01,0,0,0,0,37,68466,0.000000,0.000,1583107200,#ffffccff,0.5,Madison,,0.0,0.000
38,2020-03-01,0,0,0,0,39,62253,0.000000,0.000,1583107200,#ffffccff,0.5,Livingston,,0.0,0.000
58,2020-03-01,0,0,0,0,59,26681,0.000000,0.000,1583107200,#ffffccff,0.5,Lewis,,0.0,0.000
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
11,2023-02-11,39,139819,984,2086102,12,398277,0.000098,44.625,1676160000,#e9261fff,0.5,Orange,1.556812,33.0,44.625
25,2023-02-11,7,26147,194,499457,26,112060,0.000062,23.875,1676160000,#feaf4aff,0.5,Ontario,0.993125,9.0,23.875
10,2023-02-11,65,142510,945,2877295,11,474621,0.000137,38.125,1676160000,#fd522bff,0.5,Onondaga,2.177324,50.0,38.125
60,2023-02-11,0,4322,54,103472,61,17920,0.000000,17.250,1676160000,#fed06cff,0.5,Schuyler,0.000000,1.0,17.250
