# Stoneburner, Kurt #
- ## DSC 550 - Week 07
- ## Milestone #5

This notebooke is roughly organized into
- Preprocesing and cleaning

- Modeling
    
    Functions of note:
    
    - calculate_coefficients: Generates a heavily regression biased model towards the first regression attribute.
    
    - calculate_coefficients_v2: Improves the model results by scaling the inputs to offset the modeling bias.
    
    - coef_to_percent() - This converts the coefficients to percentages


- Graphs

## Some thoughts about the data and processing ##

### **Wrangling Racial Data** ###

As expected, great care and effort was taken to clean and wrangle the data to create both consistency and continuity between the data sets. Aligning the Federal Census bureau racial categories with California’s race data was a challenge. The Census bureau measures race and ethnicity differently than California. 

California breaks race into 9 categories: 

American Indian or Alaska Native, Asian, Black, Latino, Multi-Race, Native Hawaiian and other Pacific Islander, Other, White. 

The census bureau breaks race and ethnicity into 35 attributes split across male and female values for a total of 70 attributes. 

WA_MALE,WA_FEMALE,BA_MALE,BA_FEMALE,IA_MALE,IA_FEMALE,AA_MALE,AA_FEMALE,NA_MALE,NA_FEMALE,TOM_MALE,TOM_FEMALE,WAC_MALE,WAC_FEMALE,BAC_MALE,BAC_FEMALE,IAC_MALE,IAC_FEMALE,AAC_MALE,AAC_FEMALE,NAC_MALE,NAC_FEMALE,NH_MALE,NH_FEMALE,NHWA_MALE,NHWA_FEMALE,NHBA_MALE,NHBA_FEMALE,NHIA_MALE,NHIA_FEMALE,NHAA_MALE,NHAA_FEMALE,NHNA_MALE,NHNA_FEMALE,NHTOM_MALE,NHTOM_FEMALE,NHWAC_MALE,NHWAC_FEMALE,NHBAC_MALE,NHBAC_FEMALE,NHIAC_MALE,NHIAC_FEMALE,NHAAC_MALE,NHAAC_FEMALE,NHNAC_MALE,NHNAC_FEMALE,H_MALE,H_FEMALE,HWA_MALE,HWA_FEMALE,HBA_MALE,HBA_FEMALE,HIA_MALE,HIA_FEMALE,HAA_MALE,HAA_FEMALE,HNA_MALE,HNA_FEMALE,HTOM_MALE,HTOM_FEMALE,HWAC_MALE,HWAC_FEMALE,HBAC_MALE,HBAC_FEMALE,HIAC_MALE,HIAC_FEMALE,HAAC_MALE,HAAC_FEMALE,HNAC_MALE,HNAC_FEMALE 

The biggest disparity is the handling categorization of Latinos. The Census Bureau treats Hispanic as an ethnicity. Meaning, an individual may be, Hispanic White, Hispanic Black, Hispanic Asian, etc. Whereas California treats anyone who identifies as Hispanic as a monolithic racial category called Latino. In terms of categorization, Latinos were counted as H_MALE and H_FEMALE which are total Hispanic numbers. The remaining racial groups were taken from their Non-Hispanic (NH) racial categories.  

The racial data was combined into the following named categories: 

    Latino 

    White 

    Asian 

    Black 

    Native 

    Hawaiian 

    Multiracial 

In hindsight these categories should be restructured to more appropriately reflect current racial sensibilities. Simply categorizing ‘Hawaiian and Other Pacific Islander’ as Hawaiian reflects a sense of racial appropriation that doesn’t respect the history Hawaiian natives, and broadly dismisses the diversity of the pacific islander population. A more appropriate designation is AAPI (Asian American Pacific Islander) which combines Asian and ‘Hawaiian and Other Pacific Islander’ categories into a single category. This aligns with the racial categorization that is reflected within my community. Additionally, I would have combined the lower population demographics of Native and Multiracial with the California Other category. This should boost some of the oddities with modeling races with low population. Although, this would eliminate the representation of racial diversity.  

Appropriately, categorizing individuals is unexpectedly difficult and comes with the burden of accurate representation. There are no easy answers. 

### **Data Consistency and Transformation** ###

This data involves matching three different sets of data. Statewide racial totals (Total Asian, total Latino, etc..) per day, total statewide COVID cases, and daily county COVID cases. One significant challenge is that none of totals numbers matched. The sum of the statewide race totals is different that the reported statewide cases, which is different than summing all county COVID cases. 

I surmise the difference in data reporting is due to inconsistencies in the data reporting process. Although COVID case numbers are reported daily, the final daily COVID numbers may change due to testing delays which were especially pronounced early in the pandemic through the summer of 2020. California reports on COVID cases and reported COVID cases. The reported cases represent positive tests reported on a given day. The reported tests generally represent tests performed on an earlier date (unless it was a same day test). The case count is a retroactively adjusted value   to reflect actual case counts. I’m unsure if this retroactive process was applied to all the data fields. I suspect not. For data consistency, all totals are based off county sums.  

Generating daily race totals was more involved. The racial percentages were recalculated daily. The state provided the racial percentages, but recalculated values were generated with a higher precision. The daily racial percentages were applied to the state totals. This preserved the reported ratio of racial cases, but applied them to a consistent case total.  

This balanced the racial totals, state totals, and summed county case totals. 

**Reference link for future use**

These links are for future lookup. I find myself returning to past assignments and projects looking for reference code. Keeping good references helps during follow-on classes

In [1]:
#Look at Support Vector Regression
#https://www.mygreatlearning.com/blog/support-vector-regression/

#//*** GEOPANDAS sources
#https://jcutrer.com/python/learn-geopandas-plotting-usmaps
#https://www.census.gov/geographies/mapping-files/time-series/geo/carto-boundary-file.html
#https://towardsdatascience.com/lets-make-a-map-using-geopandas-pandas-and-matplotlib-to-make-a-chloropleth-map-dddc31c1983d
#https://geopandas.org/docs/user_guide/mapping.html

#//*** Build Custom Color Gradients
#https://coolors.co/gradient-palette/ffffff-e0472b?number=9

**Basic Imports**

In [2]:
import os
import sys
# //*** Imports and Load Data
import matplotlib.pyplot as plt
import numpy as np
import pandas as pd
import requests
#//*** Use the whole window in the IPYNB editor
from IPython.core.display import display, HTML
display(HTML("<style>.container { width:100% !important; }</style>"))

import time 
import random


import geopandas
from matplotlib.colors import ListedColormap, LinearSegmentedColormap

#//*** Maximize columns and rows displayed by pandas
pd.set_option('display.max_rows', 100)
pd.set_option('display.max_columns', None)

pd.set_option('display.width', 200)

df_list = []

from sklearn import linear_model
from math import sqrt
from sklearn.metrics import mean_squared_error

OSError: could not find or load spatialindex_c-64.dll

**Deduplicate Legend**

Primarily scans the handles and labels for multiple instances of the same element. This is very helpful when drawing graphs with loops that may draw multiple legend items.

In [None]:
# //*** Legends automatically generate too many labels based on my looping method.
# //*** Remove the Duplicate Legends. I wrote this for DSC 530 and it keeps on giving.
def deduplicate_legend(input_ax):
    # //**** Get handle and label list for the current legend
    # //**** Use first instance, toss the rest.
    handles, labels = input_ax.get_legend_handles_labels()

    handle_dict = {}

    for x in range(len(labels)):
        if labels[x] not in handle_dict.keys():
            # //*** Label = handle
            handle_dict[labels[x]] = handles[x]

    # //*** Build unique output ists and handles
    out_handles = []
    out_labels = []
    
    for label,handle in handle_dict.items():
        out_handles.append(handle)
        out_labels.append(label)
    
    return out_handles,out_labels

   

In [None]:
#//*** Only download Data if download_data is True.
#//*** Avoids needlessly generating HTTP traffic
download_data = False
demographic_data_filename = "z_ca_covid_demo.csv"
cases_data_filename = "z_ca_covid_cases.csv"

#//***********************************************************************************************
#//*** California COVID Data website:
#//**************************************
#//*** https://data.chhs.ca.gov/dataset/covid-19-time-series-metrics-by-county-and-state
#//***********************************************************************************************

#//*** Download California Current COVID Demograohic Data
if download_data:
    try:
        response = requests.get("https://data.chhs.ca.gov/dataset/f333528b-4d38-4814-bebb-12db1f10f535/resource/e2c6a86b-d269-4ce1-b484-570353265183/download/covid19casesdemographics.csv")
        if response.ok:
            print("Demographic Data Downloaded")
            f = open(demographic_data_filename, "w")
            f.write(response.text)
            f.close()
            print("Demographic Data Written to file.")
    except:
        print("Demographic Data: Trouble Downloading From State of CA")

#//*** Download California Current COVID Case Counts
    try:
        response = requests.get("https://data.chhs.ca.gov/dataset/f333528b-4d38-4814-bebb-12db1f10f535/resource/046cdd2b-31e5-4d34-9ed3-b48cdbc4be7a/download/covid19cases_test.csv")
        if response.ok:
            print("Case Data Downloaded")
            f = open(cases_data_filename, "w")
            f.write(response.text)
            f.close()
            print("Case Data Written to file.")
    except:
        print("Ca Case Data: Trouble Downloading From State of CA")

In [None]:
#//*** read Cached Data From Local files

#//*** Statewide per county COVID Cases
ca_covid_df= pd.read_csv(cases_data_filename)

#//*** Racial Totals
ca_race_df = pd.read_csv(demographic_data_filename)


#//*** df_list keeps track of all the dataframe names. The project is getting too big to keep in my head, so I need to keep references
if 'ca_covid_df' not in df_list:
    df_list.append('ca_covid_df')

if 'ca_race_df' not in df_list:
    df_list.append('ca_race_df')

print(ca_race_df.columns)
#//*** Demographics Contain Age Groups, Gender, and Race Ethnicity.

#//*** We'll Focus on just Race Ethnicicty
print(f"Demographic Types: {ca_race_df['demographic_category'].unique()}")

#//*** Get Just Race Ethnicity
race_category = ca_race_df['demographic_category'].unique()[2]

#//*** Remove the other demographic Types. If we are cool we'd be factoring in age and gender.
#//*** But not this time
ca_race_df = ca_race_df[ca_race_df['demographic_category'] == race_category]


In [None]:
#//*******************************
#//*** Clean Ca_COVID_df
#//*******************************

print(ca_covid_df)

ca_covid_df.rename( columns= {'area':'county'}, inplace=True)
print(f"# of counties before Cleaning: {len(ca_covid_df['county'].unique())}")

#//*** Convert Date Column to Date Type.
ca_covid_df['date'] =  pd.to_datetime(ca_covid_df['date'], infer_datetime_format=True)

#print(ca_covid_df[ ca_covid_df['area_type'] == 'State'])
ca_covid_df = ca_covid_df.sort_values('date')

#\\*** DROP ca_covd dates that are not included in ca_race_df




In [None]:
#//*** Remove the 'Out Of State, Unknown and California listings
print(f"Length Before removing Out Of Country County: {len(ca_covid_df)}")
ca_covid_df = ca_covid_df[~ca_covid_df['county'].isin(['Out of state','Unknown','California'])]

print(f"# of counties After Cleaning: {len(ca_covid_df['county'].unique())}")


#//*** Replace NaN values with 0
for col in ca_covid_df.columns:
    ca_covid_df[col].fillna(0,inplace=True)
    
    

#//*** Drop Columns
dropcols = ['area_type','population','positive_tests','reported_cases','reported_deaths','reported_tests']
#dropcols = []
ca_covid_orig_df = ca_covid_df.copy()

if 'ca_covid_orig_df' not in df_list:
    df_list.append('ca_covid_orig_df')

for col in dropcols:
    if col in ca_covid_df.columns:
        del ca_covid_df[col]


In [None]:
### //****************************************
#//*** Cleanup ca_race_df  attributes
#//****************************************
print("Before Cleaning:")
print(ca_race_df)
if 'demographic_value' in ca_race_df.columns:
    #//*** Rename the California Racial names to matches the census derived attribute names in pop_attrib_df
    ca_race_df['demographic_value']=ca_race_df['demographic_value'].str.replace('Native Hawaiian and other Pacific Islander','Hawaiian')
    ca_race_df['demographic_value']=ca_race_df['demographic_value'].str.replace('Multi-Race','Multiracial' )
    ca_race_df['demographic_value']=ca_race_df['demographic_value'].str.replace('American Indian or Alaska Native','Native' ) 

#//*** rename the reported_date column
ca_race_df.rename( columns= {'report_date':'date','total_cases':'cum_cases','deaths':'cum_deaths'}, inplace=True)

#//*** Delete Columns
if 'demographic_category' in ca_race_df.columns:
    del ca_race_df['demographic_category']

if 'demographic_value' in ca_race_df.columns:
    print(ca_race_df['demographic_value'].unique())

#//*************************************
#//*** Cleanup Statewide COVID values
#//*************************************
if 'demographic_value' in ca_race_df.columns:
    ca_race_df.rename( columns= {'demographic_value':'race'}, inplace=True)

#//*** Convert date to datetime format.
ca_race_df['date'] =  pd.to_datetime(ca_race_df['date'], infer_datetime_format=True)

#//*** Remove Total and Other from Race Types. Total is Statewide infections, Other is other racial categories.
#ca_race_df = ca_race_df[~ca_race_df['race'].isin(['Other'])]

#//*** Save temp_race_total_df for later use
#temp_race_total_df = ca_race_df[ca_race_df['race']=='Total']

#//*** Remove values with total (statewide values)
ca_race_df = ca_race_df[~ca_race_df['race'].isin(['Other','Total'])]

print("After Cleaning:")
print(ca_race_df)

In [None]:
#//*** Synchronize dates between ca_covid_df and ca_race_df
#//*** Get first date of Race
first_date = ca_race_df.iloc[0]['date']
#//*** remove values from ca_covid before the date
ca_covid_df = ca_covid_df[ca_covid_df['date'] >= first_date].sort_values('date')

In [None]:
#print(ca_race_df)
#print(ca_covid_df)

In [None]:
#print(df_list)
#print(ca_covid_df)

In [None]:

#//**********************************************************************************************************************************
#//*** US Census data on Racial population by County in California
#//**********************************************************************************************************************************
#//*** Data Source
#//**********************************************************************************************************************************
#//*** Census Data: https://www.census.gov/data/datasets/time-series/demo/popest/2010s-counties-detail.html
#//*** Direct Download: https://www2.census.gov/programs-surveys/popest/datasets/2010-2019/counties/asrh/cc-est2019-alldata-06.csv
#//**********************************************************************************************************************************
#//*** Process Flat File: California Ethnicity demographics - cc-est2019-alldata-06.csv

raw_ethnic_pop_df = pd.read_csv("cc-est2019-alldata-06.csv")

#//*** Data includes values for last twelve years. We only want data for the last year.

#//*** Rebuild raw_ethnic_pop_df using only the last year (most recent) data
raw_ethnic_pop_df = raw_ethnic_pop_df[raw_ethnic_pop_df['YEAR']==raw_ethnic_pop_df['YEAR'].max()]
if 'raw_ethnic_pop_df' not in df_list:
    df_list.append('raw_ethnic_pop_df')
#//*** Ethnic data is broken down by age. At this stage we will only use the totals of all ages
#//*** Only use AGEGRP == 0
raw_ethnic_pop_df = raw_ethnic_pop_df[raw_ethnic_pop_df['AGEGRP']==raw_ethnic_pop_df['AGEGRP'].min()]

#//*** Demographics are based on gender as well as Federal Race and Ethnic attributes. These attributes are different than the values reported
#//*** By the State of California. These attributes will require cleaning and transformation.
raw_ethnic_pop_df.head(20)




In [None]:

#//*** Convert Applicable federal based census codes to California Census Codes.
#//*** Description of Federal Column Values
#//*** https://www2.census.gov/programs-surveys/popest/technical-documentation/file-layouts/2010-2019/cc-est2019-alldata.pdf
#//*** Census Data: https://www2.census.gov/programs-surveys/popest/datasets/2010-2019/counties/totals/

#//*** Notably, Federal census regards Hispanic as an ethnicity not a race. For Example: People can be Hispanic White,
#//*** Hispanic Black, or Hispanic Asian.
#//*** California treats all hispanics as Latino
#//*** Latino = H_MALE, H_FEMALE Hispanic
#//*** White - NHWA_MALE, NHWA_FEMALE (Not Hispanic White)
#//*** Asian - NHAA_MALE, NHAA_FEMALE (Not Hispanic Asian) 
#//*** Black - NHBA_MALE, NHBA_FEMALE (Not Hispanic Black) 

#//*** Amer Indian - NHIA_MALE, NHIA_FEMALE (Not Hispanic, American Indian) 

#//*** Hawaiian - NHNA_MALE, NHNA_FEMALE (Not Hispanic, Hawaiian) 

#//*** California has the following columns: Multiracial, Other, Multirace. I could not find a good definition of these
#//*** These represent less than 5% of the population. Small but not too small to be ignored. These will combined into
#//*** Single attribute Other and combined with NHTOM_MALE, NHTOM_FEMALE - Not Hispanic Two or more races

#//*** Build a new data frame to hold the sanitized values.
pop_attrib_df = pd.DataFrame()

if 'pop_attrib_df' not in df_list:
    df_list.append('pop_attrib_df')

    #//*** The County Fibs code is shared between the federal census data and the Community Resilliance Estimate
pop_attrib_df['cty_fibs'] = raw_ethnic_pop_df['COUNTY']

#//*** County Name will be the Common attribute to link to the timeseries Data.
#//*** Standardize the County name. Remove County from the column name 
pop_attrib_df['county'] = raw_ethnic_pop_df['CTYNAME'].str.replace(" County","")
pop_attrib_df['population'] = raw_ethnic_pop_df['TOT_POP']

clean_cols = { 'Latino' : ['H_MALE', 'H_FEMALE'], 
              'White' : ['NHWA_MALE', 'NHWA_FEMALE'],
              'Asian' : ['NHAA_MALE', 'NHAA_FEMALE'],
              'Black' : ['NHBA_MALE', 'NHBA_FEMALE'],
              'Native' : ['NHIA_MALE','NHIA_FEMALE'],
              'Hawaiian' : ['NHNA_MALE', 'NHNA_FEMALE'],
              'Multiracial' : ['NHTOM_MALE', 'NHTOM_FEMALE']
            
            }

#//*** Combine male and female columns and store to column with same name as California Data
#//*** Loop through the clean_cols dictionary, key is California name, value is Federal columns to combine
#//*** These are the easy 1:1 columns
#//*** Hawaiian and Other will need adjustment in the Califnornia Side of the Dataset.


#//*** California Column name = Federal category male + Federal Category female
for ca_name,fed_names in clean_cols.items():
    pop_attrib_df[ca_name] = raw_ethnic_pop_df[fed_names[0]] + raw_ethnic_pop_df[fed_names[1]] 

#              'Native Hawaiian or Pacific Islander' :
#              'Native Hawaiian and other Pacific Islander'
#            'Other'

#//*** Assign the index to the county fibs number
pop_attrib_df = pop_attrib_df.set_index('cty_fibs')



In [None]:
#//*** Merge Population Attributes with COVID County info
#//*** Only Merge if we haven't merged yet. I got 99 iPython problems but this aint one.
if "Latino" not in ca_covid_df.columns:
    ca_covid_df = pd.merge(ca_covid_df,pop_attrib_df,how="left",on=['county'])


#//*** Build per 100k Stats
ca_100k_df = ca_covid_df.copy()
if 'ca_100k_df' not in df_list:
    df_list.append('ca_100k_df')

#//*** Define Population Columns to convert to 100k. These Columns shouldn't change. Trying to setup a flexible
#//*** Systems where I can add other attributes later if needed
population_cols = [ 'population','Latino', 'White', 'Asian', 'Black', 'Native', 'Hawaiian','Multiracial' ]

#//*** Convert Popultion values to 100k units. ie divide by 100,000
for col in population_cols:
    ca_100k_df[col] = ca_100k_df[col]/100000



#//*** Convert cases, deaths, test to per 100k units
attrib_cols = ['date','county']

#//*** Ignore values in attrib_cols, and population_cols
#//*** Convert remianing attributes to values per 100,000.
#//*** This method makes it easier to change the 100k attributes later.
for col in ca_100k_df.columns:
    if col not in attrib_cols and col not in population_cols:
        #//*** Convert column to per 100k value. Which is Columns value divided population per 100k
        ca_100k_df[col] = ca_100k_df[col]/ca_100k_df['population'] 
"""
plt.rcParams['figure.figsize'] = [50,20]
#//*** Check our Work.
#//*** Cases per 100k should be relatively similar in values.
display_size = 40
fig,ax = plt.subplots()

for county in ca_100k_df['county'].unique():
    
    loop_df = ca_100k_df[ca_100k_df['county'] ==  county]
    ax.plot(loop_df['date'],loop_df['cases'].rolling(5).mean(),label=county)


    plt.xticks(rotation=30,fontsize=display_size)
    plt.yticks(fontsize=display_size)
handles,labels = deduplicate_legend(ax)
plt.legend(fontsize=display_size*.25,loc='upper left')
plt.title(f"Scaled County Data (per 100k)",fontsize=display_size)
#plt.ylabel("Total Cases by County (millions)",fontsize=display_size)
plt.show()

"""

In [None]:
#//*** Build a list of counties ordered by total COVID prevalence (most Cases per 100k) 

#//*** Get the Statewide 100k value. 
#//*** Get total Case Count from orig_df, dvided by total population / 100000
state_100k = ca_covid_orig_df['cases'].sum()/(ca_covid_orig_df['population'].unique().sum()/100000)

if 'state_100k' not in df_list:
    df_list.append('state_100k')

county_list = ca_100k_df['county'].unique()

county_100k = []

#//*** Get a list of counties with population greater than 100,000
for county in county_list:
    #if ca_100k_df[ca_100k_df['county']==county].iloc[0]['population'] > 1:
    county_100k.append(county)
case_totals = []

#//*** Get the total Cases for each county per 100k
for county in county_100k:
    case_totals.append(ca_100k_df[ca_100k_df['county']==county]['cases'].sum())

#//*** Build a series with the county name and the total per100k value for each county. Sort by the Prevalence value.
ts = pd.Series(index = county_100k, data=case_totals).sort_values(ascending=False)
if 'ts' not in df_list:
    df_list.append('ts')

In [None]:
print(ts.index)

In [None]:

#//******************************************************************************************************************************************************************************
#//*** Build: ca_cases_broad_df
#//*** Counties are converted to per day attributes. The Values are a single CA_COVID_value for the whole table.
#//*** The Broad tables are needed for building linear regressions across the whole State. The Coefficients are used to generate granualar data about each county per day
#//*** This DF looks at building COVID cases per day, using ca_covid_df and ca_race_df
#//******************************************************************************************************************************************************************************

def build_broad_county_attribute(input_df,input_col):

    #//*** Temp dataframe to hold the output
    output_df = pd.DataFrame()
    for group in input_df[['date','county',input_col]].groupby('date'):

        #//*** Get County and Values. Each county is its own value. Use transpose() to make the counties attribute
        loop_df = group[1][['county',input_col]].transpose().copy()
        #//*** Set the Column names to the counties which are in the first line
        loop_df.columns = list(loop_df.iloc[0])

        #//*** Remove the first line, since the counties are now the columns/attributes
        loop_df = loop_df.drop('county')

        #//*** Change the index from Cases to 0. This is mostly cosmetic
        loop_df.index = [0]

        #//*** Add Date Column
        loop_df['date'] = group[0]

        #//*** Add Total Column which is a sum() of all values
        loop_df['total'] = group[1][[case_col_col]].values.sum()

        #//*** Build a new columm list that moves date and total to the first two columns
        #cols = ['date','total']+list(ts.index)

        cols = ['date','total']+ list(ca_covid_df['county'].unique())

        #//*** Save the df with the reordered columns
        loop_df = loop_df[cols]

        #//*** concat/append this row to the temp_df/output_df
        output_df = pd.concat([output_df,loop_df])

    return output_df




if 'ca_cases_broad_df' not in df_list:
    df_list.append('ca_cases_broad_df')

#//*** The column used to build broad statistic
case_col_col = 'cases'

ca_cases_broad_df = build_broad_county_attribute(ca_covid_df,'cases')

print(ca_cases_broad_df)


In [None]:
#//*** State Race numbers don't quite add up with the State COVID totals.
#//*** We'll re-normalize the values by
#//*** 1.) Recalculate the percent of cases to a higher degree of accuracy
#//*** 2.) Recalculate the Race Cases based on updated percentages

#//*** Temp variables
t_date = []
t_race = []
t_percent_cases = []
tdf = pd.DataFrame()

#//*** Convert Date Column to Date Type.
#ca_covid_df['date'] =  pd.to_datetime(ca_covid_df['date'], infer_datetime_format=True)

#//*******************

#//*** iPython Trap
if 'cum_cases' in ca_race_df:
    for group in ca_race_df.groupby('date'):
        
        
        
        #//*** Get the Total Cases for the day, from the Race labeled Total
        #total_cases = temp_race_total_df[temp_race_total_df['date']==group[0]]['cum_cases'].values[0]
        #total_deaths = temp_race_total_df[temp_race_total_df['date']==group[0]]['cum_deaths'].values[0]
        
        total_cases = group[1]['cum_cases'].sum()
        total_deaths = group[1]['cum_deaths'].sum()
        loop_df = group[1].copy()
        #//*** Recalculate percent_cases
        loop_df['percent_cases'] = loop_df['cum_cases'] / total_cases

        #//*** Recalculate percent_deaths
        loop_df['percent_deaths'] = loop_df['cum_deaths'] / total_deaths

        #//*** Generate the Daily Race Cases based of total reported COVID cases
        
        
        #//*** Cross Check
        #print("Cross Check Cases: ",ca_covid_df[ca_covid_df['date']==group[0]]['cases'].sum())
        #print("Cross Check Deaths: ",ca_covid_df[ca_covid_df['date']==group[0]]['deaths'].sum())
        state_cases = ca_cases_broad_df[ca_cases_broad_df['date']==group[0]]['total'].values[0]
        state_deaths = ca_covid_df[ca_covid_df['date']==group[0]]['deaths'].sum()
        print("State Cases: ",state_cases)
        #print("State Deaths: ",state_deaths)

        #print(ca_covid_df[ca_covid_df['date']==group[0]])
        #//*** Convert cum_cases to daily cases
        #print(loop_df)
        loop_df['cum_cases'] = loop_df['percent_cases'] * state_cases
        loop_df['cum_deaths'] = loop_df['percent_deaths'] * state_deaths
        
        #//*** Checking Work
        #print("[1 == ",loop_df['percent_cases'].sum(), " ] ",loop_df['cum_cases'].sum()," == ", state_cases)
        
        #print(loop_df)
        #print(state_cases," ",state_deaths)
        tdf = pd.concat([tdf,loop_df])
    
        #print()
        #//*** Get the Percentage of cumulative Cases per day. This is cum_cases / total_cases
        #daily_percent = group[1][group[1]['race']!='Total']['cum_cases']/total_cases
        #group[1][group[1]['race']!='Total'] = daily_percent
        #print(group[1])

        

    print(tdf[ ['date','race','cum_cases','cum_deaths','percent_cases','percent_deaths','percent_of_ca_population'] ])
    ca_race_df = tdf[ ['date','race','cum_cases','cum_deaths','percent_cases','percent_deaths','percent_of_ca_population'] ].copy()

    if 'percent_of_ca_population' in ca_race_df.columns:
        del ca_race_df['percent_of_ca_population']

    ca_race_df.columns = ['date','race','cases','deaths','percent_cases','percent_deaths']

del tdf
del t_date
del t_race
del t_percent_cases
del loop_df

In [None]:
#//*** Verified ca_race_df and ca_covid_df ca_cases_broad_df are all balanced. Daily totals are based on the total county cases

#print(df_list)
#//*** Check our Work use Random to pick a random day
t_date = ca_cases_broad_df.iloc[random.randint(0,len(ca_cases_broad_df))]['date']


print("Should all be the same for this Date: ",t_date)
print("ca_covid_df Cases:        ",ca_covid_df[ca_covid_df['date']==t_date]['cases'].sum())
print("ca_cases_broad_df Cases: ",ca_cases_broad_df[ca_cases_broad_df['date']==t_date]['total'].values[0])
print("ca_race_df Cases:        ",ca_race_df[ca_race_df['date']==t_date]['cases'].sum())

#del race_tdf
del t_date
print(ca_race_df)

In [None]:
#//********************************************************************
#//*** Add State Wide Race per 100k values.
#//********************************************************************

state_pop = {}
print(ca_race_df['race'].unique())
for col in pop_attrib_df.columns[1:]:
    state_pop[col] = pop_attrib_df[col].sum()/100000

print(ca_race_df['race'].apply(lambda x : state_pop[x]))

ca_race_df['cases_100k'] = ca_race_df['cases'] / ca_race_df['race'].apply(lambda x : state_pop[x])
ca_race_df['deaths_100k'] = ca_race_df['deaths'] / ca_race_df['race'].apply(lambda x : state_pop[x])

#//*** temp list to hold per 100k values
#tl_case = []
#t1_death = []
#print(ca_race_df)
#for index,row in ca_race_df.iterrows():
    
    #//*** Get State population based on race column
#    state_pop[ row['race'] ]

for date in ca_race_df['date'].unique():
    print(ca_race_df[ ca_race_df['date'] == date ])
    break
#print(ca_race_df)

In [None]:
#print(df_list)
#print(ca_100k_df)

In [None]:
#//******************************************************************************************************************************************************************************
#//*** Build: ca_races_broad_df
#//*** Daily Race Values are converted to per day attributes. The Values are a single race value for the whole table.
#//*** The Broad tables are needed for building linear regressions across the whole State. The Coefficients are used to generate granualar data about each county per day
#//*** This DF looks at building COVID cases per day, using ca_broad_df['total] which is the same as ca_covid_df.sum()
#//******************************************************************************************************************************************************************************
def build_broad_race_attribute(input_df,input_cols):

    #//*** Temp dataframe to hold the output
    output_df = pd.DataFrame()
    for group in input_df[['date','race',input_cols]].groupby('date'):


        #//*** Get County and Values. Each county is its own value. Use transpose() to make the counties attribute
        loop_df = group[1][['race',input_cols]].transpose().copy()
        #//*** Set the Column names to the counties which are in the first line
        loop_df.columns = list(loop_df.iloc[0])

        #//*** Remove the first line, since the counties are now the columns/attributes
        loop_df = loop_df.drop('race')
        #print(loop_df)
        #//*** Change the index from Cases to 0. This is mostly cosmetic
        loop_df.index = [0]

        #//*** Add Date Column
        loop_df['date'] = group[0]



        #//*** Build a new columm list that moves date and total to the first two columns
        cols = ['date']+list(loop_df.columns[:-1])

        #//*** Save the df with the reordered columns
        loop_df = loop_df[cols]

        #//*** concat/append this row to the temp_df/output_df
        output_df = pd.concat([output_df,loop_df])

    return output_df

    
    



    

In [None]:
#//*** The column used to build broad statistic
race_col_col = 'cases'

#//*** Assign output_df to ca_races_broad_df 
ca_races_broad_df = build_broad_race_attribute(ca_race_df,'cases')

#ca_rac= build_broad_attribute(ca_100k_df,'cases')

if 'ca_races_broad_df' not in df_list:
    df_list.append('ca_races_broad_df')
    
print(ca_races_broad_df)


In [None]:
#//*** Check our Work use Random to pick a random day
#//*** The case totals should be close or identical
t_date = ca_cases_broad_df.iloc[random.randint(0,len(ca_cases_broad_df))]['date']


print("Should all be the same for this Date:\n",t_date)
print("ca_covid_df Cases:       ",ca_covid_df[ca_covid_df['date']==t_date]['cases'].sum())
print("ca_cases_broad_df Cases: ",ca_cases_broad_df[ca_cases_broad_df['date']==t_date]['total'].values[0])
print("ca_race_df Cases:        ",ca_race_df[ca_race_df['date']==t_date]['cases'].sum())

print("ca_races_broad_df Cases  ",ca_races_broad_df[ca_races_broad_df['date']==t_date].transpose()[1:].sum().values[0])
del t_date

In [None]:
#//*** Check our work with a selection of ilocs. Sum of the Coefficients + intercept should be close to the total values.
#//*** It's very close, very low error

#for x in [50,100,150,200,250,300]:
#    print(f"{state_coef_df.iloc[x]['date'].values[0]} - {state_coef_df[x_column].iloc[x].sum() + state_coef_df.iloc[x]['intercept'].values[0]} == {state_coef_df.iloc[x]['total'].values[0]}")
 

In [None]:
ordered_counties = (list(pop_attrib_df[['county','population']].sort_values('population',ascending=False)['county']))
race = 'Latino'
#print(list(pop_attrib_df[['county',race]].sort_values(race,ascending=False)[race]))

In [None]:
#print(ca_races_broad_df)
#print(ca_cases_broad_df)

In [None]:
#print(ca_100k_df)

**Calculate Coefficients**

Runs a linear regression between Statewide Racial numbers and Daily COVID Cases by County.

This is the unbiased model that generates exponentially higher values to the first county. This remains for reference and to run a separate version for comparison against the scaled biased version.

In [None]:
#//***************************************************************************************************************************************************************************
#//*** Generatate a Individual Race Coeffecients for each county per day.
#//*** The the sum of racial coefficients should equal the state coefficent for the county. 
#//*** Build the coefficients for the entire data set. Each day will calculate the coefficients from the previous 30 days. The First 30 days will use one set of coeffients. The rest will use
#//*** The current day: -30 to generate the coefficients. This will be an overfitted solution which is exactly what we are going for.
#//***************************************************************************************************************************************************************************
def calculate_coefficients(left_df, right_df, x_col_index, y_col_index):
    print("Calculating racial coefficients...")
    start_time = time.time()
    
    #//*** Initialize the output dataframe
    output_df = pd.DataFrame()
    
    #//*** Combining with car_race_df and Latino value. It's not strictly needed, but the additional column will make combining the dataframes later, easier.
    #//*** Reusing Code: This loop only needs to run once
    for race in left_df.columns[1:]:

        #//*** Build model for the first 30 days, combines the race from ca_race (which is only needed as an extra field, to evenly space the columms)
        model_df = left_df[['date',race]].merge(right_df, left_on='date', right_on='date')[:30]

        #//*************************
        #//*** BEGIN regression
        #//*************************


        #//**** Build the x Values - Dependent Variables. These will be all the counties which start at the column index 
        #//*** The X Value is the index where the attributes start, there are 58 of them :)
        #x_column = model_df.columns[x_col_index:]

        #//*** Generate ordered list of Counties by current race population.
        #//*** The assumptions is the counties with higher populations will exert a greater weight on the model.
        #//*** Otherwise the tiny county of Alpine gets way overly represented
        race_columns = (pop_attrib_df[['county',race]].sort_values(race,ascending=False)['county'])

        #print(model_df[race_columns])
        #//*** Build the X attributes using the x_column. These are separated for readability and modularity
        x_model = model_df[race_columns]

        #print(x_model)
        #//*** Build the independent variable using the Index Column defined above as y_col_index.
        y_column = model_df.columns[y_col_index]

        #//*** Build the Y model using the y_column attribute. This is less readable and intuitive, but it lets the columns be 
        #//*** easily assigned at the top of this section
        y_model = model_df[y_column]

        #//*** Define the Linear Model
        regr = linear_model.LinearRegression(n_jobs=-1)

        #//*** Make Regression Magic
        regr.fit(x_model, y_model)

        #//*** Apply the regression coefficients
        #//*** v1 Change: Apply coef to actual values
        model_df[race_columns] = (model_df[race_columns])*regr.coef_
        
        
        #//*** Dead End: v2 Change: Try just the coefficients 
        #model_df[race_columns] = regr.coef_



        #//*** Replace the Statewide Total Column. With the Statewide Race Totals
        model_df['total'] = model_df[race]

        #//*** Change The race column to hold the race name
        model_df[race] = race

        #//*** Rename the race (Latino, Native, etc) to 'race'
        cols = list(model_df.columns)
        cols[1] = 'race'
        model_df.columns = cols

        model_df['intercept'] = regr.intercept_

        #//*** Move intercept to be the column after Total
        #//*** Gets Columns as a list, removes intercept of end and inserts into position
        #//*** Model_df is saved with ordered list of columns.
        #//*** Kinda Cool
        model_df = model_df[ list(model_df.columns[:-1].insert(3,'intercept')) ]

        #//*** Reorder Counties
        #//*** Keep the First four Columns, then use ordered_counties
        model_df = model_df[list(model_df.columns[:4])+ordered_counties]

        #print(model_df)
        #//*** Add the First 30 days Model to the output_df dataframe
        output_df = pd.concat([output_df,model_df])

        #//*** Checking our work. The sum of the coefficients * cases + intercept should be close the independent value in Total Cases.
        print("Checking our Work. These values should be close:")
        print(model_df.iloc[1]['total'], " == ", model_df[race_columns].iloc[1].sum()+regr.intercept_)    

        #print(model_df)
        
        #break
        #//*** Build each day individually, based on the previous 30 days
        #//*** Start at index 31 
        for index in range(31,len(left_df)+1):

            #//*** Define the start and indexes for linear modeling. This is the row_index - 30
            min_index = index-30

            #//*** Build model_df using min_index and index as a 30 day range)
            model_df = left_df[['date',race]].merge(right_df, left_on='date', right_on='date')[min_index:index]
            
            #//*** Build the X attributes using the x_column. These are separated for readability and modularity
            x_model = model_df[race_columns]

            #//*** Build the Y model using the y_column attribute. This is less readable and intuitive, but it lets the columns be 
            #//*** easily assigned at the top of this section
            y_model = model_df[y_column]

            #//*** Build a New the Linear Model
            regr = linear_model.LinearRegression(n_jobs=-1)

            #//*** Make Regression Magic
            regr.fit(x_model, y_model)

            #//*** Replace the Statewide Total Column. With the Statewide Race Totals
            model_df['total'] = model_df[race]

            #//*** Change The race column to hold the race name
            model_df[race] = race

            #//*** Rename the race (Latino, Native, etc) to 'race'
            cols = list(model_df.columns)
            cols[1] = 'race'
            model_df.columns = cols

            #//*** Apply the regression coefficients to all columns, even though we only need the last one
            #//*** v1 Change: Apply coef to actual values
            model_df[race_columns] = (model_df[race_columns])*regr.coef_

            #//*** Dead End: v2 Change: Try just the coefficients
            #model_df[race_columns] = regr.coef_

            model_df['intercept'] = regr.intercept_

            #//*** Move intercept to be the column after Total
            #//*** Gets Columns as a list, removes intercept of end and inserts into position
            #//*** Model_df is saved with ordered list of columns.
            #//*** Kinda Cool
            model_df = model_df[ list(model_df.columns[:-1].insert(3,'intercept')) ]

            #//*** Reorder Counties
            #//*** Keep the First four Columns, then use ordered_counties
            model_df = model_df[list(model_df.columns[:4])+ordered_counties]

            #//*** Add the last day of model_df to output_df. 
            #//*** It's not exactly efficient, but it is functional
            #output_df = pd.concat([output_df,model_df.iloc[-1]])
            output_df = output_df.append(model_df.iloc[-1])
    
    print(f"CoEfficients Calculated: {round(time.time()-start_time,0)}s")
    return output_df


**Calculate Coefficients v2**

Runs a linear regression between Statewide Racial numbers and Daily COVID Cases by County.

This version 

In [None]:
#//***************************************************************************************************************************************************************************
#//*** Generatate a Individual Race Coeffecients for each county per day.
#//*** The the sum of racial coefficients should equal the state coefficent for the county. 
#//*** Build the coefficients for the entire data set. Each day will calculate the coefficients from the previous 30 days. The First 30 days will use one set of coeffients. The rest will use
#//*** The current day: -30 to generate the coefficients. This will be an overfitted solution which is exactly what we are going for.
#//***************************************************************************************************************************************************************************
def calculate_coefficients_v2(left_df, right_df, x_col_index, y_col_index):
    
    #//*** build county population percentages
    
    #//*** tdf is a placeholder for Temporary DataFrame
    #//*** pop_attrib_df holds the county populations which will be converted to percentages
    tdf = pop_attrib_df.copy()
    
    #//*** Convert to percentages. I'm doing a 1- population percentage, to invert the values
    #//*** which Makes the small numbers big and the big one small.
    tdf['percent'] = (1-(tdf['population']/tdf['population'].sum()))
    
    tdf = tdf.sort_values('percent',ascending=False)[['county','percent']]
    
    #//*** convert From a single attribute of counties and percentages to broad format 
    #//*** Where each county is an attribute, like the broad formatted dataframes
    tdf = tdf.transpose()
    #//*** Set columns to the county row
    tdf.columns = tdf.loc['county']
    #//*** Drop the county row
    tdf = tdf.drop(tdf.index[0])
    #//*** Make doubly sure the county columns are ordered the same as the dataframes below
    tdf = tdf[ordered_counties]
    #tdf = pow(1-(tdf*2),2)
    
    #//*** Apply a power transformation, to scale the first counties to a much smaller weighted
    #//*** value, and progressively add weight to the otehr counties.
    #//*** or at least that is what I was trying to do. This is achieved through trial and error,
    #//*** and achieves a cululative result similar to the actual COVID counts. I could probably
    #//*** Iteratively tweak this value for a better result. But since rescaling the data is not 
    #//*** Great approach in general. I'll leave it like it is.
    tdf = pow(pow(1-(tdf*2),2)*1.5,2.1)
    print(tdf)
    
    
    print("Calculating racial coefficients...")
    start_time = time.time()
    
    #//*** Initialize the output dataframe
    output_df = pd.DataFrame()
    
    #//*** Combining with car_race_df and Latino value. It's not strictly needed, but the additional column will make combining the dataframes later, easier.
    #//*** Reusing Code: This loop only needs to run once
    for race in left_df.columns[1:]:

        #//*** Build model for the first 30 days, combines the race from ca_race (which is only needed as an extra field, to evenly space the columms)
        model_df = left_df[['date',race]].merge(right_df, left_on='date', right_on='date')[:30]

        #//*************************
        #//*** BEGIN regression
        #//*************************


        #//**** Build the x Values - Dependent Variables. These will be all the counties which start at the column index 
        #//*** The X Value is the index where the attributes start, there are 58 of them :)
        #x_column = model_df.columns[x_col_index:]

        #//*** Generate ordered list of Counties by current race population.
        #//*** The assumptions is the counties with higher populations will exert a greater weight on the model.
        #//*** Otherwise the tiny county of Alpine gets way overly represented
        race_columns = (pop_attrib_df[['county',race]].sort_values(race,ascending=False)['county'])

        #print(model_df[race_columns])
        #//*** Build the X attributes using the x_column. These are separated for readability and modularity
        x_model = model_df[race_columns].copy()
        
        #//*** Weight by population percentage
        for col in race_columns:
            x_model[col] = x_model[col]*tdf[col].values
        
        #print(x_model)
        #//*** Build the independent variable using the Index Column defined above as y_col_index.
        y_column = model_df.columns[y_col_index]

        #//*** Build the Y model using the y_column attribute. This is less readable and intuitive, but it lets the columns be 
        #//*** easily assigned at the top of this section
        y_model = model_df[y_column]

        #//*** Define the Linear Model
        regr = linear_model.LinearRegression(n_jobs=-1)

        #//*** Make Regression Magic
        regr.fit(x_model, y_model)

        #//*** Apply the regression coefficients
        #//*** v1 Change: Apply coef to actual values
        model_df[race_columns] = (model_df[race_columns])*regr.coef_
        
        
        #//*** Replace the Statewide Total Column. With the Statewide Race Totals
        model_df['total'] = model_df[race]

        #//*** Change The race column to hold the race name
        model_df[race] = race

        #//*** Rename the race (Latino, Native, etc) to 'race'
        cols = list(model_df.columns)
        cols[1] = 'race'
        model_df.columns = cols

        model_df['intercept'] = regr.intercept_

        #//*** Move intercept to be the column after Total
        #//*** Gets Columns as a list, removes intercept of end and inserts into position
        #//*** Model_df is saved with ordered list of columns.
        #//*** Kinda Cool
        model_df = model_df[ list(model_df.columns[:-1].insert(3,'intercept')) ]

        #//*** Reorder Counties
        #//*** Keep the First four Columns, then use ordered_counties
        model_df = model_df[list(model_df.columns[:4])+ordered_counties]

        #print(model_df)
        
        #//*** Add the First 30 days Model to the output_df dataframe
        output_df = pd.concat([output_df,model_df])

        #//*** Checking our work. The sum of the coefficients * cases + intercept should be close the independent value in Total Cases.
        print("Checking our Work. These values should be close:")
        print(model_df.iloc[1]['total'], " == ", model_df[race_columns].iloc[1].sum()+regr.intercept_)    

        #print(model_df[race_columns].iloc[0])
        
        #print(model_df[race_columns].iloc[0]*tdf[race_columns])
        
        #break
        #//*** Build each day individually, based on the previous 30 days
        #//*** Start at index 31 
        for index in range(31,len(left_df)+1):

            #//*** Define the start and indexes for linear modeling. This is the row_index - 30
            min_index = index-30

            #//*** Build model_df using min_index and index as a 30 day range)
            model_df = left_df[['date',race]].merge(right_df, left_on='date', right_on='date')[min_index:index]
            
            #//*** Build the X attributes using the x_column. These are separated for readability and modularity
            x_model = model_df[race_columns].copy()

            #//*** Weight by population percentage
            for col in race_columns:
                x_model[col] = x_model[col]*tdf[col].values

            #//*** Build the Y model using the y_column attribute. This is less readable and intuitive, but it lets the columns be 
            #//*** easily assigned at the top of this section
            y_model = model_df[y_column]

            #//*** Build a New the Linear Model
            regr = linear_model.LinearRegression(n_jobs=-1)

            #//*** Make Regression Magic
            regr.fit(x_model, y_model)

            #//*** Replace the Statewide Total Column. With the Statewide Race Totals
            model_df['total'] = model_df[race]

            #//*** Change The race column to hold the race name
            model_df[race] = race

            #//*** Rename the race (Latino, Native, etc) to 'race'
            cols = list(model_df.columns)
            cols[1] = 'race'
            model_df.columns = cols

            #//*** Apply the regression coefficients to all columns, even though we only need the last one
            #//*** v1 Change: Apply coef to actual values
            model_df[race_columns] = (model_df[race_columns])*regr.coef_
            
            #//*** Weight by population percentage
            for col in race_columns:
                model_df[col] = model_df[col]*tdf[col].values


            #//*** Dead End: v2 Change: Try just the coefficients
            #model_df[race_columns] = regr.coef_

            model_df['intercept'] = regr.intercept_

            #//*** Move intercept to be the column after Total
            #//*** Gets Columns as a list, removes intercept of end and inserts into position
            #//*** Model_df is saved with ordered list of columns.
            #//*** Kinda Cool
            model_df = model_df[ list(model_df.columns[:-1].insert(3,'intercept')) ]

            #//*** Reorder Counties
            #//*** Keep the First four Columns, then use ordered_counties
            model_df = model_df[list(model_df.columns[:4])+ordered_counties]

            #//*** Add the last day of model_df to output_df. 
            #//*** It's not exactly efficient, but it is functional
            #output_df = pd.concat([output_df,model_df.iloc[-1]])
            output_df = output_df.append(model_df.iloc[-1])
        #print(model_df)
    print(f"CoEfficients Calculated: {round(time.time()-start_time,0)}s")
    return output_df


**coef_to_percent: **

Converts regression coefficients to percentages. This assumes the sum of weighted coefficients equals the predicted value. This is not how coefficients are supposed to be used, but I think I'm correct on this


In [None]:

#//************************************************
#//*** Convert Coefficients to percentages
#//************************************************
#//*** Racial Coefficients are built for each race by day and county.
#//*** Group output_df by date, to calculate the race percent for each county by day.
#//*** The Racial Percentage is

def coef_to_percent(input_df,x_col_index):

    start_time = time.time()

    #//*** UH OH....This might be a problem
    #//*** This should be input_df not processing_df. Which explains why I had
    #//*** Trouble getting my unbiased values to run after the biased ones. 
    #//*** I 'fixed' it by copy/pasting and reusing processing_df.
    #//*** So it "works". I don't want to fix it and break something else.

    #//*** Get the columns to process starting at the x_col_index value.
    x_column = processing_df.columns[x_col_index:]

    #//*** Initialize the output dataframe
    output_df = pd.DataFrame()

    #//*** Process each row of the input dataframe (or processing in this case)
    for row in processing_df.iterrows():

        #//*** Each Row
        loop_row = row[1]


        #//*** Square the results to remove negative values
        loop_row[4:] = loop_row[4:]**2
        
        #//*** Avoid Divide by Zero issues
        if loop_row[4:].sum() > 0:
            #//*** Divide by the sum to get relative percentages
            loop_row[4:] = loop_row[4:]/loop_row[4:].sum()
        
        #//*** Multiply by the total to get the weighted percentage of racial values
        loop_row[4:] = loop_row[4:]*loop_row['total']
        
        #//*** Add the row to the output_df
        output_df = pd.concat([output_df,pd.DataFrame(loop_row).transpose()])

    #//*** Reindex with an integer. Keeps it clean
    output_df.index = np.arange(0,len(output_df))
    
    #//*** Zero, values generate Divide by Zero NaNs. Replace these with Zero
    output_df = output_df.fillna(0)

    print(f'Racial Percentages Calculated: {round(time.time()-start_time,0)}s')
    
    return output_df





Calculate the biased coefficients and derive the racial percentages

In [None]:
#//************************************************************************************************************************************************************************
#//*** Calculate the coefficients for Racial data for each county day.
#//*** This uses LinearRegression to generate coefficients which are weighted with the county cases of the day.
#//*** Coefficients derive individual counties affect on the whole of racial COVID cases. The idea is to estimate every counties given portion of the racial cases.
#//*** Each race is modeled separately. The coefficients are converted to percentages in the following step
#//************************************************************************************************************************************************************************


#//*** left_df contains the broad racial categories
left_df = ca_races_broad_df

#//*** Right df contains each county as a column, with one value aggregated. Such as raw COVID cases.
right_df = ca_cases_broad_df

#//*** County Columns start at Index 3.
#//*** These values are a little unintuitive, since they are releveant during the function operation. They are used to indicate which columns are used via the first elements index position
x_col_index = 3

#//*** Target Independent Column Index. Statewide race numbers begin at column 1
#//*** These values are a little unintuitive. They are used to indicate which columns are used via the first elements index position
y_col_index = 1


processing_df = calculate_coefficients_v2(left_df,right_df,x_col_index,y_col_index)

#//**** County Column index start
x_col_index = 4

#//*** Convert coefficients to percentages,

print("BEGIN PERCENT")
ca_case_race_total_df =coef_to_percent(processing_df,x_col_index)

print(ca_case_race_total_df)
if 'ca_case_race_total_df' not in df_list:
    df_list.append('ca_case_race_total_df')

print("Done")

Run the model again but using the unbiased transform for reference

this is the no_transform_df

In [None]:
#//*** Get Sample model without using power transform


#//************************************************************************************************************************************************************************
#//*** Calculate the coefficients for Racial data for each county day.
#//*** This uses LinearRegression to generate coefficients which are weighted with the county cases of the day.
#//*** Coefficients derive individual counties affect on the whole of racial COVID cases. The idea is to estimate every counties given portion of the racial cases.
#//*** Each race is modeled separately. The coefficients are converted to percentages in the following step
#//************************************************************************************************************************************************************************


#//*** left_df contains the broad racial categories
left_df = ca_races_broad_df

#//*** Right df contains each county as a column, with one value aggregated. Such as raw COVID cases.
right_df = ca_cases_broad_df

#//*** County Columns start at Index 3.
#//*** These values are a little unintuitive, since they are releveant during the function operation. They are used to indicate which columns are used via the first elements index position
x_col_index = 3

#//*** Target Independent Column Index. Statewide race numbers begin at column 1
#//*** These values are a little unintuitive. They are used to indicate which columns are used via the first elements index position
y_col_index = 1


processing_df = calculate_coefficients(left_df,right_df,x_col_index,y_col_index)

#//**** County Column index start
x_col_index = 4

#//*** Convert coefficients to percentages,

print("BEGIN PERCENT")
no_transform_df =coef_to_percent(processing_df,x_col_index)

print(ca_case_race_total_df)
if 'ca_case_race_total_df' not in df_list:
    df_list.append('ca_case_race_total_df')

print("Done")

In [None]:
"""
#//***************************************************************************************************************************************************************************
#//*** Generatate a Individual Race Coeffecients for each county per day.
#//*** The the sum of racial coefficients should equal the state coefficent for the county. 
#//*** Build the coefficients for the entire data set. Each day will calculate the coefficients from the previous 30 days. The First 30 days will use one set of coeffients. The rest will use
#//*** The current day: -30 to generate the coefficients. This will be an overfitted solution which is exactly what we are going for.
#//***************************************************************************************************************************************************************************
def calculate_coefficients_v2(left_df, right_df, x_col_index, y_col_index):
    print("Calculating racial coefficients...")
    start_time = time.time()
    
    #//*** Initialize the output dataframe
    output_df = pd.DataFrame()
    
    
    
    #//*** Combining with car_race_df and Latino value. It's not strictly needed, but the additional column will make combining the dataframes later, easier.
    #//*** Reusing Code: This loop only needs to run once
    for race in left_df.columns[1:]:

        #//*** Build model for the first 30 days, combines the race from ca_race (which is only needed as an extra field, to evenly space the columms)
        model_df = left_df[['date',race]].merge(right_df, left_on='date', right_on='date')
        
        loop_df = model_df.copy()
        
        #print(model_df)
        #//*************************
        #//*** BEGIN regression
        #//*************************


        #//**** Build the x Values - Dependent Variables. These will be all the counties which start at the column index 
        #//*** The X Value is the index where the attributes start, there are 58 of them :)
        #x_column = model_df.columns[x_col_index:]

        #//*** Generate ordered list of Counties by current race population.
        #//*** The assumptions is the counties with higher populations will exert a greater weight on the model.
        #//*** Otherwise the tiny county of Alpine gets way overly represented
        race_columns = list(pop_attrib_df[['county',race]].sort_values(race,ascending=False)['county'])
        
        for index in range(0,len(race_columns)):
            loop_cols = (race_columns[index:] + race_columns[:index])
            
            #print(model_df[race_columns])
            #//*** Build the X attributes using the x_column. These are separated for readability and modularity
            x_model = model_df[loop_cols]
            
            #//*** Build the independent variable using the Index Column defined above as y_col_index.
            y_column = model_df.columns[y_col_index]

            #//*** Build the Y model using the y_column attribute. This is less readable and intuitive, but it lets the columns be 
            #//*** easily assigned at the top of this section
            y_model = model_df[y_column]

            #//*** Define the Linear Model
            regr = linear_model.LinearRegression(n_jobs=-1)

            #//*** Make Regression Magic
            regr.fit(x_model, y_model)

            #//*** Apply the regression coefficients
            #//*** v1 Change: Apply coef to actual values
            model_df[loop_cols] = (model_df[race_columns])*regr.coef_
            
            #//*** Only keep the coef_ for the first county
            loop_df[loop_cols[0]] = model_df[loop_cols[0]]
        #print(loop_df)
        #break

        
        #print(x_model)
        
        
        #//*** Dead End: v2 Change: Try just the coefficients 
        #model_df[race_columns] = regr.coef_



        #//*** Replace the Statewide Total Column. With the Statewide Race Totals
        loop_df['total'] = loop_df[race]

        #//*** Change The race column to hold the race name
        loop_df[race] = race

        #//*** Rename the race (Latino, Native, etc) to 'race'
        cols = list(loop_df.columns)
        cols[1] = 'race'
        loop_df.columns = cols

        loop_df['intercept'] = regr.intercept_

        #//*** Move intercept to be the column after Total
        #//*** Gets Columns as a list, removes intercept of end and inserts into position
        #//*** Model_df is saved with ordered list of columns.
        #//*** Kinda Cool
        loop_df = loop_df[ list(loop_df.columns[:-1].insert(3,'intercept')) ]

        #//*** Reorder Counties
        #//*** Keep the First four Columns, then use ordered_counties
        loop_df = loop_df[list(loop_df.columns[:4])+ordered_counties]

        #print(model_df)
        #//*** Add the First 30 days Model to the output_df dataframe
        output_df = pd.concat([output_df,loop_df])

        #//*** Checking our work. The sum of the coefficients * cases + intercept should be close the independent value in Total Cases.
        print("Checking our Work. These values should be close:")
        print(model_df.iloc[1]['total'], " == ", model_df[race_columns].iloc[1].sum()+regr.intercept_)    

        #print(output_df)
        
        

    
    print(f"CoEfficients Calculated: {round(time.time()-start_time,0)}s")
    return output_df


#//************************************************************************************************************************************************************************
#//*** Calculate the coefficients for Racial data for each county day.
#//*** This uses LinearRegression to generate coefficients which are weighted with the county cases of the day.
#//*** Coefficients derive individual counties affect on the whole of racial COVID cases. The idea is to estimate every counties given portion of the racial cases.
#//*** Each race is modeled separately. The coefficients are converted to percentages in the following step
#//************************************************************************************************************************************************************************


#//*** left_df contains the broad racial categories
left_df = ca_races_broad_df

#//*** Right df contains each county as a column, with one value aggregated. Such as raw COVID cases.
right_df = ca_cases_broad_df

#//*** County Columns start at Index 3.
#//*** These values are a little unintuitive, since they are releveant during the function operation. They are used to indicate which columns are used via the first elements index position
x_col_index = 3

#//*** Target Independent Column Index. Statewide race numbers begin at column 1
#//*** These values are a little unintuitive. They are used to indicate which columns are used via the first elements index position
y_col_index = 1


processing_df = calculate_coefficients_v2(left_df,right_df,x_col_index,y_col_index)

#//**** County Column index start
x_col_index = 4

#//*** Convert coefficients to percentages,

print("BEGIN PERCENT")
v2_ca_case_race_total_df =coef_to_percent(processing_df,x_col_index)


print(v2_ca_case_race_total_df)
if 'v2_ca_case_race_total_df' not in df_list:
    df_list.append('v2_ca_case_race_total_df')

print("Done")
"""
print()

Code to check our work. The summed racial totals should equal the summed COVID cases for Imperial county

In [None]:
#//*** Checking our work. These numbers should be close. Predicted values are converted to integers so there will be no 'partial' cases. If modeled a


t_date = ca_case_race_total_df.iloc[random.randint(0,len(ca_case_race_total_df))]['date']
t_total = ca_case_race_total_df[ca_case_race_total_df['date']==t_date]['total']
#print(t_date, " " , t_total)
#print(ca_case_race_total_df.iloc[250][4:].sum())

#print(ca_case_race_total_df.iloc[250])
print(t_date)
tdf1 = ca_cases_broad_df[ca_cases_broad_df['date']== t_date]
print(tdf1[tdf1.columns[2:]].transpose().sum())
print("===================")
print(ca_case_race_total_df[ca_case_race_total_df['date']==t_date].iloc[3][2])
print(ca_case_race_total_df[ca_case_race_total_df['date']==t_date].iloc[3][4:].sum())
print("===================")
print(tdf1)
print("===================")
print("Imperial Broad: ",tdf1['Imperial'].values[0])
print("Imperial Race:  ", ca_case_race_total_df[ca_case_race_total_df['date']==t_date]['Imperial'].sum())
print("Race Total   :  ", ca_case_race_total_df[ca_case_race_total_df['date']==t_date]['total'].sum())
print("===================")
print("Total COVID cases by Race on this day: ",ca_races_broad_df[ca_races_broad_df['date']==t_date].transpose()[1:].sum().values[0])
print("Total COVID cases by County:           ",tdf1['total'].values[0])



del t_date
del tdf1
del t_total

Skipping my original rescaling solution which was to multiply each county by it's missing values. This step is not needed when scaling the values before they are applied to the model.

This was added last. Instead of changing code, I copied ca_case_race_total_df to modified_race_total to keep all the graphs working

In [None]:
"""
modified_race_total = ca_case_race_total_df.copy()
redistrib = 0
for county in ordered_counties:
    #//*** Get the Actual Summed Totals of cases for the Counties 
    actual = ca_cases_broad_df[county].sum()
    
    modeled = modified_race_total[county].sum()
    if modeled > actual:
        #//*** Modeled is Bigger, need to redistribute the overage
        print(actual/modeled)
        redistrib = modified_race_total[county].sum()
        print("Redistrib: ",redistrib)
        modified_race_total[county] = modified_race_total[county]*(actual/modeled)
    else:
        county_target =  modified_race_total[county].sum()/(modeled/actual)
        redistrib -=  county_target
        #print(actual,"-",int(modeled)," Under: ", modeled/actual, " ", county, " ", county_target, " == ", modified_race_total[county].sum()/(modeled/actual)," -- ",int(redistrib))
        modified_race_total[county] = modified_race_total[county]/(modeled/actual)
    #print(actual, " / ",modeled)
"""
print()
modified_race_total = ca_case_race_total_df.copy()

This section checks the accuracy of the model for each county. It's OK. Given more time I can improve the error

In [None]:
#//*** The Rescaled numbers match. But it's kind of hokum
#//*** Compare the Modeled County Totals to the Actual Totals. 
for county in ordered_counties:
    #print(county, " Actual: ",ca_cases_broad_df[county].sum(), " - Modeled: ",ca_case_race_total_df[county].sum()   )
    print("A: ",ca_cases_broad_df[county].sum(), " - M: ",round(modified_race_total[county].sum(),0), " ", county   )

Plot the COVID distribution between the Unbiased model, the Biased model and the actual cases.

In [None]:
#print(ca_cases_broad_df[ordered_counties].sum())
display_size=15
bottom = -1



fig,ax = plt.subplots(figsize=(display_size,display_size/2))

    
ax.bar(no_transform_df[ordered_counties].sum().index,no_transform_df[ordered_counties].sum())
        
plt.xticks(rotation=90,fontsize=display_size*.8)
plt.yticks(fontsize=display_size)

handles,labels = deduplicate_legend(ax)

plt.title(f"Raw Model Distribution",fontsize=display_size)
plt.show()

graph_df = ca_case_race_total_df.copy()

fig,ax = plt.subplots(figsize=(display_size,display_size/2))

    
ax.bar(graph_df[ordered_counties].sum().index,graph_df[ordered_counties].sum())
        
plt.xticks(rotation=90,fontsize=display_size*.8)
plt.yticks(fontsize=display_size)

handles,labels = deduplicate_legend(ax)

plt.title(f"Transformed Modeled Distribution",fontsize=display_size)
plt.show()
#del graph_df

graph_df = ca_cases_broad_df.copy()

fig,ax = plt.subplots(figsize=(display_size,display_size/2))

ax.bar(graph_df[ordered_counties].sum().index,graph_df[ordered_counties].sum())
    
#    bottom += graph_df[graph_df.columns[x]]

#//*** Draw horizontal line. Draw it twice to get the yellow and back effect. 
#//*** This technique looks viusually good, but I can't get the legend to draw approrpriately.
#ax.axhline(state_100k,color = "black", label="Statewide Cases 100k", linestyle = "-", lw=2)
#ax.axhline(state_100k,color = "yellow", linestyle = "--", lw=2)
        
plt.xticks(rotation=90,fontsize=display_size*.8)
plt.yticks(fontsize=display_size)

handles,labels = deduplicate_legend(ax)

#plt.legend(fontsize=display_size*.5)
plt.title(f"Actual COVID Distribution",fontsize=display_size)
#plt.ylabel("Total Cases by County (per 100k)",fontsize=display_size)
plt.show()

del graph_df


In [None]:
"""
#//*** This is the modeled county Data (such as it is) I'm working on a visualization for this


#//*** Set Display to 5 significant figures globally
pd.set_option('display.float_format', lambda x: '%.5f' % x)

for group in modified_race_total.groupby('race'):
    x_column = modified_race_total.columns[4:]
    print(group[0])
    #print(group[1])
    loop_series = group[1][x_column].sum().sort_values(ascending=False)
    
    loop_df = pd.DataFrame(loop_series) 
    loop_df.columns = ['cases']
    loop_df['county'] = loop_df.index
    loop_df = loop_df.merge(pop_attrib_df,on='county')
    
    #loop_df['population'] = loop_df['population']/100000
    #loop_df['percent'] = loop_df['cases'] / loop_df[group[0]]
    loop_df['percent'] = loop_df['cases'] / (loop_df[group[0]])
    loop_df['pop_percent'] = loop_df[group[0]]/loop_df['population']
    
    print(loop_df.sort_values('percent',ascending=False)[['county','cases','percent','pop_percent']])
    print("====")

#//*** Reset the notation 
pd.reset_option('display.float_format')
"""
print()

This section looks at the differences between the modeled values and the expected COVID values based on a racial groups portion of the population.

In [None]:
#//*********************************************************************
#//*** Compare the Regression Model with the expected values
#//*********************************************************************
#//*** The expected racial values are the total COVID cases by county * racial percentage of the county
#//*** This would mean an even distribution of cases
#//*********************************************************************

#//*** Build Statewide racial differential percentages

ca_race_diff = pd.DataFrame()

#ca_race_diff['date'] = ca_total_df['date'].copy()
ca_race_diff['date'] = ca_cases_broad_df['date'].copy()

#print(ca_race_diff)

for race in ca_race_df['race'].unique():

    race_percent = pop_attrib_df[race].sum() / pop_attrib_df['population'].sum()
    #/  ca_total_df['cases']*race_percent
    #//**** actual Value
    
    #//*** Avoid Divide by 0 warrnings. Convert Zeros to 1
    actual = ca_race_df[ca_race_df['race']==race]['cases'].replace(0,1)
#    expected = ca_total_df['cases']*race_percent
    expected = ca_cases_broad_df['total']*race_percent
    
    #print(expected)
    
    #ca_race_diff['actual'] = actual.values
    #ca_race_diff['expected'] = expected.values
    ca_race_diff[race] = ( actual.values - expected.values ) / actual.values
    
    #print(pd.DataFrame([actual,expected]))
    #ax.plot(ca_race_df['date'].unique(),ca_race_df[ca_race_df['race']==race]['cases'].rolling(5).mean(),label='actual')
    #ax.plot(ca_race_df['date'].unique(),ca_total_df['cases'].rolling(5).mean()*race_percent,label='expected')
#ca_race_diff['Native'] = ca_race_diff['Native'].replace(np.inf,0)
ca_race_diff.replace([np.inf, -np.inf], 0, inplace=True)
#//*** Replace 1 with Zero. These were Zero value before. Avoids Divide by Zero issues.
ca_race_diff.replace(1, 0, inplace=True)
ca_race_diff = ca_race_diff.fillna(0)

#//***********************************************************************************************************************
#//*** Build the Model for all Counties.
#//***********************************************************************************************************************
#//*** Model takes the daily cases, and estimates the racial cases based on the portion of the population.
#//*** Example: 100 cases and Latino is .55. Latinos would be assigned 55 cases.
#//***********************************************************************************************************************
#//*** The expected racial value is adjusted by the statewide racial percentage
#//*** Example: If Latinos comprised .25 of the total State Cases, The modeled Latino cases would be 68.75 (55 *.25)
#//***********************************************************************************************************************

#//*** Build a dataframe to hold the expected cases
ca_covid_est_case_df = ca_covid_df.copy()


#//*** Process each race
for race in ca_race_df['race'].unique():
    #ca_covid_est_case_df[race] =  ( (ca_covid_est_case_df[race] / ca_covid_est_case_df['population'])*ca_covid_est_case_df['cases'] ) + ( ( (ca_covid_est_case_df[race] / loop_df['population'])*ca_covid_est_case_df['cases'] ) * ca_race_diff[race].values )
    
    #//*** Build the racial population percentage for each county
    #//*** Racial Population / total Population = County racial portion
    ca_covid_est_case_df[race] = (ca_covid_est_case_df[race] / ca_covid_est_case_df['population']) 
    
    #//*** Build the Expected COVID cases based on population percentage
    #//*** Case_[race] columns. County COVID cases * Racial portion 
    ca_covid_est_case_df[f'case_{race}'] = ca_covid_est_case_df[race] * ca_covid_est_case_df['cases']
    

#//*** Build model Dataframe. This holds the adjusted cases
#//*** This workflow is awkward because I"m building it for each date. I can probably do this better by merging the ca_race_diff dataframe.
ca_model_cases_df = pd.DataFrame()

#//*** Process each Date 
for group in ca_covid_est_case_df.groupby('date'):
    loop_date = group[0]
    loop_df = group[1].copy()
    
    #//*** Get the race differential values on the given date
    loop_race_diff = ca_race_diff[ ca_race_diff['date']==loop_date] 
    
    #//*** Process the race data for each racial attribute.
    #//*** Each loop processes a racial attribute and adds an adjusted_[race] value.
    for race in ca_race_df['race'].unique():
        #//*** Get the modifier that race on the given day
        race_modifier = loop_race_diff[race].values[0]
        
        #//*** expected value + (expected value * modifier)
        #//*** Modifier can be positive or negative.
        loop_df[f'adj_{race}'] =  loop_df[f'case_{race}'].values + (loop_df[f'case_{race}'].values * race_modifier) 
    
    #//*** Add the results into ca_model_cases_df. Awkward an inefficient.
    ca_model_cases_df = pd.concat([ ca_model_cases_df, loop_df ] ) 
    

#//*** Temp variable cleanup
del loop_df
del loop_date
del loop_race_diff
del race_modifier

In [None]:
#//*** Get the Statewide 100k value. 
#//*** Get total Case Count from orig_df, dvided by total population / 100000
state_100k = ca_covid_orig_df['cases'].sum()/(ca_covid_orig_df['population'].unique().sum()/100000)

county_list = ca_100k_df['county'].unique()

county_100k = []

for county in county_list:
    #if ca_100k_df[ca_100k_df['county']==county].iloc[0]['population'] > 1:
    county_100k.append(county)
case_totals = []

for county in county_100k:
    case_totals.append(ca_100k_df[ca_100k_df['county']==county]['cases'].sum())

#//*** Temp Series
ts = pd.Series(index = county_100k, data=case_totals).sort_values(ascending=False)

#print(ts)

high_covid_counties = list(ts[ts > state_100k].index)



Examine the correlation between the modeled and the expected values. Not quite sure if this is helpful. If the model reflected the expected values, then this wouldn't detect racial disparities. It's interesting, but unsure if it's good information

In [None]:
#//*** Build correlation table. How does the correlation compare between the modeled and Expected Values?
#//*** Result: Unsure. Probably not helpful
from scipy.stats import pearsonr

county = 'Los Angeles'
loop_dict = {}

for county in ca_100k_df['county'].unique():   
    #//*** Compare the expected vs scaled modeled values
    expected_la = ca_model_cases_df[ca_model_cases_df['county']== county]
    
    modeled_la = modified_race_total[['date','race',county]]
    #modeled_la = ca_case_race_total_df[['date','race',county]]
    for race in modeled_la['race'].unique():
        loop_model_race = modeled_la[modeled_la['race']==race]
#
        loop_model_expected = expected_la[['date','county',race,f"case_{race}",f"adj_{race}","cases"]]
        #print(loop_model_race)
        #print(loop_model_expected)
        corr,_ = pearsonr(loop_model_race[county],loop_model_expected['cases'])
        #print("Correlation: ", county, " ",race, " ",corr)
        
        if county not in loop_dict.keys():
            loop_dict[county] = {}
            
        loop_dict[county][race] = (corr)
cors_df = pd.DataFrame(loop_dict)
cor_index = list(cors_df.index)
cor_col = list(cors_df.columns)
cors_df = cors_df.append(cors_df.sum(),ignore_index=True)
cors_df.index = [cor_index + ['total']]
cors_df = cors_df.transpose()
cors_df['county'] = cors_df.index
cors_df.index=range(0,len(cors_df))
cors_df.columns = ['Native','Asian','Black','Latino','Multiracial', 'Hawaiian','White','total','county']
cors_df = cors_df.sort_values('total',ascending=False)
print("===============================================")
print("Correlation between Expected and Modeled Values")
print("===============================================")
print(cors_df)        

Build modeled population percentages. These are the modeled racial portion of the population. For Example: The model suggests 62% of All Latinos in Los Angeles were COVID-19 positive.

In [None]:
pd_dict = {}

input_df=modified_race_total



for group in input_df.groupby('race'):

    #//*** Initialize Race as a key
    pd_dict[group[0]] ={}

    #//*** Add the values for each county
    for county in group[1].columns[4:]:
        
        #//*** Assign the total to [race] [county]
        pd_dict[group[0]][county] = group[1][county].sum()
    

#//*** Let's build cumulative Racial Totals
#print(modified_race_total)

#//*** Convert to data frame
modeled_total_counts_df = pd.DataFrame(pd_dict).transpose()

#//*** Build a new Dataframe to hold the percentages
modeled_percents_df = modeled_total_counts_df.copy()

for county in modeled_percents_df.columns:
    
    modeled_percents_df[county] = modeled_percents_df[county] / modeled_percents_df[county].sum()
    

#print(modeled_percents_df.transpose().sort_values("Latino",ascending=False))

#modeled_percents_df = modeled_percents_df.sort_values("Los Angeles",ascending=False).transpose().sort_values("Latino",ascending=False)
modeled_percents_df = modeled_percents_df.sort_values("Los Angeles",ascending=False).transpose()

print(modeled_percents_df)
#//*** Temp variable cleanup
del pd_dict

Build a dataframe with the actual population percentages.

In [None]:
#//*** Build the Racial percentages for each county
pop_percent_df = pop_attrib_df.copy()

#//*** Divide each race column by the population to get percentage
for col in pop_percent_df.columns[2:]:
    pop_percent_df[col] = pop_percent_df[col]/pop_percent_df['population'] 

#//*** Set index to county
pop_percent_df.index=pop_percent_df['county']
#//*** Delete county column
del pop_percent_df['county']

del pop_percent_df['population']

#//*** Set index to modeled_percents_df.index. This aligns the county values.
pop_percent_df = pop_percent_df.reindex(modeled_percents_df.index)

#//*** Align Columns with Modeled Percents for consistent looping
pop_percent_df = pop_percent_df[list(modeled_percents_df.columns)]

print(pop_percent_df)

In [None]:
#//*** Assigns a color from a palette list to a county. 
def assign_color(input_item, input_cd,input_palette):
    #//*** Check if item already exists, if so, return input_cd
    if input_item in input_cd.keys():
        return input_cd
    
    #//*** input_item needs a Color. Walk down the input_palette till one is not found
    for color in input_palette:
        if color not in input_cd.values():
            input_cd[input_item] = color
            return input_cd
    print("UH OH ran out of colors!!!")
    print(f"Item: {input_item}")
    print(input_cd)
    return input_cd

#//*** Color Choices: Tucking these aside for later use
#//*** Combine these with a dictionary to create color continuity across multiple visualizations.
color_palette = ["#c6eaff","#caa669","#14bae2","#f7cd89","#98a9e7","#e2ffb7","#cb9ec2","#77dcb5","#ffc5b7","#40bdba","#fff4b0","#74d0ff","#e4da8d","#7ceeff","#d0e195","#b7ab8c","#fcffdb","#83b88d","#ffe2c0","#abc37a"]
color_palette = ["#557788","#e12925","#44af0e","#7834c0","#726d00","#130c6d","#004e12","#f7007d","#017878","#950089","#00a3d7","#4b000e","#0063c2","#f07478","#013b75","#cf81b8","#212238","#af87e7","#320f49","#9c91db"]
county_color_palette = ["#b4a23b","#4457ca","#9ec535","#a651cb","#59ce59","#6a77f0","#52a325","#b93d9b","#36b25c","#e374d4","#c1c035","#7452af","#96ae3a","#a484e2","#89c466","#e54790","#57c888","#dd3d60","#5bd6c4","#dd4e2d","#45ccdf","#bd3738","#4cb998","#b13a6c","#368433","#588feb","#dcad3d","#4763af","#e49132","#4aa5d4","#c86321","#7695d3","#769233","#925898","#54701c","#c893d6","#3d7b44","#e084ac","#65a76b","#965179","#296437","#e57f5f","#31a8ad","#a44b2c","#368d71","#df7f81","#226a4d","#96465f","#b5b671","#68649c","#ad772e","#a34f52","#758348","#d8a06e","#505e25","#8e5e31","#8e8033","#695f1b"]
county_color_palette = ["#96465f","#dd3d60","#df7f81","#a34f52","#bd3738","#dd4e2d","#e57f5f","#a44b2c","#c86321","#8e5e31","#d8a06e","#e49132","#ad772e","#dcad3d","#b4a23b","#8e8033","#695f1b","#c1c035","#b5b671","#96ae3a","#505e25","#758348","#9ec535","#769233","#54701c","#52a325","#89c466","#368433","#59ce59","#3d7b44","#65a76b","#296437","#36b25c","#57c888","#226a4d","#368d71","#4cb998","#5bd6c4","#31a8ad","#45ccdf","#4aa5d4","#7695d3","#588feb","#4763af","#6a77f0","#4457ca","#68649c","#a484e2","#7452af","#a651cb","#c893d6","#925898","#e374d4","#b93d9b","#965179","#e084ac","#e54790","#b13a6c"]
county_color_palette = ["#226a4d","#31a8ad","#68649c","#758348","#505e25","#368d71","#4aa5d4","#965179","#7695d3","#45ccdf","#296437","#96465f","#8e5e31","#b5b671","#d8a06e","#a34f52","#5bd6c4","#695f1b","#4cb998","#df7f81","#3d7b44","#e084ac","#c893d6","#65a76b","#8e8033","#925898","#4763af","#54701c","#ad772e","#a44b2c","#e57f5f","#769233","#57c888","#b13a6c","#588feb","#a484e2","#b4a23b","#368433","#89c466","#7452af","#96ae3a","#dcad3d","#bd3738","#36b25c","#e374d4","#c86321","#b93d9b","#e49132","#dd3d60","#e54790","#c1c035","#4457ca","#6a77f0","#52a325","#9ec535","#dd4e2d","#a651cb","#59ce59"]
color_palette = ["#e0472b","#ffac4b","#469100","#02c1d7","#66acff","#906fee","#fcaee0"]
race_color = {}

for race in pop_percent_df.columns:
    race_color = assign_color(race,race_color,color_palette)
    
print(race_color)

First attempt at displaying racial distributions across the state

In [None]:
display_size = 40

graph_df = modeled_percents_df.copy()

bottom = -1

fig,ax = plt.subplots(figsize=(display_size,display_size/2))

ax.bar(graph_df.index,graph_df[graph_df.columns[0]],label=graph_df.columns[0])

bottom = graph_df[graph_df.columns[0]]

for x in range(1, (len(graph_df.columns))):
    
    ax.bar(graph_df.index,graph_df[graph_df.columns[x]],bottom=bottom,label=graph_df.columns[x])
    
    
    bottom += graph_df[graph_df.columns[x]]

#//*** Draw horizontal line. Draw it twice to get the yellow and back effect. 
#//*** This technique looks viusually good, but I can't get the legend to draw approrpriately.
#ax.axhline(state_100k,color = "black", label="Statewide Cases 100k", linestyle = "-", lw=2)
#ax.axhline(state_100k,color = "yellow", linestyle = "--", lw=2)
        
plt.xticks(rotation=90,fontsize=display_size*.8)
plt.yticks(fontsize=display_size)

handles,labels = deduplicate_legend(ax)

plt.legend(fontsize=display_size*.5)
plt.title(f"Racial COVID Distribution by county",fontsize=display_size)
#plt.ylabel("Total Cases by County (per 100k)",fontsize=display_size)
plt.show()

del graph_df

Second pass at modelin State and expected values across the state. To get the Y axis to draw correctly, I'll have to move to subplots.

In [None]:
display_size = 40

graph_df = modeled_percents_df.copy()

bottom = -1

fig,ax = plt.subplots(figsize=(display_size,display_size/2))

ax.bar(graph_df.index,graph_df[graph_df.columns[0]],label=graph_df.columns[0],color=race_color[graph_df.columns[0]])

ax.bar(pop_percent_df.index,pop_percent_df[graph_df.columns[0]],color='black',width=.25)

#//*** Set the graph bottom  to the greater of graph_df and pop_percent_df max
if graph_df[graph_df.columns[0]].max() > pop_percent_df[graph_df.columns[0]].max():
    
    bottom = graph_df[graph_df.columns[0]].max()+.05
    
else:
    bottom = pop_percent_df[graph_df.columns[0]].max()+.05
        

for x in range(1, (len(graph_df.columns))):
    
    ax.bar(graph_df.index,graph_df[graph_df.columns[x]],bottom=bottom,label=graph_df.columns[x],color=race_color[graph_df.columns[x]])
    ax.bar(graph_df.index,pop_percent_df[graph_df.columns[x]],color='black',bottom=bottom,width=.25)
    
    if graph_df[graph_df.columns[x]].max() > pop_percent_df[graph_df.columns[x]].max():
        bottom += graph_df[graph_df.columns[x]].max()+.05
    else:
        bottom += pop_percent_df[graph_df.columns[x]].max()+.05
        
ax.bar(pop_percent_df.index,pop_percent_df[graph_df.columns[0]],label='Population',color='black',width=.25)
    
#//*** Draw horizontal line. Draw it twice to get the yellow and back effect. 
#//*** This technique looks viusually good, but I can't get the legend to draw approrpriately.
#ax.axhline(state_100k,color = "black", label="Statewide Cases 100k", linestyle = "-", lw=2)
#ax.axhline(state_100k,color = "yellow", linestyle = "--", lw=2)
        
plt.xticks(rotation=90,fontsize=display_size*.8)
plt.yticks(fontsize=display_size)

handles,labels = deduplicate_legend(ax)

plt.legend(fontsize=display_size*.5,loc="upper right")
plt.title(f"Racial COVID Distribution by county",fontsize=display_size)
#plt.ylabel("Total Cases by County (per 100k)",fontsize=display_size)
plt.show()

del graph_df

Final attempt at using a bar chart to represent modeled cases vs expected values. This took a long time to finally generate. I'm proud of the work. I wish I figured out Geopandas before I started this though. The 

In [None]:
display_size = 40

graph_df = modeled_percents_df.copy()

#state_cases = modeled_total_counts_df.transpose().sum().sum()
#race_percent = modeled_total_counts_df.transpose().sum().loc[this_race] / state_cases

bottom = -1
col_count = len(graph_df.columns)
fig,axs = plt.subplots(col_count,figsize=(display_size,display_size/2),sharex=True)
for x in range(0,col_count):
    
    this_race = graph_df.columns[x]

    this_color = race_color[this_race]
    this_ax = col_count-1 - x
    axs[this_ax].bar(graph_df.index,graph_df[this_race],color=this_color,label=this_race)
    if x == col_count-1:
        axs[this_ax].set_title(f"Racial COVID Distribution by county\nModeled Cases vs Expected Cases",fontsize=display_size)

    axs[this_ax].bar(pop_percent_df.index,pop_percent_df[this_race],color='black',width=.15,label='Expected')
    

    #axs[this_ax].axhline(race_percent,color = "white", linestyle = "-", lw=2, alpha=.5)
    #axs[this_ax].axhline(race_percent,color = "black", label="State Avg", linestyle = "--", lw=2, alpha=.15)
    
    axs[this_ax].legend(fontsize=display_size*.3,loc="upper right")
    
    axs[this_ax].set_ylabel(this_race)
    axs[this_ax].yaxis.label.set_size(display_size*.5)
    
    #axes.xaxis.label.set_size(16)

        
plt.xticks(rotation=90,fontsize=display_size*.5)
#plt.yticks(fontsize=display_size)

#handles,labels = deduplicate_legend(ax)

#plt.legend(fontsize=display_size*.5,loc="upper right")
#plt.title(f"Racial COVID Distribution by county",fontsize=display_size)
#plt.ylabel("Total Cases by County (per 100k)",loc="center",fontsize=display_size)
fig.text(.085, 0.5, 'Modeled Cases', va='center', rotation='vertical',fontsize=display_size)
#fig.text(.5, -0.05, 'California Counties', va='center',fontsize=display_size)
fig.text(.17, -0.055, '(Most Populous Counties)', va='center',fontsize=display_size *.5)
fig.text(.265, -0.055, '---------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------   (Least Populous Counties)', va='center',fontsize=display_size *.5)
fig.text(.267, -0.055, '--------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------', va='center',fontsize=display_size *.5)

plt.show()

del graph_df

In [None]:
race_diff_percent = modeled_percents_df- pop_percent_df
race_diff_percent['NAME'] = list(race_diff_percent.index)
print(race_diff_percent.min())

In [None]:
print(race_color)
race_color_map = {}
#//*** Build handcrafted Color Maps
negative_colors = ["#8C8C8C","#9A9A9A","#A9A9A9","#B7B7B7","#C6C6C6","#D4D4D4","#E2E2E2","#F1F1F1"]
race_color_map['Latino'] =      ListedColormap( negative_colors + [ "#FFFFFF","#FBE8E5","#F7D1CA","#F3BAB0","#F0A395","#EC8C7B","#E87560","#E45E46","#E0472B"])
race_color_map['White'] =       ListedColormap( negative_colors + ["#FFFFFF","#FFF5E9","#FFEAD2","#FFE0BC","#FFD6A5","#FFCB8F","#FFC178","#FFB662","#FFAC4B"])
race_color_map['Asian'] =       ListedColormap( negative_colors + ["#FFFFFF","#E8F1DF","#D1E4BF","#BAD69F","#A3C880","#8BBA60","#74AD40","#5D9F20","#469100"])
race_color_map['Black'] =       ListedColormap( negative_colors + ["#FFFFFF","#DFF7FA","#C0F0F5","#A0E8F0","#81E0EB","#61D8E6","#41D1E1","#22C9DC","#02C1D7"])
race_color_map['Multiracial'] = ListedColormap( negative_colors + ["#FFFFFF","#ECF5FF","#D9EAFF","#C6E0FF","#B3D6FF","#9FCBFF","#8CC1FF","#79B6FF","#66ACFF"])
race_color_map['Hawaiian'] =    ListedColormap( negative_colors + ["#FFFFFF","#F1EDFD","#E3DBFB","#D5C9F9","#C8B7F7","#BAA5F4","#AC93F2","#9E81F0","#906FEE"])
race_color_map['Native'] =      ListedColormap( negative_colors + ["#FFFFFF","#FFF5FB","#FEEBF7","#FEE1F3","#FED7F0","#FDCCEC","#FDC2E8","#FCB8E4","#FCAEE0"])



In [None]:
#https://jcutrer.com/python/learn-geopandas-plotting-usmaps
#https://www.census.gov/geographies/mapping-files/time-series/geo/carto-boundary-file.html
#https://towardsdatascience.com/lets-make-a-map-using-geopandas-pandas-and-matplotlib-to-make-a-chloropleth-map-dddc31c1983d
#https://geopandas.org/docs/user_guide/mapping.html

display_size = 20

#//**** Read Shape File for all 52 States
states = geopandas.read_file('maps\cb_2018_us_county_20m.shp')

#//*** Get just the California Shapes (these are all the counties)
#//*** A quick investigation of the dataframes revealed a STATEFP value of 06 is California
calif_geo = states[states['STATEFP'] == "06"]

#//*** Merge with the Race_diff_percent dataframe. These percentages are used to color the counties
calif_geo = calif_geo.merge(race_diff_percent,on='NAME')

#//*** Set Max and Min Scale for all gradients
#//*** Fixing to the greatest and minimum values, creates a standard scale across the graphs
vmin,vmax = calif_geo['White'].min(),calif_geo['Latino'].max()

plt.rc('axes', labelsize=display_size*2)
plt.rc('axes', titlesize=display_size*2)
race_axis = {
    'Latino' :   (0,0),
    'White' :    (0,1),
    'Asian' :    (0,2),
    'Black' :    (0,3),
    'Multiracial' : (1,0),
    'Hawaiian' : (1,1),
    'Native' :   (1,2)
}
fig, ax = plt.subplots(2,4, figsize=(display_size, display_size))

#fig,axs = plt.subplots(col_count,figsize=(display_size,display_size/2),sharex=True)
#axs[this_ax]
#//*** Draw a graph for each Race
for race in race_color_map.keys():
    ax[race_axis[race]].axis('off')
    ax[race_axis[race]].set_title(f"{race}",fontsize=display_size*.5)
    
    # add the colorbar to the figure
    #//*** The Colormaps are hand coded and stored in a dictionary
    cmap = race_color_map[race]
    
    
    # Create colorbar as a legend
    sm = plt.cm.ScalarMappable(cmap=cmap, norm=plt.Normalize(vmin=vmin, vmax=vmax))

    # empty array for the data range
    sm._A = []


    cbar = fig.colorbar(sm,ax=ax[race_axis[race]],shrink=1)

    calif_geo.plot(column=race,cmap=cmap, ax=ax[race_axis[race]],linewidth=0.8,edgecolor='0.8')
    
fig.delaxes(ax[1,3]) #The indexing is zero-based here
fig.tight_layout()
plt.subplots_adjust(left=0,
                    bottom=.65, 
                    top=1, 
                    wspace=-.5, 
                    hspace=.1)
#plt.title("",fontdict={'verticalalignment': 'baseline'})
fig.text(.5, 1.025, 'Racial Disparity (by County)', va='center',fontsize=display_size)
plt.show()

plt.rcParams.update(plt.rcParamsDefault)

In [None]:
#https://jcutrer.com/python/learn-geopandas-plotting-usmaps
#https://www.census.gov/geographies/mapping-files/time-series/geo/carto-boundary-file.html
#https://towardsdatascience.com/lets-make-a-map-using-geopandas-pandas-and-matplotlib-to-make-a-chloropleth-map-dddc31c1983d
#https://geopandas.org/docs/user_guide/mapping.html

display_size = 20

#//**** Read Shape File for all 52 States
states = geopandas.read_file('maps\cb_2018_us_county_20m.shp')

#//*** Get just the California Shapes (these are all the counties)
#//*** A quick investigation of the dataframes revealed a STATEFP value of 06 is California
calif_geo = states[states['STATEFP'] == "06"]

#//*** Merge with the Race_diff_percent dataframe. These percentages are used to color the counties
calif_geo = calif_geo.merge(race_diff_percent,on='NAME')

#//*** Set Max and Min Scale for all gradients
#//*** Fixing to the greatest and minimum values, creates a standard scale across the graphs
vmin,vmax = calif_geo['White'].min(),calif_geo['Latino'].max()

plt.rc('axes', labelsize=display_size*2)
plt.rc('axes', titlesize=display_size*2)
    
#//*** Draw a graph for each Race
for race in race_color_map.keys():
    
    fig, ax = plt.subplots(1, figsize=(display_size/2, display_size/2))

    ax.axis('off')
    ax.set_title(f"{race} Disparity (by County)",fontsize=display_size*.5)
    
    # add the colorbar to the figure
    #//*** The Colormaps are hand coded and stored in a dictionary
    cmap = race_color_map[race]
    
    if race == 'White':
        cmap = ListedColormap( negative_colors + ["#FFFFFF"])

    if race == 'Asian':
        cmap = ListedColormap( negative_colors + ["#FFFFFF","#E8F1DF","#D1E4BF"])
    
    vmin,vmax = calif_geo[race].min(),calif_geo[race].max()

    # Create colorbar as a legend
    sm = plt.cm.ScalarMappable(cmap=cmap, norm=plt.Normalize(vmin=vmin, vmax=vmax))

    # empty array for the data range
    sm._A = []


    cbar = fig.colorbar(sm,shrink=.8)

    calif_geo.plot(column=race,cmap=cmap, ax=ax,linewidth=0.8,edgecolor='0.8')

plt.rcParams.update(plt.rcParamsDefault)

In [None]:


#https://jcutrer.com/python/learn-geopandas-plotting-usmaps
#https://www.census.gov/geographies/mapping-files/time-series/geo/carto-boundary-file.html
#https://towardsdatascience.com/lets-make-a-map-using-geopandas-pandas-and-matplotlib-to-make-a-chloropleth-map-dddc31c1983d
#https://geopandas.org/docs/user_guide/mapping.html


display_size = 20

tdf =  pd.DataFrame()
tdf['NAME']=ts.index
tdf['total_100k'] = ts.values 

#//**** Read Shape File for all 52 States
states = geopandas.read_file('maps\cb_2018_us_county_20m.shp')


#//*** Get just the California Shapes (these are all the counties)
#//*** A quick investigation of the dataframes revealed a STATEFP value of 06 is California
calif_geo = states[states['STATEFP'] == "06"]


#//*** Merge with the Race_diff_percent dataframe. These percentages are used to color the counties
calif_geo = calif_geo.merge(tdf,on='NAME')
print(len(calif_geo))
#print(len(calif_geo))
#//*** Set Max and Min Scale for all gradients
#//*** Fixing to the greatest and minimum values, creates a standard scale across the graphs
vmin,vmax = ts.values[len(ts.values)-2],ts.values[0]

#plt.rc('axes', labelsize=display_size*2)
#plt.rc('axes', titlesize=display_size*2)
    
    
fig, ax = plt.subplots(1, figsize=(display_size/2, display_size/2))

ax.axis('off')
#ax.set_title(f"{race} Disparity (by County)",fontsize=display_size*.5)

# add the colorbar to the figure
#//*** The Colormaps are hand coded and stored in a dictionary
cmap =  ListedColormap( ["#FFFFFF","#FBE8E5","#F7D1CA","#F3BAB0","#F0A395","#EC8C7B","#E87560","#E45E46","#E0472B"])
cmap =  ListedColormap( ["#FFFFFF","#FDF5F4","#FCECE9","#FAE2DE","#F8D8D2","#F7CFC7","#F5C5BC","#F4BBB1","#F2B2A6","#F0A89B","#EF9E8F","#ED9484","#EB8B79","#EA816E","#E87763","#E76E58","#E5644C","#E35A41","#E25136","#E0472B"])


# Create colorbar as a legend
sm = plt.cm.ScalarMappable(cmap=cmap, norm=plt.Normalize(vmin=vmin, vmax=vmax))

# empty array for the data range
sm._A = []


cbar = fig.colorbar(sm,shrink=.7)

plt.title("Cumulative COVID Prevalence",fontsize=display_size)
fig.text(.78, .2, 'Cumulative Cases (per 100k)', va='center' ,fontsize=display_size/2)
calif_geo.plot(column='total_100k',cmap=cmap, ax=ax,linewidth=0.8,edgecolor='.8')

plt.rcParams.update(plt.rcParamsDefault)
del tdf

In [None]:
race_solid_color_map = {}
#//*** Build handcrafted Color Maps
negative_colors = ["#8C8C8C","#9A9A9A","#A9A9A9","#B7B7B7","#C6C6C6","#D4D4D4","#E2E2E2","#F1F1F1"]
race_solid_color_map['Latino'] =      ListedColormap( ["#FFFFFF","#FDF5F4","#FCECE9","#FAE2DE","#F8D8D2","#F7CFC7","#F5C5BC","#F4BBB1","#F2B2A6","#F0A89B","#EF9E8F","#ED9484","#EB8B79","#EA816E","#E87763","#E76E58","#E5644C","#E35A41","#E25136","#E0472B"])
race_solid_color_map['White'] =       ListedColormap( ["#FFFFFF","#FFFBF6","#FFF6EC","#FFF2E3","#FFEED9","#FFE9D0","#FFE5C6","#FFE0BD","#FFDCB3","#FFD8AA","#FFD3A0","#FFCF97","#FFCB8D","#FFC684","#FFC27A","#FFBD71","#FFB967","#FFB55E","#FFB054","#FFAC4B"])
race_solid_color_map['Asian'] =       ListedColormap( ["#FFFFFF","#F5F9F2","#ECF3E4","#E2EED7","#D8E8C9","#CEE2BC","#C5DCAE","#BBD6A1","#B1D194","#A7CB86","#9EC579","#94BF6B","#8ABA5E","#80B451","#77AE43","#6DA836","#63A228","#599D1B","#50970D","#469100"])
race_solid_color_map['Black'] =       ListedColormap( ["#FFFFFF","#F2FCFD","#E4F8FB","#D7F5F9","#CAF2F7","#BCEFF4","#AFEBF2","#A2E8F0","#94E5EE","#87E2EC","#7ADEEA","#6DDBE8","#5FD8E6","#52D5E4","#45D1E2","#37CEDF","#2ACBDD","#1DC8DB","#0FC4D9","#02C1D7"])
race_solid_color_map['Multiracial'] = ListedColormap( ["#FFFFFF","#F7FBFF","#EFF6FF","#E7F2FF","#DFEEFF","#D7E9FF","#CFE5FF","#C7E0FF","#BFDCFF","#B7D8FF","#AED3FF","#A6CFFF","#9ECBFF","#96C6FF","#8EC2FF","#86BDFF","#7EB9FF","#76B5FF","#6EB0FF","#66ACFF"])
race_solid_color_map['Hawaiian'] =    ListedColormap( ["#FFFFFF","#F9F7FE","#F3F0FD","#EDE8FC","#E8E1FB","#E2D9FB","#DCD2FA","#D6CAF9","#D0C2F8","#CABBF7","#C5B3F6","#BFACF5","#B9A4F4","#B39CF3","#AD95F2","#A78DF2","#A286F1","#9C7EF0","#9677EF","#906FEE"])
race_solid_color_map['Native'] =      ListedColormap( ["#FFFFFF","#FFFBFD","#FFF6FC","#FFF2FA","#FEEEF8","#FEEAF7","#FEE5F5","#FEE1F4","#FEDDF2","#FED9F0","#FDD4EF","#FDD0ED","#FDCCEB","#FDC8EA","#FDC3E8","#FDBFE7","#FCBBE5","#FCB7E3","#FCB2E2","#FCAEE0"])

In [None]:
print(modeled_percents_df)

In [None]:
#https://jcutrer.com/python/learn-geopandas-plotting-usmaps
#https://www.census.gov/geographies/mapping-files/time-series/geo/carto-boundary-file.html
#https://towardsdatascience.com/lets-make-a-map-using-geopandas-pandas-and-matplotlib-to-make-a-chloropleth-map-dddc31c1983d
#https://geopandas.org/docs/user_guide/mapping.html

display_size = 20


tdf = modeled_percents_df.copy()
tdf['NAME'] = tdf.index

#//**** Read Shape File for all 52 States
states = geopandas.read_file('maps\cb_2018_us_county_20m.shp')

#//*** Get just the California Shapes (these are all the counties)
#//*** A quick investigation of the dataframes revealed a STATEFP value of 06 is California
calif_geo = states[states['STATEFP'] == "06"]



#//*** Merge with the Race_diff_percent dataframe. These percentages are used to color the counties
calif_geo = calif_geo.merge(tdf,on='NAME')

#//*** Set Max and Min Scale for all gradients
#//*** Fixing to the greatest and minimum values, creates a standard scale across the graphs
vmin,vmax = modeled_percents_df.min().min(),modeled_percents_df.max().max()

plt.rc('axes', labelsize=display_size*2)
plt.rc('axes', titlesize=display_size*2)
race_axis = {
    'Latino' :   (0,0),
    'White' :    (0,1),
    'Asian' :    (0,2),
    'Black' :    (0,3),
    'Multiracial' : (1,0),
    'Hawaiian' : (1,1),
    'Native' :   (1,2)
}
fig, ax = plt.subplots(2,4, figsize=(display_size, display_size))

#fig,axs = plt.subplots(col_count,figsize=(display_size,display_size/2),sharex=True)
#axs[this_ax]
#//*** Draw a graph for each Race
for race in race_color_map.keys():
    ax[race_axis[race]].axis('off')
    ax[race_axis[race]].set_title(f"{race}",fontsize=display_size*.5)
    
    # add the colorbar to the figure
    #//*** The Colormaps are hand coded and stored in a dictionary
    cmap = race_solid_color_map[race]
    
    
    # Create colorbar as a legend
    sm = plt.cm.ScalarMappable(cmap=cmap, norm=plt.Normalize(vmin=vmin, vmax=vmax))

    # empty array for the data range
    sm._A = []


    cbar = fig.colorbar(sm,ax=ax[race_axis[race]],shrink=1)
    
    vmin,vmax = calif_geo[race].min(),calif_geo[race].max()
    
    calif_geo.plot(column=race,cmap=cmap, ax=ax[race_axis[race]],linewidth=0.8,edgecolor='0.8')
    
fig.delaxes(ax[1,3]) #The indexing is zero-based here
fig.tight_layout()
plt.subplots_adjust(left=0,
                    bottom=.65, 
                    top=1, 
                    wspace=-.5, 
                    hspace=.1)
#plt.title("",fontdict={'verticalalignment': 'baseline'})
fig.text(.5, 1.015, 'Modeled COVID Racial Prevalence', va='center',fontsize=display_size)
#fig.text(.5, 1.025, '(% of county racial population)', va='center',fontsize=display_size*.75)
plt.show()

plt.rcParams.update(plt.rcParamsDefault)

In [None]:
#https://jcutrer.com/python/learn-geopandas-plotting-usmaps
#https://www.census.gov/geographies/mapping-files/time-series/geo/carto-boundary-file.html
#https://towardsdatascience.com/lets-make-a-map-using-geopandas-pandas-and-matplotlib-to-make-a-chloropleth-map-dddc31c1983d
#https://geopandas.org/docs/user_guide/mapping.html

display_size = 20


tdf = pop_attrib_df.copy()

for col in tdf.columns[2:]:
    tdf[col] = tdf[col]/tdf['population']

tdf['NAME'] = tdf['county']

#//**** Read Shape File for all 52 States
states = geopandas.read_file('maps\cb_2018_us_county_20m.shp')

#//*** Get just the California Shapes (these are all the counties)
#//*** A quick investigation of the dataframes revealed a STATEFP value of 06 is California
calif_geo = states[states['STATEFP'] == "06"]



#//*** Merge with the Race_diff_percent dataframe. These percentages are used to color the counties
calif_geo = calif_geo.merge(tdf,on='NAME')

#//*** Set Max and Min Scale for all gradients
#//*** Fixing to the greatest and minimum values, creates a standard scale across the graphs

plt.rc('axes', labelsize=display_size*2)
plt.rc('axes', titlesize=display_size*2)
race_axis = {
    'Latino' :   (0,0),
    'White' :    (0,1),
    'Asian' :    (0,2),
    'Black' :    (0,3),
    'Multiracial' : (1,0),
    'Hawaiian' : (1,1),
    'Native' :   (1,2)
}
fig, ax = plt.subplots(2,4, figsize=(display_size, display_size))

#fig,axs = plt.subplots(col_count,figsize=(display_size,display_size/2),sharex=True)
#axs[this_ax]
#//*** Draw a graph for each Race
for race in race_color_map.keys():
    ax[race_axis[race]].axis('off')
    ax[race_axis[race]].set_title(f"{race}",fontsize=display_size*.5)
    
    # add the colorbar to the figure
    #//*** The Colormaps are hand coded and stored in a dictionary
    cmap = race_solid_color_map[race]
    
    vmin,vmax = tdf[race].min(),tdf[race].max()
    
    # Create colorbar as a legend
    sm = plt.cm.ScalarMappable(cmap=cmap, norm=plt.Normalize(vmin=vmin, vmax=vmax))

    # empty array for the data range
    sm._A = []


    cbar = fig.colorbar(sm,ax=ax[race_axis[race]],shrink=1)

    calif_geo.plot(column=race,cmap=cmap, ax=ax[race_axis[race]],linewidth=0.8,edgecolor='0.8')
    
fig.delaxes(ax[1,3]) #The indexing is zero-based here
fig.tight_layout()
plt.subplots_adjust(left=0,
                    bottom=.65, 
                    top=1, 
                    wspace=-.5, 
                    hspace=.1)
#plt.title("",fontdict={'verticalalignment': 'baseline'})
fig.text(.5, 1.025, 'Population Distribution by Race', va='center',fontsize=display_size)
plt.show()

plt.rcParams.update(plt.rcParamsDefault)

del tdf

In [None]:
#https://jcutrer.com/python/learn-geopandas-plotting-usmaps
#https://www.census.gov/geographies/mapping-files/time-series/geo/carto-boundary-file.html
#https://towardsdatascience.com/lets-make-a-map-using-geopandas-pandas-and-matplotlib-to-make-a-chloropleth-map-dddc31c1983d
#https://geopandas.org/docs/user_guide/mapping.html

display_size = 20


tdf = pop_attrib_df.copy()

for col in tdf.columns[2:]:
    tdf[col] = tdf[col]/tdf['population']

tdf['NAME'] = tdf['county']

#//**** Read Shape File for all 52 States
states = geopandas.read_file('maps\cb_2018_us_county_20m.shp')

#//*** Get just the California Shapes (these are all the counties)
#//*** A quick investigation of the dataframes revealed a STATEFP value of 06 is California
calif_geo = states[states['STATEFP'] == "06"]



#//*** Merge with the Race_diff_percent dataframe. These percentages are used to color the counties
calif_geo = calif_geo.merge(tdf,on='NAME')

#//*** Set Max and Min Scale for all gradients
#//*** Fixing to the greatest and minimum values, creates a standard scale across the graphs

plt.rc('axes', labelsize=display_size*2)
plt.rc('axes', titlesize=display_size*2)
race_axis = {
    'Latino' :   (0,0),
    'White' :    (0,1),
    'Asian' :    (0,2),
    'Black' :    (0,3),
    'Multiracial' : (1,0),
    'Hawaiian' : (1,1),
    'Native' :   (1,2)
}
fig, ax = plt.subplots(figsize=(display_size, display_size))

#fig,axs = plt.subplots(col_count,figsize=(display_size,display_size/2),sharex=True)
#axs[this_ax]
#//*** Draw a graph for each Race
for race in ['Latino']:
    ax.axis('off')
    #ax.set_title(f"{race}",fontsize=display_size*.5)
    
    # add the colorbar to the figure
    #//*** The Colormaps are hand coded and stored in a dictionary
    cmap = race_solid_color_map[race]
    
    vmin,vmax = tdf[race].min(),tdf[race].max()
    
    # Create colorbar as a legend
    sm = plt.cm.ScalarMappable(cmap=cmap, norm=plt.Normalize(vmin=vmin, vmax=vmax))

    # empty array for the data range
    sm._A = []


    cbar = fig.colorbar(sm,ax=ax,shrink=1)

    calif_geo.plot(column=race,cmap=cmap, ax=ax,linewidth=0.8,edgecolor='0.8')
    

fig.tight_layout()
plt.subplots_adjust(left=0,
                    bottom=.65, 
                    top=1, 
                    wspace=-.5, 
                    hspace=.1)
#plt.title("",fontdict={'verticalalignment': 'baseline'})
fig.text(.5, 1.025, 'Latino Population Distribution', va='center',fontsize=display_size)
plt.show()

plt.rcParams.update(plt.rcParamsDefault)

del tdf

In [None]:
"""
county = 'Los Angeles'
for county in cors_df['county']:   
    
    if county not in high_covid_counties:
        continue
    #//*** Compare the expected vs scaled modeled values
    expected_la = ca_model_cases_df[ca_model_cases_df['county']== county]
    
    
    modeled_la = modified_race_total[['date','race',county]]
    #modeled_la = v2_ca_case_race_total_df[['date','race',county]]
    
    modeled_la = ca_case_race_total_df[['date','race',county]]

    for race in ['Latino','White',"Asian","Black"]:
        print("=======================================")
        print("TESTING visualization Model vs Expected")
        print("=======================================")

        
        loop_model_race = modeled_la[modeled_la['race']==race]

        loop_model_expected = expected_la[['date','county',race,f"case_{race}",f"adj_{race}","cases"]]
        
        #//*** Skip Columns where the Max cases is less than 20. These values are a bit unreliable
        if loop_model_expected[f"case_{race}"].max() < 20:
            continue
        #loop_model_race = modeled_la[modeled_la['race']==race]

        #loop_model_expected = expected_la[['date','county',race,f"case_{race}",f"adj_{race}","cases"]]
        #print("Correlation: ", county, " ",race, " ",pearsonr(loop_model_race[county],loop_model_expected['cases']))
        
        #//*** Cases per 100k should be relatively similar in values.
        display_size = 20
        fig,ax = plt.subplots(figsize=(20,10))

        ax.plot(loop_model_race['date'],loop_model_race[county].rolling(5).mean(),label=f"{race}_modeled")
        ax.plot(loop_model_expected['date'],loop_model_expected[f"case_{race}"].rolling(5).mean(),label=f"{race}_expected")
        ax.plot(loop_model_expected['date'],loop_model_expected[f"cases"].rolling(5).mean(),label=f"total")

        plt.xticks(rotation=30,fontsize=display_size)
        plt.yticks(fontsize=display_size)
        handles,labels = deduplicate_legend(ax)
        plt.legend(fontsize=display_size,loc='upper left')
        plt.title(f"{county}\n Modeled/Expected Correlation: {(pearsonr(loop_model_race[county],loop_model_expected['cases']))}",fontsize=display_size)
        plt.ylabel("New cases per day",fontsize=display_size)
        plt.show()

        

#//*** Temp variable cleanup
del expected_la
del modeled_la
del loop_model_expected
del loop_model_race
"""
print()

In [None]:
#//*** Merge Population Attributes with COVID County info
#//*** Only Merge if we haven't merged yet. I got 99 iPython problems but this aint one.
if "Latino" not in ca_covid_df.columns:
    ca_covid_df = pd.merge(ca_covid_df,pop_attrib_df,how="left",on=['county'])


#//*** Build per 100k Stats
ca_100k_df = ca_covid_df.copy()

#//*** Define Population Columns to convert to 100k. These Columns shouldn't change. Trying to setup a flexible
#//*** Systems where I can add other attributes later if needed
population_cols = [ 'population','Latino', 'White', 'Asian', 'Black', 'Native', 'Hawaiian','Multiracial' ]

#//*** Convert Popultion values to 100k units. ie divide by 100,000
for col in population_cols:
    ca_100k_df[col] = ca_100k_df[col]/100000



#//*** Convert cases, deaths, test to per 100k units
attrib_cols = ['date','county']

#//*** Ignore values in attrib_cols, and population_cols
#//*** Convert remianing attributes to values per 100,000.
#//*** This method makes it easier to change the 100k attributes later.
for col in ca_100k_df.columns:
    if col not in attrib_cols and col not in population_cols:
        #//*** Convert column to per 100k value. Which is Columns value divided population per 100k
        ca_100k_df[col] = ca_100k_df[col]/ca_100k_df['population'] 


#//*** Check our Work.
#//*** Cases per 100k should be relatively similar in values.
display_size = 40
fig,ax = plt.subplots(figsize=(40,20))

for county in ca_100k_df['county'].unique():
    
    loop_df = ca_100k_df[ca_100k_df['county'] ==  county]
    ax.plot(loop_df['date'],loop_df['cases'].rolling(5).mean(),label=county)


    plt.xticks(rotation=30,fontsize=display_size)
    plt.yticks(fontsize=display_size)
handles,labels = deduplicate_legend(ax)
plt.legend(fontsize=display_size*.25,loc='upper left')
plt.title(f"Scaled County Data (per 100k)",fontsize=display_size)
#plt.ylabel("Total Cases by County (millions)",fontsize=display_size)
plt.show()



In [None]:
#//*** Scale the State Numbers


display_size = 15
fig,ax = plt.subplots(figsize=(display_size,display_size/2))
for race in ['Latino',"White","Asian","Black","Native","Multiracial"]:
    if race == 'Total':
        continue
    
    loop_df = ca_race_df[ca_race_df['race'] ==  race]
    
    if race in ['Latino']:
        ax.plot(loop_df['date'],loop_df['cases_100k'].rolling(7).mean(),linewidth=5,label=race,color=race_color[race])
    else:
        ax.plot(loop_df['date'],loop_df['cases_100k'].rolling(7).mean(),label=race,color=race_color[race])

    plt.xticks(rotation=30,fontsize=display_size)
    plt.yticks(fontsize=display_size)
handles,labels = deduplicate_legend(ax)
plt.legend(fontsize=display_size*.5,loc='upper left')
plt.title(f"Statewide Scaled Race Data\n New Cases per Day(per 100k)",fontsize=display_size)
#plt.ylabel("Total Cases by County (millions)",fontsize=display_size)
plt.show()

In [None]:
print(ca_cases_broad_df['Madera'].sum())

for group in ca_case_race_total_df.groupby('race'):
    cols = group[1].columns[4:]
    print(group[0])
    print(group[1][['Los Angeles','Madera']])


In [None]:
#//*** Assigns a color from a palette list to a county. 
def assign_color(input_item, input_cd,input_palette):
    #//*** Check if item already exists, if so, return input_cd
    if input_item in input_cd.keys():
        return input_cd
    
    #//*** input_item needs a Color. Walk down the input_palette till one is not found
    for color in input_palette:
        if color not in input_cd.values():
            input_cd[input_item] = color
            return input_cd
    print("UH OH ran out of colors!!!")
    print(f"Item: {input_item}")
    print(input_cd)
    return input_cd

In [None]:
#//*** Color Choices: Tucking these aside for later use
#//*** Combine these with a dictionary to create color continuity across multiple visualizations.
color_palette = ["#c6eaff","#caa669","#14bae2","#f7cd89","#98a9e7","#e2ffb7","#cb9ec2","#77dcb5","#ffc5b7","#40bdba","#fff4b0","#74d0ff","#e4da8d","#7ceeff","#d0e195","#b7ab8c","#fcffdb","#83b88d","#ffe2c0","#abc37a"]
color_palette = ["#557788","#e12925","#44af0e","#7834c0","#726d00","#130c6d","#004e12","#f7007d","#017878","#950089","#00a3d7","#4b000e","#0063c2","#f07478","#013b75","#cf81b8","#212238","#af87e7","#320f49","#9c91db"]
county_color_palette = ["#b4a23b","#4457ca","#9ec535","#a651cb","#59ce59","#6a77f0","#52a325","#b93d9b","#36b25c","#e374d4","#c1c035","#7452af","#96ae3a","#a484e2","#89c466","#e54790","#57c888","#dd3d60","#5bd6c4","#dd4e2d","#45ccdf","#bd3738","#4cb998","#b13a6c","#368433","#588feb","#dcad3d","#4763af","#e49132","#4aa5d4","#c86321","#7695d3","#769233","#925898","#54701c","#c893d6","#3d7b44","#e084ac","#65a76b","#965179","#296437","#e57f5f","#31a8ad","#a44b2c","#368d71","#df7f81","#226a4d","#96465f","#b5b671","#68649c","#ad772e","#a34f52","#758348","#d8a06e","#505e25","#8e5e31","#8e8033","#695f1b"]
county_color_palette = ["#96465f","#dd3d60","#df7f81","#a34f52","#bd3738","#dd4e2d","#e57f5f","#a44b2c","#c86321","#8e5e31","#d8a06e","#e49132","#ad772e","#dcad3d","#b4a23b","#8e8033","#695f1b","#c1c035","#b5b671","#96ae3a","#505e25","#758348","#9ec535","#769233","#54701c","#52a325","#89c466","#368433","#59ce59","#3d7b44","#65a76b","#296437","#36b25c","#57c888","#226a4d","#368d71","#4cb998","#5bd6c4","#31a8ad","#45ccdf","#4aa5d4","#7695d3","#588feb","#4763af","#6a77f0","#4457ca","#68649c","#a484e2","#7452af","#a651cb","#c893d6","#925898","#e374d4","#b93d9b","#965179","#e084ac","#e54790","#b13a6c"]
county_color_palette = ["#226a4d","#31a8ad","#68649c","#758348","#505e25","#368d71","#4aa5d4","#965179","#7695d3","#45ccdf","#296437","#96465f","#8e5e31","#b5b671","#d8a06e","#a34f52","#5bd6c4","#695f1b","#4cb998","#df7f81","#3d7b44","#e084ac","#c893d6","#65a76b","#8e8033","#925898","#4763af","#54701c","#ad772e","#a44b2c","#e57f5f","#769233","#57c888","#b13a6c","#588feb","#a484e2","#b4a23b","#368433","#89c466","#7452af","#96ae3a","#dcad3d","#bd3738","#36b25c","#e374d4","#c86321","#b93d9b","#e49132","#dd3d60","#e54790","#c1c035","#4457ca","#6a77f0","#52a325","#9ec535","#dd4e2d","#a651cb","#59ce59"]

In [None]:


display_size = 40

fig,ax = plt.subplots(figsize=(display_size,display_size/2))

ax.bar(ts.index,ts)

#//*** Draw horizontal line. Draw it twice to get the yellow and back effect. 
#//*** This technique looks viusually good, but I can't get the legend to draw approrpriately.
ax.axhline(state_100k,color = "black", label="Statewide Cases 100k", linestyle = "-", lw=2)
ax.axhline(state_100k,color = "yellow", linestyle = "--", lw=2)
        
plt.xticks(rotation=90,fontsize=display_size*.75)
plt.yticks(fontsize=display_size*.75)

handles,labels = deduplicate_legend(ax)

plt.legend(fontsize=display_size,loc='upper right')
plt.title(f"Total Covid cases for all counties.\nper 100k",fontsize=display_size)
plt.ylabel("Total Cases by County (per 100k)",fontsize=display_size)
plt.show()

In [None]:
#//*** Look at total County COVID numbers by county rates per 100k.


#//*** Get the last data


#for county in






#last_day_df = rd[race_list[0]][rd[race_list[0]]['date'] == last_date]

In [None]:
"""
#//*** Convert Racial percentages into estimated COVID cases per county
ca_case_race_total_df = pd.DataFrame()

if 'ca_case_race_total_df' not in df_list:
    df_list.append('ca_case_race_total_df')
    
case_county_index = 4
x_column = ca_case_race_percent_df.columns[case_county_index:]
    
#//*** Loop through each day in ca_case_race_percent_df
for group in ca_case_race_percent_df.groupby('date'):
    
    loop_df = group[1].copy()
    #print(loop_df[x_column])
    
    this_day = ca_cases_broad_df[ca_cases_broad_df['date']==group[0]].copy()
    
    loop_df['intercept'] = this_day['total'].values[0]
    
    #//*** Keep just the county columns
    this_day = this_day[this_day.columns[2:]]
    
    loop_df[x_column]= (loop_df[x_column]*this_day.values).astype(int)
    
    
    ca_case_race_total_df = pd.concat([ca_case_race_total_df,loop_df])

print(ca_case_race_total_df)
#//*** Temp Variable Cleanup
del loop_df
del this_day
"""
print()


In [None]:
"""
#//*** Version 1: Generate Coefficients from the model

#//*** This works on one variable at a time. Keeping this for reference. 
#//*** Will use builder function.

#//***************************************************************************************************************************************************************************
#//*** Generatate a Individual Race Coeffecients for each county per day.
#//*** The the sum of racial coefficients should equal the state coefficent for the county. 
#//***************************************************************************************************************************************************************************
from sklearn.preprocessing import StandardScaler

#//*** Build the coefficients for the entire data set. Each day will calculate the coefficients from the previous 30 days. The First 30 days will use one set of coeffients. The rest will use
#//*** The current day: -30 to generate the coefficients. This will be an overfitted solution which is exactly what we are going for.
start_time = time.time()
#//*** Abstract Dataframes to Left and Right for reusability
left_df = ca_races_broad_df
right_df = ca_cases_broad_df

#//*** Get Coefficients for counties predicting the State total Case values
output_df = pd.DataFrame()

#//*** County Columns start at Index 3
x_col_index = 3

#//*** Target Independent Column Index. Statewide race numbers begin at column 1
y_col_index = 1

#//*** Sample size
#modeling_days = 30


print("Calculating racial coefficients...")
#//*** Combining with car_race_df and Latino value. It's not strictly needed, but the additional column will make combining the dataframes later, easier.
#//*** Reusing Code: This loop only needs to run once
for race in left_df.columns[1:]:
    
    #//*** Build model for the first 30 days, combines the race from ca_race (which is only needed as an extra field, to evenly space the columms)
    model_df = left_df[['date',race]].merge(right_df, left_on='date', right_on='date')[:30]
    
    #//*************************
    #//*** BEGIN regression
    #//*************************
    
    
    #//**** Build the x Values - Dependent Variables. These will be all the counties which start at the column index 
    #//*** The X Value is the index where the attributes start, there are 58 of them :)
    #x_column = model_df.columns[x_col_index:]
    
    #//*** Generate ordered list of Counties by current race population.
    #//*** The assumptions is the counties with higher populations will exert a greater weight on the model.
    #//*** Otherwise the tiny county of Alpine gets way overly represented
    race_columns = (pop_attrib_df[['county',race]].sort_values(race,ascending=False)['county'])

    #print(model_df[race_columns])
    #//*** Build the X attributes using the x_column. These are separated for readability and modularity
    x_model = model_df[race_columns]

    #print(x_model)
    #//*** Build the independent variable using the Index Column defined above as y_col_index.
    y_column = model_df.columns[y_col_index]
    
    #//*** Build the Y model using the y_column attribute. This is less readable and intuitive, but it lets the columns be 
    #//*** easily assigned at the top of this section
    y_model = model_df[y_column]
    
    #//*** Define the Linear Model
    regr = linear_model.LinearRegression(n_jobs=-1)
    
    #//*** Make Regression Magic
    regr.fit(x_model, y_model)
    
    #//*** Apply the regression coefficients
    model_df[race_columns] = (model_df[race_columns])*regr.coef_
    
    

    #//*** Replace the Statewide Total Column. With the Statewide Race Totals
    model_df['total'] = model_df[race]
    
    #//*** Change The race column to hold the race name
    model_df[race] = race
    
    #//*** Rename the race (Latino, Native, etc) to 'race'
    cols = list(model_df.columns)
    cols[1] = 'race'
    model_df.columns = cols
    
    model_df['intercept'] = regr.intercept_

    #//*** Move intercept to be the column after Total
    #//*** Gets Columns as a list, removes intercept of end and inserts into position
    #//*** Model_df is saved with ordered list of columns.
    #//*** Kinda Cool
    model_df = model_df[ list(model_df.columns[:-1].insert(3,'intercept')) ]

    #//*** Reorder Counties
    #//*** Keep the First four Columns, then use ordered_counties
    model_df = model_df[list(model_df.columns[:4])+ordered_counties]
    
    #print(model_df)
    #//*** Add the First 30 days Model to the output_df dataframe
    output_df = pd.concat([output_df,model_df])

    #//*** Checking our work. The sum of the coefficients * cases + intercept should be close the independent value in Total Cases.
    print("Checking our Work. These values should be close:")
    print(model_df.iloc[1]['total'], " == ", model_df[x_column].iloc[1].sum()+regr.intercept_)    
     
   
    #//*** Build each day individually, based on the previous 30 days
    #//*** Start at index 31 
    for index in range(31,len(left_df)):
        
        #//*** Define the start and indexes for linear modeling. This is the row_index - 30
        min_index = index-30
        
        #//*** Build model_df using min_index and index as a 30 day range)
        model_df = left_df[['date',race]].merge(right_df, left_on='date', right_on='date')[min_index:index]
        
        #//*** Build the X attributes using the x_column. These are separated for readability and modularity
        x_model = model_df[race_columns]

        #//*** Build the Y model using the y_column attribute. This is less readable and intuitive, but it lets the columns be 
        #//*** easily assigned at the top of this section
        y_model = model_df[y_column]

        #//*** Build a New the Linear Model
        regr = linear_model.LinearRegression(n_jobs=-1)

        #//*** Make Regression Magic
        regr.fit(x_model, y_model)

        #//*** Replace the Statewide Total Column. With the Statewide Race Totals
        model_df['total'] = model_df[race]

        #//*** Change The race column to hold the race name
        model_df[race] = race

        #//*** Rename the race (Latino, Native, etc) to 'race'
        cols = list(model_df.columns)
        cols[1] = 'race'
        model_df.columns = cols

        #//*** Apply the regression coefficients to all columns, even though we only need the last one
        model_df[race_columns] = (model_df[race_columns])*regr.coef_
        model_df['intercept'] = regr.intercept_

        #//*** Move intercept to be the column after Total
        #//*** Gets Columns as a list, removes intercept of end and inserts into position
        #//*** Model_df is saved with ordered list of columns.
        #//*** Kinda Cool
        model_df = model_df[ list(model_df.columns[:-1].insert(3,'intercept')) ]

        #//*** Reorder Counties
        #//*** Keep the First four Columns, then use ordered_counties
        model_df = model_df[list(model_df.columns[:4])+ordered_counties]

        #//*** Add the last day of model_df to output_df. 
        #//*** It's not exactly efficient, but it is functional
        #output_df = pd.concat([output_df,model_df.iloc[-1]])
        output_df = output_df.append(model_df.iloc[-1])

processing_df = output_df.copy()

del output_df
print("Done")
"""
print()


In [None]:
"""
#//**** Archived Reference in case I muck it up

#//************************************************
#//*** Convert Coefficients to percentages
#//************************************************
#//*** Racial Coefficients are built for each race by day and county.
#//*** Group output_df by date, to calculate the race percent for each county by day.
#//*** The Racial Percentage is

print(f"CoEfficients Calculated: {round(time.time()-start_time,0)}s")
start_time = time.time()

x_col_index = 4
x_column = model_df.columns[x_col_index:]
output2_df = pd.DataFrame()
for group in output_df.groupby('date'):
    
    #//*** Make a coopy of sliced dataframe. This prevents index errors
    loop_df = group[1].copy()
    
    #//*** Avoid Divide by zero errors and replace 0 with 1. When processed 0's will become 0.142857, which can be replace with 0 again
    loop_df = loop_df.replace(0,1)
    
    #//*** Each racial row of coefficints summed + intercept will equal the racial total for the day.
    #//*** The relavtive weights of each racial coefficient need to be converted in to percentages.
    #//*** Coefficients are positive and negative. Which adds a challenge of calculating weights which is usually value/sum. 
    #//*** If values are negative, it disrupts the sum. The solution is to sum the absolute value of the coefficients. Percentages are calculated as value/ abs().sum()
    
    
    
    #print((abs(loop_df[x_column])/max_val).sum())
    loop_df[x_column] = (abs(loop_df[x_column])/abs(loop_df[x_column]).sum())
    
    #//*** replace the exactly even percentage distributions with 0's. These were the initially 0's replaced with 1's
    loop_df = loop_df.replace(.14285714285714285,0)
    output2_df = pd.concat([output2_df,loop_df])
    #print(loop_df)

    
ca_case_race_percent_df = output2_df.copy().fillna(0)

ca_case_race_percent_df.index = np.arange(0,len(ca_case_race_percent_df))

print(f'Racial Percentages Calculated: {round(time.time()-start_time,0)}s')
print(ca_case_race_percent_df)
if 'ca_case_race_percent_df' not in df_list:
    df_list.append('ca_case_race_percent_df')
#//*** Temp Variable Cleanup
#del output_df
#del output2_df
#del loop_df
#del model_df
"""
print()


In [None]:
tdf = pop_attrib_df.copy()
tdf['population'] = tdf['population']/tdf['population'].sum()
print(tdf.sort_values('population',ascending=False))
del tdf

In [None]:
"""
if 'state_coef_df' not in df_list:
    df_list.append('state_coef_df')
    
#//*** Generate Coefficients for the State - Orphaned Code
#//***************************************************************************************************************************************************************************
#//*** Generatate a State Coeffecient. These are best used to check our work.
#//*** The State Coeefficients are useful for comparing the sum of racial coefficients. If the regressions are consistent, the state coefficient should be close to the 
#//*** sum of the racial coeffificents
#//***************************************************************************************************************************************************************************

#//*** Build the coefficients for the entire data set. Each day will calculate the coefficients from the previous 30 days. The First 30 days will use one set of coeffients. The rest will use
#//*** The current day: -30 to generate the coefficients. This will be an overfitted solution which is exactly what we are going for.

#//*** Abstract Dataframes to Left and Right for reusability
left_df = ca_races_broad_df
right_df = ca_cases_broad_df

#//*** Get Coefficients for counties predicting the State total Case values
output_df = pd.DataFrame()

#//*** County Columns start at Index 3
x_col_index = 3

#//*** Target Independent Column Index. Statewide numbers begin at column 2
y_col_index = 2

#//*** Sample size
#modeling_days = 30

#//*** Combining with car_race_df and Latino value. It's not strictly needed, but the additional column will make combining the dataframes later, easier.
#//*** Reusing Code: This loop only needs to run once
for race in ['Latino']:
    
    #//*** Build model for the first 30 days, combines the race from ca_race (which is only needed as an extra field, to evenly space the columms)
    model_df = left_df[['date',race]].merge(right_df, left_on='date', right_on='date')[:30]
    #print(model_df)
    
    #//*************************
    #//*** BEGIN regression
    #//*************************
    
    #//**** Build the x Values - Dependent Variables. These will be all the counties which start at the column index 
    #//*** The X Value is the index where the attributes start, there are 58 of them :)
    x_column = model_df.columns[x_col_index:]

    #//*** Build the X attributes using the x_column. These are separated for readability and modularity
    x_model = model_df[x_column]
    

    #//*** Build the independent variable using the Index Column defined above as y_col_index.
    y_column = model_df.columns[y_col_index]
    
    #//*** Build the Y model using the y_column attribute. This is less readable and intuitive, but it lets the columns be 
    #//*** easily assigned at the top of this section
    y_model = model_df[y_column]
    
    #//*** Define the Linear Model
    regr = linear_model.LinearRegression()
    
    #//*** Make Regression Magic
    regr.fit(x_model, y_model)

    #//*** Apply the regression coefficients
    model_df[x_column] = (model_df[x_column])*regr.coef_
    
    #//*** Add the First 30 days Model to the output_df dataframe
    output_df = pd.concat([output_df,model_df])
    
    #//*** Checking our work. The sum of the coefficients * cases + intercept should be close the independent value in Total Cases.
    print("Checking our Work. These values should be close:")
    print(model_df.iloc[1]['total'], " == ", model_df[x_column].iloc[1].sum()+regr.intercept_)    

    #//*** Build each day individually, based on the previous 30 days
    #//*** Start at index 31 
    for index in range(31,len(left_df)):
        
        #//*** Define the start and indexes for linear modeling. This is the row_index - 30
        min_index = index-30
        
        #//*** Build model_df using min_index and index as a 30 day range)
        model_df = left_df[['date',race]].merge(right_df, left_on='date', right_on='date')[min_index:index]
        
        #//*** Build the X attributes using the x_column. These are separated for readability and modularity
        x_model = model_df[x_column]

        #//*** Build the Y model using the y_column attribute. This is less readable and intuitive, but it lets the columns be 
        #//*** easily assigned at the top of this section
        y_model = model_df[y_column]

        #//*** Build a New the Linear Model
        regr = linear_model.LinearRegression()

        #//*** Make Regression Magic
        regr.fit(x_model, y_model)

        #//*** Apply the regression coefficients to all columns, even though we only need the last one
        model_df[x_column] = (model_df[x_column])*regr.coef_
        model_df[model_df.columns[1]] = regr.intercept_
        #//*** Add the last day of model_df to output_df. 
        #//*** It's not exactly efficient, but it is functional
        #output_df = pd.concat([output_df,model_df.iloc[-1]])
        output_df = output_df.append(model_df.iloc[-1])
        
        
state_coef_df = output_df.copy()
state_coef_df.columns = [ ['date','intercept','total'] + list(x_column) ]
print("State Coef")
print(state_coef_df)
if 'state_coef_df' not in df_list:
    df_list.append('state_coef_df')
#//*** Eliminate lingering temp variables
del output_df
del model_df
"""
print()


In [None]:
"""
#//*** Version 1: Covert Coefficients to percentages

#//************************************************
#//*** Convert Coefficients to percentages
#//************************************************
#//*** Racial Coefficients are built for each race by day and county.
#//*** Group output_df by date, to calculate the race percent for each county by day.
#//*** The Racial Percentage is

start_time = time.time()



x_col_index = 4
x_column = model_df.columns[x_col_index:]
output2_df = pd.DataFrame()

break_counter = 0
for row in processing_df.iterrows():
    

    loop_row = row[1]
    
    
    total = row[1][2]
    #print("Total: ",total)
    intercept = row[1][3]
    #print("Intercept: ",total-intercept)
    loop_row[4:] = loop_row[4:]
    #x_range = loop_row[4:].max() - loop_row[4:].min()
    #x_range = total - loop_row[4:].min()
    #print("x_norm: ", x_norm)
    #print((loop_row[4:]-min_val)/x_range)
    
    #loop_row[4:] = loop_row[4:].replace(0,1)
    #print(loop_row[4:].sum())
    
    #//*** ABSOLUTE Value method
    #loop_row[4:] = abs(loop_row[4:])/abs(loop_row[4:]).sum()
    
    #//*** Find double the minimum value. This is the offset value.
    #//*** Used to make negative values positive
    offset_value = abs(loop_row[4:].min()*2)
    
    #//*** The total offset which is added to all the values.
    #//*** The Offset_total has to be added to the maximum value in order to keep the proportions correct
    offset_total = offset_value*len(loop_row[4:])
    
    #print("Offset: ",offset_total)

    
    #//*** If all values were positive and there was no need for adjusted, the max value would be the loop_row['total'] 
    #//*** Max_value is calculated as Race Total (loop_row['total']) + offset_total - intercept.
    #//*** We can verify this value by compaing it to the sum() of all the values.
    max_value = (loop_row['total'] + offset_total) - loop_row['intercept']
    #print(loop_row['date']," ", loop_row['race'],"- [",max_value, " == ",(loop_row[4:]+offset_value).sum(), "] [ 1 == ", ((loop_row[4:]+offset_value)/max_value).sum(),"]")
    
    
    
    #//*** Calculate the modeled percentage of each caounty's contribution
    #//*** Percentages: value + offset / max_value
    #//*** Case contribution: race total * value + offset / max_value
    #//*** Original formula
    #loop_row[4:] = ((loop_row[4:]+offset_value)/max_value)*loop_row['total']
    
    #//*** Reformulated
    #//*** Square the results to remove negative values
    loop_row[4:] = loop_row[4:]**2
    
    #//*** Divide by the sum to get relative percentages
    loop_row[4:] = loop_row[4:]/loop_row[4:].sum()
    
    #//*** Multiply by the total to get the weighted percentage of racial values
    loop_row[4:] = loop_row[4:]*loop_row['total']
    
    #print((((loop_row[4:]+offset_value)/max_value)*loop_row['total']).sum())
    #print( (loop_row[4:]*loop_row['total']))
    #print(pd.DataFrame(loop_row).transpose())
    #loop_row[4:] = ( (loop_row[4:].pow(2)/ (total+intercept)**2 ).pow(1./2) )
    #print( abs(loop_row[4:]).sum())  
    #print( loop_row[4:])  
    #print(abs((loop_row[4:]/(total))).sum())    
    #min_val = loop_row[4:].min()
    #print("min: ", loop_row[4:].min())
    #//*** Make a coopy of sliced dataframe. This prevents index errors
    #loop_df = group[1].copy()
    

    output2_df = pd.concat([output2_df,pd.DataFrame(loop_row).transpose()])
    #print(loop_row['total'])
    #print(loop_row)
    break_counter = break_counter + 1
    
#    if break_counter > 250:
#        print(loop_row[4:].sum())
#        print(loop_row)
#        break
    
    
ca_case_race_total_df = output2_df.copy().fillna(0)

ca_case_race_total_df.index = np.arange(0,len(ca_case_race_total_df))

print(f'Racial Percentages Calculated: {round(time.time()-start_time,0)}s')
print(ca_case_race_total_df)
if 'ca_case_race_total_df' not in df_list:
    df_list.append('ca_case_race_total_df')
#//*** Temp Variable Cleanup
#del output_df
#del output2_df
#del loop_df
#del model_df
"""
print()

In [None]:
"""
100k Modeling is a fail

#ca_broad_race_100k_df
#ca_broad_cases_100k_df
#ca_case_race_100k_df

t_date = ca_case_race_100k_df.iloc[random.randint(0,len(ca_case_race_100k_df))]['date']
t_total = ca_case_race_100k_df[ca_case_race_100k_df['date']==t_date]['total']

print(t_date)

tdf1 = ca_broad_cases_100k_df[ca_broad_cases_100k_df['date']== t_date]
print(tdf1[tdf1.columns[2:]].transpose().sum())
print("===================")
print(ca_case_race_100k_df[ca_case_race_100k_df['date']==t_date].iloc[3][2])
print(ca_case_race_100k_df[ca_case_race_100k_df['date']==t_date].iloc[3][4:].sum())
print("===================")
print(tdf1)
print("===================")
print("Total COVID cases by Race on this day: ",ca_broad_race_100k_df[ca_broad_race_100k_df['date']==t_date].transpose()[1:].sum().values[0])
print("Total COVID cases by County:           ",tdf1['total'].values[0])
"""
print()

In [None]:
"""
100k Modeling is a fail

#//*** Compare the Modeled County Totals to the Actual Totals. 
for county in ordered_counties:
    #print(county, " Actual: ",ca_cases_broad_df[county].sum(), " - Modeled: ",ca_case_race_total_df[county].sum()   )
    print("A: ",ca_cases_broad_df[county].sum(), " - M: ",round(ca_case_race_100k_df[county].sum(),0), " ", county   )
print(ca_cases_broad_df['Los Angeles'].sum())

print(ca_case_race_total_df['Los Angeles'].sum())
"""
print()

In [None]:
"""
100k Linear Regression is not the way to go.

ca_broad_cases_100k_df = build_broad_county_attribute(ca_100k_df,'cases')

ca_broad_race_100k_df = build_broad_race_attribute(ca_race_df,'cases_100k')
#build_broad_race_attribute

#print(ca_100k_cases_df)

#//************************************************************************************************************************************************************************
#//*** Calculate the coefficients for Racial data for each county day.
#//*** This uses LinearRegression to generate coefficients which are weighted with the county cases of the day.
#//*** Coefficients derive individual counties affect on the whole of racial COVID cases. The idea is to estimate every counties given portion of the racial cases.
#//*** Each race is modeled separately. The coefficients are converted to percentages in the following step
#//************************************************************************************************************************************************************************


#//*** left_df contains the broad racial categories
left_df = ca_broad_race_100k_df

#//*** Right df contains each county as a column, with one value aggregated. Such as raw COVID cases.
right_df = ca_broad_cases_100k_df

#//*** County Columns start at Index 3.
#//*** These values are a little unintuitive, since they are releveant during the function operation. They are used to indicate which columns are used via the first elements index position
x_col_index = 3

#//*** Target Independent Column Index. Statewide race numbers begin at column 1
#//*** These values are a little unintuitive. They are used to indicate which columns are used via the first elements index position
y_col_index = 1

processing_df = calculate_coefficients(left_df,right_df,x_col_index,y_col_index)

#//**** County Column index start
x_col_index = 4

#//*** Convert coefficients to percentages,

ca_case_race_100k_df =coef_to_percent(processing_df,x_col_index)


print(ca_case_race_total_df)
if 'ca_case_race_100k_df' not in df_list:
    df_list.append('ca_case_race_100k_df')

print("Done")
"""
print()