# Stoneburner, Kurt #
- ## DSC 550 - Week 07
- ## Milestone #4

### Current Project Goals: ###
**Primary Goal**: Model Racial COVID numbers at the county level. California provides County COVID data in terms of total case data. I want to model the racial makeup of those case. Example: If there were 100 COVID cases, I want to estimate the Latino, White, Black and Asian cases for the county. 

I thought this task would be fairly easy. After a very intense week I kind of have a path on a model. I had some success by running a manual model. First, I calculated the statewide racial portion of all COVID cases. Example: Latino: .45, White: .22, Black: .05 etc. This value is calculated for each day across the data set.

The model takes the expected race cases for each county/day and adjusts each value by the statewide racial percentage: Expected Cases + (Expected cases * state racial modifier). To score the model, I've generated an 'est' attribute than is a sum() of the adjusted racial cases. This value is compared the to actual values. I have not generated any numeric error scores yet. I would use a sum of mean squares method to score the model error. Towards the bottom of this document is a graph for each county of actual vs modeled values ordered by COVID prevalence. 

The model appears to work pretty well in counties with a high COVID prevalence. It tends to be much less accurate in less COVID prevalent counties. This could be an error in my math, scaling, or these counties are less influenced by racial effects.

### Note: ###
This is very much a work in progress. I left a lot of dead end code in here. Most the graphs demonstrate that I was on the wrong path. There is a lot of trial and error here. I did a fair bit of Linear Regression that simply failed. 

I'm chasing the COVID data trying to make the best sense of it that I can. I've learned quite a bit about scaling and the racial inequity of COVID. There is a graph below referencing scaling. Unscaled COVID cases by race show White people as having the second highest COVID counts. When scaled to cases per 100k, Whites have the lowest prevalence of COVID cases. That hit me full stop and inspired a conversation with a newscast Producer.

I'm going to try and run this model on scaled data, then look at maybe using a linear regression or forest regression model to see if I can predict actual COVID cases based on the generated adjusted COVID values.

I feel like I'm chasing something worthy, although I may be veering away from the intent of this assignment. I could really use some advice if I'm heading in the right direction.

In [1]:
import os
import sys
# //*** Imports and Load Data
import matplotlib.pyplot as plt
import numpy as np
import pandas as pd
import requests
#//*** Use the whole window in the IPYNB editor
from IPython.core.display import display, HTML
display(HTML("<style>.container { width:100% !important; }</style>"))

import time 

#//*** Maximize columns and rows displayed by pandas
pd.set_option('display.max_rows', 100)
pd.set_option('display.max_columns', None)

pd.set_option('display.width', 200)

df_list = []

In [2]:
from sklearn import linear_model
from math import sqrt
from sklearn.metrics import mean_squared_error

In [3]:
# //*** Legends automatically generate too many labels based on my looping method.
# //*** Remove the Duplicate Legends. I wrote this for DSC 530 and it keeps on giving.
def deduplicate_legend(input_ax):
    # //**** Get handle and label list for the current legend
    # //**** Use first instance, toss the rest.
    handles, labels = input_ax.get_legend_handles_labels()

    handle_dict = {}

    for x in range(len(labels)):
        if labels[x] not in handle_dict.keys():
            # //*** Label = handle
            handle_dict[labels[x]] = handles[x]

    # //*** Build unique output ists and handles
    out_handles = []
    out_labels = []
    
    for label,handle in handle_dict.items():
        out_handles.append(handle)
        out_labels.append(label)
    
    return out_handles,out_labels

   

In [4]:
#//*** Only download Data if download_data is True.
#//*** Avoids needlessly generating HTTP traffic
download_data = False
demographic_data_filename = "z_ca_covid_demo.csv"
cases_data_filename = "z_ca_covid_cases.csv"

#//***********************************************************************************************
#//*** California COVID Data website:
#//**************************************
#//*** https://data.chhs.ca.gov/dataset/covid-19-time-series-metrics-by-county-and-state
#//***********************************************************************************************

#//*** Download California Current COVID Demograohic Data
if download_data:
    try:
        response = requests.get("https://data.chhs.ca.gov/dataset/f333528b-4d38-4814-bebb-12db1f10f535/resource/e2c6a86b-d269-4ce1-b484-570353265183/download/covid19casesdemographics.csv")
        if response.ok:
            print("Demographic Data Downloaded")
            f = open(demographic_data_filename, "w")
            f.write(response.text)
            f.close()
            print("Demographic Data Written to file.")
    except:
        print("Demographic Data: Trouble Downloading From State of CA")

#//*** Download California Current COVID Case Counts
    try:
        response = requests.get("https://data.chhs.ca.gov/dataset/f333528b-4d38-4814-bebb-12db1f10f535/resource/046cdd2b-31e5-4d34-9ed3-b48cdbc4be7a/download/covid19cases_test.csv")
        if response.ok:
            print("Case Data Downloaded")
            f = open(cases_data_filename, "w")
            f.write(response.text)
            f.close()
            print("Case Data Written to file.")
    except:
        print("Ca Case Data: Trouble Downloading From State of CA")

In [5]:
ca_covid_df= pd.read_csv(cases_data_filename)
ca_race_df = pd.read_csv(demographic_data_filename)
df_list.append('ca_covid_df')
df_list.append('ca_race_df')

print(ca_race_df.columns)
#//*** Demographics Contain Age Groups, Gender, and Race Ethnicity.

#//*** We'll Focus on just Race Ethnicicty
print(f"Demographic Types: {ca_race_df['demographic_category'].unique()}")

#//*** Get Just Race Ethnicity
race_category = ca_race_df['demographic_category'].unique()[2]
#print(ca_race_df)
ca_race_df = ca_race_df[ca_race_df['demographic_category'] == race_category]


Index(['demographic_category', 'demographic_value', 'total_cases', 'percent_cases', 'deaths', 'percent_deaths', 'percent_of_ca_population', 'report_date'], dtype='object')
Demographic Types: ['Age Group' 'Gender' 'Race Ethnicity']


In [6]:
#//****************************************
#//*** Cleanup ca_race_df  attributes
#//****************************************
print(ca_race_df)
if 'demographic_value' in ca_race_df.columns:
    #//*** Rename the California Racial names to matches the census derived attribute names in pop_attrib_df
    ca_race_df['demographic_value']=ca_race_df['demographic_value'].str.replace('Native Hawaiian and other Pacific Islander','Hawaiian')
    ca_race_df['demographic_value']=ca_race_df['demographic_value'].str.replace('Multi-Race','Multiracial' )
    ca_race_df['demographic_value']=ca_race_df['demographic_value'].str.replace('American Indian or Alaska Native','Native' ) 

#//*** rename the reported_date column
ca_race_df.rename( columns= {'report_date':'date','total_cases':'cum_cases','deaths':'cum_deaths'}, inplace=True)

if 'demographic_category' in ca_race_df.columns:
    del ca_race_df['demographic_category']

if 'demographic_value' in ca_race_df.columns:
    print(ca_race_df['demographic_value'].unique())

#//*************************************
#//*** Cleanup Statewide COVID values
#//*************************************
if 'demographic_value' in ca_race_df.columns:
    ca_race_df.rename( columns= {'demographic_value':'race'}, inplace=True)

#if 'total_cases' in ca_race_df.columns:
#    ca_race_df.rename( columns= {'total_cases':'cases'}, inplace=True)

#remove_cols = ['percent_cases','percent_deaths','percent_of_ca_population']

#//*** Convert date to datetime format.
ca_race_df['date'] =  pd.to_datetime(ca_race_df['date'], infer_datetime_format=True)

#for col in remove_cols:
#    if col in ca_race_df.columns:
#        del ca_race_df[col]

#ca_total_df = ca_race_df[ca_race_df['race']=="Total"].copy()

#if 'race' in ca_total_df.columns:
#    del ca_total_df['race']


#//*** Remove Total and Other from Race Types. Total is Statewide infections, Other is other racial categories.
#ca_race_df = ca_race_df[~ca_race_df['race'].isin(['Other','Total'])]
ca_race_df = ca_race_df[~ca_race_df['race'].isin(['Other'])]

temp_race_total_df = ca_race_df[ca_race_df['race']=='Total']
ca_race_df = ca_race_df[~ca_race_df['race'].isin(['Other','Total'])]

print(ca_race_df)

     demographic_category                 demographic_value  total_cases  percent_cases  deaths  percent_deaths  percent_of_ca_population report_date
3910       Race Ethnicity  American Indian or Alaska Native           33            0.2       3             0.5                       0.5  2020-04-13
3911       Race Ethnicity  American Indian or Alaska Native           33            0.2       3             0.4                       0.5  2020-04-14
3912       Race Ethnicity  American Indian or Alaska Native           32            0.2       3             0.4                       0.5  2020-04-15
3913       Race Ethnicity  American Indian or Alaska Native           34            0.2       4             0.5                       0.5  2020-04-16
3914       Race Ethnicity  American Indian or Alaska Native           38            0.2       4             0.4                       0.5  2020-04-17
...                   ...                               ...          ...            ...     ...     

In [7]:
#//**********************************
#//*** ca_total_df - This one is Temp
#//**********************************
#//*** Build Total State Numbers by extracting area_type (county) == 'State'
#//**********************************
ca_total_df = ca_covid_df[ca_covid_df['area_type']=='State'].copy()
ca_total_df['date'] =  pd.to_datetime(ca_total_df['date'], infer_datetime_format=True)

In [8]:
print(ca_total_df)

            date        area area_type  population  cases  deaths  total_tests  positive_tests  reported_cases  reported_deaths  reported_tests
5     2021-05-17  California     State  40129160.0    0.0     0.0          NaN             NaN           687.0              3.0        129983.0
66    2021-05-16  California     State  40129160.0   84.0     0.0       4890.0            99.0           995.0             11.0        177223.0
127   2021-05-15  California     State  40129160.0  342.0     0.0      26911.0           431.0          1370.0             55.0        229177.0
188   2021-05-14  California     State  40129160.0  884.0     0.0     130077.0          1224.0          1864.0             27.0        251185.0
249   2021-05-13  California     State  40129160.0  951.0     8.0     178100.0          1472.0          2034.0             66.0        219570.0
...          ...         ...       ...         ...    ...     ...          ...             ...             ...              ...         

In [9]:
#//*** State Race numbers don't quite add up with the State COVID totals.
#//*** We'll re-normalize the values by
#//*** 1.) Recalculate the percent of cases to a higher degree of accuracy
#//*** 2.) Recalculate the Race Cases based on updated percentages

#//*** Temp variables
t_date = []
t_race = []
t_percent_cases = []
tdf = pd.DataFrame()
#//*******************

#//*** iPython Trap
if 'cum_cases' in ca_race_df:
    for group in ca_race_df.groupby('date'):

        #//*** Get the Total Cases for the day, from the Race labeled Total
        total_cases = temp_race_total_df[temp_race_total_df['date']==group[0]]['cum_cases'].values[0]
        total_deaths = temp_race_total_df[temp_race_total_df['date']==group[0]]['cum_deaths'].values[0]

        #//*** Recalculate percent_cases
        group[1]['percent_cases'] = group[1]['cum_cases'] / total_cases

        #//*** Recalculate percent_deaths
        group[1]['percent_deaths'] = group[1]['cum_deaths'] / total_deaths

        #//*** Generate the Daily Race Cases based of total reported COVID cases

        #//*** Get State cases and deaths from ca_total_df (originally CA_COVID_DF)
        state_cases = ca_total_df[ca_total_df['date']==group[0]]['cases'].values[0]
        state_deaths = ca_total_df[ca_total_df['date']==group[0]]['deaths'].values[0]

        #//*** Convert cum_cases to daily cases
        group[1]['cum_cases'] = group[1]['percent_cases'] * state_cases
        group[1]['cum_deaths'] = group[1]['percent_deaths'] * state_deaths

        #print(state_cases," ",state_deaths)
        tdf = pd.concat([tdf,group[1]])

        #print()
        #//*** Get the Percentage of cumulative Cases per day. This is cum_cases / total_cases
        #daily_percent = group[1][group[1]['race']!='Total']['cum_cases']/total_cases
        #group[1][group[1]['race']!='Total'] = daily_percent
        #print(group[1])



    #print(tdf[ ['date','race','cum_cases','cum_deaths','percent_cases','percent_deaths','percent_of_ca_population'] ])
    ca_race_df = tdf[ ['date','race','cum_cases','cum_deaths','percent_cases','percent_deaths','percent_of_ca_population'] ].copy()

    if 'percent_of_ca_population' in ca_race_df.columns:
        del ca_race_df['percent_of_ca_population']

    ca_race_df.columns = ['date','race','cases','deaths','percent_cases','percent_deaths']

del tdf
del t_date
del t_race
del t_percent_cases

In [10]:
print(df_list)
print(ca_race_df)

['ca_covid_df', 'ca_race_df']
           date         race       cases     deaths  percent_cases  percent_deaths
3910 2020-04-13       Native    3.401296   0.395659       0.002251        0.005008
4310 2020-04-13        Asian  196.038336  12.924875       0.129741        0.163606
4710 2020-04-13        Black  106.161664   8.045075       0.070259        0.101836
5110 2020-04-13       Latino  543.795089  22.420701       0.359891        0.283806
5510 2020-04-13  Multiracial   27.622647   1.055092       0.018281        0.013356
...         ...          ...         ...        ...            ...             ...
5109 2021-05-17        Black    0.000000   0.000000       0.042464        0.063297
5509 2021-05-17       Latino    0.000000   0.000000       0.558986        0.465177
5909 2021-05-17  Multiracial    0.000000   0.000000       0.017548        0.014101
6309 2021-05-17     Hawaiian    0.000000   0.000000       0.005557        0.005261
7509 2021-05-17        White    0.000000   0.000000      

In [11]:
#//*******************************
#//*** Clean Ca_COVID_df
#//*******************************

ca_covid_df.rename( columns= {'area':'county'}, inplace=True)
print(f"# of counties before Cleaning: {len(ca_covid_df['county'].unique())}")

#//*** Convert Date Column to Date Type.
ca_covid_df['date'] =  pd.to_datetime(ca_covid_df['date'], infer_datetime_format=True)

#print(ca_covid_df[ ca_covid_df['area_type'] == 'State'])
ca_covid_df = ca_covid_df.sort_values('date')

#\\*** DROP ca_covd dates that are not included in ca_race_df

#//*** Get first date of Race
first_date = ca_race_df.iloc[0]['date']
#//*** remove values from ca_covid before the date
ca_covid_df = ca_covid_df[ca_covid_df['date'] >= first_date].sort_values('date')


# of counties before Cleaning: 61


In [12]:
#//**********************************
#//*** ca_total_df 
#//**********************************
#//*** Build Total State Numbers by extracting area_type (county) == 'State'
#//**********************************
df_list.append('ca_total_df')
ca_total_df = ca_covid_df[ca_covid_df['area_type']=='State'].copy()


In [13]:
#//*** Remove the 'Out Of State, Unknown and California listings
print(f"Length Before removing Out Of Country County: {len(ca_covid_df)}")
ca_covid_df = ca_covid_df[~ca_covid_df['county'].isin(['Out of state','Unknown','California'])]

print(f"# of counties After Cleaning: {len(ca_covid_df['county'].unique())}")

#//*** Replace NaN values with 0
for col in ca_total_df.columns:
    ca_total_df[col].fillna(0,inplace=True)


#//*** Replace NaN values with 0
for col in ca_covid_df.columns:
    ca_covid_df[col].fillna(0,inplace=True)
    
    

#//*** Drop Columns
dropcols = ['area_type','population','positive_tests','reported_cases','reported_deaths','reported_tests']
#dropcols = []
ca_covid_orig_df = ca_covid_df.copy()
df_list.append('ca_covid_orig_df')

for col in dropcols:
    if col in ca_covid_df.columns:
        del ca_covid_df[col]

for col in dropcols:
    if col in ca_total_df.columns:
        del ca_total_df[col]
if 'county' in ca_total_df.columns:
    del ca_total_df['county']

Length Before removing Out Of Country County: 24400
# of counties After Cleaning: 58


In [14]:
print(df_list)
print(ca_covid_df)

['ca_covid_df', 'ca_race_df', 'ca_total_df', 'ca_covid_orig_df']
            date    county  cases  deaths  total_tests
24355 2020-04-13     Kings    2.0     0.0         48.0
24397 2020-04-13   Ventura   10.0     0.0        365.0
24395 2020-04-13  Tuolumne    0.0     0.0         29.0
24394 2020-04-13    Tulare   10.0     0.0        153.0
24393 2020-04-13   Trinity    0.0     0.0          2.0
...          ...       ...    ...     ...          ...
2     2021-05-17    Amador    0.0     0.0          0.0
28    2021-05-17      Napa    0.0     0.0          0.0
14    2021-05-17      Inyo    0.0     0.0          0.0
1     2021-05-17    Alpine    0.0     0.0          0.0
0     2021-05-17   Alameda    0.0     0.0          0.0

[23200 rows x 5 columns]


In [15]:
last_day = ca_total_df['date'].unique()[-30]
second_day = ca_total_df['date'].unique()[-31]
print(last_day)
print(ca_total_df[ca_total_df['date']==last_day])
print(ca_covid_df[ca_covid_df['date']==last_day]['cases'].sum())
print(ca_race_df[ca_race_df['date']==last_day]['cases'].sum())
print(ca_race_df[ca_race_df['date']==last_day])
tdf=ca_race_df[ca_race_df['date']==last_day]

print(((1392*tdf['percent_cases'])/100).sum())

2021-04-18T00:00:00.000000000
           date   cases  deaths  total_tests
1774 2021-04-18  1392.0    24.0      70709.0
1392.0
1247.061231715329
           date         race       cases     deaths  percent_cases  percent_deaths
4280 2021-04-18       Native    4.734258   0.088393       0.003401        0.003683
4680 2021-04-18        Asian   96.188358   2.911658       0.069101        0.121319
5080 2021-04-18        Black   58.061101   1.519781       0.041711        0.063324
5480 2021-04-18       Latino  774.199002  11.228296       0.556177        0.467846
5880 2021-04-18  Multiracial   23.721152   0.317317       0.017041        0.013222
6280 2021-04-18     Hawaiian    7.762943   0.127497       0.005577        0.005312
7480 2021-04-18        White  282.394418   7.483223       0.202870        0.311801
12.470612317153293


In [16]:

#//**********************************************************************************************************************************
#//*** US Census data on Racial population by County in California
#//**********************************************************************************************************************************
#//*** Data Source
#//**********************************************************************************************************************************
#//*** Census Data: https://www.census.gov/data/datasets/time-series/demo/popest/2010s-counties-detail.html
#//*** Direct Download: https://www2.census.gov/programs-surveys/popest/datasets/2010-2019/counties/asrh/cc-est2019-alldata-06.csv
#//**********************************************************************************************************************************
#//*** Process Flat File: California Ethnicity demographics - cc-est2019-alldata-06.csv

raw_ethnic_pop_df = pd.read_csv("cc-est2019-alldata-06.csv")

#//*** Data includes values for last twelve years. We only want data for the last year.

#//*** Rebuild raw_ethnic_pop_df using only the last year (most recent) data
raw_ethnic_pop_df = raw_ethnic_pop_df[raw_ethnic_pop_df['YEAR']==raw_ethnic_pop_df['YEAR'].max()]
df_list.append('raw_ethnic_pop_df')
#//*** Ethnic data is broken down by age. At this stage we will only use the totals of all ages
#//*** Only use AGEGRP == 0
raw_ethnic_pop_df = raw_ethnic_pop_df[raw_ethnic_pop_df['AGEGRP']==raw_ethnic_pop_df['AGEGRP'].min()]

#//*** Demographics are based on gender as well as Federal Race and Ethnic attributes. These attributes are different than the values reported
#//*** By the State of California. These attributes will require cleaning and transformation.
raw_ethnic_pop_df.head(20)


Unnamed: 0,SUMLEV,STATE,COUNTY,STNAME,CTYNAME,YEAR,AGEGRP,TOT_POP,TOT_MALE,TOT_FEMALE,WA_MALE,WA_FEMALE,BA_MALE,BA_FEMALE,IA_MALE,IA_FEMALE,AA_MALE,AA_FEMALE,NA_MALE,NA_FEMALE,TOM_MALE,TOM_FEMALE,WAC_MALE,WAC_FEMALE,BAC_MALE,BAC_FEMALE,IAC_MALE,IAC_FEMALE,AAC_MALE,AAC_FEMALE,NAC_MALE,NAC_FEMALE,NH_MALE,NH_FEMALE,NHWA_MALE,NHWA_FEMALE,NHBA_MALE,NHBA_FEMALE,NHIA_MALE,NHIA_FEMALE,NHAA_MALE,NHAA_FEMALE,NHNA_MALE,NHNA_FEMALE,NHTOM_MALE,NHTOM_FEMALE,NHWAC_MALE,NHWAC_FEMALE,NHBAC_MALE,NHBAC_FEMALE,NHIAC_MALE,NHIAC_FEMALE,NHAAC_MALE,NHAAC_FEMALE,NHNAC_MALE,NHNAC_FEMALE,H_MALE,H_FEMALE,HWA_MALE,HWA_FEMALE,HBA_MALE,HBA_FEMALE,HIA_MALE,HIA_FEMALE,HAA_MALE,HAA_FEMALE,HNA_MALE,HNA_FEMALE,HTOM_MALE,HTOM_FEMALE,HWAC_MALE,HWAC_FEMALE,HBAC_MALE,HBAC_FEMALE,HIAC_MALE,HIAC_FEMALE,HAAC_MALE,HAAC_FEMALE,HNAC_MALE,HNAC_FEMALE
209,50,6,1,California,Alameda County,12,0,1671329,823247,848082,414416,409177,88167,96201,9048,8749,259991,280400,7534,8227,44091,45328,451943,447434,102712,111680,18534,19263,286818,307183,13053,14053,634636,663638,256400,255734,81150,88804,1957,2200,254719,274979,6423,7051,33987,34870,284971,284737,92291,100719,6123,7255,277718,297837,10757,11661,188611,184444,158016,153443,7017,7397,7091,6549,5272,5421,1111,1176,10104,10458,166972,162697,10421,10961,12411,12008,9100,9346,2296,2392
437,50,6,3,California,Alpine County,12,0,1129,609,520,424,343,0,4,146,144,11,7,0,0,28,22,450,363,15,11,154,156,18,9,0,3,534,456,384,308,0,4,121,122,11,7,0,0,18,15,400,322,11,9,126,130,15,8,0,2,75,64,40,35,0,0,25,22,0,0,0,0,10,7,50,41,4,2,28,26,3,1,0,1
665,50,6,5,California,Amador County,12,0,39752,21638,18114,19053,16583,955,111,554,371,318,347,61,55,697,647,19671,17173,1124,250,917,727,515,528,130,107,18021,15978,15959,14783,902,92,362,244,259,316,39,44,500,499,16396,15234,1023,201,599,502,421,475,95,86,3617,2136,3094,1800,53,19,192,127,59,31,22,11,197,148,3275,1939,101,49,318,225,94,53,35,21
893,50,6,7,California,Butte County,12,0,219186,108473,110713,92754,94988,2316,1842,2730,2817,5461,5523,316,312,4896,5231,97324,99889,3738,3349,4806,5119,7157,7246,785,752,89311,92144,76229,79186,2011,1515,1657,1733,5246,5327,229,236,3939,4147,79924,83077,3236,2795,3104,3339,6751,6847,618,593,19162,18569,16525,15802,305,327,1073,1084,215,196,87,76,957,1084,17400,16812,502,554,1702,1780,406,399,167,159
1121,50,6,9,California,Calaveras County,12,0,45905,22847,23058,20794,20958,303,195,428,477,390,469,58,64,874,895,21574,21770,472,364,881,977,675,724,136,132,19821,20117,18213,18459,261,159,271,291,310,409,44,54,722,745,18854,19126,407,299,620,691,569,648,110,112,3026,2941,2581,2499,42,36,157,186,80,60,14,10,152,150,2720,2644,65,65,261,286,106,76,26,20
1349,50,6,11,California,Colusa County,12,0,21547,10975,10572,10011,9619,164,118,312,282,152,185,63,67,273,301,10249,9884,247,211,430,425,232,260,95,96,4210,4319,3649,3695,127,93,142,151,112,154,34,39,146,187,3772,3860,182,157,188,223,164,212,53,57,6765,6253,6362,5924,37,25,170,131,40,31,29,28,127,114,6477,6024,65,54,242,202,68,48,42,39
1577,50,6,13,California,Contra Costa County,12,0,1153526,564187,589339,372197,379086,52267,57797,5984,5725,99630,111488,3416,3455,30693,31788,399287,407030,61879,67818,12804,13023,118061,130291,6633,6803,414051,439055,242832,249561,47774,53024,1571,1555,96077,107968,2626,2753,23171,24194,263136,270682,55040,60702,5366,5845,111101,123341,5063,5312,150136,150284,129365,129525,4493,4773,4413,4170,3553,3520,790,702,7522,7594,136151,136348,6839,7116,7438,7178,6960,6950,1570,1491
1805,50,6,15,California,Del Norte County,12,0,27812,15186,12626,11674,10046,884,96,1415,1277,410,451,23,31,780,725,12387,10722,1033,229,1869,1719,620,617,75,70,11581,10635,8690,8546,844,73,1064,995,381,421,19,23,583,577,9220,9080,959,177,1370,1327,564,577,64,57,3605,1991,2984,1500,40,23,351,282,29,30,4,8,197,148,3167,1642,74,52,499,392,56,40,11,13
2033,50,6,17,California,El Dorado County,12,0,192843,96158,96685,85557,85379,1142,858,1340,1234,4128,5189,223,213,3768,3812,89067,88956,1824,1551,2772,2682,5914,6965,566,555,83368,84097,74547,74356,967,729,766,734,3956,5018,167,161,2965,3099,77327,77277,1511,1310,1707,1746,5524,6611,453,448,12790,12588,11010,11023,175,129,574,500,172,171,56,52,803,713,11740,11679,313,241,1065,936,390,354,113,107
2261,50,6,19,California,Fresno County,12,0,999101,498648,500453,382292,382910,29301,28675,14838,14887,54900,56453,1422,1393,15895,16135,396322,397119,34389,33833,21051,21307,61661,63154,2722,2718,227406,234515,139958,146091,23641,22633,2891,3076,50943,52487,714,723,9259,9505,148081,154409,26887,25942,5393,5820,55612,57118,1535,1563,271242,265938,242334,236819,5660,6042,11947,11811,3957,3966,708,670,6636,6630,248241,242710,7502,7891,15658,15487,6049,6036,1187,1155


In [17]:

#//*** Convert Applicable federal based census codes to California Census Codes.
#//*** Description of Federal Column Values
#//*** https://www2.census.gov/programs-surveys/popest/technical-documentation/file-layouts/2010-2019/cc-est2019-alldata.pdf
#//*** Census Data: https://www2.census.gov/programs-surveys/popest/datasets/2010-2019/counties/totals/

#//*** Notably, Federal census regards Hispanic as an ethnicity not a race. For Example: People can be Hispanic White,
#//*** Hispanic Black, or Hispanic Asian.
#//*** California treats all hispanics as Latino
#//*** Latino = H_MALE, H_FEMALE Hispanic
#//*** White - NHWA_MALE, NHWA_FEMALE (Not Hispanic White)
#//*** Asian - NHAA_MALE, NHAA_FEMALE (Not Hispanic Asian) 
#//*** Black - NHBA_MALE, NHBA_FEMALE (Not Hispanic Black) 

#//*** Amer Indian - NHIA_MALE, NHIA_FEMALE (Not Hispanic, American Indian) 

#//*** Hawaiian - NHNA_MALE, NHNA_FEMALE (Not Hispanic, Hawaiian) 

#//*** California has the following columns: Multiracial, Other, Multirace. I could not find a good definition of these
#//*** These represent less than 5% of the population. Small but not too small to be ignored. These will combined into
#//*** Single attribute Other and combined with NHTOM_MALE, NHTOM_FEMALE - Not Hispanic Two or more races

#//*** Build a new data frame to hold the sanitized values.
pop_attrib_df = pd.DataFrame()
df_list.append(pop_attrib_df)
#//*** The County Fibs code is shared between the federal census data and the Community Resilliance Estimate
pop_attrib_df['cty_fibs'] = raw_ethnic_pop_df['COUNTY']

#//*** County Name will be the Common attribute to link to the timeseries Data.
#//*** Standardize the County name. Remove County from the column name 
pop_attrib_df['county'] = raw_ethnic_pop_df['CTYNAME'].str.replace(" County","")
pop_attrib_df['population'] = raw_ethnic_pop_df['TOT_POP']

clean_cols = { 'Latino' : ['H_MALE', 'H_FEMALE'], 
              'White' : ['NHWA_MALE', 'NHWA_FEMALE'],
              'Asian' : ['NHAA_MALE', 'NHAA_FEMALE'],
              'Black' : ['NHBA_MALE', 'NHBA_FEMALE'],
              'Native' : ['NHIA_MALE','NHIA_FEMALE'],
              'Hawaiian' : ['NHNA_MALE', 'NHNA_FEMALE'],
              'Multiracial' : ['NHTOM_MALE', 'NHTOM_FEMALE']
            
            }

#//*** Combine male and female columns and store to column with same name as California Data
#//*** Loop through the clean_cols dictionary, key is California name, value is Federal columns to combine
#//*** These are the easy 1:1 columns
#//*** Hawaiian and Other will need adjustment in the Califnornia Side of the Dataset.


#//*** California Column name = Federal category male + Federal Category female
for ca_name,fed_names in clean_cols.items():
    pop_attrib_df[ca_name] = raw_ethnic_pop_df[fed_names[0]] + raw_ethnic_pop_df[fed_names[1]] 

#              'Native Hawaiian or Pacific Islander' :
#              'Native Hawaiian and other Pacific Islander'
#            'Other'

#//*** Assign the index to the county fibs number
pop_attrib_df = pop_attrib_df.set_index('cty_fibs')



In [18]:
#//*** Merge Population Attributes with COVID County info
#//*** Only Merge if we haven't merged yet. I got 99 iPython problems but this aint one.
if "Latino" not in ca_covid_df.columns:
    ca_covid_df = pd.merge(ca_covid_df,pop_attrib_df,how="left",on=['county'])


#//*** Build per 100k Stats
ca_100k_df = ca_covid_df.copy()
df_list.append(ca_100k_df)

#//*** Define Population Columns to convert to 100k. These Columns shouldn't change. Trying to setup a flexible
#//*** Systems where I can add other attributes later if needed
population_cols = [ 'population','Latino', 'White', 'Asian', 'Black', 'Native', 'Hawaiian','Multiracial' ]

#//*** Convert Popultion values to 100k units. ie divide by 100,000
for col in population_cols:
    ca_100k_df[col] = ca_100k_df[col]/100000



#//*** Convert cases, deaths, test to per 100k units
attrib_cols = ['date','county']

#//*** Ignore values in attrib_cols, and population_cols
#//*** Convert remianing attributes to values per 100,000.
#//*** This method makes it easier to change the 100k attributes later.
for col in ca_100k_df.columns:
    if col not in attrib_cols and col not in population_cols:
        #//*** Convert column to per 100k value. Which is Columns value divided population per 100k
        ca_100k_df[col] = ca_100k_df[col]/ca_100k_df['population'] 
"""
plt.rcParams['figure.figsize'] = [50,20]
#//*** Check our Work.
#//*** Cases per 100k should be relatively similar in values.
display_size = 40
fig,ax = plt.subplots()

for county in ca_100k_df['county'].unique():
    
    loop_df = ca_100k_df[ca_100k_df['county'] ==  county]
    ax.plot(loop_df['date'],loop_df['cases'].rolling(5).mean(),label=county)


    plt.xticks(rotation=30,fontsize=display_size)
    plt.yticks(fontsize=display_size)
handles,labels = deduplicate_legend(ax)
plt.legend(fontsize=display_size*.25,loc='upper left')
plt.title(f"Scaled County Data (per 100k)",fontsize=display_size)
#plt.ylabel("Total Cases by County (millions)",fontsize=display_size)
plt.show()

"""

'\nplt.rcParams[\'figure.figsize\'] = [50,20]\n#//*** Check our Work.\n#//*** Cases per 100k should be relatively similar in values.\ndisplay_size = 40\nfig,ax = plt.subplots()\n\nfor county in ca_100k_df[\'county\'].unique():\n    \n    loop_df = ca_100k_df[ca_100k_df[\'county\'] ==  county]\n    ax.plot(loop_df[\'date\'],loop_df[\'cases\'].rolling(5).mean(),label=county)\n\n\n    plt.xticks(rotation=30,fontsize=display_size)\n    plt.yticks(fontsize=display_size)\nhandles,labels = deduplicate_legend(ax)\nplt.legend(fontsize=display_size*.25,loc=\'upper left\')\nplt.title(f"Scaled County Data (per 100k)",fontsize=display_size)\n#plt.ylabel("Total Cases by County (millions)",fontsize=display_size)\nplt.show()\n\n'

In [19]:
#//*** Build a list of counties ordered by total COVID prevalence (most Cases per 100k) 

#//*** Get the Statewide 100k value. 
#//*** Get total Case Count from orig_df, dvided by total population / 100000
state_100k = ca_covid_orig_df['cases'].sum()/(ca_covid_orig_df['population'].unique().sum()/100000)
df_list.append('state_100k')

county_list = ca_100k_df['county'].unique()

county_100k = []

#//*** Get a list of counties with population greater than 100,000
for county in county_list:
    if ca_100k_df[ca_100k_df['county']==county].iloc[0]['population'] > 1:
        county_100k.append(county)
case_totals = []

#//*** Get the total Cases for each county per 100k
for county in county_100k:
    case_totals.append(ca_100k_df[ca_100k_df['county']==county]['cases'].sum())

#//*** Build a series with the county name and the total per100k value for each county. Sort by the Prevalence value.
ts = pd.Series(index = county_100k, data=case_totals).sort_values(ascending=False)
df_list.append(ts)

In [20]:
#//******************************************************************************************************************************************************************************
#//*** Build: ca_cases_broad_df
#//*** Counties are converted to per day attributes. The Values are a single CA_COVID_value for the whole table.
#//*** The Broad tables are needed for building linear regressions across the whole State. The Coefficients are used to generate granualar data about each county per day
#//*** This DF looks at building COVID cases per day, using ca_covid_df and ca_race_df
#//******************************************************************************************************************************************************************************
#//*** this_df is a placeholder to ease changing dataframes as needed
this_df = ca_covid_df

#//*** The column used to build broad statistic
case_col_col = 'cases'

#//*** The column used to build broad statistic
race_col_col = 'cases'

broad_case_dict = {}
for group in this_df[['date','county',case_col_col]].groupby('date'):
    
    #//*** Initialize the dict
    if len(broad_case_dict.keys()) == 0:
        broad_case_dict['date'] = []
        broad_case_dict['total'] = []
        newcols = list(group[1][['county',case_col_col]].transpose().iloc[0])
        for col in newcols:
            broad_case_dict[col] = []
           
    #//*** Add Date
    broad_case_dict['date'].append(group[0])
    broad_case_dict['total'].append(group[1][case_col_col].sum())
    loop_df = group[1][['county',case_col_col]].transpose() 
    loop_df.columns = newcols
    
    for col in newcols:
        broad_case_dict[col].append((loop_df[col].iloc[1]))

ca_cases_broad_df = pd.DataFrame()

ca_cases_broad_df['date'] = broad_case_dict['date']
ca_cases_broad_df['total'] = broad_case_dict['total']
for col in ts.index.values:
    ca_cases_broad_df[col] = broad_case_dict[col]

print(ca_cases_broad_df[-30:])



#//*** Build: ca_cases_broad_df
#//*** Each Statewide Race as a column with COVID Cases by date
broad_race_dict = {}
this_df = ca_race_df
for group in this_df[['date','race',race_col_col]].groupby('date'):
    #print(group[1])
    
    #//*** Initialize the dict
    if len(broad_race_dict.keys()) == 0:
        broad_race_dict['date'] = []
        newcols = list(group[1].transpose().iloc[1])
        for col in newcols:
            broad_race_dict[col] = []

    
    #//*** Add Date
    broad_race_dict['date'].append(group[0])
    
    loop_df = group[1].transpose()
    loop_df.columns = newcols
    #print(loop_df)
    for col in newcols:
        broad_race_dict[col].append((loop_df[col].loc[race_col_col]))

ca_races_broad_df = pd.DataFrame()

for key,value in broad_race_dict.items():
    ca_races_broad_df[key] = value

print(ca_races_broad_df[-30:])
    

    

          date   total  Imperial  Kings  San Bernardino  Los Angeles  Riverside  Merced  Kern  Tulare  Madera  Fresno  Stanislaus  Monterey  Ventura  San Joaquin  San Diego  Orange  Santa Barbara  \
370 2021-04-18  1392.0      23.0    3.0            52.0        251.0      110.0     0.0   5.0    36.0     3.0     0.0        11.0       4.0     31.0         41.0       86.0     1.0           31.0   
371 2021-04-19  2366.0     105.0   81.0           173.0          3.0       12.0    36.0  31.0    51.0   447.0    57.0        13.0      11.0      0.0         21.0      137.0    10.0          281.0   
372 2021-04-20  2110.0      21.0    0.0            82.0          1.0       91.0     2.0   1.0    19.0   372.0    68.0         0.0      11.0     10.0         77.0        7.0    15.0           15.0   
373 2021-04-21  1948.0       1.0    0.0            76.0          6.0       90.0     1.0  29.0    10.0     2.0    57.0        20.0       0.0      8.0         24.0       94.0    17.0           88.0   
374 2

          date    Native       Asian      Black       Latino  Multiracial   Hawaiian       White
370 2021-04-18  4.734258   96.188358  58.061101   774.199002    23.721152   7.762943  282.394418
371 2021-04-19  8.046881  163.509721  98.753957  1315.853293    40.338011  13.200141  480.062297
372 2021-04-20  7.177374  145.834923  88.136045  1173.464040    35.979189  11.769839  428.232651
373 2021-04-21  6.632428  134.645853  81.410079  1083.531681    33.291869  10.857425  395.440266
374 2021-04-22  6.226908  126.481454  76.524121  1017.982138    31.296270  10.199115  371.515171
375 2021-04-23  6.112580  124.159952  75.191513   999.012762    30.745151  10.005796  364.680982
376 2021-04-24  4.525936   91.886029  55.690036   739.250386    22.770413   7.399430  269.883507
377 2021-04-25  3.960584   80.407768  48.758480   646.950947    19.941625   6.472987  236.171561
378 2021-04-26  7.014829  142.435127  86.443166  1145.846809    35.323228  11.465549  418.363376
379 2021-04-27  6.220941  126.

In [None]:
df_list.append('state_coef_df')

In [110]:
#//***************************************************************************************************************************************************************************
#//*** Generatate a State Coeffecient. These are best used to check our work.
#//*** The State Coeefficients are useful for comparing the sum of racial coefficients. If the regressions are consistent, the state coefficient should be close to the 
#//*** sum of the racial coeffificents
#//***************************************************************************************************************************************************************************

#//*** Build the coefficients for the entire data set. Each day will calculate the coefficients from the previous 30 days. The First 30 days will use one set of coeffients. The rest will use
#//*** The current day: -30 to generate the coefficients. This will be an overfitted solution which is exactly what we are going for.

#//*** Abstract Dataframes to Left and Right for reusability
left_df = ca_races_broad_df
right_df = ca_cases_broad_df

#//*** Get Coefficients for counties predicting the State total Case values
output_df = pd.DataFrame()

#//*** County Columns start at Index 3
x_col_index = 3

#//*** Target Independent Column Index. Statewide numbers begin at column 2
y_col_index = 2

#//*** Sample size
#modeling_days = 30

#//*** Combining with car_race_df and Latino value. It's not strictly needed, but the additional column will make combining the dataframes later, easier.
#//*** Reusing Code: This loop only needs to run once
for race in ['Latino']:
    
    #//*** Build model for the first 30 days, combines the race from ca_race (which is only needed as an extra field, to evenly space the columms)
    model_df = left_df[['date',race]].merge(right_df, left_on='date', right_on='date')[:30]
    #print(model_df)
    
    #//*************************
    #//*** BEGIN regression
    #//*************************
    
    #//**** Build the x Values - Dependent Variables. These will be all the counties which start at the column index 
    #//*** The X Value is the index where the attributes start, there are 58 of them :)
    x_column = model_df.columns[x_col_index:]

    #//*** Build the X attributes using the x_column. These are separated for readability and modularity
    x_model = model_df[x_column]
    

    #//*** Build the independent variable using the Index Column defined above as y_col_index.
    y_column = model_df.columns[y_col_index]
    
    #//*** Build the Y model using the y_column attribute. This is less readable and intuitive, but it lets the columns be 
    #//*** easily assigned at the top of this section
    y_model = model_df[y_column]
    
    #//*** Define the Linear Model
    regr = linear_model.LinearRegression()
    
    #//*** Make Regression Magic
    regr.fit(x_model, y_model)

    #//*** Apply the regression coefficients
    model_df[x_column] = (model_df[x_column])*regr.coef_
    
    #//*** Add the First 30 days Model to the output_df dataframe
    output_df = pd.concat([output_df,model_df])
    
    #//*** Checking our work. The sum of the coefficients * cases + intercept should be close the independent value in Total Cases.
    print("Checking our Work. These values should be close:")
    print(model_df.iloc[1]['total'], " == ", model_df[x_column].iloc[1].sum()+regr.intercept_)    

    #//*** Build each day individually, based on the previous 30 days
    #//*** Start at index 31 
    for index in range(31,len(left_df)):
        
        #//*** Define the start and indexes for linear modeling. This is the row_index - 30
        min_index = index-30
        
        #//*** Build model_df using min_index and index as a 30 day range)
        model_df = left_df[['date',race]].merge(right_df, left_on='date', right_on='date')[min_index:index]
        
        #//*** Build the X attributes using the x_column. These are separated for readability and modularity
        x_model = model_df[x_column]

        #//*** Build the Y model using the y_column attribute. This is less readable and intuitive, but it lets the columns be 
        #//*** easily assigned at the top of this section
        y_model = model_df[y_column]

        #//*** Build a New the Linear Model
        regr = linear_model.LinearRegression()

        #//*** Make Regression Magic
        regr.fit(x_model, y_model)

        #//*** Apply the regression coefficients to all columns, even though we only need the last one
        model_df[x_column] = (model_df[x_column])*regr.coef_
        model_df[model_df.columns[1]] = regr.intercept_
        #//*** Add the last day of model_df to output_df. 
        #//*** It's not exactly efficient, but it is functional
        #output_df = pd.concat([output_df,model_df.iloc[-1]])
        output_df = output_df.append(model_df.iloc[-1])
        
        
state_coef_df = output_df.copy()
state_coef_df.columns = [ ['date','intercept','total'] + list(x_column) ]
print("State Coef")
print(state_coef_df)
#//*** Eliminate lingering temp variables
del output_df
del model_df

Checking our Work. These values should be close:
1480.0  ==  1479.9999999999995
State Coef
          date   intercept   total   Imperial      Kings San Bernardino Los Angeles   Riverside     Merced        Kern      Tulare      Madera      Fresno  Stanislaus    Monterey    Ventura  \
0   2020-04-13  543.795089  1511.0 -65.166825   2.567514     311.589582  299.939454  -51.186829   9.827685  147.453608  -35.188892    0.819586  -68.644198  -10.778890   14.916608   3.099949   
1   2020-04-14  550.286254  1480.0  -0.000000   3.851271     303.389856  300.784354   -3.393602   0.000000    0.000000   -0.000000    3.278344  -52.074908   -0.000000    6.392832   2.169964   
2   2020-04-15  627.914027  1661.0 -59.242568   3.851271     389.486977    0.000000   -0.848400   0.000000    4.607925   -3.518889  732.709784   -0.000000   -0.000000    1.065472   1.239979   
3   2020-04-16  636.565208  1653.0 -71.091082   0.000000     282.890542    0.000000   -2.545201   0.000000   96.766430   -3.518889  670.4

In [111]:
#//*** Check our work with a selection of ilocs. Sum of the Coefficients + intercept should be close to the total values.
#//*** It's very close, very low error

for x in [50,100,150,200,250,300]:
    print(f"{state_coef_df.iloc[x]['date'].values[0]} - {state_coef_df[x_column].iloc[x].sum() + state_coef_df.iloc[x]['intercept'].values[0]} == {state_coef_df.iloc[x]['total'].values[0]}")
 

2020-06-02T00:00:00.000000000 - 3895.000000000001 == 3895.0
2020-07-22T00:00:00.000000000 - 9359.0 == 9359.0
2020-09-10T00:00:00.000000000 - 3249.000000000002 == 3249.0
2020-10-30T00:00:00.000000000 - 5514.000000000001 == 5514.0
2020-12-19T00:00:00.000000000 - 29476.99999999998 == 29477.0
2021-02-07T00:00:00.000000000 - 5121.000000000025 == 5121.0


In [26]:
"""
#// State Coefficient Original / single day sample code.
#//***************************************************************************************************************************************************************************
#//*** Generatate a State Coeffecient. These are best used to check our work.
#//*** The State Coeefficients are useful for comparing the sum of racial coefficients. If the regressions are consistent, the state coefficient should be close to the 
#//*** sum of the racial coeffificents
#//***************************************************************************************************************************************************************************

#//*** Get Coefficients for counties predicting the State total Case values
state_coef_df = pd.DataFrame()


df_list.append('state_coef_df')

#//*** Reusing Code: This loop only needs to run once
for race in ['Latino']:
    
    #//*** County Columns start at Index 4
    x_col_index = 3
    
    #//*** Target Independent Column Index. Statewide numbers begin at column 3
    y_col_index = 2
    
    #//*** Sample size
    modeling_days = -30
    
    #//*** Set True if running a regression on a single column of index value
    single_col = False
    

    
    model_df = ca_races_broad_df[['date',race]].merge(ca_cases_broad_df, left_on='date', right_on='date')[modeling_days:]
    print(model_df.head(1))
    #//*** Get the Columns for the X dataframe.
    if single_col:
        #//*** The X Value is the index where the attributes start, there are 58 of them :)
        x_column = model_df.columns[x_col_index]

        #//*** Define the X attributes and 
        x_model = np.array(model_df[x_column]).reshape(-1, 1)
        #x_model = preprocessing.scale(x_model) 
    else:
        #//*** The X Value is the index where the attributes start, there are 58 of them :)
        x_column = model_df.columns[x_col_index:]

        #//*** Define the X attributes and 
        x_model = model_df[x_column]
        #x_model = preprocessing.scale(x_model) 
        
    
    y_column = model_df.columns[y_col_index]
    y_model = model_df[y_column]
    
    regr = linear_model.LinearRegression()
    regr.fit(x_model, y_model)

    y_predict = regr.predict(x_model)

    pred_df = pd.DataFrame()
    
    pred_df['actual'] = list(y_model)
    pred_df['predict'] = y_predict
    
    
    #calculate RMSE
    rmse = sqrt(mean_squared_error(pred_df['actual'], pred_df['predict'])) 
    if single_col:
        index = ['race','intercept'] + list([model_df.columns[x_col_index]])
    else:
        index = ['race','intercept'] + list(model_df.columns[x_col_index:])
    data = [race,regr.intercept_] + list(regr.coef_)
    
    #print(index)
    #print(data)
    #calculate RMSE
    rmse = sqrt(mean_squared_error(pred_df['actual'], pred_df['predict'])) 
    state_coef_df = pd.concat([state_coef_df,pd.Series(data, index=index).to_frame().transpose()])
    
    print("RMSE: ",race, " - ", rmse )
#print(pred_df)
#print(model_df)
#print(ca_races_broad_df[-30:-29])
#print(ca_cases_broad_df[-30:-29])

print(state_coef_df)
"""
print()




In [27]:
print(state_coef_df)
race_coef_df = pd.DataFrame()


#//*** Build a table of Modeled Racial co-efficients for the past 30 days
races = list(ca_races_broad_df.columns[1:])

for race in races:
#for race in ['Latino']:

    
    #//*** Fit a Random forest Regression for the past 30 days
    #https://medium.datadriveninvestor.com/random-forest-regression-9871bc9a25eb
    # Fitting Random Forest Regression to the dataset
    # import the regressor


    x_col_index = 3
    y_col_index = 1
    modeling_days = -30
    single_col = False
    

    
    model_df = ca_races_broad_df[['date',race]].merge(ca_cases_broad_df, left_on='date', right_on='date')[modeling_days:]
    
    #print(model_df.head(1))
    
    #//*** Get the Columns for the X dataframe.
    if single_col:
        #//*** The X Value is the index where the attributes start, there are 58 of them :)
        x_column = model_df.columns[x_col_index]

        #//*** Define the X attributes and 
        x_model = np.array(model_df[x_column]).reshape(-1, 1)
        #x_model = preprocessing.scale(x_model) 
    else:
        #//*** The X Value is the index where the attributes start, there are 58 of them :)
        x_column = model_df.columns[x_col_index:]

        #//*** Define the X attributes and 
        x_model = model_df[x_column]
        #x_model = preprocessing.scale(x_model) 
        
    
    y_column = model_df.columns[y_col_index]
    y_model = model_df[y_column]
    
    #print(y_column)
    #print(y_model)
    
    regr = linear_model.LinearRegression()
    regr.fit(x_model, y_model)

    y_predict = regr.predict(x_model)

    pred_df = pd.DataFrame()
    
    pred_df['actual'] = list(y_model)
    pred_df['predict'] = y_predict
    
    
    #calculate RMSE
    rmse = sqrt(mean_squared_error(pred_df['actual'], pred_df['predict'])) 
    if single_col:
        index = ['race','intercept'] + list([model_df.columns[x_col_index]])
    else:
        index = ['race','intercept'] + list(model_df.columns[x_col_index:])
    data = [race,regr.intercept_] + list(regr.coef_)
    
    #print(index)
    #print(data)
    #calculate RMSE
    rmse = sqrt(mean_squared_error(pred_df['actual'], pred_df['predict'])) 
    race_coef_df = pd.concat([race_coef_df,pd.Series(data, index=index).to_frame().transpose()])
    
    #print("RMSE: ",race, " - ", rmse )
    
print(ca_races_broad_df[-30:-29])
print(ca_cases_broad_df[-30:-29])
print(state_coef_df)
print(race_coef_df)


     race    intercept Imperial    Kings San Bernardino Los Angeles Riverside   Merced      Kern   Tulare  Madera   Fresno Stanislaus  Monterey   Ventura San Joaquin San Diego   Orange  \
0  Latino -1.13687e-12  1.12441  3.79273        3.90709    -1.41156   1.05243 -4.87727 -0.218208  3.50032 -1.3224  2.43792    4.09191 -0.727124 -0.197336     1.86216  0.928327 -1.13058   

  Santa Barbara San Luis Obispo   Solano      Napa   Shasta Sacramento     Yolo Santa Clara   Sonoma Contra Costa Santa Cruz   Placer    Butte San Mateo    Marin El Dorado  Alameda San Francisco  \
0       2.41512        -1.75541  1.19327 -0.462217  3.64995    2.27829  1.23148     2.64486  3.73973      1.93951   -2.67122 -4.18906  1.84201   5.88375  7.45894   5.33358 -1.26266       4.57431   

   Humboldt  
0  0.307763  
          date    Native      Asian      Black      Latino  Multiracial  Hawaiian       White
370 2021-04-18  4.734258  96.188358  58.061101  774.199002    23.721152  7.762943  282.394418
          

In [28]:
#print(ca_cases_broad_df[-30:-29].transpose()[2:])
sample_cases = ca_cases_broad_df[-30:-29].transpose()[2:]
latino_coef = race_coef_df.iloc[3].transpose()[2:]
coef_df = pd.concat([latino_coef,sample_cases],axis=1)
coef_df.columns = ['coef','cases']
coef_df['product'] = (coef_df['coef'] * coef_df['cases']) + 0.278053
print(coef_df['product'].sum())
print(coef_df['cases'].sum())

#print(state_coef_df['Los Angeles']*507)
#print(race_coef_df['Los Angeles']*507)

783.9308574661849
1058.0


In [29]:
print(ca_cases_broad_df.iloc[1][2:].sum())
print(ca_cases_broad_df.iloc[1][2:])

1203.0
Imperial             0
Kings                3
San Bernardino      74
Los Angeles        712
Riverside           12
Merced               0
Kern                 0
Tulare               0
Madera               4
Fresno              22
Stanislaus           0
Monterey             6
Ventura              7
San Joaquin          5
San Diego           83
Orange               0
Santa Barbara       42
San Luis Obispo     41
Solano               6
Napa                 0
Shasta              39
Sacramento          38
Yolo                 6
Santa Clara          1
Sonoma              21
Contra Costa        18
Santa Cruz           1
Placer               0
Butte                0
San Mateo           19
Marin               25
El Dorado            2
Alameda              2
San Francisco       14
Humboldt             0
Name: 1, dtype: object


In [30]:
#//*** Sample Target Day For testing
t_day_cases_df =  ca_cases_broad_df[-30:-29]
t_day_coef_df = state_coef_df

cases_cols = t_day_cases_df.columns[2:]

day_total = t_day_cases_df['total'].values[0]
day_intercept = t_day_coef_df['intercept'].values[0]

print(day_total," ", day_intercept)
#print(t_day_cases_df[cases_cols])
#print(t_day_coef_df[cases_cols])
t_state_df = pd.concat([t_day_cases_df[cases_cols],t_day_coef_df[cases_cols]])
#print(t_state_df)
results = np.array( (t_day_cases_df[cases_cols])+day_intercept) *np.array(t_day_coef_df[cases_cols])
results = results[0]
#print(pd.DataFrame(data=results, index = cases_cols).transpose())
print(results.sum())
#results2 = results/day_total
#print(results2.sum())
#t_state_df = pd.concat([t_state_df,pd.DataFrame(data=results, index = cases_cols).transpose(),pd.DataFrame(data=results2, index = cases_cols).transpose()])
t_state_df = pd.concat([t_state_df,pd.DataFrame(data=results, index = cases_cols).transpose()])
t_state_df.index = ['cases','state_coef','state_impact']
print(t_state_df)
#print(t_state_df.append((pd.Series(results, index = cases_cols))))
#print(np.array(t_day_cases_df[cases_cols])*np.array(t_day_coef_df[cases_cols])[0])

print(1392 * .0185787)
print(1392 * -0.254527)
print(251 *  -0.254527)

1392.0   -1.1368683772161603e-12
1391.9999999999475
             Imperial    Kings San Bernardino Los Angeles Riverside       Merced      Kern   Tulare   Madera      Fresno Stanislaus  Monterey   Ventura San Joaquin San Diego   Orange Santa Barbara  \
cases              23        3             52         251       110            0         5       36        3           0         11         4        31          41        86        1            31   
state_coef    1.12441  3.79273        3.90709    -1.41156   1.05243     -4.87727 -0.218208  3.50032  -1.3224     2.43792    4.09191 -0.727124 -0.197336     1.86216  0.928327 -1.13058       2.41512   
state_impact  25.8615  11.3782        203.168    -354.301   115.767  5.54482e-12  -1.09104  126.011 -3.96719 -2.7716e-12     45.011   -2.9085  -6.11742     76.3485   79.8361 -1.13058       74.8686   

             San Luis Obispo   Solano      Napa   Shasta Sacramento     Yolo Santa Clara   Sonoma Contra Costa Santa Cruz   Placer    Butte San Mat

In [31]:
race_impact_df = pd.DataFrame()
print(t_day_cases_df[cases_cols].values[0])
for row in race_coef_df.iterrows():
    race = row[1][0] 
    intercept = row[1][1]
    counties = row[1][2:]
    #print(race)
    #print(intercept)
    #print((counties))
    #print((counties.values+intercept))
    tdf = pd.DataFrame((counties.values+intercept)*t_state_df.iloc[0].values,index=counties.index).transpose()
    tdf.index=[f"impact_{race}"]
    race_impact_df = pd.concat([race_impact_df,tdf])
    
    #break
#print(race_impact_df['Imperial'].sum())
#print(race_impact_df['Kings'].sum())
#print(race_impact_df.sum())
#print(race_impact_df)

race_percent_df = race_impact_df.copy()

for col in race_percent_df.columns:
        race_percent_df[col] = race_percent_df[col] / race_percent_df[col].sum()
print(race_percent_df)


[ 23.   3.  52. 251. 110.   0.   5.  36.   3.   0.  11.   4.  31.  41.
  86.   1.  31.  10.  11.   4. 148.   6.   6.   5.  20.  71.  21.   1.
  17.  25.   1.   0.   0.  21.   3.]
                      Imperial       Kings San Bernardino Los Angeles   Riverside Merced         Kern      Tulare      Madera Fresno  Stanislaus    Monterey    Ventura San Joaquin   San Diego  \
impact_Native       0.00399307  0.00354923     0.00365312  0.00588249  0.00119888    NaN  -0.00882572  0.00541838  0.00499831    NaN  0.00334706  0.00331459  0.0307644  0.00916233  0.00262859   
impact_Asian         0.0772954   0.0749235      0.0768915   0.0821815   0.0702959    NaN    0.0514357   0.0817292   0.0795861    NaN   0.0732283   0.0741238   0.130284   0.0941001   0.0719863   
impact_Black         0.0461221   0.0521497      0.0462937   0.0456072   0.0549788    NaN    0.0277877   0.0451342     0.04588    NaN   0.0520644   0.0503784  0.0851803   0.0294112   0.0514653   
impact_Latino         0.620911    0.62191

In [32]:
for x in dir(regr):
    print(regr)

LinearRegression()
LinearRegression()
LinearRegression()
LinearRegression()
LinearRegression()
LinearRegression()
LinearRegression()
LinearRegression()
LinearRegression()
LinearRegression()
LinearRegression()
LinearRegression()
LinearRegression()
LinearRegression()
LinearRegression()
LinearRegression()
LinearRegression()
LinearRegression()
LinearRegression()
LinearRegression()
LinearRegression()
LinearRegression()
LinearRegression()
LinearRegression()
LinearRegression()
LinearRegression()
LinearRegression()
LinearRegression()
LinearRegression()
LinearRegression()
LinearRegression()
LinearRegression()
LinearRegression()
LinearRegression()
LinearRegression()
LinearRegression()
LinearRegression()
LinearRegression()
LinearRegression()
LinearRegression()
LinearRegression()
LinearRegression()
LinearRegression()
LinearRegression()
LinearRegression()
LinearRegression()
LinearRegression()
LinearRegression()
LinearRegression()
LinearRegression()
LinearRegression()
LinearRegression()
LinearRegres

In [33]:
#//*** Coefficients predicting race percentage
print(race_coef_df)

          race    intercept    Imperial      Kings San Bernardino Los Angeles   Riverside     Merced         Kern     Tulare      Madera     Fresno Stanislaus    Monterey     Ventura San Joaquin  \
0       Native -3.55271e-15  0.00413561  0.0134564      0.0127787 -0.00751304  0.00116524 -0.0282926   0.00201367   0.016977 -0.00622244  0.0114267  0.0131443 -0.00241485 -0.00621493    0.012928   
0        Asian -2.84217e-14   0.0800546   0.284063       0.268969   -0.104961   0.0683234  -0.365799   -0.0117355   0.256077  -0.0990775   0.174308   0.287577  -0.0540031  -0.0263195    0.132775   
0        Black -1.42109e-14   0.0477685   0.197719       0.161937  -0.0582489   0.0534361  -0.181285     -0.00634   0.141416  -0.0571164   0.101047   0.204463  -0.0367033  -0.0172079    0.041499   
0       Latino            0    0.643076    2.35791        2.18161   -0.738153    0.650172   -2.44264    -0.203273    1.83972   -0.743398    1.26951    2.47481    -0.45929 -0.00875005    0.729332   
0  Multira

In [34]:
print(race_coef_df['Imperial']*3)

#Target Deaths: 2351
#LA Deaths: 507
#LA COEF:  -1.86145819
#inter: 2024

print("Native: ",0.668577 + 12)
print("Native: ",(0.668577 + 12)/12)

print("Latino: ",-25.8728 + 2024)
print("Latino: ",(-25.8728 + 2024)/2024)

0    0.0124068
0     0.240164
0     0.143305
0      1.92923
0    0.0706852
0    0.0195156
0     0.691787
Name: Imperial, dtype: object
Native:  12.668576999999999
Native:  1.05571475
Latino:  1998.1272
Latino:  0.9872169960474307


In [35]:
#//***********************
#//*** Working Data Sets
#//*********************************
#//*** County Population by Race
#//*********************************

#print(pop_attrib_df.head(20))




In [36]:
#//***********************************************************************************************************************
#//*** Build the Model for all Counties.
#//***********************************************************************************************************************
#//*** Model takes the daily cases, and estimates the racial cases based on the portion of the population.
#//*** Example: 100 cases and Latino is .55. Latinos would be assigned 55 cases.
#//***********************************************************************************************************************
#//*** The expected racial value is adjusted by the statewide racial percentage
#//*** Example: If Latinos comprised .25 of the total State Cases, The modeled Latino cases would be 68.75 (55 *.25)
#//***********************************************************************************************************************

#//*** Build a dataframe to hold the expected cases
ca_covid_est_case_df = ca_covid_df.copy()


#//*** Process each race
for race in ca_race_df['race'].unique():
    #ca_covid_est_case_df[race] =  ( (ca_covid_est_case_df[race] / ca_covid_est_case_df['population'])*ca_covid_est_case_df['cases'] ) + ( ( (ca_covid_est_case_df[race] / loop_df['population'])*ca_covid_est_case_df['cases'] ) * ca_race_diff[race].values )
    
    #//*** Build the racial population percentage for each county
    #//*** Racial Population / total Population = County racial portion
    ca_covid_est_case_df[race] = (ca_covid_est_case_df[race] / ca_covid_est_case_df['population']) 
    
    #//*** Build the Expected COVID cases based on population percentage
    #//*** Case_[race] columns. County COVID cases * Racial portion 
    ca_covid_est_case_df[f'case_{race}'] = ca_covid_est_case_df[race] * ca_covid_est_case_df['cases']
 

#//*** Build model Dataframe. This holds the adjusted cases
#//*** This workflow is awkward because I"m building it for each date. I can probably do this better by merging the ca_race_diff dataframe.
ca_model_cases_df = pd.DataFrame()

#//*** Process each Date 
for group in ca_covid_est_case_df.groupby('date'):
    loop_date = group[0]
    loop_df = group[1]
    
    #//*** Get the race differential values on the given date
    loop_race_diff = ca_race_diff[ ca_race_diff['date']==loop_date] 
    
    #//*** Process the race data for each racial attribute.
    #//*** Each loop processes a racial attribute and adds an adjusted_[race] value.
    for race in ca_race_df['race'].unique():
        #//*** Get the modifier that race on the given day
        race_modifier = loop_race_diff[race].values[0]
        
        #//*** expected value + (expected value * modifier)
        #//*** Modifier can be positive or negative.
        loop_df[f'adj_{race}'] =  loop_df[f'case_{race}'].values + (loop_df[f'case_{race}'].values * race_modifier) 
    
    #//*** Add the results into ca_model_cases_df. Awkward an inefficient.
    ca_model_cases_df = pd.concat([ ca_model_cases_df, loop_df ] ) 
    


print(ca_model_cases_df)

NameError: name 'ca_race_diff' is not defined

In [None]:
"""
#//********************************************************************
#//*** Convert State Race Totals to State Wide per 100k values.
#//********************************************************************

state_pop = {}
print(ca_race_df['race'].unique())
for col in pop_attrib_df.columns[1:]:
    state_pop[col] = pop_attrib_df[col].sum()/100000

print(ca_race_df['race'].apply(lambda x : state_pop[x]))

ca_race_df['cases_100k'] = ca_race_df['cases'] / ca_race_df['race'].apply(lambda x : state_pop[x])
ca_race_df['deaths_100k'] = ca_race_df['deaths'] / ca_race_df['race'].apply(lambda x : state_pop[x])

#//*** temp list to hold per 100k values
#tl_case = []
#t1_death = []
#print(ca_race_df)
#for index,row in ca_race_df.iterrows():
    
    #//*** Get State population based on race column
#    state_pop[ row['race'] ]

for date in ca_race_df['date'].unique():
    print(ca_race_df[ ca_race_df['date'] == date ])
    break
"""

In [None]:
"""
#//*** Build State wide new cases

#//********************************************************************
#//*** Convert State COVID Totals to State Wide per 100k values.
#//********************************************************************
ca_total_pop = pop_attrib_df['population'].sum()
ca_total_pop_100k = ca_total_pop/100000
ca_total_df['cases'] = ca_total_df['cum_cases'].rolling(window=2).apply(lambda x: x.iloc[1] - x.iloc[0])
ca_total_df['deaths'] = ca_total_df['cum_deaths'].rolling(window=2).apply(lambda x: x.iloc[1] - x.iloc[0])
#    death_list = death_list + list(loop_race_df['deaths'].rolling(window=2).apply(lambda x: x.iloc[1] - x.iloc[0]))    

ca_total_df['cases_100k'] = ca_total_df['cases']/ca_total_pop_100k
ca_total_df['deaths_100k'] = ca_total_df['deaths']/ca_total_pop_100k
print(ca_total_df)
"""

In [None]:
#//*** Merge Population Attributes with COVID County info
#//*** Only Merge if we haven't merged yet. I got 99 iPython problems but this aint one.
if "Latino" not in ca_covid_df.columns:
    ca_covid_df = pd.merge(ca_covid_df,pop_attrib_df,how="left",on=['county'])


#//*** Build per 100k Stats
ca_100k_df = ca_covid_df.copy()

#//*** Define Population Columns to convert to 100k. These Columns shouldn't change. Trying to setup a flexible
#//*** Systems where I can add other attributes later if needed
population_cols = [ 'population','Latino', 'White', 'Asian', 'Black', 'Native', 'Hawaiian','Multiracial' ]

#//*** Convert Popultion values to 100k units. ie divide by 100,000
for col in population_cols:
    ca_100k_df[col] = ca_100k_df[col]/100000



#//*** Convert cases, deaths, test to per 100k units
attrib_cols = ['date','county']

#//*** Ignore values in attrib_cols, and population_cols
#//*** Convert remianing attributes to values per 100,000.
#//*** This method makes it easier to change the 100k attributes later.
for col in ca_100k_df.columns:
    if col not in attrib_cols and col not in population_cols:
        #//*** Convert column to per 100k value. Which is Columns value divided population per 100k
        ca_100k_df[col] = ca_100k_df[col]/ca_100k_df['population'] 

plt.rcParams['figure.figsize'] = [50,20]
#//*** Check our Work.
#//*** Cases per 100k should be relatively similar in values.
display_size = 40
fig,ax = plt.subplots()

for county in ca_100k_df['county'].unique():
    
    loop_df = ca_100k_df[ca_100k_df['county'] ==  county]
    ax.plot(loop_df['date'],loop_df['cases'].rolling(5).mean(),label=county)


    plt.xticks(rotation=30,fontsize=display_size)
    plt.yticks(fontsize=display_size)
handles,labels = deduplicate_legend(ax)
plt.legend(fontsize=display_size*.25,loc='upper left')
plt.title(f"Scaled County Data (per 100k)",fontsize=display_size)
#plt.ylabel("Total Cases by County (millions)",fontsize=display_size)
plt.show()



In [None]:
ca_race_diff = pd.DataFrame()

ca_race_diff['date'] = ca_total_df['date'].copy()
print(ca_race_diff)

for race in ca_race_df['race'].unique():

    race_percent = pop_attrib_df[race].sum() / pop_attrib_df['population'].sum()
    #/  ca_total_df['cases']*race_percent
    #//**** actual Value
    
    actual = ca_race_df[ca_race_df['race']==race]['cases']
    expected = ca_total_df['cases']*race_percent
    
    #ca_race_diff['actual'] = actual.values
    #ca_race_diff['expected'] = expected.values
    ca_race_diff[race] = ( actual.values - expected.values ) / actual.values
    
    #print(pd.DataFrame([actual,expected]))
    #ax.plot(ca_race_df['date'].unique(),ca_race_df[ca_race_df['race']==race]['cases'].rolling(5).mean(),label='actual')
    #ax.plot(ca_race_df['date'].unique(),ca_total_df['cases'].rolling(5).mean()*race_percent,label='expected')
#ca_race_diff['Native'] = ca_race_diff['Native'].replace(np.inf,0)
ca_race_diff.replace([np.inf, -np.inf], 0, inplace=True)
print(ca_race_diff)

In [None]:
loop_df = ca_covid_df[ca_covid_df['county'] == 'Imperial'].copy()
print(ca_race_df['race'].unique())

print(loop_df.columns[4:])
for race in ca_race_df['race'].unique():
    loop_df[race] =  ( (loop_df[race] / loop_df['population'])*loop_df['cases'] ) + ( ( (loop_df[race] / loop_df['population'])*loop_df['cases'] ) * ca_race_diff[race].values )
    #loop_df[race] =  ca_race_diff[race].values
dates = ca_race_df['date'].unique()

loop_df['est'] = loop_df['Latino'] + loop_df['White'] + loop_df['Asian'] + loop_df['Black'] + loop_df['Native'] + loop_df['Hawaiian'] + loop_df['Multiracial']  
     

print(loop_df)

fig,ax = plt.subplots()    

ax.plot(dates,loop_df['cases'],label='total_actual_cases')
ax.plot(dates,loop_df['est'],label='total_estimated_cases')
#ax.plot(dates,ca_total_df['cases'].rolling(5).mean()*race_percent,label='expected')

plt.title(f'Model Testing on Imperial County\nThis Model looks promising')

plt.legend(fontsize=10)
plt.show()

In [None]:
print(ca_model_cases_df['date'].unique()[-2])

print(ca_model_cases_df[ca_model_cases_df['date']==ca_model_cases_df['date'].unique()[-2]]['cases'].sum())

In [None]:
print(ca_race_diff)
print(ca_race_diff['Latino']-(ca_race_diff['Latino']*0.061724))

In [None]:
#//*** Build the estimated case values to check the model
#//*** Sum() each Adjusted Racial attribute.
#//*** The difference between the sum() value and the est value is the model error.

#//*** There is a lot of extra data here to help me double check the maths. Most of these columns will be removed.
ca_model_cases_df['est'] = ca_model_cases_df['adj_Native'] + ca_model_cases_df['adj_Asian'] + ca_model_cases_df['adj_Black'] + ca_model_cases_df['adj_Latino'] + ca_model_cases_df['adj_Multiracial'] + ca_model_cases_df['adj_Hawaiian'] + ca_model_cases_df['White'] 

print(ca_model_cases_df)

In [None]:
#//*** Build a list of counties ordered by total COVID prevalence (most Cases per 100k) 

#//*** Get the Statewide 100k value. 
#//*** Get total Case Count from orig_df, dvided by total population / 100000
state_100k = ca_covid_orig_df['cases'].sum()/(ca_covid_orig_df['population'].unique().sum()/100000)

county_list = ca_100k_df['county'].unique()

county_100k = []

#//*** Get a list of counties with population greater than 100,000
for county in county_list:
    if ca_100k_df[ca_100k_df['county']==county].iloc[0]['population'] > 1:
        county_100k.append(county)
case_totals = []

#//*** Get the total Cases for each county per 100k
for county in county_100k:
    case_totals.append(ca_100k_df[ca_100k_df['county']==county]['cases'].sum())

#//*** Build a series with the county name and the total per100k value for each county. Sort by the Prevalence value.
ts = pd.Series(index = county_100k, data=case_totals).sort_values(ascending=False)


In [None]:
#//*** Display the actual values vs the modeled values.

dates = ca_model_cases_df['date'].unique()
counties = ts.index.values

plt.rcParams['figure.figsize'] = [20,10]
fontsize=20
for county in counties:
    
    
    
    loop_df = ca_model_cases_df[ ca_model_cases_df['county'] == county]
    
    if loop_df['population'].iloc[0] < 100000:
        continue
    
    fig,ax = plt.subplots()    

    ax.plot(dates,loop_df['cases'],label='actual')
    ax.plot(dates,loop_df['est'],label='modeled')
    
    plt.title(f'{county} County Actual Values vs Modeled Values', fontsize=fontsize)
    plt.xticks(fontsize=fontsize)
    plt.yticks(fontsize=fontsize)
    plt.legend(fontsize=fontsize)
    plt.show()

In [None]:
"""
#//***********************
#//*** Working Data Sets
#//****************************************
#//*** Statewide COVID cases by Ethnicity
#//****************************************
#print(ca_race_df)

ca_race_df['date'] = pd.to_datetime( ca_race_df['date'])
imperial_df = ca_covid_df[ca_covid_df['county'] == 'Imperial']
latino_df = ca_race_df[ca_race_df['race'] == 'Latino'].copy()


latino_df['cases'] = latino_df['cases'].rolling(window=2).apply(lambda x: x.iloc[1] - x.iloc[0])
latino_df['deaths'] = latino_df['deaths'].rolling(window=2).apply(lambda x: x.iloc[1] - x.iloc[0])

latino_df.rename(columns={'cases':'state_latino_cases','deaths':'state_latino_deaths'},inplace=True)
#ca_race_df.rename( columns= {'report_date':'date'}, inplace=True)

imperial_df = pd.merge(imperial_df,latino_df,on='date')

del imperial_df['race']

imperial_df = imperial_df[1:]

print(imperial_df)
#print(latino_df)
"""
"""
fig,ax = plt.subplots()
ax.plot(latino_df['date'],latino_df['new_cases'],label='cases')
ax.plot(latino_df['date'],latino_df['new_deaths'],label='deaths')
ax.plot(imperial_df['date'],imperial_df['cases'],label='deaths')
plt.legend()
plt.show()
"""
print()
#//*** Daily COVID Cases per 100k population
#print(ca_100k_df)

In [None]:
#//*** Generate total covid prevalence by county.
#//*** Sort by prevalence

#//*** Get the Statewide 100k value. 
#//*** Get total Case Count from orig_df, dvided by total population / 100000
state_100k = ca_covid_orig_df['cases'].sum()/(ca_covid_orig_df['population'].unique().sum()/100000)

county_list = ca_100k_df['county'].unique()

county_100k = []

for county in county_list:
    if ca_100k_df[ca_100k_df['county']==county].iloc[0]['population'] > 1:
        county_100k.append(county)
case_totals = []

for county in county_100k:
    case_totals.append(ca_100k_df[ca_100k_df['county']==county]['cases'].sum())

#//*** Temp Series
ts = pd.Series(index = county_100k, data=case_totals).sort_values(ascending=False)

counties = ts.index.values
print(counties)

In [None]:
"""#//*** Modeling
#//*** Predict total Statewide cases from County Cases. This gives a county COVID weight.
#//*** Predict total Racial Cases from county Cases. This gives a county racial weight. The difference between the total prediction
#//*** And the racial prediction whould be the racial case count for the county. May have to the regression in reverse with the county weights.
inter_dict = {}

#//*** Assemble Each County cases_100k
x = []
print(counties)
for county in counties:
    x.append(ca_100k_df[ca_100k_df['county']==county]['cases'].values)


x = np.array(x).reshape(-1,len(counties))    
#//**** Predict Statewide Cases
y = ca_total_df['cases_100k']

x_train, x_test, y_train, y_test = train_test_split(x, y, test_size=0.2)

print(len(x))
print(len(y))

model = LinearRegression().fit(x_train,y_train)
lasso = Lasso().fit(x_train,y_train)
print(model.score(x_test,y_test))
print(lasso.score(x_test,y_test))
models_df = pd.DataFrame()
models_df['county'] = counties
models_df['state_100k'] = model.coef_
inter_dict['state'] = model.intercept_
#//*** Modeling
#//*** Predict total Statewide cases from County Cases. This gives a county COVID weight.
#//*** Predict total Racial Cases from county Cases. This gives a county racial weight. The difference between the total prediction
#//*** And the racial prediction whould be the racial case count for the county. May have to the regression in reverse with the county weights.
#//*** Assemble Each County cases_100k
#x = []
#for county in counties:
#    x.append(ca_100k_df[ca_100k_df['county']==county]['cases'].values)


x = np.array(x).reshape(-1,len(counties))    

for race in ca_race_df['race'].unique():
    
    #//**** Predict Statewide Cases
    y = ca_race_df[ ca_race_df['race'] == race ]['cases_100k']

    model = LinearRegression().fit(x,y)
    
    inter_dict[race] = model.intercept_
    
    #print(model.score(x,y))

    
    #print(pd.DataFrame([ca_100k_df['county'].unique(), model.coef_]).transpose())
    models_df[f"{race}_100k"] = model.coef_

#print(inter_dict)
#print(models_df.sort_values('state_100k',ascending=False))
#print(models_df)
"""

In [None]:
"""
#//*** Modeling
#//*** Predict total Statewide cases from County Cases. This gives a county COVID weight.
#//*** Predict total Racial Cases from county Cases. This gives a county racial weight. The difference between the total prediction
#//*** And the racial prediction whould be the racial case count for the county. May have to the regression in reverse with the county weights.

#//*** Assemble Each County cases_100k
x = []
for county in ca_covid_df['county'].unique():
    x.append(ca_covid_df[ca_covid_df['county']==county]['cases'][:-30].values)

x = np.array(x).reshape(-1,58)    
#//**** Predict Statewide Cases
y = ca_total_df['cases'][:-30]

#model = LinearRegression().fit(x,y,sample_weight=np.array(weights).reshape(-1,58))
model = LinearRegression().fit(x,y)
print(model.score(x,y))

print(model.intercept_)
print(pd.DataFrame([ca_covid_df['county'].unique(), model.coef_]).transpose())
"""

In [None]:
plain_covid_df= pd.read_csv(cases_data_filename)

print(plain_covid_df)

print(ca_race_df['cases'])
print(ca_covid_df['cases'])

In [None]:
print(ca_100k_df)
print(ca_total_df)
print(ca_race_df)

print(ca_covid_df)

#//**** Predict Statewide Latino cases
#imperial_df = ca_100k_df[ca_100k_df['county'] == 'Imperial']
#print(len(ca_race_df[ca_race_df['race'] == 'Latino']['cases'].values.reshape(-1, 1)))
#print(len(ca_100k_df[ca_100k_df['county'] == 'Imperial']['cases'].values))
#print(ca_race_df[ca_race_df['race'] == 'Latino']['cases'])
#print(ca_100k_df[ca_100k_df['county'] == 'Imperial']['cases'])

In [None]:
"""

x = imperial_df['state_latino_cases'].values.reshape(-1, 1)
y = imperial_df['cases'].values


x = ca_race_df[ca_race_df['race'] == 'Latino']['cases_100k'].values.reshape(-1, 1)
y = ca_100k_df[ca_100k_df['county'] == 'Imperial']['cases'].values


x = ca_100k_df[ca_100k_df['county'] == 'Imperial']['cases'].values.reshape(-1, 1)
y = ca_race_df[ca_race_df['race'] == 'Latino']['cases_100k'].values

plt.rcParams['figure.figsize'] = [20,10]
#model = LinearRegression().fit([ imperial_df['state_latino_cases'] ], [ imperial_df['cases'] ])
model = LinearRegression().fit(x,y)

print(model.score(x,y))


#model.predict([ imperial_df['state_latino_cases'] ])

#print(model.score([ imperial_df['state_latino_cases'] ], [ imperial_df['cases'] ]))

#print((model.predict([ imperial_df['state_latino_cases'] ])[0]))
print(model.coef_)
print(model.intercept_)


fig,ax = plt.subplots()
ax.plot(ca_100k_df['date'].unique(),y,label='actual')
ax.plot(ca_100k_df['date'].unique(),model.predict(x),label='predict')
#ax.plot(ca_100k_df['date'].unique(),y,label='state')
#ax.plot(ca_100k_df['date'].unique(),x,label='county')
plt.title("Linear Regression Test. This is not a good path.")
plt.legend()
plt.show()
"""

In [None]:
#print(ca_100k_df)
print(ca_covid_df)
print(ca_race_df)

In [None]:
print(ca_model_cases_df)

In [None]:
print(ca_race_df)

In [None]:
from sklearn.ensemble import RandomForestRegressor
from sklearn.metrics import mean_squared_error
from math import sqrt

from sklearn.model_selection import cross_val_score
from sklearn.model_selection import KFold
from sklearn.preprocessing import StandardScaler


from sklearn.inspection import permutation_importance

def rfr_model(X, y):# Perform Grid-Search
    gsc = GridSearchCV(
        estimator=RandomForestRegressor(),
        param_grid={
            'max_depth': range(3,10),
            'n_estimators': (10, 50, 100, 200, 300, 400),
        },
        cv=5, scoring='neg_mean_squared_error', verbose=0, n_jobs=-1)

    grid_result = gsc.fit(X, y)
    best_params = grid_result.best_params_

    print(best_params)

    #//*** Return the model with best parameters
    return RandomForestRegressor(max_depth=best_params["max_depth"], n_estimators=best_params["n_estimators"],random_state=False, verbose=False)

if False:
    race_coef_df = pd.DataFrame()

    #//*** Build a table of Modeled Racial co-efficients for the past 30 days
    races = list(ca_races_broad_df.columns[1:])

    #for race in races:
    for race in races:


        #//*** Fit a Random forest Regression for the past 30 days
        #https://medium.datadriveninvestor.com/random-forest-regression-9871bc9a25eb
        # Fitting Random Forest Regression to the dataset
        # import the regressor


        x_col_index = 4
        modeling_days = -30


        model_df = ca_races_broad_df[['date',race]].merge(ca_race_diff[['date',race]],left_on='date', right_on='date').merge(ca_cases_broad_df, left_on='date', right_on='date')[modeling_days:]
        #print(model_df.head(1))
        #//*** Get the Columns for the X dataframe.
        #//*** The X Value is the index where the attributes start, there are 58 of them :)
        x_column = model_df.columns[x_col_index:]

        #//*** Define the X attributes and 
        x_model = model_df[x_column]
        #x_model = preprocessing.scale(x_model) 
        y_column = model_df.columns[1]
        y_model = model_df[y_column]




        regressor =rfr_model(x_model,y_model)

        # Perform K-Fold CV
        scores = cross_val_score(regressor, x_model, y_model, cv=10, scoring='neg_mean_squared_error')

        # fit the regressor with x and y data
        regressor.fit(x_model, y_model)  


        y_predict = regressor.predict(x_model)

        pred_df = pd.DataFrame()

        pred_df['actual'] = list(y_model)
        pred_df['predict'] = y_predict



        start_time = time.time()
        result = permutation_importance(
            regressor, x_model, y_model, n_repeats=10, random_state=0, n_jobs=-1)
        elapsed_time = time.time() - start_time
        print(f"Elapsed time to compute the importances: "
              f"{elapsed_time:.3f} seconds")
        index = ['race'] + list(model_df.columns[x_col_index:])
        data = [race] + list(result.importances_mean)
        forest_importances = pd.Series(data, index=index).to_frame().transpose()

        #calculate RMSE
        rmse = sqrt(mean_squared_error(pred_df['actual'], pred_df['predict'])) 
        #print(pred_df)
        print(race, " - RMSE: ",rmse)

        #print("Forest:",forest_importances)
        #print(pred_df)
        race_coef_df = pd.concat([race_coef_df,forest_importances])



    print(race_coef_df)


In [None]:
"""#//*** Modeling
#//*** Predict total Statewide cases from County Cases. This gives a county COVID weight.
#//*** Predict total Racial Cases from county Cases. This gives a county racial weight. The difference between the total prediction
#//*** And the racial prediction whould be the racial case count for the county. May have to the regression in reverse with the county weights.

#//*** Assemble Each County cases_100k
x = []
for county in ca_100k_df['county'].unique():
    x.append(ca_100k_df[ca_100k_df['county']==county]['cases'].values)


x = np.array(x).reshape(-1,58)    
#//**** Predict Statewide Cases
y = ca_total_df['cases_100k']

print(len(x))
print(len(y))

model = LinearRegression().fit(x,y)

print(model.score(x,y))

models_df = pd.DataFrame()
models_df['county'] = ca_100k_df['county'].unique()
models_df['state_100k'] = model.coef_
print(models_df)
print(model.intercept_)

#x = np.array(x).reshape(-1,58)    
#//**** Predict Statewide Cases
#y = ca_total_df['cases_100k']

#print(len(x))
#print(len(y))
print(x_model.shape)
print(x_model)
model = LinearRegression().fit(np.array(x_model).reshape(-1,1)  ,np.array(y_model).reshape(-1,1))

print(model.score(x_model,x_model))
"""

In [None]:

#//**** Let's try modeling just Latinos and see what happens
broad_latino_df = ca_races_broad_df[['date','Latino']].merge(ca_race_diff[['date','Latino']],left_on='date', right_on='date').merge(ca_cases_broad_df, left_on='date', right_on='date')
#print(broad_latino_df.shape)

#print( broad_latino_df.columns[2:])
#print(y_model)
print(broad_latino_df)


In [None]:
"""
#https://medium.datadriveninvestor.com/random-forest-regression-9871bc9a25eb
# Fitting Random Forest Regression to the dataset
# import the regressor
from sklearn.ensemble import RandomForestRegressor
from sklearn.metrics import mean_squared_error
from math import sqrt
from sklearn.model_selection import GridSearchCV
from sklearn.inspection import permutation_importance

x_col_index = 4

model_df = broad_latino_df

x_column = model_df.columns[x_col_index:]
x_model = model_df[x_column]/model_df[x_column].max()
x_model = model_df[x_column][-30:]


#x_model = preprocessing.scale(x_model) 
y_column = model_df.columns[1]
y_model = model_df[y_column][-30:]


def rfr_model(X, y):# Perform Grid-Search
    gsc = GridSearchCV(
        estimator=RandomForestRegressor(),
        param_grid={
            'max_depth': range(3,10),
            'n_estimators': (10, 50, 100, 200, 300, 400),
        },
        cv=5, scoring='neg_mean_squared_error', verbose=0,                         n_jobs=-1)
    
    grid_result = gsc.fit(X, y)
    best_params = grid_result.best_params_
    
    print(best_params)
    
    #//*** Return the model with best parameters
    return RandomForestRegressor(max_depth=best_params["max_depth"], n_estimators=best_params["n_estimators"],random_state=False, verbose=False)

    
regressor =rfr_model(x_model,y_model)

# Perform K-Fold CV
scores = cross_val_score(regressor, x_model, y_model, cv=10, scoring='neg_mean_squared_error')

#print("Scores: ",scores)
  
# fit the regressor with x and y data
regressor.fit(x_model, y_model)  


y_predict = regressor.predict(x_model)
pred_df = pd.DataFrame()
pred_df['actual'] = y_model
pred_df['predict'] = y_predict

start_time = time.time()
result = permutation_importance(
    regressor, x_model, y_model, n_repeats=10, random_state=0, n_jobs=-1)
elapsed_time = time.time() - start_time
print(f"Elapsed time to compute the importances: "
      f"{elapsed_time:.3f} seconds")

forest_importances = pd.Series(result.importances_mean, index=model_df.columns[x_col_index:])
print(forest_importances.transpose())

#print(x_model)

#calculate RMSE
rmse = sqrt(mean_squared_error(pred_df['actual'], pred_df['predict'])) 
print("RMSE: ",rmse)
print(pred_df)

"""

In [None]:
"""
#https://medium.datadriveninvestor.com/random-forest-regression-9871bc9a25eb
# Fitting Random Forest Regression to the dataset
# import the regressor
from sklearn.ensemble import RandomForestRegressor
  


model_df = broad_latino_df

x_column = model_df.columns[3:]
x_model = model_df[x_column]/model_df[x_column].max()
x_model = model_df[x_column]
#x_model = preprocessing.scale(x_model) 
y_column = model_df.columns[1]
y_model = model_df[y_column]


def rfr_model(X, y):# Perform Grid-Search
    gsc = GridSearchCV(
        estimator=RandomForestRegressor(),
        param_grid={
            'max_depth': range(3,10),
            'n_estimators': (10, 50, 100, 1000),
        },
        cv=5, scoring='neg_mean_squared_error', verbose=0,                         n_jobs=-1)
    
    grid_result = gsc.fit(X, y)
    best_params = grid_result.best_params_
    
    print(best_params)
    
    #//*** Return the model with best parameters
    return RandomForestRegressor(max_depth=best_params["max_depth"], n_estimators=best_params["n_estimators"],random_state=False, verbose=False)

    
regressor =rfr_model(x_model,y_model)

# Perform K-Fold CV
scores = cross_val_score(regressor, x_model, y_model, cv=10, scoring='neg_mean_squared_error')

print("Scores: ",scores)
  
# fit the regressor with x and y data
regressor.fit(x_model, y_model)  

y_predict = regressor.predict(x_model)

pred_df = pd.DataFrame([y_model,y_predict]).transpose()
pred_df.columns = ['actual','predict']


start_time = time.time()
result = permutation_importance(
    regressor, x_model, y_model, n_repeats=10, random_state=0, n_jobs=-1)
elapsed_time = time.time() - start_time
print(f"Elapsed time to compute the importances: "
      f"{elapsed_time:.3f} seconds")

forest_importances = pd.Series(result.importances_mean, index=model_df.columns[3:])
print(forest_importances)

from sklearn.metrics import mean_squared_error
from math import sqrt

#calculate RMSE
rmse = sqrt(mean_squared_error(pred_df['actual'], pred_df['predict'])) 

print(pred_df)
print("RMSE: ",rmse)
"""

In [None]:
#https://oindrilasen.com/2021/02/how-to-install-and-import-keras-in-anaconda-jupyter-notebooks/

from keras.layers import Dense
from keras.models import Sequential
from keras.wrappers.scikit_learn import KerasRegressor

from sklearn.model_selection import cross_val_score
from sklearn.model_selection import KFold
from sklearn.preprocessing import StandardScaler

from sklearn import preprocessing 
from sklearn.preprocessing import scale

from sklearn.pipeline import Pipeline
from sklearn.model_selection import cross_val_score
from sklearn.metrics import f1_score, precision_score, accuracy_score, recall_score, confusion_matrix

import tensorflow as tf
import time

In [None]:
#//*** Assigns a color from a palette list to a county. 
def assign_color(input_item, input_cd,input_palette):
    #//*** Check if item already exists, if so, return input_cd
    if input_item in input_cd.keys():
        return input_cd
    
    #//*** input_item needs a Color. Walk down the input_palette till one is not found
    for color in input_palette:
        if color not in input_cd.values():
            input_cd[input_item] = color
            return input_cd
    print("UH OH ran out of colors!!!")
    print(f"Item: {input_item}")
    print(input_cd)
    return input_cd

In [None]:
#//*** Color Choices: Tucking these aside for later use
#//*** Combine these with a dictionary to create color continuity across multiple visualizations.
color_palette = ["#c6eaff","#caa669","#14bae2","#f7cd89","#98a9e7","#e2ffb7","#cb9ec2","#77dcb5","#ffc5b7","#40bdba","#fff4b0","#74d0ff","#e4da8d","#7ceeff","#d0e195","#b7ab8c","#fcffdb","#83b88d","#ffe2c0","#abc37a"]
color_palette = ["#557788","#e12925","#44af0e","#7834c0","#726d00","#130c6d","#004e12","#f7007d","#017878","#950089","#00a3d7","#4b000e","#0063c2","#f07478","#013b75","#cf81b8","#212238","#af87e7","#320f49","#9c91db"]
county_color_palette = ["#b4a23b","#4457ca","#9ec535","#a651cb","#59ce59","#6a77f0","#52a325","#b93d9b","#36b25c","#e374d4","#c1c035","#7452af","#96ae3a","#a484e2","#89c466","#e54790","#57c888","#dd3d60","#5bd6c4","#dd4e2d","#45ccdf","#bd3738","#4cb998","#b13a6c","#368433","#588feb","#dcad3d","#4763af","#e49132","#4aa5d4","#c86321","#7695d3","#769233","#925898","#54701c","#c893d6","#3d7b44","#e084ac","#65a76b","#965179","#296437","#e57f5f","#31a8ad","#a44b2c","#368d71","#df7f81","#226a4d","#96465f","#b5b671","#68649c","#ad772e","#a34f52","#758348","#d8a06e","#505e25","#8e5e31","#8e8033","#695f1b"]
county_color_palette = ["#96465f","#dd3d60","#df7f81","#a34f52","#bd3738","#dd4e2d","#e57f5f","#a44b2c","#c86321","#8e5e31","#d8a06e","#e49132","#ad772e","#dcad3d","#b4a23b","#8e8033","#695f1b","#c1c035","#b5b671","#96ae3a","#505e25","#758348","#9ec535","#769233","#54701c","#52a325","#89c466","#368433","#59ce59","#3d7b44","#65a76b","#296437","#36b25c","#57c888","#226a4d","#368d71","#4cb998","#5bd6c4","#31a8ad","#45ccdf","#4aa5d4","#7695d3","#588feb","#4763af","#6a77f0","#4457ca","#68649c","#a484e2","#7452af","#a651cb","#c893d6","#925898","#e374d4","#b93d9b","#965179","#e084ac","#e54790","#b13a6c"]
county_color_palette = ["#226a4d","#31a8ad","#68649c","#758348","#505e25","#368d71","#4aa5d4","#965179","#7695d3","#45ccdf","#296437","#96465f","#8e5e31","#b5b671","#d8a06e","#a34f52","#5bd6c4","#695f1b","#4cb998","#df7f81","#3d7b44","#e084ac","#c893d6","#65a76b","#8e8033","#925898","#4763af","#54701c","#ad772e","#a44b2c","#e57f5f","#769233","#57c888","#b13a6c","#588feb","#a484e2","#b4a23b","#368433","#89c466","#7452af","#96ae3a","#dcad3d","#bd3738","#36b25c","#e374d4","#c86321","#b93d9b","#e49132","#dd3d60","#e54790","#c1c035","#4457ca","#6a77f0","#52a325","#9ec535","#dd4e2d","#a651cb","#59ce59"]

In [None]:
"""#print(ca_covid_df[ca_covid_df['county']=='Imperial'])

print(ca_covid_df.columns[4:])
plt.rcParams['figure.figsize'] = [50,20]
plt.rcParams.update({'figure.max_open_warning': 0})



loop_df = ca_covid_orig_df[ca_covid_orig_df['county']=='Imperial']
display_size = 40
fig,ax = plt.subplots()

for col in ['cases','reported_cases']:

    
    ax.plot(loop_df['date'],loop_df[col].rolling(5).mean(),label=col)


    plt.xticks(rotation=30,fontsize=display_size)
    plt.yticks(fontsize=display_size)
handles,labels = deduplicate_legend(ax)
plt.legend(fontsize=display_size,loc='upper left')
#plt.title(f"Total Covid cases Time Series for all counties.\nLos Angeles County dominates with 29% of the state population (2.5% of the national population)",fontsize=display_size)
#plt.ylabel("Total Cases by County (millions)",fontsize=display_size)
plt.show()

for col in ca_covid_df.columns[4:]:
    display_size = 40
    fig,ax = plt.subplots()
    
    ax.plot(loop_df['date'],loop_df[col],label=col)


    plt.xticks(rotation=30,fontsize=display_size)
    plt.yticks(fontsize=display_size)
    handles,labels = deduplicate_legend(ax)
    plt.legend(fontsize=display_size,loc='upper left')
    #plt.title(f"Total Covid cases Time Series for all counties.\nLos Angeles County dominates with 29% of the state population (2.5% of the national population)",fontsize=display_size)
    #plt.ylabel("Total Cases by County (millions)",fontsize=display_size)
    plt.show()

fig,ax = plt.subplots()
for col in ['deaths','reported_deaths']:

    
    ax.plot(loop_df['date'],loop_df[col].rolling(5).mean(),label=col)


    plt.xticks(rotation=30,fontsize=display_size)
    plt.yticks(fontsize=display_size)
handles,labels = deduplicate_legend(ax)
plt.legend(fontsize=display_size,loc='upper left')
#plt.title(f"Total Covid cases Time Series for all counties.\nLos Angeles County dominates with 29% of the state population (2.5% of the national population)",fontsize=display_size)
#plt.ylabel("Total Cases by County (millions)",fontsize=display_size)
plt.show()
print(ca_covid_df)
"""
print()

In [None]:
#//*** Get the Statewide 100k value. 
#//*** Get total Case Count from orig_df, dvided by total population / 100000
state_100k = ca_covid_orig_df['cases'].sum()/(ca_covid_orig_df['population'].unique().sum()/100000)

county_list = ca_100k_df['county'].unique()

county_100k = []

for county in county_list:
    if ca_100k_df[ca_100k_df['county']==county].iloc[0]['population'] > 1:
        county_100k.append(county)
case_totals = []

for county in county_100k:
    case_totals.append(ca_100k_df[ca_100k_df['county']==county]['cases'].sum())

#//*** Temp Series
ts = pd.Series(index = county_100k, data=case_totals).sort_values(ascending=False)

print(ts)

display_size = 40
fig,ax = plt.subplots()

ax.bar(ts.index,ts)

#//*** Draw horizontal line. Draw it twice to get the yellow and back effect. 
#//*** This technique looks viusually good, but I can't get the legend to draw approrpriately.
ax.axhline(state_100k,color = "black", label="Statewide Cases 100k", linestyle = "-", lw=2)
ax.axhline(state_100k,color = "yellow", linestyle = "--", lw=2)
        
plt.xticks(rotation=90,fontsize=display_size)
plt.yticks(fontsize=display_size)

#handles,labels = deduplicate_legend(ax)
plt.legend(fontsize=display_size,loc='upper right')
plt.title(f"Total Covid cases for all counties.\nper 100k",fontsize=display_size)
plt.ylabel("Total Cases by County (per 100k)",fontsize=display_size)
plt.show()

In [None]:
#//*** Look at total County COVID numbers by county rates per 100k.


#//*** Get the last data


#for county in






#last_day_df = rd[race_list[0]][rd[race_list[0]]['date'] == last_date]