# Incidence/New Case counts using two different methods of Urban/Rural groupings

### July 18, 2023

This script is the first of two scripts which calculates new case counts and county population for two methods of urban and rural groupings. This is accomplished by taking in a population dataset with 2020 population estimates for all counties in Wisconsin from the census bureau: https://www.census.gov/data/datasets/time-series/demo/popest/2020s-counties-total.html#par_textimage_70769902. 
The counties are grouped into the WISH method of urban/rural grouping which I refer to hereafter as "m1". I refer to "m2" as the urban/rural grouping from the NCHS Urban-Rural Classification Scheme for Counties. The total populations for these categories "m1_urban_pop", "m1_rural_pop", "m2_urban_pop", and "m2_rural_pop" are summed and stored for incidence calculations using COVID-19 historical dataset from the Department of Health Services of Wisconsin (DHS). https://data.dhsgis.wi.gov/datasets/wi-dhs::covid-19-historical-data-by-county-v2/about

Before incidence can be calculated, the dataset containing the COVID-19 historical information is used to sum the new case counts for each county per month and year from the column "POS_NEW_CP." The "POS_NEW_CP" column represents the confirmed positive new COVID-19 cases as well as the positive probable COVID-19 case. The county m1 and m2 assignments are made and then the total new case counts for these groups are summed. After summation, these values are divided by the m1 and m2 populations and multiplied by 10000 to represent the incidence per 10000 people. 

In [268]:
import pandas as pd
import csv
import calendar
import datetime

In [269]:
# Define the "method 1" or "m1" urban and rural WI counties as 2013 NCHSUR Codes 1-4 being Metro or "urban"
# and 5-6 being Nonmetro or "rural"
m1_urban = ['Brown', 'Calumet', 'Chippewa', 'Columbia', 'Dane', 'Douglas', 'Eau Claire', 'Fond du Lac', 
            'Green', 'Iowa', 'Kenosha', 'Kewaunee', 'La Crosse', 'Marathon', 'Milwaukee', 'Oconto', 
            'Outagamie', 'Ozaukee', 'Pierce', 'Racine', 'Rock', 'St.Croix', 'Sheboygan', 'Washington', 
            'Waukesha', 'Winnebago']

m1_rural = ['Adams', 'Ashland', 'Barron', 'Bayfield', 'Buffalo', 'Burnett', 'Clark', 'Crawford', 
            'Dodge', 'Door', 'Dunn', 'Florence', 'Forest', 'Grant', 'Green Lake', 'Iron', 
            'Jackson', 'Jefferson', 'Juneau', 'Lafayette', 'Langlade', 'Lincoln', 'Manitowoc', 
            'Marinette', 'Marquette', 'Menominee', 'Monroe', 'Oneida', 'Pepin', 'Polk', 'Portage', 
            'Price', 'Richland', 'Rusk', 'Sauk', 'Sawyer', 'Shawano', 'Taylor', 'Trempealeau', 
            'Vernon', 'Vilas', 'Walworth', 'Washburn', 'Waupaca', 'Waushara', 'Wood']

In [270]:
# Define the "method 2" or "m2" urban and rural WI counties as 2013 NCHSUR Codes 1-3 being Metro or "urban"
# and 4-6 being Nonmetro or "rural"


m2_urban= ['Milwaukee',
'Kenosha',
'Ozaukee',
'Pierce',
'St.Croix', 
'Washington',
'Waukesha',
'Brown',
'Columbia',
'Dane',
'Douglas',
'Green',
'Iowa',
'Kewaunee',
'Oconto']

m2_rural = ["Calumet", "Chippewa", "Eau Claire", "Fond du Lac", "La Crosse", 
            "Marathon", "Outagamie", "Racine", "Rock", "Sheboygan", "Winnebago", 
            "Dodge", "Dunn", "Florence", "Grant", "Jefferson", "Lincoln", 
            "Manitowoc", "Marinette", "Menominee", "Portage", "Sauk", "Shawano", 
            "Walworth", "Wood", "Adams", "Ashland", "Barron", "Bayfield", "Buffalo", 
            "Burnett", "Clark", "Crawford", "Door", "Forest", "Green Lake", "Iron", 
            "Jackson", "Juneau", "Lafayette", "Langlade", "Marquette", "Monroe", 
            "Oneida", "Pepin", "Polk", "Price", 
            "Richland", "Rusk", "Sawyer", "Taylor", "Trempealeau", "Vernon", 
            "Vilas", "Washburn", "Waupaca", "Waushara"]

In [271]:
# store county population sizes in a dictionary and make a dataframe from the dictionary
popinput_file = '/Users/mavoeg/Desktop/SARS/Wisconsin/WI_Data_Counties/data/population/co-est2022_census_2023_07.csv'
pop_outputdict ={}
with open(popinput_file, "r") as popfile:
    for popline in popfile:
        
        popsplit_lines = popline.split(',')
        
        population = popsplit_lines[2].replace("\"","") + popsplit_lines[3].replace("\"","")
        

        popcounty_names = popsplit_lines[0].replace('\"','')
        popcounty = popcounty_names.split(' County')
    
        
        
        if len(popcounty[0]) == 0:
            pass
        if "St Croix" in popcounty[0]:
            popcounty_firstname = popcounty[0].replace('\"St Croix','St. Croix')
        else: 
            popcounty_firstname = popcounty[0].replace("\"", "")
            
        if popcounty_firstname not in pop_outputdict:
            pop_outputdict[popcounty_firstname] = {}
        if population not in pop_outputdict:
            pop_outputdict[popcounty_firstname] = {population}
            

pop_df = pd.DataFrame.from_dict(pop_outputdict, orient="index") # make a df from the dictionary
pop_df = pop_df.sort_index().sort_index(axis=0)  # Sort the index (county) and column names 
pop_df = pop_df.drop(["County"])
pop_df = pop_df.rename(index={'St Croix': 'St.Croix'})

print(pop_df)


                0
Adams       20675
Ashland     16018
Barron      46714
Bayfield    16233
Brown      268921
...           ...
Waukesha   407467
Waupaca     51791
Waushara    24549
Winnebago  171800
Wood        74197

[72 rows x 1 columns]


In [272]:
# assign urban and rural via the lists created earlier
pop_df['m1_urb_rur'] = pop_df.index.map(lambda popcounty: 'urban' if popcounty in m1_urban else 'rural')

pop_df['m2_urb_rur'] = pop_df.index.map(lambda popcounty: 'urban' if popcounty in m2_urban else 'rural')

pop_df = pop_df.reset_index().rename(columns={"index": "county", 0: "population"})
#pop_df = pop_df.drop([0])

print(pop_df['county'])



0         Adams
1       Ashland
2        Barron
3      Bayfield
4         Brown
        ...    
67     Waukesha
68      Waupaca
69     Waushara
70    Winnebago
71         Wood
Name: county, Length: 72, dtype: object


The next chunk sums the urban and rural populations for both methods of urban and rural groupings.

In [273]:
# sum the population variables for each group
m1_rural_pop = pd.to_numeric(pop_df.loc[pop_df['m1_urb_rur']=='rural', 'population'], errors='coerce').fillna(0).sum()
m1_urban_pop = pd.to_numeric(pop_df.loc[pop_df['m1_urb_rur']=='urban', 'population'], errors='coerce').fillna(0).sum()
m2_rural_pop = pd.to_numeric(pop_df.loc[pop_df['m2_urb_rur']=='rural', 'population'], errors='coerce').fillna(0).sum()
m2_urban_pop = pd.to_numeric(pop_df.loc[pop_df['m2_urb_rur']=='urban', 'population'], errors='coerce').fillna(0).sum()

print(m1_rural_pop, m1_urban_pop, m2_rural_pop, m2_urban_pop)

1531989 4364282 2961867 2934404


In [274]:
# read in COVID-19 CDC data, using the "POS_NEW_CP" column for new case counts
# This column is the tabulation of confirmed new or probable cases of COVID-19
input_file = '/Users/mavoeg/Desktop/SARS/Wisconsin/WI_Data_Counties/data/COVID-19_Historical_Data_by_County_DATA.csv'
output_dict  = {}
with open(input_file, "r") as infile: 
    for line in infile:
        if "GEOID" in line:
            pass
        else:
            rpt_date = line.split(",")[0]
            location = line.split(",")[2]
            pos = line.split(',')[3]
           # print(location[-1] + "  " + pos)
            
           
            
            if len(location.split("/")) == 0:
                county = "unknown"
            else:
                county = location.split("/")[-1].strip()
          
            year = int(rpt_date.split("/")[-1])
            month = int(rpt_date.split("/")[0])
            month_abbr = calendar.month_abbr[month]
            year_month = month_abbr + "-" + str(year) 
            

            # make the dictionary 
            if location not in output_dict: 
                output_dict[location] = {}
            if year_month not in output_dict[location]:
                output_dict[location][year_month]= int(pos)
            else:
                output_dict[location][year_month] += int(pos)
                
            
            
          

inc_df = pd.DataFrame.from_dict(output_dict, orient="index") # make a df from the dictionary
inc_df = inc_df.sort_index().sort_index(axis=1)  # Sort the index (county) and column names 
inc_df = inc_df.rename(index={'St Croix': 'St.Croix'})
print(inc_df)

           Apr-20  Apr-21  Apr-22  Aug-20  Aug-21  Aug-22  Dec-20  Dec-21  \
Adams           5      91      31      45     189     147     310     328   
Ashland         2      63      39      22      64      78     351     316   
Barron          6     198     178     151     347     460     880    1013   
Bayfield        3      73      89      26      83     102     247     255   
Brown        1086    1097     961    1474    2100    2892    4534    7681   
...           ...     ...     ...     ...     ...     ...     ...     ...   
Waukesha      350    1811    2134    2170    3707    2877    8965   10007   
Waupaca         9     152      92     342     382     371     608    1043   
Waushara        2      50      76      58     139     191     236     491   
Winnebago      52     663     850     532    1283    1794    2611    4504   
Wood            0     227     289     250     593     802    1543    2311   

           Dec-22  Feb-20  ...  May-22  Nov-20  Nov-21  Nov-22  Oct-20  \
A

In [275]:
# Add columns m1_urb_rur and m2_urb_rur for "urban", "rural" by the new urban rural definition using the lists above
# to the inc_df dataframe

inc_df['m1_urb_rur'] = inc_df.index.map(lambda county: 'urban' if county in m1_urban else 'rural')

inc_df['m2_urb_rur'] = inc_df.index.map(lambda county: 'urban' if county in m2_urban else 'rural')
print(inc_df)

           Apr-20  Apr-21  Apr-22  Aug-20  Aug-21  Aug-22  Dec-20  Dec-21  \
Adams           5      91      31      45     189     147     310     328   
Ashland         2      63      39      22      64      78     351     316   
Barron          6     198     178     151     347     460     880    1013   
Bayfield        3      73      89      26      83     102     247     255   
Brown        1086    1097     961    1474    2100    2892    4534    7681   
...           ...     ...     ...     ...     ...     ...     ...     ...   
Waukesha      350    1811    2134    2170    3707    2877    8965   10007   
Waupaca         9     152      92     342     382     371     608    1043   
Waushara        2      50      76      58     139     191     236     491   
Winnebago      52     663     850     532    1283    1794    2611    4504   
Wood            0     227     289     250     593     802    1543    2311   

           Dec-22  Feb-20  ...  Nov-21  Nov-22  Oct-20  Oct-21  Oct-22  \
A

In [276]:
# Create a new DataFrame from the old df with desired columns to plot
newinc_df = pd.DataFrame(columns=["date", "m1urban_posnewcp", "m1rural_posnewcp", "m2urban_posnewcp", "m2rural_posnewcp"])


# iterate over columns to add up the counts for urban and rural and populate the new df
for column in inc_df.columns:

    m1urban_posnewcp = inc_df[inc_df["m1_urb_rur"] == "urban"][column].sum()
    m1rural_posnewcp = inc_df[inc_df["m1_urb_rur"] == "rural"][column].sum()
    m2urban_posnewcp = inc_df[inc_df["m2_urb_rur"] == "urban"][column].sum()
    m2rural_posnewcp = inc_df[inc_df["m2_urb_rur"] == "rural"][column].sum()
    newinc_df.loc[column] = [column, m1urban_posnewcp,  m1rural_posnewcp,  m2urban_posnewcp, m2rural_posnewcp]
    

# reset the index and rename the index column to "Date"
newinc_df = newinc_df.reset_index().rename(columns={"index": "date"})

#remove the duplicate "Date" column
newinc_df = newinc_df.loc[:, ~newinc_df.columns.duplicated()]

# drop the urban_rural row which consists of strings summed as a result of the above iteration
newinc_df = newinc_df.drop([38])
newinc_df = newinc_df.drop([39])
print(newinc_df)


      date m1urban_posnewcp m1rural_posnewcp m2urban_posnewcp m2rural_posnewcp
0   Apr-20             5753              573             4842             1484
1   Apr-21            18803             6317            12763            12357
2   Apr-22            23265             5427            17167            11525
3   Aug-20            18080             5471            12952            10599
4   Aug-21            34840            10939            23648            22131
5   Aug-22            37132            14156            24633            26655
6   Dec-20            77513            29819            51945            55387
7   Dec-21           112758            36261            75932            73087
8   Dec-22            27102             8398            18721            16779
9   Feb-20                1                0                1                0
10  Feb-21            18774             6649            12522            12901
11  Feb-22            39544            16740        

In [277]:
# Divide new cases by the population per 10000 people and store the results in a new column 
newinc_df['incidence_m1urb'] =  (newinc_df['m1urban_posnewcp'] / m1_urban_pop)* 10000
newinc_df['incidence_m1rur'] =  (newinc_df['m1rural_posnewcp'] /m1_rural_pop)* 10000
newinc_df['incidence_m2urb'] = (newinc_df['m2urban_posnewcp'] /m2_urban_pop)* 10000
newinc_df['incidence_m2rur'] = (newinc_df['m2rural_posnewcp']/m2_rural_pop)*10000
print(newinc_df)


      date m1urban_posnewcp m1rural_posnewcp m2urban_posnewcp  \
0   Apr-20             5753              573             4842   
1   Apr-21            18803             6317            12763   
2   Apr-22            23265             5427            17167   
3   Aug-20            18080             5471            12952   
4   Aug-21            34840            10939            23648   
5   Aug-22            37132            14156            24633   
6   Dec-20            77513            29819            51945   
7   Dec-21           112758            36261            75932   
8   Dec-22            27102             8398            18721   
9   Feb-20                1                0                1   
10  Feb-21            18774             6649            12522   
11  Feb-22            39544            16740            24682   
12  Feb-23              612              185              433   
13  Jan-20                0                0                0   
14  Jan-21            533

In [278]:
newinc_df.to_csv('/Users/mavoeg/Desktop/SARS/Wisconsin/WI_Data_Counties/VaxData/Vax_frompy/check_calcs2.csv', index=True, header=True)



In [221]:
# count the number of sequences available using the GISAID dataset

input_file = '/Users/mavoeg/computational_folder/gh_folder/ncov/data/clean_ready_for_nextstrain/nextstrain_may_2023/06_01_2023urbrur_hilomid.tsv' # Gisaid dataset
#input_file = '/Users/mavoeg/Desktop/SARS/Wisconsin/WI_Data_Counties/genbank/2023-06-30-CDC_Contract_Seq_NoNAs.csv' # CDC dataset
output_dict = {}
with open(input_file, "r") as infile:
    tsv_reader = csv.reader(infile, delimiter='\t')  # Use csv.reader with tab delimiter
    for line in tsv_reader:

        if "Accession ID" in line:
            pass
        else:
            collection_date = line[4]  # Access the column by index 
            location = line[8]

            if len(location.split("/")) == 0:
                county = "unknown"
            else:
                counties = location.split("County")
                county = counties[0].strip()  # Extract the first element from the list
                county = county.replace(" ", "_") #+ "_County"  # Format county name with spaces

            date_parts = collection_date.split("-")

            if len(date_parts) >= 2:
                year = int(date_parts[0])
                month = int(date_parts[1])
                month_abbr = calendar.month_abbr[month]
                    
                
                year_month =  month_abbr + '-' + str(year) 

                if '0 County' in location:
                    pass
                else:
                    # Make the dictionary
                    if county not in output_dict:
                        output_dict[county] = {}
                    if year_month not in output_dict[county]:
                        output_dict[county][year_month] = 1
                    else:
                        output_dict[county][year_month] += 1

                        
                        

seq_df = pd.DataFrame.from_dict(output_dict, orient="index") # make a df from the dictionary
seq_df = seq_df.sort_index().sort_index(axis=1)  # Sort the index (county) and column names 


# Fill NaN values with 0
seq_df = seq_df.fillna(0)
print(seq_df)

# to create series visualizations (using all counties) uncomment the line below
# df.to_csv('/Users/mavoeg/Desktop/SARS/Wisconsin/WI_Data_Counties/input_files/sequence_counts/SERIES_urb-rur_CDC_2023-07-17.csv', header=True, index=True)


           Apr-2020  Apr-2021  Apr-2022  Aug-2020  Aug-2021  Aug-2022  \
Adams           0.0      10.0       0.0       2.0      10.0       1.0   
Ashland         0.0       3.0       0.0       0.0       2.0       3.0   
Barron          0.0      16.0      14.0       0.0      46.0       3.0   
Bayfield        0.0       3.0       2.0       0.0       3.0       0.0   
Brown          93.0     330.0       1.0       3.0      64.0       5.0   
...             ...       ...       ...       ...       ...       ...   
Waukesha        9.0      46.0      68.0       1.0     105.0     104.0   
Waupaca         0.0       1.0       1.0       0.0       5.0       3.0   
Waushara        0.0       0.0       1.0       0.0       3.0       0.0   
Winnebago       4.0      42.0      12.0       0.0      16.0      23.0   
Wood            0.0       9.0      63.0       0.0      47.0       8.0   

           Dec-2020  Dec-2021  Dec-2022  Feb-2021  ...  May-2022  Nov-2020  \
Adams           3.0       8.0       0.0      

In [222]:
# Add columns m1_urb_rur and m2_urb_rur "urban", "rural" by the new urban rural definition using the lists above
seq_df['m1_urb_rur'] = seq_df.index.map(lambda county: 'urban' if county in m1_urban else 'rural')

seq_df['m2_urb_rur'] = seq_df.index.map(lambda county: 'urban' if county in m2_urban else 'rural')


print(seq_df)

           Apr-2020  Apr-2021  Apr-2022  Aug-2020  Aug-2021  Aug-2022  \
Adams           0.0      10.0       0.0       2.0      10.0       1.0   
Ashland         0.0       3.0       0.0       0.0       2.0       3.0   
Barron          0.0      16.0      14.0       0.0      46.0       3.0   
Bayfield        0.0       3.0       2.0       0.0       3.0       0.0   
Brown          93.0     330.0       1.0       3.0      64.0       5.0   
...             ...       ...       ...       ...       ...       ...   
Waukesha        9.0      46.0      68.0       1.0     105.0     104.0   
Waupaca         0.0       1.0       1.0       0.0       5.0       3.0   
Waushara        0.0       0.0       1.0       0.0       3.0       0.0   
Winnebago       4.0      42.0      12.0       0.0      16.0      23.0   
Wood            0.0       9.0      63.0       0.0      47.0       8.0   

           Dec-2020  Dec-2021  Dec-2022  Feb-2021  ...  Nov-2021  Nov-2022  \
Adams           3.0       8.0       0.0      

In [None]:
# Create a new DataFrame from the old df with desired columns to plot
newseq_df = pd.DataFrame(columns=["date", "m1_urban", "m1_rural", "m2_urban", "m2_rural"])

# iterate over columns to add up the counts for urban and rural and populate the new df
for column in seq_df.columns:
    
    m1urban_sum = seq_df[seq_df["m1_urb_rur"] == "urban"][column].sum()
    m1rural_sum = seq_df[seq_df["m1_urb_rur"] == "rural"][column].sum()
    m2urban_sum = seq_df[seq_df["m2_urb_rur"] == "urban"][column].sum()
    m2rural_sum = seq_df[seq_df["m2_urb_rur"] == "rural"][column].sum()
    newseq_df.loc[column] = [column, m1urban_sum, m1rural_sum, m2urban_sum, m2rural_sum]

# reset the index and rename the index column to "Date"
newseq_df = new_df.reset_index().rename(columns={"index": "date"})

#remove the duplicate "Date" column
newseq_df = newseq_df.loc[:, ~new_df.columns.duplicated()]

# drop the urban_rural row which consists of strings summed as a result of the above iteration
newseq_df = newseq_df.drop([30])
newseq_df = newseq_df.drop([31])
