# Counting sequences using two different methods of Urban/Rural groupings

## July 13, 2023

This script reads in a metadata file downloaded from GISAID for augur input and counts the sequence availability in GISAID during the COVID-19 pandemic for each county in Wisconsin in the file. I then categorize the counties using a different definition of "urban" and "rural" counties, which I am calling "Method Two" or "m2" for short. 

The original method of classification was taken directly from WISH (Wisconsin Interactive Statistics on Health) https://www.dhs.wisconsin.gov/wish/urban-rural.htm and the Rural Health report as published by the Wisconsin Rural Office of Health. 

These definitions of urban and rural take metropolitan areas scored as 1-6 by the NCHSUR Codes 2013, with 1 being the most metropolitan and 6 being the least metropolitan. 


<b>Note:</b> *<i>It is strongly advised to use a "cleaned" metadata file from GISAID or Genbank before running this script. Use the py script "clean_wicounties_metadata_tsv_2023_05_03.py" to accomplish this task *</i> 

In [82]:
m1_urban = ['Brown', 'Calumet', 'Chippewa', 'Columbia', 'Dane', 'Douglas', 'Eau Claire', 'Fond du Lac', 
            'Green', 'Iowa', 'Kenosha', 'Kewaunee', 'La Crosse', 'Marathon', 'Milwaukee', 'Oconto', 
            'Outagamie', 'Ozaukee', 'Pierce', 'Racine', 'Rock', 'St. Croix', 'Sheboygan', 'Washington', 
            'Waukesha', 'Winnebago']

m1_rural = ['Adams', 'Ashland', 'Barron', 'Bayfield', 'Buffalo', 'Burnett', 'Clark', 'Crawford', 
            'Dodge', 'Door', 'Dunn', 'Florence', 'Forest', 'Grant', 'Green Lake', 'Iron', 
            'Jackson', 'Jefferson', 'Juneau', 'Lafayette', 'Langlade', 'Lincoln', 'Manitowoc', 
            'Marinette', 'Marquette', 'Menominee', 'Monroe', 'Oneida', 'Pepin', 'Polk', 'Portage', 
            'Price', 'Richland', 'Rusk', 'Sauk', 'Sawyer', 'Shawano', 'Taylor', 'Trempealeau', 
            'Vernon', 'Vilas', 'Walworth', 'Washburn', 'Waupaca', 'Waushara', 'Wood']

In [83]:
# Define the new urban and rural WI counties as 2013 NCHSUR Codes 1-3 being Metro or "urban"
# and 4-6 being Nonmetro or "rural"


m2_urban= ['Milwaukee',
'Kenosha',
'Ozaukee',
'Pierce',
'St. Croix', 
'Washington',
'Waukesha',
'Brown',
'Columbia',
'Dane',
'Douglas',
'Green',
'Iowa',
'Kewaunee',
'Oconto']

m2_rural = ["Calumet", "Chippewa", "Eau Claire", "Fond du Lac", "La Crosse", 
            "Marathon", "Outagamie", "Racine", "Rock", "Sheboygan", "Winnebago", 
            "Dodge", "Dunn", "Florence", "Grant", "Jefferson", "Lincoln", 
            "Manitowoc", "Marinette", "Menominee", "Portage", "Sauk", "Shawano", 
            "Walworth", "Wood", "Adams", "Ashland", "Barron", "Bayfield", "Buffalo", 
            "Burnett", "Clark", "Crawford", "Door", "Forest", "Green Lake", "Iron", 
            "Jackson", "Juneau", "Lafayette", "Langlade", "Marquette", "Monroe", 
            "Oneida", "Pepin", "Polk", "Price", 
            "Richland", "Rusk", "Sawyer", "Taylor", "Trempealeau", "Vernon", 
            "Vilas", "Washburn", "Waupaca", "Waushara"]

The next block of code reads in the input metadata file and counts the number of sequences per month and year for each county using a dictionary to store the values. The dictionary is then converted to a dataframe which can be printed for viewing and then exported to an output file. 

In [84]:
import pandas as pd
import csv
import calendar


input_file = '/Users/mavoeg/computational_folder/gh_folder/ncov/data/clean_ready_for_nextstrain/nextstrain_may_2023/06_01_2023urbrur_hilomid.tsv' # Gisaid dataset
#input_file = '/Users/mavoeg/Desktop/SARS/Wisconsin/WI_Data_Counties/genbank/2023-06-30-CDC_Contract_Seq_NoNAs.csv' # CDC dataset
output_dict = {}
with open(input_file, "r") as infile:
    tsv_reader = csv.reader(infile, delimiter='\t')  # Use csv.reader with tab delimiter
    for line in tsv_reader:

        if "Accession ID" in line:
            pass
        else:
            collection_date = line[4]  # Access the column by index 
            location = line[8]

            if len(location.split("/")) == 0:
                county = "unknown"
            else:
                counties = location.split("County")
                county = counties[0].strip()  # Extract the first element from the list
                county = county.replace(" ", "_") #+ "_County"  # Format county name with spaces

            date_parts = collection_date.split("-")

            if len(date_parts) >= 2:
                year = int(date_parts[0])
                month = int(date_parts[1])
                month_abbr = calendar.month_abbr[month]
                    
                
                year_month =  month_abbr + '-' + str(year) 

                if '0 County' in location:
                    pass
                else:
                    # Make the dictionary
                    if county not in output_dict:
                        output_dict[county] = {}
                    if year_month not in output_dict[county]:
                        output_dict[county][year_month] = 1
                    else:
                        output_dict[county][year_month] += 1


In [85]:

df = pd.DataFrame.from_dict(output_dict, orient="index") # make a df from the dictionary
df = df.sort_index().sort_index(axis=1)  # Sort the index (county) and column names 


# Fill NaN values with 0
df = df.fillna(0)
print(df)

# to create series visualizations (using all counties) uncomment the line below
# df.to_csv('/Users/mavoeg/Desktop/SARS/Wisconsin/WI_Data_Counties/input_files/sequence_counts/SERIES_urb-rur_CDC_2023-07-17.csv', header=True, index=True)

           Apr-2020  Apr-2021  Apr-2022  Aug-2020  Aug-2021  Aug-2022  \
Adams           0.0      10.0       0.0       2.0      10.0       1.0   
Ashland         0.0       3.0       0.0       0.0       2.0       3.0   
Barron          0.0      16.0      14.0       0.0      46.0       3.0   
Bayfield        0.0       3.0       2.0       0.0       3.0       0.0   
Brown          93.0     330.0       1.0       3.0      64.0       5.0   
...             ...       ...       ...       ...       ...       ...   
Waukesha        9.0      46.0      68.0       1.0     105.0     104.0   
Waupaca         0.0       1.0       1.0       0.0       5.0       3.0   
Waushara        0.0       0.0       1.0       0.0       3.0       0.0   
Winnebago       4.0      42.0      12.0       0.0      16.0      23.0   
Wood            0.0       9.0      63.0       0.0      47.0       8.0   

           Dec-2020  Dec-2021  Dec-2022  Feb-2021  ...  May-2022  Nov-2020  \
Adams           3.0       8.0       0.0      

In [95]:
# Add columns m1_urb_rur and m2_urb_rur "urban", "rural" by the new urban rural definition using the lists above
df['m1_urb_rur'] = df.index.map(lambda county: 'urban' if county in m1_urban else 'rural')

df['m2_urb_rur'] = df.index.map(lambda county: 'urban' if county in m2_urban else 'rural')


print(df)

           Apr-2020  Apr-2021  Apr-2022  Aug-2020  Aug-2021  Aug-2022  \
Adams           0.0      10.0       0.0       2.0      10.0       1.0   
Ashland         0.0       3.0       0.0       0.0       2.0       3.0   
Barron          0.0      16.0      14.0       0.0      46.0       3.0   
Bayfield        0.0       3.0       2.0       0.0       3.0       0.0   
Brown          93.0     330.0       1.0       3.0      64.0       5.0   
...             ...       ...       ...       ...       ...       ...   
Waukesha        9.0      46.0      68.0       1.0     105.0     104.0   
Waupaca         0.0       1.0       1.0       0.0       5.0       3.0   
Waushara        0.0       0.0       1.0       0.0       3.0       0.0   
Winnebago       4.0      42.0      12.0       0.0      16.0      23.0   
Wood            0.0       9.0      63.0       0.0      47.0       8.0   

           Dec-2020  Dec-2021  Dec-2022  Feb-2021  ...  Nov-2021  Nov-2022  \
Adams           3.0       8.0       0.0      

In [96]:
# Create a new DataFrame from the old df with desired columns to plot
new_df = pd.DataFrame(columns=["date", "m1_urban", "m1_rural", "m2_urban", "m2_rural"])

# iterate over columns to add up the counts for urban and rural and populate the new df
for column in df.columns:
    
    m1urban_sum = df[df["m1_urb_rur"] == "urban"][column].sum()
    m1rural_sum = df[df["m1_urb_rur"] == "rural"][column].sum()
    m2urban_sum = df[df["m2_urb_rur"] == "urban"][column].sum()
    m2rural_sum = df[df["m2_urb_rur"] == "rural"][column].sum()
    new_df.loc[column] = [column, m1urban_sum, m1rural_sum, m2urban_sum, m2rural_sum]

# reset the index and rename the index column to "Date"
new_df = new_df.reset_index().rename(columns={"index": "ate"})

#remove the duplicate "Date" column
new_df = new_df.loc[:, ~new_df.columns.duplicated()]

# drop the urban_rural row which consists of strings summed as a result of the above iteration
new_df = new_df.drop([30])
new_df = new_df.drop([31])


In [92]:
# export the df to a csv for Rplotting
print(new_df)

# uncomment to save as CDC sequence dataset
# new_df.to_csv('/Users/mavoeg/Desktop/SARS/Wisconsin/WI_Data_Counties/input_files/sequence_counts/m1m2_urb-rur_GISAID_2023-07-18.csv', header=True, index=True)
#new_df.to_csv('/Users/mavoeg/Desktop/SARS/Wisconsin/WI_Data_Counties/input_files/sequence_counts/m2_urb-rur_CDC_2023-07-12.csv', header=True, index=True)

        date m1_urban m1_rural m2_urban m2_rural
0   Apr-2020    264.0     59.0    242.0     81.0
1   Apr-2021    926.0    397.0    725.0    598.0
2   Apr-2022    904.0    253.0    797.0    360.0
3   Aug-2020    200.0    122.0    181.0    141.0
4   Aug-2021    932.0    781.0    660.0   1053.0
5   Aug-2022   1086.0    136.0    971.0    251.0
6   Dec-2020    427.0    586.0    272.0    741.0
7   Dec-2021   2032.0    374.0   1827.0    579.0
8   Dec-2022    441.0    179.0    393.0    227.0
9   Feb-2021    639.0    205.0    565.0    279.0
10  Feb-2022    566.0     83.0    507.0    142.0
11  Feb-2023    126.0     82.0    104.0    104.0
12  Jan-2021    747.0    345.0    678.0    414.0
13  Jan-2022   1182.0     66.0   1116.0    132.0
14  Jan-2023    312.0     87.0    277.0    122.0
15  Jul-2020    275.0    109.0    267.0    117.0
16  Jul-2021   1009.0    454.0    779.0    684.0
17  Jul-2022    941.0    245.0    819.0    367.0
18  Jun-2020    301.0     81.0    269.0    113.0
19  Jun-2021    135.