Read the CDC consolidated dataframe, and set up some preliminary variables

In [1]:
import pandas as pd
cdc = pd.read_csv("2013-2018_consolidated_flu_data.csv")

In [2]:
state_names = ["Alabama","Alaska","Arizona","Arkansas","California","Colorado",
  "Connecticut","Delaware","Florida","Georgia","Hawaii","Idaho","Illinois",
  "Indiana","Iowa","Kansas","Kentucky","Louisiana","Maine","Maryland",
  "Massachusetts","Michigan","Minnesota","Mississippi","Missouri","Montana",
  "Nebraska","Nevada","New Hampshire","New Jersey","New Mexico","New York",
  "North Carolina","North Dakota","Ohio","Oklahoma","Oregon","Pennsylvania",
  "Rhode Island","South Carolina","South Dakota","Tennessee","Texas","Utah",
  "Vermont","Virginia","Washington","West Virginia","Wisconsin","Wyoming"]

As we see from the sample, there are multiple statistics and regions being considered

In [3]:
cdc.sample(40)

Unnamed: 0.1,Unnamed: 0,month,stats,race_or_age,state_or_region,value,year
205212,18102,Aug,UL,≥18 years,Louisiana,2.0,2014-2015
132411,7671,Dec,LL,≥65 years,Delaware,57.0,2017-2018
75661,13291,Feb,LL,18-64 years at high risk,Illinois,29.8,2016-2017
241667,54557,May,UL,6 months - 17 years,Region 4,56.6,2014-2015
146178,21438,Mar,CI,18-49 years not at high risk,Massachusetts,(±3.7),2017-2018
251854,2374,Sep,CI,18-64 years at high risk,Arizona,(±2.7),2013-2014
226385,39275,Aug,avg,50-64 years,Rhode Island,NR †,2014-2015
120048,57678,Feb,CI,13-17 years,Region 7,(±3.4),2016-2017
221386,34276,Sep,LL,18-49 years not at high risk,North Dakota,2.8,2014-2015
59182,59182,Jul,UL,"White only, non-Hispanic",Region 8,0.3,2015-2016


In [4]:
cdc["state_or_region"].isin(state_names)

0          True
1          True
2          True
3          True
4          True
5          True
6          True
7          True
8          True
9          True
10         True
11         True
12         True
13         True
14         True
15         True
16         True
17         True
18         True
19         True
20         True
21         True
22         True
23         True
24         True
25         True
26         True
27         True
28         True
29         True
          ...  
311694    False
311695    False
311696    False
311697    False
311698    False
311699    False
311700    False
311701    False
311702    False
311703    False
311704    False
311705    False
311706    False
311707    False
311708    False
311709    False
311710    False
311711    False
311712    False
311713    False
311714    False
311715    False
311716    False
311717    False
311718    False
311719    False
311720    False
311721    False
311722    False
311723    False
Name: state_or_region, L

For our purposes, we only need the "avg" statistic, which indicates the percentage of patients covered by that date in the season. We only need state-by-state data:

In [5]:
cdc_avg = cdc[cdc["state_or_region"].isin(state_names) * cdc["stats"]=="avg"]

In [6]:
cdc_avg.sample(20)

Unnamed: 0.1,Unnamed: 0,month,stats,race_or_age,state_or_region,value,year
254350,4870,Mar,avg,Hispanic,California,NR,2013-2014
276616,27136,Nov,avg,18-64 years not at high risk,Nebraska,33.1,2013-2014
208640,21530,Dec,avg,≥65 years,Massachusetts,59.6,2014-2015
9750,9750,Oct,avg,"Black only, non-Hispanic",Florida,18.3,2015-2016
78400,16030,Dec,avg,5-12 years,Kansas,49.0,2016-2017
251921,2441,Dec,avg,18-64 years not at high risk,Arizona,21.3,2013-2014
2540,2540,Sep,avg,18-49 years at high risk,Arizona,7.4,2015-2016
259735,10255,Apr,avg,18-64 years,Georgia,30.9,2013-2014
37435,37435,Feb,avg,"White only, non-Hispanic",Oregon,41.0,2015-2016
98690,36320,Nov,avg,50-64 years,Oklahoma,42.4,2016-2017


In [7]:
cdc_avg.shape

(49500, 7)

We need to remove all the "non-reported" values, and convert the rest to float.

In [8]:
#reomve "NR" values
cdc_avg_cleaned = cdc_avg[(cdc_avg["value"] != "NR †") & (cdc_avg["value"] != "NR *") & (cdc_avg["value"] != "NR")]
cdc_avg_cleaned["value_flt"] = cdc_avg_cleaned["value"].apply(float)
cdc_avg_cleaned.drop(columns=["Unnamed: 0"], inplace=True)

A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: http://pandas.pydata.org/pandas-docs/stable/indexing.html#indexing-view-versus-copy
  This is separate from the ipykernel package so we can avoid doing imports until
A value is trying to be set on a copy of a slice from a DataFrame

See the caveats in the documentation: http://pandas.pydata.org/pandas-docs/stable/indexing.html#indexing-view-versus-copy
  errors=errors)


In [9]:
cdc_avg_cleaned.sample(20)

Unnamed: 0,month,stats,race_or_age,state_or_region,value,year,value_flt
190880,Jan,avg,"White only, non-Hispanic",Arkansas,46.5,2014-2015,46.5
141695,Oct,avg,6 months - 4 years,Kentucky,35.0,2017-2018,35.0
283047,Jan,avg,"Other or multiple races, non-Hispanic",North Carolina,43.5,2013-2014,43.5
75710,Jan,avg,18-64 years not at high risk,Illinois,29.5,2016-2017,29.5
202885,Apr,avg,Hispanic,Iowa,56.6,2014-2015,56.6
138915,Mar,avg,≥18 years,Indiana,32.2,2017-2018,32.2
152315,Nov,avg,"Black only, non-Hispanic",Nebraska,36.3,2017-2018,36.3
173115,Jan,avg,"Black only, non-Hispanic",West Virginia,43.6,2017-2018,43.6
277219,Nov,avg,6 months - 17 years,Nevada,39.0,2013-2014,39.0
263352,Mar,avg,≥6 months,Indiana,40.9,2013-2014,40.9


We now want just the full >= 6-months dataset

In [10]:
cdc_allpopulation = cdc_avg_cleaned[cdc_avg_cleaned.race_or_age == "≥6 months"]

Now compare the number of rows to the number of all possible data points for 5 years. We expect less due to some non-reported values.

In [11]:
cdc_allpopulation.shape

(2665, 7)

In [12]:
5*12*50

3000

In [13]:
cdc_allpopulation.sample(20)

Unnamed: 0,month,stats,race_or_age,state_or_region,value,year,value_flt
156420,Jul,avg,≥6 months,New York,0.7,2017-2018,0.7
10915,Dec,avg,≥6 months,Hawaii,40.7,2015-2016,40.7
45575,Feb,avg,≥6 months,Virginia,47.6,2015-2016,47.6
81210,Jan,avg,≥6 months,Maine,46.3,2016-2017,46.3
93085,Dec,avg,≥6 months,New Mexico,41.6,2016-2017,41.6
151490,Nov,avg,≥6 months,Nebraska,38.8,2017-2018,38.8
218790,Jul,avg,≥6 months,New York,0.7,2014-2015,0.7
109925,Feb,avg,≥6 months,West Virginia,48.2,2016-2017,48.2
66370,Mar,avg,≥6 months,California,47.7,2016-2017,47.7
67365,Apr,avg,≥6 months,Colorado,49.7,2016-2017,49.7


We can now convert the dates to a more convenient time-stamp object, taking into account they way a 'season' is defined by CDC:

In [14]:
def convertToTimestamp(monthCol, yearCol):
    dateCol = []
    for (month, year) in zip(monthCol, yearCol):
        if (month in ["Jul", "Aug", "Sep", "Oct", "Nov", "Dec"]):
            yr = year[0:4]
        else:
            yr = year[-4:]
        dateCol.append(pd.Timestamp.strptime(month + " 15 " + yr, "%b %d %Y"))
        
    return dateCol

In [15]:
cdc_allpopulation["time"] = convertToTimestamp(cdc_allpopulation["month"].values, cdc_allpopulation["year"].values)

A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: http://pandas.pydata.org/pandas-docs/stable/indexing.html#indexing-view-versus-copy
  """Entry point for launching an IPython kernel.


Do some renaming of columns, and save the final dataframe

In [16]:
cdc_allpopulation.rename(columns={'value_flt': 'mean_pct', 'state_or_region': 'state'}, inplace=True)

A value is trying to be set on a copy of a slice from a DataFrame

See the caveats in the documentation: http://pandas.pydata.org/pandas-docs/stable/indexing.html#indexing-view-versus-copy
  return super(DataFrame, self).rename(**kwargs)


In [17]:
cdc_allpopulation.sample(20)

Unnamed: 0,month,stats,race_or_age,state,value,year,mean_pct,time
207905,Aug,avg,≥6 months,Massachusetts,2.4,2014-2015,2.4,2014-08-15
193055,Aug,avg,≥6 months,Connecticut,1.6,2014-2015,1.6,2014-08-15
230675,Aug,avg,≥6 months,Utah,1.4,2014-2015,1.4,2014-08-15
173290,Mar,avg,≥6 months,Wisconsin,39.3,2017-2018,39.3,2018-03-15
196030,Sep,avg,≥6 months,Florida,9.1,2014-2015,9.1,2014-09-15
149505,Oct,avg,≥6 months,Missouri,26.6,2017-2018,26.6,2017-10-15
90100,Sep,avg,≥6 months,Nevada,9.1,2016-2017,9.1,2016-09-15
221770,Sep,avg,≥6 months,Ohio,9.2,2014-2015,9.2,2014-09-15
94095,Apr,avg,≥6 months,New York,49.2,2016-2017,49.2,2017-04-15
47565,Apr,avg,≥6 months,West Virginia,49.2,2015-2016,49.2,2016-04-15


In [18]:
cdc_allpopulation.to_csv("cdc_average_bystate_2013-2017.csv")