# Delivery 4 - Part 1

## Adding influenza type counts to subsets

This is where I start to do some statistical analysis. We want to look at the flu types A and B... but we need to do some thinking as to how to approach this. Here I will load in the `flucases2.csv` file and set up a subset like we have before, and look at the layout of the `Disease` column. This will help you  see what I am talking about/ what the challenge will be.

In [1]:
import pandas as pd
import numpy as np
cols = ["Season","Region","County","CDC_Week","Week_Ending_Date","Disease",
        "Count","County_Centroid","FIPS"]
path = "../Datasets/"
df = pd.read_csv(path+"flucases2.csv", usecols=cols)
df.head()

Unnamed: 0,Season,Region,County,CDC_Week,Week_Ending_Date,Disease,Count,County_Centroid,FIPS
0,2009-2010,CAPITAL DISTRICT,WASHINGTON,40,10/10/2009,INFLUENZA_B,0,"(43.3123766, -73.4394282)",36115
1,2009-2010,CAPITAL DISTRICT,OTSEGO,4,01/30/2010,INFLUENZA_UNSPECIFIED,0,"(42.6297762, -75.028841)",36077
2,2009-2010,WESTERN,ALLEGANY,12,03/27/2010,INFLUENZA_UNSPECIFIED,0,"(42.2478938, -78.0261758)",36003
3,2009-2010,WESTERN,ERIE,16,04/24/2010,INFLUENZA_B,0,"(42.752759, -78.7781922)",36029
4,2009-2010,WESTERN,CHAUTAUQUA,52,01/02/2010,INFLUENZA_B,0,"(42.3042159, -79.4075949)",36013


In [2]:
df.isna().sum()

Season              0
Region              0
County              0
CDC_Week            0
Week_Ending_Date    0
Disease             0
Count               0
County_Centroid     0
FIPS                0
dtype: int64

We will make the 2009-2010 season subset as an example. This follows the same subset methods as before. We cannot further modify or group yet because it will remove our `Disease` column (which identifies the type of influenza). So we will need to figure out a way to fix this!

In [3]:
# 2009 - 2010 season
df_09 = df.loc[(df["Season"]=="2009-2010")]
df_09.head()

Unnamed: 0,Season,Region,County,CDC_Week,Week_Ending_Date,Disease,Count,County_Centroid,FIPS
0,2009-2010,CAPITAL DISTRICT,WASHINGTON,40,10/10/2009,INFLUENZA_B,0,"(43.3123766, -73.4394282)",36115
1,2009-2010,CAPITAL DISTRICT,OTSEGO,4,01/30/2010,INFLUENZA_UNSPECIFIED,0,"(42.6297762, -75.028841)",36077
2,2009-2010,WESTERN,ALLEGANY,12,03/27/2010,INFLUENZA_UNSPECIFIED,0,"(42.2478938, -78.0261758)",36003
3,2009-2010,WESTERN,ERIE,16,04/24/2010,INFLUENZA_B,0,"(42.752759, -78.7781922)",36029
4,2009-2010,WESTERN,CHAUTAUQUA,52,01/02/2010,INFLUENZA_B,0,"(42.3042159, -79.4075949)",36013


As you see in the `Disease` column... it has multiple categorical variables in this column. We need to make each of these variables (`INFLUENZA_A`, `INFLUENZA_B`, and `INFLUENZA_UNSPECIFIED`) their own columns, with their respective counts for each county!

**Challenge:** Multiple variables stored in _one column_.


In [4]:
disease_split = df_09["Disease"].str.split('_')
disease_split.head()

0              [INFLUENZA, B]
1    [INFLUENZA, UNSPECIFIED]
2    [INFLUENZA, UNSPECIFIED]
3              [INFLUENZA, B]
4              [INFLUENZA, B]
Name: Disease, dtype: object

In [5]:
type(disease_split)

pandas.core.series.Series

In [6]:
# Think of a series as a column of data
print(disease_split[1])
print(type(disease_split[1]))

['INFLUENZA', 'UNSPECIFIED']
<class 'list'>


In [7]:
disease_split.str.get(1)

0                 B
1       UNSPECIFIED
2       UNSPECIFIED
3                 B
4                 B
           ...     
6133              B
6134              B
6135    UNSPECIFIED
6136              B
6137              A
Name: Disease, Length: 6138, dtype: object

In [8]:
# Create new column in dataset:
df_09['Type'] = disease_split.str.get(1)
df_09.head()

A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  


Unnamed: 0,Season,Region,County,CDC_Week,Week_Ending_Date,Disease,Count,County_Centroid,FIPS,Type
0,2009-2010,CAPITAL DISTRICT,WASHINGTON,40,10/10/2009,INFLUENZA_B,0,"(43.3123766, -73.4394282)",36115,B
1,2009-2010,CAPITAL DISTRICT,OTSEGO,4,01/30/2010,INFLUENZA_UNSPECIFIED,0,"(42.6297762, -75.028841)",36077,UNSPECIFIED
2,2009-2010,WESTERN,ALLEGANY,12,03/27/2010,INFLUENZA_UNSPECIFIED,0,"(42.2478938, -78.0261758)",36003,UNSPECIFIED
3,2009-2010,WESTERN,ERIE,16,04/24/2010,INFLUENZA_B,0,"(42.752759, -78.7781922)",36029,B
4,2009-2010,WESTERN,CHAUTAUQUA,52,01/02/2010,INFLUENZA_B,0,"(42.3042159, -79.4075949)",36013,B


In [9]:
df_09

Unnamed: 0,Season,Region,County,CDC_Week,Week_Ending_Date,Disease,Count,County_Centroid,FIPS,Type
0,2009-2010,CAPITAL DISTRICT,WASHINGTON,40,10/10/2009,INFLUENZA_B,0,"(43.3123766, -73.4394282)",36115,B
1,2009-2010,CAPITAL DISTRICT,OTSEGO,4,01/30/2010,INFLUENZA_UNSPECIFIED,0,"(42.6297762, -75.028841)",36077,UNSPECIFIED
2,2009-2010,WESTERN,ALLEGANY,12,03/27/2010,INFLUENZA_UNSPECIFIED,0,"(42.2478938, -78.0261758)",36003,UNSPECIFIED
3,2009-2010,WESTERN,ERIE,16,04/24/2010,INFLUENZA_B,0,"(42.752759, -78.7781922)",36029,B
4,2009-2010,WESTERN,CHAUTAUQUA,52,01/02/2010,INFLUENZA_B,0,"(42.3042159, -79.4075949)",36013,B
...,...,...,...,...,...,...,...,...,...,...
6133,2009-2010,NYC,NEW YORK,15,04/17/2010,INFLUENZA_B,0,"(40.7831, -73.9712)",36061,B
6134,2009-2010,METRO,WESTCHESTER,7,02/20/2010,INFLUENZA_B,1,"(41.1527698, -73.745912)",36119,B
6135,2009-2010,CENTRAL,JEFFERSON,40,10/10/2009,INFLUENZA_UNSPECIFIED,0,"(44.0607, -75.9928)",36045,UNSPECIFIED
6136,2009-2010,METRO,WESTCHESTER,46,11/21/2009,INFLUENZA_B,2,"(41.1527698, -73.745912)",36119,B


In [10]:
# Sort by Region

df_09 = df_09.sort_values(by = ['Region','County','CDC_Week'])
df_09 = df_09.drop(['Region', "Week_Ending_Date", "CDC_Week", "Disease",
                   "County_Centroid"], axis=1)
df_09 = df_09.reset_index(drop = True)
df_09.head()

Unnamed: 0,Season,County,Count,FIPS,Type
0,2009-2010,ALBANY,0,36001,UNSPECIFIED
1,2009-2010,ALBANY,0,36001,B
2,2009-2010,ALBANY,0,36001,A
3,2009-2010,ALBANY,0,36001,B
4,2009-2010,ALBANY,0,36001,A


In [11]:
df_09.tail()

Unnamed: 0,Season,County,Count,FIPS,Type
6133,2009-2010,YATES,1,36123,A
6134,2009-2010,YATES,0,36123,B
6135,2009-2010,YATES,0,36123,UNSPECIFIED
6136,2009-2010,YATES,0,36123,B
6137,2009-2010,YATES,0,36123,A


In [12]:
df_09.shape

(6138, 5)

I need to split each into more subsets based on type.

In [13]:
# 2009 - 2010 season Influenza Type A
df_09_A = df_09.loc[(df_09["Type"]=="A")]
df_09_A.head()

Unnamed: 0,Season,County,Count,FIPS,Type
2,2009-2010,ALBANY,0,36001,A
4,2009-2010,ALBANY,0,36001,A
6,2009-2010,ALBANY,0,36001,A
9,2009-2010,ALBANY,2,36001,A
13,2009-2010,ALBANY,1,36001,A


In [14]:
df_09_A.shape

(2046, 5)

In [15]:
A_09 = df_09_A.groupby("County").sum().reset_index()
A_09.head()

Unnamed: 0,County,Count,FIPS
0,ALBANY,331,1188033
1,ALLEGANY,222,1188099
2,BRONX,1063,1188165
3,BROOME,626,1188231
4,CATTARAUGUS,171,1188297


In [16]:
df_09_B = df_09.loc[(df_09["Type"]=="B")]
df_09_B.head()

Unnamed: 0,Season,County,Count,FIPS,Type
1,2009-2010,ALBANY,0,36001,B
3,2009-2010,ALBANY,0,36001,B
8,2009-2010,ALBANY,0,36001,B
11,2009-2010,ALBANY,0,36001,B
12,2009-2010,ALBANY,0,36001,B


In [17]:
B_09 = df_09_B.groupby("County").sum().reset_index()
B_09.head()


Unnamed: 0,County,Count,FIPS
0,ALBANY,2,1188033
1,ALLEGANY,0,1188099
2,BRONX,33,1188165
3,BROOME,2,1188231
4,CATTARAUGUS,1,1188297


In [18]:
df_09_U = df_09.loc[(df_09["Type"]=="UNSPECIFIED")]
df_09_U.head()

Unnamed: 0,Season,County,Count,FIPS,Type
0,2009-2010,ALBANY,0,36001,UNSPECIFIED
5,2009-2010,ALBANY,0,36001,UNSPECIFIED
7,2009-2010,ALBANY,0,36001,UNSPECIFIED
10,2009-2010,ALBANY,0,36001,UNSPECIFIED
14,2009-2010,ALBANY,0,36001,UNSPECIFIED


In [19]:
U_09 = df_09_U.groupby("County").sum().reset_index()
U_09.head()

Unnamed: 0,County,Count,FIPS
0,ALBANY,0,1188033
1,ALLEGANY,0,1188099
2,BRONX,20,1188165
3,BROOME,14,1188231
4,CATTARAUGUS,0,1188297


In [20]:
df_09 = df_09.groupby("County").sum().reset_index()
df_09.head()

Unnamed: 0,County,Count,FIPS
0,ALBANY,333,3564099
1,ALLEGANY,222,3564297
2,BRONX,1116,3564495
3,BROOME,642,3564693
4,CATTARAUGUS,172,3564891


In [21]:
count_A = A_09['Count']
df_09['Influenza_A'] = count_A
count_B = B_09['Count']
df_09['Influenza_B'] = count_B
df_09.head()
count_U = U_09['Count']
df_09['Influenza_Unspecified'] = count_U
df_09.head()

Unnamed: 0,County,Count,FIPS,Influenza_A,Influenza_B,Influenza_Unspecified
0,ALBANY,333,3564099,331,2,0
1,ALLEGANY,222,3564297,222,0,0
2,BRONX,1116,3564495,1063,33,20
3,BROOME,642,3564693,626,2,14
4,CATTARAUGUS,172,3564891,171,1,0


In [22]:
cols = ['FIPS']
FIPS = pd.read_csv(path+'FIPS', usecols=cols)
FIPS

Unnamed: 0,FIPS
0,36001.0
1,36003.0
2,36005.0
3,36007.0
4,36009.0
...,...
57,36115.0
58,36117.0
59,36119.0
60,36121.0


In [23]:
df_09 = df_09.drop(['FIPS'], axis=1)
df_09

Unnamed: 0,County,Count,Influenza_A,Influenza_B,Influenza_Unspecified
0,ALBANY,333,331,2,0
1,ALLEGANY,222,222,0,0
2,BRONX,1116,1063,33,20
3,BROOME,642,626,2,14
4,CATTARAUGUS,172,171,1,0
...,...,...,...,...,...
57,WASHINGTON,32,32,0,0
58,WAYNE,298,298,0,0
59,WESTCHESTER,1149,1135,7,7
60,WYOMING,49,48,1,0


In [24]:
df_09['FIPS'] = FIPS

In [25]:
df_09

Unnamed: 0,County,Count,Influenza_A,Influenza_B,Influenza_Unspecified,FIPS
0,ALBANY,333,331,2,0,36001.0
1,ALLEGANY,222,222,0,0,36003.0
2,BRONX,1116,1063,33,20,36005.0
3,BROOME,642,626,2,14,36007.0
4,CATTARAUGUS,172,171,1,0,36009.0
...,...,...,...,...,...,...
57,WASHINGTON,32,32,0,0,36115.0
58,WAYNE,298,298,0,0,36117.0
59,WESTCHESTER,1149,1135,7,7,36119.0
60,WYOMING,49,48,1,0,36121.0


In [26]:
# add population data and prevalence rates
cols = ['Population', 'Rate']
prev_rate_09 = pd.read_csv(path+'df_09', usecols=cols)
prev_rate_09

Unnamed: 0,Population,Rate
0,304733,10.927599
1,48969,45.334804
2,1376261,8.108927
3,200935,31.950631
4,80491,21.368849
...,...,...
57,63077,5.073165
58,93643,31.822987
59,944201,12.169019
60,42236,11.601477


In [27]:
df_09['Population'] = prev_rate_09['Population']
df_09['Prevalence_Rate'] = prev_rate_09['Rate']
df_09

Unnamed: 0,County,Count,Influenza_A,Influenza_B,Influenza_Unspecified,FIPS,Population,Prevalence_Rate
0,ALBANY,333,331,2,0,36001.0,304733,10.927599
1,ALLEGANY,222,222,0,0,36003.0,48969,45.334804
2,BRONX,1116,1063,33,20,36005.0,1376261,8.108927
3,BROOME,642,626,2,14,36007.0,200935,31.950631
4,CATTARAUGUS,172,171,1,0,36009.0,80491,21.368849
...,...,...,...,...,...,...,...,...
57,WASHINGTON,32,32,0,0,36115.0,63077,5.073165
58,WAYNE,298,298,0,0,36117.0,93643,31.822987
59,WESTCHESTER,1149,1135,7,7,36119.0,944201,12.169019
60,WYOMING,49,48,1,0,36121.0,42236,11.601477


In [28]:
# Save as new dataset!
# Commenting it out after the fact, since it is already saved in my notebook. 
# If you use my github, it should already be in the datasets folder so you
# don't need this step unless you are running it from scratch
#df_09.to_csv(path+'flu_09.csv') # this is named because it is showing the FLU types for 2009-2010 season

Now we will do the same for the remaining datasets.
It will follow this code pattern:
```python
# season
df_ = df.loc[(df["Season"]=="seasonrange")]
#df_.head()
disease_split = df_["Disease"].str.split('_')
#disease_split.head()
# Create new column in dataset:
df_['Type'] = disease_split.str.get(1)
#df_.head()
# Sort by Region

df_ = df_.sort_values(by = ['Region','County','CDC_Week'])
df_ = df_.drop(['Region', "Week_Ending_Date", "CDC_Week", "Disease",
                   "County_Centroid"], axis=1)
df_ = df_.reset_index(drop = True)
#df_.head()

# Specified season Influenza Type A
df__A = df_.loc[(df_["Type"]=="A")]
#df__A.head()

A_ = df__A.groupby("County").sum().reset_index()
#A_.head()

# Type B
df__B = df_.loc[(df_["Type"]=="B")]
#df__B.head()
B_ = df__B.groupby("County").sum().reset_index()
#B_.head()

# Unspecified
df__U = df_.loc[(df_["Type"]=="UNSPECIFIED")]
#df__U.head()
U_ = df__U.groupby("County").sum().reset_index()
#U_.head()

# Group DF by County
df_ = df_.groupby("County").sum().reset_index()
#df_.head()

# Add type counts as different columns
count_A = A_['Count']
df_['Influenza_A'] = count_A
count_B = B_['Count']
df_['Influenza_B'] = count_B
#df_.head()
count_U = U_['Count']
df_['Influenza_Unspecified'] = count_U
#df_.head()

# Drop incorrect FIPS column and replace with correct one
df_ = df_.drop(['FIPS'], axis=1)
df_['FIPS'] = FIPS

# add population data and prevalence rates
cols = ['Population', 'Rate']
prev_rate_ = pd.read_csv(path+'df_', usecols=cols)
prev_rate_

df_['Population'] = prev_rate_['Population']
df_['Prevalence_Rate'] = prev_rate_['Rate']
df_

#df_.to_csv(path+'flu_.csv')
```

In [29]:
# 2010-2011 Season
df_10 = df.loc[(df["Season"]=="2010-2011")]
#df_10.head()
disease_split = df_10["Disease"].str.split('_')
#disease_split.head()

# Create new column in dataset:
df_10['Type'] = disease_split.str.get(1)
#df_10.head()
# Sort by Region

df_10 = df_10.sort_values(by = ['Region','County','CDC_Week'])
df_10 = df_10.drop(['Region', "Week_Ending_Date", "CDC_Week", "Disease",
                   "County_Centroid"], axis=1)
df_10 = df_10.reset_index(drop = True)
#df_10.head()

# Specified season Influenza Type A
df_10_A = df_10.loc[(df_10["Type"]=="A")]
#df_10_A.head()

A_10 = df_10_A.groupby("County").sum().reset_index()
#A_10.head()

# Type B
df_10_B = df_10.loc[(df_10["Type"]=="B")]
#df_10_B.head()
B_10 = df_10_B.groupby("County").sum().reset_index()
#B_10.head()

# Unspecified
df_10_U = df_10.loc[(df_10["Type"]=="UNSPECIFIED")]
#df__U.head()
U_10 = df_10_U.groupby("County").sum().reset_index()
#U_10.head()

# Group DF by County
df_10 = df_10.groupby("County").sum().reset_index()
#df_10.head()

# Add type counts as different columns
count_A = A_10['Count']
df_10['Influenza_A'] = count_A
count_B = B_10['Count']
df_10['Influenza_B'] = count_B
#df_10.head()
count_U = U_10['Count']
df_10['Influenza_Unspecified'] = count_U
#df_10.head()

# Drop incorrect FIPS column and replace with correct one
df_10 = df_10.drop(['FIPS'], axis=1)
df_10['FIPS'] = FIPS

# add population data and prevalence rates
cols = ['Population', 'Rate']
prev_rate_10 = pd.read_csv(path+'df_10', usecols=cols)
prev_rate_10

df_10['Population'] = prev_rate_10['Population']
df_10['Prevalence_Rate'] = prev_rate_10['Rate']
df_10

A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  


Unnamed: 0,County,Count,Influenza_A,Influenza_B,Influenza_Unspecified,FIPS,Population,Prevalence_Rate
0,ALBANY,144,111,30,3,36001.0,304596,4.727574
1,ALLEGANY,117,110,6,1,36003.0,48800,23.975410
2,BRONX,1830,1642,175,13,36005.0,1397335,13.096358
3,BROOME,391,224,159,8,36007.0,199363,19.612466
4,CATTARAUGUS,115,112,3,0,36009.0,79815,14.408319
...,...,...,...,...,...,...,...,...
57,WASHINGTON,13,11,2,0,36115.0,63091,2.060516
58,WAYNE,260,210,50,0,36117.0,93256,27.880244
59,WESTCHESTER,1200,912,284,4,36119.0,956262,12.548862
60,WYOMING,29,25,4,0,36121.0,41849,6.929676


Count is the total count of confirmed flu cases, and it may be helpful to 1. Make sure the types add up to that number and 2. for future visualizations!

In [30]:
#df_10.to_csv(path+'flu_10.csv')

### 2011-2012 Season

In [31]:
# 2011-2012 season
df_11 = df.loc[(df["Season"]=="2011-2012")]
#df_11.head()
disease_split = df_11["Disease"].str.split('_')
#disease_split.head()
# Create new column in dataset:
df_11['Type'] = disease_split.str.get(1)
#df_11.head()
# Sort by Region

df_11 = df_11.sort_values(by = ['Region','County','CDC_Week'])
df_11 = df_11.drop(['Region', "Week_Ending_Date", "CDC_Week", "Disease",
                   "County_Centroid"], axis=1)
df_11 = df_11.reset_index(drop = True)
#df_11.head()

# Specified season Influenza Type A
df_11_A = df_11.loc[(df_11["Type"]=="A")]
#df_11_A.head()

A_11 = df_11_A.groupby("County").sum().reset_index()
#A_11.head()

# Type B
df_11_B = df_11.loc[(df_11["Type"]=="B")]
#df_11_B.head()
B_11 = df_11_B.groupby("County").sum().reset_index()
#B_11.head()

# Unspecified
df_11_U = df_11.loc[(df_11["Type"]=="UNSPECIFIED")]
#df_11_U.head()
U_11 = df_11_U.groupby("County").sum().reset_index()
#U_11.head()

# Group DF by County
df_11 = df_11.groupby("County").sum().reset_index()
#df_11.head()

# Add type counts as different columns
count_A = A_11['Count']
df_11['Influenza_A'] = count_A
count_B = B_11['Count']
df_11['Influenza_B'] = count_B
#df_11.head()
count_U = U_11['Count']
df_11['Influenza_Unspecified'] = count_U
#df_11.head()

# Drop incorrect FIPS column and replace with correct one
df_11 = df_11.drop(['FIPS'], axis=1)
df_11['FIPS'] = FIPS

# add population data and prevalence rates
cols = ['Population', 'Rate']
prev_rate_11 = pd.read_csv(path+'df_11', usecols=cols)
prev_rate_11

df_11['Population'] = prev_rate_11['Population']
df_11['Prevalence_Rate'] = prev_rate_11['Rate']
df_11

A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  import sys


Unnamed: 0,County,Count,Influenza_A,Influenza_B,Influenza_Unspecified,FIPS,Population,Prevalence_Rate
0,ALBANY,39,26,13,0,36001.0,304596,1.280385
1,ALLEGANY,4,3,1,0,36003.0,48800,0.819672
2,BRONX,491,445,37,9,36005.0,1397335,3.513832
3,BROOME,98,90,7,1,36007.0,199363,4.915656
4,CATTARAUGUS,9,4,5,0,36009.0,79815,1.127608
...,...,...,...,...,...,...,...,...
57,WASHINGTON,30,29,1,0,36115.0,63091,4.755036
58,WAYNE,21,14,7,0,36117.0,93256,2.251866
59,WESTCHESTER,229,184,45,0,36119.0,956262,2.394741
60,WYOMING,1,0,1,0,36121.0,41849,0.238954


In [32]:
#df_11.to_csv(path+'flu_11.csv')

In [33]:
df_11.isna().sum()

County                   0
Count                    0
Influenza_A              0
Influenza_B              0
Influenza_Unspecified    0
FIPS                     0
Population               0
Prevalence_Rate          0
dtype: int64

### 2012-2013 Season

In [35]:
# 2012-2013 season
df_12 = df.loc[(df["Season"]=="2012-2013")]
#df_12.head()
disease_split = df_12["Disease"].str.split('_')
#disease_split.head()
# Create new column in dataset:
df_12['Type'] = disease_split.str.get(1)
#df_12.head()
# Sort by Region

df_12 = df_12.sort_values(by = ['Region','County','CDC_Week'])
df_12 = df_12.drop(['Region', "Week_Ending_Date", "CDC_Week", "Disease",
                   "County_Centroid"], axis=1)
df_12 = df_12.reset_index(drop = True)
#df_12.head()

# Specified season Influenza Type A
df_12_A = df_12.loc[(df_12["Type"]=="A")]
#df_12_A.head()

A_12 = df_12_A.groupby("County").sum().reset_index()
#A_12.head()

# Type B
df_12_B = df_12.loc[(df_12["Type"]=="B")]
#df_12_B.head()
B_12 = df_12_B.groupby("County").sum().reset_index()
#B_12.head()

# Unspecified
df_12_U = df_12.loc[(df_12["Type"]=="UNSPECIFIED")]
#df__U.head()
U_12 = df_12_U.groupby("County").sum().reset_index()
#U_12.head()

# Group DF by County
df_12 = df_12.groupby("County").sum().reset_index()
#df_12.head()

# Add type counts as different columns
count_A = A_12['Count']
df_12['Influenza_A'] = count_A
count_B = B_12['Count']
df_12['Influenza_B'] = count_B
#df_12.head()
count_U = U_12['Count']
df_12['Influenza_Unspecified'] = count_U
#df_12.head()

# Drop incorrect FIPS column and replace with correct one
df_12 = df_12.drop(['FIPS'], axis=1)
df_12['FIPS'] = FIPS

# add population data and prevalence rates
cols = ['Population', 'Rate']
prev_rate_12 = pd.read_csv(path+'df_12', usecols=cols)
prev_rate_12

df_12['Population'] = prev_rate_12['Population']
df_12['Prevalence_Rate'] = prev_rate_12['Rate']
df_12



A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  import sys


Unnamed: 0,County,Count,Influenza_A,Influenza_B,Influenza_Unspecified,FIPS,Population,Prevalence_Rate
0,ALBANY,521,426,93,2,36001.0,304596,17.104624
1,ALLEGANY,123,105,18,0,36003.0,48800,25.204918
2,BRONX,3954,2686,1220,48,36005.0,1397335,28.296722
3,BROOME,822,689,130,3,36007.0,199363,41.231322
4,CATTARAUGUS,108,79,29,0,36009.0,79815,13.531291
...,...,...,...,...,...,...,...,...
57,WASHINGTON,114,99,15,0,36115.0,63091,18.069138
58,WAYNE,628,510,117,1,36117.0,93256,67.341512
59,WESTCHESTER,2459,1717,736,6,36119.0,956262,25.714710
60,WYOMING,86,69,16,1,36121.0,41849,20.550073


In [36]:
df_12.to_csv(path+'flu_12.csv')

In [37]:
df_12.isna().sum()

County                   0
Count                    0
Influenza_A              0
Influenza_B              0
Influenza_Unspecified    0
FIPS                     0
Population               0
Prevalence_Rate          0
dtype: int64

### 2013-2014 Season

In [38]:
# 2013-2014 season
df_13 = df.loc[(df["Season"]=="2013-2014")]
#df_13.head()
disease_split = df_13["Disease"].str.split('_')
#disease_split.head()
# Create new column in dataset:
df_13['Type'] = disease_split.str.get(1)
#df_13.head()
# Sort by Region

df_13 = df_13.sort_values(by = ['Region','County','CDC_Week'])
df_13 = df_13.drop(['Region', "Week_Ending_Date", "CDC_Week", "Disease",
                   "County_Centroid"], axis=1)
df_13 = df_13.reset_index(drop = True)
#df_13.head()

# Specified season Influenza Type A
df_13_A = df_13.loc[(df_13["Type"]=="A")]
#df__A.head()

A_13 = df_13_A.groupby("County").sum().reset_index()
#A_13.head()

# Type B
df_13_B = df_13.loc[(df_13["Type"]=="B")]
#df__B.head()
B_13 = df_13_B.groupby("County").sum().reset_index()
#B_13.head()

# Unspecified
df_13_U = df_13.loc[(df_13["Type"]=="UNSPECIFIED")]
#df_13_U.head()
U_13 = df_13_U.groupby("County").sum().reset_index()
#U_13.head()

# Group DF by County
df_13 = df_13.groupby("County").sum().reset_index()
#df_13.head()

# Add type counts as different columns
count_A = A_13['Count']
df_13['Influenza_A'] = count_A
count_B = B_13['Count']
df_13['Influenza_B'] = count_B
#df_13.head()
count_U = U_13['Count']
df_13['Influenza_Unspecified'] = count_U
#df_13.head()

# Drop incorrect FIPS column and replace with correct one
df_13 = df_13.drop(['FIPS'], axis=1)
df_13['FIPS'] = FIPS

# add population data and prevalence rates
cols = ['Population', 'Rate']
prev_rate_13 = pd.read_csv(path+'df_13', usecols=cols)
prev_rate_13

df_13['Population'] = prev_rate_13['Population']
df_13['Prevalence_Rate'] = prev_rate_13['Rate']
df_13



A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  import sys


Unnamed: 0,County,Count,Influenza_A,Influenza_B,Influenza_Unspecified,FIPS,Population,Prevalence_Rate
0,ALBANY,567,409,158,0,36001.0,304596,18.614821
1,ALLEGANY,160,155,5,0,36003.0,48800,32.786885
2,BRONX,4242,2836,1347,59,36005.0,1397335,30.357788
3,BROOME,549,503,46,0,36007.0,199363,27.537708
4,CATTARAUGUS,35,33,2,0,36009.0,79815,4.385141
...,...,...,...,...,...,...,...,...
57,WASHINGTON,81,69,12,0,36115.0,63091,12.838598
58,WAYNE,197,182,15,0,36117.0,93256,21.124646
59,WESTCHESTER,2738,1646,1087,5,36119.0,956262,28.632320
60,WYOMING,14,13,1,0,36121.0,41849,3.345361


In [41]:
#df_13.to_csv(path+'flu_13.csv')

### 2015-2016 Season

In [42]:
# 2015-2016 season
df_15 = df.loc[(df["Season"]=="2015-2016")]
#df_15.head()
disease_split = df_15["Disease"].str.split('_')
#disease_split.head()
# Create new column in dataset:
df_15['Type'] = disease_split.str.get(1)
#df_15.head()
# Sort by Region

df_15 = df_15.sort_values(by = ['Region','County','CDC_Week'])
df_15 = df_15.drop(['Region', "Week_Ending_Date", "CDC_Week", "Disease",
                   "County_Centroid"], axis=1)
df_15 = df_15.reset_index(drop = True)
#df_15.head()

# Specified season Influenza Type A
df_15_A = df_15.loc[(df_15["Type"]=="A")]
#df_15_A.head()

A_15 = df_15_A.groupby("County").sum().reset_index()
#A_15.head()

# Type B
df_15_B = df_15.loc[(df_15["Type"]=="B")]
#df_15_B.head()
B_15 = df_15_B.groupby("County").sum().reset_index()
#B_15.head()

# Unspecified
df_15_U = df_15.loc[(df_15["Type"]=="UNSPECIFIED")]
#df_15_U.head()
U_15 = df_15_U.groupby("County").sum().reset_index()
#U_15.head()

# Group DF by County
df_15 = df_15.groupby("County").sum().reset_index()
#df_15.head()

# Add type counts as different columns
count_A = A_15['Count']
df_15['Influenza_A'] = count_A
count_B = B_15['Count']
df_15['Influenza_B'] = count_B
#df_15.head()
count_U = U_15['Count']
df_15['Influenza_Unspecified'] = count_U
#df_15.head()

# Drop incorrect FIPS column and replace with correct one
df_15 = df_15.drop(['FIPS'], axis=1)
df_15['FIPS'] = FIPS

# add population data and prevalence rates
cols = ['Population', 'Rate']
prev_rate_15 = pd.read_csv(path+'df_15', usecols=cols)
prev_rate_15

df_15['Population'] = prev_rate_15['Population']
df_15['Prevalence_Rate'] = prev_rate_15['Rate']
df_15


A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  import sys


Unnamed: 0,County,Count,Influenza_A,Influenza_B,Influenza_Unspecified,FIPS,Population,Prevalence_Rate
0,ALBANY,713,494,218,1,36001.0,304596,23.408055
1,ALLEGANY,87,75,12,0,36003.0,48800,17.827869
2,BRONX,4982,3473,1456,53,36005.0,1397335,35.653583
3,BROOME,653,572,79,2,36007.0,199363,32.754323
4,CATTARAUGUS,87,79,8,0,36009.0,79815,10.900207
...,...,...,...,...,...,...,...,...
57,WASHINGTON,121,94,21,6,36115.0,63091,19.178647
58,WAYNE,375,331,44,0,36117.0,93256,40.211890
59,WESTCHESTER,3610,2431,1166,13,36119.0,956262,37.751160
60,WYOMING,54,43,11,0,36121.0,41849,12.903534


In [43]:
#df_15.to_csv(path+'flu_15.csv')

### 2016-2017 Season

In [44]:
# 2016-2017 season
df_16 = df.loc[(df["Season"]=="2016-2017")]
#df_16.head()
disease_split = df_16["Disease"].str.split('_')
#disease_split.head()
# Create new column in dataset:
df_16['Type'] = disease_split.str.get(1)
#df_16.head()
# Sort by Region

df_16 = df_16.sort_values(by = ['Region','County','CDC_Week'])
df_16 = df_16.drop(['Region', "Week_Ending_Date", "CDC_Week", "Disease",
                   "County_Centroid"], axis=1)
df_16 = df_16.reset_index(drop = True)
#df_16.head()

# Specified season Influenza Type A
df_16_A = df_16.loc[(df_16["Type"]=="A")]
#df_16_A.head()

A_16 = df_16_A.groupby("County").sum().reset_index()
#A_16.head()

# Type B
df_16_B = df_16.loc[(df_16["Type"]=="B")]
#df_16_B.head()
B_16 = df_16_B.groupby("County").sum().reset_index()
#B_16.head()

# Unspecified
df_16_U = df_16.loc[(df_16["Type"]=="UNSPECIFIED")]
#df_16_U.head()
U_16 = df_16_U.groupby("County").sum().reset_index()
#U_16.head()

# Group DF by County
df_16 = df_16.groupby("County").sum().reset_index()
#df_16.head()

# Add type counts as different columns
count_A = A_16['Count']
df_16['Influenza_A'] = count_A
count_B = B_16['Count']
df_16['Influenza_B'] = count_B
#df_.head()
count_U = U_16['Count']
df_16['Influenza_Unspecified'] = count_U
#df_16.head()

# Drop incorrect FIPS column and replace with correct one
df_16 = df_16.drop(['FIPS'], axis=1)
df_16['FIPS'] = FIPS

# add population data and prevalence rates
cols = ['Population', 'Rate']
prev_rate_16 = pd.read_csv(path+'df_16', usecols=cols)
prev_rate_16

df_16['Population'] = prev_rate_16['Population']
df_16['Prevalence_Rate'] = prev_rate_16['Rate']
df_16



A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  import sys


Unnamed: 0,County,Count,Influenza_A,Influenza_B,Influenza_Unspecified,FIPS,Population,Prevalence_Rate
0,ALBANY,701,503,197,1,36001.0,304596,23.014091
1,ALLEGANY,211,163,48,0,36003.0,48800,43.237705
2,BRONX,6096,4913,1054,129,36005.0,1397335,43.625902
3,BROOME,1352,1008,344,0,36007.0,199363,67.815994
4,CATTARAUGUS,105,84,17,4,36009.0,79815,13.155422
...,...,...,...,...,...,...,...,...
57,WASHINGTON,123,82,38,3,36115.0,63091,19.495649
58,WAYNE,463,350,112,1,36117.0,93256,49.648280
59,WESTCHESTER,5128,4105,995,28,36119.0,956262,53.625471
60,WYOMING,63,43,20,0,36121.0,41849,15.054123


In [45]:
# df_16.to_csv(path+'flu_16.csv')

### 2017-2018 Season

In [46]:
# 2017-2018 season
df_17 = df.loc[(df["Season"]=="2017-2018")]
#df_17.head()
disease_split = df_17["Disease"].str.split('_')
#disease_split.head()
# Create new column in dataset:
df_17['Type'] = disease_split.str.get(1)
#df_17.head()
# Sort by Region

df_17 = df_17.sort_values(by = ['Region','County','CDC_Week'])
df_17 = df_17.drop(['Region', "Week_Ending_Date", "CDC_Week", "Disease",
                   "County_Centroid"], axis=1)
df_17 = df_17.reset_index(drop = True)
#df_17.head()

# Specified season Influenza Type A
df_17_A = df_17.loc[(df_17["Type"]=="A")]
#df_17_A.head()

A_17 = df_17_A.groupby("County").sum().reset_index()
#A_17.head()

# Type B
df_17_B = df_17.loc[(df_17["Type"]=="B")]
#df_17_B.head()
B_17 = df_17_B.groupby("County").sum().reset_index()
#B_17.head()

# Unspecified
df_17_U = df_17.loc[(df_17["Type"]=="UNSPECIFIED")]
#df_17_U.head()
U_17 = df_17_U.groupby("County").sum().reset_index()
#U_17.head()

# Group DF by County
df_17 = df_17.groupby("County").sum().reset_index()
#df_17.head()

# Add type counts as different columns
count_A = A_17['Count']
df_17['Influenza_A'] = count_A
count_B = B_17['Count']
df_17['Influenza_B'] = count_B
#df_17.head()
count_U = U_17['Count']
df_17['Influenza_Unspecified'] = count_U
#df_17.head()

# Drop incorrect FIPS column and replace with correct one
df_17 = df_17.drop(['FIPS'], axis=1)
df_17['FIPS'] = FIPS

# add population data and prevalence rates
cols = ['Population', 'Rate']
prev_rate_17 = pd.read_csv(path+'df_17', usecols=cols)
prev_rate_17

df_17['Population'] = prev_rate_17['Population']
df_17['Prevalence_Rate'] = prev_rate_17['Rate']
df_17



A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  import sys


Unnamed: 0,County,Count,Influenza_A,Influenza_B,Influenza_Unspecified,FIPS,Population,Prevalence_Rate
0,ALBANY,1708,1234,456,18,36001.0,304596,56.074275
1,ALLEGANY,205,141,58,6,36003.0,48800,42.008197
2,BRONX,11749,6784,4740,225,36005.0,1397335,84.081484
3,BROOME,2214,1618,584,12,36007.0,199363,111.053706
4,CATTARAUGUS,492,235,255,2,36009.0,79815,61.642548
...,...,...,...,...,...,...,...,...
57,WASHINGTON,278,191,86,1,36115.0,63091,44.063337
58,WAYNE,1413,906,505,2,36117.0,93256,151.518401
59,WESTCHESTER,8639,4767,3841,31,36119.0,956262,90.341350
60,WYOMING,284,167,117,0,36121.0,41849,67.863031


In [47]:
#df_17.to_csv(path+'flu_17.csv')

### 2018-2019 Season

In [48]:
# 2018-2019 season
df_18 = df.loc[(df["Season"]=="2018-2019")]
#df_18.head()
disease_split = df_18["Disease"].str.split('_')
#disease_split.head()
# Create new column in dataset:
df_18['Type'] = disease_split.str.get(1)
#df_18.head()
# Sort by Region

df_18 = df_18.sort_values(by = ['Region','County','CDC_Week'])
df_18 = df_18.drop(['Region', "Week_Ending_Date", "CDC_Week", "Disease",
                   "County_Centroid"], axis=1)
df_18 = df_18.reset_index(drop = True)
#df_18.head()

# Specified season Influenza Type A
df_18_A = df_18.loc[(df_18["Type"]=="A")]
#df_18_A.head()

A_18 = df_18_A.groupby("County").sum().reset_index()
#A_18.head()

# Type B
df_18_B = df_18.loc[(df_18["Type"]=="B")]
#df_18_B.head()
B_18 = df_18_B.groupby("County").sum().reset_index()
#B_18.head()

# Unspecified
df_18_U = df_18.loc[(df_18["Type"]=="UNSPECIFIED")]
#df_18_U.head()
U_18 = df_18_U.groupby("County").sum().reset_index()
#U_18.head()

# Group DF by County
df_18 = df_18.groupby("County").sum().reset_index()
#df_18.head()

# Add type counts as different columns
count_A = A_18['Count']
df_18['Influenza_A'] = count_A
count_B = B_18['Count']
df_18['Influenza_B'] = count_B
#df_18.head()
count_U = U_18['Count']
df_18['Influenza_Unspecified'] = count_U
#df_18.head()

# Drop incorrect FIPS column and replace with correct one
df_18 = df_18.drop(['FIPS'], axis=1)
df_18['FIPS'] = FIPS

# add population data and prevalence rates
cols = ['Population', 'Rate']
prev_rate_18 = pd.read_csv(path+'df_18', usecols=cols)
prev_rate_18

df_18['Population'] = prev_rate_18['Population']
df_18['Prevalence_Rate'] = prev_rate_18['Rate']
df_18

#df_18.to_csv(path+'flu_18.csv')

A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  import sys


Unnamed: 0,County,Count,Influenza_A,Influenza_B,Influenza_Unspecified,FIPS,Population,Prevalence_Rate
0,ALBANY,1352,1095,133,124,36001.0,304596,44.386663
1,ALLEGANY,364,348,15,1,36003.0,48800,74.590164
2,BRONX,10902,10504,387,11,36005.0,1397335,78.019945
3,BROOME,1503,1474,28,1,36007.0,199363,75.390118
4,CATTARAUGUS,413,401,12,0,36009.0,79815,51.744660
...,...,...,...,...,...,...,...,...
57,WASHINGTON,241,231,10,0,36115.0,63091,38.198792
58,WAYNE,1104,1100,4,0,36117.0,93256,118.383804
59,WESTCHESTER,5872,5552,314,6,36119.0,956262,61.405765
60,WYOMING,288,287,1,0,36121.0,41849,68.818849


In [49]:
#df_18.to_csv(path+'flu_18.csv')

Now we are done with our datasets! Check out the next notebook for analysis of these subsets!