# Lecture 6

Data Science, Fall 2023



A demonstration of exploratory data analysis to accompany Lecture 6 (originally planned for Lecture 5).

In [1]:
import numpy as np
import pandas as pd

In [2]:
import matplotlib.pyplot as plt
import seaborn as sns
#%matplotlib inline
plt.rcParams['figure.figsize'] = (12, 9)

sns.set()
sns.set_context('talk')

np.set_printoptions(threshold=20, precision=2, suppress=True)
pd.set_option('display.max_rows', 30)
pd.set_option('display.max_columns', None)
pd.set_option('display.precision', 2)
# This option stops scientific notation for pandas
pd.set_option('display.float_format', '{:.2f}'.format)


These options are used to customize the display of pandas DataFrames for better readability and precision when working with data. Here's why each option is helpful:

# Tuberculosis in the United States

What can we say about the presence of Tuberculosis in the United States?

Let's look at the data included in the [original CDC article](https://www.cdc.gov/mmwr/volumes/71/wr/mm7112a1.htm?s_cid=mm7112a1_w#T1_down) published in 2021.

<br/>

---

# CSV and Nice Field Names
Suppose Table 1 was saved as a CSV file located in `data/cdc_tuberculosis.csv`.

We can then explore the CSV (which is a text file, and does not contain binary-encoded data) in many ways:
1. Using a text editor like the one in DataHub (right-click on the file and use `Open->Editor`), emacs, vim, VSCode, etc.
2. Opening the CSV directly in DataHub (read-only), Excel, Google Sheets, etc.
3. The Python file object
4. pandas, using `pd.read_csv()`

1, 2. Let's start with the first two so we really solidify the idea of a CSV as **rectangular data (i.e., tabular data) stored as comma-separated values**.



4. Finally, let's see the tried-and-true Data Science approach: pandas.

In [3]:
# from google.colab import drive
# drive.mount('/content/drive')

In [4]:
# Read tuberculosis data
# Importing the data
tb_df = pd.read_csv('datafiles/cdc_tuberculosis.csv')
tb_df

Unnamed: 0.1,Unnamed: 0,No. of TB cases,Unnamed: 2,Unnamed: 3,TB incidence,Unnamed: 5,Unnamed: 6
0,U.S. jurisdiction,2019,2020,2021,2019.00,2020.00,2021.00
1,Total,8900,7173,7860,2.71,2.16,2.37
2,Alabama,87,72,92,1.77,1.43,1.83
3,Alaska,58,58,58,7.91,7.92,7.92
4,Arizona,183,136,129,2.51,1.89,1.77
...,...,...,...,...,...,...,...
48,Virginia,191,169,161,2.23,1.96,1.86
49,Washington,221,163,199,2.90,2.11,2.57
50,West Virginia,9,13,7,0.50,0.73,0.39
51,Wisconsin,51,35,66,0.88,0.59,1.12


Wait, what's up with the "Unnamed" column names? And the first row, for that matter?

Congratulations -- you're ready to wrangle your data. Because of how things are stored, we'll need to clean the data a bit to name our columns better.

A reasonable first step is to identify the row with the right header. The `pd.read_csv()` function ([documentation](https://pandas.pydata.org/docs/reference/api/pandas.read_csv.html)) has the convenient `header` parameter:

In [5]:
tb_df = pd.read_csv('datafiles/cdc_tuberculosis.csv', header= 1)
tb_df

Unnamed: 0,U.S. jurisdiction,2019,2020,2021,2019.1,2020.1,2021.1
0,Total,8900,7173,7860,2.71,2.16,2.37
1,Alabama,87,72,92,1.77,1.43,1.83
2,Alaska,58,58,58,7.91,7.92,7.92
3,Arizona,183,136,129,2.51,1.89,1.77
4,Arkansas,64,59,69,2.12,1.96,2.28
...,...,...,...,...,...,...,...
47,Virginia,191,169,161,2.23,1.96,1.86
48,Washington,221,163,199,2.90,2.11,2.57
49,West Virginia,9,13,7,0.50,0.73,0.39
50,Wisconsin,51,35,66,0.88,0.59,1.12


Wait...but now we can't differentiate betwen the "Number of TB cases" and "TB incidence" year columns. pandas has tried to make our lives easier by automatically adding ".1" to the latter columns, but this doesn't help us as humans understand the data.

We can do this manually with `df.rename()` [documentation]

In [6]:
# Rename column List
tb_df = tb_df.rename(columns={'2019' : 'NO of Cases 2019',
                  '2020' : 'NO of Cases 2020',
                  '2021' : 'NO of Cases 2021',
                  '2019.1' : 'TB incidence 2019',
                  '2020.1' : 'TB incidence 2020',
                  '2021.1' : 'TB incidence 2021'
                  })
tb_df


Unnamed: 0,U.S. jurisdiction,NO of Cases 2019,NO of Cases 2020,NO of Cases 2021,TB incidence 2019,TB incidence 2020,TB incidence 2021
0,Total,8900,7173,7860,2.71,2.16,2.37
1,Alabama,87,72,92,1.77,1.43,1.83
2,Alaska,58,58,58,7.91,7.92,7.92
3,Arizona,183,136,129,2.51,1.89,1.77
4,Arkansas,64,59,69,2.12,1.96,2.28
...,...,...,...,...,...,...,...
47,Virginia,191,169,161,2.23,1.96,1.86
48,Washington,221,163,199,2.90,2.11,2.57
49,West Virginia,9,13,7,0.50,0.73,0.39
50,Wisconsin,51,35,66,0.88,0.59,1.12


<br/><br/>

---

# Record Granularity

You might already be wondering: What's up with that first record?

Row 0 is what we call a **rollup record**, or summary record. It's often useful when displaying tables to humans. The **granularity** of record 0 (Totals) vs the rest of the records (States) is different.

<br/>

Okay, EDA step two. How was the rollup record aggregated?

Let's check if Total TB cases is the sum of all state TB cases. If we sum over all rows, we should get **2x** the total cases in each of our TB cases by year (why?).

In [7]:
tb_df.sum()

U.S. jurisdiction    TotalAlabamaAlaskaArizonaArkansasCaliforniaCol...
NO of Cases 2019     8,9008758183642,111666718245583029973261085237...
NO of Cases 2020     7,1737258136591,706525417194122219282169239376...
NO of Cases 2021     7,8609258129691,750585443194992281064255127494...
TB incidence 2019                                               109.94
TB incidence 2020                                                93.09
TB incidence 2021                                               102.94
dtype: object

<br/>

Whoa, what's going on? Check out the column types:

In [8]:
# Find datatype for each column
tb_df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 52 entries, 0 to 51
Data columns (total 7 columns):
 #   Column             Non-Null Count  Dtype  
---  ------             --------------  -----  
 0   U.S. jurisdiction  52 non-null     object 
 1   NO of Cases 2019   52 non-null     object 
 2   NO of Cases 2020   52 non-null     object 
 3   NO of Cases 2021   52 non-null     object 
 4   TB incidence 2019  52 non-null     float64
 5   TB incidence 2020  52 non-null     float64
 6   TB incidence 2021  52 non-null     float64
dtypes: float64(3), object(4)
memory usage: 3.0+ KB


In [9]:
# gives error because of commas
# tb_df.iloc[:, 1:4] = tb_df.iloc[:, 1:4].astype('int64')
# tb_df

In [10]:
tb_df.iloc[:, 1:4] = tb_df.iloc[:, 1:4].replace({',': ''}, regex=True)
tb_df

Unnamed: 0,U.S. jurisdiction,NO of Cases 2019,NO of Cases 2020,NO of Cases 2021,TB incidence 2019,TB incidence 2020,TB incidence 2021
0,Total,8900,7173,7860,2.71,2.16,2.37
1,Alabama,87,72,92,1.77,1.43,1.83
2,Alaska,58,58,58,7.91,7.92,7.92
3,Arizona,183,136,129,2.51,1.89,1.77
4,Arkansas,64,59,69,2.12,1.96,2.28
...,...,...,...,...,...,...,...
47,Virginia,191,169,161,2.23,1.96,1.86
48,Washington,221,163,199,2.90,2.11,2.57
49,West Virginia,9,13,7,0.50,0.73,0.39
50,Wisconsin,51,35,66,0.88,0.59,1.12


In [11]:
tb_df.iloc[:, 1:4] = tb_df.iloc[:, 1:4].astype('int64')
tb_df

Unnamed: 0,U.S. jurisdiction,NO of Cases 2019,NO of Cases 2020,NO of Cases 2021,TB incidence 2019,TB incidence 2020,TB incidence 2021
0,Total,8900,7173,7860,2.71,2.16,2.37
1,Alabama,87,72,92,1.77,1.43,1.83
2,Alaska,58,58,58,7.91,7.92,7.92
3,Arizona,183,136,129,2.51,1.89,1.77
4,Arkansas,64,59,69,2.12,1.96,2.28
...,...,...,...,...,...,...,...
47,Virginia,191,169,161,2.23,1.96,1.86
48,Washington,221,163,199,2.90,2.11,2.57
49,West Virginia,9,13,7,0.50,0.73,0.39
50,Wisconsin,51,35,66,0.88,0.59,1.12


<br/>

Looks like those commas are causing all TB cases to be read as the `object` datatype, or **storage type** (close to the Python string datatype), so pandas is concatenating strings instead of adding integers.

<br/>

Fortunately `read_csv` also has a `thousands` parameter (for what it's worth, I didn't know this beforehand--I [googled](https://pandas.pydata.org/docs/reference/api/pandas.read_csv.html) this):

In [12]:
# improve readability: chaining method calls with outer parentheses/line breaks

In [13]:
tb_df

Unnamed: 0,U.S. jurisdiction,NO of Cases 2019,NO of Cases 2020,NO of Cases 2021,TB incidence 2019,TB incidence 2020,TB incidence 2021
0,Total,8900,7173,7860,2.71,2.16,2.37
1,Alabama,87,72,92,1.77,1.43,1.83
2,Alaska,58,58,58,7.91,7.92,7.92
3,Arizona,183,136,129,2.51,1.89,1.77
4,Arkansas,64,59,69,2.12,1.96,2.28
...,...,...,...,...,...,...,...
47,Virginia,191,169,161,2.23,1.96,1.86
48,Washington,221,163,199,2.90,2.11,2.57
49,West Virginia,9,13,7,0.50,0.73,0.39
50,Wisconsin,51,35,66,0.88,0.59,1.12


In [14]:
# now apply sum
# tb_df.iloc.sum()
tb_df.iloc[1:, 1:].sum()


NO of Cases 2019      8900
NO of Cases 2020      7173
NO of Cases 2021      7860
TB incidence 2019   107.23
TB incidence 2020    90.93
TB incidence 2021   100.57
dtype: object

The Total TB cases look right. Phew!

(We'll leave it to your own EDA to figure out how the TB incidence "Totals" were aggregated.)

Let's just look at the records with **state-level granularity**:

In [15]:
# Answer Here
tb_df['U.S. jurisdiction']


0             Total
1           Alabama
2            Alaska
3           Arizona
4          Arkansas
          ...      
47         Virginia
48       Washington
49    West Virginia
50        Wisconsin
51          Wyoming
Name: U.S. jurisdiction, Length: 52, dtype: object

What do each of these values represent? Why?

To the lecture!


# Gather Census Data

U.S. Census population estimates [source](https://www.census.gov/data/tables/time-series/demo/popest/2010s-state-total.html) (2019), [source](https://www.census.gov/data/tables/time-series/demo/popest/2020s-state-total.html) (2020-2021).

Running the below cells cleans the data. We encourage you to closely explore the CSV and study these lines after lecture...

There are a few new methods here:
* `df.convert_dtypes()` ([documentation](https://pandas.pydata.org/docs/reference/api/pandas.DataFrame.convert_dtypes.html)) conveniently converts all float dtypes into ints and is out of scope for the class.
* `df.drop_na()` ([documentation](https://pandas.pydata.org/docs/reference/api/pandas.DataFrame.dropna.html)) will be explained in more detail next time.

In [16]:
# Load 2010s census data
census_2019_df = pd.read_csv('datafiles/nst-est2019-01.csv', header=3)
census_2019_df

Unnamed: 0.1,Unnamed: 0,Census,Estimates Base,2010,2011,2012,2013,2014,2015,2016,2017,2018,2019
0,United States,308745538,308758105,309321666,311556874,313830990,315993715,318301008,320635163,322941311,324985539,326687501,328239523
1,Northeast,55317240,55318443,55380134,55604223,55775216,55901806,56006011,56034684,56042330,56059240,56046620,55982803
2,Midwest,66927001,66929725,66974416,67157800,67336743,67560379,67745167,67860583,67987540,68126781,68236628,68329004
3,South,114555744,114563030,114866680,116006522,117241208,118364400,119624037,120997341,122351760,123542189,124569433,125580448
4,West,71945553,71946907,72100436,72788329,73477823,74167130,74925793,75742555,76559681,77257329,77834820,78347268
...,...,...,...,...,...,...,...,...,...,...,...,...,...
58,Note: The estimates are based on the 2010 Cens...,,,,,,,,,,,,
59,Suggested Citation:,,,,,,,,,,,,
60,Table 1. Annual Estimates of the Resident Popu...,,,,,,,,,,,,
61,"Source: U.S. Census Bureau, Population Division",,,,,,,,,,,,


# Apply some EDA

Drop the column of Estiamte Base

Rename Unnamed: 0 as 'Geographic Area'

"smart" converting of columns, use at your own risk

.dropna() to drop records with NaN

You can also suggest any change that can be helpful for EDA



In [17]:
with pd.option_context('display.min_rows', 30): # shows more rows
    display(census_2019_df)

Unnamed: 0.1,Unnamed: 0,Census,Estimates Base,2010,2011,2012,2013,2014,2015,2016,2017,2018,2019
0,United States,308745538,308758105,309321666,311556874,313830990,315993715,318301008,320635163,322941311,324985539,326687501,328239523
1,Northeast,55317240,55318443,55380134,55604223,55775216,55901806,56006011,56034684,56042330,56059240,56046620,55982803
2,Midwest,66927001,66929725,66974416,67157800,67336743,67560379,67745167,67860583,67987540,68126781,68236628,68329004
3,South,114555744,114563030,114866680,116006522,117241208,118364400,119624037,120997341,122351760,123542189,124569433,125580448
4,West,71945553,71946907,72100436,72788329,73477823,74167130,74925793,75742555,76559681,77257329,77834820,78347268
5,.Alabama,4779736,4780125,4785437,4799069,4815588,4830081,4841799,4852347,4863525,4874486,4887681,4903185
6,.Alaska,710231,710249,713910,722128,730443,737068,736283,737498,741456,739700,735139,731545
7,.Arizona,6392017,6392288,6407172,6472643,6554978,6632764,6730413,6829676,6941072,7044008,7158024,7278717
8,.Arkansas,2915918,2916031,2921964,2940667,2952164,2959400,2967392,2978048,2989918,3001345,3009733,3017804
9,.California,37253956,37254519,37319502,37638369,37948800,38260787,38596972,38918045,39167117,39358497,39461588,39512223


In [18]:
census_2019_df = census_2019_df.dropna()

with pd.option_context('display.min_rows', 30): # shows more rows
    display(census_2019_df)

Unnamed: 0.1,Unnamed: 0,Census,Estimates Base,2010,2011,2012,2013,2014,2015,2016,2017,2018,2019
0,United States,308745538,308758105,309321666,311556874,313830990,315993715,318301008,320635163,322941311,324985539,326687501,328239523
1,Northeast,55317240,55318443,55380134,55604223,55775216,55901806,56006011,56034684,56042330,56059240,56046620,55982803
2,Midwest,66927001,66929725,66974416,67157800,67336743,67560379,67745167,67860583,67987540,68126781,68236628,68329004
3,South,114555744,114563030,114866680,116006522,117241208,118364400,119624037,120997341,122351760,123542189,124569433,125580448
4,West,71945553,71946907,72100436,72788329,73477823,74167130,74925793,75742555,76559681,77257329,77834820,78347268
5,.Alabama,4779736,4780125,4785437,4799069,4815588,4830081,4841799,4852347,4863525,4874486,4887681,4903185
6,.Alaska,710231,710249,713910,722128,730443,737068,736283,737498,741456,739700,735139,731545
7,.Arizona,6392017,6392288,6407172,6472643,6554978,6632764,6730413,6829676,6941072,7044008,7158024,7278717
8,.Arkansas,2915918,2916031,2921964,2940667,2952164,2959400,2967392,2978048,2989918,3001345,3009733,3017804
9,.California,37253956,37254519,37319502,37638369,37948800,38260787,38596972,38918045,39167117,39358497,39461588,39512223


In [20]:
# renaming the unnamed
census_2019_df = census_2019_df.rename(columns={'Unnamed: 0' : 'Geographic Area'})
census_2019_df

Unnamed: 0,Geographic Area,Census,Estimates Base,2010,2011,2012,2013,2014,2015,2016,2017,2018,2019
0,United States,308745538,308758105,309321666,311556874,313830990,315993715,318301008,320635163,322941311,324985539,326687501,328239523
1,Northeast,55317240,55318443,55380134,55604223,55775216,55901806,56006011,56034684,56042330,56059240,56046620,55982803
2,Midwest,66927001,66929725,66974416,67157800,67336743,67560379,67745167,67860583,67987540,68126781,68236628,68329004
3,South,114555744,114563030,114866680,116006522,117241208,118364400,119624037,120997341,122351760,123542189,124569433,125580448
4,West,71945553,71946907,72100436,72788329,73477823,74167130,74925793,75742555,76559681,77257329,77834820,78347268
...,...,...,...,...,...,...,...,...,...,...,...,...,...
52,.Washington,6724540,6724540,6742830,6826627,6897058,6963985,7054655,7163657,7294771,7423362,7523869,7614893
53,.West Virginia,1852994,1853018,1854239,1856301,1856872,1853914,1849489,1842050,1831023,1817004,1804291,1792147
54,.Wisconsin,5686986,5687285,5690475,5705288,5719960,5736754,5751525,5760940,5772628,5790186,5807406,5822434
55,.Wyoming,563626,563775,564487,567299,576305,582122,582531,585613,584215,578931,577601,578759


In [21]:
# removing the commas
census_2019_df = census_2019_df.replace({',': '',}, regex= True)
census_2019_df

Unnamed: 0,Geographic Area,Census,Estimates Base,2010,2011,2012,2013,2014,2015,2016,2017,2018,2019
0,United States,308745538,308758105,309321666,311556874,313830990,315993715,318301008,320635163,322941311,324985539,326687501,328239523
1,Northeast,55317240,55318443,55380134,55604223,55775216,55901806,56006011,56034684,56042330,56059240,56046620,55982803
2,Midwest,66927001,66929725,66974416,67157800,67336743,67560379,67745167,67860583,67987540,68126781,68236628,68329004
3,South,114555744,114563030,114866680,116006522,117241208,118364400,119624037,120997341,122351760,123542189,124569433,125580448
4,West,71945553,71946907,72100436,72788329,73477823,74167130,74925793,75742555,76559681,77257329,77834820,78347268
...,...,...,...,...,...,...,...,...,...,...,...,...,...
52,.Washington,6724540,6724540,6742830,6826627,6897058,6963985,7054655,7163657,7294771,7423362,7523869,7614893
53,.West Virginia,1852994,1853018,1854239,1856301,1856872,1853914,1849489,1842050,1831023,1817004,1804291,1792147
54,.Wisconsin,5686986,5687285,5690475,5705288,5719960,5736754,5751525,5760940,5772628,5790186,5807406,5822434
55,.Wyoming,563626,563775,564487,567299,576305,582122,582531,585613,584215,578931,577601,578759


In [22]:
# checking if the row 0 sums, are correct?
census_2019_df.iloc[:, 1:].sum()


Census            3087455385531724066927001114555744719455534779...
Estimates Base    3087581055531844366929725114563030719469074780...
2010              3093216665538013466974416114866680721004364785...
2011              3115568745560422367157800116006522727883294799...
2012              3138309905577521667336743117241208734778234815...
2013              3159937155590180667560379118364400741671304830...
2014              3183010085600601167745167119624037749257934841...
2015              3206351635603468467860583120997341757425554852...
2016              3229413115604233067987540122351760765596814863...
2017              3249855395605924068126781123542189772573294874...
2018              3266875015604662068236628124569433778348204887...
2019              3282395235598280368329004125580448783472684903...
dtype: object

In [23]:
census_2019_df.info()

<class 'pandas.core.frame.DataFrame'>
Index: 57 entries, 0 to 57
Data columns (total 13 columns):
 #   Column           Non-Null Count  Dtype 
---  ------           --------------  ----- 
 0   Geographic Area  57 non-null     object
 1   Census           57 non-null     object
 2   Estimates Base   57 non-null     object
 3   2010             57 non-null     object
 4   2011             57 non-null     object
 5   2012             57 non-null     object
 6   2013             57 non-null     object
 7   2014             57 non-null     object
 8   2015             57 non-null     object
 9   2016             57 non-null     object
 10  2017             57 non-null     object
 11  2018             57 non-null     object
 12  2019             57 non-null     object
dtypes: object(13)
memory usage: 6.2+ KB


In [24]:
census_2019_df.iloc[:, 1:] = census_2019_df.iloc[:, 1:].astype('int64')
census_2019_df

Unnamed: 0,Geographic Area,Census,Estimates Base,2010,2011,2012,2013,2014,2015,2016,2017,2018,2019
0,United States,308745538,308758105,309321666,311556874,313830990,315993715,318301008,320635163,322941311,324985539,326687501,328239523
1,Northeast,55317240,55318443,55380134,55604223,55775216,55901806,56006011,56034684,56042330,56059240,56046620,55982803
2,Midwest,66927001,66929725,66974416,67157800,67336743,67560379,67745167,67860583,67987540,68126781,68236628,68329004
3,South,114555744,114563030,114866680,116006522,117241208,118364400,119624037,120997341,122351760,123542189,124569433,125580448
4,West,71945553,71946907,72100436,72788329,73477823,74167130,74925793,75742555,76559681,77257329,77834820,78347268
...,...,...,...,...,...,...,...,...,...,...,...,...,...
52,.Washington,6724540,6724540,6742830,6826627,6897058,6963985,7054655,7163657,7294771,7423362,7523869,7614893
53,.West Virginia,1852994,1853018,1854239,1856301,1856872,1853914,1849489,1842050,1831023,1817004,1804291,1792147
54,.Wisconsin,5686986,5687285,5690475,5705288,5719960,5736754,5751525,5760940,5772628,5790186,5807406,5822434
55,.Wyoming,563626,563775,564487,567299,576305,582122,582531,585613,584215,578931,577601,578759


In [25]:
# now checking the sum
# census_2019_df.iloc[1:, 1:].sum() # Sum is not equal
# Why????
# is it of states or or of regions?
census_2019_df.iloc[1:5, 1:].sum()

Census            308745538
Estimates Base    308758105
2010              309321666
2011              311556874
2012              313830990
2013              315993715
2014              318301008
2015              320635163
2016              322941311
2017              324985539
2018              326687501
2019              328239523
dtype: object

In [26]:
# The sum is of regions but we don't want regions so we drop first 5 rows
states_census_2019_df = census_2019_df.iloc[5:,:]
states_census_2019_df

Unnamed: 0,Geographic Area,Census,Estimates Base,2010,2011,2012,2013,2014,2015,2016,2017,2018,2019
5,.Alabama,4779736,4780125,4785437,4799069,4815588,4830081,4841799,4852347,4863525,4874486,4887681,4903185
6,.Alaska,710231,710249,713910,722128,730443,737068,736283,737498,741456,739700,735139,731545
7,.Arizona,6392017,6392288,6407172,6472643,6554978,6632764,6730413,6829676,6941072,7044008,7158024,7278717
8,.Arkansas,2915918,2916031,2921964,2940667,2952164,2959400,2967392,2978048,2989918,3001345,3009733,3017804
9,.California,37253956,37254519,37319502,37638369,37948800,38260787,38596972,38918045,39167117,39358497,39461588,39512223
...,...,...,...,...,...,...,...,...,...,...,...,...,...
52,.Washington,6724540,6724540,6742830,6826627,6897058,6963985,7054655,7163657,7294771,7423362,7523869,7614893
53,.West Virginia,1852994,1853018,1854239,1856301,1856872,1853914,1849489,1842050,1831023,1817004,1804291,1792147
54,.Wisconsin,5686986,5687285,5690475,5705288,5719960,5736754,5751525,5760940,5772628,5790186,5807406,5822434
55,.Wyoming,563626,563775,564487,567299,576305,582122,582531,585613,584215,578931,577601,578759


In [27]:
# removing the dot in Geographical Areas
states_census_2019_df.iloc[:, 0] = census_2019_df.iloc[:, 0].str.removeprefix('.')
states_census_2019_df


Unnamed: 0,Geographic Area,Census,Estimates Base,2010,2011,2012,2013,2014,2015,2016,2017,2018,2019
5,Alabama,4779736,4780125,4785437,4799069,4815588,4830081,4841799,4852347,4863525,4874486,4887681,4903185
6,Alaska,710231,710249,713910,722128,730443,737068,736283,737498,741456,739700,735139,731545
7,Arizona,6392017,6392288,6407172,6472643,6554978,6632764,6730413,6829676,6941072,7044008,7158024,7278717
8,Arkansas,2915918,2916031,2921964,2940667,2952164,2959400,2967392,2978048,2989918,3001345,3009733,3017804
9,California,37253956,37254519,37319502,37638369,37948800,38260787,38596972,38918045,39167117,39358497,39461588,39512223
...,...,...,...,...,...,...,...,...,...,...,...,...,...
52,Washington,6724540,6724540,6742830,6826627,6897058,6963985,7054655,7163657,7294771,7423362,7523869,7614893
53,West Virginia,1852994,1853018,1854239,1856301,1856872,1853914,1849489,1842050,1831023,1817004,1804291,1792147
54,Wisconsin,5686986,5687285,5690475,5705288,5719960,5736754,5751525,5760940,5772628,5790186,5807406,5822434
55,Wyoming,563626,563775,564487,567299,576305,582122,582531,585613,584215,578931,577601,578759


### Similarly for Census data of 2020 and 2021

In [28]:
census_2020_df = pd.read_csv('datafiles/NST-EST2022-POP.csv', header=3)
# droping null calues
census_2020_df = census_2020_df.dropna()
# renaming the unnamed
census_2020_df = census_2020_df.rename(columns={'Unnamed: 0' : 'Geographic Area',
                                                'Unnamed: 1': 'Estimates Base'})
# removing the commas
census_2020_df = census_2020_df.replace({',': '',}, regex= True)
census_2020_df
# changing the type
census_2020_df.iloc[:, 1:] = census_2020_df.iloc[:, 1:].astype('int64')
census_2020_df
# checking the sum
census_2019_df.iloc[1:5, 1:].sum()
# The sum is of regions but we don't want regions so we drop first 5 rows
states_census_2020_df = census_2020_df.iloc[5:,:]
states_census_2020_df
# removing the dot in Geographical Areas
states_census_2020_df.iloc[:, 0] = census_2020_df.iloc[:, 0].str.removeprefix('.')
states_census_2020_df

Unnamed: 0,Geographic Area,Estimates Base,2020,2021,2022
5,Alabama,5024356,5031362,5049846,5074296
6,Alaska,733378,732923,734182,733583
7,Arizona,7151507,7179943,7264877,7359197
8,Arkansas,3011555,3014195,3028122,3045637
9,California,39538245,39501653,39142991,39029342
...,...,...,...,...,...
52,Washington,7705247,7724031,7740745,7785786
53,West Virginia,1793755,1791420,1785526,1775156
54,Wisconsin,5893725,5896271,5880101,5892539
55,Wyoming,576837,577605,579483,581381


<br/><br/>

---

# Join Data (Merge DataFrames)

Time to `merge`! Here I use the DataFrame method `df1.merge(right=df2, ...)` on DataFrame `df1` ([documentation](https://pandas.pydata.org/docs/reference/api/pandas.DataFrame.merge.html)). Contrast this with the function `pd.merge(left=df1, right=df2, ...)` ([documentation](https://pandas.pydata.org/docs/reference/api/pandas.merge.html?highlight=pandas%20merge#pandas.merge)). Feel free to use either.

This is a little unwieldy. We could either drop the unneeded columns now, or just merge on smaller census DataFrames. Let's do the latter.

In [29]:
# mergeing the census data for 2019, 2020 and 2021
census_merged_df = pd.merge(left= states_census_2019_df,
                     right= states_census_2020_df,
                     left_on= 'Geographic Area',
                     right_on= 'Geographic Area'
                     )
census_merged_df


Unnamed: 0,Geographic Area,Census,Estimates Base_x,2010,2011,2012,2013,2014,2015,2016,2017,2018,2019,Estimates Base_y,2020,2021,2022
0,Alabama,4779736,4780125,4785437,4799069,4815588,4830081,4841799,4852347,4863525,4874486,4887681,4903185,5024356,5031362,5049846,5074296
1,Alaska,710231,710249,713910,722128,730443,737068,736283,737498,741456,739700,735139,731545,733378,732923,734182,733583
2,Arizona,6392017,6392288,6407172,6472643,6554978,6632764,6730413,6829676,6941072,7044008,7158024,7278717,7151507,7179943,7264877,7359197
3,Arkansas,2915918,2916031,2921964,2940667,2952164,2959400,2967392,2978048,2989918,3001345,3009733,3017804,3011555,3014195,3028122,3045637
4,California,37253956,37254519,37319502,37638369,37948800,38260787,38596972,38918045,39167117,39358497,39461588,39512223,39538245,39501653,39142991,39029342
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
47,Washington,6724540,6724540,6742830,6826627,6897058,6963985,7054655,7163657,7294771,7423362,7523869,7614893,7705247,7724031,7740745,7785786
48,West Virginia,1852994,1853018,1854239,1856301,1856872,1853914,1849489,1842050,1831023,1817004,1804291,1792147,1793755,1791420,1785526,1775156
49,Wisconsin,5686986,5687285,5690475,5705288,5719960,5736754,5751525,5760940,5772628,5790186,5807406,5822434,5893725,5896271,5880101,5892539
50,Wyoming,563626,563775,564487,567299,576305,582122,582531,585613,584215,578931,577601,578759,576837,577605,579483,581381


In [30]:
merged_df = pd.merge(left= census_merged_df,
                     right= tb_df,
                     left_on= 'Geographic Area',
                     right_on= 'U.S. jurisdiction')
merged_df

Unnamed: 0,Geographic Area,Census,Estimates Base_x,2010,2011,2012,2013,2014,2015,2016,2017,2018,2019,Estimates Base_y,2020,2021,2022,U.S. jurisdiction,NO of Cases 2019,NO of Cases 2020,NO of Cases 2021,TB incidence 2019,TB incidence 2020,TB incidence 2021
0,Alabama,4779736,4780125,4785437,4799069,4815588,4830081,4841799,4852347,4863525,4874486,4887681,4903185,5024356,5031362,5049846,5074296,Alabama,87,72,92,1.77,1.43,1.83
1,Alaska,710231,710249,713910,722128,730443,737068,736283,737498,741456,739700,735139,731545,733378,732923,734182,733583,Alaska,58,58,58,7.91,7.92,7.92
2,Arizona,6392017,6392288,6407172,6472643,6554978,6632764,6730413,6829676,6941072,7044008,7158024,7278717,7151507,7179943,7264877,7359197,Arizona,183,136,129,2.51,1.89,1.77
3,Arkansas,2915918,2916031,2921964,2940667,2952164,2959400,2967392,2978048,2989918,3001345,3009733,3017804,3011555,3014195,3028122,3045637,Arkansas,64,59,69,2.12,1.96,2.28
4,California,37253956,37254519,37319502,37638369,37948800,38260787,38596972,38918045,39167117,39358497,39461588,39512223,39538245,39501653,39142991,39029342,California,2111,1706,1750,5.35,4.32,4.46
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
46,Virginia,8001024,8001049,8023699,8101155,8185080,8252427,8310993,8361808,8410106,8463587,8501286,8535519,8631384,8636471,8657365,8683619,Virginia,191,169,161,2.23,1.96,1.86
47,Washington,6724540,6724540,6742830,6826627,6897058,6963985,7054655,7163657,7294771,7423362,7523869,7614893,7705247,7724031,7740745,7785786,Washington,221,163,199,2.90,2.11,2.57
48,West Virginia,1852994,1853018,1854239,1856301,1856872,1853914,1849489,1842050,1831023,1817004,1804291,1792147,1793755,1791420,1785526,1775156,West Virginia,9,13,7,0.50,0.73,0.39
49,Wisconsin,5686986,5687285,5690475,5705288,5719960,5736754,5751525,5760940,5772628,5790186,5807406,5822434,5893725,5896271,5880101,5892539,Wisconsin,51,35,66,0.88,0.59,1.12


In [31]:
merged_df.columns

Index(['Geographic Area', 'Census', 'Estimates Base_x', '2010', '2011', '2012',
       '2013', '2014', '2015', '2016', '2017', '2018', '2019',
       'Estimates Base_y', '2020', '2021', '2022', 'U.S. jurisdiction',
       'NO of Cases 2019', 'NO of Cases 2020', 'NO of Cases 2021',
       'TB incidence 2019', 'TB incidence 2020', 'TB incidence 2021'],
      dtype='object')

In [32]:
# columsn to drop:
columns = merged_df.columns.to_list()
dropcolumns = columns[1:12]
dropcolumns.append(columns[13])
dropcolumns.append(columns[16])
dropcolumns.append(columns[17])
dropcolumns

['Census',
 'Estimates Base_x',
 '2010',
 '2011',
 '2012',
 '2013',
 '2014',
 '2015',
 '2016',
 '2017',
 '2018',
 'Estimates Base_y',
 '2022',
 'U.S. jurisdiction']

In [33]:
merged_df = merged_df.drop(columns=dropcolumns)
merged_df
# all neat and tidy ;)

Unnamed: 0,Geographic Area,2019,2020,2021,NO of Cases 2019,NO of Cases 2020,NO of Cases 2021,TB incidence 2019,TB incidence 2020,TB incidence 2021
0,Alabama,4903185,5031362,5049846,87,72,92,1.77,1.43,1.83
1,Alaska,731545,732923,734182,58,58,58,7.91,7.92,7.92
2,Arizona,7278717,7179943,7264877,183,136,129,2.51,1.89,1.77
3,Arkansas,3017804,3014195,3028122,64,59,69,2.12,1.96,2.28
4,California,39512223,39501653,39142991,2111,1706,1750,5.35,4.32,4.46
...,...,...,...,...,...,...,...,...,...,...
46,Virginia,8535519,8636471,8657365,191,169,161,2.23,1.96,1.86
47,Washington,7614893,7724031,7740745,221,163,199,2.90,2.11,2.57
48,West Virginia,1792147,1791420,1785526,9,13,7,0.50,0.73,0.39
49,Wisconsin,5822434,5896271,5880101,51,35,66,0.88,0.59,1.12



## Reproduce incidence

Let's recompute incidence to make sure we know where the original CDC numbers came from.

From the [CDC report](https://www.cdc.gov/mmwr/volumes/71/wr/mm7112a1.htm?s_cid=mm7112a1_w#T1_down): TB incidence is computed as “Cases per 100,000 persons using mid-year population estimates from the U.S. Census Bureau.”

If we define a group as 100,000 people, then we can compute the TB incidence for a given state population as

$$\text{TB incidence} = \frac{\text{# TB cases in population}}{\text{# groups in population}} = \frac{\text{# TB cases in population}}{\text{population}/100000} $$

$$= \frac{\text{# TB cases in population}}{\text{population}} \times 100000$$

Let's try this for 2019:

In [34]:
reproduce_incidence_df = merged_df.copy()
reproduce_incidence_df = reproduce_incidence_df.drop(columns=reproduce_incidence_df.iloc[:,-3:])

In [35]:
reproduce_incidence_df

Unnamed: 0,Geographic Area,2019,2020,2021,NO of Cases 2019,NO of Cases 2020,NO of Cases 2021
0,Alabama,4903185,5031362,5049846,87,72,92
1,Alaska,731545,732923,734182,58,58,58
2,Arizona,7278717,7179943,7264877,183,136,129
3,Arkansas,3017804,3014195,3028122,64,59,69
4,California,39512223,39501653,39142991,2111,1706,1750
...,...,...,...,...,...,...,...
46,Virginia,8535519,8636471,8657365,191,169,161
47,Washington,7614893,7724031,7740745,221,163,199
48,West Virginia,1792147,1791420,1785526,9,13,7
49,Wisconsin,5822434,5896271,5880101,51,35,66


In [36]:
# defining function to reproduce incidences
def incidence_counter_perlac(cases,population):
    float: incidence = cases / (population/100000)
    return float

In [37]:
# computing Tb incidences for 2019
reproduce_incidence_df['TB incidence 2019'] = (incidence_counter_perlac(reproduce_incidence_df['NO of Cases 2019'],
                                                                   reproduce_incidence_df['2019']))

In [38]:
reproduce_incidence_df

Unnamed: 0,Geographic Area,2019,2020,2021,NO of Cases 2019,NO of Cases 2020,NO of Cases 2021,TB incidence 2019
0,Alabama,4903185,5031362,5049846,87,72,92,1.77
1,Alaska,731545,732923,734182,58,58,58,7.93
2,Arizona,7278717,7179943,7264877,183,136,129,2.51
3,Arkansas,3017804,3014195,3028122,64,59,69,2.12
4,California,39512223,39501653,39142991,2111,1706,1750,5.34
...,...,...,...,...,...,...,...,...
46,Virginia,8535519,8636471,8657365,191,169,161,2.24
47,Washington,7614893,7724031,7740745,221,163,199,2.90
48,West Virginia,1792147,1791420,1785526,9,13,7,0.50
49,Wisconsin,5822434,5896271,5880101,51,35,66,0.88


Awesome!!!

Let's use a for-loop and Python format strings to compute TB incidence for all years. Python f-strings are just used for the purposes of this demo, but they're handy to know when you explore data beyond this course ([Python documentation](https://docs.python.org/3/tutorial/inputoutput.html)).

In [41]:
# recompute incidence for all years
reproduce_incidence_df['TB incidence 2020'] = (incidence_counter_perlac(reproduce_incidence_df['NO of Cases 2020'],
                                                                   reproduce_incidence_df['2020']))

reproduce_incidence_df['TB incidence 2021'] = (incidence_counter_perlac(reproduce_incidence_df['NO of Cases 2021'],
                                                                   reproduce_incidence_df['2021']))


In [42]:
reproduce_incidence_df

Unnamed: 0,Geographic Area,2019,2020,2021,NO of Cases 2019,NO of Cases 2020,NO of Cases 2021,TB incidence 2019,TB incidence 2020,TB incidence 2021
0,Alabama,4903185,5031362,5049846,87,72,92,1.77,1.43,1.82
1,Alaska,731545,732923,734182,58,58,58,7.93,7.91,7.90
2,Arizona,7278717,7179943,7264877,183,136,129,2.51,1.89,1.78
3,Arkansas,3017804,3014195,3028122,64,59,69,2.12,1.96,2.28
4,California,39512223,39501653,39142991,2111,1706,1750,5.34,4.32,4.47
...,...,...,...,...,...,...,...,...,...,...
46,Virginia,8535519,8636471,8657365,191,169,161,2.24,1.96,1.86
47,Washington,7614893,7724031,7740745,221,163,199,2.90,2.11,2.57
48,West Virginia,1792147,1791420,1785526,9,13,7,0.50,0.73,0.39
49,Wisconsin,5822434,5896271,5880101,51,35,66,0.88,0.59,1.12


In [43]:
# In comparison with
merged_df

Unnamed: 0,Geographic Area,2019,2020,2021,NO of Cases 2019,NO of Cases 2020,NO of Cases 2021,TB incidence 2019,TB incidence 2020,TB incidence 2021
0,Alabama,4903185,5031362,5049846,87,72,92,1.77,1.43,1.83
1,Alaska,731545,732923,734182,58,58,58,7.91,7.92,7.92
2,Arizona,7278717,7179943,7264877,183,136,129,2.51,1.89,1.77
3,Arkansas,3017804,3014195,3028122,64,59,69,2.12,1.96,2.28
4,California,39512223,39501653,39142991,2111,1706,1750,5.35,4.32,4.46
...,...,...,...,...,...,...,...,...,...,...
46,Virginia,8535519,8636471,8657365,191,169,161,2.23,1.96,1.86
47,Washington,7614893,7724031,7740745,221,163,199,2.90,2.11,2.57
48,West Virginia,1792147,1791420,1785526,9,13,7,0.50,0.73,0.39
49,Wisconsin,5822434,5896271,5880101,51,35,66,0.88,0.59,1.12


These numbers look pretty close!!! There are a few errors in the hundredths place, particularly in 2021. It may be useful to further explore reasons behind this discrepancy. We'll leave it to you!

In [44]:
reproduce_incidence_df.describe()

Unnamed: 0,Geographic Area,2019,2020,2021,NO of Cases 2019,NO of Cases 2020,NO of Cases 2021,TB incidence 2019,TB incidence 2020,TB incidence 2021
count,51,51,51,51,51,51,51,51.0,51.0,51.0
unique,51,51,51,51,46,45,45,51.0,51.0,51.0
top,Alabama,4903185,5031362,5049846,18,67,58,1.77,1.43,1.82
freq,1,1,1,1,3,4,2,1.0,1.0,1.0
