<a href="https://colab.research.google.com/github/maya-papaya/ads1-cervical-cancer-analysis/blob/main/Data%20Cleaning%20%26%20Preprocessing%20(2).ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# Data Cleaning & Preprocessing (2)

In this notebook, we will clean/preprocess the datasets in order to prepare for creating the relational database. This process involves renaming columns, checking for null values, altering datatypes, and restructuring rows.

In [2]:
# SETTING UP COLAB AND MODULES
from google.colab import drive
drive.mount('/content/drive')

import pandas as pd
import numpy as np
import warnings
warnings.filterwarnings('ignore')

import os
os.chdir("/content/drive/My Drive/ADS_Maya_Reddy/projects/disease_project_1/disease_datasets_1/")

Mounted at /content/drive


### Main Dataset

This dataset (`data` or `cervical_cancer_USA.csv`) was sourced from IHME VizHub's search tool. It includes multiple metrics for mortality and incidence cases of cervical cancer, as well as the upper and lower bounds of 95% CI. Since this data analysis is focusing on cervical cancer cases within the United States, the data is separated by state. The data also spans the years 1980 to 2021, allowing us to track cervical cancer rates over four decades.

In [3]:
# LOADING MAIN DATASET
data = pd.read_csv('cervical_cancer_USA.csv').drop(['sex', 'age', 'cause'], axis=1)
data.head()

Unnamed: 0,measure,location,metric,year,val,upper,lower
0,Deaths,Massachusetts,Number,1980,156.027002,167.409189,144.935259
1,Deaths,Massachusetts,Rate,1980,5.121602,5.495223,4.757514
2,Deaths,Oregon,Number,1980,57.602448,62.026101,52.861243
3,Deaths,Oregon,Rate,1980,4.256527,4.583412,3.906176
4,Deaths,Michigan,Number,1980,229.941567,244.664251,214.484653


In [4]:
# CHECKING NULL / DUPLICATE VALUES
print(data.isnull().sum())
print(data.duplicated().sum())

measure     0
location    0
metric      0
year        0
val         0
upper       0
lower       0
dtype: int64
0


In [5]:
# CHECKING DATATYPES
print(data.dtypes)

measure      object
location     object
metric       object
year          int64
val         float64
upper       float64
lower       float64
dtype: object


In [6]:
# LOADING MODIFIED DATASET INTO FOLDER
data.to_csv("/content/drive/My Drive/ADS_Maya_Reddy/projects/disease_project_1/prepped_datasets/data.csv", index=False)

### Pap Smear Test Datasets

These datasets were sourced from [data.cdc.gov](https://https://data.cdc.gov/500-Cities-Places/500-Cities-Census-Tract-level-Data-GIS-Friendly-Fo/5mtz-k78d/about_data). They include estimates of pap smear tests in various US cities, as well as 95% CI. The data spans 2016 to 2019. Cleaning these datasets involves splitting the `PAPTEST_Crude95CI` column into upper and lower bounds, dealing with null values, aggregating by city (the data is currently separated by geolocational points), adding a `year` column, and merging the datasets.



In [7]:
# LOADING PAP SMEAR TEST DATASETS
pap19 = pd.read_csv('pap_smear/pap_smear_2019.csv')[['StateAbbr', 'PlaceName', 'Population2010', 'PAPTEST_CrudePrev', 'PAPTEST_Crude95CI']]
pap18 = pd.read_csv('pap_smear/pap_smear_2018.csv')[['StateAbbr', 'PlaceName', 'Population2010', 'PAPTEST_CrudePrev', 'PAPTEST_Crude95CI']]
pap17 = pd.read_csv('pap_smear/pap_smear_2017.csv')[['StateAbbr', 'PlaceName', 'population_count', 'PAPTEST_CrudePrev', 'PAPTEST_Crude95CI']].rename(columns={'population_count': 'Population2010'})
pap16 = pd.read_csv('pap_smear/pap_smear_2016.csv')[['StateAbbr', 'PlaceName', 'Population2010', 'PAPTEST_CrudePrev', 'PAPTEST_Crude95CI']]

In [8]:
# CHECKING NULL / DUPLICATE VALUES FOR PAP SMEAR TESTS
print(pap19.isnull().sum())
print(pap19.duplicated().sum())
print()
print(pap18.isnull().sum())
print(pap18.duplicated().sum())
print()
print(pap17.isnull().sum())
print(pap17.duplicated().sum())
print()
print(pap16.isnull().sum())
print(pap16.duplicated().sum())

pap19 = pap19.dropna()
pap18 = pap18.dropna()
pap17 = pap17.dropna()
pap16 = pap16.dropna()

StateAbbr               0
PlaceName               0
Population2010          0
PAPTEST_CrudePrev    2202
PAPTEST_Crude95CI    2202
dtype: int64
28

StateAbbr               0
PlaceName               0
Population2010          0
PAPTEST_CrudePrev    2202
PAPTEST_Crude95CI    2202
dtype: int64
28

StateAbbr            0
PlaceName            0
Population2010       0
PAPTEST_CrudePrev    7
PAPTEST_Crude95CI    7
dtype: int64
0

StateAbbr            0
PlaceName            0
Population2010       0
PAPTEST_CrudePrev    7
PAPTEST_Crude95CI    7
dtype: int64
0


In [9]:
# MODIFYING 95% CI COLUMN
def create_95CI_columns(pap):
  lower = np.array([])
  upper = np.array([])
  for row in pap['PAPTEST_Crude95CI']:
    row_tuple = eval(row)
    lower = np.append(lower, row_tuple[0])
    upper = np.append(upper, row_tuple[1])
  pap['lower'] = lower
  pap['upper'] = upper
  return pap.drop(['PAPTEST_Crude95CI'], axis=1)

pap19 = create_95CI_columns(pap19)
pap18 = create_95CI_columns(pap18)
pap17 = create_95CI_columns(pap17)
pap16 = create_95CI_columns(pap16)

In [10]:
# AGGREGATING BY CITY
pap19 = pap19.groupby(['StateAbbr', 'PlaceName']).agg('sum').reset_index()
pap18 = pap18.groupby(['StateAbbr', 'PlaceName']).agg('sum').reset_index()
pap17 = pap17.groupby(['StateAbbr', 'PlaceName']).agg('sum').reset_index()
pap16 = pap16.groupby(['StateAbbr', 'PlaceName']).agg('sum').reset_index()

In [11]:
# ADDING YEAR COLUMN
pap19['year'] = np.array([2019]*len(pap19['StateAbbr']))
pap18['year'] = np.array([2018]*len(pap18['StateAbbr']))
pap17['year'] = np.array([2017]*len(pap17['StateAbbr']))
pap16['year'] = np.array([2016]*len(pap16['StateAbbr']))

In [12]:
# MERGING INTO ONE DATASET
pap = pap19.merge(pap16, how='outer', on=list(pap19.columns)).merge(pap17, how='outer', on=list(pap19.columns)).merge(pap18, how='outer', on=list(pap19.columns))
pap.head(4)

Unnamed: 0,StateAbbr,PlaceName,Population2010,PAPTEST_CrudePrev,lower,upper,year
0,AK,Anchorage,291826,4293.9,4161.1,4412.9,2016
1,AK,Anchorage,291826,4293.9,4161.1,4412.9,2017
2,AK,Anchorage,291826,4339.7,4208.5,4461.3,2018
3,AK,Anchorage,291826,4339.7,4208.5,4461.3,2019


In [13]:
# CHECKING DATATYPES
print(pap.dtypes)

StateAbbr             object
PlaceName             object
Population2010         int64
PAPTEST_CrudePrev    float64
lower                float64
upper                float64
year                   int64
dtype: object


In [14]:
# STANDARDIZING COLUMN NAMES
pap = pap.rename(columns={'StateAbbr':'state', 'PlaceName':'city', 'Population2010':'pop_2010', 'PAPTEST_CrudePrev':'val'})

In [15]:
# LOADING MODIFIED DATASET INTO FOLDER
pap.to_csv("/content/drive/My Drive/ADS_Maya_Reddy/projects/disease_project_1/prepped_datasets/pap.csv", index=False)

### HPV Vaccination Datasets

These datasets are sourced from [kaggle.com](https://www.kaggle.com/datasets/joebeachcapital/cervical-cancer-and-hpv-vaccines). Currently, the datasets include all countries, but we will narrow down these dataframes to include only the United States. The datasets span the years 2010 to 2020, and include detailed information on vaccination estimates, cervical cancer cases/deaths prevented, and costs prevented. Cleaning these datasets will involve adding a `year` column, focusing the datasets on the United States, and merging the datasets.

In [16]:
# LOADING HPV VACCINATION DATASETS
hpv10 = pd.read_csv('hpv_vaccines/hpv_2010.csv').drop('Unnamed: 0', axis=1)
hpv11 = pd.read_csv('hpv_vaccines/hpv_2011.csv').drop('Unnamed: 0', axis=1)
hpv12 = pd.read_csv('hpv_vaccines/hpv_2012.csv').drop('Unnamed: 0', axis=1)
hpv13 = pd.read_csv('hpv_vaccines/hpv_2013.csv').drop('Unnamed: 0', axis=1)
hpv14 = pd.read_csv('hpv_vaccines/hpv_2014.csv').drop('Unnamed: 0', axis=1)
hpv15 = pd.read_csv('hpv_vaccines/hpv_2015.csv').drop('Unnamed: 0', axis=1)
hpv16 = pd.read_csv('hpv_vaccines/hpv_2016.csv').drop('Unnamed: 0', axis=1)
hpv17 = pd.read_csv('hpv_vaccines/hpv_2017.csv').drop('Unnamed: 0', axis=1)
hpv18 = pd.read_csv('hpv_vaccines/hpv_2018.csv').drop('Unnamed: 0', axis=1)
hpv19 = pd.read_csv('hpv_vaccines/hpv_2019.csv').drop('Unnamed: 0', axis=1)
hpv20 = pd.read_csv('hpv_vaccines/hpv_2020.csv').drop('Unnamed: 0', axis=1)

In [17]:
# ADDING YEAR COLUMN
hpv10['year'] = np.array([2010]*len(hpv10['country']))
hpv11['year'] = np.array([2011]*len(hpv11['country']))
hpv12['year'] = np.array([2012]*len(hpv12['country']))
hpv13['year'] = np.array([2013]*len(hpv13['country']))
hpv14['year'] = np.array([2014]*len(hpv14['country']))
hpv15['year'] = np.array([2015]*len(hpv15['country']))
hpv16['year'] = np.array([2016]*len(hpv16['country']))
hpv17['year'] = np.array([2017]*len(hpv17['country']))
hpv18['year'] = np.array([2018]*len(hpv18['country']))
hpv19['year'] = np.array([2019]*len(hpv19['country']))
hpv20['year'] = np.array([2020]*len(hpv20['country']))

In [18]:
# MERGING DATASETS
hpv = hpv10[hpv10['country'] == 'USA'].merge(hpv11[hpv11['country'] == 'USA'], how='outer', on=list(hpv10.columns))
for df in [hpv12, hpv13, hpv14, hpv15, hpv16, hpv17, hpv18, hpv19, hpv20]:
  hpv = hpv.merge(df[df['country'] == 'USA'], how='outer', on=list(hpv.columns))
hpv.head()

Unnamed: 0,country,cohort_size,current_cov,curr_vacc_cohort_size,future_cov,future_vacc_cohort_size,curr_cc_prev,curr_mort_prev,curr_cost,curr_cost_prev,proj_cc_prev,proj_mort_prev,proj_cost,proj_cost_prev,year,current_net_cost,country_name,region,income_group
0,USA,2034208.78,0.3,1816028.1,0.9,1816028.1,3309.71,1678.42,172522700.0,6618281.07,9929.14,5035.26,517568008.5,19854843.21,2015,165904400.0,United States,North America,High income
1,USA,2039953.92,0.26,1821093.3,0.9,1821093.3,2872.74,1454.98,149936700.0,5744490.73,9944.11,5036.47,519011590.5,19884775.62,2014,144192200.0,United States,North America,High income
2,USA,2043656.05,0.23,1823372.1,0.9,1823372.1,2534.9,1278.94,132802300.0,5068932.99,9919.19,5004.54,519661048.5,19834955.16,2011,127733300.0,United States,North America,High income
3,USA,2043845.6,0.23,1823406.3,0.9,1823406.3,2531.52,1275.39,132804800.0,5062159.5,9905.94,4990.64,519670795.5,19808450.23,2010,127742600.0,United States,North America,High income
4,USA,2048158.45,0.26,1828353.6,0.9,1828353.6,2880.77,1457.32,150534400.0,5760547.53,9971.9,5044.58,521080776.0,19940356.85,2013,144773900.0,United States,North America,High income


In [19]:
# REMOVING UNNECESSARY COLUMNS
hpv = hpv.drop(['country', 'region', 'income_group'], axis=1).rename(columns={'country_name':'country'})
hpv.head(2)

Unnamed: 0,cohort_size,current_cov,curr_vacc_cohort_size,future_cov,future_vacc_cohort_size,curr_cc_prev,curr_mort_prev,curr_cost,curr_cost_prev,proj_cc_prev,proj_mort_prev,proj_cost,proj_cost_prev,year,current_net_cost,country
0,2034208.78,0.3,1816028.1,0.9,1816028.1,3309.71,1678.42,172522669.5,6618281.07,9929.14,5035.26,517568008.5,19854843.21,2015,165904400.0,United States
1,2039953.92,0.26,1821093.3,0.9,1821093.3,2872.74,1454.98,149936681.7,5744490.73,9944.11,5036.47,519011590.5,19884775.62,2014,144192200.0,United States


In [20]:
# CHECKING NULL / DUPLICATE VALUES
print(hpv.isnull().sum())
print()
print(hpv.duplicated().sum())

cohort_size                0
current_cov                0
curr_vacc_cohort_size      0
future_cov                 0
future_vacc_cohort_size    0
curr_cc_prev               0
curr_mort_prev             0
curr_cost                  0
curr_cost_prev             0
proj_cc_prev               0
proj_mort_prev             0
proj_cost                  0
proj_cost_prev             0
year                       0
current_net_cost           0
country                    0
dtype: int64

0


In [21]:
# CHECKING DATATYPES
print(hpv.dtypes)

cohort_size                float64
current_cov                float64
curr_vacc_cohort_size      float64
future_cov                 float64
future_vacc_cohort_size    float64
curr_cc_prev               float64
curr_mort_prev             float64
curr_cost                  float64
curr_cost_prev             float64
proj_cc_prev               float64
proj_mort_prev             float64
proj_cost                  float64
proj_cost_prev             float64
year                         int64
current_net_cost           float64
country                     object
dtype: object


In [22]:
# LOADING MODIFIED DATASET INTO FOLDER
hpv.to_csv("/content/drive/My Drive/ADS_Maya_Reddy/projects/disease_project_1/prepped_datasets/hpv.csv", index=False)

### Extra Datasets

These datasets (including `screening_program`, `medicaid_chip`, and `adolescent`) are meant to provide additional support/direction to the data analysis. They are sourced from [kaggle.com](https://www.kaggle.com/datasets/willianoliveiragibin/hpv-vaccination) and [healthdata.gov](https://healthdata.gov/dataset/Vaccination-Coverage-among-Adolescents-13-17-Years/47pk-jpce/about_data). They include information on the presence of screeening programs in different countries, as well as HPV vaccination coverage throughout the United States (supplementing `hpv.csv` in the analysis).

In [23]:
# LOADING EXTRA DATASETS
screening_program = pd.read_csv('countries_with_screening_programs.csv')
medicaid_chip = pd.read_csv('medicaid_chip_vaccine_coverage.csv').drop(['DataQuality'], axis=1)
adolescent = pd.read_csv('adolescent_vaccine_coverage.csv')

In [24]:
# NARROWING DATASETS TO CERVICAL CANCER
medicaid_chip = medicaid_chip[medicaid_chip['VaccineType'] == 'HPV'].drop(['VaccineType'], axis=1)
adolescent = adolescent[adolescent['Vaccine/Sample'] == 'HPV'].drop(['Vaccine/Sample'], axis=1)

In [25]:
# ADJUSTING COLUMN VALUES
adolescent[['lower', 'upper']] = adolescent['95% CI (%)'].str.split(' ', expand=True).drop([1], axis=1).rename(columns={0:'lower', 2:'upper'})
adolescent['lower'] = adolescent['lower'].astype(float)
adolescent['upper'] = adolescent['upper'].astype(float)
adolescent = adolescent.drop(['95% CI (%)'], axis=1)

from datetime import datetime
medicaid_chip = medicaid_chip.reset_index().drop(['index'], axis=1)
for i in range(len(medicaid_chip['Month'])):
  unconverted = str(medicaid_chip['Month'][i])
  date = datetime.strptime(unconverted, '%Y%m')
  medicaid_chip['Month'][i] = date.month

medicaid_chip['ServiceCount'] = pd.Series([x.replace(',', '') for x in medicaid_chip['ServiceCount']]).replace({' -   ': None}).replace({' DS ': None}).astype('float')
medicaid_chip['RatePer1000Beneficiaries'] = medicaid_chip['RatePer1000Beneficiaries'].replace({'DS': None}).astype('float')

In [26]:
# CHECKING DATATYPES
print(screening_program.dtypes)
print()
print(medicaid_chip.dtypes)
print()
print(adolescent.dtypes)

Entity                                                         object
Code                                                           object
Year                                                            int64
Existence of national screening program for cervical cancer    object
dtype: object

State                        object
Year                          int64
Month                         int64
ServiceCount                float64
RatePer1000Beneficiaries    float64
dtype: object

Dose               object
Geography Type     object
Geography          object
Survey Year        object
Dimension Type     object
Dimension          object
Estimate (%)      float64
Sample Size       float64
lower             float64
upper             float64
dtype: object


In [27]:
# STANDARDIZING COLUMN NAMES
screening_program = screening_program.rename(columns={'Entity':'country', 'Code':'code', 'Year':'year', 'Existence of national screening program for cervical cancer':'screening_program'})
medicaid_chip = medicaid_chip.rename(columns={'State':'state', 'Month':'month', 'Year':'year', 'ServiceCount':'service_count', 'RatePer1000Beneficiaries':'rate_per_1000'})
adolescent = adolescent.rename(columns={'Dose':'dose', 'Geography Type':'location_type', 'Geography':'location', 'Survey Year':'year', 'Dimension Type':'dimension_type', 'Dimension':'dimension_val', 'Estimate (%)':'val', 'Sample Size':'sample_size'})

In [28]:
# LOADING MODIFIED DATASET INTO FOLDER
screening_program.to_csv("/content/drive/My Drive/ADS_Maya_Reddy/projects/disease_project_1/prepped_datasets/screening_program.csv", index=False)
medicaid_chip.to_csv("/content/drive/My Drive/ADS_Maya_Reddy/projects/disease_project_1/prepped_datasets/medicaid_chip.csv", index=False)
adolescent.to_csv("/content/drive/My Drive/ADS_Maya_Reddy/projects/disease_project_1/prepped_datasets/adolescent.csv", index=False)