<a href="https://colab.research.google.com/github/nhuyen183/LungCancerSupportSystem/blob/master/BRFSSfinal_preprocessing.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

In [1]:
#@title Installing Spark and dependencies
#Java 8
#Apache Spark with hadoop and
#Findspark (used to locate the spark in the system)
!apt-get update
!apt-get install openjdk-8-jdk-headless -qq > /dev/null
!wget -q http://archive.apache.org/dist/spark/spark-3.1.1/spark-3.1.1-bin-hadoop3.2.tgz
!tar xf spark-3.1.1-bin-hadoop3.2.tgz
!pip install -q findspark

#Set Environment Variables
import os
os.environ["JAVA_HOME"] = "/usr/lib/jvm/java-8-openjdk-amd64"
os.environ["SPARK_HOME"] = "/content/spark-3.1.1-bin-hadoop3.2"
import findspark
findspark.init()

0% [Working]            Hit:1 http://archive.ubuntu.com/ubuntu bionic InRelease
0% [Waiting for headers] [Connecting to security.ubuntu.com (185.125.190.39)] [                                                                               Get:2 http://archive.ubuntu.com/ubuntu bionic-updates InRelease [88.7 kB]
                                                                               Get:3 https://cloud.r-project.org/bin/linux/ubuntu bionic-cran40/ InRelease [3,626 B]
0% [2 InRelease 15.6 kB/88.7 kB 18%] [Connecting to security.ubuntu.com (185.120% [1 InRelease gpgv 242 kB] [2 InRelease 15.6 kB/88.7 kB 18%] [Connecting to s0% [1 InRelease gpgv 242 kB] [Waiting for headers] [Waiting for headers] [Waiti                                                                               Get:4 http://archive.ubuntu.com/ubuntu bionic-backports InRelease [83.3 kB]
0% [1 InRelease gpgv 242 kB] [4 InRelease 8,396 B/83.3 kB 10%] [Waiting for hea0% [1 InRelease gpgv 242 kB] [Waiting

# Step 1: Define the problem
What sorts of people were likely to have lung cancer?

# Step 2: Gather the data
The datasets can be found here:
* https://www.kaggle.com/datasets/aemreusta/brfss-2020-survey-data
* https://www.kaggle.com/datasets/sakinak/behavioral-risk-factor-surveillance-survey-201619

In [2]:
#@title Create Spark entry points
from pyspark import SparkConf, SparkContext
from pyspark.sql import SparkSession

In [3]:
sc = SparkContext(conf=SparkConf())
spark = SparkSession(sparkContext=sc)

In [4]:
#@title Import Spark Mlib libraries
from pyspark.ml.classification import LinearSVC
from pyspark.ml.classification import LogisticRegression
from pyspark.ml.classification import DecisionTreeClassifier
from pyspark.ml.classification import GBTClassifier
from pyspark.ml.classification import RandomForestClassifier
from pyspark.ml.classification import NaiveBayes
from pyspark.ml.classification import MultilayerPerceptronClassifier
from pyspark.ml.classification import OneVsRest

# Step 3: Prepare data for consumption

In [5]:
#@title Mount content to drive for kaggle data download
from google.colab import drive
drive.mount('/content/drive')

Mounted at /content/drive


In [6]:
!pip install kaggle

Looking in indexes: https://pypi.org/simple, https://us-python.pkg.dev/colab-wheels/public/simple/


In [7]:
! mkdir ~/.kaggle

In [8]:
!cp /content/drive/MyDrive/kaggle.json ~/.kaggle/kaggle.json

In [9]:
! chmod 600 ~/.kaggle/kaggle.json

In [10]:
! kaggle datasets download aemreusta/brfss-2020-survey-data

Downloading brfss-2020-survey-data.zip to /content
 85% 41.0M/48.3M [00:00<00:00, 48.6MB/s]
100% 48.3M/48.3M [00:01<00:00, 49.1MB/s]


In [11]:
! kaggle datasets download sakinak/behavioral-risk-factor-surveillance-survey-201619

Downloading behavioral-risk-factor-surveillance-survey-201619.zip to /content
 99% 232M/234M [00:02<00:00, 157MB/s]
100% 234M/234M [00:02<00:00, 85.1MB/s]


In [12]:
!ls

behavioral-risk-factor-surveillance-survey-201619.zip
brfss-2020-survey-data.zip
drive
sample_data
spark-3.1.1-bin-hadoop3.2
spark-3.1.1-bin-hadoop3.2.tgz


In [13]:
!unzip brfss-2020-survey-data.zip

Archive:  brfss-2020-survey-data.zip
  inflating: brfss2020.csv           


In [14]:
!unzip behavioral-risk-factor-surveillance-survey-201619.zip

Archive:  behavioral-risk-factor-surveillance-survey-201619.zip
  inflating: 2016.csv                
  inflating: 2017.csv                
  inflating: 2018.csv                
  inflating: 2019.csv                


In [15]:
from subprocess import check_output
print('-'*10, 'Files', '-'*10)
print(check_output(['ls', './']).decode('utf8'))

---------- Files ----------
2016.csv
2017.csv
2018.csv
2019.csv
behavioral-risk-factor-surveillance-survey-201619.zip
brfss2020.csv
brfss-2020-survey-data.zip
drive
sample_data
spark-3.1.1-bin-hadoop3.2
spark-3.1.1-bin-hadoop3.2.tgz



## About the BRFSS dataset and Prediction task

The Behavioral Risk Factor Surveillance System (BRFSS) is a collaborative project between all of the states in the United States and participating US territories and the Centers for Disease Control and Prevention (CDC).

BRFSS’s objective is to collect uniform state-specific data on health risk behaviors, chronic diseases and conditions, access to health care, and use of preventive health services related to the leading causes of death and disability in the United States. BRFSS conducts both landline and mobile phone-based surveys with individuals over the age of 18. General factors assessed by the BRFSS in 2020 included health status and healthy days, exercise, insufficient sleep, chronic health conditions, oral health, tobacco use, cancer screenings, and access to healthcare.

The aim of this project is to build a model with relatively high accuracy and AUC that could serve as an decision aid for those at high risk of developing lung cancer.

The data contains information about 401958 unique survey participant. As a result of my research to select the ones related to coronary artery disease among a total of 279 different features. Each example in the dataset contains the following demographic data for a set of individuals

### Categorical Features
*   `_AGE65YR`: The age of the individual in years two-level categories `18 <= AGE <= 64`: `1` and `65 <= AGE <= 99`:`2`
*   `SEXVAR`: Sex of Respondent `Male: 1` and `Female: 2`
*   `_BMI5CAT`:  Four-categories of Body Mass Index (BMI)`_BMI5 < 1850: Underweight` ; `1850 <= _BMI5 < 2500: Normal`;`2500 <= _BMI5 < 3000: Overweight`;`3000 <= _BMI5 < 9999: Obese`
*   `GENHLTH`: Health status: Would you say that in general your health is: `1: Excellent`; `2: Very good` ; `3: Good` ; `4: Fair` ; `5: Poor`
*   `SMOKE100`: Have you smoked at least 100 cigarettes in your entire life? [Note: 5 packs = 100 cigarettes] `1: Yes` ; `2: No`
*   `_SMOKER3`: Four-level smoker status: Everyday smoker: `1`, Someday smoker: `2`, Former smoker: `3`, Non-smoker: `4`

### Lung Cancer (Features) Screening Section
*   `LCSFIRST`: How old were you when you first started to smoke cigarettes regularly. `Value 1-100 in years`
*   `LCSLAST`: How old were you when you last smoked cigarettes regularly? `Value 1-100 in years`
*   `LCSNUMCG`: On average, when you smoke/smoked regularly, about how many cigarettes do/did you usually smoke each 
day? `Value 1-300 in number of cigarettes`
*   `LCSCTSCN`: In the last 12 months, did you have a CT or CAT scan? Example include: `Yes, to check for lung cancer`, `No (did not have a CT scan`, `Had a CT scan, but for other reason`.
*   `CNCRTYP1`:  What type of cancer was it? (If Response = 2 (Two) or 3 (Three or more), ask: “With your most recent 
diagnoses of cancer, what type of cancer was it?”). Examples include: `Lung cancer: 24`, `Others: 1-30`
*   `STOPSMK2`:  During the past 12 months, have you stopped smoking for one day or longer because you were trying to quit smoking? `Yes: 1` or `No: 2`.
*   `ECIGARET`: Have you ever used an e-cigarette or other electronic vaping product, even just one time, in your entire life? `Yes: 1` or `No: 2`.
*   `ECIGNOW`: Do you now use e-cigarettes or other electronic vaping products every day, some days, or not at all? `Every day: 1`; `Some days: 2` or `Not at all: 3`
* `ASTHMA3`: (Ever told) (you had) asthma? `Yes: 1` or `No: 2`.
### Prediction Task
The prediction task is to **early predict whether a person have the high risk of lung cancer.**

### Label
*   `CNCRTYP1`: What type of cancer (lung cancer = 24)





In [16]:
import numpy as np
import pandas as pd
import tensorflow as tf
from tensorflow.keras import layers
from matplotlib import pyplot as plt
from matplotlib import rcParams
from sklearn.model_selection import train_test_split
import seaborn as sns

# The following lines adjust the granularity of reporting. 
pd.options.display.max_rows = 10
pd.options.display.float_format = "{:.1f}".format

from google.colab import widgets
# For facets
from IPython.core.display import display, HTML
import base64
!pip install facets-overview==1.0.0
from facets_overview.feature_statistics_generator import FeatureStatisticsGenerator

Looking in indexes: https://pypi.org/simple, https://us-python.pkg.dev/colab-wheels/public/simple/
Collecting facets-overview==1.0.0
  Downloading facets_overview-1.0.0-py2.py3-none-any.whl (24 kB)
Installing collected packages: facets-overview
Successfully installed facets-overview-1.0.0


In [17]:
# load packages
import sys
print('Python version: {}'. format(sys.version))

import pandas as pd
print('Python version: {}'. format(pd.__version__))

import matplotlib
print('matplotlib version: {}'. format(matplotlib.__version__))

import numpy as np
print('numpy version: {}'. format(np.__version__))

import scipy as sp
print('scipy version: {}'. format(sp.__version__))

import IPython
from IPython import display # pretty printing of dataframe in Jupyter notebook
print('IPython version: {}'. format(IPython.__version__))

import pyspark
print('Apache Spark Pyspark version: {}'. format(pyspark.__version__)) # pyspark version

# misc libraries
import random
import time

# ignore warnings
import warnings
warnings.filterwarnings('ignore')
print('-'*25)

Python version: 3.8.16 (default, Dec  7 2022, 01:12:13) 
[GCC 7.5.0]
Python version: 1.3.5
matplotlib version: 3.2.2
numpy version: 1.21.6
scipy version: 1.7.3
IPython version: 7.9.0
Apache Spark Pyspark version: 3.1.1
-------------------------


In [295]:
#@title Data Integration
from pyspark.sql.types import *

data_2020 = spark.read.csv('./brfss2020.csv', inferSchema=True, header=True)

# preview the data
# data type
print('-'*10, 'data types', '-'*10)
pd.DataFrame(data_2020.dtypes)

---------- data types ----------


Unnamed: 0,0,1
0,_STATE,double
1,FMONTH,double
2,IDATE,int
3,IMONTH,int
4,IDAY,int
...,...,...
274,_STOLDNA,double
275,_VIRCOLN,double
276,_SBONTIM,double
277,_CRCREC1,double


In [296]:
data_2020F = data_2020.select('SEXVAR', '_AGE65YR', '_BMI5CAT', 'GENHLTH', 'SMOKE100', '_SMOKER3',
                  'LCSFIRST', 'LCSLAST', 'LCSNUMCG', 'LCSCTSCN', 'CNCRTYP1',
                  'STOPSMK2', 'ASTHMA3', 'CHCCOPD2') #'ECIGARET',  'ECIGNOW'
data_2020F.show()

+------+--------+--------+-------+--------+--------+--------+-------+--------+--------+--------+--------+-------+--------+
|SEXVAR|_AGE65YR|_BMI5CAT|GENHLTH|SMOKE100|_SMOKER3|LCSFIRST|LCSLAST|LCSNUMCG|LCSCTSCN|CNCRTYP1|STOPSMK2|ASTHMA3|CHCCOPD2|
+------+--------+--------+-------+--------+--------+--------+-------+--------+--------+--------+--------+-------+--------+
|   2.0|     1.0|     1.0|    2.0|     1.0|     1.0|    null|   null|    null|    null|    null|     2.0|    1.0|     1.0|
|   2.0|     2.0|     3.0|    3.0|    null|     9.0|    null|   null|    null|    null|    null|    null|    1.0|     2.0|
|   2.0|     2.0|    null|    3.0|     2.0|     4.0|    null|   null|    null|    null|    null|    null|    2.0|     2.0|
|   2.0|     2.0|    null|    1.0|     2.0|     4.0|    null|   null|    null|    null|    null|    null|    2.0|     2.0|
|   2.0|     2.0|     2.0|    2.0|     2.0|     4.0|    null|   null|    null|    null|    null|    null|    2.0|     2.0|
|   1.0|     2.0

In [127]:
print('Columns with null values:')
print('-'*25)
data_2020F.select([eval('data_2020F.' + x + '.isNull().cast("int").alias("' + x + '")') for x in data_2020F.columns]).\
    groupBy().sum().toPandas()

Columns with null values:
-------------------------


Unnamed: 0,sum(SEXVAR),sum(_AGE65YR),sum(_BMI5CAT),sum(GENHLTH),sum(SMOKE100),sum(_SMOKER3),sum(LCSFIRST),sum(LCSLAST),sum(LCSNUMCG),sum(LCSCTSCN),sum(CNCRTYP1),sum(STOPSMK2),sum(ASTHMA3),sum(CHCCOPD2)
0,0,0,41357,8,17860,0,387914,388332,388351,370711,379282,349535,3,5


In [64]:
#401958
#data_2020F = data_2020F.na.drop(how="any")
#data_2020F.count()

236

In [299]:
data_2020F.show(6)

+------+--------+--------+-------+--------+--------+--------+-------+--------+--------+--------+--------+-------+--------+
|SEXVAR|_AGE65YR|_BMI5CAT|GENHLTH|SMOKE100|_SMOKER3|LCSFIRST|LCSLAST|LCSNUMCG|LCSCTSCN|CNCRTYP1|STOPSMK2|ASTHMA3|CHCCOPD2|
+------+--------+--------+-------+--------+--------+--------+-------+--------+--------+--------+--------+-------+--------+
|   2.0|     1.0|     1.0|    2.0|     1.0|     1.0|    null|   null|    null|    null|    null|     2.0|    1.0|     1.0|
|   2.0|     2.0|     3.0|    3.0|    null|     9.0|    null|   null|    null|    null|    null|    null|    1.0|     2.0|
|   2.0|     2.0|    null|    3.0|     2.0|     4.0|    null|   null|    null|    null|    null|    null|    2.0|     2.0|
|   2.0|     2.0|    null|    1.0|     2.0|     4.0|    null|   null|    null|    null|    null|    null|    2.0|     2.0|
|   2.0|     2.0|     2.0|    2.0|     2.0|     4.0|    null|   null|    null|    null|    null|    null|    2.0|     2.0|
|   1.0|     2.0

In [306]:
from pyspark.sql.functions import col
data_2020F = data_2020F.withColumn("SEXVAR", col('SEXVAR').cast(IntegerType()))\
          .withColumn("_AGE65YR", col('_AGE65YR').cast(IntegerType()))\
          .withColumn("_BMI5CAT", col('_BMI5CAT').cast(IntegerType()))\
          .withColumn("GENHLTH", col('GENHLTH').cast(IntegerType()))\
          .withColumn("SMOKE100", col('SMOKE100').cast(IntegerType()))\
          .withColumn("_SMOKER3", col('_SMOKER3').cast(IntegerType()))\
                    .withColumn("LCSFIRST", col('LCSFIRST').cast(IntegerType()))\
                    .withColumn("LCSLAST", col('LCSLAST').cast(IntegerType()))\
                    .withColumn("LCSNUMCG", col('LCSNUMCG').cast(IntegerType()))\
                    .withColumn("LCSCTSCN", col('LCSCTSCN').cast(IntegerType()))\
                    .withColumn("STOPSMK2", col('STOPSMK2').cast(IntegerType()))\
                    .withColumn("ASTHMA3", col('ASTHMA3').cast(IntegerType()))\
                    .withColumn("CHCCOPD2", col('CHCCOPD2').cast(IntegerType()))\
                    .withColumn("CNCRTYP1", col('CNCRTYP1').cast(IntegerType()))

In [307]:
data_2020F.printSchema()

root
 |-- SEXVAR: integer (nullable = true)
 |-- _AGE65YR: integer (nullable = true)
 |-- _BMI5CAT: integer (nullable = true)
 |-- GENHLTH: integer (nullable = true)
 |-- SMOKE100: integer (nullable = true)
 |-- _SMOKER3: integer (nullable = true)
 |-- LCSFIRST: integer (nullable = true)
 |-- LCSLAST: integer (nullable = true)
 |-- LCSNUMCG: integer (nullable = true)
 |-- LCSCTSCN: integer (nullable = true)
 |-- CNCRTYP1: integer (nullable = true)
 |-- STOPSMK2: integer (nullable = true)
 |-- ASTHMA3: integer (nullable = true)
 |-- CHCCOPD2: integer (nullable = true)



In [308]:
data_2020F = data_2020F.fillna(0)
data_2020F.show(6)

+------+--------+--------+-------+--------+--------+--------+-------+--------+--------+--------+--------+-------+--------+
|SEXVAR|_AGE65YR|_BMI5CAT|GENHLTH|SMOKE100|_SMOKER3|LCSFIRST|LCSLAST|LCSNUMCG|LCSCTSCN|CNCRTYP1|STOPSMK2|ASTHMA3|CHCCOPD2|
+------+--------+--------+-------+--------+--------+--------+-------+--------+--------+--------+--------+-------+--------+
|     2|       1|       1|      2|       1|       1|       0|      0|       0|       0|       0|       2|      1|       1|
|     2|       2|       3|      3|       0|       9|       0|      0|       0|       0|       0|       0|      1|       2|
|     2|       2|       0|      3|       2|       4|       0|      0|       0|       0|       0|       0|      2|       2|
|     2|       2|       0|      1|       2|       4|       0|      0|       0|       0|       0|       0|      2|       2|
|     2|       2|       2|      2|       2|       4|       0|      0|       0|       0|       0|       0|      2|       2|
|     1|       2

In [314]:
#@title User Defined Functions (UDF) for prediction label

from pyspark.sql.functions import udf
from pyspark.sql.types import *
y_udf = udf(lambda y: 1 if y == 24 else 0, StringType())
x_udf = udf(lambda x: 0 if (x==7 or x==77 or x==777 or x==9 or x==99 or x==888 or x==999) else x, StringType())
x1_udf = udf(lambda x: 0 if (x==7 or x==77 or x==777 or x==9 or x==99 or x==888 or x==999) else x, StringType())
x_age = udf(lambda x: 0 if x==3 else x, StringType())

processed_2020 = data_2020F.withColumn("Gender", x_udf('SEXVAR')).drop("SEXVAR")\
                    .withColumn("Age65", x_age('_AGE65YR')).drop("_AGE65YR")\
                    .withColumn("BMI", x_udf('_BMI5CAT')).drop("_BMI5CAT")\
                    .withColumn("GeneralHealth", x_udf('GENHLTH')).drop("GENHLTH")\
                    .withColumn("Smoked100", x_udf('SMOKE100')).drop("SMOKE100")\
                    .withColumn("SmokerStatus", x_udf('_SMOKER3')).drop("_SMOKER3")\
                    .withColumn("FirstSmokedAge", x_udf('LCSFIRST')).drop("LCSFIRST")\
                    .withColumn("LastSmokedAge", x_udf('LCSLAST')).drop("LCSLAST")\
                    .withColumn("AvgNumCigADay", x_udf('LCSNUMCG')).drop("LCSNUMCG")\
                    .withColumn("HasCTScan", x_udf('LCSCTSCN')).drop("LCSCTSCN")\
                    .withColumn("StopSmoking", x_udf('STOPSMK2')).drop("STOPSMK2")\
                    .withColumn("HasAsthma", x_udf('ASTHMA3')).drop("ASTHMA3")\
                    .withColumn("HasChronicDisease", x_udf('CHCCOPD2')).drop("CHCCOPD2")\
                    .withColumn("HasLungCancer", y_udf('CNCRTYP1')).drop("CNCRTYP1")
                    

In [315]:
processed_2020.printSchema()

root
 |-- Gender: string (nullable = true)
 |-- Age65: string (nullable = true)
 |-- BMI: string (nullable = true)
 |-- GeneralHealth: string (nullable = true)
 |-- Smoked100: string (nullable = true)
 |-- SmokerStatus: string (nullable = true)
 |-- FirstSmokedAge: string (nullable = true)
 |-- LastSmokedAge: string (nullable = true)
 |-- AvgNumCigADay: string (nullable = true)
 |-- HasCTScan: string (nullable = true)
 |-- StopSmoking: string (nullable = true)
 |-- HasAsthma: string (nullable = true)
 |-- HasChronicDisease: string (nullable = true)
 |-- HasLungCancer: string (nullable = true)



In [316]:
processed_2020.show(5)

+------+-----+---+-------------+---------+------------+--------------+-------------+-------------+---------+-----------+---------+-----------------+-------------+
|Gender|Age65|BMI|GeneralHealth|Smoked100|SmokerStatus|FirstSmokedAge|LastSmokedAge|AvgNumCigADay|HasCTScan|StopSmoking|HasAsthma|HasChronicDisease|HasLungCancer|
+------+-----+---+-------------+---------+------------+--------------+-------------+-------------+---------+-----------+---------+-----------------+-------------+
|     2|    1|  1|            2|        1|           1|             0|            0|            0|        0|          2|        1|                1|            0|
|     2|    2|  3|            3|        0|           0|             0|            0|            0|        0|          0|        1|                2|            0|
|     2|    2|  0|            3|        2|           4|             0|            0|            0|        0|          0|        2|                2|            0|
|     2|    2|  0|    

In [317]:
processed_2020.groupBy(processed_2020.HasLungCancer).count().show()
# Data is unbalanced :)

+-------------+------+
|HasLungCancer| count|
+-------------+------+
|            0|401518|
|            1|   440|
+-------------+------+



In [211]:
#@title Data Integration 2
data_2017 = spark.read.csv('./2017.csv', inferSchema=True, header=True)

# preview the data
# data type
print('-'*10, 'data types', '-'*10)
pd.DataFrame(data_2017.dtypes)

---------- data types ----------


Unnamed: 0,0,1
0,_c0,int
1,_STATE,int
2,FMONTH,int
3,IDATE,int
4,IMONTH,int
...,...,...
354,_RFSEAT2,int
355,_RFSEAT3,int
356,_FLSHOT6,string
357,_PNEUMO2,string


In [212]:
data_2017F = data_2017.select('SEX', '_AGE65YR', '_BMI5CAT', 'GENHLTH', 'SMOKE100', '_SMOKER3',
                  'LCSFIRST', 'LCSLAST', 'LCSNUMCG', 'LCSCTSCN', 'CNCRTYP1',
                  'STOPSMK2', 'ASTHMA3', 'CHCCOPD1') #'ECIGARET',  'ECIGNOW'
data_2017F.show(6)

+---+--------+--------+-------+--------+--------+--------+-------+--------+--------+--------+--------+-------+--------+
|SEX|_AGE65YR|_BMI5CAT|GENHLTH|SMOKE100|_SMOKER3|LCSFIRST|LCSLAST|LCSNUMCG|LCSCTSCN|CNCRTYP1|STOPSMK2|ASTHMA3|CHCCOPD1|
+---+--------+--------+-------+--------+--------+--------+-------+--------+--------+--------+--------+-------+--------+
|  2|       2|       3|      2|       2|       4|      NA|     NA|      NA|      NA|      NA|      NA|      2|       2|
|  1|       2|       3|      2|       2|       4|      NA|     NA|      NA|      NA|      NA|      NA|      2|       2|
|  1|       2|       3|      3|       2|       4|      NA|     NA|      NA|      NA|      NA|      NA|      2|       2|
|  2|       2|       3|      4|       2|       4|      NA|     NA|      NA|      NA|      NA|      NA|      2|       1|
|  2|       2|       2|      4|       1|       3|      NA|     NA|      NA|      NA|      NA|      NA|      1|       1|
|  1|       2|       3|      3|       1|

In [136]:
print('Columns with null values:')
print('-'*25)
data_2017F.select([eval('data_2017F.' + x + '.isNull().cast("int").alias("' + x + '")') for x in data_2017F.columns]).\
    groupBy().sum().toPandas()

Columns with null values:
-------------------------


Unnamed: 0,sum(SEX),sum(_AGE65YR),sum(_BMI5CAT),sum(GENHLTH),sum(SMOKE100),sum(_SMOKER3),sum(LCSFIRST),sum(LCSLAST),sum(LCSNUMCG),sum(LCSCTSCN),sum(CNCRTYP1),sum(STOPSMK2),sum(ASTHMA3),sum(CHCCOPD1)
0,0,1,1,0,0,1,1,1,1,1,1,0,0,0


In [213]:
#401958
data_2017F = data_2017F.na.drop(how="any")
data_2017F.count()

450015

In [74]:
data_2017F.show(6)

+---+--------+--------+-------+--------+--------+--------+-------+--------+--------+--------+--------+-------+--------+
|SEX|_AGE65YR|_BMI5CAT|GENHLTH|SMOKE100|_SMOKER3|LCSFIRST|LCSLAST|LCSNUMCG|LCSCTSCN|CNCRTYP1|STOPSMK2|ASTHMA3|CHCCOPD1|
+---+--------+--------+-------+--------+--------+--------+-------+--------+--------+--------+--------+-------+--------+
|  2|       2|       3|      2|       2|       4|      NA|     NA|      NA|      NA|      NA|      NA|      2|       2|
|  1|       2|       3|      2|       2|       4|      NA|     NA|      NA|      NA|      NA|      NA|      2|       2|
|  1|       2|       3|      3|       2|       4|      NA|     NA|      NA|      NA|      NA|      NA|      2|       2|
|  2|       2|       3|      4|       2|       4|      NA|     NA|      NA|      NA|      NA|      NA|      2|       1|
|  2|       2|       2|      4|       1|       3|      NA|     NA|      NA|      NA|      NA|      NA|      1|       1|
|  1|       2|       3|      3|       1|

In [75]:
data_2017F.filter(data_2017F.CNCRTYP1 != 'NA').count()

7976

In [185]:
data_2017F.printSchema()

root
 |-- SEX: integer (nullable = true)
 |-- _AGE65YR: integer (nullable = true)
 |-- _BMI5CAT: string (nullable = true)
 |-- GENHLTH: string (nullable = true)
 |-- SMOKE100: string (nullable = true)
 |-- _SMOKER3: integer (nullable = true)
 |-- LCSFIRST: string (nullable = true)
 |-- LCSLAST: string (nullable = true)
 |-- LCSNUMCG: string (nullable = true)
 |-- LCSCTSCN: string (nullable = true)
 |-- CNCRTYP1: string (nullable = true)
 |-- STOPSMK2: string (nullable = true)
 |-- ASTHMA3: string (nullable = true)
 |-- CHCCOPD1: string (nullable = true)



In [214]:
data_2017F = data_2017F.fillna(0)

In [215]:
#@title User Defined Functions (UDF) for prediction label

from pyspark.sql.functions import udf
from pyspark.sql.types import *
y_udf = udf(lambda y: '1' if y == '24' else '0', StringType())
x_udf = udf(lambda x: '0' if (x=='NA' or x=='7' or x=='77' or x=='777' or x=='9' or x=='99' or x=='888' or x=='999') else x, StringType())
x1_udf = udf(lambda x: 0 if (x==7 or x==77 or x==777 or x==9 or x==99 or x==888 or x==999) else x, StringType())

processed_2017 = data_2017F.withColumn("Gender", x1_udf('SEX')).drop("SEX")\
                    .withColumn("Age65", x_age('_AGE65YR')).drop("_AGE65YR")\
                    .withColumn("BMI", x_udf('_BMI5CAT')).drop("_BMI5CAT")\
                    .withColumn("GeneralHealth", x_udf('GENHLTH')).drop("GENHLTH")\
                    .withColumn("Smoked100", x_udf('SMOKE100')).drop("SMOKE100")\
                    .withColumn("SmokerStatus", x1_udf('_SMOKER3')).drop("_SMOKER3")\
                    .withColumn("FirstSmokedAge", x_udf('LCSFIRST')).drop("LCSFIRST")\
                    .withColumn("LastSmokedAge", x_udf('LCSLAST')).drop("LCSLAST")\
                    .withColumn("AvgNumCigADay", x_udf('LCSNUMCG')).drop("LCSNUMCG")\
                    .withColumn("HasCTScan", x_udf('LCSCTSCN')).drop("LCSCTSCN")\
                    .withColumn("StopSmoking", x_udf('STOPSMK2')).drop("STOPSMK2")\
                    .withColumn("HasAsthma", x_udf('ASTHMA3')).drop("ASTHMA3")\
                    .withColumn("HasChronicDisease", x_udf('CHCCOPD1')).drop("CHCCOPD1")\
                    .withColumn("HasLungCancer", y_udf('CNCRTYP1')).drop("CNCRTYP1")
                    

In [189]:
processed_2017.printSchema()

root
 |-- Gender: string (nullable = true)
 |-- Age65: string (nullable = true)
 |-- BMI: string (nullable = true)
 |-- GeneralHealth: string (nullable = true)
 |-- Smoked100: string (nullable = true)
 |-- SmokerStatus: string (nullable = true)
 |-- FirstSmokedAge: string (nullable = true)
 |-- LastSmokedAge: string (nullable = true)
 |-- AvgNumCigADay: string (nullable = true)
 |-- HasCTScan: string (nullable = true)
 |-- StopSmoking: string (nullable = true)
 |-- HasAsthma: string (nullable = true)
 |-- HasChronicDisease: string (nullable = true)
 |-- HasLungCancer: string (nullable = true)



In [216]:
processed_2017.groupBy(processed_2017.HasLungCancer).count().show()
# Data is unbalanced :)

+-------------+------+
|HasLungCancer| count|
+-------------+------+
|            0|449859|
|            1|   156|
+-------------+------+



In [217]:
#@title Data Integration 3
data_2018 = spark.read.csv('./2018.csv', inferSchema=True, header=True)

# preview the data
# data type
print('-'*10, 'data types', '-'*10)
pd.DataFrame(data_2018.dtypes)

---------- data types ----------


Unnamed: 0,0,1
0,_c0,int
1,_STATE,int
2,FMONTH,int
3,IDATE,int
4,IMONTH,int
...,...,...
271,_HFOB3YR,string
272,_FS5YR,string
273,_FOBTFS,string
274,_CRCREC,string


In [218]:
data_2018F = data_2018.select('SEX1', '_AGE65YR', '_BMI5CAT', 'GENHLTH', 'SMOKE100', '_SMOKER3',
                  'LCSFIRST', 'LCSLAST', 'LCSNUMCG', 'LCSCTSCN', 'CNCRTYP1',
                  'STOPSMK2', 'ASTHMA3', 'CHCCOPD1') #'ECIGARET',  'ECIGNOW'
data_2018F.show(6)

+----+--------+--------+-------+--------+--------+--------+-------+--------+--------+--------+--------+-------+--------+
|SEX1|_AGE65YR|_BMI5CAT|GENHLTH|SMOKE100|_SMOKER3|LCSFIRST|LCSLAST|LCSNUMCG|LCSCTSCN|CNCRTYP1|STOPSMK2|ASTHMA3|CHCCOPD1|
+----+--------+--------+-------+--------+--------+--------+-------+--------+--------+--------+--------+-------+--------+
|   2|       2|       2|      2|       2|       4|      NA|     NA|      NA|      NA|      NA|      NA|      2|       2|
|   2|       1|       4|      3|       1|       1|      NA|     NA|      NA|      NA|      NA|       1|      2|       2|
|   2|       2|       3|      5|       2|       4|      NA|     NA|      NA|      NA|      NA|      NA|      2|       2|
|   1|       2|       3|      1|       2|       4|      NA|     NA|      NA|      NA|      NA|      NA|      2|       2|
|   2|       1|      NA|      2|       2|       4|      NA|     NA|      NA|      NA|      NA|      NA|      2|       2|
|   2|       2|       4|      2|

In [85]:
print('Columns with null values:')
print('-'*25)
data_2018F.select([eval('data_2018F.' + x + '.isNull().cast("int").alias("' + x + '")') for x in data_2018F.columns]).\
    groupBy().sum().toPandas()

Columns with null values:
-------------------------


Unnamed: 0,sum(SEX1),sum(_AGE65YR),sum(_BMI5CAT),sum(GENHLTH),sum(SMOKE100),sum(_SMOKER3),sum(LCSFIRST),sum(LCSLAST),sum(LCSNUMCG),sum(LCSCTSCN),sum(CNCRTYP1),sum(STOPSMK2),sum(ASTHMA3),sum(CHCCOPD1)
0,0,1,1,0,0,1,1,1,1,1,1,0,0,0


In [219]:
#401958
data_2018F = data_2018F.na.drop(how="any")
data_2018F.count()

437435

In [87]:
data_2018F.show(6)

+----+--------+--------+-------+--------+--------+--------+-------+--------+--------+--------+--------+-------+--------+
|SEX1|_AGE65YR|_BMI5CAT|GENHLTH|SMOKE100|_SMOKER3|LCSFIRST|LCSLAST|LCSNUMCG|LCSCTSCN|CNCRTYP1|STOPSMK2|ASTHMA3|CHCCOPD1|
+----+--------+--------+-------+--------+--------+--------+-------+--------+--------+--------+--------+-------+--------+
|   2|       2|       2|      2|       2|       4|      NA|     NA|      NA|      NA|      NA|      NA|      2|       2|
|   2|       1|       4|      3|       1|       1|      NA|     NA|      NA|      NA|      NA|       1|      2|       2|
|   2|       2|       3|      5|       2|       4|      NA|     NA|      NA|      NA|      NA|      NA|      2|       2|
|   1|       2|       3|      1|       2|       4|      NA|     NA|      NA|      NA|      NA|      NA|      2|       2|
|   2|       1|      NA|      2|       2|       4|      NA|     NA|      NA|      NA|      NA|      NA|      2|       2|
|   2|       2|       4|      2|

In [88]:
data_2018F.filter(data_2018F.CNCRTYP1 != 'NA').count()

6187

In [89]:
data_2018F.printSchema()

root
 |-- SEX1: integer (nullable = true)
 |-- _AGE65YR: integer (nullable = true)
 |-- _BMI5CAT: string (nullable = true)
 |-- GENHLTH: string (nullable = true)
 |-- SMOKE100: string (nullable = true)
 |-- _SMOKER3: integer (nullable = true)
 |-- LCSFIRST: string (nullable = true)
 |-- LCSLAST: string (nullable = true)
 |-- LCSNUMCG: string (nullable = true)
 |-- LCSCTSCN: string (nullable = true)
 |-- CNCRTYP1: string (nullable = true)
 |-- STOPSMK2: string (nullable = true)
 |-- ASTHMA3: string (nullable = true)
 |-- CHCCOPD1: string (nullable = true)



In [220]:
data_2018F = data_2018F.fillna(0)

In [221]:
#@title User Defined Functions (UDF) for prediction label

from pyspark.sql.functions import udf
from pyspark.sql.types import *
y_udf = udf(lambda y: '1' if y == '24' else '0', StringType())
x_udf = udf(lambda x: '0' if (x=='NA' or x=='7' or x=='77' or x=='777' or x=='9' or x=='99' or x=='888' or x=='999') else x, StringType())

processed_2018 = data_2018F.withColumn("Gender", x1_udf('SEX1')).drop("SEX1")\
                    .withColumn("Age65", x_age('_AGE65YR')).drop("_AGE65YR")\
                    .withColumn("BMI", x_udf('_BMI5CAT')).drop("_BMI5CAT")\
                    .withColumn("GeneralHealth", x_udf('GENHLTH')).drop("GENHLTH")\
                    .withColumn("Smoked100", x_udf('SMOKE100')).drop("SMOKE100")\
                    .withColumn("SmokerStatus", x1_udf('_SMOKER3')).drop("_SMOKER3")\
                    .withColumn("FirstSmokedAge", x_udf('LCSFIRST')).drop("LCSFIRST")\
                    .withColumn("LastSmokedAge", x_udf('LCSLAST')).drop("LCSLAST")\
                    .withColumn("AvgNumCigADay", x_udf('LCSNUMCG')).drop("LCSNUMCG")\
                    .withColumn("HasCTScan", x_udf('LCSCTSCN')).drop("LCSCTSCN")\
                    .withColumn("StopSmoking", x_udf('STOPSMK2')).drop("STOPSMK2")\
                    .withColumn("HasAsthma", x_udf('ASTHMA3')).drop("ASTHMA3")\
                    .withColumn("HasChronicDisease", x_udf('CHCCOPD1')).drop("CHCCOPD1")\
                    .withColumn("HasLungCancer", y_udf('CNCRTYP1')).drop("CNCRTYP1")
                    

In [196]:
processed_2018.printSchema()

root
 |-- Gender: string (nullable = true)
 |-- Age65: string (nullable = true)
 |-- BMI: string (nullable = true)
 |-- GeneralHealth: string (nullable = true)
 |-- Smoked100: string (nullable = true)
 |-- SmokerStatus: string (nullable = true)
 |-- FirstSmokedAge: string (nullable = true)
 |-- LastSmokedAge: string (nullable = true)
 |-- AvgNumCigADay: string (nullable = true)
 |-- HasCTScan: string (nullable = true)
 |-- StopSmoking: string (nullable = true)
 |-- HasAsthma: string (nullable = true)
 |-- HasChronicDisease: string (nullable = true)
 |-- HasLungCancer: string (nullable = true)



In [222]:
processed_2018.groupBy(processed_2018.HasLungCancer).count().show()
# Data is unbalanced :)

+-------------+------+
|HasLungCancer| count|
+-------------+------+
|            0|437284|
|            1|   151|
+-------------+------+



In [95]:
#@title Data Integration 4
data_2019 = spark.read.csv('./2019.csv', inferSchema=True, header=True)

# preview the data
# data type
print('-'*10, 'data types', '-'*10)
pd.DataFrame(data_2019.dtypes)

---------- data types ----------


Unnamed: 0,0,1
0,_c0,int
1,_STATE,int
2,FMONTH,int
3,IDATE,int
4,IMONTH,int
...,...,...
338,_FRUITE1,int
339,_VEGETE1,int
340,_FLSHOT7,string
341,_PNEUMO3,string


In [100]:
data_2019F = data_2019.select('SEXVAR', '_AGE65YR', '_BMI5CAT', 'GENHLTH', 'SMOKE100', '_SMOKER3',
                  'LCSFIRST', 'LCSLAST', 'LCSNUMCG', 'LCSCTSCN', 'CNCRTYP1',
                  'STOPSMK2', 'ASTHMA3', 'CHCCOPD2') #'ECIGARET',  'ECIGNOW'
data_2019F.show(6)

+------+--------+--------+-------+--------+--------+--------+-------+--------+--------+--------+--------+-------+--------+
|SEXVAR|_AGE65YR|_BMI5CAT|GENHLTH|SMOKE100|_SMOKER3|LCSFIRST|LCSLAST|LCSNUMCG|LCSCTSCN|CNCRTYP1|STOPSMK2|ASTHMA3|CHCCOPD2|
+------+--------+--------+-------+--------+--------+--------+-------+--------+--------+--------+--------+-------+--------+
|     2|       2|       3|      3|       1|       3|      NA|     NA|      NA|      NA|      NA|      NA|      2|       2|
|     2|       2|       2|      4|       2|       4|      NA|     NA|      NA|      NA|      NA|      NA|      2|       2|
|     2|       2|       4|      3|       2|       4|      NA|     NA|      NA|      NA|      NA|      NA|      2|       2|
|     2|       2|       2|      4|      NA|       9|      NA|     NA|      NA|      NA|      NA|      NA|      2|       1|
|     2|       2|       2|      2|       1|       3|      NA|     NA|      NA|      NA|      NA|      NA|      2|       2|
|     2|       3

In [101]:
print('Columns with null values:')
print('-'*25)
data_2019F.select([eval('data_2019F.' + x + '.isNull().cast("int").alias("' + x + '")') for x in data_2019F.columns]).\
    groupBy().sum().toPandas()

Columns with null values:
-------------------------


Unnamed: 0,sum(SEXVAR),sum(_AGE65YR),sum(_BMI5CAT),sum(GENHLTH),sum(SMOKE100),sum(_SMOKER3),sum(LCSFIRST),sum(LCSLAST),sum(LCSNUMCG),sum(LCSCTSCN),sum(CNCRTYP1),sum(STOPSMK2),sum(ASTHMA3),sum(CHCCOPD2)
0,0,0,0,0,0,0,0,0,0,0,0,0,0,0


In [102]:
data_2019F = data_2019F.na.drop(how="any")
data_2019F.count()

418268

In [103]:
data_2019F.show(6)

+------+--------+--------+-------+--------+--------+--------+-------+--------+--------+--------+--------+-------+--------+
|SEXVAR|_AGE65YR|_BMI5CAT|GENHLTH|SMOKE100|_SMOKER3|LCSFIRST|LCSLAST|LCSNUMCG|LCSCTSCN|CNCRTYP1|STOPSMK2|ASTHMA3|CHCCOPD2|
+------+--------+--------+-------+--------+--------+--------+-------+--------+--------+--------+--------+-------+--------+
|     2|       2|       3|      3|       1|       3|      NA|     NA|      NA|      NA|      NA|      NA|      2|       2|
|     2|       2|       2|      4|       2|       4|      NA|     NA|      NA|      NA|      NA|      NA|      2|       2|
|     2|       2|       4|      3|       2|       4|      NA|     NA|      NA|      NA|      NA|      NA|      2|       2|
|     2|       2|       2|      4|      NA|       9|      NA|     NA|      NA|      NA|      NA|      NA|      2|       1|
|     2|       2|       2|      2|       1|       3|      NA|     NA|      NA|      NA|      NA|      NA|      2|       2|
|     2|       3

In [104]:
data_2019F.filter(data_2019F.CNCRTYP1 != 'NA').count()

888

In [105]:
data_2019F.printSchema()

root
 |-- SEXVAR: integer (nullable = true)
 |-- _AGE65YR: integer (nullable = true)
 |-- _BMI5CAT: string (nullable = true)
 |-- GENHLTH: string (nullable = true)
 |-- SMOKE100: string (nullable = true)
 |-- _SMOKER3: integer (nullable = true)
 |-- LCSFIRST: string (nullable = true)
 |-- LCSLAST: string (nullable = true)
 |-- LCSNUMCG: string (nullable = true)
 |-- LCSCTSCN: string (nullable = true)
 |-- CNCRTYP1: string (nullable = true)
 |-- STOPSMK2: string (nullable = true)
 |-- ASTHMA3: string (nullable = true)
 |-- CHCCOPD2: string (nullable = true)



In [107]:
#@title User Defined Functions (UDF) for prediction label

from pyspark.sql.functions import udf
from pyspark.sql.types import *
y_udf = udf(lambda y: '1' if y == '24' else '0', StringType())
x_udf = udf(lambda x: '0' if (x=='NA' or x=='7' or x=='77' or x=='777' or x=='9' or x=='99' or x=='888' or x=='999') else x, StringType())

processed_2019 = data_2019F.withColumn("Gender", x_udf('SEXVAR')).drop("SEXVAR")\
                    .withColumn("Age65", x_udf('_AGE65YR')).drop("_AGE65YR")\
                    .withColumn("GeneralHealth", x_udf('GENHLTH')).drop("GENHLTH")\
                    .withColumn("Smoked100", x_udf('SMOKE100')).drop("SMOKE100")\
                    .withColumn("SmokerStatus", x_udf('_SMOKER3')).drop("_SMOKER3")\
                    .withColumn("FirstSmokedAge", x_udf('LCSFIRST')).drop("LCSFIRST")\
                    .withColumn("LastSmokedAge", x_udf('LCSLAST')).drop("LCSLAST")\
                    .withColumn("AvgNumCigADay", x_udf('LCSNUMCG')).drop("LCSNUMCG")\
                    .withColumn("HasCTScan", x_udf('LCSCTSCN')).drop("LCSCTSCN")\
                    .withColumn("StopSmoking", x_udf('STOPSMK2')).drop("STOPSMK2")\
                    .withColumn("HasAsthma", x_udf('ASTHMA3')).drop("ASTHMA3")\
                    .withColumn("HasChronicDisease", x_udf('CHCCOPD2')).drop("CHCCOPD2")\
                    .withColumn("HasLungCancer", y_udf('CNCRTYP1')).drop("CNCRTYP1")
                    

In [108]:
processed_2019.printSchema()

root
 |-- _BMI5CAT: string (nullable = true)
 |-- Gender: string (nullable = true)
 |-- Age65: string (nullable = true)
 |-- GeneralHealth: string (nullable = true)
 |-- Smoked100: string (nullable = true)
 |-- SmokerStatus: string (nullable = true)
 |-- FirstSmokedAge: string (nullable = true)
 |-- LastSmokedAge: string (nullable = true)
 |-- AvgNumCigADay: string (nullable = true)
 |-- HasCTScan: string (nullable = true)
 |-- StopSmoking: string (nullable = true)
 |-- HasAsthma: string (nullable = true)
 |-- HasChronicDisease: string (nullable = true)
 |-- HasLungCancer: string (nullable = true)



In [109]:
processed_2019.groupBy(processed_2019.HasLungCancer).count().show()
# Data is unbalanced :)

+-------------+------+
|HasLungCancer| count|
+-------------+------+
|            0|418253|
|            1|    15|
+-------------+------+



In [318]:
#@title Data Preprocessing
data = processed_2020.unionByName(processed_2017)

# preview the data
# data type
print('-'*10, 'data types', '-'*10)
pd.DataFrame(data.dtypes)

---------- data types ----------


Unnamed: 0,0,1
0,Gender,string
1,Age65,string
2,BMI,string
3,GeneralHealth,string
4,Smoked100,string
...,...,...
9,HasCTScan,string
10,StopSmoking,string
11,HasAsthma,string
12,HasChronicDisease,string


In [319]:
data_raw = data.unionByName(processed_2018)

# preview the data
# data type
print('-'*10, 'data types', '-'*10)
pd.DataFrame(data_raw.dtypes)

---------- data types ----------


Unnamed: 0,0,1
0,Gender,string
1,Age65,string
2,BMI,string
3,GeneralHealth,string
4,Smoked100,string
...,...,...
9,HasCTScan,string
10,StopSmoking,string
11,HasAsthma,string
12,HasChronicDisease,string


In [165]:
# Define schema explitcitly
from pyspark.sql.types import *
data_raw.columns

['Gender',
 'Age65',
 'BMI',
 'GeneralHealth',
 'Smoked100',
 'SmokerStatus',
 'FirstSmokedAge',
 'LastSmokedAge',
 'AvgNumCigADay',
 'HasCTScan',
 'StopSmoking',
 'HasAsthma',
 'HasChronicDisease',
 'HasLungCancer']

In [225]:
data_raw.groupBy(data_raw.HasLungCancer).count().show()
# Data is unbalanced :)

+-------------+-------+
|HasLungCancer|  count|
+-------------+-------+
|            0|1288661|
|            1|    747|
+-------------+-------+



In [320]:
# data summary
print('-'*10, 'data summary', '-'*10)
data_raw.describe().toPandas()

---------- data summary ----------


Unnamed: 0,summary,Gender,Age65,BMI,GeneralHealth,Smoked100,SmokerStatus,FirstSmokedAge,LastSmokedAge,AvgNumCigADay,HasCTScan,StopSmoking,HasAsthma,HasChronicDisease,HasLungCancer
0,count,1289408.0,1289408.0,1289408.0,1289408.0,1289408.0,1289408.0,1289408.0,1289408.0,1289408.0,1289408.0,1289408.0,1289408.0,1289408.0,1289408.0
1,mean,1.5479530140963915,1.328006340894426,2.7138973854668187,2.5374691331215566,1.5086341949173574,3.1843357571846926,0.7682711756092718,1.7161759505137242,0.687531797538095,0.2033491338660842,0.1985857075495111,1.8550823323571748,1.907504839430188,0.0005793356330967389
2,stddev,0.4998675705749727,0.5057963813415817,1.1568094469630912,1.082762391633306,0.5795246875466048,1.1595266441303385,3.7821849709013184,8.946524041357442,4.270878895331059,0.6421514139918514,0.5316875116963501,0.3612492982590254,0.3065137425244624,0.0240624282308456
3,min,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
4,max,2.0,2.0,4.0,5.0,2.0,4.0,8.0,94.0,94.0,3.0,2.0,2.0,2.0,1.0


In [235]:
processed_2020 = processed_2020.filter(processed_2020.Age65 != '0')

In [None]:
processed_20.filter(processed_2020.Age65 == '0').count()

In [239]:
#Imputation
from pyspark.ml.feature import Imputer

imputer = Imputer(
    inputCols = ['FirstSmokedAge', 'LastSmokedAge', 'AvgNumCigADay'],
    outputCols = ["{}_imputed".format(a) for a in ['FirstSmokedAge', 'LastSmokedAge', 'AvgNumCigADay']]
).setStrategy("mode")

In [None]:
column_subset = [col_ for col_ in data_raw.columns if data_raw.select(col_).dtypes!="string"]
for col_ in column_subset:
    temp_col = data_raw.groupBy(col_).count()
    temp_col = temp_col.dropna(subset=col_)
    frequent_category = temp_col.orderBy(
                        temp_col['count'].desc()).show()

column_subset

+------+------+
|Gender| count|
+------+------+
|     2|707932|
|     1|580079|
|     0|  1397|
+------+------+

+-----+------+
|Age65| count|
+-----+------+
|    1|820814|
|    2|445764|
|    0| 22830|
+-----+------+

+---+------+
|BMI| count|
+---+------+
|  3|421971|
|  4|372179|
|  2|362499|
|  0|113065|
|  1| 19694|
+---+------+

+-------------+------+
|GeneralHealth| count|
+-------------+------+
|            2|425815|
|            3|400324|
|            1|227909|
|            4|169213|
|            5| 62894|
|            0|  3253|
+-------------+------+

+---------+------+
|Smoked100| count|
+---------+------+
|        2|711232|
|        1|522781|
|        0| 55395|
+---------+------+

+------------+------+
|SmokerStatus| count|
+------------+------+
|           4|711232|
|           3|344310|
|           1|126402|
|           0| 56640|
|           2| 50824|
+------------+------+

+--------------+-------+
|FirstSmokedAge|  count|
+--------------+-------+
|             0|1234016|

In [243]:
column_subset = [col_ for col_ in data_raw.columns if data_raw.select(col_).dtypes[0][1] !="string"]
for col_ in column_subset:
    temp_col = data_raw.groupBy(col_).count()
    temp_col = temp_col.dropna(subset=col_)
    frequent_category=temp_col.orderBy(
                     temp_col['count'].desc()).collect()[0][0]
    data_raw = data_raw.replace(frequent_category, subset=col_)
data_raw.show()

+------+-----+---+-------------+---------+------------+--------------+-------------+-------------+---------+-----------+---------+-----------------+-------------+
|Gender|Age65|BMI|GeneralHealth|Smoked100|SmokerStatus|FirstSmokedAge|LastSmokedAge|AvgNumCigADay|HasCTScan|StopSmoking|HasAsthma|HasChronicDisease|HasLungCancer|
+------+-----+---+-------------+---------+------------+--------------+-------------+-------------+---------+-----------+---------+-----------------+-------------+
|   2.0|  1.0|1.0|          2.0|      1.0|         1.0|           0.0|          0.0|          0.0|      0.0|        2.0|      1.0|              1.0|            0|
|   2.0|  2.0|3.0|          3.0|      0.0|           0|           0.0|          0.0|          0.0|      0.0|        0.0|      1.0|              2.0|            0|
|   2.0|  2.0|0.0|          3.0|      2.0|         4.0|           0.0|          0.0|          0.0|      0.0|        0.0|      2.0|              2.0|            0|
|   2.0|  2.0|0.0|    

In [None]:
data_raw.write.option("header",True).csv("final_data")

# Step 4: Decision Tree Classification with PySpark

In [None]:
#@title Process categorical columns
from pyspark.ml.feature import StringIndexer, OneHotEncoder, VectorAssembler, Imputer, BucketedRandomProjectionLSH,VectorSlicer
from pyspark.sql.window import Window
from pyspark.ml.linalg import Vectors,VectorUDT
from pyspark.sql.functions import array, create_map, struct
from pyspark.ml import Pipeline

# categorical columns
categorical_columns = data_raw.columns[0:12]

In [None]:
#@title Build StringIndexer stages
stringindexer_stages = [StringIndexer(inputCol=c, outputCol='strindexed_' + c) for c in categorical_columns]
# encode label column and add it to stringindexer_stages
stringindexer_stages += [StringIndexer(inputCol='HasLungCancer', outputCol='label')]

In [None]:
#@title Build OneHotEncoder stages
onehotencoder_stages = [OneHotEncoder(inputCol='strindexed_' + c, outputCol='onehot_' + c) for c in categorical_columns]

In [None]:
#@title Build VectorAssembler stage
feature_columns = ['onehot_' + c for c in categorical_columns]
vectorassembler_stage = VectorAssembler(inputCols=feature_columns, outputCol='features') 

In [None]:
#@title Build Pipeline model
# all stages
all_stages = stringindexer_stages + onehotencoder_stages + [vectorassembler_stage]
pipeline = Pipeline(stages=all_stages)

In [None]:
#@title Fit pipeline model
pipeline_model = pipeline.fit(data_raw)

In [None]:
#@title Transform data
final_columns = feature_columns + ['features', 'label']
df_raw = pipeline_model.transform(data_raw).\
            select(final_columns)
            
df_raw.show(5)

In [None]:
#@title Split data into traning and test sets
training, test = df_raw.randomSplit([0.8, 0.2], seed=1234)

In [None]:
##@title Data Imputing
#    inputCols = ['Age of Employee', 'Experience (in years)', 'Salary (per month - $)'],
#    outputCols = ["{}_imputed".format(a) for a in ['Age of Employee', 'Experience (in years)', 'Salary (per month - $)']]
#).setStrategy("mean")

In [None]:
training.printSchema() #'onehot_SEXVAR', 'onehot__AGE65YR', 'onehot__BMI5CAT', 'onehot_GENHLTH',
                       #     'onehot_SMOKE100', 'onehot__SMOKER3', 'onehot_LCSFIRST', 'onehot_LCSLAST',
                       #     'onehot_LCSNUMCG', 'onehot_LCSCTSCN', 'onehot_STOPSMK2', 'onehot_ASTHMA3', 

In [None]:
#@title Data balancing using SMOTE
#K-nearest neighbor algorithm to simulate the minority sample
from imblearn.over_sampling import SMOTE

features = training.select(['features']).toPandas()

labels = training.select('label').toPandas()

In [None]:
sm = SMOTE(sampling_strategy = 'not majority', k_neighbors = 50, random_state = 42)

features, labels = sm.fit_resample(features, labels)

In [None]:
features['label'] = labels.values
features = spark.createDataFrame(features)

In [None]:
#@title Build cross validation 
from pyspark.ml.regression import GeneralizedLinearRegression
from pyspark.ml.classification import LogisticRegression, DecisionTreeClassifier

dt = DecisionTreeClassifier(featuresCol='features', labelCol='label')

In [None]:
#@title Parameter grid
from pyspark.ml.tuning import ParamGridBuilder
param_grid = ParamGridBuilder().\
    addGrid(dt.maxDepth, [2,3,4,5]).\
    build()

In [None]:
#@title Evaluator
from pyspark.ml.evaluation import BinaryClassificationEvaluator
evaluator = BinaryClassificationEvaluator(rawPredictionCol="rawPrediction", metricName="areaUnderROC")

In [None]:
#@title Cross-validation model
from pyspark.ml.tuning import CrossValidator
cv = CrossValidator(estimator=dt, estimatorParamMaps=param_grid, evaluator=evaluator, numFolds=4)

In [None]:
#@title Fit cross validation model
cv_model = cv.fit(df_raw)

In [None]:
show_columns = ['features', 'label', 'prediction', 'rawPrediction', 'probability']

In [None]:
#@title Prediction on training data
pred_training_cv = cv_model.transform(training)
pred_training_cv.select(show_columns).show(5, truncate=False)

In [None]:
#@title Prediction on test data
pred_test_cv = cv_model.transform(test)
pred_test_cv.select(show_columns).show(5, truncate=False)

In [None]:
#@title Confusion matrix
label_and_pred = cv_model.transform(df_raw).select('label', 'prediction')
label_and_pred.rdd.zipWithIndex().countByKey()

In [None]:
print('The best MaxDepth is:', cv_model.bestModel._java_obj.getMaxDepth())