### Covid prediction

Today we are going to predict wheter a person will be admitted to the icu. 
The data that we use can be find at:
https://www.kaggle.com/tanmoyx/covid19-patient-precondition-dataset?select=covid.csv

We will perform several steps:
1) Clean the data. As the data as a lot of NA values represented as numbers <br>
2) Add weights columns as the data set contains a lot of people not being admitted to the icu.<br>
3) Train a logistic regression model. <br>
4) Extra: analyse the results: check what is most predictive for being admited to the icu.<br>




In [14]:
# Start with creating a spark session

from pyspark.sql import SparkSession

### Specify clusers. The name. Get or create will make sure that we do not initialize two times the same session 
spark = SparkSession.builder.master('local[*]').appName('covid prediction').getOrCreate()

In [15]:
#Next load the data. 

covid = spark.read.csv('./data/covid.csv',
                         sep=',',
                         header=True,
                         inferSchema=True,
                         nullValue="NA")



In [16]:
# Next check how many entries we have and check the schema

print(covid.show(5))
print(covid.count())
print(covid.schema)

+------+---+------------+----------+-------------+----------+-------+---------+---+---------+--------+----+------+-------+------------+-------------+--------------+-------+-------------+-------+-------------------+---------+---+
|    id|sex|patient_type|entry_date|date_symptoms| date_died|intubed|pneumonia|age|pregnancy|diabetes|copd|asthma|inmsupr|hypertension|other_disease|cardiovascular|obesity|renal_chronic|tobacco|contact_other_covid|covid_res|icu|
+------+---+------------+----------+-------------+----------+-------+---------+---+---------+--------+----+------+-------+------------+-------------+--------------+-------+-------------+-------+-------------------+---------+---+
|16169f|  2|           1|04-05-2020|   02-05-2020|9999-99-99|     97|        2| 27|       97|       2|   2|     2|      2|           2|            2|             2|      2|            2|      2|                  2|        1| 97|
|1009bf|  2|           1|19-03-2020|   17-03-2020|9999-99-99|     97|        2| 24| 

In [17]:
### First we select all the columns that we want

# Hint these are: 'sex', 'pneumonia', 'age', 'pregnancy', 'diabetes', 'copd', 'inmsupr', 'hypertension', 'other_disease', 'cardiovascular', 'obesity', 'renal_chronic', 'tobacco', 'icu'


covid = covid.select('sex', 'pneumonia', 'age', 'pregnancy', 'diabetes', 'copd', 'inmsupr', 'hypertension', 'other_disease', 'cardiovascular', 'obesity', 'renal_chronic', 'tobacco', 'icu' )



In [18]:
# We only want the patients that we have info if they are admitted or not. Make sure that 0 = yes and 1 = no

covid_interest = covid.filter((covid.icu == 1) | (covid.icu ==2)).withColumn('icu', covid.icu-1)
covid_interest.show(5)

+---+---------+---+---------+--------+----+-------+------------+-------------+--------------+-------+-------------+-------+---+
|sex|pneumonia|age|pregnancy|diabetes|copd|inmsupr|hypertension|other_disease|cardiovascular|obesity|renal_chronic|tobacco|icu|
+---+---------+---+---------+--------+----+-------+------------+-------------+--------------+-------+-------------+-------+---+
|  1|        2| 54|        2|       2|   2|      2|           2|            2|             2|      1|            2|      2|  1|
|  2|        1| 30|       97|       2|   2|      2|           2|            2|             2|      2|            2|      2|  1|
|  1|        2| 60|        2|       1|   2|      2|           1|            2|             1|      2|            2|      2|  1|
|  2|        1| 47|       97|       1|   2|      2|           2|            2|             2|      2|            2|      2|  0|
|  2|        2| 63|       97|       2|   2|      2|           1|            2|             2|      2|   

In [19]:
### How many patients are left?
covid_interest.count()

121788

In [20]:
### What percentage of patient where admitted to the icu?
# What should the weight by of the patients that were admited?
import numpy as np

icu_percentage = np.round(covid_interest.filter(covid_interest.icu == 0).count()/ covid_interest.filter(covid_interest.icu == 1).count(), 2)
icu_weight = np.round(covid_interest.filter(covid_interest.icu == 1).count()/ covid_interest.filter(covid_interest.icu == 0).count(), 2)
print(icu_percentage)
print(icu_weight)

0.09
11.04


In [21]:
### Next add a column that gives appropiate weight to the columns, this is used for weighted logistic regression. See https://spark.apache.org/docs/latest/api/python/pyspark.ml.html#pyspark.ml.classification.LogisticRegression
covid_interest = covid_interest.withColumn("weight", when(covid_interest.icu == 0, icu_weight).otherwise(1))

In [22]:
#Next we make sure that we map every entry of the columns we are interested in to (0,1,2) (yes, no, unknown/na)
# Hint look at what values for unbknown/na is used


from pyspark.sql.functions import *
columns_to_map = [
    'sex', 'pneumonia', 'age', 'pregnancy', 'diabetes', 'copd', 'inmsupr', 'hypertension', 'other_disease', 'cardiovascular', 'obesity', 'renal_chronic', 'tobacco' 
]



for column in columns_to_map:
    
    covid_interest = covid_interest.withColumn(column, when(covid[column] < 3, covid[column] - 1).otherwise(2) )
    
covid_interest.show()



+---+---------+---+---------+--------+----+-------+------------+-------------+--------------+-------+-------------+-------+---+------+
|sex|pneumonia|age|pregnancy|diabetes|copd|inmsupr|hypertension|other_disease|cardiovascular|obesity|renal_chronic|tobacco|icu|weight|
+---+---------+---+---------+--------+----+-------+------------+-------------+--------------+-------+-------------+-------+---+------+
|  0|        1|  2|        1|       1|   1|      1|           1|            1|             1|      0|            1|      1|  1|   1.0|
|  1|        0|  2|        2|       1|   1|      1|           1|            1|             1|      1|            1|      1|  1|   1.0|
|  0|        1|  2|        1|       0|   1|      1|           0|            1|             0|      1|            1|      1|  1|   1.0|
|  1|        0|  2|        2|       0|   1|      1|           1|            1|             1|      1|            1|      1|  0| 11.04|
|  1|        1|  2|        2|       1|   1|      1|    

In [23]:
## split the data in train test, use 80% as train data. use 17 as your seed
covid_train, covid_test = covid_interest.randomSplit([0.8, 0.2], 17) ## 17 = seed




In [24]:
### Next create the pipeline that does these steps: stringindexer, onehotencoder, vectorassembler and lastly logisticregression

from pyspark.ml.feature import StringIndexer, OneHotEncoder, VectorAssembler
from pyspark.ml.classification import LogisticRegression
from pyspark.ml import Pipeline


index_cols = ["index_" + col for col in columns_to_map]
one_hot_cols = ['one_hot_' + col for col in columns_to_map]

indexer = StringIndexer(inputCols=columns_to_map, outputCols=index_cols)
one_hot = OneHotEncoder(inputCols=index_cols, outputCols=one_hot_cols)

assembler = VectorAssembler(inputCols=one_hot_cols, outputCol="features")
logic_regress = LogisticRegression(labelCol="icu", weightCol="weight")


pipeline_regression = Pipeline(stages=[indexer, one_hot, assembler, logic_regress])




In [25]:
### Fit the pipeline

# Train the pipeline on the training data
pipeline_regression = pipeline_regression.fit(covid_train)




In [26]:
# Make predictions on the testing data and show the first 5
predictions = pipeline_regression.transform(covid_test)
predictions.select('prediction', 'icu').show(5)

+----------+---+
|prediction|icu|
+----------+---+
|       0.0|  1|
|       0.0|  1|
|       0.0|  1|
|       0.0|  1|
|       0.0|  0|
+----------+---+
only showing top 5 rows



In [28]:
# evaluate with the binaryClassification evaluator, use weight as the cols. 

from pyspark.ml.evaluation import BinaryClassificationEvaluator

# Calculate the RMSE on testing data
BinaryClassificationEvaluator(labelCol='icu', weightCol="weight").evaluate(predictions)

0.6713172251477334

Now is an open part:
Several things that you can do:
1) Improve upon the model by using grid search <br>
2) Use another model e.g: SVC <br>
3) Explore which parameters are usefull, by checking the weights of each feature. <br>