Installing PySpark enviroment

In [None]:
!pip install -q pyspark

[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m310.8/310.8 MB[0m [31m4.7 MB/s[0m eta [36m0:00:00[0m
[?25h  Preparing metadata (setup.py) ... [?25l[?25hdone
  Building wheel for pyspark (setup.py) ... [?25l[?25hdone


In [None]:
from pyspark import SparkContext, SparkConf
from pyspark.sql import SparkSession
from pyspark.sql import SQLContext
from pyspark.sql.functions import col

conf = SparkConf().setAppName('FinalProject3200')
sc = SparkContext.getOrCreate(conf = conf)
spark = SparkSession.builder.getOrCreate()

sqlContext = SparkSession.builder\
        .master("local")\
        .appName("Colab")\
        .config('spark.ui.port', '4050')\
        .getOrCreate()

Importing libraries

In [None]:
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
from numpy import polyfit

Uploading the data

In [None]:
train_data = pd.read_csv("train.csv")

**Cleaning Data**

Dropping columns that will not be revelant in predicting disorders and protected information such as the Father's Name, Patient First Name, and Institue Name. We will also drop "Genetic Disorder" to focus on the subclass of the disorders.

In [None]:
drop_columns = ['Patient Id', 
                'Patient First Name', 
                'Family Name', 
                "Father's name", 
                "Institute Name", 
                "Location of Institute", 
                "Status", 
                "Parental consent",
                "Genetic Disorder"]

train_data.drop(columns = drop_columns, inplace=True)

Checking the data types on the columns.
Object representing strings.
float64 representing numeric values.

In [None]:
train_data.dtypes

Patient Age                                         float64
Genes in mother's side                               object
Inherited from father                                object
Maternal gene                                        object
Paternal gene                                        object
Blood cell count (mcL)                              float64
Mother's age                                        float64
Father's age                                        float64
Respiratory Rate (breaths/min)                       object
Heart Rate (rates/min                                object
Test 1                                              float64
Test 2                                              float64
Test 3                                              float64
Test 4                                              float64
Test 5                                              float64
Follow-up                                            object
Gender                                  

Counting NA in Disorder Subclass

In [None]:
subclass_na_count = train_data["Disorder Subclass"].isna().sum()
print(subclass_na_count)


2168


We will drop the NA values for Disorder Subclass since we need the genetic disorder to build a proper model.

In [None]:
train_data = train_data.dropna(subset = ['Disorder Subclass'])

Checking NAs has been dropped

In [None]:
subclass_na = train_data["Disorder Subclass"].isna().sum()
print(subclass_na)

0


Replace 'NA' values with median in numeric variables since it will not impact the summary statistics as much as the mean would, and allow for the rows to remain valuable.

Also, converting some of the float types to int types.

In [None]:
# Replacing NA in No. of Previous of Abortions with the median of the variable
train_data["No. of previous abortion"].fillna(train_data["No. of previous abortion"].median(), inplace=True)
train_data["No. of previous abortion"] = train_data["No. of previous abortion"].astype("int64")

# Replacing NA in White Blood cell count (thousand per microliter) with the median of the variable
train_data["White Blood cell count (thousand per microliter)"].fillna(train_data["White Blood cell count (thousand per microliter)"].median(), inplace=True)

# Replacing NA in Patient's age with the median of the variable
train_data["Patient Age"].fillna(train_data["Patient Age"].median(), inplace=True)
train_data["Patient Age"] = train_data["Patient Age"].astype("int64")

# Replacing NA in Blood cell count with the median of the variable
train_data["Blood cell count (mcL)"].fillna(train_data["Blood cell count (mcL)"].median(), inplace=True)

# Replacing NA in Mother's age with the median of the variable
train_data["Mother's age"].fillna(train_data["Mother's age"].median(), inplace=True)
train_data["Mother's age"] = train_data["Mother's age"].astype("int64")

# Replacing NA in Father's age with the median of the variable
train_data["Father's age"].fillna(train_data["Father's age"].median(), inplace=True)
train_data["Father's age"] = train_data["Father's age"].astype("int64")


Replacing 'NA' values in the categorical variables with an "Unknown" value.

In [None]:
train_data[["Blood test result", "Birth defects", "Gender", "Heart Rate (rates/min", "Respiratory Rate (breaths/min)", "Follow-up", "Place of birth"]] = train_data[["Blood test result", "Birth defects", "Gender", "Heart Rate (rates/min", "Respiratory Rate (breaths/min)", "Follow-up", "Place of birth"]].fillna('Unknown')

Adjusting the categorical variables from object type to categorical type. 

In [None]:
train_data[["Blood test result", "Birth defects", "Gender", "Heart Rate (rates/min", "Respiratory Rate (breaths/min)", "Follow-up", "Place of birth", "Disorder Subclass"]].astype("category")

Unnamed: 0,Blood test result,Birth defects,Gender,Heart Rate (rates/min,Respiratory Rate (breaths/min),Follow-up,Place of birth,Disorder Subclass
0,Unknown,Unknown,Unknown,Normal,Normal (30-60),High,Institute,Leber's hereditary optic neuropathy
1,normal,Multiple,Unknown,Normal,Tachypnea,High,Unknown,Cystic fibrosis
2,normal,Singular,Unknown,Tachycardia,Normal (30-60),Low,Unknown,Diabetes
3,inconclusive,Singular,Male,Normal,Tachypnea,High,Institute,Leigh syndrome
4,Unknown,Multiple,Male,Tachycardia,Tachypnea,Low,Institute,Cancer
...,...,...,...,...,...,...,...,...
22078,inconclusive,Multiple,Female,Tachycardia,Normal (30-60),High,Institute,Leigh syndrome
22079,inconclusive,Multiple,Ambiguous,Normal,Normal (30-60),High,Institute,Diabetes
22080,normal,Singular,Male,Normal,Tachypnea,High,Home,Mitochondrial myopathy
22081,abnormal,Multiple,Male,Tachycardia,Tachypnea,High,Home,Leigh syndrome


Formating all "-", "Not applicable", "Not available", and "No record" into "No" to create a uniform response. 

In [None]:
train_data = train_data.replace(["-", "Not applicable", "Not available", "None", "No record"], "No")

Converting the binary columns from "Yes" and "No" to binary (1, 0)

In [None]:
binary_var = ["Maternal gene", "Genes in mother's side",
              "Inherited from father", "Paternal gene",
              "Birth asphyxia", 
              "Folic acid details (peri-conceptional)",
              "H/O serious maternal illness", 
              "H/O radiation exposure (x-ray)",
              "H/O substance abuse", "Assisted conception IVF/ART",
              "History of anomalies in previous pregnancies",
              "Autopsy shows birth defect (if applicable)",
              ]

train_data[binary_var] = train_data[binary_var].replace({"Yes": 1, "No": 0})

Replacing 'NA' values in all binary variables with 0 since we are assuming they didn't have the gene, symptom, test, etc.

Transforming binary variables to type "int64".

In [None]:
binary = ["Maternal gene", "Genes in mother's side",
          "Inherited from father", "Paternal gene",
          "Birth asphyxia", 
          "Folic acid details (peri-conceptional)",
          "H/O serious maternal illness", 
          "H/O radiation exposure (x-ray)",
          "H/O substance abuse", "Assisted conception IVF/ART",
          "History of anomalies in previous pregnancies",
          "Autopsy shows birth defect (if applicable)",
          "Test 1", "Test 2", "Test 3", "Test 4", "Test 5",
          "Symptom 1", "Symptom 2", "Symptom 3", "Symptom 4", "Symptom 5"
          ]

# Replacing NAs with 0          
train_data[binary] = train_data[binary].fillna(0)

# Transforming into type int
train_data[binary] = train_data[binary].astype("int64")

**Exploring the Data**

Summary of statistics from all the numeric variables in the dataframe.

In [None]:
train_data.describe()

Unnamed: 0,Patient Age,Genes in mother's side,Inherited from father,Maternal gene,Paternal gene,Blood cell count (mcL),Mother's age,Father's age,Test 1,Test 2,...,H/O substance abuse,Assisted conception IVF/ART,History of anomalies in previous pregnancies,No. of previous abortion,White Blood cell count (thousand per microliter),Symptom 1,Symptom 2,Symptom 3,Symptom 4,Symptom 5
count,19915.0,19915.0,19915.0,19915.0,19915.0,19915.0,19915.0,19915.0,19915.0,19915.0,...,19915.0,19915.0,19915.0,19915.0,19915.0,19915.0,19915.0,19915.0,19915.0,19915.0
mean,6.952599,0.595933,0.391162,0.484158,0.433191,4.898917,34.663219,41.972282,0.0,0.0,...,0.227115,0.455385,0.459854,2.002712,7.481795,0.538288,0.497615,0.48948,0.453628,0.41893
std,4.178497,0.490723,0.488023,0.499762,0.495529,0.199735,8.461725,11.25931,0.0,0.0,...,0.418978,0.498018,0.498398,1.343635,2.526877,0.498544,0.500007,0.499902,0.497857,0.493396
min,0.0,0.0,0.0,0.0,0.0,4.092727,18.0,20.0,0.0,0.0,...,0.0,0.0,0.0,0.0,3.0,0.0,0.0,0.0,0.0,0.0
25%,3.0,0.0,0.0,0.0,0.0,4.763367,29.0,34.0,0.0,0.0,...,0.0,0.0,0.0,1.0,5.640557,0.0,0.0,0.0,0.0,0.0
50%,7.0,1.0,0.0,0.0,0.0,4.899456,35.0,42.0,0.0,0.0,...,0.0,0.0,0.0,2.0,7.47806,1.0,0.0,0.0,0.0,0.0
75%,10.0,1.0,1.0,1.0,1.0,5.033677,40.0,49.0,0.0,0.0,...,0.0,1.0,1.0,3.0,9.287531,1.0,1.0,1.0,1.0,1.0
max,14.0,1.0,1.0,1.0,1.0,5.609829,51.0,64.0,0.0,0.0,...,1.0,1.0,1.0,4.0,12.0,1.0,1.0,1.0,1.0,1.0


Counting the different genetic disorder

In [None]:
train_data["Disorder Subclass"].value_counts()

Leigh syndrome                         5160
Mitochondrial myopathy                 4405
Cystic fibrosis                        3448
Tay-Sachs                              2833
Diabetes                               1817
Hemochromatosis                        1355
Leber's hereditary optic neuropathy     648
Alzheimer's                             152
Cancer                                   97
Name: Disorder Subclass, dtype: int64

We can see that Leigh syndrome has the most instances with 5160. While Cancer has the least with only 97.

Creating new columns to represent if the patient had a certain genetic disorder.

In [None]:
# If the patient has Leigh syndrome
train_data['leigh_syndrome'] = np.where(train_data['Disorder Subclass'] == "Leigh syndrome", 1, 0)

# If the patient has Mitochondrial myopathy
train_data['mitochondrial_myopathy'] = np.where(train_data['Disorder Subclass'] == "Mitochondrial myopathy", 1, 0)

# If the patient has Cystic fibrosis
train_data['cystic_fibrosis'] = np.where(train_data['Disorder Subclass'] == "Cystic fibrosis", 1, 0)

# If the patient has Tay-Sachs
train_data['tay_sachs'] = np.where(train_data['Disorder Subclass'] == "Tay-Sachs", 1, 0)

# If the patient has Diabetes
train_data['diabetes'] = np.where(train_data['Disorder Subclass'] == "Diabetes", 1, 0)

# If the patient has Hemochromatosis
train_data['hemochromatosis'] = np.where(train_data['Disorder Subclass'] == "Hemochromatosis", 1, 0)

# If the patient has Leber's herditary optic neuropathy
train_data['lebers'] = np.where(train_data['Disorder Subclass'] == "Leber's hereditary optic neuropathy", 1, 0)

# If the patient has Alzheimer's
train_data['alzheimers'] = np.where(train_data['Disorder Subclass'] == "Alzheimer's", 1, 0)

# If the patient has Cancer
train_data['cancer'] = np.where(train_data['Disorder Subclass'] == "Cancer", 1, 0)


Saving edited dataframe to a CSV file

In [None]:
train_data.to_csv('train_data.csv')

**Creating a Recommendation System using Collabortative Filtering with Binary Data**

Steps from: https://medium.com/radon-dev/item-item-collaborative-filtering-with-binary-or-unary-data-e8f0b465b2c3

We will use an item-item collaborative filter recommendation system since our focus is being able to predict the diseases by what symptoms and variables are most similar to them, compared to a user-item recommendation system.

Creating a new data frame with just binary variables.

In [None]:
df = train_data[["Maternal gene", "Genes in mother's side",
          "Inherited from father", "Paternal gene",
          "Birth asphyxia", 
          "Folic acid details (peri-conceptional)",
          "H/O serious maternal illness", 
          "H/O radiation exposure (x-ray)",
          "H/O substance abuse", "Assisted conception IVF/ART",
          "History of anomalies in previous pregnancies",
          "Autopsy shows birth defect (if applicable)",
          "Test 1", "Test 2", "Test 3", "Test 4", "Test 5",
          "Symptom 1", "Symptom 2", "Symptom 3", "Symptom 4", "Symptom 5",
          "leigh_syndrome", "mitochondrial_myopathy", "cystic_fibrosis", "tay_sachs",
          "diabetes", "hemochromatosis", "lebers", "alzheimers", "cancer"]].copy()


Importing cosine_similarity and sparse libraries

In [None]:
from sklearn.metrics.pairwise import cosine_similarity
from scipy import sparse

**Item-Item Calculations**

Normalizing the user vectors to unit vectors

In [None]:
# magnitude = sqrt(x2 + y2 + z2 + ...)
magnitude = np.sqrt(np.square(df).sum(axis=1))

# unit vector = (x/ magnitude, y/magnitude, z/magnitude, ...)
df = df.divide(magnitude, axis='index')

Function to calculate the column-wise cosine similarity for a spare matrix and return a new dataframe matrix with the similarities. 


In [None]:
def calculate_sim(df):
  df_sparse = sparse.csr_matrix(df) # Dealing with the sparse data, CSR is to compress the sparse rows.
  similarities = cosine_similarity(df_sparse.transpose())
  sim = pd.DataFrame(data=similarities, index = df.columns, columns= df.columns)
  return sim

Creating a similarity matrix

In [None]:
df_matrix = calculate_sim(df)

df_matrix


Unnamed: 0,Maternal gene,Genes in mother's side,Inherited from father,Paternal gene,Birth asphyxia,Folic acid details (peri-conceptional),H/O serious maternal illness,H/O radiation exposure (x-ray),H/O substance abuse,Assisted conception IVF/ART,...,Symptom 5,leigh_syndrome,mitochondrial_myopathy,cystic_fibrosis,tay_sachs,diabetes,hemochromatosis,lebers,alzheimers,cancer
Maternal gene,1.0,0.544954,0.408035,0.431565,0.313005,0.443936,0.441118,0.306848,0.302404,0.433301,...,0.427997,0.351247,0.294685,0.312431,0.224881,0.235604,0.140833,0.147631,0.073429,0.021371
Genes in mother's side,0.544954,1.0,0.455306,0.468736,0.349362,0.495236,0.486044,0.339181,0.347767,0.492372,...,0.471463,0.385431,0.336859,0.347553,0.254621,0.26619,0.156987,0.161596,0.085705,0.033303
Inherited from father,0.408035,0.455306,1.0,0.439668,0.278515,0.393258,0.38896,0.274305,0.279321,0.38682,...,0.384088,0.317376,0.259842,0.291578,0.180661,0.233789,0.11373,0.141806,0.079112,0.012929
Paternal gene,0.431565,0.468736,0.439668,1.0,0.290778,0.419432,0.411178,0.286643,0.289867,0.410081,...,0.397869,0.323275,0.276435,0.302784,0.215191,0.225567,0.132811,0.150531,0.080418,0.019098
Birth asphyxia,0.313005,0.349362,0.278515,0.290778,1.0,0.306895,0.307389,0.214844,0.207855,0.301196,...,0.285232,0.223461,0.229873,0.1758,0.187163,0.129177,0.130664,0.073345,0.034934,0.043379
Folic acid details (peri-conceptional),0.443936,0.495236,0.393258,0.419432,0.306895,1.0,0.42422,0.296805,0.292076,0.419156,...,0.418309,0.32807,0.316817,0.255116,0.263376,0.180631,0.184548,0.111223,0.043378,0.050234
H/O serious maternal illness,0.441118,0.486044,0.38896,0.411178,0.307389,0.42422,1.0,0.303305,0.294501,0.422245,...,0.407473,0.328906,0.312678,0.25362,0.261685,0.177741,0.179651,0.102999,0.045901,0.049024
H/O radiation exposure (x-ray),0.306848,0.339181,0.274305,0.286643,0.214844,0.296805,0.303305,1.0,0.208747,0.299256,...,0.282393,0.226083,0.221035,0.168557,0.179374,0.127565,0.142589,0.071947,0.036027,0.043806
H/O substance abuse,0.302404,0.347767,0.279321,0.289867,0.207855,0.292076,0.294501,0.208747,1.0,0.294138,...,0.2769,0.226747,0.220406,0.185479,0.180445,0.124145,0.121758,0.072746,0.03245,0.05155
Assisted conception IVF/ART,0.433301,0.492372,0.38682,0.410081,0.301196,0.419156,0.422245,0.299256,0.294138,1.0,...,0.407796,0.333825,0.305088,0.26,0.264062,0.180004,0.184406,0.100028,0.051,0.050598


We can see the top 5 variables most similar or related to each disease.

We use 6 since the most similar will always be the variable we are looking at. 

*Leigh Syndrome's Top 5 Most Similar Variables*

In [None]:
df_matrix.loc['leigh_syndrome'].nlargest(6)

leigh_syndrome            1.000000
Test 4                    0.475404
Genes in mother's side    0.385431
Symptom 1                 0.361922
Symptom 2                 0.360921
Symptom 3                 0.360833
Name: leigh_syndrome, dtype: float64

We can see that Test 4, Genes in mother's side, Symptom 1, Symptom 3, and Sympton 2 are most related to Leigh Syndrome. So possibly if a patient had these results, they could be diagnosed with Leigh syndrome. 

Leigh syndrome can only be inherited through the mother, so it explains the importance of having "genes in mother's side" as an indicator (https://medlineplus.gov/genetics/condition/leigh-syndrome/#inheritance).

*Mitochondrial Myopathy's Top 5 Most Similar Variables*

In [None]:
df_matrix.loc['mitochondrial_myopathy'].nlargest(6)

mitochondrial_myopathy                          1.000000
Test 4                                          0.453863
Genes in mother's side                          0.336859
Symptom 1                                       0.320359
Folic acid details (peri-conceptional)          0.316817
History of anomalies in previous pregnancies    0.315198
Name: mitochondrial_myopathy, dtype: float64

Similar to Leigh syndrome, Mitochondrial myopathy is very similar to Test 4 and Genes in mother's side. Which makes sense since Leigh syndrome is a form of Mitochondrial myopathy (https://www.chop.edu/conditions-diseases/mitochondrial-myopathy#:~:text=Mitochondrial%20myopathies%20may%20be%20caused,found%20in%20cells'%20mitochondria).)

Also both Leigh syndrome and Mitochondrial Myopathy are most often passed down through the mother, which supports "Genes in mother's side" being so prevalant in both diseases.


*Cystic Fibrosis' Top 5 Most Similar Variables*

In [None]:
df_matrix.loc['cystic_fibrosis'].nlargest(6)

cystic_fibrosis    1.000000
Symptom 5          0.400131
Symptom 4          0.381251
Test 4             0.369959
Symptom 3          0.365101
Symptom 2          0.351498
Name: cystic_fibrosis, dtype: float64

With Cystic Fibrosis, we see that having the symptoms 2, 3, 4, and 5 and testing positive for test 4 could lead help lead to a diagnosis with a patient. Cystic Fibrosis impacts the lungs, pancreas, and other organs, there is very specific symptoms which could explain the high similarity rate (https://www.cff.org/intro-cf/about-cystic-fibrosis).

*Tay-Sachs' Top 5 Most Similar Variables*

In [None]:
df_matrix.loc['tay_sachs'].nlargest(6)

tay_sachs                                       1.000000
Test 4                                          0.381479
Assisted conception IVF/ART                     0.264062
Folic acid details (peri-conceptional)          0.263376
H/O serious maternal illness                    0.261685
History of anomalies in previous pregnancies    0.261094
Name: tay_sachs, dtype: float64

We see that having a positive test result with Test 4 could a good indicator if a patient has Tay Sachs.

It is also interesting to see that it seems to relate more to events that happened while the patient was in the womb or being conceived. 

Tay-Sachs must be passed down through both parents, and can have a delayed onset of symptoms which may be why the test result and events of the pregnancy are more prevalent. (https://www.betterhealth.vic.gov.au/health/conditionsandtreatments/tay-sachs-disease)

*Diabetes' Top 5 Most Similar Variables*

In [None]:
df_matrix.loc['diabetes'].nlargest(6)

diabetes                  1.000000
Symptom 5                 0.326222
Symptom 4                 0.300202
Symptom 3                 0.297604
Symptom 2                 0.273577
Genes in mother's side    0.266190
Name: diabetes, dtype: float64

Symptoms 2, 3, 4, and 5 are important indicators when it comes to diagnosing diabetes.

There isn't an exact known cause of childhood diabetes, genetics and family history do play a roll in a child's chances of developing diabetes, which could explain "Genes in mother's side". So it is important for patients and their parents to be aware of the symptoms of diabetes (https://www.mayoclinic.org/diseases-conditions/type-1-diabetes-in-children/symptoms-causes/syc-20355306).

*Hemochromatosis' Top 5 Most Similar Variables*

In [None]:
df_matrix.loc['hemochromatosis'].nlargest(6)

hemochromatosis                                 1.000000
Test 4                                          0.277900
History of anomalies in previous pregnancies    0.185508
Folic acid details (peri-conceptional)          0.184548
Assisted conception IVF/ART                     0.184406
H/O serious maternal illness                    0.179651
Name: hemochromatosis, dtype: float64

Test 4 seems to be the strongest indicator of Hemochromatosis. Usually, hemochromatosis doesn't develop until later in life and some people may never have symptoms of it at all, which could explain the small number of instances in the dataset (https://www.mayoclinic.org/diseases-conditions/hemochromatosis/symptoms-causes/syc-20351443). We also see more similarties to variables regarding previous pregnancy or events that could happen while the patient was in the womb. 

*Leber's Top 5 Most Similar Variables*

In [None]:
df_matrix.loc['lebers'].nlargest(6)

lebers       1.000000
Symptom 5    0.212385
Symptom 4    0.200106
Symptom 3    0.188436
Symptom 2    0.178209
Symptom 1    0.166691
Name: lebers, dtype: float64

As we head into the diseases with the smallest amount of instances, we can see that the similarity numbers drop significantly compared to the other diseases with a higher amount of instances. 

In Leber's, symptoms 1, 2, 3, 4, and 5 have the most similarity, which is similar to diabetes. 

Leber's hereditary optic neuropathy is also a mitochondrial disease so it must be inherited through the mother. Usually onset of the disease is not until 15 years old which could explain the small number of instances in the dataset. (https://my.clevelandclinic.org/health/diseases/15620-leber-hereditary-optic-neuropathy-sudden-vision-loss)

*Alzheimer's Top 5 Most Similar Variables*

In [None]:
df_matrix.loc['alzheimers'].nlargest(6)

alzheimers                1.000000
Symptom 5                 0.107727
Symptom 4                 0.102340
Symptom 3                 0.100828
Symptom 2                 0.090605
Genes in mother's side    0.085705
Name: alzheimers, dtype: float64

Alzheimer's is most similar to symptoms 2, 3, 4, and 5 like Leber's or diabetes. There is also similarity regarding genes in the mother's side as well. The similarity number have become much smaller, since the number of instances has greatly decreased. 

Alzheimer's using occurs later in life and is usually caused by proteins built up in the brain. This could explain the small amount of instances in the dataset. (https://www.nhs.uk/conditions/alzheimers-disease/causes/#:~:text=Alzheimer's%20disease%20is%20thought%20to,form%20tangles%20within%20brain%20cells.)

*Cancer's Top 5 Most Similar Variables*

In [None]:
df_matrix.loc['cancer'].nlargest(6)

cancer                                          1.000000
Test 4                                          0.082640
H/O substance abuse                             0.051550
History of anomalies in previous pregnancies    0.051171
Assisted conception IVF/ART                     0.050598
Folic acid details (peri-conceptional)          0.050234
Name: cancer, dtype: float64

Lastly, we look at cancer which has the fewest number of instances in the entire dataset with only 97. So our similarity numbers are the smallest. 

We see that Test 4, history of substance abuse in the mother, history of anomalies in previous pregnancies, assisted conception, and folic acid details could play a part in a diagnosis of cancer. The dataset doesn't specify which type of cancer, but we know in general cancer in childhood is rare, but is the leading cause of death by disease past infancy in the United States (https://www.cancer.gov/types/childhood-cancers/child-adolescent-cancers-fact-sheet#how-common-is-cancer-in-children-and-adolescents). 

**Decision Tree Classification**

Implementing a decision tree using Pyspark methods.

Steps from: https://spark.apache.org/docs/latest/api/python/reference/api/pyspark.ml.classification.DecisionTreeClassifier.html

Loading libraries

In [None]:
from pyspark.ml.feature import VectorAssembler
from pyspark.ml.regression import LinearRegression
from pyspark.ml.evaluation import RegressionEvaluator
from pyspark.ml.feature import StringIndexer
from pyspark.ml.classification import DecisionTreeClassifier
from pyspark.ml.evaluation import MulticlassClassificationEvaluator
from pyspark.ml.evaluation import BinaryClassificationEvaluator
from sklearn.metrics import confusion_matrix

Importing edited CSV into Spark.

In [None]:
data = spark.read.csv("train_data.csv", inferSchema=True, header=True)
# data.show()

Counting the Genetic Disorders.

In [None]:
data.groupBy('Disorder Subclass').count().show()

+--------------------+-----+
|   Disorder Subclass|count|
+--------------------+-----+
|      Leigh syndrome| 5160|
|            Diabetes| 1817|
|              Cancer|   97|
|Leber's hereditar...|  648|
|     Hemochromatosis| 1355|
|         Alzheimer's|  152|
|Mitochondrial myo...| 4405|
|           Tay-Sachs| 2833|
|     Cystic fibrosis| 3448|
+--------------------+-----+



Showing the columns of the dataset.

In [None]:
data.columns

['_c0',
 'Patient Age',
 "Genes in mother's side",
 'Inherited from father',
 'Maternal gene',
 'Paternal gene',
 'Blood cell count (mcL)',
 "Mother's age",
 "Father's age",
 'Respiratory Rate (breaths/min)',
 'Heart Rate (rates/min',
 'Test 1',
 'Test 2',
 'Test 3',
 'Test 4',
 'Test 5',
 'Follow-up',
 'Gender',
 'Birth asphyxia',
 'Autopsy shows birth defect (if applicable)',
 'Place of birth',
 'Folic acid details (peri-conceptional)',
 'H/O serious maternal illness',
 'H/O radiation exposure (x-ray)',
 'H/O substance abuse',
 'Assisted conception IVF/ART',
 'History of anomalies in previous pregnancies',
 'No. of previous abortion',
 'Birth defects',
 'White Blood cell count (thousand per microliter)',
 'Blood test result',
 'Symptom 1',
 'Symptom 2',
 'Symptom 3',
 'Symptom 4',
 'Symptom 5',
 'Disorder Subclass',
 'leigh_syndrome',
 'mitochondrial_myopathy',
 'cystic_fibrosis',
 'tay_sachs',
 'diabetes',
 'hemochromatosis',
 'lebers',
 'alzheimers',
 'cancer']

Creating an assembler using Vector Assembler to create a new column to keep track of the independent variables.

In [None]:
assembler = VectorAssembler(inputCols= ['Patient Age',
 "Genes in mother's side",
 'Inherited from father',
 'Maternal gene',
 'Paternal gene',
 'Blood cell count (mcL)',
 "Mother's age",
 "Father's age",
 'Test 1',
 'Test 2',
 'Test 3',
 'Test 4',
 'Test 5',
 'Birth asphyxia',
 'Autopsy shows birth defect (if applicable)',
 'Folic acid details (peri-conceptional)',
 'H/O serious maternal illness',
 'H/O radiation exposure (x-ray)',
 'H/O substance abuse',
 'Assisted conception IVF/ART',
 'History of anomalies in previous pregnancies',
 'White Blood cell count (thousand per microliter)',
 'Symptom 1',
 'Symptom 2',
 'Symptom 3',
 'Symptom 4',
 'Symptom 5'], outputCol= "features")

assembler

VectorAssembler_caffce28aa47

Applying the vector assembler to the data.

In [None]:
output = assembler.transform(data)
output.show()

+---+-----------+----------------------+---------------------+-------------+-------------+----------------------+------------+------------+------------------------------+---------------------+------+------+------+------+------+---------+---------+--------------+------------------------------------------+--------------+--------------------------------------+----------------------------+------------------------------+-------------------+---------------------------+--------------------------------------------+------------------------+-------------+------------------------------------------------+-----------------+---------+---------+---------+---------+---------+--------------------+--------------+----------------------+---------------+---------+--------+---------------+------+----------+------+--------------------+
|_c0|Patient Age|Genes in mother's side|Inherited from father|Maternal gene|Paternal gene|Blood cell count (mcL)|Mother's age|Father's age|Respiratory Rate (breaths/min)|Heart

**Leigh Syndrome Decision Tree**

We will focus on creating a decision tree for Leigh Syndrome.

In [None]:
output.select("features", "leigh_syndrome").show()

+--------------------+--------------+
|            features|leigh_syndrome|
+--------------------+--------------+
|(27,[0,1,3,5,6,7,...|             0|
|(27,[0,1,2,5,6,7,...|             0|
|(27,[0,1,5,6,7,11...|             0|
|(27,[0,1,3,5,6,7,...|             1|
|(27,[0,1,4,5,6,7,...|             0|
|(27,[0,1,3,5,6,7,...|             0|
|(27,[0,1,3,4,5,6,...|             0|
|(27,[0,3,4,5,6,7,...|             0|
|(27,[0,3,5,6,7,11...|             1|
|(27,[0,2,3,4,5,6,...|             0|
|(27,[0,1,5,6,7,16...|             0|
|(27,[0,4,5,6,7,11...|             0|
|(27,[0,1,2,5,6,7,...|             1|
|(27,[1,4,5,6,7,11...|             1|
|(27,[0,1,3,5,6,7,...|             1|
|(27,[5,6,7,15,20,...|             0|
|(27,[1,2,5,6,7,11...|             0|
|(27,[0,5,6,7,11,1...|             0|
|(27,[0,1,2,3,5,6,...|             0|
|(27,[0,2,3,4,5,6,...|             0|
+--------------------+--------------+
only showing top 20 rows



We will create a model using the two columns to represent all the features and the disorder subclasses.

In [None]:
model_df = output.select("features", "leigh_syndrome")

In [None]:
# model_df.show()

Splitting edited data into training and testing data.

70% of the original data in training dataframe and 30% in testing dataframe, we will also set the seed to 1000 to have consistent results. 

In [None]:
training_df, test_df = model_df.randomSplit([0.7, 0.3], 1000)

Count of training data.

In [None]:
training_df.count()

13973

Count of test data. 

In [None]:
test_df.count()

5942

We will build the Decision Tree Classifer using libraries previously imported libraries and the training dataframe.

In [None]:
df_classifier = DecisionTreeClassifier(labelCol = "leigh_syndrome").fit(training_df)

We will also create predictions using the test dataframe we created.

In [None]:
df_predictions = df_classifier.transform(test_df)

Showing the predictions within the dataframe.

In [None]:
df_predictions.show()

+--------------------+--------------+---------------+--------------------+----------+
|            features|leigh_syndrome|  rawPrediction|         probability|prediction|
+--------------------+--------------+---------------+--------------------+----------+
|(27,[0,1,2,3,4,5,...|             1|[9366.0,3249.0]|[0.74244946492271...|       0.0|
|(27,[0,1,2,3,4,5,...|             0|[9366.0,3249.0]|[0.74244946492271...|       0.0|
|(27,[0,1,2,3,4,5,...|             0|  [549.0,195.0]|[0.73790322580645...|       0.0|
|(27,[0,1,2,3,4,5,...|             0|[9366.0,3249.0]|[0.74244946492271...|       0.0|
|(27,[0,1,2,3,4,5,...|             0|[9366.0,3249.0]|[0.74244946492271...|       0.0|
|(27,[0,1,2,3,4,5,...|             0|[9366.0,3249.0]|[0.74244946492271...|       0.0|
|(27,[0,1,2,3,4,5,...|             0|[9366.0,3249.0]|[0.74244946492271...|       0.0|
|(27,[0,1,2,3,4,5,...|             0|[9366.0,3249.0]|[0.74244946492271...|       0.0|
|(27,[0,1,2,3,4,5,...|             0|[9366.0,3249.0]|[

*Evaluating the model*

First, we look at the accuracy of the model using libraries previously imported.

In [None]:
df_accuracy = MulticlassClassificationEvaluator(labelCol="leigh_syndrome",
                                                metricName= "accuracy").evaluate(df_predictions)
df_accuracy                      

0.7440255806125884

The accuracy of the model is the number of correct predictions over the total number of predictions (TP + TN/ TP + TN + FP + FN)

This model about 74% accurate for predicting Leigh syndrome.

Now, the precision of the model

In [None]:
df_precision = MulticlassClassificationEvaluator(labelCol = "leigh_syndrome",
                                                 metricName = "weightedPrecision").evaluate(df_predictions)

df_precision

0.6232965419750084

Weighted precision focuses on how many times the patient was correctly diagnosed with leigh syndrome, and ignores the instances where the patient was not diagnosed with leigh syndrome. (TP/ TP + FP)

The weighted precision of the model is about 62%, which is not very high. 

We will look at the important variables that go into this prediction.

In [None]:
df_classifier.featureImportances

SparseVector(27, {0: 0.2124, 2: 0.167, 5: 0.1897, 6: 0.2386, 16: 0.1924})

The features that are important in this model are the user ID, patient age, paternal gene, blood cell count, mother's age, test 4, follow up, Folic acid details (peri-conceptional), H/O substance abuse, and Assisted conception IVF/ART.

Lastly, we will look at the AUC for our model.

The area under the ROC curve to measure the performance across all possible classification thresholds. 

In [None]:
df_auc = MulticlassClassificationEvaluator(labelCol = "leigh_syndrome").evaluate(df_predictions)
df_auc

0.6370771199003631

The AUC of this model is only about 64%, which is also not very high. Possibly the use of different variables or a larger amount of instances may improve this model. 

Following the steps used to create the decision tree for Leigh Syndrome, we will apply it to the other genetic disorder subclasses. 

**Mitochondrial Myopathy Decision Tree**

In [None]:
model_df1 = output.select("features", "mitochondrial_myopathy")

In [None]:
training_df1, test_df1 = model_df1.randomSplit([0.7, 0.3], 1000)

In [None]:
df_classifier1 = DecisionTreeClassifier(labelCol = "mitochondrial_myopathy").fit(training_df1)

In [None]:
df_predictions1 = df_classifier1.transform(test_df1)

In [None]:
df_accuracy1 = MulticlassClassificationEvaluator(labelCol="mitochondrial_myopathy",
                                                metricName= "accuracy").evaluate(df_predictions1)

df_accuracy1

0.776001346348031

The accuracy of this model is about 78% for being able to predict Mitochondrial Myopathy.

In [None]:
df_precision1 = MulticlassClassificationEvaluator(labelCol = "mitochondrial_myopathy",
                                                 metricName = "weightedPrecision").evaluate(df_predictions1)

df_precision1

0.6021780895339567

The weighted precision of this model is about 60%. 

In [None]:
df_classifier1.featureImportances

SparseVector(27, {})

There are no remarkable features in this model. This could be changed with more instances of the disease in the dataset.

In [None]:
df_auc1 = MulticlassClassificationEvaluator(labelCol = "mitochondrial_myopathy").evaluate(df_predictions1)
df_auc1

0.6781279651304407

The AUC of this model is about 68%. 

This model could use improvements whether it is a larger amount of instances or focusing on different features. 

**Cystic Fibrosis Decision Tree**

In [None]:
model_df2 = output.select("features", "cystic_fibrosis")

In [None]:
training_df2, test_df2 = model_df2.randomSplit([0.7, 0.3], 1000)

In [None]:
df_classifier2 = DecisionTreeClassifier(labelCol = "cystic_fibrosis").fit(training_df2)

In [None]:
df_predictions2 = df_classifier2.transform(test_df2)

In [None]:
df_accuracy2 = MulticlassClassificationEvaluator(labelCol="cystic_fibrosis",
                                                metricName= "accuracy").evaluate(df_predictions2)

df_accuracy2

0.8197576573544261

The accuracy of this model is about 82% for being able to predict patients with Cystic Fibrosis.

In [None]:
df_precision2 = MulticlassClassificationEvaluator(labelCol = "cystic_fibrosis",
                                                 metricName = "weightedPrecision").evaluate(df_predictions2)

df_precision2

0.6720026167912166

The weighted precision of this model is about 67%. 

In [None]:
df_classifier2.featureImportances

SparseVector(27, {})

There are no remarkable features in this model. This could be changed with more instances of the disease in the dataset. 

In [None]:
df_auc2 = MulticlassClassificationEvaluator(labelCol = "cystic_fibrosis").evaluate(df_predictions2)
df_auc2

0.7385627576016665

The AUC of this model is about 74%. 

This model is acceptable, but could use some improvements. This could come from a larger sample size or possibility of more features. 

**Tay-Sachs Decision Tree**

In [None]:
model_df3 = output.select("features", "tay_sachs")

In [None]:
training_df3, test_df3 = model_df3.randomSplit([0.7, 0.3], 1000)

In [None]:
df_classifier3 = DecisionTreeClassifier(labelCol = "tay_sachs").fit(training_df3)

In [None]:
df_predictions3 = df_classifier3.transform(test_df3)

In [None]:
df_accuracy3 = MulticlassClassificationEvaluator(labelCol="tay_sachs",
                                                metricName= "accuracy").evaluate(df_predictions3)

df_accuracy3

0.8608212722988893

The accuracy of this model is 86% for predicting Tay-Sachs.

In [None]:
df_precision3 = MulticlassClassificationEvaluator(labelCol = "tay_sachs",
                                                 metricName = "weightedPrecision").evaluate(df_predictions3)

df_precision3

0.7410132628422785

The weighted precision of this model is 74%.

In [None]:
df_classifier3.featureImportances

SparseVector(27, {})

No remarkable features in this model. 

In [None]:
df_auc3 = MulticlassClassificationEvaluator(labelCol = "tay_sachs").evaluate(df_predictions3)
df_auc3

0.7964367925854786

The AUC of this model is 80%. This model is very good for predicting patients with Tay-Sachs.

**Diabetes Decision Tree**

In [None]:
model_df4 = output.select("features", "diabetes")

In [None]:
training_df4, test_df4 = model_df4.randomSplit([0.7, 0.3], 1000)

In [None]:
df_classifier4 = DecisionTreeClassifier(labelCol = "diabetes").fit(training_df4)

In [None]:
df_predictions4 = df_classifier4.transform(test_df4)

In [None]:
df_accuracy4 = MulticlassClassificationEvaluator(labelCol="diabetes",
                                                metricName= "accuracy").evaluate(df_predictions4)

df_accuracy4

0.9097946819252777

The accuracy of this model is 90%, which is very high. 

In [None]:
df_precision4 = MulticlassClassificationEvaluator(labelCol = "diabetes",
                                                 metricName = "weightedPrecision").evaluate(df_predictions4)

df_precision4

0.8277263632595172

The weighted precision is about 83%. 

In [None]:
df_classifier4.featureImportances

SparseVector(27, {})

No remarkable features in this model.

In [None]:
df_auc4 = MulticlassClassificationEvaluator(labelCol = "diabetes").evaluate(df_predictions4)
df_auc4

0.86682235644837

The AUC of this model is 87%, which is very good. 

**Hemochromatosis Decision Tree**

In [None]:
model_df5 = output.select("features", "hemochromatosis")

In [None]:
training_df5, test_df5 = model_df5.randomSplit([0.7, 0.3], 1000)

In [None]:
df_classifier5 = DecisionTreeClassifier(labelCol = "hemochromatosis").fit(training_df5)

In [None]:
df_predictions5 = df_classifier5.transform(test_df5)

In [None]:
df_accuracy5 = MulticlassClassificationEvaluator(labelCol="hemochromatosis",
                                                metricName= "accuracy").evaluate(df_predictions5)

df_accuracy5

0.9348704140020195

The accuracy of this model is 93% when predicting Hemochromatosis

In [None]:
df_precision5 = MulticlassClassificationEvaluator(labelCol = "hemochromatosis",
                                                 metricName = "weightedPrecision").evaluate(df_predictions5)

df_precision5

0.8739826909763073

The weighted precision of this model is 87%. 

In [None]:
df_classifier5.featureImportances

SparseVector(27, {})

No remarkable features. 

In [None]:
df_auc5 = MulticlassClassificationEvaluator(labelCol = "hemochromatosis").evaluate(df_predictions5)
df_auc5

0.9034017830357864

The AUC of this model is 90%, which is very good. 

**Leber's Hereditary Optic Neuropathy Decision Tree**

In [None]:
model_df6 = output.select("features", "lebers")

In [None]:
training_df6, test_df6 = model_df6.randomSplit([0.7, 0.3], 1000)

In [None]:
df_classifier6 = DecisionTreeClassifier(labelCol = "lebers").fit(training_df6)

In [None]:
df_predictions6 = df_classifier6.transform(test_df6)

In [None]:
df_accuracy6 = MulticlassClassificationEvaluator(labelCol="lebers",
                                                metricName= "accuracy").evaluate(df_predictions6)

df_accuracy6

0.9678559407606866

The accuracy of this model is 97% when prediciting Leber's. 

In [None]:
df_precision6 = MulticlassClassificationEvaluator(labelCol = "lebers",
                                                 metricName = "weightedPrecision").evaluate(df_predictions6)

df_precision6

0.9367451220657538

The weighted precision of this model is about 94%.

In [None]:
df_classifier6.featureImportances

SparseVector(27, {})

No remarkable features. 

In [None]:
df_auc6 = MulticlassClassificationEvaluator(labelCol = "lebers").evaluate(df_predictions6)
df_auc6

0.952046440659319

The AUC of this model is 95%. As the number of instances decrease, we notice the evaluation measures for the models increase. 

**Alzheimer's Decision Tree**

In [None]:
model_df7 = output.select("features", "alzheimers")

In [None]:
training_df7, test_df7 = model_df7.randomSplit([0.7, 0.3], 1000)

In [None]:
df_classifier7 = DecisionTreeClassifier(labelCol = "alzheimers").fit(training_df7)

In [None]:
df_predictions7 = df_classifier7.transform(test_df7)

In [None]:
df_accuracy7 = MulticlassClassificationEvaluator(labelCol="alzheimers",
                                                metricName= "accuracy").evaluate(df_predictions7)

df_accuracy7

0.9909121507909795

The accuracy of this model is 99%. This is due to the small amount of instances in the dataset.

In [None]:
df_precision7 = MulticlassClassificationEvaluator(labelCol = "alzheimers",
                                                 metricName = "weightedPrecision").evaluate(df_predictions7)

df_precision7

0.9819068905852049

The weighted precision of the model is 98%. 

In [None]:
df_classifier7.featureImportances

SparseVector(27, {})

No remarkable features. 

In [None]:
df_auc7 = MulticlassClassificationEvaluator(labelCol = "alzheimers").evaluate(df_predictions7)
df_auc7

0.9863889676850865

The AUC of the model is 98%. As previously mentioned, the evaluaters of the models have increased as the number of instances has decreased. We would most likely get more accurate models with more instances of these diseases.  

**Cancer Decision Tree**

In [None]:
model_df8 = output.select("features", "cancer")

In [None]:
training_df8, test_df8 = model_df8.randomSplit([0.7, 0.3], 1000)

In [None]:
df_classifier8 = DecisionTreeClassifier(labelCol = "cancer").fit(training_df8)

In [None]:
df_predictions8 = df_classifier8.transform(test_df8)

In [None]:
df_accuracy8 = MulticlassClassificationEvaluator(labelCol="cancer",
                                                metricName= "accuracy").evaluate(df_predictions8)

df_accuracy8

0.9947829013800067

The accuracy of the model is 99% for predicting Cancer.

In [None]:
df_precision8 = MulticlassClassificationEvaluator(labelCol = "cancer",
                                                 metricName = "weightedPrecision").evaluate(df_predictions8)

df_precision8

0.9895930208780241

The weighted precision of the model is also 99%. 

In [None]:
df_classifier8.featureImportances

SparseVector(27, {})

No remarkable features. 

In [None]:
df_auc8 = MulticlassClassificationEvaluator(labelCol = "cancer").evaluate(df_predictions8)
df_auc8

0.9921811743958862

The AUC of the model is 99%. 

We are more likely to get a better model with less overfitting with more instances of the diseases. 

Overall, the models could be improved with more instances of the diseases with could come from data collection or simulated instances to create better predictions. 

**Resources:**

https://medium.com/radon-dev/item-item-collaborative-filtering-with-binary-or-unary-data-e8f0b465b2c3

https://academic.oup.com/bioinformatics/article/34/22/3907/5026663

https://towardsdatascience.com/recommender-systems-item-customer-collaborative-filtering-ff0c8f41ae8a


https://medium.com/analytics-vidhya/linear-regression-and-decision-tree-implementation-using-pyspark-bfcd93dee86

https://python.plainenglish.io/decision-trees-random-forests-in-pyspark-d07546e4fa7d

https://spark.apache.org/docs/latest/api/python/reference/api/pyspark.ml.classification.DecisionTreeClassifier.html

https://www.datatechnotes.com/2021/06/pyspark-decision-tree-classification.html

https://towardsdatascience.com/choosing-performance-metrics-61b40819eae1

