# Cardiotocographic Classification using Spark
##### Cardiotocographic classification for fetal heart-rate and uterine contractions, implemented with a PySpark Pipeline

Dataset from the UCI data repository: https://archive.ics.uci.edu/ml/datasets/cardiotocography

In [42]:
#Pipeline dependencies
from pyspark.ml import Pipeline
from pyspark.ml.classification import LogisticRegression
from pyspark.ml.feature import MinMaxScaler
from pyspark.ml.feature import VectorAssembler
from pyspark.ml.feature import HashingTF, Tokenizer
from pyspark.sql import SparkSession

#Data manipulation, analysis and plotting tools
import pandas as pd
import numpy as np
import matplotlib as plt

#Machine Learning libraries
#import sklearn as sk
#import tensorflow as tf

In [18]:
#Define the session and cluster
spark = SparkSession.builder \
                    .master('local[4]') \
                    .appName('cardiotocography_pipeline') \
                    .getOrCreate()

In [28]:
#Read the CSV dataset file and print the schema
df = spark.read.options(header='True', #Specify that headers exist in dataset
                        inferSchema='True', 
                        delimiter=',' #Comma delimited
                       ).csv("CTG_data.csv") #Source file

#Drop missing values
df = df.dropna()

#Required to drop some attributes which likely do not add to the learning of the model:
#- FileName, Date, SegFile to be truncated
df = df.drop("FileName","Date","SegFile")

#Dataset has seemed to already One-Hot-Encoded the multiclass target 'Class' labels to:
# A,B,C,D,E,AD,DE,LD,FS,SUSP
#Since this is the target label, we can remove those columns and retain the Class column
#which contains the multiclass numbers between 1-10
df = df.drop("A","B29","C","D","E32","AD","DE","LD","FS","SUSP")
df.printSchema()

root
 |-- b3: integer (nullable = true)
 |-- e4: integer (nullable = true)
 |-- LBE: integer (nullable = true)
 |-- LB: integer (nullable = true)
 |-- AC: integer (nullable = true)
 |-- FM: integer (nullable = true)
 |-- UC: integer (nullable = true)
 |-- ASTV: integer (nullable = true)
 |-- MSTV: double (nullable = true)
 |-- ALTV: integer (nullable = true)
 |-- MLTV: double (nullable = true)
 |-- DL: integer (nullable = true)
 |-- DS: integer (nullable = true)
 |-- DP: integer (nullable = true)
 |-- DR: integer (nullable = true)
 |-- Width: integer (nullable = true)
 |-- Min: integer (nullable = true)
 |-- Max: integer (nullable = true)
 |-- Nmax: integer (nullable = true)
 |-- Nzeros: integer (nullable = true)
 |-- Mode: integer (nullable = true)
 |-- Mean: integer (nullable = true)
 |-- Median: integer (nullable = true)
 |-- Variance: integer (nullable = true)
 |-- Tendency: integer (nullable = true)
 |-- CLASS: integer (nullable = true)
 |-- NSP: integer (nullable = true)



### Feature descriptions:
- b: Start instant
- e: End instant
- LBE: Baseline value (medical expert)
- LB: Baseline value (SisPorto)
- AC: Accelerations (SisPorto)
- FM: Foetal movement (SisPorto)
- UC: Uterine contractions (SisPorto)
- ASTV: percentage of time with abnormal short term variability  (SisPorto)
- mSTV:	mean value of short term variability  (SisPorto)
- ALTV:	percentage of time with abnormal long term variability  (SisPorto)
- mLTV:	mean value of long term variability  (SisPorto)
- DL:	light decelerations
- DS:	severe decelerations
- DP:	prolongued decelerations
- DR:	repetitive decelerations
- Width:  histogram width
- Min:	low freq. of the histogram
- Max:	high freq. of the histogram
- Nmax:	number of histogram peaks
- Nzeros:	number of histogram zeros
- Mode:	histogram mode
- Mean:	histogram mean
- Median:	histogram median
- Variance:	histogram variance
- Tendency:	histogram tendency: [-1=left assymetric; 0=symmetric; 1=right assymetric]

### Classes
- A: Calm sleep
- B: REM sleep
- C: Calm vigilance
- D: Actice vigilance
- E: Shift pattern (A or Susp with shifts)
- AD: Accelerative/Decelerative pattern (stress simulation)
- DE: Decelerative pattern (vagal stimulation)
- LD: Largely develerative pattern
- FS: Flat-sinusoidal pattern (pathogenic state)
- SUSP: Suspect pattern

#### Further higher level classification (NSP):
- 1: Normal
- 2: Suspect
- 3: Pathogenic

In [62]:
featuresToScale = [f[0] for f in df.dtypes if f[0] not in ["CLASS", "NSP"]]

assembler = VectorAssembler(inputCols=featuresToScale, outputCol="x_vec")
temp_train = assembler.transform(df)

scaler = MinMaxScaler(inputCol="x_vec", outputCol="x_scaled")
scaledData = scaler.fit(temp_train).transform(temp_train)

In [67]:
scaledData.show(1, truncate=False)

+---+---+---+---+---+---+---+----+----+----+----+---+---+---+---+-----+---+---+----+------+----+----+------+--------+--------+-----+---+--------------------------------------------------------------------------------------------------------------------------+-------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+
|b3 |e4 |LBE|LB |AC |FM |UC |ASTV|MSTV|ALTV|MLTV|DL |DS |DP |DR |Width|Min|Max|Nmax|Nzeros|Mode|Mean|Median|Variance|Tendency|CLASS|NSP|x_vec                                                                                                                     |x_scaled                                                                                                                     

In [94]:
train_data, test_data = scaledData.randomSplit([0.75, 0.35], 23)
labels = 

+-----+
|CLASS|
+-----+
|    7|
|    2|
|   10|
|    2|
|    2|
|    2|
|    7|
|    1|
|    2|
|    5|
|    7|
|    4|
|   10|
|    6|
|   10|
|    2|
|    1|
|    2|
|    8|
|    6|
+-----+
only showing top 20 rows



In [100]:
training = train_data.select(train_data.columns[-1])
training = training.withColumn("CLASS", scaledData.CLASS)

testing = test_data.select(test_data.columns[-1])

AnalysisException: Resolved attribute(s) CLASS#3015 missing from x_scaled#4571 in operator !Project [x_scaled#4571, CLASS#3015 AS CLASS#6530].;
!Project [x_scaled#4571, CLASS#3015 AS CLASS#6530]
+- Project [x_scaled#4571]
   +- Sample 0.0, 0.6818181818181818, false, 23
      +- Sort [b3#2980 ASC NULLS FIRST, e4#2981 ASC NULLS FIRST, LBE#2982 ASC NULLS FIRST, LB#2983 ASC NULLS FIRST, AC#2984 ASC NULLS FIRST, FM#2985 ASC NULLS FIRST, UC#2986 ASC NULLS FIRST, ASTV#2987 ASC NULLS FIRST, MSTV#2988 ASC NULLS FIRST, ALTV#2989 ASC NULLS FIRST, MLTV#2990 ASC NULLS FIRST, DL#2991 ASC NULLS FIRST, DS#2992 ASC NULLS FIRST, DP#2993 ASC NULLS FIRST, DR#2994 ASC NULLS FIRST, Width#2995 ASC NULLS FIRST, Min#2996 ASC NULLS FIRST, Max#2997 ASC NULLS FIRST, Nmax#2998 ASC NULLS FIRST, Nzeros#2999 ASC NULLS FIRST, Mode#3000 ASC NULLS FIRST, Mean#3001 ASC NULLS FIRST, Median#3002 ASC NULLS FIRST, Variance#3003 ASC NULLS FIRST, ... 5 more fields], false
         +- Project [b3#2980, e4#2981, LBE#2982, LB#2983, AC#2984, FM#2985, UC#2986, ASTV#2987, MSTV#2988, ALTV#2989, MLTV#2990, DL#2991, DS#2992, DP#2993, DR#2994, Width#2995, Min#2996, Max#2997, Nmax#2998, Nzeros#2999, Mode#3000, Mean#3001, Median#3002, Variance#3003, ... 5 more fields]
            +- Project [b3#2980, e4#2981, LBE#2982, LB#2983, AC#2984, FM#2985, UC#2986, ASTV#2987, MSTV#2988, ALTV#2989, MLTV#2990, DL#2991, DS#2992, DP#2993, DR#2994, Width#2995, Min#2996, Max#2997, Nmax#2998, Nzeros#2999, Mode#3000, Mean#3001, Median#3002, Variance#3003, ... 4 more fields]
               +- Project [b3#2980, e4#2981, LBE#2982, LB#2983, AC#2984, FM#2985, UC#2986, ASTV#2987, MSTV#2988, ALTV#2989, MLTV#2990, DL#2991, DS#2992, DP#2993, DR#2994, Width#2995, Min#2996, Max#2997, Nmax#2998, Nzeros#2999, Mode#3000, Mean#3001, Median#3002, Variance#3003, ... 3 more fields]
                  +- Project [b3#2980, e4#2981, LBE#2982, LB#2983, AC#2984, FM#2985, UC#2986, ASTV#2987, MSTV#2988, ALTV#2989, MLTV#2990, DL#2991, DS#2992, DP#2993, DR#2994, Width#2995, Min#2996, Max#2997, Nmax#2998, Nzeros#2999, Mode#3000, Mean#3001, Median#3002, Variance#3003, ... 13 more fields]
                     +- Filter atleastnnonnulls(40, FileName#2977, Date#2978, SegFile#2979, b3#2980, e4#2981, LBE#2982, LB#2983, AC#2984, FM#2985, UC#2986, ASTV#2987, MSTV#2988, ALTV#2989, MLTV#2990, DL#2991, DS#2992, DP#2993, DR#2994, Width#2995, Min#2996, Max#2997, Nmax#2998, Nzeros#2999, ... 17 more fields)
                        +- Relation [FileName#2977,Date#2978,SegFile#2979,b3#2980,e4#2981,LBE#2982,LB#2983,AC#2984,FM#2985,UC#2986,ASTV#2987,MSTV#2988,ALTV#2989,MLTV#2990,DL#2991,DS#2992,DP#2993,DR#2994,Width#2995,Min#2996,Max#2997,Nmax#2998,Nzeros#2999,Mode#3000,... 16 more fields] csv
