# CalCOFI
### Over 60 years of oceanographic data
#### Dataset: https://www.kaggle.com/datasets/sohier/calcofi
#### Table Info: https://calcofi.org/data/oceanographic-data/bottle-database/

## EDA and Data Preparation

In [1]:
from pyspark.sql.functions import *

#### Load Data

In [2]:
''' Loading in spark '''
import pyspark
from pyspark.sql import SparkSession

conf = pyspark.SparkConf().setAll([
    ('spark.master', 'local[1]'), 
    ('spark.app.name', 'App Name')])
    
spark = SparkSession.builder.config(conf=conf).getOrCreate()
spark.version

Setting default log level to "WARN".
To adjust logging level use sc.setLogLevel(newLevel). For SparkR, use setLogLevel(newLevel).
2022-05-21 22:38:47,372 WARN util.NativeCodeLoader: Unable to load native-hadoop library for your platform... using builtin-java classes where applicable


'3.2.1'

In [3]:
''' Read the data '''
bottle = spark.read.csv("hdfs:///bottle.csv", header=True, inferSchema=True).cache() ##leslie and katie's path
#bottle = spark.read.csv("file:///home/work/Final/bottle.csv", header=True, inferSchema=True).cache() ##karina's path
maxRows = bottle.count()
print("There are", maxRows, "rows in the initial dataframe")

2022-05-21 22:39:06,687 WARN util.package: Truncated the string representation of a plan since it was too large. This behavior can be adjusted by setting 'spark.sql.debug.maxToStringFields'.

There are 864863 rows in the initial dataframe


                                                                                

In [4]:
''' View the inferred schema '''
print("There are", len(bottle.columns), "columns in the initial dataframe")
bottle.printSchema()

There are 74 columns in the initial dataframe
root
 |-- Cst_Cnt: integer (nullable = true)
 |-- Btl_Cnt: integer (nullable = true)
 |-- Sta_ID: string (nullable = true)
 |-- Depth_ID: string (nullable = true)
 |-- Depthm: integer (nullable = true)
 |-- T_degC: double (nullable = true)
 |-- Salnty: double (nullable = true)
 |-- O2ml_L: double (nullable = true)
 |-- STheta: double (nullable = true)
 |-- O2Sat: double (nullable = true)
 |-- Oxy_µmol/Kg: double (nullable = true)
 |-- BtlNum: integer (nullable = true)
 |-- RecInd: integer (nullable = true)
 |-- T_prec: integer (nullable = true)
 |-- T_qual: integer (nullable = true)
 |-- S_prec: integer (nullable = true)
 |-- S_qual: integer (nullable = true)
 |-- P_qual: integer (nullable = true)
 |-- O_qual: integer (nullable = true)
 |-- SThtaq: integer (nullable = true)
 |-- O2Satq: integer (nullable = true)
 |-- ChlorA: double (nullable = true)
 |-- Chlqua: integer (nullable = true)
 |-- Phaeop: double (nullable = true)
 |-- Phaqua: in

In [5]:
''' See a snippet of what this dataframe looks like '''
bottle.show(2)

+-------+-------+-----------+--------------------+------+------+------+------+------+-----+-----------+------+------+------+------+------+------+------+------+------+------+------+------+------+------+-----+----+------+------+-----+----+-----+----+-----+----+------+------+------+------+------+------+------+------+------+------+------+------+------+------+-------+------+--------+----------+-------+-----+-------+----+-------+------+-----+-----+-----+-----+------+-------+------+------+----+----+----+----+----+----+-------------------+
|Cst_Cnt|Btl_Cnt|     Sta_ID|            Depth_ID|Depthm|T_degC|Salnty|O2ml_L|STheta|O2Sat|Oxy_µmol/Kg|BtlNum|RecInd|T_prec|T_qual|S_prec|S_qual|P_qual|O_qual|SThtaq|O2Satq|ChlorA|Chlqua|Phaeop|Phaqua|PO4uM|PO4q|SiO3uM|SiO3qu|NO2uM|NO2q|NO3uM|NO3q|NH3uM|NH3q|C14As1|C14A1p|C14A1q|C14As2|C14A2p|C14A2q|DarkAs|DarkAp|DarkAq|MeanAs|MeanAp|MeanAq|IncTim|LightP|R_Depth|R_TEMP|R_POTEMP|R_SALINITY|R_SIGMA|R_SVA|R_DYNHT|R_O2|R_O2Sat|R_SIO3|R_PO4|R_NO3|R_NO2|R_NH4|R_CH

#### Remove columns that are not needed

Removing the four string columns because they aren't useful for our purposes
* Sta_ID: Line and Station
* Depth_ID: Uses the Cast_ID prefix ([Century]-[Year][Month][ShipCode]-[CastType][Julian Day]-[CastTime]-[Line][Sta]) but adds three additional variables: [Depth][Bottle]-[Rec_Ind]
* IncTim: Elapsed incubation time of the primary productivity experiment
* DIC Quality Comment: Quality Comment

Also removing the Cast and Bottle counts, which are essentially indexes (identifiers)
* 'Cst_Cnt': Auto-numbered Cast Count - all casts consecutively numbered. 1 is first station done
* 'Btl_Cnt': Auto-numbered Bottle count- all bottles ever sampled, consecutively numbered
* 'BtlNum': Bottle Number

In [6]:
''' Dropping unneeded columns and viewing two rows of the resulting dataframe '''
deleteList1 = ["Sta_ID","Depth_ID","IncTim","DIC Quality Comment","Cst_Cnt","Btl_Cnt","BtlNum"]
bottle = bottle.drop(*deleteList1)

print("There are now", len(bottle.columns), "columns and", bottle.count(), "rows")
bottle.show(2, truncate=False)

There are now 67 columns and 864863 rows
+------+------+------+------+------+-----+-----------+------+------+------+------+------+------+------+------+------+------+------+------+------+-----+----+------+------+-----+----+-----+----+-----+----+------+------+------+------+------+------+------+------+------+------+------+------+------+-------+------+--------+----------+-------+-----+-------+----+-------+------+-----+-----+-----+-----+------+-------+------+------+----+----+----+----+----+----+
|Depthm|T_degC|Salnty|O2ml_L|STheta|O2Sat|Oxy_µmol/Kg|RecInd|T_prec|T_qual|S_prec|S_qual|P_qual|O_qual|SThtaq|O2Satq|ChlorA|Chlqua|Phaeop|Phaqua|PO4uM|PO4q|SiO3uM|SiO3qu|NO2uM|NO2q|NO3uM|NO3q|NH3uM|NH3q|C14As1|C14A1p|C14A1q|C14As2|C14A2p|C14A2q|DarkAs|DarkAp|DarkAq|MeanAs|MeanAp|MeanAq|LightP|R_Depth|R_TEMP|R_POTEMP|R_SALINITY|R_SIGMA|R_SVA|R_DYNHT|R_O2|R_O2Sat|R_SIO3|R_PO4|R_NO3|R_NO2|R_NH4|R_CHLA|R_PHAEO|R_PRES|R_SAMP|DIC1|DIC2|TA1 |TA2 |pH2 |pH1 |
+------+------+------+------+------+-----+-------

#### Handle quality values

Removing the four columns indicating quality codes because we're using the quantity measurements instead
* T_qual: Temperature Quality Code
* S_qual: Salinity Quality Code
* P_qual: Pressure Quality Code
* O_qual: Oxygen Quality Code
* 'O2Satq': Oxygen Saturation Quality Code
* 'Chlqua': Chlorophyll-a Quality Code
* 'Phaeop': Phaeophytin Quality Code
* 'Phaqua': Phosphate Quality Code
* 'PO4uM': Salinity Quality Code
* 'PO4q': Phosphate Quality Code
* 'SiO3qu': Quality Code
* 'NO2q': Quality Code
* 'NO3q': Nitrate Quality Code
* 'NH3q': Ammonium Quality Code
* 'C14A1q': 14C As1 Quality Code
* 'C14A2q': 14C As2 Quality Code
* 'DarkAq': 14C Assimilation Dark Bottle Quality Code
* 'MeanAq': Mean 14C Assimilation Quality Code

In [7]:
''' Dropping quality/irrelevant columns and viewing two rows of the resulting dataframe '''
deleteList2 = ['T_qual','S_qual','P_qual','O_qual','O2Satq','Chlqua','Phaeop','Phaqua','PO4uM','PO4q','SiO3qu','NO2q','NO3q','NH3q','C14A1q','C14A2q','DarkAq','MeanAq']
bottle = bottle.drop(*deleteList2)

print("There are now", len(bottle.columns), "columns and", bottle.count(), "rows")
bottle.show(2, truncate=False)

There are now 49 columns and 864863 rows
+------+------+------+------+------+-----+-----------+------+------+------+------+------+------+-----+-----+-----+------+------+------+------+------+------+------+------+------+-------+------+--------+----------+-------+-----+-------+----+-------+------+-----+-----+-----+-----+------+-------+------+------+----+----+----+----+----+----+
|Depthm|T_degC|Salnty|O2ml_L|STheta|O2Sat|Oxy_µmol/Kg|RecInd|T_prec|S_prec|SThtaq|ChlorA|SiO3uM|NO2uM|NO3uM|NH3uM|C14As1|C14A1p|C14As2|C14A2p|DarkAs|DarkAp|MeanAs|MeanAp|LightP|R_Depth|R_TEMP|R_POTEMP|R_SALINITY|R_SIGMA|R_SVA|R_DYNHT|R_O2|R_O2Sat|R_SIO3|R_PO4|R_NO3|R_NO2|R_NH4|R_CHLA|R_PHAEO|R_PRES|R_SAMP|DIC1|DIC2|TA1 |TA2 |pH2 |pH1 |
+------+------+------+------+------+-----+-----------+------+------+------+------+------+------+-----+-----+-----+------+------+------+------+------+------+------+------+------+-------+------+--------+----------+-------+-----+-------+----+-------+------+-----+-----+-----+-----+-----

#### Remove columns with low data count

We start by counting the number of NaNs and nulls in each column and reporting them in a new dataframe. Then the columns with less than 200,000 non-nulls are deleted.

In [8]:
''' Counting the number of null/NaN rows per column and outputting that in a new dataframe '''

def getNullCounts(df):
    return df.select([count(when(isnan(c) | col(c).isNull(), c)).alias(c) for c in bottle.columns])

nullCounter = getNullCounts(bottle)
nullCounter.show()



+------+------+------+------+------+------+-----------+------+------+------+------+------+------+------+------+------+------+------+------+------+------+------+------+------+------+-------+------+--------+----------+-------+-----+-------+------+-------+------+------+------+------+------+------+-------+------+------+------+------+------+------+------+------+
|Depthm|T_degC|Salnty|O2ml_L|STheta| O2Sat|Oxy_µmol/Kg|RecInd|T_prec|S_prec|SThtaq|ChlorA|SiO3uM| NO2uM| NO3uM| NH3uM|C14As1|C14A1p|C14As2|C14A2p|DarkAs|DarkAp|MeanAs|MeanAp|LightP|R_Depth|R_TEMP|R_POTEMP|R_SALINITY|R_SIGMA|R_SVA|R_DYNHT|  R_O2|R_O2Sat|R_SIO3| R_PO4| R_NO3| R_NO2| R_NH4|R_CHLA|R_PHAEO|R_PRES|R_SAMP|  DIC1|  DIC2|   TA1|   TA2|   pH2|   pH1|
+------+------+------+------+------+------+-----------+------+------+------+------+------+------+------+------+------+------+------+------+------+------+------+------+------+------+-------+------+--------+----------+-------+-----+-------+------+-------+------+------+------+------

                                                                                

In [9]:
''' Deleting columns with less than 200000 non-nulls '''
thresh = maxRows - 200000
deleteList3 = []
for value in nullCounter.columns:
    if nullCounter.filter(nullCounter[value] > thresh).select(nullCounter[value]).collect():
        deleteList3.append(value)
bottle = bottle.drop(*deleteList3)

print("There are now", len(bottle.columns), "columns and", bottle.count(), "rows")
bottle.show(2, truncate=False)

There are now 30 columns and 864863 rows
+------+------+------+------+------+-----+-----------+------+------+------+------+------+-----+-----+-------+------+--------+----------+-------+-----+-------+----+-------+------+-----+-----+-----+------+-------+------+
|Depthm|T_degC|Salnty|O2ml_L|STheta|O2Sat|Oxy_µmol/Kg|RecInd|T_prec|S_prec|ChlorA|SiO3uM|NO2uM|NO3uM|R_Depth|R_TEMP|R_POTEMP|R_SALINITY|R_SIGMA|R_SVA|R_DYNHT|R_O2|R_O2Sat|R_SIO3|R_PO4|R_NO3|R_NO2|R_CHLA|R_PHAEO|R_PRES|
+------+------+------+------+------+-----+-----------+------+------+------+------+------+-----+-----+-------+------+--------+----------+-------+-----+-------+----+-------+------+-----+-----+-----+------+-------+------+
|0     |10.5  |33.44 |null  |25.649|null |null       |3     |1     |2     |null  |null  |null |null |0.0    |10.5  |10.5    |33.44     |25.64  |233.0|0.0    |null|null   |null  |null |null |null |null  |null   |0     |
|8     |10.46 |33.44 |null  |25.656|null |null       |3     |2     |2     |null  |n

#### Remove null values from chlorophyll column

Because this is the target column, we can only use the non-null rows. We also get rid of the duplicate chlorophyll column `ChlorA`, keeping `R_CHLA` because it has less non-null rows (though only by a few).

In [10]:
''' Illustrating that we have two target columns that are essentially duplicates of each other '''
bottle.corr("ChlorA","R_CHLA")

0.9999995087108701

Because we only need one of these columns, we can delete the other. Then we drop all null rows of the R_CHLA column since it's our target.

In [11]:
''' Looking at which of the two target columns have more NaNs in order to select which to delete '''
nullCounter.select(*["ChlorA","R_CHLA"]).show()

+------+------+
|ChlorA|R_CHLA|
+------+------+
|639591|639587|
+------+------+



In [12]:
''' Delete the ChlorA column and drop all NaN rows in the R_CHLA column '''
bottle = bottle.drop("ChlorA")
bottle = bottle.dropna(subset="R_CHLA")

print("There are now", len(bottle.columns), "columns and", bottle.count(), "rows")
bottle.show(2, truncate=False)

There are now 29 columns and 225276 rows
+------+------+------+------+------+-----+-----------+------+------+------+------+-----+-----+-------+------+--------+----------+-------+-----+-------+----+-------+------+-----+-----+-----+------+-------+------+
|Depthm|T_degC|Salnty|O2ml_L|STheta|O2Sat|Oxy_µmol/Kg|RecInd|T_prec|S_prec|SiO3uM|NO2uM|NO3uM|R_Depth|R_TEMP|R_POTEMP|R_SALINITY|R_SIGMA|R_SVA|R_DYNHT|R_O2|R_O2Sat|R_SIO3|R_PO4|R_NO3|R_NO2|R_CHLA|R_PHAEO|R_PRES|
+------+------+------+------+------+-----+-----------+------+------+------+------+-----+-----+-------+------+--------+----------+-------+-----+-------+----+-------+------+-----+-----+-----+------+-------+------+
|0     |19.23 |34.491|5.46  |24.575|103.9|237.9949   |3     |2     |3     |null  |null |1.3  |0.0    |19.23 |19.23   |34.491    |24.57  |335.3|0.0    |5.46|103.9  |null  |null |1.3  |null |0.64  |0.47   |0     |
|10    |19.22 |34.492|5.46  |24.578|103.9|237.9942   |3     |2     |3     |null  |null |1.9  |10.0   |19.22 |19

#### Remove "duplicate" columns

As we can see from the column above, there are some columns that are essentially duplicates of each other but in different units of measurement. Of the remaining columns, here are their descriptions that will help us determine which features to compare for potential deletion:

In [13]:
''' See what columns we still have '''
print(bottle.columns)

['Depthm', 'T_degC', 'Salnty', 'O2ml_L', 'STheta', 'O2Sat', 'Oxy_µmol/Kg', 'RecInd', 'T_prec', 'S_prec', 'SiO3uM', 'NO2uM', 'NO3uM', 'R_Depth', 'R_TEMP', 'R_POTEMP', 'R_SALINITY', 'R_SIGMA', 'R_SVA', 'R_DYNHT', 'R_O2', 'R_O2Sat', 'R_SIO3', 'R_PO4', 'R_NO3', 'R_NO2', 'R_CHLA', 'R_PHAEO', 'R_PRES']


Chlorophyll
* 'ChlorA': Acetone extracted chlorophyll-a measured fluorometrically
* 'R_CHLA': Reported Chlorophyll-a (micrograms per liter)

Depth
* 'Depthm': Depth in meters
* 'R_Depth': Reported Depth (from pressure) in meters

Water density
* 'STheta': Potential Density of Water
* 'R_SIGMA': Reported Potential Density of water

Silicate
* 'SiO3uM': Micromoles Silicate per liter of seawater
* 'R_SIO3': Reported Silicate Concentration

Nitrite
* 'NO2uM': Micromoles Nitrite per liter of seawater
* 'R_NO2': Reported Nitrite Concentration

Nitrate
* 'NO3uM': Micromoles Nitrate per liter of seawater
* 'R_NO3': Reported Nitrate Concentration

Salinity
* 'Salnty': Practical Salinity Scale, 1978 (UNESCO, 1981a); Salinity of water
* 'R_SALINITY': Reported Salinity (from Specific Volume Anomoly, M³/Kg)

O2 saturation
* 'O2Sat': Percent Saturation; Oxygen Saturation
* 'R_O2Sat': Percent	Reported Oxygen Saturation

Oxygen
* 'O2ml_L': Oxygen in mL/L; Milliliters of dissolved oxygen per Liter seawater
* 'Oxy_µmol/Kg': Oxygen in micro moles per kilogram of seawater
* 'R_O2': Reported milliliters of oxygen per liter of seawater

Temperature
* 'T_degC': Temperature of Water
* 'R_TEMP': Reported Temperature (Celsius)
* 'R_POTEMP': Reported Potential Temperature (Celsius)

Other
* 'S_prec': Salinity Units of Precision
* 'T_prec': Temperature Units of Precision
* 'RecInd': Record Indicator
* 'R_SVA': Reported Specific Volume Anomaly
* 'R_DYNHT': Reported Dynamic Height
* 'R_PO4': Reported Phosphate Concentration
* 'R_PHAEO': Reported Phaeophytin
* 'R_PRES': Pressure in decibars

In [14]:
''' Setting up a function that will streamline the comparing process '''
nullCounter = getNullCounts(bottle)

def psudoDuplicateCheck(features):
    nullCounter.select(*features).show()

    if len(features)==2:
        print("correlation:", bottle.corr(features[0],features[1]))
    elif len(features)==3:
        print("1&2 correlation:", bottle.corr(features[0],features[1]))
        print("2&3 correlation:", bottle.corr(features[1],features[2]))
        print("1&3 correlation:", bottle.corr(features[0],features[2]))

In [15]:
nullCounter.show()



+------+------+------+------+------+-----+-----------+------+------+------+------+-----+-----+-------+------+--------+----------+-------+-----+-------+----+-------+------+-----+-----+-----+------+-------+------+
|Depthm|T_degC|Salnty|O2ml_L|STheta|O2Sat|Oxy_µmol/Kg|RecInd|T_prec|S_prec|SiO3uM|NO2uM|NO3uM|R_Depth|R_TEMP|R_POTEMP|R_SALINITY|R_SIGMA|R_SVA|R_DYNHT|R_O2|R_O2Sat|R_SIO3|R_PO4|R_NO3|R_NO2|R_CHLA|R_PHAEO|R_PRES|
+------+------+------+------+------+-----+-----------+------+------+------+------+-----+-----+-------+------+--------+----------+-------+-----+-------+----+-------+------+-----+-----+-----+------+-------+------+
|     0|  3088|  3433|  3610|  3781| 4500|       4502|     0|  3088|  3433|  8947|14951| 9573|      0|  3088|    3508|      3433|   3655| 3630|   3527|3610|   4352|  8941|10322| 9567|14945|     0|      5|     0|
+------+------+------+------+------+-----+-----------+------+------+------+------+-----+-----+-------+------+--------+----------+-------+-----+-------+-

                                                                                

In [16]:
Depths = ["Depthm","R_Depth"]
print("Depths NaN count:")
psudoDuplicateCheck(Depths)

WD = ["STheta","R_SIGMA"]
print("\nWater density NaN count:")
psudoDuplicateCheck(WD)

Silicate = ["SiO3uM","R_SIO3"]
print("\nSilicate NaN count:")
psudoDuplicateCheck(Silicate)

Nitrite = ["NO2uM","R_NO2"]
print("\nNitrite NaN count:")
psudoDuplicateCheck(Nitrite)

Nitrate = ["NO3uM","R_NO3"]
print("\nNitrate NaN count:")
psudoDuplicateCheck(Nitrate)

Salinity = ["Salnty","R_SALINITY"]
print("\nSalinity NaN count:")
psudoDuplicateCheck(Salinity)

Saturation = ["O2Sat","R_O2Sat"]
print("\nO2 Saturation NaN count:")
psudoDuplicateCheck(Saturation)

Oxygen = ["O2ml_L","Oxy_µmol/Kg","R_O2"]
print("\nOxygen NaN count:")
psudoDuplicateCheck(Oxygen)

Temperature = ["T_degC","R_TEMP","R_POTEMP"]
print("\nTemperature NaN count:")
psudoDuplicateCheck(Temperature)

Depths NaN count:
+------+-------+
|Depthm|R_Depth|
+------+-------+
|     0|      0|
+------+-------+

correlation: 0.9999999949168985

Water density NaN count:
+------+-------+
|STheta|R_SIGMA|
+------+-------+
|  3781|   3655|
+------+-------+

correlation: 0.9775619506024495

Silicate NaN count:
+------+------+
|SiO3uM|R_SIO3|
+------+------+
|  8947|  8941|
+------+------+

correlation: 0.9999991048864016

Nitrite NaN count:
+-----+-----+
|NO2uM|R_NO2|
+-----+-----+
|14951|14945|
+-----+-----+

correlation: 0.9999753963583337

Nitrate NaN count:
+-----+-----+
|NO3uM|R_NO3|
+-----+-----+
| 9573| 9567|
+-----+-----+

correlation: 0.9999998732099501

Salinity NaN count:
+------+----------+
|Salnty|R_SALINITY|
+------+----------+
|  3433|      3433|
+------+----------+

correlation: 0.9999999893049997

O2 Saturation NaN count:
+-----+-------+
|O2Sat|R_O2Sat|
+-----+-------+
| 4500|   4352|
+-----+-------+

correlation: 0.9903641742840068

Oxygen NaN count:
+------+-----------+----+
|O

We now delete the columns with fewer null if they have a correlation coeffecients of 0.97 or higher.

In the case of a tie, we can refer to the pattern that is very apparent in no-tie cases, which is that the "reported" (columns starting with `R_`) have the higher non-null count. Therefore if two columns have equal null counts and high enough correlation, we delete the not-"reported" column.

In [17]:
''' Deleting columns that are duplicates of some other column '''
deleteList4 = ["Depthm","STheta","SiO3uM","NO2uM","NO3uM","Salnty","O2Sat","O2ml_L","Oxy_µmol/Kg","T_degC","R_POTEMP"]
bottle = bottle.drop(*deleteList4)

print("There are now", len(bottle.columns), "columns and", bottle.count(), "rows")
bottle.show(2, truncate=False)

There are now 18 columns and 225276 rows
+------+------+------+-------+------+----------+-------+-----+-------+----+-------+------+-----+-----+-----+------+-------+------+
|RecInd|T_prec|S_prec|R_Depth|R_TEMP|R_SALINITY|R_SIGMA|R_SVA|R_DYNHT|R_O2|R_O2Sat|R_SIO3|R_PO4|R_NO3|R_NO2|R_CHLA|R_PHAEO|R_PRES|
+------+------+------+-------+------+----------+-------+-----+-------+----+-------+------+-----+-----+-----+------+-------+------+
|3     |2     |3     |0.0    |19.23 |34.491    |24.57  |335.3|0.0    |5.46|103.9  |null  |null |1.3  |null |0.64  |0.47   |0     |
|3     |2     |3     |10.0   |19.22 |34.492    |24.57  |335.3|0.03   |5.46|103.9  |null  |null |1.9  |null |0.66  |0.38   |10    |
+------+------+------+-------+------+----------+-------+-----+-------+----+-------+------+-----+-----+-----+------+-------+------+
only showing top 2 rows



In [18]:
# tbd
getNullCounts(bottle).show()

+------+------+------+-------+------+----------+-------+-----+-------+----+-------+------+-----+-----+-----+------+-------+------+
|RecInd|T_prec|S_prec|R_Depth|R_TEMP|R_SALINITY|R_SIGMA|R_SVA|R_DYNHT|R_O2|R_O2Sat|R_SIO3|R_PO4|R_NO3|R_NO2|R_CHLA|R_PHAEO|R_PRES|
+------+------+------+-------+------+----------+-------+-----+-------+----+-------+------+-----+-----+-----+------+-------+------+
|     0|  3088|  3433|      0|  3088|      3433|   3655| 3630|   3527|3610|   4352|  8941|10322| 9567|14945|     0|      5|     0|
+------+------+------+-------+------+----------+-------+-----+-------+----+-------+------+-----+-----+-----+------+-------+------+



                                                                                

#### Fill null values

In [19]:
''' Using the function from PA3 '''
from pyspark.ml.feature import Imputer

def fill_na(df, strategy):    
    imputer = Imputer(
        strategy=strategy,
        inputCols=df.columns, 
        outputCols=["{}_imputed".format(c) for c in df.columns]
    )
    
    new_df = imputer.fit(df).transform(df)
    
    ''' Select the newly created columns with all filled values '''
    new_df = new_df.select([c for c in new_df.columns if "imputed" in c])
    
    for col in new_df.columns:
        new_df = new_df.withColumnRenamed(col, col.split("_imputed")[0])
        
    return new_df

In [20]:
''' Filling in the remaining null rows with the mean of the column and saving this newly filled in dataframe as a new variable '''
bottleTest = fill_na(bottle, 'mean')

print("There are still", len(bottle.columns), "columns and", bottle.count(), "rows")
bottleTest.show(2, truncate=False)

There are still 18 columns and 225276 rows
+------+------+------+-------+------+----------+-------+-----+-------+----+-------+------------------+------------------+-----+-------------------+------+-------+------+
|RecInd|T_prec|S_prec|R_Depth|R_TEMP|R_SALINITY|R_SIGMA|R_SVA|R_DYNHT|R_O2|R_O2Sat|R_SIO3            |R_PO4             |R_NO3|R_NO2              |R_CHLA|R_PHAEO|R_PRES|
+------+------+------+-------+------+----------+-------+-----+-------+----+-------+------------------+------------------+-----+-------------------+------+-------+------+
|3     |2     |3     |0.0    |19.23 |34.491    |24.57  |335.3|0.0    |5.46|103.9  |11.529812096979274|0.9506145966112527|1.3  |0.05539378408325813|0.64  |0.47   |0     |
|3     |2     |3     |10.0   |19.22 |34.492    |24.57  |335.3|0.03   |5.46|103.9  |11.529812096979274|0.9506145966112527|1.9  |0.05539378408325813|0.66  |0.38   |10    |
+------+------+------+-------+------+----------+-------+-----+-------+----+-------+------------------+-----

In [21]:
''' Double checking that all null values are filled '''
getNullCounts(bottleTest).show()

+------+------+------+-------+------+----------+-------+-----+-------+----+-------+------+-----+-----+-----+------+-------+------+
|RecInd|T_prec|S_prec|R_Depth|R_TEMP|R_SALINITY|R_SIGMA|R_SVA|R_DYNHT|R_O2|R_O2Sat|R_SIO3|R_PO4|R_NO3|R_NO2|R_CHLA|R_PHAEO|R_PRES|
+------+------+------+-------+------+----------+-------+-----+-------+----+-------+------+-----+-----+-----+------+-------+------+
|     0|     0|     0|      0|     0|         0|      0|    0|      0|   0|      0|     0|    0|    0|    0|     0|      0|     0|
+------+------+------+-------+------+----------+-------+-----+-------+----+-------+------+-----+-----+-----+------+-------+------+



#### Lasso regression

> (my personal notes will indented)
> 1. tried to use Lasso in spark, turns out it's depricated, use LinearRegression instead
> 2. LinearRegression needs a label column (easy)
> 3. LinearRegression also needs a feature column (need help determining if this needs to be scaled before using lasso)
> 
> spark lasso: https://spark.apache.org/docs/3.1.1/api/python/reference/api/pyspark.mllib.regression.LassoWithSGD.html<br>
Warning: **"Use [pyspark.ml.regression.LinearRegression](https://spark.apache.org/docs/3.1.1/api/python/reference/api/pyspark.ml.regression.LinearRegression.html) with elasticNetParam = 1.0. Note the default regParam is 0.01 for LassoWithSGD, but is 0.0 for LinearRegression."**
> 
> need to specify a label column in order to use. default is to use a column *named* "label" but it *is* possible to have the model look for a specific name. however, renaming columns is something we've done in past PAs and i wanna use it to be like "HEY-YO WE'RE USING WHAT YOU SHOWED US :D"
> 
> **(need help here:)** need to create "features" column first (did this in PA3, filled NaNs with mean before creating "features" column), but not sure if we need to scale it before feeding to lasso AND not sure what alpha "regParam" to use (i thiiiiink regParam == alpha ? 😅 i'm actually not sure, but see the raw-cell below for more about which regParam to use)


To better estimate what columns are most important to our model, we use a lasso regression. Because spark LassoWithSGD has depricated, we use LinearRegression with `elasticNetParam = 1.0` for Lasso regression equivalent.

In order to use this, `label` and `features` columns need to be specified. We rename `R_CHLA` to `label` and create a features column with VectorAssembler.

In [22]:
''' Rename the target chlorophyll column to 'label' '''
bottleTest = bottleTest.withColumnRenamed('R_CHLA', 'label')

In [23]:
''' Creating 'features' column '''
from pyspark.ml.feature import VectorAssembler

''' (interm step) make list of column names other than 'label,' AKA make a list of the features '''
features = bottleTest.columns
features.remove('label')

# bottleTest = VectorAssembler(outputCol="features_unscaled").setInputCols(features).transform(bottleTest)
bottleTest = VectorAssembler(outputCol="features").setInputCols(features).transform(bottleTest)

bottleTest.show(2, truncate=False)

[Stage 237:>                                                        (0 + 1) / 1]

+------+------+------+-------+------+----------+-------+-----+-------+----+-------+------------------+------------------+-----+-------------------+-----+-------+------+-----------------------------------------------------------------------------------------------------------------------------------+
|RecInd|T_prec|S_prec|R_Depth|R_TEMP|R_SALINITY|R_SIGMA|R_SVA|R_DYNHT|R_O2|R_O2Sat|R_SIO3            |R_PO4             |R_NO3|R_NO2              |label|R_PHAEO|R_PRES|features                                                                                                                           |
+------+------+------+-------+------+----------+-------+-----+-------+----+-------+------------------+------------------+-----+-------------------+-----+-------+------+-----------------------------------------------------------------------------------------------------------------------------------+
|3     |2     |3     |0.0    |19.23 |34.491    |24.57  |335.3|0.0    |5.46|103.9  |11.52981209697

                                                                                

In [24]:
''' Applying lasso regression where regParam= 0.0025 in place of alpha (?) '''
from pyspark.ml.regression import LinearRegression

lr = LinearRegression(elasticNetParam = 1.0, regParam=0.0025, solver="normal", maxIter=1000, standardization=True)
model = lr.fit(bottleTest)

2022-05-21 22:39:51,810 WARN netlib.InstanceBuilder$NativeBLAS: Failed to load implementation from:dev.ludovic.netlib.blas.JNIBLAS
2022-05-21 22:39:51,815 WARN netlib.InstanceBuilder$NativeBLAS: Failed to load implementation from:dev.ludovic.netlib.blas.ForeignLinkerBLAS
                                                                                

##### regParam (alpha?) tinkering

In [25]:
print("when regParams=" + str(lr.getRegParam()) + ", model.coefficients=")
model.coefficients

when regParams=0.0025, model.coefficients=


DenseVector([0.0, 0.0, 0.0201, 0.0, -0.0718, 0.9188, 0.0, 0.0, -0.9898, 0.3511, 0.0118, 0.0177, 0.0, 0.0082, -0.6885, 2.2156, 0.0])

> okay i'm not sure how to adjust alpha like we would with a regular non-spark LassoRegression, but changing regParam did change which columns had nonzero coefficients which is what we did for regular Lasso
> 
> all of the model.coefficients align with the six columns `'R_SALINITY','R_DYNHT','R_O2','R_SIO3','R_NO2','R_PHAEO'` but otherwise include at least one more that doesn't appear in all other regParam tests
>
> | regParam | nonzero coeffs | S-prec? | R_TEMP? | R_SVA? | R_O2Sat? | R_NO3 |
> |----------|----------------|---------|---------|--------|----------|-------|
> | 0.0      |       17       |   Yes   |   Yes   |   Yes  |    Yes   |  Yes  |
> | 0.001    |       13       |   Yes   |   Yes   |        |    Yes   |  Yes  |
> | 0.002    |       10       |   Yes   |   Yes   |        |    Yes   |  Yes  |
> | 0.003    |       10       |   Yes   |   Yes   |        |    Yes   |  Yes  |
> | 0.005    |        9       |   Yes   |   Yes   |        |          |       |
> | 0.008    |        8       |         |   Yes   |        |          |       |
> | 0.010    |        7       |         |   Yes   |        |          |       |
> | 0.013    |        8       |         |   Yes   |   Yes  |          |       |
> | 0.015    |        8       |         |   Yes   |   Yes  |          |       |
> | 0.018    |        7       |         |         |   Yes  |          |       |
> | 0.020    |        7       |         |         |   Yes  |          |       |
>
> also it looks like scaling doesn't actually change much other than the absolute value of the coefficients

##### using the coeff list

> here are the "most useful" features. i'm not sure if i should be deleting the "useless" ones or just dropping nulls from the other columns

In [26]:
''' Find the "most useful" and "not useful" features, as according to the lasso regresion when regParam=0.0025 '''

coeff = model.coefficients
usefulFeatures = []
deleteList5 = []

print("(For-loop of", len(coeff), "iterations)")
for i in range(len(coeff)):
    if coeff[i] != 0:
        usefulFeatures.append(features[i])
    else:
        deleteList5.append(features[i])

print("When regParam="+str(lr.getRegParam())+", this lasso model indicates the most useful columns to predicting chlorophyll are:", usefulFeatures)

(For-loop of 17 iterations)
When regParam=0.0025, this lasso model indicates the most useful columns to predicting chlorophyll are: ['S_prec', 'R_TEMP', 'R_SALINITY', 'R_DYNHT', 'R_O2', 'R_O2Sat', 'R_SIO3', 'R_NO3', 'R_NO2', 'R_PHAEO']


In [27]:
''' Moving back to dataframe before nulls were filled with means, drop the null rows of the "most useful" features '''
bottle = bottle.dropna(subset=usefulFeatures)

In [28]:
''' Moving back to dataframe before nulls were filled with means, drop the "not useful" features '''
bottle = bottle.drop(*deleteList5)

In [29]:
''' View the dimensions and two rows of the resulting dataframe '''
print("Moving back to dataframe before nulls were filled with means, here are now", len(bottle.columns), "columns and", bottle.count(), "rows")
bottle.show(2, truncate=False)

                                                                                

Moving back to dataframe before nulls were filled with means, here are now 11 columns and 208035 rows
+------+------+----------+-------+----+-------+------+-----+-----+------+-------+
|S_prec|R_TEMP|R_SALINITY|R_DYNHT|R_O2|R_O2Sat|R_SIO3|R_NO3|R_NO2|R_CHLA|R_PHAEO|
+------+------+----------+-------+----+-------+------+-----+-----+------+-------+
|3     |15.34 |33.492    |0.0    |5.86|102.8  |2.0   |0.1  |0.0  |0.85  |0.19   |
|3     |15.34 |33.492    |0.003  |5.86|102.8  |2.0   |0.1  |0.0  |0.85  |0.19   |
+------+------+----------+-------+----+-------+------+-----+-----+------+-------+
only showing top 2 rows



#### Make sure column values are the right type

In [30]:
''' Check the dtypes of each column '''
bottle.printSchema()

root
 |-- S_prec: integer (nullable = true)
 |-- R_TEMP: double (nullable = true)
 |-- R_SALINITY: double (nullable = true)
 |-- R_DYNHT: double (nullable = true)
 |-- R_O2: double (nullable = true)
 |-- R_O2Sat: double (nullable = true)
 |-- R_SIO3: double (nullable = true)
 |-- R_NO3: double (nullable = true)
 |-- R_NO2: double (nullable = true)
 |-- R_CHLA: double (nullable = true)
 |-- R_PHAEO: double (nullable = true)



In [31]:
''' Row count, and comparison to original row count '''
print("There are now", len(bottle.columns), "columns and", bottle.count(), "rows (which is", maxRows-bottle.count(), "less than the original row count)")

There are now 11 columns and 208035 rows (which is 656828 less than the original row count)


#### Summary Statistics and graphs

In [32]:
''' See the detailed numerical description of the current dataframe '''
bottle.describe().show()



+-------+--------------------+------------------+-------------------+-------------------+------------------+-----------------+------------------+------------------+-------------------+-------------------+-------------------+
|summary|              S_prec|            R_TEMP|         R_SALINITY|            R_DYNHT|              R_O2|          R_O2Sat|            R_SIO3|             R_NO3|              R_NO2|             R_CHLA|            R_PHAEO|
+-------+--------------------+------------------+-------------------+-------------------+------------------+-----------------+------------------+------------------+-------------------+-------------------+-------------------+
|  count|              208035|            208035|             208035|             208035|            208035|           208035|            208035|            208035|             208035|             208035|             208035|
|   mean|    2.99938471891749|12.935006878650077| 33.500370831831134|0.20956609705099236| 4.80645381

                                                                                

In [33]:
getNullCounts(bottle).show()

+------+------+----------+-------+----+-------+------+-----+-----+------+-------+
|S_prec|R_TEMP|R_SALINITY|R_DYNHT|R_O2|R_O2Sat|R_SIO3|R_NO3|R_NO2|R_CHLA|R_PHAEO|
+------+------+----------+-------+----+-------+------+-----+-----+------+-------+
|     0|     0|         0|      0|   0|      0|     0|    0|    0|     0|      0|
+------+------+----------+-------+----+-------+------+-----+-----+------+-------+



## Questions

#### Which factors are chlorophyll levels most dependent on?
#### Can we use these factors as features in a model that accurately predict chlorophyll levels in a marine ecosystem?
#### Can these few, simple, measurable factors we found indicate whether an ocean ecosystem is healthy and sustainable based on the chlorophyll abundance measured?
#### Can we predict chlorophyll abundance without having to measure it?

## Feature Selection and Target Cleaning

#### Distribution of data for cholorophyll levels, summary stats

In [34]:
bottle.agg(min('R_CHLA'), max('R_CHLA'), mean('R_CHLA'), stddev('R_CHLA'), count('R_CHLA'), skewness('R_CHLA')).show()

+-----------+-----------+-------------------+-------------------+-------------+------------------+
|min(R_CHLA)|max(R_CHLA)|        avg(R_CHLA)|stddev_samp(R_CHLA)|count(R_CHLA)|  skewness(R_CHLA)|
+-----------+-----------+-------------------+-------------------+-------------+------------------+
|      -0.01|      66.11|0.43074977768152034| 1.1391470120828653|       208035|10.820339155374741|
+-----------+-----------+-------------------+-------------------+-------------+------------------+



##### Looking at this summary data of our chlorophyll levels, the data is highly positively skewed. This is displayed in our skewness value but also the difference between the maximum value and the mean.

#### Define cholorophyll category levels: low, medium, high?

##### As displayed in Figure 4 of this article,https://www.nature.com/scitable/knowledge/library/the-biological-productivity-of-the-ocean-70631104/, the chlorophyll concentrations in the ocean can be split into three category levels, low, medium, and high. Our low category range will be between 0 > x > .1ug/l, medium range will be .1 >= x > 1ug/l, and high concentrations will be x >= 1ug/l.

#### Map cholorophyll levels to defined categories and output graphs/summary stats and Add target column with category label to dataframe

In [35]:
lowThreshold = .1
medThreshold = 1

In [36]:
bottle = bottle.withColumn(
    'Target',
     when((col("R_CHLA") < lowThreshold), 0)\
    .when((col("R_CHLA").between(lowThreshold, medThreshold)), 1)\
    .when((col("R_CHLA") > medThreshold), 2)\
    .otherwise(10)
)

In [37]:
bottle.groupby('Target').count().show()

+------+------+
|Target| count|
+------+------+
|     1|115096|
|     2| 17386|
|     0| 75553|
+------+------+



#### PCA columns that are duplicates and analyse if we can get rid of null columns without removal of variance

In [38]:
from pyspark.sql.types import StructType, StructField, IntegerType

In [39]:
columns = ['col1', 'col2', 'correlation']
schema = StructType([
  StructField('col1', StringType(), False),
  StructField('col2', StringType(), False),
  StructField('correlation', IntegerType(), False)
  ])
highCorrs = spark.createDataFrame(spark.sparkContext.emptyRDD() ,schema)
rangeColumns = range(len(bottle.columns))
for i in rangeColumns:
    for j in rangeColumns:
        if(i != j):
            correlation = bottle.corr(bottle.columns[i], bottle.columns[j])
            if(correlation > .9):
                newCorr = spark.createDataFrame([(bottle.columns[i], bottle.columns[j], correlation)], columns)
                highCorrs = highCorrs.union(newCorr)
            else: 
                 continue;

In [40]:
highCorrs.show()

+-------+-------+------------------+
|   col1|   col2|       correlation|
+-------+-------+------------------+
|   R_O2|R_O2Sat|0.9878867685488074|
|R_O2Sat|   R_O2|0.9878867685488074|
| R_SIO3|  R_NO3|0.9755082756124436|
|  R_NO3| R_SIO3|0.9755082756124436|
+-------+-------+------------------+



Because R_PO4 has 998 nulls and is highly correlated with R_NO3 and R_SIO3, we will drop this column with little reduction in variance. R_SVA also contains nulls and is highly correlated with R_Temp, so we will drop this column as well. 

In [41]:
bottle = bottle.drop('R_O2')
bottle.withColumnRenamed('R_O2Sat', 'R_O2, R_O2Sat')
bottle = bottle.drop('R_SIO3')
bottle.withColumnRenamed('R_NO3', 'R_SIO3, R_NO3')
bottle = bottle.drop('R_CHLA')
bottle.show(2)

+------+------+----------+-------+-------+-----+-----+------+-------+------+
|S_prec|R_TEMP|R_SALINITY|R_DYNHT|R_O2Sat|R_NO3|R_NO2|R_CHLA|R_PHAEO|Target|
+------+------+----------+-------+-------+-----+-----+------+-------+------+
|     3| 15.34|    33.492|    0.0|  102.8|  0.1|  0.0|  0.85|   0.19|     1|
|     3| 15.34|    33.492|  0.003|  102.8|  0.1|  0.0|  0.85|   0.19|     1|
+------+------+----------+-------+-------+-----+-----+------+-------+------+
only showing top 2 rows



## Models

#### Split data into 80/10/10 train, test, validation set

In [45]:
#Checking schema of data again. Using bottleTest df instead of 'bottle' because it has no null values and used vector assembler to create feature vector.
bottle.printSchema()

root
 |-- S_prec: integer (nullable = true)
 |-- R_TEMP: double (nullable = true)
 |-- R_SALINITY: double (nullable = true)
 |-- R_DYNHT: double (nullable = true)
 |-- R_O2Sat: double (nullable = true)
 |-- R_NO3: double (nullable = true)
 |-- R_NO2: double (nullable = true)
 |-- R_PHAEO: double (nullable = true)
 |-- Target: integer (nullable = false)



In [48]:
features = bottle.columns
features.remove('Target')
bottle = VectorAssembler(outputCol="features_unscaled").setInputCols(features).transform(bottle)

In [49]:
bottle.show(2, truncate=False)

+------+------+----------+-------+-------+-----+-----+-------+------+-------------------------------------------+
|S_prec|R_TEMP|R_SALINITY|R_DYNHT|R_O2Sat|R_NO3|R_NO2|R_PHAEO|Target|features_unscaled                          |
+------+------+----------+-------+-------+-----+-----+-------+------+-------------------------------------------+
|3     |15.34 |33.492    |0.0    |102.8  |0.1  |0.0  |0.19   |1     |[3.0,15.34,33.492,0.0,102.8,0.1,0.0,0.19]  |
|3     |15.34 |33.492    |0.003  |102.8  |0.1  |0.0  |0.19   |1     |[3.0,15.34,33.492,0.003,102.8,0.1,0.0,0.19]|
+------+------+----------+-------+-------+-----+-----+-------+------+-------------------------------------------+
only showing top 2 rows



In [50]:
seed = 42
train, test, validation = bottle.randomSplit([0.80, 0.10, 0.10], seed=seed)
print('Train dataset count:', train.count())
print('Test dataset count:', test.count())
print('Validation dataset count:', validation.count())

                                                                                

Train dataset count: 166435


                                                                                

Test dataset count: 20671




Validation dataset count: 20929


                                                                                

#### Scale Data

In [51]:
from pyspark.ml.feature import StandardScaler

standardScaler = StandardScaler(withMean=True, withStd=True, inputCol='features_unscaled', outputCol='features')
ss = standardScaler.fit(train)

                                                                                

In [52]:
print('StandardScaler Means:', ss.mean)
print('StandardScaler StDevs:', ss.std)

StandardScaler Means: [2.9993631147295092,12.929605960284789,33.50024505662886,0.20992421065280162,81.25461651695906,9.56258299035666,0.055488388860515936,0.18715780935500081]
StandardScaler StDevs: [0.025228624058689544,3.0537191316097294,0.3046101763905132,0.1554892607565311,25.244894938029052,10.300224447461177,0.0984950925137126,0.277056368386772]


In [53]:
trainscaled = ss.transform(train)
testscaled = ss.transform(test)
valscaled = ss.transform(validation)

#### Check for Imblance in datasets

In [54]:
trainscaled.groupby('target').count().show()



+------+-----+
|target|count|
+------+-----+
|     1|92089|
|     2|13801|
|     0|60545|
+------+-----+



                                                                                

In [55]:
testscaled.groupby('target').count().show()



+------+-----+
|target|count|
+------+-----+
|     1|11467|
|     2| 1763|
|     0| 7441|
+------+-----+



                                                                                

In [56]:
valscaled.groupby('target').count().show()



+------+-----+
|target|count|
+------+-----+
|     1|11540|
|     2| 1822|
|     0| 7567|
+------+-----+



                                                                                

The datasets are well balanced with proportional amounts of each feature in each set.

### Logistic Regresssion

#### Cross Validation, Check to see if data can be used to predict labels, Train and Validation Accuracy Scores

In [57]:
from pyspark.ml.classification import LogisticRegression
from pyspark.ml.tuning import ParamGridBuilder, CrossValidator
from pyspark.ml.evaluation import MulticlassClassificationEvaluator

In [59]:
regParams = [0.001, 0.002, 0.003, 0.004, 0.005, 0.006, 0.007]
elasticNetParams = [0.90, 0.95, 0.99, 1.0]

for r in regParams:
    for e in elasticNetParams:
        lr = LogisticRegression(featuresCol='features', labelCol='Target', predictionCol='prediction', maxIter=10, regParam=r, elasticNetParam=e)
        lrModel = lr.fit(trainscaled)
        print('regParam', r, 'elasticNetParam', e, 'accuracy', lrModel.summary.accuracy)

                                                                                

regParam 0.001 elasticNetParam 0.9 accuracy 0.880517919908673


                                                                                

regParam 0.001 elasticNetParam 0.95 accuracy 0.8827470183555142


                                                                                

regParam 0.001 elasticNetParam 0.99 accuracy 0.8828251269264278


                                                                                

regParam 0.001 elasticNetParam 1.0 accuracy 0.8828491603328626


                                                                                

regParam 0.002 elasticNetParam 0.9 accuracy 0.8818337489109863


                                                                                

regParam 0.002 elasticNetParam 0.95 accuracy 0.8819238741851173


                                                                                

regParam 0.002 elasticNetParam 0.99 accuracy 0.8820620662721183


                                                                                

regParam 0.002 elasticNetParam 1.0 accuracy 0.8821041247333794


                                                                                

regParam 0.003 elasticNetParam 0.9 accuracy 0.8749121278577222


                                                                                

regParam 0.003 elasticNetParam 0.95 accuracy 0.8748220025835912


                                                                                

regParam 0.003 elasticNetParam 0.99 accuracy 0.8748400276384174


                                                                                

regParam 0.003 elasticNetParam 1.0 accuracy 0.8792922161804909


                                                                                

regParam 0.004 elasticNetParam 0.9 accuracy 0.8760537146633821


                                                                                

regParam 0.004 elasticNetParam 0.95 accuracy 0.8762219485084267


                                                                                

regParam 0.004 elasticNetParam 0.99 accuracy 0.8762399735632529


                                                                                

regParam 0.004 elasticNetParam 1.0 accuracy 0.8763661489470363


                                                                                

regParam 0.005 elasticNetParam 0.9 accuracy 0.8747078439030253


                                                                                

regParam 0.005 elasticNetParam 0.95 accuracy 0.8767026166371256


                                                                                

regParam 0.005 elasticNetParam 0.99 accuracy 0.8769850091627363


                                                                                

regParam 0.005 elasticNetParam 1.0 accuracy 0.8770751344368672


                                                                                

regParam 0.006 elasticNetParam 0.9 accuracy 0.8764682909243849


                                                                                

regParam 0.006 elasticNetParam 0.95 accuracy 0.8764142157599063


                                                                                

regParam 0.006 elasticNetParam 0.99 accuracy 0.8732538228137111


                                                                                

regParam 0.006 elasticNetParam 1.0 accuracy 0.8792201159611861


                                                                                

regParam 0.007 elasticNetParam 0.9 accuracy 0.8759095142247725


                                                                                

regParam 0.007 elasticNetParam 0.95 accuracy 0.8770571093820411


                                                                                

regParam 0.007 elasticNetParam 0.99 accuracy 0.8770090425691711




regParam 0.007 elasticNetParam 1.0 accuracy 0.8770991678433022


                                                                                

In [61]:
lr = LogisticRegression(featuresCol='features', labelCol='Target', predictionCol='prediction', maxIter=10, regParam=0.005, elasticNetParam=1)
lrModel = lr.fit(trainscaled)

# Print the coefficients and intercept for multinomial logistic regression
print("Coefficients: \n" + str(lrModel.coefficientMatrix))
print("Intercept: " + str(lrModel.interceptVector))

trainingSummary = lrModel.summary

# for multiclass, we can inspect metrics on a per-label basis
print("False positive rate by label:")
for i, rate in enumerate(trainingSummary.falsePositiveRateByLabel):
    print("label %d: %s" % (i, rate))

print("True positive rate by label:")
for i, rate in enumerate(trainingSummary.truePositiveRateByLabel):
    print("label %d: %s" % (i, rate))

print("Precision by label:")
for i, prec in enumerate(trainingSummary.precisionByLabel):
    print("label %d: %s" % (i, prec))

print("Recall by label:")
for i, rec in enumerate(trainingSummary.recallByLabel):
    print("label %d: %s" % (i, rec))

print("F-measure by label:")
for i, f in enumerate(trainingSummary.fMeasureByLabel()):
    print("label %d: %s" % (i, f))
    
print('-------------------------------------------')

accuracy = trainingSummary.accuracy
falsePositiveRate = trainingSummary.weightedFalsePositiveRate
truePositiveRate = trainingSummary.weightedTruePositiveRate
fMeasure = trainingSummary.weightedFMeasure()
precision = trainingSummary.weightedPrecision
recall = trainingSummary.weightedRecall
print("Accuracy: %s\nFalse Positive Rate: %s\nTrue Positive Rate: %s\nF-measure: %s\nPrecision: %s\nRecall: %s"
      % (accuracy, falsePositiveRate, truePositiveRate, fMeasure, precision, recall))

                                                                                

Coefficients: 
DenseMatrix([[ 0.        ,  0.        ,  0.        ,  0.58721588, -0.65784343,
               0.96010674, -1.05787805, -3.68190649],
             [ 0.        ,  0.        , -0.20279745,  0.11296316,  0.        ,
              -0.38580527,  0.21844675,  0.45753901],
             [ 0.        ,  0.        ,  0.        , -1.31031281,  0.92087076,
              -0.01774288,  0.14474382,  2.52421528]])
Intercept: [0.11355503078084468,1.9840294420973472,-2.097584472878192]
False positive rate by label:




label 0: 0.03509302105959014
label 1: 0.20954725203776936
label 2: 0.007626085931050749
True positive rate by label:
label 0: 0.8362540259311256
label 1: 0.9472141080910857
label 2: 0.5881457865372074
Precision by label:
label 0: 0.9316245606933226
label 1: 0.8484636260176837
label 2: 0.8745824803361707
Recall by label:
label 0: 0.8362540259311256
label 1: 0.9472141080910857
label 2: 0.5881457865372074
F-measure by label:
label 0: 0.8813668488667619
label 1: 0.8951235530744603
label 2: 0.7033186032406205
-------------------------------------------
Accuracy: 0.8770751344368672
False Positive Rate: 0.12934149346527749
True Positive Rate: 0.8770751344368674
F-measure: 0.8742144908645169
Precision: 0.8808813572544187
Recall: 0.8770751344368674


                                                                                

In [62]:
predictVal = lrModel.evaluate(valscaled)
accuracy = predictVal.accuracy
falsePositiveRate = predictVal.weightedFalsePositiveRate
truePositiveRate = predictVal.weightedTruePositiveRate
fMeasure = predictVal.weightedFMeasure()
precision = predictVal.weightedPrecision
recall = predictVal.weightedRecall

print("Accuracy: %s\nFalse Positive Rate: %s\nTrue Positive Rate: %s\nF-measure: %s\nPrecision: %s\nRecall: %s"
      % (accuracy, falsePositiveRate, truePositiveRate, fMeasure, precision, recall))



Accuracy: 0.8753404367146065
False Positive Rate: 0.13074069930030524
True Positive Rate: 0.8753404367146065
F-measure: 0.8727169562572579
Precision: 0.8798413340614577
Recall: 0.8753404367146065


                                                                                

In [84]:
# Code for Logistic Regression with 5 fold Cross Validation doesn't work yet

# lr = LogisticRegression(featuresCol='features', labelCol='Target', predictionCol='Prediction')

# # lrModel = lr.fit(trainscaled)

# paramGrid = (ParamGridBuilder()
#              .addGrid(lr.regParam, [0.001, 0.01, 0.1, 1.0, 10.0])
#              .addGrid(lr.elasticNetParam, [0.0, 0.25, 0.5, 0.75, 1.0])
#              .addGrid(lr.maxIter, [1, 5, 10, 20, 50])
#              .build())

# evaluator = MulticlassClassificationEvaluator(predictionCol='Prediction')

# cv = CrossValidator(estimator=lr, estimatorParamMaps=paramGrid, evaluator=evaluator, numFolds=5, parallelism=2)
# cvmodel = cv.fit(trainscaled)

# predictTrain=cvmodel.transform(trainscaled)
# predictVal=cvmodel.transform(valscaled)

# print("The area under ROC for train set is {}".format(evaluator.evaluate(predictTrain)))
# print("The area under ROC for validation set is {}".format(evaluator.evaluate(predictVal)))

### AdaBoost Decision Tree

#### Can we increase accuracy from previous steps? Can we figure out which features matter the most?

#### Train and Test Accuracy Scores

#### Kernel SVM

## Data Visualization

#### SVM Decision Boundary Graph

#### AdaBoost Decision Tree Graph
https://www.ashishmenkudale.com/spark-tree-plotting/

# Stop spark session

In [49]:
#spark.stop()