# CalCOFI
### Over 60 years of oceanographic data
#### Dataset: https://www.kaggle.com/datasets/sohier/calcofi
#### Table Info: https://calcofi.org/data/oceanographic-data/bottle-database/

## EDA and Data Preparation

#### Load Data

In [1]:
# Loading in spark
import pyspark
from pyspark.sql import SparkSession

conf = pyspark.SparkConf().setAll([
    ('spark.master', 'local[1]'), 
    ('spark.app.name', 'App Name')])
    
spark = SparkSession.builder.config(conf=conf).getOrCreate()
spark.version

Setting default log level to "WARN".
To adjust logging level use sc.setLogLevel(newLevel). For SparkR, use setLogLevel(newLevel).
2022-05-19 11:52:03,387 WARN util.NativeCodeLoader: Unable to load native-hadoop library for your platform... using builtin-java classes where applicable


'3.2.1'

In [2]:
# Read the data
bottle = spark.read.csv("file:///home/work/DSE230_scalable/group_project/bottle.csv", header=True, inferSchema=True).cache() ##leslie's path
# bottle = spark.read.csv("file:///home/work/Final/data/bottle.csv", header=True).cache() ##karina's path
maxRows = bottle.count()
maxRows

2022-05-19 11:52:17,054 WARN util.package: Truncated the string representation of a plan since it was too large. This behavior can be adjusted by setting 'spark.sql.debug.maxToStringFields'.
                                                                                

864863

In [3]:
# View the inferred schema
bottle.printSchema()

root
 |-- Cst_Cnt: integer (nullable = true)
 |-- Btl_Cnt: integer (nullable = true)
 |-- Sta_ID: string (nullable = true)
 |-- Depth_ID: string (nullable = true)
 |-- Depthm: integer (nullable = true)
 |-- T_degC: double (nullable = true)
 |-- Salnty: double (nullable = true)
 |-- O2ml_L: double (nullable = true)
 |-- STheta: double (nullable = true)
 |-- O2Sat: double (nullable = true)
 |-- Oxy_µmol/Kg: double (nullable = true)
 |-- BtlNum: integer (nullable = true)
 |-- RecInd: integer (nullable = true)
 |-- T_prec: integer (nullable = true)
 |-- T_qual: integer (nullable = true)
 |-- S_prec: integer (nullable = true)
 |-- S_qual: integer (nullable = true)
 |-- P_qual: integer (nullable = true)
 |-- O_qual: integer (nullable = true)
 |-- SThtaq: integer (nullable = true)
 |-- O2Satq: integer (nullable = true)
 |-- ChlorA: double (nullable = true)
 |-- Chlqua: integer (nullable = true)
 |-- Phaeop: double (nullable = true)
 |-- Phaqua: integer (nullable = true)
 |-- PO4uM: double (nu

In [4]:
# See a snippet of what this dataframe looks like
bottle.show(2)

+-------+-------+-----------+--------------------+------+------+------+------+------+-----+-----------+------+------+------+------+------+------+------+------+------+------+------+------+------+------+-----+----+------+------+-----+----+-----+----+-----+----+------+------+------+------+------+------+------+------+------+------+------+------+------+------+-------+------+--------+----------+-------+-----+-------+----+-------+------+-----+-----+-----+-----+------+-------+------+------+----+----+----+----+----+----+-------------------+
|Cst_Cnt|Btl_Cnt|     Sta_ID|            Depth_ID|Depthm|T_degC|Salnty|O2ml_L|STheta|O2Sat|Oxy_µmol/Kg|BtlNum|RecInd|T_prec|T_qual|S_prec|S_qual|P_qual|O_qual|SThtaq|O2Satq|ChlorA|Chlqua|Phaeop|Phaqua|PO4uM|PO4q|SiO3uM|SiO3qu|NO2uM|NO2q|NO3uM|NO3q|NH3uM|NH3q|C14As1|C14A1p|C14A1q|C14As2|C14A2p|C14A2q|DarkAs|DarkAp|DarkAq|MeanAs|MeanAp|MeanAq|IncTim|LightP|R_Depth|R_TEMP|R_POTEMP|R_SALINITY|R_SIGMA|R_SVA|R_DYNHT|R_O2|R_O2Sat|R_SIO3|R_PO4|R_NO3|R_NO2|R_NH4|R_CH

#### Remove columns that are not needed

Removing the four string columns because they aren't useful for our purposes
* Sta_ID: Line and Station
* Depth_ID: Uses the Cast_ID prefix ([Century]-[Year][Month][ShipCode]-[CastType][Julian Day]-[CastTime]-[Line][Sta]) but adds three additional variables: [Depth][Bottle]-[Rec_Ind]
* IncTim: Elapsed incubation time of the primary productivity experiment
* DIC Quality Comment: Quality Comment

Also removing the Cast and Bottle counts, which are essentially indexes (identifiers)
* 'Cst_Cnt': Auto-numbered Cast Count - all casts consecutively numbered. 1 is first station done
* 'Btl_Cnt': Auto-numbered Bottle count- all bottles ever sampled, consecutively numbered
* 'BtlNum': Bottle Number

In [5]:
# Dropping unneeded columns and viewing two rows of the resulting dataframe
bottle = bottle.drop(*["Sta_ID","Depth_ID","IncTim","DIC Quality Comment","Cst_Cnt","Btl_Cnt","BtlNum"])
bottle.show(2, truncate=False)

+------+------+------+------+------+-----+-----------+------+------+------+------+------+------+------+------+------+------+------+------+------+-----+----+------+------+-----+----+-----+----+-----+----+------+------+------+------+------+------+------+------+------+------+------+------+------+-------+------+--------+----------+-------+-----+-------+----+-------+------+-----+-----+-----+-----+------+-------+------+------+----+----+----+----+----+----+
|Depthm|T_degC|Salnty|O2ml_L|STheta|O2Sat|Oxy_µmol/Kg|RecInd|T_prec|T_qual|S_prec|S_qual|P_qual|O_qual|SThtaq|O2Satq|ChlorA|Chlqua|Phaeop|Phaqua|PO4uM|PO4q|SiO3uM|SiO3qu|NO2uM|NO2q|NO3uM|NO3q|NH3uM|NH3q|C14As1|C14A1p|C14A1q|C14As2|C14A2p|C14A2q|DarkAs|DarkAp|DarkAq|MeanAs|MeanAp|MeanAq|LightP|R_Depth|R_TEMP|R_POTEMP|R_SALINITY|R_SIGMA|R_SVA|R_DYNHT|R_O2|R_O2Sat|R_SIO3|R_PO4|R_NO3|R_NO2|R_NH4|R_CHLA|R_PHAEO|R_PRES|R_SAMP|DIC1|DIC2|TA1 |TA2 |pH2 |pH1 |
+------+------+------+------+------+-----+-----------+------+------+------+------+------+-

#### Handle quality values

Removing the four columns indicating quality codes because we're using the quantity measurements instead
* T_qual: Temperature Quality Code
* S_qual: Salinity Quality Code
* P_qual: Pressure Quality Code
* O_qual: Oxygen Quality Code
* 'O2Satq': Oxygen Saturation Quality Code
* 'Chlqua': Chlorophyll-a Quality Code
* 'Phaeop': Phaeophytin Quality Code
* 'Phaqua': Phosphate Quality Code
* 'PO4uM': Salinity Quality Code
* 'PO4q': Phosphate Quality Code
* 'SiO3qu': Quality Code
* 'NO2q': Quality Code
* 'NO3q': Nitrate Quality Code
* 'NH3q': Ammonium Quality Code
* 'C14A1q': 14C As1 Quality Code
* 'C14A2q': 14C As2 Quality Code
* 'DarkAq': 14C Assimilation Dark Bottle Quality Code
* 'MeanAq': Mean 14C Assimilation Quality Code

In [6]:
# Dropping quality/irrelevant columns and viewing two rows of the resulting dataframe
bottle = bottle.drop(*['T_qual','S_qual','P_qual','O_qual','O2Satq','Chlqua','Phaeop','Phaqua','PO4uM','PO4q','SiO3qu','NO2q','NO3q','NH3q','C14A1q','C14A2q','DarkAq','MeanAq'])
bottle.show(2, truncate=False)

+------+------+------+------+------+-----+-----------+------+------+------+------+------+------+-----+-----+-----+------+------+------+------+------+------+------+------+------+-------+------+--------+----------+-------+-----+-------+----+-------+------+-----+-----+-----+-----+------+-------+------+------+----+----+----+----+----+----+
|Depthm|T_degC|Salnty|O2ml_L|STheta|O2Sat|Oxy_µmol/Kg|RecInd|T_prec|S_prec|SThtaq|ChlorA|SiO3uM|NO2uM|NO3uM|NH3uM|C14As1|C14A1p|C14As2|C14A2p|DarkAs|DarkAp|MeanAs|MeanAp|LightP|R_Depth|R_TEMP|R_POTEMP|R_SALINITY|R_SIGMA|R_SVA|R_DYNHT|R_O2|R_O2Sat|R_SIO3|R_PO4|R_NO3|R_NO2|R_NH4|R_CHLA|R_PHAEO|R_PRES|R_SAMP|DIC1|DIC2|TA1 |TA2 |pH2 |pH1 |
+------+------+------+------+------+-----+-----------+------+------+------+------+------+------+-----+-----+-----+------+------+------+------+------+------+------+------+------+-------+------+--------+----------+-------+-----+-------+----+-------+------+-----+-----+-----+-----+------+-------+------+------+----+----+----+--

#### Remove columns with low data count

In [7]:
# Counting the number of null/NaN rows per column and outputting that in a new dataframe
from pyspark.sql.functions import col,isnan, when, count

def getNullCounts(df):
    return df.select([count(when(isnan(c) | col(c).isNull(), c)).alias(c) for c in bottle.columns])

nullCounter = getNullCounts(bottle)
nullCounter.show()



+------+------+------+------+------+------+-----------+------+------+------+------+------+------+------+------+------+------+------+------+------+------+------+------+------+------+-------+------+--------+----------+-------+-----+-------+------+-------+------+------+------+------+------+------+-------+------+------+------+------+------+------+------+------+
|Depthm|T_degC|Salnty|O2ml_L|STheta| O2Sat|Oxy_µmol/Kg|RecInd|T_prec|S_prec|SThtaq|ChlorA|SiO3uM| NO2uM| NO3uM| NH3uM|C14As1|C14A1p|C14As2|C14A2p|DarkAs|DarkAp|MeanAs|MeanAp|LightP|R_Depth|R_TEMP|R_POTEMP|R_SALINITY|R_SIGMA|R_SVA|R_DYNHT|  R_O2|R_O2Sat|R_SIO3| R_PO4| R_NO3| R_NO2| R_NH4|R_CHLA|R_PHAEO|R_PRES|R_SAMP|  DIC1|  DIC2|   TA1|   TA2|   pH2|   pH1|
+------+------+------+------+------+------+-----------+------+------+------+------+------+------+------+------+------+------+------+------+------+------+------+------+------+------+-------+------+--------+----------+-------+-----+-------+------+-------+------+------+------+------

                                                                                

In [8]:
# Deleting columns with less than 200000 non-nulls
thresh = maxRows - 200000
for value in nullCounter.columns:
    if nullCounter.filter(nullCounter[value] > thresh).select(nullCounter[value]).collect():
        bottle = bottle.drop(value)

bottle.show(2, truncate=False)

+------+------+------+------+------+-----+-----------+------+------+------+------+------+-----+-----+-------+------+--------+----------+-------+-----+-------+----+-------+------+-----+-----+-----+------+-------+------+
|Depthm|T_degC|Salnty|O2ml_L|STheta|O2Sat|Oxy_µmol/Kg|RecInd|T_prec|S_prec|ChlorA|SiO3uM|NO2uM|NO3uM|R_Depth|R_TEMP|R_POTEMP|R_SALINITY|R_SIGMA|R_SVA|R_DYNHT|R_O2|R_O2Sat|R_SIO3|R_PO4|R_NO3|R_NO2|R_CHLA|R_PHAEO|R_PRES|
+------+------+------+------+------+-----+-----------+------+------+------+------+------+-----+-----+-------+------+--------+----------+-------+-----+-------+----+-------+------+-----+-----+-----+------+-------+------+
|0     |10.5  |33.44 |null  |25.649|null |null       |3     |1     |2     |null  |null  |null |null |0.0    |10.5  |10.5    |33.44     |25.64  |233.0|0.0    |null|null   |null  |null |null |null |null  |null   |0     |
|8     |10.46 |33.44 |null  |25.656|null |null       |3     |2     |2     |null  |null  |null |null |8.0    |10.46 |10.46   

#### Remove null values from chlorophyll column

and get rid of the other chlorophyll column

In [9]:
# Illustrating that we have two target columns
bottle.corr("ChlorA","R_CHLA")

0.9999995087108701

Because we only need one of these columns, we can delete the other. Then we drop all null rows of the R_CHLA column since it's our target.

In [10]:
# Looking at which of the two target columns have more NaNs in order to select which to delete
nullCounter.select(*["ChlorA","R_CHLA"]).show()

+------+------+
|ChlorA|R_CHLA|
+------+------+
|639591|639587|
+------+------+



In [11]:
# Delete the ChlorA column and drop all NaN rows in the R_CHLA column
bottle = bottle.drop("ChlorA")
bottle = bottle.dropna(subset="R_CHLA")

bottle.show(2, truncate=False)

+------+------+------+------+------+-----+-----------+------+------+------+------+-----+-----+-------+------+--------+----------+-------+-----+-------+----+-------+------+-----+-----+-----+------+-------+------+
|Depthm|T_degC|Salnty|O2ml_L|STheta|O2Sat|Oxy_µmol/Kg|RecInd|T_prec|S_prec|SiO3uM|NO2uM|NO3uM|R_Depth|R_TEMP|R_POTEMP|R_SALINITY|R_SIGMA|R_SVA|R_DYNHT|R_O2|R_O2Sat|R_SIO3|R_PO4|R_NO3|R_NO2|R_CHLA|R_PHAEO|R_PRES|
+------+------+------+------+------+-----+-----------+------+------+------+------+-----+-----+-------+------+--------+----------+-------+-----+-------+----+-------+------+-----+-----+-----+------+-------+------+
|0     |19.23 |34.491|5.46  |24.575|103.9|237.9949   |3     |2     |3     |null  |null |1.3  |0.0    |19.23 |19.23   |34.491    |24.57  |335.3|0.0    |5.46|103.9  |null  |null |1.3  |null |0.64  |0.47   |0     |
|10    |19.22 |34.492|5.46  |24.578|103.9|237.9942   |3     |2     |3     |null  |null |1.9  |10.0   |19.22 |19.22   |34.492    |24.57  |335.3|0.03   |5

#### Remove "duplicate" columns

As we can see from the column above, there are some columns that are essentially duplicates of each other but in different units of measurement. Of the remaining columns, here are their descriptions that will help us determine which attributes to compare for potential deletion:

In [12]:
# See what columns we still have
print(bottle.columns)

['Depthm', 'T_degC', 'Salnty', 'O2ml_L', 'STheta', 'O2Sat', 'Oxy_µmol/Kg', 'RecInd', 'T_prec', 'S_prec', 'SiO3uM', 'NO2uM', 'NO3uM', 'R_Depth', 'R_TEMP', 'R_POTEMP', 'R_SALINITY', 'R_SIGMA', 'R_SVA', 'R_DYNHT', 'R_O2', 'R_O2Sat', 'R_SIO3', 'R_PO4', 'R_NO3', 'R_NO2', 'R_CHLA', 'R_PHAEO', 'R_PRES']


Chlorophyll
* 'ChlorA': Acetone extracted chlorophyll-a measured fluorometrically
* 'R_CHLA': Reported Chlorophyll-a

Depth
* 'Depthm': Depth in meters
* 'R_Depth': Reported Depth (from pressure) in meters

Water density
* 'STheta': Potential Density of Water
* 'R_SIGMA': Reported Potential Density of water

Silicate
* 'SiO3uM': Micromoles Silicate per liter of seawater
* 'R_SIO3': Reported Silicate Concentration

Nitrite
* 'NO2uM': Micromoles Nitrite per liter of seawater
* 'R_NO2': Reported Nitrite Concentration

Nitrate
* 'NO3uM': Micromoles Nitrate per liter of seawater
* 'R_NO3': Reported Nitrate Concentration

Salinity
* 'Salnty': Practical Salinity Scale, 1978 (UNESCO, 1981a); Salinity of water
* 'R_SALINITY': Reported Salinity (from Specific Volume Anomoly, M³/Kg)

O2 saturation
* 'O2Sat': Percent Saturation; Oxygen Saturation
* 'R_O2Sat': Percent	Reported Oxygen Saturation

Oxygen
* 'O2ml_L': Oxygen in mL/L; Milliliters of dissolved oxygen per Liter seawater
* 'Oxy_µmol/Kg': Oxygen in micro moles per kilogram of seawater
* 'R_O2': Reported milliliters of oxygen per liter of seawater

Temperature
* 'T_degC': Temperature of Water
* 'R_TEMP': Reported Temperature (Celsius)
* 'R_POTEMP': Reported Potential Temperature (Celsius)

Etc.
* 'S_prec': Salinity Units of Precision
* 'T_prec': Temperature Units of Precision
* 'RecInd': Record Indicator
* 'R_SVA': Reported Specific Volume Anomaly
* 'R_DYNHT': Reported Dynamic Height
* 'R_PO4': Reported Phosphate Concentration
* 'R_PHAEO': Reported Phaeophytin
* 'R_PRES': Pressure in decibars

In [13]:
# Setting up a function that will streamline the comparing process
import matplotlib.pyplot as plt

nullCounter = getNullCounts(bottle)

def psudoDuplicateCheck(attributes):
    nullCounter.select(*attributes).show()

    if len(attributes)==2:
        print("correlation:", bottle.corr(attributes[0],attributes[1]))
    elif len(attributes)==3:
        print("1&2 correlation:", bottle.corr(attributes[0],attributes[1]))
        print("2&3 correlation:", bottle.corr(attributes[1],attributes[2]))
        print("1&3 correlation:", bottle.corr(attributes[0],attributes[2]))

In [14]:
nullCounter.show()

+------+------+------+------+------+-----+-----------+------+------+------+------+-----+-----+-------+------+--------+----------+-------+-----+-------+----+-------+------+-----+-----+-----+------+-------+------+
|Depthm|T_degC|Salnty|O2ml_L|STheta|O2Sat|Oxy_µmol/Kg|RecInd|T_prec|S_prec|SiO3uM|NO2uM|NO3uM|R_Depth|R_TEMP|R_POTEMP|R_SALINITY|R_SIGMA|R_SVA|R_DYNHT|R_O2|R_O2Sat|R_SIO3|R_PO4|R_NO3|R_NO2|R_CHLA|R_PHAEO|R_PRES|
+------+------+------+------+------+-----+-----------+------+------+------+------+-----+-----+-------+------+--------+----------+-------+-----+-------+----+-------+------+-----+-----+-----+------+-------+------+
|     0|  3088|  3433|  3610|  3781| 4500|       4502|     0|  3088|  3433|  8947|14951| 9573|      0|  3088|    3508|      3433|   3655| 3630|   3527|3610|   4352|  8941|10322| 9567|14945|     0|      5|     0|
+------+------+------+------+------+-----+-----------+------+------+------+------+-----+-----+-------+------+--------+----------+-------+-----+-------+-

In [15]:
Depths = ["Depthm","R_Depth"]
print("Depths NaN count:")
psudoDuplicateCheck(Depths)

WD = ["STheta","R_SIGMA"]
print("\nWater density NaN count:")
psudoDuplicateCheck(WD)

Silicate = ["SiO3uM","R_SIO3"]
print("\nSilicate NaN count:")
psudoDuplicateCheck(Silicate)

Nitrite = ["NO2uM","R_NO2"]
print("\nNitrite NaN count:")
psudoDuplicateCheck(Nitrite)

Nitrate = ["NO3uM","R_NO3"]
print("\nNitrate NaN count:")
psudoDuplicateCheck(Nitrate)

Salinity = ["Salnty","R_SALINITY"]
print("\nSalinity NaN count:")
psudoDuplicateCheck(Salinity)

Saturation = ["O2Sat","R_O2Sat"]
print("\nO2 Saturation NaN count:")
psudoDuplicateCheck(Saturation)

Depths NaN count:
+------+-------+
|Depthm|R_Depth|
+------+-------+
|     0|      0|
+------+-------+

correlation: 0.9999999949168985

Water density NaN count:
+------+-------+
|STheta|R_SIGMA|
+------+-------+
|  3781|   3655|
+------+-------+

correlation: 0.9775619506024495

Silicate NaN count:
+------+------+
|SiO3uM|R_SIO3|
+------+------+
|  8947|  8941|
+------+------+

correlation: 0.9999991048864016

Nitrite NaN count:
+-----+-----+
|NO2uM|R_NO2|
+-----+-----+
|14951|14945|
+-----+-----+

correlation: 0.9999753963583337

Nitrate NaN count:
+-----+-----+
|NO3uM|R_NO3|
+-----+-----+
| 9573| 9567|
+-----+-----+

correlation: 0.9999998732099501

Salinity NaN count:
+------+----------+
|Salnty|R_SALINITY|
+------+----------+
|  3433|      3433|
+------+----------+

correlation: 0.9999999893049997

O2 Saturation NaN count:
+-----+-------+
|O2Sat|R_O2Sat|
+-----+-------+
| 4500|   4352|
+-----+-------+

correlation: 0.9903641742840068


In [16]:
Oxygen = ["O2ml_L","Oxy_µmol/Kg","R_O2"]
print("Oxygen NaN count:")
psudoDuplicateCheck(Oxygen)

Temperature = ["T_degC","R_TEMP","R_POTEMP"]
print("Temperature:")
psudoDuplicateCheck(Temperature)

Oxygen NaN count:
+------+-----------+----+
|O2ml_L|Oxy_µmol/Kg|R_O2|
+------+-----------+----+
|  3610|       4502|3610|
+------+-----------+----+

1&2 correlation: 0.9727492394078366
2&3 correlation: 0.9727398909191203
1&3 correlation: 0.9999894839294193
Temperature:
+------+------+--------+
|T_degC|R_TEMP|R_POTEMP|
+------+------+--------+
|  3088|  3088|    3508|
+------+------+--------+

1&2 correlation: 0.9999999632664396
2&3 correlation: 0.9788144336830903
1&3 correlation: 0.9788143110817276


We now delete the columns with fewer NaNs if they have a correlation coeffecients of 0.99 or higher.

("STheta","T_degC","R_POTEMP","O2ml_L","Oxy_µmol/Kg" should also be deleted but spark's corr is giving worse scores than pandas' corr, and i can't FUCKIN PLOT THIS)

In [17]:
# Deleting columns that are duplicates of some other column
bottle = bottle.drop(*["Depthm","SiO3uM","NO2uM","NO3uM","Salnty","O2Sat", "STheta","T_degC","R_POTEMP","O2ml_L","Oxy_µmol/Kg"])
bottle.show(2, truncate=False)

+------+------+------+-------+------+----------+-------+-----+-------+----+-------+------+-----+-----+-----+------+-------+------+
|RecInd|T_prec|S_prec|R_Depth|R_TEMP|R_SALINITY|R_SIGMA|R_SVA|R_DYNHT|R_O2|R_O2Sat|R_SIO3|R_PO4|R_NO3|R_NO2|R_CHLA|R_PHAEO|R_PRES|
+------+------+------+-------+------+----------+-------+-----+-------+----+-------+------+-----+-----+-----+------+-------+------+
|3     |2     |3     |0.0    |19.23 |34.491    |24.57  |335.3|0.0    |5.46|103.9  |null  |null |1.3  |null |0.64  |0.47   |0     |
|3     |2     |3     |10.0   |19.22 |34.492    |24.57  |335.3|0.03   |5.46|103.9  |null  |null |1.9  |null |0.66  |0.38   |10    |
+------+------+------+-------+------+----------+-------+-----+-------+----+-------+------+-----+-----+-----+------+-------+------+
only showing top 2 rows



In [18]:
getNullCounts(bottle).show()

+------+------+------+-------+------+----------+-------+-----+-------+----+-------+------+-----+-----+-----+------+-------+------+
|RecInd|T_prec|S_prec|R_Depth|R_TEMP|R_SALINITY|R_SIGMA|R_SVA|R_DYNHT|R_O2|R_O2Sat|R_SIO3|R_PO4|R_NO3|R_NO2|R_CHLA|R_PHAEO|R_PRES|
+------+------+------+-------+------+----------+-------+-----+-------+----+-------+------+-----+-----+-----+------+-------+------+
|     0|  3088|  3433|      0|  3088|      3433|   3655| 3630|   3527|3610|   4352|  8941|10322| 9567|14945|     0|      5|     0|
+------+------+------+-------+------+----------+-------+-----+-------+----+-------+------+-----+-----+-----+------+-------+------+



#### Fill null values

In [19]:
# Using the function from PA3
from pyspark.ml.feature import Imputer

def fill_na(df, strategy):    
    imputer = Imputer(
        strategy=strategy,
        inputCols=df.columns, 
        outputCols=["{}_imputed".format(c) for c in df.columns]
    )
    
    new_df = imputer.fit(df).transform(df)
    
    # Select the newly created columns with all filled values
    new_df = new_df.select([c for c in new_df.columns if "imputed" in c])
    
    for col in new_df.columns:
        new_df = new_df.withColumnRenamed(col, col.split("_imputed")[0])
        
    return new_df

In [20]:
# Filling in the remaining NaN rows with the mean of the columns
bottle_full = fill_na(bottle, 'mean')
bottle_full.show(2, truncate=False)

+------+------+------+-------+------+----------+-------+-----+-------+----+-------+------------------+------------------+-----+-------------------+------+-------+------+
|RecInd|T_prec|S_prec|R_Depth|R_TEMP|R_SALINITY|R_SIGMA|R_SVA|R_DYNHT|R_O2|R_O2Sat|R_SIO3            |R_PO4             |R_NO3|R_NO2              |R_CHLA|R_PHAEO|R_PRES|
+------+------+------+-------+------+----------+-------+-----+-------+----+-------+------------------+------------------+-----+-------------------+------+-------+------+
|3     |2     |3     |0.0    |19.23 |34.491    |24.57  |335.3|0.0    |5.46|103.9  |11.529812096979274|0.9506145966112527|1.3  |0.05539378408325813|0.64  |0.47   |0     |
|3     |2     |3     |10.0   |19.22 |34.492    |24.57  |335.3|0.03   |5.46|103.9  |11.529812096979274|0.9506145966112527|1.9  |0.05539378408325813|0.66  |0.38   |10    |
+------+------+------+-------+------+----------+-------+-----+-------+----+-------+------------------+------------------+-----+-------------------+---

In [21]:
# Double checking that all NaN values are filled
getNullCounts(bottle_full).show()

+------+------+------+-------+------+----------+-------+-----+-------+----+-------+------+-----+-----+-----+------+-------+------+
|RecInd|T_prec|S_prec|R_Depth|R_TEMP|R_SALINITY|R_SIGMA|R_SVA|R_DYNHT|R_O2|R_O2Sat|R_SIO3|R_PO4|R_NO3|R_NO2|R_CHLA|R_PHAEO|R_PRES|
+------+------+------+-------+------+----------+-------+-----+-------+----+-------+------+-----+-----+-----+------+-------+------+
|     0|     0|     0|      0|     0|         0|      0|    0|      0|   0|      0|     0|    0|    0|    0|     0|      0|     0|
+------+------+------+-------+------+----------+-------+-----+-------+----+-------+------+-----+-----+-----+------+-------+------+



#### Lasso

(incomplete)

spark lasso: https://spark.apache.org/docs/3.1.1/api/python/reference/api/pyspark.mllib.regression.LassoWithSGD.html<br>
Warning: **"Use [pyspark.ml.regression.LinearRegression](https://spark.apache.org/docs/3.1.1/api/python/reference/api/pyspark.ml.regression.LinearRegression.html) with elasticNetParam = 1.0. Note the default regParam is 0.01 for LassoWithSGD, but is 0.0 for LinearRegression."**

use LinearRegression: https://spark.apache.org/docs/3.1.1/api/python/reference/api/pyspark.ml.regression.LinearRegression.html

need to create "features" column first (did this in PA3, filled NaNs with mean before creating "features" column), but not sure if we need to scale it before feeding to lasso

In [22]:
# Rename the target chlorophyll column to 'label'
bottle_full = bottle_full.withColumnRenamed('R_CHLA', 'label')

In [23]:
# Creating 'features' column
from pyspark.ml.feature import VectorAssembler

attributes = bottle_full.columns
attributes.remove('label')
# bottle_full = VectorAssembler(outputCol="features_unscaled").setInputCols(attributes).transform(bottle_full)
bottle_full = VectorAssembler(outputCol="features").setInputCols(attributes).transform(bottle_full)
bottle_full.show(2, truncate=False)

[Stage 219:>                                                        (0 + 1) / 1]

+------+------+------+-------+------+----------+-------+-----+-------+----+-------+------------------+------------------+-----+-------------------+-----+-------+------+-----------------------------------------------------------------------------------------------------------------------------------+
|RecInd|T_prec|S_prec|R_Depth|R_TEMP|R_SALINITY|R_SIGMA|R_SVA|R_DYNHT|R_O2|R_O2Sat|R_SIO3            |R_PO4             |R_NO3|R_NO2              |label|R_PHAEO|R_PRES|features                                                                                                                           |
+------+------+------+-------+------+----------+-------+-----+-------+----+-------+------------------+------------------+-----+-------------------+-----+-------+------+-----------------------------------------------------------------------------------------------------------------------------------+
|3     |2     |3     |0.0    |19.23 |34.491    |24.57  |335.3|0.0    |5.46|103.9  |11.52981209697

                                                                                

(i'm completely unsure if we need to scale this)

In [27]:
from pyspark.ml.regression import LinearRegression

lr = LinearRegression(elasticNetParam = 1.0, regParam=0.01, solver="normal", maxIter=1000, standardization=True)
                     # featuresCol = features, labelCol="R_CHLA"
model = lr.fit(bottle_full)

                                                                                

In [25]:
print("when regParams=0.00, coeffs are", model.coefficients)
# DenseVector([-0.0029, 0.0189, 0.0311, -0.2704, -0.1755, 1.2258, 2.1681, 0.0271, -1.0676, 0.2813, 0.0188, 0.0153, 0.0011, 0.0111, -0.6229, 2.2373, 0.2685])

when regParams=0.00, coeffs are [-0.0028849503754668883,0.018886041823642955,0.0311333206218953,-0.2704313289772785,-0.17550356685046917,1.225787321677819,2.16812545405269,0.027052162668178018,-1.0676198887434958,0.2813083291742565,0.018804132634568323,0.01530331973991504,0.001051673679584693,0.01107009558676124,-0.6229432964290584,2.2373433348073,0.26853185179367056]


In [28]:
print("when regParams=0.01, coeffs are", model.coefficients)
# when regParams=0.01, coeffs are [0.0,0.0,0.0,0.0,-0.04478949599997118,0.8326928677863764,0.0,0.0,-0.8604647411995379,0.4312145597551872,0.0,0.014621547833334166,0.0,0.0,-0.5361579956416976,2.2177784829776312,0.0]

when regParams=0.01, coeffs are [0.0,0.0,0.0,0.0,-0.04478949599997118,0.8326928677863764,0.0,0.0,-0.8604647411995379,0.4312145597551872,0.0,0.014621547833334166,0.0,0.0,-0.5361579956416976,2.2177784829776312,0.0]


In [29]:
# need to figure out a non-numpy way of getting these 7 nonzero columns, but by hand:
usefulFeatures = ['R_TEMP','R_SALINITY','R_DYNHT','R_O2','R_SIO3','R_NO2','R_PHAEO']

print("When regParam=0.01, this lasso model indicates the most useful columns to predicting chlorophyll are:", usefulFeatures)

When regParam=0.01, this lasso model indicates the most useful columns to predicting chlorophyll are: ['R_TEMP', 'R_SALINITY', 'R_DYNHT', 'R_O2', 'R_SIO3', 'R_NO2', 'R_PHAEO']


Here are the 7 "most useful" features but i haven't pruned the dataset of the "useless" ones yet.

Please feel free to mess with the regParam value because I think that's what controls how many column values it'll say are useful or not.

#### Make sure column values are the right type

In [30]:
# Check the dtypes of each column
bottle_full.printSchema()

root
 |-- RecInd: integer (nullable = true)
 |-- T_prec: integer (nullable = true)
 |-- S_prec: integer (nullable = true)
 |-- R_Depth: double (nullable = true)
 |-- R_TEMP: double (nullable = true)
 |-- R_SALINITY: double (nullable = true)
 |-- R_SIGMA: double (nullable = true)
 |-- R_SVA: double (nullable = true)
 |-- R_DYNHT: double (nullable = true)
 |-- R_O2: double (nullable = true)
 |-- R_O2Sat: double (nullable = true)
 |-- R_SIO3: double (nullable = true)
 |-- R_PO4: double (nullable = true)
 |-- R_NO3: double (nullable = true)
 |-- R_NO2: double (nullable = true)
 |-- label: double (nullable = true)
 |-- R_PHAEO: double (nullable = true)
 |-- R_PRES: integer (nullable = true)
 |-- features: vector (nullable = true)



In [31]:
# Row count, and comparison to original row count
usableRows = bottle_full.count()
print(usableRows, "are usable, which is", maxRows-usableRows, "less than the original row count.")

225276 are usable, which is 639587 less than the original row count.


#### Summary Statistics and graphs

In [32]:
# See the detailed numerical description of the current dataframe
bottle_full.describe().show()



+-------+------------------+-------------------+-----------------+-----------------+------------------+------------------+------------------+-----------------+-------------------+------------------+-----------------+------------------+------------------+------------------+-------------------+-------------------+-------------------+-----------------+
|summary|            RecInd|             T_prec|           S_prec|          R_Depth|            R_TEMP|        R_SALINITY|           R_SIGMA|            R_SVA|            R_DYNHT|              R_O2|          R_O2Sat|            R_SIO3|             R_PO4|             R_NO3|              R_NO2|              label|            R_PHAEO|           R_PRES|
+-------+------------------+-------------------+-----------------+-----------------+------------------+------------------+------------------+-----------------+-------------------+------------------+-----------------+------------------+------------------+------------------+-------------------+---

                                                                                

## Questions

#### Which factors are chlorophyll levels most dependent on?
#### Can we use these factors as features in a model that accurately predict chlorophyll levels in a marine ecosystem?
#### Can these few, simple, measurable factors we found indicate whether an ocean ecosystem is healthy and sustainable based on the chlorophyll abundance measured?
#### Can we predict chlorophyll abundance without having to measure it?

## Feature Selection and Target Cleaning

#### Distribution of data for cholorophyll levels, summary stats/graphs

#### Define cholorophyll category levels: low, medium, high?

#### Map cholorophyll levels to defined categories and output graphs/summary stats

#### Add target column with category label to dataframe

#### PCA columns that are duplicates

#### Heat Map to see any less obvious columns that are highly correlated

#### If necessary, lasso to find the most important 10 features (columns)

#### Select features to be used for training

## Models

#### Scale Data

#### Split data into 80/10/10 train, test, validation set

### Logistic Regresssion

#### Cross Validation

#### Check to see if data can be used to predict the lables

#### Train and Test Accuracy Scores

### AdaBoost Decision Tree

#### Can we increase accuracy from previous steps? Can we figure out which features matter the most?

#### Train and Test Accuracy Scores

#### Kernel SVM

## Data Visualization

#### SVM Decision Boundary Graph

#### AdaBoost Decision Tree Graph
https://www.ashishmenkudale.com/spark-tree-plotting/

# Stop spark session

In [33]:
spark.stop()