# Soy blossom period prediction model

## Setup

### Configure environment

In [1]:
!apt-get install -y -qq software-properties-common python-software-properties module-init-tools
!add-apt-repository -y ppa:alessandro-strada/ppa 2>&1 > /dev/null
!apt-get update -qq 2>&1 > /dev/null
!apt-get -y install -qq google-drive-ocamlfuse fuse
from google.colab import auth
auth.authenticate_user()
from oauth2client.client import GoogleCredentials
creds = GoogleCredentials.get_application_default()
import getpass
!google-drive-ocamlfuse -headless -id={creds.client_id} -secret={creds.client_secret} < /dev/null 2>&1 | grep URL
vcode = getpass.getpass()
!echo {vcode} | google-drive-ocamlfuse -headless -id={creds.client_id} -secret={creds.client_secret}

Preconfiguring packages ...
Selecting previously unselected package cron.
(Reading database ... 18408 files and directories currently installed.)
Preparing to unpack .../00-cron_3.0pl1-128ubuntu5_amd64.deb ...
Unpacking cron (3.0pl1-128ubuntu5) ...
Selecting previously unselected package libapparmor1:amd64.
Preparing to unpack .../01-libapparmor1_2.11.0-2ubuntu17.1_amd64.deb ...
Unpacking libapparmor1:amd64 (2.11.0-2ubuntu17.1) ...
Selecting previously unselected package libdbus-1-3:amd64.
Preparing to unpack .../02-libdbus-1-3_1.10.22-1ubuntu1_amd64.deb ...
Unpacking libdbus-1-3:amd64 (1.10.22-1ubuntu1) ...
Selecting previously unselected package dbus.
Preparing to unpack .../03-dbus_1.10.22-1ubuntu1_amd64.deb ...
Unpacking dbus (1.10.22-1ubuntu1) ...
Selecting previously unselected package dirmngr.
Preparing to unpack .../04-dirmngr_2.1.15-1ubuntu8.1_amd64.deb ...
Unpacking dirmngr (2.1.15-1ubuntu8.1) ...
Selecting previously unselected package distro-info-data.
Preparing to unpack .

In [0]:
!cd sample_data && rm * && cd .. && rmdir sample_data

In [0]:
!mkdir -p drive
!google-drive-ocamlfuse drive

### Show the machine details the environment is running on

Show CPU info

In [4]:
!cat /proc/cpuinfo

processor	: 0
vendor_id	: GenuineIntel
cpu family	: 6
model		: 63
model name	: Intel(R) Xeon(R) CPU @ 2.30GHz
stepping	: 0
microcode	: 0x1
cpu MHz		: 2300.000
cache size	: 46080 KB
physical id	: 0
siblings	: 2
core id		: 0
cpu cores	: 1
apicid		: 0
initial apicid	: 0
fpu		: yes
fpu_exception	: yes
cpuid level	: 13
wp		: yes
flags		: fpu vme de pse tsc msr pae mce cx8 apic sep mtrr pge mca cmov pat pse36 clflush mmx fxsr sse sse2 ss ht syscall nx pdpe1gb rdtscp lm constant_tsc rep_good nopl xtopology nonstop_tsc cpuid pni pclmulqdq ssse3 fma cx16 sse4_1 sse4_2 x2apic movbe popcnt aes xsave avx f16c rdrand hypervisor lahf_lm abm pti fsgsbase tsc_adjust bmi1 avx2 smep bmi2 erms xsaveopt
bugs		: cpu_meltdown spectre_v1 spectre_v2 l1tf
bogomips	: 4600.00
clflush size	: 64
cache_alignment	: 64
address sizes	: 46 bits physical, 48 bits virtual
power management:

processor	: 1
vendor_id	: GenuineIntel
cpu family	: 6
model		: 63
model name	: Intel(R) Xeon(R) CPU @

Show RAM info

In [5]:
!cat /proc/meminfo

MemTotal:       13335196 kB
MemFree:         2355964 kB
MemAvailable:   12373028 kB
Buffers:          170988 kB
Cached:          9767516 kB
SwapCached:            0 kB
Active:           699756 kB
Inactive:        9522276 kB
Active(anon):     325968 kB
Inactive(anon):   186080 kB
Active(file):     373788 kB
Inactive(file):  9336196 kB
Unevictable:           0 kB
Mlocked:               0 kB
SwapTotal:             0 kB
SwapFree:              0 kB
Dirty:               516 kB
Writeback:             0 kB
AnonPages:        283620 kB
Mapped:           166064 kB
Shmem:            254972 kB
Slab:             671700 kB
SReclaimable:     633220 kB
SUnreclaim:        38480 kB
KernelStack:        3680 kB
PageTables:         5092 kB
NFS_Unstable:          0 kB
Bounce:                0 kB
WritebackTmp:          0 kB
CommitLimit:     6667596 kB
Committed_AS:    2049796 kB
VmallocTotal:   34359738367 kB
VmallocUsed:           0 kB
VmallocChunk:          0 kB
AnonHugePag

Show GPU info

In [6]:
from tensorflow.python.client import device_lib
device_lib.list_local_devices()

[name: "/device:CPU:0"
 device_type: "CPU"
 memory_limit: 268435456
 locality {
 }
 incarnation: 257620791599394521, name: "/device:GPU:0"
 device_type: "GPU"
 memory_limit: 11281989632
 locality {
   bus_id: 1
   links {
   }
 }
 incarnation: 5242550902197982300
 physical_device_desc: "device: 0, name: Tesla K80, pci bus id: 0000:00:04.0, compute capability: 3.7"]

### Import modules

In [0]:
from matplotlib import cm
from matplotlib import gridspec
from matplotlib import pyplot as plt
import numpy as np
import pandas as pd
from sklearn import metrics

import tensorflow as tf
from tensorflow import keras
from tensorflow.keras import backend as K
from sklearn.metrics import r2_score as r2_metric

### Set up the dataset

Download a data set using PyDrive wrapper to Drive API. At first we need to authenticate to the google Drive and then download the file, the file is then storred on the machine the notebook is running on

In [0]:
!pip install -U -q PyDrive
from pydrive.auth import GoogleAuth
from pydrive.drive import GoogleDrive
from google.colab import auth
from oauth2client.client import GoogleCredentials

# 1. Authenticate and create the PyDrive client.
auth.authenticate_user()
gauth = GoogleAuth()
gauth.credentials = GoogleCredentials.get_application_default()
drive = GoogleDrive(gauth)

#2. Get the file
downloaded = drive.CreateFile({'id':'15x-A5dxw7VXflscEoz4YiJaHgXWFUd-8'})
downloaded.GetContentFile('soydata.csv')

downloaded = drive.CreateFile({'id':'1HxgKwE-cgQJu0jUdfo67OfbnWOy-XDtO'})
downloaded.GetContentFile('translator_module.py')

Import data set into pandas dataframe. The dataset contains a missing value on one of the key features the model will be training with, so we need to remove that explicitly

In [0]:
df = pd.read_csv('soydata.csv', header=0, index_col=0)
df.dropna(inplace=True)

The data is now loaded, but we will tweak header so that feature names are more readable. File "translator.py" has a dictionary with both short and long names for each label in initial dataset, we will use them to attach to columns so that we can use shortnames for usage in code as well as longnames for interpreting the column name in plain english when printing

In [0]:
from collections import defaultdict
from translator_module import ConstantsTranslator as translator
from translator_module import labels_translator as get_labels

verbose_names = translator.explanation_dict
verbose_print = lambda frame: print(frame.rename(columns=verbose_names))
df.columns = get_labels(df.columns.values.tolist())

Now the dataframe looks like this

In [0]:
verbose_print(df)

### Feature extraction

Shuffle the data

In [0]:
df = df.sample(frac=1).reset_index(drop=True)

Select features

In [0]:
features = [translator.AVG_DLEN_SPR20,
            translator.AVG_TEMP_SOWSPR,
            translator.TIME_SOWSPR,
            translator.AVG_RAIN_SOW10]
data = df[features].values
target = df[translator.TIME_SOWBLOS].values

data is numpy matrix with only the needed features for the model and target is numpy vector with output data

In [42]:
#@title Select number of rows to display { run: "auto", form-width: "30%" }
num_entries = 3 #@param {type:"slider", min:0, max:5, step:1}
print("Data:  ", data.shape, "Showing only", num_entries, "\n", data[0:num_entries], "\n")
print("Target:", target.shape, "Showing only", num_entries, "\n", target[0:num_entries], "\n")

Data:   (394, 4) Showing only 3 
 [[ 0.718      21.98333333  8.         15.1       ]
 [ 0.76527083 14.5969697  11.         18.2       ]
 [ 0.76775    11.46363636 11.         34.1       ]] 

Target: (394,) Showing only 3 
 [58. 38. 42.] 



Split the data for training and testing

In [68]:
#@title Set the desired partition of test and train samples { run: "auto", form-width: "30%", display-mode: "both" }
border = 157 #@param {type:"slider", min:2, max:393, step:1}
train_input, train_output = data[:border], target[:border]
test_input, test_output = data[border:], target[border:]

print("Train set contains", len(train_input))
print("Test set contains", len(test_input))

Train set contains 157
Test set contains 237


##  Model

We will use a simple model with input layer of 4 neurons, a single hidden layer with 20 neurons and 1 output neuron

In [0]:
tf.enable_eager_execution()

### Keras model definition

In [0]:
model = keras.Sequential()
model.add(keras.layers.Dense(20, input_dim=4, activation=tf.nn.sigmoid))
model.add(keras.layers.Dense(1, activation=tf.nn.sigmoid))

### Keras model compile

In [0]:
model.compile(loss='mse',
              optimizer=tf.train.AdamOptimizer(learning_rate=0.6),
              metrics=['mse', 'mae'])

Model summary shows the architecture of the keras model

In [52]:
print(model.summary())

_________________________________________________________________
Layer (type)                 Output Shape              Param #   
dense (Dense)                (None, 20)                100       
_________________________________________________________________
dense_1 (Dense)              (None, 1)                 21        
Total params: 121
Trainable params: 121
Non-trainable params: 0
_________________________________________________________________
None


## Results

In [69]:
#@title Hyperparameters { run: "auto", form-width: "30%" }
epochs = 5 #@param {type:"integer"}
results = model.fit(
    train_input, train_output,
    epochs = epochs)

Epoch 1/5
Epoch 2/5
Epoch 3/5
Epoch 4/5
Epoch 5/5


In [70]:
loss, mse, mae = model.evaluate(test_input, test_output)

print("MSE - ", mse)
print("MAE - ", mae)

MSE -  1534.5295379252373
MAE -  38.59493677324384


## Conclusion