# Killer shrimp challenge
The goal of the challenge is to predict the presence of _Dikerogammarus Villosus_ and its spread in the Baltic Sea. D. Villosus is called "Killer Shrimp" and it's an invasive species.

### Dataset
The dataset contains the following variables:
- Presence = 0 or 1
- Salinity_today = water salinity at surface (0-2 meters, mean value over winter months, in parts per 1000)
- Temperature_today = water temperature at surface (mean value over winter months, in C)
- Substrate = substrate type (1 = sand, 0 = no sand)
- Depth = Depth of ocean
- Exposure = Wave exposure index at surface

### Output
The output dataset must contain pointid and predicted presence.

## Approach
Since this is a classification problem, I intend to use it as a learning tool to try to implement a Support Vector Machine model. This is something I read about but I never had the time to dive in it. It should be fun.

In [41]:
import numpy as np
import pandas as pd
import seaborn as sns
import matplotlib.pyplot as plt
from sklearn import svm

### Data exploration and visualization

In [94]:
train_data_raw = pd.read_csv('../Datasets/killer-shrimp-invasion/train.csv')
test_data_raw = pd.read_csv('../Datasets/killer-shrimp-invasion/test.csv')

The dataset contains some NA. Where are the NA?

In [110]:
train_data_raw[train_data_raw.isna().any(axis=1)]

Unnamed: 0,pointid,Salinity_today,Temperature_today,Substrate,Depth,Exposure,Presence
21,2703707,,,1.0,-1.659212,5374.0000,0
123,1406673,,,0.0,-8.157890,5057.2600,0
179,342018,,,1.0,-8.849793,4107.5410,0
192,91518,,,1.0,,12528.6045,0
361,1729550,,,1.0,-3.481059,652.6757,0
...,...,...,...,...,...,...,...
2625807,2693786,,,1.0,-1.900000,3600.0000,0
2625834,345209,,,1.0,-25.670000,8440.2340,0
2625941,908607,,,1.0,-11.050000,9651.9790,0
2625972,1507699,,,1.0,-13.273676,7509.1094,0


There are more than 50'000 rows with missing data! Seemingly, mostly in Salinity and Temperature, but also in Depth and Exposure. When data is missing in salinity, it is also missing in temperature:

In [96]:
train_data_raw[train_data_raw['Temperature_today'].isna()].shape == train_data_raw[train_data_raw['Salinity_today'].isna()].shape

True

There are also many NAs in the test dataset (5622 rows). Here too it seems to be mainly in salinity, tempearture and a few in depth and exposure.

In [97]:
test_data_raw[test_data_raw.isna().any(axis=1)]

Unnamed: 0,pointid,Salinity_today,Temperature_today,Substrate,Depth,Exposure
30,2688649,,,1.0,-5.891769,1976.18040
65,868085,,,1.0,-3.890000,995.55756
72,2697497,,,1.0,,6249.00000
75,1174557,,,1.0,-7.780000,14857.21700
117,2731018,,,1.0,-6.271195,2202.01590
...,...,...,...,...,...,...
291454,916104,,,1.0,-6.409005,5613.11100
291680,1057900,,,1.0,-30.150000,8575.10700
291729,1623400,,,1.0,,2816.18850
291760,2486640,,,0.0,-3.950000,80636.98000


Masking the NAs seems the better option. Then, it would be wise to first normalize the values and then mask. So let us split the dataset into features and labels:

Separate features and labels:

In [98]:
features = ['Salinity_today', 'Temperature_today', 'Substrate', 'Depth', 'Exposure']
train_features = train_data_raw[features]
test_features = test_data_raw[features]
train_labels = train_data_raw['Presence']

My current mind for NAs is to encode whether a values is NA or not, then mask all NAs to a value. So let's start this: 

In [103]:
train_features['Temperature_NA'] = np.where(train_features['Temperature_today'].isna(), 1, 0)
train_features['Salinity_NA'] = np.where(train_features['Salinity_today'].isna(), 1, 0)
train_features['Exposure_NA'] = np.where(train_features['Exposure'].isna(), 1, 0)
train_features['Depth_NA'] = np.where(train_features['Depth'].isna(), 1, 0)

Let's check whether this worked...

In [104]:
train_features[train_features.isna().any(axis=1)]

Unnamed: 0,Salinity_today,Temperature_today,Substrate,Depth,Exposure,Temperature_NA,Salinity_NA,Exposure_NA,Depth_NA
21,,,1.0,-1.659212,5374.0000,1,1,0,0
123,,,0.0,-8.157890,5057.2600,1,1,0,0
179,,,1.0,-8.849793,4107.5410,1,1,0,0
192,,,1.0,,12528.6045,1,1,0,1
361,,,1.0,-3.481059,652.6757,1,1,0,0
...,...,...,...,...,...,...,...,...,...
2625807,,,1.0,-1.900000,3600.0000,1,1,0,0
2625834,,,1.0,-25.670000,8440.2340,1,1,0,0
2625941,,,1.0,-11.050000,9651.9790,1,1,0,0
2625972,,,1.0,-13.273676,7509.1094,1,1,0,0


Now let's mask the NAs. I reckon that the best guess should be something along the lines of a Z-normalization and then set Salinity_today and Temperature_today to 0. Thus, the NAs are set equal to the average value for these columns. Let's ise .fillna() for this:

In [117]:
train_features = train_features.fillna(train_features.mean())

From a first visualization of data, it seems that there are some values with more entries than others. I will try to implement the support vector machine blindly, as a start.

In [47]:
#model = svm.SVC()
#model.fit(train_features, train_labels)

And then we produce the prediction:

In [48]:
#predictions = model.predict(test_features.dropna())

The problem with this is that I had to drop NaNs in order to do it. Submission has to be full. Another solution will be needed...