Statoil Iceberg Classification challenge

My solution

It should be noted that there is a data leak in inc_angle feature

So inc_angle's with identical values (up to 4 decimal digits) had the same label most of the time

Also we should avoid extreme values as log_loss penalty for these values is high

The solution to that is to clip values from 0.001 to 0.01 and from 0.999 to 0.99

Leak usage is implemented in leak.py

As for the model, I used best CNN that I tested, which I discovered using ice.py file

My model:

    model=Sequential()
    
    # Using 3xConv in a row to reduce load 3x(3,3) = 1x(7,7)

    # Conv block 1
    model.add(Conv2D(64, kernel_size=(3, 3),activation='relu', input_shape=(75, 75, 3)))
    model.add(Conv2D(64, kernel_size=(3, 3), activation='relu' ))
    model.add(Conv2D(64, kernel_size=(3, 3), activation='relu' ))
    model.add(MaxPooling2D(pool_size=(3, 3), strides=(2, 2)))
   
    # Conv block 2
    model.add(Conv2D(128, kernel_size=(3, 3), activation='relu' ))
    model.add(Conv2D(128, kernel_size=(3, 3), activation='relu' ))
    model.add(Conv2D(128, kernel_size=(3, 3), activation='relu' ))
    model.add(MaxPooling2D(pool_size=(2, 2), strides=(2, 2)))
   
    # Conv block 3
    model.add(Conv2D(128, kernel_size=(3, 3), activation='relu'))
    model.add(MaxPooling2D(pool_size=(2, 2), strides=(2, 2)))
   
    #Conv block 4
    model.add(Conv2D(256, kernel_size=(3, 3), activation='relu'))
    model.add(MaxPooling2D(pool_size=(2, 2), strides=(2, 2)))
   
    # Flatten before dense
    model.add(Flatten())

    #Dense 1
    model.add(Dense(1024, activation='relu'))
    model.add(Dropout(0.4))

    #Dense 2
    model.add(Dense(512, activation='relu'))
    model.add(Dropout(0.2))

    # Output 
    model.add(Dense(1, activation="sigmoid"))

    optimizer = Adam(lr=0.0001, decay=0.0)
    model.compile(loss='binary_crossentropy', optimizer=optimizer, metrics=['accuracy'])
    
    return model

It overfits a bit, but decent loss may be achived.

Data is normalized and augmented(flip,flop)

Ensembling using K-fold is a good idea, so in final submission 10-fold is used

After getting submission file, it is processed using leak.py script, which is mentioned earlier

My stats:

Complete log of CNN training available in ice.ipynb

Possible improvements

Use different augmentations(shift, rotations)
XGBoost or LightGBM adding additional metadata
Pseudolabeling
KNN/SVM after CNN
Use diffrerent models for icebergs in group1 and group2 (Plot 1)
Train many weak CNNs and ensemble, i.e. top100 of them
Use pretrained VGG16, ResNet, etc.
Use diffrerent data preparation, i.e. FFT or sqrmean

Name		Name	Last commit message	Last commit date
Latest commit History 12 Commits
logs		logs
pics		pics
result		result
.gitignore		.gitignore
README.md		README.md
angle_plot.jpg		angle_plot.jpg
ice.ipynb		ice.ipynb
ice.py		ice.py
ice_v2.py		ice_v2.py
leak.py		leak.py
model.py		model.py
stats.png		stats.png

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Statoil Iceberg Classification challenge

My solution

Possible improvements

Paprers and links used

About

Releases

Packages

Languages

kabachook/iceberg_ml

Folders and files

Latest commit

History

Repository files navigation

Statoil Iceberg Classification challenge

My solution

Possible improvements

Paprers and links used

About

Topics

Resources

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages