# TGS - Salt identification challenge
First exploratory data analysis.  
2018.07.20  

This kernel proposal is to carry out the initial analysis of the images made available by TGS.  

## Summary:
* [A. Initial statements](#secA)  

* [B. Understanding the data](#secB)
  * [B.1. Background](#secB.1)
  * [B.2. Available data](#secB.2)  
  * [B.3. Loading the data](#secB.3)
  * [B.4. Choosing a random sample](#secB.4)
  
* [C. Feature extraction](#secC)
  * [C.1 Creating a feature dataset](#secC.1)
  * [C.2 Taking a look at depth](#secC.2)
  * [C.3 Target defining](#secC.3)

<a name='secA'></a>
## A. Initial statements

In [None]:
## This kernel must be run on Python=3.6
import numpy as np 
import pandas as pd  #Python Data Analysis Library
import random

import scipy.ndimage as scipyImg
import scipy.misc as misc

import matplotlib.pyplot as plt 
import seaborn as sns

import os

%matplotlib inline

In [None]:
## Disabling filter warnings
import warnings
warnings.filterwarnings("ignore")

In [None]:
## Defining basic functions
def basic_readImg(directory, filename):
    '''Reading an RGB image through the scipy library. Provides an array.
    Sintaxe: basic_readImg(directory, filename).'''
    sample = scipyImg.imread(directory + filename, mode='RGB')
    if sample.shape[2] != 3:
        return 'The input must be an RGB image.'
    return sample

def basic_showImg(img, size=4):
    ''' Displays the image at the chosen size. The image (img) should be read through basic_readImg().
    Sintaxe: basic_showImg(img, size=4).'''
    plt.figure(figsize=(size,size))
    plt.imshow(img)
    plt.show()
    
def basic_writeImg(directory, filename, img):
    misc.imsave(directory+filename, img)

<a name='secB'></a>
## B. Understanding the data
<a name='secB.1'></a>
### B.1. Background  
*The information in this section is from [TGS Salt Challenge page](https://www.kaggle.com/c/tgs-salt-identification-challenge/data)*.  

Seismic data is collected using reflection seismology, or seismic reflection. The method requires a controlled seismic source of energy, such as compressed air or a seismic vibrator, and sensors record the reflection from rock interfaces within the subsurface. The recorded data is then processed to create a 3D view of earth’s interior. Reflection seismology is similar to X-ray, sonar and echolocation.

A seismic image is produced from imaging the reflection coming from rock boundaries. The seismic image shows the boundaries between different rock types. In theory, the strength of reflection is directly proportional to the difference in the physical properties on either sides of the interface. While seismic images show rock boundaries, they don't say much about the rock themselves; some rocks are easy to identify while some are difficult.

There are several areas of the world where there are vast quantities of salt in the subsurface. One of the challenges of seismic imaging is to identify the part of subsurface which is salt. Salt has characteristics that makes it both simple and hard to identify. Salt density is usually 2.14 g/cc which is lower than most surrounding rocks. The seismic velocity of salt is 4.5 km/sec, which is usually faster than its surrounding rocks. This difference creates a sharp reflection at the salt-sediment interface. Usually salt is an amorphous rock without much internal structure. This means that there is typically not much reflectivity inside the salt, unless there are sediments trapped inside it. The unusually high seismic velocity of salt can create problems with seismic imaging.


<a name='secB.2'></a>
### B.2. Available data
The data is a set of images chosen at various locations chosen at random in the subsurface. The images are 101 x 101 pixels and each pixel is classified as either salt or sediment. In addition to the seismic images, the depth of the imaged location is provided for each image. The goal of the competition is to segment regions that contain salt.

<a name='secB.3'></a>
### B.3. Loading the data
#### Depths database reading  
The depth underground (in feet) of each image is available on the *depths.csv* file, where the attribute *id* is the unique image identifier and *z* is its depth in feet (1 feet equals 0.3048 meters).

In [None]:
## Loading the dataset and showing the first rows:
depths = pd.read_csv('../input/depths.csv')
depths.head(2)

#### Training set masks in RLE
TGS has also made available some [Run-length encoding (RLE)](https://en.wikipedia.org/wiki/Run-length_encoding) masks with the salt portion of the images already identified.

In [None]:
train_masks = pd.read_csv('../input/train.csv')
train_masks.head(2)

In order to read such encoded masks, I will make use of [the function created by Robert](https://www.kaggle.com/robertkag/rle-to-mask-converter) at Kaggle, which reads the RLE string and generates an image array corresponding to the mask:  

In [None]:
def rleToMask(rleString,height,width):
    rows,cols = height,width
    try:
        rleNumbers = [int(numstring) for numstring in rleString.split(' ')]
        rlePairs = np.array(rleNumbers).reshape(-1,2)
        img = np.zeros(rows*cols,dtype=np.uint8)
        for index,length in rlePairs:
            index -= 1
            img[index:index+length] = 255
        img = img.reshape(cols,rows)
        img = img.T
    except:
        img = np.zeros((cols,rows))
    return img

Regarding the training data, for each subsurface image there is its corresponding mask, listed below:

In [None]:
file_imgs = os.listdir(path='../input/train/images/')
file_masks = os.listdir(path='../input/train/masks/')
print('Images found: {0}\nCorresponding masks: {1}'.format(len(file_imgs), len(file_masks)))

<a name='secB.4'></a>
### B.4. Choosing a random sample

In [None]:
## Defining a function since there's sample without valid RLE.
def choose_sample(data=train_masks):
    ## Choosing a random image from train dataset:
    sample = random.choice(range(len(data)))

    ## Parsing the sample information:
    sample_id = data['id'][sample]
    sample_depth = depths[depths['id'] == sample_id]['z'].values[0]
    sample_RLEstring = data['rle_mask'][sample]
    try: 
        sample_RLE = rleToMask(sample_RLEstring, 101,101)
    except: 
        sample_RLE = np.zeros((101,101))
    file_name = sample_id + '.png'
    sample_img = basic_readImg('../input/train/images/',file_name)
    sample_mask = basic_readImg('../input/train/masks/',file_name)
    
    fig1, axes = plt.subplots(1,3, figsize=(10,4))
    axes[0].imshow(sample_img)
    axes[0].set_xlabel('Subsurface image')
    axes[1].imshow(sample_mask)
    axes[1].set_xlabel('Provided mask')
    axes[2].imshow(sample_RLE)
    axes[2].set_xlabel('Decoded RLE mask')
    fig1.suptitle('Image ID = {0}\nDepth = {1} ft.'.format(sample_id, sample_depth));
    return

In [None]:
choose_sample()

<a name='secC'></a>
## C. Exploratory Data Analysis
<a name='secC.1'></a>
### C.1. Creating a feature dataset
Putting together all the training information we have.


In [None]:
df1 = depths.set_index('id')
df2= train_masks.set_index('id')
dataset = pd.concat([df1, df2], axis=1, join='inner')
dataset = dataset.reset_index()

In [None]:
dataset['mask'] = dataset['rle_mask'].apply(lambda x: rleToMask(x, 101,101))

In [None]:
def salt_proportion(imgArray):
    try: 
        unique, counts = np.unique(imgArray, return_counts=True)
        ## The total number of pixels is 101*101 = 10,201
        return counts[1]/10201.
    except: 
        return 0.0

In [None]:
dataset['salt_proportion'] = dataset['mask'].apply(lambda x: salt_proportion(x))

In [None]:
dataset.head()

<a name='secC.2'></a>
### C.2. Taking a look at depth
How is depth distributed along the dataset?


In [None]:
sns.set();
sns.distplot(dataset['z'], bins=20)

In [None]:
sns.pairplot(dataset, vars=['z', 'salt_proportion'])

The pairplot above results are expected, since the salt proportion in each image is related to the region from which the data were colected. Next steps now is to extract some features from the *salty* regions in order to correlate it with depth as well as other attributes.

<a name='secC.3'></a>
### C.3. Target defining
In this first approach I will consider the salt proportion as a target feature. In this way, I'll divide the dataset into four salty categories:  
i. No salt [0%]  
ii. Very low [0.01% - 10%]  
iii. Low [10.01% - 40%]  
iv. Medium [40.01% - 60%]  
v. High [60.01% - 90%]  
vi. Very high [> 90%]

In [None]:
dataset['target'] = pd.cut(dataset['salt_proportion'], bins=[0, 0.001, 0.1, 0.4, 0.6, 0.9, 1.0], 
       include_lowest=True, labels=['No salt', 'Very low', 'Low', 'Medium', 'High', 'Very high'])

In [None]:
dataset.tail()