## Real Micro Crystals -  Data Engineering & Exploration

Michael Janus, June 2018

Use the functions on a real (small) data set.

For explanation and how to usage functions, see the notebook **imgutils_test_and_explain.ipynb**

## 1. Import the used modules, including the one with test functions:

In [None]:
import warnings
warnings.filterwarnings("ignore", category=DeprecationWarning)

import matplotlib.pyplot as plt
import pandas as pd

import imgutils
import imgutils_test as tst

In [None]:
# Re-run this cell if you altered imgutils or imgutils_test
import importlib
importlib.reload(imgutils)
importlib.reload(tst)

## 1. Get image files

In [None]:
df_imgfiles = imgutils.scanimgdir('../data/Crystals_Apr_12/Tileset7', '.tif')
print(df_imgfiles)

## 2. Get Image Slice Statistics
This set contains 6 images. Let's slice those up in 4 by 4; this will give total of 6 x 4 x 4 = 96 slices.
And also apply the statistics on each slice.

In [None]:
statfuncs = [imgutils.img_min, imgutils.img_max, imgutils.img_range, imgutils.img_mean, imgutils.img_std, imgutils.img_median]
df = imgutils.slicestats(list(df_imgfiles['filename']), 4, 4, statfuncs)
print("records: ", df.shape[0])
df.head()

**Normalize** the statistics using 'standarization'

In [None]:
stat_names = imgutils.stat_names(statfuncs)
print(stat_names)

In [None]:
df.isnull().values.any()

In [None]:
imgutils.normalize(df, stat_names)
df.head()

In [None]:
stat_normnames = imgutils.normalized_names(stat_names)
print(stat_normnames)

## 3. Check some combinations for patterns
(using the seaborn pairplot)

In [None]:
import seaborn as sb

In [None]:
%matplotlib inline
sb.pairplot(df, vars=stat_normnames)
plt.show()

## 4. Inspect interactively
Let's inspect some combinations that have 'signs of clustering' in the interactive graph

In [None]:
%matplotlib notebook

In [None]:
imgutils.plotwithimg(df, '|img_mean|', '|img_range|', imgutils.highlightimgslice)

Looks likt the sort-of cluster in lower right are points without a crystal

In [None]:
imgutils.plotwithimg(df, '|img_mean|', '|img_median|', imgutils.highlightimgslice)

The separation is not representative, the group at top-left contains both with and without micro crystals

In [None]:
imgutils.plotwithimg(df, '|img_range|', '|img_std|', imgutils.highlightimgslice)

This looks better, bottom left are empty regions, top-left have crystals. 

## 5. Heatmaps

Let's do an attempt to create a score for a heatmap. Looks like |img_std| is most infromative

In [None]:
imgname = df_imgfiles.iloc[3]['filename']
print(imgname)

In [None]:
imgs, heats = imgutils.getimgslices_fromdf(df, imgname, '|img_std|')

In [None]:
%matplotlib inline

In [None]:
imgutils.showheatmap(imgs, heats)
print(heats)

Yes, looks great!. Let's check for some other images as well

In [None]:
imgname = df_imgfiles.iloc[0]['filename']
imgs, heats = imgutils.getimgslices_fromdf(df, imgname, '|img_std|')
imgutils.showheatmap(imgs, heats, opacity=0.7)
print(heats)

In [None]:
imgname = df_imgfiles.iloc[1]['filename']
imgs, heats = imgutils.getimgslices_fromdf(df, imgname, '|img_std|')
imgutils.showheatmap(imgs, heats, opacity=0.7)
print(heats)

In [None]:
imgname = df_imgfiles.iloc[2]['filename']
imgs, heats = imgutils.getimgslices_fromdf(df, imgname, '|img_std|')
imgutils.showheatmap(imgs, heats, opacity=0.7)
print(heats)

In [None]:
imgname = df_imgfiles.iloc[4]['filename']
imgs, heats = imgutils.getimgslices_fromdf(df, imgname, '|img_std|')
imgutils.showheatmap(imgs, heats, opacity=0.7)
print(heats)

In [None]:
imgname = df_imgfiles.iloc[5]['filename']
imgs, heats = imgutils.getimgslices_fromdf(df, imgname, '|img_std|')
imgutils.showheatmap(imgs, heats, opacity=0.7)
print(heats)

## 6. Conclusions & Remarks
- The visualization and heatmap concept looks nice. 
- Did not use real clustering, but from data exploration just used normalized standard deviation as indicator
- For larger or different sets (with outliers), I guess a combination of statistics is needed (which was the idea in the first place and let ML figure out what)


## 7. Next steps
- Export this data set and label it based on std-dev (e.g. 3 cats: none, some, full) 
- Export this data set for unsupervised learning
- Repeat on bigger and more versatile set



Michael Janus, 15 June 2018

<hr>


# Update 5 July 2018
## 8. Assign labels
inspecting the heats, define 3 cats: 
* |img_std|<0 = A (no particle);
* 0<|img_std|<1 = B (partly)
* |img_std|>1 = C (fully)


In [None]:
def assign_label(score):
    if score<0: return 'A'
    if score>=1: return 'C'
    return 'B'

df['class'] = df.apply(lambda r: assign_label(r['|img_std|']), axis=1)

In [None]:
df.head()

In [None]:
df2 = df[df['class']=='C']

In [None]:
print(len(df2))

In [None]:
%matplotlib inline
# check class C images
for i in range(0,len(df2)):
    img = imgutils.getimgslice(df2, i)
    imgutils.showimg(img)

In [None]:
# also plot them the img_std vs img_range with the labels
labels = df['class'].tolist()
colors = [(0 if (l=='A') else (1 if (l=='B') else 2))  for l in labels]
plt.scatter(df['|img_range|'], df['|img_std|'], c=colors)
plt.show()

ideally we should have this interactive with the images, so extend the infrastructure
(done, had to change to interactive scatter plot instead of line plot)

In [None]:
%matplotlib notebook
imgutils.plotwithimg(df, '|img_range|', '|img_std|', imgutils.highlightimgslice, 'class')


## 9. Export as csv


In [None]:
df.to_csv('../data/Crystals_Apr_12/Tileset7.csv', sep=';')

## 10. Also other stats

In [None]:
statfuncs = [imgutils.img_mean, imgutils.img_std, imgutils.img_kurtosis, imgutils.img_skewness, imgutils.img_mode]
df2 = imgutils.slicestats(list(df_imgfiles['filename']), 4, 4, statfuncs)
print("records: ", df2.shape[0])


In [None]:
df2['img_std2']=df2['img_std']/df['img_mean']

In [None]:
stat_names = imgutils.stat_names(statfuncs) + ['img_std2']
imgutils.normalize(df2, stat_names)

In [None]:
stat_normnames = imgutils.normalized_names(stat_names)

%matplotlib inline
sb.pairplot(df2, vars=stat_normnames)
plt.show()

In [None]:
#label them based on std (first experiment)
df2['class'] = df.apply(lambda r: assign_label(r['|img_std|']), axis=1)

In [None]:
df2.to_csv('../data/Crystals_Apr_12/Tileset7-2.csv', sep=';')