# Real Micro Crystals -  Data Engineering & Exploration 2
_playing with different statistics_

Michael Janus, June 2018

Use more functions on a real (small) data set.

For explanation and how to usage functions, see the notebook **imgutils_test_and_explain.ipynb**

## 1. Import the used modules, including the one with test functions:

In [None]:
import warnings
warnings.filterwarnings("ignore", category=DeprecationWarning)
warnings.filterwarnings('ignore')

import matplotlib.pyplot as plt

import imgutils
import imgutils_test as tst

In [None]:
# Re-run this cell if you altered imgutils or imgutils_test
import importlib
importlib.reload(imgutils)
importlib.reload(tst)

## 1. Get image files

In [None]:
df_imgfiles = imgutils.scanimgdir('../data/Crystals_Apr_12/Tileset7', '.tif')
print(df_imgfiles)

## 2. Get Image Slice Statistics
This set contains 6 images. Let's slice those up in 4 by 4; this will give total of 6 x 4 x 4 = 96 slices.
And also apply the statistics on each slice.

In [None]:
statfuncs = imgutils.statfuncs_common_ext()
stat_names = imgutils.stat_names(statfuncs)
print(stat_names)

In [None]:
df = imgutils.slicestats(list(df_imgfiles['filename']), 4, 4, statfuncs)
print("records: ", df.shape[0])
df.head()

**Normalize** the statistics using 'standarization'

In [None]:
imgutils.normalize(df, stat_names)
df.head()

In [None]:
stat_normnames = imgutils.normalized_names(stat_names)
print(stat_normnames)

## 3. Check some combinations for patterns
(using the seaborn pairplot)

In [None]:
import seaborn as sb

In [None]:
%matplotlib inline
sb.pairplot(df, vars=stat_normnames)
plt.show()

## 4. Inspect interactively
Let's inspect some combinations that have 'signs of clustering' in the interactive graph

In [None]:
df.head(3)

In [None]:
%matplotlib notebook

In [None]:
imgutils.plotwithimg(df, '|img_mean|', '|img_std|', imgutils.highlightimgslice, thumbnails=True)

Looks likt the sort-of cluster in lower right are points without a crystal

In [None]:
imgutils.plotwithimg(df, '|img_mean|', '|img_median|', imgutils.highlightimgslice, thumbnails=True)

The separation is not representative, the group at top-left contains both with and without micro crystals

In [None]:
imgutils.plotwithimg(df, '|img_median|', '|img_std|', imgutils.highlightimgslice, thumbnails=True)

This looks better, bottom right are empty regions, top-left have crystals. 

## 5. Heatmaps

Let's do an attempt to create a score for a heatmap. Looks like |img_std| is most infromative

In [None]:
imgname = df_imgfiles.iloc[3]['filename']
print(imgname)

In [None]:
imgs, heats = imgutils.getimgslices_fromdf(df, imgname, '|img_std|')

In [None]:
imgutils.showheatmap(imgs, heats)

Yes, looks great!. Let's check for some other images as well

In [None]:
imgname = df_imgfiles.iloc[0]['filename']
imgs, heats = imgutils.getimgslices_fromdf(df, imgname, '|img_std|')
imgutils.showheatmap(imgs, heats, opacity=0.7)

In [None]:
imgname = df_imgfiles.iloc[1]['filename']
imgs, heats = imgutils.getimgslices_fromdf(df, imgname, '|img_std|')
imgutils.showheatmap(imgs, heats, opacity=0.7)

In [None]:
imgname = df_imgfiles.iloc[2]['filename']
imgs, heats = imgutils.getimgslices_fromdf(df, imgname, '|img_std|')
imgutils.showheatmap(imgs, heats, opacity=0.7)

In [None]:
imgname = df_imgfiles.iloc[4]['filename']
imgs, heats = imgutils.getimgslices_fromdf(df, imgname, '|img_std|')
imgutils.showheatmap(imgs, heats, opacity=0.7)

In [None]:
imgname = df_imgfiles.iloc[5]['filename']
imgs, heats = imgutils.getimgslices_fromdf(df, imgname, '|img_std|')
imgutils.showheatmap(imgs, heats, opacity=0.7)

## [So far, this was a repeat of previous session of June 15]



## 6. Try some more stats (June 19)

**The '5 number statistics'**

In [None]:
%matplotlib inline
statfuncs = imgutils.statfuncs_5numsummary()
stat_names = imgutils.stat_names(statfuncs)
stat_normnames = imgutils.normalized_names(stat_names)
df = imgutils.slicestats(list(df_imgfiles['filename']), 4, 4, statfuncs)
imgutils.normalize(df, stat_names)

sb.pairplot(df, vars=stat_normnames)

In [None]:
%matplotlib notebook
imgutils.plotwithimg(df, '|img_quartile1|', '|img_quartile3|', imgutils.highlightimgslice, thumbnails=True)

In [None]:
# looks like here that if quartile 2 needs to be > 0 and quartile 1 < -1
df['score'] = df['|img_quartile3|'] - df['|img_quartile1|'] 
df['|score|'] = imgutils.norm_standardize(df, 'score')

In [None]:
imgname = df_imgfiles.iloc[0]['filename']
imgs, heats = imgutils.getimgslices_fromdf(df, imgname, '|score|')
imgutils.showheatmap(imgs, heats)

** 7 number stats ***

In [None]:
%matplotlib inline
statfuncs = imgutils.statfuncs_7numsummary()
stat_names = imgutils.stat_names(statfuncs)
stat_normnames = imgutils.normalized_names(stat_names)
df = imgutils.slicestats(list(df_imgfiles['filename']), 4, 4, statfuncs)
imgutils.normalize(df, stat_names)

sb.pairplot(df, vars=stat_normnames)

In [None]:
%matplotlib notebook
imgutils.plotwithimg(df, '|img_quintile1|', '|img_quintile2|', imgutils.highlightimgslice, thumbnails=True)

here, quintile 1 looks like pretty good separting statistics

In [None]:
# looks like here that if quartile 2 needs to be > 0 and quartile 1 < -1
df['score'] = -df['|img_quintile1|'] 
df['|score|'] = imgutils.norm_minmax(df, 'score')
imgname = df_imgfiles.iloc[0]['filename']
imgs, heats = imgutils.getimgslices_fromdf(df, imgname, 'score')
imgutils.showheatmap(imgs, heats)

hmm, one obvious one was missed, so need clustering and not one statistics!

** box-and-whisker stats **

In [None]:
statfuncs = imgutils.statfuncs_boxandwhisker()
stat_names = imgutils.stat_names(statfuncs)
stat_normnames = imgutils.normalized_names(stat_names)
df = imgutils.slicestats(list(df_imgfiles['filename']), 4, 4, statfuncs)
imgutils.normalize(df, stat_names)


%matplotlib inline
sb.pairplot(df, vars=stat_names)


In [None]:
# check one interactively
%matplotlib notebook
imgutils.plotwithimg(df, '|img_interquartilerange|', '|img_median|', imgutils.highlightimgslice, thumbnails = True)

separtion of clusters is just at top of dense area of top left (where it becomes more sparse)

In [None]:
statfuncs = imgutils.statfuncs_boxandwhisker_ext()
stat_names = imgutils.stat_names(statfuncs)
stat_normnames = imgutils.normalized_names(stat_names)
df = imgutils.slicestats(list(df_imgfiles['filename']), 4, 4, statfuncs)
imgutils.normalize(df, stat_names)


%matplotlib inline
sb.pairplot(df, vars=stat_names)

In [None]:
print(stat_normnames)

In [None]:
%matplotlib notebook
imgutils.plotwithimg(df, '|img_interquartilerange_low|', '|img_interquartilerange_high|', imgutils.highlightimgslice, True)

In [None]:
# lower left cluster is non-particles; lets try to separate them:
df['score'] = df['|img_interquartilerange_low|'] + df['|img_interquartilerange_high|']
#df['|score|'] = imgutils.norm_standardize(df, 'score')
#df['|score|'] = imgutils.norm_minmax(df, 'score')

In [None]:
imgname = df_imgfiles.iloc[0]['filename']
imgs, heats = imgutils.getimgslices_fromdf(df, imgname, 'score')
imgutils.showheatmap(imgs, heats)

In [None]:
# mwa
