# Real Micro Crystals -  Data Engineering & Exploration 3 
_explore larger data set_

Michael Janus, June 2018

Use the functions on a real (small) data set.

For explanation and how to usage functions, see the notebook **imgutils_test_and_explain.ipynb**

## 1. Import the used modules, including the one with test functions:

In [None]:
import warnings
warnings.filterwarnings("ignore", category=DeprecationWarning)

import matplotlib.pyplot as plt
from scipy import stats

import imgutils
import imgutils_test as tst

In [None]:
# Re-run this cell if you altered imgutils or imgutils_test
import importlib
importlib.reload(imgutils)
importlib.reload(tst)

## 1. Get image files

In [None]:
df_imgfiles = imgutils.scanimgdir('../data/Crystals_Apr_12/Tileset6', '.tif')
print(df_imgfiles)

## 2. Get Image Slice Statistics
This set contains many images. Let's slice those up in 10 by 10

And also apply the statistics on each slice.

In [None]:
statfuncs = imgutils.statfuncs_5numsummary()
df = imgutils.slicestats(list(df_imgfiles['filename']), 10, 10, statfuncs)
print("records: ", df.shape[0])
df.head()

visualize some images

In [None]:
#get single slice:
sliceimg = imgutils.getimgslice(df, 8)
imgutils.showimg(sliceimg)

In [None]:
# show first image sliced up:
imgname = df_imgfiles.iloc[0]['filename']
imgs, dummy = imgutils.getimgslices_fromdf(df, imgname)
imgutils.showimgs(imgs)

In [None]:
df.isnull().values.any()

**Normalize** the statistics using 'standarization'

In [None]:
stat_names = imgutils.stat_names(statfuncs)
print(stat_names)

In [None]:
imgutils.normalize(df, stat_names)
df.head()

In [None]:
df.isnull().values.any()

In [None]:
stat_normnames = imgutils.normalized_names(stat_names)
print(stat_normnames)

## 3. Check some combinations for patterns
(using the seaborn pairplot)

In [None]:
import seaborn as sb

In [None]:
%matplotlib inline
sb.pairplot(df, vars=stat_normnames)
#sb.pairplot(df, vars=['|img_mean|','|img_min|', '|img_std|'])
plt.show()

## 4. Inspect interactively
Let's inspect some combinations that have 'signs of clustering' in the interactive graph

In [None]:
%matplotlib notebook

In [None]:
imgutils.plotwithimg(df, '|img_quartile1|', '|img_quartile3|', imgutils.highlightimgslice, interactive=True)

the black parts are easy identified, but the crystals are harder to get out

In [None]:
#Try other stats:
statfuncs = imgutils.statfuncs_boxandwhisker()
stat_names = imgutils.stat_names(statfuncs)
stat_normnames = imgutils.normalized_names(stat_names)
df = imgutils.slicestats(list(df_imgfiles['filename']), 10, 10, statfuncs)
imgutils.normalize(df, stat_names)


%matplotlib inline
sb.pairplot(df, vars=stat_normnames)

and check interactively

In [None]:
%matplotlib notebook
imgutils.plotwithimg(df, '|img_interquartilerange|', '|img_median|', imgutils.highlightimgslice)

lookls like 4 nice clusters, but all seem to have black grid, so stats are too coarse

In [None]:
#Try other stats:
%matplotlib inline
statfuncs = imgutils.statfuncs_7numsummary()
stat_names = imgutils.stat_names(statfuncs)
stat_normnames = imgutils.normalized_names(stat_names)
df = imgutils.slicestats(list(df_imgfiles['filename']), 10, 10, statfuncs)
imgutils.normalize(df, stat_names)

sb.pairplot(df, vars=stat_normnames)

In [None]:
%matplotlib notebook
imgutils.plotwithimg(df, '|img_quintile2|', '|img_quintile3|', imgutils.highlightimgslice)

Hmmm, looks like mean and standard deviation are missing parts

In [None]:
# added one that is a mix of quartile stats and common stats
%matplotlib inline
statfuncs = imgutils.statfuncs_selection1()
stat_names = imgutils.stat_names(statfuncs)
stat_normnames = imgutils.normalized_names(stat_names)
df = imgutils.slicestats(list(df_imgfiles['filename']), 10, 10, statfuncs)
imgutils.normalize(df, stat_names)

sb.pairplot(df, vars=stat_normnames)

In [None]:
# try some to get  separatable one
%matplotlib notebook
# imgutils.plotwithimg(df, '|img_std|', '|img_interquartilerange|', imgutils.highlightimgslice)
imgutils.plotwithimg(df, '|img_std|', '|img_quartile1|', imgutils.highlightimgslice)

quartile 1 can separate the black bars from the others. Let's try some more and then plot some heatmaps

In [None]:
%matplotlib notebook
#imgutils.plotwithimg(df, '|img_interquartilerange|', '|img_quartile3|', imgutils.highlightimgslice)
imgutils.plotwithimg(df, '|img_std|', '|img_mean|', imgutils.highlightimgslice)

the real particles are 'hidden' in the cluster top-left, i.e. high mean but also some variance. I can try as a 'separator' the summation of both.

## 5. Heatmaps

Let's do an attempt to create a score for a heatmap. Looks like |img_std| is most infromative

In [None]:
imgname = df_imgfiles.iloc[0]['filename']
print(imgname)

In [None]:
# looks like here that if quartile 2 needs to be > 0 and quartile 1 < -1
df['score'] = df['img_quartile1']
df['|score|'] = imgutils.norm_minmax(df, 'score')

In [None]:
imgname = df_imgfiles.iloc[0]['filename']
imgs, heats = imgutils.getimgslices_fromdf(df, imgname, '|score|')
imgutils.showheatmap(imgs, heats, cmapname='RdYlGn', opacity=0.5, heatdepend_opacity = False)

In [None]:
imgname = df_imgfiles.iloc[2]['filename']
imgs, heats = imgutils.getimgslices_fromdf(df, imgname, '|score|')
imgutils.showheatmap(imgs, heats, cmapname='RdYlGn', opacity=0.5, heatdepend_opacity = False)

In [None]:
imgname = df_imgfiles.iloc[1]['filename']
imgs, heats = imgutils.getimgslices_fromdf(df, imgname, '|score|')
imgutils.showheatmap(imgs, heats, cmapname='RdYlGn', opacity=0.5, heatdepend_opacity = False)

Hmm

In [None]:
# looks like here that if quartile 2 needs to be > 0 and quartile 1 < -1
df['score2'] = df['img_mean'] + df['img_std']
df['|score2|'] = imgutils.norm_standardize(df, 'score2')

In [None]:
imgname = df_imgfiles.iloc[0]['filename']
imgs, heats = imgutils.getimgslices_fromdf(df, imgname, '|score2|')
imgutils.showheatmap(imgs, heats, cmapname='RdYlGn', opacity=0.5, heatdepend_opacity = False)

In [None]:
imgname = df_imgfiles.iloc[1]['filename']
imgs, heats = imgutils.getimgslices_fromdf(df, imgname, '|score2|')
imgutils.showheatmap(imgs, heats, cmapname='RdYlGn', opacity=0.5, heatdepend_opacity = False)

In [None]:
from mpl_toolkits.mplot3d import Axes3D
%matplotlib notebook

In [None]:
fig = plt.figure()
ax = Axes3D(fig)

ax.scatter(df['|img_quartile1|'], df['|img_mean|'], df['|img_std|'])
plt.show()

In [None]:
fig = plt.figure()
ax = Axes3D(fig)

ax.scatter(df['|img_quartile1|'], df['|img_quartile3|'], df['|img_median|'])
plt.show()

In [None]:
fig = plt.figure()
ax = Axes3D(fig)

ax.scatter(df['|img_interquartilerange|'], df['|img_mean|'], df['|img_std|'])
plt.show()

## 6. Conclusions & Remarks
- need mutli-dimension analyses and e.g. PCA; let's export these values for that!
- consider pre-filtering the image to take out the noise (99% range)
- consider re-scaling images two 2k x 2k or 1k x 1k for performance



## 7. Next steps
- Export this data set for multi dimension visualization


Michael Janus, 19 June 2018