In [None]:
# Custom libraries
from datascienceutils import plotter
from datascienceutils import analyze

# Standard libraries
import json
%matplotlib inline
import datetime
import numpy as np
import pandas as pd
import random

from sklearn import cross_validation
from sklearn import metrics

from bokeh.plotting import figure, show, output_file, output_notebook, ColumnDataSource
from bokeh.charts import Histogram
import bokeh
output_notebook(bokeh.resources.INLINE)

from sqlalchemy import create_engine

## Data Source/Background:  We have data about measurements of flowers from species belonging to a particular genus.()

In [None]:
headers = ['SepalLengthCm', 'SepalWidthCm', 'PetalLengthCm', 'PetalWidthCm',
       'Species']

irisDf = pd.read_csv('http://archive.ics.uci.edu/ml/machine-learning-databases/iris/iris.data', names=headers)


## Ok data loaded now let's assume we know nothing of patterns in this data and take a peek

In [None]:
irisDf.describe()

In [None]:
irisDf.head()

## Ok we have 4 numerical columns and a categorical column. 
## Can we see any correlation among the numericals?

In [None]:
irisDf.corr()

## Hmm.. We have a few +ve correlations  like (PetalLength vs SepalLength, PetalWidth vs PetalLength etc..)

The intra petal and intra sepal make sense. As the size of the flower will correspond to petal length and petal width. (same for sepal length and width). 

However the PetalLength vs SepalLength is new.. Come to think of it, what counts as petal and what is a sepal??

[Petal](https://en.wikipedia.org/wiki/Petal)
[Sepal](https://en.wikipedia.org/wiki/Sepal)

## That makes sense since sepal is a leaf-like part that supports the flower, if the flower is bigger the support also has to be bigger. Let's move on to see more.

In [None]:
analyze.correlation_analyze(irisDf, exclude_columns='Id', 
                                categories=['Species'], 
                                measure=['SepalLengthCm','SepalWidthCm',
                                           'PetalLengthCm', 'PetalWidthCm'])

## Independence Assumptions:
  One interesting advantage of using scatter plots in this case is that we can make judgements about the [Independence](https://www.quora.com/Why-is-the-assumption-of-independence-so-important-for-statistical-analysis) of variables involved.
  For ex: If you look at the PetalWidth vs SepalWidth plot, you'll see there's anything but linear correlation among them. it's almost as if the data points are avoiding anything like a positive correlation that it's funny. So if you conclude that these two variables are completely independant you'd be wrong. 
  However, within the clusters that are obvious in the plots there's a some correlation between the two variables, which agrees with common-sense. In this case we have the case of clustering confusing the interpretation of independence.
  
  As an exercise for the reader, find the other two variables that display a similar pattern.

## That petallength vs sepallength looks interesting.. they correlate linearly after a certain threshold, but not before that. Might be species mixed up data,  but it does make sense that the sepals serving only a support role might not grow till a threshold level is reached for the flower size.

## The petal Length vs SepalWidth suggests there are two species atleast.. same from PetalWidth vs PetalLength 

## Let's look at the distributions


In [None]:
analyze.dist_analyze(irisDf, 'PetalLengthCm')

## Aha.. There you go there's clearly atleast two different clusters/species based on PetalLength alone.  Let's look at SepalWidth next

In [None]:
analyze.dist_analyze(irisDf, 'SepalWidthCm')

## Hmm. how about sepalLength

In [None]:
analyze.dist_analyze(irisDf, 'SepalLengthCm')

## So we can't distinguish the clusters/Species by Sepal size.  Let's do some cluster analysis. 

In [None]:
## First see how many species are labeled
irisDf.Species.unique()

In [None]:
## It seems 3.. Let's see if our algorithms can find it. 
tempDf = irisDf.copy(deep=True)
tempDf.drop(['Species'], 1, inplace=True)
analyze.silhouette_analyze(tempDf, cluster_type='KMeans')

   ## Hmm.. That's interesting... The silhouette score keeps falling even after clustering with 4 clusters. So it is clear we should finalize a cluster number < 4
   ## From the scatter plot both 2 and 4 clusters look believable let's try again with 3 clusters.


In [None]:
analyze.silhouette_analyze(tempDf, cluster_type='dbscan', n_clusters=range(2,4))

In [None]:
analyze.silhouette_analyze(tempDf, cluster_type='birch', n_clusters=range(2,4))

## Purely by the algorithm and silhouette score(higher ==> better clustering.), we think there should be 2 clusters. However based on the Species labels there are 3-clusters.

 So we'll have to conclude either 
        * uneven distribution of data samples within clusters
        * algorithmic issues/inefficiency (try dbscan or other clustering) 
        * Two of the species are simply too close in the flower based  measures, but different by other plant characteristics so not captured in this data

In [None]:
# Testing the first case
irisDf.groupby('Species').count()

In [None]:
## Spectral clustering can other 
analyze.silhouette_analyze(tempDf, cluster_type='dbscan', n_clusters=range(2,5))

In [None]:
# Testing the second case
analyze.silhouette_analyze(tempDf, cluster_type='spectral', n_clusters=range(2,4))

## Now let's return to the Regression we saw between PetalLength vs SepalLength

In [None]:
analyze.regression_analyze(irisDf, 'SepalLengthCm', 'PetalLengthCm')

## Ok there's a clear distinction between the two clusters and in one of them we can simply predict PetalLength based on SepalLength 

## Unfortunately, this is one of those insights/patterns that's not likely to be useful for this data set. 

## I mean who wants to predict a flower's sepal length based on it's petal Length?? I mean, we can just measure the sepalLength too.. 

## Ah.. perhaps if we are a species millions of times smaller than the flower and measuring is of very highly costly...  

## Aka... if these were galaxies instead of flowers then predicion of Sepal length would be useful.