## Objectives
* Practice k-means clustering
* Gain experience moving between pandas and Spark (both ways)

### * Name: Prabhjot Singh
### * I worked myself.

## Background

#### Refer Read.me

### As a first step, let's load the libraries we need:

In [4]:
import pandas as pd
import numpy as np
from matplotlib import pyplot as plt
from PIL import Image
import seaborn as sns
import matplotlib.cm as cm
import re
import os.path

### Getting the data from Amazon Web Services S3

We need the CSV file with data (bob_ross.csv), as well as a collection of images to complete this assignment.
One way for us to share those with you is to put them in an AWS S3 bucket and get you to "mount" that bucket
as a directory that's accessible via this notebook.

The following code block does exactly that, making the bucket containing those files available to this notebook.  To Spark, it will look like the files live in a directory called ```/mnt/si330w18```.  
To pandas, which we will use to read the data, the files will live in ```/dbfs/mnt/si330w18```.  Note the use of ```/dbfs``` as a prefix in the pandas version.
At the end of the code block is a command to list the contents of the 
mounted S3 bucket.

In [6]:
ACCESS_KEY = "AKIAIPKMRL4G3IEVQ7FQ"
SECRET_KEY = "bkG5SUmSc+S8bQseSo8SaBAHQtt3xGUfRlOojUrW"
ENCODED_SECRET_KEY = SECRET_KEY.replace("/", "%2F")
AWS_BUCKET_NAME = "umsi-data-science-west"
MOUNT_NAME = "umsi-data-science"
try:
  dbutils.fs.unmount("/mnt/%s/" % MOUNT_NAME)
except:
  print("Could not unmount %s, but that's ok." % MOUNT_NAME)
dbutils.fs.mount("s3a://%s:%s@%s" % (ACCESS_KEY, ENCODED_SECRET_KEY, AWS_BUCKET_NAME), "/mnt/%s" % MOUNT_NAME)
display(dbutils.fs.ls("/mnt/umsi-data-science/si618wn2017"))

### Defining a helper function to simplify the color space of images
The next code block sets up a utility function (```getColors```), which takes an image and figures out which colors are used.
It reduces the color space to about 85 colors (from an original space of 65536 colors) and returns the normalized count of 
each color's appearance in the image.

The function takes as input the filename of an image file.  It opens the file and sets up a numpy array of zeros for each of the
85 output colors.  The function then goes through all of the pixels in the image and calculates the red, green and blue 
color values in the reduced space (that's why we divide each of the values for red, green and blue by 63).  We then put
the red, green and blue values back together again by bit-shifting the green and blue values and then using a logical 'or'.
Let's say we had a color of 126,189,252 (which is an pleasant blue color).  Dividing those values by 63, we get 2,3,4.
Bit-shifting 3 << 2 gives us 12, 4 << 4 gives us 64.  We don't bit-shift the red values, so we just keep the 2.  Adding those
together (equivalent to using a logical "or" on the bit-shifted values) gives us 78, so we would increment the count of color 78.

Finally, we convert all counts to proportions and return the proportions of each color as a numpy array.

In [8]:
def getColors(img):
    im = Image.open(img, 'r')
    width, height = im.size
    #print(img,width,height)
    pixel_values = list(im.getdata())
    cnt = np.zeros(85,dtype=int)
    for i in pixel_values:
        #print(i)
        r = int(i[0]/63)
        g = int(i[1]/63)<<2
        b = int(i[2]/63)<<4
        x = r | g | b
        #print(x)
        cnt[x] = cnt[x] + 1
        #print(cnt[x])
    cnt = cnt/float(sum(cnt))
    return(cnt)

### Loading the "tags' file into a pandas DataFrame
First, we're going to load the CSV file of the human-assigned tags for each of Bob's paintings into a **pandas** DataFrame.  Remember that we mounted the AWS S3 bucket containing the data as ```/mnt/umsi-data-science/si618wn2017``` and the CSV file is named ```bob_ross.csv```.  We can read the file using the (hopefully)
familiar ```.from_csv()``` method in pandas:

In [10]:
bob_ross = pd.DataFrame.from_csv("/dbfs/mnt/umsi-data-science/si618wn2017/bob_ross.csv")

Let's take a look at the contents:

In [12]:
bob_ross.head()

The above command should show you that you have a pandas DataFrame with 5 rows and 68 columns.  These are the "tags" for each of the images that
we will load.  The tags were generated by people, and indicate the presence or absence of various features (e.g. "BEACH"), which is set to 1 if the 
feature is present or 0 if the feature is not present.

## NOTE: The next code block takes a very long time (about 5 minutes) to complete.  Wait for it!

In [15]:
bob_ross['image'] = ""
# create a column for each of the 85 colors (these will be c0...c84)
# we'll do this in a separate table for now and then merge
cols = ['c%s'%i for i in np.arange(0,85)]
colors = pd.DataFrame(columns=cols)
colors['EPISODE'] = bob_ross.index.values
colors = colors.set_index('EPISODE')

# figure out if we have the image or not, we don't have a complete set
for s in bob_ross.index.values:
    b = bob_ross.loc[s]['TITLE']
    b = b.lower()
    b = re.sub(r'[^a-z0-9\s]', '',b)
    b = re.sub(r'\s', '_',b)
    img = b+".png"
    if (os.path.exists("/dbfs/mnt/umsi-data-science/si618wn2017/images/"+img)):
        bob_ross.set_value(s,"image","/dbfs/mnt/umsi-data-science/si618wn2017/images/"+img)
        t = getColors("/dbfs/mnt/umsi-data-science/si618wn2017/images/"+img)
        colors.loc[s] = t


In [16]:
# join the colors and tag database and toss the rows where we don't have an image
bob_ross = bob_ross.join(colors)
bob_ross = bob_ross[bob_ross.image != ""] 

In [17]:
# these are masks you might find handy to only get the colors, the tags, or both (as well as the image path)
color_columns = ['c0', 'c1', 'c2', 'c3', 'c4', 'c5', 'c6', 'c7', 'c8', 'c9', 'c10',
               'c11', 'c12', 'c13', 'c14', 'c15', 'c16', 'c17', 'c18', 'c19', 'c20',
               'c21', 'c22', 'c23', 'c24', 'c25', 'c26', 'c27', 'c28', 'c29', 'c30',
               'c31', 'c32', 'c33', 'c34', 'c35', 'c36', 'c37', 'c38', 'c39', 'c40',
               'c41', 'c42', 'c43', 'c44', 'c45', 'c46', 'c47', 'c48', 'c49', 'c50',
               'c51', 'c52', 'c53', 'c54', 'c55', 'c56', 'c57', 'c58', 'c59', 'c60',
               'c61', 'c62', 'c63', 'c64', 'c65', 'c66', 'c67', 'c68', 'c69', 'c70',
               'c71', 'c72', 'c73', 'c74', 'c75', 'c76', 'c77', 'c78', 'c79', 'c80',
               'c81', 'c82', 'c83', 'c84']
tag_columns = ['APPLE_FRAME', 'AURORA_BOREALIS', 'BARN', 'BEACH', 'BOAT',
       'BRIDGE', 'BUILDING', 'BUSHES', 'CABIN', 'CACTUS', 'CIRCLE_FRAME',
       'CIRRUS', 'CLIFF', 'CLOUDS', 'CONIFER', 'CUMULUS', 'DECIDUOUS',
       'DIANE_ANDRE', 'DOCK', 'DOUBLE_OVAL_FRAME', 'FARM', 'FENCE', 'FIRE',
       'FLORIDA_FRAME', 'FLOWERS', 'FOG', 'FRAMED', 'GRASS', 'GUEST',
       'HALF_CIRCLE_FRAME', 'HALF_OVAL_FRAME', 'HILLS', 'LAKE', 'LAKES',
       'LIGHTHOUSE', 'MILL', 'MOON', 'MOUNTAIN', 'MOUNTAINS', 'NIGHT', 'OCEAN',
       'OVAL_FRAME', 'PALM_TREES', 'PATH', 'PERSON', 'PORTRAIT',
       'RECTANGLE_3D_FRAME', 'RECTANGULAR_FRAME', 'RIVER', 'ROCKS',
       'SEASHELL_FRAME', 'SNOW', 'SNOWY_MOUNTAIN', 'SPLIT_FRAME', 'STEVE_ROSS',
       'STRUCTURE', 'SUN', 'TOMB_FRAME', 'TREE', 'TREES', 'TRIPLE_FRAME',
       'WATERFALL', 'WAVES', 'WINDMILL', 'WINDOW_FRAME', 'WINTER',
       'WOOD_FRAMED']
all_columns = color_columns + tag_columns + ['image']
color_columns = color_columns + ['image']
tag_columns = tag_columns + ['image']

In [18]:
# this is a utility function for displaying a grid of images, with an optional heading
def display_images(imagelist,cluster_title=None):
    a = imagelist.apply(lambda x: re.search('(\w+.png)', x).group(1))
    np.zeros(7-len(a)%7,dtype=np.str)
    a = np.append(a,np.zeros(7-len(a)%7,dtype=np.str))
    grid = a.reshape(int(len(a)/7),7)
    text = ""
    if (cluster_title != None):
       text = "<h1>"+cluster_title+"</h1>\n" 
    text = text + "<table>"
    for i in np.arange(0,len(grid)):
        row = grid[i]
        line = ''.join( ["\n<TD><img style='width: 120px; margin: 0px; float: left; border: 1px solid black;' src='https://s3.amazonaws.com/si618image/images/%s' /></TD>" % str(s) for s in row])
        text = text + "<TR>"+line+"</TR>\n"
    text = text +"</table>"
    displayHTML(text)

In [19]:
# for example, we can display the first 12 images
display_images(bob_ross.image,"sample images")

## K-means
### 1) K-Means on tags (2 clusters)

We're going to start by replicating the fivethirtyeight article a bit.  Using *only* the tags, perform a k-means clustering with 2 clusters. Use display_images to show the images from each cluster.

**We are going to move our data into Spark for this analysis.**

In [21]:
from pyspark.ml.clustering import KMeans
from pyspark.ml.evaluation import ClusteringEvaluator
from pyspark.ml.feature import VectorAssembler

tags = spark.createDataFrame(bob_ross[tag_columns[:-1]])

assembler = VectorAssembler(
    inputCols=tag_columns[:-1],
    outputCol="features")

tags_assembled = assembler.transform(tags)


# create a k-means model, k=2, and fit the data
kmeans = KMeans().setK(2).setSeed(1)
kmeans_model = kmeans.fit(tags_assembled.select("features"))

# Make predictions
tags_predictions = kmeans_model.transform(tags_assembled)

### Now move back into pandas...

In [23]:
bob_ross["prediction"] = tags_predictions.select("prediction").toPandas().set_index(bob_ross.index)

df_0 = bob_ross[bob_ross["prediction"] == 0]
display_images(df_0['image']," Cluster 1 Images")

In [24]:
df_1 = bob_ross[bob_ross["prediction"] == 1]
display_images(df_1['image'],"Cluster 2 Images")

### 2) Describe the differences

Without any further analysis, is there something obviously different about what's in the images?

**Your answer**:
In cluster 1, all images contain mountains as well as trees, whereas, in cluster 2, majority of images contain just trees.

### 3) Calculate the differences between clusters

One thing we can do to compare the clusters is to determine which tags show up more in the first cluster and which ones appear more in the second. Write code to determine which tags are maximally different between the two clusters.  You should get output that looks like:
```
MOUNTAIN              0.967647
SNOWY_MOUNTAIN        0.681513
MOUNTAINS             0.638655
CONIFER               0.515126
LAKE                  0.294958
```

Hint: you can do this with some combination of masks, .mean() and .sort_values() all in one line (but feel free to write a loop if it's easier to think about)

In [27]:
# YOUR CODE HERE

df_means = df_1[tag_columns].mean() - df_0[tag_columns].mean()
df_sorted = df_means.abs().sort_values(ascending=False)
df_sorted


### 4) Find a better value of k

Determine a better value for k (you can use the "rule of thumb" approach, silhoutte scores, or scree plots... though as a warning, some of these may not be as "clear" as the examples in class).  

**Use display_images to show the different clusters, pick the best value of k, and describe your clusters qualitatively.**

In [29]:
# method 1: "Rule of Thumb"
guess = np.sqrt(tags_predictions.count()/2)
print("Within Set Sum of Squared Errors = " + str(guess))
# print the results

In [30]:
# method 2: Scree plot
cost = list()
for k in range(2,11):
    bkm = KMeans().setK(k).setSeed(1).setFeaturesCol("features")
    bkm_model = bkm.fit(tags_assembled)
    cost.append(bkm_model.computeCost(tags_assembled)/k)
print(cost)
# print the results

In [31]:
fig, ax = plt.subplots()
plt.plot(range(2,11), cost, 'b*-')
plt.xlabel('Number of clusters');
plt.ylabel('Within Set Sum of Squared Error');
plt.title('Elbow for K-Means clustering');
# Uncomment the next line
display(fig)

In [32]:
# method 3: Silhouette scores
cost = list()
evaluator = ClusteringEvaluator()
for k in range(2,11):
    bkm = KMeans().setK(k).setSeed(1).setFeaturesCol("features")
    bkm_model = bkm.fit(tags_assembled)
    tags_predictions = bkm_model.transform(tags_assembled)
    silhouette = evaluator.evaluate(tags_predictions)
    cost.append(silhouette)
    
kIdx = np.argmax(cost)

fig, ax = plt.subplots()
plt.plot(range(2,11), cost, 'b*-')
plt.plot(range(2,11)[kIdx], cost[kIdx], marker='o', markersize=12, 
         markeredgewidth=2, markeredgecolor='r', markerfacecolor='None')
plt.xlim(1, plt.xlim()[1])
plt.xlabel('Number of clusters')
plt.ylabel('Silhouette Coefficient')
plt.title('Silhouette Scores for k-means clustering')
# Uncomment the next line
display(fig)

### 5) k-means based on colors
Perform k-means clustering on the paintings using *only* the color columns. Decide a good value for k, execute the clustering, display the images in each clusters, and describe the resulting clusters.

#### 5.1) For k = 4

In [35]:
colors = spark.createDataFrame(bob_ross[color_columns[:-1]])

assembler = VectorAssembler(
    inputCols=color_columns[:-1],
    outputCol="features")

colors_assembled = assembler.transform(colors)


# create a k-means model, k=4, and fit the data
kmeans = KMeans().setK(4).setSeed(1)
kmeans_model = kmeans.fit(colors_assembled.select("features"))

# Make predictions
colors_predictions = kmeans_model.transform(colors_assembled)

In [36]:
#clustering column - prediction
bob_ross["prediction"] = colors_predictions.select("prediction").toPandas().set_index(bob_ross.index)

df_c0 = bob_ross[bob_ross["prediction"] == 0]
display_images(df_c0['image']," Cluster 1 Images (By color)")

#### Analysis : On observing the above cluster, we find that all images 'very bright' in color (whitish).

In [38]:
df_c1 = bob_ross[bob_ross["prediction"] == 1]
display_images(df_c1['image']," Cluster 2 Images (By color)")

#### Analysis : On observing the above cluster, we find that all images 'pale' in color (yellowish).

In [40]:
df_c2 = bob_ross[bob_ross["prediction"] == 2]
display_images(df_c2['image']," Cluster 3 Images (By color)")

#### Analysis : On observing the above cluster, we find that all images 'very dark' in color (blackish).

In [42]:
df_c3 = bob_ross[bob_ross["prediction"] == 3]
display_images(df_c3['image']," Cluster 4 Images (By color)")

#### Analysis : On observing the above cluster, we find that all images again 'pale blue' in color (bluish).

#### Evaluating K Value

In [45]:
cost = list()
for k in range(2,11):
    bkm = KMeans().setK(k).setSeed(1).setFeaturesCol("features")
    bkm_model = bkm.fit(colors_assembled)
    cost.append(bkm_model.computeCost(colors_assembled)/k)

fig, ax = plt.subplots()
plt.plot(range(2,11), cost, 'b*-')
plt.xlabel('Number of clusters');
plt.ylabel('Within Set Sum of Squared Error');
plt.title('Elbow for K-Means clustering');
# Uncomment the next line
display(fig)

### **Best k value**
#####From the above scree plot, we have k=4 as the elbow point which signifies it to be the best k-value

### 6) Use both tags and colors for k-means clustering

Perform k-means clustering on the paintings using *both* tag and color columns. Decide a good value for k, execute the clustering, display the images in each clusters, and describe the resulting clusters.

In [48]:
ct = spark.createDataFrame(bob_ross[all_columns[:-1]])

assembler = VectorAssembler(
    inputCols=all_columns[:-1],
    outputCol="features")

all_assembled = assembler.transform(ct)


# create a k-means model, k=4, and fit the data
kmeans = KMeans().setK(4).setSeed(1)
kmeans_model = kmeans.fit(all_assembled.select("features"))

# Make predictions
all_predictions = kmeans_model.transform(all_assembled)

In [49]:
#clustering column - prediction
bob_ross["prediction"] = all_predictions.select("prediction").toPandas().set_index(bob_ross.index)

df_all0 = bob_ross[bob_ross["prediction"] == 0]
display_images(df_all0['image']," Cluster 1 Images (Both by tags & colors)")

#### Analysis : On observing the above cluster, we find that all images have an element of Mountains, Lake, Summer Evenings (Tags) + bright and blue color (Color)

In [51]:
df_all1 = bob_ross[bob_ross["prediction"] == 1]
display_images(df_all1['image']," Cluster 2 Images (Both by tags & colors)")

#### Analysis : On observing the above cluster, we find that all images have an elements of Snow Mountains and Winter(tags) + Pale Blue shade(color)

In [53]:
df_all2 = bob_ross[bob_ross["prediction"] == 2]
display_images(df_all2['image']," Cluster 3 Images (Both by tags & colors)")

#### Analysis : On observing the above cluster, we find that all images have an element of Lakes, Spring, Waterfalls (Tags) + green/nature color (Color)

In [55]:
df_all3 = bob_ross[bob_ross["prediction"] == 3]
display_images(df_all3['image']," Cluster 4 Images (Both by tags & colors)")

#### Analysis : On observing the above cluster, we find that all images have an element of Ocean View, Waves, Steve Ross (Tags) + Dark color (Color)

### **Best k value**

In [58]:
cost = list()
for k in range(2,11):
    bkm = KMeans().setK(k).setSeed(1).setFeaturesCol("features")
    bkm_model = bkm.fit(all_assembled)
    cost.append(bkm_model.computeCost(all_assembled)/k)

fig, ax = plt.subplots()
plt.plot(range(2,11), cost, 'b*-')
plt.xlabel('Number of clusters');
plt.ylabel('Within Set Sum of Squared Error');
plt.title('Elbow for K-Means clustering');
# Uncomment the next line
display(fig)

### Silhouette Plots

In [60]:
cost = list()
evaluator = ClusteringEvaluator()
for k in range(2,11):
    bkm = KMeans().setK(k).setSeed(1).setFeaturesCol("features")
    bkm_model = bkm.fit(all_assembled)
    all_predictions = bkm_model.transform(all_assembled)
    silhouette = evaluator.evaluate(all_predictions)
    cost.append(silhouette)
    
kIdx = np.argmax(cost)

fig, ax = plt.subplots()
plt.plot(range(2,11), cost, 'b*-')
plt.plot(range(2,11)[kIdx], cost[kIdx], marker='o', markersize=12, 
         markeredgewidth=2, markeredgecolor='r', markerfacecolor='None')
plt.xlim(1, plt.xlim()[1])
plt.xlabel('Number of clusters')
plt.ylabel('Silhouette Coefficient')
plt.title('Silhouette Scores for k-means clustering')
# Uncomment the next line
display(fig)

### **Best k value**
#####From the scree plot and silhouette plot, we have k=4 as the elbow point and a peak in sil plot which signifies it to be the best k-value, hence we need to reiterate with k=4

## Above and Beyond

#### Repeating the analysis for Step 6 (both tags and colors) using bisecting k-means **and compare the results to k-means**.

In [63]:
from pyspark.ml.clustering import BisectingKMeans


ct_bisect = spark.createDataFrame(bob_ross[all_columns[:-1]])

assembler = VectorAssembler(
    inputCols=all_columns[:-1],
    outputCol="features")

all_assembled = assembler.transform(ct_bisect)


# Trains a bisecting k-means model.
bkm = BisectingKMeans().setK(3).setSeed(1)
bkm_model = bkm.fit(all_assembled.select("features"))

# Make predictions
all_predictions = bkm_model.transform(all_assembled)

# Evaluate clustering by computing Within Set Sum of Squared Errors.
cost = bkm_model.computeCost(all_assembled)
print("Within Set Sum of Squared Errors = " + str(cost))


In [64]:
bob_ross["prediction"] = all_predictions.select("prediction").toPandas().set_index(bob_ross.index)

df_all0_bkm = bob_ross[bob_ross["prediction"] == 0]
display_images(df_all0_bkm['image']," Cluster 1 Images (Both by tags & colors) - BKM Model")

In [65]:
df_all1_bkm = bob_ross[bob_ross["prediction"] == 1]
display_images(df_all1_bkm['image']," Cluster 2 Images (Both by tags & colors) - BKM Model")

In [66]:
df_all2_bkm = bob_ross[bob_ross["prediction"] == 2]
display_images(df_all2_bkm['image']," Cluster 3 Images (Both by tags & colors) - BKM Model")

In [67]:
np.sqrt(all_assembled.count()/2)

In [68]:
cost = list()
for k in range(2,11):
    bkm = BisectingKMeans().setK(k).setSeed(1).setFeaturesCol("features")
    bkm_model = bkm.fit(all_assembled)
    cost.append(bkm_model.computeCost(all_assembled)/k)

fig, ax = plt.subplots()
plt.plot(range(2,11), cost, 'b*-')
plt.xlabel('Number of clusters');
plt.ylabel('Within Set Sum of Squared Error');
plt.title('Elbow for bisecting K-Means clustering');
# Uncomment the next line
display(fig)

In [69]:
cost = list()
evaluator = ClusteringEvaluator()
for k in range(2,11):
    bkm = BisectingKMeans().setK(k).setSeed(1).setFeaturesCol("features")
    bkm_model = bkm.fit(all_assembled)
    all_predictions = bkm_model.transform(all_assembled)
    silhouette = evaluator.evaluate(all_predictions)
    cost.append(silhouette)
    
kIdx = np.argmax(cost)

fig, ax = plt.subplots()
plt.plot(range(2,11), cost, 'b*-')
plt.plot(range(2,11)[kIdx], cost[kIdx], marker='o', markersize=12, 
         markeredgewidth=2, markeredgecolor='r', markerfacecolor='None')
plt.xlim(1, plt.xlim()[1])
plt.xlabel('Number of clusters')
plt.ylabel('Silhouette Coefficient')
plt.title('Silhouette Scores for bisect k-means clustering')
# Uncomment the next line
display(fig)

## Differences between K means and Bisecting K means clustering :

We can clearly see the differences between the cluster images. K means clustering has a mixture of different kind of images in the clusters whereas the bisecting k means cluster images are much more similar and less ambiguous.
One can clearly make out the differences in different bisecting k means clusters like :
Cluster 1 is Ocean view only, Cluster 2 is no mountains but lake view only, whereas Cluster 3 is mountains and trees.

There is clear difference in the both Silhouette plots. K means sil plot suggesting k = 2(not a good result), whereas bisecting k means suggest k = 3. Furthermore, the best k -value as suggested by both the sil plot and scree plot varies in K means while it is quite similar in bisecting k means which suggests k=3 to be the best k value.

Hence, bisecting k means provides better looking clusters than k means.