<a href="https://colab.research.google.com/github/lakhanrajpatlolla/aiml-learning/blob/master/U2W7_23_Hierarchical_Clustering.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# Advanced Certification in AIML
## A Program by IIIT-H and TalentSprint
## Not for Grading

At the end of the experiment, you will be able to:

*  find groups or clusters using Hierarchical Clustering Algorithm
*  visualize the clusters using Dendrogram


In [None]:
#@title Experiment Walkthrough Video
from IPython.display import HTML

HTML("""<video width="854" height="480" controls>
  <source src="https://cdn.iiith.talentsprint.com/aiml/Experiment_related_data/Walkthrough/Hierarchical_Clustering.mp4" type="video/mp4">
</video>
""")

## Dataset

### Description

The dataset consists of the below 7 columns,

- **species:** penguin species (Chinstrap, Adélie, or Gentoo)
- **culmen length & depth:** The culmen is the upper ridge of a bird's beak
- **flipper_length_mm:** flipper length
- **body_mass_g:** body mass
- **island:** island name (Dream, Torgersen, or Biscoe)
- **sex:** penguin sex

## AI/ML Technique

### Hierarchical Clustering

It is an algorithm that builds hierarchy of clusters. This algorithm starts with all the data points assigned to a cluster of their own. Then two nearest clusters are merged into the same cluster. In the end, this algorithm terminates when there is only a single cluster left.

Why Hierarchical Clustering is used over K-means Clustering Algorithm? K-means works well when the shape of clusters are hyper-spherical  (or circular in 2 dimensions). If there are general clusters occurring in the dataset which are non-spherical then probably K-means is not a good choice.

K-means starts with random choice of cluster centers and it may lead to different clustering results and different runs of algorithm is required. Thus, the results may not be repeatable and lack of consistency with hierarchical clustering, you will definitely get the same clustering results.

K-means require prior knowledge of K (number of clusters), whereas in hierarchical clustering we can stop at any level (clusters) we wish.


Hierarchical clustering is of two types:

**Agglomerative**: This is a "bottom up" approach: each observation starts in its own cluster, and pairs of clusters are merged as one moves up the hierarchy.

**Divisive**: This is a "top down" approach: all observations start in one cluster, and splits are performed recursively as one moves down the hierarchy.

In this experiment we will use Agglomerative Clustering.

A dendrogram is a tree like structure that shows the hierarchical relationship between objects. It is most commonly created as an output from hierarchical clustering. The main use of a dendrogram is to work out the best way to allocate objects to clusters.


Hierarchical clustering gives the deep insight of each step of converging different clusters and create dendrogram which helps you figure out which clusters combination makes sense and where you want to stop.

## Setup Steps

In [None]:
#@title Please enter your registration id to start: { run: "auto", display-mode: "form" }
Id = "2418775" #@param {type:"string"}


In [None]:
#@title Please enter your password (normally your phone number) to continue: { run: "auto", display-mode: "form" }
password = "9959000490" #@param {type:"string"}


In [None]:
#@title Run this cell to complete the setup for this Notebook
from IPython import get_ipython

ipython = get_ipython()

notebook= "U2W7_23_Hierarchical_Clustering" #name of the notebook
Answer = "Ungraded"
def setup():
#  ipython.magic("sx pip3 install torch")
    from IPython.display import HTML, display
    ipython.magic("sx wget -qq https://cdn.iiith.talentsprint.com/aiml/Experiment_related_data/Penguin.csv")
    display(HTML('<script src="https://dashboard.talentsprint.com/aiml/record_ip.html?traineeId={0}&recordId={1}"></script>'.format(getId(),submission_id)))
    print("Setup completed successfully")
    return

def submit_notebook():

    ipython.magic("notebook -e "+ notebook + ".ipynb")

    import requests, json, base64, datetime

    url = "https://dashboard.talentsprint.com/xp/app/save_notebook_attempts"
    if not submission_id:
      data = {"id" : getId(), "notebook" : notebook, "mobile" : getPassword()}
      r = requests.post(url, data = data)
      r = json.loads(r.text)

      if r["status"] == "Success":
          return r["record_id"]
      elif "err" in r:
        print(r["err"])
        return None
      else:
        print ("Something is wrong, the notebook will not be submitted for grading")
        return None

    elif getComplexity() and getAdditional() and getConcepts() and getComments():
      f = open(notebook + ".ipynb", "rb")
      file_hash = base64.b64encode(f.read())

      data = {"complexity" : Complexity, "additional" :Additional,
              "concepts" : Concepts, "record_id" : submission_id,
              "id" : Id, "file_hash" : file_hash,
              "feedback_experiments_input" : Comments, "notebook" : notebook}

      r = requests.post(url, data = data)
      r = json.loads(r.text)
      if "err" in r:
        print(r["err"])
        return None
      else:
        print("Your submission is successful.")
        print("Ref Id:", submission_id)
        print("Date of submission: ", r["date"])
        print("Time of submission: ", r["time"])
        print("View your submissions: https://learn-iiith.talentsprint.com/notebook_submissions")
        # print("For any queries/discrepancies, please connect with mentors through the chat icon in LMS dashboard.")
      return submission_id
    else: submission_id


def getAdditional():
  try:
    if not Additional:
      raise NameError
    else:
      return Additional
  except NameError:
    print ("Please answer Additional Question")
    return None
def getComments():
  try:
    if not Comments:
      raise NameError
    else:
      return Comments
  except NameError:
    print ("Please answer Comments Question")
    return None

def getComplexity():
  try:
    if not Complexity:
      raise NameError
    else:
      return Complexity
  except NameError:
    print ("Please answer Complexity Question")
    return None

def getConcepts():
  try:
    if not Concepts:
      raise NameError
    else:
      return Concepts
  except NameError:
    print ("Please answer Concepts Question")
    return None

def getId():
  try:
    return Id if Id else None
  except NameError:
    return None

def getPassword():
  try:
    return password if password else None
  except NameError:
    return None

submission_id = None
### Setup
if getPassword() and getId():
  submission_id = submit_notebook()
  if submission_id:
    setup()

else:
  print ("Please complete Id and Password cells before running setup")


## Import Required Packages

In [None]:
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
from sklearn import preprocessing
from sklearn.cluster import AgglomerativeClustering
from scipy.cluster.hierarchy import dendrogram, linkage

## Load the data

In [None]:
df = pd.read_csv('Penguin.csv')
df.head()

In [None]:
# Count NaN values in each column of the dataframe
df.isna().sum()

In [None]:
# Drop the records where sex column has NaN values
df.dropna(subset = ['sex'], inplace = True)

# Print the unique() elements from the sex column after dropping
print("Unique values after dropping NA values : ",df.sex.unique())

## Convert categorical values to numerical

In [None]:
LE = preprocessing.LabelEncoder()

In [None]:
df['island'] = LE.fit_transform(df['island'])
df['sex'] = LE.fit_transform(df['sex'])
df['species'] = LE.fit_transform(df['species'])
df.head()

## Store the data and labels

In [None]:
X = df.drop(['species'], axis=1)
y = df['species']

In [None]:
# Selecting first 100 rows and 2 columns from the data
X1 = X.iloc[:100,1:3].values

In [None]:
X1.shape

## Apply Agglomerative Clustering

**Note:** Refer to following [AgglomerativeClustering](https://scikit-learn.org/stable/modules/generated/sklearn.cluster.AgglomerativeClustering.html)

In [None]:
# Call the Agglomerative clustering function from sklearn library
# n_clusters : The number of clusters to generate
clustering = AgglomerativeClustering(n_clusters = 3)

# Fit Hierarchical Clustering to the data
Y_preds = clustering.fit_predict(X1)

# Plot the results
plt.figure(figsize = (8,5))
plt.scatter(X1[Y_preds == 0 , 0] , X1[Y_preds == 0 , 1] , c = 'red')
plt.scatter(X1[Y_preds == 1 , 0] , X1[Y_preds == 1 , 1] , c = 'blue')
plt.scatter(X1[Y_preds == 2 , 0] , X1[Y_preds == 2 , 1] , c = 'green')
plt.show()

## Visualize the dendogram

**Note:** Refer to [scipy.cluster.hierarchy.dendrogram](https://docs.scipy.org/doc/scipy/reference/generated/scipy.cluster.hierarchy.dendrogram.html)

In [None]:
# In 'linkage' ward method minimizes the variance between the clusters being merged
clusters = linkage(X1, 'ward')

In [None]:
plt.figure(figsize=(20,20))
dendrogram(clusters)
plt.show()

## Please answer the questions below to complete the experiment:

In [None]:
#@title How was the experiment? { run: "auto", form-width: "500px", display-mode: "form" }
Complexity = "Good, But Not Challenging for me" #@param ["","Too Simple, I am wasting time", "Good, But Not Challenging for me", "Good and Challenging for me", "Was Tough, but I did it", "Too Difficult for me"]


In [None]:
#@title If it was very easy, what more you would have liked to have been added? If it was very difficult, what would you have liked to have been removed? { run: "auto", display-mode: "form" }
Additional = "Good" #@param {type:"string"}

In [None]:
#@title Can you identify the concepts from the lecture which this experiment covered? { run: "auto", vertical-output: true, display-mode: "form" }
Concepts = "Yes" #@param ["","Yes", "No"]

In [None]:
#@title  Text and image description/explanation and code comments within the experiment: { run: "auto", vertical-output: true, display-mode: "form" }
Comments = "Very Useful" #@param ["","Very Useful", "Somewhat Useful", "Not Useful", "Didn't use"]


In [None]:
#@title Run this cell to submit your notebook  { vertical-output: true }
try:
  if submission_id:
      return_id = submit_notebook()
      if return_id : submission_id =return_id
  else:
      print("Please complete the setup first.")
except NameError:
  print ("Please complete the setup first.")