# Data Mining / Prospecção de Dados

## Sara C. Madeira and André Falcão, 2019/20

# Project 2 - Clustering

## Logistics

**In a "normal" scenario students should work in teams of 2 people. Due to the social distance imposed by current public health situation, students were allowed to work in groups of 1 and 3. In this context, the amount of work was adapted according to the number of students in groups as described below.**

* Tasks **1 to 5** should be done by **all** groups
* Task **6** should be done only by **groups of 2 and 3** students
* Task **7** should be done only by **groups of 3** students

The quality of the project will then dictate its grade.

**The project's solution should be uploaded in Moodle before the end of May, 17th 2020 (23:59).** 

**It is mandatory to produce a Jupyter notebook containing code and text/images/tables/etc describing the solution and the results. Projects not delivered in this format will not be graded. Note that you can use `PD_201920_Project2.ipynb`as template.**

Students should **upload a `.zip` file** containing all the files necessary for project evaluation. 

**Decisions should be justified and results should be critically discussed.**

## Dataset and Tools

In this project you should use [Python 3](https://www.python.org), [Jupyter Notebook](http://jupyter.org) and **[Scikit-learn](http://scikit-learn.org/stable/)**.

The dataset to be analysed is **`AML_ALL_PATIENTS_GENES_EXTENDED.csv`**. This is an extended version of the widely studied **Leukemia dataset**, originally published by Golub et al. (1999) ["Molecular Classification of Cancer: Class Discovery and Class Prediction by Gene
Expression Monitoring"](http://archive.broadinstitute.org/mpr/publications/projects/Leukemia/Golub_et_al_1999.pdf.) 

**This dataset studies patients with leukaemia. At disease onset clinicials diagnosed them in two different types of leukaemia: acute myeloid leukemia (AML) and acute lymphoblastic leukemia (ALL).** Some of these diagnoses were later confirmed, other revealed to be wrong. The data analyzed here contains the expression levels of 5147 Human genes (features/columns) analyzed in 110 patients (rows): 70 ALL and 40 AML.
Each row identifies a patient: The first column, `ID`, contains the patients' IDs , the second column, `DIAGNOSIS`, contains the initial diagnosis as performed by clinicians (ground truth), and the remaining 5147 columns contain the expression levels of the 5147 genes analysed.

**The goal is to cluster patients and (ideally) find AML groups and ALL groups.**


<img src="AML_ALL_PATIENTS_GENES_EXTENDED.jpg" alt="AML_ALL_PATIENTS_GENES_EXTENDED.csv" style="width: 1000px;"/>

## 1. Load and Preprocess Dataset

At the end of this step you should have:
* a 110 rows × 5147 columns matrix, **X**, containing the values of the 5147 features for each of the 110 patients.
* a vector, **y**, with the 110 diagnosis, which you can use later to evaluate clustering quality.

In [None]:
import plotly.graph_objs as go
import plotly.offline as offline


import pandas as pd

from sklearn.decomposition import PCA

import warnings
warnings.filterwarnings('ignore')

In [None]:
import pandas as pd

df = pd.read_csv('AML_ALL_PATIENTS_GENES_EXTENDED.csv')
X = df.iloc[:,2:]
y = df.iloc[:,1]
y = y.to_frame()

In [None]:
X.shape

### Verificação se existem colunas com valores ``null``

In [None]:
X.columns[X.isna().any()].tolist()

Não foram encontrados valores `null`

### Verificação se existem colunas com valores ``0``

In [None]:
zeros_columns = (X == 0).astype(int).sum(axis=0)
zeros_columns = zeros_columns.to_frame()
zeros_columns = zeros_columns[zeros_columns[0]>0]
zeros_columns.sort_values(by = [0])


In [None]:
#for row in zeros_columns.index: 
    #print(row, end = ", ") 

Os genes ``AC002073_cds1_at, M16801_at, M25756_at, U07418_at, U13044_at, U27459_at, U69141_at, X51405_at, X66079_at, U10690_f_at`` têm valores iguais a zero que podem ser dados que não foram preenchidos. 
   
Caso que estes valores iguais a zero sejam valores em falta, estes valores fazem com que exista um desvio dos valores reais, alterando os valores da média e a mediana.


### Verificação quais as entradas com valores a ``0``

In [None]:
zeros_rows = (X == 0).astype(int).sum(axis=1)
zeros_rows = zeros_rows.to_frame()
zeros_rows = zeros_rows[zeros_rows[0]>0]
zeros_rows.sort_values(by = [0])

Foram entcontrados valores iguais a zero nos seguintes pacientes:
    6, 7, 17, 19, 38, 45, 47, 60, 61, 66, 73, 77, 85

Onde os pacientes:
- 7, 66 têm **3** valores a zero,
- 45, 47 têm **2** valores a zero
- 6, 17, 19, 38, 60, 61, 73, 77, 85 têm **1** valor a zero 

## 2. Dimensionality Reduction

As you already noticed the number of features (genes) is extremely high whe compared to the number of objects to cluster (patients). In this context, you should perform dimensionality reduction, that is, reduce the number of features, in two ways:

* [**Removing features with low variance**](http://scikit-learn.org/stable/modules/feature_selection.html)

* [**Using Principal Component Analysis**](http://scikit-learn.org/stable/modules/generated/sklearn.decomposition.PCA.html)

At the end of this step you should have two new matrices with the same number of rows, each with a different number of columns (features): **X_variance** and **X_PCA**. 

**Don't change X you will need it!**

### Low variance


In [None]:
from sklearn.feature_selection import VarianceThreshold
import numpy as np
from sklearn import preprocessing


def variance_threshold_selector(data, threshold=0.5):
    selector = VarianceThreshold(threshold)
    selector.fit(data)
    return data[data.columns[selector.get_support(indices=True)]]

X_variance = variance_threshold_selector(data = X, threshold=1.2)



print("Foi reduzido para %d features" % X_variance.shape[1])
print()

zeros_columns.index.tolist() in X_variance.columns.tolist()

### PCA

'PCA is effected by scale so you need to scale the features in your data before applying PCA.'

Uma vez que todas as features são genes humanas, à partida estarão na mesma escala e não haverá necessidade de utilizar StandardScaler no entanto de forma a procurar uma melhor performance dos algoritmos aplicamos este método de preprocessamento de modo a obter média = 0 e variância = 1. Que são os valores que obtem melhores resultados nos demais algortimos.


#### Análise das componentes

In [None]:
pca = PCA()
pca = pca.fit(X, y)

eigenvalues = pca.singular_values_

kaiser_pca = len(list(filter(lambda a: a > 1, eigenvalues)))

print("Kaiser Rule: nº PC = ", kaiser_pca)

In [None]:
var_exp = pca.explained_variance_ratio_
cum_var_exp = np.cumsum(var_exp)
xi = ["PC%s" %i for i in range(1,11)]
trace1 = go.Scatter(
    x=xi,
    y=list(var_exp),
    name="Explained Variance")

trace2 = go.Scatter(
    x=xi,
    y=cum_var_exp,
    name="Cumulative Variance")

layout = go.Layout(
    title='Explained variance',
    xaxis=dict(title='Principle Components', tickmode='linear'))

data = [trace2, trace1]
fig = go.Figure(data=data, layout=layout)
offline.iplot(fig)

Através do gráfico podemos verificar que depois da Componente principal 2 não existe grande variação na explicação da variância. Segundo a regra do "elbow" 2 componentes seriam o ideal, apesar de só explicarem 27% da variância, como podemos verificar no gráfico apartir da "Cumulative Variance".

Se quisermos obter algo mais rigoroso poderemos olhar para a "Cumulative Variance" e escolher uma percentagem mais satisfatória por exemplo 9 Componentes Principais que representam 50% da variância

####  PCA (n_components = 2)

##### Sem StandardScaler

In [None]:
from sklearn.decomposition import PCA
import matplotlib.pyplot as plt

In [None]:
pca = PCA(n_components=2)

principalComponents = pca.fit_transform(X)


X_PCA = pd.DataFrame(data = principalComponents, columns = ['principal component 1', 'principal component 2'])

X_PCA = pd.concat([X_PCA, y], axis = 1)

X_PCA

In [None]:


fig = plt.figure(figsize = (8,8))
ax = fig.add_subplot(1,1,1) 
ax.set_xlabel('Principal Component 1', fontsize = 15)
ax.set_ylabel('Principal Component 2', fontsize = 15)
ax.set_title('2 component PCA', fontsize = 20)
targets = ['ALL', 'AML']
colors = ['r', 'g']
for DIAGNOSIS, color in zip(targets,colors):
    indicesToKeep = X_PCA['DIAGNOSIS'] == DIAGNOSIS
    ax.scatter(X_PCA.loc[indicesToKeep, 'principal component 1']
               , X_PCA.loc[indicesToKeep, 'principal component 2']
               , c = color
               , s = 50)
ax.legend(targets)
ax.grid()

##### Com StandardScaler 

In [None]:
from sklearn.preprocessing import StandardScaler

#Standerizar as features
x_scaler = StandardScaler().fit_transform(X)

#Processo igual ao anterior
pca = PCA(n_components=2)
principalComponents = pca.fit_transform(x_scaler)
X_PCA_scaler = pd.DataFrame(data = principalComponents, columns = ['principal component 1', 'principal component 2'])

X_PCA_scaler_y = pd.concat([X_PCA_scaler, y], axis = 1)

X_PCA_scaler

É possível observar através do plot que apesar de dar a sensação dos pontos continuarem próximos as suas posições mudaram e tendo em conta a escala, que diminui considerávelmente

In [None]:
import matplotlib.pyplot as plt

fig = plt.figure(figsize = (8,8))
ax = fig.add_subplot(1,1,1) 
ax.set_xlabel('Principal Component 1', fontsize = 15)
ax.set_ylabel('Principal Component 2', fontsize = 15)
ax.set_title('2 component PCA', fontsize = 20)
targets = ['ALL', 'AML']
colors = ['r', 'g']
for DIAGNOSIS, color in zip(targets,colors):
    indicesToKeep = X_PCA['DIAGNOSIS'] == DIAGNOSIS
    ax.scatter(X_PCA_scaler.loc[indicesToKeep, 'principal component 1']
               , X_PCA_scaler.loc[indicesToKeep, 'principal component 2']
               , c = color
               , s = 50)
ax.legend(targets)
ax.grid()

Write text in cells like this ...

#### PCA Kaiser Rule

In [None]:
pca_Kaiser = PCA(n_components=kaiser_pca)

principalComponentsKaiser = pca_Kaiser.fit(X)#.tolist()

X_PCA_Kaiser = principalComponentsKaiser.transform(X)
#X_PCA10 = pd.DataFrame(data = principalComponents10, columns = ['principal component 1', 'principal component 2'])

#X_PCA10 = pd.concat([X_PCA10, y], axis = 1)

## 3. Clustering Patients using Partitional Clustering

Use **`K`-means** to cluster the patients:

* Cluster the original data (5147 features): **X**.
    * Use different values of `K`.
    * For each value of `K` present the clustering by specifying how many patients ALL and AML are in each cluster.     
    For instance, `{0: {'ALL': 70, 'AML': 0}, 1: {'ALL': 0, 'AML': 40}}` is the ideal clustering that we aimed at obtained with K-means when `K=2`, where the first cluster has 70 ALL patients and 0 AML patients and the second cluster has 0 ALL patients and 40 AML patients. 
    You can choose how to output this information.  
    * What is the best value of `K` ? Justify using the clustering results and the [Silhouette score](http://scikit-learn.org/stable/modules/generated/sklearn.metrics.silhouette_score.html).

* Cluster the data obtained after removing features with low variance: **X_variance**.
    * Study different values of `K` as above.

* Cluster the data obtained after applying PCA: **X_PCA**.
    * Study different values of `K` as above.

* Compare the results obtained in the three datasets above for the best `K`. Discuss.

Implementámos o método "Elbow" para ter inicialmente uma ideia de quais os melhores valores para k. 
No entanto frizamos, inicialmente, uma vez que vamos concluir o melhor valor de k com o recurso às várias performances obtidas com os vários valores assim como da 'silhouette score'.

In [None]:
from sklearn.datasets import make_blobs
from sklearn.cluster import KMeans
from sklearn.metrics import silhouette_samples, silhouette_score

import matplotlib.pyplot as plt
import matplotlib.cm as cm
import numpy as np

In [None]:
%%javascript  
IPython.OutputArea.auto_scroll_threshold = 9999;


### Functions

In [None]:
def f_kmeans(data, range_n_clusters):
    if type(data) == type(X):
        data = np.array(data.values.tolist())
    
    list_silhouette_avg = []
    list_cluster_labels = []
    
    for n_clusters in range_n_clusters:


        clusterer = KMeans(n_clusters=n_clusters, init = 'random', max_iter=10000, n_init=10,random_state=0)
        cluster_labels = clusterer.fit_predict(data)
        list_cluster_labels.append(cluster_labels)
        
        silhouette_avg = silhouette_score(data, cluster_labels)
        list_silhouette_avg.append(silhouette_avg)
        print("For n_clusters =", n_clusters,
              "The average silhouette_score is :", silhouette_avg)

        if data.shape[1] == 2 :

            fig, (ax1, ax2) = plt.subplots(1, 2)
            fig.set_size_inches(18, 7)

            ax1.set_xlim([-0.1, 1])
            ax1.set_ylim([0, len(data) + (n_clusters + 1) * 10])

            sample_silhouette_values = silhouette_samples(data, cluster_labels)

            y_lower = 10
            for i in range(n_clusters):
                # Aggregate the silhouette scores for samples belonging to
                # cluster i, and sort them
                ith_cluster_silhouette_values = \
                    sample_silhouette_values[cluster_labels == i]

                ith_cluster_silhouette_values.sort()

                size_cluster_i = ith_cluster_silhouette_values.shape[0]
                y_upper = y_lower + size_cluster_i

                color = cm.nipy_spectral(float(i) / n_clusters)
                ax1.fill_betweenx(np.arange(y_lower, y_upper),
                                  0, ith_cluster_silhouette_values,
                                  facecolor=color, edgecolor=color, alpha=0.7)

                ax1.text(-0.05, y_lower + 0.5 * size_cluster_i, str(i))

                y_lower = y_upper + 10  # 10 for the 0 samples

            ax1.set_title("The silhouette plot for the various clusters.")
            ax1.set_xlabel("The silhouette coefficient values")
            ax1.set_ylabel("Cluster label")

            ax1.axvline(x=silhouette_avg, color="red", linestyle="--")

            ax1.set_yticks([])  # Clear the yaxis labels / ticks
            ax1.set_xticks([-0.1, 0, 0.2, 0.4, 0.6, 0.8, 1])

            # segundo Plot
            colors = cm.nipy_spectral(cluster_labels.astype(float) / n_clusters)
            ax2.scatter(data[:, 0], data[:, 1], marker='.', s=30, lw=0, alpha=0.7, c=colors, edgecolor='k')

            centers = clusterer.cluster_centers_

            ax2.scatter(centers[:, 0], centers[:, 1], marker='o',
                        c="white", alpha=1, s=200, edgecolor='k')

            for i, c in enumerate(centers):
                ax2.scatter(c[0], c[1], marker='$%d$' % i, alpha=1,
                            s=50, edgecolor='k')

            ax2.set_title("The visualization of the clustered data.")
            ax2.set_xlabel("Feature space for the 1st feature")
            ax2.set_ylabel("Feature space for the 2nd feature")

            plt.suptitle(("Silhouette analysis for KMeans clustering on sample data "
                          "with n_clusters = %d" % n_clusters),
                         fontsize=14, fontweight='bold')

            plt.show()
    return list_silhouette_avg, list_cluster_labels, range_n_clusters





def result_kmeans(data, y, range_n_cluster = [2, 3, 4, 5, 6]):
    
    list_silhouette_avg, list_cluster_labels, range_n_clusters = f_kmeans(data, range_n_cluster)
    
    kmeans_results = y
    for i in range(len(range_n_clusters)):
        print("\n\nK = ",range_n_clusters[i], "\n")
        kmeans_results[str(range_n_clusters[i])] = list_cluster_labels[i]

        df_aux = kmeans_results[kmeans_results.columns[[0,1+i]]]
        score = 0
        for j in range(0,range_n_clusters[i]):
            all_count = df_aux[df_aux[str(range_n_clusters[i])]==j]["DIAGNOSIS"].str.count('ALL').sum()
            aml_count = df_aux[df_aux[str(range_n_clusters[i])]==j]["DIAGNOSIS"].str.count('AML').sum()
            aux=0
            if(all_count>=aml_count):
                aux = all_count
            else:
                aux = aml_count
            print(j, ": {ALL =",all_count, ", AML = ",aml_count,"}", "score:", (aux/(all_count+aml_count)))
            if ~(np.isnan(aux/(all_count+aml_count))):
                score += (all_count+aml_count)*(aux/(all_count+aml_count))
        print("Score final: ", score/len(y))
            

### All Features Dataset

In [None]:
result_kmeans(X, y, range(2,21))

### PCA(n_components = 2)

In [None]:
X_PCA = X_PCA.drop(["DIAGNOSIS"],axis=1)

In [None]:
result_kmeans(X_PCA, y)

### PCA Kaiser Rule

In [None]:
result_kmeans(X_PCA_Kaiser, y, list(range(2,21)))

### Features with more Variance (Low Variance remove features)

In [None]:
result_kmeans(X_variance, y, range(2,21))

### Ivo Things

In [None]:
from sklearn.cluster import KMeans
wcss=[]

for i in range(1,71):
    kmeans = KMeans(n_clusters=i, init ='k-means++', max_iter=300,  n_init=10,random_state=0)
    kmeans.fit(X)
    wcss.append(kmeans.inertia_)

    
plt.plot(range(1,71),wcss)
plt.title('The Elbow Method Graph')
plt.xlabel('Number of clusters')
plt.ylabel('WCSS')
plt.show()


In [None]:
fviz_nbclust(X, kmeans, method = "wss", k.max = 110) + theme_minimal() + ggtitle("the Elbow Method")

In [None]:
X[X.columns[[0,1]]]

#### k = 2

In [None]:
x_teste = StandardScaler().fit_transform(X)
colors = ['red', 'blue', 'cyan', 'orange', 'green']

def k_means(clusters, data):
    
    kmeans = KMeans(n_clusters=clusters, init = 'random', max_iter=1000, n_init=10,random_state=0 )
    y_kmeans = kmeans.fit_predict(data)
    
    for i in range(0,clusters):
        plt.scatter(np.array(data.values.tolist())[y_kmeans==i, 0], np.array(data.values.tolist())[y_kmeans==i, 1], s=50, c=colors[i], label ='Cluster 1')


    plt.scatter(kmeans.cluster_centers_[:, 0], kmeans.cluster_centers_[:, 1], s=300, c='black', marker='x', label = 'Centroids')

    plt.show()
    return(y_kmeans)
#k_means(2, x_teste)
y2_kmeans = k_means(2, X)

In [None]:
x_teste

In [None]:
#from collections import Counter

#counter_k_2 = Counter(y_kmeans).values()


def renomear (row):
   if row['DIAGNOSIS'] == 'ALL' :
      return 1
   return 0

labels = y.apply (lambda row: renomear(row), axis=1)

ALL=0
AML=0
ALL_1=0
AML_1=0

for i in range(0,110):
    if y2_kmeans[i] == 0 & labels[i] == 1:
        ALL = ALL + 1
    elif y2_kmeans[i] == 0 & labels[i] == 0:
        AML = AML + 1
    elif y2_kmeans[i] == 1 & labels[i] == 1:
        ALL_1 = ALL_1 + 1
    else:
        AML_1 = AML_1 + 1

#{0: {'ALL': 70, 'AML': 0}, 1: {'ALL': 0, 'AML': 40}}
print('K = 2')
print('0: {ALL: %s, AML: %s}, 1: {ALL: %s, AML: %s}' % (ALL,AML,ALL_1,AML_1))


y2_kmeans

#### k = 3

In [None]:
'''kmeans = KMeans(n_clusters=3, init = 'random', max_iter=1000, n_init=10,random_state=0 )
y_kmeans = kmeans.fit_predict(x_teste)

plt.scatter(x_teste[y_kmeans==0, 0], x_teste[y_kmeans==0, 1], s=100, c='red', label ='Cluster 1')
plt.scatter(x_teste[y_kmeans==1, 0], x_teste[y_kmeans==1, 1], s=100, c='blue', label ='Cluster 2')
plt.scatter(x_teste[y_kmeans==2, 0], x_teste[y_kmeans==2, 1], s=100, c='cyan', label ='Cluster 3')

plt.scatter(kmeans.cluster_centers_[:, 0], kmeans.cluster_centers_[:, 1], s=300, c='yellow', label = 'Centroids')

plt.show()'''

k_means(3, X)

#### k = 4

In [None]:
'''kmeans = KMeans(n_clusters=4, init = 'random', max_iter=1000, n_init=10,random_state=0 )
y_kmeans = kmeans.fit_predict(x_teste)

plt.scatter(x_teste[y_kmeans==0, 0], x_teste[y_kmeans==0, 1], s=100, c='red', label ='Cluster 1')
plt.scatter(x_teste[y_kmeans==1, 0], x_teste[y_kmeans==1, 1], s=100, c='blue', label ='Cluster 2')
plt.scatter(x_teste[y_kmeans==2, 0], x_teste[y_kmeans==2, 1], s=100, c='cyan', label ='Cluster 3')
plt.scatter(x_teste[y_kmeans==3, 0], x_teste[y_kmeans==3, 1], s=100, c='green', label ='Cluster 4')

plt.scatter(kmeans.cluster_centers_[:, 0], kmeans.cluster_centers_[:, 1], s=300, c='yellow', label = 'Centroids')

plt.show()'''
k_means(4, X)

#### k = 5

In [None]:
'''kmeans = KMeans(n_clusters=5, init = 'random', max_iter=1000, n_init=10,random_state=0 )
y_kmeans = kmeans.fit_predict(x_teste)

plt.scatter(x_teste[y_kmeans==0, 0], x_teste[y_kmeans==0, 1], s=100, c='red', label ='Cluster 1')
plt.scatter(x_teste[y_kmeans==1, 0], x_teste[y_kmeans==1, 1], s=100, c='blue', label ='Cluster 2')
plt.scatter(x_teste[y_kmeans==2, 0], x_teste[y_kmeans==2, 1], s=100, c='cyan', label ='Cluster 3')
plt.scatter(x_teste[y_kmeans==3, 0], x_teste[y_kmeans==3, 1], s=100, c='green', label ='Cluster 4')
plt.scatter(x_teste[y_kmeans==4, 0], x_teste[y_kmeans==4, 1], s=100, c='orange', label ='Cluster 5')

plt.scatter(kmeans.cluster_centers_[:, 0], kmeans.cluster_centers_[:, 1], s=300, c='yellow', label = 'Centroids')

plt.show()'''
k_means(5, X)

#### Cluster -> X_variance
### k = 2

In [None]:
#X_variance = StandardScaler().fit_transform(X_variance) // não causa diferença uma vez que já foi aplicado o preprocessamento da variância

kmeans = KMeans(n_clusters=2, init = 'random', max_iter=1000, n_init=10,random_state=0 )
y_kmeans = kmeans.fit_predict(X_variance)

plt.scatter(X_variance[y_kmeans==0, 0], X_variance[y_kmeans==0, 1], s=100, c='red', label ='Cluster 1')
plt.scatter(X_variance[y_kmeans==1, 0], X_variance[y_kmeans==1, 1], s=100, c='blue', label ='Cluster 2')

plt.scatter(kmeans.cluster_centers_[:, 0], kmeans.cluster_centers_[:, 1], s=300, c='yellow', label = 'Centroids')

plt.show()

#### K = 3

In [None]:
kmeans = KMeans(n_clusters=3, init = 'random', max_iter=1000, n_init=10,random_state=0 )
y_kmeans = kmeans.fit_predict(X_variance)

plt.scatter(X_variance[y_kmeans==0, 0], X_variance[y_kmeans==0, 1], s=100, c='red', label ='Cluster 1')
plt.scatter(X_variance[y_kmeans==1, 0], X_variance[y_kmeans==1, 1], s=100, c='blue', label ='Cluster 2')
plt.scatter(X_variance[y_kmeans==2, 0], X_variance[y_kmeans==2, 1], s=100, c='green', label ='Cluster 3')

plt.scatter(kmeans.cluster_centers_[:, 0], kmeans.cluster_centers_[:, 1], s=300, c='yellow', label = 'Centroids')

plt.show()

#### K = 4

In [None]:
kmeans = KMeans(n_clusters=4, init = 'random', max_iter=1000, n_init=10,random_state=0 )
y_kmeans = kmeans.fit_predict(X_variance)

plt.scatter(X_variance[y_kmeans==0, 0], X_variance[y_kmeans==0, 1], s=100, c='red', label ='Cluster 1')
plt.scatter(X_variance[y_kmeans==1, 0], X_variance[y_kmeans==1, 1], s=100, c='blue', label ='Cluster 2')
plt.scatter(X_variance[y_kmeans==2, 0], X_variance[y_kmeans==2, 1], s=100, c='green', label ='Cluster 3')
plt.scatter(X_variance[y_kmeans==3, 0], X_variance[y_kmeans==3, 1], s=100, c='orange', label ='Cluster 4')

plt.scatter(kmeans.cluster_centers_[:, 0], kmeans.cluster_centers_[:, 1], s=300, c='yellow', label = 'Centroids')

plt.show()

#### K = 5

In [None]:
kmeans = KMeans(n_clusters=5, init = 'random', max_iter=1000, n_init=10,random_state=0 )
y_kmeans = kmeans.fit_predict(X_variance)

plt.scatter(X_variance[y_kmeans==0, 0], X_variance[y_kmeans==0, 1], s=100, c='red', label ='Cluster 1')
plt.scatter(X_variance[y_kmeans==1, 0], X_variance[y_kmeans==1, 1], s=100, c='blue', label ='Cluster 2')
plt.scatter(X_variance[y_kmeans==2, 0], X_variance[y_kmeans==2, 1], s=100, c='green', label ='Cluster 3')
plt.scatter(X_variance[y_kmeans==3, 0], X_variance[y_kmeans==3, 1], s=100, c='orange', label ='Cluster 4')
plt.scatter(X_variance[y_kmeans==4, 0], X_variance[y_kmeans==4, 1], s=100, c='cyan', label ='Cluster 5')

plt.scatter(kmeans.cluster_centers_[:, 0], kmeans.cluster_centers_[:, 1], s=300, c='yellow', label = 'Centroids')

plt.show()

#### X_PCA
## k = 2

In [None]:
k_means(2, X_PCA_scaler)

#### k = 3

In [None]:
k_means(3, X_PCA_scaler)

#### k = 4

In [None]:
k_means(4, X_PCA_scaler)

#### k = 5

In [None]:
k_means(5, X_PCA_scaler)

## 4. Clustering Patients using Hierarchical Clustering

Use a **Hierarchical Clustering Algorithm (HCA)** to cluster the patients: 

* Cluster the data in **X_variance**.
    * Use **different linkage metrics**. (single, complete, average, ward) - eu é que acrescentei
    * Use different values of `K`.
    * For each linkage metric and value of `K` present the clustering by specifying how many patients ALL and AML are in each cluster as you did before. 
    * What is the best linkage metric and the best value of `K`? Justify using the clustering results and the [Silhouette score](http://scikit-learn.org/stable/modules/generated/sklearn.metrics.silhouette_score.html).

* Cluster the data in **X_PCA**.
    * Study different linkage metrics and different values of `K` as above.

* Compare the results obtained in the two datasets above for the best linkage metric and the best `K`. Discuss.

#### X_variance - Hierarchical Clustering - FALTA CONTADOR/OUTRO E DISCUSSÃO

In [None]:
from sklearn.cluster import AgglomerativeClustering
import scipy.cluster.hierarchy as shc

#PLot
plt.figure(figsize=(10, 7))  
plt.title("Dendrograms")  
dend = shc.dendrogram(shc.linkage(X_variance, method='ward'))
plt.show()

In [None]:
X_variance = np.array(X_variance.values.tolist())

#### k = 2

In [None]:
def hc(clusters, data, metric):
    
    cluster = AgglomerativeClustering(n_clusters=clusters, affinity='euclidean', linkage=metric)
    y_hc = cluster.fit_predict(data)
    
    for i in range(0,clusters):
        plt.scatter(X_variance[y_hc==i, 0], X_variance[y_hc==i, 1], s=100, c=colors[i], label ='Cluster 1')

    plt.show()
    return(y_hc)

hc2 = hc(2, X_variance,'ward')

#### k = 3

In [None]:
hc3 = hc(3, X_variance, 'ward')

#### k = 4


In [None]:
hc4 = hc(4, X_variance, 'ward')

#### k = 5

In [None]:
hc5 = hc(5, X_variance, 'ward')

### X_PCA
## k = 2


In [None]:
pca_hc2 = hc(2, X_PCA_scaler, 'ward')

#### k = 3

In [None]:
pca_hc3 = hc(3, X_PCA_scaler, 'ward')

#### k = 4

In [None]:
pca_hc4 = hc(4, X_PCA_scaler, 'ward')

#### k = 5

In [None]:
pca_hc5 = hc(5, X_PCA_scaler, 'ward')

Write text in cells like this ...

## 5. Evaluating Clustering Results

In this task you should compare the best results obtained using `K`-means and HCA 
1. **Without using ground truth**
2. **Using ground truth (`DIAGNOSIS`)**.

### 5.1. Without Using Ground Truth

**Choose one adequate measure** from those available by Sciki-learn (https://scikit-learn.org/stable/modules/clustering.html#clustering-performance-evaluation) to evaluate the different clusterings. 

Discuss the results.

In [None]:
# Write code in cells like this
# ....

Write text in cells like this ...

### 5.2. Using Ground Truth

**Choose one adequate measure** from those available by Sciki-learn (https://scikit-learn.org/stable/modules/clustering.html#clustering-performance-evaluation) to evaluate the different clusterings. 

Discuss the results.

In [None]:
# Write code in cells like this
# ....

Write text in cells like this ...

## 6. Clustering Patients using Density-based Clustering

Use DBSCAN (https://scikit-learn.org/stable/modules/generated/sklearn.cluster.DBSCAN.html) or OPTICS (https://scikit-learn.org/stable/modules/generated/sklearn.cluster.OPTICS.html) to cluster the patients.

Compare the results with those of K-means and HCA.

In [None]:
from sklearn.cluster import DBSCAN

dbscan = DBSCAN(eps=5, min_samples = 2)
clusters = dbscan.fit_predict(X_variance)

plt.scatter(X_variance[clusters==0, 0], X_variance[clusters==0, 1], s=100, c='red', label ='Cluster 1')
plt.scatter(X_variance[clusters==1, 0], X_variance[clusters==1, 1], s=100, c='blue', label ='Cluster 2')

Write text in cells like this ...

## 7. Choose a Different Clustering Algorithm to Group the Patients

Choose **a clustering algorithm** besides `K`-means, HCA and DBSCAN/OPTICS to cluster the patients. 

Justify your choice and compare the results with those of `K`-means, HCA and DBSCAN/OPTICS.

# Affinity Propagation

Write text in cells like this ...

In [None]:
from sklearn.cluster import MeanShift

ms = MeanShift()
clusters = ms.fit_predict(X_PCA_scaler)

plt.scatter(x_teste[clusters==0, 0], x_teste[clusters==0, 1], s=100, c='red', label ='Cluster 2')
plt.scatter(x_teste[clusters==1, 0], x_teste[clusters==1, 1], s=100, c='blue', label ='Cluster 2')