# Hierarchical Clustering

**Hierarchical clustering** refers to a class of clustering methods that seek to build a **hierarchy** of clusters, in which some clusters contain others. In this assignment, we will explore a top-down approach, recursively bipartitioning the data using k-means.

**Note to Amazon EC2 users**: To conserve memory, make sure to stop all the other notebooks before running this notebook.

## Import packages

In [1]:
from __future__ import print_function # to conform python 2.x print to python 3.x
import turicreate as tc
import matplotlib.pyplot as plt
import numpy as np
import sys
import os
import time
from scipy.sparse import csr_matrix
from sklearn.cluster import KMeans
from sklearn.metrics import pairwise_distances
%matplotlib inline

## Load the Wikipedia dataset

In [2]:
wiki = tc.SFrame('people_wiki.sframe/')

In [3]:
wiki

URI,name,text
<http://dbpedia.org/resou rce/Digby_Morrell> ...,Digby Morrell,digby morrell born 10 october 1979 is a former ...
<http://dbpedia.org/resou rce/Alfred_J._Lewy> ...,Alfred J. Lewy,alfred j lewy aka sandy lewy graduated from ...
<http://dbpedia.org/resou rce/Harpdog_Brown> ...,Harpdog Brown,harpdog brown is a singer and harmonica player who ...
<http://dbpedia.org/resou rce/Franz_Rottensteiner> ...,Franz Rottensteiner,franz rottensteiner born in waidmannsfeld lower ...
<http://dbpedia.org/resou rce/G-Enka> ...,G-Enka,henry krvits born 30 december 1974 in tallinn ...
<http://dbpedia.org/resou rce/Sam_Henderson> ...,Sam Henderson,sam henderson born october 18 1969 is an ...
<http://dbpedia.org/resou rce/Aaron_LaCrate> ...,Aaron LaCrate,aaron lacrate is an american music producer ...
<http://dbpedia.org/resou rce/Trevor_Ferguson> ...,Trevor Ferguson,trevor ferguson aka john farrow born 11 november ...
<http://dbpedia.org/resou rce/Grant_Nelson> ...,Grant Nelson,grant nelson born 27 april 1971 in london ...
<http://dbpedia.org/resou rce/Cathy_Caruth> ...,Cathy Caruth,cathy caruth born 1955 is frank h t rhodes ...


As we did in previous assignments, let's extract the TF-IDF features:

In [4]:
wiki['tf_idf'] = tc.text_analytics.tf_idf(wiki['text'])

To run k-means on this dataset, we should convert the data matrix into a sparse matrix.

In [5]:
from em_utilities import sframe_to_scipy # converter

# This will take about a minute or two.
wiki = wiki.add_row_number()
tf_idf, map_word_to_index = sframe_to_scipy(wiki, 'tf_idf')

To be consistent with the k-means assignment, let's normalize all vectors to have unit norm.

In [6]:
from sklearn.preprocessing import normalize
tf_idf = normalize(tf_idf)

## Bipartition the Wikipedia dataset using k-means

Recall our workflow for clustering text data with k-means:

1. Load the dataframe containing a dataset, such as the Wikipedia text dataset.
2. Extract the data matrix from the dataframe.
3. Run k-means on the data matrix with some value of k.
4. Visualize the clustering results using the centroids, cluster assignments, and the original dataframe. We keep the original dataframe around because the data matrix does not keep auxiliary information (in the case of the text dataset, the title of each article).

Recuerde nuestro flujo de trabajo para agrupar datos de texto con k-means:

1. Cargue el dataframe que contiene un conjunto de datos, como el conjunto de datos de texto de Wikipedia.
2. Extraiga la matriz de datos del dataframe.
3. Ejecute k-means en la matriz de datos con algún valor de k.
4. Visualice los resultados del clustering utilizando los centroides, las asignaciones de cluster y el dataframe original. Mantenemos el dataframe original porque la matriz de datos no guarda información auxiliar (en el caso del conjunto de datos de texto, el título de cada artículo).

Let us modify the workflow to perform bipartitioning:

1. Load the dataframe containing a dataset, such as the Wikipedia text dataset.
2. Extract the data matrix from the dataframe.
3. Run k-means on the data matrix with k=2.
4. Divide the data matrix into two parts using the cluster assignments.
5. Divide the dataframe into two parts, again using the cluster assignments. This step is necessary to allow for visualization.
6. Visualize the bipartition of data.

Modifiquemos el flujo de trabajo para realizar biparticiones:

1. Cargue el dataframe que contiene un conjunto de datos, como el conjunto de datos de texto de Wikipedia.
2. Extraiga la matriz de datos del dataframe.
3. Ejecute k-means en la matriz de datos con k=2.
4. Divida la matriz de datos en dos partes utilizando las asignaciones de cluster.
5. Divida el dataframe en dos partes, nuevamente utilizando las asignaciones de clúster. Este paso es necesario para permitir la visualización.
6. Visualice la bipartición de datos.

We'd like to be able to repeat Steps 3-6 multiple times to produce a **hierarchy** of clusters such as the following:
```
                      (root)
                         |
            +------------+-------------+
            |                          |
         Cluster                    Cluster
     +------+-----+             +------+-----+
     |            |             |            |
   Cluster     Cluster       Cluster      Cluster
```
Each **parent cluster** is bipartitioned to produce two **child clusters**. At the very top is the **root cluster**, which consists of the entire dataset.

Now we write a wrapper function to bipartition a given cluster using k-means. There are three variables that together comprise the cluster:

* `dataframe`: a subset of the original dataframe that correspond to member rows of the cluster
* `matrix`: same set of rows, stored in sparse matrix format
* `centroid`: the centroid of the cluster (not applicable for the root cluster)

Rather than passing around the three variables separately, we package them into a Python dictionary. The wrapper function takes a single dictionary (representing a parent cluster) and returns two dictionaries (representing the child clusters).

In [9]:
def bipartition(cluster, maxiter=400, num_runs=4, seed=None):
    '''cluster: should be a dictionary containing the following keys
                * dataframe: original dataframe
                * matrix:    same data, in matrix format
                * centroid:  centroid for this particular cluster'''
    
    data_matrix = cluster['matrix']
    dataframe   = cluster['dataframe']
    
    # Run k-means on the data matrix with k=2. We use scikit-learn here to simplify workflow.
    kmeans_model = KMeans(n_clusters=2, max_iter=maxiter, n_init=num_runs, random_state=seed, verbose=0)
    kmeans_model.fit(data_matrix)
    centroids, cluster_assignment = kmeans_model.cluster_centers_, kmeans_model.labels_
    
    # Divide the data matrix into two parts using the cluster assignments.
    data_matrix_left_child, data_matrix_right_child = data_matrix[cluster_assignment==0], \
                                                      data_matrix[cluster_assignment==1]
    
    # Divide the dataframe into two parts, again using the cluster assignments.
    cluster_assignment_sa = tc.SArray(cluster_assignment) # minor format conversion
    dataframe_left_child, dataframe_right_child     = dataframe[cluster_assignment_sa==0], \
                                                      dataframe[cluster_assignment_sa==1]
        
    
    # Package relevant variables for the child clusters
    cluster_left_child  = {'matrix': data_matrix_left_child,
                           'dataframe': dataframe_left_child,
                           'centroid': centroids[0]}
    cluster_right_child = {'matrix': data_matrix_right_child,
                           'dataframe': dataframe_right_child,
                           'centroid': centroids[1]}
    
    return (cluster_left_child, cluster_right_child)

The following cell performs bipartitioning of the Wikipedia dataset. Allow 2+ minutes to finish.

Note. For the purpose of the assignment, we set an explicit seed (`seed=1`) to produce identical outputs for every run. In pratical applications, you might want to use different random seeds for all runs.

In [10]:
%%time
wiki_data = {'matrix': tf_idf, 'dataframe': wiki} # no 'centroid' for the root cluster
left_child, right_child = bipartition(wiki_data, maxiter=100, num_runs=1, seed=0)

CPU times: user 18.2 s, sys: 255 ms, total: 18.4 s
Wall time: 4.28 s


Let's examine the contents of one of the two clusters, which we call the `left_child`, referring to the tree visualization above.

In [11]:
left_child

{'matrix': <30219x547979 sparse matrix of type '<class 'numpy.float64'>'
 	with 5282514 stored elements in Compressed Sparse Row format>,
 'dataframe': Columns:
 	id	int
 	URI	str
 	name	str
 	text	str
 	tf_idf	dict
 
 Rows: Unknown
 
 Data:
 +----+-------------------------------+-------------------------------+
 | id |              URI              |              name             |
 +----+-------------------------------+-------------------------------+
 | 1  | <http://dbpedia.org/resour... |         Alfred J. Lewy        |
 | 3  | <http://dbpedia.org/resour... |      Franz Rottensteiner      |
 | 5  | <http://dbpedia.org/resour... |         Sam Henderson         |
 | 7  | <http://dbpedia.org/resour... |        Trevor Ferguson        |
 | 9  | <http://dbpedia.org/resour... |          Cathy Caruth         |
 | 10 | <http://dbpedia.org/resour... |          Sophie Crumb         |
 | 11 | <http://dbpedia.org/resour... |         Jenn Ashworth         |
 | 12 | <http://dbpedia.org/resour... 

And here is the content of the other cluster we named `right_child`.

In [12]:
right_child

{'matrix': <28852x547979 sparse matrix of type '<class 'numpy.float64'>'
 	with 5096769 stored elements in Compressed Sparse Row format>,
 'dataframe': Columns:
 	id	int
 	URI	str
 	name	str
 	text	str
 	tf_idf	dict
 
 Rows: Unknown
 
 Data:
 +----+-------------------------------+-------------------------------+
 | id |              URI              |              name             |
 +----+-------------------------------+-------------------------------+
 | 0  | <http://dbpedia.org/resour... |         Digby Morrell         |
 | 2  | <http://dbpedia.org/resour... |         Harpdog Brown         |
 | 4  | <http://dbpedia.org/resour... |             G-Enka            |
 | 6  | <http://dbpedia.org/resour... |         Aaron LaCrate         |
 | 8  | <http://dbpedia.org/resour... |          Grant Nelson         |
 | 15 | <http://dbpedia.org/resour... |         Joerg Steineck        |
 | 17 | <http://dbpedia.org/resour... | Paddy Dunne (Gaelic footba... |
 | 18 | <http://dbpedia.org/resour... 

## Visualize the bipartition

We provide you with a modified version of the visualization function from the k-means assignment. For each cluster, we print the top 5 words with highest TF-IDF weights in the centroid and display excerpts for the 8 nearest neighbors of the centroid.

In [13]:
def display_single_tf_idf_cluster(cluster, map_index_to_word):
    '''map_index_to_word: SFrame specifying the mapping betweeen words and column indices'''
    
    wiki_subset   = cluster['dataframe']
    tf_idf_subset = cluster['matrix']
    centroid      = cluster['centroid']
    
    # Print top 5 words with largest TF-IDF weights in the cluster
    idx = centroid.argsort()[::-1]
    for i in range(5):
        print('{0}:{1:.3f}'.format(map_index_to_word['category'], centroid[idx[i]])),
    print('')
    
    # Compute distances from the centroid to all data points in the cluster.
    distances = pairwise_distances(tf_idf_subset, [centroid], metric='euclidean').flatten()
    # compute nearest neighbors of the centroid within the cluster.
    nearest_neighbors = distances.argsort()
    # For 8 nearest neighbors, print the title as well as first 180 characters of text.
    # Wrap the text at 80-character mark.
    for i in range(8):
        text = ' '.join(wiki_subset[nearest_neighbors[i]]['text'].split(None, 25)[0:25])
        print('* {0:50s} {1:.5f}\n  {2:s}\n  {3:s}'.format(wiki_subset[nearest_neighbors[i]]['name'],
              distances[nearest_neighbors[i]], text[:90], text[90:180] if len(text) > 90 else ''))
    print('')

Let's visualize the two child clusters:

In [14]:
display_single_tf_idf_cluster(left_child, map_word_to_index)

113949:0.021
113949:0.015
113949:0.013
113949:0.012
113949:0.010

* Kayee Griffin                                      0.97358
  kayee frances griffin born 6 february 1950 is an australian politician and former australi
  an labor party member of the new south wales legislative council serving
* %C3%81ine Hyland                                   0.97370
  ine hyland ne donlon is emeritus professor of education and former vicepresident of univer
  sity college cork ireland she was born in 1942 in athboy co
* Christine Robertson                                0.97373
  christine mary robertson born 5 october 1948 is an australian politician and former austra
  lian labor party member of the new south wales legislative council serving
* Anita Kunz                                         0.97471
  anita e kunz oc born 1956 is a canadianborn artist and illustratorkunz has lived in london
   new york and toronto contributing to magazines and working
* Barry Sullivan (lawyer)                 

In [15]:
display_single_tf_idf_cluster(right_child, map_word_to_index)

113949:0.023
113949:0.017
113949:0.017
113949:0.016
113949:0.016

* Patricia Scott                                     0.97143
  patricia scott pat born july 14 1929 is a former pitcher who played in the allamerican gir
  ls professional baseball league for parts of four seasons
* Madonna (entertainer)                              0.97181
  madonna louise ciccone tkoni born august 16 1958 is an american singer songwriter actress 
  and businesswoman she achieved popularity by pushing the boundaries of lyrical
* Janet Jackson                                      0.97257
  janet damita jo jackson born may 16 1966 is an american singer songwriter and actress know
  n for a series of sonically innovative socially conscious and
* Natashia Williams                                  0.97343
  natashia williamsblach born august 2 1978 is an american actress and former wonderbra camp
  aign model who is perhaps best known for her role as shane
* Todd Williams                                     

The right cluster consists of athletes and artists (singers and actors/actresses), whereas the left cluster consists of non-athletes and non-artists. So far, we have a single-level hierarchy consisting of two clusters, as follows:

```
                                           Wikipedia
                                               +
                                               |
                    +--------------------------+--------------------+
                    |                                               |
                    +                                               +
         Non-athletes/artists                                Athletes/artists
```

Is this hierarchy good enough? **When building a hierarchy of clusters, we must keep our particular application in mind.** For instance, we might want to build a **directory** for Wikipedia articles. A good directory would let you quickly narrow down your search to a small set of related articles. The categories of athletes and non-athletes are too general to facilitate efficient search. For this reason, we decide to build another level into our hierarchy of clusters with the goal of getting more specific cluster structure at the lower level. To that end, we subdivide both the `athletes/artists` and `non-athletes/artists` clusters.

¿Es esta jerarquía lo suficientemente buena? **Al crear una jerarquía de clústeres, debemos tener en cuenta nuestra aplicación particular.** Por ejemplo, es posible que deseemos crear un **directorio** para los artículos de Wikipedia. Un buen directorio le permitiría limitar rápidamente su búsqueda a un pequeño conjunto de artículos relacionados. Las categorías de atletas y no atletas son demasiado generales para facilitar una búsqueda eficiente. Por este motivo, decidimos crear otro nivel en nuestra jerarquía de clústeres con el objetivo de obtener una estructura de clústeres más específica en el nivel inferior. Con ese fin, subdividimos los grupos "atletas/artistas" y "no atletas/artistas".

## Perform recursive bipartitioning

### Cluster of athletes and artists

To help identify the clusters we've built so far, let's give them easy-to-read aliases:

In [16]:
non_athletes_artists   = left_child
athletes_artists       = right_child

Using the bipartition function, we produce two child clusters of the athlete cluster:

In [17]:
# Bipartition the cluster of athletes and artists
left_child_athletes_artists, right_child_athletes_artists = bipartition(athletes_artists, maxiter=100, num_runs=6, seed=1)

The left child cluster mainly consists of athletes:

In [18]:
display_single_tf_idf_cluster(left_child_athletes_artists, map_word_to_index)

113949:0.036
113949:0.032
113949:0.027
113949:0.026
113949:0.025

* Todd Williams                                      0.95702
  todd michael williams born february 13 1971 in syracuse new york is a former major league 
  baseball relief pitcher he attended east syracuseminoa high school
* Gord Sherven                                       0.95840
  gordon r sherven born august 21 1963 in gravelbourg saskatchewan and raised in mankota sas
  katchewan is a retired canadian professional ice hockey forward who played
* Justin Knoedler                                    0.95907
  justin joseph knoedler born july 17 1980 in springfield illinois is a former major league 
  baseball catcherknoedler was originally drafted by the st louis cardinals
* Chris Day                                          0.95918
  christopher nicholas chris day born 28 july 1975 is an english professional footballer who
   plays as a goalkeeper for stevenageday started his career at tottenham
* Tony Smith (football

On the other hand, the right child cluster consists mainly of artists (singers and actors/actresses):

In [19]:
display_single_tf_idf_cluster(right_child_athletes_artists, map_word_to_index)

113949:0.033
113949:0.031
113949:0.026
113949:0.026
113949:0.021

* Madonna (entertainer)                              0.96003
  madonna louise ciccone tkoni born august 16 1958 is an american singer songwriter actress 
  and businesswoman she achieved popularity by pushing the boundaries of lyrical
* Janet Jackson                                      0.96110
  janet damita jo jackson born may 16 1966 is an american singer songwriter and actress know
  n for a series of sonically innovative socially conscious and
* Cher                                               0.96531
  cher r born cherilyn sarkisian may 20 1946 is an american singer actress and television ho
  st described as embodying female autonomy in a maledominated industry
* Laura Smith                                        0.96572
  laura smith is a canadian folk singersongwriter she is best known for her 1995 single shad
  e of your love one of the years biggest hits
* Lizzie West                                        0

Our hierarchy of clusters now looks like this:
```
                                           Wikipedia
                                               +
                                               |
                    +--------------------------+--------------------+
                    |                                               |
                    +                                               +
         Non-athletes/artists                                Athletes/artists
                                                                    +
                                                                    |
                                                         +----------+----------+
                                                         |                     |
                                                         |                     |
                                                         +                     |
                                                     athletes               artists
```

Should we keep subdividing the clusters? If so, which cluster should we subdivide? To answer this question, we again think about our application. Since we organize our directory by topics, it would be nice to have topics that are about as coarse as each other. For instance, if one cluster is about baseball, we expect some other clusters about football, basketball, volleyball, and so forth. That is, **we would like to achieve similar level of granularity for all clusters.**

Both the athletes and artists node can be subdivided more, as each one can be divided into more descriptive professions (singer/actress/painter/director, or baseball/football/basketball, etc.). Let's explore subdividing the athletes cluster further to produce finer child clusters.

¿Deberíamos seguir subdividiendo los grupos? Si es así, ¿qué grupo deberíamos subdividir? Para responder a esta pregunta, volvemos a pensar en nuestra aplicación. Dado que organizamos nuestro directorio por temas, sería bueno tener temas que sean tan burdos como los demás. Por ejemplo, si un grupo es sobre béisbol, esperamos otros grupos sobre fútbol, baloncesto, voleibol, etc. Es decir, **nos gustaría lograr un nivel similar de granularidad para todos los clústeres.**

Tanto el nodo de atletas como el de artistas se pueden subdividir más, ya que cada uno se puede dividir en profesiones más descriptivas (cantante/actriz/pintora/directora, o béisbol/fútbol/baloncesto, etc.). Exploremos subdividiendo aún más el grupo de atletas para producir grupos de niños más finos.

Let's give the clusters aliases as well:

In [20]:
athletes    = left_child_athletes_artists
artists     = right_child_athletes_artists

### Cluster of athletes

In answering the following quiz question, take a look at the topics represented in the top documents (those closest to the centroid), as well as the list of words with highest TF-IDF weights.

Let us bipartition the cluster of athletes.

Al responder la siguiente pregunta del cuestionario, eche un vistazo a los temas representados en los documentos principales (los más cercanos al centroide), así como a la lista de palabras con los pesos TF-IDF más altos.

Dividamos en dos el grupo de atletas.

In [21]:
left_child_athletes, right_child_athletes = bipartition(athletes, maxiter=100, num_runs=6, seed=1)

In [22]:
display_single_tf_idf_cluster(left_child_athletes, map_word_to_index)
display_single_tf_idf_cluster(right_child_athletes, map_word_to_index)

113949:0.110
113949:0.103
113949:0.051
113949:0.046
113949:0.045

* Steve Springer                                     0.89327
  steven michael springer born february 11 1961 is an american former professional baseball 
  player who appeared in major league baseball as a third baseman and
* Dave Ford                                          0.89574
  david alan ford born december 29 1956 is a former major league baseball pitcher for the ba
  ltimore orioles born in cleveland ohio ford attended lincolnwest
* Todd Williams                                      0.89823
  todd michael williams born february 13 1971 in syracuse new york is a former major league 
  baseball relief pitcher he attended east syracuseminoa high school
* Justin Knoedler                                    0.90084
  justin joseph knoedler born july 17 1980 in springfield illinois is a former major league 
  baseball catcherknoedler was originally drafted by the st louis cardinals
* Kevin Nicholson (baseball)        

**Quiz Question**. Which diagram best describes the hierarchy right after splitting the `athletes` cluster? Refer to the quiz form for the diagrams.

**Pregunta de prueba**. ¿Qué diagrama describe mejor la jerarquía justo después de dividir el grupo de "atletas"? Consulte el formulario de prueba para ver los diagramas.

**Caution**. The granularity criteria is an imperfect heuristic and must be taken with a grain of salt. It takes a lot of manual intervention to obtain a good hierarchy of clusters.

* **If a cluster is highly mixed, the top articles and words may not convey the full picture of the cluster.** Thus, we may be misled if we judge the purity of clusters solely by their top documents and words. 
* **Many interesting topics are hidden somewhere inside the clusters but do not appear in the visualization.** We may need to subdivide further to discover new topics. For instance, subdividing the `ice_hockey_football` cluster led to the appearance of runners and golfers.

**Precaución**. El criterio de granularidad es una heurística imperfecta y debe tomarse con pinzas. Se necesita mucha intervención manual para obtener una buena jerarquía de clústeres.

* **Si un grupo está muy mezclado, es posible que los artículos y las palabras principales no transmitan la imagen completa del grupo.** Por lo tanto, podemos equivocarnos si juzgamos la pureza de los grupos únicamente por sus documentos y palabras principales.
* **Muchos temas interesantes están ocultos en algún lugar dentro de los grupos pero no aparecen en la visualización.** Es posible que tengamos que subdividir más para descubrir nuevos temas. Por ejemplo, la subdivisión del grupo `ice_hockey_football` condujo a la aparición de corredores y golfistas.

### Cluster of non-athletes

Now let us subdivide the cluster of non-athletes.

In [23]:
%%time 
# Bipartition the cluster of non-athletes
left_child_non_athletes_artists, right_child_non_athletes_artists = bipartition(non_athletes_artists,
                                                                                maxiter=100, num_runs=3, seed=1)

CPU times: user 1min 7s, sys: 278 ms, total: 1min 7s
Wall time: 12.1 s


In [24]:
display_single_tf_idf_cluster(left_child_non_athletes_artists, map_word_to_index)

113949:0.021
113949:0.017
113949:0.015
113949:0.014
113949:0.014

* Anita Kunz                                         0.97141
  anita e kunz oc born 1956 is a canadianborn artist and illustratorkunz has lived in london
   new york and toronto contributing to magazines and working
* %C3%81ine Hyland                                   0.97487
  ine hyland ne donlon is emeritus professor of education and former vicepresident of univer
  sity college cork ireland she was born in 1942 in athboy co
* Ruth Rosen                                         0.97515
  ruth rosen born 1956 is a pioneering historian of gender and society an awardwinning journ
  alist and a professor emerita at university of california davisshe is
* Catherine Hakim                                    0.97532
  catherine hakim born 30 may 1948 is a british sociologist who specialises in womens employ
  ment and womens issues she is currently a professorial research fellow
* Ren%C3%A9e Fox                                 

In [25]:
display_single_tf_idf_cluster(right_child_non_athletes_artists, map_word_to_index)

113949:0.030
113949:0.027
113949:0.027
113949:0.025
113949:0.023

* Kayee Griffin                                      0.95724
  kayee frances griffin born 6 february 1950 is an australian politician and former australi
  an labor party member of the new south wales legislative council serving
* Lucienne Robillard                                 0.96152
  lucienne robillard pc born june 16 1945 is a canadian politician and a member of the liber
  al party of canada she sat in the house
* Marcelle Mersereau                                 0.96243
  marcelle mersereau born february 14 1942 in pointeverte new brunswick is a canadian politi
  cian a civil servant for most of her career she also served
* Maureen Lyster                                     0.96244
  maureen anne lyster born 10 september 1943 is an australian politician she was an australi
  an labor party member of the victorian legislative assembly from 1985
* Carol Skelton                                      0.96349
  caro

The clusters are not as clear, but the left cluster has a tendency to show important female figures, and the right one to show politicians and government officials.

Let's divide them further.

Los grupos no son tan claros, pero el grupo de la izquierda tiende a mostrar figuras femeninas importantes y el de la derecha a políticos y funcionarios gubernamentales.

Vamos a dividirlos aún más.

In [26]:
female_figures = left_child_non_athletes_artists
politicians_etc = right_child_non_athletes_artists

**Quiz Question**. Let us bipartition the clusters `female_figures` and `politicians`. Which diagram best describes the resulting hierarchy of clusters for the non-athletes? Refer to the quiz for the diagrams.

**Note**. Use `maxiter=100, num_runs=6, seed=1` for consistency of output.

In [27]:
%%time 
# Bipartition female_figures
left_child_female_figures, right_child_female_figures = bipartition(female_figures,
                                                                                maxiter=100, num_runs=6, seed=1)
display_single_tf_idf_cluster(left_child_female_figures, map_word_to_index)
display_single_tf_idf_cluster(right_child_female_figures, map_word_to_index)

113949:0.017
113949:0.015
113949:0.013
113949:0.013
113949:0.011

* Archie Brown                                       0.97612
  archibald haworth brown cmg fba commonly known as archie brown born 10 may 1938 is a briti
  sh political scientist and historian in 2005 he became
* Timothy Luke                                       0.97651
  timothy w luke is university distinguished professor of political science in the college o
  f liberal arts and human sciences as well as program chair of
* Lawrence W. Green                                  0.97701
  lawrence w green is best known by health education researchers as the originator of the pr
  ecede model and codeveloper of the precedeproceed model which has
* Jerry L. Martin                                    0.97726
  jerry l martin is chairman emeritus of the american council of trustees and alumni he serv
  ed as president of acta from its founding in 1995
* Loren Graham                                       0.97799
  loren r graham

In [28]:
%%time 
# Bipartition the cluster of politicians_etc
left_child_politicians_etc, right_child_politicians_etc = bipartition(politicians_etc,
                                                                                maxiter=100, num_runs=6, seed=1)
display_single_tf_idf_cluster(left_child_politicians_etc, map_word_to_index)
display_single_tf_idf_cluster(right_child_politicians_etc, map_word_to_index)

113949:0.103
113949:0.072
113949:0.054
113949:0.039
113949:0.035

* William G. Young                                   0.90773
  william glover young born 1940 is a united states federal judge for the district of massac
  husetts young was born in huntington new york he attended
* George B. Daniels                                  0.90920
  george benjamin daniels born 1953 is a united states federal judge for the united states d
  istrict court for the southern district of new yorkdaniels was
* Barry Sullivan (lawyer)                            0.91213
  barry sullivan is a chicago lawyer and as of july 1 2009 the cooney conway chair in advoca
  cy at loyola university chicago school of law
* James G. Carr                                      0.91469
  james g carr born july 7 1940 is a federal district judge for the united states district c
  ourt for the northern district of ohiocarr was
* Jean Constance Hamilton                            0.91538
  jean constance hamilton born 1945