Hierarchical clustering is a method used in machine learning and data mining to group similar data points into clusters based on their characteristics or features. Unlike flat clustering algorithms, which produce a single partition of the data, hierarchical clustering creates a tree-like hierarchical structure of clusters.

In hierarchical clustering, the process starts by treating each data point as its own cluster. Then, pairs of clusters are iteratively merged together based on their similarity until all data points belong to a single cluster or until a specified stopping criterion is met.

There are two main types of hierarchical clustering:

Agglomerative Hierarchical Clustering: This is the most common type of hierarchical clustering. It begins by considering each data point as a separate cluster and then iteratively merges the closest pairs of clusters based on a distance metric until all data points belong to a single cluster. The merging process continues until a stopping criterion, such as a predefined number of clusters or a threshold distance, is reached.

Divisive Hierarchical Clustering: In divisive hierarchical clustering, the process starts with all data points belonging to a single cluster, and then it recursively divides the clusters into smaller clusters based on some dissimilarity criterion until each data point is in its own cluster. Divisive hierarchical clustering is less common than agglomerative clustering and can be computationally expensive, especially for large datasets.

Hierarchical clustering produces a dendrogram, which is a tree-like diagram that illustrates the merging process and shows the hierarchical relationships between clusters. The dendrogram can be cut at different levels to obtain different numbers of clusters, allowing users to explore the data at different granularities.

Hierarchical clustering does not require specifying the number of clusters beforehand, making it suitable for exploratory analysis and visualizing the structure of the data. However, it can be computationally expensive for large datasets, and the choice of distance metric and linkage criteria can significantly affect the resulting clusters.

Overall, hierarchical clustering is a flexible and powerful technique for grouping similar data points into clusters and is widely used in various fields, including biology, image analysis, and social sciences.

In [1]:
from sklearn.datasets import fetch_20newsgroups
from scipy.cluster.hierarchy import ward, dendrogram
import matplotlib as mpl
from scipy.cluster.hierarchy import fcluster
from sklearn.metrics.pairwise import cosine_similarity
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
%matplotlib inline
import re
import string
from nltk import word_tokenize
from nltk.corpus import stopwords
from nltk.stem import WordNetLemmatizer
from sklearn.feature_extraction.text import TfidfVectorizer
from collections import Counter
from pylab import *
import nltk
import warnings
warnings.filterwarnings('ignore')

1. `fetch_20newsgroups` is a function from sklearn.datasets to load the '20 newsgroups' text dataset.

2. It uses hierarchical clustering (`ward`, `dendrogram` and `fcluster` from scipy.cluster.hierarchy), which is a method of cluster analysis which seeks to build a hierarchy of clusters.

3. For measuring the similarity between documents, it uses `cosine_similarity` from sklearn.metrics.pairwise.

4. It uses `TfidfVectorizer` from sklearn.feature_extraction.text to convert the raw documents into a matrix of TF-IDF (Term Frequency-Inverse Document Frequency) features.

5. For text preprocessing, it uses `WordNetLemmatizer` to lemmatize words (bring them to their base form), `stopwords` to remove common words such as 'the', 'is', 'in', etc, and `word_tokenize` to split text into words.

6. The Counter class from collections module is used to count the frequency of elements.

7. The matplotlib, pylab, and numpy libraries are used for plotting and data manipulation.

8. It uses `warnings.filterwarnings("ignore")` to make Python ignore warnings.

9. The `%matplotlib inline` is a magic function in IPython, and this line helps to show your plots in the notebook itself since plots or graphs are rendered in the cell output.

In [2]:
# Downloading a list of stop words and the Wordnet corpus from nltk

nltk.download('stopwords')
stop_words=stopwords.words('english')
stop_words=stop_words+list(string.printable)
nltk.download('wordnet')
lemmatizer=WordNetLemmatizer()

[nltk_data] Downloading package stopwords to /root/nltk_data...
[nltk_data]   Unzipping corpora/stopwords.zip.
[nltk_data] Downloading package wordnet to /root/nltk_data...


1. `nltk.download('stopwords')` will download the NLTK package 'stopwords'. Stopwords are words that you want to ignore, so you filter them out when you’re processing natural language text. Standard stop words in English might be 'is', 'and', 'the', etc.

2. `stop_words=stopwords.words('english')` will declare a variable 'stop_words' that is a list of stopwords in English which is available after you've downloaded the stopwords package.

3. `stop_words=stop_words+list(string.printable)` will extend the `stop_words` list by adding all printable characters, which includes digits, ascii_letters (which includes lowercase and uppercase letters), punctuation, and whitespace.

4. `nltk.download('wordnet')` will download the 'wordnet' NLTK package. WordNet is a semantic graph for the English language. It groups English words into sets of synonyms called synsets, provides short definitions, and use terms that relate to those synsets.

5. `lemmatizer=WordNetLemmatizer()` will create an instance of the WordNetLemmatizer class. Lemmatizing is a process of breaking words into their lemmatized base form. For example, 'running' will be lemmatized to 'run', 'better' might be lemmatized to 'good', etc. Packaging this functionality in an instance of a class allows the functionality to be used repeatedly and neatly.

In [3]:
# We do need to specify the categories of news articles we want to fetch by

categories= ['misc.forsale', 'sci.electronics', 'talk.religion.misc']

In [4]:
# fetch the dataset
news_data = fetch_20newsgroups(subset='train', categories=categories,shuffle=True, random_state=42, download_if_missing=True)

The above code is fetching a training subset of the 20 Newsgroups dataset. The arguments passed to this function determine what subset of the dataset will be returned:-`subset='train'` means that the returned dataset will be the training subset from the 20 Newsgroups dataset. Other possible values for this parameter are 'test' for the testing subset and 'all' for both the training and testing subsets.

- `categories` is a list of the categories of news articles to be included in the returned dataset. The specific value of this parameter is not given in the above.

- `shuffle=True` means that the returned dataset will be shuffled. Shuffling a dataset is often useful in machine learning to ensure that the order of the data does not affect the performance of the model.

- `random_state=42` sets the seed for the random number generator used to shuffle the dataset. Using this parameter ensures that the same order of shuffling will be used each time the code is run, which is useful for consistent results in testing.

- `download_if_missing=True` means that the dataset will be downloaded if it is not found in the sklearn data directory. If this parameter is set to False and the dataset is not found, a `FileNotFoundError` will be raised.



In [5]:
# To view the data of the fetched content
news_data['data'][:5]

['From: Steve@Busop.cit.wayne.edu (Steve Teolis)\nSubject: Re: *** TurboGrafx System For SALE ***\nOrganization: Wayne State University\nLines: 38\nDistribution: na\nNNTP-Posting-Host: 141.217.75.24\n\n>TurboGrafx-16 Base Unit (works like new) with:\n>       1 Controller\n>       AC Adapter\n>       Antenna hookup\n>     * Games:\n>         Kieth Courage\n>         Victory Run\n>         Fantasy Zone\n>         Military Madness\n>         Battle Royal\n>         Legendary Axe\n>         Blazing Lasers\n>         Bloody Wolf\n>\n>  --------------------------------------\n>* Will sell games separatley at $25 each\n>  --------------------------------------\n\nYour kidding, $210.00, man o man, you can buy the system new for $49.00 at \nElectronic Boutique and those games are only about $15 - $20.00 brand new.  \nMaybe you should think about that price again if you REALLY need the money.\n\n\n\n\n\n\n                        \n                        \n                        -=-=-=-=-=-=-=-

In [6]:
# To check the categories of news articles, we do need to insert a new cell
news_data.target

array([0, 0, 1, ..., 0, 1, 0])

0 refers to misc.forsale, 1 refers to sci.electrocnics, and 2 refers to category 'talk.religion,misc'

In [7]:
# to check the categores we are dealing with
news_data.target_names

['misc.forsale', 'sci.electronics', 'talk.religion.misc']

In [8]:
# We do need to store news_data and the above categories in a pandas dataframe and view it
news_data_df = pd.DataFrame({'text' : news_data['data'], 'category': news_data.target})
news_data_df.head()

Unnamed: 0,text,category
0,From: Steve@Busop.cit.wayne.edu (Steve Teolis)...,0
1,From: jks2x@holmes.acc.Virginia.EDU (Jason K. ...,0
2,From: wayne@uva386.schools.virginia.edu (Tony ...,1
3,From: lihan@ccwf.cc.utexas.edu (Bruce G. Bostw...,1
4,From: myoakam@cis.ohio-state.edu (micah r yoak...,0


In [10]:
# To count the occurences iof each category
news_data_df['category'].value_counts()


1    591
0    585
2    377
Name: category, dtype: int64

 we will use a lambda function to extract tokens from each 'text' of the news_data_df DataFrame, check whether any of these tokens are stop words, lemmatize them, and concatenate them side by side. We make use of the join function to concatenate a list of words into a single sentence. We use the regular expression (re) to replace anything other than alphabets, digits, and white spaces with blank space.

In [12]:
import nltk
nltk.download('punkt')
news_data_df['cleaned_text'] = news_data_df['text'].apply(\
lambda x : ' '.join([lemmatizer.lemmatize(word.lower()) \
  for word in word_tokenize(re.sub(r'([^\s\w]|_)+', ' ', str(x))) if word.lower() not in stop_words]))

[nltk_data] Downloading package punkt to /root/nltk_data...
[nltk_data]   Unzipping tokenizers/punkt.zip.


`news_data_df['cleaned_text'] =` This creates a new column in the DataFrame called 'cleaned_text'.

`news_data_df['text'].apply(\` will apply a function to each entry in the 'text' column of the DataFrame.

`lambda x : ' '.join([lemmatizer.lemmatize(word.lower()) \` will use a lambda function to lower the case of each word, lemmatize the word (normalizes the word by reducing it to its base or root form) and then joins the lemmatized words back into a sentence.

`for word in word_tokenize(re.sub(r'([^\s\w]|_)+', ' ', str(x)))` will tokenizes the text (breaking the text into individual words), and for each word, it also removes any punctuation or special characters using a regular expression (regex).

`if word.lower() not in stop_words])) will check if the normalized (lowercase) word is not in the list of stop words and removes any that are. Stop words are common words like 'is', 'the', 'and' etc that usually do not contribute to the meaning of a sentence.

In summary, the above code will clean up the text data, allowing better analysis of the data or better results when used in machine learning models.