## NLP Use Case - An Example

In [9]:
import pandas as pd
from nltk.tokenize import RegexpTokenizer
from sklearn.feature_extraction.text import CountVectorizer
from scipy.cluster.hierarchy import *
import re
from nltk.stem import SnowballStemmer
from scipy.cluster.hierarchy import fcluster
from scipy.spatial.distance import pdist
import numpy as np
from sklearn.metrics.pairwise import pairwise_distances

We created a list of news article titles, which were scrapped from several online sources of interest. In the following example, we will read the set of titles and group them to topics, according to the words in their titles.
Let us first read the titles.

In [4]:
article_titles = pd.read_csv('../Data/article_titles.csv')
print(article_titles.head(20))
print(article_titles.shape)
article_titles = article_titles.head(1000) # We will work only on the first 1000 articles (just in order to be able to run it quickly)

                                                Title
0   Data from 800,000 user accounts stolen in Oran...
1   China's web giants unite to defuse Windows XP ...
2   Mt. Gox, once the world’s largest Bitcoin exch...
3   Netflix is paying Comcast for direct connectio...
4   UPDATE 1-Netflix may need to pay AT&T, Verizon...
5   Suspected Hacker Group Creates Network of Fake...
6   Zeus variant targets Salesforce.com accounts, ...
7   Wall St. Is Told to Tighten Digital Security o...
8   MasterCard program will protect credit card pu...
9   Researchers blow past all protections in Micro...
10  Security Alert: New and Cheap Stampado Ransomw...
11  Report: Verizon Uncovers Two More Retail Breac...
12  360 million newly stolen credentials on black ...
13  Oklahoma DPS and Bank Security Exposure - Blog...
14  Data Breach Cuts Into Target's 4Q Profit | Fox...
15  Blue Coat to Deliver Transformative Security S...
16  Google will start teaching people how to build...
17  CrowdStrike Inc. Partner

The titles are not "clean":
Data = data, but the capital initial pose a problem.
China's = Chine
...
We will run some standard text pre-processing operations.

In [5]:
# Stemming = taking the word to its stem form
stemmer = SnowballStemmer("english", ignore_stopwords=True)
# Tokenization = breaking a text into a set of words.
tokenizer = RegexpTokenizer(r'\w+')
for i ,row in article_titles.iterrows(): # Go over all the titles
    tokens = tokenizer.tokenize(row['Title'].lower()) # Tokenize the lowercase version of the sentence
    clean_words = [stemmer.stem(word) for word in tokens if word not in stemmer.stopwords] # Stem and remove stopwords
    sentence = " ".join(clean_words) # Re-create the sentence from the words
    article_titles.loc[i,'clean_title'] = sentence 
article_titles.head(10)

LookupError: 
**********************************************************************
  Resource [93mstopwords[0m not found.
  Please use the NLTK Downloader to obtain the resource:

  [31m>>> import nltk
  >>> nltk.download('stopwords')
  [0m
  Attempted to load [93mcorpora/stopwords[0m

  Searched in:
    - 'C:\\Users\\aavivi/nltk_data'
    - 'C:\\Users\\aavivi\\AppData\\Local\\Continuum\\anaconda2\\nltk_data'
    - 'C:\\Users\\aavivi\\AppData\\Local\\Continuum\\anaconda2\\share\\nltk_data'
    - 'C:\\Users\\aavivi\\AppData\\Local\\Continuum\\anaconda2\\lib\\nltk_data'
    - 'C:\\Users\\aavivi\\AppData\\Roaming\\nltk_data'
    - 'C:\\nltk_data'
    - 'D:\\nltk_data'
    - 'E:\\nltk_data'
**********************************************************************


In order to run any machine learning algorithm, we need to encode the titles as vectors.
We will use the most simple encoding method, known as "Bag of Words". Each title will be represented by a row, and the columns will include a binary indication on the existance of words in the title.

In [None]:
vectorizer = CountVectorizer(strip_accents='unicode', binary=1)
weighted_terms = vectorizer.fit_transform(article_titles['clean_title'])
# weighted_terms is a sparse matrix with a sparse matrix representation. we will describe it as an array.
dtm = weighted_terms.toarray()
print(pd.DataFrame(dtm).head())
print(vectorizer.vocabulary_)

We can now run hierarchical clustering on the titles. We need to define some distance metric that will describe the distance between two titles.
The Jaccard similarity is the proportion of words that appear in both titles.

In [None]:
dist = pdist(dtm, 'jaccard')

In [None]:
print(dist)

In [None]:
linkage_matrix = single(dist)
raw_clusters = fcluster(linkage_matrix, 0.5)

In [6]:
print(raw_clusters)

NameError: name 'raw_clusters' is not defined

In [7]:
article_titles['cluster'] = raw_clusters

NameError: name 'raw_clusters' is not defined

In [8]:
article_titles.to_csv('../Data/clustered_article_titles.csv')