## NLP Use Case - An Example

In [1]:
import pandas as pd
from nltk.tokenize import RegexpTokenizer
from sklearn.feature_extraction.text import CountVectorizer
from scipy.cluster.hierarchy import *
import re
from nltk.stem import SnowballStemmer
from scipy.cluster.hierarchy import fcluster
from scipy.spatial.distance import pdist
import numpy as np
from sklearn.metrics.pairwise import pairwise_distances

We created a list of news article titles, which were scrapped from several online sources of interest. In the following example, we will read the set of titles and group them to topics, according to the words in their titles.
Let us first read the titles.

In [2]:
article_titles = pd.read_csv('../Data/article_titles.csv')
print(article_titles.head(20))
print(article_titles.shape)
article_titles = article_titles.head(1000) # We will work only on the first 1000 articles (just in order to be able to run it quickly)

                                                Title
0   Data from 800,000 user accounts stolen in Oran...
1   China's web giants unite to defuse Windows XP ...
2   Mt. Gox, once the world’s largest Bitcoin exch...
3   Netflix is paying Comcast for direct connectio...
4   UPDATE 1-Netflix may need to pay AT&T, Verizon...
5   Suspected Hacker Group Creates Network of Fake...
6   Zeus variant targets Salesforce.com accounts, ...
7   Wall St. Is Told to Tighten Digital Security o...
8   MasterCard program will protect credit card pu...
9   Researchers blow past all protections in Micro...
10  Security Alert: New and Cheap Stampado Ransomw...
11  Report: Verizon Uncovers Two More Retail Breac...
12  360 million newly stolen credentials on black ...
13  Oklahoma DPS and Bank Security Exposure - Blog...
14  Data Breach Cuts Into Target's 4Q Profit | Fox...
15  Blue Coat to Deliver Transformative Security S...
16  Google will start teaching people how to build...
17  CrowdStrike Inc. Partner

The titles are not "clean":
Data = data, but the capital initial pose a problem.
China's = Chine
...
We will run some standard text pre-processing operations.

In [3]:
# Stemming = taking the word to its stem form
stemmer = SnowballStemmer("english", ignore_stopwords=True)
# Tokenization = breaking a text into a set of words.
tokenizer = RegexpTokenizer(r'\w+')
for i ,row in article_titles.iterrows(): # Go over all the titles
    tokens = tokenizer.tokenize(row['Title'].lower()) # Tokenize the lowercase version of the sentence
    clean_words = [stemmer.stem(word) for word in tokens if word not in stemmer.stopwords] # Stem and remove stopwords
    sentence = " ".join(clean_words) # Re-create the sentence from the words
    article_titles.loc[i,'clean_title'] = sentence 
article_titles.head(10)

Unnamed: 0,Title,clean_title
0,"Data from 800,000 user accounts stolen in Oran...",data 800 000 user account stolen orang fr hack
1,China's web giants unite to defuse Windows XP ...,china web giant unit defus window xp bombshel ...
2,"Mt. Gox, once the world’s largest Bitcoin exch...",mt gox world largest bitcoin exchang shut ar t...
3,Netflix is paying Comcast for direct connectio...,netflix pay comcast direct connect network ar ...
4,"UPDATE 1-Netflix may need to pay AT&T, Verizon...",updat 1 netflix may need pay verizon faster speed
5,Suspected Hacker Group Creates Network of Fake...,suspect hacker group creat network fake linked...
6,"Zeus variant targets Salesforce.com accounts, ...",zeus variant target salesforc com account saa ...
7,Wall St. Is Told to Tighten Digital Security o...,wall st told tighten digit secur partner new y...
8,MasterCard program will protect credit card pu...,mastercard program protect credit card purchas...
9,Researchers blow past all protections in Micro...,research blow past protect microsoft emet anti...


In order to run any machine learning algorithm, we need to encode the titles as vectors.
We will use the most simple encoding method, known as "Bag of Words". Each title will be represented by a row, and the columns will include a binary indication on the existance of words in the title.

In [4]:
vectorizer = CountVectorizer(strip_accents='unicode', binary=1)
weighted_terms = vectorizer.fit_transform(article_titles['clean_title'])
# weighted_terms is a sparse matrix with a sparse matrix representation. we will describe it as an array.
dtm = weighted_terms.toarray()
print(pd.DataFrame(dtm).head())
print(vectorizer.vocabulary_)

   0     1     2     3     4     5     6     7     8     9     ...   2424  \
0     1     0     0     0     0     0     0     0     0     0  ...      0   
1     0     0     0     0     0     0     0     0     0     0  ...      0   
2     0     0     0     0     0     0     0     0     0     0  ...      0   
3     0     0     0     0     0     0     0     0     0     0  ...      0   
4     0     0     0     0     0     0     0     0     0     0  ...      0   

   2425  2426  2427  2428  2429  2430  2431  2432  2433  
0     0     0     0     0     0     0     0     0     0  
1     0     0     0     0     0     0     0     0     0  
2     0     0     0     0     0     0     0     0     0  
3     0     0     0     0     0     0     0     0     0  
4     0     0     0     0     0     0     0     0     0  

[5 rows x 2434 columns]
{'data': 582, '800': 70, '000': 0, 'user': 2301, 'account': 85, 'stolen': 2070, 'orang': 1530, 'fr': 911, 'hack': 1007, 'china': 411, 'web': 2363, 'giant': 962, 'un

We can now run hierarchical clustering on the titles. We need to define some distance metric that will describe the distance between two titles.
The Jaccard similarity is the proportion of words that appear in both titles.

In [5]:
dist = pdist(dtm, 'jaccard')

In [8]:
print(dist)

[1.         1.         1.         ... 0.91666667 1.         1.        ]


In [6]:
linkage_matrix = single(dist)
raw_clusters = fcluster(linkage_matrix, 0.5)

In [7]:
print(raw_clusters)

[437 643 541 179 427 499 753 329 542  33 104 371 725 567 326 493 630 195
 195 243 308 352 236 309 256 677 682 683 494 568 218 569 125 305 162 521
 481 226 296 317  92 349 344  47 438 124 684 359 754 242  19  19 220 202
 508 360 755 570 196 325  94 328 726 203 571 376 685 727 345 152 356 756
 123 543 809 324  92 816 572 412 495 810 795 544 165 111 686 631 496 687
 350 317 443 811 178 796 357 632 213 728 573 386 574  10 522 688  10 471
   9 678 249 113  28 155 156 633  76 729 254 377 364 146 146 391 365 545
 301 489 523  75  75 575 167 163  91 634 546 145 676 730 635 497 337 141
 141 111 689 731  74  74 690 576 362 392 439   4   4 228 113 691 199 199
 440 547 190 190 431 441 692  28 402 636 757 693 442 484 148 679  82 637
 232 638 786 524 758 577 372 444 694 103  73 435  95   7 759 776 732 230
 235 578 423 445 241 579 118 314 580 498 581 797 353 733 120 582 525 446
 246 734 447 448 187 612 449 347 157 355 181  72 145 583 230 248 310 268
 526 639 147 147 548 394 284 735 234 222 140 473 14

In [8]:
article_titles['cluster'] = raw_clusters

In [9]:
article_titles.to_csv('../Data/clustered_article_titles.csv')