## Anatomy of Natural Language Processing

As Wikipedia says 'Natural language processing (NLP) is a field of computer science, artificial intelligence, and computational linguistics concerned with the interactions between computers and human (natural) languages and, in particular, concerned with programming computers to fruitfully process large natural language corpora'.


### Why is NLP required?

There are several reasons that made NLP popular within the Artificial Intelligence and Machine Learning community, especially the huge opportunity that NLP provided for everyone to work on large sets of textual data.

Few of the domains, where NLP plays a crucial role is:

* Auto Summarization
* Chatbots
* Machine Translation
* Text Classification
* Sentimental Analysis
* Speech Recognition

### Steps taken to solve a NLP problem

In this notebook, let's try to classify news's articles to different categories based on the article content and in doing so let's look at the various steps that we take to solve a NLP problem.

#### 1. Collection of the Dataset

The first and foremost step in any Machine Learning or NLP based problem is to collect a dataset. For our problem, let's try to collect news articles from [Doxydonkey](http://www.doxydonkey.blogspot.in).

Let's scrape all the articles from the above mentioned website.


In [167]:
# imports
import urllib2
from bs4 import BeautifulSoup

In [168]:
# URL to collect the data
url = "http://doxydonkey.blogspot.in"

In [169]:
# Method to request, response and soup the data
def beautifySoup(url):
    request = urllib2.Request(url)
    response = urllib2.urlopen(request)
    soup = BeautifulSoup(response)
    return soup

In [170]:
# Collecting all articles hyer links
def collectArticles(links):
    articles = []
    for link in links:
        content = []
        soup = beautifySoup(link)
        divs = soup.find_all('div', {'class': 'post-body'})
        for div in divs:
            content += map(lambda p: p.text.encode("ascii", errors="replace").replace("?", ""), div.find_all("li"))
        articles += content
    return articles

In [171]:
def collectData(url, links):
    soup = beautifySoup(url)
    for a in soup.find_all("a"):
        try:
            title = a["title"]
            url = a["href"]
            if title == "Older Posts":
                links.append(url)
                getAllLinks(url, links)
        except:
            url = ""
    return

In [172]:
links = []
def collectParameters():
    links.append(url)
    collectData(url, links)
    print(len(links))

In [173]:
collectParameters()

('Older Posts', 'http://doxydonkey.blogspot.in/search?updated-max=2017-02-20T18:07:00-08:00&max-results=7&start=7&by-date=false')
('Older Posts', 'http://doxydonkey.blogspot.in/search?updated-max=2017-02-09T18:22:00-08:00&max-results=7&start=14&by-date=false')
('Older Posts', 'http://doxydonkey.blogspot.in/search?updated-max=2017-01-31T17:57:00-08:00&max-results=7&start=21&by-date=false')
('Older Posts', 'http://doxydonkey.blogspot.in/search?updated-max=2017-01-19T17:44:00-08:00&max-results=7&start=28&by-date=false')
('Older Posts', 'http://doxydonkey.blogspot.in/search?updated-max=2017-01-10T18:06:00-08:00&max-results=7&start=35&by-date=false')
('Older Posts', 'http://doxydonkey.blogspot.in/search?updated-max=2017-01-01T18:20:00-08:00&max-results=7&start=42&by-date=false')
('Older Posts', 'http://doxydonkey.blogspot.in/search?updated-max=2016-12-21T18:55:00-08:00&max-results=7&start=49&by-date=false')
('Older Posts', 'http://doxydonkey.blogspot.in/search?updated-max=2016-12-12T19:04:0

In [174]:
articles = collectArticles(links)
print(len(articles))
articles[0]

2673


'Airbnb raises $1 billion inlatestround of funding:Onlineroom rentingservice Airbnb Inc said on Thursday it had raised $1 billion in its latest round of funding, valuing the company at $31 billion.The company turned in a profit on an EBITDA basis in the second half of 2016 and expects to continue to be profitable this year, the source said, adding that Airbnb had no plans to go public anytime soon.The company is locked in an intensifying global battle with regulators who say the service takesaffordablehousing off the market and drives up rental prices.Airbnb raised $447.85 million as part of the funding, a source close to the company told Reuters. The company said in September it had raised about $555 million as part of the same round of funding.Airbnb, which operates in more than 65,000 cities, has enjoyed tremendous growth as it pushes ahead with its plansofglobal expansion.'

#### 2. Term Frequency - Inverse Document Frequency

TF-IDF weight is often used in information retrieval and text mining. The weight is a statistical measure used to evaluate how important a word is to a document in a collection or corpus.

TF-IDF is made up of two components: 
* TF - Term Frequency
        
        Term Frequency is the measure of how frequently a word appears in a document.
        TF(t) = (Number of times term t appears in a document) / (Total number of terms in the document)
        
* IDF - Inverse Document Frequency
        Inverse Document Frequency the measure of how important a term is.
        IDF(t) = log_e(Total number of documents / Number of documents with term t in it)

In [175]:
# importing feature extractor TF-IDF vectorizer

from sklearn.feature_extraction.text import TfidfVectorizer

In [176]:
vectorizer = TfidfVectorizer(max_df=0.5, min_df=2, stop_words="english")

In [177]:
# Applying tf-idf transformation on the articles 

articles_tfidf = vectorizer.fit_transform(articles)

# Let's see the tf-idf for the first article
print(articles_tfidf[0])

  (0, 4527)	0.093459986294
  (0, 880)	0.0909888475678
  (0, 9245)	0.125506358268
  (0, 5484)	0.0593929321553
  (0, 12111)	0.136331194112
  (0, 4293)	0.136331194112
  (0, 2522)	0.0853580208378
  (0, 1)	0.065168725165
  (0, 470)	0.109464227828
  (0, 8136)	0.103883585377
  (0, 427)	0.158782106766
  (0, 10394)	0.0896669240819
  (0, 9883)	0.0904124688247
  (0, 11941)	0.0768705699236
  (0, 2603)	0.0766677306981
  (0, 7515)	0.0847935427048
  (0, 539)	0.113991816532
  (0, 363)	0.169920689887
  (0, 8980)	0.0869161877993
  (0, 9696)	0.121217078597
  (0, 4015)	0.123688217323
  (0, 7265)	0.0517256017034
  (0, 10412)	0.0532921082588
  (0, 10199)	0.0728780620055
  (0, 9604)	0.100322272811
  :	:
  (0, 9194)	0.0657727820619
  (0, 8711)	0.0676923544855
  (0, 735)	0.0880771876065
  (0, 10875)	0.177538939725
  (0, 13072)	0.0378691826488
  (0, 9066)	0.0955164362715
  (0, 2980)	0.0875720285004
  (0, 4534)	0.102520325273
  (0, 180)	0.0853580208378
  (0, 5550)	0.0794643241154
  (0, 10304)	0.0705467582402
  (

#### 3. Clustering the articles

Next, let's cluster the above articles in order to predict the category of future articles.

Let's use KMeans algorithm to cluster the articles. 

In [178]:
# importing KMeans algorithm from sklearn

from sklearn.cluster import KMeans

In [229]:
# instance of KMeans

km = KMeans(n_clusters=10, init='k-means++', n_init=2, max_iter=100, verbose=True)

In [230]:
# fitting the model with the articles dataset

km.fit(articles_tfidf)

Initialization complete
Iteration  0, inertia 4819.099
Iteration  1, inertia 2499.371
Iteration  2, inertia 2485.310
Iteration  3, inertia 2478.664
Iteration  4, inertia 2474.163
Iteration  5, inertia 2470.683
Iteration  6, inertia 2468.460
Iteration  7, inertia 2467.534
Iteration  8, inertia 2467.211
Iteration  9, inertia 2467.074
Iteration 10, inertia 2467.046
Iteration 11, inertia 2467.038
Iteration 12, inertia 2467.030
Iteration 13, inertia 2467.024
Converged at iteration 13
Initialization complete
Iteration  0, inertia 2648.611
Iteration  1, inertia 2563.164
Iteration  2, inertia 2541.464
Iteration  3, inertia 2533.711
Iteration  4, inertia 2532.416
Iteration  5, inertia 2528.497
Iteration  6, inertia 2525.316
Iteration  7, inertia 2524.910
Iteration  8, inertia 2524.788
Iteration  9, inertia 2524.737
Iteration 10, inertia 2524.723
Converged at iteration 10


KMeans(copy_x=True, init='k-means++', max_iter=100, n_clusters=10, n_init=2,
    n_jobs=1, precompute_distances='auto', random_state=None, tol=0.0001,
    verbose=True)

So, we have fitted the algorithm with the dataset. Now, let's check the type of articles that are clustered. 

In [231]:
# Label that is assigned to each articles 

print("Label of each article: {}".format(km.labels_))

Label of each article: [0 5 2 ..., 8 8 8]


In [232]:
# Total number of articles in each cluster

import numpy as np

cluster_count = np.unique(km.labels_, return_counts=True)
print("Number of articles in each cluster: {}".format(cluster_count))

Number of articles in each cluster: (array([0, 1, 2, 3, 4, 5, 6, 7, 8, 9]), array([ 224,  269,  155,  109,  253,  179,  147,  168, 1002,  167], dtype=int64))


Now, let's segregate the articles into their respective clusters.

In [233]:
# Segregating the documents according to the clusters

article_clusters = {}
for i, cluster in enumerate(km.labels_):
    document = articles[i]
    if cluster not in article_clusters.keys():
        article_clusters[cluster] = document
    else:
        article_clusters[cluster] += document

#### 4. Tokenizing the collected data

The step in the process is to tokenizing the collected data. What do you mean by vectorizing the data?

Tokenization is the process of breaking down the complex sentences to individual words.

> "I like dogs"

So, the above statement can be tokenized to get ("I", "like", "dogs"), this is similar to using **split()** method to tokenize a sentence.

In terms of NLP, tokens better known as **n-grams**, where one word is called **unigram**, 2 words are called as **bigram** and multiple words are called a **n-grams**.

In [234]:
# import statements

from nltk.corpus import stopwords
from nltk.tokenize import word_tokenize, sent_tokenize
from nltk.probability import FreqDist
from collections import defaultdict
from string import punctuation
from heapq import nlargest
import nltk

In [235]:
# Let's declare a list of some not so important words that might appear in the articles

special_words = ["billion", "million", "billions", "millions", "$",  "'s", "'d", "\n", "\t", "''", "``", "`", "*"]
stop_words = list(set(stopwords.words("english"))) + list(punctuation) + special_words

print("Stop words are: {}".format(stop_words))

Stop words are: [u'all', u'just', u'being', u'over', u'both', u'through', u'yourselves', u'its', u'before', u'o', u'hadn', u'herself', u'll', u'had', u'should', u'to', u'only', u'won', u'under', u'ours', u'has', u'do', u'them', u'his', u'very', u'they', u'not', u'during', u'now', u'him', u'nor', u'd', u'did', u'didn', u'this', u'she', u'each', u'further', u'where', u'few', u'because', u'doing', u'some', u'hasn', u'are', u'our', u'ourselves', u'out', u'what', u'for', u'while', u're', u'does', u'above', u'between', u'mustn', u't', u'be', u'we', u'who', u'were', u'here', u'shouldn', u'hers', u'by', u'on', u'about', u'couldn', u'of', u'against', u's', u'isn', u'or', u'own', u'into', u'yourself', u'down', u'mightn', u'wasn', u'your', u'from', u'her', u'their', u'aren', u'there', u'been', u'whom', u'too', u'wouldn', u'themselves', u'weren', u'was', u'until', u'more', u'himself', u'that', u'but', u'don', u'with', u'than', u'those', u'he', u'me', u'myself', u'ma', u'these', u'up', u'will', u'b

In the next step, we will see the top words from each cluster, this should give us an idea of theme of each cluster.

In [236]:
keywords = {}
counts = {}

for cluster in range(km.n_clusters):
    word_set = word_tokenize(article_clusters[cluster].lower())
    word_set = [word for word in word_set if word not in stop_words and word != ""]
    freq_distribution = FreqDist(word_set)
    keywords[cluster] = nlargest(100, freq_distribution, key=freq_distribution.get)
    counts[cluster] = freq_distribution

# printing the top 100 words from each cluster
for key, value in keywords.iteritems():
    print(key, value[:20])

(0, ['company', 'uber', 'investors', 'said', 'companies', 'valuation', 'year', 'private', 'round', 'public', 'last', 'according', 'percent', 'new', 'raised', 'market', 'funding', 'lyft', 'people', 'also'])
(1, ['percent', 'revenue', 'company', 'quarter', 'year', 'said', 'shares', 'share', 'sales', 'profit', 'rose', 'business', 'cents', 'analysts', 'earnings', 'net', 'growth', 'per', 'fell', 'reported'])
(2, ['alibaba', 'percent', 'china', 'company', 'said', 'chinese', 'e-commerce', 'new', 'online', 'market', 'year', 'shares', 'group', 'also', 'would', 'stake', 'chinas', 'companies', 'internet', 'u.s.'])
(3, ['twitter', 'company', 'users', 'said', 'new', 'percent', 'social', 'facebook', 'tweets', 'twitters', 'also', 'people', 'stock', 'like', 'user', 'year', 'app', 'data', 'dorsey', 'product'])
(4, ['facebook', 'users', 'ads', 'video', 'new', 'said', 'ad', 'app', 'company', 'people', 'google', 'mobile', 'like', 'percent', 'snapchat', 'also', 'youtube', 'service', 'data', 'instagram'])
(

Here, let's find the unique words in each cluster 

In [237]:
unique_words = {}
for cluster in range(km.n_clusters):
    temp_cluster_1 = list(set(range(km.n_clusters)) - set([cluster]))
    temp_cluster_2 = set(keywords[temp_cluster_1[0]]).union(set(keywords[temp_cluster_1[1]]))
    unique_word = set(keywords[cluster]) - temp_cluster_2
    unique_words[cluster] = nlargest(10, unique_word, key=counts[cluster].get)

In [238]:
# printing unique words to a cluster
for i, words in enumerate(unique_words):
    print(i, unique_words[words])

(0, ['uber', 'valuation', 'private', 'round', 'raised', 'lyft', 'funding', 'people', 'capital', 'venture'])
(1, ['profit', 'rose', 'cents', 'analysts', 'earnings', 'net', 'per', 'reported', 'trading', 'forecast'])
(2, ['alibaba', 'e-commerce', 'group', 'stake', 'chinas', 'internet', 'yahoo', 'alibabas', 'holding', 'tencent'])
(3, ['twitter', 'social', 'tweets', 'twitters', 'user', 'dorsey', 'product', 'media', 'ads', 'buy'])
(4, ['ads', 'video', 'ad', 'youtube', 'instagram', 'social', 'facebooks', 'advertisers', 'content', 'news'])
(5, ['prime', 'delivery', 'amazons', 'products', 'items', 'product', 'amazon.com', 'jet', 'shipping', 'retailer'])
(6, ['cars', 'car', 'drivers', 'tesla', 'vehicles', 'model', 'driver', 'self-driving', 'musk', 'india'])
(7, ['apple', 'iphone', 'pay', 'watch', 'apples', 'apps', 'iphones', 'store', 'music', 'samsung'])
(8, ['internet', 'make', 'use', 'search', 'mr.', 'way', 'products', 'information', 'big', 'made'])
(9, ['india', 'rs', 'crore', 'snapdeal', 'e-

By now, we know what's the theme of each cluster

* Cluster 0 - Amazon
* Cluster 1 - Ads and Social Media
* Cluster 2 - Stock Market
* Cluster 3 - Apple
* Cluster 4 - Investment

Disclaimer - The clusters might not be accurately created, this is done just to give an overview on how the articles will be clustered.

In [251]:
categories = {0: "Cab Aggregator", 1: "Stocks", 2: "Alibaba", 3: "Twitter", 4: "Ads", 5: "Amazon", 6: "Self Driving", 7: "Apple", 8: "Internet", 9: "Ecommerce in India"}

#### 5. Predicting the label for an article

In [252]:
# We are using KNeighborsClassifier for predicting the class for an article

from sklearn.neighbors import KNeighborsClassifier
clf = KNeighborsClassifier()

In [253]:
# Fitting the previously modelled data to KNeighborsClassifier

clf.fit(articles_tfidf, km.labels_)

KNeighborsClassifier(algorithm='auto', leaf_size=30, metric='minkowski',
           metric_params=None, n_jobs=1, n_neighbors=5, p=2,
           weights='uniform')

In [254]:
predict_airbnb = " Online room renting service Airbnb Inc said on Thursday it had raised $1 billion in its latest round of funding, valuing the company at $31 billion.The company turned in a profit on an EBITDA basis in the second half of 2016 and expects to continue to be profitable this year, the source said, adding that Airbnb had no plans to go public anytime soon. The company is locked in an intensifying global battle with regulators who say the service takes affordable housing off the market and drives up rental prices.Airbnb raised $447.85 million as part of the funding, a source close to the company told Reuters. The company said in September it had raised about $555 million as part of the same round of funding. Airbnb, which operates in more than 65,000 cities, has enjoyed tremendous growth as it pushes ahead with its plans of global expansion"

In [255]:
airbnb_transform = vectorizer.transform([predict_airbnb.decode('utf8').encode('ascii', errors='ignore')])

In [256]:
label = clf.predict(airbnb_transform)[0]

if label in categories:
    print("Category belongs to the category: {}".format(categories[label]))
else:
    print("Unknown category")

Category belongs to the category: Internet


Hope you understood the basics of a Natural Language Processing problem.