##CATEGORIZATION OF COLLECTION OF NEWS ARTICLES USING NLP



In this project, I've used Python to process the textual content of a large collection of news articles with the goal of accurately predicting the correct category for all articles whose categories are currently unknown. 

Term Frequency – Inverse Document Frequency (TF-IDF) scores for each news article was used for the categorization. 

###Methodology

There are multiple approaches to categorize the text. I've used the average TF-IDF feature vector for each article category. These average TF-IDF feature vectors are used as a basis for calculating the distance between each possible category and each article for which the category is unknown. 

###Import Libraries

In [1]:
#import libraries
import numpy as np #used to quickly perform mathematical calculations on vectors
import re #regular expressions; used to clean the text data
import sqlite3 #used to interact with the database
import pandas as pd #allows us to work with data using Pandas dataframes
from collections import Counter #used to quickly count letters and words

###Load Data

In [2]:
#open a connection to the database
conn = sqlite3.connect('Dataset.db')

#load all documents into a Pandas dataframe named 'df', and use the document_id column as the index
sql = 'SELECT * FROM Article'
df = pd.read_sql_query(sql, conn, index_col='id')

#close database connection
conn.close()

###Preview Data

In [3]:
#show the first 10 rows of data
df.groupby(['category']).count()

Unnamed: 0_level_0,raw_text
category,Unnamed: 1_level_1
Business,270
Entertainment,197
Politics,239
Sports,294
Technology,225
Unknown,1000


The dataset contains 5 categories for the articles. The missing categories have an 'Unknown' tag for which the correct categories are to be predicted.

In [4]:
df.head()

Unnamed: 0_level_0,category,raw_text
id,Unnamed: 1_level_1,Unnamed: 2_level_1
6347,Politics,Hiding women away in the home hidden behind ve...
13840,Sports,Celtic brushed aside Clyde to secure their pla...
14775,Unknown,"If you have finished Doom 3, Half Life 2 and H..."
16641,Unknown,Controversial new UK casinos will be banned fr...
17511,Unknown,Justine Henin-Hardenne lost to Elena Dementiev...


###Prepare Data for Analysis

In [5]:
#define a function that will clean the raw input text in preparation for analysis
def clean_text(raw_text):
  #convert the raw text to lowercase
  text = raw_text.lower()
  #remove all numbers from the text using a regular expression
  text = re.sub(r'[0-9]', ' ', text)
  #remove all underscores from the text
  text = re.sub(r'\_', ' ', text)
  #remove anything else in the text that isn't a word character or a space (e.g., punctuation, special symbols, etc.)
  text = re.sub(r'[^\w\s]', ' ', text)
  #remove any excess whitespace
  for _ in range(10):
    text = text.replace('  ', ' ')
  #remove any leading or trailing space characters
  text = text.strip()
  #return the clean text
  return text

#clean the raw text of each article, and store the resulting clean text in a new column 
df['clean_text'] = [clean_text(raw_text) for raw_text in df.raw_text]

#show the cleaned text of the first article
df.iloc[0]['clean_text']

'hiding women away in the home hidden behind veils is a backward view of islam president musharraf of pakistan has said during a visit to britain he was speaking to the bbc s newsnight programme a few hours before visiting the pakistani community in manchester my wife is travelling around she is very religious but she is very moderate said general musharraf it comes after pakistan s high commissioner to britain said some pakistanis should integrate more dr maleeha lodhi said people could not expect others to listen to their grievances if they isolated themselves gen musharraf told the bbc some people think that the women should be confined to their houses and put veils on and all that and they should not move out absolutely wrong the pakistani president was also asked whether he thought the war on terror had made the world less safe yes absolutely and i would add that unfortunately we are not addressing the core problems so therefore we can never address it in its totality he said we a

###Compute Raw Letter Frequencies for Each Language

In [6]:
#define a function that will compute the raw letter frequencies for the input texts as well as the total number of letters appearing in the input texts
def letter_counts(input_texts):
  all_text = ' '.join(input_texts) #join all of the input texts into one big string
  letter_counts = Counter(all_text.replace(' ', '')) #count all letters in the text (excluding spaces)
  #return letter counts (sorted from most common to least common), and the total number of letters
  return letter_counts.most_common(), sum(letter_counts.values()) 


#get letter counts for each language
df_letter_counts, total_letters = letter_counts(df.clean_text)

In [7]:
df_letter_counts

[('e', 476074),
 ('t', 347366),
 ('a', 327937),
 ('o', 291803),
 ('i', 289103),
 ('n', 277156),
 ('s', 268268),
 ('r', 248520),
 ('h', 186120),
 ('l', 167314),
 ('d', 154621),
 ('c', 125245),
 ('u', 106896),
 ('m', 105290),
 ('p', 85065),
 ('g', 83122),
 ('f', 82762),
 ('w', 75363),
 ('b', 70700),
 ('y', 65325),
 ('v', 42983),
 ('k', 31394),
 ('x', 8246),
 ('j', 7868),
 ('q', 3261),
 ('z', 2761)]

###Build a vocabulary for all of the news articles

In [8]:
#build a vocabulary of words
all_text = ' '.join(df.clean_text) #join all of the texts into one big string
words = all_text.split() #split the text into words
word_frequencies = Counter(words) #count all words in the text
vocabulary = list(word_frequencies.keys()) #get a list of all unique words

In [9]:
len(vocabulary)

27762

In [10]:
#define a class that we can use to hold information about each document
class Document:
  def __init__(self, id, category, word_frequencies, total_words):
    self.id = id #the document's unique ID number
    self.category = category #the document's category
    self.predicted_category = None
    self.total_words = total_words #the total number of words in the document
    self.word_frequencies = word_frequencies #holds raw frequencies for each word in the vocabulary
    self.term_frequencies = None
    self.tfidf_scores = None

In [11]:
#sort the vocabulary to ensure that we all get consistent results!
vocabulary.sort()
#define a collection (list) to hold our Document objects
documents = []
#create a Document object for each document in the corpus
for row in df.itertuples(): #for each row in the dataframe
  words = row.clean_text.split() #split the (clean) text into words
  document_word_frequencies = Counter(words) #count all words in the document's (clean) text
  total_words = sum(document_word_frequencies.values()) #compute the total number of words in the document
  
  vocabulary_word_frequencies = []
  for vocabulary_word in vocabulary:
    #if this vocabulary word exists in the document
    if vocabulary_word in document_word_frequencies:
      #add the raw document frequency for this vocabulary word to the collection
      vocabulary_word_frequencies.append(document_word_frequencies[vocabulary_word])
    else: #if this vocabulary word doesn't exist in the document
      #add a value of zero for this vocabulary word to the collection
      vocabulary_word_frequencies.append(0)      
  #add a new Document object for this document to the collection
  documents.append(Document(row.Index, row.category, vocabulary_word_frequencies, total_words))

In [12]:
documents[9].category

'Technology'

In [13]:
#for each document in the 'documents' collection
for document in documents:
  #compute the unigram probability distributions for this document
  document.term_frequencies = np.array(document.word_frequencies) / document.total_words

In [14]:
#Calculate an IDF Score for each word in the vocabulary
idf_scores = []
for i in range(len(vocabulary)):
  number_of_documents = 0
  for document in documents:
    if document.word_frequencies[i] > 0:
      number_of_documents +=1
  idf = np.log(len(documents) / number_of_documents)
  idf_scores.append(idf)


In [15]:
len(idf_scores)

27762

In [16]:
vocabulary[:15]

['a',
 'aa',
 'aaa',
 'aaas',
 'aac',
 'aadc',
 'aaliyah',
 'aaltra',
 'aamir',
 'aan',
 'aara',
 'aarhus',
 'aaron',
 'abacus',
 'abandon']

In [17]:
idf_scores[:15]

[0.007216991180223392,
 7.707512194600341,
 5.6280706529205045,
 5.761602045545027,
 7.014365014040395,
 7.707512194600341,
 7.707512194600341,
 7.707512194600341,
 7.707512194600341,
 7.707512194600341,
 7.707512194600341,
 7.707512194600341,
 6.09807428216624,
 7.014365014040395,
 6.608899905932231]

###Calculate TF-IDF Scores

In [18]:
#Calculate TF-IDF Scores
idf_scores = np.array(idf_scores)
for document in documents:
  document.tfidf_scores = np.array(document.term_frequencies) * idf_scores

In [19]:
#define a dictionary that holds each category's name (keys) and average word probability distribution (values).
#The probability distributions are all numpy arrays of the same size as the vocabulary. All elements of each probability distribuation are initialized to zero.
category_tfidf_scores = {'Business': np.zeros(len(vocabulary)), 'Entertainment': np.zeros(len(vocabulary)), 'Politics': np.zeros(len(vocabulary)), 'Sports': np.zeros(len(vocabulary)), 'Technology': np.zeros(len(vocabulary))}
#define a dictionary to hold the number of documents for each category
document_counts = {'Business': 0, 'Entertainment': 0, 'Politics': 0, 'Sports': 0, 'Technology': 0}
#for each document in the corpus
for document in documents:
  #if the category of this document is known
  if document.category != 'Unknown':
    #increment the document count for this category
    document_counts[document.category] += 1
    #add this document's word probabilities to the running sum for the corresponding distribution 
    #for the document's category
    category_tfidf_scores[document.category] += document.tfidf_scores
#compute the average word probability distributions for each category by dividing the summed probabilities
#by the number of documents for each category
for category in category_tfidf_scores:
  category_tfidf_scores[category] /= document_counts[category]

In [20]:
category_tfidf_scores['Entertainment']

array([0.00015555, 0.        , 0.        , ..., 0.00018992, 0.        ,
       0.00017036])

In [21]:
#define a function to compute the Euclidean distance between two points 
#(where each point is defined as a vector)
def get_distance(point1, point2):
  return np.sqrt(np.sum(np.square(point1 - point2)))

In [22]:
#for each document in the corpus
number_of_accurate_predictions = 0
number_of_known_articles = 0
for document in documents:
  min_distance = np.inf
  best_category = None
  for category in category_tfidf_scores:
    distance = get_distance(document.tfidf_scores, category_tfidf_scores[category])
    if distance < min_distance:
      min_distance = distance
      best_category = category
  document.predicted_category = best_category
  if document.category != 'Unknown':
    number_of_known_articles += 1
    if document.category == document.predicted_category:
      number_of_accurate_predictions +=1

print('Overall accuracy', number_of_accurate_predictions / number_of_known_articles)

Overall accuracy 0.9844897959183674


###Conclusion and Further Development



The overall accuracy for the known articles using TF-IDF scores are obtained to be 98.45%. 

Google introduced an open-sourced a neural network-based technique for natural language processing (NLP) pre-training called [Bidirectional Encoder Representations from Transformers - BERT](https://blog.google/products/search/search-language-understanding-bert/). The above can be re-evaluated using BERT and the scores can be compared.