# Text Analysis 

### Objective
To classify news articles
Learn the basics of natural language processing
Build models using sklearn and choose the best one
Use sklearn’s Pipeline class
In this post we’ll classify news articles into different categories. First download the dataset from http://mlg.ucd.ie/files/datasets/bbc-fulltext.zip and extract. The dataset consists of 2225 documents and 5 categories: business, entertainment, politics, sport, and tech.

## Step 1
load in libs and data 

In [14]:
import numpy as np
import matplotlib.pyplot as plt
from sklearn.datasets import load_files
from sklearn.model_selection import train_test_split
from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.naive_bayes import MultinomialNB
from sklearn.metrics import classification_report, accuracy_score

 
DATA_DIR = "./bbc/"
print('done')


done


## Preprosess the data 

We’ll use load_files function which loads text files with categories as subfolder names. Our dataset already has articles organized into different folders. After loading the data, we’ll also check how many articles are there per category.

In [5]:
data = load_files(DATA_DIR, encoding="utf-8", decode_error="replace")
# calculate count of each category
labels, counts = np.unique(data.target, return_counts=True)
labels, counts

(array([0, 1, 2, 3, 4]), array([510, 386, 417, 511, 401]))

In [6]:
# convert data.target_names to np array for fancy indexing
labels_str = np.array(data.target_names)[labels]
print(dict(zip(labels_str, counts)))

{'business': 510, 'entertainment': 386, 'politics': 417, 'sport': 511, 'tech': 401}


## Data break down
Each category has different number of articles. However, it does not look too imbalanced and the model should be able to learn properly.

Data preparation
Now we’ll split the data into training and testing set and then print out first 80 chars of some samples.

In [9]:
X_train, X_test, y_train, y_test = train_test_split(data.data, data.target)
list(t[:80] for t in X_train[:10])
X_train

["'Ultimate game' award for Doom 3\n\nSci-fi shooter Doom 3 has blasted away the competition at a major games ceremony, the Golden Joystick awards.\n\nIt was the only title to win twice, winning Ultimate Game of the year and best PC game at the awards, presented by Little Britain star Matt Lucas. The much-anticipated sci-fi horror Doom 3 shot straight to the top of the UK games charts on its release in August. Other winners included Grand Theft Auto: San Andreas which took the Most Wanted for Christmas prize. Only released last week, it was closely followed by Halo 2 and Half-Life 2, which are expected to be big hits when they are unleashed later this month.\n\nBut they missed out on the prize for the Most Wanted game of 2005, which went to the Nintendo title, The Legend of Zelda. The original Doom, released in 1994, heralded a new era in computer games and introduced 3D graphics. It helped to establish the concept of the first-person shooter. Doom 3 was developed over four years and i

# Process the Text Data

Before we go further, lets quickly go through what are the common natural language processing pipeline.

Tokenize i.e. split the text into words
Convert the case of letters to either upper or lower
Remove stopwords. For e.g. “the”, “an”, “with”
Perform stemming or lemmatization to reduce inflected words to its stem. For e.g. transportation -> transport, transported -> transport (maybe some others)
Vectorization (Count, Binary, TF-IDF)
Many libraries already exist to perform all of the steps mentioned above.

The data is in textual format and we cannot use it as it is. We need to convert it to a numerical format. A very common method, among others, is to calculate TF-IDF matrix. TF stands for term frequency in which we calculate how many times a term/word appears in a document. IDF stands for inverse document frequency which measures how important a word is. In simple terms it gives more weight to rare words than common ones. Once we calculate both TF and IDF, we can simply multiply them together to obtain TF-IDF value.

tfidf(t, d, D) = tf(t, d) * idf(t, D) where,

t is a term
d is a document
D is set of all documents
For details about TF-IDF check http://www.tfidf.com/ https://en.wikipedia.org/wiki/Tf%E2%80%93idf http://scikit-learn.org/stable/modules/generated/sklearn.feature_extraction.text.TfidfVectorizer.html

In [12]:
vectorizer = TfidfVectorizer(stop_words="english", max_features=1000, decode_error="ignore")
vectorizer.fit(X_train)
vectorizer

TfidfVectorizer(analyzer='word', binary=False, decode_error='ignore',
                dtype=<class 'numpy.float64'>, encoding='utf-8',
                input='content', lowercase=True, max_df=1.0, max_features=1000,
                min_df=1, ngram_range=(1, 1), norm='l2', preprocessor=None,
                smooth_idf=True, stop_words='english', strip_accents=None,
                sublinear_tf=False, token_pattern='(?u)\\b\\w\\w+\\b',
                tokenizer=None, use_idf=True, vocabulary=None)

## Training the Vecorizer 

We used TfidfVectorizer to calculate TF-IDF. When initializing the vectorizer, we passed stop_words as “english” which tells sklearn to discard commonly occurring words in English. Then we also specifed max_features to 1000. The vectorizer will build a vocabulary of top 1000 words (by frequency). This means that each text in our dataset will be converted to a vector of size 1000.

Next, we call fit function to “train” the vectorizer and also convert the list of texts into TF-IDF matrix. We can also use another function called fit_transform, which is equivalent to:

Important We should use only the training data to fit the vectorizer, otherwise it is cheating.

In [13]:
vectorizer.fit(X_train)
X_train_vectorized = vectorizer.transform(X_train)
X_train_vectorized

<1668x1000 sparse matrix of type '<class 'numpy.float64'>'
	with 112494 stored elements in Compressed Sparse Row format>


# Build Model

We’ll create a simple naive Bayes model

In [15]:
cls = MultinomialNB()
# transform the list of text to tf-idf before passing it to the model
cls.fit(vectorizer.transform(X_train), y_train)
y_pred = cls.predict(vectorizer.transform(X_test))

#print results
print(accuracy_score(y_test, y_pred))
print(classification_report(y_test, y_pred))

0.9766606822262118
              precision    recall  f1-score   support

           0       0.98      0.98      0.98       130
           1       0.97      0.97      0.97        86
           2       0.95      0.97      0.96       104
           3       0.99      1.00      1.00       125
           4       0.99      0.96      0.98       112

    accuracy                           0.98       557
   macro avg       0.98      0.98      0.98       557
weighted avg       0.98      0.98      0.98       557

