<a href="https://colab.research.google.com/github/marimcmurtrie/NLP/blob/main/NLP_Text_Classification.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# Introduction to Natural Language Processing

NLP frames language and linguistic interactions within a computational perspective. This makes possible the development of algorithms and models capable of natural language understanding and natural language generation.

The applications include document proofreading (spelling & grammar), word prediction, information retrieval, text classification, text summarization, question answering, information extraction, machine translation, sentiment analysis, optical character recognition, speech recognition etc.

We will explore Text Classification, one of the core concepts in NLP.

## Import basic packages

In [None]:
import pickle # Needed to load and save data
import numpy as np # Needed for numerical computations
import pandas as pd # We will use this for formatting and displaying data
import matplotlib.pyplot as plt # Needed for generating graphics
import nltk # Natural Language Tool Kit
nltk.download('punkt') #One time download
nltk.download('stopwords') #One time download

[nltk_data] Downloading package punkt to /root/nltk_data...
[nltk_data]   Package punkt is already up-to-date!
[nltk_data] Downloading package stopwords to /root/nltk_data...
[nltk_data]   Package stopwords is already up-to-date!


True

***What's with all these imports?***

Python programming practice upholds code reusability. It is very tedious (impossible, I would say) to compose software entirely from scratch, without building on code written by others. But for the existence of packages to import, we can only dream about writing programs to do complex tasks, constantly reinventing the wheel. We must embrace the **DRY** principle ("Don't repeat yourself") of software development aimed at reducing repetition of software patterns. Violating this principle would lead to **WET** solutions ("write every time", "write everything twice", "we enjoy typing" or "waste everyone's time")

## Working with text data

### Import Data

We have about 5000 pieces of text data available for training. The data is "pickled" in the file Webinar_06-09-2020.pkl. (Make sure you place this file in the same directory as this notebook.) Each piece of text data is associated with a Category. Take a look at a few random samples of the data:

<pre>
Commerce    London School of Commerce is an associate college of the University of Wales Trinity Saint David...
Commerce    The Protection of Lawful Commerce in Arms Act (PLCAA) is a United States law that protects firea...
Astronomy   Heliocentrism is the astronomical model in which the Earth and planets revolve around the Sun at...
Religion    Greco-Roman religion may refer to:\r\n\r\nAncient Greek religion\r\nHellenistic religion\r\nMyst...
Literature  Iranian literature, or Iranic literature, refers to the literary traditions of the Iranian langu...
Literature                                  This is a list of some of the standards of concert band repertoire.
Cosmology   The Centre for Theoretical Cosmology is a research centre within the Department of Applied Mathe...
Art         Indigenous Australian art includes art made by Aboriginal Australian and Torres Strait Islander ...
Religion    The status of religious freedom around the world varies from country to country. States can diff...
Evolution   Evolution is the fourth studio album by American R&B quartet Boyz II Men, released in September ...
</pre>

In [None]:
data_file = 'NLP.pkl'

In [None]:
with open(data_file, 'rb') as f:
    data = pickle.load(f)

In [None]:
data.shape # There are 4991 samples of data. Each sample is of the form [Category, Text]

(4991, 2)

In [None]:
np.random.seed(1) # Try changing the random seed to get a different selection
random_selection = np.random.randint(0, len(data)-1, 10)

with pd.option_context('display.max_colwidth', 100):
    df = pd.DataFrame(data[random_selection,1], index=data[random_selection,0])
    print(df)

                                                                                                              0
Biology     In silico (Pseudo-Latin for "in silicon", alluding to the mass use of silicon for computer chips...
Philosophy  Dualism may refer to:\r\n\r\nDualism (cybernetics), systems or problems in which an intelligent ...
Evolution   Odor molecules are detected by the olfactory receptors (hereafter OR) in the olfactory epitheliu...
Commerce    The Paris Île-de-France Regional Chamber of Commerce and Industry (French: Chambre de commerce e...
Commerce    The Superintendency of Industry and Commerce (SIC) is a competitiveness regulatory agency of the...
Biology     In biology, a substrate is the surface on which an organism (such as a plant, fungus, or animal)...
Art         Scythian art is the art associated with Scythian cultures, primarily decorative objects, such as...
Commerce    Carousell is a smartphone and web-based consumer to consumer and business to consumer market

***How many distinct categories are there? How many entries in each category? Let's find out!***

In [None]:
(unique_categories, category_counts) = np.unique(data[:,0], return_counts=True)

In [None]:
pd.DataFrame({'Category':unique_categories, 'Count':category_counts})

Unnamed: 0,Category,Count
0,Art,500
1,Astronomy,498
2,Biology,500
3,Commerce,500
4,Cosmology,500
5,Economics,497
6,Evolution,500
7,Literature,500
8,Philosophy,498
9,Religion,498


This data will be used to demonstrate the following:
* Text Classification: We will build a classifier that can accept a piece of text and determine the category
* Information Retrieval: We will see how to retrieve articles similar to a given article (or given set of keywords)

### Preprocessing

In [None]:
from nltk.tokenize import word_tokenize
nltk.download('punkt_tab')

[nltk_data] Downloading package punkt_tab to /root/nltk_data...
[nltk_data]   Package punkt_tab is already up-to-date!


True

#### Tokenize, filter stopwords

Tokenization simply means, convert this:
<pre>
In biological classification, class (Latin: classis) is  a taxonomic rank, as well as a taxonomic unit, a taxon, in that rank. Other well-known ranks in descending order of size are life, domain, kingdom, phylum, order, family, genus, and species, with class fitting between phylum and order.
</pre>
to this:
<pre>
['In', 'biological', 'classification', ',', 'class', '(', 'Latin', ':', 'classis', ')', 'is', 'a', 'taxonomic', 'rank', ',', 'as', 'well', 'as', 'a', 'taxonomic', 'unit', ',', 'a', 'taxon', ',', 'in', 'that', 'rank', '.', 'Other', 'well-known', 'ranks', 'in', 'descending', 'order', 'of', 'size', 'are', 'life', ',', 'domain', ',', 'kingdom', ',', 'phylum', ',', 'order', ',', 'family', ',', 'genus', ',', 'and', 'species', ',', 'with', 'class', 'fitting', 'between', 'phylum', 'and', 'order', '.']
</pre>

In [None]:
documents_tokenized = []
for doc in data[:,1]:
    documents_tokenized.append(word_tokenize(doc))

Let's check

In [None]:
print(data[15,1])

In biological classification, class (Latin: classis) is  a taxonomic rank, as well as a taxonomic unit, a taxon, in that rank. Other well-known ranks in descending order of size are life, domain, kingdom, phylum, order, family, genus, and species, with class fitting between phylum and order.


In [None]:
print(documents_tokenized[15])

['In', 'biological', 'classification', ',', 'class', '(', 'Latin', ':', 'classis', ')', 'is', 'a', 'taxonomic', 'rank', ',', 'as', 'well', 'as', 'a', 'taxonomic', 'unit', ',', 'a', 'taxon', ',', 'in', 'that', 'rank', '.', 'Other', 'well-known', 'ranks', 'in', 'descending', 'order', 'of', 'size', 'are', 'life', ',', 'domain', ',', 'kingdom', ',', 'phylum', ',', 'order', ',', 'family', ',', 'genus', ',', 'and', 'species', ',', 'with', 'class', 'fitting', 'between', 'phylum', 'and', 'order', '.']


Can we find out how many unique tokens there are? Yes we can!

In [None]:
unique_tokens = set()

In [None]:
for tokens in documents_tokenized:
    unique_tokens.update(tokens)

In [None]:
len(unique_tokens)

55258

An important step in the preprocessing chain is to remove all extraneous words containing non-alphabet characters e.g. punctuation and numbers. We will also simplify the problem by reducing all text to lowercase.

In [None]:
documents_alpha_lower = [[word.lower() for word in words if word.isalpha()] for words in documents_tokenized]

Next we remove all stop words - noninformative words that occur too commonly across documents e.g. the, a, an, of ...

In [None]:
from nltk.corpus import stopwords
stop_words = set(stopwords.words('english'))
documents_stop_filtered = [[word for word in words if (len(word)>2) and (not word in stop_words)] for words in documents_alpha_lower]

In [None]:
print(documents_stop_filtered[15])

['biological', 'classification', 'class', 'latin', 'classis', 'taxonomic', 'rank', 'well', 'taxonomic', 'unit', 'taxon', 'rank', 'ranks', 'descending', 'order', 'size', 'life', 'domain', 'kingdom', 'phylum', 'order', 'family', 'genus', 'species', 'class', 'fitting', 'phylum', 'order']


In [None]:
print(data[15,1])

In biological classification, class (Latin: classis) is  a taxonomic rank, as well as a taxonomic unit, a taxon, in that rank. Other well-known ranks in descending order of size are life, domain, kingdom, phylum, order, family, genus, and species, with class fitting between phylum and order.


***Now how many unique tokens remain?***

In [None]:
unique_tokens = set()

In [None]:
for tokens in documents_stop_filtered:
    unique_tokens.update(tokens)

In [None]:
len(unique_tokens)

38033

#### Stemming

Stemming refers to the removal of identified prefixes and suffixes from words

In [None]:
from nltk.stem.snowball import SnowballStemmer
stemmer = SnowballStemmer("english")

In [None]:
stemmer.stem('studying')

'studi'

In [None]:
print([stemmer.stem(word) for word in documents_stop_filtered[15]])

['biolog', 'classif', 'class', 'latin', 'classi', 'taxonom', 'rank', 'well', 'taxonom', 'unit', 'taxon', 'rank', 'rank', 'descend', 'order', 'size', 'life', 'domain', 'kingdom', 'phylum', 'order', 'famili', 'genus', 'speci', 'class', 'fit', 'phylum', 'order']


Stemming often returns truncated words. We do not want our vocabulary to contain truncated words. Here is a way to modify the stemming process to ensure that only valid words are returned. The goal is to ensure that the end result is still readable.

In [None]:
stem_dict = {}

In [None]:
def modified_stem(word):
    stemmed_word = stemmer.stem(word)
    if stemmed_word in stem_dict:
        return stem_dict[stemmed_word]
    else:
        stem_dict[stemmed_word] = word
        return word

In [None]:
documents_stemmed = [[modified_stem(word) for word in words] for words in documents_stop_filtered]

In [None]:
print(documents_stemmed[15])

['biology', 'classification', 'classed', 'latin', 'classis', 'taxonomic', 'rank', 'well', 'taxonomic', 'unit', 'taxon', 'rank', 'rank', 'descendants', 'order', 'size', 'life', 'domain', 'kingdom', 'phylum', 'order', 'family', 'genus', 'species', 'classed', 'fitting', 'phylum', 'order']


In [None]:
print(data[15,1])

In biological classification, class (Latin: classis) is  a taxonomic rank, as well as a taxonomic unit, a taxon, in that rank. Other well-known ranks in descending order of size are life, domain, kingdom, phylum, order, family, genus, and species, with class fitting between phylum and order.


After all this preprocessing, let's check the word counts, broken down by total number of words and total number of unique words.

In [None]:
unique_tokens = set()
total_words = 0
for tokens in documents_stemmed:
    total_words+=len(tokens)
    unique_tokens.update(tokens)

In [None]:
total_words

432826

In [None]:
len(unique_tokens)

27486

#### Detokenize

We need to detokenize because sklearn's TFIDF (to be explained soon) requires detokenized documents

In [None]:
detokenize = lambda words: ' '.join(words)

In [None]:
documents_detokenized = [detokenize(words) for words in documents_stemmed]

In [None]:
documents_detokenized[15] # Example detokenized document

'biology classification classed latin classis taxonomic rank well taxonomic unit taxon rank rank descendants order size life domain kingdom phylum order family genus species classed fitting phylum order'

In [None]:
data[15,1] # The original version

'In biological classification, class (Latin: classis) is  a taxonomic rank, as well as a taxonomic unit, a taxon, in that rank. Other well-known ranks in descending order of size are life, domain, kingdom, phylum, order, family, genus, and species, with class fitting between phylum and order.'

#### Function to do all preprocessing

In [None]:
# Utility function to preprocess a text passage
def preprocess(query):
    words = word_tokenize(query) # Tokenize
    words_alpha = [word.lower() for word in words if word.isalpha()] # Remove non-alphabets
    words_filtered = [word for word in words_alpha if (len(word)>2) and (not word in stop_words)] # Remove stop words
    words_stemmed = [modified_stem(word) for word in words_filtered] # Stem
    words_detokenized = detokenize(words_stemmed) # Detokenize
    return words_detokenized

In [None]:
preprocess(data[15, 1])

'biology classification classed latin classis taxonomic rank well taxonomic unit taxon rank rank descendants order size life domain kingdom phylum order family genus species classed fitting phylum order'

Note: The preprocessing steps finish with the creation of <code>documents_stemmed</code>, which is the starting point for further analysis.

## Math-ifying text with Word Vectors

<ul>
<li>Q: Why is knowing about math-ifying text important?</li>
<li>A: ML algorithms expect numbers as input. Text has to be converted to numbers</li>
<li>We will understand how to:
<ul>
<li>Count words and term frequencies in text data</li>
<li>Represent words/documents as points in a vector space</li>
<li>Solve NLP problems using those vectors</li>
</ul>
</li>
</ul>

There are two basic ways of creating document vectors. Using raw word counts and using TFIDF

In [None]:
from sklearn.feature_extraction.text import CountVectorizer, TfidfVectorizer

### Term Frequency

In [None]:
vectorizer_counts = CountVectorizer(max_features= 15)
tf = vectorizer_counts.fit_transform(documents_detokenized)

In [None]:
type(tf)

In [None]:
terms = np.array(vectorizer_counts.get_feature_names_out())

In [None]:
pd.DataFrame(tf.todense()[:10], columns=terms)

Unnamed: 0,also,art,biology,century,economically,evolution,including,literature,one,philosophy,religion,states,universally,uses,work
0,0,0,6,0,0,2,1,0,0,0,0,0,0,1,0
1,0,0,3,0,0,0,1,0,1,0,0,1,0,0,0
2,0,0,6,0,0,0,1,0,0,0,0,0,0,0,0
3,0,0,3,0,0,0,0,0,0,0,0,0,0,2,0
4,2,0,6,0,0,0,1,0,0,0,0,0,0,2,1
5,1,0,4,0,1,0,5,0,0,0,0,0,0,0,0
6,0,0,1,0,0,0,1,0,0,0,0,0,0,0,0
7,0,0,1,0,0,0,0,0,0,0,0,1,0,2,0
8,0,0,0,0,0,0,0,0,1,0,0,0,0,1,0
9,0,0,6,0,0,2,0,0,0,0,0,0,0,0,0


In [None]:
data[-5]

array(['Literature',
       'The Society for Literature, Science, and the Arts (SLSA) is a United States-based academic organization whose members "share an interest in problems of science and representation, and in the cultural and social dimensions of science, technology, and medicine."The SLSA publishes the journal Configurations, published by Johns Hopkins University Press, and a members\' newsletter Decodings. It holds an annual conference that "attracts hundreds of participants from many different disciplines, including the history, sociology, anthropology, rhetoric, and philosophy of science, technology and medicine; literary history and criticism; art history and media studies; the cognitive sciences; and all areas of science, technology, engineering, and medicine" (the 30th being in 2016).The European Society for Literature, Science and the Arts (SLSAeu) is a sister body.The Society\'s own web page is inconsistent as to whether there is a comma after "Science" in its name.'],


The *vocabulary* consists of the following 15 (max_features) most frequently occuring words

In [None]:
print(vectorizer_counts.vocabulary_)

{'biology': np.int64(2), 'including': np.int64(6), 'evolution': np.int64(5), 'uses': np.int64(13), 'states': np.int64(11), 'one': np.int64(8), 'also': np.int64(0), 'work': np.int64(14), 'economically': np.int64(4), 'universally': np.int64(12), 'century': np.int64(3), 'philosophy': np.int64(9), 'literature': np.int64(7), 'art': np.int64(1), 'religion': np.int64(10)}


What are the words that got excluded due to low frequency of occurrence?

In [None]:
print(vectorizer_counts.stop_words) # Not to be confused with NLTK's stop words

None


### Document frequency

Document frequency = The number of documents a given term appears in. A term should be considered less important if it occurs in too many documents.

In [None]:
doc_freq = np.array([sum(col) for col in np.sign(np.array(tf.todense()).T)])

In [None]:
pd.DataFrame([doc_freq], index=None, columns=terms)

Unnamed: 0,also,art,biology,century,economically,evolution,including,literature,one,philosophy,religion,states,universally,uses,work
0,1521,643,566,743,604,535,1465,588,1190,605,616,814,850,1236,861


### Inverse document frequency

Inverse document frequency should simply be:

In [None]:
pd.DataFrame([1/doc_freq], index=None, columns=terms)

Unnamed: 0,also,art,biology,century,economically,evolution,including,literature,one,philosophy,religion,states,universally,uses,work
0,0.000657,0.001555,0.001767,0.001346,0.001656,0.001869,0.000683,0.001701,0.00084,0.001653,0.001623,0.001229,0.001176,0.000809,0.001161


The inverse document frequencies may span a wide range. So it is more commonly defined as

In [None]:
idf = np.log(len(documents_detokenized)/doc_freq)+1

In [None]:
pd.DataFrame([idf], index=None, columns=terms)

Unnamed: 0,also,art,biology,century,economically,evolution,including,literature,one,philosophy,religion,states,universally,uses,work
0,2.188268,3.049247,3.176797,2.904696,3.111817,3.233125,2.225781,3.138665,2.433683,3.110163,3.092145,2.813431,2.770155,2.395756,2.757297


### TFIDF

In [None]:
pd.DataFrame(np.array(tf.todense())[:10]*idf, columns=terms)

Unnamed: 0,also,art,biology,century,economically,evolution,including,literature,one,philosophy,religion,states,universally,uses,work
0,0.0,0.0,19.060785,0.0,0.0,6.46625,2.225781,0.0,0.0,0.0,0.0,0.0,0.0,2.395756,0.0
1,0.0,0.0,9.530392,0.0,0.0,0.0,2.225781,0.0,2.433683,0.0,0.0,2.813431,0.0,0.0,0.0
2,0.0,0.0,19.060785,0.0,0.0,0.0,2.225781,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
3,0.0,0.0,9.530392,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,4.791512,0.0
4,4.376537,0.0,19.060785,0.0,0.0,0.0,2.225781,0.0,0.0,0.0,0.0,0.0,0.0,4.791512,2.757297
5,2.188268,0.0,12.70719,0.0,3.111817,0.0,11.128905,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
6,0.0,0.0,3.176797,0.0,0.0,0.0,2.225781,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
7,0.0,0.0,3.176797,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,2.813431,0.0,4.791512,0.0
8,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,2.433683,0.0,0.0,0.0,0.0,2.395756,0.0
9,0.0,0.0,19.060785,0.0,0.0,6.46625,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0


#### Normalization

The rows of the above matrix must be normalized to ensure that the total number of words in a document does not skew the term frequencies.

In [None]:
def normalize_row(array):
    return [row/np.linalg.norm(row) for row in array]

In [None]:
tfidf = normalize_row(np.array(tf.todense())*idf)

  return [row/np.linalg.norm(row) for row in array]


In [None]:
pd.DataFrame(tfidf[:10], columns=terms)

Unnamed: 0,also,art,biology,century,economically,evolution,including,literature,one,philosophy,religion,states,universally,uses,work
0,0.0,0.0,0.934735,0.0,0.0,0.317103,0.109152,0.0,0.0,0.0,0.0,0.0,0.0,0.117487,0.0
1,0.0,0.0,0.910258,0.0,0.0,0.0,0.212587,0.0,0.232444,0.0,0.0,0.268714,0.0,0.0,0.0
2,0.0,0.0,0.993251,0.0,0.0,0.0,0.115985,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
3,0.0,0.0,0.893438,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.449186,0.0
4,0.214068,0.0,0.932312,0.0,0.0,0.0,0.108869,0.0,0.0,0.0,0.0,0.0,0.0,0.234365,0.134867
5,0.126382,0.0,0.733898,0.0,0.179722,0.0,0.642745,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
6,0.0,0.0,0.818987,0.0,0.0,0.0,0.573812,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
7,0.0,0.0,0.496338,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.439566,0.0,0.748619,0.0
8,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.712638,0.0,0.0,0.0,0.0,0.701532,0.0
9,0.0,0.0,0.946991,0.0,0.0,0.321261,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0


In [None]:
print([np.linalg.norm(row) for row in tfidf[:10]])

[np.float64(1.0), np.float64(0.9999999999999999), np.float64(1.0), np.float64(1.0), np.float64(1.0), np.float64(1.0), np.float64(1.0), np.float64(1.0), np.float64(1.0), np.float64(1.0)]


### TfidfVectorizer

In practice (let's keep things DRY), the above result can be obtained using `TfidfVectorizer`

In [None]:
vectorizer_tfidf = TfidfVectorizer(max_features= 15, use_idf=True, smooth_idf=False)
X = vectorizer_tfidf.fit_transform(documents_detokenized)

In [None]:
terms_tfidf = np.array(vectorizer_tfidf.get_feature_names_out())

In [None]:
pd.DataFrame(X.todense()[:10], columns=terms_tfidf) # Using TfidfVectorizer

Unnamed: 0,also,art,biology,century,economically,evolution,including,literature,one,philosophy,religion,states,universally,uses,work
0,0.0,0.0,0.934735,0.0,0.0,0.317103,0.109152,0.0,0.0,0.0,0.0,0.0,0.0,0.117487,0.0
1,0.0,0.0,0.910258,0.0,0.0,0.0,0.212587,0.0,0.232444,0.0,0.0,0.268714,0.0,0.0,0.0
2,0.0,0.0,0.993251,0.0,0.0,0.0,0.115985,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
3,0.0,0.0,0.893438,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.449186,0.0
4,0.214068,0.0,0.932312,0.0,0.0,0.0,0.108869,0.0,0.0,0.0,0.0,0.0,0.0,0.234365,0.134867
5,0.126382,0.0,0.733898,0.0,0.179722,0.0,0.642745,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
6,0.0,0.0,0.818987,0.0,0.0,0.0,0.573812,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
7,0.0,0.0,0.496338,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.439566,0.0,0.748619,0.0
8,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.712638,0.0,0.0,0.0,0.0,0.701532,0.0
9,0.0,0.0,0.946991,0.0,0.0,0.321261,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0


In [None]:
pd.DataFrame(tfidf[:10], columns=terms) # WET

Unnamed: 0,also,art,biology,century,economically,evolution,including,literature,one,philosophy,religion,states,universally,uses,work
0,0.0,0.0,0.934735,0.0,0.0,0.317103,0.109152,0.0,0.0,0.0,0.0,0.0,0.0,0.117487,0.0
1,0.0,0.0,0.910258,0.0,0.0,0.0,0.212587,0.0,0.232444,0.0,0.0,0.268714,0.0,0.0,0.0
2,0.0,0.0,0.993251,0.0,0.0,0.0,0.115985,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
3,0.0,0.0,0.893438,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.449186,0.0
4,0.214068,0.0,0.932312,0.0,0.0,0.0,0.108869,0.0,0.0,0.0,0.0,0.0,0.0,0.234365,0.134867
5,0.126382,0.0,0.733898,0.0,0.179722,0.0,0.642745,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
6,0.0,0.0,0.818987,0.0,0.0,0.0,0.573812,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
7,0.0,0.0,0.496338,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.439566,0.0,0.748619,0.0
8,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.712638,0.0,0.0,0.0,0.0,0.701532,0.0
9,0.0,0.0,0.946991,0.0,0.0,0.321261,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0


## Vectorization

What we did above was just for explaining TFIDF. Now let's get real and repeat the computation using more realistic options.

In [None]:
vectorizer_tfidf = TfidfVectorizer(max_features= 1000, use_idf=True, sublinear_tf=True, smooth_idf=True, min_df=0.013, max_df= 0.95)
X = vectorizer_tfidf.fit_transform(documents_detokenized).toarray() # Setting min_df and max_df is an "art"

In [None]:
X.shape # This is the document-term matrix i.e. the starting point for NLP

(4991, 1000)

In [None]:
# The terms are arranged alphabetically. Take a look at the last few.
terms = np.array(vectorizer_tfidf.get_feature_names_out()); terms[-10:-1]

array(['worldwide', 'worshipped', 'would', 'writer', 'writing', 'written',
       'wrote', 'years', 'york'], dtype=object)

Note: Processing a new query (i.e. a passage of text) calls for converting the query to be compatible with the vectors in `X`. This conversion can be accomplished by computing `vectorizer_tfidf.transform([preprocess(query)]).` The utility function defined below does just that.

In [None]:
def query2vector(query, vectorizer):
    preprocessed = [preprocess(query)]
    return vectorizer.transform(preprocessed).toarray()[0]

In [None]:
query = data[15, 1]; query

'In biological classification, class (Latin: classis) is  a taxonomic rank, as well as a taxonomic unit, a taxon, in that rank. Other well-known ranks in descending order of size are life, domain, kingdom, phylum, order, family, genus, and species, with class fitting between phylum and order.'

In [None]:
np.allclose((
    X)[15],
    query2vector(query, vectorizer_tfidf)
) # Just checking

True

**Once every document has been converted to a vector, all the usual machine learning methods can be brought to bear on the data. Prediction, classification and clustering all become possible.**

## Text Classification

### Split into training and testing sets

In [None]:
from sklearn.model_selection import train_test_split

In [None]:
[X_train, X_test, indices_train, indices_test] = train_test_split(X, np.arange(len(X)), test_size=0.25, random_state=0)

In [None]:
y_train = data[indices_train, 0]
y_test = data[indices_test, 0]

In [None]:
X_train.shape

(3743, 1000)

In [None]:
X_test.shape

(1248, 1000)

In [None]:
indices_train.shape

(3743,)

In [None]:
indices_test.shape

(1248,)

In [None]:
X.shape

(4991, 1000)

It is more correct to split into training and testing sets ***right after importing the data, before any other processing***. Do as I say, don't do as I do.

### Build and try out a classifier

In [None]:
from sklearn.linear_model import LogisticRegression

In [None]:
clf = LogisticRegression(random_state=0).fit(X_train, y_train) # Build a Logistic Regression classifier
clf.score(X_test, y_test) # Test the classifier. Yay!

0.9022435897435898

In [None]:
y_pred = clf.predict(X_test)

In [None]:
y_pred

array(['Commerce', 'Astronomy', 'Art', ..., 'Evolution', 'Astronomy',
       'Commerce'], dtype=object)

In [None]:
from sklearn.metrics import confusion_matrix
cm = confusion_matrix(y_test, y_pred)

In [None]:
df = pd.DataFrame(cm, index=unique_categories, columns=unique_categories)

In [None]:
with pd.option_context('display.max_rows', None, 'display.max_columns', None):
    display(df)

Unnamed: 0,Art,Astronomy,Biology,Commerce,Cosmology,Economics,Evolution,Literature,Philosophy,Religion
Art,124,0,1,0,0,1,0,0,0,0
Astronomy,0,113,1,0,10,0,0,0,0,0
Biology,0,0,99,0,1,1,16,0,1,0
Commerce,1,1,0,127,0,2,1,0,1,0
Cosmology,1,7,2,1,94,2,1,3,9,6
Economics,0,0,2,7,1,111,0,0,0,0
Evolution,1,1,9,0,4,0,108,0,2,0
Literature,1,0,0,0,2,1,0,126,0,0
Philosophy,1,0,1,0,6,1,1,0,117,5
Religion,0,0,0,0,6,0,0,0,0,107
