# Text Analytics | BAIS:6100
# Module 7: Document-Term Representation

Instructor: Kang-Pyo Lee

In [1]:
# ! pip install --user --upgrade scikit-learn

## What Is a Corpus?

A corpus or text corpus is a large and structured set of texts. In corpus linguistics, they are used to do statistical analysis and hypothesis testing, checking occurrences or validating linguistic rules within a specific language territory.

Text Corpus: https://en.wikipedia.org/wiki/Text_corpus

In [2]:
corpus = ['This is the first document.',
          'This document is the second document.',
          'And this is the third one.',
          'Is this the first document?']

This corpus contains four documents and nine unique words, or terms. 

## What Is a Document-Term Matrix?

A document-term matrix (DTM) is a mathematical matrix that describes the frequency of terms that occur in a collection of documents. In a document-term matrix, rows correspond to documents in the collection and columns correspond to terms. There are various schemes for determining the value that each entry in the matrix should take.

Document-term matrix: https://en.wikipedia.org/wiki/Document-term_matrix

DTM is based on the "Bag-of-Words" model, where a text is simply represented as the bag of its words, disregarding grammar and even word order but only keeping multiplicity. 
- Pros: Simple and easy to analyze.
- Cons: Grammar and order are lost.

Bag-of-words model: https://en.wikipedia.org/wiki/Bag-of-words_model

## What Is TF-IDF?

Term frequency (TF) is the number of times a term occurs in a document. Adjustments are often made to simple term frequency in the case where the lengh of documents varies greatly. In that case, we typically divide the raw term frequencies by the length of the document, i.e., the number of all terms in the document. 

Inverse document frequency (IDF) is an inverse function of the number of documents in which it occurs. For example, because the term *the* is so common in English, term frequency will tend to incorrectly emphasize documents which happen to use the word *the* more frequently, without giving enough weight to the more meaningful terms. Here, the term *the* is not a good keyword to distinguish relevant and non-relevant documents and terms. Hence, an inverse document frequency factor is incorporated which diminishes the weight of terms that occur very frequently in the document set and increases the weight of terms that occur rarely.

The TF-IDF is the product of two statistics, term frequency and inverse document frequency. There are various ways for determining the exact values of both statistics.

tf–idf: https://en.wikipedia.org/wiki/Tf%E2%80%93idf

## Building a DTM with Term Frequencies

In [3]:
from sklearn.feature_extraction.text import TfidfVectorizer

In [4]:
vectorizer = TfidfVectorizer(use_idf=False, norm=None)

sklearn.feature_extraction.text.TfidfVectorizer: https://scikit-learn.org/stable/modules/generated/sklearn.feature_extraction.text.TfidfVectorizer.html

- `lowercase` (bool, default=True): Convert all characters to lowercase before tokenizing.

In [5]:
X = vectorizer.fit_transform(corpus)

sklearn.feature_extraction.text.TfidfVectorizer.fit_transform: https://scikit-learn.org/stable/modules/generated/sklearn.feature_extraction.text.TfidfVectorizer.html#sklearn.feature_extraction.text.TfidfVectorizer.fit_transform

From this point on, you can consider X the document-term matrix for `corpus`. 

In [6]:
X

<4x9 sparse matrix of type '<class 'numpy.float64'>'
	with 21 stored elements in Compressed Sparse Row format>

In [7]:
type(X)

scipy.sparse.csr.csr_matrix

In [8]:
X.shape

(4, 9)

X has four rows, or documents, and nine columns, or terms.

In [9]:
vectorizer.get_feature_names()

['and', 'document', 'first', 'is', 'one', 'second', 'the', 'third', 'this']

sklearn.feature_extraction.text.TfidfVectorizer.get_feature_names: https://scikit-learn.org/stable/modules/generated/sklearn.feature_extraction.text.TfidfVectorizer.html#sklearn.feature_extraction.text.TfidfVectorizer.get_feature_names

In [10]:
X.toarray()

array([[0., 1., 1., 1., 0., 0., 1., 0., 1.],
       [0., 2., 0., 1., 0., 1., 1., 0., 1.],
       [1., 0., 0., 1., 1., 0., 1., 1., 1.],
       [0., 1., 1., 1., 0., 0., 1., 0., 1.]])

In [11]:
import pandas as pd

pd.DataFrame(data=X.toarray(), columns=vectorizer.get_feature_names(), 
             index=["doc{}".format(i) for i in range(X.shape[0])])

Unnamed: 0,and,document,first,is,one,second,the,third,this
doc0,0.0,1.0,1.0,1.0,0.0,0.0,1.0,0.0,1.0
doc1,0.0,2.0,0.0,1.0,0.0,1.0,1.0,0.0,1.0
doc2,1.0,0.0,0.0,1.0,1.0,0.0,1.0,1.0,1.0
doc3,0.0,1.0,1.0,1.0,0.0,0.0,1.0,0.0,1.0


This document-term matrix is based on the bag-of-words model, so we have lost all the grammar and order of words. Only how many times each term occurs matters in this scheme.  

## Building a DTM with Binary Term Frequencies

In [12]:
vectorizer = TfidfVectorizer(binary=True, use_idf=False, norm=None)
X = vectorizer.fit_transform(corpus)

- `binary` (bool, default=False): If True, all non-zero term counts are set to 1.

In [13]:
pd.DataFrame(data=X.toarray(), columns=vectorizer.get_feature_names(), 
             index=["doc{}".format(i) for i in range(X.shape[0])])

Unnamed: 0,and,document,first,is,one,second,the,third,this
doc0,0.0,1.0,1.0,1.0,0.0,0.0,1.0,0.0,1.0
doc1,0.0,1.0,0.0,1.0,0.0,1.0,1.0,0.0,1.0
doc2,1.0,0.0,0.0,1.0,1.0,0.0,1.0,1.0,1.0
doc3,0.0,1.0,1.0,1.0,0.0,0.0,1.0,0.0,1.0


Whether or not each word appears in a document only matters, no matter how many times it appears.

## Building a DTM with Normalized Term Frequencies

It would be unfair to give more credit to longer documents with more terms. Normalization is used to get rid of the advantage of longer documents with possibly more terms. 

In [14]:
vectorizer = TfidfVectorizer(use_idf=False, norm="l2")
X = vectorizer.fit_transform(corpus)

- `norm` ('l1', 'l2' or None, optional, default='l2') 
    - 'l2': Sum of squares of vector elements is 1
    - 'l1': Sum of absolute values of vector elements is 1

In [15]:
pd.DataFrame(data=X.toarray(), columns=vectorizer.get_feature_names(), 
             index=["doc{}".format(i) for i in range(X.shape[0])])

Unnamed: 0,and,document,first,is,one,second,the,third,this
doc0,0.0,0.447214,0.447214,0.447214,0.0,0.0,0.447214,0.0,0.447214
doc1,0.0,0.707107,0.0,0.353553,0.0,0.353553,0.353553,0.0,0.353553
doc2,0.408248,0.0,0.0,0.408248,0.408248,0.0,0.408248,0.408248,0.408248
doc3,0.0,0.447214,0.447214,0.447214,0.0,0.0,0.447214,0.0,0.447214


## Building a DTM with TF-IDF

In [16]:
vectorizer = TfidfVectorizer(use_idf=True, norm="l2")
X = vectorizer.fit_transform(corpus)

- `use_idf` (bool, default=True): Enable inverse-document-frequency reweighting.

In [17]:
pd.DataFrame(data=X.toarray(), columns=vectorizer.get_feature_names(), 
             index=["doc{}".format(i) for i in range(X.shape[0])])

Unnamed: 0,and,document,first,is,one,second,the,third,this
doc0,0.0,0.469791,0.580286,0.384085,0.0,0.0,0.384085,0.0,0.384085
doc1,0.0,0.687624,0.0,0.281089,0.0,0.538648,0.281089,0.0,0.281089
doc2,0.511849,0.0,0.0,0.267104,0.511849,0.0,0.267104,0.511849,0.267104
doc3,0.0,0.469791,0.580286,0.384085,0.0,0.0,0.384085,0.0,0.384085


## Building a DTM with TF-IDF Removing English Stopwords

You may want no stopwords in English to be included in the DTM. 

In [18]:
vectorizer = TfidfVectorizer(use_idf=True, norm="l2", stop_words="english")
X = vectorizer.fit_transform(corpus)

- `stop_words` (str {'english'}, list, or None, default=None)

In [19]:
pd.DataFrame(data=X.toarray(), columns=vectorizer.get_feature_names(), 
             index=["doc{}".format(i) for i in range(X.shape[0])])

Unnamed: 0,document,second
doc0,1.0,0.0
doc1,0.787223,0.616668
doc2,0.0,0.0
doc3,1.0,0.0


By removing English stopwords, the terms <i>and</i>, <i>first</i>, <i>is</i>, <i>one</i>, <i>the</i>, <i>third</i>, and <i>this</i> have disappeared in the DTM. 

## Building a DTM with TF-IDF Removing Corpus-Specific Stopwords

While there are universal stopwords, there could be corpus-specific stopwords, which occur many times in a specific corpus. 

In [20]:
vectorizer = TfidfVectorizer(use_idf=True, norm="l2", max_df=0.7)
X = vectorizer.fit_transform(corpus)

- `max_df` (float in range [0.0, 1.0] or int, default=1.0): When building the vocabulary ignore terms that have a document frequency strictly higher than the given threshold. For example, if `max_df` is set to 0.7, all terms that appear in over 70% of the documents will be excluded.  

In [21]:
pd.DataFrame(data=X.toarray(), columns=vectorizer.get_feature_names(), 
             index=["doc{}".format(i) for i in range(X.shape[0])])

Unnamed: 0,and,first,one,second,third
doc0,0.0,1.0,0.0,0.0,0.0
doc1,0.0,0.0,0.0,1.0,0.0
doc2,0.57735,0.0,0.57735,0.0,0.57735
doc3,0.0,1.0,0.0,0.0,0.0


By removing corpus-specific stopwords, the terms <i>document</i>, <i>is</i>, <i>the</i>, and <i>this</i> have disappeared in the DTM. 

#### The choice of which scheme to fill the document-term matrix depends on the data. 