A simple real-world data for this demonstration is obtained from the movie review corpus provided by nltk (Pang & Lee, 2004). The first two reviews from the positive set and the negative set are selected. Then the first sentence of these for reviews are selected. We can first define 4 documents in Python as: 

In [1]:
import sys
sys.path.append('..')

from Truth import Page

d1 = Page('Athens').pagecontent # republican
d2 = Page('Greece').pagecontent # democrat
d3 = Page('Greece').pagecontent
d4 = Page('Philippines').pagecontent
documents = [d1, d2]

FileNotFoundError: [Errno 2] No such file or directory: 'D:\\Files\\Desktop\\thesis-clean\\thesis-computational-fact-checking\\ipython\\logs\\wikilog.log'

## Preprocessing with nltk

The default functions of CountVectorizer and TfidfVectorizer in scikit-learn detect word boundary and remove punctuations automatically. However, if we want to do stemming or lemmatization, we need to customize certain parameters in CountVectorizer and TfidfVectorizer. Doing this overrides the default tokenization setting, which means that we have to customize tokenization, punctuation removal, and turning terms to lower case altogether.

**Normalize by stemming**:

In [38]:
import nltk, string, numpy
#nltk.download('punkt') # first-time use only
stemmer = nltk.stem.porter.PorterStemmer()
def StemTokens(tokens):
    return [stemmer.stem(token) for token in tokens]
remove_punct_dict = dict((ord(punct), None) for punct in string.punctuation)
def StemNormalize(text):
    return StemTokens(nltk.word_tokenize(text.lower().translate(remove_punct_dict)))

**Normalize by lemmatization:**

In [39]:
#nltk.download('wordnet') # first-time use only
lemmer = nltk.stem.WordNetLemmatizer()
def LemTokens(tokens):
	return [lemmer.lemmatize(token) for token in tokens]
remove_punct_dict = dict((ord(punct), None) for punct in string.punctuation)
def LemNormalize(text):
	return LemTokens(nltk.word_tokenize(text.lower().translate(remove_punct_dict)))

If we want more meaningful terms in their dictionary forms, lemmatization is preferred.
## Turn text into vectors of term frequency:

In [40]:
from sklearn.feature_extraction.text import CountVectorizer
LemVectorizer = CountVectorizer(tokenizer=LemNormalize, stop_words='english')
LemVectorizer.fit_transform(documents)

<2x5226 sparse matrix of type '<class 'numpy.int64'>'
	with 6329 stored elements in Compressed Sparse Row format>

Normalized (after lemmatization) text in the four documents are tokenized and each term is indexed:

In [41]:
print (LemVectorizer.vocabulary_)

{'athens': 848, 'greek': 2340, 'αθήνα': 5113, 'athína': 858, 'aˈθina': 912, 'ancient': 721, 'ἀθῆναι': 5211, 'athênai': 857, 'atʰɛ̂ːnai̯': 887, 'capital': 1091, 'largest': 2843, 'city': 1206, 'greece': 2338, 'dominates': 1699, 'attica': 877, 'region': 3978, 'world': 5057, 'oldest': 3384, 'recorded': 3950, 'history': 2458, 'spanning': 4409, '3400': 356, 'year': 5080, 'earliest': 1745, 'human': 2493, 'presence': 3774, 'starting': 4471, '11th': 30, '7th': 495, 'millennium': 3123, 'bcclassical': 944, 'wa': 4985, 'powerful': 3753, 'citystate': 1208, 'emerged': 1823, 'conjunction': 1319, 'seagoing': 4218, 'development': 1605, 'port': 3734, 'piraeus': 3659, 'distinct': 1678, 'prior': 3796, '5th': 453, 'century': 1142, 'bc': 943, 'incorporation': 2562, 'centre': 1139, 'art': 818, 'learning': 2872, 'philosophy': 3640, 'home': 2466, 'plato': 3677, 'academy': 556, 'aristotle': 806, 'lyceum': 2965, 'widely': 5023, 'referred': 3963, 'cradle': 1421, 'western': 5013, 'civilization': 1214, 'birthplace'

And we have the tf matrix:

In [42]:
tf_matrix = LemVectorizer.transform(documents).toarray()
print (tf_matrix)

[[0 0 0 ... 0 0 0]
 [1 1 1 ... 1 1 1]]


This should be a **4**  *(# of documents)* by **41** *(# of terms in the corpus)*. Check its shape:


In [43]:
tf_matrix.shape

(2, 5226)

## Calculate idf and turn tf matrix to tf-idf matrix:
Get idf:

In [44]:
from sklearn.feature_extraction.text import TfidfTransformer
tfidfTran = TfidfTransformer(norm="l2")
tfidfTran.fit(tf_matrix)
print (tfidfTran.idf_)

[1.40546511 1.40546511 1.40546511 ... 1.40546511 1.40546511 1.40546511]


Now we have a vector where each component is the idf for each term. In this case, the values are almost the same because other than one term, each term only appears in 1 document. The exception is the 18th term that appears in 2 document. We can corroborate the result.

In [45]:
import math
def idf(n,df):
    result = math.log((n+1.0)/(df+1.0)) + 1
    return result
print ("The idf for terms that appear in one document: " + str(idf(4,1)))
print ("The idf for terms that appear in two documents: " + str(idf(4,2)))

The idf for terms that appear in one document: 1.916290731874155
The idf for terms that appear in two documents: 1.5108256237659907


which is exactly the same as the result from TfidfTransformer. Also, the idf is indeed smaller when df(d, t) is larger.

## Get the tf-idf matrix (4 by 41)

Here what the transform method does is multiplying the tf matrix (4 by 41) by the diagonal idf matrix (41 by 41 with idf for each term on the main diagonal), and dividing the tf-idf by the Euclidean norm.

In [46]:
tfidf_matrix = tfidfTran.transform(tf_matrix)
print (tfidf_matrix.toarray())

[[0.         0.         0.         ... 0.         0.         0.        ]
 [0.00253189 0.00253189 0.00253189 ... 0.00253189 0.00253189 0.00253189]]


## Get the pairwise similarity matrix (n by n):

In [47]:
cos_similarity_matrix = (tfidf_matrix * tfidf_matrix.T).toarray()
print (cos_similarity_matrix)

[[1.         0.41204267]
 [0.41204267 1.        ]]


The matrix obtained in the last step is multiplied by its transpose. The result is the similarity matrix, which indicates that d2 and d3 are more similar to each other than any other pair.

## Use TfidfVectorizer instead

Scikit-learn actually has another function TfidfVectorizer that combines the work of CountVectorizer and TfidfTransformer, which makes the process more efficient.

In [48]:
from sklearn.feature_extraction.text import TfidfVectorizer
TfidfVec = TfidfVectorizer(tokenizer=LemNormalize, stop_words='english')
def cos_similarity(textlist):
    tfidf = TfidfVec.fit_transform(textlist)
    return (tfidf * tfidf.T).toarray()
cos_similarity(documents)
#greece usa philippines

array([[1.        , 0.41204267],
       [0.41204267, 1.        ]])

which returns the same result.