In [54]:
import math
import numpy as np

# TF-IDF

Tf-idf stands for term frequency-inverse document frequency, and the tf-idf weight is a weight often used in information retrieval and text mining. This weight is a statistical measure used to evaluate how important a word is to a document in a collection or corpus. 

The importance increases proportionally to the number of times a word appears in the document but is offset by the frequency of the word in the corpus. Variations of the tf-idf weighting scheme are often used by search engines as a central tool in scoring and ranking a document's relevance given a user query.

In [None]:
nltk.download('stopwords')

### 1. Term Frequency

**TF: Term Frequency**, which measures how frequently a term occurs in a document. Since every document is different in length, it is possible that a term would appear much more times in long documents than shorter ones. Thus, the term frequency is often divided by the document length (aka. the total number of terms in the document) as a way of normalization: 

TF(t) = (Number of times term t appears in a document) / (Total number of terms in the document).

In [7]:
document1 = open("corpus/story1.txt").read()

a) Print the contents of document1 to see what it contains.

In [None]:
document1

b) What is the Term Frequency of the word **allegations**?

In [33]:
from nltk import word_tokenize

words = word_tokenize(document1)
number_of_words = len(words)
print("Total number of words in document 1 is", number_of_words)

frequency_of_word = sum([1 for word in words if word.lower() == "allegations"])
print("Frequency of the word Allegations is",frequency_of_word)

term_frequency = frequency_of_word / number_of_words
term_frequency

Total number of words in document 1 is 905
Frequency of the word Allegations is 10


0.011049723756906077

c) What is the Term Frequency of the word **allegations** after you remove the stop words?

In [35]:
from nltk.corpus import stopwords

stop_words = set(stopwords.words('english'))

words = word_tokenize(document1)
words = [w for w in words if w not in stop_words]
number_of_words = len(words)
print("Total number of words in document 1 is", number_of_words)

frequency_of_word = sum([1 for word in words if word.lower() == "allegations"])
print("Frequency of the word Allegations is",frequency_of_word

term_frequency = frequency_of_word / number_of_words
term_frequency

Total number of words in document 1 is 593
Frequency of the word Allegations is 10


0.016863406408094434

d) What is the term frequency of the word *allegations* in the other two documents?

In [38]:
document2 = open("corpus/story2.txt").read()
document3 = open("corpus/story3.txt").read()

In [39]:
def term_frequency(document, term):
    stop_words = set(stopwords.words('english'))

    words = word_tokenize(document)
    words = [w for w in words if w not in stop_words]
    number_of_words = len(words)
    print("Total number of words in document 1 is", number_of_words)

    frequency_of_word = sum([1 for word in words if word.lower() == term])
    print("Frequency of the word", term, "is", frequency_of_word)

    term_frequency = frequency_of_word / number_of_words
    return term_frequency

In [40]:
term_frequency(document2, "allegations")

Total number of words in document 1 is 290
Frequency of the word allegations is 1


0.0034482758620689655

In [41]:
term_frequency(document3, "allegations")

Total number of words in document 1 is 247
Frequency of the word allegations is 0


0.0

e) What do you learn from the Term Frequencies of the other two documents?

### Inverse Document Frequency
DF: Inverse Document Frequency, which measures how important a term is. While computing TF, all terms are considered equally important. However it is known that certain terms, such as "is", "of", and "that", may appear a lot of times but have little importance. Thus we need to weigh down the frequent terms while scale up the rare ones, by computing the following: 

IDF(t) = log(Total number of documents / Number of documents with term t in it).

a) What is the IDF of the term("Allegations")

In [47]:
#How many documents contain the word allegations?
all_docs = 2
#How many documents are there 
docs_with_the_term_allegations = 3

idf_of_the_term_allegations = math.log(all_docs / docs_with_the_term_allegations)
idf_of_the_term_allegations

-0.40546510810816444

### TF-IDF

Now calculated the TF-IDF weight for the term **allegations** with respect to document1.

TF-IDF is the product of tf and idf.

In [49]:
tfidf = term_frequency(document1, "allegations") * idf_of_the_term_allegations 
tfidf

Total number of words in document 1 is 593
Frequency of the word allegations is 10


-0.006837522902329923

What are the top 5 distinguishing words in the corpora?

In [59]:
from sklearn.feature_extraction.text import TfidfVectorizer

corpora = [document1, document2, document3]

vectorizer = TfidfVectorizer()
vectorizer.fit_transform(corpora)
indices = np.argsort(vectorizer.idf_)[::-1]
features = vectorizer.get_feature_names()


top_n = 5
top_features = [features[i] for i in indices[:top_n]]
print(top_features)

['zone', 'georges', 'follow', 'following', 'fondling']


## TOPIC MODELLING


In [53]:

from sklearn.feature_extraction.text import TfidfVectorizer, CountVectorizer
from sklearn.datasets import fetch_20newsgroups
from sklearn.decomposition import NMF, LatentDirichletAllocation

def display_topics(model, feature_names, no_top_words):
    for topic_idx, topic in enumerate(model.components_):
        print("Topic %d:" % (topic_idx))
        print(" ".join([feature_names[i]
                        for i in topic.argsort()[:-no_top_words - 1:-1]])
            )

dataset = fetch_20newsgroups(shuffle=True, random_state=1, remove=('headers', 'footers', 'quotes'))
documents = dataset.data

no_features = 1000

# NMF is able to use tf-idf
tfidf_vectorizer = TfidfVectorizer(max_df=0.95, min_df=2, max_features=no_features, stop_words='english')
tfidf = tfidf_vectorizer.fit_transform(documents)
tfidf_feature_names = tfidf_vectorizer.get_feature_names()

# LDA can only use raw term counts for LDA because it is a probabilistic graphical model
tf_vectorizer = CountVectorizer(max_df=0.95, min_df=2, max_features=no_features, stop_words='english')
tf = tf_vectorizer.fit_transform(documents)
tf_feature_names = tf_vectorizer.get_feature_names()

no_topics = 20

# Run NMF
nmf = NMF(n_components=no_topics, random_state=1, alpha=.1, l1_ratio=.5, init='nndsvd').fit(tfidf)

# Run LDA
lda = LatentDirichletAllocation(n_topics=no_topics, max_iter=5, learning_method='online', learning_offset=50.,random_state=0).fit(tf)

no_top_words = 10
display_topics(nmf, tfidf_feature_names, no_top_words)
display_topics(lda, tf_feature_names, no_top_words)



Topic 0:
people time right did good said say make way government
Topic 1:
window problem using server application screen display motif manager running
Topic 2:
god jesus bible christ faith believe christian christians sin church
Topic 3:
game team year games season players play hockey win league
Topic 4:
new 00 sale 10 price offer shipping condition 20 15
Topic 5:
thanks mail advance hi looking info help information address appreciated
Topic 6:
windows file files dos program version ftp ms directory running
Topic 7:
edu soon cs university ftp internet article email pub david
Topic 8:
key chip clipper encryption keys escrow government public algorithm nsa
Topic 9:
drive scsi drives hard disk ide floppy controller cd mac
Topic 10:
just ll thought tell oh little fine work wanted mean
Topic 11:
does know anybody mean work say doesn help exist program
Topic 12:
card video monitor cards drivers bus vga driver color memory
Topic 13:
like sounds looks look bike sound lot things really thing
To