# Instruction 11 - Text Mining - Solution

Natural language processing (NLP) is a field that focuses on making natural human language usable by computer programs. NLTK, or Natural Language Toolkit, is a Python package that you can use for NLP.

In [1]:
import nltk

## Basic preprocessing steps

In [2]:
#for sentence tokenization, word tokenization
from nltk.tokenize import sent_tokenize, word_tokenize

#for tokenization and punctuation removal 
from nltk.tokenize import RegexpTokenizer
nltk.download('punkt')

#to filter out stop words
nltk.download("stopwords")
from nltk.corpus import stopwords

#for stemming
from nltk.stem.porter import PorterStemmer
from nltk.stem.snowball import SnowballStemmer

#for lemmatization
from nltk.stem.wordnet import WordNetLemmatizer
nltk.download('wordnet')
nltk.download('omw-1.4') # omw=open multilingual wordnet

#to compute frequency of text units
from nltk.probability import FreqDist

[nltk_data] Downloading package punkt to
[nltk_data]     C:\Users\bakullari\AppData\Roaming\nltk_data...
[nltk_data]   Package punkt is already up-to-date!
[nltk_data] Downloading package stopwords to
[nltk_data]     C:\Users\bakullari\AppData\Roaming\nltk_data...
[nltk_data]   Package stopwords is already up-to-date!
[nltk_data] Downloading package wordnet to
[nltk_data]     C:\Users\bakullari\AppData\Roaming\nltk_data...
[nltk_data]   Package wordnet is already up-to-date!
[nltk_data] Downloading package omw-1.4 to
[nltk_data]     C:\Users\bakullari\AppData\Roaming\nltk_data...
[nltk_data]   Package omw-1.4 is already up-to-date!


Open the "sagan.txt" file and display the content.

In [3]:
f = open("sagan.txt")
text = f.read()
print(text)

Look again at that dot. That's here. That's home. That's us. On it everyone you love, everyone you know, everyone you ever heard of, every human being who ever was, lived out their lives. The aggregate of our joy and suffering, thousands of confident religions, ideologies, and economic doctrines, every hunter and forager, every hero and coward, every creator and destroyer of civilization, every king and peasant, every young couple in love, every mother and father, hopeful child, inventor and explorer, every teacher of morals, every corrupt politician, every "superstar," every "supreme leader," every saint and sinner in the history of our species lived there-on a mote of dust suspended in a sunbeam.

The Earth is a very small stage in a vast cosmic arena. Think of the endless cruelties visited by the inhabitants of one corner of this pixel on the scarcely distinguishable inhabitants of some other corner, how frequent their misunderstandings, how eager they are to kill one another, how f

Obtain the list of sentences contained in the text file and save them using variable `sentences`. Display the first five sentences.

In [4]:
sentences = sent_tokenize(text)
print(sentences[:5])

['Look again at that dot.', "That's here.", "That's home.", "That's us.", 'On it everyone you love, everyone you know, everyone you ever heard of, every human being who ever was, lived out their lives.']


Obtain the list of tokens contained in the first sentence.

In [5]:
s1 = sentences[0]
words_s1 = word_tokenize(s1)
print(s1)
print(words_s1)

Look again at that dot.
['Look', 'again', 'at', 'that', 'dot', '.']


Apply tokenization (into words), stopword and punctuation removal, and stemming to the corpus (here: a set of sentences). As a result, the new preprocessed corpus must contain a list of documents, where each document (here: sentence) is a list of tokens.
*Hint: The stop words provided from nltk are all lowercase.*

In [6]:
tokenizer = nltk.RegexpTokenizer(r"\w+")
stop_list = stopwords.words("english")
stemmer = PorterStemmer()

corpus_stem = []

for sentence in sentences:
    tokenized = tokenizer.tokenize(sentence)
    filtered = [word.lower() for word in tokenized if word.lower() not in stop_list]
    stemmed = [stemmer.stem(word) for word in filtered]
    if stemmed != []:
        corpus_stem.append(stemmed)

In [7]:
print(corpus_stem[:5])

[['look', 'dot'], ['home'], ['us'], ['everyon', 'love', 'everyon', 'know', 'everyon', 'ever', 'heard', 'everi', 'human', 'ever', 'live', 'live'], ['aggreg', 'joy', 'suffer', 'thousand', 'confid', 'religion', 'ideolog', 'econom', 'doctrin', 'everi', 'hunter', 'forag', 'everi', 'hero', 'coward', 'everi', 'creator', 'destroy', 'civil', 'everi', 'king', 'peasant', 'everi', 'young', 'coupl', 'love', 'everi', 'mother', 'father', 'hope', 'child', 'inventor', 'explor', 'everi', 'teacher', 'moral', 'everi', 'corrupt', 'politician', 'everi', 'superstar', 'everi', 'suprem', 'leader', 'everi', 'saint', 'sinner', 'histori', 'speci', 'live', 'mote', 'dust', 'suspend', 'sunbeam']]


Repeat the same task, but now use lemmatization instead of stemming.

(WordNet is a lexical database of semantic relations between words in more than 200 languages. WordNet links words into semantic relations including synonyms, hyponyms, and meronyms. More info: https://wordnet.princeton.edu/)

In [8]:
lemmatizer = WordNetLemmatizer()

corpus_lemma = []

for sentence in sentences:
    tokenized = tokenizer.tokenize(sentence)
    filtered = [word.lower() for word in tokenized if word.lower() not in stop_list]
    lemmatized = [lemmatizer.lemmatize(word) for word in filtered]
    if lemmatized != []:
        corpus_lemma.append(lemmatized)

In [9]:
print(corpus_lemma[:5])

[['look', 'dot'], ['home'], ['u'], ['everyone', 'love', 'everyone', 'know', 'everyone', 'ever', 'heard', 'every', 'human', 'ever', 'lived', 'life'], ['aggregate', 'joy', 'suffering', 'thousand', 'confident', 'religion', 'ideology', 'economic', 'doctrine', 'every', 'hunter', 'forager', 'every', 'hero', 'coward', 'every', 'creator', 'destroyer', 'civilization', 'every', 'king', 'peasant', 'every', 'young', 'couple', 'love', 'every', 'mother', 'father', 'hopeful', 'child', 'inventor', 'explorer', 'every', 'teacher', 'moral', 'every', 'corrupt', 'politician', 'every', 'superstar', 'every', 'supreme', 'leader', 'every', 'saint', 'sinner', 'history', 'specie', 'lived', 'mote', 'dust', 'suspended', 'sunbeam']]


Using the new (lemmatized) corpus, show the ten most frequent words.

In [10]:
words = []
for word_list in corpus_lemma:
    w = [word for word in word_list]
    words.extend(w)

fdist = FreqDist(words)
fdist.most_common(10)

[('every', 12),
 ('dot', 3),
 ('everyone', 3),
 ('ever', 3),
 ('earth', 3),
 ('one', 3),
 ('home', 2),
 ('u', 2),
 ('love', 2),
 ('human', 2)]

### Part-of-Speech (tagging)

Part of speech is a grammatical term that deals with the roles words play when you use them together in sentences. Tagging parts of speech, or POS tagging, is the task of labeling the words in your text according to their part of speech.

Here’s how to get a list of tags and their meanings:

In [11]:
nltk.download('tagsets')
nltk.help.upenn_tagset()

$: dollar
    $ -$ --$ A$ C$ HK$ M$ NZ$ S$ U.S.$ US$
'': closing quotation mark
    ' ''
(: opening parenthesis
    ( [ {
): closing parenthesis
    ) ] }
,: comma
    ,
--: dash
    --
.: sentence terminator
    . ! ?
:: colon or ellipsis
    : ; ...
CC: conjunction, coordinating
    & 'n and both but either et for less minus neither nor or plus so
    therefore times v. versus vs. whether yet
CD: numeral, cardinal
    mid-1890 nine-thirty forty-two one-tenth ten million 0.5 one forty-
    seven 1987 twenty '79 zero two 78-degrees eighty-four IX '60s .025
    fifteen 271,124 dozen quintillion DM2,000 ...
DT: determiner
    all an another any both del each either every half la many much nary
    neither no some such that the them these this those
EX: existential there
    there
FW: foreign word
    gemeinschaft hund ich jeux habeas Haementeria Herr K'ang-si vous
    lutihaw alai je jour objets salutaris fille quibusdam pas trop Monte
    terram fiche oui corporis ...
IN: preposition or

[nltk_data] Downloading package tagsets to
[nltk_data]     C:\Users\bakullari\AppData\Roaming\nltk_data...
[nltk_data]   Package tagsets is already up-to-date!


Given the quote below, label the words according to their part of speech.

In [12]:
sagan_quote = "If you wish to make an apple pie from scratch, you must first invent the universe."

In [13]:
nltk.download('averaged_perceptron_tagger')

words = word_tokenize(sagan_quote)
pos_tagged = nltk.pos_tag(words)

[nltk_data] Downloading package averaged_perceptron_tagger to
[nltk_data]     C:\Users\bakullari\AppData\Roaming\nltk_data...
[nltk_data]   Package averaged_perceptron_tagger is already up-to-
[nltk_data]       date!


In [14]:
pos_tagged

[('If', 'IN'),
 ('you', 'PRP'),
 ('wish', 'VBP'),
 ('to', 'TO'),
 ('make', 'VB'),
 ('an', 'DT'),
 ('apple', 'NN'),
 ('pie', 'NN'),
 ('from', 'IN'),
 ('scratch', 'NN'),
 (',', ','),
 ('you', 'PRP'),
 ('must', 'MD'),
 ('first', 'VB'),
 ('invent', 'VB'),
 ('the', 'DT'),
 ('universe', 'NN'),
 ('.', '.')]

Let's try to generate some sentences using the lemmatized corpus we obtained previously as our training data. For this task, we will use the language modeling module of nltk (detailed description: https://www.nltk.org/api/nltk.lm.html)

Build a unigram and a trigram language model using MLE.  For each of the language models, generate a sentence of 7 words.

In [15]:
from nltk.lm.preprocessing import padded_everygram_pipeline
from nltk.lm import MLE

In [16]:
# the padded_everygram_pipeline goes through the corpus, applies left and right padding to the sentences (adding <s> and </s>)
# and obtains the tuples of a given order together with the vocabulary
padded_tuples, vocab = padded_everygram_pipeline(1, corpus_lemma)
# generate (an empty) ngram language model for some n>0
lm_unigram = MLE(1)
# generate probabilities (model) given the list of n-grams and the vocabulary
lm_unigram.fit(padded_tuples, vocab)

padded_tuples, vocab = padded_everygram_pipeline(3, corpus_lemma)
lm_trigram = MLE(3)
lm_trigram.fit(padded_tuples, vocab)

In [17]:
print('Unigram sentence')
print(lm_unigram.generate(7))

print('Trigram sentence')
print(lm_trigram.generate(7))

Unigram sentence
['ideology', 'deal', 'triumph', 'ever', 'mote', 'love', 'like']
Trigram sentence
['</s>', '</s>', '</s>', '</s>', '</s>', '</s>', '</s>']


## Feature Extraction (Bag of Words, Set of Words, Tfidf, Word2vec, Doc2vec)

Feature extraction is about extracting/deriving information from the original features set to create a new features subspace. An example for this from text mining is encoding a piece of text as a numerical vector.

*Not to be confused with feature selection! Feature selection is about selecting a subset of features out of the original features in order to reduce model complexity and reduce generalization error introduced due to noise by irrelevant features. An example for this from text mining is the removal of stop words.*

### Feature extraction method: Bag of Words (using CountVectorizer)

The CountVectorizer provided by sklearn converts a collection of text documents to a matrix of token counts (equivalent to BoW model). Each row is a document and each column correspond to a word in the dictionary. By default, the number of features (dimensions) will be equal to the vocabulary size. The CountVectorizer enables one to apply casefolding, tokenizing and stopword removal at the same time. 

In [18]:
from sklearn.feature_extraction.text import CountVectorizer

Apply the CountVectorizer to the original corpus (here: `sentences`) with the default parameters. Show the shape of the matrix. Interpret each of the tuple entries.

In [19]:
bow = CountVectorizer()
count_matrix = bow.fit_transform(sentences)
count_matrix.shape

(20, 204)

There are 20 sentences and 204 extracted features (words in the vocabulary).

In [20]:
print(count_matrix)

  (0, 108)	1
  (0, 0)	1
  (0, 8)	1
  (0, 171)	1
  (0, 41)	1
  (1, 171)	1
  (1, 75)	1
  (2, 171)	1
  (2, 79)	1
  (3, 171)	1
  (3, 185)	1
  (4, 127)	1
  (4, 93)	1
  (4, 53)	3
  (4, 202)	3
  (4, 109)	1
  (4, 98)	1
  (4, 51)	2
  (4, 73)	1
  (4, 126)	1
  (4, 52)	1
  (4, 82)	1
  (4, 11)	1
  (4, 196)	1
  (4, 192)	1
  :	:
  (19, 41)	1
  (19, 79)	1
  (19, 93)	1
  (19, 51)	1
  (19, 172)	2
  (19, 132)	1
  (19, 3)	2
  (19, 128)	1
  (19, 181)	3
  (19, 4)	1
  (19, 193)	1
  (19, 135)	1
  (19, 129)	1
  (19, 99)	1
  (19, 112)	1
  (19, 183)	1
  (19, 147)	1
  (19, 34)	1
  (19, 118)	1
  (19, 96)	1
  (19, 198)	1
  (19, 144)	1
  (19, 19)	1
  (19, 14)	1
  (19, 188)	1


Obtain the vocabulary of your matrix.

In [21]:
bow.vocabulary_

{'look': 108,
 'again': 0,
 'at': 8,
 'that': 171,
 'dot': 41,
 'here': 75,
 'home': 79,
 'us': 185,
 'on': 127,
 'it': 93,
 'everyone': 53,
 'you': 202,
 'love': 109,
 'know': 98,
 'ever': 51,
 'heard': 73,
 'of': 126,
 'every': 52,
 'human': 82,
 'being': 11,
 'who': 196,
 'was': 192,
 'lived': 105,
 'out': 134,
 'their': 173,
 'lives': 106,
 'the': 172,
 'aggregate': 1,
 'our': 132,
 'joy': 94,
 'and': 3,
 'suffering': 164,
 'thousands': 179,
 'confident': 24,
 'religions': 146,
 'ideologies': 85,
 'economic': 45,
 'doctrines': 40,
 'hunter': 84,
 'forager': 61,
 'hero': 76,
 'coward': 30,
 'creator': 31,
 'destroyer': 37,
 'civilization': 21,
 'king': 97,
 'peasant': 136,
 'young': 203,
 'couple': 29,
 'in': 89,
 'mother': 120,
 'father': 57,
 'hopeful': 80,
 'child': 20,
 'inventor': 91,
 'explorer': 55,
 'teacher': 169,
 'morals': 117,
 'corrupt': 26,
 'politician': 141,
 'superstar': 166,
 'supreme': 167,
 'leader': 100,
 'saint': 150,
 'sinner': 155,
 'history': 78,
 'species':

In [22]:
len(bow.vocabulary_)

204

If we want the CountVectorizer to build a vocabulary (here: features/columns) based only on lemmatized tokens that are not stopwords, we need to specify what kind of preprocessing we want (by deafult, the CountVectorizer method does not apply stemming/lemmatization). Write a function called `my_preprocessor` which given a string, returns another string after tokenization, stopword removal and lemmatization (case for stemming is equivalent) has been applied.

In [23]:
def my_preprocessor(text):
    
    tokenized = tokenizer.tokenize(text)
    filtered = [word.lower() for word in tokenized if word.lower() not in stop_list]
    lemmatized = [lemmatizer.lemmatize(word) for word in filtered]
    return ' '.join(lemmatized)

Apply the CountVectorizer to the original corpus (here: `sentences`) with using your own preprocessor `my_preprocessor`. I.e., set the parameter value `preprocessor=my_preprocessor`.

In [24]:
bow_2 = CountVectorizer(preprocessor=my_preprocessor)
corpus_vectors = bow_2.fit(sentences) 

Extract the vocabularies/features of the two BoW models you created (one with the deafult parameters and one using your own preprocessor). What do you notice?

In [25]:
# use method get_feature_names_out() on the countVectorizer model to access the feature set.

In [26]:
print(bow.get_feature_names_out())
print(bow_2.get_feature_names_out())

['again' 'aggregate' 'all' 'and' 'another' 'are' 'arena' 'astronomy' 'at'
 'become' 'been' 'being' 'better' 'blood' 'blue' 'building' 'by'
 'challenged' 'character' 'cherish' 'child' 'civilization' 'come'
 'conceits' 'confident' 'corner' 'corrupt' 'cosmic' 'could' 'couple'
 'coward' 'creator' 'cruelties' 'dark' 'deal' 'delusion' 'demonstration'
 'destroyer' 'distant' 'distinguishable' 'doctrines' 'dot' 'dust' 'eager'
 'earth' 'economic' 'else' 'elsewhere' 'emperors' 'endless' 'enveloping'
 'ever' 'every' 'everyone' 'experience' 'explorer' 'far' 'father'
 'fervent' 'folly' 'for' 'forager' 'fraction' 'frequent' 'from' 'future'
 'generals' 'glory' 'great' 'harbor' 'has' 'hatreds' 'have' 'heard' 'help'
 'here' 'hero' 'hint' 'history' 'home' 'hopeful' 'how' 'human' 'humbling'
 'hunter' 'ideologies' 'image' 'imagined' 'importance' 'in' 'inhabitants'
 'inventor' 'is' 'it' 'joy' 'kill' 'kindly' 'king' 'know' 'known' 'leader'
 'least' 'life' 'light' 'like' 'lived' 'lives' 'lonely' 'look' 'love'

In [27]:
print(len(bow.get_feature_names_out()))
print(len(bow_2.get_feature_names_out()))

204
152


The encoding using our own preprocessing steps has a smaller vocabulary / fewer dimensions. This makes sense since our own preprocessing method additionally applies lemmatization, which may map different raw words onto the same lemma.

Use either each the two BoW models to show the encoding of the fifth sentence.

In [28]:
s = sentences[4]
print('Original sentence: ', s)
print('BoW encoding without lemmatization:', bow.transform([s]).toarray())
print('BoW encoding with lemmatization:', bow_2.transform([s]).toarray())

Original sentence:  On it everyone you love, everyone you know, everyone you ever heard of, every human being who ever was, lived out their lives.
BoW encoding without lemmatization: [[0 0 0 0 0 0 0 0 0 0 0 1 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0
  0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 2 1 3 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0
  0 1 0 0 0 0 0 0 0 0 1 0 0 0 0 0 0 0 0 0 0 1 0 0 0 0 1 0 0 0 0 0 0 1 1 0
  0 1 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 1 1 0 0 0 0 0 0 1 0 0 0 0 0 0 0 0 0
  0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 1 0 0 0 0 0 0
  0 0 0 0 0 0 0 0 0 0 0 0 1 0 0 0 1 0 0 0 0 0 3 0]]
BoW encoding with lemmatization: [[0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0
  0 0 0 0 0 0 0 2 1 3 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 1 0 0 0 0 0 0 1 0 0 0
  0 0 0 0 0 0 0 0 0 1 0 0 0 1 0 0 1 0 0 1 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0
  0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0
  0 0 0 0 0 0 0 0]]


### Feature extraction method: Set of Words (using CountVectorizor)

We want to use the CountVectorizer method to encode our text. This time, for each word in the vocabulary, we are only interested in whether the word appears or not in a given document. The frequency is not important. This encoding is also called Set of Words. Find the parameter in the CountVectorizer method that produces such a 0/1 encoding. Set the preprocessor parameter to the preprocessor defined above.

In [29]:
sow = CountVectorizer(binary=True, preprocessor=my_preprocessor)
corpus_vectors = sow.fit(sentences)

Show the encoding of the fifth sentence in the text using the Set of Words model. What do you notice?

In [30]:
s = sentences[4]
print('Original sentence: ', s)
print('SoW encoding:', sow.transform([s]).toarray())

Original sentence:  On it everyone you love, everyone you know, everyone you ever heard of, every human being who ever was, lived out their lives.
SoW encoding: [[0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0
  0 0 0 0 0 0 0 1 1 1 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 1 0 0 0 0 0 0 1 0 0 0
  0 0 0 0 0 0 0 0 0 1 0 0 0 1 0 0 1 0 0 1 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0
  0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0
  0 0 0 0 0 0 0 0]]


All numbers are 0/1. 

### Feature extraction method: Tf-idf scores (using TfidfVectorizer)

We want to use the TfidfVectorizer method to encode our text. This time, for each word in the vocabulary, we are interested in the tf-idf score of the word in a given document. Set the preprocessor parameter to the preprocessor defined before.

In [31]:
from sklearn.feature_extraction.text import TfidfVectorizer

In [32]:
tfidf = TfidfVectorizer(preprocessor=my_preprocessor)
corpus_vectors = tfidf.fit_transform(sentences)

Show the encoding of the fifth sentence is the text using the TfIdf model. What do you notice?

In [33]:
s = sentences[4]
print('Original sentence: ', s)
print('SoW encoding:', tfidf.transform([s]).toarray())

Original sentence:  On it everyone you love, everyone you know, everyone you ever heard of, every human being who ever was, lived out their lives.
SoW encoding: [[0.         0.         0.         0.         0.         0.
  0.         0.         0.         0.         0.         0.
  0.         0.         0.         0.         0.         0.
  0.         0.         0.         0.         0.         0.
  0.         0.         0.         0.         0.         0.
  0.         0.         0.         0.         0.         0.
  0.         0.         0.         0.         0.         0.
  0.         0.41490215 0.20745108 0.70801182 0.         0.
  0.         0.         0.         0.         0.         0.
  0.         0.         0.         0.         0.         0.
  0.         0.23600394 0.         0.         0.         0.
  0.         0.         0.20745108 0.         0.         0.
  0.         0.         0.         0.         0.         0.
  0.         0.         0.         0.23600394 0.         0.

All numbers are between 0 and 1, as expected for the range of the tf-idf scores.

### Feature extraction method: Word2Vec

Word2vec is one of the most popular techniques to learn word embeddings using a two-layer neural network. The input layer contains the individual words which we want to compress, the output layer can be its surrounding context. The embedding maps the words into a vector space which has a dimension smaller than the size of the vocabulary. A well-trained set of word vectors will place similar words close to each other in that (vector) space. 

The input of the Gensim word2vec is a text corpus and its output is a set of vectors. More specifically, genism word2vec requires a format of ‘list of lists’ for training where every document is contained in a list and every inner list contains lists of tokens of that document. 

Create a word2vec embedding for the lemmatized corpus such that the compressed vector has a dimension of 20 and every word that appears at least once in the corpus must be considered. Use the skip-gram algorithm and consider all surrounding words of distance less or equal 3.

In [34]:
# run this cell to install the gensim package
!pip install gensim



In [35]:
from gensim.models import Word2Vec

In [36]:
# vector_size: The number of dimensions of the embeddings and the default is 100.
# window: The maximum distance between a target word and words around the target word. The default window is 5.
# min_count: The minimum count of words to consider when training the model; words with occurrence less than this count will be ignored. The default for min_count is 5.
# sg: The training algorithm, either CBOW(0) or skip gram(1). The default training algorithm is CBOW.

w2v = Word2Vec(corpus_lemma, min_count=1, vector_size=20, window=3, sg=1)

Obtain the embeddings of the words: future, blue, universe.

In [37]:
word_vectors = w2v.wv
print('future:', word_vectors['future'])
print('blue:', word_vectors['blue'])
print('universe:', word_vectors['universe'])

future: [-0.01134635 -0.02719816  0.03814759  0.03324739 -0.02231137  0.01154214
 -0.02944485  0.00161118  0.04687099 -0.0130652  -0.02552955 -0.03749329
 -0.01477422 -0.00420845  0.01746522  0.0486251  -0.01622649  0.00980703
  0.04809051  0.00712067]
blue: [-0.02614041 -0.03331963 -0.03823137  0.04199896 -0.01039955 -0.03447679
 -0.02071675  0.02617425 -0.01466679 -0.01884807  0.00858427 -0.01437992
 -0.00827299  0.00570001 -0.01502686  0.04251807  0.01988973 -0.0494634
  0.03117746 -0.03388793]
universe: [-0.02200732 -0.04642254 -0.0087069  -0.01776312  0.04506843  0.01419324
 -0.02991859 -0.01620416 -0.0500894   0.00952429 -0.01948267 -0.01419335
  0.02536843  0.03787052  0.02165071 -0.03425599  0.03424682 -0.04766918
 -0.03521582 -0.04028212]


Which words are the most similar ones to the words future, blue and universe?

In [38]:
future_similar = word_vectors.similar_by_word("future")
most_similar_key, similarity = future_similar[0]  # look at the first match
print("for future:", most_similar_key)

blue_similar = word_vectors.similar_by_word("blue")
most_similar_key, similarity = blue_similar[0]  # look at the first match
print("for blue:", most_similar_key)

universe_similar = word_vectors.similar_by_word("universe")
most_similar_key, similarity = universe_similar[0]  # look at the first match
print("for universe:", most_similar_key)

for future: astronomy
for blue: pixel
for universe: corner


### Feature extraction method: Doc2Vec

Import the Fake.csv and the True.csv files into two separate dataframes. Those files contain news that were published that were either fake or real. Display the first few lines of those dataframes.

In [39]:
import pandas as pd

fake_news = pd.read_csv("Fake.csv")
real_news = pd.read_csv("True.csv")

fake_news.head()

Unnamed: 0,title,text,subject,date
0,Donald Trump Sends Out Embarrassing New Year’...,Donald Trump just couldn t wish all Americans ...,News,"December 31, 2017"
1,Drunk Bragging Trump Staffer Started Russian ...,House Intelligence Committee Chairman Devin Nu...,News,"December 31, 2017"
2,Sheriff David Clarke Becomes An Internet Joke...,"On Friday, it was revealed that former Milwauk...",News,"December 30, 2017"
3,Trump Is So Obsessed He Even Has Obama’s Name...,"On Christmas day, Donald Trump announced that ...",News,"December 29, 2017"
4,Pope Francis Just Called Out Donald Trump Dur...,Pope Francis used his annual Christmas Day mes...,News,"December 25, 2017"


In [40]:
real_news.head()

Unnamed: 0,title,text,subject,date
0,"As U.S. budget fight looms, Republicans flip t...",WASHINGTON (Reuters) - The head of a conservat...,politicsNews,"December 31, 2017"
1,U.S. military to accept transgender recruits o...,WASHINGTON (Reuters) - Transgender people will...,politicsNews,"December 29, 2017"
2,Senior U.S. Republican senator: 'Let Mr. Muell...,WASHINGTON (Reuters) - The special counsel inv...,politicsNews,"December 31, 2017"
3,FBI Russia probe helped by Australian diplomat...,WASHINGTON (Reuters) - Trump campaign adviser ...,politicsNews,"December 30, 2017"
4,Trump wants Postal Service to charge 'much mor...,SEATTLE/WASHINGTON (Reuters) - President Donal...,politicsNews,"December 29, 2017"


For each of those dataframes, remove the "subject" and the "date" columns. For each of the two dataframes, add a new column named "label", and set its value to "real" for the real news dataframe, and to "fake" for the fake news dataframe. Display the first few lines of the modified dataframes.

In [41]:
fake_news.drop(['subject', 'date'], axis=1, inplace=True)
fake_news['label'] = 'fake'
real_news.drop(['subject', 'date'], axis=1, inplace=True)
real_news['label'] = 'real'

In [42]:
fake_news.head()

Unnamed: 0,title,text,label
0,Donald Trump Sends Out Embarrassing New Year’...,Donald Trump just couldn t wish all Americans ...,fake
1,Drunk Bragging Trump Staffer Started Russian ...,House Intelligence Committee Chairman Devin Nu...,fake
2,Sheriff David Clarke Becomes An Internet Joke...,"On Friday, it was revealed that former Milwauk...",fake
3,Trump Is So Obsessed He Even Has Obama’s Name...,"On Christmas day, Donald Trump announced that ...",fake
4,Pope Francis Just Called Out Donald Trump Dur...,Pope Francis used his annual Christmas Day mes...,fake


In [43]:
real_news.head()

Unnamed: 0,title,text,label
0,"As U.S. budget fight looms, Republicans flip t...",WASHINGTON (Reuters) - The head of a conservat...,real
1,U.S. military to accept transgender recruits o...,WASHINGTON (Reuters) - Transgender people will...,real
2,Senior U.S. Republican senator: 'Let Mr. Muell...,WASHINGTON (Reuters) - The special counsel inv...,real
3,FBI Russia probe helped by Australian diplomat...,WASHINGTON (Reuters) - Trump campaign adviser ...,real
4,Trump wants Postal Service to charge 'much mor...,SEATTLE/WASHINGTON (Reuters) - President Donal...,real


Concatenate (merge) the two dataframes into a single dataframe named `news_df`.

In [44]:
frames = [fake_news, real_news]
news_df = pd.concat(frames)

news_df

Unnamed: 0,title,text,label
0,Donald Trump Sends Out Embarrassing New Year’...,Donald Trump just couldn t wish all Americans ...,fake
1,Drunk Bragging Trump Staffer Started Russian ...,House Intelligence Committee Chairman Devin Nu...,fake
2,Sheriff David Clarke Becomes An Internet Joke...,"On Friday, it was revealed that former Milwauk...",fake
3,Trump Is So Obsessed He Even Has Obama’s Name...,"On Christmas day, Donald Trump announced that ...",fake
4,Pope Francis Just Called Out Donald Trump Dur...,Pope Francis used his annual Christmas Day mes...,fake
...,...,...,...
21412,'Fully committed' NATO backs new U.S. approach...,BRUSSELS (Reuters) - NATO allies on Tuesday we...,real
21413,LexisNexis withdrew two products from Chinese ...,"LONDON (Reuters) - LexisNexis, a provider of l...",real
21414,Minsk cultural hub becomes haven from authorities,MINSK (Reuters) - In the shadow of disused Sov...,real
21415,Vatican upbeat on possibility of Pope Francis ...,MOSCOW (Reuters) - Vatican Secretary of State ...,real


Create a corpus named `corpus` containing a list of all documents. Each document is a text field under "text" from the dataframe `news_df`.

In [45]:
news_df.head()

Unnamed: 0,title,text,label
0,Donald Trump Sends Out Embarrassing New Year’...,Donald Trump just couldn t wish all Americans ...,fake
1,Drunk Bragging Trump Staffer Started Russian ...,House Intelligence Committee Chairman Devin Nu...,fake
2,Sheriff David Clarke Becomes An Internet Joke...,"On Friday, it was revealed that former Milwauk...",fake
3,Trump Is So Obsessed He Even Has Obama’s Name...,"On Christmas day, Donald Trump announced that ...",fake
4,Pope Francis Just Called Out Donald Trump Dur...,Pope Francis used his annual Christmas Day mes...,fake


In [46]:
corpus =  news_df['text'].tolist()

Create a preprocessed version of the corpus and save it under `corpus_p`. This corpus should be a list of documents. Each document is a list of terms. You must get to this list of terms after applying your previously defined preprocessor function to the document. Display some document contained in the preprocessed corpus.

In [47]:
corpus_p = []
for doc in corpus:
    doc_p = my_preprocessor(doc)
    corpus_p.append(doc_p)

In [48]:
corpus_p[0]

'donald trump wish american happy new year leave instead give shout enemy hater dishonest fake news medium former reality show star one job country rapidly grows stronger smarter want wish friend supporter enemy hater even dishonest fake news medium happy healthy new year president angry pant tweeted 2018 great year america country rapidly grows stronger smarter want wish friend supporter enemy hater even dishonest fake news medium happy healthy new year 2018 great year america donald j trump realdonaldtrump december 31 2017trump tweet went welll expect kind president sends new year greeting like despicable petty infantile gibberish trump lack decency even allow rise gutter long enough wish american citizen happy new year bishop talbert swan talbertswan december 31 2017no one like calvin calvinstowell december 31 2017your impeachment would make 2018 great year america also accept regaining control congress miranda yaver mirandayaver december 31 2017do hear talk include many people hate

Create a doc2vec embedding of the documents in news_df with a vector size of 50. Consider all words that appear at least 10 times in the corpus.

In [49]:
from gensim.models.doc2vec import Doc2Vec, TaggedDocument

In [50]:
# create TaggedDocument
# note how the tag of each document is some unique identifier for the document, 
# for simplicity, we set it to the doc's position in the corpus
docs = [TaggedDocument(doc, [i]) for i, doc in enumerate(corpus_p)]

import multiprocessing
cores = multiprocessing.cpu_count()

# determining parameters of the model   
doc2vec = Doc2Vec(vector_size=50, min_count=10, workers=cores)
# building the vocabulary    
doc2vec.build_vocab(docs)


# document embedding, create the embedding based on all documents in the corpus
doc2vec.train(corpus_iterable=docs, total_examples=doc2vec.corpus_count, epochs=100)

Display the vector corresponding to the first document. Find its most similar document w.r.t. the cosine similarity and display the original text corresponding to those two docs.

In [51]:
d1= corpus_p[0]
# infer_vector requires the sentence (=document) to be passed as a list of tokens
d1_tokens = tokenizer.tokenize(d1)
# print(d1_tokens)
d1_embedding = doc2vec.infer_vector(d1_tokens)
most_similar_docs = doc2vec.dv.most_similar(d1_embedding) #gives you top 10 document tags and their cosine similarity
tag = most_similar_docs[0][0] #obtain the tag of the first document'

In [52]:
d1_original = news_df.iloc[0]['text']
d_similar_original = news_df.iloc[tag]['text']

In [53]:
print(d1_original)
print("\n")
print(d_similar_original)

Donald Trump just couldn t wish all Americans a Happy New Year and leave it at that. Instead, he had to give a shout out to his enemies, haters and  the very dishonest fake news media.  The former reality show star had just one job to do and he couldn t do it. As our Country rapidly grows stronger and smarter, I want to wish all of my friends, supporters, enemies, haters, and even the very dishonest Fake News Media, a Happy and Healthy New Year,  President Angry Pants tweeted.  2018 will be a great year for America! As our Country rapidly grows stronger and smarter, I want to wish all of my friends, supporters, enemies, haters, and even the very dishonest Fake News Media, a Happy and Healthy New Year. 2018 will be a great year for America!  Donald J. Trump (@realDonaldTrump) December 31, 2017Trump s tweet went down about as welll as you d expect.What kind of president sends a New Year s greeting like this despicable, petty, infantile gibberish? Only Trump! His lack of decency won t eve

## Classification Tasks

Split the corpus from the previous task into training (80%) and test (20%) data preserving the distribution based on the "label". 

In [54]:
from sklearn.model_selection import train_test_split

In [55]:
target = news_df['label'].tolist()

In [56]:
corpus_train, corpus_test, y_train, y_test = train_test_split(corpus, target, test_size=0.2, stratify=target)

### Based on Doc2Vec embedding 

For each of the corpora, create their preprocessed versions using your preprocessor defined above. Save them under `corpus_train_p` and `corpus_test_p`. Create a doc2vec model based only on the documents in `corpus_train_p`. Set the vector dimension to 50 and min_count to 10. 

In [57]:
corpus_train_p = []
for doc in corpus_train:
    doc_p = my_preprocessor(doc)
    corpus_train_p.append(doc_p)

In [58]:
corpus_test_p = []
for doc in corpus_test:
    doc_p = my_preprocessor(doc)
    corpus_test_p.append(doc_p)

In [59]:
# create TaggedDocument
docs_train = [TaggedDocument(doc, [i]) for i, doc in enumerate(corpus_train_p)]

import multiprocessing
cores = multiprocessing.cpu_count()

# determining parameters of the model   
doc2vec_from_train = Doc2Vec(vector_size=50, min_count=10, workers=cores)
# building the vocabulary    
doc2vec_from_train.build_vocab(docs_train)


# document embedding, train the model on the training examples
doc2vec_from_train.train(corpus_iterable=docs_train, total_examples=doc2vec_from_train.corpus_count, epochs=100)

Use the embedding of the training corpus to train a logistic regression classifier with the label as target feature. Use the classifier to predict the label of the documents in the test corpus and show its accuracy both for the training corpus and the test corpus. 
<i>Hint: You can assess the encoding of each document using the "infer_vector" method. <b> Don't forget to train and test your classifier on the embeddings of the preprocessed documents (and not on the corpus itself). </b> </i>

In [60]:
X_train = [doc2vec_from_train.infer_vector(tokenizer.tokenize(doc)) for doc in corpus_train_p]
X_test = [doc2vec_from_train.infer_vector(tokenizer.tokenize(doc)) for doc in corpus_test_p]

In [61]:
#classifier
from sklearn.linear_model import LogisticRegression
from sklearn.metrics import accuracy_score

clf_doc2vec = LogisticRegression(max_iter=1000)
clf_doc2vec.fit(X_train, y_train)
y_train_pred = clf_doc2vec.predict(X_train)
y_test_pred = clf_doc2vec.predict(X_test)

accuracy_doc2vec_train = accuracy_score(y_train, y_train_pred, normalize=True)
print('Classifier accuracy on training set:', accuracy_doc2vec_train)

accuracy_doc2vec_test = accuracy_score(y_test, y_test_pred, normalize=True)
print('Classifier accuracy on test set:', accuracy_doc2vec_test)

Classifier accuracy on training set: 0.667993763572582
Classifier accuracy on test set: 0.6760579064587974


### Based on BoW encoding

Create a Bag of Words model based only on the documents in the training data. Use the previously defined preprocessor as preprocessor. 

In [62]:
bow_from_train = CountVectorizer(preprocessor=my_preprocessor)
bow_from_train.fit(corpus_train) 

X_train = bow_from_train.transform(corpus_train)
X_test = bow_from_train.transform(corpus_test)

Use the BoW encoding of the training corpus to train a logistic regression classifier with the label as target feature. Use the classifier to predict the label of the documents in the test corpus and show its accuracy both for the training corpus and the test corpus. 

In [63]:
clf_bow = LogisticRegression(max_iter=1000)
clf_bow.fit(X_train, y_train)
y_train_pred = clf_bow.predict(X_train)
y_test_pred = clf_bow.predict(X_test)

accuracy_bow_train = accuracy_score(y_train, y_train_pred, normalize=True)
print('Classifier accuracy on training set:', accuracy_bow_train)

accuracy_bow_test = accuracy_score(y_test, y_test_pred, normalize=True)
print('Classifier accuracy on test set:', accuracy_bow_test)

Classifier accuracy on training set: 0.9999721588061696
Classifier accuracy on test set: 0.9967706013363029


<b>Important:</b>
- We split the data into training and test data once and used that same partition for both the Doc2Vec based classifier and BoW based classifier. This is important if we want to make claims comparing how the classifier performs on one feature extraction method compared to the other.
- The feature extraction was computed based only on the training corpus. This is important if we want to really test the classifier on independent data and reason about how "well" the embedding generalizes even for unseen text. 