1. Read *Wikipedia.csv* as written via Wikipedia class.
2. Tokenize the text, and save tokens into *df_token*.
3. Build similarity matrix and write this to file *Wikipedia-sims.csv* with columns:  
    id : *article id*  
    similar_articles : *list of article ids*


In [1]:
import sys, os
import pandas as pd

# Read data

Read .csv files and make a pandas dataframe.

In [10]:
actual_dir=os.getcwd()
os.chdir('../data')
csvfile1 = 'Wikipedia-dog.csv'
csvfile2 = 'Wikipedia-fish.csv'

df=pd.read_csv(csvfile1,index_col=0)
df2=pd.read_csv(csvfile2,index_col=0)
df=df.append(df2)
del df2

df=df[df['page_type']=='article']

df.tail(5)


Unnamed: 0,website,title,page_type,text_raw
43590845,https://en.wikipedia.org/wiki/Whale_feces,Whale feces,article,"[[File:WhalePump.jpg|thumb|400px|right|""Whale ..."
42614454,https://en.wikipedia.org/wiki/Whale_watching_i...,Whale watching in New Zealand,article,{{Use dmy dates|date=May 2017}}\n[[File:Whale ...
49735416,https://en.wikipedia.org/wiki/Tail_sailing,Tail sailing,article,[[File:Southern right whale4.jpg|thumb|upright...
52243894,https://en.wikipedia.org/wiki/Bubble_net_feeding,Bubble net feeding,article,{{main|Cetacean surfacing behaviour}}\n'''Bubb...
53720250,https://en.wikipedia.org/wiki/Ethelbert_(whale),Ethelbert (whale),article,{{no footnotes|date=June 2017}}\n\n'''Ethelber...


Now let's use nltk to tokenize and clean up the text

# Tokenizer

Find a list of tokens from raw text for each article

In [3]:
import nltk
tokenizer = nltk.RegexpTokenizer(r'\w+')

df_token=df['text_raw']

# Convert to lower case:
for index in df_token.index:
    text=df_token[index]
    df_token[index]=text.lower()

# Tokenize
df_token=df_token.apply(tokenizer.tokenize)
df_token.head()

1467938     [about, shelter, for, dogs, and, cats, for, th...
275388      [cynology, ipac, en, s, ᵻ, ˈ, n, ɒ, l, ə, dʒ, ...
2352562     [other, uses, wolfpack, disambiguation, image,...
17021807    [for, a, list, of, rare, dog, breeds, category...
20777185    [refimprove, date, december, 2008, originalres...
Name: text_raw, dtype: object

In [4]:
from gensim import corpora, models

# gensim dictionary
https://radimrehurek.com/gensim/corpora/dictionary.html

* compactify()


Assign new word ids to all words.

This is done to make the ids more compact, e.g. after some tokens have been removed via filter_tokens() and there are gaps in the id series. Calling this method will remove the gaps.

* self.dfs()

token frequency

* Download stop words running:

nltk.download()

In [5]:
# make gensim dictionary

line_list = df_token.values
dictionary = corpora.Dictionary(line_list)
# dictionary: 0: "about", 1:"shelter",...

# filter dictionary to remove stopwords and words occurring < min_count times
# need to run nltk.download() -> 3 GB downloaded into C:\Users\melanie\AppData\Roaming\nltk_data
stop_words = nltk.corpus.stopwords.words('english') 
print("Stop words: {}\n".format(stop_words[:5]))

stop_ids = [dictionary.token2id[word] for word in stop_words
            if word in dictionary.token2id]
min_count = 2
rare_ids = [id for id, freq in dictionary.dfs.items()
            if freq < min_count]
dictionary.filter_tokens(stop_ids + rare_ids)
print("Dictionary after filtering:")
print([(key,dictionary[key]) for key in dictionary.keys()[1:5]])
dictionary.compactify()


Stop words: ['i', 'me', 'my', 'myself', 'we']

Dictionary after filtering:
[(1, 'dogs'), (2, 'cats'), (3, 'article'), (4, 'shed')]


## doc2bow()

1. counts the number of occurrences of each distinct word
2. converts the word to its integer word id 
3. returns the result as a sparse vector. 

The sparse vector [(0, 1), (1, 1)] therefore reads: in the document “Human computer interaction”, the words computer (id 0) and human (id 1) appear once; the other ten dictionary words appear (implicitly) zero times.

In [6]:
corpus = [dictionary.doc2bow(words) for words in line_list]
print("Corpus contains tuples of word lists and its frequency")
print("Corpus: {}".format(corpus[1][1:5]))



Corpus contains tuples of word lists and its frequency
Corpus: [(6, 36), (9, 2), (17, 1), (18, 1)]


# Model transformations in gensim

https://radimrehurek.com/gensim/tut2.html#available-transformations

1. **Latent Sematic Indexing, LSI (or LSA)** transform documents in a tfldf-weighted space into a latent space of lower dimensionality. On real corpora, target dimensionality of 200–500 is recommended as a “golden standard” 
2. **Random Projections, RP** aim to reduce vector space dimensionality. This is a very efficient (both memory- and CPU-friendly) approach to approximating tfidf distances between documents, by throwing in a little randomness. Recommended target dimensionality is again in the hundreds/thousands, depending on your dataset.
3. **Latent Dirichlet Allocation, LDA** is yet another transformation from bag-of-words counts into a topic space of lower dimensionality. LDA is a probabilistic extension of LSA (also called multinomial PCA), so LDA's topics can be interpreted as probability distributions over words. These distributions are, just like with LSA, inferred automatically from a training corpus. Documents are in turn interpreted as a (soft) mixture of these topics (again, just like with LSA).
4. **Hierarchical Dirichlet Process, HDP** is a non-parametric bayesian method (note the missing number of requested topics).

In [7]:
os.chdir(actual_dir)
from similarity_matrix import get_similarity_matrix
from matsim2np import matsim2np

max_posts=len(df_token)
num_best=max_posts+1
article_ids=df_token.index
num_sims=len(df_token) #Save only top ten similar articles

# TFIDF
tfidf = models.TfidfModel(corpus)
tfidf_corpus=tfidf[corpus]



In [11]:
# LSI (Latent semantic indexing)
run_lsi=True
if (run_lsi):
    num_topics=200 #dimensionality of model
    topic_model = models.LsiModel(tfidf_corpus, id2word=dictionary, num_topics=num_topics)
    similarity_matrix,matsim=get_similarity_matrix(article_ids,topic_model[tfidf_corpus],num_best,num_sims)
    npmatrix=matsim2np(matsim)
    npmatrix.dump('sim_matrix_lsi.pickle')

# RP (Random projections)
run_rp=True
if ( run_rp ):
    num_topics=200
    topic_model=models.RpModel(tfidf_corpus,id2word=dictionary,num_topics=num_topics)
    similarity_matrix_rp,matsim=get_similarity_matrix(article_ids,topic_model[tfidf_corpus],num_best,num_sims)
    npmatrix=matsim2np(matsim)
    npmatrix.dump('sim_matrix_rp.pickle')

In [12]:
# from similarity_matrix import get_similarity_matrix_sparse

# LDA (Latent Dirichlet Allocation)
run_lda=True
if run_lda:
    num_topics=200
#    topic_model=models.LdaModel(corpus,id2word=dictionary,num_topics=num_topics)
    topic_model=models.LdaModel(tfidf_corpus,id2word=dictionary,num_topics=num_topics)
    similarity_matrix_lda,matsim=get_similarity_matrix(article_ids,topic_model[corpus],num_best,num_sims)
    npmatrix=matsim2np(matsim)
    npmatrix.dump('sim_matrix_lda.pickle')

# HDP (Hierarchical Dirichlet Process)
run_hdp=True
if run_hdp:
    num_topics=200
#    topic_model=models.HdpModel(corpus,id2word=dictionary)
    topic_model=models.HdpModel(tfidf_corpus,id2word=dictionary)
    similarity_matrix_hdp,matsim=get_similarity_matrix(article_ids,topic_model[corpus],num_best,num_sims)
    npmatrix=matsim2np(matsim)
    npmatrix.dump('sim_matrix_hdp.pickle')

  (perwordbound, np.exp2(-perwordbound), len(chunk), corpus_words))
