1. Read *Wikipedia.csv* as written via Wikipedia class.
2. Tokenize the text, and save tokens into *df_token*.
3. Build similarity matrix and write this to file *Wikipedia-sims.csv* with columns:  
    id : *article id*  
    similar_articles : *list of article ids*


In [1]:
import sys, os
import pandas as pd

In [2]:
os.chdir('../data')
csvfile = 'Wikipedia.csv'

df=pd.read_csv(csvfile,index_col=0)
df.head(5)

Unnamed: 0,website,title,page_type,text_raw
970284,https://en.wikipedia.org/wiki/Category:Dog_sho...,Category:Dog shows and showing,category,{{Cat main|Conformation show|Show dog}}\n{{por...
972913,https://en.wikipedia.org/wiki/Category:Dog_health,Category:Dog health,category,This is a collection of articles about the hea...
970251,https://en.wikipedia.org/wiki/Category:Dog_org...,Category:Dog organizations,category,This is an automatically collected list of art...
729436,https://en.wikipedia.org/wiki/Category:Dog_sports,Category:Dog sports,category,This is an automatically accumulated list of a...
978163,https://en.wikipedia.org/wiki/Category:Dogs_as...,Category:Dogs as pets,category,[[Category:Dogs|Pets]]\n[[Category:Mammals as ...


Now let's use nltk to tokenize and clean up the text

In [3]:
import nltk

In [4]:
tokenizer = nltk.RegexpTokenizer(r'\w+')

df_token=df['text_raw']

# Convert to lower case:
for index in df_token.index:
    text=df_token[index]
    df_token[index]=text.lower()

# Tokenize
df_token=df_token.apply(tokenizer.tokenize)
df_token.head()

970284    [cat, main, conformation, show, show, dog, por...
972913    [this, is, a, collection, of, articles, about,...
970251    [this, is, an, automatically, collected, list,...
729436    [this, is, an, automatically, accumulated, lis...
978163    [category, dogs, pets, category, mammals, as, ...
Name: text_raw, dtype: object

In [5]:
from collections import defaultdict
from gensim import corpora, models, similarities

# gensim dictionary
https://radimrehurek.com/gensim/corpora/dictionary.html

* compactify()


Assign new word ids to all words.

This is done to make the ids more compact, e.g. after some tokens have been removed via filter_tokens() and there are gaps in the id series. Calling this method will remove the gaps.

* self.dfs()

token frequency

* Download stop words running:

nltk.download()

In [6]:
# make gensim dictionary

line_list = df_token.values
dictionary = corpora.Dictionary(line_list)
# dictionary: 0: "about", 1:"shelter",...

# filter dictionary to remove stopwords and words occurring < min_count times
# need to run nltk.download() -> 3 GB downloaded into C:\Users\melanie\AppData\Roaming\nltk_data
stop_words = nltk.corpus.stopwords.words('english') 
print("Stop words: {}\n".format(stop_words[:5]))

stop_ids = [dictionary.token2id[word] for word in stop_words
            if word in dictionary.token2id]
min_count = 2
rare_ids = [id for id, freq in dictionary.dfs.items()
            if freq < min_count]
dictionary.filter_tokens(stop_ids + rare_ids)
print("Dictionary after filtering:")
print([(key,dictionary[key]) for key in dictionary.keys()[1:5]])
dictionary.compactify()


Stop words: ['i', 'me', 'my', 'myself', 'we']

Dictionary after filtering:
[(1, 'main'), (2, 'conformation'), (3, 'show'), (4, 'dog')]


## doc2bow()

1. counts the number of occurrences of each distinct word
2. converts the word to its integer word id 
3. returns the result as a sparse vector. 

The sparse vector [(0, 1), (1, 1)] therefore reads: in the document “Human computer interaction”, the words computer (id 0) and human (id 1) appear once; the other ten dictionary words appear (implicitly) zero times.

https://radimrehurek.com/gensim/models/tfidfmodel.html

**TF-IDF model**

https://en.wikipedia.org/wiki/Tf%E2%80%93idf

*TF-IDF* model, **term frequency–inverse document frequency**, is a numerical statistic that is intended to reflect how important a word is to a document in a collection or corpus. It is often used as a weighting factor in information retrieval, text mining, and user modeling. The tf-idf value increases proportionally to the number of times a word appears in the document, but is often offset by the frequency of the word in the corpus, which helps to adjust for the fact that some words appear more frequently in general. Nowadays, tf-idf is one of the most popular term-weighting schemes. For instance, 83% of text-based recommender systems in the domain of digital libraries use tf-idf.

**Term frequency**. 
The number of times a term occurs in a document is called its term frequency

The first form of term weighting is due to Hans Peter Luhn (1957) and is based on the Luhn Assumption:
The weight of a term that occurs in a document is simply proportional to the term frequency.[3]

**Inverse document frequency**. An inverse document frequency factor is incorporated which diminishes the weight of terms that occur very frequently in the document set and increases the weight of terms that occur rarely, e.g. "the", "a", etc.

Karen Spärck Jones (1972) conceived a statistical interpretation of term specificity called Inverse Document Frequency (IDF), which became a cornerstone of term weighting:
The specificity of a term can be quantified as an inverse function of the number of documents in which it occurs.

tf–idf is the product of two statistics, term frequency and inverse document frequency.

A high weight in tf–idf is reached by a high term frequency (in the given document) and a low document frequency of the term in the whole collection of documents; the weights hence tend to filter out common terms. Since the ratio inside the idf's log function is always greater than or equal to 1, the value of idf (and tf-idf) is greater than or equal to 0. As a term appears in more documents, the ratio inside the logarithm approaches 1, bringing the idf and tf-idf closer to 0.







In [7]:
corpus = [dictionary.doc2bow(words) for words in line_list]
print("Corpus: {}".format(corpus[1][1:5]))
tfidf = models.TfidfModel(corpus)


Corpus: [(1, 1), (4, 2), (6, 3), (10, 1)]


In [12]:
model=models.LsiModel
max_posts=len(df_token.values)

topic_model = model(tfidf[corpus], id2word=dictionary, num_topics=5)
for topic in topic_model.print_topics(5):
    print('Topic: {}'.format(topic[0]))
    print(str(topic).replace(' + ', '\n')) 
    print('') 
    #print ('\n' + str(topic))


        


Topic: 0
(0, '0.309*"ref"
0.266*"cat"
0.227*"cats"
0.200*"dog"
0.168*"name"
0.156*"journal"
0.127*"cite"
0.123*"title"
0.102*"meat"
0.099*"url"')

Topic: 1
(1, '-0.632*"bailys"
-0.277*"hunt"
-0.258*"foxhounds"
-0.241*"england"
-0.228*"directory"
-0.171*"harriers"
-0.167*"name"
-0.166*"beagles"
-0.156*"hunting"
-0.152*"packs"')

Topic: 2
(2, '0.384*"breeds"
0.243*"breed"
0.241*"list"
0.194*"fictional"
0.171*"pets"
0.155*"types"
0.151*"dog"
0.149*"dogs"
0.145*"commons"
0.135*"individual"')

Topic: 3
(3, '0.421*"cats"
-0.357*"dog"
0.349*"cat"
0.216*"pets"
-0.201*"breeds"
-0.159*"breed"
0.155*"mammals"
0.131*"country"
0.113*"equipment"
0.112*"cafe"')

Topic: 4
(4, '-0.420*"pets"
0.359*"fictional"
-0.304*"mammals"
0.202*"individual"
0.201*"popular"
0.187*"culture"
0.186*"meat"
-0.185*"lists"
0.181*"canines"
-0.175*"country"')



In [14]:
matsim=similarities.MatrixSimilarity(topic_model[tfidf[corpus]],num_best=max_posts+1)

In [22]:
article_ids=df_token.index
titles=df['title'].values
similarity=defaultdict(list)
#Save only top ten more similar articles
sim_top=10

# This is from Melanie, to debug and check if article similarities make sense.
for sims in list(matsim)[:sim_top]:
    article_id = sims[0][0]
    print('\033[1m'+titles[article_id]+'\033[0m') 
    for other_id, score in sims[1:sim_top]:
        print('\t', titles[other_id], ' ', score) 

# This produces the same output as above, 
# the output is saved now in the "similarity" array
for article_id, sims in zip(article_ids, matsim):
    similarity[article_id].append([])
    for other_id, score in sims[1:sim_top+1]:
        similarity[article_id][0].append(article_ids[other_id])

# This is a debug to check if this is making sense
# Now take a random article
titles=df['title']
index=1
article_id=article_ids[index]
title=titles[article_id]
print("Similar pages for article {}".format(title))
for other_id in similarity[article_id]:
    print("{}".format(titles[other_id]))

[1mCategory:Dog shows and showing[0m
	 Category:Dog-related professions and professionals   0.987384080887
	 Rolfi   0.986163973808
	 Bog Dog   0.980988621712
	 Category:Dog law   0.974566400051
	 Category:Dog sports   0.96657794714
	 Category:Dog training and behavior   0.964447975159
	 Category:Dog organizations   0.96400731802
	 Lists of dogs   0.953864693642
	 Category:Dog breeds   0.952302157879
[1mCategory:Dog health[0m
	 Category:Cat behavior   0.932681262493
	 Category:Cat health   0.93171864748
	 Category:Cat breeds   0.914559006691
	 Category:Cat fancy   0.900419652462
	 Portal:Cats   0.886702656746
	 Portal:Dogs   0.878064990044
	 Category:Dog breeding   0.876295745373
	 Category:Dog training and behavior   0.861733615398
	 Cattery   0.860115885735
[1mCategory:Dog organizations[0m
	 Category:Dog sports   0.965003609657
	 Category:Dog shows and showing   0.96400731802
	 Lists of dogs   0.962345182896
	 Category:Dog-related professions and professionals   0.960405230522


In [10]:
#Save similarity into file:
df_similarity=pd.DataFrame.from_dict(similarity, orient='index')
df_similarity.columns=['similar_articles']

df_similarity.to_csv('Wikipedia-sims.csv')
df_similarity.head()

# Note that the list is printed as a string in the Wikipedia-sims.csv file.
# use literal_eval to convert back to list 
#>>> from ast import literal_eval
#>>> literal_eval('[1.23, 2.34]')
#[1.23, 2.34]


Unnamed: 0,similar_articles
970284,"[1777080, 42498898, 53824388, 1968414, 729436,..."
972913,"[977022, 2558456, 710421, 16566346, 1765233, 7..."
970251,"[37331361, 970284, 729436, 1777080, 691500, 97..."
729436,"[1765458, 970360, 970284, 970251, 32122845, 53..."
978163,"[37104213, 978165, 34561692, 43220839, 3243721..."


In [11]:
# If interested into the similarity score, uncomment the next lines:
# similarity_scores = defaultdict(list)
# for article_id, sims in zip(article_ids, index):
#    for id, score in sims[1:sim_top+1]:
#        similarity_scores[article_id].append((article_ids[id], score))
# article_id=article_ids[0]
# print("Similarity score for article ID {}".format(article_id))
# print(similarity_scores[article_id][:5])