1. Read *Wikipedia.csv* as written via Wikipedia class.
2. Tokenize the text, and save tokens into *df_token*.
3. Build similarity matrix and write this to file *Wikipedia-sims.csv* with columns:  
    id : *article id*  
    similar_articles : *list of article ids*


In [1]:
import sys, os
import pandas as pd

In [2]:
os.chdir('../data')
csvfile = 'Wikipedia.csv'

df=pd.read_csv(csvfile,index_col=0)
df.head(5)

Unnamed: 0,website,title,page_type,text_raw
970284,https://en.wikipedia.org/wiki/Category:Dog_sho...,Category:Dog shows and showing,category,{{Cat main|Conformation show|Show dog}}\n{{por...
972913,https://en.wikipedia.org/wiki/Category:Dog_health,Category:Dog health,category,This is a collection of articles about the hea...
970251,https://en.wikipedia.org/wiki/Category:Dog_org...,Category:Dog organizations,category,This is an automatically collected list of art...
729436,https://en.wikipedia.org/wiki/Category:Dog_sports,Category:Dog sports,category,This is an automatically accumulated list of a...
978163,https://en.wikipedia.org/wiki/Category:Dogs_as...,Category:Dogs as pets,category,[[Category:Dogs|Pets]]\n[[Category:Mammals as ...


Now let's use nltk to tokenize and clean up the text

In [3]:
import nltk

In [4]:
tokenizer = nltk.RegexpTokenizer(r'\w+')

df_token=df['text_raw']

# Convert to lower case:
for index in df_token.index:
    text=df_token[index]
    df_token[index]=text.lower()

# Tokenize
df_token=df_token.apply(tokenizer.tokenize)
df_token.head()

970284    [cat, main, conformation, show, show, dog, por...
972913    [this, is, a, collection, of, articles, about,...
970251    [this, is, an, automatically, collected, list,...
729436    [this, is, an, automatically, accumulated, lis...
978163    [category, dogs, pets, category, mammals, as, ...
Name: text_raw, dtype: object

In [5]:
from collections import defaultdict
from gensim import corpora, models, similarities

# gensim dictionary
https://radimrehurek.com/gensim/corpora/dictionary.html

* compactify()


Assign new word ids to all words.

This is done to make the ids more compact, e.g. after some tokens have been removed via filter_tokens() and there are gaps in the id series. Calling this method will remove the gaps.

* self.dfs()

token frequency

* Download stop words running:

nltk.download()

In [6]:
# make gensim dictionary

line_list = df_token.values
dictionary = corpora.Dictionary(line_list)
# dictionary: 0: "about", 1:"shelter",...

# filter dictionary to remove stopwords and words occurring < min_count times
# need to run nltk.download() -> 3 GB downloaded into C:\Users\melanie\AppData\Roaming\nltk_data
stop_words = nltk.corpus.stopwords.words('english') 
print("Stop words: {}\n".format(stop_words[:5]))

stop_ids = [dictionary.token2id[word] for word in stop_words
            if word in dictionary.token2id]
min_count = 2
rare_ids = [id for id, freq in dictionary.dfs.items()
            if freq < min_count]
dictionary.filter_tokens(stop_ids + rare_ids)
print("Dictionary after filtering:")
print([(key,dictionary[key]) for key in dictionary.keys()[1:5]])
dictionary.compactify()


Stop words: ['i', 'me', 'my', 'myself', 'we']

Dictionary after filtering:
[(1, 'main'), (2, 'conformation'), (3, 'show'), (4, 'dog')]


## doc2bow()

1. counts the number of occurrences of each distinct word
2. converts the word to its integer word id 
3. returns the result as a sparse vector. 

The sparse vector [(0, 1), (1, 1)] therefore reads: in the document “Human computer interaction”, the words computer (id 0) and human (id 1) appear once; the other ten dictionary words appear (implicitly) zero times.

https://radimrehurek.com/gensim/models/tfidfmodel.html

**TF-IDF model**

https://en.wikipedia.org/wiki/Tf%E2%80%93idf

*TF-IDF* model, **term frequency–inverse document frequency**, is a numerical statistic that is intended to reflect how important a word is to a document in a collection or corpus. It is often used as a weighting factor in information retrieval, text mining, and user modeling. The tf-idf value increases proportionally to the number of times a word appears in the document, but is often offset by the frequency of the word in the corpus, which helps to adjust for the fact that some words appear more frequently in general. Nowadays, tf-idf is one of the most popular term-weighting schemes. For instance, 83% of text-based recommender systems in the domain of digital libraries use tf-idf.

**Term frequency**. 
The number of times a term occurs in a document is called its term frequency

The first form of term weighting is due to Hans Peter Luhn (1957) and is based on the Luhn Assumption:
The weight of a term that occurs in a document is simply proportional to the term frequency.[3]

**Inverse document frequency**. An inverse document frequency factor is incorporated which diminishes the weight of terms that occur very frequently in the document set and increases the weight of terms that occur rarely, e.g. "the", "a", etc.

Karen Spärck Jones (1972) conceived a statistical interpretation of term specificity called Inverse Document Frequency (IDF), which became a cornerstone of term weighting:
The specificity of a term can be quantified as an inverse function of the number of documents in which it occurs.

tf–idf is the product of two statistics, term frequency and inverse document frequency.

A high weight in tf–idf is reached by a high term frequency (in the given document) and a low document frequency of the term in the whole collection of documents; the weights hence tend to filter out common terms. Since the ratio inside the idf's log function is always greater than or equal to 1, the value of idf (and tf-idf) is greater than or equal to 0. As a term appears in more documents, the ratio inside the logarithm approaches 1, bringing the idf and tf-idf closer to 0.







In [7]:
corpus = [dictionary.doc2bow(words) for words in line_list]
print("Corpus: {}".format(corpus[1][1:5]))
tfidf = models.TfidfModel(corpus)


Corpus: [(1, 1), (4, 2), (6, 3), (10, 1)]


In [8]:
model=models.LsiModel
max_posts=len(df_token.values)

topic_model = model(tfidf[corpus], id2word=dictionary, num_topics=10)
for topic in topic_model.print_topics(5):
    print('Topic: {}'.format(topic[0]))
    print(str(topic).replace(' + ', '\n')) 
    print('') 
    #print ('\n' + str(topic))


        


Topic: 0
(0, '0.309*"ref"
0.266*"cat"
0.227*"cats"
0.200*"dog"
0.167*"name"
0.155*"journal"
0.127*"cite"
0.123*"title"
0.102*"meat"
0.099*"url"')

Topic: 1
(1, '-0.633*"bailys"
-0.277*"hunt"
-0.257*"foxhounds"
-0.242*"england"
-0.228*"directory"
-0.170*"harriers"
-0.166*"name"
-0.166*"beagles"
-0.156*"hunting"
-0.153*"packs"')

Topic: 2
(2, '-0.383*"breeds"
-0.243*"breed"
-0.240*"list"
-0.193*"fictional"
-0.171*"pets"
-0.156*"types"
-0.152*"dog"
-0.148*"dogs"
-0.146*"commons"
-0.136*"individual"')

Topic: 3
(3, '0.422*"cats"
-0.357*"dog"
0.349*"cat"
0.216*"pets"
-0.202*"breeds"
-0.159*"breed"
0.155*"mammals"
0.130*"country"
0.112*"equipment"
0.112*"cafe"')

Topic: 4
(4, '-0.421*"pets"
0.357*"fictional"
-0.305*"mammals"
0.204*"popular"
0.199*"individual"
0.190*"culture"
0.189*"meat"
-0.186*"lists"
0.180*"canines"
-0.175*"country"')



In [9]:
matsim=similarities.MatrixSimilarity(topic_model[tfidf[corpus]],num_best=max_posts+1)

In [10]:
article_ids=df_token.index
titles=df['title'].values
similarity=defaultdict(list)
#Save only top ten more similar articles
sim_top=10

# This is from Melanie, to debug and check if article similarities make sense.
for sims in list(matsim)[:sim_top]:
    article_id = sims[0][0]
    print('\033[1m'+titles[article_id]+'\033[0m') 
    for other_id, score in sims[1:sim_top]:
        print('\t', titles[other_id], ' ', score) 

# This produces the same output as above, 
# the output is saved now in the "similarity" array
for article_id, sims in zip(article_ids, matsim):
    similarity[article_id].append([])
    for other_id, score in sims[1:sim_top+1]:
        similarity[article_id][0].append(article_ids[other_id])

# This is a debug to check if this is making sense
# Now take a random article
titles=df['title']
index=1
article_id=article_ids[index]
title=titles[article_id]
print("Similar pages for article {}".format(title))
for other_id in similarity[article_id]:
    print("{}".format(titles[other_id]))

[1mCategory:Dog shows and showing[0m
	 Category:Dog-related professions and professionals   0.940749764442
	 Category:Dog sports   0.914433181286
	 Category:Dog types   0.862453222275
	 Bog Dog   0.859058082104
	 Category:Dog breeds   0.831410706043
	 Lists of dogs   0.824149012566
	 Category:Dog training and behavior   0.820615828037
	 Category:Dog stubs   0.803065598011
	 Category:Dog organizations   0.802525043488
[1mCategory:Dog health[0m
	 Category:Cat health   0.985054016113
	 Category:Dog law   0.915589809418
	 Category:Dog training and behavior   0.807522475719
	 Category:Dog equipment   0.776322245598
	 Category:Cat equipment   0.770564198494
	 Category:Cat behavior   0.739643633366
	 Category:Dog sports   0.66308504343
	 Category:Dog meat   0.64880001545
	 Cat training   0.58797454834
[1mCategory:Dog organizations[0m
	 Category:Dog-related professions and professionals   0.849314749241
	 Category:Dog shows and showing   0.802525043488
	 Category:Dog training and behavio

In [11]:
#Save similarity into file:
df_similarity=pd.DataFrame.from_dict(similarity, orient='index')
df_similarity.columns=['similar_articles']

df_similarity.to_csv('Wikipedia-sims.csv')
df_similarity.head()

# Note that the list is printed as a string in the Wikipedia-sims.csv file.
# use literal_eval to convert back to list 
#>>> from ast import literal_eval
#>>> literal_eval('[1.23, 2.34]')
#[1.23, 2.34]


Unnamed: 0,similar_articles
970284,"[1777080, 729436, 704388, 42498898, 691500, 37..."
972913,"[977022, 1968414, 970360, 1764821, 8174570, 25..."
970251,"[1777080, 970284, 970360, 39353382, 729436, 42..."
729436,"[970284, 1777080, 970360, 42498898, 1968414, 1..."
978163,"[978165, 43220839, 32437215, 37104213, 3456169..."
