1. Read *Wikipedia.csv* as written via Wikipedia class.
2. Tokenize the text, and save tokens into *df_token*.
3. Build similarity matrix and write this to file *Wikipedia-sims.csv* with columns:  
    id : *article id*  
    similar_articles : *list of article ids*


In [1]:
import sys, os
import pandas as pd

# Read data

Read .csv files and make a pandas dataframe.

In [2]:
actual_dir=os.getcwd()
os.chdir('../data')
csvfile1 = 'Wikipedia-dog.csv'
csvfile2 = 'Wikipedia-fish.csv'


df=pd.read_csv(csvfile1,index_col=0)
df2=pd.read_csv(csvfile2,index_col=0)
df=df.append(df2)
del df2
df.tail(5)


Unnamed: 0,website,title,page_type,text_raw
49475203,https://en.wikipedia.org/wiki/Category:Whales_...,Category:Whales in fiction,category,{{popcat}}\n[[Whale]]s in fiction\n\n[[Categor...
49735416,https://en.wikipedia.org/wiki/Tail_sailing,Tail sailing,article,[[File:Southern right whale4.jpg|thumb|upright...
52243894,https://en.wikipedia.org/wiki/Bubble_net_feeding,Bubble net feeding,article,{{main|Cetacean surfacing behaviour}}\n'''Bubb...
31823392,https://en.wikipedia.org/wiki/Category:Individ...,Category:Individual cetaceans,category,{{commons cat|Individual cetaceans}}\nThis cat...
53720250,https://en.wikipedia.org/wiki/Ethelbert_(whale),Ethelbert (whale),article,{{no footnotes|date=June 2017}}\n\n'''Ethelber...


Now let's use nltk to tokenize and clean up the text

# Tokenizer

Find a list of tokens from raw text for each article

In [3]:
import nltk
tokenizer = nltk.RegexpTokenizer(r'\w+')

df_token=df['text_raw']

# Convert to lower case:
for index in df_token.index:
    text=df_token[index]
    df_token[index]=text.lower()

# Tokenize
df_token=df_token.apply(tokenizer.tokenize)
df_token.head()

970284    [cat, main, conformation, show, show, dog, por...
972913    [this, is, a, collection, of, articles, about,...
970251    [this, is, an, automatically, collected, list,...
729436    [this, is, an, automatically, accumulated, lis...
978163    [category, dogs, pets, category, mammals, as, ...
Name: text_raw, dtype: object

In [4]:
from gensim import corpora, models

# gensim dictionary
https://radimrehurek.com/gensim/corpora/dictionary.html

* compactify()


Assign new word ids to all words.

This is done to make the ids more compact, e.g. after some tokens have been removed via filter_tokens() and there are gaps in the id series. Calling this method will remove the gaps.

* self.dfs()

token frequency

* Download stop words running:

nltk.download()

In [5]:
# make gensim dictionary

line_list = df_token.values
dictionary = corpora.Dictionary(line_list)
# dictionary: 0: "about", 1:"shelter",...

# filter dictionary to remove stopwords and words occurring < min_count times
# need to run nltk.download() -> 3 GB downloaded into C:\Users\melanie\AppData\Roaming\nltk_data
stop_words = nltk.corpus.stopwords.words('english') 
print("Stop words: {}\n".format(stop_words[:5]))

stop_ids = [dictionary.token2id[word] for word in stop_words
            if word in dictionary.token2id]
min_count = 2
rare_ids = [id for id, freq in dictionary.dfs.items()
            if freq < min_count]
dictionary.filter_tokens(stop_ids + rare_ids)
print("Dictionary after filtering:")
print([(key,dictionary[key]) for key in dictionary.keys()[1:5]])
dictionary.compactify()


Stop words: ['i', 'me', 'my', 'myself', 'we']

Dictionary after filtering:
[(1, 'main'), (2, 'conformation'), (3, 'show'), (4, 'dog')]


## doc2bow()

1. counts the number of occurrences of each distinct word
2. converts the word to its integer word id 
3. returns the result as a sparse vector. 

The sparse vector [(0, 1), (1, 1)] therefore reads: in the document “Human computer interaction”, the words computer (id 0) and human (id 1) appear once; the other ten dictionary words appear (implicitly) zero times.

In [6]:
corpus = [dictionary.doc2bow(words) for words in line_list]
print("Corpus contains tuples of word lists and its frequency")
print("Corpus: {}".format(corpus[1][1:5]))



Corpus contains tuples of word lists and its frequency
Corpus: [(1, 1), (4, 2), (6, 3), (10, 1)]


# TF-IDF model

https://radimrehurek.com/gensim/models/tfidfmodel.html

**TF-IDF model**

https://en.wikipedia.org/wiki/Tf%E2%80%93idf

*TF-IDF* model, **term frequency–inverse document frequency**, is a numerical statistic that is intended to reflect how important a word is to a document in a collection or corpus. It is often used as a weighting factor in information retrieval, text mining, and user modeling. The tf-idf value increases proportionally to the number of times a word appears in the document, but is often offset by the frequency of the word in the corpus, which helps to adjust for the fact that some words appear more frequently in general. Nowadays, tf-idf is one of the most popular term-weighting schemes. For instance, 83% of text-based recommender systems in the domain of digital libraries use tf-idf.

**Term frequency**. 
The number of times a term occurs in a document is called its term frequency

The first form of term weighting is due to Hans Peter Luhn (1957) and is based on the Luhn Assumption:
The weight of a term that occurs in a document is simply proportional to the term frequency.[3]

**Inverse document frequency**. An inverse document frequency factor is incorporated which diminishes the weight of terms that occur very frequently in the document set and increases the weight of terms that occur rarely, e.g. "the", "a", etc.

Karen Spärck Jones (1972) conceived a statistical interpretation of term specificity called Inverse Document Frequency (IDF), which became a cornerstone of term weighting:
The specificity of a term can be quantified as an inverse function of the number of documents in which it occurs.

tf–idf is the product of two statistics, term frequency and inverse document frequency.

A high weight in tf–idf is reached by a high term frequency (in the given document) and a low document frequency of the term in the whole collection of documents; the weights hence tend to filter out common terms. Since the ratio inside the idf's log function is always greater than or equal to 1, the value of idf (and tf-idf) is greater than or equal to 0. As a term appears in more documents, the ratio inside the logarithm approaches 1, bringing the idf and tf-idf closer to 0.





In [7]:
tfidf = models.TfidfModel(corpus)

In [8]:
os.chdir(actual_dir)
from topic_model import get_topic_model

# LSI
num_topics=100 #dimensionality of model
model=models.LsiModel
topic_model=get_topic_model(tfidf,corpus,dictionary,num_topics,model)


In [12]:
from similarity_matrix import get_similarity_matrix

max_posts=len(df_token)
num_best=max_posts
#matsim=similarities.MatrixSimilarity(topic_model[tfidf[corpus]],num_best=max_posts+1)
article_ids=df_token.index
num_sims=5 #Save only top ten similar articles

similarity_matrix,matsim=get_similarity_matrix(article_ids,topic_model[tfidf[corpus]],num_best,num_sims)

In [13]:
# This shows the list of similar articles for a few random articles
import random
list_of_random_items = random.sample(list(matsim), 10)

titles=df['title']
for sims in list_of_random_items:
    sims_id = sims[0][0]
    article_id=article_ids[sims_id]
    print('\033[1m'+titles[article_id]+'\033[0m') 
    similar_article_ids= similarity_matrix[article_id][0]
    for other_id in similar_article_ids:
        print("\t{}".format(titles[other_id]))
    #for other_id, score in sims[1:sim_top]:
    #    print('\t', titles[other_id], ' ', score) 
        

[1mCategory:Deaths due to dog attacks[0m
	Dog bite
	Body language of dogs
	Dog bite prevention
	Tail wagging by dogs
	Tail sailing
[1mList of foxhound packs of the United Kingdom[0m
	List of hound packs of New Zealand
	List of hound packs of Australia
	List of minkhound packs of the United Kingdom
	List of hound packs of Ireland
	List of draghound packs of the United Kingdom
[1mCategory:Dog breeding[0m
	Category:Cat fancy
	Category:Fish reproduction
	Category:Dog monuments
	Breed type (dog)
	Canine reproduction
[1mCategory:Dog training and behavior[0m
	Cat training
	Category:Cat behavior
	Category:Dog-related professions and professionals
	Cynology
	Category:Dog types
[1mCategory:Prehistoric whales[0m
	Category:Saber-toothed cats
	Category:Blue whales
	Category:Whales in art
	Cetacean stranding
	Whales in Ghanaian waters
[1mCategory:Fish anatomy[0m
	Category:Mythological dogs
	Outline of fish
	Fish
	Category:Fish by location
	Category:Fish in heraldry
[1mFile:Basset hound 

In [11]:
os.chdir('../data')

#Save similarity into file:
df_similarity=pd.DataFrame.from_dict(similarity_matrix, orient='index')
df_similarity.columns=['similar_articles','scores']

df_similarity.to_csv('Wikipedia-sims.csv')
df_similarity.head()

# Note that the list is printed as a string in the Wikipedia-sims.csv file.
# use literal_eval to convert back to list 
#>>> from ast import literal_eval
#>>> literal_eval('[1.23, 2.34]')
#[1.23, 2.34]


Unnamed: 0,similar_articles,scores
970284,"[22345050, 17430047, 51676708, 729436, 970251,...","[0.367689877748, 0.360884338617, 0.34451672434..."
972913,"[977022, 11037163, 50899941, 1968414, 53984662...","[0.96357101202, 0.818602263927, 0.205982163548..."
970251,"[39353382, 34921437, 970284, 704388, 970360, 3...","[0.887428402901, 0.318142652512, 0.29628589749..."
729436,"[970284, 704388, 50899941, 1968414, 970251, 97...","[0.325745344162, 0.209249347448, 0.19172243773..."
978163,"[978165, 32437215, 43220839, 51511707, 3710421...","[0.963033974171, 0.350213766098, 0.19761410355..."
