Using minidom to parse XML

ref: https://www.tutorialspoint.com/python/python_xml_processing.htm

In [1]:
import sys, os
from xml.dom import minidom

In [2]:
os.chdir('../data')
xmlfile = 'Wikipedia-20170717213140.xml'

In [3]:
xmldoc = minidom.parse(xmlfile)
# collect title, id, text of each wikipage
titlelist = xmldoc.getElementsByTagName('title')
idlist = xmldoc.getElementsByTagName('id')
textlist = xmldoc.getElementsByTagName('text')

In [4]:
# number of wiki articles in our file
print(len(titlelist))

147


In [5]:
# print article titles
for item in titlelist:
    print(item.childNodes[0].data)

Category:Dog shows and showing
Category:Dog health
Category:Dog organizations
Category:Dog sports
Category:Dogs as pets
Category:Dog equipment
Category:Dog breeding
Category:Dog monuments
Kennel
Cynology
Category:Dog training and behavior
Category:Dog law
Category:Dog-related professions and professionals
Category:Dogs in popular culture
Category:Dog breeds
Pack (canine)
Rare breed (dog)
Category:Dog types
Category:Deaths due to dog attacks
Dogs in ancient China
Dog biscuit
Category:Wikipedia books on dogs
Breed type (dog)
Category:Mythological dogs
Canid hybrid
Canine physical therapy
Dogs in Mesoamerica
Interbreeding of dingoes with other domestic dogs
Canine reproduction
PDSA Certificate for Animal Bravery or Devotion
Category:Robotic dogs
Cropping (animal)
Pet Check Technology
Dog daycare
Dogs in religion
The Ten Commandments of Dog Ownership
Roan (color)
List of beagle, harrier and basset packs of the United Kingdom
Category:Dog stubs
Category:Individual dogs
Panhu
The Dog Pillow


In [6]:
# print article IDs
for item in idlist:
    print(item.childNodes[0].data)

970284
732620169
16621713
972913
748117961
25152539
970251
641034775
5957048
729436
547353011
6569922
978163
599235326
1215485
1764821
547372987
6569922
1765233
547375119
6569922
1765458
747942371
48734
1467938
772563016
13286072
275388
773697329
13286072
970360
693299905
25152539
1968414
144853393
1777080
740619279
278097
2460099
746201717
22066013
691500
790195106
1879566
2352562
785809441
17021807
762623658
704388
548518791
259798
22806874
604361169
4635357
20777185
759352260
236191
4020758
752689112
27545231
365072956
348521
17430047
750324682
27823944
2460103
548871376
259798
2676271
788078379
22044074
746844710
15420856
19282291
783071363
23416702
786918856
30707369
5740890
786723340
30707369
31752133
691460390
1398
32122845
547755696
6569922
33069847
789344300
94794
32912733
531505194
9380977
24243398
789207125
28903366
17862013
786780180
30707369
34442389
789597244
1258165
178449
771661662
279219
34949251
737565386
27738727
1396693
762378101
25152539
31823444
715433902
203786
3

In [7]:
# print text of first article
print(textlist[0].childNodes[0].data)

{{Cat main|Conformation show|Show dog}}
{{portal|Dogs}}
This is an automatically collected list of articles related to showing dogs for their appearance in [[conformation show]]s.

[[Category:Dogs|Shows and showing]]
[[Category:Competitions]]
[[Category:Animal shows| Dog]]


Compare this with the original XML:

![Capture.PNG](attachment:Capture.PNG)

Now let's use nltk to tokenize and clean up the text

In [8]:
import nltk

In [9]:
tokenizer = nltk.RegexpTokenizer(r'\w+')
text = textlist[0].childNodes[0].data.lower()
ttext = tokenizer.tokenize(text)
print(ttext)

['cat', 'main', 'conformation', 'show', 'show', 'dog', 'portal', 'dogs', 'this', 'is', 'an', 'automatically', 'collected', 'list', 'of', 'articles', 'related', 'to', 'showing', 'dogs', 'for', 'their', 'appearance', 'in', 'conformation', 'show', 's', 'category', 'dogs', 'shows', 'and', 'showing', 'category', 'competitions', 'category', 'animal', 'shows', 'dog']


The above article was a category. Lets see what happens for a true article with links, etc in the text. Looking at the list of titles, the 9th title, 'Kennel', looks like an article

In [10]:
# print text of 9th article
text = textlist[8].childNodes[0].data.lower()
print(text)
ttext = tokenizer.tokenize(text)
print(ttext)

{{about|shelter for dogs and cats}}
{{for|the article about a shed built to shelter a dog|doghouse}}
[[image:dog kennel mason.jpg|thumb|a dog sits in front of a typical kennel panel]]
a '''kennel''' is a structure or shelter for [[dog]]s or [[cat]]s. used in the plural, ''the kennels'', the term means any building, collection of buildings or a property in which dogs or cats are housed, maintained, and (though not in all cases) bred. a kennel can be made out of various materials, the most popular being wood and canvas. 

==breeding kennels==
this is a formal establishment for the propagation of animals, whether or not they are actually housed in a separate shed, the garage, a state-of-the-art facility, or the family dwelling. licensed breeding kennels are heavily regulated and must follow relevant government legislation. [[breed club (dog)|breed club]] members are expected to comply with general code of ethics and guidelines applicable to the breed concerned. [[kennel club|kennel counci

Use topic modeling to see how similar articles are:

refs: http://www.datasciencebytes.com/bytes/2014/12/30/topic-modeling-of-shakespeare-characters/
    http://www.datasciencebytes.com/bytes/2014/11/20/using-topic-modeling-to-find-related-blog-posts/

In [11]:
import pandas as pd
from collections import defaultdict
from gensim import corpora, models, similarities

In [12]:
# organize articles into a dataframe...not sure if we need to exclude articles of short length? 
min_length = 50  # minimum word count
word_data = [(item.childNodes[0].data, tokenizer.tokenize(text.childNodes[0].data.lower()))
             for item, text in zip(idlist, textlist)
             if len(tokenizer.tokenize(text.childNodes[0].data.lower())) >= min_length]
word_data_df = pd.DataFrame(word_data, columns=['id', 'words'])
word_data_df.to_csv('word_data_df.csv')
word_data_df.head()

Unnamed: 0,id,words
0,5957048,"[about, shelter, for, dogs, and, cats, for, th..."
1,729436,"[cynology, ipac, en, s, ᵻ, ˈ, n, ɒ, l, ə, dʒ, ..."
2,1215485,"[category, diffuse, cat, main, dog, breed, lis..."
3,1764821,"[other, uses, wolfpack, disambiguation, image,..."
4,547372987,"[for, a, list, of, rare, dog, breeds, category..."


In [13]:
word_data_df.tail()

Unnamed: 0,id,words
104,787653468,"[about, the, cat, species, that, is, commonly,..."
105,41493138,"[infobox, disease, name, cat, bite, image, cat..."
106,701109732,"[italic, title, taxobox, image, name, ancylost..."
107,754619,"[further, feline, zoonosis, see, cat, health, ..."
108,849619,"[infobox, holiday, holiday_name, international..."


# gensim dictionary
https://radimrehurek.com/gensim/corpora/dictionary.html

* compactify()


Assign new word ids to all words.

This is done to make the ids more compact, e.g. after some tokens have been removed via filter_tokens() and there are gaps in the id series. Calling this method will remove the gaps.

* self.dfs()

token frequency

* Download stop words running:

nltk.download()

In [14]:
# make gensim dictionary

line_list = word_data_df['words'].values
dictionary = corpora.Dictionary(line_list)
# dictionary: 0: "about", 1:"shelter",...

# filter dictionary to remove stopwords and words occurring < min_count times
# need to run nltk.download() -> 3 GB downloaded into C:\Users\melanie\AppData\Roaming\nltk_data
stop_words = nltk.corpus.stopwords.words('english') 
print("Stop words: {}\n".format(stop_words[:5]))

stop_ids = [dictionary.token2id[word] for word in stop_words
            if word in dictionary.token2id]
min_count = 2
rare_ids = [id for id, freq in dictionary.dfs.items()
            if freq < min_count]
dictionary.filter_tokens(stop_ids + rare_ids)
print("Dictionary after filtering:")
print([(key,dictionary[key]) for key in dictionary.keys()[1:5]])
dictionary.compactify()


Stop words: ['i', 'me', 'my', 'myself', 'we']

Dictionary after filtering:
[(1, 'dogs'), (2, 'cats'), (3, 'article'), (4, 'shed')]


## doc2bow()

1. counts the number of occurrences of each distinct word
2. converts the word to its integer word id 
3. returns the result as a sparse vector. 

The sparse vector [(0, 1), (1, 1)] therefore reads: in the document “Human computer interaction”, the words computer (id 0) and human (id 1) appear once; the other ten dictionary words appear (implicitly) zero times.

https://radimrehurek.com/gensim/models/tfidfmodel.html

**TF-IDF model**

https://en.wikipedia.org/wiki/Tf%E2%80%93idf

*TF-IDF* model, **term frequency–inverse document frequency**, is a numerical statistic that is intended to reflect how important a word is to a document in a collection or corpus. It is often used as a weighting factor in information retrieval, text mining, and user modeling. The tf-idf value increases proportionally to the number of times a word appears in the document, but is often offset by the frequency of the word in the corpus, which helps to adjust for the fact that some words appear more frequently in general. Nowadays, tf-idf is one of the most popular term-weighting schemes. For instance, 83% of text-based recommender systems in the domain of digital libraries use tf-idf.

**Term frequency**. 
The number of times a term occurs in a document is called its term frequency

The first form of term weighting is due to Hans Peter Luhn (1957) and is based on the Luhn Assumption:
The weight of a term that occurs in a document is simply proportional to the term frequency.[3]

**Inverse document frequency**. An inverse document frequency factor is incorporated which diminishes the weight of terms that occur very frequently in the document set and increases the weight of terms that occur rarely, e.g. "the", "a", etc.

Karen Spärck Jones (1972) conceived a statistical interpretation of term specificity called Inverse Document Frequency (IDF), which became a cornerstone of term weighting:
The specificity of a term can be quantified as an inverse function of the number of documents in which it occurs.

tf–idf is the product of two statistics, term frequency and inverse document frequency.

A high weight in tf–idf is reached by a high term frequency (in the given document) and a low document frequency of the term in the whole collection of documents; the weights hence tend to filter out common terms. Since the ratio inside the idf's log function is always greater than or equal to 1, the value of idf (and tf-idf) is greater than or equal to 0. As a term appears in more documents, the ratio inside the logarithm approaches 1, bringing the idf and tf-idf closer to 0.







In [15]:
corpus = [dictionary.doc2bow(words) for words in line_list]
print("Corpus: {}".format(corpus[1][1:5]))
tfidf = models.TfidfModel(corpus)


Corpus: [(6, 36), (9, 2), (17, 1), (18, 1)]


In [21]:
model=models.LsiModel
max_posts=len(word_data_df['words'].values)

topic_model = model(tfidf[corpus], id2word=dictionary, num_topics=5)
for topic in topic_model.print_topics(5):
        print ('\n' + str(topic))


        
index=similarities.MatrixSimilarity(topic_model[tfidf[corpus]],num_best=max_posts+1)



(0, '-0.320*"cat" + -0.234*"cats" + -0.187*"dog" + -0.158*"journal" + -0.122*"ref" + -0.112*"breeds" + -0.105*"meat" + -0.104*"name" + -0.087*"doi" + -0.086*"cite"')

(1, '-0.669*"bailys" + -0.285*"hunt" + -0.278*"foxhounds" + -0.235*"directory" + -0.233*"england" + -0.181*"harriers" + -0.173*"beagles" + -0.154*"packs" + -0.145*"hunting" + -0.138*"liams"')

(2, '-0.358*"cat" + 0.354*"dog" + 0.320*"breeds" + -0.255*"cats" + 0.238*"breed" + -0.214*"cafe" + 0.180*"list" + 0.143*"kennel" + 0.119*"dogs" + 0.111*"types"')

(3, '-0.392*"breeds" + -0.316*"cat" + -0.279*"cafe" + -0.233*"breed" + -0.190*"list" + -0.174*"types" + 0.123*"bites" + 0.121*"bite" + -0.112*"café" + -0.109*"kennel"')

(4, '-0.340*"meat" + 0.236*"bites" + 0.210*"bite" + -0.172*"festival" + -0.169*"china" + 0.163*"breeds" + 0.153*"rabies" + 0.146*"cdc" + -0.120*"cafe" + 0.115*"infection"')


In [26]:
article_ids=word_data_df['id']

similarity_scores = defaultdict(list)
for article_id, sims in zip(article_ids, index):
    for id, score in sims[1:]:
        similarity_scores[article_id].append((article_ids[id], score))


In [38]:
article_id=article_ids.iloc[0]
print("Similarity score for article ID {}",article_id)
print(similarity_scores[article_id][:5])

Similarity score for article ID {} 5957048
[('547372987', 0.95294737815856934), ('747942371', 0.95260119438171387), ('604361169', 0.94745504856109619), ('30707369', 0.94608569145202637), ('2460099', 0.93998193740844727)]
