# NLP Model for determining the similarity/relation of Wiki Articles
### Inputs two Wikipedia Articles inside and outputs a percentage of similarity
![Final Results](dataframe_related.png) <br><br>
Postprocessing required to input data into dataframe, this is shown in <b>TEST. NLTK Similarity Function (pandas).ipynb </b>

### How to use:
e.g. test = similar('Amsterdam','Van Gogh') or comparewiki.similar('Amsterdam','Van Gogh') using comparewiki.py file <br><br>
Probability of topics being related is 63% <br>
Count is 240 and sum is 151.61918693607896 <br><br>
For extensive Wikipedia articles that I tested and organized in dataframe, please refer to <b>TEST. NLTK Similarity Function (pandas).ipynb </b><br><br>

### Background information for developing NLP Model
The NLP model is inspired by the Cosine Similarity Formula to determine the similarity of two documents. From two text documents or corpus, create two n dimension vectors that stores n total unique words from both documents, with the frequency of the word in each dimension or vector cell. <br><br>

On the numerator, find the dot product of the two vectors and in the denominator, find the product of the "magnitude" or length of the vectors. This value will be the cos θ, between values 0 and 1. If this value is 1 (θ = 0 degrees), the documents are fully identical while if the value is 0 (θ = 90 degrees), they are totally different. <br><br>

![Similarity Cosine Formula](similarity_cosine.png) <br><br>

For example, 'I love cats and love dogs' and 'I love dogs and love rabbits'. The first passage will have counter {love: 2, and: 1, I: 1, cats: 1, dogs: 1, rabbits: 0} which is [2,1,1,1,1,0] vectorized. The second passage will have counter {love: 2, and: 1, I: 1, cats: 0, dogs: 1, rabbits: 1} which is [2,1,1,0,1,1]. <br><br>

The numerator dot product is 2x2 + 1x1 + 1x1 + 1x0 + 1x1 + 0x1 = 7 while the denominator is sqrt(2^2+1^2+1^2+1^2+1^2+0) + sqrt(2^2+1^2+1^2+0+1^2+1^2) = sqrt(8) * sqrt(8) = 8. <br><br>

#### cos θ = 7/8 = 0.875. Hence the documents have 87.5% similarity as they have many common words except for dogs and rabbits.

### Issues of cosine similarity algorithm
1. In reality, documents have a large number of unique words (1,000s-10,000s) and it is computationally demanding to determine the similarity with high dimensional word vectors. 
2. cosine similarity equation gives equal weight to all unique words in similarity calculation, but there are a lot of uncommon words in Wikipedia articles and documents in general that should not be penalizing the final score. For example, Beyonce and Rihanna might have different album and track names, but these should not penalize them too heavily as they both produce tracks and release albums.
3. Furthermore, documents with a lot of text tend to have a lot of common English words (I, is, am, and, but, for etc.) which are more for grammar and sentence structure purposes. Hence, these words have to be removed, or the similarity score would be skewed upwards.
4. Given the stylistic writing and different focus of different Wikipedia articles, there will be a lot of different words which causes the cosine formula to give a low similarity percentage. However, these articles are actually similar as they have a lot of high frequency words (top 40 words) in common - the top 40 words should have a lot more weight than the other words. <br>
For instance, the Wikipedia Article of Amsterdam and Van Gogh are fundamentally different content, but they are related as Van Gogh lived in the Netherlands and has a posthumous museum in Amsterdam, and Amsterdam is well known for museums. <br>
5. Many Wiki articles (and documents in general) only have a few high frequency words in common that signify that they are related, while the other words are different.
6. There might be words that are synonyms - e.g. 'America' vs 'US' that should actually contribute to the similarity score instead of reducing it
7. Hence, the idea was to develop a new NLP model to calculate the similarity of Wikipedia articles.

### How does the NLP Model work?
1. It uses the Wikipedia web scraper that I built, which is able to get a word counter on the most common words and their frequency. After user types in two Wikipedia articles, the web scraper (wikionly.py) runs for both articles and gets the word counter on the 40 most common words.
2. In the NLTK package, they have a function word1.path_similarity(word2) that allows you to calculate the similarity between two words, inclding synonyms. We compare the 40 words with each other in a cartesian product manner and get the similarity score between 0 to 1. 
3. As there are many definitions for a word, the top two common definitions (e.g. dog.n.01 and dog.n.02) are compared with the top two of the other word (e.g. cat.n.01 and cat.n.02) and the maximum similarity score is obtained. A try-exception loop is used to figure out if the word has a valid synset for doing path similarity comparison.
### An unique mathematical formula is used: Score = Similarity x e^(Factor x Similarity) + 10 x log(0.89 + Similarity)<br><br>
4. Most words have a NLTK similarity value between 0.07-0.15. This NLTK similarity value is mapped into a formula, much like how the sigmoid function works for logistic regression. For words with similarity value below 0.11, the log function will begin to penalize the total score. On the other hand, for similarity value above 0.15, the exponential function will amplify the total score. If the two words are exactly the same, the score increase will be very high.
5. If two words from two Wiki articles are Top 10 or 20 in frequency, they will have a higher Factor than that of the Top 40 frequency. A factor of 4.5 is given if two words are Top 10 most frequent words, 2 if both words are Top 11-20 most frequent words and 1.5 for Top 21-40 most frequent words<br><br>
![NLP Model Formula](NLP_Equation2.png) <br><br>
6. The sum of all the scores is divided by the number of comparisons between words of two documents. The total number of possible comparisons between 40 words of both documents is 40x40=1600, but the actual number is about 300-700 as many words do not exist in the NLTK corpus. However, if the word is identical, both the score and comparison count will be computed outside of NLTK.
7. The sum and count will be printed and shown for each comparison of Wiki articles.
8. Use the <b>words()</b> function to view the 40 common words for both Wiki articles and <b>ans()</b> to output answer in list that can be combined into a Pandas dataframe.
9. The final similarity score has a ceiling value of 1 or 100% similarity/related.
<br><br>

### Advantage of NLP Model
1. It only takes the 40 most common words from both Wikipedia articles for computation, reducing the time taken to get results. With good internet connection and RAM, the web scraping, data cleaning, and NLP model would take less than 10 seconds to determine if two Wikipedia articles are related.
2. It has a penalty for words that are not similar and a reward (score increase) for similar words (NLTK similarity > 0.11), much like how the cosine similarity model works
3. It scales up the score for words that are similar and appear frequently in both Wikipedia articles. In many articles, only a few frequent words will hint that the articles are actually related, while the other content are different.
4. It has a stoplist to remove frequent words, and is customizable according to the needs of the developer.
5. It is able to detect synonyms and award scores for them
6. With small modifications, you can input two lists of the most frequent words from two documents to run this model instead of scraping Wikipedia articles. (modify the line self.wiki1slice = list(key) to put your own word list within the percent(self,input1,input2): function)
7. <b>The difference between NLP model and cosine similarity is that the former is trying to find if two documents/articles are related/similar, while cosine similarity is a 'hard match' of how much the contents of both documents tally.</b>

### Dependencies and libraries used:
wikionly.py (adapted from wikiscrape.py with lesser features, please see my other project https://github.com/kohjiaxuan/Wikipedia-Article-Scraper) <br>
nltk, re, math
<br>
### Libraries used in wikionly/wikiscrape.py:
requests, bs4, collections, matplotlib, re, os, nltk (optional, only if using stoplist)

### Feedback/Suggestions/Bugs:
Please contact me on my Linkedin https://www.linkedin.com/in/kohjiaxuan/

In [1]:
import wikionly #script name is wikionly (no summary), class name is wiki
import re as re
import nltk
nltk.download('wordnet')
from nltk.corpus import wordnet
import math

#Input two Wikipedia articles to compute similarity percentage
class similar:
    def __init__(self,text1,text2):

        self.wn = nltk.corpus.wordnet #the corpus reader

        #Error handling: check if both arguments input are string format
        checkstr = False
        if isinstance(text1, str) == True:
            if isinstance(text2, str) == True:
                self.text1 = text1
                self.text2 = text2
                checkstr = True
            else:
                print('Error! The second argument is not a string format!')        
        else:
            print('Error! The first argument is not a string format!')
        
        #Run internal wikipedia python file for processing for both wiki titles
        if checkstr == True:
            self.wiki1 = wikionly.wiki(text1)
            self.wiki2 = wikionly.wiki(text2)
        
        #Call the function that calculates percentage
        self.percent(self.wiki1,self.wiki2)
        
        #call the function that shows list of words for both Wiki sites, disabled
        #self.words()
        
    #Retrieve top 40 common words from wiki page, slice up and append .n01 for NLTK usage
    def percent(self,input1,input2):
        self.dotn01 = ('.','n','.','0','1')
        self.wiki1list = []
        for key in self.wiki1.commonwords(40):
            self.wiki1slice = list(key)
            for letter in self.dotn01:
                self.wiki1slice.append(letter)
            self.wiki1slice = ''.join(self.wiki1slice)
            self.wiki1list.append(self.wiki1slice)

        self.wiki2list = []
        for key in self.wiki2.commonwords(40):
            self.wiki2slice = list(key)
            for letter in self.dotn01:
                self.wiki2slice.append(letter)
            self.wiki2slice = ''.join(self.wiki2slice)
            self.wiki2list.append(self.wiki2slice)
        
        #count and sum for calculating similarity
        self.count = 0
        self.sum = 0
        #A count for the ranking of the word (how often it appears in both wiki passages)
        self.topten1 = 0
        self.topten2 = 0

        #For words that are 1-10th and 11-21st in popularity, if both wiki pages have the word, they get more points
        for word1 in self.wiki1list:
            #Reset self.topten2
            self.topten2 = 0
            self.topten1 += 1
            for word2 in self.wiki2list:
                self.topten2 += 1
                #reinitialize to zero to prevent old sums from going into maxsum
                self.sum1 = 0
                self.sum2 = 0
                self.sum3 = 0
                self.sum4 = 0
                self.maxsum = 0
                
                if self.topten1 < 11 and self.topten2 < 11:
                    self.expvalue = 4.5 #3.5
                elif self.topten1 < 21 and self.topten2 < 21:
                    self.expvalue = 2
                else:
                    self.expvalue = 1.5
                
                #Main algorithm for calculating score of words
                try:
                    if re.findall(r"\d+.n.01", word1) == [] and re.findall(r"\d+.n.01", word2) == []: #check both words not numbers
                        #since words have many meanings, for every pair of words, use top two meanings n.01 and n.02 for comparison
                        #two for loops will check every permutation pair of words between wiki pages, two meanings for each word, 
                        #Take the max similarity value taken for computation of similarity index
                        #e.g. money.n.01 may have highest value with value.n.02 because value.n.01 has the obvious meaning of worth/significance and secondary for money
                        word11 = word1.replace('n.01','n.02')
                        word22 = word2.replace('n.01','n.02')
                        #print(word11,word22)
                        self.x = self.wn.synset(word1)
                        self.y = self.wn.synset(word2)
                        #get default similarity value of 1st definitions of word
                        self.sum1 = self.x.path_similarity(self.y) * math.exp(self.expvalue * self.x.path_similarity(self.y)) + 10 * math.log(0.885+self.x.path_similarity(self.y))
                        try: #get 2nd definitions of words and their similarity values, if it exist
                            self.xx = self.wn.synset(word11)
                            self.yy = self.wn.synset(word22)
                            self.sum2 = self.xx.path_similarity(self.y) * math.exp(self.expvalue * self.xx.path_similarity(self.y)) + 10 * math.log(0.89+self.xx.path_similarity(self.y))
                            self.sum3 = self.x.path_similarity(self.yy) * math.exp(self.expvalue * self.x.path_similarity(self.yy)) + 10 * math.log(0.89+self.x.path_similarity(self.yy))
                            self.sum4 = self.xx.path_similarity(self.yy) * math.exp(self.expvalue * self.xx.path_similarity(self.yy)) + 10 * math.log(0.89+self.xx.path_similarity(self.yy))
                        except:
                            continue
                        self.maxsum = max(self.sum1,self.sum2,self.sum3,self.sum4) #get the max similarity value between 2 words x 2 meanings = 4 comparisons
                        #print(word1, word2, self.maxsum)
                        self.sum += self.maxsum
                        self.count += 1
                except:
                    if word1 == word2 and re.findall(r"\d+.n.01", word1) == []: #remove years/numbers being counted as match yyyy.n.01
                        self.sum += math.exp(self.expvalue) + 10 * math.log(1.89)
                        self.count += 1
                    else:
                        continue

        #Print the results and implement ceiling if the percent exceeds 100%
        if self.count != 0:
            self.percent = round(self.sum/self.count*100)
            if self.percent > 100:
                self.percent = 100
            elif self.percent < 0:
                self.percent = 0
            print('Probability of topics being related is ' + str(self.percent) + '%')
            print('Count is ' + str(self.count) + ' and sum is ' + str(self.sum))
            print('\n')
        else:
            print('No relation index can be calculated as words are all foreign')
            
        return self.percent
        
    #Print out list of common words for both Wiki articles
    def words(self):
        print(self.wiki1list)
        print('\n')
        print(self.wiki2list)
        
    #Outputs list of results [Article 1, Article 2, Percentage, Yes/No] that can be put into a dataframe
    def ans(self):
        self.listans = [self.text1,self.text2,self.percent]
        if self.percent > 49:
            self.listans.append('Yes')
        else:
            self.listans.append('No')
        return self.listans
    
    def help(self):
        print("To start, assign var = comparewiki.similar('arg1','arg2'). To get values in a list for storage, use .ans(). To get the 40 common words for comparison, use .words()")
            

            

[nltk_data] Downloading package wordnet to
[nltk_data]     C:\Users\jx\AppData\Roaming\nltk_data...
[nltk_data]   Package wordnet is already up-to-date!


In [2]:
test = similar('Amsterdam','Van Gogh')

Probability of topics being related is 63%
Count is 240 and sum is 151.61918693607896


