<h1>Document Similarity using LSI</h1>

<ol>
<li>From Wikipedia’s List of musicians page (https://en.wikipedia.org/wiki/Lists_of_musicians), pick five lists of
musicians (e.g., List of big band musicians).
<li>Collect the urls of all the musicians on those five pages and place them in a list
<li>Grab the content of each musician in the list and place them in a list (of documents)
<li>Build an LSI model using this data. This is "reference" data set
<li>Grab another list of musicians from wikipedia and create a new list of documents using the detail from each musicians page. This is "musician" data set
<li>For each musician in the new list, find the musician in the reference data set that is the closest in similarity. 
<li>Print a table that contains each musician from the musician data set and the most similar musician from the reference data set
</ol>

<p><span style="color:blue">get_musicians</span>: A function that, given a "list of musicians" url, returns a list containing the names of the musicians and the urls for their wikipedia pages
<p>non_musician_finder tries its best to remove links that are not musician links from the page (not perfect, but good enough!)

In [None]:
def get_musicians(url):
    from bs4 import BeautifulSoup
    import requests
    page_soup = BeautifulSoup(requests.get(url).content,'lxml')
    li_tags = page_soup.find_all('li')
    all_musicians = list()
    for tag in li_tags:
        if tag.get('id'):
            continue

        try:
            tag.find('sup',class_="reference")
            link = tag.find('a').get('href')
            name = tag.find('a').get_text()
            if "/wiki/" in link and non_musician_finder(link):
                all_musicians.append((name,"https://en.wikipedia.org" + link))
        except:
            pass
    return all_musicians

def non_musician_finder(link):
    non_musician_words = ['Category','Template','Portal','List','File','Template','Special','Main','Help','User']
    for word in non_musician_words:
        if word in link:
            return False
    return True

In [None]:
url = "https://en.wikipedia.org/wiki/List_of_K-pop_musicians"
get_musicians(url)

<h4>get_musician_text(url): returns the page text of the wikipedia page associated with a musician</h4>

In [None]:
def get_musician_text(url):
    from bs4 import BeautifulSoup
    import requests
    all_text = ''
    try:
        page_soup = BeautifulSoup(requests.get(url).content,'lxml')
        for p_tag in page_soup.find_all('p'):
            all_text += p_tag.get_text()
    except:
        return None
    return all_text


<h4>testing get_musician_text</h4>

In [None]:
url = "https://en.wikipedia.org/wiki/G-Dragon"
get_musician_text(url)

<p><span style="color:blue">get_all_musicians</span>: A function that, given a list of genres, returns a list containing the names of the musicians and the urls for their wikipedia pages associated with that list of genres
<p>The function should return a list of (name,url) pairs for all the musicians in the list of genres

In [None]:
def get_all_musicians(genre_list):
    all_musicians = list()
    for genre in genre_list:
        url = 'https://en.wikipedia.org/wiki/List_of_' + genre
        all_musicians += get_musicians(url)
    
    return all_musicians

<h4>Example of how to use get_all_musicians</h4>

In [None]:
genre_list = ['bluegrass_musicians','British_blues_musicians','country_blues_musicians','jazz_blues_musicians','jazz_musicians']
all_musicians = get_all_musicians(genre_list)
all_musicians

<p><span style="color:blue">get_all_musician_docs</span>: A function that, given the list of (musician,url) pairs, returns two lists, a list of musicians and a parallel (same size) list of documents. 

In [None]:
def get_all_musician_docs(all_musicians):
    musician_names = list()
    musician_texts = list()
    for musician in all_musicians:
        name = musician[0]
        url = musician[1]
        if get_musician_text(url) == None:
            continue
        else:
            musician_names.append(name)
            musician_texts.append(get_musician_text(url))
    return musician_names,musician_texts
        

<h4>Example of how to use get_all_musician_docs</h4>

In [None]:
reference_names,reference_docs = get_all_musician_docs(all_musicians)

In [None]:
reference_docs

<h3>Set up the LSI model</h3>
<li>reference_docs is the list of documents
<li>construct texts, dictionary, and corpus
<li>construct an LSI model. Use 5 topics initially

In [None]:

from gensim.parsing.preprocessing import STOPWORDS
from gensim.similarities.docsim import Similarity
from gensim import corpora, models, similarities

documents = reference_docs
texts = [[word for word in document.lower().split()
        if word not in STOPWORDS and word.isalnum()]
        for document in documents]
dictionary = corpora.Dictionary(texts)
corpus = [dictionary.doc2bow(text) for text in texts]
lsi = models.LsiModel(corpus, id2word=dictionary, num_topics=5)

<h3>Construct the "musician" data set</h3>
<h4>Example</h4>

In [None]:
musician_genre_list = ['hip_hop_musicians']
all_musicians = get_all_musicians(musician_genre_list)
musician_names,musician_docs = get_all_musician_docs(all_musicians)

<h4>find the most similar musicians for each new musician from our reference data set</h4>

In [None]:
table_data = list()
for index,musician in enumerate(musician_docs):
    
    vec_bow = dictionary.doc2bow(musician.lower().split())
    vec_lsi = lsi[vec_bow]
    sims_index = similarities.MatrixSimilarity(lsi[corpus])
    sims = sims_index[vec_lsi]
    sims = sorted(enumerate(sims), key=lambda item: -item[1])
    most_similar_musician = sims[0][0]
    
    table_data.append((musician_names[index],reference_names[most_similar_musician]))
    
import pprint
import pprint
pp = pprint.PrettyPrinter(indent=4)
pp.pprint(table_data)