<h1>Document Similarity using LSI</h1>

<h4>In this assignment we’re going to practice document similarity. Here’s
what you need to do:</h4>
<ol>
<li>From Wikipedia’s List of musicians page (https://en.wikipedia.org/wiki/Lists_of_musicians), pick five lists of
musicians (e.g., List of big band musicians). You can pick any five
you like but make sure that the list has the words “musicians” in
it and that the list has at least 30 musicians listed
<li>Collect the urls of all the musicians on those five pages and place them in a list
<li>Grab the content of each musician in the list and place them in a list (of documents)
<li>Build an LSI model using this data. This is your "reference" data set
<li>Now grab another list of musicians from wikipedia and create a new list of documents using the detail from each musicians page. This is your "musician" data set
<li>For each musician in the new list, find the musician in the reference data set that is the closest in similarity. 
<li>Print a table that contains each musician from the musician data set and the most similar musician from the reference data set
</ol>
<h4>Use the code below to build your solution

<p><span style="color:blue">get_musicians</span>: A function that, given a "list of musicians" url, returns a list containing the names of the musicians and the urls for their wikipedia pages
<p>non_musician_finder tries its best to remove links that are not musician links from the page (not perfect, but good enough!)

In [1]:
def get_musicians(url):
    from bs4 import BeautifulSoup
    import requests
    page_soup = BeautifulSoup(requests.get(url).content,'lxml')
    li_tags = page_soup.find_all('li')
    all_musicians = list()
    for tag in li_tags:
        if tag.get('id'): #### to ignore the last few rows
            continue

        try:
            tag.find('sup',class_="reference")
            link = tag.find('a').get('href')
            name = tag.find('a').get_text()
            if "/wiki/" in link and non_musician_finder(link): ### to ignore first few rows
                all_musicians.append((name,"https://en.wikipedia.org" + link))
        except:
            pass
    return all_musicians

def non_musician_finder(link):
    non_musician_words = ['Category','Template','Portal','List','File','Template','Special','Main','Help','User']
    for word in non_musician_words:
        if word in link:
            return False
    return True

<h4>testing the function</h4>
<li>Note that Wikipedia does not have a standard for its page design so this code may not work with every list

In [2]:
url = "https://en.wikipedia.org/wiki/List_of_alternative_country_musicians"
get_musicians(url)

[('16 Horsepower', 'https://en.wikipedia.org/wiki/16_Horsepower'),
 ('Ryan Adams', 'https://en.wikipedia.org/wiki/Ryan_Adams'),
 ('Jill Andrews', 'https://en.wikipedia.org/wiki/Jill_Andrews'),
 ('The Autumn Defense', 'https://en.wikipedia.org/wiki/The_Autumn_Defense'),
 ('Backyard Tire Fire', 'https://en.wikipedia.org/wiki/Backyard_Tire_Fire'),
 ('Del Barber', 'https://en.wikipedia.org/wiki/Del_Barber'),
 ('Eef Barzelay', 'https://en.wikipedia.org/wiki/Eef_Barzelay'),
 ("Bear's Den", 'https://en.wikipedia.org/wiki/Bear%27s_Den_(band)'),
 ('Rico Bell', 'https://en.wikipedia.org/wiki/Rico_Bell'),
 ('Blitzen Trapper', 'https://en.wikipedia.org/wiki/Blitzen_Trapper'),
 ('Blue Rodeo', 'https://en.wikipedia.org/wiki/Blue_Rodeo'),
 ('Bosque Brown', 'https://en.wikipedia.org/wiki/Bosque_Brown'),
 ('The Bottle Rockets', 'https://en.wikipedia.org/wiki/The_Bottle_Rockets'),
 ('BR549', 'https://en.wikipedia.org/wiki/BR549'),
 ('Jim Bryson', 'https://en.wikipedia.org/wiki/Jim_Bryson'),
 ('Richard B

<h4>get_musician_text(url): returns the page text of the wikipedia page associated with a musician</h4>
<li>Since we're not sure if this will always work, we use a try ... except to catch exceptions
<li>If it doesn't work, the function returns None
<li>We will need to delete this (musician, url) pair from our musicians list

In [3]:
def get_musician_text(url):
    from bs4 import BeautifulSoup
    import requests
    all_text = ''
    try:
        page_soup = BeautifulSoup(requests.get(url).content,'lxml')
        for p_tag in page_soup.find_all('p'):
            all_text += p_tag.get_text()
    except:
        return None
    return all_text


<h4>testing get_musician_text</h4>

In [4]:
url = "https://en.wikipedia.org/wiki/Jim_Morrison"
get_musician_text(url)

'\nJames "Jim" Douglas Morrison (December 8, 1943 – July 3, 1971) was an American singer, songwriter and poet, who was the lead vocalist of the rock band the Doors. Due to his wild personality, poetic lyrics, distinctive voice, unpredictable and erratic performances, and the dramatic circumstances surrounding his life and early death, Morrison is regarded by music critics and fans as one of the most iconic and influential frontmen in rock history. Since his death, his fame has endured as one of popular culture\'s most rebellious and oft-displayed icons, representing the generation gap and youth counterculture.[2]\nTogether with pianist Ray Manzarek, Morrison co-founded the Doors in July 1965 in Venice, California. The band spent two years in obscurity until shooting to prominence with their number-one single in the United States, "Light My Fire", taken from their self-titled debut album. Morrison recorded a total of six studio albums with the Doors, all of which sold well and received 

<p><span style="color:blue">get_all_musicians</span>: A function that, given a list of genres, returns a list containing the names of the musicians and the urls for their wikipedia pages associated with that list of genres
<p>The function should return a list of (name,url) pairs for all the musicians in the list of genres
<p>You need to:
<ol>
<li>iterate through the list of genres
<li>initialize a list "all_musicians"
<li>construct a url for the list of musicians (I've done these first three steps for you)
<li>call get_musicians for that url
<li>extend all_musicians by what get_musicians returns

In [5]:
def get_all_musicians(genre_list):
    all_musicians = list()
    for genre in genre_list:
        url = 'https://en.wikipedia.org/wiki/List_of_' + genre
        all_musicians+=get_musicians(url)
    
    
    return all_musicians

<h4>Example of how to use get_all_musicians</h4>

In [6]:
genre_list = ['bluegrass_musicians#G','British_blues_musicians','country_blues_musicians','emo_artists']
all_musicians = get_all_musicians(genre_list)
all_musicians

[('Tom Adams', 'https://en.wikipedia.org/wiki/Tom_Adams_(bluegrass_musician)'),
 ('Eddie Adcock', 'https://en.wikipedia.org/wiki/Eddie_Adcock'),
 ('David "Stringbean" Akeman',
  'https://en.wikipedia.org/wiki/David_%22Stringbean%22_Akeman'),
 ('Red Allen', 'https://en.wikipedia.org/wiki/Red_Allen_(bluegrass)'),
 ('Darol Anger', 'https://en.wikipedia.org/wiki/Darol_Anger'),
 ('Mike Auldridge', 'https://en.wikipedia.org/wiki/Mike_Auldridge'),
 ('Kenny Baker (fiddler)',
  'https://en.wikipedia.org/wiki/Kenny_Baker_(fiddler)'),
 ('Jessie Baker', 'https://en.wikipedia.org/wiki/Jessie_Baker'),
 ('Butch Baldassari', 'https://en.wikipedia.org/wiki/Butch_Baldassari'),
 ('Russ Barenberg', 'https://en.wikipedia.org/wiki/Russ_Barenberg'),
 ('Byron Berline', 'https://en.wikipedia.org/wiki/Byron_Berline'),
 ('Norman Blake',
  'https://en.wikipedia.org/wiki/Norman_Blake_(American_musician)'),
 ('Kathy Boyd', 'https://en.wikipedia.org/wiki/Kathy_Boyd_and_Phoenix_Rising'),
 ('Dale Ann Bradley', 'https:

<p><span style="color:blue">get_all_musician_docs</span>: A function that, given the list of (musician,url) pairs, returns two lists, a list of musicians and a parallel (same size) list of documents. 

<p>You need to:

<ol>
<li>initialize the two lists

<li>iterate through the all_musicians list
<li>extract the name and the url of the musician
<li>get the text using the get_musician_text() function
<li>if the function returns None, ignore it and move to the next musician
<li>otherwise, append the name ot the musician_names list and the text to the musician_texts list
<li>return musician_names and musician_texts


In [7]:
def get_all_musician_docs(all_musicians):
    musician_names = list()
    musician_texts = list()
    for musician in all_musicians:
        name = musician[0]
        url = musician[1]
        all_text=get_musician_text(url)
        if get_musician_text(url):
            musician_names.append(name)
            musician_texts.append(all_text)
        
        
    return musician_names,musician_texts
        

<h4>Example of how to use get_all_musician_docs</h4>

In [8]:
reference_names,reference_docs = get_all_musician_docs(all_musicians)

In [9]:
reference_docs

['Tom Adams (born 1958) is an American bluegrass guitarist and banjo player.\nAdams began his career in 1969, playing banjo in his family bluegrass band in Gettysburg, Pennsylvania. In 1983 he joined Jimmy Martin\'s Sunny Mountain Boys, and then became one of the Johnson Mountain Boys in 1986.[1]\nAdams played with Rhonda Vincent & The Rage in 2000,[2] He recorded an album, Live - At the Ragged Edge, with Michael Cleveland in 2004, and the album was awarded "Bluegrass Instrumental Album of The Year 2004" by the International Bluegrass Music Association.[3]\nAdams has also played with Dale Ann Bradley, in the Lynn Morris band, and later with  Bill Emerson & Sweet Dixie.[4]  In 2009 Adams joined Michael Cleaveland and Flamekeeper as a singer and guitar player.[2]\n',
 'Eddie Adcock (born June 21, 1938 in Scottsville, Virginia)[1] is an American banjoist and guitarist.\nHis professional career as a 5-string banjoist began in 1953 when he joined Smokey Graves & His Blue Star Boys, who had 

<h3>Set up the LSI model</h3>
<li>reference_docs is the list of documents
<li>construct texts, dictionary, and corpus (see class iPython notebook)
<li>construct an LSI model. Use 5 topics initially but you should play around with this number

In [15]:
#Code for LSI model goes here
from gensim.similarities.docsim import Similarity
from gensim import corpora, models, similarities
from gensim.parsing.preprocessing import STOPWORDS
import numpy as np
import pandas as pd


#documents = [doc.raw() for doc in reference_docs]
texts = [[word for word in document.lower().split()
        if word not in STOPWORDS and word.isalnum()]
        for document in reference_docs]
dictionary = corpora.Dictionary(texts)
corpus = [dictionary.doc2bow(text) for text in texts]

lsi = models.LsiModel(corpus, id2word=dictionary, num_topics=5)

<h3>Construct the "musician" data set</h3>
<h4>Example</h4>

In [11]:
musician_genre_list = ['acid_rock_artists']
all_musicians = get_all_musicians(musician_genre_list)
musician_names,musician_docs = get_all_musician_docs(all_musicians)

<h4>find the most similar musicians for each new musician from our reference data set</h4>

In [16]:
table_data = list()
for index,musician in enumerate(musician_docs):
    
    #Your similarity code here. Use the in-class notebook as a reference
    vec_bow = dictionary.doc2bow(musician.lower().split())
    vec_lsi = lsi[vec_bow]
    similar = similarities.MatrixSimilarity(lsi[corpus])
    sims = similar[vec_lsi]
    sims = sorted(enumerate(sims), key=lambda item: -item[1])
    most_similar_musician = sims[0][0]
    table_data.append((musician_names[index],reference_names[most_similar_musician]))
    
#Write code to print table_data after the for loop ends
match=pd.DataFrame(table_data,columns=['reference musician','the most similar musician'])   
match

Unnamed: 0,reference musician,the most similar musician
0,The 13th Floor Elevators,Fragile Rock
1,Alice Cooper,Jawbreaker
2,The Amboy Dukes,The Anniversary
3,Amon Düül,Secondhand Serenade
4,Big Brother and the Holding Company,The Pretty Things
...,...,...
98,Psychedelic drug,Midwestern United States
99,Psychedelic era,Walter Roland
100,Psychedelic experience,Midwestern United States
101,Psychedelic literature,Bo Carter
