## Text classification model

Build a text classification model to predict the artist from a piece of text.

- Download HTML pages
- Get a list of song urls
- Extract lyrics from song urls
- Convert text to numbers by applying the Bag Of Words method
- Build and train a Naive Bayes classifier
- Balance out your dataset
- Write a command-line interface

In [1]:
import requests

## Find Song Links
- Choose 2 artists you want to work with this week
- Request their webpages
- Save them in an html file on your computer
- Use your browser and its development tools and a text editor, try to find patterns in the html file that would allow you to extract the song names and the links to the song pages
- extract all links using Regular Expressions

In [2]:
kooks = requests.get('https://www.lyrics.com/artist/The-Kooks/762797')
alicia = requests.get('https://www.lyrics.com/artist/Alicia-Keys/469431')
type(alicia)

requests.models.Response

In [3]:
# requests.text will return the html file of the website as a string

In [4]:
# save the files as html files
with open('data/kooks.html', 'w', encoding='utf-8') as file:
    file.write(kooks.text)
    
with open('data/alicia.html', 'w', encoding='utf-8') as file:
    file.write(alicia.text)

## Extract all links using Regex

In [5]:
import re

In [6]:
text = '''thyme <a href="coriander99"> <a href="rosemary"> cinnamon pepper tarragon basil salvia cumin'''
# match all words starting with a "c":
pattern =  "c\w*" 
re.findall(pattern, text)

['coriander99', 'cinnamon', 'cumin']

In [7]:
pattern =  "\/lyric\/\d{8}\/[0-9A-Za-z+\/]+"      

kooks_links = re.findall(pattern, kooks.text)
alicia_links = re.findall(pattern, alicia.text)

In [8]:
kooks_links

['/lyric/34760352/The+Kooks/She+Moves+in+Her+Own+Way',
 '/lyric/34760337/The+Kooks/Seaside',
 '/lyric/34855550/The+Kooks/Naive',
 '/lyric/35284886/The+Kooks/Intro',
 '/lyric/35284900/The+Kooks/Kids',
 '/lyric/35284899/The+Kooks/All+the+Time',
 '/lyric/35284898/The+Kooks/Believe',
 '/lyric/35284882/The+Kooks/Fractured+and+Dazed',
 '/lyric/35284881/The+Kooks/Chicken+Bone',
 '/lyric/35284880/The+Kooks/Four+Leaf+Clover',
 '/lyric/35284879/The+Kooks/Tesco+Disco',
 '/lyric/35284878/The+Kooks/Honey+Bee',
 '/lyric/35284877/The+Kooks/Initials+for+Gainsbourg',
 '/lyric/35284876/The+Kooks/Pamela',
 '/lyric/35284875/The+Kooks/Picture+Frame',
 '/lyric/35284874/The+Kooks/Swing+Low',
 '/lyric/35284873/The+Kooks/Weight+of+the+World',
 '/lyric/35284872/The+Kooks/No+Pressure',
 '/lyric/33888738/The+Kooks/Be+Who+You+Are',
 '/lyric/33905847/The+Kooks/Na',
 '/lyric/33905846/The+Kooks/Always+Where+I+Need+to+Be',
 '/lyric/33905845/The+Kooks/Junk+of+the+Heart+',
 '/lyric/33905880/The+Kooks/Bad+Habit',
 '/lyri

In [9]:
len(re.findall(pattern, kooks.text)), len(re.findall(pattern, alicia.text))

(330, 775)

## Download songs
- Write a loop that goes through all song URLs that you collected previously
- Construct a complete URL
- Test the URL in a browser manually
- Generate a unique file name (using the song name or a number)
- Download each song
- Save each song to a unique file

In [25]:
linklist = [*kooks_links, *alicia_links]

In [24]:
# write a loop that goes through all songs URLS
song_titles = []
for i in linklist:

    split_i = i.split('/', 3)
    song_clean = split_i[3].replace('/', '_').replace('\+$', '').lower()
    if song_clean not in song_titles:
        print('Downloading ', song_clean)
        song_titles.append(song_clean)
        # construct a complete URL
        URL_complete = 'https://www.lyrics.com' + i
        request_response = requests.get(URL_complete)
        with open('data/songs' + song_clean + '.html', 'w', encoding='utf-8') as file:
            file.write(request_response.text)

Downloading  the+kooks_she+moves+in+her+own+way
Downloading  the+kooks_seaside
Downloading  the+kooks_naive
Downloading  the+kooks_intro
Downloading  the+kooks_kids
Downloading  the+kooks_all+the+time
Downloading  the+kooks_believe
Downloading  the+kooks_fractured+and+dazed
Downloading  the+kooks_chicken+bone
Downloading  the+kooks_four+leaf+clover
Downloading  the+kooks_tesco+disco
Downloading  the+kooks_honey+bee
Downloading  the+kooks_initials+for+gainsbourg
Downloading  the+kooks_pamela
Downloading  the+kooks_picture+frame
Downloading  the+kooks_swing+low
Downloading  the+kooks_weight+of+the+world
Downloading  the+kooks_no+pressure
Downloading  the+kooks_be+who+you+are
Downloading  the+kooks_na
Downloading  the+kooks_always+where+i+need+to+be
Downloading  the+kooks_junk+of+the+heart+
Downloading  the+kooks_bad+habit
Downloading  the+kooks_bad+habits
Downloading  the+kooks_shine+on
Downloading  the+kooks_sofa+song
Downloading  the+kooks_down
Downloading  the+kooks_is+it+me
Downloadi

Downloading  alicia+keys_dah+dee+dah+
Downloading  alicia+keys_karma+
Downloading  alicia+keys_little+drummer+girl
Downloading  alicia+keys_superwoman+
Downloading  alicia+keys_thing+about+love
Downloading  alicia+keys_doncha+know+
Downloading  alicia+keys_saviour
Downloading  alicia+keys_djin+djin
Downloading  alicia+keys_unbreakable
Downloading  alicia+keys_my+boo
Downloading  alicia+keys_ghetto+story+chapter+2
Downloading  alicia+keys_ghetto+story
Downloading  alicia+keys_never+felt+this+way
Downloading  alicia+keys_nobody+not+really+
Downloading  alicia+keys_america+the+beautiful
Downloading  alicia+keys_if+this+world+were+mine
Downloading  alicia+keys_intro+alicia
Downloading  alicia+keys_diary+
Downloading  alicia+keys_girlfriend+
Downloading  alicia+keys_butterflyz+
Downloading  alicia+keys_i+got+a+little+something+for+you
Downloading  alicia+keys_someday+we
Downloading  alicia+keys_feeling+u+feeling+me
Downloading  alicia+keys_streets+of+new+york
Downloading  alicia+keys_brotha