# NLP Data Cleaning

As an addition to the previous work done on this project, and with the intention of practising NLP on a mostly ready dataset that I'm familiar with, I've decided to embark on this mini-project.

# Objectives:

- Use NLP techniques to preprocess song titles
- Use NLP techniques to visualise song titles

# Libraries

In [53]:
import pandas as pd
import nltk 
from nltk.corpus import stopwords
from nltk.tokenize import word_tokenize

# download nltk stuff
nltk.download('stopwords')

[nltk_data] Downloading package stopwords to
[nltk_data]     /Users/yousefnami/nltk_data...
[nltk_data]   Package stopwords is already up-to-date!


True

In [54]:
df = pd.read_csv('../Data/cleaned_data.csv')
df.head()

Unnamed: 0,Name,Total Time,Year,Date Added,Play Count,Skip Count,Artist,Genre,Cleaned up time,Relative Plays,Relative Skips
0,Ether,277968,2001,2016-02-12,36,11,Nas,Hip Hop/Rap,4:37,0.14234,0.200364
1,Happy,247066,2014,2016-02-12,33,16,Pharrell Williams,Pop,4:7,0.130478,0.291439
2,Immigrant song,148662,1970,2016-02-12,80,13,Led Zeppelin,Rock,2:28,0.316311,0.236794
3,Violent Pornography,211408,2005,2016-02-12,30,12,System Of A Down,Metal,3:31,0.118617,0.218579
4,Psycho,235415,2001,2016-02-12,51,6,System Of A Down,Metal,3:55,0.201648,0.10929


In [131]:
# extract text
text = ' '.join(df.Name)
text

'Ether Happy Immigrant song Violent Pornography Psycho Gold Digger Come As You Are Lithium Here Comes The Sun HiiiPoWeR I Love Rock \'n\' Roll \u200bm.A.A.d city Follow The Leader The Man Who Sold The World The Man Who Sold The World Ramble On Crazy Little Thing Called Love The Prophets Song Fight The Power Raindrops Keep Fallin\' on My Head Mama Used To Say T.N.T. Le Freak Three Little Birds Kashmir Jamming Walking On Sunshine Heartbreak Hotel "Heroes" Moonage Daydream Life On Mars? Ashes To Ashes Modern Love Fame Golden Years Eye of the Tiger Innuendo The Show Must Go On Dont Stop Me Now I Want To Break Free Radio Ga Ga Killer Queen We Will Rock You Changes Space Oddity \'Till I Collapse (Remix) \'Till I Collapse Stan Guilty Concience God Gave Me Everything Gimme Shelter Back in Black Abracadabra Highway to Hell Paul Revere The Power Of Love Django Gangsta\'s Paradise (NickT Remix) Gangsta\'s Paradise Ambitionz Az a Ridah All Eyez on Me Stayin\' Alive Hypnotize Stronger Another One B

In [145]:
# need to remove characters that aren't of interest, such as punctuation
import re
import string

def clean(text, numbers = True, non_ascii = True, stop = False):
    """
    Function to clean text
    
    Dependencies:
    -------------
    
    import re
    import string
    
    Attributes:
    -----------
    
    text : str
        text to be cleaned
        
    numbers (True) : bool
        boolean to determine whether to remove numbers (True) or not (False)
        
    non_ascii (True) : bool
        boolean to determine whether to remove non-ascii characters (True) or not (False)

    """
    text = text.lower() # make text lower case
    
    
    # this needs fixing
    text = re.sub(' +', ' ', text)
    
    if numbers:
        text = re.sub('\w*\d\w*', '', text) # removes numbers
    
    if non_ascii:
        text = re.sub(r'[^\x00-\x7F]+',' ', text)    # remove non_ascii chars
        
    if not stop:    
        
        stop_words = set(stopwords.words('english')) 
        
        text = ' '.join([w for w in text.split() if not w in stop_words]) 
                
    text = re.sub('[%s]' % re.escape(string.punctuation), '', text) # removes punctuation


        
    return text

In [146]:
text = clean(text)

In [147]:
text

'ether happy immigrant song violent pornography psycho gold digger come lithium comes sun hiiipower love rock n roll maad city follow leader man sold world man sold world ramble crazy little thing called love prophets song fight power raindrops keep fallin head mama used say tnt le freak three little birds kashmir jamming walking sunshine heartbreak hotel heroes moonage daydream life mars ashes ashes modern love fame golden years eye tiger innuendo show must go dont stop want break free radio ga ga killer queen rock changes space oddity till collapse remix till collapse stan guilty concience god gave everything gimme shelter back black abracadabra highway hell paul revere power love django gangstas paradise nickt remix gangstas paradise ambitionz az ridah eyez stayin alive hypnotize stronger another one bites dust bohemian rhapsody live blitzkrieg bop toxicity chop suey lets get started jailhouse rock im believer sympathy devil house rising sun smells like teen spirit heartshaped box h

Visual inspection looks good. The function "clean" to be stored for use later in other projects (for instance NLP in finding novelty within reseasrch papers).

*Update:* added to function above festure for removing stopwords as well!

Thoughts:

    - Can you pick up on songs that mention cities, even if the cities aren't explciitly mentioned, for instance: NY, Ney York, etc etc
    