# 50 Years of Pop Music: A Lyrical Analysis

- Class activity 3, 9/12/2017 
- Data courtesy of Kaylin Walker: http://kaylinwalker.com/50-years-of-pop-music/

## (1) Read in the spreadsheet file as a DataFrame
- Create a DataFrame called `songs` from the spreadsheet. 
- You will likely run into an encoding error. Who can figure out the correct encoding? 
- Once successfully loaded, explore the DF. Use the following methods: `.head()`, `.tail()`, `.info()`, `.describe()`. 
- What are the column labels and row indexes?
- How to improve organization?

In [1]:
import pandas as pd
songs = pd.read_csv('billboard_lyrics_1964-2015.csv', encoding='ISO8859')
    # this encoding is the Windows version of ASCII, or ISO-8859
    # can find using using `file [file name]` in command line,
        # or by trying to open in text editor & trying to changing the encoding

songs.head()     # first five entries (default)

Unnamed: 0,Rank,Song,Artist,Year,Lyrics,Source
0,1,wooly bully,sam the sham and the pharaohs,1965,sam the sham miscellaneous wooly bully wooly b...,3.0
1,2,i cant help myself sugar pie honey bunch,four tops,1965,sugar pie honey bunch you know that i love yo...,1.0
2,3,i cant get no satisfaction,the rolling stones,1965,,1.0
3,4,you were on my mind,we five,1965,when i woke up this morning you were on my mi...,1.0
4,5,youve lost that lovin feelin,the righteous brothers,1965,you never close your eyes anymore when i kiss...,1.0


In [2]:
songs.tail()     # last five entries (default)

Unnamed: 0,Rank,Song,Artist,Year,Lyrics,Source
5095,96,el perdon,nicky jam and enrique iglesias,2015,enrique iglesias dime si es verdad me dijeron ...,3.0
5096,97,she knows,neyo featuring juicy j,2015,,
5097,98,night changes,one direction,2015,going out tonight changes into something red ...,1.0
5098,99,back to back,drake,2015,oh man oh man oh man not againyeah i learned ...,1.0
5099,100,how deep is your love,calvin harris and disciples,2015,i want you to breathe me in let me be your ai...,1.0


In [3]:
songs.info()     # info about the data itself

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 5100 entries, 0 to 5099
Data columns (total 6 columns):
Rank      5100 non-null int64
Song      5100 non-null object
Artist    5100 non-null object
Year      5100 non-null int64
Lyrics    4913 non-null object
Source    4913 non-null float64
dtypes: float64(1), int64(2), object(3)
memory usage: 239.1+ KB


In [4]:
songs.describe()     # basic statistics on numeric columns

Unnamed: 0,Rank,Year,Source
count,5100.0,5100.0,4913.0
mean,50.5,1990.0,1.400977
std,28.8689,14.721045,0.890375
min,1.0,1965.0,1.0
25%,25.75,1977.0,1.0
50%,50.5,1990.0,1.0
75%,75.25,2003.0,1.0
max,100.0,2015.0,5.0


## (2) Reorganize and clean up
- Rearrange the columns so that they are ordered: Year - Rank - Artist - Song - Lyrics. 
- Get rid of Source column. 
- What about the index? 

In [5]:
songs2 = songs[['Year', 'Rank', 'Artist', 'Song', 'Lyrics']]
# create new DataFrame with Year, Rank, Artist, Song, Lyrics data in that order

## (3) Sanity check
- Is the data complete? 
  - Are there really 50 years represented?
      - yes
  - Does every year have all 100 entries? 
      - yes
- Any missing or anomalous data? How to address them?
    - approx. 200 songs are missing lyrics

In [6]:
# len(set(song['Year']))      # works, but not Pandas way
print("# years:", songs2['Year'].unique().size)
songs2['Year'].unique()

# years: 51


array([1965, 1966, 1967, 1968, 1969, 1970, 1971, 1972, 1973, 1974, 1975,
       1976, 1977, 1978, 1979, 1980, 1981, 1982, 1983, 1984, 1985, 1986,
       1987, 1988, 1989, 1990, 1991, 1992, 1993, 1994, 1995, 1996, 1997,
       1998, 1999, 2000, 2001, 2002, 2003, 2004, 2005, 2006, 2007, 2008,
       2009, 2010, 2011, 2012, 2013, 2014, 2015])

In [7]:
songs['Rank'].value_counts().sort_index()

1      51
2      51
3      51
4      51
5      51
6      51
7      51
8      51
9      51
10     51
11     51
12     51
13     51
14     51
15     51
16     51
17     51
18     51
19     51
20     51
21     51
22     51
23     51
24     51
25     51
26     51
27     51
28     51
29     51
30     51
       ..
71     51
72     51
73     51
74     51
75     51
76     51
77     51
78     51
79     51
80     51
81     51
82     51
83     51
84     51
85     51
86     51
87     51
88     51
89     51
90     51
91     51
92     51
93     51
94     51
95     51
96     51
97     51
98     51
99     51
100    51
Name: Rank, Length: 100, dtype: int64

## (4) Accessing routines
- How to get songs from year 1965 vs. songs from year 2015?
- How to get number-1 hits only?
- How to get Madonna songs only?
- Madonna songs that are top 10? Year 2000 or later? 

In [8]:
# METHOD 1
# oldies = songs.loc[songs['Year']==1965,:]
# newsies = songs.loc[songs['Year']==2015,:]

# METHOD 2
newfilter = songs['Year'] == 2015
newsies = songs.loc[newfilter,:]

oldfilter = songs['Year'] == 1965
oldies = songs.loc[oldfilter,:]

newsies.head()
oldies.head()

# METHOD 3
# newsies = songs.iloc[-100:, :]
# newsies
# oldies = songs.iloc[:100, :]
# oldies

Unnamed: 0,Rank,Song,Artist,Year,Lyrics,Source
0,1,wooly bully,sam the sham and the pharaohs,1965,sam the sham miscellaneous wooly bully wooly b...,3.0
1,2,i cant help myself sugar pie honey bunch,four tops,1965,sugar pie honey bunch you know that i love yo...,1.0
2,3,i cant get no satisfaction,the rolling stones,1965,,1.0
3,4,you were on my mind,we five,1965,when i woke up this morning you were on my mi...,1.0
4,5,youve lost that lovin feelin,the righteous brothers,1965,you never close your eyes anymore when i kiss...,1.0


In [9]:
num1 = songs2[songs['Rank'] == 1]
num1

Unnamed: 0,Year,Rank,Artist,Song,Lyrics
0,1965,1,sam the sham and the pharaohs,wooly bully,sam the sham miscellaneous wooly bully wooly b...
100,1966,1,ssgt barry sadler,ballad of the green berets,
200,1967,1,lulu,to sir with love,those school girl days of telling tales and b...
300,1968,1,the beatles,hey jude,hey jude dont make it bad take a sad song and ...
400,1969,1,the archies,sugar sugar,sugar honey honey you are my candy girl and y...
500,1970,1,simon garfunkel,bridge over troubled water,when youre weary feeling small when tears are ...
600,1971,1,three dog night,joy to the world,jeremiah was a bullfrog was a good friend of ...
700,1972,1,roberta flack,the first time ever i saw your face,the first time ever i saw your face i thought...
800,1973,1,tony orlando and dawn,tie a yellow ribbon round the ole oak tree,im comin home ive done my time now ive got to ...
900,1974,1,barbra streisand,the way we were,memories light the corners of my mind misty w...


In [10]:
songs2.loc[songs2['Artist']=='madonna',:]

Unnamed: 0,Year,Rank,Artist,Song,Lyrics
1934,1984,35,madonna,borderline,something in the way you love me wont let me ...
1965,1984,66,madonna,lucky star,you must be my lucky star cause you shine on ...
1978,1984,79,madonna,holiday,holiday celebrate holiday celebrateif we took...
2001,1985,2,madonna,like a virgin,i made it through the wilderness somehow i ma...
2008,1985,9,madonna,crazy for you,swaying room as the music starts strangers ma...
2057,1985,58,madonna,material girl,some boys kiss me some boys hug me i think th...
2080,1985,81,madonna,angel,why am i standing on a cloud every time youre...
2097,1985,98,madonna,dress you up,youve got style thats what all the girls say ...
2128,1986,29,madonna,papa dont preach,
2134,1986,35,madonna,live to tell,i have a tale to tell sometimes it gets so ha...


In [22]:
songs2.loc[(songs2['Artist']=='madonna') & (songs2['Year']>=2000),:]     # Madonna songs year 2000 or later

Unnamed: 0,Year,Rank,Artist,Song,Lyrics
3516,2000,17,madonna,music,hey mister dj put a record on i want to dance...
3633,2001,34,madonna,dont tell me,dont tell me to stop tell the rain not to dro...
4190,2006,91,madonna,hung up,time goes by so slowly time goes by so slowly...


## (5) Who are top artists?
- How many artists are represented?
- Which artists produced most hits?
- How many artists are one-hit wonders?

In [33]:
songs2['Artist'].unique().size     # how many artists are represented?

2473

In [38]:
songs2['Artist'].value_counts()     # which artists produced most hits? also, how many hits per artist?

madonna                                            35
elton john                                         26
mariah carey                                       25
michael jackson                                    22
stevie wonder                                      22
janet jackson                                      22
rihanna                                            19
taylor swift                                       19
whitney houston                                    19
pink                                               17
kelly clarkson                                     17
the beatles                                        17
britney spears                                     16
the black eyed peas                                16
chicago                                            15
katy perry                                         14
usher                                              14
rod stewart                                        14
aretha franklin             

In [43]:
songs2['Artist'].value_counts()[songs2['Artist'].value_counts() == 1].size     # how many artists only had one hit?

1628

## (6) Create another column: lyric length in word count
- All symbols have been removed from lyrics
- So, `.split()` works as a rudimentary tokenizer

In [75]:
songs.fillna('', inplace=True)
songs['Lyric Length'] = [len(x.split()) for x in songs['Lyrics']]
songs.head()

Unnamed: 0,Rank,Song,Artist,Year,Lyrics,Source,Lyric Length,Tokens,Toks_2
0,1,wooly bully,sam the sham and the pharaohs,1965,sam the sham miscellaneous wooly bully wooly b...,3,125,"[sam, the, sham, miscellaneous, wooly, bully, ...","[sam, the, sham, miscellaneous, wooly, bully, ..."
1,2,i cant help myself sugar pie honey bunch,four tops,1965,sugar pie honey bunch you know that i love yo...,1,204,"[sugar, pie, honey, bunch, you, know, that, i,...","[sugar, pie, honey, bunch, you, know, that, i,..."
2,3,i cant get no satisfaction,the rolling stones,1965,,1,0,[],[]
3,4,you were on my mind,we five,1965,when i woke up this morning you were on my mi...,1,152,"[when, i, woke, up, this, morning, you, were, ...","[when, i, woke, up, this, morning, you, were, ..."
4,5,youve lost that lovin feelin,the righteous brothers,1965,you never close your eyes anymore when i kiss...,1,232,"[you, never, close, your, eyes, anymore, when,...","[you, never, close, your, eyes, anymore, when,..."


In [74]:
def tokenize(txt):
    return txt.split()

songs['Tokens'] = songs['Lyrics'].map(tokenize)
songs.head()

Unnamed: 0,Rank,Song,Artist,Year,Lyrics,Source,Lyric Length,Tokens,Toks_2
0,1,wooly bully,sam the sham and the pharaohs,1965,sam the sham miscellaneous wooly bully wooly b...,3,125,"[sam, the, sham, miscellaneous, wooly, bully, ...","[sam, the, sham, miscellaneous, wooly, bully, ..."
1,2,i cant help myself sugar pie honey bunch,four tops,1965,sugar pie honey bunch you know that i love yo...,1,204,"[sugar, pie, honey, bunch, you, know, that, i,...","[sugar, pie, honey, bunch, you, know, that, i,..."
2,3,i cant get no satisfaction,the rolling stones,1965,,1,0,[],[]
3,4,you were on my mind,we five,1965,when i woke up this morning you were on my mi...,1,152,"[when, i, woke, up, this, morning, you, were, ...","[when, i, woke, up, this, morning, you, were, ..."
4,5,youve lost that lovin feelin,the righteous brothers,1965,you never close your eyes anymore when i kiss...,1,232,"[you, never, close, your, eyes, anymore, when,...","[you, never, close, your, eyes, anymore, when,..."


In [73]:
songs['Toks_2'] = songs['Lyrics'].map(lambda x: x.split())
songs.head()

Unnamed: 0,Rank,Song,Artist,Year,Lyrics,Source,Lyric Length,Tokens,Toks_2
0,1,wooly bully,sam the sham and the pharaohs,1965,sam the sham miscellaneous wooly bully wooly b...,3,125,"[sam, the, sham, miscellaneous, wooly, bully, ...","[sam, the, sham, miscellaneous, wooly, bully, ..."
1,2,i cant help myself sugar pie honey bunch,four tops,1965,sugar pie honey bunch you know that i love yo...,1,204,"[sugar, pie, honey, bunch, you, know, that, i,...","[sugar, pie, honey, bunch, you, know, that, i,..."
2,3,i cant get no satisfaction,the rolling stones,1965,,1,0,[],[]
3,4,you were on my mind,we five,1965,when i woke up this morning you were on my mi...,1,152,"[when, i, woke, up, this, morning, you, were, ...","[when, i, woke, up, this, morning, you, were, ..."
4,5,youve lost that lovin feelin,the righteous brothers,1965,you never close your eyes anymore when i kiss...,1,232,"[you, never, close, your, eyes, anymore, when,...","[you, never, close, your, eyes, anymore, when,..."


In [72]:
del songs['Toks_2']
songs.head()

Unnamed: 0,Rank,Song,Artist,Year,Lyrics,Source,Lyric Length,Tokens
0,1,wooly bully,sam the sham and the pharaohs,1965,sam the sham miscellaneous wooly bully wooly b...,3,125,"[sam, the, sham, miscellaneous, wooly, bully, ..."
1,2,i cant help myself sugar pie honey bunch,four tops,1965,sugar pie honey bunch you know that i love yo...,1,204,"[sugar, pie, honey, bunch, you, know, that, i,..."
2,3,i cant get no satisfaction,the rolling stones,1965,,1,0,[]
3,4,you were on my mind,we five,1965,when i woke up this morning you were on my mi...,1,152,"[when, i, woke, up, this, morning, you, were, ..."
4,5,youve lost that lovin feelin,the righteous brothers,1965,you never close your eyes anymore when i kiss...,1,232,"[you, never, close, your, eyes, anymore, when,..."


In [79]:
def get_avg_len(toks):
    lens = [len(t) for t in toks]
    if len(toks) == 0:
        return 0
    else:
        return sum(lens)/len(lens)

songs['Avg Word Len'] = songs['Tokens'].map(get_avg_len)
songs.head()

Unnamed: 0,Rank,Song,Artist,Year,Lyrics,Source,Lyric Length,Tokens,Toks_2,Avg Word Len
0,1,wooly bully,sam the sham and the pharaohs,1965,sam the sham miscellaneous wooly bully wooly b...,3,125,"[sam, the, sham, miscellaneous, wooly, bully, ...","[sam, the, sham, miscellaneous, wooly, bully, ...",4.28
1,2,i cant help myself sugar pie honey bunch,four tops,1965,sugar pie honey bunch you know that i love yo...,1,204,"[sugar, pie, honey, bunch, you, know, that, i,...","[sugar, pie, honey, bunch, you, know, that, i,...",3.872549
2,3,i cant get no satisfaction,the rolling stones,1965,,1,0,[],[],0.0
3,4,you were on my mind,we five,1965,when i woke up this morning you were on my mi...,1,152,"[when, i, woke, up, this, morning, you, were, ...","[when, i, woke, up, this, morning, you, were, ...",3.546053
4,5,youve lost that lovin feelin,the righteous brothers,1965,you never close your eyes anymore when i kiss...,1,232,"[you, never, close, your, eyes, anymore, when,...","[you, never, close, your, eyes, anymore, when,...",4.051724


## (7) Filling in missing values
- We really need to take care of the missing lyric values. 
- What are good strategies? Considerations?

In [80]:
# see above

## (8) A detour: text processing basics
- NLTK functions
- How to word-tokenize
- How to sentence-tokenize
- Frequency distribution
- Type-token ratio

In [81]:
import nltk
nltk.corpus.brown.words()

['The', 'Fulton', 'County', 'Grand', 'Jury', 'said', ...]

## (9) 50 years apart: songs from 1965 vs. 2015
- Let's create two corpora of 1965 lyrics vs. 2015 lyrics
- Good ways to highlight differences? 

In [15]:
# Your code here

## (10) A different take: multi-level index
- Alternative organization: use year-and-rank combination as index 
- Create a copy of `songs` as `songs2` DataFrame
- How does accessing work in this DF?

In [16]:
# Your code here

## (11) Discussion: data curation method, presentation
- How did Walker collect her song lyrics data?
- What tools did she use?
- How successful was her effort?
- How did she present her work, and what are your thoughts?

Your notes here in markdown