# 50 Years of Pop Music: A Lyrical Analysis

- Class activity 3, 9/12/2017 
- Data courtesy of Kaylin Walker: http://kaylinwalker.com/50-years-of-pop-music/

## (1) Read in the spreadsheet file as a DataFrame
- Create a DataFrame called `songs` from the spreadsheet. 
- You will likely run into an encoding error. Who can figure out the correct encoding? 
- Once successfully loaded, explore the DF. Use the following methods: `.head()`, `.tail()`, `.info()`, `.describe()`. 
- What are the column labels and row indexes?
- How to improve organization?

In [1]:
import pandas as pd

pop = pd.read_csv('/Users/Paige/Desktop/billboard_lyrics_1964-2015.csv', encoding='ISO8859')

#How to find the encoding of a file?
#file billboard_lyrics_1964-2015.csv in command line

pop.head() #.head shows you the first 5 entries

pop.head(20) #shows you the first 20

pop.tail() #.tail shows you the last 5 entries

pop.info() #Note: Pandas stores strings in other memory locations as objects and puts a pointer to it

pop.describe() #displays count, mean, std, min, 25%, 50%, 75%, max for columns with numerical data

pop.head(20) #Indices are default integers

pop.loc[:][pop.Artist == 'one direction'] #Here are 5 One Direction songs in the top 100 because we are all really big 1D fans.



<class 'pandas.core.frame.DataFrame'>
RangeIndex: 5100 entries, 0 to 5099
Data columns (total 6 columns):
Rank      5100 non-null int64
Song      5100 non-null object
Artist    5100 non-null object
Year      5100 non-null int64
Lyrics    4913 non-null object
Source    4913 non-null float64
dtypes: float64(1), int64(2), object(3)
memory usage: 239.1+ KB


Unnamed: 0,Rank,Song,Artist,Year,Lyrics,Source
4709,10,what makes you beautiful,one direction,2012,youre insecure dont know what for youre turni...,1.0
4873,74,best song ever,one direction,2013,maybe its the way she walked wow straight int...,1.0
4923,24,story of my life,one direction,2014,verse 1 harry written in these walls are the...,1.0
5064,65,drag me down,one direction,2015,ive got fire for a heart im not scared of the...,1.0
5097,98,night changes,one direction,2015,going out tonight changes into something red ...,1.0


## (2) Reorganize and clean up
- Rearrange the columns so that they are ordered: Year - Rank - Artist - Song - Lyrics. 
- Get rid of Source column. 
- What about the index? 

In [2]:
pop = pop[['Year', 'Rank', 'Artist', 'Song', 'Lyrics']] #Need double brackets, cause if you use one, a Pandas series is returned
type(pop)
pop.head()

Unnamed: 0,Year,Rank,Artist,Song,Lyrics
0,1965,1,sam the sham and the pharaohs,wooly bully,sam the sham miscellaneous wooly bully wooly b...
1,1965,2,four tops,i cant help myself sugar pie honey bunch,sugar pie honey bunch you know that i love yo...
2,1965,3,the rolling stones,i cant get no satisfaction,
3,1965,4,we five,you were on my mind,when i woke up this morning you were on my mi...
4,1965,5,the righteous brothers,youve lost that lovin feelin,you never close your eyes anymore when i kiss...


In [3]:
type(pop['Artist']) #A single column is a series!
pop.loc[2]
type(pop['Lyrics'][2]) #str
len(pop['Lyrics'][2]) #2
type(pop['Lyrics'][45]) #float!

#ways to look up

pop.loc[48][2]
pop.loc[48, 'Lyrics']

'well east coast girls are hip i really dig those styles they wear and the southern girls with the way they talk they knock me out when im down there the midwest farmers daughter really makes you feel alright and the northern girls with the way they kiss they keep their boyfriends warm at night i wish they all could be california wish they all could be california i wish they all could be california girls the west coast has the sunshine and the girls all get so tan i dig a french bikini on hawiian island dollsby a palm tree in the sand i been all round this great big world and i seen all kinds of girls yeah but i couldnt wait to get back in the states back to the cutest girls in the world i wish they all could be california wish they all could be california i wish they all could be california girls i wish they all could be california wish they all could be california i wish they all could be california girls'

In [4]:
%pprint

Pretty printing has been turned OFF


## (3) Sanity check
- Is the data complete? 
  - Are there really 50 years represented?
  - Does every year have all 100 entries? 
- Any missing or anomalous data? How to address them?

In [5]:
#Are there 50 years?
set(pop['Year'])
len(set(pop['Year']))
#also pop['Year']

#There's a better way!

pop['Year'] #This is a series
len(pop['Year'].drop_duplicates())
pop['Year'].unique()
pop['Year'].min()
pop['Year'].max()
pop['Year'].unique().size

pop['Rank'].unique() #100!
#Each rank should be represented 51 times

pop['Year'].value_counts() #or
pop['Rank'].value_counts().sort_index()

1      51
2      51
3      51
4      51
5      51
6      51
7      51
8      51
9      51
10     51
11     51
12     51
13     51
14     51
15     51
16     51
17     51
18     51
19     51
20     51
21     51
22     51
23     51
24     51
25     51
26     51
27     51
28     51
29     51
30     51
       ..
71     51
72     51
73     51
74     51
75     51
76     51
77     51
78     51
79     51
80     51
81     51
82     51
83     51
84     51
85     51
86     51
87     51
88     51
89     51
90     51
91     51
92     51
93     51
94     51
95     51
96     51
97     51
98     51
99     51
100    51
Name: Rank, Length: 100, dtype: int64

## (4) Accessing routines
- How to get songs from year 1965 vs. songs from year 2015?
- How to get number-1 hits only?
- How to get Madonna songs only?
- Madonna songs that are top 10? Year 2000 or later? 

In [6]:
newfilter = pop['Year'] == 1965
oldies = pop.loc[newfilter, :]
df15 = pop.loc[pop['Year']==2013, :]
df15
#or!
#df15 = pop.iloc[-100:, :] #HAVE to use iloc with negative index
number1 = pop.loc[pop['Rank']==1, :]
number1
oldies
df15
pop.loc[pop['Artist']=='madonna', :]

# control + / comments out a highlighted section

Unnamed: 0,Year,Rank,Artist,Song,Lyrics
1934,1984,35,madonna,borderline,something in the way you love me wont let me ...
1965,1984,66,madonna,lucky star,you must be my lucky star cause you shine on ...
1978,1984,79,madonna,holiday,holiday celebrate holiday celebrateif we took...
2001,1985,2,madonna,like a virgin,i made it through the wilderness somehow i ma...
2008,1985,9,madonna,crazy for you,swaying room as the music starts strangers ma...
2057,1985,58,madonna,material girl,some boys kiss me some boys hug me i think th...
2080,1985,81,madonna,angel,why am i standing on a cloud every time youre...
2097,1985,98,madonna,dress you up,youve got style thats what all the girls say ...
2128,1986,29,madonna,papa dont preach,
2134,1986,35,madonna,live to tell,i have a tale to tell sometimes it gets so ha...


In [18]:
#Madonna songs only
pop.loc[pop['Artist'] == 'madonna', :]

pop.loc[(pop['Artist'] == 'madonna') & (pop['Year']>=2000), :]

Unnamed: 0,Year,Rank,Artist,Song,Lyrics
3516,2000,17,madonna,music,hey mister dj put a record on i want to dance...
3633,2001,34,madonna,dont tell me,dont tell me to stop tell the rain not to dro...
4190,2006,91,madonna,hung up,time goes by so slowly time goes by so slowly...


## (5) Who are top artists?
- How many artists are represented?
- Which artists produced most hits?
- How many artists are one-hit wonders?

In [41]:
dir(pop['Artist'])

pop['Artist'].unique().size

pop['Artist'].value_counts()[:20]

onehit = pop['Artist'].value_counts()[pop['Artist'].value_counts()==1] #pop['Artist'].value_counts() is a series and we are providing a 1D filter for it.

onehit.size

1628

## (6) Create another column: lyric length in word count
- All symbols have been removed from lyrics
- So, `.split()` works as a rudimentary tokenizer

In [56]:
len(pop.loc[0, 'Lyrics'].split())

foo = pop[:50]
foo

for x in foo['Lyrics']:
    print(len(x.split()))
    
    
    

125
204
0
152
232
239
228
215
148
153
299
169
204
241
134
222
170
1
198
233
176
155
166
148
138
85
184
137
291
154
273
139
113
260
287
280
235
144
217
134
391
0
123
271
0
0
161
160
180
178


In [73]:
foo
[x for x in [1,2,3,4]]
song_length = [len(x.split()) for x in pop['Lyrics']]
song_length[:100]

pop['Length'] = pd.Series(song_length)
pop.head()

#Other ways!

def tokenize(txt):
    return txt.split()

pop['Tokens'] = pop['Lyrics'].map(tokenize)

pop.head()

pop['tokens2'] = pop['Lyrics'].map(lambda x: x.split())
pop.head()

del pop['tokens2']
pop.head()

Unnamed: 0,Year,Rank,Artist,Song,Lyrics,Length,Tokens
0,1965,1,sam the sham and the pharaohs,wooly bully,sam the sham miscellaneous wooly bully wooly b...,125,"[sam, the, sham, miscellaneous, wooly, bully, ..."
1,1965,2,four tops,i cant help myself sugar pie honey bunch,sugar pie honey bunch you know that i love yo...,204,"[sugar, pie, honey, bunch, you, know, that, i,..."
2,1965,3,the rolling stones,i cant get no satisfaction,,0,[]
3,1965,4,we five,you were on my mind,when i woke up this morning you were on my mi...,152,"[when, i, woke, up, this, morning, you, were, ..."
4,1965,5,the righteous brothers,youve lost that lovin feelin,you never close your eyes anymore when i kiss...,232,"[you, never, close, your, eyes, anymore, when,..."


In [77]:
def get_avg_length(tokens):
    length = [len(x) for x in tokens]
    if len(tokens) == 0:
        return 0
    else:
        return sum(length)/len(tokens)

get_avg_length(pop.loc[0]['Tokens'])

pop['avg_wd_len'] = pop['Tokens'].map(get_avg_length)
pop.head()

Unnamed: 0,Year,Rank,Artist,Song,Lyrics,Length,Tokens,avg_wd_len
0,1965,1,sam the sham and the pharaohs,wooly bully,sam the sham miscellaneous wooly bully wooly b...,125,"[sam, the, sham, miscellaneous, wooly, bully, ...",4.28
1,1965,2,four tops,i cant help myself sugar pie honey bunch,sugar pie honey bunch you know that i love yo...,204,"[sugar, pie, honey, bunch, you, know, that, i,...",3.872549
2,1965,3,the rolling stones,i cant get no satisfaction,,0,[],0.0
3,1965,4,we five,you were on my mind,when i woke up this morning you were on my mi...,152,"[when, i, woke, up, this, morning, you, were, ...",3.546053
4,1965,5,the righteous brothers,youve lost that lovin feelin,you never close your eyes anymore when i kiss...,232,"[you, never, close, your, eyes, anymore, when,...",4.051724


## (7) Filling in missing values
- We really need to take care of the missing lyric values. 
- What are good strategies? Considerations?

In [54]:
#pop = pop.fillna('')

pop.fillna('')#doesnt change the original
pop.fillna('', inplace=True) #chagnes thes original!

## (8) A detour: text processing basics
- NLTK functions
- How to word-tokenize
- How to sentence-tokenize
- Frequency distribution
- Type-token ratio

In [81]:
#TTR
def ttr(tokens):
    if len(tokens) == 0:
        return 0
    else:
        return len(set(tokens))/len(tokens)

pop['TTR'] = pop.Tokens.map(ttr)
pop.head()

Unnamed: 0,Year,Rank,Artist,Song,Lyrics,Length,Tokens,avg_wd_len,TTR
0,1965,1,sam the sham and the pharaohs,wooly bully,sam the sham miscellaneous wooly bully wooly b...,125,"[sam, the, sham, miscellaneous, wooly, bully, ...",4.28,0.512
1,1965,2,four tops,i cant help myself sugar pie honey bunch,sugar pie honey bunch you know that i love yo...,204,"[sugar, pie, honey, bunch, you, know, that, i,...",3.872549,0.460784
2,1965,3,the rolling stones,i cant get no satisfaction,,0,[],0.0,0.0
3,1965,4,we five,you were on my mind,when i woke up this morning you were on my mi...,152,"[when, i, woke, up, this, morning, you, were, ...",3.546053,0.289474
4,1965,5,the righteous brothers,youve lost that lovin feelin,you never close your eyes anymore when i kiss...,232,"[you, never, close, your, eyes, anymore, when,...",4.051724,0.37931


## (9) 50 years apart: songs from 1965 vs. 2015
- Let's create two corpora of 1965 lyrics vs. 2015 lyrics
- Good ways to highlight differences? 

In [11]:
# Your code here

## (10) A different take: multi-level index
- Alternative organization: use year-and-rank combination as index 
- Create a copy of `songs` as `songs2` DataFrame
- How does accessing work in this DF?

In [12]:
# Your code here

## (11) Discussion: data curation method, presentation
- How did Walker collect her song lyrics data?
- What tools did she use?
- How successful was her effort?
- How did she present her work, and what are your thoughts?

Your notes here in markdown