# 50 Years of Pop Music: A Lyrical Analysis

- Class activity 3, 9/12/2017 
- Data courtesy of Kaylin Walker: http://kaylinwalker.com/50-years-of-pop-music/

## (1) Read in the spreadsheet file as a DataFrame
- Create a DataFrame called `songs` from the spreadsheet. 
- You will likely run into an encoding error. Who can figure out the correct encoding? 
- Once successfully loaded, explore the DF. Use the following methods: `.head()`, `.tail()`, `.info()`, `.describe()`. 
- What are the column labels and row indexes?
- How to improve organization?

In [1]:
import pandas as pd
songs = pd.read_csv('billboard_lyrics_1964-2015.csv', encoding="ISO8859") 
# cp1252 also works
# file is a unix command that displays encoding info (not always correct, but useful)

songs.info()   # prints info 
songs.describe()
songs.tail()
songs.head(20)

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 5100 entries, 0 to 5099
Data columns (total 6 columns):
Rank      5100 non-null int64
Song      5100 non-null object
Artist    5100 non-null object
Year      5100 non-null int64
Lyrics    4913 non-null object
Source    4913 non-null float64
dtypes: float64(1), int64(2), object(3)
memory usage: 239.1+ KB


Unnamed: 0,Rank,Song,Artist,Year,Lyrics,Source
0,1,wooly bully,sam the sham and the pharaohs,1965,sam the sham miscellaneous wooly bully wooly b...,3.0
1,2,i cant help myself sugar pie honey bunch,four tops,1965,sugar pie honey bunch you know that i love yo...,1.0
2,3,i cant get no satisfaction,the rolling stones,1965,,1.0
3,4,you were on my mind,we five,1965,when i woke up this morning you were on my mi...,1.0
4,5,youve lost that lovin feelin,the righteous brothers,1965,you never close your eyes anymore when i kiss...,1.0
5,6,downtown,petula clark,1965,when youre alone and life is making you lonel...,1.0
6,7,help,the beatles,1965,help i need somebody help not just anybody hel...,3.0
7,8,cant you hear my heart beat,hermans hermits,1965,carterlewis every time i see you lookin my way...,5.0
8,9,crying in the chapel,elvis presley,1965,you saw me crying in the chapel the tears i s...,1.0
9,10,my girl,the temptations,1965,ive got sunshine on a cloudy day when its cold...,3.0


## (2) Reorganize and clean up
- Rearrange the columns so that they are ordered: Year - Rank - Artist - Song - Lyrics. 
- Get rid of Source column. 
- What about the index? 

In [2]:
songs[['Artist']].head()
type(songs[['Artist']])    # cf. songs['Artist'] is a series

pandas.core.frame.DataFrame

In [3]:
songs = songs[['Year', 'Rank', 'Artist', 'Song', 'Lyrics']]
type(songs)
songs.head(50)

Unnamed: 0,Year,Rank,Artist,Song,Lyrics
0,1965,1,sam the sham and the pharaohs,wooly bully,sam the sham miscellaneous wooly bully wooly b...
1,1965,2,four tops,i cant help myself sugar pie honey bunch,sugar pie honey bunch you know that i love yo...
2,1965,3,the rolling stones,i cant get no satisfaction,
3,1965,4,we five,you were on my mind,when i woke up this morning you were on my mi...
4,1965,5,the righteous brothers,youve lost that lovin feelin,you never close your eyes anymore when i kiss...
5,1965,6,petula clark,downtown,when youre alone and life is making you lonel...
6,1965,7,the beatles,help,help i need somebody help not just anybody hel...
7,1965,8,hermans hermits,cant you hear my heart beat,carterlewis every time i see you lookin my way...
8,1965,9,elvis presley,crying in the chapel,you saw me crying in the chapel the tears i s...
9,1965,10,the temptations,my girl,ive got sunshine on a cloudy day when its cold...


In [4]:
songs['Lyrics'][2]   # this one is '  ' two spaces
songs['Lyrics'][45]  # this one is NaN "not a value"/empty value
songs.iloc[48, 4]        # iloc: using integer indexes
songs.loc[48, 'Lyrics']  # loc: using labels

'well east coast girls are hip i really dig those styles they wear and the southern girls with the way they talk they knock me out when im down there the midwest farmers daughter really makes you feel alright and the northern girls with the way they kiss they keep their boyfriends warm at night i wish they all could be california wish they all could be california i wish they all could be california girls the west coast has the sunshine and the girls all get so tan i dig a french bikini on hawiian island dollsby a palm tree in the sand i been all round this great big world and i seen all kinds of girls yeah but i couldnt wait to get back in the states back to the cutest girls in the world i wish they all could be california wish they all could be california i wish they all could be california girls i wish they all could be california wish they all could be california i wish they all could be california girls'

## (3) Sanity check
- Is the data complete? 
  - Are there really 50 years represented?
  - Does every year have all 100 entries? 
- Any missing or anomalous data? How to address them?

In [5]:
%pprint  # toggle on/off pretty printing 

Pretty printing has been turned OFF


In [6]:
# len(set(songs['Year']))   # works, but inefficient, not pandas way
# dir(songs['Year'])        # all commands one can try with this series
# help(songs['Year'].unique)   # look up help description for .unique
songs['Year'].unique()         # Unique values for 'Year' series. 
songs['Year'].unique().size    # there are 51
songs['Rank'].value_counts().sort_index()   # count values, then sort by index 

1      51
2      51
3      51
4      51
5      51
6      51
7      51
8      51
9      51
10     51
11     51
12     51
13     51
14     51
15     51
16     51
17     51
18     51
19     51
20     51
21     51
22     51
23     51
24     51
25     51
26     51
27     51
28     51
29     51
30     51
       ..
71     51
72     51
73     51
74     51
75     51
76     51
77     51
78     51
79     51
80     51
81     51
82     51
83     51
84     51
85     51
86     51
87     51
88     51
89     51
90     51
91     51
92     51
93     51
94     51
95     51
96     51
97     51
98     51
99     51
100    51
Name: Rank, Length: 100, dtype: int64

## (4) Accessing routines
- How to get songs from year 1965 vs. songs from year 2015?
- How to get number-1 hits only?
- How to get Madonna songs only?
- Madonna songs that are top 10? Year 2000 or later? 

In [7]:
newfilter = songs['Year'] == 2015     # a series of True/False boolean values
newsongs = songs.loc[newfilter, :]    # use it as a filter. Year 2015 songs only. 
newsongs.info()

newsongs = songs.iloc[-100: , :]     # last 100 rows are all 2015 songs 
newsongs.head()

oldies = songs.loc[songs['Year'] == 1965, :]     # one-liner 

millenial = songs.loc[songs['Year'] >= 2000, :]  # songs in this millenium 
millenial.info()

<class 'pandas.core.frame.DataFrame'>
Int64Index: 100 entries, 5000 to 5099
Data columns (total 5 columns):
Year      100 non-null int64
Rank      100 non-null int64
Artist    100 non-null object
Song      100 non-null object
Lyrics    98 non-null object
dtypes: int64(2), object(3)
memory usage: 4.7+ KB
<class 'pandas.core.frame.DataFrame'>
Int64Index: 1600 entries, 3500 to 5099
Data columns (total 5 columns):
Year      1600 non-null int64
Rank      1600 non-null int64
Artist    1600 non-null object
Song      1600 non-null object
Lyrics    1567 non-null object
dtypes: int64(2), object(3)
memory usage: 75.0+ KB


In [8]:
songs.loc[songs['Artist']=='madonna', :]   # Madonna songs only! 

Unnamed: 0,Year,Rank,Artist,Song,Lyrics
1934,1984,35,madonna,borderline,something in the way you love me wont let me ...
1965,1984,66,madonna,lucky star,you must be my lucky star cause you shine on ...
1978,1984,79,madonna,holiday,holiday celebrate holiday celebrateif we took...
2001,1985,2,madonna,like a virgin,i made it through the wilderness somehow i ma...
2008,1985,9,madonna,crazy for you,swaying room as the music starts strangers ma...
2057,1985,58,madonna,material girl,some boys kiss me some boys hug me i think th...
2080,1985,81,madonna,angel,why am i standing on a cloud every time youre...
2097,1985,98,madonna,dress you up,youve got style thats what all the girls say ...
2128,1986,29,madonna,papa dont preach,
2134,1986,35,madonna,live to tell,i have a tale to tell sometimes it gets so ha...


## (5) Who are top artists?
- How many artists are represented?
- Which artists produced most hits?
- How many artists are one-hit wonders?

In [None]:
# Your code here

## (6) Create another column: lyric length in word count
- All symbols have been removed from lyrics
- So, `.split()` works as a rudimentary tokenizer

In [None]:
# Your code here

## (7) Filling in missing values
- We really need to take care of the missing lyric values. 
- What are good strategies? Considerations?

In [None]:
# Your code here

## (8) A detour: text processing basics
- NLTK functions
- How to word-tokenize
- How to sentence-tokenize
- Frequency distribution
- Type-token ratio

In [None]:
# Your code here

## (9) 50 years apart: songs from 1965 vs. 2015
- Let's create two corpora of 1965 lyrics vs. 2015 lyrics
- Good ways to highlight differences? 

In [None]:
# Your code here

## (10) A different take: multi-level index
- Alternative organization: use year-and-rank combination as index 
- Create a copy of `songs` as `songs2` DataFrame
- How does accessing work in this DF?

In [None]:
# Your code here

## (11) Discussion: data curation method, presentation
- How did Walker collect her song lyrics data?
- What tools did she use?
- How successful was her effort?
- How did she present her work, and what are your thoughts?

Your notes here in markdown