# 50 Years of Pop Music: A Lyrical Analysis

- Class activity 3, 9/12/2017 
- Data courtesy of Kaylin Walker: http://kaylinwalker.com/50-years-of-pop-music/

## (1) Read in the spreadsheet file as a DataFrame
- Create a DataFrame called `songs` from the spreadsheet. 
- You will likely run into an encoding error. Who can figure out the correct encoding? 
- Once successfully loaded, explore the DF. Use the following methods: `.head()`, `.tail()`, `.info()`, `.describe()`. 
- What are the column labels and row indexes?
- How to improve organization?

In [1]:
%pprint #to allow more than one figure per line when printing


import numpy as np
import pandas as pd

songs = pd.read_csv('billboard_lyrics_1964-2015.csv', encoding="cp1252") #or ISO8859 encoding works

songs.head() #head shows the first 5 entries (or can do a different number - use brackers)
songs.tail() #shows last 5 entries
songs.info() #shows categories, number of values, etc.
songs.describe() #shows statistical info


Pretty printing has been turned OFF
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 5100 entries, 0 to 5099
Data columns (total 6 columns):
Rank      5100 non-null int64
Song      5100 non-null object
Artist    5100 non-null object
Year      5100 non-null int64
Lyrics    4913 non-null object
Source    4913 non-null float64
dtypes: float64(1), int64(2), object(3)
memory usage: 239.1+ KB


Unnamed: 0,Rank,Year,Source
count,5100.0,5100.0,4913.0
mean,50.5,1990.0,1.400977
std,28.8689,14.721045,0.890375
min,1.0,1965.0,1.0
25%,25.75,1977.0,1.0
50%,50.5,1990.0,1.0
75%,75.25,2003.0,1.0
max,100.0,2015.0,5.0


## (2) Reorganize and clean up
- Rearrange the columns so that they are ordered: Year - Rank - Artist - Song - Lyrics. 
- Get rid of Source column. 
- What about the index? 

In [2]:
songs = songs[['Year', 'Rank', 'Artist', 'Song', 'Lyrics']]
type(songs) #.DataFrame
songs.head()

Unnamed: 0,Year,Rank,Artist,Song,Lyrics
0,1965,1,sam the sham and the pharaohs,wooly bully,sam the sham miscellaneous wooly bully wooly b...
1,1965,2,four tops,i cant help myself sugar pie honey bunch,sugar pie honey bunch you know that i love yo...
2,1965,3,the rolling stones,i cant get no satisfaction,
3,1965,4,we five,you were on my mind,when i woke up this morning you were on my mi...
4,1965,5,the righteous brothers,youve lost that lovin feelin,you never close your eyes anymore when i kiss...


## (3) Sanity check
- Is the data complete? 
  - Are there really 50 years represented?
  - Does every year have all 100 entries? 
- Any missing or anomalous data? How to address them?

In [3]:
len(set(songs['Year'][:])) #but only uses python methods and not pandas methods

songs['Year'].unique() #uses unique method to return unique values in the object

songs['Year'].unique().size #works like python len()

51

In [4]:
songs['Rank'].value_counts().sort_index()

1      51
2      51
3      51
4      51
5      51
6      51
7      51
8      51
9      51
10     51
11     51
12     51
13     51
14     51
15     51
16     51
17     51
18     51
19     51
20     51
21     51
22     51
23     51
24     51
25     51
26     51
27     51
28     51
29     51
30     51
       ..
71     51
72     51
73     51
74     51
75     51
76     51
77     51
78     51
79     51
80     51
81     51
82     51
83     51
84     51
85     51
86     51
87     51
88     51
89     51
90     51
91     51
92     51
93     51
94     51
95     51
96     51
97     51
98     51
99     51
100    51
Name: Rank, Length: 100, dtype: int64

## (4) Accessing routines
- How to get songs from year 1965 vs. songs from year 2015?
- How to get number-1 hits only?
- How to get Madonna songs only?
- Madonna songs that are top 10? Year 2000 or later? 

In [5]:
newfilter = songs['Year'] == 2015
oldfilter = songs['Year'] == 1965

newsongs = songs.loc[newfilter, :]
oldsongs = songs.loc[oldfilter, :]

newsongs.info()

<class 'pandas.core.frame.DataFrame'>
Int64Index: 100 entries, 5000 to 5099
Data columns (total 5 columns):
Year      100 non-null int64
Rank      100 non-null int64
Artist    100 non-null object
Song      100 non-null object
Lyrics    98 non-null object
dtypes: int64(2), object(3)
memory usage: 4.7+ KB


In [12]:
songs.loc[songs['Artist']=='madonna',:]
songs.loc[(songs['Artist']=='madonna') & (songs['Year'] >= 2000),:]

Unnamed: 0,Year,Rank,Artist,Song,Lyrics
3516,2000,17,madonna,music,hey mister dj put a record on i want to dance...
3633,2001,34,madonna,dont tell me,dont tell me to stop tell the rain not to dro...
4190,2006,91,madonna,hung up,time goes by so slowly time goes by so slowly...


## (5) Who are top artists?
- How many artists are represented?
- Which artists produced most hits?
- How many artists are one-hit wonders?

In [17]:
songs['Artist'].unique().size

madonna                                            False
elton john                                         False
mariah carey                                       False
michael jackson                                    False
stevie wonder                                      False
janet jackson                                      False
whitney houston                                    False
rihanna                                            False
taylor swift                                       False
pink                                               False
kelly clarkson                                     False
the beatles                                        False
britney spears                                     False
the black eyed peas                                False
chicago                                            False
aretha franklin                                    False
katy perry                                         False
usher                          

## (6) Create another column: lyric length in word count
- All symbols have been removed from lyrics
- So, `.split()` works as a rudimentary tokenizer

## (7) Filling in missing values
- We really need to take care of the missing lyric values. 
- What are good strategies? Considerations?

In [8]:
# Your code here

## (8) A detour: text processing basics
- NLTK functions
- How to word-tokenize
- How to sentence-tokenize
- Frequency distribution
- Type-token ratio

In [9]:
# Your code here

## (9) 50 years apart: songs from 1965 vs. 2015
- Let's create two corpora of 1965 lyrics vs. 2015 lyrics
- Good ways to highlight differences? 

In [10]:
# Your code here

## (10) A different take: multi-level index
- Alternative organization: use year-and-rank combination as index 
- Create a copy of `songs` as `songs2` DataFrame
- How does accessing work in this DF?

In [11]:
# Your code here

## (11) Discussion: data curation method, presentation
- How did Walker collect her song lyrics data?
- What tools did she use?
- How successful was her effort?
- How did she present her work, and what are your thoughts?

Your notes here in markdown