## Keyness Analysis

### In this notebook, you will find:
- Loaded corpora from JSON files of various song dictionaries 
- Detailed text analysis of lyrics, separated by section headers
- Keyness analysis is used to evaluate whether a particular word occurs more frequently in one corpus as compared to its occurrence in another corpus

In [9]:
%run functions.ipynb

In [10]:
%run frequency_ngram_analysis.ipynb

[nltk_data] Downloading package stopwords to /Commjhub/jupyterhub/comm
[nltk_data]     318_fall2019/jpasik123/nltk_data...
[nltk_data]   Package stopwords is already up-to-date!


Top 50 words in your `all_90s` corpus
[('love', 108), ("i'm", 75), ('know', 74), ('oh', 70), ('yeah', 56), ('like', 55), ('got', 50), ('get', 47), ('let', 47), ('little', 45), ('wanna', 44), ("ain't", 42), ('one', 38), ('never', 38), ('way', 37), ('tell', 36), ('girl', 35), ("i've", 35), ('take', 34), ('ya', 34), ('come', 33), ('say', 33), ('go', 33), ('boy', 32), ('said', 32), ('make', 32), ('baby', 30), ('heart', 30), ('away', 29), ('gonna', 28), ('think', 28), ('well', 27), ('right', 27), ('maria', 27), ('time', 26), ('man', 25), ('ever', 25), ('feel', 25), ('want', 24), ('night', 22), ('would', 22), ('knows', 22), ('world', 21), ('hey', 21), ("can't", 21), ('kiss', 21), ('maybe', 19), ('really', 19), ('passionate', 19), ('kisses', 19)]
Top 50 # of songs each type occurs in your `all_90s` corpus
[('know', 22), ("i'm", 21), ('like', 20), ('love', 19), ('oh', 19), ('night', 19), ('got', 18), ('one', 18), ('said', 17), ('never', 17), ('get', 16), ("ain't", 15), ('right', 15), ("i've", 

Top 50 words in your `all_male` corpus
[("i'm", 114), ('like', 98), ('love', 78), ('know', 70), ('yeah', 60), ('got', 57), ('wanna', 57), ('get', 55), ("ain't", 55), ('oh', 51), ('baby', 51), ('ya', 48), ('back', 46), ('make', 44), ('little', 43), ('girl', 40), ('never', 39), ('think', 38), ('take', 37), ('right', 37), ('oooh', 36), ('one', 35), ('see', 34), ('heart', 33), ('go', 33), ('gonna', 33), ('come', 32), ('tell', 31), ('rock', 30), ('night', 29), ("can't", 29), ("i've", 28), ('way', 28), ('mama', 28), ("'em", 28), ('need', 28), ('maria', 27), ('hey', 26), ('say', 25), ('man', 25), ('around', 25), ("i'll", 24), ('whiskey', 23), ('world', 23), ('beautiful', 23), ('let', 22), ('away', 21), ('time', 21), ('good', 21), ('would', 20)]
Top 50 # of songs each type occurs in your `all_male` corpus
[('know', 24), ("i'm", 24), ('like', 22), ('night', 19), ('get', 19), ('got', 19), ('go', 18), ('never', 17), ('yeah', 17), ("ain't", 17), ('right', 16), ('love', 15), ('oh', 14), ('take', 14

Top 50 words in your `male_2010s` corpus
[('like', 72), ("i'm", 57), ('back', 44), ('baby', 40), ('oooh', 36), ('yeah', 31), ('need', 28), ('rock', 28), ('got', 27), ('right', 27), ("ain't", 27), ('little', 26), ("'em", 26), ('mama', 26), ('make', 25), ('think', 25), ('know', 24), ('get', 24), ('gonna', 23), ('go', 23), ('wanna', 23), ('see', 21), ('hey', 20), ('dirt', 19), ('one', 19), ('good', 18), ('way', 17), ('whiskey', 17), ('used', 16), ('road', 16), ("i'ma", 16), ('tequila', 16), ('around', 15), ('night', 15), ('hell', 15), ('free', 15), ('drink', 14), ('take', 14), ("'cause", 14), ('man', 14), ('ya', 14), ('crazy', 14), ('always', 14), ('country', 13), ('never', 13), ('glasses', 13), ('drunk', 13), ("can't", 12), ('feel', 12), ('shine', 12)]
Top 50 # of songs each type occurs in your `male_2010s` corpus
[('go', 12), ('like', 12), ('back', 12), ("i'm", 11), ('yeah', 11), ('know', 11), ('get', 11), ("ain't", 10), ('got', 9), ('baby', 9), ('right', 8), ('take', 8), ('way', 8), ('

## Additional Modules

In [11]:
#Additional modules
import os
import pandas as pd
import re
import json
import requests
from bs4 import BeautifulSoup
import lyricsgenius
from collections import Counter
import nltk
from nltk import Text
nltk.download('stopwords')
from nltk.corpus import stopwords
stop_words = stopwords.words('english')
sect_stoppers = ['pre-chorus','refrain','chorus','verse','intro','outro','bridge','verse 1','verse 2','verse 3','verse 4','1','2','3','4','Tim McGraw','Faith Hill','Tim McGraw & Faith Hill']
for x in sect_stoppers:
    stop_words.append(x)
# pos tagging
from nltk import pos_tag, pos_tag_sents, FreqDist, ConditionalFreqDist

[nltk_data] Downloading package stopwords to /Commjhub/jupyterhub/comm
[nltk_data]     318_fall2019/jpasik123/nltk_data...
[nltk_data]   Package stopwords is already up-to-date!


In [12]:
char_to_strip = '.,!][?;$"-()'

In [13]:
all_charts = json.load(open('../data/charts/all_charts.json'))

## Keyness Analysis

Below, I have printed keyness analysis charts that compare key words across decades and genders. 

## 1990s vs 2010s

In [19]:
## keyness analysis: key words in the 90s subset vs those in 2010s
calculate_keyness(word_freq_90s, word_freq_2010s, top = 50)

WORD                     Corpus A Freq.Corpus B Freq.Keyness
love                     108       19        77.052
boy                      32        8         17.639
let                      47        18        16.121
tell                     36        14        12.080
want                     24        7         11.434
i've                     35        15        10.111
really                   19        5         10.003
oh                       70        42        9.794
know                     74        46        9.329
say                      33        15        8.618
maybe                    19        6         8.312
said                     32        16        6.959
would                    22        9         6.844


### Observations:

- No clear plot can be distinguished from this keyness analysis 
- "Love" appears to be significantly more key in the `all_90s` chart than in `all_2010s`
- "Boy" is another key word than appears and may have more contextual, lyrical variety than "love"
- "Really" is the only evaluative word in this chart

## 2010s vs 1990s

In [21]:
## keyness analysis: key words in the 2010s subset vs those in 90s
calculate_keyness(word_freq_2010s, word_freq_90s, top = 50)

WORD                     Corpus A Freq.Corpus B Freq.Keyness
back                     80        15        43.221
every                    51        10        26.601
'em                      39        6         24.190
road                     34        5         21.687
like                     108       55        13.149
need                     33        10        11.004
good                     33        12        8.408
free                     22        6         8.349
always                   19        5         7.507
whiskey                  20        6         6.757


### Observations:

- "Back" is a term that appears to be more key in the `all_2010s` chart than the `all_90s`
- "Every" is an adjective that also is key in `all_2010s`
- "Like" is more present in country songs of the 2010s than the 1990s; can be used in a variety of ways 
- "Whiskey" and "free" are more key in the 2010s chart than the 1990s

## All Female vs All Male

In [27]:
calculate_keyness(f_word_freq, m_word_freq, top = 50)

WORD                     Corpus A Freq.Corpus B Freq.Keyness
boy                      34        6         23.204
every                    46        15        18.263
kiss                     27        9         10.435
said                     34        14        9.728
prechorus                24        8         9.275
thing                    25        9         8.740
ever                     37        17        8.726
let                      43        22        8.111
name                     22        8         7.586
without                  18        6         6.956


## All Male vs All Female

In [28]:
calculate_keyness(m_word_freq, f_word_freq, top = 50)

WORD                     Corpus A Freq.Corpus B Freq.Keyness
wanna                    57        14        25.619
i'm                      114       58        15.622
girl                     40        17        8.330


### Observations:
 
- Interesting that "boy" was a key word in the `all_female` chart while "girl" was significant in the `all_male` chart
- No clear storyline with the `all_male` keyness chart
- Some reference to the theme of 'love' with the term "kiss" in the female chart
- "Wanna" and "I'm" do not have much room for analysis out of context in the male chart

## Male - 1990s vs Female - 1990s

In [24]:
calculate_keyness(m_word_freq_90s, f_word_freq_90s, top = 50)

WORD                     Corpus A Freq.Corpus B Freq.Keyness
i'm                      57        18        20.078
heart                    25        5         13.921
wanna                    34        10        13.074
love                     71        37        9.825


## Female - 1990s vs Male - 1990s

In [23]:
calculate_keyness(f_word_freq_90s, m_word_freq_90s, top = 50)

WORD                     Corpus A Freq.Corpus B Freq.Keyness
boy                      26        6         14.126
way                      26        11        6.751


### Observations:

Looking at the gender-filtered charts from the 1990s, the words "heart" and "love" appear to be the most key in songs from the 1990s; also, the word "boy" seems to be most key in the `female_90s` chart relative to `male_90s`. 

## Male - 2010s vs Female - 2010s 

In [25]:
calculate_keyness(m_word_freq_2010s, f_word_freq_2010s, top = 50)

WORD                     Corpus A Freq.Corpus B Freq.Keyness
need                     28        5         15.979
baby                     40        12        13.857
like                     72        36        9.657
think                    25        7         9.422
right                    27        9         8.105


## Female - 2010s vs Male - 2010s 

In [26]:
calculate_keyness(f_word_freq_2010s, m_word_freq_2010s, top = 50)

WORD                     Corpus A Freq.Corpus B Freq.Keyness
every                    40        11        19.792
could                    21        5         11.838
name                     21        6         10.002
never                    32        13        9.785
oh                       30        12        9.395
yeah                     56        31        9.308


### Observations:

- "Need" and "baby" are most key in the `male_2010s` than in `female_2010s`
- Again, no clear theme that is evident from these keyness charts
- Filler words such as "yeah" and "oh" are more prevalent in `female_2010s`

## Male - 2010s vs Female - 1990s

In [29]:
calculate_keyness(m_word_freq_2010s, f_word_freq_90s, top = 50)

WORD                     Corpus A Freq.Corpus B Freq.Keyness
i'm                      57        18        16.132
back                     44        13        13.679
like                     72        29        13.288
see                      21        5         8.438


## Female - 1990s vs Male - 2010s

In [30]:
calculate_keyness(f_word_freq_90s, m_word_freq_2010s, top = 50)

WORD                     Corpus A Freq.Corpus B Freq.Keyness
love                     37        7         26.938
oh                       31        12        11.619
kiss                     19        7         7.596


### Observations:

- No substantial takeaways from the `male_2010s` chart
- "Kiss" alludes to theme of 'love' in `female_90s`

## Male - 1990s vs Female - 2010s

In [31]:
calculate_keyness(m_word_freq_90s, f_word_freq_2010s, top = 50)

WORD                     Corpus A Freq.Corpus B Freq.Keyness
love                     71        12        48.539
tell                     24        7         10.455
know                     46        22        9.506
girl                     31        13        8.220


## Female - 2010s vs Male - 1990s

In [32]:
calculate_keyness(f_word_freq_2010s, m_word_freq_90s, top = 50)

WORD                     Corpus A Freq.Corpus B Freq.Keyness
yeah                     56        29        7.818


### Observations:

- "Love" and "girl" reveal some narrative in the `male_90s` charts
- "yeah" is lone key word in `female_2010s` relative to `male_90s`

## Female - 1990s vs Female - 2010s

In [35]:
calculate_keyness(f_word_freq_90s, f_word_freq_2010s, top = 50)

WORD                     Corpus A Freq.Corpus B Freq.Keyness
love                     37        12        15.097
boy                      26        8         11.272


## Female - 2010s vs Female - 1990s

In [36]:
calculate_keyness(f_word_freq_2010s, f_word_freq_90s, top = 50)

WORD                     Corpus A Freq.Corpus B Freq.Keyness
every                    40        6         25.930
back                     36        13        9.753
thing                    20        5         8.665
yeah                     56        27        8.508
never                    32        12        8.148
heart                    19        5         7.801
i'm                      40        18        7.155
get                      36        16        6.618


### Observations:

- "Love" and "boy" in `female_90s` links to the theme of 'love'
- "Heart" seems to be the only word that stands out as key in the `female_2010s` chart relative to `female_90s`

## Male - 1990s vs Male - 2010s

In [37]:
calculate_keyness(m_word_freq_90s, m_word_freq_2010s, top = 50)

WORD                     Corpus A Freq.Corpus B Freq.Keyness
love                     71        7         68.334
oh                       39        12        18.188
girl                     31        9         15.350
tell                     24        7         11.829
heart                    25        8         11.171
ya                       34        14        10.948
know                     46        24        9.676
say                      19        6         8.615
time                     16        5         7.338


## Male - 2010s vs Male - 1990s

In [38]:
calculate_keyness(m_word_freq_2010s, m_word_freq_90s, top = 50)

WORD                     Corpus A Freq.Corpus B Freq.Keyness
like                     72        26        17.682
baby                     40        11        14.471


### Observations:

- No substantive key words in the `male_2010s` relative to `male_90s`
- Narrative of falling in love / singing about love is present in `male_90s` relative to `male_2010s` with key words such as "love", "girl", and "heart"