## Frequency Analyses, N-gram Lists, Word Comparisons

### In this notebook, you will find:
- Loaded corpora from JSON files of various song dictionaries 
- Detailed text analysis of lyrics, separated by section headers

In [None]:
pwd

In [2]:
%run functions.ipynb

## Additional Modules

In [3]:
#Additional modules
import os
import pandas as pd
import re
import json
import requests
from bs4 import BeautifulSoup
import lyricsgenius
from collections import Counter
import nltk
from nltk import Text
nltk.download('stopwords')
from nltk.corpus import stopwords
stop_words = stopwords.words('english')
sect_stoppers = ['pre-chorus','refrain','chorus','verse','intro','outro','bridge','verse 1','verse 2','verse 3','verse 4','1','2','3','4','Tim McGraw','Faith Hill','Tim McGraw & Faith Hill']
for x in sect_stoppers:
    stop_words.append(x)
# pos tagging
from nltk import pos_tag, pos_tag_sents, FreqDist, ConditionalFreqDist

[nltk_data] Downloading package stopwords to /Commjhub/jupyterhub/comm
[nltk_data]     318_fall2019/jpasik123/nltk_data...
[nltk_data]   Package stopwords is already up-to-date!


In [4]:
char_to_strip = '.,!][?;$"-()'

In [5]:
all_charts = json.load(open('../data/charts/all_charts.json'))

In [6]:
all_charts['all_90s'][0].keys()

dict_keys(['Decade', 'Title', 'Artist', 'Gender', 'lyrics', 'tokens'])

In [7]:
stop_words

['i',
 'me',
 'my',
 'myself',
 'we',
 'our',
 'ours',
 'ourselves',
 'you',
 "you're",
 "you've",
 "you'll",
 "you'd",
 'your',
 'yours',
 'yourself',
 'yourselves',
 'he',
 'him',
 'his',
 'himself',
 'she',
 "she's",
 'her',
 'hers',
 'herself',
 'it',
 "it's",
 'its',
 'itself',
 'they',
 'them',
 'their',
 'theirs',
 'themselves',
 'what',
 'which',
 'who',
 'whom',
 'this',
 'that',
 "that'll",
 'these',
 'those',
 'am',
 'is',
 'are',
 'was',
 'were',
 'be',
 'been',
 'being',
 'have',
 'has',
 'had',
 'having',
 'do',
 'does',
 'did',
 'doing',
 'a',
 'an',
 'the',
 'and',
 'but',
 'if',
 'or',
 'because',
 'as',
 'until',
 'while',
 'of',
 'at',
 'by',
 'for',
 'with',
 'about',
 'against',
 'between',
 'into',
 'through',
 'during',
 'before',
 'after',
 'above',
 'below',
 'to',
 'from',
 'up',
 'down',
 'in',
 'out',
 'on',
 'off',
 'over',
 'under',
 'again',
 'further',
 'then',
 'once',
 'here',
 'there',
 'when',
 'where',
 'why',
 'how',
 'all',
 'any',
 'both',
 'each

## Word and Song Frequencies; N-gram lists (bigrams and trigrams) 

## All 1990s

In [8]:
word_freq_90s = Counter() ##tally of # of times each type occurs in TOTAL 
song_freq_90s = Counter() ## tally of # of songs each type occurs in (each type only counted once for each song it occurs in)
bigrams_90s_dist = Counter()
trigrams_90s_dist = Counter()

# 1. loop over each song in chart
for song in all_charts['all_90s']:
    raw_lyrics = song['lyrics']
    song_toks = []
    temp = tokenize(raw_lyrics, lowercase=True,strip_chars=char_to_strip)
    for tok in temp:
        if tok not in stop_words:
            song_toks.append(tok)
    word_freq_90s.update(song_toks)
    unique_type = set(song_toks) 
    song_freq_90s.update(unique_type) # update song_freq_90s with the types (i.e. unique values in tokens)
    bigram_90s = get_ngram_tokens(song_toks, n=2)
    bigrams_90s_dist.update(bigram_90s)
    trigrams_90s = get_ngram_tokens(song_toks, n=3)
    trigrams_90s_dist.update(trigrams_90s)

print('Top 50 words in your `all_90s` corpus\n', '='*34, sep='')
print(word_freq_90s.most_common(50))

print('Top 50 # of songs each type occurs in your `all_90s` corpus\n', '='*34, sep='')
print(song_freq_90s.most_common(50))

print('Top 50 bigrams in your `all_90s` corpus\n', '='*34, sep='')
print(bigrams_90s_dist.most_common(50))

print('Top 50 trigrams in your `all_90s` corpus\n', '='*34, sep='')
print(trigrams_90s_dist.most_common(50))

Top 50 words in your `all_90s` corpus
[('love', 108), ("i'm", 75), ('know', 74), ('oh', 70), ('yeah', 56), ('like', 55), ('got', 50), ('get', 47), ('let', 47), ('little', 45), ('wanna', 44), ("ain't", 42), ('one', 38), ('never', 38), ('way', 37), ('tell', 36), ('girl', 35), ("i've", 35), ('take', 34), ('ya', 34), ('come', 33), ('say', 33), ('go', 33), ('boy', 32), ('said', 32), ('make', 32), ('baby', 30), ('heart', 30), ('away', 29), ('gonna', 28), ('think', 28), ('well', 27), ('right', 27), ('maria', 27), ('time', 26), ('man', 25), ('ever', 25), ('feel', 25), ('want', 24), ('night', 22), ('would', 22), ('knows', 22), ('world', 21), ('hey', 21), ("can't", 21), ('kiss', 21), ('maybe', 19), ('really', 19), ('passionate', 19), ('kisses', 19)]
Top 50 # of songs each type occurs in your `all_90s` corpus
[('know', 22), ("i'm", 21), ('like', 20), ('love', 19), ('night', 19), ('oh', 19), ('got', 18), ('one', 18), ('said', 17), ('never', 17), ('get', 16), ("ain't", 15), ('right', 15), ("i've", 

### Observations 

As can be seen in the printed results above, a majority of the most frequently recurring tokens are filler words such as "oh", "yeah", and "really". However, there are some substantive words that stand out and will be used in later, more detailed analysis. The words that stand out the most to me are: "love", "little", "girl", "boy", "man", "world", "passionate", and "kiss(es)". 

Additionally, when looking at these frequencies, it is important to note that the repetition of certain words (and thus higher frequency counts) may be skewed if that specific word is repeated in say, the choruses or bridges of a song. That being said, it is helpful to pay close attention to the `song_freq` results because those reveal the number of songs each type occurs in (each type only counted once for each song it occurs in).

When looking at the bigrams and trigrams, it seems to be that many are word pairs and combinations coming from the same song (i.e. "achy breaky" and "breaky heart" and "got friends low" and "friends low places"). It will be interesting to further analyze some of the tokens and n-grams throughout this project!

In [9]:
## Calculating type-token ratio for 90s
# Finding amount of unique words across corpora

ttr_90s = []

for song in all_charts['all_90s']:
    toks_90s = song['tokens']
    ttr = len(set(toks_90s)) / len(toks_90s) * 100
    ttr_90s.append(ttr)

print('the average type-token ratio among `all_90s` songs is {}'.format(sum(ttr_90s) / len(ttr_90s)))


the average type-token ratio among `all_90s` songs is 40.612394807300554


## All 2010s 

In [10]:
word_freq_2010s = Counter() ##tally of # of times each type occurs in TOTAL 
song_freq_2010s = Counter() ## tally of # of songs each type occurs in (each type only counted once for each song it occurs in)
bigrams_2010s_dist = Counter()
trigrams_2010s_dist = Counter()

# 1. loop over each song in chart
for song in all_charts['all_2010s']:
    raw_lyrics = song['lyrics']
    song_toks = []
    temp = tokenize(raw_lyrics, lowercase=True,strip_chars=char_to_strip)
    for tok in temp:
        if tok not in stop_words:
            song_toks.append(tok)
    word_freq_2010s.update(song_toks)
    unique_type = set(song_toks) 
    song_freq_2010s.update(unique_type) 
    bigram_2010s = get_ngram_tokens(song_toks, n=2)
    bigrams_2010s_dist.update(bigram_2010s)
    trigrams_2010s = get_ngram_tokens(song_toks, n=3)
    trigrams_2010s_dist.update(trigrams_2010s)

print('Top 50 words in your `all_2010s` corpus\n', '='*34, sep='')
print(word_freq_2010s.most_common(50))

print('Top 50 # of songs each type occurs in your `all_2010s` corpus\n', '='*34, sep='')
print(song_freq_2010s.most_common(50))

print('Top 50 bigrams in your `all_2010s` corpus\n', '='*34, sep='')
print(bigrams_2010s_dist.most_common(50))

print('Top 50 trigrams in your `all_2010s` corpus\n', '='*34, sep='')
print(trigrams_2010s_dist.most_common(50))

Top 50 words in your `all_2010s` corpus
[('like', 108), ("i'm", 97), ('yeah', 87), ('back', 80), ('little', 62), ('get', 60), ('got', 58), ("ain't", 53), ('baby', 52), ('every', 51), ('know', 46), ('never', 45), ('gonna', 43), ('oh', 42), ('take', 40), ("'em", 39), ('make', 38), ('right', 36), ('one', 36), ('oooh', 36), ('see', 35), ('go', 35), ('road', 34), ('good', 33), ('need', 33), ('hope', 33), ('think', 32), ('way', 31), ('rock', 31), ('away', 30), ('ever', 29), ('mama', 29), ('around', 28), ('hey', 28), ('heart', 27), ('name', 27), ('wanna', 27), ('could', 26), ('changed', 25), ("'cause", 24), ('night', 24), ("i'll", 23), ('thing', 22), ('dirt', 22), ('girl', 22), ('song', 22), ('free', 22), ('man', 21), ('come', 21), ("can't", 21)]
Top 50 # of songs each type occurs in your `all_2010s` corpus
[('like', 23), ('back', 22), ('get', 22), ('yeah', 21), ("i'm", 19), ('know', 18), ('go', 18), ('got', 17), ("ain't", 17), ('take', 15), ('one', 15), ('good', 14), ('right', 14), ('baby', 

### Observations:

With the `all_2010s` data, the words "little", "baby", "road", "mama", "name", "dirt", "man", "girl" and "hope" occurred most frequently. In addition, looking over the bigrams list, there 
are some distinct pairs such as: "whiskey glasses", "dirt boots", "taste tequila", "back road", "drunk plane". Lastly, being more familiar with these songs from the 2010s myself, I recognize that most of the trigrams reference either the titles of the songs or main words in the choruses that make it easy to determine which songs the words are coming from. 

In [25]:
## Calculating type-token ratio for 2010s
# Finding amount of unique words across corpora


ttr_2010s = []

for song in all_charts['all_2010s']:
    toks_2010s = song['tokens']
    ttr_10s = len(set(toks_2010s)) / len(toks_2010s) * 100
    ttr_2010s.append(ttr_10s)

print('the average type-token ratio among 2010s songs is {}'.format(sum(ttr_2010s) / len(ttr_2010s)))


the average type-token ratio among 2010s songs is 37.27572078056541


## All Females

In [12]:
f_word_freq = Counter() ##tally of # of times each type occurs in TOTAL 
f_song_freq = Counter() ## tally of # of songs each type occurs in (each type only counted once for each song it occurs in)
f_bigrams_dist = Counter()
f_trigrams_dist = Counter()

# 1. loop over each song in chart
for song in all_charts['all_female']:
    raw_lyrics = song['lyrics']
    song_toks = []
    temp = tokenize(raw_lyrics, lowercase=True,strip_chars=char_to_strip)
    for tok in temp:
        if tok not in stop_words:
            song_toks.append(tok)
    f_word_freq.update(song_toks)
    unique_type = set(song_toks) 
    f_song_freq.update(unique_type) 
    f_bigrams = get_ngram_tokens(song_toks, n=2)
    f_bigrams_dist.update(f_bigrams)
    f_trigrams = get_ngram_tokens(song_toks, n=3)
    f_trigrams_dist.update(f_trigrams)

print('Top 50 words in your `all_female` corpus\n', '='*34, sep='')
print(f_word_freq.most_common(50))

print('Top 50 # of songs each type occurs in your `all_female` corpus\n', '='*34, sep='')
print(f_song_freq.most_common(50))

print('Top 50 bigrams in your `all_female` corpus\n', '='*34, sep='')
print(f_bigrams_dist.most_common(50))

print('Top 50 trigrams in your `all_female` corpus\n', '='*34, sep='')
print(f_trigrams_dist.most_common(50))

Top 50 words in your `all_female` corpus
[('yeah', 83), ('like', 65), ('little', 64), ('oh', 61), ("i'm", 58), ('get', 52), ('got', 51), ('know', 50), ('love', 49), ('back', 49), ('every', 46), ('never', 44), ('let', 43), ("ain't", 40), ('way', 40), ('one', 39), ('away', 38), ('gonna', 38), ('ever', 37), ('take', 37), ('go', 35), ('hope', 35), ('boy', 34), ('said', 34), ('baby', 31), ('well', 27), ('kiss', 27), ('right', 26), ('make', 26), ('time', 25), ('thing', 25), ('feel', 25), ('could', 25), ('changed', 25), ('good', 24), ('prechorus', 24), ('heart', 24), ('say', 23), ('hey', 23), ('come', 22), ('name', 22), ('day', 22), ('put', 22), ('maybe', 22), ('knows', 22), ("i've", 22), ('think', 22), ('road', 21), ('man', 21), ('left', 21)]
Top 50 # of songs each type occurs in your `all_female` corpus
[('like', 21), ('one', 20), ('get', 19), ('back', 19), ('oh', 18), ('got', 16), ('said', 16), ("i'm", 16), ('good', 16), ('know', 16), ('yeah', 16), ("ain't", 15), ('time', 15), ('take', 15)

### Observations:

Within the `all_female` charts, the words that stand out to me are: "little", "love", "hope", "boy", "baby", "kiss", "heart", and "road". Looking at these words out of context, one can infer that  country songs written by female artists tend to share narratives about love. It will be interesting to analyze their context further with KWIC concordance analyses.  

Again, the bigrams and trigrams hint at the titles or main messages of the songs.

## All Males

In [13]:
m_word_freq = Counter() ##tally of # of times each type occurs in TOTAL 
m_song_freq = Counter() ## tally of # of songs each type occurs in (each type only counted once for each song it occurs in)
m_bigrams_dist = Counter()
m_trigrams_dist = Counter()

# 1. loop over each song in chart
for song in all_charts['all_male']:
    raw_lyrics = song['lyrics']
    song_toks = []
    temp = tokenize(raw_lyrics, lowercase=True,strip_chars=char_to_strip)
    for tok in temp:
        if tok not in stop_words:
            song_toks.append(tok)
    m_word_freq.update(song_toks)
    unique_type = set(song_toks) 
    m_song_freq.update(unique_type) 
    m_bigrams = get_ngram_tokens(song_toks, n=2)
    m_bigrams_dist.update(m_bigrams)
    m_trigrams = get_ngram_tokens(song_toks, n=3)
    m_trigrams_dist.update(m_trigrams)

print('Top 50 words in your `all_male` corpus\n', '='*34, sep='')
print(m_word_freq.most_common(50))

print('Top 50 # of songs each type occurs in your `all_male` corpus\n', '='*34, sep='')
print(m_song_freq.most_common(50))

print('Top 50 bigrams in your `all_male` corpus\n', '='*34, sep='')
print(m_bigrams_dist.most_common(50))

print('Top 50 trigrams in your `all_male` corpus\n', '='*34, sep='')
print(m_trigrams_dist.most_common(50))

Top 50 words in your `all_male` corpus
[("i'm", 114), ('like', 98), ('love', 78), ('know', 70), ('yeah', 60), ('got', 57), ('wanna', 57), ('get', 55), ("ain't", 55), ('oh', 51), ('baby', 51), ('ya', 48), ('back', 46), ('make', 44), ('little', 43), ('girl', 40), ('never', 39), ('think', 38), ('take', 37), ('right', 37), ('oooh', 36), ('one', 35), ('see', 34), ('heart', 33), ('go', 33), ('gonna', 33), ('come', 32), ('tell', 31), ('rock', 30), ('night', 29), ("can't", 29), ("i've", 28), ('way', 28), ('mama', 28), ("'em", 28), ('need', 28), ('maria', 27), ('hey', 26), ('say', 25), ('man', 25), ('around', 25), ("i'll", 24), ('whiskey', 23), ('world', 23), ('beautiful', 23), ('let', 22), ('away', 21), ('time', 21), ('good', 21), ('would', 20)]
Top 50 # of songs each type occurs in your `all_male` corpus
[('know', 24), ("i'm", 24), ('like', 22), ('night', 19), ('get', 19), ('got', 19), ('go', 18), ('never', 17), ('yeah', 17), ("ain't", 17), ('right', 16), ('love', 15), ('oh', 14), ('take', 14

### Observations:

Looking at these results, the words "love", "baby", "girl", "rock", "mama", "whiskey", "world", and "beautiful seem to stand out the most to me because they can be used in a variety of contexts. In this `all_male` chart, the bigrams and trigrams list reveal the spread of topics sung about by male artists - be it drinking, heartbreak or love.  

## Female Artists - 1990s

In [14]:
f_word_freq_90s = Counter() ##tally of # of times each type occurs in TOTAL 
f_song_freq_90s = Counter() ## tally of # of songs each type occurs in (each type only counted once for each song it occurs in)
f_bigrams_dist_90s = Counter()
f_trigrams_dist_90s = Counter()

# 1. loop over each song in chart
for song in all_charts['female_90s']:
    raw_lyrics = song['lyrics']
    song_toks = []
    temp = tokenize(raw_lyrics, lowercase=True,strip_chars=char_to_strip)
    for tok in temp:
        if tok not in stop_words:
            song_toks.append(tok)
    f_word_freq_90s.update(song_toks)
    unique_type = set(song_toks) 
    f_song_freq_90s.update(unique_type) # update song_freq_90s with the types (i.e. unique values in tokens)
    f_bigrams_90s = get_ngram_tokens(song_toks, n=2)
    f_bigrams_dist_90s.update(f_bigrams_90s)
    f_trigrams_90s = get_ngram_tokens(song_toks, n=3)
    f_trigrams_dist_90s.update(f_trigrams_90s)

print('Top 50 words in your `female_90s` corpus\n', '='*34, sep='')
print(f_word_freq_90s.most_common(50))

print('Top 50 # of songs each type occurs in your `female_90s` corpus\n', '='*34, sep='')
print(f_song_freq_90s.most_common(50))

print('Top 50 bigrams in your `female_90s` corpus\n', '='*34, sep='')
print(f_bigrams_dist_90s.most_common(50))

print('Top 50 trigrams in your `female_90s` corpus\n', '='*34, sep='')
print(f_trigrams_dist_90s.most_common(50))

Top 50 words in your `female_90s` corpus
[('love', 37), ('oh', 31), ('like', 29), ('let', 29), ('little', 28), ('know', 28), ('yeah', 27), ('boy', 26), ('way', 26), ('go', 23), ('one', 22), ('said', 22), ('knows', 21), ('got', 20), ('baby', 19), ('maybe', 19), ('passionate', 19), ('kisses', 19), ('kiss', 19), ('gonna', 18), ('well', 18), ("i'm", 18), ('ever', 18), ('feel', 18), ('right', 17), ('get', 16), ('goodbyes', 16), ('hey', 15), ('think', 15), ("ain't", 14), ('man', 14), ('say', 14), ('without', 14), ('back', 13), ('took', 13), ('day', 13), ('make', 13), ('earl', 13), ('away', 12), ('come', 12), ('mama', 12), ('want', 12), ('left', 12), ('prechorus', 12), ('tell', 12), ('ohohoh', 12), ('never', 12), ('long', 11), ('take', 11), ('somewhere', 11)]
Top 50 # of songs each type occurs in your `female_90s` corpus
[('oh', 11), ('like', 10), ('one', 10), ('said', 10), ('back', 9), ('know', 9), ('got', 8), ('gonna', 8), ('come', 8), ("ain't", 8), ('man', 8), ('get', 8), ('love', 8), ("i'

### Observations:

Looking at the songs written by female artists in the 1990s (`female_90s`), again there is a theme of "love" and love-related words ("love", "passionate", "kisses"). An interesting distinction in this data set is, apart from the allusions to love, there is a theme about seasons and geographies ("carolina", "california", "summer", "memphis").  

## Male Artists - 1990s

In [15]:
m_word_freq_90s = Counter() ##tally of # of times each type occurs in TOTAL 
m_song_freq_90s = Counter() ## tally of # of songs each type occurs in (each type only counted once for each song it occurs in)
m_bigrams_dist_90s = Counter()
m_trigrams_dist_90s = Counter()

# 1. loop over each song in chart
for song in all_charts['male_90s']:
    raw_lyrics = song['lyrics']
    song_toks = []
    temp = tokenize(raw_lyrics, lowercase=True,strip_chars=char_to_strip)
    for tok in temp:
        if tok not in stop_words:
            song_toks.append(tok)
    m_word_freq_90s.update(song_toks)
    unique_type = set(song_toks) 
    m_song_freq_90s.update(unique_type) 
    m_bigrams_90s = get_ngram_tokens(song_toks, n=2)
    m_bigrams_dist_90s.update(m_bigrams_90s)
    m_trigrams_90s = get_ngram_tokens(song_toks, n=3)
    m_trigrams_dist_90s.update(m_trigrams_90s)

print('Top 50 words in your `male_90s` corpus\n', '='*34, sep='')
print(m_word_freq_90s.most_common(50))

print('Top 50 # of songs each type occurs in your `male_90s` corpus\n', '='*34, sep='')
print(m_song_freq_90s.most_common(50))

print('Top 50 bigrams in your `male_90s` corpus\n', '='*34, sep='')
print(m_bigrams_dist_90s.most_common(50))

print('Top 50 trigrams in your `male_90s` corpus\n', '='*34, sep='')
print(m_trigrams_dist_90s.most_common(50))

Top 50 words in your `male_90s` corpus
[('love', 71), ("i'm", 57), ('know', 46), ('oh', 39), ('wanna', 34), ('ya', 34), ('girl', 31), ('get', 31), ('got', 30), ('yeah', 29), ("ain't", 28), ('maria', 27), ('never', 26), ('like', 26), ('heart', 25), ("i've", 25), ('tell', 24), ('take', 23), ('come', 21), ('say', 19), ('make', 19), ('let', 18), ('huh', 18), ('away', 17), ('little', 17), ("can't", 17), ('would', 16), ('one', 16), ('time', 16), ("i'll", 16), ('beautiful', 16), ('ay', 16), ('world', 15), ('start', 15), ('night', 14), ('done', 14), ('hurry', 14), ('rush', 14), ("slippin'", 14), ('see', 13), ('think', 13), ('gotta', 13), ('dance', 13), ('want', 12), ('marie', 12), ('keep', 12), ('man', 11), ('way', 11), ('friends', 11), ('baby', 11)]
Top 50 # of songs each type occurs in your `male_90s` corpus
[('know', 13), ("i'm", 13), ('night', 12), ('never', 11), ('love', 11), ('like', 10), ("i've", 10), ('got', 10), ('oh', 8), ('one', 8), ('girl', 8), ('right', 8), ('get', 8), ('said', 7)

### Observations:

Given the theme of "love" that I've been picking up on from earlier analysis of the list of country songs, it is interesting that '90s male artists mention the word "love" 71 times while their female counterparts from the '90s mention "love" only 37 times; male artists mention "love" nearly twice as much! I am excited to explore this disparity more throughout this project!

## Female Artists - 2010s

In [16]:
f_word_freq_2010s = Counter() ##tally of # of times each type occurs in TOTAL 
f_song_freq_2010s = Counter() ## tally of # of songs each type occurs in (each type only counted once for each song it occurs in)
f_bigrams_dist_2010s = Counter()
f_trigrams_dist_2010s = Counter()

# 1. loop over each song in chart
for song in all_charts['female_2010s']:
    raw_lyrics = song['lyrics']
    song_toks = []
    temp = tokenize(raw_lyrics, lowercase=True,strip_chars=char_to_strip)
    for tok in temp:
        if tok not in stop_words:
            song_toks.append(tok)
    f_word_freq_2010s.update(song_toks)
    unique_type = set(song_toks) 
    f_song_freq_2010s.update(unique_type) 
    f_bigrams_2010s = get_ngram_tokens(song_toks, n=2)
    f_bigrams_dist_2010s.update(f_bigrams_2010s)
    f_trigrams_2010s = get_ngram_tokens(song_toks, n=3)
    f_trigrams_dist_2010s.update(f_trigrams_2010s)

print('Top 50 words in your `female_2010s` corpus\n', '='*34, sep='')
print(f_word_freq_2010s.most_common(50))

print('Top 50 # of songs each type occurs in your `female_2010s` corpus\n', '='*34, sep='')
print(f_song_freq_2010s.most_common(50))

print('Top 50 bigrams in your `female_2010s` corpus\n', '='*34, sep='')
print(f_bigrams_dist_2010s.most_common(50))

print('Top 50 trigrams in your `female_2010s` corpus\n', '='*34, sep='')
print(f_trigrams_dist_2010s.most_common(50))

Top 50 words in your `female_2010s` corpus
[('yeah', 56), ('every', 40), ("i'm", 40), ('back', 36), ('like', 36), ('get', 36), ('little', 36), ('hope', 33), ('never', 32), ('got', 31), ('oh', 30), ('away', 26), ('take', 26), ("ain't", 26), ('changed', 25), ('know', 22), ('could', 21), ('name', 21), ('gonna', 20), ('thing', 20), ('heart', 19), ('ever', 19), ('road', 18), ('one', 17), ('blown', 16), ('stronger', 16), ('good', 15), ('time', 15), ("i'll", 15), ('traveled', 15), ('let', 14), ('place', 14), ('see', 14), ('put', 14), ('along', 14), ('way', 14), ('less', 14), ('house', 13), ('remember', 13), ('water', 13), ('around', 13), ('make', 13), ("'em", 13), ('girl', 13), ('said', 12), ('prechorus', 12), ('love', 12), ("i've", 12), ('go', 12), ('find', 12)]
Top 50 # of songs each type occurs in your `female_2010s` corpus
[('like', 11), ('get', 11), ('yeah', 10), ('back', 10), ('one', 10), ('heart', 9), ('good', 8), ('got', 8), ("i'm", 8), ('time', 8), ('little', 8), ('know', 7), ('every

### Observations:

Looking at the most common tokens, there does not seem to be any obvious connections between any of the words that could relate to a bigger theme through songs by female artists of the 2010. Keeping this in mind, songs written by '90s female artists contained lyrics relating to "love"; I am interested to see if there are any hidden messages/themes that will be unveiled in my other analyses notebooks with this specific corpus. 

## Male Artists - 2010s

In [17]:
m_word_freq_2010s = Counter() ##tally of # of times each type occurs in TOTAL 
m_song_freq_2010s = Counter() ## tally of # of songs each type occurs in (each type only counted once for each song it occurs in)
m_bigrams_dist_2010s = Counter()
m_trigrams_dist_2010s = Counter()

# 1. loop over each song in chart
for song in all_charts['male_2010s']:
    raw_lyrics = song['lyrics']
    song_toks = []
    temp = tokenize(raw_lyrics, lowercase=True,strip_chars=char_to_strip)
    for tok in temp:
        if tok not in stop_words:
            song_toks.append(tok)
    m_word_freq_2010s.update(song_toks)
    unique_type = set(song_toks) 
    m_song_freq_2010s.update(unique_type) 
    m_bigrams_2010s = get_ngram_tokens(song_toks, n=2)
    m_bigrams_dist_2010s.update(m_bigrams_2010s)
    m_trigrams_2010s = get_ngram_tokens(song_toks, n=3)
    m_trigrams_dist_2010s.update(m_trigrams_2010s)

print('Top 50 words in your `male_2010s` corpus\n', '='*34, sep='')
print(m_word_freq_2010s.most_common(50))

print('Top 50 # of songs each type occurs in your `male_2010s` corpus\n', '='*34, sep='')
print(m_song_freq_2010s.most_common(50))

print('Top 50 bigrams in your `male_2010s` corpus\n', '='*34, sep='')
print(m_bigrams_dist_2010s.most_common(50))

print('Top 50 trigrams in your `male_2010s` corpus\n', '='*34, sep='')
print(m_trigrams_dist_2010s.most_common(50))

Top 50 words in your `male_2010s` corpus
[('like', 72), ("i'm", 57), ('back', 44), ('baby', 40), ('oooh', 36), ('yeah', 31), ('need', 28), ('rock', 28), ('got', 27), ('right', 27), ("ain't", 27), ('little', 26), ("'em", 26), ('mama', 26), ('make', 25), ('think', 25), ('know', 24), ('get', 24), ('gonna', 23), ('go', 23), ('wanna', 23), ('see', 21), ('hey', 20), ('dirt', 19), ('one', 19), ('good', 18), ('way', 17), ('whiskey', 17), ('used', 16), ('road', 16), ("i'ma", 16), ('tequila', 16), ('around', 15), ('night', 15), ('hell', 15), ('free', 15), ('drink', 14), ('take', 14), ("'cause", 14), ('man', 14), ('ya', 14), ('crazy', 14), ('always', 14), ('country', 13), ('never', 13), ('glasses', 13), ('drunk', 13), ("can't", 12), ('feel', 12), ('shine', 12)]
Top 50 # of songs each type occurs in your `male_2010s` corpus
[('like', 12), ('go', 12), ('back', 12), ("i'm", 11), ('yeah', 11), ('know', 11), ('get', 11), ("ain't", 10), ('got', 9), ('baby', 9), ('right', 8), ('take', 8), ('way', 8), ('

### Observations:

Unlike the `female_2010s` data, the `male_2010s` chart reveals some possible themes with words such as "baby", "mama", "road", "whiskey", "tequila", and "drunk". Again, with the bigrams and trigrams, I can pick up on what songs such word pairs belong to.  

## Word Frequency Comparisons

In [18]:
## comparing some key words present in BOTH the 90s and 2010s lyrics - key words found on various websites

words_to_compare = ['beer','boots','drink','drinking','drinks','truck','road','yeah','baby','girl','love','little','drunk','whiskey']
print("{:<10} {:<10} {:<10}".format("word", "1990s", "2010s"))
print("="*30)

for word in words_to_compare:
    print("{:<10} {:<10} {:<10}".format(word,
                                   word_freq_90s.get(word,0),
                                   word_freq_2010s.get(word,0)))

word       1990s      2010s     
beer       5          9         
boots      1          17        
drink      3          16        
drinking   0          0         
drinks     0          6         
truck      9          5         
road       5          34        
yeah       56         87        
baby       30         52        
girl       35         22        
love       108        19        
little     45         62        
drunk      0          14        
whiskey    6          20        


In the chart above, it is clear that songs from the 2010s referenced alcohol and drinking significantly more than songs from the 1990s. The most significant difference that stands out to me is how the word "love" in the '90s was used almost six times more than it was used in the 2010s. 

In [19]:
## Comparing some key words present in BOTH the 90s and 2010s lyrics
## Key words chosen from printed results earlier in this notebook 

words_to_compare = ['heart','love','baby','girl','yeah','man','little']
print("{:<10} {:<10} {:<10}".format("word", "1990s", "2010s"))
print("="*30)

for word in words_to_compare:
    print("{:<10} {:<10} {:<10}".format(word,
                                   word_freq_90s.get(word,0),
                                   word_freq_2010s.get(word,0)))

word       1990s      2010s     
heart      30         27        
love       108        19        
baby       30         52        
girl       35         22        
yeah       56         87        
man        25         21        
little     45         62        


After putting words that appeared in both corpora into this chart, I found it easier to make comparisons. The words "heart", "man", "little", and "girl" were used almost equally, with "baby", "yeah" and "love" being the words with the biggest disparities. 

## Related Words Proportions between Different Corpora

In [20]:
love_rel_words = ['love','heart','happy','romance','lover','smile','baby']

for word in love_rel_words:
    lovecnt_proportion90s = word_freq_90s[word]/len(word_freq_90s)*100
    lovecnt_percentage90s = round(lovecnt_proportion90s, 2)
    print(word, word_freq_90s[word],lovecnt_percentage90s)

love 108 6.81
heart 30 1.89
happy 1 0.06
romance 4 0.25
lover 1 0.06
smile 11 0.69
baby 30 1.89


In [21]:
love_rel_words = ['love','heart','happy','romance','lover','smile','baby']

for word in love_rel_words:
    lovecnt_proportion2010s = word_freq_2010s[word]/len(word_freq_2010s)*100
    lovecnt_percentage2010s = round(lovecnt_proportion2010s, 2)
    print(word, word_freq_2010s[word],lovecnt_percentage2010s)

love 19 1.36
heart 27 1.94
happy 3 0.22
romance 0 0.0
lover 0 0.0
smile 11 0.79
baby 52 3.73


## Observations: 

- Keeping in mind that these percentages account for all the filler words of "i'm", "get", "every", etc., "love" is used quite frequently (as noted multiple times in this notebook), with "heart" and "baby" being used the most next.

In [22]:
drinking_rel_words = ['whiskey','beer','drunk','tequila','drink']

for word in drinking_rel_words:
    drinkcnt_proportion2010s = word_freq_2010s[word]/len(word_freq_2010s)*100
    drinkcnt_percentage2010s = round(drinkcnt_proportion2010s, 2)
    print(word, word_freq_2010s[word],drinkcnt_percentage2010s)

whiskey 20 1.43
beer 9 0.65
drunk 14 1.0
tequila 16 1.15
drink 16 1.15


In [23]:
drinking_rel_words = ['whiskey','beer','drunk','tequila','drink']

for word in drinking_rel_words:
    drinkcnt_proportion90s = word_freq_90s[word]/len(word_freq_90s)*100
    drinkcnt_percentage90s = round(drinkcnt_proportion90s, 2)
    print(word, word_freq_90s[word],drinkcnt_percentage90s)

whiskey 6 0.38
beer 5 0.32
drunk 0 0.0
tequila 0 0.0
drink 3 0.19


## Observations:

- Not much reference to drinking in either corpora
- Drink references uses more in 2010s lyrics than 1990s