# Assignment 6.2: Preparing Data for Final Team Project

## ADS509 Summer 2023
### Team members: Hunter Blum, Nicholas Lee, Kyle Esteban Dalope

All the codes provided for Assignment 6.2 were used from the "03_EDA.ipynb" in Team 12's repository. For more context regarding how the team pulled the data and pre-processed it, please refer to the "02_PreProcessing.ipynb" for details. 

In [4]:
# Import packages

# NOTE. for wordcloud to work - Python ver. 3.9.5 was used
from collections import Counter
# Pandas version 1.4.4
import pandas as pd
from matplotlib import pyplot as plt
import seaborn as sns
from wordcloud import WordCloud
from ast import literal_eval

In [8]:
# Read in preprocessed data
preproc_df = pd.read_csv("../data/genre_prepped.csv.gz", compression = "gzip",
                         converters = {"tokens": literal_eval,
                                       "genre": literal_eval})

# drop unnecessary columns (index and unnamed index columns)
preproc_df = preproc_df.drop(preproc_df.columns[0:2], axis = 1)

# Sample Table
preproc_df.head(5)

Unnamed: 0,artist,title,lyrics,genre,tokens,lyrics_clean
0,Taylor Swift,​betty,"Betty, I won't make assumptions\nAbout why you...",[country],"[betty, make, assumptions, switched, homeroom,...",betty make assumptions switched homeroom think...
1,John Denver,"Take Me Home, Country Roads","Almost Heaven, West Virginia\nBlue Ridge Mount...",[country],"[almost, heaven, west, virginia, blue, ridge, ...",almost heaven west virginia blue ridge mountai...
2,Post Malone,Feeling Whitney,"I've been looking for someone...\nOoh, ooh, oo...",[country],"[looking, someone, ooh, ooh, ooh, ooh, ooh, oo...",looking someone ooh ooh ooh ooh ooh oohooh ooh...
3,Cam,Burning House,\n[Verse 1]\nI had a dream about a burning hou...,[country],"[dream, burning, house, stuck, inside, get, la...",dream burning house stuck inside get laid besi...
4,Johnny Cash,Folsom Prison Blues,"I hear the train a-comin', it's rolling 'round...",[country],"[hear, train, acomin, rolling, round, bend, ai...",hear train acomin rolling round bend aint seen...


In [9]:
preproc_df_ungrouped = preproc_df.assign(genre = preproc_df['genre']).explode('genre')

## Descriptive Statistics

In [10]:
def descriptive_stats(tokens, num_tokens = 5, verbose=True) :
   
    
    num_tokens = len(tokens) #length of tokens
    num_unique_tokens = (len(set(tokens))) #number of unique tokens
    lexical_diversity = (num_unique_tokens / num_tokens) #ratio of different unique word stems (types) to the total number of words (tokens).
    num_characters = sum(len(token) for token in tokens) #length of the string of tokens
    
    if verbose :        
        print(f"There are {num_tokens} tokens in the data.")
        print(f"There are {num_unique_tokens} unique tokens in the data.")
        print(f"There are {num_characters} characters in the data.")
        print(f"The lexical diversity is {lexical_diversity:.3f} in the data.")
    
        # print the five most common tokens
        print(Counter(tokens).most_common(5)) #p. 16 of textbook
        
    return([num_tokens, num_unique_tokens,
            lexical_diversity,
            num_characters])

In [11]:
# Descriptive stats calculated for each all genres

# Iterate over each genre and calculate descriptive statistics
for genre in preproc_df_ungrouped["genre"].unique():
    genre_tokens = [token for tokens in preproc_df_ungrouped.loc[preproc_df_ungrouped["genre"] == genre, "tokens"] for token in tokens]
    print(f"Genre: {genre}")
    descriptive_stats(genre_tokens)
    print()

Genre: country
There are 156743 tokens in the data.
There are 16760 unique tokens in the data.
There are 778715 characters in the data.
The lexical diversity is 0.107 in the data.
[('know', 1660), ('****', 1450), ('love', 1316), ('got', 1196), ('might', 1193)]

Genre: rock
There are 149352 tokens in the data.
There are 15431 unique tokens in the data.
There are 751756 characters in the data.
The lexical diversity is 0.103 in the data.
[('know', 1800), ('****', 1787), ('love', 1452), ('might', 1139), ('got', 1122)]

Genre: pop
There are 194736 tokens in the data.
There are 19505 unique tokens in the data.
There are 953951 characters in the data.
The lexical diversity is 0.100 in the data.
[('****', 3460), ('know', 2659), ('love', 2434), ('got', 1831), ('get', 1540)]

Genre: r-b
There are 213327 tokens in the data.
There are 17143 unique tokens in the data.
There are 1031393 characters in the data.
The lexical diversity is 0.080 in the data.
[('****', 6146), ('know', 3949), ('love', 3174