# ADS 509 Module 3: Group Comparison 

The task of comparing two groups of text is fundamental to textual analysis. There are innumerable applications: survey respondents from different segments of customers, speeches by different political parties, words used in Tweets by different constituencies, etc. In this assignment you will build code to effect comparisons between groups of text data, using the ideas learned in reading and lecture.

This assignment asks you to analyze the lyrics and Twitter descriptions for the two artists you selected in Module 1. If the results from that pull were not to your liking, you are welcome to use the zipped data from the “Assignment Materials” section. Specifically, you are asked to do the following: 

* Read in the data, normalize the text, and tokenize it. When you tokenize your Twitter descriptions, keep hashtags and emojis in your token set. 
* Calculate descriptive statistics on the two sets of lyrics and compare the results. 
* For each of the four corpora, find the words that are unique to that corpus. 
* Build word clouds for all four corpora. 

Each one of the analyses has a section dedicated to it below. Before beginning the analysis there is a section for you to read in the data and do your cleaning (tokenization and normalization). 


## General Assignment Instructions

These instructions are included in every assignment, to remind you of the coding standards for the class. Feel free to delete this cell after reading it. 

One sign of mature code is conforming to a style guide. We recommend the [Google Python Style Guide](https://google.github.io/styleguide/pyguide.html). If you use a different style guide, please include a cell with a link. 

Your code should be relatively easy-to-read, sensibly commented, and clean. Writing code is a messy process, so please be sure to edit your final submission. Remove any cells that are not needed or parts of cells that contain unnecessary code. Remove inessential `import` statements and make sure that all such statements are moved into the designated cell. 

Make use of non-code cells for written commentary. These cells should be grammatical and clearly written. In some of these cells you will have questions to answer. The questions will be marked by a "Q:" and will have a corresponding "A:" spot for you. *Make sure to answer every question marked with a `Q:` for full credit.* 


In [1]:
import os
import re
import emoji
import pandas as pd

from collections import Counter, defaultdict
from nltk.corpus import stopwords
from string import punctuation
from wordcloud import WordCloud 

from sklearn.feature_extraction.text import TfidfTransformer, CountVectorizer




In [57]:
import string

from wordcloud import WordCloud
from collections import Counter
from nltk.corpus import stopwords
from nltk.tokenize import word_tokenize
from matplotlib import pyplot as plt

nltk.download('punkt')

[nltk_data] Downloading package punkt to
[nltk_data]     /Users/landonpadgett/nltk_data...
[nltk_data]   Package punkt is already up-to-date!


True

In [51]:
# Place any addtional functions or constants you need here. 

# Some punctuation variations
punctuation = set(punctuation) # speeds up comparison
tw_punct = punctuation - {"#"}

# Stopwords
sw = stopwords.words("english")

# Two useful regex
whitespace_pattern = re.compile(r"\s+")
hashtag_pattern = re.compile(r"^#[0-9a-zA-Z]+")

# It's handy to have a full set of emojis
all_language_emojis = set()

for country in emoji.EMOJI_DATA : 
    for em in emoji.EMOJI_DATA[country] : 
        all_language_emojis.add(em)

# and now our functions
def descriptive_stats(tokens, num_tokens = 5, verbose=True) :
    """
        Given a list of tokens, print number of tokens, number of unique tokens, 
        number of characters, lexical diversity, and num_tokens most common
        tokens. Return a list of 
    """

    # Place your Module 2 solution here
    
    return(0)


    
def contains_emoji(s):
    
    s = str(s)
    emojis = [ch for ch in s if emoji.is_emoji(ch)]

    return(len(emojis) > 0)


def remove_stop(tokens) :
    # modify this function to remove stopwords
    return(tokens)
 
def remove_punctuation(text, punct_set=tw_punct) : 
    return("".join([ch for ch in text if ch not in punct_set]))

def tokenize(text) : 
    """ Splitting on whitespace rather than the book's tokenize function. That 
        function will drop tokens like '#hashtag' or '2A', which we need for Twitter. """
    
    # modify this function to return tokens
    return(text)

def prepare(text, pipeline) : 
    tokens = str(text)
    
    for transform in pipeline : 
        tokens = transform(tokens)
        
    return(tokens)


## Data Ingestion

Use this section to ingest your data into the data structures you plan to use. Typically this will be a dictionary or a pandas DataFrame.

In [5]:
# Feel fre to use the below cells as an example or read in the data in a way you prefer

# change `data_location` to the location of the folder on your machine.
data_location = "/users/landonpadgett/Desktop/M1 Results/" 
twitter_folder = "/users/landonpadgett/Desktop/M1 Results/twitter/"
lyrics_folder = "/users/landonpadgett/Desktop/M1 Results/lyrics/"

artist_files = {'cher':'cher_followers_data.txt',
                'robyn':'robynkonichiwa_followers_data.txt'}


In [8]:
# Define the artist files
artist_files = {
    'cher': 'cher_followers_data.txt',
    'robyn': 'robynkonichiwa_followers_data.txt'
}

# Read the CSV file for Cher
twitter_data = pd.read_csv(twitter_folder + artist_files['cher'],
                           sep="\t",
                           quoting=3)

# Add the artist column
twitter_data['artist'] = "cher"

In [9]:
# Define the artist files
artist_files = {
    'cher': 'cher_followers_data.txt',
    'robyn': 'robynkonichiwa_followers_data.txt'
}

# Read the CSV file for Cher
twitter_data = pd.read_csv(twitter_folder + artist_files['cher'],
                           sep="\t",
                           quoting=3)

# Add the artist column for Cher
twitter_data['artist'] = "cher"

# Read the CSV file for Robyn
twitter_data_2 = pd.read_csv(twitter_folder + artist_files['robyn'],
                             sep="\t",
                             quoting=3)

# Add the artist column for Robyn
twitter_data_2['artist'] = "robyn"

# Concatenate the two dataframes
twitter_data = pd.concat([twitter_data, twitter_data_2])

# Delete the second dataframe to free memory
del twitter_data_2

In [13]:
# Read Twitter data for Cher
twitter_data = pd.read_csv(twitter_folder + artist_files['cher'],
                           sep="\t",
                           quoting=3)
twitter_data['artist'] = "cher"

# Read Twitter data for Robyn
twitter_data_2 = pd.read_csv(twitter_folder + artist_files['robyn'],
                             sep="\t",
                             quoting=3)
twitter_data_2['artist'] = "robyn"

# Concatenate Twitter data
twitter_data = pd.concat([twitter_data, twitter_data_2])
del twitter_data_2

# Function to read all lyrics files for a given artist
def read_lyrics(artist, folder_path):
    lyrics_data_list = []
    for file_name in os.listdir(folder_path):
        if file_name.startswith(artist) and file_name.endswith('.txt'):
            file_path = os.path.join(folder_path, file_name)
            lyrics_data = pd.read_csv(file_path, 
                                      sep="\t", 
                                      quoting=3, 
                                      header=None, 
                                      names=["line"])  # Adjust as per the file structure
            lyrics_data['artist'] = artist
            lyrics_data_list.append(lyrics_data)
    
    # Check if there is data to concatenate
    if lyrics_data_list:
        return pd.concat(lyrics_data_list, ignore_index=True)
    else:
        print(f"No lyrics files found for {artist}.")
        return pd.DataFrame(columns=["line", "artist"])  # Return an empty DataFrame with the correct columns

# Read all lyrics files for Cher
cher_lyrics_data = read_lyrics('cher', lyrics_folder)

# Read all lyrics files for Robyn
robyn_lyrics_data = read_lyrics('robyn', lyrics_folder)

# Concatenate all lyrics data
if not cher_lyrics_data.empty or not robyn_lyrics_data.empty:
    lyrics_data = pd.concat([cher_lyrics_data, robyn_lyrics_data], ignore_index=True)
else:
    lyrics_data = pd.DataFrame(columns=["line", "artist"])  # Return an empty DataFrame with the correct columns

# Display final DataFrames
print(twitter_data.head())
print(lyrics_data.head())

No lyrics files found for cher.
No lyrics files found for robyn.
    screen_name          name                   id        location  \
0        hsmcnp  Country Girl             35152213             NaN   
1    horrormomy          Jeny   742153090850164742           Earth   
2  anju79990584          anju  1496463006451974150             NaN   
3  gallionjenna             J           3366479914             NaN   
4       bcscomm       bcscomm             83915043  Washington, DC   

   followers_count  friends_count  \
0             1302           1014   
1               81            514   
2               13            140   
3              752            556   
4              888           2891   

                                         description artist  
0                                                NaN   cher  
1           𝙿𝚛𝚘𝚞𝚍 𝚜𝚞𝚙𝚙𝚘𝚛𝚝𝚎𝚛 𝚘𝚏 𝚖𝚎𝚜𝚜𝚢 𝚋𝚞𝚗𝚜 & 𝚕𝚎𝚐𝚐𝚒𝚗𝚐𝚜   cher  
2          163㎝／愛かっぷ💜26歳🍒 工〇好きな女の子💓 フォローしてくれたらDMします🧡   cher  
3                                          

## Tokenization and Normalization

In this next section, tokenize and normalize your data. We recommend the following cleaning. 

**Lyrics** 

* Remove song titles
* Casefold to lowercase
* Remove stopwords (optional)
* Remove punctuation
* Split on whitespace

Removal of stopwords is up to you. Your descriptive statistic comparison will be different if you include stopwords, though TF-IDF should still find interesting features for you. Note that we remove stopwords before removing punctuation because the stopword set includes punctuation.

**Twitter Descriptions** 

* Casefold to lowercase
* Remove stopwords
* Remove punctuation other than emojis or hashtags
* Split on whitespace

Removing stopwords seems sensible for the Twitter description data. Remember to leave in emojis and hashtags, since you analyze those. 

In [38]:
data_location = "/users/landonpadgett/Desktop/M1 Results/" 
twitter_folder = "/users/landonpadgett/Desktop/M1 Results/twitter/"
lyrics_folder = "/users/landonpadgett/Desktop/M1 Results/lyrics/"

artist_files = {'cher':'cher_followers_data.txt',
                'robyn':'robynkonichiwa_followers_data.txt'}


def read_file(file_path):
    """Reads the content of a file and returns it as a string."""
    with open(file_path, 'r', encoding='utf-8') as file:
        return file.read()

def process_folder(folder_path):
    """
    Processes all .txt files in a folder and its subfolders.
    Returns tokenized text as a list of words.
    """
    all_data = ""
    # Walk through all subfolders and files
    for root, dirs, files in os.walk(folder_path):
        for file in files:
            if file.endswith(".txt"):  # Only process .txt files
                file_path = os.path.join(root, file)
                print(f"Processing file: {file_path}")  # Debugging output
                all_data += read_file(file_path) + " "
    return all_data.split()  # Tokenize by splitting on whitespace

def descriptive_stats(tokens, num_tokens=5, verbose=True):
    """
    Given a list of tokens, print number of tokens, number of unique tokens, 
    number of characters, lexical diversity, and num_tokens most common tokens. 
    Return a list with the number of tokens, number of unique tokens, 
    lexical diversity, and number of characters. 
    """
    # Total number of tokens
    total_tokens = len(tokens)
    
    # Number of unique tokens
    num_unique_tokens = len(set(tokens))
    
    # Lexical diversity
    lexical_diversity = num_unique_tokens / total_tokens if total_tokens > 0 else 0
    
    # Number of characters
    num_characters = sum(len(token) for token in tokens)
    
    # Find the most common tokens
    token_counts = Counter(tokens)
    most_common_tokens = token_counts.most_common(num_tokens)
    
    if verbose:
        print(f"There are {total_tokens} tokens in the data.")
        print(f"There are {num_unique_tokens} unique tokens in the data.")
        print(f"There are {num_characters} characters in the data.")
        print(f"The lexical diversity is {lexical_diversity:.3f} in the data.")
        print(f"The {num_tokens} most common tokens are:")
        for token, count in most_common_tokens:
            print(f"{token}: {count}")
    
    return [total_tokens, num_unique_tokens, lexical_diversity, num_characters]

# Paths to Twitter and Lyrics folders
data_location = "/users/landonpadgett/Desktop/M1 Results/"
twitter_folder = os.path.join(data_location, "twitter/")
lyrics_folder = os.path.join(data_location, "lyrics/")

# Process and analyze Twitter data
print("Twitter Data Stats:")
twitter_tokens = process_folder(twitter_folder)
twitter_stats = descriptive_stats(twitter_tokens, verbose=True)

# Process and analyze Lyrics data (including subfolders like cher/ and robyn/)
print("\nLyrics Data Stats:")
lyrics_tokens = process_folder(lyrics_folder)
lyrics_stats = descriptive_stats(lyrics_tokens, verbose=True)

Twitter Data Stats:
Processing file: /users/landonpadgett/Desktop/M1 Results/twitter/cher_followers_data.txt
Processing file: /users/landonpadgett/Desktop/M1 Results/twitter/robynkonichiwa_followers_data.txt
Processing file: /users/landonpadgett/Desktop/M1 Results/twitter/cher_followers.txt
Processing file: /users/landonpadgett/Desktop/M1 Results/twitter/robynkonichiwa_followers.txt
There are 58532931 tokens in the data.
There are 12777699 unique tokens in the data.
There are 368240283 characters in the data.
The lexical diversity is 0.218 in the data.
The 5 most common tokens are:
and: 598725
a: 409768
the: 400525
I: 393219
of: 365284

Lyrics Data Stats:
Processing file: /users/landonpadgett/Desktop/M1 Results/lyrics/robyn/robyn_includemeout.txt
Processing file: /users/landonpadgett/Desktop/M1 Results/lyrics/robyn/robyn_electric.txt
Processing file: /users/landonpadgett/Desktop/M1 Results/lyrics/robyn/robyn_beach2k20.txt
Processing file: /users/landonpadgett/Desktop/M1 Results/lyrics/

With the data processed, we can now start work on the assignment questions. 

Q: What is one area of improvement to your tokenization that you could theoretically carry out? (No need to actually do it; let's not make perfect the enemy of good enough.)

A: 
An area of improvement for the tokenization process could be more effective handling of hashtags and emojis. Instead of treating them as single tokens, we could split hashtags into meaningful words and convert emojis into descriptive words (e.g., 😊 → "smiley_face"). This would enhance feature extraction and improve the contextual understanding of the text data.

## Calculate descriptive statistics on the two sets of lyrics and compare the results. 


In [40]:
# Create a DataFrame for comparison
comparison_df = pd.DataFrame({
    "Metric": ["Total Tokens", "Unique Tokens", "Lexical Diversity", "Number of Characters"],
    "Twitter Data": twitter_stats,
    "Lyrics Data": lyrics_stats
})

# Calculate differences between the two sets
comparison_df["Difference"] = comparison_df["Twitter Data"] - comparison_df["Lyrics Data"]

# Display the comparison DataFrame
print("\nComparison of Descriptive Statistics Between Twitter and Lyrics Data:")
print(comparison_df)


Comparison of Descriptive Statistics Between Twitter and Lyrics Data:
                 Metric  Twitter Data    Lyrics Data    Difference
0          Total Tokens  5.853293e+07   99415.000000  5.843352e+07
1         Unique Tokens  1.277770e+07    7674.000000  1.277002e+07
2     Lexical Diversity  2.182993e-01       0.077192  1.411077e-01
3  Number of Characters  3.682403e+08  391566.000000  3.678487e+08


Q: what observations do you make about these data? 

A: The Twitter data contains significantly more total tokens (approximately 58.5 million) compared to the Lyrics data, which has only around 99,415 tokens. This substantial difference suggests that the Twitter dataset is much larger, likely due to the sheer volume of user-generated posts compared to a limited collection of song lyrics. Additionally, the Twitter data has a much higher number of unique tokens (about 12.8 million) compared to the Lyrics data's 7,674 unique tokens, reflecting the diverse vocabulary and variety of expressions found on the social media platform. The lexical diversity of Twitter data (0.218) is also notably higher than that of the Lyrics data (0.077), indicating that the Twitter dataset has a wider range of distinct words relative to its total word count. Finally, the total number of characters is dramatically larger in the Twitter data, which aligns with the overall size and richness of the dataset compared to the more repetitive and structured nature of song lyrics.


## Find tokens uniquely related to a corpus

Typically we would use TF-IDF to find unique tokens in documents. Unfortunately, we either have too few documents (if we view each data source as a single document) or too many (if we view each description as a separate document). In the latter case, our problem will be that descriptions tend to be short, so our matrix would be too sparse to support analysis. 

To avoid these problems, we will create a custom statistic to identify words that are uniquely related to each corpus. The idea is to find words that occur often in one corpus and infrequently in the other(s). Since corpora can be of different lengths, we will focus on the _concentration_ of tokens within a corpus. "Concentration" is simply the count of the token divided by the total corpus length. For instance, if a corpus had length 100,000 and a word appeared 1,000 times, then the concentration would be $\frac{1000}{100000} = 0.01$. If the same token had a concentration of $0.005$ in another corpus, then the concentration ratio would be $\frac{0.01}{0.005} = 2$. Very rare words can easily create infinite ratios, so you will also add a cutoff to your code so that a token must appear at least $n$ times for you to return it. 

An example of these calculations can be found in [this spreadsheet](https://docs.google.com/spreadsheets/d/1P87fkyslJhqXFnfYezNYrDrXp_GS8gwSATsZymv-9ms). Please don't hesitate to ask questions if this is confusing. 

In this section find 10 tokens for each of your four corpora that meet the following criteria: 

1. The token appears at least `n` times in all corpora
1. The tokens are in the top 10 for the highest ratio of appearances in a given corpora vs appearances in other corpora.

You will choose a cutoff for yourself based on the side of the corpus you're working with. If you're working with the Robyn-Cher corpora provided, `n=5` seems to perform reasonably well.

In [41]:
def calculate_concentration(tokens):
    """
    Calculate the concentration of each token in the corpus.
    Returns a dictionary with tokens as keys and their concentration as values.
    """
    total_tokens = len(tokens)
    token_counts = Counter(tokens)
    concentration = {token: count / total_tokens for token, count in token_counts.items()}
    return concentration

def calculate_concentration_ratios(conc1, conc2, min_count, token_counts1, token_counts2):
    """
    Calculate the concentration ratios of tokens between two corpora.
    Returns a dictionary with tokens as keys and their concentration ratio as values.
    """
    ratios = {}
    for token in conc1:
        # Check if the token appears at least min_count times in both corpora
        if token_counts1.get(token, 0) >= min_count and token_counts2.get(token, 0) >= min_count:
            # Calculate the concentration ratio
            ratio = conc1[token] / conc2[token] if conc2[token] > 0 else float('inf')
            ratios[token] = ratio
    return ratios

def top_tokens_by_ratio(conc1, conc2, token_counts1, token_counts2, min_count, top_n=10):
    """
    Find the top `top_n` tokens with the highest concentration ratio for each corpus.
    """
    # Calculate concentration ratios
    ratios1_to_2 = calculate_concentration_ratios(conc1, conc2, min_count, token_counts1, token_counts2)
    ratios2_to_1 = calculate_concentration_ratios(conc2, conc1, min_count, token_counts2, token_counts1)
    
    # Sort tokens by concentration ratio and get the top_n
    top_tokens_corp1 = sorted(ratios1_to_2.items(), key=lambda x: x[1], reverse=True)[:top_n]
    top_tokens_corp2 = sorted(ratios2_to_1.items(), key=lambda x: x[1], reverse=True)[:top_n]
    
    return top_tokens_corp1, top_tokens_corp2

# Calculate token concentrations for both corpora
twitter_concentration = calculate_concentration(twitter_tokens)
lyrics_concentration = calculate_concentration(lyrics_tokens)

# Count the token frequencies in both corpora
twitter_token_counts = Counter(twitter_tokens)
lyrics_token_counts = Counter(lyrics_tokens)

# Define minimum token appearance count cutoff
min_count = 5  # Adjust this based on the size of your corpora

# Get the top 10 tokens for each corpus based on concentration ratios
top_twitter_tokens, top_lyrics_tokens = top_tokens_by_ratio(
    twitter_concentration, 
    lyrics_concentration, 
    twitter_token_counts, 
    lyrics_token_counts, 
    min_count, 
    top_n=10
)

# Display the results
print("Top 10 tokens unique to Twitter data based on concentration ratio:")
for token, ratio in top_twitter_tokens:
    print(f"Token: {token}, Concentration Ratio: {ratio:.2f}")

print("\nTop 10 tokens unique to Lyrics data based on concentration ratio:")
for token, ratio in top_lyrics_tokens:
    print(f"Token: {token}, Concentration Ratio: {ratio:.2f}")

Top 10 tokens unique to Twitter data based on concentration ratio:
Token: i, Concentration Ratio: 30.82
Token: 10, Concentration Ratio: 19.27
Token: 22, Concentration Ratio: 14.47
Token: Music, Concentration Ratio: 9.78
Token: que, Concentration Ratio: 7.29
Token: family, Concentration Ratio: 5.75
Token: Follow, Concentration Ratio: 4.35
Token: James, Concentration Ratio: 4.24
Token: follow, Concentration Ratio: 4.24
Token: lover, Concentration Ratio: 3.99

Top 10 tokens unique to Lyrics data based on concentration ratio:
Token: cryin', Concentration Ratio: 3414.89
Token: Ooh,, Concentration Ratio: 3336.38
Token: digi, Concentration Ratio: 1712.80
Token: (Let's, Concentration Ratio: 1345.77
Token: Ohh,, Concentration Ratio: 1177.55
Token: Taxi,, Concentration Ratio: 981.29
Token: splinters, Concentration Ratio: 883.16
Token: conceal, Concentration Ratio: 856.40
Token: indestructible, Concentration Ratio: 856.40
Token: tellin', Concentration Ratio: 824.28


Q: What are some observations about the top tokens? Do you notice any interesting items on the list? 

A: The Twitter data shows a lot of everyday language, with personal pronouns like "I" and numbers being common, along with words like "family" and "follow," which are typical for social interactions and self-expression on social media. In contrast, the lyrics data is full of expressive and artistic words like "cryin'," "Ooh," and "indestructible," which you’d expect in a more poetic context. The high concentration ratios for these lyrical tokens suggest they’re used in ways that are much more unique and creative compared to the more casual, conversational style of Twitter.

## Build word clouds for all four corpora. 

For building wordclouds, we'll follow exactly the code of the text. The code in this section can be found [here](https://github.com/blueprints-for-text-analytics-python/blueprints-text/blob/master/ch01/First_Insights.ipynb). If you haven't already, you should absolutely clone the repository that accompanies the book. 


In [62]:
nltk.download('punkt')
nltk.download('stopwords')

nltk_data_path = "/users/landonpadgett/Desktop/M1 Results/nltk_data/"

nltk.data.path.append(nltk_data_path)

# Download the 'punkt' tokenizer to the specified directory
nltk.download('punkt', download_dir=nltk_data_path)
nltk.download('stopwords', download_dir=nltk_data_path)


# Define stopwords
stop_words = set(stopwords.words('english'))

# Utility function to preprocess text
def preprocess_text(text):
    if not isinstance(text, str):
        return []  # Return an empty list for non-string values (e.g., NaN, float)
    text = text.lower()  # Convert to lowercase
    text = text.translate(str.maketrans('', '', string.punctuation))  # Remove punctuation
    tokens = word_tokenize(text)  # Tokenize text
    tokens = [word for word in tokens if word not in stop_words]  # Remove stopwords
    return tokens

# Function to calculate word frequencies
def calculate_word_frequencies(tokens_list):
    all_tokens = [word for tokens in tokens_list for word in tokens]  # Flatten list of lists
    return Counter(all_tokens)

# Function to generate word cloud
def wordcloud(word_freq, title=None, max_words=200, stopwords=None):
    wc = WordCloud(width=800, height=400, 
                   background_color="black", colormap="Paired", 
                   max_font_size=150, max_words=max_words)
    
    # Convert data frame into dict if it's a Series
    if type(word_freq) == pd.Series:
        counter = Counter(word_freq.fillna(0).to_dict())
    else:
        counter = word_freq

    # Filter stopwords in frequency counter if provided
    if stopwords is not None:
        counter = {token: freq for (token, freq) in counter.items() 
                   if token not in stopwords}
    
    wc.generate_from_frequencies(counter)
    
    plt.figure(figsize=(10, 5))  # Add figure size to display it clearly
    plt.title(title) 
    plt.imshow(wc, interpolation='bilinear')
    plt.axis("off")
    plt.show()  # This is necessary to display the plot

# Assuming `lyrics_data` and `twitter_data` are already defined and contain the data you provided earlier
# Preprocess the data
lyrics_data['tokens'] = lyrics_data['line'].apply(preprocess_text)
twitter_data['tokens'] = twitter_data['description'].apply(preprocess_text)

# Calculate word frequencies for both datasets
lyrics_word_freq = calculate_word_frequencies(lyrics_data['tokens'])
twitter_word_freq = calculate_word_frequencies(twitter_data['tokens'])

# Generate word clouds for both datasets
wordcloud(lyrics_word_freq, title="Lyrics Data Word Cloud")
wordcloud(twitter_word_freq, title="Twitter Data Word Cloud")

[nltk_data] Downloading package punkt to
[nltk_data]     /Users/landonpadgett/nltk_data...
[nltk_data]   Package punkt is already up-to-date!
[nltk_data] Downloading package stopwords to
[nltk_data]     /Users/landonpadgett/nltk_data...
[nltk_data]   Package stopwords is already up-to-date!
[nltk_data] Downloading package punkt to
[nltk_data]     /users/landonpadgett/Desktop/M1 Results/nltk_data/...
[nltk_data]   Unzipping tokenizers/punkt.zip.
[nltk_data] Downloading package stopwords to
[nltk_data]     /users/landonpadgett/Desktop/M1 Results/nltk_data/...
[nltk_data]   Unzipping corpora/stopwords.zip.


LookupError: 
**********************************************************************
  Resource [93mpunkt_tab[0m not found.
  Please use the NLTK Downloader to obtain the resource:

  [31m>>> import nltk
  >>> nltk.download('punkt_tab')
  [0m
  For more information see: https://www.nltk.org/data.html

  Attempted to load [93mtokenizers/punkt_tab/english/[0m

  Searched in:
    - '/Users/landonpadgett/nltk_data'
    - '/Users/landonpadgett/miniconda3/envs/myenv/nltk_data'
    - '/Users/landonpadgett/miniconda3/envs/myenv/share/nltk_data'
    - '/Users/landonpadgett/miniconda3/envs/myenv/lib/nltk_data'
    - '/usr/share/nltk_data'
    - '/usr/local/share/nltk_data'
    - '/usr/lib/nltk_data'
    - '/usr/local/lib/nltk_data'
    - '/users/landonpadgett/Desktop/M1 Results/nltk_data/'
**********************************************************************


Q: What observations do you have about these (relatively straightforward) wordclouds? 

A: 