# Preprocessing

importing google drive

In [None]:
from google.colab import drive
drive.mount('/content/drive')

Drive already mounted at /content/drive; to attempt to forcibly remount, call drive.mount("/content/drive", force_remount=True).


Installing, Importing and Preprocessing

In [None]:
# Install and import spacy and plotly.
%pip install spaCy
%pip install plotly
%pip install nbformat --upgrade



In [250]:
# Import spacy
import spacy

# Import os to upload documents and metadata
import os

# Load spaCy visualizer
from spacy import displacy

# Import pandas DataFrame packages
import pandas as pd
pd.options.mode.chained_assignment = None  # default='warn'

# Import graphing package
import plotly.graph_objects as go
import plotly.express as px

In [251]:
# Create empty lists for file names and contents
texts = []
file_names = []

# Iterate through each file in the folder
for _file_name in os.listdir('/content/drive/MyDrive/collecting data assignment4/txt_files'):
# Look for only text files
    if _file_name.endswith('.txt'):
    # Append contents of each text file to text list
        texts.append(open('/content/drive/MyDrive/collecting data assignment4/txt_files' + '/' + _file_name, 'r', encoding='utf-8').read())
        # Append name of each file to file name list
        file_names.append(_file_name)

In [252]:
# Create dictionary object associating each file name with its text
d = {'Filename':file_names,'Text':texts}

In [253]:
# Turn dictionary into a dataframe
lyric_df = pd.DataFrame(d)

In [254]:
lyric_df.head()


Unnamed: 0,Filename,Text
0,Roseblood.txt,Everyday you can see\nChanges in her hair and ...
1,I'm Sailin'.txt,"I'm sailin', sailin'\n(To a) place I've never ..."
2,Still Cold.txt,You really had a million hearts to break\nAll ...
3,"Quiet, The Winter Harbor.txt",Well you're still walking\nAround the block\nY...
4,Flowers In December.txt,Before I let you down again\nI just want to se...


The beginnings of some of the texts may contain extra spaces (indicated by \t or \n). These characters can be replaced by a single space using the str.replace() method.

In [255]:
# Remove extra spaces from papers
lyric_df['Text'] = lyric_df['Text'].str.replace('\s+', ' ', regex=True).str.strip()
lyric_df.head()


Unnamed: 0,Filename,Text
0,Roseblood.txt,Everyday you can see Changes in her hair and s...
1,I'm Sailin'.txt,"I'm sailin', sailin' (To a) place I've never s..."
2,Still Cold.txt,You really had a million hearts to break All t...
3,"Quiet, The Winter Harbor.txt",Well you're still walking Around the block You...
4,Flowers In December.txt,Before I let you down again I just want to see...


In [256]:
# Load metadata.
metadata_df = pd.read_csv('/content/drive/MyDrive/collecting data assignment4/metadata.csv')
metadata_df.head()


Unnamed: 0,TITLE,ALBUM,YEAR,TYPE
0,Be My Angel,She Hangs Brightly,1990,ALBUM
1,Before I Sleep,She Hangs Brightly,1990,ALBUM
2,Blue Flower,She Hangs Brightly,1990,ALBUM
3,Free,She Hangs Brightly,1990,ALBUM
4,Ghost Highway,She Hangs Brightly,1990,ALBUM


In [257]:
# Remove .txt from title of each paper
lyric_df['Filename'] = lyric_df['Filename'].str.replace('.txt', '', regex=True)

# Rename column from paper ID to Title
metadata_df.rename(columns={"TITLE": "Filename"}, inplace=True)

In [258]:
# Merge metadata and papers into new DataFrame
# Will only keep rows where both essay and metadata are present
final_lyric_df = metadata_df.merge(lyric_df,on='Filename')
final_lyric_df = final_lyric_df.rename(columns={'Filename': 'TITLE'})

Let's check the head of the DataFrame again to confirm everything has worked well. Check the first five rows to make sure each has a title, album, year, type and text (the full lyric)

In [259]:
# Print DataFrame
final_lyric_df.head()


Unnamed: 0,TITLE,ALBUM,YEAR,TYPE,Text
0,Be My Angel,She Hangs Brightly,1990,ALBUM,They say it's me that makes you do things you ...
1,Before I Sleep,She Hangs Brightly,1990,ALBUM,If it's The truth That's all I know I look for...
2,Blue Flower,She Hangs Brightly,1990,ALBUM,Waitin' for a sign from you Waitin' for a sign...
3,Free,She Hangs Brightly,1990,ALBUM,I fell asleep in the silence Before the street...
4,Ghost Highway,She Hangs Brightly,1990,ALBUM,You're a ghost on the highway And I'll love yo...


The resulting DataFrame is now ready for analysis.


## Text Enrichment with spaCy

### Creating Doc Objects

To use spaCy, the first step is to load one of spaCy’s Trained Models and Pipelines which will be used to perform tokenization, part-of-speech tagging, and other text enrichment tasks. A wide range of options are available (see the full list here), and they vary based on size and language.

We’ll use en_core_web_sm, which has been trained on written web texts. It may not perform as accurately as the those trained on medium and large English language models, but it will deliver results most efficiently. Once we’ve loaded en_core_web_sm, we can check what actions it performs; parser, tagger, lemmatizer, and NER, should be among those listed.

In [262]:
# Load nlp pipeline
nlp = spacy.load('en_core_web_sm')

# Check what functions it performs
print(nlp.pipe_names)

['tok2vec', 'tagger', 'parser', 'attribute_ruler', 'lemmatizer', 'ner']


Now that the nlp function is loaded, let’s test out its capacities on a single sentence. Calling the nlp function on a single sentence yields a Doc object. This object stores not only the original text, but also all of the linguistic annotations obtained when spaCy processed the text.

In [263]:
#Define example sentence
sentence = "This is 'an' example? sentence"

# Call the nlp model on the sentence
doc = nlp(sentence)

Next we can call on the Doc object to get the information we’re interested in. The command below loops through each token in a Doc object and prints each word in the text along with its corresponding part-of-speech:

In [267]:
# Loop through each token in doc object
for token in doc:
    # Print text and part of speech for each
    print(token.text, token.pos_)

This PRON
is AUX
' PUNCT
an DET
' PUNCT
example NOUN
? PUNCT
sentence NOUN


Let’s try the same process on the lyrics . As we’ll be calling the NLP function on every text in the DataFrame, we should first define a function that runs nlp on whatever input text is given. Functions are a useful way to store operations that will be run multiple times, reducing duplications and improving code readability.

In [268]:
# Define a function that runs the nlp pipeline on any given input text
def process_text(text):
    return nlp(text)

After the function is defined, use .apply() to apply it to every cell in a given DataFrame column. In this case, nlp will run on each cell in the Text column of the final_lyric_df DataFrame, creating a Doc object from every lyric text. These Doc objects will be stored in a new column of the DataFrame called Doc.

Running this function takes several minutes because spaCy is performing all the parsing and tagging tasks on each text. However, when it is complete, we can simply call on the resulting Doc objects to get parts-of-speech, named entities, and other information of interest, just as in the example of the sentence above.

In [269]:
# Apply the function to the "Text" column, so that the nlp pipeline is called on each student lyric
final_lyric_df['Doc'] = final_lyric_df['Text'].apply(process_text)

## Text Reduction

### Tokenization

A critical first step spaCy performs is tokenization, or the segmentation of strings into individual words and punctuation markers. Tokenization enables spaCy to parse the grammatical structures of a text and identify characteristics of each word-like part-of-speech.

To retrieve a tokenized version of each text in the DataFrame, we’ll write a function that iterates through any given Doc object and returns all functions found within it.

In [270]:
# Define a function to retrieve tokens from a doc object
def get_token(doc):
    return [(token.text) for token in doc]

As with the function used to create Doc objects, the token function can be applied to the DataFrame. In this case, we will call the function on the Doc column, since this is the column which stores the results from the processing done by spaCy.

In [271]:
# Run the token retrieval function on the doc objects in the dataframe
final_lyric_df['Tokens'] = final_lyric_df['Doc'].apply(get_token)
final_lyric_df.head()

Unnamed: 0,TITLE,ALBUM,YEAR,TYPE,Text,Doc,Tokens
0,Be My Angel,She Hangs Brightly,1990,ALBUM,They say it's me that makes you do things you ...,"(They, say, it, 's, me, that, makes, you, do, ...","[They, say, it, 's, me, that, makes, you, do, ..."
1,Before I Sleep,She Hangs Brightly,1990,ALBUM,If it's The truth That's all I know I look for...,"(If, it, 's, The, truth, That, 's, all, I, kno...","[If, it, 's, The, truth, That, 's, all, I, kno..."
2,Blue Flower,She Hangs Brightly,1990,ALBUM,Waitin' for a sign from you Waitin' for a sign...,"(Waitin, ', for, a, sign, from, you, Waitin, '...","[Waitin, ', for, a, sign, from, you, Waitin, '..."
3,Free,She Hangs Brightly,1990,ALBUM,I fell asleep in the silence Before the street...,"(I, fell, asleep, in, the, silence, Before, th...","[I, fell, asleep, in, the, silence, Before, th..."
4,Ghost Highway,She Hangs Brightly,1990,ALBUM,You're a ghost on the highway And I'll love yo...,"(You, 're, a, ghost, on, the, highway, And, I,...","[You, 're, a, ghost, on, the, highway, And, I,..."


If we compare the Text and Tokens column, we find a couple of differences. In the table below, you’ll notice that most importantly, the words, spaces, and punctuation markers in the Tokens column are separated by commas, indicating that each have been parsed as individual tokens. The text in the Tokens column is also bracketed; this indicates that tokens have been generated as a list.

In [272]:
tokens = final_lyric_df[['Text', 'Tokens']].copy()
tokens.head()



Unnamed: 0,Text,Tokens
0,They say it's me that makes you do things you ...,"[They, say, it, 's, me, that, makes, you, do, ..."
1,If it's The truth That's all I know I look for...,"[If, it, 's, The, truth, That, 's, all, I, kno..."
2,Waitin' for a sign from you Waitin' for a sign...,"[Waitin, ', for, a, sign, from, you, Waitin, '..."
3,I fell asleep in the silence Before the street...,"[I, fell, asleep, in, the, silence, Before, th..."
4,You're a ghost on the highway And I'll love yo...,"[You, 're, a, ghost, on, the, highway, And, I,..."


### Lemmatization

Another process performed by spaCy is lemmatization, or the retrieval of the dictionary root word of each word (for example “brighten” for “brightening”). We’ll perform a similar set of steps to those above to create a function to call the lemmas from the Doc object, then apply it to the DataFrame.

In [273]:
# Define a function to retrieve lemmas from a doc object
def get_lemma(doc):
    return [(token.lemma_) for token in doc]

# Run the lemma retrieval function on the doc objects in the dataframe
final_lyric_df['Lemmas'] = final_lyric_df['Doc'].apply(get_lemma)

Lemmatization can help reduce noise and refine results for researchers who are conducting keyword searches. For example, let’s compare counts of the word “she” in the original Tokens column and in the lemmatized Lemmas column.

In [274]:
print(f'"she" appears in the text tokens column ' + str(final_lyric_df['Tokens'].apply(lambda x: x.count('she')).sum()) + ' times.')
print(f'"she" appears in the lemmas column ' + str(final_lyric_df['Lemmas'].apply(lambda x: x.count('she')).sum()) + ' times.')

"she" appears in the text tokens column 19 times.
"she" appears in the lemmas column 51 times.


As expected, there are more instances of “she” in the Lemmas column, as the lemmatization process has grouped inflected word forms (her,hers) into the base word “she.”

##Text Annotation

### Part of Speech Tagging

spaCy facilitates two levels of part-of-speech tagging: coarse-grained tagging, which predicts the simple universal part-of-speech of each token in a text (such as noun, verb, adjective, adverb), and detailed tagging, which uses a larger, more fine-grained set of part-of-speech tags (for example 3rd person singular present verb). The part-of-speech tags used are determined by the English language model we use. In this case, we’re using the small English model, and you can explore the differences between the models on spaCy’s website.

We can call the part-of-speech tags in the same way as the lemmas. Create a function to extract them from any given Doc object and apply the function to each Doc object in the DataFrame. The function we’ll create will extract both the coarse- and fine-grained part-of-speech for each token (token.pos_ and token.tag_, respectively).

In [277]:
# Define a function to retrieve lemmas from a doc object
def get_pos(doc):
    #Return the coarse- and fine-grained part of speech text for each token in the doc
    return [(token.pos_, token.tag_) for token in doc]

# Define a function to retrieve parts of speech from a doc object
final_lyric_df['POS'] = final_lyric_df['Doc'].apply(get_pos)

We can create a list of the part-of-speech columns to review them further. The first (coarse-grained) tag corresponds to a generally recognizable part-of-speech such as a noun, adjective, or punctuation mark, while the second (fine-grained) category are a bit more difficult to decipher.

In [278]:
# Create a list of part of speech tags
list(final_lyric_df['POS'])


[[('PRON', 'PRP'),
  ('VERB', 'VBP'),
  ('PRON', 'PRP'),
  ('AUX', 'VBZ'),
  ('PRON', 'PRP'),
  ('PRON', 'WDT'),
  ('VERB', 'VBZ'),
  ('PRON', 'PRP'),
  ('VERB', 'VB'),
  ('NOUN', 'NNS'),
  ('PRON', 'PRP'),
  ('AUX', 'MD'),
  ('PART', 'RB'),
  ('AUX', 'VB'),
  ('VERB', 'VBN'),
  ('SCONJ', 'IN'),
  ('PRON', 'PRP'),
  ('AUX', 'VBD'),
  ('ADV', 'RB'),
  ('CCONJ', 'CC'),
  ('SCONJ', 'IN'),
  ('PRON', 'PRP'),
  ('AUX', 'VBZ'),
  ('PRON', 'PRP'),
  ('PRON', 'WDT'),
  ('VERB', 'VBZ'),
  ('PART', 'TO'),
  ('VERB', 'VB'),
  ('ADP', 'IN'),
  ('PRON', 'PRP'),
  ('CCONJ', 'CC'),
  ('VERB', 'VBZ'),
  ('PRON', 'PRP'),
  ('SCONJ', 'IN'),
  ('PRON', 'PRP'),
  ('VERB', 'VBP'),
  ('ADV', 'RB'),
  ('AUX', 'VBP'),
  ('PART', 'RB'),
  ('VERB', 'VB'),
  ('PRON', 'PRP'),
  ('AUX', 'VBZ'),
  ('ADJ', 'JJ'),
  ('AUX', 'VBP'),
  ('PART', 'RB'),
  ('VERB', 'VB'),
  ('VERB', 'VB'),
  ('PRON', 'PRP'),
  ('AUX', 'VBP'),
  ('PART', 'RB'),
  ('VERB', 'VB'),
  ('PRON', 'PRP'),
  ('NOUN', 'NNS'),
  ('ADP', 'IN'),
  ('AD

Fortunately, spaCy has a built-in function called explain that can provide a short description of any tag of interest. If we try it on the tag IN using spacy.explain("IN"), the output reads conjunction, subordinating or preposition.

In [279]:
spacy.explain("IN")

'conjunction, subordinating or preposition'

In some cases, you may want to get only a set of part-of-speech tags for further analysis, like all of the proper nouns. A function can be written to perform this task, extracting only words which have been fitted with the proper noun tag.

In [280]:
# Define function to extract proper nouns from Doc object
def extract_proper_nouns(doc):
    return [token.text for token in doc if token.pos_ == 'PROPN']

# Apply function to Doc column and store resulting proper nouns in new column
final_lyric_df['Proper_Nouns'] = final_lyric_df['Doc'].apply(extract_proper_nouns)

Listing the nouns in each text can help us ascertain the texts’ subjects. Let’s list the nouns in two different texts, the text located in row 3 of the DataFrame and the text located in row 40.

In [281]:
list(final_lyric_df.loc[[3, 40], 'Proper_Nouns'])

[['Lay', 'Afraid', 'Free', 'Plea'], ['London', 'Seldom']]

The first text in the list includes botany and astronomy concepts; this is likely to have been written for a biology course. In contrast, the second text appears to be an analysis of Shakespeare plays and movie adaptations, likely written for an English course.

### Named Entity Recognition

spaCy can tag named entities in the text, such as names, dates, organizations, and locations. Call the full list of named entities and their descriptions using this code:

In [282]:
# Get all NE labels and assign to variable
labels = nlp.get_pipe("ner").labels

# Print each label and its description
for label in labels:
    print(label + ' : ' + spacy.explain(label))

CARDINAL : Numerals that do not fall under another type
DATE : Absolute or relative dates or periods
EVENT : Named hurricanes, battles, wars, sports events, etc.
FAC : Buildings, airports, highways, bridges, etc.
GPE : Countries, cities, states
LANGUAGE : Any named language
LAW : Named documents made into laws.
LOC : Non-GPE locations, mountain ranges, bodies of water
MONEY : Monetary values, including unit
NORP : Nationalities or religious or political groups
ORDINAL : "first", "second", etc.
ORG : Companies, agencies, institutions, etc.
PERCENT : Percentage, including "%"
PERSON : People, including fictional
PRODUCT : Objects, vehicles, foods, etc. (not services)
QUANTITY : Measurements, as of weight or distance
TIME : Times smaller than a day
WORK_OF_ART : Titles of books, songs, etc.


We’ll create a function to extract the named entity tags from each Doc object and apply it to the Doc objects in the DataFrame, storing the named entities in a new column:

In [283]:
# Define function to extract named entities from doc objects
def extract_named_entities(doc):
    return [ent.label_ for ent in doc.ents]

# Apply function to Doc column and store resulting named entities in new column
final_lyric_df['Named_Entities'] = final_lyric_df['Doc'].apply(extract_named_entities)
final_lyric_df['Named_Entities']

0                                          [TIME, DATE]
1                                                    []
2     [ORG, PERSON, TIME, PERSON, LANGUAGE, TIME, PE...
3                                         [PERSON, ORG]
4                                      [PERSON, PERSON]
5                                   [DATE, PERSON, ORG]
6                                                    []
7                                    [GPE, GPE, PERSON]
8                                                    []
9                                         [PERSON, ORG]
10                                                   []
11                                               [TIME]
12                                     [PERSON, PERSON]
13                                                   []
14                                 [CARDINAL, CARDINAL]
15       [PERSON, CARDINAL, CARDINAL, PERSON, CARDINAL]
16    [PERSON, PERSON, PERSON, ORDINAL, PERSON, PERS...
17    [DATE, GPE, DATE, DATE, DATE, ORG, GPE, DA

We can add another column with the words and phrases identified as named entities:

In [284]:
# Define function to extract text tagged with named entities from doc objects
def extract_named_entities(doc):
    return [ent for ent in doc.ents]

# Apply function to Doc column and store resulting text in new column
final_lyric_df['NE_Words'] = final_lyric_df['Doc'].apply(extract_named_entities)
final_lyric_df['NE_Words']


0                            [(the, night), (the, day)]
1                                                    []
2     [(Waitin), (Waitin), (morning), (Superstar), (...
3                                    [(Afraid), (Plea)]
4                                    [(Ghost), (Ghost)]
5                  [(Seven, days), (Man), (Discomfort)]
6                                                    []
7               [(Sweet), (New, Orleans), (Uncle, Sam)]
8                                                    []
9                                      [(Sour), (Lies)]
10                                                   []
11                                 [(the, night, Like)]
12                        [(Shine, There, 's), (Miles)]
13                                                   []
14                                     [(five), (five)]
15    [(Breathless), (two), (two), (Breathless), (two)]
16    [(Mary, Come), (Mary), (Mary), (first), (Mary)...
17    [(Yesterday), (feelin), (today), (Yesterda

Let’s visualize the words and their named entity tags in a single text. Call the first text’s Doc object and use displacy.render to visualize the text with the named entities highlighted and tagged:

In [285]:
# Extract the first Doc object
doc = final_lyric_df['Doc'][40]

# Visualize named entity tagging in a single paper
displacy.render(doc, style='ent', jupyter=True)

# Analysis

Why are spaCy’s linguistic annotations useful to researchers? Below are two examples of how researchers can use data about the MICUSP corpus, produced through spaCy, to draw conclusions about discipline and genre conventions in student academic writing. We will use the enriched dataset generated with spaCy for these examples.

###Part of Speech Analysis

In this section, we’ll analyze the part-of-speech tags extracted by spaCy to answer the first research question: Do students use certain parts-of-speech more frequently in Biology texts versus English texts, and does this signify differences in disciplinary conventions?

spaCy counts the number of each part-of-speech tag that appears in each document (for example the number of times the NOUN tag appears in a document). This is called using doc.count_by(spacy.attrs.POS). Here’s how it works on a single sentence:

In [295]:
# Create doc object from single sentence
doc = nlp("This is 'an' example? sentence")

# Print counts of each part of speech in sentence
print(doc.count_by(spacy.attrs.POS))

{95: 1, 87: 1, 97: 3, 90: 1, 92: 2}


spaCy generates a dictionary where the values represent the counts of each part-of-speech term found in the text. The keys in the dictionary correspond to numerical indices associated with each part-of-speech tag. To make the dictionary more legible, let’s associate the numerical index values with their corresponding part of speech tags. In the example below, it’s now possible to see which parts-of-speech tags correspond to which counts:

In [296]:
# Store dictionary with indexes and POS counts in a variable
num_pos = doc.count_by(spacy.attrs.POS)

dictionary = {}

# Create a new dictionary which replaces the index of each part of speech for its label (NOUN, VERB, ADJECTIVE)
for k,v in sorted(num_pos.items()):
  dictionary[doc.vocab[k].text] = v

dictionary

{'AUX': 1, 'DET': 1, 'NOUN': 2, 'PRON': 1, 'PUNCT': 3}

To get the same type of dictionary for each text in a DataFrame, a function can be created to nest the above for loop. First, we’ll create a new DataFrame for the purposes of part-of speech analysis, containing the text filenames, disciplines, and Doc objects. We can then apply the function to each Doc object in the new DataFrame. In this case (and above), we are interested in the simpler, coarse-grained parts of speech.

In [297]:
# Create new DataFrame for analysis purposes
pos_analysis_df = final_lyric_df[['TITLE','ALBUM', 'Doc']]

# Create list to store each dictionary
num_list = []

# Define a function to get part of speech tags and counts and append them to a new dictionary
def get_pos_tags(doc):
    dictionary = {}
    num_pos = doc.count_by(spacy.attrs.POS)
    for k,v in sorted(num_pos.items()):
        dictionary[doc.vocab[k].text] = v
    num_list.append(dictionary)

# Apply function to each doc object in DataFrame
pos_analysis_df.loc['C_POS'] = pos_analysis_df['Doc'].apply(get_pos_tags)

From here, we’ll take the part-of-speech counts and put them into a new DataFrame where we can calculate the frequency of each part-of-speech per document. In the new DataFrame, if a lyric does not contain a particular part-of-speech, the cell will read NaN (Not a Number).

In [298]:
# Create new dataframe with part of speech counts
pos_counts = pd.DataFrame(num_list)
columns = list(pos_counts.columns)

# Add discipline of each paper as new column to dataframe
idx = 0
new_col = pos_analysis_df['ALBUM']
pos_counts.insert(loc=idx, column='ALBUM', value=new_col)

pos_counts

Unnamed: 0,ALBUM,ADJ,ADP,ADV,AUX,CCONJ,DET,NOUN,PART,PRON,PUNCT,SCONJ,VERB,PROPN,INTJ,NUM,X,SYM
0,She Hangs Brightly,11.0,17,13,30,8.0,5,24,15.0,61,1.0,10.0,45,,,,,
1,She Hangs Brightly,1.0,3,3,5,2.0,4,5,,17,,2.0,12,,,,,
2,She Hangs Brightly,14.0,17,8,14,3.0,13,27,4.0,32,9.0,1.0,13,6.0,,,,
3,She Hangs Brightly,10.0,22,9,9,,26,30,7.0,30,,3.0,25,4.0,,,,
4,She Hangs Brightly,,4,18,10,15.0,8,10,,16,,,6,2.0,,,,
5,She Hangs Brightly,11.0,8,10,23,3.0,5,12,6.0,51,5.0,8.0,34,2.0,2.0,2.0,,
6,She Hangs Brightly,1.0,8,20,17,6.0,9,18,11.0,57,10.0,8.0,39,4.0,1.0,,,
7,She Hangs Brightly,7.0,5,9,13,1.0,6,18,7.0,29,13.0,5.0,19,14.0,1.0,,,
8,She Hangs Brightly,2.0,19,17,20,8.0,6,20,9.0,52,3.0,11.0,36,1.0,4.0,,,
9,She Hangs Brightly,7.0,14,4,4,5.0,5,10,1.0,19,3.0,,19,2.0,,,1.0,


Now you can calculate the amount of times, on average, that each part-of-speech appears in each album. To do so, we use the .groupby() and .mean() functions to group all part-of-speech counts from lyrics together and calculate the mean usage of each part-of-speech, before doing the same for the English texts. The following code also rounds the counts to the nearest whole number:

In [302]:
# Get average part of speech counts used in papers of each discipline
average_pos_df = pos_counts.groupby(['ALBUM']).mean()

# Round calculations to the nearest whole number
average_pos_df = average_pos_df.round(0)

# Reset index to improve DataFrame readability
average_pos_df = average_pos_df.reset_index()

# Show dataframe
average_pos_df

Unnamed: 0,ALBUM,ADJ,ADP,ADV,AUX,CCONJ,DET,NOUN,PART,PRON,PUNCT,SCONJ,VERB,PROPN,INTJ,NUM,X,SYM
0,Among My Swan,6.0,11.0,10.0,9.0,4.0,7.0,14.0,4.0,27.0,9.0,4.0,19.0,3.0,2.0,1.0,,1.0
1,Five String Serenade / Under My Car,2.0,9.0,13.0,11.0,5.0,1.0,11.0,5.0,46.0,14.0,2.0,32.0,3.0,2.0,,,
2,Flowers in December,6.0,4.0,9.0,12.0,5.0,2.0,11.0,4.0,30.0,5.0,2.0,18.0,1.0,,,,
3,I'm Less Here,4.0,20.0,10.0,11.0,4.0,16.0,28.0,1.0,40.0,6.0,4.0,30.0,,,,6.0,
4,Seasons Of Your Day,5.0,8.0,7.0,12.0,4.0,8.0,13.0,6.0,27.0,3.0,3.0,20.0,3.0,6.0,1.0,,
5,She Hangs Brightly,7.0,12.0,11.0,14.0,5.0,9.0,18.0,7.0,34.0,6.0,6.0,23.0,4.0,2.0,2.0,1.0,
6,So Tonight That I Might See,8.0,17.0,7.0,7.0,4.0,8.0,21.0,5.0,30.0,3.0,4.0,24.0,4.0,4.0,2.0,,
7,Still,3.0,11.0,11.0,8.0,4.0,8.0,12.0,4.0,24.0,2.0,3.0,19.0,1.0,2.0,,,
8,Fade Into You,4.0,4.0,4.0,32.0,4.0,11.0,33.0,35.0,60.0,10.0,4.0,64.0,1.0,3.0,1.0,,


Here we can examine the differences between average part-of-speech usage per album.
We can visualize these differences using a bar graph:

In [303]:
# Use plotly to plot proper noun use per genre
fig = px.bar(average_pos_df, x="ALBUM", y=["ADJ", 'VERB', "NUM"], title="Average Part-of-Speech Use in Papers Written by Biology and English Students", barmode='group')
fig.show()

### Named Entity Analysis

In this section, we’ll use the named entity tags extracted from spaCy.

To start, we’ll create a new DataFrame with the title, album, and named entity words and tags:

In [304]:
# Create new DataFrame for analysis purposes
ner_analysis_df = final_lyric_df[['TITLE','ALBUM', 'Named_Entities', 'NE_Words']]

Using the str.count method, we can get counts of a specific named entity used in each text. Let’s get the counts of the named entities of interest here (PERSON, TIME, DATE, and CARDINAL) and add them as new columns of the DataFrame.

In [None]:
# Convert named entity lists to strings so we can count specific entities
ner_analysis_df['Named_Entities'] = ner_analysis_df['Named_Entities'].apply(lambda x: ' '.join(x))

# Get the number of each type of entity in each paper
person_counts = ner_analysis_df['Named_Entities'].str.count('PERSON')
time_counts = ner_analysis_df['Named_Entities'].str.count('TIME')
date_counts = ner_analysis_df['Named_Entities'].str.count('DATE')
cardi_counts = ner_analysis_df['Named_Entities'].str.count('CARDINAL')


# Append named entity counts to new DataFrame
ner_counts_df = pd.DataFrame()
ner_counts_df['Album'] = ner_analysis_df["ALBUM"]
ner_counts_df['PERSON_Counts'] = person_counts
ner_counts_df['TIME_Counts'] = time_counts
ner_counts_df['DATE_Counts'] = date_counts
ner_counts_df['CARDI_Counts'] = cardi_counts

ner_counts_df.head()
ner_counts_df

Unnamed: 0,Album,PERSON_Counts,TIME_Counts,DATE_Counts,CARDI_Counts
0,She Hangs Brightly,0,1,1,0
1,She Hangs Brightly,0,0,0,0
2,She Hangs Brightly,3,2,0,0
3,She Hangs Brightly,1,0,0,0
4,She Hangs Brightly,2,0,0,0
5,She Hangs Brightly,1,0,1,0
6,She Hangs Brightly,0,0,0,0
7,She Hangs Brightly,1,0,0,0
8,She Hangs Brightly,0,0,0,0
9,She Hangs Brightly,1,0,0,0


From here, we can compare the total usage of each named entity and plot across albums.

In [305]:
# Calculate total usage of each named entity type
total_ner_df = ner_counts_df.groupby(['Album']).sum()
total_ner_df = total_ner_df.reset_index()

# Use plotly to plot total proper noun use per genre
fig = px.bar(total_ner_df, x="Album", y=["PERSON_Counts", 'TIME_Counts', "DATE_Counts", "CARDI_Counts"], title="Total Named Entity Usage Across Albums", barmode='group')
fig.show()


## Analysis of PERSON Named Entities

Let’s explore patterns of one of these entities’ usage (PERSON) further by retrieving the words most frequently tagged as dates in various genres. We’ll do this by first creating functions to extract the words tagged as person entities in each document and adding the words to a new DataFrame column:

In [306]:
# Define function to extract words tagged as "date" named entities from doc objects
def extract_date_named_entities(doc):
    return [ent for ent in doc.ents if ent.label_ == 'PERSON']

# Get all date entity words and apply to new column of DataFrame
ner_analysis_df['Date_Named_Entities'] = final_lyric_df['Doc'].apply(extract_date_named_entities)


# Make list of date entities a string so we can count their frequencies
ner_analysis_df['Date_Named_Entities'] = [', '.join(map(str, l)) for l in ner_analysis_df['Date_Named_Entities']]

Now we can retrieve only the subset of lyrics that are in the album She Hangs Brightly, get the top words that have been tagged as “PERSON” in these lyrics and append them to a list.spaCy outputs a list of the 10 words most-frequently labeled with the PERSON named entity tag in lyrics in the album She Hangs Brightly:

In [308]:
# Search for only date words in proposal papers
date_word_counts_df = ner_analysis_df[(ner_analysis_df == 'She Hangs Brightly').any(axis=1)]

# Count the frequency of each word in these essays and append to list
date_word_frequencies = date_word_counts_df.Date_Named_Entities.str.split(expand=True).stack().value_counts()

# Get top 10 most common words and their frequencies
date_word_frequencies[:10]

Waitin,       1
Superstar,    1
Superstar     1
Afraid        1
Ghost,        1
Ghost         1
Man           1
Uncle         1
Sam           1
Sour          1
dtype: int64

# Download Enriched Dataset

To save the dataset of doc objects, text reductions and linguistic annotations generated with spaCy, download the final_lyric_df DataFrame to your local computer as a .csv file:

In [309]:
final_lyric_df.to_csv('MazzyStar_lyrics_with_spaCy_tags.csv')