# The Beatles: Discography Lyrics Analysis

## DESCRIPTION:
This Jupyter Notebook assosiated with "The Beatles: Discography Lyrics Analysis" will explain how the data was processed for the purpose of this project.

### IMPORT
First we want to import all the needed libraries.

In [12]:
import pandas as pd
import re
import os
import csv
import string
import seaborn as sns
import matplotlib.pyplot as plt
from textblob import TextBlob
import spacy
!spacy download en_core_web_sm

Collecting en-core-web-sm==3.7.1
  Downloading https://github.com/explosion/spacy-models/releases/download/en_core_web_sm-3.7.1/en_core_web_sm-3.7.1-py3-none-any.whl (12.8 MB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m12.8/12.8 MB[0m [31m47.8 MB/s[0m eta [36m0:00:00[0m00:01[0m00:01[0m
[38;5;2m✔ Download and installation successful[0m
You can now load the package via spacy.load('en_core_web_sm')


### CONVERSION

As the main corpus equals to The Beatles discography, it consist 13 folders (albums) with .txt files (tracks). Below is python code that did the conversion to the .csv file. 

In [29]:
def read_text_files(folder):
    data = []
    for filename in os.listdir(folder):
        if filename.endswith(".txt"):
            with open(os.path.join(folder, filename), 'r', encoding='utf-8') as file:
                track = filename.replace('.txt', '')
                lyrics = file.read()
                data.append({"id": len(data) + 1, "album": os.path.basename(folder), "track": track, "lyrics": lyrics})
    return data

def convert_to_csv(input_folders, output_csv):
    all_data = []
    for f in input_folders:
        folder_data = read_text_files(f)
        all_data.extend(folder_data)

    with open(output_csv, 'w', newline='', encoding='utf-8') as csvfile:
        fieldnames = ["id", "album", "track", "lyrics"]
        writer = csv.DictWriter(csvfile, fieldnames=fieldnames)

        writer.writeheader()

        for row in all_data:
            writer.writerow(row)

input_folders = [
    "texts/Please Please Me",
    "texts/Let It Be",
    "texts/Abbey Road",
    "texts/Yellow Submarine",
    "texts/The Beatles",
    "texts/Magical Mystery Tour",
    "texts/Sgt. Pepper's Lonely Hearts Club Band",
    "texts/Revolver",
    "texts/Rubber Soul",
    "texts/Help!",
    "texts/Beatles For Sale",
    "texts/A Hard Day's Night",
    "texts/With The Beatles"
]
output_csv = "clean_TheBeatles.csv"

convert_to_csv(input_folders, output_csv)

### SPACY ANNOTATION
First spaCy is used to create a token called 'Doc' and process the text.

In [31]:
# Initialize spaCy
nlp = spacy.load('en_core_web_sm')
print(nlp.pipe_names)

df = pd.read_csv('clean_TheBeatles.csv')

# Use spaCy
def process_text(text):
    return nlp(text)

df['Doc'] = df['lyrics'].apply(process_text)

['tok2vec', 'tagger', 'parser', 'attribute_ruler', 'lemmatizer', 'ner']


Now, the processed text is tokenized, creating a Tokenized version of every song lyrics in the CSV and dataframe formats.

In [33]:
def get_token(doc):
    return [(token.text) for token in doc]
df['tokens'] = df['Doc'].apply(get_token)

tokens = df[['lyrics', 'tokens']].copy()
tokens.head()

Unnamed: 0,lyrics,tokens
0,"Last night I said these words to my girl, I kn...","[Last, night, I, said, these, words, to, my, g..."
1,Sha la la la la la la la Sha la la la la la la...,"[Sha, la, la, la, la, la, la, la, Sha, la, la,..."
2,"There, there's a place where I can go When I f...","[There, ,, there, 's, a, place, where, I, can,..."
3,"Well, shake it up, baby, now (shake it up, bab...","[Well, ,, shake, it, up, ,, baby, ,, now, (, s..."
4,I been told when a boy kiss a girl Take a trip...,"[I, been, told, when, a, boy, kiss, a, girl, T..."


Next parts of code use the processed text in the 'Doc' column to create columns consisting of the Lemma's, Part-of-Speech, Named Entities and the words of Named Entities of the original processed text, every row being a separate song.

In [34]:
# Lemmatization
def get_lemma(doc):
    return [(token.lemma_) for token in doc]

df['lemmas'] = df['Doc'].apply(get_lemma)

# Part-of-Speech
def get_pos(doc):
    return [(token.pos_, token.tag_) for token in doc]

df['POS'] = df['Doc'].apply(get_pos)

# Named Entities
def extract_named_entities(doc):
    return [ent.label_ for ent in doc.ents]

df['Named_Entities'] = df['Doc'].apply(extract_named_entities)

# Named Entities Words
def extract_ne_words(doc):
    return [ent for ent in doc.ents]

df['NE_Words'] = df['Doc'].apply(extract_ne_words)

The last step shows the final result of the DataFrame, and saves it into a new CSV file.

In [35]:
# Show the result
df.head()

# Save into a new .csv file
df.to_csv('TheBeatles_annotation.csv')