## Who Said It - Taylor Swift or William Shakespeare? 
A fun challenge for Taylor Swift fans is to see if they can idenfity whether a line of text is a Taylor Swift lyric or a line from Shakespeare. It is suprisingly difficult, so I want to see if this task is hard for AI as well. 
The following notebook implements a basic machine learning project to determine if the words were written by Taylor Swift or William Shakespeare. 

Time to find out if the AI is a Swiftie!

In [1]:
import pandas as pd
import numpy as np
import spacy
from string import punctuation

from sklearn.svm import LinearSVC
from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.naive_bayes import MultinomialNB
from sklearn.pipeline import Pipeline
from sklearn.metrics import classification_report
from sklearn.model_selection import GridSearchCV, train_test_split

nlp = spacy.load('en_core_web_sm')

### Pre-Process the Data

[Taylor Lyrics Dataset - Full Corpus](https://github.com/sagesolar/Corpus-of-Taylor-Swift/blob/main/lyrics/flat-song-lyrics.json)

[Shakespeare Lines Dataset - All Plays](https://www.kaggle.com/datasets/kingburrito666/shakespeare-plays)

Taylor will be class 1 and Shakespeare will be class 0 since the models prefer numerical targets.

In [2]:
stopwords = nlp.Defaults.stop_words
# define a quick preprocessing function
# remove any tokens that are stopwords, punctuation, digits, or under 3 char.
def spacy_preprocessing(text):
    text = text.lower()

    tokens = [token.text for token in nlp(text)]
    tokens = [t for t in tokens if
              t not in stopwords and 
              t not in punctuation and 
              len(t) > 3]
    tokens = [t for t in tokens if not t.isdigit()]

    return " ".join(tokens)

In [3]:
# read in Taylor lyrics
tay_df = pd.read_json('flat-song-lyrics.json', orient = 'index')
tay_df = tay_df.reset_index()

# rename the columns for clarity
tay_df.columns = ['key', 'text']
# class label
tay_df['source'] = 1

# drop column, remove duplicate lyrics
tay_df = tay_df.drop('key', axis=1)
tay_df = tay_df.drop_duplicates(subset='text')

# preprocess lyrics
tay_df['clean_text'] = tay_df['text'].apply(spacy_preprocessing)
# remove records with 2 or less words
tay_df = tay_df[tay_df['clean_text'].str.split().str.len() > 2]

In [4]:
# read in shakespeare lines
shakespeare_df = pd.read_csv('Shakespeare_data.csv')

# fitler records for more than 40 char to filter out lines like ACT and SCENE
# not a totally perfect way but should suffice
shakespeare_df['text_len'] = shakespeare_df['PlayerLine'].str.len()
shakespeare_df = shakespeare_df[shakespeare_df['text_len'] > 40]

# randomly sample about 5k records... this dataset is huge and we don't want imbalanced classes 
# or to waste time pre-processing records we wont use
sampled_df = shakespeare_df.sample(n=5000, random_state = 42)
# class label
sampled_df['source'] = 0
# rename columns 
sampled_df.rename(columns={'PlayerLine': 'text'}, inplace=True)
sampled_df = sampled_df[['text', 'source']]

# preprocess lyrics
sampled_df['clean_text'] = sampled_df['text'].apply(spacy_preprocessing)
# remove records with 2 or less words
sampled_df = sampled_df[sampled_df['clean_text'].str.split().str.len() > 2]

In [5]:
# handle potential imbalance 
min_class_size = min(len(tay_df), len(sampled_df))

# resample both datasets to have the same number of records as the smaller class
tay_df_balanced = tay_df.sample(n=min_class_size, random_state=42)
shakespeare_df_balanced = sampled_df.sample(n=min_class_size, random_state=42)

# concat datasets
df = pd.concat([tay_df_balanced, shakespeare_df_balanced])
df['source'].value_counts()

source
1    3824
0    3824
Name: count, dtype: int64

### Training Some Classifiers
#### Split Data

In [6]:
# shuffle and 80-20 train-test split
# stratified train-test split to ensure class balance
dataset = df[['clean_text', 'source']].sample(frac=1, random_state=42)

X = dataset['clean_text'].to_list()
y = dataset['source'].to_list()

X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, stratify=y, random_state=42)

# back to numpy arrays
y_train = np.array(y_train)
y_test = np.array(y_test)

# sanity check 
np.unique(y_train, return_counts=True)

(array([0, 1]), array([3059, 3059]))

#### Multinomial Naive Bayes

In [7]:
# we have to vectorize the data before grid searching training
pipeline = Pipeline([
    ('vectorizer', TfidfVectorizer()),
    ('classifier', MultinomialNB())
])


param_grid = {
    'classifier__alpha': [0.01, 0.1, 0.5, 1, 2, 5, 10],
    'classifier__fit_prior': [True, False],
    'vectorizer__use_idf': [True, False],
    'vectorizer__ngram_range': [(1, 1), (1, 2)],  # unigrams and bigrams
    'vectorizer__sublinear_tf': [True, False],   #  sublinear tf scaling
}

# grid search with cross-validation
grid_search = GridSearchCV(pipeline, param_grid, cv=5, scoring='f1')
grid_search.fit(X_train, y_train)

# the best parameters and the best score
print("Best parameters found: ", grid_search.best_params_)
print("Best cross-validation score: ", grid_search.best_score_)

# eval. on the test set
best_model = grid_search.best_estimator_
y_pred = best_model.predict(X_test)
print(classification_report(y_test, y_pred))

Best parameters found:  {'classifier__alpha': 0.01, 'classifier__fit_prior': True, 'vectorizer__ngram_range': (1, 2), 'vectorizer__sublinear_tf': True, 'vectorizer__use_idf': False}
Best cross-validation score:  0.8738351614375544
              precision    recall  f1-score   support

           0       0.90      0.85      0.88       765
           1       0.86      0.90      0.88       765

    accuracy                           0.88      1530
   macro avg       0.88      0.88      0.88      1530
weighted avg       0.88      0.88      0.88      1530



#### Support Vector Machine

In [8]:
# same steps as above, just for SVM
pipeline = Pipeline([
    ('vectorizer', TfidfVectorizer()),
    ('classifier', LinearSVC()) 
])

param_grid_svm = {
    'classifier__C': [0.01, 0.1, 1, 10],  # Regularization parameter
    'classifier__max_iter': [1000, 5000, 10000],  # iterations for convergence
    'vectorizer__use_idf': [True, False],
    'vectorizer__ngram_range': [(1, 1), (1, 2)],  # unigrams and bigrams
    'vectorizer__min_df': [1, 5, 10],
    'vectorizer__sublinear_tf': [True, False]
}


grid_search_svm = GridSearchCV(pipeline, param_grid_svm, cv=5, scoring='f1')
grid_search_svm.fit(X_train, y_train)

print("Best parameters for Linear SVM: ", grid_search_svm.best_params_)
print("Best cross-validation score (LinearSVM): ", grid_search_svm.best_score_)


best_model_svm = grid_search_svm.best_estimator_
y_pred_svm = best_model_svm.predict(X_test)
print("LinearSVC classification report:")
print(classification_report(y_test, y_pred_svm))


Best parameters for Linear SVM:  {'classifier__C': 10, 'classifier__max_iter': 1000, 'vectorizer__min_df': 1, 'vectorizer__ngram_range': (1, 2), 'vectorizer__sublinear_tf': False, 'vectorizer__use_idf': True}
Best cross-validation score (LinearSVM):  0.8595364111159437
LinearSVC classification report:
              precision    recall  f1-score   support

           0       0.87      0.83      0.85       765
           1       0.84      0.88      0.86       765

    accuracy                           0.85      1530
   macro avg       0.85      0.85      0.85      1530
weighted avg       0.85      0.85      0.85      1530



### Results 
Turns out it's even somewhat difficult for AI to distinguish Taylor Swift lyrics versus Shakespeare lines, as an accuracy of 88% definitely leaves room for improvement! In both cases, the models show a marginally better performance in predicting Taylor Swift lyrics compared to Shakespeare lines.


The accuracy of these models could be improved by trying a number of different things: 
* Other classifiers 
* Different sampling of the data
* Additional pre-processing techniques, like lemmatization

However, I'll leave it here for now to keep this notebook relatively short. 