The song writing collaboration between John Lennon and Paul McCartney was one of the most productive in music history.  Unlike many other partnerships where one individual wrote lyrics and one wrote music, Lennon and McCartney composed both, and it was decided that any song that was written would be credited to both.  In the beginning of their relationship, many of their songs were collaborations.  However, later on, they often worked separately with little to no input from the other.    

Because of extensive reporting on the Beatles over the years, it is generally known if a Lennon-McCartney song was a true collaboration, primarily (or totally) written by Lennon, or primarily (or totally) written by McCartney.  

However, there are several disputed songs where both Lennon and McCartney at times claimed to be the sole (or primary) composer.

We will use cosine similarity to determine if *In My Life* (disputed) is most similar to *From Me to You* (collaboration, not disputed), *Strawberry Fields* (Lennon, not disputed) or *Penny Lane* (McCartney, not disputed).

I started by looking at the text of Strawberry Fields, which we know was written by John Lennon.  We can actually copy the lyrics to the entire song (removing punctuation and capitals from the first words of sentences) as a string and then convert that string into a data frame.

---



In [None]:
import pandas as pd

#Strawberry Fields - John Lennon (not disputed)

Strawberry_ = "let me take you down cause Im going to Strawberry Fields nothing is real and nothing to get hung about Strawberry Fields forever living is easy with eyes closed misunderstanding all you see its getting hard to be someone but it all works out it doesnt matter much to me let me take you down cause Im going to Strawberry Fields nothing is real and nothing to get hung about Strawberry Fields forever no one I think is in my tree I mean it must be high or low that is you cant you know tune in but its all right that is I think its not too bad let me take you down cause Im going to Strawberry Fields nothing is real and nothing to get hung about Strawberry Fields forever always no sometimes think but you know I know when it's a dream I think er no I mean er yes but its all wrong that is I think I disagree let me take you down cause Im going to Strawberry Fields nothing is real and nothing to get hung about Strawberry Fields forever Strawberry Fields forever Strawberry Fields forever"
Strawberry_df = pd.DataFrame({'words' : Strawberry_.split()}) #split will split evry sing value into it sown individual word 
Strawberry_df.head() 
#This took every single word in the dataset and created a dataframe

Unnamed: 0,words
0,let
1,me
2,take
3,you
4,down


The way we are going to determine if two songs are similar is by comparing how frequently words appear in each song. We can make a frequency table to determine how many times each word appears in Strawberry Fields.

In [None]:
#Lets make a term frequency calculator 
straw_freq = pd.crosstab(index= Strawberry_df['words'], columns = 'count') #We are calling on the word column and count colum
straw_freq
#here we are looking at how many times a word appears in the yrics 

col_0,count
words,Unnamed: 1_level_1
Fields,10
I,8
Im,4
Strawberry,10
a,1
...,...
with,1
works,1
wrong,1
yes,1


Now we do the same with Penny Lane - a song we know was written by McCartney

In [None]:
import pandas as pd

#Penny Lane - Paul McCartney (not disputed)

Lane_ = "in Penny Lane there is a barber showing photographs of every head hes had the pleasure to know and all the people that come and go stop and say hello on the corner is a banker with a motorcar and little children laugh at him behind his back and the banker never wears a mac in the pouring rain very strange Penny Lane is in my ears and in my eyes there beneath the blue suburban skies I sit and meanwhile back in Penny Lane there is a fireman with an hourglass and in his pocket is a portrait of the Queen he likes to keep his fire engine clean its a clean machine Penny Lane is in my ears and in my eyes a four of fish and finger pies in summer meanwhile back behind the shelter in the middle of the roundabout the pretty nurse is selling poppies from a tray and though she feels as if shes in a play ahe is anyway in Penny Lane the barber shaves another customer we see the banker sitting waiting for a trim and then the fireman rushes in from the pouring rain very strange Penny Lane is in my ears and in my eyes there beneath the blue suburban skies I sit and meanwhile back Penny Lane is in my ears and in my eyes there beneath the blue suburban skies Penny Lane"
Lane_df = pd.DataFrame({'words' : Lane_.split()})
Lane_df
Lane_freq = pd.crosstab(index = Lane_df['words'], columns = 'count')
Lane_freq

col_0,count
words,Unnamed: 1_level_1
I,2
Lane,8
Penny,8
Queen,1
a,11
...,...
very,2
waiting,1
we,1
wears,1


Here I concatenate the two data sets so that there is one row for each word that appears in either song and one column for each song that counts how many times that word appears in the song's lyrics.

In [None]:
# Compare Strawberry Fields to Penny Lane

from numpy import dot
from numpy.linalg import norm

dfs = [straw_freq, Lane_freq]
all_words = pd.concat(dfs, axis= 1)
all_words[:50]

#This is showing me every single word and how often that word appears in each song 
#For ex the word "I" appears in Strawberry Feilds 8x and appears 2x in Penny Lane


col_0,count,count
words,Unnamed: 1_level_1,Unnamed: 2_level_1
Fields,10.0,
I,8.0,2.0
Im,4.0,
Strawberry,10.0,
a,1.0,11.0
about,4.0,
all,4.0,1.0
always,1.0,
and,4.0,15.0
bad,1.0,


I renamed the first column so we know it is the word count from Strawberry Fields and the second column so we know it is the word count from Penny Lane.  

Also, I changed the NaNs present to 0s 

In [None]:
all_words = all_words.fillna(0)
all_words.columns = ['Strawberry', 'Penny_Lane']
all_words

Unnamed: 0_level_0,Strawberry,Penny_Lane
words,Unnamed: 1_level_1,Unnamed: 2_level_1
Fields,10.0,0.0
I,8.0,2.0
Im,4.0,0.0
Strawberry,10.0,0.0
a,1.0,11.0
...,...,...
trim,0.0,1.0
very,0.0,2.0
waiting,0.0,1.0
we,0.0,1.0


Now we can have two numeric vectors that represent the lyric frequency of each song, and we an compare them using the cosine similarity.  

In [None]:
#cos_sim = dot product Strawberry Fields and Penny Lane / norm(Strawberry Fields) * norm (Penny Lane)
dot(all_words['Strawberry'], all_words["Penny_Lane"]) / (norm(all_words['Strawberry']) * norm(all_words["Penny_Lane"]))

0.21590157172853788

I used (cosine similarity = 0.22) as a baseline.  This is the similarity between two songs that were written by close collaborators but we know were not written by the same individual.

Let's load in two more songs: From Me to You (collaboration, not disputed) and In My Life (the disputed song)

In [None]:
#From Me to You - Lennon and McCartney (not disputed)

Me_ = "if there's anything that you want if there's anything I can do just call on me and Ill send it along with love from me to you Ive got everything that you want like a heart thats oh so true just call on me and Ill send it along with love from me to you Ive got arms that long to hold you and keep you by my side Ive got lips that long to kiss you and keep you satisfied oh if theres anything that you want if theres anything I can do just call on me and Ill send it along with love from me to you from me to you just call on me and Ill send it along with love from me to you Ive got arms that long to hold you and keep you by my side Ive got lips that long to kiss you and keep you satisfied oh if theres anything that you want if theres anything I can do just call on me and Ill send it along with love from me to you to you to you to you"

Me_df = pd.DataFrame({"Words": Me_.split()})

Me_df_freq = pd.DataFrame(pd.crosstab(index=Me_df['Words'],columns='count'))

Me_df_freq[0:50]

col_0,count
Words,Unnamed: 1_level_1
I,3
Ill,5
Ive,5
a,1
along,5
and,9
anything,6
arms,2
by,2
call,5


In [None]:

#In My Life - Lennon or McCartney (disputed)

Life_ = "there are places Ill remember all my life though some have changed some forever, not for better some have gone and some remain all these places had their moments with lovers and friends I still can recall some are dead and some are living in my life Ive loved them all but of all these friends and lovers there is no one compares with you and these memories lose their meaning when I think of love as something new though I know Ill never lose affection for people and things that went before I know Ill often stop and think about them in my life Ill love you more though I know Ill never lose affection for people and things that went before I know Ill often stop and think about them in my life Ill love you more in my life Ill love you more"

Life_df = pd.DataFrame({"Words": Life_.split()})

Life_df_freq = pd.DataFrame(pd.crosstab(index=Life_df['Words'],columns='count'))

Life_df_freq[0:50]



col_0,count
Words,Unnamed: 1_level_1
I,6
Ill,8
Ive,1
about,2
affection,2
all,4
and,9
are,3
as,1
before,2


In [None]:
# Compare Penny Lane(McCartney) to In My Life(Dispute)

from numpy import dot
from numpy.linalg import norm

dfs = [Lane_freq, Life_df_freq]

all_words = pd.concat(dfs, axis=1)
all_words = all_words.fillna(0)
all_words.columns = ["Penny Lane", "In My Life"]
all_words[0:50]


cos_sim = dot(all_words["Penny Lane"], all_words["In My Life"])/(norm(all_words["Penny Lane"])*norm(all_words["In My Life"]))

print(cos_sim)

0.35376193243365495


In [None]:
# Compare Strawberry Fields(Lennon) to In My LIfe(Dispute)

from numpy import dot
from numpy.linalg import norm

dfs = [straw_freq, Life_df_freq]

all_words = pd.concat(dfs, axis=1)
all_words = all_words.fillna(0)
all_words.columns = ["Strawberry Fields", "In My Life"]
all_words[0:50]


cos_sim = dot(all_words["Strawberry Fields"], all_words["In My Life"])/(norm(all_words["Strawberry Fields"])*norm(all_words["In My Life"]))

print(cos_sim)





0.29313341167173035


In [None]:
# Compare In My Life(Dispute) to From Me to You(Collab)

from numpy import dot
from numpy.linalg import norm

dfs = [Life_df_freq, Me_df_freq]

all_words = pd.concat(dfs, axis=1)
all_words = all_words.fillna(0)
all_words.columns = ["In My Life", "From Me to You"]
all_words[0:50]


cos_sim = dot(all_words["In My Life"], all_words["From Me to You"])/(norm(all_words["In My Life"])*norm(all_words["From Me to You"]))

print(cos_sim)

0.33677745450264224


**Compared:**Penny Lane (Lennon) to Strawberry Fields (McCartney) = 21%

**Compared:** In My Life (Dispute) to Penny Lane (McCartney) = 35%

**Compared:** In My Life (Dispute) to Strawberry Fields (McCartney) = 29%

**Compared:** In My Life (Dispute) to From Me to You (Colab) = 33%

The cosine similarity between In My Life to all three other songs is higher than the cosine similarity between Strawberry Fields and Penny Lane.

It is highest between In My Life and Penny Lane, followed by From Me to You.  In My Life is least Similar to Strawberry Fields.

From the Wikipedia article about the Lennon-McCartney collaboration: In 1977, when shown a list of songs Lennon claimed writing on for the magazine Hit Parader, McCartney disputed only "In My Life". Lennon said that McCartney helped only with "the middle eight" (a short section) of the song. McCartney said that he wrote the entire melody, taking inspiration from Smokey Robinson songs.

Perhaps this analysis gives additional evidence that McCartney really did write all of In My Life.