<a href="https://colab.research.google.com/github/kevmanning/DS-Unit-1-Sprint-1-Data-Wrangling-and-Storytelling/blob/master/DS_134_Linear_Algebra_Assignment_Kevin_Manning.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

## Objectives:
- define a vector and calculate a vector length and dot product
- define a matrix and calculate a matrix dot product, transpose, and inverse
- explain cosine similarity and compute the similarity between two vectors
- use linear algebra to solve for linear regression coefficients

#Use the following information to answer the assignment questions 1) - 11).

###Is head size related to brain weight in healthy adult humans?

The Brainhead.csv dataset provides information on 237 individuals who were subject to post-mortem examination at the Middlesex Hospital in London around the turn of the 20th century. Study authors used cadavers to see if a relationship between brain weight and other more easily measured physiological characterizes such as age, sex, and head size could be determined. The end goal was to develop a way to estimate a person’s brain size while they were still alive (as the living aren’t keen on having their brains taken out and weighed). 

**We wish to determine if we can improve on our model of the linear relationship between head size and brain weight in healthy human adults.**

Source: R.J. Gladstone (1905). "A Study of the Relations of the Brain to the Size of the Head", Biometrika, Vol. 4, pp105-123.

In [1]:
#Import the Brainhead.csv dataset from a URL and print the first few rows

import pandas as pd
import numpy as np


data_url = 'https://raw.githubusercontent.com/LambdaSchool/data-science-practice-datasets/main/unit_1/Brainhead/Brainhead.csv'

df = pd.read_csv(data_url, skipinitialspace=True, header=0)

df.head()

Unnamed: 0,Gender,Age,Head,Brain
0,1,1,4512,1530
1,1,1,3738,1297
2,1,1,4261,1335
3,1,1,3777,1282
4,1,1,4177,1590


1) Store the response variable - brain size - as a matrix called Y.

In [None]:
### YOUR CODE HERE ###

Y = np.array(df['Brain']).reshape(-1, 1)
print(Y)

2) Store the explanatory variable - head size size - as a matrix called X.  Don't forget to include the column of 1s for the intercept term.

In [None]:

### YOUR CODE HERE ###

ones= np.repeat(1, len(df)).reshape(-1,1)

head= np.array(df['Head']).reshape(-1, 1)

X = np.concatenate((ones, head), axis= 1)

print(X)



3) Calculate $X^T$.  Explain what the transpose of a matrix is.

In [None]:
### YOUR CODE HERE ###
print(X)
print()
print()
X_T = np.transpose(X)
print(X_T)

Answer: 
A transposed matrix is one that has the original rows as the new columns and the original columns are the new rows. It's found by holding the diagonal fixed and flipping the other values across the diagonal.

4) Use matrix multplication to calculate $X^TX$

In [18]:
# checking the 'shape' of the arrays
print('X_T:')
print(len(X_T))
print(len(X_T[0]))
print()
print('X:')
print(len(X))
print(len(X[0]))

X_T:
2
237

X:
237
2


In [12]:
### YOUR CODE HERE ###

X_T_X = np.matmul(X_T, X)
X_T_X

array([[       237,     861256],
       [    861256, 3161283190]])

5) Calculate $(X^TX)^{-1}$.  Explain what the inverse of a matrix is.

In [20]:
### YOUR CODE HERE ###

X_T_X_inv= np.linalg.inv(X_T_X)
print(X_T_X_inv)

[[ 4.23638519e-01 -1.15415543e-04]
 [-1.15415543e-04  3.17599920e-08]]


Answer: The inverse is the reciprocal of the matrix used to generate the array. Like 1/array

6) Use matrix multiplication to calculate $X^TY$.

In [21]:
### YOUR CODE HERE ###

X_T_Y = np.matmul(X_T, Y)
X_T_Y

array([[    304041],
       [1113176805]])

7) Use your previous results to calculate the values of the slope and intercept using the formula $$ B = (X^{'}X)^{-1}X^{'}Y$$

In [29]:
### YOUR CODE HERE ###

# B = inverse(X_Transpose * X) * (X_Transpose * Y)

B= np.matmul(X_T_X_inv, X_T_Y)
print(B)

[[3.25573421e+02]
 [2.63429339e-01]]


8) Use the OLS function to calculate the slope and intercept and compare your answers.

In [26]:
### YOUR CODE HERE ###
from statsmodels.formula.api import ols
model1= ols('Brain ~ Head', data= df).fit()
print(model1.params)

Intercept    325.573421
Head           0.263429
dtype: float64


9) Create a new X matrix that includes coluns for both head size and age group.

In [30]:
### YOUR CODE HERE ###

ones= np.repeat(1, len(df)).reshape(-1, 1)
Head= np.array(df['Head']).reshape(-1, 1)
Age= np.array(df['Age']).reshape(-1, 1)

X= np.concatenate((ones, Head, Age), axis= 1)
Y= np.array(df['Brain']).reshape(-1, 1)


11) Calculate the values of the intercept and slope terms for head size and age using the formula $$ B = (X^{'}X)^{-1}X^{'}Y$$

In [39]:
### YOUR CODE HERE ###

# X transposed
X_T_2 = np.transpose(X)

# X_transpose * X
X_T_X_2 = np.matmul(X_T_2, X)
print('X_transpose times X', X_T_X_2)
print()

# inverse of X_Transpose*X
X_T_X_2_inv= np.linalg.inv(X_T_X_2)
print('X_Transpose times X, inverse:', X_T_X_2_inv)
print()

# (X_Transpose) times Y

X_T_Y_2= np.matmul(X_T_2, Y)
print('X_transpose times Y:', X_T_Y_2)
print()

# put it all together

B2= np.matmul(X_T_X_2_inv, X_T_Y_2)
print('Intercepts:')
print(B2)
print()

X_transpose times X [[       237     861256        364]
 [    861256 3161283190    1318231]
 [       364    1318231        618]]

X_Transpose times X, inverse: [[ 4.96445307e-01 -1.20513659e-04 -3.53418297e-02]
 [-1.20513659e-04  3.21169750e-08  2.47472443e-06]
 [-3.53418297e-02  2.47472443e-06  1.71556109e-02]]

X_transpose times Y: [[    304041]
 [1113176805]
 [    464561]]

Intercepts:
[[ 3.68282145e+02]
 [ 2.60438766e-01]
 [-2.07316446e+01]]



11) Use the OLS function to confirm your answer in 10).

In [40]:
### YOUR CODE HERE ###

model2= ols('Brain ~ Head + Age', data= df).fit()
print(model2.params)

Intercept    368.282145
Head           0.260439
Age          -20.731645
dtype: float64


In [38]:
print(model2.summary())

                            OLS Regression Results                            
Dep. Variable:                  Brain   R-squared:                       0.647
Model:                            OLS   Adj. R-squared:                  0.644
Method:                 Least Squares   F-statistic:                     214.1
Date:                Fri, 11 Dec 2020   Prob (F-statistic):           1.38e-53
Time:                        04:06:36   Log-Likelihood:                -1347.8
No. Observations:                 237   AIC:                             2702.
Df Residuals:                     234   BIC:                             2712.
Df Model:                           2                                         
Covariance Type:            nonrobust                                         
                 coef    std err          t      P>|t|      [0.025      0.975]
------------------------------------------------------------------------------
Intercept    368.2821     50.618      7.276      0.0

#Use the following information to answer the assignment questions 12) - 16).

The song writing collaboration between John Lennon and Paul McCartney was one of the most productive in music history.  Unlike many other partnerships where one individual wrote lyrics and one wrote music, Lennon and McCartney composed both, and it was decided that any song that was written would be credited to both.  In the beginning of their relationship, many of their songs were truly collabroative.  However, later on, they often worked separately with little to no input from the other.    

Because of extensive reporting on the Beatles over the years, it is generally known if a Lennon-McCartney song was a true collabortion, primarily (or totally) writen by Lennon, or primarily (or totally) written by McCartney.  

However, there are several disputed songs where both Lennon and McCartney at times claimed to be the sole (or primary) composer.

We will now use cosine similarity to determine if *Ticket to Ride* (disputed) is most similar to *From Me to You* (collabortive, not disputed) or *Strawberry Fields* (Lennon, not disputed).

From the Wikipedia article on the Lennon-McCartney Partnership: Lennon said that McCartney's contribution was limited to "the way Ringo played the drums".In Many Years from Now, McCartney said "we sat down and wrote it together ... give him 60 percent of it."

12) Import the text of Strawberry Fields and calculate the freqency of song lyrics using the code below.

In [None]:
### YOUR CODE HERE ###

sf= "let me take you down cause Im going to Strawberry Fields nothing is real and nothing to get hung about Strawberry Fields forever living is easy with eyes closed misunderstanding all you see its getting hard to be someone but it all works out it doesnt matter much to me let me take you down cause Im going to Strawberry Fields nothing is real and nothing to get hung about Strawberry Fields forever no one I think is in my tree I mean it must be high or low that is you cant you know tune in but its all right that is I think its not too bad let me take you down cause Im going to Strawberry Fields nothing is real and nothing to get hung about Strawberry Fields forever always no sometimes think but you know I know when it's a dream I think er no I mean er yes but its all wrong that is I think I disagree let me take you down cause Im going to Strawberry Fields nothing is real and nothing to get hung about Strawberry Fields forever Strawberry Fields forever Strawberry Fields forever"
sf_df= pd.DataFrame({'words': sf.split()})

sf_freq= pd.DataFrame(pd.crosstab(index= sf_df['words'], columns= 'count'))

sf_freq[0:50]


13) Import the text of From Me to You and calculate the freqency of song lyrics using the code below.

In [None]:
### YOUR CODE HERE ###

me2u = "if there's anything that you want if there's anything I can do just call on me and Ill send it along with love from me to you Ive got everything that you want like a heart thats oh so true just call on me and Ill send it along with love from me to you Ive got arms that long to hold you and keep you by my side Ive got lips that long to kiss you and keep you satisfied oh if theres anything that you want if theres anything I can do just call on me and Ill send it along with love from me to you from me to you just call on me and Ill send it along with love from me to you Ive got arms that long to hold you and keep you by my side Ive got lips that long to kiss you and keep you satisfied oh if theres anything that you want if theres anything I can do just call on me and Ill send it along with love from me to you to you to you to you"

me2u_df= pd.DataFrame({'words': me2u.split()})

me2u_df_freq= pd.DataFrame(pd.crosstab(index= me2u_df['words'], columns= 'count'))
me2u_df_freq[0:50]

13) Import the text of Ticket to Ride using the code below.

In [None]:
### YOUR CODE HERE ###

tick2ride= "I think Im gonna be sad I think its today yeah the girl thats driving me mad is going away shes got a ticket to ride shes got a ticket to ride shes got a ticket to ride but she dont care she said that living with me is bringing her down yeah for she would never be free when I was around shes got a ticket to ride shes got a ticket to ride shes got a ticket to ride but she dont care I dont know why shes ridin so high she ought to think twice she ought to do right by me before she gets to saying goodbye she ought to think twice she ought to do right by me I think Im gonna be sad I think its today yeah the girl thats driving me mad is going away yeah shes got a ticket to ride shes got a ticket to ride shes got a ticket to ride but she dont care I dont know why shes ridin so high she ought to think twice she ought to do right by me before she gets to saying goodbye she ought to think twice she ought to do right by me she said that living with me is bringing her down yeah for she would never be free when I was around ah shes got a ticket to ride shes got a ticket to ride shes got a ticket to ride but she dont care my baby dont care my baby dont care my baby dont care my baby dont care my baby dont care my baby dont care"
tick2ride_df = pd.DataFrame({'words': tick2ride.split()})

tick2ride_df_freq= pd.DataFrame(pd.crosstab(index= tick2ride_df['words'], columns= 'count'))
tick2ride_df_freq[0:50]


14) Concatenate Ticket to Ride and Strawberry Fields and calculate the cosine similarity.

In [None]:
### YOUR CODE HERE ###

from numpy import dot
from numpy.linalg import norm

df_tickstraw= [sf_freq, tick2ride_df_freq]

all_words_1 = pd.concat(df_tickstraw, axis= 1)
all_words_1[0:50]


In [None]:

all_words_1= all_words_1.fillna(0)
all_words_1.columns = ['Strawberry Fields', 'Ticket to Ride']
all_words_1[0:50]

In [57]:
# cosine_similarity= dot procuct(Strawberry Fields, Ticket to Ride) / norm(Strawberry Fields) * norm(Ticket to Ride)

cos_sim_1 = dot(all_words_1['Strawberry Fields'], all_words_1['Ticket to Ride'])/ (norm(all_words_1['Strawberry Fields'])*norm(all_words_1['Ticket to Ride']))
print(cos_sim_1)

0.324035859004908


15) Concatenate Ticket to Ride and From Me to You and calculate the cosine similarity.

In [None]:
### YOUR CODE HERE ###

df_tickme2u = [tick2ride_df_freq, me2u_df_freq]

all_words_2= pd.concat(df_tickme2u, axis= 1)
all_words_2 = all_words_2.fillna(0)
all_words_2.columns= ['Ticket to Ride', 'From Me to You']
all_words_2[0:50]

In [59]:
cos_sim_2 = dot(all_words_2['Ticket to Ride'], all_words_2['From Me to You']) / (norm(all_words_2['Ticket to Ride']) * norm(all_words_2['From Me to You']))
cos_sim_2

0.2882268853551227

16) What is your conclusion about Ticket to Ride?  Does it appear most similar to Strawberry Fields (Lennon) or From Me to You (collaborative)?

Answer: 

It looks like Ticket to Ride is most similar to Strawberry Fields and therefore would be mainly attributed to John Lennon, rather than a collaboration of Lennon and McCartney.