# MTHM503 Assignment
#### K. Donkers - 700063874

## A. Gym exercise and physiology data

### 6. Regression analysis

#### (i) State the author’s last name and year of the study in which that data first appeared.

Load the dataset from SciKitLearn

In [1]:
from sklearn.datasets import load_linnerud
linnerud = load_linnerud()

What exactly is `linnerud`?

In [2]:
print(type(linnerud))
print(linnerud.keys())

<class 'sklearn.utils.Bunch'>
dict_keys(['data', 'feature_names', 'target', 'target_names', 'frame', 'DESCR', 'data_filename', 'target_filename'])


`DESCR` looks like it describes the dataset

In [3]:
print(linnerud["DESCR"])

.. _linnerrud_dataset:

Linnerrud dataset
-----------------

**Data Set Characteristics:**

    :Number of Instances: 20
    :Number of Attributes: 3
    :Missing Attribute Values: None

The Linnerud dataset is a multi-output regression dataset. It consists of three
excercise (data) and three physiological (target) variables collected from
twenty middle-aged men in a fitness club:

- *physiological* - CSV containing 20 observations on 3 physiological variables:
   Weight, Waist and Pulse.
- *exercise* - CSV containing 20 observations on 3 exercise variables:
   Chins, Situps and Jumps.

.. topic:: References

  * Tenenhaus, M. (1998). La regression PLS: theorie et pratique. Paris:
    Editions Technic.



The author's last name is **Tenenhaus** and the year of the study is **1998**

#### (ii) Using the appropriate function from the scikit-learn package, fit a simple linear regression model with number of chinups (Chins) as the target variable, and “Weight” as the covariate. <br>Report the fitted regression coefficients, and interpret the slope coefficient.

We can use the `LinearRegression` model from `sklearn` to fit the number of chinups with the weights in `linnerud`

In [4]:
from sklearn.linear_model import LinearRegression

First let's look at the data

In [5]:
weight = linnerud['target'][:,0]
chins = linnerud['data'][:,0]

In [6]:
print(weight)
print(chins)

[191. 189. 193. 162. 189. 182. 211. 167. 176. 154. 169. 166. 154. 247.
 193. 202. 176. 157. 156. 138.]
[ 5.  2. 12. 12. 13.  4.  8.  6. 15. 17. 17. 13. 14.  1.  6. 12.  4. 11.
 15.  2.]


Now we have the data we can instantiate a `LinearRegression` object and use it to fit `weight` to `chins`

In [7]:
lr = LinearRegression()

lr_fit = lr.fit(weight.reshape(-1,1), chins)

In [8]:
print(lr_fit.intercept_)
print(lr_fit.coef_)

24.351322650827086
[-0.08343406]


The fitted model has a negative gradient. So, the less you weigh the more chinups you can do (according to this model)

#### (iii) Your lecturer is a middle-aged male, 170 pounds (“Weight”), 32 inch waist size (“Waist”), and a resting heart rate of 70 (“Pulse”)? How many chin-ups (“Chins”) do you think he can do? <br>(Use linear regression in scikit-learn on the Linnerud data to answer the question.)

First we need to get the data we need, which is stored in `linnerud.target`

In [9]:
male = linnerud['target']
male

array([[191.,  36.,  50.],
       [189.,  37.,  52.],
       [193.,  38.,  58.],
       [162.,  35.,  62.],
       [189.,  35.,  46.],
       [182.,  36.,  56.],
       [211.,  38.,  56.],
       [167.,  34.,  60.],
       [176.,  31.,  74.],
       [154.,  33.,  56.],
       [169.,  34.,  50.],
       [166.,  33.,  52.],
       [154.,  34.,  64.],
       [247.,  46.,  50.],
       [193.,  36.,  46.],
       [202.,  37.,  62.],
       [176.,  37.,  54.],
       [157.,  32.,  52.],
       [156.,  33.,  54.],
       [138.,  33.,  68.]])

Using `lr` instance from question **(ii)** we can fit this physiology data with the number of chinups from question **(ii)**

In [10]:
lr_fit = lr.fit(male, chins)

print(lr_fit.intercept_)
print(lr_fit.coef_)

47.96841290822685
[ 0.07884384 -1.45584256 -0.01895002]


With this linear model we can now predict the number of chinups our example lecturer can do

In [11]:
example = [[170, 32, 70]]
lr_fit.predict(example)

array([13.45840241])

Model predicts the lecturer can do at least 13 chinups

### 7. Dimensionality reduction

#### (i) What linear combination $α_w W +α_c C+α_p P$ of the physiological variables weight W, waist circumference C and pulse P has the highest possible variance among all possible linear transformations?
**State the weights $α_w$, $α_c$ and $α_p$ rounded to 2 decimal places. <br>
(The variance maximisation is subject to the constraint $α^2_w + α^2_c + α^2_p = 1$ and you may assume that the appropriate scikit-learn function adheres to this constraint.)**

Finding the "highest possible variance among all possible linear tranformations" is the definition of what principal component analysis (PCA) is trying to achieve. We can use `sklearn` to perform PCA on the physiological variables in `linnerud.target`

In [12]:
from sklearn import decomposition

For this we only need PCA with one component

In [13]:
pca1 = decomposition.PCA(n_components=1)

We can fit this PCA model to the physiology data and view the components

In [14]:
bmi_pca1 = pca1.fit(linnerud.target)

bmi_pca1.components_

array([[ 0.98717201,  0.11200059, -0.1137862 ]])

These are the cooeficients which linearly combine the physiological variables W, C and P with the maximum possible variance.

Let's make sure we line them up correctly before reporting the answer...

In [15]:
linnerud.target_names

['Weight', 'Waist', 'Pulse']

Therefore the cooeficients are:
- $\alpha_w = 0.99$
- $\alpha_c = 0.11$
- $\alpha_p = -0.11$

#### (ii) What might the interpretation of this one-dimensional representation of a person’s physiological factors be, and in what context could it be useful?

I don't know how to interpret these coefficients

#### (iii) What is the variance of the linear combination with $α_w = α_c = α_p = \sqrt{1/3}$ and how does this variance compare to the variance of the linear combination calculated in question 7(i)?

<!-- ![](https://media.giphy.com/media/bef5Fn7091uQJozMee/giphy.gif) -->
<img src="https://media.giphy.com/media/13jghlUIB6FHZm/giphy.gif" alt="*shrug*" style="width: 300px;"/>

I don't understand PCA enough...