# Exercise 8: Clustering stocks using KMeans

In this exercise, you'll cluster companies using their daily stock price movements (i.e. the dollar difference between the closing and opening prices for each trading day).  You are given a NumPy array `movements` of daily price movements from 2010 to 2015, where each row corresponds to a company, and each column corresponds to a trading day.

Some stocks are more expensive than others.  To account for this, include a `Normalizer` at the beginning of your pipeline.  The Normalizer will separately transform each company's stock price to a relative scale before the clustering begins.

## Normalizer vs StandardScaler
Note that `Normalizer()` is different to `StandardScaler()`, which you used in the previous exercise. While `StandardScaler()` standardizes **features** (such as the features of the fish data from the previous exercise) by removing the mean and scaling to unit variance, `Normalizer()` rescales **each sample** - here, each company's stock price - independently of the other.

This dataset was obtained from the Yahoo! Finance API.

**Step 1:** Load the data _(written for you)_

In [2]:
import pandas as pd
import numpy as np

fn = '../datasets/company-stock-movements-2010-2015-incl.csv'
stocks_df = pd.read_csv(fn, index_col=0)

**Step 2:** Inspect the first few rows of the DataFrame `stocks_df` by calling its `head()` function.

In [6]:
stocks_df.head

.490001    0.350001   -0.659999   
Coca Cola                             0.790001    0.169998    0.090000   
Lookheed Martin                       0.209999    0.719993    0.049995   
MasterCard                            0.070000    4.340000    4.230003   
McDonalds                             0.630001   -0.270001    0.090001   
3M                                    0.470001   -0.330002   -0.239998   
Microsoft                            -0.080000    0.090000    0.650000   
Mitsubishi                           -0.010000    0.030000    0.090000   
Navistar                             -0.510002    0.679996    0.639999   
Northrop Grumman                      0.320001    1.540007    0.060001   
Novartis                              0.360000   -0.020001    0.410000   
Pepsi                                 1.029999    0.639999    0.779999   
Pfizer                               -0.029999    0.339998    0.179998   
Procter Gamble                        0.189998    0.479999    0.160004   
Phi

**Step 3:** Extract the NumPy array `movements` from the DataFrame and the list of company names (_written for you_)

In [3]:
companies = list(stocks_df.index)
movements = stocks_df.values

**Step 4:** Make the necessary imports:

- `Normalizer` from `sklearn.preprocessing`.
- `KMeans` from `sklearn.cluster`.
- `make_pipeline` from `sklearn.pipeline`.

In [4]:
from sklearn.preprocessing import Normalizer
from sklearn.cluster import KMeans
from sklearn.pipeline import make_pipeline

**Step 3:** Create an instance of `Normalizer` called `normalizer`.

In [5]:
normalizer = Normalizer()

**Step 4:** Create an instance of `KMeans` called `kmeans` with `14` clusters.

In [6]:
kmeans = KMeans(n_clusters=14)


**Step 5:** Using `make_pipeline()`, create a pipeline called `pipeline` that chains `normalizer` and `kmeans`.

In [7]:
pipeline = make_pipeline(normalizer, kmeans)

**Step 6:** Fit the pipeline to the `movements` array.

In [8]:
pipeline.fit(movements)

Pipeline(steps=[('normalizer', Normalizer()),
                ('kmeans', KMeans(n_clusters=14))])

**In the next exercise:** Let's check out your clustering!