<img src="https://i.imgur.com/FoKB5Z5.png" align="left" width="300" height="250" title="source: imgur.com" /></a>

## Program Code: J620-002-4:2020 

## Program Name: FRONT-END SOFTWARE DEVELOPMENT

## Title : Case Study - Clustering Stocks using k-Means

#### Name: Chong Mun Chen

#### IC Number: 960327-07-5097

#### Date : 26/7/2023

#### Introduction : Practising on this case study using k-means clustering method.



#### Conclusion : Succeeded in chaining the Normalizer with k-means clustering method in the Pipeline, and achieving the desired clusters.





# Clustering stocks using KMeans

In this exercise, you'll cluster companies using their daily stock price movements (i.e. the dollar difference between the closing and opening prices for each trading day).  You are given a NumPy array `movements` of daily price movements from 2010 to 2015, where each row corresponds to a company, and each column corresponds to a trading day.

Some stocks are more expensive than others.  To account for this, include a `Normalizer` at the beginning of your pipeline.  The Normalizer will separately transform each company's stock price to a relative scale before the clustering begins.

## Normalizer vs StandardScaler
Note that `Normalizer()` is different to `StandardScaler()`, which you used in the previous exercise. While `StandardScaler()` standardizes **features** (such as the features of the fish data from the previous exercise) by removing the mean and scaling to unit variance, `Normalizer()` rescales **each sample** - here, each company's stock price - independently of the other.

This dataset was obtained from the Yahoo! Finance API.

**Step 1:** Load the data _(written for you)_

In [1]:
import pandas as pd

import warnings

warnings.filterwarnings('ignore')

fn = '../../data_samples2/company-stock-movements-2010-2015-incl.csv'
stocks_df = pd.read_csv(fn, index_col=0)

**Step 2:** Inspect the first few rows of the DataFrame `stocks_df` by calling its `head()` function.

In [2]:
stocks_df.head()

Unnamed: 0,2010-01-04,2010-01-05,2010-01-06,2010-01-07,2010-01-08,2010-01-11,2010-01-12,2010-01-13,2010-01-14,2010-01-15,...,2013-10-16,2013-10-17,2013-10-18,2013-10-21,2013-10-22,2013-10-23,2013-10-24,2013-10-25,2013-10-28,2013-10-29
Apple,0.58,-0.220005,-3.409998,-1.17,1.680011,-2.689994,-1.469994,2.779997,-0.680003,-4.999995,...,0.320008,4.519997,2.899987,9.590019,-6.540016,5.959976,6.910011,-5.359962,0.840019,-19.589981
AIG,-0.640002,-0.65,-0.210001,-0.42,0.710001,-0.200001,-1.130001,0.069999,-0.119999,-0.5,...,0.919998,0.709999,0.119999,-0.48,0.010002,-0.279998,-0.190003,-0.040001,-0.400002,0.66
Amazon,-2.350006,1.260009,-2.350006,-2.009995,2.960006,-2.309997,-1.640007,1.209999,-1.790001,-2.039994,...,2.109985,3.699982,9.570008,-3.450013,4.820008,-4.079986,2.579986,4.790009,-1.760009,3.740021
American express,0.109997,0.0,0.260002,0.720002,0.190003,-0.270001,0.75,0.300004,0.639999,-0.130001,...,0.680001,2.290001,0.409996,-0.069999,0.100006,0.069999,0.130005,1.849999,0.040001,0.540001
Boeing,0.459999,1.77,1.549999,2.690003,0.059997,-1.080002,0.36,0.549999,0.530002,-0.709999,...,1.559997,2.480003,0.019997,-1.220001,0.480003,3.020004,-0.029999,1.940002,1.130005,0.309998


**Step 3:** Extract the NumPy array `movements` from the DataFrame and the list of company names (_written for you_)

In [3]:
movements = stocks_df.values
companies = stocks_df.index

**Step 4:** Make the necessary imports:

- `Normalizer` from `sklearn.preprocessing`.
- `KMeans` from `sklearn.cluster`.
- `make_pipeline` from `sklearn.pipeline`.

In [4]:
from sklearn.preprocessing import Normalizer
from sklearn.pipeline import make_pipeline
from sklearn.cluster import KMeans

**Step 3:** Create an instance of `Normalizer` called `normalizer`.

In [5]:
normalizer = Normalizer()

**Step 4:** Create an instance of `KMeans` called `kmeans` with `14` clusters.

In [6]:
kmeans = KMeans(n_clusters = 14)

**Step 5:** Using `make_pipeline()`, create a pipeline called `pipeline` that chains `normalizer` and `kmeans`.

In [7]:
pipeline = make_pipeline(normalizer, kmeans)

**Step 6:** Fit the pipeline to the `movements` array.

In [8]:
pipeline.fit(movements)

So which company have stock prices that tend to change in the same way?  Now inspect the cluster labels from your clustering to find out.

**Step 7:** Predict the labels for `movements` using function provided by pipeline


In [9]:
labels = pipeline.predict(movements)
labels

array([13,  1,  6,  9,  3,  1,  0,  8,  2,  4,  5,  7,  5,  2,  7,  8,  1,
       13,  1,  5,  9,  8,  7,  5,  7,  4,  1,  4, 10,  3,  9, 11,  2,  7,
        8,  2,  3,  5, 10, 12,  4,  0,  5,  5,  2,  8,  5,  5,  8,  5,  2,
        2,  5,  2,  3,  1,  4,  5,  8,  6])

**Step 8:** Align the cluster labels with the list of company names `companies` by creating a DataFrame `df` with `labels` and `companies` as columns.

In [10]:
companies_df = pd.DataFrame({'labels':labels, 'companies':companies})

**Step 9:** Now display the DataFrame, sorted by cluster label.  To do this, use the `.sort_values()` method of `df` to sort the DataFrame by the `'labels'` column.

In [11]:
companies_df.sort_values('labels')

Unnamed: 0,labels,companies
41,0,Philip Morris
6,0,British American Tobacco
18,1,Goldman Sachs
1,1,AIG
5,1,Bank of America
55,1,Wells Fargo
26,1,JPMorgan Chase
16,1,General Electrics
32,2,3M
44,2,Schlumberger
