# Exercise 6: Scaling fish data for clustering

You are given an array `samples` giving measurements of fish.  Each row represents asingle fish.  The measurements, such as weight in grams, length in centimeters, and the percentage ratio of height to length, have very different scales.  In order to cluster this data effectively, you'll need to standardize these features first.  In this exercise, you'll build a pipeline to standardize and cluster the data.

This great dataset was derived from the one [here](http://svitsrv25.epfl.ch/R-doc/library/rrcov/html/fish.html), where you can see a description of each measurement.

From the course _Transition to Data Science_. [Buy the entire course for just $10](https://www.udemy.com/transition-to-data-science-in-python/?couponCode=CLUSTER-NBS) for many more exercises and helpful video lectures.

**Step 1:** Load the dataset _(this bit is written for you)_.

In [24]:
import pandas as pd

df = pd.read_csv('../datasets/fish.csv')

# forget the species column for now - we'll use it later!
del df['species']

**Step 2:** Call `df.head()` to inspect the dataset:

In [25]:
df.head(10)

Unnamed: 0,weight,length1,length2,length3,height,width
0,242.0,23.2,25.4,30.0,38.4,13.4
1,290.0,24.0,26.3,31.2,40.0,13.8
2,340.0,23.9,26.5,31.1,39.8,15.1
3,363.0,26.3,29.0,33.5,38.0,13.3
4,430.0,26.5,29.0,34.0,36.6,15.1
5,450.0,26.8,29.7,34.7,39.2,14.2
6,500.0,26.8,29.7,34.5,41.1,15.3
7,390.0,27.6,30.0,35.0,36.2,13.4
8,450.0,27.6,30.0,35.1,39.9,13.8
9,500.0,28.5,30.7,36.2,39.3,13.7


**Step 3:** Extract all the measurements as a 2D NumPy array, assigning to `samples` (hint: use the `.values` attribute of `df`)

In [26]:
samples = df.values
samples

array([[ 242. ,   23.2,   25.4,   30. ,   38.4,   13.4],
       [ 290. ,   24. ,   26.3,   31.2,   40. ,   13.8],
       [ 340. ,   23.9,   26.5,   31.1,   39.8,   15.1],
       [ 363. ,   26.3,   29. ,   33.5,   38. ,   13.3],
       [ 430. ,   26.5,   29. ,   34. ,   36.6,   15.1],
       [ 450. ,   26.8,   29.7,   34.7,   39.2,   14.2],
       [ 500. ,   26.8,   29.7,   34.5,   41.1,   15.3],
       [ 390. ,   27.6,   30. ,   35. ,   36.2,   13.4],
       [ 450. ,   27.6,   30. ,   35.1,   39.9,   13.8],
       [ 500. ,   28.5,   30.7,   36.2,   39.3,   13.7],
       [ 475. ,   28.4,   31. ,   36.2,   39.4,   14.1],
       [ 500. ,   28.7,   31. ,   36.2,   39.7,   13.3],
       [ 500. ,   29.1,   31.5,   36.4,   37.8,   12. ],
       [ 600. ,   29.4,   32. ,   37.2,   40.2,   13.9],
       [ 600. ,   29.4,   32. ,   37.2,   41.5,   15. ],
       [ 700. ,   30.4,   33. ,   38.3,   38.8,   13.8],
       [ 700. ,   30.4,   33. ,   38.5,   38.8,   13.5],
       [ 610. ,   30.9,   33.5,

**Step 4:** Perform the necessary imports:

- `make_pipeline` from `sklearn.pipeline`.
- `StandardScaler` from `sklearn.preprocessing`.
- `KMeans` from `sklearn.cluster`.


In [27]:
from sklearn.pipeline import make_pipeline
from sklearn.preprocessing import StandardScaler
from sklearn.cluster import KMeans

**Step 5:** Create an instance of `StandardScaler` called `scaler`.

In [28]:
scaler = StandardScaler()
scaler

StandardScaler(copy=True, with_mean=True, with_std=True)

**Step 6:** Create an instance of `KMeans` with `4` clusters called `kmeans`.

In [29]:
kmeans = KMeans(n_clusters=4,random_state=0)


**Step 7:** Create a pipeline called `pipeline` that chains `scaler` and `kmeans`. To do this, you just need to pass them in as arguments to `make_pipeline()`.

In [30]:
#make_pipeline([('scaler':scaler),('kmeans':kmeans)])
pipeline = make_pipeline(scaler,kmeans)
pipeline

Pipeline(memory=None,
         steps=[('standardscaler',
                 StandardScaler(copy=True, with_mean=True, with_std=True)),
                ('kmeans',
                 KMeans(algorithm='auto', copy_x=True, init='k-means++',
                        max_iter=300, n_clusters=4, n_init=10, n_jobs=None,
                        precompute_distances='auto', random_state=0, tol=0.0001,
                        verbose=0))],
         verbose=False)

**Great job!** Now you're all set to transform the fish measurements and perform the clustering.  Let's get to it in the next exercise!