# Fish dataset for Clustering analysis

Original dataset from the [Journal of Statistics Education](http://ww2.amstat.org/publications/jse/jse_data_archive.htm)
and reconfigured by DataCamp named [Fish Measurements](https://assets.datacamp.com/production/repositories/655/datasets/fee715f8cf2e7aad9308462fea5a26b791eb96c4/fish.csv)

In [6]:
# Get dependencies
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
from sklearn.cluster import KMeans

In [7]:
# From the jse website documentation the column names were derived
cols = ['species','weight_grams','length1','length2','length3','height','width']

In [9]:
df = pd.read_csv('https://assets.datacamp.com/production/repositories/655/datasets/fee715f8cf2e7aad9308462fea5a26b791eb96c4/fish.csv', names=cols, header=None)

In [10]:
df.head()

Unnamed: 0,species,weight_grams,length1,length2,length3,height,width
0,Bream,242.0,23.2,25.4,30.0,38.4,13.4
1,Bream,290.0,24.0,26.3,31.2,40.0,13.8
2,Bream,340.0,23.9,26.5,31.1,39.8,15.1
3,Bream,363.0,26.3,29.0,33.5,38.0,13.3
4,Bream,430.0,26.5,29.0,34.0,36.6,15.1


In [11]:
samples = df.drop('species', axis=1).values

In [13]:
species = df['species'].values

In [17]:
df.species.value_counts()

Bream    34
Roach    20
Pike     17
Smelt    14
Name: species, dtype: int64

## Cluster fish data after scaling using a pipeline

In [15]:
# import dependencies
from sklearn.pipeline import make_pipeline
from sklearn.preprocessing import StandardScaler

In [16]:
# Instantiate scaler
scaler = StandardScaler()

In [18]:
# Instantiate KMeans with 4 clusters given the results of value_counts above is 4 species
kmeans = KMeans(n_clusters=4)

In [20]:
# Create pipeline using 'make_pipeline'
pipeline = make_pipeline(scaler, kmeans)

In [21]:
# Fit the pipeline to samples
pipeline.fit(samples)

Pipeline(steps=[('standardscaler', StandardScaler()),
                ('kmeans', KMeans(n_clusters=4))])

In [22]:
# Calculate the cluster labels: labels
labels = pipeline.predict(samples)

In [23]:
# Create a DataFrame with labels and species as columns: df
df = pd.DataFrame({'labels':labels,'species':species})

# Create crosstab: ct
ct = pd.crosstab(df['labels'],df['species'])

In [24]:
ct

species,Bream,Pike,Roach,Smelt
labels,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1
0,1,0,19,1
1,33,0,1,0
2,0,17,0,0
3,0,0,0,13
