# Clustering

## About the module:

Clustering is an unsupervised machine learning methodology for grouping and identifing similar objects, people, or observations.

Clustering is often used as a preprocessing or an exploratory step in the data science pipeline. This is because the cluster that each item is assigned to becomes a feature for a supervised model.

In this module, you will be introduced to various clustering algorithms and learn why and when to use them. You will learn how to use clustering methods to identify similar groups using Python and Scikit-Learn. You will learn how to apply these clusters further down the pipeline.

## Learning

1. Goals

    * Understand how clustering is used across multiple industries and use cases.
    * Understand how the k-means clustering algorithm works.
    * Understand how to use k-means clustering in python.

2. Skills

    * Cluster observations of a dataset into meaningful groups.
    * Methods for estimating parameters for the clustering algorithm.
    * Use clustering as a pre-processing step: Creating a feature that labels each observation into a cluster. These clusters are significant drivers of a target outcome you wish to predict.

3. Methods/Tools

    * sklearn.metrics.pairwise_distances
    * sklearn.cluster.KMeans


# About Clustering

# Data Wrangling

### Learning Goals

* Acquire a sample of data from SQL.
* Identify null values, which nulls are 'deal-breakers', i.e. rows removed, which nulls should be represented by 0, and which should be replaced by a value from other methods, such as mean.
* Identify outliers and decide what to do with them, if anything (remove, keep as-is, replace).
* Data Structure: Aggregate as needed so that every row is an observation and each column is a variable (1 variable and not a measure).


# Class Discussion

## Plan for data prep:

1. Acquire and Summarize the data
    * How many records do we have? 
    * What data types do we have? 
    * Are there columns that should become multiple columns?
        * Is a column a string w/ delimeters?
    * Are there many columns that should become two columns (.melt)
    * what exacly is an observation?
    * what does each row represent?
1. Handle nulls
    * When to remove?
        * The row (individal observation)
        * The column (the entire features)
    * When to replace nulls?
    * What to replace nulls with? 
1. Handle Anomalies/Outliers
    * Our first anomaly filter is the IQR rule
    * We'll do much deeper dive into outliers/anomalies in a later modele

In [1]:
import pandas as pd
import numpy as np

# Exploration



Exploratory data analysis is really important because it is what gives us the insights to produce the better models.