# Clustering Overview
Unsupervised Learning

## Machine Learning:
   - Supervised
       - Classification
       - Regression
   - Unsupervised
       - Clustering
       - Dimensionality Reduction

- Intelligence == capability of grouping similar objects
- Groups of "unlabeled" data into "clusters" (groups) of similar observations

### Clustering Methodologies:
- Partition based clustering 
    - K-Means is most common
- Hierarchial clustering
- Density-based clustering
    - Density-Based Spatial Clustering of Applications with Noise (DBSCAN)

<hr style="border:1px solid black"> </hr>

### Example:
- Goal: Target customers using 3 different marketing strategies
- Data we have: 1- age, 2- engament score

<br>

Customers:
   - 1- age: 21, eng: 15 --|
   - 2- age: 22, eng: 10 --| Group 1 
   - 3- age: 19, eng: 8  --| 
    ___________________
   - 4- age: 44, eng: 28  --|
   - 5- age: 40, eng: 25  --| Group 2
   - 6- age: 38, eng: 30  --|
    ____________________
   - 7- age: 56, eng: 14 --|
   - 8- age: 54, eng: 12 --| Group 3
 
<br> 
* groups the 8 customers into 3 groups of like items
- in K-Means, you tell the model how many groups
- in other models, the computer tells you how many groups it needs

- must scale your data in K-Means


<hr style="border:1px solid black"> </hr>

### Clustering: Use Cases

- **Customer Segmentation**:
- **Recommendation engine**: like Netflix movie recommendation
- **Document Classification**: application where you might have a bunch of articles and you want to divide them into different groups. Frequency of which words appear
- **Anomaly detection**: alorythm makes cluster, the outliers are your anomaly. if youre not part of a cluster... your an anomaly
- **Patient clustering**: demographic info from patients in epidemiology. Same symptoms for several patients? What features do they have in common? (same food, same exercise. etc)
- **Survey response clustering**: for polling. a political party might want to do targeting marketing. 5 point scale - 0: strongly disagree to 5: strongly agree.
- **Financial sector**: looking at companies revenue and profits. growth, value, dividend companies

<hr style="border:2px solid black"> </hr>

## Centroid Clustering: KMeans

- have to begin with knowing how many clusters you want
    - (ex): trying to find a species of bacteria, you already know there are only 5 species.

#### Example:
- you're a pet store owner. Want two clusters of who likes which pet
##### Step 1:
- put two X's randomly anywhere (dogs vs cats)
- determine if each point is closer to which X
- partition the data
##### Step 2:    
- take mean of two data points in cluster 1, then move X
- take mean of four data points in clusters 2, then move X
- reassign the points to its closest X
- partition the data again into 2 clusters
##### Step 3:
- repeat the process until your points dont move anymore

#### Pros:
- simple alogrythm

#### Cons:
- difficult to chose what K to use (how many clusters)

<hr style="border:1px solid black"> </hr>

## Density Clustering: DBSCAN

- do not need to specifiy the number of clusters
- for very dense datasets
<br>

- starts with any random data point
- get hyperparameters
    - epsilon: get required distance
    - minPoints: get minumum number of data points
- ask if there are the minPoints are within your epsilon, if yes.. you have cluster
- will likely have outliers (AKA anomalies, AKA noise)

#### Pros:
- identifies outliers


<hr style="border:1px solid black"> </hr>

## Hierarchial Clustering

- Each datapoint is a cluster in itself
- Most similar clusters get put together
    - groups of similar clusters get clustered together
- Hyperparameters:
    - Linkage: determines which distance to use between sets of observations
        - single linkage: uses minimum distance
        - complete or maximum linkage:

<hr style="border:3px solid black"> </hr>

# Data Wrangling

In [None]:
#acquire and summarize

In [10]:
import acquire
import pandas as pd

In [8]:
df = acquire.get_mallcustomer_data()

In [11]:
#get nulls by column
def nulls_by_col(df):
    num_missing = df.isnull().sum()
    rows = df.shape[0]
    prcnt_miss = num_missing / rows * 100
    cols_missing = pd.DataFrame({'num_rows_missing': num_missing, 'percent_rows_missing': prcnt_miss})
    return cols_missing

#get nulls by row
def nulls_by_row(df):
    num_missing = df.isnull().sum(axis=1)
    prcnt_miss = num_missing / df.shape[1] * 100
    rows_missing = pd.DataFrame({'num_cols_missing': num_missing, 'percent_cols_missing': prcnt_miss})\
    .reset_index()\
    .groupby(['num_cols_missing', 'percent_cols_missing']).count()\
    .rename(index=str, columns={'index':'num_rows'}).reset_index()
    return rows_missing


In [12]:
nulls_by_row(df)

Unnamed: 0,num_cols_missing,percent_cols_missing,customer_id
0,0,0.0,200


In [18]:
#summarize data in the df
#head, info, describe, value counts, nulls

def summarize(df):
    '''
    this function will take in a single argument (a pandas df) 
    output to console various statistics on said dataframe, including:
    #.head()
    #.info()
    #.describe()
    #.value_counts()
    #observation of nulls in the dataframe
    '''
    #print head
    print('=================================================')
    print('Dataframe head: ')
    print(df.head(3))
    
    #print info
    print('=================================================')
    print('Dataframe info: ')
    print(df.info())
    
    #print descriptive stats
    print('=================================================')
    print('Dataframe Description: ')
    print(df.describe())
    num_cols = [col for col in df.columns if df[col].dtype != 'O']
    cat_cols = [col for col in df.columns if col not in num_cols]
    
    #print value counts
    print('=================================================')
    print('Dataframe value counts: ')
    for col in df. columns:
        if col in cat_cols:
            print(df[col].value_counts())
        else:
            print(df[col].value_counts(bins=10, sort = False))
    
    #print nulls by column
    print('=================================================')
    print('nulls in dataframe by column: ')
    print(nulls_by_col(df))
    
    #print nulls by column
    print('=================================================')
    print('nulls in dataframe by row: ')
    print(nulls_by_row(df))
    print('=================================================')


In [19]:
summarize(df)

Dataframe head: 
             gender  age  annual_income  spending_score
customer_id                                            
1              Male   19             15              39
2              Male   21             15              81
3            Female   20             16               6
Dataframe info: 
<class 'pandas.core.frame.DataFrame'>
Int64Index: 200 entries, 1 to 200
Data columns (total 4 columns):
 #   Column          Non-Null Count  Dtype 
---  ------          --------------  ----- 
 0   gender          200 non-null    object
 1   age             200 non-null    int64 
 2   annual_income   200 non-null    int64 
 3   spending_score  200 non-null    int64 
dtypes: int64(3), object(1)
memory usage: 7.8+ KB
None
Dataframe Description: 
              age  annual_income  spending_score
count  200.000000     200.000000      200.000000
mean    38.850000      60.560000       50.200000
std     13.969007      26.264721       25.823522
min     18.000000      15.000000        1.0