# Regression
- Can we predict one variable based on other variables
- Predict continous value
- Dependent variable (Final goal or Y), Independent variable (target variables or X)

## Applications
- Sales forecasting
- Satisfaction Analysis 
- Price estimation
- Employment Income

## Simple Regression
  Using one variable to predict a value of other variable
1. Simple Linear Regression
2. Simple Non-linar Regression

## Multiple Regression
  Using more than one varaibles to predict a value of other variable
1. Multiple Linear Regression
2. Multiple Non-Linear Regression 

# Simple Linear Regression
- Predict Co2 emissions (Continuous value) using engine size, cylinders, fuel consumption
- Dependent variable must have continuous values & Independent variables can have continuous/discreet/categorical values

## Changes in one variable should explain other variable
-  $\hat{y}$ = $\theta_0 + \theta_1 x_1$
1. $\hat{y}$ is response variable
2. $x_1$ is a single predictor
3. $\theta_0$ is an intercept
4. $\theta_1$ is slope
5. $\theta_0$ and $\theta_1$ are also called coefficients of linear equation

## Finding line of best fit:
- Most important part of Linear regression is to find theta0 and theta 1 
- How we can adjust parameters to best fit theta ? 
### How to find best fit ?
1. $x_1$ = 5.4 ; independent variable y = 250
2. $\hat{y}$ = 340 (i.e. after applying $\hat{y}$ = $\theta_0$ + $\theta_1$ * $x_1$)
3. Find Residual Error:
   - *Distance from data point to fitted regression line*
   - Error = $y - \hat{y}$ = 250 - 340 = -90
4. Find Mean Squared Error:
   - *Mean of all Residual Errors shows how poorly the line fits with the data set*
   - MSE = $\frac{1}{n} \Sigma_{i=1}^n({y}-\hat{y})^2$

#### Note: Objective of Linear Regression is to minimize the MSE equation 

### Mathematical Approach: 
1. calculate $\bar{x}$ (Avearge value of independent variable)
2. calculate $\bar{y}$ (Average value of dependent variable)
3. Plug $\bar{x}$ bar and $\bar{y}$ values in slope equation to find theta1
4. Find intercept $\theta_0$ based on slope $\theta_1, y, \hat{y}$
#### Note: Theta0 is also called bias coefficient and Theta1 is coefficient for independent variable column

### Advantage of Linear Regression
   - Very Fast
   - No Parameter Tuning Needed
   - Easy to understand and highly interpretable 

# Model Evaluation in Regression Models
## Accuray of Model
### Train and test on same dataset
- Train set uses entire dataset and build a model
- Test set includes a selected small portion of dataset without labels. (Note: Train set has labels but we should not use them )
- Labels are used only for ground truthing
- Labels are called "Actual values of Test Set" and the values predicted by Test set is called "Predicted Values"

#### Measuring model accuracy
- Difference between actual and predicted values
- Error = $\frac{1}{n} \Sigma_{j=1}^n |y_j - \hat{y}_j| $

##### Common Pitfalls to Note:
- **High "Training accuracy"**
  - Hight Training accuracy is not necessarily a good thing 
  - May result in over fitting 
    - Over fit: The model is overly trained to the dataset, which may capture noise and produce a non-generalized model
- **Low "out-of-sample accurary"**
  - It is important that our models have a high, out of sample accuracy (beacuse our goal is to predict values with high accuracy)
  - How can we improve out-of-sample accuracy ? **use Train/Test Split**

### Train/Test Split
- Train set uses portion of dataset and build model
- Test set uses other portion of dataset passed on to the model for prediction
- Compared predicted values (from train set) with actual values (from test set)
- **Mutually Exclusive**

##### Note:
- More accurate evalution on out-of-sample accuracy
- Train your model with testing set afterwards as you don't want to loose potentially valuable data
- Outcome of Train/Test Split is highly dependent on which dataset the data is trained and tested

### K-fold cross-validation
- If k = 4 fold, we split up data set into 
- 1st fold - we use first 25% of dataset for testing and rest for training (Model build based on training set and evaluated using test set - accuracy is calculated)
- 2nd fold - Second 25% data set is used for testing and rest for training (accurarcy calculated)
- 3rd fold - third 25 %
- 4th fold - fourth 25% 

#### Accuracy = average of all four model accuracies

# Evaluation Metrics in Regression Models
- Used to explain performance of a model
1. Mean Absolute Error (MAE) 
   - Mean of the absolute value of the error
2. Mean Squared Error (MSE)
   - Mean of the squared error
   - Popular beacuse it focuses on large errors
3. Root Mean Squared Error (RMSE)
   - Sqare root of the mean squared error 
4. Relative Absolute Error (RAE)
   - takes y bar value and normalizes error 
5. Relative Squared Error (RSE)
   - Used for R-Squared value;  $R^2$ = (1 - RSE);
   - $R^2$ is a popular metric for measuring accuracy of the model
   - $R^2$ shows how close the data values to the fitted regression line
   - Higher $R^2$ means the better model fits the data
  

# Multiple Linear Regression
- Predicting Co2 emission based on engine and cylinder of all cars
### Usage: 
1. To identify strength of the effect of independent variable have on dependent variable 
   - Eg: Does lecutre attendance and gender have any effect on exam performance of students ? 
2. To predict impacts of changes
   - To identify how dependent variable changes when we change independent variables
   - Eg: How a patient blood presure increase/decrease for every unit increase/ decrease in BMI (holding other factors constant)

#### Note: 
1. In Multiple linear regression Independent variable (Y) is a linear combination of dependent variables (X)
    - $\hat{y} = \theta_0 + \theta_1 * Engine size + \theta_2 * Cylinders+ ....$
2. You can identify 
   - which variable are significant to outcome variable
   - how each feature impacts the outcome variable
3. Predict unknown value 

### Mathematical Notation:
- $\hat{y} = \theta_0 + \theta_1x_1 + \theta_2x_2+ ....+ \theta_nx_n$
- $\hat{y} = \theta^TX$(.i.e. Theta Transposed X)
  - n X 1 vector of unknown parameters in multidimensional space
  - X is the vector of feature sets 
    - First value of the feature set is set to 1 to accommodate bias parameter theta0
  - theta is the vector of coefficients (also called parameters / weight of regression equation)

### Important Note:
- Due to multiple dependent variable, we don't have a line anymore so we call it Plane / Hyperplane
- Our goal is to find best fit hyper plane for data

### How to find optimized parameters: 
- Optimized parameters would decrease the model error in hyperplance
##### How it works ?
- $y$ (actual) = 196
- $\hat{y}$ (predicted) = 140
- residual error = $y - \hat{y}$ = 196 - 140 = 56

### Errors:
- Mean Squared Error (MSE): most popular
   - how squared residual error is represented in the model
   - Need to minimize MSE equation to best fit model
     - Solution: Find best parameters (theta)

#### How to estimate theta (parameter):
1. Ordinary Least Squares:
   - Tries to estimate the value of coefficients by minimizing mean square error
   - Uses data as a matrix and applies linear algebra operations to estimate optimal values of theta
   - **Downside:** It takes a very long time to perform matrix operations 
2. Optimization Approach:
   - Iteratively minmizing error in the model
   - Gradient Descent
     - Starts optimization with random values for each coefficient 
     - Calculates error and tries to minimize error through changing value of coefficients in multiple iterations
     - **Proper approach** for large datasets

### Steps in prediction making:
1. find parameters ($\Theta$)
2. Plug into the linear equation model ($y_\theta = \frac{1}{X_\theta}$)
- Eg. Co2 Emission = 125 + 6.2 Engine size + 14 Cylinders + ..... = 125 + 6.2 * 2.4 + 14 * 4 + ....  = 214.1

### Caution:
1. Adding too many independent variable will result in **overfitting** 
2. Should independent variables should be continuous
   - Binary vairables (code categorical variables into numerical with dummy values eg: 0 for male and 1 for female)
3. What are a linear relationships between dependent variable and independent variable ?
   - use scatter plot to visually check the linearity 
   - If the relationship has no linearity use should use non-linear regression

# Non Linear Regression
- If Scatter plot data shows a curvy line, linear regeression may not produce accurate predictions compared to non linear regression 
- There is a strong relationship between independent variable (GDP) and dependent variable (Year) but the relationship is non linear

## What is non linear regression:
- To model non-linear realationship between the dependent variable and a set of independent variables
- yhat must be a non-linear function of the parameters theta, not necessarily the features x
- It can be exponential, logistic etc. 
- Change of yhat depends on the changes in parameters theta

### Note: 
- We Cannot use Ordinary Least Squares method to fit data 
- Estimation of parameters is not easy

## estimation method: 
- Use exponential functions 
#### Polynomial Regressions
- fits a curve line to data 
- Eg: Third degree polynomial equation ($\hat{y} = \theta_0 + \theta_1x + \theta_2 x_2 + \theta_3 x_3$)
- A polynomial regression model can be expressed/transformed into linear regression model
  - Eg: $x_1 = x, x_2 = x^2, x_3 = x^3 $
    - makes model into $\hat{y} = \theta_0 + \theta_1 x_1 + \theta_2 x_2 + \theta_3 x_3$
1. Linear Regression
2. Quadratic (parabolic) Regression
3. Cubic Regression
4. etc.............

## How to determine if the problem is linear or non-linear 
- Inspect visually 
  - Calcuate correlation coefficients between independent and dependent variables if the value is 0.7 or higher there is a linear tendency
- Based on accurarcy
   - When we cannot accurately model with linear parameters

## How should I model my data, if it displays non-linear on a scatter plot ?
- Polynomial regression
- Non-linear regression model
- Transform your data

# Classification
- Supervised learning model
- Categorizing some unknown items into a discrete set of categories and "classes"
- The target attribute is a categorical variable

- Classification determines label for a test case
  - Eg: 1.  loan default classification (Which customers will have problem repaying loans)
    - Build a classifier 
    - pass data to model
    - classify data into defaultor or not a defaultor (Binary classifier)
  - Eg: 2. To predict a category where a customer belong
  - Eg: 3. Churn detection: To predict if a customer switches to another product/brand
  - Eg: 4. Email / document classification

## Types:
- Binary classification
- Multi-class classification 

## Algorithms
1. Decision Trees (ID3, C4.5, C5.0)
2. Naive Bayes
3. Linear Discriminant Analysis
4. k-Nearest Neighbor
5. Logistic Regression
6. Neural Networks
7. Support Vector Machines (SVM)

# k-Nearest Neighbors (KNN)
- Given the labelled data, we need to predict the label of unknown case

## Build a classifier
- If Age and Income are predictors of customercategory 

### Note:
  1. Poor Judgement: can we find closest cases and assign value to new case (1st nearest neighbor)
  2. Good Judgement: Finding highest number of times a class appears in the neighborhood

### Definition:
- A method for classifying cases based on their similarity to other cases
- Cases nearer to each other are "neighbors"
- Distance between two cases is the measure of "dissimilarity"/ "similarity"
- Distance measured using Euclid distance

### Usage:
1. Pick a value for k
   - How to choose
     - low value of k would result in less accurate model (causes over fitting)
     - High value of k would result in overly generalized model (causes under fitting)
     - *Solution:* Reserve part of data to testing the accuracy of the model - pick k value with highest accuracy

2. Calculate the distance of unknown case from all cases
   - Need to normalize feature set to get accurate dissimilarity measure
   - Highly depend on data type and domain knowledge
3. Select the k-observations in the training data that are "nearest" to the unknown data point
4. predict the response of the unknown data point using the most popular response value from the K-nearest neighbors

# Evaluation metrics for classification
## Classification accuracy Methods:
- Compare actual labels in test set vs predicted labels by model

1. Jaccard Index / Jaccard similarity coefficients:
- If model predicts 8 values accurately out of 10. Jaccard index will be $J(y,\hat{y})$ = 8 / (10 + 10 - 8) = 0.66
- Highest accuracy is 1.0 ; lowest accuracy is 0.0

2. F1-score: (Confusion Matrix):

| Confusion Matrix  | Predicted Label 1 | Predicted Label 0 |
|-------------------| ------------------| ------------------|
|  Actual Label 1   |        6          |        9          |
|  Actual Label 0   |        1          |        24         |

- Precision = $\frac{TP} {(TP + FP)}$
- Recall = $\frac{TP}{(TP + FN)}$

**$F1 Score (Harmonic progression of Precision & Recall)$ = $2 * \frac {precision * recall} {precision + recall}$**
- Highest accuracy is 1.0 (shows model have perfect precision and recall)

3. Log loss:
- Sometimes out of a classifier is a probability of a class label instead of the label between 0 and 1
- Measure log loss value for each row using log loss equation $(y * log(\hat{y}) + (1-y) * log(1-\hat{y}))$
- Calculate average log loss using $-\frac{1}{n} \Sigma(y * log(\hat{y}) + (1-y) * log(1-\hat{y}))$
- Ideal classifiers have lower log loss (Classifier with lower log loss has better accuracy) 

# Decision Trees
- What drug is needed for similar patient based on other patients data
- Building decision tree with training set
- Testing an attribute and branching the cases
### Interpretation
1. Each internal node corresponds to a test (Eg: Sex)
2. Each branch corresponds to a result of the test (Eg: Male)
3. Each leaf node assigns a patient to a class (Eg: Drug B)

### Building a decision tree algorithm
1. Choose an attribute from your dataset
2. Calculate the significance of attribute in splitting of data
   - Recursive partitioning
      - Which attribute is needed for best prediction
      - The selected attribute should have a defined outcome
      - Results in the node are mostly pure (Pure Node means: 100 % of case node fall into a specific category)
      - Key factors (More predictiveness, Less Impurity in nodes, Lower Entropy)
      - If attribute is uncertain - split it into "sub-tree"
    - Impurity of a node:
      - Must be decreased by each sub-tree
      - Calculated by entropy of the data
    - Entropy:
      - Amount of information disorder or amount of randomness in data
      - used to calculate homogeneity of samples in that node
        - Completely homogeneous entropy is 0
        - Samples are equally divided it has entropy of 1
        - 1 Drug A 7 Drug B - low entropy
        - 3 Drug A 5 Drug B - High entropy
        - 0 Drug A 8 Drug B - 0 entropy
        - 4 Drug A 4 Drug B - 1 entropy
      - Entropy measured with a formula.i.e.  $-p(A)log(p(A)) - p(B)log(p(B))$

 Note: While choosing which attribute to split use "information gain"
 - Information gain is the information that can increase level of certainty after splitting
 - Information gain = (Entropy before split) - (weighted entropy after split)
   - $Weighted~entropy~after~split = \frac {no.~of~females}{total~patients} * entropy~after~split + \frac {no.~of~males} {total~patients} * entropy~after~split$
 - As weighted entropy decreases, Information gain increases

3. Split data based on the value of the best attribute
4. Repeat (Step 1) for next attribute

# Logistic Regression
- Used in classification for categorical variables
- one or more independent variables to predict dependent variable
- Independent variables should be continuous (if categorical need to create dummy continuous values)
- used in both binary class classification and multi class classification

### Applications:
1. Predict probability of heart attack based on BMI and age etc.
2. Predict default on a mortgage

### When to use:
1. Target field is categorical / binary
2. If you need probability results (eg: probability of buying a product)
3. When you need a linear decision boundary
4. If you need to understand the impact of a feature

### Mathematical representation
- $\hat{y} = P(y = 1 | x)$
- $P(y = 0 | x) = 1- P(y = 1 |x)$

# Logistic regression vs linear regression
- In logistic regression  (predict categorical value)
  - $\hat{y}$ = P(y = 1 | x)
  - $\hat{y}$ is the predicted labels of our model
  - y is the actual label in dataset

 $\Theta^T$ X = $\theta_0$ + $\theta_1$ * $$x_1$$

### Threshold: (Step function)
 - $\hat{y}$ = 0 if $\Theta^T$ X < 0.5
 - $\hat{y}$ = 1 if $\Theta^T$ X $\ge$ 0.5

 ### Limitations:
 1. Doesn't distinguish customer with 1 or 1000 in $\Theta^T$ X value (X-axis)
 2. Not accurate

 ### Sigmoid function:
 - Sigmoid / Logistic function is similar to step function but used in logistic regression
 - $\Sigma(\Theta^TX)$ = $$\frac{1} {1 + e^-\Theta^Tx }$$
 - When $\Theta^T x$ gets big value of $\Sigma(\Theta^TX)$ will get closer to 1
 - When $\Theta^T x$ gets small value of $\Sigma(\Theta^TX)$ will get closer to 0

 - When sigmoid value gets closer to 1 the probability of y given x (P(y = 1|x) goes up
 - When sigmoid value gets closer to 0 the probability of y given x (P(y = 1|x) goes down

 - Gives Point belonging to a class instead of value y directly
 - $\Sigma(\Theta^T X)$ = $\Sigma(\Theta_0 + \Theta_1x_1 + ....)$
 - It always returns value between 0 and 1

 - New model is with sigmoid function is
 - $\hat{y}$ = $\Sigma(\Theta^T X)$

### Training process
1. Initiate $\Theta$ with random values
   - Example pick $\Theta$ = [-1,2]
2. Calculate model output $\hat{y}$ = $\Sigma(\Theta^T X)$ for a customer in training set
   - X and $\Theta^TX$ are the feature values for example age and income of the customer (for example = [2,5])
3. Compare output value $\hat{y}$ with actual value y and record the error
4. Calculate error for all customers and add up this errors
   - Total error is calculated by cost function; Cost = J($\Theta$)
   - Cost function shows how poorly model is estimating labels
   - Lower the cost, better the model is at estimating labels correctly
5. Minimize Cost function
   - Change $\Theta$ value to reduce the cost
   - This creates $\Theta_{new}$
6. Go to Step 2 (another iteration) and continue till cost is low enough

### How to identify $\Theta_{new}$ and when to stop the iteration ?
Most popular solution - Gradient Descent4

# Training logistic regression model

- General cost function
  - change the weight to reduce cost
  - find the relationship between cost function and $\theta$
  - formula:
    - $Cost(\hat{y}$, y) =  $\frac{1}{2}(\Sigma(\theta^TX)-y)^2$
    - Interpretation of above equation
      - captures Predicted value - actual value
      - square value is used to remove possibility of negative results
    - Final, Mean Squared Error
      - J($\theta$) = $\frac{1}{m} \Sigma_{i=1}^{m} Cost(\hat{y}, y)$
 - Need to find Global minimum for this function

### Plotting the cost function of the model

- Actual value of y = 1 or 0
- If y = 1 and $\hat{y}$ = 1 then Cost = 0
- If y = 1 and $\hat{y}$ = 0 then cost = large

- Cost($\hat{y}$, y) = -log($\hat{y}$) ;  if y = 1
- Cost($\hat{y}$, y) = -log(1-$\hat{y}$) ; if y = 0

Logistic Regression Cost function:

$J(\Theta)$ = -$\frac{1}{m} \Sigma_{i=1}^my^ilog(\hat{y}^i) + (1-y^i)log(1-\hat{y}^i)$

### Gradient Descent:
- Iterative approach to find minimum of J($\theta$)
- A technique to use the derivative of a cost function to change the parameter values, in order to minimize the cost
- How it works ?
  - Draw a contour chart with errors also called error bowl
  - Aim is to find best parameters to minimize cost value
  - Select a random point on the bowl
  - As long as we are going downward we can go one more step (steeper the slope, we can take more steps downward)
  - When we are approaching to minimum value slope diminishes, then we will take smaller steps till we reach a flat surface.
- How would you measure how many steps to take ?
  - By calculating gradient descent of the cost function at that point
  - Gradient is the slope of the surface at every point
  - Direction of the gradient is the direction of the greatest uphill
- How do we calculate gradient of a cost function at a point ?
  - If you select a random point on the error bowl and take partial derivative of J($\theta$) with respect to each parameter at that point, we will find the slope of the move at that point.
  - If we move opposite direction of the slope we will move opposite direction of the error curve
  - For example if we measure $\frac {\partial{J}}{\partial\theta_1}$ we will find out it is a positive number
    - This indicates that function is increase when $\theta_1$ is increasing
    - So, to decrease J, we need to move toward opposite direction (Which means we should decrease $\theta_1$)
  - How big step to take ?
    - Gradient value indicates, how big step to take
    - If slope is large, we should take a large step because we are far from the minimum
  - Partial derivative of J is calculated using this expression
    - $\frac {\partial J} {\partial \theta_1 }$ = -$\frac {1}{m} \Sigma_{i=1}^m(y^i-\hat{y}^i)x_1^i$
  - A vector of all the slopes calculated using partial derivative is called "gradient vector"
    - gradient vector is used to update/change all the parameters
    - Take previous value of parameters and substract the error derivative
      - New$\theta$ = old$\theta$ - $\mu\nabla$J
        - $\mu$ is the learning rate

### Overall steps:
1. Initialize the parameters randomly
2. Feed the cost function with training set, and calculate the error
3. Calculate the gradient of the cost function
4. Update weights with new values
5. Go to step 2 until cost is small enough
6. Predict the new customer X

# Support Vector Machines
- SVM is a supervised algorithm that classifies cases by finding a separator
1. Mapping data to a high dimensional feature space (In this step data point will be categorized)
2. Finding a separator (A separator will be estimated for the data)

### Data Transformation:
1. Make data separable
2. In one dimensional space all point are in a single line so it is inseparable so it needs to be converted to two dimensional space.
   - Map data using a function $\phi$(x) = [x, $x^2$]
   - In 2D a hyperplane is a line dividing data into two parts
3. Mapping data into higher dimensions is called "kernelling"
   - Types of Kernelling
     - Linear
     - Polynomial
     - RBF (Radial Basis Function)
     - Sigmoid
   - There is no best Kernelling algorithm, usually perform the Kernelling and compare the results to choose which fits the data

### How to find hyperplane ?
- In 2D hyperplane is a line that linearly separates the two class of data
- Best way to find hyperplane:
  - The line that creates largest separation or have highest margin between the two classes
  -  Goal is to choose the hyperplane with highest margin
- Examples closest to the margin are called "Support Vectors"
  - Only support vectors examples matter for achieving our goal
- Find a hyperplane in a way that it has the maximum distance from Support Vectors
  - Hyper planes and boundary decision lines have their own equation
    - Decision Boundary $w^T$ x + b = 1 and $w^T$ x + b = -1
    - Hyperplane $w^T$ x + b = 0
    - Find the value of w and b such that $\phi$(w) = $\frac{1}{2} w^Tw$ is minimized;
    - for all {($x_i, y_i)$}: $y_i(w^Tx_i + b) \ge 1$
- Finding correct margin is an optimization problem
  - Like all optimization problems, this can be solved using gradient descent

### Pros and Cons of SVM:
#### Pros:
1. Accurate in high dimensional spaces
2. Uses a subset of training data so memory efficient
#### Cons:
1. Prone for Over fitting
2. SVM does not provide probability estimates directly which are needed in most classification problems
3. They are not efficient if your data is larger than 1000 rows

### SVM Applications:
1. Good for Image Classification
2. Effective in text mining tasks (Detecting spam, sentiment analysis)
3. Gene expression classification
4. Regression, outlier detection and clustering

# Clustering
- Unsupervised
- Definition: A group of objects that are similar to other objects in the cluster, and dissimilar to data points in other clusters.
## Introduction
- To apply customer segmentation on historical data
- To identify similar customers
- Partitions customers into groups that are mutually exclusive

## Clustering vs classification
- Classification should have a labeled data set as training data
  - Predicts data using
    - decision tree
    - logistic regression
    - SVM
- Clustering works with unlabeled dataset
  - Group similar customers
    - K means
## Application of Clustering applications
- Retail/ marketing
  - Identifying buying pattern of customers
  - Recommending new books or movies to new customers
- Banking
  - Fraud detection in credit card use
  - Identifying clusters of customers (Eg. loyal)
- Insurance
  - Fraud detection and claims analysis
- Publication
  - Auto-categorizing news based on their content
  - Recommending similar news articles
- medicine
  - Characterize patient behaviour
- Biology
  - Group genes with similar expression patterns and similar markers

## Why Clustering?
- Exploratory data analysis
- Summary generation
- Outlier detection
- Finding duplicates
- Pre-processing step


# k-Means Clustering
- partitioning clustering
- partition customer base into similar groups
- k-means divides the data into non-overlapping subsets (clusters) without any cluster internal structure
  - Un supervised learning

## Determine similarity or dissimilarity
- k means
  - Minimizes Intra-cluster distances
  - Maximizes Inter-cluster distances

## How to calculate dis similarity beteen two cases
- 1D (dimensional) similarity / distance (where only one feature is present eg: Age)
- 2D (dimensional) space ... Multi-dimensional space
- Minkowski distance: (Euclidean distance)
  - $ Dis (x1,x2) = \sqrt{\sigma_{i=0}^n (x_{1i} - x_{2i})^2} $
- To use Data needs to be normalized first
### Possible approaches:
1. Euclidean distance
2. cosine similarity
3. Average distance

## How does k-means work ?
- For Example: Take a 2D customers segmentation based age and height
### Steps:
- 1. Initialize k = 3
     - How to choose ?
       - pick thre values from centroid from dataset
       - Cerate three random points
     - Should be of same feature size as customer feature set
- 2. Calculate distance from each centroid point
     - Each row represents distance from each centroid (This is called distance matrix)



















































































