#### Unsupervised learning algorithms 
- Unsupervised learning algorithms in machine learning analyze unlabeled data to discover hidden patterns, structures, and relationships without predefined outputs.
- These algorithms are used for tasks like clustering, dimensionality reduction, and anomaly detection.
- .
- Key unsupervised learning algorithms and techniques:
- .
- Clustering - This technique groups similar data points together based on their characteristics.
- Common clustering algorithms include:
- K-Means: Partitions data into K clusters, minimizing the within-cluster sum of squares.
- Hierarchical Clustering: Builds a hierarchy of clusters, either by merging smaller clusters or dividing larger ones.
- DBSCAN (Density-Based Spatial Clustering of Applications with Noise): Groups together data points that are closely packed together, marking as outliers data points that lie alone in low-density regions.
- .
- Dimensionality Reduction - This method reduces the number of variables in a dataset while preserving important information.
- Principal Component Analysis (PCA):
- Transforms data into a new coordinate system where the most important variance is captured by the first few components.
- t-distributed Stochastic Neighbor Embedding (t-SNE):
- Reduces dimensionality while preserving local structure, making it useful for visualization.
- .
- Anomaly Detection - Algorithms that identify unusual data points that deviate significantly from the norm.
- One-Class SVM: Learns a boundary around the normal data points and flags anything outside as anomalous.
- Isolation Forest: Isolates anomalies by randomly partitioning data, with anomalies requiring fewer partitions to be isolated.
- .
- Association Rule Mining - Discovers relationships and associations between items in a dataset.
- Apriori Algorithm: Identifies frequent itemsets and generates association rules.
- ECLAT (Equivalence Class Clustering and bottom-up Lattice Traversal): Another algorithm for association rule mining, often faster than Apriori.
- FP-Growth (Frequent Pattern Growth): An efficient algorithm for mining frequent patterns in large datasets

#### Centroid
- In statistics, a centroid represents the "average" or "center" of a set of data points, particularly in multidimensional spaces.
- It's the point where the average values of all variables (or dimensions) intersect.

#### Euclidean Distance Between 2 points
- The Euclidean distance between two points is the length of the straight line segment connecting them.
- In a 2-dimensional plane, the Euclidean distance between points (x₁, y₁) and (x₂, y₂) is calculated using the
  formula: √((x₂ - x₁)² + (y₂ - y₁)²). 

#### Inertia = WCSS Score calculated by Python
- Measures how well a dataset was clustered by K-Means.
- It is calculated by measuring the distance between each data point and its centroid, squaring this distance, and summing these squares across one cluster.
- Syntax in Python
- my_model.inertia_

#### K Means Clustering Algorithm

Step1:- Tell algorithm k (k = No of clusters) value. Start with some value. Ex: k =2 or k =3, etc.

Step2:- Assuming k = 2, Algorithm takes any 2 random points as Centroids. It could be anywhere in the dataset. Centroid point can coincide with the input any one of the data points or it could be imaginary point.

![image.png](attachment:a9dd42e6-09a3-4af4-a79b-410dae00e326.png)

Step3:- Calculate the distance between Centroid1 and each of the datapoints. Centroid2 and each of the datapoints, using Euclidean distance formula.

Step4:- Now create/adjust 2 groups(Clusters) based on the minimum distances, group all the datapoints that are close to Centroid1 as one cluster and all the items that are close to Centroid2 as another cluster.

![image.png](attachment:3c8f9235-d45f-4ef8-90ce-aecf029e3706.png)

Step5:- Now since we have imperfect clusters, adjust the Centroids, adjust the Centroids by placing it as per center of gravity (Center point).

![image.png](attachment:ec3549e7-f25f-499b-a5c9-349067ae7436.png)

Step6:- Repeat Step3(Calculates distances), Step4(Adjust datapoints as per distance to its respective Centroid) and Step5(Again adjust centroid after adjusting datapoints) untill none of the datapoints change the cluster.

Datapoints adjusted/grouped as per distance
![image.png](attachment:9ca9199c-ebb9-42d7-b00c-48ae1f798ed6.png)

Centroids adjusted
![image.png](attachment:85e525c8-8b3d-4991-ac41-807013fcd1e1.png)

Again, Datapoints adjusted/grouped as per distance
![image.png](attachment:0eb681d1-22a7-432e-8a47-6ba7b54cd5ce.png)

Adjust centroid and calculate distance and when there is no change in datapoints, finalize the clusters.

![image.png](attachment:6f367680-13d5-4f99-8b20-f19184290b5e.png)

#### Summary:-
- What should be the initial value of k - Start with min value.
- What should be the final value of k - number of clusters? Below 2 decide the number of clusters/final k value
  1. Sum of Squared Method Or WCSS Method help us to come up with Elbow chart. Decide the k number at the elbow part.
  2. Depends on customer requirement.

#### WCSS and Sum of Squared Error Method
- WCSS (Within-Cluster Sum of Squares) and Sum of Squared Errors (SSE) methods, both are used to evaluate the performance of clustering algorithms, but they are applied in different contexts.
- SSE is a broader term used in regression and other models to measure the difference between predicted and actual values.
- WCSS is specifically used in K-means clustering to assess the compactness of clusters.
- .
- Key Differences:
- Context:
- WCSS is specifically for clustering, while SSE is a more general term used in various models.
- Focus:
- WCSS focuses on the compactness of clusters, while SSE focuses on the overall error of a model.
- Calculation:
- WCSS calculates distances within clusters, while SSE calculates differences between predicted and actual values

#### Method1 - Sum of Squared Errors (SSE)
- SSE is the sum of the squared differences between predicted values and actual values in a dataset.
- Purpose:
- It measures the total error or the unexplained variance in a model. A lower SSE indicates a better model fit, as it suggests that the model's predictions are closer to the actual values.
- Application:
- SSE is a general term applicable to various models, including regression, neural networks, and other machine learning models.
- Formula:
SSE = Σ (yi - f(xi))^2, where yi is the actual value and f(xi) is the predicted value for the i-th data point. 

For each cluster, compute the distance for each data point from its respective Centroid and Square it(Square is to handle negate values)

SSE = SSE1 + SSE2 (In case of k = 2), where SSE1 - Cluster1 - distance between its Centroid and all its data points, SSE2 - Cluster2 - distance between its Centroid and all its data points....

SSE = SSE1 + SSE2 + ..... + SSEn (k = n)

![image.png](attachment:07b00c3e-f22c-45dc-bf73-2b65eae4f704.png)

- So, when k = 1, SSE = x
- take k = 2, SSE = y
- take k = 3 and calculate its SSE
- ...
- take k = n and calculate its SSE

#### Elbow Graph
- Now draw Elbow graph with k on x axis and SSE values on y axis
- Now looking at the elbow chart, decide the final k value and then rerun the algorithm by taking k (k=4 as per below diagram)

![image.png](attachment:0d44e72b-5fc8-4be6-91c1-9ee32f4aea21.png)

#### Method2 - Within-Cluster Sum of Squares (WCSS) - Used especially for K Means Clustering Algorithm
- WCSS is the sum of the squared distances between each point in a cluster and the centroid of that cluster.
- Purpose:
- It measures the compactness or cohesion of clusters. A lower WCSS indicates that the data points within each cluster are closer to their respective centroids, suggesting better-defined clusters.
- Application:
- Commonly used in the Elbow method to determine the optimal number of clusters in K-means.
- In the Elbow method, WCSS is calculated for different numbers of clusters, and the "elbow" point (where the rate of decrease slows down) is chosen as the optimal number of clusters.
- Formula:
- WCSS is calculated as the sum of squared Euclidean distances between each point and its cluster's centroid. 
- .
- For each cluster, compute the distance for each data point from its respective Centroid and Square it and sum them cluster wise.
- .
- WCSS = WCSS1 + WCSS2 (In case of k = 2), where
- WCSS1 - Cluster1 - distance between its Centroid and all its data points, square them and sum,
- WCSS2 - Cluster2 - distance between its Centroid and all its data points, square the distance of each and sum them....
- WCSS = WCSS1 + WCSS2 (In case of k =2)
- WCSS = WCSS1 + WCSS2 + ..... + WCSSn (k = n)
- .
- So, when k = 1, WCSS of one cluster.
- take k = 2, WCSS = WCSS1 + WCSS2
- take k = 3 and calculate its WCSS = WCSS1 + WCSS2 + WCSS3
- ...
- take k = n and calculate its WCSS = WCSS1 + WCSS2 + WCSS3 + ... WCSSn
- .
- When k = 3, see below
![image.png](attachment:1caa7630-0c11-4859-b346-f67ff6e30d0c.png)
- .
#### Elbow Graph
- Now draw Elbow graph with k on x axis and WCSS values on y axis
- Now looking at the elbow chart, decide the final k value and then rerun the algorithm by taking k
- .
![image.png](attachment:a850648a-b272-4dde-b316-17d068bdd4b3.png)
- .
- WCSS Score become zero, when the k = total number of datapoints/records = total no of clusters. This doesn't mean the model is good.
- The number of clustrs(k) should be decided on the threshold point(elbow point) to properly classify the clusters.
- In above graph, elbow graph suggests to take k = 3

#### Cohesion and Separation
- Cohesion means sticking together. How closely they are related to each other.
- Separation - how far they are to each other.

### Silhouette score
- Mean of Distance between same cluster datapoints to distance between other cluster datapoints.

- Like R2 Score, MSE, etc. used in Linear Regression, we have Silhouette Score to measure the accuracy of K Means Clustering.
- Since there is no direct way to compare the predicted value to the output, as we do not have the actual output column, we use this method.
- The score value lies between -1(Bad) to +1(Good)
- Silhouette coeficient is calculated for each data point.
- Silhouette coeficient for entire cluster is mean of all the Silhouette coeficients.
- .
- ![image.png](attachment:5447d632-2fd2-46bb-b370-6ce5826f3095.png)
- 
- To compute Silhouette Coef for one data point(d0)
- 1. Calculate the Cohesion
- Compute the distances from the selected data point to all the other data points within the cluster.
- Distance = distance(d0,d1) + distance(d0,d2)
- Mean of total distances = (distance(d0,d1) + distance(d0,d2))/2 = Lets call this as Wd (Within distance)
- 2. Calculate the separation
- Compute the distances from the selected data point to all the other data points in the other cluster.
- Distance = distance(d0,d3) + distance(d0,d4) + distance(d0,d5)
- Mean of total distance = (distance(d0,d3) + distance(d0,d4) + distance(d0,d5))/3 = Lets call this as Od1 (Outside cluster distance 1)
- Since there are only two clusters, calculate the distance from the selected datapoint to the other cluster, if there are more, the distance should be calculated from selected datapoint to all the other clusters one by one. So in that case the distance will be Od2, O3, etc.
- 3. Take the minimum value of the averate distances calculated from the outside clusters.
- Od (Outside Distance) = Minimum of (Od1, Od2, Od3, ...)
- Ex: Lets consider Od1 is the minimum.
- From Step2 and Step3 we are trying to get the cluster that is close to the cluster that has the selected datapoint. So it is like how close the selected datapoint is to the other closet cluster.
- 4. <b>Silhouette = (Od - Wd) / Max(Od,Wd)</b>
- Understanding the Silhouette Coef Value
- Typically Cohesion (Within) distance should be less than Separation(Outside) distance, as the distance between within points will be less than the distance from outside cluster data points.
- Ex1:- Wd = 0.2 and Od = 0.8 >>> (0.8 - 0.2)/Max(0.8,0.2) = 0.6/0.8 = 0.75 >>> Good for that datapoint
- Ex1:- Wd = 0.8 and Od = 0.2 >>> (0.2 - 0.8)/Max(0.8,0.2) = -0.6/0.8 = -0.75 >>> Bad for that datapoint
- 5. Now do the same for all the data points in the cluster and take the mean of all the scores.

![image.png](attachment:d8a76c24-4288-4e20-9a1e-50ebfd7cc536.png)

https://youtu.be/_jg1UFoef1c?si=h-TeuIShN_w3D9Ck

https://www.youtube.com/watch?v=a2Kg2_l3L8M&list=PLTCVN6Wwg-pB7RXD5mYdP_Ve8S7z9GvxX&index=21