### **Introduction to Unsupervised Learning**

- **"After supervised learning, the most widely used form of machine learning is unsupervised learning."**  
  This means that after supervised learning (where we teach the algorithm using labeled data), unsupervised learning is the next most popular method.

- **"Let's take a look at what that means."**  
  Now, we'll explore what "unsupervised learning" is and how it works.

---

### **Comparing Supervised and Unsupervised Learning**

- **"When we're looking at supervised learning in the last lecture, recall it looks something like this in the case of a classification problem."**  
  In supervised learning, we have input data (like tumor size and patient age) and output labels (like whether a tumor is benign or malignant).

- **"Each example was associated with an output label y such as benign or malignant, designated by the poles and crosses."**  
  Supervised learning uses examples with clear answers or "labels," like 'X' for malignant and 'O' for benign.

- **"In unsupervised learning, we're given data that isn't associated with any output labels y."**  
  In unsupervised learning, we only have input data and **no labels** telling us the "right" answer.

---

### **An Example of Unsupervised Learning**

- **"Say you're given data on patients and their tumor size and the patient's age, but not whether the tumor was benign or malignant."**  
  Imagine you have data like:

  - Patient A: Age 40, Tumor size 2 cm
  - Patient B: Age 60, Tumor size 4 cm  
    But you don’t know if their tumors are benign (harmless) or malignant (dangerous).

- **"The dataset looks like this on the right."**  
  This refers to the chart or graph where the data points (patients) are plotted based on their age and tumor size.

  ![Example Image](images/Unsupervised.png)

- **"We're not asked to diagnose whether the tumor is benign or malignant, because we're not given any labels y in the dataset."**  
  Unlike supervised learning, we don’t have answers (labels) like "benign" or "malignant" to guide us.

- **"Instead, our job is to find some structure or some pattern or just find something interesting in the data."**  
  Here, the goal is to figure out patterns in the data without having labels to guide us.

- **"This is unsupervised learning."**  
  This is what unsupervised learning is all about: discovering patterns or relationships in unlabeled data.

---

### **Clustering Example**

- **"An unsupervised learning algorithm might decide that the data can be assigned to two different groups or two different clusters."**  
  The algorithm might notice that the data naturally forms two groups (clusters) based on similarities.

- **"And so it might decide that there's one cluster or group over here, and there's another cluster or group over here."**  
  For example:

  - Cluster 1: Patients with small tumors and younger ages.
  - Cluster 2: Patients with large tumors and older ages.

- **"This is a particular type of unsupervised learning called a clustering algorithm."**  
  Clustering is a type of unsupervised learning where data is grouped into similar clusters.

---

### **Clustering in Real Life: Google News Example**

![Example Image](images/GoogleNews.png)

- **"For example, clustering is used in Google News."**  
  Clustering is used by Google News to group related news stories together.

- **"What Google News does is every day it looks at hundreds of thousands of news articles on the internet and groups related stories together."**  
  Google’s algorithm reads all the news articles on the internet and identifies which ones are related.

- **"For example, here is a sample from Google News, where the headline of the top article is 'Giant panda gives birth to rare twin cubs at Japan's oldest zoo.'”**  
  Let’s say this is one article. The algorithm looks at all articles and finds others with similar words or topics.

- **"The algorithm notices repeated words like "panda," "twins," and "zoo" and groups articles with these common words together."**

- **"The clustering algorithm is finding articles that mention similar words and grouping them into clusters."**  
  The algorithm groups related articles into one cluster based on shared words.

- **"There isn’t an employee at Google News who’s telling the algorithm to find articles with these specific words."**  
  No one manually tells the algorithm what to group. The algorithm figures it out on its own. This is why it’s "unsupervised."

---

### **Clustering Genetic Data Example**

![Example Image](images/DNA.png)

- **"This image shows a picture of DNA microarray data."**  
  DNA microarrays are like spreadsheets showing genetic information for multiple individuals.

- **"Each column represents one person, and each row represents a gene."**  
  Think of it like this:

  - Each column is a person (Person A, Person B, etc.).
  - Each row is a gene (e.g., eye color gene, height gene).

- **"What you can do is run a clustering algorithm to group individuals into categories or types of people."**  
  The algorithm looks at patterns in the data and groups similar people together based on their genetic traits.

  Sure, here’s a detailed explanation of the DNA clustering example:

---

Imagine we have data from a **DNA microarray**, which is a tool used to analyze genetic information. This data is visually represented as a grid, somewhat like a **spreadsheet**.

1. **Columns Represent Individuals:**

   - Each column in this grid corresponds to the **genetic information** of one person.
   - For instance:
     - Column 1 might represent "Person A."
     - Column 2 might represent "Person B."
     - Each column contains data about that person’s DNA activity.

2. **Rows Represent Genes:**

   - Each row in the grid corresponds to a **specific gene**.
   - For example:
     - Row 1 could represent a gene that influences **eye color**.
     - Row 2 might represent a gene that affects **height**.
     - Another row might represent a gene linked to whether someone likes or dislikes certain foods, such as broccoli, brussels sprouts, or asparagus. (Fun fact: Scientists have discovered a genetic link to food preferences!)

3. **Color Coding of Data:**

   - The grid is color-coded to show how active each gene is for each individual.
   - For example:
     - **Red** might indicate a highly active gene.
     - **Green** might show a less active gene.
     - **Gray** could represent a gene that isn’t active at all.

   These colors allow researchers to quickly visualize differences and similarities in genetic activity across individuals.

4. **The Clustering Algorithm:**

   - Now comes the role of the **unsupervised learning algorithm**, specifically a **clustering algorithm**.
   - The algorithm takes this genetic data (columns and rows) and analyzes it to identify **patterns**.
   - It groups individuals into clusters based on similarities in their genetic profiles.
     - For example:
       - People with similar DNA activity might be grouped together into **Cluster 1**.
       - Another set of individuals with a different pattern might form **Cluster 2**.
       - Yet another group might be categorized as **Cluster 3**.

5. **No Predefined Labels:**

   - Here’s what makes this unsupervised learning:
     - We don’t tell the algorithm, “This is what makes a person belong to Type 1, Type 2, or Type 3.”
     - Instead, we give it all the data and say, “Figure out the groups or clusters yourself.”
   - The algorithm works independently to find **structure** in the data and identify patterns that group similar individuals together.

6. **Practical Use:**

   - Researchers use these clusters to study **genetic traits** and how they relate to health, behavior, or other characteristics.
   - For example:
     - Cluster 1 might include people prone to certain diseases.
     - Cluster 2 could represent individuals with specific dietary preferences or tolerances.
     - Cluster 3 might highlight genetic markers linked to height or physical characteristics.

7. **Why It’s Unsupervised Learning:**
   - Unlike supervised learning, we don’t provide the algorithm with a predefined “right answer” (like labeling people as “tall” or “short” based on a specific gene).
   - Instead, the algorithm learns to group individuals based on their **own patterns** and relationships within the data.

---

---

### **Clustering Customers Example**

![Example Image](images/Customer.png)

- **"Many companies have huge databases of customer information."**  
  Companies collect a lot of data about their customers, like age, shopping habits, or spending patterns.

- **"Can you automatically group your customers into different market segments?"**  
  Businesses use clustering to identify customer groups (e.g., young shoppers, budget-conscious buyers, etc.).

---

### **Summary**

- **"To summarize, a clustering algorithm, which is a type of unsupervised learning algorithm, takes data without labels and tries to automatically group them into clusters."**  
  Clustering groups unlabeled data into clusters based on similarities.

- **"So maybe the next time you see or think of a panda, maybe you think of clustering as well."**  
  A fun way to remember clustering—think of pandas grouped together based on their shared features!
