# Machine Learning:

## Where:
 - Translation apps
 - Suggestions.
 - Autonomous vehicles
 - predict the weather
 - estimate travel times
 - recommend songs
 - auto-complete sentences
 - summarize articles
 - generate never-seen-before images

## What is it? 
- **ML** is :
  - A process.
  - Process of **training a piece of software**, called **Model**. This model could be used to make predictions or generate content (like text, images, audio or video) from data.

- **Example :** Create an app that predicts rainfall. 
    - To implement it, We could use either a **traditional approach** or an **ML approach** :
      1. **Traditional approach:**      
            - we'd create a physics-based representation of the Earth's atmosphere and surface, computing massive amounts of fluid dynamics equations. 

            - This is incredibly difficult, we'll need to write code for such equations.

      2. **ML approach :**
            - we would give an ML model `enormous amounts of weather data` until the ML model eventually learned the `mathematical relationship between weather patterns` that produce differing amounts of rain. 

            - We would then give the model the current weather data, and it would predict the amount of rain.
    

## Types of ML Systems
1. **Supervised learning** 

2. **Unsupervised learning**

3. **Reinforcement learning**

4. **Generative AI**



![green-divider](https://user-images.githubusercontent.com/7065401/52071924-c003ad80-2562-11e9-8297-1c6595f8a7ff.png)

# Supervised learning
- Supervised Learning Models can make predictions after seeing a lots of data with *correct answers* or *output labels*.

- Then identifies patterns or connections between data points, that produces correct answers.


## Types :
1. **Regression :** Predicts a numeric value.
2. **Classification :** Predicts a category


### Heirarchy:

<pre>
Supervised Learning
|
├── 1. Regression (predict continuous values)
│           ├── Linear Regression
│           ├── Polynomial Regression
│           └── Ridge/Lasso Regression
│
└── 2. Classification (predict discrete categories)
            ├── Binary Classification (2 classes)
            │   ├── Logistic Regression
            │   ├── Support Vector Machine (SVM)
            │   └── Naive Bayes
            │
            └── Multi-class Classification (more than 2 classes)
                ├── Decision Trees
                ├── Random Forest
                └── Neural Networks (Softmax Output)
</pre>

## 1. Regression

  - **Definition**: A regression model predicts a numeric value as its output.

  - **Example**: A weather model that predicts the amount of rain in inches or millimeters.

  - **Common scenarios and applications**:

    | Scenario | Possible input data | Numeric prediction |
    |----------|--------------------|--------------------|
    | Future house price | Square footage, zip code, number of bedrooms and bathrooms, lot size, mortgage interest rate, property tax rate, construction costs, and number of homes for sale in the area | The price of the home |
    | Future ride time | Historical traffic conditions (gathered from smartphones, traffic sensors, ride-hailing and other navigation applications), distance from destination, and weather conditions | The time in minutes and seconds to arrive at a destination |

  - **Key characteristic**: Output is always a continuous numerical value.

## 2. Classification

  - **Definition**: Classification models predict the likelihood that something belongs to a specific category or class.

  - **Key difference from regression**: Unlike regression models which output numbers, classification models output categorical values that determine whether something belongs to a particular category.

  - **Common examples**: Predicting if an email is spam, determining if a photo contains a cat.

### Types of Classification Models

#### 1. Binary Classification
  - **Definition**: Models that output a value from a class containing only two possible values.
  - **Example**: A model that outputs either "rain" or "no rain."

#### 2. Multiclass Classification
  - **Definition**: Models that output a value from a class containing more than two possible values.
  - **Example**: A weather model that can output "rain," "hail," "snow," or "sleet."

  - **Key characteristic**: Output is always a discrete categorical value rather than a continuous number.

![green-divider](https://user-images.githubusercontent.com/7065401/52071924-c003ad80-2562-11e9-8297-1c6595f8a7ff.png)

# Unsupervised Learning
- Unsupervised learning models make predictions by being given data that does not contain any ***correct answers*** or ***output labels***.
- It must **infer patterns and structure** on its own.
- There are **no predefined categories or outputs**.


* **Key Idea**: The model discovers **hidden structures** or **relationships** in data without any external supervision.
* **Common Technique: Clustering**
   * Groups data points into **natural groupings** or **clusters** based on similarity.
   * The model finds patterns such as:
      * Which items are similar
      * How data points are grouped
   * Example: Customer segmentation in marketing.
* **Applications**:
   * Market segmentation
   * Anomaly detection
   * Social network analysis
   * Organizing large document collections

#### NOTE : `Clustering` differs from `classification` because the categories aren't defined by you. 

# Reinforcement Learning

* **Definition**: Reinforcement learning models make predictions by getting **rewards or penalties** based on actions performed within an environment.
   * The system learns through **trial and error** interactions with its environment.
   * Actions that lead to positive outcomes are reinforced, while negative outcomes are discouraged.

* **Key Concept: Policy Generation**
   * A reinforcement learning system generates a **policy** that defines the best strategy for getting the most rewards.
   * The policy guides decision-making to maximize cumulative rewards over time.

* **Learning Process**:
   * Agent takes actions in an environment
   * Environment provides feedback through rewards or penalties
   * Agent adjusts behavior to maximize future rewards

* **Applications**:
   * **Robotics**: Training robots to perform tasks like walking around a room
   * **Game AI**: Software programs like AlphaGo to play complex games like Go
   * Autonomous vehicles, recommendation systems, trading algorithms


# Generative AI

* **Definition**: Generative AI is a class of models that **creates content** from user input.
   * Takes existing data and generates new, original content based on learned patterns.
   * Can produce human-like creative outputs across multiple media types.

* **Capabilities**: Generative AI can create:
   * Unique images, music compositions, and jokes
   * Article summaries and explanations
   * Task instructions and tutorials
   * Photo editing and enhancement

* **Input-Output Flexibility**:
   * Can take a **variety of inputs** and create a **variety of outputs**
   * Supports text, images, audio, and video in various combinations
   * Can handle **multimodal** inputs and outputs simultaneously

* **Common Input-Output Types**:
   * **Text-to-text**: Language translation, content writing, summarization
   * **Text-to-image**: Image generation from descriptions
   * **Text-to-video**: Video creation from scripts or descriptions
   * **Text-to-code**: Programming code generation
   * **Text-to-speech**: Audio synthesis from written text
   * **Image and text-to-image**: Enhanced image editing with text prompts

* **Applications**:
   * Content creation and marketing
   * Software development assistance
   * Creative arts and design
   * Educational tools and tutoring
   * Data augmentation and synthetic data generation



![purple-divider](https://user-images.githubusercontent.com/7065401/52071927-c1cd7100-2562-11e9-908a-dde91ba14e59.png)

# Fundamentals of supervised learning: 

# Data:
  - images

  - words
  - values of pixels
  - waveforms (for audio)


## Dataset:
  - Labeled dataset

  - unlabeled dataset

### Dataset characteristics : 

A dataset is characterized by **size**, **Diversity** and **Number of features**.

### 1. Size
• **Definition**: The number of records or rows. E.g. 10 million transaction records.

### 2. Features
- **Definition**: The number of variables/columns that describe each example in your dataset.
- **Examples**:
  - **Weather Dataset (High Features)**: Hundreds of features including satellite imagery, cloud coverage, wind speed, humidity, pressure, temperature, UV index, precipitation history, etc.
  - **Simple Weather Dataset (Low Features)**: Only 3-4 features like humidity, atmospheric pressure, temperature

### 3. Diversity
- **Definition**: The range and variety of conditions/scenarios your examples cover
- **Real-world examples**:
  - **Medical Dataset**: Patients from different age groups, ethnicities, genders, geographic locations, and medical conditions.
  - **E-commerce Dataset**: Customers from various income levels, countries, shopping behaviors, and purchase seasons.
  - **Image Recognition Dataset**: Photos taken in different lighting conditions, angles, weather, indoor/outdoor settings
  - **Financial Dataset**: Transactions from different economic periods (bull markets, bear markets, recessions, growth periods)

## The Four Dataset Types
1. **Large and Diverse (Ideal)**:
   - Example: 1 million customer records from 50 countries over 10 years.

2. **Large but Not Diverse** 
   - Example: 100,000 customer records but only from urban millennials in one city

3. **Small but Diverse :**
   - Example: 500 customer records representing all demographics but too few per group

4. **Small and Not Diverse (Worst) :**
   - Example: 100 customer records from one demographic in one location

## Key Takeaways
   - **Size ≠ Quality**: More data doesn't automatically mean better predictions
   - **Diversity ≠ Sufficient**: Wide coverage doesn't guarantee enough examples per scenario
   - **Balance is crucial**: Need both sufficient examples AND broad representation
   - **Context matters**: Requirements vary by problem complexity and real-world variability

## Practical Implications
   - **Before collecting data**: Plan for both volume and variety
   - **During analysis**: Check if your data represents the full problem space
   - **Model deployment**: Consider if training conditions match real-world usage
   - **Ongoing monitoring**: Track if new scenarios emerge that weren't in training data

![green-divider](https://user-images.githubusercontent.com/7065401/52071924-c003ad80-2562-11e9-8297-1c6595f8a7ff.png)