# Introduction to ML

### What is Machine Learning?
Machine Learning is when computers learn patterns from data and use them to make decisions.

### Types of Machine Learning
1. **Supervised Learning**: Learning from labeled data. It's like studying from question banks with answers provided, then applying that knowledge to new questions. When testing the model, test data must be unseen during training otherwise it can lead to overfitting. There are two main types:

    i. Classification(Defined Labels): Predicting categories. Here, the model learns to classify input data into predefined categories based on the labeled training data. Like sorting emails into "spam" or "not spam" by learning from examples.

    ii. Regression(Continuous Labels): Predicting continuous values. Here, the model learns to predict continuous outcomes based on the labeled training data. Like estimating house prices based on features such as size and location. Why it is continuous? Because house prices can take any value within a range, rather than being limited to specific categories.

> Input Raw Data(Labeled Data & Labels) -> Algorithm -> Processing -> Output

2. **Unsupervised Learning**: Finding patterns in unlabeled data. It's like exploring a new city without a map, discovering interesting places on your own. Common techniques include clustering and dimensionality reduction.

> Input Raw Data(Unlabeled Data) -> Interpretation -> Algorithm -> Processing -> Output

3. **Reinforcement Learning**: Learning by trial and error through rewards. It's like training a pet with treats and scolds until they learn a trick! Like Pavlov's dog experiment, where the dog learned to associate a bell sound with food, eventually salivating at the sound alone. The dog received a "reward" (food) for responding to the bell, reinforcing the behavior.

> Agent -> Environment -> Action -> Reward -> New State -> Repeat 

Besides these three main types, modern machine learning also includes two other important approaches: Self-Supervised Learning and Semi-Supervised Learning.

### What Are Features?

Features are these "clues" or "characteristics" that help us understand and describe something. In machine learning, features are the specific pieces of information a computer uses to learn. For example, if we're trying to teach a computer to recognize different types of fruits, the features might include color, size, shape, and texture. These features help the computer figure out what makes an apple different from a banana or a cherry.
* Useful Clues: Good features are like clear directions. They highlight important patterns, helping the model learn quickly and make accurate predictions.
* Bad Clues: Bad features are distracting or irrelevant. They confuse the model, slow down learning, and lead to poor results.

### What Are Labels?

If features are the clues, then labels are the answers we want our model to predict. For example, in a fruit recognition task, the label would be the type of fruit (like "apple," "banana," or "cherry"). Labels are essential for supervised learning because they provide the correct answers that the model uses to learn from the features.

### How Training Works?

1. Learning Patterns: The model looks at the features and their corresponding labels to find patterns. It's like connecting the dots between clues and answers.
2. Features: The model uses the features (clues) to understand the data.
3. ML Model: The model processes the features and learns to make predictions based on the patterns it finds.
4. Labels: The model compares its predictions to the actual labels (answers) to see how well it's doing.

### What Is Testing?

After the model has "studied" during training, it's time for its exam! The testing phase is crucial to see how well it learned. The model is tested on data it has never seen before. During testing, there are no "answer keys" provided. The model must make predictions on its own.

### What is Generalization?
Generalization is the model's ability to apply what it has learned during training to new, unseen data. A model that generalizes well can make accurate predictions on data it hasn't encountered before, which is the ultimate goal of machine learning.

* Good Generalization: Recognizes new examples correctly, Learned essential patterns, Adapts to variations

* Poor Generalization: Fails with new examples, Memorized specific cases, Can't adapt to changes

> Features -> Label -> Training -> Testing -> Generalization

### Machine Learning Pipeline

A simple journey from raw data to smart predictions. It's a step-by-step workflow that transforms messy real-world data into useful predictions. Like a recipe: one can't just throw ingredients in a pot and expect a perfect meal. Each step matters, and skipping one usually means disaster.

##### Step 1: Data Collection
Gather raw data from various sources like databases, files, or web scraping. The quality and variety of data collected here will significantly impact the model's performance.

##### Step 2: Data Preprocessing & Feature Engineering
Clean and prepare the data for analysis. This includes handling missing values, removing duplicates or noise, normalizing data, and converting categorical variables into numerical formats. Additionally, create new features from existing data to enhance the model's learning capability. This step is crucial as it directly affects the model's ability to learn effectively.

##### Step 3: Training
Feed the preprocessed data into a machine learning algorithm. The model learns patterns from the features and labels in the training dataset.

##### Step 4: Evaluation
Test the trained model on a separate validation dataset to assess its performance. Use metrics like accuracy, precision, recall, and F1-score to evaluate how well the model is performing.

#### Common Mistrakes to Avoid
1. Skipping data cleaning
2. Collecting insufficient/biased data
3. Trustng without testing
4. Overfitting/Underfitting
5. Using too little data
6. Ignoring feature importance

### Common Machine Learning Applications and Their Challenges

| Application            | Description                                      | Challenges                                      |
|------------------------|--------------------------------------------------|-------------------------------------------------|
| Image Recognition      | Identifying objects in images                     | Variability in lighting, angles, occlusions     |
| Natural Language Processing | Understanding and generating human language      | Ambiguity, context understanding                     |
| Fraud Detection        | Identifying fraudulent transactions               | Evolving tactics, class imbalance                |
| Recommendation Systems | Suggesting products or content to users          | Personalization, data sparsity                   |
| Autonomous Vehicles    | Self-driving cars and drones                      | Safety, real-time decision making                     |
| Healthcare Diagnostics | Assisting in medical diagnoses                    | Data privacy, variability in medical data        |
| Predictive Maintenance | Forecasting equipment failures                    | Sensor data quality, varying operating conditions |
|Smart Assistants       | Virtual assistants like Siri, Alexa               | Understanding context, speech recognition        |
| Monitoring and Surveillance | Analyzing video feeds for security purposes       | Privacy concerns, real-time processing           |


##### Challenges 
1. Bias in Data: Unfair data can lead to biased models. Use diverse and representative datasets to minimize this risk.
2. Overfitting: Models that learn too much from training data may not perform well on new data. Like memorizing answers without understanding concepts. Use more varied training data and validation techniques to prevent this.
3. Underfitting: Models that don't learn enough from training data may miss important patterns. Like predicting house prices based only on size, ignoring location and condition. Use more complex models that capture key relationships and additional features to improve learning.

### Why Preprocessing Matters? 

Machine learning models depend entirely on the quality of their input data. Poor, inconsistent, or biased data will lead to unreliable predictions, regardless of the model's sophistication. Preprocessing ensures clean, well-prepared data for better model performance and trustworthy results. 80% of the work in a machine learning project is data preprocessing and it improves the model's accuracy 3to 5 times.

Quality Data = Quality Predictions

The Golden Rule: Garbage in â†’ Garbage out (It means if the input data is bad, the output will also be bad.)