# IBM Machine Learning Online Course

### Course Outline:

Module 1: Introduction to Machine Learning

Knowledge of foundational machine learning concepts; you will be introduced to various open-source tools for machine learning, including the popular Python package scikit-learn.

Module 2: Linear and Logistic Regression

two classical statistical methods foundational to machine learning: linear and logistic regression.

Module 3: Building Supervised Learning Models

You will start by understanding how binary classification works and discover how to construct a multiclass classifier from binary classification components. You’ll learn what decision trees are, how they learn, and how to build them. Decision trees, which are used to solve classification problems, have a natural extension called regression trees, which can handle regression problems. You’ll learn about other supervised learning models, like KNN and SVM. You’ll learn what bias and variance are in model fitting and the tradeoff between bias and variance that is inherent to all learning models in various degrees. You’ll learn strategies for mitigating this tradeoff and work with models that do a very good job accomplishing that goal.

Module 4: Building Unsupervised Learning Models

In this module, you’ll dive into unsupervised learning, where algorithms uncover patterns in data without labeled examples. You’ll explore clustering strategies and real-world applications, focusing on techniques like hierarchical clustering, k-means, and advanced methods such as DBSCAN and HDBSCAN. Through practical labs, you’ll gain a deeper understanding of how to compare and implement these algorithms effectively. Additionally, you’ll delve into dimension reduction algorithms like PCA (Principal Component Analysis), t-SNE, and UMAP to reduce dataset features and simplify other modeling tasks. Using Python, you’ll implement these clustering and dimensionality reduction techniques, learning how to integrate them with feature engineering to prepare data for machine learning models.

Module 5: Evaluating and Validating Machine Learning Models

This module covers how to assess model performance on unseen data, starting with key evaluation metrics for classification and regression. You’ll also explore hyperparameter tuning to optimize models while avoiding overfitting using cross-validation. Special techniques, such as regularization in linear regression, will be introduced to handle overfitting due to outliers. Hands-on exercises in Python will guide you through model fine-tuning and cross-validation for reliable model evaluation.

Module 6: Final Exam and Project

In this concluding module, you’ll review the course content, complete a final exam, and work on a hands-on project. You’ll receive a course summary cheat sheet, apply your skills in a project on Rain Prediction in Australia, and participate in peer reviews to share feedback. The module wraps up with guidance on next steps in your learning journey.

Data processing and analytics: PostgreSQL, Hadoop, Spark, Apache Kafka, pandas, and NumPy

• Data visualization: Matplotlib, Seaborn, ggplot2, and Tableau

• Machine learning: NumPy, Pandas, SciPy, and scikit-learn

• Deep learning: TensorFlow, Keras, Theano, and PyTorch

• Computer vision: OpenCV, scikit-image, and TorchVision

• NLP: NLTK, TextBlob, and Stanza

• Generative AI: Hugging face transformers, ChatGPT, DALL-E, and PyTorch


### Ch1.Intro

**machine learning solve questions using data**
Classification, Regression, Clustering, Association, Anomaly detection, Sequence mining, Dimension reduction, Recommendation systems

Problem Definition, Data Collection, Data Preparation, Model Development, Model Deployment

In [11]:
import numpy as np
from sklearn import preprocessing
from sklearn.model_selection import train_test_split
from sklearn import svm

Example code:

X = preprocessing.StandardScaler().fit(X).transform(X)

X_train, X_test, y_train, y_test = train_test_split(X, y, test_size = 0.33)

clf = svm.SVC(gamma = 0.001, C = 100.)

clf.fit(X_train, y_train)

Artificial intelligence (AI) simulates human cognition, while **machine learning (ML) uses algorithms and requires feature engineering to learn from data**.

Machine learning includes different types of models: **supervised learning**, which uses labeled data to make predictions; **unsupervised learning**, which finds patterns in unlabeled data; and **semi-supervised learning**, which trains on a small subset of labeled data.

Key factors for choosing a machine learning technique include the type of problem to be solved, the available data, available resources, and the desired outcome.

Machine learning techniques include **anomaly detection** for identifying unusual cases like fraud, **classification** for categorizing new data, **regression** for **predicting** continuous values, and **clustering** for grouping similar data points without labels.

Machine learning tools support pipelines with modules for data preprocessing, model building, evaluation, optimization, and deployment.

R is commonly used in machine learning for statistical analysis and data exploration, while Python offers a vast array of libraries for different machine learning tasks. Other programming languages used in ML include Julia, Scala, Java, and JavaScript, each suited to specific applications like high-performance computing and web-based ML models.

Data visualization tools such as **Matplotlib** and **Seaborn** create customizable plots, **ggplot2** enables building graphics in layers, and Tableau provides interactive data dashboards.

Python libraries commonly used in machine learning include NumPy for numerical computations, Pandas for data analysis and preparation, SciPy for scientific computing, and Scikit-learn for building traditional machine learning models.

Deep learning frameworks such as TensorFlow, Keras, Theano, and PyTorch support the design, training, and testing of neural networks used in areas like computer vision and natural language processing.

Computer vision tools enable applications like object detection, image classification, and facial recognition, while natural language processing (NLP) tools like **NLTK, TextBlob, and Stanza** facilitate text processing, sentiment analysis, and language parsing.

Generative AI tools use artificial intelligence to create new content, including text, images, music, and other media, based on input data or prompts.

Scikit-learn provides a range of functions, including classification, regression, clustering, data preprocessing, model evaluation, and exporting models for production use.

The machine learning ecosystem includes a network of tools, frameworks, libraries, platforms, and processes that collectively support the development and management of machine learning models.

### Ch2.Regression

[SimpleLinearRegression](Simple-Linear-Regression.ipynb)

[MultipleLinearRegression](Mulitple-Linear-Regression.ipynb)

[LogisticRegression](Logistic_Regression.ipynb)
Logistic regression predict the probability whether something will happen.
Using Sigmoid function. **Probaility predictor and a binary classifier.**

To train a logistic regression. We first choose a set of parameters, then we predict the probability that class = 1. Then we calculate prediction error (cost function), finally we update the parameters to reduce prediction error. **Log-loss function**

Simple regression uses a single independent variable to estimate a dependent variable, while multiple regression involves more than one independent variable.

**In simple linear regression, a best-fit line minimizes errors, measured by Mean Squared Error (MSE); this approach is known as Ordinary Least Squares (OLS).**

OLS regression is easy to interpret but **sensitive to outliers**, which can impact accuracy.

Multiple linear regression extends simple linear regression by using multiple variables to predict outcomes and analyze variable relationships.

Adding too many variables can lead to **overfitting**, so careful variable selection is necessary to build a balanced model.

**Nonlinear regressio** models complex relationships using **polynomial, exponential, or logarithmic** functions when data does not fit a straight line.

Polynomial regression can fit data but mayoverfit by capturing random noise rather than underlying patterns.

Logistic regression is a probability predictor and binary classifier, suitable for **binary targets and assessing feature impact**.

Logistic regression **minimizes errors using log-loss** and optimizes with gradient descent or stochastic gradient descent for efficiency.

**Gradient descent** is an iterative process to minimize the cost function, which is crucial for training logistic regression models.



**Reference (Linear and Logistic Regression Cheatsheet.pdf)**

### Ch3. Classification

**One-vs-one**: Is it this or it is that? **One-vs-all**: Independent binary classifiers for each category

**Decision Tree** [Decision Tree](Decision-tree-classifier-drug-pred-v1.ipynb)  considering the features one by one

In a decision tree, **each internal node corresponds to a test, each branch corresponds to the result of the test, each terminal, or leaf node assigns its data to a class**. For example, you have several features, you start with a seed node and labeled training data, then you find the feature that best splits the data, next, each split partitions the node's input data. Repeat the process for each new node, each feature for once.

You can set **tree pruning**. 1. Max depth is reached 2.min number of data points in a node has exceeded 3. min number of samples in a leaf 4. tree has reached max number of leaf nodes

Best split uses two algo: Return highest information gain

**Regression Tree** [Regression Tree](Regression-Trees-Taxi-Tip-v1.ipynb) target is continuous

Objective: predict **continuous** target variable;  Using **Variance reduction (MSE) as splitting criterion**. LOWEST Weighted average of MSEs of each split.

**Support Vector Machines** This is used for building classification and regression models. It classifies input by identifying hyperplane. Each input row is a coordinate in the hyperplane.

Balance between maximizing margin controlled by parameter C. Smaller C allows more misclassifications (soft margin). Larger C is a harder margin. SVM is usually linear structure, but you can choose different kernels. **Linear  Polynomial  RBF  Sigmoid** 

 [Support Vector Machine](decision-tree-svm-ccFraud-v1.ipynb)

**KNN**  [K Nearest Neighbor](KNN-lab-v1.ipynb) For each point, find the nearest K points and use the majority catagory as the prediction. KNN is a lazy learner, because it memorizes training data and makes predictions based on distance to training data points.

**Bagging and Boosting** Bagging mitigates overfitting, using High variance low bias base learners, parallel training on bootstrapped data. Reduced overall variance.

Boosting mitigates underfitting, using low variance high bias base learners, building on previous results, put more weight on wrong predictions. Gradually reducing bias.

[Random Forests and XGBoost](Random-Forests-XGBoost-v1.ipynb)

### Ch3 Summary

Classification is a supervised machine learning method used to predict labels on new data with applications in churn prediction, customer segmentation, loan default prediction, and multiclass drug prescriptions.

Binary classifiers can be extended to multiclass classification using one-versus-all or one-versus-one strategies.

A decision tree classifies data by testing features at each node, branching based on test results, and assigning classes at leaf nodes.

Decision tree training involves selecting features that best split the data and pruning the tree to avoid overfitting.

Information gain and Gini impurity are used to measure the quality of splits in decision trees.

Regression trees are similar to decision trees but predict continuous values by recursively splitting data to maximize information gain.

Mean Squared Error (MSE) is used to measure split quality in regression trees.

K-Nearest Neighbors (k-NN) is a supervised algorithm used for classification and regression by assigning labels based on the closest labeled data points.

To optimize k-NN, test various k values and measure accuracy, considering class distribution and feature relevance.

Support Vector Machines (SVM) build classifiers by finding a hyperplane that maximizes the margin between two classes, effective in high-dimensional spaces but sensitive to noise and large datasets.

The bias-variance tradeoff affects model accuracy, and methods such as bagging, boosting, and random forests help manage bias and variance to improve model performance.

Random forests use bagging to train multiple decision trees on bootstrapped data, improving accuracy by reducing variance.

### CH4. Clustering and classification

**Partition-Based: Kmeans** k clusters with low variances
**DBSCAN Density-Based** suitable for irregular clusters
**Hierarchical clustering** Tree like structures, Divisive (Top down) and Agglomerative (Merge)


**K means**: Initialization; Iteratively assgin points to clusters and update centroids; Repeat until centroids stop moving. But performs bad when noises exist. **Using Elbow method, DB index to determine K.**

[K means](K-Means-Customer-Seg-v1.ipynb)

**DBSCAN (Density-Based Spatial Clustering of Applications with Noise):**

It creates clusters based on a user-defined density value.
It identifies core points (with enough neighbors), border points (near core points but not dense enough), and noise points (isolated).
DBSCAN can discover clusters of various shapes and sizes and is effective in handling noise and outliers.

Parameters: N (Minimum points required to form a dense region), epsilon (radius)

How It Works:
1. Finds core points with at least min_samples neighbors within eps.
2. Expands clusters from core points, adding directly reachable points.
3. Labels points as core, border, or noise.
   
**HDBSCAN** does not require any parameters to be set and uses cluster stability. Cluster stability is the persistence of a cluster over a range of distance thresholds. Automatically determines the optimal eps (no need to set it manually). Builds a hierarchical tree of clusters.
More robust to parameter choices than DBSCAN.

[Comparing-DBScan-HDBScan](Comparing-DBScan-HDBScan-v1.ipynb)

**Dimension Reduction Algorithm**

UMAP (Uniform Manifold Approximation and Projection) and t-SNE (t-Distributed Stochastic Neighbor Embedding) are both non-linear dimensionality reduction techniques, primarily used for visualizing high-dimensional data (e.g., reducing from 100D to 2D or 3D). **UMAP is faster but math more complex**

UMAP is designed to preserve local and global data structures, making it suitable for clustering applications.

**PCA: Step-by-Step Linear Algebra Operations**

**Step 1: Center Data**
$$ X' = X - \bar{X} $$

**Step 2: Compute Covariance Matrix**
$$ \Sigma = \frac{1}{m} X'^T X' $$

**Step 3: Find Eigenvalues & Eigenvectors**
$$ \Sigma v = \lambda v $$

**Step 4: Select Top k Components**
Sort by largest $$ \lambda $$

**Step 5: Project Data**
Project the centered data onto the top \( k \) principal components:

$$
Z = X' W
$$

where:  

- \( X' \) is the **mean-centered data matrix** of shape \( m \times n \), where:
  - \( m \) is the number of samples.
  - \( n \) is the original feature dimension.
- \( W \) is the **projection matrix**, consisting of the top \( k \) eigenvectors:
  
  $$
  W = [v_1, v_2, ..., v_k], \quad W \in \mathbb{R}^{n \times k}
  $$

- \( Z \) is the **transformed dataset** in the reduced-dimensional space, with shape \( m \times k \).


[PCA](PCA-v1.ipynb)

[tSNE and UMAP](tSNE-UMAP-v1.ipynb)

### CH4 Summary

**Clustering** is a machine learning technique used to group data based on similarity, with applications in customer segmentation and anomaly detection.

**K-means** clustering **partitions** data into clusters **based on the distance** between data points and centroids but struggles with imbalanced or non-convex clusters.

Heuristic methods such as silhouette analysis, the elbow method, and the Davies-Bouldin Index help assess k-means performance.

**DBSCAN is a density-based algorithm** that creates clusters based on density and works well with natural, irregular patterns.

**HDBSCAN is a variant of DBSCAN** that does not require parameters and uses cluster stability to find clusters.

Hierarchical clustering can be divisive (top-down) or agglomerative (bottom-up) and produces a dendrogram to visualize the cluster hierarchy.

Dimension reduction simplifies data structure, improves clustering outcomes, and is useful in tasks such as face recognition (using eigenfaces).

Clustering and dimension reduction work together to improve model performance by reducing noise and simplifying feature selection.

PCA, a linear dimensionality reduction method, minimizes information loss while reducing dimensionality and noise in data.

t-SNE and UMAP are other dimensionality reduction techniques that map high-dimensional data into lower-dimensional spaces for visualization and analysis.

### Ch5. Metrics and Evaluation

1. Accuracy: Ratio of correctly predicted instances to the total number of instances
2. Confusion matrix: Breaks down ground truth instances of a class against the number of predicted class instances.
3. Precision: Measures how many predicted positive instances are actually positive
4. Recall: How many actual positive instances are correctly predicted
5. F1 score: Harmonic Mean
   
[Metrics](EvaluatingClassificationModels-v1.ipynb)

Key Metrics:

1. MAE (Mean Absolute Error): Average absolute difference between predicted and observed values.

2. MSE (Mean Squared Error): Sum of squared differences between predicted and observed values, divided by the number of observations minus parameters.

3. RMSE (Root Mean Squared Error): Square root of MSE, easier to interpret as it shares the same units as the target variable.

4. R-squared: Measures the proportion of variance in the dependent variable explained by the independent variable, ranging from 0 (poor fit) to 1 (perfect fit).

[Evaluating-random-forest](Evaluating-random-forest-v1.ipynb)

[Evaluating-Clustering](Evaluating-k-means-clustering-v1.ipynb)

The Silhouette score measures how similar each point is to its own cluster compared to other clusters, providing an evaluation of the quality and separation of clusters. This makes it most suitable for assessing how well the clusters are formed in terms of separation and cohesion.

**K-fold cross-validation**: for each fold, train on remaining k-1 folds, test on fold and store model score. Compute the aggregated score. Select the best model.

For imbalanced dataset, using **stratified cross-validation**.

**Regularization** regularized cost function = MSE + lambda * penalty  
1. **Ridge L2, lambda*sum(parameters^2)**
2. **LASSO L1, lambda*sum(abs(parameter))** sparse coefficients; best for low SNR conditions

### CH5 Summary

Supervised learning evaluation assesses a model's ability to predict outcomes for unseen data, often using a train/test split to estimate performance.

Key metrics for classification evaluation include accuracy, confusion matrix, precision, recall, and the F1 score, which balances precision and recall.

Regression model evaluation metrics include MAE, MSE, RMSE, R-squared, and explained variance to measure prediction accuracy.

Unsupervised learning models are evaluated for pattern quality and consistency using metrics like Silhouette Score, Davies-Bouldin Index, and Adjusted Rand Index.

Dimensionality reduction evaluation involves Explained Variance Ratio, Reconstruction Error, and Neighborhood Preservation to assess data structure retention.

Model validation, including dividing data into training, validation, and test sets, helps prevent overfitting by tuning hyperparameters carefully.

Cross-validation methods, especially K-fold and stratified cross-validation, support robust model validation without overfitting to test data.

Regularization techniques, such as ridge (L2) and lasso (L1) regression, help prevent overfitting by adding penalty terms to linear regression models.

Data leakage occurs when training data includes information unavailable in real-world data, which is preventable by separating data properly and mindful feature selection.

Common modeling pitfalls include misinterpreting feature importance, ignoring class imbalance, and relying excessively on automated processes without causal analysis.

Feature importance assessments should consider redundancy, scale sensitivity, and avoid misinterpretation, as well as inappropriate assumptions about causation.