# IBM Machine Learning Online Course

**Course Outline:**

Module 1: Introduction to Machine Learning

Knowledge of foundational machine learning concepts; you will be introduced to various open-source tools for machine learning, including the popular Python package scikit-learn.

Module 2: Linear and Logistic Regression

two classical statistical methods foundational to machine learning: linear and logistic regression.

Module 3: Building Supervised Learning Models

You will start by understanding how binary classification works and discover how to construct a multiclass classifier from binary classification components. You’ll learn what decision trees are, how they learn, and how to build them. Decision trees, which are used to solve classification problems, have a natural extension called regression trees, which can handle regression problems. You’ll learn about other supervised learning models, like KNN and SVM. You’ll learn what bias and variance are in model fitting and the tradeoff between bias and variance that is inherent to all learning models in various degrees. You’ll learn strategies for mitigating this tradeoff and work with models that do a very good job accomplishing that goal.

Module 4: Building Unsupervised Learning Models

In this module, you’ll dive into unsupervised learning, where algorithms uncover patterns in data without labeled examples. You’ll explore clustering strategies and real-world applications, focusing on techniques like hierarchical clustering, k-means, and advanced methods such as DBSCAN and HDBSCAN. Through practical labs, you’ll gain a deeper understanding of how to compare and implement these algorithms effectively. Additionally, you’ll delve into dimension reduction algorithms like PCA (Principal Component Analysis), t-SNE, and UMAP to reduce dataset features and simplify other modeling tasks. Using Python, you’ll implement these clustering and dimensionality reduction techniques, learning how to integrate them with feature engineering to prepare data for machine learning models.

Module 5: Evaluating and Validating Machine Learning Models

This module covers how to assess model performance on unseen data, starting with key evaluation metrics for classification and regression. You’ll also explore hyperparameter tuning to optimize models while avoiding overfitting using cross-validation. Special techniques, such as regularization in linear regression, will be introduced to handle overfitting due to outliers. Hands-on exercises in Python will guide you through model fine-tuning and cross-validation for reliable model evaluation.

Module 6: Final Exam and Project

In this concluding module, you’ll review the course content, complete a final exam, and work on a hands-on project. You’ll receive a course summary cheat sheet, apply your skills in a project on Rain Prediction in Australia, and participate in peer reviews to share feedback. The module wraps up with guidance on next steps in your learning journey.

Data processing and analytics: PostgreSQL, Hadoop, Spark, Apache Kafka, pandas, and NumPy

• Data visualization: Matplotlib, Seaborn, ggplot2, and Tableau

• Machine learning: NumPy, Pandas, SciPy, and scikit-learn

• Deep learning: TensorFlow, Keras, Theano, and PyTorch

• Computer vision: OpenCV, scikit-image, and TorchVision

• NLP: NLTK, TextBlob, and Stanza

• Generative AI: Hugging face transformers, ChatGPT, DALL-E, and PyTorch


### Ch1.Intro

**machine learning solve questions using data**
Classification, Regression, Clustering, Association, Anomaly detection, Sequence mining, Dimension reduction, Recommendation systems

Problem Definition, Data Collection, Data Preparation, Model Development, Model Deployment

In [11]:
import numpy as np
from sklearn import preprocessing
from sklearn.model_selection import train_test_split
from sklearn import svm

Example code:

X = preprocessing.StandardScaler().fit(X).transform(X)

X_train, X_test, y_train, y_test = train_test_split(X, y, test_size = 0.33)

clf = svm.SVC(gamma = 0.001, C = 100.)

clf.fit(X_train, y_train)

Artificial intelligence (AI) simulates human cognition, while **machine learning (ML) uses algorithms and requires feature engineering to learn from data**.

Machine learning includes different types of models: **supervised learning**, which uses labeled data to make predictions; **unsupervised learning**, which finds patterns in unlabeled data; and **semi-supervised learning**, which trains on a small subset of labeled data.

Key factors for choosing a machine learning technique include the type of problem to be solved, the available data, available resources, and the desired outcome.

Machine learning techniques include **anomaly detection** for identifying unusual cases like fraud, **classification** for categorizing new data, **regression** for **predicting** continuous values, and **clustering** for grouping similar data points without labels.

Machine learning tools support pipelines with modules for data preprocessing, model building, evaluation, optimization, and deployment.

R is commonly used in machine learning for statistical analysis and data exploration, while Python offers a vast array of libraries for different machine learning tasks. Other programming languages used in ML include Julia, Scala, Java, and JavaScript, each suited to specific applications like high-performance computing and web-based ML models.

Data visualization tools such as **Matplotlib** and **Seaborn** create customizable plots, **ggplot2** enables building graphics in layers, and Tableau provides interactive data dashboards.

Python libraries commonly used in machine learning include NumPy for numerical computations, Pandas for data analysis and preparation, SciPy for scientific computing, and Scikit-learn for building traditional machine learning models.

Deep learning frameworks such as TensorFlow, Keras, Theano, and PyTorch support the design, training, and testing of neural networks used in areas like computer vision and natural language processing.

Computer vision tools enable applications like object detection, image classification, and facial recognition, while natural language processing (NLP) tools like **NLTK, TextBlob, and Stanza** facilitate text processing, sentiment analysis, and language parsing.

Generative AI tools use artificial intelligence to create new content, including text, images, music, and other media, based on input data or prompts.

Scikit-learn provides a range of functions, including classification, regression, clustering, data preprocessing, model evaluation, and exporting models for production use.

The machine learning ecosystem includes a network of tools, frameworks, libraries, platforms, and processes that collectively support the development and management of machine learning models.

### Ch2.Regression

[SimpleLinearRegression](Simple-Linear-Regression.ipynb)

[MultipleLinearRegression](Mulitple-Linear-Regression.ipynb)

[LogisticRegression](Logistic_Regression.ipynb)
Logistic regression predict the probability whether something will happen.
Using Sigmoid function. **Probaility predictor and a binary classifier.**

To train a logistic regression. We first choose a set of parameters, then we predict the probability that class = 1. Then we calculate prediction error (cost function), finally we update the parameters to reduce prediction error. **Log-loss function**

Simple regression uses a single independent variable to estimate a dependent variable, while multiple regression involves more than one independent variable.

**In simple linear regression, a best-fit line minimizes errors, measured by Mean Squared Error (MSE); this approach is known as Ordinary Least Squares (OLS).**

OLS regression is easy to interpret but **sensitive to outliers**, which can impact accuracy.

Multiple linear regression extends simple linear regression by using multiple variables to predict outcomes and analyze variable relationships.

Adding too many variables can lead to **overfitting**, so careful variable selection is necessary to build a balanced model.

**Nonlinear regressio** models complex relationships using **polynomial, exponential, or logarithmic** functions when data does not fit a straight line.

Polynomial regression can fit data but mayoverfit by capturing random noise rather than underlying patterns.

Logistic regression is a probability predictor and binary classifier, suitable for **binary targets and assessing feature impact**.

Logistic regression **minimizes errors using log-loss** and optimizes with gradient descent or stochastic gradient descent for efficiency.

**Gradient descent** is an iterative process to minimize the cost function, which is crucial for training logistic regression models.



**Reference (Linear and Logistic Regression Cheatsheet.pdf)**

### Ch3. Classification