<div style="background-color: #7B5AB1; padding: 15px; border-radius: 8px; box-shadow: 2px 2px 8px #aaa;">
    <h2 style="color: white; text-align: center; text-shadow: 2px 2px 4px #000000;">
        The Machine Learning Toolkit: Algorithms and Techniques -PART I-
    </h2>
</div>

In [1]:
from IPython.display import YouTubeVideo, HTML

<div style="background-color: #f2f2f2; padding: 15px; border: 1px solid #ccc; border-radius: 8px;">
    <h3 style="color: blue;">📚 Table of Contents</h3>
    <ul style="list-style: none;">
        <li><a href="#linear_regression" style="text-decoration: none; font-size: 16px;">📈 Linear Regression</a></li>
        <li><a href="#logistic_regression" style="text-decoration: none; font-size: 16px;">📊 Logistic Regression</a></li>
        <li><a href="#decision_trees" style="text-decoration: none; font-size: 16px;">🌳 Decision Trees</a></li>
        <li><a href="#random_forest" style="text-decoration: none; font-size: 16px;">🌲 Random Forest</a></li>
        <li><a href="#support_vector_machines" style="text-decoration: none; font-size: 16px;">🔍 Support Vector Machines</a></li>
        <li><a href="#k_nearest_neighbors" style="text-decoration: none; font-size: 16px;">🏠 k-Nearest Neighbors</a></li>
        <li><a href="#naive_bayes" style="text-decoration: none; font-size: 16px;">📩 Naive Bayes</a></li>
    </ul>
</div>

<div style="border:2px solid black; padding:15px; margin:10px; background-color:#f9f9f9;">
    <h4 style="color: darkred; margin-bottom: 10px;">Disclaimer:</h4>
    <ul>
        <li><p style="margin-bottom: 10px;">The YouTube videos embedded in this notebook are just my suggestions for educational purposes. I find them useful, and I hope you do too! Already familiar with the concept? Feel free to watch them on 2x speed. Note: I am not affiliated with any of these channels.</p></li>
        <li><p>The images in this notebook are generated using <a href="https://leonardo.ai/" target="_blank">Leonardo.ai's</a> Dreamshaper V7 and Leonardo Diffusion models. These images are not intended to serve as direct illustrations or explanations of the concepts discussed. Instead, they are included for their artistic and engaging qualities, offering an abstract representation of the text.</p></li>
    </ul>
</div>

<a id='linear_regression'></a>
<h3 style="color:blue">1. Linear Regression</h3>
<p>Linear Regression is a fundamental algorithm in machine learning and statistics, used to model and analyze the relationships between a dependent variable and one or more independent variables. The main goal is to find the best fit straight line that accurately predict the output values within a range.</p>


<div style="text-align: center;">
  <div style="display: inline-block; border: 4px solid black; padding: 10px; border-radius: 15px;">
    <img src="https://github.com/nurtanbilmis/other_projects/blob/main/The-Machine-Learning-Toolkit/Images/linear_regression.jpg?raw=true" alt="Linear Regression" width="600"/>
  </div>
</div>

##### Ideal Use-Case
Linear Regression is especially useful when dealing with scenarios where the outcome or target variable is continuous and linearly dependent on the independent variables. Like predicting real estate prices, salaries, and academic performance are well-suited for Linear Regression.

##### Real-World Example and Explanation

*Energy Consumption Prediction in Households*

Suppose you're a data scientist working with a utility company to optimize energy consumption in households. You have data on various household attributes, including square meters of living space, number of residents, types of electrical appliances, and outdoor temperature. Using Linear Regression, you can predict the monthly energy consumption (in kWh) for a household based on these features.

##### Math Behind It
The equation for Linear Regression is $( y = \beta_0 + \beta_1 x_1 + \beta_2 x_2 + \ldots + \beta_n x_n + \epsilon $), where $( y $) is the dependent variable you're trying to predict, $( x_1, x_2, \ldots, x_n $) are the independent variables, $( \beta_0, \beta_1, \ldots, \beta_n $) are the coefficients that we're trying to optimize, and $( \epsilon $) is the error term. The objective is to find values for the $( \beta $) coefficients that minimize the sum of the squared differences between the predicted $( y $) values and the actual $( y $) values.

##### Limitations
*Sensitive to Outliers*

One of the key limitations of Linear Regression is its sensitivity to outliers. Outliers can disproportionately affect the slope and intercept of the regression line, leading to inaccurate or misleading predictions. In practice, it's important to identify and address outliers in your data before running a Linear Regression analysis to ensure more reliable results.

##### Extend Your Understanding with This Video

In [2]:
YouTubeVideo('nk2CQITm_eo')

---
<a id='logistic_regression'></a>
<h3 style="color:blue">2. Logistic Regression</h3>
<p>Contrary to what its name suggests, logistic regression is employed for classification tasks rather than regression. It utilizes the logistic function to ensure that the output probabilities fall within the range of 0 to 1, unlike linear regression.</p>

<div style="text-align: center;">
  <div style="display: inline-block; border: 4px solid black; padding: 10px; border-radius: 15px;">
    <img src="https://github.com/nurtanbilmis/other_projects/blob/main/The-Machine-Learning-Toolkit/Images/logistic_regression.jpg?raw=true" alt="Logistic Regression" width="600"/>
  </div>
</div>

##### Ideal Use-Case

Logistic Regression excels when you have a dependent variable that can fall into one of two categories, like "yes/no" or "true/false". This algorithm outputs probabilities that can be easily converted to class labels by applying a threshold value.

##### Real-World Example and Explanation

*Spam email filtering*

Imagine you are building a spam filter for an email service. Your training dataset contains thousands of emails, each labeled as either "spam" or "not spam." Logistic Regression can take multiple features, like the frequency of certain keywords ("money," "prize," etc.), sender reputation, etc., and calculate the probability of each email being spam.

For instance, if an email contains the words "lottery" and "prize" frequently, the model might output a high probability, say 0.9, indicating it is likely to be spam. You can set a threshold, such as 0.5, and classify emails with a probability higher than this threshold as spam.


##### Math Behind It
The logistic function, often represented as $\sigma(z)$, is defined as: $\sigma(z) = \frac{1}{1 + e^{-z}}$
Here, $z$ is the weighted sum of the input features $\mathbf{x}$ and model parameters 
$\mathbf{\beta}: z = \beta_0 + \beta_1 \cdot x_1 + \beta_2 \cdot x_2 + \ldots + \beta_n \cdot x_n$

##### Limitations
*Linear Decision Boundaries*

Logistic regression assumes that the decision boundary is linear. This could be a limitation if the true decision boundary is highly non-linear. This limitation can often be a deal-breaker when dealing with real-world data that rarely adheres to such a simplified assumption. Therefore, understanding the data's underlying structure is crucial when deciding if Logistic Regression is an appropriate choice for a particular problem.

##### Extend Your Understanding with This Video

In [3]:
YouTubeVideo('yIYKR4sgzI8')

---
<a id='decision_trees'></a>
<h3 style="color:blue">3. Decision Trees</h3>
<p>Decision Trees are best suited for classification and regression problems where the feature space is divided into regions with similar characteristics. They are particularly useful when:

- You have both categorical and numerical features.
- You need an interpretable model.
- You have a dataset with a mix of simple and complex relationships between variables.</p>

<img src="https://github.com/nurtanbilmis/other_projects/blob/main/The-Machine-Learning-Toolkit/Images/decision_tree.jpg?raw=true" alt="Decision Trees" width="400"/>


##### Ideal Use-Case

For instance, in healthcare, you could use Decision Trees to determine the likelihood of a patient having a particular disease based on symptoms and test results.

##### Real-World Example and Explanation
*creditworthiness of a loan applicant*

In this scenario, the tree might start by asking: "Is the applicant's income above $50,000?" Depending on the answer, it might further ask about the applicant's credit score, employment history, and so on. At the end of this tree, leaves will represent the decision—either "Grant Loan" or "Don't Grant Loan."

##### Math Behind It
Decision Trees use metrics like Entropy, Gini Impurity, or Information Gain to decide how to split a node.

Entropy: Measures the uncertainty or randomness in the dataset.
1. **Entropy**: Measures the uncertainty or randomness in the dataset.

$E(S) = - p_+ \log_2(p_+) - p_- \log_2(p_-)$


2. **Information Gain**: Measures how much uncertainty is reduced after a particular attribute is chosen for splitting.

$IG(S, A) = E(S) - \sum_{t \in T} \frac{|S_t|}{|S|} E(S_t)$


3. **Gini Impurity**: Measures the impurity or the likelihood of incorrect classification.

$G(S) = 1 - \sum_{i=1}^{n} p_i^2$


To split a node, you calculate these metrics for each attribute and choose the one that gives the best result (highest Information Gain, lowest Gini Impurity, etc.).

##### Limitations
- Overfitting: If the tree is too deep, it can memorize the data, causing overfitting.
- Sensitivity to Data: Small changes in the data can result in a different tree.
- Biased to Dominant Classes: Trees can be biased if one class heavily outnumbers the other.
- Not Suitable for Unstructured Data: Decision Trees are not well-suited for text, images, or time-series data.
- Complex Trees: Trees can become very complex, making them difficult to interpret.

##### Extend Your Understanding with This Video

In [4]:
YouTubeVideo('_L39rN6gz7Y')

---
<a id='random_forest'></a>
<h3 style="color:blue">4. Random Forest</h3>
<p>Random Forests are an ensemble learning method ideal for both classification and regression tasks. They are particularly useful when:

- Your dataset has high dimensionality or many features.
- You need a model with high accuracy.
- You require both robustness and stability. </p>

<div style="text-align: center;">
  <div style="display: inline-block; border: 4px solid black; padding: 10px; border-radius: 15px;">
    <img src="https://github.com/nurtanbilmis/other_projects/blob/main/The-Machine-Learning-Toolkit/Images/random_forest.jpg?raw=true" alt="Random Forest" width="400"/>
  </div>
</div>

##### Ideal Use-Case
*fraud detection* 

Random Forests can combine multiple decision trees to identify various kinds of fraudulent activities more accurately than a single tree.

##### Real-World Example and Explanation

Imagine you're working in a customer service department, and your company is receiving thousands of emails from customers daily. You're tasked with sorting these emails into different categories like 'Billing', 'Technical Support', 'General Inquiry', etc., so they can be efficiently routed to the right team.

In this scenario, a Random Forest algorithm can analyze the text and other meta-information of these emails to accurately categorize them. Unlike a single decision tree, which might get confused by the variety of phrases and terms used, the Random Forest combines the insights of multiple trees to make a more robust decision.

##### Math Behind It


1. **Random Feature Selection**: For each split in each tree, a subset of features is randomly selected.

    $F = \text{Randomly selected features from } \\{ f_1, f_2, \ldots, f_p \\}$

2. **Ensemble Prediction**: The final prediction is determined by majority voting (for classification) or averaging (for regression).
  - **Classification**: 

  $P = \text{mode}(p_1, p_2, \ldots, p_n)$
  
  - **Regression**: 
 
  $P = \frac{1}{n} \sum_{i=1}^{n} p_i$
  





##### Limitations
- Computationally Intensive: Random Forests require more computational power and resources compared to individual Decision Trees.
- Longer Training Time: Due to multiple trees, it takes longer to train.
- Less Interpretable: While individual trees are easy to interpret, a forest as a whole is not.
- Overfitting: Although less prone than Decision Trees, they can still overfit on noisy data.

##### Extend Your Understanding with This Video

In [5]:
video_html = '''
<div style="display: flex; justify-content: space-between;">
    <iframe width="560" height="315" src="https://www.youtube.com/embed/J4Wdy0Wc_xQ" frameborder="0" allowfullscreen></iframe>
    <iframe width="560" height="315" src="https://www.youtube.com/embed/sQ870aTKqiM" frameborder="0" allowfullscreen></iframe>
</div>
'''
display(HTML(video_html))

---
<a id='support_vector_machines'></a>
<h3 style="color:blue">5. Support Vector Machines</h3>
<p>Support Vector Machines (SVM) is a supervised machine learning algorithm primarily used for classification tasks, although it can also be used for regression. The main idea behind SVM is to find a hyperplane in an N-dimensional space (N being the number of features) that best separates different classes of data points.</p>

<div style="text-align: center;">
  <div style="display: inline-block; border: 4px solid black; padding: 10px; border-radius: 15px;">
    <img src="https://github.com/nurtanbilmis/other_projects/blob/main/The-Machine-Learning-Toolkit/Images/svm.jpg?raw=true" alt="SVMs" width="400"/>
  </div>
</div>

In a two-dimensional space, a hyperplane is simply a line. For a three-dimensional space, it would be a plane, and in higher dimensions, it's a hyperplane. This hyperplane should be positioned in such a way that it maximizes the margin between different classes. The margin is defined as the distance between the nearest points (support vectors) of different classes to the hyperplane.

##### Ideal Use-Case
- Binary Classification: SVM is excellent for binary classification problems where you have two classes to separate.
- High-Dimensional Data: SVM performs well when the feature set is large, and the dataset is not too large (thousands of samples).
- Text Classification: Due to its high dimensionality, SVM is widely used in text classification problems.
- Non-linear Problems: With the use of kernel functions, SVM can solve non-linear classification problems.

##### Real-World Example and Explanation

Imagine you're a Data Scientist specializing in cybersecurity, particularly in intrusion detection. You're faced with a high-dimensional dataset that includes various features like packet length, frequency, and IP addresses. Your goal is to distinguish between normal network traffic and malicious intrusions. Here, SVM becomes your go-to algorithm for several reasons:

First, the high dimensionality of the data fits well with SVM's ability to handle multiple features effectively. Second, network intrusion patterns can be complex and not easily separable by a simple linear model. SVM, with its kernel trick, can map the data into a higher-dimensional space, allowing for more complex decision boundaries that can separate traffic types more accurately. In essence, SVM provides both the computational efficiency and the complexity required for real-time intrusion detection in a high-dimensional feature space.


##### Math Behind It

The basic idea is to find a hyperplane that best divides a dataset into classes. Mathematically, the equation of a hyperplane is: 
$w⋅x−b=0$

- $w$ is the weight vector (perpendicular to the hyperplane).
- $x$ is the input feature vector.
- $b$ is the bias term.

The objective is to maximize the margin $M$ subject to:
- $w⋅x_i−b≥M$
- $w⋅x_i−b≤−M$

##### Limitations
- Computationally Intensive: For large datasets, SVM can be extremely slow.
- Kernel Choice: Choosing an appropriate kernel function can be tricky.
- Binary Classification: Native SVM is designed for binary classification. For multi-class classification, multiple SVMs need to be run.
- Sensitive to Outliers: SVM is sensitive to the presence of outliers, which can skew the optimal hyperplane.

##### Extend Your Understanding with This Video

In [6]:
YouTubeVideo('efR1C6CvhmE')

---
<a id='k_nearest_neighbors'></a>
<h3 style="color:blue">6. k-Nearest Neighbors</h3>
<p>The k-Nearest Neighbors (k-NN) algorithm is a type of supervised machine learning algorithm used for classification and regression tasks. The core idea is simple: to predict the label of a new data point, you look at the 'k' closest data points (neighbors) in the training set and take a "majority vote" for classification or an average for regression.</p>

<div style="text-align: center;">
  <div style="display: inline-block; border: 4px solid black; padding: 10px; border-radius: 15px;">
    <img src="https://github.com/nurtanbilmis/other_projects/blob/main/The-Machine-Learning-Toolkit/Images/k_nearest_neighbors.jpg?raw=true" alt="k-NN" width="400"/>
  </div>
</div>

##### Ideal Use-Case
k-NN is best suited for problems where the relationship between the features and the output label is non-linear and complex. It is often used when:

- The dataset is small to moderately sized.
- The data dimensions are not too high (to avoid the "curse of dimensionality").
- Quick prototyping is needed.

##### Real-World Example and Explanation
Consider a smart agriculture system that collects data on various crops. The system has a database with information like soil pH, moisture levels, and temperature. The goal is to predict whether a certain crop is at risk of developing a specific disease.

The k-NN algorithm can be employed by this system to make quick and effective predictions. By comparing the current crop's conditions to past data, and selecting the 'k' closest instances, the system can classify the crop as either 'at-risk' or 'not at-risk' for disease based on the majority label of its nearest neighbors.


##### Math Behind It
For calculating the distance between two points \( x \) and \( y \) with \( n \) features, the Euclidean distance formula is:

- $\text{Euclidean Distance} = \sqrt{\sum_{i=1}^{n} (x_{i} - y_{i})^2}$

For classification problems, the majority vote among the 'k' nearest neighbors is taken as the prediction. The formula to find the predicted class is:

- $\text{Predicted Class} = \text{mode}(\text{Classes of k Nearest Neighbors})$

For regression problems, the prediction is the average of the 'k' nearest points:

- $\text{Predicted Value} = \frac{1}{k} \sum_{i=1}^{k} \text{Value of i-th Nearest Neighbor}$

##### Limitations
- Sensitive to outliers: A single outlier can significantly influence the prediction.
- High Computational Cost: Requires calculating the distance to every point in the dataset for each prediction.
- Curse of Dimensionality: Performance degrades as the number of features increases.
- Choice of 'k' and distance metric: The algorithm's effectiveness can be sensitive to these choices.

##### Extend Your Understanding with This Video

---
<a id='naive_bayes'></a>
<h3 style="color:blue">7. Naive Bayes</h3>
<p>Naive Bayes is a probabilistic classification algorithm based on Bayes' theorem. It's called "naive" because it makes the strong assumption that all features are independent of each other given the class label. Despite this simplification, it performs surprisingly well for various tasks and is particularly popular in text classification problems like spam filtering and sentiment analysis.</p>

<div style="text-align: center;">
  <div style="display: inline-block; border: 4px solid black; padding: 10px; border-radius: 15px;">
    <img src="https://github.com/nurtanbilmis/other_projects/blob/main/The-Machine-Learning-Toolkit/Images/naive_bayes.jpg?raw=true" alt="naive bayes" width="400"/>
  </div>
</div>

##### Ideal Use-Case
Naive Bayes is ideal for:

- High-dimensional datasets, particularly text data.
- Quick prototyping of classification models.
- Situations where interpretability is more critical than predictive power.

##### Real-World Example and Explanation

Imagine a mobile app that suggests outdoor activities like hiking, swimming, or picnicking based on current weather conditions. The app has a dataset of past weather conditions and the most popular activities on those days.

Naive Bayes can quickly analyze the current weather attributes such as temperature, humidity, and wind speed to recommend the most suitable outdoor activity. It's a good fit because it can handle multiple features efficiently and provide a probability-based recommendation, allowing the app to suggest the most likely enjoyable activity given the weather conditions.

##### Math Behind It

The Naive Bayes algorithm is based on Bayes' theorem, which helps us find the probability of a label given some features, denoted as $ P(C_k | x_1, x_2, \ldots, x_n) $.

In a standard application of Bayes' theorem, the formula would involve conditional probabilities for all combinations of features. However, Naive Bayes simplifies this by making a 'naive' assumption that all features are independent of each other given the label.

Given this assumption, we can rewrite the formula for the posterior probability of a class $ C_k $ given features $( x_1, x_2, \ldots, x_n)$ as:

- $P(C_k | x_1, x_2, \ldots, x_n) \propto P(C_k) \times \prod_{i=1}^{n} P(x_i | C_k)$



- $ P(C_k) $ is the prior probability of the class.
- $ P(x_i | C_k) $ is the likelihood which measures how much the feature $( x_i $) appears in class $( C_k $) in the training data.
- The symbol $( \propto $) means 'proportional to,' and it indicates that we can ignore the denominator $( P(x_1, x_2, \ldots, x_n) $), as it remains constant when we compare different classes.

The class with the highest posterior probability is the prediction for the new data point.

##### Limitations
- Independence Assumption: The biggest limitation is the assumption that all features are independent, which is rarely true in real-world applications.
- Zero Frequency: If a feature-label combination is not present in the training data, the probability estimate for that combination will be zero, affecting the model's performance.
- Overfitting: For highly dimensional data, Naive Bayes can suffer from overfitting unless proper smoothing techniques are applied.

##### Extend Your Understanding with This Video

In [7]:
YouTubeVideo('O2L2Uv9pdDA')

<div style="text-align: center; font-size: 15px; font-weight: bold; color: purple; border: 3px solid black; padding: 10px; background-color: #f9f9f9;">
  Closing Part I with a glimpse of ARIMA in its <span style="text-decoration: underline;">final boss form</span>! <br>
  <span style="font-size: 15px; color: blue;">Ready to face its true essence?</span> <br>
  <span style="font-size: 15px; color: black;">Join me in Part II!</span>
</div>


<div style="text-align: center;">
  <div style="display: inline-block; border: 4px solid black; padding: 10px; border-radius: 15px;">
    <img src="https://github.com/nurtanbilmis/other_projects/blob/main/The-Machine-Learning-Toolkit/Images/bonus_arima.jpg?raw=true" alt="ARIMA" width="400"/>
  </div>
</div>



<div style="text-align: center; font-size: 14px; color: #666666; padding-top: 10px;">
  <em>Prompt details: Generate an abstract, high-quality image with good lighting of the concept: ARIMA</em><br>
  <em>Sampler: Leonardo, Finetuned model: DreamShaper v7</em>
</div>