Machine learning is a branch of artificial intelligence (AI) that involves the development of algorithms and statistical models that enable computers to perform tasks without explicit instructions. Instead, these algorithms learn from and make predictions or decisions based on data. Here’s a more detailed breakdown:

1. **Definition**:
   - **Machine Learning (ML)**: A field of AI focused on creating systems that learn from data, identify patterns, and make decisions with minimal human intervention.

2. **Types of Machine Learning**:
   - **Supervised Learning**: The algorithm learns from labeled data. The training dataset includes input-output pairs, and the model makes predictions based on this data.
     - Examples: Regression, Classification.
   - **Unsupervised Learning**: The algorithm learns from unlabeled data. It identifies patterns and relationships in the data without predefined labels.
     - Examples: Clustering, Association.
   - **Semi-Supervised Learning**: Combines labeled and unlabeled data to improve learning accuracy. This approach is useful when obtaining a fully labeled dataset is challenging.
   - **Reinforcement Learning**: The algorithm learns by interacting with an environment, receiving feedback through rewards or penalties, and aims to maximize cumulative rewards.
     - Examples: Game playing, Robotics.

3. **Key Concepts**:
   - **Algorithms**: Procedures or formulas for solving problems. Examples include decision trees, neural networks, and support vector machines.
   - **Training**: The process of teaching a model by feeding it data and allowing it to adjust its parameters.
   - **Model**: The output of the training process, which can be used to make predictions or decisions.
   - **Features**: Individual measurable properties or characteristics of the data.
   - **Labels**: The output or target value in supervised learning.

4. **Applications**:
   - **Image and Speech Recognition**: Identifying objects in images or transcribing spoken words.
   - **Natural Language Processing (NLP)**: Understanding and generating human language, such as in chatbots or translation services.
   - **Recommendation Systems**: Suggesting products or content based on user behavior.
   - **Predictive Analytics**: Forecasting future trends based on historical data, used in finance, healthcare, and more.
   - **Autonomous Systems**: Enabling vehicles or robots to navigate and make decisions without human intervention.

5. **Challenges**:
   - **Data Quality and Quantity**: High-quality, relevant data is essential for training effective models.
   - **Overfitting and Underfitting**: Creating models that generalize well to new, unseen data, rather than just memorizing the training data.
   - **Interpretability**: Understanding and explaining how models make decisions, which is particularly important in critical applications like healthcare and finance.
   - **Ethics and Bias**: Ensuring models do not perpetuate or amplify biases present in the training data.

Machine learning is a rapidly evolving field with a wide range of techniques and applications, and it plays a crucial role in the development of intelligent systems.

## Data Set
In the mind of a computer, a data set is any collection of data. It can be anything from an array to a complete database.

## Data Types

To analyze data, it is important to know what type of data we are dealing with.

We can split the data types into three main categories:

- Numerical
- Categorical
- Ordinal
Numerical data are numbers, and can be split into two numerical categories:

`Discrete Data`

- counted data that are limited to integers. Example: The number of cars passing by.

`Continuous Data`

-  measured data that can be any number. Example: The price of an item, or the size of an item

`Categorical data`


-  are values that cannot be measured up against each other. Example: a color value, or any yes/no values.

`Ordinal data`

- are like categorical data, but can be measured up against each other. Example: school grades where A is better than B and so on.

By knowing the data type of your data source, you will be able to know what technique to use when analyzing them.

## Machine Learning - Mean Median Mode

In machine learning and statistics, mean, median, and mode are basic measures of central tendency that are used to summarize and understand datasets. Here’s a detailed explanation of each:

### Mean
The mean (or average) is the sum of all values in a dataset divided by the number of values.

- **Formula**:
  \[
  \text{Mean} = \frac{\sum_{i=1}^{n} x_i}{n}
  \]
  where \( x_i \) represents each value in the dataset, and \( n \) is the total number of values.

- **Use Case**: The mean is useful when you want a quick sense of the average value in your dataset. It is sensitive to outliers, meaning that extreme values can significantly affect the mean.

### Median
The median is the middle value of a dataset when it is ordered from least to greatest. If the dataset has an even number of values, the median is the average of the two middle values.

- **Steps to Find Median**:
  1. Sort the dataset in ascending order.
  2. If the number of observations (n) is odd, the median is the middle value.
  3. If \( n \) is even, the median is the average of the two middle values.

- **Use Case**: The median is useful for understanding the central tendency of a dataset without being affected by outliers or skewed data.

### Mode
The mode is the value that appears most frequently in a dataset. A dataset can have one mode (unimodal), more than one mode (bimodal or multimodal), or no mode at all if no value repeats.

- **Use Case**: The mode is useful for categorical data where you want to know which is the most common category. It is less informative for continuous data unless you are interested in the most frequent values.

### Examples
Let's consider a dataset: [3, 7, 8, 5, 12, 14, 21, 13, 18, 14]

- **Mean**:
  \[
  \text{Mean} = \frac{3 + 7 + 8 + 5 + 12 + 14 + 21 + 13 + 18 + 14}{10} = \frac{115}{10} = 11.5
  \]

- **Median**:
  - Sorted dataset: [3, 5, 7, 8, 12, 13, 14, 14, 18, 21]
  - Number of values (n) = 10 (even)
  - Median = \(\frac{12 + 13}{2} = 12.5\)

- **Mode**:
  - The value 14 appears most frequently (twice).
  - Mode = 14

### Application in Machine Learning
- **Data Preprocessing**: Mean, median, and mode are often used in preprocessing steps, such as imputing missing values. For example, you might fill missing values with the median if the data is skewed.
- **Feature Engineering**: These statistics can help in creating new features that might improve model performance.
- **Model Evaluation**: Understanding the central tendency of your predictions can help in evaluating the performance of regression models.

By using these measures of central tendency, you can gain insights into your dataset, detect anomalies, and make informed decisions during the data analysis and preprocessing stages in machine learning projects.

### Mean

In [1]:
import numpy

speed = [99,86,87,88,111,86,103,87,94,78,77,85,86]

x = numpy.mean(speed)

print(x)

89.76923076923077


In [4]:
import numpy as np

# Example dataset
data = np.array([4, 8, 15, 16, 23, 42])

# Calculate the mean
mean = np.mean(data)
print(mean)


18.0


### Median

In [5]:
import numpy

speed = [99,86,87,88,111,86,103,87,94,78,77,85,86]

x = numpy.median(speed)

print(x)

87.0


In [7]:
# Example dataset
data_odd = [3, 1, 4, 1, 5]
data_even = [7, 3, 1, 5]

# Calculate the median for odd number of values
data_odd_sorted = sorted(data_odd)
n_odd = len(data_odd_sorted)
median_odd = data_odd_sorted[n_odd // 2]

# Calculate the median for even number of values
data_even_sorted = sorted(data_even)
n_even = len(data_even_sorted)
median_even = (data_even_sorted[n_even // 2 - 1] + data_even_sorted[n_even // 2]) / 2

print("Median (odd):", median_odd)
print("Median (even):", median_even)


Median (odd): 3
Median (even): 4.0


In [8]:
import numpy

speed = [99,86,87,88,86,103,87,94,78,77,85,86]

x = numpy.median(speed)

print(x)

86.5


### Mode
-The Mode value is the value that appears the most number of times
-The `SciPy module` has a method for this

In [14]:
from scipy import stats

speed = [99,86,87,88,111,86,103,87,94,78,77,85,86]

x = stats.mode(speed)

print(x)

ModeResult(mode=86, count=3)


In [16]:
from scipy import stats

data = [1, 2, 3, 4, 4, 5, 5, 5, 6]

mode_result = stats.mode(data)
print(mode_result)


ModeResult(mode=5, count=3)


## Machine Learning - Standard Deviation

In machine learning, standard deviation is a measure of the dispersion or spread of a dataset. It indicates how much individual data points differ from the mean of the dataset. A low standard deviation suggests that the data points tend to be close to the mean, while a high standard deviation indicates that the data points are spread out over a wider range of values.

### Calculating Standard Deviation
The standard deviation (σ) of a dataset is calculated using the following formula:

\[
\sigma = \sqrt{\frac{\sum_{i=1}^{n} (x_i - \mu)^2}{n}}
\]

Where:
- \(x_i\) represents each value in the dataset.
- \(\mu\) is the mean of the dataset.
- \(n\) is the total number of values.

### Steps to Calculate Standard Deviation
1. **Calculate the Mean**: Find the mean (average) of the dataset.
2. **Calculate the Deviation**: Subtract the mean from each value to find the deviation from the mean.
3. **Square the Deviation**: Square each deviation to eliminate negative values.
4. **Calculate the Mean of Squared Deviations**: Find the mean of the squared deviations.
5. **Take the Square Root**: Finally, take the square root of the mean of squared deviations to obtain the standard deviation.

### Example Calculation

Consider the dataset: [2, 4, 4, 4, 5, 5, 7, 9].

1. **Calculate the Mean**:
   \[
   \text{Mean} = \frac{2 + 4 + 4 + 4 + 5 + 5 + 7 + 9}{8} = \frac{40}{8} = 5
   \]

2. **Calculate the Deviation**:
   Deviations: [-3, -1, -1, -1, 0, 0, 2, 4]

3. **Square the Deviation**:
   Squared deviations: [9, 1, 1, 1, 0, 0, 4, 16]

4. **Calculate the Mean of Squared Deviations**:
   \[
   \text{Mean of Squared Deviations} = \frac{9 + 1 + 1 + 1 + 0 + 0 + 4 + 16}{8} = \frac{32}{8} = 4
   \]

5. **Take the Square Root**:
   \[
   \text{Standard Deviation} = \sqrt{4} = 2
   \]

### Applications in Machine Learning

- **Data Understanding**: Standard deviation provides insights into the variability of data points in a dataset.
- **Feature Selection**: Standard deviation can be used as a feature selection criterion. Features with low variability may be less informative for predictive modeling.
- **Model Evaluation**: Standard deviation is used in various evaluation metrics, such as standard deviation of residuals in regression analysis or standard deviation of accuracy scores in cross-validation.

### Example in Python

Here's how you can calculate the standard deviation using Python with a dataset stored in a list:

```python
import statistics

data = [2, 4, 4, 4, 5, 5, 7, 9]

std_dev = statistics.stdev(data)
print("Standard Deviation:", std_dev)
```

Alternatively, you can use NumPy for more efficient computation:

```python
import numpy as np

data = np.array([2, 4, 4, 4, 5, 5, 7, 9])

std_dev = np.std(data)
print("Standard Deviation:", std_dev)
```

### Importance in Machine Learning

- **Data Quality**: Understanding the variability of features helps in assessing data quality and identifying potential issues.
- **Feature Engineering**: Standard deviation can be used to create new features or transform existing features to improve model performance.
- **Model Training**: Standard deviation is used in various algorithms and techniques, such as outlier detection, normalization, and scaling.
- **Model Evaluation**: Standard deviation-based metrics provide insights into the variability of model predictions and can help in comparing different models' performance.

By effectively understanding and utilizing standard deviation, machine learning practitioners can enhance their data analysis, preprocessing, and modeling processes, leading to more accurate and robust machine learning models.

In [17]:
import numpy as np

data = np.array([2, 4, 4, 4, 5, 5, 7, 9])

std_dev = np.std(data)
print( std_dev)

2.0


In [18]:
import numpy

speed = [86,87,88,86,87,85,86]

x = numpy.std(speed)

print(x)

0.9035079029052513


In [19]:
import numpy

speed = [32,111,138,28,59,77,97]

x = numpy.std(speed)

print(x)

37.84501153334721


- A low standard deviation means that most of the numbers are close to the mean (average) value.

- A high standard deviation means that the values are spread out over a wider range.

## Variance

- Variance is another number that indicates how spread out the values are.

- In fact, if you take the square root of the variance, you get the standard deviation!

- Or the other way around, if you multiply the standard deviation by itself, you get the variance!

To calculate the variance you have to do as follows:

- 1. Find the mean:

(32+111+138+28+59+77+97) / 7 = 77.4

- 2. For each value: find the difference from the mean:


- 3. For each difference: find the square value:

- 4. The variance is the average number of these squared differences:

(2061.16+1128.96+3672.36+2440.36+338.56+0.16+384.16) / 7 = 1432.2
Luckily, NumPy has a method to calculate the variance:

In [20]:
import numpy

speed = [32,111,138,28,59,77,97]

x = numpy.var(speed)

print(x)

1432.2448979591834


In [21]:
import numpy as np

data = np.array([2, 4, 4, 4, 5, 5, 7, 9])

variance = np.var(data)
print("Variance:", variance)


Variance: 4.0


## Standard Deviation

In [22]:
import numpy

speed = [32,111,138,28,59,77,97]

x = numpy.std(speed)

print(x)

37.84501153334721


### Symbols

- Standard Deviation is often represented by the symbol Sigma: σ

- Variance is often represented by the symbol Sigma Squared: σ2

## Machine Learning - Percentiles

In machine learning, percentiles are statistical measures used to describe the distribution of a dataset by dividing it into hundred equal parts. Percentiles are particularly useful for understanding how individual data points rank relative to the entire dataset. They are commonly used in exploratory data analysis, data preprocessing, and model evaluation.

### Calculating Percentiles
The \(p\)-th percentile of a dataset is the value below which a certain percentage of observations fall. For example, the 50th percentile (also known as the median) represents the value below which 50% of the observations fall.

The formula to calculate the \(p\)-th percentile is as follows:

\[
\text{{Percentile}}_p = \text{{K-th value in the sorted dataset}}
\]

Where:
- \(p\) is the desired percentile (e.g., 50 for the median, 25 for the first quartile, 75 for the third quartile, etc.).
- \(K\) is calculated as \(\frac{{p \times (n + 1)}}{{100}}\), where \(n\) is the number of observations in the dataset. Note that if \(K\) is not an integer, it is common to interpolate between the two closest values.

### Example Calculation

Consider the dataset: [2, 4, 4, 4, 5, 5, 7, 9].

1. **Sort the Dataset**: [2, 4, 4, 4, 5, 5, 7, 9]
2. **Calculate the Index**: For the 50th percentile (median), \(K = \frac{{50 \times (8 + 1)}}{{100}} = 4.5\).
3. **Interpolate**: Since \(K\) is not an integer, we interpolate between the 4th and 5th values in the sorted dataset.
4. **Calculate the Percentile**: The 50th percentile is the average of the 4th and 5th values, which is \((4 + 5) / 2 = 4.5\).

So, the 50th percentile (median) of the dataset is 4.5.

### Applications in Machine Learning

- **Data Understanding**: Percentiles provide insights into the distribution and spread of data points in a dataset.
- **Outlier Detection**: Percentiles are used to identify outliers by comparing individual data points to specific percentiles.
- **Feature Engineering**: Percentiles can be used to create new features or transform existing features, particularly in skewed distributions.
- **Model Evaluation**: Percentiles-based metrics are used to evaluate model performance and assess predictions across different percentiles of the dataset.

### Example in Python

Here's how you can calculate percentiles using Python with a dataset stored in a list:

```python
data = [2, 4, 4, 4, 5, 5, 7, 9]

# Calculate the 50th percentile (median)
median = np.percentile(data, 50)
print("Median:", median)

# Calculate the 25th and 75th percentiles (first and third quartiles)
q1 = np.percentile(data, 25)
q3 = np.percentile(data, 75)
print("First Quartile (Q1):", q1)
print("Third Quartile (Q3):", q3)
```

### Importance in Machine Learning

- **Data Understanding**: Percentiles provide insights into the distribution of data points and help in identifying skewness and outliers.
- **Preprocessing**: Percentiles-based transformations can be applied to features to make them more suitable for certain algorithms or to handle skewed distributions.
- **Model Training**: Percentiles are used in various techniques and algorithms to handle data variability and make models more robust.
- **Model Evaluation**: Percentiles-based metrics provide a comprehensive evaluation of model performance across different segments of the dataset.

By effectively understanding and utilizing percentiles, machine learning practitioners can gain valuable insights into their data, improve model performance, and make informed decisions throughout the machine learning pipeline.

In [23]:
data = [2, 4, 4, 4, 5, 5, 7, 9]

# Calculate the 50th percentile (median)
median = np.percentile(data, 50)
print("Median:", median)

# Calculate the 25th and 75th percentiles (first and third quartiles)
q1 = np.percentile(data, 25)
q3 = np.percentile(data, 75)
print("First Quartile (Q1):", q1)
print("Third Quartile (Q3):", q3)

Median: 4.5
First Quartile (Q1): 4.0
Third Quartile (Q3): 5.5


In [24]:
import numpy

ages = [5,31,43,48,50,41,7,11,15,39,80,82,32,2,8,6,25,36,27,61,31]

x = numpy.percentile(ages, 75)

print(x)

43.0


In [25]:
import numpy

ages = [5,31,43,48,50,41,7,11,15,39,80,82,32,2,8,6,25,36,27,61,31]

x = numpy.percentile(ages, 25)

print(x)

11.0


In [26]:
import numpy

ages = [5,31,43,48,50,41,7,11,15,39,80,82,32,2,8,6,25,36,27,61,31]

x = numpy.percentile(ages, 50)

print(x)

31.0


In [28]:
import numpy

ages = [5,31,43,48,50,41,7,11,15,39,80,82,32,2,8,6,25,36,27,61,31]

x = numpy.percentile(ages, 90)

print(x)


61.0
