# **Machine Learning(Getting Started)**

## **Machine Learning**

  - Machine Learning is making the computer learn from studying data and statistics.
  - Machine Learning is a step into the direction of artificial intelligence (AI).
  - Machine Learning is a program that analyses data and learns to predict the outcome.

## **Data Sets**
In the mind of a computer, a data set is any collection of data. It can be anything from an array to a complete database.

Example of an array:

    [99,86,87,88,111,86,103,87,94,78,77,85,86]

Example of a database:
  
    Here's a simple database table using markdown:

| ID | First Name | Last Name | Email | Age | Program |
|----|------------|-----------|-------|-----|---------|
| 1 | John | Smith | john.smith@email.com | 32 | Masters |
| 2 | Sarah | Johnson | sarah.j@email.com | 28 | Masters  |
| 3 | Pradip | Puri | impradp@email.com | 45 | PhD |
| 4 | Emma | Brown | emma.b@email.com | 35 | PhD |
| 5 | Michael | Williams | mike.w@email.com | 22 | Bachelors |


By looking at the array, we can guess that the average value is probably around 80 or 90, and we are also able to determine the highest value and the lowest value, but what else can we do?

And by looking at the database we can see that the most popular program is Masters, and the youngest student to take program is 22 years, but what if we could predict if a student had a PhD program, just by looking at the other values?

That is what Machine Learning is for! Analyzing data and predicting the outcome!

## **Data Types**

Let me explain the key data types commonly encountered in machine learning.

Numerical Data:
- Continuous: Values that can take any number within a range (e.g., height, weight, temperature, prices)
- Discrete: Count values or whole numbers (e.g., number of rooms, count of items)

Categorical Data:
- Nominal: Categories with no inherent order (e.g., colors, product names, cities)
- Ordinal: Categories with a meaningful order (e.g., education level, customer satisfaction ratings)
- Binary: Data with only two possible values (e.g., yes/no, true/false)

Time Series Data:
Data points collected over time at regular intervals (e.g., stock prices, weather measurements, sales figures)

Text Data:
Unstructured text that requires natural language processing (e.g., customer reviews, social media posts, articles)

Image Data:
Digital images represented as matrices of pixel values, commonly used in computer vision tasks

Audio Data:
Sound recordings represented as waveforms or spectrograms

Video Data:
Sequences of images with temporal information

Each data type requires specific preprocessing techniques:
- Numerical data often needs scaling or normalization
- Categorical data typically requires encoding (one-hot, label, or ordinal encoding)
- Text data needs tokenization, vectorization, or embedding
- Image and audio data usually require normalization and may need resizing or feature extraction

## **Mean**
The average value from the given datasets.

To calculate the mean, find the sum of all values, and divide the sum by the number of values.

    (99+86+87+88+111+86+103+87+94+78+77+85+86) / 13 = 89.77



In [None]:
# Calculating the mean using numpy library/module
import numpy

speed = [99,86,87,88,111,86,103,87,94,78,77,85,86]

x = numpy.mean(speed)

print(x)

## **Median**
The median value is the value in the middle, after you have sorted all the values.
                            
    77, 78, 85, 86, 86, 86, 87, 87, 88, 94, 99, 103, 111

In [None]:
# Calculating median
import numpy

speed = [99,86,87,88,111,86,103,87,94,78,77,85,86]

x = numpy.median(speed)

print(x)

## **Mode**
The Mode value is the value that appears the most number of times.

      99, 86, 87, 88, 111, 86, 103, 87, 94, 78, 77, 85, 86 = 86



In [None]:
# Calculating mode from scipy
from scipy import stats

speed = [99,86,87,88,111,86,103,87,94,78,77,85,86]

x = stats.mode(speed)

print(x)

## **Standard Deviation**

Standard deviation is a statistical measure that quantifies the amount of variation or dispersion in a dataset.

In simpler terms, it tells you how spread out your numbers are from their average (mean) value.

- A low standard deviation means the values tend to be close to the mean
- A high standard deviation means the values are spread out over a wider range


Some important notes about NumPy's standard deviation functions:

- *np.std()* by default uses *ddof=0* (population standard deviation)
- Use *ddof=1* when working with a sample of a larger population (sample standard deviation)
- NumPy has both *np.std()* and *array.std()* methods that work the same way

Mathematical steps:

    # Manual calculation example
    differences = scores - mean
    squared_diff = differences ** 2
    variance = np.mean(squared_diff)
    manual_std = np.sqrt(variance)



In [None]:
# Standard Deviation Calculation
import numpy as np

# Sample dataset: test scores of students
scores = np.array([85, 90, 72, 95, 88, 82, 78, 85, 92, 89])

# Calculate mean
mean = np.mean(scores)
print(f"Mean score: {mean}")

# Calculate standard deviation
std_dev = np.std(scores)
print(f"Standard deviation: {std_dev}")

# You can also specify ddof (delta degrees of freedom)
# ddof=1 gives sample standard deviation (commonly used in statistics)
sample_std = np.std(scores, ddof=1)
print(f"Sample standard deviation: {sample_std}")

## **Variance**
Variance is a measure of spread that shows how far a set of numbers are from their mean. It's actually the square of standard deviation. In other words, it's the average of squared differences from the mean.

Key Takeways:

- Variance is always positive (because we're squaring differences)
- Larger variance indicates more spread out data
- Units are squared (if your data is in meters, variance is in square meters)
- Like standard deviation, you can calculate population variance (ddof=0) or sample variance (ddof=1)

In [None]:
# Variance Calculation
import numpy as np

# Sample dataset
data = np.array([4, 8, 6, 2, 9, 5, 7, 3])

# Calculate variance using NumPy
variance = np.var(data)
print(f"Variance using np.var(): {variance}")

# Calculate sample variance (n-1 denominator)
sample_variance = np.var(data, ddof=1)
print(f"Sample variance: {sample_variance}")

# Let's break down how variance is calculated manually:
# 1. Calculate the mean
mean = np.mean(data)

# 2. Calculate squared differences from mean
squared_diff = (data - mean) ** 2

# 3. Calculate variance (mean of squared differences)
manual_variance = np.mean(squared_diff)
print(f"Manual variance calculation: {manual_variance}")

In [None]:
# Optional
# Variance along specific axis of multi-dimensional array
array_2d = np.array([[1, 2, 3],
                     [4, 5, 6],
                     [7, 8, 9]])

# Variance of each column
col_variance = np.var(array_2d, axis=0)
print(f"Column variances: {col_variance}")

# Variance of each row
row_variance = np.var(array_2d, axis=1)
print(f"Row variances: {row_variance}")

# Calculate covariance between two variables
x = np.array([1, 2, 3, 4, 5])
y = np.array([2, 4, 5, 4, 5])
covariance = np.cov(x, y)[0][1]  # Extract covariance from 2x2 matrix
print(f"Covariance between x and y: {covariance}")

# Standard deviation is square root of variance
std_dev = np.sqrt(variance)
print(f"Standard deviation: {std_dev}")
print(f"Variance: {variance}")

## **Percentiles**

A percentile indicates the value below which a given percentage of observations falls.

For example, if you score in the 90th percentile, you performed better than 90% of the group.

  1.  25th percentile (Q1): First quartile
  2.  50th percentile (Q2): Median
  3.  75th percentile (Q3): Third quartile

In [None]:
import numpy as np

# Sample dataset - let's say these are test scores
scores = np.array([55, 62, 68, 74, 77, 82, 84, 85, 87, 91, 93, 95, 98])

# Calculate various percentiles
median = np.percentile(scores, 50)  # 50th percentile is the median
quartile_25 = np.percentile(scores, 25)  # First quartile
quartile_75 = np.percentile(scores, 75)  # Third quartile
percentile_90 = np.percentile(scores, 90)  # 90th percentile

print(f"Median (50th percentile): {median}")
print(f"25th percentile: {quartile_25}")
print(f"75th percentile: {quartile_75}")
print(f"90th percentile: {percentile_90}")

# Calculate multiple percentiles at once
percentiles = np.percentile(scores, [25, 50, 75, 90])

### **Interpolation Methods in Percentile**
NumPy offers different interpolation methods through the method parameter.

Common Applications:

  - Performance Assessment (test scores, athletic performance)
  - Income Distribution Analysis
  - Growth Charts (height, weight percentiles)
  - Outlier Detection using IQR method

In [None]:
# Different interpolation methods
linear = np.percentile(scores, 75, method='linear')  # Default
nearest = np.percentile(scores, 75, method='nearest')
lower = np.percentile(scores, 75, method='lower')
higher = np.percentile(scores, 75, method='higher')
midpoint = np.percentile(scores, 75, method='midpoint')

# Real-world example: Calculate percentile rank of a score
def percentile_rank(data, score):
    """Calculate the percentile rank of a score within a dataset"""
    return len(data[data <= score]) / len(data) * 100

score_to_check = 85
rank = percentile_rank(scores, score_to_check)
print(f"A score of {score_to_check} is in the {rank:.1f}th percentile")

# Interquartile Range (IQR) - common measure of spread
iqr = np.percentile(scores, 75) - np.percentile(scores, 25)
print(f"Interquartile Range: {iqr}")

# Using nanpercentile for data with missing values
data_with_nan = np.array([1, 2, np.nan, 4, 5])
percentile_with_nan = np.nanpercentile(data_with_nan, 50)

In [None]:
### **Outlier Detection using Percentiles**

# Detect outliers using IQR method
q1 = np.percentile(scores, 25)
q3 = np.percentile(scores, 75)
iqr = q3 - q1
lower_bound = q1 - 1.5 * iqr
upper_bound = q3 + 1.5 * iqr

outliers = scores[(scores < lower_bound) | (scores > upper_bound)]
print(f"Outliers: {outliers}")