# Introduction to Machine Learning and Supervised Learning Basics

- Machine Learning (ML) is a field of artificial intelligence (AI) that uses statistical techniques to give computer systems the ability to **learn** from data, without being explicitly programmed.


- Machine learning involves training algorithms to recognize patterns in data and make predictions or decisions based on that data.
- It is used in a variety of applications, from recommendation systems to image recognition and natural language processing.

### Machine Learning Types

- There are three main types of machine learning:
1. Supervised Learning: The algorithm learns from labeled training data, and makes predictions based on that data. Examples include classification and regression.

2. Unsupervised Learning: The algorithm learns from unlabeled data, finding hidden patterns or intrinsic structures. Examples include clustering and association.

3. Reinforcement Learning: The algorithm learns by interacting with an environment, receiving rewards or penalties for actions. Examples include game playing and robotics.

### Common Machine Learning Algorithms
- Linear Regression: Used for predicting numerical outcomes (e.g., house prices).
- Logistic Regression: Used for classification problems (e.g., predicting whether an email is spam or not).
- Decision Trees: Models that split data into branches to make predictions based on different decision rules.
- K-Nearest Neighbors (KNN): A simple algorithm that classifies new cases based on the majority vote of its k-nearest neighbors.
- Support Vector Machines (SVM): Creates a decision boundary that best separates classes of data.
- Neural Networks: Algorithms inspired by the human brain, particularly useful for tasks like image recognition and natural language processing.

### Workflow in Machine Learning
- Data Collection: Gather data from various sources (e.g., CSV files, databases, APIs).
- Data Preprocessing: Clean, normalize, and transform the data to make it usable for the model.
- Feature Selection: Choose the most relevant features from the dataset that will have the most impact on the model's predictions.
- Model Selection: Choose an appropriate algorithm based on the problem (e.g., linear regression, decision trees).
- Training: Fit the model to the data by feeding it the training dataset.
- Evaluation: Assess the model using a testing dataset and performance metrics.
- Deployment: Once the model is performing well, it can be deployed to make predictions on new data.

### Machine Learning Library

Popular ML libraries in Python include:
- scikit-learn: A library for classical machine learning algorithms.
- TensorFlow: An open-source library for deep learning developed by Google.
- Keras: A high-level neural networks API, running on top of TensorFlow.
- PyTorch: An open-source deep learning library developed by Facebook.

### Supervised Learning

- In supervised learning, you train a model using labeled data. Each input data point is matched with its correct output. The model learns from these pairs to figure out how to predict outputs for new data it hasn't seen before.

- Comparison with Unsupervised Learning:

    - In supervised learning, the model is trained on labeled data, whereas in unsupervised learning, the model is trained on unlabeled data and must find patterns and relationships in the data without guidance.


**Types of Supervised Learning Problems**
- Classification: Predicting a categorical label (e.g., spam detection, image classification).
- Regression: Predicting a continuous value (e.g., house price prediction, temperature forecasting).

## Example: Predicting House Prices (Regression)

In [None]:
# To silence warnings

import warnings
warnings.filterwarnings("ignore")

### Step 1: Import the necessary libraries

In [None]:
import numpy as np  # For numerical operations
import pandas as pd  # For data manipulation using DataFrames

# Import functions from scikit-learn for model building and evaluation
from sklearn.model_selection import train_test_split  # To split the dataset into training and testing sets
from sklearn.linear_model import LinearRegression  # To create and train a linear regression model
from sklearn.metrics import mean_squared_error  # To evaluate the model using Mean Squared Error (MSE)
from sklearn.metrics import r2_score  # To evaluate the model using R-squared (R²)

### Step 2: Generate a simple dataset

In [None]:
# Set random seed for reproducibility
np.random.seed(42) 
# By setting the seed with np.random.seed(42), you ensure that anyone running the code will get the same random numbers, 
# making your results reproducible. 
# The number 42 is often used as a default seed in examples. However, you can set the seed to any integer value.  
# The reason 42 is commonly chosen is not technical, but rather cultural. 
# It's a reference to the book "The Hitchhiker's Guide to the Galaxy" by Douglas Adams, 
# where 42 is humorously declared as the "Answer to the Ultimate Question of Life, the Universe, and Everything."


# Number of samples
num_samples = 100


# Generate synthetic data based on ranges and patterns from the original data
sizes = np.random.randint(1400, 3000, num_samples)  # Random sizes between 1400 and 3000 sq ft
bedrooms = np.random.choice([2, 3, 4, 5, 6], num_samples)  # Random number of bedrooms between 2 and 6

# Generate prices based on a linear relationship with size and bedrooms, and add some noise
base_price = 200000 + sizes * 100 + bedrooms * 50000
prices = base_price + np.random.normal(0, 25000, num_samples)  # Add noise

In [None]:
# Create DataFrame
data = {
    'Size (sq ft)': sizes,
    'Bedrooms': bedrooms,
    'Price ($)': prices.astype(int)  # Convert to integer
}

df = pd.DataFrame(data)

# Display the first few rows of the generated data
df.head()

### Step 3: Separate features and target variable

In [None]:
X = df[['Size (sq ft)', 'Bedrooms']]
y = df['Price ($)']

In [None]:
X

### Step 4: Split the data into training and testing sets

In [None]:
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)

# X: Features (independent variables) in the dataset
# y: Target (dependent variable) in the dataset

# test_size=0.2: Specifies that 20% of the data should be used for the test set, 
# and the remaining 80% for the training set.

# random_state=42: Ensures that the data is split in the same way every time you run the code, 
# providing reproducibility. The number 42 is arbitrary and can be any integer.

# X_train: Training data for the features
# X_test: Testing data for the features

# y_train: Training data for the target
# y_test: Testing data for the target

In [None]:
X_train

### Step 5: Choose and train the model

In [None]:
model = LinearRegression()  # Initialize a Linear Regression model
model.fit(X_train, y_train)  # Train the model using the training data (X_train) and the corresponding target values (y_train)

#### Linear Regression Algorithm
Model Formula:

- `𝑦 = 𝜃0 + 𝜃1𝑥1 + 𝜃2𝑥2`

where:
- `𝜃0`  is the intercept
- `𝜃1` is the coefficient for house size
- `θ2` is the coefficient for number of bedrooms 
- `𝑥1` is house size 
- `𝑥2`  is number of bedrooms

In [None]:
# Get out the coefficients and intercept
print(f"Intercept: {model.intercept_}")
print(f"Coefficients: {model.coef_}")

##### The linear regression equation based on these results would be:

`𝑦 = 195203.3704513209 + 100.32761232 * (House Size (sq ft)) + 51483.21560206 * (Bedrooms)`

### Step 6: Make predictions

In [None]:
X_test

In [None]:
# Use the formula with the actual coefficients and intercept to make a prediction using a record in the test data
y = 195203.3704513209 + 100.32761232 * 1606 + 51483.21560206 * 5
y

In [None]:
y_pred = model.predict(X_test) # Use the trained model to make predictions on the test data (X_test)
y_pred

In [None]:
y_test

### Step 7: Evaluate the model

In [None]:
mse = mean_squared_error(y_test, y_pred)  # Calculate the Mean Squared Error (MSE) between the actual values (y_test) and the predicted values (y_pred)
rmse = np.sqrt(mse)  # Calculate the Root Mean Squared Error (RMSE) by taking the square root of the MSE, which provides the error in the same units as the target variable

- In essence, you find the **errors** (subtract each predicted value from the actual value) **E**
- **Square** these errors___________________________________________________**S**
- find the **mean** (average) of all of them ____________________________________ **M**
- and take the square**root**  ________________________________________________**R**

In [None]:
# print("Predictions:", y_pred)
# print("Actual values:", y_test.values)
print("Root Mean Squared Error (RMSE):", rmse)

In [None]:
# Calculate the R-squared value
r2 = r2_score(y_test, y_pred)

# Print the R-squared value
print(f"R-squared: {r2:.2f}") # ".2f" output in 2 decimal places

### What do these metrics mean??


- **RMSE (Root Mean Squared Error)** tries to answer the question: "How large, on average, are the errors in my model's predictions?"
    - **Optimal Value**: As close to 0 as possible.
    - Lower RMSE indicates that the predictions are closer to the actual values, meaning smaller errors.
    - Note that for values ranging over 500k, 20k is an acceptable RMSE value. This won't be the case for values below 100k for instance. A good RMSE in this case could  be around 2k or less.

- **R-squared (`R²`)** tries to answer the question: "How well do my model's predictions explain the variability ("distance" from the mean) in the actual data?" 
    - In essence how good is my model in making predictions
    - **Optimal Value**: As close to 1 as possible.
    - R² = 1 means the model predicted everything perfectly.
     - R² = 0.5 means the predictions were no better than just using the average price of all the houses in the case.
     - R² less than 0.5 means the predictions were worse than just using the average every time.
   

### Predict the price of a new house

In [None]:
new_house = [[2500, 4]]  # Size: 2500 sq ft, 4 bedrooms
predicted_price = model.predict(new_house)
print(f"Predicted price for the new house: ${predicted_price[0]:,.2f}")

In [None]:
predicted_price

In [None]:
# Using the formula with the known coeficients and intercept

predicted_price2 = 195203.3704513209 + 100.32761232 * 2500 + 51483.21560206 * 4
print(f"Predicted price for the new house: ${predicted_price2:,.2f}")

In [None]:
# Practice Question:
# Train a model to predict house prices using only the House Size feature.

## Predict whether a house is "Expensive" or "Affordable" (Classification)

### Step 1: Import necessary libraries

In [None]:
# Import the LogisticRegression model from scikit-learn's linear_model module
from sklearn.linear_model import LogisticRegression

# Import metrics for evaluating classification models from scikit-learn's metrics module
from sklearn.metrics import accuracy_score  # To calculate the accuracy of the model
from sklearn.metrics import confusion_matrix  # To create a confusion matrix for the model's predictions
from sklearn.metrics import classification_report  # To generate a detailed classification report including precision, recall, F1-score, and support

In [None]:
# Set random seed for reproducibility
np.random.seed(42)

# Number of samples
num_samples = 100

# Generate synthetic data based on the given ranges and patterns
sizes = np.random.randint(1400, 3000, num_samples)  # Random sizes between 1400 and 3000 sq ft
bedrooms = np.random.choice([2, 3, 4, 5, 6], num_samples)  # Random number of bedrooms between 2 and 6

# Generate price categories based on a simple rule:
# - 'Affordable' if size is less than 1900 sq ft and bedrooms are less than 5
# - 'Expensive' otherwise
price_category = np.where((sizes < 1900) & (bedrooms < 5), 'Affordable', 'Expensive')

# Create the DataFrame
data = {
    'Size (sq ft)': sizes,
    'Bedrooms': bedrooms,
    'Price Category': price_category
}

df = pd.DataFrame(data)

# Display the first few rows of the generated data
df.head()

### Step 3: Encode the target variable (convert categories to numbers)

In [None]:
df['Price Category'] = df['Price Category'].map({'Affordable': 0, 'Expensive': 1})

In [None]:
df

### Step 4: Separate features and target variable

In [None]:
X = df[['Size (sq ft)', 'Bedrooms']]
y = df['Price Category']

### Step 5: Split the data into training and testing sets

In [None]:
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)

### Step 6: Choose and train the Logistic Regression model

In [None]:
model = LogisticRegression()
model.fit(X_train, y_train)

### What happens in the "backend"?

In Logistic Regression, the process is a bit different from Linear Regression because instead of predicting a continuous outcome (like price), you're predicting a **probability that an instance belongs to a particular class** (like "spam" or "not spam" or in our case, "expensive" or "affordable").

The steps involve 

- calculating the linear combination of the features like in linear regression, 
- applying a formula or function called the sigmoid function, 
- and then using a threshold (usually 0.5) to make a prediction. More like, is the result of the sigmoid function greater than or less than the set threshold (usually 0.5)

**Steps to Make a Prediction with Logistic Regression**

1. **Linear Combination**: Compute the linear combination of the features using the coefficients and intercept.

   $
   z = \text{intercept} + \text{coef}_1 \times \text{feature}_1 + \text{coef}_2 \times \text{feature}_2 + \dots
   $ 

    or

    `z = 𝜃0 + 𝜃1𝑥1 + 𝜃2𝑥2 + ...`
    
2. **Apply Sigmoid Function**: Use the sigmoid function to convert the linear combination $(z)$ into a probability.  

   $
   \sigma(z) = \frac{1}{1 + e^{-z}}
   $


3. **Thresholding**: If the resulting probability is greater than or equal to 0.5, predict the class as 1 (e.g., "expensive"), otherwise, predict 0 (e.g., "affordable").

### Step 7: Make predictions

In [None]:
y_pred = model.predict(X_test)

In [None]:
y_pred

# Note that 1 means Expensive and 0 means Affordable

In [None]:
y_test

In [None]:
# Get the coefficients for each feature
coefficients = model.coef_
print("Coefficients:", coefficients)

# Get the intercept
intercept = model.intercept_
print("Intercept:", intercept)

In [None]:
X_test.iloc[[0]]

In [None]:
# Use the formula with the gotten coefficients and intercept to make a prediction using a record in the test data

#### Step 1: Calculate the Linear Combination (z)

# Given coefficients and intercept
intercept = -20.53181245
coef_house_size = 0.00887793
coef_bedrooms = 1.48555931

# Input features from the record
house_size = 2529
bedrooms = 4

# Calculate the linear combination (z)
z = intercept + (coef_house_size * house_size) + (coef_bedrooms * bedrooms)
print("z: ", z)

#### Step 2: Apply the Sigmoid Function to Get the Probability

probability = 1 / (1 + np.exp(-z))
print("probability: ", probability)

#### Step 3: Make a Prediction Based on the Probability

# Predict class based on a threshold of 0.5
prediction = 1 if probability >= 0.5 else 0
print("prediction: ", prediction)

In [None]:
# Compare with y_pred
y_pred[0]

### Step 8: Evaluate the model

In [None]:
# Calculate the accuracy of the model
accuracy = accuracy_score(y_test, y_pred)
# 'accuracy_score' compares the predicted labels 'y_pred' with the true labels 'y_test' and calculates the proportion of correctly classified instances.

# Generate the confusion matrix for the model's predictions
conf_matrix = confusion_matrix(y_test, y_pred)
# 'confusion_matrix' provides a matrix showing the number of true positives, false positives, true negatives, and false negatives.

# Create a detailed classification report
class_report = classification_report(y_test, y_pred)
# 'classification_report' generates a report including metrics like precision, recall, F1-score, and support for each class.

### Step 9: Output the results

In [None]:
print("Predictions:", y_pred)
print("Actual values:", y_test.values)
print("Accuracy:", accuracy)
print("Confusion Matrix:\n", conf_matrix)
print("Classification Report:\n", class_report)

### Predict the category of a new house

In [None]:
new_house = [[2500, 4]]  # Size: 2500 sq ft, 4 bedrooms
predicted_category = model.predict(new_house)
category = "Expensive" if predicted_category[0] == 1 else "Affordable"
print(f"The new house is predicted to be: {category}")

**Confusion Matrix??**
- **TP (4)**: Correctly predicted positives. 
- **FP (0)**: Incorrectly predicted positives.
- **FN (0)**: Missed Positives (predicted as negative).
- **TN (16)**: Correctly predicted negatives.

### What do these metrics mean?

- **Accuracy** tries to answer the question: "What proportion of my model's predictions were correct?"
    - **Optimal Value**: As close to 100% or 1.0 as possible.
    - For instance, if your accuracy is `0.85`, it means the model correctly predicted the outcome 85% of the time.
    - **Note** that in **imbalanced datasets**, a high accuracy might be **misleading**, so precision, recall, or F1 score might be more appropriate.
    - For instance, a model predicts all the records passed through it as boys and gets an accuracy of 0.9 Meanwhile out of the 100 records, 10 are actually girls and the rest boys. It got just 10 wrong but those 10 were all the girls in the dataset.
---
- **Precision** tries to answer the question: "Of the instances predicted as positive by my model, how many were actually positive? (Roughly, how many false positives are there?)"
    - **Optimal Value:** As close to 1.0 (or 100%) as possible.
    - In our case, Precision is about being careful when you say something is affordable.
    - For instance: If the model says 10 houses are affordable, and 8 of them are actually affordable, the precision is 8 out of 10 (which is 80%).
---
- **Recall** tries to answer the question: "Of the actual positive instances in the data, how many did my model correctly identify as positive? (Roughly, how many false negatives are there?)"
    - **Optimal Value:** As close to 1.0 (or 100%) as possible.
    - In our case, **Recall** is about being thorough in predicting all the affordable houses.
    - For instance: If there are 10 affordable houses in the dataset, and the model correctly predicts 8 of them, the recall is 8 out of 10 (which is 80%).

---
- **F1 Score** tries to answer the question: "What is the balance between precision and recall in my model's performance?"
    - **Optimal Value:** As close to 1.0 (or 100%) as possible.
    - A higher F1 score indicates a good balance between precision and recall, meaning the model performs well in both avoiding false positives and false negatives.
---

In [None]:
# Practice Question:
# Train a model to predict house prices using only the House Size feature.

---
_**Your Dataness**_,  
`Obinna Oliseneku` (_**Hybraid**_)  
**[LinkedIn](https://www.linkedin.com/in/obinnao/)** | **[GitHub](https://github.com/hybraid6)**  