
# Machine Learning Basics
=====================================


## 1.1 What is Machine Learning?
-------------------------------

### Features and Target

*   **Features**: Information about objects (e.g., car year, make, mileage)
*   **Target**: Property to predict (e.g., price)

### Machine Learning Process

1.  Collect data
2.  Define and calculate features
3.  Train model
4.  Make predictions


## 1.2 Machine Learning vs Rule-Based Systems
------------------------------------------


### Machine Learning System

1.  **Get Data**: Emails from user's spam folder and inbox
2.  **Define and Calculate Features**: Use rules/characteristics to define features
3.  **Train and Use Model**: Apply ML algorithm to encoded emails


### Rule-Based System

*   Based on set characteristics (e.g., keywords, email length)


## 1.3 Supervised Machine Learning
--------------------------------


### Formula

g(x) = y

*   g: Model
*   x: Feature matrix (2D array)
*   y: Target variable (vector)


### Types of Supervised Learning


*   **Regression**: Predict continuous values (e.g., house price)
*   **Classification**: Predict categories (e.g., spam/not spam)
*   **Multiclass Classification**: Predict multiple categories (e.g., cats, dogs, cars)
*   **Binary Classification**: Predict two categories (e.g., spam/not spam)
*   **Ranking**: Predict rankings (e.g., recommendation systems)


## 1.4 CRISP-DM 
----------------


### Six Steps


1.  **Business Understanding**: Identify problem, define goal
2.  **Data Understanding**: Identify data sources, assess quality
3.  **Data Preparation**: Clean, transform data
4.  **Modeling**: Train models, select best
5.  **Evaluation**: Measure performance, achieve goal
6.  **Deployment**: Roll out model, monitor performance


### Example: Spam Detection


## 1.5 Model Selection Process
----------------------------


### Steps to Evaluate Model Performance


1.  Split data into training (80%) and validation (20%) sets
2.  Train model using training data
3.  Apply trained model to validation data
4.  Compare predicted values with actual values


### Example


| Predicted | Actual |
| --- | --- |
| 0.8 | 1 |
| 0.7 | 0 |
| 0.6 | 1 |
| 0.1 | 0 |
| 0.9 | 1 |
| 0.6 | 0 |


### Comparing Multiple Models


Try different models (e.g., linear regression, decision tree, random forest, neural network) and select the best one based on accuracy.


### Problem: Overfitting or Luck


Use training-validation-test split to avoid overfitting.


### Summary


The model selection process involves:


1.  Splitting data
2.  Training models
3.  Validating models
4.  Repeating steps 2-3
5.  Selecting the best model
6.  Testing the model on unseen data


### Alternative Approach


Combine training and validation data to retrain the selected model.

In [None]:
import pandas as pd
from sklearn.model_selection import train_test_split
from sklearn.linear_model import LogisticRegression
from sklearn.metrics import accuracy_score

# Sample data
data = {'Feature1': [1, 2, 3, 4, 5], 
        'Feature2': [6, 7, 8, 9, 10], 
        'Target': [0, 1, 0, 1, 0]}
df = pd.DataFrame(data)

# Split data
X = df[['Feature1', 'Feature2']]
y = df['Target']
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)

# Train model
model = LogisticRegression()
model.fit(X_train, y_train)

# Make predictions
y_pred = model.predict(X_test)

# Evaluate model
accuracy = accuracy_score(y_test, y_pred)
print(f"Accuracy: {accuracy:.2f}")