# Introduction to Machine Learning

Machine Learning (ML) is a branch of artificial intelligence that allows computers to learn from data without being explicitly programmed. It focuses on building models that can identify patterns and make predictions or decisions based on data. Machine learning is at the core of many applications we use today, such as recommendation systems, voice recognition, and fraud detection.


Key Terminologies:
- **Model**: A system that makes predictions or decisions based on data.
- **Training**: The process of teaching a model by feeding it data so it can learn to recognize patterns.
- **Features**: The measurable attributes or properties of the data used to train the model.
- **Labels**: The actual output or result that we want the model to predict.

### Types of Machine Learning

Machine Learning can be broadly classified into three main types:

1. **Supervised Learning**:
   - The model is trained on **labeled data** (data with input-output pairs).
   - It learns to map inputs to outputs based on the given labels.
   - Examples: **Regression** (predicting prices), **Classification** (email spam detection).

2. **Unsupervised Learning**:
   - The model is trained on **unlabeled data** (data without explicit labels).
   - It learns to find **patterns or groupings** in the data.
   - Examples: **Clustering** (grouping customers by behavior)

3. **Reinforcement Learning**:
   - The model learns through **trial and error** by interacting with an environment.
   - It receives **rewards or penalties** based on the actions it takes and aims to maximize cumulative rewards.
   - Examples: **Robotics**, **Game AI** (training agents to play games).

The general workflow of a machine learning project involves the following steps:

1. **Data Collection**: Gather data from various sources to be used for training and testing.
2. **Data Preprocessing**: Clean and preprocess data (handling missing values, feature scaling, encoding categorical variables).
3. **Feature Selection**: Identify important features that significantly impact the outcome.
4. **Model Selection**: Choose an appropriate model (e.g., Linear Regression, Decision Tree).
5. **Training**: Train the model using labeled data.
6. **Evaluation**: Evaluate model performance using metrics like **accuracy**, **precision**, **recall**, **RMSE**.
7. **Prediction**: Use the trained model to make predictions on new data.

This process is iterative, meaning each step can be revisited to improve the model.

### Simple Example of Machine Learning using **Linear Regression** (Regression Problem)

In [20]:
import numpy as np
import pandas as pd
from sklearn.model_selection import train_test_split
from sklearn.linear_model import LinearRegression
from sklearn.metrics import accuracy_score

In [21]:
# Creating a simple dataset
# Data: Experience vs. Salary
data = {
    'Experience (Years)': [1, 3, 5, 7, 9, 11],
    'Salary (in $1000)': [30, 40, 50, 60, 80, 100]
}
df = pd.DataFrame(data)

In [22]:
# Splitting the data into features and labels
X = df[['Experience (Years)']]
y = df['Salary (in $1000)']

The train-test split is a method used to evaluate the performance of a machine learning model by dividing the dataset into two distinct parts: the training set and the testing set.

- X, y: The features (X) and target labels (y) of the dataset.
- test_size=0.3: The proportion of the dataset to include in the testing set, in this case, 30%.
- random_state=42: Ensures reproducibility. It controls the random splitting of the data. Using the same random state will always generate the same split, making it easier to compare results.

In [23]:
# Splitting the dataset into training and testing sets
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)

In [24]:
X_train

Unnamed: 0,Experience (Years)
5,11
2,5
4,9
3,7


In [25]:
y_train

5    100
2     50
4     80
3     60
Name: Salary (in $1000), dtype: int64

In [26]:
X_test

Unnamed: 0,Experience (Years)
0,1
1,3


In [27]:
y_test

0    30
1    40
Name: Salary (in $1000), dtype: int64

- LinearRegression is used to model a linear relationship between input features and a target variable.
- It can be used for both simple (one feature) and multiple (multiple features) regression.
- It fits a straight line (or hyperplane in the case of multiple features) that minimizes the sum of squared differences between the predicted and actual values.
- After training the model, predictions can be made on unseen data, and the performance of the model can be evaluated using error metrics like MSE.

In [28]:
# Creating a Linear Regression model and training it
model = LinearRegression()
model.fit(X_train, y_train)

In [29]:
# Making predictions
y_pred = model.predict(X_test)


In [31]:
y_pred

array([13., 30.])

The mean_squared_error (MSE) is a metric used to evaluate the performance of regression models. It measures the average squared difference between the actual and predicted values. Essentially, it tells us how close our model’s predictions are to the actual output values.

The MSE is always non-negative, and **values closer to 0 represent a better fit of the model**, as the model predictions are closer to the actual values.

In [32]:
# Evaluating the model
mse = mean_squared_error(y_test, y_pred)
print("Mean Squared Error:", mse)
print("Predictions for Test Set:", y_pred)

Mean Squared Error: 194.49999999999972
Predictions for Test Set: [13. 30.]


### Sample Code for Iris Dataset using **Logistic Regression** (Classification Problem)

In [33]:
# Importing necessary libraries
from sklearn.datasets import load_iris
from sklearn.model_selection import train_test_split
from sklearn.linear_model import LogisticRegression
from sklearn.metrics import accuracy_score

In [34]:
# Step 1: Data Collection - Load Iris dataset
iris = load_iris()
X, y = iris.data, iris.target


In [35]:
# Step 2: Data Preprocessing - No preprocessing needed as dataset is clean


In [36]:
# Step 3: Split the dataset into training and testing sets
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.3, random_state=42)

Logistic Regression is a supervised learning algorithm used for classification tasks. Despite its name, logistic regression is not used for regression problems (predicting continuous values). Instead, it is a method for predicting a binary or multiclass outcome based on one or more predictor variables.

Applications of Logistic Regression:

- Binary Classification: Classifying emails as spam or not spam.
- Multiclass Classification: Classifying species of flowers in the Iris dataset.
- Medical Diagnosis: Determining whether a patient has a disease (yes/no).
- Fraud Detection: Predicting if a transaction is fraudulent or not.

In [37]:
# Step 4: Model Selection - Using Logistic Regression
model = LogisticRegression(max_iter=200)

In [38]:
# Step 5: Training the Model
model.fit(X_train, y_train)

In [39]:
# Step 6: Making Predictions
y_pred = model.predict(X_test)

$Accuracy= Number of Correct Predictions/Total Number of Predictions$
​


the proportion of correct predictions made by the model out of all predictions. It is a straightforward metric used when dealing with classification tasks where the goal is to correctly assign labels or categories.

In [40]:
# Step 7: Evaluation
accuracy = accuracy_score(y_test, y_pred)
print("Accuracy of the Logistic Regression model on Iris dataset:", accuracy)

Accuracy of the Logistic Regression model on Iris dataset: 1.0


### Task: Machine Learning using **Linear Regression** (Regression Problem)

In [45]:
import numpy as np
import pandas as pd
from sklearn.model_selection import train_test_split
from sklearn.linear_model import LinearRegression
from sklearn.metrics import mean_squared_error