MACHINE LEARNING ASSIGNMENT  


Question 1: Explain the differences between AI, ML, Deep Learning (DL), and Data
Science (DS).

->1. Artificial Intelligence (AI)

Definition: The broadest field. AI is about creating machines or systems that can perform tasks that typically require human intelligence.

Examples: Chatbots, recommendation systems, self-driving cars, speech recognition (like Siri, Alexa).

Scope: AI is the umbrella concept → ML, DL, etc. are part of AI.

2. Machine Learning (ML)

Definition: A subset of AI where machines learn from data and improve their performance over time without being explicitly programmed.

Focus: Algorithms that find patterns and make predictions.

Examples: Spam email detection, fraud detection, stock price prediction.

Key Idea: Feed data → algorithm learns → makes predictions.

3. Deep Learning (DL)

Definition: A subset of ML that uses artificial neural networks (inspired by the human brain) with many layers (“deep”).

Focus: Works best with very large datasets and high computational power.

Examples: Face recognition, natural language processing (like ChatGPT), autonomous driving.

Difference from ML: Traditional ML needs manual feature engineering, while DL automatically extracts features from raw data.

4. Data Science (DS)

Definition: A multidisciplinary field that combines statistics, programming, domain knowledge, and data analysis to extract insights and make decisions from data.


Question 2: What are the types of machine learning? Describe each with one
real-world example.

->1. Supervised Learning

Definition: In supervised learning, the model is trained on a dataset that contains both inputs (features) and their correct outputs (labels). The algorithm learns to map inputs to outputs.

Goal: Predict outcomes for new, unseen data.

Example:

Spam Email Detection – A model is trained on emails labeled as “spam” or “not spam”. When a new email arrives, the model predicts if it’s spam.

2. Unsupervised Learning

Definition: In unsupervised learning, the model is given only inputs (no labels). It tries to find hidden patterns, structures, or relationships in the data.

Goal: Discover patterns or groupings within the data.

Example:

Customer Segmentation – E-commerce websites group customers into clusters (e.g., “bargain shoppers,” “loyal customers”) based on purchase behavior, without predefined labels.

3. Semi-Supervised Learning

Definition: A mix of supervised and unsupervised learning. The model is trained on a small amount of labeled data and a large amount of unlabeled data.

Goal: Improve learning accuracy when labeling data is expensive or time-consuming.

Example:

Medical Diagnosis – Only a few medical images are labeled by doctors (due to high cost and effort), but many unlabeled images exist. Semi-supervised models use both to detect diseases.

4. Reinforcement Learning (RL)

Definition: In RL, an agent learns by interacting with an environment. It receives rewards or penalties based on its actions and aims to maximize the cumulative reward.

Goal: Learn the best sequence of actions to achieve a goal.

Example:

Self-Driving Cars – The car (agent) learns to drive safely by making decisions (accelerate, brake, turn) and receiving feedback (reward for avoiding accidents, penalty for collisions).


Question 3: Define overfitting, underfitting, and the bias-variance tradeoff in machine
learning

->1. Overfitting

Definition: Overfitting happens when a machine learning model learns the training data too well, capturing noise and random fluctuations instead of just the underlying pattern.

Result: The model performs very well on training data but poorly on unseen test data.

Example: A decision tree that keeps splitting until every training sample is perfectly classified but fails on new data.

2. Underfitting

Definition: Underfitting occurs when a model is too simple to capture the underlying structure of the data.

Result: The model performs poorly on both training and test data.

Example: Using a straight line (linear regression) to model data that follows a curved (nonlinear) relationship.

3. Bias-Variance Tradeoff

Bias: Error from making overly simplistic assumptions about the data. (High bias → underfitting.)

Variance: Error from the model being too sensitive to small changes in the training data. (High variance → overfitting.)


Question 4: What are outliers in a dataset, and list three common techniques for
handling them.

->1.Removal of Outliers

Directly eliminate outlier values if they are due to errors or irrelevant to analysis.

Example: Removing negative ages in a dataset of people.

2.Transformation of Data

Apply mathematical transformations (e.g., log, square root, or Box-Cox) to reduce the effect of extreme values.

3.This compresses the scale and makes outliers less influential.

Capping (Winsorization)

Replace extreme values beyond a threshold with the nearest acceptable value.

Example: Capping all values above the 95th percentile to the 95th percentile value.



Question 5: Explain the process of handling missing values and mention one
imputation technique for numerical and one for categorical data.

->Handling Missing Values

Missing values occur when no data is stored for a variable in an observation. Handling them is crucial because they can affect the accuracy of analysis and machine learning models.

Steps in handling missing values:

Identify missing data → Use methods like .isnull() or .info() in Python to detect.

Analyze the pattern → Check if data is Missing Completely at Random (MCAR), Missing at Random (MAR), or Missing Not at Random (MNAR).

Decide the strategy → Options include:

Remove → Drop rows or columns with too many missing values.

Impute (fill in) → Replace missing values with estimated ones.

Use models → Predict missing values using regression or machine learning models.

Imputation Techniques

Numerical Data → Mean Imputation: Replace missing values with the mean of the column.
Example: If a "Salary" column has missing values, fill them with the average salary.


Question 6: Write a Python program that:
● Creates a synthetic imbalanced dataset with make_classification() from
sklearn.datasets.
● Prints the class distribution.


-># Import required libraries
from sklearn.datasets import make_classification
from collections import Counter

# Create synthetic imbalanced dataset
X, y = make_classification(n_samples=1000,     # total samples
                           n_features=10,      # number of features
                           n_informative=5,    # number of informative features
                           n_redundant=2,      # number of redundant features
                           n_clusters_per_class=1,
                           weights=[0.9, 0.1], # imbalance: 90% class 0, 10% class 1
                           random_state=42)

# Print class distribution
print("Class distribution:", Counter(y))


Question 7: Implement one-hot encoding using pandas for the following list of colors:
['Red', 'Green', 'Blue', 'Green', 'Red']. Print the resulting dataframe.
(Include your Python code and output in the code box below.)

->import pandas as pd

# Input list of colors
colors = ['Red', 'Green', 'Blue', 'Green', 'Red']

# Create a DataFrame
df = pd.DataFrame(colors, columns=['Color'])

# Apply one-hot encoding
one_hot = pd.get_dummies(df['Color'])

# Print result
print(one_hot)
