### **Machine Learning - Assignment Questions & Answers :**

**Q.1. Explain the differences between AI, ML, Deep Learning (DL), and Data Science (DS).**
  - While AI, ML, DL, and DS are all interconnected, they refer to distinct concepts. AI is the broad goal of creating intelligent machines, ML is a way to achieve AI, DL is a specific and advanced type of ML, and Data Science is an interdisciplinary field that uses these tools to extract insights from data.
  
  **AI (Artificial Intelligence)**
  AI is the umbrella term for creating systems that can perform tasks requiring human-like intelligence, such as reasoning, problem-solving, and understanding language. It's the overall concept or big dream of making machines smart. AI can be achieved through various methods, including traditional rule-based systems (e.g., "if-then" statements) or, more commonly today, through machine learning.
    * Example: A self-driving car that perceives its environment, makes decisions, and navigates roads without human intervention.

  **ML (Machine Learning)**
  ML is a subset of AI that focuses on enabling machines to learn from data without being explicitly programmed. Instead of hard-coding rules for every possible scenario, ML algorithms are trained on large datasets to identify patterns and make predictions. The more data an ML model is exposed to, the better it typically performs.
    * Example: An email spam filter that learns to identify and block new spam messages by analyzing patterns in a large collection of previously labeled spam and non-spam emails.

  **DL (Deep Learning)**
  DL is a subset of ML that uses complex, multi-layered neural networks, inspired by the human brain.  Unlike traditional ML, deep learning models can automatically extract and learn features from raw data, which makes them highly effective for tackling more complex problems involving large, unstructured datasets like images, video, and audio.
    * Example: A system that can recognize objects in an image by processing the raw pixel data through its many layers to identify edges, shapes, and textures, which it then uses to classify the object.

  **DS (Data Science)**
  Data Science is an interdisciplinary field that uses scientific methods, processes, and algorithms to extract knowledge and insights from data. A data scientist uses tools from various fields, including statistics, mathematics, and computer science, to analyze data. Machine learning and deep learning are key tools in a data scientist's toolkit, but data science also involves other aspects like data cleaning, visualization, and communication of results to stakeholders.
    * Example: A retail company uses data science to analyze customer purchasing habits, demographics, and trends to predict future sales, optimize inventory, and personalize marketing campaigns. This often involves using ML models as part of the analysis.


**Q.2. What are the types of machine learning? Describe each with one real-world example.**
  - Machine learning can be categorized into four primary types based on how they learn from data.
  
  **1. Supervised Learning**
  Supervised learning models are trained on labeled data, meaning the training data includes both the input and the corresponding correct output. The model learns to map the input to the output, and its performance is evaluated by how well it predicts the correct labels for new, unseen data. It's like a student learning with flashcards that have the question on one side and the answer on the other.
    * Example: An email spam filter. The model is trained on a dataset of emails, each labeled as either "spam" or "not spam." It learns to identify patterns in the content, sender, and subject line of spam emails to accurately classify new incoming emails.

  **2. Unsupervised Learning**
  Unsupervised learning models are trained on unlabeled data without any predefined correct outputs. The goal is for the algorithm to find hidden patterns, structures, or groupings within the data on its own. It's like giving a student a pile of mixed-up objects and asking them to sort them into groups without telling them what the groups are.
    * Example: A customer segmentation system for a retail business. The model analyzes a large dataset of customer purchase history, demographics, and browsing behavior to identify distinct groups of customers with similar traits (e.g., "discount shoppers," "luxury buyers," "tech enthusiasts") to inform marketing strategies.

  **3. Semi-Supervised Learning**
  Semi-supervised learning combines elements of both supervised and unsupervised learning. It uses a small amount of labeled data and a large amount of unlabeled data for training. This approach is useful when obtaining a large, fully labeled dataset is difficult or expensive. The model first learns from the labeled data and then uses that knowledge to make predictions on the unlabeled data, effectively labeling it and using it for further training.
    * Example: A text document classifier. It's often impractical to manually label millions of documents. A semi-supervised model can be initially trained on a small, labeled set of documents (e.g., a few hundred news articles categorized by topic). It then uses this initial understanding to categorize a much larger, unlabeled set of articles, refining its classifications as it goes.
  
  **4. Reinforcement Learning**
  Reinforcement learning involves an agent that learns to make decisions by interacting with an environment to achieve a specific goal. The agent receives a reward for desired actions and a penalty for undesired ones. Through a process of trial and error, the agent learns the best sequence of actions to maximize its total reward.
    * Example: An AI playing a game like chess or Go. The AI agent learns by playing against itself or a human. It receives a reward for winning the game and a penalty for losing, which helps it learn and refine the optimal strategies for making moves.


**Q.3. Define overfitting, underfitting, and the bias-variance tradeoff in machine learning**
  - Overfitting, underfitting, and the bias-variance tradeoff are core concepts in machine learning that describe the common challenges of building a model that performs well on both the data it was trained on and new, unseen data.

  **Overfitting**
  Overfitting occurs when a model learns the training data too well, including its noise and random fluctuations. This makes the model highly accurate on the training set but causes it to perform poorly on new data. It's like a student who has memorized all the answers from a practice exam but hasn't learned the underlying concepts, so they fail the actual test. An overfit model has low bias but high variance.

  **Underfitting**
  Underfitting is the opposite problem, where the model is too simple and fails to capture the fundamental patterns in the training data. An underfit model performs poorly on both the training data and new data because it doesn't have enough complexity to represent the data's underlying relationships. It's like a student who barely studies and therefore does poorly on both practice exams and the real one. An underfit model has high bias but low variance.

  **Bias-Variance Tradeoff**
  The bias-variance tradeoff is a central problem in supervised machine learning. It states that there is an inverse relationship between a model's complexity, its bias, and its variance.
    * Bias is the error caused by a model's simplistic assumptions about the data. A high-bias model is too simple and underfits.
    * Variance is the model's sensitivity to small changes or noise in the training data. A high-variance model is too complex and overfits.
  
  The goal is to find the "sweet spot" of model complexity that minimizes the total error, which is the sum of squared bias and variance. As you increase model complexity, bias decreases, but variance increases, and vice versa. The tradeoff is finding the right balance to build a model that is complex enough to learn the patterns in the data but simple enough to generalize well to unseen data.

**Q.4. What are outliers in a dataset, and list three common techniques for handling them**
  - Outliers are data points that are significantly different from the majority of the other data points in a dataset. They can be caused by measurement errors, data entry mistakes, or genuine but rare events. Outliers can skew statistical analyses and negatively impact the performance of machine learning models.

  **1. Removal/Trimming**
  This is the simplest method: you remove the outlier data points from the dataset. This technique is effective when you're confident that the outliers are the result of errors and not true, meaningful data points.
    
    * When to use: When the dataset is large and the number of outliers is small. This method is risky if the outliers contain valuable information, and it should be avoided if the dataset is small, as it could lead to a significant loss of data.

  **2. Transformation**
  This technique involves applying a mathematical function to the data to reduce the impact of outliers. Common transformations include the logarithmic transformation or the square root transformation. These functions compress the range of the data, bringing the outliers closer to the other data points.
    
    * When to use: When you want to keep the outliers but need to reduce their influence on the model. This is particularly useful for highly skewed datasets where the outliers are part of a natural, albeit rare, distribution.
  
  **3. Imputation/Capping**
  This method involves replacing the outlier values with less extreme, more representative values. A common approach is capping, where you set all values beyond a certain threshold to that threshold value (e.g., all values above the 99th percentile are set to the value of the 99th percentile).
    * When to use: When removing the outliers is not an option and transforming them isn't effective. It helps to preserve the data while mitigating the extreme values, making it a good balance between removal and transformation.


**Q.5. Explain the process of handling missing values and mention one imputation technique for numerical and one for categorical data.**
  - Handling missing values is a crucial step in data preprocessing to ensure that a dataset is complete and ready for analysis or model training. The overall process involves first identifying the missing data, then deciding on a strategy to handle it, and finally implementing that strategy. The choice of technique depends on the nature of the data and the percentage of missing values.
  
  **Process of Handling Missing Values**
  1. Identify Missing Values: The first step is to quantify and locate the missing values. This can be done by counting the number of null, NaN, or other placeholder values in each column of the dataset.
  
  2. Determine the Cause: It's important to understand why the data is missing. Is it a random error, or is there a systematic reason? For example, a "Years of Experience" field might be missing for all new graduates, which is a pattern, not a random occurrence.
  
  3. Choose a Strategy: Based on the nature and quantity of missing data, you can choose from a few strategies:
    * Deletion: Remove rows or columns with missing data. This is simple but can lead to a significant loss of information, especially if many rows contain missing values.
    * Imputation: Replace the missing values with a substituted value. This is the most common approach as it preserves the dataset size.
    * Ignoring: Some advanced machine learning algorithms (like certain tree-based models) can handle missing values on their own, making imputation unnecessary.
  4. Implement the Chosen Strategy: Apply the selected technique to the dataset to fill in or remove the missing values, preparing the data for the next steps in the machine learning pipeline.

  **Imputation Techniques**
  1. For Numerical Data: Median Imputation 🔢
    * Technique: Median imputation involves replacing missing numerical values with the median of that column. The median is the middle value in a sorted list of numbers and is less sensitive to outliers than the mean.
    * When to use: This technique is a robust choice when the numerical data is skewed or contains outliers. Using the mean in such cases could introduce a significant bias.

  2. For Categorical Data: Mode Imputation 🔡
    * Technique: Mode imputation involves replacing missing categorical values with the mode of that column. The mode is the most frequently occurring value.
    * When to use: This is a simple and effective technique for categorical data, as it fills the missing values with the most common category, preserving the distribution of the categories.

**Q.6. Write a Python program that:**

**● Creates a synthetic imbalanced dataset with make_classification() from sklearn.datasets.**

**● Prints the class distribution.**
**(Include your Python code and output in the code box below.)**
  -I can certainly help with that. Here is a Python program that creates a synthetic imbalanced dataset using sklearn.datasets.make_classification() and prints its class distribution. The output from the code is also included.
  Python Code

In [None]:
import numpy as np
from sklearn.datasets import make_classification

# Set a random seed for reproducibility
np.random.seed(42)

# Create an imbalanced dataset
X, y = make_classification(
    n_samples=1000,
    n_features=2,
    n_redundant=0,
    n_informative=2,
    n_clusters_per_class=1,
    weights=[0.9, 0.1],  # Specify the class distribution
    flip_y=0,
    random_state=42
)

# Print the class distribution
unique, counts = np.unique(y, return_counts=True)
class_distribution = dict(zip(unique, counts))

print("Class Distribution:")
for class_label, count in class_distribution.items():
    print(f"Class {class_label}: {count} samples")

Class Distribution:
Class 0: 900 samples
Class 1: 100 samples


**Q.7. Question Implement one-hot encoding using pandas for the following list of colors: ['Red', 'Green', 'Blue', 'Green', 'Red']. Print the resulting dataframe.**

**(Include your Python code and output in the code box below.)**
  - I can certainly help with that. Here is a Python program that uses pandas to perform one-hot encoding on the list of colors and prints the resulting dataframe. The code and its output are included below.
  Python Code

In [None]:
import pandas as pd

# The list of colors provided
colors = ['Red', 'Green', 'Blue', 'Green', 'Red']

# Create a DataFrame from the list
df = pd.DataFrame({'color': colors})

# Implement one-hot encoding using pd.get_dummies()
encoded_df = pd.get_dummies(df, columns=['color'], dtype=int)

# Print the resulting DataFrame
print("Original DataFrame:")
print(df)
print("\nOne-hot Encoded DataFrame:")
print(encoded_df)

Original DataFrame:
   color
0    Red
1  Green
2   Blue
3  Green
4    Red

One-hot Encoded DataFrame:
   color_Blue  color_Green  color_Red
0           0            0          1
1           0            1          0
2           1            0          0
3           0            1          0
4           0            0          1


**Q.8. Write a Python script to:**

**● Generate 1000 samples from a normal distribution.**

**● Introduce 50 random missing values.**

**● Fill missing values with the column mean.**

**● Plot a histogram before and after imputation.**

**(Include your Python code and output in the code box below.)**
  - I've created a more concise Python script to perform the same task. This version combines the generation of data and the introduction of missing values into a single line. It also plots both histograms on the same figure for a more direct comparison.
  Python Code


  Histograms
  This plot shows both histograms side-by-side, making it easy to see the effect of mean imputation. The histogram on the right clearly shows a sharp peak at the mean, indicating that all the missing values have been replaced with that single value.

In [None]:
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt

# Generate data and introduce missing values
np.random.seed(42)
df = pd.DataFrame(np.random.normal(50, 10, 1000), columns=['value'])
df.loc[np.random.choice(df.index, 50, replace=False), 'value'] = np.nan

# Plot before and after imputation on a single figure
plt.figure(figsize=(12, 6))

# Before Imputation
plt.subplot(1, 2, 1)
plt.hist(df['value'].dropna(), bins=30, edgecolor='black')
plt.title('Before Imputation')
plt.xlabel('Value')
plt.ylabel('Frequency')

# After Imputation
df['value'] = df['value'].fillna(df['value'].mean())
plt.subplot(1, 2, 2)
plt.hist(df['value'], bins=30, edgecolor='black')
plt.title('After Imputation')
plt.xlabel('Value')
plt.ylabel('Frequency')

plt.tight_layout()
plt.savefig('histograms_imputation.png')
plt.close()

print("Histograms saved as 'histograms_imputation.png'")
print("Number of missing values after imputation:", df['value'].isnull().sum())

Histograms saved as 'histograms_imputation.png'
Number of missing values after imputation: 0


**Q.9. : Implement Min-Max scaling on the following list of numbers [2, 5, 10, 15,20] using sklearn.preprocessing.MinMaxScaler. Print the scaled array.**

**(Include your Python code and output in the code box below.)**
  - Here is a Python program that uses sklearn.preprocessing.MinMaxScaler to implement Min-Max scaling on the list of numbers and prints the scaled array.
  Python Code

In [None]:
import numpy as np
from sklearn.preprocessing import MinMaxScaler

# The list of numbers provided
data = np.array([2, 5, 10, 15, 20]).reshape(-1, 1)

# Initialize the MinMaxScaler
scaler = MinMaxScaler()

# Perform Min-Max scaling
scaled_data = scaler.fit_transform(data)

# Print the scaled array
print("Original data:")
print(data.flatten())
print("\nScaled data:")
print(scaled_data.flatten())

Original data:
[ 2  5 10 15 20]

Scaled data:
[0.         0.16666667 0.44444444 0.72222222 1.        ]


**Q.10. You are working as a data scientist for a retail company. You receive a customer transaction dataset that contains:**

**● Missing ages,**

**● Outliers in transaction amount,**

**● A highly imbalanced target (fraud vs. non-fraud),**

**● Categorical variables like payment method.**

**Explain the step-by-step data preparation plan you'd follow before training a machine learning model. Include how you'd address missing data, outliers, imbalance, and encoding.**

**(Include your Python code and output in the code box below.)**
  - **Python Code**
  The following Python script demonstrates this data preparation pipeline on a synthetic dataset. It shows the class distribution before and after SMOTE, as this is the most impactful step for this specific problem.

In [None]:
import numpy as np
import pandas as pd
from sklearn.preprocessing import MinMaxScaler
from imblearn.over_sampling import SMOTE

# 1. Create a synthetic dataset
np.random.seed(42)
data = {
    'Age': np.random.randint(18, 90, size=1000),
    'Transaction_Amount': np.random.lognormal(mean=7, sigma=1, size=1000),
    'Payment_Method': np.random.choice(['CreditCard', 'DebitCard', 'PayPal'], size=1000, p=[0.5, 0.3, 0.2]),
    'Is_Fraud': np.random.choice([0, 1], size=1000, p=[0.95, 0.05])
}
df = pd.DataFrame(data)

# Introduce missing ages and outliers
df.loc[df.sample(20).index, 'Age'] = np.nan
df.loc[df.sample(5).index, 'Transaction_Amount'] = 50000

print("--- Initial Dataset Snapshot ---")
print(df.head())
print("\nInitial class distribution (Is_Fraud):")
print(df['Is_Fraud'].value_counts())

# 2. Step-by-step data preparation
# A) Missing Data Imputation (Median)
df['Age'] = df['Age'].fillna(df['Age'].median())

# B) Outlier Treatment (Capping)
q_low = df['Transaction_Amount'].quantile(0.05)
q_high = df['Transaction_Amount'].quantile(0.95)
df['Transaction_Amount'] = df['Transaction_Amount'].clip(lower=q_low, upper=q_high)

# C) Categorical Variable Encoding (One-hot)
df = pd.get_dummies(df, columns=['Payment_Method'], dtype=int)

# D) Class Imbalance Handling (SMOTE)
X = df.drop('Is_Fraud', axis=1)
y = df['Is_Fraud']
sm = SMOTE(random_state=42)
X_resampled, y_resampled = sm.fit_resample(X, y)

print("\n--- After Data Preparation ---")
print("Resampled data shape:", X_resampled.shape)
print("Final class distribution (Is_Fraud):")
print(y_resampled.value_counts())

--- Initial Dataset Snapshot ---
    Age  Transaction_Amount Payment_Method  Is_Fraud
0  69.0          848.256420     CreditCard         0
1   NaN         3054.241569      DebitCard         0
2  89.0          216.475282     CreditCard         0
3  78.0         1542.235633      DebitCard         0
4   NaN         1125.037150      DebitCard         0

Initial class distribution (Is_Fraud):
Is_Fraud
0    967
1     33
Name: count, dtype: int64

--- After Data Preparation ---
Resampled data shape: (1934, 5)
Final class distribution (Is_Fraud):
Is_Fraud
0    967
1    967
Name: count, dtype: int64
