# Foundations of Machine Learning and EDA – Assignment Solutions


This notebook contains detailed answers and code for Questions 1–10. Question numbers and wording follow the original assignment exactly.


## Question 1
**What is the difference between AI, ML, DL, and Data Science? Provide a brief explanation of each.**  
_(Hint: Compare their scope, techniques, and applications for each.)_

### Answer
We can think of these terms as concentric circles:

1. **Artificial Intelligence (AI)**  
- **Scope:** The broadest field. AI is about making machines behave intelligently – able to **perceive, reason, learn, and act**.  
- **Techniques:** Rule‑based systems, search algorithms, planning, expert systems, logic, as well as machine learning and deep learning.  
- **Applications:** Game playing (e.g., chess engines), recommendation systems, chatbots, robots, self‑driving cars, fraud detection, etc.  
- **Real‑life example:** A customer support chatbot that understands queries, asks follow‑up questions, and decides whether to escalate to a human.

2. **Machine Learning (ML)**  
- **Scope:** A **subset of AI** focused specifically on algorithms that **learn patterns from data** instead of relying only on explicit rules.  
- **Idea:** "Learn from data" → given examples \((x, y)\), learn a function \(f(x)\) that predicts \(y\) for new inputs.  
- **Techniques:** Supervised learning (regression, classification), unsupervised learning (clustering, dimensionality reduction), reinforcement learning, etc.  
- **Applications:** Email spam detection, credit‑card fraud detection, demand forecasting, house price prediction.  
- **Example:** A model that predicts whether a loan applicant will default based on their past financial history. The rules are not hard‑coded; they are *learned* from historical data.

3. **Deep Learning (DL)**  
- **Scope:** A **subset of machine learning** that uses **deep neural networks** (many layers of non‑linear transformations).  
- **Idea:** Automatically learn **hierarchical representations** (e.g., edges → shapes → objects in images).  
- **Techniques:** Convolutional Neural Networks (CNNs), Recurrent Neural Networks (RNNs), LSTMs, Transformers, autoencoders, GANs, etc.  
- **Applications:** Image recognition (face unlock on phones), speech recognition (voice assistants), machine translation, large language models (e.g., ChatGPT‑like systems).  
- **Example:** A CNN trained to classify X‑ray images into "normal" and "pneumonia" using millions of labeled images.

4. **Data Science**  
- **Scope:** A broader **end‑to‑end discipline** that involves **collecting, cleaning, exploring, modeling, and communicating insights from data**. ML/DL are tools within data science.  
- **Tasks:** Data collection, data cleaning, Exploratory Data Analysis (EDA), feature engineering, model building (using ML/DL), model evaluation, visualization, storytelling, and deployment support.  
- **Techniques:** Statistics, ML/DL, data visualization, SQL, programming (Python/R), dashboards, A/B testing, etc.  
- **Applications:** Business analytics, product analytics, user behavior analysis, marketing campaigns, operations optimization.  
- **Example:** A data scientist at an e‑commerce company uses SQL to pull user data, cleans it, explores patterns in purchases, builds a churn prediction model using ML, and then creates a dashboard for business stakeholders.

### Summary of differences
- **AI** – Big umbrella: making machines intelligent (can include both rule‑based and learning‑based systems).
- **ML** – Subset of AI: algorithms that *learn from data* (e.g., decision trees, SVM, linear regression).
- **DL** – Subset of ML: uses deep neural networks, works very well with large data (images, text, audio).
- **Data Science** – Practical discipline that uses statistics, ML/DL, and domain knowledge to solve real‑world data problems and communicate insights.


## Question 2
**Explain overfitting and underfitting in ML. How can you detect and prevent them?**  
_Hint: Discuss bias‑variance tradeoff, cross‑validation, and regularization techniques._

### Answer
When we train a model, we want it to perform well not only on the **training data** but also on **unseen test data**. Overfitting and underfitting are two common problems:

#### Underfitting
- A model is **too simple** to capture the underlying pattern in the data.
- It performs **poorly on both training and test data** (high error everywhere).
- Example: Fitting a straight line to data that clearly follows a curved (quadratic) pattern.


#### Overfitting
- A model is **too complex** and **memorizes** noise and random fluctuations in the training data.
- It shows **very low training error but high test error**.
- Example: A decision tree that grows very deep and perfectly classifies each training sample but fails badly on new samples.

#### Bias‑Variance Trade‑off
- **Bias**: Error due to overly simplified assumptions in the model (e.g., assuming linear relationship when it's not). High bias → underfitting.  
- **Variance**: Error due to model's sensitivity to small fluctuations in the training data (too complex, changes a lot when data changes). High variance → overfitting.  
- We want a balance: **not too simple, not too complex**.

#### Detecting Overfitting and Underfitting
1. **Train–Test (or Train–Validation) Performance Comparison**  
- Underfitting: Training accuracy is low, validation/test accuracy is also low.  
- Overfitting: Training accuracy is very high, but validation/test accuracy is much lower.

2. **Learning Curves** (plot error vs. training set size)
- Underfitting: Both training and validation errors are high and close to each other.  
- Overfitting: Training error is low, validation error is high; sometimes validation error decreases with more data.

3. **Cross‑Validation**  
- Use **k‑fold cross‑validation** to estimate how the model performs on unseen data.  
- If performance varies a lot across folds or is much worse than training performance → likely overfitting.

#### Preventing Overfitting
1. **Simplify the model**  
- Use fewer features or a simpler algorithm (e.g., shallow tree instead of very deep tree).

2. **Regularization**  
- Add a **penalty** on large weights in linear/logistic regression, neural networks, etc.  
- **L2 (Ridge)** regularization: adds \( \lambda \sum w_i^2 \).  
- **L1 (Lasso)** regularization: adds \( \lambda \sum |w_i| \), can drive some weights to zero (feature selection).

3. **Early Stopping**  
- In iterative training (like gradient descent, neural networks), monitor validation loss and stop when it starts increasing.

4. **More Training Data**  
- With more diverse data, the model is less likely to memorize noise.

5. **Data Augmentation** (for images/text/audio)
- E.g., rotate/flip images, add noise, etc., to increase data variability.

**Real‑life example:**  
Imagine predicting house prices. If we fit a model using only one feature (e.g., number of bedrooms), it may **underfit** because it ignores other important features (location, area). If we build a very complex model that tries to perfectly match every training house price including outliers, it may **overfit** and fail on new houses.

#### Preventing Underfitting
- Use a **more expressive model** (e.g., move from linear to polynomial regression, increase tree depth slightly).  
- Add **relevant features** or better feature engineering.  
- Reduce regularization strength if it is too high.


## Question 3
**How would you handle missing values in a dataset? Explain at least three methods with examples.**  
_Hint: Consider deletion, mean/median imputation, and predictive modeling._

### Answer
Real‑world datasets (customer data, medical records, sensor logs, surveys, etc.) almost always have **missing values**. Handling them properly is important because many ML algorithms cannot work directly with missing entries.

Let us consider a simple example dataset of customers:

| Customer | Age | Income | City     |
|----------|-----|--------|----------|
| A        | 25  | 50k    | Delhi    |
| B        | NaN | 60k    | Mumbai   |
| C        | 35  | NaN    | Delhi    |
| D        | 40  | 80k    | NaN      |

We will discuss three common strategies:

#### 1. Deletion (Dropping Rows or Columns)
- **Listwise deletion (drop rows):** Remove rows with missing values.  
- **Column deletion:** Remove columns that have too many missing values.

**When to use:**
- When the amount of missing data is **small** (e.g., < 5%).  
- When removed rows are unlikely to introduce bias.

**Example:**  
In the above table, if only customer B has a missing Age, we might drop that row if the dataset is large and one record does not matter much.

In code (pandas):


In [None]:
# Example: Deletion of rows with missing values
import pandas as pd

# Tiny demo dataset
customers = pd.DataFrame({
    'Customer': ['A', 'B', 'C', 'D'],
    'Age': [25, None, 35, 40],
    'Income': [50000, 60000, None, 80000],
    'City': ['Delhi', 'Mumbai', 'Delhi', None]
})

print("Original DataFrame:\n", customers)

# Drop rows with any missing value
customers_dropped = customers.dropna()
print("\nAfter dropping rows with missing values:\n", customers_dropped)


#### 2. Mean/Median/Mode Imputation
- For **numerical features**, we can fill missing values with:  
  - **Mean** (average) – sensitive to outliers.  
  - **Median** – more robust when data is skewed.  
- For **categorical features**, we often use **mode** (most frequent category).

**Example (Age):**  
Suppose Ages are [25, 30, 35, 40] and one Age is missing.  
- Mean Age = (25 + 30 + 35 + 40) / 4 = 32.5 → use 32.5.  
- Median Age = (30 + 35) / 2 = 32.5 (in this particular case).  
In skewed income distributions, median is usually better than mean.

In code (pandas):


In [None]:
# Mean/median/mode imputation example
import pandas as pd

customers = pd.DataFrame({
    'Customer': ['A', 'B', 'C', 'D'],
    'Age': [25, None, 35, 40],
    'Income': [50000, 60000, None, 80000],
    'City': ['Delhi', 'Mumbai', 'Delhi', None]
})

# Fill numerical columns
customers['Age'] = customers['Age'].fillna(customers['Age'].mean())
customers['Income'] = customers['Income'].fillna(customers['Income'].median())

# Fill categorical column with mode
customers['City'] = customers['City'].fillna(customers['City'].mode()[0])

print(customers)


#### 3. Predictive Modeling (Model‑based Imputation)
- Here we **train a model** to predict missing values based on other features.  
- For example, if Age is missing but we have Income, City, and Spending Score, we can train a regression model to predict Age.

**Steps:**
1. Separate rows where the target feature (e.g., Age) is **known**.  
2. Train a model (e.g., linear regression, decision tree, KNN) to predict Age from other features.  
3. Use this model to predict missing Age values.

**Real‑life example:**  
In a hospital database, some patients' blood pressure readings might be missing. Instead of dropping patients or filling with a simple average, we can use other variables (age, weight, previous readings, medications) to predict blood pressure more accurately.

In sklearn, we can use `KNNImputer` or `IterativeImputer` for more advanced imputation.


In [None]:
# Skeleton example using KNNImputer (note: small demo, not real data)
import numpy as np
from sklearn.impute import KNNImputer

X = np.array([
    [25, 50000],
    [np.nan, 60000],
    [35, np.nan],
    [40, 80000]
])

imputer = KNNImputer(n_neighbors=2)
X_imputed = imputer.fit_transform(X)
print(X_imputed)


Other methods include **forward/backward fill** for time series, and **using special categories** like "Unknown" for some categorical variables. Choice of method depends on data size, missingness pattern, and domain knowledge.


## Question 4
**What is an imbalanced dataset? Describe two techniques to handle it (theoretical + practical).**  
_Hint: Discuss SMOTE, Random Under/Oversampling, and class weights in models._

### Answer
#### What is an imbalanced dataset?
- A dataset is **imbalanced** when the **class distribution is highly skewed** – one class has many more examples than another.  
- Example: In fraud detection, 99.5% transactions are normal, 0.5% are fraudulent.  
- If we train a normal classifier, it might predict **"not fraud" for everything** and still get 99.5% accuracy → but it is useless for detecting fraud.

Evaluation metrics like **accuracy** can be misleading; we use **precision, recall, F1‑score, ROC‑AUC**, etc.

We’ll describe two techniques:

### Technique 1: Random Over/Under Sampling
1. **Random Oversampling**  
- **Idea:** Increase the number of minority class samples by **duplicating** them (or by generating new synthetic variants).  
- Pros: Balances classes; simple.  
- Cons: Can lead to overfitting on duplicated examples.

2. **Random Undersampling**  
- **Idea:** Reduce the majority class by **removing some examples**.  
- Pros: Smaller dataset, faster training.  
- Cons: May lose useful information (if you drop informative majority samples).

**Real‑life example:**  
In an email spam dataset with 90% non‑spam and 10% spam, we can oversample spam emails (by duplicating them) so the classifier pays more attention to the spam class.

In code (using `sklearn.utils.resample` for simple oversampling):


In [None]:
# Simple demo of random oversampling for a binary class
import pandas as pd
from sklearn.utils import resample

# Dummy imbalanced dataset
data = pd.DataFrame({
    'feature1': [1, 2, 3, 4, 5, 6],
    'label':    [0, 0, 0, 0, 1, 1]
})

print("Original class distribution:\n", data['label'].value_counts())

majority = data[data['label'] == 0]
minority = data[data['label'] == 1]

minority_upsampled = resample(minority,
                              replace=True,
                              n_samples=len(majority),
                              random_state=42)

upsampled = pd.concat([majority, minority_upsampled])
print("\nAfter oversampling:\n", upsampled['label'].value_counts())


### Technique 2: SMOTE (Synthetic Minority Over‑sampling Technique)
- **Idea:** Instead of simply duplicating minority samples, SMOTE **creates synthetic samples** by interpolating between existing minority samples.  
- For a minority point, SMOTE picks one of its k nearest minority neighbors and generates a new sample along the line joining them.  
- This helps the classifier learn a **smoother decision boundary**.

**Theoretical intuition:**  
By creating new, slightly varied minority points, the decision region for the minority class is expanded in feature space, making it easier for the model to learn.

In practice, we can use `imblearn.over_sampling.SMOTE` (from the `imbalanced-learn` library):


In [None]:
# Skeleton code for SMOTE (will work when imblearn is installed)
# from imblearn.over_sampling import SMOTE
# smote = SMOTE(random_state=42)
# X_resampled, y_resampled = smote.fit_resample(X, y)
# print("New class distribution:", pd.Series(y_resampled).value_counts())

print("SMOTE example placeholder – uncomment when imblearn is available.")


### Technique 3 (brief mention): Class Weights
- Many models (e.g., logistic regression, SVM, tree‑based models) allow you to specify **class weights**.  
- Assign higher weight to the minority class so that **misclassifying it is penalized more** in the loss function.  
- In scikit‑learn, `class_weight='balanced'` automatically adjusts weights inversely proportional to class frequencies.

**Real‑life example:**  
In medical diagnosis for a rare disease, classifying a sick patient as healthy is very costly. By increasing the weight of the "disease" class in the loss, the model focuses more on correctly identifying diseased patients.


## Question 5
**Why is feature scaling important in ML? Compare Min‑Max scaling and Standardization.**  
_Hint: Explain impact on distance‑based algorithms (e.g., KNN, SVM) and gradient descent._

### Answer
Different features in a dataset can be on very different scales. For example:

- Age: 18–70  
- Salary: 20,000–2,00,000  
- Number of purchases: 0–100

Many ML algorithms are **sensitive to the scale** of features. If we do not scale them, features with **larger numeric ranges dominate** distance calculations and gradients.

#### Why feature scaling is important
1. **Distance‑based algorithms (KNN, K‑Means, SVM with RBF kernel)**  
- These use Euclidean distance or similar metrics.  
- If one feature has a much larger range than others, it will **dominate the distance** measure.  
- Example: In KNN, if we use Age and Salary, and Salary is in lakhs, then salary differences overshadow age differences unless we scale.

2. **Gradient Descent‑based algorithms (Linear Regression, Logistic Regression, Neural Networks)**  
- Features on different scales can cause the **loss surface to be very elongated**, leading to slow or unstable convergence.  
- Scaling makes gradient descent **more stable and faster**.

3. **Regularization (L1/L2)**  
- When features are on similar scales, regularization treats them more fairly.

#### Min‑Max Scaling (Normalization)
- **Formula:**  
  \[
  x_{scaled} = \frac{x - x_{min}}{x_{max} - x_{min}}
  \]
- Transforms values to a fixed range, usually **[0, 1]**.  
- Useful when we **know the min and max** and want a bounded scale.

**Example:**  
If Age ranges from 18 to 60, and a person is 30 years old:
\( x_{scaled} = (30 - 18) / (60 - 18) = 12 / 42 \approx 0.2857. \)

#### Standardization (Z‑score Scaling)
- **Formula:**  
  \[
  x_{scaled} = \frac{x - \mu}{\sigma}
  \]
  where \(\mu\) is the mean and \(\sigma\) is the standard deviation.
- Resulting feature typically has **mean 0** and **standard deviation 1**.  
- Values are not bounded to [0, 1]; they can be negative or greater than 1.

**Example:**  
If mean Age is 35 and standard deviation is 10, then Age = 30 →  
\( x_{scaled} = (30 - 35)/10 = -0.5. \)

#### When to use which?
- **Min‑Max Scaling**  
  - Useful for algorithms that expect inputs in a **bounded range**, e.g., neural networks with sigmoid activation, or when you want to preserve **exact 0–1 ranges**.  
  - Can be sensitive to **outliers** (min and max can be skewed).

- **Standardization**  
  - Preferred in many ML algorithms (SVM, logistic regression, linear regression, PCA).  
  - More robust when data approximately follows a **Gaussian distribution**.  
  - Outliers still affect mean and std, but not as extremely as min‑max.

In scikit‑learn, we use `MinMaxScaler` and `StandardScaler`.


In [None]:
# Example: Min-Max scaling vs Standardization
import pandas as pd
from sklearn.preprocessing import MinMaxScaler, StandardScaler

X = pd.DataFrame({
    'Age': [18, 25, 40, 60],
    'Salary': [20000, 50000, 120000, 200000]
})

mms = MinMaxScaler()
ssc = StandardScaler()

X_minmax = mms.fit_transform(X)
X_std = ssc.fit_transform(X)

print("Original:\n", X)
print("\nMin-Max scaled:\n", X_minmax)
print("\nStandardized:\n", X_std)


## Question 6
**Compare Label Encoding and One‑Hot Encoding. When would you prefer one over the other?**  
_Hint: Consider categorical variables with ordinal vs. nominal relationships._

### Answer
Categorical variables (e.g., City, Color, Education Level) must be converted into numerical form before feeding into most ML algorithms.

#### 1. Label Encoding
- **Idea:** Assign an **integer label** to each category.
- Example: City = {Delhi, Mumbai, Chennai}  
  - Delhi → 0  
  - Mumbai → 1  
  - Chennai → 2

- Pros: Simple, uses only **one column**, good for **tree‑based models** (Decision Trees, Random Forests, XGBoost) that can handle integer codes as categories.
- Cons: Imposes an **artificial ordering** (0 < 1 < 2), which may not make sense for **nominal** categories. Algorithms that rely on numerical distance (e.g., KNN, linear models) may misinterpret the labels.

#### 2. One‑Hot Encoding
- **Idea:** Create a **binary column** for each category.  
- Example: City = {Delhi, Mumbai, Chennai} → three columns:  
  - City_Delhi, City_Mumbai, City_Chennai  
  - (1, 0, 0) for Delhi, (0, 1, 0) for Mumbai, etc.
- Pros: Does **not impose any ordering**, works well for **nominal** (unordered) categories.  
- Cons: Increases dimensionality (can be large when there are many categories → "curse of dimensionality").

#### Ordinal vs Nominal
- **Ordinal categories:** Have a meaningful order (e.g., Education Level: Primary < Secondary < Graduate < Postgraduate).  
- **Nominal categories:** No natural order (e.g., City names, Colors, Product IDs).

#### When to prefer which?
- Use **Label Encoding** when:
  - The categories are **ordinal** (order matters).  
  - Or when using **tree‑based models**, which can handle label‑encoded nominal categorical variables reasonably well.

- Use **One‑Hot Encoding** when:
  - The categories are **nominal** and will be used in **distance‑based or linear models** (KNN, SVM, linear/logistic regression).  
  - You want to avoid misleading the model with artificial ordinal relationships.

**Real‑life examples:**
- Customer "Membership Level" (Bronze, Silver, Gold, Platinum) → **label encoding** with appropriate ordering is fine.  
- Customer "Country" or "City" → **one‑hot encoding** is safer.


In [None]:
# Example: Label Encoding vs One-Hot Encoding
import pandas as pd
from sklearn.preprocessing import LabelEncoder

cities = pd.DataFrame({'City': ['Delhi', 'Mumbai', 'Chennai', 'Delhi']})

# Label Encoding
le = LabelEncoder()
cities['City_Label'] = le.fit_transform(cities['City'])

# One-Hot Encoding
cities_ohe = pd.get_dummies(cities['City'], prefix='City')

print("With label encoding:\n", cities)
print("\nWith one-hot encoding:\n", cities_ohe)


## Question 7 – Google Play Store Dataset
**(a) Analyze the relationship between app categories and ratings. Which categories have the highest/lowest average ratings, and what could be the possible reasons?**  
Dataset: `https://github.com/MasteriNeuron/datasets.git`  
_(Include your Python code and output in the code box below.)_

### Plan
1. Load the Google Play Store dataset (after downloading/cloning it locally).  
2. Clean the data: remove missing or invalid ratings/categories.  
3. Group by `Category` and compute the **average rating** per category.  
4. Sort categories by their mean rating to find the highest/lowest ones.  
5. Interpret possible reasons (e.g., game apps may get more extreme ratings, utility apps may have more stable moderate ratings, etc.).

> **Note:** In this notebook, we show code using placeholder file paths. When running on your machine, replace the path with the actual CSV path from the cloned repository.


In [None]:
# Google Play Store analysis – skeleton code
import pandas as pd

# TODO: Replace this path with the actual file path after cloning the repo
# Example: df = pd.read_csv('datasets/googleplaystore.csv')

try:
    df = pd.read_csv('googleplaystore.csv')  # Adjust path as needed
    print("Loaded googleplaystore.csv successfully")
except FileNotFoundError:
    print("googleplaystore.csv not found. Please update the file path.")
    # Create a small dummy DataFrame to demonstrate the analysis steps
    df = pd.DataFrame({
        'Category': ['GAME', 'GAME', 'TOOLS', 'TOOLS', 'EDUCATION', 'EDUCATION'],
        'Rating':   [4.5, 4.7, 4.0, 3.8, 4.8, 4.6]
    })

# Basic cleaning: drop rows with missing Category or Rating
df_clean = df.dropna(subset=['Category', 'Rating'])

# Group by category and compute mean rating
category_ratings = df_clean.groupby('Category')['Rating'].mean().sort_values(ascending=False)

print("Average rating by category (descending):\n", category_ratings.head(10))
print("\nLowest-rated categories:\n", category_ratings.tail(10))


### Interpretation (example reasoning)
- Categories with **higher average ratings** (e.g., EDUCATION, HEALTH & FITNESS, PRODUCTIVITY) often provide **clear utility or learning value**, so satisfied users tend to rate them highly.  
- Categories with **lower average ratings** (e.g., some GAME subcategories or COMMUNICATION) may suffer from **bugs, ads, battery drain, or network issues**, leading to user frustration.  
- Games often receive **polarized ratings** – some users love them, others dislike ads/in‑app purchases.


## Question 8 – Titanic Dataset
**(a)** Compare the survival rates based on passenger class (`Pclass`). Which class had the highest survival rate, and why do you think that happened?  
**(b)** Analyze how age (`Age`) affected survival. Group passengers into children (Age < 18) and adults (Age ≥ 18). Did children have a better chance of survival?  
Dataset: `https://github.com/MasteriNeuron/datasets.git`  
_(Include your Python code and output in the code box below.)_

### Plan
1. Load the Titanic dataset (e.g., `titanic.csv`).  
2. Clean data (remove or impute missing Age).  
3. For (a): group by `Pclass` and compute survival rate = mean of `Survived`.  
4. For (b): create a new column `AgeGroup` = 'Child' if Age < 18 else 'Adult', then compute survival rates by `AgeGroup`.  
5. Interpret results: historically, higher‑class passengers and children often had better access to lifeboats.


In [None]:
# Titanic analysis – skeleton code
import pandas as pd

# TODO: Replace with actual path, e.g., 'datasets/titanic.csv'
try:
    titanic = pd.read_csv('titanic.csv')
    print("Loaded titanic.csv successfully")
except FileNotFoundError:
    print("titanic.csv not found. Using dummy data for demonstration.")
    titanic = pd.DataFrame({
        'Pclass':   [1, 1, 2, 2, 3, 3],
        'Survived': [1, 0, 1, 0, 0, 1],
        'Age':      [5, 38, 17, 25, 30, 10]
    })

# Drop rows with missing Age for the Age analysis
titanic_age = titanic.dropna(subset=['Age'])

# (a) Survival rate by Pclass
survival_by_class = titanic_age.groupby('Pclass')['Survived'].mean()
print("Survival rate by Pclass:\n", survival_by_class)

# (b) Children vs Adults
titanic_age['AgeGroup'] = titanic_age['Age'].apply(lambda x: 'Child' if x < 18 else 'Adult')

survival_by_agegroup = titanic_age.groupby('AgeGroup')['Survived'].mean()
print("\nSurvival rate by AgeGroup:\n", survival_by_agegroup)


### Interpretation (typical result)
- Historically, **1st class** passengers had the highest survival rate, followed by 2nd class, then 3rd class. Reasons include:  
  - First‑class cabins were closer to lifeboats.  
  - Social status and crew prioritization.  
  - Better access to information about the emergency.
- For age groups, **children generally had higher survival rates** than adults due to the "women and children first" policy during evacuation.


## Question 9 – Flight Price Prediction Dataset
**(a)** How do flight prices vary with the days left until departure? Identify any exponential price surges and recommend the best booking window.  
**(b)** Compare prices across airlines for the same route (e.g., Delhi‑Mumbai). Which airlines are consistently cheaper/premium, and why?  
Dataset: `https://github.com/MasteriNeuron/datasets.git`  
_(Include your Python code and output in the code box below.)_

### Plan
- Assume the dataset has columns like `price`, `days_left`, `airline`, `source`, `destination`, etc.  
- (a) Plot `price` vs. `days_left` and look for a pattern (often prices are lower when booked several days/weeks in advance, then surge close to departure).  
- (b) Filter for a specific route (e.g., Delhi → Mumbai) and compare average prices per airline.


In [None]:
# Flight price analysis – skeleton code
import pandas as pd

# TODO: Replace with real path, e.g., 'datasets/flight_price.csv'
try:
    flights = pd.read_csv('flight_price.csv')
    print("Loaded flight_price.csv successfully")
except FileNotFoundError:
    print("flight_price.csv not found. Using dummy data for demonstration.")
    flights = pd.DataFrame({
        'days_left':   [30, 20, 10, 5, 2, 1],
        'price':       [3000, 3200, 4000, 5500, 7000, 9000],
        'airline':     ['AirA', 'AirA', 'AirA', 'AirB', 'AirB', 'AirC'],
        'source':      ['Delhi']*6,
        'destination': ['Mumbai']*6
    })

# (a) Relationship between days_left and price
price_by_days = flights.groupby('days_left')['price'].mean().sort_index()
print("Average price by days_left:\n", price_by_days)

# (b) Compare prices across airlines for Delhi-Mumbai route
route_filter = (flights['source'] == 'Delhi') & (flights['destination'] == 'Mumbai')
route_data = flights[route_filter]

avg_price_by_airline = route_data.groupby('airline')['price'].mean().sort_values()
print("\nAverage price for Delhi-Mumbai by airline:\n", avg_price_by_airline)


### Interpretation (typical pattern)
- **Days left vs. price:**  
  - Prices often **start moderate**, may drop a bit in a sweet spot (e.g., 20–30 days before travel), and then **increase sharply** as the departure date approaches.  
  - This can look almost **exponential** in the last few days.
- **Best booking window:**  
  - In many markets, booking roughly **2–4 weeks in advance** often gives a good trade‑off between price and flexibility (exact window depends on airline and route).
- **Airlines comparison:**  
  - Some airlines position themselves as **low‑cost carriers** (cheaper prices, fewer amenities).  
  - Others are **premium airlines** (higher price, more legroom, free meals, better service).  
  - We identify them by comparing **average route‑wise prices**.


## Question 10 – HR Analytics Dataset
**(a)** What factors most strongly correlate with employee attrition? Use visualizations to show key drivers (e.g., satisfaction, overtime, salary).  
**(b)** Are employees with more projects more likely to leave?  
Dataset: `hr_analytics`  
_(Include your Python code and output in the code box below.)_

### Plan
- Assume the dataset has columns like: `Attrition` (Yes/No or 1/0), `JobSatisfaction`, `MonthlyIncome`, `OverTime`, `NumProjects` (or similar).  
- (a) Compute correlations between numeric features and Attrition (encoded as 0/1). Plot bar charts or boxplots.  
- (b) Group by number of projects and compute attrition rates.


In [None]:
# HR Analytics – skeleton code
import pandas as pd

# TODO: Replace with actual path, e.g., 'hr_analytics.csv'
try:
    hr = pd.read_csv('hr_analytics.csv')
    print("Loaded hr_analytics.csv successfully")
except FileNotFoundError:
    print("hr_analytics.csv not found. Using dummy data for demonstration.")
    hr = pd.DataFrame({
        'Attrition':        [1, 0, 1, 0, 1, 0],  # 1 = left, 0 = stayed
        'JobSatisfaction':  [2, 4, 1, 3, 2, 4],
        'MonthlyIncome':    [3000, 7000, 3500, 8000, 2800, 9000],
        'OverTime':         ['Yes', 'No', 'Yes', 'No', 'Yes', 'No'],
        'NumProjects':      [5, 3, 6, 2, 7, 3]
    })

# Convert Attrition to numeric if needed
# Here we assume it is already 0/1. If it were 'Yes'/'No', we could map it:
# hr['Attrition'] = hr['Attrition'].map({'Yes': 1, 'No': 0})

# (a) Correlation of numeric features with Attrition
numeric_cols = hr.select_dtypes(include=['int64', 'float64']).columns

correlations = hr[numeric_cols].corr()['Attrition'].sort_values(ascending=False)
print("Correlation of numeric features with Attrition:\n", correlations)

# (b) Attrition vs number of projects
attrition_by_projects = hr.groupby('NumProjects')['Attrition'].mean()
print("\nAttrition rate by NumProjects:\n", attrition_by_projects)


### Interpretation (example reasoning)
- Factors often associated with **higher attrition** include:  
  - **Low job satisfaction** (employees who are unhappy leave more).  
  - **OverTime = Yes** (overworked employees may burn out).  
  - **Very low or very high workloads** (too few or too many projects) depending on context.  
  - **Lower salaries** compared to peers in similar roles.
- For part (b), if we see that attrition rate increases with `NumProjects`, we can hypothesize that **excessive workload** is pushing people to leave. HR could respond by rebalancing workloads, hiring more staff, or improving work‑life balance.

---
This completes the detailed answers and example code for **all 10 questions** in a single notebook. Before submission, make sure to:
1. Update file paths for the datasets on your machine.
2. Re‑run the analysis cells so that outputs reflect the **real datasets** from the assignment.
