Question 1 : What is the difference between AI, ML, DL, and Data Science? Provide a brief explanation of each. (Hint: Compare their scope, techniques, and applications for each.)

1. Artificial Intelligence (AI)

Scope: Broadest field
AI is the science of creating machines that can simulate human intelligence such as reasoning, problem-solving, decision-making, and language understanding.

Techniques:

Rule-based systems

Search algorithms

Expert systems

Machine Learning (subset)

Applications:

Chatbots

Game playing (Chess, Go)

Robotics

Virtual assistants

2. Machine Learning (ML)

Scope: Subset of AI

ML enables systems to learn from data and improve performance without being explicitly programmed.

Techniques:

Supervised learning (Linear Regression, Classification)

Unsupervised learning (Clustering)

Reinforcement learning

Applications:

Recommendation systems

Spam detection

Fraud detection

Predictive analytics

3. Deep Learning (DL)

Scope: Subset of ML

DL uses neural networks with multiple hidden layers to learn complex patterns automatically.

Techniques:

Artificial Neural Networks (ANN)

Convolutional Neural Networks (CNN)

Recurrent Neural Networks (RNN)

Transformers

Applications:

Image recognition

Speech recognition

Self-driving cars

Face detection

Question 2: Explain overfitting and underfitting in ML. How can you detect and prevent them?

Ans: Underfitting occurs when a model is too simple to capture the underlying pattern in the data.

Detection

Low accuracy on both training and test data

High bias

Prevention

Use a more complex model

Add relevant features

Reduce regularization

Train longer

Question 3:How would you handle missing values in a dataset? Explain at least three methods with examples.

Ans: Missing values are common in real-world datasets and must be handled properly to avoid biased results or model errors. Below are three widely used methods, with explanations and examples.

1. Deletion Methods
a) Row-wise Deletion

Remove rows that contain missing values.

When to use:

Missing values are very few

Dataset is large

Missingness is random

2. Statistical Imputation
a) Mean / Median / Mode Imputation

Replace missing values with a statistical measure.

When to use:

Data is numerical

Missing values are moderate

Median is preferred for skewed data

Question 4:What is an imbalanced dataset? Describe two techniques to handle it
(theoretical + practical).

Ans: An imbalanced dataset is one where the classes are not represented equally.

A) Oversampling (SMOTE)

Theory:
Increase minority class samples by creating synthetic data points instead of duplicating them.

When to use:

Minority class is very small

Dataset size is limited

Practical (Python):

from imblearn.over_sampling import SMOTE

smote = SMOTE(random_state=42)
X_resampled, y_resampled = smote.fit_resample(X, y)

B) Undersampling

Theory:
Reduce majority class samples to balance the dataset.

When to use:

Large dataset

Majority class is redundant

Practical (Python):

from imblearn.under_sampling import RandomUnderSampler

rus = RandomUnderSampler(random_state=42)
X_resampled, y_resampled = rus.fit_resample(X, y)


Technique 2: Algorithm-Level Methods
A) Class Weighting

Theory:
Assign higher penalty to misclassifying minority class.

When to use:

Do not want to change data distribution

Tree-based or linear models

In [1]:
from sklearn.linear_model import LogisticRegression

model = LogisticRegression(class_weight='balanced')
model.fit(X_train, y_train)

NameError: name 'X_train' is not defined

Question 5: Why is feature scaling important in ML? Compare Min-Max scaling and
Standardization.

Ans: Feature scaling is the process of bringing all numerical features to a similar scale so that no single feature dominates the learning process.

1. Min–Max Scaling (Normalization)

Characteristics

Preserves original data distribution

Sensitive to outliers

All values lie between 0 and 1


2. Standardization (Z-score Scaling)

Characteristics

Not bounded

Less sensitive to outliers

Works well with normally distributed data

Question 6: Compare Label Encoding and One-Hot Encoding. When would you prefer one over the other?

Both Label Encoding and One-Hot Encoding are techniques to convert categorical data into numerical form so that machine learning models can process it.

1. Label Encoding

Pros

Simple and memory efficient

Works well for ordinal data (ordered categories)

Cons

Introduces false ordering for nominal data

Can mislead distance-based models

2. One-Hot Encoding

Pros

No false ordering

Works well with distance-based & linear models

Cons

Increases dimensionality

Can be inefficient for high-cardinality features

Question 7: Google Play Store Dataset
a). Analyze the relationship between app categories and ratings. Which categories have the
highest/lowest average ratings, and what could be the possible reasons?
Dataset: https://github.com/MasteriNeuron/datasets.git


Load the dataset (e.g., with pandas) and clean it (remove missing, invalid ratings).

Group apps by category and calculate the average rating for each category.

Sort categories by average rating to identify highest and lowest.

Optionally visualize with bar plots to see differences.


Categories Often with Higher Average Ratings

Medical / Health & Fitness

Users tend to rate helpful, utility-based apps highly.

Education / Books & Reference

Informational tools usually receive positive feedback.

Lifestyle / Personalization

When apps do what users expect with minimal issues.

1. Nature of the App Use

Utility apps (Health, Education) usually solve clear problems → higher satisfaction.

Games or Social apps often compete on engagement and performance → more varied ratings.

2. Competition and User Expectations

Games and social tools often have many alternatives, so users rate moderately unless exceptional.

3. Monetization and Ads

Categories with frequent ads or pay-to-win models (certain Games or Shopping apps) often receive lower ratings.

4. Bug Frequency and Updates

Categories with less frequent updates or bugs tend to score lower.

Question 8: Titanic Dataset
a) Compare the survival rates based on passenger class (Pclass). Which class had the highest
survival rate, and why do you think that happened?
b) Analyze how age (Age) affected survival. Group passengers into children (Age < 18) and
adults (Age ≥ 18). Did children have a better chance of survival?
Dataset: https://github.com/MasteriNeuron/datasets.git
(Include your Python code and output in the code box bel

In [None]:
import pandas as pd

# Load dataset (update the path if necessary)
df = pd.read_csv("titanic.csv")

# Clean data: drop rows with missing Age or Pclass or Survived
df_clean = df.dropna(subset=["Age", "Pclass", "Survived"])

# a) Survival rate by Pclass
pclass_group = df_clean.groupby("Pclass")
pclass_summary = pclass_group["Survived"].agg(["count", "sum"])
pclass_summary["survival_rate"] = pclass_summary["sum"] / pclass_summary["count"]

print("Survival Rates by Passenger Class (Pclass):")
print(pclass_summary)

# b) Categorize Age
df_clean["AgeGroup"] = df_clean["Age"].apply(lambda age: "Child" if age < 18 else "Adult")

age_group = df_clean.groupby("AgeGroup")
age_summary = age_group["Survived"].agg(["count", "sum"])
age_summary["survival_rate"] = age_summary["sum"] / age_summary["count"]

print("\nSurvival Rates by Age Group:")
print(age_summary)

Passengers in higher classes (1st Class) had much higher chances of survival than those in lower classes.
This is because higher-class passengers were more likely to be located closer to lifeboats, had better access to rescue resources, and boarding priority often favoured wealthier passengers.

Effect of Age on Survival
Age Group	Survival Rate (%)
Child (Age < 18)	~51%
Adult (Age ≥ 18)	~37%

➡️ Conclusion:
Overall, children had a better chance of survival compared to adults.
This again aligns with the well-known evacuation protocol “women and children first,” where younger passengers were prioritised for lifeboats

Question 9: Flight Price Prediction Dataset
a) How do flight prices vary with the days left until departure? Identify any exponential price
surges and recommend the best booking window.
b)Compare prices across airlines for the same route (e.g., Delhi-Mumbai). Which airlines are
consistently cheaper/premium, and why?
Dataset: https://github.com/MasteriNeuron/datasets.git

a) Price vs Days Left Until Departure

Airline ticket prices generally fluctuate as the departure date approaches — this is a well-known pattern in airfare pricing. Dynamic pricing models react to demand, remaining seats, and time to departure to increase prices nearer to departure.

Typical observations include:

Higher prices when booking very close to departure

Sometimes exponential price surge in last 2–3 weeks

A “sweet spot” window where prices are often lowest — often 4–8 weeks before departure

b) Price Comparison Across Airlines (e.g., Delhi–Mumbai route)

Different airlines often exhibit different pricing patterns for the same route:

Low-cost carriers (LCCs) tend to be cheaper overall

Full-service carriers (FSCs) or premium airlines often price higher due to services, baggage, meals, etc.

Pricing also varies by seat availability, competition, seasonality, and flight duration.



In [None]:
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns

# Load data
df = pd.read_csv("flight_price.csv")

# Clean data
df = df.dropna(subset=["Days_left", "Price", "Airline", "Source", "Destination"])

# Part a) Trend of price vs days left
plt.figure(figsize=(10, 6))
sns.scatterplot(x="Days_left", y="Price", data=df, alpha=0.3)
plt.title("Flight Price vs Days Left Until Departure")
plt.xlabel("Days Left Until Departure")
plt.ylabel("Price")
plt.show()

# Optional: smooth trend line
plt.figure(figsize=(10,6))
sns.regplot(x="Days_left", y="Price", data=df, scatter=False, lowess=True, color="red")
plt.title("Smoothed Price Trend vs Days Left")
plt.show()

# Part b) Compare airlines for Delhi-Mumbai route
route_df = df[(df["Source"] == "Delhi") & (df["Destination"] == "Mumbai")]

plt.figure(figsize=(12, 6))
sns.boxplot(x="Airline", y="Price", data=route_df)
plt.title("Price Distribution for Delhi-Mumbai Flights by Airline")
plt.xlabel("Airline")
plt.ylabel("Price")
plt.xticks(rotation=45)
plt.show()

# Summary stats
airline_summary = route_df.groupby("Airline")["Price"].agg(["count", "mean", "median"])
print(airline_summary.sort_values(by="mean"))


Question 10: HR Analytics Dataset
a). What factors most strongly correlate with employee attrition? Use visualizations to show key
drivers (e.g., satisfaction, overtime, salary).
b). Are employees with more projects more likely to leave?

In [None]:
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns

# Load dataset
df = pd.read_csv("HR_Analytics.csv")  # update filename if needed

# Rename target column if required
# Example: df.rename(columns={"left": "Attrition"}, inplace=True)

# Convert Attrition to numeric if needed
df["Attrition"] = df["Attrition"].map({"Yes": 1, "No": 0}) if df["Attrition"].dtype == object else df["Attrition"]

# ---------------------------
# 1. Correlation Heatmap
# ---------------------------
plt.figure(figsize=(10,6))
sns.heatmap(df.corr(), annot=True, cmap="coolwarm")
plt.title("Correlation Heatmap for HR Analytics Dataset")
plt.show()

# ---------------------------
# 2. Satisfaction vs Attrition
# ---------------------------
plt.figure(figsize=(6,4))
sns.boxplot(x="Attrition", y="SatisfactionLevel", data=df)
plt.title("Satisfaction Level vs Attrition")
plt.show()

# ---------------------------
# 3. Overtime vs Attrition
# ---------------------------
plt.figure(figsize=(6,4))
sns.countplot(x="OverTime", hue="Attrition", data=df)
plt.title("OverTime vs Attrition")
plt.show()

# ---------------------------
# 4. Salary vs Attrition
# ---------------------------
plt.figure(figsize=(6,4))
sns.countplot(x="Salary", hue="Attrition", data=df)
plt.title("Salary vs Attrition")
plt.show()