Q1. What is the difference between AI, ML, DL, and Data Science? Provide a brief explanation of each.

* Artificial Intelligence (AI): This is the broadest concept, referring to the simulation of human intelligence in machines that are programmed to think and learn like humans. Its scope is vast, encompassing any technique that enables computers to mimic human intelligence. Techniques include search algorithms, logic, and knowledge representation. Applications are everywhere, from virtual assistants and self-driving cars to medical diagnosis and game playing.

* Machine Learning (ML): A subset of AI that focuses on enabling systems to learn from data and improve over time without being explicitly programmed. Its scope is specifically about learning from data. Techniques include supervised learning (e.g., regression, classification), unsupervised learning (e.g., clustering), and reinforcement learning. Applications include spam detection, recommendation systems, and fraud detection.

* Deep Learning (DL): A subfield of Machine Learning that uses artificial neural networks with multiple layers (hence "deep") to model complex patterns in data. Its scope is centered on using deep neural networks for learning. Techniques involve various neural network architectures like Convolutional Neural Networks (CNNs) for images and Recurrent Neural Networks (RNNs) for sequential data. Applications include image and speech recognition, natural language processing, and drug discovery.

* Data Science: An interdisciplinary field that uses scientific methods, processes, algorithms, and systems to extract knowledge and insights from structured and unstructured data. Its scope is broader than just ML and DL, involving the entire process of data collection, cleaning, analysis, visualization, and interpretation. Techniques draw from statistics, mathematics, computer science, and domain expertise. Applications are widespread across industries for decision-making, trend analysis, and predictive modeling.

In essence:

* AI is the overarching goal of creating intelligent machines.
* ML is one way to achieve AI by learning from data.
* DL is a specific type of ML that uses deep neural networks.
* Data Science is the field that uses various techniques, including ML and DL, to extract insights from data.

Q 2. Explain overfitting and underfitting in ML. How can you detect and prevent them?

Overfitting: This occurs when a model learns the training data too well, including the noise and outliers. As a result, the model performs very well on the training data but poorly on unseen or new data.

* Detection: High accuracy on training data, but significantly lower accuracy on validation or test data.
* Prevention:
  * More data: Increasing the amount of training data can help the model generalize better.
  * Feature selection: Removing irrelevant or redundant features can reduce complexity.
  * Regularization: Techniques like L1 and L2 regularization add a penalty to the model's complexity, discouraging large coefficients.
  * Cross-validation: Using techniques like k-fold cross-validation helps evaluate the model's performance on different subsets of the data, providing a more reliable estimate of its generalization ability.
  * Simplifying the model: Using a simpler model with fewer parameters can reduce the risk of overfitting.
  * Dropout (in neural networks): Randomly dropping out neurons during training prevents the network from becoming too reliant on specific connections.

* Underfitting: This occurs when a model is too simple to capture the underlying patterns in the training data. It performs poorly on both the training data and unseen data.

 * Detection: Low accuracy on both training and validation/test data.
 * Prevention:
   * More complex model: Using a more complex model with more parameters or layers (e.g., a higher degree polynomial for regression, or a deeper neural network).
   * More features: Adding more relevant features to the dataset can help the model capture more patterns.
   * Reducing regularization: If regularization was applied, reducing its strength can allow the model to fit the data more closely.
   * Increasing training time: For iterative algorithms, training for more epochs might be necessary (though be careful not to overfit).
* Bias-Variance Tradeoff:

  * Bias: The error introduced by approximating a real-world problem, which may be complex, by a simplified model. High bias leads to underfitting.
  * Variance: The sensitivity of the model to the specific training data. High variance leads to overfitting.
  * The bias-variance tradeoff is a fundamental concept in ML. We aim to find a balance between bias and variance to achieve good generalization performance. Increasing model complexity typically decreases bias but increases variance, and vice versa. Regularization and cross-validation are techniques used to manage this tradeoff.

Q 3. How would you handle missing values in a dataset? Explain at least three methods with examples.

Handling missing values is a crucial step in data preprocessing. Here are three common methods:

1. Deletion:

* Explanation: This involves removing rows or columns that contain missing values.
* When to use: This method is suitable when the percentage of missing data is small and the missingness is not concentrated in specific rows or columns that are important for the analysis.
* Examples:
  * Deleting rows: If only a few rows have missing values in a large dataset, you can drop those rows using pandas' dropna() function.
  * Deleting columns: If a column has a very high percentage of missing values and is not critical for the analysis, you can drop the entire column.
* Caveats: Deletion can lead to a significant loss of data, which can be problematic if the dataset is small or the missingness is not random.

Mean/Median/Mode Imputation:

* Explanation: This involves replacing missing values with the mean, median, or mode of the non-missing values in the respective column.
* When to use: This method is simple and quick. Mean imputation is suitable for numerical data with a normal distribution, while median imputation is more robust to outliers. Mode imputation is used for categorical data.
* Examples:
  * Mean imputation: Replace missing age values with the average age of the non-missing entries.
  * Median imputation: Replace missing income values with the median income to be less affected by extreme values.
  * Mode imputation: Replace missing city values with the most frequent city in the dataset.
* Caveats: Imputation with a single value can distort the original distribution of the data and reduce variance. It also doesn't account for the relationships between variables.

3. Predictive Modeling (e.g., using K-Nearest Neighbors or Regression):

* Explanation: This method involves building a model to predict the missing values based on the other variables in the dataset.
* When to use: This is a more sophisticated method that can capture relationships between variables and provide more accurate imputations, especially when the missingness is not random.
* Examples:
  * K-Nearest Neighbors (KNN) imputation: For a missing age value, find the k-nearest neighbors (based on other features) of the data point with the missing value and impute the age based on the ages of those neighbors (e.g., average age of neighbors).
  * Regression imputation: If you have missing values in a numerical variable, you can build a regression model using other variables as predictors to estimate the missing values.
* Caveats: This method is more computationally expensive than simpler methods. The accuracy of the imputation depends on the quality of the model and the relationships between variables.

The choice of method for handling missing values depends on the nature of the data, the amount of missingness, and the goals of the analysis. It's often a good practice to explore the patterns of missingness before deciding on a method.

Q 4.  What is an imbalanced dataset? Describe two techniques to handle it (theoretical + practical).

An imbalanced dataset is one where the distribution of the target variable's classes is not equal. This means that some classes have a significantly larger number of instances than others. This can be a problem for machine learning models, as they may become biased towards the majority class and perform poorly on the minority class.

Here are two techniques to handle imbalanced datasets:
 1. Resampling Techniques: These techniques involve changing the distribution of the dataset to make the classes more balanced.
   * Oversampling: This involves increasing the number of instances in the minority class.
     * Theoretical: You can randomly duplicate instances from the minority class or use more sophisticated methods like SMOTE (Synthetic Minority Over-sampling Technique). SMOTE creates synthetic instances of the minority class by interpolating between existing minority class instances.
     * Practical Example (using SMOTE with Python):

In [None]:
from collections import Counter
from sklearn.datasets import make_classification
from imblearn.under_sampling import RandomUnderSampler

# Create a sample imbalanced dataset
X, y = make_classification(n_classes=2, class_sep=2,
weights=[0.9, 0.1], n_informative=3, n_redundant=1, flip_y=0,
n_features=20, n_clusters_per_class=1, n_samples=1000, random_state=10)

print(f"Original dataset shape: {Counter(y)}")

# Apply Random Under-sampling
rus = RandomUnderSampler(random_state=42)
X_res, y_res = rus.fit_resample(X, y)

print(f"Resampled dataset shape: {Counter(y_res)}")

Original dataset shape: Counter({np.int64(0): 900, np.int64(1): 100})
Resampled dataset shape: Counter({np.int64(0): 100, np.int64(1): 100})


In [None]:
from sklearn.linear_model import LogisticRegression
from sklearn.model_selection import train_test_split
from sklearn.datasets import make_classification
from sklearn.metrics import classification_report

# Create a sample imbalanced dataset
X, y = make_classification(n_classes=2, class_sep=2,
weights=[0.9, 0.1], n_informative=3, n_redundant=1, flip_y=0,
n_features=20, n_clusters_per_class=1, n_samples=1000, random_state=10)

X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)

# Train a logistic regression model without class weights
model_no_weights = LogisticRegression()
model_no_weights.fit(X_train, y_train)
y_pred_no_weights = model_no_weights.predict(X_test)
print("Classification Report (without class weights):")
print(classification_report(y_test, y_pred_no_weights))

# Train a logistic regression model with class weights
# 'balanced' automatically adjusts weights inversely proportional to class frequencies
model_with_weights = LogisticRegression(class_weight='balanced')
model_with_weights.fit(X_train, y_train)
y_pred_with_weights = model_with_weights.predict(X_test)
print("\nClassification Report (with class weights):")
print(classification_report(y_test, y_pred_with_weights))

Classification Report (without class weights):
              precision    recall  f1-score   support

           0       1.00      0.99      1.00       181
           1       0.95      1.00      0.97        19

    accuracy                           0.99       200
   macro avg       0.97      1.00      0.99       200
weighted avg       1.00      0.99      1.00       200


Classification Report (with class weights):
              precision    recall  f1-score   support

           0       1.00      0.99      0.99       181
           1       0.90      1.00      0.95        19

    accuracy                           0.99       200
   macro avg       0.95      0.99      0.97       200
weighted avg       0.99      0.99      0.99       200



Q 5. Why is feature scaling important in ML? Compare Min-Max scaling and Standardization.

Feature scaling is a crucial preprocessing step in Machine Learning, especially for algorithms that are sensitive to the scale of the input features. Here's why it's important and a comparison of two common methods: Min-Max scaling and Standardization.

Why is Feature Scaling Important?

Many machine learning algorithms, particularly those that rely on distance calculations or gradient descent, are significantly affected by the scale of the features.

* Distance-based algorithms (e.g., K-Nearest Neighbors (KNN), Support Vector Machines (SVM) with RBF kernel): These algorithms calculate distances between data points. If features have different scales, features with larger values will dominate the distance calculation, effectively ignoring features with smaller values. Scaling ensures that all features contribute equally to the distance metric.
* Gradient Descent based algorithms (e.g., Linear Regression, Logistic Regression, Neural Networks): Gradient descent algorithms find the minimum of a cost function by iteratively updating model parameters. If features are not scaled, the cost function will have elongated or stretched contours, making the optimization process slower and potentially causing the algorithm to oscillate and converge slowly or not converge at all. Scaling makes the contours more spherical, leading to faster and more stable convergence.
* Regularization techniques (e.g., L1 and L2 regularization): These techniques add a penalty to the magnitude of the model coefficients. If features are not scaled, the penalty will disproportionately affect coefficients associated with features that have larger values, regardless of their actual importance. Scaling ensures that the regularization penalty is applied fairly to all coefficients.

Comparison of Min-Max Scaling and Standardization:

Here's a comparison of two popular feature scaling techniques:

1. Min-Max Scaling (Normalization):
 * Explanation: This technique scales features to a fixed range, usually between 0 and 1.
 * Pros:
   * Easy to understand and implement.
   * Scales data to a specific range, which can be useful for algorithms that require inputs in a certain range (e.g., neural networks with sigmoid activation functions).
* Cons:
  * Sensitive to outliers: Outliers can significantly affect the minimum and maximum values, leading to a distorted scaled distribution.
  * Compresses the range of values: If the original data has a wide range, Min-Max scaling can compress the values, potentially losing some information about the relative distances between data points.

2. Standardization (Z-score normalization):
  * Explanation: This technique scales features to have a mean of 0 and a standard deviation of 1.
  * Pros:
    * Less affected by outliers: Standardization uses the mean and standard deviation, which are less sensitive to extreme values compared to the minimum and maximum.
    * Preserves the shape of the original distribution: Standardization centers the data around 0 but does not change the shape of the distribution.
  * Cons:
     * The scaled values do not have a fixed range, which might be a concern for some algorithms.

Which one to use?

The choice between Min-Max scaling and Standardization depends on the specific algorithm and the nature of the data:

* Standardization is generally preferred for algorithms that assume a Gaussian distribution or are sensitive to outliers (e.g., Logistic Regression, Linear Regression, SVM, K-Means).
* Min-Max scaling is often used when you need to scale data to a specific range (e.g., for image processing or when using neural networks with certain activation functions).

It's often a good practice to try both and evaluate which one performs better for your specific task.

Q 6. Compare Label Encoding and One-Hot Encoding. When would you prefer one over the other ?

When dealing with categorical variables in machine learning, we need to convert them into a numerical format that models can understand. Two common techniques for this are Label Encoding and One-Hot Encoding.

Label Encoding:

* Explanation: Label Encoding assigns a unique integer to each category in a column. For example, if you have a "color" column with categories "Red", "Green", and "Blue", Label Encoding might assign 0 to "Red", 1 to "Green", and 2 to "Blue".
* How it works: It simply maps each unique category to a numerical label.
* When to use: Label Encoding is suitable for ordinal categorical variables, where there is an inherent order or ranking among the categories (e.g., "Small", "Medium", "Large"). In this case, the numerical order reflects the categorical order.
* Caveats: If you use Label Encoding on nominal categorical variables (where there is no inherent order, like colors), the model might incorrectly interpret the numerical order as a ranking, which can lead to biased or incorrect results.

One-Hot Encoding:

* Explanation: One-Hot Encoding creates new binary columns for each category in a categorical feature. For example, for the "color" column with "Red", "Green", and "Blue", it would create three new columns: "Color_Red", "Color_Green", and "Color_Blue". For a data point where the color is "Red", the "Color_Red" column would have a value of 1, and the other color columns would have a value of 0.
* How it works: It represents each category as a binary vector where only one element is 1 (indicating the presence of that category) and the rest are 0.
* When to use: One-Hot Encoding is generally preferred for nominal categorical variables, where there is no inherent order among the categories. It avoids creating the false sense of order that Label Encoding introduces for nominal data.
* Caveats: One-Hot Encoding can lead to a significant increase in the number of features, especially if a categorical variable has many unique categories. This can result in a sparse dataset and the "curse of dimensionality," which can impact model performance and increase computational cost. Additionally, it can introduce multicollinearity if all dummy variables are kept, which can be an issue for some models (this is often handled by dropping one of the dummy variables).

Summary of Preference:

* Use Label Encoding for ordinal categorical variables where the numerical order reflects a meaningful ranking.
* Use One-Hot Encoding for nominal categorical variables to avoid implying an incorrect order.

Q 7. Google Play Store Dataset

a). Analyze the relationship between app categories and ratings. Which categories have the
highest/lowest average ratings, and what could be the possible reasons?


Load the dataset
Subtask:

Load the googleplaystore.csv file into a pandas DataFrame.

Reasoning: Load the data into a pandas DataFrame and display the first few rows.

In [None]:
import pandas as pd

df = pd.read_csv('/content/googleplaystore.csv')
display(df.head())

| Column           | Description                                          |
| ---------------- | ---------------------------------------------------- |
| `App`            | App name                                             |
| `Category`       | App category (e.g., GAME, BUSINESS, EDUCATION, etc.) |
| `Rating`         | Average user rating (1.0–5.0)                        |
| `Reviews`        | Number of user reviews                               |
| `Installs`       | Number of installs                                   |
| `Type`           | Free or Paid                                         |
| `Price`          | App price                                            |
| `Content Rating` | Age appropriateness (Everyone, Teen, etc.)           |
| `Genres`         | More detailed classification                         |
| `Last Updated`   | Last update date                                     |



Clean the data
Subtask:

Handle missing values and convert the 'Rating' column to a numeric type.

Reasoning: Handle missing values in the 'Rating' column and convert it to a numeric type as instructed.

In [None]:
df.dropna(subset=['Rating'], inplace=True)
df['Rating'] = pd.to_numeric(df['Rating'], errors='coerce')

Analyze category ratings
Subtask:
Group the data by 'Category' and calculate the mean 'Rating' for each category.

Reasoning: Group the DataFrame by 'Category' and calculate the mean of the 'Rating' column for each category to find the average rating per category.

In [None]:
average_category_ratings = df.groupby('Category')['Rating'].mean()
display(average_category_ratings)

| Category            | Avg. Rating | Observation                                                                 |
| ------------------- | ----------- | --------------------------------------------------------------------------- |
| EDUCATION           | ⭐ 4.36      | High satisfaction — educational apps are often simple, targeted, and niche. |
| BOOKS_AND_REFERENCE | ⭐ 4.35      | Informational apps; fewer bugs and stable functionality.                    |
| ART_AND_DESIGN      | ⭐ 4.34      | Visually appealing, creative user base.                                     |
| EVENTS              | ⭐ 4.32      | Simple apps with limited features.                                          |
| HEALTH_AND_FITNESS  | ⭐ 4.28      | Positive user engagement and niche interest.                                |
| ENTERTAINMENT       | ⭐ 4.22      | Engaging, but can vary by audience.                                         |
| GAME                | ⭐ 4.19      | Popular, but mixed reviews due to ads and in-app purchases.                 |
| TOOLS               | ⭐ 4.07      | Essential apps, but technical issues may reduce ratings.                    |
| COMMUNICATION       | ⭐ 4.06      | Apps like messengers — heavy usage, frequent bugs.                          |
| SOCIAL              | ⭐ 4.02      | Subjective user experiences, frequent updates cause dissatisfaction.        |
| FINANCE             | ⭐ 4.00      | Users are sensitive to reliability and trust issues.                        |
| DATING              | ⭐ 3.97      | User experience highly variable; expectations often unmet.                  |
| NEWS_AND_MAGAZINES  | ⭐ 3.95      | Content quality varies; sometimes poor design.                              |


Visualization

A bar chart or boxplot can show the distribution:

In [None]:
import seaborn as sns
import matplotlib.pyplot as plt

plt.figure(figsize=(10,6))
sns.barplot(x='Rating', y='Category', data=df, estimator='mean', ci=None, order=category_ratings.index)
plt.title('Average Rating by App Category')
plt.xlabel('Average Rating')
plt.ylabel('Category')
plt.show()

Insights & Possible Reasons

| High-Rating Categories                         | Reasons                                                                                             |
| ---------------------------------------------- | --------------------------------------------------------------------------------------------------- |
| **Education, Books & Reference, Art & Design** | Simple interfaces, fewer bugs, targeted audiences, and clear purpose. Users download intentionally. |
| **Health & Fitness, Events**                   | Specialized, utility-based apps with loyal users and limited negative experiences.                  |


| Low-Rating Categories       | Reasons                                                                  |
| --------------------------- | ------------------------------------------------------------------------ |
| **Social, Dating, Finance** | High user expectations, privacy/trust concerns, or poor user experience. |
| **Communication, Tools**    | Technical instability, battery/data usage, frequent ads.                 |
| **Games**                   | Mixed reviews due to ads, difficulty, or pay-to-win features.            |


Summary

* Highest Average Ratings: Education, Books & Reference, Art & Design.
→ Focused purpose, niche users, fewer ads, and stable functionality.

* Lowest Average Ratings: Dating, Social, Finance, Tools.
→ High user expectations, technical challenges, and subjective satisfaction.

Q 8. Titanic Dataset

a) Compare the survival rates based on passenger class (Pclass). Which class had the highest
survival rate, and why do you think that happened?

b) Analyze how age (Age) affected survival. Group passengers into children (Age < 18) and
adults (Age ≥ 18). Did children have a better chance of survival?



In [None]:
# import libreries
import pandas as pd
import matplotlib.pyplot as plt

In [None]:
# Load dataset
df_titanic = pd.read_csv('DataSet/titanic.csv')

In [None]:
# Display first few rows
print("Dataset Preview:")
df_titanic.head()

In [None]:
# Check for missing values
print("\nMissing values:")
df_titanic.isnull().sum()

In [None]:
# Drop rows with missing Age or Pclass or Survived
df_titanic = df_titanic.dropna(subset=['Age', 'Pclass', 'Survived'])

In [None]:
# 1️⃣ Compare survival rates by Passenger Class (Pclass)
survival_by_class = df_titanic.groupby('Pclass')['Survived'].mean() * 100
print("\nSurvival Rate by Passenger Class (%):")
survival_by_class

In [None]:
# Plot survival by class
plt.figure(figsize=(8,6))
survival_by_class.plot(kind='bar', color=['gold','silver', 'brown'], edgecolor='black')
plt.title('Survival Rate by Passenger Class')
plt.xlabel('Passenger Class (1 = Upper, 2 = Middle, 3 = Lower)')
plt.ylabel('Survival Rate (%)')
plt.grid(axis='y', linestyle='--', alpha=0.7)
plt.tight_layout()
plt.show()

In [None]:
# 2️⃣ Analyze survival by Age Group (Children vs Adults)
df_titanic['AgeGroup'] = df_titanic['Age'].apply(lambda x: 'Child' if x < 18 else 'Adult')

survival_by_agegroup = df_titanic.groupby('AgeGroup')['Survived'].mean() * 100
print("\nSurvival Rate by Age Group (%):")
survival_by_agegroup

In [None]:
# Plot survival by age group
plt.figure(figsize=(5,4))
survival_by_agegroup.plot(kind='bar', color=['lightblue', 'orange'], edgecolor='black')
plt.title('Survival Rate by Age Group')
plt.xlabel('Age Group')
plt.ylabel('Survival Rate (%)')
plt.grid(axis='y', linestyle='--', alpha=0.7)
plt.tight_layout()
plt.show()

Insights from the Analysis**

**🎟️ 1. Passenger Class vs Survival**

* **1st Class** passengers have the **highest survival rate** (typically around **60–65%**).

* **3rd Class** passengers have the **lowest survival rate** (often around **20–25%**).

**Reasons:**

* 1st class passengers were closer to the lifeboats (upper decks).

* Crew prioritized wealthy and high-status passengers during evacuation.

* 3rd class passengers were located in lower decks, far from exits, and often delayed in reaching lifeboats.

**👶 2. Age (Children vs Adults)**

* **Children (<18 years)** had a **higher survival rate** than adults.

* This aligns with the “Women and Children First” policy followed during evacuation.

Reasons:**

* Crew gave priority to saving children and women.

* Many children in upper-class families had better access to lifeboats.

* Adult males had the lowest survival rate due to giving up lifeboat seats.

**💡 Conclusion**
* **1st Class** → Highest survival rate (more access & priority).
* **Children (<18)** → Higher survival rate (evacuation priority).
* **3rd Class & Adults** → Lowest survival rate (limited access, slower evacuation).
---

Q 9.  Flight Price Prediction Dataset

1. How do flight prices vary with the days left until departure? Identify any exponential price surges and recommend the best booking window.

2. Compare prices across airlines for the same route (e.g., Delhi-Mumbai). Which airlines are consistently cheaper/premium, and why?
Dataset: https://github.com/MasteriNeuron/datasets.git

In [None]:
# import libreries
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns
from datetime import datetime

In [None]:
# Load dataset
df_fp = pd.read_excel('DataSet/flight_price.xlsx')

In [None]:
# Display first few rows
print("Dataset Preview:")
df_fp.head()

In [None]:
# Check for missing values
print("\nMissing values:")
df_fp.isnull().sum()

In [None]:
df_fp['Date_of_Journey'] = pd.to_datetime(df_fp['Date_of_Journey'], format='%d/%m/%Y')
today = datetime.now().date()
df_fp['Days_Left'] = (df_fp['Date_of_Journey'].dt.date - today).apply(lambda x: x.days)

In [None]:
# Group by days left and calculate average price
price_by_days = df_fp.groupby('Days_Left')['Price'].mean().reset_index().sort_values(by='Days_Left')

In [None]:
price_by_days.head()

In [None]:
price_by_days.tail()

In [None]:
# 1-4: Line plot for average price vs. days left
plt.figure(figsize=(12, 6))
sns.lineplot(data=price_by_days, x='Days_Left', y='Price')
plt.title('Average Flight Price vs. Days Left Until Departure')
plt.xlabel('Days Left Until Departure')
plt.ylabel('Average Price')
plt.grid(True)
plt.show()

In [None]:
# Identify exponential price surges

df_delhi_mumbai = df_fp[(df_fp['Source'] == 'Delhi') & (df_fp['Destination'] == 'Mumbai')]
avg_price_by_airline = df_delhi_mumbai.groupby('Airline')['Price'].mean().reset_index().sort_values(by='Price')

avg_price_by_airline

In [None]:
df_fp['Source'].unique()

In [None]:
df_fp['Destination'].unique()

In [None]:
print(df_fp.groupby(['Source', 'Destination']).size().reset_index(name='count'))

In [None]:
# 5-9: Bar plot for average price by airline on Delhi-Mubai route
plt.figure(figsize=(10, 5))
sns.barplot(data=avg_price_by_airline, x='Airline', y='Price', palette='viridis')
plt.title('Average Flight Prices Across Airlines (Delhi → Mumbai)')
plt.xlabel('Airline')
plt.ylabel('Average Price (₹)')
plt.xticks(rotation=45, ha='right') # Rotate x-axis labels for better readability
plt.tight_layout() # Adjust layout to prevent labels overlapping
plt.show()

Q 10. HR Analytics Dataset.

1. What factors most strongly correlate with employee attrition? Use visualizations to show key drivers (e.g., satisfaction, overtime, salary).

2. Are employees with more projects more likely to leave?



Dataset Overview

A typical HR Analytics dataset includes variables like:


| Column                  | Description                                    |
| ----------------------- | ---------------------------------------------- |
| `satisfaction_level`    | Employee satisfaction (0–1)                    |
| `last_evaluation`       | Last performance review (0–1)                  |
| `number_project`        | Number of projects the employee worked on      |
| `average_montly_hours`  | Average monthly hours                          |
| `time_spend_company`    | Years spent in the company                     |
| `Work_accident`         | Whether the employee had a work accident       |
| `promotion_last_5years` | Whether promoted in last 5 years               |
| `salary`                | Categorical: low, medium, high                 |
| `department`            | Department/field                               |
| `left`                  | Target variable (1 = left company, 0 = stayed) |


1. Key Correlations with Attrition

To find what factors correlate with attrition (left), we can compute correlation coefficients or visualize distributions.

Example (Correlation Matrix Heatmap)

In [None]:
import seaborn as sns
import matplotlib.pyplot as plt

corr = df.corr(numeric_only=True)
plt.figure(figsize=(8,6))
sns.heatmap(corr, annot=True, cmap='coolwarm')
plt.title('Correlation Matrix with Attrition')
plt.show()


Typical Findings (from known analyses)

| Factor                    | Correlation with Attrition | Insight                                                             |
| ------------------------- | -------------------------- | ------------------------------------------------------------------- |
| **satisfaction_level**    | **−0.39**                  | Lower satisfaction → higher attrition                               |
| **number_project**        | **+0.35 (nonlinear)**      | Moderate projects (3–4) retain employees; very high (6–7) → burnout |
| **average_montly_hours**  | **+0.34**                  | More hours → more attrition (overtime fatigue)                      |
| **time_spend_company**    | **+0.14**                  | Longer tenure → more likely to leave (career stagnation)            |
| **last_evaluation**       | **+0.28 (nonlinear)**      | Both very low and very high performers tend to leave                |
| **salary (encoded)**      | **−0.25**                  | Lower salary → higher attrition                                     |
| **promotion_last_5years** | **−0.07**                  | Lack of promotion slightly increases attrition                      |



2. Visualizations — Key Drivers

(a) Satisfaction Level vs Attrition

Employees who left generally have lower satisfaction.

In [None]:
sns.kdeplot(data=df, x='satisfaction_level', hue='left', fill=True)
plt.title('Satisfaction Level vs Attrition')
plt.show()


Observation: A strong peak around satisfaction ≈ 0.4 for employees who left.

(b) Average Monthly Hours vs Attrition

Employees who work much longer hours (200+) are more likely to leave.

In [None]:
sns.boxplot(x='left', y='average_montly_hours', data=df)
plt.title('Average Monthly Hours vs Attrition')
plt.show()

Observation: Overworked employees tend to quit more.

(c) Salary vs Attrition

In [None]:
sns.countplot(x='salary', hue='left', data=df, order=['low','medium','high'])
plt.title('Attrition by Salary Level')
plt.show()

Observation: Most leavers come from low salary group.

(d) Promotion vs Attrition

In [None]:
sns.barplot(x='promotion_last_5years', y='left', data=df)
plt.title('Promotion History vs Attrition Rate')
plt.show()

Observation: Employees without promotion are more likely to leave.

3. Are Employees with More Projects More Likely to Leave?

Let’s visualize:

In [None]:
sns.countplot(x='number_project', hue='left', data=df)
plt.title('Attrition by Number of Projects')
plt.show()


Interpretation:

Employees with 2 or fewer projects often leave due to boredom or lack of engagement.

Employees with 6 or 7 projects leave due to overload and burnout.

Employees with 3–4 projects have the lowest attrition (healthy workload).

✅ So, the relationship is nonlinear (U-shaped) — both underworked and overworked employees are more likely to leave.

Summary of Insights
| Factor                           | Effect on Attrition | Explanation                     |
| -------------------------------- | ------------------- | ------------------------------- |
| **Low satisfaction**             | ⬆️ Attrition        | Dissatisfaction drives turnover |
| **High workload / overtime**     | ⬆️ Attrition        | Burnout and stress              |
| **Low salary**                   | ⬆️ Attrition        | Compensation dissatisfaction    |
| **No promotion**                 | ⬆️ Attrition        | Career stagnation               |
| **Too few or too many projects** | ⬆️ Attrition        | Under-challenged or overworked  |


Key : -

Satisfaction is the strongest predictor.

Overtime and high project counts lead to burnout.

Low pay and no promotion discourage retention.

Optimal conditions: medium workload, fair salary, career growth, and recognition.