# Exam Cheat Sheet: Basic Python for DSAI

This Jupyter notebook is a comprehensive cheat sheet for your mid-semester exam in *Basic Python for Data Science and Artificial Intelligence*. It consolidates the lecture material, lab exercises and the graded assignment into concise explanations with runnable code examples.

## Table of Contents

1. [Python Basics](#Python-Basics)
2. [Object-Oriented Programming](#Object-Oriented-Programming)
3. [NumPy](#NumPy)
4. [Pandas](#Pandas)
5. [Visualization with Matplotlib](#Matplotlib)
6. [Visualization with Seaborn](#Seaborn)
7. [Practice Lab Problems](#Practice-Lab-Problems)
8. [Assignment‑1: Student, Department & Institute](#Assignment-1-Student-Department-and-Institute)


## Python Basics

Python is a dynamically typed, interpreted language. The following points cover the core syntax and language constructs you need to know for the exam:

- **Variables and Data Types**: Python variables are created when first assigned. Common built‑in types include integers (`int`), floating‑point numbers (`float`), strings (`str`), booleans (`bool`), lists (`list`), tuples (`tuple`), dictionaries (`dict`) and sets (`set`).
- **Comments**: Use the `#` symbol for single‑line comments and triple quotes (`""" ... """`) for multi‑line documentation strings.
- **Operators**: Python supports arithmetic (`+`, `-`, `*`, `/`, `//`, `%`, `**`), comparison (`==`, `!=`, `<`, `>`, `<=`, `>=`) and logical operators (`and`, `or`, `not`).
- **Control Flow**: Use `if`, `elif`, `else` for conditional logic. Loop constructs include `for` (iterate over iterables) and `while` (repeat until condition is false). Use `break` and `continue` to control loop execution.
- **Comprehensions**: List and dictionary comprehensions provide concise ways to create new collections from existing ones. For example, `[x**2 for x in range(5)]` produces a list of squares.
- **Functions**: Defined with the `def` keyword. Functions can take positional, default and keyword arguments and return values with `return`. Docstrings describe what a function does.
- **Modules**: Code can be organised into modules (`.py` files). Import modules using `import module` or `from module import name`.

Common pitfalls include forgetting the colon (`:`) after control structures, mixing tabs and spaces (Python uses indentation to define blocks) and confusing assignment (`=`) with equality comparison (`==`).


In [None]:
# Variables and data types
integer_var = 42              # an integer
float_var = 3.14159           # a floating-point number
string_var = "DSAI"           # a string
bool_var = True              # a boolean

print(type(integer_var), type(float_var), type(string_var), type(bool_var))

# Collections: list, tuple, dict and set
my_list = [1, 2, 3, 2]                # mutable ordered collection
my_tuple = (1, 2, 3)                 # immutable ordered collection
my_dict = {"a": 1, "b": 2, "c": 3}  # key‑value mapping
my_set = {1, 2, 2, 3}                # unordered collection of unique elements

print("List:", my_list)
print("Tuple:", my_tuple)
print("Dictionary:", my_dict)
print("Set:", my_set)

# Control flow: for and if/else
for i in range(5):
    if i % 2 == 0:
        print(f"{i} is even")
    else:
        print(f"{i} is odd")

# List comprehension: squares of even numbers from 0 to 9
squares = [i**2 for i in range(10) if i % 2 == 0]
print("Even squares:", squares)

# Function definition with default parameter and docstring
def greet(name="World"):
    """Return a greeting message for the given name."""
    return f"Hello, {name}!"

# Call the function
print(greet("DSAI"))
print(greet())


## Object‑Oriented Programming

Object‑Oriented Programming (OOP) allows you to model real‑world entities as **objects** with attributes (state) and methods (behaviour). Classes provide blueprints for objects:

- **Defining a class**: Use the `class` keyword. The `__init__` method is a special constructor used to initialise instance attributes. Each method must explicitly accept `self` as the first parameter.
- **Attributes**: Variables that belong to an object. Assign them within `__init__` using `self.attribute_name = value`.
- **Methods**: Functions defined within a class. They operate on the instance via `self`.
- **Inheritance**: One class (child) can inherit attributes and methods from another class (parent) to promote reuse. Use `class Child(Parent):` to declare inheritance. Override methods to customise behaviour and call the parent method via `super().method()`.
- **Special methods**: Double‑underscore (__) methods such as `__str__`, `__repr__`, `__len__`, `__eq__` and `__gt__` provide custom behaviour for built‑in operations (string representation, length, comparisons, etc.).
- **Encapsulation and information hiding**: Prefix attribute names with an underscore (`_attribute`) to indicate they should be treated as non‑public. Use getters and setters when you need validation or computed attributes.

Common mistakes include forgetting to include `self` in method definitions, accidentally creating class variables (shared across instances) instead of instance variables, and not calling `super().__init__()` in subclasses.


In [None]:
# Define a base class representing a person
class Person:
    def __init__(self, name, age):
        self.name = name  # instance attribute
        self.age = age
    
    def speak(self):
        return f"Hi, I'm {self.name} and I'm {self.age} years old."
    
    def __str__(self):
        return f"Person(name={self.name}, age={self.age})"

# Define a subclass representing a student inheriting from Person
class Student(Person):
    def __init__(self, name, age, scores):
        super().__init__(name, age)  # initialise the base class
        self.scores = scores  # list of scores
    
    def average_score(self):
        return sum(self.scores) / len(self.scores)
    
    # override the speak method
    def speak(self):
        return f"I'm {self.name}, my average score is {self.average_score():.2f}."

# Create and use objects
alice = Person("Alice", 30)
print(alice.speak())
print(alice)  # uses __str__

bob = Student("Bob", 20, [88, 92, 76, 95])
print(bob.speak())
print("Bob's scores:", bob.scores)
print("Bob's average:", bob.average_score())


## NumPy

NumPy provides a powerful N‑dimensional array object (`ndarray`) and functions for fast vectorised operations. Key concepts:

- **Creating arrays**: Use `np.array()` to convert Python lists to arrays. Use `np.zeros()`, `np.ones()`, `np.eye()` and `np.arange()`/`np.linspace()` to create arrays with specific patterns. Random numbers come from `np.random` (e.g., `np.random.rand`, `np.random.randn`).
- **Array attributes**: `shape` (dimensions), `dtype` (data type), `size` (number of elements) and `ndim` (number of axes).
- **Indexing and slicing**: Similar to Python lists but can specify indices per axis using tuples; supports boolean masking and fancy indexing (arrays of indices).
- **Reshaping and transposing**: Use `.reshape(new_shape)`, `np.reshape()` and `.T` (transpose). Use `np.concatenate`, `np.hstack` and `np.vstack` to combine arrays.
- **Broadcasting**: Automatic expansion of smaller arrays to match shapes of larger arrays in arithmetic operations (e.g., adding a vector to each row of a matrix).
- **Universal functions (ufuncs)**: Fast element‑wise operations like `np.add`, `np.subtract`, `np.multiply`, `np.exp`, `np.sqrt`. Aggregate functions include `np.sum`, `np.mean`, `np.std`, `np.min` and `np.max` with optional `axis` parameter.

Be careful with **copy vs view**; slicing returns views into the original array. Use `.copy()` when you need an independent copy.


In [None]:
import numpy as np

# Creating arrays
arr = np.array([1, 2, 3, 4])
print("Array:", arr)

zeros = np.zeros((2, 3))  # 2x3 array of zeros
ones = np.ones((2, 3))    # 2x3 array of ones
rand = np.random.rand(2, 3)  # 2x3 array of random numbers from [0, 1)

print("Zeros:
", zeros)
print("Ones:
", ones)
print("Random:
", rand)

# Array attributes
print("Shape:", rand.shape)
print("Data type:", rand.dtype)
print("Number of elements:", rand.size)

# Reshaping and transposing
b = np.arange(1, 13).reshape((3, 4))  # reshape a 1D array into 3x4
print("Original b:
", b)
print("Transposed b:
", b.T)

# Indexing and slicing
print("Element at row 1, col 2:", b[1, 2])
print("Second row:", b[1])
print("First two rows and last two columns:
", b[:2, -2:])

# Broadcasting: adding a vector to each row
vec = np.array([1, 2, 3, 4])
print("b + vec:
", b + vec)

# Example: computing Body Mass Index (BMI) using NumPy
heights = np.array([1.70, 1.82, 1.60, 1.75])
weights = np.array([70, 80, 55, 68])

bmi = weights / (heights ** 2)
print("BMI values:", bmi)

# Boolean masking: identify overweight (BMI > 25)
overweight = bmi > 25
print("Overweight flags:", overweight)
print("Heights of overweight individuals:", heights[overweight])


## Pandas

Pandas offers high‑level data structures (Series and DataFrame) and tools for data cleaning, manipulation and analysis.

- **Series**: A one‑dimensional labelled array. Create with `pd.Series(data, index=...)`.
- **DataFrame**: A two‑dimensional labelled table with columns of potentially different types. Create from dictionaries, lists of dictionaries, or using `pd.read_csv()` / `pd.read_excel()` to load data.
- **Inspecting data**: Use `head()`, `tail()`, `shape`, `info()` and `describe()` to understand your dataset.
- **Indexing and selection**: Use `.loc[rows, columns]` (label‑based) and `.iloc[rows, columns]` (integer position‑based). Boolean indexing lets you filter rows based on conditions.
- **Missing data**: Pandas uses `NaN` for missing values. Detect missing values with `isnull()`/`notnull()`. Handle missing data using `dropna()` to remove rows/columns or `fillna()` to impute with a constant or statistic (mean, median, etc.).
- **Outliers**: Detect outliers by computing Z‑scores (`(x - mean)/std`) or the Interquartile Range (IQR). Remove or cap outliers as appropriate.
- **Groupby and aggregation**: Use `df.groupby(column).agg({'col': 'function', ...})` to compute statistics per group.
- **Pipelines**: Chain operations together using method chaining for readable data transformations.

Avoid common mistakes such as chained indexing (e.g., `df[df['col']>0]['other'] = ...`), which can lead to warnings or bugs. Use `.loc` for assignment.


In [None]:
import pandas as pd

# Creating a DataFrame from a dictionary
students_data = {
    'Name': ['Alice', 'Bob', 'Charlie', 'Diana'],
    'Age': [20, 21, 19, 22],
    'Score': [85, 92, 78, 90]
}
students_df = pd.DataFrame(students_data)
print("Students DataFrame:
", students_df)

# Reading the Titanic dataset (Lab4) from CSV
file_path = '/home/oai/share/Lab4_Titanic.csv'
titanic = pd.read_csv(file_path)

print("
Titanic dataset shape:", titanic.shape)
print("Columns:", titanic.columns.tolist())
print(titanic.head())

# Handling missing values: fill 'Age' with median and drop 'Cabin'
median_age = titanic['Age'].median()
titanic['Age'] = titanic['Age'].fillna(median_age)

if 'Cabin' in titanic.columns:
    titanic = titanic.drop(columns=['Cabin'])

# Fill missing 'Embarked' values with the most frequent value (mode)
mode_embarked = titanic['Embarked'].mode()[0]
titanic['Embarked'] = titanic['Embarked'].fillna(mode_embarked)

print("
After cleaning, missing values per column:
", titanic.isnull().sum())

# Example: compute average fare by passenger class
avg_fare_by_class = titanic.groupby('Pclass')['Fare'].mean()
print("
Average fare by class:
", avg_fare_by_class)

# Outlier detection using IQR on 'Fare'
Q1 = titanic['Fare'].quantile(0.25)
Q3 = titanic['Fare'].quantile(0.75)
IQR = Q3 - Q1
lower_bound = Q1 - 1.5 * IQR
upper_bound = Q3 + 1.5 * IQR
outliers = titanic[(titanic['Fare'] < lower_bound) | (titanic['Fare'] > upper_bound)]
print("Number of fare outliers:", len(outliers))


## Visualization with Matplotlib

Matplotlib is a plotting library that provides fine‑grained control over visualisation. Key points:

- **Import convention**: `import matplotlib.pyplot as plt`.
- **Figure and axes**: A figure is a container for one or more axes (plots). Use `plt.figure()` to create a figure and `fig, ax = plt.subplots()` for one or multiple axes.
- **Line plot**: Use `ax.plot(x, y)` for continuous data.
- **Bar chart**: Use `ax.bar(labels, values)` for categorical comparisons.
- **Scatter plot**: Use `ax.scatter(x, y)` to show relationships between two variables.
- **Pie chart**: Use `ax.pie(sizes, labels=labels)` to show part‑whole relationships.
- **Customising plots**: Add titles (`ax.set_title()`), axis labels (`ax.set_xlabel()`, `ax.set_ylabel()`), legends (`ax.legend()`), grid lines (`ax.grid(True)`) and annotations (`ax.annotate()`).
- **Subplots**: Use `fig, axs = plt.subplots(nrows, ncols)` to create multiple plots in a grid. Each `axs[i, j]` is its own `Axes`.

Avoid specifying colours explicitly unless required, and keep each chart on its own axes for clarity. Always call `plt.show()` to display the figure in a notebook.


In [None]:
import matplotlib.pyplot as plt
import pandas as pd

# Load health data (Lab5)
health = pd.read_csv('/home/oai/share/health_data.csv')

# Line chart: Steps over months
fig, ax = plt.subplots()
ax.plot(health['Month'], health['Steps'])
ax.set_xlabel('Month')
ax.set_ylabel('Steps')
ax.set_title('Monthly Steps')
plt.show()

# Bar chart: Sleep hours over months
fig, ax = plt.subplots()
ax.bar(health['Month'], health['Sleep_Hours'])
ax.set_xlabel('Month')
ax.set_ylabel('Sleep Hours')
ax.set_title('Average Sleep Hours per Month')
plt.show()

# Scatter plot: Sleep hours vs Average heart rate
fig, ax = plt.subplots()
ax.scatter(health['Sleep_Hours'], health['Avg_Heart_Rate'])
ax.set_xlabel('Sleep Hours')
ax.set_ylabel('Average Heart Rate')
ax.set_title('Sleep Hours vs Average Heart Rate')
plt.show()

# Pie chart: Total steps vs total calories burned
total_steps = health['Steps'].sum()
total_calories = health['Calories_Burned'].sum()
fig, ax = plt.subplots()
ax.pie([total_steps, total_calories], labels=['Steps', 'Calories'], autopct='%1.1f%%')
ax.set_title('Proportion of Steps vs Calories Burned')
plt.show()


## Visualization with Seaborn

Seaborn builds on top of Matplotlib and provides high‑level functions for creating attractive statistical graphics with minimal code. It integrates well with pandas DataFrames.

- **Import**: `import seaborn as sns`; set styles with `sns.set()` if desired.
- **scatterplot**: `sns.scatterplot(x='col1', y='col2', hue='col3', data=df)` adds colour encoding.
- **boxplot**: `sns.boxplot(x='category', y='value', data=df)` shows distributions and outliers.
- **violinplot**: `sns.violinplot(...)` combines a boxplot with a kernel density estimate to visualise distribution shape.
- **pairplot**: `sns.pairplot(df[cols])` plots pairwise relationships for a set of variables, optionally coloured by a class column via `hue`.
- **heatmap**: `sns.heatmap(df.corr(), annot=True)` displays a correlation matrix.
- **jointplot**: `sns.jointplot(x='col1', y='col2', data=df, kind='scatter')` shows both scatter and marginal distributions; use `kind='kde'` for density contours.

Seaborn chooses default colours automatically, so you usually do not need to specify them. Remember to call `plt.show()` when mixing with Matplotlib.


In [None]:
import pandas as pd
import seaborn as sns
import matplotlib.pyplot as plt

# Load heart disease dataset (Lab6)
heart = pd.read_csv('/home/oai/share/heart.csv')

# Pairplot of selected numerical features coloured by target
sns.pairplot(heart[['age','chol','thalach','trestbps','target']], hue='target')
plt.suptitle('Pairplot of Heart Disease Features', y=1.02)
plt.show()

# Boxplot: resting blood pressure (trestbps) by disease presence
plt.figure()
sns.boxplot(x='target', y='trestbps', data=heart)
plt.xlabel('Target (0=No Disease, 1=Has Disease)')
plt.ylabel('Resting Blood Pressure (trestbps)')
plt.title('Resting Blood Pressure by Heart Disease Presence')
plt.show()

# Violin plot: maximum heart rate (thalach) by sex
plt.figure()
sns.violinplot(x='sex', y='thalach', data=heart)
plt.xlabel('Sex (0=Female, 1=Male)')
plt.ylabel('Maximum Heart Rate (thalach)')
plt.title('Distribution of Max Heart Rate by Sex')
plt.show()

# Heatmap of correlation matrix
plt.figure()
corr = heart.corr()
sns.heatmap(corr, annot=True, fmt='.2f')
plt.title('Correlation Matrix of Heart Dataset')
plt.show()

# Jointplot: age vs maximum heart rate with density contours
sns.jointplot(x='age', y='thalach', data=heart, kind='kde')
plt.suptitle('Age vs Maximum Heart Rate (Density)', y=1.02)
plt.show()


## Practice Lab Problems

This section presents example solutions to the lab exercises. Each problem illustrates important concepts from the course.

### Lab‑1: Iteration, Search and Matrices

1. **Combining lists** – iterate over two lists simultaneously (e.g. with `zip`) and the second list reversed.
2. **Word search in a 2D grid** – search for a given word horizontally or vertically in a grid of characters.
3. **Rotate a matrix 90° clockwise** – modify a 2D matrix in place.
4. **Validate a Sudoku board** – ensure no duplicates across rows, columns or 3x3 sub-grids (ignoring empty cells).

### Lab‑2: Shopping Cart (OOP)

Define classes for products, cart items, a shopping cart and a customer. Enforce encapsulation by keeping internal attributes non‑public and expose operations via methods.

### Lab‑3: NumPy Calculations

Convert lists to arrays, convert units (inches/feet to metres; pounds to kilograms), compute BMI and identify overweight individuals. Work with 2D arrays and aggregate statistics (sum, mean, median, argmax, argsort).

### Lab‑4: Titanic Data (Pandas)

Load the Titanic dataset, handle missing values (fill or drop), detect outliers using Z‑score or IQR and organise steps into a data processing pipeline function.

### Lab‑5: Health Data Visualisation

Plot multiple charts from a CSV file: line chart, bar chart, scatter plot and pie chart. Illustrate use of axes labels, titles, legends and saving figures.

### Lab‑6: Heart Data Visualisation

Use Seaborn to create pairplots, boxplots, violin plots, heatmaps, jointplots and barplots with the heart disease dataset.

The following cells provide implementations of these tasks.


In [None]:
# Lab‑1 Problem 1: Combine two lists, printing pairs
list1 = [1, 2, 3, 4]
list2 = ['a', 'b', 'c', 'd']

# Iterate over list1 normally and list2 reversed
for x, y in zip(list1, reversed(list2)):
    print(x, y)

# Lab‑1 Problem 2: Word search in a 2D grid

def search_word(grid, word):
    """Return True if the word exists horizontally or vertically in the grid."""
    rows = len(grid)
    cols = len(grid[0])
    word_len = len(word)

    # Check horizontally
    for r in range(rows):
        for c in range(cols - word_len + 1):
            if ''.join(grid[r][c:c+word_len]) == word:
                return True
    # Check vertically
    for c in range(cols):
        for r in range(rows - word_len + 1):
            if ''.join(grid[r+i][c] for i in range(word_len)) == word:
                return True
    return False

sample_grid = [
    ['C', 'A', 'T', 'F'],
    ['B', 'G', 'E', 'S'],
    ['I', 'T', 'A', 'E'],
    ['S', 'O', 'N', 'G']
]

print(search_word(sample_grid, 'CAT'))  # True
print(search_word(sample_grid, 'DOG'))  # False

# Lab‑1 Problem 3: Rotate matrix 90 degrees clockwise in place

def rotate_matrix_clockwise(mat):
    """Rotate a square matrix 90 degrees clockwise in place."""
    n = len(mat)
    # transpose
    for i in range(n):
        for j in range(i+1, n):
            mat[i][j], mat[j][i] = mat[j][i], mat[i][j]
    # reverse each row
    for i in range(n):
        mat[i].reverse()
    return mat

matrix = [
    [1, 2, 3],
    [4, 5, 6],
    [7, 8, 9]
]
rotated = rotate_matrix_clockwise(matrix)
print("Rotated matrix:
", rotated)

# Lab‑1 Problem 4: Validate Sudoku board

def is_valid_sudoku(board):
    """Check whether a 9x9 Sudoku board is valid."""
    def is_unit_valid(unit):
        unit = [i for i in unit if i != '.']
        return len(unit) == len(set(unit))

    # check rows and columns
    for i in range(9):
        if not is_unit_valid(board[i]) or not is_unit_valid([board[r][i] for r in range(9)]):
            return False
    # check 3x3 sub-grids
    for i in [0, 3, 6]:
        for j in [0, 3, 6]:
            grid = [board[x][y] for x in range(i, i+3) for y in range(j, j+3)]
            if not is_unit_valid(grid):
                return False
    return True

sample_board = [
    ['5','3','.','.','7','.','.','.','.'],
    ['6','.','.','1','9','5','.','.','.'],
    ['.','9','8','.','.','.','.','6','.'],
    ['8','.','.','.','6','.','.','.','3'],
    ['4','.','.','8','.','3','.','.','1'],
    ['7','.','.','.','2','.','.','.','6'],
    ['.','6','.','.','.','.','2','8','.'],
    ['.','.','.','4','1','9','.','.','5'],
    ['.','.','.','.','8','.','.','7','9']
]
print(is_valid_sudoku(sample_board))  # True


In [None]:
# Lab‑2: Shopping Cart Implementation
class Product:
    def __init__(self, product_id, name, price):
        self.product_id = product_id
        self.name = name
        self.price = price

class CartItem:
    def __init__(self, product, quantity=1):
        self.product = product
        self.quantity = quantity

    def total_price(self):
        return self.product.price * self.quantity

class ShoppingCart:
    def __init__(self):
        self._items = []  # use underscore to indicate non-public

    def add_item(self, product, quantity=1):
        for item in self._items:
            if item.product.product_id == product.product_id:
                item.quantity += quantity
                return
        self._items.append(CartItem(product, quantity))

    def remove_item(self, product_id):
        self._items = [item for item in self._items if item.product.product_id != product_id]

    def total_cost(self):
        return sum(item.total_price() for item in self._items)

    def clear(self):
        self._items.clear()

class Customer:
    def __init__(self, name):
        self.name = name
        self.cart = ShoppingCart()

    def checkout(self):
        total = self.cart.total_cost()
        self.cart.clear()
        return total

# Demonstration
p1 = Product(101, 'Milk', 2.5)
p2 = Product(102, 'Bread', 1.5)

customer = Customer('Jane')
customer.cart.add_item(p1, 2)
customer.cart.add_item(p2, 1)
customer.cart.add_item(p1, 1)  # increase quantity of Milk

print("Total cost:", customer.cart.total_cost())
print("Checking out... You owe:", customer.checkout())
print("Cart after checkout:", customer.cart._items)


In [None]:
# Lab‑3: NumPy Operations
import numpy as np

# Sample height in inches and weight in pounds
tallness_inches = np.array([65, 70, 72, 60])
weights_pounds = np.array([150, 180, 190, 120])

heights_m = tallness_inches * 0.0254
weights_kg = weights_pounds * 0.453592

bmi = weights_kg / (heights_m ** 2)
print("BMI:", bmi)
print("Overweight flags:", bmi >= 25)

# 2D array operations
a = np.array([[1, 2, 3], [4, 5, 6], [7, 8, 9]])
print("Sum of all elements:", np.sum(a))
print("Mean of each column:", np.mean(a, axis=0))
print("Row with maximum sum:", np.argmax(np.sum(a, axis=1)))
print("Indices that would sort the array flattened:", np.argsort(a.flatten()))

# Simulate random height, weight and age for 10 people
np.random.seed(0)
heights_random = np.random.normal(loc=1.7, scale=0.1, size=10)
weights_random = np.random.normal(loc=70, scale=5, size=10)
ages_random = np.random.randint(18, 60, size=10)

print("Random heights:", heights_random)
print("Random weights:", weights_random)
print("Random ages:", ages_random)


In [None]:
# Lab‑4: Titanic Data Processing
import pandas as pd

file_path = '/home/oai/share/Lab4_Titanic.csv'
titanic = pd.read_csv(file_path)

def clean_titanic(df):
    df = df.copy()
    df['Age'] = df['Age'].fillna(df['Age'].median())
    if 'Cabin' in df.columns:
        df = df.drop(columns=['Cabin'])
    if 'Embarked' in df.columns:
        df['Embarked'] = df['Embarked'].fillna(df['Embarked'].mode()[0])
    return df

# Function to detect outliers using IQR
def detect_outliers(series):
    Q1 = series.quantile(0.25)
    Q3 = series.quantile(0.75)
    IQR = Q3 - Q1
    lower = Q1 - 1.5 * IQR
    upper = Q3 + 1.5 * IQR
    return series[(series < lower) | (series > upper)]

cleaned = clean_titanic(titanic)
print("Missing values after cleaning:
", cleaned.isnull().sum())

outliers = detect_outliers(cleaned['Fare'])
print("Number of fare outliers:", len(outliers))

# Pipeline function

def process_titanic(df):
    df = clean_titanic(df)
    outlier_indices = detect_outliers(df['Fare']).index
    df = df.drop(index=outlier_indices)
    df['Sex_num'] = df['Sex'].map({'male': 1, 'female': 0})
    return df

processed = process_titanic(titanic)
print("Processed shape:", processed.shape)


In [None]:
# Lab‑5: Visualising health data with Matplotlib
import pandas as pd
import matplotlib.pyplot as plt

health = pd.read_csv('/home/oai/share/health_data.csv')

# Line chart: steps per month
plt.figure()
plt.plot(health['Month'], health['Steps'])
plt.title('Steps per Month')
plt.xlabel('Month')
plt.ylabel('Steps')
plt.show()

# Bar chart: average sleep hours per month
plt.figure()
plt.bar(health['Month'], health['Sleep_Hours'])
plt.title('Average Sleep Hours per Month')
plt.xlabel('Month')
plt.ylabel('Sleep Hours')
plt.show()

# Scatter plot: sleep hours vs heart rate
plt.figure()
plt.scatter(health['Sleep_Hours'], health['Avg_Heart_Rate'])
plt.title('Sleep Hours vs Average Heart Rate')
plt.xlabel('Sleep Hours')
plt.ylabel('Average Heart Rate')
plt.show()

# Pie chart: proportion of steps and calories burned
total_steps = health['Steps'].sum()
total_calories = health['Calories_Burned'].sum()
plt.figure()
plt.pie([total_steps, total_calories], labels=['Steps', 'Calories'], autopct='%1.1f%%')
plt.title('Proportion of Steps vs Calories Burned')
plt.show()


In [None]:
# Lab‑6: Seaborn visualisation with heart data
import pandas as pd
import seaborn as sns
import matplotlib.pyplot as plt

heart = pd.read_csv('/home/oai/share/heart.csv')

# Pairplot of selected features coloured by target
sns.pairplot(heart[['age','chol','thalach','trestbps','target']], hue='target')
plt.suptitle('Pairplot of Heart Features', y=1.02)
plt.show()

# Boxplot of resting blood pressure by heart disease status
plt.figure()
sns.boxplot(x='target', y='trestbps', data=heart)
plt.xlabel('Target (0=No Disease, 1=Disease)')
plt.ylabel('Resting Blood Pressure (trestbps)')
plt.title('Resting Blood Pressure by Heart Disease Status')
plt.show()

# Violin plot of maximum heart rate by sex
plt.figure()
sns.violinplot(x='sex', y='thalach', data=heart)
plt.xlabel('Sex (0=Female, 1=Male)')
plt.ylabel('Maximum Heart Rate (thalach)')
plt.title('Distribution of Max Heart Rate by Sex')
plt.show()

# Heatmap of correlation matrix
plt.figure()
corr = heart.corr()
sns.heatmap(corr, annot=True, fmt='.2f')
plt.title('Correlation Matrix')
plt.show()

# Jointplot: age vs maximum heart rate
sns.jointplot(x='age', y='thalach', data=heart, kind='scatter')
plt.suptitle('Age vs Maximum Heart Rate', y=1.02)
plt.show()


## Assignment‑1: Student, Department & Institute

The graded assignment requires building a hierarchy of classes to model an educational institute. The tasks demonstrate OOP design with NumPy to process numerical data. The classes and their responsibilities are:

1. **Student**: represents a single student with a roll number and NumPy array of scores. Implements methods:
   - `average()` – return the mean of the scores.
   - `highest_score()` and `lowest_score()` – return the maximum/minimum score.
   - `standard_deviation()` – return the standard deviation (use `np.std`).
   - `__gt__(other)` – compare two students by their average score.

2. **Department**: contains a list of `Student` objects. Methods:
   - `add_student(student)` – add a student.
   - `department_average()` – average score across all students.
   - `topper()` – student with highest average.
   - `weakest_subject()` – subject (index) with lowest average across all students.
   - `rank_students()` – return students sorted by average score in descending order.

3. **Institute**: aggregates multiple departments. Methods:
   - `add_department(dept)` – add a department.
   - `institute_average()` – average of department averages.
   - `best_department()` – department with highest average.
   - `overall_topper()` – student with highest average in the whole institute.
   - `search_by_roll(roll_number)` – return the student with matching roll number.

The following implementation satisfies these requirements and demonstrates the functionality.


In [None]:
import numpy as np

class Student:
    def __init__(self, roll_number, name, scores):
        self.roll_number = roll_number
        self.name = name
        self.scores = np.array(scores, dtype=float)

    def average(self):
        return np.mean(self.scores)

    def highest_score(self):
        return np.max(self.scores)

    def lowest_score(self):
        return np.min(self.scores)

    def standard_deviation(self):
        return np.std(self.scores)

    def __gt__(self, other):
        return self.average() > other.average()

    def __str__(self):
        return f"{self.roll_number} - {self.name} (Avg: {self.average():.2f})"

class Department:
    def __init__(self, name):
        self.name = name
        self.students = []

    def add_student(self, student):
        self.students.append(student)

    def department_average(self):
        return np.mean([s.average() for s in self.students])

    def topper(self):
        return max(self.students, key=lambda s: s.average())

    def weakest_subject(self):
        scores_matrix = np.vstack([s.scores for s in self.students])
        subject_means = scores_matrix.mean(axis=0)
        return int(np.argmin(subject_means))

    def rank_students(self):
        return sorted(self.students, key=lambda s: s.average(), reverse=True)

class Institute:
    def __init__(self, name):
        self.name = name
        self.departments = []

    def add_department(self, dept):
        self.departments.append(dept)

    def institute_average(self):
        return np.mean([d.department_average() for d in self.departments])

    def best_department(self):
        return max(self.departments, key=lambda d: d.department_average())

    def overall_topper(self):
        return max([student for dept in self.departments for student in dept.students], key=lambda s: s.average())

    def search_by_roll(self, roll_number):
        for dept in self.departments:
            for student in dept.students:
                if student.roll_number == roll_number:
                    return student
        return None

# Demonstration
np.random.seed(42)

institute = Institute('IIT Bhilai')

dept_names = ['Computer Science', 'Electrical Engineering', 'Mechanical Engineering']
num_subjects = 6

for dept_name in dept_names:
    dept = Department(dept_name)
    for i in range(1, 6):
        roll = f"{dept_name[:2].upper()}{i:02d}"
        scores = np.random.randint(50, 101, size=num_subjects)
        student = Student(roll, f"Student_{roll}", scores)
        dept.add_student(student)
    institute.add_department(dept)

for dept in institute.departments:
    print(f"Department: {dept.name}")
    print(f"Average score: {dept.department_average():.2f}")
    print(f"Topper: {dept.topper()}")
    print(f"Weakest subject index: {dept.weakest_subject()}")
    print("Ranking:")
    for rank, student in enumerate(dept.rank_students(), start=1):
        print(f"  {rank}. {student}")
    print()

print("Institute average:", institute.institute_average())
print("Best department:", institute.best_department().name)
print("Overall topper:", institute.overall_topper())

# Search by roll number
search_roll = 'CS01'
found = institute.search_by_roll(search_roll)
print(f"Search for {search_roll}:", found if found else 'Not found')
