# Lab 1: Working with Columns in Pandas

## Student Information
**Name:** Orchlon Chinbat
**Student ID:** 50291063

## Project Overview
This notebook demonstrates how to:
- Create a structured dataset with multiple data types
- Build and manipulate a pandas DataFrame
- Perform exploratory data analysis
- Add computed columns (Score → Letter Grade conversion)
- Export data to CSV format

The dataset contains information about 50 students with various attributes including academic scores, majors, and graduation years.


## 1. Data Generation

In this section, we generate synthetic student data with the following columns:

| Column | Data Type | Description |
|--------|-----------|-------------|
| `Student_ID` | Integer | Unique identifier (10001-10050) |
| `Score` | Float | Academic score (0.0-100.0) |
| `Major` | String | Student's major (10 different majors) |
| `Graduation_Year` | Integer | Expected graduation year (2024-2027) |
| `Pass/Fail` | Boolean | Pass if score ≥ 60, otherwise Fail |


In [1]:
# Import necessary libraries
import pandas as pd          # For DataFrame manipulation and data analysis
import random                # For generating random data
import numpy as np           # For numerical operations and array handling
from datetime import datetime, timedelta

# Set random seed for reproducibility
random.seed(42)
np.random.seed(42)

# Define the number of students in our dataset
n_students = 50

# ===== Generate Student IDs =====
# Create sequential integer IDs from 10001 to 10050
# Using range() creates consecutive integers, list() converts to a list
student_ids = list(range(10001, 10001 + n_students))

# ===== Generate Academic Scores =====
# Create random floating-point scores between 0.0 and 100.0
# random.uniform(a, b) generates a random float between a and b
# round(value, 2) rounds to 2 decimal places for realistic grade representation
scores = [round(random.uniform(0.0, 100.0), 2) for _ in range(n_students)]

# ===== Assign Academic Majors =====
# Define a list of 10 different possible majors (categorical data)
majors_list = ["Computer Science", "Mathematics", "Physics", "Chemistry", "Biology", 
               "Engineering", "Psychology", "Economics", "Literature", "History"]

# Randomly assign a major to each student
# random.choice(list) picks a random element from the list
majors = [random.choice(majors_list) for _ in range(n_students)]

# ===== Generate Graduation Years =====
# random.randint(a, b) generates a random integer between a and b (both inclusive)
graduation_years = [random.randint(2024, 2027) for _ in range(n_students)]

# ===== Compute Pass/Fail Status =====
# Determine pass/fail based on score threshold of 60.0
# Creates a boolean value: True if score >= 60, False otherwise
pass_fail = [score >= 60.0 for score in scores]

# Display summary of the created dataset
print("Dataset created successfully!")
print(f"Total students: {len(student_ids)}")
print(f"Data types: Student_ID (int), Score (float), Major (str), Graduation_Year (int), Pass/Fail (bool)")


Dataset created successfully!
Total students: 50
Data types: Student_ID (int), Score (float), Major (str), Graduation_Year (int), Pass/Fail (bool)


In [2]:
# Create a dictionary to organize our data
# Each key is a column name, and each value is a list of data for that column
# All lists must have the same length (50 elements in this case)
data = {
    'Student_ID': student_ids,           # Column 1: Integer IDs
    'Score': scores,                      # Column 2: Float scores
    'Major': majors,                      # Column 3: String (categorical) data
    'Graduation_Year': graduation_years,  # Column 4: Integer years
    'Pass/Fail': pass_fail               # Column 5: Boolean values
}

# Convert the dictionary to a pandas DataFrame
# DataFrames are the primary data structure in pandas for tabular data
# Think of it as a spreadsheet or SQL table in Python
df = pd.DataFrame(data)

# ===== Preview the DataFrame =====

# Display the first 5 rows to see the beginning of our dataset
# head() method shows the top N rows (default is 5)
print("First 5 rows of the DataFrame:")
print(df.head(5))
print("\n" + "="*80 + "\n")  # Print a separator line for readability

# Display the last 5 rows to see the end of our dataset
# tail() method shows the bottom N rows (default is 5)
print("Last 5 rows of the DataFrame:")
print(df.tail(5))
print("\n" + "="*80 + "\n")

# Show the data types of each column
# dtypes attribute reveals what type of data is stored in each column
print(f"DataFrame dtypes:")
print(df.dtypes)


First 5 rows of the DataFrame:
   Student_ID  Score        Major  Graduation_Year  Pass/Fail
0       10001  63.94  Engineering             2025       True
1       10002   2.50  Engineering             2027      False
2       10003  27.50    Chemistry             2024      False
3       10004  22.32      Biology             2027      False
4       10005  73.65  Mathematics             2027       True


Last 5 rows of the DataFrame:
    Student_ID  Score             Major  Graduation_Year  Pass/Fail
45       10046  23.28         Economics             2025      False
46       10047  10.10       Mathematics             2025      False
47       10048  27.80  Computer Science             2026      False
48       10049  63.57       Mathematics             2027       True
49       10050  36.48           Physics             2026      False


DataFrame dtypes:
Student_ID           int64
Score              float64
Major               object
Graduation_Year      int64
Pass/Fail             bool
dt

## 2. Creating the DataFrame

Now we organize our data into a pandas DataFrame, which provides a structured, table-like format for analysis. We'll also preview the first and last few rows to verify our data was created correctly.


In [9]:
# ===== Analyze Numeric Columns =====

# Select only numeric columns (int64 and float64 types)
# select_dtypes() filters columns by data type
# np.number includes all numeric types (integers and floats)
numeric_columns = df.select_dtypes(include=[np.number])

# Generate descriptive statistics for numeric data
# describe() automatically calculates:
#   - count: number of non-null values
#   - mean: average value
#   - std: standard deviation (measure of spread)
#   - min: minimum value
#   - 25%: first quartile (25th percentile)
#   - 50%: median (50th percentile)
#   - 75%: third quartile (75th percentile)
#   - max: maximum value
print("\nDescriptive statistics:")
print(numeric_columns.describe())

# info() method provides a concise summary of the DataFrame
print("\n" + "="*80 + "\n")
print(numeric_columns.info())

print("\n" + "="*80 + "\n")

# ===== Analyze Categorical Columns =====

print("UNIQUE VALUES IN CATEGORICAL COLUMNS:")

# Select only categorical columns (object type, which includes strings)
# In pandas, text/string data is stored as 'object' dtype
categorical_columns = df.select_dtypes(include=['object'])

# Loop through each categorical column to examine its unique values
for column in categorical_columns.columns:
    # Count the number of unique (distinct) values in this column
    # nunique() returns the count of unique values
    unique_count = df[column].nunique()
    
    # Get an array of all unique values in this column
    # unique() returns an array with one instance of each distinct value
    unique_values = df[column].unique()
    
    # Display the results for this column
    print(f"\nColumn: {column}")
    print(f"Number of unique values: {unique_count}")
    print(f"Unique values: {list(unique_values)}")


Descriptive statistics:
        Student_ID     Score  Graduation_Year
count     50.00000  50.00000        50.000000
mean   10025.50000  45.06760      2025.440000
std       14.57738  29.32915         1.109514
min    10001.00000   0.65000      2024.000000
25%    10013.25000  21.90500      2024.250000
50%    10025.50000  46.36500      2025.000000
75%    10037.75000  69.27500      2026.000000
max    10050.00000  97.31000      2027.000000


<class 'pandas.core.frame.DataFrame'>
RangeIndex: 50 entries, 0 to 49
Data columns (total 3 columns):
 #   Column           Non-Null Count  Dtype  
---  ------           --------------  -----  
 0   Student_ID       50 non-null     int64  
 1   Score            50 non-null     float64
 2   Graduation_Year  50 non-null     int64  
dtypes: float64(1), int64(2)
memory usage: 1.3 KB
None


UNIQUE VALUES IN CATEGORICAL COLUMNS:

Column: Major
Number of unique values: 10
Unique values: ['Engineering', 'Chemistry', 'Biology', 'Mathematics', 'History', 'Physics

## 3. Exploratory Data Analysis

Let's explore our dataset by examining:
- **Descriptive statistics** for numeric columns (mean, standard deviation, min/max, quartiles)
- **Unique values** in categorical columns to understand the diversity of our data


In [48]:
# ===== Define Letter Grade Conversion Function =====

def score_to_letter_grade(score):
    """
    Convert a numeric score to a letter grade based on standard grading scale.
    
    This function takes a single numeric score and returns the corresponding
    letter grade using traditional academic grading thresholds.
    
    Parameters:
    -----------
    score : float
        The numeric score to convert (expected range: 0-100)
    
    Returns:
    --------
    str
        Letter grade ('A', 'B', 'C', 'D', or 'F')
    
    Grading Scale:
    - A: 90-100  (Excellent)
    - B: 80-89   (Good)
    - C: 70-79   (Satisfactory)
    - D: 60-69   (Passing)
    - F: Below 60 (Failing)
    """
    # Check conditions from highest to lowest grade
    # The order matters: we check >= 90 first, so lower scores won't match
    if score >= 90:
        return 'A'
    elif score >= 80:   # This means: 80 <= score < 90
        return 'B'
    elif score >= 70:   # This means: 70 <= score < 80
        return 'C'
    elif score >= 60:   # This means: 60 <= score < 70
        return 'D'
    else:               # This means: score < 60
        return 'F'

# ===== Apply the Function to Create Computed Column =====

# Use the apply() method to apply our function to every value in the Score column
# apply() takes a function as an argument and applies it to each row/element
# This creates a new column 'Letter_Grade' with the computed letter grades
# This is a COMPUTED column because it's derived from existing data (Score)
df['Letter_Grade'] = df['Score'].apply(score_to_letter_grade)

# Display summary information about the updated DataFrame
print(f"\nDataFrame now has {len(df.columns)} columns: {list(df.columns)}")

# Show the distribution of letter grades
# value_counts() counts how many times each unique value appears
# sort_index() sorts the results alphabetically (A, B, C, D, F)
print("\nLetter grade distribution:")
print(df['Letter_Grade'].value_counts().sort_index())



DataFrame now has 6 columns: ['Student_ID', 'Score', 'Major', 'Graduation_Year', 'Pass/Fail', 'Letter_Grade']

Letter grade distribution:
Letter_Grade
A     2
B     7
C     3
D     7
F    31
Name: count, dtype: int64


## 4. Adding a Computed Column

We'll now add a **computed column** called `Letter_Grade` that converts the numeric `Score` into traditional letter grades:

- **A**: 90-100
- **B**: 80-89
- **C**: 70-79
- **D**: 60-69
- **F**: Below 60

This demonstrates how to derive new information from existing columns using a custom function.


In [51]:
# ===== Save DataFrame to CSV File =====

# Define the output filename
csv_filename = 'student_data_with_grades.csv'

# Export the DataFrame to CSV (Comma-Separated Values) format
# to_csv() writes the DataFrame to a CSV file
# index=False prevents pandas from writing row numbers as a separate column
df.to_csv(csv_filename, index=False)

# Display confirmation message with file details
print(f"DataFrame saved to '{csv_filename}' successfully!")
print(f"\nFile contains {len(df)} rows and {len(df.columns)} columns")
print(f"Columns: {', '.join(df.columns)}")

DataFrame saved to 'student_data_with_grades.csv' successfully!

File contains 50 rows and 6 columns
Columns: Student_ID, Score, Major, Graduation_Year, Pass/Fail, Letter_Grade


## 5. Exporting to CSV

Finally, we save our complete DataFrame (now with 6 columns including the computed `Letter_Grade`) to a CSV file. This allows us to:
- Share the data with others
- Use it in other programs or analysis tools
- Create a permanent record of our processed data
