# Week 1: Know Your Tools, Know Your Data
## Data Science for Biomedical Informatics

---

### Week 1 Mantra
> *"Before you analyze, you must organize.*  
> *Before you explore, you must understand what you have."*

---

In this notebook, we'll walk through the **10-Point Inspection** - a systematic approach to understanding any dataset before diving into analysis.

## Setup: Import Libraries and Load Data

This is my first Python test:
print("Hello World! -- explore this")

In [3]:
!pip install pandas


[1m[[0m[34;49mnotice[0m[1;39;49m][0m[39;49m A new release of pip is available: [0m[31;49m24.2[0m[39;49m -> [0m[32;49m25.3[0m
[1m[[0m[34;49mnotice[0m[1;39;49m][0m[39;49m To update, run: [0m[32;49mpip install --upgrade pip[0m


In [4]:
# Import the essential libraries
import pandas as pd
import numpy as np

# Display settings for better output
pd.set_option('display.max_columns', None)
pd.set_option('display.width', None)

print("Libraries imported successfully!")
print(f"pandas version: {pd.__version__}")
print(f"numpy version: {np.__version__}")

Libraries imported successfully!
pandas version: 2.3.3
numpy version: 2.4.1


In [5]:
# Load the Student Performance dataset
df = pd.read_csv('student_performance.csv')

print("Dataset loaded successfully!")

Dataset loaded successfully!


---

# The 10-Point Inspection

Like a pilot's pre-flight checklist, we go through each point methodically before taking off with our analysis.

| Point | What We Check | Command |
|-------|---------------|----------|
| 1 | Shape | `df.shape` |
| 2 | Columns | `df.columns` |
| 3 | Data Types | `df.dtypes` |
| 4 | First Look | `df.head()` |
| 5 | Last Look | `df.tail()` |
| 6 | Memory | `df.memory_usage()` |
| 7 | Missing | `df.isnull().sum()` |
| 8 | Duplicates | `df.duplicated()` |
| 9 | Statistics | `df.describe()` |
| 10 | Unique | `df.nunique()` |

---

## Point 1: Shape

**Question:** How big is our dataset?

The shape tells us:
- **Rows** = Number of observations (students in our case)
- **Columns** = Number of features/variables

In [6]:
# Point 1: Shape
print("Dataset Shape:")
print(df.shape)

print(f"\nüìä We have {df.shape[0]:,} students and {df.shape[1]} features")
print(f"üìä Total data points: {df.shape[0] * df.shape[1]:,}")

Dataset Shape:
(14003, 16)

üìä We have 14,003 students and 16 features
üìä Total data points: 224,048


---

## Point 2: Column Names

**Question:** What features do we have?

Understanding column names helps us:
- Know what data is available
- Identify potential issues (spaces, special characters)
- Start thinking about feature relationships

In [7]:
# Point 2: Column Names
print("Column Names:")
print(df.columns.tolist())

print(f"\nüìã Total columns: {len(df.columns)}")

Column Names:
['StudyHours', 'Attendance', 'Resources', 'Extracurricular', 'Motivation', 'Internet', 'Gender', 'Age', 'LearningStyle', 'OnlineCourses', 'Discussions', 'AssignmentCompletion', 'ExamScore', 'EduTech', 'StressLevel', 'FinalGrade']

üìã Total columns: 16


---

## Point 3: Data Types

**Question:** How is each column stored?

‚ö†Ô∏è **KEY INSIGHT:** Just because pandas says `int64` doesn't mean it should be treated as numeric!

In [8]:
# Point 3: Data Types
print("Data Types:")
print(df.dtypes)

print("\n" + "="*50)
print("Data Type Summary:")
print(df.dtypes.value_counts())

Data Types:
StudyHours              int64
Attendance              int64
Resources               int64
Extracurricular         int64
Motivation              int64
Internet                int64
Gender                  int64
Age                     int64
LearningStyle           int64
OnlineCourses           int64
Discussions             int64
AssignmentCompletion    int64
ExamScore               int64
EduTech                 int64
StressLevel             int64
FinalGrade              int64
dtype: object

Data Type Summary:
int64    16
Name: count, dtype: int64


**Observation:** All columns are `int64`, but this doesn't mean they're all truly numeric! Some are encoded categorical variables.

---

## Point 4: First Look (Head)

**Question:** What do the first few rows look like?

This helps us:
- See actual data values
- Spot obvious data quality issues
- Understand the data format

In [10]:
# Point 4: First Look
print("First 5 Rows:")
df.head()

First 5 Rows:


Unnamed: 0,StudyHours,Attendance,Resources,Extracurricular,Motivation,Internet,Gender,Age,LearningStyle,OnlineCourses,Discussions,AssignmentCompletion,ExamScore,EduTech,StressLevel,FinalGrade
0,19,64,1,0,0,1,0,19,2,8,1,59,40,0,1,3
1,19,64,1,0,0,1,0,23,3,16,0,90,66,0,1,2
2,19,64,1,0,0,1,0,28,1,19,0,67,99,1,1,0
3,19,64,1,1,0,1,0,19,2,8,1,59,40,0,1,3
4,19,64,1,1,0,1,0,23,3,16,0,90,66,0,1,2


---

## Point 5: Last Look (Tail)

**Question:** What do the last few rows look like?

Why check the tail?
- Data might be sorted or ordered
- End of file might have different patterns
- Can reveal data entry issues

In [12]:
# Point 5: Last Look
print("Last 5 Rows:")
df.tail()

Last 5 Rows:


Unnamed: 0,StudyHours,Attendance,Resources,Extracurricular,Motivation,Internet,Gender,Age,LearningStyle,OnlineCourses,Discussions,AssignmentCompletion,ExamScore,EduTech,StressLevel,FinalGrade
13998,30,62,0,1,1,1,0,22,2,2,1,100,71,1,2,1
13999,30,62,0,1,1,1,0,23,3,12,1,72,55,1,1,2
14000,22,90,2,0,1,1,0,23,3,0,1,80,56,0,0,2
14001,22,90,2,0,1,1,0,29,2,16,0,50,62,1,2,2
14002,10,86,2,1,2,1,0,18,2,8,1,66,77,1,2,1


---

## Point 6: Memory Usage

**Question:** How much memory does our dataset use?

Important for:
- Knowing if data fits in RAM
- Planning for larger datasets
- Optimizing data types if needed

In [13]:
# Point 6: Memory Usage
print("Memory Usage by Column:")
print(df.memory_usage(deep=True))

# Total memory in MB
total_memory_mb = df.memory_usage(deep=True).sum() / 1e6
print(f"\nüíæ Total Memory Usage: {total_memory_mb:.2f} MB")

Memory Usage by Column:
Index                      132
StudyHours              112024
Attendance              112024
Resources               112024
Extracurricular         112024
Motivation              112024
Internet                112024
Gender                  112024
Age                     112024
LearningStyle           112024
OnlineCourses           112024
Discussions             112024
AssignmentCompletion    112024
ExamScore               112024
EduTech                 112024
StressLevel             112024
FinalGrade              112024
dtype: int64

üíæ Total Memory Usage: 1.79 MB


**Observation:** ~1.79 MB is a small dataset - loads instantly on any machine!

---

## Point 7: Missing Values

**Question:** Do we have any missing data?

Missing data is crucial because:
- Many algorithms can't handle NaN values
- Missing patterns can be informative
- We may need imputation strategies

In [14]:
# Point 7: Missing Values
print("Missing Values by Column:")
missing = df.isnull().sum()
print(missing)

print("\n" + "="*50)
total_missing = missing.sum()
print(f"‚ùì Total Missing Values: {total_missing}")

if total_missing == 0:
    print("‚úÖ Great! No missing values - complete dataset!")
else:
    print(f"‚ö†Ô∏è {total_missing} missing values need attention")

Missing Values by Column:
StudyHours              0
Attendance              0
Resources               0
Extracurricular         0
Motivation              0
Internet                0
Gender                  0
Age                     0
LearningStyle           0
OnlineCourses           0
Discussions             0
AssignmentCompletion    0
ExamScore               0
EduTech                 0
StressLevel             0
FinalGrade              0
dtype: int64

‚ùì Total Missing Values: 0
‚úÖ Great! No missing values - complete dataset!


---

## Point 8: Duplicate Rows

**Question:** Do we have any duplicate records?

Duplicates can:
- Inflate sample size artificially
- Bias statistical analyses
- Indicate data collection issues

In [15]:
# Point 8: Duplicates
duplicate_count = df.duplicated().sum()
duplicate_pct = (duplicate_count / len(df)) * 100

print(f"üìë Duplicate Rows: {duplicate_count:,}")
print(f"üìë Percentage: {duplicate_pct:.2f}%")

if duplicate_count > 0:
    print(f"\n‚ö†Ô∏è Warning: {duplicate_pct:.2f}% of rows are duplicates!")
    print("   This needs investigation in Week 4 (Data Cleaning)")

üìë Duplicate Rows: 1,534
üìë Percentage: 10.95%

   This needs investigation in Week 4 (Data Cleaning)


In [16]:
# Let's look at some duplicate examples
print("Example of duplicate rows:")
duplicates = df[df.duplicated(keep=False)]
duplicates.head(10)

Example of duplicate rows:


Unnamed: 0,StudyHours,Attendance,Resources,Extracurricular,Motivation,Internet,Gender,Age,LearningStyle,OnlineCourses,Discussions,AssignmentCompletion,ExamScore,EduTech,StressLevel,FinalGrade
0,19,64,1,0,0,1,0,19,2,8,1,59,40,0,1,3
1,19,64,1,0,0,1,0,23,3,16,0,90,66,0,1,2
2,19,64,1,0,0,1,0,28,1,19,0,67,99,1,1,0
12,19,64,1,0,0,1,0,19,2,8,1,59,40,0,1,3
13,19,64,1,0,0,1,0,23,3,16,0,90,66,0,1,2
14,19,64,1,0,0,1,0,28,1,19,0,67,99,1,1,0
18,24,98,1,1,1,1,1,29,0,3,0,65,46,1,2,3
19,24,98,1,1,1,1,1,27,0,0,1,71,83,1,0,1
20,24,98,1,1,1,1,1,25,1,5,1,61,95,1,0,0
21,24,98,1,1,1,1,1,29,0,3,0,65,46,1,2,3


---

## Point 9: Descriptive Statistics

**Question:** What are the statistical properties of our numeric columns?

This tells us:
- Central tendency (mean, median)
- Spread (std, min, max)
- Distribution hints (quartiles)

In [17]:
# Point 9: Descriptive Statistics
print("Descriptive Statistics:")
df.describe()

Descriptive Statistics:


Unnamed: 0,StudyHours,Attendance,Resources,Extracurricular,Motivation,Internet,Gender,Age,LearningStyle,OnlineCourses,Discussions,AssignmentCompletion,ExamScore,EduTech,StressLevel,FinalGrade
count,14003.0,14003.0,14003.0,14003.0,14003.0,14003.0,14003.0,14003.0,14003.0,14003.0,14003.0,14003.0,14003.0,14003.0,14003.0,14003.0
mean,19.987431,80.194316,1.104406,0.594158,0.905806,0.925516,0.551953,23.532172,1.515461,9.891952,0.60587,74.502535,70.346926,0.709062,1.304363,1.447904
std,5.890637,11.472181,0.697362,0.491072,0.695896,0.262566,0.497311,3.514293,1.112941,6.112801,0.48868,14.632177,17.688113,0.454211,0.785383,1.12155
min,5.0,60.0,0.0,0.0,0.0,0.0,0.0,18.0,0.0,0.0,0.0,50.0,40.0,0.0,0.0,0.0
25%,16.0,70.0,1.0,0.0,0.0,1.0,0.0,20.0,1.0,5.0,0.0,62.0,55.0,0.0,1.0,0.0
50%,20.0,80.0,1.0,1.0,1.0,1.0,1.0,24.0,2.0,10.0,1.0,74.0,70.0,1.0,2.0,1.0
75%,24.0,90.0,2.0,1.0,1.0,1.0,1.0,27.0,3.0,15.0,1.0,87.0,86.0,1.0,2.0,2.0
max,44.0,100.0,2.0,1.0,2.0,1.0,1.0,29.0,3.0,20.0,1.0,100.0,100.0,1.0,2.0,3.0


In [None]:
# Transpose for easier reading
print("Descriptive Statistics (Transposed for readability):")
df.describe().T

### Key Observations from Statistics:

| Feature | Range | Mean | Interpretation |
|---------|-------|------|----------------|
| StudyHours | 5-40 | ~18 | Reasonable weekly study hours |
| Attendance | 60-100 | ~83 | No one below 60% |
| Age | 18-29 | ~23 | Typical college/grad ages |
| ExamScore | 40-100 | ~71 | Normal exam distribution |

---

## Point 10: Unique Values

**Question:** How many unique values does each column have?

This reveals:
- Binary features (2 unique values)
- Categorical features (few unique values)
- Continuous features (many unique values)

In [18]:
# Point 10: Unique Values
print("Unique Values per Column:")
unique_counts = df.nunique().sort_values()
print(unique_counts)

Unique Values per Column:
Extracurricular          2
Internet                 2
Gender                   2
Discussions              2
EduTech                  2
Resources                3
Motivation               3
StressLevel              3
LearningStyle            4
FinalGrade               4
Age                     12
OnlineCourses           21
StudyHours              37
Attendance              41
AssignmentCompletion    51
ExamScore               61
dtype: int64


In [19]:
# Categorize columns by unique count
print("\n" + "="*50)
print("Feature Classification by Unique Values:")
print("="*50)

binary = unique_counts[unique_counts == 2].index.tolist()
low_cardinality = unique_counts[(unique_counts > 2) & (unique_counts <= 5)].index.tolist()
high_cardinality = unique_counts[unique_counts > 5].index.tolist()

print(f"\nüü¢ BINARY (2 values): {binary}")
print(f"\nüîµ LOW CARDINALITY (3-5 values): {low_cardinality}")
print(f"\nüü£ HIGH CARDINALITY (>5 values): {high_cardinality}")


Feature Classification by Unique Values:

üü¢ BINARY (2 values): ['Extracurricular', 'Internet', 'Gender', 'Discussions', 'EduTech']

üîµ LOW CARDINALITY (3-5 values): ['Resources', 'Motivation', 'StressLevel', 'LearningStyle', 'FinalGrade']

üü£ HIGH CARDINALITY (>5 values): ['Age', 'OnlineCourses', 'StudyHours', 'Attendance', 'AssignmentCompletion', 'ExamScore']


---

# Feature Classification Summary

Based on our 10-Point Inspection, we can classify our features:

In [20]:
# Feature Classification
feature_classification = {
    'Binary (Yes/No encoded as 0/1)': [
        'Gender', 'Internet', 'Discussions', 'EduTech', 'Extracurricular'
    ],
    'Ordinal Categorical': [
        'Motivation (0,1,2)', 
        'Resources (0,1,2)', 
        'StressLevel (0,1,2)', 
        'LearningStyle (0,1,2,3)'
    ],
    'Numeric (Continuous)': [
        'StudyHours (5-40)', 
        'Attendance (60-100)', 
        'Age (18-29)', 
        'OnlineCourses (0-20)',
        'AssignmentCompletion (50-100)', 
        'ExamScore (40-100)'
    ],
    'Target Variable': [
        'FinalGrade (0, 1, 2, 3) - CLASSIFICATION PROBLEM'
    ]
}

for category, features in feature_classification.items():
    print(f"\n{'='*50}")
    print(f"üìå {category}")
    print('='*50)
    for feature in features:
        print(f"   ‚Ä¢ {feature}")


üìå Binary (Yes/No encoded as 0/1)
   ‚Ä¢ Gender
   ‚Ä¢ Internet
   ‚Ä¢ Discussions
   ‚Ä¢ EduTech
   ‚Ä¢ Extracurricular

üìå Ordinal Categorical
   ‚Ä¢ Motivation (0,1,2)
   ‚Ä¢ Resources (0,1,2)
   ‚Ä¢ StressLevel (0,1,2)
   ‚Ä¢ LearningStyle (0,1,2,3)

üìå Numeric (Continuous)
   ‚Ä¢ StudyHours (5-40)
   ‚Ä¢ Attendance (60-100)
   ‚Ä¢ Age (18-29)
   ‚Ä¢ OnlineCourses (0-20)
   ‚Ä¢ AssignmentCompletion (50-100)
   ‚Ä¢ ExamScore (40-100)

üìå Target Variable
   ‚Ä¢ FinalGrade (0, 1, 2, 3) - CLASSIFICATION PROBLEM


---

# Sanity Checks

**Question:** Do our values make real-world sense?

In [21]:
# Sanity Checks
print("üîç SANITY CHECKS")
print("="*50)

checks = [
    ('Age', 'Age', 18, 29, 'Valid student age range'),
    ('Attendance', 'Attendance', 60, 100, 'Valid percentage range'),
    ('ExamScore', 'ExamScore', 40, 100, 'Valid score range'),
    ('StudyHours', 'StudyHours', 5, 40, 'Plausible weekly study time'),
    ('AssignmentCompletion', 'AssignmentCompletion', 50, 100, 'Valid percentage range')
]

all_passed = True
for name, col, expected_min, expected_max, note in checks:
    actual_min = df[col].min()
    actual_max = df[col].max()
    
    passed = (actual_min >= expected_min - 5) and (actual_max <= expected_max + 5)
    status = "‚úÖ" if passed else "‚ùå"
    
    print(f"\n{status} {name}:")
    print(f"   Range: {actual_min} to {actual_max}")
    print(f"   Note: {note}")
    
    if not passed:
        all_passed = False

print("\n" + "="*50)
if all_passed:
    print("‚úÖ ALL SANITY CHECKS PASSED!")
    print("   Data is internally consistent with real-world expectations")
else:
    print("‚ö†Ô∏è Some checks failed - investigate further!")

üîç SANITY CHECKS

‚úÖ Age:
   Range: 18 to 29
   Note: Valid student age range

‚úÖ Attendance:
   Range: 60 to 100
   Note: Valid percentage range

‚úÖ ExamScore:
   Range: 40 to 100
   Note: Valid score range

‚úÖ StudyHours:
   Range: 5 to 44
   Note: Plausible weekly study time

‚úÖ AssignmentCompletion:
   Range: 50 to 100
   Note: Valid percentage range

‚úÖ ALL SANITY CHECKS PASSED!
   Data is internally consistent with real-world expectations


---

# Quick Info Summary

pandas provides a convenient `info()` method that combines several inspection points:

In [None]:
# Bonus: df.info() provides a quick summary
print("DataFrame Info:")
df.info()

---

# Reusable 10-Point Inspection Function

Here's a function you can reuse for any dataset:

In [None]:
def ten_point_inspection(df, name="Dataset"):
    """
    Perform a comprehensive 10-point inspection on a DataFrame.
    
    Parameters:
    -----------
    df : pandas DataFrame
        The dataset to inspect
    name : str
        Name of the dataset for display purposes
    """
    print("="*60)
    print(f"üìä 10-POINT INSPECTION: {name}")
    print("="*60)
    
    # 1. Shape
    print(f"\n1Ô∏è‚É£  SHAPE: {df.shape[0]:,} rows √ó {df.shape[1]} columns")
    
    # 2. Columns
    print(f"\n2Ô∏è‚É£  COLUMNS: {list(df.columns)}")
    
    # 3. Data Types
    print(f"\n3Ô∏è‚É£  DATA TYPES:")
    print(df.dtypes.value_counts().to_string())
    
    # 4 & 5. First and Last rows (just noting)
    print(f"\n4Ô∏è‚É£  FIRST ROW: {dict(df.iloc[0])}")
    print(f"\n5Ô∏è‚É£  LAST ROW: {dict(df.iloc[-1])}")
    
    # 6. Memory
    memory_mb = df.memory_usage(deep=True).sum() / 1e6
    print(f"\n6Ô∏è‚É£  MEMORY: {memory_mb:.2f} MB")
    
    # 7. Missing Values
    missing = df.isnull().sum().sum()
    print(f"\n7Ô∏è‚É£  MISSING VALUES: {missing:,}")
    
    # 8. Duplicates
    dupes = df.duplicated().sum()
    dupe_pct = (dupes / len(df)) * 100
    print(f"\n8Ô∏è‚É£  DUPLICATES: {dupes:,} ({dupe_pct:.2f}%)")
    
    # 9. Key Statistics
    print(f"\n9Ô∏è‚É£  KEY STATISTICS:")
    numeric_cols = df.select_dtypes(include=[np.number]).columns[:5]  # First 5 numeric
    for col in numeric_cols:
        print(f"    {col}: min={df[col].min()}, max={df[col].max()}, mean={df[col].mean():.2f}")
    
    # 10. Unique Values
    print(f"\nüîü UNIQUE VALUE RANGES:")
    unique = df.nunique()
    print(f"    Binary (2): {list(unique[unique == 2].index)}")
    print(f"    Low (3-5): {list(unique[(unique > 2) & (unique <= 5)].index)}")
    print(f"    High (>5): {list(unique[unique > 5].index)}")
    
    print("\n" + "="*60)
    print("‚úÖ 10-Point Inspection Complete!")
    print("="*60)

In [None]:
# Run the inspection function
ten_point_inspection(df, "Student Performance")

---

# Summary: What We Learned

## Dataset Overview
- **14,004 students** with **16 features** each
- **1.79 MB** memory footprint
- **0 missing values** (complete dataset)
- **2,234 duplicates (15.95%)** - needs investigation

## Feature Types
- **5 Binary**: Gender, Internet, Discussions, EduTech, Extracurricular
- **4 Ordinal**: Motivation, Resources, StressLevel, LearningStyle
- **6 Numeric**: StudyHours, Attendance, Age, OnlineCourses, AssignmentCompletion, ExamScore
- **1 Target**: FinalGrade (Classification problem)

## Key Insight
> All columns are stored as `int64`, but they represent different types of data!

---

## Your Week 1 Deliverable: Data Profile Report

Create a report that includes:
1. Dataset Overview
2. Feature Inventory
3. Data Quality Assessment
4. Summary Statistics
5. Initial Observations
6. Questions for Investigation

---

## Coming Next: Week 2
*"Every Column Tells a Story"*

Deep dive into feature meaning, relationships, and domain understanding.