## Section 1: Import Libraries and Load the Dataset

This section loads the core libraries used for data analysis and visualization:

- **pandas** (`pd`) – for loading and manipulating datasets in tabular format.
- **numpy** (`np`) – for numerical and statistical computations.
- **matplotlib.pyplot** (`plt`) – for creating static visualizations like bar charts and line plots.
- **seaborn** (`sns`) – a statistical plotting library built on top of matplotlib for prettier and simpler plots.

Finally, the dataset `student_dataset.csv` is loaded into a DataFrame named `df`, which will serve as the main data source throughout the notebook.


In [1]:
# Load essential data science libraries


# Load the student dataset CSV



## Section 2: Initial Data Exploration

In this section, we use the `.head()` function to preview the **first 5 rows** of the dataset.

This is a quick way to:
- Inspect the structure and contents of the DataFrame
- Verify that the dataset was loaded correctly
- Identify column names and sample values
- Spot-check data formats like dates, numbers, or text categories

This early look helps in planning the next steps such as data cleaning, transformation, or visualization.


In [2]:
#Preview the first 5 records





#### `df.dtypes`
This command displays the data type of each column in the DataFrame. Understanding data types is important because:
- **object** usually means text/string data
- **int64** is for whole numbers
- **float64** is for decimal numbers
- **bool** is for True/False values

Correct data types help ensure that mathematical operations and visualizations work properly.

#### `df.shape`
This command returns a tuple showing the number of **rows** and **columns** in the dataset.  
For example, a result like `(1330, 13)` means:
- 1330 rows (students)
- 13 columns (data features)


In [3]:
#Check data types of each column



In [4]:
#Check shape (rows, columns)



The command `df.isnull().sum()` is used to **identify missing or null values** in the dataset.

- It returns a count of missing values for each column.
- A result of `0` means there are no missing values in that column.
- A value greater than `0` indicates potential data quality issues.

In this case:
- All columns have complete data **except** for `academic_probation_status`, which has **1,123 missing values**.
- This likely means that only a small portion of students are under probation, and the rest are left blank (which may be interpreted as "Not on probation").

Knowing where data is missing helps you decide whether to clean, impute, or ignore certain columns during analysis.


In [5]:
#Check for missing values



The command `df.describe(include='all')` generates **summary statistics** for both numerical and categorical columns in the dataset.

Here's what each row means:

- **count** – Number of non-null entries in each column
- **unique** – Number of unique values (only for categorical/object columns)
- **top** – Most frequently occurring value (also called the mode)
- **freq** – How many times the `top` value appears
- **mean** – Average (for numeric columns only)
- **std** – Standard deviation (spread of numeric values)
- **min / max** – Minimum and maximum values
- **25% / 50% / 75%** – Percentile values (useful for understanding distribution)

For example:
- The `course` column's most common value (`top`) is **Human Resource Management (BSBA-HRM)**, appearing **234 times**.
- The `balance` column has an average of about ₱6,495.54 and a maximum of ₱39,983.
- The `number_of_units_enrolled` ranges from **15 to 24 units**, with a median of **18 units**.

This summary gives a quick overview of both data distribution and potential patterns.


In [6]:
# Summary statistics (numeric + categorical)



## Section 3: Frequency Counts

The command `df['gender'].value_counts()` shows the **frequency of each unique value** in the `gender` column.

This is useful to:
- Understand how many records belong to each category (e.g., Male vs. Female)
- Check for imbalances or skewed distributions in categorical data
- Identify unexpected values or typos in categorical columns

This function is commonly used on fields like `course`, `municipality`, `payment_status`, etc., to get a quick summary of how data is distributed.


In [7]:
#Gender distribution



### Frequency Counts – Number of Students per Course

The command `df['course'].value_counts()` shows how many students are enrolled in each academic program.

#### Output (Top to Bottom):
- **Human Resource Management (BSBA-HRM)**: 234 students
- **Interactive Entertainment & Multimedia Computing (BS-IEMC)**: 226 students
- **Cybersecurity (BSCSEC)**: 222 students
- **Business Solutions & Applications (BSBA-BSAA)**: 222 students
- **Business Intelligence & Analytics (BSBA-BIA)**: 219 students
- **Information Systems (BS-IS)**: 207 students

This breakdown helps us:
- Understand the distribution of students across different programs
- Identify which course has the highest or lowest enrollment
- Prepare program-specific statistics or visualizations later

It’s especially useful for plotting bar charts or evaluating program popularity.


In [8]:
#Number of students per course



###  Frequency Counts – Student Nationalities

The command `df['nationality'].value_counts()` displays the number of students per nationality in the dataset.

#### Output:
- **Filipino**: 1,259 students
- **American**: 25 students
- **Chinese**: 13 students
- **Vietnamese**: 12 students
- **Korean**: 11 students
- **Japanese**: 10 students

This confirms that the dataset is predominantly composed of **Filipino students (≈95%)**, with a small percentage representing **foreign nationalities (≈5%)**.

This insight is valuable for:
- Diversity reporting
- Scholarship and support program planning
- Visualizing demographic breakdowns using pie or bar charts


In [9]:
# Count of nationalities



###  Frequency Counts – Top 10 Municipalities

The command `df['municipality'].value_counts().head(10)` displays the **10 most common municipalities** represented in the dataset.

#### Output:
- **Makati**: 78
- **Muntinlupa**: 76
- **Antipolo**: 75
- **Pasig**: 74
- **Mandaluyong**: 67
- **Marikina**: 64
- **Imus**: 63
- **Caloocan**: 60
- **Pasay**: 60
- **Taguig**: 59

This gives a snapshot of where most students are coming from. It is useful for:
- Identifying location-based trends or outreach opportunities
- Tailoring services or marketing by geographic concentration
- Visualizing regional density using bar charts or maps


In [10]:
# Top 10 municipalities



 ## Section 4: Numpy-Based Statistics

This section uses `NumPy` to perform basic statistical analysis on numerical columns in the dataset.

#### 1. Mean of `number_of_units_enrolled`
```python
np.mean(df["number_of_units_enrolled"])
```
- Calculates the **average number of units** that students are currently enrolled in.
- Example Output: `19.39` means students typically enroll in around 19–20 units.

#### 2. Median of `last_term_GWA`
```python
np.median(df["last_term_GWA"])
```
- Calculates the **middle value** of the last term’s GWA when all values are sorted.
- Example Output: `3.5` implies that half of the students have a GWA above 3.5, and half below.

#### 3. Standard Deviation of `last_term_GWA`
```python
np.std(df["last_term_GWA"])
```
- Measures how much the GWA values vary from the mean.
- Example Output: `0.81` suggests most students have a GWA relatively close to the average.

These simple statistical metrics help summarize trends and identify potential outliers in the data.


In [11]:
# Mean of enrolled units



In [12]:
# Median of last term GWA



In [13]:
# Standard deviation of GWA





##  Section 5: Data Visualization – Gender Distribution

This chart uses `seaborn.countplot()` to visually represent the number of students by gender.

#### Code Breakdown:
- `sns.set(style="whitegrid")` – applies a clean background grid style to the chart.
- `plt.figure(figsize=(6, 4))` – sets the width and height of the plot.
- `sns.countplot(x="gender", data=df)` – creates a bar chart showing the frequency of each gender in the dataset.
- `plt.title()` – sets the title of the chart.
- `plt.show()` – renders the final plot.

This type of visualization is useful for:
- Quickly identifying the dominant gender in the dataset
- Presenting categorical distributions clearly and visually


In [14]:
# Gender Distribution



### Pie Chart – Gender Distribution

This pie chart presents the proportion of male and female students in the dataset using `matplotlib`.

#### Code Explanation:
- `df['gender'].value_counts()` counts the number of Male and Female entries.
- `plt.pie()` creates the pie chart using:
  - `labels` to show each gender
  - `autopct='%1.1f%%'` to show percentages|
  - `startangle=90` to rotate the pie for better visual alignment
  - `colors` to manually set blue and pink shades for each gender
- `plt.axis('equal')` keeps the pie perfectly circular.

#### Insight:
From the chart:
- **52% of students are Female**
- **48% are Male**

This pie chart provides a clear, visual summary of gender distribution in the student population.


In [15]:
# Gender Distribution (Pie Chart)



### Bar Chart – Students per Course

This horizontal bar chart visualizes the number of students enrolled in each academic program.

#### Code Highlights:
- `df['course'] = df['course'].str.replace('\u2011', '-', regex=False)`  
  Replaces **non-breaking hyphens** with standard hyphens to fix display issues (e.g., missing glyphs).
- `plt.figure(figsize=(10, 5))`  
  Sets the size of the plot (wider layout for long course names).
- `sns.countplot()`  
  Creates a **horizontal bar chart** (`y="course"`) sorted by most common course.
- `plt.title()`, `xlabel()`, `ylabel()`  
  Add chart title and axis labels for context.
- `plt.tight_layout()`  
  Prevents label overlap and ensures spacing looks good.

#### Insight:
The chart shows how students are distributed across different degree programs. For example:
- **Human Resource Management (BSBA-HRM)** has the highest enrollment
- **Information Systems (BS-IS)** has the lowest, but still maintains strong representation

This helps identify which programs are most popular or may require additional support and resources.


In [16]:
# Course Enrollment

# Replace non-breaking hyphen with normal hyphen



### Histogram – GWA Distribution

This histogram visualizes the **distribution of students' GWA (General Weighted Average)** using `seaborn.histplot()`.

#### Code Breakdown:
- `plt.figure(figsize=(6, 4))` – Sets the plot size.
- `sns.histplot(df["last_term_GWA"], bins=10, kde=True)`  
  - `bins=10` splits the GWA range into 10 intervals for frequency counting.
  - `kde=True` overlays a **Kernel Density Estimate curve** to show the smooth distribution shape.
- `plt.title()` and `xlabel()`/`ylabel()` – Add descriptive labels to the chart.
- `plt.show()` – Renders the plot.

#### Insight:
- The majority of students have a GWA between **3.0 and 4.0**, with a visible peak near 3.0–3.5.
- A few students have lower GWAs (closer to 1.0), indicating academic risk.
- The KDE curve helps us understand the **overall shape** and central tendency of the GWA distribution.

This type of chart is ideal for visualizing the **spread, skewness, and clusters** in continuous numeric data.


In [17]:
# GWA Distribution



### Boxplot – Units Enrolled per Course

This boxplot shows the **distribution of enrolled units** for students in each course using `seaborn.boxplot()`.

#### Code Breakdown:
- `df['course'] = df['course'].str.replace('\u2011', '-', regex=False)`  
  Replaces non-breaking hyphens with standard hyphens to prevent font rendering issues.
- `sns.boxplot(x="number_of_units_enrolled", y="course", data=df)`  
  Creates a **horizontal boxplot** to compare unit enrollment across courses.
- `plt.tight_layout()` ensures the layout is readable and not cramped.

#### Interpretation:
- Each box shows the **interquartile range (IQR)** — the middle 50% of the data.
- The **horizontal line inside the box** is the median.
- **Whiskers** represent the range, and **dots (if any)** would indicate outliers.
- This chart is useful for spotting:
  - Median number of units per course
  - Spread and consistency of enrolled units
  - Any courses with tighter or wider variation in unit load

It’s especially helpful for comparing student workload across different degree programs.


In [18]:
# Units Enrolled by Course (Boxplot)

# to fix the warning



### Boxplot – Balance Distribution (Unpaid Students Only)

This boxplot shows the **range and spread of unpaid balances** among students whose `payment_status` is "With Balance".

#### Code Breakdown:
- `df[df["balance"] > 0]` filters the dataset to include only students who have an outstanding balance.
- `sns.boxplot(x="payment_status", y="balance", data=...)` creates a boxplot of balances grouped by payment status.
- `plt.title()`, `xlabel()`, `ylabel()` – Add informative labels for better understanding.

#### Interpretation:
- The box represents the **interquartile range (IQR)**—the middle 50% of student balances.
- The **line inside the box** shows the **median balance**, which appears to be around ₱25,000.
- The **whiskers** extend to the minimum and maximum unpaid amounts (₱10,000 to ₱40,000).

This chart is useful for:
- Identifying the central tendency and spread of student balances
- Understanding tuition affordability patterns
- Planning billing interventions or scholarship policies


In [19]:
# Balance Distribution (Unpaid Students Only)



---

### Final Remarks

This notebook provided a complete walkthrough of how to load, explore, analyze, and visualize data using **Pandas**, **NumPy**, **Matplotlib**, and **Seaborn**.

We started by:
- Importing and cleaning the dataset
- Exploring the structure, shape, and data types
- Identifying missing values
- Summarizing statistics both numerically and visually

We then built:
- Frequency tables and bar charts for categorical variables
- Summary statistics using NumPy
- Visualizations including histograms, pie charts, and boxplots

The dataset used in this notebook (`student_dataset.csv`) was carefully generated to simulate real academic scenarios such as course enrollments, payment statuses, academic standing, and demographic data. It serves as an excellent hands-on foundation for beginner data analysts and student researchers.

**Congratulations on completing this data analysis mini-project!** Feel free to expand this notebook by adding correlation matrices, pivot tables, or even predictive modeling in future versions.

---
