# CS ELEC 3C - Final Project

## Background of the Dataset

The dataset utilized is the `Historical Enrollment Data` from the Department of Education, which was created September 9, 2024 and updated last November 22, 2022.

**Historical Enrollment Data**
> Some context.

**Problem**
> Some context.

**Objectives of the Analysis**
> Some context.

In [1]:
#!pip install -r requirements.txt

In [6]:
import datetime
import warnings

import plotly as py
import numpy as np
import pandas as pd
import seaborn as sns
import matplotlib.pyplot as plt

In [7]:
warnings.filterwarnings("ignore") 

## Preliminary Data Analysis

Conduct an initial exploration of the data. This may include data cleaning, missing value handling, and summary statistics (mean, median, mode, standard deviation).

In [8]:
df = pd.read_csv('HistoricalEnrollmentData.csv')

In [9]:
df.head(3)

Unnamed: 0,SCHOOL YEAR,SECTOR,REGION,GRADE LEVEL,GENDER,NUMBER OF ENROLLEES
0,2010-2011,PUBLIC,REGION 1,KINDERGARTEN,MALE,43376
1,2010-2011,PUBLIC,REGION 2,KINDERGARTEN,MALE,13837
2,2010-2011,PUBLIC,REGION 3,KINDERGARTEN,MALE,72310


In [10]:
df.tail(3)

Unnamed: 0,SCHOOL YEAR,SECTOR,REGION,GRADE LEVEL,GENDER,NUMBER OF ENROLLEES
13468,2020-2021,SUCSLUCS,BARMM,NON-GRADE,FEMALE,-
13469,2020-2021,SUCSLUCS,CAR,NON-GRADE,FEMALE,-
13470,2020-2021,SUCSLUCS,NCR,NON-GRADE,FEMALE,-


### Data Profiling

**Step 1: Data Type Information**

In [11]:
df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 13471 entries, 0 to 13470
Data columns (total 6 columns):
 #   Column               Non-Null Count  Dtype 
---  ------               --------------  ----- 
 0   SCHOOL YEAR          13471 non-null  object
 1   SECTOR               13471 non-null  object
 2   REGION               13471 non-null  object
 3   GRADE LEVEL          13471 non-null  object
 4   GENDER               13471 non-null  object
 5   NUMBER OF ENROLLEES  13471 non-null  object
dtypes: object(6)
memory usage: 631.6+ KB


**Step 2: Data Shape**

In [12]:
df.shape

(13471, 6)

**Step 3: Statistical Information**

In [13]:
df.describe()

Unnamed: 0,SCHOOL YEAR,SECTOR,REGION,GRADE LEVEL,GENDER,NUMBER OF ENROLLEES
count,13471,13471,13471,13471,13471,13471
unique,11,3,34,13,2,7996
top,2015-2016,SUCSLUCS,REGION 1,GRADE 10,FEMALE,-
freq,1326,4495,760,1129,6739,939


**Step 4: Null Values**

In [14]:
for column in df.columns:
    missing_values = df[df[column].isna()]
    if not missing_values.empty:
        print(f"Missing values found in column: {column}")
        print(missing_values.head(10))
    else:
        print(f"No missing values in column: {column}")

No missing values in column: SCHOOL YEAR
No missing values in column: SECTOR
No missing values in column: REGION
No missing values in column: GRADE LEVEL
No missing values in column: GENDER
No missing values in column: NUMBER OF ENROLLEES


**Step 6: Format Standardization**

`SCHOOL YEAR`
> Current dtype: object (string)\
> Converted dtype: category

In [17]:
df["SCHOOL YEAR"] = df["SCHOOL YEAR"].astype("category")

`SECTOR`
> Current dtype: object (string)\
> Converted dtype: category

In [18]:
df["SECTOR"] = df["SECTOR"].astype("category")

`REGION`
> Current dtype: object (string)\
> Converted dtype: category

In [19]:
df["REGION"] = df["REGION"].astype("category")

`GRADE LEVEL`
> Current dtype: object (string)\
> Converted dtype: category

In [20]:
df["GRADE LEVEL"] = df["GRADE LEVEL"].astype("category")

`NUMBER OF ENROLLLEES`
> Current dtype: object (string)\
> Converted dtype: int or float

In [None]:
# Kailangan muna nating pag-usapan ang pag-handle ng mga "-" muna, before converting to int or float. - Kyle
# df["NUMBER OF ENROLLEES"] = df["NUMBER OF ENROLLEES"].astype("float")

**Step 7: Checking for Incorrect Formatting**

`SCHOOL YEAR`

In [21]:
inc_sy = df[~df["SCHOOL YEAR"].str.match(r"^\d{4}-\d{4}$")]
print("Incorrect SCHOOL YEAR formatting:")
print(inc_sy["SCHOOL YEAR"].unique())  # Display unique incorrect values

Incorrect SCHOOL YEAR formatting:
[], Categories (11, object): ['2010-2011', '2011-2012', '2012-2013', '2013-2014', ..., '2017-2018', '2018-2019', '2019-2020', '2020-2021']


`REGION`

In [22]:
inc_region = df[~df["REGION"].str.match(r"^REGION \d+$")]
print("Incorrect REGION formatting:")
print(inc_region["REGION"].unique())

Incorrect REGION formatting:
['REGION 4A', 'REGION 4B', 'CARAGA', 'BARMM', 'CAR', ..., ' REGION 12 ', ' CARAGA ', ' BARMM ', ' CAR ', ' NCR ']
Length: 23
Categories (34, object): [' BARMM ', ' CAR ', ' CARAGA ', ' NCR ', ..., 'REGION 6', 'REGION 7', 'REGION 8', 'REGION 9']


`GRADE LEVEL`

In [24]:
inc_grade = df[~df["GRADE LEVEL"].str.match(r"^(GRADE \d+|KINDERGARTEN)$")]
print("Incorrect GRADE LEVEL formatting:")
print(inc_grade["GRADE LEVEL"].unique())

Incorrect GRADE LEVEL formatting:
['NON-GRADE', 'KINDERGARDEN']
Categories (13, object): ['GRADE 1', 'GRADE 10', 'GRADE 2', 'GRADE 3', ..., 'GRADE 9', 'KINDERGARDEN', 'KINDERGARTEN', 'NON-GRADE']


### Data Cleaning

**Step 1: Rechecking of Dataframe's Info**

**Step 2: Removing of Null Values**

**Step 3: Removing of Zero Values**

**Step 4: Fixing Incorrect Format**