In [1]:
#Data loading
import pandas as pd

raw = pd.read_csv('data/raw_data.csv')
inc = pd.read_csv('data/incremental_data.csv')

raw.shape, inc.shape


((8000, 10), (1000, 10))

### Dataset Loading

Both the **raw** and **incremental** CSV files were successfully loaded from the `/data` folder.

- The raw dataset contains 8,000 records and 10 columns.  
- The incremental dataset contains 1,000 records and the same structure.

These files represent student performance data, generated synthetically for this ETL exercise.


In [2]:
#Data inspection
raw.head()
raw.info()
raw.describe()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 8000 entries, 0 to 7999
Data columns (total 10 columns):
 #   Column       Non-Null Count  Dtype  
---  ------       --------------  -----  
 0   student_id   8000 non-null   int64  
 1   name         8000 non-null   object 
 2   gender       8000 non-null   object 
 3   age          8000 non-null   int64  
 4   subject      8000 non-null   object 
 5   exam_score   8000 non-null   float64
 6   exam_date    8000 non-null   object 
 7   region       8000 non-null   object 
 8   grade_level  8000 non-null   object 
 9   school       8000 non-null   object 
dtypes: float64(1), int64(2), object(7)
memory usage: 625.1+ KB


Unnamed: 0,student_id,age,exam_score
count,8000.0,8000.0,8000.0
mean,5506.6665,21.48475,69.84867
std,2605.57068,2.294488,17.305507
min,1000.0,18.0,40.01
25%,3252.5,19.0,54.7275
50%,5513.5,22.0,69.815
75%,7758.25,23.0,84.6825
max,9999.0,25.0,100.0


### Dataset Overview

The dataset includes columns such as `student_id`, `name`, `gender`, `age`, `subject`, `exam_score`, and `exam_date`.

- The `.info()` summary confirms that all columns were loaded correctly.  
- `.describe()` shows that `age` and `exam_score` have realistic ranges, suggesting valid data values.  
- No obvious structural issues are observed at this stage.


In [3]:
#Data quality checks
raw.isnull().sum()


student_id     0
name           0
gender         0
age            0
subject        0
exam_score     0
exam_date      0
region         0
grade_level    0
school         0
dtype: int64

In [4]:
raw.duplicated().sum()


np.int64(0)

In [5]:
raw.dtypes

student_id       int64
name            object
gender          object
age              int64
subject         object
exam_score     float64
exam_date       object
region          object
grade_level     object
school          object
dtype: object

## Data Quality Assessment
After performing a comprehensive inspection of the dataset, the following observations were made:

1. exam_date column stored as 'object' type
   - It should be converted to 'datetime' format for accurate time-based analysis

2. No missing values were detected in any of the 10 columns

3. No duplicate records exist in the dataset, confirming data uniqueness.

4.  All data types are appropriate:
   - Numeric fields ('student_id', 'age', 'exam_score') are stored as integers or floats.
   - Categorical fields ('gender', 'subject', 'region', 'grade_level', 'school') are stored as 'object'.

In [6]:
#Merge datasets
full = pd.concat([raw, inc], ignore_index=True)
full.shape

(9000, 10)

### Dataset Merge

The raw and incremental datasets were combined into a single DataFrame for validation.  
This merge ensures that all new student records are appended to the main dataset.  

After merging:
- The combined dataset contains approximately **9,000 rows**.
- No structural inconsistencies were detected between the two sources.


In [7]:
#Save validated data
full = pd.concat([raw, inc], ignore_index=True)
full.shape


(9000, 10)

### Data Validation Summary
Both raw and incremental datasets were successfully merged and inspected.  
The resulting validated datasets have been saved to the `/data` folder as:

- `validated_full.csv`
- `validated_incremental.csv`

These files maintain the original structure but have been confirmed to be free of missing or duplicate records.  
The next notebook (`etl_transform.ipynb`) will perform cleaning and data type conversions such as converting `exam_date` to datetime.


In [9]:
#Save validated data
full.to_csv('data/validated_data.csv', index=False)
inc.to_csv('data/validated_incremental_data.csv', index=False)

### Saving Validated Data

The validated datasets were exported to the `/data` folder as:

- `validated_full.csv`
- `validated_incremental.csv`

These files preserve the original data but confirm that the structure and record integrity are intact.  
They will serve as the clean input for the **Transform** phase, where further cleaning, standardization, and enrichment will occur.


## ✅ Extract Phase Summary

| Step | Description | Status |
|------|--------------|--------|
| Data Loading | Imported raw and incremental datasets | ✅ |
| Data Inspection | Checked structure and summary stats | ✅ |
| Data Quality | No missing or duplicate records found | ✅ |
| Data Validation | Confirmed consistent schema | ✅ |
| Data Merge | Combined datasets into a validated version | ✅ |
| File Export | Saved validated CSV files to `/data` | ✅ |

The Extract phase has been successfully completed.  
The next notebook (`etl_transform.ipynb`) will handle cleaning, formatting, and feature transformation.
