In [1]:
import pandas as pd

## Item 1 - Loading

Load the dataset and show the dimensions it has

In [None]:


# Load Dataset3.csv into a dataframe
df = pd.read_csv('Dataset3.csv', delimiter=';')

# Print dimensions
print("Dataset dimensions:", df.shape)

df.head()


### Item 1 - Loading Commentary

The dataset has 4424 rows and 37 columns

## Item 1 - Principal Component Analysis

Create a new dataframe that transforms and reduces the feature set of the original dataframe using PCA. The PCA dataset will contain only 3 feature components (plus the Target column). Visualise the new PCA dataset

In [None]:
from sklearn.preprocessing import StandardScaler
from sklearn.decomposition import PCA

# Create copy of dataframe without Target column for PCA
X = df.drop('Target', axis=1)

# Scale the features
scaler = StandardScaler()
X_scaled = scaler.fit_transform(X)

# Apply PCA with 3 components
pca = PCA(n_components=3)
X_pca = pca.fit_transform(X_scaled)

# Create new dataframe with PCA results
pca_df = pd.DataFrame(data=X_pca, columns=['PC1', 'PC2', 'PC3'])

# Add Target column back
pca_df['Target'] = df['Target'].values

# Display first few rows
print("\nPCA Dataset:")
print(pca_df.head())


In [None]:
import matplotlib.pyplot as plt
from mpl_toolkits.mplot3d import Axes3D

# Create 3D scatter plot
fig = plt.figure(figsize=(10, 8))
ax = fig.add_subplot(111, projection='3d')

# Plot points colored by Target
colors = {'Dropout': 'red', 'Graduate': 'blue', 'Enrolled': 'orange'}
for target in pca_df['Target'].unique():
    mask = pca_df['Target'] == target
    ax.scatter(pca_df.loc[mask, 'PC1'], 
              pca_df.loc[mask, 'PC2'],
              pca_df.loc[mask, 'PC3'],
              c=colors[target],
              label=target,
              alpha=0.6)

# Set labels and title
ax.set_xlabel('First Principal Component')
ax.set_ylabel('Second Principal Component') 
ax.set_zlabel('Third Principal Component')
ax.set_title('3D PCA Visualization')

# Add legend
plt.legend()

# Show plot
plt.show()



### Item 1 - PCA Commentary

From the 3d visualisation it can be seen that there is an imbalance in the target classes, with Graduate being the most populus followed by Dropout then Enrolled. The Graduate group has a much better defined cluster in the 3 components compared to the other two Targets as well.

In [None]:
# Print value counts of Target column
print("\nTarget Distribution:")
print(pca_df['Target'].value_counts())


A value count of the Target column confirms there is indeed a class imbalance.

## Item 2 - Train / Test Splitting

Here we need to split the dataset into train / test splits. Care needs to be taken as the Target is imbalanced. We must produce 3 train / test splits as there are 3 classes in the Target output variable.