# Lab One: Visualization and Data Preprocessing
# by Nino Castellano, Aayush Dalal, Chloe Prowse, Muskaan Mahes

# 1. Business Understanding
# The Student Placement Dataset from Kaggle was collected and analyzed to determine whether students’ results were sufficient to obtain a job offer. The dataset contained over 50,000 records consisting of academic, technical, and soft-skill attributes that can influence the outcome of being placed or not. Therefore, the primary purpose of the dataset is to help students and educational institutions understand which factors are crucial for achieving a successful placement outcome. Using placement results is important, as they can provide real-world examples that can serve as a template for assessing how well a student is prepared for the job market.

# The main objective of this analysis is to explore the relationships between students’ key features, and their placement outcome to identify important insights. Therefore, the target variable from the training set, placement status, was used to measure the effectiveness of the student outcome.

import pandas as pd
import seaborn as sns
import matplotlib.pyplot as plt
import numpy as np
from sklearn.model_selection import train_test_split
from sklearn.preprocessing import StandardScaler
from sklearn.discriminant_analysis import LinearDiscriminantAnalysis as LDA
from sklearn.linear_model import LogisticRegression
from mlxtend.plotting import plot_decision_regions

#Importing Train Dataset for Intial EDA
df = pd.read_csv("/Users/muskaanmahes/Downloads/placement_train.csv")
print(df)





# 2. Data Understanding

# This dataset contains 14 attributes that describe a student profile. The variable Student_ID is a unique identifier, while Age is a ratio-scale numeric value. The categorical features include Gender, Degree, Branch, and Placement_Status, which are all nominal variables. Additionally, there are academic variables which are all on a ratio-scale: Internships, Projects, Certifications, Backlogs, and CGPA, which is on an interval scale. Skill-related attributes such as Coding_Skills, Communication_Skills, Aptitude_Test_Score, and Soft_Skills_Rating are either on the ordinal or interval scale, ranging from 1 to 10. Therefore, the dataset has numerical, categorical, and ratio features that are crucial for analyzing a student's performance. 

#type of data each attribute
df.info()




# After analyzing the dataset, no missing or duplicated data was identified. All 15 columns were checked for missing values, though it returned an empty DataFrame, hence indicating that there are no incomplete values. Additionally, the dataset contained zero duplicate rows, demonstrating that each student entry is unique. Therefore, the overall data is clean and well-defined. 

#missing values
#to check for any nulls
rows_with_nulls = data[data.isnull().any(axis=1)]
print("Rows with null values: ")
print(rows_with_nulls)

#duplicate data
number_duplicates = data.duplicated().sum()
print(f"\nNumber of duplicate rows: {number_duplicates}")



# However, after calculating the interquartile range (IQR) for each numeric variable, several columns were flagged to have potential outliers. Specifically, there were 153 outliers in CGPA, 1652 in Internships, 1130 in Projects, and 4354 in Soft_Skills_Tating. Additionally, these outliers are not data entry errors as they were found through the IQR method. For example, having a high CGPA, such as 9.80, can indicate a strong academic performance, and having multiple internships or projects is realistic for motivated students. Therefore, the outliers can skew the metrics; we should not remove them. Instead, we could extend the bounds to reduce extreme values. 

#are there outliers
numeric_column = data.select_dtypes(include = ['float64', 'int64']).columns

for col in numeric_column:
    Q1 = data[col].quantile(0.25)
    Q3 = data[col].quantile(0.75)
    IQR = Q3 - Q1
    lower_bound = Q1 - 1.5 * IQR
    upper_bound = Q3 + 1.5 * IQR

    outlier = data[(data[col] < lower_bound) | (data[col] > upper_bound)]

    print(f"\nOutliers in '{col}': ")
    print(outlier[[col]])
    print(f"Total outliers in '{col}': {outlier.shape[0]}")



# The violin and strip plots depict the distribution and their respective outliers within the four numerical features. In the CGPA plot, even though high values like 9.8 are flagged as outliers, they represent high academic performances. The Intership plot ranges from 0 to 3, with value of 3 considered to be outliers, though they can still reflect real-world scenarios. In the Projects plot, the data is mainly concentrated between ranges 2 and 5, while ranges 1 and 6 are flagged as outliers, which may depict lower or higher engagment. Lastly, the Soft_Skills_Ratings ranged from 1 to 10, and have outliers on both ends. Therefore, these outliers are statistically significant and may not be erroneous. 

#visualizing outliers
#columns that had outliers
column_outlier = ['CGPA', 'Internships', 'Projects', 'Soft_Skills_Rating']

plt.figure(figsize=(15,10))

#violin and strip plots
for i, col in enumerate(column_outlier, 1):
    plt.subplot(2,2,i)
    sns.violinplot(x=data[col], inner=None, color='gray')
    sns.stripplot(x=data[col], color = 'red', size=2, jitter=True)
    plt.title(f'Outlier Visualization of {col}')

plt.tight_layout()
plt.show


#Dimension Reduction: LDA + Logistic Regression
#defining the features and target variable
X = df[['CGPA', 'Internships', 'Projects', 'Soft_Skills_Rating']]
y = df['Placement_Status'].astype('category').cat.codes  # Converts to 0 (Not Placed), 1 (Placed)

# Train/test split
x_train, x_test, y_train, y_test = train_test_split(X, y, test_size=0.3, random_state=1)

#Standardize features
scale = StandardScaler()
x_train_std = scale.fit_transform(x_train)
x_test_std = scale.transform(x_test)

#applying LDA and using 1 component because of the outcome
lda = LDA(n_components=1)
x_train_lda = lda.fit_transform(x_train_std, y_train)
x_test_lda = lda.transform(x_test_std)

#training with logistic model
model = LogisticRegression()
model.fit(x_train_lda, y_train)

# To implement dimension reduction, a linear discriminant analysis (LDA) was applied to reduce the feature space to a single linear discriminant (LD1), with the target variable as Placement_Status. LDA is a supervised technique that not only reduces dimensionality but also separates the two outcomes classes. After this tra, a logistic regression classifier was trained on the data. This resulted in a plot that shows the linear boundary and the separate discriminative features. The logistic regression method was chosen due to its simplicity and effectiveness in binary classification. 

#plot
plot_decision_regions(x_train_lda, y_train.to_numpy().astype(np.int_), clf=model)
plt.xlabel("LD1")
plt.ylabel("Decision Boundary")
plt.title("LDA + Logistic Regression: Decision Regions")
plt.show()
