*Welcome to our notebook where we will be delving into a fascinating project on predicting the placements of engineering students. We will be exploring various factors such as CGPA, internships, backlogs, gender, stream, and more to develop an accurate prediction model. Join us as we dive into the world of data analysis and machine learning to uncover insights that could potentially impact the future of engineering placements. Let's get started!*


## Loading Tools

In [None]:
import numpy as np 
import pandas as pd 
import os

In [None]:
import seaborn as sns
import matplotlib.pyplot as plt
import plotly.express as px
import plotly.graph_objects as go
from plotly.subplots import make_subplots
from sklearn import preprocessing
from sklearn.model_selection import train_test_split
from sklearn.linear_model import LogisticRegression
from sklearn.ensemble import RandomForestClassifier
from xgboost import XGBClassifier
from sklearn.metrics import accuracy_score
import plotly.express as px

In [None]:
df = pd.read_csv("/kaggle/input/engineering-placements-prediction/collegePlace.csv")
df.head()

## Basic EDA

In [None]:
df.shape

In [None]:
df.info()

In [None]:
df.isnull().sum()

In [None]:
df['Stream'].value_counts()

In [None]:
# Nothing just shortening the name

mapping = {"Electronics And Communication": "ECE", "Computer Science": "CSE", "Information Technology": "IT", "Mechanical": "MECH", "Civil": "Civil", "Electrical": "EC"}

df["Stream"] = df["Stream"].map(mapping)


In [None]:
df['Stream'].value_counts()

In [None]:
df.describe()

## Checking Outliers

In [None]:
# I tried all the columns and find out that only age column has some outliers.

plt.figure(figsize = (10, 6), dpi = 100)
sns.boxplot(x = "Age", data = df)

## Removing Outliers

In [None]:
max_thresold = df['Age'].quantile(0.95)
print(max_thresold)

min_thresold = df['Age'].quantile(0.01)
print(min_thresold)

df = df[(df['Age']<max_thresold) & (df['Age']>min_thresold)]

## Data Visualization

In [None]:
plt.figure(figsize = (10, 6), dpi = 100)


color_palette = sns.color_palette("Accent_r")
sns.set_palette(color_palette)

sns.countplot(x = "Stream", data = df)

In [None]:
plt.figure(figsize = (10, 6), dpi = 100)


color_palette = sns.color_palette("cool")
sns.set_palette(color_palette)

sns.countplot(x = "Internships", data = df)
plt.show()

In [None]:
plt.figure(figsize = (10, 6), dpi = 100)
grp = dict(df.groupby('CGPA').groups)

m = {}

for key, val in grp.items():
    
    if key in m:
        m[key] += len(val)
        
    else:
        m[key] = len(val)

    
plt.title("Distribution of CGPA")
plt.pie(m.values(), labels = m.keys())
plt.show()

In [None]:
plt.figure(figsize = (10, 6), dpi = 100)


# setting the different color palette
color_palette = sns.color_palette("Accent_r")
sns.set_palette(color_palette)

sns.countplot(x = "Gender", data = df)

plt.show()

## Relationships

In [None]:
plt.figure(figsize = (10, 6), dpi = 100)


# setting the different color palette
color_palette = sns.color_palette("plasma")
sns.set_palette(color_palette)

sns.barplot(x = "PlacedOrNot", y = "Gender", data = df)

plt.show()

In [None]:
plt.figure(figsize = (10, 6), dpi = 100)


# setting the different color palette
color_palette = sns.color_palette("magma")
sns.set_palette(color_palette)

sns.barplot(x = "Stream", y = "PlacedOrNot", data = df)

plt.show()

## Finally, how many placed?

In [None]:
plt.figure(figsize = (10, 6), dpi = 100)


# setting the different color palette
color_palette = sns.color_palette("BuGn_r")
sns.set_palette(color_palette)

sns.countplot(x = "PlacedOrNot", data = df)

plt.show()

## Correlation

In [None]:
plt.figure(figsize = (10, 6), dpi = 100)
color = sns.color_palette("BuGn_r")
sns.heatmap(df.corr(), vmax=0.9, annot=True,cmap = color)

It can be seen from the above graph that the CGPA is the most important feature among all.

## Model 1: RandomForestClassifier

Transforming Categorical variables into numerical. Because RandomForestClassifier Works only with numerical data.

In [None]:
le = preprocessing.LabelEncoder()

df["Gender"] = le.fit_transform(df["Gender"])
df["Stream"] = le.fit_transform(df["Stream"])

In [None]:
X=df[['Age', 'Gender', 'Internships', 'CGPA', 'Hostel',
       'HistoryOfBacklogs', 'Stream']]
y= df["PlacedOrNot"]

x_train, x_test, y_train, y_test = train_test_split(X,y, test_size=0.30, random_state=100)

clff = RandomForestClassifier().fit(x_train,y_train)

pred = clff.predict(x_test)

## Accuracy of RandomForestClassifier

In [None]:
acc = accuracy_score(y_test, pred)
acc

## Model 2: CatBoostClassifier

In [None]:
from catboost import CatBoostClassifier

X = df[['Age', 'Gender', 'Internships', 'CGPA', 'Hostel',
       'HistoryOfBacklogs', 'Stream']]

y = df["PlacedOrNot"]

x_train, x_test, y_train, y_test = train_test_split(X,y, test_size=0.30, random_state=100)

clf = CatBoostClassifier(
    
    iterations = 5, 
    learning_rate = 0.1, 
    loss_function='CrossEntropy',
    
).fit(x_train, y_train)


pred = clf.predict(x_test)


## Accuracy of CatBoostClassifier

In [None]:
acc = accuracy_score(y_test, pred)
acc

## Conclusion

In this project, we utilize two powerful machine learning models - Random Forest and CatBoost - to predict placements of engineering students. We apply various techniques such as outlier detection and removal, correlation analysis, and categorical variable encoding to preprocess the data and improve model performance. Additionally, we conduct in-depth data analysis and visualization to gain deeper insights into the data. Feel free to experiment with other models or tune hyperparameters to further enhance the model's accuracy. Let's dive in and keep exploring! 🚀

### Thanks for reading this notebook. Upvote it if you found it useful 😇.
### Checkout my other notebooks 🙃
* [XGBoost V/S LightGBM](https://www.kaggle.com/code/neesham/xgboost-v-s-lightgbm)
* [🔥 Pandas V/S SQL](https://www.kaggle.com/code/neesham/pandas-v-s-sql)
* [🔥 Transformers for Beginners (P1)](https://www.kaggle.com/code/neesham/transformers-for-beginners-p1)