# Problem Description

Develop a machine learning system to predict final CGPA of a student at the end of fourth year given GPs of the courses obtained in initial years (up to first, second or third year). 

Model 1: predict final CGPA based on GPs of first year only.


Model 2: predict final CGPA based on GPs of first two years.


Model 3: predict final CGPA based on GPs of first three years.

## Import Libraries

In [None]:
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns

from sklearn.model_selection import train_test_split
from sklearn import linear_model
from sklearn.ensemble import GradientBoostingRegressor
from sklearn.tree import DecisionTreeRegressor
from sklearn.neighbors import KNeighborsRegressor

from math import ceil
from tabulate import tabulate
from sklearn.model_selection import learning_curve
from sklearn.model_selection import ShuffleSplit

import warnings
warnings.filterwarnings('ignore')

try:
    from ipywidgets import widgets
    from ipywidgets import Button, HBox, VBox
except ModuleNotFoundError:
    !pip install ipywidgets

## Load the Dataset

In [None]:
# Reading the CSV file present in the Project Folder using Padas read_csv method

df = pd.read_csv("The_Grades_Dataset.csv")

## Exploratory Data Analysis

In [None]:
rows, columns = df.shape
print(f"The given dataset contains {rows} rows and {columns} columns")

In [None]:
# Examining first 10 rows
df.head(10)

In [None]:
# Examining last 10 rows
df.tail(10)

In [None]:
# The only numerical variable present is CGPA

df.describe()

In [None]:
# 42 Categorical Variables with some null values
df.info()

In [None]:
# Checking the Unique Values of Each Column
df.apply(pd.Series.unique)

In [None]:
# Example Row
df.apply(pd.Series.unique).values[18]

### Unique Values

The most repeated unique grades are,

    A+, A, A-, B+, B, B-, C+, C, C-, D+, D , F 

Grade W stands for Withdrawal \
Grade WU stands for Withdrawal Unauthorized \
Grade I stands for Incomplete 

**Withdrawal (W)** \
A withdrawal from a class (W) is GPA-neutral: instead of a grade, you receive a W notation on your transcript which does not affect your GPA; you also don't earn credits for the course.

**Withdrawal Unauthorized (WU)** \
A withdrawal from a class (W) is not GPA-neutral, for purposes of grade point average and progress point computation this symbol is equivalent to an "F".

**Incomplete Grade (I)** \
An Incomplete grade, in the GPA computations, has the same effect as a grade F.

### Missing Values

We assume that the missing values indicate that the student hasn't taken that specific course, but if there are too many missing values we will have to drop it as it is an outlier and will affect model performance

In [None]:
# Checking missing values with respect to each row

for index, i in enumerate(df.isnull().sum(axis=1).to_list()):
    if i:
        print(index,": ",i)

### Irrelevant Values

Seat No. is different for each student and does not contribute in predicting the CGPA.
Courses starting with 4 are Final year courses and we have to predict GPA based on GPs of first three years or two years or first year only, hence they will not be used and we will simply drop them.

In [None]:
print(df.columns)

## Data Preprocessing

### Steps

1. Drop columns Seat No, CS-403, CS-421, CS-406, CS-414, CS-419, CS-423, CS-412
2. Fill all missing values with 0.
3. Replace Grade W and Grade I with Grade F to make encoding simpler.
4. Encode all the grades

#### Drop Irrelevant Features

In [None]:
# Seat_No. has no effect on CGPA so we will drop it
# Also the grades of fourth year will not be used by model

df = df.drop(["Seat No.", "CS-403", "CS-421", "CS-406", "CS-414", "CS-419", "CS-423", "CS-412"], axis=1)

#### Handle Missing Values

In [None]:
# We can drop the row if there are too many misisng values
# Lets set are threshold to 23

df = df.drop(df[df.isnull().sum(axis=1) >= 23].index)

# Now the remaining null values are filled by 0, assuming the student hasn't opted for that course yet
# Setting to 0 to show that this grade won't affect the final CGPA

df = df.fillna(0)

In [None]:
# All columnns have 566 not-null values

df.info()

#### Encoding Variables

In [None]:
# Replace WU and I with F Replace W with 0 (No effect on gpa)
df = df.replace({"WU":"F",
                 "I":"F",
                 "W":"0"})

In [None]:
# Checking the Data again
df.apply(pd.Series.unique)

![image.png](attachment:image.png)

Now we will encode the grades with respect to grade points
We have given the weight of 0 to the null values, thus we will give a weight of 1 for F and and other grades in accordance to it.

In [None]:
grade_points = {"F":  1.0,
                "D":  2.0,
                "D+": 2.4,
                "C-": 2.7,
                "C":  3.0,
                "C+": 3.4,
                "B-": 3.7,
                "B":  4.0,
                "B+": 4.4,
                "A-": 4.7,
                "A":  5.0,
                "A+": 5.0 }

In [None]:
# Encoding all the Grades to their respective weights

df = df.replace(grade_points)

In [None]:
df.head()

### Correlation Matrix

In [None]:
# GETTING a Correllation matrix
corr_mat=df.corr(method='pearson')
plt.figure(figsize=(20,10))
sns.heatmap(corr_mat,vmax=1,square=True,annot=True,cmap='cubehelix')

### Encoding Function

In [None]:
# We have written an encoding function to encode any new data

def encode_data(input_df):
    """
    Input: PandasDataFrame
    
        Encode the gradepoints for model training
        
    Returns: PandasDataFrame
    """
    df = input_df.fillna(0)
    df = df.replace({"WU":"F",
                     "I":"F",
                     "W":"0"})
    
    grade_points = {"F":  1.0,
                "D":  2.0,
                "D+": 2.4,
                "C-": 2.7,
                "C":  3.0,
                "C+": 3.4,
                "B-": 3.7,
                "B":  4.0,
                "B+": 4.4,
                "A-": 4.7,
                "A":  5.0,
                "A+": 5.0 }
    
    df = df.replace(grade_points)
    
    return df

## Model Training

#### Input Features and Target Variable

In [None]:
X = df.drop("CGPA", axis=1)
y = df[['CGPA']]

#### Splitting the Dataset

In [None]:
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=0)

#### Preparing Dataset for Model 1, Model 2 and Model 3

In [None]:
columns = df.columns.to_list()
first_year, first_two_years = [],[]
for i in columns:
    if i[3] == '1':
        first_year.append(i)
        first_two_years.append(i)
    elif i[3] == '2':
        first_two_years.append(i)
        
first_three_years = list(X.columns)    

first_year, first_two_years, first_three_years
len(first_three_years)

### Model 1

#### DataFrame for Model 1

In [None]:
df1 = df[first_year + ["CGPA"]]
x1 = df1.drop("CGPA", axis=1)
y1 = df1[['CGPA']]
X_train1, X_test1, y_train1, y_test1 = train_test_split(x1, y1, test_size=0.2, random_state=0)

#### Linear Regression

In [None]:
reg1 = linear_model.LinearRegression()
reg1.fit(X_train1, y_train1)
r1 = reg1.score(X_test1, y_test1)
r1

#### Gradient Boosting Regressor

In [None]:
grad1 = GradientBoostingRegressor()
grad1.fit(X_train1, y_train1)
g1 = grad1.score(X_test1, y_test1)
g1

### Model 2

#### DataFrame for Model 2

In [None]:
df2 = df[first_two_years + ["CGPA"]]
x2 = df2.drop("CGPA", axis=1)
y2 = df2[['CGPA']]
X_train2, X_test2, y_train2, y_test2 = train_test_split(x2, y2, test_size=0.2, random_state=0)

#### Linear Regression

In [None]:
reg2 = linear_model.LinearRegression()
reg2.fit(X_train2, y_train2)
r2 = reg2.score(X_test2, y_test2)
r2

#### Gradient Boosting Regressor

In [None]:
grad2 = GradientBoostingRegressor()
grad2.fit(X_train2, y_train2)
g2 = grad2.score(X_test2, y_test2)
g2

### Model 3

#### DataFrame for Model 3

In [None]:
df3 = df[columns]
x3 = df3.drop("CGPA", axis=1)
y3 = df3[['CGPA']]
X_train3, X_test3, y_train3, y_test3 = train_test_split(x3, y3, test_size=0.2, random_state=0)

#### Linear Regression

In [None]:
reg3 = linear_model.LinearRegression()
reg3.fit(X_train3, y_train3)
r3 = reg3.score(X_test3, y_test3)
r3

#### Gradient Boosting Regressor

In [None]:
grad3 = GradientBoostingRegressor()
grad3.fit(X_train3, y_train3)
g3 = grad3.score(X_test3, y_test3)
g3

### Testing More Algorithms on these Models

### Model 1

#### Lasso Regression

In [None]:
lasso = linear_model.Lasso(alpha=0.1)
lasso.fit(X_train1, y_train1)
ls1 = lasso.score(X_test1, y_test1)
ls1

#### Ridge Regression

In [None]:
ridge = linear_model.Ridge(alpha=0.1)
ridge.fit(X_train1, y_train1)
rg1 = ridge.score(X_test1, y_test1)
rg1

#### ElasticNet Regression

In [None]:
elasticnet = linear_model.ElasticNet(alpha=0.1)
elasticnet.fit(X_train1, y_train1)
en1 = elasticnet.score(X_test1, y_test1)
en1

### Model 3

#### Huber Regression

In [None]:
huber = linear_model.HuberRegressor()
huber.fit(X_train3, y_train3)
hb3 = huber.score(X_test3, y_test3)
hb3

#### Decision Tree Regressor

In [None]:
dtreg = DecisionTreeRegressor()
dtreg.fit(X_train3, y_train3)
dt3 = dtreg.score(X_test3, y_test3)
dt3

#### KNeighbours Regressor

In [None]:
knreg = KNeighborsRegressor()
knreg.fit(X_train3, y_train3)
kn3 = knreg.score(X_test3, y_test3)
kn3

## Model Evaluation

In [None]:
# A snippet of code takes from sklearn official website in order to visulize model learning and crossvalidation curve
# in order to test for overfitting

def plot_learning_curve(
    estimator,
    title,
    X,
    y,
    axes=None,
    ylim=None,
    cv=None,
    n_jobs=None,
    train_sizes=np.linspace(0.1, 1.0, 5),
):
    """
    Generate 3 plots: the test and training learning curve, the training
    samples vs fit times curve, the fit times vs score curve.
    
    """
    if axes is None:
        _, axes = plt.subplots(1, 3, figsize=(20, 5))

    axes[0].set_title(title)
    if ylim is not None:
        axes[0].set_ylim(*ylim)
    axes[0].set_xlabel("Training examples")
    axes[0].set_ylabel("Score")

    train_sizes, train_scores, test_scores, fit_times, _ = learning_curve(
        estimator,
        X,
        y,
        cv=cv,
        n_jobs=n_jobs,
        train_sizes=train_sizes,
        return_times=True,
    )
    train_scores_mean = np.mean(train_scores, axis=1)
    train_scores_std = np.std(train_scores, axis=1)
    test_scores_mean = np.mean(test_scores, axis=1)
    test_scores_std = np.std(test_scores, axis=1)
    fit_times_mean = np.mean(fit_times, axis=1)
    fit_times_std = np.std(fit_times, axis=1)

    # Plot learning curve
    axes[0].grid()
    axes[0].fill_between(
        train_sizes,
        train_scores_mean - train_scores_std,
        train_scores_mean + train_scores_std,
        alpha=0.1,
        color="r",
    )
    axes[0].fill_between(
        train_sizes,
        test_scores_mean - test_scores_std,
        test_scores_mean + test_scores_std,
        alpha=0.1,
        color="g",
    )
    axes[0].plot(
        train_sizes, train_scores_mean, "o-", color="r", label="Training score"
    )
    axes[0].plot(
        train_sizes, test_scores_mean, "o-", color="g", label="Cross-validation score"
    )
    axes[0].legend(loc="best")

    # Plot n_samples vs fit_times
    axes[1].grid()
    axes[1].plot(train_sizes, fit_times_mean, "o-")
    axes[1].fill_between(
        train_sizes,
        fit_times_mean - fit_times_std,
        fit_times_mean + fit_times_std,
        alpha=0.1,
    )
    axes[1].set_xlabel("Training examples")
    axes[1].set_ylabel("fit_times")
    axes[1].set_title("Scalability of the model")

    # Plot fit_time vs score
    fit_time_argsort = fit_times_mean.argsort()
    fit_time_sorted = fit_times_mean[fit_time_argsort]
    test_scores_mean_sorted = test_scores_mean[fit_time_argsort]
    test_scores_std_sorted = test_scores_std[fit_time_argsort]
    axes[2].grid()
    axes[2].plot(fit_time_sorted, test_scores_mean_sorted, "o-")
    axes[2].fill_between(
        fit_time_sorted,
        test_scores_mean_sorted - test_scores_std_sorted,
        test_scores_mean_sorted + test_scores_std_sorted,
        alpha=0.1,
    )
    axes[2].set_xlabel("fit_times")
    axes[2].set_ylabel("Score")
    axes[2].set_title("Performance of the model")

    return plt

In [None]:
title = "Learning Curve Linear Regression"
# Cross validation with 50 iterations to get smoother mean test and train
# score curves, each time with 20% data randomly selected as a validation set.
cv = ShuffleSplit(n_splits=50, test_size=0.2, random_state=0)

estimator = linear_model.LinearRegression()

plot_learning_curve(estimator, title, X, y, ylim=(0.7, 1.01), cv=cv, n_jobs=4)

plt.show()

In [None]:
title = "Learning Curve Gradient Boosting Regressor"
# Cross validation with 50 iterations to get smoother mean test and train
# score curves, each time with 20% data randomly selected as a validation set.
cv = ShuffleSplit(n_splits=50, test_size=0.2, random_state=0)

estimator = GradientBoostingRegressor()

plot_learning_curve(estimator, title, X, y, ylim=(0.7, 1.01), cv=cv, n_jobs=4)

plt.show()

In [None]:
x = ['LR1', 'LR2', 'LR3']
y = [r1,r2,r3]

plotdata = pd.DataFrame({

    "LR":[r1,r2,r3],

    "GB":[g1,g2,g3]},

    index=["Model1", "Model2", "Model3"])

plotdata.plot(kind="bar",figsize=(15, 8), color=["red","green"])
plt.xlabel('Accuracy')
plt.title('Accuracy Bar Plot')
plt.show()

In [None]:
table = [["Models", 1, 2, 3],["Linear Regression",r1,r2,r3],["Gradient Boosting Regressor",g1,g2,g3]]
print(tabulate(table))

#### Finalizing Models
Linear Regression for model1 \
Gradient Boosting Regressor for model2 and model3

## Model Deployment

In [None]:
prediction_input = pd.DataFrame()


def display_menu(model, course_grades):    
    columns = course_grades
    def btn_eventhandler(obj):
        global prediction_input
        del prediction_input
        grade_points = []
        for i in course_grades:
            grade_points.append(i.value)
        prediction_input = pd.DataFrame([grade_points], columns=columns)
        prediction_input = encode_data(prediction_input)
        print("The predicted CGPA is:", model.predict(prediction_input))


    grades = ['A+', 'A', 'A-', 'B+', 'B', 'B-', 'C+', 'C', 'C-', 'D+', 'D', 'F', "W", "WU", "I"]

    course_grades = []
    for i in columns:
        x = widgets.Dropdown(options=grades, description=i, value=None)
        course_grades.append(x)

    left_box, centre_box, right_box = [],[],[]

    for i in range(ceil(len(course_grades)/3)):
        left_box.append(course_grades[i*3])
        if len(course_grades) > (i*3+1):
            centre_box.append(course_grades[i*3+1])
            if len(course_grades) > (i*3+2):
                right_box.append(course_grades[i*3+2])
    
    print("Please select grades from the dropdown list for the respective courses, if the student has not opted for that specific course\n then simply leave it")
    display(HBox([VBox(left_box), VBox(centre_box), VBox(right_box)]))
    
    btn = widgets.Button(description='Predict CGPA')
    display(btn)

    btn.on_click(btn_eventhandler)

    
print("="*124)
print("\t\t\t\t\t\tWELCOME TO CGPA PREDICTOR")
print("="*124)
print("\t\t\tTHIS MODEL CAN PREDICT THE FINAL CGPA OF A STUDENT AT THE END OF FOURTH YEAR\n \t\t\t\tGIVEN GPs OF THE COURSES OBTAINED IN FIRST THREE YEARS")
print("*"*124)
choice = int(input("Please select one of the following options:\n\
    1. Predict CGPA based on GPs of First Year\n\
    2. Predict CGPA based on GPs of First Two Years\n\
    3. Predict CGPA based on GPs of First Three Years\n"))
print("="*124)
if choice == 1:
    display_menu(reg1, first_year)
elif choice == 2:
    display_menu(grad2, first_two_years)
elif choice == 3:
    display_menu(grad3, first_three_years)
else:
    print("Please select valid choice")    