## Feature Columns
    
* ID: id
* Start time: start_time
* Completion time: completion_time
* Email: email
* Name: name
* Last modified time: last_modified_time
* Jamb score: jamb_score
* English: english
* Maths: maths
* Subject 3: subject_3
* Subject 4: subject_4
* Subject 5: subject_5
* What was your age in Year One: age_in_year_one
* Gender: gender
* Do you have a disability?: has_disability
* Did you attend extra tutorials?: attended_tutorials
* How would you rate your participation in extracurricular activiti(tech, music, partying, fellowship, etc.) in Year One?:  extracurricular_participation
* How would you rate your class attendance in Year One: class_attendance_rating
* How well did you participate in class activities (Assignments, Asking and Answering Questions, Writing Notes....): class_participation_rating
* Did you use extra materials for study in Year One? (Youtube, Other books, others): used_extra_study_materials
* Morning: morning_study
* Afternoon: afternoon_study
* Evening: evening_study
* Late Night: late_night_study
* How many days per week did you do reading on average in Year One?: days_per_week_reading
* On average, How many hours per day was used for personal study in Year One: hours_per_day_personal_study
* Did you teach your peers in Year One: taught_peers
* How many courses did you offer in Year One?: courses_offered
* Did you fall sick in Year One? if yes, How many times do you remember (0 if none): times_fell_sick
* What was your study mode in Year 1: study_mode
* Did you study the course your originally applied for?: studied_original_course
* Rate your financial status in Year One: financial_status_rating
* Rate the teaching style / method of the lectures received in Year One: teaching_style_rating
* What type of higher institution did you attend in Year One\n: institution_type
* What was your CGPA in Year One?: cgpa_year_one
* What grading system does your school use ( if others, type numbers only): grading_system

In [None]:
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
from sklearn.model_selection import train_test_split
from sklearn.metrics import mean_squared_error,mean_absolute_error,explained_variance_score
from sklearn.preprocessing import OrdinalEncoder, LabelEncoder
import warnings
# Set display options to show all columns
pd.set_option('display.max_columns', None)

# Ignore warnings
warnings.filterwarnings('ignore')
print("Importation complete")

In [None]:
data_path = "../Data/year1_gpa.csv"  # Adjust the path as needed
gpa_data = pd.read_csv(data_path,encoding='latin1')
gpa_data.columns

In [None]:
# Dictionary to map old column names to new names
new_column_names = {
    'ID': 'id',
    'Start time': 'start_time',
    'Completion time': 'completion_time',
    'Email': 'email',
    'Name': 'name',
    'Last modified time': 'last_modified_time',
    'Jamb score': 'jamb_score',
    'English': 'english',
    'Maths': 'maths',
    'Subject 3': 'subject_3',
    'Subject 4': 'subject_4',
    'Subject 5': 'subject_5',
    'What was your age in Year One': 'age_in_year_one',
    'Gender': 'gender',
    'Do you have a disability?': 'has_disability',
    'Did you attend extra tutorials? ': 'attended_tutorials',
    'How would you rate your participation in extracurricular activities (tech, music, partying, fellowship, etc.) in Year One?': 'extracurricular_participation',
    'How would you rate your class attendance in Year One': 'class_attendance_rating',
    'How well did you participate in class activities (Assignments, Asking and Answering Questions, Writing Notes....)': 'class_participation_rating',
    'Did you use extra materials for study in Year One? (Youtube, Other books, others)': 'used_extra_study_materials',
    'Morning': 'morning_study',
    'Afternoon': 'afternoon_study',
    'Evening': 'evening_study',
    'Late Night': 'late_night_study',
    'How many days per week did you do reading on average in Year One?': 'days_per_week_reading',
    'On average, How many hours per day was used for personal study in Year One': 'hours_per_day_personal_study',
    'Did you teach your peers in Year One': 'taught_peers',
    'How many courses did you offer in Year One?': 'courses_offered',
    'Did you fall sick in Year One? if yes, How many times do you remember (0 if none)': 'times_fell_sick',
    'What was your study mode in Year 1': 'study_mode',
    'Did you study the course your originally applied for?': 'studied_original_course',
    'Rate your financial status in Year One': 'financial_status_rating',
    'Rate the teaching style / method of the lectures received in Year One': 'teaching_style_rating',
    'What type of higher institution did you attend in Year One\n': 'institution_type',
    'What was your CGPA in Year One?': 'cgpa_year_one',
    'What grading system does your school use ( if others, type numbers only)': 'grading_system'
}

# Rename columns using the dictionary
gpa_data.rename(columns=new_column_names, inplace=True)

# Print the DataFrame with updated column names
gpa_data()


## Feature Engineering

In [None]:
# List of columns to drop
columns_to_drop = ['start_time', 'completion_time', 'email', 'name', 'last_modified_time']

# Drop the specified columns
gpa_data = gpa_data.drop(columns=columns_to_drop)

# Print the DataFrame after dropping columns
gpa_data.head()

In [None]:
# Separate columns into numeric and categorical
numeric_columns = gpa_data.select_dtypes(include=[np.number]).columns.tolist()
categorical_columns = gpa_data.select_dtypes(include=[np.object]).columns.tolist()

# Print the lists
print("Numeric Columns:")
print(numeric_columns)

print("\nCategorical Columns:")
print(categorical_columns)

In [None]:
# Ordinal encoding map
ordinal_encoding_map = {'A': 5, 'B': 4, 'C': 3, 'D': 2, 'E': 1, 'F': 0}

# Features to encode
features_to_encode = ['english', 'maths', 'subject_3', 'subject_4', 'subject_5']

# Apply ordinal encoding for the specified features
gpa_data[features_to_encode] = gpa_data[features_to_encode].apply(lambda col: col.map(ordinal_encoding_map))

# Perform label encoding for other categorical columns
categorical_columns = gpa_data.select_dtypes(include=['object']).columns
label_encoder = LabelEncoder()

for col in categorical_columns:
    gpa_data[col] = label_encoder.fit_transform(gpa_data[col])

# Create GPA_normal and drop unnecessary columns
gpa_data['GPA_normal'] = gpa_data['cgpa_year_one'] / gpa_data['grading_system']
gpa_data.drop(['grading_system', 'cgpa_year_one'], axis=1, inplace=True)


# Print the DataFrame after engineering
gpa_data.head()

In [None]:
gpa_data.isnull().sum()

## Exploratory Data Analysis

In [None]:
gpa_data.describe()

## Scaling and Train Test Split

In [None]:
X = gpa_data.drop(['id', 'GPA_normal'], axis=1)  # Features excluding 'id' and 'GPA_normal'
y = gpa_data['GPA_normal']  # Target variable

In [None]:
X_train, X_test, y_train, y_test = train_test_split(X,y,test_size=0.3,random_state=101)

### Scaling

In [None]:
from sklearn.preprocessing import MinMaxScaler

In [None]:
scaler = MinMaxScaler()

In [None]:
X_train= scaler.fit_transform(X_train)

In [None]:
X_test = scaler.transform(X_test)

In [None]:
display(X_train.shape)
display(X_test.shape)

## Creating a Model

In [None]:
from tensorflow.keras.models import Sequential
from tensorflow.keras.layers import Dense, Activation
from tensorflow.keras.optimizers import Adam

In [None]:
model = Sequential()
# change xy with the other side of the X_train shape
model.add(Dense(xy,activation='relu'))
model.add(Dense(xy,activation='relu'))
model.add(Dense(xy,activation='relu'))
model.add(Dense(xy,activation='relu'))
# output would be one
model.add(Dense(1))

model.compile(optimizer='adam',loss='mse')

## Training the Model

In [None]:
model.fit(x=X_train,y=y_train.values,
          validation_data=(X_test,y_test.values),
          batch_size=128,epochs=400)

In [None]:
losses = pd.DataFrame(model.history.history)

In [None]:
losses.plot()

# Evaluation on Test Data

In [None]:
X_test

In [None]:
predictions = model.predict(X_test)

In [None]:
mean_absolute_error(y_test,predictions)

In [None]:
np.sqrt(mean_squared_error(y_test,predictions))

170315.2072705816

In [None]:
# Our predictions
plt.scatter(y_test,predictions)

# Perfect predictions
plt.plot(y_test,y_test,'r')

In [None]:
errors = y_test.values.reshape(6480, 1) - predictions

In [None]:
sns.distplot(errors)

## Save the model

In [None]:
import joblib

model_filename = 'linear_regression_model.joblib'
joblib.dump(model, model_filename)

print('Model saved to', model_filename)