# GPA Prediction Starter Notebook

Welcome to the GPA Prediction Starter Notebook for our project! 🚀

In this notebook, you'll find a ready-to-use Python script that provides a solid foundation for building a GPA predictor based on the data from `year1_gpa.csv`.

## Getting Started

To get started, follow these steps:

1. **Clone the Repository**: Begin by cloning this repository to your local machine.

2. **Organize Your Data**: Ensure that your GPA data is organized in the `Data` directory, particularly the `year1_gpa.csv` file.

3. **Open the Notebook**: Open this notebook in a Jupyter environment.

4. **Follow the Code**: The notebook contains commented code that guides you through the process of setting up the data, building and training the model, and evaluating its performance.

5. **Experiment and Contribute**: Feel free to experiment with different models,engineering features, hyperparameters, or preprocessing techniques. If you come up with improvements, consider contributing them back to the project!

## Important Notes

- Ensure that you have the necessary Python libraries, such as Pandas, NumPy, and scikit-learn, installed in your environment.
- If you encounter any issues or have questions, don't hesitate to reach out. We're here to help!

Happy coding, and let's build an amazing GPA predictor together! 


In [1]:
# Import necessary libraries
import pandas as pd
import numpy as np
from sklearn.model_selection import train_test_split
from sklearn.linear_model import LinearRegression
from sklearn.metrics import mean_squared_error
from sklearn.preprocessing import OrdinalEncoder, LabelEncoder
import warnings
import joblib
# Set display options to show all columns
pd.set_option('display.max_columns', None)

# Ignore warnings
warnings.filterwarnings('ignore')
print("Importation complete")

Importation complete


In [2]:
# Load the GPA data from year1_gpa.csv
gpa_data = pd.read_csv('year1_gpa.csv', encoding='latin-1')
gpa_data.head()

Unnamed: 0,ID,Start time,Completion time,Email,Name,Last modified time,Jamb score,English,Maths,Subject 3,Subject 4,Subject 5,What was your age in Year One,Gender,Do you have a disability?,Did you attend tutorials,"How would you rate your participation in extracurricular activities (tech, music, partying, fellowship, etc.) in Year One?",How would you rate your class attendance in Year One,"How well did you participate in class activities (Assignments, Asking and Answering Questions, Writing Notes....)","use of extra materials(youtube,google,etc)",Morning,Afternoon,Evening,Late Night,How many days per week did you do reading on average in Year One?,"On average, How many hours per day was used for personal study in Year One",Did you teach your peers in Year One,How many courses did you offer in Year One?,"Did you fall sick in Year One? if yes, How many times do you remember (0 if none)",What was your study mode in Year 1,Did you study the course your originally applied for?,What was your monthly allowance in Year One?,Rate the teaching style / method of the lectures received in Year One,What type of higher institution did you attend in Year One\n,What was your CGPA in Year One?,"What grading system does your school use ( if others, type numbers only)"
0,2,9/30/2023 9:42,9/30/2023 9:43,anonymous,,,300,B,A,A,B,B,16,Male,No,Yes,7,10,10,,3,5,5,1,6.0,6.0,"Yes, but just a few times",16 to 20,2,Full Time,Yes,,6,Public (Federal),4.83,5
1,3,9/30/2023 10:06,9/30/2023 10:12,anonymous,,,313,B,A,A,A,B,17,Male,No,No,1,10,9,,3,2,5,4,7.0,10.0,"Yes, but just a few times",13 to 16,1,Full Time,Yes,,6,Public (Federal),4.8,5
2,4,10/2/2023 7:00,10/2/2023 7:13,anonymous,,,249,C,B,B,B,C,22,Male,No,No,2,5,2,,3,1,4,4,4.0,8.0,"No, I studied alone",5 to 8,6,Full Time,No,,2,Public (Federal),3.1,5
3,5,10/2/2023 10:47,10/2/2023 10:52,anonymous,,,213,C,B,B,C,B,17,Female,No,No,3,5,5,,3,1,1,1,2.0,2.0,"No, I studied alone",16 to 20,0,Full Time,No,,1,Public (State),3.33,5
4,6,10/2/2023 10:51,10/2/2023 10:53,anonymous,,,345,C,A,A,A,A,18,Male,No,Yes,6,4,3,,3,3,4,5,3.0,3.0,"Yes, but just a few times",0 to 4,2,Full Time,Yes,,5,Public (Federal),4.6,5


## Data Preprocessing
In the preprocessing stage, we carefully handle the GPA dataset by addressing missing values, performing feature engineering, and ensuring uniform data formatting to prepare it for accurate model training and prediction.

In [4]:
gpa_data.columns 

Index(['ID', 'Start time', 'Completion time', 'Email', 'Name',
       'Last modified time', 'Jamb score', 'English', 'Maths', 'Subject 3',
       'Subject 4', 'Subject 5', 'What was your age in Year One', 'Gender',
       'Do you have a disability?', 'Did you attend tutorials',
       'How would you rate your participation in extracurricular activities (tech, music, partying, fellowship, etc.) in Year One?',
       'How would you rate your class attendance in Year One',
       'How well did you participate in class activities (Assignments, Asking and Answering Questions, Writing Notes....)',
       'use of extra materials(youtube,google,etc)', 'Morning', 'Afternoon',
       'Evening', 'Late Night',
       'How many days per week did you do reading on average in Year One?',
       'On average, How many hours per day was used for personal study in Year One',
       'Did you teach your peers in Year One',
       'How many courses did you offer in Year One?',
       'Did you fall sick in Y

In [3]:
# Dictionary to map long column names to short names
new_column_names = {
    'ID' : 'id',
    'Start time' : 'start_time',
    'Completion time' : 'completion_time', 
    'Email' : 'email', 
    'Name' : 'name',
    'Last modified time' : 'last_modified_time', 
    'Jamb score' : 'jamb_score', 
    'English' : 'english', 
    'Maths' : 'maths', 
    'Subject 3' : 'subject_3',
    'Subject 4' : 'subject_4', 
    'Subject 5' : 'subject_5', 
    'What was your age in Year One' : 'age_in_year_one', 
    'Gender' : 'gender',
    'Do you have a disability?': 'disability', 
    'Did you attend tutorials' : 'attend_tutorial',
    'How would you rate your participation in extracurricular activities (tech, music, partying, fellowship, etc.) in Year One?': 'extra_curricular_activities_level', 
    'How would you rate your class attendance in Year One': 'class_attendance_level',
    'How well did you participate in class activities (Assignments, Asking and Answering Questions, Writing Notes....)' : 'class_participation_level', 
    'use of extra materials(youtube,google,etc)' : 'use_of_extra_materials', 
    'Morning' : 'morning', 
    'Afternoon' : 'afternoon', 
    'Evening' : 'evening', 
    'Late Night': 'late_night',
    'How many days per week did you do reading on average in Year One?' : 'average_study_days_per_week', 
    'On average, How many hours per day was used for personal study in Year One' : 'average_study_hours_per_day',
    'Did you teach your peers in Year One' : 'taught_coursemates',
    'How many courses did you offer in Year One?' : 'num_of_courses_offered',
    'Did you fall sick in Year One? if yes, How many times do you remember (0 if none)' : 'num_of_times_fell_sick',
    'What was your study mode in Year 1': 'mode_of_study',
    'Did you study the course your originally applied for?': 'original_course',
    'What was your monthly allowance in Year One?': 'monthly_allowance',
    'Rate the teaching style / method of the lectures received in Year One': 'level_of_teaching_received',
    'What type of higher institution did you attend in Year One\n': 'category_of_university',
    'What was your CGPA in Year One?': 'year_one_cgpa',
    'What grading system does your school use ( if others, type numbers only)' : 'school_grading_system'

}

# Rename columns using the dictionary
gpa_data.rename(columns=new_column_names, inplace=True)

# Print the DataFrame with updated column names
gpa_data.head()



Unnamed: 0,id,start_time,completion_time,email,name,last_modified_time,jamb_score,english,maths,subject_3,subject_4,subject_5,age_in_year_one,gender,disability,attend_tutorial,extra_curricular_activities_level,class_attendance_level,class_participation_level,use_of_extra_materials,morning,afternoon,evening,late_night,average_study_days_per_week,average_study_hours_per_day,taught_coursemates,num_of_courses_offered,num_of_times_fell_sick,mode_of_study,original_course,monthly_allowance,level_of_teaching_received,category_of_university,year_one_cgpa,school_grading_system
0,2,9/30/2023 9:42,9/30/2023 9:43,anonymous,,,300,B,A,A,B,B,16,Male,No,Yes,7,10,10,,3,5,5,1,6.0,6.0,"Yes, but just a few times",16 to 20,2,Full Time,Yes,,6,Public (Federal),4.83,5
1,3,9/30/2023 10:06,9/30/2023 10:12,anonymous,,,313,B,A,A,A,B,17,Male,No,No,1,10,9,,3,2,5,4,7.0,10.0,"Yes, but just a few times",13 to 16,1,Full Time,Yes,,6,Public (Federal),4.8,5
2,4,10/2/2023 7:00,10/2/2023 7:13,anonymous,,,249,C,B,B,B,C,22,Male,No,No,2,5,2,,3,1,4,4,4.0,8.0,"No, I studied alone",5 to 8,6,Full Time,No,,2,Public (Federal),3.1,5
3,5,10/2/2023 10:47,10/2/2023 10:52,anonymous,,,213,C,B,B,C,B,17,Female,No,No,3,5,5,,3,1,1,1,2.0,2.0,"No, I studied alone",16 to 20,0,Full Time,No,,1,Public (State),3.33,5
4,6,10/2/2023 10:51,10/2/2023 10:53,anonymous,,,345,C,A,A,A,A,18,Male,No,Yes,6,4,3,,3,3,4,5,3.0,3.0,"Yes, but just a few times",0 to 4,2,Full Time,Yes,,5,Public (Federal),4.6,5


In [4]:
# changes column names to lowercase and replaces spaces with underscores
gpa_data.columns = gpa_data.columns.str.lower().str.replace(' ', '_')

In [5]:
gpa_data

Unnamed: 0,id,start_time,completion_time,email,name,last_modified_time,jamb_score,english,maths,subject_3,subject_4,subject_5,age_in_year_one,gender,disability,attend_tutorial,extra_curricular_activities_level,class_attendance_level,class_participation_level,use_of_extra_materials,morning,afternoon,evening,late_night,average_study_days_per_week,average_study_hours_per_day,taught_coursemates,num_of_courses_offered,num_of_times_fell_sick,mode_of_study,original_course,monthly_allowance,level_of_teaching_received,category_of_university,year_one_cgpa,school_grading_system
0,2,9/30/2023 9:42,9/30/2023 9:43,anonymous,,,300,B,A,A,B,B,16,Male,No,Yes,7,10,10,,3,5,5,1,6.0,6.0,"Yes, but just a few times",16 to 20,2,Full Time,Yes,,6,Public (Federal),4.83,5
1,3,9/30/2023 10:06,9/30/2023 10:12,anonymous,,,313,B,A,A,A,B,17,Male,No,No,1,10,9,,3,2,5,4,7.0,10.0,"Yes, but just a few times",13 to 16,1,Full Time,Yes,,6,Public (Federal),4.8,5
2,4,10/2/2023 7:00,10/2/2023 7:13,anonymous,,,249,C,B,B,B,C,22,Male,No,No,2,5,2,,3,1,4,4,4.0,8.0,"No, I studied alone",5 to 8,6,Full Time,No,,2,Public (Federal),3.1,5
3,5,10/2/2023 10:47,10/2/2023 10:52,anonymous,,,213,C,B,B,C,B,17,Female,No,No,3,5,5,,3,1,1,1,2.0,2.0,"No, I studied alone",16 to 20,0,Full Time,No,,1,Public (State),3.33,5
4,6,10/2/2023 10:51,10/2/2023 10:53,anonymous,,,345,C,A,A,A,A,18,Male,No,Yes,6,4,3,,3,3,4,5,3.0,3.0,"Yes, but just a few times",0 to 4,2,Full Time,Yes,,5,Public (Federal),4.6,5
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
140,142,10/26/2023 13:41,10/26/2023 13:49,anonymous,,,279,C,A,A,B,B,17,Male,No,No,3,10,10,10.0,5,4,3,4,0.3,0.5,"No, I studied alone",9 to 12,0,Part Time,Yes,21 to 30k,6,Public (Federal),2.56,5
141,143,10/26/2023 14:02,10/26/2023 14:06,anonymous,,,261,B,A,A,C,B,15,Male,No,No,6,9,8,2.0,5,3,3,2,3.0,4.0,"Yes, but just a few times",9 to 12,2,Full Time,Yes,6 to 10k,4,Public (Federal),3.8,5
142,144,10/26/2023 14:11,10/26/2023 14:14,anonymous,,,289,C,B,B,B,B,17,Male,No,No,1,2,1,2.0,3,2,3,1,3.0,2.0,"No, I didn't interact with my peers",9 to 12,1,Full Time,Yes,11 to 20k,5,Public (Federal),3.77,5
143,145,10/26/2023 14:13,10/26/2023 14:16,anonymous,,,267,B,A,A,B,A,17,Female,No,No,6,8,9,10.0,5,1,2,5,5.0,2.0,"Yes, but just a few times",9 to 12,1,Full Time,Yes,11 to 20k,5,Public (Federal),4.55,5


In [6]:
gpa_data.dtypes

id                                     int64
start_time                            object
completion_time                       object
email                                 object
name                                 float64
last_modified_time                   float64
jamb_score                             int64
english                               object
maths                                 object
subject_3                             object
subject_4                             object
subject_5                             object
age_in_year_one                        int64
gender                                object
disability                            object
attend_tutorial                       object
extra_curricular_activities_level      int64
class_attendance_level                 int64
class_participation_level              int64
use_of_extra_materials               float64
morning                                int64
afternoon                              int64
evening   

In [7]:
# List of columns to drop
columns_to_drop = ['id', 'start_time', 'completion_time', 'email', 'name', 'last_modified_time']

# Drop the specified columns
gpa_data.drop(columns=columns_to_drop, inplace=True)

# Print the DataFrame after dropping columns
gpa_data.head()

Unnamed: 0,jamb_score,english,maths,subject_3,subject_4,subject_5,age_in_year_one,gender,disability,attend_tutorial,extra_curricular_activities_level,class_attendance_level,class_participation_level,use_of_extra_materials,morning,afternoon,evening,late_night,average_study_days_per_week,average_study_hours_per_day,taught_coursemates,num_of_courses_offered,num_of_times_fell_sick,mode_of_study,original_course,monthly_allowance,level_of_teaching_received,category_of_university,year_one_cgpa,school_grading_system
0,300,B,A,A,B,B,16,Male,No,Yes,7,10,10,,3,5,5,1,6.0,6.0,"Yes, but just a few times",16 to 20,2,Full Time,Yes,,6,Public (Federal),4.83,5
1,313,B,A,A,A,B,17,Male,No,No,1,10,9,,3,2,5,4,7.0,10.0,"Yes, but just a few times",13 to 16,1,Full Time,Yes,,6,Public (Federal),4.8,5
2,249,C,B,B,B,C,22,Male,No,No,2,5,2,,3,1,4,4,4.0,8.0,"No, I studied alone",5 to 8,6,Full Time,No,,2,Public (Federal),3.1,5
3,213,C,B,B,C,B,17,Female,No,No,3,5,5,,3,1,1,1,2.0,2.0,"No, I studied alone",16 to 20,0,Full Time,No,,1,Public (State),3.33,5
4,345,C,A,A,A,A,18,Male,No,Yes,6,4,3,,3,3,4,5,3.0,3.0,"Yes, but just a few times",0 to 4,2,Full Time,Yes,,5,Public (Federal),4.6,5


In [8]:
gpa_data.shape

(145, 30)

In [9]:
print(gpa_data.columns)

Index(['jamb_score', 'english', 'maths', 'subject_3', 'subject_4', 'subject_5',
       'age_in_year_one', 'gender', 'disability', 'attend_tutorial',
       'extra_curricular_activities_level', 'class_attendance_level',
       'class_participation_level', 'use_of_extra_materials', 'morning',
       'afternoon', 'evening', 'late_night', 'average_study_days_per_week',
       'average_study_hours_per_day', 'taught_coursemates',
       'num_of_courses_offered', 'num_of_times_fell_sick', 'mode_of_study',
       'original_course', 'monthly_allowance', 'level_of_teaching_received',
       'category_of_university', 'year_one_cgpa', 'school_grading_system'],
      dtype='object')


In [33]:
# Separate columns into numeric and categorical
numeric_columns = gpa_data.select_dtypes(include=[np.number]).columns.tolist()
categorical_columns = gpa_data.select_dtypes(include=[object]).columns.tolist()

# Print the lists
print("Numeric Columns:", numeric_columns)
print('----------------------------------------')
print("Categorical Columns:", categorical_columns)


Numeric Columns: ['jamb_score', 'english', 'maths', 'subject_3', 'subject_4', 'subject_5', 'age_in_year_one', 'extra_curricular_activities_level', 'class_attendance_level', 'class_participation_level', 'use_of_extra_materials', 'morning', 'afternoon', 'evening', 'late_night', 'average_study_days_per_week', 'average_study_hours_per_day', 'num_of_times_fell_sick', 'level_of_teaching_received']
----------------------------------------
Categorical Columns: ['num_of_courses_offered', 'monthly_allowance', 'year_one_cgpa', 'school_grading_system']


In [34]:
# gpa_data['english'].unique()

for col in categorical_columns:
    print(f'unique values for {col} is: {gpa_data[col].unique()}')
    print('-------------------------------------')

unique values for num_of_courses_offered is: ['16 to 20' '13 to 16' '5 to 8' '0 to 4' '9 to 12' '20+']
-------------------------------------
unique values for monthly_allowance is: [nan '6 to 10k' '0 to 5k' '11 to 20k' '21 to 30k' '31 to 50k' '51 to 70k']
-------------------------------------
unique values for year_one_cgpa is: ['4.83' '4.8' '3.1' '3.33' '4.6' '4.06' '4.44' '4.35' '3.5' '3.7' '3.78'
 '3.91' '4.75' '4.27' '3.9' '4.89' '2.67' '4.4' '3.8' '4.51' '3.82' '4.54'
 '4.73' '3.76' '3.6' '4.74' '4.3' '3.2' '4.1' '4.67' '4.7' '4.57' '3.3'
 '2.97' '3.4' '3.97' '4.77' '4.5' '4.92' '4.03' '4.52' '3.87' '4.34'
 '3.69' '4.2' '4.21' '3.49' '3.85' '4.45' '4.91' '3.23' '4.23' '4.49'
 '4.66' '2.23' '4.82' '3.03' 'no idea ' '4.25' '4.81' '4.39' '4.42' '2.2'
 '4.85' '4.48' '3' '3.31' '2.5' '3.52' '3.75' '4' '3.o' '2.6' '1.9'
 '4.264' '4.79' '4.46' '3.46' '4.61' '4.43' '3.27' '4.56' '215' '4.55'
 '4.33' '3.65' '3.43' '2.56' '3.77']
-------------------------------------
unique values for schoo

In [12]:

# Ordinal encoding map
ordinal_encoding_map = {'A': 5, 'B': 4, 'C': 3, 'D': 2, 'E': 1, 'F': 0}

# Features to encode
features_to_encode = ['english', 'maths', 'subject_3', 'subject_4', 'subject_5']

# Apply ordinal encoding for the specified features
gpa_data[features_to_encode] = gpa_data[features_to_encode].apply(lambda col: col.map(ordinal_encoding_map))

# # Perform label encoding for other categorical columns
# categorical_columns = gpa_data.select_dtypes(include=['object']).columns
# label_encoder = LabelEncoder()

# for col in categorical_columns:
#     gpa_data[col] = label_encoder.fit_transform(gpa_data[col])


# Create GPA_normalised and drop unnecessary columns
# gpa_data['GPA_normalised'] = gpa_data['cgpa_year_one'] / gpa_data['grading_system']
# gpa_data.drop(['grading_system', 'cgpa_year_one'], axis=1, inplace=True)


# Print the DataFrame after engineering
gpa_data.head()


Unnamed: 0,jamb_score,english,maths,subject_3,subject_4,subject_5,age_in_year_one,gender,disability,attend_tutorial,extra_curricular_activities_level,class_attendance_level,class_participation_level,use_of_extra_materials,morning,afternoon,evening,late_night,average_study_days_per_week,average_study_hours_per_day,taught_coursemates,num_of_courses_offered,num_of_times_fell_sick,mode_of_study,original_course,monthly_allowance,level_of_teaching_received,category_of_university,year_one_cgpa,school_grading_system
0,300,4,5,5,4,4,16,Male,No,Yes,7,10,10,,3,5,5,1,6.0,6.0,"Yes, but just a few times",16 to 20,2,Full Time,Yes,,6,Public (Federal),4.83,5
1,313,4,5,5,5,4,17,Male,No,No,1,10,9,,3,2,5,4,7.0,10.0,"Yes, but just a few times",13 to 16,1,Full Time,Yes,,6,Public (Federal),4.8,5
2,249,3,4,4,4,3,22,Male,No,No,2,5,2,,3,1,4,4,4.0,8.0,"No, I studied alone",5 to 8,6,Full Time,No,,2,Public (Federal),3.1,5
3,213,3,4,4,3,4,17,Female,No,No,3,5,5,,3,1,1,1,2.0,2.0,"No, I studied alone",16 to 20,0,Full Time,No,,1,Public (State),3.33,5
4,345,3,5,5,5,5,18,Male,No,Yes,6,4,3,,3,3,4,5,3.0,3.0,"Yes, but just a few times",0 to 4,2,Full Time,Yes,,5,Public (Federal),4.6,5


In [14]:
def split_and_encode(df, column_name):
    # Split the column content by ',' and strip whitespace
#     df[column_name] = df[column_name].str.split(',').apply(lambda x: [val.strip() for val in x])

    # Perform dummy encoding
    df_encoded = pd.get_dummies(df[column_name])

#     df_encoded = pd.get_dummies(df[column_name].apply(pd.Series).stack())
        
    # Concatenate the encoded columns with the original DataFrame
    df = pd.concat([df, df_encoded], axis=1)

    # Drop the original column
    df.drop(column_name, axis=1, inplace=True)

    return df

In [15]:
# Assuming you have a DataFrame named 'df' with a column 'ColumnToSplit'
gpa_data = split_and_encode(gpa_data, 'gender')
gpa_data = split_and_encode(gpa_data, 'disability')
gpa_data = split_and_encode(gpa_data, 'attend_tutorial')
gpa_data = split_and_encode(gpa_data, 'mode_of_study')
gpa_data = split_and_encode(gpa_data, 'original_course')
gpa_data = split_and_encode(gpa_data, 'category_of_university')
gpa_data = split_and_encode(gpa_data, 'taught_coursemates')

In [16]:
gpa_data.columns

Index(['jamb_score', 'english', 'maths', 'subject_3', 'subject_4', 'subject_5',
       'age_in_year_one', 'extra_curricular_activities_level',
       'class_attendance_level', 'class_participation_level',
       'use_of_extra_materials', 'morning', 'afternoon', 'evening',
       'late_night', 'average_study_days_per_week',
       'average_study_hours_per_day', 'num_of_courses_offered',
       'num_of_times_fell_sick', 'monthly_allowance',
       'level_of_teaching_received', 'year_one_cgpa', 'school_grading_system',
       'Female', 'Male', 'No', 'No', 'Yes', 'Full Time', 'Part Time', 'No',
       'Yes', 'Private', 'Public (Federal)', 'Public (State)',
       'No, I didn't interact with my peers', 'No, I studied alone',
       'Yes, I ran a tutorial service', 'Yes, but just a few times'],
      dtype='object')

In [17]:
gpa_data.shape

(145, 39)

In [None]:
# Remove non-numeric entries
df['CGPA'] = df['CGPA'].str.replace('no idea', '').str.replace(' ', '').str.replace('o', '0')
# Handle '3.o' specifically
df['CGPA'] = df['CGPA'].apply(lambda x: float(x) if x.replace('.', '', 1).isdigit() else x)

# Convert the column to numeric
df['CGPA'] = pd.to_numeric(df['CGPA'], errors='coerce')
# Handle any remaining missing values (NaN) by filling with the mean
df['CGPA'].fillna(df['CGPA'].mean(), inplace=True)


In [18]:
gpa_data.head(2)

Unnamed: 0,jamb_score,english,maths,subject_3,subject_4,subject_5,age_in_year_one,extra_curricular_activities_level,class_attendance_level,class_participation_level,use_of_extra_materials,morning,afternoon,evening,late_night,average_study_days_per_week,average_study_hours_per_day,num_of_courses_offered,num_of_times_fell_sick,monthly_allowance,level_of_teaching_received,year_one_cgpa,school_grading_system,Female,Male,No,No.1,Yes,Full Time,Part Time,No.2,Yes.1,Private,Public (Federal),Public (State),"No, I didn't interact with my peers","No, I studied alone","Yes, I ran a tutorial service","Yes, but just a few times"
0,300,4,5,5,4,4,16,7,10,10,,3,5,5,1,6.0,6.0,16 to 20,2,,6,4.83,5,False,True,True,False,True,True,False,False,True,False,True,False,False,False,False,True
1,313,4,5,5,5,4,17,1,10,9,,3,2,5,4,7.0,10.0,13 to 16,1,,6,4.8,5,False,True,True,True,False,True,False,False,True,False,True,False,False,False,False,True


### Mutual Information

In [25]:
from sklearn.metrics import mutual_info_score

def mutual_information_score(series):
    return mutual_info_score(series, gpa_data.year_one_cgpa)

In [31]:
gpa_data[categorical_columns].apply(mutual_information_score).sort_values(ascending=False)

### Correlation

In [30]:
gpa_data[numeric_columns].corrwith(gpa_data.year_one_cgpa)

## Machine Learning Modeling

In this section, we will walk through the steps involved in building and evaluating a machine learning model for our GPA prediction task.


### Model Training

In [29]:
gpa_data.dtypes

jamb_score                               int64
english                                  int64
maths                                    int64
subject_3                                int64
subject_4                                int64
subject_5                                int64
age_in_year_one                          int64
extra_curricular_activities_level        int64
class_attendance_level                   int64
class_participation_level                int64
use_of_extra_materials                 float64
morning                                  int64
afternoon                                int64
evening                                  int64
late_night                               int64
average_study_days_per_week            float64
average_study_hours_per_day            float64
num_of_courses_offered                  object
num_of_times_fell_sick                   int64
monthly_allowance                       object
level_of_teaching_received               int64
year_one_cgpa

In [28]:
gpa_data.isna().sum()

jamb_score                              0
english                                 0
maths                                   0
subject_3                               0
subject_4                               0
subject_5                               0
age_in_year_one                         0
extra_curricular_activities_level       0
class_attendance_level                  0
class_participation_level               0
use_of_extra_materials                 10
morning                                 0
afternoon                               0
evening                                 0
late_night                              0
average_study_days_per_week             0
average_study_hours_per_day             0
num_of_courses_offered                  0
num_of_times_fell_sick                  0
monthly_allowance                      10
level_of_teaching_received              0
year_one_cgpa                           0
school_grading_system                   0
Female                            

In [21]:

gpa_data['use_of_extra_materials'].fillna(gpa_data['use_of_extra_materials'].mean(), inplace=True)


In [23]:
X = gpa_data.drop(['year_one_cgpa'], axis=1)  # Features excluding 'id' and 'GPA_normal'
y = gpa_data['year_one_cgpa']  # Target variable

# Train-test split
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)

# Initialize the Linear Regression model
model = LinearRegression()

# Train the model
model.fit(X_train, y_train)

In [26]:
from sklearn.ensemble import RandomForestRegressor
from sklearn.model_selection import train_test_split
from sklearn.metrics import mean_squared_error

# Create a Random Forest model
rf_model = RandomForestRegressor()

# Fit the model on the training data
rf_model.fit(X_train, y_train)

# Make predictions on the test data
y_pred = rf_model.predict(X_test)


In [29]:
rmse = mean_squared_error(y_test, y_pred, squared=False)
print(f"root Mean Squared Error: {rmse}")

root Mean Squared Error: 24.8280113636969


### Model Evaluation

In [24]:
# Predict on the test set
y_pred = model.predict(X_test)

# Evaluate the model
rmse = mean_squared_error(y_test, y_pred, squared=False)
print('Root Mean Squared Error (RMSE):', rmse)

Root Mean Squared Error (RMSE): 23.946597880933762


### Save the model

In [None]:
# Save the model to a file
model_filename = 'linear_regression_model.joblib'
joblib.dump(model, model_filename)

print('Model saved to', model_filename)

### Tips to Improve Model Performance

1. **Data Quality:**
   - Ensure clean, high-quality data without missing values or outliers.

2. **Feature Engineering:**
   - Create relevant and new features that capture essential patterns in the data.

3. **Model Selection:**
   - Choose appropriate models and tune hyperparameters for better performance.

4. **Ensemble Learning:**
   - Combine multiple models to improve accuracy and robustness.

5. **Regularization:**
   - Implement regularization to prevent overfitting.

7. **Domain Understanding:**
   - Understand the problem domain to make informed model decisions.

8. **Feedback Loop:**
   - Continuously iterate and improve the model based on feedback and new data.

---

## HAPPY HACKING!!


