# Student Performance Prediction using Linear Regression

We'll walk through a machine learning workflow to predict the final grade `G3` based on `G1`, `G2`, `studytime`, `failures`, and `absences`. Let's start with an explanation of each step followed by the code.

Source of dataset from [Kaggle](https://www.kaggle.com/datasets/dskagglemt/student-performance-data-set)

## Data Attributes
1. `school`: Student's school (binary: "GP" - Gabriel Pereira or "MS" - Mousinho da Silveira)
2. `sex`: Student's sex (binary: "F" - female or "M" - male)
3. `age`: Student's age (numeric: from 15 to 22)
4. `address`: Student's home address type (binary: "U" - urban or "R" - rural)
5. `famsize`: Family size (binary: "LE3" - less or equal to 3 or "GT3" - greater than 3)
6. `Pstatus`: Parent's cohabitation status (binary: "T" - living together or "A" - apart)
7. `Medu`: Mother's education (numeric: 0 - none, 1 - primary education (4th grade), 2 - 5th to 9th grade, 3 - secondary education, or 4 - higher education)
8. `Fedu`: Father's education (numeric: 0 - none, 1 - primary education (4th grade), 2 - 5th to 9th grade, 3 - secondary education, or 4 - higher education)
9. `Mjob`: Mother's job (nominal: "teacher", "health" care related, civil "services" (e.g. administrative or police), "at_home" or "other")
10. `Fjob`: Father's job (nominal: "teacher", "health" care related, civil "services" (e.g. administrative or police), "at_home" or "other")
11. `reason`: Reason to choose this school (nominal: close to "home", school "reputation", "course" preference or "other")
12. `guardian`: Student's guardian (nominal: "mother", "father" or "other")
13. `traveltime`: Home to school travel time (numeric: 1 - <15 min., 2 - 15 to 30 min., 3 - 30 min. to 1 hour, or 4 - >1 hour)
14. `studytime`: Weekly study time (numeric: 1 - <2 hours, 2 - 2 to 5 hours, 3 - 5 to 10 hours, or 4 - >10 hours)
15. `failures`: Number of past class failures (numeric: n if 1<=n<3, else 4)
16. `schoolsup`: Extra educational support (binary: yes or no)
17. `famsup`: Family educational support (binary: yes or no)
18. `paid`: Extra paid classes within the course subject (Math or Portuguese) (binary: yes or no)
19. `activities`: Extra-curricular activities (binary: yes or no)
20. `nursery`: Attended nursery school (binary: yes or no)
21. `higher`: Wants to take higher education (binary: yes or no)
22. `internet`: Internet access at home (binary: yes or no)
23. `romantic`: With a romantic relationship (binary: yes or no)
24. `famrel`: Quality of family relationships (numeric: from 1 - very bad to 5 - excellent)
25. `freetime`: Free time after school (numeric: from 1 - very low to 5 - very high)
26. `goout`: Going out with friends (numeric: from 1 - very low to 5 - very high)
27. `Dalc`: Workday alcohol consumption (numeric: from 1 - very low to 5 - very high)
28. `Walc`: Weekend alcohol consumption (numeric: from 1 - very low to 5 - very high)
29. `health`: Current health status (numeric: from 1 - very bad to 5 - very good)
30. `absences`: Number of school absences (numeric: from 0 to 93)
31. `G1`: First period grade (numeric: from 0 to 20)
32. `G2`: Second period grade (numeric: from 0 to 20)
33. `G3`: Final grade (numeric: from 0 to 20, target variable)

## Step 1: Load and Explore the Data
First, we'll load the data from the CSV file and take a look at its structure.

Explanation:
- We'll use `pandas` to load the data.
- We'll inspect the first few rows and basic statistics to understand the dataset.

In [None]:
# Import necessary library
import pandas as pd

# Load the dataset from CSV file
data = pd.read_csv('student-performance.csv', sep=';')

# Display the first few rows of the dataset
data.head()

In [None]:
# Display basic statistics
data.describe()

## Step 2: Data Preprocessing
We'll select the relevant features and handle any missing values.

Explanation:
- Select the columns `G1`, `G2`, `studytime`, `failures`, `absences`, and `G3`.
- Check for and handle any missing values (if any).

In [None]:
# Select relevant columns
features = ['G1', 'G2', 'studytime', 'failures', 'absences']
target = 'G3'

# Check for missing values
print(data[features + [target]].isnull().sum())

# Drop rows with missing target values (if any)
data = data.dropna(subset=[target])

## Step 3: Split the Data into Training and Testing Sets
We need to divide the dataset into training and testing sets to evaluate the model's performance.

Explanation:
- We'll use the `train_test_split` function from scikit-learn to split the data.
- A common split is 80% training and 20% testing.

In [None]:
from sklearn.model_selection import train_test_split

# Split the data
X = data[features]
y = data[target]

X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2)

# Display the shape of the training and testing sets
print(X_train.shape, X_test.shape, y_train.shape, y_test.shape)

## Step 4: Train a Machine Learning Model
We'll train a model using a regression algorithm. For simplicity, we'll start with a Linear Regression model.

Explanation:
- We'll use the `LinearRegression` class from scikit-learn.
- Fit the model to the training data.

In [None]:
from sklearn.linear_model import LinearRegression

# Initialize and train the model
model = LinearRegression()
model.fit(X_train, y_train)

# Display model coefficients
print(f'Coefficients: {model.coef_}')
print(f'Intercept: {model.intercept_}')

### Coefficients:

Each coefficient represents the change in the target variable `G3` for a one-unit change in the respective feature, assuming all other features are held constant.

- G1 (0.14446336): For each additional unit increase in G1, G3 increases by approximately 0.144.
- G2 (0.97961532): For each additional unit increase in G2, G3 increases by approximately 0.98.
- studytime (-0.07123057): For each additional unit increase in study time, G3 decreases by approximately 0.071.
- failures (-0.45581289): For each additional failure, G3 decreases by approximately 0.456.
- absences (0.0392449): For each additional unit increase in absences, G3 increases by approximately 0.039.

The coefficients for studytime and absences might seem counterintuitive, indicating that higher study time decreases G3 slightly, and higher absences increase G3 slightly. This could suggest multicollinearity or that the effect of these features on G3 might not be linear or that there are other underlying factors influencing these relationships.

### Intercept:

Intercept=−1.6213124035190951

The intercept represents the expected value of G3 when all the features are zero. This value doesn't have much practical interpretation in this context since having zero grades, study time, failures, and absences isn't realistic.

## Step 5: Evaluate the Model
We'll evaluate the model's performance using metrics such as `Mean Absolute Error (MAE)`, `Mean Squared Error (MSE)`, and `R-squared score`.

Explanation:
- Predict the target values for the test set.
- Calculate evaluation metrics to assess model performance.

In [None]:
from sklearn.metrics import mean_absolute_error, mean_squared_error, r2_score

# Predict the target values for the test set
y_pred = model.predict(X_test)

# Calculate evaluation metrics
mae = mean_absolute_error(y_test, y_pred)
mse = mean_squared_error(y_test, y_pred)
r2 = r2_score(y_test, y_pred)

print(f'Mean Absolute Error: {mae}')
print(f'Mean Squared Error: {mse}')
print(f'R-squared: {r2}')

### Evaluation Metrics

#### Mean Absolute Error (MAE):

MAE=1.3394159082670924

The MAE measures the average magnitude of the errors in the predictions, without considering their direction. An MAE of 1.34 means that, on average, the model's predictions are off by about 1.34 units of 𝐺 3 G3.

#### Mean Squared Error (MSE):

MSE=4.4665032120156045

The MSE measures the average squared differences between predicted and actual values. It gives more weight to larger errors. An MSE of 4.47 indicates that there are some larger errors in the predictions.

#### R-squared (R<sup>2</sup>):
R<sup>2</sup>=0.7821754247320556

The R<sup>2</sup> score indicates how well the model's predictions fit the actual data. An R<sup>2</sup> of 0.78 means that approximately 78% of the variance in G3 can be explained by the model. This is a relatively high value, suggesting the model has a good fit to the data.

### Summary

- The model performs reasonably well with an R<sup>2</sup> of 0.78, meaning it explains 78% of the variance in the final grade G3.
- The MAE of 1.34 indicates that the predictions are, on average, 1.34 units away from the actual G3.
- The coefficients provide insights into the relationships between the features and G3, though some coefficients (e.g., studytime and absences) might require further investigation or additional feature engineering to better understand their effects.

## Step 6: Predict Student Performance for a Different Value of Experience
- **Collect New Student Data**: Gather the necessary features for the new student.
- **Preprocess the New Data**: Convert the new data into a DataFrame (or appropriate format) and ensure it matches the structure of the training data.
- **Make Predictions**: Use the `predict` method of the trained model to make predictions.

In [None]:
# Data attributes information
# G1: First period grade (numeric: from 0 to 20)
# G2: Second period grade (numeric: from 0 to 20)
# studytime: Weekly study time (numeric: 1 - <2 hours, 2 - 2 to 5 hours, 3 - 5 to 10 hours, or 4 - >10 hours)
# failures: Number of past class failures (numeric: n if 1<=n<3, else 4)
# absences: Number of school absences (numeric: from 0 to 93)

# New student data
new_student = {
    'G1': 20,
    'G2': 5,
    'studytime': 4,
    'failures': 0,
    'absences': 10
}

# Convert the new student data to a DataFrame
new_student_data = pd.DataFrame([new_student])

# Make a prediction
new_student_prediction = model.predict(new_student_data)

print(f'Predicted G3 for the new student: {new_student_prediction[0]}')
