## In this Notebook, a linear regression model was created and trained on data collected about students performance. The model was then evaluated by comparing the error matrices resulted from the testing part fo the data.

## <font color='blue'>Description of the Student Performance Dataset:</font>

This dataset was obtained from Kaggle adn contains information on the performance of high school students in mathematics, including their grades and demographic information. The data was collected from three high schools in the United States. Link to dataset:
https://www.kaggle.com/datasets/rkiattisak/student-performance-in-mathematics

### <font color='blue'> Dictionary of Columns:</font>
 1. **Gender**
- Data Type: Nominal
- Description: The gender of the student (male/female).

 2. **Race/ethnicity**
- Data Type: Nominal
- Description: The student's racial or ethnic background (e.g., Asian, African-American, Hispanic, etc.).

 3. **Parental level of education**
- Data Type: Ordinal
- Description: The highest level of education attained by the student's parent(s) or guardian(s) (e.g., high school, bachelor's degree, master's degree).

 4. **Lunch**
- Data Type: Nominal
- Description: Whether the student receives free or reduced-price lunch (yes/no).

 5. **Test preparation course**
- Data Type: Nominal
- Description: Whether the student completed a test preparation course (yes/no).

 6. **Math score**
- Data Type: Numerical
- Description: The student's score on a standardized mathematics test.

 7. **Reading score**
- Data Type: Numerical
- Description: The student's score on a standardized reading test.

 8. **Writing score**
- Data Type: Numerical
- Description: The student's score on a standardized writing test.

### <font color='blue'>Importing the necessary libraries: </font>

In [1]:
import pandas as pd
from sklearn.model_selection import train_test_split
from sklearn.linear_model import LinearRegression
from sklearn.preprocessing import OneHotEncoder, OrdinalEncoder, StandardScaler
from sklearn.impute import SimpleImputer
from sklearn import set_config
from sklearn.pipeline import make_pipeline
from sklearn.compose import ColumnTransformer
from sklearn.linear_model import LinearRegression
from sklearn.metrics import mean_squared_error, r2_score, mean_absolute_error

### <font color='blue'>Loading the data from CSV file:</font>

In [2]:
# Read data from csv file
data = pd.read_csv('exams.csv')
data

Unnamed: 0,gender,race/ethnicity,parental level of education,lunch,test preparation course,math score,reading score,writing score
0,female,group D,some college,standard,completed,59,70,78
1,male,group D,associate's degree,standard,none,96,93,87
2,female,group D,some college,free/reduced,none,57,76,77
3,male,group B,some college,free/reduced,none,70,70,63
4,female,group D,associate's degree,standard,none,83,85,86
...,...,...,...,...,...,...,...,...
995,male,group C,some college,standard,none,77,77,71
996,male,group C,some college,standard,none,80,66,66
997,female,group A,high school,standard,completed,67,86,86
998,male,group E,high school,standard,none,80,72,62


### <font color='blue'>The Target Feature: Math Exam Score. We will use the existing data fields to predict the math score of any individual.<br><br>Verifying Data Types: </font>

In [3]:
data.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 1000 entries, 0 to 999
Data columns (total 8 columns):
 #   Column                       Non-Null Count  Dtype 
---  ------                       --------------  ----- 
 0   gender                       1000 non-null   object
 1   race/ethnicity               1000 non-null   object
 2   parental level of education  1000 non-null   object
 3   lunch                        1000 non-null   object
 4   test preparation course      1000 non-null   object
 5   math score                   1000 non-null   int64 
 6   reading score                1000 non-null   int64 
 7   writing score                1000 non-null   int64 
dtypes: int64(3), object(5)
memory usage: 62.6+ KB


<font color='green'>The data has all of the correct data types, with 5 catigorical and 3 numerical (including the target feature).</font>

### <font color='blue'>Deciding which columns to drop: </font>

- **Gender**: Kept, since the gender of a person can possibly effect their math score.
- **Race/Ethnicity**: Kept, since some races can be highly talented in math unlike others.
- **Parental Level of Education**: Kept, since parents can influence their children and inherent their Intelligence.
- **Lunch**: Kept, since it can reflect their financial state and hence their mental state which will effect their score.
- **Test Preparation Course**: Kept, since preparing can increase ones chance of getting a better score.
- **Reading Score**: Kept, since this reflects their literacy and ability to read.
- **Writing Score**: Kept, since this reflects their ability to write.


<font color='green'> All features will be kept (not dropped) because all of them have some effect on the target feature (math score).</font>

### <font color='blue'>Checking for duplicates and removing if any: </font>

In [4]:
data.duplicated().sum()

0

### <font color='blue'>Spliting the Features: <br>- y as the target feature Math Score(will be predicted).<br>- X as the input features (used to predict the target).</font>

In [5]:
# target
y = data["math score"]

# other features
X = data.drop(columns="math score")

# displaying content
print(y)
X

0      59
1      96
2      57
3      70
4      83
       ..
995    77
996    80
997    67
998    80
999    58
Name: math score, Length: 1000, dtype: int64


Unnamed: 0,gender,race/ethnicity,parental level of education,lunch,test preparation course,reading score,writing score
0,female,group D,some college,standard,completed,70,78
1,male,group D,associate's degree,standard,none,93,87
2,female,group D,some college,free/reduced,none,76,77
3,male,group B,some college,free/reduced,none,70,63
4,female,group D,associate's degree,standard,none,85,86
...,...,...,...,...,...,...,...
995,male,group C,some college,standard,none,77,71
996,male,group C,some college,standard,none,66,66
997,female,group A,high school,standard,completed,86,86
998,male,group E,high school,standard,none,72,62


### <font color='blue'>Splitting data to train and test (80% and 20% respectively), while using the number 1210209 as the random seed:</font>


In [6]:
#using the method in the sickit learn library imported at the start of the notebook
X_train, X_test, y_train, y_test = train_test_split(X, y, random_state=1210209, test_size=0.2)

#displaying to make sure
X_train

Unnamed: 0,gender,race/ethnicity,parental level of education,lunch,test preparation course,reading score,writing score
840,female,group E,high school,free/reduced,completed,94,97
941,male,group E,associate's degree,standard,completed,75,75
571,female,group B,associate's degree,standard,none,94,93
802,female,group D,bachelor's degree,free/reduced,none,85,83
764,female,group B,some college,standard,none,76,70
...,...,...,...,...,...,...,...
999,male,group D,high school,standard,none,47,45
51,female,group E,master's degree,free/reduced,completed,83,84
30,female,group B,associate's degree,standard,completed,81,85
560,male,group C,high school,free/reduced,completed,57,55


### <font color='blue'>Handling missing values:</font>

In [7]:
#checking if any missing values exist already in the original dataset
data.isna().sum()

gender                         0
race/ethnicity                 0
parental level of education    0
lunch                          0
test preparation course        0
math score                     0
reading score                  0
writing score                  0
dtype: int64

<font color='green'>Even if there are no missing values in the current dataset, there might be in the date that will be used to predict the target. For this reason, we must handle any potential missing values by creating a simple imputer.</font>

- **Nominal missing values** will be filled with the most frequent value (mode), since it will keep the distribution of the data the same and doesn't introduce any outliers.
- **Ordinal missing values** will be filled with the median value, to -as mentioned about the nominal data type- to avoid skewing the data.
- **Numerical missing values** will be filled with the median value to avoid the exteme cases and try to remain within the same distribution shape as the original data.

### <font color='blue'>Splitting the data depending on their data type:</font>

In [8]:
#numerical coloumns
num_col = X_train.select_dtypes("number").columns

# catagorical columns
cat_col = X_train.select_dtypes("object").columns

# differentiating between nominal and ordinal columns
nom_col = cat_col.drop("parental level of education")

ord_col = ["parental level of education"]

### <font color='blue'>Creating Imputers:</font>

In [9]:
# most frequent value for ordinal and nominal data
categorical_imputer = SimpleImputer(strategy='most_frequent')  

# median for numerical data
numerical_imputer = SimpleImputer(strategy='median') 

print(numerical_imputer)
print(categorical_imputer)

SimpleImputer(strategy='median')
SimpleImputer(strategy='most_frequent')


### <font color='blue'>Data Encoding Nominal and Ordinal Features:</font>

- #### Hot Encoder for Nominal Features:

In [10]:
# creating one hot encoder for nominal data
ohe_encoder = OneHotEncoder(sparse_output=False, handle_unknown="ignore")

print(ohe_encoder)

OneHotEncoder(handle_unknown='ignore', sparse_output=False)



- #### Ordinal Encoder for Ordinal Features:
<font color='green'><br>We must first determine the existing catagories of the ordinal feature to order them.</font>

In [11]:
data['parental level of education'].unique()

array(['some college', "associate's degree", 'some high school',
       "bachelor's degree", "master's degree", 'high school'],
      dtype=object)

In [12]:
#ordering teh education level
education_order = ["some high school", "high school", "some college", "associate's degree", "bachelor's degree", "master's degree"]

#representing the order as a list
education_order_list = [education_order]
education_order_list

[['some high school',
  'high school',
  'some college',
  "associate's degree",
  "bachelor's degree",
  "master's degree"]]

In [13]:
#creating the ordinal encoder
ordinal_encoder = OrdinalEncoder(categories=education_order_list)
ordinal_encoder

### <font color='blue'>Creating a Scaler to normalize the data the data.</font>

In [14]:
scaler = StandardScaler()
scaler

### <font color='blue'>Creating a column transformer the piplelines to fit and transform the data:</font>

In [15]:
#Numerical Pipeline
num_pipe = make_pipeline(numerical_imputer, scaler)

#ordinal pipeline
ord_pipe = make_pipeline(categorical_imputer, ordinal_encoder, scaler)

#nominal pipeline
nominal_pipe = make_pipeline(categorical_imputer, ohe_encoder)

#building pipeline tuples
num_pipe_tuple = ("numeric", num_pipe, num_col)
ord_pipe_tuple = ("ordinal", ord_pipe, ord_col)
nominal_pipe_tuple = ("categorical", nominal_pipe, nom_col)


#creating transformer
col_transformer = ColumnTransformer([num_pipe_tuple, ord_pipe_tuple, nominal_pipe_tuple], remainder="passthrough", verbose_feature_names_out =False)
col_transformer



### <font color='blue'>Linear Regression Model: Creation, Building, and Fitting.</font>

In [16]:
# generating dataframes instead of numpy arrays
set_config(transform_output = "pandas")

#creating linear model
model = LinearRegression()

# making the pipeline holding the column transformer and linear model
linear_model = make_pipeline(col_transformer, model)

#fitting the model to the training data
linear_model.fit(X_train, y_train)
linear_model

<font color="green">The data is now clean, ready to handle missing values, encoded, and scaled.</font>

<font color="green">The model studied the patterns between the input features and the target (Math Score), all while finding the optimal weights that minimoze the loss.</font>

### <font color='blue'>Obtaining the assigned weights and parameters.</font>

In [17]:
# the y intercept
b_0 = linear_model.named_steps['linearregression'].intercept_.round(3)

# other coefficients
b_i = linear_model.named_steps['linearregression'].coef_.round(3)

# the corresponding column names of these values
col_names = col_transformer.get_feature_names_out()

#displaying values
print(f"**Column Names**\n {col_names}\n")
print(f"**Coefficients**\n {b_i}\n")
print(f"**Intercept**\n {b_0}")

**Column Names**
 ['reading score' 'writing score' 'parental level of education'
 'gender_female' 'gender_male' 'race/ethnicity_group A'
 'race/ethnicity_group B' 'race/ethnicity_group C'
 'race/ethnicity_group D' 'race/ethnicity_group E' 'lunch_free/reduced'
 'lunch_standard' 'test preparation course_completed'
 'test preparation course_none']

**Coefficients**
 [ 3.493 10.78   0.077 -6.659  6.659 -0.553 -0.75  -1.274 -0.833  3.41
 -2.252  2.252 -2.061  2.061]

**Intercept**
 66.742


### <font color='blue'>Inturpertating the weights and parameters.</font>

#### Intercept ( b_0 )
Represents the expected value of the target variable (math score) **when all predictors(or inputs) are are zero**. In this case, the intercept is  **66.742**. This means that a student will average with an expected math score of approximately 66.742.

#### Coefficients ( b_i )
Each coefficient represents the expected change in the math score for a one-unit change in the predictor variable, holding all other predictors constant.

1. <font color="red">**Reading Score**</font> (3.493):
   - For each additional point in the reading score, the math score is expected to increase by approximately 3.493 points, indicating a positive relationship between reading and math scores.

2. <font color="green">**Writing Score**</font> (10.780):
   - For each additional point in the writing score, the math score is expected to increase by approximately 10.780 points, indicating a strong positive relationship between writing and math scores.

3. <font color="blue">**Parental Level of Education**</font> (0.077):
   - For each unit increase in the ordinal scale of parental education, the math score increases by 0.077 points. Even if higher levels of parental education are associated with higher math scores, the effect is very small.
   
4. <font color="orange">**Gender_Female**</font> (-6.659):
   - Being female is associated with a 6.659 point decrease in the math score. This suggests that, on average, female students score lower in math compared to males.

5. <font color="orange">**Gender_Male**</font> (6.659):
   - This coefficient is the negative of the female coefficient, and increases the math score by 6.659 points when the student is in-fact a male.

6. <font color="purple">**Race/Ethnicity_Group A**</font> (-0.553):
   - Students from group A are expected to score 0.553 points lower in math compared to the reference groups.

7. <font color="purple">**Race/Ethnicity_Group B**</font> (-0.750):
   - Students from group B are expected to score 0.750 points lower in math compared to the reference group.

8. <font color="purple">**Race/Ethnicity_Group C**</font> (-1.274):
   - Students from group C are expected to score 1.274 points lower in math compared to the reference group.

9. <font color="purple">**Race/Ethnicity_Group D**</font> (-0.833):
   - Students from group D are expected to score 0.833 points lower in math compared to the reference group.

10. <font color="purple">**Race/Ethnicity_Group E**</font> (3.410):
    - Students from group E are expected to score 3.410 points higher in math which is the highest increase compared to the referenced groups, indicating a positive effect.

11. <font color="hotpink">**Lunch_Free/Reduced**</font> (-2.252):
    - Students who receive free or reduced-price lunch are expected to score 2.252 points lower in math compared to those who receive standard lunch, suggesting that these students that come from lower-income households with financial constraints have additional stress and less academic support, hence affecting their performance and scores.

12. <font color="hotpink">**Lunch_Standard**</font> (2.252):
    - This coefficient is the negative of the free/reduced lunch coefficient, and indicates that Students ared to those who receive standard lunch are expected to score 2.252 points higher in math.

13. <font color="gold">**Test Preparation Course_Completed**</font> (-2.061):
    - Completing a test preparation course is associated with a 2.061 point decrease in math score compared to those who did not complete the course. This is very counterintuitive and can indicate that the test preparation course caused a burnout to the students, or had contradictory material. There is also a chance of issues with the dataset.

14. <font color="gold">**Test Preparation Course_None**</font> (2.061):
    - This is the reference category, and its coefficient is the negative of the completed course coefficient to maintain the baseline. There is a 2.061 point incease in the math score when students don;t take the test preparation course.
    
   
### <font color='blue'>Evaluating the Model's Quality Using MAE, MSE, and RMSE</font>

In [1]:
# Define the regression metrics function
def regression_metrics(y_true, y_pred, label='', verbose=True, output_dict=False):
    mae = mean_absolute_error(y_true, y_pred)
    mse = mean_squared_error(y_true, y_pred)
    rmse = mean_squared_error(y_true, y_pred, squared=False) 
    r_squared = r2_score(y_true, y_pred)
    
    if verbose:
        header = "-" * 60
        print(header, f"Regression Metrics: {label}", header, sep='\n')
        print(f"- MAE = {mae:,.3f}")
        print(f"- MSE = {mse:,.3f}")
        print(f"- RMSE = {rmse:,.3f}")
        print(f"- R^2 = {r_squared:,.3f}")
    
    if output_dict:
        return {'Label': label, 'MAE': mae, 'MSE': mse, 'RMSE': rmse, 'R^2': r_squared}

# predictions for training data
y_train_pred = linear_model.predict(X_train)

# Callinge the function to obtain regression metrics for training data
results_train = regression_metrics(y_train, y_train_pred, label='Training Data')
print()

# Get predictions for test data
y_test_pred = linear_model.predict(X_test)

# Callinge the function to obtain regression metrics for test data
results_test = regression_metrics(y_test, y_test_pred, label='Test Data')

NameError: name 'linear_model' is not defined

### <font color='blue'>Model Evaluation:</font>


1. **Mean Absolute Error (MAE)**:
   - The MAE was relatively low for both training and testing data. This shows that the model -on average- predicted close to the actual scores.

2. **Mean Squared Error (MSE)**:
    - The MSE values indicate that, on average, the model's predictions of math scores deviate by around 30 points squared from the actual scores. This suggests there is some variability in prediction accuracy.

3. **Root Mean Squared Error (RMSE)**:
   - The RMSE values also suggest that the model predictions are generally close to the actual values (Lower RMSE values indicate better model performance), with a standard deviation of around 5.5 points from the actual math scores. 

3. **R squared**:
   - Both training and test values were relatively high, and higher R^2 values (close to 1) indicate a better fit.

### <font color='blue'>Determining if the model over/under fits the data:</font>

 - The model **is slightly overfits** the data. This is because the error metrics values are lower in the trainning data, and the R^2 values is higher in compare to the testing data. This means the model performs well on the training data, but finds it harder to generalize its abilities to unseen test data. But in general, the model performed very well and there were no significant differences, therefore, the overral evaluation of the model is just slightly overfit.

  - This model was **not underfit**  since there was a good performance on both training and test data, and no high errors.
