## Dataset 1 & 2 Overview

In [23]:
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
from sklearn.model_selection import train_test_split
from sklearn.linear_model import LinearRegression
from sklearn.ensemble import RandomForestClassifier, RandomForestRegressor
from sklearn.metrics import mean_squared_error, mean_absolute_error, r2_score, accuracy_score
from sklearn.preprocessing import OneHotEncoder
from sklearn.compose import ColumnTransformer
from sklearn.pipeline import Pipeline
import xgboost as xgb

# loading both dataset
dataset1_url = "https://raw.githubusercontent.com/hbedros/data622-assignment1/refs/heads/main/data/dataset-1.csv"
dataset2_url = "https://raw.githubusercontent.com/hbedros/data622-assignment1/refs/heads/main/data/dataset-2.csv"

dataset1 = pd.read_csv(dataset1_url)
dataset2 = pd.read_csv(dataset2_url)

print(dataset1.head())
print(dataset2.head())


   Hours_Studied  Attendance Parental_Involvement Access_to_Resources  \
0             23          84                  Low                High   
1             19          64                  Low              Medium   
2             24          98               Medium              Medium   
3             29          89                  Low              Medium   
4             19          92               Medium              Medium   

  Extracurricular_Activities  Sleep_Hours  Previous_Scores Motivation_Level  \
0                         No            7               73              Low   
1                         No            8               59              Low   
2                        Yes            7               91           Medium   
3                        Yes            8               98           Medium   
4                        Yes            6               65           Medium   

  Internet_Access  Tutoring_Sessions Family_Income Teacher_Quality  \
0             Ye

### Overview of the Datasets

We have two datasets to work with:

1. **Dataset 1 (dataset1.csv)**:
   - This one has a bunch of columns about students, like how many hours they studied, their attendance, etc.
   - It also includes some categories, like whether they had access to resources or if they were involved in extracurricular activities.
   - Overall, it gives a good picture of their backgrounds and behaviors.

2. **Dataset 2 (dataset2.csv)**:
   - This dataset focuses on the actual scores: Math, Reading, Writing, and a Placement Score.
   - It’s more about the outcomes of their studies.
   - The Placement Score could be seen as an output. The results of the other exams could possibly indicate how they would perform on their placement exam.

### Similarities and Differences

**Similarities**:
- Both datasets deal with students and their academic information.
- They relate to how students perform in school, which is relevant for the analysis.

**Differences**:
- Dataset 1 has a lot more details about students' experiences and environments.
- Dataset 2 is mainly about their scores, so it’s smaller and focused just on performance.

### Analyzing the Data

For the analysis, we aim to predict how well students will perform based on the information in Dataset 1. We will also determine if the math, reading, and writing scores from Dataset 2 indicate how a student will do on their placement exam.

### Which Algorithms to Use

To determine which algorithms will be best to use, we decided to consider multiple options and then choose two for comparison. The output variable is discrete because it is a numerical exam score without decimal points. The data contains many different input variables. Some of those variables will have a linear relationship with the exam score results, and some will not. Dataset 1 is relatively large compared to Dataset 2.

Here are a couple of machine learning algorithms that could work well:

1. **Random Forest Regression**:
   - This is a complex algorithm that can handle a mix of numbers and categories. It is great if we think there might be non-linear relationships between the factors. The data is discrete since it does not use decimals, but the data can be treated as a continuous variable since the differences between values is meaningful. 

2. **KNN**:
   - This algorithm is good for small datasets dealing with classification problems. It should perform well with Dataset 2. To use this with the large dataset, it makes sense to perform some dimensionality reduction. This can be done with Principal Component Analysis.

3. **xGBoost**:
   - This is a good starting point for analyzing data and creating models. This is good for complex, multi-dimensional data. It should work well with Dataset 1 and 2. We can use it after testing another model to see how they compare.

Based on those options, we will use Random Forest Classification and xGBoost. KNN might produce better results for Dataset 2, but it will likely struggle with Dataset 1.

## Applying Machine Learning Algorithms to Dataset 1 and Dataset 2

In this section, we will apply Random Forest Regression and xGBoost algorithms to the datasets to create models. First, we will test Random Forest Classification.

### Random Forest Regression - Dataset 1

We included several numeric variables in this random forest model, such as hours studied, attendance, sleep hours, previous scores, tutoring sessions, and physical activity. After splitting the data into training and testing sets, the model was trained and evaluated. The results showed some promise but left room for improvement. The **R-squared value** came out to be **0.54**, indicating that 54.1% of the variance in exam scores is explained by the model, leaving 45.9% unexplained. The **mean absolute error (MAE)** was **1.32**, which is a solid result considering the score scale is from 0 to 100. While these results are decent, there's definitely potential for better accuracy, so next, we’ll apply the **XGBoost** algorithm to see if we can achieve stronger performance.

In [32]:
X = dataset1[['Hours_Studied', 'Attendance', 'Sleep_Hours', 'Previous_Scores', 
              'Tutoring_Sessions', 'Physical_Activity', 'Teacher_Quality', 
              'Family_Income', 'Parental_Involvement']]  # adding the categorical stuff variables here
Y = dataset1['Exam_Score']  # exam_score is what we're trying to predict

categorical_columns = ['Teacher_Quality', 'Family_Income', 'Parental_Involvement']

# one-hot encode the categorical variables and leave the numerical ones as they are
preprocessor = ColumnTransformer(
    transformers=[
        ('num', 'passthrough', ['Hours_Studied', 'Attendance', 'Sleep_Hours', 'Previous_Scores', 
                                'Tutoring_Sessions', 'Physical_Activity']),  
        ('cat', OneHotEncoder(), categorical_columns)
    ])

# putting everything into a pipeline with the RandomForestRegressor
pipeline = Pipeline(steps=[
    ('preprocessor', preprocessor),  # First, handle preprocessing
    ('regressor', RandomForestRegressor(n_estimators=100, random_state=15))  # Then, the actual model
])

X_train, X_test, Y_train, Y_test = train_test_split(X, Y, test_size=0.25, random_state=28)

pipeline.fit(X_train, Y_train)

Y_pred = pipeline.predict(X_test)

mse = mean_squared_error(Y_test, Y_pred)
mae = mean_absolute_error(Y_test, Y_pred)
r2 = r2_score(Y_test, Y_pred)

print(f'Mean Squared Error: {mse}')  # lower is better
print(f'Mean Absolute Error: {mae}')  # lower is better
print(f'R-squared Score: {r2}')  # higher is better here

Mean Squared Error: 7.558078874092011
Mean Absolute Error: 1.32114406779661
R-squared Score: 0.5410372811587938


### xGBoost - Gradient Boosting - Dataset 1

Many numeric variables were included in this xGBoost model. These include hours studied, attendance, sleep hours, previous scores, tutoring sessions, and physical activity. The data was first split into train and test, the model was created with the train data, and then the model was analyzed with the test data. The results were not favorable. The **R-squared value** was quite low at **0.12**, meaning only **11.8%** of the variance in exam scores is explained by the model, leaving a significant **88.2%** unexplained. Additionally, the **mean absolute error (MAE)** was **2.73**, which is noticeably higher than the random forest model. On a scale from 0 to 100, an average error of **2.73** is less accurate, and this model performed worse overall compared to the random forest model. There's significant room for improvement in future iterations.

In [36]:
X = dataset1[['Hours_Studied', 'Attendance', 'Sleep_Hours', 'Previous_Scores', 'Tutoring_Sessions', 'Physical_Activity', 
              'Teacher_Quality', 'Family_Income', 'Parental_Involvement']]
Y = dataset1['Exam_Score'] 

categorical_columns = ['Teacher_Quality', 'Family_Income', 'Parental_Involvement']

# One-hot encode the categorical variables
preprocessor = ColumnTransformer(
    transformers=[
        ('num', 'passthrough', ['Hours_Studied', 'Attendance', 'Sleep_Hours', 'Previous_Scores', 
                                'Tutoring_Sessions', 'Physical_Activity']),
        ('cat', OneHotEncoder(), categorical_columns)
    ])

pipeline = Pipeline(steps=[
    ('preprocessor', preprocessor),
    ('regressor', xgb.XGBRegressor(n_estimators=45, learning_rate=0.07, random_state=23))
])

X_train, X_test, Y_train, Y_test = train_test_split(X, Y, test_size=0.25, random_state=28)

pipeline.fit(X_train, Y_train)

Y_pred = pipeline.predict(X_test)

mse = mean_squared_error(Y_test, Y_pred)
mae = mean_absolute_error(Y_test, Y_pred)
r2 = r2_score(Y_test, Y_pred)

print(f'Mean Squared Error: {mse}')
print(f'Mean Absolute Error: {mae}')
print(f'R-squared Score: {r2}')

Mean Squared Error: 14.531362795653902
Mean Absolute Error: 2.7283759613591303
R-squared Score: 0.11758611040394995


### Random Forest Regression - Dataset 2

The math, reading, and writing exam scores were included in this random forest model. The data was first split into train and test, the model was created with the train data, and then the model was analyzed with the test data. It produced horrible results. The R-squared value was negative, indicating a performance worse than baseline. None of the variance can be explained by the model. Additionally, the mean absolute error was about 9.76. On a scale from 0 to 100, an average error of 9.76 indicates almost a 10% error rate. This data does not appear to be useful in predicting the results of a placement exam. Based on the EDA of this dataset, the results make sense because none of the variables were related. Their correlations were extremely low.

In [37]:

X = dataset2[['Math_Score', 'Reading_Score', 'Writing_Score']]
Y = dataset2['Placement_Score']

X_train, X_test, Y_train, Y_test = train_test_split(X, Y, test_size=0.25, random_state=28)

reg = RandomForestRegressor(n_estimators=100, random_state=15)
reg.fit(X_train, Y_train)

Y_pred = reg.predict(X_test)

mse = mean_squared_error(Y_test, Y_pred)
mae = mean_absolute_error(Y_test, Y_pred)

print(f'Mean Squared Error: {mse}')
print(f'Mean Absolute Error: {mae}')

r2 = r2_score(Y_test, Y_pred)
print(f'R-squared Score: {r2}')



Mean Squared Error: 141.6895601661111
Mean Absolute Error: 9.787733333333332
R-squared Score: -0.21580613084403177


### xGBoost - Gradient Boosting - Dataset 2

The math, reading, and writing exam scores were included in this xGBoost model. The data was first split into train and test, the model was created with the train data, and then the model was analyzed with the test data. It produced horrible results, similar to the random forest model. The R-squared value was negative, indicating a performance worse than baseline. None of the variance can be explained by the model. Additionally, the mean absolute error was about 9.94. On a scale from 0 to 100, an average error of 9.94 indicates almost a 10% error rate.

In [38]:

X = dataset2[['Math_Score', 'Reading_Score', 'Writing_Score']]
Y = dataset2['Placement_Score']

X_train, X_test, Y_train, Y_test = train_test_split(X, Y, test_size=0.25, random_state=28)

xgb_reg = xgb.XGBRegressor(n_estimators=45, learning_rate=0.07, random_state=23)

xgb_reg.fit(X_train, Y_train)

Y_pred = xgb_reg.predict(X_test)

mse = mean_squared_error(Y_test, Y_pred)
mae = mean_absolute_error(Y_test, Y_pred)
r2 = r2_score(Y_test, Y_pred)

print(f'Mean Squared Error: {mse}')
print(f'Mean Absolute Error: {mae}')
print(f'R-squared Score: {r2}')

Mean Squared Error: 139.28098768582566
Mean Absolute Error: 9.816615447998046
R-squared Score: -0.19513871410083516


## Results from the Datasets

### Dataset 1:

It appears that random forest regression and xGBoost models can be helpful in predicting exam scores from the provided information about students, but these models have much room for improvement. The R-squared value for the random forest regression was **0.54**, while for the xGBoost model it was significantly lower at **0.12**. However, the relatively low mean absolute error values, **1.32** for random forest and **2.73** for xGBoost, demonstrate some potential for use with future student data, though accuracy improvements are necessary.

### Dataset 2:

The negative R-squared values and high mean absolute errors demonstrate that no models will likely be helpful in predicting placement exam scores based on the other exam scores. The data should be used for other analysis, such as simple statistical analysis and exploratory data analysis. It should not be used for predictions since the scores are not related to one another.