# End of Semester Statistical Analysis

This notebook uses Python in order to analyze the relationship between attendance and students' final grades.

## Import Python Libraries

In [None]:
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
from sklearn.linear_model import LinearRegression
from sklearn.metrics import mean_squared_error, r2_score
import os

## Choose the Data File

In [None]:
# Set the path to the current directory
folder_path = '.'

# List all files and directories in the current directory
contents = os.listdir(folder_path)

# Print whether each item is a file or a directory
for item in contents:
    item_path = os.path.join(folder_path, item)
    if os.path.isfile(item_path):
        print(f"File: {item}")
    elif os.path.isdir(item_path):
        print(f"Directory: {item}")

## Load the Data

It may be necessary to change the filename used. To do this, we can access the contents of the folder that this notebook is stored in which were listed in the previous operation. Then change the file name the line where it says "Change file name or path here."

If you can see 10 rows below the cell after running it, then the data has loaded successfully and you may continue without any errors.

In [None]:
file_path = 'file_goes_here.csv' # Change file name or path here.
data = pd.read_csv(file_path)

# Display the first 10 rows of the DataFrame
data.head(10)

## Cubic Regression Model for Analyzation

In [None]:
# Fit a cubic regression model
cubic_fit = np.polyfit(data['Absence_Percentage'], data['Semester_Total'], 3)

## Linear Regression Model for Analyzation

In [None]:
# Fit a linear regression model
linear_regressor = LinearRegression()
linear_regressor.fit(data['Absence_Percentage'].values.reshape(-1, 1), data['Semester_Total'])

## Higher-degree Polynomial Regression for Analyzation

In [None]:
# Fit a higher-degree polynomial regression
degree = 5
coeffs_higher_degree = np.polyfit(data['Absence_Percentage'], data['Semester_Total'], degree)

## Generate the Charts Parameters

In [None]:
# Generate x values (Absence Percentage) for prediction
x_values = np.linspace(data['Absence_Percentage'].min(), data['Absence_Percentage'].max(), 100)

# Predict y values using the fitted models
y_cubic = np.polyval(cubic_fit, x_values)
y_linear = linear_regressor.predict(x_values.reshape(-1, 1))
y_higher_degree = np.polyval(coeffs_higher_degree, x_values)

# Ensure that negative predicted values are set to zero
y_cubic[y_cubic < 0] = 0
y_linear[y_linear < 0] = 0
y_higher_degree[y_higher_degree < 0] = 0

## Plot the Data in a Scatterplot

Be sure to change the title so that it matches the data that has been plotted.

In [None]:
# Create the plot
plt.figure(figsize=(12, 8))
plt.scatter(data['Absence_Percentage'], data['Semester_Total'], alpha=0.7, label='Data')
plt.plot(x_values, y_cubic, color='green', label='Cubic Regression')
plt.plot(x_values, y_linear, color='blue', label='Linear Regression')
plt.plot(x_values, y_higher_degree, color='red', label=f'{degree} Degree Regression')
plt.ylim(0, 100)
# plt.xlim(0, 25)
plt.axhline(y=60, color='gray', linestyle='--', label='Grade Threshold (60)')
plt.title('Relationship Between Absence Percentage and Final Grade')
plt.xlabel('Absence Percentage')
plt.ylabel('Final Grade (out of 100%)')
plt.legend()
plt.show()

## Descriptive Statistics

*   `count`: Number of non-NA/null observations
*   `mean`: The arithmetic average
*   `std`: The standard deviation
*   `min`: The smallest (minimum) value
*   `25%`: The first quartile (25th percentile)
*   `50%`: The median (50th percentile) 
*   `75%`: The third quartile (75th percentile)
*   `max`: The largest (maximum) value

In [None]:
# Descriptive statistics
print("Descriptive Statistics for Absence Percentage:")
print(data['Absence_Percentage'].describe())
print("\nDescriptive Statistics for Semester Total:")
print(data['Semester_Total'].describe())

## Calculate and Display R-Squared Values for Each Model

The higher the r-squared value is, the better fit the model is for the data. Based on the data, choose the corresponding regressive model below to isolate each plot.

In [None]:
# Calculate and display R-squared values for each model
r2_cubic = r2_score(data['Semester_Total'], np.polyval(cubic_fit, data['Absence_Percentage']))
r2_linear = linear_regressor.score(data['Absence_Percentage'].values.reshape(-1, 1), data['Semester_Total'])
r2_higher_degree = r2_score(data['Semester_Total'], np.polyval(coeffs_higher_degree, data['Absence_Percentage']))

print("\nR-squared values:")
print(f"Cubic Regression: {r2_cubic}")
print(f"Linear Regression: {r2_linear}")
print(f"{degree} Degree Regression: {r2_higher_degree}")

# Determine the Best Regression Model

Based on the calculations above, the regression model with the highest number should be the best fit for the scatterplot. Run the corresponding cell in order to replot the correct model.

## Cubic Regression Model

Use the cubic coefficients obtained from the previous cells. If you haven't saved them, rerun the calculation.

In [None]:
cubic_fit = np.polyfit(data['Absence_Percentage'], data['Semester_Total'], 3)

# Generate x values (Absence Percentage) for prediction
x_values = np.linspace(data['Absence_Percentage'].min(), data['Absence_Percentage'].max(), 100)

# Predict y values using the cubic model
y_cubic = np.polyval(cubic_fit, x_values)
y_cubic[y_cubic < 0] = 0  # Ensuring non-negative predictions

# Re-create the plot with only the cubic regression
plt.figure(figsize=(12, 8))
plt.scatter(data['Absence_Percentage'], data['Semester_Total'], alpha=0.7, label='Data')
plt.plot(x_values, y_cubic, color='green', label='Cubic Regression')
plt.ylim(0, 100)
plt.axhline(y=60, color='gray', linestyle='--', label='Grade Threshold (60)')
plt.title('Relationship Between Absence Percentage and Final Grade (BLP - M)')
plt.xlabel('Absence Percentage')
plt.ylabel('Final Grade (out of 100%)')
plt.legend()
plt.show()

## Linear Regression Model
If the linear model was chosen as the best, use its coefficients. If you haven't saved the model, you might need to fit it again.

In [None]:
linear_regressor = LinearRegression()
linear_regressor.fit(data['Absence_Percentage'].values.reshape(-1, 1), data['Semester_Total'])

# Generate x values (Absence Percentage) for prediction
x_values = np.linspace(data['Absence_Percentage'].min(), data['Absence_Percentage'].max(), 100)

# Predict y values using the linear model
y_linear = linear_regressor.predict(x_values.reshape(-1, 1))
y_linear[y_linear < 0] = 0  # Ensuring non-negative predictions

# Re-create the plot with only the linear regression
plt.figure(figsize=(12, 8))
plt.scatter(data['Absence_Percentage'], data['Semester_Total'], alpha=0.7, label='Data')
plt.plot(x_values, y_linear, color='blue', label='Linear Regression')
plt.ylim(0, 100)
plt.axhline(y=60, color='gray', linestyle='--', label='Grade Threshold (60)')
plt.title('Relationship Between Absence Percentage and Final Grade')
plt.xlabel('Absence Percentage')
plt.ylabel('Final Grade (out of 100%)')
plt.legend()
plt.show()

## Higher Degree Polynomial Regression

If the higher-degree polynomial was chosen as the best, use its coefficients. If you haven't saved the coefficients, you might need to fit it again.

In [None]:
degree = 5  # Or whatever degree was determined to be best
coeffs_higher_degree = np.polyfit(data['Absence_Percentage'], data['Semester_Total'], degree)

# Generate x values (Absence Percentage) for prediction
x_values = np.linspace(data['Absence_Percentage'].min(), data['Absence_Percentage'].max(), 100)

# Predict y values using the higher-degree polynomial model
y_higher_degree = np.polyval(coeffs_higher_degree, x_values)
y_higher_degree[y_higher_degree < 0] = 0  # Ensuring non-negative predictions

# Re-create the plot with only the higher-degree polynomial regression
plt.figure(figsize=(12, 8))
plt.scatter(data['Absence_Percentage'], data['Semester_Total'], alpha=0.7, label='Data')
plt.plot(x_values, y_higher_degree, color='red', label=f'{degree} Degree Regression')
plt.ylim(0, 100)
# plt.xlim(0, 25)
plt.axhline(y=60, color='gray', linestyle='--', label='Grade Threshold (60)')
plt.title(f'Relationship Between Absence Percentage and Final Grade (Applied College - M)')
plt.xlabel('Absence Percentage')
plt.ylabel('Final Grade (out of 100%)')
plt.legend()
plt.show()


## No Correlation

If none of the lines appears to show a correct correlation, here, you can print the chart without a line.

In [None]:
# Generate x values (Absence Percentage) for prediction
x_values = np.linspace(data['Absence_Percentage'].min(), data['Absence_Percentage'].max(), 100)

# Predict y values using the higher-degree polynomial model
y_higher_degree = np.polyval(coeffs_higher_degree, x_values)
y_higher_degree[y_higher_degree < 0] = 0  # Ensuring non-negative predictions

# Re-create the plot with only the higher-degree polynomial regression
plt.figure(figsize=(12, 8))
plt.scatter(data['Absence_Percentage'], data['Semester_Total'], alpha=0.7, label='Data')
plt.ylim(0, 100)
plt.axhline(y=60, color='gray', linestyle='--', label='Grade Threshold (60)')
plt.title(f'Relationship Between Absence Percentage and Final Grade (Applied College - M)')
plt.xlabel('Absence Percentage')
plt.ylabel('Final Grade (out of 100%)')
plt.legend()
plt.show()

This Python notebook was created by S. Hatting for the University of Tabuk's English Language Institute in December 2023.