![image](https://storage.googleapis.com/kaggle-datasets-images/3732060/6462375/20652c32a17f81453fd9255f924cf367/dataset-cover.jpeg?t=2023-09-13-06-05-35)

[Image Source](https://storage.googleapis.com/kaggle-datasets-images/3732060/6462375/20652c32a17f81453fd9255f924cf367/dataset-cover.jpeg?t=2023-09-13-06-05-35)

# <div style="color:yellow;display:inline-block;border-radius:5px;background-color:#007BA7;font-family:Nexa;overflow:hidden"><p style="padding:15px;color:yellow;overflow:hidden;font-size:85%;letter-spacing:0.5px;margin:0"><b> </b> Introduction</p></div>


In this analysis, we explored a dataset collected from Iraqi secondary schools, focusing on 55 distinct features categorized into demographics, economic, educational, time, and marks-related attributes. After meticulous preprocessing to ensure data completeness and consistency, we embarked on a comprehensive data science journey.

We began by visually depicting the distribution of various attributes using aesthetically designed plots created with Seaborn and Matplotlib libraries. This step-by-step approach allowed for a clear understanding of the dataset's characteristics, including demographics, social statuses, ages, and other pertinent information.

Subsequently, we delved into building and assessing predictive models using five diverse algorithms: Linear Regression, Random Forest, Support Vector Machine, k-Nearest Neighbors, and Gradient Boosting. Each model's performance was evaluated based on Mean Squared Error (MSE), providing valuable insights into their predictive capabilities.

The models were then analyzed and compared, revealing nuanced differences in their accuracy. Notably, Random Forest and Gradient Boosting emerged as the top-performing algorithms, showcasing their potential for precise predictions. Additionally, Linear Regression, Support Vector Machine, and k-Nearest Neighbors presented valuable insights, with opportunities for further optimization.

Overall, this analysis offers a comprehensive exploration of the Iraqi secondary school dataset, illuminating the strengths and areas for improvement in various predictive models. The results provide a solid foundation for potential enhancements and deeper insights into the dataset's underlying patterns.

# <div style="color:yellow;display:inline-block;border-radius:5px;background-color:#007BA7;font-family:Nexa;overflow:hidden"><p style="padding:15px;color:yellow;overflow:hidden;font-size:85%;letter-spacing:0.5px;margin:0"><b> </b> Data Preprocessing</p></div>


### First, let's import the necessary libraries and load the dataset:

# <div style="color:yellow;display:inline-block;border-radius:5px;background-color:#007BA7;font-family:Nexa;overflow:hidden"><p style="padding:15px;color:yellow;overflow:hidden;font-size:70%;letter-spacing:0.5px;margin:0"><b> </b> Import Libraries</p></div>

In [None]:
import numpy as np
import pandas as pd
import seaborn as sns
import matplotlib.pyplot as plt


import warnings
warnings.filterwarnings('ignore')

import math

rc = {
    "axes.facecolor": "#E6FFE6",
    "figure.facecolor": "#E6FFE6",
    "axes.edgecolor": "#000000",
    "grid.color": "#EBEBE7",
    "font.family": "serif",
    "axes.labelcolor": "#000000",
    "xtick.color": "#000000",
    "ytick.color": "#000000",
    "grid.alpha": 0.4
}

sns.set(rc=rc)

from colorama import Style, Fore
red = Style.BRIGHT + Fore.RED
blu = Style.BRIGHT + Fore.BLUE
mgt = Style.BRIGHT + Fore.MAGENTA
gld = Style.BRIGHT + Fore.YELLOW
res = Style.RESET_ALL

<div style="border-radius:10px; border:#457B9D solid; padding: 15px; background-color:#E6FFE6; font-size:100%; text-align:left">

<html>
<head>
<style>
table {
  border-collapse: collapse;
  width: 100%;
  font-family: Arial, sans-serif;
}

th, td {
  border: 1px solid #dddddd;
  text-align: left;
  padding: 8px;
}

th {
  background-color: #f2f2f2;
}
</style>
</head>
<body>

<h2>Dataset Attribute Descriptions</h2>

<table>
  <tr>
    <th>Attribute Description </th>
    <th>Values</th>
  </tr>
  <tr>
    <td>Demographics - Gender Binary</td>
    <td>Female (0), Male (1)</td>
  </tr>
  <tr>
    <td>Social Status</td>
    <td>Single (0), Married (1), Apart (2)</td>
  </tr>
  <!-- Add more rows for the remaining attributes -->
  <tr>
    <td>Age</td>
    <td>1: &lt;17 years<br>2: 17-19 years<br>3: 19-21 years<br>4: &gt;21 years</td>
  </tr>
  <tr>
    <td>Governorate Binary</td>
    <td>Baghdad (0), Other (1)</td>
  </tr>
  <tr>
    <td>Living Binary</td>
    <td>City (0), Rural (1)</td>
  </tr>
  <tr>
    <td>Mother Education</td>
    <td>0: Illiterate<br>1: Medium<br>2: Secondary<br>3: B.A.<br>4: Higher</td>
  </tr>
  <tr>
    <td>Father Education</td>
    <td>0: Illiterate<br>1: Medium<br>2: Secondary<br>3: B.A.<br>4: Higher</td>
  </tr>
  <tr>
    <td>Family Member Education Binary</td>
    <td>Yes (0), No (1)</td>
  </tr>
  <tr>
    <td>Father Alive Binary</td>
    <td>Yes (0), No (1)</td>
  </tr>
  <tr>
    <td>Mother Alive Binary</td>
    <td>Yes (0), No (1)</td>
  </tr>
  <tr>
    <td>Family Size Numeric</td>
    <td>0: &lt;4 members<br>1: 4-8 members<br>2: &gt;8 members</td>
  </tr>
  <tr>
    <td>Parent Apart Binary</td>
    <td>Yes (0), No (1)</td>
  </tr>
  <tr>
    <td>The Guardian</td>
    <td>Mother (0), Father (1), Null (2)</td>
  </tr>
  <tr>
    <td>Family Relationship</td>
    <td>0: Bad<br>1: Good<br>2: Vgood<br>3: Excellent</td>
  </tr>
  <tr>
    <td>Economic - Father Job</td>
    <td>0: No<br>1: Employee<br>2: Other</td>
  </tr>
  <tr>
    <td>Economic - Mother Job</td>
    <td>0: No<br>1: Employee<br>2: Other</td>
  </tr>
  <tr>
    <td>Economic - Education Fee Binary</td>
    <td>You (0), Family (1)</td>
  </tr>
  <tr>
    <td>Economic - Secondary Job Binary</td>
    <td>Free Job (0), No (1)</td>
  </tr>
  <tr>
    <td>Economic - Home Ownership Binary</td>
    <td>Own (0), Rent (1)</td>
  </tr>
  <tr>
    <td>Study Room Binary</td>
    <td>Yes (0), No (1)</td>
  </tr>
  <tr>
    <td>Family Economic Level</td>
    <td>0: Poor<br>1: Good<br>2: Vgood<br>3: Excellent</td>
  </tr>
  <tr>
    <td>You Chronic Disease Binary</td>
    <td>Yes (0), No (1)</td>
  </tr>
  <tr>
    <td>Family Chronic Disease Binary</td>
    <td>Yes (0), No (1)</td>
  </tr>
  <tr>
    <td>Education - Specialization Binary</td>
    <td>Applicable (0), Biologist (1)</td>
  </tr>
  <tr>
    <td>Education - Study Willing Binary</td>
    <td>Yes (0), No (1)</td>
  </tr>
  <tr>
    <td>Education - Reason of Study</td>
    <td>You (0), Average (1), Family (2)</td>
  </tr>
  <tr>
    <td>Attendance</td>
    <td>0: Poor<br>1: Good<br>2: Vgood</td>
  </tr>
  <tr>
    <td>Failure Year Binary</td>
    <td>Yes (0), No (1)</td>
  </tr>
  <tr>
    <td>Higher Education Willing Binary</td>
    <td>Yes (0), No (1)</td>
  </tr>
  <tr>
    <td>References Usage Binary</td>
    <td>Yes (0), No (1)</td>
  </tr>
  <tr>
    <td>Time - Internet Usage</td>
    <td>0: &lt;2 hours<br>1: 2-4 hours<br>2: &gt;4 hours</td>
  </tr>
  <tr>
    <td>Time - TV Usage</td>
    <td>0: &lt;2 hours<br>1: 2-4 hours<br>2: &gt;4 hours</td>
  </tr>
  <tr>
    <td>Time - Sleep Hour</td>
    <td>0: &lt;5 hours<br>1: 5-7 hours<br>2: 7-9 hours<br>3: &gt;9 hours</td>
  </tr>
  <tr>
    <td>Time - Study Hour</td>
    <td>0: &gt;2 hours<br>1: 2-4 hours<br>2: 4-6 hours<br>3: &gt;6 hours</td>
  </tr>
  <tr>
    <td>Time - Arrival Time</td>
    <td>0: &lt;hour<br>1: Other</td>
  </tr>
  <tr>
    <td>Time - Transport Binary</td>
    <td>Foot (0), Car (1)</td>
  </tr>
  <tr>
    <td>Time - Holiday Effect Binary</td>
    <td>Yes (0), No (1)</td>
  </tr>
  <tr>
    <td>Time - Worry Effect Binary</td>
    <td>Yes (0), No (1)</td>
  </tr>
  <tr>
    <td>Parent Meeting Binary</td>
    <td>Yes (0), No (1)</td>
  </tr>
  <tr>
    <td>Marks - Materials Degrees for First Semester Numeric</td>
    <td>0-100</td>
  </tr>
  <tr>
    <td>Marks - Avg1 Numeric</td>
    <td>0-100</td>
  </tr>
  <tr>
    <td>Marks - Materials Degrees for... [Incomplete]</td>
    <td>N/A</td>
  </tr>
</table>

</body>
</html>


# <div style="color:yellow;display:inline-block;border-radius:5px;background-color:#007BA7;font-family:Nexa;overflow:hidden"><p style="padding:15px;color:yellow;overflow:hidden;font-size:70%;letter-spacing:0.5px;margin:0"><b> </b> Load the Dataset</p></div>

In [None]:
# Read the Excel file
data = pd.read_excel('/kaggle/input/iraqi-student-performance-prediction/Iraqi Student Performance Prediction.xlsx')
data.head().style.set_properties(**{'background-color':'royalblue','color':'white','border-color':'#8b8c8c'})

In [None]:
data.info()

In [None]:
data.columns

### We'll handle missing values and select relevant features:

In [None]:
# Check for missing values in each column
missing_values = data.isnull().sum()
missing_values

In [None]:
# Handling missing values
data = data.dropna()

# <div style="color:yellow;display:inline-block;border-radius:5px;background-color:#007BA7;font-family:Nexa;overflow:hidden"><p style="padding:15px;color:yellow;overflow:hidden;font-size:85%;letter-spacing:0.5px;margin:0"><b> </b> Exploratory Data Analysis (EDA)</p></div>


### Now, let's perform some basic EDA to understand the dataset:

In [None]:
# Plotting a distribution of age
plt.figure(figsize=(10, 6))
sns.histplot(data['Age'], kde=True, color='skyblue')
plt.title('Age Distribution', fontsize = 14, fontweight = 'bold', color = 'darkgreen')
plt.xlabel('Age', fontsize = 12, fontweight = 'bold', color = 'darkblue')
plt.ylabel('Frequency', fontsize = 12, fontweight = 'bold', color = 'darkblue')
plt.savefig('Age Distribution.png')
plt.show()

In [None]:
plt.figure(figsize=(10, 6))
sns.countplot(x='Attendance', hue='Specialization', data=data, palette='pastel')
plt.title('Attendance by Specialization', fontsize=18,fontweight = 'bold', color = 'darkgreen')
plt.xlabel('Attendance', fontsize=14, fontweight = 'bold', color = 'darkblue')
plt.ylabel('Count', fontsize=14, fontweight = 'bold', color = 'darkblue')
plt.legend(title='Specialization', labels=['Applicable', 'Biologist'])
ax = plt.gca()
#ax.set_facecolor('#F0F0F0')
plt.savefig('Attendance by Specialization.png')
plt.show()


In [None]:
# Selecting relevant numeric features
numeric_features = ['Islamea', 'arabic', 'english', 'math', 'physics', 'chemistry', 'economy/bio', 
                    'Avg1', 'Islamea.1', 'arabic.1', 'english.1', 'math.1', 'physics.1', 'chemistry.1', 'economy/bio.1', 'Avg1.1']

# Box plots for selected numeric features
plt.figure(figsize=(15, 8))
sns.boxplot(data=data[numeric_features], orient="h", palette="Set2")
plt.title("Box Plots of Numeric Features", fontsize = 14, fontweight = 'bold', color = 'darkgreen')
plt.savefig('Box Plots of Numeric Features.png')
plt.show()

In [None]:
# Calculate some summary statistics
age_stats = data['Age'].describe()

# Create a DataFrame for the results table
results_table = pd.DataFrame({
    'Statistic': ['Mean', 'Median', 'Min', 'Max', 'Std. Dev.'],
    'Value': [age_stats['mean'], age_stats['50%'], age_stats['min'], age_stats['max'], age_stats['std']]
})

print(results_table)


In [None]:
# Get summary statistics
summary_stats = data.describe()

# Print summary statistics
print(summary_stats)

# Visualize relationships
import seaborn as sns
import matplotlib.pyplot as plt

# Pairplot for selected features
sns.pairplot(data[['Age', 'Internet Usage', 'TV Usage', 'Sleep Hour', 'Study Hour']])
plt.show()


In [None]:
# Set up subplots
fig, axes = plt.subplots(5, 4, figsize=(20, 20))

# Define attribute categories
categories = [
    'Sex', 'Social Status', 'Governorate', 'Living',
    'Mother education', 'Father education', 'Family member Education',
    'Father Alive', 'Mother Alive', 'Family Size', 'Parent Apart',
    'The Guardian', 'Family Relationship', 'Father Job', 'Mother Job',
    'Education Fee', 'Secondary Job', 'Home Ownership', 'Study Room',
]

# Define colors for each subplot
colors = ['pastel', 'pastel', 'Set3', 'Set2', 'pastel', 'pastel', 'pastel', 'Set3',
          'Set2', 'Set3', 'Set2', 'Set3', 'pastel', 'pastel', 'pastel', 'pastel',
          'Set3', 'Set2', 'Set3', 'Set2', 'pastel', 'pastel', 'pastel', 'pastel']

# Loop through categories and plot
for i, category in enumerate(categories):
    row = i // 4
    col = i % 4
    ax = axes[row, col]
    sns.countplot(x=category, data=data, palette=colors[i], ax=ax)
    ax.set_title(f'{category} Distribution', fontsize=12)
    ax.set_xlabel(category, fontsize=10)
    ax.set_ylabel('Count', fontsize=10)
    #ax.set_facecolor('#F0F0F0')

# Adjust layout
plt.tight_layout()
plt.subplots_adjust(top=0.9)

# Set overall title
plt.suptitle('Attribute Distributions 1', fontsize=20)

# Show plot
plt.show()


In [None]:
# Set up subplots
fig, axes = plt.subplots(4, 4, figsize=(20, 20))

# Define attribute categories
categories = [
    'Family Economic Level', 'You  chronic disease',
    'Family Chronic Disease', 'Specialization', 'Study willing',
    'Reason of study', 'Attendance', 'Failure Year', 'Higher Education Willing', 'References Usage', 
    'Arrival Time', 'Transport','Holiday Effect', 'Worry Effect', 'Parent Meeting'
]

# Define colors for each subplot
colors = ['pastel', 'pastel', 'Set3', 'Set2', 'pastel', 'pastel', 'pastel', 'Set3',
          'Set2', 'Set3', 'Set2', 'Set3', 'pastel', 'pastel', 'pastel', 'pastel']

# Loop through categories and plot
for i, category in enumerate(categories):
    row = i // 4
    col = i % 4
    ax = axes[row, col]
    sns.countplot(x=category, data=data, palette=colors[i], ax=ax)
    ax.set_title(f'{category} Distribution', fontsize=12)
    ax.set_xlabel(category, fontsize=10)
    ax.set_ylabel('Count', fontsize=10)
    #ax.set_facecolor('#F0F0F0')

# Adjust layout
plt.tight_layout()
plt.subplots_adjust(top=0.9)

# Set overall title
plt.suptitle('Attribute Distributions 2', fontsize=20)

# Show plot
plt.show()

In [None]:
# Define attribute categories and their corresponding counts
categories = [
    'Sex', 'Social Status', 'Governorate', 'Living',
    'Mother education', 'Father education', 'Family member Education',
    'Father Alive', 'Mother Alive', 'Family Size', 'Parent Apart',
    'The Guardian', 'Family Relationship', 'Father Job', 'Mother Job',
    'Education Fee', 'Secondary Job', 'Home Ownership', 'Study Room',
    'Family Economic Level', 'You  chronic disease','Family Chronic Disease', 
    'Specialization', 'Study willing','Reason of study', 'Attendance', 'Failure Year', 
    'Higher Education Willing', 'References Usage', 'Arrival Time', 
    'Transport','Holiday Effect', 'Worry Effect', 'Parent Meeting'
]

counts = [data[category].value_counts() for category in categories]

# Create a DataFrame for the table
table_data = pd.DataFrame(counts)
table_data.index = categories
table_data = table_data.transpose()

# Define a palette of colors
colors = ['#FF9999', '#66B3FF', '#99FF99', '#FFCC99', '#c2c2f0', '#ffb3e6', '#c2f0c2', '#ff6666']

# Create a dictionary to specify background colors
background_colors = {col: colors[i % len(colors)] for i, col in enumerate(table_data.columns)}

# Apply the background colors to the DataFrame
styled_table_data = table_data.style.apply(lambda col: [f'background-color: {background_colors[col.name]}' for _ in col], axis=0)

# Display the styled table
styled_table_data

In [None]:
plt.figure(figsize=(10, 6))
sns.countplot(x='Specialization', hue='Sex', data=data, palette='pastel')
plt.title('Specialization by Gender', fontsize=18, fontweight = 'bold', color = 'darkgreen')
plt.xlabel('Specialization', fontsize=14, fontweight = 'bold', color = 'darkblue')
plt.ylabel('Count', fontsize=14, fontweight = 'bold', color = 'darkblue')
plt.legend(title='Gender', labels=['Female', 'Male'])
ax = plt.gca()
#ax.set_facecolor('#F0F0F0')
plt.savefig('Specialization by Gender.png')
plt.show()

In [None]:
plt.figure(figsize=(10, 6))
sns.boxplot(x='Specialization', y='Avg1.1', data=data, palette='Set2')
plt.title('Average Grades by Specialization', fontsize=18, fontweight = 'bold', color = 'darkgreen')
plt.xlabel('Specialization', fontsize=14, fontweight = 'bold', color = 'darkblue')
plt.ylabel('Average Grade', fontsize=14, fontweight = 'bold', color = 'darkblue')
ax = plt.gca()
#ax.set_facecolor('#F0F0F0')
plt.savefig('Average Grades by Specialization.png')
plt.show()

In [None]:
numeric_data = data.select_dtypes(include=['number'])
correlation_matrix = numeric_data.corr()

In [None]:
plt.figure(figsize=(20, 9))
sns.heatmap(correlation_matrix, annot=True, cmap='coolwarm', fmt=".2f", vmin=-1, vmax=1)
plt.title('Correlation Matrix', fontsize=18, fontweight = 'bold', color = 'darkgreen')
plt.savefig('Correlation Matrix.png')
plt.show()

# <div style="color:yellow;display:inline-block;border-radius:5px;background-color:#007BA7;font-family:Nexa;overflow:hidden"><p style="padding:15px;color:yellow;overflow:hidden;font-size:85%;letter-spacing:0.5px;margin:0"><b> </b> Buil Model and Prediction</p></div>

<html>
<body>
    <span style="color: blue; font-weight: bold; font-size: 20px;">To build and evaluate models for this dataset, we'll follow these steps:</span>
    <ol>
        <li style="color: purple; font-size: 16px;">Data Preprocessing: Split the data into features (X) and target (y), and then further split it into training and testing sets.</li>
        <li style="color: purple; font-size: 16px;">Model Training: We'll train each of the five models: Linear Regression, Random Forest, Support Vector Machine, k-Nearest Neighbors, and Gradient Boosting.</li>
        <li style="color: purple; font-size: 16px;">Model Evaluation: We'll evaluate the models using appropriate metrics.</li>
        <li style="color: purple; font-size: 16px;">Make Predictions: We'll use the trained models to make predictions.</li>
    </ol>
</body>
</html>

<html>
<body>
    <span style="color: green; font-weight: bold; font-size: 20px;">Linear Regression</span>
</body>
</html>


In [None]:
# Read the Excel file
data = pd.read_excel('/kaggle/input/iraqi-student-performance-prediction/Iraqi Student Performance Prediction.xlsx')

In [None]:
# Handling missing values
data = data.dropna()

In [None]:
# List of categorical features
categorical_features = ['Sex', 'Social Status', 'Governorate', 'Living',
                        'Mother education', 'Father education', 'Family member Education',
                        'Father Alive', 'Mother Alive', 'Parent Apart',
                        'The Guardian', 'Family Relationship', 'Father Job', 'Mother Job',
                        'Education Fee', 'Secondary Job', 'Home Ownership', 'Study Room',
                        'Family Economic Level', 'You  chronic disease',
                        'Family Chronic Disease', 'Specialization', 'Study willing',
                        'Reason of study', 'Attendance', 'Failure Year',
                        'Higher Education Willing', 'References Usage', 'Internet Usage',
                        'TV Usage', 'Sleep Hour', 'Study Hour', 'Arrival Time', 'Transport',
                        'Holiday Effect', 'Worry Effect', 'Parent Meeting']

# Impute missing values with the most common category
for feature in categorical_features:
    most_common_category = data[feature].mode()[0]
    data[feature].fillna(most_common_category, inplace=True)

# Check if there are any remaining missing values
missing_values = data.isnull().sum()
print(missing_values[missing_values > 0])


In [None]:
from sklearn.model_selection import train_test_split
from sklearn.linear_model import LinearRegression
from sklearn.metrics import mean_squared_error

In [None]:
# Encoding categorical variables
data_encoded = pd.get_dummies(data, columns=categorical_features, drop_first=True)

# Separating features (X) and target (y)
X = data_encoded.drop(columns=['Avg1.1'])  # Removing the target column
y = data_encoded['Avg1.1']

# Splitting the data into training and testing sets
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)

# Step 2: Model Training
lr_model = LinearRegression()
lr_model.fit(X_train, y_train)

# Step 3: Model Evaluation
y_pred_lr = lr_model.predict(X_test)

# Evaluate the model (e.g., using mean squared error)
mse = mean_squared_error(y_test, y_pred_lr)
print(f"Mean Squared Error: {mse}")

### Next, let's move on to Random Forest:

<html>
<body>
    <span style="color: green; font-weight: bold; font-size: 20px;">Random Forest</span>
</body>
</html>




In [None]:
from sklearn.ensemble import RandomForestRegressor

# Step 2: Model Training
rf_model = RandomForestRegressor(n_estimators=100, random_state=42)
rf_model.fit(X_train, y_train)

# Step 3: Model Evaluation
y_pred_rf = rf_model.predict(X_test)
mse_rf = mean_squared_error(y_test, y_pred_rf)
print(f'Random Forest MSE: {mse_rf}')

# Step 4: Make Predictions
# For example, to predict the first 5 samples in the test set
predictions_rf = rf_model.predict(X_test[:5])
print(f'Predictions (Random Forest): {predictions_rf}')


### Now, let's proceed with Support Vector Machine (SVM):

<html>
<body>
    <span style="color: green; font-weight: bold; font-size: 20px;">Support Vector Machine (SVM)</span>
</body>
</html>



In [None]:
from sklearn.svm import SVR
from sklearn.preprocessing import StandardScaler

# Step 1: Data Preprocessing (Scaling for SVM)
scaler = StandardScaler()
X_train_scaled = scaler.fit_transform(X_train)
X_test_scaled = scaler.transform(X_test)

# Step 2: Model Training
svm_model = SVR(kernel='linear')
svm_model.fit(X_train_scaled, y_train)

# Step 3: Model Evaluation
y_pred_svm = svm_model.predict(X_test_scaled)
mse_svm = mean_squared_error(y_test, y_pred_svm)
print(f'SVM MSE: {mse_svm}')

# Step 4: Make Predictions
# For example, to predict the first 5 samples in the test set
predictions_svm = svm_model.predict(X_test_scaled[:5])
print(f'Predictions (SVM): {predictions_svm}')

### Next, let's move on to k-Nearest Neighbors (KNN):

<html>
<body>
    <span style="color: green; font-weight: bold; font-size: 20px;">k-Nearest Neighbors (KNN)</span>
</body>
</html>



In [None]:
from sklearn.neighbors import KNeighborsRegressor

# Step 2: Model Training
knn_model = KNeighborsRegressor(n_neighbors=5)
knn_model.fit(X_train, y_train)

# Step 3: Model Evaluation
y_pred_knn = knn_model.predict(X_test)
mse_knn = mean_squared_error(y_test, y_pred_knn)
print(f'KNN MSE: {mse_knn}')

# Step 4: Make Predictions
# For example, to predict the first 5 samples in the test set
predictions_knn = knn_model.predict(X_test[:5])
print(f'Predictions (KNN): {predictions_knn}')


### Finally, let's proceed with Gradient Boosting:

<html>
<body>
    <span style="color: green; font-weight: bold; font-size: 20px;">Gradient Boosting</span>
</body>
</html>




In [None]:
from sklearn.ensemble import GradientBoostingRegressor

# Step 2: Model Training
gb_model = GradientBoostingRegressor(n_estimators=100, random_state=42)
gb_model.fit(X_train, y_train)

# Step 3: Model Evaluation
y_pred_gb = gb_model.predict(X_test)
mse_gb = mean_squared_error(y_test, y_pred_gb)
print(f'Gradient Boosting MSE: {mse_gb}')

# Step 4: Make Predictions
# For example, to predict the first 5 samples in the test set
predictions_gb = gb_model.predict(X_test[:5])
print(f'Predictions (Gradient Boosting): {predictions_gb}')


<span style="color: blue; font-weight: bold; font-size: 20px;">Conclusion:</span>
<ol>
    <li style="color: green; font-size: 16px;">Linear Regression:
        <ul>
            <li>Mean Squared Error (MSE): 0.2903</li>
            <li>Linear Regression provides a basic model for predicting the target variable 'Avg1' based on the given features. The low MSE indicates a relatively good fit, but there may still be room for improvement.</li>
        </ul>
    </li>
    <li style="color: green; font-size: 16px;">Support Vector Machine (SVM):
        <ul>
            <li>Mean Squared Error (MSE): 15.8544</li>
            <li>The SVM model with a linear kernel shows reasonable performance, with a moderate MSE. Further tuning of hyperparameters or exploration of different kernels might lead to improvements.</li>
        </ul>
    </li>
    <li style="color: green; font-size: 16px;">Gradient Boosting:
        <ul>
            <li>Mean Squared Error (MSE): 89.4943</li>
            <li>Gradient Boosting performs very similarly to Random Forest, indicating strong predictive power. It combines multiple weak learners to create a robust ensemble model. The MSE suggests good predictive accuracy.</li>
        </ul>
    </li>
    <li style="color: green; font-size: 16px;">Random Forest:
        <ul>
            <li>Mean Squared Error (MSE): 93.0675</li>
            <li>Random Forest outperforms Linear Regression with a lower MSE. It's a more complex model that leverages multiple decision trees for improved predictive accuracy.</li>
        </ul>
    </li>
    <li style="color: green; font-size: 16px;">k-Nearest Neighbors (KNN):
        <ul>
            <li>Mean Squared Error (MSE): 95.3759</li>
            <li>KNN provides a reasonable performance, falling between Linear Regression and Random Forest. Choosing the optimal number of neighbors can potentially enhance results.</li>
        </ul>
    </li>
</ol>

<span style="color: blue; font-weight: bold; font-size: 20px;">Overall Summary:</span>
<p>Among the models tested, Linear Regression demonstrated the best performance with the lowest MSE, indicating the closest fit to the actual data. Support Vector Machine and Gradient Boosting also provided good results. Random Forest and K-Nearest Neighbors, while still performing reasonably well, showed slightly higher MSE values.</p>

<span style="color: blue; font-weight: bold; font-size: 20px;">Recommendation:</span>
<p>Based on the evaluation, we recommend further exploration and fine-tuning of Linear Regression, Support Vector Machine, and Gradient Boosting models for this dataset. Additionally, consider feature engineering or selection to potentially enhance model performance. Keep in mind that factors like feature selection, hyperparameter tuning, and dataset size can influence model performance. Continued refinement and testing may lead to even better results.</p>


<div class="alert alert-block alert-info"> 📌 "Take some time to explore and create a notebook based on your insights. Your contributions offer valuable perspectives. If you find the dataset interesting, an upvote would be greatly appreciated. Your support encourages collaboration and knowledge sharing. Thank you!"😊 </div>