**SECJ 5043 - Advance Artifial Intelligence**

Assignment 2 - Classification

Name: Muhammad Hazman Hanif Bin Roslan

Matric Number: A21MJ5039


# **Introduction**

In the dynamic landscape of education, understanding and predicting student engagement play pivotal roles in optimizing learning experiences. This project leverages machine learning techniques to analyze and predict student engagement based on various factors such as the number of times students raised their hands, visited educational resources, viewed announcements, and participated in discussions.


# **Import Libray**

In [18]:
import pandas as pd
import sklearn.linear_model
from sklearn.model_selection import train_test_split
from sklearn.preprocessing import StandardScaler
from sklearn.metrics import accuracy_score, classification_report
from sklearn.linear_model import LogisticRegression
from sklearn.tree import DecisionTreeClassifier
from sklearn.ensemble import RandomForestClassifier
from sklearn.svm import SVC
from sklearn.neighbors import KNeighborsClassifier
from sklearn.naive_bayes import GaussianNB

# **Explantion**
* pandas: Library for data manipulation and analysis.
* sklearn: Scikit-learn, a machine learning library in Python.
* train_test_split: Function to split the dataset into training and testing sets.
* StandardScaler: Class to standardize features by removing the mean and scaling to unit variance.
* accuracy_score, classification_report: Functions to evaluate classification performance.
* LogisticRegression, DecisionTreeClassifier, RandomForestClassifier, SVC, KNeighborsClassifier, GaussianNB: Different classifiers provided by scikit-learn.

# **Read Data**

In [19]:
df = pd.read_csv("/content/xAPI-Edu-Data.csv")
df

Unnamed: 0,gender,NationalITy,PlaceofBirth,StageID,GradeID,SectionID,Topic,Semester,Relation,raisedhands,VisITedResources,AnnouncementsView,Discussion,ParentAnsweringSurvey,ParentschoolSatisfaction,StudentAbsenceDays,Class
0,M,KW,KuwaIT,lowerlevel,G-04,A,IT,F,Father,15,16,2,20,Yes,Good,Under-7,M
1,M,KW,KuwaIT,lowerlevel,G-04,A,IT,F,Father,20,20,3,25,Yes,Good,Under-7,M
2,M,KW,KuwaIT,lowerlevel,G-04,A,IT,F,Father,10,7,0,30,No,Bad,Above-7,L
3,M,KW,KuwaIT,lowerlevel,G-04,A,IT,F,Father,30,25,5,35,No,Bad,Above-7,L
4,M,KW,KuwaIT,lowerlevel,G-04,A,IT,F,Father,40,50,12,50,No,Bad,Above-7,M
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
475,F,Jordan,Jordan,MiddleSchool,G-08,A,Chemistry,S,Father,5,4,5,8,No,Bad,Above-7,L
476,F,Jordan,Jordan,MiddleSchool,G-08,A,Geology,F,Father,50,77,14,28,No,Bad,Under-7,M
477,F,Jordan,Jordan,MiddleSchool,G-08,A,Geology,S,Father,55,74,25,29,No,Bad,Under-7,M
478,F,Jordan,Jordan,MiddleSchool,G-08,A,History,F,Father,30,17,14,57,No,Bad,Above-7,L


# **Explanation**
This line reads the dataset from the CSV file named "xAPI-Edu-Data.csv" and stores it in a pandas DataFrame called df.



# **Display Data**

In [20]:
# Display the first few rows of the dataset
print(df.head())
# Check for missing values
print(df.isnull().sum())
# Get statistical summary
print(df.describe())
## explain some paragraph

  gender NationalITy PlaceofBirth     StageID GradeID SectionID Topic  \
0      M          KW       KuwaIT  lowerlevel    G-04         A    IT   
1      M          KW       KuwaIT  lowerlevel    G-04         A    IT   
2      M          KW       KuwaIT  lowerlevel    G-04         A    IT   
3      M          KW       KuwaIT  lowerlevel    G-04         A    IT   
4      M          KW       KuwaIT  lowerlevel    G-04         A    IT   

  Semester Relation  raisedhands  VisITedResources  AnnouncementsView  \
0        F   Father           15                16                  2   
1        F   Father           20                20                  3   
2        F   Father           10                 7                  0   
3        F   Father           30                25                  5   
4        F   Father           40                50                 12   

   Discussion ParentAnsweringSurvey ParentschoolSatisfaction  \
0          20                   Yes                     Go

* This code prints the first few rows of the dataset. The head() method in pandas is used to display the top rows of the DataFrame. It's useful for quickly inspecting the structure and content of the dataset.
* Here, the isnull() method is used to identify missing values in the dataset. sum() is then applied to count the number of missing values for each column. If there are no missing values, the output will show zeros for all columns. This step is essential for identifying and handling missing data before training machine learning models.
* The describe() method provides a statistical summary of the dataset. It includes count, mean, standard deviation, minimum, 25th percentile (Q1), median (50th percentile or Q2), 75th percentile (Q3), and maximum for numeric columns. This summary is helpful in understanding the distribution and central tendency of the numerical features in the dataset.

# **Select Target Variable**

In [57]:
# Select only the specified columns
selected_columns = ['raisedhands', 'VisITedResources', 'AnnouncementsView', 'Discussion']
X = df[selected_columns]
y = df['Class']  # Assuming 'Class' is the target variable
print (X)
print (y)

     raisedhands  VisITedResources  AnnouncementsView  Discussion
0             15                16                  2          20
1             20                20                  3          25
2             10                 7                  0          30
3             30                25                  5          35
4             40                50                 12          50
..           ...               ...                ...         ...
475            5                 4                  5           8
476           50                77                 14          28
477           55                74                 25          29
478           30                17                 14          57
479           35                14                 23          62

[480 rows x 4 columns]
0      M
1      M
2      L
3      L
4      M
      ..
475    L
476    M
477    M
478    L
479    L
Name: Class, Length: 480, dtype: object


# **Explanation**
selected_columns = ['raisedhands', 'VisITedResources', 'AnnouncementsView', 'Discussion']

* This is a list containing the names of the columns you want to select from your original DataFrame (df).
* The columns specified are 'raisedhands', 'VisITedResources', 'AnnouncementsView', and 'Discussion'.
* These columns are selected based on the assumption that they contain numeric data relevant to predicting student engagement.

X = df[selected_columns]:

* X now contains the features (independent variables) want to use for prediction.

y = df['Class']:

* This line creates a Series y by selecting the 'Class' column from the original DataFrame (df).
* y now contains the target variable (dependent variable) you want to predict.

# **Train and Test Data**

In [58]:
# Split the data into training and testing sets
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)

# Standardize the features
scaler = StandardScaler()

X_train_scaled = scaler.fit_transform(X_train)
X_test_scaled = scaler.transform(X_test)

train_test_split:

* This function is part of the sklearn.model_selection module. Its purpose is to split the dataset into two subsets: one for training the model (X_train and y_train) and the other for evaluating the model's performance (X_test and y_test).

* X and y:
These are the features and target variable obtained from the dataset as explained in the previous code snippet.

* test_size=0.2:
This parameter determines the proportion of the dataset that will be used for testing. In this case, test_size=0.2 means that 20% of the data will be used for testing, and the remaining 80% will be used for training.

* random_state=42:
The random_state parameter is optional but crucial for reproducibility. Setting a random seed (here, 42) ensures that the data split is the same every time the code is run. This is useful for debugging and obtaining consistent results.
Output:

* X_train and y_train are the features and target variable for training the model.
* X_test and y_test are the features and target variable for evaluating the model.

# **Initialize Classifiers**

In [59]:
# Initialize classifiers
classifiers = {
    'Logistic Regression': LogisticRegression(),
    'Decision Tree': DecisionTreeClassifier(),
    'Random Forest': RandomForestClassifier(),
    'SVM': SVC(),
    'K-Nearest Neighbors': KNeighborsClassifier(),
    'Naive Bayes': GaussianNB()
}

**Logistic Regression**:

* Purpose: Logistic Regression is a binary classification algorithm that is used when the target variable is binary (two classes).
* Explanation: It models the probability that a given instance belongs to a particular class. It's a simple yet effective algorithm for binary classification problems.

**Decision Tree**:

* Purpose: Decision Trees are versatile algorithms used for both classification and regression tasks.
* Explanation: Decision Trees make decisions based on asking a series of questions, leading to a conclusion about the target variable. They are interpretable and can capture complex relationships in the data.

**Random Forest**:

* Purpose: Random Forest is an ensemble learning method that builds a multitude of decision trees and merges them together to get a more accurate and stable prediction.
* Explanation: It reduces overfitting and increases accuracy compared to a single decision tree by aggregating the predictions of multiple trees.

**Support Vector Machine (SVM)**:

* Purpose: SVM is a powerful algorithm for classification tasks, especially in high-dimensional spaces.
* Explanation: SVM finds a hyperplane that best separates the data into different classes. It is effective in cases where the data is not linearly separable by transforming the feature space.

**K-Nearest Neighbors (KNN)**:

* Purpose: KNN is a simple and intuitive algorithm used for both classification and regression tasks.
* Explanation: KNN classifies an instance based on the majority class of its k-nearest neighbors. It is a lazy learner as it does not build a model during training but rather memorizes the training dataset.

**Naive Bayes**:

* Purpose: Naive Bayes is a probabilistic algorithm used for classification tasks.
* Explanation: It is based on Bayes' theorem and assumes that features are conditionally independent given the class. Despite its simplicity, Naive Bayes often performs well and is computationally efficient.

# **Train and Result**

In [60]:
# Train and evaluate each classifier
for name, clf in classifiers.items():
    print(f"\nTraining {name}...")
    clf.fit(X_train_scaled, y_train)

    print(f"Evaluating {name}...")
    y_pred = clf.predict(X_test_scaled)

    accuracy = accuracy_score(y_test, y_pred)
    report = classification_report(y_test, y_pred)

    print(f"\nClassifier: {name}")
    print(f"Accuracy: {accuracy:.2f}")
    print("Classification Report:")
    print(report)



Training Logistic Regression...
Evaluating Logistic Regression...

Classifier: Logistic Regression
Accuracy: 0.61
Classification Report:
              precision    recall  f1-score   support

           H       0.46      0.55      0.50        22
           L       0.72      0.81      0.76        26
           M       0.63      0.54      0.58        48

    accuracy                           0.61        96
   macro avg       0.61      0.63      0.62        96
weighted avg       0.62      0.61      0.61        96


Training Decision Tree...
Evaluating Decision Tree...

Classifier: Decision Tree
Accuracy: 0.58
Classification Report:
              precision    recall  f1-score   support

           H       0.50      0.55      0.52        22
           L       0.67      0.62      0.64        26
           M       0.58      0.58      0.58        48

    accuracy                           0.58        96
   macro avg       0.58      0.58      0.58        96
weighted avg       0.59      0.58  

# **Explanation**
Logistic Regression:

* Accuracy: 61%

* Logistic Regression provides a decent accuracy but may not capture the complexity of the data well.


---


Decision Tree:

* Accuracy: 58%
* Decision Trees have a slightly lower accuracy, indicating a potential issue with overfitting.


---


Random Forest:

* Accuracy: 69%

* Random Forest outperforms Logistic Regression and Decision Tree, demonstrating the power of ensemble methods.


---


Support Vector Machine (SVM):

* Accuracy: 65%

* SVM performs reasonably well, but the accuracy is not as high as Random Forest.


---


K-Nearest Neighbors (KNN):

* Accuracy: 64%

* KNN provides competitive results, falling between SVM and Logistic Regression in terms of accuracy.


---


Naive Bayes:

* Accuracy: 64%

* Naive Bayes offers similar performance to KNN, but with a different trade-off in precision and recall.

# **Conclusion**

In this project, we aimed to classify students' academic performance levels using different machine learning classification algorithms. The selected features for prediction were 'raisedhands', 'VisITedResources', 'AnnouncementsView', and 'Discussion', while the target variable was 'Class'.

* Random Forest stands out as the best-performing model with the highest accuracy (69%). It's a robust ensemble model suitable for this classification task.

* Logistic Regression and Decision Tree provide acceptable accuracy but may not capture the complexities of the data as well as ensemble methods.

* SVM, KNN, and Naive Bayes fall in the mid-range of accuracy, with differences in precision and recall. Depending on specific requirements, one might be preferred over the other.