<a href="https://colab.research.google.com/github/okaystephen/DATASCI-MCO/blob/main/DATASCI_MCO.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

###Cleaning

* clean column names (ira)
* clean kuwait (bly)
* clean mum to Mother (ira)
* clean lebanon (stephen)
* clean outlier in grade 7 (Anj)
* KW --> Kuwait (stephen)

# **Student's Academic Performance** 
#### *Members: Beaverly Cuasi, Allexandra Domingo, Angeline Gubat, Stephen Salamante, Ira Villanueva  (S12 - Couch Data Scientists)*
This notebook focuses on the Student's Academic Performance Dataset which can be found [here](https://www.kaggle.com/aljarah/xAPI-Edu-Datal). The first section shows the dataset information and dataset cleaning process, followed by data visualization using Exploratory Data Analysis (EDA). Lastly, this notebook aims to answer the following questions: <br><br> 
  
* **Main Research Question**
  * Which factors affect a student’s academic grades?<br>
* **Sub-questions**
  * Do students who participate more have higher grades than those who participate less?
  * Does parent participation and satisfaction affect a student’s performance?
  * Do the number of class absences greatly affect the student’s grades?
  * Does visiting class resources result in higher class grades?


### **Dataset Information**  
The Student's Performance Dataset is collected from Kalboard 360, a learning management, using a learner activity tracker tool called experience API (xAPI). The dataset contains **480** observations (rows) across 16 features (columns). Below is a brief description of each features:  
* **`gender`**: Student's gender <br>
* **`NationallTy`**: Student's nationality <br>
* **`PlaceofBirth`**: Student's place of birth <br>
* **`StageID`**: Educational level that student belongs to <br>
* **`GradeID`**: Grade level that student belongs to <br>
* **`SectionID`**: Classroom that student belongs to<br>
* **`Topic`**: Course topic <br>
* **`Semester`**: Current semester in a school year <br>
* **`Relation`**: Parent who is responsible for a student<br>
* **`raisedhands`**: Total number of times a student raises his/her hand  <br>
* **`VisITedResources`**: Total number of times a student visited a course content <br>
* **`AnnouncementsView`**: Total number of times a student checks the announcements <br>
* **`Discussion`**: Total number of times a student participates on discussion groups <br>
* **`ParentAnsweringSurvey`**: If a parent answered surveys provided by the school <br>
* **`ParentSchoolSatisfaction`**: If a parent is satisfied or not <br>
* **`StudentAbsenceDays`**: Total number of days that a student is absent <br>
* **`Class`**: Represents the interval of a student total grade <br>
  * _Low_: interval includes values from 0 to 69
  * _Middle_: interval includes values from 70 to 89
  * _High_: interval includes values from 90-100

Let's view the observations and features of the dataset. But first, let's import the needed libraries for this notebook.

In [57]:
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns

Using `read_csv`, we will loading the dataset (xAPI-Edu-Data.csv) into a DataFrame.

In [58]:
url = 'https://raw.githubusercontent.com/okaystephen/DATASCI-MCO/main/xAPI-Edu-Data.csv'
spd_df = pd.read_csv(url)  

# First 5 rows in the dataset:
spd_df.head()

Unnamed: 0,gender,NationalITy,PlaceofBirth,StageID,GradeID,SectionID,Topic,Semester,Relation,raisedhands,VisITedResources,AnnouncementsView,Discussion,ParentAnsweringSurvey,ParentschoolSatisfaction,StudentAbsenceDays,Class
0,M,KW,KuwaIT,lowerlevel,G-04,A,IT,F,Father,15,16,2,20,Yes,Good,Under-7,M
1,M,KW,KuwaIT,lowerlevel,G-04,A,IT,F,Father,20,20,3,25,Yes,Good,Under-7,M
2,M,KW,KuwaIT,lowerlevel,G-04,A,IT,F,Father,10,7,0,30,No,Bad,Above-7,L
3,M,KW,KuwaIT,lowerlevel,G-04,A,IT,F,Father,30,25,5,35,No,Bad,Above-7,L
4,M,KW,KuwaIT,lowerlevel,G-04,A,IT,F,Father,40,50,12,50,No,Bad,Above-7,M


Let's see now its general information using info() function.

In [59]:
# Dataset's variables type:
spd_df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 480 entries, 0 to 479
Data columns (total 17 columns):
 #   Column                    Non-Null Count  Dtype 
---  ------                    --------------  ----- 
 0   gender                    480 non-null    object
 1   NationalITy               480 non-null    object
 2   PlaceofBirth              480 non-null    object
 3   StageID                   480 non-null    object
 4   GradeID                   480 non-null    object
 5   SectionID                 480 non-null    object
 6   Topic                     480 non-null    object
 7   Semester                  480 non-null    object
 8   Relation                  480 non-null    object
 9   raisedhands               480 non-null    int64 
 10  VisITedResources          480 non-null    int64 
 11  AnnouncementsView         480 non-null    int64 
 12  Discussion                480 non-null    int64 
 13  ParentAnsweringSurvey     480 non-null    object
 14  ParentschoolSatisfaction  

In [60]:
spd_df.isnull().sum()

gender                      0
NationalITy                 0
PlaceofBirth                0
StageID                     0
GradeID                     0
SectionID                   0
Topic                       0
Semester                    0
Relation                    0
raisedhands                 0
VisITedResources            0
AnnouncementsView           0
Discussion                  0
ParentAnsweringSurvey       0
ParentschoolSatisfaction    0
StudentAbsenceDays          0
Class                       0
dtype: int64

We can see that there are no null values or missing values in any columns so there will be no problem in the dataset in handling null or NaN values. Before analyzing our data, let's further explore the dataset for data cleaning.  
### **Data Cleaning**  
#### **Renaming feature names**  
When we viewed the information of the dataset, we saw that some column names doesn't have the same format. Some columns are all in lowercase while some letters are capitalized like `NationalITy` and `VisITedResources`. It is a good approach to rename columns to ones that can be easily recalled later on. Here are the columns that we will be renaming:  
* NationalITy --> Nationality
* PlaceofBirth --> BirthPlace
* VisITedResources --> VistedResources
* ParentschoolSatisfaction --> ParentSchoolSatisfaction
* raisedhands --> RaisedHands

In [61]:
spd_df.rename(columns={'NationalITy':'Nationality',
'PlaceofBirth': 'BirthPlace',
'gender': 'Gender',
'VisITedResources':'VisitedResources',
'ParentschoolSatisfaction':'ParentSchoolSatisfaction',
'raisedhands':'RaisedHands'}, inplace=True) 

Let's view the dataset with the renamed columns.

In [62]:
spd_df.head()

Unnamed: 0,Gender,Nationality,BirthPlace,StageID,GradeID,SectionID,Topic,Semester,Relation,RaisedHands,VisitedResources,AnnouncementsView,Discussion,ParentAnsweringSurvey,ParentSchoolSatisfaction,StudentAbsenceDays,Class
0,M,KW,KuwaIT,lowerlevel,G-04,A,IT,F,Father,15,16,2,20,Yes,Good,Under-7,M
1,M,KW,KuwaIT,lowerlevel,G-04,A,IT,F,Father,20,20,3,25,Yes,Good,Under-7,M
2,M,KW,KuwaIT,lowerlevel,G-04,A,IT,F,Father,10,7,0,30,No,Bad,Above-7,L
3,M,KW,KuwaIT,lowerlevel,G-04,A,IT,F,Father,30,25,5,35,No,Bad,Above-7,L
4,M,KW,KuwaIT,lowerlevel,G-04,A,IT,F,Father,40,50,12,50,No,Bad,Above-7,M


#### **Renaming Mum to Mother in Relation Column**  
It's also important to ensure that the data values is consistent and uniformed within the dataset especially inside a feature column. Let's check the `Relation` column in the dataset and view its unique values.  

#### **Renaming countries from the `BirthPlace` column** 
Here are the countries that we will be renaming:  
* lebanon --> Lebanon
* KuwaIT --> Kuwait
* venzuela --> Venezuela

Let's turn `lebanon` into `Lebanon`.

In [63]:
#spd_df.replace({'BirthPlace': {'lebanon': 'Lebanon'}})
spd_df['BirthPlace'].replace('lebanon','Lebanon', inplace=True)

In [64]:
spd_df['BirthPlace'].unique()

array(['KuwaIT', 'Lebanon', 'Egypt', 'SaudiArabia', 'USA', 'Jordan',
       'venzuela', 'Iran', 'Tunis', 'Morocco', 'Syria', 'Iraq',
       'Palestine', 'Lybia'], dtype=object)

#### **Renaming countries from the `Nationality` column**  
Here are the countries that we will be renaming:  
* KW --> Kuwait
* lebanon --> Lebanon
* venzuela --> Venezuela

Let's turn `KW` into `Kuwait`.

In [65]:
spd_df['Nationality'].replace('KW','Kuwait', inplace=True)

In [69]:
spd_df['Nationality'].unique()

array(['Kuwait', 'lebanon', 'Egypt', 'SaudiArabia', 'USA', 'Jordan',
       'venzuela', 'Iran', 'Tunis', 'Morocco', 'Syria', 'Palestine',
       'Iraq', 'Lybia'], dtype=object)