# Student Performance Indicator

## Life cycle of ML project.

- Understanding the problem statement
- Data Collection
- Data checks to perform
- Exploratory Data Analysis
- Model Training 
- Choose best model

## 1. Problem Statement

- How different features affect the test score of the students.  

## 2. Data Collection

- [Data Source](https://www.kaggle.com/datasets/spscientist/students-performance-in-exams?select=StudentsPerformance.csv)
- The Data consists of 8 columns and 1000 rows

### 2.1 Importing required libraries
`numpy, pandas, matplotlib, seaborn and warning libraries`

In [5]:
import numpy as np
import pandas as pd
import seaborn as sns
import matplotlib.pyplot as plt
%matplotlib inline
import warnings
warnings.filterwarnings('ignore')

### Importing data as pandas dataframe

In [6]:
import pandas as pd

data = 'sdata'
df = pd.read_csv('data/{}.csv'.format(data))

In [7]:
df

Unnamed: 0,gender,race/ethnicity,parental level of education,lunch,test preparation course,math score,reading score,writing score
0,female,group B,bachelor's degree,standard,none,72,72,74
1,female,group C,some college,standard,completed,69,90,88
2,female,group B,master's degree,standard,none,90,95,93
3,male,group A,associate's degree,free/reduced,none,47,57,44
4,male,group C,some college,standard,none,76,78,75
...,...,...,...,...,...,...,...,...
995,female,group E,master's degree,standard,completed,88,99,95
996,male,group C,high school,free/reduced,none,62,55,55
997,female,group C,high school,free/reduced,completed,59,71,65
998,female,group D,some college,standard,completed,68,78,77


In [9]:
df.shape

(1000, 8)

# 3. Dataset Description

This dataset can be found on Kaggle, provided by Spscientist. It's related to the performance of students in exams, with different influencing factors. You can access the dataset from the following link: [Dataset Link](https://www.kaggle.com/datasets/spscientist/students-performance-in-exams?select=StudentsPerformance.csv)

## Columns Description:

* **Gender**
    * This represents the sex of the student. It's a categorical value with two categories: 'Male' and 'Female'.

* **Race/Ethnicity**
    * This column reflects the racial/ethnic background of the students. It is also a categorical value, with several distinct groups labeled from 'group A' to 'group E'.

* **Parental Level of Education**
    * This describes the highest level of education attained by the student's parents. This could range from high school, some college, associate's degree, bachelor's degree, master's degree, etc.

* **Lunch**
    * This represents whether the student is on a standard or free/reduced lunch program at the school. This could be an indicator of the socioeconomic status of the student's family.

* **Test Preparation Course**
    * This column indicates whether the student completed a test preparation course or not. This is a binary categorical feature with possible values of 'none' or 'completed'.

* **Math Score**
    * This column represents the student's score in Mathematics. It's a numerical value ranging from 0 to 100.

* **Reading Score**
    * This column represents the student's score in Reading. It's a numerical value ranging from 0 to 100.

* **Writing Score**
    * This column represents the student's score in Writing. It's a numerical value ranging from 0 to 100.


# 3. Data Checks Description

## Data Loading

1. **Check successful data load**: Ensure the data has been loaded correctly without any errors.

## Missing Data

1. **Check for Null values**: Check if there are any missing values (NaN, Null, None, etc.) in the dataset for all columns.

2. **Percentage of missing data**: If there are missing values, compute the percentage of missing data. This can help decide whether to impute these missing values or discard the column entirely if the majority of values are missing.

## Data Quality

1. **Data types**: Check the data types of each column and make sure they match the description. If there's a mismatch, data conversion might be necessary.

2. **Unique values**: For categorical variables, check the unique values and their count. This is useful for understanding the distribution of the categories and identifying any unexpected or erroneous categories.

3. **Statistical summary**: For numerical columns, check the summary statistics like mean, median, mode, min, max, standard deviation etc. This helps understand the distribution of the data.

4. **Outliers**: Look for any outliers in the numeric columns, as they may affect the overall analysis and model performance. Outliers can be identified by visually plotting the data or statistically (for example, using Z-score, IQR).

## Consistency

1. **Consistency of categories**: Check if there are inconsistencies in the categorical columns. For example, are 'Male' and 'male' being treated as different categories? 

2. **Data range**: Check if the values of numerical columns (e.g., scores) fall within the expected range (e.g., 0-100 for exam scores).

## Correlation

1. **Correlation matrix**: It's useful to calculate a correlation matrix between numerical variables to see if there are highly correlated variables.

## Duplicates

1. **Duplicate rows**: Check if there are any duplicate rows in the dataset. Duplicate data can bias the analysis, so you need to decide whether to keep, remove, or consolidate duplicate entries.

Remember, each of these checks may reveal issues with the data that need to be resolved before continuing with the analysis. The nature of the issues and how they should be addressed will depend on the specific context and purpose of the analysis.


## 3.1 Check Missing Values

In [10]:
df.isna().sum()

gender                         0
race/ethnicity                 0
parental level of education    0
lunch                          0
test preparation course        0
math score                     0
reading score                  0
writing score                  0
dtype: int64

> There are no missing values in the data set. 

## 3.2 Check Duplicates

In [12]:
df.duplicated().sum()

0

> There are no duplicates in the data set

- If there are duplicates in any data set, remove them with this code 

```python
# The drop_duplicates function considers all columns to identify duplicates.
df = df.drop_duplicates()

# Remove duplicates based on certain columns
df = df.drop_duplicates(subset=['column1', 'column2'])
```
- After removing duplicates, it's usually a good idea to reset the index of DataFrame.

```python
df = df.reset_index(drop=True)
```


## 3.3 Check Data Types

In [13]:
df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 1000 entries, 0 to 999
Data columns (total 8 columns):
 #   Column                       Non-Null Count  Dtype 
---  ------                       --------------  ----- 
 0   gender                       1000 non-null   object
 1   race/ethnicity               1000 non-null   object
 2   parental level of education  1000 non-null   object
 3   lunch                        1000 non-null   object
 4   test preparation course      1000 non-null   object
 5   math score                   1000 non-null   int64 
 6   reading score                1000 non-null   int64 
 7   writing score                1000 non-null   int64 
dtypes: int64(3), object(5)
memory usage: 62.6+ KB


## 3.4 Checking the number of unique values in each column

In [15]:
df.nunique()

gender                          2
race/ethnicity                  5
parental level of education     6
lunch                           2
test preparation course         2
math score                     81
reading score                  72
writing score                  77
dtype: int64

## 3.5 Checking statistics of the data set 

In [16]:
df.describe()

Unnamed: 0,math score,reading score,writing score
count,1000.0,1000.0,1000.0
mean,66.089,69.169,68.054
std,15.16308,14.600192,15.195657
min,0.0,17.0,10.0
25%,57.0,59.0,57.75
50%,66.0,70.0,69.0
75%,77.0,79.0,79.0
max,100.0,100.0,100.0


# Dataset Insights

The dataset contains scores for Mathematics, Reading, and Writing for 1000 students.

## Math Score

- The average (mean) Mathematics score is approximately 66.09, which suggests that students tend to perform moderately well in Mathematics.
- The standard deviation is around 15.16, indicating there's a fair amount of variation in the Mathematics scores. Some students have very high scores, and others have very low scores.
- The minimum score in Mathematics is 0, indicating that at least one student has scored zero in Mathematics. This could be due to several reasons such as not attempting the exam, lack of understanding, or other factors.
- The maximum score is 100, indicating that at least one student has achieved full marks in Mathematics.
- The 25th, 50th (median), and 75th percentiles suggest that half of the students have scores between 57 and 77.

## Reading Score

- The average Reading score is approximately 69.17, which is slightly higher than the average Mathematics score.
- The standard deviation is around 14.6, which is a bit less than that for Mathematics. This might suggest that students' Reading scores are slightly more clustered around the mean compared to Mathematics.
- The minimum Reading score is 17, and the maximum is 100. This shows that students overall performed better in Reading compared to Mathematics as the lowest score is significantly higher.
- Half of the students have Reading scores between 59 and 79.

## Writing Score

- The average Writing score is around 68.05, which is close to the average Mathematics score, but slightly less than the average Reading score.
- The standard deviation is approximately 15.19, indicating a similar spread to the Mathematics scores.
- The minimum Writing score is 10, and the maximum is 100. The minimum is higher than that of Mathematics but lower than Reading, indicating that students on average found the Writing section slightly more challenging than Reading but less so than Mathematics.
- Half of the students have Writing scores between approximately 57.75 and 79.

# General Insights

From these observations students seem to perform slightly better in Reading, followed closely by Writing, and then Mathematics. However, there is a notable spread in the scores for all three subjects. Some students have achieved perfect scores, while others have struggled significantly. These insights could form the basis for further investigation into factors contributing to these score distributions.


# Brief Insights

This dataset contains Mathematics, Reading, and Writing scores for 1000 students.

- **Math Score**: Average score is approximately 66.09, with a minimum of 0 and a maximum of 100.
- **Reading Score**: Average score is slightly higher at 69.17, with a minimum of 17 and a maximum of 100. This indicates students performed somewhat better in Reading than Mathematics.
- **Writing Score**: The average score is around 68.05, with a minimum of 10 and a maximum of 100. This suggests performance in Writing lies between Mathematics and Reading.

In general, students perform slightly better in Reading, followed by Writing, then Mathematics. However, there is a considerable spread in scores, highlighting varied performance levels across students.


## 3.7 Exploring Data