## 1.1 Data Types Exploration Exercise

Exploring Variables in the Student Performance Dataset

**Objective**: Identify the variables in the Student Performance dataset and classify them as numerical or categorical, as well as determine if they are continuous or discrete, and whether they are ordinal or nominal.

**Dataset Description**:
The Student Performance dataset contains information about students' performance in exams. It includes various attributes such as gender, ethnicity, parental level of education, test scores, and more.

**Instructions**:
- Load the Student Performance dataset into a DataFrame called df.
- Examine the dataset and the available columns.
- For each column, determine its data type and classify it accordingly:

    a) Identify the numerical variables and specify if they are continuous or discrete.

    b) Identify the categorical variables and specify if they are ordinal or nominal.

Remember to consider the nature of the variables, their values, and the context of the dataset when classifying them. Some variables may require further examination or interpretation to determine their exact classification.


**Dataset source URL**

url = "https://raw.githubusercontent.com/data-bootcamp-v4/prework_data/main/students_performance.csv"

In [2]:
# Import pandas library
import pandas as pd

# Store the link of csv file
url= "https://raw.githubusercontent.com/data-bootcamp-v4/prework_data/main/students_performance.csv"

# Read the csv file from url and store it into dataframe "df"
df= pd.read_csv(url)

# Display top 5 rows of the dataset
df.head()

Unnamed: 0,gender,race/ethnicity,parental level of education,lunch,test preparation course,math score,reading score,writing score
0,female,group B,bachelor's degree,standard,none,72,72,74
1,female,group C,some college,standard,completed,69,90,88
2,female,group B,master's degree,standard,none,90,95,93
3,male,group A,associate's degree,free/reduced,none,47,57,44
4,male,group C,some college,standard,none,76,78,75


### Numerical Data Types

By directly looking at the dataset, we can clearly see that the following variables contain **numerical values**, as they represent students’ scores:

- **`math score`**: Shows the student’s score in the math module.  
- **`reading score`**: Shows the student’s score in the reading module.  
- **`writing score`**: Shows the student’s score in the writing module.  

These variables represent exam scores and are therefore numerical in nature.

In [16]:
# Separate the numerical columns
num_cols=["math score", "reading score", "writing score"]

# Display numerical columns
df[num_cols]

Unnamed: 0,math score,reading score,writing score
0,72,72,74
1,69,90,88
2,90,95,93
3,47,57,44
4,76,78,75
...,...,...,...
995,88,99,95
996,62,55,55
997,59,71,65
998,68,78,77


#### Numerical Data Type: Discrete or Continous Variables?

Although ``math score``, ``reading score``, and ``writing score`` are numerical columns, they are **discrete numerical variables**, not continuous.

**Continuous Numerical Variables:** Continuous data represents measurements that can take ***any value within a specified range***, are ***not limited to whole numbers*** and can include ***decimal or fractional values***. For examples: height, weight, time, temperature. 

In this dataset, exam scores do **not** qualify as continuous because decimal values (e.g., 87.9) are **not possible**.

**Discrete Numerical Variables:** Discrete data refers to values that are ***counted***, ***not measured***, can take only ***specific***, ***distinct values***, are typically ***whole numbers*** and cannot be ***subdivided*** into meaningful smaller units. For exampl: number of students, number of correct answers, exam scores.

**Conclusion**
So we can conlcude that this dataset does not contains any **continous variables** and all the numerical columns are**discrete variables**

### Categorical Data Type

Categorical or qualitative data represents variables that are divided into distinct categories or groups.

In [17]:
# Separate categorical columns
cat_cols = [
    "gender",
    "race/ethnicity",
    "parental level of education",
    "lunch",
    "test preparation course"
]

df[cat_cols]

Unnamed: 0,gender,race/ethnicity,parental level of education,lunch,test preparation course
0,female,group B,bachelor's degree,standard,none
1,female,group C,some college,standard,completed
2,female,group B,master's degree,standard,none
3,male,group A,associate's degree,free/reduced,none
4,male,group C,some college,standard,none
...,...,...,...,...,...
995,female,group E,master's degree,standard,completed
996,male,group C,high school,free/reduced,none
997,female,group C,high school,free/reduced,completed
998,female,group D,some college,standard,completed


#### Nominal or Oridinal Variables

**Nominal variables:** Nominal variables represent categories or groups that have no ***inherent order*** or ***ranking***. 

Examples of nominal variables include:
- gender (male, female).
- ethnicity (Asian, African American, Caucasian).
- marital status (single, married, divorced).

**Ordinal variables:** Ordinal variables, on the other hand, represent categories that have a ***natural order*** or ***ranking***. 

Examples of ordinal variables include:
- rating scales (such as Likert scales).
- educational levels (e.g., high school, bachelor's degree, master's degree).
- satisfaction levels (e.g., very satisfied, satisfied, neutral, dissatisfied, very dissatisfied).

**Note:** When working with categorical data, we focus on understanding the frequency counts and proportions within each category.

In [18]:
df[cat_cols].nunique()

gender                         2
race/ethnicity                 5
parental level of education    6
lunch                          2
test preparation course        2
dtype: int64

In [19]:
# Display categorical columns
df[cat_cols]

Unnamed: 0,gender,race/ethnicity,parental level of education,lunch,test preparation course
0,female,group B,bachelor's degree,standard,none
1,female,group C,some college,standard,completed
2,female,group B,master's degree,standard,none
3,male,group A,associate's degree,free/reduced,none
4,male,group C,some college,standard,none
...,...,...,...,...,...
995,female,group E,master's degree,standard,completed
996,male,group C,high school,free/reduced,none
997,female,group C,high school,free/reduced,completed
998,female,group D,some college,standard,completed


Based on the observation, we can say that **`gender`**, **`race/ethnicity`**, **`lunch`**, and **`test preparation course`** are **nominal variables** because they have no natural order and their values are labels only.

On the other hand, the column **`parental level of education`** is an **ordinal variable** because it has a meaningful order, such as:

`some high school < high school < associate’s degree < bachelor’s degree < master’s degree`
