# Data types

In this hands-on lesson, we will explore the common data types.

In [3]:
# Let's start by importing pandas and reading the titanic dataset
import pandas as pd

# We can read from an online URL
titanic_data = pd.read_csv('https://raw.githubusercontent.com/data-bootcamp-v4/prework_data/main/titanic.csv')

Here is a description of the titanic dataset variables to get a better understanding of the data:
- PassengerId: an Id to identify each passenger.
- Survived: Whether the passenger survived (0 = No, 1 = Yes)
- Pclass: Passenger class (1 = 1st class, 2 = 2nd class, 3 = 3rd class)
- Name: Passenger's name
- Sex: Passenger's gender (Male or Female)
- Age: Passenger's age in years
- SibSp: Number of siblings/spouses aboard the Titanic
- Parch: Number of parents/children aboard the Titanic
- Ticket: Ticket number
- Fare: Passenger fare
- Cabin: Cabin number
- Embarked: Port of embarkation (C = Cherbourg, Q = Queenstown, S = Southampton)

In [4]:
# Lets look again at our data
titanic_data.head()

Unnamed: 0,PassengerId,Survived,Pclass,Name,Sex,Age,SibSp,Parch,Ticket,Fare,Cabin,Embarked
0,1,0,3,"Braund, Mr. Owen Harris",male,22.0,1,0,A/5 21171,7.25,,S
1,2,1,1,"Cumings, Mrs. John Bradley (Florence Briggs Th...",female,38.0,1,0,PC 17599,71.2833,C85,C
2,3,1,3,"Heikkinen, Miss. Laina",female,26.0,0,0,STON/O2. 3101282,7.925,,S
3,4,1,1,"Futrelle, Mrs. Jacques Heath (Lily May Peel)",female,35.0,1,0,113803,53.1,C123,S
4,5,0,3,"Allen, Mr. William Henry",male,35.0,0,0,373450,8.05,,S


## Numerical or quantitative data

Numerical or quantitative data consists of values that **can be measured or counted**. When working with numerical data, **we focus on understanding the central tendency (measures of centrality) and the variability (measures of dispersion)** within the dataset.

In [5]:
# Lets look again at our data
titanic_data.head()

Unnamed: 0,PassengerId,Survived,Pclass,Name,Sex,Age,SibSp,Parch,Ticket,Fare,Cabin,Embarked
0,1,0,3,"Braund, Mr. Owen Harris",male,22.0,1,0,A/5 21171,7.25,,S
1,2,1,1,"Cumings, Mrs. John Bradley (Florence Briggs Th...",female,38.0,1,0,PC 17599,71.2833,C85,C
2,3,1,3,"Heikkinen, Miss. Laina",female,26.0,0,0,STON/O2. 3101282,7.925,,S
3,4,1,1,"Futrelle, Mrs. Jacques Heath (Lily May Peel)",female,35.0,1,0,113803,53.1,C123,S
4,5,0,3,"Allen, Mr. William Henry",male,35.0,0,0,373450,8.05,,S


Just by looking at it, we can see that the following variables have numeric values:

- PassengerId: Id for each passenger.
- Survived: Whether the passenger survived (0 = No, 1 = Yes).
- Pclass: Passenger class (1 = 1st class, 2 = 2nd class, 3 = 3rd class).
- Age: The age of the passengers.
- SibSp: The number of siblings/spouses aboard the Titanic.
- Parch: The number of parents/children aboard the Titanic.
- Fare: The fare or price of the ticket.


But remember what we mentioned in the data types lesson.

❗ **Important note**: It is important to recognize that in some cases, a **numerical or quantitative variable may represent a categorical or qualitative variable**. For example, if a dataset includes a column with numerical values representing different categories or labels, such as "0" for "male" and "1" for "female," it should be treated as a categorical variable rather than a true numerical variable. Always consider the context and meaning of the data when determining the appropriate data type.

The variables "PassengerId," "Survived," and "Pclass" have numerical values. However, they are typically not considered numerical or quantitative variables for the following reasons:

- PassengerId: Although it is represented by numbers, it serves as an identifier or label for each passenger rather than a numerical quantity with meaningful magnitude or units. 

- Survived: While it has numerical values (0 and 1), it represents a binary or categorical outcome rather than a numerical quantity. It indicates whether a passenger survived or not, rather than measuring a quantitative attribute.

- Pclass: Despite having numerical values (1, 2, and 3), it represents different passenger classes or categories rather than a numerical scale. The numbers are used to label the different classes rather than implying a quantitative relationship between them.

This is why these variables are typically treated as categorical variables due to their nature and usage in the analysis.

With this said, we consider the following as numerical or quantitative variables:
    
- Age
- SibSp
- Parch
- Fare

### Discrete and Continuous variables

Remember that numerical or quantitative variables can be discrete or continuous. 

- **Discrete Data**: Discrete data refers to data that is counted and **can only take on specific values** within a defined range or set. These values are often whole numbers and cannot be further subdivided. For example, let's consider the number of students in each bootcamp. This is a classic example of discrete data, as it can only take on specific whole number values. We cannot have fractions or decimals when counting the number of students. 

- **Continuous Data**: On the other hand, continuous data represents measurements that **can take on any value within a specified range**. It is not limited to whole numbers and can include decimal values. 
A classic example of continuous data is the height of each student in a class. Heights can vary continuously, encompassing a wide range of values. From 0 cm to the tallest height ever recorded, there exists an infinite number of possible height values in between.

Which variables in the titanic dataset are continuous and which are discrete?

In [6]:
# We saw that to access a single column, you can use df['column_name'].
# To access many columns in a DataFrame, you can pass a list of column names using df[column_list].
numerical_vars = ["Age","SibSp","Parch","Fare"]
titanic_data[numerical_vars]

Unnamed: 0,Age,SibSp,Parch,Fare
0,22.0,1,0,7.2500
1,38.0,1,0,71.2833
2,26.0,0,0,7.9250
3,35.0,1,0,53.1000
4,35.0,0,0,8.0500
...,...,...,...,...
886,27.0,0,0,13.0000
887,19.0,0,0,30.0000
888,,1,2,23.4500
889,26.0,0,0,30.0000



Continuous Variables:

- Age: Represents the age of the passengers. It is a continuous variable as it can take any numerical value within a range.

- Fare: Indicates the fare or ticket price paid by the passengers. It is a continuous variable as it can take any numerical value within a range.

Discrete Variables:

- SibSp: The number of siblings/spouses aboard the Titanic. It represents discrete numerical values.
- Parch: The number of parents/children aboard the Titanic. It also represents discrete numerical values.


*Note: In a strict mathematical sense, age is a continuous variable because it can theoretically take on any value within a range (e.g., 20.5 years, 30.75 years, etc.). Age can be measured with a high level of precision, allowing for decimal values.*

*However, in many practical applications and datasets, age is often recorded and represented as whole numbers (integers) since people's ages are typically reported in whole years (e.g., 20 years, 30 years, etc.). In such cases, age is treated as a discrete variable with integer values.*

## Categorical or Qualitative Data

Categorical or qualitative data represents variables that are divided into **distinct categories or groups**. When working with categorical data, we focus on understanding the frequency counts and proportions within each category.

In [7]:
# Lets look again at our data
titanic_data.head()

Unnamed: 0,PassengerId,Survived,Pclass,Name,Sex,Age,SibSp,Parch,Ticket,Fare,Cabin,Embarked
0,1,0,3,"Braund, Mr. Owen Harris",male,22.0,1,0,A/5 21171,7.25,,S
1,2,1,1,"Cumings, Mrs. John Bradley (Florence Briggs Th...",female,38.0,1,0,PC 17599,71.2833,C85,C
2,3,1,3,"Heikkinen, Miss. Laina",female,26.0,0,0,STON/O2. 3101282,7.925,,S
3,4,1,1,"Futrelle, Mrs. Jacques Heath (Lily May Peel)",female,35.0,1,0,113803,53.1,C123,S
4,5,0,3,"Allen, Mr. William Henry",male,35.0,0,0,373450,8.05,,S


We know because of the dataset description, that the following variables take only the values mentioned below:

 
- Survived: Indicates whether a passenger survived (0 = No, 1 = Yes).
- Pclass: The passenger class (1 = 1st class, 2 = 2nd class, 3 = 3rd class).
- Sex: The gender of the passenger (male or female).
- Embarked: The port of embarkation (C = Cherbourg, Q = Queenstown, S = Southampton).

So we know they can be classified as categorical.

How about PassengerId, Name, Ticket, or Cabin?

To determine the uniqueness and potential usefulness of these variables, we can examine the number of distinct or unique values they have. To calculate the number of unique values for each variable in a dataframe, we can use:
```python
dataset.nunique()
```

In [8]:
titanic_data.nunique()

PassengerId    891
Survived         2
Pclass           3
Name           891
Sex              2
Age             88
SibSp            7
Parch            7
Ticket         681
Fare           248
Cabin          147
Embarked         3
dtype: int64

We can see what we were told before, that our categorical variables Survived and Sex have 2 unique values and, Pclass and Embarked 3.

How many rows did our dataset have?

In [9]:
titanic_data.shape # We use shape to look at it

(891, 12)

We can observe that "PassengerId" and "Name" have the same number of unique values as the total number of rows in the dataset (891), indicating that they are unique identifiers for each passenger and not useful for categorical analysis.

The variable "Ticket" has 681 unique values, suggesting that it may have some variations but is not highly categorical. Similarly, the "Cabin" variable has 147 unique values, indicating a significant number of distinct categories but not necessarily categorical in the traditional sense.

Based on this information, it is advisable to treat "PassengerId," "Name," "Ticket," and "Cabin" as non-categorical variables for analysis purposes. However, further exploration and domain knowledge may be required to determine if any meaningful insights can be derived from these variables through other data manipulation techniques.

Note: The `unique()` method in pandas is used to identify and retrieve the unique values within a column of a DataFrame or a Series. It returns an array or a list containing all the distinct values present in the column, in the order they appear.
```python
data["column_name"].unique()
```

In [10]:
titanic_data["Sex"].unique() # This way we can see the unique values of a column

array(['male', 'female'], dtype=object)

### Nominal and Ordinal variables

There are two main types of categorical variables: nominal and ordinal.

- **Nominal Variables**: Nominal variables represent categories or groups that **have no inherent order** or ranking. Each category is distinct and independent, without any numerical or hierarchical relationship between them. Examples of nominal variables include gender (male, female), ethnicity (Asian, African American, Caucasian), and marital status (single, married, divorced).

- **Ordinal Variables**: Ordinal variables, on the other hand, represent categories that **have a natural order or ranking**. The categories possess a qualitative relationship of "more" or "less" compared to others but do not have a consistent or measurable difference between them. Examples of ordinal variables include rating scales (such as Likert scales), educational levels (e.g., high school, bachelor's degree, master's degree), and satisfaction levels (e.g., very satisfied, satisfied, neutral, dissatisfied, very dissatisfied).

Which categorical variables are nominal and which are ordinal?

In [11]:
# We saw that to access a single column, you can use df['column_name'].
# To access many columns in a DataFrame, you can pass a list of column names using df[column_list].
categorical_vars = ["Survived","Pclass","Sex","Embarked"]
titanic_data[categorical_vars]

Unnamed: 0,Survived,Pclass,Sex,Embarked
0,0,3,male,S
1,1,1,female,C
2,1,3,female,S
3,1,1,female,S
4,0,3,male,S
...,...,...,...,...
886,0,2,male,S
887,1,1,female,S
888,0,3,female,S
889,1,1,male,C


Looking at the Titanic dataset, or just understanding the meaning of the variables and the values it takes, the following categorical variables can be classified as nominal or ordinal:

- Nominal Variables:

    - Sex: It has two categories: Male and Female. This variable is nominal since there is no inherent order or ranking between the categories.

    - Embarked: Indicates the port of embarkation. It has three categories: C (Cherbourg), Q (Queenstown), and S (Southampton). This variable is also nominal since the categories represent distinct locations with no inherent order.
    
    - Survived: While it represents two distinct categories, there is no inherent order or ranking between them. Therefore, it does not possess an ordinal nature.

- Ordinal Variable:

    - Pclass: Represents the passenger class. It has three categories: 1 (first class), 2 (second class), and 3 (third class). This variable is ordinal because the categories have a natural order or hierarchy based on the class hierarchy (first class being higher than second and third class).

# Exercises

1. Exploring Variables in the Student Performance Dataset

**Objective**: Identify the variables in the Student Performance dataset and classify them as numerical or categorical, as well as determine if they are continuous or discrete, and whether they are ordinal or nominal.

**Dataset Description**:
The Student Performance dataset contains information about students' performance in exams. It includes various attributes such as gender, ethnicity, parental level of education, test scores, and more.

**Instructions**:
- Load the Student Performance dataset into a DataFrame called df.
- Examine the dataset and the available columns.
- For each column, determine its data type and classify it accordingly:

    a) Identify the numerical variables and specify if they are continuous or discrete.

    b) Identify the categorical variables and specify if they are ordinal or nominal.

Remember to consider the nature of the variables, their values, and the context of the dataset when classifying them. Some variables may require further examination or interpretation to determine their exact classification.

In [14]:
# Dataset source URL
url = "https://raw.githubusercontent.com/data-bootcamp-v4/prework_data/main/students_performance.csv"

import pandas as pd

df = pd.read_csv('https://raw.githubusercontent.com/data-bootcamp-v4/prework_data/main/students_performance.csv')

df.to_csv(r'C:\Users\lluis\Documents\IronHack\PreWork\Block 2\students_performance.csv')

# Additional Content: Data types and pandas

In Python, we can use methods like dtypes and select_dtypes in pandas to analyze data types. 

In terms of data types:

- `int` (integer): Represents whole numbers without decimal places. For example, the "Pclass" variable in the Titanic dataset, representing passenger class.
- `float` (floating-point): Represents numbers with decimal places. For example, the "Fare" variable in the Titanic dataset, representing the fare paid by passengers.
- `object`: Represents non-numeric data types, such as strings. For example, the "Name" variable in the Titanic dataset, containing passenger names.

## dtypes

The dtypes attribute in pandas is used to retrieve the data types of columns in a DataFrame. It provides information about the data type of each column, allowing you to understand how the data is stored and processed.

In [15]:
# Get the data types of all columns
titanic_data.dtypes

PassengerId      int64
Survived         int64
Pclass           int64
Name            object
Sex             object
Age            float64
SibSp            int64
Parch            int64
Ticket          object
Fare           float64
Cabin           object
Embarked        object
dtype: object

## select_dtypes

The `select_dtypes` method in pandas allows you to select columns from a DataFrame based on their data types. It helps you filter and retrieve specific columns that match the desired data types.

The syntax for select_dtypes is as follows:
```python
df.select_dtypes(include=None, exclude=None)
```

- include: Specifies the data types to include. It accepts a string or a list of strings representing the desired data types. For example, include='object' will select columns with the object data type.
- exclude: Specifies the data types to exclude. It also accepts a string or a list of strings representing the data types to be excluded. For example, exclude=['int64', 'float64'] will exclude columns with the int64 and float64 data types.

By default, if neither include nor exclude is provided, select_dtypes will return all columns of the DataFrame.

In [16]:
# Select columns with specific data types, in this case, with numeric data types, which are typically numerical variables
numerical_variables = titanic_data.select_dtypes(include=['int64', 'float64'])

In [None]:
numerical_variables

In [17]:
# Selecting columns with object data type, which are typically categorical variables
categorical_variables = titanic_data.select_dtypes(include=['object'])
categorical_variables

Unnamed: 0,Name,Sex,Ticket,Cabin,Embarked
0,"Braund, Mr. Owen Harris",male,A/5 21171,,S
1,"Cumings, Mrs. John Bradley (Florence Briggs Th...",female,PC 17599,C85,C
2,"Heikkinen, Miss. Laina",female,STON/O2. 3101282,,S
3,"Futrelle, Mrs. Jacques Heath (Lily May Peel)",female,113803,C123,S
4,"Allen, Mr. William Henry",male,373450,,S
...,...,...,...,...,...
886,"Montvila, Rev. Juozas",male,211536,,S
887,"Graham, Miss. Margaret Edith",female,112053,B42,S
888,"Johnston, Miss. Catherine Helen ""Carrie""",female,W./C. 6607,,S
889,"Behr, Mr. Karl Howell",male,111369,C148,C


## Important observation

It is important to note that **we cannot solely rely on the assumption that `object` data type indicates categorical variables, while `int` or `float` data types indicate numerical variables**. To ensure accuracy, let's compare these results with our previous analysis and examine any differences. 

Have in mind that:

- **Numbers can indicate categories**: The presence of numbers in a column does not necessarily mean it represents a numerical variable. We can use the `nunique()` method to determine the number of unique values in a column. If the number of unique values is small, it suggests that the variable is categorical, even if it is represented as a number. It is crucial to understand the variable's meaning and context to correctly interpret its data type.

- **Errors in data can affect data types**: In certain cases, errors or inconsistencies in the data can lead to a numerical variable being classified as an object data type. For example, if an "Age" column contains a space or a value like "h25", it would be stored as an object data type instead of an int or float. This can be misleading and may lead to confusion if the variable is incorrectly assumed to be categorical.

Therefore, it is essential to consider the unique values, data information, and the intended meaning of a variable when determining its data type, rather than solely relying on the data type itself.