# Descriptive Statistics

In this hands-on lesson, we will explore descriptive statistics using the Titanic dataset, a popular dataset often used for introductory data analysis. 

In this lesson, we will explore descriptive statistics numerical techniques. In the following lesson, we will explore graphical techniques.

## Quick look at the dataset

In [1]:
# Let's start by importing pandas and reading the titanic dataset
import pandas as pd

# We can read from an online URL
titanic_data = pd.read_csv('https://raw.githubusercontent.com/data-bootcamp-v4/prework_data/main/titanic.csv')

Here is a description of the titanic dataset variables to get a better understanding of the data:
- PassengerId: an Id to identify each passenger.
- Survived: Whether the passenger survived (0 = No, 1 = Yes)
- Pclass: Passenger class (1 = 1st class, 2 = 2nd class, 3 = 3rd class)
- Name: Passenger's name
- Sex: Passenger's gender (Male or Female)
- Age: Passenger's age in years
- SibSp: Number of siblings/spouses aboard the Titanic
- Parch: Number of parents/children aboard the Titanic
- Ticket: Ticket number
- Fare: Passenger fare
- Cabin: Cabin number
- Embarked: Port of embarkation (C = Cherbourg, Q = Queenstown, S = Southampton)

In [2]:
# Let's start by gaining a general understanding of the dataset.
titanic_data.head()

Unnamed: 0,PassengerId,Survived,Pclass,Name,Sex,Age,SibSp,Parch,Ticket,Fare,Cabin,Embarked
0,1,0,3,"Braund, Mr. Owen Harris",male,22.0,1,0,A/5 21171,7.25,,S
1,2,1,1,"Cumings, Mrs. John Bradley (Florence Briggs Th...",female,38.0,1,0,PC 17599,71.2833,C85,C
2,3,1,3,"Heikkinen, Miss. Laina",female,26.0,0,0,STON/O2. 3101282,7.925,,S
3,4,1,1,"Futrelle, Mrs. Jacques Heath (Lily May Peel)",female,35.0,1,0,113803,53.1,C123,S
4,5,0,3,"Allen, Mr. William Henry",male,35.0,0,0,373450,8.05,,S


As we saw in the previous lesson,  we consider the following as numerical or quantitative variables:
    
- Age: continuous (it can also be considered discrete)
- SibSp: discrete
- Parch: discrete
- Fare: continuous

The following variables are considered categorical variables:

 
- Survived
- Pclass
- Sex
- Embarked

'Sex', 'Embarked', and 'Survived' are considered nominal variables, and 'Pclass' is considered an ordinal variable. While we won't delve into the importance of this distinction at the moment, we will explore it further during the bootcamp to understand its significance in data analysis.

For now, we will ignore PassengerId, Name, Ticket, and Cabin.

## Descriptive statistics for numerical variables

### Numerical Univariate Techniques

#### Measures of Central Tendency

Used for numerical/quantitative variables.

- **Mean**: The **average** value of a variable, calculated by summing all values and dividing by the total number of observations.
	- Mostly used for: numerical/quantitative *continuous* variables.
- **Median**: The **middle value** in a dataset when arranged in ascending or descending order.
	- Mostly used for: numerical/quantitative *continuous* variables.
- **Mode**: The **most frequent** value or values in a dataset.
	- Mostly used for: numerical/quantitative -  *discrete* variables.
	


<div style="text-align:center">
    <img src="https://github.com/data-bootcamp-v4/prework_img/blob/main/Median-Mode-Mean-and-Range-1.jpg?raw=true" alt="Image" style="width:40%;">
</div>

*Source: [k8schoollessons](https://k8schoollessons.com/median-mode-mean-and-range/)*

*Range is a measure of dispersion, we will talk about it after measures of central tendency*

Lets look at a quick example before continuing with our titanic dataset:

In [3]:
# Mean: add all the numbers and divide by the amount of numbers.
ages = [18, 19, 16, 19, 20, 20, 21, 19, 22]
mean = sum(ages)/len(ages)
mean

19.333333333333332

In [4]:
#Median: order the set of numbers, the median is the middle number.
ages.sort() 
ages # If we get the number in the middle of the list, just by looking at the ordered list, is the number 19

[16, 18, 19, 19, 19, 20, 20, 21, 22]

In [5]:
# Mode: the most common number
# Just by looking at the list, we can see its the number 19 since it is repeated 3 times, and
# no other number is repeated 3 times

Now, what happens if we add two huge numbers at the end of the list?

In [6]:
ages = [18, 19, 16, 19, 20, 20, 21, 19, 22, 110, 100]
mean = sum(ages)/len(ages)
mean

34.90909090909091

In [7]:
#Median: order the set of numbers, the median is the middle number.
ages.sort() 
ages # If we get the number in the middle of the list, just by looking at the ordered list, is the number 20

[16, 18, 19, 19, 19, 20, 20, 21, 22, 100, 110]

In [8]:
# Mode: the most common number
# Just by looking at the list, we can see its the number 19 since it is repeated 3 times, and
# no other number is repeated 3 times

We can observe that when there are extreme values in the dataset, the mean is significantly affected, but the mode and median remain relatively stable. This indicates that the statistical measure **mean is sensitive to extreme values or outliers**, while the **median and mode are more resistant to such values**. Therefore, when analyzing data that may contain outliers, it is advisable to rely on the median and mode rather than the mean. This approach helps to mitigate the impact of outliers and provide a more robust representation of the central tendency and the most frequent values in the dataset.

Lets look at measures of central tendency for numerical variables in the titanic dataset.

We saw in the *Introduction to Pandas* lesson that we could calculate the `mean` of a variable by doing:
```python
data["column_name"].mean()
```

In [9]:
titanic_data["Age"].mean()

29.69911764705882

We also saw that to access a single column, you can use 

```python
df['column_name']
```

and to access many columns in a DataFrame, you can pass a list of column names using 
```python
df[column_list]
```


In [10]:
titanic_data[["Age","Fare"]].mean() # This way we calculate at the same time the mean for Age and Fare

Age     29.699118
Fare    32.204208
dtype: float64

By obtaining the list of numeric variables, we can calculate the mean of multiple variables simultaneously using just a single line of code.

In [11]:
numerical_variables = ["Age","SibSp","Parch","Fare"]

In [12]:
titanic_data[numerical_variables].mean()

Age      29.699118
SibSp     0.523008
Parch     0.381594
Fare     32.204208
dtype: float64

The means for the variables in the Titanic dataset can be interpreted as follows:

- Age: The average age of the passengers aboard the Titanic is approximately 29.7 years. This means that, on average, the passengers were around 29.7 years old.

- SibSp: The average number of siblings/spouses aboard the Titanic is approximately 0.52. This indicates that, on average, passengers had slightly more than half a sibling/spouse accompanying them on the ship.

- Parch: The average number of parents/children aboard the Titanic is approximately 0.38. This means that, on average, passengers had less than one parent/child accompanying them on the ship.

- Fare: The average fare paid by the passengers is approximately 32.20 dollars. This suggests that, on average, passengers paid around $32.20 for their tickets.

These interpretations provide insights into the central tendency or average values of the respective variables in the Titanic dataset.

For continuous variables like Age and Fare, the mean provides a meaningful interpretation. However, for discrete variables like SibSp (number of siblings/spouses aboard the Titanic) and Parch (number of parents/children aboard the Titanic), the mean may not have a practical interpretation. It doesn't make sense to have fractions or decimals representing the number of siblings/spouses or parents/children aboard the Titanic (e.g., 0.52 or 0.38).


We can do many other operations on numerical columns using `median()`, `mode()`, `sum()`, `min()`, `max()`, `std()`, `var()`the same way we did with `mean()`.

In [13]:
titanic_data[numerical_variables].median()

Age      28.0000
SibSp     0.0000
Parch     0.0000
Fare     14.4542
dtype: float64

- Age: The median age of the passengers aboard the Titanic is 28 years. This means that 50% of the passengers were younger than 28 and 50% were older. Comparing it to the mean age of approximately 29.7, we can see that the mean is slightly higher than the median, indicating that there might be some older passengers with higher ages that pull the mean slightly upwards.

- SibSp: The median number of siblings/spouses aboard the Titanic is 0. This suggests that 50% of the passengers did not have any siblings or spouses accompanying them. The mean is slightly higher than the median due to a few passengers having more siblings or spouses.

- Parch: The median number of parents/children aboard the Titanic is 0. This indicates that 50% of the passengers did not have any parents or children accompanying them. Comparing it to the mean of approximately 0.38, we can see a similar pattern as with SibSp. The mean is slightly higher than the median due to a few passengers having parents or children on board.

- Fare: The median fare paid by the passengers is 14.45 dollars approximately. This means that 50% of the passengers paid less than or equal to this amount for their tickets. Comparing it to the mean fare of approximately $32.20, we can observe the mean is higher than the median, indicating the presence of a few passengers who paid **significantly** higher fares.

In [14]:
titanic_data[numerical_variables].mode()

Unnamed: 0,Age,SibSp,Parch,Fare
0,24.0,0,0,8.05


The mode, which represents the most frequently occurring value, provides more meaningful information for discrete variables compared to continuous variables. In the case of the Titanic dataset:

- SibSp (number of siblings/spouses aboard): The mode of 0 suggests that the majority of passengers did not have any siblings or spouses accompanying them on the Titanic. We already saw that when calculating the median.

- Parch (number of parents/children aboard): Similarly, the mode of 0 indicates that most passengers did not have any parents or children accompanying them on the ship. We also already saw that when calculating the median.

Based on the mean, median, and mode values of the numeric variables in the dataset, we can draw the following general conclusions:

- Age: The age distribution appears to be centered around the late 20s to early 30s range.
- SibSp (number of siblings/spouses aboard): The majority of passengers (mode: 0) did not have any siblings or spouses aboard.
- Parch (number of parents/children aboard): The mode of 0 indicates that most passengers did not have parents or children aboard.
- Fare: the higher mean, compared to the median, suggests the presence of some higher fare values, possibly indicating variations in ticket prices.

These general conclusions provide insights into the central tendency and distribution of the numeric variables in the dataset.

#### Measures of Dispersion
Used for numerical/quantitative variables.

- **Range**: The **difference between the maximum and minimum** values in a dataset.
- **Variance**: A measure of **how far each value in the dataset deviates from the mean**, indicating the **spread** of the data. A higher variance indicates that the data points are more dispersed or spread out from the mean. 
	- Mostly used for: numerical/quantitative *continuous* variables.
- **Standard Deviation**: The Standard Deviation is also a measure of how spread out numbers are. It is the square root of the variance. The standard deviation is easier to interpret since it is expressed in the same unit as the data. Like variance, a higher standard deviation indicates that the data points are more spread out from the mean.
	- Mostly used for: numerical/quantitative *continuous* variables.
    
*This measures and other measures such as interquartile range will be studied during the bootcamp.*
	


Lets look at a simple example:

<div style="text-align:center">
    <img src="https://upload.wikimedia.org/wikipedia/commons/thumb/f/f9/Comparison_standard_deviations.svg/1224px-Comparison_standard_deviations.svg.png" alt="Image" style="width:40%;">
</div>

*Source: [wiki.kidzsearch.com](https://wiki.kidzsearch.com/wiki/File:Comparison_standard_deviations.svg)*

In the image, we have two sample populations with the same mean but different standard deviations. The red population has a mean of 100 and a standard deviation of 10, while the blue population has a mean of 100 and a standard deviation of 50.

From this example, we can observe that simply looking at the mean is not sufficient to fully understand the data. We can see that the blue data points are more spread out compared to the red data points. This aligns with our understanding of standard deviation, as the blue population has a higher standard deviation, indicating greater variability or dispersion in the data. Therefore, the standard deviation provides valuable information about the spread of the data, complementing the mean as a measure of centrality.

If you want more information about this, take a look at [this page](https://www.mathsisfun.com/data/standard-deviation.html)

Lets look at dispersion measures for the titanic dataset.

In [15]:
# Calculate standard deviation
titanic_data[numerical_variables].std()

Age      14.526497
SibSp     1.102743
Parch     0.806057
Fare     49.693429
dtype: float64

In [16]:
# Calculate range
titanic_data[numerical_variables].max() - titanic_data[numerical_variables].min()

Age       79.5800
SibSp      8.0000
Parch      6.0000
Fare     512.3292
dtype: float64

In [17]:
# Calculate variance
titanic_data[numerical_variables].var()

Age       211.019125
SibSp       1.216043
Parch       0.649728
Fare     2469.436846
dtype: float64

- For continuous variables like Age and Fare, the variance, standard deviation, and range make sense as they provide information about the spread or variability of the data. In this case, we can see that the Fare variable has a much larger variance, standard deviation, and range compared to the Age variable, indicating a wider range of values and greater variability in fares.
    - For the variable "Age," the standard deviation of 14.52 indicates that the ages of individuals in the dataset deviate from the mean age of 29.69 by approximately 14 years on average. This means that, on average, the ages of individuals in the dataset can vary by around 14 years from the mean age.
    - For the variable "Fare," the standard deviation of 49.69 signifies that the fares paid by passengers deviate from the mean fare of 32.20 by approximately 49.69 units on average. The unit here represents the currency. Therefore, on average, the fares paid by passengers can vary by around 49.69 dollars from the mean fare.

    In both cases, the standard deviation quantifies the average amount of deviation or dispersion of data points from the mean value. It provides a measure of the variability or spread of the data around the mean, giving insights into how much the values tend to deviate from the average.

- For discrete variables like SibSp and Parch, the variance and standard deviation may not provide meaningful insights since they are primarily used for continuous data. Instead, the range can give us an idea of the extent of values for these variables. We can see that both SibSp and Parch have relatively small ranges, indicating that the majority of values are concentrated within a narrower range.

We can use the describe() method to obtain a summary of these statistics.

```python
# Generate descriptive statistics for numerical variables
df.describe()
```

The describe() method provides statistics such as count, mean, standard deviation, minimum, quartiles, and maximum values for the numerical variables. This information gives us insights into the central tendency, dispersion, and range of these variables.

In [18]:
# Display summary statistics for each numerical variable
titanic_data[numerical_variables].describe()

Unnamed: 0,Age,SibSp,Parch,Fare
count,714.0,891.0,891.0,891.0
mean,29.699118,0.523008,0.381594,32.204208
std,14.526497,1.102743,0.806057,49.693429
min,0.42,0.0,0.0,0.0
25%,20.125,0.0,0.0,7.9104
50%,28.0,0.0,0.0,14.4542
75%,38.0,1.0,0.0,31.0
max,80.0,8.0,6.0,512.3292


Note that in the describe table, the value corresponding to the median is labeled as "50%". This represents the median value of the dataset. The label "50%" is used because the median divides the dataset into two equal parts, with 50% of the data falling below the median and 50% above it.

The values labeled as "25%" and "75%" in the describe table represent the value below which 25% of the data falls, and the value below which 75% of the data falls, respectively. Further discussion on the values labeled as "25%" and "75%" will be covered during the bootcamp, where we will explore their significance and interpretation in descriptive statistics. 

## Descriptive statistics for categorical variables

### Numerical Univariate Techniques
- **Frequency Counts**: used for categorical/qualitative variables.
	- Counting the **number of occurrences of each value or category** within a dataset, providing insights into the distribution of values.



The `value_counts()` method is especially useful for categorical variables as it provides insights into the frequency or occurrence of each category in the dataset. 

To use it simply do:
```python
data["column_name"].value_counts()
```

In [19]:
# Remember we can access columns using [] and that column names are case sensitive
titanic_data['Sex'].value_counts() # Calculate frequency counts for a categorical variable

Sex
male      577
female    314
Name: count, dtype: int64

In [20]:
# Lets define a list of categorical variables
categorical_variables = ["Survived", "Pclass", "Sex", "Embarked"]

In [21]:
#If we do the same as for the numerical variables, we see the result is not what we wanted
titanic_data[categorical_variables].value_counts()

Survived  Pclass  Sex     Embarked
0         3       male    S           231
          2       male    S            82
1         2       female  S            61
0         3       female  S            55
          1       male    S            51
1         1       female  S            46
                          C            42
0         3       male    Q            36
1         3       male    S            34
0         3       male    C            33
1         3       female  S            33
          1       male    S            28
0         1       male    C            25
1         3       female  Q            24
          1       male    C            17
          3       female  C            15
          2       male    S            15
          3       male    C            10
0         3       female  Q             9
                          C             8
          2       male    C             8
1         2       female  C             7
0         2       female  S             6

This is because it prints the count for each combination of those variables values. In this case, we should print the value counts separately.

We can do this by calculating `value_counts()` for each variable, writing a line of code for each variable, or by iterating through the list of categorical variables like this:

In [22]:
# Display unique values and their frequency counts for each categorical variable
for column in categorical_variables:
    print(titanic_data[column].value_counts())

Survived
0    549
1    342
Name: count, dtype: int64
Pclass
3    491
1    216
2    184
Name: count, dtype: int64
Sex
male      577
female    314
Name: count, dtype: int64
Embarked
S    644
C    168
Q     77
Name: count, dtype: int64


From the value counts of the categorical variables in the Titanic dataset, we can draw the following insights:

- Survived: Out of the 891 passengers, 342 (38.38%) survived, while 549 (61.62%) did not survive. This indicates that the majority of passengers did not survive the Titanic disaster.

- Pclass: The passengers were categorized into three classes - 1, 2, and 3. The majority of passengers (491 or 55.11%) were in the third class, followed by 216 (24.24%) in the first class and 184 (20.65%) in the second class. This suggests that the majority of passengers belonged to the lower class.

- Sex: The dataset consists of 577 male passengers (64.76%) and 314 female passengers (35.24%). This indicates a higher number of male passengers compared to female passengers.

- Embarked: The passengers boarded the Titanic from three different ports - S (Southampton), C (Cherbourg), and Q (Queenstown). The majority of passengers (644 or 72.28%) boarded from Southampton, followed by 168 (18.86%) from Cherbourg and 77 (8.64%) from Queenstown. This provides information about the distribution of passengers from different embarkation points.

*Note that we calculate the percentages just by dividing the numbers by the total of rows, 891, and multiplying by 100*

## Additional Analysis

You can further explore the dataset by calculating statistics for specific subsets of the data and analyzing correlations between variables.

We will look at this during the bootcamp.

## Exercise

Exploring Students' Performance Dataset - Numerical Techniques

**Objective**: The objective of this exercise is to practice using **numerical techniques** to analyze the Students' Performance dataset and gain insights into the students' academic performance.

**Dataset Description**:
The Students' Performance dataset contains information about students' demographic attributes, such as gender, race/ethnicity, parental education, lunch type, and test scores in three subjects: Math, Reading, and Writing.

**Exercise Steps**:

- Load the Dataset: Import the necessary libraries and load the Students' Performance dataset into a pandas DataFrame. 

- Explore the Dataset: Use basic pandas functions to get an overview of the dataset, including the number of rows and columns, and number of unique values for each column. For those columns that have less than 10 distinct values, show those unique values. *Hint: look at the previous lesson. There you can find the functions or methods you need to use.**

- Analyze Descriptive Statistics: Calculate and interpret descriptive statistics, including measures of central tendency (mean, median, mode) and dispersion (standard deviation, range) for the numerical variables, and frequency counts for categorical variables. 

In [4]:
# Dataset source URL
url = "https://raw.githubusercontent.com/data-bootcamp-v4/prework_data/main/students_performance.csv"

import pandas as pd

df = pd.read_csv('https://raw.githubusercontent.com/data-bootcamp-v4/prework_data/main/students_performance.csv')

df.to_csv(r'C:\Users\lluis\Documents\IronHack\PreWork\Block 2\students_performance.csv')

print("Number of rows:", df.shape[0])
print("Number of columns:", df.shape[1])
print("Number of unique values:")
print(df.nunique())
print()

columns_with_few_unique_values = ["gender","race/ethnicity","parental level of education","lunch","test preparation course"]
for column in columns_with_few_unique_values:
    unique_values = df[column].unique()
    print(f"Unique values for {column}:")
    print(unique_values)
    print()

numerical_variables = ['math score', 'reading score', 'writing score']
print("Descriptive statistics for numerical variables:")
print(df[numerical_variables].describe())
print()

categorical_variables = ['gender', 'race/ethnicity', 'parental level of education', 'lunch', "test preparation course"]
print("Frequency counts for categorical variables:")
for column in categorical_variables:
    frequency_counts = df[column].value_counts()
    print(frequency_counts)
    print()

Number of rows: 1000
Number of columns: 8
Number of unique values:
gender                          2
race/ethnicity                  5
parental level of education     6
lunch                           2
test preparation course         2
math score                     81
reading score                  72
writing score                  77
dtype: int64

Unique values for gender:
['female' 'male']

Unique values for race/ethnicity:
['group B' 'group C' 'group A' 'group D' 'group E']

Unique values for parental level of education:
["bachelor's degree" 'some college' "master's degree" "associate's degree"
 'high school' 'some high school']

Unique values for lunch:
['standard' 'free/reduced']

Unique values for test preparation course:
['none' 'completed']

Descriptive statistics for numerical variables:
       math score  reading score  writing score
count  1000.00000    1000.000000    1000.000000
mean     66.08900      69.169000      68.054000
std      15.16308      14.600192      15.19565