# Data Literacy

> Gartner defines data literacy as the ability to read, write and communicate data in context, including an understanding of data sources and constructs, analytical methods and techniques applied — and the ability to describe the use case, application and resulting value.
>
> This all boils down to a simple question, “Do you speak data?” 

_Source: [A Data and Analytics Leader’s Guide to Data Literacy](https://www.gartner.com/smarterwithgartner/a-data-and-analytics-leaders-guide-to-data-literacy/)_


Data literacy is the basic competence to **examine, explore, interpret and reason from data**. Developing data literacy is the foundation of a data-driven organization. Data literacy eventually enables success stories in data science and artificial intelligence. For the aspiring data scientist, it is the basic requirement on which to build more advanced skills. However, not just data experts, but many stakeholders and decision makers can benefit from improving their data literacy.

This notebook gives an overview of data literacy skills but is not comprehensive. The best way to improve and broaden your skills is to start working with data: Gather experience from various projects and discuss them with colleagues.

## Preamble

In [None]:
# We have a small python module, which will support us with some basic functionality.
import data_science_learning_paths

In [None]:
data_science_learning_paths.setup_plot_style()

## What is Data Science?

![](graphics/data_science.png)

The essential ingredients for a successful data science project comprise **software design**, **statistics** and **domain knowledge**. If you leave out one of these skills you either 
- find yourself in the realm of **Machine Learning** without a relevant use case or with useless results, in the sole combination of software design and statistics, 
- end up in **traditional research**, if you only combine statistics and domain knowledge,
- or you find yourself in the **danger zone**, if you leave out the statistical understanding and the ability to validate your results properly. 

Thus, all in all, you need a **combination of all three** areas of expertise. This does not mean, that you must find a single person that incorporates all those skills, but rather a **well-mixed team of experts**, who are able to challenge one another and can thus successfully implement projects.

Here, we will focus on the most relevant statistical skills one has to bring into data science projects.

## How to Explore Data

What do we do if we are handed a new, unknown data set? We go and explore.

**Exploratory Data Analysis (EDA)** aims to summarize a data sets main characteristics, often with visual methods. It is generally recommended as a preliminary stage to other, more goal-directed types of data analysis, such as **hypothesis testing** and **modelling**.

The Python world knows powerful tools for data analysis, and here we import some of them: 
- The `numpy` library for work with big arrays and generation of distributions.
- The `pandas` library for handling tabular data.
- The `matplotlib` library for plotting and customizing graphics.
- The `seaborn` library for statistical graphics.

In [None]:
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import seaborn

Here, we will only focus on the output, stay on a higher level and will go into more detail in dedicated notebooks at a later stage of the course.

Let't start with a dataset. For more information on the dataset see kaggle challenge: [House Sales in King County, USA](https://www.kaggle.com/harlfoxem/housesalesprediction).

In [None]:
data, data_descr = data_science_learning_paths.datasets.read_house_prices_seattle()

In [None]:
# Show data
data

In [None]:
# Data types
data.dtypes

In [None]:
# column descriptions
data_descr

## Measurement Scales


Variables in a data set can have different _measurement scales_ or _levels of measurement_. They tell us something about the nature of the information stored in the values of a variable. Statistics traditionally defines the following four scales:

- **nominal**: A _nominal_ variable is a set of labels that are mutually exclusive and have no numeric meaning. They can denote a **category** into which the data item belongs. Classic example: Gender - people might identify as either male, female, or nonbinary on a questionaire.
- **ordinal**: In an _ordinal_ variable, the values are **ordered** (>) - their order is meaningful, but calculating an exact difference between them is not known. Classic example: School grades - an A is better than an F.
- **interval**: With an _interval scale_, the **difference** of numeric values is meaningful (however,  not their ratio.) Classic example: Temperature in degrees Celsius - 20°C is 10 degrees warmer than 10 °C, but not "twice as warm".
- **ratio**: A variable on a _ratio scale_ has an **absolute zero** point, so ratios are meaningful. Classic example: Temperature in Kelvin - 20K is indeed twice as warm as 10K.



Let us examine our data set and find an example for each measurement scale.

In [None]:
data.dtypes

Note: Measurement scales may not be immediately visible from the *data types* of the variables. An integer may encode a nominal, ordinal, interval or ratio scale variable. 

**nominal**

Postal codes (`zipcode`) are mutually exclusive category labels on each of the houses.

In [None]:
data["zipcode"]

For example, we could count how many houses we have in each area:

In [None]:
data["zipcode"].value_counts().plot(kind="bar", figsize=(15, 4))

**ordinal**

The `grade` variable gives a quality rating to each house.

In [None]:
data["grade"]

Let's have a look at the distribution of ratings. We are taking care that the x-axis shows the grades in their proper order.

In [None]:
data["grade"].value_counts().sort_index().plot(kind="bar", title="condition")

**interval**

The year that the building was built (`yr_built`) is an example for an interval-scaled variable. 



In [None]:
data.iloc[:2]

How many years older is the second house than the first house?

In [None]:
data.iloc[0]["yr_built"] - data.iloc[1]["yr_built"]

In [None]:
data["yr_built"]

Have a look at the number of houses built in a specific time range.

In [None]:
data["yr_built"].plot(kind="hist", bins=10)

**ratio**

The house price (`price`) is clearly a ratio scale variable - there is an absolute zero (0$) and we can meaningfully speak of a house being twice as expensive as another.

In [None]:
data["price"]

In [None]:
data.iloc[0]["price"] / data.iloc[1]["price"]

In [None]:
seaborn.histplot(data["price"])

Measurement scales are important to understand:
- they allow us to characterize the kind of information stored in a variable
- they influence which methods and operations can be meaningfully applied to a variable

### Remark:
Keep in mind that we just have a look on the raw data. Working with data will always need some **_handwork_**, where you have to apply your data literacy skils and domain knowhow. For example: Do you have an idea how to make the zipcode to an interval-scaled variable? Maybe you get a much better and more meaningful variable for your analysis by this transformation! 

## Descriptive Statistics

As mentioned above, data exploration on a new data set is essential for coming up with ideas for analysis, raise questions, and to derive use cases. Data literacy is more than knowing what kinds of plots and styles do exist, but moreover, one has to know how a plot can be interpreted and which kind of plot will represent one's findings best.

Besides visualisation some statistical properties of your collected data set can help you to interpret it. The field of **descriptive statistics** provides some **measures** to help describe the fundamental properties of a given data set.

Let's start small and have a look at a simple example. Think of a class of 30 students. They all get lunch money everyday. Let's assume they all get 1 or 2 € per day at first.

In [None]:
# generate lunch-money distribution
student_count = 30
np.random.seed(0)
lunch_money_fair = np.random.randint(
    1,
    3,
    size=student_count,
)

In [None]:
lunch_money_fair

To viualize this, we plot the Student ID (a number between 1 and 30) on the x axis and the respective lunch money as a bar, where the height is the amount in €. We get a distribution that looks quite uniform, thus the distribution is fair.

In [None]:
fig, ax = plt.subplots(figsize=(10, 5))
ax.bar(np.linspace(1, 30, 30), lunch_money_fair)
ax.set(xlabel="Student ID", ylabel="Lunch money", title="Distribution of lunch money")
plt.show()

Let's determine some properties. 
* For example we can calculate the **arithmetic mean**, $\mu$, which is the sum of all elements divided by the number of elements. 
	* Mean: $\mu=\frac{1}{N}\sum_{i=1}^N x_i$
* Or we can calculate the **standard deviation (spread around the mean, std)**, $\sqrt{\sigma^2}=\sigma$, which is given by the square root of the average of the squared deviations from the mean. 
	* Standard deviation: $\text{Std}(x)=\sigma=\sqrt{\sigma^2}=\sqrt{\frac{1}{(N-1)}\sum_{i=1}^N (x_i-\mu)^2}$
* The standard deviation is derived from the **variance**, $\sigma^2$, the overall spread of our data points, as the square root of it.
	* Variance: $\text{Var}(x)=\sigma^2=\frac{1}{N}\sum_{i=1}^N (x_i-\mu)^2$
    
Furthermore we can determine the **minimum** and **maximum** value of the lunch-money distribution to get a better feeling of the overall range of values.    

We get the following values:

In [None]:
lunch_money_fair_mean = np.mean(lunch_money_fair)
lunch_money_fair_std = np.std(lunch_money_fair, ddof=1)
lunch_money_fair_max = np.max(lunch_money_fair)
lunch_money_fair_min = np.min(lunch_money_fair)
print(f"Mean: {np.round(lunch_money_fair_mean,2)} €")
print(f"Std: {np.round(lunch_money_fair_std,2)} €")
print(f"Min: {np.round(lunch_money_fair_min,2)} €")
print(f"Max: {np.round(lunch_money_fair_max,2)} €")

All students seem to get almost the same amount of lunch money. The average is 1.57 € and the values spread ± 0.5 € on average around the mean. Besides this the standard deviation is smaller than the determined mean. By construction the minimum value is 1 € and the maximum is 2€.

What happens if we _change the conditions "slightly"_? We now assume that one student gets 91 € and all the others get 1 € each?

In [None]:
lunch_money_unfair = np.append(np.array(91), np.ones(29))

In [None]:
lunch_money_unfair

In [None]:
fig, ax = plt.subplots(figsize=(10, 5))
ax.bar(np.linspace(1, 30, 30), lunch_money_unfair)
ax.set(xlabel="Student ID", ylabel="Lunch money", title="Distribution of lunch money")
plt.show()

We get a distribition with a clear peak for the first student. Let's see how the properties change:

In [None]:
lunch_money_unfair_mean = np.mean(lunch_money_unfair)
lunch_money_unfair_std = np.std(lunch_money_unfair, ddof=1)
lunch_money_unfair_max = np.max(lunch_money_unfair)
lunch_money_unfair_min = np.min(lunch_money_unfair)
print(f"Mean: {np.round(lunch_money_unfair_mean,2)} €")
print(f"Std: {np.round(lunch_money_unfair_std,2)} €")
print(f"Min: {np.round(lunch_money_unfair_min,2)} €")
print(f"Max: {np.round(lunch_money_unfair_max,2)} €")

Here, we have a completely different picture than before. The mean is 4 € and the spread around the mean is on average 16 €, which is 4 times higher than the mean itself. For 29 out of 30 students we overestimate the lunch money by a factor of 4 and for 1 out of 30 students we underestimate the lunch money by a factor of almost 23.

To get a more accurate picture of the situation we can calculate the **median**, which is the middle value of the sorted data set. If the number of values is even then the median is calculated as the arithemtic mean of the two data points in the middle. Here, a short example:

In [None]:
# odd number of values
np.median(np.array([0, 1, 2, 3, 4]))

In [None]:
# even number of values
np.median(np.array([0, 1, 2, 3]))

So, let's see what the median is for our two distributions:

In [None]:
print(f"Median fair: {np.round(np.median(lunch_money_fair),2)}")
print(f"Median unfair: {np.round(np.median(lunch_money_unfair),2)}")

It gives you a more complete picture for the unfair distributed lunch money as it is **not susceptible to outliers** in the distribution. 

Our example data set of a class of 30 students is by far not the **basic population** if we want to make statements about the overall distribution of lunch money for example of the whole school or the whole country. So we only have a **random sample** of the **basic population**. Thus, the true average of the basic population can only be approximated by the sample mean.

If we now want to calculate the uncertainty of the sample mean, the **Standard error of the mean** is determined by the standard deviation divided by square root of the number of elements
* Standard error of the mean: $\text{SEM}(x)=\frac{\sigma}{\sqrt{N}}=\sqrt{\frac{1}{N(N-1)}\sum_{i=1}^N (x_i-\mu)^2}$.

In [None]:
import scipy.stats

In [None]:
print(
    f"Fair mean (± sem): ({np.round(lunch_money_fair_mean,2)}  ± {np.round(scipy.stats.sem(lunch_money_fair),2)}) €"
)
print(
    f"Unfair mean (± sem): ({np.round(lunch_money_unfair_mean,2)} ± {np.round(scipy.stats.sem(lunch_money_unfair),2)}) €"
)

In [None]:
scipy.stats.sem(lunch_money_fair)

#### Percentile & Quantiles
Some other common values, which describe your data are **percentile** (or quantiles). They go from zero to one (or 0% to 100% in the case of quantiles) and cut your data quantitatively. The 0.5-percentile (or 50% quantile) is the median, the 0.0-percentile the minimum and 1.0-percentile the maximun of the variable, respectively. All percentiles in between can be described equivalently. 
If you want to discard all values smaller than 0.05 and bigger than the 0.95-percentile for a visualization to get rid of some strange outlayers you can use **percentiles**.

In [None]:
# minimum
np.percentile(lunch_money_unfair, 0)

In [None]:
# maximum
np.percentile(lunch_money_unfair, 100)

In [None]:
# Median
np.percentile(lunch_money_unfair, 50)

## From Histograms to probabilities and distributions

One of the most common visualization of an obtained data set from a particular collection of *things* is a **histogram** (e.g. see above the number of houses built in a specific time range). Our data set can be something like the income or the heights of individuals within a group people or the year houses were built in a city or a district. 
A histogram shows the division of the feature (e.g. income, height or year built) into different classes and their frequency. On the x axis directly adjacent rectangles of the width of the respective class are drawn, whose areas represent the (relative or absolute) class frequencies. The height of each rectangle then represents the (relative or absolute) frequency density, i.e. the (relative or absolute) frequency divided by the width of the corresponding class. Thus, a histogram visualizes the **frequency distribution** of the given data set.

### Experiment: Rolling dice
Now, let's have a look at a very simple experiment/data set: **rolling the dice**. We generate a data set, where one or more dice can be rolled at once and the average of the number of dice eyes per roll is calculated and saved. We use a single die and then ten dice. For each configuration we will roll 10 000 times.

In [None]:
def average_sum_of_eyes_per_roll_of_dice(dice_count):
    return np.sum([np.random.randint(1, 7) for i in range(dice_count)]) / dice_count

In [None]:
roll_counts = 10000
# single dice
result_single_dice = [
    average_sum_of_eyes_per_roll_of_dice(dice_count=1) for i in range(roll_counts)
]
# ten dices
result_ten_dice = [
    average_sum_of_eyes_per_roll_of_dice(dice_count=10) for i in range(roll_counts)
]

For example we can now have a look at the frequency distributions of the determined averages. We can see clearly that when we roll a single die the average number of eyes per roll is quite **uniformly distributed**. This is confirmed by the field of stochastic, where the probability of rolling a dice is the same for each number, i.e 1/6th. Here we speak of **probability distributions**.


When rolling ten dice for 10000 times and calculate the average number of eyes per roll we get a **Normal distribution** (Gaussian distribution) with its typical bell-like shape. A normal distribution is the underlying distribution of many scientific and technical measurements. A fundamental mathematical concept that explains why normal distributions are so common is the [**central limit theorem**](https://en.wikipedia.org/wiki/Central_limit_theorem), which we have kind of confirmed with this little dice experiment.

In [None]:
fig, ax = plt.subplots(figsize=(10, 5), dpi=200)
ax.hist(
    result_single_dice,
    bins=np.linspace(1, 7, 7),
    alpha=0.5,
    density=False,
    label="Single dice",
    color="C0",
)
ax.hist(
    result_ten_dice,
    bins=np.linspace(1, 7, 7),
    alpha=0.5,
    density=False,
    label="Ten dice",
    color="C1",
)
ax.set(xlabel="Average number of eyes per roll", ylabel="Frequency / bin")
ax.legend()

In [None]:
fig, ax = plt.subplots(figsize=(10, 5), dpi=200)
ax.hist(
    result_single_dice,
    bins=np.linspace(1, 7, 7),
    alpha=0.5,
    density=False,
    label="Single dice",
    color="C0",
)
ax.set(xlabel="Average number of eyes per roll", ylabel="Frequency / bin")

In [None]:
fig, ax = plt.subplots(figsize=(10,5), dpi=200)
ax.hist(result_ten_dice, bins=20, alpha=0.5, density=False, label="Ten dice", color="C1")
ax.set(xlabel="Average number of eyes per roll", ylabel="Frequency / bin")


In [None]:
np.mean(result_ten_dice), np.std(result_ten_dice)

### Normal distribution

A **normal distribution** is defined by its **mean** $\mu$ (or *expected value* of the distribution; in the example from above it is equal to ~3.5) and its **standard deviation**, which is $\sqrt{\sigma^2}=\sigma$ (here around 0.5). For example one $\sigma$ is often assumed to be half the width of the interval covering the middle two thirds of the values in a sample, thus cover 68.2% of the data.

![](https://upload.wikimedia.org/wikipedia/commons/8/8c/Standard_deviation_diagram.svg)



But there are many more [probability distributions](https://en.wikipedia.org/wiki/List_of_probability_distributions).

### Exponential distribution
An exponential distribution occurs often in real life. It describes the amount of waiting time until the next event, i.e. it is the probability distribution of the time between events in a [Poisson process](https://en.wikipedia.org/wiki/Poisson_point_process). 

When we look for example at the rate of incoming phone calls in a call center, which differs according to the time of day, we find that we can use the exponential distribution as a good approximation for the time until the next phone call arrives, when focussing on an interval where the rate is roughly constant.  

In [None]:
seaborn.histplot(np.random.exponential(scale=5.0, size=100000));

### Examples from the dataset
It is obvious that we do not see the very clear distributions from above, but rather a superposition or combination of more than one distribution. Thus, we have different influencing factors.

When looking at the square-footage of the living area, we see a non-symmetric distribution with a large „tail“ to higher values, i.e. we have a lot of houses with smaller living areas, but also few houses with very large square-footage. But we also see a clear peak between 1000 and 2000 sq. ft. It can be modeled by a normal distribution combined with an exponential distribution.

In [None]:
seaborn.histplot(data["sqft_living"], bins=100)
plt.title("function from the dataset");

Just assume, you would find a second smaller peak around 3500 square-footage. How do you interpret this? What can this tell you? 

The answer could be that there are a number of townhouses with exactly the same size of in one area of the city.

In [None]:
fig, ax = plt.subplots(figsize=(16, 3))
plotting_data = data["sqft_living"].copy()
plotting_data = pd.concat(
    [
        plotting_data,
        pd.Series(np.random.normal(loc=3500, scale=50, size=1000)),
    ]
)

seaborn.histplot(plotting_data, bins=100)
plt.title("function from the dataset with additional data");

When we look at the distribution of the _longitudinal coordinates of the houses_ it is obvious, that we do not have this very prominent peak, but a superpositon of several distributions. Quite interesting, that the distribution has a lot of local minima, i.e. drops often. It could be an indication for a barrier running from north to south. A river could be the cause.

In [None]:
seaborn.histplot(data["long"], bins=100);

As a last example, let's have a look at the price. It shows a quite similar shape like the first example of the square-footage. In the next part we will compare both variables and find out if the square-footage has an influence on the overall price.

In [None]:
seaborn.histplot(data["price"], bins=100);

## Correlations

In statistics, [**correlation**](https://en.wikipedia.org/wiki/Correlation_and_dependence) is any **statistical relationship between two random variables**. In common language, a correlation exists if one variable does not vary independently of another.

Correlations are useful because they **can indicate a predictive relationship** that can be exploited in practice. For example, an electrical utility may produce less power on a mild day based on the correlation between electricity demand and weather. In this example, there is a causal relationship, because extreme weather causes people to use more electricity for heating or cooling. However, in general, the presence of a correlation is not sufficient to infer the presence of a causal relationship (i.e. **correlation does not imply causation**, see below).

![](graphics/spurious-correlation.svg)
_Source: [Spurious Correlations](https://www.tylervigen.com/spurious-correlations)_

### Correlation Coefficients

A **correlation coefficient** is a number that expresses the strength of a correlation.

The [*Pearson correlation coefficient*](https://en.wikipedia.org/wiki/Pearson_correlation_coefficient) is a statistic that measures linear correlation between two variables. It has a value between +1 and −1. A value of +1 is total positive linear correlation, 0 is no linear correlation, and −1 is total negative linear correlation.

![](https://upload.wikimedia.org/wikipedia/commons/3/34/Correlation_coefficient.png)

_Source: [Wikimedia Commons](https://en.wikipedia.org/wiki/Pearson_correlation_coefficient)_

Let's compare the square footage of the home (`sqft_living`) or footage of the lot (`sqft_lot`) with the overall `price`.

In [None]:
plt.figure(figsize=(12, 6))
seaborn.scatterplot(x="sqft_living", y="price", data=data);

In [None]:
data[["sqft_living", "price"]].corr(method="pearson")

In [None]:
plt.figure(figsize=(12, 6))
seaborn.scatterplot(x="sqft_lot", y="price", data=data);

In [None]:
data[["sqft_lot", "price"]].corr(method="pearson")

## Regression Analysis and Model Fitting 

[**Regression analysis**](https://en.wikipedia.org/wiki/Regression_analysis) is a set of statistical processes for estimating the **relationships between a dependent variable** (often called the 'outcome variable' or 'target variable') and one or more **independent variables** (often called 'predictors' or 'features'). The most common form of regression analysis is **linear regression**, in which a line (or a more complex linear combination) that most closely fits the data according to a specific mathematical criterion is found. 

**Regression analysis** is primarily used for two conceptually distinct purposes. First, it is widely used for **prediction and forecasting**, where its use has substantial **overlap with the field of Machine Learning**. Second, regression analysis can be used to suggest a causal **relationships between** the independent and dependent **variables**.

A basic principle of **regression analysis** is [**Model Fitting**](https://en.wikipedia.org/wiki/Curve_fitting), the process of finding a mathematical function, that has the **best fit to a series of data points**. There are several ways (e.g. using [*Least-Squares Method*](https://en.wikipedia.org/wiki/Least_squares) or the [Likelihood function](https://en.wikipedia.org/wiki/Likelihood_function)) to used to find the **best fit** of the model's (a simple line, a polynomial of degree $n$ or machine learning model) parameters  to your data. 

In general, this process is always a kind of **minimization task**. The *Least-Squares Method* minimizes the sum of the squares of the residuals, i.e. the difference of the used model and the data points, when varying the model's parameters. The parameter selection is a crucial step - be it choosing the appropriate degree of freedom of a polynomial or the appropriate set of parameters for a more complex machine learning algorithm.

In Python there are several modules for **fitting** and **regression**. A small selection:
* [SciPy optimize](https://docs.scipy.org/doc/scipy/reference/generated/scipy.optimize.least_squares.html), 
* [Numpy polyfit](https://docs.scipy.org/doc/numpy/reference/generated/numpy.polyfit.html),
* [Sklearn  Linear Regression](https://scikit-learn.org/stable/modules/generated/sklearn.linear_model.LinearRegression.html).


### Example: Linear Regression

The idea is simple: Find the set of parameters of a given function (_Model_) so the sum of the squared **residuals** is minimized.

We use some generated data point. In general they follow a linear relationship of `y = 1 + 3x` but we will add some random noise.

In [None]:
import scipy.optimize as opt

In [None]:
## Model or function we want to fit
def linear(x, intercept, slope):
    return intercept + slope * x

In [None]:
## Generate example data set
sample_size = 20
error = 1

## Generate example data
x = np.linspace(-2, 2, sample_size)
# Get y-values with random noise
noise = np.random.normal(0, error, size=sample_size)
y = 1 + 3 * x + noise

In [None]:
# Apply a linear fit
params, cov = opt.curve_fit(linear, x, y)

In [None]:
fig, ax = plt.subplots(dpi=200)
ax.scatter(x, y, label="data", marker="x")
ax.plot(x, linear(x, *params), label="Linear fit", color="orange")
ax.legend()
print(f"=== Result: intercept={params[0]:.2f}, slope={params[1]:.2f}")

## Overfitting & Underfitting (bias–variance dilemma)

In principle, regardless of the fitting method, the choice of the function will have the highest impact on the performance of a regression or fit. There are two extremes when chosing a model or function:

- **Underfitting** (bias) - Underfitting occurs when our model does not have enough parametric "freedom" to describe the data.

- **Overfitting** (variance) - The opposite problem is called overfitting: The model describes the data points "too well", e.g. it describes random noise instead of simplifying and describing the general relationships in the data.

It is your task to validate your function and model, if it _interprets_ the data correctly. For sure there are tools and methods to care for that. We will come to that at a later stage.


![](graphics/OverUnderfitting.png)

---

_This notebook is licensed under a [Creative Commons Attribution-NonCommercial-ShareAlike 4.0 International (CC BY-NC-SA 4.0)](https://creativecommons.org/licenses/by-nc-sa/4.0/). Copyright © 2018-2025 [Point 8 GmbH](https://point-8.de)_