# Variables in statistics

Previously, we discussed the details around collecting data for our analysis. In this mission, we'll focus on understanding the structural parts of a dataset, and how they're measured.

Whether a sample or a population, **a dataset is generally an attempt to describe correctly a relatively small part of the world**. The dataset we worked with in the previous mission describes basketball players and their performance in the season 2016-2017.

Other datasets might attempt to describe the stock market, patient symptoms, stars from galaxies other than ours, movie ratings, customer purchases, and all sorts of other things.

The things we want to describe usually have a myriad of properties. A human, for instance, besides the property of being a human, can also have properties like height, weight, age, name, hair color, gender, nationality, whether they're married or not, whether they have a job or not, etc.

In practice, we limit ourselves to the properties relevant to the questions we want to answer, and to the properties that we can actually measure. Let's consider three rows at random from the basketball dataset we've previously worked with:

![rows.png](attachment:rows.png)

Each row describes an individual having a series of properties: name, team, position on the field, height, etc. For most properties, the values vary from row to row. All players have a height, for example, but the height values vary from player to player.

The properties with varying values we call variables. The height property in our dataset is an example of a variable. In fact, all the properties described in our dataset are variables.

A row in our dataset describes the actual values that each variable takes for a given individual.

Notice that this particular meaning of the "variable" concept is restricted to the domain of statistics. A variable in statistics is not the same as a variable in programming, or other domains.

## Quantitative and qualitative variables

Variables in statistics can describe either **quantities**, or **qualities**.

For instance, the Height variable in our dataset describes how tall each player is. The Age variable describes how much time has passed since each player was born. The MIN variable describes how many minutes each player played in the 2016-2017 WNBA season.

Generally, a variable that describes how much there is of something describes a quantity, and, for this reason, it's called a **quantitative variable**.

Usually, quantitative variables describe a quantity using real numbers, but there are also cases when words are used instead. Height, for example, can be described using real numbers, like in our dataset, but it can also be described using labels like "tall" or "short".

A few variables in our dataset clearly don't describe quantities. The Name variable, for instance, describes the name of each player. The Team variable describes what team each player belongs to. The College variable describes what college each player goes or went to.

The Name, Team, and College variables describe for each individual a quality, that is, a property that is not quantitative. Variables that describe qualities are called **qualitative variables** or **categorical variables**. Generally, qualitative variables describe what or how something is.

Usually, qualitative variables describe qualities using words, but numbers can also be used. For instance, the number of a player's shirt or the number of a racing car are described using numbers. The numbers don't bear any quantitative meaning though, they are just names, not quantities.

In the diagram below we do a head-to-head comparison between qualitative and quantitative variables:

![quant.png](attachment:quant.png)

We've selected a few variables from our dataset. For each of the variables selected, indicate whether it's quantitative or qualitative.

- We've already created a dictionary named variables. Each variable name is given as dictionary key.
- If a variable is quantitative, then complete the value of the corresponding key with the string 'quantitative'. If the variable is qualitative, the use the string 'qualitative'.

In [2]:
wnba.head()

Unnamed: 0,Name,Team,Pos,Height,Weight,BMI,Birth_Place,Birthdate,Age,College,...,OREB,DREB,REB,AST,STL,BLK,TO,PTS,DD2,TD3
0,Aerial Powers,DAL,F,183,71.0,21.200991,US,"January 17, 1994",23,Michigan State,...,6,22,28,12,3,6,12,93,0,0
1,Alana Beard,LA,G/F,185,73.0,21.329438,US,"May 14, 1982",35,Duke,...,19,82,101,72,63,13,40,217,0,0
2,Alex Bentley,CON,G,170,69.0,23.875433,US,"October 27, 1990",26,Penn State,...,4,36,40,78,22,3,24,218,0,0
3,Alex Montgomery,SAN,G/F,185,84.0,24.543462,US,"December 11, 1988",28,Georgia Tech,...,35,134,169,65,20,10,38,188,2,0
4,Alexis Jones,MIN,G,175,78.0,25.469388,US,"August 5, 1994",23,Baylor,...,3,9,12,12,7,0,14,50,0,0


In [4]:
import pandas as pd
wnba = pd.read_csv('wnba.csv')

variables = {'Name': 'qualitative', 
             'Team': 'qualitative', 
             'Pos': 'qualitative', 
             'Height': 'quantitative', 
             'BMI': 'quantitative',
             'Birth_Place': 'qualitative', 
             'Birthdate': 'quantitative', 
             'Age': 'quantitative', 
             'College': 'qualitative',
             'Experience': 'quantitative',
             'Games Played': 'quantitative', 
             'MIN': 'quantitative', 
             'FGM': 'quantitative', 
             'FGA': 'quantitative',
             '3PA': 'quantitative', 
             'FTM': 'quantitative', 
             'FTA': 'quantitative', 
             'FT%': 'quantitative', 
             'OREB': 'quantitative',
             'DREB': 'quantitative',
             'REB': 'quantitative', 
             'AST': 'quantitative', 
             'PTS': 'quantitative'}

## Scales of measurement

The amount of information a variable provides depends on its nature (whether it's quantitative or qualitative), and on the way it's measured.

For instance, if we analyze the Team variable for any two individuals:

- We can tell whether or not the two individuals are different from each other with respect to the team they play.
- But if there's a difference:
    - We can't tell the size of the difference.
    - We can't tell the direction of the difference - we can't say that team A is greater or less than team B.

On the other side, if we analyze the Height variable:

- We can tell whether or not two individuals are different.
- If there's a difference:
    - We can tell the size of the difference. If player A has 190 cm and player B has 192 cm, then the difference between the two is 2 cm.
    - We can tell the direction of the difference from each perspective: player A has 2 cm less than player B, and player B has 2 cm more than player A.
   
![table.png](attachment:table.png)

The Team and Height variables provide different amounts of information because they have a different nature (one is qualitative, the other quantitative), and because they are measured differently.

The system of rules that define how each variable is measured is called **scale of measurement or**, less often, **level of measurement**.

In the next screens, we'll learn about a system of measurement made up of four different scales of measurement: *nominal, ordinal, interval, and ratio*. As we'll see, the characteristics of each scale pivot around three main questions:

- Can we tell whether two individuals are different?
- Can we tell the direction of the difference?
- Can we tell the size of the difference?

## The nominal scale

In the previous screen, we've discussed about the Team variable, and said that by examining its values we can tell whether two individuals are different or not, but we can't indicate the size and the direction of the difference.

The Team variable is an example of a variable measured on a nominal scale. For any variable measured on a nominal scale:

- We can tell whether two individuals are different or not (with respect to that variable).
- We can't say anything about the direction and the size of the difference.
- We know that it can only describe qualities.

![nom.png](attachment:nom.png)

When a qualitative variable is described with numbers, the principles of the nominal scale still hold. We can tell whether there's a difference or not between individuals, but we still can't say anything about the size and the direction of the difference.

If basketball player A has the number 5 on her shirt, and player B has 8, we can tell they're different with respect to shirt numbers, but it doesn't make any sense to subtract the two values and quantify the difference as a 3. Nor it makes sense to say that B is greater than A. **The numbers on the shirts are just identifiers here, they don't quantify anything.**


*Inspect the dataset, and find the variables measured on a nominal scale. In the code editor:*

- Add the variables measured on a nominal scale to a list named nominal_scale, and sort the elements in the list alphabetically (the sorting helps us with answer checking).

- Notice that we've added a new variable named Height_labels. Instead of showing the height in centimeters, the new variable shows labels like "short", "medium", or "tall". By considering the principles that characterize the nominal scale, think whether the new Height_labels variable should be included in your nominal_scale list.

In [9]:
nominal_scale = sorted(['Name', 'Team', 'Pos', 'Birth_Place', 'College'])

## The ordinal scale

In our last exercise, we've seen that the new Height_labels variable was showing labels like "short", "medium", or "tall". By examining the values of this new variable, we can tell whether two individuals are different or not. But, unlike in the case of a nominal scale, we can also tell the direction of the difference. Someone who is assigned the label "tall" has a bigger height than someone assigned the label "short".

However, we still can't determine the size of the difference. This is an example of a variable measured on an ordinal scale.

![ordinal.png](attachment:ordinal.png)

Generally, for any variable measured on an ordinal scale, we can tell whether individuals are different or not, we can also tell the direction of the difference, but we still can't determine the size of the difference.

Variables measured on an ordinal scale can only be quantitative. Quantitative variables, however, can be measured on other scales too, as we'll see next in this mission.

![table2.png](attachment:table2.png)

Common examples of variables measured on ordinal scales include ranks: ranks of athletes, of horses in a race, of people in various competitions, etc.

For example, let's say we only know that athlete A finished second in a marathon, and athlete B finished third in the same race. We can immediately tell their performance is different, we know that athlete A finished faster, but we don't know how much faster. The difference between the two could be half a second, 12 minutes, half an hour, etc.

Other common examples include measurements of subjective evaluations that are generally difficult or near to impossible to quantify with precision. For instance, when answering a survey about how much they like a new product, people may have to choose a label between "It's a disaster, I hate it", "I don't like it", "I like it a bit", "I really like it", "I simply love it".

The values of the variables measured on an ordinal scale can be both words and numbers. When the values are numbers, they are usually ranks. But we still can't use the numbers to compute the size of the difference. We can't say how much faster an athlete was than another by simply comparing their ranks.

Whether a variable is quantitative or qualitative is independent of the way the variable is measured. The Height variable, for instance, is quantitative no matter how we measure it. The fact that we use words like "short" or "tall" doesn't change its underlying nature. The Height variable still describes a magnitude, but in a different way.

## The Interval and ratio scales

We've seen in the case of the Height variable that the values have direction when measured on an ordinal scale. The downside is that we don't know the size of each interval between values, and because of this we can't determine the size of the difference.

![ordinal2.png](attachment:ordinal2.png)

An alternative here is to measure the Height variable using real numbers, which will result in having well-defined intervals, which in turn will allow us to determine the size of the difference between any two values.

![real.png](attachment:real.png)

A variable measured on a scale that preserves the order between values and has well-defined intervals using real numbers is an example of a variable measured either on an interval scale, or on a ratio scale.

In practice, variables measured on interval or ratio scales are very common, if not the most common. Examples include:

- Height measured with a numerical unit of measurement (like inches or centimeters).
- Weight measured with a numerical unit of measurement (multiples and submultiples of grams, for instance).
- Time measured with a numerical unit of measurement (multiples and submultiple of seconds, for example).
- The price of various products measured with a numerical unit of measurement (like dollars, pounds, etc.).

![interval.png](attachment:interval.png)

## The difference between Ratio and Interval scales

What sets apart ratio scales from interval scales is the nature of the zero point.

On a ratio scale, the zero point means no quantity. For example, the Weight variable is measured on a ratio scale, which means that 0 grams indicate the absence of weight.

On an interval scale, however, the zero point doesn't indicate the absence of a quantity. It actually indicates the presence of a quantity.

To exemplify this case using our dataset, we've used the Weight variable (measured on a ratio scale), and created a new variable that is measured on an interval scale. The new variable describes by how many kilograms the weight of a player is different than the average weight of the players in our dataset. Here's a random sample that includes values from the new variable named Weight_deviation:

![w.png](attachment:w.png)

If a player had a value of 0 for our Weight_deviation variable (which is measured on an interval scale), that wouldn't mean the player has no weight. Rather, it'd mean that her weight is exactly the same as the mean. The mean of the Weight variable is roughly 78.98 kg, which means that the zero point in the Weight_deviation variable is equivalent to 78.98 kg.

On the other side, a value of 0 for the Weight variable, which is measured on a ratio scale, indicates the absolute absence of weight.

Another important difference between the two scales is given by the way we can measure the size of the differences.

On a ratio scale, we can quantify the difference in two ways. One way is to measure a distance between any two points by simply subtracting one from another. The other way is to measure the difference in terms of ratios.

For example, by doing a simple subtraction using the data in the table above, we can tell that the difference (the distance) in weight between Clarissa dos Santos and Alex Montgomery is 5 kg. In terms of ratios, however, Clarissa dos Santos is roughly 1.06 (the result of 89 kg divided by 84 kg) times heavier than Alex Montgomery. To give a straightforward example, if player A had 90 kg and player B had 45 kg, we could say that player A is two times (90 kg divided by 45 kg) heavier than player B.

On an interval scale, however, we can measure meaningfully the difference between any two points only by finding the distance between them (by subtracting one point from another). If we look at the weight deviation variable, we can say there's a difference of 5 kg between Clarissa dos Santos and Alex Montgomery. However, if we took ratios, we'd have to say that Clarissa dos Santos is two times heavier than Alex Montgomery, which is not true.

![int_rat.png](attachment:int_rat.png)