# Lesson - Statistics and Probability II: Variables

In the `wnba` dataset each row describes an individual having a series of properties: name, team, position on the field, height, etc. For most properties, the values vary from row to row. All players have a height, for example, but the height values vary from player to player.

The properties with varying values are called **variables**. The height property in `wnba` data set is an example of a variable. In fact, all the properties described in `wnba` data set are variables. A row in the data set describes the actual values that each variable takes for a given individual.

In [1]:
import pandas as pd

wnba = pd.read_csv("wnba.csv")

print(wnba.info())

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 143 entries, 0 to 142
Data columns (total 32 columns):
 #   Column        Non-Null Count  Dtype  
---  ------        --------------  -----  
 0   Name          143 non-null    object 
 1   Team          143 non-null    object 
 2   Pos           143 non-null    object 
 3   Height        143 non-null    int64  
 4   Weight        142 non-null    float64
 5   BMI           142 non-null    float64
 6   Birth_Place   143 non-null    object 
 7   Birthdate     143 non-null    object 
 8   Age           143 non-null    int64  
 9   College       143 non-null    object 
 10  Experience    143 non-null    object 
 11  Games Played  143 non-null    int64  
 12  MIN           143 non-null    int64  
 13  FGM           143 non-null    int64  
 14  FGA           143 non-null    int64  
 15  FG%           143 non-null    float64
 16  15:00         143 non-null    int64  
 17  3PA           143 non-null    int64  
 18  3P%           143 non-null    

### Quantitative and Qualitative Variables

Generally, a variable that describes how much there is of something describes a quantity, and, for this reason, it's called a **quantitative variable**.

Usually, quantitative variables describe a quantity using real numbers, but there are also cases when words are used instead. `Height`, for example, can be described using real numbers, like in our data set, but it can also be described using labels like "tall" or "short".

In `wnba` dataset, The `Name`, `Team`, and `College` variables describe for each individual a quality, that is, a property that is not quantitative. Variables that describe qualities are called **qualitative variables** or **categorical variables**. Generally, qualitative variables describe what or how something is.

![image.png](attachment:image.png)

### Scales of Measurement

The `Team` and `Height` variables in `wnba` dataset provide different amounts of information because they have a different nature (one is qualitative, the other quantitative), and because they are measured differently.

The system of rules that define how each variable is measured is called **scale of measurement** or, less often, level of measurement.

Frequently used, is a system of measurement made up of four different scales of measurement: `nominal`, `ordinal`, `interval`, and `ratio`.

The characteristics of each scale pivot around three main questions:

- Can we tell whether two individuals are different?
- Can we tell the direction of the difference?
- Can we tell the size of the difference?

### Nominal Scale

The `Team`, `Name`, `Pos`, `Birth_Place`, `College` variablew in `wnba` dataset are examples of  variables measured on a **nominal scale** as it can tell whether two individuals are different or not, but can't indicate the size and the direction of the difference.

For any variable measured on a nominal scale:

- We can tell whether two individuals are different or not (with respect to that variable).
- We can't say anything about the direction and the size of the difference.
- We know that it can only describe qualities.

![image.png](attachment:image.png)

### Ordinal Scale
If we consider a variable `height_label` which categorizes heights of players as "tall", "medium" and "short" then this is an example of a variable measured on **Ordinal Scale**.

By examining the values of this new variable, we can tell whether two individuals are different or not. But, unlike in the case of a nominal scale, we can also tell the direction of the difference. Someone who is assigned the label "tall" has a bigger height than someone assigned the label "short". However, we still can't determine the **size of the difference**.

![image.png](attachment:image.png)

Generally, for any variable measured on an ordinal scale, we can tell whether individuals are different or not, we can also tell the direction of the difference, but we still can't determine the size of the difference.

Variables measured on an ordinal scale can only be quantitative. Quantitative variables, however, can be measured on other scales too. 

![image.png](attachment:image.png)



Common examples of variables measured on ordinal scales include ranks: ranks of athletes, of horses in a race, of people in various competitions, etc.

The values of the variables measured on an ordinal scale can be both words and numbers. When the values are numbers, they are usually ranks. But we still can't use the numbers to compute the size of the difference. We can't say how much faster an athlete was than another by simply comparing their ranks.

Whether a variable is quantitative or qualitative is independent of the way the variable is measured. The `Height` variable, for instance, is quantitative no matter how we measure it. The fact that we use words like "short" or "tall" doesn't change its underlying nature. The Height variable still describes a magnitude, but in a different way.

### Interval and Ratio Scales
In the case of the `height_label` variable the values have direction when measured on an ordinal scale, but we don't know the size of each interval between values, and because of this we can't determine the size of the difference.

![image.png](attachment:image.png)


An alternative is to measure the Height variable using real numbers, which will result in having well-defined intervals, which in turn will allow us to determine the size of the difference between any two values.

![image.png](attachment:image.png)

A variable measured on a scale that preserves the order between values and has well-defined intervals using real numbers is an example of a variable measured either on an **interval scale**, or on a **ratio scale**.

Examples include:

- Height measured with a numerical unit of measurement (like inches or centimeters).
- Weight measured with a numerical unit of measurement (multiples and submultiples of grams, for instance).
- Time measured with a numerical unit of measurement (multiples and submultiple of seconds, for example).
- The price of various products measured with a numerical unit of measurement (like dollars, pounds, etc.).

![image.png](attachment:image.png)

### Difference between Ratio and Interval Scale

What sets apart ratio scales from interval scales is the nature of the **zero point**.

On a ratio scale, the zero point means no quantity. For example, the `Weight` variable is measured on a ratio scale, which means that 0 grams indicate the absence of weight.

On an interval scale, however, the zero point doesn't indicate the absence of a quantity. It actually indicates the presence of a quantity.

For example, the mean of the `Weight` variable is roughly 78.98 kg. If, we create a `weight_deviation` variable which measures deviation from mean, it will mean that the zero point in the Weight_deviation variable is equivalent to 78.98 kg. A player having 0 wieght_deviation does not have 0 weight but a weight of 78.98 kg. So, weight_deviation is measured on an interval scale.

On the other side, a value of 0 for the `Weight` variable, which is measured on a ratio scale, indicates the absolute absence of weight.



In [2]:
wnba["weight_deviation"] = wnba["Weight"] - wnba["Weight"].mean()

wnba_weights = wnba[["Name", "Weight", "weight_deviation"]]

print(wnba_weights.head())

              Name  Weight  weight_deviation
0    Aerial Powers    71.0         -7.978873
1      Alana Beard    73.0         -5.978873
2     Alex Bentley    69.0         -9.978873
3  Alex Montgomery    84.0          5.021127
4     Alexis Jones    78.0         -0.978873


Another important difference between the two scales is given by the way we can measure the size of the differences.

On a ratio scale, we can quantify the difference in two ways. One way is to measure a distance between any two points by simply subtracting one from another. The other way is to measure the difference in terms of ratios.

For example, by doing a simple subtraction using the data in the table above, we can tell that the difference (the distance) in weight between Aerial Powers  and Alex Montgomery is 13 kg. In terms of ratios, however, Aerial Powers is roughly 1.18 (the result of 84 kg divided by 71 kg) times lighter than Alex Montgomery, or 0.845 (the result of 84 kg divided by 71 kg or 1/1,18) times heavier than Alex Montgomery.

If we look at the `weight_deviation` variable, we can only say there's a difference of 13 kg between Aerial Powers and Alex Montgomery, but not how much heavier or lighter they are with respect to each other.
![image.png](attachment:image.png)

In the `wnba` dataset, the variables are measured as below:

interval = ['Birthdate', 'Weight_deviation']

['15:00',
 '3P%',
 '3PA',
 'AST',
 'Age',
 'BLK',
 'BMI',
 'DD2',
 'DREB',
 'Experience',
 'FG%',
 'FGA',
 'FGM',
 'FT%',
 'FTA',
 'FTM',
 'Games Played',
 'Height',
 'MIN',
 'OREB',
 'PTS',
 'REB',
 'STL',
 'TD3',
 'TO',
 'Weight']

In [3]:
# Just an empty dataframe which can contain values based on dtypes to initialize variable types

empty = pd.DataFrame(columns = ["height", "weight", "weight_deviation", "points"], index = range(1,11), dtype = 'int64')
print(empty.info())
print(empty)

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 10 entries, 1 to 10
Data columns (total 4 columns):
 #   Column            Non-Null Count  Dtype  
---  ------            --------------  -----  
 0   height            0 non-null      float64
 1   weight            0 non-null      float64
 2   weight_deviation  0 non-null      float64
 3   points            0 non-null      float64
dtypes: float64(4)
memory usage: 452.0 bytes
None
    height  weight  weight_deviation  points
1      NaN     NaN               NaN     NaN
2      NaN     NaN               NaN     NaN
3      NaN     NaN               NaN     NaN
4      NaN     NaN               NaN     NaN
5      NaN     NaN               NaN     NaN
6      NaN     NaN               NaN     NaN
7      NaN     NaN               NaN     NaN
8      NaN     NaN               NaN     NaN
9      NaN     NaN               NaN     NaN
10     NaN     NaN               NaN     NaN


### Common Examples of Interval Scale

In practice, variables measured on an interval scale are relatively rare.

Generally, points in time are indicated by variables measured on an interval scale. Let's say we want to indicate the point in time of the first manned mission on the Moon. If we want to use a ratio scale, our zero point must be meaningful and denote the absence of time. For this reason, we'd basically have to begin the counting at the very beginning of time.

There are many problems with this approach. One of them is that we don't know with precision when time began (assuming time actually has a beginning), which means we don't know how far away in time we are from that zero point.

To overcome this, we can set an arbitrary zero point, and measure the distance in time from there. Customarily, we use the Anno domini system where the zero point is arbitrarily set at the moment Jesus was born. Using this system, we can say that the first manned mission on the Moon happened in 1969. This means that the event happened 1968 years after Jesus' birth (1968 because there's no year 0 in the Anno domini system).

![image.png](attachment:image.png)

Another common example has to do with measuring temperature. In day to day life, we usually measure temperature on a Celsius or a Fahrenheit scale. These scales are examples of interval scales.

Because temperature is measured on an interval scale, we need to avoid quantifying the difference in terms of ratio. For example, 0°C or 0°F are arbitrarily set zero points and don't indicate the absence of temperature. If 0°C or 0°F were meaningful zero points, temperatures below 0°C or 0°F wouldn't be possible. But we know that we can go way below 0°C or 0°F.

If yesterday was 10°C, and today is 20°C, we can't say that today is twice as hot as yesterday. We can say, however, that today's temperature is 10°C more compared to yesterday.

Temperature can be measured on a ratio scale too, and this is done using the Kelvin scale. 0 K (0 degrees Kelvin) is not set arbitrarily, and it indicates the lack of temperature. The temperature can't possibly drop below 0 K.

![image.png](attachment:image.png)

### Discrete and Continuous Variables

Variables measured on interval and ratio scales can only take real numbers as values. 

In the following sample, both weight and points are measured on ratio scale. 

In the `PTS` variable, we cannot have intermediate values such as 9.2 or 9.5 etc. These are known as **discrete** variables where counts can only be integers.

On the other hand, between any two values of the `Weight` variable, there's an infinity of values. we call such variables **continuous**. 
Whether a variable is discrete or continuous is determined by the underlying nature of the variable being considered, and not by the values obtained from the measurement.

![image.png](attachment:image.png)

In [37]:
# Random sample of weights and points
wnba_wp = wnba[["Name", "Weight", "PTS"]].sample(5, random_state = 0)

print(wnba_wp)

               Name  Weight  PTS
45    Diana Taurasi    74.0  376
118  Sequoia Holmes    70.0   81
16      Asia Taylor    76.0   31
56    Glory Johnson    77.0    9
22   Briann January    65.0  238


### Real Limits

Most likely, following players don't have an exact weight of 77 kg. If the values were measured with a precision of one decimal, we'd probably see that the players have different weights. One player may weigh 76.7 kg, another 77.2 kg, another 77.1 kg. 

if we measure the weight with zero decimals precision (which we do in our data set), a player weighing 77.4 kg will be assigned the same weight (77 kg) as a player weighing 76.6 kg. So if a player is recorded to weigh 77 kg, we can only tell that her actual weight is somewhere between 76.5 kg and 77.5 kg. The value of 77 is not really a distinct value here. Rather, it's an interval of values.

This principle applies to any possible numerical weight value. If a player is measured to weigh 76.5 kg, we can only tell that her weight is somewhere between 76.45 kg and 76.55 kg. If a player has 77.50 kg, we can only tell that her weight is somewhere between 77.495 kg and 77.505 kg. Because there can be an infinite number of decimals, we could continue this breakdown infinitely.

![image.png](attachment:image.png)




####### *The weight values in the table which all show 77.0, and the trailing zero suggests a precision of one decimal point, but this is not the case. The values are automatically converted by pandas to float64 because of one NaN value in the Weight column, and end up with a trailing zero, which gives the false impression of one decimal point precision. So a player was recorded to weigh 77 kg (zero decimals precision), not 77.0 kg (one decimal precision).

Generally, every value of a continuous variable is an interval, no matter how precise the value is. The boundaries of an interval are sometimes called real limits. The lower boundary of the interval is called lower real limit, and the upper boundary is called upper real limit.

Following dictionary shows **Real Limits** of some BMI values:
bmi = {21.201: [21.2005, 21.2015],
 21.329: [21.3285, 21.3295],
 23.875: [23.8745, 23.8755],
 24.543: [24.5425, 24.5435],
 25.469: [25.4685, 25.4695]}

![image.png](attachment:image.png)

In [42]:
wnba_wt = wnba.loc[wnba["Weight"] == 77, ["Name", "Weight"]][:5]
print(wnba_wt)

                 Name  Weight
9   Allison Hightower    77.0
19    Breanna Stewart    77.0
21        Bria Holmes    77.0
33       Chelsea Gray    77.0
56      Glory Johnson    77.0
