# Fundamentals of Statistics

## Data

The dataset we use here is taken from [OECD](https://www.oecd.org/pisa/). PISA is the OECD's Programme for International Student Assessment. PISA measures 15-year-olds’ ability to use their reading, mathematics and science knowledge and skills to meet real-life challenges. We start by importing the required modules/packages/libraries. Then we read in the data which is a text file. We take a subset of the data where `"Region" == 82620` which indicates the data for Scotland. 

In [None]:
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns

In [None]:
# Read a comma-separated values (csv) file into DataFrame
PISA = pd.read_csv("Data/PISA_2018_UK.txt", sep = " ", encoding='latin-1')
PISA_Scotland = PISA.loc[PISA["Region"] == 82620]
PISA_Scotland

## Normal distribution

We take the column "PV1MATH" which includes the students Maths scores and give this vector of data the name "math". A histogram of this variable is made below to see the distribution of the maths scores.

In [None]:
math = PISA_Scotland["PV1MATH"]
plt.hist(math, bins=20)
plt.title("Distribution of Maths scores in Scotland in 2018")
plt.xlabel("Maths scores")
plt.ylabel("Counts")
plt.show()

We use the relevant numpy functions to calculate mean and standard deviation of this variable.

In [None]:
mu = np.mean(math)
print("mean= ", mu)
std = np.std(math, ddof=1)
print("Standard deviation = ", std)
math.describe()

The histogram shows that maths scores are distributed very similarly to a Normal distribution. If make such an assmption about the `math` variable, then any probabilities can be calculated. For this purpose, we import the `norm` function from python's `scipy.stats` and use `cdf` and `ppf` functions. 

What is the probability that students' maths scores are smaller than or equal to 492? or $ P(X \leqslant 492 ) = ?$ (if X is a random variable indicating the Maths scores)

In [None]:
from scipy.stats import norm
norm.cdf(492, mu, std)

What is the probability that students' maths scores are smaller than or equal to 400? or $ P(X \leqslant 400 ) = ? $

In [None]:
norm.cdf(400, mu, std)

What is the probability that students' maths scores are larger than or equal to 400? or $ P(X \geqslant 400 ) = ? $

In [None]:
1 - norm.cdf(400, mu, std)

What is the probability that a student's maths score is between 400 and 500? or $ P(400 \leqslant X \leqslant 500 ) = ? $

In [None]:
norm.cdf(500, mu, std) - norm.cdf(400, mu, std)

What is the highest maths score that only $10\%$ of students could achieve? or $ P( X \leqslant ? ) = 0.9 $

In [None]:
norm.ppf(0.9, mu, std)

What is the maths score that $90\%$ of students have got? or $ P( X \leqslant ? ) = 0.1 $

In [None]:
norm.ppf((1-0.9), mu, std)

# Exercises

1. The dataset includes Maths, Reading, and Scince scores; let's look at the reading scores now. Make a histogram of the reading scores (PV1READ) of students in Scotland. Do you think the distribution of the scores is Normal?

2. Find the summary statistics of the science scores.

3. Assume the Science variable is normally distributed. Calculate the followings:
- The probability of a student getting at least 600 in the test.
- The probability getting less than or equal to 421.
- What is the highest score that 75% of the students achieve?

4. Make a boxplot with both Maths and Reading scores and compare them.