# Fundamental Statisitics

This aspect is meant to introduce you to the field of statistics and explore various statistical concepts that'll aid you in navigating the field of data science and ultimately enable you to get the most out of your data. The topics here are discussed under two major categories namely descriptive and inferential statistics. The emphasis of discussing these categories is to improve your statistical literacy and the use of technology to answer questions.

## What exactly is statistics?

The art and science of answering questions and exploring ideas through the processes of gathering data, describing data, and making inferences about a population on the basis of a smaller sample. With this in mind, you are expected to have the skills neccessary to __formulate questions__ that can be addressed with data and __collect__, __organize (analyze)__, and __display relevant data to answer the questions (interpret results).__

An example of a statistical question is _“How old are the students in my school?”_. This
is because to answer the second question, you would need to determine the ages of all students in your school
and there would be variability in those data (since not all students are the same age). <br> <br>Another example of statistical question is _"Which region of Nigeria had the lowest mortality rates in 2021?_ __Could you think of more statistical questions?__

## Collecting your data

In order to gather/get/generate the data needed to answer your questions, you have to be familiar with a host terminologies, which is what this aspect deals with. We'll be looking at the following concepts:
1. Population and Sample
2. Variables
3. Measures of central tendency
4. Measures of variability and Coefficient of Variance
5. Skewness and Kurtosis
6. Normal distribution
7. Central Limit theorem
8. Confidence Interval

### Variables and Cases

A __case__ is an experimental unit. These are the individuals from which data are collected. When data are collected from humans, we sometimes call them participants. When data are collected from animals, the term subjects is often used.

A __variable__ is a characteristic that is measured and can take on different values. In other words, something that varies between cases. This is in contrast to a constant which is the same for all cases in a research study.

#### Example
A teacher wants to know if `third grade students` who spend more `time` reading at home get `higher homework` and `exam grades.`

The students are the cases. There are three variables: amount of time spent reading at home, homework grades, and exam grades. The grade-level of the students is a constant because all students are in the third grade.

#### Types of Variables

Variables can be broadly classififed into categorical(or qualitative) and quantitative variable.

1. Categorical variables: Names or labels (i.e., categories) with no logical order or with a logical order but inconsistent differences between groups (e.g., rankings), also known as qualitative.
2. Quantitative variable: Numerical values with magnitudes that can be placed in a meaningful order with consistent intervals, also known as numerical.

A team of medical researchers weigh participants in kilograms. What kind of variable is this??

A teacher conducts a poll in her class. She asks her students if they would prefer chocolate, vanilla, or strawberry ice cream at their class party. Ice cream flavor here is what type of variable?

###### Independent VS dependent variable
Experiments are usually designed to find out what effect one variable has on another. In some research studies one variable is used to predict or explain differences in another variable. In those cases, the explanatory variable (also known as independent/predicor variable) is used to predict or explain differences in the response variable. In an experimental study, the explanatory variable is the variable that is manipulated by the researcher.

### Population and Sample

We often have questions concerning large populations. Gathering information from the entire population is not always possible due to barriers such as time, accessibility, or cost. Instead of gathering information from the whole population, we often gather information from a smaller subset of the population, known as a sample.

__Poupulation:__ The entire set of possible cases

__Sample__: A subset of the population from which data are collected

__Statistic__: A measure concerning a sample (e.g., sample mean)

__Parameter__: A measure concerning a population (e.g., population mean)

__The process of using sample statistics to make conclusions about population parameters is known as inferential statistics. In other words, data from a sample are used to make an inference about a population.__

A survey is carried out at Unilorin to estimate the proportion of all undergraduate students living at home during the current semester. Of the 20,000 undergraduate students enrolled at the campus, a random sample of 300 was surveyed.

* Population: All 20,000 undergraduate students at Penn State Altoona
* Sample: The 300 undergraduate students surveyed

### Measures of central tendency

Mathematically central tendency means measuring the center or distribution of location of values of a data set. It gives an idea of the average value of the data in the data set and also an indication of how widely the values are spread in the data set. That in turn helps in evaluating the chances of a new input fitting into the existing data set and hence probability of success.

__Mean:__ It is the Average value of the data which is a division of sum of the values with the number of values

In [2]:
student_age = [12,11,14,12,13,15,16,13,14,15,14,11]
mean_age = sum(student_age)/len(student_age)
mean_age_rounded = round(mean_age, 2)
print(f"Student age is: {mean_age_rounded}")

Student age is: 13.33


In [10]:
# with numpy
import numpy as np
student_age_arr = np.array(student_age)
mean_age_np = np.mean(student_age_arr)
print(f"mean age with numpy: {mean_age_np:.2f}")

mean age with numpy: 13.33


In [4]:
# TO DO: find the mean using Pandas (HINT: change the list to a pandas series first)
# Your code here
import pandas as pd
student_age_series = pd.Series(student_age)
type(student_age_series)

pandas.core.series.Series

In [8]:
round(student_age_series.mean(), 2)

13.33

In [48]:
# TO DO: find the mean using the statistics module

__Median:__ It is the middle value in distribution when the values are arranged in ascending or descending order.

In [11]:
#list_sorted = sorted(student_age)
#mid = len(list_sorted)//2 # provided the length % 2 != 0
#median = list_sorted[mid]
#print(median)
median_age = np.median(student_age_arr)
print(f"The median age is: {median_age}")

The median age is: 13.5


__Mode:__ It is the most commonly occurring value in a distribution

In [12]:
from statistics import mode
mode_age = mode(student_age)
print(f"Mode is: {mode_age}")
#print(mode(student_age_arr))

Mode is: 14


We'll demonstrate the difference between sample and population statistics here. For example, we are interested in the shoe size the of guys residing in school owned hostel.

In [13]:
# let's assume the total number of guys residing in school owned hostel is 150
size_pool = np.random.randint(30, 45, 150)
print(size_pool)

[34 40 40 34 37 40 30 42 33 37 33 40 34 38 33 35 34 31 33 40 30 44 43 33
 33 43 33 42 30 34 44 31 32 41 31 38 38 31 44 42 34 42 39 40 39 32 36 41
 30 31 34 38 32 34 35 30 41 41 39 34 39 41 37 33 37 42 34 44 30 41 35 30
 32 31 36 39 30 31 36 32 33 39 33 31 34 42 35 36 43 30 41 36 34 41 39 43
 36 43 34 37 43 39 38 44 33 32 37 30 44 33 33 41 31 39 36 38 32 32 39 38
 43 39 44 32 43 37 32 42 42 36 33 44 43 34 34 31 43 39 43 35 37 41 40 40
 44 41 41 41 40 32]


In [14]:
population_mean = np.mean(size_pool)
print(f"Mean shoe size of the population: {population_mean:.2f}")

Mean shoe size of the population: 36.83


In [15]:
# TO DO: find the median of the population
population_median = np.median(size_pool)
print(f"Median shoe size of the population is {population_median:.2f}")

Median shoe size of the population is 37.00


In [11]:
# TO DO: find the mode of the population

In [21]:
# let's assume we chose to only sample three hostels
hostel_a = np.random.choice(size_pool, 40) # where 40 students are selected randomly in the hostel
print(hostel_a)

[39 37 33 40 33 39 42 32 43 44 33 40 31 33 31 41 32 33 33 40 44 41 35 37
 41 42 42 43 39 32 41 36 33 39 41 33 33 40 37 42]


In [22]:
sample_mean = np.mean(hostel_a)
print(f"Sample mean is: {sample_mean:.2f}")

Sample mean is: 37.50


In [23]:
# Let's create more sample and compare their means
hostel_b = np.random.choice(size_pool, 35)
hostel_c = np.random.choice(size_pool, 50)

In [24]:
# let's create a list of list out of the sample
all_samples = [hostel_a, hostel_b, hostel_c]
print(all_samples)

[array([39, 37, 33, 40, 33, 39, 42, 32, 43, 44, 33, 40, 31, 33, 31, 41, 32,
       33, 33, 40, 44, 41, 35, 37, 41, 42, 42, 43, 39, 32, 41, 36, 33, 39,
       41, 33, 33, 40, 37, 42]), array([34, 39, 36, 31, 34, 31, 32, 36, 33, 34, 32, 39, 42, 33, 44, 40, 37,
       43, 34, 34, 38, 36, 30, 34, 43, 37, 43, 32, 30, 33, 44, 39, 32, 33,
       44]), array([39, 44, 42, 31, 34, 41, 37, 33, 39, 41, 30, 38, 34, 38, 31, 33, 40,
       33, 34, 44, 32, 32, 36, 33, 41, 34, 34, 44, 34, 41, 42, 41, 36, 34,
       40, 38, 30, 34, 36, 44, 40, 39, 38, 34, 41, 44, 42, 41, 32, 34])]


In [20]:
len(hostel_a)+len(hostel_b)+len(hostel_c)

125

In [16]:
# now let's compare the means of the three samples
sample_mean = []
for hostel in all_samples:
    sample_mean.append(np.mean(hostel))
    
print(sample_mean)

[37.95, 35.542857142857144, 37.38]


In [17]:
# We can see that the samples have slightly different sample means

In [33]:
# TO DO: complete the code below:

# turn the list into a dictionary with the sample names as the keys
sample_dict = {"hostel_a":hostel_a,
              "hostel_b":hostel_b,
              "hostel_c":hostel_c}

# create a DataFrame from the sample_dict
import pandas as pd

dict_to_frame = dict([(k,pd.Series(v)) for k,v in sample_dict.items() ])

dict_to_frame

df = pd.DataFrame(dict_to_frame)
df

Unnamed: 0,hostel_a,hostel_b,hostel_c
0,39.0,34.0,39
1,37.0,39.0,44
2,33.0,36.0,42
3,40.0,31.0,31
4,33.0,34.0,34
5,39.0,31.0,41
6,42.0,32.0,37
7,32.0,36.0,33
8,43.0,33.0,39
9,44.0,34.0,41


In [19]:
# What is the average shoe size of the students in hostel_a?

In [20]:
# How many students in hostel_a use more than the average shoe size?

In [21]:
# Which of the hostel has the smallest shoe size 
# Hint: Use min() and compare across the three hostels

In [22]:
# What is the diffrenece between the smallest and largest shoe size in hostel B
#Hint: use min() max()

In [23]:
# What is the most commonly used shoe size across the three hostels

### Measures of Variability

While measure of central tendency describes the the approximate center of a distribution, measure of variability describes the spread of the data.

#### Range

The range is the difference between the lowest and highest values in a data set. To determine the range:
* Identify the largest value in your data set. This is called the maximum.
* Identify the lowest value in your data set. This is called the minimum
* Subtract the minimum from the maximum

In [63]:
age_in_years = [9, 10, 12, 13, 14, 14, 17, 17, 20]
age_in_years_arr = np.array(age_in_years)
age_range = age_in_years_arr.max() - age_in_years_arr.min()
#age_range = np.max(age_in_years_arr) - np.min(age_in_years_arr)
age_range

11

#### Finding the upper and lower quartile

The quartiles of a group of data are the medians of the upper and lower halves of that set. The lower
quartile, Q1, is the median of the lower half, while the upper quartile, Q3, is the median of the upper
half. To find Q1 and Q3:
* Put the data in order from smallest to largest.
* Identify the upper and lower halves of your data
* Using the lower half, find Q1 by finding the median of that half. 
* Using the upper half, find Q3 by finding the median of that half

In [29]:
# first quartile
Q1 = np.percentile(age_in_years_arr, 25)
Q1

12.0

In [31]:
# second quartile
Q2 = np.percentile(age_in_years_arr, 50)
Q2

14.0

In [64]:
# third quartile
Q3 = np.percentile(age_in_years_arr, 75)
Q3

17.0

#### Interquartile range

This is the difference between Q1 and Q3 and it can be used to determine the observations within a set of data may fall outside the general scope of the other observations __(i.e outliers)__

In [65]:
IQR = Q3 - Q1
IQR

5.0

We can use the IQR method of identifying outliers to set up a “fence” outside of Q1 and Q3. Any values that fall outside of this fence are considered outliers. To build this fence we take 1.5 times the IQR and then subtract this value from Q1 and add this value to Q3. This gives us the minimum and maximum fence posts that we compare each observation to. Any observations that are more than 1.5 IQR below Q1 or more than 1.5 IQR above Q3 are considered outliers. This is the method that Minitab uses to identify outliers by default.

In [35]:
#A teacher wants to examine students’ test scores
test_score = [74, 88, 78, 90, 94, 90, 84, 90, 98, 80]

# turn the list to numpy array
test_score_arr = np.array(test_score)
# find Q1
Q1_score = np.percentile(test_score_arr, 25)
#find Q3
Q3_score = np.percentile(test_score_arr, 75)
#find  IQR
IQR_score = Q3_score - Q1_score
IQR_score

9.0

In [36]:
fence_val = 1.5 * IQR_score
lower_fence = Q1_score - (fence_val)
upper_fence = Q3_score + (fence_val)
print(lower_fence)
print(upper_fence)

67.5
103.5


Any scores that are less than 67.5 or greater than 103.5 are outliers.

##### Variance and Standard deviation

The variance and standard deviation are a measure based on the distance each data value is from the
mean

In [66]:
# using our size pool data, let's get the population and sample variance
population_var = np.var(size_pool)
population_var

18.355555555555554

In [40]:
sample_var = np.var(hostel_a)
sample_var

20.647499999999997

In [41]:
# standard deviation
population_std = np.std(size_pool)
population_std

4.720998011249551

In [42]:
sample_std = np.std(hostel_a)
sample_std

4.543952024394623