# EX2 - Statistics

![stats](imgs/stats.png)

In this jupyter we'll implement the basics of statistics. The statistics are used to describe the distribution of data, to understand how the data is spread out.

We won't use any library in this exercice, you are asked to implement the statistics function by yourself.

## Min and Max

The minimum (min) is the smallest value in a data set, while the maximum (max) is the largest value. Together, they provide the range of the data, showing its spread from the lowest to the highest point

In [57]:
def min(l):

  """
  Loop throught the list and return the minimum value

  """
  min_l = l[0]
  for element in l[1:]:
    if element < min_l:
      min_l = element
  return min_l

def max(l):
  """
  Loop throught the list and return the maximum value
  """
  max_l = l[0]
  for element in l[1:]:
    if element > max_l:
      max_l = element
  return max_l

## Range

The range is the difference between the maximum and minimum values in a data set. It provides a simple measure of how spread out the data is by indicating the total span between the smallest and largest values.

> Feel free to use the `min` and `max` functions you implemented in the previous exercise

In [58]:
def m_range(l):
  """
  calculate the range
  """
  # max_l = l[0]
  # min_l = l[0]
  # for element in l[1:]:
  #   if element > max_l:
  #     max_l = element
  #   if element < min_l:
  #     min_l = element

  range_of_values = max(l) - min(l)
  return range_of_values

## Mean

The mean, or average, is the sum of all values in a data set divided by the number of values. It represents the central point of the data, giving a measure of its overall trend. The formula for the mean is: $\frac{1}{n}\sum_{i=1}^n x_i$

where:
- $n$ is the number 
- $x_i$ is the $i$-th element of the list

You can read this formula as the sum of all the elements of the list divide by the length of the list.

In [59]:
def mean(l: list) :
  """
  calculate the mean
  """
  # if not l:
  #   return None
  
  # return sum(l) / len(l)

  total=0
  for i in l:
    total+= i

  return total/len(l)

In [60]:
print('Mean ', mean([1,2,3,4,5]))

Mean  3.0


## Median

The median is the middle value of a data set when the values are arranged in ascending or descending order. It divides the data into two equal halves, with half the values being smaller and half being larger, making it a good measure of central tendency when there are outliers.

These are the steps to calculate the median:

1. Sort the list (here you can use the `sorted` function or the `list.sort` method)
2. If the list has an odd number of elements, return the middle element
3. If the list has an even number of elements, return the average of the two middle elements

> **Note :** To know if the list as one or two middle elements, we'll use the lenght of the list. If the length is odd, there is only one middle element. If the length is even, there are two middle elements.

> **Note :** To know if a number is odd or even you must use the modulo operator `%`, you make the modulo by 2 and the result will be `0` or `1`. If it's `1` the number is odd and if it's `0` the number is even

In [61]:
def median(l):
  """
  calculate the median
  """
  sorted_l = sorted(l)
  n = len(sorted_l) // 2

  return sorted_l[n] if len(sorted_l) % 2 == 1 else (sorted_l[n] + sorted_l[n -1])/2  

  if len(sorted_l) % 2 == 1:
    return sorted_l[n]
  
  return (sorted_l[n] + sorted_l[n -1])/2

## Quartiles

Quartiles divide a data set into four equal parts, helping to describe the distribution. The first quartile (Q1) represents the 25th percentile, the second quartile (Q2) is the median or 50th percentile, and the third quartile (Q3) is the 75th percentile, showing the spread of values in each quarter of the data.

In [62]:
def quartiles(l):
  """
    Calculate the quartiles of a given list of numbers.
  Returns:
    tuple: A tuple containing three values:
      - Q1 (float): The first quartile (25th percentile).
      - Q2 (float): The second quartile (50th percentile, or median).
      - Q3 (float): The third quartile (75th percentile).
  """
  q2 = median(l)
  mid = len(l) // 2
  q1 = median(l[:mid])
  q3 = median(l[mid+1:])
  return q1, q2, q3

## Mode

The mode is the value that appears most frequently in a data set. It represents the most common data point and is useful for identifying patterns or repeated occurrences within the data. If there are multiple elements that appear the same number of times, return the lowest one.

> **Note :** Use a dictionary to store the frequency of appearances of each element.

In [63]:
def mode(l):
  """
  calculate the mode
  """
  # l = sorted(l) # O(len(l)*log(len(l)))

  d = {}
  for i in l: # O(n) 
    d[i] = d.get(i, 0) + 1
    
    .extend([2, 4, 5, 5])# if i not in d:
    #   d[i] = 1  # d.update({i:1})
    # else:
    #   d[i]+=1

  max_o = 0
  for j in d: # O(n)
    if d[j]>max_o:
      max_o = d[j]
  
  maxs = []
  for k in d: # O(n)
    if d[k] == max_o:
      maxs.append(k)
  
  return min(maxs)
      # return k
  #     mode = k
  #     break
  # return mode

In [64]:
l = [1, 2, 2, 3, 4, 4]
print(mode(l))

2


## Variance

Variance measures how far individual data points in a set are from the mean. It quantifies the spread of data, with higher variance indicating more spread-out values and lower variance showing data points are closer to the mean. The formula is: $\frac{1}{n}\sum_{i=1}^n (x_i - \mu)^2$

where:
- $\mu$ is the mean of the data
- $x_i$ is the $i$-th element of the list
- $n$ is the number of elements in the list

You can see it as the sum of the squares of the differences of each element from the mean divided by the length of the list.

In [65]:
def variance(l):
  """
  calculate the variance
  """
  mu = sum(l) / len(l)                           # O(n)
  variance = sum([(x-mu)**2 for x in l])/len(l)  # O(n)

  return variance

## Standard Deviation

Standard deviation measures the amount of variation or dispersion in a data set. It shows how much the values deviate from the mean, with a low standard deviation indicating values are close to the mean, and a high standard deviation indicating more spread-out values.

> Smaller standard deviation indicates that the data points are closer to the mean

> Larger standard deviation indicates that the data points are more spread out.

The formula for the standard deviation is: $\sqrt{\frac{1}{n}\sum_{i=1}^n (x_i - \mu)^2}$

where:
- $\mu$ is the mean of the data
- $x_i$ is the $i$-th element of the list
- $n$ is the number of elements in the list

You can see it as the square root of the sum of the squares of the differences of each element from the mean divided by the length of the list.

or you can see the `std` as the square root of the variance.

In [66]:
def standard_deviation(l):
  """
  calculate the standard deviation
  """
  return variance(l)**0.5

## Testing part :)

In [70]:
l = list(range(1, 6))

assert min(l) == 1, f"Expected 1, got {min(l)}"
assert max(l) == 5, f"Expected 5, got {max(l)}"
assert m_range(l) == 4, f"Expected 4, got {m_range(l)}"
assert mean(l) == 3, f"Expected 3, got {mean(l)}"
assert median(l) == 3, f"Expected 3, got {median(l)}"
assert quartiles(l) == (1.5, 3, 4.5), f"Expected (1.5, 3, 4.5), got {quartiles(l)}"
assert variance(l) == 2, f"Expected 2, got {variance(l)}"
assert standard_deviation(l) == 1.4142135623730951, f"Expected 1.581, got {standard_deviation(l)}"

l = list(range(1, 6))
l.extend([2, 4, 5, 5])

assert mode(l) == 5, f"Expected 5, got {mode(l)}"