<p style="color:#153462; 
          font-weight: bold; 
          font-size: 30px; 
          font-family: Gill Sans, sans-serif; 
          text-align: center;">
          Commonly used Statistics Terms</p>

<p style="text-align: justify; text-justify: inter-word;">
   <font size=3>
       In this notebook, we will learn about most commonly used statistical terms. I will also include <i>Python</i>
       snippets where ever it is need.
   </font>
</p>

In [1]:
import statistics
import numpy as np

#### Central Tendency
<p style="text-align: justify; text-justify: inter-word;">
   <font size=3>
       Measure of Central Tendency helps us to find the middle or average of a dataset. To measure <i> Central Tendency</i> we commonly use <i>Mode</i>, <i>Median</i> and <i> Mean </i>.
   </font>
</p>


#### Mode
<p style="text-align: justify; text-justify: inter-word;">
   <font size=3>
       The <i>Mode</i> or <i> Modal </i> value of a dataset is the word/number/element which is frequently occuring on the dataset.
   </font>
</p>


In [2]:
# Finding mode value
data = [12, 23, 34, 45, 12, 1, 0, 45, 45]
# In the above dataset 45 is most repeating value
statistics.mode(data)

45

In [3]:
countries = ["India", "Japan", "America", "America", "India", "Mexico"]
statistics.mode(countries)

'India'

<strong> References: </strong> <br>
https://www.scribbr.com/statistics/mode/

#### Median
<p style="text-align: justify; text-justify: inter-word;">
   <font size=3>
       The Median is a value that is exactly in the middle of a dataset when it is ordered (Order the values
       from low to high). But ordering dataset is optional. It's main goal is finding the middle value of the 
       dataset. It’s a measure of central tendency that separates the lowest 50% from the highest when it is
       ordered in acending order.
       50% of values.
   </font>
</p>


<p style="text-align: justify; text-justify: inter-word;">
   <font size=3>
       <b> Formula when dataset length is odd </b>:<br>
        $position = \frac{(n+1)}{2}$ <br>
        Where, <br>
        n, is the length of the data set <br>
        median = sorted_data[postion]
   </font>
</p>


In [10]:
data = sorted([3, 1, 4, 7, 5, 6, 9])
position = (len(data)+ 1)//2
# In python indexing or counting starts from 0, that's why we are removing 1 from position variable
median = data[position-1]
print(f"Ordered Data: {data}")
print(f"Median: {median}")

Ordered Data: [1, 3, 4, 5, 6, 7, 9]
Median: 5


In [7]:
# you can also use statistic module
statistics.median(data)

5

<b> Formula when dataset length is even:</b><br>
Whenever data set length is even, we need to consider mean of <i>middle element</i> and <i>middle element + 1</i> to calculate median.

$middle\_position = (\frac{n}{2})$

$middle\_position + 1 = (\frac{n}{2}) + 1$

where<strong> n</strong> is total number of elements

In [15]:
data = sorted([3, 1, 4, 7])
# substracting -1 beacuse in python length and indexing is different. Index start from 0
middle_position = (len(data)//2) - 1
middle_position_1 = middle_position + 1
median = (data[middle_position] + data[middle_position_1])/2
print(f"Middle Position: {middle_position}")
print(f"Middle Position + 1: {middle_position_1}")
print(F"Ordered Data: {data}")
print(f"Median: {median}")

Middle Position: 1
Middle Position + 1: 2
Ordered Data: [1, 3, 4, 7]
Median: 3.5


In [16]:
statistics.median(data)

3.5

<strong>Reference</strong><br>
https://www.scribbr.com/statistics/median/

#### Mean

The <i> Mean </i> means sum of all values in the dataset divided by total number of element in the dataset. This also sometimes calls <i> Arthematic Means </i> or <i>Average</i>

$\bar{X} = \frac{\sum{X}}{N}$

<ul>
    <li>$\bar{X}$ = Mean</li>
    <li>$\sum{X}$ = Sum of total elements in the dataset</li>
    <li>N = Total number of elements</li>
</ul>

In [17]:
dataset = [1, 23, 3, 5, 2, 5]
sum_of_dataset = sum(dataset)
total_ele = len(dataset)
mean = sum_of_dataset/total_ele
print(f"Mean: {mean}")

Mean: 6.5


In [18]:
# Using statistics module
statistics.mean(dataset)

6.5

#### Variability
<p style="text-align: justify; text-justify: inter-word;">
   <font size=3>
       Variability describes how far apart data points lies from each other and from the center of distribution. 
       Variability also reffered to as spread, scatter or dispersion. It is commonly measure with <i>Range</i>
       <i>Interquartile Range</i>, <i>Standard Deviation</i> and <i>Variance</i><br><br>
       <b>Standard deviation</b> is expressed in the same units as the original values (e.g., meters).<br>
       <b>Variance is expressed</b> in much larger units (e.g., meters squared)<br>
   </font>
</p>



#### Stanard Deviation
<p style="text-align: justify; text-justify: inter-word;">
   <font size=3>
       Stanard Deviation helps us in measuring <i>variability</i> of your data set. It will tell us on average
       how far each data point lies from the mean. A high standard deviation means that values are generally far
       from the mean, while a low standard deviation indicates that values are clustered close to the mean
   </font>
</p>


<p style="text-align: justify; text-justify: inter-word;">
   <font size=3>
       <b>Formula</b>:<br>
        $\sigma = \sqrt{\frac{\sum{(X}-{\mu)}^2}{N}}$ <br><br>
        where,<br>
        $\sigma$, is Stardard Deviation,<br>
        $\sum$, sum of<br>
        $\mu$, mean <br>
        $X$, Each value,<br>
        $N$, Number of values in the data set
   </font>
</p>

In [27]:
dataset = [10, 20, 30, 40, 50, 60, 70, 80]
mean = np.mean(dataset)
diff = dataset-mean
print(F"Diff: {diff}")
squared = diff**2
squared

Diff: [-35. -25. -15.  -5.   5.  15.  25.  35.]


array([1225.,  625.,  225.,   25.,   25.,  225.,  625., 1225.])

In [29]:
sandard_devation = np.sqrt(sum(squared)/len(squared))
print(F"{sandard_devation = }")

sandard_devation = 22.9128784747792


In [25]:
# Using numpy
np.std(dataset)

22.9128784747792

In [26]:
# here they are using "Sample standard deviation" formula. so in the denominator it is N-1 not N in the formula
statistics.stdev(dataset)

24.49489742783178

#### Variance

<p style="text-align: justify; text-justify: inter-word;">
   <font size=3>
       The variance is average of squared deviation from the mean. Variance tells you the degree of spread in your data set. The more spread the data, the larger the variance is in relation to the mean.
   </font>
</p>

<p style="text-align: justify; text-justify: inter-word;">
   <font size=3>
        <b>Formula</b>:<br>
        $\sigma^2 = \frac{\sum{(X}-{\mu)}^2}{N}$ <br><br>
        where,<br>
        $\sigma^2$, is a Variance,<br>
        $\sum$, sum of<br>
        $\mu$, mean <br>
        $X$, Each value,<br>
        $N$, Number of values in the data set
   </font>
</p>


In [34]:
dataset = np.array([10, 20, 30, 40, 50, 60, 70, 80])
mean = np.mean(dataset)
diff = dataset - mean
print(f"{diff = }")
squared = diff**2
squared

diff = array([-35., -25., -15.,  -5.,   5.,  15.,  25.,  35.])


array([1225.,  625.,  225.,   25.,   25.,  225.,  625., 1225.])

In [31]:
variance = sum(squared)/len(squared)
variance

525.0

In [32]:
np.var(dataset)

525.0

In [35]:
statistics.variance(dataset)

600