## Machine Learning and Statistics Tasks
#### Hayley Doherty
---

### Task 1: Calculate the square root of 2 to 100 decimal places
---
In order to calculate the square root of 2 (or any number) will we use Newton's method. This algorithm produces successively better approximations of the root of a number [1]. To calculate the square root $z$ of a number $x$, we must start with guessing the square root and this guess will be adjusted with better guesses the more times we loop through the equation until we reach the guess as close to the square root as possible based on how close $z^2$ is to $x$ [2].
$$ z = z - \frac{z^2 - x}{2z} $$




In [1]:
# code used to obtain the square root of 2
def sqrt2():
    z = 2/2
    while abs(2-(z * z)) > 0.000001:
        z-= (z * z - 2)/ (2 * z)
    return z

In [2]:
# calling the function to check it works
sqrt2()

1.4142135623746899

In [3]:
# Testing my answer against known value
import math
math.sqrt(2)

1.4142135623730951

Below I have formatted the value that is returned by the function so that the answer is displayed to 100 decimal places. This is achieved using '.100f'. The format specification 'f' is used with floating point and decimal numbers to display the number as a fixed-point number [3]. The default precision is 6 however you can specify the number of decimal places you want to return by inserting that number infront of the specifier, 'f'. To return the answer to 100 decimal places I added 100 infront of 'f' as shown below.

In [4]:
# added formatting to return the answer to 100 decimal places
def sqrt2():
    z = 2/2
    while abs(2-(z * z)) > 0.000001:
        z-= (z * z - 2)/ (2 * z)
    return format(z, '.100f')

In [5]:
sqrt2()

'1.4142135623746898698271934335934929549694061279296875000000000000000000000000000000000000000000000000'

<br>

---

### Task 2: Chi-Square
___

A chi-square test, also known as Pearson's chi-squared test, is used to determine if there is a difference between the expected values and those actually observed. The null hypothesis states that there is no difference between the expected and observed values and if this is true then the test statistic calculated from these values will be chi-squared distributed [4]. This is the distribution of a sum of the squares of *k* independent standard normal random variables [5]. The value  produced by the chi-square test is called the chi-square statistic. A small value means that there is a high correlation between the expected and observed data, i.e. there is no difference between them. The eqution for calculating the chi-square statistic is shown below.
$$X^2=\sum{\frac{(Oi - Ei)^{2}}{Ei}}$$

$O$= Observed value(s)

$E$= Expected value(s)

The most common way of calculating the chi-square ststistic is by making a table, called a contingency table, used to summarize the frequency distribution between variables [6]. The function chi2_contingency from SciPy.stats. returns a contingency table containing the expected values based on the data we observed.

In [7]:
import numpy as np
from scipy.stats import chi2_contingency

#converted the data into a NumPy array 

white = [90, 60, 104, 95]
blue = [30, 50, 51, 20]
no = [30, 40, 45, 35]

collar = np.array([white, blue, no])

In [10]:
# passed the array into the chi-square function
chi2_contingency(collar)

(24.5712028585826,
 0.0004098425861096696,
 6,
 array([[ 80.53846154,  80.53846154, 107.38461538,  80.53846154],
        [ 34.84615385,  34.84615385,  46.46153846,  34.84615385],
        [ 34.61538462,  34.61538462,  46.15384615,  34.61538462]]))

Above we can see the output from the chi-square function, however it can be hard to understand what the output means and so I have used the code and print statements below, adapted from [6], to return the output in a neater and easier to understand fashion.

In [12]:
chi2_stat, p_val, dof, ex = chi2_contingency(collar)
print("===Chi2 Stat===")
print(chi2_stat)
print("\n")
print("===Degrees of Freedom===")
print(dof)
print("\n")
print("===P-Value===")
print(p_val)
print("\n")
print("===Contingency Table===")
print(ex)

===Chi2 Stat===
24.5712028585826


===Degrees of Freedom===
6


===P-Value===
0.0004098425861096696


===Contingency Table===
[[ 80.53846154  80.53846154 107.38461538  80.53846154]
 [ 34.84615385  34.84615385  46.46153846  34.84615385]
 [ 34.61538462  34.61538462  46.15384615  34.61538462]]


The chi-square value obtained, 24.5712028585826, matches that presented in the task instructions. The p-value obtained, *p*=0.0004098425861096696, is less than 0.001, indicating a high level of statistical significance. The significance level allows us to conclude that there is a difference bewteen the three groups, however as there are more than two groups, further analysis would have to be performed to reveal exactly which of the groups differed from each other.

___

### Task 3: Comparison of standard deviation functions

___

<script type="text/javascript" id="MathJax-script" async
  src="https://cdn.jsdelivr.net/npm/mathjax@3/es5/tex-mml-chtml.js">
</script>

The standard deviation (SD) of a set of numbers is the measure of the amount of variation between the  numbers in the set. A low SD suggests that the numbers in the set are close to the mean while a high SD would indicate that the values are spread over a wide range [7]. 

There are two formulas that can be used to calculate the SD of a data set, sample and population SD. The formula to use depends on what your data set represents. If your data set consists of the entire population or if you are not generalizing you sample data set to a larger population then you wuld calculate the population SD; however if your data set consists of a sample from a larger population to which you will generalize your findings the sample SD should be used [8].

The Excel population SD function, written STDEV.P, calculates the SD based on the entire population given as arguments. It is calculated using the 'n' method [9]:

$$\sigma=\sqrt{\frac{\sum(x-\overline{x})^2}{n}}$$

$\sigma$= population standard deviation<br>
$\sum$= sum of...<br>
$\overline{x}$= population mean<br>
$n$= population size<br>

The Excel sample SD function, written STDEV.S, calculates the SD based on the a proportion of the population, i.e. a sample. The formula used by Excel to calculate STDEV.S is
$$\sigma=\sqrt{\frac{\sum(x-\overline{x})^2}{n-1}}$$

[1]. Newton's method; https://en.wikipedia.org/wiki/Newton%27s_method

[2]. A Tour of Go; Exercise: Loops and Functions; https://tour.golang.com/flowcontrol/8

[3]. Format Specifications; The Python Standard Library; https://docs.python.org/3/library/string.html#format-string-syntax

[4]. Chi-squared test; https://en.wikipedia.org/wiki/Chi-squared_test

[5]. Chi-square distribution; https://en.wikipedia.org/wiki/Chi-square_distribution

[6]. Running Chi-Square Tests with Die Roll Data in Python, Jake Huneycutt; https://towardsdatascience.com/running-chi-square-tests-in-python-with-die-roll-data-b9903817c51b

[7]. Standard Deviation; https://en.wikipedia.org/wiki/Standard_deviation

[8]. Standard Deviation; https://statistics.laerd.com/statistical-guides/measures-of-spread-standard-deviation.php

[9]. STDEV.P funcion; https://support.microsoft.com/en-us/office/stdev-p-function-6e917c05-31a0-496f-ade7-4f4e7462f285#:~:text=P%20function,-Excel%20for%20Microsoft&text=Calculates%20standard%20deviation%20based%20on,average%20value%20(the%20mean).