# Fundamentals of Data Analysis Tasks

**Anthony McGarry**

***

## Task 1

> The Collatz conjecture is a famous unsolved problem in mathematics. The problem is to prove that if you start with any positive integer x and repeatedly apply the function $f(x)$ below, you always get stuck in the repeating sequence 1, 4, 2, 1, 4, 2, . . 

In [None]:
def f(x):
  # If x is even, divide it by two.
  if x % 2 == 0:
    return x // 2
  else:
    return (3 * x) + 1

In [None]:
def collatz(x):
  print(f'Initial testing Collatz with inital value {x}')
  
  while x != 1:
    x = f(x)
    print(x, end = ',')
    

In [None]:
collatz(587)

In [None]:
def verify_collatz(limit):
    for i in range(1, limit + 1):
        x = i  # Initialize x with the original value of i
        while x != 1:
            x = f(x)
            if x == 1:
                break
        if x != 1:
            print(f'The Collatz Conjecture is not true for {i}.')
            return

    print('The Collatz Conjecture is true for all numbers up to', limit)

# Verify the Collatz Conjecture for the first 10,000 positive integers
verify_collatz(10000)

***

## End

## Task 2

> Give an overview of the famous penguins data set,explaining the types of variables it contains. Suggest the types of variables that should be used to model them in Python, explaining your rationale.

In [None]:
import pandas as pd

In [None]:

# Specify the name of the CSV file

file_name = "penguins_lter.csv"

# Use the pandas read_csv function to read the CSV file

data = pd.read_csv(file_name)

In [None]:
data.describe()

In [None]:
data.info()

Categorical Variables:

Species: The species of the penguins (e.g., Adelie, Gentoo, Chinstrap). This is a categorical variable as penguins belong to distinct species categories.

Island: The island where the penguins were observed (e.g., Torgersen, Biscoe, Dream). This is also categorical as there are specific islands.

Sex: The gender of the penguin (e.g., Male, Female).

Numerical Variables:

Bill Length (mm): The length of the penguin's bill, which is a continuous numerical variable.

Bill Depth (mm): The depth of the penguin's bill, also continuous.

Flipper Length (mm): The length of the penguin's flippers.

Body Mass (g): The weight of the penguin's body.

Ordinal Variables:

Stage: The developmental stage of the penguins (e.g., Adult, Juvenile). This is ordinal as it represents a ranking or order.


>Reference:

https://www.analyticsvidhya.com/blog/2022/04/data-exploration-and-visualisation-using-palmer-penguins-dataset/


***

## End

## Task 3

> For each of the variables in the penguins data set, suggest what probability distribution from the numpy random distributions list is the most appropriate to model the variable.

***

## End

## Task 4 

>Suppose you are flipping two coins, each with a probability p of giving heads. Plot the entropy of the total number of heads versus p

Entropy is a measure of how much information a variable contains. When a low probability event occurs, it carries more information and a high probability event carries less information, because we are not as surprised that the event occurred. If a variable can only take a single value, the entropy of said variable is 0, because we know with complete certainty what value the variable will take. In the case of a fair coin toss, where the variable can be either heads or tails, the entropy is greater than 0, because we cannot say for sure what the outcome of the coin toss will be.

Outlined below is the formula for Shannon entropy. If the coin is fair, the probability of getting heads or tails is equal with p=q=0.5

\begin{align}
H(X) = \sum_{i=1}^n P(x_i) log_b P(x_i) \\
\end{align}

In [None]:
# Import numpy and matplotlib

import matplotlib.pyplot as plt
import numpy as np

# Function to calculate Shannon entropy
def entropy(p):
    q = 1 - p
    probabilities = [q**2, 2*p*q, p**2]
    entropy = -np.sum([prob * np.log2(prob) if prob != 0 else 0 for prob in probabilities])
    return entropy



>Reference

https://towardsdatascience.com/entropy-is-a-measure-of-uncertainty-e2c000301c2c

https://medium.com/@mhannan94/information-entropy-3cce2bac62a2#:~:text=If%20the%20coin%20is%20fair,result%20is%20likely%20to%20occur.

***

## End

## Task 5 

>Create an appropriate individual plot for each of the variables in the penguin data set

***

## End