# Fundamentals of Data Analysis Tasks

**Kevin Donovan**

****

In [3]:
#importing standard libraries to use in the tasks
#Numerical arrays and random numbers
import numpy as np
#plotting charts
import matplotlib as plt
#to use datasets
import pandas as pd
#for graphing
import seaborn as sns

Numpy link [numpy API Reference: Random sampling](https://numpy.org/doc/stable/reference/random/index.html)]

# Task 1

> The Collatz conjecture is a famous unsolved problem in mathematics. The problem is to prove that if you start with any positive integer x and repeatedly apply the function $f(x)$ below, you always get stuck in the repeating sequence 1, 4, 2, 1, 4, 2, ...

The first part is to create the function f(x). We use an if statement to test if a number can be divided evenly by 2. If the number can be divided by 2, the answer is returned otherwise the number is multiplied by 3 with 1 added.

In [None]:
def f(x):
    # If x is even divide it by two. Using the modulo command % to give the remainder of the division
    if x % 2 == 0:
        return x // 2
    else:
        return (3 * x) + 1

Next we create a function colatz(x) which takes an integer, passes that integer to the F(x) function to establish whther it is odd or even. Depending on the outcome, it returns a result which is again tested for odd or even. When the number gets to 1, the original number that was entered by the user is decrmented by 1 and the process starts again.

In [None]:
def collatz(x):
    print(f'Testing the Collatz conjecture using the initial value {x}') 
    count = 0
    value = x
    for i in range(1, x, 1):
        print("i = ", i)
        count += 1
        x = value
        while x != 1:
            print(x, end = ', ')
            x = f(x)
        print("x is: ", x)
        value = value - 1
        print("value is; ", value)
    print("count : ", count)
    #print(x) 

In [None]:
collatz(50)

****

# End of Task 1

# Task 2

**Overview of the Penguins dataset**

****

The penguins data set contains 345 sets of data relating to three species of penguin. The Data was collected by Dr. Kristen Gorman (https://www.uaf.edu/cfos/people/faculty/detail/kristen-gorman.php) and the Palmer Station, Antarctica LTER (https://pallter.marine.rutgers.edu/), a member of the Long Term Ecological Research Network (https://lternet.edu/). The task asks us to 'Give an overview of the famous penguins data set, explaining the types of variables it contains. Suggest the types of variables that should be used to model them in Python, explaining your rationale.'

I start by copying the penguins spreadsheet into a dataframe. I have displayed this to show the data it contains. As can be seen, there are 7 columns of data comprising of floating point numbers and strings. This mixture of datatypes is the main reason why we use Pandas Dataframes to manipulate the data as opposed to using arrays in Numpy. The dataframe can use more than one datatype.

In [2]:
Penguins_Dataset = 'penguins.csv'
df = pd.read_csv(Penguins_Dataset)
df

Unnamed: 0,species,island,bill_length_mm,bill_depth_mm,flipper_length_mm,body_mass_g,sex
0,Adelie,Torgersen,39.1,18.7,181.0,3750.0,MALE
1,Adelie,Torgersen,39.5,17.4,186.0,3800.0,FEMALE
2,Adelie,Torgersen,40.3,18.0,195.0,3250.0,FEMALE
3,Adelie,Torgersen,,,,,
4,Adelie,Torgersen,36.7,19.3,193.0,3450.0,FEMALE
...,...,...,...,...,...,...,...
339,Gentoo,Biscoe,,,,,
340,Gentoo,Biscoe,46.8,14.3,215.0,4850.0,FEMALE
341,Gentoo,Biscoe,50.4,15.7,222.0,5750.0,MALE
342,Gentoo,Biscoe,45.2,14.8,212.0,5200.0,FEMALE


As can be seen in the data from the dataframe, there are a number of NaN cells, this means "Not a Number". In the data it is seen that there are at least two rows of missing data which could skew our results. As shown, the data shows 344 rows of data with 7 columns made up of text and numbers. As I just mentioned, some of these rows contain no information other than the species and the island the penguin is from.

Calling the describe function allows me to quickly see how the data is distributed within the dataframe. The counts across each of the columns shows the discrepancies in the number of filled data fields. It should show 344 for each column, but, as can be seen, only 342 counts of data entered are showing. It is only the first two columns that have the full dataset in them. We can deal with having rows with no data by either deleting the rows completely. The other option depending on the dataset, is to use an average of the data in each column. This could have the issue of skewing the data and so would depend on the unique situation for each dataset that could be used. For this data set, we can comfortabley just delete the row as it contains just the species and the island as data.

In [None]:
print(df.describe())

Using the dropna() function, the rows with no number in them can be cleared

In [None]:
#Using the dropna() function drops any row that has an NaN in it. There are other methods that can be used to be less coarse
df.dropna()

The task asks for an overview of the variable types used in the dataset. As has been noted, the dataset contains a mixture of numbers and text. The columns headings are:

 - species as type text to denote the different species
 
 - island as type text to denote the island the species resides on
 
 - bill length in mm as a number, in the order of tens and to one decimal place
 
 - bill depth in mm	as a number, in the order of tens and to one decimal place
 
 - flipper length in mm	as a number, in the order of hundreds and to one decimal place
 
 - body mass in grams as a number, in the order of thousands and to one decimal place	
 
 - sex as type text to denote the sex of the subject



****

# End of task 2

# Task 3

**data analysis**

****

For this task, we are asked "For each of the variables in the penguins data set, suggest what probability distribution from the numpy random distributions list is the most appropriate to model the variable."

****

# End of task 3

# Task 4

**Entropy of flipping two coins**
 
 We are asked: "Suppose you are flipping two coins, each with a probability p of giving heads. Plot the entropy of the total number of heads versus p"

 given event's probability of happening shows mathematically (numerically) how likely it is to occur. If an event has a probability of 1, it will always occur, whereas it will never occur if its probability is 0
https://goodcalculators.com/coin-flip-probability-calculator/


****

Flipping two coins gives results 

1 	2 		How many heads?
H 	H 		2
H 	T 		1
T 	H 		1
T 	T 		0

There are three possible outcomes. The probability is that we would get two heads once, one head twice, and zero heads once. So, our most likely outcome is half heads and half tails so long as the coins aren't biased in any way.


In [None]:
import numpy as np
import matplotlib.pyplot as plt


The function below is created to compute Shannon's equation of entropy

In [None]:
def H(p):
  return -(1 - p) * np.log2(1.0 - p) - p * np.log2(p)

If we were to use H as 0.5, this would be the entropy of using one coin and getting a Heads

In [None]:
H(0.5)


Next we pass the function 0.25, which is the entropy of using two coins

In [None]:
# Entropy of 0.25.
H(0.25)

In [None]:
import numpy as np
from scipy.stats import entropy
base = 2  # work in units of bits
pk = np.array([1/2, 1/2])  # fair coin
H = entropy(pk, base=base)
H

In [None]:
H == -np.sum(pk * np.log(pk)) / np.log(base)

In [None]:
# Create an empty plot.
fig, ax = plt.subplots(figsize=(12,6))

# p is a probability.
p = np.linspace(0.00000001, 0.99999999, 100)

# Plot H(p).
ax.plot(p, H(p));

# References
https://en.wikipedia.org/wiki/Fair_coin
https://math.stackexchange.com/questions/1424818/flip-2-coins-how-to-show-that-each-point-in-sample-space-has-equal-probability
https://www.geeksforgeeks.org/what-is-a-fair-and-unfair-coin/?ref=ml_lbp
https://math.stackexchange.com/questions/2321905/how-to-find-probability-of-two-coins
https://docs.scipy.org/doc/scipy/reference/generated/scipy.stats.entropy.html


****

# End of task 4

# Task 5

In this task we are asked to create an appropriate individual plot for each of the variables in
the penguin data set.

In [None]:
# From https://seaborn.pydata.org/examples/scatterplot_matrix.html
sns.set_theme(style="ticks")
df = sns.load_dataset("penguins")
sns.pairplot(df, hue="species")

****

# End of task 5

****

# End of assessment