## Fundamentals of Data Analysis Tasks
**Miriam Roddy**

## Task 1 
**Your task is to verify, using Python, that the [collatz] conjecture is true for
the first 10,000 positive integers.**

The Collatz Conjecture is an [slightly head-melting](https://imgs.xkcd.com/comics/collatz_conjecture.png) unsolved problem in mathematics, which posits that if you repeatedly apply a simple set of rules to a positive integer, you will eventually reach the number 1. If the previous integer is even, the next integer is half of the previous integer. It it's odd, the next integer is three times the previous plus one. The conjecture is that you will always reach number 1. The formula for the conjecture is:

\begin{cases}
    n/2 & \text{if } n \text{ is even} \\
    3n+1 & \text{if } n \text{ is odd}
\end{cases}

The first challenge here is to understand the problem before implementing it in code. In a previous module, I had learned about some of the theory behind the conjecture from [Veritasium](https://www.youtube.com/watch?v=094y1Z2wpJg&t=1s&ab_channel=Veritasium), and had researched how a sequence of calculations could best be done in Python. [This article in medium](https://medium.com/the-art-of-python/the-collatz-sequence-in-python-eb7e1f1b4f9e) was useful as was [Geeksforgeeks](https://www.geeksforgeeks.org/program-to-print-collatz-sequence/).


Our code is not a formal proof, since we have only demonstrated the conjecture's validity for a finite set of numbers. We'll leave that kind of heavy lifting to [Riho Terras](https://en.wikipedia.org/wiki/Riho_Terras_(mathematician)#:~:text=He%20is%20known%20for%20the,established%20bounds%20for%20the%20conjecture.), who managed to prove that the conjecture was true for "almost all" numbers.




In [6]:
def f(x):
        ## If x is even, divide it by two.
        if x % 2 == 0:
            return x // 2
        else:
            return (3 * x) + 1

In [7]:
def collatz(x):

        while x != 1:
            print(x, end=', ')
            x = f(x)
        print(f'(Testing Collatz with initial value {x})')

In [8]:
collatz(10000)


10000, 5000, 2500, 1250, 625, 1876, 938, 469, 1408, 704, 352, 176, 88, 44, 22, 11, 34, 17, 52, 26, 13, 40, 20, 10, 5, 16, 8, 4, 2, (Testing Collatz with initial value 1)


Our code is not a formal proof, since we have only demonstrated the conjecture's validity for a finite set of numbers. We'll leave that kind of heavy lifting to [Riho Terras](https://en.wikipedia.org/wiki/Riho_Terras_(mathematician)#:~:text=He%20is%20known%20for%20the,established%20bounds%20for%20the%20conjecture.), who managed to prove that the conjecture was true for "almost all" numbers.

## Task 2 
**Give an overview of the famous penguins data set, explaining the types of variables it contains. Suggest the types of variables that should be
used to model them in Python, explaining your rationale.**

![Penguins](https://github.com/miriamroddy/fund_data/blob/main/Images/penguins.png)

The "Palmer Penguins" dataset is a widely used dataset in data science and machine learning. Overuse of the Iris dataset alongside Fisherâ€™s association with eugenics make the penguin dataset a popular alternative.

It contains information about three penguin species: Adelie, Chinstrap, and Gentoo. There are two penguins datasets - one is the raw data collected by the researchers and the second is [ ].


In [9]:
## Import the pandas library
import pandas as pd
## We load data from a CSV file named 'penguins.csv' into a pandas DataFrame called 'penguindata':
penguindata = pd.read_csv('Data/penguins.csv')

# Display the first few rows of the DataFrame
print(penguindata.head())

print("\n--------------------------------------------\n")

# View summary statistics
print(penguindata.describe())



  species     island  bill_length_mm  bill_depth_mm  flipper_length_mm   
0  Adelie  Torgersen            39.1           18.7              181.0  \
1  Adelie  Torgersen            39.5           17.4              186.0   
2  Adelie  Torgersen            40.3           18.0              195.0   
3  Adelie  Torgersen             NaN            NaN                NaN   
4  Adelie  Torgersen            36.7           19.3              193.0   

   body_mass_g     sex  
0       3750.0    MALE  
1       3800.0  FEMALE  
2       3250.0  FEMALE  
3          NaN     NaN  
4       3450.0  FEMALE  

--------------------------------------------

       bill_length_mm  bill_depth_mm  flipper_length_mm  body_mass_g
count      342.000000     342.000000         342.000000   342.000000
mean        43.921930      17.151170         200.915205  4201.754386
std          5.459584       1.974793          14.061714   801.954536
min         32.100000      13.100000         172.000000  2700.000000
25%         3

## Task 3 
**For each of the variables in the penguins data set, suggest what probability distribution from the numpy random distributions list
is the most appropriate to model the variable.**

In [10]:
import numpy as np
rng = np.random.default_rng()  # Add () to create a random number generator
rng.random()
rng.standard_normal(3)


array([-4.21929107e-02,  2.55947859e+00,  2.49989283e-03])

## Task 4
**Suppose you are flipping two coins, each with a probability p of giving heads. Plot the entropy of the total number of heads versus p.**

## Task 5
**Create an appropriate individual plot for each of the variables in the penguin data set.**

***
## End