## Fundamentals of Data Analysis - Tasks ##

**Name: James McEneaney**

***

### Task 1 - *Collatz Conjecture* ###

The Collatz conjecture is a famous unsolved problem in mathematics. The problem is to prove that if you start with any positive
integer x and repeatedly apply the function f(x) below, you always get stuck in the repeating sequence 1, 4, 2, 1, 4, 2, . . .

$f(x)$ equals $(x รท 2)$ if $x$ is even and $(3x + 1)$ otherwise.

For example, starting with the value 10, which is an even number, we divide it by 2 to get 5. Then 5 is an odd number so, we multiply by 3 and add 1 to get 16. Then we repeatedly divide by 2 to get 8, 4, 2, 1. Once we are at 1, we go back to 4 and get stuck in the repeating sequence 4, 2, 1 as we suspected.
The task below aims to verify, using Python, that the conjecture is true for the first 10,000 positive integers.

In [17]:
def f(x):
    # If x is even, divide by two. 
    if x % 2 == 0:
        return x/2
    else:
        return (3*x) + 1
    

In [18]:
def collatz(x):
    while x != 1:
        print(x, end = ', ')
        x = f(x)
    print(x)

In [19]:
collatz(10000)


10000, 5000.0, 2500.0, 1250.0, 625.0, 1876.0, 938.0, 469.0, 1408.0, 704.0, 352.0, 176.0, 88.0, 44.0, 22.0, 11.0, 34.0, 17.0, 52.0, 26.0, 13.0, 40.0, 20.0, 10.0, 5.0, 16.0, 8.0, 4.0, 2.0, 1.0


***

#### End of Task 1 ####

***

### Task 2 - *Penguin Dataset* ###

Give an overview of the famous penguins data set, explaining the types of variables it contains. Suggest the types of variables
that should be used to model them in Python, explaining your rationale. 

Introduction:

The Penguins dataset contains data relating to penguins found on the Palmer Archipelego in Antarctica. It consists of 344 rows of data, with the data in each row each corresponding to an individual penguin studied for the dataset. This dataset can 

Firstly, I downloaded the CSV file from the raw data on https://github.com/mwaskom/seaborn-data/blob/master/penguins.csv and saved it into the same repository that I have saved this file. Next, I will read the 'penguins' csv file using Pandas; this allows me to see the kinds of data contained within the dataset.

In [20]:
import pandas as pd
df = pd.read_csv('C:\\Users\\James\\pands\\fund_data\\fund_data\\penguins.csv')
print(df)


    species     island  bill_length_mm  bill_depth_mm  flipper_length_mm  \
0    Adelie  Torgersen            39.1           18.7              181.0   
1    Adelie  Torgersen            39.5           17.4              186.0   
2    Adelie  Torgersen            40.3           18.0              195.0   
3    Adelie  Torgersen             NaN            NaN                NaN   
4    Adelie  Torgersen            36.7           19.3              193.0   
..      ...        ...             ...            ...                ...   
339  Gentoo     Biscoe             NaN            NaN                NaN   
340  Gentoo     Biscoe            46.8           14.3              215.0   
341  Gentoo     Biscoe            50.4           15.7              222.0   
342  Gentoo     Biscoe            45.2           14.8              212.0   
343  Gentoo     Biscoe            49.9           16.1              213.0   

     body_mass_g     sex  
0         3750.0    MALE  
1         3800.0  FEMALE  
2     

From the dataset, we can see there are seven pieces of data measured, namely:
- the **species** of penguin
- the **island** on which the penguin was found
- the **bill length**, in millimetres, for the penguin
- the **bill depth**, in millimetres, for the penguin
- the **flipper length**, in millimetres, for the penguin
- the **body mass**, in grams, of the penguin
- the **sex** of the penguin

These pieces of data can be categorised into two categories of quantitative (measured numerically) and qualitative (measured or recorded in a non-numerical way). 

These categories can be further sub-divided into subcategories:

- Quantitative:
    - *Discrete*: This is data measured by counting in whole numbers (integers)
    - *Continuous*: This kind of data can take on any value within a range - the number of values are potentially unlimited, although by rounding numbers we can impose limits on the range of values which can be measured.
    - *Interval*: These scales have a zero point which is arbitrary (eg. calendar systems) and can represent values less than zero.
    - *Ratio*: These scales have a zero point which has a basis is not arbitrary (eg. height) and can only represent values of zero or greater.
<br/><br/>
- Qualitative:
    - *Nominal*: this is data which does not have an inherent ranking or order. Examples include lists of countries. 
    - *Ordinal*: This is data which does lend itself to being ranked or ordered. Examples include the answers to questions of the type 'on a scale of 1 to 5, where 1 is strongly disagree and 5 is strongly agree ..".

    *Binary* data is Nominal or Ordinal data which can only fall into one of only two categories. 

From the first and last five rows of the dataset above, it can be seen that the 'species', 'island' and 'sex' are nominal data types, and additionally sex is a binary nominal data type. These three pieces of data would best be measured on Python using the **String** data type.

'Bill length', 'bill depth', 'flipper length' and body mass' are continuous data types; however, here 'bill length' and 'bill depth' are measured to within one decimal place, and flipper length and body mass are rounded to the nearest whole number. 

Therefore, the data types which should be used in Python to measure these pieces of data are:
- **String** for 'species', 'island' and 'sex'
- **Float** for 'bill length' and 'bill depth'
- **Integer** for 'flipper length' and 'body mass'
