# Fundamentals of Data Analysis Tasks

**Linda Grealish**

***

## Table of contents
 * [Task 1](#task-1)
 * [Task 2](#task-2)

## Task 1

*The Collatz conjecture is a famous unsolved problem in mathematics. The problem is to prove that if you start with any positive integer $x$ and repeatedly apply the function $f(x)$ below, you always get stuck in the repeating sequence 1, 4, 2, 1, 4, 2, . . .*

$$ f(x) = \begin{cases}
x\div 2 & \text{if $x$ is even} \\
3x+1 & \text{otherwise}
\end{cases} $$

*For example, starting with the value 10, which is an even number , we divide it by 2 to get 5.  Then 5 is an odd number so, we multiply by 3 and add 1 to get 16.  Then we repeatedly divide by 2 to get 8, 4, 2, 1. Once we are at 1, we go back to 4 and get stuck in the repeating sequence 4,2,1 as we suspected.*

*Your task is to verify, using Python, that the conjecture is true for the first 10000 positive integers.*

First we will define the function $f(x)$  that takes in a positive integer $x$.  The function checks if the integer is odd or even by using the modulo operator %.  If it is positive it divides x by 2 using floor division // and if it is odd then it returns the result of (3 * x) + 1. 

In [None]:
def f(x):
  # If x is even, divide it by 2.
  if x % 2 == 0:
    return x // 2
  else:
    return (3 * x) + 1 

Next we will define the function collatz().  This is done with the use of a *while* loop which runs until we achieve a value of 1 and the result is a comma seperated list of numbers. This while loop contains conditional statements if and else that signify what action should happen depending on whether the number is odd or even.

In [None]:
def collatz(x):
  print(f'Testing Collatz with initial value {x}')
  while x !=1:
    print (x, end=', ')
    x = f(x)
  print(x)


*collatz()* is a Python function that allow the user to input a specific integer and prints out the result as a comma seperated list.  The results of the Collatz sequence for 1000 is shown below.  This can be run for any positive integer by simply changing the value in the parenthesis below *collatz(number)*

In [None]:
collatz(1000)

To satisfy the second part of the task we will define a function *collatz_sequence* that reads the starting integer and runs the collatz conjecture with all the results at each iteration appended to a list.

In [None]:
def collatz_sequence(x):
    sequence = [x]
    while x != 1:
        if x % 2 == 0:
            x = x // 2
        else:
            x = 3 * x + 1
        sequence.append(x)
    return sequence

The final part of this uses the *range* function which takes all of the integers between 1 and 10,000 inclusive and applies the *collatz_sequence(x)* to each integer.

In [None]:
for number in range(1, 10001):
    print(f"collatz_sequence {number} is {collatz_sequence(number)}")

If we run the above code we can see that for each of the integers from 1 to 10,000 (inclusive) we can see that the Collatz Sequence is true for each integer.

***

## Task 2

*Give an overview of the famous penguins data set, explaining the types of variables it contains. Suggest the types of variables that should be used to model them in Python, explaining your rationale.*

### History and Overview of the dataset

Data were collected and made available by Dr. Kristen Gorman and the Palmer Station, Antarctica LTER, a member of the Long Term Ecological Research Network.

The rigorous study was conducted in the islands of the Palmer Archipelago, Antarctica. These data were collected from 2007 to 2009 by Dr. Kristen Gorman with the Palmer Station Long Term Ecological Research Program, part of the US Long Term Ecological Research Network.

The dataset comprises various measurements of the three different pengion species, *Adelie, Gentoo and Chinstrap*.

<img src = https://github.com/allisonhorst/palmerpenguins/raw/main/man/figures/lter_penguins.png alt = "Image of Palmer Penguins">

### Columns in the dataset

The dataset contains 344 rows and 7 columns. The columns in the dataset are:

- **species:**           The species of the penguin (Adelie, Gentoo, or Chinstrap)
- **island:**            The island where the penguin was observed (Biscoe, Dream, or Torgersen)
- **bill_length_mm:**    The length of the penguin’s bill in millimeters
- **bill_depth_mm:**     The depth of the penguin’s bill in millimeters
- **flipper_length_mm:** The length of the penguin’s flipper in millimeters
- **body_mass_g:**       The body mass of the penguin in grams
- **sex:**               The sex of the penguin (male or female)

### Types of Variables Explained

<img src = Images/LOM.PNG alt = "Data Types Infographic">

A variable is a characteristic that can be measured and that can assume different values. Height, age, income, province or country of birth, grades obtained at school and type of housing are all examples of variables. Variables may be classified into two main categories: categorical and numeric. Each category is then classified in two subcategories: nominal or ordinal for categorical variables, discrete or continuous for numeric variables. 

A categorical variable (also called qualitative variable) refers to a characteristic that can’t be quantifiable. Categorical variables can be either nominal or ordinal.

A nominal variable is one that describes a name, label or category without natural order. Sex and type of dwelling are examples of nominal variables.

An ordinal variable is a variable whose values are defined by an order relation between the different categories.

A numeric or quantative variable is a quantifiable characteristic whose values are numbers.  These variables can be either continous or discrete.

  A continous variable is one that can assume an infinite number of real values within a given interval.  

  A discrete variable can assume only a finite number of real values within a given interval.

https://www150.statcan.gc.ca/n1/edu/power-pouvoir/ch8/5214817-eng.htm 

### Types of Variables in the Penguin Dataset

- **species:**           categorical
- **island:**            categorical
- **bill_length_mm:**    numerical
- **bill_depth_mm:**     numerical
- **flipper_length_mm:** numerical
- **body_mass_g:**       numerical
- **sex:**               categorical

In [9]:
import pandas as pd


In [19]:
url ='https://raw.githubusercontent.com/mwaskom/seaborn-data/master/penguins.csv'

penguins = pd.read_csv(url)

In [23]:
# The code below displays an overview of the penguin dataset with the datatypes assigned for each of 
# the columns.  We can see that the 3 categorical variables have been assigned as an object datatype
# and the 4 numerical variables have been assigned as an object datatype. 
print(penguins.info())

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 344 entries, 0 to 343
Data columns (total 7 columns):
 #   Column             Non-Null Count  Dtype  
---  ------             --------------  -----  
 0   species            344 non-null    object 
 1   island             344 non-null    object 
 2   bill_length_mm     342 non-null    float64
 3   bill_depth_mm      342 non-null    float64
 4   flipper_length_mm  342 non-null    float64
 5   body_mass_g        342 non-null    float64
 6   sex                333 non-null    object 
dtypes: float64(4), object(3)
memory usage: 18.9+ KB
None


### Conclusion

Based on my research and understanding types of variables I conclude that the following variabe types are the most appropriate for the variables in the penguin dataset.

Nominal   = species, island and sex 

Ratio - bill_length_mm,bill_depth_mm,flipper_length_mm,body_mass_g 

### Resources

https://en.wikipedia.org/wiki/Continuous_or_discrete_variable
 
https://www150.statcan.gc.ca/n1/edu/power-pouvoir/ch8/5214817-eng.htm

https://articles.outlier.org/discrete-vs-continuous-variables

https://articles.outlier.org/discrete-vs-continuous-variables


***
## End