# Fundamentals of Data Analysis Winter 2023 Tasks

**Author: Nur Bujang**

tasks.ipynb
***

## Task 1 : Collatz Conjecture
> The Collatz conjecture is a famous unsolved problem in mathematics. The problem is to prove that if you start with any positive integer x and  repeatedly apply the function f ( x ) below, you al ways get stuck in the repeating sequence 1, 4, 2, 1, 4, 2, . . .

\begin{align*}
f(x) = \begin{cases} 
    x \div 2 & \text{if } x \text{ is even} \\
    3x + 1 & \text{if } x \text{ is odd}
\end{cases}
\end{align*}

> For example, starting with the value 10, which is an even number, we divide it by 2 to get 5. Then 5 is an odd number so, we multiply by 3 and add 1 to get 16. Then we repeatedly divide by 2 to get 8, 4, 2, 1. **Once we are at 1, we go back to 4 and get stuck in the repeating sequence 4, 2, 1 as we suspected.**

### Task Description:
> The task is to verify, using Python, that the Collatz conjecture is true for the first 10000 positive integers.


In [55]:
def collatz(x):
    
    clist = [x] # numbers will be in a list

    while x != 1:
        if x % 2 == 0:
            x = x // 2 # If x is even, divide it by two
        else:
            x = (3 * x) + 1 # if x is odd, multiply by 3, then add 1
        clist.append(x) # list will append

    return clist

In [56]:
# for verification of the first 10000 positive integers

def verify_collatz(limit):
    for i in range(1, limit + 1):
        clist = collatz(i)

# if NOT verified, output will NOT end with 1
    if clist[-1:] != [1]: # slicing and comparing
        print(f"The Collatz Conjecture is not true for x = {i}")
        return
    else:
        print("The Collatz Conjecture is true the first", limit, "positive integers.")

verify_collatz(10000) # call the function 

The Collatz Conjecture is true the first 10000 positive integers.


## Task 2 : Penguins Data Set Description
> Give an overview of the famous penguins data set and explain the types of variables it contains. 

### Task Description: 
> The task is to suggest the types of variables that should be used to model them in Python and to explain your rationale.

In [1]:
import numpy as np # for computational operations
import pandas as pd # for data loading from other sources and processing

In [2]:
df = pd.read_csv('penguins.csv') # import penguins.csv
df.head() # show the default first few lines of the dataframe

Unnamed: 0,species,island,bill_length_mm,bill_depth_mm,flipper_length_mm,body_mass_g,sex
0,Adelie,Torgersen,39.1,18.7,181.0,3750.0,MALE
1,Adelie,Torgersen,39.5,17.4,186.0,3800.0,FEMALE
2,Adelie,Torgersen,40.3,18.0,195.0,3250.0,FEMALE
3,Adelie,Torgersen,,,,,
4,Adelie,Torgersen,36.7,19.3,193.0,3450.0,FEMALE


df.dtypes will provide the data type of each column. 

df.info() will give an output containing the number of rows and columns, column names and their data types and the number of non-null (not missing) values of each column. 

In [3]:
data_types = df.dtypes # OR df.info()
print(data_types)

species               object
island                object
bill_length_mm       float64
bill_depth_mm        float64
flipper_length_mm    float64
body_mass_g          float64
sex                   object
dtype: object


From the output, the first (species), second (island) and seventh (sex) columns are objects. It is a categorical data type, which is a type of qualitative data, meaning it can only fall into one distinct group. The species, island and sex are nominal types of categorical data which has no order or ranking. The three species are Adelie, Chinstrap, and Gentoo. The three islands are Torgersen, Dream, and Biscoe. The sex are either "Male" or "Female".

The third (bill length), fourth (bill depth), fifth (flipper length) and sixth (body mass) columns are 64-bit floating-point numbers. It is a continuous data type in quantitative. These penguin body part measurements and weight can take an infinite value within a certain range. These are measured in ratio scale and has a true zero point.

In [4]:
# To specify the type of object:

element = df.at[0, 'species'] # row 0 (line 1 of data), column species
if isinstance(element, str):
    print("species column is a string.")
else:
    print("species column is not a string.")

species column is a string.


In [5]:
element = df.at[0, 'island']
if isinstance(element, str):
    print("island column is a string.")
else:
    print("island column is not a string.")

island column is a string.


In [7]:
element = df.at[3, 'sex'] # line with missing data. If row is changed to 0, it will output a string
if isinstance(element, str):
    print("sex column is a string.")
else:
    print("sex column is not a string.")

sex column is not a string.


String is a more specialized text data type compared to the generic object data type. However, we can only do this if we are confident that these columns only contain strings and when we do not need to handle mixed data types or missing values within those columns.

In [9]:
columns_to_check = ['bill_length_mm', 'bill_depth_mm', 'flipper_length_mm', 'body_mass_g']

for column_name in columns_to_check:
    element = df.at[0, column_name]
    if isinstance(element, int):
        print(f"{column_name} column is an int.")
    else:
        print(f"{column_name} column is not an int.")

bill_length_mm column is not an int.
bill_depth_mm column is not an int.
flipper_length_mm column is not an int.
body_mass_g column is not an int.


Bill length and bill depth columns are floats.

**HOWEVER, flipper length and body mass are actually integers in the penguins.csv data file.**

The rationale for selecting variable types for modeling depends on the question and types of analysis we want to perform. For example, we can use all the quantitative data to predict species classification and in regression analysis, where we can predict one variable from other features. Different variable types will also determine the appropriate type of data visualization techniques, such as pie charts, scatter plots and histograms. Understanding and specifying data types is important for the right analysis and to ensure that the operations are performed correctly.

## **Task 3 : Penguins Data Set Distribution Model** 
> For each of the variables in the penguins data set():

### Task Description:
> The task is to suggest what probability distribution from the numpy random distributions list is the most appropriate to model the variable

## **Task 4 : Head Probability** 
> Suppose you are flipping two coins, each with a probability p of giving heads.

### Task Description:
> The task is to plot the entropy of the total number of heads versus p.

## **Task 5 : Penguins Data Set Plots** 
> Penguins data set.

### Task Description:
> The task is to create an appropriate individual plot for each of the variables in the penguin data set.

***

## End