# TASKS FOR FUNDAMENTALS OF DATA ANALYSIS
## Author : Michael Allen

***

## IMPORT PYTHON LIBRARIES

In [1]:
import pandas as pd
import matplotlib.pyplot as plt
import numpy as np
from scipy.stats import norm
from scipy.stats import binom
import seaborn as sns

## First Task : COLLATZ CONJECTURE

***
My task is to verify, using Python, that the conjecture is true for
the first 10,000 positive integers.

The Collatz conjecture1 is a famous unsolved problem.[12] The problem is to prove that if you start with any positive
integer x and repeatedly apply the function f(x) below, you always get stuck in the repeating sequence 1, 4, 2, 1, 4, 2, . . .

$ {\displaystyle f(x)={\begin{cases}x/2&{\text{if }}x\equiv 0{\pmod {2}},\\[4px]3x+1&{\text{if }}x\equiv 1{\pmod {2}}.\end{cases}}} $

For example, starting with the value 10, which is an even number,
we divide it by 2 to get 5. Then 5 is an odd number so, we multiply by 3 and add 1 to get 16. Then we repeatedly divide by 2 to
get 8, 4, 2, 1. Once we are at 1, we go back to 4 and get stuck in the
repeating sequence 4, 2, 1 as we suspected.
Your task is to verify, using Python, that the conjecture is true for
the first 10,000 positive integers. [14]

### Step 1 : 
- Define a function f(X). 
- If the remainder of x divided by 2 is equal to zero, return x. 
- Otherwise, return x multiplied by 3 plus 1. 

In [2]:
def f(x):
        if (x % 2) == 0: #if the number is divided by 2 with remainder equal to zero 
            return (x // 2)      
         
        else: # else if the number is divided 2 with remainder not equal to zero
            return (x * 3) + 1

### Step 2 : 
- Define the collatz function which takes x as an input.[13]
- To reduce the amount of output which is printed in the notebook, I only print the output if x divided by 100 has no remainder.
- While x is not equal to 1, execute the f(x) function. 
- When x equals 1, the Collatz conjecture is true.
- To reduce the amount of output which is printed in the notebook, I only print the output if x divided by 10 has no remainder.

In [3]:
def collatz(x):
    if x % 100 == 0:
        print(f'Testing Collatz with initial value {x}')
        while x != 1:
            x=f(x)
            if x % 10 == 0:
                print(x, end=" ")
            elif x==1:
                print(x, end="\n")
                print("The Collatz conjecture is true.\n")
    elif x == 1:
        print(f'The Collatz conjecture is true for the first 10,000 positive integers.\nMission Accomplished !!!')
    else:
        while x != 1:
            x=f(x)      

### Step 3 : 
- Start with x equal to 10,000. Prrove that it's true.

- Then proove it's true for 9,999.

- Keep going until you get to 1.

- Verify that the conjecture is true for the first 10,000 positive integers.

In [None]:
x=10000
while x>=1:
    collatz(x)
    x=x-1        

## Second Task : Types of variables
***

Give an overview of the famous penguins data set, explaining the types of variables it contains. Suggest the types of variables
that should be used to model them in Python, explaining your rationale


In [None]:
penguins = pd.read_excel('./data/penguins.xlsx')
penguins

In [None]:
penguins.columns

In [None]:
penguins.dtypes

I need to change the data type of flipper_length_mm and body_mass_g from a float to an integer.

In [None]:
penguins['flipper_length_mm'] = penguins['flipper_length_mm'].astype('Int64')
penguins['body_mass_g'] = penguins['body_mass_g'].astype('Int64')

Let's check that that worked

In [None]:
penguins.dtypes

### Type : Float
The float type in Python represents the floating point number. Float is used to represent real numbers and is written with a decimal point dividing the integer and fractional parts. For example in the first row of the penguin dataset, 39.1 and	18.7 are floating point numbers. Python float values are represented as 64-bit double-precision values.[10]

In the penguin dataset there are two variable of type float:
- bill_length_mm is a float with one decimal place.
- bill_depth_mm is a float with one decimal place.

The mathematical notation for real numbers is as follows:

**Reals**

$ \mathbb{R} $

![Real Number Line](https://upload.wikimedia.org/wikipedia/commons/thumb/d/d7/Real_number_line.svg/689px-Real_number_line.svg.png)

In [None]:
type("Adelie")

In [None]:
type("Torgersen")

In [None]:
type(39.1)

In [None]:
type(18.7)

In [None]:
type(181)

In [None]:
type(3750)

In [None]:
type("MALE")

In [None]:
penguins

 ### Type : Categorical variable
In Python terminolgy, they are type string.  For example in the first row of the penguin dataset, Adeli, Torgersen and MALE are strings.

According to Laerd statistics website, categorical variables are also known as discrete or qualitative variables.[9] Categorical variables can be further categorized as either nominal, ordinal or dichotomous.
In the penguin dataset there are three categorical variables:
- Species is a categorical variable. It has 3 categories : Adélie, Chinstrap and Gentoo. Since there are more than 2 categories and no order, it is referred as a nominal variable. Nominal variables are variables that have two or more categories, but which do not have an intrinsic order.

- Island is a categorical variable. It also has 3 categories : Biscoe, Dream or Torgersen. Since there are more than 2 categories and no order, it is referred as a nominal variable. Nominal variables are variables that have two or more categories, but which do not have an intrinsic order.

- sex is a categorical variable with 2 categories : male and female. Since there are only 2 categories, it is referred as a dichotomous variable.  This is an example of a dichotomous variable (and also a nominal variable). 

### Type : Integer
In Python, integer variables, or "int" variables, are variables that specifically store, as the name suggests, integers as its value. As such, all whole numbers (0, 1, 2, 3, 4, 5, ...) are included in integer variables, including negative numbers (0, -1, -2, -3, -4, -5, ...)

In the penguin dataset there are two integer variables:
- flipper_length_mm is an integer denoting flipper length (millimeters)   
- body_mass_g is an integer denoting body mass (grams)
  The mathematical notation for integers is as follows: 
    
**Integers**
  
  $\mathbb{Z} = \{ \ldots, -3, -2, -1, 0, 1, 2, 3, \ldots \}$
     
But since both flipper length and body mass can not be negative, they are actually called natural numbers in mathematics.

**Naturals**

   $\mathbb{N} = \{1, 2, 3, \ldots\}$

   $\mathbb{N}_0 = \{0, 1, 2, 3, \ldots\}$

## Third Task : Probability Distributions
***

For each of the variables in the penguins data set, suggest what probability distribution from the numpy random distributions list is the most appropriate to model the variable.

In [None]:
bill_length = penguins.bill_length_mm
fig, ax = plt.subplots(1, 1)
n, bins, patches = plt.hist(bill_length)
title = ("Histogram of Bill Length(Normal Distribution)")
plt.title(title)
# adding labels 
ax.set_xlabel('Bill Length (mm)') 
ax.set_ylabel('Frequency')

In [None]:
n, bins

In [None]:
bill_length.mean()

### The normal disribution is the most appropriate distribution to model the variable : bill_length_mm . 
It's shaped like a bell curve. 
As you can see, the bins in the center of the distribution contain most of the values for bill_length_mm.
As you go further from the mean of 43.92192982456142, the number of values in the bins get smaller and smaller.

In [None]:
bill_depth = penguins.bill_depth_mm
fig, ax = plt.subplots(1, 1)
n, bins, patches = plt.hist(bill_depth)
title = ("Histogram of Bill Depth(Normal Distribution)")
plt.title(title)
# adding labels 
ax.set_xlabel('Bill Depth (mm)') 
ax.set_ylabel('Frequency') 

In [None]:
n, bins

In [None]:
bill_depth.mean()

### bill_depth is a normal disribution. 
It's shaped like a bell curve. 
It's shaped like a bell curve.
As you can see, the bins in the center of the distribution contain most of the values.
As you go further from the mean of 17.151169590643278, the number of values in the bins get smaller and smaller.

In [None]:
flipper_length = penguins.flipper_length_mm
#plt.hist(flipper_length)
fig, ax = plt.subplots(1, 1)
n, bins, patches = plt.hist(flipper_length[~np.isnan(flipper_length)])
title = ("Histogram of Flipper Length(2 Normal Distributions)")
plt.title(title)
# adding labels 
ax.set_xlabel('Flipper Length (mm)') 
ax.set_ylabel('Frequency') 

In [None]:
n, bins

In [None]:
flipper_length.mean()

### flipper_length_mm is 2 normal disributions side-by-side.
It's shaped like 2 bell curves.
As you can see, there are two peaks.
The number of values in each bin(n) gets bigger and smaller and bigger and smaller. Like a wave. Up and down. And Up and down again.

In [None]:
body_mass = penguins.body_mass_g
#plt.hist(body_mass)
fig, ax = plt.subplots(1, 1)
n, bins, patches = plt.hist(body_mass[~np.isnan(body_mass)])
title = ("Histogram of Flipper Length(Normal Distribution)")
plt.title(title)
# adding labels 
ax.set_xlabel('Flipper Length (mm)') 
ax.set_ylabel('Frequency') 

In [None]:
n, bins

In [None]:
body_mass.mean()

In [None]:
body_mass.std()

### body_mass is a normal disribution. 
It's shaped like a bell curve.
As you can see, the bins in the center of the distribution contain most of the values.
As you go further from the mean of 4201.754385964912, the number of values in the bins get smaller and smaller. 

In [None]:
body_mass.replace([np.inf, -np.inf], np.nan, inplace=True)
body_mass.dropna(how="any", inplace=True)

In [None]:
#data = body_mass.dropna()
#data

In [None]:
#penguins.dropna()

In [None]:
np.isnan(body_mass).any()

In [None]:
body_mass.size

In [None]:
data = norm.rvs(4201.754385964912, 801.9545356980955, size=342)
mu, std = norm.fit(data)
fig, ax = plt.subplots(1, 1) 
plt.hist(data, bins=25, density=True, alpha=0.6, color='g')
title = ("Histogram of Body Mass(Normal Distribution)")
plt.title(title)
# adding labels 
ax.set_xlabel('Body Mass (g)') 
ax.set_ylabel('Probability Density') 

In [None]:
# Plot the PDF.
#xmin, xmax = plt.xlim()
body_mass.min()

In [None]:
data.min()

In [None]:
body_mass.max()

In [None]:
data.max()

In [None]:
x = np.linspace(2051, 6644, 100)
p = norm.pdf(x, mu, std)
fig, ax = plt.subplots(1, 1) 
plt.plot(x, p, 'k', linewidth=2)
plt.hist(data, bins=25, density=True, alpha=0.6, color='g')
title = "Probability Density Function for Body Mass(Fit results: mu = %.2f,  std = %.2f" % (mu, std)
plt.title(title)
# adding labels 
ax.set_xlabel('Body Mass(g)') 
ax.set_ylabel('Probability Density') 

In [None]:
#from scipy import stats
#dist = stats.norm
#data = body_mass
#bounds = [(170, 240), (0, 90)]
#res = stats.fit(dist, data, bounds)
#res

In [None]:
np.isinf(body_mass).any()

In [None]:
body_mass.dtype

In [None]:
body_mass.mean()

In [None]:
body_mass.std()

## TASK 4 : Flipping 2 coins
***
Description : Suppose you are flipping two coins, each with a probability p of
giving heads. Plot the entropy of the total number of heads versus
p.

The distribution of coin flips can be described by a binomial distribution. The probability of getting heads (or tails) is constant for all coin tosses: p = 1/21. The binomial distribution consists of the probabilities of each of the possible numbers of successes on N trials for independent events that each have a probability of π (the Greek letter pi) of occurring2. When the probability of heads is 50%, the distribution closely resembles a normal distribution as the number of trials and the number of coin flips per trial increase

The probability of getting one head from flipping one coin is 0.5.
In other words, each coin has a probability of 0.5 of giving heads

In [None]:
p = .5 * .5
p
#rng = np.random.default_rng()
#n, p = 5, .25  # number of trials, probability of each trial
#s = rng.binomial(n, p, 1000)
# result of flipping 2 coin 5 times, tested 1000 times.

The probability of getting one head from flipping 2 coins is 0.25(.5 * .5)
The probability of getting 2 heads from flipping 2 coins is 0.5 (0.25 + 0.25)

In [None]:
rng = np.random.default_rng()
n, p = 10, 0.25  # number of trials, probability of each trial
s = rng.binomial(n, p, 100000)
s
# result of flipping 2 coins, tested 1000 times.

In [None]:
plt.hist(s)
plt.xlabel("Number of heads per trial")
plt.ylabel("Number of trials with x heads")
title = ("Histogram of flipping 2 coins")
plt.title(title)

In [None]:
mu = s.mean()
mu

In [None]:
std = s.std()
std

In [None]:
x = np.linspace(0, 8, 10)
p = norm.pdf(x, mu, std)
fig, ax = plt.subplots(1, 1) 
ax.plot(x, p, 'k', linewidth=2)
ax.hist(s, bins=25, density=True, alpha=0.6, color='g')
title = "Fit results: mu = %.2f,  std = %.2f" % (mu, std)
plt.title(title)
# adding labels 
ax.set_xlabel('Number of heads') 
ax.set_ylabel('Probability') 

In [None]:
# setting the values 
# of n and p 
n = 10
p = 0.25
# defining list of r values 
r_values = list(range(n + 1)) 
# list of pmf values 
dist = [binom.pmf(r, n, p) for r in r_values ] 
# plotting the graph 
fig, ax = plt.subplots(1, 1) 
#ax.hist(penguins.bill_depth_mm) 
ax.bar(r_values, dist) 
#plt.show()
# Set title 
ax.set_title("Probability Mass Function") 
  
# adding labels 
ax.set_xlabel('Number of heads') 
ax.set_ylabel('Probability') 

##### Calculate the first four moments:

In [None]:
n, p = 10, 0.25
mean, var, skew, kurt = binom.stats(n, p, moments='mvsk')

##### Display the probability mass function (pmf):

In [None]:
fig, ax = plt.subplots(1, 1)
x = np.arange(binom.ppf(0.01, n, p),
              binom.ppf(0.99, n, p))
ax.plot(x, binom.pmf(x, n, p), 'bo', ms=8, label='binom pmf')
ax.vlines(x, 0, binom.pmf(x, n, p), colors='b', lw=5, alpha=0.5)
# Set title 
ax.set_title("Probability Mass Function") 
  
# adding labels 
ax.set_xlabel('Number of heads') 
ax.set_ylabel('Probability') 

##### Check accuracy

In [None]:
prob = binom.cdf(x, n, p)
np.allclose(x, binom.ppf(prob, n, p))

##### Generate random numbers

In [None]:
r = binom.rvs(n, p, size=1000)
r

In [None]:
import warnings
warnings.filterwarnings("ignore", category=FutureWarning)
sns.distplot(r, kde = True, color = "g").set(title='Kernel Density Estimate (KDE) Plot for Flipping 2 coins')

## TASK 5 : Plot variables
***
Create an appropriate individual plot for each of the variables in
the penguin data set.

### HISTOGRAM OF BILL DEPTH

In [None]:
# Creating histogram 
fig, ax = plt.subplots(1, 1) 
ax.hist(penguins.bill_depth_mm) 
  
# Set title 
ax.set_title("HISTOGRAM OF BILL DEPTH") 
  
# adding labels 
ax.set_xlabel('BILL DEPTH(cm)') 
ax.set_ylabel('FREQUENCY') 

plt.show()

In [None]:
sns.histplot(data=penguins, x='bill_depth_mm', hue="sex").set(title='HISTOGRAM OF BILL DEPTH BY SPECIES')

### HISTOGRAM OF BILL DEPTH

In [None]:
# Creating histogram 
fig, ax = plt.subplots(1, 1) 
ax.hist(penguins.bill_length_mm) 
  
# Set title 
ax.set_title("HISTOGRAM OF BILL LENGTH") 
  
# adding labels 
ax.set_xlabel('BILL LENGTH(cm)') 
ax.set_ylabel('FREQUENCY') 

plt.show()

In [None]:
sns.histplot(data=penguins, x='bill_length_mm', hue="island").set(title='HISTOGRAM OF BILL LENGTH BY ISLAND')

### LINE PLOT OF BILL LENGTH AND BILL DEPTH

In [None]:
#plt.plot(penguins.bill_length_mm,penguins.bill_depth_mm )
# Creating lineplot 
fig, ax = plt.subplots(1, 1) 
ax.plot(penguins.bill_length_mm,penguins.bill_depth_mm) 
  
# Set title 
ax.set_title("LINE PLOT OF BILL LENGTH AND BILL DEPTH") 
  
# adding labels 
ax.set_xlabel('BILL LENGTH(cm)') 
ax.set_ylabel('BILL DEPTH(cm)') 

plt.show()

### SCATTER PLOTS OF BILL LENGTH AND BILL DEPTH

In [None]:

# Creating scatter plot
fig, ax = plt.subplots(1, 1) 
ax.scatter(penguins.bill_length_mm,penguins.bill_depth_mm) 
  
# Set title 
ax.set_title("SCATTER PLOT OF BILL LENGTH AND BILL DEPTH") 
  
# adding labels 
ax.set_xlabel('BILL LENGTH(cm)') 
ax.set_ylabel('BILL DEPTH(cm)') 

plt.show()

In [None]:
#sns.scatterplot(data = penguins)
sns.scatterplot(x='bill_length_mm',y='bill_depth_mm', hue="species", data = penguins).set(title='SCATTER PLOT OF BILL LENGTH AND BILL DEPTH')

### PAIR PLOTS 

In [None]:
sns.pairplot(data = penguins)
sns.pairplot(penguins, hue="species", palette="rainbow")
plt.savefig('./img/pairplot.png')
plt.show()
plt.close()

### BILL LENGTH BY SPECIES BAR PLOT

In [None]:
sns.barplot(x="species", y="bill_length_mm", data=penguins).set(title='BILL LENGTH BY SPECIES BAR PLOT')

### BILL DEPTH BY SPECIES BAR PLOT

In [None]:
sns.barplot(x="species", y="bill_depth_mm", data=penguins).set(title='BILL DEPTH BY SPECIES BAR PLOT')

### FLIPPER LENGTH BY SPECIES BAR PLOT

In [None]:
sns.barplot(x="species", y="flipper_length_mm", data=penguins).set(title='FLIPPER LENGTH BY SPECIES BAR PLOT')

### BODY MASS BY SPECIES BAR PLOT

In [None]:
sns.barplot(x="species", y="body_mass_g", data=penguins).set(title='BODY MASS BY SPECIES BAR PLOT')

### GROUPBY SPECIES AND SEX BAR PLOT
- Barcharts: Barcharts are great when you have two variables one is numerical and therefore the other may be a categorical variable. A barplot can reveal the relationship between them.[15]

- Grouped Barplot: A Grouped barplot is beneficial when you have a multiple categorical variable. Python’s Seaborn plotting library makes it easy to form grouped barplots.

- Groupby: Pandas dataframe.groupby() function is used to split the data into groups based on some criteria. Pandas objects can be split on any of their axes. The abstract definition of grouping is to provide a mapping of labels to group names.

In [None]:
penguins_groupby = penguins.groupby(['species', 'sex']).mean(('bill_length_mm')) 
penguins_groupby = penguins_groupby.reset_index() 
  
# plot barplot 
sns.barplot(x="species", 
           y="bill_length_mm", 
           hue="sex", 
           data=penguins_groupby).set(title='GROUPBY SPECIES AND SEX BAR PLOT') 

In [None]:
penguins_groupby

In [None]:
penguins_groupby.bill_length_mm

- The bill length of Adelie FEMALE penguins is 37.257534
- The bill length of Adelie MALE penguins is 40.390411
- The bill length of Chinstrap FEMALE penguins is 46.573529
- The bill length of Chinstrap MALE penguins is 51.094118
- The bill length of Gentoo	FEMALE penguins is 45.563793
- The bill length of Gentoo	MALE penguins is 49.473770

Bill length of males is longer than females. This is true for all 3 species.

### PENGUIN SEX PIECHART

In [None]:
ax=plt.subplots(1,1,figsize=(10,8))
penguins['sex'].value_counts().plot.pie(explode=[0.1,0.1],autopct='%1.1f%%',shadow=True,figsize=(10,8), legend=True, labeldistance=1.1)
plt.title("Penguin Sex %")
plt.show()

### PENGUIN ISLAND PIECHART

In [None]:
ax=plt.subplots(1,1,figsize=(10,8))
penguins['island'].value_counts().plot.pie(explode=[0.1,0.1,0.1],autopct='%1.1f%%',shadow=True,figsize=(10,8), legend=True, labeldistance=1.1)
plt.title("Penguin Island %")
plt.show()

### PENGUIN SPECIES PIECHART

In [None]:
ax=plt.subplots(1,1,figsize=(10,8))
penguins['species'].value_counts().plot.pie(explode=[0.1,0.1,0.1],autopct='%1.1f%%',shadow=True,figsize=(10,8), legend=True, labeldistance=1.1)
plt.title("Penguin Species %")
plt.show()

# RESEARCH / REFERENCES

[1] Convert Pandas column containing NaNs to dtype `int`, Stackoverflow
https://stackoverflow.com/questions/21287624/convert-pandas-column-containing-nans-to-dtype-int

[2] Penguins Dataset Overview — iris alternative, Towards Data Science
https://towardsdatascience.com/penguins-dataset-overview-iris-alternative-9453bb8c8d95

[3] how to plot a histogram with nan?, Stackoverflow
https://stackoverflow.com/questions/54615686/how-to-plot-a-histogram-with-nan#:~:text=You%20can%20use%20numpy.isnan%20%28%29%20to%20choose%20only,to%20be%20np.nan%20for%20this%20to%20work%20though.%29

[4] Fitting a Normal distribution to 1D data, Stackoverflow
https://stackoverflow.com/questions/20011122/fitting-a-normal-distribution-to-1d-data

[5] Types of Variable - Understanding the different types of variable in statistics, Laerd Statistics
https://statistics.laerd.com/statistical-guides/types-of-variable.php

[6] Binomial Distribution via Coin Flips, WOLFRAM TECHNOLOGIES
https://demonstrations.wolfram.com/BinomialDistributionViaCoinFlips/

[7] numpy.random.Generator.binomial API, Numpy
https://numpy.org/doc/stable/reference/random/generated/numpy.random.Generator.binomial.html#numpy.random.Generator.binomial

[8] scipy.stats.binom API, Scipy
https://docs.scipy.org/doc/scipy/reference/generated/scipy.stats.binom.html

[9] Types of Variable, Laerd Statistics 
https://statistics.laerd.com/statistical-guides/types-of-variable.php

[10] Built-in Types, The Python Software Foundation
https://docs.python.org/3/library/stdtypes.html

[11] Collatz Conjecture and printing statements
https://stackoverflow.com/questions/7947624/collatz-conjecture-and-printing-statements

[12] The Simple Math Problem We Still Can’t Solve, Quanta Magazine
https://www.quantamagazine.org/why-mathematicians-still-cant-solve-the-collatz-conjecture-20200922/

[13] Collatz Conjecture in Python, Stackerflow
https://stackoverflow.com/questions/46505460/collatz-conjecture-in-python

[14] Collatz conjecture, Wikipedia
https://en.wikipedia.org/wiki/Collatz_conjecture#Iterating_on_real_or_complex_numbers

[15] Grouped Barplots in Python with Seaborn, GeeksforGeeks
https://www.geeksforgeeks.org/grouped-barplots-in-python-with-seaborn/


***


# END