# Exploratory Data Analysis Palmer Penguins
<img src='https://allisonhorst.github.io/palmerpenguins/reference/figures/lter_penguins.png' width=500>

Done by: Carlos M Mazzaroli

## Data comes from:

https://raw.githubusercontent.com/allisonhorst/palmerpenguins/master/inst/extdata/penguins.csv

## About the data

Data were collected and made available by [Dr. Kristen Gorman](https://www.uaf.edu/cfos/people/faculty/detail/kristen-gorman.php) and the [Palmer Station, Antarctica LTER](https://pallter.marine.rutgers.edu/), a member of the [Long Term Ecological Research Network](https://pallter.marine.rutgers.edu/).

## Atributes:

Numeric

1. **species:** the penguin species (Adelie, Chinstrap or Gento)

1. **island:**  isla en la Antártida donde se observó cada pingüino (Biscoe, Torgersen or Dream)

1. **bill_length_mm:** bill length measurement in millimeters

1. **bill_depth_mm:** bill depth measurement in millimeters

1. **flipper_length_mm:** flipper length measurement in millimeters

1. **body_mass_g:** penguin body weight measurement

1. **sex:**  (female or male)  

1. **year:** year of study

## Initial configuration





### install libraries

In [111]:
!pip install --upgrade pip
!pip install numpy pandas matplotlib seaborn empiricaldist statsmodels sklearn pyjanitor



### Import libraries

In [112]:
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns
import scipy.stats
import sklearn.metrics
import statsmodels.api as sm
import statsmodels.formula.api as smf
import statsmodels.stats as ss
import empiricaldist
import janitor

###  Graphs appareance

In [113]:
%matplotlib inline
sns.set_style(style='whitegrid')
sns.set_context(context='notebook')
plt.rcParams['figure.figsize'] = (11,9.4)

 Seaborn
penguin_color = {
    'Adelie':'ff6602ff',
    'Gentoo':'0f7175ff',
    'Chinstrap':'c65dc9ff',
    'Torgersen': '955FC8', 
    'Biscoe': '94e2c3', 
    'Dream': '345469',
    'Female': 'pink',
    'Male': 'skyblue',
}

 Matplotlib
pcolors =['ff6602ff','0f7175ff','c65dc9ff']
icolors =['955FC8', '94e2c3', '345469']
scolors =['pink','skyblue',]

plt_colors=[pcolors,icolors,scolors]

IndentationError: unexpected indent (186775463.py, line 6)

## Data validation

###  Load dataset

In [None]:
df = sns.load_dataset('penguins')
df

###  Dataset information

In [None]:
df.info()

### Variables from the dataset

1. **species:** (Categorical)
1. **island:** (Categorical)
1. **bill_length_mm:** (Numerical)
1. **bill_depth_mm:** (Numerical)
1. **flipper_length_mm:** (Numerical)
1. **body_mass_g:** (Numerical)
1. **sex:** (Categorical)
1. **year:** (Numerical)

- **Numerical data**: 4
- **Categorical data**: 3

Shape of the dataset: 

- rows: 344
- cols: 7

In [None]:
category_columns = ['species','island','sex']
numeric_columns = df.select_dtypes(include=np.number).columns
penguin_columns = ['Adelie', 'Chinstrap', 'Gentoo' ]

## Data Cleaning

### Not missing values

In [None]:
df.notnull().sum()

### Missing values

In [None]:
df.isnull().sum()

### Proportion of missing values

In [None]:
df.isnull().sum()/df.notnull().sum()

### Missing values

In [None]:
sex_null = df.isnull().any(True)
sex_null
df[sex_null]

 **Observation:**
 The missing values come mainly from the gender variable, except for two penguins that are missing all of their numeric and gender variables. A new data frame will be created with the null values removed to continue the study and then the null values will be retaken.

In [None]:
df2 = df.dropna()
print(f'''
{df2.isna().any()}
''')

###  Convert data type

In [None]:
df = df.astype({'species': 'category','island': 'category','sex': 'category',}) 
df2 = df2.astype({'species': 'category','island': 'category','sex': 'category',}) 


In [None]:
df2.info()

# Data exploration: Univariate Analysis

## Numerical analysis

### See the basic statistics from the numerical Data

In [None]:
numerical_statistics = pd.concat([
    df2.describe(include=np.number).iloc[0:1],
    df2.mode(numeric_only=True).rename(index={0:'mode'}),
    pd.DataFrame(df2.median(numeric_only=True),columns=['median']).T,
    df2.describe(include=np.number).iloc[1:8],
    ])
numerical_statistics

###  Numerical statistics visualization

In [None]:
fig,ax = plt.subplots(1,4,figsize=(20,5))
for i,col in enumerate(numerical_statistics):
    sns.histplot( 
        ax=ax[i],
        data=df2,
        x=col,
        palette=penguin_color,
        bins=50,
        alpha=.55,
        color='0f7175ff',
        kde=True,
        )
    ax[i].lines[0].set_color('4c36f5')

    ax[i].axvline(
        x=numerical_statistics.iloc[:,i:i+1].loc['25%'][0],
        color='f26a02',
        linestyle='dashed',
        linewidth=2.5,
        label='Q1'
    ) 
    ax[i].axvline(
        x=numerical_statistics.iloc[:,i:i+1].loc['75%'][0],
        color='bd00b0',
        linestyle='dashed',
        linewidth=2.5,
        label='Q3'
    )
    ax[i].axvline(
        x=numerical_statistics.iloc[:,i:i+1].loc['mean'][0],
        color='f75c6b',
        linestyle='dashed',
        linewidth=2.5,
        label='mean',
    )
    ax[i].legend()


	

### Partial conclutions

From the graphs, we can conclude:
- The variables **bill_length_mm, bill_ depth_m, flipper_length_mm** tend to be a bimodal distribution.
- As for the variable body_mass_g, it tends to be a positively Skewed Distribution.

## Categorical analysis

### See the basic statistics from the categorical data

In [None]:
df2.describe(include='category')

### Categorical variables count visualization

In [None]:
fig, ax = plt.subplots(1,3,figsize=(20,5))
for i,category in enumerate(category_columns):
    sns.histplot(
        ax=ax[i], 
        data=df2,
        y=category,
        hue=category,
        palette=penguin_color,
        alpha=0.6
    )
    

$
\begin{matrix}
\text{SPECIES}   & count &&&&  & \text{ISLAND} & count &&&& \text{SEX}  & count\\
Adelie           & 146   &&&&  & Biscoe        & 163   &&&& Male        & 168  \\
Gentoo           & 119   &&&&  & Dream         & 123   &&&& Female      & 165  \\
Chinstrap        & 68    &&&&  & Torgersen     & 47    &&&&                    \\
\end{matrix}
$

- In the previous graph, it is seen that the data of the species and the islands are not balanced, except for the sex of the penguins, which is balanced.

###  Categorical variables proportion visualization

In [None]:
fig, ax = plt.subplots(1,3,figsize=(20,5))
for i,category in enumerate(category_columns):
    sns.histplot(
        ax=ax[i],
        data=df2.add_column(category[i],',').reset_index(),
        y=category[i],
        palette=penguin_color,
        multiple='fill',
        stat='count',
        hue=category,
        alpha=0.6
    )
    ax[i].set(ylabel=category, xlabel='proportion')

    


$
\begin{matrix}
\text{SPECIES}   & proportion  &&&&  & \text{ISLAND} & proportion &&&& \text{SEX}  & proportion\\
Adelie           & 43.84\%     &&&&  & Biscoe        & 48.94\%    &&&& Male        & 50.45\%     \\
Gentoo           & 35.73\%     &&&&  & Dream         & 36.93\%    &&&& Female      & 49.54\%     \\
Chinstrap        & 20.42\%     &&&&  & Torgersen     & 14.11\%    &&&& 
\end{matrix}
$



The proportion of penguins between the Adelie and Gentoo species is similar, however there is a smaller number of Chinstraps

The same situation occurs with the islands, where we have the highest concentration of penguins on Biscoe Island, followed by Dream and Torsergen with the fewest records.

###  Partial Conclusions

From the previous calculus several things can be concluded:

- The proportion of species and islands is similar.
- More than 40% of the penguins are of the Adelie species.
- Nearly 50% of the penguins inhabit Biscoe Island.

# Data exploration: Bivariate Analysis

## Numerical & Categorical analysis

Prepare filters for analysis


In [None]:
male = df2.sex == 'Male'
female = ~male

adelie = df2.species == 'Adelie'
chinstrap = df2.species == 'Chinstrap'
gentoo = df2.species == 'Gentoo'

torgersen = df2.island == 'Torgersen'
dream = df2.island == 'Dream'
biscoe = df2.island == 'Biscoe'

species = [adelie,chinstrap,gentoo]
islands = [torgersen,dream,biscoe]
sex = [male, female]

###  Analysis of penguin features

####  Penguin features statistics 

In [None]:
df2.groupby(['species','island']).agg(['min','mean','max',]).dropna()

#### Penguin features visualization:

In [None]:
fig,ax = plt.subplots(3,len(numeric_columns), figsize=(20,15))

for i, i_col in enumerate(category_columns):
    for j, j_col in enumerate(numeric_columns):
        sns.violinplot(
             split=True,
            ax=ax[i][j],
            data=df2,
            x='species',
            y=j_col,
            hue=i_col,
            palette=penguin_color,
            )
        ax[i][j].set_title(j_col) if i ==0 else None
        ax[i][j].set_xlabel(None)
        ax[i][j].set_ylabel(None)

#### Conclusions

**Adelie penguins:**
1. They are present on all three islands.
1. Adelies are smaller than the Chinstrap and Gentoo species. Except for the depth of its bill.
1. Its flipper and bill, both long and wide, are longer on Torgersen Island than its species
1. They tend to be heavier on Biscoe Island.

**Chinstrap penguins:**
1. Its only found on Dream Island.
1. Chinstrap penguins have longer wings and are heavier than Adelie penguins but less than the Gentoo penguins.
1. The length of its bill is similar to the Adelie penguin, but it is wider than the Gentoo penguin.

**Gentoo penguins:**

1. Its only located on Biscoe Island.
1. They are heaviest than the other species.
1. They have longer flippers than the other species.
1. Gentoo penguins have the longest bill and the thinnest at the same time

In all species, males were bigger than females. But it was discovered a female chinstrap penguin had longer flippers than other males from the same species.

### Analysis of penguin distribution

#### General distribution

In [None]:
fig,ax = plt.subplots(3,len(numeric_columns), figsize=(20,15))

for i, i_col in enumerate(category_columns):
    for j, j_col in enumerate(numeric_columns):
        sns.histplot( 
            ax=ax[i][j],
            data=df2,
            x=j_col,
            hue=i_col,
            bins=40,
            kde=True,
            palette=penguin_color,
            )

#####  **Partial Conclusions**

**With the last graph, we can conclude that:**

1. The species category is the one that most closely corresponds to a normal distribution.
1. Biscoe Island has a bimodal trend.
1. Dream Island appears to follow a normal distribution except for the variable bill length, which tends to have a bimodal distribution.
1. Torgersen Island tends to have a normal distribution.
1. The sex of penguins has a bimodal tendency.

**Data insights**

- Biases in the species category may arise from the difference in values between sexes of the same species.
- Bimodal trends in the island category may be suggesting the presence of more than one penguin per island.
- Bimodal tendencies in sex are due to different species.


#### Individual species distribution

In [None]:
fig,ax = plt.subplots(len(numeric_columns),len(species), figsize=(15,13))
bins = 20
for i, i_col in enumerate(numeric_columns):
    for j, j_col in enumerate(category_columns): 
        sns.histplot( 
            ax=ax[i][j],
            data=df2[adelie],
            x=i_col,
            hue=j_col,
            multiple='layer',
            bins=30,
            kde=True,
            palette=penguin_color,
            )
            
        ax[i][j].set_ylabel(numeric_columns[i], labelpad=60,rotation=0) if j==0 else ax[i][j].set_ylabel(None)
        ax[i][j].set_xlabel(None)
fig.suptitle('Adelie species');
plt.subplots_adjust(top=0.95);


In [None]:
fig,ax = plt.subplots(len(numeric_columns),len(species), figsize=(15,13))
bins = 20
for i, i_col in enumerate(numeric_columns):
    for j, j_col in enumerate(category_columns): 
        sns.histplot( 
            ax=ax[i][j],
            data=df2[chinstrap],
            x=i_col,
            hue=j_col,
            multiple='layer',
            bins=30,
            kde=True,
            palette=penguin_color,
            )
            
        ax[i][j].set_ylabel(numeric_columns[i], labelpad=60,rotation=0) if j==0 else ax[i][j].set_ylabel(None)
        ax[i][j].set_xlabel(None)
         ax[0][j].set_title(penguin_columns[j])
         ax[i][j].get_legend().remove()
plt.suptitle('Chinstrap species');
plt.subplots_adjust(top=0.95);


In [None]:
fig,ax = plt.subplots(len(numeric_columns),len(species), figsize=(15,13))
bins = 20
for i, i_col in enumerate(numeric_columns):
    for j, j_col in enumerate(category_columns): 
        sns.histplot( 
            ax=ax[i][j],
            data=df2[gentoo],
            x=i_col,
            hue=j_col,
            multiple='layer',
            bins=30,
            kde=True,
            palette=penguin_color,
            )
            
        ax[i][j].set_ylabel(numeric_columns[i], labelpad=60,rotation=0) if j==0 else ax[i][j].set_ylabel(None)
        ax[i][j].set_xlabel(None)
         ax[0][j].set_title(penguin_columns[j])
         ax[i][j].get_legend().remove()
plt.suptitle('Gentoo species');
plt.subplots_adjust(top=0.95);

##### Conclusions

Se ajusta a una distribucion normal
alas de los gentoo y los chinstrap

peso de los chinstrap

Todas las otras se encuentran con sesgos o tienden a ser bimodales por el sexo