# Exploratory Data Analysis
- Exploratory data analysis is an approach of analyzing data sets to summarize their main characteristics, often using statistical graphics and other data visualization methods. 
- Primarily EDA is for seeing what the data can tell us beyond the formal modeling and thereby contrasts traditional hypothesis testing.
- The concept of EDA was proposed by John Tukey, a world famous statistician, in 1970. 
- The aim was to explore the data and formulate the hypothesis. It can help us to collect data and experiments. 
- EDA differs from **initial data analysis (IDA)** which focuses more narrowly on checking assumptions required for model fitting and hypothesis testing, and handling missing values and making transformations of variables as needed. 
- EDA encompasses IDA.


## Basic exploration

In [None]:
# Importing libraries
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns

In [None]:
# import warnings
# warnings.filterwarnings("ignore")

In [None]:
penguin_raw = pd.read_csv('penguins_lter_manipulated.csv') # reading url dataset

In [None]:
penguin = penguin_raw.copy()

In [None]:
penguin_raw.shape # exploring number of observations and variables

(344, 17)

We have 286 observations and 10 variables. 

In [None]:
penguin_raw.head()

Unnamed: 0,studyName,Sample Number,Species,Region,Island,Stage,Individual ID,Clutch Completion,Date Egg,Culmen Length (mm),Culmen Depth (mm),Flipper Length (mm),Body Mass (g),Sex,Delta 15 N (o/oo),Delta 13 C (o/oo),Comments
0,PAL0708,1,Adelie Penguin (Pygoscelis adeliae),Anvers,Torgersen,"Adult, 1 Egg Stage",N1A1,Yes,11-11-2007,,,181.0,3750,MALE,,,Not enough blood for isotopes.
1,PAL0708,2,Adelie Penguin (Pygoscelis adeliae),Anvers,Torgersen,"Adult, 1 Egg Stage",N1A2,Yes,11-11-2007,39.5,17.4,186.0,3800,FEMALE,8.94956,-24.69454,
2,PAL0708,3,Adelie Penguin (Pygoscelis adeliae),Anvers,Torgersen,"Adult, 1 Egg Stage",N2A1,Yes,11/16/07,40.3,18.0,195.0,3250,FEMALE,8.36821,-25.33302,
3,PAL0708,4,Adelie Penguin (Pygoscelis adeliae),Anvers,Torgersen,"Adult, 1 Egg Stage",N2A2,Yes,11/16/07,,,,.,,,,Adult not sampled.
4,PAL0708,5,Adelie Penguin (Pygoscelis adeliae),Anvers,Torgersen,"Adult, 1 Egg Stage",N3A1,Yes,11/16/07,36.7,19.3,193.0,3450,FEMALE,8.76651,-25.32426,


In [None]:
penguin.tail()

Unnamed: 0,studyName,Sample Number,Species,Region,Island,Stage,Individual ID,Clutch Completion,Date Egg,Culmen Length (mm),Culmen Depth (mm),Flipper Length (mm),Body Mass (g),Sex,Delta 15 N (o/oo),Delta 13 C (o/oo),Comments
339,PAL0910,120,Gentoo penguin (Pygoscelis papua),Anvers,Biscoe,"Adult, 1 Egg Stage",N38A2,No,12-01-2009,,,,.,,,,
340,PAL0910,121,Gentoo penguin (Pygoscelis papua),Anvers,Biscoe,"Adult, 1 Egg Stage",N39A1,Yes,11/22/09,46.8,14.3,215.0,4850,FEMALE,8.41151,-26.13832,
341,PAL0910,122,Gentoo penguin (Pygoscelis papua),Anvers,Biscoe,"Adult, 1 Egg Stage",N39A2,Yes,11/22/09,50.4,15.7,222.0,5750,MALE,8.30166,-26.04117,
342,PAL0910,123,Gentoo penguin (Pygoscelis papua),Anvers,Biscoe,"Adult, 1 Egg Stage",N43A1,Yes,11/22/09,45.2,14.8,212.0,.,FEMALE,8.24246,-26.11969,
343,PAL0910,124,Gentoo penguin (Pygoscelis papua),Anvers,Biscoe,"Adult, 1 Egg Stage",N43A2,Yes,11/22/09,49.9,16.1,213.0,5400,MALE,8.3639,-26.15531,


In [None]:
penguin.columns

Index(['studyName', 'Sample Number', 'Species', 'Region', 'Island', 'Stage',
       'Individual ID', 'Clutch Completion', 'Date Egg', 'Culmen Length (mm)',
       'Culmen Depth (mm)', 'Flipper Length (mm)', 'Body Mass (g)', 'Sex',
       'Delta 15 N (o/oo)', 'Delta 13 C (o/oo)', 'Comments'],
      dtype='object')

In [None]:
penguin.dtypes

studyName               object
Sample Number            int64
Species                 object
Region                  object
Island                  object
Stage                   object
Individual ID           object
Clutch Completion       object
Date Egg                object
Culmen Length (mm)     float64
Culmen Depth (mm)      float64
Flipper Length (mm)    float64
Body Mass (g)           object
Sex                     object
Delta 15 N (o/oo)      float64
Delta 13 C (o/oo)      float64
Comments                object
dtype: object

In [None]:
penguin = penguin.drop(['studyName', 'Sample Number', 'Stage', 'Region', 'Date Egg', 'Individual ID', 'Delta 15 N (o/oo)', 'Delta 13 C (o/oo)', 'Comments'], axis = 1)

In [None]:
penguin.nunique() # nuniuqe function to count unique values in each column

Species                  3
Island                   3
Clutch Completion        2
Culmen Length (mm)     144
Culmen Depth (mm)       74
Flipper Length (mm)     55
Body Mass (g)           96
Sex                      3
dtype: int64

In [None]:
penguin.drop_duplicates()

Unnamed: 0,Species,Island,Clutch Completion,Culmen Length (mm),Culmen Depth (mm),Flipper Length (mm),Body Mass (g),Sex
0,Adelie Penguin (Pygoscelis adeliae),Torgersen,Yes,,,181.0,3750,MALE
1,Adelie Penguin (Pygoscelis adeliae),Torgersen,Yes,39.5,17.4,186.0,3800,FEMALE
2,Adelie Penguin (Pygoscelis adeliae),Torgersen,Yes,40.3,18.0,195.0,3250,FEMALE
3,Adelie Penguin (Pygoscelis adeliae),Torgersen,Yes,,,,.,
4,Adelie Penguin (Pygoscelis adeliae),Torgersen,Yes,36.7,19.3,193.0,3450,FEMALE
...,...,...,...,...,...,...,...,...
339,Gentoo penguin (Pygoscelis papua),Biscoe,No,,,,.,
340,Gentoo penguin (Pygoscelis papua),Biscoe,Yes,46.8,14.3,215.0,4850,FEMALE
341,Gentoo penguin (Pygoscelis papua),Biscoe,Yes,50.4,15.7,222.0,5750,MALE
342,Gentoo penguin (Pygoscelis papua),Biscoe,Yes,45.2,14.8,212.0,.,FEMALE


In [None]:
penguin.describe(include = 'all')

Unnamed: 0,Species,Island,Clutch Completion,Culmen Length (mm),Culmen Depth (mm),Flipper Length (mm),Body Mass (g),Sex
count,344,344,344,288.0,235.0,337.0,344.0,334
unique,3,3,2,,,,96.0,3
top,Adelie Penguin (Pygoscelis adeliae),Biscoe,Yes,,,,3800.0,MALE
freq,152,168,308,,,,11.0,168
mean,,,,44.429514,17.195745,200.925816,,
std,,,,5.262364,2.003539,14.069888,,
min,,,,32.1,13.2,172.0,,
25%,,,,40.2,15.55,190.0,,
50%,,,,45.3,17.5,197.0,,
75%,,,,49.0,18.8,213.0,,
