#### **Imports**

In [2]:
import pandas as pd

##### **Clean-up**

After running .head() there was a index column 'Unnamed: 0' (Removed Column)

In [3]:
df = pd.read_csv('../data/penguins.csv')
df.drop(columns='Unnamed: 0', inplace=True)
df.head()

Unnamed: 0,species,island,bill_length_mm,bill_depth_mm,flipper_length_mm,body_mass_g,sex,year
0,Adelie,Torgersen,39.1,18.7,181.0,3750.0,male,2007
1,Adelie,Torgersen,39.5,17.4,186.0,3800.0,female,2007
2,Adelie,Torgersen,40.3,18.0,195.0,3250.0,female,2007
3,Adelie,Torgersen,,,,,,2007
4,Adelie,Torgersen,36.7,19.3,193.0,3450.0,female,2007


Lets get rid of all the null data

In [4]:
print('Missing Values')
print(df.isnull().sum())
df.dropna(inplace=True)
print(f'After Clean, the dataset has {df.shape[0]} rows and {df.shape[1]} columns.')

Missing Values
species               0
island                0
bill_length_mm        2
bill_depth_mm         2
flipper_length_mm     2
body_mass_g           2
sex                  11
year                  0
dtype: int64
After Clean, the dataset has 333 rows and 8 columns.


In [5]:
print('Basic Information:')
print(df.info())
print('\nStatistical Summary:')
display(df.describe())

Basic Information:
<class 'pandas.core.frame.DataFrame'>
Index: 333 entries, 0 to 343
Data columns (total 8 columns):
 #   Column             Non-Null Count  Dtype  
---  ------             --------------  -----  
 0   species            333 non-null    object 
 1   island             333 non-null    object 
 2   bill_length_mm     333 non-null    float64
 3   bill_depth_mm      333 non-null    float64
 4   flipper_length_mm  333 non-null    float64
 5   body_mass_g        333 non-null    float64
 6   sex                333 non-null    object 
 7   year               333 non-null    int64  
dtypes: float64(4), int64(1), object(3)
memory usage: 23.4+ KB
None

Statistical Summary:


Unnamed: 0,bill_length_mm,bill_depth_mm,flipper_length_mm,body_mass_g,year
count,333.0,333.0,333.0,333.0,333.0
mean,43.992793,17.164865,200.966967,4207.057057,2008.042042
std,5.468668,1.969235,14.015765,805.215802,0.812944
min,32.1,13.1,172.0,2700.0,2007.0
25%,39.5,15.6,190.0,3550.0,2007.0
50%,44.5,17.3,197.0,4050.0,2008.0
75%,48.6,18.7,213.0,4775.0,2009.0
max,59.6,21.5,231.0,6300.0,2009.0


Looks like the year is read as an integer. Lets fix that!

In [6]:
df['year'] = pd.to_datetime(df['year'], format='%Y')

#### **Summary**

- Not much wrong with the data.
- Removed indexes column 'Unnamed: 0'
- Removed rows with null values
- Converted 'Year' Data Type to DateTime

In [7]:
df.to_csv('../data/clean_penguins.csv', index=False)