# Hackathon 1: Python, ETL and Visualisation

## Objective

* Investigate and identify all outliers within the dataset.
* Transform the dataset so it's ready for feature engineering.

### Import Packages

In [2]:
import pandas as pd
import numpy as np

The healthcare insurance dataset will be read and loaded into a DataFrame, so the outliers can be investigated. 

In [4]:
df = pd.read_csv('../data/inputs/cleaned_data.csv')
df

Unnamed: 0,age,sex,bmi,children,smoker,region,charges
0,19,female,27.900,0,yes,southwest,16884.92400
1,18,male,33.770,1,no,southeast,1725.55230
2,28,male,33.000,3,no,southeast,4449.46200
3,33,male,22.705,0,no,northwest,21984.47061
4,32,male,28.880,0,no,northwest,3866.85520
...,...,...,...,...,...,...,...
1332,50,male,30.970,3,no,northwest,10600.54830
1333,18,female,31.920,0,no,northeast,2205.98080
1334,18,female,36.850,0,no,southeast,1629.83350
1335,21,female,25.800,0,no,southwest,2007.94500


A random sample of the dataset has been highlighted to enable better inspection.

In [5]:
df.sample(n=4, random_state=28)

Unnamed: 0,age,sex,bmi,children,smoker,region,charges
1195,19,female,30.02,0,yes,northwest,33307.5508
91,53,female,24.795,1,no,northwest,10942.13205
217,27,male,23.1,0,no,southeast,2483.736
470,27,male,32.67,0,no,southeast,2497.0383


The loaded dataset will be checked for data types, to ensure they are categorised correctly.

In [6]:
df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 1337 entries, 0 to 1336
Data columns (total 7 columns):
 #   Column    Non-Null Count  Dtype  
---  ------    --------------  -----  
 0   age       1337 non-null   int64  
 1   sex       1337 non-null   object 
 2   bmi       1337 non-null   float64
 3   children  1337 non-null   int64  
 4   smoker    1337 non-null   object 
 5   region    1337 non-null   object 
 6   charges   1337 non-null   float64
dtypes: float64(2), int64(2), object(3)
memory usage: 73.2+ KB


In [7]:
df.dtypes

age           int64
sex          object
bmi         float64
children      int64
smoker       object
region       object
charges     float64
dtype: object

Checking that all duplicates were succesfully removed in the previous cleaning stage.

In [8]:
df.duplicated().sum()

0

The Matplotlib and Seaborn libraries will now be imported to investigate and deal with any potential outliers in the dataset.

In [9]:
import matplotlib.pyplot as plt
import seaborn as sns