# 6.1: Sourcing Open Data

### This script contains the following points:

* 1 - Importing Libraries
* 2 - Importing Data
* 3 - Data Wrangling
    * I) Changing Data Types
    * II) Check for Mixed Types
* 4 - Data Consistency
    * I) Check Numerical Variables
    * II) Looking for Missing Values (NaN)
    * III) Looking for Duplicates Records
* 5 - Exporting Data

---

## 1. Import Libraries

In [1]:
# Import Libraries

import pandas as pd
import numpy as np
import os
import datetime as dt

---

## 2. Importing Data

In [2]:
# Define path

path = r'/Users/juanigalvalisi/Desktop/Data Analyst/Achievement 6/'

In [3]:
# Import .CSV

df_medical_costs_raw = pd.read_csv(os.path.join(path, 'Medical Cost Personal Datasets.csv'))

In [4]:
# Check the output of df_medical_costs_raw

df_medical_costs_raw.head()

Unnamed: 0,age,sex,bmi,children,smoker,region,charges
0,19,female,27.9,0,yes,southwest,16884.924
1,18,male,33.77,1,no,southeast,1725.5523
2,28,male,33.0,3,no,southeast,4449.462
3,33,male,22.705,0,no,northwest,21984.47061
4,32,male,28.88,0,no,northwest,3866.8552


In [5]:
# Check the output of df_medical_costs_raw

df_medical_costs_raw.shape

(1338, 7)

In [6]:
# Check the output of df_medical_costs_raw

df_medical_costs_raw.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 1338 entries, 0 to 1337
Data columns (total 7 columns):
 #   Column    Non-Null Count  Dtype  
---  ------    --------------  -----  
 0   age       1338 non-null   int64  
 1   sex       1338 non-null   object 
 2   bmi       1338 non-null   float64
 3   children  1338 non-null   int64  
 4   smoker    1338 non-null   object 
 5   region    1338 non-null   object 
 6   charges   1338 non-null   float64
dtypes: float64(2), int64(2), object(3)
memory usage: 73.3+ KB


---

## 3. Data Wrangling

### I) Changing Data Types

In [7]:
# Check all the data types of df_medical_costs_raw

df_medical_costs_raw.dtypes

age           int64
sex          object
bmi         float64
children      int64
smoker       object
region       object
charges     float64
dtype: object

### II) Check for Mixed Types

In [8]:
# Check for mixed types

for col in df_medical_costs_raw.columns.tolist():
  weird = (df_medical_costs_raw[[col]].applymap(type) != df_medical_costs_raw[[col]].iloc[0].apply(type)).any(axis = 1)
  if len (df_medical_costs_raw[weird]) > 0:
    print (col)

> #### There are no mixed-type columns.

---

## 04 - Data Consistency

### I) Check Numerical Variables

In [9]:
# Check df_medical_costs_raw numerical variables

df_medical_costs_raw.describe()

Unnamed: 0,age,bmi,children,charges
count,1338.0,1338.0,1338.0,1338.0
mean,39.207025,30.663397,1.094918,13270.422265
std,14.04996,6.098187,1.205493,12110.011237
min,18.0,15.96,0.0,1121.8739
25%,27.0,26.29625,0.0,4740.28715
50%,39.0,30.4,1.0,9382.033
75%,51.0,34.69375,2.0,16639.912515
max,64.0,53.13,5.0,63770.42801


> #### The dataframe appears to be consistent.

### II) Looking for Missing Values (NaN)

In [10]:
# Look for missing values (NaN)

df_medical_costs_raw.isnull().sum()

age         0
sex         0
bmi         0
children    0
smoker      0
region      0
charges     0
dtype: int64

> #### There are no missing values.

### III) Looking for Duplicates Records

In [11]:
# Look for duplicates

df_medical_costs_raw_dups = df_medical_costs_raw[df_medical_costs_raw.duplicated()]

In [12]:
# Check the output

df_medical_costs_raw_dups.head()

Unnamed: 0,age,sex,bmi,children,smoker,region,charges
581,19,male,30.59,0,no,northwest,1639.5631


In [13]:
# Omit 1 duplicate record

df_medical_costs_clean = df_medical_costs_raw.drop_duplicates()

In [14]:
# Check the output I

df_medical_costs_clean.shape

(1337, 7)

In [15]:
# Check the output II

df_medical_costs_clean.head()

Unnamed: 0,age,sex,bmi,children,smoker,region,charges
0,19,female,27.9,0,yes,southwest,16884.924
1,18,male,33.77,1,no,southeast,1725.5523
2,28,male,33.0,3,no,southeast,4449.462
3,33,male,22.705,0,no,northwest,21984.47061
4,32,male,28.88,0,no,northwest,3866.8552


> #### One duplicate record was omitted.

---

# 5 -  Exporting Data

In [16]:
# Export df_custs as .pkl

df_medical_costs_clean.to_pickle(os.path.join(path, 'Medical Cost Personal.pkl'))