# **Data Cleaning Notebook**

## Objectives

* Evaluate and handle missing data
* Clean data 

## Inputs

* outputs/dataset/collection/HospitalReadmissions.csv

## Outputs

* Generate cleaned Train and Test sets, both saved under outputs/datasets/cleaned 

## Additional Comments

* In case you have any additional comments that don't fit in the previous bullets, please state them here. 


---

# Change working directory

* We are assuming you will store the notebooks in a subfolder, therefore when running the notebook in the editor, you will need to change the working directory

We need to change the working directory from its current folder to its parent folder
* We access the current directory with os.getcwd()

In [None]:
import os
current_dir = os.getcwd()
current_dir

We want to make the parent of the current directory the new current directory
* os.path.dirname() gets the parent directory
* os.chir() defines the new current directory

In [None]:
os.chdir(os.path.dirname(current_dir))
print("You set a new current directory")

Confirm the new current directory

In [None]:
current_dir = os.getcwd()
current_dir

# Load the dataset

In [None]:
import pandas as pd

data_path = 'outputs/datasets/raw/HospitalReadmissions.csv'

df = pd.read_csv(data_path)
df.head()

target_var = 'readmitted'

---

# Explore data

First we check for missing values in order to decide what's the best way to handle those values.

In [None]:
vars_with_missing_values = df.columns[df.isnull().sum()].tolist()
vars_with_missing_values

However, from the previous notebook we noticed that the variable <strong>"medical_specialty"</strong> contains the value "Missing", which constitutes half of the values. So, we are going to drop this variable since it neither contributes neither correlates to the <strong>"readmitted"</strong> variable which is our target variable.

So, below we can adjust the above to list any variables containing the word "Missing".

In [None]:
vars_with_MISSING = df.columns[df.isin(['MISSING']).sum()].tolist()
vars_with_MISSING

---

# Split Train and Test set

In [None]:
from sklearn.model_selection import train_test_split

TrainSet, TestSet, _, __ = train_test_split(
                                        df,
                                        df[target_var],
                                        test_size=0.2,
                                        random_state=0)

print(f"TrainSet shape: {TrainSet.shape} \nTestSet shape: {TestSet.shape}")

---

# Push files to Repo

* In case you don't need to push files to Repo, you may replace this section with "Conclusions and Next Steps" and state your conclusions and next steps.

In [None]:
import os
try:
  os.makedirs(name='outputs/datasets/cleaned')
except Exception as e:
  print(e)

## Train set

In [None]:
TrainSet.to_csv('outputs/datasets/cleaned/TrainSetCleaned.csv', index=False)

## Test set

In [None]:
TestSet.to_csv('outputs/datasets/cleaned/TestSetCleaned.csv', index=False)

## Next Steps

In the next notebook we will move to feature engineering.