# **Data Collection Notebook**

## Objectives

* Fetch data from Kaggle and save as raw data
* Inspect the data for missing values and data types and save it to outputs/datasets/collection

## Inputs

* Kaggle JSON file - the authentication token.
* Kaggle dataset URL - [Predicting Hospital Readmissions](https://www.kaggle.com/datasets/dubradave/hospital-readmissions)

## Outputs

* outputs/datasets/collection/HospitalReadmissions.csv

## Additional Comments

* No additional comments  


---

# Change working directory

We need to change the working directory from its current folder to its parent folder
* We access the current directory with os.getcwd()

In [None]:
import os
current_dir = os.getcwd()
current_dir

We want to make the parent of the current directory the new current directory
* os.path.dirname() gets the parent directory
* os.chir() defines the new current directory

In [None]:
os.chdir(os.path.dirname(current_dir))
print("You set a new current directory")

Confirm the new current directory

In [None]:
current_dir = os.getcwd()
current_dir

---

# Import Dataset from Kaggle

First, install kaggle to access the kaggle API and import the raw data set.

A valid account must be registered with Kaggle to obtain an API key (as a JSON-file).

In [None]:
! pip install kaggle==1.5.12

Make the kaggle authentication token available for the session.

In [None]:
import os
os.environ['KAGGLE_CONFIG_DIR'] = os.getcwd()
! chmod 600 kaggle.json

Define the kaggle dataset and the destination folder and then download it. 

In [None]:
KaggleDatasetPath = "dubradave/hospital-readmissions"
DestinationFolder = "inputs/datasets/raw"
! kaggle datasets download -d {KaggleDatasetPath} -p {DestinationFolder}

Unzip the downloaded file, delete the zip file and delete the kaggle.json file

In [None]:
import zipfile

# Unzip all .zip files in the destination folder
for file_name in os.listdir(DestinationFolder):
    if file_name.endswith('.zip'):
        file_path = os.path.join(DestinationFolder, file_name)
        with zipfile.ZipFile(file_path, 'r') as zip_ref:
            zip_ref.extractall(DestinationFolder)
        os.remove(file_path)  # Remove the zip file after extraction

# Remove the kaggle.json file if it exists
kaggle_json_path = 'kaggle.json'

if os.path.exists(kaggle_json_path):
    os.remove(kaggle_json_path)

print("Files unzipped, zip files and kaggle.json removed.")

---

# Load and Inspect the Kaggle Data

Using the pandas library, the dataset can be loaded as a dataframe and the data inspected.

In [None]:
import pandas as pd

df = pd.read_csv(f"inputs/datasets/raw/hospital_readmissions.csv")
df.head()

By running the command below we will be able to see the data type and size of the dataset.

In [None]:
df.info()

Then we check for missing values.

In [None]:
df.isnull().sum()

From the above output we can see that there aren't any missing values, however from the Dataframe above we can see in the 'medical_specialty' a value of 'Missing'. 

* To further investigate this variable we run the cell bellow and we see that actually almost half of the rows are labelled 'Missing'. So, those were actual missing values, which were labelled 'Missing'

In [None]:
df['medical_specialty'].value_counts()

Also, we can see that half of the columns are object. What we would like is to convert, already, those variables with a 'yes' and 'no' values and map them with 0 to 'no' and 1 to 'yes', and make them numeric.

* Below, we first check to see which variables are categorical and which boolean and then we convert the boolean.

In [None]:
# loops through all columns in the dataframe and 
# prints the value counts for each column that is of type 'object'

for col in df.columns:
    if df[col].dtype == 'object':
        print(f"**{col}**:\n {df[col].value_counts()}\n\n")

In [None]:
# maps the boolean values to the corresponding numerical values

boolean_vars = ['change', 'diabetes_med', 'readmitted']

for var in boolean_vars:
    df[var] = df[var].map({'no': 0, 'yes': 1})
    print(f"**{var}**:\n {df[var].value_counts()}\n\n")

df.head(10)

---

# Save the Dataset

  The modified dataset is saved to the outputs directory.

In [None]:
import os
try:
  os.makedirs(name="outputs/datasets/collection")
except Exception as e:
  print(e)

df.to_csv("outputs/datasets/collection/HospitalReadmissions.csv", index=False)

---

# Coclusions

In this notebook we have achieve the following:

* Successfully download, unzip and save the dataset using the Gaggle API
* Inspected the dataset for missing values and identified no actual missing values, except in the "medical_specialty" variable were almost half of the rows were already labelled "Missing". 
* Inspected the data type of the "change", "diabetes_med" and "readmitted" variables were mapped to numerical values 
* The dataset was saved in the outputs directory.

### Next steps

In the next Notebook we will begin an EDA using Pandas profiling and correlation studies and start addressing the business requirement 1.

This will take us to the 'Data Understanding' of the CRISP-DM workflow.