# **Data Collection Notebook**

## Objectives

* Fetch data from Kaggle and save as raw data
* Inspect the data for missing values and data types and save it to outputs/datasets/collection

## Inputs

* Kaggle JSON file - the authentication token.
* Kaggle dataset URL - [Predicting Hospital Readmissions](https://www.kaggle.com/datasets/dubradave/hospital-readmissions)

## Outputs

* outputs/datasets/collection/HospitalReadmissions.csv

## Additional Comments

* No additional comments  


---

# Change working directory

We need to change the working directory from its current folder to its parent folder
* We access the current directory with os.getcwd()

In [1]:
import os
current_dir = os.getcwd()
current_dir

'c:\\Users\\Andrias\\Desktop\\patient-readmission\\jupyter_notebooks'

We want to make the parent of the current directory the new current directory
* os.path.dirname() gets the parent directory
* os.chir() defines the new current directory

In [2]:
os.chdir(os.path.dirname(current_dir))
print("You set a new current directory")

You set a new current directory


Confirm the new current directory

In [3]:
current_dir = os.getcwd()
current_dir

'c:\\Users\\Andrias\\Desktop\\patient-readmission'

---

# Import Dataset from Kaggle

First, install kaggle to access the kaggle API and import the raw data set.

A valid account must be registered with Kaggle to obtain an API key (as a JSON-file).

In [None]:
! pip install kaggle==1.5.12

Make the kaggle authentication token available for the session.

In [4]:
import os
os.environ['KAGGLE_CONFIG_DIR'] = os.getcwd()
! chmod 600 kaggle.json

'chmod' is not recognized as an internal or external command,
operable program or batch file.


Define the kaggle dataset and the destination folder and then download it. 

In [6]:
KaggleDatasetPath = "dubradave/hospital-readmissions"
DestinationFolder = "inputs/datasets/raw"
! kaggle datasets download -d {KaggleDatasetPath} -p {DestinationFolder}

Downloading hospital-readmissions.zip to inputs/datasets/raw




  0%|          | 0.00/286k [00:00<?, ?B/s]
100%|██████████| 286k/286k [00:00<00:00, 762kB/s]
100%|██████████| 286k/286k [00:00<00:00, 758kB/s]


Unzip the downloaded file, delete the zip file and delete the kaggle.json file

In [7]:
import zipfile

# Unzip all .zip files in the destination folder
for file_name in os.listdir(DestinationFolder):
    if file_name.endswith('.zip'):
        file_path = os.path.join(DestinationFolder, file_name)
        with zipfile.ZipFile(file_path, 'r') as zip_ref:
            zip_ref.extractall(DestinationFolder)
        os.remove(file_path)  # Remove the zip file after extraction

# Remove the kaggle.json file if it exists
kaggle_json_path = 'kaggle.json'

if os.path.exists(kaggle_json_path):
    os.remove(kaggle_json_path)

print("Files unzipped, zip files and kaggle.json removed.")

Files unzipped, zip files and kaggle.json removed.


---

# Load and Inspect the Kaggle Data

Using the pandas library, the dataset can be loaded as a dataframe and the data inspected.

In [8]:
import pandas as pd

df = pd.read_csv(f"inputs/datasets/raw/hospital_readmissions.csv")
df.head()

Unnamed: 0,age,time_in_hospital,n_lab_procedures,n_procedures,n_medications,n_outpatient,n_inpatient,n_emergency,medical_specialty,diag_1,diag_2,diag_3,glucose_test,A1Ctest,change,diabetes_med,readmitted
0,[70-80),8,72,1,18,2,0,0,Missing,Circulatory,Respiratory,Other,no,no,no,yes,no
1,[70-80),3,34,2,13,0,0,0,Other,Other,Other,Other,no,no,no,yes,no
2,[50-60),5,45,0,18,0,0,0,Missing,Circulatory,Circulatory,Circulatory,no,no,yes,yes,yes
3,[70-80),2,36,0,12,1,0,0,Missing,Circulatory,Other,Diabetes,no,no,yes,yes,yes
4,[60-70),1,42,0,7,0,0,0,InternalMedicine,Other,Circulatory,Respiratory,no,no,no,yes,no


By running the command below we will be able to see the data type and size of the dataset.

In [9]:
df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 25000 entries, 0 to 24999
Data columns (total 17 columns):
 #   Column             Non-Null Count  Dtype 
---  ------             --------------  ----- 
 0   age                25000 non-null  object
 1   time_in_hospital   25000 non-null  int64 
 2   n_lab_procedures   25000 non-null  int64 
 3   n_procedures       25000 non-null  int64 
 4   n_medications      25000 non-null  int64 
 5   n_outpatient       25000 non-null  int64 
 6   n_inpatient        25000 non-null  int64 
 7   n_emergency        25000 non-null  int64 
 8   medical_specialty  25000 non-null  object
 9   diag_1             25000 non-null  object
 10  diag_2             25000 non-null  object
 11  diag_3             25000 non-null  object
 12  glucose_test       25000 non-null  object
 13  A1Ctest            25000 non-null  object
 14  change             25000 non-null  object
 15  diabetes_med       25000 non-null  object
 16  readmitted         25000 non-null  objec

Then we check for missing values.

In [10]:
df.isnull().sum()

age                  0
time_in_hospital     0
n_lab_procedures     0
n_procedures         0
n_medications        0
n_outpatient         0
n_inpatient          0
n_emergency          0
medical_specialty    0
diag_1               0
diag_2               0
diag_3               0
glucose_test         0
A1Ctest              0
change               0
diabetes_med         0
readmitted           0
dtype: int64

From the above output we can see that there aren't any missing values, however from the Dataframe above we can see in the 'medical_specialty' a value of 'Missing'. 

* To further investigate this variable we run the cell bellow and we see that actually almost half of the rows are labelled 'Missing'. So, those were actual missing values, which were labelled 'Missing'

In [11]:
df['medical_specialty'].value_counts()

medical_specialty
Missing                   12382
InternalMedicine           3565
Other                      2664
Emergency/Trauma           1885
Family/GeneralPractice     1882
Cardiology                 1409
Surgery                    1213
Name: count, dtype: int64

Also, we can see that half of the columns are object. What we would like is to convert already any boolean variable to a numeric and map the values 0 to 'no' and 1 to 'yes'.

* Below, we first check to see which variables are categorical and which boolean and then we convert the boolean.

In [12]:
# loops through all columns in the dataframe and 
# prints the value counts for each column that is of type 'object'

for col in df.columns:
    if df[col].dtype == 'object':
        print(f"**{col}**:\n {df[col].value_counts()}\n\n")

**age**:
 age
[70-80)     6837
[60-70)     5913
[80-90)     4516
[50-60)     4452
[40-50)     2532
[90-100)     750
Name: count, dtype: int64


**medical_specialty**:
 medical_specialty
Missing                   12382
InternalMedicine           3565
Other                      2664
Emergency/Trauma           1885
Family/GeneralPractice     1882
Cardiology                 1409
Surgery                    1213
Name: count, dtype: int64


**diag_1**:
 diag_1
Circulatory        7824
Other              6498
Respiratory        3680
Digestive          2329
Diabetes           1747
Injury             1666
Musculoskeletal    1252
Missing               4
Name: count, dtype: int64


**diag_2**:
 diag_2
Other              9056
Circulatory        8134
Diabetes           2906
Respiratory        2872
Digestive           973
Injury              591
Musculoskeletal     426
Missing              42
Name: count, dtype: int64


**diag_3**:
 diag_3
Other              9107
Circulatory        7686
Diabetes      

In [14]:
# maps the boolean values to the corresponding numerical values

boolean_vars = ['change', 'diabetes_med', 'readmitted']

for var in boolean_vars:
    df[var] = df[var].map({'no': 0, 'yes': 1})
    print(f"**{var}**:\n {df[var].value_counts()}\n\n")

df.head(10)

**change**:
 change
0    13497
1    11503
Name: count, dtype: int64


**diabetes_med**:
 diabetes_med
1    19228
0     5772
Name: count, dtype: int64


**readmitted**:
 readmitted
0    13246
1    11754
Name: count, dtype: int64




Unnamed: 0,age,time_in_hospital,n_lab_procedures,n_procedures,n_medications,n_outpatient,n_inpatient,n_emergency,medical_specialty,diag_1,diag_2,diag_3,glucose_test,A1Ctest,change,diabetes_med,readmitted
0,[70-80),8,72,1,18,2,0,0,Missing,Circulatory,Respiratory,Other,no,no,0,1,0
1,[70-80),3,34,2,13,0,0,0,Other,Other,Other,Other,no,no,0,1,0
2,[50-60),5,45,0,18,0,0,0,Missing,Circulatory,Circulatory,Circulatory,no,no,1,1,1
3,[70-80),2,36,0,12,1,0,0,Missing,Circulatory,Other,Diabetes,no,no,1,1,1
4,[60-70),1,42,0,7,0,0,0,InternalMedicine,Other,Circulatory,Respiratory,no,no,0,1,0
5,[40-50),2,51,0,10,0,0,0,Missing,Other,Other,Other,no,no,0,0,1
6,[50-60),4,44,2,21,0,0,0,Missing,Injury,Other,Other,no,normal,1,1,0
7,[60-70),1,19,6,16,0,0,1,Other,Circulatory,Other,Other,no,no,0,1,1
8,[80-90),4,67,3,13,0,0,0,InternalMedicine,Digestive,Other,Other,no,no,0,0,1
9,[70-80),8,37,1,18,0,0,0,Family/GeneralPractice,Respiratory,Respiratory,Other,no,no,1,1,0


---

# Save the Dataset

  The modified dataset is saved to the outputs directory.

In [15]:
import os
try:
  os.makedirs(name="outputs/datasets/collection")
except Exception as e:
  print(e)

df.to_csv("outputs/datasets/collection/HospitalReadmissions.csv", index=False)

---

# Coclusions

In this notebook we have achieve the following:

* Successfully download, unzip and save the dataset using the Gaggle API
* Inspected the dataset for missing values and identified no actual missing values, except in the "medical_specialty" variable were almost half of the rows were already labelled "Missing". 
* Inspected the data type of the "change", "diabetes_med" and "readmitted" variables were mapped to numerical values 
* The dataset was saved in the outputs directory.

### Next steps

In the next Notebook we will begin an EDA using Pandas profiling and correlation studies and start addressing the business requirement 1.

This will take us to the 'Data Understanding' of the CRISP-DM workflow.