# **Data Collection**

## Objectives

* Fetch data from Kaggle and save as raw data

## Inputs

* Kaggle JSON file - the authentication token
* Kaggle dataset URL - pavansubhasht/ibm-hr-analytics-attrition-dataset

## Outputs

* outputs/datasets/collection/employee-attrition.csv


---

# Change working directory

We need to change the working directory from its current folder to its parent folder
* We access the current directory with os.getcwd()

In [None]:
import os

current_dir = os.getcwd()
current_dir

We want to make the parent of the current directory the new current directory
* os.path.dirname() gets the parent directory
* os.chir() defines the new current directory

In [None]:
os.chdir(os.path.dirname(current_dir))
print("You set a new current directory")

Confirm the new current directory

In [None]:
current_dir = os.getcwd()
current_dir

# Import Dataset from Kaggle

Firstly, the Kaggle API must be installed before the data can be loaded.

A valid account must be registered with Kaggle to obtain an API key (as a JSON-file).


In [None]:
! pip install kaggle==1.6.14

Next, the Kaggle config directory is set to the current working directory, and the read/write permissions are set to user only (600)

In [None]:
os.environ['KAGGLE_CONFIG_DIR'] = os.getcwd()
! chmod 600 kaggle.json

Then, define the Kaggle dataset and destination folder paths and download to 'inputs/datasets/raw' directory.
* The dataset path is taken from the Kaggle url, after 'https://www.kaggle.com/datasets/'

In [None]:
KaggleDatasetPath = "pavansubhasht/ibm-hr-analytics-attrition-dataset"
DestinationFolder = "inputs/datasets/raw"
! kaggle datasets download -d {KaggleDatasetPath} -p {DestinationFolder}

The downloaded file is then unzipped, and the zipped file and kaggle.json are both deleted.

In [None]:
! unzip {DestinationFolder}/*.zip -d {DestinationFolder} \
    && rm {DestinationFolder}/*.zip \
        && rm kaggle.json

Rename the file to avoid typos

In [None]:
! mv {DestinationFolder}/WA_Fn-UseC_-HR-Employee-Attrition.csv {DestinationFolder}/employee-attrition.csv

---

# Load and Inspect the Kaggle Data

Using the pandas library, the dataset can be loaded as a dataframe and the data inspected.

In [None]:
import pandas as pd


df = pd.read_csv(f"inputs/datasets/raw/employee-attrition.csv")
df.head()

A summary of the dataframe columns, non-null counts and datatypes can be obtained. It is found that there is no missing data!

In [None]:
print(f"The dataset has {df.shape[0]} rows and {df.shape[1]} columns")
print("-----------------------------")
print("A summary of the dataframe")
print("-----------------------------")
df.info()

We can double-check for missing data. We see that the number of null values in all columns is zero, this will make data preparation easier.

In [None]:
df.isnull().sum()

Check for duplicates int he EmployeeNumber feature. There are no duplicates.

In [None]:
df[df.duplicated(subset=['EmployeeNumber'])]

---

# Save the dataset

Save dataset in the outputs directory

In [None]:
try:
  os.makedirs(name="outputs/datasets/collection/")
except Exception as e:
  print(e)

df.to_csv(f"outputs/datasets/collection/employee-attrition.csv", index=False)

---

# Conclusions

In this notebook, the following was achieved:
* The dataset was imported via Kaggle API
* The dataset summary was displayed and checked for no-null entries
* The dataset was saved in the outputs directory

# Next Steps

In the next notebook, an exploratory data analysis will be carried out using Pandas profiling and correlation studies.