# **Data Collection**

## Objectives

* Fetch data from Kaggle and save as raw data

## Inputs

* Kaggle JSON file - the authentication token
* Kaggle dataset URL - pavansubhasht/ibm-hr-analytics-attrition-dataset

## Outputs

* outputs/datasets/collection/employee-attrition.csv


---

# Change working directory

We need to change the working directory from its current folder to its parent folder
* We access the current directory with os.getcwd()

In [2]:
import os

current_dir = os.getcwd()
current_dir

'/workspace/attrition-predictor/jupyter_notebooks'

We want to make the parent of the current directory the new current directory
* os.path.dirname() gets the parent directory
* os.chir() defines the new current directory

In [3]:
os.chdir(os.path.dirname(current_dir))
print("You set a new current directory")

You set a new current directory


Confirm the new current directory

In [4]:
current_dir = os.getcwd()
current_dir

'/workspace/attrition-predictor'

# Import Dataset from Kaggle

Firstly, the Kaggle API must be installed before the data can be loaded.

A valid account must be registered with Kaggle to obtain an API key (as a JSON-file).


In [None]:
! pip install kaggle==1.6.14

Next, the Kaggle config directory is set to the current working directory, and the read/write permissions are set to user only (600)

In [7]:
os.environ['KAGGLE_CONFIG_DIR'] = os.getcwd()
! chmod 600 kaggle.json

Then, define the Kaggle dataset and destination folder paths and download to 'inputs/datasets/raw' directory.
* The dataset path is taken from the Kaggle url, after 'https://www.kaggle.com/datasets/'

In [8]:
KaggleDatasetPath = "pavansubhasht/ibm-hr-analytics-attrition-dataset"
DestinationFolder = "inputs/datasets/raw"
! kaggle datasets download -d {KaggleDatasetPath} -p {DestinationFolder}

Dataset URL: https://www.kaggle.com/datasets/pavansubhasht/ibm-hr-analytics-attrition-dataset
License(s): DbCL-1.0
Downloading ibm-hr-analytics-attrition-dataset.zip to inputs/datasets/raw
  0%|                                               | 0.00/50.1k [00:00<?, ?B/s]
100%|██████████████████████████████████████| 50.1k/50.1k [00:00<00:00, 2.35MB/s]


The downloaded file is then unzipped, and the zipped file and kaggle.json are both deleted.

In [9]:
! unzip {DestinationFolder}/*.zip -d {DestinationFolder} \
    && rm {DestinationFolder}/*.zip \
        && rm kaggle.json

Archive:  inputs/datasets/raw/ibm-hr-analytics-attrition-dataset.zip
  inflating: inputs/datasets/raw/WA_Fn-UseC_-HR-Employee-Attrition.csv  


Rename the file to avoid typos

In [10]:
! mv {DestinationFolder}/WA_Fn-UseC_-HR-Employee-Attrition.csv {DestinationFolder}/employee-attrition.csv

---

# Load and Inspect the Kaggle Data

Using the pandas library, the dataset can be loaded as a dataframe and the data inspected.

In [5]:
import pandas as pd


df = pd.read_csv(f"inputs/datasets/raw/employee-attrition.csv")
df.head()

Unnamed: 0,Age,Attrition,BusinessTravel,DailyRate,Department,DistanceFromHome,Education,EducationField,EmployeeCount,EmployeeNumber,...,RelationshipSatisfaction,StandardHours,StockOptionLevel,TotalWorkingYears,TrainingTimesLastYear,WorkLifeBalance,YearsAtCompany,YearsInCurrentRole,YearsSinceLastPromotion,YearsWithCurrManager
0,41,Yes,Travel_Rarely,1102,Sales,1,2,Life Sciences,1,1,...,1,80,0,8,0,1,6,4,0,5
1,49,No,Travel_Frequently,279,Research & Development,8,1,Life Sciences,1,2,...,4,80,1,10,3,3,10,7,1,7
2,37,Yes,Travel_Rarely,1373,Research & Development,2,2,Other,1,4,...,2,80,0,7,3,3,0,0,0,0
3,33,No,Travel_Frequently,1392,Research & Development,3,4,Life Sciences,1,5,...,3,80,0,8,3,3,8,7,3,0
4,27,No,Travel_Rarely,591,Research & Development,2,1,Medical,1,7,...,4,80,1,6,3,3,2,2,2,2


A summary of the dataframe columns, non-null counts and datatypes can be obtained. It is found that there is no missing data!

In [6]:
print(f"The dataset has {df.shape[0]} rows and {df.shape[1]} columns")
print("-----------------------------")
print("A summary of the dataframe")
print("-----------------------------")
df.info()

The dataset has 1470 rows and 35 columns
-----------------------------
A summary of the dataframe
-----------------------------
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 1470 entries, 0 to 1469
Data columns (total 35 columns):
 #   Column                    Non-Null Count  Dtype 
---  ------                    --------------  ----- 
 0   Age                       1470 non-null   int64 
 1   Attrition                 1470 non-null   object
 2   BusinessTravel            1470 non-null   object
 3   DailyRate                 1470 non-null   int64 
 4   Department                1470 non-null   object
 5   DistanceFromHome          1470 non-null   int64 
 6   Education                 1470 non-null   int64 
 7   EducationField            1470 non-null   object
 8   EmployeeCount             1470 non-null   int64 
 9   EmployeeNumber            1470 non-null   int64 
 10  EnvironmentSatisfaction   1470 non-null   int64 
 11  Gender                    1470 non-null   object
 12  Hour

We can double-check for missing data. We see that the number of null values in all columns is zero, this will make data preparation easier.

In [7]:
df.isnull().sum()

Age                         0
Attrition                   0
BusinessTravel              0
DailyRate                   0
Department                  0
DistanceFromHome            0
Education                   0
EducationField              0
EmployeeCount               0
EmployeeNumber              0
EnvironmentSatisfaction     0
Gender                      0
HourlyRate                  0
JobInvolvement              0
JobLevel                    0
JobRole                     0
JobSatisfaction             0
MaritalStatus               0
MonthlyIncome               0
MonthlyRate                 0
NumCompaniesWorked          0
Over18                      0
OverTime                    0
PercentSalaryHike           0
PerformanceRating           0
RelationshipSatisfaction    0
StandardHours               0
StockOptionLevel            0
TotalWorkingYears           0
TrainingTimesLastYear       0
WorkLifeBalance             0
YearsAtCompany              0
YearsInCurrentRole          0
YearsSince

---

# Save the Dataset

Save dataset in the outputs directory

In [16]:
try:
  os.makedirs(name="outputs/datasets/collection")
except Exception as e:
  print(e)

df.to_csv(f"outputs/datasets/collection/employee-attrition.csv", index=False)

---

# Conclusions

In this notebook, the following was achieved:
* The dataset was imported via Kaggle API
* The dataset summary was displayed and checked for no-null entries
* The dataset was saved in the outputs directory

# Next Steps

In the next notebook, an exploratory data analysis will be carried out using Pandas profiling and correlation studies.