# **Data Collection**

## Objectives

* Data preparation

## Inputs

* Dataset in outputs/datasets/collection/employee-attrition.csv

## Outputs

* Cleaned dataset


---

# Change working directory

We need to change the working directory from its current folder to its parent folder
* We access the current directory with os.getcwd()

In [1]:
import os

current_dir = os.getcwd()
os.chdir(os.path.dirname(current_dir))

In [2]:
current_dir = os.getcwd()
current_dir

'/workspace/attrition-predictor'

---

# Pandas Profiling

Using the pandas library, the dataset can be loaded as a dataframe and the data inspected.

In [3]:
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns
sns.set_style('whitegrid')


Unnamed: 0,Age,Attrition,BusinessTravel,DailyRate,Department,DistanceFromHome,Education,EducationField,EmployeeCount,EmployeeNumber,...,RelationshipSatisfaction,StandardHours,StockOptionLevel,TotalWorkingYears,TrainingTimesLastYear,WorkLifeBalance,YearsAtCompany,YearsInCurrentRole,YearsSinceLastPromotion,YearsWithCurrManager
0,41,Yes,Travel_Rarely,1102,Sales,1,2,Life Sciences,1,1,...,1,80,0,8,0,1,6,4,0,5
1,49,No,Travel_Frequently,279,Research & Development,8,1,Life Sciences,1,2,...,4,80,1,10,3,3,10,7,1,7
2,37,Yes,Travel_Rarely,1373,Research & Development,2,2,Other,1,4,...,2,80,0,7,3,3,0,0,0,0
3,33,No,Travel_Frequently,1392,Research & Development,3,4,Life Sciences,1,5,...,3,80,0,8,3,3,8,7,3,0
4,27,No,Travel_Rarely,591,Research & Development,2,1,Medical,1,7,...,4,80,1,6,3,3,2,2,2,2


In [None]:

df = pd.read_csv(f"inputs/datasets/raw/employee-attrition.csv")
df.head()

A summary of the dataframe columns, non-null counts and datatypes can be obtained.

In [None]:
print(f"The dataset has {df.shape[0]} rows and {df.shape[1]} columns")
print("-----------------------------")
print("A summary of the dataframe")
print("-----------------------------")
df.info()

In [31]:
from ydata_profiling import ProfileReport


pandas_report = ProfileReport(df=df, title="Overview of the attrition dataset", minimal=True)
pandas_report.to_notebook_iframe()

Summarize dataset:   0%|          | 0/5 [00:00<?, ?it/s]

Generate report structure:   0%|          | 0/1 [00:00<?, ?it/s]

Render HTML:   0%|          | 0/1 [00:00<?, ?it/s]

In [30]:

for col in df.columns:
    if df[col].dtype == 'object':
        print(f'{col} is object and has unique values of: {df[col].unique()}')
        print('---------------------')
    if df[col].dtype == 'int64':
        print(f'{col} is int of min:{df[col].min()} and max {df[col].max()}.')
        print('---------------------')



Age is int of min:18 and max 60.
---------------------
Attrition is object and has unique values of: ['Yes' 'No']
---------------------
BusinessTravel is object and has unique values of: ['Travel_Rarely' 'Travel_Frequently' 'Non-Travel']
---------------------
DailyRate is int of min:102 and max 1499.
---------------------
Department is object and has unique values of: ['Sales' 'Research & Development' 'Human Resources']
---------------------
DistanceFromHome is int of min:1 and max 29.
---------------------
Education is int of min:1 and max 5.
---------------------
EducationField is object and has unique values of: ['Life Sciences' 'Other' 'Medical' 'Marketing' 'Technical Degree'
 'Human Resources']
---------------------
EmployeeCount is int of min:1 and max 1.
---------------------
EmployeeNumber is int of min:1 and max 2068.
---------------------
EnvironmentSatisfaction is int of min:1 and max 4.
---------------------
Gender is object and has unique values of: ['Female' 'Male']
-----

---

# Save the Dataset

Save dataset in the outputs directory

In [16]:
try:
  os.makedirs(name="outputs/datasets/collection")
except Exception as e:
  print(e)

df.to_csv(f"outputs/datasets/collection/employee-attrition.csv", index=False)

---

# Conclusions

In this notebook, the following was achieved:
* The dataset was imported via Kaggle API
* The dataset summary was displayed and checked for no-null entries
* The dataset was saved in the outputs directory

# Next Steps

In the next notebook, an exploratory data analysis will be carried out using Pandas profiling and correlation studies.