# **Data Collection**

## Objectives

* Fetch data from kaggle and save raw data in repo.
* Inspect the data and save in outputs/datasets/collection.

## Inputs

* Kaggle JSON file - the account authentication token. 

## Outputs

* Generate Dataset: output/datasets/collection/.csv

## Additional Comments

* Were this a real-world project, the inputs/datasets/raw outputs/datasets/ director would ideally be included in .gitignore in order to not push the customer data to the public. For the purposes of project evaluation and the jupyter notebooks running properly for the examiner this will not be done.


---

# Change working directory

* Need to change working directory from the current **jupyter_notebooks** folder to the parent folder in order to access the whole project

We need to change the working directory from its current folder to its parent folder
* We access the current directory with os.getcwd()

In [1]:
import os
current_dir = os.getcwd()
current_dir

'/workspace/creditcard-churn/jupyter_notebooks'

We want to make the parent of the current directory the new current directory
* os.path.dirname() gets the parent directory
* os.chir() defines the new current directory

In [2]:
os.chdir(os.path.dirname(current_dir))
print("You set a new current directory")

You set a new current directory


Confirm the new current directory

In [3]:
current_dir = os.getcwd()
current_dir

'/workspace/creditcard-churn'

# Install required packages

Installation of dependencies listed in requirements.txt

In [4]:
%pip install -r requirements.txt

Collecting numpy==1.18.5
  Using cached numpy-1.18.5.zip (5.4 MB)
  Installing build dependencies ... [?25ldone
[?25h  Getting requirements to build wheel ... [?25ldone
[?25h  Preparing metadata (pyproject.toml) ... [?25ldone
[?25hCollecting pandas==1.4.2
  Using cached pandas-1.4.2.tar.gz (4.9 MB)
  Installing build dependencies ... [?25ldone
[?25h  Getting requirements to build wheel ... [?25ldone
[?25h  Preparing metadata (pyproject.toml) ... [?25ldone
[?25hCollecting matplotlib==3.3.1
  Using cached matplotlib-3.3.1.tar.gz (38.8 MB)
  Preparing metadata (setup.py) ... [?25ldone
[?25hCollecting seaborn==0.11.0
  Using cached seaborn-0.11.0-py3-none-any.whl (283 kB)
Collecting pandas-profiling==3.1.0
  Using cached pandas_profiling-3.1.0-py2.py3-none-any.whl (261 kB)
Collecting plotly==4.12.0
  Using cached plotly-4.12.0-py2.py3-none-any.whl (13.1 MB)
Collecting ppscore==1.2.0
  Using cached ppscore-1.2.0.tar.gz (47 kB)
  Preparing metadata (setup.py) ... [?25ldone
[?

---

# Fetch data from Kaggle

Install kaggle package to fetch data

In [5]:
%pip install kaggle==1.5.12

Collecting kaggle==1.5.12
  Downloading kaggle-1.5.12.tar.gz (58 kB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m59.0/59.0 kB[0m [31m3.3 MB/s[0m eta [36m0:00:00[0m
[?25h  Preparing metadata (setup.py) ... [?25ldone
Collecting tqdm
  Downloading tqdm-4.65.0-py3-none-any.whl (77 kB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m77.1/77.1 kB[0m [31m6.8 MB/s[0m eta [36m0:00:00[0m
[?25hCollecting python-slugify
  Downloading python_slugify-8.0.1-py2.py3-none-any.whl (9.7 kB)
Collecting text-unidecode>=1.3
  Downloading text_unidecode-1.3-py2.py3-none-any.whl (78 kB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m78.2/78.2 kB[0m [31m13.8 MB/s[0m eta [36m0:00:00[0m
Building wheels for collected packages: kaggle
  Building wheel for kaggle (setup.py) ... [?25ldone
[?25h  Created wheel for kaggle: filename=kaggle-1.5.12-py3-none-any.whl size=73031 sha256=31570c46262aba53b66ad18e4cddd3ee62078a882288ca6fc1bad9ffe737b4e7
  

Note: in order to run this, you must upload your own kaggle.json file to the workspace to authenticate the request. This cell sets the KAGGLE_CONFIG_DIR to the project's directory, and then sets the 'read' permission on kaggle.json to anyone. This is so the kaggle API request can function correctly.

In [8]:
os.environ['KAGGLE_CONFIG_DIR'] = os.getcwd()
! chmod 600 kaggle.json

We now define the kaggle dataset path, make a directory for it, and then run a kaggle command to download the dataset to this directory

In [9]:
KaggleDatasetPath = "patelprashant/employee-attrition"
DestinationFolder = "inputs/datasets/raw"
if not os.path.isdir(DestinationFolder):
    os.makedirs(DestinationFolder)

! kaggle datasets download -d {KaggleDatasetPath} -p {DestinationFolder}

Downloading employee-attrition.zip to inputs/datasets/raw
  0%|                                               | 0.00/50.1k [00:00<?, ?B/s]
100%|██████████████████████████████████████| 50.1k/50.1k [00:00<00:00, 2.04MB/s]


Unzip the downloaded file, delete the zip file and delete the kaggle.json file

In [10]:
! unzip {DestinationFolder}/*.zip -d {DestinationFolder} && rm {DestinationFolder}/*.zip && rm kaggle.json

Archive:  inputs/datasets/raw/employee-attrition.zip
  inflating: inputs/datasets/raw/WA_Fn-UseC_-HR-Employee-Attrition.csv  


Rename .csv file to a more readable name

In [11]:
! mv {DestinationFolder}/*Employee-Attrition.csv {DestinationFolder}/EmployeeAttrition.csv

# Load and Inspect Kaggle data #

In [13]:
import pandas as pd
df = pd.read_csv(f"inputs/datasets/raw/EmployeeAttrition.csv")
df.head()

ModuleNotFoundError: No module named 'pandas'

DataFrame summary

In [None]:
df.info()

Check if there are any duplicate customers by inspecting for duplicates in `CLIENTNUM`. There are no duplicates.

In [None]:
df[df.duplicated(subset=['CLIENTNUM'])]

Change `Attrition_Flag` from categorical variable ('Existing Customer', 'Attrited Customer') to an integer (0 or 1) for use in model 

In [None]:
df['Attrition_Flag'].unique()

In [None]:
df['Attrition_Flag'] = df['Attrition_Flag'].replace({'Attrited Customer': 1, 'Existing Customer': 0})

Check the `Attrition_Flag` data type

In [None]:
df['Attrition_Flag'].dtype

# Push files to Repo

* In case you don't need to push files to Repo, you may replace this section with "Conclusions and Next Steps" and state your conclusions and next steps.

In [None]:
import os
try:
  os.makedirs(name='outputs/datasets/collection') # create outputs/datasets/collection folder
except Exception as e:
  print(e)

df.to_csv(f"outputs/datasets/collection/BankChurners.csv")
