# **Data Collection**

## Objectives

* Fetch data from kaggle and save raw data in repo.
* Inspect the data and save in outputs/datasets/collection.

## Inputs

* Kaggle JSON file - the account authentication token. 

## Outputs

* Generate Dataset: output/datasets/collection/CreditcardCustomerAttrition.csv

## Additional Comments

* Were this a real-world project, the output/datasets/ directory should be included in .gitignore in order to not push the customer data to the public. For the purposes of project evaluation and the jupyter notebooks running properly for the examiner this will not be done.


---

# Change working directory

* Need to change working directory from the current jupyter_notebooks folder to the parent folder in order to access the whole project

We need to change the working directory from its current folder to its parent folder
* We access the current directory with os.getcwd()

In [None]:
import os
current_dir = os.getcwd()
current_dir

We want to make the parent of the current directory the new current directory
* os.path.dirname() gets the parent directory
* os.chir() defines the new current directory

In [None]:
os.chdir(os.path.dirname(current_dir))
print("You set a new current directory")

Confirm the new current directory

In [None]:
current_dir = os.getcwd()
current_dir

# Install required packages

Installation of dependencies listed in requirements.txt

In [None]:
%pip install -r requirements.txt

---

# Fetch data from Kaggle

Install kaggle package to fetch data

In [None]:
%pip install kaggle==1.5.12

Note: in order to run this, you must upload your own kaggle.json file to the workspace to authenticate the request. This cell sets the KAGGLE_CONFIG_DIR to the project's directory, and then sets the 'read' permission on kaggle.json to anyone. This is so the kaggle API request can function correctly.

In [None]:
os.environ['KAGGLE_CONFIG_DIR'] = os.getcwd()
! chmod 600 kaggle.json

We now define the kaggle dataset path, make a directory for it, and then run a kaggle command to download the dataset to this directory

In [None]:
KaggleDatasetPath = "sakshigoyal7/credit-card-customers"
DestinationFolder = "inputs/datasets/raw"
if not os.path.isdir(DestinationFolder):
    os.makedirs(DestinationFolder)

! kaggle datasets download -d {KaggleDatasetPath} -p {DestinationFolder}

Unzip the downloaded file, delete the zip file and delete the kaggle.json file

In [None]:
! unzip {DestinationFolder}/*.zip -d {DestinationFolder} && rm {DestinationFolder}/*.zip && rm kaggle.json

# Load and Inspect Kaggle data #

In [None]:
import pandas as pd
df = pd.read_csv(f"inputs/datasets/raw/BankChurners.csv")
df.head()

DataFrame summary

In [None]:
df.info()

Check if there are any duplicate customers by inspecting for duplicates in `CLIENTNUM`. There are no duplicates.

In [None]:
df[df.duplicated(subset=['CLIENTNUM'])]

Change `Attrition_Flag` from categorical variable ('Existing Customer', 'Attrited Customer') to an integer (0 or 1) for use in model 

In [None]:
df['Attrition_Flag'].unique()

In [None]:
df['Attrition_Flag'] = df['Attrition_Flag'].replace({'Attrited Customer': 1, 'Existing Customer': 0})

Check the `Attrition_Flag` data type

In [None]:
df['Attrition_Flag'].dtype

# Push files to Repo

* In case you don't need to push files to Repo, you may replace this section with "Conclusions and Next Steps" and state your conclusions and next steps.

In [None]:
import os
try:
  os.makedirs(name='outputs/datasets/collection') # create outputs/datasets/collection folder
except Exception as e:
  print(e)

df.to_csv(f"outputs/datasets/collection/BankChurners.csv")
