# **Data Collection**

## Objectives

* Fetch data from kaggle and save raw data in repo.
* Inspect the data and save in outputs/datasets/collection.

## Inputs

* Kaggle JSON file - the account authentication token. 

## Outputs

* Generate Dataset: output/datasets/collection/mushrooms.csv

## Additional Comments

* Were this a real-world project, the inputs/datasets/raw outputs/datasets/ directories would ideally be included in .gitignore in order to not push commercially sensitive data belonging to the client to a publically available repo without their permission. For the purposes of project evaluation and the jupyter notebooks running properly for the examiner this will not be done.


---

# Change working directory

* Need to change working directory from the current **jupyter_notebooks** folder to the parent folder in order to access the whole project

We need to change the working directory from its current folder to its parent folder
* We access the current directory with os.getcwd()

In [1]:
import os
current_dir = os.getcwd()
current_dir

'/workspace/creditcard-churn/jupyter_notebooks'

We want to make the parent of the current directory the new current directory
* os.path.dirname() gets the parent directory
* os.chir() defines the new current directory

In [2]:
os.chdir(os.path.dirname(current_dir))
print("You set a new current directory")

You set a new current directory


Confirm the new current directory

In [3]:
current_dir = os.getcwd()
current_dir

'/workspace/creditcard-churn'

# Install required packages

Installation of dependencies listed in requirements.txt

In [None]:
%pip install -r requirements.txt

altair==4.2.2
astor==0.8.1
asttokens==2.2.1
attrs==23.1.0
backcall==0.2.0
backports.zoneinfo==0.2.1
base58==2.1.1
blinker==1.6.2
cachetools==5.3.1
certifi==2023.5.7
charset-normalizer==3.2.0
click==7.1.2
comm==0.1.3
cycler==0.11.0
dacite==1.8.1
debugpy==1.6.7
decorator==5.1.1
entrypoints==0.4
executing==1.2.0
feature-engine==1.0.2
gitdb==4.0.10
GitPython==3.1.32
htmlmin==0.1.12
idna==3.4
ImageHash==4.3.1
imbalanced-learn==0.8.0
importlib-metadata==6.8.0
importlib-resources==6.0.0
ipykernel==6.24.0
ipython==8.12.2
ipywidgets==8.0.2
jedi==0.18.2
Jinja2==3.1.1
joblib==1.3.1
jsonschema==4.18.2
jsonschema-specifications==2023.6.1
jupyter-client==8.3.0
jupyter-core==5.3.1
jupyterlab-widgets==3.0.8
kiwisolver==1.4.4
MarkupSafe==2.0.1
matplotlib==3.3.1
matplotlib-inline==0.1.6
multimethod==1.9.1
nest-asyncio==1.5.6
networkx==3.1
numpy==1.18.5
packaging==23.1
pandas==1.4.2
parso==0.8.3
patsy==0.5.3
pexpect==4.8.0
phik==0.12.3
pickleshare==0.7.5
Pillow==10.0.0
pkgutil-resolve-name==1.3.10
platfo

---

# Fetch data from Kaggle

Install kaggle package to fetch data

In [None]:
%pip install kaggle==1.5.12

Note: in order to run this, you must upload your own kaggle.json file to the workspace to authenticate the request. This cell sets the KAGGLE_CONFIG_DIR to the project's directory, and then sets the 'read' permission on kaggle.json to anyone. This is so the kaggle API request can function correctly.

In [4]:
os.environ['KAGGLE_CONFIG_DIR'] = os.getcwd()
! chmod 600 kaggle.json

We now define the kaggle dataset path, make a directory for it, and then run a kaggle command to download the dataset to this directory

In [6]:
KaggleDatasetPath = "uciml/mushroom-classification"
DestinationFolder = "inputs/datasets/raw"
if not os.path.isdir(DestinationFolder):
    os.makedirs(DestinationFolder)

! kaggle datasets download -d {KaggleDatasetPath} -p {DestinationFolder}

Downloading mushroom-classification.zip to inputs/datasets/raw
  0%|                                               | 0.00/34.2k [00:00<?, ?B/s]
100%|██████████████████████████████████████| 34.2k/34.2k [00:00<00:00, 8.24MB/s]


Unzip the downloaded file, delete the zip file and delete the kaggle.json file

In [7]:
! unzip {DestinationFolder}/*.zip -d {DestinationFolder} && rm {DestinationFolder}/*.zip && rm kaggle.json

Archive:  inputs/datasets/raw/mushroom-classification.zip
replace inputs/datasets/raw/mushrooms.csv? [y]es, [n]o, [A]ll, [N]one, [r]ename: ^C


Rename .csv file to a more readable name

In [None]:
! mv {DestinationFolder}/*Employee-Attrition.csv {DestinationFolder}/EmployeeAttrition.csv

# Load and Inspect Kaggle data #

In [8]:
import pandas as pd
df = pd.read_csv(f"inputs/datasets/raw/mushrooms.csv")
df.head()

Unnamed: 0,class,cap-shape,cap-surface,cap-color,bruises,odor,gill-attachment,gill-spacing,gill-size,gill-color,...,stalk-surface-below-ring,stalk-color-above-ring,stalk-color-below-ring,veil-type,veil-color,ring-number,ring-type,spore-print-color,population,habitat
0,p,x,s,n,t,p,f,c,n,k,...,s,w,w,p,w,o,p,k,s,u
1,e,x,s,y,t,a,f,c,b,k,...,s,w,w,p,w,o,p,n,n,g
2,e,b,s,w,t,l,f,c,b,n,...,s,w,w,p,w,o,p,n,n,m
3,p,x,y,w,t,p,f,c,n,n,...,s,w,w,p,w,o,p,k,s,u
4,e,x,s,g,f,n,f,w,b,k,...,s,w,w,p,w,o,e,n,a,g


DataFrame summary

In [9]:
df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 8124 entries, 0 to 8123
Data columns (total 23 columns):
 #   Column                    Non-Null Count  Dtype 
---  ------                    --------------  ----- 
 0   class                     8124 non-null   object
 1   cap-shape                 8124 non-null   object
 2   cap-surface               8124 non-null   object
 3   cap-color                 8124 non-null   object
 4   bruises                   8124 non-null   object
 5   odor                      8124 non-null   object
 6   gill-attachment           8124 non-null   object
 7   gill-spacing              8124 non-null   object
 8   gill-size                 8124 non-null   object
 9   gill-color                8124 non-null   object
 10  stalk-shape               8124 non-null   object
 11  stalk-root                8124 non-null   object
 12  stalk-surface-above-ring  8124 non-null   object
 13  stalk-surface-below-ring  8124 non-null   object
 14  stalk-color-above-ring  

Change `class` from categorical variable ('p' - poisonous, 'e' - edible) to an integer (0 or 1) for use in model 

In [11]:
df['class'].unique()

array(['p', 'e'], dtype=object)

In [12]:
df['class'] = df['class'].replace({'e': 1, 'p': 0})

Check the `class` data type

In [13]:
df['class'].dtype

dtype('int64')

# Push files to Repo

* In case you don't need to push files to Repo, you may replace this section with "Conclusions and Next Steps" and state your conclusions and next steps.

In [14]:
import os
try:
  os.makedirs(name='outputs/datasets/collection') # create outputs/datasets/collection folder
except Exception as e:
  print(e)

df.to_csv(f"outputs/datasets/collection/mushrooms.csv")


[Errno 17] File exists: 'outputs/datasets/collection'
