# **Data Collection**

## Objectives

* Fetch data from kaggle and save raw data in repo.
* Inspect the data and save in outputs/datasets/collection.

## Inputs

* Kaggle JSON file - the account authentication token. 

## Outputs

* Generate Dataset: output/datasets/collection/mushrooms.csv

## Additional Comments

* Were this a real-world project, the inputs/datasets/raw outputs/datasets/ directories would ideally be included in .gitignore in order to not push commercially sensitive data belonging to the client to a publically available repo without their permission. For the purposes of project evaluation and the jupyter notebooks running properly for the examiner this will not be done.


---

# Change working directory

* Need to change working directory from the current **jupyter_notebooks** folder to the parent folder in order to access the whole project

We need to change the working directory from its current folder to its parent folder
* We access the current directory with os.getcwd()

In [None]:
import os
current_dir = os.getcwd()
current_dir

We want to make the parent of the current directory the new current directory
* os.path.dirname() gets the parent directory
* os.chir() defines the new current directory

In [None]:
os.chdir(os.path.dirname(current_dir))
print("You set a new current directory")

Confirm the new current directory

In [None]:
current_dir = os.getcwd()
current_dir

# Install required packages

Installation of dependencies listed in requirements.txt

In [None]:
%pip install -r requirements.txt

---

# Fetch data from Kaggle

Install kaggle package to fetch data

In [None]:
%pip install kaggle==1.5.12

Note: in order to run this, you must upload your own kaggle.json file to the workspace to authenticate the request. This cell sets the KAGGLE_CONFIG_DIR to the project's directory, and then sets the 'read' permission on kaggle.json to anyone. This is so the kaggle API request can function correctly.

In [None]:
os.environ['KAGGLE_CONFIG_DIR'] = os.getcwd()
! chmod 600 kaggle.json

We now define the kaggle dataset path, make a directory for it, and then run a kaggle command to download the dataset to this directory

In [None]:
KaggleDatasetPath = "uciml/mushroom-classification"
DestinationFolder = "inputs/datasets/raw"
if not os.path.isdir(DestinationFolder):
    os.makedirs(DestinationFolder)

! kaggle datasets download -d {KaggleDatasetPath} -p {DestinationFolder}

Unzip the downloaded file, delete the zip file and delete the kaggle.json file

In [None]:
! unzip {DestinationFolder}/*.zip -d {DestinationFolder} && rm {DestinationFolder}/*.zip && rm kaggle.json

# Load and Inspect Kaggle data #

In [None]:
import pandas as pd
df = pd.read_csv(f"inputs/datasets/raw/mushrooms.csv")
df.head()

DataFrame summary

In [None]:
df.info()

# Reformat Dataset Labelling #

Change `class` from categorical variable ('p' - poisonous, 'e' - edible) to an integer (0 or 1) for use in model 

In [None]:
df['class'].unique()

In [None]:
df['class'] = df['class'].replace({'e': 1, 'p': 0})

Check the `class` data type

In [None]:
df['class'].dtype

Rename `class` to `edible` for greater user comprehension

In [None]:
df = df.rename(columns={'class':'edible'})

df.head()

We will also change each of the character symbols to word descriptions so the user is better able to understand what they describe. Using definitions provided on the [kaggle page](https://www.kaggle.com/datasets/uciml/mushroom-classification):

In [None]:
rename_dict = {'cap-shape':{'b': 'bell', 'c': 'conical', 'x': 'convex', 'f': 'flat', 'k': 'knobbed', 's': 'sunken'},
                'cap-surface':{'f': 'fibrous', 'g': 'grooves', 'y': 'scaly', 's': 'smooth'},
                'cap-color':{'n': 'brown', 'b': 'buff', 'c': 'cinnamon', 'g': 'gray', 'r': 'green', 'p': 'pink', 
                'u': 'purple', 'e': 'red', 'w': 'white', 'y': 'yellow'},
                'bruises':{'t': 'bruises', 'f': 'no'},
                'odor':{'a': 'almond', 'l': 'anise', 'c': 'creosote', 'y': 'fishy', 'f': 'foul', 'm': 'musty', 
                        'n': 'none', 'p': 'pungent', 's': 'spicy'},
                'gill-attachment':{'a': 'attached', 'd': 'descending', 'f': 'free', 'n': 'notched'},
                'gill-spacing':{'c': 'close', 'w': 'crowded', 'd': 'distant'},
                'gill-size':{'b': 'broad', 'n': 'narrow'},
                'gill-color':{'k': 'black', 'n': 'brown', 'b': 'buff', 'h': 'chocolate', 'g': 'gray', 'r': 'green', 
                            'o': 'orange', 'p': 'pink', 'u': 'purple', 'e': 'red', 
                                'w': 'white', 'y': 'yellow'},
                'stalk-shape':{'e': 'enlarging', 't': 'tapering'},
                'stalk-root':{'b': 'bulbous', 'c': 'club', 'u': 'cup', 'e': 'equal', 'z': 'rhizomorphs', 'r': 'rooted', 
                            '?': 'missing'},
                'stalk-surface-above-ring':{'f': 'fibrous', 'y': 'scaly', 'k': 'silky', 's': 'smooth'},
                'stalk-surface-below-ring':{'f': 'fibrous', 'y': 'scaly', 'k': 'silky', 's': 'smooth'},
                'stalk-color-above-ring':{'n': 'brown', 'b': 'buff', 'c': 'cinnamon', 'g': 'gray', 'o': 'orange', 
                                          'p': 'pink', 'e': 'red', 'w': 'white', 'y': 'yellow'},
                'stalk-color-below-ring':{'n': 'brown', 'b': 'buff', 'c': 'cinnamon', 'g': 'gray', 'o': 'orange', 
                                            'p': 'pink', 'e': 'red', 'w': 'white', 'y': 'yellow'},
                'veil-type':{'p': 'partial', 'u': 'universal'},
                'veil-color':{'n': 'brown', 'o': 'orange', 'w': 'white', 'y': 'yellow'},
                'ring-number':{'n': 'none', 'o': 'one', 't': 'two'},
                'ring-type':{'c': 'cobwebby', 'e': 'evanescent', 'f': 'flaring', 'l': 'large', 'n': 'none', 'p': 'pendant', 's': 'sheathing', 'z': 'zone'},
                'spore-print-color':{'k': 'black', 'n': 'brown', 'b': 'buff', 'h': 'chocolate', 'r': 'green', 'o': 'orange', 'u': 'purple', 'w': 'white', 'y': 'yellow'},
                'population':{'a': 'abundant', 'c': 'clustered', 'n': 'numerous', 's': 'scattered', 'v': 'several', 'y': 'solitary'},
                'habitat':{'g': 'grasses', 'l': 'leaves', 'm': 'meadows', 'p': 'paths', 'u': 'urban', 'w': 'waste', 'd': 'woods'}}

print(rename_dict)

`rename_dict` can then be used to alter the datasets contents as in the following cell

In [None]:
for key, value in rename_dict.items():
    df[key] = df[key].replace(value)

df.head()

---

# Push files to Repo

In [None]:
import os
try:
  os.makedirs(name='outputs/datasets/collection') # create outputs/datasets/collection folder
except Exception as e:
  print(e)

df.to_csv(f"outputs/datasets/collection/mushrooms.csv", index=False)
