# **Data Collection - Brain Tumor Detector**

## Objectives

* Import the necessary packages
* Set working directory
* Fetch image dataset for Kaggle but import manually due to Gitpod size restrictions of 100MB
* Clean data so that only image files are used
* Split the data into Test, Train and Validation folders

## Inputs

* Kaggle Dataset: https://www.kaggle.com/datasets/preetviradiya/brian-tumor-dataset
* Kaggle JSON file - authentication token

## Outputs

* Once the dataset is imported, it will have the following structure:

* inputs
    * mri-scans
        *test
            * healthy
            * tumor
        *train
            * healthy
            * tumor
        * validation
            * healthy
            * tumor


## Comments/Conclusions

* These steps are all necessary to fetch the data, clean it and then divide it up into subsets for the purposes of machine learning. In the next step, we will look into Data Visualisation to understand the data and discover patterns that will emerge. 


---

## Import Packages

In [5]:
%pip install -r /workspace/brain-tumor-detector/requirements.txt

Collecting numpy==1.19.2
  Downloading numpy-1.19.2-cp38-cp38-manylinux2010_x86_64.whl (14.5 MB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m14.5/14.5 MB[0m [31m77.3 MB/s[0m eta [36m0:00:00[0m00:01[0m00:01[0m
[?25hCollecting pandas==1.1.2
  Downloading pandas-1.1.2-cp38-cp38-manylinux1_x86_64.whl (10.4 MB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m10.4/10.4 MB[0m [31m8.8 MB/s[0m eta [36m0:00:00[0m00:01[0m0:01[0mmm
[?25hCollecting matplotlib==3.3.1
  Downloading matplotlib-3.3.1-cp38-cp38-manylinux1_x86_64.whl (11.6 MB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m11.6/11.6 MB[0m [31m18.7 MB/s[0m eta [36m0:00:00[0m00:01[0m0:01[0mm
[?25hCollecting seaborn==0.11.0
  Downloading seaborn-0.11.0-py3-none-any.whl (283 kB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m283.1/283.1 kB[0m [31m82.7 MB/s[0m eta [36m0:00:00[0m
[?25hCollecting plotly==4.12.0
  Downloading plotly-4.12.0-py2.py3-n

In [6]:
import numpy
import os

## Change working directory

* The notebooks are in a sub-folder of this directory, therefore, when running the notebook in this editor it's a necessary requirement to change the working directory

We need to change the working directory from its current folder to its parent folder

In [7]:
current_dir = os.getcwd()
current_dir

'/workspace/brain-tumor-detector'

We want to make the parent of the current directory the new current directory
* os.path.dirname() gets the parent directory
* os.chir() defines the new current directory

In [8]:
os.chdir(os.path.dirname(current_dir))
print("You set a new current directory")

You set a new current directory


Confirm the new current directory

In [9]:
current_dir = os.getcwd()
current_dir

'/workspace'

## Install Kaggle

In [6]:
%pip install kaggle==1.5.12

Collecting kaggle==1.5.12
  Downloading kaggle-1.5.12.tar.gz (58 kB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m59.0/59.0 kB[0m [31m1.7 MB/s[0m eta [36m0:00:00[0m
[?25h  Preparing metadata (setup.py) ... [?25ldone
Collecting tqdm
  Downloading tqdm-4.66.2-py3-none-any.whl (78 kB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m78.3/78.3 kB[0m [31m3.6 MB/s[0m eta [36m0:00:00[0m
[?25hCollecting python-slugify
  Downloading python_slugify-8.0.4-py2.py3-none-any.whl (10 kB)
Collecting text-unidecode>=1.3
  Downloading text_unidecode-1.3-py2.py3-none-any.whl (78 kB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m78.2/78.2 kB[0m [31m26.4 MB/s[0m eta [36m0:00:00[0m
Installing collected packages: text-unidecode, tqdm, python-slugify, kaggle
[33m  DEPRECATION: kaggle is being installed using the legacy 'setup.py install' method, because it does not have a 'pyproject.toml' and the 'wheel' package is not installed. pip 23.1 

Set up the Kaggle JSON file and directory setup

In [7]:
os.environ['KAGGLE_CONFIG_DIR'] = os.getcwd()
! chmod 600 kaggle.json

**Dataset has to be manually imported as the original dataset is bigger than 100MB, therefore giving Gitpod issues with size limits.**

---

## Data Preparation

---

### Data Cleaning

Check and remove non-image files

In [1]:
def remove_non_image_file(my_data_dir):
    image_extension = ('.png', '.jpg', '.jpeg')
    folders = os.listdir(my_data_dir)
    for folder in folders:
        files = os.listdir(my_data_dir + '/' + folder)
        # print(files)
        i = []
        j = []
        for given_file in files:
            if not given_file.lower().endswith(image_extension):
                file_location = my_data_dir + '/' + folder + '/' + given_file
                os.remove(file_location)  # remove non image files
                i.append(1)
            else:
                j.append(1)
                pass
        print(f"Folder: {folder} - has image file", len(j))
        print(f"Folder: {folder} - has non-image file", len(i))

In [8]:
remove_non_image_file(my_data_dir='input/mri-scans')

Folder: healthy - has image file 1922
Folder: healthy - has non-image file 0
Folder: tumor - has image file 1788
Folder: tumor - has non-image file 0


### Split train validation test set

In [10]:
import os
import shutil
import random
import joblib


def split_train_validation_test_images(my_data_dir, train_set_ratio, validation_set_ratio, test_set_ratio):

    if train_set_ratio + validation_set_ratio + test_set_ratio != 1.0:
        print("train_set_ratio + validation_set_ratio + test_set_ratio should sum to 1.0")
        return

    # gets classes labels
    labels = os.listdir(my_data_dir)  # it should get only the folder name
    if 'test' in labels:
        pass
    else:
        # create train, test folders with classes labels sub-folder
        for folder in ['train', 'validation', 'test']:
            for label in labels:
                os.makedirs(name=my_data_dir + '/' + folder + '/' + label)

        for label in labels:

            files = os.listdir(my_data_dir + '/' + label)
            random.shuffle(files)

            train_set_files_qty = int(len(files) * train_set_ratio)
            validation_set_files_qty = int(len(files) * validation_set_ratio)

            count = 1
            for file_name in files:
                if count <= train_set_files_qty:
                    # move a given file to the train set
                    shutil.move(my_data_dir + '/' + label + '/' + file_name,
                                my_data_dir + '/train/' + label + '/' + file_name)

                elif count <= (train_set_files_qty + validation_set_files_qty):
                    # move a given file to the validation set
                    shutil.move(my_data_dir + '/' + label + '/' + file_name,
                                my_data_dir + '/validation/' + label + '/' + file_name)

                else:
                    # move given file to test set
                    shutil.move(my_data_dir + '/' + label + '/' + file_name,
                                my_data_dir + '/test/' + label + '/' + file_name)

                count += 1

            os.rmdir(my_data_dir + '/' + label)

* The training set is divided into a 0.70 ratio of data.
* The validation set is divided into a 0.10 ratio of data.
* The test set is divided into a 0.20 ratio of data.

In [11]:
split_train_validation_test_images(my_data_dir=f"input/mri-scans",
                                   train_set_ratio=0.7,
                                   validation_set_ratio=0.1,
                                   test_set_ratio=0.2
                                   )

---

## Push files to Repo

* If necessary, uncomment the code below

In [None]:
# import os
# try:
#   # create here your folder
#   # os.makedirs(name='')
# except Exception as e:
#   print(e)
