# Labeling the dataset

This notebook requires the following libraries:

- Numpy
- Pandas
- glob
- os

The output of this notebook reads data in the $data\_dir$ and creates a CSV containing labels, saving that as per the $output$ path.

## Imports

In [13]:
import numpy as np
import glob
import os

import pandas as pd

## Loading the data and normalising dimensions

### Directories

In [14]:
#This is the main data directory

data_dir = '../Data/cell-images-for-detecting-malaria/cell_images/'

#This is the output CSV file
output_dir = '../Data/'
output_filename = 'labels.csv'

### Exception handling

In [15]:
if not os.path.isdir(data_dir):
    raise Exception("Data Directory not found! Please run the data_download.ipynb notebook before proceeding.")

In [16]:
if not os.path.isdir(output_dir):
    os.makedirs(output_dir)

### Data loading

In [17]:
file_list_parasitized = glob.glob(os.path.join(data_dir, 'Parasitized', '*.png'))
file_list_uninfected  = glob.glob(os.path.join(data_dir, 'Uninfected', '*.png'))

In [18]:
n_parasitized = len(file_list_parasitized)
n_uninfected  = len(file_list_uninfected)

n_parasitized, n_uninfected

(13779, 13779)

In [19]:
file_list_parasitized = np.array(file_list_parasitized)
file_list_uninfected  = np.array(file_list_uninfected)

file_list_parasitized = np.reshape(file_list_parasitized, newshape = (n_parasitized, 1))
file_list_uninfected  = np.reshape(file_list_uninfected , newshape = (n_uninfected , 1))

In [20]:
file_list_parasitized = np.append(file_list_parasitized, np.ones(file_list_parasitized.shape), axis = 1)
file_list_uninfected  = np.append(file_list_uninfected , np.zeros(file_list_uninfected.shape), axis = 1)

In [21]:
file_list_parasitized.shape, file_list_parasitized.shape

((13779, 2), (13779, 2))

In [22]:
file_list = np.append(file_list_uninfected, file_list_parasitized, axis = 0)

file_list.shape

(27558, 2)

We see that there are 13779 images in both directories, indicating balanced data.

## Creating a labeled dataframe

In [23]:
df = pd.DataFrame(file_list, columns = ['Image_Path', 'Parasitized'])
df

Unnamed: 0,Image_Path,Parasitized
0,../Data/cell-images-for-detecting-malaria/cell...,0.0
1,../Data/cell-images-for-detecting-malaria/cell...,0.0
2,../Data/cell-images-for-detecting-malaria/cell...,0.0
3,../Data/cell-images-for-detecting-malaria/cell...,0.0
4,../Data/cell-images-for-detecting-malaria/cell...,0.0
...,...,...
27553,../Data/cell-images-for-detecting-malaria/cell...,1.0
27554,../Data/cell-images-for-detecting-malaria/cell...,1.0
27555,../Data/cell-images-for-detecting-malaria/cell...,1.0
27556,../Data/cell-images-for-detecting-malaria/cell...,1.0


In [24]:
df.to_csv(os.path.join(output_dir, output_filename), index = False)