Tutorial for [writing csv in Python](https://www.pythontutorial.net/python-basics/python-write-csv-file/)

Useful information on [pathlib](https://www.atqed.com/python-current-path)

In [None]:
import numpy as np
import pandas as pd

import pathlib
import IPython.display as display
from PIL import Image

import csv

---

### File Path Construction

In [None]:
home_path = str(pathlib.Path.home())
# get current working directory
cwd = pathlib.Path.cwd()

# build complete paths for `train_data` and `test_data`
# use `.joinpath()` to ensure operating system conform paths
train_data_dir = cwd.joinpath('data', 'train_images')
test_data_dir = cwd.joinpath('data', 'test_images')

In [None]:
test_data_dir

In [None]:
# Count number of images in folder
image_count = len(list(train_data_dir.glob('*.jpg')))
print("We have", image_count, "training images.")

In [None]:
# print out first 2 elements via UNIX commands
!head -3 data/train.csv > /tmp/input.csv 
!cat /tmp/input.csv

In [None]:
# Display a few images
images = list(train_data_dir.glob('*.jpg'))

for image in images[:5]:
    display.display(Image.open(str(image)))
    print(image.as_posix())

For our complete csv-file we will first extract all `ImageIds` from `train.csv`. Since there are images with more than one defect, and, hence, more than 1 line in `train.csv`, we will concat the missing image IDs to `train.csv`. To obtain the missing IDs, we construct a complete list of all images, eliminate all lines with `ImageIds` from `train.csv` and then concatenate.

---

### Prepare train.csv

In [None]:
df_defects = pd.read_csv('data/train.csv')
# create image paths for 
defect_paths = df_defects.ImageId.apply(lambda x: train_data_dir.joinpath(x))
# add column to the left of the data frame
df_defects = pd.concat([pd.Series(defect_paths, name='FilePath'), df_defects], axis = 1)
df_defects.FilePath[0]

In [None]:
# isolate `ImageIds` for images with defect
defect_ids = df_defects.ImageId.unique()

---

### Building the CSV-File

Create a csv file with all image paths, the respective `ImageId` and an initialisation for `ClassId` and `EncodedPixels`.

In [None]:
header = ['FilePath', 'ImageId', 'ClassId', 'EncodedPixels']

rows = []

for image in images:
    # `.as_posix()` returns the complete path
    # `.name` returns the image name
    # set `ClassId` and `EncodedPixels` to 0
    rows.append([image.as_posix(), image.name, 0, '0'])
    
with open(train_data_dir.parent.joinpath('train_raw.csv'), # `.parent` returns the path up to the data directory
          'w', 
          encoding = 'UTF8',
          newline = '' # avoid blank lines between rows
         ) as f:
    writer = csv.writer(f)
    writer.writerow(header)
    writer.writerows(rows) # write row into file

In [None]:
df_raw = pd.read_csv('data/train_raw.csv')

# get indices of `df_raw` for row dropping
indices = []
for idx, row in df_raw.iterrows():
    if row.ImageId in defect_ids:
        indices.append(idx)

In [None]:
# check wether all indices or defected images are caught
len(indices)

In [None]:
df_raw.drop(indices, inplace=True)
df_raw

In [None]:
# add all rows of unclassified images to the defected images
df_complete = pd.concat([df_defects, df_raw], axis=0, ignore_index=True)
df_complete['Defect'] = df_complete.ClassId.apply(lambda x: 1 if x > 0 else 0)
df_complete.to_csv('data/train_complete.csv', sep=',', index=False)

In [None]:
# eliminate unused csv file
!rm -f data/train_raw.csv