# Part 1 - Feature Extraction

article:

[1]: Cinar, Ilkay, and Murat Koklu. "Identification of rice varieties using machine learning algorithms." Journal of Agricultural Sciences (2022): 9-9.

https://dergipark.org.tr/en/download/article-file/1513632

## Introduction

- Rice is a popular ingrediant all around the word. Its nutritious and it can be harvest in many places. To have a high quality rice on the table, it needs to go trought some phases. Some unwanted things should be removed from it. So, we need classification for these phases. In this task, we are trying to come up with the best model and hyperparameters for classification.

- Based on the article[1], the data was gathered by taking pictures of different species of rice in a labratory. The camera had the power of 2.2 megapixels and 2048 × 1088 resolution. However, the pictures are not directly used. Using some image processing techniques, the contour of each rice and the geometric features of the rices are extracted and saved in a chart for any analytic use.

- For image processing phase, contours of the pictures where extracted by opencv. After that some features like roundness and skew were calculated using the rgb numbers of the pixels inside the contour.

## Preparations of the data

Make three folders in your working folder: "notebooks", "data" and "training_data". Save this notebook in "notebooks" folder.
<br> <br>
Perform preparations for the data
- import all the packages needed for this notebook in one cell 
- import the images. Data can be found from (downloading starts as you press the link) https://www.muratkoklu.com/datasets/vtdhnd09.php <br>
    - save the data folders "Arborio", "Basmati" and "Jasmine" in "data" folder 
- take a random sample of 100 images from Arborio, Basmati and Jasmine rice species (i.e. 300 images in total) 
- determine the contour of each rice (you can use e.g. *findContours* from OpenCV) 
- plot one example image of each rice species, including the contour 

In [1]:
import os
import numpy as np
import cv2
import random
import pandas as pd

<font color = red>contours are a curve that connects all points in a row (along an object boundary) and has the same color or intensity.

In [42]:
Arborio_path = '../data/Arborio/' # creating path of Arborio data set
Basmati_path = '../data/Basmati/' # creating path of Basmati data set
Jasmine_path = '../data/Jasmine/' # creating path of Basmati data set

Arborio_arr = os.listdir(Arborio_path) # getting name of all the images in the path
Basmati_arr = os.listdir(Basmati_path)
Jasmine_arr = os.listdir(Jasmine_path)

Arborio_arr = random.sample(Arborio_arr,100) # making 100 random samples from each type
Basmati_arr = random.sample(Basmati_arr,100)
Jasmine_arr = random.sample(Jasmine_arr,100)

y_train = [] # creating empty lists
x_train = []

for element in Arborio_arr: # iterating through all the image names
  path = Arborio_path + element # creating every image specifialy
  I1 = cv2.imread(path) # reading image, turning it into a 3D array of rbg pixels
  I1 = np.array(I1) # converting to numpy
  y_train.append(0) # putting label 1 for all the images in 'Arborio' folder
  x_train.append(I1) # putting all the images from 'Arborio' folder in x_train array
  
for element in Basmati_arr:
  path = Basmati_path + element
  I1 = cv2.imread(path)
  I1 = np.array(I1)
  y_train.append(1) # putting label 0 for all the images in 'Basmati' folder
  x_train.append(I1) # putting all the images from 'Basmati' folder in x_train array
  
for element in Jasmine_arr:
  path = Jasmine_path + element
  I1 = cv2.imread(path)
  I1 = np.array(I1)
  y_train.append(2) # putting label 0 for all the images in 'Jasmine' folder
  x_train.append(I1) # putting all the images from 'Jasmine' folder in x_train array

In [43]:
contours_list = []

for image in x_train:
    imgray = cv2.cvtColor(image, cv2.COLOR_BGR2GRAY) # making a grayscale image
    ret, thresh = cv2.threshold(imgray, 127, 255, 0) # making a binary image
    # extracting all contours
    contours, hierarchy = cv2.findContours(thresh, cv2.RETR_TREE, cv2.CHAIN_APPROX_SIMPLE)
    contours_list.append(contours)

In [44]:
number_Arborio = random.randint(0,100) # getting 1 random sample out of 100
number_Basmati = random.randint(100,200)
number_Jasmine = random.randint(200,300)

numbers = [number_Arborio, number_Basmati, number_Jasmine] # putting all numbers in a list

for number in numbers:
    im = cv2.drawContours(x_train[number], contours_list[number][0], -1, (0,255,0), 3) #drawing contour
    cv2.imshow('Contours', im)
    cv2.waitKey(0) # keeping the window opened until a key is entered
cv2.destroyAllWindows() #destroying all windows

## Feature extraction

Gathering the feature data <br>
<br>
Color features <br>
In this section, I :
- Calculate the following color features for each image, including only the pixels within the contour (*pointPolygonTest* from OpenCV)
    - Mean for each RGB color channel 
    - Variance for each RGB color channel 
    - Skewness for each RGB color channel 
    - Kurtosis for each RGB color channel 
    - Entropy for each RGB color channel 
    
Dimension features <br>
In this section, I :
- Fit an ellipse to the contour points (*fitEllipse* from OpenCV) 
- Plot one example image of each rice species including the fitted ellipse 
- Calculate the following features for each image (for details, see the original article)
    - the major axis length the ellipse 
    - the minor axis length of the ellipse 
    - area inside the contour (*contourArea* from OpenCV) 
    - perimeter of the contour (*arcLength* from OpenCV) 
    - roundness 
    - aspect ratio 
    
Then I Gather all the features in one array or dataframe: one data point in one row, including all feature values in columns.  <br>
For each data point, also information of the original image and the label (rice species) are included. The data is saved in "training_data" folder. 

In [45]:
dfcolumns = ['name', 'Class', 'RMean', 'GMean', 'BMean', 'RVariance', 'GVariance', 'BVariance', 'RSkew',
             'GSkew', 'BSkew', 'RKurtosis','GKurtosis','BKurtosis', 'REntropy', 'GEntropy', 'BEntropy',
             'MajL', 'MinL', 'Area', 'Perimeter', 'Roundness', 'AspectRatio']
df = pd.DataFrame(columns=dfcolumns)

In [46]:
from scipy.stats import skew, entropy, kurtosis
from statistics import mean

for i, image in enumerate(x_train): # we need to do the calculation for each image
    data = []
    if i < 100: # checking if its type 1
        data.append(Arborio_arr[i]) # finding its name
        data.append('Arborio') # attaching its type
        
    elif i < 200: # checking if its type 2
        data.append(Basmati_arr[i - 100])
        data.append('Basmati')
        
    elif i < 300: # checking if its type 3
        data.append(Jasmine_arr[i - 200])
        data.append('Jasmine')
        
    pixel_list_R = [] # dividing all the rgb channels
    pixel_list_G = []
    pixel_list_B = []
    
    for j,row in enumerate(image): # creating two nested loops to get each pixel
        for k, pixel in enumerate(row):
            # checking if the pixel is on the border or inside the contour
            if cv2.pointPolygonTest(contour=contours_list[i][0], pt=(j,k),
                                    measureDist=False) > -1:
                # using each pixels rbg seperately
                pixel_list_R.append(pixel[0])
                pixel_list_G.append(pixel[1])
                pixel_list_B.append(pixel[2])
                
    data.append(mean(pixel_list_R))
    data.append(mean(pixel_list_G))
    data.append(mean(pixel_list_B))
    data.append(np.var(pixel_list_R))
    data.append(np.var(pixel_list_G))
    data.append(np.var(pixel_list_B))
    data.append(skew(pixel_list_R))
    data.append(skew(pixel_list_G))
    data.append(skew(pixel_list_B))
    data.append(kurtosis(pixel_list_R))
    data.append(kurtosis(pixel_list_G))
    data.append(kurtosis(pixel_list_B))
    data.append(entropy(pixel_list_R))
    data.append(entropy(pixel_list_G))
    data.append(entropy(pixel_list_B))
    
    # fitellipse fits an ellipse to out contour and returns x and y of its center, major and minor length
    # and the angle
    (xc,yc),(d1,d2),angle = cv2.fitEllipse(contours_list[i][0])
    data.append(max(d1, d2)) # major
    data.append(min(d1, d2)) #minor
    area = cv2.contourArea(contours_list[i][0]) # calculates the area of a contour
    data.append(area)
    perimeter = cv2.arcLength(contours_list[i][0],True) # calculates the perimeter of a contour
    data.append(perimeter)
    
    data.append(perimeter**2/area) # roundness is equal to the division of primeter square and area
    x,y,w,h = cv2.boundingRect(contours_list[i][0])
    # acspect ratio is the relationship btw width and height of an image
    # to do so, we need to fit the contour in a rect and then calculate the division of the width 
    # and height
    data.append(float(w)/h)
    
    df = pd.concat([df, pd.DataFrame([data], columns=dfcolumns)])

<font color = red> MEAN = the average value of a data set

VARIANCE = mean squared difference between each value of a data set and the mean of the data set.

ACSPECT RATIO =  is the relationship btw width and height of an image

ROUNDNESS = is equal to the division of primeter square and area

In [50]:
df.to_csv('../training_data/df.csv') # saving the data frame to the csv file

In [51]:
number_Arborio = random.randint(0,100) # getting 1 random sample out of 100
number_Basmati = random.randint(100,200)
number_Jasmine = random.randint(200,300)

numbers = [number_Arborio, number_Basmati, number_Jasmine] # putting all numbers in a list

for number in numbers:
    cnt = contours_list[number][0] # taking the contour we want, out
    ellipse = cv2.fitEllipse(cnt) # fitting an ellipse on it
    cv2.ellipse(x_train[number],ellipse, (0,0,255), 3) # drawing the ellipse
    cv2.imshow("Ellipse", x_train[number]) # showing the plot
    cv2.waitKey(0)
cv2.destroyAllWindows()
