# CM3151 Coursework

## Dataset (around 300 words)

For my coursework, I have chosen the classification dataset 'Seeds', sourced from the UCI Machine Learning Repository on 23rd October at 3:30pm. https://archive.ics.uci.edu/dataset/236/seeds.

The 'Seeds' dataset contains instances of individual seeds defined by the following 7 features: area, perimeter, compactness, length, width, asymmetry coefficient and length of kernel groove. These features were identified for each seed using soft X-ray photography. Each seed has a specific wheat variety: Kama, Rosa or Canadian.

The aim of this project is find the most effective classification model at identifying seed varieties. This will involve creating different classifiers and evaluating each model's accurary at predicting correct labels.


## Research Paper (around 300 words)

## Data Pre-processing

In [195]:
import numpy as np
import pandas as pd

In [196]:
seeds_df = pd.read_csv('data/seeds_dataset.txt')

seeds_df.head()

Unnamed: 0,15.26,14.84,0.871,5.763,3.312,2.221,5.22,1
0,14.88,14.57,0.8811,5.554,3.333,1.018,4.956,1
1,14.29,14.09,0.905,5.291,3.337,2.699,4.825,1
2,13.84,13.94,0.8955,5.324,3.379,2.259,4.805,1
3,16.14,14.99,0.9034,5.658,3.562,1.355,5.175,1
4,14.38,14.21,0.8951,5.386,3.312,2.462,4.956,1


The file 'seeds_dataset.txt' does not have comma-separated values. Instead, each value is separated with a tab. 

The code below opens the file and replaces all tabs with commas.

In [197]:
import re

with open ('data/seeds_dataset.txt', 'r') as file:
    seed_data = file.read()

    #Replace tab with comma 
    seed_data = re.sub("\t",",", seed_data)

with open ('data/seeds_dataset.txt', 'w') as file:
    file.write(seed_data)


In [198]:
seeds_df = pd.read_csv('data/seeds_dataset.txt')

#View column names
seeds_df.columns

Index(['15.26', '14.84', '0.871', '5.763', '3.312', '2.221', '5.22', '1'], dtype='object')

The data file also does not include the feature names, so the data frame column names are mistakenly showing as the first row of values.

The following code reads the file again and adds in the column names with the `names` attribute. 

The column names were retrieved from the dataset description on the UCI Machine Learning Repository.


In [199]:
#Creates array of column names 
seeds_features = ['Area','Perimeter','Compactness','Length','Width','Asymmetry Coefficient','Groove Length', 'Target']

#The 'names' attribute adds the names in the array as the column headings
seeds_df = pd.read_csv('data/seeds_dataset.txt', names=seeds_features)

seeds_df.head()

Unnamed: 0,Area,Perimeter,Compactness,Length,Width,Asymmetry Coefficient,Groove Length,Target
0,15.26,14.84,0.871,5.763,3.312,2.221,5.22,1
1,14.88,14.57,0.8811,5.554,3.333,1.018,4.956,1
2,14.29,14.09,0.905,5.291,3.337,2.699,4.825,1
3,13.84,13.94,0.8955,5.324,3.379,2.259,4.805,1
4,16.14,14.99,0.9034,5.658,3.562,1.355,5.175,1


Currently the target column values are integers. To make the dataset more readable, we will change the integers to the corresponding wheat variety: Kama (1), Rosa (2) and Canadian (3). 

The wheat varieties were identified from the research paper 'Wheat Seed Classification: Utilizing Ensemble Machine Learning Approach' by Ajay Khatri, Shweta Agrawal, Jyotir M. Chatterjee (2022).

In [200]:
#Reduce target categories by 1 so values are < 3 (length of )
seeds_df.loc[:,'Target'] = seeds_df['Target'] -1


In [201]:
seeds_df.head(210)

Unnamed: 0,Area,Perimeter,Compactness,Length,Width,Asymmetry Coefficient,Groove Length,Target
0,15.26,14.84,0.8710,5.763,3.312,2.221,5.220,0
1,14.88,14.57,0.8811,5.554,3.333,1.018,4.956,0
2,14.29,14.09,0.9050,5.291,3.337,2.699,4.825,0
3,13.84,13.94,0.8955,5.324,3.379,2.259,4.805,0
4,16.14,14.99,0.9034,5.658,3.562,1.355,5.175,0
...,...,...,...,...,...,...,...,...
205,12.19,13.20,0.8783,5.137,2.981,3.631,4.870,2
206,11.23,12.88,0.8511,5.140,2.795,4.325,5.003,2
207,13.20,13.66,0.8883,5.236,3.232,8.315,5.056,2
208,11.84,13.21,0.8521,5.175,2.836,3.598,5.044,2


In [202]:
# All target values stored in a numpy array
targets = seeds_df.loc[:, 'Target']

targets = np.array(targets)

# Wheat variety names stored in a numpy array
target_names = ['Kama', 'Rosa', 'Canadian']

target_names = np.array(target_names)

# Replace original categorical variables with variety names

seeds_df['Variety'] = pd.Categorical.from_codes(targets, target_names)



In [203]:
seeds_df.head(210)

Unnamed: 0,Area,Perimeter,Compactness,Length,Width,Asymmetry Coefficient,Groove Length,Target,Variety
0,15.26,14.84,0.8710,5.763,3.312,2.221,5.220,0,Kama
1,14.88,14.57,0.8811,5.554,3.333,1.018,4.956,0,Kama
2,14.29,14.09,0.9050,5.291,3.337,2.699,4.825,0,Kama
3,13.84,13.94,0.8955,5.324,3.379,2.259,4.805,0,Kama
4,16.14,14.99,0.9034,5.658,3.562,1.355,5.175,0,Kama
...,...,...,...,...,...,...,...,...,...
205,12.19,13.20,0.8783,5.137,2.981,3.631,4.870,2,Canadian
206,11.23,12.88,0.8511,5.140,2.795,4.325,5.003,2,Canadian
207,13.20,13.66,0.8883,5.236,3.232,8.315,5.056,2,Canadian
208,11.84,13.21,0.8521,5.175,2.836,3.598,5.044,2,Canadian


In [204]:
#Remove original int Target column 

seeds_df = seeds_df.drop('Target', axis=1)

In [205]:
seeds_df.head(210)

Unnamed: 0,Area,Perimeter,Compactness,Length,Width,Asymmetry Coefficient,Groove Length,Variety
0,15.26,14.84,0.8710,5.763,3.312,2.221,5.220,Kama
1,14.88,14.57,0.8811,5.554,3.333,1.018,4.956,Kama
2,14.29,14.09,0.9050,5.291,3.337,2.699,4.825,Kama
3,13.84,13.94,0.8955,5.324,3.379,2.259,4.805,Kama
4,16.14,14.99,0.9034,5.658,3.562,1.355,5.175,Kama
...,...,...,...,...,...,...,...,...
205,12.19,13.20,0.8783,5.137,2.981,3.631,4.870,Canadian
206,11.23,12.88,0.8511,5.140,2.795,4.325,5.003,Canadian
207,13.20,13.66,0.8883,5.236,3.232,8.315,5.056,Canadian
208,11.84,13.21,0.8521,5.175,2.836,3.598,5.044,Canadian


## Data Exploration

Check the size of the dataset - the number of rows and columns

In [206]:
print('The dataset has', seeds_df.shape[0] , 'rows and', seeds_df.shape[1] ,'columns.')

The dataset has 210 rows and 8 columns.


Check the feature names

In [207]:
for i in range(len(seeds_df.columns)):
    print(seeds_df.columns[i])

Area
Perimeter
Compactness
Length
Width
Asymmetry Coefficient
Groove Length
Variety


In [208]:
seeds_df.value_counts('Variety')

Variety
Kama        70
Rosa        70
Canadian    70
Name: count, dtype: int64