# Preprocessing Datasets

#### Ishita Kapur, UTA ID: 1001753123

The datasets used to compare the accuracy of the machine learning algorithm needs to be preprocessed.

The algorithm accepts only numeric data. Categorical textual data needs to be converted to numeric data. The function **covertToNumeric()** implemented in this notebook converts the categorical textual data into numeric.

The algorithm also assumes the last column to be the **class** label column. Data has been read from csv into Pandas dataframe and the columns have been reindexed.

In [1]:
import pandas as pd
import numpy as np

Following is the function to convert categorical textual data into numeric.

In [2]:
def convertToNumeric(data):
    dict_encode = {}
    for col in list(data):
        possible = data[col].unique()
        dict_encode[col] = {key:i+1 for i, key in enumerate(possible)}
    data_numeric = data
    for i in dict_encode:
        data_numeric[i].replace(dict_encode[i], inplace=True)
    return data_numeric

### Hayes Roth Dataset

The data in this dataset is numeric. The column **'name'** needs to be dropped, because even though it is numeric data it is unique for each row in the dataset and it is just like an identity and does not provide information related to the class. The preprocessed data has been stored in a separate csv file.

In [3]:
data_hayes = pd.read_csv('Datasets/hayes_roth/hayes-roth.data', header=None, names=['name', 'hobby', 'age', 'educational_level', 'marital_status', 'class'])
print('\nOriginal Dataset\n', data_hayes)
data_hayes = data_hayes.drop(columns=['name'])
print('\nPreprocessed Dataset\n', data_hayes)
data_hayes.to_csv('Datasets/hayes_roth/hayes-roth-no-name.data',header=False,index=False)


Original Dataset
      name  hobby  age  educational_level  marital_status  class
0      92      2    1                  1               2      1
1      10      2    1                  3               2      2
2      83      3    1                  4               1      3
3      61      2    4                  2               2      3
4     107      1    1                  3               4      3
..    ...    ...  ...                ...             ...    ...
127    44      1    1                  4               3      3
128    40      2    1                  2               1      1
129    90      1    2                  1               2      2
130    21      1    2                  2               1      2
131     9      3    1                  1               2      1

[132 rows x 6 columns]

Preprocessed Dataset
      hobby  age  educational_level  marital_status  class
0        2    1                  1               2      1
1        2    1                  3               2

### Car Evaluation Dataset

The columns in this dataset have categorical textual data. The dataset is converted into numeric using the function above. The numeric dataset has been stored in a different csv file.

In [4]:
data_car = pd.read_csv('Datasets/car_evaluation/car.data', header=None, names=['buying', 'maint', 'doors', 'persons', 'lug_boot', 'safety', 'class'], index_col=False)
print('\nOriginal Dataset\n', data_car)
data_car_numeric = convertToNumeric(data_car)
data_car_numeric.to_csv('Datasets/car_evaluation/car-eval-number.data',header=False,index=False)
print('\nPreprocessed Dataset\n', data_car_numeric)


Original Dataset
      buying  maint  doors persons lug_boot safety  class
0     vhigh  vhigh      2       2    small    low  unacc
1     vhigh  vhigh      2       2    small    med  unacc
2     vhigh  vhigh      2       2    small   high  unacc
3     vhigh  vhigh      2       2      med    low  unacc
4     vhigh  vhigh      2       2      med    med  unacc
...     ...    ...    ...     ...      ...    ...    ...
1723    low    low  5more    more      med    med   good
1724    low    low  5more    more      med   high  vgood
1725    low    low  5more    more      big    low  unacc
1726    low    low  5more    more      big    med   good
1727    low    low  5more    more      big   high  vgood

[1728 rows x 7 columns]

Preprocessed Dataset
       buying  maint  doors  persons  lug_boot  safety  class
0          1      1      1        1         1       1      1
1          1      1      1        1         1       2      1
2          1      1      1        1         1       3      1
3    

### Breast Cancer Dataset

The columns in the dataset are reindexed to make **'class'** as the last column. The columns in this dataset have categorical textual data. The dataset is then converted into numeric using the function above. The numeric dataset has been stored in a different csv file.

In [5]:
data_cancer = pd.read_csv('Datasets/breast_cancer/breast-cancer.data', header=None, names=['class', 'age', 'menopause', 'tumor_size', 'inv_nodes', 'node_caps', 'deg_malig', 'breast', 'breast_quad', 'irradiat'], index_col=False)
print('\nOriginal Dataset\n', data_cancer)
data_cancer = data_cancer.reindex(columns=['age', 'menopause', 'tumor_size', 'inv_nodes', 'node_caps', 'deg_malig', 'breast', 'breast_quad', 'irradiat', 'class'])
data_cancer_numeric = convertToNumeric(data_cancer)
data_cancer_numeric.to_csv('Datasets/breast_cancer/breast-cancer-number.data',header=False,index=False)
print('\nPreprocessed Dataset\n', data_cancer_numeric)


Original Dataset
                     class    age menopause tumor_size inv_nodes node_caps  \
0    no-recurrence-events  30-39   premeno      30-34       0-2        no   
1    no-recurrence-events  40-49   premeno      20-24       0-2        no   
2    no-recurrence-events  40-49   premeno      20-24       0-2        no   
3    no-recurrence-events  60-69      ge40      15-19       0-2        no   
4    no-recurrence-events  40-49   premeno        0-4       0-2        no   
..                    ...    ...       ...        ...       ...       ...   
281     recurrence-events  30-39   premeno      30-34       0-2        no   
282     recurrence-events  30-39   premeno      20-24       0-2        no   
283     recurrence-events  60-69      ge40      20-24       0-2        no   
284     recurrence-events  40-49      ge40      30-34       3-5        no   
285     recurrence-events  50-59      ge40      30-34       3-5        no   

     deg_malig breast breast_quad irradiat  
0          