### Load CSV File

The first step is to load the CSV file.

We will use the csv module that is a part of the standard library.

The reader() function in the csv module takes a file as an argument.

We will create a function called load_csv() to wrap this behavior that will take a filename and return our dataset. We will represent the loaded dataset as a list of lists. The first list is a list of observations or rows, and the second list is the list of column values for a given row.

In [1]:
from csv import reader

In [2]:
def load_csv(filename):
    with open(filename,'r') as file:
        lines = reader(file)
        dataset = list(lines)
    return dataset

In [37]:
dataset = load_csv('../Naive-Bayes/Iris.csv')[1:]

A limitation of this function is that it will load empty lines from data files and add them to our list of rows. We can overcome this by adding rows of data one at a time to our dataset and skipping empty rows.

In [5]:
def load_csv(filename):
    dataset = []
    with open(filename,'r') as file:
        csv_reader = reader(file)
        for row in csv_reader:
            if not row:
                continue
            else:
                dataset.append(row)
    return dataset

### Convert String to Floats

Most, if not all machine learning algorithms prefer to work with numbers.

Specifically, floating point numbers are preferred.

Our code for loading a CSV file returns a dataset as a list of lists, but each value is a string. We can see this if we print out one record from the dataset

In [30]:
data_chunk = load_csv('../Naive-Bayes/Iris.csv')[1:10]
data_chunk[0]

['1', '5.1', '3.5', '1.4', '0.2', 'Iris-setosa']

In [31]:
def str_column_to_float(dataset,column):
    for row in dataset:
        row[column] = float(row[column].strip())
    return dataset
    

In [32]:
for i in range(5):
    str_column_to_float(data_chunk,i)

In [33]:
data_chunk

[[1.0, 5.1, 3.5, 1.4, 0.2, 'Iris-setosa'],
 [2.0, 4.9, 3.0, 1.4, 0.2, 'Iris-setosa'],
 [3.0, 4.7, 3.2, 1.3, 0.2, 'Iris-setosa'],
 [4.0, 4.6, 3.1, 1.5, 0.2, 'Iris-setosa'],
 [5.0, 5.0, 3.6, 1.4, 0.2, 'Iris-setosa'],
 [6.0, 5.4, 3.9, 1.7, 0.4, 'Iris-setosa'],
 [7.0, 4.6, 3.4, 1.4, 0.3, 'Iris-setosa'],
 [8.0, 5.0, 3.4, 1.5, 0.2, 'Iris-setosa'],
 [9.0, 4.4, 2.9, 1.4, 0.2, 'Iris-setosa']]

### Convert String to Integers

The iris flowers dataset has columns contain numeric data.

The difference is the final column, traditionally used to hold the outcome or value to be predicted for a given row. The final column in the iris flowers data is the iris flower species as a string.

Download the dataset and place it in the current working directory with the file name iris.csv. Open the file and delete any empty lines at the bottom.

Some machine learning algorithms prefer all values to be numeric, including the outcome or predicted value.

We can convert the class value in the iris flowers dataset to an integer by creating a map.

    First, we locate all of the unique class values, which happen to be: Iris-setosa, Iris-versicolor and Iris-virginica.
    Next, we assign an integer value to each, such as: 0, 1 and 2.
    Finally, we replace all occurrences of class string values with their corresponding integer values.

Below is a function to do just that called str_column_to_int(). Like the previously introduced str_column_to_float() it operates on a single column in the dataset.

In [34]:
def str_column_to_int(dataset,column):
    classes = [row[column] for row in dataset]
    unique = set(classes)
    lookup = dict()
    for idx,cls in enumerate(unique):
        lookup[cls] = idx
    for row in dataset:
        row[column] = lookup[row[column]]
    return lookup

In [38]:
str_column_to_int(dataset,5)

{'Iris-setosa': 0, 'Iris-versicolor': 1, 'Iris-virginica': 2}