<a href="https://colab.research.google.com/github/nathanbollig/aivm/blob/main/Exercise%201/(E1)%20Build%20a%20classifier%20using%20Scikit-Learn.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# Building a Classifier using Scikit-Learn

In this programming exercise, you will explore the basic steps of training, testing, and using a machine learning model. The popular software library "Scikit-Learn" (or `sklearn`) simplifies each of these tasks. It is often the framework of choice for training ML models that are not neural network-based.

The aim of this exercise is to develop a multiclass (single label) classifier that will identify the type of an animal (mammal, bird, etc.) based on a small number of features describing that animal.

Our first steps will be to set up the workspace, then we will load the data and examine it.

## Setup

We first import software libraries into our Python session. Generally, you need to install these libraries first before you are able to import them. In the Colab environment, the libraries are already installed and ready to be imported.

In [2]:
import pandas as pd

## Load and explore the animal dataset

The code below downloads the dataset onto the virtual machine running this Colab.

In [4]:
!gdown --id 1kzUc2NgLHwZCv8pgMSguPg5N1T1PpxrA

Downloading...
From: https://drive.google.com/uc?id=1kzUc2NgLHwZCv8pgMSguPg5N1T1PpxrA
To: /content/BNG_zoo.csv
100MB [00:00, 241MB/s] 


Now we will load the data into a variable called `data`. We use the `pandas` package to do this, which has a number of useful utilities for loading and saving spreadsheets.

In [5]:
data = pd.read_csv("BNG_zoo.csv")

We can now explore the dataset. How many rows and columns are in it?

In [10]:
data.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 1000000 entries, 0 to 999999
Data columns (total 18 columns):
 #   Column    Non-Null Count    Dtype  
---  ------    --------------    -----  
 0   animal    1000000 non-null  object 
 1   hair      1000000 non-null  bool   
 2   feathers  1000000 non-null  bool   
 3   eggs      1000000 non-null  bool   
 4   milk      1000000 non-null  bool   
 5   airborne  1000000 non-null  bool   
 6   aquatic   1000000 non-null  bool   
 7   predator  1000000 non-null  bool   
 8   toothed   1000000 non-null  bool   
 9   backbone  1000000 non-null  bool   
 10  breathes  1000000 non-null  bool   
 11  venomous  1000000 non-null  bool   
 12  fins      1000000 non-null  bool   
 13  legs      1000000 non-null  float64
 14  tail      1000000 non-null  bool   
 15  domestic  1000000 non-null  bool   
 16  catsize   1000000 non-null  bool   
 17  type      1000000 non-null  object 
dtypes: bool(15), float64(1), object(2)
memory usage: 37.2+ MB


There are 17 columns, and each column has 1,000,000 "non-null" (meaning non-empty) rows. The above information lists the name of each column and the "Dtype" represents the datatype of each column. Most are "bool", which stands for Boolean (meaning true or false), except for `legs`, which is float64, meaning it is a decimal number. We will look more closely at the `legs` column in a moment. The `animal` and `type` columns are listed as "object", which means the type has not been inferred from the .csv file. In fact, these columns are strings (text).

Next, let's look at the first few rows of the data.

In [6]:
data.head()

Unnamed: 0,animal,hair,feathers,eggs,milk,airborne,aquatic,predator,toothed,backbone,breathes,venomous,fins,legs,tail,domestic,catsize,type
0,herring,False,False,True,False,False,True,True,True,True,False,False,True,0.0,True,False,False,fish
1,sole,False,False,True,False,False,True,False,True,True,False,False,True,0.0,True,True,False,fish
2,dolphin,True,False,False,True,False,False,False,True,True,True,False,False,4.0,True,False,True,mammal
3,moth,False,True,True,False,False,False,True,False,True,True,False,False,2.0,True,False,True,bird
4,starfish,False,False,False,False,False,True,True,True,True,True,False,True,8.0,False,False,False,amphibian


Hopefully you get the idea for what this data represents. Each row represents an animal, whose name appears in the `animal` column. There are 16 additional features, like whether or not it has hair, feathers, produces eggs or milk, whether it flies, lives in water, etc.

The outcome of interest is the animal `type`. Let's see how many distinct values of `type` are in this dataset.

In [7]:
data['type'].unique()

array(['fish', 'mammal', 'bird', 'amphibian', 'invertebrate', 'insect',
       'reptile'], dtype=object)

As you can see, the unique types are fish, mammal, bird, amphibian, invertebrate, insect, and reptile.

The goal of our machine learning model will be to predict the type of animal based on the features. In other words, we wish to be able to feed our model the data in all but the rightmost column, and have the model predict the correct `type` based only on features we provide.

Think for a moment: Would you rather come up with a rule-based system for finding the `type`? For example, you may say that "If milk = TRUE, then type = mammal". Perhaps other rules are quite evident, like "If backbone = FALSE, then type = invertebrate". A complete set of rules, however, may be difficult to write by hand, even for this simple problem. For more complex data, it may often be close to impossible to design rules by hand. The key behind machine learning is that we will not need to tell the machine how to compute the `type` variable for a given input. It will learn to do this by detecting patterns from labeled data.

## Data pre-processing

One of the requirements of machine learning is getting the data into a form suitable for training a model. The pre-processing requirements can vary depending on the type of data you are working with and the types of models you seek to perform. There are a few general ideas that are often true, however:

1. You should remove or remedy missing data.
2. Strings of text should be converted to numeric variables. In particular, labels should be numeric.
3. Some ML models require standardization of numeric data, i.e. transforming the numbers so that their overall distribution has mean 0 and standard deviation 1.
4. Removing correlated features.

Not all of these are relevant to this dataset, but let's discuss each point to get some background on what these mean.


### 1. Removing missing (null) data

Many machine learning models cannot be trained when the input data contains missing values. In this case, our dataset has already been prepared so as to have no missing values. (You can see this from the `data.info()` output above, which lists exactly 1,000,000 non-null values for each column in the dataset.) However, if there had been missing values, there would be two main ways of handling this issue:

1. Remove rows with missing values: This is sometimes ok, but notice that their could be some reason why the values are missing. If there is a reason why these values are missing, then by removing them, you could be introducing a sampling bias into the dataset. This may limit your ability to train a model that would perform well on new data.

2. Imputation: The second option, imputation, refers to filling in the missing values. There are many valid ways of doing this. One way is to replace missing values with the average of all values of that variable that are present in the dataset.

### 2. Label encoding and removing free text

For many types of machine learning models, the inputs to the model need to be numeric. This means that free text usually needs to be transformed into something else. We will have the opportunity later in the week to study models that primarily process text, but today the main issue is that there are two columns in the dataset (`animal` and `type`) that are free text, and we need to figure out what to do with them.

The `type` column is important because it will be our target outcome, i.e. the variable that the ML model will try to predict. When the target outcome is one of several (in this case, 7) strings, we need to simply replace each occurrence of each string with a particular number. Each asdf becomes a 0, each asf becomes a 1, etc. This is called **label encoding**. 

The `animal` column in the dataset also appears to be free-text. In this case, I have chosen to simply remove this column from the dataset, because it is not 

In [24]:
a = data['animal'].unique().tolist()
a.sort()
a

['aardvark',
 'antelope',
 'bass',
 'bear',
 'boar',
 'buffalo',
 'calf',
 'carp',
 'catfish',
 'cavy',
 'cheetah',
 'chicken',
 'chub',
 'clam',
 'crab',
 'crayfish',
 'crow',
 'deer',
 'dogfish',
 'dolphin',
 'dove',
 'duck',
 'elephant',
 'flamingo',
 'flea',
 'frog',
 'fruitbat',
 'giraffe',
 'girl',
 'gnat',
 'goat',
 'gorilla',
 'gull',
 'haddock',
 'hamster',
 'hare',
 'hawk',
 'herring',
 'honeybee',
 'housefly',
 'kiwi',
 'ladybird',
 'lark',
 'leopard',
 'lion',
 'lobster',
 'lynx',
 'mink',
 'mole',
 'mongoose',
 'moth',
 'newt',
 'octopus',
 'opossum',
 'oryx',
 'ostrich',
 'parakeet',
 'penguin',
 'pheasant',
 'pike',
 'piranha',
 'pitviper',
 'platypus',
 'polecat',
 'pony',
 'porpoise',
 'puma',
 'pussycat',
 'raccoon',
 'reindeer',
 'rhea',
 'scorpion',
 'seahorse',
 'seal',
 'sealion',
 'seasnake',
 'seawasp',
 'skimmer',
 'skua',
 'slowworm',
 'slug',
 'sole',
 'sparrow',
 'squirrel',
 'starfish',
 'stingray',
 'swan',
 'termite',
 'toad',
 'tortoise',
 'tuatara',
 't