This Notebook introduces exploratory data analysis for tabular data. We will be using the Zoo Animal dataset from the UCI Machine Learning Respository.

To follow the tutorial read each section of text and then run the Python code below the section. Python is a programming language commonly used for Machine Learning due to its relatively straightforward syntax and the existence of many programming libraries which hide away much of the complexity, and save us reinventing the wheel when solving problems.





The following commands connect to GDrive. Running the commands should present an accounts.google.com link. Click to open it in a browser window (where you may be prompted to login in to your Google account and authorise Colabs to access your files) and copy the authorisation code it gives you (there's a little 'double rectangle' which you can click to copy). Then paste the code in the space below, where it says "Enter your authorization code":

In [0]:
from google.colab import drive
drive.mount('/content/gdrive')

Go to this URL in a browser: https://accounts.google.com/o/oauth2/auth?client_id=947318989803-6bn6qk8qdgf4n4g3pfee6491hc0brc4i.apps.googleusercontent.com&redirect_uri=urn%3aietf%3awg%3aoauth%3a2.0%3aoob&response_type=code&scope=email%20https%3a%2f%2fwww.googleapis.com%2fauth%2fdocs.test%20https%3a%2f%2fwww.googleapis.com%2fauth%2fdrive%20https%3a%2f%2fwww.googleapis.com%2fauth%2fdrive.photos.readonly%20https%3a%2f%2fwww.googleapis.com%2fauth%2fpeopleapi.readonly

Enter your authorization code:
··········
Mounted at /content/gdrive


Import the libraries used by this tutorial. Libraries contain programmes that people have already written to save you re-inventing the wheel.

In [0]:
import pandas as pd
import os
import seaborn as sns
from matplotlib import pyplot as plt

Test that you can access the data files for this tutorial. After running this cell you should see two files in a list:
 ['zoo.csv','zoo.clean.csv']

In [0]:
import os
zoo_dir = '/content/gdrive/My Drive/MLC/Session 1/Data/'
os.listdir(zoo_dir)

Now we are ready to start working with some data.
The first line loads the zoo data csv file into a table (technically a Pandas data frame, for those who care about such things).
The second line sets the column names of the tables because the original file doesn't have them.
The third line prints out the first and last few rows of the table.

Take some time to look at the data, scroll across to see more columns.

**How many rows and columns are there in the data table?**

In [0]:
zoo_data = pd.read_csv(zoo_dir + 'zoo.csv', header=None)
zoo_data.columns = ['animalname','hair','feathers','eggs','milk','airborne','aquatic','predator','toothed','backbone', \
                    'breathes','venomous','fins','legs','tail','domestic','catsize','animalclass']
zoo_data

The describe command outputs a summary of the data in the table.
There are 8 rows with a column for all but one of the columns in the original data table. The rows include things like "count", "mean", "min", and "max". **Can you guess why one column has been missed out?**

The last column, "class", is the one we will be predicting when we get to Machine Learning. For now we'll refer to this as the Target Variable. The remaining columns are known as Features.

A "boolean" value is a TRUE or FALSE value, and in this case is represented by 0 for False and 1 for True. For example, a 1 in the "feathers" column would mean an animal has feathers.
Most of the columns in the table are Boolean.
**Do the "max" and "min" values in the summary table below make sense for the ones you expect to be boolean?**

**How many classes of animal do you think there are?**

In [0]:
zoo_data.describe()

Let's now look at a particular class of animal. Although they're not labelled as such, the first 3 rows of class 1 suggests they are mammals.

**Look at the top rows of the other classes and see if you can come up with sensible class labels.**

To view a different class change the 1 after the double equals sign in the line below to a value between 1 and 7.
If you want to look at more rows than 3, change the variable inside the "head" function at the end.

In [0]:
zoo_data[zoo_data['animalclass'] == 1].head(3)

Eyeballing data is a great first step but remember a picture says a thousand words.

So the next step is to start plotting data. We will start with a simple histogram which counts how many rows are in each class. This command utilises the Seaborn visualisation library which gets abbreviated to sns. The countplot function is for histograms like this one.

**Try running the command for a couple of other columns.** Most of them are boolean 0 or 1 columns which aren't so interesting, so try plotting 'legs'. Something odd happens if you plot 'animalname'. **Can you spot the data quality problem?**

In [0]:
sns.countplot(zoo_data['animalclass'])

If we want to plot two variables we can use the 'hue' option. The following command will still plot by class but will create two bars per boolean value for the 'predator' column. So class 1 is fairly evenly split between predators (value 1) and non-predators (value 0), while class 7 is predominately made up of predators.

**Try changing the 'predator' value in the command to other columns** (NOT 'animalname'!). **Do the plots you create make sense for the class labels you derived earlier?**

In [0]:
sns.countplot(zoo_data['animalclass'], hue = zoo_data['predator'])

We can also put multiple plots in a grid for easy comparison. This unfortunately involves a lot more typing but all the extra code is basically doing is the same as the above multiple times, a lot of the code is for formatting.

If you want to plot different items then change the values in the data_columns variable on the first line. 

In [0]:
data_columns = ['feathers','hair','tail','domestic','airborne','aquatic']

# Don't edit below this point unless you know what you're doing
rows = 2
cols = 3
fig, ax =plt.subplots(rows,cols, figsize=(5*cols, 10))
fig.tight_layout(pad=cols*1.5)

for r in range(rows):
    for c in range(cols):
        hue_col = (r*cols) + c
        p = sns.countplot(zoo_data['animalclass'], ax = ax[r][c], hue = zoo_data[data_columns[hue_col]])
        p.set_title(data_columns[hue_col])
        if r < rows-1:
            x = p.axes.get_xaxis(); x.set_label_text("")
        if r == rows-1 and c == cols-1:
            p.legend(bbox_to_anchor=(1.05, 1), loc=2, borderaxespad=0.)
        else:
            p.legend().remove()

fig.show()

That's enough for this dataset for now. You have so far learned about the initial steps of data analysis:

1.   Viewing tabular data
2.   Summary statistics for tabular data

1.   Simple histograms
2.   Two variable histograms

1.   Plotting many variables in a grid













