We'll practice data analysis skills using the Iris dataset, a simple dataset commonly used in machine learning. The goal is to predict the species of an Iris flower (setosa, versicolor, or virginica) given its sepal and petal measurements: ![img](https://machinelearninghd.com/wp-content/uploads/2021/03/iris-dataset.png)

To start, we will import `pandas`, which is a Python package. To run the cell, select it and press Shift+Enter.

In [1]:
import pandas as pd

To work with a data file, we first need to load it into a **DataFrame**. This is a useful Python object (from the pandas library) that allows us to view and manipulate tabular datasets. This notebook will introduce simple python and pandas syntax for creating and using a DataFrame. First, we'll use the pandas read_csv function to create a DataFrame from our data file. First, you'll have to find the name of the file (it's in the "data" folder—don't use the "messy" ones for now!) Then, you'll have to edit the `filename` string—you can edit a cell by clicking inside it.

In [2]:
filename = "data/iris.data" # TODO: fill in the name (a string) of the Iris data file

# the read_csv command auto-creates a dataframe from a csv (data file format)
df = pd.read_csv(filename, header=None)

Now that we have the `df`, we can use some simple pandas functionality to inspect it. Take a look at the documentation for [.shape](https://pandas.pydata.org/docs/reference/api/pandas.DataFrame.shape.html), [.head()](https://pandas.pydata.org/docs/reference/api/pandas.DataFrame.head.html), [.columns](https://pandas.pydata.org/docs/reference/api/pandas.DataFrame.columns.html), and [.describe()](https://pandas.pydata.org/docs/reference/api/pandas.DataFrame.describe.html). We'll use these attributes and methods to understand the Iris dataset a bit better.

In [3]:
df.shape # TODO: inspect the shape attribute to find the number of rows and columns

(150, 5)

In [4]:
df.head(10) # TODO: use the head() method to display the first 10 rows of the dataframe

Unnamed: 0,0,1,2,3,4
0,5.1,3.5,1.4,0.2,Iris-setosa
1,4.9,3.0,1.4,0.2,Iris-setosa
2,4.7,3.2,1.3,0.2,Iris-setosa
3,4.6,3.1,1.5,0.2,Iris-setosa
4,5.0,3.6,1.4,0.2,Iris-setosa
5,5.4,3.9,1.7,0.4,Iris-setosa
6,4.6,3.4,1.4,0.3,Iris-setosa
7,5.0,3.4,1.5,0.2,Iris-setosa
8,4.4,2.9,1.4,0.2,Iris-setosa
9,4.9,3.1,1.5,0.1,Iris-setosa


We can see that there are 5 columns of this dataset, or 5 different variables that have been recorded about each Iris flower. Right now, we can't tell from the DataFrame what each of these columns correspond to. We can use the `columns` attribute to add a descriptive name for each column. Check the [documentation](https://archive.ics.uci.edu/dataset/53/iris) for the Iris dataset to see what each column corresponds to (they are listed in the "Features" table).

In [5]:
#TODO: fill in the two missing column names
column_names = ['sepal_length','sepal_width','petal_length','petal_width','class'] # a list of strings

In [6]:
df.columns = column_names # set the columns of the dataframe to this list

In [7]:
df.head() # TODO: check the head of the DataFrame again to ensure the column names have been set

Unnamed: 0,sepal_length,sepal_width,petal_length,petal_width,class
0,5.1,3.5,1.4,0.2,Iris-setosa
1,4.9,3.0,1.4,0.2,Iris-setosa
2,4.7,3.2,1.3,0.2,Iris-setosa
3,4.6,3.1,1.5,0.2,Iris-setosa
4,5.0,3.6,1.4,0.2,Iris-setosa


Finally, we can use the `describe` method to understand the numeric features a bit better. What are their average values? Which variable has the highest maximum value?

In [8]:
df.describe() # TODO: use the describe() method to print summary statistics for the four numeric features

Unnamed: 0,sepal_length,sepal_width,petal_length,petal_width
count,150.0,150.0,150.0,150.0
mean,5.843333,3.054,3.758667,1.198667
std,0.828066,0.433594,1.76442,0.763161
min,4.3,2.0,1.0,0.1
25%,5.1,2.8,1.6,0.3
50%,5.8,3.0,4.35,1.3
75%,6.4,3.3,5.1,1.8
max,7.9,4.4,6.9,2.5


These values are *summary statistics* of our data. `count` is just the number of observations (so it's the same for all four columns). We then have the `mean`, standard deviation (`std`), minimum and maximum (`min` and `max`, respectively), and the `25%`, `50%`, and `75%` quantiles.