# Data Exploration - Get to Know Your Data
__Objectives__:
* Access and summarize data stored in a DataFrame.
* Select subsets of a DataFrame (__data indexing/slicing__).

---

In [1]:
import pandas as pd

## Load Data into Pandas

We will work with the __Iris__ dataset (`iris.csv`). Let's see how it looks like:

__Note:__ Make sure the path is correct.

In [2]:
less ../datasets/iris.csv

In [3]:
# Load from CSV (Comma Separated Values) file

dataset = "../datasets/iris.csv"

iris = pd.read_csv(dataset)
iris

Unnamed: 0,sepallength,sepalwidth,petallength,petalwidth,class
0,5.1,3.5,1.4,0.2,Iris-setosa
1,4.9,3.0,1.4,0.2,Iris-setosa
2,4.7,3.2,1.3,0.2,Iris-setosa
3,4.6,3.1,1.5,0.2,Iris-setosa
4,5.0,3.6,1.4,0.2,Iris-setosa
...,...,...,...,...,...
145,6.7,3.0,5.2,2.3,Iris-virginica
146,6.3,2.5,5.0,1.9,Iris-virginica
147,6.5,3.0,5.2,2.0,Iris-virginica
148,6.2,3.4,5.4,2.3,Iris-virginica


---
## Explore the Data

### View the Data

In [5]:
# Show the first lines of our dataset

iris.head(8)

Unnamed: 0,sepallength,sepalwidth,petallength,petalwidth,class
0,5.1,3.5,1.4,0.2,Iris-setosa
1,4.9,3.0,1.4,0.2,Iris-setosa
2,4.7,3.2,1.3,0.2,Iris-setosa
3,4.6,3.1,1.5,0.2,Iris-setosa
4,5.0,3.6,1.4,0.2,Iris-setosa
5,5.4,3.9,1.7,0.4,Iris-setosa
6,4.6,3.4,1.4,0.3,Iris-setosa
7,5.0,3.4,1.5,0.2,Iris-setosa


In [7]:
# Show the last lines of our dataset

iris.tail(11)

Unnamed: 0,sepallength,sepalwidth,petallength,petalwidth,class
139,6.9,3.1,5.4,2.1,Iris-virginica
140,6.7,3.1,5.6,2.4,Iris-virginica
141,6.9,3.1,5.1,2.3,Iris-virginica
142,5.8,2.7,5.1,1.9,Iris-virginica
143,6.8,3.2,5.9,2.3,Iris-virginica
144,6.7,3.3,5.7,2.5,Iris-virginica
145,6.7,3.0,5.2,2.3,Iris-virginica
146,6.3,2.5,5.0,1.9,Iris-virginica
147,6.5,3.0,5.2,2.0,Iris-virginica
148,6.2,3.4,5.4,2.3,Iris-virginica


In [17]:
# Show a random sample of our dataset (+ random state)

iris.sample(10, random_state=8)

Unnamed: 0,sepallength,sepalwidth,petallength,petalwidth,class
4,5.0,3.6,1.4,0.2,Iris-setosa
29,4.7,3.2,1.6,0.2,Iris-setosa
27,5.2,3.5,1.5,0.2,Iris-setosa
141,6.9,3.1,5.1,2.3,Iris-virginica
65,6.7,3.1,4.4,1.4,Iris-versicolor
34,4.9,3.1,1.5,0.1,Iris-setosa
23,5.1,3.3,1.7,0.5,Iris-setosa
145,6.7,3.0,5.2,2.3,Iris-virginica
132,6.4,2.8,5.6,2.2,Iris-virginica
74,6.4,2.9,4.3,1.3,Iris-versicolor


In [18]:
# Get the column names of our dataset

iris.columns

Index(['sepallength', 'sepalwidth', 'petallength', 'petalwidth', 'class'], dtype='object')

In [19]:
# Get the (row) index of our dataset

iris.index

RangeIndex(start=0, stop=150, step=1)

In [22]:
# Get all the values in our dataset in a Numpy array
display(iris.values)
type(iris.values)

array([[5.1, 3.5, 1.4, 0.2, 'Iris-setosa'],
       [4.9, 3.0, 1.4, 0.2, 'Iris-setosa'],
       [4.7, 3.2, 1.3, 0.2, 'Iris-setosa'],
       [4.6, 3.1, 1.5, 0.2, 'Iris-setosa'],
       [5.0, 3.6, 1.4, 0.2, 'Iris-setosa'],
       [5.4, 3.9, 1.7, 0.4, 'Iris-setosa'],
       [4.6, 3.4, 1.4, 0.3, 'Iris-setosa'],
       [5.0, 3.4, 1.5, 0.2, 'Iris-setosa'],
       [4.4, 2.9, 1.4, 0.2, 'Iris-setosa'],
       [4.9, 3.1, 1.5, 0.1, 'Iris-setosa'],
       [5.4, 3.7, 1.5, 0.2, 'Iris-setosa'],
       [4.8, 3.4, 1.6, 0.2, 'Iris-setosa'],
       [4.8, 3.0, 1.4, 0.1, 'Iris-setosa'],
       [4.3, 3.0, 1.1, 0.1, 'Iris-setosa'],
       [5.8, 4.0, 1.2, 0.2, 'Iris-setosa'],
       [5.7, 4.4, 1.5, 0.4, 'Iris-setosa'],
       [5.4, 3.9, 1.3, 0.4, 'Iris-setosa'],
       [5.1, 3.5, 1.4, 0.3, 'Iris-setosa'],
       [5.7, 3.8, 1.7, 0.3, 'Iris-setosa'],
       [5.1, 3.8, 1.5, 0.3, 'Iris-setosa'],
       [5.4, 3.4, 1.7, 0.2, 'Iris-setosa'],
       [5.1, 3.7, 1.5, 0.4, 'Iris-setosa'],
       [4.6, 3.6, 1.0, 0.2, 'Iri

numpy.ndarray

In [26]:
# Show the shape of our dataset

iris.shape

(150, 5)

In [27]:
# Show the size of our dataset

iris.size

750

In [28]:
# Show the data types held in our dataset

iris.dtypes

sepallength    float64
sepalwidth     float64
petallength    float64
petalwidth     float64
class           object
dtype: object

### Data High Level Description

In [30]:
# Print information about our dataframe (size, non-null, data types, etc.)

iris.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 150 entries, 0 to 149
Data columns (total 5 columns):
 #   Column       Non-Null Count  Dtype  
---  ------       --------------  -----  
 0   sepallength  150 non-null    float64
 1   sepalwidth   150 non-null    float64
 2   petallength  150 non-null    float64
 3   petalwidth   150 non-null    float64
 4   class        150 non-null    object 
dtypes: float64(4), object(1)
memory usage: 6.0+ KB


In [4]:
# Print descriptive statistics of our dataset

iris.describe(include='all')

Unnamed: 0,sepallength,sepalwidth,petallength,petalwidth,class
count,150.0,150.0,150.0,150.0,150
unique,,,,,3
top,,,,,Iris-setosa
freq,,,,,50
mean,5.843333,3.054,3.758667,1.198667,
std,0.828066,0.433594,1.76442,0.763161,
min,4.3,2.0,1.0,0.1,
25%,5.1,2.8,1.6,0.3,
50%,5.8,3.0,4.35,1.3,
75%,6.4,3.3,5.1,1.8,


# Data Subset Selection 

<b>Extracting smaller parts from our main dataset.</b>

__We often use the terms:__
* __Indexing:__ Selecting one or more elements by label or position (rows, columns, or values).
* __Slicing:__ Selecting a range of elements using the slice syntax.

\* __Slice syntax reminder:__ `array[ start:stop ]` or `array[ start:stop:step ]`

__N.B.:__ In pandas, `df['col']` selects columns, while `df[ start:stop ]` slices rows.

In [52]:
# Indexing Example

iris['class']

0         Iris-setosa
1         Iris-setosa
2         Iris-setosa
3         Iris-setosa
4         Iris-setosa
            ...      
145    Iris-virginica
146    Iris-virginica
147    Iris-virginica
148    Iris-virginica
149    Iris-virginica
Name: class, Length: 150, dtype: object

In [53]:
# Slicing Example

iris[10:15]

Unnamed: 0,sepallength,sepalwidth,petallength,petalwidth,class
10,5.4,3.7,1.5,0.2,Iris-setosa
11,4.8,3.4,1.6,0.2,Iris-setosa
12,4.8,3.0,1.4,0.1,Iris-setosa
13,4.3,3.0,1.1,0.1,Iris-setosa
14,5.8,4.0,1.2,0.2,Iris-setosa


## Access Methods

The __options__ we have __to select__ subsets.

<br>

__Our choices:__

__1.__ Square-bracket notation - `[]`

<br>

__2.__ Attribute access (or dot notation) - `.`

<br>

__3.__ Selection by label - `.loc`

<br>

__4.__ Selection by position - `.iloc`

<br>

__Note:__ Pandas also has `.at` and `.iat`. We won't cover these.


### 1. Square-Bracket Notation
You are already familiar with `[]` from __Python Basics__.

This is how we use it if we have Series or DataFrames:

Object Type | Selection
-----|-----------|
Series | `series[index]` **or** `series[start:stop]`
DataFrame | `frame[colname]` **or** `frame[start:stop]`

__Pros:__
* Easy and fast use
* Python-wide (works with lists, tuples, dictionaries, etc.)

__Cons:__
* Can be confusing (behavior differs between Series and DataFrames)
* Limited flexibility

__Note:__ When slicing (e.g. `series[1:4]` or `df[1:4]`), the end index is exclusive (**end-exclusive**).

__Example:__ Indexing and slicing with `[]`

In [54]:
# Select an Iris column, giving its column name

iris['sepallength']

0      5.1
1      4.9
2      4.7
3      4.6
4      5.0
      ... 
145    6.7
146    6.3
147    6.5
148    6.2
149    5.9
Name: sepallength, Length: 150, dtype: float64

In [55]:
# Select a subset of rows, providing a slice

iris[5:15]

Unnamed: 0,sepallength,sepalwidth,petallength,petalwidth,class
5,5.4,3.9,1.7,0.4,Iris-setosa
6,4.6,3.4,1.4,0.3,Iris-setosa
7,5.0,3.4,1.5,0.2,Iris-setosa
8,4.4,2.9,1.4,0.2,Iris-setosa
9,4.9,3.1,1.5,0.1,Iris-setosa
10,5.4,3.7,1.5,0.2,Iris-setosa
11,4.8,3.4,1.6,0.2,Iris-setosa
12,4.8,3.0,1.4,0.1,Iris-setosa
13,4.3,3.0,1.1,0.1,Iris-setosa
14,5.8,4.0,1.2,0.2,Iris-setosa


In [56]:
# Avoid chain indexing (okay when only reading)

iris[10:15]['petalwidth']

10    0.2
11    0.2
12    0.1
13    0.1
14    0.2
Name: petalwidth, dtype: float64

## 2. Attribute Access (Dot notation)

Allow us to access a (Series) index or a (DataFrame) column directly __as an attribute__ (e.g. `iris.sepallength`).

__Pros:__
* The easiest and fastest to use

__Cons:__
* Will not work if it conflicts with an existing method (e.g. `iris.size` is not allowed)
* Will not work if it conflicts with Python keywords (e.g. `iris.class` is not allowed)
* Will not work if there is a space in the column name (e.g. `iris.sepal length` is not allowed)
* Will not work with integer labels (e.g. `iris.1` is not allowed)

__Bottomline:__ It only works with valid Python identifiers (https://docs.python.org/3/reference/lexical_analysis.html#identifiers)

__Example:__ Try to access the __Iris__ columns with dot notation

In [58]:
# Will all of them work?

iris.class

SyntaxError: invalid syntax (1784424474.py, line 3)

## Recommended access methods for Pandas

* Using `.loc` (access by label - label location)
* Using `.iloc` (access by integer position - integer location)

<br>
<br>

__Note:__ Slicing Pandas DataFrames with the `[]` notation you learned in _Python Basics_ can lead to confusion. It is suggested to use `.loc` and `.iloc` wherever possible.

## 3. Selection by label

### Using the `.loc` Property
__Access a group of rows and columns by labels.__

__Pros:__
* Very powerful and flexible compared to the previous options.
* Explicit and consistent syntax (clear to read).

__Cons:__
* A bit lengthy to write.

<br>


__Note:__ `.loc` is end-inclusive!

__Syntax format:__
* `frame.loc[ rows , columns ]`

### Select Rows using `.loc`

In [59]:
iris

Unnamed: 0,sepallength,sepalwidth,petallength,petalwidth,class
0,5.1,3.5,1.4,0.2,Iris-setosa
1,4.9,3.0,1.4,0.2,Iris-setosa
2,4.7,3.2,1.3,0.2,Iris-setosa
3,4.6,3.1,1.5,0.2,Iris-setosa
4,5.0,3.6,1.4,0.2,Iris-setosa
...,...,...,...,...,...
145,6.7,3.0,5.2,2.3,Iris-virginica
146,6.3,2.5,5.0,1.9,Iris-virginica
147,6.5,3.0,5.2,2.0,Iris-virginica
148,6.2,3.4,5.4,2.3,Iris-virginica


In [61]:
# Select a particular row and all columns

iris.loc[ 4 , : ]

sepallength            5.0
sepalwidth             3.6
petallength            1.4
petalwidth             0.2
class          Iris-setosa
Name: 4, dtype: object

In [62]:
# Select multiple rows and all columns

iris.loc[ [1, 3, 149] , : ]

Unnamed: 0,sepallength,sepalwidth,petallength,petalwidth,class
1,4.9,3.0,1.4,0.2,Iris-setosa
3,4.6,3.1,1.5,0.2,Iris-setosa
149,5.9,3.0,5.1,1.8,Iris-virginica


In [63]:
# Select a range of rows and all columns

iris.loc[ 3:8 , : ]

Unnamed: 0,sepallength,sepalwidth,petallength,petalwidth,class
3,4.6,3.1,1.5,0.2,Iris-setosa
4,5.0,3.6,1.4,0.2,Iris-setosa
5,5.4,3.9,1.7,0.4,Iris-setosa
6,4.6,3.4,1.4,0.3,Iris-setosa
7,5.0,3.4,1.5,0.2,Iris-setosa
8,4.4,2.9,1.4,0.2,Iris-setosa


### Lazy alternative (not recommended!)

<br>

Below we do exactly the same thing, however we let Python assume that we mean "all columns".

__Example:__ `iris.loc[ 4:7 ]`

Python will assume that we mean: `iris.loc[ 4:7, : ]`


<br>

<b>Though remember Python's phylosophy:<br>
    Explicit better than Implicit</b>

In [64]:
# Select a range rows (all columns implicitly)
# Python will assume that we mean iris.loc[ 4:7, : ]

iris.loc[5:9]

Unnamed: 0,sepallength,sepalwidth,petallength,petalwidth,class
5,5.4,3.9,1.7,0.4,Iris-setosa
6,4.6,3.4,1.4,0.3,Iris-setosa
7,5.0,3.4,1.5,0.2,Iris-setosa
8,4.4,2.9,1.4,0.2,Iris-setosa
9,4.9,3.1,1.5,0.1,Iris-setosa


---
### Select Columns using `.loc`

In [66]:
# Select a column with all its rows

iris.loc[ : , 'sepallength' ]

0      5.1
1      4.9
2      4.7
3      4.6
4      5.0
      ... 
145    6.7
146    6.3
147    6.5
148    6.2
149    5.9
Name: sepallength, Length: 150, dtype: float64

In [67]:
# Select multiple columns with all the rows

iris.loc[ : , ['class', 'petalwidth'] ]

Unnamed: 0,class,petalwidth
0,Iris-setosa,0.2
1,Iris-setosa,0.2
2,Iris-setosa,0.2
3,Iris-setosa,0.2
4,Iris-setosa,0.2
...,...,...
145,Iris-virginica,2.3
146,Iris-virginica,1.9
147,Iris-virginica,2.0
148,Iris-virginica,2.3


In [68]:
iris

Unnamed: 0,sepallength,sepalwidth,petallength,petalwidth,class
0,5.1,3.5,1.4,0.2,Iris-setosa
1,4.9,3.0,1.4,0.2,Iris-setosa
2,4.7,3.2,1.3,0.2,Iris-setosa
3,4.6,3.1,1.5,0.2,Iris-setosa
4,5.0,3.6,1.4,0.2,Iris-setosa
...,...,...,...,...,...
145,6.7,3.0,5.2,2.3,Iris-virginica
146,6.3,2.5,5.0,1.9,Iris-virginica
147,6.5,3.0,5.2,2.0,Iris-virginica
148,6.2,3.4,5.4,2.3,Iris-virginica


In [69]:
# Select range of columns with all the rows

iris.loc[ : , 'sepalwidth':'petalwidth' ]

Unnamed: 0,sepalwidth,petallength,petalwidth
0,3.5,1.4,0.2
1,3.0,1.4,0.2
2,3.2,1.3,0.2
3,3.1,1.5,0.2
4,3.6,1.4,0.2
...,...,...,...
145,3.0,5.2,2.3
146,2.5,5.0,1.9
147,3.0,5.2,2.0
148,3.4,5.4,2.3


### Combine Row and Column Selection using `.loc`

In [70]:
# Select from range of rows and range of columns

iris.loc[ 2:7 , 'sepallength':'petallength' ]

Unnamed: 0,sepallength,sepalwidth,petallength
2,4.7,3.2,1.3
3,4.6,3.1,1.5
4,5.0,3.6,1.4
5,5.4,3.9,1.7
6,4.6,3.4,1.4
7,5.0,3.4,1.5


In [71]:
# Select the value at the intersection of row and column

iris.loc[4, 'sepalwidth']

np.float64(3.6)

---
### Use `.loc` to Filter with Boolean Conditions
__A sneak peek into Filtering.__ (Hopefully we will have time to see some filtering later)

In [5]:
# Select/filter the rows where 'sepallength' is greater than 7

iris.loc[ iris.loc[: , 'sepallength'] > 7 , :]

Unnamed: 0,sepallength,sepalwidth,petallength,petalwidth,class
102,7.1,3.0,5.9,2.1,Iris-virginica
105,7.6,3.0,6.6,2.1,Iris-virginica
107,7.3,2.9,6.3,1.8,Iris-virginica
109,7.2,3.6,6.1,2.5,Iris-virginica
117,7.7,3.8,6.7,2.2,Iris-virginica
118,7.7,2.6,6.9,2.3,Iris-virginica
122,7.7,2.8,6.7,2.0,Iris-virginica
125,7.2,3.2,6.0,1.8,Iris-virginica
129,7.2,3.0,5.8,1.6,Iris-virginica
130,7.4,2.8,6.1,1.9,Iris-virginica


In [7]:
# Let's dissect what we just did - print the mask used

mask = iris.loc[: , 'sepallength'] > 7
display(mask)

iris.loc[mask, :]

0      False
1      False
2      False
3      False
4      False
       ...  
145    False
146    False
147    False
148    False
149    False
Name: sepallength, Length: 150, dtype: bool

Unnamed: 0,sepallength,sepalwidth,petallength,petalwidth,class
102,7.1,3.0,5.9,2.1,Iris-virginica
105,7.6,3.0,6.6,2.1,Iris-virginica
107,7.3,2.9,6.3,1.8,Iris-virginica
109,7.2,3.6,6.1,2.5,Iris-virginica
117,7.7,3.8,6.7,2.2,Iris-virginica
118,7.7,2.6,6.9,2.3,Iris-virginica
122,7.7,2.8,6.7,2.0,Iris-virginica
125,7.2,3.2,6.0,1.8,Iris-virginica
129,7.2,3.0,5.8,1.6,Iris-virginica
130,7.4,2.8,6.1,1.9,Iris-virginica


---
## Using the `.iloc` Property
__Access a group of rows and columns by integer position(s)*.__

__*__ By __'integer positions'__ we mean the actual location of indices and column.

__Pros:__
* Also very flexible and powerful.
* Also explicit and consistent syntax (clear to read).

__Cons:__
* Also lengthier than the first 2 options


<br>

__Note:__ `.iloc` is end-exclusive! (similar to the `[]` notation)

__Syntax format:__
* `frame.iloc[ rows , columns ]`

### Select Columns with `.iloc`

In [None]:
iris

In [74]:
# Select all rows, from a column

iris.iloc[: , 4]

0         Iris-setosa
1         Iris-setosa
2         Iris-setosa
3         Iris-setosa
4         Iris-setosa
            ...      
145    Iris-virginica
146    Iris-virginica
147    Iris-virginica
148    Iris-virginica
149    Iris-virginica
Name: class, Length: 150, dtype: object

In [77]:
# Select all rows, from a range of columns

iris.iloc[: , 1:4]

Unnamed: 0,sepalwidth,petallength,petalwidth
0,3.5,1.4,0.2
1,3.0,1.4,0.2
2,3.2,1.3,0.2
3,3.1,1.5,0.2
4,3.6,1.4,0.2
...,...,...,...
145,3.0,5.2,2.3
146,2.5,5.0,1.9
147,3.0,5.2,2.0
148,3.4,5.4,2.3


### Select Rows with `.iloc`

In [78]:
# Select a range of rows, all columns

iris.iloc[ 3:8 , : ]

Unnamed: 0,sepallength,sepalwidth,petallength,petalwidth,class
3,4.6,3.1,1.5,0.2,Iris-setosa
4,5.0,3.6,1.4,0.2,Iris-setosa
5,5.4,3.9,1.7,0.4,Iris-setosa
6,4.6,3.4,1.4,0.3,Iris-setosa
7,5.0,3.4,1.5,0.2,Iris-setosa


### Combine Row and Column Selection using `.iloc`

In [83]:
# Select from a specific row to the end

iris.iloc[ 142: , : ]

Unnamed: 0,sepallength,sepalwidth,petallength,petalwidth,class
142,5.8,2.7,5.1,1.9,Iris-virginica
143,6.8,3.2,5.9,2.3,Iris-virginica
144,6.7,3.3,5.7,2.5,Iris-virginica
145,6.7,3.0,5.2,2.3,Iris-virginica
146,6.3,2.5,5.0,1.9,Iris-virginica
147,6.5,3.0,5.2,2.0,Iris-virginica
148,6.2,3.4,5.4,2.3,Iris-virginica
149,5.9,3.0,5.1,1.8,Iris-virginica


In [84]:
# Select multiple rows, and a range of columns

iris.iloc[ 125:131 , 0:3 ]

Unnamed: 0,sepallength,sepalwidth,petallength
125,7.2,3.2,6.0
126,6.2,2.8,4.8
127,6.1,3.0,4.9
128,6.4,2.8,5.6
129,7.2,3.0,5.8
130,7.4,2.8,6.1
