# Data Exploration - Get to Know Your Data
__Objectives__:
* Access and summarize data stored in a DataFrame.
* Select subsets of a DataFrame (__data indexing/slicing__).

---

In [None]:
import pandas as pd

## Load Data into Pandas

We will work with the __Iris__ dataset (`iris.csv`). Let's see how it looks like:

__Note:__ Make sure the path is correct.

In [None]:
less iris.csv

In [None]:
# Load from CSV (Comma Separated Values) file

dataset = "iris.csv"



---
## Explore the Data

### View the Data

In [None]:
# Show the first lines of our dataset



In [None]:
# Show the last lines of our dataset



In [None]:
# Show a random sample of our dataset (+ random state)



In [None]:
# Get the column names of our dataset



In [None]:
# Get the (row) index of our dataset



In [None]:
# Get all the values in our dataset in a Numpy array



In [None]:
# Show the shape of our dataset



In [None]:
# Show the size of our dataset



In [None]:
# Show the data types held in our dataset



### Data High Level Description

In [None]:
# Print information about our dataframe (size, non-null, data types, etc.)



In [None]:
# Print descriptive statistics of our dataset



---
# Data Subset Selection 

<b>Extracting smaller parts from our main dataset.</b>

__We ofter use the terms:__
* __Indexing:__ Selecting a singe element (row/column for DataFrames, or a value in Series) using its index.
* __Slicing:__ Selecting a sequence of elements based on a slice*.

\* __Reminder:__ Slice syntax: `array[ start:stop ]` or `array[ start:stop:step ]`.

__Example:__ Indexing and slicing

In [None]:
# Indexing Example



In [None]:
# Slicing Example



## Access Methods

The __options__ we have __to select__ subsets.

<br>

__Our choices:__

__1.__ Square-bracket notation - `[]`

<br>

__2.__ Attribute access (or dot notation) - `.`

<br>

__3.__ Selection by label - `.loc`

<br>

__4.__ Selection by position - `.iloc`

<br>

__Note:__ Pandas also has `.at` and `.iat`. We won't cover these.


### 1. Square-Bracket Notation
You are already familiar with `[]` from __Python Basics__.

This is how we use it if we have Series or DataFrames:

Object Type | Selection
-----|-----------|
Series | `series[index]`
DataFrame | `frame[colname]`

__Pros:__
* Easy and fast use
* Python-wide (works with lists, tupples, dictionaries, etc.)

__Cons:__
* Can lead to confusion (implicitly assumes whether you mean, column or rows)
* Limited functionality

__Note:__ The `[]` is end-exclusive!

__Example:__ Indexing and slicing with `[]`

In [None]:
# Select an Iris column, giving its column name



In [1]:
# Select a subset of rows, providing a slice



In [None]:
# Avoid chain indexing



## 2. Attribute Access (Dot notation)

Allow us to access a (Series) index or a (DataFrame) column directly __as an attribute__ (e.g. `iris.sepallength`).

__Pros:__
* The easiest and fastest to use

__Cons:__
* Will not work if it conflicts with an existing method (e.g. `iris.size` is not allowed)
* Will not work if it conflicts with Python keywords (e.g. `iris.class` is not allowed)
* Will not work if there is a space in the column name (e.g. `iris.sepal length` is not allowed)
* Will not work with integer labels (e.g. `iris.1` is not allowed)

__Bottomline:__ It only works with valid Python identifiers (https://docs.python.org/3/reference/lexical_analysis.html#identifiers)

__Example:__ Try to access the __Iris__ columns with dot notation

In [None]:
# Will all of them work?



2 common ways to slice:
* Using `.loc` (access by label - label location)
* Using `.iloc` (access by integer position - integer location)


__Note:__ You can also slice Pandas DataFrames with the `[]` notation you learned in _Python Basics_. However, this might lead in confusion, so it is suggested to use `.loc` and `.iloc` wherever possible.

In [None]:
# Print dataset
iris

## 3. Selection by label

### Using the `.loc` Property
__Access a group of rows and columns by labels.__

__Pros:__
* Very powerful and flexible compared to the previous options.
* Explicit and consistent syntax (clear to read).

__Cons:__
* A bit lengthy to write.

<br>


__Note:__ `.loc` is end-inclusive!

__Syntax format:__
* `frame.loc[ rows , columns ]`

### Select Rows using `.loc`

In [None]:
# Select a particular row and all columns



In [None]:
# Select multiple rows and all columns



In [None]:
# Select a range of rows and all columns



### Lazy alternative (not recommended!)

<br>

Below we do exactly the same thing, however we let Python assume that we mean "all columns".

__Example:__ `iris.loc[ 4:7 ]`

Python will assume that we mean: `iris.loc[ 4:7, : ]`


<br>

<b>Though remember Python's phylosophy:<br>
    Explicit better than Implicit</b>

In [None]:
# Select a range rows (all columns implicitly)
# Python will assume that we mean iris.loc[ 4:7, : ]



---
### Select Columns using `.loc`

In [None]:
# Select a column with all its rows



In [None]:
# Select multiple columns with all the rows



In [None]:
# Select range of columns with all the rows



### Combine Row and Column Selection using `.loc`

In [None]:
# Select from range of rows and range of columns



In [None]:
# Select the value at the intersection of row and column



---
### Use `.loc` to Filter with Boolean Conditions
__A sneak peek into Filtering.__ (Hopefully we will have time to see some filtering later)

In [None]:
# Select/filter the rows where 'sepallength' is greater than 7



In [None]:
# Let's dissect what we just did - print the mask used



---
## Using the `.iloc` Property
__Access a group of rows and columns by integer position(s)*.__

__*__ By __'integer positions'__ we mean the actual location of indices and column.

__Pros:__
* Also very flexible and powerful.
* Also explicit and consistent syntax (clear to read).

__Cons:__
* Also lengthier than the first 2 options


<br>

__Note:__ `.iloc` is end-exclusive! (similar to the `[]` notation)

__Syntax format:__
* `frame.iloc[ rows , columns ]`

In [None]:
# Print dataset
iris

### Select Columns with `.iloc`

In [None]:
# Select all rows, from a column



In [None]:
# Select all rows, from a range of columns



### Select Rows with `.iloc`

In [None]:
# Select a range of rows, all columns



### Combine Row and Column Selection using `.iloc`

In [None]:
# Select from a specific row to the end



In [None]:
# Select multiple rows, and a range of columns

