In [None]:
#@title -- Downloading Data -- { display-mode: "form" }
import sys
!{sys.executable} -m pip install git+https://github.com/michalgregor/class_utils.git

from class_utils.download import download_file_maybe_extract
download_file_maybe_extract(
    "https://github.com/michalgregor/luiza_notebooks/blob/198b2032e36fbbcfe4c815fe0907eedab3345810/data/iris.csv?raw=true",
    directory="data"
)

## Package `pandas`

When handling data, the `pandas` package is going to be very useful. It enables reading data from various file formats, from databases etc. It provides comfortable ways of processing them, computing their basic statistical properties, quickly displaying simple plots and so on.

### Dataframes

Dataframe is the basic `pandas` data type. It is basically a table, where the named columns represent certain attributes and the rows correspond to entries. If we have some data with attributes `attr1, attr2, attr3`, we can use it to create a dataframe as follows:



In [None]:
import pandas as pd

df = pd.DataFrame([[1, 2, 3], [4, 5, 6], [7, 8, 9]], columns=["attr1", "attr2", "attr3"])
print(df)

Selection of particular columns is easy:



In [None]:
a = df["attr1"]
print(a)

In [None]:
b = df[["attr2", "attr3"]]
print('df[["attr2", "attr3"]] = \n{}'.format(b))

New columns can be added by indexing them and assigning some new expression to them:



In [None]:
df["attr4"] = df["attr2"] + df["attr3"]
print(df)

Attribute columns can be used to get the list of all columns:



In [None]:
cols = df.columns
print(cols)

If we want to select a column by its numeric index, we can do the following:



In [None]:
a = df.iloc[:, 1]
print(a)

We can also select rows in the same manner:



In [None]:
a = df.iloc[1:3]
print(a)

By using the attribute `.values`, we can extract the data from the dataframe in the form of a standard `numpy` array:



In [None]:
print(df.values)

### Reading Data from a CSV File

CSV file are simple text files containing data separated by commas, e.g.:

```
5.1,3.5,1.4,0.2,setosa
7.0,3.2,4.7,1.4,versicolor
6.7,3.1,5.6,2.4,virginica
```
In the `pandas` package, a CSV file can be loaded using function `read_csv`:



In [None]:
import pandas as pd
df = pd.read_csv('data/iris.csv')
df.head()

This function accepts several arguments, which need to be set up correctly – e.g. argument `sep=';'`, if the entries are not separated by commas, but rather by semicolons or other characters. It is also possible to set `header=False` if the CSV does not have a header – and there are many other options.

### Transforming a Categorical Values to Numbers

Some methods cannot handle categorical attributes (attributes, which have a certain small number of possible values, often textual values) if their values are represented as strings. In that case we need to convert such values to numeric values. This can be done using class `OrdinalEncoder` from package `sklearn`, for an instance:



In [None]:
from sklearn.preprocessing import OrdinalEncoder
ordenc = OrdinalEncoder()

# function fit_transform fits the parameters of the transformer
# using the data and also returns the transformed data itself
df['species_num'] = ordenc.fit_transform(df[['species']])

# we will display a few samples to verify that everything worked
df[["species", "species_num"]].iloc[[0, 1, 50, 80, 100, 101]]

### Simple Plots and Statistics

Package `pandas` can also compute basic statistical measures and do simple plotting. To display information about the distribution of values in a column, we can use function `describe`.



In [None]:
print(df.iloc[:, 0].describe())

#### Boxplot

The dataframe interface makes it easy to display boxplots using the builtin `boxplot` function. Boxplots present information akin to that provided by `describe` in graphical form:



In [None]:
df.boxplot(column=df.columns[0])

#### Comparing Boxplots Across Classes

If we want to compare boxplots across all individual classes, we can do the following:



In [None]:
df.boxplot(column=df.columns[0], by='species')

#### Histograms

It is similarly easy to display column histograms:



In [None]:
df[df.columns[0]].plot(bins=20, kind='hist')

#### Comparing Histograms

If we want to compare histograms across the classes, we can use the following:



In [None]:
df.hist(column=df.columns[0], by='species', bins=50,
        sharex=True, sharey=True, figsize=[10, 8])

### Applying an Arbitrary Function to an Entire Column

In the code presented above we have shown how an entire new column can be created. However, new columns can also be created by applying an arbritrary function to an entire column. Let us suppose that we want to determine the length of each string in column `species` and assign the result to a new column:



In [None]:
# the transformation
df["len"] = df["species"].map(lambda x: len(x))

# we display a few samples
df["len"].iloc[[0, 1, 50, 80, 100, 101]]

### Group Comparisons and Indexing

Similarly to `numpy` arrays, it is also possible to find and index entries that meet a certain condition in dataframes. For an instance, we can select rows, in which the value of the 0th column is greater than 5:



In [None]:
a = df[df.iloc[:, 0] > 5]
a.head()