# Functions and modules
In `python`, we integrate tools by importing modules and using functions.
`module`s are containers of useful objects and functions that we can use to leverage code written by others.
`function`s, similarly to mathematical functions, are instead **parametric**, reusable code routines that allow us to execute a specific functionalities.

We can import a `module`, say `numpy` by using
```python
import numpy
```
This allows us to access `numpy` and all the utilities it offers.
Moreover, we can only `import` only a subset of functionalities of interest by using `from`:
```python
from numpy import ones
```

How to know what module to import?
[pypi!](https://pypi.org)
In the `module` page you also usually find a link to the available documentation.

## Functions
Functions are invoked instead with a, again, mathematical-like syntax:
```python
function(parameter_1, parameter_2, ...)
```
For instance, `numpy` offers a function `ones` (documentation [here](https://numpy.org/doc/stable/reference/generated/numpy.ones.html), which generates a vector of `1` of the given size.
We can invoke it with
```python
from numpy import ones

o = ones(5,)
```

## Methods
Objects are a special `type`, as they hold both values and functions.
Why? Because some data has some behavior which is better defined by the data itself.
For instance, the movement of a steering wheel is defined by the manufacturer itself, rather than by who uses it (I thought of this example before self-driving cars were a thing).
Thus, instead of having a function `steer(wheel, "right")` we'd rather have `wheel.steer_right()`.
When defined on objects, functions are named `method`s.


# Pandas and Numpy
`pandas` is a `python` module that allows us to play with data in a tabular format.
Data is stored in an object of type `DataFrame` (reference [here](https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.DataFrame.html)) which allows us to group records together, each record having a set of attributes.
Access is slightly different from classic `list`s:
- `dataframe.iloc[]` replaces the `list[]`
- `dataframe[column]` allows us to access features, i.e., columns, of a record. Multiple columns can also be accessed at once by using a `list` in place of `column`.

`numpy` is a related utility that instead allows us to work with vectors, which are of type `numpy.ndarray` (reference [here](https://numpy.org/doc/stable/reference/generated/numpy.ndarray.html)).
They build on `python` lists, allowing far more complex operations.

In [None]:
from datasets import load_dataset


dataset = load_dataset("mstz/adult", "income")["train"].to_pandas()
dataset.head()

In [None]:
dataset.dtypes

We can also filter in/out features according to their type.
Reference [here](https://pandas.pydata.org/docs/reference/api/pandas.DataFrame.select_dtypes.html).

In [None]:
dataset.select_dtypes(include="number")

dataset.select_dtypes(exclude="number")

In [None]:
dataset.shape

### Searching
DataFrames can be filtered down through `selectors`.
A `selector` is a boolean sequence filtering down rows.

In [None]:
dataset[dataset["age"] == 43]

### ...and adding

In [None]:
dataset.loc[dataset.age < 21, "can_drink"] = False
dataset.loc[dataset.age >= 21, "can_drink"] = True
dataset.head()

### Datasets at a glance

In [None]:
dataset.info()

In [None]:
dataset.describe()

You can find a host of example datasets to play with at [huggingface.co/mstz](https://huggingface.co/mstz).

### Dataset filtering
We can filter the dataset to only retain some of its records and/or features.

In [None]:
dataset[["age", "capital_gain", "capital_loss"]]

...and these can be chained!

In [None]:
dataset[["age", "capital_gain", "capital_loss"]]["age"]

We can also filter out duplicate records.
Reference at [https://pandas.pydata.org/pandas-docs/stable/generated/pandas.DataFrame.drop_duplicates.html](https://pandas.pydata.org/pandas-docs/stable/generated/pandas.DataFrame.drop_duplicates.html).

In [None]:
dataset = dataset.drop_duplicates()

Or simply filter out records with missing values.

In [None]:
dataset = dataset.dropna()

In [None]:
ages = dataset["age"].values
ages_mean = ages.mean()
ages_std = ages.std()

print(ages_mean, ages_std)

### Missing values and imputations
Dataframes allow for missing values by using the special value `nan` (Not A Number), which we can detect through:
```python
dataset.isna()
```

In [None]:
dataset["age"].isna().values

### Imputations
Imputation is a technique with which we "fill-in" the missing values according to the distribution of the data.
They can be guided by domain knowledge or computed:
- statistical imputation: missing value of the feature is replaced with the a dataset-related statistic, e.g., mean, mode
- neighbor imputation: missing value of the feature is replaced according to the values of similar instances

Imputation can introduce noise in your data: use with **extreme** care!

In [None]:
import numpy
from sklearn.impute import SimpleImputer


dataset = load_dataset("mstz/adult", "income")["train"].to_pandas()

# creating artificial missing values
missing_indexes = [0, 1, 2]
print(f"Removing {dataset['age'].values[missing_indexes]}")
dataset.loc[missing_indexes, "age"] = numpy.nan


imputer = SimpleImputer(missing_values=numpy.nan,
                        strategy="mean")
imputer.fit(dataset[["age"]].values)
imputer.transform(dataset[["age"]].values[missing_indexes])

In [None]:
dataset = load_dataset("mstz/adult", "income")["train"].to_pandas()

# creating artificial missing values
missing_indexes = [0, 1, 2]
print(f"Removing {dataset['age'].values[missing_indexes]}")
dataset.loc[missing_indexes, "age"] = numpy.nan


imputer = SimpleImputer(missing_values=numpy.nan,
                        strategy="most_frequent")
imputer.fit(dataset[["age"]].values)
imputer.transform(dataset[["age"]].values[missing_indexes])

In [None]:
from sklearn.impute import KNNImputer


dataset = load_dataset("mstz/adult", "income")["train"].to_pandas()

# creating artificial missing values
missing_indexes = [0, 1, 2, 34, 21, 1234, 489, 90, 102]
auxiliary_columns = ["capital_gain", "capital_loss"]
print(f"Removing values {dataset['age'].values[missing_indexes]}")
dataset.loc[missing_indexes, "age"] = numpy.nan

imputer = KNNImputer(n_neighbors=25,
                     weights="distance")
imputer.fit(dataset[["age"] + auxiliary_columns])
imputer.transform(dataset[["age"] + auxiliary_columns])[missing_indexes, 0]

# Advanced operations: `scipy` and `numpy`
While `pandas` and `numpy` allow us to model and represent data, respectively, we are still missing several mathematical operations we may be interested in.
`scipy` comes in handy to:
- perform [standard linear algebra operations](https://docs.scipy.org/doc/scipy/tutorial/linalg.html)
- [interpolate data](https://docs.scipy.org/doc/scipy/tutorial/interpolate.html)
- perform [basic statistics](https://docs.scipy.org/doc/scipy/tutorial/stats.html)

In [None]:
from scipy.stats import gaussian_kde


dataset = load_dataset("mstz/adult", "income")["train"].to_pandas()

estimation = gaussian_kde(dataset["age"].values)
estimation

# Data interpolation
Interpolation allows us to generate data by interpolating, i.e., combining, already existing data.
This is usually done again by estimating data density with some model, then leveraging said model for generating data.

In [None]:
import numpy
from scipy.interpolate import interp1d


ages = dataset["age"].values
sorted_ages = numpy.unique(ages)

interpolation = interp1d(range(sorted_ages.size), sorted_ages)
interpolation

# Density estimation
`scipy` allows you to perform density estimation of features.

In [None]:
from scipy.stats import fit
from scipy.stats import norm  # pick any family of distributions of your choosing


distribution = norm
estimation = fit(distribution, dataset["age"].values)
print(estimation)

# Try this yourself!
Solutions below, try not to cheat :P

- compute the correlation matrix of a dataset
- compute feature distributions of datasets

Reference for pandas DataFrames [here](https://pandas.pydata.org/docs/reference/api/pandas.DataFrame.html).