# Basics

## validation of dependencies

Look into the [labs](labs) folder. The readme contains detailed information. Simply make sure docker has started. Then run:

```bash
make notebook
```

Then open a browser and go to: [http://localhost:8888/lab](http://localhost:8888/lab)
It will ask you for a token. The token can be retrieved from the console:

```bash
To access the notebook, open this file in a browser:
    file:///.../Jupyter/runtime/nbserver-52250-open.html
Or copy and paste one of these URLs:
    http://localhost:8888/?token=<<token>>
```

## why python for data science?

- High-level (easy to learn)
    - In fact more abstraction makes things faster for both the developer and the machine (unlike other programming languages. Example numpy)
- Interpreted / Scripting Lang. (speedy development, slow(er) execution)
- General Purpose
- Multi-Paradigm (OO, Functional, Procedural)
- Open-Source (Free!, Thriving Community, Regular Releases)
    - Almost every deep learning project has a native python API
    - A lot of great frameworks like pyspark, dask, pyTorch, Airflow
- Duck Typing
    - Allows for fast experimentation (no rigid class hierarchy)
    

## interactive python
- using the REPL
- a bigger project (text editor of choice)
- (jupyter) notebooks (there also is the jupter lab available at `jupyter lab`
    - colorful
    - caching of results
    - E2E documentation in the direction of the workflow
    - magics
    
    
### jupyter basics:
- magics https://github.com/jakevdp/PythonDataScienceHandbook/blob/master/notebooks/01.03-Magic-Commands.ipynb
- help https://github.com/jakevdp/PythonDataScienceHandbook/blob/master/notebooks/01.01-Help-And-Documentation.ipynb
- shell commands https://github.com/jakevdp/PythonDataScienceHandbook/blob/master/notebooks/01.05-IPython-And-Shell-Commands.ipynb
- timing code https://github.com/jakevdp/PythonDataScienceHandbook/blob/master/notebooks/01.07-Timing-and-Profiling.ipynb

In [None]:
%ls

Some first steps:

In [None]:
len?

In [None]:
square??

In [None]:
%timeit sum(range(100))

### VS Code

https://code.visualstudio.com/docs/python/jupyter-support

- sharing of screen with team members (https://visualstudio.microsoft.com/de/services/live-share/?rr=https%3A%2F%2Fcode.visualstudio.com%2Fdocs%2Fpython%2Fjupyter-support)
- remote jupyter notebooks (https://blog.ouseful.info/2019/02/11/connecting-to-a-remote-jupyter-notebook-server-running-on-digital-ocean-from-microsoft-vs-code/)
- potentially easier to version control
- better auto completion


Example:
- open VSCode for the `lab` folder
- play with some basic cells

### Google colaboratory
- easy collaboration like google docs
- separation of compute and storage (you need to mount an object store or google drive)
- free GPU

https://colab.research.google.com/notebooks/welcome.ipynb

## famous python libraries for data science
- jupyter (interactive notebooks) like this one
- pandas (Excel)
- numpy (MatLab, matrices)
- matplotlib (visualization)
- scikit-learn (machine learning)

Great reading: https://github.com/jakevdp/PythonDataScienceHandbook with lots of examples.

The examples below are based on https://github.com/jakevdp/PythonDataScienceHandbook though they might be adapted.

> When working for your own it is important to read the documentation or further examples.
The content below will explain with miniam examples some of the most important topics.

### numpy

- efficient storage & manipulation of array or matrix like data
- you can specify the type of data (int vs. int8, vs. int23, ...) https://github.com/jakevdp/PythonDataScienceHandbook/blob/master/notebooks/02.01-Understanding-Data-Types.ipynb
- building block for a lot of libraries
- behaves fairly similar to MatLab from a syntactical perspective regarding indexing and slicing

In [None]:
import numpy as np

np.zeros(10, dtype=int)

In [None]:
# Create a 3x3 array of normally distributed random values
# with mean 0 and standard deviation 1
a = np.random.normal(0, 1, (3, 3))
a

basic properties of numpy arrays:

In [None]:
print(f"ndim: {a.ndim}")
print(f"shape: {a.shape}")
print(f"size: {a.size}")

accessing the contents of the array (indexing)

In [None]:
a

In [None]:
a[0]

In [None]:
a[0][0]

In [None]:
a[0, 0]

slicing: `x[start:stop:step]`

> NOTE: sub arrays are no-copy views. If the underlying data is manipulated, the original data is manipulated as well

In [None]:
a[0:2, [2]]

In [None]:
x2 = np.random.randint(10, size=(3, 4))  # Two-dimensional array
print(x2)
x2_sub = x2[:2, :2]
print(x2_sub)

x2_sub[0, 0] = 99
print(x2_sub)

x2

### pandas

good for tabular data. Fairly similar to Excel. Heavily uses `numpy` internally.

It is based on Series:

In [None]:
%pylab inline
import pandas as pd

data = pd.Series([0.25, 0.5, 0.75, 1.0])
data

their underlying numpy array can be accessed:

In [None]:
data.values

and it has an index:

In [None]:
data.index

indexing works very similar to numpy, but more generalized.

In [None]:
data[1:3]

In [None]:
data[data > 0.5]

In [None]:
area_dict = {'California': 423967, 'Texas': 695662, 'New York': 141297,
             'Florida': 170312, 'Illinois': 149995}
area = pd.Series(area_dict)
display(area)

population_dict = {'California': 38332521,
                   'Texas': 26448193,
                   'New York': 19651127,
                   'Florida': 19552860,
                   'Illinois': 12882135}
population = pd.Series(population_dict)
display(population)

states = pd.DataFrame({'population': population,
                       'area': area})
states

In [None]:
states.index

In [None]:
states.columns

compute new attributes

In [None]:
states['density'] = states.population / states.area
states

Transposition is possible

In [None]:
states.T

Again, indexing:

In [None]:
states.iloc[:3, :2]

More interesting: boolean indexing:

In [None]:
states.loc[states.density > 100][['population', 'density']]

setting values:

In [None]:
states.iloc[0, 2] = 90
states

you can always apply any numpy function:

In [None]:
np.log1p(states.population)

In [None]:
df = pd.DataFrame({'foo':[1,2,3], 'bar':['one', 'two', 'three']})
# df = pd.read_csv()
# df = pd.read_excel()

# Parquet is a compressed columnar file format.
# - explicit schema
# - column pruning (only required columns can be loaded)
# df = pd.read_parquet()


print(df.dtypes)
print(df.isnull().sum())
display(df.head())
display(df[df.foo > 2])
df.foo.plot()

In [None]:
df.bar.value_counts(normalize=True)

In [None]:
df.foo.plot.hist()

In [None]:
df.info()

In [None]:
df.describe()

## GroupBy: Split, Apply, Combine

Simple aggregations can give you a flavor of your dataset, but often we would prefer to aggregate conditionally on some label or index: this is implemented in the so-called ``groupby`` operation.
The name "group by" comes from a command in the SQL database language, but it is perhaps more illuminative to think of it in the terms first coined by Hadley Wickham of Rstats fame: *split, apply, combine*.

### Split, apply, combine

This makes clear what the ``groupby`` accomplishes:

- The *split* step involves breaking up and grouping a ``DataFrame`` depending on the value of the specified key.
- The *apply* step involves computing some function, usually an aggregate, transformation, or filtering, within the individual groups.
- The *combine* step merges the results of these operations into an output array.

While this could certainly be done manually using some combination of the masking, aggregation, and merging commands covered earlier, an important realization is that *the intermediate splits do not need to be explicitly instantiated*. Rather, the ``GroupBy`` can (often) do this in a single pass over the data, updating the sum, mean, count, min, or other aggregate for each group along the way.
The power of the ``GroupBy`` is that it abstracts away these steps: the user need not think about *how* the computation is done under the hood, but rather thinks about the *operation as a whole*.

As a concrete example, let's take a look at using Pandas for the computation shown in this diagram.
We'll start by creating the input ``DataFrame``:

In [None]:
# SINGLE split (=no split at all)
states.mean()

In [None]:
states.population.mean()

In [None]:
df = pd.DataFrame({'key': ['A', 'B', 'C', 'A', 'B', 'C'],
                   'data': range(6)}, columns=['key', 'data'])
df

The most basic split-apply-combine operation can be computed with the ``groupby()`` method of ``DataFrame``s, passing the name of the desired key column:

Notice that what is returned is not a set of ``DataFrame``s, but a ``DataFrameGroupBy`` object.
This object is where the magic is: you can think of it as a special view of the ``DataFrame``, which is poised to dig into the groups but does no actual computation until the aggregation is applied.
This "lazy evaluation" approach means that common aggregates can be implemented very efficiently in a way that is almost transparent to the user.

To produce a result, we can apply an aggregate to this ``DataFrameGroupBy`` object, which will perform the appropriate apply/combine steps to produce the desired result:

In [None]:
df.groupby('key').sum()

#### Dispatch methods

Through some Python class magic, any method not explicitly implemented by the ``GroupBy`` object will be passed through and called on the groups, whether they are ``DataFrame`` or ``Series`` objects.
For example, you can use the ``describe()`` method of ``DataFrame``s to perform a set of aggregations that describe each group in the data:

In [None]:
df.groupby('key')['data'].describe()

## Combining Datasets: Merge and Join. 

Similar to SQL again. Check the documentation to see all available options (kind of join, specification of keys, ...).

In [None]:
df1 = pd.DataFrame({'employee': ['Bob', 'Jake', 'Lisa', 'Sue'],
                    'group': ['Accounting', 'Engineering', 'Engineering', 'HR']})
df2 = pd.DataFrame({'employee': ['Lisa', 'Bob', 'Jake', 'Sue'],
                    'hire_date': [2004, 2008, 2012, 2014]})
display(df1, df2)

df3 = pd.merge(df1, df2)
df3

## Pivot Tables reshaping
wide-to-long and the reverse

Read-along: https://github.com/jakevdp/PythonDataScienceHandbook/blob/master/notebooks/03.09-Pivot-Tables.ipynb

In [None]:
import seaborn as sns; sns.set()
titanic = sns.load_dataset('titanic')


titanic.head()

In [None]:
titanic.pivot_table('survived', index='sex', columns='class')

In [None]:
age = pd.cut(titanic['age'], [0, 18, 80])
titanic.pivot_table('survived', ['sex', age], 'class')

In [None]:
# shell command to download the data:
# uncomment and run!
#!curl -O https://raw.githubusercontent.com/jakevdp/data-CDCbirths/master/births.csv

In [None]:
births = pd.read_csv('births.csv')
births.head()

In [None]:
births['decade'] = 10 * (births['year'] // 10)
births.pivot_table('births', index='decade', columns='gender', aggfunc='sum')

In [None]:
%pylab inline
sns.set()  # use Seaborn styles


births.pivot_table('births', index='year', columns='gender', aggfunc='sum').plot()
plt.ylabel('total births per year');

### Exercise:

Read the CSV file `data/reshape_me.csv` to a pandas data frame. Then reshape it so that the result looks like:

```
              value2             value_1            
target             0       1           0           1
foo group                                           
2   bar    2519413.0     NaN  696.454798         NaN
3   bar          NaN  3512.0         NaN  656.505657
```

to move the observations per row into a single row with multiple columns for each target variable.

In [None]:
# Your code here

In [None]:
%load ../solutions/01_reshape.py

Given a JSON string parse it to json and load it to pandas. Normalize the result to be clean & tidy:

In [None]:
input = """
[{
    "load": 1,
    "results": {
        "key": "A",
        "timing": 1.1
    }
}, {
    "load": 2,
    "results": {
        "key": "B",
        "timing": 2.2
    }
}]
"""

In [None]:
# your code here

In [None]:
%load ../solutions/01_normalize.py

Further topics:
- gpu-based pandas https://rapids.ai/index.html
- time series: https://github.com/jakevdp/PythonDataScienceHandbook/blob/master/notebooks/03.11-Working-with-Time-Series.ipynb
- visualization in seaborn https://github.com/jakevdp/PythonDataScienceHandbook/blob/master/notebooks/04.14-Visualization-With-Seaborn.ipynb

## Automate the boring things.
Basic data exploration:

In [None]:
import pandas as pd
import numpy as np

In [None]:
import pandas_profiling

df = pd.DataFrame(
    np.random.rand(100, 5),
    columns=['a', 'b', 'c', 'd', 'e']
)
display(df.head())

profile = df.profile_report()
rejected_variables = profile.get_rejected_variables(threshold=0.9)

print(rejected_variables)

profile

## minimalistic web scraping

using tabular data: http://www.nationmaster.com/country-info/stats/Media/Internet-users from HTML.

In [None]:
import pandas as pd
import requests
from bs4 import BeautifulSoup

res = requests.get("http://www.nationmaster.com/country-info/stats/Media/Internet-users")
soup = BeautifulSoup(res.content,'lxml')
table = soup.find_all('table')[0] 
df = pd.read_html(str(table))
df = df[0]
display(df.head()) 

## Embeddings

Read more: https://github.com/jakevdp/PythonDataScienceHandbook/blob/master/notebooks/05.10-Manifold-Learning.ipynb

1. Great reduce the dimensionality and quickly see if groups form.
2. some ML models prefer data with lower dimensionality (https://umap-learn.readthedocs.io/en/latest/clustering.html)


Uniform Manifold Approximation and Projection (UMAP) is a non-linear dimension reduction technique that can be used for visualisation similarly to t-SNE, but also for general non-linear dimension reduction. (https://github.com/lmcinnes/umap)

In [None]:
from bokeh.plotting import figure
from bokeh.models import CategoricalColorMapper, ColumnDataSource, FactorRange
from bokeh.palettes import Category10
from bokeh.io import show, output_notebook

output_notebook()

import umap
from sklearn.datasets import load_digits


digits = load_digits()
embedding = umap.UMAP().fit_transform(digits.data)

targets = [str(d) for d in digits.target_names]

source = ColumnDataSource(dict(
    x = [e[0] for e in embedding],
    y = [e[1] for e in embedding],
    label = [targets[d] for d in digits.target]
))

cmap = CategoricalColorMapper(factors=targets, palette=Category10[10])

p = figure(title="umap dgits dataset")
p.circle(x='x',
         y='y',
         source=source,
         color={"field": 'label', "transform": cmap},
         legend='label')

show(p)