# Tools and Methods of Data Analysis
## Session 1 - Part 2

Niels Hoppe <<niels.hoppe.extern@srh.de>>

### Setup

* Install [Visual Studio Code](https://code.visualstudio.com/docs/setup/setup-overview)
* Install [Anaconda](https://www.anaconda.com/)
* Restart
* Create conda environment
* Download [example project and data](http://www.nielshoppe.de/srh-tmda/tmda.zip)

#### Create conda environment

Open conda PowerShell and run:

    conda create -n tmda -c conda-forge python=3.10 jupyter numpy pandas pyreadr scipy seaborn statsmodels

This will create a `conda` environment in `C:\Users\...\anaconda3\envs\tmda`

### Getting started

* Jupyter Notebooks
* Python
* Pandas

### Getting started with Jupyter Notebooks

* Open Visual Studio Code
* Create Jupyter Notebook
* Select conda environment as kernel

### Getting started with Python

Imports

Simple calculations

Variables

Output

#### Python Imports

* Python code is organized in modules and distributed in **packages**
* Public packages are available from the [Python Package Index (PyPI)](https://pypi.org)
* Packages must be **imported** before objects and methods can be used

#### Python Imports (cont.)

Simple import of `math` package:

In [3]:
import math

Import `numpy` package with alias:

In [4]:
import numpy as np

Selective import from `pandas` package:

In [5]:
from pandas import DataFrame, Series

#### Python Imports (cont.)

Packages used in this coures:

* [math](https://docs.python.org/3/library/math.html) (included in Python Standard Library)
* [numpy](https://pypi.org/project/numpy/)
* [pandas](https://pypi.org/project/pandas/)
* [pyreadr](https://pypi.org/project/pyreadr/)
* [scipy](https://pypi.org/project/scipy/)
* [seaborn](https://pypi.org/project/seaborn/) (optional)
* [statsmodels](https://pypi.org/project/statsmodels/)

#### Simple Calculations in Python

You can use Python like a calculator:

In [6]:
3 + 6 # addition

9

In [7]:
5 - 2 # substraction

3

In [8]:
4 * 3 # multiplication

12

In [9]:
18 / 6 # division

3.0

#### Simple Calculations in Python (cont.)

In [10]:
5 ** 2 # raise to higher power

25

In [11]:
5 ** (1/3) # root extraction

1.7099759466766968

In [12]:
math.sqrt(9) # square root

3.0

#### Simple Calculations in Python (cont.)

In [13]:
math.exp(2) # raise e to higher power

7.38905609893065

In [14]:
math.log(2) # natural logarithm

0.6931471805599453

In [15]:
math.log(2, 10) # logarithm with specified base

0.30102999566398114

#### Simple Calculations in Python (cont.)

In [16]:
math.sin(math.pi / 6) # trigonometric functions; sin, cos, tan

0.49999999999999994

In [17]:
abs(-3) # absolute value

3

In [18]:
round(2.456, 2) # round to two decimals

2.46

In [19]:
math.floor(1.6) # round down

1

In [20]:
math.ceil(1.3) # round up

2

#### Variables in Python

In [21]:
x = 5 + 4 # variable assignment
x

9

In [22]:
l = [2 * x for x in range(1, 10)] # list comprehension
l

[2, 4, 6, 8, 10, 12, 14, 16, 18]

In [23]:
d = { 'key1': 'Some value', 'key2': 42, 3: 'Another value' } # dictionary
d

{'key1': 'Some value', 'key2': 42, 3: 'Another value'}

### Getting started with Pandas

* Series and DataFrames
* Accessing DataFrames
* Loading data

#### Series and DataFrames in Pandas

A `Series` is a sequence of values.

In [24]:
Series([1, 2, 3, 4, 5])

0    1
1    2
2    3
3    4
4    5
dtype: int64

#### Series and DataFrames in Pandas (cont.)

A `DataFrame` is a table of values. It has `Series` as its columns.

In [25]:
df = DataFrame({
    'column1': [1, 2, 3, 4, 5],
    'column2': ['A', 'B', 'C', 'D', 'E']
}, index=['row1', 'row2', 'row3', 'row4', 'row5'])

df

Unnamed: 0,column1,column2
row1,1,A
row2,2,B
row3,3,C
row4,4,D
row5,5,E


#### Accessing DataFrames in Pandas

Pandas provides different access methods for `DataFrames`:

In [26]:
df.iloc[:, 0] # all rows, first column

row1    1
row2    2
row3    3
row4    4
row5    5
Name: column1, dtype: int64

In [27]:
df.iloc[2:4, :] # rows 3 to 4, all columns

Unnamed: 0,column1,column2
row3,3,C
row4,4,D


In [28]:
df.iloc[[0, 2, 4], 1:] # rows 1, 3 and 5; skip first column

Unnamed: 0,column2
row1,A
row3,C
row5,E


#### Accessing DataFrames in Pandas (cont.)

Pandas provides different access methods for `DataFrames`:

In [29]:
df.loc[:, 'column2'] # all rows, name based column

row1    A
row2    B
row3    C
row4    D
row5    E
Name: column2, dtype: object

#### Loading data

Data can be read from CSV files (.csv) ...

In [30]:
import pandas as pd

df = pd.read_csv('../data/data.csv')
df


Unnamed: 0,A,B,C
0,1,2,3
1,1,2,4
2,2,2,3


#### Loading data (cont.)

... or from R data files (.rda):

In [31]:
from pyreadr import read_r

data = read_r('../data/devore7/ex01.11.rda')
df = data['ex01.11']
df.head()

Unnamed: 0_level_0,Scores
rownames,Unnamed: 1_level_1
1,74
2,89
3,80
4,93
5,64


#### Loading data (cont.)

We can also create dummy data through sampling:

In [34]:
labels = ['A', 'B', 'C', 'D', 'F']
df = DataFrame(labels).sample(30, replace=True)

df[0].to_numpy()

array(['B', 'F', 'F', 'F', 'F', 'D', 'C', 'F', 'D', 'B', 'A', 'D', 'D',
       'F', 'C', 'B', 'F', 'B', 'F', 'A', 'A', 'A', 'D', 'B', 'A', 'A',
       'C', 'D', 'A', 'D'], dtype=object)

### Additional Resources

* https://www.learnpython.org/
* https://www.kaggle.com/
* https://www.statology.org/python-guides/