# Tools and Methods of Data Analysis
## Session 1 - Part 1

Niels Hoppe <<niels.hoppe.extern@srh.de>>

### Agenda 04/13

* Introduction
  - Course Organization
* Statistics for Data Analysis
* Types of Variables
* Scales of Measure

### Niels Hoppe, M.Sc., 31yo

#### About

* **B.Sc. Computer Science** at Freie Universität Berlin
* **M.Sc. Computer Science** at Technische Universität Berlin
* ex. **Teaching Assistant, Guest Lecturer** at FU and TU Berlin
* ex. **Software-Engineer** at Sendinblue (ex. Newsletter2Go)
* ex. **Researcher** at Fraunhofer FOKUS
* Competitive Ballroom Dancer and Teacher of Dancing

#### Contact

* Email: <niels.hoppe.extern@srh.de>
* LinkedIn: https://www.linkedin.com/in/hielsnoppe/

### Course Organization

10 sessions à 2 parts à 90 min.

| Date  | Topic |
|-------|-------|
| 04/13 | Introduction and Python Basics |
| 04/14 | Empirical Distributions + Statistical Parameters |
| 04/20 | Probability Theory |
| 04/21 | Theoretical Distributions |
| 05/04 | Confidence Intervals |
| 05/11 | Test of Hypotheses Based on One and Two Samples |
| 05/19 | Test of Hypotheses Based on One and Two Samples |
| 05/26 | Recap and exercises |
| 06/09 | Exam |
| TBD   | Exam retake |

#### Course Organization (cont.)

* Slides will be made available after the lecture
* Exercises and other material will be provided through Teams
* Questions?

### Statistics for Data Analysis

Original meaning: Statistics describes the economical status of a country.

As a field in mathematics: Methods for collecting, preparing, and analysing empirical data.

### Statistics for Data Analysis (cont.)

* Terminology
* Descriptive Statistics
* Inferential Statistics

#### Terminology

**Subject**: an entity of interest

**Parameter**: an observable characteristic of an entity (aka. attribute, feature, property)

**Population**: the complete set of all *subjects* of interest

**Sample**: a subset of the *population*

**Variable**: *parameter* of interest

**Value**: concrete expression of a *variable* in a *subject*

#### Determining the Subject

What is the subject in the following examples?

1. daily blood pressure of a patient
2. incidence of a disease in a country per year
3. prevalence of a disease in multiple countries in a specific year
4. number of rain days in the month of June in a specific region
5. number of rain days in the month of June in multiple regions

#### Comparability of Subjects

> "You can't compare apples with oranges"

Actually, you can.

* Subjects must be comparable with regard to a variable, i.e., all subjects must in principle be able to express that parameter.
* Comparability does not imply usefulness!

#### Generalization and Transfer of Findings

* Findings are always limited to the studied sample.
* Findings from representative samples can be generalized to the population.
* Findings can not generally be transferred to another population.
* Even if two populations are technically comparable, there may be unknown parameters in play.

#### What is Data?

**Data**: the entirety of variables and values contained in the sample

Data is collected by observing the parameters of interest (variables) from every subject contained in the sample.

#### Descriptive Statistics

* How does the data look like?
  - What kind of data do we have?
  - What are interesting values from the data?
  - How focused or dispersed is the data?

#### Inferential Statistics

* What can we learn from the data?
* Does it support our assumptions (hypotheses)?

### Types of Variables

| Number | Name | Sex | Age | BMI | Packed Cell Volume | Stage of Disease | # OPs | Blood pressure | Cigarettes / day |
|--------|------|-----|-----|-----|--------------------|------------------|-------|----------------|------------------|
| 23-8   | G.L. | f   | 57  | Normal | 47%             | I                | 4     | 125 mmHg       | 0 |
| 49-1   | S.P. | m   | 62  | Adipose | 49%            | III              | 1     | 130 mmHg       | 10 |

### Types of Variables (cont.)

**Variable**: *parameter* of interest

* Can be classified as
  - qualitative/categorical or quantitative
  - discrete or continuous
* Are measured on scales

#### Qualitative vs. Quantitative Variables

Qualitative aka. categorical variables take values from a **defined set of categories**, e.g.,

* gender of a person
* blood type
* disease or therapy method (in clinical trials)



Quantitative variables take **numerical values** obtained by counting or measuring, e.g.,

* age of a person
* weight, height
* blood pressure
* number of siblings

#### Discrete vs. Continuous Variables

Discrete variables take values from a finite or countably infinite set.

* Qualitative variables are always discrete.
* Quantitative variables are discrete iff they take integer values, e.g.,
  - number of patients
  - number of children

Continuous variables take any numerical value, e.g.,

* weight, height
* temperature
* blood pressure

### Scales of Measure

aka. Levels of Measurement

Different scales are used depending on the type of variable:

* Nominal scale
* Ordinal scale
* Interval scale
* Ratio scale

#### Nominal Scale

* For **categorical data**
* Values can be **discerned**, but **not ranked**,
* i.e., no value is inherently less or greater than any other, **just different**.

Examples:

* Categories, e.g.,
  - gender of a person
  - music, literature or movie genre
  - disease or therapy method (in clinical trials)
  - ...
* Names, adresses, telephone numbers, ...

#### Nominal Scale (cont.)

**Question**: But I can rank

* names by lexicographical order or length
* movie genres by my personal preference
* therapy methods by their success rate
* adresses by their proximity to where I am
* ...

Why are they on a nominal scale?

**Answer**: You are creating a **derived variable** on another scale.

* The original variable is still on a nominal scale.
* The derived variable **relates to** and **interpretes** the original variable.
* The original and the derived variable are **not the same**.
* Interpretation **adds opinion** to the original data and can be **biased**.

#### Ordinal Scale

* For **categorical data**
* Values can be **discerned and ranked**, but **not arithmetically compared**,
* i.e., we can tell one value is less or greater than another, but not by how much.



Examples:

* Age categories, e.g., child, adolescent, adult, ...
* Dress sizes, e.g., XS, S, M, L, XL, XXL
* School grades, e.g., A, B, C, D, E, F

#### Interval Scale

* For **quantitative data**
* Values can be **discerned and arithmetically compared**, but only linearly,
* i.e., sums and differences are allowed, products and quotients are not.



Examples:

* Date and time of day
* Temperature in Celsius or Fahrenheit (but not Kelvin!)

#### Interval Scale (cont.)

**Question**: If quotients are not allowed, can I calculate an average?

**Answer**: Yes. Quotients *of values* are not allowed, but the average is the quotient of the **sum of values** divided by the **number of values**. No values are multiplied or divided, but only added and counted.

#### Ratio Scale

* For **quantitative data**
* Values can be **discerned and arithmetically compared** without limitation,
* i.e., sums, differences, products and quotients are allowed.

Examples:

* Durations (in contrast to date and time of day!)
* Temperature in Kelvin
* Weight [kg] and height [cm] of a person
* Most physical measurements

#### Absolute and Metric Scale

* **Absolute scale**: ratio scale for discrete values.
* **Metric scale**: effective a synonym for ratio scale; summarizes interval, ratio and absolute scale.

### Summary

* Variables are classified as qualitative or quantitative and discrete or continuous.
* Variables are measured on scales.

### Exercises

**Classification of Variables**: Determine the **type** (qualitative or quantitative, discrete or continuous) and **level of scale** for the following variables:

1. Name of a patient
2. Gender
3. Age
4. Body-Mass-Index
5. Body-Mass-Index (as categories)
6. Packed cell volume in blood sample
7. Stage of a disease
8. Number of operative procedures
9. Blood pressure
10. Number of smoked cigarettes per day
11. Dose of a medication (low, medium, high)
12. Dose of a medication [mg/day]
13. Blood group
14. Body temperature
15. Presence of a symptom (yes, no)