# Introduction to Python for Data Science

This jupyter notebook is adapted from the Microsoft [**FREE** Edx course](https://www.edx.org/course/introduction-python-data-science-microsoft-dat208x-6) of the same title. 

This script was taken from the «signal processing» lecture from Prof. G. Jeschke in autumn semester 2021.

# 1.1. Hello Python

## What you will learn
- Python
- Specifically for Data Science
- Store data
- Manipulate data
- Tools for data analysis

## Python
- Guido Van Rossum
- General Purpose: build anything
- Open Source! Free!
- Python Packages, also for Data Science
- Many applications and fields
- [Version 2.7 or 3.x](https://wiki.python.org/moin/Python2orPython3)

# 1.2. Variables and Types

- Specific, case-sensitive name
- Call up value through variable name

#### Task: Define your height and weight, then print their values

In [None]:
height = 1.79
weight = 68.7

In [None]:
height

In [None]:
weight

## Math operations 

#### Task: Calculate your BMI

- Definition: $ \text{BMI} = \frac{\text{weight}}{\text{height}^2} $

In [None]:
height = 1.79
weight = 68.7
bmi = weight / height ** 2
print(bmi)

In [None]:
# Changing height or weight does not alter bmi
height = 1.34
print(bmi)

## Python Data Types

- float - real numbers
- int - integer numbers
- str - string, text
- bool - True, False

In [None]:
type(bmi)

In [None]:
day_of_week = 5
type(day_of_week)

In [None]:
x = "body mass index"
type(x)

In [None]:
y = 'this works too'
type(y)

In [None]:
z = True
type(z)

In [None]:
# Different type = different behavior!
print(2 + 3)
print('ab' + 'cd')

# 2.1. Python `List`

## Python Data Types
- `float` - real numbers
- `int` - integer numbers
- `str` - string, text
- `bool` - True, False
- Each variable represents ***single*** value

## Problem
- Data Science: many data points
- Height of entire family
```Python
In [3]: height1 = 1.73
In [4]: height2 = 1.68
In [5]: height3 = 1.71
In [6]: height4 = 1.89
```
- Inconvenient

## Python `List`
- Name a collection of values
- Contain any type
- Contain different types

In [None]:
# Basic list
fam = [1.73, 1.68, 1.71, 1.89]
type(fam)

In [None]:
# List with multiple types
fam = ["liz", 1.73, "emma", 1.68, "mom", 1.71, "dad", 1.89]
type(fam)

In [None]:
# List of lists
fam2 = [["liz", 1.73], ["emma", 1.68], ["mom", 1.71], ["dad", 1.89]]
type(fam)

# 2.2. Subsetting lists

Zero-based indexing: 

- $0, 1, 2,\dots, N-2, N-1$
- $-(N-1), -(N-2), \dots , -2, -1$
    

In [None]:
fam[3]

In [None]:
fam[6]

In [None]:
fam[-1]

In [None]:
fam[-2]

## List slicing 

[ start (inclusive) : end (exclusive) ]

In [None]:
fam[3:5]

In [None]:
fam[1:4]

In [None]:
fam[:4]

In [None]:
fam[5:]

# 2.3. Manipulating Lists

- Change list elements
- Add list elements
- Remove list elements

## Changing list elements


In [None]:
fam[7] = 1.86

In [None]:
fam

In [None]:
fam[0:2] = ["lisa", 1.74]

In [None]:
fam

## Adding and removing elements

In [None]:
fam_ext = fam + ["me", 1.79]
fam_ext

In [None]:
del(fam[2])
fam

In [None]:
del(fam[2])
fam

## Copying Lists

- copying by reference
- copying by value 

In [None]:
# Copy by reference (change the copy, change the original)
x = ["a", "b", "c"]
y = x
y[1] = "z"
print(x)
print(y)

In [None]:
# Copy by value (change the copy, no change to the original)
x = ["a", "b", "c"]
y = x[:] # or y = list(x)
y[1] = "z"
print(x)
print(y)

# 3.1. Functions

- Nothing new!
- `type()`
- Piece of reusable code
- Solves particular task
- Call function instead of writing code yourself

## Example 1: max( ) function

Find the largest element in a list.

In [None]:
fam = [1.73, 1.68, 1.71, 1.89]
tallest = max(fam)
print(tallest)

## Example 2: round( ) function

Round floating point number.

In [None]:
help(round)

In [None]:
round(1.68,1)

In [None]:
round(1.68)

## Finding functions

- Standard task $\rightarrow$ probably function exists!
- The internet is your friend (google it!)

# 3.2 Methods: Functions that belong to objects

- `str` : capitalize(), replace(), etc.
- `float` : bit_length(), conjugate(), etc.
- `list` : intex(), count(), etc.

## `list` Methods

In [None]:
fam = ['liz', 1.73, 'emma', 1.68, 'mom', 1.71, 'dad', 1.89]
fam.index("mom")

## `str` Methods

In [None]:
sister = "liz"
sister

In [None]:
sister.capitalize()

In [None]:
sister.replace("z", "sa")

## Methods
- Everything = object
- Object have methods associated, depending on type

In [None]:
fam.replace("mom", "mommy")

In [None]:
sister.index("z")

In [None]:
fam.append("me")
fam

In [None]:
fam.append("1.79")
fam

## Summary
- Functions
    - `type(fam)`
- Methods: call functions on objects
    - `fam.index("dad")`

# 3.3 Packages

## Motivation

- Functions and methods are powerful
- All code in Python distribution?
     - Huge code base: messy
     - Lots of code you won’t use
     - Maintenance problem

## Packages

- Directory of Python Scripts
- Each script = module
- Specify functions, methods, types
- Thousands of packages available
    - Numpy
    - Matplotlib
    - Scikit-learn

## Install package
- http://pip.readthedocs.org/en/stable/installing/
- Download get-pip.py
- Terminal:
    - python get-pip.py
    - pip install numpy
    
## Import package

In [None]:
import numpy
array([1, 2, 3])

In [None]:
numpy.array([1, 2, 3])

In [None]:
import numpy as np
np.array([1, 2, 3])

In [None]:
from numpy import array
array([1, 2, 3])

# 4.1 Numpy

## Lists Recap
- Powerful
- Collection of values
- Hold different types
- Change, add, remove
- Need for Data Science
    - Mathematical operations over collections
    - Speed

## Illustration

In [None]:
height = [1.73, 1.68, 1.71, 1.89, 1.79]
height

In [None]:
weight = [65.4, 59.2, 63.6, 88.4, 68.7]
weight

In [None]:
weight / height ** 2

## Solution: Numpy
- Numeric Python
- Alternative to Python List: Numpy Array
- Calculations over entire arrays
- Easy and Fast
- Installation
     - In the terminal: pip3 install numpy

In [None]:
import numpy as np
np_height = np.array(height)
np_height

In [None]:
np_weight = np.array(weight)
np_weight

In [None]:
# Element-wise calculations
bmi = np_weight / np_height ** 2
bmi

## Numpy: remarks
- Numpy arrays: contain only one type
- Different types: different behavior!

In [None]:
np.array([1.0, "is", True])

In [None]:
python_list = [1, 2, 3]
python_list + python_list

In [None]:
numpy_array = np.array([1, 2, 3])
numpy_array + numpy_array

## Numpy Subsetting

In [None]:
bmi

In [None]:
bmi[1]

In [None]:
bmi > 23

In [None]:
bmi[bmi > 23]

# 4.2. 2D Numpy Arrays

## Type of Numpy Arrays
- ndarray = N-dimensional array

In [None]:
import numpy as np
np_height = np.array([1.73, 1.68, 1.71, 1.89, 1.79])
np_weight = np.array([65.4, 59.2, 63.6, 88.4, 68.7])

In [None]:
type(np_height)

In [None]:
type(np_weight)

## 2D Numpy Arrays

In [None]:
np_2d = np.array([[1.73, 1.68, 1.71, 1.89, 1.79],[65.4, 59.2, 63.6, 88.4, 68.7]])
np_2d

In [None]:
# 2 rows, 5 columns
np_2d.shape

In [None]:
# numpy arrays must have a single type
np.array([[1.73, 1.68, 1.71, 1.89, 1.79],
[65.4, 59.2, 63.6, 88.4, "68.7"]])

## Subsetting

In [None]:
# Get the first row
np_2d[0]

In [None]:
# First row, third column
np_2d[0][2]

In [None]:
# First row, third column
In [12]: np_2d[0,2]

In [None]:
# All rows, first through second columns
np_2d[:,1:3]

In [None]:
# Second row, all columns
np_2d[1,:]

# 4.3. Numpy: Basic Statistics

## Data analysis

- Get to know your data
- Little data $\rightarrow$ simply look at it
- Big data $\rightarrow$ ?

## City-wide survey

In [None]:
import numpy as np

# Generatate 5000 normally distributed random variables
height = np.round(np.random.normal(1.75, 0.20, 5000), 2) # mean = 1.75, std dev = 0.2
weight = np.round(np.random.normal(60.32, 15, 5000), 2) # mean = 60.32, std dev = 15
np_city = np.column_stack((height, weight))
np_city

In [None]:
np.mean(np_city[:,0])

In [None]:
np.median(np_city[:,0])

In [None]:
np.corrcoef(np_city[:,0], np_city[:,1])

In [None]:
np.std(np_city[:,0])

# 5.1. Basic Plots with Matplotlib

## Data Visualization
- Very important in Data Analysis
    - Explore data
    - Report insights

## Matplotlib

In [None]:
import matplotlib.pyplot as plt
year = [1950, 1970, 1990, 2010]
pop = [2.519, 3.692, 5.263, 6.972]
plt.plot(year, pop)
plt.show()

## Scatter plot

In [None]:
plt.scatter(year, pop)
plt.show()

# 5.2. Histograms

## Histogram
- Explore dataset
- Get idea about distribution

## Matplotlib

In [None]:
import matplotlib.pyplot as plt
help(plt.hist)

## Matplotlib Example

In [None]:
values = [0,0.6,1.4,1.6,2.2,2.5,2.6,3.2,3.5,3.9,4.2,6]
plt.hist(values, bins = 3)
plt.show()

# 5.3. Customization

## Data Visualization

- Science & Art
- Many options
    - Different plot types
    - Many customizations
- Choice depends on:
    - Data
    - Story you want to tell
    
## Basic Plot

In [None]:
year = np.linspace(1950.,2100.,num=50)

K = 11.;
P0 = 2.6;
r = 0.03;
population = K*P0*np.exp(r*(year-year[0])) / (K + P0*(np.exp(r*(year-year[0]))-1.))

plt.plot(year, population)
plt.show()

## Axis Labels, Title, Ticks, Axis Limits

In [None]:
plt.plot(year, population)

plt.xlabel('Year')
plt.ylabel('Population')
plt.title('World Population Projections')
plt.yticks([0,2,4,6,8,10])
plt.xlim(1950, 2100)
plt.ylim(0, 11)

plt.show()

## Tick Labels

In [None]:
plt.plot(year, population)

plt.xlabel('Year')
plt.ylabel('Population')
plt.title('World Population Projections')
plt.yticks([0,2,4,6,8,10],['0','2B','4B','6B','8B','10B'])
plt.xlim(1950, 2100)
plt.ylim(0, 11)

plt.show()

## Add Historical Data

In [None]:
population = [1.0,1.262,1.650] + population.tolist()
year = [1800,1850,1900] + year.tolist()

plt.fill_between(year, population, 0, color='green')

plt.xlabel('Year')
plt.ylabel('Population')
plt.title('World Population Projections')
plt.yticks([0,2,4,6,8,10],['0','2B','4B','6B','8B','10B'])
plt.xlim(1800, 2100)
plt.ylim(0, 11)

plt.show()

# 6.1. Boolean Logic & Control Flow

## Booleans


In [None]:
2 < 3

In [None]:
2 == 3

In [None]:
x = 2
y = 3
x < y

In [None]:
x == y

## Relational Operators

| operator | meaning |
| :---: | --- |
| < | strictly less than |
| <= | less than or equal |
| > | strictly greater than |
| >= | greater than or equal |
| == | equal |
| != | not equal |

## Logical Operators

- and
- or
- not

In [None]:
print(True and True)
print(True and False)
print(False and True)
print(False and False)

In [None]:
print(True or True)
print(True or False)
print(False or True)
print(False or False)

In [None]:
print(not True)
print(not False)

## Conditional Statements

```python
if condition :
    expression
```
Note the indentation of expression and the colon after the condition.

In [None]:
z = 4
if z % 2 == 0 :
    print("z is even")

In [None]:
z = 5
if z % 2 == 0 :
    print("z is even")
else :
    print("z is odd")

In [None]:
z = 3
if z % 2 == 0 :
    print("z is divisible by 2")
elif z % 3 == 0 :
    print("z is divisible by 3")
else :
    print("z is neither divisible by 2 nor by 3")