# EEMP - Introduction to Python for Data Analysis

This is an introductory course to python, that gives you a background in packages needed for data analysis and introduces you to the most basic commands to start with. There will be more commands used and practiced throughout the course.

**Organizational issues**

- 3 python introductory sessions:


    - 07/10/2019, 14:00 - 15:30
    - 07/10/2019, 16:00 - 17:30
    - 08/10/2019, 10:00 - 11:30


- all materials can be found in my github:

    https://github.com/lemepe/EEMP
    
    
- We will be working with mainly two tools within this course

    - Python (version 3.7.3)
    - Jupyter notebook

*Let's set the scene for working with python and jupyter notebook:*

**Step 1:** install a python distribution and the respective packages

 - we will be using Anaconda https://www.anaconda.com/
 - install the required packages (in this order): numpy, pandas, statsmodels, matplotlib, seaborn, scikit-learn
        - open Anaconda/Environments -> create a new environment and install the respective packages in Anaconda
        - command line: conda create --name *env_name* 
                        conda install -c anaconda numpy pandas statsmodels seaborn scikit-learn
                        
                        .. check installed packages with:
                        conda list
                      
**Step 2:** open a jupyter notebook

 - with Anaconda
 - command line: jupyter notebook

# 8 Reasons Why You Should Learn Python

1. Consistently ranks among the most popular programming languages with a promising future
2. First-class tool for scientific computing tasks, especially for large datasets
3. Straightforward sytnax and easy to learn
4. Very versatile and highly compatible
5. Free of charge since it is open source
6. Comprehensive standard libraries with large ecosystem of third-party packages
7. State-of-the-art for machine learning and data science in general
8. Great amount of resources and welcoming community

---
## Short Introduction to Jupyter Notebook

- open source web application
    - works with your browser of choice (chrome, firefox, safari)
- interactive computing environment
- great tool to create and share documents that combine live code, visualizations, equations, text, images, videos etc.
- allows to work interactively
- check http://www.jupyter.org

### Jupyter Basics

Notebook cells can have 2 different modes:
 - edit mode (green cell border -> Enter)
 - command mode (blue cell border -> Shift + Enter)

and types:
 - markdown cell for narrative text, LaTex style formulas, weblinks, pictures etc. (command mode -> m)
 - code cell (command mode -> y)
     - chosen kernel defines active programming language (don't worry about this, we will only be using the Python kernel)
     
### First Steps with Jupyter:
 1. Take the User Interface Tour (Help -> User Interface Tour)
 2. Check keyboard shortcuts (Help -> Keyboard Shortcuts)
 3. If you want to know more about jupyter there are many online resources which give you a more detailed introduction (e.g. Jupyter documentation https://jupyter-notebook.readthedocs.io/en/stable/, blogs, Youtube Tutorials etc.) 
     
---

*Let's get started with Python...*

## 1. Datatypes and Operators

### Datatypes:

- integers
- floats (decimal number)
- strings ("text")
- booleans (TRUE/FALSE)

In [None]:
# integers
a = 10
b = 4

print(a)
print(type(a))
(a+b)*3

In [None]:
# want to know more about a function and how to use it? - use the help() function or internet search
help(print)
help(type)

In [None]:
# floats
c = 1.5
d = 28.0

print(type(d))
c*d

In [71]:
# strings

question = "What is the answer to life, to the universe and everything?" # either denote with ""
answer = '42' # .. or ''

print(type(question))

question_answer= question + answer # strings can be added too!

print(question_answer)

print(question," - ",answer)
print(question + ' - ' + answer)

<class 'str'>
What is the answer to life, to the universe and everything?42
What is the answer to life, to the universe and everything?  -  42
What is the answer to life, to the universe and everything? - 42


In [66]:
# Booleans and True/False - Operators


print(True==1) # True is encoded as 1
print(False==1) # False is encoded as 0

print(not True) # we can inverse a boolean with "not"
print(True + True) # We can also add booleans

True
False
False
2


In [67]:
# we can evaluate the truth of a statement with the different operators ==,!=,>,<,>=,<=, no
# -> the output is always a boolean

print(True > False)
print(answer == '42')
print(4 >= 5)
print(10 != 0)

True
True
False
True


In [78]:
# If-statements can be used to execute code only if a certain condition is fulfilled

answer = '42'

# identation after if-condition needed (convention is to indent with 4 spaces)
if answer == "42":
    print("This is the answer to life, to the universe and everything.")

This is the answer to life, to the universe and everything.


In [82]:
# we can also include additional conditions

answer = '1'

if answer == "42":
    print("This is the answer to life, to the universe and everything.")
elif answer == "41":
    print("This is nearly the answer to life, to the universe and everything.")
else:
    print("This is not the answer to life, to the universe and everything.")

This is not the answer to life, to the universe and everything.


## 2. Python Lists

- standard mutable multi-element container in Python
- denoted by squared brackets [ ]

In [48]:
# Python lists can contain integers...

l1 = list(range(10))
print(l1, type(l1[0]))

# ...strings

l2 = list(str(i) for i in l1)
print(l2, type(l2[0]))


# ... or a combination of different data types.

l3 = [1.2,42,'Yes',True]
print(l3)
print([type(i) for i in l3])

[0, 1, 2, 3, 4, 5, 6, 7, 8, 9] <class 'int'>
['0', '1', '2', '3', '4', '5', '6', '7', '8', '9'] <class 'str'>
[1.2, 42, 'Yes', True]
[<class 'float'>, <class 'int'>, <class 'str'>, <class 'bool'>]


In [104]:
# one can also access the different elements within a list with calling its index

print(l1[0]) # Python is a zero-indexed language, i.e. this would give you the first element of the list
print(l2[-1]) # one can also access the list from the end, i.e. this would give you the last element of the list
print(l2[-2]) # ... this the second last element

print(l3[0:3]) # or slice the list and extract only a certain range

0
9
8
[1.2, 42, 'Yes']


## 3. Loops

- we can loop over values in a list

In [84]:
for item in ['life','the universe','everything']:
    print('The answer to',item,'is 42.')

The answer to life is 42.
The answer to the univers is 42.
The answer to everything is 42.


In [89]:
even_number = list(range(0,10,2)) # check help(range) to find out about the options within the function

print(even_number)

result = 0
for number in even_number:
    result += number # this is the inplace version of reassigning result = result + number, the outcome is identical
    print(result)

[0, 2, 4, 6, 8]
0
2
6
12
20


## 4. Functions 

- Python is also suitable for writing functions
- Very good for operations that are done repeatedly, but have no built-in functions
- However, whenever there are built-in functions always use those; they are usually computationally more efficient
- We will give you a short idea of what a function means and how it looks like, but writing functions is not the focus of the course

In [90]:
def f(x):
    '''This function squares its numerical input'''
    return x**2

In [95]:
f(3)

9

## 5. Libraries 

### 5.1 NumPy Library 

*Provides numeric vector and matrix operations*

- NumPy's "ndarray" is another of the basic formats data can be stored in
- Similar to python built-in lists (see 2.), but lack its multi-type flexibility, i.e. can only contain one data type
- However, in contrast to lists, ndarrays are more efficient in storing and manipulating data, which is important as data become bigger
- Building blocks for many other packages (see 5.2)

In [109]:
# Before we can use a package the first time, we need to import it (given we have it already installed)
# the "as np" indicates the alias we can use to call the package from now on
import numpy as np

In [119]:
# ndarrays
array1 = np.array([0,1,5,15,2])
print(array1, type(array1))

array2 = np.arange(5)
print(array2)

array3 = array1 + array2 # ndarrays can also be added to each other
print(array3)

[ 0  1  5 15  2] <class 'numpy.ndarray'>
[0 1 2 3 4]
[ 0  2  7 18  6]


In [129]:
# we can also build matrices from ndarrays

matrix1 = np.array([[1,0],[0,1]])
print(matrix1, type(matrix1))

matrix2 = np.array([array1, array2, array1 + array3])
print(matrix2)

matrix3 = matrix2 + array1
print(matrix3)

[[1 0]
 [0 1]] <class 'numpy.ndarray'>
[[ 0  1  5 15  2]
 [ 0  1  2  3  4]
 [ 0  3 12 33  8]]
[[ 0  2 10 30  4]
 [ 0  2  7 18  6]
 [ 0  4 17 48 10]]


In [141]:
# and then work with these arrays and matrices using numpy methods and functions

matrix2_t = matrix2.transpose()
print(matrix2_t)
print(np.shape(matrix2_t)) # gives you a 5x3 matrix from the original 3x5 matrix

[[ 0  0  0]
 [ 1  1  3]
 [ 5  2 12]
 [15  3 33]
 [ 2  4  8]]
(5, 3)


In [146]:
# as with lists you can access elements within an array in a similar fashion

print(array1[0:4:2]) # slicing scheme: array[start:stop:step]

print(matrix2_t[1,2]) # takes only the index 1 row- and index 2 column-element
print(matrix2_t[0:2,0:2], np.shape(matrix2_t[0:2,0:2])) # gives you a 2x2 matrix from the 5x3 original matrix

[0 5]
3
[[0 0]
 [1 1]] (2, 2)


### 5.2 Pandas Library
*Provides the DataFrame, which is the building block for working with data*

- Built around NumPy arrays

In [1]:
# Again, we have to import the package first...
import pandas as pd

*Let's read in some data...*

In [2]:
path_to_data = "https://raw.githubusercontent.com/lemepe/EEMP/master/python_intro/Employee_data.csv"
employee_data = pd.read_csv(path_to_data)
employee_data.head() # by default this gives you the first 5 observations in the dataframe

Unnamed: 0,Age,Attrition,BusinessTravel,DailyRate,Department,DistanceFromHome,Education,EducationField,EmployeeCount,EmployeeNumber,...,RelationshipSatisfaction,StandardHours,StockOptionLevel,TotalWorkingYears,TrainingTimesLastYear,WorkLifeBalance,YearsAtCompany,YearsInCurrentRole,YearsSinceLastPromotion,YearsWithCurrManager
0,41,Yes,Travel_Rarely,1102,Sales,1,2,Life Sciences,1,1,...,1,80,0,8,0,1,6,4,0,5
1,49,No,Travel_Frequently,279,Research & Development,8,1,Life Sciences,1,2,...,4,80,1,10,3,3,10,7,1,7
2,37,Yes,Travel_Rarely,1373,Research & Development,2,2,Other,1,4,...,2,80,0,7,3,3,0,0,0,0
3,33,No,Travel_Frequently,1392,Research & Development,3,4,Life Sciences,1,5,...,3,80,0,8,3,3,8,7,3,0
4,27,No,Travel_Rarely,591,Research & Development,2,1,Medical,1,7,...,4,80,1,6,3,3,2,2,2,2


In [32]:
# Slicing by indices

employee_data.iloc[0:100] # extract first 100 observations [check .loc() for indices that are strings]
employee_data.iloc[0:100,0:2] # sliced both row- and columnwise

exit_data = employee_data.loc[employee_data['Attrition']=='Yes']
print(exit_data.iloc[:,[0,4]].head(10))

    Age              Department
0    41                   Sales
2    37  Research & Development
14   28  Research & Development
21   36                   Sales
24   34  Research & Development
26   32  Research & Development
33   39                   Sales
34   24  Research & Development
36   50                   Sales
42   26  Research & Development


In [33]:
# Descriptives statistics

# Distribution of exits across departments
print(exit_data['Department'].value_counts(normalize=True))

# Mean age of exits
mean_age_exited = exit_data['Age'].mean()
print(mean_age)

# Mean age of exits across departments
mean_age_exited_by_dep = exit_data['Age'].groupby(exit_data['Department']).mean()
print(mean_age_by_dep)

# Mean age across all employees
mean_age = employee_data['Age'].mean()
print(mean_age)

Research & Development    0.561181
Sales                     0.388186
Human Resources           0.050633
Name: Department, dtype: float64
33.607594936708864
Department
Human Resources           30.083333
Research & Development    33.473684
Sales                     34.260870
Name: Age, dtype: float64
36.923809523809524


### 5.3 Maplotlib and Seaborn Libraries
*Provide plotting and visualization support*

In [None]:
import matplotlib as plt
import seaborn as sns

In [None]:
%lsmagic

In [None]:
%pwd

In [None]:
%ls

### 5.4 Statsmodels Library
*Provides many different statistical models, statistical tests, and statistical data exploration*

### 5.5 Scikit-learn Library
*Provides general purpose machine learning package with extensive coverage of models and feature transformers*

# 6. References and Further Readings

- VanderPlas, Jake (2016): Python Data Science Handbook: Essential Tools for Working with Data. O'Reilly Media.
- Adams, Douglas (2008): The Hitchhiker's Guide to the Galaxy. Reclam.