# EEMP - Introduction to Python for Data Analysis

- Introductory course to working with python for data analysis
- Goals:
    - Overview of basic data structures and commands in python
    - Essential toolkit for data analysis in python, giving you a background in packages needed
    - By no means exhaustive, but should enable you to continue learning by yourself
- More commands will be introduced and practiced throughout the course

# Organizational issues

- 3 python introductory sessions:


    - 07/10/2019, 14:00 - 15:30
    - 07/10/2019, 16:00 - 17:30
    - 08/10/2019, 10:00 - 11:30


- all materials can be found in the course's github page:

    https://github.com/jeshan49/EEMP2019

We will be working with mainly two tools within this course

- Python (version 3.7.3)
- Jupyter notebook
  
  
  
*... Let's set the scene for working with python and jupyter notebook.*

**Step 1:** install a python distribution and the respective packages

 - we will be using Anaconda https://www.anaconda.com/
 - install the required packages (in this order: numpy, pandas, statsmodels, matplotlib, seaborn, scikit-learn)
     - open Anaconda/Environments
        - check whether respective packages already installed, if not install.
        - *command line*: conda install -c anaconda numpy pandas statsmodels seaborn scikit-learn
            - check installed packages with: conda list

**Step 2:** open a jupyter notebook

 - with Anaconda
 - *command line*: jupyter notebook

# 8 Reasons Why You Should Learn Python

1. Consistently ranks among the most popular programming languages with a promising future
2. First-class tool for scientific computing tasks, especially for large datasets
3. Straightforward sytnax and easy to learn
4. Very versatile and highly compatible
5. Free of charge since it is open source
6. Comprehensive standard libraries with large ecosystem of third-party packages
7. State-of-the-art for machine learning and data science in general
8. Great amount of resources and welcoming community

## Short Introduction to Jupyter Notebook

- open source web application
    - works with your browser of choice (chrome, firefox, safari)
- interactive computing environment
- great tool to create and share documents that combine live code, visualizations, equations, text, images, videos etc.
- allows to work interactively
- check http://www.jupyter.org

### Jupyter Basics

Notebook cells can have 2 different modes:
 - edit mode (green cell border -> Enter)
 - command mode (blue cell border -> Shift + Enter)

... and different types:
 - markdown cell for narrative text, LaTex style formulas, weblinks, pictures etc. (command mode -> m)
 - code cell (command mode -> y)
     - chosen kernel defines active programming language (don't worry about this, we will only be using the Python kernel)

### First Steps with Jupyter:

 1. Take the User Interface Tour (Help -> User Interface Tour)
 2. Check keyboard shortcuts (Help -> Keyboard Shortcuts)
 3. If you want to know more about jupyter there are many online resources which give you a more detailed introduction (e.g. Jupyter documentation https://jupyter-notebook.readthedocs.io/en/stable/, blogs, Youtube Tutorials etc.) 

*Let's get started with Python...*

## 1. Datatypes and Operators

### Datatypes:

- Integers
- Floats (decimal number)
- Strings ("text")
- Booleans (TRUE/FALSE)

In [1]:
# want to know more about a function and how to use it? - use the help() function or internet search
help(print)

Help on built-in function print in module builtins:

print(...)
    print(value, ..., sep=' ', end='\n', file=sys.stdout, flush=False)
    
    Prints the values to a stream, or to sys.stdout by default.
    Optional keyword arguments:
    file:  a file-like object (stream); defaults to the current sys.stdout.
    sep:   string inserted between values, default a space.
    end:   string appended after the last value, default a newline.
    flush: whether to forcibly flush the stream.



In [None]:
# floats

In [None]:
# strings

In [None]:
# Booleans and True/False - Operators

In [None]:
# we can evaluate the truth of a statement with the different operators ==,!=,>,<,>=,<=, no
# -> the output is always a boolean

In [None]:
# we can also combine different conditions with and, &, or

In [None]:
# If-statements can be used to execute code only if a certain condition is fulfilled
# identation after if-condition needed (convention is to indent with 4 spaces)

In [None]:
# we can also include additional conditions

## 2. Python Lists

- Standard mutable multi-element container in Python
- Denoted by squared brackets [ ]

In [None]:
# Python lists can contain integers...



# ...strings



# ... or a combination of different data types.



In [None]:
# one can also access the different elements within a list with calling its index


## 3. Loops

- We can loop over values in a list

## 4. Functions 

- Python is also suitable for writing functions
- Very good for operations that are done repeatedly, but have no built-in functions
- However, whenever there are built-in functions always use those; they are usually computationally more efficient
- We will give you a short idea of what a function means and how it looks like, but writing functions is not the focus of the course

## 5. Libraries 
### 5.1 NumPy Library 

*Provides numeric vector and matrix operations*

- NumPy's "ndarray" is another of the basic formats data can be stored in
- Similar to python built-in lists (see 2.), but lack its multi-type flexibility, i.e. can only contain one data type
- However, in contrast to lists, ndarrays are more efficient in storing and manipulating data, which is important as data become bigger
- Building blocks for many other packages (see 5.2)

In [None]:
# Before we can use a package the first time, we need to import it (given we have it already installed)

In [None]:
# ndarrays

In [None]:
# we can also build matrices from ndarrays

In [None]:
# and then work with these arrays and matrices using numpy methods and functions

In [None]:
# as with lists you can access elements within an array in a similar fashion

### 5.2 Pandas Library
*Provides the DataFrame, which is the building block for working with data*

- Newer package built on top of NumPy
- A Series is the Pandas equivalent to a NumPy array
    - However, Series (and DataFrames) have explicitly defined row (and column) labels attached as compared to implicitly defined simple integer indices in NumPy
- A DataFrame is a 2-dimensional multi-array data structure, i.e. consist of multiple series
- Often contain heterogenous types and/or missing values

In [None]:
# Again, we have to import the package first...

In [None]:
# A series is a one-dimensional array of indexed data

In [None]:
# However, indices don't need to be numerical, but could also be strings...

In [2]:
# Now, let's read in an actual dataset with several columns and numerous rows, and start working with DataFrames...

path_to_data = "https://raw.githubusercontent.com/lemepe/EEMP/master/python_intro/Employee_data.csv"

### Commands for exploratory data analysis (EDA)

In [None]:
# Shape of the dataframe in form of a tuple (#rows, #cols)

In [None]:
# Lists all column indeces

In [None]:
# Overview of columns, non-null entries & datatypes

In [None]:
# Summary statistics of the dataset

In [None]:
# To get an idea about the different values contained within a specific column, we can use the .unique() method
# since unique takes the values in order of appearance, we use the sorted function on top of it
# with [] and the respective column label, we can access this particular Series in the DataFrame

In [None]:
# If we want to know more about the distribution of certain values or categories, we can use value_counts()

In [None]:
# We can also slice the data by indices (similar to how we did it with lists or NumPy arrays)

In [None]:
# .. which could also be a string, as in the case of the columns

In [None]:
# select subset of data with a condition and assign it to new dataframe

# we can also subselect only certain columns to be shown

# ... or only select rows that fulfill a certain condition


### Descriptives statistics

- Overview of pandas aggregation methods:
<div>
<img src="attachment:grafik.png" width="450"/>
</div>

In [None]:
# Distribution of exits across departments

# Mean age of exits

# Mean age of exits across departments


In [None]:
# Creating new variables


In [None]:
# calculate percentage of employees that left the company


In [None]:
# alternatively using value_counts()


In [None]:
# with the groupby() function, we can split the dataset by categories and do calculations on these subgroups


### 5.3 Visualization Libraries
*Provide plotting and visualization support*
#### Matplotlib Library

- Original visualization library built on NumPy arrays
- Conceived in 2002 to enable MATLAP-style plotting


- We will only provide a quick overview, for more information see matplotlib documentation
- https://matplotlib.org/3.1.1/gallery/index.html

In [None]:
import matplotlib.pyplot as plt

In [None]:
# Here we define the plotstyle to be used
# check https://matplotlib.org/3.1.1/gallery/style_sheets/style_sheets_reference.html for an overview of style sheets


In [None]:
# There exist several "magic functions" in jupyter notebook which allow you additional operations
%lsmagic

In [None]:
# ... one we will need is "%matplotlib inline" which enables matplotlib plots to be displayed directly in the notebook


In [None]:
# Histogram with frequencies


In [None]:
# Histogram with probability density
plt.hist(employee_data['Age'],density=True, alpha=0.5)
plt.hist(exit_data['Age'],density=True, alpha=0.5)
plt.xlabel('Age')

In [None]:
# We can also combine multiple plots with plt.subplot(#rows,#cols,i)

plt.subplot(1,2,1)
plt.hist(employee_data['Age'],density=True, color = 'red', alpha=0.5)
plt.xlabel('Age (all)')
plt.xlim(18,60)
plt.ylim(0,0.06)

plt.subplot(1,2,2)
plt.hist(exit_data['Age'],density=True, color = 'blue', alpha=0.5)
plt.xlabel('Age (exits)')
plt.xlim(18,60)
plt.ylim(0,0.06)

In [None]:
# Scatter plot example

#### Seaborn Library

- Newer library with more visually appealing and simpler to use toolkit
- Better suited for visualizing DataFrame structures

- Again, we only provide a quick overview, see documentation for more details
- https://seaborn.pydata.org/examples/index.html

In [None]:
# import the package
import seaborn as sns

In [None]:
# Distribution plot with estimated PDF

In [None]:
# Overlaid distribution plot for different groups

In [None]:
# Seaborn Scatterplot with marker for different groups


In [None]:
# Scatterplot with estimated regression line

In [None]:
# Seaborn barplot

### 5.4 Statsmodels Library
*Provides many different statistical models, statistical tests, and statistical data exploration*

- https://www.statsmodels.org/stable/index.html

In [None]:
# import the statsmodels libraries

In [None]:
# simple OLS regression with one explanatory variable

In [None]:
# OLS regression with several explanatory variables

# 6. References and Further Readings

- VanderPlas, Jake (2016): Python Data Science Handbook: Essential Tools for Working with Data. O'Reilly Media.