# Session 2A: Python and Social Data Science

*Joachim Kahr Rasmussen*



## Agenda for Session 2A

1. Module Structure
2. Data Science and Beyond
3. Python - an Overview
4. The Python Language

## Associated Readings

BBB, chapter 1:
- Social science 'vs' data science
- The Digital Age: Availability of data and computer power
- Intro to ethics

Grimmer (2015):
- Descriptive vs causal inference
- Importance of research design
- New data: Magnitude and type

Insiprational readings:
- "What is code?": Maybe save it for a rain day.
- "The end of theory: The data deluge makes the scientific method obsolete": Polemic post about data and models.

# Module Structure

## The Structure of Classes

Most teaching modules will have the following structure

- Before lectures and exercise classes: 
    - Do the **reading** $-$ we have some quality textbooks!
    - Watch **previous recorded lectures** $-$ it takes more than one view to get under your skin

- Lectures: 
    - Introduces **new material**, but we are trying something slightly different this year
    - The **format will vary** for each module: Generally...
        - **Live** recap and introduction
        - **Questions** from last time
        - Technical content is **recorded** for you to...
            - watch at **preferred pace**, and
            - potentially ***while* solving exercises**
    - When watching videos, you can ALWAYS **ask question**s.

- Exercise class:
    - Continue working on **exercises**
    - **Discuss with TA** $-$ use the fact that they are there with the sole purpose of helping you learning

## Small Groups

Are you 1 or 2 pax in your group > come to me in break.


## The Academic Quarter

In case you don't know...
- 9 means 9.15, 
- 13 means 13.15 (i.e. 1.15pm)

This holds for both exercise classes and lectures!

## Learning Outcomes After Completing Intro SDS 

Main elements in this course:
- Tidy / transform: Data structuring and text (sessions 2-3, 15)
- Visualize: Plotting (session 4)
- Import: Scraping and data IO (sessions 5, 6-8)
- Ethics: Rules and moral considerations for working with data  (session 9)
- Model: Fundamentals of machine learning, application to text (session 10-14)

Also a tiny bit on...
- Communication and access: Git and Markdown (should have been covered briefly by Andreas earlier today)

## What We Don't Teach You Now

Many more courses that built on this
- Statistics: Econometrics and Machine Learning - overlaps
- More about data - modelling and processing:
    - Text, networks/relational, spatial
- Non-linear ML models:
    - Tree based and kernel 
    - Neural networks
- ML for dynamic decisions: reinforcment learning
- All about privacy

In particular, check out [Advanced SDS I](https://kurser.ku.dk/course/asdk20004u/2020-2021) and [Advanced SDS II](https://kurser.ku.dk/course/asdk20006u/2020-2021). Also sometimes possible to write a project (BA, MSc, [seminar](https://kurser.ku.dk/course/a%c3%98kk08411u/2020-2021)).

# Data Science and Beyond

## Why Data Science Now?

Three trends are important for understand the increasing interest in and influence of data science:

- **Data** is increasingly available, e.g.
    - *Social Media*: Facebook status updates, public 'tweeting', Instagram pictures, etc.
    - *Shopping*: Online shopping (track everything, make experiments), in store shopping (memberships, less granular data).
    - *Phones*: Patterns in phone calls, GPS logging, phone activity (see recent SODAS studies), etc.
- Faster and bigger **computers** ([Waldrop, 2016](https://pubmed.ncbi.nlm.nih.gov/26863965/)):
    - Moore’s law $\sim$ transistors on a microprocessor chip doubles every two years 
    - Soon coming to a halt due to heat issues...
- Improved **algorithms**, methods for computation amd accessability of tools
    - Example: Development of effective and accessible libraries in python
    - Much more about this later in the course...



## Examples of Major Advances

Innovations from data science already create enormous amounts of value and make our lives much easier. Some examples:
- Autonomous systems: 
    - self-driving cars
    - computer game bots, 
    - trading bots, etc.
- Image and text recognition: 
    - face recognition (Facebook, police)
    - language parsing with e.g. Google Translate/GPT-3 (Grammarly, auto-correct)
- Combined services: 
    - Virtual assistants (banks, medical diagnosis)
    - recommendation systems (amazon, spotify, netflix)
    - classification systems (spam, tax fraud, plagiarization)

[McKinsey (2018)](https://www.mckinsey.com/~/media/McKinsey/Featured%20Insights/Artificial%20Intelligence/Notes%20from%20the%20frontier%20Modeling%20the%20impact%20of%20AI%20on%20the%20world%20economy/MGI-Notes-from-the-AI-frontier-Modeling-the-impact-of-AI-on-the-world-economy-September-2018.ashx#:~:text=In%20the%20aggregate%2C%20and%20netting,about%201.2%20percent%20a%20year.&text=The%20economic%20impact%20may%20emerge%20gradually%20and%20be%20visible%20only%20over%20time.): 'Artificial Intelligence' has the potential to increase global GDP by 1.2 percent per year

## Past the Peak? Illustrative example
For a couple of years: Data scientists had HIGHEST entry wages in DK.

More recent evidence: Not top, but still high...

*Why did the mean relative data scientist entry wage decline?*

Paradoxically, this is where *you* as social scientists come into the picture!

Key issue: Prediction based agenda is flawed. But opens up new opportunities:

- Combine with theory: Supply and demand $-$ did supply catch up? Selection?
- Combine with causal inference: Instruments, regression discontinuity, matching...

Clearly a role for econometrics and structural modelling!

## Social Data Science (I/II)

Social data scientists combine skills and tools from two different fields:
- Data scientists in a nutshell (BBB):
    - Developing algorithms that are fast and flexible
    - Largely concerned with prediction $-$ not causal inference
- Social scientist in a nutshell (BBB):
    - Study human behavior and interactions
    - Causal inference is important for understand implications of e.g. policies


## Social Data Science (II/II)

We may say that the skills and ideas of data science are spreading to social data science:
- smart, free tools for working with
    - small and big data on structured (tabular) data
    - unstructured data sources from image, text and social media
- incorporate machine learning into
    - statistics and causal inference
    - economic modelling

In particular, data science complements social sciences by:
- enhancing existing fields, and
- oppening up for new field emerging (new data, combination of methods)

# Python - an Overview

## Introducing Python

*What is Python useful for?*

* It can do "anything" and [used everywhere](https://www.python.org/about/success/)
    * High-tech manufacturing
    * Space shuttles 
    * Large servers     
* Python has incredible resources for machine learning, big data, visualizations.

## Use is Evidently Trending

<center><img src=https://grapecitycontentcdn.azureedge.net/blogs//grapecity/20181026-the-growth-of-major-programming-languages/3.jpg' alt="Drawing" style="width: 900px;"/></center>

## What and Why?

*What is Python?*

A multiparadigm, general purpose programming language.
  * Can do everything you can imagine a computer can do.
  * E.g. manage databases, advanced computation, web etc.
 

*Why Python?*

Python's main objective is to make programming more ***effortless***. 
- This is done by making syntax intuitive.
- A side effect: programming can be fun
- Downside: not the fastest (solved with packages)

## Other Programs and Languages (I/II)

*Is Python the most popular for statistics and data science?*

There are other good languages, e.g. R, Stata or SAS, why not use them?

- Python has the best data science packages.
- And it is also being used increasingly in statistics.

Each program/language has their own advantages and similarities (our opinion)

|                     | Python | R | Julia |
|---------------------|--------|---|-------|
| Data structuring    | X      | X | |
| Plotting            | X      | X | |
| Machine learning    | X      |   | |
| Statistics          |        | X | |
| General programming and modelling | X      |   | (X)|
| Ease of learning    | X      |   | |

Other ('statistical') programming languages: SAS, Stata, ...


## Other Programs and Languages (II/II)

Tools are increasingly integrated
- Jupyter; a shared framework for data science 
- New software allows direct execution side-by-side: use R within python (vice versa)
- New tools becoming available across languages, e.g. data processing engine (arrow)

Advice: don't worry. It's likely you need to learn more than one language.

Advice: don't worry. It's likely you need to learn more than one language.

## The Wheel of Data Science

*How does data science work?*

<br>
<br>
<center><img src='https://raw.githubusercontent.com/hadley/r4ds/master/diagrams/data-science.png' alt="Drawing" style="width: 700px;"/></center>

## Learning How to Code... Not a Free Lunch

This course... it ain't easy!

<center><img src='https://media.giphy.com/media/h36vh423PiV9K/giphy.gif' alt="Drawing" style="width: 300px;"/></center>

But at least you get a lot of help!

Learning without supervision, you may struggle with simple stuff...

## Some Encouragement

#### Hadley Wickham

> The bad news is that when ever you learn a new skill **you’re going to suck**. It’s going to be **frustrating**. The good news is that is typical and **happens to everyone** and it is **only temporary**. You can’t go from knowing nothing to becoming an expert without going through a **period of great frustration** and great suckiness.


#### Kosuke Imai

> One can learn data analysis **only by doing**, not by reading.

## Light at the End..

Why would you go through this pain? You choose one of two paths after this course...

i. You move on, you forget some or most of the material.

ii. You are lit and your life has changed. 
- You may return to become a better sociologist, anthropologit, economist etc.
- Or, you may continue along the new track of data science.
- In any case, you keep learning and expanding your programming skills.

## Advice for Coding

Three pieces of advice that will take you far!

1. Be careful: Think before you code $-$ what you are trying to make it do?

2. Be lazy: Reuse code and write Reusable code (e.g. functions)

3. Make understandable: Think about audience!
    1. Future you? May not recall this at all.    
    1. Group members or world? May not understand!
    1. Write lots of comments and potentially background explanation/documentation
    


*How do you get there?*

- Maintain healthy curiousity  $-$  how could we do things better?
- Practice and try as much as possible
- Type the code in yourself  $-$ then you see what is going on.


## Help and Advice


Whenever you have a question you do as follows:


1: You ask other people in your group.


2: You search on Google (more advice will follow).

3: You ask the neighboring groups.

4: You raise an [issue in our Github repo](https://github.com/abjer/sds/issues) or you ask us.

# The Python Language


## The Python Shell

The fundamental way of accessing Python is from your shell by typing *`python`*. On a Windows computer, this would simply be the command prompt.

Everyone should be able to run the following commands and reproduce the output.

``` python
>>> print ('hello my friend')
hello my friend
```

``` python
>>> 4*5
20
```

You can leave Python again by simply typing *`quit()`*.

If you want to close the prompt, simply type *`exit()`*.

## The Python Script (I/III)

The power of the interpreter is that it can be used to execute Python scripts. 

*What is a script?* 

These are programs containing code blocks.

## The Python Script (II/III)

Everyone should be able to make a text file called *`test.py`* on the desktop or in some folder with some content and run it.

The file should contain the following two lines:

Everyone should be able to make a text file called *`test.py`* on the desktop or in some folder with some content and run it.

The file should contain the following two lines:

``` python
print ('Line 1')
print ('Line 2')
```

I saved mine on the desktop.

Reopen the prompt and type something equivalent to...

``` python
cd "C:\Users\xtw562\Desktop"
python test.py
```

This should now yield the following output:

```
Line 1
Line 2
```

## The Python Script (III/III)

Now, everyone should be able to make a text file called *`test.py`* in their current folder with some content and run it. Current folder?

``` python
>>> import os
>>> print("Current working directory: {0}".format(os.getcwd()))
Current working directory: C:\Users\xtw562
```

Now choose your specify your working directory...

``` python
>>> os.chdir('C:/Users/xtw562/Desktop')
>>> print("Current working directory: {0}".format(os.getcwd()))
Current working directory: C:\Users\xtw562\Desktop
```

Now that we have chosen the working directory, the file should contain the following two lines:

Try executing the test file from the shell by typing:

*`python test.py`*

This should yield the following output:

```
Line 1
Line 2
```

## The Jupyter framework (I/III)

*What is Jupyter Notebook?*

- Jupyter provides an interactive and visual platform for working with data. 
- It is an abbreviation of Julia, Python, and R.

*Why Jupyter notebook?*

- great for writing. 
    - markdown, equations and direct visual output;
- interactive allows keeping, changing data etc.
- many tools (e.g. create this slideshow)

*How do we create a Jupyter Notebook?*

We start Jupyter Notebook by typing *`jupyter notebook`* in the shell.

Try making a new notebook: 
- click the button *`New`* in the upper right corner 
- clicking on *`Python 3`*.

## The Jupyter framework (II/III)


*How do we interact with Jupyter?*

Jupyter works by having cells in which you can put code. The active cell has a colored bar next to it.

A cell is *`edit mode`* when there is a <span style="color:green">*green*</span> bar to the left. To activate *`edit mode`* click on a cell.

A cell is in *`command mode`* when the bar to the left is <span style="color:blue">*blue*</span>.

*How do we add and execute code?*

Go into edit mode - add the following:

In [1]:
A = 11
B = 25

A*B 

275

Click the &#9658; to run the code in the cell. What happens if we change A+B to A*B?

## The Jupyter framework (III/III)


*How can we add cells to our notebook?*

Try creating a new cell by clicking the **`+`** symbol.

*Some relevant keyboard short cuts?*

Editing and executing cells
- enter edit mode: click inside the cell or press `ENTR`
- exit edit mode: click outside cell or press `ESC`.
- executing code within a cell is `SHFT`+`ENTR` or `CTRL`+`ENTR` (not same!)


Adding cell (`a` above, `b` below) and removing cells (press `d` twice)  

More info:
- For tips [see blog post](https:abjer.github.io/sds2019/post/jupyter) or see list Jupyter keyboard shortcuts in menu (top): `Help > Keyboard Shortcuts`.
- General resources in documentation and tutorial available [here](http://jupyter.readthedocs.io/en/latest/).

## Before Looking Into More Advanced Concepts...

... we begin with a quiz!

### Go to [kahoot.it]()

## Fundamental data types (I/II)

Recall the four fundamental data types: `int`, `float`, `str` and `bool`.
- Sometimes known as elementary, primitive or basic.

Some data types we can change between, e.g. between `float` and `int`.

In [2]:
int(1.6) # integer conversion always rounds down, i.e. floor 

1

In [3]:
float(int(1.6)) # it does not retake its former value

1.0

We can do the same for converting to `float` and `int` to `str`. Note some conversion are not allowed.

## Fundamental data types (II/II)

*What is an object in Python?*

- A thing, anything - everything is an object.

*Why use objects?*

- Easy manipulable, powerful methods and flexible attributes. 
- We can make complex objects, e.g. estimation methods quite easy.
- Example of a float method:

In [4]:
(1.5).as_integer_ratio()

(3, 2)

## Debugging (I/III)

*Code fails all the time!*


In [5]:
A='I am a string'
int(A)
print(A)

ValueError: invalid literal for int() with base 10: 'I am a string'

## Debugging (II/III)

*How do you fix code errors?*


Look at the error message:
1. **Where** is the error? I.e. what linenumber (and which function).
  - Inspect the elements from the failure before the error occurs. 
     - Note: if you use a function you may want to try printing elements
  - Try replacing the objects in the line.
  
2. **What** goes wrong? Examples:
  - `SyntaxError`: spelling error; `ValueError`: datatype mismatch.
  - Hint: reread it several times, search on Google if you do not understand the error.


## Debugging (III/III)

*Exercise: investigate the error we incurred*


* Look at the answers in this stackoverflow post: [https://stackoverflow.com/questions/8420143](https://stackoverflow.com/questions/8420143).
* An explanation by Blender:
> Somewhere in your text file, a line has the word `id` in it, which can't really be converted to a number.


## Operators

*What computations can python do?*

- Numeric operators: Output a numeric value from numeric input. 
    - `+`; `*`; `-`; `/`.
- Comparison operators: Output a boolean value, `True` or `False`
    - `==`; `!=` (equal, not equal - input from most object types)
    - `>`; `<`. (greater, smaller - input from numeric)
- Logical operators: Output a boolean value from boolean input.
    - `and` / `&`; `or` / `|`; `not` / `!`



*How can we test an expression in Python?*

We can check the validity of a statement using comparison operations:

In [6]:
3 == (2 + 1) # other ops: >, !=, >=

True

And apply logical operations:

In [15]:
True & False # When part is false, logically the whole statement is false.

False

## Control flow (I/II)

*How can we activate code based on data?*


A conditional execution of code, if a condition is true then active code.

In Python the syntax is easy with the `if` syntax:

```
if condition:  
    (CODE BLOCK LINE 1)
    (CODE BLOCK LINE 2)
    ...
```

Condition is either a variable or an expression. If statement is `True` then execute a code block.

## Control flow (II/II)

We can use comparison and logical operators directly as they output boolean values.

In [16]:
if 4 == 4:  
    print ("I'm being executed, yay!")
else:      
    print ("Oh no, I'm not being executed!")

I'm being executed, yay!


We can make deep control flow structures:

In [17]:
A = 11
if  A>=0:      
    if A==0:
        print ("I'm exactly zero!")
        
    elif A<10:
        print ("I'm small but positive!")
        
    else:
        print ("I'm large and positive!")
else:      
    print ("Oh shoot, I'm negative!")

I'm large and positive!


## Containers

*How do we store multiple objects?*

- We put objects into containers. (Like a bag)
- An example is a `list` where we can add and remove objects

*What are they useful for?*

- We can use them to compute statistics (max, min, mean)

## Sequential containers

*Which data types are ordered?*

- Sequential containers are ordered from first to last. 
- They can be accessed using their element using integer position/order.
    - Done with square bracket syntax `[]` 
    - Note **first element is 0, and last is n-1!**
    - One exception are iterators (`iter`) which are incredibly fast.

*Which containers are sequential?*

- `list` which we can modify (**mutable**).
    - useful to collect data on the go
- `tuple` which is after initial assignment (**immutable**)
     - tuples are faster as they can do less things
- `array` 
    - which is mutable in content (i.e. we can change elements)
    - but immutable in size
    - great for data analysis


## Lists

A list can be modified (mutated) by methods, e.g.
- We can `append` objects to it and remove `remove` them again.
- We can use operations like `+` and `*`.


In [18]:
list_1 = ['A', 'B']
list_2 = ['C', 'D']
list_1 + list_2  

['A', 'B', 'C', 'D']

## Non-sequential types

*Are there any non-sequential containers?*
- A dictionary (`dict`)  which are accessed by keys (immutable objects).
    - Focus of tomorrow.
- A `set` where elements are
    - unique (no duplicates) 
    - not ordered
    - disadvantage: cannot access specific elements!


## For loops

*Why are containers so powerful?*



We can iterate over elements in a container -  this creates a *finite* loop, called the `for` loop. 

Example - try the following code:

In [19]:
A = []
for i in range(4):
    i_squared = i**2
    A.append(i_squared)

for a in A:
    print(a)

0
1
4
9


For loops are smart when: iterating over files in a directory; iterating over specific set of columns.

How does Python know where the code associated with inside of the loop begins?

## The one line loop

*What is the fastest way to write a loop?*

Using list comprehension (also work for containers):

In [20]:
A = [i**2 for i in range(4)]
print(A)

[0, 1, 4, 9]



## While loops

*Can we make a loop without specifying the end?*

Yes, this is called a `while` loop. Example - try the following code:


In [21]:
i = 0
L = []
while (i<3):
    L.append(i*2)
    i += 1
print(L)

[0, 2, 4]


Applications
- Can be applied in scraping, model which converges, etc.
- Make server process that keeps running

## Making Code Reusable


### Functions
Be careful of how you define input/outputs and how objects are created. In particular:

- Globals: Objects defined outside function
    - Note: These are available both locally and globally (although, they can be overwritten)
- Locals: Objects that are created within a sub-level. 
    - Example: objects defined inside a function
    - Cannot use outside, unless we use `return`

### Checklist for Functions
Some key questions that you should often ask yourself:
- Should I write a function?
    - If you are repeating some process: Yes!
    - Makes code easier to write (better overview, less mistakes) and read
- If yes:
    - What do you intend the function to output? Is the correct output returned?
    - Do you use locals where possible? 
    - Are globals assigned before you define the function in notebook/script?
        - (if not, your function may fail! Reason: some used globals are not defined)


### Classes 
Where do objects come from?

- From a collection of pre-defined attributes and methods $-$ this is called a `class`.
- We can make our own. It is complex, but powerful.
- Examples of classes/objects we will see: 
    - Dataframes, scraping tools, machine learning models etc.
    - (everything is an object)

## Copy vs. View

Important, when writing code `A = B`, then `A` is only a reference to `B`!

- In other words: `A` is a **view** of `B`
- Implication 
    - if `A` is mutable, e.g. list, dataframe: changes to `B` shows up in `A` and vice versa 

We can break this dependency by explicitly making **copy**. 
- For instance, in pandas use `A = B.copy()` method



## Coding That is Fast

### One-liners using comprehensions

A very compact way of writing code: *comprehension*. Example of list comprehension:
```python
new_list = [my_proc(a) for a in my_list] 
```

The new thing is that we define loop inside the list! 


In [22]:
# Define function
def my_proc(a):
    return a**2

# Generate a list
my_list = []
for i in range(10):
    my_list.append(i+1)

# Use line from before to generate transformation 
new_list = [my_proc(a) for a in my_list]

# Print lists
print('my_list:', my_list)
print('new_list:', new_list)

my_list: [1, 2, 3, 4, 5, 6, 7, 8, 9, 10]
new_list: [1, 4, 9, 16, 25, 36, 49, 64, 81, 100]


### The Need for Speed
Python is elegant and simple

However, Python is NOT built for speed. 

We can compensate by using smart packages that have fast algorithms, e.g. numpy and pandas


### Vectorization

An alternative is to write our code in terms of numpy arrays.

In example below, we generate a long list by making some transformation of an input... we get around 30 times speed-up by using numpy!

In [23]:
%%timeit 
[i+3 for i in range(10**7)]

821 ms ± 21.8 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)


In [24]:
import numpy as np

In [25]:
%%timeit
np.arange(10**7)+3

24.8 ms ± 1.29 ms per loop (mean ± std. dev. of 7 runs, 10 loops each)


### Take Away on Speed

Use pandas or numpy - they are optimized for speed! 

Why are pandas and numpy fast? 

- Because they are written in fast, low-level code with optimized algorithms! 

Why are we not learning low level? 
- Requires too much space for simple operations - not efficient!
- Too steep learning curve (not like Python!!)

# Session 2B: Data Structuring in Pandas I

*Joachim Kahr Rasmussen*

## Agenda for Session 2B

In this session, we will work with `pandas` and how to structure your data. In particular, we will cover:

- Live part:
    - Why We Structure Data
    - Overview of Numpy and Pandas
- Video part integrated with exercises:
    1. Welcome (Back to) Pandas
    - DataFrames and Series
    - Operations with Elementary Data Types
        - Boolean Operations
        - Numeric Operations and Methods
        - String Operations
    2. Readible Code and Method Chaining
    3. More Advanced Data Types
        - Categorical Data
        - Time Series Data

## Associated Readings

PDA, chapter 7:
- Handling missing data
- Data transformations: 
    - Duplicates 
    - Mapping
    - Replacing
    - Renaming axes
    - Binning
    - Filtering outliers
    - Dummies
- String manipulations

PDA, sections 11.1-11.2:
- Dates and time in Python
- Working with time series in pandas (time as index)

PDA, sections 12.1, 12.3:
- Working with categorical data in pandas
- Method chaining

PML, chapter 4, section 'Handling categorical data':
- Encoding class labels with `LabelEncoder`
- One-hot encoding

## Loading Stuff

In [1]:
# Loading packages
import numpy as np

# Why We Structure Data

## Motivation
*Why do we want to learn data structuring?*

- Data rarely comes in the form of our model. We need to 'wrangle' our data.
- Someone has to do this - and this person might very well be you
- Even as a data science manager, you need to know what is going on with the data structuring: This is the backbone of your analyses!

*Can our machine learning models not do this for us?* 

- Not yet :). The current version needs **tidy** data. What is tidy? 

One row per observation.

<center><img src='https://raw.githubusercontent.com/abjer/sds2017/master/slides/figures/tidy.png'></center>

# Numpy and Pandas

## Numpy Overview
*What is the [`numpy`](http://www.numpy.org/) module?*

`numpy` is a Python module similar to matlab 
- fast and versatile for manipulating arrays
- linear algebra tools available
- used in some machine learning and statistics packages

Example from earlier sessions

In [2]:
table = [[1,2],[3,4]]
arr = np.array(table)
arr

array([[1, 2],
       [3, 4]])

## Pandas motivation
*Why use Pandas?*

It is built on numpy:
- Simplicity: Pandas is built with Python's simplicity 
- Powerful and fast tools for manipulating data from numpy

Improves on numpy:
- Clarity, flexibility by using labels (keys)
- Introduces lots of new, useful tools for data analysis (more on this)

The future: interesting development combining tools for big and small data

Note: Much more similar to common software for data manipulation like, say, Stata

## Pandas popularity

<center><img src='https://www.sqlshack.com/wp-content/uploads/2020/08/pandas-in-python-popularity-from-stack-overflow.png' alt="Drawing" style="width: 800px;"/></center>



## Videos + Exercises

In the exercises, you are going to (i) learn more about data structuring in Pandas and (ii) work with Pandas yourself.

# VIDEO 2.1: DataFrames and Series

## Loading Stuff

In [1]:
# Loading packages
import matplotlib.pyplot as plt
import numpy as np
import pandas as pd
import requests
import seaborn as sns

## Pandas Data Types
*How do we work with data in Pandas?*

- We use two fundamental data stuctures: 
  - ``Series``, and
  - ``DataFrame``.

## Pandas Series (I/V)
*What is a `Series`?*

- A vector/list with labels for each entry. Example:

In [2]:
L = [1, 1.2, 'abc', True]

my_series = pd.Series(L)
my_series

0       1
1     1.2
2     abc
3    True
dtype: object

## Pandas Series (II/V)
*What are the components in a Series?* 

From before, we could see that a Series generally consists of three components:

- `index`: label for each observation

- `values`: observation data

- `dtype`: the format of the series (`object` means any data type is allowed)
  - examples are fundamental datatypes (`float`, `int`, `bool`)  
      - in terms of precision: `float`>`int`>`bool`
      - this comes at a cost in the form of speed

## Pandas Series (III/V)
*How do we set custom index?* 

Indices need not have a sequential structure. To see this, consider the following example

In [3]:
num_data = range(0,3) # Generate data
indices = ['B', 'C', 'A'] # Generate index names

Now, combine to a series:

In [4]:
my_series2 = pd.Series(data=num_data, index=indices) # Create a pandas series from the two
my_series2

B    0
C    1
A    2
dtype: int64

## Pandas Series (IV/V)
*What data structure does the pandas series remind us of?*

A mix of Python list and dictionary. Consider the following simple transformation:

In [5]:
my_series.to_dict()

{0: 1, 1: 1.2, 2: 'abc', 3: True}

*Can we also convert a dictionary to a series?*

Yes, we just put into the Series class constructor. Example:

In [6]:
d = {'yesterday': 0, 'today': 1, 'tomorrow':3} # Create some dictionary
my_series3 = pd.Series(d) # Use the constructor
my_series3

yesterday    0
today        1
tomorrow     3
dtype: int64

## Pandas Series (V/V)
*How is the series different from a dict?*

An important distinction: Series indices are NOT unique! Example:

In [7]:
s = pd.Series(range(3), index=['A','A', 'A']) # Create series with same indices
print(s.index.duplicated()) # Check duplicates
print()
print(s.to_dict()) # So translating to a dict gives...

[False  True  True]

{'A': 2}


Series are both key and index  based (i.e. sequential).
- Remember that unlike, say, lists, dictionaries are not sequential!

## Pandas Data Frames (I/IV)

*OK, so now we know what a series is. What is a `DataFrame` then?*

- A 2d-array (matrix) with labelled columns and rows (which are called indices). Example:

In [8]:
df = pd.DataFrame(data=[[1,2],[3,4]],
                  columns=['A', 'B'])
df

Unnamed: 0,A,B
0,1,2
1,3,4


## Pandas Data Frames (II/IV)

*How can we really think about this*

There are at least two simple ways of seeing the pandas DataFrae:
1. A numpy arrays with some additional stuff.
2. A set of series that have been merged horizontally
    - Note that columns can have different datatypes!

Most functions from `numpy` can be applied directly to Pandas. We can convert a DataFrame to a `numpy` array with `values` attribute.

In [9]:
df.values

array([[1, 2],
       [3, 4]], dtype=int64)

*To note*: In Python we can describe it as a *list of lists* or sometimes a *dict of dicts*.

In [10]:
df.values.tolist()

[[1, 2], [3, 4]]

## Pandas Data Frames (III/IV)

*How can larger pandas dataframes be built?*

Similar to Series, DataFrames can be built from dictionaries.

An important difference: When it comes to creating distinct columns, DataFrames require that each value in the dictionary is also a dictionary. Example:

In [11]:
djan = {'1st': 0, '2nd': 1, '3rd':3} # Create some dictionary for january
dfeb = {'1st': -3, '2nd': -1, '3rd':-2} # Create some dictionary for february
dmar = {'1st': 3, '2nd': 5, '3rd':4} # Create some dictionary for march

d = {'january': djan, 'february': dfeb, 'march': dmar} # Create dictionary of dictionaries
my_df1 = pd.DataFrame(d) # Use the constructor
my_df1

Unnamed: 0,january,february,march
1st,0,-3,3
2nd,1,-1,5
3rd,3,-2,4


## Pandas Data Frames (IV/IV)

*What happens if keys are not the same?*

No big deal...

In [12]:
djan = {'1st': 0, '2nd': 1, '3rd':3} # Create some dictionary for january
dfeb = {'1st': -3, '2nd': -1, '3rd':-2} # Create some dictionary for february
dmar = {'1st': 3, '2nd': 5, '4th':4} # Create some dictionary for march

d = {'january': djan, 'february': dfeb, 'march': dmar} # Create dictionary of dictionaries
my_df2 = pd.DataFrame(d) # Use the constructor
my_df2

Unnamed: 0,january,february,march
1st,0.0,-3.0,3.0
2nd,1.0,-1.0,5.0
3rd,3.0,-2.0,
4th,,,4.0


## Series vs DataFrames (I/II)
*How are Series related to DataFrames?*

Putting it simple: Every column is a series. Example, access as key (recommended):

In [13]:
print(df['B'])

0    2
1    4
Name: B, dtype: int64


Another option is access as object method... smart, but dangerous! Sometimes it works...

In [14]:
print(df.B)

0    2
1    4
Name: B, dtype: int64


But sometimes it doesn't... To illustrate, add one more column

In [15]:
df['count'] =  5
print(df)

   A  B  count
0  1  2      5
1  3  4      5


## Series vs DataFrames (II/II)
*But when wouldn't this work?*

To illustrate, add one more column:

In [16]:
df['count'] =  5
print(df)

   A  B  count
0  1  2      5
1  3  4      5


Now print this and see!

In [17]:
print(df.count)

<bound method DataFrame.count of    A  B  count
0  1  2      5
1  3  4      5>



Clearly, the key-based option more robust as variables named same as methods, e.g. `count`, cannot be accesed.

## Converting Data Types

The data type of a series can be converted with the **astype** method. Some examples:

In [18]:
print(my_series3)
print()
print(my_series3.astype(np.float))
print()
print(my_series3.astype(np.str))

yesterday    0
today        1
tomorrow     3
dtype: int64

yesterday    0.0
today        1.0
tomorrow     3.0
dtype: float64

yesterday    0
today        1
tomorrow     3
dtype: object


## Indices and Column Names
*Why don't we just use numpy arrays and matrices?*


- Inspection of data is quicker
    - What was it that column 18 represented?

- Keep track of rows after deletion
    - Again.... What was it that column 18 represented!?

- Indices may contain fundamentally different data structures 
    - e.g. time series (more about this later)
    - Other datatypes (spatial data $\rightarrow$ advanced course)

- Facilitates complex operation (next session):
    - Merging datasets
    - Split-apply-combine (operations on subsets of data)
    - Method chaining (multiple operations in sequence)

## Viewing Series and Dataframes
*How can we view the contents in our dataset?*
- We can use `print` on our dataset
- We can visualize patterns by plotting

## The Head and Tail
*But what if we have a large data set with many rows?*

Let's load the 'titanic' data set that comes with the *seaborn* library:

In [19]:
import seaborn as sns
titanic = sns.load_dataset('titanic')

We now select the *first* 3 rows in a the with the `head` method.

In [20]:
titanic.head(3)

Unnamed: 0,survived,pclass,sex,age,sibsp,parch,fare,embarked,class,who,adult_male,deck,embark_town,alive,alone
0,0,3,male,22.0,1,0,7.25,S,Third,man,True,,Southampton,no,False
1,1,1,female,38.0,1,0,71.2833,C,First,woman,False,C,Cherbourg,yes,False
2,1,3,female,26.0,0,0,7.925,S,Third,woman,False,,Southampton,yes,True


The `tail` method selects the last observations in a DataFrame. 

## Row Selection (I/III)
*How can we select certain rows in a Series when for given index **keys**?* 

WIth the `loc` attribute. Example:

In [21]:
print(titanic.loc[range(3),['survived', 'age', 'sex']])

   survived   age     sex
0         0  22.0    male
1         1  38.0  female
2         1  26.0  female


## Row selection (II/III)
*How can we select certain rows in a Series for given index **integers**?* 

The `iloc` method selects rows for provided index integers. 

In [22]:
print(titanic.iloc[10:15,:5])

    survived  pclass     sex   age  sibsp
10         1       3  female   4.0      1
11         1       1  female  58.0      0
12         0       3    male  20.0      0
13         0       3    male  39.0      1
14         0       3  female  14.0      0


Clearly, this is very similar to working with matrices in numpy! 

## Row selection (III/III)
*Do our tools for vieving specific rows, i.e. `loc`, `iloc` work for DataFrames?* 

- Yes, we can use both `loc` and `iloc`. As default they work the same.

In [23]:
my_idx = ['i', 'ii', 'iii']
my_cols = ['a','b']

my_data = np.arange(1,7) #my_data = [[1, 2], [3, 4], [5, 6]]
my_data = my_data.reshape(3,2)

my_df = pd.DataFrame(my_data, columns=my_cols, index=my_idx)

print(my_df)
print()
print(my_df.loc[['i','ii']])
print()
print(my_df.iloc[:2])

     a  b
i    1  2
ii   3  4
iii  5  6

    a  b
i   1  2
ii  3  4

    a  b
i   1  2
ii  3  4


## Columns Selection (I/II)
*How are `loc`, `iloc` different for DataFrames?* 

- For DataFrames, we can also specify columns.

In [24]:
idx_keep = ['i','ii']
cols_keep = ['a']
print(my_df.loc[idx_keep, cols_keep])

    a
i   1
ii  3


## Columns Selection (II/II)
*How can we generally select columns in a DataFrame?* 

- Option 1: using the `[]` and providing a list of columns.
- Option 2: using `loc` and setting row selection as `:`.

In [25]:
print(my_df.loc[:,['b']])

     b
i    2
ii   4
iii  6


## Selection quiz
*What does `:` do in `iloc` or `loc`?* 

Select all rows (columns).

## Modifying DataFrames
*Why do we want to modify DataFrames?*

- Because data rarely comes in the form we want it.


## Changing the Index (I/III)
*How can we change the index of a DataFrame?*

We change or set a DataFrame's index using its method `set_index`. Example:

In [26]:
print(my_df.set_index('a'))
print()
print(my_df)

   b
a   
1  2
3  4
5  6

     a  b
i    1  2
ii   3  4
iii  5  6


Clearly, doing so, we also implicitly delete the previous index.

Also, notice the level shift in *b* due to this.

## Changing the Index (II/III)
*Is our DataFrame changed? I.e. does it have a new index?*

No, we must overwrite it or make it into a new object:

In [27]:
print(my_df)
my_df_a = my_df.set_index('a')
print()
print(my_df_a)
print()
print(my_df_a.iloc[1,0])

     a  b
i    1  2
ii   3  4
iii  5  6

   b
a   
1  2
3  4
5  6

4


## Changing the index (III/III)

Sometimes we wish to remove the index. This is done with the `reset_index` method:

In [28]:
print(my_df_a.reset_index()) # drop=True
print()
print(my_df_a.reset_index(drop=True)) # drop=True
print()
print(my_df)

   a  b
0  1  2
1  3  4
2  5  6

   b
0  2
1  4
2  6

     a  b
i    1  2
ii   3  4
iii  5  6


The old indices cannot be restored (that information was lost), but the interim index is by default made into a new variable.

By specifying the keyword `drop`=True we delete this index.

*To note:* Indices can have multiple levels, in this case `level` can be specified to delete a specific level.

## Changing the Column Names

Column names can simply be changed with `columns`:

In [29]:
print(my_df)
my_df.columns = ['A', 'B']
print()
print(my_df)

     a  b
i    1  2
ii   3  4
iii  5  6

     A  B
i    1  2
ii   3  4
iii  5  6


DataFrame's also have the function called `rename`.

In [30]:
my_df.rename(columns={'A': 'Aa'}, inplace=True)
print(my_df)

     Aa  B
i     1  2
ii    3  4
iii   5  6


## Changing all Column Values
*How can we can update values in a DataFrame?*

In [31]:
print(my_df)

# # set uniform value
my_df['B'] = 3
print()
print(my_df)

# set different values
my_df['B'] = [2,17,0] 
print()
print(my_df)

     Aa  B
i     1  2
ii    3  4
iii   5  6

     Aa  B
i     1  3
ii    3  3
iii   5  3

     Aa   B
i     1   2
ii    3  17
iii   5   0


## Changing Specific Column Values
*How can we can update values in a DataFrame?*

In [32]:
print(my_df)

# loc, iloc
my_loc2 = ['i', 'iii']
my_df.loc[my_loc2, 'Aa'] = 10

print()
print(my_df)

     Aa   B
i     1   2
ii    3  17
iii   5   0

     Aa   B
i    10   2
ii    3  17
iii  10   0


## Sorting Data

A DataFrame can be sorted with `sort_values`; this method takes one or more columns to sort by. 

In [33]:
print(my_df.sort_values(by='Aa', ascending=True))

     Aa   B
ii    3  17
i    10   2
iii  10   0


Many key word arguments are possible for sort_values, including ascending if for one or more valuable we want descending values. 

In addition, sorting by index is also possible with `sort_index`.

In [34]:
print(my_df.sort_index())

     Aa   B
i    10   2
ii    3  17
iii  10   0


# VIDEO 2: Boolean Data

## Logical Expression for Series (I/II)
*Can we test an expression for all elements?*

Yes: **==**, **!=** work for a single object or Series with same indices. Example:

In [35]:
print(my_series3)
print()
print(my_series3 == 0)

yesterday    0
today        1
tomorrow     3
dtype: int64

yesterday     True
today        False
tomorrow     False
dtype: bool


What datatype is returned? 


## Logical Expression in Series  (II/II)
*Can we check if elements in a series equal some element in a container?*

Yes, the `isin` method. Example:

In [36]:
my_rng = list(range(2))

print(my_rng)
print()
print(my_series3.isin(my_rng)) 

[0, 1]

yesterday     True
today         True
tomorrow     False
dtype: bool


## Power of Boolean Series (I/II)
*Can we combine boolean Series?*

Yes, we can use:
- the `&` operator (*and*)
- the `|` operator (*or*)

In [37]:
print(((titanic.sex == 'female') & (titanic.age >= 30)).head(3)) # selection by multiple columns

0    False
1     True
2    False
dtype: bool


What datatype was returned? 


## Power of Boolean Series (II/II)
*Why do we care for boolean series (and arrays)?*

Mainly because we can use them to select rows based on their content.

In [38]:
print(my_series3)
print()
print(my_series3[my_series3<3])

yesterday    0
today        1
tomorrow     3
dtype: int64

yesterday    0
today        1
dtype: int64


NOTE: Boolean selection is extremely useful for dataframes!!

# VIDEO 3: Numeric Operations and Methods

## Numeric Operations (I/III)
*How can we make basic arithmetic operations with arrays, series and dataframes?*

It really works just like with Python data, e.g. lists. An example with squaring:

In [39]:
num_ser1 = pd.Series([2,3,2,1,1])
num_ser2 = num_ser1 ** 2

print(num_ser1)
print(num_ser2)

0    2
1    3
2    2
3    1
4    1
dtype: int64
0    4
1    9
2    4
3    1
4    1
dtype: int64


## Numeric Operations (II/III)
*Are other numeric python operators the same??*

Numeric operators work `/`, `//`, `-`, `*`, `**`  as expected.

So does comparative (`==`, `!=`, `>`, `<`)

*Why is this useful?*

- vectorized operations are VERY fast;
- requires very little code.

## Numeric  Operations (III/III)
*Can we also do this with vectors of data?*

Yes, we can also do elementwise addition, multiplication, subtractions etc. of series. Example: 

In [40]:
num_ser1 + num_ser2

0     6
1    12
2     6
3     2
4     2
dtype: int64

## Numeric methods (I/IV)

*OK, these were some quite simple operations with pandas series. Are there other numeric methods?*

Yes, pandas series and dataframes have other powerful numeric methods built-in. 

Consider an example series of 10 million randomly generated observations:

Yes, pandas series and dataframes have other powerful numeric methods built-in. Consider an example series of 10 million randomly generated observations:

In [41]:
arr_rand = np.random.randn(10**7) # Draw 10^7 observations from standard normal , arr_rand = np.random.normal(size = 10**7)
s2 = pd.Series(arr_rand) # Convert to pandas series
s2

0         -0.114982
1         -0.142769
2         -0.252456
3         -0.045497
4          0.132941
             ...   
9999995    0.663020
9999996   -0.755392
9999997    0.157724
9999998    0.611881
9999999   -0.778720
Length: 10000000, dtype: float64

Now, display the median of this distribution:

In [42]:
s2.median() # Display median

0.00016248044579439104

Other useful methods include: `mean`, `quantile`, `min`, `max`, `std`, `describe`, `quantile` and many more.

In [43]:
np.round(s2.describe(),2) # Display other characteristics of distribution (rounded)

count    10000000.00
mean            0.00
std             1.00
min            -5.26
25%            -0.67
50%             0.00
75%             0.67
max             5.55
dtype: float64

## Numeric methods (II/III)
An important method is `value_counts`. This counts number for each observation. 

Example:

In [44]:
cuts = np.arange(-7, 8, 1) # range from -10 to 10 with intervals of unit size
cats = pd.cut(s2, cuts) # cut into categorical data

In [45]:
cats.unique()

[(-1, 0], (0, 1], (-2, -1], (-3, -2], (1, 2], ..., (-4, -3], (4, 5], (-5, -4], (5, 6], (-6, -5]]
Length: 12
Categories (12, interval[int64]): [(-6, -5] < (-5, -4] < (-4, -3] < (-3, -2] ... (2, 3] < (3, 4] < (4, 5] < (5, 6]]

In [46]:
cats.value_counts()

(0, 1]      3413583
(-1, 0]     3413419
(1, 2]      1359455
(-2, -1]    1358695
(2, 3]       214039
(-3, -2]     213627
(-4, -3]      13274
(3, 4]        13256
(-5, -4]        330
(4, 5]          315
(5, 6]            4
(-6, -5]          3
(6, 7]            0
(-7, -6]          0
dtype: int64

What is observation in the value_counts output - index or data?

## Numeric methods (III/III)
*Are there other powerful numeric methods?*

Yes: examples include 
- `unique`, `nunique`: the unique elements and the count of unique elements
- `cut`, `qcut`: partition series into bins 
- `diff`: difference every two consecutive observations
- `cumsum`: cumulative sum
- `nlargest`, `nsmallest`: the n largest elements 
- `idxmin`, `idxmax`: index which is minimal/maximal 
- `corr`: correlation matrix

Check [series documentation](https://pandas.pydata.org/pandas-docs/stable/generated/pandas.Series.html) for more information.

# VIDEO 4: String Operations

## String Operations (I/III)
*Do the numeric python operators also apply to strings?*

In some cases yes, and this can be done very elegantly! Consider the following example with a series:

In [47]:
names_ser1 = pd.Series(['Andreas', 'Joachim', 'Nicklas', 'Terne'])
names_ser1

0    Andreas
1    Joachim
2    Nicklas
3      Terne
dtype: object

Now add another string:

In [48]:
names_ser1 + ' works @ SAMF'

0    Andreas works @ SAMF
1    Joachim works @ SAMF
2    Nicklas works @ SAMF
3      Terne works @ SAMF
dtype: object

## String Operations (II/III)
*Can two vectors of strings also be combined like as with numeric vectors?*

Fortunately, yes:

In [49]:
names_ser2 = pd.Series(['ethics', 'python and ML', 'scraping', 'text as data'])
names_ser1 + ' teaches ' + names_ser2

0           Andreas teaches ethics
1    Joachim teaches python and ML
2         Nicklas teaches scraping
3       Terne teaches text as data
dtype: object

## String Operations (III/III)
*Any other types of vectorized operations with strings?*

Many. In particular, there is a large set of string-specific operation (see `.str`-notation below). Some examples (see table 7-5 in PDA for more - we will revisit in session 5):

In [50]:
names_ser1.str.upper() # works similarly with lower()

0    ANDREAS
1    JOACHIM
2    NICKLAS
3      TERNE
dtype: object

In [51]:
names_ser1.str.contains('as')

0     True
1    False
2     True
3    False
dtype: bool

In [52]:
names_ser1.str[1:3] # We can even do vectorized slicing of strings!

0    nd
1    oa
2    ic
3    er
dtype: object

# VIDEO 5: Categorical Data

## The Categorical Data Type
*Are string (or object) columns attractive to work with?*

In [53]:
pd.Series(['Pandas', 'series'])

0    Pandas
1    series
dtype: object

Now, sometimes the categorical data type is better:
- Use categorical data when many characters are repeated
    - Less storage and faster computations
- You can put some order (structure) on your string data
- It also allows new features:
    - Plots have bars, violins etc. sorted according to category order

## Example of Categorical Data

Conversion to categorical data:

In [54]:
edu_list = ['BSc Political Science', 'Secondary School'] + ['High School']*2
edu_cats = ['Secondary School', 'High School', 'BSc Political Science']

str_ser = pd.Series(edu_list*10**5)

Option 1: No order

In [55]:
cat_ser = str_ser.astype('category')
cat_ser[:5]

0    BSc Political Science
1         Secondary School
2              High School
3              High School
4    BSc Political Science
dtype: category
Categories (3, object): ['BSc Political Science', 'High School', 'Secondary School']

Option 2: Order

In [56]:
cats = pd.Categorical(str_ser, categories=edu_cats, ordered=True)
cat_ser2 = pd.Series(cats, index=str_ser.index)
cat_ser2[:5]

0    BSc Political Science
1         Secondary School
2              High School
3              High School
4    BSc Political Science
dtype: category
Categories (3, object): ['Secondary School' < 'High School' < 'BSc Political Science']

## Numbers as Categories

It is natural to think of measures in categories, e.g. small and large.

*Can we convert our numerical data to bins in a smart way?*

Yes, there are two methods that are useful (and you just applied one of them earlier in this session!):
- `cut` which divides data by user specified bins
- `qcut` which divides data by user specified quantiles
    - E.g. median, $q=0.5$; lower quartile threshold, $q=0.25$; etc.

In [57]:
x = pd.Series(np.random.normal(size = 10**6))
cat_ser3 = pd.qcut(x, q = [0,0.025, 0.975, 1])
cat_ser3.cat.categories

IntervalIndex([(-5.07, -1.964], (-1.964, 1.957], (1.957, 5.01]],
              closed='right',
              dtype='interval[float64]')

In [58]:
cat_ser3.cat.codes.head(5)

0    1
1    1
2    1
3    1
4    1
dtype: int8

## Converting to Numeric and Binary

For regression, we often want our string / categorical variable as dummy variables:
- That is, all categories have their own binary column (0 and 1)
    - Note: We may leave one 'reference' category out here (intro statistics)
- Rest as numeric

*How can we do this?*

Insert dataframe, `df`, into the function as `pd.get_dummies(df)`

In [59]:
pd.get_dummies(cat_ser3).head(5)

Unnamed: 0,"(-5.07, -1.964]","(-1.964, 1.957]","(1.957, 5.01]"
0,0,1,0
1,0,1,0
2,0,1,0
3,0,1,0
4,0,1,0


# VIDEO 6: Time Series Data

## Temporal Data Type

*Why is time so fundamental?*

Every measurement made by a human was made at some point in time - therefore, it has a "timestamp"!

## Formats for Time

*How are time stamps measured?*

1. **Datetime** (ISO 8601): Standard calendar
    - year, month, day (minute, second, milisecond); timezone
    - can come as string in raw data
2. **Epoch time**: Seconds since January 1, 1970 - 00:00, GMT (Greenwich time zone)
    - nanoseconds in pandas

## Time Data in Pandas

*Does Pandas store it in a smart way?*

Pandas and numpy have native support for temporal data combining datetime and epoch time.

In [60]:
str_ser2 = pd.Series(['20170101', '20170727', '20170803', '20171224'])
dt_ser = pd.to_datetime(str_ser2)
dt_ser

0   2017-01-01
1   2017-07-27
2   2017-08-03
3   2017-12-24
dtype: datetime64[ns]

## Example of Passing Temporal Data

*How does the input type matter for how time data is passed?*

A lot! As we will see, `to_datetime()` may assume either *datetime* or *epoch time* format:

In [61]:
pd.to_datetime(str_ser2)

0   2017-01-01
1   2017-07-27
2   2017-08-03
3   2017-12-24
dtype: datetime64[ns]

In [62]:
pd.to_datetime(str_ser2.astype(int))

0   1970-01-01 00:00:00.020170101
1   1970-01-01 00:00:00.020170727
2   1970-01-01 00:00:00.020170803
3   1970-01-01 00:00:00.020171224
dtype: datetime64[ns]

## Time Series Data

*Why are temporal data powerful?*

We can easily make and plot time series. Example of 20 years of Apple stock prices:
- Tip: Install in terminal using: *conda install pandas-datareader*

In [63]:
from pandas_datareader import data
aapl = data.DataReader('AAPL', data_source='yahoo', start='2000')['Adj Close']
aapl.plot(figsize = (12,4), logy = True)

RemoteDataError: Unable to read URL: https://finance.yahoo.com/quote/AAPL/history?period1=946695600&period2=1625277599&interval=1d&frequency=1d&filter=history
Response Text:
b'<!DOCTYPE html>\n  <html lang="en-us"><head>\n  <meta http-equiv="content-type" content="text/html; charset=UTF-8">\n      <meta charset="utf-8">\n      <title>Yahoo</title>\n      <meta name="viewport" content="width=device-width,initial-scale=1,minimal-ui">\n      <meta http-equiv="X-UA-Compatible" content="IE=edge,chrome=1">\n      <style>\n  html {\n      height: 100%;\n  }\n  body {\n      background: #fafafc url(https://s.yimg.com/nn/img/sad-panda-201402200631.png) 50% 50%;\n      background-size: cover;\n      height: 100%;\n      text-align: center;\n      font: 300 18px "helvetica neue", helvetica, verdana, tahoma, arial, sans-serif;\n  }\n  table {\n      height: 100%;\n      width: 100%;\n      table-layout: fixed;\n      border-collapse: collapse;\n      border-spacing: 0;\n      border: none;\n  }\n  h1 {\n      font-size: 42px;\n      font-weight: 400;\n      color: #400090;\n  }\n  p {\n      color: #1A1A1A;\n  }\n  #message-1 {\n      font-weight: bold;\n      margin: 0;\n  }\n  #message-2 {\n      display: inline-block;\n      *display: inline;\n      zoom: 1;\n      max-width: 17em;\n      _width: 17em;\n  }\n      </style>\n  <script>\n    document.write(\'<img src="//geo.yahoo.com/b?s=1197757129&t=\'+new Date().getTime()+\'&src=aws&err_url=\'+encodeURIComponent(document.URL)+\'&err=%<pssc>&test=\'+encodeURIComponent(\'%<{Bucket}cqh[:200]>\')+\'" width="0px" height="0px"/>\');var beacon = new Image();beacon.src="//bcn.fp.yahoo.com/p?s=1197757129&t="+new Date().getTime()+"&src=aws&err_url="+encodeURIComponent(document.URL)+"&err=%<pssc>&test="+encodeURIComponent(\'%<{Bucket}cqh[:200]>\');\n  </script>\n  </head>\n  <body>\n  <!-- status code : 404 -->\n  <!-- Not Found on Server -->\n  <table>\n  <tbody><tr>\n      <td>\n      <img src="https://s.yimg.com/rz/p/yahoo_frontpage_en-US_s_f_p_205x58_frontpage.png" alt="Yahoo Logo">\n      <h1 style="margin-top:20px;">Will be right back...</h1>\n      <p id="message-1">Thank you for your patience.</p>\n      <p id="message-2">Our engineers are working quickly to resolve the issue.</p>\n      </td>\n  </tr>\n  </tbody></table>\n  </body></html>'

## Time Series Components

*What is within the `appl` series? What is a time series*

In [64]:
aapl.head(5)

NameError: name 'aapl' is not defined

In [65]:
aapl.head(5).index

NameError: name 'aapl' is not defined

So in essence, time series in pandas are often just series of data with a time index.

## Pandas and Time Series

*Why is pandas good at handling and processing time series data?*

It has specific tools for resampling and interpolating data:
- See 11.3, 11.5 and 11.6 in PDA textbook

It handles irregular data well:
- missing values
- duplicate entries (`fillna(method='ffill')` or `data.fillna(data.mean())`)



## Datetime in Pandas

*What other uses might time data have?*

We can extract data from datetime columns. These columns have the `dt` and its sub-methods. Example:

In [66]:
dt_ser2 = pd.Series(aapl.index)
dt_ser2.dt.month #also year, weekday, hour, second

NameError: name 'aapl' is not defined

Many other useful features (e.g. aggregation over time into means, medians, etc.)

## Datetime in Pandas

*What other uses might time data have?*

We can extract data from datetime columns. These columns have the `dt` and its sub-methods. Example:

## Videos + Exercises

In the exercises, you are going to (i) learn more about data structuring in Pandas and (ii) work with Pandas yourself.