# TOC
- Datetime in python
- 3 objects: Date, Time, DateTime  (from datetime import datetime, date, time)
- time deltas (from datetime import timedelta)
  - operations:
    - minus, plus, multiply, divide
- convert between datetime and string and vice versa
  - `datetime.now().isoformat()` convert to string
  - `datetime.strptime('21/11/06 16:30', '%d/%m/%y %H:%M')` convert to datetime
  - `import dateutil.parser; dateutil.parser.parse('21/11-06 16:30')` easy (forgiving) datatime parsing lib
- Pandas
  - Series and DataFrame
  - Series
    - 1 dimensional data structure
    - 
  - DataFrame
    - 2 dimensional data structure
    - `pandas.DataFrame( data, index, columns, dtype, copy)`
    - columns can be of different types
    - data can be lists, dicts, maps, ndarrays.
- pd.Series(

# Datetime in Python

This intro to handling *dates* and *times* in Python is based on the documentation of the standard library's `datetime` module https://docs.python.org/3.6/library/datetime.html.

The datetime module supplies classes for manipulating *dates* and *times* in both simple and complex ways. While date and time arithmetic is supported, the focus of the implementation is on efficient attribute extraction for output formatting and manipulation.

## Date Objects

A date object represents a date (year, month and day) in an idealized calendar, the current Gregorian calendar indefinitely extended in both directions. January 1 of year 1 is called day number 1, January 2 of year 1 is called day number 2, and so on. This matches the definition of the "proleptic Gregorian" calendar in Dershowitz and Reingold’s book Calendrical Calculations, where it is the base calendar for all computations. See the book for algorithms for converting between proleptic Gregorian ordinals and many other calendar systems. Alternatively, you can have a look at the paper describing what is extended in the book:
http://www.cs.tau.ac.il/~nachumd/papers/cc-paper.pdf

In [24]:
from datetime import date


today = date.today()
print(today)
today

2019-03-03


datetime.date(2019, 3, 3)

In [2]:
today = date(today.year, 10, 2)
today

datetime.date(2019, 10, 2)

In [3]:
next_lecture = date(today.year, 10, 9)
time_to_next_lecture = abs(next_lecture - today)
time_to_next_lecture.days

7

In [4]:
from datetime import timedelta
# see the timedelta documentation:
# https://docs.python.org/3.6/library/datetime.html#timedelta-objects

next_lecture = date(today.year, 10, 2) + timedelta(5)
time_to_next_lecture = abs(next_lecture - today)
time_to_next_lecture.days

5

In [5]:
today.strftime("%d/%m/%Y")

'02/10/2019'

In [6]:
today.strftime("%A %d. %B %Y")

'Wednesday 02. October 2019'

## Time Objects

A time object represents a (local) time of day, independent of any particular day. In our coure I will not consider times with respect to different time zones. In case you need to add information about which time zone a `time` refers, please read https://docs.python.org/3.4/library/datetime.html#tzinfo-objects.

In [7]:
from datetime import datetime, date, time


t = time(12, 10, 30)
t.isoformat()

'12:10:30'

In [8]:
print(t.strftime('%H:%M:%S'))

print('The time is {:%H:%M}.'.format(t))

12:10:30
The time is 12:10.


## Datetime Objects

In [26]:
from datetime import datetime, date, time


d = date.today()
t = time(12, 30)
datetime.combine(d, t) # todays date combined with time: 12:30

datetime.datetime(2019, 3, 3, 12, 30)

In [41]:
now = datetime.now()
print(now)
now = now.replace(microsecond=0)
print(now)

2019-03-03 16:18:01.947856
2019-03-03 16:18:01


In [11]:
datetime.utcnow()

datetime.datetime(2019, 3, 3, 14, 42, 30, 847761)

In [37]:
ic = now.isocalendar()
print('the year is {}, the week is {} and it is the {}th day'.format(ic[0], ic[1],ic[2]))
week_number = ic[1]
print(week_number)

the year is 2019, the week is 9 and it is the 7th day
9


In [42]:
now.strftime("%A, %d. %B %Y %I:%M%p")

'Sunday, 03. March 2019 04:18PM'

In [14]:
'The {1} is {0:%d}, the {2} is {0:%B}, the {3} is {0:%I:%M%p}.'.format(now, "day", "month", "time")

'The day is 03, the month is March, the time is 03:42PM.'

In [15]:
d = datetime.strptime('10 Jun 2010', '%d %b %Y')
print(d)
d.strftime('%d-%m-%Y week: %U')

2010-06-10 00:00:00


'10-06-2010 week: 23'

## Timedeltas

A `timedelta` object represents a duration, the difference between two dates or times.

In [44]:
from datetime import timedelta


d = timedelta(microseconds=5)
(d.days, d.seconds, d.microseconds)

(0, 0, 5)

In [17]:
timedelta(hours=-5)

datetime.timedelta(-1, 68400)

### Operations with `timedelta`s

In [18]:
year_as_delta = timedelta(days=365)
print('year_as_delta:',year_as_delta)
another_year_delta = timedelta(weeks=40, days=84, hours=23, minutes=50, seconds=600)  # adds up to 365 days

last_year = datetime.now() - year_as_delta
next_year = datetime.now() - year_as_delta + (2 *another_year_delta)
print(last_year)
print(next_year)

two_year_delta = next_year - last_year
print('The two year difference is equivalent to {} days or to {} seconds'.format(
    two_year_delta.days, two_year_delta.total_seconds()))


year_as_delta: 365 days, 0:00:00
2018-03-03 15:42:31.259717
2020-03-02 15:42:31.259846
The two year difference is equivalent to 730 days or to 63072000.000129 seconds


## Converting Strings to Times and Vice Versa

In [19]:
from datetime import datetime


datetime.now().isoformat()

'2019-03-03T15:42:31.312044'

In [20]:
dt = datetime.strptime('21/11/06 16:30', '%d/%m/%y %H:%M')
dt

datetime.datetime(2006, 11, 21, 16, 30)

In [21]:
dt.strftime('%y-%m-%d %H:%M')

'06-11-21 16:30'

### Parsing Arbitrary Dates from Strings

The `dateutil.parser` module offers a generic date/time string parser which is able to parse most known formats to represent a date and/or time.

The module attempts to be forgiving with regards to unlikely input formats, returning a datetime object even for dates which are ambiguous.

In [22]:
import dateutil.parser


dateutil.parser.parse('21/11-06 16:30')

datetime.datetime(2006, 11, 21, 16, 30)

# Class exercise with dates
Create a function: getMeetingDates in a module called myUtilities.py
- the function must take 3 arguments (start_date=now, period_as_timedelta, time_of_day, number_of_meetings)
- the function should then return a list of datetimes for a series of meetings that should take place from start_date and evenly distributed throughout the period.
- create another list of number of attendents, that was actually there at each meeting.
- create a bar plot of attendance through the series of meetings.

# Pandas for Time Series and Data Frames

Pandas is -similar to NumPy- another library offering high-level data structures, which enable fast data analyzis. For us, the most important are probably the types `Series` and `DataFrame`, both of which are introduced in the following.  

This tutorial is based on the [intro to Pandas:](http://pandas.pydata.org/pandas-docs/stable/10min.html)

## Pandas vs Numpy
1. In pandas we have 1D Series and 2D DataFrame in numpy we have multi dimensional ndArrays
2. In DataFrame we have column names (like in sql) in ndArrays we are data slicing based in indices
3. In DataFrame we can have multiple datatypes in different columns
![](images/pandas_vs_numpy.png)

In [None]:
%matplotlib inline
import warnings
warnings.filterwarnings('ignore')

In [None]:
# !pip3 install pandas --user

As we will refer to Panda's classes and functions often in code, we usually import the module as `pd`.

In [48]:

import pandas as pd
import numpy as np
import matplotlib.pyplot as plt

## `Series`

A `Series` is a one-dimensional labeled array capable of holding any data type (integers, strings, floating point numbers, Python objects, etc.). The axis labels are collectively referred to as the index.

http://pandas.pydata.org/pandas-docs/stable/dsintro.html#series

You can create a Series by passing a list of values, letting Pandas create a default integer index.

In [51]:
s = pd.Series([1, 3, 5, np.nan, 'seks', 8])
print(s,'\n---------------------')
s = pd.Series(['seks','fem','fire'],[6,5,4])
print(s)

0       1
1       3
2       5
3     NaN
4    seks
5       8
dtype: object 
---------------------
6    seks
5     fem
4    fire
dtype: object


For the following example introducing `Series` we will collect some open data from the World Bank, see http://data.worldbank.org/?locations=DK-UY. This dataset includes a plethora of interesting data. However, for this example we will focus on the *CO2 emissions*.

First, we have to download the data. We do this by writing the response to a request to the World Bank API into a file. As denoted in the response header, we receive a ZIP file.

In [None]:
import requests


# url = 'http://api.worldbank.org/v2/en/country/DNK;URY' 
# response = requests.get(url, params={'downloadformat': 'csv'})
url = 'http://api.worldbank.org/v2/en/indicator/EN.ATM.CO2E.KT?downloadformat=csv'
response = requests.get(url)

print(response.headers)

fname = response.headers['Content-Disposition'].split('=')[1]

if response.ok:  # status_code == 200:
    with open(fname, 'wb') as f:
        f.write(response.content)   
print('-----------------')
print('Downloaded {}'.format(fname))

In [None]:
%%bash
ls -ltrh | tail
#man ls

You can resort to the standard libraries `zipfile` module to uncompress the downloaded file.

In [None]:
import zipfile

zipfile.ZipFile(fname, 'r').extractall('.')

In [None]:
%%bash

ls -ltrh | tail

Additionally, you can make use of the `glob` module to glob for certain file patterns. We will store the filename of the CSV file we are interested in, in a variable called `local_file`.

In [None]:
from glob import glob
# glob is useful in any situation where your program needs to look for a list of files on the filesystem with names matching a pattern. If you need a list of filenames that all have a certain extension, prefix, or any common string in the middle, use glob instead of writing code to scan the directory contents yourself.

local_file = glob('./*API_EN*.csv')[0]
local_file

## A small detour...

### Collecting information on the CLI

To see the header of the file that is of our interest, we can use the `head` command.

In [None]:
%%bash
head ./*API_EN.ATM.CO2*.csv

We can see that the actual CSV header is on line five. To extract only the header row, we can use the stream editor *sed*, see `man sed`. The argument `'5!d'` tells `sed`, that we are only interested in the fifth line.

In [None]:
%%bash
# sed is a cli application that can filter text from pipeline (inputstream or a file)
# Sed Linux command doesn’t update your data. It only sends the changed text to STDOUT
# To know more about the sed tool: https://www.geeksforgeeks.org/sed-command-in-linux-unix-with-examples/
sed '5!d' API_EN.ATM.CO2E.KT_DS2_en_csv_v2_10473877.csv # sed delete all lines except line 5 (sed '5d' filename.ext (would delete only line 5))

To get the two lines -one for Denmark and one for Uruguay repectively- holding the CO2 emission in tons, we can use the grep command, see `man grep`.

In [None]:
%%bash
grep -E "DNK|URY" API_EN.ATM.CO2E.KT_DS2_en_csv_v2_10473877.csv

### Executing OS Commands from Python


With the help of the `subprocess` module allows us to execute shell commands and to read what the process was writing to standard out and standard error.


In [None]:
import subprocess


cmd = 'sed 5!d {}'.format(local_file).split()
out, err = subprocess.Popen(cmd, stdout=subprocess.PIPE, 
                            stderr=subprocess.STDOUT).communicate()
# Since we are getting the output as a byte literal, we have to decode it into string
header_cols = out.splitlines()[0].decode('UTF-8').split(',')
header_cols = [h.replace('"', '') for h in header_cols]
print(header_cols)

In [None]:
%%bash
ls -ltr *.csv

Now, let's get in a similar way the line of the CSV file containing the time series corresponding to Danmark's and Uruguay's CO2 emissions.

In [None]:
import subprocess


cmd = ['grep', '-E', 'DNK|URY', local_file]
out, err = subprocess.Popen(cmd, stdout=subprocess.PIPE, 
                            stderr=subprocess.STDOUT).communicate()
lines = out.decode('UTF-8').splitlines()
lines = [l.split(',') for l in lines]
lines = [[c.replace('"', '') for c in l] for l in lines]
lines

As in Pandas `Series` are one-dimensional labeled arrays, we are now ready to go to create two time series of CO2 emissions for Danmark and Uruguay repectively.
 

In [None]:
print(header_cols)
print(lines)

Since our data and the corresponding indexes are still all strings we have to convert them to floats and integers repectively. We do so using two different mechanisms. Once, creation of typed NumPy arrays and on the other hand via a Pandas method `convert_objects`, which converts strings to numerical values of a suitable type.

In [None]:
# reference to check graph: https://www.klimadebat.dk/grafer_co2udledning.php
header_cols[4:-1]
lines[0][4:-1]

ts_dk = pd.Series(lines[0][4:-1], index=np.asarray(header_cols[4:-1], dtype=int))
ts_dk = pd.to_numeric(ts_dk)
ts_dk.loc[1960]

In [None]:
ts_dk.plot()

Now we create a time series for the corresponding Uruguaian time series.

In [None]:
ts_ury = pd.Series(lines[1][4:-1], index=np.asarray(header_cols[4:-1], dtype=int))
ts_ury = pd.to_numeric(ts_ury)
ts_ury.plot()

# `DataFrame`

Since `Series` are one-dimensional arrays, we have to create a `DataFrame` if we wanted to combine our two previous `Series` objects `ts_dk` and `ts_ur`.Since `Series` are one-dimensional arrays, we have to create a `DataFrame` if we wanted to combine our two previous `Series` objects `ts_dk` and `ts_ur`. 

A `DataFrame` is a 2-dimensional labeled data structure with columns of potentially different types. You can think of it like a spreadsheet or SQL table, or **a dict of Series objects**.

In the following we concatenate two `Series`to form a `DataFrame`.

We will use pandas concat() method [get a good explanation here](https://www.tutorialspoint.com/python_pandas/python_pandas_concatenation.htm)

In [None]:
ts = pd.concat([ts_dk, ts_ury], axis=1, keys=['DNK', 'URY']) # axis=0 is default (concats like sql UNION) axis=1 concats the data along the x axis
#print(ts)
#print(type(ts))
ts.plot()
#ts.DNK
#print(ts)
#ts['DNK']

More information on `DataFrame`s can be found here:
http://pandas.pydata.org/pandas-docs/stable/dsintro.html#dataframe

Similar to, we can create `DataFrame`s by giving the data for the values and indexes explicitely.

In [None]:
dates = pd.date_range('20180302', periods=6)
df = pd.DataFrame(np.random.randn(6,4), index=dates, columns=list('ABCD'))
df

But since we do not want to work on random example data. We will have a look on the military expenditures of some countries in the world. We will use this data to exemplify usage of Pandas' `DataFrame` methods.

Again, we will receive the data from the World Bank.
http://data.worldbank.org/indicator/MS.MIL.XPND.CN?locations=DK-CN-US-RU

In [None]:
import requests


url = 'http://api.worldbank.org/v2/en/indicator/MS.MIL.XPND.CN'

response = requests.get(url, params={'downloadformat': 'csv'})
fname = response.headers['Content-Disposition'].split('=')[1]

if response.ok:  # status_code == 200:
    with open(fname, 'wb') as f:
        f.write(response.content)   

print('Downloaded {}',fname)
#print('Downloaded {}'.format(fname))

In [None]:
import os
import zipfile


zipfile.ZipFile(fname, 'r').extractall('.')
os.remove(fname)

In [None]:
%%bash
ls -ltrh | tail

In [None]:
from glob import glob


milit_files = glob('API_MS.MIL.XPND.CN_DS2_en_csv_v2_10418495.csv')
expenditure_csv = milit_files[0]
expenditure_csv

In [None]:
%%bash
head ./Metadata_Country_API_MS.MIL.XPND.CN_DS2_en_csv_v2_10418495.csv

Now, we use Pandas' `read_csv` function to read the downloaded CSV file directly. Note that we have to skip the first four rows as they do not contain data we are interested in, see keyword argument `skiprows=4`.

Reading the CSV file like this returns a `DataFrame` directly.

In [None]:
import pandas as pd


expenditures = pd.read_csv(expenditure_csv, skiprows=4)
expenditures

## Viewing Data

In [None]:
expenditures.head()

In [None]:
expenditures.tail()

In [None]:
expenditures.index

In [None]:
expenditures.columns

In [None]:
expenditures.values

## Selection of Data in a `DataFrame`

### Selection by Column Name


In [None]:
expenditures['Country Name']

### Selection by Indexes

In the following we index the third row directly.

In [None]:
albania = expenditures.iloc[3]
print(albania)

In [None]:
expenditures.loc[3]

In [None]:
expenditures.iloc[3:5]

In [None]:
expenditures.iloc[3:5, 4:-1]

## Boolean Indexing

Similar to NumPy, you can use boolean arrays for indexing. That is, you can use boolean expressions directly for indexing.

In the following we assign `expenditures` to `df`as the latter is shorter.

In [None]:
df = expenditures

df[df['Country Name'] == 'Denmark']

Using the `isin()` method for filtering:

In [None]:
df[df['Country Name'].isin(['United States', 'China', 'Denmark', 'Russian Federation'])]

Here, we create a `DataFrame` of all country codes for the four countries, which we want to study further in the following.

In [None]:
c_code_df = df[df['Country Name'].isin(['United States', 'China', 
                                        'Denmark', 'Russian Federation'])]['Country Code']
c_code_df

We cannot plot the time series of military expenditures directly in a meaningful way as we would like to have the years on the y-axis but in the selection of our `DataFrame`, the year numbers are column names. Consequently, we have to transpose our `DataFrame`, see `T` function.

Note, that the expenditures are given in `LUC` in the World Bank data set. That is, in currency of the corresponding country.

In [None]:
import matplotlib.pyplot as plt


ts_df = df.iloc[c_code_df.index, 31:-1].T
ts_df = ts_df.rename(columns=dict(c_code_df))
ts_df
ts_df.plot()

Since this plot may be a bit misleading, we will 'normalize' all expeditures to Euro, so that they are better comparible.

In [None]:
import requests


# http://www.ecb.europa.eu/stats/policy_and_exchange_rates/euro_reference_exchange_rates/html/index.en.html#dev
response = requests.get('http://www.ecb.europa.eu/stats/eurofxref/eurofxref-daily.xml')
response.text

In [None]:
from bs4 import BeautifulSoup


xml = BeautifulSoup(response.text, 'html5lib')
rate_list = xml.cube.cube.findAll("cube") # [0]['rate']

currency = ['USD', 'DKK', 'RUB', 'CNY']
rate_dict = dict.fromkeys(currency)
for r in rate_list:
    if r['currency'] in currency:
        rate_dict[r['currency']] = float(r['rate'])
        print(r['rate'])
rate_dict

In [None]:
ts_df['DNK'] = ts_df['DNK'] / rate_dict['DKK']
ts_df['USA'] = ts_df['USA'] / rate_dict['USD']
ts_df['CHN'] = ts_df['CHN'] / rate_dict['CNY']
ts_df['RUS'] = ts_df['RUS'] / rate_dict['RUB']
ts_df.plot()

**OBS!!!** Be careful, the graph above is still not really well suited for comparison as currency exchange rates are not fix. However, the code above normalizes just relying on the most current exchange rate from the European Central Bank. See the exercise block in the bottom for how to fix that isuue!

## PS

In case you have to sort the data in your `DataFrames` see the methods `sort_index` and `sort_values`.


```python
df.sort_index(axis=1, ascending=True)
df.sort_values(by='Country Code')
```

In [None]:
df.sort_index?

In [None]:
df.sort_values?