In [1]:
import numpy as np
import matplotlib.pyplot as plt
import pandas as pd
import seaborn as sns
%matplotlib inline

#### notes
 - NaN and missing data and scientific notation like 1e6
 - Date and time
 - Hierarchical Indexes
 - Concat & Merging
 - Pivot tables?
 - More on Pandas and visualization
 - Exploring data with descriptive statistics
 - Correlation and linear fitting
 - Advanced DataFrame manipulations
 - [Google Dataset Search](https://toolbox.google.com/datasetsearch)

## COMP 3122 - Artificial Intelligence with Python
__Week 4 lecture__

### [github.com/kamrik/ML1](https://github.com/kamrik/ML1)

### [slido.com/COMP3122](http://slido.com/COMP3122)

## NaN and the floating point numbers
Standardized as IEEE 754 in 1985, which included a bunch of special values and rules of dealing with them ([Wikipedia article](https://en.wikipedia.org/wiki/Floating-point_arithmetic))
 - (+∞) + (+7) = (+∞)
 - (+∞) × (−2) = (−∞)
 - (+∞) × 0 = NaN – there is no meaningful thing to do

## It's not easy to represent a missing value
 - No agreed upon standard for integers - integer column with a NaN gets "upcasted" to float
 - For string columns it can be unclear, whether this is a NaN representing absence of data, or a literal string "NaN"?

## Use df.isnull() & df.notnull()
 - Avoid comparing using `x == NaN` 

In [34]:
pd.isnull([None, np.NaN, False, ''])

array([ True,  True, False, False])

## Pandas string manipulation methods


In [9]:
s = pd.Series(['alice', 'BOB', None, 'Carol'])
s

0    alice
1      BOB
2     None
3    Carol
dtype: object

In [16]:
s.str.upper()

0    ALICE
1      BOB
2     None
3    CAROL
dtype: object

## DataFrame from series (by the way)

In [24]:
s_cap = s.str.capitalize()

In [25]:
df=pd.DataFrame({'name':s_cap, 'raw_name':s})
df

Unnamed: 0,name,raw_name
0,Alice,alice
1,Bob,BOB
2,,
3,Carol,Carol


## Working with dates and time

## Keeping time is difficult and often unintuitive
#### When is Pushkin's birthday? - Google and Wikipedia seem to disagree

## Calendars
 - We basically give names to time intervals like the Jurassic period, or September 27, 2018
 - The Julian calendar, proposed by Julius Caesar took effect on 1 January 45 BC (AUC 709)
 - Introduction of Gregorian calendar by Pope Gregory XIII
     - 4 October 1582 **was followed by** 15 October 1582

## Adoption of Gregorian calendar took some 350 years
Some examples
 - UK 1752
 - Sweden 1700- 1752 (and Finland)
 - Russia 1918 (Jan 31 followed by Feb 14)

![Baltic map](https://img.affordabletours.com/AffordableCruisesWeb/Itineraires_Map/75947__201801091445__.jpg)

## Time
 - Same idea, we name a 1 second long period as 23:59:59
 - What second comes after 23:59:59 

 - Waht about December 31, 2016, 23:59:59 ?

## Leap seconds
 - https://en.wikipedia.org/wiki/Leap_second
 - Introduced in 1972

![Length of a day](https://upload.wikimedia.org/wikipedia/commons/5/5b/Deviation_of_day_length_from_SI_day.svg)

## Coordinated Universal Time - UTC
 - Keeps close to astronomic time by occasionally introducing leap seconds
 - All modern time is UTC + some time zone offset, Toronto summer time is UTC-4h, Toronto winte time is UTC-5h
 - UTC time is monotonic = never goes backwards

## Unix time / Epoch time
 - The number of seconds that have **approximately** elapsed since 00:00:00 UTC, Thursday, 1 January 1970
 - Convenient in code, because it's a single number

In [40]:
import time
time.time()

21.40021586418152

## Date / time represenataion
 - Way too many options
 - [ISO 8601](https://en.wikipedia.org/wiki/ISO_8601) 
    - 2018-09-27T17:17:19Z

## When in doubt use UTC and ISO !

## Date & time in Pandas
 - Covered in Video 25 of the video series 

In [62]:
pd.to_datetime('14:25')

Timestamp('2018-09-27 14:25:00')

In [53]:
pd.to_datetime('Tuesday, Sept 25, 2018')


Timestamp('2018-09-25 00:00:00')

In [63]:
df = pd.read_csv('../../pandas-videos/data/ufo.csv')
df.head()

Unnamed: 0,City,Colors Reported,Shape Reported,State,Time
0,Ithaca,,TRIANGLE,NY,6/1/1930 22:00
1,Willingboro,,OTHER,NJ,6/30/1930 20:00
2,Holyoke,,OVAL,CO,2/15/1931 14:00
3,Abilene,,DISK,KS,6/1/1931 13:00
4,New York Worlds Fair,,LIGHT,NY,4/18/1933 19:00


In [65]:
pd.to_datetime(df.Time)

0       1930-06-01 22:00:00
1       1930-06-30 20:00:00
2       1931-02-15 14:00:00
3       1931-06-01 13:00:00
4       1933-04-18 19:00:00
5       1934-09-15 15:30:00
6       1935-06-15 00:00:00
7       1936-07-15 00:00:00
8       1936-10-15 17:00:00
9       1937-06-15 00:00:00
10      1937-08-15 21:00:00
11      1939-06-01 20:00:00
12      1939-06-30 20:00:00
13      1939-07-07 02:00:00
14      1941-06-01 13:00:00
15      1941-07-02 11:30:00
16      1942-02-25 00:00:00
17      1942-06-01 22:30:00
18      1942-07-15 01:00:00
19      1943-04-30 23:00:00
20      1943-06-01 15:00:00
21      1943-08-15 00:00:00
22      1943-08-15 00:00:00
23      1943-10-15 11:00:00
24      1944-01-01 10:00:00
25      1944-01-01 12:00:00
26      1944-01-01 12:00:00
27      1944-04-02 11:00:00
28      1944-06-01 12:00:00
29      1944-06-30 10:00:00
                ...        
18211   2000-12-28 18:00:00
18212   2000-12-28 18:20:00
18213   2000-12-28 19:10:00
18214   2000-12-29 00:00:00
18215   2000-12-29 0

In [46]:
df = pd.read_csv('../../pandas-videos/data/ufo.csv', parse_dates=['Time'])
df.head()

Unnamed: 0,City,Colors Reported,Shape Reported,State,Time
0,Ithaca,,TRIANGLE,NY,1930-06-01 22:00:00
1,Willingboro,,OTHER,NJ,1930-06-30 20:00:00
2,Holyoke,,OVAL,CO,1931-02-15 14:00:00
3,Abilene,,DISK,KS,1931-06-01 13:00:00
4,New York Worlds Fair,,LIGHT,NY,1933-04-18 19:00:00


df 

In [26]:
df = pd.read_csv('../exercises/athlete_events.csv')