# Session 2B: Data Structuring in Pandas

*Joachim Kahr Rasmussen*



# VIDEO 1: DataFrames and Series

## Loading Stuff

In [1]:
# Loading packages
import matplotlib.pyplot as plt
import numpy as np
import pandas as pd
import requests
import seaborn as sns

## Pandas Data Types
*How do we work with data in Pandas?*

- We use two fundamental data stuctures: 
  - ``Series``, and
  - ``DataFrame``.

## Pandas Series (I/V)
*What is a `Series`?*

- A vector/list with labels for each entry. Example:

In [2]:
L = [1, 1.2, 'abc', True]

my_series = pd.Series(L)
my_series

0       1
1     1.2
2     abc
3    True
dtype: object

## Pandas Series (II/V)
*What are the components in a Series?* 

From before, we could see that a Series generally consists of three components:

- `index`: label for each observation

- `values`: observation data

- `dtype`: the format of the series (`object` means any data type is allowed)
  - examples are fundamental datatypes (`float`, `int`, `bool`)  
      - in terms of precision: `float`>`int`>`bool`
      - this comes at a cost in the form of speed

## Pandas Series (III/V)
*How do we set custom index?* 

Indices need not have a sequential structure. To see this, consider the following example

In [3]:
num_data = range(0,3) # Generate data
indices = ['B', 'C', 'A'] # Generate index names

Now, combine to a series:

In [4]:
my_series2 = pd.Series(data=num_data, index=indices) # Create a pandas series from the two
my_series2

B    0
C    1
A    2
dtype: int64

## Pandas Series (IV/V)
*What data structure does the pandas series remind us of?*

A mix of Python list and dictionary. Consider the following simple transformation:

In [5]:
my_series.to_dict()

{0: 1, 1: 1.2, 2: 'abc', 3: True}

*Can we also convert a dictionary to a series?*

Yes, we just put into the Series class constructor. Example:

In [6]:
d = {'yesterday': 0, 'today': 1, 'tomorrow':3} # Create some dictionary
my_series3 = pd.Series(d) # Use the constructor
my_series3

yesterday    0
today        1
tomorrow     3
dtype: int64

## Pandas Series (V/V)
*How is the series different from a dict?*

An important distinction: Series indices are NOT unique! Example:

In [7]:
s = pd.Series(range(3), index=['A','A', 'A']) # Create series with same indices
print(s.index.duplicated()) # Check duplicates
print()
print(s.to_dict()) # So translating to a dict gives...

[False  True  True]

{'A': 2}


Series are both key and index  based (i.e. sequential).
- Remember that unlike, say, lists, dictionaries are not sequential!

## Pandas Data Frames (I/IV)

*OK, so now we know what a series is. What is a `DataFrame` then?*

- A 2d-array (matrix) with labelled columns and rows (which are called indices). Example:

In [8]:
df = pd.DataFrame(data=[[1,2],[3,4]],
                  columns=['A', 'B'])
df

Unnamed: 0,A,B
0,1,2
1,3,4


## Pandas Data Frames (II/IV)

*How can we really think about this*

There are at least two simple ways of seeing the pandas DataFrae:
1. A numpy arrays with some additional stuff.
2. A set of series that have been merged horizontally
    - Note that columns can have different datatypes!

Most functions from `numpy` can be applied directly to Pandas. We can convert a DataFrame to a `numpy` array with `values` attribute.

In [9]:
df.values

array([[1, 2],
       [3, 4]], dtype=int64)

*To note*: In Python we can describe it as a *list of lists* or sometimes a *dict of dicts*.

In [10]:
df.values.tolist()

[[1, 2], [3, 4]]

## Pandas Data Frames (III/IV)

*How can larger pandas dataframes be built?*

Similar to Series, DataFrames can be built from dictionaries.

An important difference: When it comes to creating distinct columns, DataFrames require that each value in the dictionary is also a dictionary. Example:

In [11]:
djan = {'1st': 0, '2nd': 1, '3rd':3} # Create some dictionary for january
dfeb = {'1st': -3, '2nd': -1, '3rd':-2} # Create some dictionary for february
dmar = {'1st': 3, '2nd': 5, '3rd':4} # Create some dictionary for march

d = {'january': djan, 'february': dfeb, 'march': dmar} # Create dictionary of dictionaries
my_df1 = pd.DataFrame(d) # Use the constructor
my_df1

Unnamed: 0,january,february,march
1st,0,-3,3
2nd,1,-1,5
3rd,3,-2,4


## Pandas Data Frames (IV/IV)

*What happens if keys are not the same?*

No big deal...

In [12]:
djan = {'1st': 0, '2nd': 1, '3rd':3} # Create some dictionary for january
dfeb = {'1st': -3, '2nd': -1, '3rd':-2} # Create some dictionary for february
dmar = {'1st': 3, '2nd': 5, '4th':4} # Create some dictionary for march

d = {'january': djan, 'february': dfeb, 'march': dmar} # Create dictionary of dictionaries
my_df2 = pd.DataFrame(d) # Use the constructor
my_df2

Unnamed: 0,january,february,march
1st,0.0,-3.0,3.0
2nd,1.0,-1.0,5.0
3rd,3.0,-2.0,
4th,,,4.0


## Series vs DataFrames (I/II)
*How are Series related to DataFrames?*

Putting it simple: Every column is a series. Example, access as key (recommended):

In [13]:
print(df['B'])

0    2
1    4
Name: B, dtype: int64


Another option is access as object method... smart, but dangerous! Sometimes it works...

In [14]:
print(df.B)

0    2
1    4
Name: B, dtype: int64


But sometimes it doesn't... To illustrate, add one more column

In [15]:
df['count'] =  5
print(df)

   A  B  count
0  1  2      5
1  3  4      5


## Series vs DataFrames (II/II)
*But when wouldn't this work?*

To illustrate, add one more column:

In [16]:
df['count'] =  5
print(df)

   A  B  count
0  1  2      5
1  3  4      5


Now print this and see!

In [17]:
print(df.count)

<bound method DataFrame.count of    A  B  count
0  1  2      5
1  3  4      5>



Clearly, the key-based option more robust as variables named same as methods, e.g. `count`, cannot be accesed.

## Converting Data Types

The data type of a series can be converted with the **astype** method. Some examples:

In [18]:
print(my_series3)
print()
print(my_series3.astype(np.float))
print()
print(my_series3.astype(np.str))

yesterday    0
today        1
tomorrow     3
dtype: int64

yesterday    0.0
today        1.0
tomorrow     3.0
dtype: float64

yesterday    0
today        1
tomorrow     3
dtype: object


## Indices and Column Names
*Why don't we just use numpy arrays and matrices?*


- Inspection of data is quicker
    - What was it that column 18 represented?

- Keep track of rows after deletion
    - Again.... What was it that column 18 represented!?

- Indices may contain fundamentally different data structures 
    - e.g. time series (more about this later)
    - Other datatypes (spatial data $\rightarrow$ advanced course)

- Facilitates complex operation (next session):
    - Merging datasets
    - Split-apply-combine (operations on subsets of data)
    - Method chaining (multiple operations in sequence)

## Viewing Series and Dataframes
*How can we view the contents in our dataset?*
- We can use `print` on our dataset
- We can visualize patterns by plotting

## The Head and Tail
*But what if we have a large data set with many rows?*

Let's load the 'titanic' data set that comes with the *seaborn* library:

In [19]:
import seaborn as sns
titanic = sns.load_dataset('titanic')

We now select the *first* 3 rows in a the with the `head` method.

In [20]:
titanic.head(3)

Unnamed: 0,survived,pclass,sex,age,sibsp,parch,fare,embarked,class,who,adult_male,deck,embark_town,alive,alone
0,0,3,male,22.0,1,0,7.25,S,Third,man,True,,Southampton,no,False
1,1,1,female,38.0,1,0,71.2833,C,First,woman,False,C,Cherbourg,yes,False
2,1,3,female,26.0,0,0,7.925,S,Third,woman,False,,Southampton,yes,True


The `tail` method selects the last observations in a DataFrame. 

## Row Selection (I/III)
*How can we select certain rows in a Series when for given index **keys**?* 

WIth the `loc` attribute. Example:

In [21]:
print(titanic.loc[range(3),['survived', 'age', 'sex']])

   survived   age     sex
0         0  22.0    male
1         1  38.0  female
2         1  26.0  female


## Row selection (II/III)
*How can we select certain rows in a Series for given index **integers**?* 

The `iloc` method selects rows for provided index integers. 

In [22]:
print(titanic.iloc[10:15,:5])

    survived  pclass     sex   age  sibsp
10         1       3  female   4.0      1
11         1       1  female  58.0      0
12         0       3    male  20.0      0
13         0       3    male  39.0      1
14         0       3  female  14.0      0


Clearly, this is very similar to working with matrices in numpy! 

## Row selection (III/III)
*Do our tools for vieving specific rows, i.e. `loc`, `iloc` work for DataFrames?* 

- Yes, we can use both `loc` and `iloc`. As default they work the same.

In [23]:
my_idx = ['i', 'ii', 'iii']
my_cols = ['a','b']

my_data = np.arange(1,7) #my_data = [[1, 2], [3, 4], [5, 6]]
my_data = my_data.reshape(3,2)

my_df = pd.DataFrame(my_data, columns=my_cols, index=my_idx)

print(my_df)
print()
print(my_df.loc[['i','ii']])
print()
print(my_df.iloc[:2])

     a  b
i    1  2
ii   3  4
iii  5  6

    a  b
i   1  2
ii  3  4

    a  b
i   1  2
ii  3  4


## Columns Selection (I/II)
*How are `loc`, `iloc` different for DataFrames?* 

- For DataFrames, we can also specify columns.

In [24]:
idx_keep = ['i','ii']
cols_keep = ['a']
print(my_df.loc[idx_keep, cols_keep])

    a
i   1
ii  3


## Columns Selection (II/II)
*How can we generally select columns in a DataFrame?* 

- Option 1: using the `[]` and providing a list of columns.
- Option 2: using `loc` and setting row selection as `:`.

In [25]:
print(my_df.loc[:,['b']])

     b
i    2
ii   4
iii  6


## Selection quiz
*What does `:` do in `iloc` or `loc`?* 

Select all rows (columns).

## Modifying DataFrames
*Why do we want to modify DataFrames?*

- Because data rarely comes in the form we want it.


## Changing the Index (I/III)
*How can we change the index of a DataFrame?*

We change or set a DataFrame's index using its method `set_index`. Example:

In [26]:
print(my_df.set_index('a'))
print()
print(my_df)

   b
a   
1  2
3  4
5  6

     a  b
i    1  2
ii   3  4
iii  5  6


Clearly, doing so, we also implicitly delete the previous index.

Also, notice the level shift in *b* due to this.

## Changing the Index (II/III)
*Is our DataFrame changed? I.e. does it have a new index?*

No, we must overwrite it or make it into a new object:

In [27]:
print(my_df)
my_df_a = my_df.set_index('a')
print()
print(my_df_a)
print()
print(my_df_a.iloc[1,0])

     a  b
i    1  2
ii   3  4
iii  5  6

   b
a   
1  2
3  4
5  6

4


## Changing the index (III/III)

Sometimes we wish to remove the index. This is done with the `reset_index` method:

In [28]:
print(my_df_a.reset_index()) # drop=True
print()
print(my_df_a.reset_index(drop=True)) # drop=True
print()
print(my_df)

   a  b
0  1  2
1  3  4
2  5  6

   b
0  2
1  4
2  6

     a  b
i    1  2
ii   3  4
iii  5  6


The old indices cannot be restored (that information was lost), but the interim index is by default made into a new variable.

By specifying the keyword `drop`=True we delete this index.

*To note:* Indices can have multiple levels, in this case `level` can be specified to delete a specific level.

## Changing the Column Names

Column names can simply be changed with `columns`:

In [29]:
print(my_df)
my_df.columns = ['A', 'B']
print()
print(my_df)

     a  b
i    1  2
ii   3  4
iii  5  6

     A  B
i    1  2
ii   3  4
iii  5  6


DataFrame's also have the function called `rename`.

In [30]:
my_df.rename(columns={'A': 'Aa'}, inplace=True)
print(my_df)

     Aa  B
i     1  2
ii    3  4
iii   5  6


## Changing all Column Values
*How can we can update values in a DataFrame?*

In [31]:
print(my_df)

# # set uniform value
my_df['B'] = 3
print()
print(my_df)

# set different values
my_df['B'] = [2,17,0] 
print()
print(my_df)

     Aa  B
i     1  2
ii    3  4
iii   5  6

     Aa  B
i     1  3
ii    3  3
iii   5  3

     Aa   B
i     1   2
ii    3  17
iii   5   0


## Changing Specific Column Values
*How can we can update values in a DataFrame?*

In [32]:
print(my_df)

# loc, iloc
my_loc2 = ['i', 'iii']
my_df.loc[my_loc2, 'Aa'] = 10

print()
print(my_df)

     Aa   B
i     1   2
ii    3  17
iii   5   0

     Aa   B
i    10   2
ii    3  17
iii  10   0


## Sorting Data

A DataFrame can be sorted with `sort_values`; this method takes one or more columns to sort by. 

In [33]:
print(my_df.sort_values(by='Aa', ascending=True))

     Aa   B
ii    3  17
i    10   2
iii  10   0


Many key word arguments are possible for sort_values, including ascending if for one or more valuable we want descending values. 

In addition, sorting by index is also possible with `sort_index`.

In [34]:
print(my_df.sort_index())

     Aa   B
i    10   2
ii    3  17
iii  10   0


# VIDEO 2: Boolean Data

## Logical Expression for Series (I/II)
*Can we test an expression for all elements?*

Yes: **==**, **!=** work for a single object or Series with same indices. Example:

In [35]:
print(my_series3)
print()
print(my_series3 == 0)

yesterday    0
today        1
tomorrow     3
dtype: int64

yesterday     True
today        False
tomorrow     False
dtype: bool


What datatype is returned? 


## Logical Expression in Series  (II/II)
*Can we check if elements in a series equal some element in a container?*

Yes, the `isin` method. Example:

In [36]:
my_rng = list(range(2))

print(my_rng)
print()
print(my_series3.isin(my_rng)) 

[0, 1]

yesterday     True
today         True
tomorrow     False
dtype: bool


## Power of Boolean Series (I/II)
*Can we combine boolean Series?*

Yes, we can use:
- the `&` operator (*and*)
- the `|` operator (*or*)

In [37]:
print(((titanic.sex == 'female') & (titanic.age >= 30)).head(3)) # selection by multiple columns

0    False
1     True
2    False
dtype: bool


What datatype was returned? 


## Power of Boolean Series (II/II)
*Why do we care for boolean series (and arrays)?*

Mainly because we can use them to select rows based on their content.

In [38]:
print(my_series3)
print()
print(my_series3[my_series3<3])

yesterday    0
today        1
tomorrow     3
dtype: int64

yesterday    0
today        1
dtype: int64


NOTE: Boolean selection is extremely useful for dataframes!!

# VIDEO 3: Numeric Operations and Methods

## Numeric Operations (I/III)
*How can we make basic arithmetic operations with arrays, series and dataframes?*

It really works just like with Python data, e.g. lists. An example with squaring:

In [39]:
num_ser1 = pd.Series([2,3,2,1,1])
num_ser2 = num_ser1 ** 2

print(num_ser1)
print(num_ser2)

0    2
1    3
2    2
3    1
4    1
dtype: int64
0    4
1    9
2    4
3    1
4    1
dtype: int64


## Numeric Operations (II/III)
*Are other numeric python operators the same??*

Numeric operators work `/`, `//`, `-`, `*`, `**`  as expected.

So does comparative (`==`, `!=`, `>`, `<`)

*Why is this useful?*

- vectorized operations are VERY fast;
- requires very little code.

## Numeric  Operations (III/III)
*Can we also do this with vectors of data?*

Yes, we can also do elementwise addition, multiplication, subtractions etc. of series. Example: 

In [40]:
num_ser1 + num_ser2

0     6
1    12
2     6
3     2
4     2
dtype: int64

## Numeric methods (I/IV)

*OK, these were some quite simple operations with pandas series. Are there other numeric methods?*

Yes, pandas series and dataframes have other powerful numeric methods built-in. 

Consider an example series of 10 million randomly generated observations:

Yes, pandas series and dataframes have other powerful numeric methods built-in. Consider an example series of 10 million randomly generated observations:

In [41]:
arr_rand = np.random.randn(10**7) # Draw 10^7 observations from standard normal , arr_rand = np.random.normal(size = 10**7)
s2 = pd.Series(arr_rand) # Convert to pandas series
s2

0         -0.114982
1         -0.142769
2         -0.252456
3         -0.045497
4          0.132941
             ...   
9999995    0.663020
9999996   -0.755392
9999997    0.157724
9999998    0.611881
9999999   -0.778720
Length: 10000000, dtype: float64

Now, display the median of this distribution:

In [42]:
s2.median() # Display median

0.00016248044579439104

Other useful methods include: `mean`, `quantile`, `min`, `max`, `std`, `describe`, `quantile` and many more.

In [43]:
np.round(s2.describe(),2) # Display other characteristics of distribution (rounded)

count    10000000.00
mean            0.00
std             1.00
min            -5.26
25%            -0.67
50%             0.00
75%             0.67
max             5.55
dtype: float64

## Numeric methods (II/III)
An important method is `value_counts`. This counts number for each observation. 

Example:

In [44]:
cuts = np.arange(-7, 8, 1) # range from -10 to 10 with intervals of unit size
cats = pd.cut(s2, cuts) # cut into categorical data

In [45]:
cats.unique()

[(-1, 0], (0, 1], (-2, -1], (-3, -2], (1, 2], ..., (-4, -3], (4, 5], (-5, -4], (5, 6], (-6, -5]]
Length: 12
Categories (12, interval[int64]): [(-6, -5] < (-5, -4] < (-4, -3] < (-3, -2] ... (2, 3] < (3, 4] < (4, 5] < (5, 6]]

In [46]:
cats.value_counts()

(0, 1]      3413583
(-1, 0]     3413419
(1, 2]      1359455
(-2, -1]    1358695
(2, 3]       214039
(-3, -2]     213627
(-4, -3]      13274
(3, 4]        13256
(-5, -4]        330
(4, 5]          315
(5, 6]            4
(-6, -5]          3
(6, 7]            0
(-7, -6]          0
dtype: int64

What is observation in the value_counts output - index or data?

## Numeric methods (III/III)
*Are there other powerful numeric methods?*

Yes: examples include 
- `unique`, `nunique`: the unique elements and the count of unique elements
- `cut`, `qcut`: partition series into bins 
- `diff`: difference every two consecutive observations
- `cumsum`: cumulative sum
- `nlargest`, `nsmallest`: the n largest elements 
- `idxmin`, `idxmax`: index which is minimal/maximal 
- `corr`: correlation matrix

Check [series documentation](https://pandas.pydata.org/pandas-docs/stable/generated/pandas.Series.html) for more information.

# VIDEO 4: String Operations

## String Operations (I/III)
*Do the numeric python operators also apply to strings?*

In some cases yes, and this can be done very elegantly! Consider the following example with a series:

In [47]:
names_ser1 = pd.Series(['Andreas', 'Joachim', 'Nicklas', 'Terne'])
names_ser1

0    Andreas
1    Joachim
2    Nicklas
3      Terne
dtype: object

Now add another string:

In [48]:
names_ser1 + ' works @ SAMF'

0    Andreas works @ SAMF
1    Joachim works @ SAMF
2    Nicklas works @ SAMF
3      Terne works @ SAMF
dtype: object

## String Operations (II/III)
*Can two vectors of strings also be combined like as with numeric vectors?*

Fortunately, yes:

In [49]:
names_ser2 = pd.Series(['ethics', 'python and ML', 'scraping', 'text as data'])
names_ser1 + ' teaches ' + names_ser2

0           Andreas teaches ethics
1    Joachim teaches python and ML
2         Nicklas teaches scraping
3       Terne teaches text as data
dtype: object

## String Operations (III/III)
*Any other types of vectorized operations with strings?*

Many. In particular, there is a large set of string-specific operation (see `.str`-notation below). Some examples (see table 7-5 in PDA for more - we will revisit in session 5):

In [50]:
names_ser1.str.upper() # works similarly with lower()

0    ANDREAS
1    JOACHIM
2    NICKLAS
3      TERNE
dtype: object

In [51]:
names_ser1.str.contains('as')

0     True
1    False
2     True
3    False
dtype: bool

In [52]:
names_ser1.str[1:3] # We can even do vectorized slicing of strings!

0    nd
1    oa
2    ic
3    er
dtype: object

# VIDEO 5: Categorical Data

## The Categorical Data Type
*Are string (or object) columns attractive to work with?*

In [53]:
pd.Series(['Pandas', 'series'])

0    Pandas
1    series
dtype: object

Now, sometimes the categorical data type is better:
- Use categorical data when many characters are repeated
    - Less storage and faster computations
- You can put some order (structure) on your string data
- It also allows new features:
    - Plots have bars, violins etc. sorted according to category order

## Example of Categorical Data

Conversion to categorical data:

In [54]:
edu_list = ['BSc Political Science', 'Secondary School'] + ['High School']*2
edu_cats = ['Secondary School', 'High School', 'BSc Political Science']

str_ser = pd.Series(edu_list*10**5)

Option 1: No order

In [55]:
cat_ser = str_ser.astype('category')
cat_ser[:5]

0    BSc Political Science
1         Secondary School
2              High School
3              High School
4    BSc Political Science
dtype: category
Categories (3, object): ['BSc Political Science', 'High School', 'Secondary School']

Option 2: Order

In [56]:
cats = pd.Categorical(str_ser, categories=edu_cats, ordered=True)
cat_ser2 = pd.Series(cats, index=str_ser.index)
cat_ser2[:5]

0    BSc Political Science
1         Secondary School
2              High School
3              High School
4    BSc Political Science
dtype: category
Categories (3, object): ['Secondary School' < 'High School' < 'BSc Political Science']

## Numbers as Categories

It is natural to think of measures in categories, e.g. small and large.

*Can we convert our numerical data to bins in a smart way?*

Yes, there are two methods that are useful (and you just applied one of them earlier in this session!):
- `cut` which divides data by user specified bins
- `qcut` which divides data by user specified quantiles
    - E.g. median, $q=0.5$; lower quartile threshold, $q=0.25$; etc.

In [57]:
x = pd.Series(np.random.normal(size = 10**6))
cat_ser3 = pd.qcut(x, q = [0,0.025, 0.975, 1])
cat_ser3.cat.categories

IntervalIndex([(-5.07, -1.964], (-1.964, 1.957], (1.957, 5.01]],
              closed='right',
              dtype='interval[float64]')

In [58]:
cat_ser3.cat.codes.head(5)

0    1
1    1
2    1
3    1
4    1
dtype: int8

## Converting to Numeric and Binary

For regression, we often want our string / categorical variable as dummy variables:
- That is, all categories have their own binary column (0 and 1)
    - Note: We may leave one 'reference' category out here (intro statistics)
- Rest as numeric

*How can we do this?*

Insert dataframe, `df`, into the function as `pd.get_dummies(df)`

In [59]:
pd.get_dummies(cat_ser3).head(5)

Unnamed: 0,"(-5.07, -1.964]","(-1.964, 1.957]","(1.957, 5.01]"
0,0,1,0
1,0,1,0
2,0,1,0
3,0,1,0
4,0,1,0


# VIDEO 6: Time Series Data

## Temporal Data Type

*Why is time so fundamental?*

Every measurement made by a human was made at some point in time - therefore, it has a "timestamp"!

## Formats for Time

*How are time stamps measured?*

1. **Datetime** (ISO 8601): Standard calendar
    - year, month, day (minute, second, milisecond); timezone
    - can come as string in raw data
2. **Epoch time**: Seconds since January 1, 1970 - 00:00, GMT (Greenwich time zone)
    - nanoseconds in pandas

## Time Data in Pandas

*Does Pandas store it in a smart way?*

Pandas and numpy have native support for temporal data combining datetime and epoch time.

In [60]:
str_ser2 = pd.Series(['20170101', '20170727', '20170803', '20171224'])
dt_ser = pd.to_datetime(str_ser2)
dt_ser

0   2017-01-01
1   2017-07-27
2   2017-08-03
3   2017-12-24
dtype: datetime64[ns]

## Example of Passing Temporal Data

*How does the input type matter for how time data is passed?*

A lot! As we will see, `to_datetime()` may assume either *datetime* or *epoch time* format:

In [61]:
pd.to_datetime(str_ser2)

0   2017-01-01
1   2017-07-27
2   2017-08-03
3   2017-12-24
dtype: datetime64[ns]

In [62]:
pd.to_datetime(str_ser2.astype(int))

0   1970-01-01 00:00:00.020170101
1   1970-01-01 00:00:00.020170727
2   1970-01-01 00:00:00.020170803
3   1970-01-01 00:00:00.020171224
dtype: datetime64[ns]

## Time Series Data

*Why are temporal data powerful?*

We can easily make and plot time series. Example of 20 years of Apple stock prices:
- Tip: Install in terminal using: *conda install pandas-datareader*

In [63]:
from pandas_datareader import data
aapl = data.DataReader('AAPL', data_source='yahoo', start='2000')['Adj Close']
aapl.plot(figsize = (12,4), logy = True)

RemoteDataError: Unable to read URL: https://finance.yahoo.com/quote/AAPL/history?period1=946695600&period2=1625277599&interval=1d&frequency=1d&filter=history
Response Text:
b'<!DOCTYPE html>\n  <html lang="en-us"><head>\n  <meta http-equiv="content-type" content="text/html; charset=UTF-8">\n      <meta charset="utf-8">\n      <title>Yahoo</title>\n      <meta name="viewport" content="width=device-width,initial-scale=1,minimal-ui">\n      <meta http-equiv="X-UA-Compatible" content="IE=edge,chrome=1">\n      <style>\n  html {\n      height: 100%;\n  }\n  body {\n      background: #fafafc url(https://s.yimg.com/nn/img/sad-panda-201402200631.png) 50% 50%;\n      background-size: cover;\n      height: 100%;\n      text-align: center;\n      font: 300 18px "helvetica neue", helvetica, verdana, tahoma, arial, sans-serif;\n  }\n  table {\n      height: 100%;\n      width: 100%;\n      table-layout: fixed;\n      border-collapse: collapse;\n      border-spacing: 0;\n      border: none;\n  }\n  h1 {\n      font-size: 42px;\n      font-weight: 400;\n      color: #400090;\n  }\n  p {\n      color: #1A1A1A;\n  }\n  #message-1 {\n      font-weight: bold;\n      margin: 0;\n  }\n  #message-2 {\n      display: inline-block;\n      *display: inline;\n      zoom: 1;\n      max-width: 17em;\n      _width: 17em;\n  }\n      </style>\n  <script>\n    document.write(\'<img src="//geo.yahoo.com/b?s=1197757129&t=\'+new Date().getTime()+\'&src=aws&err_url=\'+encodeURIComponent(document.URL)+\'&err=%<pssc>&test=\'+encodeURIComponent(\'%<{Bucket}cqh[:200]>\')+\'" width="0px" height="0px"/>\');var beacon = new Image();beacon.src="//bcn.fp.yahoo.com/p?s=1197757129&t="+new Date().getTime()+"&src=aws&err_url="+encodeURIComponent(document.URL)+"&err=%<pssc>&test="+encodeURIComponent(\'%<{Bucket}cqh[:200]>\');\n  </script>\n  </head>\n  <body>\n  <!-- status code : 404 -->\n  <!-- Not Found on Server -->\n  <table>\n  <tbody><tr>\n      <td>\n      <img src="https://s.yimg.com/rz/p/yahoo_frontpage_en-US_s_f_p_205x58_frontpage.png" alt="Yahoo Logo">\n      <h1 style="margin-top:20px;">Will be right back...</h1>\n      <p id="message-1">Thank you for your patience.</p>\n      <p id="message-2">Our engineers are working quickly to resolve the issue.</p>\n      </td>\n  </tr>\n  </tbody></table>\n  </body></html>'

## Time Series Components

*What is within the `appl` series? What is a time series*

In [64]:
aapl.head(5)

NameError: name 'aapl' is not defined

In [65]:
aapl.head(5).index

NameError: name 'aapl' is not defined

So in essence, time series in pandas are often just series of data with a time index.

## Pandas and Time Series

*Why is pandas good at handling and processing time series data?*

It has specific tools for resampling and interpolating data:
- See 11.3, 11.5 and 11.6 in PDA textbook

It handles irregular data well:
- missing values
- duplicate entries (`fillna(method='ffill')` or `data.fillna(data.mean())`)



## Datetime in Pandas

*What other uses might time data have?*

We can extract data from datetime columns. These columns have the `dt` and its sub-methods. Example:

In [66]:
dt_ser2 = pd.Series(aapl.index)
dt_ser2.dt.month #also year, weekday, hour, second

NameError: name 'aapl' is not defined

Many other useful features (e.g. aggregation over time into means, medians, etc.)

## Datetime in Pandas

*What other uses might time data have?*

We can extract data from datetime columns. These columns have the `dt` and its sub-methods. Example: