# Exercise for midterm preparation
Instructions:

1. This exercise contains 3 questions:

* 2 questions about aggregating and summarising time series data, and 

* 1 question about calculating Non-Cumulative into a new column from cumulative values.

2. Try to complete these exercise problems  by yourself.

3. Solutions to the problems are given below.

In [9]:
import pandas as pd

## Working with datetime in Pandas DataFrame

In Pandas DataFrame, working with datetime

We will go over the following frequent datetime issues in this part, which should get you started with data analysis.

1. Convert dates and times from strings.

2. Create a datetime by combining numerous columns.

3. Get the year's week, day of the week.

### Convert dates and times from strings.

To convert strings to datetime, Pandas comes with a built-in method named to `datetime()`. Let's look at a few examples.

#### With the default parameters

Without any further parameters, Pandas to datetime() can convert any valid date string to a datetime. Consider the following scenario:

In [35]:
df = pd.DataFrame({'date': ['3/09/2022', '3/10/2022', '3/11/2022'],
                   'value': [2, 4, 8]})
df['date'] = pd.to_datetime(df['date'])
df

Unnamed: 0,date,value
0,2022-03-09,2
1,2022-03-10,4
2,2022-03-11,8


### Format to suit your needs

Your strings may be in a unique format, such as YYYY-DD-MM HH:MM:SS, for example. The `format` argument in Pandas to `datetime()` allows you to pass a custom format:

In [36]:
df = pd.DataFrame({'date': ['2022-09-3', '2022-10-3', '2022-11-3'],
                   'value': [2, 4, 8]})

df['date'] = pd.to_datetime(df['date'], format="%Y-%d-%m")
df

Unnamed: 0,date,value
0,2022-03-09,2
1,2022-03-10,4
2,2022-03-11,8


In [37]:
df = pd.DataFrame({'date': ['2022 09 3', '2022 10 3', '2022 11 3'],
                   'value': [2, 3, 4]})

df['date'] = pd.to_datetime(df['date'], format="%Y %d %m")
df

Unnamed: 0,date,value
0,2022-03-09,2
1,2022-03-10,3
2,2022-03-11,4


#### Create a datetime by combining multiple columns.

The function `to_datetime()` can also be used to create a datetime from a collection of columns. The keys (column labels) can be abbreviations of ['year','month', 'day'].

In [13]:
df = pd.DataFrame({'year': [2022, 2021],
                   'month': [1, 2],
                   'day': [15, 16]})
df['date'] = pd.to_datetime(df)
df

Unnamed: 0,year,month,day,date
0,2022,1,15,2022-01-15
1,2021,2,16,2021-02-16


#### Determine the year, month, and day.

The built-in characteristics `dt.year`, `dt.month`, and `dt.day` are used to get the year, month, and day from a Pandas datetime object.

In [6]:
df = pd.DataFrame({'date': ['2022 09 3', '2022 10 3', '2022 11 3'],
                   'value': [2, 3, 4]})

df['date'] = pd.to_datetime(df['date'], format="%Y %d %m")

df['year']= df['date'].dt.year
df['month']= df['date'].dt.month
df['day']= df['date'].dt.day
df

Unnamed: 0,date,value,year,month,day
0,2022-03-09,2,2022,3,9
1,2022-03-10,3,2022,3,10
2,2022-03-11,4,2022,3,11


We are going to use data from Yahoo Finance databases. The data set consists,

* Open and close are the prices at which a stock began and ended trading in the same period. 

* Volume is the total amount of trading activity.

We load the data and also use `parse_dates= ['Date']`
to specify a list of date columns.

In [18]:
df = pd.read_csv('/Users/Kaemyuijang/SCMA248/Data/yahoo_stock.csv',parse_dates = ['Date'])

In [19]:
df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 1825 entries, 0 to 1824
Data columns (total 7 columns):
 #   Column     Non-Null Count  Dtype         
---  ------     --------------  -----         
 0   Date       1825 non-null   datetime64[ns]
 1   High       1825 non-null   float64       
 2   Low        1825 non-null   float64       
 3   Open       1825 non-null   float64       
 4   Close      1825 non-null   float64       
 5   Volume     1825 non-null   float64       
 6   Adj Close  1825 non-null   float64       
dtypes: datetime64[ns](1), float64(6)
memory usage: 99.9 KB


**Exercise** write a python code to calculate the average yearly high, low, open, close and volumn.

In [56]:
# Add your code here:







**Exercise** write a python code to calculate the average quarterly high, low, open, close and volumn.

In [56]:
# Add your code here:







## Calculating Non-Cumulative into a new column from cumulative values 

Suppose that we have a series of cumulative values with a timestamp. We would like to get Python to show the non-cumulative values as shown below in the last column.

In [48]:
df = pd.DataFrame({'date': ['3/09/2022', '3/10/2022', '3/11/2022', '3/12/2022'],
                   'cumulative_value': [2, 4, 8, 15]})
df['date'] = pd.to_datetime(df['date'])
df

Unnamed: 0,date,cumulative_value
0,2022-03-09,2
1,2022-03-10,4
2,2022-03-11,8
3,2022-03-12,15


In [44]:
df['noncumulative_value'] = [2,2,4,7]
df

Unnamed: 0,date,cumulative_value,noncumulative_value
0,2022-03-09,2,2
1,2022-03-10,4,2
2,2022-03-11,8,3
3,2022-03-12,15,7


**Exercise** Write a python code to create a column of non-cumulative values shown above.

In [47]:
# Here we first drop the noncumulative_value column.

df.drop(['noncumulative_value'],axis = 1,inplace=True)
df

Unnamed: 0,date,cumulative_value
0,2022-03-09,2
1,2022-03-10,4
2,2022-03-11,8
3,2022-03-12,15


In [56]:
# Add your code here:







## Solutions to exercise problems

**Exercise** write a python code to calculate the average yearly high, low, open, close and volumn.

**Solution** we can use the `pd.Grouper` function and the updated agg function to aggregatie and summarise data.

#### pd.Grouper

The function `pd.Grouper` is useful when working with time-series data.

##### Year-based grouping

We will use 

`pd.Grouper(key=INPUT COLUMN>, freq=DESIRED FREQUENCY>)`

in the following example to group our data depending on the supplied frequency for the specified column. 

The frequency in our situation is 'Y,' and the relevant column is 'Date.'

In [25]:
df.groupby(pd.Grouper(key='Date',freq='Y')).mean()

Unnamed: 0_level_0,High,Low,Open,Close,Volume,Adj Close
Date,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1
2015-12-31,2072.024664,2048.042302,2061.825136,2058.87566,3593263000.0,2058.87566
2016-12-31,2102.493832,2082.798118,2093.452849,2093.592565,3920366000.0,2093.592565
2017-12-31,2453.631583,2441.352216,2447.547878,2448.491076,3399288000.0,2448.491076
2018-12-31,2760.626739,2728.128138,2745.848932,2743.89129,3626960000.0,2743.89129
2019-12-31,2923.954243,2899.792422,2911.631296,2913.895471,3553994000.0,2913.895471
2020-12-31,3184.745498,3130.553108,3160.341503,3159.985474,5000890000.0,3159.985474


In [24]:
((df[df['Date'] <= '2015-12-31'])['Close']).mean()

2058.875660431691

#### Quarter or other frequency grouping

Different standard frequencies, such as 'D','W','M', or 'Q', can be used instead of 'Y.'

**Exercise** write a python code to calculate the average quarterly high, low, open, close and volumn.

In [26]:
df.groupby(pd.Grouper(key='Date',freq='Q')).mean()

Unnamed: 0_level_0,High,Low,Open,Close,Volume,Adj Close
Date,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1
2015-12-31,2072.024664,2048.042302,2061.825136,2058.87566,3593263000.0,2058.87566
2016-03-31,1964.934285,1935.597145,1951.661427,1952.453953,4582622000.0,1952.453953
2016-06-30,2083.341771,2063.419239,2074.964738,2074.080992,3910602000.0,2074.080992
2016-09-30,2167.638709,2152.447491,2160.367405,2160.912499,3488674000.0,2160.912499
2016-12-31,2192.357178,2177.917948,2185.075657,2185.176617,3706660000.0,2185.176617
2017-03-31,2328.875228,2316.056779,2322.374108,2324.068555,3537906000.0,2324.068555
2017-06-30,2402.197354,2389.173503,2396.404501,2396.013173,3609076000.0,2396.013173
2017-09-30,2470.405759,2459.073255,2464.184976,2465.394369,3162825000.0,2465.394369
2017-12-31,2609.776829,2597.814352,2603.950851,2605.212957,3292639000.0,2605.212957
2018-03-31,2751.427566,2714.169561,2732.261209,2733.328784,3820737000.0,2733.328784


**Exercise** Write a python code to create a column of non-cumulative values shown above.

In [47]:
# Here we first drop the noncumulative_value column.

df.drop(['noncumulative_value'],axis = 1,inplace=True)
df

Unnamed: 0,date,cumulative_value
0,2022-03-09,2
1,2022-03-10,4
2,2022-03-11,8
3,2022-03-12,15


To get the non-cumulative values, we simply take the time lag of numbers in the cumulative column using the `shift` function and then calculating the difference between cumulative and lag.

In [49]:
df.head()

Unnamed: 0,date,cumulative_value
0,2022-03-09,2
1,2022-03-10,4
2,2022-03-11,8
3,2022-03-12,15


In [54]:
df['lag'] = df.cumulative_value.shift(1).fillna(0)
df['noncumulative_value'] = df.cumulative_value - df.lag

df

Unnamed: 0,date,cumulative_value,lag,noncumulative_value
0,2022-03-09,2,0.0,2.0
1,2022-03-10,4,2.0,2.0
2,2022-03-11,8,4.0,4.0
3,2022-03-12,15,8.0,7.0
