<h1>Home Assignment injunctions!</h1>
<li>No Loops. Use only elementwise operations</li>
<li>Your code should work for any data in the provided format. Data values (example, customer ids, product ids, number of customers, etc.) should NEVER appear in your code!</li>

<h1>Problem 1:</h1>
Write a function that reads timeseries pricing data from a file into a pandas dataframe and then groups the data as follows:
<li>The arguments to the function are the filename and a threshold number
<li>The function reads the data in the file and creates a new column "pct_change" with the one day percent change
<li>Then groups the data into four categories:
<ul>
<li>"High+" if the percent change is greater than the threshold 
<li>"Low+" if the percent change is zero or positive and less than or equal to the threshold% 
<li>"Low-" if the percent change is negative but greater than or equal to -1 * the threshold
<li>"High-" if the percent change is less than -1 * the threshold
</ul>
<li>The function should return a dataframe that contains three columns (count, mean, stdev) and four index values (High+, High-, Low+, Low-)
    <p><b>Note: </b>we have to deal with nan percent changes. Make sure that you don't count a NaN in any of the four categories! (see https://pandas.pydata.org/docs/reference/api/pandas.isna.html)  <p>
For the sample data your function should return a dataframe with the following values for a threshold of 1.0:

<pre>
        count	mean	stdev
High+	870	2.162834	1.301745
High-	666	-2.291770	1.370657
Low+	870	0.463122	0.284869
Low-	883	-0.445557	0.276355

</pre>

    

<h3>read_csv</h3>
The pandas <a href="https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.read_csv.html">read_csv</a> function reads data from a delimited file into a pandas dataframe.



<h2>Sample data</h2>
Use the attached AAPL.csv file

In [4]:
import pandas as pd
import numpy as np

In [5]:
def change_groups(datafile,threshold):
    #Your code goes here
    df = pd.read_csv('../class-datasets/AAPL.csv')
    df['pct_change'] = df.iloc[:,1].rolling(2).apply(lambda x: (x.iloc[1] - x.iloc[0]) * 100 / x.iloc[0])
    df = df.dropna()
    def make_categories(row):
        if row['pct_change'] > threshold:
            return 'High+'
        elif row['pct_change'] >= 0:
            return 'Low+'
        elif row['pct_change'] >= (-1*threshold):
            return 'Low-'
        else:
            return 'High-'
    df['cat'] = df.apply(make_categories,axis=1)
    return df.groupby('cat')['pct_change'].agg(['count','mean','std'])


In [6]:
#Test your code
change_groups("AAPL.csv",1.0)

Unnamed: 0_level_0,count,mean,std
cat,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1
High+,870,2.162834,1.301745
High-,666,-2.29177,1.370657
Low+,870,0.463122,0.284869
Low-,883,-0.445557,0.276355


<h1>Problem 2</h1>
A manufacturer has data on orders from customers and product prices in two dataframes (see below). They want to use this data to answer the following questions:
<ol>
    <li>Which customer is responsible for the most revenue</li>
    <li>Which customer is responsible for the highest profit</li>
    <li>Which product is responsible for the highest (dollar) profit</li>
    <li>Which customer and product combination is responsible for the most orders</li>
</ol>
<p>
Obviously, your code should work for any actual data values and pandas dataframes of any length!
<p>For the data below: your answers should be:
    
<pre>
Customer with most profit: 005
Customer with most revenue: 007
Product with most profit: 011
Customer 001 with product 010 with 4 orders is the most ordered customer product pair
</pre>


<h2>Useful functions:</h2>
<li><a href="https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.DataFrame.sort_values.html">pd.sort_values</a> </li>
<li><a href="https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.DataFrame.join.html">pandas dataframe join</a> the last example on the linked page is probably what you need here!</li>
<li><a href="https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.DataFrame.groupby.html">pd.groupby</a></li>

In [7]:
import pandas as pd
import numpy as np
order_data = pd.DataFrame([["001","010",24],
                           ["007","012",35],
                           ["001","011",15],
                           ["005","010",30],
                           ["007","011",17],
                           ["005","011",81],
                           ["001","010",32],
                           ["007","012",89],
                           ["001","010",16],
                           ["001","010",33]],columns=["customer","product","amount"])
products = pd.DataFrame([['010',22.3,17.2],
                        ['011',11.7,5.5],
                        ['012',62.5,61.4]],columns=['product','price','cost'])                     

In [8]:
#Your code goes here
def get_answers_prob_2(order_data, products):
    #Question 1
    answer_1 = order_data.groupby('customer')['amount'].sum().sort_values(ascending=False).index[0]
    #Question 2
    products['profit_per_item'] = products['price'] - products['cost']
    all_data = order_data.merge(products, on='product')
    all_data['profit_per_order'] = all_data['amount'] * all_data['profit_per_item']
    answer_2 = all_data.groupby('customer')['profit_per_order'].sum().sort_values(ascending=False).index[0]
    #Question 3
    answer_3 = all_data.groupby('product')['profit_per_order'].sum().sort_values(ascending=False).index[0]
    #Question 4
    answer_4 = all_data.groupby(['customer','product'])['amount'].count().sort_values(ascending=False).index[0]
    return '''Customer with most revenue: {}
        Customer with most profit: {}
        Product with most profit: {}
        Customer/Product most orders: {}'''.format(answer_1,answer_2,answer_3,answer_4)
get_answers_prob_2(order_data,products)
    


"Customer with most revenue: 007\n        Customer with most profit: 005\n        Product with most profit: 011\n        Customer/Product most orders: ('001', '010')"

<h1>Problem 3</h1>
In this problem you'll get some practice getting and combining data from the St. Louis Federal Reserve (FRED). Get the following data from FRED (01/01/2010 to 12/31/2022):

<pre>
"TB3MS" #3 month t-bill market yield 
"DGS10" #10 year constant maturity government bond market yield
"NB000334Q" #Real GDP index quarterly (index = 100 at 2012)
"CPIAUCSL" #Consumer price index for all urban consumers seasonally adjusted
</pre>

Since these data items have different frequencies (some are daily, some monthly, some quarterly), make separate data reader calls for each. For GDP and the CPI, use percent changes quarter over quarter rather than the absolute values

Then, as a proxy for the stock market, get data for the ticker SPY,the S&P ETF, from tiingo. You will need to create an account and get an API Key (https://www.tiingo.com/). Use the adjusted close. Resample the data to the business quarter and calculate a quarter over quarter percent change.

Align all the data to the end of the business quarter (i.e., use the value on the last day of each quarter).

For the ETF, calculate one day percent changes and shift the data back by one quarter (we're interested in the correlation between macroeconomic data in one quarter and the performance of the S&P in the next quarter). For example, if the percentage change on 3/31 is 5% and on 6/30 is 2.5%, we want to align the percent change on 6/30 with the the macroeconomic data as of 3/31. So we need to replace the data on 3/31 by 2.5%

Using the pandas join function, join all the data into one dataframe with the quarter end date as the index

Generate the correlation matrix. This is what you should get:

<pre>
	TB3MS	DGS10	NB000334Q	CPIAUCSL	SPY
TB3MS	1.000000	0.409817	0.070356	-0.001078	-0.078983
DGS10	0.409817	1.000000	0.060454	0.038290	-0.329595
NB000334Q	0.070356	0.060454	1.000000	0.089625	-0.009725
CPIAUCSL	-0.001078	0.038290	0.089625	1.000000	-0.393822
SPY	-0.078983	-0.329595	-0.009725	-0.393822	1.000000

</pre>

<h3>Notes:</h3>

1. In the shift function, positive numbers will shift forward while negative numbers will shift backward
2. tiingo returns datetime index values while fred returns date index values. You can convert datetime to date using:

    df.index = df.index.date
    
where spy is the dateframe with datetime values as index
    
3. To rename a column, use df.rename(columns={"old_name":"new_name"})

In [9]:
#Your code goes here
import pandas as pd
from pandas_datareader import data as pdr
import datetime as dt
from TiingoApiKey import tiingo_api_key

In [10]:
#Your code goes here
t_bill = pdr.DataReader('TB3MS','fred', dt.datetime(2010,1,1), dt.datetime(2022,12,31))
gov_bond = pdr.DataReader('DGS10','fred', dt.datetime(2010,1,1), dt.datetime(2022,12,31))
gdp = pdr.DataReader('NB000334Q','fred', dt.datetime(2010,1,1), dt.datetime(2022,12,31))
cpi = pdr.DataReader('CPIAUCSL','fred', dt.datetime(2010,1,1), dt.datetime(2022,12,31))
stocks = pdr.DataReader('SPY','tiingo',dt.datetime(2010,1,1), dt.datetime(2022,12,31), api_key=tiingo_api_key)


  stocks = pdr.DataReader('SPY','tiingo',dt.datetime(2010,1,1), dt.datetime(2022,12,31), api_key=tiingo_api_key)


In [11]:
gdp = gdp.resample('Q').last().pct_change()
cpi = cpi.resample('Q').last().pct_change()
t_bill = t_bill.resample('Q').last()
gov_bond = gov_bond.resample('Q').last()

In [12]:
stocks = stocks.droplevel(level=0)

In [13]:
stocks.index = stocks.index.date
stocks.index = pd.to_datetime(stocks.index)

In [14]:
sandp = pd.DataFrame(stocks.adjClose.resample('Q').last().pct_change().shift(-1))
sandp.columns = ['SPY']

In [15]:
all_data = t_bill.join([gov_bond, gdp, cpi, sandp])

In [16]:
all_data.corr()

Unnamed: 0,TB3MS,DGS10,NB000334Q,CPIAUCSL,SPY
TB3MS,1.0,0.409817,0.070356,0.049515,-0.078983
DGS10,0.409817,1.0,0.060454,0.060827,-0.329595
NB000334Q,0.070356,0.060454,1.0,0.162621,-0.009725
CPIAUCSL,0.049515,0.060827,0.162621,1.0,-0.398769
SPY,-0.078983,-0.329595,-0.009725,-0.398769,1.0
