<a href="https://colab.research.google.com/github/plthiyagu/AI-Engineering/blob/master/01-Machine%20Learning/Using_Python_And_Pandas_Datareader_to_Analyze_Financial_Data.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

Using Python And Pandas Datareader to Analyze Financial Data
Finance and economics are becoming more and more interesting for all kinds of people, regardless of their career or profession. This is because we are all affected by economic data, or at least we are increasingly interested in being up-to-date, and we have a lot of information at hand.

Every day billions of bytes of financial data are sent over the Internet. Whether it is the price of a share, an e-commerce transaction, or even information on a country's GDP. All this data, when properly organized and managed can be used to build some amazing and insightful software applications.

We will use Python to access public financial data, organize it and combine it to gain new insights into how money makes the world go round. We will focus mainly on two Python modules:

Pandas - used to organize and format complex data in table structures called DataFrames.
Pandas-datareader - used to access public financial data from the Internet and import it into Python as a DataFrame.
We will use these modules to import data from some of the largest financial organizations in the world, as well as data stored locally on our computers. By the end of the notebook, you should feel comfortable importing financial data, either from a public source or from a local file, into Python, organizing that data and combining it with each other

Importing data via Datareader
Many financial institutions, stock markets and global banks provide the public with large amounts of the data they publicly store. Most of this data is well organized, updated live and accessible through the use of an application programming interface (API), which offers programming languages such as Python a way to download and import it.

The pandas-datareader module is specifically designed to interface with some of the world's most popular financial data APIs, and import their data into an easily digestible pandas DataFrame. Each financial API is accessed by a different function exposed by pandas-datareader. Generally, accessing each API requires a different set of arguments and information to be provided by the programmer.

We will import data from several of these APIs and play with them. For a complete list of all data that the pandas-datareader can access, you can consult the official documentation.



In [1]:
!pip3 install pandas_datareader



In [2]:
from pandas_datareader import wb
from datetime import datetime

In [3]:
start = datetime(2005, 1, 1)
end = datetime(2008, 1, 1)
indicator_id = 'NY.GDP.PCAP.KD'

In [4]:
gdp_per_capita = wb.download(indicator=indicator_id, start=start, end=end, country=['US', 'CA', 'MX'])

In [5]:
gdp_per_capita

Unnamed: 0_level_0,Unnamed: 1_level_0,NY.GDP.PCAP.KD
country,year,Unnamed: 2_level_1
Canada,2008,48495.20404
Canada,2007,48534.174477
Canada,2006,45857.996552
Canada,2005,44471.08006
Mexico,2008,9587.636339
Mexico,2007,9622.047957
Mexico,2006,9547.333571
Mexico,2005,9270.656542
United States,2008,49319.478865
United States,2007,49856.28149


Getting the NASDAQ Symbols

The NASDAQ Stock Exchange identifies each of its shares with a unique symbol:

Apple - APPL

Google - GOOGL

Tesla - TSLA

It also provides a useful API for accessing the symbols currently traded on it. Pandas-datareader provides several functions to import data from the NASDAQ API through its nasdaq_trader sub-module.



In [6]:
from pandas_datareader.nasdaq_trader import get_nasdaq_symbols

To import the list of stock symbols, we want to use the function get_nasdaq_symbols from nasdaq_trader. It is done in this way



In [7]:
symbols = get_nasdaq_symbols()

When called, it will go to the NASDAQ API, and import the list of symbols currently being traded. The advantage of using pandas-datareader is that all the logic to interact with the NASDAQ API or any other API is encapsulated in easy-to-use sub-modules and functions like the ones above.



In [8]:
symbols

Unnamed: 0_level_0,Nasdaq Traded,Security Name,Listing Exchange,Market Category,ETF,Round Lot Size,Test Issue,Financial Status,CQS Symbol,NASDAQ Symbol,NextShares
Symbol,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1
A,True,"Agilent Technologies, Inc. Common Stock",N,,False,100.0,False,,A,A,False
AA,True,Alcoa Corporation Common Stock,N,,False,100.0,False,,AA,AA,False
AAA,True,Listed Funds Trust AAF First Priority CLO Bond...,P,,True,100.0,False,,AAA,AAA,False
AAAU,True,Goldman Sachs Physical Gold ETF Shares,P,,True,100.0,False,,AAAU,AAAU,False
AACG,True,ATA Creativity Global - American Depositary Sh...,Q,G,False,100.0,False,N,,AACG,False
...,...,...,...,...,...,...,...,...,...,...,...
ZXYZ.A,True,Nasdaq Symbology Test Common Stock,Q,Q,False,100.0,True,N,,ZXYZ.A,False
ZXZZT,True,NASDAQ TEST STOCK,Q,G,False,100.0,True,N,,ZXZZT,False
ZYME,True,Zymeworks Inc. Common Shares,N,,False,100.0,False,,ZYME,ZYME,False
ZYNE,True,"Zynerba Pharmaceuticals, Inc. - Common Stock",Q,G,False,100.0,False,N,,ZYNE,False


In [9]:
symbols.loc['IBM']

Nasdaq Traded                                                    True
Security Name       International Business Machines Corporation Co...
Listing Exchange                                                    N
Market Category                                                      
ETF                                                             False
Round Lot Size                                                    100
Test Issue                                                      False
Financial Status                                                  NaN
CQS Symbol                                                        IBM
NASDAQ Symbol                                                     IBM
NextShares                                                      False
Name: IBM, dtype: object

Technical Analysis in Finance

Technical analysis in finance is the type of analysis performed by means of statistics and charts on stocks (or indices in our case). Let's see how to do something very simple with 'Plotly' a Python library for charting. In this case we'll access to Microsoft daily quotes. Let's do it!



In [10]:
!pip3 install plotly



In [11]:
import plotly.graph_objects as go

In [12]:
import pandas_datareader.data as web

stock = 'MSFT'
start = datetime(2019, 1, 1)

df = web.DataReader(stock, data_source='yahoo', start=start)

In [13]:
df

Unnamed: 0_level_0,High,Low,Open,Close,Volume,Adj Close
Date,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1
2019-01-02,101.750000,98.940002,99.550003,101.120003,35329300.0,98.602066
2019-01-03,100.190002,97.199997,100.099998,97.400002,42579100.0,94.974693
2019-01-04,102.510002,98.930000,99.720001,101.930000,44060600.0,99.391899
2019-01-07,103.269997,100.980003,101.639999,102.059998,35656100.0,99.518669
2019-01-08,103.970001,101.709999,103.040001,102.800003,31514400.0,100.240234
...,...,...,...,...,...,...
2020-12-18,219.690002,216.020004,218.589996,218.589996,63354900.0,218.589996
2020-12-21,224.000000,217.279999,217.550003,222.589996,37181900.0,222.589996
2020-12-22,225.630005,221.850006,222.690002,223.940002,22612200.0,223.940002
2020-12-23,223.559998,220.800003,223.110001,221.020004,18699600.0,221.020004


We have accessed the data of MSFT. We did this by importing .data from datareader and giving it the web alias. Under the hood we are using Yahoo Finance to import the data from an API, but in this case pandas.datareader allowed us to do it in a very simple way. Now we are going to plot the result to make Technical Analysis.



In [14]:
graph = {
    'x': df.index,
    'open': df.Open,
    'close': df.Close,
    'high': df.High,
    'low': df.Low,
    'type': 'candlestick',
    'name': 'MSFT',
    'showlegend': True
}

In [15]:
layout = go.Figure(
    data = [graph],
    layout_title="Microsoft Stock"
)

In [16]:
layout

We just did something very interesting and it was to chart MSFT's acicon with the updated data! Today is November 20, 2020, so the last data of my graph is that date, you can do the same, place the mouse at the end of the graph and see the last quote of the stock! You could in this case access your investment portfolio and run the code daily and make a technical analysis on those tickets!

Data filtering by date

Many of the APIs that the pandas-datareader connects to allow us to filter the data we get by date. Financial institutions tend to keep track of data that goes back several decades, and when we import that data, it is useful to be able to specify exactly when we want it to come from

An API that does just that is the Federal Reserve Bank of San Luis (FRED), which we can access by first importing the pandas_datareader.data sub-module and then calling its DataReader function:



In [17]:
import pandas_datareader.data as web

start = datetime(2019, 1, 1)
end = datetime(2019, 2, 1)

In [18]:
sap_data = web.DataReader('SP500', 'fred', start, end)

In [19]:
sap_data

Unnamed: 0_level_0,SP500
DATE,Unnamed: 1_level_1
2019-01-01,
2019-01-02,2510.03
2019-01-03,2447.89
2019-01-04,2531.94
2019-01-07,2549.69
2019-01-08,2574.41
2019-01-09,2584.96
2019-01-10,2596.64
2019-01-11,2596.26
2019-01-14,2582.61


In [None]:
The DataReader function takes 4 arguments:

SP500 - An identifier provided by the API that specifies the data we want to retrieve, in this case data from the SP500
"fred" - The name of the API we want to access
start_date, end_date - The range of dates we want the data to be
By changing the start and end dates, you can easily filter the data you receive

Using the Shift() operation

Once we've imported a DataFrame full of financial data, there are some pretty cool ways to manipulate it. In this exercise we will see the shift() operation, a DataFrame function that moves all the rows in a column up or down



In [20]:
start = datetime(2008, 1, 1)
end = datetime(2018, 1, 1)

In [21]:
gdp = web.DataReader('GDP', 'fred', start, end)
gdp.head()

Unnamed: 0_level_0,GDP
DATE,Unnamed: 1_level_1
2008-01-01,14651.039
2008-04-01,14805.611
2008-07-01,14835.187
2008-10-01,14559.543
2009-01-01,14394.547


We have imported the GDP from the FRED, now we will create a new column called Growth where we can do the math of the difference (in absolute values) between the different days



In [22]:
gdp['Growth'] = gdp['GDP'] - gdp['GDP'].shift(1)

In [23]:
gdp.head()

Unnamed: 0_level_0,GDP,Growth
DATE,Unnamed: 1_level_1,Unnamed: 2_level_1
2008-01-01,14651.039,
2008-04-01,14805.611,154.572
2008-07-01,14835.187,29.576
2008-10-01,14559.543,-275.644
2009-01-01,14394.547,-164.996


We can now see the absolute differences in this new column. An important clarification: the first row of growth column is now 'NaN' because it has no one to do the calculation with, it is the first row of the dataset

Calculating basic financial statistics

Two useful calculations that can be made with financial data are variance and covariance. To illustrate these concepts, let's use the example of a DataFrame that measures stock and bond prices over time



Variance

Variance measures how far a set of numbers is from its average. In finance, it is used to determine the volatility of investments.

dataframe['stocks'].var() # 106427
dataframe['bonds'].var() # 2272
In the above variance calculations, stocks are greater in value than bonds (106427 vs 2272). That's because stock prices are more dispersed than bond prices, indicating that stocks are a more volatile investment.

Covariance

Covariance, in a financial context, describes the relationship between the returns of two different investments over a period of time, and can be used to help balance a portfolio. Calling our stock and bond columns cov() produces an array that defines the covariance values between each pair of columns in the DataFrame. Covariance is also known as a correlation in finance. In our example data, when stock prices go up, bonds go down. We can use the covariance function to see this numerically.

dataframe.cov()


In [24]:
import numpy as np

def log_return(prices):
  return np.log(prices / prices.shift(1))

In [25]:
start = datetime(1999, 1, 1)
end = datetime(2019, 1, 1)

nasdaq_data = web.DataReader("NASDAQ100", "fred", start, end)
sap_data = web.DataReader("SP500", "fred", start, end)
gdp_data = wb.download(indicator='NY.GDP.MKTP.CD', country=['US'], start=start, end=end)
export_data = wb.download(indicator='NE.EXP.GNFS.CN', country=['US'], start=start, end=end)

In [26]:
nasdaq_returns = log_return(nasdaq_data['NASDAQ100'])

In [27]:
nasdaq_returns

DATE
1999-01-01         NaN
1999-01-04         NaN
1999-01-05    0.025876
1999-01-06    0.031526
1999-01-07    0.001221
                ...   
2018-12-26         NaN
2018-12-27    0.004069
2018-12-28   -0.000483
2018-12-31    0.007087
2019-01-01         NaN
Name: NASDAQ100, Length: 5218, dtype: float64

In [28]:
sap_returns = log_return(sap_data['SP500'])

In [29]:
sap_returns

DATE
2010-12-27         NaN
2010-12-28    0.000771
2010-12-29    0.001009
2010-12-30   -0.001509
2010-12-31   -0.000191
                ...   
2018-12-26         NaN
2018-12-27    0.008526
2018-12-28   -0.001242
2018-12-31    0.008457
2019-01-01         NaN
Name: SP500, Length: 2092, dtype: float64

In [30]:
gdp_returns = log_return(gdp_data['NY.GDP.MKTP.CD'])

In [31]:
gdp_returns

country        year
United States  2019         NaN
               2018   -0.040615
               2017   -0.052921
               2016   -0.042083
               2015   -0.026545
               2014   -0.039026
               2013   -0.043275
               2012   -0.035650
               2011   -0.041243
               2010   -0.036063
               2009   -0.036900
               2008    0.018100
               2007   -0.017898
               2006   -0.045096
               2005   -0.057963
               2004   -0.065203
               2003   -0.063851
               2002   -0.046611
               2001   -0.032961
               2000   -0.031631
               1999   -0.062554
Name: NY.GDP.MKTP.CD, dtype: float64

In [32]:
export_returns = log_return(export_data['NE.EXP.GNFS.CN'])

In [33]:
export_returns

country        year
United States  2019         NaN
               2018    0.005533
               2017   -0.062895
               2016   -0.064079
               2015    0.017222
               2014    0.045653
               2013   -0.042320
               2012   -0.036803
               2011   -0.041123
               2010   -0.130190
               2009   -0.154485
               2008    0.149476
               2007   -0.100832
               2006   -0.120293
               2005   -0.120663
               2004   -0.102871
               2003   -0.127967
               2002   -0.036798
               2001    0.025597
               2000    0.067562
               1999   -0.099148
Name: NE.EXP.GNFS.CN, dtype: float64

In [34]:
print('nasdaq_returns:', nasdaq_returns.var())

nasdaq_returns: 0.0003178379833057235


In [35]:
print('sap_returns:', sap_returns.var())

sap_returns: 8.401972356103525e-05


In [36]:
print('gdp_returns:', gdp_returns.var())

gdp_returns: 0.000342043019210802


In [37]:
print('export_returns:', export_returns.var())

export_returns: 0.006201903531105135
