# Web Application for an ETF Analyzer

In this Project, we build a financial database and web application using SQL, Python, and the Voilà library to analyze the performance of a hypothetical fintech ETF. This notebook is used for the analysis of a fintech ETF that consists of four stocks: GOST, GS, PYPL, and SQ. Each stock has its own table in the `etf.db` database.

We analyze the daily returns of the ETF stocks both individually and as a whole. Then deploy the visualizations to a web application by using the Voilà library. The detailed instructions are divided into the following parts:

* Analyze a single asset in the ETF

* Optimize data access with Advanced SQL queries

* Analyze the ETF portfolio

* Deploy the notebook as a web application


## Imports of the required libraries, initiation of the SQLite database, population of the database with records from the `etf.db` seed file that is included in the repository, creates the database engine, and confirms that data tables that it now contains.

In [17]:
# Importing the required libraries and dependencies
import numpy as np
import pandas as pd
import hvplot.pandas
import sqlalchemy
import datetime
from datetime import date, datetime
from sqlalchemy import inspect

# Create a temporary SQLite database and populate the database with content from the etf.db seed file
database_connection_string = 'sqlite:///etf.db'

# Create an engine to interact with the SQLite database
engine = sqlalchemy.create_engine(database_connection_string)

# Confirm that table names contained in the SQLite database.
print("Tables names for data from stocks Green Dot Inc, Goldman Scahs Group Inc, Paypal Inc, and Square Inc.")
inspect(engine).get_table_names()

Tables names for data from stocks Green Dot Inc, Goldman Scahs Group Inc, Paypal Inc, and Square Inc.


['GDOT', 'GS', 'PYPL', 'SQ']

## Analyze a single asset in the FinTech ETF


### Step 1: Write a SQL `SELECT` statement by using an f-string that reads all the PYPL data from the database. Using the SQL `SELECT` statement, we execute a query that reads the PYPL data from the database into a Pandas DataFrame.

In [19]:
# Write a SQL query to SELECT all of the data from the PYPL table
query = """
SELECT * from PYPL
"""

# Use the query to read the PYPL data into a Pandas DataFrame and set index to "time"
fmt='%Y%m%d %H:%M:%S'
pypl_dataframe = pd.read_sql_query(query, con=engine, parse_dates={'time':fmt})
pypl_dataframe=pypl_dataframe.set_index('time')

print("\033[1m  Table with Paypal Inc. prices, volume and daily returns")
pypl_dataframe

[1m  Table with Paypal Inc. prices, volume and daily returns


Unnamed: 0_level_0,open,high,low,close,volume,daily_returns
time,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1
2016-12-16,39.90,39.90,39.1200,39.320,7298861,-0.005564
2016-12-19,39.40,39.80,39.1100,39.450,3436478,0.003306
2016-12-20,39.61,39.74,39.2600,39.740,2940991,0.007351
2016-12-21,39.84,40.74,39.8200,40.090,5826704,0.008807
2016-12-22,40.04,40.09,39.5400,39.680,4338385,-0.010227
...,...,...,...,...,...,...
2020-11-30,212.51,215.83,207.0900,214.200,8992681,0.013629
2020-12-01,217.15,220.57,214.3401,216.520,9148174,0.010831
2020-12-02,215.60,215.75,210.5000,212.660,6414746,-0.017827
2020-12-03,213.33,216.93,213.1100,214.680,6463339,0.009499


### Step 2: Use the `head` and `tail` functions to review the first five and the last five rows of the DataFrame. We save the beginning and end dates that are available from this dataset, since we’ll use this information to complete the analysis.

In [20]:
# Print Beggining and End date of the Period.
print(f"Beggining of period: {pypl_dataframe.index[0]}")
print(f"End of period      : {pypl_dataframe.index[-1]}")

#Calculate lenght of period to calculate actual annualized return later
period=(pypl_dataframe.index[-1] - pypl_dataframe.index[0])
period_in_years=period.days/365.25
print(f"Period in days is: {period.days}, which are {period_in_years:,.2f} years \n\n")

# View the first 5 rows of the DataFrame.
print("\033[1m  Firsts and lasts columns of the pypl_dataframe, with data of Paypal stock. \n")
display(pypl_dataframe.head())
display(pypl_dataframe.tail())

Beggining of period: 2016-12-16 00:00:00
End of period      : 2020-12-04 00:00:00
Period in days is: 1449, which are 3.97 years 


[1m  Firsts and lasts columns of the pypl_dataframe, with data of Paypal stock. 



Unnamed: 0_level_0,open,high,low,close,volume,daily_returns
time,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1
2016-12-16,39.9,39.9,39.12,39.32,7298861,-0.005564
2016-12-19,39.4,39.8,39.11,39.45,3436478,0.003306
2016-12-20,39.61,39.74,39.26,39.74,2940991,0.007351
2016-12-21,39.84,40.74,39.82,40.09,5826704,0.008807
2016-12-22,40.04,40.09,39.54,39.68,4338385,-0.010227


Unnamed: 0_level_0,open,high,low,close,volume,daily_returns
time,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1
2020-11-30,212.51,215.83,207.09,214.2,8992681,0.013629
2020-12-01,217.15,220.57,214.3401,216.52,9148174,0.010831
2020-12-02,215.6,215.75,210.5,212.66,6414746,-0.017827
2020-12-03,213.33,216.93,213.11,214.68,6463339,0.009499
2020-12-04,214.88,217.28,213.01,217.235,2118319,0.011901


In [36]:
pypl_dataframe["Daily Returns %"]=pypl_dataframe['daily_returns']*100

Unnamed: 0_level_0,open,high,low,close,volume,daily_returns,Daily Returns %
time,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1
2016-12-16,39.90,39.90,39.1200,39.320,7298861,-0.005564,-0.556399
2016-12-19,39.40,39.80,39.1100,39.450,3436478,0.003306,0.330621
2016-12-20,39.61,39.74,39.2600,39.740,2940991,0.007351,0.735108
2016-12-21,39.84,40.74,39.8200,40.090,5826704,0.008807,0.880725
2016-12-22,40.04,40.09,39.5400,39.680,4338385,-0.010227,-1.022699
...,...,...,...,...,...,...,...
2020-11-30,212.51,215.83,207.0900,214.200,8992681,0.013629,1.362862
2020-12-01,217.15,220.57,214.3401,216.520,9148174,0.010831,1.083100
2020-12-02,215.60,215.75,210.5000,212.660,6414746,-0.017827,-1.782745
2020-12-03,213.33,216.93,213.1100,214.680,6463339,0.009499,0.949873


### Step 3: Interactive visualization for the PYPL daily returns using hvPlot.

In [38]:
# Create an interactive visualization with hvplot to plot the daily returns for PYPL.
pypl_dataframe.hvplot(
    title="PYPL Daily Returns (%)"
    ,y='Daily Returns %'
    ,xlabel= 'Date'
    ,ylabel='Returns (%)'
    ,width=800
).opts(
    color='blue')

### Step 4: Interactive visualization for the PYPL cumulative returns. 

In [33]:
# Representing the cummulative investment
growth_of_1usd_investment=(1+pypl_dataframe["daily_returns"]).cumprod()

#Transforming a series to a dataframe and renaming columns
growth_of_1usd_investment=growth_of_1usd_investment.to_frame().rename(columns={'daily_returns':'Growth of 1 USD Investment'})
print("\033[1m Table: Evolution of a $1 initial investment on Dec 15th 2016 on the ETF.")

display(growth_of_1usd_investment)

# Create an interactive visaulization with hvplot to plot the cumulative returns for PYPL.
growth_of_1usd_investment.hvplot(
    title="Paypal Holdings Inc -- Growth of 1 USD Initial Investment -- Period Dec-16-2016 to Dec 4th 2020"
    ,ylabel="Initial Investment \n plus Cumulative Return"
    ,xlabel= "Date"
    ,width=900

)

[1m Table: Evolution of a $1 initial investment on Dec 15th 2016 on the ETF.


Unnamed: 0_level_0,Growth of 1 USD Investment
time,Unnamed: 1_level_1
2016-12-16,0.994436
2016-12-19,0.997724
2016-12-20,1.005058
2016-12-21,1.013910
2016-12-22,1.003541
...,...
2020-11-30,5.417299
2020-12-01,5.475974
2020-12-02,5.378351
2020-12-03,5.429439


## Optimize the SQL Queries

For this part, we continue to analyze a single asset (PYPL) from the ETF. We use SQL queries to optimize the efficiency of accessing data from the database.



### Step 1: Access the closing prices for PYPL that are greater than 200 by completing the following steps:

    - Write a SQL `SELECT` statement to select the dates where the PYPL closing price was higher than 200.0.

    - Using the SQL statement, read the data from the database into a Pandas DataFrame, and then review the resulting DataFrame.

    - Select the “time” and “close” columns for those dates where the closing price was higher than 200.0.



In [6]:
# Write a SQL SELECT statement to select the time column 
# where the PYPL closing price was higher than 200.0.
query = """
SELECT time 
FROM PYPL
WHERE close > 200
"""

# Using the query, read the data from the database into a Pandas DataFrame, and convert date strings to date
fmt='%Y%m%d %H:%M:%S'
pypl_dates_higher_than_200 = pd.read_sql_query(query, engine, parse_dates={'time':fmt})

# Review the resulting DataFrame
print("\033[1m Older dates when close price of Paypal is higher than $200. Data comes from SQL database:")
display(pypl_dates_higher_than_200.head())

# Select those dates from the pypl dataset
pypl_higher_than_200 = pypl_dataframe.loc[pypl_dates_higher_than_200['time'],'close'].to_frame()
print("\033[1m  Older dates and close price of Paypal when higher than $200 in pandas dataframe")
display(pypl_higher_than_200.head())

[1m Older dates when close price of Paypal is higher than $200. Data comes from SQL database:


Unnamed: 0,time
0,2020-08-05
1,2020-08-06
2,2020-08-25
3,2020-08-26
4,2020-08-27


[1m  Older dates and close price of Paypal when higher than $200 in pandas dataframe


Unnamed: 0_level_0,close
time,Unnamed: 1_level_1
2020-08-05,202.92
2020-08-06,204.09
2020-08-25,201.71
2020-08-26,203.53
2020-08-27,204.34


### Step 2: Find the top 10 daily returns for PYPL by completing the following steps:

    -  Write a SQL statement to find the top 10 PYPL daily returns. Make sure to do the following:

        * Use `SELECT` to select only the “time” and “daily_returns” columns.

        * Use `ORDER` to sort the results in descending order by the “daily_returns” column.

        * Use `LIMIT` to limit the results to the top 10 daily return values.

    - Using the SQL statement, read the data from the database into a Pandas DataFrame, and then review the resulting DataFrame.


In [7]:
# Write a SQL SELECT statement to select the time and daily_returns columns
# Sort the results in descending order and return only the top 10 return values
query = """
SELECT time, daily_returns
FROM PYPL
ORDER by daily_returns  desc
LIMIT 10
"""

# Using the query, read the data from the database into a Pandas DataFrame
# Counting is useful to visualize amount of data, so index is not change to time
fmt= '%Y%m%d %H:%M:%S'
pypl_top_10_returns = pd.read_sql_query(query, engine, parse_dates={'time':fmt})

pypl_top_10_returns['daily_returns']=pypl_top_10_returns['daily_returns']*100

print("\033[1m Table with the dates when the top 10 larger daily returns of Paypal Inc. occured:")
display(pypl_top_10_returns[['time']])
    
# Review the resulting DataFrame
print ("\n")
print("\033[1m Table with the top 10 larger daily returns of Paypal stock in percentages (%):")
display(round(pypl_top_10_returns,2))


[1m Table with the dates when the top 10 larger daily returns of Paypal Inc. occured:


Unnamed: 0,time
0,2020-03-24
1,2020-05-07
2,2020-03-13
3,2020-04-06
4,2018-10-19
5,2019-10-24
6,2020-11-04
7,2020-03-10
8,2020-04-22
9,2018-12-26




[1m Table with the top 10 larger daily returns of Paypal stock in percentages (%):


Unnamed: 0,time,daily_returns
0,2020-03-24,14.1
1,2020-05-07,14.03
2,2020-03-13,13.87
3,2020-04-06,10.09
4,2018-10-19,9.34
5,2019-10-24,8.59
6,2020-11-04,8.1
7,2020-03-10,8.09
8,2020-04-22,7.53
9,2018-12-26,7.47


## Analyze the Fintech ETF Portfolio

For this part, we build the entire ETF portfolio and then evaluate its performance. To do so, we build the ETF portfolio by using SQL joins to combine all the data for each asset.


### Step 1: Write a SQL query to join each table in the portfolio into a single DataFrame. To do so, complete the following steps:

    - Use a SQL inner join to join each table on the “time” column. Access the “time” column in the `GDOT` table via the `GDOT.time` syntax. Access the “time” columns from the other tables via similar syntax.

    - Using the SQL query, read the data from the database into a Pandas DataFrame. Review the resulting DataFrame.

In [8]:
# Create a SQL query to join each table in the portfolio into a single DataFrame 
# Use the time column from each table as the basis for the join
query = """
SELECT  *
FROM GDOT, GS, PYPL, SQ
WHERE GDOT.time=GS.time
AND PYPL.time=SQ.time
AND GDOT.time=PYPL.time
"""
#['GDOT', 'GS', 'PYPL', 'SQ']
# Using the query, read the data from the database into a Pandas DataFrame
frm='%Y%m%d %H:%M:%S'
etf_portfolio = pd.read_sql_query(query, engine, parse_dates={'time':frm})

# Review the resulting DataFrame
print('\n')
print("\033[1m Join Tables from ['GDOT', 'GS', 'PYPL', 'SQ'] on dates")
display(etf_portfolio.head())



[1m Join Tables from ['GDOT', 'GS', 'PYPL', 'SQ'] on dates


Unnamed: 0,time,open,high,low,close,volume,daily_returns,time.1,open.1,high.1,...,close.1,volume.1,daily_returns.1,time.2,open.2,high.2,low.1,close.2,volume.2,daily_returns.2
0,2016-12-16,24.41,24.73,23.94,23.98,483544,-0.023218,2016-12-16,242.8,243.19,...,39.32,7298861,-0.005564,2016-12-16,14.29,14.47,14.23,14.375,4516341,0.017339
1,2016-12-19,24.0,24.01,23.55,23.79,288149,-0.007923,2016-12-19,238.34,239.74,...,39.45,3436478,0.003306,2016-12-19,14.34,14.6,14.3,14.36,3944657,-0.001043
2,2016-12-20,23.75,23.94,23.58,23.82,220341,0.001261,2016-12-20,240.52,243.65,...,39.74,2940991,0.007351,2016-12-20,14.73,14.82,14.41,14.49,5207412,0.009053
3,2016-12-21,23.9,23.97,23.69,23.86,249189,0.001679,2016-12-21,242.24,242.4,...,40.09,5826704,0.008807,2016-12-21,14.45,14.54,14.2701,14.38,3901738,-0.007591
4,2016-12-22,23.9,24.01,23.7,24.005,383139,0.006077,2016-12-22,241.23,242.86,...,39.68,4338385,-0.010227,2016-12-22,14.33,14.34,13.9301,14.04,3874004,-0.023644


### Step 2: Create a DataFrame that averages the “daily_returns” columns for all four assets. Review the resulting DataFrame.

 We assume that this ETF contains equally weighted returns, and average the daily returns for all assets to get the average returns of the portfolio. We use the average returns of the portfolio to calculate the annualized returns and the cumulative returns. For the calculation to get the average daily returns for the portfolio, we use the following code:

 ```python
 etf_portfolio_returns = etf_portfolio['daily_returns'].mean(axis=1)
 ```

In [9]:
# Create a DataFrame that averages the “daily_returns” columns for all four assets. Review the resulting DataFrame.
etf_portfolio_returns = etf_portfolio['daily_returns'].mean(axis=1)
print('\033[1mETF Portfolio Returns (%)')
display(round((etf_portfolio_returns*100),2).head(10))

[1mETF Portfolio Returns (%)


0   -0.70
1   -0.12
2    0.86
3   -0.10
4   -0.82
5   -0.12
6    0.03
7   -0.42
8   -0.51
9   -0.37
dtype: float64

In [10]:
# As a second view, we create a DataFrame that displays the value of the “daily_returns” for all four assets only, and assign an index.
# Use the time column from each table as the basis for the join
query2 = """
SELECT  GDOT.time, GDOT.daily_returns as 'GDOT.daily_returns',
   GS.daily_returns as 'GS.daily_returns',
   PYPL.daily_returns as 'PYPL.daily_returns',
   SQ.daily_returns as 'SQ.daily_returns'
FROM GDOT, GS, PYPL, SQ
WHERE GDOT.time=GS.time
AND PYPL.time=SQ.time
AND GDOT.time=PYPL.time
"""

#['GDOT', 'GS', 'PYPL', 'SQ']
# Using the query, read the data from the database into a Pandas DataFrame
fmt='%Y%m%d %H:%M:%S'
etf_portfolio2 = pd.read_sql_query(
        query2
        , engine
        ,parse_dates={'time':fmt}
)
etf_portfolio2=etf_portfolio2.set_index("time") 

In [11]:
print('\n')
print('\033[1m                         Daily individual returns in time (%)')
display(round((etf_portfolio2*100),2))


# We repeat the calculation of average daily returns using this table
etf_portfolio_returns = etf_portfolio2.mean(axis=1)

# Review the resulting DataFrame
display("ETF Returns (%)")
display(round(etf_portfolio_returns*100,2))



[1m                         Daily individual returns in time (%)


Unnamed: 0_level_0,GDOT.daily_returns,GS.daily_returns,PYPL.daily_returns,SQ.daily_returns
time,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1
2016-12-16,-2.32,-1.67,-0.56,1.73
2016-12-19,-0.79,0.08,0.33,-0.10
2016-12-20,0.13,1.66,0.74,0.91
2016-12-21,0.17,-0.69,0.88,-0.76
2016-12-22,0.61,-0.52,-1.02,-2.36
...,...,...,...,...
2020-11-30,-4.38,-2.13,1.36,-0.72
2020-12-01,0.45,0.65,1.08,-3.78
2020-12-02,-2.73,2.44,-1.78,-0.44
2020-12-03,2.75,-0.90,0.95,1.69


'ETF Returns (%)'

time
2016-12-16   -0.70
2016-12-19   -0.12
2016-12-20    0.86
2016-12-21   -0.10
2016-12-22   -0.82
              ... 
2020-11-30   -1.46
2020-12-01   -0.40
2020-12-02   -0.63
2020-12-03    1.12
2020-12-04    0.91
Length: 999, dtype: float64

### Step 3: Use the average daily returns in the etf_portfolio_returns DataFrame to calculate the annualized returns for the portfolio. Display the annualized return value of the ETF portfolio.

To calculate the expected annualized returns, we multiply the mean of the `etf_portfolio_returns` values by 252.

To convert the decimal values to percentages, we multiply the results by 100 before printing or plotting the values.

In [12]:
# Use the average daily returns provided by the etf_portfolio_returns DataFrame 
# to calculate the annualized return for the portfolio. 
annualized_etf_portfolio_returns = etf_portfolio_returns.mean()*252

print(f"The expected annualized return, calculated using daily average return in the period, times 252 trading days is: {annualized_etf_portfolio_returns*100:,.2f}% ")

The expected annualized return, calculated using daily average return in the period, times 252 trading days is: 43.83% 


### Step 4: We use the average daily returns in the `etf_portfolio_returns` DataFrame to calculate the cumulative returns of the ETF portfolio.


In [13]:
# Use the average daily returns provided by the etf_portfolio_returns DataFrame 
# to calculate the cumulative returns
# This is the growth of 1[USD] initial investment
etf_cumulative_returns = (1+etf_portfolio_returns).cumprod()

In [14]:
# ROI
etf_cumulative_return_above_initial_investment=etf_cumulative_returns[len(etf_cumulative_returns)-1]-1
growth_of_1usd_initial_investment=etf_cumulative_return_above_initial_investment+1

# Display the final cumulative return value
print(f"The cumulative return of the investment in the full period, above the initial investment (no-annualized) is of {etf_cumulative_return_above_initial_investment*100:,.2f}%")
print(f"The growth of $1.00 initial investment in the full period is ${growth_of_1usd_initial_investment:.2f} ")

The cumulative return of the investment in the full period, above the initial investment (no-annualized) is of 341.83%
The growth of $1.00 initial investment in the full period is $4.42 


In [15]:
# Adjusting columns names for proper graph variables
etf_cumulative_returns_df = pd.DataFrame(etf_cumulative_returns, columns=['Growth of 1[USD] Initial Investment'])

etf_cumulative_returns_df.tail()

Unnamed: 0_level_0,Growth of 1[USD] Initial Investment
time,Unnamed: 1_level_1
2020-11-30,4.374534
2020-12-01,4.357078
2020-12-02,4.329679
2020-12-03,4.378371
2020-12-04,4.41825


### Step 5: Using hvPlot, we create an interactive line plot that visualizes the cumulative return values of the ETF portfolio. Reflect the “time” column of the DataFrame on the x-axis. 

In [16]:
# Using hvplot, create an interactive line plot that visualizes the ETF portfolios cumulative return values.
etf_cumulative_returns_df.hvplot(
    title="ETF - Equally Weighted FinTech Stocks (GDOT, GS, PYPL, SQ) Growth of 1 USD Initial Investment -- Dec-16-2016 to Dec 4th 2020"
    , ylabel="Cumulative Investment [$]"
    ,xlabel= "Date"
    ,width=900
)

#### Deployment of the Notebook as a Web Application

For this part, we completed the following steps:

1. Use the Voilà library to deploy the notebook as a web application locally on the computer.

2. Include a screen recording in the GitHub repository, as well as screenshots are included in the "README.md" file to show how the web application appears when using Voilà.