## Pipeline Demo: Daily Stock Market Data Analytics

## <b><font color='red'>The buildout and documentation of this pipeline demo is still in active progress.</font></b>

#### End-to-end pipeline demonstrating the implementation of the following steps:
1. **Collect data from API and store in data lake**: Python, AWS CLI, AWS S3, AWS EC2 *(In progress)*
    * Download list of <a href="https://www.nasdaq.com/screening/companies-by-industry.aspx" target="blank">NASDAQ companies</a>.
    * Create collection of tickers from the company list. Specify companies for which to collect market data using search argument.
    * Query Daily Adjusted Stock Market Time Series <a href="https://www.alphavantage.co/documentation" target="blank">web service</a> for all specified companies for all dates within specified date range.<br><br>
2. **Transform data in data lake** Python, PySpark, AWS EMR (Spark) *(Coming soon)*
3. **Import data into analytics columnar database**: AWS Redshift, SQL *(Coming soon)*
4. **Build pipeline orchestration & scheduling engine**: Python, Apache Airflow *(Coming soon)*
5. **Surface data to public using RESTful web API**: Python, Django *(Coming soon)*

### Get configurations.

Create function to get configuration value from configuration JSON file. By default, configuration file is located in the current working directory and named `configuration.json`.

In [11]:
def getConfigurationValue(configurationKey):

    # Import packages and functions.
    import json, os

    # Get current working directory.
    currentWorkingDirectory = os.getcwd()

    # Load configuration file.
    with open(os.path.join(currentWorkingDirectory, 'configuration.json'), 'r') as configurationFile:
        dictConfigurations = json.load(configurationFile)

    return(dictConfigurations[configurationKey])

Print sample configuration values.

In [177]:
print('NASDAQCompaniesSourceURL:', getConfigurationValue('NASDAQCompaniesSourceURL'))
print('NASDAQCompaniesDestinationPathPart:', getConfigurationValue('NASDAQCompaniesDestinationPathPart'))

NASDAQCompaniesSourceURL: https://www.nasdaq.com/screening/companies-by-industry.aspx?exchange=NASDAQ&render=download
NASDAQCompaniesDestinationPathPart: Data/NASDAQCompanies.csv


### Programmatically download list of <a href="https://www.nasdaq.com/screening/companies-by-industry.aspx" target="blank">NASDAQ companies</a>.

Create function to download companies CSV into Pandas data frame.

In [158]:
def getNASDAQCompanies(downloadFile = False, companyKeyword = None):
    # Import package(s)/function(s).
    import os, urllib
    import pandas as pd
    
    # Declare variables.

    # Get source URL for NASDAQ companies file.
    NASDAQCompaniesSourceURL = getConfigurationValue('NASDAQCompaniesSourceURL')
    
    # Get destination path part for NASDAQ companies file.
    NASDAQCompaniesDestinationPathPart = getConfigurationValue('NASDAQCompaniesDestinationPathPart')
    
    # Get destination path for NASDAQ companies file.
    NASDAQCompaniesDestinationPath = os.path.abspath(os.path.join(currentWorkingDirectory, NASDAQCompaniesDestinationPathPart))

    # Download data file, if so specified.
    if (downloadFile):
        # Remove destination file, it it exists.
        if(os.path.isfile(NASDAQCompaniesDestinationPath)):
            os.remove(NASDAQCompaniesDestinationPath)

        # Download file.
        tmpFile, tmpFileHeaders = urllib.request.urlretrieve(NASDAQCompaniesSourceURL, NASDAQCompaniesDestinationPathPart)

    # Create dfCompanies data frame from NASDAQ companies file.
    dfCompanies = pd.read_csv(NASDAQCompaniesDestinationPathPart)
    
    # Drop emppty last column created as a result of trailing comma.
    dfCompanies.drop(dfCompanies.columns[-1], axis=1, inplace=True)
    
    # Filter dfCompanies by Name based on search argument.
    if (companyKeyword != None):
        # Use vectorized string methods: lower() and contains()
        dfCompanies = dfCompanies[dfCompanies.Name.str.lower().str.contains(companyKeyword)]

    return(dfCompanies)


Download companies CSV into Pandas data frame.

In [178]:
# Filter by 'tech'
dfCompanies = getNASDAQCompanies(downloadFile=True, companyKeyword='tech')

print('dfCompanies count:', len(dfCompanies))

dfCompanies.head(10)

dfCompanies count: 168


Unnamed: 0,Symbol,Name,LastSale,MarketCap,ADR TSO,IPOyear,Sector,Industry,Summary Quote
50,AEY,"ADDvantage Technologies Group, Inc.",1.34,13702830.0,,,Consumer Services,Office Equipment/Supplies/Services,https://www.nasdaq.com/symbol/aey
51,IOTS,Adesto Technologies Corporation,7.25,155104900.0,,2015.0,Technology,Semiconductors,https://www.nasdaq.com/symbol/iots
56,ADRO,"Aduro Biotech, Inc.",8.7,704756400.0,,2015.0,Health Care,Major Pharmaceuticals,https://www.nasdaq.com/symbol/adro
62,ADVM,"Adverum Biotechnologies, Inc.",6.4,397909000.0,,2014.0,Health Care,Biotechnology: Biological Products (No Diagnos...,https://www.nasdaq.com/symbol/advm
89,AKAM,"Akamai Technologies, Inc.",70.19,11934520000.0,,1999.0,Miscellaneous,Business Services,https://www.nasdaq.com/symbol/akam
95,AKTS,"Akoustis Technologies, Inc.",6.64,148209400.0,,,Public Utilities,Telecommunications Equipment,https://www.nasdaq.com/symbol/akts
104,ALGN,"Align Technology, Inc.",250.92,20107530000.0,,2001.0,Health Care,Industrial Specialties,https://www.nasdaq.com/symbol/algn
115,AMOT,"Allied Motion Technologies, Inc.",39.08,368406500.0,,,Capital Goods,Electrical Products,https://www.nasdaq.com/symbol/amot
153,AETI,"American Electric Technologies, Inc.",1.0213,8854314.0,,,Energy,Industrial Machinery/Components,https://www.nasdaq.com/symbol/aeti
171,AMKR,"Amkor Technology, Inc.",10.28,2460758000.0,,1998.0,Technology,Semiconductors,https://www.nasdaq.com/symbol/amkr
