# __WEB SCRAPING__

### Objective:
* Create a data frame with all NASDAQ, NYSE, and AMEX companies with their financial statements.
* Automatizate web scrape program to get up to date information.
    
### Steps:
1. Research for a websites that contains financial statements for NASDAQ, NYSE, and AMEX companies.
2. List all the links from where to get data from.
3. Write a scrapping automatization code to extract selected information from website.
4. Store data into a data frame
5. Export data frame as a CSV file 

### Description:
* In this project I researched financial websites including the 10-K fillings from the __[U.S. Securities and Exchange Commission (SEC)](https://www.sec.gov/)__ and the financial summary of __[Yahoo Finance](https://finance.yahoo.com/)__ which are well known websites with financial data. After the research I found __[www.advfn.com](https://www.advfn.com/nasdaq/nasdaq.asp?companies=A)__ to be an accurated and up to date data source.
* I collected all financial data from each NASDAQ, NYSE, and AMEX companies and merged it into a large data set for personal financial analysis with no comercial distribution.

### Main Breakdowns:
1. List the financial websites
2. Extract Individual Company Information
3. Shaping the DataFrame

### Disclaimers:
* This project is not monetized nor is part of a commercialization in any way. This project complies according to the "Copyright And Limited Reproduction Notices" from www.advfn.com (as of May 31st, 2022).

<br><br><br><br>

# Web Scraping Code

In [None]:
from bs4 import BeautifulSoup
import pandas as pd
import urllib.request
from datetime import datetime

# Part 1 / 3 List of financial websites

##### Set up header variables

In [None]:
# define header variables
user_agent = 'SaintWhoza@protonmail.com'
headers={'User-Agent':user_agent,
         'Accept': 'text/html,application/xhtml+xml,application/xml;q=0.9,*/*;q=0.8',
         'Accept-Charset': 'ISO-8859-1,utf-8;q=0.7,*;q=0.3',
         'Accept-Encoding': 'none',
         'Accept-Language': 'en-US,en;q=0.8',
         'DNT': '1',
         'Connection': 'keep-alive'}


##### List with initial websites to be scrapped

In [None]:
# create a list with websites containing financials data
base_url = ["https://www.advfn.com/nasdaq/nasdaq.asp?companies=","https://www.advfn.com/nyse/newyorkstockexchange.asp?companies=","https://www.advfn.com/amex/americanstockexchange.asp?companies="]
letters = [ chr(x) for x in range(65,91)] # list with A - Z letters
ls_websites = [ x + y for x in base_url for y in letters ]

##### FUNCTION: input url and return its html (soup)

In [None]:
# download html from url
def getSoup(url):
    request=urllib.request.Request(url,None,headers)
    response = urllib.request.urlopen(request)
    data = response.read()
    soup = BeautifulSoup(data, 'html.parser')
    return soup


##### FUNCTION: input soup and return all href

In [None]:
# store financial links
def getLinks(soup, list_): # list_ is where to store all href from soup
    ls = [ x.get('href') for x in soup.select('a') if x.get('href').endswith("financials")] # make a list with all href
    for e in ls:
        list_.append(e) # append each href to the main list

##### FUNCTION: input url and return list of websites to scrape

In [None]:
# Takes about 2m 30s to run
ls = []
for url in ls_websites:
    soup = getSoup(url)
    getLinks(soup, ls)
# len(ls) = 12575


# Part 2/3
# Extract Individual Company Information

##### FUNCTION: input url then validates, and return soup.

In [None]:
# download html from url ; only those that have "financials" at the end of the url
def getSoupProfile(url):
    request=urllib.request.Request(url,None,headers)
    response = urllib.request.urlopen(request) # although I open a 'financials' url I may get redirected.
    if response.url.endswith("financials"): # if no redirected then extract soup
        data = response.read()
        soup = BeautifulSoup(data, 'html.parser')
        return soup
    else: # if redirect then return 0
        return 0


#### FUNCTION: input soup and return pd df with column names. It is a single row.

In [None]:
# KEYS in profile

def getNamesInProfile(soup_financials):
    
    description_ls = ['Company Name', 'Ticket', 'Description', 'Source Link', 'Stock Group']
    for e in description_ls:
        name_ls.append(e)

    CompanyProfile = soup_financials.find_all('tr', {'class': "border-bottom"})
    for e in CompanyProfile:
        profileItem = e.find_all('td')[0].get_text()
        name_ls.append(profileItem)
    
    # create a dataFrame with the names transposed.
    dataFrame_main = pd.DataFrame(data={'name': name_ls}).transpose()
    return dataFrame_main


#### FUNCTION: input soup and return pd df with values. It is a single row.

In [None]:
# VALUES in profile

def getValuesInProfile(soup_financials, website):
    value_ls = [] # for each company set the value list to empty
    CompanyProfile = soup_financials.find_all('tr', {'class': "border-bottom"})

    # append title
    title = soup_financials.find_all('div', {'class':'mx-0 px-0'})[0].get_text()[1:-21]
    value_ls.append(title)

    # append company ticket
    initial_string = soup_financials.find_all('div', {'class':'mx-0 px-0'})[0].get_text().rfind("(") + 1
    ticket = soup_financials.find_all('div', {'class':'mx-0 px-0'})[0].get_text()[initial_string:-15]
    value_ls.append(ticket)

    # append company description
    description = soup_financials.find_all('td', {'class':'text-left'})[0].get_text()
    value_ls.append(description)

    # append source website
    value_ls.append(website)

    # append company stock group name
    value_ls.append(anyWebsite.split('/')[-3])

    # append company Values
    for e in CompanyProfile:
        profileItem = e.find_all('td')[1].get_text()
        try:
            value_ls.append(int(profileItem))
        except:
            value_ls.append(profileItem)
    
    # create a dataFrame with the values transposed
    dataFrame_ticket = pd.DataFrame(data={'value': value_ls}).T
    return dataFrame_ticket

#### Build list with column names, then build a pd df, then add df into df list

In [None]:
# Builds the heading (first row) of the combined dataFrame
dataFrame_ls = [] # were all company df are going to be stored
name_ls = []

anyWebsite = "https://ih.advfn.com/stock-market/NASDAQ/tesla-TSLA/financials" # used as a template website to extract KEYS headings
soupWebiste = getSoup(anyWebsite)
names = getNamesInProfile(soupWebiste)
dataFrame_ls.append(names)


#### FUNCTION: input root link (with company information), then return df with compan profile information values. It is a single row per company.

In [None]:
# open profile link and store info into a database
# link_name = list with links (list)
# stock_group = name of the stock group (string)
def build_dataframe(link_name):
    # TRACKING
    list_length = len(link_name)
    redirect_tracking = 0
    download_tracking = 0

    for website in link_name: # for every element in the list containing financial statments urls
        soupWebiste = getSoupProfile(website) # get html from website
        if soupWebiste != 0:
            values = getValuesInProfile(soupWebiste, website) # get values as a dataFrame
            dataFrame_ls.append(values) # add dataFrame to the list

            # TRACKING
            list_length-=1
            download_tracking +=1
            print(download_tracking," downloads.", "\t"*4 , list_length, " left. --------------- SUCCESS", "\t"*1, website.split('/')[-3])
        # TRACKING
        else:
            # TRACKING
            redirect_tracking+=1
            list_length-=1
            print("Rederected.", redirect_tracking, "\t"*4, list_length, " left.", "\t"*4, website.split('/')[-3])

In [None]:
# RUN ALL PROFILE FUNCTION ------- takes 8 hrs.
build_dataframe(ls)

# TRACKING
print("DONE")

# Part 3/3
# Shaping the DataFrame

#### Convert to df and save as csv

In [None]:
dataFrame_finalScrape = pd.concat(dataFrame_ls, axis = 0)

date = datetime.now().strftime("%d-%m-%Y %H%M%S")
nameFile = "InvestorsHub " + date + " - Webscrape raw.csv"
filePath = "/Users/pedrosanhueza/EXOXY/Personal Projects/Programming/Web Scraping/Yahoo Finance (py)/InvestorsHub - Historical Data/" + nameFile

dataFrame_finalScrape.to_csv(filePath, index=False)

In [None]:
# make  copy of the dataFrame
dataFrame = dataFrame_finalScrape.copy()

# First row to column
dataFrame.columns = dataFrame.iloc[0] # assign first row to columns header
dataFrame = dataFrame[1:] # remove first row

# Reset index
dataFrame=dataFrame.reset_index()
dataFrame.drop(['index'], axis=1, inplace=True)


In [None]:
# drop duplicated column
dataFrame.drop(dataFrame.columns[90], axis=1, inplace=True) # col name: 

# list of columns with "$\xa"
colNameEdit_ls = []
for e in dataFrame.columns.values:
    try:    
        if dataFrame[e].iloc[0][0] == chr(36):
            colNameEdit_ls.append(e)
    except:
        try:
            if dataFrame[e].iloc[1][0] == chr(36):
                colNameEdit_ls.append(e)
        except:
            if dataFrame[e].iloc[1] == chr(36):
                colNameEdit_ls.append(e)

dataFrame = dataFrame.replace(',','', regex=True) # remove commas from all dataframe
dataFrame = dataFrame.replace('%','', regex=True) # remove percentage sign from all dataframe
dataFrame = dataFrame.replace('-', '') # remove hyphen from all data frame

# remove all "$\xa" based on list of columns with "$\xa"
for e in colNameEdit_ls:
    dataFrame[e] = dataFrame[e].apply(lambda x: x[2:])
    dataFrame[e] = pd.to_numeric(dataFrame[e])
    # dataFrame[e] = dataFrame[e].apply(lambda x: "$" + x)   


In [None]:
# save df
date = datetime.now().strftime("%d-%m-%Y %H%M%S")
nameFile = "InvestorsHub " + date + " - Webscrape.csv"
filePath = "../InvestorsHub - Historical Data/" + nameFile
dataFrame.to_csv(filePath, index=False)


In [13]:
df

Unnamed: 0,Company Name,Ticket,Description,Source Link,Market Cap,Shares Outstanding,Float,Percent Float,Short Interest,Short Percent Float,...,Address,Website,Facsimile,Telephone,Email,Symbol,Name,Country,IPO Year,Volume
0,A SPAC I Acquisition Corp. Unit,ASCAU,ASPAC I Acquisition Corp is a blank check comp...,http://ih.advfn.com/stock-market/NASDAQ/a-spac...,0.000000e+00,,,0.0%,0.0,0.0%,...,10 Marina Boulevard Tower 2Level 39 Marina Bay...,,,+65 68185796,,ASCAU,A SPAC I Acquisition Corp. Unit,Singapore,2022.0,855.0
1,A. Schulman Inc.,SHLM,A. Schulman Inc manufactures and sells a varie...,http://ih.advfn.com/stock-market/NASDAQ/a-schu...,1.298403e+09,2.950915e+07,2.770361e+07,93.88%,0.0,0.0%,...,3637 Ridgewood RoadFairlawn OH 44333,http://www.aschulman.com,,+1 330 666-3751,Jennifer_Beeman@us.aschulman.com,,,,,
2,A2Z Smart Technologies Cor,AZ,A2Z Smart Technologies Corp is a technology co...,http://ih.advfn.com/stock-market/NASDAQ/a2z-sm...,1.113124e+08,2.726603e+07,2.714936e+07,100.0%,0.0,0.0%,...,1600 - 609 Granville StreetSuite 1600Vancouver...,https://www.a2zas.com,,+1 647 558-5564,info@a2zas.com,AZ,A2Z Smart Technologies Corp. Common Shares,Canada,,60898.0
3,ABIOMED Inc.,ABMD,Abiomed Inc provides temporary mechanical circ...,http://ih.advfn.com/stock-market/NASDAQ/abiome...,1.172966e+10,4.554500e+07,3.981849e+07,87.43%,0.0,0.0%,...,22 Cherry Hill DriveDanvers MA 1923,http://www.abiomed.com,+1 978 777-8411,+1 978 646-1400,mediarelations@abiomed.com,ABMD,ABIOMED Inc. Common Stock,United States,,555297.0
4,Abri SPAC I Inc. Unit,ASPAU,Abri SPAC I Inc is a blank check company.,http://ih.advfn.com/stock-market/NASDAQ/abri-s...,5.350118e+07,,5.276250e+06,100.0%,0.0,0.0%,...,9663 Santa Monica BoulevardNo. 1091Beverly Hil...,,,+1 424 732-1021,,ASPAU,Abri SPAC I Inc. Unit,United States,2021.0,13.0
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
6369,Zosano Pharma Corporation,ZSAN,Zosano Pharma Corp is a clinical-stage biophar...,http://ih.advfn.com/stock-market/NASDAQ/zosano...,8.578955e+06,4.902260e+06,4.819354e+06,98.31%,0.0,0.0%,...,34790 Ardentech CourtFremont CA 94555,https://www.zosanopharma.com,,+1 510 745-1200,bd@zosanopharma.com,ZSAN,Zosano Pharma Corporation Common Stock,United States,2015.0,205382.0
6370,Zscaler In,ZS,Zscaler Inc is a security-as-a-service firm th...,http://ih.advfn.com/stock-market/NASDAQ/zscale...,2.439930e+10,1.410853e+08,8.348679e+07,59.17%,0.0,0.0%,...,120 Holger WaySan Jose CA 95134,https://www.zscaler.com,,+1 408 533-0288,ir@zscaler.com,ZS,Zscaler Inc. Common Stock,United States,2018.0,4229899.0
6371,Zygo Corp,ZIGO,,http://ih.advfn.com/stock-market/NASDAQ/zygo-Z...,3.679698e+08,1.912028e+07,1.912028e+07,100.0%,0.0,0.0%,...,Laurel Brook RoadMiddlefield CT 06455-1291,http://www.zygo.com,+1 860 347-8372,+1 860 347-8506,investor@zygo.com,,,,,
6372,Zynerba Pharmaceuticals Inc.,ZYNE,Zynerba Pharmaceuticals Inc is a pharmaceutica...,http://ih.advfn.com/stock-market/NASDAQ/zynerb...,5.452682e+07,4.362145e+07,4.170836e+07,95.61%,0.0,0.0%,...,80 West Lancaster AvenueSuite 300Devon PA 19333,http://www.zynerba.com,,+1 484 581-7505,robertsw@zynerba.com,ZYNE,Zynerba Pharmaceuticals Inc. Common Stock,United States,2015.0,198042.0
