## BeautifulSoup Tutorial 2
### Objective: scrap data from veriety of stocks on Nasdaq (advanced)
ref: https://towardsdatascience.com/web-scraping-for-beginners-beautifulsoup-scrapy-selenium-twitter-api-f5a6d0589ea6
Data will be scraped from the following table:
https://www.nasdaq.com/screening/companies-by-industry.aspx?industry=Technology&sortname=marketcap&sorttype=1&page

### Steps:
#### 1.) Select URL to scap;
#### 2.) Finalize info needed to scap from the site;
#### 3.) Get request
#### 4.) Inspect website
#### 5.) Beautiful soup HTML parser
#### 6.) Select data, append to list
#### 7.) Download data to CSV, save locally
#### 8.) Use pandas to analyze the data


In [1]:
# Import following libraries:
from time import time, sleep
from random import randint
from IPython.core.display import clear_output
from requests import get
from bs4 import BeautifulSoup
import matplotlib.pyplot as plt
import pandas as pd

### Read URL

In [3]:
url = 'https://www.nasdaq.com/screening/companies-by-industry.aspx?industry=Technology&sortname=marketcap&sorttype=1&page'
response = get(url)

In [4]:
print(url)

https://www.nasdaq.com/screening/companies-by-industry.aspx?industry=Technology&sortname=marketcap&sorttype=1&page


In [5]:
print(response)

<Response [200]>


### Create bs4 object with response from above and parser method = html.parser

In [6]:
page_html = BeautifulSoup(response.text, 'html.parser')

### Select stocks in the first page

In [7]:
data = [] #create an empty list

In [8]:
stable = page_html.find('table', attrs={'id':'CompanylistResults'}) #find the table

In [9]:
print(stable)

<table id="CompanylistResults">
<thead>
<tr>
<th valign="top"><a href="companies-by-industry.aspx?industry=Technology&amp;sortname=name&amp;sorttype=0" rel="nofollow">Name</a></th>
<th valign="top"><a href="companies-by-industry.aspx?industry=Technology&amp;sortname=symbol&amp;sorttype=0" rel="nofollow">Symbol</a></th>
<th style="" valign="top"><a href="companies-by-industry.aspx?industry=Technology&amp;sortname=marketcap&amp;sorttype=0" rel="nofollow">Market Cap</a></th>
<th style="display:none" valign="top"><a href="companies-by-industry.aspx?industry=Technology&amp;sortname=adrtso&amp;sorttype=0" rel="nofollow">ADR TSO</a></th>
<th valign="top"><a href="companies-by-industry.aspx?industry=Technology&amp;sortname=country&amp;sorttype=0" rel="nofollow">Country</a></th>
<th valign="top" width="50"><a href="companies-by-industry.aspx?industry=Technology&amp;sortname=ipoyear&amp;sorttype=0" rel="nofollow">IPO Year</a></th>
<th valign="top"><a href="companies-by-industry.aspx?industry=Tec

In [10]:
rows = stable.find_all('tr') #find all rows, <tr> is row

In [11]:
print(rows)

[<tr>
<th valign="top"><a href="companies-by-industry.aspx?industry=Technology&amp;sortname=name&amp;sorttype=0" rel="nofollow">Name</a></th>
<th valign="top"><a href="companies-by-industry.aspx?industry=Technology&amp;sortname=symbol&amp;sorttype=0" rel="nofollow">Symbol</a></th>
<th style="" valign="top"><a href="companies-by-industry.aspx?industry=Technology&amp;sortname=marketcap&amp;sorttype=0" rel="nofollow">Market Cap</a></th>
<th style="display:none" valign="top"><a href="companies-by-industry.aspx?industry=Technology&amp;sortname=adrtso&amp;sorttype=0" rel="nofollow">ADR TSO</a></th>
<th valign="top"><a href="companies-by-industry.aspx?industry=Technology&amp;sortname=country&amp;sorttype=0" rel="nofollow">Country</a></th>
<th valign="top" width="50"><a href="companies-by-industry.aspx?industry=Technology&amp;sortname=ipoyear&amp;sorttype=0" rel="nofollow">IPO Year</a></th>
<th valign="top"><a href="companies-by-industry.aspx?industry=Technology&amp;sortname=industry&amp;sortt

In [12]:
for row in rows: # iterate over each row
    cols = row.find_all('td') # find the the cells, where each cell is <td>
    cols = [ele.text.strip() for ele in cols] # for each of the cells found, strip any leading or trailing whitespaces, ref: https://www.programiz.com/python-programming/methods/string/strip
    data.append([ele for ele in cols if ele]) #get rid of empty values and append non empty values to the list

In [13]:
print(data) #HTML is now parced, providing all non empty cells of the table

[[], ['Apple Inc.', 'AAPL', '$986.2B', 'United States', '1980', 'Computer Manufacturing'], ['AAPL Stock Quote\n\n\r\n\t\t\t\t                AAPL Ratings\n\n\r\n\t\t\t\t                AAPL Stock Report'], ['Microsoft Corporation', 'MSFT', '$967.12B', 'United States', '1986', 'Computer Software: Prepackaged Software'], ['MSFT Stock Quote\n\n\r\n\t\t\t\t                MSFT Ratings\n\n\r\n\t\t\t\t                MSFT Stock Report'], ['Alphabet Inc.', 'GOOGL', '$809.84B', 'United States', 'n/a', 'Computer Software: Programming, Data Processing'], ['GOOGL Stock Quote\n\n\r\n\t\t\t\t                GOOGL Ratings\n\n\r\n\t\t\t\t                GOOGL Stock Report'], ['Alphabet Inc.', 'GOOG', '$807.13B', 'United States', '2004', 'Computer Software: Programming, Data Processing'], ['GOOG Stock Quote\n\n\r\n\t\t\t\t                GOOG Ratings\n\n\r\n\t\t\t\t                GOOG Stock Report'], ['Facebook, Inc.', 'FB', '$549.57B', 'United States', '2012', 'Computer Software: Programming, Data P

In [20]:
start_time = time() #initialize variables
requests = 0
pages = [str(i) for i in range(1,14)]

In [21]:
for page in pages:
    url = 'https://www.nasdaq.com/screening/companies-by-industry.aspx?industry=Technology&sortname=marketcap&sorttype=1&page=' + page
    

In [22]:
print(page)

13
