<h1>Extracting Netflix Stock History Data Using Web Scraping</h1>

In [None]:
# Install library if not isntalled previously
!pip install pandas
!pip install requests
!pip install bs4
!pip install html5lib 
!pip install lxml
!pip install plotly

In [2]:
import pandas as pd
import requests
from bs4 import BeautifulSoup

In [4]:
import warnings
# ignore warnings
warnings.filterwarnings("ignore", category=FutureWarning)

# Source Website

We will extract Netflix stock data (latest --> for the blog page)
https://finance.yahoo.com/quote/NFLX/history

We will extract Netflix stock data (for this project)
https://cf-courses-data.s3.us.cloud-object-storage.appdomain.cloud/IBMDeveloperSkillsNetwork-PY0220EN-SkillsNetwork/labs/project/netflix_data_webpage.html.

We see a table with columns name (Date, Open, High, Low, close, adj close volume) out of which we will extract following columns

- Date
- Open
- High
- Low
- Close
- Volume

# Steps for extracting the data

1. Send an HTTP request to the web page using the requests library.
2. Parse the HTML content of the web page using BeautifulSoup.
3. Identify the HTML tags that contain the data you want to extract.
4. Use BeautifulSoup methods to extract the data from the HTML tags.
5. View the extracted data

## Step 1: Send an HTTP request to the web page

In [5]:
url = "https://cf-courses-data.s3.us.cloud-object-storage.appdomain.cloud/IBMDeveloperSkillsNetwork-PY0220EN-SkillsNetwork/labs/project/netflix_data_webpage.html"

The requests.get() method takes a URL as its first argument, which specifies the location of the resource to be retrieved. In this case, the value of the url variable is passed as the argument to the requests.get() method, because you will store a web page URL in a url variable.

You use the .text method for extracting the HTML content as a string in order to make it readable.

In [37]:
data = requests.get(url).text
#print(data)

## Step 2: Parse the HTML content

### What is parsing?
In simple words, parsing refers to the process of analyzing a string of text or a data structure, usually following a set of rules or grammar, to understand its structure and meaning. Parsing involves breaking down a piece of text or data into its individual components or elements, and then analyzing those components to extract the desired information or to understand their relationships and meanings.

Next you will take the raw HTML content of a web page or a string of HTML code which needs to be parsed and transformed into a structured, hierarchical format that can be more easily analyzed and manipulated in Python. This can be done using a Python library called Beautiful Soup.

### Parsing the data using the BeautifulSoup library

- Create a new BeautifulSoup object.

Note: To create a BeautifulSoup object in Python, you need to pass two arguments to its constructor:

1. The HTML or XML content that you want to parse as a string.
2. The name of the parser that you want to use to parse the HTML or XML content. This argument is optional, and if you don't specify a parser, BeautifulSoup will use the default HTML parser included with the library. here in this lab we are using "html5lib" parser.

In [7]:
soup = BeautifulSoup(data, 'html5lib')

## Step 3: Identify the HTML tags

The web page consists of a table so, we will scrape the content of the HTML web page and convert the table into a data frame.

### Working on HTML table  </center>

These are the following tags which are used while creating HTML tables.

* &lt;table&gt;: This tag is a root tag used to define the start and end of the table. All the content of the table is enclosed within these tags. 


* &lt;tr&gt;: This tag is used to define a table row. Each row of the table is defined within this tag.

* &lt;td&gt;: This tag is used to define a table cell. Each cell of the table is defined within this tag. You can specify the content of the cell between the opening and closing <td> tags.

* &lt;th&gt;: This tag is used to define a header cell in the table. The header cell is used to describe the contents of a column or row. By default, the text inside a <th> tag is bold and centered.

* &lt;tbody&gt;: This is the main content of the table, which is defined using the <tbody> tag. It contains one or more rows of <tr> elements.

We will create an empty data frame using the pd.DataFrame() function with the following columns:
- "Date"
- "Open"
- "High"
- "Low"
- "Close"
- "Volume"

In [8]:
netflix_data = pd.DataFrame(columns=["Date", "Open", "High", "Low", "Close", "Volume"])

## Step 4: Use a BeautifulSoup method for extracting data


We will use <b>find()</b> and <b>find_all()</b> methods of the BeautifulSoup object to locate the table body and table row respectively in the HTML. 
   * The <i>find() method </i> will return particular tag content.
   * The <i>find_all()</i> method returns a list of all matching tags in the HTML.

In [10]:
# First we isolate the body of the table which contains all the information
# Then we loop through each row and find all the column values for each row
for row in soup.find("tbody").find_all('tr'):
    col = row.find_all("td")
    date = col[0].text
    Open = col[1].text
    high = col[2].text
    low = col[3].text
    close = col[4].text
    adj_close = col[5].text
    volume = col[6].text
    
    # Finally we append the data of each row to the table
    netflix_data = netflix_data.append({"Date":date, "Open":Open, "High":high, "Low":low, "Close":close, "Adj Close":adj_close, "Volume":volume}, ignore_index=True)    

## Step 5: View the extracted data

In [11]:
# view the data frame with head() or tail() functions
netflix_data.head()

Unnamed: 0,Date,Open,High,Low,Close,Volume,Adj Close
0,"Jun 01, 2021",504.01,536.13,482.14,528.21,78560600,528.21
1,"May 01, 2021",512.65,518.95,478.54,502.81,66927600,502.81
2,"Apr 01, 2021",529.93,563.56,499.0,513.47,111573300,513.47
3,"Mar 01, 2021",545.57,556.99,492.85,521.66,90183900,521.66
4,"Feb 01, 2021",536.79,566.65,518.28,538.85,61902300,538.85


In [15]:
netflix_data.columns

Index(['Date', 'Open', 'High', 'Low', 'Close', 'Volume', 'Adj Close'], dtype='object')

In [16]:
netflix_data.tail()

Unnamed: 0,Date,Open,High,Low,Close,Volume,Adj Close
135,"Jan 01, 2016",109.0,122.18,90.11,91.84,488193200,91.84
136,"Dec 01, 2015",124.47,133.27,113.85,114.38,319939200,114.38
137,"Nov 01, 2015",109.2,126.6,101.86,123.33,320321800,123.33
138,"Oct 01, 2015",102.91,115.83,96.26,108.38,446204400,108.38
139,"Sep 01, 2015",109.35,111.24,93.55,103.26,497401200,103.26


In [14]:
title = soup.title
title

<title>Netflix, Inc. (NFLX) Stock Historical Prices &amp; Data - Yahoo Finance</title>

# Extracting data using pandas library

We can also use the pandas `read_html` function from the pandas library and use the URL for extracting data.

## What is read_html in pandas library?

`pd.read_html(url)` is a function provided by the pandas library in Python that is used to extract tables from HTML web pages. It takes in a URL as input and returns a list of all the tables found on the web page.

In [17]:
# returs a list of all tables
read_html_pandas_data = pd.read_html(url)

In [26]:
netflix_dataframe = read_html_pandas_data[0]
netflix_dataframe

Unnamed: 0,Date,Open,High,Low,Close*,Adj Close**,Volume
0,"Jun 01, 2021",504.01,536.13,482.14,528.21,528.21,78560600
1,"May 01, 2021",512.65,518.95,478.54,502.81,502.81,66927600
2,"Apr 01, 2021",529.93,563.56,499.00,513.47,513.47,111573300
3,"Mar 01, 2021",545.57,556.99,492.85,521.66,521.66,90183900
4,"Feb 01, 2021",536.79,566.65,518.28,538.85,538.85,61902300
...,...,...,...,...,...,...,...
66,"Dec 01, 2015",124.47,133.27,113.85,114.38,114.38,319939200
67,"Nov 01, 2015",109.20,126.60,101.86,123.33,123.33,320321800
68,"Oct 01, 2015",102.91,115.83,96.26,108.38,108.38,446204400
69,"Sep 01, 2015",109.35,111.24,93.55,103.26,103.26,497401200


# Exercise: use webscraping to extract Amazon stock data

In [34]:
url = "https://cf-courses-data.s3.us.cloud-object-storage.appdomain.cloud/IBMDeveloperSkillsNetwork-PY0220EN-SkillsNetwork/labs/project/amazon_data_webpage.html"
url_latest = "https://finance.yahoo.com/quote/AMZN/history"    

In [35]:
# use the latest
read_html_pandas_data = pd.read_html(url)
amazon_data = read_html_pandas_data[0]
amazon_data

Unnamed: 0,Date,Open,High,Low,Close*,Adj Close**,Volume
0,"Jan 01, 2021",3270.00,3363.89,3086.00,3206.20,3206.20,71528900
1,"Dec 01, 2020",3188.50,3350.65,3072.82,3256.93,3256.93,77556200
2,"Nov 01, 2020",3061.74,3366.80,2950.12,3168.04,3168.04,90810500
3,"Oct 01, 2020",3208.00,3496.24,3019.00,3036.15,3036.15,116226100
4,"Sep 01, 2020",3489.58,3552.25,2871.00,3148.73,3148.73,115899300
...,...,...,...,...,...,...,...
57,"Apr 01, 2016",590.49,669.98,585.25,659.59,659.59,78464200
58,"Mar 01, 2016",556.29,603.24,538.58,593.64,593.64,94009500
59,"Feb 01, 2016",578.15,581.80,474.00,552.52,552.52,124144800
60,"Jan 01, 2016",656.29,657.72,547.18,587.00,587.00,130200900
