<h1>Extracting Stock Data Using a Web Scraping</h1>

Not all stock data is available via the API in this assignment; you will use web-scraping to obtain financial data. You will be quizzed on your results.  
 You will extract and share historical data from a web page using the BeautifulSoup library.

In [21]:
!conda install -c anaconda lxml -y

Collecting package metadata (current_repodata.json): done
Solving environment: done


  current version: 23.7.4
  latest version: 24.3.0

Please update conda by running

    $ conda update -n base -c defaults conda

Or to minimize the number of packages updated during conda update use

     conda install conda=24.3.0



# All requested packages already installed.



In [5]:
#!pip install pandas==1.3.3
#!pip install requests==2.26.0
!mamba install bs4==4.10.0 -y
!mamba install html5lib==1.1 -y 
!pip install lxml==4.6.4
#!pip install plotly==5.3.1

zsh:1: command not found: mamba
zsh:1: command not found: mamba
Collecting lxml==4.6.4
  Using cached lxml-4.6.4.tar.gz (3.2 MB)
  Preparing metadata (setup.py) ... [?25ldone
[?25hBuilding wheels for collected packages: lxml
  Building wheel for lxml (setup.py) ... [?25lerror
  [1;31merror[0m: [1msubprocess-exited-with-error[0m
  
  [31m×[0m [32mpython setup.py bdist_wheel[0m did not run successfully.
  [31m│[0m exit code: [1;36m1[0m
  [31m╰─>[0m [31m[108 lines of output][0m
  [31m   [0m Building lxml version 4.6.4.
  [31m   [0m   import pkg_resources
  [31m   [0m Building without Cython.
  [31m   [0m Building against libxml2 2.10.4 and libxslt 1.1.37
  [31m   [0m Building against libxml2/libxslt in one of the following directories:
  [31m   [0m   /Users/onesimomtintsilana/anaconda3/lib
  [31m   [0m   /Users/onesimomtintsilana/anaconda3/lib
  [31m   [0m   /Users/onesimomtintsilana/anaconda3/lib
  [31m   [0m   /Users/onesimomtintsilana/anaconda3/lib


In [22]:
import pandas as pd
import requests
from bs4 import BeautifulSoup

In Python, you can ignore warnings using the warnings module. You can use the filterwarnings function to filter or ignore specific warning messages or categories.

In [23]:
import warnings
# Ignore all warnings
warnings.filterwarnings("ignore", category=FutureWarning)

## Using Webscraping to Extract Stock Data Example

We will extract Netflix stock data https://cf-courses-data.s3.us.cloud-object-storage.appdomain.cloud/IBMDeveloperSkillsNetwork-PY0220EN-SkillsNetwork/labs/project/netflix_data_webpage.html.

<center> 
    
#### In this example, we are using yahoo finance website and looking to extract Netflix data.

</center>
    <br>

  <img src="https://cf-courses-data.s3.us.cloud-object-storage.appdomain.cloud/IBMDeveloperSkillsNetwork-PY0220EN-SkillsNetwork/labs/project/Images/netflix.png"> </center> 
  
<center> Fig:- Table that we need to extract </center>


On the following web page we have a table with columns name (Date, Open, High, Low, close, adj close volume) out of which we must extract following columns  

* Date 

* Open  

* High 

* Low 

* Close 

* Volume 


# Steps for extracting the data
1. Send an HTTP request to the web page using the requests library.
2. Parse the HTML content of the web page using BeautifulSoup.
3. Identify the HTML tags that contain the data you want to extract.
4. Use BeautifulSoup methods to extract the data from the HTML tags.
5. Print the extracted data


### Step 1: Send an HTTP request to the web page

You will use the request library for sending an HTTP request to the web page.<br>


In [24]:
url = "https://cf-courses-data.s3.us.cloud-object-storage.appdomain.cloud/IBMDeveloperSkillsNetwork-PY0220EN-SkillsNetwork/labs/project/netflix_data_webpage.html"

The requests.get() method takes a URL as its first argument, which specifies the location of the resource to be retrieved. In this case, the value of the url variable is passed as the argument to the requests.get() method, because you will store a web page URL in a url variable.

You use the .text method for extracting the HTML content as a string in order to make it readable.

In [25]:
data  = requests.get(url).text
#print(data)

### Step 2: Parse the HTML content


<hr>
<hr>
<center>

# What is parsing?
In simple words, parsing refers to the process of analyzing a string of text or a data structure, usually following a set of rules or grammar, to understand its structure and meaning.
Parsing involves breaking down a piece of text or data into its individual components or elements, and then analyzing those components to extract the desired information or to understand their relationships and meanings.</center>
<hr>
<hr>


Next you will take the raw HTML content of a web page or a string of HTML code which needs to be parsed and transformed into a structured, hierarchical format that can be more easily analyzed and manipulated in Python. This can be done using a Python library called Beautiful Soup.

## Parsing the data using the BeautifulSoup library
* Create a new BeautifulSoup object.
<br>
<br>
<b>Note: </b>To create a BeautifulSoup object in Python, you need to pass two arguments to its constructor:

1. The HTML or XML content that you want to parse as a string.
2. The name of the parser that you want to use to parse the HTML or XML content. This argument is optional, and if you don't specify a parser, BeautifulSoup will use the default HTML parser included with the library.
here in this lab we are using "html5lib" parser.


In [26]:
soup = BeautifulSoup(data, 'html5lib')

### Step 3: Identify the HTML tags

As stated above, the web page consists of a table so, we will scrape the content of the HTML web page and convert the table into a data frame.

You will create an empty data frame using the <b> pd.DataFrame() </b> function with the following columns:
* "Date"
* "Open"
* "High" 
* "Low" 
* "Close"
* "Volume"

In [27]:
netflix_data = pd.DataFrame(columns=["Date", "Open", "High", "Low", "Close", "Volume"])

<hr>
<hr>
<center>

### Working on HTML table  </center>
<br>

These are the following tags which are used while creating HTML tables.

* &lt;table&gt;: This tag is a root tag used to define the start and end of the table. All the content of the table is enclosed within these tags. 


* &lt;tr&gt;: This tag is used to define a table row. Each row of the table is defined within this tag.

* &lt;td&gt;: This tag is used to define a table cell. Each cell of the table is defined within this tag. You can specify the content of the cell between the opening and closing <td> tags.

* &lt;th&gt;: This tag is used to define a header cell in the table. The header cell is used to describe the contents of a column or row. By default, the text inside a <th> tag is bold and centered.

* &lt;tbody&gt;: This is the main content of the table, which is defined using the <tbody> tag. It contains one or more rows of <tr> elements.

<hr>
<hr>


### Step 4: Use a BeautifulSoup method for extracting data

We will use <b>find()</b> and <b>find_all()</b> methods of the BeautifulSoup object to locate the table body and table row respectively in the HTML. 
   * The <i>find() method </i> will return particular tag content.
   * The <i>find_all()</i> method returns a list of all matching tags in the HTML.

In [30]:
# Initialize an empty list to store row data
rows_data = []

# Loop through each row and extract column values
for row in soup.find("tbody").find_all('tr'):
    col = row.find_all("td")
    date = col[0].text
    Open = col[1].text
    high = col[2].text
    low = col[3].text
    close = col[4].text
    adj_close = col[5].text
    volume = col[6].text
    
    # Append the data of each row to the list
    rows_data.append({"Date": date, "Open": Open, "High": high, "Low": low, "Close": close, "Adj Close": adj_close, "Volume": volume})

# Create DataFrame from the list of row data
netflix_data = pd.DataFrame(rows_data)


### Step 5: Print the extracted data

We can now print out the data frame using the head() or tail() function.


In [32]:
netflix_data.head()

Unnamed: 0,Date,Open,High,Low,Close,Adj Close,Volume
0,"Jun 01, 2021",504.01,536.13,482.14,528.21,528.21,78560600
1,"May 01, 2021",512.65,518.95,478.54,502.81,502.81,66927600
2,"Apr 01, 2021",529.93,563.56,499.0,513.47,513.47,111573300
3,"Mar 01, 2021",545.57,556.99,492.85,521.66,521.66,90183900
4,"Feb 01, 2021",536.79,566.65,518.28,538.85,538.85,61902300


# Extracting data using `pandas` library

We can also use the pandas `read_html` function from the pandas library and use the URL for extracting data.

<center>

## What is read_html in pandas library?
`pd.read_html(url)` is a function provided by the pandas library in Python that is used to extract tables from HTML web pages. It takes in a URL as input and returns a list of all the tables found on the web page. 
</center>
