# Python Intermediate - Day 2
---

**More on Data Visualization**
- Pandas Built-in Plot
- Using Seaborn
- JSON LIve Data Loading and Visualization

**Web Scraping**
- Plotting with Live Data
- Reading HTML
- Web Scraping

## Import

```
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns
```

In [None]:
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns

# Pandas Built-in Plotting

The plot method on `Series` and `DataFrame` is just a simple wrapper around Matplotlib `plt.plot()`

Using the Pandas built-in plotting is quick and easy. 

**import students data file**
```
students = pd.read_excel('./data/students.xlsx') 
```

**simple plot**
```
students.plot()
students.plot.bar()
students.plot(x='AcademicYear', y='Undergraduate')
```

The built-in plot is good enough to grab an early insight about your data.

When you need some theming and styling on your plots, go for Seaborn

In [None]:
students = pd.read_excel('./data/students.xlsx')

In [None]:
students

In [None]:
students.plot()

## Pandas Built-in Plot Types

Beside the default line-plot, you can choose the following plot types.

* `bar` or `barh` for bar plots
* `hist` for histogram
* `box` for boxplot
* `kde` or ‘density’ for density plots
* `area` for area plots
* `scatter` for scatter plots
* `hexbin` for hexagonal bin plots
* `pie` for pie plots


Example:
```
students.plot(kind='bar')

students.plot(kind='kde')

students.plot(kind='box')
plt.xticks(rotation=90)

students.plot(x='AcademicYear', y='Undergraduate')
```

or
```
students.plot.bar()
students.plot.bar(stacked=True)
students.plot.kde()

```

In [None]:
students.plot(kind='kde')

In [None]:
students.plot(kind='box')
plt.xticks(rotation=90)

In [None]:
students.plot(x='AcademicYear', y='Undergraduate')

In [None]:
students.plot.bar()

In [None]:
students.plot.bar(stacked=True)

## More on pandas chart visualization

[Panda Chart Visualization](https://pandas.pydata.org/docs/user_guide/visualization.html)

# Using Seaborn

* Seaborn is a library that uses Matplotlib underneath to plot graphs.
* Matplotlib usually requires numpy array as parameter while seaborn is friendly to pandas dataframe.
* Seaborn offers built-in themes/styles and therefore makes plotting easier.
* Seaborn is not to replace Matplotlib but complete Matplotlib.  
* Seaborn and Matplotlib are usually used together.

## Required Imports
```
import matplotlib.pyplot as plt
import seaborn as sns
```

You usually also import numpy and pandas
```
import numpy as np
import pandas as pd
```

In [None]:
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns

## Built-in Seaborn Data Sets

Use the following command to load the names of built-in data sets
```
sns.get_dataset_names()
```

To load built-in data set
```
iris = sns.load_dataset('iris')
```

Simple Scatter Plot
```
sns.scatterplot(data=iris, x='sepal_length', y='sepal_width')
sns.scatterplot(data=iris, x='petal_length', y='petal_width')
```

In [None]:
sns.get_dataset_names()

In [None]:
iris = sns.load_dataset('iris')
iris

In [None]:
sns.scatterplot(data=iris, x='sepal_length', y='sepal_width')

In [None]:
sns.scatterplot(data=iris, x='petal_length', y='petal_width')

## Loading Data

**Reading CSV File**
```
gra = pd.read_csv('./data/graduates.csv') 
```

**Extracting Rows**
```
ug_bm_m = gra[(gra['LevelOfStudy']=='Undergraduate') 
                    & (gra['ProgrammeCategory']=='Business and Management') 
                    & (gra['Sex']=='M')]
```

**Extracting the column required for plotting**
```
year = ug_bm_m['AcademicYear']
headcount = ug_bm_m['Headcount']
```

In [None]:
gra = pd.read_csv('./data/graduates.csv') 

In [None]:
ug_bm_m = gra[(gra['LevelOfStudy']=='Undergraduate') 
                    & (gra['ProgrammeCategory']=='Business and Management') 
                    & (gra['Sex']=='M')]

year = ug_bm_m['AcademicYear']
headcount = ug_bm_m['Headcount']

## Matplotlib Style

```
plt.plot(year, headcount)
```

In [None]:
plt.plot(year, headcount)

## Set to Seaborn Style

Call `set()` function to activate seaborn style

```
sns.set() # activate to seaborn style
plt.plot(year, headcount) # use matplotlib plot() function for plotting
```

In [None]:
sns.set() # activate seaborn style
plt.plot(year, headcount)

## Seaborn and Matplotlib are Usually Used Together

In the example below, we call seaborn's `set()` to activate Seaborn style. 

Then we use matplotlib's `xticks()` to set x tick roration degree and also call its `plot()` function for plot generation.

**Example:**
```
sns.set() # set to seaborn style
plt.xticks(rotation=90) # calling Matplotlib xticks rotation function
plt.plot(year, headcount)

```

In [None]:
sns.set() # set to seaborn style
plt.xticks(rotation=90) # calling Matplotlib xticks rotation function
plt.plot(year, headcount)

## Style
Seaborn splits the Matplotlib parameters into two groups

* Plot styles
* Plot scale


Use seaborn's `set_style()` function to manipulate the styles.

Below are some themes
* `darkgrid`
* `whitegrid`
* `dark`
* `white`
* `ticks`

**Example**:
```
sns.set_style('white')
plt.plot(year, headcount)
```

**More on `set_style()`**:

[set_style()](https://seaborn.pydata.org/generated/seaborn.set_style.html)

In [None]:
sns.set_style('white')
plt.plot(year, headcount)

## Removing Axes Spines

Call `despine()` function to remove the spine to achieve a cleaner output.

```
sns.set_style('white')
plt.plot(year, headcount)
sns.despine()
```

In [None]:
sns.set_style('white')
plt.plot(year, headcount)
sns.despine()

## Customizing Axes Style

Call `sns.axes_style()` to show the current axes style.
```
sns.axes_style()
```

**To change the axes style**:
```
sns.set_style("darkgrid", {'axes.facecolor': 'yellow', 'grid.color': '.8'})
plt.plot(year, headcount)

sns.set_style("darkgrid", {'axes.facecolor': 'white', 'grid.color': '.8'})
plt.plot(year, headcount)

```

In [None]:
sns.axes_style()

In [None]:
sns.set_style("darkgrid", {'axes.facecolor': 'yellow', 'grid.color': '.8'})
plt.plot(year, headcount)

In [None]:
sns.set_style("darkgrid", {'axes.facecolor': 'white', 'grid.color': '.8'})
plt.plot(year, headcount)

In [None]:
plt.plot(year, headcount)

## Default Color Palettes

**Use `color_palette()` to give colors to plots and adding more aesthetic value to it**

```
current_palette = sns.color_palette()
sns.palplot(current_palette) # paplot() functions plot the array of colors horizontally
plt.show()
```

In [None]:
current_palette = sns.color_palette()
sns.palplot(current_palette) # paplot() functions plot the array of colors horizontally
plt.show()

## Ready to Use Palette

**Some built-in seaborn color palette**:
* `deep`
* `muted`
* `bright`
* `pastel`
* `dark`
* `colorblind`

**Show the palette**:
```
sns.palplot(sns.color_palette('pastel')) 
plt.show()
```

**Show more palettes**:
```
sns.color_palette("Set1")
sns.color_palette("Set2")
sns.color_palette("Set3")
```

In [None]:
sns.palplot(sns.color_palette('pastel')) 

In [None]:
sns.color_palette("Set1")

In [None]:
sns.color_palette("Set2")

In [None]:
sns.color_palette("Set3")

## Reset Style

call `reset_defaults()` to reset the seaborn style.

```
sns.reset_defaults()
```

In [None]:
sns.reset_defaults()

## Line Plot

**Examples: undergraduate business management (male students only)**
```
ug_bm_m
sns.lineplot(x='AcademicYear', y='Headcount', data=ug_bm_m)
```

In [None]:
ug_bm_m

In [None]:
sns.lineplot(x='AcademicYear', y='Headcount', data=ug_bm_m)

**Examples: undergraduate business management**
```
ug_bm = gra[(gra['LevelOfStudy']=='Undergraduate') 
                    & (gra['ProgrammeCategory']=='Business and Management')]
ug_bm


plt.xticks(rotation=90)
sns.lineplot(x='AcademicYear', 
             y='Headcount', 
             data=ug_bm, 
             hue='Sex', 
             marker='o')
```

In [None]:
ug_bm = gra[(gra['LevelOfStudy']=='Undergraduate') 
                    & (gra['ProgrammeCategory']=='Business and Management')]
ug_bm

In [None]:
plt.xticks(rotation=90)
sns.lineplot(x='AcademicYear', 
             y='Headcount', 
             data=ug_bm, 
             hue='Sex', 
             marker='o')

**Examples: undergraduate female student**
```
ug_f = gra[(gra['LevelOfStudy']=='Undergraduate') 
                    & (gra['Sex']=='F')]
ug_f

plt.xticks(rotation=90)
sns.lineplot(x='AcademicYear', y='Headcount', data=ug_f, hue='ProgrammeCategory')
plt.legend(loc='upper left')
```

In [None]:
ug_f = gra[(gra['LevelOfStudy']=='Undergraduate') 
                    & (gra['Sex']=='F')]
ug_f

In [None]:
plt.xticks(rotation=90)
sns.lineplot(x='AcademicYear', y='Headcount', data=ug_f, hue='ProgrammeCategory')
plt.legend(loc='upper left')

## Scatter Plot

```
plt.xticks(rotation=90)
sns.scatterplot(x='AcademicYear', 
                y='Headcount', 
                data=ug_bm, 
                hue='Sex')
```

In [None]:
plt.xticks(rotation=90)
sns.scatterplot(x='AcademicYear', 
                y='Headcount', 
                data=ug_bm, 
                hue='Sex')

## Bar Plot

```
sns.barplot(x='AcademicYear', 
            y='Headcount', 
            data=ug_bm_m, 
            palette=sns.color_palette("Set1")
           )
```

```
sns.color_palette("Set1")
```

```
sns.barplot(x='AcademicYear', 
            y='Headcount', 
            data=ug_bm, 
            hue='Sex', 
            palette=sns.color_palette("Set1")
           )
```


In [None]:
sns.barplot(x='AcademicYear', 
            y='Headcount', 
            data=ug_bm_m, 
            palette=sns.color_palette("Set1")
           )

In [None]:
sns.color_palette("Set1")

In [None]:
sns.barplot(x='AcademicYear', 
            y='Headcount', 
            data=ug_bm, 
            hue='Sex', 
            palette=sns.color_palette("Set1")
           )

## Pie Chart

```
ug_m_2122 = gra[(gra['LevelOfStudy']=='Undergraduate') 
                    & (gra['Sex']=='M')
                    & (gra['AcademicYear']=='2020/21')]
ug_m_2122

colors = sns.color_palette('pastel')
plt.pie(ug_m_2122['Headcount'], labels=ug_m_2122['ProgrammeCategory'], colors=colors)
plt.show()

colors = sns.color_palette('dark')
plt.pie(ug_m_2122['Headcount'], labels=ug_m_2122['ProgrammeCategory'], colors=colors)
plt.show()
```

In [None]:
ug_m_2122 = gra[(gra['LevelOfStudy']=='Undergraduate') 
                    & (gra['Sex']=='M')
                    & (gra['AcademicYear']=='2020/21')]
ug_m_2122

In [None]:
colors = sns.color_palette('pastel')
plt.pie(ug_m_2122['Headcount'], labels=ug_m_2122['ProgrammeCategory'], colors=colors)
plt.show()

In [None]:
colors = sns.color_palette('dark')
plt.pie(ug_m_2122['Headcount'], labels=ug_m_2122['ProgrammeCategory'], colors=colors)
plt.show()

## Boxplots

```
ug_f

sns.catplot(x="ProgrammeCategory", y="Headcount", kind="box", data=ug_f)
plt.xticks(rotation=90)

ug = gra[gra['LevelOfStudy']=='Undergraduate']
ug

sns.catplot(x="ProgrammeCategory", y="Headcount", hue='Sex', kind="box", data=ug)
plt.xticks(rotation=90)
```

In [None]:
ug_f

In [None]:
sns.catplot(x="ProgrammeCategory", y="Headcount", kind="box", data=ug_f)
plt.xticks(rotation=90)

In [None]:
ug = gra[gra['LevelOfStudy']=='Undergraduate']
ug

In [None]:
sns.catplot(x="ProgrammeCategory", y="Headcount", hue='Sex', kind="box", data=ug)
plt.xticks(rotation=90)

## Build Your Own Palette

**Use `color_palette()` function to build you own palette**

```
sns.color_palette(n_colors=4) # specify number of colors to use
sns.palplot(sns.color_palette("Reds")) # specify a base color. Don't forget the ending 's'
sns.color_palette("light:purple") # light theme for a chosen color
sns.color_palette("light:#5A9") # light theme for a chosen color
sns.color_palette("dark:#f00") # dark theme for a chosen color
sns.color_palette("blend:#f00,#00F") # blending from one color to another color
```

**To reset style to default**
```
sns.reset_defaults()

```

**More on `color_palette()`**:

[seaborn.color_palette](https://seaborn.pydata.org/generated/seaborn.color_palette.html#seaborn.color_palette)

### specify `n_colors`

```
sns.color_palette(n_colors=4)
plt.pie(ug_m_2122['Headcount'], 
        labels=ug_m_2122['ProgrammeCategory'], 
        colors=sns.color_palette(n_colors=4))
plt.show()
```

In [None]:
sns.color_palette(n_colors=4)
plt.pie(ug_m_2122['Headcount'], 
        labels=ug_m_2122['ProgrammeCategory'], 
        colors=sns.color_palette(n_colors=4))
plt.show()

### specify fading color series

```
sns.color_palette("Greens")
plt.pie(ug_m_2122['Headcount'], 
        labels=ug_m_2122['ProgrammeCategory'], 
        colors=sns.color_palette("Greens"))
plt.show()
```

In [None]:
sns.color_palette("Greens")
plt.pie(ug_m_2122['Headcount'], 
        labels=ug_m_2122['ProgrammeCategory'], 
        colors=sns.color_palette("Greens"))
plt.show()

### specify light color series using color name

```
sns.color_palette("light:purple")
plt.pie(ug_m_2122['Headcount'], 
        labels=ug_m_2122['ProgrammeCategory'], 
        colors=sns.color_palette("light:purple"))
plt.show()
```

In [None]:
sns.color_palette("light:purple")
plt.pie(ug_m_2122['Headcount'], 
        labels=ug_m_2122['ProgrammeCategory'], 
        colors=sns.color_palette("light:purple"))
plt.show()

### specify light color series using color code

```
sns.color_palette("light:#5A9")
plt.pie(ug_m_2122['Headcount'], 
        labels=ug_m_2122['ProgrammeCategory'], 
        colors=sns.color_palette("light:#5A9"))
plt.show()
```

In [None]:
sns.color_palette("light:#5A9")
plt.pie(ug_m_2122['Headcount'], 
        labels=ug_m_2122['ProgrammeCategory'], 
        colors=sns.color_palette("light:#5A9"))
plt.show()

### specify dark color series

```
sns.color_palette("dark:#f00")
plt.pie(ug_m_2122['Headcount'], 
        labels=ug_m_2122['ProgrammeCategory'], 
        colors=sns.color_palette("dark:#f00"))
plt.show()
```

In [None]:
sns.color_palette("dark:#f00")
plt.pie(ug_m_2122['Headcount'], 
        labels=ug_m_2122['ProgrammeCategory'], 
        colors=sns.color_palette("dark:#f00"))
plt.show()

### specify blend color series

```
sns.color_palette("blend:#f00,#00F")
plt.pie(ug_m_2122['Headcount'], 
        labels=ug_m_2122['ProgrammeCategory'], 
        colors=sns.color_palette("blend:#f00,#00F"))
plt.show()
```

In [None]:
sns.color_palette("blend:#f00,#00F")
plt.pie(ug_m_2122['Headcount'], 
        labels=ug_m_2122['ProgrammeCategory'], 
        colors=sns.color_palette("blend:#f00,#00F"))
plt.show()

## Grouping and Groups' Aggregation

```
gra

grouped_by_level = gra.groupby(['LevelOfStudy', 'AcademicYear'])

type(grouped_by_level)

grouped_by_level.agg(np.sum)

plt.xticks(rotation=90)
sns.lineplot(x='AcademicYear', y='Headcount', 
             data=grouped_by_level.agg(np.sum), 
             hue='LevelOfStudy')
sns.set_style('whitegrid')
```


In [None]:
gra

In [None]:
grouped_by_level = gra.groupby(['LevelOfStudy', 'AcademicYear'])

In [None]:
type(grouped_by_level)

In [None]:
grouped_by_level.agg(np.sum)

In [None]:
plt.xticks(rotation=90)
sns.lineplot(x='AcademicYear', y='Headcount', 
             data=grouped_by_level.agg(np.sum), 
             hue='LevelOfStudy')
sns.set_style('whitegrid')

## Seaborn Example Gallery

Check the webpage below for seaborn example gallery.

[Seaborn Example Gallery](https://seaborn.pydata.org/examples/index.html)

# JSON Live Data Loading& Visualization




## HTTP `requests` 

`requests` is the python addon to get data by HTTP

**To import**

```
import requests
```

**Provide a web page url (link) that you want to fetch data from**
```
hk_w_url = 'https://data.weather.gov.hk/weatherAPI/opendata/weather.php?dataType=fnd&lang=en'
```

**Issue HTTP GET requests and save server's response as variable `r`**
r
```
r = requests.get(hk_w_url)
```

**Exploring the response from server**

* `r.status_code` # returns 200 if everything has gone fine 
* `r.headers` # descriptive headers about server's response
* `r.headers['content-type']` # returns 'application/json; charset=utf-8'
* `r.encoding` # returns ''utf-8''
* `r.text` # retrieve server's response in plain text format. type of str
* `r.json()` # retrieve server's response in json format, type of dict


In [None]:
import requests

In [None]:
hk_w_url = 'https://data.weather.gov.hk/weatherAPI/opendata/weather.php?dataType=fnd&lang=en'

In [None]:
r = requests.get(hk_w_url)

In [None]:
r.status_code

In [None]:
r.headers

In [None]:
r.headers['content-type']

In [None]:
r.encoding

In [None]:
type(r.text)

In [None]:
type(r.json())

## Retrieving JSON Data

**How to retrieve JSON child element**
```
r.json()
type(r.json() # json data as stored as Python dictionary
r.json()['generalSituation']
r.json()['weatherForecast']
r.json()['weatherForecast'][0]
r.json()['weatherForecast'][0]['week']
r.json()['weatherForecast'][0]['forecastWeather']
```


**More complex attributes retrieving**
```
print(r.json()['weatherForecast'][0]['week'],
      ' | ',
      r.json()['weatherForecast'][0]['forecastMintemp']['value'],
      ('°' + r.json()['weatherForecast'][0]['forecastMintemp']['unit']),
      ' | ',
      r.json()['weatherForecast'][0]['forecastWeather']
     )
```

In [None]:
r.json()

In [None]:
type(r.json())

In [None]:
r.json()['generalSituation']

In [None]:
r.json()['weatherForecast']

In [None]:
r.json()['weatherForecast'][0]

In [None]:
r.json()['weatherForecast'][0]['week']

In [None]:
r.json()['weatherForecast'][0]['forecastWeather']

In [None]:
print(r.json()['weatherForecast'][0]['week'],
      ' | ',
      r.json()['weatherForecast'][0]['forecastMintemp']['value'],
      ('°' + r.json()['weatherForecast'][0]['forecastMintemp']['unit']),
      ' | ',
      r.json()['weatherForecast'][0]['forecastWeather']
     )

## Ploting Weather Forecast

**Retrieving weatherForecast element**
```
weather = r.json()['weatherForecast']
weather
type(weather)
```

**Retrieve weatherForecast attribute and convert to DataFrame as flatten attributes**
```
weather_normalized = pd.json_normalize(weather)
type(weather_normalized)
weather_normalized.info()
sns.lineplot(x='forecastDate', y='forecastMaxtemp.value', data=weather_normalized)
```

**Multiple Plots**
```
sns.set_style('darkgrid')
plt.xticks(rotation=90)
sns.lineplot(x='forecastDate', 
             y='forecastMaxtemp.value', 
             data=weather_normalized, 
             label='Max Temp', 
             marker='o')

sns.lineplot(x='forecastDate', 
             y='forecastMintemp.value', 
             data=weather_normalized, 
             label='Min Temp', 
             marker='o')
```

In [None]:
# Retrieve weatherForecast attribute and convert to DataFrame
weather = r.json()['weatherForecast']

In [None]:
weather

In [None]:
type(weather)

In [2]:
weather_normalized = pd.json_normalize(weather)

NameError: name 'pd' is not defined

In [None]:
type(weather_normalized)

In [None]:
weather_normalized.info()

In [None]:
sns.lineplot(x='forecastDate', y='forecastMaxtemp.value', data=weather_normalized)

In [None]:
sns.set_style('darkgrid')
plt.xticks(rotation=90)
sns.lineplot(x='forecastDate', 
             y='forecastMaxtemp.value', 
             data=weather_normalized, 
             label='Max Temp', 
             marker='o')

sns.lineplot(x='forecastDate', 
             y='forecastMintemp.value', 
             data=weather_normalized, 
             label='Min Temp', 
             marker='o')

# EXERCISE: Interbank Liquidity

Follow the code logics and syntax from previous weather example. Retrieve the live inter-bank liquidity and plot the `hibor_fixing_1m` & `hibor_overnigh` line-plot

**The link for retrieving `daily figures interbank liquidity`**
```
inter_liq_url = "https://api.hkma.gov.hk/public/market-data-and-statistics/daily-monetary-statistics/daily-figures-interbank-liquidity"
```

**HTTP Request**
```
r = requests.get(inter_liq_url)
type(r)
inter_liq_json = r.json()
type(inter_liq_json)
```

In [None]:
inter_liq_url = "https://api.hkma.gov.hk/public/market-data-and-statistics/daily-monetary-statistics/daily-figures-interbank-liquidity"

In [None]:
r = requests.get(inter_liq_url)

In [None]:
type(r)

In [None]:
inter_liq_json = r.json()

In [None]:
type(inter_liq_json)

In [None]:
inter_liq_json["result"]["records"]

In [None]:
records = inter_liq_json["result"]["records"]

In [None]:
type(records)

In [None]:
inter_liq_df = pd.DataFrame(inter_liq_json["result"]["records"])

In [None]:
type(inter_liq_df)

In [None]:
inter_liq_df

In [None]:
inter_liq_df.info()

## Plotting

```
inter_liq_df.plot() # pandas built-in plots. use all the number column for plotting. Not a good plot.

inter_liq_df.plot(y='hibor_fixing_1m') # pandas built-in plots

inter_liq_df.info()

plt.xticks(rotation=90)
plt.xticks([]) # removes xticks
sns.lineplot(x='end_of_date', y='hibor_fixing_1m', data=inter_liq_df)

plt.xticks(rotation=90)
plt.xticks([]) # removes xticks
plt.yticks([]) # removes yticks
sns.lineplot(x='end_of_date', y='hibor_fixing_1m', data=inter_liq_df, label="Hibor One Month")
sns.lineplot(x='end_of_date', y='hibor_overnight', data=inter_liq_df, label="Hibor Overnight")
```

In [None]:
inter_liq_df.plot() # pandas built-in plots. use all the number column for plotting. Not a good plot.

In [None]:
inter_liq_df.plot(y='hibor_fixing_1m') # pandas built-in plots

In [None]:
inter_liq_df.info()

In [None]:
plt.xticks(rotation=90)
plt.xticks([]) # removes xticks
sns.lineplot(x='end_of_date', y='hibor_fixing_1m', data=inter_liq_df)

In [None]:
plt.xticks(rotation=90)
plt.xticks([]) # removes xticks
plt.yticks([]) # removes yticks
sns.lineplot(x='end_of_date', y='hibor_fixing_1m', data=inter_liq_df, label="Hibor One Month")
sns.lineplot(x='end_of_date', y='hibor_overnight', data=inter_liq_df, label="Hibor Overnight")

# Web Scraping Introduction

When JSON data source is not an option, you can write your own Python codes to grab data from the web.

**However, do pay attention that**
- Reading HTML is not always easy
- Many website implements data protections to prevent data grabbing
- Modern web application generates web content on the fly when the page is loading.  The HTML page is empty at the begining of loading while progressively loading data by JavaScript

## Use Pandas `read_html()` function

You can use pandas's `read_html()` to read a url with HTML table.  
It requires a web page link.

**Declare url and read url**
```
sp500_url = 'http://www.multpl.com/s-p-500-dividend-yield/table?f=m'
raw_html_tbl = pd.read_html(sp500_url)
```

**Retrieve the elements**
```
type(raw_html_tbl) # returns 'list'
len(raw_html_tbl) # returns 1
raw_html_tbl[0] # get the first table
type(raw_html_tbl[0]) # pandas DataFrame
```

In [None]:
sp500_url = 'http://www.multpl.com/s-p-500-dividend-yield/table?f=m'
raw_html_tbl = pd.read_html(sp500_url)

In [None]:
type(raw_html_tbl)

In [None]:
len(raw_html_tbl)

In [None]:
raw_html_tbl[0]

In [None]:
type(raw_html_tbl[0])

## `read_html()` doesn't always work

- There are too many broken HTML codes
- And some HTML are generated by JavaScript on the fly

The following scraping won't work

```
hkej_url = 'https://stock360.hkej.com/marketWatch/Top20'
raw_html_tbl2 = pd.read_html(hkej_url)
raw_html_tbl2
```

In [None]:
hkej_url = 'https://stock360.hkej.com/marketWatch/Top20'
raw_html_tbl2 = pd.read_html(hkej_url)
raw_html_tbl2

## Another Example or web scraping using Pandas

Retrieving currency table from Yahoo Finance

```
yahoo_url = 'https://hk.finance.yahoo.com/currencies'
raw_html_tbl3 = pd.read_html(yahoo_url)
raw_html_tbl3[0]
```

In [None]:
yahoo_url = 'https://hk.finance.yahoo.com/currencies'
raw_html_tbl3 = pd.read_html(yahoo_url)
raw_html_tbl3[0]

# BeautifulSoup

You can extract HTML element by using BeautifulSoup.  

BeautifulSoup is a popular web scraping tools.  

Besides, Scrapy and Selenium are also widely used.

## Import
Import `BeautifulSoup` before you use it

```
from bs4 import BeautifulSoup
```

In [None]:
from bs4 import BeautifulSoup

## Work with Dummy HTML

**Declare the following HTML documents**

```
html_doc = """
<html><head><title>The Dormouse's story</title></head>
<body>
<b>Sample HTML Contents</b>
<p class="title purple"><b>The Dormouse's story</b></p>

<p class="story">Once upon a time there were three little sisters; and their names were
<a href="http://example.com/elsie" class="sister purple" id="link1">Elsie</a>,
<a href="http://example.com/lacie" class="sister" id="link2">Lacie</a> and
<a href="http://example.com/tillie" class="sister" id="link3">Tillie (<b>Important</b>)</a>;
and they lived at the bottom of a well.</p>

<p class="story">a paragraph ... </p>
</body>
</html>
"""
```

**Createing soup object**
```
soup = BeautifulSoup(html_doc, 'html.parser')
```

In [None]:
html_doc = """
<html><head><title>The Dormouse's story</title></head>
<body>
<b>Sample HTML Contents</b>
<p class="title purple"><b>The Dormouse's story</b></p>

<p class="story">Once upon a time there were three little sisters; and their names were
<a href="http://example.com/elsie" class="sister purple" id="link1">Elsie</a>,
<a href="http://example.com/lacie" class="sister" id="link2">Lacie</a> and
<a href="http://example.com/tillie" class="sister" id="link3">Tillie</a>;
and they lived at the bottom of a well.</p>

<p class="story">a paragraph ... </p>
</body>
</html>
"""

In [None]:
soup = BeautifulSoup(html_doc, 'html.parser')

In [None]:
soup

In [None]:
type(soup)

## `prettify()` function

The following command will display a neat output
```
print(soup.prettify())
```

In [None]:
print(soup.prettify())

##  `find()` a child elements
Examples:
```
soup.find('html')
soup.find('head')
soup.find('title')
soup.find('body')
soup.find('p')
p = soup.find('p')
type(p)
b = p.find('b')
type(b)
```
`find()` _function will only return ONE SINGLE element even if there are multiple matched_

## Use  `.` to refer child element
To retrieve the `<title>` child tag
```
soup.title
```
Other examples on child elements
```
soup.html
soup.head
soup.body
title_tag = soup.title 
print(title_tag.name)
print(title_tag.string)
print(title_tag.text)
```

In [None]:
title_tag = soup.title

In [None]:
title_tag

In [None]:
type(title_tag)

In [None]:
title_tag.name

In [None]:
title_tag.string

In [None]:
title_tag.text

In [None]:
type(title_tag.string)

In [None]:
type(title_tag.text)

## Get the parent tag
`.parent` gives the parent tag of current tag
```
title_tag.parent
title_tag.parent.name
title_tag.parent.string
```

In [None]:
title_tag.parent

In [None]:
title_tag.parent.name

In [None]:
title_tag.parent.string

## Extract the attributes of a tag
Showing attribute
```
a_tag = soup.a
a_tag
a_tag["class"]
a_tag["href"]
a_tag["id"]
a_tag.attrs # show all the attributes of a tag
```
a_tag.attrs
Showing all attributes
```

```

In [None]:
a_tag = soup.a

In [None]:
a_tag

In [None]:
a_tag["id"]

In [None]:
a_tag.attrs

## `find_all()` elements
`find_all()` function will return all the matching tags in the form of array
Example:
```
soup.find_all('a')
links = soup.find_all('a')
print(links)
type(links)
links[0]
links[1]
links[0]["href"]
```


In [None]:
soup.find_all('a')

In [None]:
links = soup.find_all('a')

In [None]:
links

In [None]:
type(links)

In [None]:
links[0]

In [None]:
links[1]

In [None]:
links[0]["href"]

## Retrieve by css class name
Examples:
```
soup.find(class_='sister')
soup.find_all(class_='sister') # returns all tags with sister css class
soup.find_all('a', class_='purple') # returns the `<a>` tag with css class purple
soup.find_all('p', class_='purple') # returns the `<p>` tag with css class purple

```

In [None]:
soup.find_all(class_='sister')

In [None]:
soup.find_all('a', class_='purple') 

In [None]:
soup.find_all('p', class_='purple') 

## Limit the number in search
Example
```
soup.find_all('a')
soup.find_all('a', limit=2) # set the limit return to 2

```

In [None]:
soup.find_all('a', limit=2)

## Retrieve by HTML `id`
Examples:
```
soup.find(id='link1')
```

**Note**:
- id is a unique value. So you should expecting only one matched tag.  
- However there could be exception as it's quite common that HMTL codes are buggy and messy.

In [None]:
soup.find(id='link1')

## Advanced CSS Selectors
If you are experienced with CSS coding, you will be familiar with the following coding styles
```
soup.select('body b')
soup.select('p b')
soup.select('body>b')
soup.select('body>p>b')
```

In [None]:
soup.select('body b')

# WEB SCRAPING EXERCISE

Use requests and BeautifulSoup together

BeautifulSoup is NOT a HTTP client, we have to use `requests` to retrieve HTML source codes from an actual webiste

**Required imports**:
```
import requests
from bs4 import BeautifulSoup
```

In [None]:
import requests
from bs4 import BeautifulSoup

## Declaring url to retrieve
```
hkej_topgainers = "https://stock360.hkej.com/marketWatch/Top20/topGainers"
html_response_from_server = requests.get(hkej_topgainers)

type(html_response_from_server)

html_response_from_server.content

type(html_response_from_server.content)

soup = BeautifulSoup(html_response_from_server.content, 'html.parser')
```

In [None]:
hkej_topgainers = "https://stock360.hkej.com/marketWatch/Top20/topGainers"
html_response_from_server = requests.get(hkej_topgainers)
type(html_response_from_server)

In [None]:
html_response_from_server.content

In [None]:
type(html_response_from_server.content)

In [None]:
soup = BeautifulSoup(html_response_from_server.content, 'html.parser')

In [None]:
type(soup)

## Use `find()` to retrieve rows
```
top_stocks_table = soup.find(class_='dt640')
print(type(top_stocks_table))
stock_rows = top_stocks_table.find_all("tr")
print(len(stock_rows))
print(stock_rows[0])
print(stock_rows[1])
print(stock_rows[2])
print(stock_rows[3])

```

In [None]:
top_stocks_table = soup.find(class_='dt640')
print(type(top_stocks_table))
stock_rows = top_stocks_table.find_all("tr")
print(len(stock_rows))
print(stock_rows[0])
print(stock_rows[1])
print(stock_rows[2])
print(stock_rows[3])

## Looping the top stock rows
```
for i in range(2, len(stock_rows)):
    stock = stock_rows[i]
    code = stock.find(class_='code')
    name = stock.find(class_='name')
    print(f'{code}\t{name}')
    #print(f'{code.string}\t{name.string}')
```

In [None]:
for i in range(2, len(stock_rows)):
    stock = stock_rows[i]
    code = stock.find(class_='code')
    name = stock.find(class_='name')
    print(f'{code}\t{name}')
    #print(f'{code.string}\t{name.string}')


## Retrieving more stock columns
```
for i in range(2, len(stock_rows)):
    stock = stock_rows[i]
    code = stock.find(class_='code')
    name = stock.find(class_='name')
    latest = stock.find(class_='latest')
    change = stock.find(class_='change')
    change_p = stock.find(class_='change_p')
    volumn = stock.find(class_='volumn')
    turnover = stock.find(class_='turnover')
    market_cap = stock.find(class_='marketCap')    
    print(f'{code.string}\t{name.string}\t{latest.string}\t{change_p.text}')
```

In [None]:
for i in range(2, len(stock_rows)):
    stock = stock_rows[i]
    code = stock.find(class_='code')
    name = stock.find(class_='name')
    latest = stock.find(class_='latest')
    change = stock.find(class_='change')
    change_p = stock.find(class_='change_p')
    volumn = stock.find(class_='volumn')
    turnover = stock.find(class_='turnover')
    market_cap = stock.find(class_='marketCap')    
    print(f'{code.string}\t{name.string.ljust(15)}\t{latest.string}\t{change_p.text}')


## Run the complete python script
There is a complete python script named `get_active_stock.py` in the script folder of the downloaded folder

**To run the script**:
- Open the command line window / Terminal
- Type in command: `python3 get_active_stock.py`