# Guided Project: API and Web Data Scraping
## Part 2: Web Scraping
### Scraping Text Content
I selected an article, [World’s Best 10 Countries to Launch a Fintech Startup](https://medium.com/datadriveninvestor/worlds-best-10-cities-to-launch-a-fintech-startup-a3a1c739e04c) from Medium. From this article I wanted to extract the title of the article and the top countries to create a ranking. I choose the article since it is part of an industry I really like.
- First I imported the requests library 

In [1]:
import requests

- I proceded to specify the URL of the page I wanted to scrape and used the get and content method in the requests library to retreive the content 

In [2]:
url = 'https://medium.com/datadriveninvestor/worlds-best-10-cities-to-launch-a-fintech-startup-a3a1c739e04c'
html = requests.get(url).content
html[0:700]

b'<!DOCTYPE html><html xmlns:cc="http://creativecommons.org/ns#"><head prefix="og: http://ogp.me/ns# fb: http://ogp.me/ns/fb# medium-com: http://ogp.me/ns/fb/medium-com#"><meta http-equiv="Content-Type" content="text/html; charset=utf-8"><meta name="viewport" content="width=device-width, initial-scale=1.0, viewport-fit=contain"><title>World\xe2\x80\x99s Best 10 Countries to Launch a Fintech Startup</title><link rel="canonical" href="https://medium.com/datadriveninvestor/worlds-best-10-cities-to-launch-a-fintech-startup-a3a1c739e04c"><meta name="title" content="World\xe2\x80\x99s Best 10 Countries to Launch a Fintech Startup"><meta name="referrer" content="unsafe-url"><meta name="description" content="FinTech st'

- I imported the BeautifulSoup library to read the raw HTML and parse the information I wanted 
- I went to the website itself and used the inspect element to identify the type of elements that contained the main header and the name of the 10 countries. I figured out it was h1 and h3, so i proced to extract all the text contained within these header tags. 

In [3]:
from bs4 import BeautifulSoup

In [4]:
soup = BeautifulSoup(html,"lxml")
soup

<!DOCTYPE html>
<html xmlns:cc="http://creativecommons.org/ns#"><head prefix="og: http://ogp.me/ns# fb: http://ogp.me/ns/fb# medium-com: http://ogp.me/ns/fb/medium-com#"><meta content="text/html; charset=utf-8" http-equiv="Content-Type"/><meta content="width=device-width, initial-scale=1.0, viewport-fit=contain" name="viewport"/><title>World’s Best 10 Countries to Launch a Fintech Startup</title><link href="https://medium.com/datadriveninvestor/worlds-best-10-cities-to-launch-a-fintech-startup-a3a1c739e04c" rel="canonical"/><meta content="World’s Best 10 Countries to Launch a Fintech Startup" name="title"/><meta content="unsafe-url" name="referrer"/><meta content="FinTech startups have become one of the hottest booms in the competitive business world of entrepreneurship with the revolutionizing of financial services. Merely a decade ago, Fin-techs acted as a…" name="description"/><meta content="#000000" name="theme-color"/><meta content="World’s Best 10 Countries to Launch a Fintech St

In [5]:
title = [e.text for e in soup.select('h1')]
countries = [e.text for e in soup.find_all('h3')][0:10]
print(type(title))
display(title)
print(type(countries))
display(countries)

<class 'list'>


['World’s Best 10 Countries to Launch a Fintech\xa0Startup']

<class 'list'>


['1. New\xa0Zealand',
 '2. Sweden',
 '3. Denmark',
 '4. United\xa0Kingdom',
 '5. Singapore',
 '6. Canada',
 '7. The Netherlands',
 '8. Ireland',
 '9. Switzerland',
 '10. Hong\xa0Kong']

- I created a for loop to split the elements in the countries list and turn it into a nested list called ranking
- I imported the pandas library to create the data frame
- I proceded to create a Data Frame using pandas for the ranking of the countries
- I exported the output as .csv

In [6]:
ranking = []
for country in countries:
    ranking.append(country.split("."))

print(type(ranking))
ranking

<class 'list'>


[['1', ' New\xa0Zealand'],
 ['2', ' Sweden'],
 ['3', ' Denmark'],
 ['4', ' United\xa0Kingdom'],
 ['5', ' Singapore'],
 ['6', ' Canada'],
 ['7', ' The Netherlands'],
 ['8', ' Ireland'],
 ['9', ' Switzerland'],
 ['10', ' Hong\xa0Kong']]

In [7]:
import pandas as pd

In [8]:
df = pd.DataFrame(ranking)
df.columns = ['Ranking', 'Country']
print(type(df))
df

<class 'pandas.core.frame.DataFrame'>


Unnamed: 0,Ranking,Country
0,1,New Zealand
1,2,Sweden
2,3,Denmark
3,4,United Kingdom
4,5,Singapore
5,6,Canada
6,7,The Netherlands
7,8,Ireland
8,9,Switzerland
9,10,Hong Kong


In [9]:
# df.to_csv('output/scraping.csv', index=False)

### Scraping Tables
I selected [Wikipedia's](https://en.wikipedia.org/wiki/Mobile_banking) page on Mobile banking. From this page I wanted to extract the table with a summary of Mobile banking in the world. The table shows a ranking of counties by mobile banking usage in 2012.

- I proceded to specify the URL of the page I wanted to scrape and used the get and content method in the requests library to retreive the content
- I used the BeautifulSoup library to read the raw HTML and parse the information I wanted 

In [10]:
url_mb = 'https://en.wikipedia.org/wiki/Mobile_banking'
html_mb = requests.get(url_mb).content
soup_mb = BeautifulSoup(html_mb,"lxml")
soup_mb

<!DOCTYPE html>
<html class="client-nojs" dir="ltr" lang="en">
<head>
<meta charset="utf-8"/>
<title>Mobile banking - Wikipedia</title>
<script>document.documentElement.className = document.documentElement.className.replace( /(^|\s)client-nojs(\s|$)/, "$1client-js$2" );</script>
<script>(window.RLQ=window.RLQ||[]).push(function(){mw.config.set({"wgCanonicalNamespace":"","wgCanonicalSpecialPageName":false,"wgNamespaceNumber":0,"wgPageName":"Mobile_banking","wgTitle":"Mobile banking","wgCurRevisionId":866411949,"wgRevisionId":866411949,"wgArticleId":4354196,"wgIsArticle":true,"wgIsRedirect":false,"wgAction":"view","wgUserName":null,"wgUserGroups":["*"],"wgCategories":["CS1 maint: Archived copy as title","Use dmy dates from September 2010","E-commerce","Mobile content","Banking technology","Banking terms"],"wgBreakFrames":false,"wgPageContentLanguage":"en","wgPageContentModel":"wikitext","wgSeparatorTransformTable":["",""],"wgDigitTransformTable":["",""],"wgDefaultDateFormat":"dmy","wgMon

- I went to the website itself and used the inspect element to determine the class, fortunaly the class was specified, the table was a 'wikitable sortable'. Therefore I used the class to get the specific table I wanted. 

In [11]:
table = soup_mb.find_all(class_="wikitable sortable")[0]
table

<table class="wikitable sortable" style="font-size: 100%; text-align: center; width: 15%;">
<tbody><tr>
<th>Rank</th>
<th>Country/Territory</th>
<th>Usage in 2012
</th></tr>
<tr>
<td>1</td>
<td style="text-align: left"><span class="flagicon"><img alt="" class="thumbborder" data-file-height="600" data-file-width="900" height="15" src="//upload.wikimedia.org/wikipedia/commons/thumb/0/09/Flag_of_South_Korea.svg/23px-Flag_of_South_Korea.svg.png" srcset="//upload.wikimedia.org/wikipedia/commons/thumb/0/09/Flag_of_South_Korea.svg/35px-Flag_of_South_Korea.svg.png 1.5x, //upload.wikimedia.org/wikipedia/commons/thumb/0/09/Flag_of_South_Korea.svg/45px-Flag_of_South_Korea.svg.png 2x" width="23"/> </span><a href="/wiki/South_Korea" title="South Korea">South Korea</a></td>
<td>47%
</td></tr>
<tr>
<td>2</td>
<td style="text-align: left"><span class="flagicon"><img alt="" class="thumbborder" data-file-height="600" data-file-width="900" height="15" src="//upload.wikimedia.org/wikipedia/commons/thumb/f

- I proceded to extract each of the rows and their values into a a nested list to be able to load the data into pandas.
- I used the index [1:] since the first row contained the column titles, not actual data

In [12]:
rows = [row.text.strip().split("\n") for row in table.find_all('tr')][1:]
print(type(rows))
rows

<class 'list'>


[['1', '\xa0South Korea', '47%'],
 ['2', '\xa0China', '42%'],
 ['3', '\xa0Hong Kong', '41%'],
 ['4', '\xa0Singapore', '38%'],
 ['5', '\xa0India', '37%'],
 ['6', '\xa0Spain', '34%'],
 ['7', '\xa0United States', '32%'],
 ['8', '\xa0Mexico', '30%'],
 ['9', '\xa0Australia', '27%'],
 ['10', '\xa0France', '26%'],
 ['11', '\xa0United Kingdom', '26%'],
 ['12', '\xa0Thailand', '24%'],
 ['13', '\xa0Canada', '22%'],
 ['14', '\xa0Germany', '14%'],
 ['15', '\xa0Pakistan', '9%']]

- I created the data frame with the data from rows and specified the column names

In [13]:
df_mb = pd.DataFrame(rows)
df_mb.columns = ['Ranking', 'Country','Usage in 2014 (%)']
df_mb

Unnamed: 0,Ranking,Country,Usage in 2014 (%)
0,1,South Korea,47%
1,2,China,42%
2,3,Hong Kong,41%
3,4,Singapore,38%
4,5,India,37%
5,6,Spain,34%
6,7,United States,32%
7,8,Mexico,30%
8,9,Australia,27%
9,10,France,26%


- I proceded to check the information of the data frame I created to see if the column types were correct
- The column types for 'Ranking' and 'Usage in 2014' were not correct, therefore proceeded to change the column types

In [14]:
df_mb.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 15 entries, 0 to 14
Data columns (total 3 columns):
Ranking              15 non-null object
Country              15 non-null object
Usage in 2014 (%)    15 non-null object
dtypes: object(3)
memory usage: 440.0+ bytes


In [15]:
df_mb['Ranking'] = df_mb['Ranking'].astype('int64')
df_mb['Usage in 2014 (%)'] = df_mb['Usage in 2014 (%)'].str.replace('%','').astype('float64')

In [16]:
df_mb

Unnamed: 0,Ranking,Country,Usage in 2014 (%)
0,1,South Korea,47.0
1,2,China,42.0
2,3,Hong Kong,41.0
3,4,Singapore,38.0
4,5,India,37.0
5,6,Spain,34.0
6,7,United States,32.0
7,8,Mexico,30.0
8,9,Australia,27.0
9,10,France,26.0


- With the updated column data types, I could now find the mean usage in 2014

In [17]:
df_mb['Usage in 2014 (%)'].mean()

29.933333333333334

- I exported the output as .csv

In [18]:
# df_mb.to_csv('output/scraping_table.csv', index=False)