# Web Scraping

![Data Science Workflow](img/ds-workflow.png)

## Acquire Data
### Common Data Sources
- **The Internet - Web Scraping**
- Databasis
- CSV
- Excel
- Parquet

### Web Scraping
- Extracting data from websites
- Leagal issues: [wikipedia.org](https://en.wikipedia.org/wiki/Web_scraping#Legal_issues)
- The legality of web scraping varies across the world.
- In general, web scraping may be against the terms of use of some websites, but the enforceability of these terms is unclear.

### Be ethical
- Not for commercial use
- Only private use

## Example
- Let's consider [https://en.wikipedia.org/wiki/Wikipedia:Fundraising_statistics](https://en.wikipedia.org/wiki/Wikipedia:Fundraising_statistics)
- **pandas** ```.read_html(.)``` Read HTML tables into a list of DataFrame objects ([docs](https://pandas.pydata.org/docs/reference/api/pandas.read_html.html)).

In [5]:
import pandas as pd

In [6]:
url = "https://en.wikipedia.org/wiki/Wikipedia:Fundraising_statistics"
data = pd.read_html(url)

In [7]:
type(data)

list

In [8]:
type(data[0])

pandas.core.frame.DataFrame

In [9]:
data[0].head()

Unnamed: 0,Year,Source,Revenue,Expenses,Asset rise,Total assets
0,2021/22,PDF,"$ 154,686,521","$ 145,970,915","$ 8,173,996","$ 239,351,532"
1,2020/21,PDF,"$ 162,886,686","$ 111,839,819","$ 50,861,811","$ 231,177,536"
2,2019/20,PDF,"$ 129,234,327","$ 112,489,397","$ 14,674,300","$ 180,315,725"
3,2018/19,PDF,"$ 120,067,266","$ 91,414,010","$ 30,691,855","$ 165,641,425"
4,2017/18,PDF,"$ 104,505,783","$ 81,442,265","$ 21,619,373","$ 134,949,570"


In [10]:
fundraising = data[0]

In [11]:
fundraising.dtypes

Year            object
Source          object
Revenue         object
Expenses        object
Asset rise      object
Total assets    object
dtype: object

In [76]:
fundraising['Exp']=fundraising['Expenses'].str[2:]
fundraising['Exp']=fundraising['Exp'].str.replace(',','')

fundraising.head()

Unnamed: 0,Year,Source,Revenue,Expenses,Asset rise,Total assets,Exp,Exp1,Exp2
0,2021/22,PDF,"$ 154,686,521","$ 145,970,915","$ 8,173,996","$ 239,351,532",145970915,145970915,145970915
1,2020/21,PDF,"$ 162,886,686","$ 111,839,819","$ 50,861,811","$ 231,177,536",111839819,111839819,111839819
2,2019/20,PDF,"$ 129,234,327","$ 112,489,397","$ 14,674,300","$ 180,315,725",112489397,112489397,112489397
3,2018/19,PDF,"$ 120,067,266","$ 91,414,010","$ 30,691,855","$ 165,641,425",91414010,91414010,91414010
4,2017/18,PDF,"$ 104,505,783","$ 81,442,265","$ 21,619,373","$ 134,949,570",81442265,81442265,81442265


In [79]:
fundraising.drop(columns=['Exp1','Exp2'])

Unnamed: 0,Year,Source,Revenue,Expenses,Asset rise,Total assets,Exp
0,2021/22,PDF,"$ 154,686,521","$ 145,970,915","$ 8,173,996","$ 239,351,532",145970915
1,2020/21,PDF,"$ 162,886,686","$ 111,839,819","$ 50,861,811","$ 231,177,536",111839819
2,2019/20,PDF,"$ 129,234,327","$ 112,489,397","$ 14,674,300","$ 180,315,725",112489397
3,2018/19,PDF,"$ 120,067,266","$ 91,414,010","$ 30,691,855","$ 165,641,425",91414010
4,2017/18,PDF,"$ 104,505,783","$ 81,442,265","$ 21,619,373","$ 134,949,570",81442265
5,2016/17,PDF,"$ 91,242,418","$ 69,136,758","$ 21,547,402","$ 113,330,197",69136758
6,2015/16,PDF,"$ 81,862,724","$ 65,947,465","$ 13,962,497","$ 91,782,795",65947465
7,2014/15,PDF,"$ 75,797,223","$ 52,596,782","$ 24,345,277","$ 77,820,298",52596782
8,2013/14,PDF,"$ 52,465,287","$ 45,900,745","$ 8,285,897","$ 53,475,021",45900745
9,2012/13,PDF,"$ 48,635,408","$ 35,704,796","$ 10,260,066","$ 45,189,124",35704796


In [81]:
fundraising['Exp'] = pd.to_numeric(fundraising['Exp'])

## Data Wrangling
- Data wrangling (data munging): transforming and mapping data from one "raw" data form into another format
- With the intent of making it more appropriate and valuable for a variety of downstream purposes such as analytics

### Check the data types
- Remember ```.dtypes```

In [82]:
fundraising.dtypes

Year            object
Source          object
Revenue         object
Expenses        object
Asset rise      object
Total assets    object
Exp              int64
Exp1            object
Exp2            object
dtype: object