# 웹데이터 수집을 위한 HTML 기초

## 웹데이터 수집하기

* Web Scraping : 하나의 특정 웹페이지에서 원하는 정보를 받아 오는 것
* Web Crawling : 프로그램을 짜서 여러 웹페이지에서 스크래핑 해오도록 하는 것

## 웹 페이지의 구성 : HTML
* HTML : 웹페이지의 내용을 나타내는 코드


### HTML 태그의 구성
* **태그** : 꺽쇠 <>로 시작 + 내용 + \</>로 끝 (끝맺음 없는 경우도 있음)
* **태그 이름** : \<>안에 가장 먼저 들어가며, 태그를 상징한다.  
ex) \<p> : 문단, \<li> : list, \<img> : image
* **태그 속성** : 모든 태그는 태그 속성이라는 추가 정보를 가질 수 있다. 이름 뒤에 나오는 모든 것은 속성. 
    * 일반적으로 속성 이름과 속성 값을 하나의 쌍으로 갖는다. (ex: name="value") 시작태그에 명시함  
     >```<li id = "favorite">우유</li>``` **:**
     ><li id = "favorite">우유</li>   
    * 한 태그가 여러개의 속성을 가질 수 도 있다. 
    >```<img alt="brazilnut" class="logo-img" src="brazilnut.jpg" width="100"/>``` 
    ><img alt="brazilnut" class="logo-img" src="brazilnut.jpg" width="100"/> 
    >총 4개 속성 : alt, class, src, width


### HTML 태그의 구조
* 한 페이지의 HTML 태그는 서로 연결되어 있다. 부모 관계 혹은 트리구조라고도 부름
```html
<ul>                     -> 부모 태그
    <li>커피</li>           -> 자녀 태그
    <li>녹차</li>
    <li>우유</li>
</ul>
```
><ul>                    
>    <li>커피</li>           
>    <li>녹차</li>
>    <li>우유</li>
></ul>

### 예시
```html
<!DOCTYPE html>
<html>
<head>
    <title>Sample Website</title>
</head>
<body>
<h2>HTML 연습!</h2>

<p>이것은 첫 번째 문단입니다.</p>
<p>이것은 두 번째 문단입니다!</p>

<ul>
    <li>커피</li>
    <li>녹차</li>
    <li>우유</li>
</ul>

<img src="https://upload.wikimedia.org/wikipedia/commons/thumb/4/45/A_small_cup_of_coffee.JPG/550px-A_small_cup_of_coffee.JPG">
</body>
</html>
```
![image.png](attachment:image.png)  
  
><!DOCTYPE html>
><html>
><head>
>    <title>Sample Website</title>
></head>
><body>
><h2>HTML 연습!</h2>
>
><p>이것은 첫 번째 문단입니다.</p>
><p>이것은 두 번째 문단입니다!</p>
>
><ul>
>    <li>커피</li>
>    <li>녹차</li>
>    <li>우유</li>
></ul>
>
><img src="https://upload.wikimedia.org/wikipedia/commons/thumb/4/45/A_small_cup_of_coffee.JPG/550px-A_small_cup_of_coffee.JPG">
></body>
></html>


### 기본 HTML 태그 정리
**DOCTYPE 선언**
* HTML 파일을 쓰기 전 가장 먼저 쓰는 선언문  
  
  ```<!DOCTYPE html>```  

**title 태그**
* 페이지 제목. 브라우저의 탭이나 방문 기록에 나와있는 제목 등에 해당  
  
  ```<title>Sample Website</title>```    
  
**h1~h6 태그**  
* 머리말 태그 
* 가장 중요한 태그 h1 >>> h6 까지 작성 가능   
* h1부터 순서대로 작아짐  

   ```
   <h1>머리말 1</h1>
   <h2>머리말 2</h2>
   <h3>머리말 3</h3>
   <h4>머리말 4</h4>
   <h5>머리말 5</h5>
   <h6>머리말 6</h6>
   ```  
   ><h1>머리말 1</h1>
   ><h2>머리말 2</h2>
   ><h3>머리말 3</h3>
   ><h4>머리말 4</h4>
   ><h5>머리말 5</h5>
   ><h6>머리말 6</h6>  
   
**p태그**
* 문단 작성 태그
```
<p>Lorem ipsum dolor sit amet, consectetur adipiscing elit, sed do eiusmod tempor incididunt ut labore et dolore magna aliqua.</p>
<p>Ut enim ad minim veniam, quis nostrud exercitation ullamco laboris nisi ut aliquip ex ea commodo consequat.</p>
<p>Duis aute irure dolor in reprehenderit in voluptate velit esse cillum dolore eu fugiat nulla pariatur. Excepteur sint occaecat cupidatat non proident, sunt in culpa qui officia deserunt mollit anim id est laborum.</p>
```  
><p>Lorem ipsum dolor sit amet, consectetur adipiscing elit, sed do eiusmod tempor incididunt ut labore et dolore magna aliqua.</p>
><p>Ut enim ad minim veniam, quis nostrud exercitation ullamco laboris nisi ut aliquip ex ea commodo consequat.</p>
><p>Duis aute irure dolor in reprehenderit in voluptate velit esse cillum dolore eu fugiat nulla pariatur. Excepteur sint occaecat cupidatat non proident, sunt in culpa qui officia deserunt mollit anim id est laborum.</p>


### HTML 태그 스크래핑

In [15]:
import pandas as pd

url = "https://www.w3schools.com/tags/ref_byfunc.asp"
table = pd.read_html(url)
len(table)

12

In [20]:
pd.set_option('display.max_colwidth', None)

#### Basic HTML

In [21]:
table[0]

Unnamed: 0,Tag,Description
0,<!DOCTYPE>,Defines the document type
1,<html>,Defines an HTML document
2,<head>,Contains metadata/information for the document
3,<title>,Defines a title for the document
4,<body>,Defines the document's body
5,<h1> to <h6>,Defines HTML headings
6,<p>,Defines a paragraph
7,<br>,Inserts a single line break
8,<hr>,Defines a thematic change in the content
9,<!--...-->,Defines a comment


#### Formatting

In [23]:
table[1]

Unnamed: 0,Tag,Description
0,<acronym>,Not supported in HTML5. Use <abbr> instead.Defines an acronym
1,<abbr>,Defines an abbreviation or an acronym
2,<address>,Defines contact information for the author/owner of a document/article
3,<b>,Defines bold text
4,<bdi>,Isolates a part of text that might be formatted in a different direction from other text outside it
5,<bdo>,Overrides the current text direction
6,<big>,Not supported in HTML5. Use CSS instead.Defines big text
7,<blockquote>,Defines a section that is quoted from another source
8,<center>,Not supported in HTML5. Use CSS instead.Defines centered text
9,<cite>,Defines the title of a work


#### Forms and Input

In [24]:
table[2]

Unnamed: 0,Tag,Description
0,<form>,Defines an HTML form for user input
1,<input>,Defines an input control
2,<textarea>,Defines a multiline input control (text area)
3,<button>,Defines a clickable button
4,<select>,Defines a drop-down list
5,<optgroup>,Defines a group of related options in a drop-down list
6,<option>,Defines an option in a drop-down list
7,<label>,Defines a label for an <input> element
8,<fieldset>,Groups related elements in a form
9,<legend>,Defines a caption for a <fieldset> element


#### Frames

In [25]:
table[3]

Unnamed: 0,Tag,Description
0,<frame>,Not supported in HTML5.Defines a window (a frame) in a frameset
1,<frameset>,Not supported in HTML5.Defines a set of frames
2,<noframes>,Not supported in HTML5.Defines an alternate content for users that do not support frames
3,<iframe>,Defines an inline frame


#### Images

In [26]:
table[4]

Unnamed: 0,Tag,Description
0,<img>,Defines an image
1,<map>,Defines a client-side image map
2,<area>,Defines an area inside an image map
3,<canvas>,"Used to draw graphics, on the fly, via scripting (usually JavaScript)"
4,<figcaption>,Defines a caption for a <figure> element
5,<figure>,Specifies self-contained content
6,<picture>,Defines a container for multiple image resources
7,<svg>,Defines a container for SVG graphics


#### Audio/ Video

In [27]:
table[5]

Unnamed: 0,Tag,Description
0,<audio>,Defines sound content
1,<source>,"Defines multiple media resources for media elements (<video>, <audio> and <picture>)"
2,<track>,Defines text tracks for media elements (<video> and <audio>)
3,<video>,Defines a video or movie


#### Links

In [28]:
table[6]

Unnamed: 0,Tag,Description
0,<a>,Defines a hyperlink
1,<link>,Defines the relationship between a document and an external resource (most used to link to style sheets)
2,<nav>,Defines navigation links


#### Lists

In [29]:
table[7]

Unnamed: 0,Tag,Description
0,<ul>,Defines an unordered list
1,<ol>,Defines an ordered list
2,<li>,Defines a list item
3,<dir>,Not supported in HTML5. Use <ul> instead.Defines a directory list
4,<dl>,Defines a description list
5,<dt>,Defines a term/name in a description list
6,<dd>,Defines a description of a term/name in a description list


#### Tables

In [30]:
table[8]

Unnamed: 0,Tag,Description
0,<table>,Defines a table
1,<caption>,Defines a table caption
2,<th>,Defines a header cell in a table
3,<tr>,Defines a row in a table
4,<td>,Defines a cell in a table
5,<thead>,Groups the header content in a table
6,<tbody>,Groups the body content in a table
7,<tfoot>,Groups the footer content in a table
8,<col>,Specifies column properties for each column within a <colgroup> element
9,<colgroup>,Specifies a group of one or more columns in a table for formatting


#### Styles and Semantics

In [31]:
table[9]

Unnamed: 0,Tag,Description
0,<style>,Defines style information for a document
1,<div>,Defines a section in a document
2,<span>,Defines a section in a document
3,<header>,Defines a header for a document or section
4,<footer>,Defines a footer for a document or section
5,<main>,Specifies the main content of a document
6,<section>,Defines a section in a document
7,<article>,Defines an article
8,<aside>,Defines content aside from the page content
9,<details>,Defines additional details that the user can view or hide


#### Meta Info

In [32]:
table[10]

Unnamed: 0,Tag,Description
0,<head>,Defines information about the document
1,<meta>,Defines metadata about an HTML document
2,<base>,Specifies the base URL/target for all relative URLs in a document
3,<basefont>,"Not supported in HTML5. Use CSS instead.Specifies a default color, size, and font for all text in a document"


#### Programming

In [33]:
table[11]

Unnamed: 0,Tag,Description
0,<script>,Defines a client-side script
1,<noscript>,Defines an alternate content for users that do not support client-side scripts
2,<applet>,Not supported in HTML5. Use <embed> or <object> instead.Defines an embedded applet
3,<embed>,Defines a container for an external (non-HTML) application
4,<object>,Defines an embedded object
5,<param>,Defines a parameter for an object
