## [미니프로젝트] 악성사이트 탐지 머신러닝 모델 개발

### 여러분은 기업 보안팀에서 근무중인 엔지니어로써, 웹페이지에서 추출한 Feature(특징) 기반으로 악성사이트를 탐지하는 머신러닝 모델 개발 미션을 부여받았습니다.

### ▣ 우리가 풀어야 하는 문제는 무엇인가요?
 - 웹 페이지에서 Feature를 추출하세요.
 - 악성사이트 여부를 판별하는 성능 좋은 AI모델을 생성하세요.

<br>

---

## ▣ 데이터 소개
* 웹 크롤링 데이터셋 : Feature_Website.xlsx

## ▣ 웹 크롤링 데이터셋의 변수 소개
* html_code : 크롤링을 활용해 수집한 HTML Code 원본
* repu : 악성사이트 여부 (malicious : 악성사이트, benign : 정상사이트)
<br>

---

## <b>[1단계] 데이터 수집</b>

* 1단계에서는 크롤링으로 수집한 HTML Code를 활용해 Feature를 만드는 과정을 체험합니다.

# <b>Step 0. 본격적인 실습 전 packages 설치
* Beautifulsoup 라이브러리 설치
* openpyxl 라이브러리 설치

* 데이터 프레임 관련 라이브러리 Import

In [1]:
from bs4 import BeautifulSoup as bs
import openpyxl as xl
import pandas as pd
import numpy as np

---
## <b>Q1. 데이터 불러오기
### 정상/악성 HTML Code가 저장된 엑셀파일 불러오기
- 파일명 : Feature Website.xlsx


### <span style="color:pink">[문제1] Pandas 라이브러리를 활용해서 'Feature Website.xlsx'파일을 'df' 변수에 저장하고 그 info()및 head()를 통해 데이터를 확인하세요.<span>

In [2]:
# 아래에 실습코드를 작성하고 결과를 확인합니다.
Feature_website = pd.read_excel('./Feature_Website.xlsx')
data = pd.DataFrame(Feature_website)

In [3]:
# 데이터 프레임의 info를 확인합니다.
data.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 40 entries, 0 to 39
Data columns (total 2 columns):
 #   Column     Non-Null Count  Dtype 
---  ------     --------------  ----- 
 0   html_code  40 non-null     object
 1   repu       40 non-null     object
dtypes: object(2)
memory usage: 768.0+ bytes


In [4]:
# 불러온 데이터를 확인합니다.
data.tail(2)
# 'malicious'   ## 악성
# 'benign'      ## 정상

Unnamed: 0,html_code,repu
38,_x000D_\n<!DOCTYPE HTML>_x000D_\n<html>_x000D_...,benign
39,<!DOCTYPE html>\n<html>\n\n<head>\n <title>Bu...,benign


---
# <b>Step 1. 데이터 수집

### 주어진 데이터로만 모델링 하는 경우는 드뭅니다.
### 주어진 데이터 외 추가로 데이터를 수집 또는 생성해야 하는 경우가 많습니다.
### 이번 과정에서는 웹 크롤러를 통해 수집된 정상/악성 사이트 HTML 데이터에서
### BeatifulSoup 라이브러리를 활용 필요한 Feature(특징)를 추출해 보도록 하겠습니다.
### 정상/악성 사이트 HTML Code는 사전에 수집하여 'Feature Website.xlsx' 파일에 저장해 두었습니다.


### <span style="color:cyan">[예시] Beatuifulsoup 라이브러리를 활용 HTML code를 출력하고 \<title> 태그 길이를 계산합니다.<span>

In [5]:
from bs4 import BeautifulSoup

soup = BeautifulSoup(data['html_code'][0], 'html.parser')

*<span style="color:cyan"> html code 출력<span>

In [6]:
print(soup)

<!DOCTYPE html>

<!--[if lt IE 7]> <html lang="en-us" class="a-no-js a-lt-ie9 a-lt-ie8 a-lt-ie7"> <![endif]-->
<!--[if IE 7]>    <html lang="en-us" class="a-no-js a-lt-ie9 a-lt-ie8"> <![endif]-->
<!--[if IE 8]>    <html lang="en-us" class="a-no-js a-lt-ie9"> <![endif]-->
<!--[if gt IE 8]><!-->
<html class="a-no-js" lang="en-us"><!--<![endif]--><head>
<meta content="text/html; charset=utf-8" http-equiv="content-type"/>
<meta charset="utf-8"/>
<meta content="IE=edge,chrome=1" http-equiv="X-UA-Compatible"/>
<title dir="ltr">Amazon.com</title>
<meta content="width=device-width" name="viewport"/>
<link href="https://images-na.ssl-images-amazon.com/images/G/01/AUIClients/AmazonUI-3c913031596ca78a3768f4e934b1cc02ce238101.secure.min._V1_.css" rel="stylesheet"/>
<script>

if (true === true) {
    var ue_t0 = (+ new Date()),
        ue_csm = window,
        ue = { t0: ue_t0, d: function() { return (+new Date() - ue_t0); } },
        ue_furl = "fls-na.amazon.com",
        ue_mid = "ATVPDKIKX0DER"

* <span style="color:cyan"> \<title> 태그 출력 및 길이 계산<span>

In [7]:
# <title> 태그 출력
print("* title :",soup.head.title)

# <title> 태그 길이 출력
print("* title 길이 :", len(str(soup.head.title.getText())))

* title : <title dir="ltr">Amazon.com</title>
* title 길이 : 10


---

## <b>Q2. html 에서 \<script>...\</script> 태그 길이 계산
- BeautifulSoup으로 html소스를 python 객체로 변환
- 함수로 구현하기
- float으로 return 받기

### <span style="color:pink">[문제2] Beatuifulsoup 라이브러리를 활용 HTML code에서 \<script> 태그 길이를 계산하는 함수를 완성하고 결과를 확인하세요.<span>

In [8]:
data['script_len'] = 0
data.tail(5)

Unnamed: 0,html_code,repu,script_len
35,"\n\n\n <!DOCTYPE HTML>\n <html class=""sp...",benign,0
36,"<!doctype html>\n<html lang=""en""><head><meta h...",benign,0
37,"\n\n\n\t<!DOCTYPE html>\n\t<html class=""no-js""...",benign,0
38,_x000D_\n<!DOCTYPE HTML>_x000D_\n<html>_x000D_...,benign,0
39,<!DOCTYPE html>\n<html>\n\n<head>\n <title>Bu...,benign,0


In [9]:
# Feature(특징) 데이터를 추출는 함수를 작성합니다.
data['script_len'] = 0
for idx, htm in enumerate(data['html_code']):
    soup = BeautifulSoup(htm, 'html.parser')
    scripts = soup.find_all('script')
    r=0
    for script in scripts:
        r+=len(script.getText())
    data['script_len'][idx] = r

A value is trying to be set on a copy of a slice from a DataFrame

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  if __name__ == "__main__":


In [10]:
data.tail(5)

Unnamed: 0,html_code,repu,script_len
35,"\n\n\n <!DOCTYPE HTML>\n <html class=""sp...",benign,4336
36,"<!doctype html>\n<html lang=""en""><head><meta h...",benign,0
37,"\n\n\n\t<!DOCTYPE html>\n\t<html class=""no-js""...",benign,0
38,_x000D_\n<!DOCTYPE HTML>_x000D_\n<html>_x000D_...,benign,2908
39,<!DOCTYPE html>\n<html>\n\n<head>\n <title>Bu...,benign,19372


---

## <b>Q3. html에서 공백 수 계산

- BeautifulSoup으로 html소스를 python 객체로 변환
- 함수로 구현하기
- float으로 return 받기

### <span style="color:pink">[문제3] Beatuifulsoup 라이브러리를 활용 HTML Code에서 \<html> 태그 공백 수를 계산하는 함수를 완성하고 결과를 확인하세요.<span>

In [11]:
# Feature(특징) 데이터를 추출하는 함수를 작성합니다.
data['count_space'] = 0
for idx, htm in enumerate(data['html_code']):
    soup = BeautifulSoup(htm, 'html.parser')
    htmls = soup.find('html')
    r=0
    data['count_space'][idx] = htmls.get_text().count(' ')

A value is trying to be set on a copy of a slice from a DataFrame

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  import sys


In [12]:
# 추출한 Feature(특징) 데이터를 확인합니다.
data.tail(5)

Unnamed: 0,html_code,repu,script_len,count_space
35,"\n\n\n <!DOCTYPE HTML>\n <html class=""sp...",benign,4336,6
36,"<!doctype html>\n<html lang=""en""><head><meta h...",benign,0,0
37,"\n\n\n\t<!DOCTYPE html>\n\t<html class=""no-js""...",benign,0,13
38,_x000D_\n<!DOCTYPE HTML>_x000D_\n<html>_x000D_...,benign,2908,4257
39,<!DOCTYPE html>\n<html>\n\n<head>\n <title>Bu...,benign,19372,4


---

## <b>Q4. html 에서 body 길이 계산

- BeautifulSoup으로 html소스를 python 객체로 변환
- 함수로 구현하기
- float으로 return 받기

### <span style="color:pink">[문제4] Beatuifulsoup 라이브러리를 활용 HTML code에서 \<body> 태그 길이를 계산하는 함수를 완성하고 결과를 확인하세요.<span>

In [13]:
# Feature(특징) 데이터를 추출하는 함수를 작성합니다.
data['body_length'] = 0
## test ##
#htm = data['html_code'][0]
#if 1==1:
#########
for idx, htm in enumerate(data['html_code']):
    soup = BeautifulSoup(htm,'html.parser')
    bodys = soup.find_all('body')
    r = 0
    for body in bodys:
        r += len(body.get_text())
    data['body_length'][idx] = r

A value is trying to be set on a copy of a slice from a DataFrame

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  del sys.path[0]


In [14]:
# 데이터 프레임의 html_code 컬럼에서 Feature(특징) 데이터를 추출합니다.
data.tail(5)

Unnamed: 0,html_code,repu,script_len,count_space,body_length
35,"\n\n\n <!DOCTYPE HTML>\n <html class=""sp...",benign,4336,6,24
36,"<!doctype html>\n<html lang=""en""><head><meta h...",benign,0,0,0
37,"\n\n\n\t<!DOCTYPE html>\n\t<html class=""no-js""...",benign,0,13,0
38,_x000D_\n<!DOCTYPE HTML>_x000D_\n<html>_x000D_...,benign,2908,4257,8932
39,<!DOCTYPE html>\n<html>\n\n<head>\n <title>Bu...,benign,19372,4,0


---

## <b>Q5. script 에서 src, href 속성을 가진 태그수

- BeautifulSoup으로 html소스를 python 객체로 변환
- 함수로 구현하기
- float으로 return 받기

### <span style="color:pink">[문제5] Beatuifulsoup 라이브러리를 활용 HTML code에서 \<script> 태그에서 src, href 속성을 가진 태그수를 계산하는 함수를 완성하고 결과를 확인하세요. <span>


In [15]:
# Feature(특징) 데이터를 추출하는 함수를 작성합니다.
data['src&href_count'] = 0
for idx, htm in enumerate(data['html_code']):
    pass
    soup = BeautifulSoup(htm,'html.parser')
    scripts = soup.find_all('script')
    tags = 0
    for script in scripts:
        tags += script.get_text().count('src')
        tags += script.get_text().count('href')
    data['src&href_count'][idx] = tags

A value is trying to be set on a copy of a slice from a DataFrame

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  # This is added back by InteractiveShellApp.init_path()


In [16]:
script.get_text()

''

In [17]:
# 데이터 프레임의 html_code 컬럼에서 Feature(특징) 데이터를 추출합니다
data

Unnamed: 0,html_code,repu,script_len,count_space,body_length,src&href_count
0,<!DOCTYPE html>\n<!--[if lt IE 7]> <html lang=...,malicious,1076,65,402,3
1,\n\t\n\n\n\t\n\n\t\n\n\n\t\n\n\n\t\n\n\t\n\t\t...,malicious,562,87,1041,0
2,"<!DOCTYPE html>\n<html lang=""en"">\n <head>\...",malicious,8968,199,433,3
3,"<!DOCTYPE html><html lang=""en""><head><style da...",malicious,0,0,0,0
4,<!DOCTYPE html>\n\n\n \n \n \n \n ...,malicious,408,1808,2969,0
5,_x000D_\n_x000D_\n_x000D_\n<!DOCTYPE html>_x00...,malicious,3163,358,0,0
6,"<!doctype html>\n\n<html data-ytrk-page=""HOME""...",malicious,23676,156,792,2
7,"\n\t<!DOCTYPE html>\n\t<html class=""no-icon-fo...",malicious,23445,5,0,5
8,"<!DOCTYPE html>\n<html class=""no-js"">\n<head>\...",malicious,1349,95,2032,0
9,"<!DOCTYPE html>\n<html class=""b-header--bl...",malicious,14883,2,0,12


## <b>Q6. 추가적으로 도출 가능한 Feature

- BeautifulSoup으로 html소스를 python 객체로 변환
- 함수로 구현하기
- 적절한 자료형으로 return 받기

### <span style="color:pink">[문제6] Beatuifulsoup 라이브러리를 활용 HTML code에서 추가로 만들수 있는 Feature를 찾아보고 결과를 확인하세요. <span>


In [18]:
# Feature(특징) 데이터를 추출하는 함수를 작성합니다.
data['inside_newline'] = 0
## test ##
#htm = data['html_code'][0]
#if 1==1:
#########
for idx, htm in enumerate(data['html_code']):
    r = 0
    soup = BeautifulSoup(htm,'html.parser')
    r += str(soup).count('/n')
    r += str(soup).count('/t')
    data['inside_newline'][idx] = r

A value is trying to be set on a copy of a slice from a DataFrame

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  if sys.path[0] == "":


In [20]:
# 데이터 프레임의 html_code 컬럼에서 Feature(특징) 데이터를 추출합니다
data['outside_newline'] = 0
for idx,htm in enumerate(data['html_code']):
    r = 0
    r += htm.count('\n')
    r += htm.count('\t')
    r -= data['inside_newline'][idx]
    data['outside_newline'][idx] = r

A value is trying to be set on a copy of a slice from a DataFrame

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  


In [21]:
# 추출한 Feature(특징) 데이터를 확인합니다.
data


Unnamed: 0,html_code,repu,script_len,count_space,body_length,src&href_count,inside_newline,outside_newline
0,<!DOCTYPE html>\n<!--[if lt IE 7]> <html lang=...,malicious,1076,65,402,3,2,133
1,\n\t\n\n\n\t\n\n\t\n\n\n\t\n\n\n\t\n\n\t\n\t\t...,malicious,562,87,1041,0,13,5326
2,"<!DOCTYPE html>\n<html lang=""en"">\n <head>\...",malicious,8968,199,433,3,6,319
3,"<!DOCTYPE html><html lang=""en""><head><style da...",malicious,0,0,0,0,0,0
4,<!DOCTYPE html>\n\n\n \n \n \n \n ...,malicious,408,1808,2969,0,11,913
5,_x000D_\n_x000D_\n_x000D_\n<!DOCTYPE html>_x00...,malicious,3163,358,0,0,2,59
6,"<!doctype html>\n\n<html data-ytrk-page=""HOME""...",malicious,23676,156,792,2,14,496
7,"\n\t<!DOCTYPE html>\n\t<html class=""no-icon-fo...",malicious,23445,5,0,5,5,238
8,"<!DOCTYPE html>\n<html class=""no-js"">\n<head>\...",malicious,1349,95,2032,0,55,395
9,"<!DOCTYPE html>\n<html class=""b-header--bl...",malicious,14883,2,0,12,3,182


In [43]:
# Feature(특징) 데이터를 추출하는 함수를 작성합니다.
data['count_style'] = 0
## test ##
#htm = data['html_code'][0]
#if 1==1:
#########
for idx, htm in enumerate(data['html_code']):
    r = 0
    soup = BeautifulSoup(htm,'html.parser')
    r = str(soup.find('style')).count(' ')
    data['count_style'][idx] = r

A value is trying to be set on a copy of a slice from a DataFrame

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  # This is added back by InteractiveShellApp.init_path()


In [44]:
data

Unnamed: 0,html_code,repu,script_len,count_space,body_length,src&href_count,inside_newline,outside_newline,count_style
0,<!DOCTYPE html>\n<!--[if lt IE 7]> <html lang=...,malicious,1076,65,402,3,2,133,0
1,\n\t\n\n\n\t\n\n\t\n\n\n\t\n\n\n\t\n\n\t\n\t\t...,malicious,562,87,1041,0,13,5326,58
2,"<!DOCTYPE html>\n<html lang=""en"">\n <head>\...",malicious,8968,199,433,3,6,319,0
3,"<!DOCTYPE html><html lang=""en""><head><style da...",malicious,0,0,0,0,0,0,196
4,<!DOCTYPE html>\n\n\n \n \n \n \n ...,malicious,408,1808,2969,0,11,913,0
5,_x000D_\n_x000D_\n_x000D_\n<!DOCTYPE html>_x00...,malicious,3163,358,0,0,2,59,30
6,"<!doctype html>\n\n<html data-ytrk-page=""HOME""...",malicious,23676,156,792,2,14,496,0
7,"\n\t<!DOCTYPE html>\n\t<html class=""no-icon-fo...",malicious,23445,5,0,5,5,238,0
8,"<!DOCTYPE html>\n<html class=""no-js"">\n<head>\...",malicious,1349,95,2032,0,55,395,0
9,"<!DOCTYPE html>\n<html class=""b-header--bl...",malicious,14883,2,0,12,3,182,0


In [51]:
d = {'benign':0,'malicious':1}
data['repu']= data['repu'].replace(d)
#data.drop('html_code',axis=1,inplace=True)

In [52]:

data

Unnamed: 0,repu,script_len,count_space,body_length,src&href_count,inside_newline,outside_newline,count_style
0,1,1076,65,402,3,2,133,0
1,1,562,87,1041,0,13,5326,58
2,1,8968,199,433,3,6,319,0
3,1,0,0,0,0,0,0,196
4,1,408,1808,2969,0,11,913,0
5,1,3163,358,0,0,2,59,30
6,1,23676,156,792,2,14,496,0
7,1,23445,5,0,5,5,238,0
8,1,1349,95,2032,0,55,395,0
9,1,14883,2,0,12,3,182,0


In [57]:
from sklearn.preprocessing import StandardScaler
target = 'repu'
x = data.drop(target,axis=1)
y = data[target]

scaler = StandardScaler()
scaler.fit(x)
x = scaler.transform(x)

from sklearn.model_selection import train_test_split
from sklearn.linear_model import LogisticRegression
from sklearn.metrics import accuracy_score
x_train, x_test, y_train, y_test = train_test_split(x,y,random_state=2,test_size=.25)

model = LogisticRegression()
model.fit(x_train,y_train)
y_pred = model.predict(x_test)

print('acc:',accuracy_score(y_pred,y_test))

acc: 0.6


In [58]:
y_pred, y_test

(array([1, 0, 1, 0, 0, 0, 1, 0, 1, 0], dtype=int64),
 27    0
 9     1
 14    1
 0     1
 2     1
 30    0
 13    1
 36    0
 17    1
 37    0
 Name: repu, dtype: int64)