# **TF-IDF**
7,395 개의 웹사이트 데이터이며 각각 27개의 feature(column) 정보를 가지고 있다. 

y 예측할 정답 컬럼 : label, 각각의 사이트가 환경친화적 사이트면 1, 아니면 0이다. (predict_y)

X 모델 입력 컬럼 : boilerplate 컬럼 하나만 사용한다. 웹사이트의 content 텍스트 정보를 가지고 있다. (train_x) 

즉 boilerplate 컬럼만 TF-IDF 방법으로 분석해서 모델의 입력으로 넣어주는 방식을 연구해보자! 

## **데이터 불러오기 & import**

In [1]:
from google.colab import drive
drive.mount('/content/gdrive')


Mounted at /content/gdrive


In [2]:
import os
path = "gdrive/My Drive/Colab Notebooks/02_Test/data/"
os.listdir(path)

['stumble_upon_evergreen.tsv']

In [3]:
import pandas as pd 
import numpy as np

from sklearn.model_selection import train_test_split

from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.feature_extraction.text import CountVectorizer
from sklearn.pipeline import Pipeline # 전처리 + 모델링 

from sklearn.ensemble import RandomForestClassifier

## **데이터 살펴보기**

In [4]:
df = pd.read_table(path + "stumble_upon_evergreen.tsv")
df.head()

Unnamed: 0,url,urlid,boilerplate,alchemy_category,alchemy_category_score,avglinksize,commonlinkratio_1,commonlinkratio_2,commonlinkratio_3,commonlinkratio_4,compression_ratio,embed_ratio,framebased,frameTagRatio,hasDomainLink,html_ratio,image_ratio,is_news,lengthyLinkDomain,linkwordscore,news_front_page,non_markup_alphanum_characters,numberOfLinks,numwords_in_url,parametrizedLinkRatio,spelling_errors_ratio,label
0,http://www.bloomberg.com/news/2010-12-23/ibm-p...,4042,"{""title"":""IBM Sees Holographic Calls Air Breat...",business,0.789131,2.055556,0.676471,0.205882,0.047059,0.023529,0.443783,0.0,0,0.090774,0,0.245831,0.003883,1,1,24,0,5424,170,8,0.152941,0.07913,0
1,http://www.popsci.com/technology/article/2012-...,8471,"{""title"":""The Fully Electronic Futuristic Star...",recreation,0.574147,3.677966,0.508021,0.28877,0.213904,0.144385,0.468649,0.0,0,0.098707,0,0.20349,0.088652,1,1,40,0,4973,187,9,0.181818,0.125448,1
2,http://www.menshealth.com/health/flu-fighting-...,1164,"{""title"":""Fruits that Fight the Flu fruits tha...",health,0.996526,2.382883,0.562016,0.321705,0.120155,0.042636,0.525448,0.0,0,0.072448,0,0.226402,0.120536,1,1,55,0,2240,258,11,0.166667,0.057613,1
3,http://www.dumblittleman.com/2007/12/10-foolpr...,6684,"{""title"":""10 Foolproof Tips for Better Sleep ""...",health,0.801248,1.543103,0.4,0.1,0.016667,0.0,0.480725,0.0,0,0.095861,0,0.265656,0.035343,1,0,24,0,2737,120,5,0.041667,0.100858,1
4,http://bleacherreport.com/articles/1205138-the...,9006,"{""title"":""The 50 Coolest Jerseys You Didn t Kn...",sports,0.719157,2.676471,0.5,0.222222,0.123457,0.04321,0.446143,0.0,0,0.024908,0,0.228887,0.050473,1,1,14,0,12032,162,10,0.098765,0.082569,0


In [5]:
df.columns

Index(['url', 'urlid', 'boilerplate', 'alchemy_category',
       'alchemy_category_score', 'avglinksize', 'commonlinkratio_1',
       'commonlinkratio_2', 'commonlinkratio_3', 'commonlinkratio_4',
       'compression_ratio', 'embed_ratio', 'framebased', 'frameTagRatio',
       'hasDomainLink', 'html_ratio', 'image_ratio', 'is_news',
       'lengthyLinkDomain', 'linkwordscore', 'news_front_page',
       'non_markup_alphanum_characters', 'numberOfLinks', 'numwords_in_url',
       'parametrizedLinkRatio', 'spelling_errors_ratio', 'label'],
      dtype='object')

In [6]:
df.shape

(7395, 27)

In [7]:
df['boilerplate'].value_counts()

{"title":"Freebase Pancakes NOTCOT ","body":"notcot in food drink 13 03 Every now and then a series of images comes along and just blows your mind I couldn t stop laughing and then just kind of staring amazement at the micro scale cooking going on here my friend is calling them Freebase Pancakes How one even comes up with a concept such as this amazes me See full set of imagery below via Random Stuff Tags food technology ","url":"notcot archives 2007 04 freebase pancak php"}                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                         

## **데이터 전처리 Data Preprocessing** 
boilerplate column을 살펴 본 결과 "title", "body", "url을 기준으로 나누어야 하며 중간 중간 필요 없는 문자열을 제거해줄 필요가 있습니다. 


*   {}, "", : 제거하기 
*   title, body, url 문자열을 기준으로 분할하기 
*   분할 후 필요없는 뒤 문자열 제거 (확인차)

의 과정이 필요합니다. 

*첫 번째로 필요없는 문자열부터 삭제하고 시작했어야 했는데 보면서 그때 그때 삭제하다보니 코드가 길어졌습니다... 이는 개선해야할 필요가 있는 것 같습니다*



In [8]:
X_column = ['boilerplate']
Y_column = 'label'

input_data = df[X_column + [Y_column]].copy() 
input_data.head()

Unnamed: 0,boilerplate,label
0,"{""title"":""IBM Sees Holographic Calls Air Breat...",0
1,"{""title"":""The Fully Electronic Futuristic Star...",1
2,"{""title"":""Fruits that Fight the Flu fruits tha...",1
3,"{""title"":""10 Foolproof Tips for Better Sleep ""...",1
4,"{""title"":""The 50 Coolest Jerseys You Didn t Kn...",0


In [9]:
input_data['boilerplate']=input_data['boilerplate'].str.replace("[{}]",'')
input_data.head().values

array([['"title":"IBM Sees Holographic Calls Air Breathing Batteries ibm sees holographic calls, air-breathing batteries","body":"A sign stands outside the International Business Machines Corp IBM Almaden Research Center campus in San Jose California Photographer Tony Avelar Bloomberg Buildings stand at the International Business Machines Corp IBM Almaden Research Center campus in the Santa Teresa Hills of San Jose California Photographer Tony Avelar Bloomberg By 2015 your mobile phone will project a 3 D image of anyone who calls and your laptop will be powered by kinetic energy At least that s what International Business Machines Corp sees in its crystal ball The predictions are part of an annual tradition for the Armonk New York based company which surveys its 3 000 researchers to find five ideas expected to take root in the next five years IBM the world s largest provider of computer services looks to Silicon Valley for input gleaning many ideas from its Almaden research center in S

In [10]:
input_data[['temp', 'title']] = input_data['boilerplate'].str.split("\"title\":", expand=True)
input_data = input_data.drop(['temp'], axis=1)

input_data.head()

Unnamed: 0,boilerplate,label,title
0,"""title"":""IBM Sees Holographic Calls Air Breath...",0,"""IBM Sees Holographic Calls Air Breathing Batt..."
1,"""title"":""The Fully Electronic Futuristic Start...",1,"""The Fully Electronic Futuristic Starting Gun ..."
2,"""title"":""Fruits that Fight the Flu fruits that...",1,"""Fruits that Fight the Flu fruits that fight t..."
3,"""title"":""10 Foolproof Tips for Better Sleep "",...",1,"""10 Foolproof Tips for Better Sleep "",""body"":""..."
4,"""title"":""The 50 Coolest Jerseys You Didn t Kno...",0,"""The 50 Coolest Jerseys You Didn t Know Existe..."


In [11]:
input_data[['temp', 'body']] = input_data['title'].str.split(",\"body\":", expand=True)
input_data = input_data.drop(['temp'], axis=1)

input_data.head()

Unnamed: 0,boilerplate,label,title,body
0,"""title"":""IBM Sees Holographic Calls Air Breath...",0,"""IBM Sees Holographic Calls Air Breathing Batt...","""A sign stands outside the International Busin..."
1,"""title"":""The Fully Electronic Futuristic Start...",1,"""The Fully Electronic Futuristic Starting Gun ...","""And that can be carried on a plane without th..."
2,"""title"":""Fruits that Fight the Flu fruits that...",1,"""Fruits that Fight the Flu fruits that fight t...","""Apples The most popular source of antioxidant..."
3,"""title"":""10 Foolproof Tips for Better Sleep "",...",1,"""10 Foolproof Tips for Better Sleep "",""body"":""...","""There was a period in my life when I had a lo..."
4,"""title"":""The 50 Coolest Jerseys You Didn t Kno...",0,"""The 50 Coolest Jerseys You Didn t Know Existe...","""Jersey sales is a curious business Whether yo..."


In [12]:
input_data[['temp', 'url']] = input_data['body'].str.split(",\"url\":", expand=True)
input_data = input_data.drop(['temp'], axis=1)

input_data

Unnamed: 0,boilerplate,label,title,body,url
0,"""title"":""IBM Sees Holographic Calls Air Breath...",0,"""IBM Sees Holographic Calls Air Breathing Batt...","""A sign stands outside the International Busin...","""bloomberg news 2010 12 23 ibm predicts hologr..."
1,"""title"":""The Fully Electronic Futuristic Start...",1,"""The Fully Electronic Futuristic Starting Gun ...","""And that can be carried on a plane without th...","""popsci technology article 2012 07 electronic ..."
2,"""title"":""Fruits that Fight the Flu fruits that...",1,"""Fruits that Fight the Flu fruits that fight t...","""Apples The most popular source of antioxidant...","""menshealth health flu fighting fruits cm mmc ..."
3,"""title"":""10 Foolproof Tips for Better Sleep "",...",1,"""10 Foolproof Tips for Better Sleep "",""body"":""...","""There was a period in my life when I had a lo...","""dumblittleman 2007 12 10 foolproof tips for b..."
4,"""title"":""The 50 Coolest Jerseys You Didn t Kno...",0,"""The 50 Coolest Jerseys You Didn t Know Existe...","""Jersey sales is a curious business Whether yo...","""bleacherreport articles 1205138 the 50 cooles..."
...,...,...,...,...,...
7390,"""title"":""Kno Raises 46 Million More To Build M...",0,"""Kno Raises 46 Million More To Build Most Powe...","""Marc Andreessen is normally enthusiastic abou...","""techcrunch 2010 09 08 kno raises 46 million m..."
7391,"""title"":""Why I Miss College "",""body"":""Mar 30 2...",0,"""Why I Miss College "",""body"":""Mar 30 2009 I d ...","""Mar 30 2009 I d like to congratulate Jane on ...","""uncoached category why i miss college"""
7392,"""title"":""Sweet Potatoes Eat This Not That i'm...",1,"""Sweet Potatoes Eat This Not That i'm eating ...","""They re loaded with vitamin C which smoothes ...","""eatthis menshealth slide sweet potatoes slide..."
7393,"""title"":""Naturally Ella "",""body"":"" "",""url"":""na...",1,"""Naturally Ella "",""body"":"" "",""url"":""naturallye...",""" "",""url"":""naturallyella""","""naturallyella"""


In [13]:
input_data['title'] = input_data['title'].str.split(",\"body\":").str[0]
input_data['body'] = input_data['body'].str.split(",\"url\":").str[0]
input_data

Unnamed: 0,boilerplate,label,title,body,url
0,"""title"":""IBM Sees Holographic Calls Air Breath...",0,"""IBM Sees Holographic Calls Air Breathing Batt...","""A sign stands outside the International Busin...","""bloomberg news 2010 12 23 ibm predicts hologr..."
1,"""title"":""The Fully Electronic Futuristic Start...",1,"""The Fully Electronic Futuristic Starting Gun ...","""And that can be carried on a plane without th...","""popsci technology article 2012 07 electronic ..."
2,"""title"":""Fruits that Fight the Flu fruits that...",1,"""Fruits that Fight the Flu fruits that fight t...","""Apples The most popular source of antioxidant...","""menshealth health flu fighting fruits cm mmc ..."
3,"""title"":""10 Foolproof Tips for Better Sleep "",...",1,"""10 Foolproof Tips for Better Sleep ""","""There was a period in my life when I had a lo...","""dumblittleman 2007 12 10 foolproof tips for b..."
4,"""title"":""The 50 Coolest Jerseys You Didn t Kno...",0,"""The 50 Coolest Jerseys You Didn t Know Existe...","""Jersey sales is a curious business Whether yo...","""bleacherreport articles 1205138 the 50 cooles..."
...,...,...,...,...,...
7390,"""title"":""Kno Raises 46 Million More To Build M...",0,"""Kno Raises 46 Million More To Build Most Powe...","""Marc Andreessen is normally enthusiastic abou...","""techcrunch 2010 09 08 kno raises 46 million m..."
7391,"""title"":""Why I Miss College "",""body"":""Mar 30 2...",0,"""Why I Miss College ""","""Mar 30 2009 I d like to congratulate Jane on ...","""uncoached category why i miss college"""
7392,"""title"":""Sweet Potatoes Eat This Not That i'm...",1,"""Sweet Potatoes Eat This Not That i'm eating ...","""They re loaded with vitamin C which smoothes ...","""eatthis menshealth slide sweet potatoes slide..."
7393,"""title"":""Naturally Ella "",""body"":"" "",""url"":""na...",1,"""Naturally Ella """,""" ""","""naturallyella"""


In [14]:
input_data['title'] = input_data['title'].str.replace("\"", '')
input_data['body'] = input_data['body'].str.replace("\"", '')
input_data['url'] = input_data['url'].str.replace("\"", '')
input_data

Unnamed: 0,boilerplate,label,title,body,url
0,"""title"":""IBM Sees Holographic Calls Air Breath...",0,IBM Sees Holographic Calls Air Breathing Batte...,A sign stands outside the International Busine...,bloomberg news 2010 12 23 ibm predicts hologra...
1,"""title"":""The Fully Electronic Futuristic Start...",1,The Fully Electronic Futuristic Starting Gun T...,And that can be carried on a plane without the...,popsci technology article 2012 07 electronic f...
2,"""title"":""Fruits that Fight the Flu fruits that...",1,Fruits that Fight the Flu fruits that fight th...,Apples The most popular source of antioxidants...,menshealth health flu fighting fruits cm mmc F...
3,"""title"":""10 Foolproof Tips for Better Sleep "",...",1,10 Foolproof Tips for Better Sleep,There was a period in my life when I had a lot...,dumblittleman 2007 12 10 foolproof tips for be...
4,"""title"":""The 50 Coolest Jerseys You Didn t Kno...",0,The 50 Coolest Jerseys You Didn t Know Existed...,Jersey sales is a curious business Whether you...,bleacherreport articles 1205138 the 50 coolest...
...,...,...,...,...,...
7390,"""title"":""Kno Raises 46 Million More To Build M...",0,Kno Raises 46 Million More To Build Most Power...,Marc Andreessen is normally enthusiastic about...,techcrunch 2010 09 08 kno raises 46 million mo...
7391,"""title"":""Why I Miss College "",""body"":""Mar 30 2...",0,Why I Miss College,Mar 30 2009 I d like to congratulate Jane on h...,uncoached category why i miss college
7392,"""title"":""Sweet Potatoes Eat This Not That i'm...",1,Sweet Potatoes Eat This Not That i'm eating t...,They re loaded with vitamin C which smoothes o...,eatthis menshealth slide sweet potatoes slides...
7393,"""title"":""Naturally Ella "",""body"":"" "",""url"":""na...",1,Naturally Ella,,naturallyella


In [15]:
input_data['context'] = input_data["title"] + " " + input_data["body"] + " " + input_data["url"]
input_data['context'] = input_data['context'].fillna("")
input_data

# 이제 input_data의 'context' 열은 준비가 다 된 것입니다. 

Unnamed: 0,boilerplate,label,title,body,url,context
0,"""title"":""IBM Sees Holographic Calls Air Breath...",0,IBM Sees Holographic Calls Air Breathing Batte...,A sign stands outside the International Busine...,bloomberg news 2010 12 23 ibm predicts hologra...,IBM Sees Holographic Calls Air Breathing Batte...
1,"""title"":""The Fully Electronic Futuristic Start...",1,The Fully Electronic Futuristic Starting Gun T...,And that can be carried on a plane without the...,popsci technology article 2012 07 electronic f...,The Fully Electronic Futuristic Starting Gun T...
2,"""title"":""Fruits that Fight the Flu fruits that...",1,Fruits that Fight the Flu fruits that fight th...,Apples The most popular source of antioxidants...,menshealth health flu fighting fruits cm mmc F...,Fruits that Fight the Flu fruits that fight th...
3,"""title"":""10 Foolproof Tips for Better Sleep "",...",1,10 Foolproof Tips for Better Sleep,There was a period in my life when I had a lot...,dumblittleman 2007 12 10 foolproof tips for be...,10 Foolproof Tips for Better Sleep There was ...
4,"""title"":""The 50 Coolest Jerseys You Didn t Kno...",0,The 50 Coolest Jerseys You Didn t Know Existed...,Jersey sales is a curious business Whether you...,bleacherreport articles 1205138 the 50 coolest...,The 50 Coolest Jerseys You Didn t Know Existed...
...,...,...,...,...,...,...
7390,"""title"":""Kno Raises 46 Million More To Build M...",0,Kno Raises 46 Million More To Build Most Power...,Marc Andreessen is normally enthusiastic about...,techcrunch 2010 09 08 kno raises 46 million mo...,Kno Raises 46 Million More To Build Most Power...
7391,"""title"":""Why I Miss College "",""body"":""Mar 30 2...",0,Why I Miss College,Mar 30 2009 I d like to congratulate Jane on h...,uncoached category why i miss college,Why I Miss College Mar 30 2009 I d like to co...
7392,"""title"":""Sweet Potatoes Eat This Not That i'm...",1,Sweet Potatoes Eat This Not That i'm eating t...,They re loaded with vitamin C which smoothes o...,eatthis menshealth slide sweet potatoes slides...,Sweet Potatoes Eat This Not That i'm eating t...
7393,"""title"":""Naturally Ella "",""body"":"" "",""url"":""na...",1,Naturally Ella,,naturallyella,Naturally Ella naturallyella


## **모델 준비하기!**
모델의 입력값으로 TF-IDF 값을 갖는 벡터를 사용할 것입니다.

그렇기 때문에 scikit-learn의 TfidVectorizer를 사용해야 합니다. 

이를 위해서는 입력값이 text 데이터 (이전에 전처리한!) 이여야 합니다! 

# **TF-IDF (Term Frequency - Inverse Document Frequency)**


*   **TF(단어 빈도, term frequency)**는 특정한 단어가 문서 내에 얼마나 자주 등장하는지를 나타내는 값. 이 값이 높을수록 문서에서 중요하다고 생각할 수 있다.
*   하지만 하나의 문서에서 많이 나오지 않고 다른 문서에서 자주 등장하면 단어의 중요도는 낮아진다.
*   **DF(문서 빈도, document frequency)**라고 하며, 이 값의 역수를 IDF(역문서 빈도, inverse document frequency)라고 한다.
*   TF-IDF는 TF와 IDF를 곱한 값으로 점수가 높은 단어일수록 다른 문서에는 많지 않고 해당 문서에서 자주 등장하는 단어를 의미한다.





In [18]:
vectorizer = CountVectorizer(analyzer = 'word', max_features = 5000)

context = list(input_data['context'])

train_data_features = vectorizer.fit_transform(context)

train_data_features

<7395x5000 sparse matrix of type '<class 'numpy.int64'>'
	with 960797 stored elements in Compressed Sparse Row format>

In [19]:
# train data 와 test data를 8:2 비율로 나누고 
# train data는 추후에 valid data와 test data로 나눕니다! 
#X_column = ['title', 'body', 'url'] -> preprocessing fin
Y_column = 'label'

train_X, test_X, train_Y, test_Y = train_test_split(
    train_data_features,  # X 입력 데이터
    input_data[Y_column], # Y 정답 데이터, label 데이터 
    test_size = 0.2,
    #train_size = 13, 
    shuffle=True,
    random_state=42)

train_X

<5916x5000 sparse matrix of type '<class 'numpy.int64'>'
	with 774308 stored elements in Compressed Sparse Row format>

In [21]:
model_rf = RandomForestClassifier(n_estimators=100)
model_rf.fit(train_X, train_Y)

RandomForestClassifier(bootstrap=True, ccp_alpha=0.0, class_weight=None,
                       criterion='gini', max_depth=None, max_features='auto',
                       max_leaf_nodes=None, max_samples=None,
                       min_impurity_decrease=0.0, min_impurity_split=None,
                       min_samples_leaf=1, min_samples_split=2,
                       min_weight_fraction_leaf=0.0, n_estimators=100,
                       n_jobs=None, oob_score=False, random_state=None,
                       verbose=0, warm_start=False)

## 모델 예측, 성능 파악 

In [23]:
pred_Y = model_rf.predict(test_X)
pred_Y

array([1, 1, 1, ..., 0, 1, 1])

In [24]:
df_result = pd.DataFrame(list(zip(test_Y, pred_Y)), columns=['true_y', 'pred_y'])
df_result

Unnamed: 0,true_y,pred_y
0,1,1
1,1,1
2,1,1
3,1,1
4,0,0
...,...,...
1474,0,0
1475,1,0
1476,1,0
1477,1,1


In [26]:
print("Accurancy %f" % model_rf.score(test_X, test_Y))

Accurancy 0.762677
