#데이터셋(chapter 2, chapter 3에서 사용한 데이터셋과 동일): 
    1. https://www.kaggle.com/yelp-dataset/yelp-dataset/data?select=yelp_academic_dataset_business.json
    2. https://www.kaggle.com/yelp-dataset/yelp-dataset/data?select=yelp_academic_dataset_review.json 
    
#코드 참고: https://github.com/woosa7/feature-engineering-book/blob/master/Chapter04_The_Effects_of_Feature_Scaling.ipynb 
#추가 참고자료: https://wikidocs.net/book/2155

# Chapter 4. The Effects of Feature Scaling: From Bag-of-Words to Tf-Idf

우리는 텍스트 데이터에서 "의미있는" 단어들을 강조해주는 representation을 찾고자 하는 것<br>
Tf-Idf를 통해 각 단어에 중요도 가중치를 부여하는 방법과 다양한 feature를 이용하여 간단한 텍스트 분류 작업을 실시하고 비교하는 것을 다룸 

## 1. Tf-Idf(<i>Term frequency-Inverse document frequency</i>) : A Simple Twist on Bag-of-Words
- bag-of-words를 응용한 방법으로, 단어의 raw count 대신 normalized count를 사용
- 특정 문서(document)에서의 특정 단어의 빈도수 / 특정 단어가 나타나나는 문서의 수
<br>
<img src="img/figure4-1.png" width="500" height="500">

- 또한 특정 단어가 나타난 문서의 수의 단순한 역수 꼴을 사용하는 대신 log transform을 취하여 사용할 수 있음
<br>
<img src="img/figure4-2.png" width="500" height="500">

- Tf-Idf는 모든 문서에서 자주 등장하는 단어는 중요도가 낮다고 판단하며, 특정 문서에서만 자주 등장하는 단어는 중요도가 높다고 판단. Tf-Idf값이 작으면 중요도가 낮은 것이며, Tf-Idf값이 크면 중요도가 큰 것

## 2. Putting It to the Test
- Tf-Idf는 단어 카운트 feautre에 일정한 상수를 곱하여 변환한 것으로, 일종의 feature scaling로 볼 수 있음
- 간단한 텍스트 분류 작업(text classification task)에 있어 원 단어 카운트 feature와 scaled feautre의 수행 성능을 비교 
- 예시: 리뷰 텍스트를 기반으로 해당 비지니스의 클래스가 Nightlife인지 Restaurants인지 구분

In [1]:
import json
import numpy as np
import pandas as pd
from sklearn.feature_extraction import text
from sklearn.linear_model import LogisticRegression
import sklearn.model_selection as modsel
import sklearn.preprocessing as preproc

In [2]:
# Yelp Business data 불러오기 
biz_f = open('C:\\Users\\이인주\\Desktop\\2020 summer study\\feature_engineering\\data\\yelp_academic_dataset_business.json', 'r', encoding='UTF-8')
biz_df = pd.DataFrame([json.loads(x) for x in biz_f.readlines()])
biz_f.close()

# Yelp Reviews data 불러오기 
review_file = open('C:\\Users\\이인주\\Desktop\\2020 summer study\\feature_engineering\\data\\yelp_academic_dataset_review.json', 'r', encoding='UTF-8')
review_df = pd.DataFrame([json.loads(x) for x in review_file.readlines()])
review_file.close()

In [3]:
biz_df.head()

Unnamed: 0,business_id,name,address,city,state,postal_code,latitude,longitude,stars,review_count,is_open,attributes,categories,hours
0,1SWheh84yJXfytovILXOAQ,Arizona Biltmore Golf Club,2818 E Camino Acequia Drive,Phoenix,AZ,85016,33.522143,-112.018481,3.0,5,0,{'GoodForKids': 'False'},"Golf, Active Life",
1,QXAEGFB4oINsVuTFxEYKFQ,Emerald Chinese Restaurant,30 Eglinton Avenue W,Mississauga,ON,L5R 3E7,43.605499,-79.652289,2.5,128,1,"{'RestaurantsReservations': 'True', 'GoodForMe...","Specialty Food, Restaurants, Dim Sum, Imported...","{'Monday': '9:0-0:0', 'Tuesday': '9:0-0:0', 'W..."
2,gnKjwL_1w79qoiV3IC_xQQ,Musashi Japanese Restaurant,"10110 Johnston Rd, Ste 15",Charlotte,NC,28210,35.092564,-80.859132,4.0,170,1,"{'GoodForKids': 'True', 'NoiseLevel': 'u'avera...","Sushi Bars, Restaurants, Japanese","{'Monday': '17:30-21:30', 'Wednesday': '17:30-..."
3,xvX2CttrVhyG2z1dFg_0xw,Farmers Insurance - Paul Lorenz,"15655 W Roosevelt St, Ste 237",Goodyear,AZ,85338,33.455613,-112.395596,5.0,3,1,,"Insurance, Financial Services","{'Monday': '8:0-17:0', 'Tuesday': '8:0-17:0', ..."
4,HhyxOkGAM07SRYtlQ4wMFQ,Queen City Plumbing,"4209 Stuart Andrew Blvd, Ste F",Charlotte,NC,28217,35.190012,-80.887223,4.0,4,1,"{'BusinessAcceptsBitcoin': 'False', 'ByAppoint...","Plumbing, Shopping, Local Services, Home Servi...","{'Monday': '7:0-23:0', 'Tuesday': '7:0-23:0', ..."


In [4]:
# categories == None인 데이터 제거하는 방법 
nan_idx = biz_df.loc[biz_df.categories.isnull()].index
biz_df.drop(biz_df.index[[nan_idx]], inplace=True)

  result = getitem(key)


In [5]:
# Nightlife와 Restaurants 비즈니스만 추출하여 two_biz라는 새로운 데이터프레임 형성
two_biz = biz_df[biz_df.apply(lambda x: 'Nightlife' in x['categories'] or 'Restaurants' in x['categories'], axis=1)]

In [6]:
# 비즈니스 데이터 프레임과와 리뷰 데이터 프레임을 하나의 데이터프레임으로 통합
twobiz_reviews = two_biz.merge(review_df, on='business_id', how='inner') #on='a': a열을 기준으로 통합, how='inner': 교집합
twobiz_reviews.head()

Unnamed: 0,business_id,name,address,city,state,postal_code,latitude,longitude,stars_x,review_count,...,categories,hours,review_id,user_id,stars_y,useful,funny,cool,text,date
0,QXAEGFB4oINsVuTFxEYKFQ,Emerald Chinese Restaurant,30 Eglinton Avenue W,Mississauga,ON,L5R 3E7,43.605499,-79.652289,2.5,128,...,"Specialty Food, Restaurants, Dim Sum, Imported...","{'Monday': '9:0-0:0', 'Tuesday': '9:0-0:0', 'W...",6W0MQHmasK0IsaoDo4bmkw,2K62MJ4CJ19L8Tp5pRfjfQ,3.0,3,2,0,My girlfriend and I went for dinner at Emerald...,2017-01-27 21:54:30
1,QXAEGFB4oINsVuTFxEYKFQ,Emerald Chinese Restaurant,30 Eglinton Avenue W,Mississauga,ON,L5R 3E7,43.605499,-79.652289,2.5,128,...,"Specialty Food, Restaurants, Dim Sum, Imported...","{'Monday': '9:0-0:0', 'Tuesday': '9:0-0:0', 'W...",A1D2kUnZ0HTroFreAheNSg,SuOLY03LW5ZcnynKhbTydA,3.0,0,0,0,"***No automatic doors, not baby friendly!*** I...",2016-01-04 12:59:22
2,QXAEGFB4oINsVuTFxEYKFQ,Emerald Chinese Restaurant,30 Eglinton Avenue W,Mississauga,ON,L5R 3E7,43.605499,-79.652289,2.5,128,...,"Specialty Food, Restaurants, Dim Sum, Imported...","{'Monday': '9:0-0:0', 'Tuesday': '9:0-0:0', 'W...",_Hr9z8pJ5nZSf7FS1O8ujw,R-xGsTpwlwuOe_vAbg_aeA,2.0,2,0,1,"Despite the poor service here, my family comes...",2015-02-24 04:32:58
3,QXAEGFB4oINsVuTFxEYKFQ,Emerald Chinese Restaurant,30 Eglinton Avenue W,Mississauga,ON,L5R 3E7,43.605499,-79.652289,2.5,128,...,"Specialty Food, Restaurants, Dim Sum, Imported...","{'Monday': '9:0-0:0', 'Tuesday': '9:0-0:0', 'W...",4Sg_ziTcrYlGO0dVyj2V3g,agqWketq-FhYwVmRyli4jA,1.0,2,0,0,I went at 230 on a Monday. It was dimsum \n\nI...,2017-01-02 20:32:29
4,QXAEGFB4oINsVuTFxEYKFQ,Emerald Chinese Restaurant,30 Eglinton Avenue W,Mississauga,ON,L5R 3E7,43.605499,-79.652289,2.5,128,...,"Specialty Food, Restaurants, Dim Sum, Imported...","{'Monday': '9:0-0:0', 'Tuesday': '9:0-0:0', 'W...",BeeBfUxvzD4qNX4HxrgA5g,A0kENtCCoVT3m7T35zb2Vg,3.0,0,0,0,We've always been there on a Sunday so we were...,2013-06-24 23:11:30


In [7]:
# 사용하지 않는 feature를 제거하기 위해 사용할 feature만 지정해줌
twobiz_reviews = twobiz_reviews[['business_id', 
                                 'name', 
                                 'stars_y', 
                                 'text', 
                                 'categories']]
twobiz_reviews.head()

Unnamed: 0,business_id,name,stars_y,text,categories
0,QXAEGFB4oINsVuTFxEYKFQ,Emerald Chinese Restaurant,3.0,My girlfriend and I went for dinner at Emerald...,"Specialty Food, Restaurants, Dim Sum, Imported..."
1,QXAEGFB4oINsVuTFxEYKFQ,Emerald Chinese Restaurant,3.0,"***No automatic doors, not baby friendly!*** I...","Specialty Food, Restaurants, Dim Sum, Imported..."
2,QXAEGFB4oINsVuTFxEYKFQ,Emerald Chinese Restaurant,2.0,"Despite the poor service here, my family comes...","Specialty Food, Restaurants, Dim Sum, Imported..."
3,QXAEGFB4oINsVuTFxEYKFQ,Emerald Chinese Restaurant,1.0,I went at 230 on a Monday. It was dimsum \n\nI...,"Specialty Food, Restaurants, Dim Sum, Imported..."
4,QXAEGFB4oINsVuTFxEYKFQ,Emerald Chinese Restaurant,3.0,We've always been there on a Sunday so we were...,"Specialty Food, Restaurants, Dim Sum, Imported..."


In [8]:
# 'target'이라는 이름의 column을 생성. Nightlife 비즈니스이면 True, 아니면 False로 값 설정
# categories에 두 클래스가 모두 존재하는 경우에는 True (Nightlife) 로 설정됨
twobiz_reviews['target'] = twobiz_reviews.apply(lambda x: 'Nightlife' in x['categories'], axis=1)
twobiz_reviews.head()

Unnamed: 0,business_id,name,stars_y,text,categories,target
0,QXAEGFB4oINsVuTFxEYKFQ,Emerald Chinese Restaurant,3.0,My girlfriend and I went for dinner at Emerald...,"Specialty Food, Restaurants, Dim Sum, Imported...",False
1,QXAEGFB4oINsVuTFxEYKFQ,Emerald Chinese Restaurant,3.0,"***No automatic doors, not baby friendly!*** I...","Specialty Food, Restaurants, Dim Sum, Imported...",False
2,QXAEGFB4oINsVuTFxEYKFQ,Emerald Chinese Restaurant,2.0,"Despite the poor service here, my family comes...","Specialty Food, Restaurants, Dim Sum, Imported...",False
3,QXAEGFB4oINsVuTFxEYKFQ,Emerald Chinese Restaurant,1.0,I went at 230 on a Monday. It was dimsum \n\nI...,"Specialty Food, Restaurants, Dim Sum, Imported...",False
4,QXAEGFB4oINsVuTFxEYKFQ,Emerald Chinese Restaurant,3.0,We've always been there on a Sunday so we were...,"Specialty Food, Restaurants, Dim Sum, Imported...",False


### 2.1 Creating a Classification Dataset 
- 리뷰를 이용하여 비지니스가 Nightlife인지 Restaurants인지 구분
- 두 클래스의 리뷰수가 동일하지 않기 때문에(class-imbalanced dataset), 작은 리뷰수의 클래스(Nightlife)에 맞춰 큰 리뷰수의 클래스(Restaurants)의 리뷰수를 줄임 

In [11]:
nightlife = twobiz_reviews[twobiz_reviews.target == True] # target == True : Nightlife
restaurants = twobiz_reviews[twobiz_reviews.target == False] #target == True: Restaurants

In [47]:
# 모델에 사용하기 위한 subset 샘플링 데이터 생성 
# 두 클래스의 샘플 수가 대략적으로 비슷하도록 퍼센트(frac)를 설정함
nightlife_subset = nightlife.sample(frac=0.03, random_state=123) #전체 Nightlife 리뷰의 3%에 해당하는 수에서 랜덤 샘플링 
restaurant_subset = restaurants.sample(frac=0.011, random_state=123) #전체 Restaurants 리뷰의 1.1%에 해당하는 수에서 랜덤 샘플링 
print('nightlife:', nightlife_subset.shape)
print('restaurant:', restaurant_subset.shape)

nightlife: (42129, 6)
restaurant: (41129, 6)


In [13]:
# 두 클래스의 샘플링 데이터 합치기
combined = pd.concat([nightlife_subset, restaurant_subset])
combined.head()

Unnamed: 0,business_id,name,stars_y,text,categories,target
3831709,6tSvz_21BMo3a4GaItwa0g,Sushi Ya Las Vegas,5.0,"If the food doesn't wow you the upbeat and ""su...","Restaurants, Sushi Bars, Karaoke, Japanese, No...",True
619608,64QldpwnBOWbv25w2wngmg,Encore Lobby Bar,5.0,Wow. Wow. Wow.\n\nThis little cafe was the per...,"Nightlife, Lounges, Bars",True
4247794,NoF90rswXBHESSyDaWeKKA,Proper Brick Oven & Tap Room,4.0,Very good pizza and great service. Nice select...,"Italian, Food, Beer, Wine & Spirits, Nightlife...",True
4984015,xudgMGJvmXIljlxZmeYODg,Cobra Arcade Bar,1.0,This place frequently overlooks sexual harassm...,"Arts & Entertainment, Bars, Nightlife, Arcades",True
4240197,ewAmzOqnSAfLBdt4Stc8bA,McLean's Pub,5.0,"While on our trip to Montreal, Candace and I a...","Pubs, Food, Canadian (New), Sports Bars, Night...",True


In [15]:
# 트레이닝셋과 테스트셋으로 분할
training_data, test_data = modsel.train_test_split(combined, test_size=0.3, random_state=123) #트레이닝셋 70%, 테스트셋 30%의 비율로 분할
print(training_data.shape)
print(test_data.shape)

(58280, 6)
(24978, 6)


### 2.2 Scaling Bag-of-Words with Tf-Idf Transformation
- bag-of-words, Tf-Idf, bag-of-words 표현에 대한 ℓ2 정규화(ℓ2 normalization) 3가지 feature 생성
- linear classifier를 위한 X값과 Y값 설정; X(리뷰 텍스트 데이터) → Y(Nightlife(ture)/Restaurants(True))

In [17]:
#리뷰 텍스트를 bag-of-words로 변환
bow_transform = text.CountVectorizer()

X_tr_bow = bow_transform.fit_transform(training_data['text'])
X_te_bow = bow_transform.transform(test_data['text'])

In [18]:
#리뷰 텍스트를 Tf-Idf로 변환
tfidf_trfm = text.TfidfTransformer(norm=None)

X_tr_tfidf = tfidf_trfm.fit_transform(X_tr_bow)
X_te_tfidf = tfidf_trfm.transform(X_te_bow)

In [19]:
#리뷰 텍스트를 bag-of-words로 변환한 것을 ℓ2 정규화
X_tr_l2 = preproc.normalize(X_tr_bow, axis=0)
X_te_l2 = preproc.normalize(X_te_bow, axis=0)

In [20]:
y_tr = training_data['target']
y_te = test_data['target']

### 2.3 Classifciation with Logistic Regrssion 
- logistic regression은 간단한 linear classifier이기 때문에, 일차적으로 시도해보기 좋음  
- logistic regression은 input feature를 sigmoid function에 보냄. sigmoid function은 input으로 받은 값을 0-1 사이의 수로 변환시킴
- <i>w</i>는 함수값 0.5 근방의 기울기, <i>b</i>는 함수값이 0.5일때의 절편을 나타냄 
<br>
<img src="img/figure4-3.png" width="500" height="500">
<center>(Feature Engineering for Machine Learning by Alice Zheng and Amanda Casari (O’Reilly))</center> 

- sigmoid output이 0.5보다 클 경우 해당 데이터를 positive class로, 작을 경우 negative class로 예측함

In [21]:
#특정 feature(description)을 이용하여 logistic classifier를 학습시키고 테스트 데이터에 대한 성능을 측정하여 반환하는 함수 정의 
def simple_logistic_classify(X_tr, y_tr, X_test, y_test, description, _C=1.0):
    m = LogisticRegression(C=_C).fit(X_tr, y_tr) #logistic classifier 학습
    s = m.score(X_test, y_test) #테스트 데이터에 대한 성능 측정 
    print ('Test score with', description, 'features:', s)
    return m

In [22]:
m1 = simple_logistic_classify(X_tr_bow, y_tr, X_te_bow, y_te, 'bow')
m2 = simple_logistic_classify(X_tr_l2, y_tr, X_te_l2, y_te, 'l2-normalized')
m3 = simple_logistic_classify(X_tr_tfidf, y_tr, X_te_tfidf, y_te, 'tf-idf')

STOP: TOTAL NO. of ITERATIONS REACHED LIMIT.

Increase the number of iterations (max_iter) or scale the data as shown in:
    https://scikit-learn.org/stable/modules/preprocessing.html
Please also refer to the documentation for alternative solver options:
    https://scikit-learn.org/stable/modules/linear_model.html#logistic-regression


Test score with bow features: 0.735767475378333
Test score with l2-normalized features: 0.7541836816398431
Test score with tf-idf features: 0.7101048923052286


STOP: TOTAL NO. of ITERATIONS REACHED LIMIT.

Increase the number of iterations (max_iter) or scale the data as shown in:
    https://scikit-learn.org/stable/modules/preprocessing.html
Please also refer to the documentation for alternative solver options:
    https://scikit-learn.org/stable/modules/linear_model.html#logistic-regression


### 2.4 Tuning Logistic Regression with Regularization 
- classsifier가 제대로 tuned 되어있지 않으면, 분류 결과가 제대로 안나올 수 있음
- data point보다 feature의 수가 많을 때 최적의 모델을 찾는 문제가 'underdetermined' 되었다고 하며, 이를 해결하기 위해 학습 과정에서 추가적인 constraint를 설정해야함. 이 과정이 regularization

##### Regularization 
- regularization은 regularization parameter인 hyperparameter를 설정하는 것. 주어진 문제와 학습 알고리즘과 맞게 따로 tuning하여야 함
- hyperparameter을 tuning하는 가장 기본적인 방법이 grid search. gird of hyperparameter를 특정한 후 best hyperparameter를 tuner가 programmatically 찾는 과정
- 또한 서로 다른 feaure 간의 텍스트 분류를 수행한 결과 차이가 단순한 noise에 의한 차이가 아닌 것을 확인하기 위해 <i>k</i>-fold cross validation를 실시

In [23]:
# best hyperparmeter를 찾을 grid를 지정
param_grid_ = {'C': [1e-5, 1e-3, 1e-1, 1e0, 1e1, 1e2]}

In [25]:
#bag-of-words feature에 대해 5-fold cross validation으로 grid search를 수행
bow_search = modsel.GridSearchCV(LogisticRegression(), cv=5, param_grid=param_grid_)
bow_search.fit(X_tr_bow, y_tr)

STOP: TOTAL NO. of ITERATIONS REACHED LIMIT.

Increase the number of iterations (max_iter) or scale the data as shown in:
    https://scikit-learn.org/stable/modules/preprocessing.html
Please also refer to the documentation for alternative solver options:
    https://scikit-learn.org/stable/modules/linear_model.html#logistic-regression
STOP: TOTAL NO. of ITERATIONS REACHED LIMIT.

Increase the number of iterations (max_iter) or scale the data as shown in:
    https://scikit-learn.org/stable/modules/preprocessing.html
Please also refer to the documentation for alternative solver options:
    https://scikit-learn.org/stable/modules/linear_model.html#logistic-regression
STOP: TOTAL NO. of ITERATIONS REACHED LIMIT.

Increase the number of iterations (max_iter) or scale the data as shown in:
    https://scikit-learn.org/stable/modules/preprocessing.html
Please also refer to the documentation for alternative solver options:
    https://scikit-learn.org/stable/modules/linear_model.html#logist

STOP: TOTAL NO. of ITERATIONS REACHED LIMIT.

Increase the number of iterations (max_iter) or scale the data as shown in:
    https://scikit-learn.org/stable/modules/preprocessing.html
Please also refer to the documentation for alternative solver options:
    https://scikit-learn.org/stable/modules/linear_model.html#logistic-regression
STOP: TOTAL NO. of ITERATIONS REACHED LIMIT.

Increase the number of iterations (max_iter) or scale the data as shown in:
    https://scikit-learn.org/stable/modules/preprocessing.html
Please also refer to the documentation for alternative solver options:
    https://scikit-learn.org/stable/modules/linear_model.html#logistic-regression
STOP: TOTAL NO. of ITERATIONS REACHED LIMIT.

Increase the number of iterations (max_iter) or scale the data as shown in:
    https://scikit-learn.org/stable/modules/preprocessing.html
Please also refer to the documentation for alternative solver options:
    https://scikit-learn.org/stable/modules/linear_model.html#logist

GridSearchCV(cv=5, error_score=nan,
             estimator=LogisticRegression(C=1.0, class_weight=None, dual=False,
                                          fit_intercept=True,
                                          intercept_scaling=1, l1_ratio=None,
                                          max_iter=100, multi_class='auto',
                                          n_jobs=None, penalty='l2',
                                          random_state=None, solver='lbfgs',
                                          tol=0.0001, verbose=0,
                                          warm_start=False),
             iid='deprecated', n_jobs=None,
             param_grid={'C': [1e-05, 0.001, 0.1, 1.0, 10.0, 100.0]},
             pre_dispatch='2*n_jobs', refit=True, return_train_score=False,
             scoring=None, verbose=0)

In [27]:
#ℓ2 정규화 feature에 대해 5-fold cross validation으로 grid search를 수행
l2_search = modsel.GridSearchCV(LogisticRegression(), cv=5, param_grid=param_grid_)
l2_search.fit(X_tr_l2, y_tr)

STOP: TOTAL NO. of ITERATIONS REACHED LIMIT.

Increase the number of iterations (max_iter) or scale the data as shown in:
    https://scikit-learn.org/stable/modules/preprocessing.html
Please also refer to the documentation for alternative solver options:
    https://scikit-learn.org/stable/modules/linear_model.html#logistic-regression
STOP: TOTAL NO. of ITERATIONS REACHED LIMIT.

Increase the number of iterations (max_iter) or scale the data as shown in:
    https://scikit-learn.org/stable/modules/preprocessing.html
Please also refer to the documentation for alternative solver options:
    https://scikit-learn.org/stable/modules/linear_model.html#logistic-regression
STOP: TOTAL NO. of ITERATIONS REACHED LIMIT.

Increase the number of iterations (max_iter) or scale the data as shown in:
    https://scikit-learn.org/stable/modules/preprocessing.html
Please also refer to the documentation for alternative solver options:
    https://scikit-learn.org/stable/modules/linear_model.html#logist

GridSearchCV(cv=5, error_score=nan,
             estimator=LogisticRegression(C=1.0, class_weight=None, dual=False,
                                          fit_intercept=True,
                                          intercept_scaling=1, l1_ratio=None,
                                          max_iter=100, multi_class='auto',
                                          n_jobs=None, penalty='l2',
                                          random_state=None, solver='lbfgs',
                                          tol=0.0001, verbose=0,
                                          warm_start=False),
             iid='deprecated', n_jobs=None,
             param_grid={'C': [1e-05, 0.001, 0.1, 1.0, 10.0, 100.0]},
             pre_dispatch='2*n_jobs', refit=True, return_train_score=False,
             scoring=None, verbose=0)

In [26]:
#Tf-Idf feature에 대해 5-fold cross validation으로 grid search를 수행
tfidf_search = modsel.GridSearchCV(LogisticRegression(), cv=5, param_grid=param_grid_)
tfidf_search.fit(X_tr_tfidf, y_tr)

STOP: TOTAL NO. of ITERATIONS REACHED LIMIT.

Increase the number of iterations (max_iter) or scale the data as shown in:
    https://scikit-learn.org/stable/modules/preprocessing.html
Please also refer to the documentation for alternative solver options:
    https://scikit-learn.org/stable/modules/linear_model.html#logistic-regression
STOP: TOTAL NO. of ITERATIONS REACHED LIMIT.

Increase the number of iterations (max_iter) or scale the data as shown in:
    https://scikit-learn.org/stable/modules/preprocessing.html
Please also refer to the documentation for alternative solver options:
    https://scikit-learn.org/stable/modules/linear_model.html#logistic-regression
STOP: TOTAL NO. of ITERATIONS REACHED LIMIT.

Increase the number of iterations (max_iter) or scale the data as shown in:
    https://scikit-learn.org/stable/modules/preprocessing.html
Please also refer to the documentation for alternative solver options:
    https://scikit-learn.org/stable/modules/linear_model.html#logist

STOP: TOTAL NO. of ITERATIONS REACHED LIMIT.

Increase the number of iterations (max_iter) or scale the data as shown in:
    https://scikit-learn.org/stable/modules/preprocessing.html
Please also refer to the documentation for alternative solver options:
    https://scikit-learn.org/stable/modules/linear_model.html#logistic-regression
STOP: TOTAL NO. of ITERATIONS REACHED LIMIT.

Increase the number of iterations (max_iter) or scale the data as shown in:
    https://scikit-learn.org/stable/modules/preprocessing.html
Please also refer to the documentation for alternative solver options:
    https://scikit-learn.org/stable/modules/linear_model.html#logistic-regression
STOP: TOTAL NO. of ITERATIONS REACHED LIMIT.

Increase the number of iterations (max_iter) or scale the data as shown in:
    https://scikit-learn.org/stable/modules/preprocessing.html
Please also refer to the documentation for alternative solver options:
    https://scikit-learn.org/stable/modules/linear_model.html#logist

GridSearchCV(cv=5, error_score=nan,
             estimator=LogisticRegression(C=1.0, class_weight=None, dual=False,
                                          fit_intercept=True,
                                          intercept_scaling=1, l1_ratio=None,
                                          max_iter=100, multi_class='auto',
                                          n_jobs=None, penalty='l2',
                                          random_state=None, solver='lbfgs',
                                          tol=0.0001, verbose=0,
                                          warm_start=False),
             iid='deprecated', n_jobs=None,
             param_grid={'C': [1e-05, 0.001, 0.1, 1.0, 10.0, 100.0]},
             pre_dispatch='2*n_jobs', refit=True, return_train_score=False,
             scoring=None, verbose=0)

In [33]:
#bag-of-words에 대한 5-fold cross validation을 이용한 grid search를 수행 결과와 텍스트 분류 수행 결과
print(bow_search.cv_results_)
print(bow_search.cv_results_['mean_test_score'])

{'mean_fit_time': array([0.54374614, 1.17725973, 2.79752202, 3.14838452, 3.01712837,
       2.98661704]), 'std_fit_time': array([0.03924144, 0.08986535, 0.06104691, 0.11092474, 0.11309796,
       0.11551533]), 'mean_score_time': array([0.0079792 , 0.00678425, 0.0069808 , 0.00797844, 0.00738063,
       0.00798478]), 'std_score_time': array([0.00126109, 0.00074962, 0.00063098, 0.00166858, 0.00119658,
       0.00088552]), 'param_C': masked_array(data=[1e-05, 0.001, 0.1, 1.0, 10.0, 100.0],
             mask=[False, False, False, False, False, False],
       fill_value='?',
            dtype=object), 'params': [{'C': 1e-05}, {'C': 0.001}, {'C': 0.1}, {'C': 1.0}, {'C': 10.0}, {'C': 100.0}], 'split0_test_score': array([0.61496225, 0.7288092 , 0.73550103, 0.72383322, 0.72031572,
       0.72537749]), 'split1_test_score': array([0.60295127, 0.72040151, 0.74047701, 0.72983871, 0.72554907,
       0.72220316]), 'split2_test_score': array([0.62010981, 0.73695951, 0.74553878, 0.73215511, 0.72889499,


In [36]:
#bag-of-words을 이용하였을 때 가장 높은 점수의 텍스트 분류 수행 결과와 그때의 hyperparameter
print(bow_search.best_score_)
print(bow_search.best_params_)

0.7413177762525739
{'C': 0.1}


In [37]:
#ℓ2 정규화를 이용하였을 때 가장 높은 점수의 텍스트 분류 수행 결과와 그때의 hyperparameter
print(l2_search.best_score_)
print(l2_search.best_params_)

0.7457961564859299
{'C': 1.0}


In [38]:
#Tf-Idf을 이용하였을 때 가장 높은 점수의 텍스트 분류 수행 결과와 그때의 hyperparameter
print(tfidf_search.best_score_)
print(tfidf_search.best_params_)

0.7484557309540152
{'C': 0.001}


In [39]:
#best hyperparameter를 설정한 후 전체 트레이닝셋을 이용하여 모델 학습시키기 
m1 = simple_logistic_classify(X_tr_bow, y_tr, X_te_bow, y_te, 'bow', 
                              _C=bow_search.best_params_['C'])
m2 = simple_logistic_classify(X_tr_l2, y_tr, X_te_l2, y_te, 'l2-normalized', 
                              _C=l2_search.best_params_['C'])
m3 = simple_logistic_classify(X_tr_tfidf, y_tr, X_te_tfidf, y_te, 'tf-idf', 
                              _C=tfidf_search.best_params_['C'])

STOP: TOTAL NO. of ITERATIONS REACHED LIMIT.

Increase the number of iterations (max_iter) or scale the data as shown in:
    https://scikit-learn.org/stable/modules/preprocessing.html
Please also refer to the documentation for alternative solver options:
    https://scikit-learn.org/stable/modules/linear_model.html#logistic-regression


Test score with bow features: 0.7457762831291537
Test score with l2-normalized features: 0.7541836816398431
Test score with tf-idf features: 0.7507406517735608
