*Last edit by DLao - 2019*

![](https://oec2solutions.com/wp-content/uploads/2016/12/assglb-700x580.png)


# 케글 리퍼런스 https://www.kaggle.com/datasnaek/mbti-type/data

# Context
The Myers Briggs Type Indicator (or MBTI for short) is a personality type system that divides everyone into 16 distinct personality types across 4 axis:

Introversion (I) – Extroversion (E)
Intuition (N) – Sensing (S)
Thinking (T) – Feeling (F)
Judging (J) – Perceiving (P)
(More can be learned about what these mean here)

So for example, someone who prefers introversion, intuition, thinking and perceiving would be labelled an INTP in the MBTI system, and there are lots of personality based components that would model or describe this person’s preferences or behaviour based on the label.

It is one of, if not the, the most popular personality test in the world. It is used in businesses, online, for fun, for research and lots more. A simple google search reveals all of the different ways the test has been used over time. It’s safe to say that this test is still very relevant in the world in terms of its use.

From scientific or psychological perspective it is based on the work done on cognitive functions by Carl Jung i.e. Jungian Typology. This was a model of 8 distinct functions, thought processes or ways of thinking that were suggested to be present in the mind. Later this work was transformed into several different personality systems to make it more accessible, the most popular of which is of course the MBTI.

Recently, its use/validity has come into question because of unreliability in experiments surrounding it, among other reasons. But it is still clung to as being a very useful tool in a lot of areas, and the purpose of this dataset is to help see if any patterns can be detected in specific types and their style of writing, which overall explores the validity of the test in analysing, predicting or categorising behaviour.

Content
This dataset contains over 8600 rows of data, on each row is a person’s:

Type (This persons 4 letter MBTI code/type)
A section of each of the last 50 things they have posted (Each entry separated by "|||" (3 pipe characters))
Acknowledgements
This data was collected through the PersonalityCafe forum, as it provides a large selection of people and their MBTI personality type, as well as what they have written.

Inspiration
Some basic uses could include:

Use machine learning to evaluate the MBTIs validity and ability to predict language styles and behaviour online.
Production of a machine learning algorithm that can attempt to determine a person’s personality type based on some text they have written.

# Trable attraction recommandation system based by personal MBTI

# 사이트명 : 트리바보 (안쓰면 바보란 뜻)

[step1] https://www.personalitycafe.com/ 를 통해 축적된 135.2Kmembers의 10.4M posts 크롤링하여 각 성향별로 글의 수와 내용을 크롤링한 자료를 활용한다.

[step2] 성향과 글을 분석하여 어휘와 빈도, 양, 댓글 및 글의 수 등으로 mbti 분석을 예측하는 모델링을 설계한다.

[step3] 각 여행후기 사이트별로 서울 관광지의 별점과 영어 후기를 크롤링 한다.

[step4] [step3의 크롤링 자료]를 바탕으로 [step2의 모델]을 활용하여 서울 관광지를 어떠한 mbti 유형의 여행자가 선호하는지 예측한다.

[step5] html 사이트를 구현하여, 여행자가 mbti 유형을 입력하면, 같은 mbti 유형들이 선호하는 여행사이트를 plotnine, seaborn 등으로 지도에 표현해준다.

[step6] 여행 관련 결제 사이트를 구현한다.


In [None]:
#!pip install pandas_profiling
#필요한 모듈 불러오기 

import pandas as pd
import numpy as np


import pandas_profiling as pp

import matplotlib
import matplotlib.pyplot as plt
from matplotlib import font_manager, rc

from sklearn.metrics import confusion_matrix
%matplotlib inline


import scipy as sp
import seaborn as sns
from pandas import Series, DataFrame

import platform
sns.set(style='whitegrid', palette='muted')

from sklearn.linear_model import LogisticRegression
from sklearn.neighbors import KNeighborsClassifier
from sklearn.tree import DecisionTreeClassifier
from sklearn.impute import SimpleImputer
from sklearn.preprocessing import StandardScaler
from sklearn.model_selection import train_test_split  
from sklearn.decomposition import PCA
from imblearn.combine import SMOTETomek
from sklearn.neural_network import MLPClassifier 
from sklearn.ensemble import VotingClassifier
from sklearn.feature_selection import SelectFromModel # 중요한 피쳐를 선택
from sklearn.ensemble import RandomForestClassifier
from sklearn.svm import SVC
from sklearn.ensemble import GradientBoostingClassifier
from imblearn.under_sampling import NearMiss
from sklearn.metrics import accuracy_score

from sklearn.ensemble import BaggingClassifier
from sklearn.ensemble import AdaBoostClassifier

from sklearn.model_selection import KFold #for K-fold cross validation
from sklearn.model_selection import cross_val_score #score evaluation
from sklearn.model_selection import cross_val_predict #prediction

from sklearn.metrics import precision_recall_curve
from sklearn.metrics import roc_curve
from sklearn.metrics import auc

from imblearn.under_sampling import RandomUnderSampler
from imblearn.under_sampling import NearMiss
import imblearn
from imblearn.over_sampling import RandomOverSampler

from sklearn.datasets import load_breast_cancer
from sklearn.model_selection import train_test_split
from sklearn.preprocessing import MinMaxScaler
from sklearn.pipeline import Pipeline

import warnings
warnings.filterwarnings('ignore')
import missingno as msno
import pandas as pd
import numpy as np
import seaborn as sns
import matplotlib.pyplot as plt

# [step1] 
- mbti 유형별 쓴 글을 크롤링한 csv를 바탕으로 예측 모델 생성

In [None]:
# 다운받은 mbti_1 을 경로에 맞게 수정해주세요.

In [None]:
df = pd.read_csv('C:/Users/TJ/Downloads/W-master/W-master/mbti_1.csv')
df.head()

In [None]:
cols = ['type']
df[cols] = df[cols].apply(lambda x: x.astype('category').cat.codes)
df

In [None]:
def var_row(row):
    l = []
    for i in row.split('.'):
        l.append(len(i.split()))
    return np.var(l)

df['words_per_comment'] = df['posts'].apply(lambda x: len(x.split())/50)
df['variance_of_word_counts'] = df['posts'].apply(lambda x: var_row(x))
df.head()

In [None]:
plt.figure(figsize=(15,10))
sns.swarmplot("type", "words_per_comment", data=df)

In [None]:
df.groupby('type').agg({'type':'count'})

In [None]:
df_2 = df[~df['type'].isin(['ESFJ','ESFP','ESTJ','ESTP'])]
df_2['http_per_comment'] = df_2['posts'].apply(lambda x: x.count('http')/50)
df_2['qm_per_comment'] = df_2['posts'].apply(lambda x: x.count('?')/50)
df_2.head()

In [None]:
print(df_2.groupby('type').agg({'http_per_comment': 'mean'}))
print(df_2.groupby('type').agg({'qm_per_comment': 'mean'}))

In [None]:
plt.figure(figsize=(15,10))
sns.jointplot("variance_of_word_counts", "words_per_comment", data=df_2, kind="hex")

In [None]:
def plot_jointplot(mbti_type, axs, titles):
    df_3 = df_2[df_2['type'] == mbti_type]
    sns.jointplot("variance_of_word_counts", "words_per_comment", data=df_3, kind="hex", ax = axs, title = titles)
    
i = df_2['type'].unique()
k = 0
for m in range(0,2):
    for n in range(0,6):
        df_3 = df_2[df_2['type'] == i[k]]
        sns.jointplot("variance_of_word_counts", "words_per_comment", data=df_3, kind="hex")
        plt.title(i[k])
        k+=1
    


In [None]:
from sklearn.model_selection import train_test_split

In [None]:
df['posts']=df['posts'].astype('category')
df['posts']=df['posts'].cat.codes
df['posts'].value_counts()

In [None]:
mdx=df.drop('type',axis=1)
dfy=df['type']



X_train,X_test,y_train,y_test = train_test_split(mdx,dfy,test_size=.25,random_state=0)

In [None]:
print(X_train.shape,X_test.shape,y_train.shape,y_test.shape)

In [None]:
#!pip install lightgbm
#!pip install imblearn

In [None]:
import lightgbm as gbm 

In [None]:
from lightgbm import LGBMClassifier
from sklearn.model_selection import KFold
from sklearn.model_selection import cross_val_score, StratifiedShuffleSplit
from sklearn.model_selection import RandomizedSearchCV
from sklearn.model_selection import GridSearchCV
import numpy as np
from sklearn.ensemble import GradientBoostingClassifier
from sklearn.metrics import accuracy_score
from sklearn.metrics import f1_score
from sklearn.metrics import classification_report
from sklearn.metrics import confusion_matrix
from sklearn.metrics import roc_curve
from sklearn.metrics import auc
from sklearn.metrics import roc_auc_score
from sklearn.dummy import DummyClassifier
import matplotlib.pyplot as plt
from matplotlib import font_manager, rc
from sklearn.preprocessing import PolynomialFeatures, OneHotEncoder
from sklearn.pipeline import Pipeline
from sklearn.pipeline import FeatureUnion
from sklearn.decomposition import PCA
from sklearn.feature_selection import SelectKBest
from sklearn.preprocessing import StandardScaler, MinMaxScaler
from imblearn.under_sampling import *
import warnings
warnings.filterwarnings('ignore')
%matplotlib inline

In [None]:
svm = SVC(random_state=0,C=100)
svm.fit(X_train, y_train)

display(svm.score(X_train, y_train))
display(svm.score(X_test, y_test))

In [None]:
mlp = MLPClassifier()
mlp.fit(X_train, y_train)

display(mlp.score(X_train, y_train))
display(mlp.score(X_test, y_test))

In [None]:
lr = LogisticRegression()
lr.fit(X_train, y_train)

display(lr.score(X_train, y_train))
display(lr.score(X_test, y_test))

In [None]:
clf=KNeighborsClassifier(n_neighbors=200)
clf.fit(X_train,y_train).score(X_test, y_test)

display(clf.score(X_train, y_train))
display(clf.score(X_test, y_test))

In [None]:
tree = DecisionTreeClassifier()
tree.fit(X_train,y_train).score(X_test, y_test)

display(tree.score(X_train, y_train))
display(tree.score(X_test, y_test))

In [None]:
gboost = GradientBoostingClassifier()
gboost.fit(X_train,y_train).score(X_test, y_test)

display(gboost.score(X_train, y_train))
display(gboost.score(X_test, y_test))

In [None]:
voting = VotingClassifier(
    estimators = [('svc', svm), ('mlp', mlp), ('lr', lr), ('clf', clf),('tree', tree),('gboost', gboost)],
    voting = 'hard')


for clf in (svm, mlp, lr, clf, tree, gboost, voting) :
    clf.fit(X_train, y_train)
    print(clf.__class__.__name__, 
          accuracy_score(y_test, clf.predict(X_test)))

In [None]:
pipe = Pipeline([("scaler", MinMaxScaler()), ("svm", SVC())])
pipe.fit(X_train, y_train).score(X_test, y_test)

In [None]:
from sklearn.model_selection import GridSearchCV 
param_grid = {'svm__C': [0.001, 0.01, 0.1, 1, 10, 100],
              'svm__gamma': [0.001, 0.01, 0.1, 1, 10, 100]}

grid = GridSearchCV(pipe, param_grid=param_grid, cv=5)
grid.fit(X_train, y_train)
print("Best cross-validation accuracy: {:.2f}".format(
    grid.best_score_))
print("Test set score: {:.2f}".format(grid.score(X_test, y_test)))
print("Best parameters: {}".format(grid.best_params_))


#모델 생성 끝

In [1]:
#!pip install cssselect       # css 소스 중 원하는 것을 크롤링하기 위한 css select
#!pip install selenium        # 페이지 내에 클릭을 구현해주는 selenium
#!pip install lxml            
#!pip install openpyxl         # excel 파일 가공을 위한 openyxl
#!pip install bs4              # 다른 방법의 크롤링을 위한 beautiful soup

In [2]:
import urllib.parse
import time
import requests
import lxml.html
import pandas as pd
from selenium.webdriver import Chrome
from selenium import webdriver
import requests
from bs4 import BeautifulSoup
import re
from urllib.request import Request, urlopen 

#### 1. 서울 관광지명과 href 주소를 데이터 프레임으로 정리

In [3]:
sites=[]
names=[]

for i in range(5) :
    try : 
        for i in range(10):
            soup = BeautifulSoup(urlopen("https://www.trip.com/travel-guide/seoul-234/tourist-attractions/"+ str(i)), "html.parser")

            for link in soup.find("div", {"class":"jsx-2566084683 gl-poi-list_list"}).findAll("a"):
                if 'href' in link.attrs:
                    sites.append(link.attrs['href'])   
        
        
    except :
        pass
    
for i in range(10) :
    try : 
        for i in range(10):
            soup = BeautifulSoup(urlopen("https://www.trip.com/travel-guide/seoul-234/tourist-attractions/"+ str(i)), "html.parser")

            for link in soup.find("div", {"class":"jsx-2566084683 gl-poi-list_list"}).findAll("a"):
                if 'title' in link.attrs:
                    names.append(link.attrs['title'])   
        
        
    except :
        pass
    
sites
names

['Myeong Dong',
 'N Seoul Tower',
 'Bukchon Hanok Village',
 'Namsan Park',
 'Gyeongbokgung Palace',
 'Cheonggyecheon',
 'Nanshan Cable Car',
 'Ewha Womans University',
 'Sinsa-dong',
 'Cheong Wa Dae',
 'Myeong Dong',
 'N Seoul Tower',
 'Bukchon Hanok Village',
 'Namsan Park',
 'Gyeongbokgung Palace',
 'Cheonggyecheon',
 'Nanshan Cable Car',
 'Ewha Womans University',
 'Sinsa-dong',
 'Cheong Wa Dae',
 'Itaewon',
 'Dongdaemun Design Plaza',
 'Gwanghwamun Square',
 'Samcheong-dong',
 'Insadong',
 'Gwangjang Market',
 'National Folk Museum',
 '63 square',
 'Dongdaemun',
 'COEX Aquarium',
 'Ihwa Mural Village',
 'Deoksugung Palace',
 'Leeum Samsung Museum of Art',
 'Alive 4D Art Museum (Insa-dong Main Branch)',
 'Korea House',
 'S.M.Entertainment',
 'Hongik University',
 'Changdeokgung Palace',
 'Youeido Han River Park',
 'National Palace Museum of Korea',
 'Haneul Park',
 'Seoul City Hall',
 'Seoul Daehangno',
 'War Memorial of Korea',
 'Namsangol Hanok Village',
 'Bongeunsa Temple',
 'Gy

In [1]:
attractions_data = pd.DataFrame({'names': names, 'sites': sites})
attractions_data.head(40)

NameError: name 'pd' is not defined

#### 2. 페이지를 넘기며 별점과 후기 크롤링

In [5]:
##### 해당명소 URL 입력 

In [9]:
browser = webdriver.Chrome('C:/Users/parkminwoo/Desktop/R/chromedriver_win32/chromedriver.exe')
url = 'https://www.trip.com/travel-guide/seoul/myeong-dong-10524255/'
browser.get(url)

score=[]
text=[]

for i in range(10) :
    try:
        stars = browser.find_elements_by_css_selector('a')
        stars = browser.find_elements_by_css_selector('em.cr')
        stars = browser.find_elements_by_css_selector('span.comment_score')
        for star in stars:
            #print(star.text)            
            for star in stars:
                score.append(star.text)
        reviews = browser.find_elements_by_css_selector('a')
        reviews = browser.find_elements_by_css_selector('em.cr')
        reviews = browser.find_elements_by_css_selector('p.mt10')
        for review in reviews:
            #print(review.text)            
            for review in reviews:
                text.append(review.text)
                
        nextpage=browser.find_elements_by_css_selector('button.btn-next')
        nextpage[0].click()            
    
    except:              
        pass

In [10]:
data1 = pd.DataFrame({'score': score, 'text': text})
data1

Unnamed: 0,score,text
0,5.0,明洞实在人太多了，站在路上放眼望去都是人。地图上看着小小块儿，但是站在路中间，小路一转弯，还...
1,5.0,去首尔怎么都绕不过明洞，在首尔的5天我和小伙伴总结出了一句话：“出来混总是要回明洞的”。我们...
2,5.0,首尔的购物天堂 各种化妆品店服装店商场咖啡厅餐厅聚集地。喜欢购物不要错过 另外在明洞还看了 ...
3,5.0,小牌子很多，价格实惠，值得每年去一次
4,5.0,很好玩的地方，逛吃购物皆有，很开心!我们一行四人吃了烧烤，购了化妆品，逛了小铺，满载而归啊！...
...,...,...
95,5.0,明洞集吃、喝、玩一体，在这里逛一天也不累。好吃的：姜虎东烤肉、古宫、本粥等，小吃也不少：大肠...
96,5.0,人文风气特别好，手机丢了，被捡到直接交到服务台，很意外的连都完好无损的躺在服务台，景色也不错...
97,5.0,明洞人是真的很多，商业店铺也特别的多，在这里吃到了向往的骨头汤饭，感觉所有的人都会来一两句中...
98,5.0,烤龙虾15000，很新鲜，加的芝士，味道不错！商店很多，没什么可买的。晚上的小吃很多，比白天...


In [None]:
#data.to_excel('명동.xlsx')
#import pandas as pd
#import sys
#mod = sys.modules[__name__]
# 2007년부터 2020년까지의 매매기록을
# salse_2007 ~ sales_2020 까지의 데이터프레임으로 각각 저장
#for i in range(2007, 2021):
#    filename = 'data/시세/아파트매매/아파트(매매)%d.csv'%i
#    setattr(mod, 'sales_{}'.format(i), a.to_csv('{}.xlsx'.format(a[0]), encoding='cp949'))