### DBSCAN (Density-Based Spatial Clustering of Applications with Noise)
- 데이터가 위치하고 있는 공간 밀집도를 기준으로 클러스터를 구분
- 자기를 중심으로 반지름 R의 공간에 최소 M개의 포인트가 존재하는 점을 core point라고 함 
- core point는 아니지만 반지름 R 안에 다른 core point가 있을 경우 border point라 함 
- core point도 아니고 border point도 아닌 점을 Noise or outlier로 분류 

In [9]:
import pandas as pd 
import folium 

file_path = '../data/2016_middle_shcool_graduates_report.xlsx'
df = pd.read_excel(file_path, header=0, engine='openpyxl')

pd.set_option('display.width', None)
pd.set_option('display.max_rows', 100)
pd.set_option('display.max_columns', 10)
pd.set_option('display.max_colwidth', 20)
pd.set_option('display.unicode.east_asian_width', True)

df.columns.values


array(['Unnamed: 0', '지역', '학교명', '코드', '유형', '주야', '남학생수', '여학생수', '일반고',
       '특성화고', '과학고', '외고_국제고', '예고_체고', '마이스터고', '자사고', '자공고', '기타진학',
       '취업', '미상', '위도', '경도'], dtype=object)

In [10]:
df.head()

Unnamed: 0.1,Unnamed: 0,지역,학교명,코드,유형,...,기타진학,취업,미상,위도,경도
0,0,성북구,서울대학교사범대학부설중학교...,3,국립,...,0.004,0,0.0,37.594942,127.038909
1,1,종로구,서울대학교사범대학부설여자중학교...,3,국립,...,0.031,0,0.0,37.577473,127.003857
2,2,강남구,개원중학교,3,공립,...,0.009,0,0.003,37.491637,127.071744
3,3,강남구,개포중학교,3,공립,...,0.019,0,0.0,37.480439,127.062201
4,4,서초구,경원중학교,3,공립,...,0.01,0,0.0,37.51075,127.0089


In [11]:
df.drop(['Unnamed: 0'], axis = 1, inplace=True)

In [12]:
df.head()

Unnamed: 0,지역,학교명,코드,유형,주야,...,기타진학,취업,미상,위도,경도
0,성북구,서울대학교사범대학부설중학교...,3,국립,주간,...,0.004,0,0.0,37.594942,127.038909
1,종로구,서울대학교사범대학부설여자중학교...,3,국립,주간,...,0.031,0,0.0,37.577473,127.003857
2,강남구,개원중학교,3,공립,주간,...,0.009,0,0.003,37.491637,127.071744
3,강남구,개포중학교,3,공립,주간,...,0.019,0,0.0,37.480439,127.062201
4,서초구,경원중학교,3,공립,주간,...,0.01,0,0.0,37.51075,127.0089


In [13]:
df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 415 entries, 0 to 414
Data columns (total 20 columns):
 #   Column  Non-Null Count  Dtype  
---  ------  --------------  -----  
 0   지역      415 non-null    object 
 1   학교명     415 non-null    object 
 2   코드      415 non-null    int64  
 3   유형      415 non-null    object 
 4   주야      415 non-null    object 
 5   남학생수    415 non-null    int64  
 6   여학생수    415 non-null    int64  
 7   일반고     415 non-null    float64
 8   특성화고    415 non-null    float64
 9   과학고     415 non-null    float64
 10  외고_국제고  415 non-null    float64
 11  예고_체고   415 non-null    float64
 12  마이스터고   415 non-null    float64
 13  자사고     415 non-null    float64
 14  자공고     415 non-null    float64
 15  기타진학    415 non-null    float64
 16  취업      415 non-null    int64  
 17  미상      415 non-null    float64
 18  위도      415 non-null    float64
 19  경도      415 non-null    float64
dtypes: float64(12), int64(4), object(4)
memory usage: 65.0+ KB


In [14]:
df.describe()

Unnamed: 0,코드,남학생수,여학생수,일반고,특성화고,...,기타진학,취업,미상,위도,경도
count,415.0,415.0,415.0,415.0,415.0,...,415.0,415.0,415.0,415.0,415.0
mean,3.19759,126.53253,116.173494,0.62308,0.149684,...,0.069571,0.0,0.00167,37.491969,127.032792
std,0.804272,79.217906,76.833082,0.211093,0.102977,...,0.23563,0.0,0.003697,0.348926,0.265245
min,3.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,34.97994,126.639561
25%,3.0,80.0,71.5,0.5665,0.0655,...,0.0,0.0,0.0,37.501934,126.921758
50%,3.0,129.0,118.0,0.681,0.149,...,0.007,0.0,0.0,37.547702,127.013579
75%,3.0,177.5,161.5,0.758,0.2245,...,0.015,0.0,0.003,37.59067,127.071265
max,9.0,337.0,422.0,0.908,0.477,...,1.0,0.0,0.036,37.694777,129.106974


In [15]:
mschool_map = folium.Map(location=[37.55, 126.98], tiles = 'Stamen Terrain', zoom_start =12)

for name, lat, lng in zip(df.학교명, df.위도, df.경도) : 
    folium.CircleMarker([lat, lng],
                        radius=5,
                        color = 'brown',
                        fill = True,
                        fill_color = 'coral',
                        popup = name
                        ).add_to(mschool_map)

mschool_map.save('../Part07/seoul_mschool_location.html')

In [16]:
df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 415 entries, 0 to 414
Data columns (total 20 columns):
 #   Column  Non-Null Count  Dtype  
---  ------  --------------  -----  
 0   지역      415 non-null    object 
 1   학교명     415 non-null    object 
 2   코드      415 non-null    int64  
 3   유형      415 non-null    object 
 4   주야      415 non-null    object 
 5   남학생수    415 non-null    int64  
 6   여학생수    415 non-null    int64  
 7   일반고     415 non-null    float64
 8   특성화고    415 non-null    float64
 9   과학고     415 non-null    float64
 10  외고_국제고  415 non-null    float64
 11  예고_체고   415 non-null    float64
 12  마이스터고   415 non-null    float64
 13  자사고     415 non-null    float64
 14  자공고     415 non-null    float64
 15  기타진학    415 non-null    float64
 16  취업      415 non-null    int64  
 17  미상      415 non-null    float64
 18  위도      415 non-null    float64
 19  경도      415 non-null    float64
dtypes: float64(12), int64(4), object(4)
memory usage: 65.0+ KB


In [18]:
from sklearn import preprocessing 

label_encoding = preprocessing.LabelEncoder()
onehot_encoding = preprocessing.OneHotEncoder() 

onehot_location = label_encoding.fit_transform(df['지역'])
onehot_code = label_encoding.fit_transform(df['코드'])
onehot_type = label_encoding.fit_transform(df['유형'])
onehot_day = label_encoding.fit_transform(df['주야']) 

df['location'] = onehot_location
df['code'] = onehot_code
df['type'] = onehot_type 
df['day'] = onehot_day 

df.head()

Unnamed: 0,지역,학교명,코드,유형,주야,...,경도,location,code,type,day
0,성북구,서울대학교사범대학부설중학교...,3,국립,주간,...,127.038909,16,0,1,0
1,종로구,서울대학교사범대학부설여자중학교...,3,국립,주간,...,127.003857,22,0,1,0
2,강남구,개원중학교,3,공립,주간,...,127.071744,0,0,0,0
3,강남구,개포중학교,3,공립,주간,...,127.062201,0,0,0,0
4,서초구,경원중학교,3,공립,주간,...,127.0089,14,0,0,0


### 모델 학습 및 검증 
- 과학고 외고_국제고 자사고 열을 선택하여 변수 X로 지정 
- StandardScaler() 케소드로 정규화 
- cluster 모듈의 DBSCAN() 함수를 적용하여 객체 생성 

In [19]:
from sklearn import cluster

columns_list = [9, 10, 13]
X = df.iloc[:, columns_list]

X[:5]

Unnamed: 0,과학고,외고_국제고,자사고
0,0.018,0.007,0.227
1,0.0,0.035,0.043
2,0.009,0.012,0.09
3,0.013,0.013,0.065
4,0.007,0.01,0.282


In [20]:
X = preprocessing.StandardScaler().fit(X).transform(X)

dbm = cluster.DBSCAN(eps = 0.2, min_samples = 5)

dbm.fit(X)

DBSCAN(eps=0.2)

In [21]:
cluster_label = dbm.labels_ 
cluster_label

array([-1, -1, -1, -1, -1, -1, -1, -1, -1, -1, -1, -1, -1,  0, -1, -1, -1,
       -1, -1, -1,  2, -1,  0, -1, -1, -1, -1, -1,  0, -1, -1, -1, -1, -1,
        0,  3, -1, -1, -1, -1, -1, -1, -1,  0, -1, -1,  1,  0, -1, -1, -1,
        0, -1, -1, -1, -1,  0, -1,  0,  0, -1, -1,  0, -1, -1, -1,  0,  0,
       -1, -1,  0, -1, -1, -1,  0, -1, -1, -1,  0,  2,  0,  0,  0,  0,  0,
       -1, -1, -1,  0, -1,  0, -1, -1,  0, -1,  0, -1,  0,  0, -1, -1, -1,
       -1,  1,  0, -1,  0,  0, -1, -1, -1,  0, -1, -1, -1, -1, -1,  0,  1,
       -1, -1,  0,  2,  0, -1, -1,  1, -1, -1, -1,  0,  0,  0, -1, -1,  0,
       -1, -1, -1,  0,  0, -1, -1, -1, -1,  0, -1, -1, -1,  0, -1, -1, -1,
        0, -1,  0,  0, -1, -1, -1, -1, -1,  0, -1,  0,  0, -1, -1, -1, -1,
       -1,  0, -1, -1, -1,  1,  0,  3,  1, -1,  0,  0, -1,  0, -1, -1,  0,
        0,  2, -1, -1,  3,  0,  0, -1, -1, -1, -1,  0, -1,  0,  0, -1,  0,
        0,  0, -1, -1,  0, -1, -1, -1, -1, -1,  2,  0, -1, -1, -1, -1, -1,
       -1, -1, -1, -1, -1

In [22]:
df['Cluster'] = cluster_label 
df.head()

Unnamed: 0,지역,학교명,코드,유형,주야,...,location,code,type,day,Cluster
0,성북구,서울대학교사범대학부설중학교...,3,국립,주간,...,16,0,1,0,-1
1,종로구,서울대학교사범대학부설여자중학교...,3,국립,주간,...,22,0,1,0,-1
2,강남구,개원중학교,3,공립,주간,...,0,0,0,0,-1
3,강남구,개포중학교,3,공립,주간,...,0,0,0,0,-1
4,서초구,경원중학교,3,공립,주간,...,14,0,0,0,-1


- groupby 메소드를 사용하여 'Cluster' 열을 기준으로 데이터프레임을 그룹 객체로 반환 

In [23]:
# 클러스터 값으로 그룹화하고 그룹별로 출력 
grouped_cols = [0, 1, 3] + columns_list 
grouped = df.groupby('Cluster')

for key, group in grouped :
    print('* key : ', key)
    print('* number : ', len(group))
    print(group.iloc[:, grouped_cols].head())

* key :  -1
* number :  255
     지역                               학교명  유형  과학고  외고_국제고  \
0  성북구  서울대학교사범대학부설중학교.....    국립   0.018        0.007   
1  종로구  서울대학교사범대학부설여자중학교...  국립   0.000        0.035   
2  강남구           개원중학교                  공립   0.009        0.012   
3  강남구           개포중학교                  공립   0.013        0.013   
4  서초구           경원중학교                  공립   0.007        0.010   

   자사고  
0   0.227  
1   0.043  
2   0.090  
3   0.065  
4   0.282  
* key :  0
* number :  102
      지역          학교명  유형  과학고  외고_국제고  자사고
13  서초구  동덕여자중학교  사립     0.0        0.022   0.038
22  강남구      수서중학교  공립     0.0        0.019   0.044
28  서초구      언남중학교  공립     0.0        0.015   0.050
34  강남구      은성중학교  사립     0.0        0.016   0.065
43  송파구      거원중학교  공립     0.0        0.021   0.054
* key :  1
* number :  45
         지역          학교명  유형  과학고  외고_국제고  자사고
46     강동구      동신중학교  사립     0.0          0.0   0.044
103    양천구      신원중학교  공립     0.0          0.0   0.006
118    구로구   

In [25]:
colors = {-1:'gray', 0:'coral', 1:'blue', 2:'green', 3:'red', 4:'purple', 
        5:'orange', 6:'brown', 7:'brick', 8:'yellow', 9:'magenta', 10:'cyan'}

cluster_map = folium.Map(location=[37.55, 126.98], tiles = 'Stamen Terrain',
                        zoom_start=12)

for name, lat, lng, clus in zip(df.학교명, df.위도, df.경도, df.Cluster):
        folium.CircleMarker([lat, lng],
                        radius=5,
                        color = colors[clus],
                        fill = True,
                        fill_color = colors[clus],
                        fill_optacity = 0.7,
                        popup = name
                        ).add_to(cluster_map)

cluster_map.save('../Part07/seoul_mschool_cluster.html')                        