```
ID : 샘플 별 고유 id
월 : 사건 발생월
요일 : 월요일 ~ 일요일
시간 : 사건 발생 시각
소관경찰서 : 사건 발생 구역의 담당 경찰서
소관지역 : 사건 발생 구역
사건발생거리 : 가장 가까운 경찰서에서 사건 현장까지의 거리
강수량(mm) 
강설량(mm)
적설량(cm) 
풍향 : 범죄발생지에서 바람이 부는 방향(최대 360도)
안개 : 가시거리가 1km 미만인 경우
짙은안개 : 가시거리가 200m 미만인 경우
번개
진눈깨비
서리
연기/연무 : 먼지, 연기가 하늘을 가리는 현상
눈날림
범죄발생지 : 범죄가 발생한 장소
TARGET : 범죄타입 [0 : 강도, 1: 절도, 2: 상해]
```

```
소개
1.1 프로젝트 개요
1.2 목표와 중요성
1.3 데이터셋 소개

데이터 이해와 전처리
2.1 데이터 수집과 출처
2.2 데이터 구조 파악
2.3 결측치 처리
2.4 이상치 탐지 및 처리
2.5 데이터 시각화

특성 공학
3.1 범주형 특성 처리
3.2 수치형 특성 처리
3.3 특성 스케일링 및 정규화
3.4 특성 선택

모델 선택과 학습
4.1 범죄 유형 분류 모델 개요
4.2 기본 모델 구축
4.3 모델 성능 평가 지표
4.4 모델 학습과 검증
4.5 모델 튜닝과 최적화

결과 분석 및 해석
5.1 모델 성능 평가
5.2 특성 중요도 분석
5.3 에러 분석
5.4 결과 해석과 인사이트 도출

개선 방안 및 향후 연구
6.1 성능 향상을 위한 개선 방안
6.2 추가 연구 및 발전 가능성

결론
7.1 프로젝트 요약
7.2 결과 재확인
7.3 마무리 및 차후 작업
```

In [1]:
import pandas as pd
import random
import os
import matplotlib.pyplot as plt
import seaborn as sns

from sklearn.preprocessing import *
from sklearn.compose import ColumnTransformer
from sklearn.ensemble import RandomForestClassifier
from sklearn.metrics import f1_score

import optuna
from optuna import Trial, visualization
from optuna.samplers import TPESampler

import xgboost
from xgboost import XGBClassifier

import warnings
warnings.filterwarnings('ignore')

In [2]:
train = pd.read_csv('train.csv')
train

Unnamed: 0,ID,월,요일,시간,소관경찰서,소관지역,사건발생거리,강수량(mm),강설량(mm),적설량(cm),풍향,안개,짙은안개,번개,진눈깨비,서리,연기/연무,눈날림,범죄발생지,TARGET
0,TRAIN_00000,9,화요일,10,137,8.0,2.611124,0.000000,0.0,0.00,245.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,차도,2
1,TRAIN_00001,11,화요일,6,438,13.0,3.209093,0.000000,0.0,0.00,200.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,차도,0
2,TRAIN_00002,8,일요일,6,1729,47.0,1.619597,0.000000,0.0,0.00,40.0,1.0,0.0,0.0,0.0,0.0,1.0,0.0,인도,1
3,TRAIN_00003,5,월요일,6,2337,53.0,1.921615,11.375000,0.0,0.00,225.0,1.0,1.0,0.0,0.0,0.0,0.0,0.0,주거지,1
4,TRAIN_00004,9,일요일,11,1439,41.0,1.789721,0.000000,0.0,0.00,255.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,주유소,2
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
84401,TRAIN_84401,4,일요일,7,336,11.0,3.808190,99.111111,0.0,0.00,165.0,1.0,0.0,0.0,0.0,0.0,0.0,0.0,차도,1
84402,TRAIN_84402,8,목요일,12,2149,38.0,1.458490,0.000000,0.0,0.00,200.0,0.0,0.0,0.0,0.0,0.0,1.0,0.0,차도,0
84403,TRAIN_84403,7,일요일,6,29,46.0,2.944913,105.888889,0.0,0.00,315.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,차도,0
84404,TRAIN_84404,1,화요일,11,536,25.0,0.493679,2.285714,8.6,10.75,330.0,1.0,1.0,0.0,0.0,0.0,1.0,0.0,주거지,1


In [3]:
# 결측치 존재 X
train.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 84406 entries, 0 to 84405
Data columns (total 20 columns):
 #   Column   Non-Null Count  Dtype  
---  ------   --------------  -----  
 0   ID       84406 non-null  object 
 1   월        84406 non-null  int64  
 2   요일       84406 non-null  object 
 3   시간       84406 non-null  int64  
 4   소관경찰서    84406 non-null  int64  
 5   소관지역     84406 non-null  float64
 6   사건발생거리   84406 non-null  float64
 7   강수량(mm)  84406 non-null  float64
 8   강설량(mm)  84406 non-null  float64
 9   적설량(cm)  84406 non-null  float64
 10  풍향       84406 non-null  float64
 11  안개       84406 non-null  float64
 12  짙은안개     84406 non-null  float64
 13  번개       84406 non-null  float64
 14  진눈깨비     84406 non-null  float64
 15  서리       84406 non-null  float64
 16  연기/연무    84406 non-null  float64
 17  눈날림      84406 non-null  float64
 18  범죄발생지    84406 non-null  object 
 19  TARGET   84406 non-null  int64  
dtypes: float64(13), int64(4), object(3)
memory usage: 

In [4]:
train.describe()

Unnamed: 0,월,시간,소관경찰서,소관지역,사건발생거리,강수량(mm),강설량(mm),적설량(cm),풍향,안개,짙은안개,번개,진눈깨비,서리,연기/연무,눈날림,TARGET
count,84406.0,84406.0,84406.0,84406.0,84406.0,84406.0,84406.0,84406.0,84406.0,84406.0,84406.0,84406.0,84406.0,84406.0,84406.0,84406.0,84406.0
mean,6.430195,6.769507,1060.027581,26.881726,1.912424,24.608776,2.284407,23.430503,186.926107,0.385423,0.017842,0.144042,0.02033,0.01026,0.210755,0.008921,0.835355
std,3.108302,3.56639,698.380485,13.870968,0.958556,62.711211,15.852881,85.199896,98.299485,0.486698,0.132379,0.351134,0.141128,0.100771,0.407847,0.09403,0.819762
min,1.0,1.0,26.0,5.0,0.012269,0.0,0.0,0.0,10.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
25%,4.0,4.0,526.0,13.0,1.209985,0.0,0.0,0.0,95.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
50%,7.0,7.0,937.0,27.0,1.822279,0.625,0.0,0.0,205.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,1.0
75%,9.0,10.0,1638.0,38.0,2.476528,18.571429,0.0,0.0,260.0,1.0,0.0,0.0,0.0,0.0,0.0,0.0,2.0
max,12.0,12.0,2450.0,54.0,4.998936,614.875,295.0,649.8,360.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,2.0


In [5]:
train['범죄발생지'].value_counts()

주거지      36077
차도       25879
인도        6437
편의점       4835
주차장       3262
식당        1806
백화점       1493
주유소       1324
공원         736
학교         728
약국         653
호텔/모텔      591
병원         453
은행         132
Name: 범죄발생지, dtype: int64