# [San Francisco Crime Classification | Kaggle](https://www.kaggle.com/c/sf-crime)

### [SF Crime Prediction with scikit-learn 을 따라해 본다. | Kaggle](https://www.kaggle.com/rhoslug/sf-crime-prediction-with-scikit-learn)

### Data fields
* 날짜  - 범죄 사건의 타임 스탬프
* 범주  - 범죄 사건 카테고리 (train.csv에만 해당) 이 변수를 예측하는 게 이 경진대회 과제임
* 설명  - 범죄 사건에 대한 자세한 설명 (train.csv에만 있음)
* DayOfWeek - 요일
* PdDistrict - 경찰서 구의 이름
* 해결 방법 - 범죄 사건이 어떻게 해결 되었는지 (train.csv에서만)
* 주소 - 범죄 사건의 대략적인 주소 
* X - 경도
* Y - 위도


* Dates - timestamp of the crime incident
* Category - category of the crime incident (only in train.csv). This is the target variable you are going to predict.
* Descript - detailed description of the crime incident (only in train.csv)
* DayOfWeek - the day of the week
* PdDistrict - name of the Police Department District
* Resolution - how the crime incident was resolved (only in train.csv)
* Address - the approximate street address of the crime incident 
* X - Longitude 
* Y - Latitude 

In [None]:
from __future__ import print_function, division
import pandas as pd
import numpy as np

In [None]:
df_train = pd.read_csv('data/train.csv', parse_dates=['Dates'])
df_train.shape

In [None]:
df_train.head()

In [None]:
# 'Descript', 'Dates', 'Resolution' 는 제거
df_train.drop(['Descript', 'Dates', 'Resolution'], axis=1, inplace=True)
df_train.shape

In [None]:
df_test = pd.read_csv('data/test.csv', parse_dates=['Dates'])
df_test.shape

In [None]:
df_test.head()

In [None]:
df_test.drop(['Dates'], axis=1, inplace=True)

In [None]:
df_test.head()

In [None]:
# 트레이닝과 검증셋을 선택한다.
inds = np.arange(df_train.shape[0])
inds

In [None]:
np.random.shuffle(inds)
df_train.shape[0]

In [None]:
# 트레인 셋
train_inds = inds[:int(0.2 * df_train.shape[0])]
print(train_inds.shape)
# 검증 셋
val_inds = inds[int(0.2) * df_train.shape[0]:]
print(val_inds.shape)

In [None]:
# 컬럼명을 추출한다.
col_names = np.sort(df_train['Category'].unique())
col_names

In [None]:
# 카테고리를 숫자로 변환해 준다.
df_train['Category'] = pd.Categorical(df_train['Category']).codes
df_train['DayOfWeek'] = pd.Categorical(df_train['DayOfWeek']).codes
df_train['PdDistrict'] = pd.Categorical(df_train['PdDistrict']).codes
df_test['DayOfWeek'] = pd.Categorical(df_test['DayOfWeek']).codes
df_test['PdDistrict'] = pd.Categorical(df_test['PdDistrict']).codes

In [None]:
df_train.head()

In [None]:
df_test.head()

In [None]:
from sklearn.feature_extraction.text import CountVectorizer
# text 빈도를 추출한다.
cvec = CountVectorizer()
cvec

In [None]:
bows_train = cvec.fit_transform(df_train['Address'].values)

In [None]:
bows_test = cvec.fit_transform(df_test['Address'].values)

In [None]:
# 트레이닝과 검증셋을 나눈다.
df_val = df_train.iloc[val_inds]
df_val.head()

In [None]:
df_val.shape

In [None]:
df_train = df_train.iloc[train_inds]
df_train.shape

In [None]:
df_train.head()

In [None]:
from patsy import dmatrices, dmatrix
y_train, X_train = dmatrices('Category ~ X + Y + DayOfWeek + PdDistrict', df_train)

In [None]:
y_train.shape

In [None]:
y_train.head()

In [None]:
# 벡터화 된 주소
X_train = np.hstack((X_train, bows_train[train_inds, :].toarray()))

In [None]:
X_train.shape

In [None]:
y_val, X_val = dmatrices('Category ~ X + Y + DayOfWeek + PdDistrict', df_val)

In [None]:
X_val = np.hstack((X_val, bows_train[val_inds, :].toarray()))
X_test = dmatrix('X + Y + DayOfWeek + PdDistrict', df_test)

In [None]:
X_test = np.hstack((X_test, bows_test.toarray()))

In [None]:
# IncrementalPCA
# # from sklearn.decomposition import IncrementalPCA
# from sklearn.ensemble import RandomForestClassifier
# from sklearn.linear_model import LogisticRegression

In [None]:
# ipca = IncrementalPCA(n_components=4, batch_size=5)
# ipca

In [None]:
# 로컬 메모리 부족으로 실행 실패 T_T
# X_train = ipca.fit_transform(X_train)

In [None]:
# X_val = ipca.transform(X_val)

In [None]:
# X_test = ipca.transform(X_test)

In [None]:
# # 로지스틱 회귀를 생성하고 fit 시킨다.
# logistic = LogisticRegression()
# logistic.fit(X_train, y_train.ravel())

# # 정확도를 본다.
# print('Mean accuracy (Logistic): {:.4f}.format(logistic.score(X_val, y_val.ravel())))')

In [None]:
# # 랜덤 포레스트로 fit 시키고 정확도를 본다.
# randforest = RandomForestClassifier()
# randforest.fit(X_train, y_train.ravel())

# # 정확도를 본다.
# print('Mean accuracy (Logistic): {:.4f}.format(logistic.score(X_val, y_val.ravel())))')

In [None]:
# Make predictions

# predict_probs = logistic.predict_proba(X_test)

In [None]:
import xgboost as xgb
model = xgb.XGBClassifier(n_estimators=5, nthread=-1, seed=37)
model

In [None]:
%time score = cross_val_score(model, X_train, y_train.ravel(), cv=5, scoring="neg_log_loss").mean()

print("Score = {0:.5f}".format(score))

In [None]:
model.fit(X_train, y_train.ravel())

In [None]:
# predict_proba 결과를 확률로 예측
predictions = model.predict_proba(X_test)

print(predictions.shape)
predictions[0]

In [None]:
df_pred = pd.DataFrame(data=predict_probs, columns=col_names)
df_pred['Id'] = df_test['Id'].astype(int)
df_pred.to_csv('output.csv', index=False)