## sf-crime problem : https://www.kaggle.com/c/sf-crime

### 问题的背景:
From 1934 to 1963, San Francisco was infamous for housing some of the world's most notorious criminals on the inescapable island of Alcatraz.

Today, the city is known more for its tech scene than its criminal past. But, with rising wealth inequality, housing shortages, and a proliferation of expensive digital toys riding BART to work, there is no scarcity of crime in the city by the bay.

From Sunset to SOMA, and Marina to Excelsior, this competition's dataset provides nearly 12 years of crime reports from across all of San Francisco's neighborhoods. Given time and location, you must predict the category of crime that occurred.

We're also encouraging you to explore the dataset visually. What can we learn about the city through visualizations like this Top Crimes Map? The top most up-voted scripts from this competition will receive official Kaggle swag as prizes. 

In [21]:
import numpy as np
import pandas as pd
import datetime
from sklearn.preprocessing import LabelEncoder,OneHotEncoder

In [22]:
# load data
train = pd.read_csv('train.csv', parse_dates = ['Dates'])
test = pd.read_csv('test.csv', parse_dates = ['Dates'])

In [23]:
train.head(1)

Unnamed: 0,Dates,Category,Descript,DayOfWeek,PdDistrict,Resolution,Address,X,Y
0,2015-05-13 23:53:00,WARRANTS,WARRANT ARREST,Wednesday,NORTHERN,"ARREST, BOOKED",OAK ST / LAGUNA ST,-122.425892,37.774599


In [31]:
train['Category'].unique()

array(['WARRANTS', 'OTHER OFFENSES', 'LARCENY/THEFT', 'VEHICLE THEFT',
       'VANDALISM', 'NON-CRIMINAL', 'ROBBERY', 'ASSAULT', 'WEAPON LAWS',
       'BURGLARY', 'SUSPICIOUS OCC', 'DRUNKENNESS',
       'FORGERY/COUNTERFEITING', 'DRUG/NARCOTIC', 'STOLEN PROPERTY',
       'SECONDARY CODES', 'TRESPASS', 'MISSING PERSON', 'FRAUD',
       'KIDNAPPING', 'RUNAWAY', 'DRIVING UNDER THE INFLUENCE',
       'SEX OFFENSES FORCIBLE', 'PROSTITUTION', 'DISORDERLY CONDUCT',
       'ARSON', 'FAMILY OFFENSES', 'LIQUOR LAWS', 'BRIBERY',
       'EMBEZZLEMENT', 'SUICIDE', 'LOITERING',
       'SEX OFFENSES NON FORCIBLE', 'EXTORTION', 'GAMBLING', 'BAD CHECKS',
       'TREA', 'RECOVERED VEHICLE', 'PORNOGRAPHY/OBSCENE MAT'],
      dtype=object)

In [24]:
train.shape

(878049, 9)

In [25]:
test.head(1)

Unnamed: 0,Id,Dates,DayOfWeek,PdDistrict,Address,X,Y
0,0,2015-05-10 23:59:00,Sunday,BAYVIEW,2000 Block of THOMAS AV,-122.399588,37.735051


In [26]:
test.shape

(884262, 7)

In [8]:
# understand data

In [27]:
feature = list(set(train.columns) & set(test.columns))

In [28]:
feature

['Y', 'X', 'Dates', 'Address', 'DayOfWeek', 'PdDistrict']

In [11]:
target = 'Category'

In [32]:
train.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 878049 entries, 0 to 878048
Data columns (total 9 columns):
Dates         878049 non-null datetime64[ns]
Category      878049 non-null object
Descript      878049 non-null object
DayOfWeek     878049 non-null object
PdDistrict    878049 non-null object
Resolution    878049 non-null object
Address       878049 non-null object
X             878049 non-null float64
Y             878049 non-null float64
dtypes: datetime64[ns](1), float64(2), object(6)
memory usage: 60.3+ MB


In [33]:
test.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 884262 entries, 0 to 884261
Data columns (total 7 columns):
Id            884262 non-null int64
Dates         884262 non-null datetime64[ns]
DayOfWeek     884262 non-null object
PdDistrict    884262 non-null object
Address       884262 non-null object
X             884262 non-null float64
Y             884262 non-null float64
dtypes: datetime64[ns](1), float64(2), int64(1), object(3)
memory usage: 47.2+ MB


In [34]:
Dtrain = train[feature]
Dtrain['Id']=-999
Dtrain['Category']=train['Category']
Dtest = test[feature]
Dtest['Id']=test['Id']
Dtest['Category']='None'

A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: http://pandas.pydata.org/pandas-docs/stable/indexing.html#indexing-view-versus-copy
  
A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: http://pandas.pydata.org/pandas-docs/stable/indexing.html#indexing-view-versus-copy
  This is separate from the ipykernel package so we can avoid doing imports until


In [35]:
Dtrain.head(2)

Unnamed: 0,Y,X,Dates,Address,DayOfWeek,PdDistrict,Id,Category
0,37.774599,-122.425892,2015-05-13 23:53:00,OAK ST / LAGUNA ST,Wednesday,NORTHERN,-999,WARRANTS
1,37.774599,-122.425892,2015-05-13 23:53:00,OAK ST / LAGUNA ST,Wednesday,NORTHERN,-999,OTHER OFFENSES


In [36]:
Dtest.head(2)

Unnamed: 0,Y,X,Dates,Address,DayOfWeek,PdDistrict,Id,Category
0,37.735051,-122.399588,2015-05-10 23:59:00,2000 Block of THOMAS AV,Sunday,BAYVIEW,0,
1,37.732432,-122.391523,2015-05-10 23:51:00,3RD ST / REVERE AV,Sunday,BAYVIEW,1,


In [15]:
D_data = pd.concat([Dtrain,Dtest],axis=0).reset_index() # 为了预处理统一处理

In [16]:
Dtest.head(2)

Unnamed: 0,Y,X,Dates,Address,DayOfWeek,PdDistrict,Id,Category
0,37.735051,-122.399588,2015-05-10 23:59:00,2000 Block of THOMAS AV,Sunday,BAYVIEW,0,
1,37.732432,-122.391523,2015-05-10 23:51:00,3RD ST / REVERE AV,Sunday,BAYVIEW,1,


In [47]:
# Dates
Dates_feature = D_data[['Dates']]
Dates_feature['year'] = Dates_feature.Dates.dt.year
Dates_feature['month'] = Dates_feature.Dates.dt.month
Dates_feature['day'] = Dates_feature.Dates.dt.day
Dates_feature['hour'] = Dates_feature.Dates.dt.hour

A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: http://pandas.pydata.org/pandas-docs/stable/indexing.html#indexing-view-versus-copy
  This is separate from the ipykernel package so we can avoid doing imports until
A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: http://pandas.pydata.org/pandas-docs/stable/indexing.html#indexing-view-versus-copy
  after removing the cwd from sys.path.
A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: http://pandas.pydata.org/pandas-docs/stable/indexing.html#indexing-view-versus-copy
  """
A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the ca

In [48]:
del Dates_feature['Dates']

In [49]:
Dates_feature.head(2)

Unnamed: 0,year,month,day,hour
0,2015,5,13,23
1,2015,5,13,23


In [50]:
len(Dates_feature['year'].unique())

13

In [51]:
len(Dates_feature['month'].unique())

12

In [52]:
len(Dates_feature['day'].unique())

31

In [53]:
len(Dates_feature['hour'].unique())

24

In [54]:
date_enc = OneHotEncoder()
date_enc.fit(Dates_feature)

In case you used a LabelEncoder before this OneHotEncoder to convert the categories to integers, then you can now use the OneHotEncoder directly.


OneHotEncoder(categorical_features=None, categories=None,
       dtype=<class 'numpy.float64'>, handle_unknown='error',
       n_values=None, sparse=True)

In [55]:
Dates_feature = date_enc.transform(Dates_feature)

In [56]:
Dates_feature = pd.DataFrame(Dates_feature.toarray(),
                             columns=['date'+str(i) for i in range(Dates_feature.shape[1])],dtype=int)

In [57]:
Dates_feature.head(2)

Unnamed: 0,date0,date1,date2,date3,date4,date5,date6,date7,date8,date9,...,date70,date71,date72,date73,date74,date75,date76,date77,date78,date79
0,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,1
1,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,1


In [None]:
# 1:  2:  3:

In [None]:
# 3>1? [1,0,0],[0,1,0],[0,0,1]

In [45]:
date_enc.inverse_transform(Dates_feature)

array([[2.015e+03, 5.000e+00, 1.300e+01, 2.300e+01],
       [2.015e+03, 5.000e+00, 1.300e+01, 2.300e+01],
       [2.015e+03, 5.000e+00, 1.300e+01, 2.300e+01],
       ...,
       [2.003e+03, 1.000e+00, 1.000e+00, 0.000e+00],
       [2.003e+03, 1.000e+00, 1.000e+00, 0.000e+00],
       [2.003e+03, 1.000e+00, 1.000e+00, 0.000e+00]])

In [58]:
# PdDistrict
D_data.groupby(['PdDistrict'])['Dates'].count()

PdDistrict
BAYVIEW       179022
CENTRAL       171590
INGLESIDE     158929
MISSION       240357
NORTHERN      212313
PARK           99512
RICHMOND       90181
SOUTHERN      314638
TARAVAL       132213
TENDERLOIN    163556
Name: Dates, dtype: int64

In [59]:
PdDistrict_feature = D_data[['PdDistrict']]
PdDistrict_enc = OneHotEncoder()
PdDistrict_enc.fit(PdDistrict_feature)

OneHotEncoder(categorical_features=None, categories=None,
       dtype=<class 'numpy.float64'>, handle_unknown='error',
       n_values=None, sparse=True)

In [60]:
PdDistrict_feature = PdDistrict_enc.transform(PdDistrict_feature)

In [61]:
PdDistrict_feature = pd.DataFrame(PdDistrict_feature.toarray(),
                             columns=['PdDistrict'+str(i) for i in range(PdDistrict_feature.shape[1])],dtype=int)

In [62]:
# DayOfWeek
D_data.groupby(['DayOfWeek'])['Dates'].count()
DayOfWeek_feature = D_data[['DayOfWeek']]
DayOfWeek_enc = OneHotEncoder()
DayOfWeek_enc.fit(DayOfWeek_feature)
DayOfWeek_feature = DayOfWeek_enc.transform(DayOfWeek_feature)
DayOfWeek_feature = pd.DataFrame(DayOfWeek_feature.toarray(),
                             columns=['DayOfWeek'+str(i) for i in range(DayOfWeek_feature.shape[1])],dtype=int)

In [63]:
DayOfWeek_feature.head(2)

Unnamed: 0,DayOfWeek0,DayOfWeek1,DayOfWeek2,DayOfWeek3,DayOfWeek4,DayOfWeek5,DayOfWeek6
0,0,0,0,0,0,0,1
1,0,0,0,0,0,0,1


In [64]:
# Address 高发地区
Address= D_data.groupby(['Address'])['Dates'].count().reset_index()

In [65]:
Address.head()

Unnamed: 0,Address,Dates
0,0 Block of HARRISON ST,1
1,0 Block of 10TH AV,5
2,0 Block of 10TH ST,119
3,0 Block of 11TH ST,81
4,0 Block of 12TH AV,20


In [66]:
Address_feature = D_data[['Address']]
Address_feature['Address'] = Address['Address'].str.contains('Block').map(int)

A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: http://pandas.pydata.org/pandas-docs/stable/indexing.html#indexing-view-versus-copy
  


In [67]:
Address_feature = Address_feature.fillna(0)

In [68]:
Address_feature.tail(2)

Unnamed: 0,Address
1762309,0.0
1762310,0.0


In [202]:
# X,Y 选择舍弃，考虑不同的挖掘方式，大家发散思维，也可以单独构造一个分类器

In [69]:
# 综合特征
featue_list=['Dates_feature','PdDistrict_feature','DayOfWeek_feature','Address_feature']
all_features = D_data[['Id','Category']]
for iname in featue_list:
    all_features = pd.concat([all_features,eval(iname)],axis=1)

In [70]:
train_df = all_features[all_features['Id']==-999].reset_index(drop=True)
test_df = all_features[all_features['Id']!=-999].reset_index(drop=True)

In [71]:
train_df.shape,test_df.shape

((878049, 101), (884262, 101))

In [72]:
l_enc = LabelEncoder()
y_train = l_enc.fit_transform(train_df['Category'])

In [73]:
del train_df['Id']
del train_df['Category']

In [74]:
train_df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 878049 entries, 0 to 878048
Data columns (total 99 columns):
index          878049 non-null int64
date0          878049 non-null int64
date1          878049 non-null int64
date2          878049 non-null int64
date3          878049 non-null int64
date4          878049 non-null int64
date5          878049 non-null int64
date6          878049 non-null int64
date7          878049 non-null int64
date8          878049 non-null int64
date9          878049 non-null int64
date10         878049 non-null int64
date11         878049 non-null int64
date12         878049 non-null int64
date13         878049 non-null int64
date14         878049 non-null int64
date15         878049 non-null int64
date16         878049 non-null int64
date17         878049 non-null int64
date18         878049 non-null int64
date19         878049 non-null int64
date20         878049 non-null int64
date21         878049 non-null int64
date22         878049 non-null int64
d

In [139]:
test_df.head(2)

Unnamed: 0,index,Id,Category,date0,date1,date2,date3,date4,date5,date6,...,PdDistrict8,PdDistrict9,DayOfWeek0,DayOfWeek1,DayOfWeek2,DayOfWeek3,DayOfWeek4,DayOfWeek5,DayOfWeek6,Address
0,878049,0,,0,0,0,0,0,0,0,...,0,0,0,0,0,1,0,0,0,0.0
1,878050,1,,0,0,0,0,0,0,0,...,0,0,0,0,0,1,0,0,0,0.0


In [75]:
del test_df['Id']
del test_df['Category']

In [76]:
train_df.tail()

Unnamed: 0,index,date0,date1,date2,date3,date4,date5,date6,date7,date8,...,PdDistrict8,PdDistrict9,DayOfWeek0,DayOfWeek1,DayOfWeek2,DayOfWeek3,DayOfWeek4,DayOfWeek5,DayOfWeek6,Address
878044,878044,1,0,0,0,0,0,0,0,0,...,1,0,0,1,0,0,0,0,0,0.0
878045,878045,1,0,0,0,0,0,0,0,0,...,0,0,0,1,0,0,0,0,0,0.0
878046,878046,1,0,0,0,0,0,0,0,0,...,0,0,0,1,0,0,0,0,0,0.0
878047,878047,1,0,0,0,0,0,0,0,0,...,0,0,0,1,0,0,0,0,0,0.0
878048,878048,1,0,0,0,0,0,0,0,0,...,0,0,0,1,0,0,0,0,0,0.0


In [77]:
del train_df['index']

In [78]:
del test_df['index']

In [79]:
train_df.head()

Unnamed: 0,date0,date1,date2,date3,date4,date5,date6,date7,date8,date9,...,PdDistrict8,PdDistrict9,DayOfWeek0,DayOfWeek1,DayOfWeek2,DayOfWeek3,DayOfWeek4,DayOfWeek5,DayOfWeek6,Address
0,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,1,1.0
1,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,1,1.0
2,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,1,1.0
3,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,1,1.0
4,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,1,1.0


### 选择模型 

In [80]:
from sklearn.model_selection import train_test_split
from sklearn.model_selection import GridSearchCV
from sklearn.metrics import log_loss
from sklearn.linear_model import LogisticRegression

In [None]:
# train_df
# y_train

In [81]:
Dtrain,Dvalid,ytrain,yvalid = train_test_split(train_df,y_train, train_size=0.50)



In [82]:
clf = LogisticRegression()
clf.fit(Dtrain,ytrain)



LogisticRegression(C=1.0, class_weight=None, dual=False, fit_intercept=True,
          intercept_scaling=1, max_iter=100, multi_class='warn',
          n_jobs=None, penalty='l2', random_state=None, solver='warn',
          tol=0.0001, verbose=0, warm_start=False)

In [83]:
predicted = clf.predict_proba(Dvalid)

In [94]:
predicted[0]==np.max(predicted[0])

array([False, False, False, False, False, False, False,  True, False,
       False, False, False, False, False, False, False, False, False,
       False, False, False, False, False, False, False, False, False,
       False, False, False, False, False, False, False, False, False,
       False, False, False])

In [89]:
predicted.shape

(439025, 39)

In [84]:
predicted_loss = np.array(predicted)

In [85]:
log_loss(yvalid, predicted_loss)

2.5626849356996733

In [86]:
yvalid

array([20, 37,  4, ..., 20, 21, 16])

In [97]:
aa = clf.classes_[predicted[0]==np.max(predicted[0])]

In [98]:
l_enc.inverse_transform(aa)

array(['DRUG/NARCOTIC'], dtype=object)

In [225]:
## 调参 C 略。。。

In [51]:
# 存数据
col_names = np.sort(train['Category'].unique())
result = pd.DataFrame(data=predicted_loss, columns=col_names)


In [53]:
result.to_csv('out_submit.csv')