In [6]:
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
from sklearn.ensemble import RandomForestClassifier
from sklearn.preprocessing import Normalizer, OneHotEncoder, LabelEncoder
from sklearn.model_selection import train_test_split
import sklearn.metrics
import copy

from Reader import *

# data understanding, basic feature selection and basic transformations

## data understanding

- id, name, first, last 不在赘述
- compas screening date 没有特别明白, 谷歌翻译为 筛选日期，可能是这条数据被作者搞到数据集的时间？
- sex 性别
- dob 出生日期
- age, age_Cat 年龄，年龄分段
- race 种族
- juv_fel_count, juv_misd_count, juv_other_count 没懂
- priors_count 之前犯罪次数
- days_b_screening_arrest 没懂
- c_* 大概就是案件的一些信息
    - decile_score 要去除，否则预测一定百分之百正确
- r_* 就是这个人如果再犯案的一些信息
    - r_charge_degree 是再犯案件的被起诉的 degree, 完全涵盖了 is_recid 信息
- vr_* 就是再犯的案件是暴力案件的一些信息
    - v_decile_score 的完全涵盖 v_score_text


### basic feature selection

- 准则
    - 明显无用的 feature 直接去掉
    - 太多 na 对数据集损伤太大的直接去掉，或者补 -1
    - 重复的去一个
    - 与 label 直接相关

- 无用
    - id, name, first, last, case_number
    - type_of_assessment, c_charge_desc, v_type_of_assessment
    - r_case_number
- id, name, first, last, case_number 直接丢掉
- 重复 feature:
    - screen_date, compas screen date
    - violence score, v_score_text
    - r_charge_degree, is_recid
- na 太多的丢弃
    - vr_charge_degree
    - c_offense_date
    - c_arrest_date
    - num_r_cases
    - r_days_from_arrest,r_offense_date,r_charge_desc,r_jail_in,r_jail_out,num_vr_cases
    - vr_case_number,vr_offense_date,vr_charge_desc
- 与 label 直接相关
    - decile_score
    
### basic datatransformation

- `Reader` 会把几个时间数据转换成秒
- `Trainer` 会做 catogorical feature / label 的转化
- `out_df.csv` 是我去除上述的feature, 转好时间的数据，转化过程如下

```python
reader = Reader('/Users/yee/Desktop/lime/lime_readdata_analysis/data/compas/compas-scores.csv')
basic_data_frame = reader.preprocess_from_raw()
reader.write_to_csv()
```

# data importing and basic settings

In [2]:
reader = Reader('./out_df.csv')
basic_data_frame = reader.get_dataframe()
basic_features = ['age', 'sex', 'race', 'priors_count',
                       'screening_date', 'c_jail_in', 'juv_other_count',
                       'dob', 'c_days_from_compas', 'juv_misd_count',
                       'juv_fel_count', 'c_jail_out', 'v_decile_score',
                       'days_b_screening_arrest', 'is_violent_recid',
                       'r_charge_degree', 'c_charge_degree']
special_features = ['age', 'sex', 'race', 'priors_count']



# feature selection

## what if includes all features

In [7]:
"""
random forest

features = ['age', 'sex', 'race', 'priors_count',
                       'screening_date', 'c_jail_in', 'juv_other_count',
                       'dob', 'c_days_from_compas', 'juv_misd_count',
                       'juv_fel_count', 'c_jail_out', 'v_decile_score',
                       'days_b_screening_arrest', 'is_violent_recid',
                       'r_charge_degree', 'c_charge_degree']
"""
basic_settings = {
    'n_estimators': 500,
    'oob_score': True,
    'random_state': 233
}
features_selected = copy.deepcopy(basic_features)

trainer = Trainer(basic_data_frame,
                      label=['score_text'],
                      features=features_selected,
                      method='rf',
                      **basic_settings)

trainer.train_pipeline()

[Reader]:	INFO		Dataframe statisticals: 
[Reader]:	INFO		length of the dataframe: 10568
[Reader]:	INFO		features: ['age', 'sex', 'race', 'priors_count', 'screening_date', 'c_jail_in', 'juv_other_count', 'dob', 'c_days_from_compas', 'juv_misd_count', 'juv_fel_count', 'c_jail_out', 'v_decile_score', 'days_b_screening_arrest', 'is_violent_recid', 'r_charge_degree', 'c_charge_degree']
[Reader]:	INFO		label: ['score_text']
[Reader]:	INFO		method using: rf
[Reader]:	INFO		params: 
[Reader]:	INFO		n_estimators = 500
[Reader]:	INFO		oob_score = True
[Reader]:	INFO		random_state = 233
[Reader]:	INFO		label classes: ['High' 'Low' 'Medium']
[Reader]:	INFO		class mapping: 
[Reader]:	INFO		0: High
[Reader]:	INFO		1: Low
[Reader]:	INFO		2: Medium
[Reader]:	INFO		output results:



mean accuracy: 0.734

feature importances: 
('v_decile_score', 0.25289846025598034)
('priors_count', 0.1122921669420839)
('dob', 0.10612064578348127)
('c_jail_out', 0.08179007036966045)
('c_jail_in', 0.0784346190270801)
('screening_date', 0.0768131122825479)
('age', 0.06748746523372101)
('days_b_screening_arrest', 0.03565219097830756)
('c_days_from_compas', 0.03513396914332562)
('race_African-American', 0.020412464712010418)
('r_charge_degree_O', 0.015794074154735313)
('juv_fel_count', 0.011695651209598592)
('juv_misd_count', 0.010363602516442397)
('r_charge_degree_F', 0.010223102778425548)
('c_charge_degree_M', 0.009906857381799887)
('sex_Female', 0.009822240389255305)
('c_charge_degree_F', 0.009760214287526991)
('sex_Male', 0.009741786220741556)
('juv_other_count', 0.009687924898213547)
('race_Caucasian', 0.009615080038673469)
('r_charge_degree_M', 0.008752819105106597)
('is_violent_recid', 0.006949791376165062)
('race_Hispanic', 0.004817922490260319)
('race_Other', 0.00424183154007

In [15]:
"""
gradient boosting tree

features = ['age', 'sex', 'race', 'priors_count',
                       'screening_date', 'c_jail_in', 'juv_other_count',
                       'dob', 'c_days_from_compas', 'juv_misd_count',
                       'juv_fel_count', 'c_jail_out', 'v_decile_score',
                       'days_b_screening_arrest', 'is_violent_recid',
                       'r_charge_degree', 'c_charge_degree']
"""
basic_settings = {
    'n_estimators': 200,
    'learning_rate': 0.1,
    'random_state': 233
}
features_selected = copy.deepcopy(basic_features)

trainer = Trainer(basic_data_frame,
                      label=['score_text'],
                      features=features_selected,
                      method='gbt',
                      **basic_settings)

trainer.train_pipeline()

[Reader]:	INFO		Dataframe statisticals: 
[Reader]:	INFO		length of the dataframe: 10568
[Reader]:	INFO		features: ['age', 'sex', 'race', 'priors_count', 'screening_date', 'c_jail_in', 'juv_other_count', 'dob', 'c_days_from_compas', 'juv_misd_count', 'juv_fel_count', 'c_jail_out', 'v_decile_score', 'days_b_screening_arrest', 'is_violent_recid', 'r_charge_degree', 'c_charge_degree']
[Reader]:	INFO		label: ['score_text']
[Reader]:	INFO		method using: gbt
[Reader]:	INFO		params: 
[Reader]:	INFO		n_estimators = 200
[Reader]:	INFO		learning_rate = 0.1
[Reader]:	INFO		random_state = 233
[Reader]:	INFO		label classes: ['High' 'Low' 'Medium']
[Reader]:	INFO		class mapping: 
[Reader]:	INFO		0: High
[Reader]:	INFO		1: Low
[Reader]:	INFO		2: Medium



mean accuracy: 0.742


feature importances: 
('v_decile_score', 0.6650116281498192)
('priors_count', 0.1491611190666305)
('dob', 0.06987405678286833)
('c_jail_out', 0.02266896839023569)
('c_jail_in', 0.014017446646820553)
('c_days_from_compas', 0.01351572740840179)
('screening_date', 0.011681783846949186)
('days_b_screening_arrest', 0.010191709912613044)
('sex_Male', 0.0067150587003542836)
('sex_Female', 0.005482087070743126)
('r_charge_degree_F', 0.0043354412159372835)
('juv_fel_count', 0.0038557656141677333)
('race_African-American', 0.0036247367666514267)
('age', 0.003303773648923544)
('race_Other', 0.003001127096838884)
('juv_other_count', 0.0028909429442577307)
('c_charge_degree_M', 0.0026983560327949537)
('r_charge_degree_O', 0.0017336157149900025)
('juv_misd_count', 0.001216490232802859)
('c_charge_degree_F', 0.0011896358927140868)
('race_Hispanic', 0.0010226281011108747)
('is_violent_recid', 0.0009332524367155736)
('race_Caucasian', 0.0006732039668835774)
('c_charge_degree_O',

## section conclusion

gbt and randomforest both have `feature_importance_` api. As is shown above, regardless of interpretability, `v_decile_score`, `priors_count`, `dob`, `c_jail_out`, `c_jail_in`, `days_b_screening_arrest` appears to be discrimitive in both random forset and gradient boost classifier.

- `v_decile_score` makes sense, because it directly denotes the degree of the criminal's violence.
- `priors_count` makes sense, since in case he/she has committed more crimes before, he/she intends to commit more in the future
- `dob`, or date of birth, `c_jail_in`, `c_jail_out`, `days_b_screening_arrest` are all date-related data, which confuses me a lot.

When it comes to our 4 special features: `age`, `sex`, `race`, `priors_count`. `priors_count` has shown its predictive power. From the rf's point of view, `age`, `race_African-American` appears useful. From the gbt's point of view `sex`, `race_African-American` seems to be more useful however.

From a human's mind, `r_*` should be more useful, because it directly indicates the potential of recividism. So, next, I will try to use these human-understanding features;

## human-understanding features

- 'age', 'sex', 'race', 'priors_count'
- plus 'v_decile_score', 'r_charge_degree', 'c_charge_degree', 'is_violent_recid'

In [17]:
"""
random forest

features = ['age', 'sex', 'race', 'priors_count', 
                'v_decile_score', 'r_charge_degree', 'c_charge_degree', 'is_violent_recid']
"""

basic_settings = {
    'n_estimators': 500,
    'oob_score': True,
    'random_state': 233
}
add_features = ['v_decile_score', 'r_charge_degree', 'c_charge_degree', 'is_violent_recid']

features_selected = special_features + add_features

trainer = Trainer(basic_data_frame,
                      label=['score_text'],
                      features=features_selected,
                      method='rf',
                      **basic_settings)

trainer.train_pipeline()

[Reader]:	INFO		Dataframe statisticals: 
[Reader]:	INFO		length of the dataframe: 10568
[Reader]:	INFO		features: ['age', 'sex', 'race', 'priors_count', 'v_decile_score', 'r_charge_degree', 'c_charge_degree', 'is_violent_recid']
[Reader]:	INFO		label: ['score_text']
[Reader]:	INFO		method using: rf
[Reader]:	INFO		params: 
[Reader]:	INFO		n_estimators = 500
[Reader]:	INFO		oob_score = True
[Reader]:	INFO		random_state = 233
[Reader]:	INFO		label classes: ['High' 'Low' 'Medium']
[Reader]:	INFO		class mapping: 
[Reader]:	INFO		0: High
[Reader]:	INFO		1: Low
[Reader]:	INFO		2: Medium
[Reader]:	INFO		output results:



mean accuracy: 0.702

feature importances: 
('v_decile_score', 0.35237466281718177)
('age', 0.2749843206964959)
('priors_count', 0.21647902159505056)
('race_African-American', 0.024944005141697995)
('r_charge_degree_O', 0.0191377494352166)
('is_violent_recid', 0.01399404977543959)
('r_charge_degree_F', 0.01270901328901669)
('race_Caucasian', 0.012244401736324604)
('c_charge_degree_M', 0.011702225867415407)
('c_charge_degree_F', 0.011404337176615147)
('sex_Female', 0.011184946664489503)
('sex_Male', 0.011161045823855183)
('r_charge_degree_M', 0.010872032172717512)
('race_Hispanic', 0.007182761298853448)
('race_Other', 0.006773504580979071)
('race_Asian', 0.0012781391782794224)
('race_Native American', 0.001071711059610562)
('c_charge_degree_O', 0.0005020716907605248)

oob score: 
0.6980127750177431

f1 score results
High : 0.611
Low : 0.837
Medium : 0.437

thorough output
{'High': {'f1-score': 0.6106666666666667,
          'precision': 0.6122994652406417,
          'recall': 0.60904255

In [18]:
"""
gbt

features = ['age', 'sex', 'race', 'priors_count', 
                'v_decile_score', 'r_charge_degree', 'c_charge_degree', 'is_violent_recid']
"""

basic_settings = {
    'n_estimators': 200,
    'learning_rate': 0.1,
    'random_state': 233
}

add_features = ['v_decile_score', 'r_charge_degree', 'c_charge_degree', 'is_violent_recid']

features_selected = special_features + add_features

trainer = Trainer(basic_data_frame,
                      ['score_text'],
                      features_selected,
                      method='gbt',
                      **basic_settings)

trainer.train_pipeline()

[Reader]:	INFO		Dataframe statisticals: 
[Reader]:	INFO		length of the dataframe: 10568
[Reader]:	INFO		features: ['age', 'sex', 'race', 'priors_count', 'v_decile_score', 'r_charge_degree', 'c_charge_degree', 'is_violent_recid']
[Reader]:	INFO		label: ['score_text']
[Reader]:	INFO		method using: gbt
[Reader]:	INFO		params: 
[Reader]:	INFO		n_estimators = 200
[Reader]:	INFO		learning_rate = 0.1
[Reader]:	INFO		random_state = 233
[Reader]:	INFO		label classes: ['High' 'Low' 'Medium']
[Reader]:	INFO		class mapping: 
[Reader]:	INFO		0: High
[Reader]:	INFO		1: Low
[Reader]:	INFO		2: Medium



mean accuracy: 0.738


feature importances: 
('v_decile_score', 0.7158739686603106)
('priors_count', 0.17357952586219455)
('age', 0.06767966929747737)
('sex_Male', 0.007627239436691127)
('sex_Female', 0.006592922327784828)
('r_charge_degree_F', 0.005058026368163863)
('race_African-American', 0.004523396315772097)
('race_Other', 0.003669347723864177)
('c_charge_degree_M', 0.0033158737539579395)
('race_Hispanic', 0.0022279110408298937)
('r_charge_degree_M', 0.0018493360453548794)
('race_Caucasian', 0.0017023852347622611)
('r_charge_degree_O', 0.0015856016511627395)
('c_charge_degree_F', 0.0015221708992638327)
('is_violent_recid', 0.0011462171631467753)
('c_charge_degree_O', 0.0010444790917548466)
('race_Asian', 0.0005318255719810834)
('race_Native American', 0.00047010355552701375)

f1 score results
High : 0.65
Low : 0.864
Medium : 0.5

thorough output
{'High': {'f1-score': 0.649859943977591,
          'precision': 0.6863905325443787,
          'recall': 0.6170212765957447,
          's

## section conclusion

As is shown above, gbt does not suffer a lot though rf's accuracy has decresed by 3%. 
For both of them, `v_decile_score`, `priors_count`, `age`, `r_charge_degree` appeares important.
For rf, `race_African-American` contribute more while `sex` contribute more for gbt.

Now I will simply use special feature plus top 4 important feature in the first section.

## machine-agreed features

In [19]:
"""
random forest

features = ['age', 'sex', 'race', 'priors_count', 
                'v_decile_score', 'dob', 'c_jail_out', 'c_jail_in']
"""

basic_settings = {
    'n_estimators': 500,
    'oob_score': True,
    'random_state': 233
}
add_features = ['v_decile_score', 'dob', 'c_jail_out', 'c_jail_in']

features_selected = special_features + add_features

trainer = Trainer(basic_data_frame,
                      label=['score_text'],
                      features=features_selected,
                      method='rf',
                      **basic_settings)

trainer.train_pipeline()

[Reader]:	INFO		Dataframe statisticals: 
[Reader]:	INFO		length of the dataframe: 10568
[Reader]:	INFO		features: ['age', 'sex', 'race', 'priors_count', 'v_decile_score', 'dob', 'c_jail_out', 'c_jail_in']
[Reader]:	INFO		label: ['score_text']
[Reader]:	INFO		method using: rf
[Reader]:	INFO		params: 
[Reader]:	INFO		n_estimators = 500
[Reader]:	INFO		oob_score = True
[Reader]:	INFO		random_state = 233
[Reader]:	INFO		label classes: ['High' 'Low' 'Medium']
[Reader]:	INFO		class mapping: 
[Reader]:	INFO		0: High
[Reader]:	INFO		1: Low
[Reader]:	INFO		2: Medium
[Reader]:	INFO		output results:



mean accuracy: 0.728

feature importances: 
('v_decile_score', 0.2860628081934829)
('dob', 0.15749160283295874)
('priors_count', 0.14144590050226213)
('c_jail_out', 0.13501084410975747)
('c_jail_in', 0.13199520079928667)
('age', 0.08893884015362717)
('race_African-American', 0.020265528494757427)
('race_Caucasian', 0.009766934351479593)
('sex_Male', 0.00847832491898889)
('sex_Female', 0.008411285907089174)
('race_Hispanic', 0.005388107262405248)
('race_Other', 0.005089413321422308)
('race_Asian', 0.0008794403186428488)
('race_Native American', 0.0007757688338393832)

oob score: 
0.7281760113555713

f1 score results
High : 0.662
Low : 0.85
Medium : 0.475

thorough output
{'High': {'f1-score': 0.6621438263229308,
          'precision': 0.6759002770083102,
          'recall': 0.648936170212766,
          'support': 376},
 'Low': {'f1-score': 0.8502415458937198,
         'precision': 0.8167053364269141,
         'recall': 0.8866498740554156,
         'support': 1191},
 'Medium': {'f1-scor

In [21]:
"""
gbt

features = ['age', 'sex', 'race', 'priors_count', 
                'v_decile_score', 'dob', 'c_jail_out', 'c_jail_in']
"""

basic_settings = {
    'n_estimators': 200,
    'learning_rate': 0.1,
    'random_state': 233
}

add_features = ['v_decile_score', 'dob', 'c_jail_out', 'c_jail_in']

features_selected = special_features + add_features

trainer = Trainer(basic_data_frame,
                      ['score_text'],
                      features_selected,
                      method='gbt',
                      **basic_settings)

trainer.train_pipeline()

[Reader]:	INFO		Dataframe statisticals: 
[Reader]:	INFO		length of the dataframe: 10568
[Reader]:	INFO		features: ['age', 'sex', 'race', 'priors_count', 'v_decile_score', 'dob', 'c_jail_out', 'c_jail_in']
[Reader]:	INFO		label: ['score_text']
[Reader]:	INFO		method using: gbt
[Reader]:	INFO		params: 
[Reader]:	INFO		n_estimators = 200
[Reader]:	INFO		learning_rate = 0.1
[Reader]:	INFO		random_state = 233
[Reader]:	INFO		label classes: ['High' 'Low' 'Medium']
[Reader]:	INFO		class mapping: 
[Reader]:	INFO		0: High
[Reader]:	INFO		1: Low
[Reader]:	INFO		2: Medium



mean accuracy: 0.744


feature importances: 
('v_decile_score', 0.6800452323577436)
('priors_count', 0.15545026545723983)
('dob', 0.08114737160029256)
('c_jail_out', 0.03157787292468014)
('c_jail_in', 0.027084674471178526)
('sex_Female', 0.006991025595693912)
('sex_Male', 0.0051799981044141664)
('race_African-American', 0.0038023976346034665)
('race_Other', 0.002797629789383512)
('age', 0.002752139736258962)
('race_Hispanic', 0.001457614355449864)
('race_Caucasian', 0.0012203269987157248)
('race_Native American', 0.00029516216276986873)
('race_Asian', 0.0001982888115757919)

f1 score results
High : 0.669
Low : 0.863
Medium : 0.516

thorough output
{'High': {'f1-score': 0.669467787114846,
          'precision': 0.7071005917159763,
          'recall': 0.6356382978723404,
          'support': 376},
 'Low': {'f1-score': 0.8629524196827979,
         'precision': 0.8367507886435331,
         'recall': 0.890848026868178,
         'support': 1191},
 'Medium': {'f1-score': 0.5156398104265403,


## section conclusion

In this section, I add 4 machine-agreed features `v_decile_score`, `dob`, `c_jail_out`, `c_jail_in`.
For the machine-agreed features, both rf and gbt gets better results than human-understanding features, especially for rf.

# conclusions

gbt and randomforest both have `feature_importance_` api. As is shown above, regardless of interpretability, `v_decile_score`, `priors_count`, `dob`, `c_jail_out`, `c_jail_in`, `days_b_screening_arrest` appears to be discrimitive in both random forset and gradient boost classifier.

- `v_decile_score` makes sense, because it directly denotes the degree of the criminal's violence.
- `priors_count` makes sense, since in case he/she has committed more crimes before, he/she intends to commit more in the future
- `dob`, or date of birth, `c_jail_in`, `c_jail_out`, `days_b_screening_arrest` are all date-related data, which confuses me a lot.

When it comes to our 4 special features: `age`, `sex`, `race`, `priors_count`. `priors_count` has shown its predictive power. From the rf's point of view, `age`, `race_African-American` appears useful. From the gbt's point of view `sex`, `race_African-American` seems to be more useful however.

From a human's mind, `v_decile_score`, `r_charge_degree`, `c_charge_degree`, `is_violent_recid` should be more useful, because it directly indicates the potential of recividism. However, they don't appear statistically important than the machine-agreed features `dob`, `c_jail_out`, `c_jail_in`.