## 参考文献：
pickle使用: http://www.cnblogs.com/pzxbc/archive/2012/03/18/2404715.html

sklearn: http://www.cnblogs.com/jasonfreak/p/5448462.html


数据和特征决定了机器学习的上限，而模型和算法只是逼近这个上限而已。

---
## 特征工程学习用例


使用sklearn中的IRIS（鸢尾花）数据集来对特征处理功能进行说明。IRIS数据集由Fisher在1936年整理，包含4个特征（Sepal.Length（花萼长度）、Sepal.Width（花萼宽度）、Petal.Length（花瓣长度）、Petal.Width（花瓣宽度）），特征值都为正浮点数，单位为厘米。目标值为鸢尾花的分类（Iris Setosa（山鸢尾）、Iris Versicolour（杂色鸢尾），Iris Virginica（维吉尼亚鸢尾））。导入IRIS数据集的代码如下：

In [1]:
from sklearn.datasets import load_iris
from sklearn.preprocessing import OneHotEncoder
iris = load_iris()


In [2]:
iris.data[0:10]

array([[ 5.1,  3.5,  1.4,  0.2],
       [ 4.9,  3. ,  1.4,  0.2],
       [ 4.7,  3.2,  1.3,  0.2],
       [ 4.6,  3.1,  1.5,  0.2],
       [ 5. ,  3.6,  1.4,  0.2],
       [ 5.4,  3.9,  1.7,  0.4],
       [ 4.6,  3.4,  1.4,  0.3],
       [ 5. ,  3.4,  1.5,  0.2],
       [ 4.4,  2.9,  1.4,  0.2],
       [ 4.9,  3.1,  1.5,  0.1]])

L0范数是指向量中非0的元素的个数。

L1范数是指向量中各个元素绝对值之和

L2范数是指向量各元素的平方和然后求平方根

### 1. 预处理学习

In [3]:
#标准化，返回值为标准化后的数据
from sklearn.preprocessing import StandardScaler
StandardScaler().fit_transform(iris.data);

In [4]:
#区间缩放，返回值为缩放到[0, 1]区间的数据
from sklearn.preprocessing import MinMaxScaler
MinMaxScaler().fit_transform(iris.data);

标准化与归一化的区别

　　简单来说，标准化是依照特征矩阵的列处理数据，其通过求z-score的方法，将样本的特征值转换到同一量纲下。归一化是依照特征矩阵的行处理数据，其目的在于样本向量在点乘运算或其他核函数计算相似性时，拥有统一的标准，也就是说都转化为“单位向量”。规则为l2的归一化公式如下

In [5]:
#归一化，返回值为归一化后的数据
from sklearn.preprocessing import Normalizer
Normalizer().fit_transform(iris.data);

### 2. 数据变换

In [6]:
from sklearn.preprocessing import PolynomialFeatures
#多项式转换
PolynomialFeatures().fit_transform(iris.data);

### 3. 特征选择

特征是否发散：如果一个特征不发散，例如方差接近于0，也就是说样本在这个特征上基本上没有差异，这个特征对于样本的区分并没有什么用。

特征与目标的相关性：这点比较显见，与目标相关性高的特征，应当优选选择。

**根据特征选择的形式又可以将特征选择方法分为3种：**

Filter：过滤法，按照发散性或者相关性对各个特征进行评分，设定阈值或者待选择阈值的个数，选择特征。

Wrapper：包装法，根据目标函数（通常是预测效果评分），每次选择若干特征，或者排除若干特征。

Embedded：嵌入法，先使用某些机器学习的算法和模型进行训练，得到各个特征的权值系数，根据系数从大到小选择特征。类似于Filter方法，但是是通过训练来确定特征的优劣。

*使用sklearn中的feature_selection库来进行特征选择*

---
# 作业相关

个人理解：
1. 由于数据没给特征矩阵，所以要先自己根据数据构造特征矩阵
2. 没有缺失值，维数不大，所以基本不需要降维，规格化也不用，因为特征值都是我们生成的，但预处理需要处理noise之类

### 1.特征构造（参考别人的构造先构造一波）
1.1. Count of vowels in word

1.2. The vowels in word

1.3. The phonemes before each vowel 

1.4. The phonemes after each vowel 

1.5. prefix in word

1.6. suffix in word（验证无用，方差为0）

In [7]:
import helper
import pandas as pd
import numpy as np
import sklearn

import matplotlib.pyplot as plt

training_data = helper.read_data('./asset/training_data.txt')
testing_data = helper.read_data('./asset/tiny_test.txt')

#定义元辅音集合
vowel = "AA, AE, AH, AO, AW, AY, EH, ER, EY, IH, IY, OW, OY, UH, UW".replace(",","").split()
consonant = "P, B, CH, D, DH, F, G, HH, JH, K, L, M, N, NG, R, S, SH, T, TH, V, W, Y, Z, ZH".replace(",","").split()

#元、辅音合集合
vowelAndconsonan = vowel + consonant;

In [8]:
print(testing_data) #test data太少，直接用train中的50000个进行交叉验证

['DATA:D EY T AH', 'MINING:M AY N IH NG', 'MACHINE:M AH SH IY N', 'LEARNING:L ER N IH NG']


In [9]:
# 从读入的data里面构建dataframe
labels = ['Word', 'Pronunciation']
df_train = pd.DataFrame.from_records([(x.split(':')[0], x.split(':')[1]) for x in training_data], columns=labels)
df_train.head()

Unnamed: 0,Word,Pronunciation
0,COED,K OW1 EH2 D
1,PURVIEW,P ER1 V Y UW2
2,HEHIR,HH EH1 HH IH0 R
3,MUSCLING,M AH1 S AH0 L IH0 NG
4,NONPOISONOUS,N AA0 N P OY1 Z AH0 N AH0 S


In [10]:
#获取前缀后缀, 直接用了别人的
def s_has_pre(s):
    pre = "an,dis,in,ig,il,im,ir,ne,n,non,neg,un,male,mal,pseudo,mis,de\
    un,anti,ant,contra,contre,contro,counter,ob,oc,of,op,with,by,circum,\
    circu,de,en,ex,ec,es,fore,in,il,im,ir,inter,intel,intro,medi,med,mid,out,\
    over,post,pre,pro,sub,suc,suf,sug,sup,sur,sus,sur,trans,under,up,\
    ante,anti,ex,fore,mid,medi,post,pre,pri,out,over,post,pre,pro,sub,suc,suf,\
    sug,sum,sup,sur,sus,super,sur,trans,under,up,ante,anti,ex,fore,mid,medi,post,\
    pre,pri,pro,re,by,extra,hyper,out,over,sub,suc,sur,super,sur,under,vice,com,\
    cop,con,cor,co,syn,syl,sym,al,over,pan,ex,for,re,se,dia,per,pel,trans，ad,\
    ac,af,ag,an,ap,ar,as,at,ambi,bin,di,twi,tri,thir,deca,deco,dec,deci,hecto,\
    hect,centi,kilo,myria,mega,micro,multi,poly,hemi,demi,semi,pene,arch,auto,bene,\
    eu,male,mal,macro,magni,micro,aud,bio,ge,phon,tele,\
    ac,ad,af,ag,al,an,ap,as,at,an,ab,abs,acer,acid,acri,act,ag,acu,aer,aero,ag,agi,\
    ig,act,agri,agro,alb,albo,ali,allo,alter,alt,am,ami,amor,ambi,ambul,ana,ano,andr,\
    andro,ang,anim,ann,annu,enni,ante,anthrop,anti,ant,anti,antico,apo,ap,aph,aqu,arch,\
    aster,astr,auc,aug,aut,aud,audi,aur,aus,aug,auc,aut,auto,bar,be,belli,bene,bi,bine,\
    bibl,bibli,biblio,bio,bi,brev,cad,cap,cas,ceiv,cept,capt,cid,cip,cad,cas,calor,capit,\
    capt,carn,cat,cata,cath,caus,caut,cause,cuse,cus,ceas,ced,cede,ceed,cess,cent,centr,\
    centri,chrom,chron,cide,cis,cise,circum,cit,civ,clam,claim,clin,clud,clusclaus,co,cog,\
    col,coll,con,com,cor,cogn,gnos,com,con,contr,contra,counter,cord,cor,cardi,corp,cort,\
    cosm,cour,cur,curr,curs,crat,cracy,cre,cresc,cret,crease,crea,cred,cresc,cret,crease,\
    cru,crit,cur,curs,cura,cycl,cyclo,de,dec,deca,dec,dign,dei,div,dem,demo,dent,dont,derm,\
    di,dy,dia,dic,dict,dit,dis,dif,dit,doc,doct,domin,don,dorm,dox,duc,duct,dura,dynam,dys,\
    ec,eco,ecto,en,em,end,epi,equi,erg,ev,et,ex,exter,extra,extro,fa,fess,fac,fact,fec,fect,\
    fic,fas,fea,fall,fals,femto,fer,fic,feign,fain,fit,feat,fid,fid,fide,feder,fig,fila,fili,\
    fin,fix,flex,flect,flict,flu,fluc,fluv,flux,for,fore,forc,fort,form,fract,frag,frai,fuge,\
    fuse,gam,gastr,gastro,gen,gen,geo,germ,gest,giga,gin,gloss,glot,glu,glo,gor,grad,gress\
    ,gree,graph,gram,graf,grat,grav,greg,hale,heal,helio,hema,hemo,her,here,hes,hetero,hex\
    ,ses,sex,homo,hum,human,hydr,hydra,hydro,hyper,hypn,an,ics,ignis,in,im,in,im,il,ir,infra\
    ,inter,intra,intro,ty,jac,ject,join,junct,judice,jug,junct,just,juven,labor,lau,lav,lot\
    ,lut,lect,leg,lig,leg,levi,lex,leag,leg,liber,liver,lide,liter,loc,loco,log,logo,ology\
    ,loqu,locut,luc,lum,lun,lus,lust,lude,macr,macer,magn,main,mal,man,manu,mand,mania,mar\
    ,mari,mer,matri,medi,mega,mem,ment,meso,meta,meter,metr,micro,migra,mill,kilo,milli,min\
    ,mis,mit,miss,mob,mov,mot,mon,mono,mor,mort,morph,multi,nano,nasc,nat,gnant,nai,nat,nasc\
    ,neo,neur,nom,nom,nym,nomen,nomin,non,non,nov,nox,noc,numer,numisma,ob,oc,of,op,oct,oligo\
    ,omni,onym,oper,ortho,over,pac,pair,pare,paleo,pan,para,pat,pass,path,pater,patr,path,pathy\
    ,ped,pod,pedo,pel,puls,pend,pens,pond,per,peri,phage,phan,phas,phen,fan,phant,fant,phe,phil\
    ,phlegma,phobia,phobos,phon,phot,photo,pico,pict,plac,plais,pli,ply,plore,plu,plur,plus,pneuma\
    ,pneumon,pod,poli,poly,pon,pos,pound,pop,port,portion,post,pot,pre,pur,prehendere,prin,prim,\
    prime,pro,proto,psych,punct,pute,quat,quad,quint,penta,quip,quir,quis,quest,quer,re,reg,recti\
    ,retro,ri,ridi,risi,rog,roga,rupt,sacr,sanc,secr,salv,salu,sanct,sat,satis,sci,scio,scientia,\
    scope,scrib,script,se,sect,sec,sed,sess,sid,semi,sen,scen,sent,sens,sept,sequ,secu,sue,serv,\
    sign,signi,simil,simul,sist,sta,stit,soci,sol,solus,solv,solu,solut,somn,soph,spec,spect,spi,\
    spic,sper,sphere,spir,stand,stant,stab,stat,stan,sti,sta,st,stead,strain,strict,string,stige,\
    stru,struct,stroy,stry,sub,suc,suf,sup,sur,sus,sume,sump,super,supra,syn,sym,tact,tang,tag,tig,\
    ting,tain,ten,tent,tin,tect,teg,tele,tem,tempo,ten,tin,tain,tend,tent,tens,tera,term,terr,terra,\
    test,the,theo,therm,thesis,thet,tire,tom,tor,tors,tort,tox,tract,tra,trai,treat,trans,tri,trib,\
    tribute,turbo,typ,ultima,umber,umbraticum,un,uni,vac,vade,vale,vali,valu,veh,vect,ven,vent,ver,\
    veri,verb,verv,vert,vers,vi,vic,vicis,vict,vinc,vid,vis,viv,vita,vivi,voc,voke,vol,volcan,volv\
    ,volt,vol,vor,with,zo".replace(" ","").split(",")
    for i in pre:
        if s.startswith(i.upper()):
            return 1
    return 0

# 验证无用
def s_has_end(s):
    end = "ee,ese,esque,se,eer,ique,ty,less,ness,ly,ible,able,ion,ic,ical,al,ian,ic,\
    ion,ity,ment,ed,es,er,est,or,ary,ory,ous,cy,ry,ty,al,ure,ute,ble,ar,ly,less,ful,ing,\
    ,inal,tion,sion,osis,oon,sce,\
    que,ette,eer,ee,aire,able,ible,acy,cy,ade,age,al,al,ial,ical,an,ance,ence,ancy,\
    ency,ant,ent,ant,ent,ient,ar,ary,ard,art,ate,ate,ate,ation,cade,drome,ed,ed,en,en,\
    ence,ency,er,ier,er,or,er,or,ery,es,ese,ies,es,ies,ess,est,iest,fold,ful,ful,fy,ia,\
    ian,iatry,ic,ic,ice,ify,ile,ing,ion,ish,ism,ist,ite,ity,ive,ive,ative,itive,ize,less,\
    ly,ment,ness,or,ory,ous,eous,ose,ious,ship,ster,ure,ward,wise,ize,phy,ogy,ity,ion,ic,ical,al".replace(" ","").split(",")
    for i in end:
        if s.endswith(i.upper()):
            return 1
    return 0

In [11]:
#输入训练单词和音标获取：目标值，元音数, 音标序列，元音序列, 有无对应前缀，有无对应后缀
def getInfoOfPronsFromTrain(word, prons):
    hasPre = s_has_pre(word)
    hasEnd = s_has_end(word)
    pronsSeq = []
    vowelsSeq = []
    stressVowelIndex = 0
    count = 0
    sub_prons = prons.split(' ')
    is_has_stress = False
    for i in range(len(sub_prons)):
        if sub_prons[i][-1].isdigit():
            vowelsSeq.append(vowel.index(sub_prons[i][0:-1])) #取得元音在元音集合中的序号
            pronsSeq.append(vowelAndconsonan.index(sub_prons[i][0:-1])) #取得元音在元、辅音集合中的序号
            count+=1
            if sub_prons[i][-1] == '1':
                stressVowelIndex = count
                is_has_stress = True
        else:
            pronsSeq.append(vowelAndconsonan.index(sub_prons[i])) #取得辅音在元辅、音集合中的序号
    
    if (is_has_stress is not True):
        print ("No stress!")
    return stressVowelIndex, count, pronsSeq, vowelsSeq, hasPre, hasEnd

#输入测试单词和目标获取：元音数, 音标序列，元音序列, 有无对应前缀，有无对应后缀
def getInfoOfPronsFromTest(word, prons):
    hasPre = s_has_pre(word)
    hasEnd = s_has_end(word)
    pronsSeq = []
    vowelsSeq = []
    count = 0
    sub_prons = prons.split(' ')
    for i in range(len(sub_prons)):
        if sub_prons[i] in vowel:
            vowelsSeq.append(vowel.index(sub_prons[i])) #取得元音在元音集合中的序号
            count+=1
        pronsSeq.append(vowelAndconsonan.index(sub_prons[i])) #取得辅音在元辅、音集合中的序号
    return count, pronsSeq, vowelsSeq, hasPre, hasEnd

In [12]:
## function test
a1, a2 = training_data[3].split(':')
print(a1, a2)
b1, b2, b3, b4, b5, b6 = getInfoOfPronsFromTrain(a1, a2)
print(b1, b2, b3, b4, b5, b6)

a1, a2 = testing_data[3].split(':')
print(a1, a2)
b1, b2, b3, b4, b5= getInfoOfPronsFromTest(a1, a2)
print(b1, b2, b3, b4, b5)

MUSCLING M AH1 S AH0 L IH0 NG
1 3 [26, 2, 30, 2, 25, 9, 28] [2, 2, 9] 0 1
LEARNING L ER N IH NG
2 [25, 7, 27, 9, 28] [7, 9] 0 1


In [13]:
targetArr = []
countArr = []
pronsSeqArr = [] 
vowelsSeqArr = []
hasPreArr = []
hasEndArr = []

for i in range(len(training_data)):
    t_word, t_prons = training_data[i].split(':')
    t1, t2, t3, t4, t5, t6 = getInfoOfPronsFromTrain(t_word, t_prons)
    targetArr.append(t1)
    countArr.append(t2)
    pronsSeqArr.append(t3) 
    vowelsSeqArr.append(t4)
    hasPreArr.append(t5)
    hasEndArr.append(t6)

In [14]:
# a = list(zip(countArr, pronsSeqArr, vowelsSeqArr,hasPreArr))
# a

In [15]:
df = pd.DataFrame({'data' : training_data,
                   'target' : targetArr,
                   'count' : countArr,
                   'pronsSeq' : pronsSeqArr,
                   'vowelsSeq' : vowelsSeqArr,
                   'hasPre' : hasPreArr,
                   'hasEnd' : hasEndArr
})


df.head()

Unnamed: 0,count,data,hasEnd,hasPre,pronsSeq,target,vowelsSeq
0,2,COED:K OW1 EH2 D,1,1,"[24, 11, 6, 18]",1,"[11, 6]"
1,2,PURVIEW:P ER1 V Y UW2,1,1,"[15, 7, 34, 36, 14]",1,"[7, 14]"
2,2,HEHIR:HH EH1 HH IH0 R,1,0,"[22, 6, 22, 9, 29]",1,"[6, 9]"
3,3,MUSCLING:M AH1 S AH0 L IH0 NG,1,0,"[26, 2, 30, 2, 25, 9, 28]",1,"[2, 2, 9]"
4,4,NONPOISONOUS:N AA0 N P OY1 Z AH0 N AH0 S,1,1,"[27, 0, 27, 15, 12, 37, 2, 27, 2, 30]",2,"[0, 12, 2, 2]"


In [16]:
df.describe()

Unnamed: 0,count,hasEnd,hasPre,target
count,50000.0,50000.0,50000.0,50000.0
mean,2.56734,1.0,0.44314,1.36408
std,0.696358,0.0,0.496761,0.595326
min,2.0,1.0,0.0,1.0
25%,2.0,1.0,0.0,1.0
50%,2.0,1.0,0.0,1.0
75%,3.0,1.0,1.0,2.0
max,4.0,1.0,1.0,4.0


In [17]:
#看看最长的音标数
print('max length in pronsSeq', max([len(x) for x in pronsSeqArr]))

##看看最长的元音数
print('max length in vowelsSeq',  max([len(x) for x in vowelsSeqArr]))

max length in pronsSeq 14
max length in vowelsSeq 4


### 研究发现hasEnd的方差为0，舍去,  同时发现元音长度是一个很好的label

In [18]:
df.groupby(['count', 'target']).count()

Unnamed: 0_level_0,Unnamed: 1_level_0,data,hasEnd,hasPre,pronsSeq,vowelsSeq
count,target,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1
2,1,24435,24435,24435,24435,24435
2,2,3184,3184,3184,3184,3184
3,1,8938,8938,8938,8938,8938
3,2,6903,6903,6903,6903,6903
3,3,554,554,554,554,554
4,1,1441,1441,1441,1441,1441
4,2,2135,2135,2135,2135,2135
4,3,2356,2356,2356,2356,2356
4,4,54,54,54,54,54


In [19]:
#构造特征矩阵0
labels_0 = []
for i in range(len(training_data)):
    labels_0.append([countArr[i]])

In [20]:
#构造特征矩阵1
labels_1 = []
for i in range(len(training_data)):
    labels_1.append([countArr[i], hasPreArr[i]])

labels_1

[[2, 1],
 [2, 1],
 [2, 0],
 [3, 0],
 [4, 1],
 [4, 1],
 [2, 0],
 [2, 0],
 [2, 1],
 [2, 0],
 [2, 1],
 [2, 0],
 [2, 1],
 [3, 0],
 [2, 0],
 [2, 0],
 [3, 1],
 [2, 0],
 [2, 1],
 [2, 0],
 [3, 1],
 [4, 1],
 [4, 0],
 [2, 0],
 [2, 0],
 [2, 1],
 [2, 0],
 [2, 0],
 [3, 0],
 [2, 0],
 [2, 1],
 [2, 0],
 [2, 0],
 [4, 0],
 [2, 0],
 [2, 0],
 [2, 0],
 [2, 0],
 [4, 1],
 [2, 0],
 [3, 1],
 [3, 1],
 [2, 1],
 [2, 0],
 [3, 1],
 [2, 0],
 [2, 0],
 [4, 1],
 [2, 1],
 [3, 0],
 [3, 0],
 [4, 1],
 [2, 0],
 [2, 0],
 [2, 0],
 [3, 0],
 [2, 1],
 [2, 1],
 [2, 0],
 [2, 1],
 [2, 0],
 [2, 0],
 [2, 1],
 [3, 1],
 [2, 0],
 [2, 1],
 [3, 0],
 [2, 0],
 [2, 0],
 [3, 1],
 [2, 0],
 [3, 0],
 [3, 1],
 [2, 0],
 [2, 0],
 [4, 0],
 [2, 0],
 [2, 0],
 [4, 1],
 [4, 0],
 [2, 0],
 [2, 0],
 [2, 1],
 [2, 1],
 [3, 0],
 [2, 0],
 [4, 0],
 [3, 0],
 [2, 1],
 [2, 0],
 [2, 0],
 [3, 1],
 [3, 1],
 [2, 0],
 [2, 0],
 [2, 1],
 [2, 1],
 [3, 1],
 [2, 1],
 [2, 1],
 [4, 1],
 [2, 0],
 [2, 1],
 [2, 0],
 [2, 1],
 [2, 1],
 [4, 1],
 [2, 0],
 [2, 0],
 [2, 1],
 [2, 0],
 

In [21]:
pronsSeqArr2 = []

for temp in pronsSeqArr:
    k = 14 - len(temp)
    temp += [-1] * k
    pronsSeqArr2.append(temp)    

In [22]:
#构造特征矩阵2
labels_2 = []
for i in range(len(training_data)):
    tempList = [countArr[i], hasPreArr[i]]
    tempList += pronsSeqArr2[i];
    labels_2.append(tempList)     

In [23]:
labels_2

[[2, 1, 24, 11, 6, 18, -1, -1, -1, -1, -1, -1, -1, -1, -1, -1],
 [2, 1, 15, 7, 34, 36, 14, -1, -1, -1, -1, -1, -1, -1, -1, -1],
 [2, 0, 22, 6, 22, 9, 29, -1, -1, -1, -1, -1, -1, -1, -1, -1],
 [3, 0, 26, 2, 30, 2, 25, 9, 28, -1, -1, -1, -1, -1, -1, -1],
 [4, 1, 27, 0, 27, 15, 12, 37, 2, 27, 2, 30, -1, -1, -1, -1],
 [4, 1, 25, 0, 34, 6, 24, 10, 2, -1, -1, -1, -1, -1, -1, -1],
 [2, 0, 16, 2, 24, 2, 25, 18, -1, -1, -1, -1, -1, -1, -1, -1],
 [2, 0, 10, 32, 2, 27, -1, -1, -1, -1, -1, -1, -1, -1, -1, -1],
 [2, 1, 30, 5, 26, 6, 18, -1, -1, -1, -1, -1, -1, -1, -1, -1],
 [2, 0, 37, 0, 26, 16, 14, -1, -1, -1, -1, -1, -1, -1, -1, -1],
 [2, 1, 16, 9, 24, 30, 2, 25, -1, -1, -1, -1, -1, -1, -1, -1],
 [2, 0, 26, 2, 21, 35, 5, 29, -1, -1, -1, -1, -1, -1, -1, -1],
 [2, 1, 8, 18, 2, 27, -1, -1, -1, -1, -1, -1, -1, -1, -1, -1],
 [3, 0, 18, 7, 5, 2, 30, -1, -1, -1, -1, -1, -1, -1, -1, -1],
 [2, 0, 25, 11, 18, 7, 37, -1, -1, -1, -1, -1, -1, -1, -1, -1],
 [2, 0, 21, 1, 27, 9, 24, -1, -1, -1, -1, -1, -1, -1, 

In [24]:
#构造特征矩阵3
vowelsSeqArr2 = []

for temp in vowelsSeqArr:
    k = 4 - len(temp)
    temp += [-1] * k
    vowelsSeqArr2.append(temp)   

labels_3 = []
for i in range(len(training_data)):
    tempList = [countArr[i], hasPreArr[i]]
    tempList += vowelsSeqArr2[i];
    labels_3.append(tempList) 



In [26]:
labels_3

[[2, 1, 11, 6, -1, -1],
 [2, 1, 7, 14, -1, -1],
 [2, 0, 6, 9, -1, -1],
 [3, 0, 2, 2, 9, -1],
 [4, 1, 0, 12, 2, 2],
 [4, 1, 0, 6, 10, 2],
 [2, 0, 2, 2, -1, -1],
 [2, 0, 10, 2, -1, -1],
 [2, 1, 5, 6, -1, -1],
 [2, 0, 0, 14, -1, -1],
 [2, 1, 9, 2, -1, -1],
 [2, 0, 2, 5, -1, -1],
 [2, 1, 8, 2, -1, -1],
 [3, 0, 7, 5, 2, -1],
 [2, 0, 11, 7, -1, -1],
 [2, 0, 1, 9, -1, -1],
 [3, 1, 6, 9, 7, -1],
 [2, 0, 5, 7, -1, -1],
 [2, 1, 4, 3, -1, -1],
 [2, 0, 14, 2, -1, -1],
 [3, 1, 0, 0, 11, -1],
 [4, 1, 5, 9, 10, 2],
 [4, 0, 0, 6, 11, 10],
 [2, 0, 1, 2, -1, -1],
 [2, 0, 11, 11, -1, -1],
 [2, 1, 7, 9, -1, -1],
 [2, 0, 6, 7, -1, -1],
 [2, 0, 6, 7, -1, -1],
 [3, 0, 0, 0, 11, -1],
 [2, 0, 8, 10, -1, -1],
 [2, 1, 9, 7, -1, -1],
 [2, 0, 9, 5, -1, -1],
 [2, 0, 0, 7, -1, -1],
 [4, 0, 1, 11, 10, 2],
 [2, 0, 1, 2, -1, -1],
 [2, 0, 6, 9, -1, -1],
 [2, 0, 1, 2, -1, -1],
 [2, 0, 3, 2, -1, -1],
 [4, 1, 6, 2, 5, 9],
 [2, 0, 1, 2, -1, -1],
 [3, 1, 6, 2, 5, -1],
 [3, 1, 1, 6, 7, -1],
 [2, 1, 6, 2, -1, -1],
 [2, 0, 1, 1

In [54]:

df_labels_2 =  pd.DataFrame(labels_2)
df_labels_2.describe()


Unnamed: 0,0,1,2,3,4,5,6,7,8,9,10,11,12,13,14,15
count,50000.0,50000.0,50000.0,50000.0,50000.0,50000.0,50000.0,50000.0,50000.0,50000.0,50000.0,50000.0,50000.0,50000.0,50000.0,50000.0
mean,2.56734,0.44314,21.39506,11.62736,21.42314,16.16342,15.75536,12.52468,8.02876,4.45342,1.91706,0.31536,-0.50854,-0.86346,-0.96888,-0.99456
std,0.696358,0.496761,8.999909,10.646194,9.794818,11.248399,12.097586,13.142204,12.68147,10.901336,8.531851,6.009199,3.771807,2.037268,0.999346,0.432266
min,2.0,0.0,0.0,0.0,-1.0,-1.0,-1.0,-1.0,-1.0,-1.0,-1.0,-1.0,-1.0,-1.0,-1.0,-1.0
25%,2.0,0.0,16.0,2.0,15.0,7.0,3.0,-1.0,-1.0,-1.0,-1.0,-1.0,-1.0,-1.0,-1.0,-1.0
50%,2.0,0.0,24.0,9.0,25.0,15.0,15.0,9.0,-1.0,-1.0,-1.0,-1.0,-1.0,-1.0,-1.0,-1.0
75%,3.0,1.0,29.0,23.0,29.0,27.0,27.0,27.0,18.0,2.0,-1.0,-1.0,-1.0,-1.0,-1.0,-1.0
max,4.0,1.0,38.0,38.0,38.0,38.0,38.0,38.0,38.0,38.0,38.0,37.0,37.0,37.0,37.0,37.0


In [None]:
from sklearn import tree
from sklearn.cross_validation import train_test_split 
from sklearn.metrics import f1_score 

x_train, x_test, y_train, y_test = train_test_split(labels_3, targetArr, test_size = 0.2) 

In [None]:
x_train

### 决策树

In [None]:

#使用信息熵作为划分标准，对决策树进行训练
clf = tree.DecisionTreeClassifier(criterion='entropy')
clf.fit(x_train, y_train)

#系数反映每个特征的影响力。越大表示该特征在分类中起到的作用越大
clf.feature_importances_

### 其实可以看到最后一个影响力为0，可以直接去掉， 其实后4个影响力都极低，

In [None]:
answer = clf.predict(x_train)
answer2 = clf.predict(x_test)
print(answer)
# print(np.mean( answer == y_train))
print('f1 for train = ' , f1_score(y_train, answer, average='micro'))
print('f1 for test = ' , f1_score(y_test, answer2, average='micro'))


### 贝叶斯

In [1]:
from sklearn import linear_model
clf_bayes = linear_model.BayesianRidge()
clf_bayes.fit(x_train, y_train)
answer = clf_bayes.predict(x_train).round() #取整
answer2 = clf_bayes.predict(x_test).round()
print('f1 for train = ' , f1_score(y_train, answer, average='micro'))
print('f1 for test = ' , f1_score(y_test, answer2, average='micro'))

NameError: name 'x_train' is not defined

In [2]:
type(linear_model.BayesianRidge())


sklearn.linear_model.bayes.BayesianRidge