## 1、引入fastai库文件，创建dataframe

In [0]:
from fastai import *
from fastai.tabular import *

python中表格数据的标准格式是Pandas.DataFrame格式，可以通过pd.read_csv()方法从csv中获取，存储在关系型数据库、hadoop、spark中的数据也能很容易用pandas转化为dataframe类型。

In [0]:
path = untar_data(URLs.ADULT_SAMPLE)
df = pd.read_csv(path/'adult.csv')

In [0]:
df.head()

Unnamed: 0,age,workclass,fnlwgt,education,education-num,marital-status,occupation,relationship,race,sex,capital-gain,capital-loss,hours-per-week,native-country,salary
0,49,Private,101320,Assoc-acdm,12.0,Married-civ-spouse,,Wife,White,Female,0,1902,40,United-States,>=50k
1,44,Private,236746,Masters,14.0,Divorced,Exec-managerial,Not-in-family,White,Male,10520,0,45,United-States,>=50k
2,38,Private,96185,HS-grad,,Divorced,,Unmarried,Black,Female,0,0,32,United-States,<50k
3,38,Self-emp-inc,112847,Prof-school,15.0,Married-civ-spouse,Prof-specialty,Husband,Asian-Pac-Islander,Male,0,0,40,United-States,>=50k
4,42,Self-emp-not-inc,82297,7th-8th,,Married-civ-spouse,Other-service,Wife,Black,Female,0,0,50,United-States,<50k


## 2、创建TabularList 
1.   cat_names和con_names分别定义类别型变量（categorical variables）和连续型变量（continuous variablesa）。
2.   procs将对数据进行预处理。表格数据处理中的processor和计算机视觉中的transformation大致相同，但processor预处理表格数据在最开始进行，并且只进行一次；transformation的目的是数据增强，因此transform每次都不一样并且随机。
FillMissing：用中位数取代缺失值，再新建一列说明该数据是否缺失
Categorify：把类别型变量转化为pandas中的categories
Normalize：数据归一化：连续变量减去均值除以标准差
训练集、验证集、测试集的数据预处理方法要相同，因此三种数据集的processors要相同。


In [0]:
dep_var='salary'
#类别型变量
cat_names=['workclass','education','marital-status','occupation','relationship','race']
#连续型变量
cont_names = ['age','fnlwgt','education-num']
procs=[FillMissing,Categorify,Normalize]

df.iloc()通过index提取dataframe中特定的行和列，
df.iloc[800:1000]表示取第800-1000行的数据,这些数据将作为测试集。


In [0]:
test = TabularList.from_df(df.iloc[800:1000].copy(),path=path,cat_names=cat_names,cont_names=cont_names)
test

TabularList (200 items)
age                               45
workclass                    Private
fnlwgt                         96975
education               Some-college
education-num                    NaN
marital-status              Divorced
occupation         Handlers-cleaners
relationship               Unmarried
race                           White
sex                           Female
capital-gain                       0
capital-loss                       0
hours-per-week                    40
native-country         United-States
salary                          <50k
Name: 800, dtype: object,age                                46
workclass                Self-emp-inc
fnlwgt                         192779
education                 Prof-school
education-num                     NaN
marital-status     Married-civ-spouse
occupation             Prof-specialty
relationship                  Husband
race                            White
sex                              Male
capital-gain    



1.   split_by_idx()的参数是valid_idx，表示想要放入验证
集的数据索引，下例将第800-1000条数据设置为验证集。验证集最好是连续一段数据，比如相邻的视频帧、相邻的时间等，因此split_by_idx()是经常使用的函数；
2.path是执行save()时数据的存储位置；
3.说明类别型变量（categorical variables）和连续型变量（continuous variables），因为在神经网络中要对这两种变量用不同的方法建模；categorical variables要用到embedding方法来处理，continuous variables的处理方式跟用神经网络处理像素类似。
4.   
label_from_df()从dataframe中获取标签，这里标签是‘salary’列；
5. add_test()在databunch中加入test数据集,用来最后评测模型。



In [0]:
data = (TabularList.from_df(df,path=path,cat_names=cat_names,cont_names=cont_names,procs=procs)
           .split_by_idx(list(range(800,1000)))
           .label_from_df(cols=dep_var)
           .add_test(test,label=0)
           .databunch())

In [0]:
data.show_batch(rows=10)#默认显示5行，这里rows参数设置显示的行数。

workclass,education,marital-status,occupation,relationship,race,education-num_na,age,fnlwgt,education-num,target
Local-gov,9th,Separated,Other-service,Other-relative,White,False,0.9098,-0.7131,-1.9869,<50k
Private,HS-grad,Married-civ-spouse,Sales,Husband,White,False,0.2502,0.1376,-0.4224,<50k
Private,Assoc-voc,Married-civ-spouse,Craft-repair,Husband,Asian-Pac-Islander,False,-0.7027,-0.4853,0.3599,<50k
Private,Some-college,Never-married,Adm-clerical,Unmarried,Asian-Pac-Islander,False,-0.4095,1.0633,-0.0312,<50k
Private,Prof-school,Married-civ-spouse,Prof-specialty,Husband,White,False,-0.8493,0.2369,1.9245,>=50k
Private,HS-grad,Never-married,Sales,Own-child,White,False,-1.3624,0.3422,-0.4224,<50k
Private,HS-grad,Never-married,Handlers-cleaners,Own-child,White,False,-1.4357,-1.1488,-0.4224,<50k
Private,9th,Married-civ-spouse,Machine-op-inspct,Husband,White,False,-1.3624,-0.8893,-1.9869,<50k
Private,HS-grad,Never-married,Adm-clerical,Not-in-family,White,False,0.6899,-0.2705,-0.4224,<50k
Private,Assoc-acdm,Separated,Adm-clerical,Unmarried,White,False,0.6166,-0.7007,0.7511,<50k


## 创建tabular_learner并训练

创建tabular_learner：

传入TabularList类型的data；
layers=[200,100]定义网络结构；

metrics为训练过程中显示的数据，在这里matrics=accuracy显示准确率。

In [0]:
learn = tabular_learner(data,layers=[200,100],metrics=accuracy)

**fit()和fit_one_cycle()的区别：**

fit()仅仅为基础训练的迭代方式，fit_one_cycle()用到了1cycle policy。1cycle policy的使用需要知道最优学习率，因此要在fit_one_cycle之前运行lr_find()函数。简单来说1cycle policy指三个训练步骤：

1、学习率逐渐从lr_max/div_factor提高到lr_max，同时动量从mom_max降低到mom_min；
2、学习率逐渐从lr_max降低到lr_max/div_factor，同时动量从mom_min提高到mom_max；
3、学习率继续降低，从lr_max/div_factor降低到lr_max/（div_factor*100），动量维持在mom_max。

相同的epoch训练，fit_one_cycle准确率更高，误差更小。

详见[fit和fit_one_cycle](https://docs.fast.ai/callbacks.one_cycle.html#The-1cycle-policy)



In [0]:
learn.fit(1,1e-2)

epoch,train_loss,valid_loss,accuracy,time
0,0.365029,0.403114,0.825,00:06
