# 快速入门：MindPandas数据处理

[![查看源文件](https://mindspore-website.obs.cn-north-4.myhuaweicloud.com/website-images/master/resource/_static/logo_source.png)](https://gitee.com/mindspore/docs/blob/master/docs/mindpandas/docs/source_zh_cn/mindpandas_quick_start.ipynb)

数据预处理对于模型训练非常重要，好的特征工程可以大幅度提升训练精度。本章节以推荐系统的特征工程为例，介绍使用MindPandas处理数据的流程。

## MindPandas后端设置

MindPandas使用首先需要设置后端为multithread模式（如需使用yr模式，请参考[MindPandas后端执行模式配置及性能介绍](https://www.mindspore.cn/mindpandas/docs/zh-CN/master/mindpandas_configuration.html)），并设置切片维度为16*3，示例如下：

In [None]:
import hashlib

import numpy as np
import mindpandas as pd

pd.set_adaptive_concurrency(False)
pd.set_concurrency_mode("multithread")
pd.set_partition_shape((16, 3))

## 下载并读取数据

下载[示例数据集](https://gitee.com/mindspore/mindpandas/tree/master/tests/st/data),该数据集包含用户的历史网页点击数据、以及被点击网页的分类和主题等信息。使用如下命令读取data2.csv文件。

In [None]:
file_name = "data/data2.csv"
mdf = pd.read_csv(file_name)

data2.csv文件是一个10000行、41列的二维数组，包含字符串、数字等数据。CSV原始数据输出如下：

In [101]:
mdf.to_pandas()

Unnamed: 0,user_id,doc_id,lang_type,domain_website,pt_d,url_drop_last_slash,url,event_time,title,region,...,domain_website_click_1d,domain_website_exp_1d,domain_website_click_3d,domain_website_exp_3d,domain_website_click_5d,domain_website_exp_5d,domain_website_click_7d,domain_website_exp_7d,hist_doc_id,source_sample.label
0,19265380,96764505,85593781,66614410,87156290,mlY5VQiTj8tKG9XNUPqr,n5v0zTX2QWsSZiYCaIg9,qtmxuICEXp3lAYc81KUh,psJHRxzE5kS2ZwUgOKna,9LZ6YQrCh4SowkAKOpFf,...,0.946144,0.103943,0.697057,0.835622,wVnJXdoBRY,gPuxfzqjVl,HNQhVUEKuz,fMktwSQGYx,pBLcsZrUGu,yAilT:jRWJB
1,85493911,87828698,98232131,41549241,48135212,i0t2AEukzZVljs53ro6c,AqWU9slEbd8Vai1GyI0T,wAO3Krq0b4mtgG58hykD,NFSkpma0dYUlCV7ngiEP,EPKy5bDCel2hAd6zSv78,...,0.408094,0.796093,0.531365,0.306002,YtBAnhxQsE,tuWNfVbCrK,COtoxJVTPz,OSfPXKYjDl,ThaeSJtwLX,sXEgt:Vwxns
2,28783123,64871036,95277574,42539942,30658195,f529MPhI6FCjGdJSsLYN,TfY53EqVZl07s4PAxbHW,582pskLef7cWwgBQC3Xo,HW6xyvV5gDrIk0mtaubs,kTEpjbtRISuwHxdF0VKA,...,0.137916,0.483957,0.082523,0.330797,BEmxgTMvZC,DjmrOAQnBJ,ZpJMYUnRbS,DAYNFLvswe,hqYTSpfHri,ATGNo:hNdfe
3,79702669,12454864,71934373,4087017,32475645,Yx1c8kPSol5grQ06y7su,o7KMWtxpXAsfh6ZPGLD9,SYrs72pBMVywlP0NQDEJ,tc8nObLwM3ZqBxU0lKCV,LpI1Gtg0quFyKCQlEBcT,...,0.709699,0.331744,0.969565,0.508447,AVzibNoWYM,riCXBTDpbh,bfXWIODQmk,WjSyNqYceR,QYebTdRwZS,EAkiQ:xEkMz:HXalE
4,56578946,62761265,43209571,96948416,80057523,ptjJvfT4bq8DYrN3QdmM,1n0cZ4PCa2jKxE9qpwOM,kpiNSEC7fXG4hy9FsWgL,5TFouhxA0v4jVZ9Qt6EK,ZIHwSxruBLvA8Cd7MDYV,...,0.606441,0.396516,0.289412,0.419054,bIFpxAhiDM,QEewBPxJVR,taXZbughdo,qBTUkutlJD,NXiqYGmxCU,FIlxp
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
9995,99836239,33382142,47424555,84061856,48850740,vmrMxVwPtg1BTYhDSlWs,CWrvqMT6fGbF4e0NoSOE,1SBVe5j4bWEXfMiQYsH9,woPauhtpWL6I7EnOSx4D,R9qDXTJ147uefBz2QKd8,...,0.354468,0.853486,0.599362,0.896752,EHvfYTiXoZ,uAvYxlHVbZ,ORdZNQHKuo,UaPQmEAzGs,QjoRnYavtf,HOvtS
9996,72990782,90103899,69145205,33555186,54606213,eHvhtMIg6uEQl32pj5o9,UCgl950VGsOfXtmEiAZ3,bO6rHgnolxkwjfYViB7q,Hgiq1TXc6fBM0KtvFy28,1YQ8NSdvDPszZ4cF37nX,...,0.634423,0.815318,0.330612,0.884765,OWmvzstXpq,LTboIcFfhd,mMzCHcjLyk,zKtYMuUcnm,YeRScWFEHr,SLGzu:DdvGQ:PztKs
9997,82847429,9802771,40045859,82857068,41498974,wthypCb3WB06XAJYekR8,IzSgEhpBOD8nkKwAoGYT,fgsL6FaXtTGKMlmVECqe,vN7DYcTSE2ztBm1qWau6,G5lyJnKTIhZbcrSv84uD,...,0.627181,0.398546,0.648544,0.414315,DjCpViEXhG,ARTxVeuyWX,MarCeJmjVE,dOuBDHwzyS,EZeJipyDsF,ZYhbU
9998,11936407,90919619,47759544,40296987,90955464,S216sTinmXeAwkvUORo5,b9VlQC02fIdW8DZcv7RL,nW1xyl5MqbasAPidGtwI,zwsLvkRFVSpMBA9IWTqt,PsC8o4dhxfJe9ky6KLlb,...,0.393747,0.168946,0.895252,0.185574,LaSzMYPjWH,WkLpStXJnZ,OTiEWldJHj,ZdFzuRCDeP,CWIHARZOUV,eEBuh:aowvZ


## 对指定列进行处理

首先对mdf的"hist_doc_id"列应用lambda函数，以冒号为分隔符进行切分，然后将mdf中的'x'替换成pd.NaT缺省值，前后结果对比如下：

In [102]:
print("before apply function to DataFrame.")
print(mdf["hist_doc_id"])
mdf["hist_doc_id"] = mdf["hist_doc_id"].apply(lambda x: x.split(':') if isinstance(x, str) else [])
mdf = mdf.replace("x", pd.NaT)
print("after apply function to DataFrame.")
print(mdf["hist_doc_id"])

before apply function to DataFrame.
0       pBLcsZrUGu
1       ThaeSJtwLX
2       hqYTSpfHri
3       QYebTdRwZS
4       NXiqYGmxCU
           ...    
9995    QjoRnYavtf
9996    YeRScWFEHr
9997    EZeJipyDsF
9998    CWIHARZOUV
9999    nvMiSxjmlV
Name: hist_doc_id, Length: 10000, dtype: object
after apply function to DataFrame.
0       [pBLcsZrUGu]
1       [ThaeSJtwLX]
2       [hqYTSpfHri]
3       [QYebTdRwZS]
4       [NXiqYGmxCU]
            ...     
9995    [QjoRnYavtf]
9996    [YeRScWFEHr]
9997    [EZeJipyDsF]
9998    [CWIHARZOUV]
9999    [nvMiSxjmlV]
Name: hist_doc_id, Length: 10000, dtype: object


新增“label”列，并将mdf的"source_sample.label"列数据拷贝至mdf的"label"列。
然后，截取mdf的后100条数据，根据"event_time"列的数值对整个mdf进行排序，再以(16 * 3)的维度分片。数据处理后结果如下：

In [103]:
mdf["label"] = mdf["source_sample.label"]
mdf = mdf[:-100]
print("before sort values to event_time column.")
print(mdf["event_time"])
mdf = mdf.sort_values("event_time")
print("*********************************************")
print("after sort values to event_time column.")
print(mdf["event_time"])
mdf.repartition((16, 3))

before sort values to event_time column.
0       qtmxuICEXp3lAYc81KUh
1       wAO3Krq0b4mtgG58hykD
2       582pskLef7cWwgBQC3Xo
3       SYrs72pBMVywlP0NQDEJ
4       kpiNSEC7fXG4hy9FsWgL
                ...         
9895    6eMaWOcNPHEYVXop87LD
9896    v6fJpel0yX9zPucSDO2Q
9897    ZGFnHJ78Rr2bL1DCoeKO
9898    FX3Cmoj6fvNikbP9Tl7z
9899    sdD723z5UEoKfZecXwGP
Name: event_time, Length: 9900, dtype: object
*********************************************
after sort values to event_time column.
9648    01C8Z6FwuyVbgXIWel5L
2391    01VBTdlYM5DRUZS4v2ys
5997    01X9n5l3vkdEUhQLWbSB
5708    01p2onA8ZjgGaQIiC57v
8177    02MKFh3LclpXECSJt4wo
                ...         
2387    zxIlOpKb4GkZhwD0r6Cc
1431    zy50qrFa92DMeSRYpEoV
6813    zyBHp7RbOfw5AtSPXQLk
7156    zyatUqJ5CbnhAMe0GjNs
1339    zynZbGWaX0eBS5CTUw8c
Name: event_time, Length: 9900, dtype: object


## 数据转换

去掉mdf的索引，新增"is_training"列，将数据的前70%设为训练数据，其他数据标记为非训练数据。然后删除稠密特征列，将缺省值填充为-1。源码及运行结果如下:

In [104]:
mdf = mdf.reset_index(drop=True)
m_train_len = int(len(mdf) * 0.7)
mdf["is_training"] = [1] * m_train_len + [0] * (len(mdf) - m_train_len)
print("before dropping dense_feat_names columns.")
print(mdf.shape)
sparse_feat_names = ['user_id', 'doc_id', 'lang_type', 'domain_website', 'url_drop_last_slash', 'url',
                     'title', 'region', 'dtype', 'tags', 'topics', 'categ', 'second_class', 'orgCateg',
                     'hwSourceId', 'level2_topic', 'level2_main_topic', 'level3_topic', 'level3_main_topic',
                     'keyword_for_feed', 'entities']
dense_feat_names = ['pt_d', 'quality', 'porn', 'duration', 'url_click_2h', 'url_click_4h', 'url_click_8h',
                    'url_click_12h', 'url_click_24h', 'domain_website_click_1d', 'domain_website_exp_1d',
                    'domain_website_click_3d', 'domain_website_exp_3d', 'domain_website_click_5d',
                    'domain_website_exp_5d', 'domain_website_click_7d', 'domain_website_exp_7d']
mdf = mdf.drop(columns=dense_feat_names)
print("after dropping dense_feat_names columns.")
print(mdf.shape)
mdf = mdf.fillna("-1")

before dropping dense_feat_names columns.
(9900, 43)
after dropping dense_feat_names columns.
(9900, 26)


mdf的稀疏特征列应用自定义hash函数，对"doc_id"列去重后，删除原index，并在第20列插入"inserted"列，数据为numpy随机数。源码及运行结果如下:

In [105]:
def hash_item(val, item_size=10000000, offset=0):
    if isinstance(val, str):
        return abs(int(hashlib.sha256(val.encode('utf-8')).hexdigest(), 16)) % item_size
    return abs(hash(val)) % item_size + offset
print("before apply hash_item function to sparse_feat_names columns.")
print(mdf[sparse_feat_names])
mdf[sparse_feat_names] = mdf[sparse_feat_names].applymap(hash_item)
print("after apply hash_item function to sparse_feat_names columns.")
print(mdf[sparse_feat_names])
mdf = mdf.drop_duplicates(subset=['doc_id'])
mdf = mdf.reset_index(drop=True)

np.random.seed(100)
data = np.random.rand(len(mdf))
mdf.insert(20, "inserted", pd.Series(data))

before apply hash_item function to sparse_feat_names columns.
       user_id    doc_id  lang_type  domain_website   url_drop_last_slash  \
0      5974447  95207556   12864936         8124794  WtFlngb2dN5Dm7fsUpyG   
1     13716918  75778363   63811234        94888995  vXJYGfl3dKN7DtITpAhL   
2     56322769  55593192   12183088        11061708  AsfnYBwd5lJKI9MyvXHW   
3     93774815  18831418   16874517        57356600  b8vJVhksqAGmfdUaYyIC   
4     19501375   1783264   51892125        42043337  mFfTaJy52R4gh7uCMDsP   
...        ...       ...        ...             ...                   ...   
9895  87091601  82490967   25242193        72908750  bS4j7dIoDwaQ20ifPzLA   
9896  50122756  80922387   71099054         8371814  d3U8tB5opcAz9VXrJ1Pi   
9897  87105678  67008910   41511537        59187673  KNBmfPe4Cg957UA3RhcD   
9898  40712385  57465396   31103361        71897453  HgcBkeT7fWuyjF5nP6tR   
9899  93098460  48021476   46309165        50365957  2tXaWcK4xo91BORbGHSz   

             

至此，数据预处理，以及特征工程等操作（去重，应用转换函数，填充缺省值，新增数据列和分割数据集等）已完成，处理好的数据即可传入模型进行训练。