可以参考的web：https://towardsdatascience.com/data-pre-processing-techniques-you-should-know-8954662716d6

数据预处理部分：Python、pandas读csv、pandas dataframe对某列做处理和转换（自定义函数及正则表达式）、字符串的处理（如邮箱等特定格式）、日期的处理（同频采样、离散化）

In [1]:
import numpy as np
import pandas as pd



常见的查看表格定义，前几行数据的操作

In [None]:
train = pd.read_csv('../input/open-shopee-code-league-marketing-analytics/train.csv')
users = pd.read_csv('../input/open-shopee-code-league-marketing-analytics/train.csv')
train
train.info()

常见的增加列, 减少列，合并,分开表格操作

In [None]:
# Merge two table by id
pd.merge(train, users, on = 'user_id', how = 'left')

# Delete a field
train.drop(columns=["buyeraddress", "selleraddress", "destination", "origin"])

# Divide a table based on value of flag
train_flag1 = train[train['open_flag'] == 1]

# add a new feature column
train['day_of_week'] = train['grass_date'].dt.day_name()

# map a boolean field to string
train['age_class'] = train['age_class'].map({True:'Unknown',False:'<>'})
# concat two columns
train_feat = pd.concat([train_feat, dom_flag], axis = 1)

统计类函数,数据分布类

In [None]:
# look at the mean and count value
train.describe()
# look at the rows fit the condition
train[(train.age > 116)]

# look at amount of different value
train.['open_flag'].value_counts()

# max and min value of one field
print(train['grass_date'].min(), train['grass_date'].max())

# count the na row
train.isna().sum()

常见的数据可视化工具和函数

In [None]:
# look the hist of table, 用来参考某一指标是否对结果有影响
train_flag1['country_code'].hist()

# get hist distribute in a condition
train_temp1 = train[(train['last_login_day'] != 'Never login') & (train['open_flag'] == 1)]
train_temp1[train_temp1['last_login_day'] < 200]['last_login_day'].hist()

# assign the bins of a hist
train_temp1['login_count_last_10_days'].hist(bins = [0, 20, 40, 60, 80, 100])

数据转换类型,重新赋值

In [None]:
import datetime as dt
train['country_code'] = train['country_code'].astype(str)
# convert grass_date to datetime
train['grass_date'] = pd.to_datetime(train['grass_date'])

# set value to a set of data
train.loc[train['age'] < 30, 'age_class'] = '<30'

# set a field to not a number
train.loc[train['age'] > 110, 'age'] = np.nan

# replace the number by apply a function
def make_domain_type(dom) :
    if dom in ['@163.com','@gmail.com','@yahoo.com','@ymail.com'] :
        res = 'low_domain'
    elif dom in ['@outlook.com','@qq.com','@rocketmail.com'] :
        res = 'med_domain'
    elif dom in ['@hotmail.com','@icloud.com','@live.com','other'] :
        res = 'high_domain'
    return res

train['domain_type'] = train.apply(lambda x : make_domain_type(x['domain']), axis=1)



最后处理非数字类型到数字类型特征

In [None]:
# getnumerical features
train._get_numeric_data().columns

# convert the time zone, and convert to numberic feature
train['grass_date'] = train['grass_date'].dt.tz_convert(None)
train['grass_date'] = (train['grass_date'] - dt.datetime(1970,1,1)).dt.total_seconds()

# set Never open to a big integer
train['last_open_day'] = train['last_open_day'].replace(['Never open'], 1000)

# One hot Encoding
dom_flag = pd.get_dummies(train['domain'])
test_feat = pd.concat([test_feat, dom_flag_test], axis = 1)
# get number of features
features = [c for c in train_feat.columns if c not in ['open_flag', 'user_id', 'row_id', 'age_class', 'domain_type', 'day_of_month']]
len(features)

# impute null
train_imputed = mice(train_feat[features].values)

正则表达式

In [None]:
import re

# email
email = "ankitrai326@gmail.com"
regex = '^[a-z0-9]+[\._]?[a-z0-9]+[@]\w+[.]\w{2,3}$'

re.search(regex,email)

提取信息中的邮件地址:\w+([-+.]\w+)*@\w+([-.]\w+)*\.\w+([-.]\w+)*  
多个邮件地址中间用分号分割： ^((([a-z0-9_\.-]+)@([\da-z\.-]+)\.([a-z\.]{2,6}\;))*(([a-z0-9_\.-]+)@([\da-z\.-]+)\.([a-z\.]{2,6})))$

正则表达式    
\d{3}-\d{8}|\d{4}-\d{7}
匹配     0511-4405222 | 021-87888822
不匹配    02-552255 12345-784787

正则表达式    
(^\([0]\d{2}|\d{4}\))(\d{6,7}$)
匹配    (021)1234567 | (0411)123456 | (000)000000
不匹配    (123)1234567 | 025123456 | 0252345678  

正则表达式    
^(?<national>\+?(?:86)?)(?<separator>\s?-?)(?<phone>(?<vender>(13|15|18)[0-9])(?<area>\d{4})(?<id>\d{4}))$

匹配    手机号
+8613012345678 | 86 13012345678 | 13245679087
不匹配    +86130123456781231434352 | 13560012513 | ++8613012345678

提取信息中的中国手机号码:(86)*0*13\d{9}     
提取信息中的中国固定电话号码:(\(\d{3,4}\)|\d{3,4}-|\s)?\d{8}     
提取信息中的中国电话号码（包括移动和固定电话）:(\(\d{3,4}\)|\d{3,4}-|\s)?\d{7,14}   
电话号码与手机号码同时验证:(^(\d{3,4}-)?\d{7,8})$|(13[0-9]{9})   
提取信息中的中国邮政编码:[1-9]{1}(\d+){5}     
提取信息中的中国身份证号码:\d{18}|\d{15} 

日期：^\d{4}(\-|\/|\.)\d{1,2}\1\d{1,2}$