# 0 前言

[赛题“移动设备用户年龄和性别预测”](赛题“移动设备用户年龄和性别预测)可拆分为两个任务：1.年龄预测，2.性别预测。年龄是连续数据，因此任务一是回归任务；性别是标量数据，因此任务二是二分类任务。

赛题提供的训练集和测试集都分别包含两个文件：数据集本体，和app_events表，两个表可通过device_id关联起来。device_id是设备的唯一标识，同时一个device_id只与一个用户关联。

train.csv中包含device_id、gender、age、phone_brand、device_model五个属性，主要描述了用户本身的信息。在前期的数据挖掘中发现，一个device_id可能对应多个(phone_brand, device_model)；age属性符合右偏分布，数据集中在[11,30]岁之间，峰值出现在29岁，在[30,89]之间分布稀疏，且拖尾较长。

train_app_events.csv中包含event_id、app_id、is_installed、is_active、device_id、tag_list、date七个属性，描述了用户使用app的情况，冗余度非常大。本文对该表进行挖掘，根据date属性提取用户的活跃度（见1.3.1）,根据is_installed、is_active、tag_list提取出用户对app的使用偏好指标（见1.3.2），与基本表合并。

对于年龄任务，尝试使用KNN和决策树模型，在验证集上测试时，KNN的表现更好。对于性别预测，使用的是线性回归模型，在验证集上，多元回归模型的表现比简单回归模型要好，但在测试集上却没有达到理想的效果，考虑是对训练集过度拟合了。

# 1 预处理

## 1.1 读入文件

In [1]:
#import
import pandas as pd
import numpy as np

from sklearn.model_selection import train_test_split
from sklearn.metrics import accuracy_score,mean_absolute_error
from sklearn.cluster import KMeans,MiniBatchKMeans

In [2]:
# setting
pd.set_option("display.max_rows",None)

In [2]:
root_path = "F:/代码仓/python/讯飞_移动设备用户年龄和性别预测/移动设备用户年龄和性别预测_数据集/"

#read trainning files
train_file_name = "train.csv"
train_file_path = root_path + train_file_name

data_train = pd.read_csv(train_file_path)
data_train.head()

Unnamed: 0,device_id,gender,age,phone_brand,device_model
0,0,0,35,0,0
1,1,1,37,1,1
2,2,0,32,1,2
3,3,1,28,1,2
4,4,0,75,2,3


In [64]:
#read trainning files
train_events_file_name = "train_app_events.csv"
train_events_file_path = root_path + train_events_file_name

data_train_events = pd.read_csv(train_events_file_path)
data_train_events.head()

Unnamed: 0,event_id,app_id,is_installed,is_active,device_id,tag_list,date
0,6,0,1,1,14271,"[549, 721, 704, 302, 303, 548, 183]",1
1,6,1,1,1,14271,"[713, 704, 548]",1
2,6,2,1,1,14271,"[549, 710, 704, 548, 172]",1
3,6,3,1,1,14271,"[548, 549]",1
4,6,4,1,1,14271,"[128, 1014]",1


## 1.2 数据清洗

In [39]:
data_train.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 20075 entries, 0 to 20074
Data columns (total 7 columns):
device_id       20075 non-null int64
gender          20075 non-null int64
age             20075 non-null int64
phone_brand     20075 non-null float64
device_model    20075 non-null float64
active_index    20001 non-null float64
pre_index       20075 non-null float64
dtypes: float64(4), int64(3)
memory usage: 1.1 MB


### 1.2.1 空值

In [6]:
data_train[data_train['device_id'].isnull()==True]

Unnamed: 0,device_id,gender,age,phone_brand,device_model


无空值

### 1.2.2 重复值

In [7]:
data_train['device_id'].unique().shape[0]

20000

疑似存在重复行。

In [4]:
data_train.drop_duplicates(inplace=True)
data_train.info()

<class 'pandas.core.frame.DataFrame'>
Int64Index: 20001 entries, 0 to 20074
Data columns (total 5 columns):
device_id       20001 non-null int64
gender          20001 non-null int64
age             20001 non-null int64
phone_brand     20001 non-null int64
device_model    20001 non-null int64
dtypes: int64(5)
memory usage: 937.5 KB


In [25]:
# data_train[].duplicated(subset=['device_id'])
data_train[data_train['device_id'].duplicated()==True].head()

Unnamed: 0,device_id,gender,age,phone_brand,device_model
17912,17855,0,18,0,238


In [35]:
data_train[data_train['device_id']==17855]

Unnamed: 0,device_id,gender,age,phone_brand,device_model,active_index
17911,17855,0,18,2,28,316.0
17912,17855,0,18,0,238,725.0


去重后发现还有重复的，这个用户有两部手机。重置一下行号，看一下索引是多少。

In [5]:
# data_train = data_train.drop(['index'],axis=1)
data_train = data_train.reset_index(drop=True)
data_train[data_train['device_id']==17855]

Unnamed: 0,device_id,gender,age,phone_brand,device_model
17855,17855,0,18,2,28
17856,17855,0,18,0,238


In [6]:
dup_index = 17855

### 1.2.3 归一化

In [5]:
# 归一化函数
def min_max(x,mmin,mmax):
    return (x-mmin) / (mmax-mmin)

In [8]:
mmin = data_train['phone_brand'].min()
mmax = data_train['phone_brand'].max()
data_train['phone_brand'] = data_train['phone_brand'].apply(min_max,args=(mmin,mmax))

mmin = data_train['device_model'].min()
mmax = data_train['device_model'].max()
data_train['device_model'] = data_train['device_model'].apply(min_max,args=(mmin,mmax))

### 1.2.4 其它

In [6]:
# 序列化函数
def set_pickle(path, pk):
    import pickle
    f=open(path,'wb')
    pickle.dump(pk,f)
    f.close()

# 反序列化
def get_pickle(path):
    import pickle
    f=open(path,'rb')
    pk = pickle.load(f)
    f.close()
    return pk

对tag_list字段进行预处理，去除夹杂的双引号

In [10]:
def delet_quote(row):
    left = int(0)
    right = len(row)
    if row[0]=='\"':
        left = int(1)
    if row[-1]=='\"':
        right = int(-1)
    return row[left:right]
data_train_events['tag_list'] = data_train_events['tag_list'].apply(delet_quote)

## 1.3 从event数据集提取特征

In [34]:
data_train_events.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 10498144 entries, 0 to 10498143
Data columns (total 7 columns):
event_id        int64
app_id          int64
is_installed    int64
is_active       int64
device_id       int64
tag_list        object
date            int64
dtypes: int64(6), object(1)
memory usage: 560.7+ MB


### 1.3.1 提取用户活跃系数

用户活跃系数由$c_i$和$a_i$来描述，其中连续因子$c_i$表示设备连续活跃天数（中断则重新计算），$a_i$表示设备第i天的活跃次数。则活跃系数$A$由以下公式定义：

$$A=\frac{1}{N}\sum_{i=0}^Na_ic_i$$

活跃系数将作为特征字段active_index存入训练数据表。由于尺度可能会偏大，所以考虑做归一化处理。

#### 训练集

In [11]:
data_train_events_gb = data_train_events.groupby(['device_id','date'])['event_id'].count()
# data_train_events_gb

下面提取每台设备的活跃系数，在训练集中插入为active_index：

In [12]:
c = [0 for i in range(0,8)] #连续系数
a = [0 for i in range(0,8)] #活跃次数
A = np.array([0 for i in range(0,data_train['device_id'].unique().shape[0]+1)])
for i in range(0,data_train['device_id'].unique().shape[0]):
    if i==dup_index+1:
        continue
#     活跃日期表
    active_date_ls = []
    if isinstance(data_train_events_gb[i],int):
        active_date_ls = [data_train_events_gb[i]]
    elif isinstance(data_train_events_gb[i],np.int64):
        active_date_ls = [int(data_train_events_gb[i])]
    else:
        active_date_ls = data_train_events_gb[i].index.tolist()
    ac_date = [1 if d in active_date_ls else 0 for d in range(0,8)]
    
    # 连续系数
    ct = 0 #连续次数
    for j in range(1,8):
        if ac_date[j]==1 and ac_date[j-1]==1:
            ct+=1.0
        elif ac_date[j]==1:
            ct = 1.0
        else:
            ct = 0
        c[j] = 1.0+0.1*ct
    
#   每日事件数  ai = data_train_events_gb[device_id,day]
    a_d = [data_train_events_gb[i,d] for d in active_date_ls]  
    for d in range(0,len(active_date_ls)):
        a[active_date_ls[d]] = a_d[d]
    
    if i<dup_index:
        A[i] = np.sum(np.array(a)*np.array(c))
    elif i==dup_index:
        A[i] = np.sum(np.array(a)*np.array(c))
        A[i+1] = np.sum(np.array(a)*np.array(c))
    else:
        A[i+1] = np.sum(np.array(a)*np.array(c))
    
# active_index = ai*ci
data_train['active_index'] = pd.Series(A)
data_train.head()

Unnamed: 0,device_id,gender,age,phone_brand,device_model,active_index
0,0,0,35,0.0,0.0,58
1,1,1,37,0.011765,0.001125,142
2,2,0,32,0.011765,0.00225,272
3,3,1,28,0.011765,0.00225,155
4,4,0,75,0.023529,0.003375,384


In [14]:
data_train[data_train['device_id']==dup_index]

Unnamed: 0,device_id,gender,age,phone_brand,device_model,active_index
17855,17855,0,18,0.023529,0.031496,566
17856,17855,0,18,0.0,0.267717,566


对active_index做一个归一化

In [13]:
mmin = data_train['active_index'].min()
mmax = data_train['active_index'].max()
data_train['active_index'] = data_train['active_index'].apply(min_max,args=(mmin,mmax))

In [14]:
print(mmin,mmax)

0 67739


In [15]:
data_train.head()

Unnamed: 0,device_id,gender,age,phone_brand,device_model,active_index
0,0,0,35,0.0,0.0,0.000856
1,1,1,37,0.011765,0.001125,0.002096
2,2,0,32,0.011765,0.00225,0.004015
3,3,1,28,0.011765,0.00225,0.002288
4,4,0,75,0.023529,0.003375,0.005669


### 1.3.2 提取用户偏好

#### 方案三（第二、三次提交使用）

在train表新增两个属性app_type、pre_index如下：

1. 使用KNN发现tag_list的分类，加入train表为app_type。
2. 定义偏好系数pre_index如下

$$pre\_index=\sum (is\_installed+1.5is\_active)$$

模型训练时app_type的生成在chap2中进行，下面生成pre_index：

In [14]:
def generate_preIndex(row):
    is_installed = data_train_events[data_train_events['device_id']==row]['is_installed']
    is_active = data_train_events[data_train_events['device_id']==row]['is_active']
#     print((is_installed + 1.5 * is_active).sum())
    return (is_installed + 1.5 * is_active).sum()

In [15]:
data_train['pre_index'] = data_train['device_id'].apply(generate_preIndex)
data_train['pre_index'].describe()

count    20001.000000
mean       834.667067
std       2073.986174
min          2.000000
25%         87.500000
50%        272.500000
75%        808.000000
max      63109.500000
Name: pre_index, dtype: float64

从标准差可以看出，不同设备之间的偏好系数差别很大，数值越大越离散。

对pre_index进行标准化：

In [16]:
mmin = data_train['pre_index'].min()
mmax = data_train['pre_index'].max()
data_train['pre_index'] = data_train['pre_index'].apply(min_max,args=(mmin,mmax))
data_train.head()

Unnamed: 0,device_id,gender,age,phone_brand,device_model,active_index,pre_index
0,0,0,35,0.0,0.0,0.000856,0.000951
1,1,1,37,0.011765,0.001125,0.002096,0.002131
2,2,0,32,0.011765,0.00225,0.004015,0.005071
3,3,1,28,0.011765,0.00225,0.002288,0.003217
4,4,0,75,0.023529,0.003375,0.005669,0.006742


#### 测试用模型训练

In [17]:
def KNN4appType_fit_pred(n_clusters,random_state,array_tag):
    mbk = MiniBatchKMeans(n_clusters=n_clusters, random_state=random_state)
    mbk.fit(array_tag)
    return mbk

def get_arr_tag(X):
    # 创建array
    device_id_list = list(tuple(int(i)for i in X[:,0]))

    # 取出训练集范围的标签列表
    af = data_train_events[data_train_events['device_id'].isin(device_id_list)].reset_index(drop=True)

    w = 26
    h = af.shape[0]
    array_tag = np.zeros([h,w])

    # tag_list标签最值
    mmin = 1
    mmax = 1021
    
    # 获取app特征矩阵
    for i in range(0,af.shape[0]):
        tag_list = af.loc[i,'tag_list']
    #     print(type(tag_list))
        r = tag_list[1:-1].split(',')
        li = [min_max(int(s),mmin,mmax) for s in r] #归一化
        for j in range(0,len(li)):
            array_tag[i][j] = li[j]
    return array_tag

In [13]:
'''
功能：1.发现app_type 2.在X的基础上添加每类app的偏好系数
X：数据集，array
return：X_1，array
'''
def pre_treatment(X, quick_pattern = True, model='part'):    
    # 将tag_list转换为特征矩阵
#     array_tag = get_arr_tag(X)
    # 创建array
    device_id_list = list(tuple(int(i)for i in X[:,0]))

    # 取出训练集范围的标签列表
    af = data_train_events[data_train_events['device_id'].isin(device_id_list)].reset_index(drop=True)

    w = 26
    h = af.shape[0]
    array_tag = np.zeros([h,w])

    # tag_list标签最值
    mmin = 1
    mmax = 1021
    
    # 获取app特征矩阵
    for i in range(0,af.shape[0]):
        tag_list = af.loc[i,'tag_list']
    #     print(type(tag_list))
        r = tag_list[1:-1].split(',')
        li = [min_max(int(s),mmin,mmax) for s in r] #归一化
        for j in range(0,len(li)):
            array_tag[i][j] = li[j]
        
    # 生成app_type
    n_clusters = 20
    random_state = 17
    batch_size = 100

#     if not quick_pattern:
#         # 生成模型
#         mbk = KNN4appType_fit_pred(n_clusters,random_state,array_tag)
#     elif model=='completed':
#         mbk = get_pickle('KNN_appType_completed.pkl')
#     else:
#         # 加载模型
#         mbk = get_pickle('KNN_appType.pkl')
    mbm_pred = mbk.predict(array_tag)
    af['app_type'] = pd.Series(mbm_pred)
    
    
    # 生成每种app的偏好系数
    af['pre_index'] = af['is_installed'] + 1.5 * af['is_active']
    pre_index_sum = af.groupby(['device_id','app_type'])['pre_index'].agg(np.sum)

    t_arr = np.zeros([X.shape[0],20])
    for i in range(0,X.shape[0]):
        did = int(X[i][0])
        for j_i,j in enumerate(list(pre_index_sum[did].index)):
            t_arr[i][j] = pre_index_sum[did].values[j_i]

    # 归一化
    for i in range(0,t_arr.shape[0]):
        t_arr[i,:] = (t_arr[i,:] - np.min(t_arr[i,:])) / (np.max(t_arr[i,:]) - np.min(t_arr[i,:]))

    # 合并
    X_1 = np.c_[X,t_arr]

    # 删除第一列
    X_1 = np.delete(X_1, 0, axis=1)
    X_1[0,:]
    
    return X_1

In [19]:
array_tag = get_arr_tag(data_train.values)

In [20]:
set_pickle('KNN_appType_completed_arrTag.pkl',array_tag)

In [None]:
array_tag = get_pickleckle('KNN_appType_completed_arrTag.pkl')

In [24]:
n_clusters = 20
random_state = 17
batch_size = 100

mbk = KNN4appType_fit_pred(n_clusters,random_state,array_tag)

In [25]:
set_pickle('KNN_appType_completed.pkl',mbk)

In [26]:
mbk.predict(array_tag)

array([17,  1,  7, ..., 13,  4, 19])

#### 方案二（第一次提交使用）

方案一维数爆炸了，对思路进行改良。

tag_list按device_id分组后可以得到若干(1, 26)列的向量，考虑这些向量的均值来表示用户偏好的应用类型。

In [16]:
def toArray(li):
    arr_t = np.zeros([1,26])
    for j in range(0,len(li)):
        arr_t[0][j] = li[j]
    return arr_t
# pd.DataFrame(list_tag).head()

In [17]:
array_tag = np.zeros([20000,26])
mmin = 1
mmax = 1021

In [19]:
for i in range(15000,20000):
    #得到第i个设备的tag_list
    tag_list = data_train_events[data_train_events['device_id']==i]['tag_list']
    #将tag_list转化为向量，并求均值
    for row in tag_list:
        r = row[1:-1].split(',')
        array_tag[i] += toArray([min_max(int(s),mmin,mmax) for s in r])[0]
    array_tag[i] /= len(tag_list)

In [20]:
import pickle
f=open('device_apptag_arr.pickle','wb')
pickle.dump(array_tag,f)
f.close()

In [18]:
# 读取
import pickle
f=open('device_apptag_arr.pickle','rb')
array_tag = pickle.load(f)
f.close()

合并向量和train数据集。

In [21]:
t = data_train.copy(deep=True)
pd.concat([t,pd.DataFrame(array_tag)], axis=1).head()

Unnamed: 0,device_id,gender,age,phone_brand,device_model,active_index,0,1,2,3,...,16,17,18,19,20,21,22,23,24,25
0,0,0,35,0,0,58.0,0.464391,0.461783,0.376711,0.304329,...,0.011006,0.017055,0.003589,0.004107,0.010118,0.016796,0.017222,0.0,0.0,0.0
1,1,1,37,1,1,142.0,0.563496,0.553062,0.442907,0.37106,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
2,2,0,32,1,2,272.0,0.506684,0.540043,0.477184,0.376362,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
3,3,1,28,1,2,155.0,0.557951,0.581768,0.460595,0.295997,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
4,4,0,75,2,3,384.0,0.526053,0.56323,0.467611,0.335094,...,0.026713,0.027334,0.014664,0.00708,0.0,0.0,0.0,0.0,0.0,0.0


先测试一下，可以。下面正式合并。

In [19]:
array_tag = np.insert(array_tag,dup_index,array_tag[dup_index-1],axis=0)
data_train = pd.concat([data_train,pd.DataFrame(array_tag)], axis=1)
data_train.head()

Unnamed: 0,device_id,gender,age,phone_brand,device_model,active_index,0,1,2,3,...,16,17,18,19,20,21,22,23,24,25
0,0,0,35,0.0,0.0,0.000856,0.464391,0.461783,0.376711,0.304329,...,0.011006,0.017055,0.003589,0.004107,0.010118,0.016796,0.017222,0.0,0.0,0.0
1,1,1,37,0.011765,0.001125,0.002096,0.563496,0.553062,0.442907,0.37106,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
2,2,0,32,0.011765,0.00225,0.004015,0.506684,0.540043,0.477184,0.376362,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
3,3,1,28,0.011765,0.00225,0.002288,0.557951,0.581768,0.460595,0.295997,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
4,4,0,75,0.023529,0.003375,0.005669,0.526053,0.56323,0.467611,0.335094,...,0.026713,0.027334,0.014664,0.00708,0.0,0.0,0.0,0.0,0.0,0.0


#### 方案一

按device_id合并tag_list，作为特征插入训练集。

In [8]:
# 归一化函数
def min_max(x,mmin,mmax):
    return (x-mmin) / (mmax-mmin)
    
def my_toList(tag_list,mmin,mmax):
    le = []
    for row in tag_list:
        r = row[1:-1].split(',')
        le.extend([min_max(int(s),mmin,mmax) for s in r])
    return le
# pd.DataFrame(list_tag).head()

In [15]:
list_tag = []

内存不够，手动分批。0-15000,15000-20000

In [18]:
# for i in range(0,data_train['device_id'].unique().shape[0]):
temp_list = []
for i in range(15000,20000):
    temp_list.append(my_toList(data_train_events[data_train_events['device_id']==i]['tag_list'],1,1021))

In [19]:
list_tag.extend(temp_list)

In [20]:
len(list_tag)

20000

tag_list整合成的列表非常之大，以防数据丢失、提高运行效率，这边先把做好的数据序列化一下：

In [22]:
import pickle
f=open('device_apptag.pickle','wb')
pickle.dump(list_tag,f)
f.close()

In [3]:
# 读取
import pickle
f=open('device_apptag.pickle','rb')
list_tag = pickle.load(f)
f.close()

将特征向量拼到训练集中。

In [5]:
max_len = -1
for li in list_tag:
    li_len = len(li)
    if li_len>max_len:
        max_len = li_len

max_len

296593

In [4]:
# 转成array
arr_len = 
array_tag = np.zeros([len(list_tag),len(max(list_tag,key = lambda x: len(x)))])
for i,j in enumerate(list_tag):
    array_tag[i][0:len(j)] = j

MemoryError: 

内存爆炸了。

In [26]:
t = data_train.copy(deep=True)
pd.concat([t,pd.DataFrame(np.array(list_tag))], axis=1).head()

Unnamed: 0,device_id,gender,age,phone_brand,device_model,active_index,0
0,0,0,35,0,0,58.0,"[0.6980392156862745, 0.6892156862745098, 0.536..."
1,1,1,37,1,1,142.0,"[0.9911764705882353, 0.6980392156862745, 0.689..."
2,2,0,32,1,2,272.0,"[0.5372549019607843, 0.7058823529411765, 0.689..."
3,3,1,28,1,2,155.0,"[0.5372549019607843, 0.6980392156862745, 0.689..."
4,4,0,75,2,3,384.0,"[0.5372549019607843, 0.6950980392156862, 0.689..."


# 2 划分数据集

# 3 任务一：性别预测

性别是标量，做个二分类就行。

In [34]:
set_pickle('data_train.pkl',data_train)

In [7]:
# 读取数据
data_train = get_pickle('data_train.pkl')
data_train.head()

Unnamed: 0,device_id,gender,age,phone_brand,device_model,active_index,pre_index
0,0,0,35,0.0,0.0,0.000856,0.000951
1,1,1,37,0.011765,0.001125,0.002096,0.002131
2,2,0,32,0.011765,0.00225,0.004015,0.005071
3,3,1,28,0.011765,0.00225,0.002288,0.003217
4,4,0,75,0.023529,0.003375,0.005669,0.006742


## 2.1 划分数据集

In [30]:
train_gender = data_train.copy(deep=True)
# train_gender.head()

In [31]:
train_gender['gender'] = train_gender['gender'].astype('int64')
# y_train = y_train.astype('int64')

In [32]:
y = train_gender['gender']
# train_gender.drop(['gender','age','device_id'],axis=1, inplace=True)
train_gender.drop(['gender','age'],axis=1, inplace=True)
train_gender=train_gender.iloc[:,0:17]
train_gender.head()

Unnamed: 0,device_id,phone_brand,device_model,active_index,pre_index
0,0,0.0,0.0,0.000856,0.000951
1,1,0.011765,0.001125,0.002096,0.002131
2,2,0.011765,0.00225,0.004015,0.005071
3,3,0.011765,0.00225,0.002288,0.003217
4,4,0.023529,0.003375,0.005669,0.006742


In [38]:
train_gender.describe()

Unnamed: 0,device_id,phone_brand,device_model,active_index,pre_index
count,20001.0,20001.0,20001.0,20001.0,20001.0
mean,9999.892755,0.053855,0.147098,0.01585,0.013194
std,5773.769872,0.086417,0.18027,0.028861,0.032864
min,0.0,0.0,0.0,0.0,0.0
25%,5000.0,0.011765,0.029246,0.005551,0.001355
50%,10000.0,0.023529,0.067492,0.009108,0.004286
75%,15000.0,0.082353,0.197975,0.016534,0.012772
max,19999.0,1.0,1.0,1.0,1.0


In [20]:
train_gender.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 20001 entries, 0 to 20000
Data columns (total 5 columns):
device_id       20001 non-null int64
phone_brand     20001 non-null float64
device_model    20001 non-null float64
active_index    20001 non-null float64
pre_index       20001 non-null float64
dtypes: float64(4), int64(1)
memory usage: 781.4 KB


In [33]:
from sklearn.model_selection import train_test_split
X_train, X_holdout, y_train, y_holdout = train_test_split(train_gender.values, y.values, test_size=0.3,random_state=17)

In [45]:
X_train.shape

(14000, 5)

## 发现app类型

首先，使用KMeans模型发现app类型:

下面基于划分后的训练集和测试集有针对性地挖掘，正式训练时将使用整个数据集挖掘得到的模型。

### 训练集

In [22]:
# 创建array
device_id_list = list(tuple(int(i)for i in X_train[:,0]))
# print(device_id_list)

In [23]:
# 取出训练集范围的标签列表
af = data_train_events[data_train_events['device_id'].isin(device_id_list)].reset_index(drop=True)
af.head()

Unnamed: 0,event_id,app_id,is_installed,is_active,device_id,tag_list,date
0,6,0,1,1,14271,"[549, 721, 704, 302, 303, 548, 183]",1
1,6,1,1,1,14271,"[713, 704, 548]",1
2,6,2,1,1,14271,"[549, 710, 704, 548, 172]",1
3,6,3,1,1,14271,"[548, 549]",1
4,6,4,1,1,14271,"[128, 1014]",1


In [65]:
af.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 7413358 entries, 0 to 7413357
Data columns (total 7 columns):
event_id        int64
app_id          int64
is_installed    int64
is_active       int64
device_id       int64
tag_list        object
date            int64
dtypes: int64(6), object(1)
memory usage: 395.9+ MB


In [22]:
w = 26
h = af.shape[0]
array_tag = np.zeros([h,w])

# tag_list标签最值
mmin = 1
mmax = 1021

In [44]:
print("tag矩阵的尺寸为：%d×%d"%(h,w))

tag矩阵的尺寸为：7413358×26


In [None]:
# 获取app特征矩阵
for i in range(0,af.shape[0]):
    tag_list = af.loc[i,'tag_list']
#     print(type(tag_list))
    r = tag_list[1:-1].split(',')
    li = [min_max(int(s),mmin,mmax) for s in r] #归一化
    for j in range(0,len(li)):
        array_tag[i][j] = li[j]
print(array_tag[:5])

保存一下得到的app的tag矩阵：

In [69]:
set_pickle("tag_Matrix.pkl",array_tag)

In [24]:
array_tag = get_pickle("tag_Matrix.pkl")

In [24]:
n_clusters = 20
random_state = 17
batch_size = 100

下面训练一下看下效果：

In [25]:
from sklearn.cluster import KMeans,MiniBatchKMeans

In [55]:
mbm_pred = MiniBatchKMeans(n_clusters=n_clusters, random_state=random_state).fit_predict(array_tag)
mbm_pred

array([ 0, 11, 10, ...,  3, 11, 19])

In [23]:
array_tag

array([[0.5372549 , 0.70588235, 0.68921569, ..., 0.        , 0.        ,
        0.        ],
       [0.69803922, 0.68921569, 0.53627451, ..., 0.        , 0.        ,
        0.        ],
       [0.5372549 , 0.69509804, 0.68921569, ..., 0.        , 0.        ,
        0.        ],
       ...,
       [0.53627451, 0.5372549 , 0.39607843, ..., 0.        , 0.        ,
        0.        ],
       [0.53627451, 0.5372549 , 0.        , ..., 0.        , 0.        ,
        0.        ],
       [0.5372549 , 0.69803922, 0.70588235, ..., 0.        , 0.        ,
        0.        ]])

In [26]:
mbk = MiniBatchKMeans(n_clusters=n_clusters, random_state=random_state)
mbk.fit(array_tag)

MiniBatchKMeans(batch_size=100, compute_labels=True, init='k-means++',
                init_size=None, max_iter=100, max_no_improvement=10,
                n_clusters=20, n_init=3, random_state=17,
                reassignment_ratio=0.01, tol=0.0, verbose=0)

In [26]:
mbm_pred = mbk.predict(array_tag)
mbm_pred

array([ 0, 11, 10, ...,  3, 11, 19])

保存模型：

In [30]:
set_pickle("KNN_appType.pkl",mbk)

In [14]:
mbk = get_pickle('KNN_appType.pkl')

In [82]:
mbm_pred.shape[0]

7413358

将分类结果插入到data_events表中：

In [27]:
af['app_type'] = pd.Series(mbm_pred)
# data_train_events['app_type'] = data_train_events['app_type'].astype('int64')
af.head()

Unnamed: 0,event_id,app_id,is_installed,is_active,device_id,tag_list,date,app_type
0,6,0,1,1,14271,"[549, 721, 704, 302, 303, 548, 183]",1,0
1,6,1,1,1,14271,"[713, 704, 548]",1,11
2,6,2,1,1,14271,"[549, 710, 704, 548, 172]",1,10
3,6,3,1,1,14271,"[548, 549]",1,11
4,6,4,1,1,14271,"[128, 1014]",1,11


将偏好系数插入表中：

In [28]:
af['pre_index'] = af['is_installed'] + 1.5 * af['is_active']
af.head()

Unnamed: 0,event_id,app_id,is_installed,is_active,device_id,tag_list,date,app_type,pre_index
0,6,0,1,1,14271,"[549, 721, 704, 302, 303, 548, 183]",1,0,2.5
1,6,1,1,1,14271,"[713, 704, 548]",1,11,2.5
2,6,2,1,1,14271,"[549, 710, 704, 548, 172]",1,10,2.5
3,6,3,1,1,14271,"[548, 549]",1,11,2.5
4,6,4,1,1,14271,"[128, 1014]",1,11,2.5


导出每个device_id的偏好系数和：

In [29]:
pre_index_sum = af.groupby(['device_id','app_type'])['pre_index'].agg(np.sum)
pre_index_sum.head()

device_id  app_type
2          0           10.0
           1           16.5
           2            4.5
           3           15.0
           4            5.0
Name: pre_index, dtype: float64

将偏好系数表插入X_train：

In [30]:
t_arr = np.zeros([X_train.shape[0],20])
for i in range(0,X_train.shape[0]):
    did = int(X_train[i][0])
    for j_i,j in enumerate(list(pre_index_sum[did].index)):
        t_arr[i][j] = pre_index_sum[did].values[j_i]
t_arr.shape

(14000, 20)

归一化一下：

In [31]:
for i in range(0,t_arr.shape[0]):
    t_arr[i,:] = (t_arr[i,:] - np.min(t_arr[i,:])) / (np.max(t_arr[i,:]) - np.min(t_arr[i,:]))

t_arr[:1,:]

array([[0.07692308, 0.        , 0.34615385, 0.        , 0.        ,
        0.23076923, 0.        , 0.69230769, 0.        , 0.        ,
        0.5       , 1.        , 0.        , 0.        , 0.        ,
        0.19230769, 0.        , 0.        , 0.        , 0.42307692]])

In [32]:
X_train_1 = np.c_[X_train,t_arr]
X_train_1

array([[8.95000000e+02, 0.00000000e+00, 1.10236220e-01, ...,
        0.00000000e+00, 0.00000000e+00, 4.23076923e-01],
       [3.34200000e+03, 0.00000000e+00, 4.49943757e-03, ...,
        1.44508671e-01, 6.66666667e-01, 5.22157996e-01],
       [1.69090000e+04, 1.52941176e-01, 4.16197975e-01, ...,
        0.00000000e+00, 0.00000000e+00, 2.50000000e-01],
       ...,
       [1.37020000e+04, 0.00000000e+00, 1.68728909e-02, ...,
        2.58302583e-02, 1.54981550e-01, 2.87822878e-01],
       [2.19100000e+03, 4.47058824e-01, 4.45444319e-01, ...,
        0.00000000e+00, 0.00000000e+00, 3.33333333e-01],
       [1.08630000e+04, 0.00000000e+00, 8.32395951e-02, ...,
        0.00000000e+00, 3.32197615e-02, 3.04940375e-01]])

In [8]:
# 用完整的模型跑一个
mbk = get_pickle('KNN_appType_completed.pkl')

In [37]:
X_train_1 = pre_treatment(X_train,quick_pattern=True, model='completed')

In [38]:
X_train.shape

(14052, 5)

删除矩阵的第一列：

In [33]:
X_train_1 = np.delete(X_train_1, 0, axis=1)
X_train_1[0,:]

array([0.00000000e+00, 1.10236220e-01, 7.91272384e-03, 6.81377015e-04,
       7.69230769e-02, 0.00000000e+00, 3.46153846e-01, 0.00000000e+00,
       0.00000000e+00, 2.30769231e-01, 0.00000000e+00, 6.92307692e-01,
       0.00000000e+00, 0.00000000e+00, 5.00000000e-01, 1.00000000e+00,
       0.00000000e+00, 0.00000000e+00, 0.00000000e+00, 1.92307692e-01,
       0.00000000e+00, 0.00000000e+00, 0.00000000e+00, 4.23076923e-01])

### 验证集

下面对验证集也做同样处理，添加上app_type:

In [36]:
# 创建array
device_id_list_test = list(tuple(int(i)for i in X_holdout[:,0]))

# 取出训练集范围的标签列表
af_test = data_train_events[data_train_events['device_id'].isin(device_id_list_test)].reset_index(drop=True)

w = 26
h = af_test.shape[0]
array_tag_test = np.zeros([h,w])

# tag_list标签最值
mmin = 1
mmax = 1021

In [37]:
# 获取app特征矩阵
for i in range(0,af_test.shape[0]):
    tag_list = af_test.loc[i,'tag_list']
#     print(type(tag_list))
    r = tag_list[1:-1].split(',')
    li = [min_max(int(s),mmin,mmax) for s in r] #归一化
    for j in range(0,len(li)):
        array_tag_test[i][j] = li[j]
print(array_tag_test[:1])

[[0.53627451 0.5372549  0.24509804 0.25588235 0.         0.
  0.         0.         0.         0.         0.         0.
  0.         0.         0.         0.         0.         0.
  0.         0.         0.         0.         0.         0.
  0.         0.        ]]


In [42]:
array_tag_test.shape

(3085164, 26)

In [38]:
n_clusters = 20
random_state = 17
batch_size = 100

mbm_test_pred = mbk.predict(array_tag_test)
af_test['app_type'] = pd.Series(mbm_test_pred)

af_test.head()

Unnamed: 0,event_id,app_id,is_installed,is_active,device_id,tag_list,date,app_type
0,189,486,1,0,12239,"[548, 549, 251, 262]",2,11
1,189,0,1,0,12239,"[549, 721, 704, 302, 303, 548, 183]",2,0
2,189,2,1,1,12239,"[549, 710, 704, 548, 172]",2,10
3,189,47,1,0,12239,"[721, 704, 548, 302, 303]",2,2
4,189,191,1,0,12239,"[713, 704, 548, 302, 303, 163]",2,2


In [46]:
af_test['pre_index'] = af_test['is_installed'] + 1.5 * af_test['is_active']
pre_index_sum_test = af_test.groupby(['device_id','app_type'])['pre_index'].agg(np.sum)

t_arr = np.zeros([X_holdout.shape[0],20])
for i in range(0,X_holdout.shape[0]):
    did = int(X_holdout[i][0])
    for j_i,j in enumerate(list(pre_index_sum_test[did].index)):
        t_arr[i][j] = pre_index_sum_test[did].values[j_i]

# 归一化
for i in range(0,t_arr.shape[0]):
    t_arr[i,:] = (t_arr[i,:] - np.min(t_arr[i,:])) / (np.max(t_arr[i,:]) - np.min(t_arr[i,:]))
    
# 合并
X_holdout_1 = np.c_[X_holdout,t_arr]

# 删除第一列
X_holdout_1 = np.delete(X_holdout_1, 0, axis=1)
X_holdout_1[0,:]

array([0.01176471, 0.03262092, 0.0167112 , 0.01305709, 0.08870968,
       0.15725806, 0.22177419, 0.        , 0.        , 0.22177419,
       0.08870968, 0.22177419, 0.        , 0.        , 0.41532258,
       1.        , 0.17741935, 0.13306452, 0.        , 0.08870968,
       0.04435484, 0.04435484, 0.04435484, 0.38306452])

In [38]:
# 用完整的模型跑一个
X_holdout_1 = pre_treatment(X_holdout,quick_pattern=True,model='completed')

## 2.2 训练模型

在训练（如下）中发现，KNN在本任务中有比决策树更优的表现，因此选择KNN作为性别预测模型。

### 2.2.1 决策树

模型训练：

In [39]:
from sklearn.model_selection import train_test_split, StratifiedKFold
from sklearn.tree import DecisionTreeClassifier

In [40]:
tree = DecisionTreeClassifier(criterion='entropy',max_depth=6, random_state=17)
tree.fit(X_train_1, y_train)

DecisionTreeClassifier(ccp_alpha=0.0, class_weight=None, criterion='entropy',
                       max_depth=6, max_features=None, max_leaf_nodes=None,
                       min_impurity_decrease=0.0, min_impurity_split=None,
                       min_samples_leaf=1, min_samples_split=2,
                       min_weight_fraction_leaf=0.0, presort='deprecated',
                       random_state=17, splitter='best')

In [45]:
set_pickle('DecisionTreeClassifier.model',tree)

验证和评估：

In [41]:
from sklearn.metrics import accuracy_score,mean_absolute_error

In [43]:
tree_pred = tree.predict(X_holdout_1)
accuracy_score(y_holdout, tree_pred)

0.6520579903349442

### 2.2.2 KNN

模型训练：

In [44]:
from sklearn.neighbors import KNeighborsClassifier

In [112]:
# 读取
knn = get_pickle('KNeighborsClassifier.model')

In [61]:
knn = KNeighborsClassifier(n_neighbors=40)
knn.fit(X_train_1, y_train)

KNeighborsClassifier(algorithm='auto', leaf_size=30, metric='minkowski',
                     metric_params=None, n_jobs=None, n_neighbors=40, p=2,
                     weights='uniform')

In [63]:
set_pickle('KNeighborsClassifier.model',knn)

验证和评估：

In [62]:
knn_pred = knn.predict(X_holdout_1)
accuracy_score(y_holdout, knn_pred)

0.6533911014830861

# 3 任务二：年龄预测模型

训练回归模型

## 3.1 划分数据集

In [9]:
train_age = data_train.copy(deep=True)
train_age['age'] = train_age['age'].astype('int64')
y = train_age['age']
# train_age.drop(['gender','age','device_id'],axis=1, inplace=True)
train_age.drop(['gender','age'],axis=1, inplace=True)
train_age = train_age.iloc[:,:17]
train_age.head()

Unnamed: 0,device_id,phone_brand,device_model,active_index,pre_index
0,0,0.0,0.0,0.000856,0.000951
1,1,0.011765,0.001125,0.002096,0.002131
2,2,0.011765,0.00225,0.004015,0.005071
3,3,0.011765,0.00225,0.002288,0.003217
4,4,0.023529,0.003375,0.005669,0.006742


In [47]:
X_train, X_holdout, y_train, y_holdout = train_test_split(train_age.values, y.values, test_size=0.3,random_state=17)

In [11]:
y_train

array([19, 33, 20, ..., 21, 39, 25], dtype=int64)

In [48]:
X_train_1 = pre_treatment(X_train)
X_holdout_1 = pre_treatment(X_holdout)

## 3.2 训练模型

### 3.2.1 简单线性回归

训练模型：

In [67]:
from sklearn.linear_model import LinearRegression

In [68]:
lin_reg = LinearRegression()
lin_reg.fit(X_train_1,y_train)

LinearRegression(copy_X=True, fit_intercept=True, n_jobs=None, normalize=False)

验证与评估：

In [69]:
linreg_pred = lin_reg.predict(X_holdout_1)
1 / (1 + mean_absolute_error(y_holdout, np.round(linreg_pred)))

0.12111243415608791

保存模型：

In [70]:
set_pickle('LinearRegression.model',lin_reg)

### 3.2.2 多元线性回归

训练模型：

In [20]:
# import
from sklearn.linear_model import SGDRegressor
from sklearn.model_selection import cross_val_score
from sklearn.preprocessing import StandardScaler

Z-score标准化：

In [49]:
X_scaler = StandardScaler()
y_scaler = StandardScaler()

In [60]:
# 训练集
X_scaler.fit(X_train_1)
X_train_scaled = X_scaler.transform(X_train_1)

y_scaler.fit(y_train.reshape(-1,1))
# y_train_scaled = y_scaler.transform(y_train.reshape(-1,1))

# 验证集
X_holdout_scaled = X_scaler.transform(X_holdout_1)
# y_holdout_scaled = y_scaler.transform(y_holdout.reshape(-1,1))

使用随机梯度下降法拟合模型：

In [61]:
regressor = SGDRegressor(loss='squared_loss')
# scores = cross_val_score(regressor, X_train_1,y_train,cv=5)
# regressor.fit(X_train_scaled,y_train_scaled)
regressor.fit(X_train_scaled,y_train)
reg_pred_scaled = regressor.predict(X_holdout_scaled)

In [62]:
reg_pred_scaled

array([33.97808306, 31.89342506, 31.44115559, ..., 32.10091725,
       34.47900349, 42.06841974])

验证与评估：

In [57]:
1 / (1 + mean_absolute_error(y_holdout_scaled, reg_pred_scaled))

0.5795560656202156

还原预测结果：

In [55]:
y_scaler.inverse_transform(reg_pred_scaled)

array([33.26367874, 31.87161875, 33.54482918, ..., 29.82571231,
       32.36839788, 39.64694147])

保存模型：

In [63]:
set_pickle('SGDRegressor.model',regressor)

保存映射模型：

In [59]:
set_pickle('y_scaler.model',y_scaler)