In [1]:
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt

%matplotlib inline

### 0. Basics
**相似人群拓展（Lookalike)**  
基于广告主提供的一个种子人群（又称为种子包），自动计算出与之相似的人群（称为扩展人群）
**种子包**  
广告主提供的种子人群    

# 1. Data Analysis

uid, aid, 用户特征和广告特征 都被加密

### 1. userFeature.data  
海量用户特征  
每行为一个用户的特征数据，特征组以“|”隔开，未知值为0  
**每个用户被记录的特征数量不同**

**具体特征解释：**  
LBS:地理位置
interset(interest1-5):兴趣id  
keyword(kw1, kw2, kw3): 比interest更具体的兴趣关键词  
appIdInstall：近期安装app  
appIdAction：用户常用app  
topic:LDA挖掘的用户喜好主题  
ct：上网连接类型(WIFI/2G/3G/4G)  
os:操作系统（Android/IOS）
carrier：移动运营商（移动/联通/电信/其他）

In [6]:
with open('userFeature.data','r') as f:
    for i in range(10):
        line = f.readline()
        print(line)

uid 26325489|age 4|gender 2|marriageStatus 11|education 7|consumptionAbility 2|LBS 950|interest1 93 70 77 86 109 47 75 69 45 8 29 49 83 6 46 36 11 44 30 118 76 48 28 106 59 67 41 114 111 71 9|interest2 46 19 13 29|interest5 52 100 72 131 116 11 71 12 8 113 28 73 6 132 99 76 46 62 121 59 129 21 93|kw1 664359 276966 734911 103617 562294|kw2 11395 79112 115065 77033 36176|topic1 9826 105 8525 5488 7281|topic2 9708 5553 6745 7477 7150|ct 3 1|os 2|carrier 1

uid 1184123|age 2|gender 1|marriageStatus 5 13|education 2|consumptionAbility 1|LBS 803|interest1 75 29|interest2 33|kw1 338851 361151 542834 496283 229952|kw2 80263 39618 53539 180 38163|topic1 4391 9140 5669 1348 4388|topic2 9401 7724 1380 8890 7153|ct 3 1|os 1|carrier 1

uid 76072711|age 1|gender 1|marriageStatus 13 10|education 5|consumptionAbility 1|LBS 927|interest1 70 12 28 106 59 49 41 6 42 115 35 116 36 11 96|interest2 51 22 79 81 70 6 21 4 41 35|interest5 77 72 80 116 101 13 1 109 8 50 6 42 76 9 46 36 58 64 85 103 131 11 79 48

### 2. train.csv
aid 唯一标识一个广告，uid 唯一标识一个用户  
样本 label 的取值为 +1 或 -1，其中 +1 表示种子用户，-1 表示非种子用户  

In [3]:
df1 = pd.read_csv('train.csv')
df1.head()

Unnamed: 0,aid,uid,label
0,699,78508957,-1
1,1991,3637295,-1
2,1119,19229018,-1
3,2013,79277120,-1
4,692,41528441,-1


In [25]:
df1.describe()

Unnamed: 0,aid,uid,label
count,8798814.0,8798814.0,8798814.0
mean,1044.592,41265600.0,-0.9040868
std,613.8768,23831980.0,0.4273488
min,6.0,2.0,-1.0
25%,519.0,20617480.0,-1.0
50%,1107.0,41275220.0,-1.0
75%,1530.0,61903520.0,-1.0
max,2216.0,82542900.0,1.0


### 3. test1.csv
aid 唯一标识一个广告，uid 唯一标识一个用户  

In [8]:
df2 = pd.read_csv('test1.csv')
df2.head()

Unnamed: 0,aid,uid
0,2118,64355836
1,692,45051997
2,692,10869198
3,1918,75929554
4,1596,5790162


In [12]:
df2.columns

Index(['aid', 'uid'], dtype='object')

In [21]:
df2.index

RangeIndex(start=0, stop=2265989, step=1)

In [18]:
df2.describe()

Unnamed: 0,aid,uid
count,2265989.0,2265989.0
mean,1044.734,41273970.0
std,613.8297,23836900.0
min,6.0,8.0
25%,519.0,20635200.0
50%,1107.0,41278560.0
75%,1530.0,61920640.0
max,2216.0,82542880.0


#### 有重复uid

In [13]:
new_df2 = df2.drop_duplicates(subset=['uid'],keep='first')

In [16]:
new_df2.describe()

Unnamed: 0,aid,uid
count,2195951.0,2195951.0
mean,1042.024,41277170.0
std,612.3307,23835080.0
min,6.0,8.0
25%,516.0,20646910.0
50%,1057.0,41280920.0
75%,1530.0,61924910.0
max,2216.0,82542880.0


### 4. adFeature.csv
账户结构分为四级：账户——推广计划——广告——素材  
**具体特征解释：**   
aid：广告id，primarykey  
advertiserId：账户对应广告主  
campaignId：推广计划（广告集合）  
creativeId：素材id  
creativeSize：素材大小  
adCatergoryId：使用广告分类体系对广告分类  
productId：待推广商品id
productType：投放目标对应的商品类型

In [5]:
df3 = pd.read_csv('adFeature.csv')
df3.head()

Unnamed: 0,aid,advertiserId,campaignId,creativeId,creativeSize,adCategoryId,productId,productType
0,177,8203,76104,1500666,59,282,0,6
1,2050,19441,178687,245165,53,1,0,6
2,1716,5552,158101,1080850,35,27,113,9
3,336,370,4833,119845,22,67,113,9
4,671,45705,352827,660519,42,67,0,4


In [24]:
df3.describe()

Unnamed: 0,aid,advertiserId,campaignId,creativeId,creativeSize,adCategoryId,productId,productType
count,173.0,173.0,173.0,173.0,173.0,173.0,173.0,173.0
mean,1140.364162,13229.202312,159044.612717,938856.7,50.364162,57.33526,2559.277457,7.346821
std,658.957025,23033.243589,184836.397583,521112.3,24.827317,62.197012,5570.602765,2.827595
min,6.0,60.0,80.0,5977.0,20.0,1.0,0.0,4.0
25%,562.0,702.0,31020.0,492484.0,35.0,21.0,0.0,4.0
50%,1171.0,7229.0,76104.0,981822.0,42.0,27.0,0.0,6.0
75%,1728.0,11487.0,209098.0,1383456.0,59.0,67.0,3733.0,11.0
max,2216.0,158679.0,766460.0,1806760.0,109.0,282.0,28986.0,11.0


### 5. submission.csv
提交结果  
每行记录表示该用户在该种子包中的得分   
score 字段有效数字不得超过 8 位  
aid,uid,score  
100,10000000,0.62124588  