## Tracking Log Data Processing

#### Tracking Logs

Tracking refers to collecting specific information during certain processes in an application to monitor usage, which can later be used to optimize the product or provide operational data support. For example, tracking logs serve as key data for enabling personalized recommendations.

There are two mainstream tracking methods:

- **Self-developed**: Embedding custom code in the product for data collection and building corresponding back-end systems for querying and processing.
- **Third-party platforms**: Using third-party analytics tools such as Umeng, Baidu Mobile Analytics, etc.

**Types of tracking logs:**

- **Impression logs**: When a product is displayed on a page, it is considered an impression. Impression logs record this event each time a product is displayed.
  - Impression time  
  - Impression scene  
  - Unique user ID  
  - Product ID  
  - Product category ID  

- **Clickstream logs**: Records of user actions such as browsing, favoriting, adding to cart, purchasing, commenting, and searching.
  - Impression time (corresponding to impression log)  
  - Impression scene (corresponding to impression log)  
  - Unique user ID  
  - Action time  
  - Action type  
  - Product ID  
  - Product category ID  
  - Dwell time (browsing)  
  - Rating (comment)  
  - Search query (search)  

#### Significance of Tracking Logs

**User behavior and preference analysis**  
- Analyzing clickstream logs to capture behavioral features of individual or group users, and predicting user preferences.

**Statistical metrics analysis**  
- **Click-through rate (CTR):** Probability of an item being clicked. Formula: *clicks / impressions*. For example, if a product is displayed 100 times and clicked 10 times, CTR = 10%.  
- **Bounce rate:** When a user visits one page but performs no further actions, it counts as a bounce. Formula: *single-page visits / total visits*.  
  - Overall bounce rate (for an entire section/app).  
  - Single-page bounce rate. For example, if 100 users visited a page, and 10 of them left without any further interaction, the bounce rate is 10%.  
- **Conversion rate:** In e-commerce, conversion rate = *number of completed orders / number of product visits*. For example, if a product has 100 visits but only 5 result in orders, its conversion rate is 5%.  

**Note:** The “visits” in bounce rate and conversion rate refer to **unique user visits**. Unique user visits are not equal to the total number of distinct users. For example, if User A visits Product 1 on January 1 and again on January 2, Product 1’s visit count is 2, not 1.

In [1]:
import logging#log：记录

def get_logger(logger_name, path, level):
    
    # 创建logger
    logger = logging.getLogger(logger_name)
    # level:  OFF、FATAL、ERROR、WARN、INFO、DEBUG、ALL或者自己定义的级别
    logger.setLevel(level)

    # 创建formatter
    # %(asctime)s: 打印日志的时间
    # %(message)s: 打印日志信息
    fmt = '%(asctime)s: %(message)s'
    datefmt = '%Y/%m/%d %H:%M:%S'
    formatter = logging.Formatter(fmt,datefmt)

    # 创建handler
    # FileHandler：writes formatted logging records to disk files
    handler = logging.FileHandler(path)
    handler.setLevel(level)

    # 添加handler和formatter 到 logger
    handler.setFormatter(formatter)
    logger.addHandler(handler)

    return logger

In [14]:
import time 
import logging

exposure_logger = get_logger('exposure','/root/workspace/3.rs_project/project2/meiduoSourceCode/logs/exposure.log',\
                             logging.DEBUG)
# 曝光日志
exposure_timesteamp = time.time()
exposure_loc = 'detail'
uid = 1
sku_id = 1
cate_id = 1


In [15]:
exposure_logger.info("exposure_timesteamp<%d> exposure_loc<%s> uid<%d> sku_id<%d> cate_id<%d>"\
                     %(exposure_timesteamp, exposure_loc, uid, sku_id, cate_id))

In [16]:
# 每运行上条指令一次，就会在log中记录一条信息
# 显示最后五条信息
!cat /root/workspace/3.rs_project/project2/meiduoSourceCode/logs/exposure.log | tail -5

2020/12/23 21:16:43:exposure_timesteamp<1608729319> exposure_loc<detail> uid<1> sku_id<1> cate_id<1>
2020/12/23 21:21:21:exposure_timesteamp<1608729679> exposure_loc<detail> uid<1> sku_id<1> cate_id<1>
2020/12/23 21:21:21: exposure_timesteamp<1608729679> exposure_loc<detail> uid<1> sku_id<1> cate_id<1>
2020/12/24 11:09:20: exposure_timesteamp<1608779359> exposure_loc<detail> uid<1> sku_id<1> cate_id<1>
2020/12/24 11:09:20: exposure_timesteamp<1608779359> exposure_loc<detail> uid<1> sku_id<1> cate_id<1>


In [17]:
import time
import logging
click_trace_logger = get_logger('click_trace','/root/workspace/3.rs_project/project2/meiduoSourceCode/logs/click_trace.log',\
                               logging.DEBUG)

In [18]:
# 点击流日志
exposure_timesteamp = exposure_timesteamp
exposure_loc = exposure_loc
timesteamp = time.time()
behavior = 'pv' # pv fav cart buy
uid = 1
sku_id = 1
cate_id = 1
stay_time = 60
# 假设某点击流日志记录格式如下：
click_trace_logger.info("exposure_timesteamp<%d> exposure_loc<%s> timesteamp<%d> behavior<%s> uid<%d> sku_id<%d> cate_id<%d> stay_time<%d>"\
                        %(exposure_timesteamp, exposure_loc, timesteamp, behavior, uid, sku_id, cate_id, stay_time))


In [19]:
!cat /root/workspace/3.rs_project/project2/meiduoSourceCode/logs/click_trace.log

2018/11/30 03:20:24: exposure_timesteamp<1543519181> exposure_loc<detail> timesteamp<1543519224> behavior<pv> uid<1> sku_id<1> cate_id<1> stay_time<60>
2018/11/30 11:54:45: exposure_timesteamp<1543519181> exposure_loc<detail> timesteamp<1543550085> behavior<pv> uid<1> sku_id<1> cate_id<1> stay_time<60>
2018/11/30 11:54:45: exposure_timesteamp<1543519181> exposure_loc<detail> timesteamp<1543550085> behavior<pv> uid<1> sku_id<1> cate_id<1> stay_time<60>
2018/11/30 13:17:16: exposure_timesteamp<1543519181> exposure_loc<detail> timesteamp<1543555036> behavior<pv> uid<1> sku_id<1> cate_id<1> stay_time<60>
2018/11/30 13:17:16: exposure_timesteamp<1543519181> exposure_loc<detail> timesteamp<1543555036> behavior<pv> uid<1> sku_id<1> cate_id<1> stay_time<60>
2018/11/30 13:17:16: exposure_timesteamp<1543519181> exposure_loc<detail> timesteamp<1543555036> behavior<pv> uid<1> sku_id<1> cate_id<1> stay_time<60>
2018/11/30 13:18:16: exposure_timesteamp<1543555093> exposure_loc<detail> timestea

#### Flume Log Collection


In [20]:
!hadoop fs -ls /meiduo_mall/logs

Found 2 items
drwxr-xr-x   - root supergroup          0 2020-12-24 08:55 /meiduo_mall/logs/click-trace
drwxr-xr-x   - root supergroup          0 2020-12-23 22:31 /meiduo_mall/logs/click_trace


In [21]:
import re
s = '2018/12/01 02:35:13: exposure_timesteamp<1543601846> exposure_loc<detail> timesteamp<1543602913> behavior<pv> uid<1> sku_id<1> cate_id<1> stay_time<60>'

match = re.search("\
exposure_timesteamp<(?P<exposure_timesteamp>.*?)> \
exposure_loc<(?P<exposure_loc>.*?)> \
timesteamp<(?P<timesteamp>.*?)> \
behavior<(?P<behavior>.*?)> \
uid<(?P<uid>.*?)> \
sku_id<(?P<sku_id>.*?)> \
cate_id<(?P<cate_id>.*?)> \
stay_time<(?P<stay_time>.*?)>", s)

result = []
if match:
    result.append(("exposure_timesteamp", match.group("exposure_timesteamp")))
    result.append(("exposure_loc", match.group("exposure_loc")))
    result.append(("timesteamp", match.group("timesteamp")))
    result.append(("behavior", match.group("behavior")))
    result.append(("uid", match.group("uid")))
    result.append(("sku_id", match.group("sku_id")))
    result.append(("cate_id", match.group("cate_id")))
    result.append(("stay_time", match.group("stay_time")))
result

[('exposure_timesteamp', '1543601846'),
 ('exposure_loc', 'detail'),
 ('timesteamp', '1543602913'),
 ('behavior', 'pv'),
 ('uid', '1'),
 ('sku_id', '1'),
 ('cate_id', '1'),
 ('stay_time', '60')]