In [1]:
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
#! pip install apache-log-parser 
import apache_log_parser
%matplotlib inline

## Load and parse the data
本案例参考 Nikolay Koldunov（koldunovn@gmail.com）文章完成

我们使用[apache-log-parser](https://github.com/rory/apache-log-parser)进行apalce log分析。log解析前我们需要了解对应的网站的Apahce log的配置。 这里我们已经知道待分析网站的log格式为:  
    
    format = r'%V %h  %l %u %t \"%r\" %>s %b \"%{Referer}i\" \"%{User-Agent}i\" %T'
    
对应的各字段代表内容如下：(参考[stackoverflow](http://stackoverflow.com/questions/9234699/understanding-apache-access-log)):
        
    %V  - 根据 UseCanonicalName 设置的服务器名字
    %h  - 远程主机（客户端 IP）
    %l  - identity of the user determined by identd (not usually used since not reliable)
    %u  - 由 HTTP authentication 决定的 user name
    %t  - 服务器完成处理这个请求的时间
    %r  - 来自客户端的请求行。 （"GET / HTTP/1.0"）
    %>s - 服务器端返回给客户端的状态码（200， 404 等等。）
    %b  - 响应给客户端的响应报文大小 （in bytes）
    \"%{Referer}i\"  - Referer is the page that linked to this URL.
    \"%{User-Agent}i\"  - the browser identification string
    %T  - Apache 请求时间

In [15]:
# 这个地方是要使用的log的日志的格式
fformat = '%V %h %l %u %t \"%r\" %>s %b \"%{Referer}i\" \"%{User-Agent}i\" %T'

创建解析器

In [16]:
p = apache_log_parser.make_parser(fformat)

比如：
        
        koldunov.net 85.26.235.202 - - [16/Mar/2013:00:19:43 +0400] "GET /?p=364 HTTP/1.0" 200 65237 "http://koldunov.net/?p=364" "Mozilla/5.0 (Windows NT 5.1) AppleWebKit/537.11 (KHTML, like Gecko) Chrome/23.0.1271.64 Safari/537.11" 0

In [17]:
sample_string = 'koldunov.net 85.26.235.202 - - [16/Mar/2013:00:19:43 +0400] "GET /?p=364 HTTP/1.0" 200 65237 "http://koldunov.net/?p=364" "Mozilla/5.0 (Windows NT 5.1) AppleWebKit/537.11 (KHTML, like Gecko) Chrome/23.0.1271.64 Safari/537.11" 0'

In [19]:
data = p(sample_string)

In [20]:
data

{'remote_host': '85.26.235.202',
 'remote_logname': '-',
 'remote_user': '-',
 'request_first_line': 'GET /?p=364 HTTP/1.0',
 'request_header_referer': 'http://koldunov.net/?p=364',
 'request_header_user_agent': 'Mozilla/5.0 (Windows NT 5.1) AppleWebKit/537.11 (KHTML, like Gecko) Chrome/23.0.1271.64 Safari/537.11',
 'request_header_user_agent__browser__family': 'Chrome',
 'request_header_user_agent__browser__version_string': '23.0.1271',
 'request_header_user_agent__is_mobile': False,
 'request_header_user_agent__os__family': 'Windows XP',
 'request_header_user_agent__os__version_string': '',
 'request_http_ver': '1.0',
 'request_method': 'GET',
 'request_url': '/?p=364',
 'request_url_fragment': '',
 'request_url_hostname': None,
 'request_url_netloc': '',
 'request_url_password': None,
 'request_url_path': '/',
 'request_url_port': None,
 'request_url_query': 'p=364',
 'request_url_query_dict': {'p': ['364']},
 'request_url_query_list': [('p', '364')],
 'request_url_query_simple_dict

In [2]:
log = open('./data/apache_access_log').readlines()

### 解析每一行，并创建dict list

In [21]:
import sys

log_list = []
for line in log:
    try:
        data = p(line)
    except:
        sys.stderr.write('Unable to parse %s' % line)
    data['time_received'] = data['time_received'][1:12] + ' ' + data['time_received'][13:21] + ' ' + data['time_received'][22:27]
    
    log_list.append(data)

## 准备网站日志数据

把字典转化为DataFrame

In [22]:
df = pd.DataFrame(log_list)

In [23]:
df.columns

Index(['remote_host', 'remote_logname', 'remote_user', 'request_first_line',
       'request_header_referer', 'request_header_user_agent',
       'request_header_user_agent__browser__family',
       'request_header_user_agent__browser__version_string',
       'request_header_user_agent__is_mobile',
       'request_header_user_agent__os__family',
       'request_header_user_agent__os__version_string', 'request_http_ver',
       'request_method', 'request_url', 'request_url_fragment',
       'request_url_hostname', 'request_url_netloc', 'request_url_password',
       'request_url_path', 'request_url_port', 'request_url_query',
       'request_url_query_dict', 'request_url_query_list',
       'request_url_query_simple_dict', 'request_url_scheme',
       'request_url_username', 'response_bytes_clf', 'server_name2', 'status',
       'time_received', 'time_received_datetimeobj', 'time_received_isoformat',
       'time_received_tz_datetimeobj', 'time_received_tz_isoformat',
       'time_recei

In [24]:
df = df[['status', 'response_bytes_clf', 'remote_host', 'request_first_line', 'time_received']]

In [25]:
df.head()

Unnamed: 0,status,response_bytes_clf,remote_host,request_first_line,time_received
0,200,26126,109.165.31.156,GET /index.php?option=com_content&task=section...,16/Mar/2013 08:00:25 +0400
1,200,10532,109.165.31.156,GET /templates/ja_procyon/css/template_css.css...,16/Mar/2013 08:00:25 +0400
2,200,1853,109.165.31.156,GET /templates/ja_procyon/switcher.js HTTP/1.0,16/Mar/2013 08:00:25 +0400
3,200,37153,109.165.31.156,GET /includes/js/overlib_mini.js HTTP/1.0,16/Mar/2013 08:00:25 +0400
4,200,3978,109.165.31.156,GET /modules/ja_transmenu/transmenuh.css HTTP/1.0,16/Mar/2013 08:00:25 +0400


把time_received变成index，这里使用了pop