-
Notifications
You must be signed in to change notification settings - Fork 1
/
index.html
1123 lines (744 loc) · 62.1 KB
/
index.html
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
213
214
215
216
217
218
219
220
221
222
223
224
225
226
227
228
229
230
231
232
233
234
235
236
237
238
239
240
241
242
243
244
245
246
247
248
249
250
251
252
253
254
255
256
257
258
259
260
261
262
263
264
265
266
267
268
269
270
271
272
273
274
275
276
277
278
279
280
281
282
283
284
285
286
287
288
289
290
291
292
293
294
295
296
297
298
299
300
301
302
303
304
305
306
307
308
309
310
311
312
313
314
315
316
317
318
319
320
321
322
323
324
325
326
327
328
329
330
331
332
333
334
335
336
337
338
339
340
341
342
343
344
345
346
347
348
349
350
351
352
353
354
355
356
357
358
359
360
361
362
363
364
365
366
367
368
369
370
371
372
373
374
375
376
377
378
379
380
381
382
383
384
385
386
387
388
389
390
391
392
393
394
395
396
397
398
399
400
401
402
403
404
405
406
407
408
409
410
411
412
413
414
415
416
417
418
419
420
421
422
423
424
425
426
427
428
429
430
431
432
433
434
435
436
437
438
439
440
441
442
443
444
445
446
447
448
449
450
451
452
453
454
455
456
457
458
459
460
461
462
463
464
465
466
467
468
469
470
471
472
473
474
475
476
477
478
479
480
481
482
483
484
485
486
487
488
489
490
491
492
493
494
495
496
497
498
499
500
501
502
503
504
505
506
507
508
509
510
511
512
513
514
515
516
517
518
519
520
521
522
523
524
525
526
527
528
529
530
531
532
533
534
535
536
537
538
539
540
541
542
543
544
545
546
547
548
549
550
551
552
553
554
555
556
557
558
559
560
561
562
563
564
565
566
567
568
569
570
571
572
573
574
575
576
577
578
579
580
581
582
583
584
585
586
587
588
589
590
591
592
593
594
595
596
597
598
599
600
601
602
603
604
605
606
607
608
609
610
611
612
613
614
615
616
617
618
619
620
621
622
623
624
625
626
627
628
629
630
631
632
633
634
635
636
637
638
639
640
641
642
643
644
645
646
647
648
649
650
651
652
653
654
655
656
657
658
659
660
661
662
663
664
665
666
667
668
669
670
671
672
673
674
675
676
677
678
679
680
681
682
683
684
685
686
687
688
689
690
691
692
693
694
695
696
697
698
699
700
701
702
703
704
705
706
707
708
709
710
711
712
713
714
715
716
717
718
719
720
721
722
723
724
725
726
727
728
729
730
731
732
733
734
735
736
737
738
739
740
741
742
743
744
745
746
747
748
749
750
751
752
753
754
755
756
757
758
759
760
761
762
763
764
765
766
767
768
769
770
771
772
773
774
775
776
777
778
779
780
781
782
783
784
785
786
787
788
789
790
791
792
793
794
795
796
797
798
799
800
801
802
803
804
805
806
807
808
809
810
811
812
813
814
815
816
817
818
819
820
821
822
823
824
825
826
827
828
829
830
831
832
833
834
835
836
837
838
839
840
841
842
843
844
845
846
847
848
849
850
851
852
853
854
855
856
857
858
859
860
861
862
863
864
865
866
867
868
869
870
871
872
873
874
875
876
877
878
879
880
881
882
883
884
885
886
887
888
889
890
891
892
893
894
895
896
897
898
899
900
901
902
903
904
905
906
907
908
909
910
911
912
913
914
915
916
917
918
919
920
921
922
923
924
925
926
927
928
929
930
931
932
933
934
935
936
937
938
939
940
941
942
943
944
945
946
947
948
949
950
951
952
953
954
955
956
957
958
959
960
961
962
963
964
965
966
967
968
969
970
971
972
973
974
975
976
977
978
979
980
981
982
983
984
985
986
987
988
989
990
991
992
993
994
995
996
997
998
999
1000
<!DOCTYPE html>
<!--[if IEMobile 7 ]><html class="no-js iem7"><![endif]-->
<!--[if lt IE 9]><html class="no-js lte-ie8"><![endif]-->
<!--[if (gt IE 8)|(gt IEMobile 7)|!(IEMobile)|!(IE)]><!--><html class="no-js" lang="en"><!--<![endif]-->
<head>
<meta charset="utf-8">
<title>Bowen's Blog</title>
<meta name="author" content="Bowen Ma">
<meta name="description" content="客户的主站系统是一个基于Spring的单块系统,每天的流量大概在几千万,其所有的应用日志以及访问日志,都会上传到Splunk服务器上。随着服务数量的增加,日志的数量也越来越多,对应的Splunk的费用也越来越感人,鉴于主站的系统对日志的贡献最多,所以就从它入手降低无用日志的上传。 …">
<!-- http://t.co/dKP3o1e -->
<meta name="HandheldFriendly" content="True">
<meta name="MobileOptimized" content="320">
<meta name="viewport" content="width=device-width, initial-scale=1">
<link rel="canonical" href="http://iambowen.github.com/">
<link href="/favicon.png" rel="icon">
<link href="/stylesheets/screen.css" media="screen, projection" rel="stylesheet" type="text/css">
<link href="/atom.xml" rel="alternate" title="Bowen's Blog" type="application/atom+xml">
<script src="/javascripts/modernizr-2.0.js"></script>
<script src="//ajax.googleapis.com/ajax/libs/jquery/1.9.1/jquery.min.js"></script>
<script>!window.jQuery && document.write(unescape('%3Cscript src="/javascripts/libs/jquery.min.js"%3E%3C/script%3E'))</script>
<script src="/javascripts/octopress.js" type="text/javascript"></script>
<!--Fonts from Google"s Web font directory at http://google.com/webfonts -->
<link href="//fonts.googleapis.com/css?family=PT+Serif:regular,italic,bold,bolditalic" rel="stylesheet" type="text/css">
<link href="//fonts.googleapis.com/css?family=PT+Sans:regular,italic,bold,bolditalic" rel="stylesheet" type="text/css">
<script type="text/javascript">
var _gaq = _gaq || [];
_gaq.push(['_setAccount', 'UA-40335675-1']);
_gaq.push(['_trackPageview']);
(function() {
var ga = document.createElement('script'); ga.type = 'text/javascript'; ga.async = true;
ga.src = ('https:' == document.location.protocol ? 'https://ssl' : 'http://www') + '.google-analytics.com/ga.js';
var s = document.getElementsByTagName('script')[0]; s.parentNode.insertBefore(ga, s);
})();
</script>
</head>
<body >
<header role="banner"><hgroup>
<h1><a href="/">Bowen's Blog</a></h1>
<h2>Respect My Authorita.</h2>
</hgroup>
</header>
<nav role="navigation"><ul class="subscription" data-subscription="rss email">
<li><a href="/atom.xml" rel="subscribe-rss" title="subscribe via RSS">RSS</a></li>
<li><a href="iambowen.m@gmail.com" rel="subscribe-email" title="subscribe via email">Email</a></li>
</ul>
<form action="https://www.google.com/search" method="get">
<fieldset role="search">
<input type="hidden" name="sitesearch" value="iambowen.github.com">
<input class="search" type="text" name="q" results="0" placeholder="Search"/>
</fieldset>
</form>
<ul class="main-navigation">
<li><a href="/">Blog</a></li>
<li><a href="/blog/archives">Archives</a></li>
</ul>
</nav>
<div id="main">
<div id="content">
<div class="blog-index">
<article>
<header>
<h1 class="entry-title"><a href="/log/spring/2016/11/17/supressing-an-warning-log">Supressing an Warning Log</a></h1>
<p class="meta">
<time class='entry-date' datetime='2016-11-17T22:49:53+11:00'><span class='date'><span class='date-month'>Nov</span> <span class='date-day'>17</span><span class='date-suffix'>th</span>, <span class='date-year'>2016</span></span> <span class='time'>10:49 pm</span></time>
| <a href="/log/spring/2016/11/17/supressing-an-warning-log#disqus_thread"
data-disqus-identifier="http://iambowen.github.com/log/spring/2016/11/17/supressing-an-warning-log">Comments</a>
</p>
</header>
<div class="entry-content"><p>客户的主站系统是一个基于Spring的单块系统,每天的流量大概在几千万,其所有的应用日志以及访问日志,都会上传到<a href="https://www.splunk.com">Splunk</a>服务器上。随着服务数量的增加,日志的数量也越来越多,对应的Splunk的费用也越来越感人,鉴于主站的系统对日志的贡献最多,所以就从它入手降低无用日志的上传。</p>
<p>其中的一条无用的warning日志信息显示如下:</p>
<figure class='code'><div class="highlight"><table><tr><td class="gutter"><pre class="line-numbers"><span class='line-number'>1</span>
</pre></td><td class='code'><pre><code class=''><span class='line'>WARN org.apache.commons.httpclient.HttpMethodBase - Going to buffer response body of large or unknown size. Using getResponseBodyAsStream instead is recommended.</span></code></pre></td></tr></table></div></figure>
<p>一周内这条warning的数量超过百万条,还是比较可观的。简单查询下其原因是<code>httpclient</code>里面的<code>getResponseBody()</code>调用触发的。</p>
<figure class='code'><figcaption><span></span></figcaption><div class="highlight"><table><tr><td class="gutter"><pre class="line-numbers"><span class='line-number'>1</span>
<span class='line-number'>2</span>
<span class='line-number'>3</span>
<span class='line-number'>4</span>
<span class='line-number'>5</span>
</pre></td><td class='code'><pre><code class='java'><span class='line'> <span class="kt">int</span> <span class="n">limit</span> <span class="o">=</span> <span class="n">getParams</span><span class="o">().</span><span class="na">getIntParameter</span><span class="o">(</span><span class="n">HttpMethodParams</span><span class="o">.</span><span class="na">BUFFER_WARN_TRIGGER_LIMIT</span><span class="o">,</span> <span class="mi">1024</span><span class="o">*</span><span class="mi">1024</span><span class="o">);</span>
</span><span class='line'> <span class="k">if</span> <span class="o">((</span><span class="n">contentLength</span> <span class="o">==</span> <span class="o">-</span><span class="mi">1</span><span class="o">)</span> <span class="o">||</span> <span class="o">(</span><span class="n">contentLength</span> <span class="o">></span> <span class="n">limit</span><span class="o">))</span> <span class="o">{</span>
</span><span class='line'> <span class="n">LOG</span><span class="o">.</span><span class="na">warn</span><span class="o">(</span><span class="s">"Going to buffer response body of large or unknown size. "</span>
</span><span class='line'> <span class="o">+</span><span class="s">"Using getResponseBodyAsStream instead is recommended."</span><span class="o">);</span>
</span><span class='line'> <span class="o">}</span>
</span></code></pre></td></tr></table></div></figure>
<p>看完这个后我们的理解是,有些请求的response body过大,超过缺省的1M(代码会从<code>Content-Length</code> header中获取这个大小),就会触发这个warning,当时没有意识到还有可能是确实不知道response body的长度。
相关的方法调用在代码中有10几处,当时我们也无法定位那段代码引发了这个问题,无脑修改的话,成本比较高,可能要增加一些测试用例,以及做回归测试。所以当时就想着用成本最低的方式修改,从配置文件中给<code>BUFFER_WARN_TRIGGER_LIMIT</code>赋一个更大的值,如20M,毕竟这是个遗留项目,熟悉代码的人以及比较少了。没有选择调整日志的级别是因为<code>HttpMethodBase</code>类是个超类,粗暴调整可能会掩盖其它有用的warning日志。
部署完成后比较日志数量发现并没有太大变化,不得不让我们重新回来审视这个问题的根本原因在哪里。幸好当时系统加了一个transactionID的功能,每次的请求过来时,在应用中用UUID生成一个transactionID写入应用日志,response返回时再写入access log。这样我们就在请求和对应的代码调用之间建立了联系。
功能上线后重新在splunk中搜索,立刻就定位了是在请求Google Map API时触发了这个warning,而且在本地可以稳定重现。</p>
<figure class='code'><figcaption><span></span></figcaption><div class="highlight"><table><tr><td class="gutter"><pre class="line-numbers"><span class='line-number'>1</span>
<span class='line-number'>2</span>
<span class='line-number'>3</span>
<span class='line-number'>4</span>
<span class='line-number'>5</span>
<span class='line-number'>6</span>
<span class='line-number'>7</span>
<span class='line-number'>8</span>
<span class='line-number'>9</span>
<span class='line-number'>10</span>
<span class='line-number'>11</span>
<span class='line-number'>12</span>
<span class='line-number'>13</span>
<span class='line-number'>14</span>
</pre></td><td class='code'><pre><code class='java'><span class='line'> <span class="o">~></span> <span class="n">curl</span> <span class="o">-</span><span class="n">I</span> <span class="s">"https://maps.google.com.au/maps/api/geocode/json?address=Sunnydale%2C+SA+5354&language=en_AU&sensor=false"</span>
</span><span class='line'><span class="n">HTTP</span><span class="o">/</span><span class="mf">1.1</span> <span class="mi">200</span> <span class="n">OK</span>
</span><span class='line'><span class="n">Content</span><span class="o">-</span><span class="nl">Type:</span> <span class="n">application</span><span class="o">/</span><span class="n">json</span><span class="o">;</span> <span class="n">charset</span><span class="o">=</span><span class="n">UTF</span><span class="o">-</span><span class="mi">8</span>
</span><span class='line'><span class="nl">Date:</span> <span class="n">Fri</span><span class="o">,</span> <span class="mi">18</span> <span class="n">Nov</span> <span class="mi">2016</span> <span class="mi">23</span><span class="o">:</span><span class="mi">20</span><span class="o">:</span><span class="mi">16</span> <span class="n">GMT</span>
</span><span class='line'><span class="nl">Expires:</span> <span class="n">Sat</span><span class="o">,</span> <span class="mi">19</span> <span class="n">Nov</span> <span class="mi">2016</span> <span class="mi">23</span><span class="o">:</span><span class="mi">20</span><span class="o">:</span><span class="mi">16</span> <span class="n">GMT</span>
</span><span class='line'><span class="n">Cache</span><span class="o">-</span><span class="nl">Control:</span> <span class="kd">public</span><span class="o">,</span> <span class="n">max</span><span class="o">-</span><span class="n">age</span><span class="o">=</span><span class="mi">86400</span>
</span><span class='line'><span class="n">Access</span><span class="o">-</span><span class="n">Control</span><span class="o">-</span><span class="n">Allow</span><span class="o">-</span><span class="nl">Origin:</span> <span class="o">*</span>
</span><span class='line'><span class="nl">Server:</span> <span class="n">mafe</span>
</span><span class='line'><span class="n">X</span><span class="o">-</span><span class="n">XSS</span><span class="o">-</span><span class="nl">Protection:</span> <span class="mi">1</span><span class="o">;</span> <span class="n">mode</span><span class="o">=</span><span class="n">block</span>
</span><span class='line'><span class="n">X</span><span class="o">-</span><span class="n">Frame</span><span class="o">-</span><span class="nl">Options:</span> <span class="n">SAMEORIGIN</span>
</span><span class='line'><span class="n">Alt</span><span class="o">-</span><span class="nl">Svc:</span> <span class="n">quic</span><span class="o">=</span><span class="s">":443"</span><span class="o">;</span> <span class="n">ma</span><span class="o">=</span><span class="mi">2592000</span><span class="o">;</span> <span class="n">v</span><span class="o">=</span><span class="s">"36,35,34"</span>
</span><span class='line'><span class="n">Transfer</span><span class="o">-</span><span class="nl">Encoding:</span> <span class="n">chunked</span>
</span><span class='line'><span class="n">Accept</span><span class="o">-</span><span class="nl">Ranges:</span> <span class="n">none</span>
</span><span class='line'><span class="nl">Vary:</span> <span class="n">Accept</span><span class="o">-</span><span class="n">Encoding</span>
</span></code></pre></td></tr></table></div></figure>
<p>然后发现原来返回是没有<code>Content-Length</code>的:(。
<code>Content-Length</code> header是客户端用于了解服务器返回body的大小,从而在获得等大的内容后,结束连接,节省开销。但在实际的应用中,<code>Content-Length</code>有可能无法准确反映返回body的大小,其值过大会导致pending,过小内容又会被截断。
<code>Transfer-Encoding: chunked</code> 是用来分块编码传输内容,每个分块中包含了长度值和数据,最后一个分块长度值是0,这样就可以准确知道边界了。
定位到问题在哪里之后就很容易解决了,最后只要改动一行代码:</p>
<figure class='code'><figcaption><span></span></figcaption><div class="highlight"><table><tr><td class="gutter"><pre class="line-numbers"><span class='line-number'>1</span>
<span class='line-number'>2</span>
</pre></td><td class='code'><pre><code class='java'><span class='line'><span class="o">-</span> <span class="n">String</span> <span class="n">response</span> <span class="o">=</span> <span class="n">query</span><span class="o">.</span><span class="na">getResponseBodyAsString</span><span class="o">();</span>
</span><span class='line'><span class="o">+</span> <span class="n">String</span> <span class="n">response</span> <span class="o">=</span> <span class="n">IOUtils</span><span class="o">.</span><span class="na">toString</span><span class="o">(</span><span class="n">query</span><span class="o">.</span><span class="na">getResponseBodyAsStream</span><span class="o">());</span>
</span></code></pre></td></tr></table></div></figure>
<p>回过头来反思整个过程,因为是遗留系统,所以处理的方式有些粗糙,如果当时我们遵循下面的过程也许会更好些:
1) 定位问题,找到根本原因(有transactionID的配合会更方便),而不是盲目用生产环境来测试配置的正确性;
2) 在本地重现问题,应用解决方案,并与熟悉遗留系统的同事沟通
3) 回归测试后上线</p>
</div>
</article>
<article>
<header>
<h1 class="entry-title"><a href="/dns/2016/11/15/issue-raised-by-dns">Issue Raised by DNS</a></h1>
<p class="meta">
<time class='entry-date' datetime='2016-11-15T21:33:22+11:00'><span class='date'><span class='date-month'>Nov</span> <span class='date-day'>15</span><span class='date-suffix'>th</span>, <span class='date-year'>2016</span></span> <span class='time'>9:33 pm</span></time>
| <a href="/dns/2016/11/15/issue-raised-by-dns#disqus_thread"
data-disqus-identifier="http://iambowen.github.com/dns/2016/11/15/issue-raised-by-dns">Comments</a>
</p>
</header>
<div class="entry-content"><p>最近在客户现场出差,见证了不少有趣的线上事故,下面要讲的就是其中之一。
一段时间依赖,某个微服务在生产环境的response的延迟陡然增加了几百毫秒,而部署的代码并不是造成延迟原因。从Newrelic的监控可以发现,该API的延迟增大的主要原因是它依赖的一个服务响应时间增大了。</p>
<p>我们暂且把这个外部的服务称为service.mycompany.com,这个服务分别部署在澳洲和欧洲的两个数据中心,入口处是Akamai,做负载均衡,尽可能的按照访问来源去分发请求。</p>
<p>该微服务部署在AWS悉尼的数据中心,所以理论上来讲,当它请求service.mycompany.com时,Akamai应该返回的是位于悉尼的edge节点的IP,同时其访问的origin服务器也应该位于悉尼。但是通过在该微服务的服务器debug,发现ping值以及traceroute的值都比较高,办公室访问却都一切正常。当时怀疑是Akamai的GEOIP判断出了问题,把来自亚马逊悉尼的请求当成了来自美国的IP的请求,于是用部署于欧洲的数据中心的服务处理请求。和基础设施部门管理网络的人讨论,再次调查后结论类似。</p>
<p>问题出在这个AWS账户下的VPC的DHCP options的配置。因为是比较早期使用的share的AWS账户,所以下面的网络配置比较复杂,配置有Direct Connect 连往其他数据中心,以及很多VPC Peering。不知道因为什么原因,这个微服务部署的Cloudformation template里面选择了包含google DNS <code>8.8.8.8</code>和<code>8.8.8.4</code>的DHCP Options。我们都知道对于在Akamai上注册的服务service.mycompany.com来说,如:</p>
<figure class='code'><div class="highlight"><table><tr><td class="gutter"><pre class="line-numbers"><span class='line-number'>1</span>
<span class='line-number'>2</span>
<span class='line-number'>3</span>
<span class='line-number'>4</span>
</pre></td><td class='code'><pre><code class=''><span class='line'> ~> host service.mycompany.com
</span><span class='line'>service.mycompany.com is an alias for mycompany.generic.edgekey.net.
</span><span class='line'>mycompany.generic.edgekey.net is an alias for e8888.g.akamaiedge.net.
</span><span class='line'>e8888.g.akamaiedge.net has address 104.116.190.24</span></code></pre></td></tr></table></div></figure>
<p>第一次DNS请求返回的记录是CName,之后进一步返回Akamai动态DNS的CName,也就是edge server的CName,之后再根据DNS服务器返回对应的edge服务器的IP地址,如果查询的是Google的DNS,那么它会返回美国的edge服务器地址……。我们可以测试下:</p>
<figure class='code'><div class="highlight"><table><tr><td class="gutter"><pre class="line-numbers"><span class='line-number'>1</span>
<span class='line-number'>2</span>
<span class='line-number'>3</span>
<span class='line-number'>4</span>
<span class='line-number'>5</span>
<span class='line-number'>6</span>
<span class='line-number'>7</span>
<span class='line-number'>8</span>
<span class='line-number'>9</span>
<span class='line-number'>10</span>
<span class='line-number'>11</span>
<span class='line-number'>12</span>
<span class='line-number'>13</span>
<span class='line-number'>14</span>
<span class='line-number'>15</span>
<span class='line-number'>16</span>
<span class='line-number'>17</span>
<span class='line-number'>18</span>
<span class='line-number'>19</span>
<span class='line-number'>20</span>
<span class='line-number'>21</span>
</pre></td><td class='code'><pre><code class=''><span class='line'> ~> dig @8.8.8.8 service.mycompany.com
</span><span class='line'>
</span><span class='line'>; <<>> DiG 9.8.3-P1 <<>> @8.8.8.8 service.mycompany.com
</span><span class='line'>; (1 server found)
</span><span class='line'>;; global options: +cmd
</span><span class='line'>;; Got answer:
</span><span class='line'>;; ->>HEADER<<- opcode: QUERY, status: NOERROR, id: 44304
</span><span class='line'>;; flags: qr rd ra; QUERY: 1, ANSWER: 3, AUTHORITY: 0, ADDITIONAL: 0
</span><span class='line'>
</span><span class='line'>;; QUESTION SECTION:
</span><span class='line'>;service.mycompany.com. IN A
</span><span class='line'>
</span><span class='line'>;; ANSWER SECTION:
</span><span class='line'>service.mycompany.com. 86399 IN CNAME mycompany.generic.edgekey.net.
</span><span class='line'>mycompany.generic.edgekey.net. 263 IN CNAME e8888.g.akamaiedge.net.
</span><span class='line'>e8888.g.akamaiedge.net. 19 IN A 23.53.156.156
</span><span class='line'>
</span><span class='line'>;; Query time: 603 msec
</span><span class='line'>;; SERVER: 8.8.8.8#53(8.8.8.8)
</span><span class='line'>;; WHEN: Tue Nov 15 22:15:52 2016
</span><span class='line'>;; MSG SIZE rcvd: 130</span></code></pre></td></tr></table></div></figure>
<p>查询下IP地址信息,</p>
<figure class='code'><div class="highlight"><table><tr><td class="gutter"><pre class="line-numbers"><span class='line-number'>1</span>
<span class='line-number'>2</span>
</pre></td><td class='code'><pre><code class=''><span class='line'> ~> whois 23.53.156.156 | grep Country
</span><span class='line'>Country: US</span></code></pre></td></tr></table></div></figure>
<p>所以,这个微服务的请求先到Akamai美国的edge服务器,之后很有可能请求被发送到了欧洲的origin服务器,这个延迟不增加才👻了……。</p>
<p>解决的办法很简单,更新配置,DHCP Options选择Amazon提供的DNS就可以了,响应时间就降下去了。</p>
<p>这个事情给我们的教训就是,不管怎么样都不能崇洋媚外,虽然澳洲一直follow美国,但是DNS还是得用自己的。</p>
</div>
</article>
<article>
<header>
<h1 class="entry-title"><a href="/bamboo/emoji/2016/11/03/dont-put-emoji-in-commit-message">Don't Put Emoji in Commit Message</a></h1>
<p class="meta">
<time class='entry-date' datetime='2016-11-03T23:47:07+11:00'><span class='date'><span class='date-month'>Nov</span> <span class='date-day'>3</span><span class='date-suffix'>rd</span>, <span class='date-year'>2016</span></span> <span class='time'>11:47 pm</span></time>
| <a href="/bamboo/emoji/2016/11/03/dont-put-emoji-in-commit-message#disqus_thread"
data-disqus-identifier="http://iambowen.github.com/bamboo/emoji/2016/11/03/dont-put-emoji-in-commit-message">Comments</a>
</p>
</header>
<div class="entry-content"><p>随着项目上越来越多的使用Slack以及Emoji的流行,很多人情不自禁的会在各种地方使用emoji表情。比如
在channel里面发<code>:bicyclist::skin-tone-2: :house: :thunder_cloud_and_rain: :disappointed:</code>。更甚者会在git commit message中添加emoji。比如像这样</p>
<figure class='code'><div class="highlight"><table><tr><td class="gutter"><pre class="line-numbers"><span class='line-number'>1</span>
</pre></td><td class='code'><pre><code class=''><span class='line'>finish story xxxx. :pear: xiao.</span></code></pre></td></tr></table></div></figure>
<p>意思是完成这个开发的需求是和<code>xiao</code>结对做的。pull request发出后,review的人除了会发:+1:这样的表情表示支持外,还会用:shipit:,:ship:之类的表示赞同,可以merge。
这样的emoji为开发增添了乐趣,但是有时候也会带来麻烦,比如我今天就遇到了这样的情况。
在清理完一些旧代码后,我在提交信息里面➕了下面的消息:</p>
<figure class='code'><div class="highlight"><table><tr><td class="gutter"><pre class="line-numbers"><span class='line-number'>1</span>
</pre></td><td class='code'><pre><code class=''><span class='line'>clean up the :older_man::skin-tone-2: code.</span></code></pre></td></tr></table></div></figure>
<p>提交merge后,过了一段时间看了下build,还没有到运行阶段就挂了。查看了下原因,发现是bamboo在保存提交信息时遇到了一个复杂字符出错了。</p>
<figure class='code'><div class="highlight"><table><tr><td class="gutter"><pre class="line-numbers"><span class='line-number'>1</span>
</pre></td><td class='code'><pre><code class=''><span class='line'>(org.springframework.jdbc.UncategorizedSQLException : Hibernate flushing: Could not execute JDBC batch update; uncategorized SQLException for SQL [insert into USER_COMMIT (REPOSITORY_CHANGESET_ID, AUTHOR_ID, COMMIT_DATE, COMMIT_REVISION, COMMIT_COMMENT_CLOB, FOREIGN_COMMIT, COMMIT_ID) values (?, ?, ?, ?, ?, ?, ?)]; SQL state [HY000]; error code [1366]; Incorrect string value: '\xF0\x9F\x91\xB4 ...' for column 'COMMIT_COMMENT_CLOB' at row 1; nested exception is java.sql.BatchUpdateException: Incorrect string value: '\xF0\x9F\x91\xB4 ...' for column 'COMMIT_COMMENT_CLOB' at row 1)</span></code></pre></td></tr></table></div></figure>
<p>当时就感觉是这个emoji出问题了,搜了下提示的编码的十六进制,果然是这个原因……。没办法只好reset下,push force,再重新修改提交信息再push。
同事告诉了我这个事故的根本原因是mysql的utf-8对Emoji的支持不够,解决的办法就是把数据库的charset设置为<code>utf8mb4</code>,详见这篇<a href="https://mathiasbynens.be/notes/mysql-utf8mb4">文章</a>.
所以,以后玩emoji的时候一定得先确认系统支持,否则可能会带来一些:shit:。</p>
</div>
</article>
<article>
<header>
<h1 class="entry-title"><a href="/2016/08/19/gpgand-keybase-introduction-and-usage">Gpgand Keybase Introduction and Usage</a></h1>
<p class="meta">
<time class='entry-date' datetime='2016-08-19T00:00:00+10:00'><span class='date'><span class='date-month'>Aug</span> <span class='date-day'>19</span><span class='date-suffix'>th</span>, <span class='date-year'>2016</span></span> <span class='time'>12:00 am</span></time>
| <a href="/2016/08/19/gpgand-keybase-introduction-and-usage#disqus_thread"
data-disqus-identifier="http://iambowen.github.com/2016/08/19/gpgand-keybase-introduction-and-usage">Comments</a>
</p>
</header>
<div class="entry-content"><p>.—
layout: post
title: “GPG and keybase.io introduction/usage”
date: 2016-08-19 17:29:49 +0800
comments: true</p>
<h2>categories: [“GPG”, “keybase.io”, “Security”]</h2>
<h2>什么是GPG</h2>
</div>
</article>
<article>
<header>
<h1 class="entry-title"><a href="/git/2016/08/17/how-to-write-useful-git-commit-message">How to Write Useful Git Commit Message</a></h1>
<p class="meta">
<time class='entry-date' datetime='2016-08-17T13:20:33+10:00'><span class='date'><span class='date-month'>Aug</span> <span class='date-day'>17</span><span class='date-suffix'>th</span>, <span class='date-year'>2016</span></span> <span class='time'>1:20 pm</span></time>
| <a href="/git/2016/08/17/how-to-write-useful-git-commit-message#disqus_thread"
data-disqus-identifier="http://iambowen.github.com/git/2016/08/17/how-to-write-useful-git-commit-message">Comments</a>
</p>
</header>
<div class="entry-content"><p>相信大家在自己项目的历史提交里面看到类似的提交记录
<img src="http://imgs.xkcd.com/comics/git_commit.png" alt="" />
我还见过更加糟糕的,类似这样</p>
<figure class='code'><div class="highlight"><table><tr><td class="gutter"><pre class="line-numbers"><span class='line-number'>1</span>
<span class='line-number'>2</span>
</pre></td><td class='code'><pre><code class=''><span class='line'>53ee0c7 fix build again
</span><span class='line'>7a63a11 fix build</span></code></pre></td></tr></table></div></figure>
<p>这样的提交信息的问题在于不表意,没有简要的说明修改的内容,为什么要这样的修改,别人只能去查看具体的代码改动才能知道发生了什么,但是可能无法知道为什么这样修改。当然,这样的提交我自己也写过,原因包括
1. 语法、拼写错误,羞于示人
2. 解释原因得写很长,懒的敲键盘
3. 无法解释为什么这样的修改就能work
这其实算是一种比较不负责任的行为,估计别人看到会比较崩溃,幸运的是还没有领导看到,所以至今没有被开除。举个例子,假设某个提交引起了产品环境的错误,别人需要迅速定位是哪个提交引起的问题,但是如果提交都是类似<code>Perfectly complete a new story</code>,而且每次代码修改的量都比较庞大,那就得花很多时间才能定位。相反如果提交信息很清晰,<code>BAU-1008 add xxx form in xxx page. :pear: Kevin</code>。你用<code>git log --oneline --after "Aug 10 2016"</code>可以迅速看到对应的提交,进一步的可以查看修改内容再查找具体的问题。</p>
<p>今天早上客户跟我们一起做了一个关于如何有效的提交<code>git commit</code>信息。他提到了<code>git commit message</code>的7个<a href="http://chris.beams.io/posts/git-commit/">规则</a>。他认为从项目维护性的角度考虑,应当注意提交的信息以及规范。一个项目的提交信息首先得从下面三个方面达成一致:</p>
<ol>
<li>格式。消息体的格式,如Markdown,语法应该是什么样子,大写的规则等。</li>
<li>内容。提交的信息中应当包含什么,不应当包含什么。</li>
<li>元数据。问题跟踪的ID(Jira,Leankit等)要不要引用,PR的sha code要不要引用。</li>
</ol>
<p>具体的规则有下面7点:</p>
<ol>
<li>用空行将内容和主题分开</li>
<li>提交的主题限制在50个字符</li>
<li>主题首字母大写</li>
<li>主题结尾不要使用句号</li>
<li>主题需要使用祈使/肯定语气</li>
<li>内容每72个字符换行</li>
<li>在内容中解释清楚修改的原因及方式</li>
</ol>
<p>第一条,如果内容和主题没有分开,<code>git log --oneline</code>主题和下面的内容会一起显示。
第二条,主题超过50个字符时超过的部分在github上显示为<code>...</code>,提交PR的时候超过的部分会被折断到comments中,很烦人。用vim去编辑提交信息的时候,如果看到主题的字的颜色变化了,就说明超过了50个字符。
第三、四条不评价,感觉更多是从美观和规范上统一的。
第五条,感觉这样可以少写一些字,而且和git 缺省的提交信息,如revert的提交信息一致。</p>
<figure class='code'><div class="highlight"><table><tr><td class="gutter"><pre class="line-numbers"><span class='line-number'>1</span>
<span class='line-number'>2</span>
<span class='line-number'>3</span>
</pre></td><td class='code'><pre><code class=''><span class='line'>Revert "Add the thing with the stuff"
</span><span class='line'>
</span><span class='line'>This reverts commit cc87791524aedd593cff5a74532befe7ab69ce9d.</span></code></pre></td></tr></table></div></figure>
<p>第六条,因为git不会帮你wrap文字,所以得手动的来做这个事情,这里可以借助一些编辑器,如VI的帮助。
第七条,个人觉得这个才是最重要的,解释清楚修改的原因以及方式,引用别人文章里面的一个例子:</p>
<figure class='code'><div class="highlight"><table><tr><td class="gutter"><pre class="line-numbers"><span class='line-number'>1</span>
<span class='line-number'>2</span>
<span class='line-number'>3</span>
<span class='line-number'>4</span>
<span class='line-number'>5</span>
<span class='line-number'>6</span>
<span class='line-number'>7</span>
<span class='line-number'>8</span>
<span class='line-number'>9</span>
<span class='line-number'>10</span>
<span class='line-number'>11</span>
<span class='line-number'>12</span>
<span class='line-number'>13</span>
<span class='line-number'>14</span>
<span class='line-number'>15</span>
<span class='line-number'>16</span>
<span class='line-number'>17</span>
<span class='line-number'>18</span>
<span class='line-number'>19</span>
<span class='line-number'>20</span>
<span class='line-number'>21</span>
<span class='line-number'>22</span>
<span class='line-number'>23</span>
<span class='line-number'>24</span>
<span class='line-number'>25</span>
<span class='line-number'>26</span>
<span class='line-number'>27</span>
<span class='line-number'>28</span>
<span class='line-number'>29</span>
<span class='line-number'>30</span>
</pre></td><td class='code'><pre><code class=''><span class='line'>commit eb0b56b19017ab5c16c745e6da39c53126924ed6
</span><span class='line'>Author: Pieter Wuille <pieter.wuille@gmail.com>
</span><span class='line'>Date: Fri Aug 1 22:57:55 2014 +0200
</span><span class='line'>
</span><span class='line'> Simplify serialize.h's exception handling
</span><span class='line'>
</span><span class='line'> Remove the 'state' and 'exceptmask' from serialize.h's stream
</span><span class='line'> implementations, as well as related methods.
</span><span class='line'>
</span><span class='line'> As exceptmask always included 'failbit', and setstate was always
</span><span class='line'> called with bits = failbit, all it did was immediately raise an
</span><span class='line'> exception. Get rid of those variables, and replace the setstate
</span><span class='line'> with direct exception throwing (which also removes some dead
</span><span class='line'> code).
</span><span class='line'>
</span><span class='line'> As a result, good() is never reached after a failure (there are
</span><span class='line'> only 2 calls, one of which is in tests), and can just be replaced
</span><span class='line'> by !eof().
</span><span class='line'>
</span><span class='line'> fail(), clear(n) and exceptions() are just never called. Delete
</span><span class='line'> them.
</span><span class='line'> ```
</span><span class='line'>业务相关的代码修改,可以将story ID加在最前面,方便issue track。
</span><span class='line'>
</span><span class='line'> 为了让大家统一提交的格式,可以新建一个提交的template,配置git[使用](https://robots.thoughtbot.com/better-commit-messages-with-a-gitmessage-template)。 过程如下:
</span><span class='line'>
</span><span class='line'> 1. 在`~/.gitconfig`中加入下面的内容:
</span><span class='line'> ```
</span><span class='line'> [commit]
</span><span class='line'> template = ~/.gitmessage</span></code></pre></td></tr></table></div></figure>
<ol>
<li>新建<code>~/.gitmessage</code>这个template文件并且填入自定义模板:</li>
</ol>
<figure class='code'><div class="highlight"><table><tr><td class="gutter"><pre class="line-numbers"><span class='line-number'>1</span>
<span class='line-number'>2</span>
<span class='line-number'>3</span>
<span class='line-number'>4</span>
<span class='line-number'>5</span>
<span class='line-number'>6</span>
<span class='line-number'>7</span>
<span class='line-number'>8</span>
</pre></td><td class='code'><pre><code class=''><span class='line'>Brief here:
</span><span class='line'>
</span><span class='line'>Reason to change:
</span><span class='line'>*
</span><span class='line'>
</span><span class='line'>Way to change:
</span><span class='line'>
</span><span class='line'>*</span></code></pre></td></tr></table></div></figure>
<p>配置完成后在项目中做修改,<code>git ci -a</code>就可以在模板的基础上修改了。</p>
<p>举一个项目中的一个例子,里面包含了业务需求的github issue的链接:</p>
<figure class='code'><div class="highlight"><table><tr><td class="gutter"><pre class="line-numbers"><span class='line-number'>1</span>
<span class='line-number'>2</span>
<span class='line-number'>3</span>
<span class='line-number'>4</span>
<span class='line-number'>5</span>
<span class='line-number'>6</span>
<span class='line-number'>7</span>
</pre></td><td class='code'><pre><code class=''><span class='line'>commit 2196e866261dee6d7c17f266cc15987f
</span><span class='line'>Author: Alex Jin <alex.jin@example.com>
</span><span class='line'>Date: Wed Aug 17 13:24:41 2016 +0800
</span><span class='line'> Make trend bigger and modify link color. :pear: Luke
</span><span class='line'>
</span><span class='line'> * Reason to change:
</span><span class='line'> see [lob/project#208]</span></code></pre></td></tr></table></div></figure>
<p>如果是用Pull Request方式工作的话,麻烦的地方在于修改的原因可能还得在comments里面再写一遍。解决的办法是创建issue/PR的template,参考<a href="https://github.com/blog/2111-issue-and-pull-request-templates">这里</a>。</p>
<p>什么,你问我为什么还没有被开除么?
因为领导不看提交 :)</p>
</div>
</article>
<article>
<header>
<h1 class="entry-title"><a href="/docker/ci/2016/08/16/on-dockerising-a-frontend-build-pipeline">On Dockerising a Frontend Build Pipeline</a></h1>
<p class="meta">
<time class='entry-date' datetime='2016-08-16T15:08:58+10:00'><span class='date'><span class='date-month'>Aug</span> <span class='date-day'>16</span><span class='date-suffix'>th</span>, <span class='date-year'>2016</span></span> <span class='time'>3:08 pm</span></time>
| <a href="/docker/ci/2016/08/16/on-dockerising-a-frontend-build-pipeline#disqus_thread"
data-disqus-identifier="http://iambowen.github.com/docker/ci/2016/08/16/on-dockerising-a-frontend-build-pipeline">Comments</a>
</p>
</header>
<div class="entry-content"><p>最近花了一段时间把主站的build pipeline docker化了,时间长到感觉自己的reputation都要被毁了。
在此总结下这个过程以及碰到的问题,希望对大家能有所帮助。</p>
<h3>背景</h3>
<p>这是一个纯前端的项目,两年前前后端分离的时候的项目,Grunt workflow,测试框架使用Karma,用Phantomjs<code>1.8.2</code>运行headless的测试,开发环境使用Chrome/Safari做功能性测试。开发环境基于node <code>0.12</code>,一些基础设施的更新,部署的脚本,smoke test是基于Ruby的,版本为<code>2.0</code>。</p>
<p>这个前端的工程部署在两个不同的AWS Region的S3上,互为fail over,最前面有Akamai为它们做负载均衡。</p>
<p>持续集成的工具使用Bamboo,其agent需要有<code>ruby 2.0</code>,<code>node 0.12</code>,<code>Phantomjs 1.8.2</code>的环境才可以运行具体的任务。整个过程已经做到了持续部署,一个完整的build过程如下:</p>
<ol>
<li>提交代码</li>
<li>trigger build,执行单元测试和集成测试</li>
<li>自动部署staging 环境</li>
<li>自动部署production 环境</li>
<li>对部署后的产品做性能测试</li>
<li>上传工程中依赖的第三方类库信息到S3 bucket(出于安全的考虑)</li>
</ol>
<h3>存在的问题</h3>
<p>太长时间没有人做技术上的升级,导致下面的一些隐患和问题:
1. 开发的工具版本落后,node当前版本已经是<code>6.3</code>了,ruby 2.0的版本应该已经不维护了,同样,对应的Karma,Phantomjs都以及更新了很多
2. 运行build依赖的agent是共用的,如果有人对agent的环境进行修改,会影响该项目的持续集成
3. 未来需要将CI工具从Bamboo迁移到Buildkite,用pipeline as code的方式去构建,每个组自己去管理build agent,使用Docker会更加方便迁移</p>
<h3>过程以及遇到的一些难点</h3>
<p>测试部分通过的过程及问题
1. 首先做的事情是构建一个基础的docker 镜像,包含最新的node <code>6.3.1</code>,phantomjs <code>2.1.1</code>,后来发现其实不用Phantomjs,这个有点多余了。 成果在这里: <a href="https://github.com/iambowen/node_on_docker%EF%BC%8C%E5%9B%A0%E4%B8%BA%E8%BF%99%E6%A0%B7%E7%9A%84%E7%8E%AF%E5%A2%83%E6%9B%B4%E5%8A%A0%E9%80%9A%E7%94%A8%E4%BA%9B%EF%BC%8C%E6%89%80%E4%BB%A5%E6%89%8Dpublish%E5%88%B0%E5%AE%98%E6%96%B9%E7%9A%84docker">https://github.com/iambowen/node_on_docker%EF%BC%8C%E5%9B%A0%E4%B8%BA%E8%BF%99%E6%A0%B7%E7%9A%84%E7%8E%AF%E5%A2%83%E6%9B%B4%E5%8A%A0%E9%80%9A%E7%94%A8%E4%BA%9B%EF%BC%8C%E6%89%80%E4%BB%A5%E6%89%8Dpublish%E5%88%B0%E5%AE%98%E6%96%B9%E7%9A%84docker</a> repository里面。
2. 在这个镜像的基础上,构建一个我们工程依赖环境的基础镜像,额外安装了Ruby <code>2.3</code>,最新的Chrome,git以及一些git的配置,因为需要从企业版github上pull代码。
3. 本地升级node版本,以及相关的grunt,karma,Phantomjs的版本,运行测试通过。
4. 将工程mount容器中,然后运行测试,<code>npm install</code>失败,原因是安装<code>fsevent</code>出错。查看了下这个包,原来只是给OSX下使用的。删除<code>npm-shrinkwrap.json</code>后重新运行可以通过。原因是有人在OSX下运行了<code>npm shrinkwrap</code>去生成的这个锁定版本的文件,真是烦人。于是反其道行之在容器里面生成<code>npm-shrinkwrap.json</code>,在host上运行测试一切完好,就这样解决了这个问题。
5. 在Bamboo创建一个branch,然后针对我的分支代码运行测试
6. 测试里面的一个步骤是做<code>bower install</code>安装第三方js类库,但是比较恶心的是,有些第三方类库是以<code>git</code>的协议去下载,而不是<code>https</code>。本地运行一切都好,但是在Bamboo Agent上运行的时候却出现了连接超时的问题,很有可能是Bamboo所在AWS的network ACL或者是security group没有允许<code>9418</code>端口的TCP访问。不过最后解决的方式并不是修改防火墙或者将协议改为<code>https</code>,而是直接把类库checking到git中,这样对应的修改Gruntfile,不用再运行<code>bower install</code>。check in之后在Bamboo上运行还是失败,本地却可以通过,仔细检查,原来是一部分bower module目录名为<code>dist</code>被git ignore掉了。</p>
<p>通过测试后,接下来就是部署了。部署要解决的问题是,如何让容器拿到AWS role的动态权限去做文件的上传更新操作。ECS好像是支持容器去assume role的操作,但是我们没有用ECS,所以只能考虑其它方式。</p>
<p>我想到的方式在bamboo 的 docker agent上 <code>assume role</code>,拿到对应的credential后,将其作为环境变量传入到容器中。实验证明这样的方式是可行的,万幸bamboo的docker agent支持aws cli命令,不过没有<code>jq</code>稍微增大了点提取credential的难度,脚本如下:</p>
<figure class='code'><div class="highlight"><table><tr><td class="gutter"><pre class="line-numbers"><span class='line-number'>1</span>
<span class='line-number'>2</span>
<span class='line-number'>3</span>
<span class='line-number'>4</span>
<span class='line-number'>5</span>
<span class='line-number'>6</span>
<span class='line-number'>7</span>
<span class='line-number'>8</span>
<span class='line-number'>9</span>
<span class='line-number'>10</span>
<span class='line-number'>11</span>
<span class='line-number'>12</span>
<span class='line-number'>13</span>
<span class='line-number'>14</span>
<span class='line-number'>15</span>
<span class='line-number'>16</span>
<span class='line-number'>17</span>
<span class='line-number'>18</span>
</pre></td><td class='code'><pre><code class=''><span class='line'>if [ "$DEPLOY_ENV" = "staging" ]; then
</span><span class='line'> AWS_ACCOUNT_ID='1111111111'
</span><span class='line'>elif [ "$DEPLOY_ENV" == "production" ]; then
</span><span class='line'> AWS_ACCOUNT_ID='2222222222'
</span><span class='line'>fi
</span><span class='line'>
</span><span class='line'>credentials=$(aws sts assume-role --role-arn arn:aws:iam::"$AWS_ACCOUNT_ID":role/roleName \
</span><span class='line'> --role-session-name roleSessionName \
</span><span class='line'> --query 'Credentials.[SecretAccessKey, SessionToken, AccessKeyId]' \
</span><span class='line'> --output text)
</span><span class='line'>
</span><span class='line'>SecretAccessKey=$(echo $credentials | cut -d' ' -f1)
</span><span class='line'>SessionToken=$(echo $credentials | cut -d' ' -f2)
</span><span class='line'>AccessKeyId=$(echo $credentials | cut -d' ' -f3)
</span><span class='line'>
</span><span class='line'>docker run -e BUILD_VERSION="$BUILD_VERSION" \
</span><span class='line'> -e DEPLOY_ENV="$DEPLOY_ENV" -e AWS_SECRET_ACCESS_KEY="$SecretAccessKey" \
</span><span class='line'> -e AWS_SESSION_TOKEN="$SessionToken" -e AWS_ACCESS_KEY_ID="$AccessKeyId" --rm docker_image bash -c 'grunt deploy'</span></code></pre></td></tr></table></div></figure>
<p>因为部署是用aws node 的sdk,所以读取的环境变量名字不太一样,要稍微注意下。</p>
<p>在CI上运行后,staging部署通过,手动在bamboo的docker agent上测试下是否能assume产品环境的部署的role,结果可以,那就是说产品环境的部署应该也可以通过了。</p>
<h3>总结</h3>
<ol>
<li><code>npm</code> sucks,更糟糕的是程序员在引入依赖的时候缺乏考虑,我在<code>package.json</code>里面见到了不少无人维护的component,后续的升级维护是一个问题,联想以前的ruby项目也是一样。一旦有版本升级,碰到无人维护的gem时会非常痛苦。</li>
<li>一个工程里面用了太多的语言,也是一件很糟糕的事情,明明可以用node的aws sdk来做到所有的部署,不知道为何用ruby去实现,无形中增大了维护的成本。</li>
<li>一般来说,我们认为docker可以保证不同环境的一致性,但是由于一些特殊原因,如我上面提到的防火墙问题,bower module被git ignore掉的问题,在CI环境下才能暴露出来。所以在PR被merge到master之前,一定要保证修改在CI上也运行通过。</li>
</ol>
</div>
</article>
<article>
<header>
<h1 class="entry-title"><a href="/docker/2016/08/11/one-interesting-docker-issue">One Interesting Docker Issue</a></h1>
<p class="meta">
<time class='entry-date' datetime='2016-08-11T17:52:10+10:00'><span class='date'><span class='date-month'>Aug</span> <span class='date-day'>11</span><span class='date-suffix'>th</span>, <span class='date-year'>2016</span></span> <span class='time'>5:52 pm</span></time>
| <a href="/docker/2016/08/11/one-interesting-docker-issue#disqus_thread"
data-disqus-identifier="http://iambowen.github.com/docker/2016/08/11/one-interesting-docker-issue">Comments</a>
</p>
</header>
<div class="entry-content"><p>项目上Akamai的回归测试运行在数据中心一台用Puppet管理的固定的虚拟服务器上,这台服务器是Bamboo Agent,负责运行所有遗留系统的自动化部署任务。
前几天一个客户的Ops找我帮忙一起让这台服务器支持Docker,然后将测试放在docker中运行。我们修改puppet脚本,然后更新了Docker,结果发现2.6的内核最多运行docker 1.7,而运行测试的docker compose需要的docker客户端要高于1.7。 鉴于改动较大,于是我们换一种思路,用在AWS账户下已有的Bamboo docker agent去运行测试。所以revert了Puppet修改,并且在服务器上运行。
以为一切都结束了,没想到过了几天,另一个组的Ops来找我说staging的部署失败了,问我什么原因,提示大意是没有找到NetScaler服务器的路由。我觉得很奇怪,就看了眼服务器上的路由表。结果发现了下面的现象:</p>
<figure class='code'><div class="highlight"><table><tr><td class="gutter"><pre class="line-numbers"><span class='line-number'>1</span>
<span class='line-number'>2</span>
</pre></td><td class='code'><pre><code class=''><span class='line'>Destination Gateway Genmask Flags Metric Ref Use Iface
</span><span class='line'>172.17.0.0 * 255.255.0.0 U 0 0 0 docker0</span></code></pre></td></tr></table></div></figure>
<p>囧,staging的IP range也是<code>172.17</code>,原来是这个原因。
于是,先停止这个网络设备,然后删除,之后再重启网络服务解决问题。</p>
<figure class='code'><figcaption><span></span></figcaption><div class="highlight"><table><tr><td class="gutter"><pre class="line-numbers"><span class='line-number'>1</span>
<span class='line-number'>2</span>
<span class='line-number'>3</span>
</pre></td><td class='code'><pre><code class='bash'><span class='line'>ip link down docker0
</span><span class='line'>ip link del docker0
</span><span class='line'>service network restart
</span></code></pre></td></tr></table></div></figure>
<p>我觉得从这个错误中可以学到两个事情</p>
<ol>
<li>配置管理工具的不可靠性,Puppet并没有完整的清理掉所有docker相关的东西</li>
<li>这种<code>Pet</code>服务器的不可靠性,如果服务器是每天都按照配置重新创建也不会出现这样的问题</li>
</ol>
</div>
</article>
<article>
<header>
<h1 class="entry-title"><a href="/2016/07/11/cloudformation-introduction-and-usage">Cloudformation Introduction and Usage</a></h1>
<p class="meta">
<time class='entry-date' datetime='2016-07-11T13:16:21+10:00'><span class='date'><span class='date-month'>Jul</span> <span class='date-day'>11</span><span class='date-suffix'>th</span>, <span class='date-year'>2016</span></span> <span class='time'>1:16 pm</span></time>
| <a href="/2016/07/11/cloudformation-introduction-and-usage#disqus_thread"
data-disqus-identifier="http://iambowen.github.com/2016/07/11/cloudformation-introduction-and-usage">Comments</a>
</p>
</header>
<div class="entry-content"><h2>Cloudformation介绍</h2>
<p><a href="https://aws.amazon.com/cloudformation/">Cloudformation</a> 是AWS的一项用来管理AWS相关的资源以及对资源的部署以及更新的服务。它具有以下几个特点:</p>
<h2>Cloudformation的相关概念</h2>
</div>
</article>
<article>
<header>
<h1 class="entry-title"><a href="/virutalbox/dns/network/2016/01/20/hand-over-dns-resolve-to-virutalbox">Hand Over DNS Resolve to VirutalBox</a></h1>
<p class="meta">
<time class='entry-date' datetime='2016-01-20T20:16:12+11:00'><span class='date'><span class='date-month'>Jan</span> <span class='date-day'>20</span><span class='date-suffix'>th</span>, <span class='date-year'>2016</span></span> <span class='time'>8:16 pm</span></time>
| <a href="/virutalbox/dns/network/2016/01/20/hand-over-dns-resolve-to-virutalbox#disqus_thread"
data-disqus-identifier="http://iambowen.github.com/virutalbox/dns/network/2016/01/20/hand-over-dns-resolve-to-virutalbox">Comments</a>
</p>
</header>
<div class="entry-content"><p>当你用<a href="https://www.vagrantup.com/">vagrant</a>新建一个虚拟机(driver 为virtualbox)并使用NAT方式让guest虚拟机连接外网时,如果有无线网络的变化,虚拟机中<code>/etc/resolv.conf</code>不会对应的修改,导致域名解析失败。</p>
<p>解决的办法是将DNS解析的任务交给虚拟机管理工具如virtualbox,假设我们要修改名为<code>test</code>的虚拟机的设置:</p>
<figure class='code'><div class="highlight"><table><tr><td class="gutter"><pre class="line-numbers"><span class='line-number'>1</span>
<span class='line-number'>2</span>
<span class='line-number'>3</span>
<span class='line-number'>4</span>
<span class='line-number'>5</span>
<span class='line-number'>6</span>
<span class='line-number'>7</span>
</pre></td><td class='code'><pre><code class=''><span class='line'> ~> VBoxManage list vms
</span><span class='line'>"mesos1" {74214693-3477-4386-a9b7-4abc3b7e608d}
</span><span class='line'>.......
</span><span class='line'>"test" {b269c98f-00e8-49a3-a8d0-53629187ea62}
</span><span class='line'>
</span><span class='line'>#保证vm没有在运行,然后执行
</span><span class='line'> ~> VBoxManage modifyvm test --natdnsproxy1 on</span></code></pre></td></tr></table></div></figure>
<p>重新启动vm,不管怎么切换网络,应该都不会再出现域名解析的问题。
如果是用Vagrantfile管理虚拟机的配置,可以更改vm的配置:</p>
<figure class='code'><div class="highlight"><table><tr><td class="gutter"><pre class="line-numbers"><span class='line-number'>1</span>
<span class='line-number'>2</span>
<span class='line-number'>3</span>
</pre></td><td class='code'><pre><code class=''><span class='line'>config.vm.provider "virtualbox" do |v|
</span><span class='line'> v.customize ["modifyvm", :id, "--natdnsproxy1", "on"]
</span><span class='line'>end</span></code></pre></td></tr></table></div></figure>
</div>
</article>
<article>
<header>
<h1 class="entry-title"><a href="/akamai/diagnostic/2016/01/19/using-akamai-diagnostic-tools-slash-api">Using Akamai Diagnostic tools/API</a></h1>
<p class="meta">
<time class='entry-date' datetime='2016-01-19T16:32:46+11:00'><span class='date'><span class='date-month'>Jan</span> <span class='date-day'>19</span><span class='date-suffix'>th</span>, <span class='date-year'>2016</span></span> <span class='time'>4:32 pm</span></time>
| <a href="/akamai/diagnostic/2016/01/19/using-akamai-diagnostic-tools-slash-api#disqus_thread"
data-disqus-identifier="http://iambowen.github.com/akamai/diagnostic/2016/01/19/using-akamai-diagnostic-tools-slash-api">Comments</a>
</p>
</header>
<div class="entry-content"><p>有时候在Akamai上提交应用修改后,因为配置的问题,可能出现错误,像下面这样:</p>
<figure class='code'><div class="highlight"><table><tr><td class="gutter"><pre class="line-numbers"><span class='line-number'>1</span>
</pre></td><td class='code'><pre><code class=''><span class='line'>#30.657008d1.1452737568.1e40544</span></code></pre></td></tr></table></div></figure>
<p>通过日志查找的方式去发现具体的问题可能会很耗时,因为需要等待akamai把日志上传。Akamai自己提供了解码错误代码的工具和API,具体的用法如下:</p>
<h3>Lunar Control Centre 的 Diagnostic Tools</h3>
<hr />
<p>这个比较容易,从<code>Luna Control Center</code>选择<code>Resolve</code> => <code>Diagnostic Tools</code>。在<code>Service Debugging Tools</code>部分选择<code>Error Translator (Reference#)</code>,然后在<code>Error String:</code>的input中输入错误码的字符串,点击<code>Analyze</code>,等待一会就可以看到详细的错误信息以及原因。</p>
<h3>使用Akamai Diagnostic API</h3>
<hr />
<ol>
<li>Akamai提供了Sample Client去调用API,除了clone client的repo,还可以直接使用docker,直接运行<code>docker run -it akamaiopen/api-kickstart /bin/bash</code>既可。</li>
<li>生成新的client请求的token。首先在<code>Luna Control Center</code>选择<code>CONFIGURE</code> => <code>Manage APIs</code>进入Open API 管理页面。在<code>Luna APIs</code>下面添加新的collection,然后在该collection添加新的client,就可以拿到新的tokens,点击右上角的导出按钮,就可以将其导出到一个文本文件中,如名为<code>api-kickstart.txt</code>的文件。</li>
<li>在client端设置token。在client的目录下运行<code>python gen_edgerc.py -s default -f api-kickstart.txt</code>, 它会在用户根目录生成<code>~/.edgerc</code>的credential文件。通过<code>python verify_creds.py</code> 可以验证credential的有效性。<code>.edgerc</code>文件中的token其实也就是api请求时authorization的headers。</li>
<li>测试请求。<code>.edgerc</code>文件设置验证完成后,可以使用<code>python diagnostic_tools.py</code>来测试,它实际请求的API endpoint是<code>/diagnostic-tools/v1/locations</code>和<code>/diagnostic-tools/v1/dig</code>,返回如下:</li>
</ol>
<figure class='code'><div class="highlight"><table><tr><td class="gutter"><pre class="line-numbers"><span class='line-number'>1</span>
<span class='line-number'>2</span>
<span class='line-number'>3</span>
<span class='line-number'>4</span>
<span class='line-number'>5</span>
<span class='line-number'>6</span>
<span class='line-number'>7</span>
<span class='line-number'>8</span>
<span class='line-number'>9</span>
<span class='line-number'>10</span>
<span class='line-number'>11</span>
<span class='line-number'>12</span>
<span class='line-number'>13</span>
<span class='line-number'>14</span>
<span class='line-number'>15</span>
<span class='line-number'>16</span>
<span class='line-number'>17</span>
<span class='line-number'>18</span>
<span class='line-number'>19</span>
<span class='line-number'>20</span>
<span class='line-number'>21</span>
<span class='line-number'>22</span>
<span class='line-number'>23</span>
<span class='line-number'>24</span>
</pre></td><td class='code'><pre><code class=''><span class='line'>root@16119b2d4eb8:/opt/examples/python# python diagnostic_tools.py
</span><span class='line'>
</span><span class='line'>Requesting locations that support the diagnostic-tools API.
</span><span class='line'>
</span><span class='line'>There are 72 locations that can run dig in the Akamai Network
</span><span class='line'>We will make our call from Adelaide, Australia
</span><span class='line'>
</span><span class='line'>; <<>> DiG 9.8.1-P1 <<>> developer.akamai.com -t A
</span><span class='line'>;; global options: +cmd
</span><span class='line'>;; Got answer:
</span><span class='line'>;; ->>HEADER<<- opcode: QUERY, status: NOERROR, id: 12919
</span><span class='line'>;; flags: qr rd ra; QUERY: 1, ANSWER: 3, AUTHORITY: 8, ADDITIONAL: 8
</span><span class='line'>
</span><span class='line'>;; QUESTION SECTION:
</span><span class='line'>;developer.akamai.com. IN A
</span><span class='line'>
</span><span class='line'>;; ANSWER SECTION:
</span><span class='line'>developer.akamai.com. 300 IN CNAME san-developer.akamai.com.edgekey.net.
</span><span class='line'>san-developer.akamai.com.edgekey.net. 21600 IN CNAME e4777.dscx.akamaiedge.net.
</span><span class='line'>e4777.dscx.akamaiedge.net. 20 IN A 23.4.164.144
</span><span class='line'>
</span><span class='line'>;; AUTHORITY SECTION:
</span><span class='line'>dscx.akamaiedge.net. 4000 IN NS n6dscx.akamaiedge.net.
</span><span class='line'>...............
</span></code></pre></td></tr></table></div></figure>
<p>Akamai的diagnostic API的列表在<a href="https://developer.akamai.com/api/luna/diagnostic-tools/uses.html">这里</a>。ErrorCode解释的endpoint是<code>/diagnostic-tools/v1/errortranslator{?errorCode}</code>,通过重用例子中的python代码即可发起这样的请求,比如把<code>diagnostic_tools.py</code>修改如下(我就是这么懒):</p>
<figure class='code'><div class="highlight"><table><tr><td class="gutter"><pre class="line-numbers"><span class='line-number'>1</span>
<span class='line-number'>2</span>
<span class='line-number'>3</span>
</pre></td><td class='code'><pre><code class=''><span class='line'>+ location_result = httpCaller.getResult('/diagnostic-tools/v1/errortranslator?errorCode=30.657008d1.1452737568.1e40544')
</span><span class='line'>- location_result = httpCaller.getResult('/diagnostic-tools/v1/locations')
</span><span class='line'>+ print location_result["errorTranslator"]["reasonForFailure"]</span></code></pre></td></tr></table></div></figure>
<p>之后就可以看到错误的原因是<code>ERR_FWD_SSL_HANDSHAKE&#x7c;err_conn_strict_cert</code>,也就是说我没有在CDN设置正确的certificate,导致它和origin的ssl handshake失败了。</p>
<p>如果没有什么特殊的需求,akamai web console中的diagnostic tool就可以满足需求,逼格较高或者有自动化需求的可以从命令行调用API输出错误原因。</p>
</div>
</article>
<div class="pagination">
<a class="prev" href="/posts/2">← Older</a>
<a href="/blog/archives">Blog Archives</a>
</div>
</div>
<aside class="sidebar">
<section>
<h1>Recent Posts</h1>
<ul id="recent_posts">