eng/Lecture 12 _ Visualizing and Understanding.srt

﻿1
00:00:10,512 --> 00:00:15,376
- Good morning.
So, it's 12:03 so, I want to get started.

2
00:00:15,376 --> 00:00:18,014
Welcome to Lecture 12, of CS-231N.

3
00:00:18,014 --> 00:00:21,840
Today we are going to talk about Visualizing
and Understanding convolutional networks.

4
00:00:21,840 --> 00:00:25,270
This is always a super fun lecture to give
because we get to look a lot of pretty pictures.

5
00:00:25,270 --> 00:00:28,375
So, it's, it's one of my favorites.

6
00:00:28,375 --> 00:00:30,354
As usual a couple administrative things.

7
00:00:30,354 --> 00:00:39,544
So, hopefully your projects are all going well, because as a reminder your milestones
are due on Canvas tonight. It is Canvas, right? Okay, so want to double check, yeah.

8
00:00:39,545 --> 00:00:43,590
Due on Canvas tonight, we are working on
furiously grading your midterms.

9
00:00:43,590 --> 00:00:49,537
So, we'll hope to have those midterms grades
to you back by on grade scope this week.

10
00:00:49,537 --> 00:00:54,987
So, I know that was little confusion, you all got registration
email's for grade scope probably in the last week.

11
00:00:54,988 --> 00:00:57,372
Something like that, we start
couple of questions on piazo.

12
00:00:57,372 --> 00:00:59,530
So, we've decided to use grade
scope to grade the midterms.

13
00:00:59,530 --> 00:01:02,973
So, don't be confused, if you
get some emails about that.

14
00:01:02,973 --> 00:01:05,047
Another reminder is that assignment three

15
00:01:05,047 --> 00:01:07,412
was released last week on Friday.

16
00:01:07,412 --> 00:01:11,088
It will be due, a week from
this Friday, on the 26th.

17
00:01:11,088 --> 00:01:12,595
This is, an assignment three,

18
00:01:12,595 --> 00:01:14,444
is almost entirely brand new this year.

19
00:01:14,444 --> 00:01:17,152
So, it we apologize for taking
a little bit longer than

20
00:01:17,152 --> 00:01:18,847
expected to get it out.

21
00:01:18,847 --> 00:01:20,272
But I think it's super cool.

22
00:01:20,272 --> 00:01:22,644
A lot of that stuff, we'll
talk about in today's lecture.

23
00:01:22,644 --> 00:01:25,283
You'll actually be implementing
on your assignment.

24
00:01:25,283 --> 00:01:27,188
And for the assignment, you'll
get the choice of either

25
00:01:27,188 --> 00:01:29,575
Pi torch or tensure flow.

26
00:01:29,575 --> 00:01:30,921
To work through these different examples.

27
00:01:30,921 --> 00:01:34,512
So, we hope that's really
useful experience for you guys.

28
00:01:34,512 --> 00:01:35,822
We also saw a lot of activity

29
00:01:35,822 --> 00:01:37,273
on HyperQuest over the weekend.

30
00:01:37,273 --> 00:01:39,084
So that's, that's really awesome.

31
00:01:39,084 --> 00:01:40,549
The leader board went up yesterday.

32
00:01:40,549 --> 00:01:42,568
It seems like you guys are
really trying to battle it out

33
00:01:42,568 --> 00:01:44,227
to show off your deep learning

34
00:01:44,227 --> 00:01:46,063
neural network training skills.

35
00:01:46,063 --> 00:01:47,402
So that's super cool.

36
00:01:47,402 --> 00:01:50,087
And we because due to the high interest

37
00:01:50,087 --> 00:01:52,811
in HyperQuest and due to
the conflicts with the,

38
00:01:52,811 --> 00:01:55,118
with the Milestones submission time.

39
00:01:55,118 --> 00:01:56,808
We decided to extend the deadline

40
00:01:56,808 --> 00:01:58,591
for extra credit through Sunday.

41
00:01:58,591 --> 00:02:02,279
So, anyone who does at
least 12 runs on HyperQuest

42
00:02:02,279 --> 00:02:04,773
by Sunday will get little bit
of extra credit in the class.

43
00:02:04,773 --> 00:02:07,394
Also those of you who are,
at the top of leader board

44
00:02:07,394 --> 00:02:09,175
doing really well, will
get may be little bit

45
00:02:09,175 --> 00:02:11,200
extra, extra credit.

46
00:02:11,200 --> 00:02:13,081
So, I thanks for
participating we got lot of

47
00:02:13,081 --> 00:02:15,935
interest and that was really cool.

48
00:02:15,935 --> 00:02:17,844
Final reminder is about
the poster session.

49
00:02:17,844 --> 00:02:21,445
So, we have the poster
session will be on June 6th.

50
00:02:21,445 --> 00:02:22,872
That date is finalized,

51
00:02:22,872 --> 00:02:24,940
I think that, I don't
remember the exact time.

52
00:02:24,940 --> 00:02:25,932
But it is June 6th.

53
00:02:25,932 --> 00:02:27,141
So that, we have some questions

54
00:02:27,141 --> 00:02:29,310
about when exactly that poster session is

55
00:02:29,310 --> 00:02:30,297
for those of you who are traveling

56
00:02:30,297 --> 00:02:31,897
at the end of quarter
or starting internships

57
00:02:31,897 --> 00:02:33,247
or something like that.

58
00:02:33,247 --> 00:02:35,497
So, it will be June 6th.

59
00:02:35,497 --> 00:02:37,210
Any questions on the admin notes.

60
00:02:39,241 --> 00:02:41,171
No, totally clear.

61
00:02:41,171 --> 00:02:42,578
So, last time we talked.

62
00:02:42,578 --> 00:02:44,254
So, last time we had a pretty

63
00:02:44,254 --> 00:02:46,259
jam packed lecture, when we
talked about lot of different

64
00:02:46,259 --> 00:02:48,161
computer vision tasks, as a reminder.

65
00:02:48,161 --> 00:02:49,955
We talked about semantic segmentation

66
00:02:49,955 --> 00:02:52,035
which is this problem, where
you want to sign labels

67
00:02:52,035 --> 00:02:54,318
to every pixel in the input image.

68
00:02:54,318 --> 00:02:56,131
But does not differentiate the

69
00:02:56,131 --> 00:02:58,225
object instances in those images.

70
00:02:58,225 --> 00:03:00,773
We talked about classification
plus localization.

71
00:03:00,773 --> 00:03:02,558
Where in addition to a class label

72
00:03:02,558 --> 00:03:04,059
you also want to draw a box

73
00:03:04,059 --> 00:03:06,539
or perhaps several boxes in the image.

74
00:03:06,539 --> 00:03:08,041
Where the distinction here is that,

75
00:03:08,041 --> 00:03:10,130
in a classification
plus localization setup.

76
00:03:10,130 --> 00:03:12,594
You have some fix number of
objects that you are looking for

77
00:03:12,594 --> 00:03:14,424
So, we also saw that this type of paradigm

78
00:03:14,424 --> 00:03:16,785
can be applied to the things
like pose recognition.

79
00:03:16,785 --> 00:03:18,836
Where you want to regress to
different numbers of joints

80
00:03:18,836 --> 00:03:20,222
in the human body.

81
00:03:20,222 --> 00:03:22,235
We also talked about the object detection

82
00:03:22,235 --> 00:03:23,976
where you start with some fixed

83
00:03:23,976 --> 00:03:25,851
set of category labels
that you are interested in.

84
00:03:25,851 --> 00:03:27,102
Like dogs and cats.

85
00:03:27,102 --> 00:03:29,460
And then the task is
to draw a boxes around

86
00:03:29,460 --> 00:03:31,196
every instance of those objects

87
00:03:31,196 --> 00:03:32,769
that appear in the input image.

88
00:03:32,769 --> 00:03:35,303
And object detection
is really distinct from

89
00:03:35,303 --> 00:03:37,063
classification plus localization

90
00:03:37,063 --> 00:03:38,783
because with object
detection, we don't know

91
00:03:38,783 --> 00:03:40,629
ahead of time, how many object instances

92
00:03:40,629 --> 00:03:42,298
we're looking for in the image.

93
00:03:42,298 --> 00:03:44,272
And we saw that there's
this whole family of methods

94
00:03:44,272 --> 00:03:48,100
based on RCNN, Fast RCNN and faster RCNN,

95
00:03:48,100 --> 00:03:49,916
as well as the single
shot detection methods

96
00:03:49,916 --> 00:03:52,588
for addressing this problem
of object detection.

97
00:03:52,588 --> 00:03:55,026
Then finally we talked
pretty briefly about

98
00:03:55,026 --> 00:03:57,722
instance segmentation,
which is kind of combining

99
00:03:57,722 --> 00:04:01,164
aspects of a semantic
segmentation and object detection

100
00:04:01,164 --> 00:04:03,308
where the goal is to
detect all the instances

101
00:04:03,308 --> 00:04:04,934
of the categories we care about,

102
00:04:04,934 --> 00:04:07,997
as well as label the pixels
belonging to each instance.

103
00:04:07,997 --> 00:04:11,339
So, in this case, we
detected two dogs and one cat

104
00:04:11,339 --> 00:04:13,093
and for each of those instances we wanted

105
00:04:13,093 --> 00:04:14,887
to label all the pixels.

106
00:04:14,887 --> 00:04:17,437
So, these are we kind of
covered a lot last lecture

107
00:04:17,437 --> 00:04:19,509
but those are really interesting
and exciting problems

108
00:04:19,509 --> 00:04:21,284
that you guys might consider to

109
00:04:21,284 --> 00:04:23,810
using in parts of your projects.

110
00:04:23,810 --> 00:04:25,645
But today we are going to
shift gears a little bit

111
00:04:25,645 --> 00:04:27,081
and ask another question.

112
00:04:27,081 --> 00:04:28,702
Which is, what's really going on

113
00:04:28,702 --> 00:04:30,578
inside convolutional networks.

114
00:04:30,578 --> 00:04:32,445
We've seen by this point in the class

115
00:04:32,445 --> 00:04:34,120
how to train convolutional networks.

116
00:04:34,120 --> 00:04:35,916
How to stitch up different
types of architectures

117
00:04:35,916 --> 00:04:37,503
to attack different problems.

118
00:04:37,503 --> 00:04:39,860
But one question that you
might have had in your mind,

119
00:04:39,860 --> 00:04:42,653
is what exactly is going
on inside these networks?

120
00:04:42,653 --> 00:04:44,081
How did they do the things that they do?

121
00:04:44,081 --> 00:04:46,444
What kinds of features
are they looking for?

122
00:04:46,444 --> 00:04:48,612
And all this source of related questions.

123
00:04:48,612 --> 00:04:51,043
So, so far we've sort of seen

124
00:04:51,043 --> 00:04:53,399
ConvNets as a little bit of a black box.

125
00:04:53,399 --> 00:04:55,635
Where some input image of raw pixels

126
00:04:55,635 --> 00:04:57,100
is coming in on one side.

127
00:04:57,100 --> 00:04:58,816
It goes to the many layers of convulsion

128
00:04:58,816 --> 00:05:01,170
and pooling in different
sorts of transformations.

129
00:05:01,170 --> 00:05:04,547
And on the outside, we end up
with some set of class scores

130
00:05:04,547 --> 00:05:07,363
or some types of understandable
interpretable output.

131
00:05:07,363 --> 00:05:09,865
Such as class scores or
bounding box positions

132
00:05:09,865 --> 00:05:12,342
or labeled pixels or something like that.

133
00:05:12,342 --> 00:05:13,307
But the question is.

134
00:05:13,307 --> 00:05:15,933
What are all these other
layers in the middle doing?

135
00:05:15,933 --> 00:05:17,685
What kinds of things in the input image

136
00:05:17,685 --> 00:05:18,567
are they looking for?

137
00:05:18,567 --> 00:05:20,857
And can we try again intuition for.

138
00:05:20,857 --> 00:05:22,023
How ConvNets are working?

139
00:05:22,023 --> 00:05:24,364
What types of things in the
image they are looking for?

140
00:05:24,364 --> 00:05:25,867
And what kinds of techniques do we have

141
00:05:25,867 --> 00:05:29,327
for analyzing this
internals of the network?

142
00:05:29,327 --> 00:05:32,667
So, one relatively simple
thing is the first layer.

143
00:05:32,667 --> 00:05:34,522
So, we've seen, we've
talked about this before.

144
00:05:34,522 --> 00:05:37,508
But recalled that, the
first convolutional layer

145
00:05:37,508 --> 00:05:39,819
consists of a filters that,

146
00:05:39,819 --> 00:05:41,492
so, for example in AlexNet.

147
00:05:41,492 --> 00:05:43,262
The first convolutional layer consists

148
00:05:43,262 --> 00:05:45,193
of a number of convolutional filters.

149
00:05:45,193 --> 00:05:49,230
Each convolutional of filter
has shape 3 by 11 by 11.

150
00:05:49,230 --> 00:05:51,228
And these convolutional filters gets slid

151
00:05:51,228 --> 00:05:52,268
over the input image.

152
00:05:52,268 --> 00:05:54,947
We take inner products between
some chunk of the image.

153
00:05:54,947 --> 00:05:56,909
And the weights of the
convolutional filter.

154
00:05:56,909 --> 00:05:58,689
And that gives us our output of the

155
00:05:58,689 --> 00:06:01,729
at, at after that first
convolutional layer.

156
00:06:01,729 --> 00:06:05,074
So, in AlexNet then we
have 64 of these filters.

157
00:06:05,074 --> 00:06:06,947
But now in the first layer
because we are taking

158
00:06:06,947 --> 00:06:08,780
in a direct inner product
between the weights

159
00:06:08,780 --> 00:06:10,175
of the convolutional layer

160
00:06:10,175 --> 00:06:11,682
and the pixels of the image.

161
00:06:11,682 --> 00:06:14,548
We can get some since for what
these filters are looking for

162
00:06:14,548 --> 00:06:17,697
by simply visualizing the
learned weights of these filters

163
00:06:17,697 --> 00:06:19,458
as images themselves.

164
00:06:19,458 --> 00:06:22,576
So, for each of those
11 by 11 by 3 filters

165
00:06:22,576 --> 00:06:25,027
in AlexNet, we can just
visualize that filter

166
00:06:25,027 --> 00:06:28,461
as a little 11 by 11 image
with a three channels

167
00:06:28,461 --> 00:06:30,201
give you the red, green and blue values.

168
00:06:30,201 --> 00:06:32,051
And then because there
are 64 of these filters

169
00:06:32,051 --> 00:06:35,305
we just visualize 64
little 11 by 11 images.

170
00:06:35,305 --> 00:06:38,047
And we can repeat... So
we have shown here at the.

171
00:06:38,047 --> 00:06:40,982
So, these are filters taken
from the prechain models,

172
00:06:40,982 --> 00:06:42,509
in the pi torch model zoo.

173
00:06:42,509 --> 00:06:44,739
And we are looking at the
convolutional filters.

174
00:06:44,739 --> 00:06:45,985
The weights of the convolutional filters.

175
00:06:45,985 --> 00:06:48,313
at the first layer of AlexNet, ResNet-18,

176
00:06:48,313 --> 00:06:51,065
ResNet-101 and DenseNet-121.

177
00:06:51,065 --> 00:06:53,753
And you can see, kind
of what all these layers

178
00:06:53,753 --> 00:06:55,553
what this filters looking for.

179
00:06:55,553 --> 00:06:59,015
You see the lot of things
looking for oriented edges.

180
00:06:59,015 --> 00:07:01,052
Likes bars of light and dark.

181
00:07:01,052 --> 00:07:04,487
At various angles, in various
angles and various positions

182
00:07:04,487 --> 00:07:07,200
in the input, we can see opposing colors.

183
00:07:07,200 --> 00:07:09,475
Like this are green and pink.

184
00:07:09,475 --> 00:07:12,732
opposing colors or this orange
and blue opposing colors.

185
00:07:12,732 --> 00:07:14,893
So, this, this kind of
connects back to what we

186
00:07:14,893 --> 00:07:16,221
talked about with Hugh and Wiesel.

187
00:07:16,221 --> 00:07:17,907
All the way in the first lecture.

188
00:07:17,907 --> 00:07:19,716
That remember the human visual system

189
00:07:19,716 --> 00:07:22,271
is known to the detect
things like oriented edges.

190
00:07:22,271 --> 00:07:24,978
At the very early layers
of the human visual system.

191
00:07:24,978 --> 00:07:26,946
And it turns out of that
these convolutional networks

192
00:07:26,946 --> 00:07:29,136
tend to do something, somewhat similar.

193
00:07:29,136 --> 00:07:31,566
At their first convolutional
layers as well.

194
00:07:31,566 --> 00:07:33,153
And what's kind of interesting is that

195
00:07:33,153 --> 00:07:35,631
pretty much no matter what type
of architecture you hook up

196
00:07:35,631 --> 00:07:37,920
or whatever type of training
data you are train it on.

197
00:07:37,920 --> 00:07:40,594
You almost always get
the first layers of your.

198
00:07:40,594 --> 00:07:42,736
The first convolutional
weights of any pretty much

199
00:07:42,736 --> 00:07:44,990
any convolutional network
looking at images.

200
00:07:44,990 --> 00:07:46,389
Ends up looking something like this

201
00:07:46,389 --> 00:07:48,676
with oriented edges and opposing colors.

202
00:07:48,676 --> 00:07:51,539
Looking at that input image.

203
00:07:51,539 --> 00:07:53,696
But this really only, sorry
what was that question?

204
00:08:04,215 --> 00:08:06,118
Yes, these are showing the learned weights

205
00:08:06,118 --> 00:08:07,592
of the first convolutional layer.

206
00:08:15,766 --> 00:08:16,826
Oh, so that the question is.

207
00:08:16,826 --> 00:08:18,998
Why does visualizing the
weights of the filters?

208
00:08:18,998 --> 00:08:21,318
Tell you what the filter is looking for.

209
00:08:21,318 --> 00:08:23,945
So this intuition comes from
sort of template matching

210
00:08:23,945 --> 00:08:25,045
and inner products.

211
00:08:25,045 --> 00:08:28,389
That if you imagine you have
some, some template vector.

212
00:08:28,389 --> 00:08:31,125
And then you imagine you
compute a scaler output

213
00:08:31,125 --> 00:08:33,272
by taking inner product
between your template vector

214
00:08:33,272 --> 00:08:35,044
and some arbitrary piece of data.

215
00:08:35,044 --> 00:08:38,321
Then, the input which
maximizes that activation.

216
00:08:38,321 --> 00:08:40,289
Under a norm constraint on the input

217
00:08:40,289 --> 00:08:43,062
is exactly when those
two vectors match up.

218
00:08:43,062 --> 00:08:45,564
So, in that since that,
when, whenever you're taking

219
00:08:45,564 --> 00:08:48,066
inner products, the thing
causes an inner product

220
00:08:48,066 --> 00:08:49,736
to excite maximally

221
00:08:49,736 --> 00:08:52,506
is a copy of the thing you are
taking an inner product with.

222
00:08:52,506 --> 00:08:55,060
So, that, that's why we can
actually visualize these weights

223
00:08:55,060 --> 00:08:56,323
and that, why that shows us,

224
00:08:56,323 --> 00:08:57,902
what this first layer is looking for.

225
00:09:06,008 --> 00:09:08,731
So, for these networks
the first layers always

226
00:09:08,731 --> 00:09:10,052
was a convolutional layer.

227
00:09:10,052 --> 00:09:12,003
So, generally whenever
you are looking at image.

228
00:09:12,003 --> 00:09:13,808
Whenever you are thinking about image data

229
00:09:13,808 --> 00:09:15,174
and training convolutional networks,

230
00:09:15,174 --> 00:09:16,525
you generally put a convolutional layer

231
00:09:16,525 --> 00:09:18,178
at the first, at the first stop.

232
00:09:28,086 --> 00:09:29,006
Yeah, so the question is,

233
00:09:29,006 --> 00:09:30,665
can we do this same type of procedure

234
00:09:30,665 --> 00:09:32,118
in the middle open network.

235
00:09:32,118 --> 00:09:33,202
That's actually the next slide.

236
00:09:33,202 --> 00:09:35,104
So, good anticipation.

237
00:09:35,104 --> 00:09:37,123
So, if we do, if we draw this exact same

238
00:09:37,123 --> 00:09:39,767
visualization for the
intermediate convolutional layers.

239
00:09:39,767 --> 00:09:41,753
It's actually a lot less interpretable.

240
00:09:41,753 --> 00:09:45,081
So, this is, this is performing
exact same visualization.

241
00:09:45,081 --> 00:09:49,278
So, remember for this using
the tiny ConvNets demo network

242
00:09:49,278 --> 00:09:50,474
that's running on the course website

243
00:09:50,474 --> 00:09:51,890
whenever you go there.

244
00:09:51,890 --> 00:09:52,702
So, for that network,

245
00:09:52,702 --> 00:09:55,987
the first layer is 7 by
7 convulsion 16 filters.

246
00:09:55,987 --> 00:09:58,263
So, after the top visualizing
the first layer weights

247
00:09:58,263 --> 00:10:00,842
for this network just like
we saw in a previous slide.

248
00:10:00,842 --> 00:10:02,366
But now at the second layer weights.

249
00:10:02,366 --> 00:10:04,491
After we do a convulsion
then there's some relu

250
00:10:04,491 --> 00:10:06,583
and some other non-linearity perhaps.

251
00:10:06,583 --> 00:10:08,185
But the second convolutional layer,

252
00:10:08,185 --> 00:10:10,629
now receives the 16 channel input.

253
00:10:10,629 --> 00:10:15,116
And does 7 by 7 convulsion
with 20 convolutional filters.

254
00:10:15,116 --> 00:10:16,064
And we've actually,

255
00:10:16,064 --> 00:10:18,660
so the problem is that
you can't really visualize

256
00:10:18,660 --> 00:10:20,495
these directly as images.

257
00:10:20,495 --> 00:10:23,846
So, you can try, so, here if you

258
00:10:23,846 --> 00:10:28,547
this 16 by, so the input is
this has 16 dimensions in depth.

259
00:10:28,547 --> 00:10:30,286
And we have these convolutional filters,

260
00:10:30,286 --> 00:10:32,542
each convolutional filter is 7 by 7,

261
00:10:32,542 --> 00:10:34,388
and is extending along the full depth

262
00:10:34,388 --> 00:10:35,759
so has 16 elements.

263
00:10:35,759 --> 00:10:38,072
Then we've 20 such of these
convolutional filters,

264
00:10:38,072 --> 00:10:40,924
that are producing the output
planes of the next layer.

265
00:10:40,924 --> 00:10:44,035
But the problem here is that
we can't, looking at the,

266
00:10:44,035 --> 00:10:45,128
looking directly at the weights

267
00:10:45,128 --> 00:10:47,498
of these filters, doesn't
really tell us much.

268
00:10:47,498 --> 00:10:49,734
So, we, that's really done here is that,

269
00:10:49,734 --> 00:10:53,743
now for this single 16 by 7
by 7 convolutional filter.

270
00:10:53,743 --> 00:10:58,192
We can spread out those 167
by 7 planes of the filter

271
00:10:58,192 --> 00:11:01,782
into a 167 by 7 grayscale images.

272
00:11:01,782 --> 00:11:03,284
So, that's what we've done.

273
00:11:03,284 --> 00:11:07,095
Up here, which is these little
tiny gray scale images here

274
00:11:07,095 --> 00:11:08,898
show us what is, what are the weights

275
00:11:08,898 --> 00:11:11,852
in one of the convolutional
filters of the second layer.

276
00:11:11,852 --> 00:11:14,473
And now, because there are
20 outputs from this layer.

277
00:11:14,473 --> 00:11:17,534
Then this second convolutional
layer, has 2o such of these

278
00:11:17,534 --> 00:11:21,046
16 by 16 or 16 by 7 by 7 filters.

279
00:11:21,046 --> 00:11:22,871
So if we visualize the weights

280
00:11:22,871 --> 00:11:24,307
of those convolutional filters

281
00:11:24,307 --> 00:11:26,709
as images, you can see that there are some

282
00:11:26,709 --> 00:11:28,638
kind of spacial structures here.

283
00:11:28,638 --> 00:11:30,897
But it doesn't really
give you good intuition

284
00:11:30,897 --> 00:11:32,128
for what they are looking at.

285
00:11:32,128 --> 00:11:35,099
Because these filters are not
looking, are not connected

286
00:11:35,099 --> 00:11:36,644
directly to the input image.

287
00:11:36,644 --> 00:11:39,493
Instead recall that the second
layer convolutional filters

288
00:11:39,493 --> 00:11:41,851
are connected to the
output of the first layer.

289
00:11:41,851 --> 00:11:44,189
So, this is giving visualization of,

290
00:11:44,189 --> 00:11:46,684
what type of activation
pattern after the first

291
00:11:46,684 --> 00:11:49,331
convulsion, would cause
the second layer convulsion

292
00:11:49,331 --> 00:11:50,646
to maximally activate.

293
00:11:50,646 --> 00:11:52,423
But, that's not very interpretable

294
00:11:52,423 --> 00:11:53,860
because we don't have a good sense

295
00:11:53,860 --> 00:11:55,966
for what those first layer
convulsions look like

296
00:11:55,966 --> 00:11:58,490
in terms of image pixels.

297
00:11:58,490 --> 00:12:00,893
So we'll need to develop some
slightly more fancy technique

298
00:12:00,893 --> 00:12:02,047
to get a sense for what is going on

299
00:12:02,047 --> 00:12:03,556
in the intermediate layers.

300
00:12:03,556 --> 00:12:04,819
Question in the back.

301
00:12:09,189 --> 00:12:10,489
Yeah. So the question is that

302
00:12:10,489 --> 00:12:13,456
for... all the visualization
on this on the previous slide.

303
00:12:13,456 --> 00:12:16,552
We've had the scale the weights
to the zero to 255 range.

304
00:12:16,552 --> 00:12:18,648
So in practice those
weights could be unbounded.

305
00:12:18,648 --> 00:12:19,885
They could have any range.

306
00:12:19,885 --> 00:12:22,983
But to get nice visualizations
we need to scale those.

307
00:12:22,983 --> 00:12:24,685
These visualizations also do not take

308
00:12:24,685 --> 00:12:26,409
in to account the bias is in these layers.

309
00:12:26,409 --> 00:12:28,162
So you should keep that in mind

310
00:12:28,162 --> 00:12:30,423
when and not take these
HEPS visualizations

311
00:12:30,423 --> 00:12:31,892
to, to literally.

312
00:12:34,180 --> 00:12:35,237
Now at the last layer

313
00:12:35,237 --> 00:12:36,733
remember when we looking at the last layer

314
00:12:36,733 --> 00:12:38,391
of convolutional network.

315
00:12:38,391 --> 00:12:40,698
We have these maybe 1000 class scores

316
00:12:40,698 --> 00:12:42,891
that are telling us what
are the predicted scores

317
00:12:42,891 --> 00:12:44,908
for each of the classes
in our training data set

318
00:12:44,908 --> 00:12:46,676
and immediately before the last layer

319
00:12:46,676 --> 00:12:48,628
we often have some fully connected layer.

320
00:12:48,628 --> 00:12:49,962
In the case of Alex net

321
00:12:49,962 --> 00:12:53,039
we have some 4096- dimensional
features representation

322
00:12:53,039 --> 00:12:55,516
of our image that then
gets fed into that final

323
00:12:55,516 --> 00:12:58,328
our final layer to predict
our final class scores.

324
00:12:58,328 --> 00:13:00,606
And one another, another kind of route

325
00:13:00,606 --> 00:13:02,787
for tackling the problem
of visual, visualizing

326
00:13:02,787 --> 00:13:04,263
and understanding ConvNets

327
00:13:04,263 --> 00:13:06,520
is to try to understand what's
happening at the last layer

328
00:13:06,520 --> 00:13:07,967
of a convolutional network.

329
00:13:07,967 --> 00:13:09,022
So what we can do

330
00:13:09,022 --> 00:13:11,230
is how to take some,
some data set of images

331
00:13:11,230 --> 00:13:13,110
run a bunch of, run a bunch of images

332
00:13:13,110 --> 00:13:14,815
through our trained convolutional network

333
00:13:14,815 --> 00:13:17,174
and recorded that 4096 dimensional vector

334
00:13:17,174 --> 00:13:18,687
for each of those images.

335
00:13:18,687 --> 00:13:20,722
And now go through and try to figure out

336
00:13:20,722 --> 00:13:23,219
and visualize that last
layer, that last hidden layer

337
00:13:23,219 --> 00:13:26,075
rather than those rather than
the first convolutional layer.

338
00:13:26,075 --> 00:13:27,804
So, one thing you might imagine is,

339
00:13:27,804 --> 00:13:29,791
is trying a nearest neighbor approach.

340
00:13:29,791 --> 00:13:31,559
So, remember, way back
in the second lecture

341
00:13:31,559 --> 00:13:33,162
we saw this graphic on the left

342
00:13:33,162 --> 00:13:36,045
where we, where we had a
nearest neighbor classifier.

343
00:13:36,045 --> 00:13:37,967
Where we were looking at
nearest neighbors in pixels

344
00:13:37,967 --> 00:13:40,303
space between CIFAR 10 images.

345
00:13:40,303 --> 00:13:41,996
And then when you look
at nearest neighbors

346
00:13:41,996 --> 00:13:44,765
in pixel space between CIFAR 10 images

347
00:13:44,765 --> 00:13:46,500
you see that you pull up images

348
00:13:46,500 --> 00:13:48,660
that looks quite similar
to the query image.

349
00:13:48,660 --> 00:13:50,777
So again on the left column
here is some CIFAR 10 image

350
00:13:50,777 --> 00:13:52,350
from the CIFAR 10 data set

351
00:13:52,350 --> 00:13:54,987
and then these, these next five columns

352
00:13:54,987 --> 00:13:57,239
are showing the nearest
neighbors in pixel space

353
00:13:57,239 --> 00:13:58,917
to those test set images.

354
00:13:58,917 --> 00:14:00,185
And so for example

355
00:14:00,185 --> 00:14:02,446
this white dog that you see here,

356
00:14:02,446 --> 00:14:04,523
it's nearest neighbors are in pixel space

357
00:14:04,523 --> 00:14:06,328
are these kinds of white blobby things

358
00:14:06,328 --> 00:14:08,321
that may, may or may not be dogs,

359
00:14:08,321 --> 00:14:09,885
but at least the raw pixels

360
00:14:09,885 --> 00:14:11,643
of the image are quite similar.

361
00:14:11,643 --> 00:14:14,268
So now we can do the same
type of visualization

362
00:14:14,268 --> 00:14:16,937
computing and visualizing
these nearest neighbor images.

363
00:14:16,937 --> 00:14:17,963
But rather than computing

364
00:14:17,963 --> 00:14:19,952
the nearest neighbors in pixel space,

365
00:14:19,952 --> 00:14:21,735
instead we can compute nearest neighbors

366
00:14:21,735 --> 00:14:24,507
in that 4096 dimensional feature space.

367
00:14:24,507 --> 00:14:27,107
Which is computed by the
convolutional network.

368
00:14:27,107 --> 00:14:28,351
So here on the right

369
00:14:28,351 --> 00:14:29,987
we see some examples.

370
00:14:29,987 --> 00:14:32,069
So this, this first column shows us

371
00:14:32,069 --> 00:14:34,924
some examples of images from the test set

372
00:14:34,924 --> 00:14:38,338
of image that... Of the image
net classification data set

373
00:14:38,338 --> 00:14:41,253
and now the, these
subsequent columns show us

374
00:14:41,253 --> 00:14:43,614
nearest neighbors to those test set images

375
00:14:43,614 --> 00:14:46,863
in the 4096, in the 4096th
dimensional features space

376
00:14:46,863 --> 00:14:48,515
computed by Alex net.

377
00:14:48,515 --> 00:14:51,010
And you can see here that
this is quite different

378
00:14:51,010 --> 00:14:52,941
from the pixel space nearest neighbors,

379
00:14:52,941 --> 00:14:55,086
because the pixels are
often quite different.

380
00:14:55,086 --> 00:14:57,111
between the image in
it's nearest neighbors

381
00:14:57,111 --> 00:14:58,375
and feature space.

382
00:14:58,375 --> 00:15:03,031
However, the semantic content of those images
tends to be similar in this feature space.

383
00:15:03,031 --> 00:15:10,484
So for example, if you look at this second layer the query image is this
elephant standing on the left side of the image with a screen grass behind him.

384
00:15:10,484 --> 00:15:17,307
and now one of these, one of these... it's third nearest neighbor in the
tough set is actually an elephant standing on the right side of the image.

385
00:15:17,307 --> 00:15:26,942
So this is really interesting. Because between this elephant standing on the left and this element
stand, elephant standing on the right the pixels between those two images are almost entirely different.

386
00:15:26,942 --> 00:15:32,554
However, in the feature space which is learned by the network
those two images and that being very close to each other.

387
00:15:32,554 --> 00:15:37,975
Which means that somehow this, this last their features is
capturing some of those semantic content of these images.

388
00:15:37,975 --> 00:15:46,192
That's really cool and really exciting and, and in general looking at these kind of nearest neighbor
visualizations is really quick and easy way to visualize something about what's going on here.

389
00:16:02,617 --> 00:16:04,630
Yes. So the question is that

390
00:16:04,630 --> 00:16:13,942
through the... the standard supervised learning procedure for classific training, classification
network There's nothing in the loss encouraging these features to be close together.

391
00:16:13,942 --> 00:16:21,476
So that, that's true. It just kind of a happy accident that they end up being close to each
other. Because we didn't tell the network during training these features should be close.

392
00:16:21,476 --> 00:16:28,746
However there are sometimes people do train networks using
things called either contrastive loss or a triplet loss.

393
00:16:28,746 --> 00:16:37,253
Which actually explicitly make... assumptions and constraints on the network such
that those last their features end up having some metric space interpretation.

394
00:16:37,253 --> 00:16:39,907
But Alex net at least was not
trained specifically for that.

395
00:16:44,931 --> 00:16:46,060
The question is, what is the nearest...

396
00:16:46,060 --> 00:16:48,875
What is this nearest neighbor thing
have to do at the last layer?

397
00:16:48,875 --> 00:16:51,432
So we're taking this image
we're running it through the network

398
00:16:51,432 --> 00:16:57,670
and then the, the second to last like the last hidden
layer of the network is of 4096th dimensional vector.

399
00:16:57,670 --> 00:17:01,797
Because there's this, this is... This is there, there are
these fully connected layers at the end of the network.

400
00:17:01,797 --> 00:17:06,893
So we are doing is... We're writing down that
4096th dimensional vector for each of the images

401
00:17:06,894 --> 00:17:12,966
and then we are computing nearest neighbors according to that 4096th
dimensional vector. Which is computed by, computed by the network.

402
00:17:17,012 --> 00:17:19,171
Maybe, maybe we can chat offline.

403
00:17:19,171 --> 00:17:28,434
So another, another, another another angle that we might have for visualizing
what's going on in this last layer is by some concept of dimensionality reduction.

404
00:17:28,435 --> 00:17:33,220
So those of you who have taken CS229 for
example you've seen something like PCA.

405
00:17:33,220 --> 00:17:39,841
Which let's you take some high dimensional representation like these
4096th dimensional features and then compress it down to two-dimensions.

406
00:17:39,841 --> 00:17:43,183
So then you can visualize that
feature space more directly.

407
00:17:43,183 --> 00:17:51,321
So, Principle Component Analysis or PCA is kind of one way to do that.
But there's real another really powerful algorithm called t-SNE.

408
00:17:51,321 --> 00:17:54,656
Standing for t-distributed
stochastic neighbor embeddings.

409
00:17:54,656 --> 00:18:03,137
Which is slightly more powerful method. Which is a non-linear dimensionality
reduction method that people in deep often use for visualizing features.

410
00:18:03,137 --> 00:18:07,264
So here as an, just an
example of what t-SNE can do.

411
00:18:07,264 --> 00:18:13,231
This visualization here is, is showing a t-SNE
dimensionality reduction on the emnest data set.

412
00:18:13,231 --> 00:18:17,521
So, emnest remember is this date set of
hand written digits between zero and nine.

413
00:18:17,521 --> 00:18:22,226
Each image is a gray scale image
20... 28 by 28 gray scale image

414
00:18:22,226 --> 00:18:32,020
and now we're... So that Now we've, we've used t-SNE to take that 28 times 28 dimensional
features space of the raw pixels for m-nest and now compress it down to two- dimensions

415
00:18:32,020 --> 00:18:37,096
ans then visualize each of those m-nest digits
in this compress two-dimensional representation

416
00:18:37,096 --> 00:18:42,653
and when you do, when you run t-SNE on the raw pixels and
m-nest You can see these natural clusters appearing.

417
00:18:42,653 --> 00:18:47,532
Which corresponds to the, the digits of
these m-nest of, of these m-nest data set.

418
00:18:47,532 --> 00:18:57,348
So now we can do a similar type of visualization. Where we apply this t-SNE dimensionality
reduction technique to the features from the last layer of our trained image net classifier.

419
00:18:57,348 --> 00:19:05,073
So...To be a little bit more concrete here what we've done is that we
take, a large set of images we run them off convolutional network.

420
00:19:05,073 --> 00:19:10,865
We record that final 4096th dimensional feature vector
for, from the last layer of each of those images.

421
00:19:10,865 --> 00:19:14,756
Which gives us large collection
of 4096th dimensional vectors.

422
00:19:14,756 --> 00:19:24,277
Now we apply t-SNE dimensionality reduction to compute, sort of compress that
4096the dimensional features space down into a two-dimensional feature space

423
00:19:24,277 --> 00:19:36,415
and now we, layout a grid in that compressed two-dimensional feature space and visualize what
types of images appear at each location in the grid in this two-dimensional feature space.

424
00:19:36,415 --> 00:19:43,417
So by doing this you get some very close rough sense of
what the geometry of this learned feature space looks like.

425
00:19:43,417 --> 00:19:48,620
So these images are little bit hard to see. So I'd encourage
you to check out the high resolution versions online.

426
00:19:48,620 --> 00:19:56,451
But at least maybe on the left you can see that there's sort of one cluster
in the bottom here of, of green things, is a different kind of flowers

427
00:19:56,451 --> 00:20:01,800
and there's other types of clusters for different types of
dog breeds and another types of animals and, and locations.

428
00:20:01,800 --> 00:20:06,192
So there's sort of discontinuous
semantic notion in this feature space.

429
00:20:06,192 --> 00:20:11,597
Which we can explore by looking through this t-SNE
dimensionality reduction version of the, of the features.

430
00:20:11,597 --> 00:20:12,604
Is there question?

431
00:20:23,716 --> 00:20:29,793
Yeah. So the basic idea is that we're we, we have an image so now we
end up with three different pieces of information about each image.

432
00:20:29,793 --> 00:20:31,308
We have the pixels of the image.

433
00:20:31,308 --> 00:20:33,353
We have the 4096th dimensional vector.

434
00:20:33,353 --> 00:20:38,109
Then we use t-SNE to convert the 4096th dimensional
vector into a two-dimensional coordinate

435
00:20:38,109 --> 00:20:49,547
and then we take the original pixels of the image and place that at the two-dimensional coordinate corresponding
to the dimensionality reduced version of the 4096th dimensional feature. Yeah, little bit involved here.

436
00:20:49,547 --> 00:20:50,348
Question in the front.

437
00:20:55,864 --> 00:20:59,255
The question is Roughly how much
variants do these two-dimension explain?

438
00:20:59,255 --> 00:21:06,080
Well, I'm not sure of the exact number and I get little bit muddy when you're
talking about t-SNE, because it's a non-linear dimensionality reduction technique.

439
00:21:06,080 --> 00:21:10,259
So, I'd have to look offline and I'm not
sure of exactly how much it explains.

440
00:21:10,259 --> 00:21:14,377
Question?

441
00:21:14,377 --> 00:21:17,038
Question is, can you do the same analysis
of upper layers of the network?

442
00:21:17,038 --> 00:21:21,384
And yes, you can. But no, I don't have
those visualizations here. Sorry.

443
00:21:21,384 --> 00:21:24,603
Question?

444
00:21:35,559 --> 00:21:39,482
The question is, Shouldn't we have overlaps of
images once we do this dimensionality reduction?

445
00:21:39,482 --> 00:21:40,902
And yes, of course, you would.

446
00:21:40,902 --> 00:21:47,537
So this is just kind of taking a, nearest neighbor in our, in our
regular grid and then picking an image close to that grid point.

447
00:21:47,537 --> 00:21:54,792
So, so... they, yeah. this is not showing you the kind
of density in different parts of the feature space.

448
00:21:54,792 --> 00:22:03,122
So that's, that's another thing to look at and again at the link you, there's a
couple more visualizations of this nature that, that address that a little bit.

449
00:22:03,122 --> 00:22:07,713
Okay. So another, another thing that you can
do for some of these intermediate features

450
00:22:07,713 --> 00:22:13,856
is, so we talked a couple of slides ago that visualizing the
weights of these intermediate layers is not so interpretable.

451
00:22:13,856 --> 00:22:20,846
But actually visualizing the activation maps of those
intermediate layers is kind of interpretable in some cases.

452
00:22:20,846 --> 00:22:28,603
So for, so I, again an example of Alex Net. Remember the,
the conv5 layers of Alex Net. Gives us this 128 by...

453
00:22:28,603 --> 00:22:35,668
The for...The conv5 features for any image
is now 128 by 13 by 13 dimensional tensor.

454
00:22:35,668 --> 00:22:42,386
But we can think of that as 128
different 13 by 132-D grids.

455
00:22:42,386 --> 00:22:49,741
So now we can actually go and visualize each of those 13 by
13 elements slices of the feature map as a grayscale image

456
00:22:49,741 --> 00:22:58,501
and this gives us some sense for what types of things in the input
are each of those features in that convolutional layer looking for.

457
00:22:58,501 --> 00:23:03,306
So this is a, a really cool interactive tool
by Jason Yasenski you can just download.

458
00:23:03,306 --> 00:23:06,598
So it's run, so I don't have the video,
it has a video on his website.

459
00:23:06,598 --> 00:23:10,059
But it's running a convolutional network
on the inputs stream of webcam

460
00:23:10,059 --> 00:23:17,279
and then visualizing in real time each of those slices of that
intermediate feature map give you a sense of what it's looking for

461
00:23:17,279 --> 00:23:23,931
and you can see that, so here the input image is this, this picture
up in, settings... of this picture of a person in front of the camera

462
00:23:23,931 --> 00:23:28,192
and most of these intermediate features
are kind of noisy, not much going on.

463
00:23:28,192 --> 00:23:34,277
But there's a, but there's this one highlighted
intermediate feature where that is also shown larger here

464
00:23:34,277 --> 00:23:41,103
that seems that it's activating on the portions of the feature map
corresponding to the person's face. Which is really interesting

465
00:23:41,103 --> 00:23:51,045
and that kind of, suggests that maybe this, this particular slice of the feature map of this
layer of this particular network is maybe looking for human faces or something like that.

466
00:23:51,045 --> 00:23:54,132
Which is kind of a nice, kind of a nice
and cool finding.

467
00:23:54,132 --> 00:23:55,517
Question?

468
00:23:59,038 --> 00:24:04,957
The question is, Are the black activations dead relu's?
So you got to be... a little careful with terminology.

469
00:24:04,957 --> 00:24:09,539
We usually say dead relu to mean something
that's dead over the entire training data set.

470
00:24:09,539 --> 00:24:14,701
Here I would say that it's a relu, that,
it's not active for this particular input.

471
00:24:14,701 --> 00:24:15,702
Question?

472
00:24:19,457 --> 00:24:22,538
The question is, If there's no humans in
image net how can it recognize a human face?

473
00:24:22,538 --> 00:24:24,182
There definitely are humans in image net

474
00:24:24,182 --> 00:24:29,020
I don't think it's, it's one of the cat... I don't think it's one
of the thousand categories for the classification challenge.

475
00:24:29,020 --> 00:24:34,906
But people definitely appear in a lot of these images and that
can be useful signal for detecting other types of things.

476
00:24:34,906 --> 00:24:41,617
So that's actually kind of nice results because that shows that, it's sort
of can learn features that are useful for the classification task at hand.

477
00:24:41,617 --> 00:24:47,483
That are even maybe a little bit different from the explicit classification
task that we told it to perform. So it's actually really cool results.

478
00:24:50,346 --> 00:24:51,929
Okay, question?

479
00:24:55,192 --> 00:25:03,334
So at each layer in the convolutional network our input image is of three,
it's like 3 by 224 by 224 and then it goes through many stages of convolution.

480
00:25:03,334 --> 00:25:07,731
And then, it, after each convolutional layer
is some three dimensional chunk of numbers.

481
00:25:07,731 --> 00:25:10,476
Which are the outputs from that layer
of the convolutional network.

482
00:25:10,476 --> 00:25:18,155
And that into the entire three dimensional chunk of numbers which are the output
of the previous convolutional layer, we call, we call, like an activation volume

483
00:25:18,155 --> 00:25:22,156
and then one of those, one of those slices
is a, it's an activation map.

484
00:25:34,426 --> 00:25:38,513
So the question is, If the image is K by K
will the activation map be K by K?

485
00:25:38,513 --> 00:25:42,489
Not always because there can be sub sampling
due to pool, straight convolution and pooling.

486
00:25:42,489 --> 00:25:47,756
But in general, the, the size of each activation
map will be linear in the size of the input image.

487
00:25:50,492 --> 00:25:55,625
So another, another kind of useful thing we can
do for visualizing intermediate features is...

488
00:25:55,625 --> 00:26:03,453
Visualizing what types of patches from input images cause maximal
activation in different, different features, different neurons.

489
00:26:03,453 --> 00:26:08,605
So what we've done here is that, we pick...
Maybe again the con five layer from Alex Net?

490
00:26:08,605 --> 00:26:10,926
And remember each of
these activation volumes

491
00:26:10,926 --> 00:26:15,738
at the con, at the con five in Alex net gives
us a 128 by 13 by 13 chunk of numbers.

492
00:26:15,738 --> 00:26:19,644
Then we'll pick one of those 128 channels.
Maybe channel 17

493
00:26:19,644 --> 00:26:23,749
and now what we'll do is run many images
through this convolutional network.

494
00:26:23,749 --> 00:26:27,456
And then, for each of those images
record the con five features

495
00:26:27,456 --> 00:26:37,925
and then look at the... Right, so, then, then look at the, the... The parts of
that 17th feature map that are maximally activated over our data set of images.

496
00:26:37,925 --> 00:26:45,161
And now, because again this is a convolutional layer each of those neurons
in the convolutional layer has some small receptive field in the input.

497
00:26:45,161 --> 00:26:49,239
Each of those neurons is not looking at the whole image.
They're only looking at the sub set of the image.

498
00:26:49,239 --> 00:27:00,731
Then what we'll do is, is visualize the patches from the, from this large data set of images corresponding
to the maximal activations of that, of that feature, of that particular feature in that particular layer.

499
00:27:00,731 --> 00:27:06,177
And then we can sorts these out, sort these patches by
their activation at that, at that particular layer.

500
00:27:06,177 --> 00:27:12,575
So here is a, some examples from this... Network
called a, fully... The network doesn't matter.

501
00:27:12,575 --> 00:27:16,380
But these are some visualizations of these
kind of maximally activating patches.

502
00:27:16,380 --> 00:27:22,500
So, each, each row gives... We've chosen one layer
from or one neuron from one layer of a network

503
00:27:22,500 --> 00:27:28,280
and then each, and then, the, they're sorted of these
are the patches from some large data set of images.

504
00:27:28,280 --> 00:27:30,611
That maximally activated this one neuron.

505
00:27:30,611 --> 00:27:35,698
And these can give you a sense for what type of
features these, these neurons might be looking for.

506
00:27:35,698 --> 00:27:39,998
So for example, this top row we see a lot
of circly kinds of things in the image.

507
00:27:39,998 --> 00:27:44,621
Some eyes, some, mostly eyes.
But also this, kind of blue circly region.

508
00:27:44,621 --> 00:27:51,303
So then, maybe this, this particular neuron in this particular layer of
this network is looking for kind of blue circly things in the input.

509
00:27:51,303 --> 00:27:56,200
Or maybe in the middle here we have neurons
that are looking for text in different colors

510
00:27:56,200 --> 00:28:02,201
or, or maybe curving, curving edges
of different colors and orientations.

511
00:28:06,246 --> 00:28:09,199
Yeah, so, I've been a little bit loose
with terminology here.

512
00:28:09,199 --> 00:28:13,970
So, I'm saying that a neuron is one scaler
value in that con five activation map.

513
00:28:13,970 --> 00:28:19,283
But because it's convolutional, all the neurons
in one channel are all using the same weights.

514
00:28:19,283 --> 00:28:26,451
So we've chosen one channel and then, right? So, you get a lot
of neurons for each convolutional filter at any one layer.

515
00:28:26,451 --> 00:28:32,532
So, we, we could have been, so this patches could've been drawn from
anywhere in the image due to the convolutional nature of the thing.

516
00:28:32,532 --> 00:28:38,721
And now at the bottom we also see some maximally activating
patches for neurons from a higher up layer in the same network.

517
00:28:38,721 --> 00:28:42,294
And now because they are coming from higher in
the network they have a larger receptive field.

518
00:28:42,294 --> 00:28:44,851
So, they're looking at larger
patches of the input image

519
00:28:44,851 --> 00:28:49,213
and we can also see that they're looking for
maybe larger structures in the input image.

520
00:28:49,213 --> 00:28:56,445
So this, this second row is maybe looking, it seems
to be looking for human, humans or maybe human faces.

521
00:28:56,445 --> 00:29:06,410
We have maybe something looking for... Parts of cameras or different types
of larger, larger, larger object like type things, types of things.

522
00:29:06,410 --> 00:29:11,885
Another, another cool experiment we can do which
comes from Zeiler and Fergus ECCV 2014 paper.

523
00:29:11,885 --> 00:29:14,062
is this idea of an exclusion experiment.

524
00:29:14,062 --> 00:29:21,659
So, what we want to do is figure out which parts of the input, of the
input image cause the network to make it's classification decision.

525
00:29:21,659 --> 00:29:25,339
So, what we'll do is, we'll take our
input image in this case an elephant

526
00:29:25,339 --> 00:29:32,486
and then we'll block out some part of that, some region in that input
image and just replace it with the mean pixel value from the data set.

527
00:29:32,486 --> 00:29:39,583
And now, run that occluded image throughout, through the network and
then record what is the predicted probability of this occluded image?

528
00:29:39,583 --> 00:29:44,752
And now slide this occluded patch over every position
in the input image and then repeat the same process.

529
00:29:44,752 --> 00:29:53,699
And then draw this heat map showing, what was the predicted probability output from
the network as a function of where did, which part of the input image did we occlude?

530
00:29:53,699 --> 00:29:59,952
And the idea is that if when we block out some part of the
image if that causes the network score to change drastically.

531
00:29:59,952 --> 00:30:04,809
Then probably that part of the input image was
really important for the classification decision.

532
00:30:04,809 --> 00:30:11,420
So here we've shown... I've shown three different
examples of... Of this occlusion type experiment.

533
00:30:11,420 --> 00:30:14,456
So, maybe this example of
a Go-kart at the bottom,

534
00:30:14,456 --> 00:30:23,077
you can see over here that when we, so here, red, the, the red corresponds to
a low probability and the white and yellow corresponds to a high probability.

535
00:30:23,077 --> 00:30:30,348
So when we block out the region of the image corresponding to this Go-kart
in front. Then the predicted probability for the Go-kart class drops a lot.

536
00:30:30,348 --> 00:30:38,419
So that gives us some sense that the network is actually caring a lot about these,
these pixels in the input image in order to make it's classification decision.

537
00:30:38,419 --> 00:30:39,589
Question?

538
00:30:47,473 --> 00:30:49,780
Yes, the question is that,
what's going on in the background?

539
00:30:49,780 --> 00:30:56,020
So maybe if the image is a little bit too small to tell but, there's, this is
actually a Go-kart track and there's a couple other Go-karts in the background.

540
00:30:56,020 --> 00:31:00,395
So I think that, when you're blocking out these other
Go-karts in the background, that's also influencing the score

541
00:31:00,395 --> 00:31:04,628
or maybe like the horizon is there and maybe the
horizon is an useful feature for detecting Go-karts,

542
00:31:04,628 --> 00:31:08,976
it's a little bit hard to tell sometimes.
But this is a pretty cool visualization.

543
00:31:08,976 --> 00:31:10,118
Yeah, was there another question?

544
00:31:20,486 --> 00:31:23,500
So the question is, sorry,
sorry, what was the first question?

545
00:31:30,731 --> 00:31:36,802
So, the, so the question... So for, for this example we're
taking one image and then masking all parts of one image.

546
00:31:36,802 --> 00:31:38,777
The second question
was, how is this useful?

547
00:31:38,777 --> 00:31:42,982
It's not, maybe, you don't really take this information
and then loop it directly into the training process.

548
00:31:42,982 --> 00:31:49,341
Instead, this is a way, a tool for humans to understand,
what types of computations these train networks are doing.

549
00:31:49,341 --> 00:31:54,296
So it's more for your understanding
than for improving performance per se.

550
00:31:54,296 --> 00:31:57,890
So another, another related idea
is this concept of a Saliency Map.

551
00:31:57,890 --> 00:32:00,534
Which is something that you
will see in your homeworks.

552
00:32:00,534 --> 00:32:02,578
So again, we have the same question

553
00:32:02,578 --> 00:32:07,831
of given an input image of a dog in this
case and the predicted class label of dog

554
00:32:07,831 --> 00:32:11,796
we want to know which pixels in the input
image are important for classification.

555
00:32:11,796 --> 00:32:19,452
We saw masking, is one way to get at this question. But Saliency
Maps are another, another, angle for attacking this problem.

556
00:32:19,452 --> 00:32:25,354
And the question is, and one relatively simple idea
from Karen Simonenian's paper, a couple years ago.

557
00:32:25,354 --> 00:32:31,694
Is, this is just computing the gradient of the predicted
class score with respect to the pixels of the input image.

558
00:32:31,694 --> 00:32:36,042
And this will directly tell us in this
sort of, first order approximation sense.

559
00:32:36,042 --> 00:32:43,963
For each input, for each pixel in the input image if we wiggle that pixel a
little bit then how much will the classification score for the class change?

560
00:32:43,963 --> 00:32:50,496
And this is another way to get at this question of which
pixels in the input matter for the classification.

561
00:32:50,496 --> 00:32:59,356
And when we, and when we run for example Saliency, where computer Saliency
map for this dog, we see kind of a nice outline of a dog in the image.

562
00:32:59,356 --> 00:33:04,985
Which tells us that these are probably the pixels of
that, network is actually looking at, for this image.

563
00:33:04,985 --> 00:33:11,675
And when we repeat this type of process for different images, we get
some sense that the network is sort of looking at the right regions.

564
00:33:11,675 --> 00:33:13,360
Which is somewhat comforting.

565
00:33:13,360 --> 00:33:14,462
Question?

566
00:33:17,407 --> 00:33:21,916
The question is, do people use Saliency Maps
for semantic segmentation? The answer is yes.

567
00:33:21,916 --> 00:33:26,741
That actually was... Yeah, you guys are
like really on top of it this lecture.

568
00:33:26,741 --> 00:33:29,513
So that was another component,
again in Karen's paper.

569
00:33:29,513 --> 00:33:38,925
Where there's this idea that maybe you can use these Saliency Maps to perform semantic
segmentation without direct, without any labeled data for the, for these, for these segments.

570
00:33:38,925 --> 00:33:43,908
So here they're using this Grabcut Segmentation Algorithm
which I don't really want to get into the details of.

571
00:33:43,908 --> 00:33:47,772
But it's kind of an interactive
segmentation algorithm that you can use.

572
00:33:47,772 --> 00:33:55,697
So then when you combine this Saliency Map with this Grabcut Segmentation
Algorithm then you can in fact, sometimes segment out the object in the image.

573
00:33:55,697 --> 00:34:00,326
Which is really cool. However I'd like to
point out that this is a little bit brittle

574
00:34:00,326 --> 00:34:07,182
and in general if you, this will probably work much, much, much, worse
than a network which did have access to supervision and training time.

575
00:34:07,182 --> 00:34:13,458
So, I don't, I'm not sure how, how practical this
is. But it is pretty cool that it works at all.

576
00:34:13,458 --> 00:34:19,025
But it probably works much less than something
trained explicitly to segment with supervision.

577
00:34:19,025 --> 00:34:23,791
So kind of another, another related idea is
this idea of, of guided back propagation.

578
00:34:23,791 --> 00:34:30,001
So again, we still want to answer the question
of for one particular, for one particular image.

579
00:34:30,001 --> 00:34:37,420
Then now instead of looking at the class score we want to know, we
want to pick some intermediate neuron in the network and ask again,

580
00:34:37,420 --> 00:34:44,199
which parts of the input image influence the score
of that neuron, that internal neuron in the network.

581
00:34:44,199 --> 00:34:49,059
And, and then you could imagine, again you could imagine
computing a Saliency Map for this again, right?

582
00:34:49,059 --> 00:34:53,466
That rather than computing the gradient of the class
scores with respect to the pixels of the image.

583
00:34:53,466 --> 00:34:58,815
You could compute the gradient of some intermediate value
in the network with respect to the pixels of the image.

584
00:34:58,815 --> 00:35:05,832
And that would tell us again which parts, which pixels in the
input image influence that value of that particular neuron.

585
00:35:05,832 --> 00:35:08,342
And that would be using
normal back propagation.

586
00:35:08,342 --> 00:35:15,093
But it turns out that there is a slight tweak that we can do to this back
propagation procedure that ends up giving some slightly cleaner images.

587
00:35:15,093 --> 00:35:21,393
So that's this idea of guided back propagation that
again comes from Zeiler and Fergus's 2014 paper.

588
00:35:21,393 --> 00:35:24,203
And I don't really want to get
into the details too much here

589
00:35:24,203 --> 00:35:30,220
but, it, you just, it's kind of weird tweak where you change
the way that you back propagate through relu non-linearities.

590
00:35:30,220 --> 00:35:37,254
And you sort of, only, only back propagate positive gradients through
relu's and you do not back propagate negative gradients through the relu's.

591
00:35:37,254 --> 00:35:46,948
So you're no longer computing the true gradient instead you're kind of only
keeping track of positive influences on throughout the entire network.

592
00:35:46,948 --> 00:35:53,614
So maybe you should read through these, these papers reference to your,
if you want a little bit more details about why that's a good idea.

593
00:35:53,614 --> 00:36:01,649
But empirically, when you do guided back propagation as appose to
regular back propagation. You tend to get much cleaner, nicer images.

594
00:36:01,649 --> 00:36:07,223
that tells you, which part, which pixel of the
input image influence that particular neuron.

595
00:36:07,223 --> 00:36:12,467
So, again we were seeing the same visualization we saw
a few slides ago of the maximally activating patches.

596
00:36:16,488 --> 00:36:20,174
But now, in addition to visualizing
these maximally activating patches.

597
00:36:20,174 --> 00:36:27,604
We've also performed guided back propagation, to tell us exactly
which parts of these patches influence the score of that neuron.

598
00:36:27,604 --> 00:36:37,139
So, remember for this example at the top, we saw that, we thought this neuron is may be looking
for circly tight things, in the input patch because there're allot of circly tight patches.

599
00:36:37,139 --> 00:36:42,028
Well, when we look at guided back propagation We
can see with that intuition is somewhat confirmed

600
00:36:42,028 --> 00:36:49,218
because it is indeed the circly parts of that input
patch which are influencing that, that neuron value.

601
00:36:49,218 --> 00:36:56,514
So, this is kind of a useful to all for synthesizing. For
understanding what these different intermediates are looking for.

602
00:36:56,514 --> 00:37:05,108
But, one kind of interesting thing about guided back propagation or computing
saliency maps. Is that there's always a function of fixed input image,

603
00:37:05,108 --> 00:37:12,882
right, they're telling us for a fixed input image, which pixel or
which parts of that input image influence the value of the neuron.

604
00:37:12,882 --> 00:37:19,110
Another question you might answer is is remove
this reliance, on that, on some input image.

605
00:37:19,110 --> 00:37:24,641
And then instead just ask what type of input
in general would cause this neuron to activate

606
00:37:24,641 --> 00:37:29,118
and we can answer this question
using a technical Gradient ascent

607
00:37:29,118 --> 00:37:34,903
so, remember we always use Gradient decent to train
our convolutional networks by minimizing the loss.

608
00:37:34,903 --> 00:37:40,552
Instead now, we want to fix the, fix the
weight of our trained convolutional network

609
00:37:40,552 --> 00:37:50,932
and instead synthesizing image by performing Gradient ascent on the pixels of the
image to try and maximize the score of some intermediate neuron or of some class.

610
00:37:50,932 --> 00:37:58,333
So, in a process of Gradient ascent, we're no longer optimizing
over the weights of the network those weights remained fixed

611
00:37:58,333 --> 00:38:07,104
instead we're trying to change pixels of some input image to cause this neuron,
or this neuron value, or this class score to maximally, to be maximized

612
00:38:07,104 --> 00:38:10,475
but, instead but, in addition
we need some regularization term

613
00:38:10,475 --> 00:38:19,078
so, remember we always a, we before seeing regularization terms to try
to prevent the network weights from over fitting to the training data.

614
00:38:19,078 --> 00:38:27,109
Now, we need something kind of similar to prevent the pixels of our generated
image from over fitting to the peculiarities of that particular network.

615
00:38:27,109 --> 00:38:34,664
So, here we'll often incorporate some regularization term that,
we're kind of, we want a generated image of two properties

616
00:38:34,664 --> 00:38:39,269
one, we wanted to maximally activate some,
some score or some neuron value.

617
00:38:39,269 --> 00:38:42,111
But, we also wanted to
look like a natural image.

618
00:38:42,111 --> 00:38:46,485
we wanted to kind of have, the kind of statistics
that we typically see in natural images.

619
00:38:46,485 --> 00:38:52,936
So, these regularization term in the subjective is something
to enforce a generated image to look relatively natural.

620
00:38:52,936 --> 00:38:57,116
And we'll see a couple of different
examples of regualizers as we go through.

621
00:38:57,116 --> 00:39:04,371
But, the general strategy for this is actually pretty simple and again
informant allot of things of this nature on your assignment three.

622
00:39:04,371 --> 00:39:10,410
But, what we'll do is start with some initial image
either initializing to zeros or to uniform or noise.

623
00:39:10,410 --> 00:39:19,922
But, initialize your image in some way and I'll repeat where you forward your image
through 3D network and compute the score or, or neuron value that you're interested.

624
00:39:19,922 --> 00:39:26,643
Now, back propagate to compute the Gradient of that
neuron score with respect to the pixels of the image

625
00:39:26,643 --> 00:39:33,897
and then make a small Gradient ascent or Gradient ascent update to
the pixels of the images itself. To try and maximize that score.

626
00:39:33,897 --> 00:39:38,786
And I'll repeat this process over and over
again, until you have a beautiful image.

627
00:39:38,786 --> 00:39:42,311
And, then we talked, we talked
about the image regularizer,

628
00:39:42,311 --> 00:39:49,428
well a very simple, a very simple idea for image regularizer
is simply to penalize L2 norm of a generated image

629
00:39:49,428 --> 00:39:51,466
This is not so semantically meaningful,

630
00:39:51,466 --> 00:40:01,764
it's just does something, and this was one of the, one of the earliest regularizer
that we've seen in the literature for these type of generating images type of papers.

631
00:40:01,764 --> 00:40:12,153
And, when you run this on a trained network you can see that now we're trying to generate
images that maximize the dumble score in the upper left hand corner here for example.

632
00:40:12,153 --> 00:40:14,820
And, then you can see that
the synthesized image,

633
00:40:14,820 --> 00:40:19,726
it been, it's little bit hard to see may be but
there're allot of different dumble like shapes,

634
00:40:19,726 --> 00:40:23,162
all kind of super impose
that different portions of the image.

635
00:40:23,162 --> 00:40:29,111
or if we try to generate an image for cups then we can may
be see a bunch of different cups all kind of super imposed

636
00:40:29,111 --> 00:40:30,466
the Dalmatian is pretty cool,

637
00:40:30,466 --> 00:40:35,478
because now we can see kind of this black and white spotted
pattern that's kind of characteristics of Dalmatians

638
00:40:35,478 --> 00:40:40,388
or for lemons we can see these different
kinds of yellow splotches in the image.

639
00:40:40,388 --> 00:40:43,539
And there's a couple of more examples here,
I think may be the goose is kind of cool

640
00:40:43,539 --> 00:40:46,514
or the kitfox are actually
may be looks like kitfox.

641
00:40:46,514 --> 00:40:47,454
Question?

642
00:40:55,528 --> 00:40:57,929
The question is, why are
these all rainbow colored

643
00:40:57,929 --> 00:41:02,434
and in general getting true colors out
of this visualization is pretty tricky.

644
00:41:02,434 --> 00:41:06,693
Right, because any, any actual image will
be bounded in the range zero to 255.

645
00:41:06,693 --> 00:41:10,395
So, it really should be some kind
of constrained optimization problem

646
00:41:10,395 --> 00:41:15,721
But, if, for using this generic methods for Gradient
ascent then we, that's going to be unconstrained problem.

647
00:41:15,721 --> 00:41:21,848
So, you may be use like projector Gradient ascent
algorithm or your rescaled image at the end.

648
00:41:21,848 --> 00:41:27,799
So, the colors that you see in this visualizations,
sometimes are you cannot take them too seriously.

649
00:41:27,799 --> 00:41:28,702
Question?

650
00:41:32,801 --> 00:41:36,846
The question is what happens, if you let the
thing loose and don't put any regularizer on it.

651
00:41:36,846 --> 00:41:44,860
Well, then you tend to get an image which maximize the score
which is confidently classified as the class you wanted

652
00:41:44,860 --> 00:41:48,522
but, usually it doesn't look like anything.
It kind of look likes random noise.

653
00:41:48,522 --> 00:41:54,538
So, that's kind of an interesting property in itself
that will go into much more detail in a future lecture.

654
00:41:54,538 --> 00:42:00,913
But, that's why, that kind of doesn't help you so much for
understanding what things the network is looking for.

655
00:42:00,913 --> 00:42:09,607
So, if we want to understand, why the network thing makes its decisions then it's
kind of useful to put regularizer on there to generate an image to look more natural.

656
00:42:09,607 --> 00:42:10,471
A question in the back.

657
00:42:34,416 --> 00:42:38,492
Yeah, so the question is that we see a lot of multi
modality here, and other ways to combat that.

658
00:42:38,492 --> 00:42:44,847
And actually yes, we'll see that, this is kind of first step
in the whole line of work in improving these visualizations.

659
00:42:44,847 --> 00:42:51,517
So, another, another kind of, so then the angle here is a kind
of to improve the regularizer to improve our visualized images.

660
00:42:51,517 --> 00:42:58,621
And there's a another paper from Jason Yesenski and some of his
collaborators where they added some additional impressive regularizers.

661
00:42:58,621 --> 00:43:00,924
So, in addition to this
L2 norm constraint,

662
00:43:00,924 --> 00:43:06,213
in addition we also periodically during optimization,
and do some gauche and blurring on the image,

663
00:43:06,213 --> 00:43:12,441
we're also clip some,. some small value, some small pixel
values all the way to zero, we're also clip some of the,

664
00:43:12,441 --> 00:43:14,694
some of the pixel values
of low Gradients to zero

665
00:43:14,694 --> 00:43:17,559
So, you can see this is kind of
a projector Gradient ascent algorithm

666
00:43:17,559 --> 00:43:24,555
where it reach periodically we're projecting our generated image
onto some nicer set of images with some nicer properties.

667
00:43:24,555 --> 00:43:28,241
For example, special smoothness
with respect to the gauchian blurring

668
00:43:28,241 --> 00:43:32,870
So, when you do this, you tend to get much
nicer images that are much clear to see.

669
00:43:32,870 --> 00:43:38,553
So, now these flamingos look like flamingos the
ground beetle is starting to look more beetle like

670
00:43:38,553 --> 00:43:41,695
or this black swan maybe
looks like a black swan.

671
00:43:41,695 --> 00:43:48,211
These billiard tables actually look kind of impressive now,
where you can definitely see this billiard table structure.

672
00:43:48,211 --> 00:43:55,209
So, you can see that once you add in nicer regularizers, then
the generated images become a bit, a little bit cleaner.

673
00:43:55,209 --> 00:44:01,038
And, now we can perform this procedure not only for the final
class course, but also for these intermediate neurons as well.

674
00:44:01,038 --> 00:44:10,111
So, instead of trying to maximize our billiard table score for example
instead we can get maximize one of the neurons from some intermediate layer

675
00:44:10,111 --> 00:44:11,118
Question.

676
00:44:16,743 --> 00:44:19,393
So, the question is what's
with the for example here,

677
00:44:19,393 --> 00:44:21,794
so those who remember
initializing our image randomly

678
00:44:21,794 --> 00:44:25,681
so, these four images would be different
random initialization of the input image.

679
00:44:28,106 --> 00:44:36,113
And again, we can use these same type of procedure to visualize, to synthesis
images which maximally activate intermediate neurons of the network.

680
00:44:36,113 --> 00:44:40,174
And, then you can get a sense from some of
these intermediate neurons are looking for,

681
00:44:40,174 --> 00:44:44,605
so may be at layer four there's neuron
that's kind of looking for spirally things

682
00:44:44,605 --> 00:44:49,703
or there's neuron that's may be looking for like chunks
of caterpillars it's a little bit harder to tell.

683
00:44:49,703 --> 00:44:56,585
But, in generally as you go larger up in the image then you can see that
the one, the obviously receptive fields of these neurons are larger.

684
00:44:56,585 --> 00:44:58,664
So, you're looking at the
larger patches in the image.

685
00:44:58,664 --> 00:45:03,549
And they tend to be looking for may be larger
structures or more complex patterns in the input image.

686
00:45:03,549 --> 00:45:04,802
That's pretty cool.

687
00:45:07,499 --> 00:45:15,559
And, then people have really gone crazy with this and trying to, they
basically improve these visualization by keeping on extra features

688
00:45:15,559 --> 00:45:23,697
So, this was a cool paper kind of explicitly trying to address this
multi modality, there's someone asked question about a few minutes ago.

689
00:45:23,697 --> 00:45:29,849
So, here they were trying to explicitly take a count, take
this multi modality into account in the optimization procedure

690
00:45:29,849 --> 00:45:35,254
where they did indeed, I think see the initial, so they
for each of the classes, you run a clustering algorithm

691
00:45:35,254 --> 00:45:42,667
to try to separate the classes into different modes and then
initialize with something that is close to one of those modes.

692
00:45:42,667 --> 00:45:45,890
And, then when you do that, you kind
of account for this multi modality.

693
00:45:45,890 --> 00:45:51,675
so for intuition, on the right here these
eight images are all of grocery stores.

694
00:45:51,675 --> 00:45:56,401
But, the top row, is kind of close
up pictures of produce on the shelf

695
00:45:56,401 --> 00:45:59,068
and those are labeled as grocery stores

696
00:45:59,068 --> 00:46:04,221
And the bottom row kind of shows people walking around grocery
stores or at the checkout line or something like that.

697
00:46:04,221 --> 00:46:06,085
And, those are also labeled
those as grocery store,

698
00:46:06,085 --> 00:46:08,073
but their visual appearance
is quiet different.

699
00:46:08,073 --> 00:46:10,988
So, a lot of these classes and
that being sort multi modal

700
00:46:10,988 --> 00:46:17,648
And, if you can take, and if you explicitly take this more time mortality
into account when generating images, then you can get nicer results.

701
00:46:17,648 --> 00:46:22,569
And now, then when you look at some of their
example, synthesis images for classes,

702
00:46:22,569 --> 00:46:31,840
you can see like the bell pepper, the card on, strawberries, jackolantern
now they end up with some very beautifully generated images.

703
00:46:31,840 --> 00:46:38,177
And now, I don't want to get to much into detail of
the next slide. But, then you can even go crazier.

704
00:46:38,177 --> 00:46:43,623
and add an even stronger image prior and
generate some very beautiful images indeed

705
00:46:43,623 --> 00:46:48,921
So, these are all synthesized images that are trying
to maximize the class score or some image in a class.

706
00:46:48,921 --> 00:46:59,020
But, the general idea is that rather than optimizing directly the pixels of the input
image, instead they're trying to optimize the FC6 representation of that image instead.

707
00:46:59,020 --> 00:47:03,342
And now they need to use some feature inversion network
and I don't want to get into the details here.

708
00:47:03,342 --> 00:47:05,290
You should read the paper,
it's actually really cool

709
00:47:05,290 --> 00:47:11,905
But, the point is that, when you start adding
additional priors towards modeling natural images

710
00:47:11,905 --> 00:47:16,662
and you can end generating some quiet realistic images they
gave you some sense of what the network is looking for

711
00:47:18,951 --> 00:47:23,839
So, that's, that's sort of one cool thing that
we can do with this strategy, but this idea

712
00:47:23,839 --> 00:47:29,893
of trying to synthesis images by using Gradients
on image pixels, is actually super powerful.

713
00:47:29,893 --> 00:47:34,288
And, another really cool thing we can do
with this, is this concept of fooling image

714
00:47:34,288 --> 00:47:43,362
So, what we can do is pick some arbitrary image, and then try to maximize
the, so, say we take it picture of an elephant and then we tell the network

715
00:47:43,362 --> 00:47:49,418
that we want to, change the image to
maximize the score of Koala bear instead

716
00:47:49,418 --> 00:47:57,064
So, then what we were doing is trying to change that image of an elephant
to try and instead cause the network to classify as a Koala bear.

717
00:47:57,064 --> 00:48:05,931
And, what you might hope for is that, maybe that elephant was sort of thought more thing
into a Koala bear and maybe he would sprout little cute ears or something like that.

718
00:48:05,931 --> 00:48:09,241
But, that's not what happens in practice,
which is pretty surprising.

719
00:48:09,241 --> 00:48:17,377
Instead if you take this picture of a elephant and tell them that, tell them that and
try to change the elephant image to instead cause it to be classified as a koala bear

720
00:48:17,377 --> 00:48:24,853
What you'll find is that, you is that this second image on the right
actually is classified as koala bear but it looks the same to us.

721
00:48:24,853 --> 00:48:28,016
So that's pretty fishy
and pretty surprising.

722
00:48:28,016 --> 00:48:34,114
So also on the bottom we've taken this picture
of a boat. Schooner is the image in that class

723
00:48:34,114 --> 00:48:37,170
and then we told the network
to classified as an iPod.

724
00:48:37,170 --> 00:48:41,881
So now the second example looks just, still looks
like a boat to us but the network thinks it's an iPod

725
00:48:41,881 --> 00:48:46,260
and the difference is in pixels between
these two images are basically nothing.

726
00:48:46,260 --> 00:48:52,025
And if you magnify those differences you don't really see
any iPod or Koala like features on these differences,

727
00:48:52,025 --> 00:48:58,924
they're just kind of like random patterns of noise. So the question
is what's going here? And like how can this possibly the case?

728
00:48:58,924 --> 00:49:03,635
Well, we'll have a guest lecture from Ian
Goodfellow in a week an half two weeks.

729
00:49:03,635 --> 00:49:08,068
And he's going to go in much more detail about this
type of phenomenon and that will be really exciting.

730
00:49:08,068 --> 00:49:11,006
But I did want to mention it here
because it is on your homework.

731
00:49:11,006 --> 00:49:11,595
Question?

732
00:49:16,320 --> 00:49:20,050
Yeah, so that's something, so the question
is can we use fooled images as training data

733
00:49:20,050 --> 00:49:27,214
and I think, Ian's going to go in much more detail on all of these types of
strategies. Because that's literally, that's really a whole lecture onto itself.

734
00:49:27,214 --> 00:49:28,885
Question?

735
00:50:00,608 --> 00:50:03,478
The question is why do we
care about any of this stuff?

736
00:50:03,478 --> 00:50:08,685
Basically... Okay, maybe that was a
mischaracterization, I am sorry.

737
00:50:24,573 --> 00:50:32,027
Yeah, the question is what is have in the... understanding this intermediate
neurons how does that help our understanding of the final classification.

738
00:50:32,027 --> 00:50:38,921
So this is actually, this whole field of trying to visualize intermediates
is kind of in response to a common criticism of deep learning.

739
00:50:38,921 --> 00:50:43,011
So a common criticism of deep learning is
like, you've got this big black box network

740
00:50:43,011 --> 00:50:47,350
you trained it on gradient ascent, you get a good
number and that's great but we don't trust the network

741
00:50:47,350 --> 00:50:51,272
because we don't understand as people why it's
making the decisions, that's it's making.

742
00:50:51,272 --> 00:51:01,530
So a lot of these type of visualization techniques were developed to try and address that and try to understand
as people why the network are making their various classification, classification decisions a bit more.

743
00:51:01,530 --> 00:51:07,721
Because if you contrast, if you contrast a deep convolutional
neural network with other machine running techniques.

744
00:51:07,722 --> 00:51:10,493
Like linear models are much
easier to interpret in general

745
00:51:10,493 --> 00:51:17,457
because you can look at the weights and kind of understand the interpretation between
how much each input feature effect the decision or if you look at something like

746
00:51:17,458 --> 00:51:19,459
a random forest or decision tree.

747
00:51:19,459 --> 00:51:27,442
Some other machine learning models end up being a bit more interpretable just
by their very nature then this sort of black box convolutional networks.

748
00:51:27,442 --> 00:51:33,520
So a lot of this is sort of in response to that criticism
to say that, yes they are these large complex models

749
00:51:33,520 --> 00:51:37,263
but they are still doing some interesting
and interpretable things under the hood.

750
00:51:37,263 --> 00:51:42,201
They are not just totally going out in randomly
classifying things. They are doing something meaningful

751
00:51:44,891 --> 00:51:50,989
So another cool thing we can do with this gradient based
optimization of images is this idea of DeepDream.

752
00:51:50,989 --> 00:51:55,592
So this was a really cool blog post that
came out from Google a year or two ago.

753
00:51:55,592 --> 00:52:00,859
And the idea is that, this is, so we talked about
scientific value, this is almost entirely for fun.

754
00:52:00,859 --> 00:52:04,284
So the point of this exercise is mostly
to generate cool images.

755
00:52:04,284 --> 00:52:10,186
And aside, you also get some sense for what features
images are looking at. Or these networks are looking at.

756
00:52:10,186 --> 00:52:15,275
So we can do is, we take our input image we run it
through the convolutional network up to some layer

757
00:52:15,275 --> 00:52:17,035
and now we back propagate

758
00:52:17,035 --> 00:52:20,742
and set the gradient of that, at that
layer equal to the activation value.

759
00:52:20,742 --> 00:52:25,427
And now back propagate, back to the image and
update the image and repeat, repeat, repeat.

760
00:52:25,427 --> 00:52:31,682
So this has the interpretation of trying to amplify existing
features that were detected by the network in this image. Right?

761
00:52:31,682 --> 00:52:35,875
Because whatever features existed on that layer
now we set the gradient equal to the feature

762
00:52:35,875 --> 00:52:40,010
and we just tell the network to amplify whatever
features you already saw in that image.

763
00:52:40,010 --> 00:52:46,918
And by the way you can also see this as trying to maximize
the L2 norm of the features at that layer of the image.

764
00:52:46,918 --> 00:52:55,999
And it turns... And when you do this the code ends up looking really simple. So your code for many of
your homework assignments will probably be about this complex or maybe even a little bit a less so.

765
00:52:55,999 --> 00:53:00,785
So the idea is that... But there's a couple of tricks
here that you'll also see in your assignments.

766
00:53:00,785 --> 00:53:04,443
So one trick is to jitter the image
before you compute your gradients.

767
00:53:04,443 --> 00:53:11,187
So rather than running the exact image through the network instead you'll shift
the image over by two pixels and kind of wrap the other two pixels over here.

768
00:53:11,187 --> 00:53:19,540
And this is a kind of regularizer to prevent each of these [mumbling] it regularizers
a little bit to encourage a little bit of extra special smoothness in the image.

769
00:53:19,540 --> 00:53:26,653
You also see they use L1 normalization of the gradients that's kind of
a useful trick sometimes when doing this image generation problems.

770
00:53:26,653 --> 00:53:33,843
You also see them clipping the pixel values once in a while. So again
we talked about images actually should be between zero to 2.55

771
00:53:33,843 --> 00:53:39,335
so this is a kind of projected gradients decent where
we project on to the space of actual valid images.

772
00:53:39,335 --> 00:53:46,215
But now when we do all this then we start, we might start with
some image of a sky and then we get really cool results like this.

773
00:53:46,215 --> 00:53:52,614
So you can see that now we've taken these tiny features on the
sky and they get amplified through this, through this process.

774
00:53:52,614 --> 00:53:59,007
And we can see things like this different mutant animals
start to pop up or these kind of spiral shapes pop up.

775
00:53:59,007 --> 00:54:04,296
Different kinds of houses and cars pop up. So
that's all, that's all pretty interesting.

776
00:54:04,296 --> 00:54:08,743
There's a couple patterns in particular that
pop up all the time that people have named.

777
00:54:08,743 --> 00:54:12,133
Right, so there's this Admiral
dog, that shows up allot.

778
00:54:12,133 --> 00:54:16,033
There's the pig snail, the camel bird
this the dog fish.

779
00:54:16,033 --> 00:54:22,771
Right, so these are kind of interesting, but actually this fact that
dog show up so much in these visualization, actually does tell us

780
00:54:22,771 --> 00:54:26,249
something about the data on
which this network was trained.

781
00:54:26,249 --> 00:54:30,786
Right, because this is a network that was trained for image
net classification, image that have thousand categories.

782
00:54:30,786 --> 00:54:32,915
But 200 of those categories are dogs.

783
00:54:32,915 --> 00:54:44,027
So, so it's kind of not surprising in a sense that when you do these kind of visualizations then network
ends up hallucinating a lot of dog like stuff in the image often morphed with other types of animals.

784
00:54:44,027 --> 00:54:47,327
When you do this other layers of the
network you get other types of results.

785
00:54:47,327 --> 00:54:52,708
So here we're taking one of these lower layers in the network,
the previous example was relatively high up in the network

786
00:54:52,708 --> 00:54:57,791
and now again we have this interpretation that lower layers
maybe computing edges and swirls and stuff like that

787
00:54:57,791 --> 00:55:01,766
and that's kind of borne out when we
running DeepDream at a lower layer.

788
00:55:01,766 --> 00:55:08,346
Or if you run this thing for a long time and maybe add in some
multiscale processing you can get some really, really crazy images.

789
00:55:08,346 --> 00:55:14,631
Right, so here they're doing a kind of multiscale processing where they start
with a small image run DeepDream on the small image then make it bigger

790
00:55:14,631 --> 00:55:19,893
and continue DeepDream on the larger image and kind of
repeat with this multiscale processing and then you can get,

791
00:55:19,893 --> 00:55:25,699
and then maybe after you complete the final scale then you
restart from the beginning and you just go wild on this thing.

792
00:55:25,699 --> 00:55:28,126
And you can get some really crazy images.

793
00:55:28,126 --> 00:55:31,454
So these examples were all from networks
trained on image net

794
00:55:31,454 --> 00:55:35,216
There's another data set from
MIT called MIT Places Data set

795
00:55:35,216 --> 00:55:40,224
but instead of 1,000 categories of objects
instead it's 200 different types of scenes

796
00:55:40,224 --> 00:55:42,663
like bedrooms and kitchens
like stuff like that.

797
00:55:42,663 --> 00:55:50,868
And now if we repeat this DeepDream procedure using an network trained
at MIT places. We get some really cool visualization as well.

798
00:55:50,868 --> 00:55:59,491
So now instead of dogs, slugs and admiral dogs and that's kind of stuff, instead
we often get these kind of roof shapes of these kind of Japanese style building

799
00:55:59,491 --> 00:56:02,104
or these different types of
bridges or mountain ranges.

800
00:56:02,104 --> 00:56:05,288
They're like really, really
cool beautiful visualizations.

801
00:56:05,288 --> 00:56:11,685
So the code for DeepDream is online, released by Google you
can go check it out and make your own beautiful pictures

802
00:56:11,685 --> 00:56:14,535
So there's another kind of...
Sorry question?

803
00:56:24,731 --> 00:56:28,252
So the question is, what
are taking gradient of?

804
00:56:28,252 --> 00:56:33,318
So like I say, if you, because like one over
x squared on the gradient of that is x.

805
00:56:33,318 --> 00:56:44,477
So, if you send back the volume of activation as the gradient, that's equivalent to max, that's
equivalent to taking the gradient with respect to the like one over x squared some... Some of the values.

806
00:56:44,477 --> 00:56:49,665
So it's equivalent to maximizing the norm
of that of the features of that layer.

807
00:56:49,665 --> 00:56:56,511
But in practice many implementation you'll see not
explicitly compute that instead of send gradient back.

808
00:56:56,511 --> 00:57:01,478
So another kind of useful, another kind of useful
thing we can do is this concept of feature inversion.

809
00:57:01,478 --> 00:57:07,687
So this again gives us a sense for what types of, what types of
elements of the image are captured at different layers of the network.

810
00:57:07,687 --> 00:57:12,220
So what we're going to do now is we're going to
take an image, run that image through network

811
00:57:12,220 --> 00:57:15,832
record the feature value
for one of those images

812
00:57:15,832 --> 00:57:20,283
and now we're going to try to reconstruct
that image from its feature representation.

813
00:57:20,283 --> 00:57:31,074
And the question, and now based on the how much, how much like what that reconstructed image looks like
that'll give us some sense for what type of information about the image was captured in that feature vector.

814
00:57:31,074 --> 00:57:34,191
So again, we can do this with gradient
ascent with some regularizer.

815
00:57:34,191 --> 00:57:41,709
Where now rather than maximizing some score instead we want
to minimize the distance between this catch feature vector.

816
00:57:41,709 --> 00:57:50,014
And between the computed features of our generated image. To try and again
synthesize a new image that matches the feature back to that we computed before.

817
00:57:50,014 --> 00:57:56,856
And another kind of regularizer that you frequently see here is the
total variation regularizer that you also see on your homework.

818
00:57:56,856 --> 00:58:05,954
So here with the total variation regularizer is panelizing differences between adjacent
pixels on both of the left and adjacent in left and right and adjacent top to bottom.

819
00:58:05,954 --> 00:58:09,956
To again try to encourage special
smoothness in the generated image.

820
00:58:09,956 --> 00:58:16,369
So now if we do this idea of feature inversion so this
visualization here on the left we're showing some original image.

821
00:58:16,369 --> 00:58:18,294
The elephants or the fruits at the left.

822
00:58:18,294 --> 00:58:22,458
And then we run that,
we run the image through a VGG-16 network.

823
00:58:22,458 --> 00:58:30,013
Record the features of that network at some layer and then try to
synthesize a new image that matches the recorded features of that layer.

824
00:58:30,013 --> 00:58:37,534
And this is, this kind of give us a sense for what how much information
is stored in this images. In these features of different layers.

825
00:58:37,534 --> 00:58:43,849
So for example if we try to reconstruct the image based
on the relu2_2 features from VGC's, from VGG-16.

826
00:58:43,849 --> 00:58:46,628
We see that the image gets
almost perfectly reconstructed.

827
00:58:46,628 --> 00:58:52,664
Which means that we're not really throwing away much
information about the raw pixel values at that layer.

828
00:58:52,664 --> 00:58:58,593
But as we move up into the deeper parts of the network
and try to reconstruct from relu4_3, relu5_1.

829
00:58:58,593 --> 00:59:05,488
We see that our reconstructed image now, we've kind of kept the
general space, the general spatial structure of the image.

830
00:59:05,488 --> 00:59:09,684
You can still tell that, that it's a
elephant or a banana or a, or an apple.

831
00:59:09,684 --> 00:59:16,427
But a lot of the low level details aren't exactly what the pixel values
were and exactly what the colors were, exactly what the textures were.

832
00:59:16,427 --> 00:59:20,923
These are kind of low level details are kind of
lost at these higher layers of this network.

833
00:59:20,923 --> 00:59:29,153
So that gives us some sense that maybe as we move up through the flairs of the network
it's kind of throwing away this low level information about the exact pixels of the image

834
00:59:29,153 --> 00:59:38,109
and instead is maybe trying to keep around a little bit more semantic information, it's
a little bit invariant for small changes in color and texture and things like that.

835
00:59:38,109 --> 00:59:42,835
So we're building towards a style
transfer here which is really cool.

836
00:59:42,835 --> 00:59:51,029
So in addition to understand style transfer, So in texture synthesis, this is kind of an old problem
in computer graphics. We also need to talk about a related problem called texture synthesis.

837
00:59:51,029 --> 00:59:55,112
So in texture synthesis, this is kind
of an old problem in computer graphics.

838
00:59:55,112 --> 01:00:05,792
Here the idea is that we're given some input patch of texture. Something like these little scales
here and now we want to build some model and then generate a larger piece of that same texture.

839
01:00:05,792 --> 01:00:12,056
So for example, we might here want to generate a large
image containing many scales that kind of look like input.

840
01:00:12,056 --> 01:00:15,986
And this is again a pretty old
problem in computer graphics.

841
01:00:15,986 --> 01:00:19,720
There are nearest neighbor approaches to
textual synthesis that work pretty well.

842
01:00:19,720 --> 01:00:21,659
So, there's no neural networks here.

843
01:00:21,659 --> 01:00:27,792
Instead, this kind of a simple algorithm where we march through
the generated image one pixel at a time in scan line order.

844
01:00:27,792 --> 01:00:34,742
And then copy... And then look at a neighborhood around the
current pixel based on the pixels that we've already generated

845
01:00:34,742 --> 01:00:41,934
and now compute a nearest neighbor of that neighborhood in the patches
of the input image and then copy over one pixel from the input image.

846
01:00:41,934 --> 01:00:48,889
So, maybe you don't need to understand the details here just the idea is that
there's a lot classical algorithms for texture synthesis, it's a pretty old problem

847
01:00:48,889 --> 01:00:52,749
but you can do this without
neural networks basically.

848
01:00:52,749 --> 01:00:59,915
And when you run this kind of this kind of classical texture synthesis
algorithm it actually works reasonably well for simple textures.

849
01:00:59,915 --> 01:01:08,970
But as we move to more complex textures these kinds of simple methods of
maybe copying pixels from the input patch directly tend not to work so well.

850
01:01:08,970 --> 01:01:16,494
So, in 2015, there was a really cool paper that tried to apply
neural network features to this problem of texture synthesis.

851
01:01:16,494 --> 01:01:24,753
And ended up framing it as kind of a gradient ascent procedure, kind of similar to
the feature map, the various feature matching objectives that we've seen already.

852
01:01:24,753 --> 01:01:30,558
So, in order to perform neural texture synthesis
they use this concept of a gram matrix.

853
01:01:30,558 --> 01:01:36,372
So, what we're going to do, is we're going to take our
input texture and in this case some pictures of rocks

854
01:01:36,372 --> 01:01:44,347
and then take that input texture and pass it through some convolutional neural
network and pull out convolutional features at some layer of the network.

855
01:01:44,347 --> 01:01:53,596
So, maybe then this convolutional feature volume that we've talked about,
might be H by W by C or sorry, C by H by W at that layer of the network.

856
01:01:53,596 --> 01:01:56,515
So, you can think of this
as an H by W spacial grid.

857
01:01:56,515 --> 01:02:04,347
And at each point of the grid, we have this C dimensional feature
vector describing the rough appearance of that image at that point.

858
01:02:04,347 --> 01:02:10,179
And now, we're going to use this activation map to
compute a descriptor of the texture of this input image.

859
01:02:10,179 --> 01:02:15,294
So, what we're going to do is take, pick out two of
these different feature columns in the input volume.

860
01:02:15,294 --> 01:02:18,318
Each of these feature columns
will be a C dimensional vector.

861
01:02:18,318 --> 01:02:23,390
And now take the outer product between those
two vectors to give us a C by C matrix.

862
01:02:23,390 --> 01:02:30,333
This C by C matrix now tells us something about the co-occurrence
of the different features at those two points in the image.

863
01:02:30,333 --> 01:02:40,218
Right, so, if an element, if like element IJ in the C by C matrix is large that means
both elements I and J of those two input vectors were large and something like that.

864
01:02:40,218 --> 01:02:51,572
So, this somehow captures some second order statistics about which features, in that feature map
tend to activate to together at different spacial volumes... At different spacial positions.

865
01:02:51,572 --> 01:03:01,664
And now we're going to repeat this procedure using all different pairs of feature vectors from all
different points in this H by W grid. Average them all out, and that gives us our C by C gram matrix.

866
01:03:01,664 --> 01:03:06,323
And this is then used a descriptor to describe
kind of the texture of that input image.

867
01:03:06,323 --> 01:03:13,623
So, what's interesting about this gram matrix is that it has now
thrown away all spacial information that was in this feature volume.

868
01:03:13,623 --> 01:03:17,545
Because we've averaged over all pairs of
feature vectors at every point in the image.

869
01:03:17,545 --> 01:03:21,863
Instead, it's just capturing the second order
co-occurrence statistics between features.

870
01:03:21,863 --> 01:03:25,364
And this ends up being a
nice descriptor for texture.

871
01:03:25,364 --> 01:03:27,640
And by the way, this is
really efficient to compute.

872
01:03:27,640 --> 01:03:39,682
So, if you have a C by H by W three dimensional tensure you can just reshape it to see times H by
W and take that times its own transpose and compute this all in one shot so it's super efficient.

873
01:03:39,682 --> 01:03:45,417
But you might be wondering why you don't use an actual covariance
matrix or something like that instead of this funny gram matrix

874
01:03:45,417 --> 01:03:51,845
and the answer is that using covariance... Using true covariance
matrices also works but it's a little bit more expensive to compute.

875
01:03:51,845 --> 01:03:55,203
So, in practice a lot of people
just use this gram matrix descriptor.

876
01:03:55,203 --> 01:04:06,916
So then... Then there's this... Now once we have this sort of neural descriptor of texture then we use a similar
type of gradient ascent procedure to synthesize a new image that matches the texture of the original image.

877
01:04:06,916 --> 01:04:10,913
So, now this looks kind of like the feature
reconstruction that we saw a few slides ago.

878
01:04:10,913 --> 01:04:20,883
But instead, I'm trying to reconstruct the whole feature map from the input image. Instead, we're
just going to try and reconstruct this gram matrix texture descriptor of the input image instead.

879
01:04:20,883 --> 01:04:25,969
So, in practice what this looks like is that well... You'll
download some pretrained model, like in feature inversion.

880
01:04:25,969 --> 01:04:28,720
Often, people will use
the VGG networks for this.

881
01:04:28,720 --> 01:04:38,553
You'll feed your... You'll take your texture image, feed it through the VGG
network, compute the gram matrix and many different layers of this network.

882
01:04:38,553 --> 01:04:47,414
Then you'll initialize your new image from some random initialization and then it
looks like gradient ascent again. Just like for these other methods that we've seen.

883
01:04:47,414 --> 01:04:52,530
So, you take that image, pass it through the same VGG
network, Compute the gram matrix at various layers

884
01:04:52,530 --> 01:05:00,833
and now compute loss as the L2 norm between the gram
matrices of your input texture and your generated image.

885
01:05:00,833 --> 01:05:06,025
And then you back prop, and compute pixel... A
gradient of the pixels on your generated image.

886
01:05:06,025 --> 01:05:09,273
And then make a gradient ascent step to
update the pixels of the image a little bit.

887
01:05:09,273 --> 01:05:17,071
And now, repeat this process many times, go forward, compute your gram
matrices, compute your losses, back prop.. Gradient on the image and repeat.

888
01:05:17,071 --> 01:05:22,702
And once you do this, eventually you'll end up generating
a texture that matches your input texture quite nicely.

889
01:05:22,702 --> 01:05:30,022
So, this was all from Nip's 2015 paper by a group in Germany.
And they had some really cool results for texture synthesis.

890
01:05:30,022 --> 01:05:33,531
So, here on the top, we're showing
four different input textures.

891
01:05:33,531 --> 01:05:41,133
And now, on the bottom, we're showing doing this
texture synthesis approach by gram matrix matching.

892
01:05:41,133 --> 01:05:45,681
Using, by computing the gram matrix at different
layers at this pretrained convolutional network.

893
01:05:45,681 --> 01:05:56,965
So, you can see that, if we use these very low layers in the convolutional network then we kind of match the general...
We generally get splotches of the right colors but the overall spacial structure doesn't get preserved so much.

894
01:05:56,965 --> 01:06:06,935
And now, as we move to large down further in the image and you compute these gram matrices
at higher layers you see that they tend to reconstruct larger patterns from the input image.

895
01:06:06,935 --> 01:06:10,107
For example, these whole rocks
or these whole cranberries.

896
01:06:10,107 --> 01:06:17,677
And now, this works pretty well that now we can synthesize these new
images that kind of match the general spacial statistics of your inputs.

897
01:06:17,677 --> 01:06:21,445
But they are quite different pixel wise
from the actual input itself.

898
01:06:21,445 --> 01:06:22,528
Question?

899
01:06:28,481 --> 01:06:30,847
So, the question is, where
do we compute the loss?

900
01:06:30,847 --> 01:06:40,285
And in practice, we want to get good results typically people will compute gram matrices at many
different layers and then the final loss will be a sum of all those potentially a weighted sum.

901
01:06:40,285 --> 01:06:47,940
But I think for this visualization, to try to pin point the effect of the
different layers I think these were doing reconstruction from just one layer.

902
01:06:47,940 --> 01:06:52,999
So, now something really... Then, then they had a
really brilliant idea kind of after this paper

903
01:06:52,999 --> 01:07:01,417
which is, what if we do this texture synthesis approach but instead of using an
image like rocks or cranberries what if we set it equal to a piece of artwork.

904
01:07:01,417 --> 01:07:03,748
So then, for example, if you...

905
01:07:03,748 --> 01:07:10,333
If you do the same texture synthesis algorithm by maximizing
gram matrices, but instead of... But now we take, for example,

906
01:07:10,333 --> 01:07:14,656
Vincent Van Gogh's Starry night
or the Muse by Picasso as our texture...

907
01:07:14,656 --> 01:07:19,759
As our input texture, and then run
this same texture synthesis algorithm.

908
01:07:19,759 --> 01:07:25,683
Then we can see our generated images tend to reconstruct
interesting pieces from those pieces of artwork.

909
01:07:25,683 --> 01:07:34,616
And now, something really interesting happens when you combine this idea of texture
synthesis by gram matrix matching with feature inversion by feature matching.

910
01:07:34,616 --> 01:07:38,988
And then this brings us to this really
cool algorithm called style transfer.

911
01:07:38,988 --> 01:07:42,716
So, in style transfer, we're
going to take two images as input.

912
01:07:42,716 --> 01:07:49,813
One, we're going to take a content image that will guide like what
type of thing we want. What we generally want our output to look like.

913
01:07:49,813 --> 01:07:55,499
Also, a style image that will tell us what is the general
texture or style that we want our generated image to have

914
01:07:55,499 --> 01:08:02,596
and then we will jointly do feature recon... We will generate a new image
by minimizing the feature reconstruction loss of the content image

915
01:08:02,596 --> 01:08:05,661
and the gram matrix
loss of the style image.

916
01:08:05,661 --> 01:08:14,353
And when we do these two things we a get a really cool image that kind of
renders the content image kind of in the artistic style of the style image.

917
01:08:14,353 --> 01:08:18,317
And now this is really cool. And you can
get these really beautiful figures.

918
01:08:18,317 --> 01:08:26,384
So again, what this kind of looks like is that you'll take your style image and your
content image pass them into your network to compute your gram matrices and your features.

919
01:08:26,384 --> 01:08:29,332
Now, you'll initialize your output image
with some random noise.

920
01:08:29,332 --> 01:08:38,264
Go forward, compute your losses go backward, compute your gradients on the image and repeat
this process over and over doing gradient ascent on the pixels of your generated image.

921
01:08:38,265 --> 01:08:43,247
And after a few hundred iterations,
generally you'll get a beautiful image.

922
01:08:43,247 --> 01:08:48,965
So, I have implementation of this online on my Gethub,
that a lot of people are using. And it's really cool.

923
01:08:48,965 --> 01:08:54,609
So, you can, this is kind of... Gives you a lot more
control over the generated image as compared to DeepDream.

924
01:08:54,609 --> 01:09:00,544
Right, so in DeepDream, you don't have a lot of control about exactly
what types of things are going to happen coming out at the end.

925
01:09:00,544 --> 01:09:06,500
You just kind of pick different layers of the networks maybe set
different numbers of iterations and then dog slugs pop up everywhere.

926
01:09:06,500 --> 01:09:11,228
But with style transfer, you get a lot more fine grain
control over what you want the result to look like.

927
01:09:11,228 --> 01:09:19,099
Right, by now, picking different style images with the same content image
you can generate whole different types of results which is really cool.

928
01:09:19,099 --> 01:09:30,349
Also, you can play around with the hyper parameters here. Right, because we're doing a joint reconstruct... We're minimizing
this feature reconstruction loss of the content image. And this gram matrix reconstruction loss of the style image.

929
01:09:30,350 --> 01:09:39,468
If you trade off the constant, the waiting between those two terms and the loss. Then you can get
control about how much we want to match the content versus how much we want to match the style.

930
01:09:39,469 --> 01:09:41,647
There's a lot of other hyper
parameters you can play with.

931
01:09:41,647 --> 01:09:45,707
For example, if you resize the style image
before you compute the gram matrix

932
01:09:45,707 --> 01:09:52,344
that can give you some control over what the scale of features
are that you want to reconstruct from the style image.

933
01:09:52,344 --> 01:09:58,976
So, you can see that here, we've done this same reconstruction the only
difference is how big was the style image before we computed the gram matrix.

934
01:09:58,976 --> 01:10:04,263
And this gives you another axis over
which you can control these things.

935
01:10:04,263 --> 01:10:07,670
You can also actually do style transfer
with multiple style images

936
01:10:07,670 --> 01:10:13,431
if you just match sort of multiple gram matrices at
the same time. And that's kind of a cool result.

937
01:10:13,431 --> 01:10:25,105
We also saw this multi-scale process... So, another cool thing you can do. We talked about this multi-scale processing
for DeepDream and saw how multi scale processing in DeepDream can give you some really cool resolution results.

938
01:10:25,105 --> 01:10:29,330
And you can do a similar type of multi-scale
processing in style transfer as well.

939
01:10:29,330 --> 01:10:40,867
So, then we can compute images like this. That a super high resolution, this is I
think a 4k image of our favorite school, like rendered in the style of Starry night.

940
01:10:40,867 --> 01:10:42,652
But this is actually super
expensive to compute.

941
01:10:42,652 --> 01:10:47,074
I think this one took four GPU's.
So, a little expensive.

942
01:10:47,074 --> 01:10:53,666
We can also other style, other style images. And get some really
cool results from the same content image. Again, at high resolution.

943
01:10:53,666 --> 01:11:01,168
Another fun thing you can do is you know, you can actually
do joint style transfer and DeepDream at the same time.

944
01:11:01,168 --> 01:11:09,017
So, now we'll have three losses, the content loss the style loss and
this... And this DeepDream loss that tries to maximize the norm.

945
01:11:09,017 --> 01:11:14,286
And get something like this. So, now it's Van
Gogh with the dog slug's coming out everywhere.

946
01:11:14,286 --> 01:11:15,858
[laughing]

947
01:11:15,858 --> 01:11:18,466
So, that's really cool.

948
01:11:18,466 --> 01:11:23,012
But there's kind of a problem with this style transfer
for algorithms which is that they are pretty slow.

949
01:11:23,012 --> 01:11:30,164
Right, you need to produce... You need to compute a lot of forward and backward
passes through your pretrained network in order to complete these images.

950
01:11:30,164 --> 01:11:38,200
And especially for these high resolution results that we saw in the previous slide. Each
forward and backward pass of a 4k image is going to take a lot of compute and a lot of memory.

951
01:11:38,200 --> 01:11:46,340
And if you need to do several hundred of those iterations generating these
images could take many, like tons of minutes even on a powerful GPU.

952
01:11:46,340 --> 01:11:50,320
So, it's really not so practical
to apply these things in practice.

953
01:11:50,320 --> 01:11:54,874
The solution is to now, train another neural
network to do the style transfer for us.

954
01:11:54,874 --> 01:12:03,164
So, I had a paper about this last year and the idea is that we're going to fix
some style that we care about at the beginning. In this case, Starry night.

955
01:12:03,164 --> 01:12:08,034
And now rather than running a separate optimization
procedure for each image that we want to synthesize

956
01:12:08,034 --> 01:12:15,748
instead we're going to train a single feed forward network that can
input the content image and then directly output the stylized result.

957
01:12:15,748 --> 01:12:26,848
And now the way that we train this network is that we compute the same content and style losses during training
of our feed forward network and use that same gradient to update the weights of the feed forward network.

958
01:12:26,848 --> 01:12:36,148
And now this thing takes maybe a few hours to train but once it's trained, then in order to
produce stylized images you just need to do a single forward pass through the trained network.

959
01:12:36,148 --> 01:12:49,880
So, I have a code for this online and you can see that it ends up looking about... Relatively comparable quality in
some cases to this very slow optimization base method but now it runs in real time it's about a thousand times faster.

960
01:12:49,880 --> 01:12:54,990
So, here you can see, this is like a
demo of it running live off my webcam.

961
01:12:54,990 --> 01:13:05,476
So, this is not running live right now obviously, but if you have a big GPU you can easily
run four different styles in real time all simultaneously because it's so efficient.

962
01:13:05,476 --> 01:13:12,650
There was... There was another group from Russia that had a very similar out...
That had a very similar paper concurrently and their results are about as good.

963
01:13:12,650 --> 01:13:15,392
They also had this kind
of tweek on the algorithm.

964
01:13:15,392 --> 01:13:25,450
So, this feed forward network that we're training ends up looking a lot like
these... These segmentation models that we saw. So, these segmentation networks,

965
01:13:25,450 --> 01:13:37,678
for semantic segmentation we're doing down sampling and then many, and then many layers then some up
sampling [mumbling] With transposed convulsion in order to down sample an up sample to be more efficient.

966
01:13:37,678 --> 01:13:45,244
The only difference is that this final layer produces a
three channel output for the RGB of that final image.

967
01:13:45,244 --> 01:13:48,540
And inside this network, we have batch
normalization in the various layers.

968
01:13:48,540 --> 01:13:56,027
But in this paper, they introduce... They swap out the batch normalization for
something else called instance normalization tends to give you much better results.

969
01:13:56,027 --> 01:14:05,500
So, one drawback of these types of methods is that we're now training one
new style transfer network... For every... For style that we want to apply.

970
01:14:05,500 --> 01:14:10,433
So that could be expensive if now you need to
keep a lot of different trained networks around.

971
01:14:10,433 --> 01:14:21,178
So, there was a paper from Google that just came... Pretty recently that addressed this by
using one feed forward trained network to apply many different styles to the input image.

972
01:14:21,178 --> 01:14:28,034
So now, they can train one network to apply many
different styles at test time using one trained network.

973
01:14:28,034 --> 01:14:36,477
So, here's it's going to take the content images input as well as the identity of the style
you want to apply and then this is using one network to apply many different types of styles.

974
01:14:36,477 --> 01:14:39,365
And again, runs in real time.

975
01:14:39,365 --> 01:14:44,442
That same algorithm can also do this kind of style
blending in real time with one trained network.

976
01:14:44,442 --> 01:14:52,458
So now, once you trained this network on these four different styles you can actually
specify a blend of these styles to be applied at test time which is really cool.

977
01:14:52,458 --> 01:15:01,976
So, these kinds of real time style transfer methods are on various
apps and you can see these out in practice a lot now these days.

978
01:15:01,976 --> 01:15:04,071
So, kind of the summary
of what we've seen today

979
01:15:04,071 --> 01:15:08,113
is that we've talked about many different
methods for understanding CNN representations.

980
01:15:08,113 --> 01:15:10,190
We've talked about some of
these activation based methods

981
01:15:10,190 --> 01:15:14,220
like nearest neighbor, dimensionality
reduction, maximal patches, occlusion images

982
01:15:14,220 --> 01:15:18,316
to try to understand based on the activation
values of what the features are looking for.

983
01:15:18,316 --> 01:15:20,461
We also talked about a bunch
of gradient based methods

984
01:15:20,461 --> 01:15:27,127
where you can use gradients to synthesize new images
to understand your features such as saliency maps

985
01:15:27,127 --> 01:15:30,417
class visualizations, fooling images,
feature inversion.

986
01:15:30,417 --> 01:15:37,997
And we also had fun by seeing how a lot of these similar ideas can be applied
to things like Style Transfer and DeepDream to generate really cool images.

987
01:15:37,997 --> 01:15:40,397
So, next time, we'll talk
about unsupervised learning

988
01:15:40,397 --> 01:15:45,834
Autoencoders, Variational Autoencoders and generative
adversarial networks so that should be a fun lecture.