/
Assignment2.ipynb
323 lines (323 loc) · 12.4 KB
/
Assignment2.ipynb
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
213
214
215
216
217
218
219
220
221
222
223
224
225
226
227
228
229
230
231
232
233
234
235
236
237
238
239
240
241
242
243
244
245
246
247
248
249
250
251
252
253
254
255
256
257
258
259
260
261
262
263
264
265
266
267
268
269
270
271
272
273
274
275
276
277
278
279
280
281
282
283
284
285
286
287
288
289
290
291
292
293
294
295
296
297
298
299
300
301
302
303
304
305
306
307
308
309
310
311
312
313
314
315
316
317
318
319
320
321
322
323
{
"cells": [
{
"cell_type": "markdown",
"metadata": {},
"source": [
"# Assignment 2\n",
"\n",
"## Introduction\n",
"\n",
"In this assignment we will replicate a gene expression data analysis experiment. We will use both unsupervised clustering, and a supervised approach using the Support Vector Machine classifier.\n",
"\n",
"The data is highly dimensional, in other words there are many more features than samples/observations ($p \\gg N$). This is typical of gene expression data and of some other medical data problems that you might encounter, such as proteomic data or other biomedical data. When the number of features/dimensions is __much bigger__ than the number of samples/observations, this is a high-dimensional problem.\n",
"\n",
"The dataset was described and analysed in the following publication:\n",
"\n",
"S. Ramaswamy, P. Tamayo, R. Rifkin, S. Mukherjee, C.H. Yeang, M. Angelo, C. Ladd, M. Reich, E. Latulippe, J.P. Mesirov, T. Poggio, W. Gerald, M. Loda, E.S. Lander and T.R. Golub. __Multiclass cancer diagnosis using tumor gene expression signatures__. _PNAS, Proceedings of the National Academy of Sciences_. 2001 Dec 18; 98(26): 15149–15154.\n",
"\n",
"The full text is available via PubMed:\n",
"<http://www.ncbi.nlm.nih.gov/pmc/articles/PMC64998/pdf/pq2601015149.pdf>\n",
"\n",
"## Deliverable\n",
"\n",
"The deliverable of this assignment is to replicate the gene expression analysis performed by Ramaswamy et al. in the paper cited above.\n",
"\n",
"## Get the Data\n",
"\n",
"Let's first get the data, which has been made available by the authors of the _Elements of Statistical Learning_ (Hastie, Tibshirani and Friedman, 2nd ed., 2009, Springer Verlag). \n",
"\n",
"In section 18.3, pp. 654–661 of this book, the authors re-analysed the dataset used by Ramaswamy et al. above and have made the formatted gene expression data available via the book's companion website.\n",
"\n",
"The dataset comprises $p=16,063$ gene expressions for $N=144$ tumour samples in the training set and $N=54$ tumour samples in the test set. The data describe 14 different types of cancer. Regarding this dataset, we can safely say that $p \\gg N$.\n",
"\n",
"We will now retrieve the data from the _Elements of Statistical Learning's_ website using `pandas` and `urllib2`:"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {
"collapsed": false
},
"outputs": [],
"source": [
"import urllib2\n",
"import csv\n",
"import pandas as pd\n",
"import numpy as np\n",
"from scipy import stats\n",
"import matplotlib.pyplot as plt\n",
"%matplotlib inline\n",
"\n",
"url_X_train = 'http://statweb.stanford.edu/~tibs/ElemStatLearn/datasets/14cancer.xtrain'\n",
"url_y_train = 'http://statweb.stanford.edu/~tibs/ElemStatLearn/datasets/14cancer.ytrain'\n",
"url_X_test = 'http://statweb.stanford.edu/~tibs/ElemStatLearn/datasets/14cancer.xtest'\n",
"url_y_test = 'http://statweb.stanford.edu/~tibs/ElemStatLearn/datasets/14cancer.ytest'\n",
"\n",
"# We know there are 144 tumours in the training set and 54 is the test set, so let's make some column names:\n",
"column_names_train = [\"Tumour_Sample_\" + str(_) for _ in np.arange(144)+1]\n",
"column_names_test = [\"Tumour_Sample_\" + str(_) for _ in np.arange(54)+1]\n",
"\n",
"# We will use Pandas to read and properly format the text-based data.\n",
"# The delimiter is a regular expression to look for zero or more repetitions of whitespace (\\s).\n",
"X_train = pd.read_csv(url_X_train, delimiter='\\s*', engine='python', names=column_names_train)\n",
"X_test = pd.read_csv(url_X_test, delimiter='\\s*', engine='python', names=column_names_test)\n",
"\n",
"# Get the labels and store as a list. There are 14 different cancers in the dataset.\n",
"y_train = urllib2.urlopen(url_y_train).read().strip().split()\n",
"y_test = urllib2.urlopen(url_y_test).read().strip().split()\n",
"\n",
"# There are 14 different types of cancer, numbered 1 to 14, in the vectors y_test and y_train above. \n",
"# For visualising, you may find the names of the cancer types useful:\n",
"cancer_names_longform = [\"Breast adenocarcinoma\", \"Prostate adenocarcinoma\", \n",
" \"Lung adenocarcinoma\", \"Collerectal adenocarcinoma\", \n",
" \"Lymphoma\", \"Bladder transitional cell carcinoma\", \n",
" \"Melanoma\", \"Uterine adenocarcinoma\", \"Leukemia\", \n",
" \"Renal cell carcinoma\", \"Pancreatic adenocarcinoma\", \n",
" \"Ovarian adenocarcinoma\", \"Pleural mesothelioma\", \n",
" \"Central nervous system\"]\n",
"\n",
"cancer_names_shortform = [\"breast\", \"prostate\", \"lung\", \"collerectal\", \n",
" \"lymphoma\", \"bladder\", \"melanoma\", \n",
" \"uterus\", \"leukemia\", \"renal\", \"pancreas\", \n",
" \"ovary\", \"meso\", \"cns\"]\n",
"\n",
"# For testing you may want a merged training and test set.\n",
"# To save memory, these are commented out for now.\n",
"# X = pd.concat([X_train, X_test])\n",
"# y = y_train + y_test"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"## Data Exploration\n",
"\n",
"Now that the data have been loaded in `X_train`, `X_test`, `y_train`, and `y_test`, we can take a look a closer look at our data. Note: It is convention to use large `X` for data matrices, and small `y` for target vectors.\n",
"\n",
"As can be seen, in our training set we have $p=16,063$ genes/features and $N=144$ tumours/samples:"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {
"collapsed": false
},
"outputs": [],
"source": [
"X_train.shape"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"To see a preview of the data, we can use the `head` and `tail` functions:"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {
"collapsed": false,
"scrolled": true
},
"outputs": [],
"source": [
"X_train.head()"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {
"collapsed": false
},
"outputs": [],
"source": [
"X_test.tail()"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"Let's see how the classes are distributed. First let's look at the number of unique values, which should equal 14, as we know we have 14 different cancer types:"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {
"collapsed": false
},
"outputs": [],
"source": [
"len(np.unique(y_train))"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"We can see how the cancer types are distrubuted using the `itemfreq` function of the SciPy `stats` package:"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {
"collapsed": false
},
"outputs": [],
"source": [
"stats.itemfreq(y_train)"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"Using the `cancer_names_longform` list we declared above, we can print tumour frequencies nicely:"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {
"collapsed": false
},
"outputs": [],
"source": [
"for freq in stats.itemfreq(y_train):\n",
" print \"%s samples appear %s times (shortform: %s).\" % (cancer_names_longform[int(freq[0])-1], \n",
" freq[1], \n",
" cancer_names_shortform[int(freq[0])-1])"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"You can take a quick look at some statistics values for each gene using the useful `describe` function (we use `transpose` to perform the analysis on a gene-by-gene basis). For example you may want to look at mean expression levels for each gene to see if they are over-expressed or under-expressed:"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {
"collapsed": false
},
"outputs": [],
"source": [
"# Note: The transpose() function here does not permanently transpose the data stored in X_train.\n",
"X_train.transpose().describe()"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"## Summary\n",
"\n",
"Now that we have read the data in a form which we can easily use, we move on to the deliverables that must be completed for Assignment 2."
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"# Deliverables for Assignment 2\n",
"\n",
"## Clustering\n",
"\n",
"___Task___: Perform hierarchical clustering mimicking the approaches used by Ramaswamy et al. in their paper cited above. Plot a dendogram of your results (SciPy provides dendogram plotting functions, see <http://docs.scipy.org/doc/scipy-0.14.0/reference/generated/scipy.cluster.hierarchy.dendrogram.html> for example) - or visualise your clustering in any other way you deem reasonable.\n",
"\n",
"Both SciKit Learn and SciPy offer hierarchical clustering algorithms, see <http://docs.scipy.org/doc/scipy/reference/cluster.hierarchy.html> and <http://scikit-learn.org/stable/modules/clustering.html>.\n",
"\n",
"Notice that not all clustering techniques are useful for all purposes. In the case of this assignment, we know the number of clusters we are searching for - this is a requirement for certain clustering algorithms. Other algorithms may require parameters you might not immediately have available to you."
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {
"collapsed": true
},
"outputs": [],
"source": [
"# Your clustering code. Use as many cells as required, use Markdown cells to document where necessary."
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"## Classification\n",
"\n",
"___Task___: Use Support Vector Machines and a One Vs. All (OVA) approach to replicate the results from the Ramaswamy et al. paper. \n",
"\n",
"SciKit Learn provides an `SVM` package for Support Vector Machines (see <http://scikit-learn.org/stable/modules/svm.html>). \n",
"\n",
"Visualise your results appropriately using plots and tables to describe classification results on the test set."
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {
"collapsed": true
},
"outputs": [],
"source": [
"# Your classification code. Use as many cells as required, use Markdown cells to document where necessary."
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"# Important Notes\n",
"\n",
"## Hints\n",
"\n",
"- You may find that scaling or normalising your data will yield better results. See the SciKit-Learn `scale` function: <http://scikit-learn.org/stable/modules/generated/sklearn.preprocessing.scale.html>. \n",
"- The `preprocessing` module contains much other useful functionality, see: <http://scikit-learn.org/stable/modules/classes.html#module-sklearn.preprocessing>.\n",
"- Cross validation train/test split indexes can be easily creating using SciKit Learn, see <http://scikit-learn.org/stable/modules/classes.html#module-sklearn.cross_validation> \n",
"- Look up the dataset's analysis in _Elements of Statistical Learning_, specifically sections 18.3 (SVM One Vs. All, One Vs. One, etc.) and 13.3 (_k_-nearest neighbours).\n",
"\n",
"## Grading\n",
"\n",
"Your grade will depend on a) quality/inventiveness of approach b) quality of plots or visualisations.\n",
"\n",
"## Submission\n",
"\n",
"In Jupyter, click File -> Download As -> IPython Notebook (.ipynb) and send your completed notebook by email."
]
}
],
"metadata": {
"kernelspec": {
"display_name": "Python 2",
"language": "python",
"name": "python2"
},
"language_info": {
"codemirror_mode": {
"name": "ipython",
"version": 2
},
"file_extension": ".py",
"mimetype": "text/x-python",
"name": "python",
"nbconvert_exporter": "python",
"pygments_lexer": "ipython2",
"version": "2.7.11"
}
},
"nbformat": 4,
"nbformat_minor": 0
}