-
Notifications
You must be signed in to change notification settings - Fork 5
/
Lakin_PythonWorkshopNotes2018.tex
1577 lines (1354 loc) · 69.3 KB
/
Lakin_PythonWorkshopNotes2018.tex
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
213
214
215
216
217
218
219
220
221
222
223
224
225
226
227
228
229
230
231
232
233
234
235
236
237
238
239
240
241
242
243
244
245
246
247
248
249
250
251
252
253
254
255
256
257
258
259
260
261
262
263
264
265
266
267
268
269
270
271
272
273
274
275
276
277
278
279
280
281
282
283
284
285
286
287
288
289
290
291
292
293
294
295
296
297
298
299
300
301
302
303
304
305
306
307
308
309
310
311
312
313
314
315
316
317
318
319
320
321
322
323
324
325
326
327
328
329
330
331
332
333
334
335
336
337
338
339
340
341
342
343
344
345
346
347
348
349
350
351
352
353
354
355
356
357
358
359
360
361
362
363
364
365
366
367
368
369
370
371
372
373
374
375
376
377
378
379
380
381
382
383
384
385
386
387
388
389
390
391
392
393
394
395
396
397
398
399
400
401
402
403
404
405
406
407
408
409
410
411
412
413
414
415
416
417
418
419
420
421
422
423
424
425
426
427
428
429
430
431
432
433
434
435
436
437
438
439
440
441
442
443
444
445
446
447
448
449
450
451
452
453
454
455
456
457
458
459
460
461
462
463
464
465
466
467
468
469
470
471
472
473
474
475
476
477
478
479
480
481
482
483
484
485
486
487
488
489
490
491
492
493
494
495
496
497
498
499
500
501
502
503
504
505
506
507
508
509
510
511
512
513
514
515
516
517
518
519
520
521
522
523
524
525
526
527
528
529
530
531
532
533
534
535
536
537
538
539
540
541
542
543
544
545
546
547
548
549
550
551
552
553
554
555
556
557
558
559
560
561
562
563
564
565
566
567
568
569
570
571
572
573
574
575
576
577
578
579
580
581
582
583
584
585
586
587
588
589
590
591
592
593
594
595
596
597
598
599
600
601
602
603
604
605
606
607
608
609
610
611
612
613
614
615
616
617
618
619
620
621
622
623
624
625
626
627
628
629
630
631
632
633
634
635
636
637
638
639
640
641
642
643
644
645
646
647
648
649
650
651
652
653
654
655
656
657
658
659
660
661
662
663
664
665
666
667
668
669
670
671
672
673
674
675
676
677
678
679
680
681
682
683
684
685
686
687
688
689
690
691
692
693
694
695
696
697
698
699
700
701
702
703
704
705
706
707
708
709
710
711
712
713
714
715
716
717
718
719
720
721
722
723
724
725
726
727
728
729
730
731
732
733
734
735
736
737
738
739
740
741
742
743
744
745
746
747
748
749
750
751
752
753
754
755
756
757
758
759
760
761
762
763
764
765
766
767
768
769
770
771
772
773
774
775
776
777
778
779
780
781
782
783
784
785
786
787
788
789
790
791
792
793
794
795
796
797
798
799
800
801
802
803
804
805
806
807
808
809
810
811
812
813
814
815
816
817
818
819
820
821
822
823
824
825
826
827
828
829
830
831
832
833
834
835
836
837
838
839
840
841
842
843
844
845
846
847
848
849
850
851
852
853
854
855
856
857
858
859
860
861
862
863
864
865
866
867
868
869
870
871
872
873
874
875
876
877
878
879
880
881
882
883
884
885
886
887
888
889
890
891
892
893
894
895
896
897
898
899
900
901
902
903
904
905
906
907
908
909
910
911
912
913
914
915
916
917
918
919
920
921
922
923
924
925
926
927
928
929
930
931
932
933
934
935
936
937
938
939
940
941
942
943
944
945
946
947
948
949
950
951
952
953
954
955
956
957
958
959
960
961
962
963
964
965
966
967
968
969
970
971
972
973
974
975
976
977
978
979
980
981
982
983
984
985
986
987
988
989
990
991
992
993
994
995
996
997
998
999
1000
\documentclass[a4paper,11pt]{article}
%\documentclass[a4paper,11pt]{scrartcl}
\usepackage[a4paper,bindingoffset=0.2in,%
left=1in,right=1in,top=1in,bottom=1in,%
footskip=.25in]{geometry}
\usepackage[utf8]{inputenc}
\usepackage{listings}
\usepackage{color}
\usepackage[pdftex,pdfpagelabels,bookmarks,hyperindex,hyperfigures]{hyperref}
\hypersetup{
bookmarksnumbered=true,
bookmarksopen=true,
bookmarksopenlevel=1,
colorlinks=true,
allcolors=blue,
pdfstartview=Fit,
pdfpagemode=UseOutlines
}
\usepackage{hypcap}
\usepackage{url}
\usepackage{graphicx}
\renewcommand{\familydefault}{\sfdefault}
\definecolor{mygray}{gray}{0.95}
\lstset{
basicstyle=\footnotesize\ttfamily, % the size of the fonts that are used for the code
backgroundcolor=\color{mygray},
language=Python,
keywordstyle=\color{blue},
stringstyle=\color{red},
commentstyle=\color{cyan},
keepspaces=true,
columns=flexible,
tabsize=1,
showstringspaces=false
}
\title{Introduction to Python Programming and Project Design}
\author{Steven Lakin, DVM}
\date{2018 May 14}
\pdfinfo{%
/Title (Introduction to Python Programming and Project Design)
/Author (Steven Lakin)
}
\begin{document}
\maketitle
\pagebreak
\tableofcontents
\pagebreak
\section{The Zen of Python}
The Zen of Python, by Tim Peters \\
\\
Beautiful is better than ugly. \\
Explicit is better than implicit. \\
Simple is better than complex. \\
Complex is better than complicated. \\
Flat is better than nested. \\
Sparse is better than dense. \\
Readability counts. \\
Special cases aren't special enough to break the rules. \\
Although practicality beats purity. \\
Errors should never pass silently. \\
Unless explicitly silenced. \\
In the face of ambiguity, refuse the temptation to guess. \\
There should be one-- and preferably only one --obvious way to do it. \\
Although that way may not be obvious at first unless you're Dutch. \\
Now is better than never. \\
Although never is often better than *right* now. \\
If the implementation is hard to explain, it's a bad idea. \\
If the implementation is easy to explain, it may be a good idea. \\
Namespaces are one honking great idea -- let's do more of those!
\pagebreak
\section{Introduction}
\par
With the advent of big data and the need for automation of repetitive tasks, basic programming
skills are becoming a necessity for working in many fields. However, much of the instructional
material on programming is intended to build a very solid foundation in programming basics. For our
purposes, we can skip the details of rudimentary programming and take a more hands-on approach to
basic scripting, since this is much of what we as non-computer scientists will be doing. We will be
using the Python language, since it is an intuitive and powerful programming language. \par
Python, named after the Monty Python skits, was built with the intention of being easy to use,
quick to learn, and syntactically fast; this is sometimes referred to in the documentation as being
“Pythonic,” based on the ideals of the language. Because of this, Python is what is called a “high-level”
language; much of the clunky syntax of other languages (Java, C, R) is removed in python.
There are no semi-colons at ends of lines, no brackets around portions of code, or the need to pre-
define variables before use. You will find that this makes Python a fast language to program in, since
we don't have to worry as much about typing and checking syntax. Much of the baseline “work” of
lower-level languages has been built into Python, which allows you to access Python's intuitive
structures and do your work as easily and fast as possible. \par
In this workshop, we will be focusing on applying Python to biological problems. Much of this
“patchwork” programming for solving small problems is well-suited to short segments of code called
scripts. A script is simply a file of code that does something. Scripts can be combined to make
programs, packages, modules, and generally “software,” all of which are generally the same thing with
slightly different semantics in different languages. We will mostly be scripting in this class, though we
will work toward making packages and learning the basics of building more advanced code structures. \par
The workshop will be divided into two segments: the first will be a lecture on a programming
topic, and the second will be applying those concepts to miniature problems on Rosalind, named after
Rosalind Franklin, whose work in X-ray crystallography contributed significantly to the discovery of
DNA structure. These problems are bioinformatics related, which is a field that combines biology,
computer science, mathematics, and statistics to solve biological problems involving large data sets.
You will need to make an account on Rosalind to track your progress.
\pagebreak
\section{Obtaining Help}
\textbf{Online resources} \\
Every programmer needs a reference source for learning new skills in a language or to troubleshoot
problems encountered during the programming process. There are many sources of this information,
including physical textbooks or manuals. However, the vast majority of programmers these days utilize
online resources that have been curated and developed by the community. Probably the most popular and
frequently utilized online resource is \href{https://stackoverflow.com}{Stack Overflow}, which is part
of the popular \href{https://stackexchange.com/}{Stack Exchange} system of websites for community-based
question and answer forums. I would recommend Googling stack exchange plus whatever keywords or error
codes that you have in order to determine the best path forward in your code. Very frequently, someone
has encountered the same question or problam as you, and that question will already have been answered
on Stack Overflow. However, if you do ever encounter a novel problem (it has only happened to me twice
in my entire programming career), then you can submit a new problem on Stack Overflow and hopefully get
the problem solved. Alternatively, the online Python manual often has answers about the language itself. \\
\\
\textbf{Additional Practice} \\
As much as workshops help to learn fundamentals of programming, there is no substitute for continual
practice. Many online websites are available that offer code challenges. If you're looking for
more opportunities to practice coding in Python, check out
\href{https://www.hackerrank.com/domains/python}{HackerRank's Python challenges} or other similar
websites that offer miniature coding sessions with an in-browser interpreter. Alternatively,
try to apply the concepts we're learning here to your own work. \\
\\
\textbf{Reference Resources} \\
I have tried to provide a reference section for commonly used types, objects, and tasks at the end of
these notes. They are not by any means comprehensive, however they might provide a fast reference if
that is helpful to you. The most comprehensive reference will always be the
\href{https://docs.python.org/3/}{Python documentation},
however it can be difficult to read at times if your grasp of the language isn't extensive.
\pagebreak
\section{Python Fundamentals}
\subsection{Installing Python}
\textbf{Python Versions} \\
There are two main distributions of Python: Python 2 and Python 3. These days, virtually every application should be
compliant with Python 3, so it's recommended that you download the most recent version of Python 3. The difference
between the two is minimial from a basic user perspective, however some of the ``grammar'' or syntax differs
between them. This workshop will be conducted as if everyone has Python 3. \\
\\
\textbf{Python Download} \\
You can download Python for Windows and MacOS by visiting the \href{https://www.python.org/downloads}{Python download page}.
Linux users can download Python through their appropriate package manager. I will assume if you're using Linux that you
know the basics of the operating system.\\
\\
\textbf{Integrated Development Environments (IDEs)} \\
When we get to the point where we are writing scripts, you may find it helpful to download specialized software called an
IDE, which is essentially a text editor that knows the Python syntax, can highlight regions of code, helps with code
completion, and can help debug programs. There are a wide array of IDEs for Python, some more feature-complete than others.
I prefer to use a
more complete IDE called PyCharm (Jetbrains), however some people prefer lighter IDEs, and others prefer to use a standard
text editor like NotePad that has no coding-specific features. You can research a few of these IDEs and decide what is
the right fit for your coding experience. \\
\\
\textbf{Python Packages} \\
Python has both core functionality and the ability to be extended for other purposes through the development of packages.
Packages are modules that perform a specific function, such as helping to scrape websites, download files, multiply matrices,
and visualize data through graphing/plotting. You can obtain these packages through Python's package manager called Pip.
A basic tutorial on how to use Pip to obtain packages can be found in the \href{https://packaging.python.org/tutorials/installing-packages/}{Python user manual}.
\pagebreak
\subsection{Variables and Basic Operations}
As a Python programmer, you'll encounter two situations commonly while coding: testing small bits of code in real time,
and assembling larger pieces of code into a program or script. For testing code in real time, you'll use the \textbf{Python
Interpreter}. To get to the Python Interpreter, double click on the Python icon or open a command-line terminal and
type python in lower-case. The interpreter will have three greater-than signs as its prompt, like so:
\vspace{3mm}
\begin{lstlisting}
>>>
\end{lstlisting}
\vspace{3mm}
The Python Interpreter has all of the functionality that we will be using later, however it is difficult to program
larger pieces of code in the Interpreter, so we will be switching to scripting later on. The Interpreter can
perform basic arithmetic such as a calculator would do; simply type the equation and press enter:
\vspace{3mm}
\begin{lstlisting}
>>> 3 - 5
-2
>>> 3 + 5
8
>>> 3 * 5
15
>>> 3 / 2
1.5
>>> 3e5
300000.0
\end{lstlisting}
\vspace{3mm}
Notice that all input either starts with the three greater-than signs or three periods, while all output simply is on a
blank newline. While performing basic arithmetic operations is a nice feature of Python, we could simply use a
calculator for this. Python's functionality and power comes from being able to store values in \textbf{variables}:
\vspace{3mm}
\begin{lstlisting}
>>> a = 3
>>> b = 5
>>> a + b
8
>>> a * b
15
>>> a * 5
15
\end{lstlisting}
\vspace{3mm}
For those interested in proper programming terminology, we refer to variables as being a \textbf{left-hand values} or l-values,
since it is on the left hand side of the equation, while \textbf{right-hand values} or r-values represent the values that the variable
will take. Remember that right-hand values \textbf{get assigned to} left-hand values; you cannot interchange the two. \par
In Python, we are not limited to numbers. Python, being a simple and elegant language, has only a few \textbf{types} of
values. We have already encountered the \textbf{Integer} type and the \textbf{Float} type: integers referring to
the mathematical definition of whole real numbers, and floats referring to decimal-point real numbers.
A commonly-used third type is the \textbf{String} type, which holds words. Less commonly used but important is the
\textbf{Boolean} type, which indicates one of two values: true or false (1 and 0, respectively in binary).
Python often will automatically convert between types depending on what you need to do:
\vspace{3mm}
\begin{lstlisting}
>>> a = 3
>>> b = "1"
>>> c = "String Example"
>>> print(a, b)
3 1
>>> c + b
'String Example1'
\end{lstlisting}
\vspace{3mm}
However, notice that Python will not automatically convert strings into integers or floats; to do that you'll have to \textbf{coerce}
the value that you know complies with the new type into that type, like so:
\vspace{3mm}
\begin{lstlisting}
>>> a = "1"
>>> b = 3
>>> a + str(b)
'13'
>>> int(a) + b
4
>>> float(a) + b
4.0
>>> bool(a)
True
\end{lstlisting}
\vspace{3mm}
Notice that the addition sign has different functionality depending on the type for which it is used: numbers are
added together, while strings are \textbf{concatenated} in the order of which they were added. Other functionality
that are built into Python include the native \textbf{functions} that you can use for common operations like
printing text to the terminal:
\vspace{3mm}
\begin{lstlisting}
>>> myvariable = "Hello World"
>>> print(myvariable)
'Hello World'
\end{lstlisting}
\vspace{3mm}
Functions take variables or values as input and are used with parentheses as above. Next, we will learn
how to create our own functions as well, which is the foundation of programming.
\pagebreak
\subsection{Functions}
Functions are the building blocks of programming; each function should have a single purpose, be specific,
and have an easily-interpretable and appropriate name. The parts of a function include the \textbf{name},
the \textbf{arguments}, the \textbf{function body} where the code resides, and an optional \textbf{return value}.
Functions in Python are \textbf{defined} like so:
\vspace{3mm}
\begin{lstlisting}
>>> def function_name(argument1, argument2):
... print(argument1)
... print(argument2)
... return argument1 + argument2
...
\end{lstlisting}
\vspace{3mm}
The above code defines the function called $function\_name$, which takes in two arguments, ``argument1'' and
``argument2.'' It first prints each argument on a separate line, then returns their sum. By default, return
values get printed to the Interpreter if they are not used. Alternatively, we can treat return values as
right-hand values and assign them to another variable:
\vspace{3mm}
\begin{lstlisting}
>>> function_name(1, 2)
1
2
3
>>> newvalue = function_name(1, 2)
1
2
>>> print(newvalue)
3
\end{lstlisting}
\vspace{3mm}
Notice that in our function definition, the function body (including the return statement) are indented; in
Python, all code that falls into a \textbf{code block} such as a function needs to be indented the same number
of spaces or tabs. The debate between the use of spaces or tabs is semantic but hotly debated amongst Python
programmers. I personally prefer tabs, as they take fewer keystrokes, however you can use what you wish as
long as you are consistent. Each level of indentation indicates another \textbf{nested} code block. We
will clarify this in the following sections once we have the ability to nest code blocks. \par
Functions can take two kinds of arguments: \textbf{positional} and \textbf{optional} arguments. Positional
arguments are mandatory when using the function and are always placed before optional arguments. Optional
arguments can be placed in any order and have a default value explicitly specified in the function definition.
There are various reasons to use positional versus optional arguments; perhaps your function will not work
without a certain number of arguments, like the function we made above. Alternatively, if you want to specify
a default value but still allow other values, you would use an optional argument. Here is a function definition
that includes both positional and optional arguments:
\vspace{3mm}
\begin{lstlisting}
>>> def add_and_multiply(summand1, summand2, coefficient=1):
... return coefficient * (summand1 + summand2)
...
>>> add_and_multiply(2, 3)
5
>>> add_and_multiply(2, 3, 3)
15
>>> add_and_multiply(2, 3, coefficient=3)
15
>>> add_and_multiply(summand1=2, summand2=3, coefficient=3)
15
\end{lstlisting}
\vspace{3mm}
Arguments can be called by name or by position, however if you are striving to have legible code, it
usually pays off to explicitly spell out the arguments so others (or you several years later) can read
the code and understand your intentions. \par
With functions and variables, you have the most basic use cases of Python covered and can begin performing
operations on input values. However, often times we want to store larger amounts of data than single values,
and often those values need to be stored in a way that relates them to one another or orders them in a
certain way. To do this, we will need to learn how to use the \textbf{data structures} that are inherently
available in Python. However, we will first practice what we have learned so far on one of Rosalind's
introductory bioinformatics problems: counting nucleotides. \\
\\
\textbf{Rosalind Problem 1: Counting Nucleotides} \\
\href{http://rosalind.info/problems/dna/?class=246}{Click here} to visit this problem on Rosalind \\
\\
Often times we want to obtain information from string-based data, such as words, paragraphs, or in this case
nucleotides from a DNA sequence. In this problem, you will use functions inherent to the \textbf{String type}
to count the number of adenine, cytosine, guanine, and thymine nucleotides in a DNA string of variable length.
To do this, make use of the \textbf{count} function that is inherent to the String type, and define a function
that prints out the number of ACGT nucleotides in that order:
\vspace{3mm}
\begin{lstlisting}
>>> dna = "ACGTGTGTGCCCGTGA"
>>> dna.count("A")
2
>>> dna.count("C")
4
\end{lstlisting}
\vspace{3mm}
\pagebreak
\subsection{Data Structures}
Data structures are used to store information in a group or organized context; each kind of data structure was
designed in Python with a certain use in mind, so each has its advantages and disadvantages. One of the most
common uses of data structures in coding is to store multiple values in a specific order. The ordered data
structures in Python include the \textbf{list} and \textbf{array}. The primary differences between lists
and arrays is that values within a list can be modified after creation while arrays cannot; this translates
into a speed advantage computationally when using arrays, since the computer knows they won't be modified
after the fact. However, for the majority of applications, lists will be what you choose to use, as often
we want to retain the ability to modify the values of the list. \par
Lists in Python are defined using square braces [ ], and each element is associated with a position in the
list. These positions \textbf{start at 0} for the first element; in computer science, this is referred to
as a \textbf{zero-indexed} language. Here is an example of a list and how we can access each element of the
list using zero-indexed integers:
\vspace{3mm}
\begin{lstlisting}
>>> mylist = ["one", "two", "three"]
>>> mylist[0]
'one'
>>> mylist[1]
'two'
>>> mylist[2]
'three'
\end{lstlisting}
\vspace{3mm}
Here, the variable $mylist$ stores the list, and we access each element of the list using square braces immediately
following the variable name. In Python, this is called \textbf{slicing} the list, since the slicing notation can
do more than just access a single value at a time; we can obtain multiple values in a chunk by using the inherent
slicing notation: \\
\\
mylist[start:stop:by]
\vspace{3mm}
\begin{lstlisting}
>>> mylist = [1, 2, 3, 4, 5]
>>> mylist[0:3]
[1, 2, 3]
>>> mylist[0:5]
[1, 2, 3, 4, 5]
>>> mylist[3:5]
[4, 5]
>>> mylist[::-1]
[5, 4, 3, 2, 1]
>>> mylist[::2]
[1, 3, 5]
>>> mylist[::3]
[1, 4]
\end{lstlisting}
\vspace{3mm}
Here, we access chunks of the list by using the start:stop notation, where the start is the index of the value we
want to start at, and the stop is the index \textbf{of the element to the right} of where we want to stop. It may
be easier to visualize it like so:
\[
[_01, _12, _23, _34, _45_5]
\]
In the example above, the subscripts represent the slicing index positions, so to obtain the numbers 1, 2, 3, we
would slice from 0 to 3. The optional third command in the slicing notation is the \textbf{by} notation. This
returns elements \textbf{by every X index}, so for instance if we wanted every other number in the list, we would
slice the whole list by 2, as in the code block above: mylist[::2]. Leaving a start or stop field empty is shorthand
notation for ``the beginning of the list'' and ``the end of the list,'' respectively. If you provide a negative number
in the by position, this reverses the list and otherwise behaves the same as the normal by notation. Finally,
we can add elements to the list using two different methods: the \textbf{addition operator} and the \textbf{append function}:
\vspace{3mm}
\begin{lstlisting}
>>> mylist = []
>>> mylist + [0, 1]
[0, 1]
>>> mylist += [0, 1]
>>> mylist
[0, 1]
>>> mylist.append(2)
>>> mylist
[0, 1, 2]
\end{lstlisting}
\vspace{3mm}
Pay close attention to the second and third line of code above: in the first with the addition operator alone,
we simply printed out the 0 and 1 added to the list, but we didn't permanently modify the list.
To permanently add the elements to the end of the list, we had to use the addition operator in combination with
the equals sign operator, a construction that is commonly referred to as an \textbf{increment} operation. For
completeness, we will digress momentarily to show that this increment operation can be performed also on other
data types and structures, such as integers and arrays:
\vspace{3mm}
\begin{lstlisting}
>>> a = 1
>>> a += 1
>>> print(a)
2
>>> myarray = ("one", "two")
>>> myarray += ("three", )
>>> myarray
('one', 'two', 'three')
\end{lstlisting}
\vspace{3mm}
The \textbf{array} data structure is defined by parentheses as seen above and operates similarly to the list,
however we cannot modify
the values of an array; the property of lists that allows them to be modified is called \textbf{mutability},
and we call lists \textbf{mutable} while arrays are \textbf{immutable}:
\vspace{3mm}
\begin{lstlisting}
>>> mylist = [0, 1, 2]
>>> myarray = (0, 1, 2)
>>> mylist[1] = 3
>>> mylist
[0, 3, 2]
>>> myarray[1] = 3
Traceback (most recent call last):
File "<stdin>", line 1, in <module>
TypeError: 'tuple' object does not support item assignment
\end{lstlisting}
\vspace{3mm}
In Python, arrays are technically called \textbf{tuples}, however the vast majority of other languages
call their similar data structure arrays, so we will use that terminology here for convenience. There
are physical differences in how lists and arrays interface with your computer that give them the
properties that define them; we will cover these semantics later in a section that discusses what
a computer actually is and how basic operations are performed. For now, we still have two important
data structures to cover: the \textbf{dictionary} and the \textbf{set}, which are the two native
\textbf{unordered} data structures. \par
When order of the values in a group does not matter, we can use the unordered data structures. The first,
called a \textbf{set}, can be thought of the same way as the mathematical definition of a set: a group
of things, whether that be strings, integers, floats, or even other data structures like arrays or lists.
A set is useful because \textbf{checking for membership is very fast}. We will digress into some basics
of computer science algorithms later in the course, however for now know that the use case of a set is
\textbf{to collect items} and \textbf{to check if an item is in the set}. Sets are defined explicitly
by name in Python, and we add elements to them with the \textbf{add} function. We can then check for
membership by asking if a value is in the set:
\vspace{3mm}
\begin{lstlisting}
>>> myset = set()
>>> myset
set()
>>> myset.add(1)
>>> myset.add(2)
>>> myset
{1, 2}
>>> myset.add(1)
>>> myset
{1, 2}
>>> 1 in myset
True
>>> 3 in myset
False
\end{lstlisting}
\vspace{3mm}
Notice that sets don't store duplicate values; when we try to add a duplicate value to the set, the set
automatically checks for membership first and only adds the value if the value is not yet present in
the set. This is useful for collecting unique values. While sets are quite useful, a more common
data structure that we will need to use is called the \textbf{dictionary}. \par
Dictionaries \textbf{relate values together} and are often called \textbf{associative arrays} in other
programming languages, since they associate a \textbf{key} with a \textbf{value}. These associations
are often called \textbf{key-value pairs} and are extraordinarily common in basic programming tasks.
Let's say we have names and phone numbers and we would like to relate them together; we can do this
by creating a dictionary. Note that dictionary keys must be unique, while values can be duplicated.
We commonly add to dictionaries using two methods: the \textbf{index operator},
and the \textbf{update} function; however, there are more ways to add to dictionaries if you're interested
in reading the Python documentation.
\vspace{3mm}
\begin{lstlisting}
>>> mydict = {"John Doe": "970-555-0001", "Jane Doe": "970-555-0002"}
>>> mydict
{'John Doe': '970-555-0001', 'Jane Doe': '970-555-0002'}
>>> mydict["James Smith"] = "970-555-1234"
>>> mydict
{'John Doe': '970-555-0001',
'Jane Doe': '970-555-0002',
'James Smith': '970-555-1234'}
>>> mydict.update({"Jane Smith": "970-555-1235"})
>>> mydict
{'John Doe': '970-555-0001',
'Jane Doe': '970-555-0002',
'James Smith': '970-555-1234',
'Jane Smith': '970-555-1235'}
\end{lstlisting}
\vspace{3mm}
Notice that the keys are displayed left of the colon and the values with which they are associated are
displayed on the right side of the colon. We can also access lists of the keys and values separately
with the \textbf{keys} and \textbf{values} functions, respectively. Checking for membership is also
useful, and by default we check for membership in the keys, since they are unique, however we also
can force Python to check for membership in the values by explicitly stating that:
\vspace{3mm}
\begin{lstlisting}
>>> mydict = {1: "red", 2: "blue", 3:"green"}
>>> mydict.keys()
dict_keys([1, 2, 3])
>>> mydict.values()
dict_values(['red', 'blue', 'green'])
>>> 1 in mydict
True
>>> 4 in mydict
False
>>> "red" in mydict
False
>>> "red" in mydict.values()
True
\end{lstlisting}
\vspace{3mm}
There are many more aspects to dictionaries that we will not cover here for brevity; I suggest if you're
interested in the versatility of dictionaries to read the Python documentation on them in detail. For now,
simply know that dictionaries store key-value pairs, and that such a data structure is highly useful in
many programming contexts. We will see concrete examples of why this is the case in the Rosalind problems. \par
As an aside, there are many other data structures available in other Python packages, but with these four
core data structures, you're ready to start learning the truly useful part of programming: \textbf{logic} and
\textbf{flow}. There will be no Rosalind problem for this section, since we have to learn the next topics
to do anything intermediate in programming.
\pagebreak
\section{Logic and Control of Flow}
\subsection{Conditional Operators}
Logic is a foundational concept in computer science and programming; when we encounter certain values,
we often want to do something or store those values differently depending on some condition. Perhaps
we see a value that is interesting, so we want to keep it, however we don't want to keep uninteresting
values. Perhaps we only want to add an element to a list if it isn't already in the list (though
if you remember last section, you could use a set for this instead!). Logic in Python is very
simple compared to other language, and the developers of Python have tried to make conditional logic
statements match what you are thinking in your head when you're working on a problem. They also
support the standard language conditional statements and operators as well, if you're not comfortable
using the Pythonic approach. \par
Perhaps the most simple conditional statement is the \textbf{in} statement; this checks for membership
in a data structure for a value or even another data structure. The \textbf{in} statement has
some computational cost depending on the data structure, but it can be used with almost anything:
\vspace{3mm}
\begin{lstlisting}
>>> mystring = "ACCGTG"
>>> "A" in mystring
True
>>> mylist = [1, 2, 3]
>>> 2 in mylist
True
>>> 2 not in mylist
False
\end{lstlisting}
\vspace{3mm}
Notice two aspects about the above code: to check for membership, we use \textbf{in}, and to check for the
opposite condition, we simply add a \textbf{not} in front of the \textbf{in} statement. Secondly,
\textbf{conditionals return boolean values}, either True or False. These values can then be used to
create logic that performs certain code on some values and not others. If anyone is nerdy like me
about computer science topics and wants to know where this idea originated, you can follow
\href{https://en.wikipedia.org/wiki/Logic_gate}{this link} to learn more about logic gates. \par
The following are the \textbf{Python conditional operators} that we can use to \textbf{compare}
values to one another: \textbf{is, ==, is not, !=, $>$, $>=$, $<$, $<=$}.
\vspace{3mm}
\begin{lstlisting}
>>> 2 > 3
False
>>> 3 >= 3
True
>>> 2 is 3
False
>>> 2 is not 3
True
>>> 2 == 2
True
>>> 2 != 2
False
\end{lstlisting}
\vspace{3mm}
Again, each of these operators returns a boolean value; these boolean values will often be used
by the statements in the next section, called \textbf{if statements} that will allow you to
conditionally perform certain blocks of code.
\pagebreak
\subsection{If Statements}
If statements begin with an \textbf{if}, include a conditional statement and operator, have some code body
similar to a function, and will optionally include additional statements and conditionals with one or more
\textbf{elif} (else if) statements and optionally ending with an \textbf{else} statement. Here is an example:
\vspace{3mm}
\begin{lstlisting}
>>> a = 4
>>> b = 3
>>> if a > 4:
... print("a greater than 4")
... elif a > b:
... print("a greater than b")
... elif b > a:
... print("b greater than a")
... else:
... print("Don't know")
...
a greater than b
\end{lstlisting}
\vspace{3mm}
Notice how we maintain the use of indentation to describe blocks of code for the body of statements,
exactly the same as for function definitions. There can be as many lines of code as are needed between
these statements, as long as they all have the same indentation. You can also nest if statements with
an additional level of indentation:
\vspace{3mm}
\begin{lstlisting}
>>> a = 4
>>> b = 3
>>> if a is 4:
... if b is 3:
... print("a = 4 and b = 3")
...
a = 4 and b = 3
>>> if a is 4 and b is 3:
... print("a = 4 and b = 3")
...
a = 4 and b = 3
\end{lstlisting}
\vspace{3mm}
The \textbf{and} conjunction can be used to chain together conditionals, or you can simply nest them;
either way is equivalent, however usually programmers consider the nesting of compound logicals to
be redundant and unnecessary. To reiterate, if-elif-else statements must begin with an if statement,
elif and else are optional, and there can be multiple elifs but only one else. We will be using
logicals quite frequently in our future Rosalind challenges. Now, onto the workhorse of programming:
loops.
\pagebreak
\subsection{For Loops and Comprehension}
The reason we store values in data structures is typically to use them later, and when we're dealing
with thousands or millions of values, we can't reasonably access them one at a time. Loops allow us
to \textbf{iterate} over a large number of values \textit{very quickly} and perform some manipulation
on them. Perhaps it will be to generate and fill new lists or to read in a large file and iterate
over its lines looking for or storing information as we go, but loops will always be the solution
to repetitive tasks in programming. The flow statement for loops is \textbf{for}, used as so:
\vspace{3mm}
\begin{lstlisting}
>>> for i in range(1, 10):
... print(i)
...
1
2
3
4
5
6
7
8
9
\end{lstlisting}
\vspace{3mm}
The overall construction for a \textbf{for loop} is ``for variable in object/iterator:'', where
we are either looping over some data, perhaps stored in a list or dictionary, or we are generating
new data in the loop, such as in the example above. The \textbf{range} function creates a set of
numbers between the start and stop arguments, in this case 1 and 10; we can then use this object
as the basis for our for loop. We technically don't have to use the variable (in this case i);
if we simply wanted to print ``Hello world'' 9 times, we could instead do that:
\vspace{3mm}
\begin{lstlisting}
>>> for i in range(1, 10):
... print("Hello world")
...
Hello world
Hello world
Hello world
Hello world
Hello world
Hello world
Hello world
Hello world
Hello world
\end{lstlisting}
\vspace{3mm}
Of course, the most commonly used construction of loops is to iterate over some data structure, such
as a list or dictionary. Here, we will create a list with several elements, iterate over them, and
find a specific value and print it out to the Interpreter.
\vspace{3mm}
\begin{lstlisting}
>>> mylist = ["Arthur", "Lancelot", "Gawain", "Galahad"]
>>> for knight in mylist:
... if knight is "Arthur" or knight is "Galahad":
... print(knight)
...
Arthur
Galahad
\end{lstlisting}
\vspace{3mm}
You can start to see the general pattern of programming at this point: we have some data, store that data
in the appropriate data structure, manipulate that data depending on our task, and output something
that we need. Generally, it is best practice to write \textbf{functions} that perform a specific task;
we will cover coding best practices later, but this practice helps us to compartmentalize our
operations and makes our code much more legible when you need to revisit it or share it later.
Here is an example of a function that repeats text a certain amount of time depending on its
arguments:
\vspace{3mm}
\begin{lstlisting}
>>> def text_repeat(text, n_times):
... for i in range(n_times):
... print(text)
...
>>> text_repeat("Hello world", 3)
Hello world
Hello world
Hello world
\end{lstlisting}
\vspace{3mm}
In addition to these standard loop constructions, Python has sought to be elegant in its ability
to generate data objects on the fly, so anytime we're creating a list, dictionary, or array, we also
have the option of looping using \textbf{comprehension}. Comprehension is a quick, one-line way to
generate data into a data structure, and it can be quite useful. Here are examples of list and
dictionary comprehension:
\vspace{3mm}
\begin{lstlisting}
>>> mylist = [x for x in range(10)]
>>> mylist
[0, 1, 2, 3, 4, 5, 6, 7, 8, 9]
>>> mylist = [x * 3 for x in range(10)]
>>> mylist
[0, 3, 6, 9, 12, 15, 18, 21, 24, 27]
>>> foo = [["foo", "bar"], ["bar", "foo"], ["foobar", "barfoo"]]
>>> mydict = {k:v for k, v in foo}
>>> mydict
{'foo': 'bar', 'bar': 'foo', 'foobar': 'barfoo'}
\end{lstlisting}
\vspace{3mm}
Now we're cooking with gas; what you have learned so far should be enough to carry you through basic
scripting. There are a few more topics that may help, however, for specific situations where we
need more control over our flow statements. In the next section, we will be covering a new type
of loop that is used for specific cases and statements that will allow us to control flow.
\pagebreak
\subsection{While Loops and Control of Flow}
In certain situations, we want to continue iterating indefinitely until we detect a certain value,
and at that point, we want to cease iteration. This task is well-suited to either a \textbf{while}
loop with a conditional or a \textbf{while} loop with a \textbf{break} statement. While loops will
continue iteration until the condition is satisfied or it gets told to stop by a control of flow
statement. It's easiest just to see it in action, so here it goes:
\vspace{3mm}
\begin{lstlisting}
>>> x = 1
>>> while x != 64:
... x *= 2
... print(x)
...
2
4
8
16
32
64
>>> x = 1
>>> while True:
... x *= 2
... print(x)
... if x == 64:
... break
...
2
4
8
16
32
64
\end{lstlisting}
\vspace{3mm}
Here, we continue multiplying the value held in x by 2 and printing it until it reaches the value of 64,
at which time we cease iteration. This can also be done with a conditional and break statement, as seen
above. Be careful with the \textbf{while True} construction: if a break statement condition is never
satisfied, you will have an infinite loop and will have to restart Python in order to stop it from looping on
your computer forever. While loops can be helpful when we don't know the size or number of elements we will
be handling, for instance while reading in data from a file that is too big to open all at once, so we have
to iterate through it line by line until we reach the end. The other major control of flow statement is
the \textbf{continue} statement, which allows us to skip an iteration if a condition is satisfied:
\vspace{3mm}
\begin{lstlisting}
>>> for x in range(10):
... if x > 6:
... continue
... else:
... print(x)
...
0
1
2
3
4
5
6
\end{lstlisting}
\vspace{3mm}
All code that is after the continue statement is never reached, and we continue to the beginning of the
next iteration within the loop. This can be useful if there are values we know we won't be using, so
we can save time and not execute the code for those values. \par
And... that's really all there is to the basics of programming. Not too bad, right? Of course, we
still need to transition away from using the Python Interpreter and begin to write actual scripts: the
cornerstone of the Bioinformatician's toolbox. We will be covering scripting in the next section along
with several common scripting tasks, such as reading and writing files from your computer.
But first, we will apply the skills described in this section to a Rosalind problem. \\
\\
\textbf{Rosalind Problem 2: Transcribing DNA into RNA} \\
\href{http://rosalind.info/problems/rna/?class=246}{Click here} to visit this problem on Rosalind \\
\\
Translation is an important part of bioinformatics; often times we want to translate one value into another,
and we can do this in several ways. In this simple case of translating DNA nucleotides into RNA nucleotides,
we can use either \textbf{string substitution} or we can use \textbf{dictionaries}, since we are essentially
associating one value with another to translate it. In this problem, you will write a function to perform
this translation task and apply it to the Rosalind problem. For demonstration purposes, here is a slightly
related example that may help in your programming:
\vspace{3mm}
\begin{lstlisting}
>>> example_string = "ABCDEFG"
>>> example_string.replace("C", "Z")
'ABZDEFG'
\end{lstlisting}
\vspace{3mm}
Using dictionaries as translation objects requires looping using either loops or using list comprehension
syntax:
\vspace{3mm}
\begin{lstlisting}
>>> example_string = "AAAACC"
>>> translator = {"A": "A", "C": "T"}
>>> "".join([translator[x] for x in list(example_string)])
'AAAATT'
\end{lstlisting}
\vspace{3mm}
The above code first \textbf{splits the string} by calling the \textbf{list} method, coercing it into a list
where each value is a single letter. We then run each letter through the associative array (dictionary) by
using \textbf{list comprehension}. The result is still a list of letters, so we have to \textbf{join} the
list back into a single string by calling the \textbf{join} function with no separator. Below is the
step-wise process for demonstration purposes.
\vspace{3mm}
\begin{lstlisting}
>>> example_string = "AAAACC"
>>> translator = {"A": "A", "C": "T"}
>>> >>> list(example_string)
['A', 'A', 'A', 'A', 'C', 'C']
>>> [translator[x] for x in list(example_string)]
['A', 'A', 'A', 'A', 'T', 'T']
>>> "".join([translator[x] for x in list(example_string)])
'AAAATT'
\end{lstlisting}
\vspace{3mm}
\pagebreak
\section{Scripting}
\subsection{Elements of a Python Script}
A script is a text-based file that the Python Interpreter expects to contain Python code.
It is conventional that these files end in a \textbf{.py suffix}, however that is not a requirement.
To start a python script, you'll open a blank text file in a plain-text editor such as
notepad or in your IDE of choice (see section 2 for an explanation of IDEs). \textbf{Do not}
use a rich-text editor such as Microsoft Word, as these programs add special characters to your
code that can be misinterpreted by the Python Interpreter. \par
Your Python script should \textbf{first list any imported modules} that you will be using in the remainder
of the code, as their functions need to be imported before you use them. Then, you'll have function
definitions for each task that you wish to perform; try to keep functions as specific as possible and not
simply lump all of the code into one big function. At the end of the file, you'll have a statement that
may look weird to you, but we'll see it is important to differentiate scripts from module files later on.
Finally, in that last code block, you'll call the functions you need in the correct order, and produce output.
At that point, you can save the file and run it using your operating system's command prompt or terminal by
calling \textbf{python myscript.py}. Here is an example of a basic Python script that we will call
\textbf{numbers.py}:
\vspace{3mm}
\begin{lstlisting}
import sys
def write_numbers_out(number_range):
for i in range(number_range):
print(number_range)
if __name__ == "__main__":
write_numbers_out(sys.argv[1])
\end{lstlisting}
\vspace{3mm}
There are several aspects of the above code that we haven't encountered yet: \textbf{importing modules} and
the \textbf{main conditional declaration} code block. Importing of modules gives you access to the
functions contained within that module; here, we have used the \textbf{sys.argv} object, which creates a list
out of all space-separated words that follow a call to the script. In this case, we would call this script
like so: \textbf{python numbers.py 10}. Because the first \textbf{argument} passed to the numbers.py script
has a value of 10, the value of 10 would get stored in the sys.argv list, which we can access using sys.argv[1].
Likewise, if we have placed a second number after 10, that would be stored in sys.argv[2], and so on. The name
of the script is stored in sys.argv[0], which would be numbers.py. We will have more information about
\textbf{command-line parsing} later; for now, simply know that we used the sys module to check for additional
command-line arguments to the script, passed that argument to the function, which would then print out numbers
in that range to the command-line. \par