# Python Basics

##Syntax
Python is an object oriented scripting language and does not require a specific first or last line (such as <code>public static void main</code> in Java or <code>return</code> in C).

There are no curly braces {} to define code blocks or semi-colons ; to end a line.  Instead of braces, indentation is rigidly enforced to create a block of code.

In [2]:
# This is a comment

if (3 < 2):
    print "True" # Another Comment.  This print syntax only works in Python 2, not 3
else:
        print "False"

False


Arbitrary indentation can be used within a code block, as long as the indentation is consistent.

In [3]:
if (1 == 1):
        print "We're in "
            print "Deep Trouble:"

IndentationError: unexpected indent (<ipython-input-3-6e3880df02f0>, line 3)

In [4]:
if (0 > -1):
            print "This works "
            print "just fine."

This works 
just fine.


## Variables and Types

Variables can be given alphanumeric names beginning with an underscore or letter.  Variable types do not have to be declared and are inferred at run time.

In [5]:
a = 1
print type(a) # Built in function

<type 'int'>


In [6]:
b = 2.5 
print type(b)

<type 'float'>


Strings can be declared with either single or double quotes.

In [7]:
c1 = "Go "
c2 = 'Gators'
c3 = c1 + c2
print c3
print type(c3)

Go Gators
<type 'str'>


The scope of variables is local to the function, class, and file in that increasing order of scope.  Global variables can also be declared.

In [8]:
print "b used to be", b # Prints arguments with a space separator 

# Our first function declaration
def sum():
    global b
    b = a + b
    
sum() # calling sum

# using this syntax, the arguments can be of any type that supports a string representation.  No casting needed.
print "Now b is", b 

b used to be 2.5
Now b is 3.5


##Modules and Import
Files with a .py extension are known as Modules in Python.  Modules are used to store functions, variables, and class definitions.  

Modules that are not part of the standard Python library are included in your program using the <code>import</code> statement.

In [9]:
# To use Math, we must import it
import math
print cos(0)

NameError: name 'cos' is not defined

Whoops.  Importing the <code>math</code> module allows us access to all of its functions, but we must call them in this way

In [10]:
print math.cos(0)

1.0


Alternatively, you can use the <code>from</code> keyword

In [11]:
from math import cos
print cos(math.pi) # we only imported cos, not the pi constant

-1.0


Using the <code>from</code> statement we can import everything from the math module.  

Disclaimer: many Pythonistas discourage doing this for performance reasons.  Just import what you need

In [12]:
from math import *
print sin(pi/2) # now we don't have to make a call to math

1.0


##Strings
As you may expect, Python has a powerful, full featured string module.  

###Substrings
Python strings can be substringed using bracket syntax

In [13]:
mystring = "Go Gators, Come on Gators, Get up and go!"
print mystring[11:25]

Come on Gators


Python is a 0-index based language.  Generally whenever forming a range of values in Python, the first argument is inclusive whereas the second is not, i.e. <code>mystring[11:25]</code> returns characters 11 through 24.

You can omit the first or second argument

In [14]:
print mystring[:9] # all characters before the 9th index

Go Gators


In [15]:
print mystring[27:] # all characters at or after the 27th

Get up and go!


In [16]:
print mystring[:] # you can even omit both arguments

Go Gators, Come on Gators, Get up and go!


Using negative values, you can count positions backwards

In [17]:
print mystring[-3:-1]

go


###String Functions
Here are some more useful string functions
####find

In [18]:
print mystring.find("Gators") # returns the index of the first occurence of Gators

3


In [19]:
print mystring.find("Gators", 4) # specify an index on which to begin searching

19


In [20]:
print mystring.find("Gators", 4, 19) # specify both begin and end indexes to search

-1


Looks like nothing was found.  -1 is returned by default.

In [21]:
print mystring.find("Seminoles") # no Seminoles here

-1


####lower and upper

In [22]:
print mystring.lower()
print mystring.upper()

go gators, come on gators, get up and go!
GO GATORS, COME ON GATORS, GET UP AND GO!


####replace

In [23]:
print mystring.replace("Gators", "Seminoles") # replaces all occurences of Gators with Seminoles

Go Seminoles, Come on Seminoles, Get up and go!


In [24]:
print mystring

Go Gators, Come on Gators, Get up and go!


Notice that replace returned a new string.  Nothing was modified in place

In [25]:
print mystring.replace("Gators", "Seminoles", 1) # limit the number of replacements

Go Seminoles, Come on Gators, Get up and go!


####split

In [26]:
print mystring.split() # returns a list of strings broken by a space by default

['Go', 'Gators,', 'Come', 'on', 'Gators,', 'Get', 'up', 'and', 'go!']


In [27]:
print mystring.split(',') # you can also define the separator

['Go Gators', ' Come on Gators', ' Get up and go!']


####join

The <code>join</code> is useful for building strings from lists or other iterables.  Call <code>join</code> on the desired separator

In [28]:
print ' '.join(["Go", "Gators"])

Go Gators


For more information on string functions:

https://docs.python.org/2/library/stdtypes.html#string-methods

##Data Structures
###Lists
The Python standard library does not have traditional C-style fixed-memory fixed-type arrays.  Instead, lists are used and can contain a mix of any type.

Lists are created with square brackets []

In [29]:
mylist = [1, 2, 3, 4, 'five']
print mylist

[1, 2, 3, 4, 'five']


In [30]:
mylist.append(6.0) # add an item to the end of the list
print mylist

[1, 2, 3, 4, 'five', 6.0]


In [31]:
mylist.extend([8, 'nine']) # extend the list with the contents of another list
print mylist

[1, 2, 3, 4, 'five', 6.0, 8, 'nine']


In [32]:
mylist.insert(6, 7) # insert the number 7 at index 6
print mylist

[1, 2, 3, 4, 'five', 6.0, 7, 8, 'nine']


In [33]:
mylist.remove('five') # removes the first matching occurence 
print mylist

[1, 2, 3, 4, 6.0, 7, 8, 'nine']


In [34]:
popped = mylist.pop() # by default, the last item in the list is removed and returned
print popped
print mylist

nine
[1, 2, 3, 4, 6.0, 7, 8]


In [35]:
popped2 = mylist.pop(4) # pops at at index
print popped2
print mylist

6.0
[1, 2, 3, 4, 7, 8]


In [36]:
print len(mylist) # returns the length of any iterable such as lists and strings

6


In [37]:
mylist.extend(range(-3, 0)) # the range function returns a list from -3 inclusive to 0 non inclusive
print mylist

[1, 2, 3, 4, 7, 8, -3, -2, -1]


In [38]:
# default list sorting. When more complex objects are in the list, arguments can be used to customize how to sort
mylist.sort()
print mylist

[-3, -2, -1, 1, 2, 3, 4, 7, 8]


In [39]:
mylist.reverse() # reverse the list
print mylist

[8, 7, 4, 3, 2, 1, -1, -2, -3]


For more information on Lists:

https://docs.python.org/2/tutorial/datastructures.html#more-on-lists

###Tuples

Python supports n-tuple sequences.  These are non-mutable

In [40]:
mytuple = 'Tim', 'Tebow', 15 # Created with commas
print mytuple
print type(mytuple)

('Tim', 'Tebow', 15)
<type 'tuple'>


In [41]:
print mytuple[1] # access an item

Tebow


In [42]:
mytuple[1] = "Winston" # results in error

TypeError: 'tuple' object does not support item assignment

###Sets
Python includes the set data structure which is an unordered collection with no duplicates

In [43]:
schools = ['Florida', 'Florida State', 'Miami', 'Florida']
myset = set(schools) # the set is built from the schools list
print myset

set(['Miami', 'Florida State', 'Florida'])


In [44]:
print 'Georgia' in myset # membership test

False


In [45]:
print 'Florida' in myset

True


In [46]:
badschools = set(['Florida State', 'Miami'])
print myset - badschools # set arithmetic

set(['Florida'])


In [47]:
print myset & badschools # AND

set(['Miami', 'Florida State'])


In [48]:
print myset | set(['Miami', 'Stetson']) # OR

set(['Miami', 'Florida State', 'Florida', 'Stetson'])


In [49]:
print myset ^ set(['Miami', 'Stetson']) # XOR

set(['Stetson', 'Florida', 'Florida State'])


###Dictionaries
Python supports dictionaries which can be thought of as an unordered list of key, value pairs.  Keys can be any immutable type and are typically integers or strings.  Values can be any object, even dictionaries.

Dictionaries are created with curly braces {}

In [50]:
mydict = {'Florida' : 1, 'Georgia' : 2, 'Tennessee' : 3}
print mydict

{'Georgia': 2, 'Florida': 1, 'Tennessee': 3}


In [51]:
print mydict['Florida'] # access the value with key = 'Florida'

1


In [52]:
del mydict['Tennessee'] # funky syntax to delete a key, value pair
print mydict

{'Georgia': 2, 'Florida': 1}


In [53]:
mydict['Georgia'] = 7 # assignment
print mydict

{'Georgia': 7, 'Florida': 1}


In [54]:
mydict['Kentucky'] = 6 # you can append a new key
print mydict

{'Georgia': 7, 'Florida': 1, 'Kentucky': 6}


In [55]:
print mydict.keys() # get a list of keys

['Georgia', 'Florida', 'Kentucky']


##Conditionals
Python supports the standard if-else-if conditional expression

In [56]:
a = 2; b = 1;

if a > b: print "a is greater than b"

a is greater than b


In [57]:
if b > a:
    print "b is greater than a"
else:
    print "b is less than or equal to a"

b is less than or equal to a


In [58]:
b = 2

if a > b:
    print "a is greater than b"
elif a < b:
    print "a is less than b"
else:
    print "a is equal to b"

a is equal to b


##Loops
Python supports for, foreach, and while loops
###For (counting)
Traditional counting loops are accomplished in Python with a combination of the <code>for</code> key word and the <code>range</code> function

In [59]:
for x in range(10): # with one argument, range produces integers from 0 to 9
    print x

0
1
2
3
4
5
6
7
8
9


In [60]:
for y in range(5, 12): # with two argumentts, range produces integers from 5 to 11
    print y

5
6
7
8
9
10
11


In [61]:
for z in range(1, 12, 3): # with three arguments, range starts at 1 and goes in steps of 3 until greater than 12
    print z

1
4
7
10


In [62]:
for a in range(10, 1, -5): # can use a negative step size as well
    print a

10
5


In [63]:
for b in range(2, 1, 1): # with a positive step, all values are less than 1. No integers are produced
    print b

In [64]:
for c in range(1, 2, -1): # same goes for a negative step as all values are less than 2
    print c

###Foreach
As it turns out, counting loops are just foreach loops in Python.  The <code>range</code> function returns a list of integers over which <code>for in</code> iterates.  This can be extended to any other iterable type

In [65]:
for i in ['foo', 'bar']: # iterate over a list of strings
    print i

foo
bar


In [66]:
anotherdict = {'one' : 1, 'two' : 2, 'three' : 3}

for key in anotherdict.keys(): # iterate over a dictionary.  Order is not guaranteed
    print key, anotherdict[key]

three 3
two 2
one 1


###While
Python supports standard <code>while</code> loops

In [67]:
a = 1; b = 4; c = 7; d = 5;

while (a < b) and (c > d): # example of and condition
    print c - a
    a += 1 # example of incrementing
    c -= 1 # decrementing

6
4


Python does not have a construct for a do-while loop, though it can be accomplished using the <code>break</code> statement

In [68]:
a = 1; b = 10

while True: # short circuit the while condition
    a *= 2
    print a
    if a > b:
        break

2
4
8
16


##Functions
Functions in Python do not have a distinction between those that do and do not return a value.  If a value is returned, the type is not declared.

Functions can be declared in any module without any distinction between static and non-static.  Functions can even be declared within other functions

The syntax is the following

In [69]:
def hello():
    print "Hello there!"
    
hello()

Hello there!


In [70]:
def player(name, number): # use some arguments
    print "#" + str(number), name # cast number to a string when concatenating
    
player("Kasey Hill", 0)

#0 Kasey Hill


Functions can have optional arguments if a default value is provided in the function signature

In [71]:
def player(name, number, team = 'Florida'): # optional team argument
    print "#" + str(number), name, team
    
player("Kasey Hill", 0) # no team argument supplied

#0 Kasey Hill Florida


In [72]:
player("Aaron Harrison", 2, "Kentucky") # supplying all three arguments

#2 Aaron Harrison Kentucky


Python functions can be called using named arguments, instead of positional

In [73]:
player(number = 23, name = 'Chris Walker')

#23 Chris Walker Florida


###\*args and \**kwargs
In Python, there is a special deferencing scheme that allows for defining and calling functions with argument lists or dictionaries.

####\*args

In [74]:
args = ['Michael Frazier II', 20, 'Florida']

player(*args) # calling player with the dereferenced argument list

#20 Michael Frazier II Florida


Argument lists can also be used in defining a function as such

In [75]:
def foo(*args): 
    for someFoo in args:
        print someFoo
        
foo('la', 'dee', 'da') # supports an arbitrary number of arguments

la
dee
da


####**kwargs
Similarly, we can define a dictionary of named parameters

In [76]:
kwargs = {'name' : 'Michael Frazier II', 'number' : 20}

player(**kwargs) # calling player with the dereferenced kwargs dictionary.  The team argument will be defaulted

#20 Michael Frazier II Florida


Just as before, we can define a function taking an arbitrary dictionary

In [77]:
def foo(**kwargs):
    for key in kwargs.keys():
        print key, kwargs[key]
        
foo(**kwargs)

name Michael Frazier II
number 20


###return
In Python functions, an arbitrary number of values can be returned

In [78]:
def sum(x,y):
    return x + y # return a single value

print sum(1,2)

3


In [79]:
def sum_and_product(x,y):
    return x + y, x * y # return two values

mysum, myproduct = sum_and_product(1,2)
print mysum, myproduct

3 2


#Data Science Tutorial
Now that we've covered some Python basics, we will begin a tutorial going through many tasks a data scientist may perform.  We will obtain real world data and go through the process of auditing, analyzing, visualing, and building classifiers from the data.

We will use a database of breast cancer data obtained from the University of Wisconsin Hospitals, Madison from Dr. William H. Wolberg.  The data is a collection of samples from Dr. Wolberg's clinical cases with attributes pertaining to tumors and a class labeling the sample as benign or malignant.

| Attribute                      | Domain                          |
|--------------------------------|---------------------------------|
| 1. Sample code number          | id number                       |
| 2. Clump Thickness             | 1 - 10                          |
| 3. Uniformity of Cell Size     | 1 - 10                          |
| 4. Uniformity of Cell Shape    | 1 - 10                          |
| 5. Marginal Adhesion           | 1 - 10                          |
| 6. Single Epithelial Cell Size | 1 - 10                          |
| 7. Bare Nuclei                 | 1 - 10                          |
| 8. Bland Chromatin             | 1 - 10                          |
| 9. Normal Nucleoli             | 1 - 10                          |
| 10. Mitoses                    | 1 - 10                          |
| 11. Class                      | (2 for benign, 4 for malignant) |

For more information on this data set:
https://archive.ics.uci.edu/ml/datasets/Breast+Cancer+Wisconsin+%28Diagnostic%29

##Obtaining the Data
Lets begin by programmatically obtaining the data.  Here I'll define a function we can use to make HTTP requests and download the data

In [80]:
def download_file(url, local_filename):
    import requests
    
    # stream = True allows downloading of large files; prevents loading entire file into memory
    r = requests.get(url, stream = True)
    with open(local_filename, 'wb') as f:
        for chunk in r.iter_content(chunk_size=1024): 
            if chunk: # filter out keep-alive new chunks
                f.write(chunk)
                f.flush()

Now we'll specify the url of the file and the file name we will save to

In [81]:
url = 'https://archive.ics.uci.edu/ml/machine-learning-databases/breast-cancer-wisconsin/breast-cancer-wisconsin.data'
filename = 'breast-cancer-wisconsin.csv'

And make a call to <code>download_file</code>

In [82]:
download_file(url, filename)

Now this might seem like overkill for downloading a single, small csv file, but we can use this same function to access countless APIs available on the World Wide Web by building an API request in the url.

##Wrangling the Data
Now that we have some data, lets get it into a useful form.  For this task we will use a package called pandas. pandas is an open source, BSD-licensed library providing high-performance, easy-to-use data structures and data analysis tools for Python.  The most fundamental data structure in pandas is the dataframe, which is similar to the data.frame data structure found in the R statistical programming language.

For more information: http://pandas.pydata.org

pandas dataframes are a 2-dimensional labeled data structures with columns of potentially different types.  Dataframes can be thought of as similar to a spreadsheet or SQL table.

There are numerous ways to build a dataframe with pandas.  Since we have already attained a csv file, we can use a parser built into pandas called <code>read_csv</code> which will read the contents of a csv file directly into a data frame.

For more information: http://pandas.pydata.org/pandas-docs/dev/generated/pandas.io.parsers.read_csv.html

In [83]:
import pandas as pd # import the module and alias it as pd

cancer_data = pd.read_csv('breast-cancer-wisconsin.csv')
cancer_data.head() # show the first few rows of the data

Unnamed: 0,1000025,5,1,1.1,1.2,2,1.3,3,1.4,1.5,2.1
0,1002945,5,4,4,5,7,10,3,2,1,2
1,1015425,3,1,1,1,2,2,3,1,1,2
2,1016277,6,8,8,1,3,4,3,7,1,2
3,1017023,4,1,1,3,2,1,3,1,1,2
4,1017122,8,10,10,8,7,10,9,7,1,4


Whoops, looks like our csv file did not contain a header row.  <code>read_csv</code> assumes the first row of the csv is the header by default.

Lets check out the file located here: https://archive.ics.uci.edu/ml/machine-learning-databases/breast-cancer-wisconsin/breast-cancer-wisconsin.names

This contains information about the data set including the names of the attributes.

Lets create a list of these attribute names to use when reading the csv file

In [84]:
# \ allows multi line wrapping
cancer_header = [ \
                 'sample_code_number', \
                 'clump_thickness', \
                 'uniformity_cell_size', \
                 'uniformity_cell_shape', \
                 'marginal_adhesion', \
                 'single_epithelial_cell_size', \
                 'bare_nuclei', \
                 'bland_chromatin', \
                 'normal_nucleoli', \
                 'mitoses', \
                 'class']

Lets try the import again, this time specifying the names.  When specifying names, the <code>read_csv</code> function requires us to set the <code>header</code> row number to <code>None</code>

In [85]:
cancer_data = pd.read_csv('breast-cancer-wisconsin.csv', header=None, names=cancer_header)
cancer_data.head()

Unnamed: 0,sample_code_number,clump_thickness,uniformity_cell_size,uniformity_cell_shape,marginal_adhesion,single_epithelial_cell_size,bare_nuclei,bland_chromatin,normal_nucleoli,mitoses,class
0,1000025,5,1,1,1,2,1,3,1,1,2
1,1002945,5,4,4,5,7,10,3,2,1,2
2,1015425,3,1,1,1,2,2,3,1,1,2
3,1016277,6,8,8,1,3,4,3,7,1,2
4,1017023,4,1,1,3,2,1,3,1,1,2


Lets take a look at some simple statistics for the **clump_thickness** column

In [86]:
cancer_data["clump_thickness"].describe()

count    699.000000
mean       4.417740
std        2.815741
min        1.000000
25%        2.000000
50%        4.000000
75%        6.000000
max       10.000000
Name: clump_thickness, dtype: float64

Referring to the documentation link above about the data, the count, range of values (min = 1, max = 10), and data type (dtype = float64) look correct.

Lets take a look at another column, this time **bare_nuclei**

In [87]:
cancer_data["bare_nuclei"].describe()

count     699
unique     11
top         1
freq      402
Name: bare_nuclei, dtype: object

Well at least the count is correct.  We were expecting no more than 10 unique values and now the data type is an object.  

Whats up with our data?

We have arrived at arguably the most important part of performing data science: dealing with messy data.  One of most important tools in a data scientist's toolbox is the ability to audit, clean, and reshape data.  The real world is full of messy data and your sources may not always have data in the exact format you desire.

In this case we are working with csv data, which is a relatively straightforward format, but this will not always be the case when performing real world data science.  Data comes in all varieties from csv all the way to something as unstructured as a collection of emails or documents.  A data scientist must be versed in a wide variety of technologies and methodologies in order to be successful.

Now, lets do a little bit of digging into why were are not getting a numeric pandas column

In [93]:
cancer_data["bare_nuclei"].unique()

cancer_data["bare_nuclei"].convert_objects(convert_numeric=True)


array(['1', '10', '2', '4', '3', '9', '7', '?', '5', '8', '6'], dtype=object)

In [88]:
cancer_data.describe()

Unnamed: 0,sample_code_number,clump_thickness,uniformity_cell_size,uniformity_cell_shape,marginal_adhesion,single_epithelial_cell_size,bland_chromatin,normal_nucleoli,mitoses,class
count,699.0,699.0,699.0,699.0,699.0,699.0,699.0,699.0,699.0,699.0
mean,1071704.098712,4.41774,3.134478,3.207439,2.806867,3.216023,3.437768,2.866953,1.589413,2.689557
std,617095.729819,2.815741,3.051459,2.971913,2.855379,2.2143,2.438364,3.053634,1.715078,0.951273
min,61634.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,2.0
25%,870688.5,2.0,1.0,1.0,1.0,2.0,2.0,1.0,1.0,2.0
50%,1171710.0,4.0,1.0,1.0,1.0,2.0,3.0,1.0,1.0,2.0
75%,1238298.0,6.0,5.0,5.0,4.0,4.0,5.0,4.0,1.0,4.0
max,13454352.0,10.0,10.0,10.0,10.0,10.0,10.0,10.0,10.0,4.0


In [89]:
cancer_data[cancer_data.sample_code_number == 1057013]

Unnamed: 0,sample_code_number,clump_thickness,uniformity_cell_size,uniformity_cell_shape,marginal_adhesion,single_epithelial_cell_size,bare_nuclei,bland_chromatin,normal_nucleoli,mitoses,class
23,1057013,8,4,5,1,2,?,7,3,1,4
