# Data Understanding and Preprocessing using Python and Pandas

#### A Jupyter Notebook by Muhammad Shakeel (F18-PCS-001)
#### Fall 2018 - CSC763: Advanced Topics in Machine Learning - Assignment 01

## Goals of this tutorial
The goals of this tutorial are to provide easy to understand, step-by-step instructions so that a beginner user can learn to use Python to write programs. Also, the tutorial explores the use of Pandas library as a way for exploratory data analysis using the Python programming language.

- **Introduce the basics of programming with Python**, and some skills useful in practice.
- **Introduce the syntax and usage of Pandas**, so that you can make use of the rich data analysis toolset available.

## Table of contents

### 1. [Introduction](#intro)
### 2. [Required software and libraries](#libs)
### 3. [How to run Python code](#run)   
   **3.1  [The Python Interpreter](#pi)   
   3.2  [The IPython Interpreter](#ip)   
   3.3  [Self-contained Python scripts](#ps)   
   3.4  [The Jupyter Notebook](#jn)**
### 4. [Writing programs with Python](#prg)
   **4.1  [Data types and Variables](#vars)   
        4.1.1  [Number types](#nums)   
        4.1.2  [Comments](#comments)   
        4.1.3  [Strings](#strings)<br>
     4.2  [Decisions](#if-else)<br>
        4.2.1  [Relational operators](#rel-ops)<br>
     4.3  [Loops](#loops)<br>
     4.4  [Lists](#lists)<br>
     4.5  [Sets](#sets)<br>
     4.6  [Dictionaries](#dict)<br>
     4.7  [Tuples](#tuples)<br>
     4.8  [Functions](#func)<br>
     4.9  [Lambda functions](#lambda)<br>
         4.9.1  [map function](#map)<br>
         4.9.2  [filter function](#filter)<br>
     5.0  [File I/O](#io)<br>**

### 5.[Introduction to Pandas](#pandas)
   **5.1  [Installation](#pandas-install)   
     5.2  [Getting started with pandas](#load)<br>
     5.3  [Introduction to pandas data structures](#pandas-ds)<br>
         5.3.1  [The Pandas Series object](#series)<br>
             5.3.1.1  [Series as generalized NumPy array](#series-np)<br>
             5.3.1.2  [Series as as specialized dictionary](#series-dict)<br>
             5.3.1.3  [Constructing Series objects](#series-obj)<br>
         5.3.2  [The Pandas DataFrame object](#df)<br>
             5.3.2.1  [DataFrame as a generalized NumPy array](#df-np)<br>
             5.3.2.2  [DataFrame as specialized dictionary](#df-dict)<br>
             5.3.2.3  [Constructing DataFrame objects](#df-obj)<br>
         5.3.3  [Data Indexing and Selection](#index)<br>
             5.3.3.1  [Data Selection in Series](#series-select)<br>
             5.3.3.2  [Data Selection in DataFrame](#df-select)<br>**
                      
### 6. [Further reading](#read) 

<a id="intro"></a>
## 1. Introduction

[[ go back to the top ]](#Table-of-contents)

Created in 1980s primarily as a teaching and scripting language, Python has since become an essential tool for many programmers, engineers, researchers, and data scientists across academia and industry.

The appeal of Python is in its simplicity, as well as the convenience of the large ecosystem of domain-specific tools that have been built on top of it.
For example, most of the Python code in scientific computing and data science is built around a group of mature and useful packages:

- [NumPy](http://numpy.org) provides efficient storage and computation for multi-dimensional data arrays.
- [SciPy](http://scipy.org) contains a wide array of numerical tools such as numerical integration and interpolation.
- [Pandas](http://pandas.pydata.org) provides a DataFrame object along with a powerful set of methods to manipulate, filter, group, and transform data.
- [Matplotlib](http://matplotlib.org) provides a useful interface for creation of publication-quality plots and figures.
- [Scikit-Learn](http://scikit-learn.org) provides a uniform toolkit for applying common machine learning algorithms to data.
- [IPython/Jupyter](http://jupyter.org) provides an enhanced terminal and an interactive notebook environment that is useful for exploratory analysis, as well as creation of interactive, executable documents. 

No less important are the numerous other tools and packages which accompany these: if there is a scientific or data analysis task you want to perform, chances are someone has written a package that will do it for you.

The primary goal of this tutorial is to provide a solid foundation for users to become familiar working with Python and the Pandas library. Therefore, the first part of this tutorial explains how to do programming in Python. The second half details working with the Pandas library. Both of these topics are reinforced with the use of examples.  

<a id="libs"></a>
## 2. Required software and libraries

[[ go back to the top ]](#Table-of-contents)

The software required to work through this tutorial includes the following:

1. To install Python and all required libraries including Pandas, you can use the [Anaconda Python distribution](https://www.anaconda.com/download/). Anaconda provides a simple double-click installer for your convenience.

	This notebook uses Python packages that come standard with the Anaconda Python distribution, so no additional installation will be required except the Anaconda distribution itself.


2. Secondly, to convert this or any Jupyter Notebook to PDF (Portable Document Format), we would also need to install a [LaTeX](https://www.latex-project.org/) distribution. This step is **OPTIONAL** and is only needed if a Jupyter Notebook is to be converted into a PDF document. LaTeX is a document preparation system used extensively for the preparation of technical and scientific documents. The reader does not need to be familiar with LaTeX to work through this tutorial.

	If LaTeX is required to be installed, then there are many LaTeX distributions available, and the [MiKTeX distribution](https://miktex.org/download) is one popular option. Installation is simple and default options are all that are needed.

<a id="run"></a>
## 3. How to run Python code
[[ go back to the top ]](#Table-of-contents)

Python is a flexible language, and there are several ways to use it depending on your particular task.
One thing that distinguishes Python from other programming languages is that it is *interpreted* rather than *compiled*.
This means that it is executed line by line, which allows programming to be interactive in a way that is not directly possible with compiled languages like Fortran, C, or Java. This section will describe four primary ways you can run Python code: the *Python interpreter*, the *IPython interpreter*, via *Self-contained Scripts*, or in the *Jupyter notebook*.

<a id="pi"></a>
### 3.1 The Python Interpreter

The most basic way to execute Python code is line by line within the *Python interpreter*.
The Python interpreter can be started by installing the Python language (see the previous section) and typing ``python`` at the command prompt (look for the Terminal on Mac OS X and Unix/Linux systems, or the Command Prompt application in Windows):
```
$ python
Python 3.6.6 |Anaconda custom (64-bit)| (default, Jun  28 2018, 11:24:55)
Type "help", "copyright", "credits" or "license" for more information.
>>>
```
With the interpreter running, you can begin to type and execute code snippets.
Here we'll use the interpreter as a simple calculator, performing calculations and assigning values to variables:
``` python
>>> 1 + 1
2
>>> x = 5
>>> x * 3
15
```

The interpreter makes it very convenient to try out small snippets of Python code and to experiment with short sequences of operations.

<a id="ip"></a>
### 3.2 The IPython interpreter

If you spend much time with the basic Python interpreter, you'll find that it lacks many of the features of a full-fledged interactive development environment.
An alternative interpreter called *IPython* (for Interactive Python) is bundled with the Anaconda distribution, and includes a host of convenient enhancements to the basic Python interpreter.
It can be started by typing ``ipython`` at the command prompt:
```
$ ipython
Python 3.6.6 |Anaconda custom (64-bit)| (default, Jun  28 2018, 11:24:55) 
Type "copyright", "credits" or "license" for more information.
IPython 6.5.0 -- An enhanced Interactive Python. Type '?' for help.

In [1]: 
```
The main aesthetic difference between the Python interpreter and the enhanced IPython interpreter lies in the command prompt: Python uses ``>>>`` by default, while IPython uses numbered commands (e.g. ``In [1]:``).
Regardless, we can execute code line by line just as we did before:
``` ipython
In [1]: 1 + 1
Out[1]: 2

In [2]: x = 5

In [3]: x * 3
Out[3]: 15
```
Note that just as the input is numbered, the output of each command is numbered as well.

<a id="ps"></a>
### 3.3 Self-contained Python scripts

Running Python snippets line by line is useful in some cases, but for more complicated programs it is more convenient to save code to file, and execute it all at once.
By convention, Python scripts are saved in files with a *.py* extension.
For example, let's create a script called *test.py* which contains the following:
``` python
# file: test.py
print("Running test.py")
x = 5
print("Result is", 3 * x)
```
To run this file, we make sure it is in the current directory and type ``python`` *``filename``* at the command prompt:
```
$ python test.py
Running test.py
Result is 15
```
For more complicated programs, creating self-contained scripts like this one is a must.

<a id="jn"></a>
### 3.4 The Jupyter notebook

The Jupyter notebook is a browser-based graphical interface to the IPython shell, and builds on it a rich set of dynamic display capabilities.
As well as executing Python/IPython statements, the notebook allows the user to include formatted text, static and dynamic visualizations, mathematical equations, JavaScript widgets, and much more.
Furthermore, these documents can be saved in a way that lets other people open them and execute the code on their own systems. Further information can be seen from the [Jupyter Project](https://jupyter.org/) page.

Though the IPython notebook is viewed and edited through your web browser window, it must connect to a running Python process in order to execute code.
This process (known as a "kernel") can be started by running the following command in the system shell:

```
$ jupyter notebook
```

<a id="prg"></a>
# 4. Writing programs with Python

[[ go back to the top ]](#Table-of-contents)

The first part of the tutorial follows in which we will see how to use Python to write computer programs.

<a id="vars"></a>
## 4.1 Data types and variables

[[ go back to the top ]](#Table-of-contents)

When your program carries out computations, you will want to store values so that you can use them later. In a Python program, you use variables to store values. In this section, you will learn how to define and use variables.

> A **variable** is a storage location in a computer program. Each variable has a *name* and holds a *value*.

You use the **assignment statement** to place a value into a variable. Here is an example:

In [1]:
numberOfBooks = 5

The left-hand side of an assignment statement consists of a variable. The right-hand side is an expression that has a value. That value is stored in the variable. The first time a variable is assigned a value, the variable is created and initialized
with that value. After a variable has been defined, it can be used in other statements. For example,

In [2]:
print(numberOfBooks)

5


will print the value stored in the variable numberOfBooks. If an existing variable is assigned a new value, that value replaces the previous contents of the variable. For example,

In [3]:
numberOfBooks = 10
print(numberOfBooks)

10


changes the value contained in variable numberOfBooks from 5 to 10.

Note that in Python, it is not necessary to define what type of value they will store. The type of variable will be determined after the value is assigned to it. For example,   

In [4]:
numberOfBooks = 10            #integer type variable
counter = 100.00              #float type variable
subject = "Machine Learning"  #string type variable

print(numberOfBooks)
print(counter)
print(subject)

10
100.0
Machine Learning


> **Note that a variable must be created and initialized before it can be used for the first time.**
For example, if the variable **b** has not been created and the following code is run, we will get an error saying that *b has not yet been created*.<br>
>`a = b + 100`

<a id="nums"></a>
### 4.1.1 Number types

[[ go back to the top ]](#Table-of-contents)

In Python, there are several different types of numbers. An **integer** value is a whole number without a fractional part. For example, there must be an integer number of books in any pack of books — you cannot have a fraction of a book. In Python, this type is called **int**. 

When a fractional part is required (such as in the number 0.355), we use floating-point numbers, which are called **float** in Python. When a value such as 6 or 0.355 occurs in a Python program, it is called a **number literal**. If a number literal has a decimal point, it is a floating-point number; otherwise, it is an integer. For example, 

In [5]:
2      #integer

2

In [6]:
2 + 3  #integer

5

In [7]:
2.3 + 5.5   #floating-point

7.8

In [8]:
2 ** 3    #exponent

8

<a id="comments"></a>
### 4.1.2 Comments

[[ go back to the top ]](#Table-of-contents)

Comments are the explanations of what code is doing. As your programs get more complex, you should add comments, explanations for human readers of your code. For example, here is a comment that explains the value used in a variable:

In [9]:
numberOfBooks = 10            #integer type variable

The interpreter does not execute comments at all. It ignores everything from a # delimiter to the end of the line.

<a id="strings"></a>
### 4.1.3 Strings

[[ go back to the top ]](#Table-of-contents)

Programs are also used to process text, not just numbers. Text consists of characters: letters, numbers, punctuation, spaces, and so on. A string is a sequence of characters. For example, the string `"Hello"` is a sequence of five characters.

A string can be stored in a variable as:

In [10]:
greeting = "Hello"

and later accessed when needed just as numerical values can be:

In [11]:
print(greeting)

Hello


A **string literal** denotes a particular string (such as "Hello"), just as a number literal (such as 2) denotes a particular number. In Python, string literals are specified by enclosing a sequence of characters within a matching pair of either single or double quotes.

In [12]:
print("This is a string.", 'So is this.')

This is a string. So is this.


By allowing both types of delimiters, Python makes it easy to include an apostrophe or quotation mark within a string.

In [13]:
message = 'He said "Hello"'
print(message)

He said "Hello"


The number of characters in a string is called the *length* of the string. For example, the length of "Harry" is 5. You can compute the length of a string using Python’s `len` function:

In [14]:
length = len("World!")    # length is 6
print (length)

6


A string of length 0 is called the *empty string*. It contains no characters and is written as "" or ''.

### Concatenation and repetition

Given two strings, such as "Harry" and "Morgan", you can **concatenate** them to one long string. The result consists of all characters in the first string, followed by all characters in the second string. In Python, you use the `+` operator to concatenate two strings. For example,

In [15]:
firstName = "John"
lastName = "Doe"

name = firstName + " " + lastName
print(name)

John Doe


Note how we have concatenated a space between the two strings in the assignment to the **name** variable above.

When the expression to the left or the right of a + operator is a string, the other one must also be a string or a syntax error will occur. You cannot concatenate a string with a numerical value.

You can also produce a string that is the result of repeating a string multiple times. For example, suppose you need to print a dashed line. Instead of specifying a literal string with 50 dashes, you can use the `*` operator to create a string that is comprised of the string "-" repeated 50 times. For example,

In [16]:
dashes = "-" * 50
print(dashes)

--------------------------------------------------


A string of any length can be repeated using the `*` operator. For example, the following statements repeat the string "Hello" five times:

In [17]:
message = "Hello..."
print(message * 5)

Hello...Hello...Hello...Hello...Hello...


### Strings and characters

Strings are sequences of *Unicode* characters. You can access the individual characters of a string based on their position within the string. This position is called the *index* of the character. The first character has index 0, the second has index 1, and so on. For example, in the string "Harry", 

 - the character *H* is at index 0, 
 - the character *a* is at index 1, 
 - the character *r* is at index 2, 
 - the character *r* is at index 3, 
 - and the character *y* is at index 4. 
 
Note that there is **no character** at index **5**.

An individual character is accessed using a special *subscript notation* in which the position is enclosed within square
brackets **[ ]**. For example, if the variable name is defined as:

In [18]:
name = "Harry"

the statements:

In [19]:
first = name[0]
last = name[4]
print(first, last)

H y


extract two different characters from the string. The first statement extracts the first character as the string "H" and stores it in variable `first`. The second statement extracts the character at position 4, which in this case is the last character, and stores it in variable `last`.

The index value must be within the valid range of character positions or an *index out of range* exception will be generated at run time. 

By using this subscript notation, we can also extract a *substring* from a string. For example,

In [20]:
middle = name[1:4]
print(middle)

arr


We get the string "arr", starting from index 1 and up to index 3. That is, it *excludes the last index 4*.

### Updating strings

Just like other variables, the value in a string variable can be changed or replaced by assigning the string variable a new string value. For example, 

In [21]:
name = "John Doe"
print(name)

John Doe


This will replace the earlier value "Harry" by "John Doe" in the **name** string variable.

> Note that a value stored in a string variable can only be replaced with another string literal.  

We can also use the substring operation and the `+` operator to replace a string value as follows: 

In [22]:
originalString = "Hello World!"
updatedString = originalString[:6] + "Python"

print("Updated string: ", updatedString)

Updated string:  Hello Python


This will replace the "Hello World!" placed in the variable **originalString** by taking a substring consisting of the first 6 characters, including the space character, and adding this substring with another string "Python" to produce the **updatedString** "Hello Python".

### String methods

Python strings are created as **objects**, that is, they have values as well as certain behaviors. The value can be simple, such as a string, or the number of characters stored in a string. The behavior of an object is given through its **methods**. A method, like a function, is a collection of programming instructions that carry out a particular task.  

There are many built-in methods that can be applied to strings in Python. For example, you can apply the upper method to any string, like this:

In [23]:
name = "John Smith"
uppercaseName = name.upper() # Sets uppercaseName to "JOHN SMITH"
print(uppercaseName)

JOHN SMITH


Note that the method name follows the object, and that a dot (.) separates the object and method name.

There is another string method called lower that yields the lowercase version of a string:

In [24]:
print(name.lower()) # Prints john smith

john smith


### Deleting strings

To delete a string, we will use a function called `del` as follows:

In [25]:
del name

This deletes the object reference placed in the **name** variable. Now, any attempt to access this variable will result in an error:

In [26]:
print(name)    #error: name has been deleted, so it is no longer defined

NameError: name 'name' is not defined

Note that we can reuse string variables after applying the function `del` by assigning them new values again:

In [27]:
name = "Jane Doe"
print(name)

Jane Doe


Now, we can reuse the variable **name** in the rest of the code.

### Finding characters in a string

We can use the **in** operator to find whether any character exists in a string. This operator will return **True** if the character exists at any location in a string, **False** otherwise.

In [28]:
name = "John Smith"

print('S' in name)
print('u' in name)

exists = 'i' in name
print(exists)

True
False
True


Note how we have used variable **exists** to get the output of the **in** operator and store it in the **exists** variable. We can also use the logical operator **not** to reverse the output of the **in** operator. That is, it will return **True** if the specified character does not exist in the string, and **False** otherwise:

In [29]:
print('B' not in name)
print('u' not in name)

True
True


As we can see from the output above, both the characters *B* and *u* do not exist in the string literal "John Smith", so we get the output of **True** in both of above statements. 

### Formatting strings

We may use the special string **substitution** placeholders, or **format specifiers** while working with strings. Two format specifiers are **%s** that is used to replace a string, and **%d** that is used to replace an integer with strings. For example:

In [30]:
print("My name is %s and I am %d years old" % ("John Smith", 35))

My name is John Smith and I am 35 years old


In the above example, **%s** placeholder is replaced with the string **John Smith**, and **%d** placeholder is replaced with the number **35**.
> Placeholders must be correctly specified and the values must be correctly defined, otherwise errors will be reported. For example, in the above example, **%s** would need a matching string value to be defined. Any other value would be an error.

For example, the below code produces an error as **%d** is being replaced with a value that is a string:

In [31]:
print("My name is %s and I am %d years old" % (35, "John Smith"))   #error: %d expects a number!

TypeError: %d format: a number is required, not str

<a id="if-else"></a>
## 4.2 Decisions

[[ go back to the top ]](#Table-of-contents)

The **if** statement is used to implement a decision. When a condition is fulfilled, one set of statements is executed. Otherwise, another set of statements is executed.

Some constructs in Python are **compound statements**, which span multiple lines and consist of a *header* and a **statement block**. The **if** statement is an example of a compound statement.

Let's take an example that whenever at a store, if the customer's total sales are more than **100**, the store would give a discount of **5%** on the total sales amount. The printed receipt for a customer reflects this discount. If the customer's total sales are less than **100**, the store does not give any discount.

We will write the corresponding code in Python as follows:

In [32]:
totalSales = 50
if totalSales > 100.0 :     # The header ends in a colon.
    discount = totalSales * 0.05     # Lines in the block are indented to the same level
    totalSales = totalSales - discount
    print("You received a discount of", discount)
else:
    print("You sales are less than 100, there are no discounts")

You sales are less than 100, there are no discounts


Compound statements require a colon (:) at the end of the header. The statement block is a group of one or more statements, all of which are indented to the same indentation level. A statement block begins on the line following the header and ends
at the first statement indented less than the first statement in the block. You can use any number of spaces to indent statements within a block, but all statements within the block must have the same indentation level. Note that comments are not statements and thus can be indented to any level.

<a id="rel-ops"></a>
## 4.2.1 Relational operators

[[ go back to the top ]](#Table-of-contents)

Every if statement contains a condition. In many cases, the condition involves comparing two values. For example, in the previous examples we tested `totalSales > 100.0`. The comparison **>** is called a **relational operator**. Python has six relational operators. 

 - **>**  (Greater-than)
 - **<**  (Less-than)
 - **>=** (Greater-than or equals to)
 - **<=** (Less-than or equals to)
 - **==** (Equals to)
 - **!=** (Not equals to)
 
All relational operators return **True** if the comparison results in true, otherwise **False**. We can compare numbers and strings using the relational operators. For example,

In [33]:
print(1 < 3)

True


In [34]:
print(15 <= 23)

True


In [35]:
"John Smith" == "John Woo"

False

In [36]:
"John Smith" != "John Woo"

True

We can also use **Boolean operators** to combine and compare multiple conditions at the same time. There are three **Boolean operators: and, or,** and **not**.

The condition of the test has two parts, joined by the **and** operator. Each part is a Boolean value that can be true or false. The combined expression is true if both individual expressions are true. If either one of the expressions is false, then the result is also false.

In [37]:
(1 == 1) and (5 < 2)

False

For the **or** operator, the combined expression is true if any individual expression is true. If neither one of the expressions is true, then the result is also false.

In [38]:
(1 < 0) or (2 > 3)

False

The **not** operator simply inverts the result - that is if the result of some comparison is true, it returns false.

In [39]:
not(1 > 2)

True

We can use these operators while writing if statements:

In [40]:
a = 100
b = 200

if a < b or b > 100:
    print("Save")
else:
    print("Cancel")

Save


<a id="loops"></a>
## 4.3 Loops

[[ go back to the top ]](#Table-of-contents)

In Python, the **while** statement implements a repetition. It has the form:

`while condition :
    statements`
    
As long as the condition remains true, the statements inside the while statement are executed. This statement block is called the body of the while statement. For example,

In [41]:
timer = 1
while timer <= 10 :
    print(timer)
    timer = timer + 1

1
2
3
4
5
6
7
8
9
10


Often, we will need to visit each character in a string. The **for** loop makes this process particularly easy to program. For example, suppose we want to print a string, with one character per line. We cannot simply print the string using the print function. Instead, we need to iterate over the characters in the string and print each character individually. Here is how you use the for loop to accomplish this task:

In [42]:
cityName = "Lahore"
for letter in cityName :
    print(letter)

L
a
h
o
r
e


Loops that iterate over a range of integer values are very common. To simplify the creation of such loops, Python provides the **range** function for generating a sequence of integers that can be used with the for loop. The loop:

In [43]:
for i in range(1, 10) :        # i = 1, 2, 3, ..., 9
    print(i)

1
2
3
4
5
6
7
8
9


prints the sequential values from 1 to 9. The range function generates a sequence of values based on its arguments. The first argument of the range function is the first value in the sequence. Values are included in the sequence while they are less than the second argument. 

> Note that the ending value (the second argument to the range function) is not included in the output sequence.

Another example of calculating the factorial of a number using the **for** loop may be written as:

In [44]:
fact = 1
N = 5
for i in range (1, N + 1) :
    fact *= i    #same as fact = fact * i

print(fact)

120


<a id="lists"></a>
## 4.4 Lists

[[ go back to the top ]](#Table-of-contents)

Lists are the fundamental mechanism in Python for collecting multiple values. Lists are indispensable when multiple values are being stored and there are many operations that are being performed on them.

As an example, suppose that we are storing scores obtained by students in a quiz. Suppose that there are 10 students. To store these scores, we would need to create 10 individual variables. Lets also suppose that we need to find the highest quiz score among all scores. We would need to write a function, say **max()** that finds the highest score among all variables. Now let's extend this problem to 100 students. There are now 100 scores that are needed to be stored. We would need to rewrite the **max()** function so that now it handles 100 scores. This is not a good program as it does not extend or scale appropriately. Also, to make it work effectively with data as the data size increases, we would need to make major modifications in this **max()** function. 

One way to remedy this scenario is by using a *List*. We now take a look on how to create and use Lists.

> a list is an ordered collection of data that is referred to by a single variable name. (in most other computer languages, the same concept is called an array.)

>Lists are objects in Python. They provide many useful methods for manipulating list elements.

> Whenever we are looking to work with multiple values, we should use Lists.

### Creating lists

We create a list and specify the initial values that are to be stored in the new list: 

In [45]:
scores = [32, 54, 67.5, 29, 35, 80, 95, 44.5, 100, 65]    #list containing 10 elements
print(scores)

names = ["Allen", "Baker", "Charlie", "Dennis", "Elmo", "Fido", "Gabriel"]
print(names)

temperatures = [32, 33.5, 30, 31.5, 34, 35]
print(temperatures)

[32, 54, 67.5, 29, 35, 80, 95, 44.5, 100, 65]
['Allen', 'Baker', 'Charlie', 'Dennis', 'Elmo', 'Fido', 'Gabriel']
[32, 33.5, 30, 31.5, 34, 35]


The square brackets indicate that we are creating a list. The items are stored as comma-separated and in the order they are provided. You will want to store the list in a variable so that you can access and manipulate it later.

> In a list, order is important, and the order of the elements in a list never changes (unless you explicitly do so). Because the order of elements in a list is important, you refer to each element in a list using its index (its position within the list).

### Accessing List elements

A list is a sequence of elements, each of which has an integer position or index. To access a list element, you specify which index you want to use. That is done with the **subscript operator ([ ])** in the same way that you access individual characters in a string. Indexes in Lists always start with zero: that is the first element in a List will be at index 0.

For example,

In [46]:
print(scores[5])    #Prints the element at index 5

80


Both lists and strings are sequences, and the [ ] operator can be used to access an element in any sequence.

There are two differences between lists and strings. Lists can hold values of any type, whereas strings are sequences of characters. Moreover, strings are immutable—you cannot change the characters in the sequence. But lists are mutable. You can
replace one list element with another, like this:

In [47]:
scores[5] = 87   #replace the value at index 5 with 87
print(scores[5])

87


Now the element at index 5 is filled with 87.

> Trying to access an element that does not exist in the list is a serious error. For example, if values has ten elements, you are not allowed to access values[20]. Attempting to access an element whose index is not within the valid index range is called an **out-of-range error** or a **bounds error**. When an **out-of-range error** occurs at run time, it causes a runtime exception.

We can also select a range or specific values from the list:

In [48]:
print("scores[1]:", scores[1])   #prints value at index 1 (54)
print("scores[1:5]: ", scores[1:5])   #prints starting index 1 to 4 (ignores index 5)!

scores[1]: 54
scores[1:5]:  [54, 67.5, 29, 35]


We can use the `len` function to obtain the length of the list; that is, the number of elements:

In [49]:
print(len(scores))    #prints the number of elements in a list

listSize = len(scores)  #stores the number of elements in a variable
print(listSize)

10
10


### Updating List elements

We can update any list element by assigning the list index a new value as follows:

In [50]:
subjects = ["Physics", "Computer Science", 1997, 2000]
print(subjects)

subjects[0] = "Discrete Mathematics"   #replace value at index 0
print(subjects)

['Physics', 'Computer Science', 1997, 2000]
['Discrete Mathematics', 'Computer Science', 1997, 2000]


We can also use the `append()` method to add elements in a list. The `append()` method always adds elements at the end of a List.

In [51]:
print("List size before appending new element: ", len(subjects))
subjects.append("Psychology")
print(subjects)
print("List size after appending new element: ", len(subjects))

print()   #blank line

subjects.append(2018)
print(subjects)

List size before appending new element:  4
['Discrete Mathematics', 'Computer Science', 1997, 2000, 'Psychology']
List size after appending new element:  5

['Discrete Mathematics', 'Computer Science', 1997, 2000, 'Psychology', 2018]


The size, or length, of the list increases after each call to the append method. Any number of elements can be added to a list.

If the order of the elements does not matter, appending new elements is sufficient. Sometimes, however, the order is important and a new element has to be inserted at a specific position in the list.

Suppose that in the above **subjects** list, we would like to add a new element at the very beginning or index 0 of the list. The **insert** method of the List is used for this purpose. While using the **insert** method, first the index where the new value is to be inserted is given, then the new value is given.

For example,

In [52]:
print(subjects, "Length: ", len(subjects))
print()

print("Inserting new value at index 0")
subjects.insert(0, 2009)

print()
print(subjects, "Length: ", len(subjects))

['Discrete Mathematics', 'Computer Science', 1997, 2000, 'Psychology', 2018] Length:  6

Inserting new value at index 0

[2009, 'Discrete Mathematics', 'Computer Science', 1997, 2000, 'Psychology', 2018] Length:  7


All of the elements at and following position 0 are moved down by one position to make room for the new element, which is inserted at position 0. After each call to the insert method, the size of the list is increased by 1.

### Removing an element

The **pop** method removes the element at a given position. For example, suppose we start with the list:

In [53]:
friends = ["Harry", "Cindy", "Emily", "Bob", "Cari", "Bill"] 

To remove the element at index position 1 ("Cindy") in the friends list, you use the command:

In [54]:
friends.pop(1)

'Cindy'

All of the elements following the removed element are moved up one position to close the gap. The size of the list is reduced by 1. The index passed to the **pop** method must be within the valid range.

The element removed from the list is returned by the **pop** method. This allows you to combine two operations in one—accessing the element and removing it:

In [55]:
print("The removed item is", friends.pop(1))
print("List now contains elements: ", friends)

The removed item is Emily
List now contains elements:  ['Harry', 'Bob', 'Cari', 'Bill']


If you call the **pop** method without an argument, it removes and returns the last element of the list. For example, `friends.pop()` removes "Bill".

Similar to the **pop** method, we can also use the **del** function to remove an element from a list by giving the index of the element to be removed as a subscript with the name of the list. For example, 

In [56]:
print("Before the removal of the first element: ", friends)
del friends[0]    #removes the first element

print()
print("After the removal of the first element: ", friends)

Before the removal of the first element:  ['Harry', 'Bob', 'Cari', 'Bill']

After the removal of the first element:  ['Bob', 'Cari', 'Bill']


The **remove** method removes an element by value instead of by position. For example, suppose we want to remove the string "Cari" from the friends list but we do not know where it's located in the list. Instead of having to find the position, we can use the **remove** method:

In [57]:
friends.remove("Cari")
print("List now contains elements: ", friends)

List now contains elements:  ['Bob', 'Bill']


<a id="sets"></a>
## 4.5 Sets

[[ go back to the top ]](#Table-of-contents)

A set is a container that stores a collection of **unique** values. Unlike a list, the elements or members of the set are *not stored in any particular order* and *cannot be accessed by position*. The operations available for use with a set are the same as the operations performed on sets in mathematics. Because sets need not maintain a particular order, set operations are much faster than the equivalent list operations.

Because sets cannot have multiple occurrences of the same element, it makes sets highly useful to efficiently remove duplicate values from a list or tuple and to perform common math operations like unions and intersections.

> Whenever we are looking to work with multiple distinct or unique values, we should use Sets instead of Lists.

### Creating and using sets

To create a set with initial elements, you can specify the elements enclosed in braces, just like in mathematics:

In [58]:
colors = { "Red", "Green", "Blue" }
print(colors)

points = { 5, 10, 6, 2, 8, 3 }
print(points)

addresses = { "123-Elm Street", 100, "101 Main Blvd", 25.5 }
print(addresses)

{'Blue', 'Red', 'Green'}
{2, 3, 5, 6, 8, 10}
{25.5, '123-Elm Street', 100, '101 Main Blvd'}


Elements of a set are comma separated. We can also have an empty Set as: 

In [59]:
emptySet = {}
print(emptySet)

{}


Alternatively, you can use the **set** function to convert any sequence into a set:

In [60]:
names = ["John", "Hana", "Spiny", "Richard"]
cast = set(names)

print(cast, "Size of set: ", len(cast))

{'John', 'Hana', 'Richard', 'Spiny'} Size of set:  4


As with any container, you can use the **len** function to obtain the number of elements in a set:

In [61]:
numberOfCharacters = len(cast)
print(numberOfCharacters)

4


To determine whether an element is contained in the set, use the **in** operator or its inverse, the **not in** operator:

In [62]:
if "Richard" in cast :
    print("Richard is included in the set of cast.")
else :
    print("Richard is not a character in the show.")

Richard is included in the set of cast.


Because sets are unordered, you cannot access the elements of a set by position as you can with a list. Instead, use a **for** loop to iterate over the individual elements:

In [63]:
print("The cast of characters includes:")
for character in cast :
    print(character)

The cast of characters includes:
John
Hana
Richard
Spiny


> Note that the order in which the elements of the set are visited depends on how they are stored internally.

### Adding and removing elements

Like lists, sets are mutable collections, so you can add and remove elements. For example, suppose we need to add more characters to the set cast created in the previous section. Use the **add** method to add elements:

In [64]:
cast.add("Arthur")
print(cast, "Size: ", len(cast))

{'John', 'Arthur', 'Richard', 'Hana', 'Spiny'} Size:  5


If the element being added is not already contained in the set, it will be added to the set and the size of the set increased by one. Remember, however, that a set cannot contain duplicate elements. If you attempt to add an element that is already in the set, there is no effect and the set is not changed.

In [66]:
cast.add("John")
print(cast)

{'John', 'Arthur', 'Richard', 'Hana', 'Spiny'}


The **update** function in set adds elements from a set (passed as an argument) to the set, that is, it can add multiple elements in a set. Any duplicates elements are ignored during the **update**.

In [67]:
list1 = [1, 2, 3]  
list2 = [3, 5, 6, 7]  
list3 = [10, 11, 12] 
  
# Lists converted to sets  
set1 = set(list2)  
set2 = set(list1) 
  
# Update method  
set1.update(set2) 
   
# Print the updated set  
print(set1)  
  
# List is passed as an parameter which gets automatically converted to a set  
set1.update(list3)  
print(set1) 

{1, 2, 3, 5, 6, 7}
{1, 2, 3, 5, 6, 7, 10, 11, 12}


There are two methods that can be used to remove individual elements from a set. The **discard** method removes an element if the element exists.  

In [68]:
cast.discard("Arthur")
print(cast, "Size: ", len(cast))

{'John', 'Richard', 'Hana', 'Spiny'} Size:  4


but has no effect if the given element is not a member of the set:

In [69]:
cast.discard("Barbara")     # Has no effect
print(cast)

{'John', 'Richard', 'Hana', 'Spiny'}


The **remove** method, on the other hand, removes an element if it exists, but raises an exception if the given element is not a member of the set:

In [70]:
cast.remove("Barbara")     # Raises an exception

KeyError: 'Barbara'

Finally, the **clear** method removes all elements of a set, leaving the empty set:

In [72]:
cast.clear()    # cast now has size 0
print(cast, "Size: ", len(cast))

set() Size:  0


### Set Union, Intersection, and Difference

The union of two sets contains all of the elements from both sets, with duplicates removed. Use the **union** method to create the union of two sets in Python. For example:

In [73]:
canadian = { "Red", "White" }         # flag colors
british = { "Red", "Blue", "White" }  #flag colors
italian = { "Red", "White", "Green" } #flag colors

inEither = british.union(italian) # The set {"Blue", "Green", "White", "Red"}
print(inEither, "Size: ", len(inEither))

{'Blue', 'Green', 'White', 'Red'} Size:  4


Both the british and italian sets contain the colors "Red" and "White", but the union is a set and therefore contains only one instance of each color.

> Note that the union method returns a new set. It does not modify either of the sets in the call.

The intersection of two sets contains all of the elements that are in both sets. To create the intersection of two Python sets, use the **intersection** method:

In [74]:
inBoth = british.intersection(italian)    # The set {"White", "Red"}
print(inBoth, "Size: ", len(inBoth))

{'White', 'Red'} Size:  2


Finally, the difference of two sets results in a new set that contains those elements in the first set that are not in the second set. For example, the difference between the Italian and the British colors is the set containing only "Green". 

Use the **difference** method to find the set difference:

In [75]:
print("Colors that are in the Italian flag but not the British:")
print(italian.difference(british)) # Prints {'Green'}

Colors that are in the Italian flag but not the British:
{'Green'}


When forming the union or intersection of two sets, the order does not matter. For example, `british.union(italian)` is the same set as `italian.union(british)`. But the order matters with the **difference** method. The set returned by:

In [76]:
print(british.difference(italian))

{'Blue'}


is {"Blue"}.

<a id="dict"></a>
## 4.6 Dictionaries

[[ go back to the top ]](#Table-of-contents)

A dictionary is a container that keeps associations between *keys* and *values*. Every key in the dictionary has an associated value. Keys are unique, but a value may be associated with several keys. The dictionary structure is also known as a *map* because it maps a unique key to a value. It stores the keys, values, and the associations between them.

> We will use a dictionary in all application where a value to be searched is associated with a particular and unique key. For example, in a contact list, a person's name acts as a key that is associated with one or more numbers. Similarly, the address book can also said to be an example of a dictionary.

> In all applications, dictionary searches will always outperform set or list searches as the keys are stored in a way to optimize search on keys.

> Hashing is an example of a dictionary implementation, and is an example of an instantaneous search costing O(1).

### Creating dictionaries

Suppose you need to write a program that looks up the phone number for a person in your mobile phone’s contact list. You can use a dictionary where the names are keys and the phone numbers are values. The dictionary also allows you to associate more than one person with a given number.

Here we create a small dictionary for a contact list that contains four items: 

In [77]:
contacts = { "Fred": 7235591, "Mary": 3841212, "Bob": 3841212, "Sarah": 2213278 }
print(contacts, "Size: ", len(contacts))

{'Fred': 7235591, 'Mary': 3841212, 'Bob': 3841212, 'Sarah': 2213278} Size:  4


We can have other examples of dictionaries as follows:

In [78]:
grades = { "A": 4, "B": 3, "C": 2.5, "D": 2 }
print(grades, "Size: ", len(contacts))

constants = { "PI": 3.14159, "g": 9.8, "K": 212}
print(constants, "Size: ", len(contacts))

{'A': 4, 'B': 3, 'C': 2.5, 'D': 2} Size:  4
{'PI': 3.14159, 'g': 9.8, 'K': 212} Size:  4


Each key/value pair is separated by a colon. You enclose the key/value pairs in braces, just as you would when forming a set. When the braces contain key/value pairs, they denote a dictionary, not a set. The only ambiguous case is an empty { }. By convention, it denotes an empty dictionary, not an empty set.

You can create a duplicate copy of a dictionary using the **dict** function:

In [79]:
oldContacts = dict(contacts)
print(oldContacts)

{'Fred': 7235591, 'Mary': 3841212, 'Bob': 3841212, 'Sarah': 2213278}


### Accessing dictionary values

The subscript operator [ ] is used to return the value associated with a key. The statement:

In [80]:
print("Fred's number is", contacts["Fred"])

Fred's number is 7235591


prints `7235591`.

Note that the dictionary is not a sequence-type container like a list. Even though the subscript operator is used with a dictionary, you cannot access the items by index or position. A value can only be accessed using its associated key.

The key supplied to the subscript operator must be a valid key in the dictionary or a `KeyError` exception will be raised. To find out whether a key is present in the dictionary, use the **in** (or **not in**) operator:

In [81]:
if "John" in contacts :
    print("John's number is", contacts["John"])
else :
    print("John is not in my contact list.")

John is not in my contact list.


### Adding and modifying items

A dictionary is a mutable container. That is, you can change its contents after it has been created. You can add a new item using the subscript operator [ ] much as you would with a list:

In [82]:
contacts["John"] = 4578102
print(contacts)

{'Fred': 7235591, 'Mary': 3841212, 'Bob': 3841212, 'Sarah': 2213278, 'John': 4578102}


To change the value associated with a given key, set a new value using the [ ] operator on an existing key:

In [83]:
contacts["John"] = 2228102
print(contacts)

{'Fred': 7235591, 'Mary': 3841212, 'Bob': 3841212, 'Sarah': 2213278, 'John': 2228102}


Sometimes you may not know which items will be contained in the dictionary when it's created. You can create an empty dictionary like this:

In [84]:
favoriteColors = {}

and add new items as needed:

In [85]:
favoriteColors["Juliet"] = "Blue"
favoriteColors["Adam"] = "Red"
favoriteColors["Eve"] = "Blue"
favoriteColors["Ramos"] = "Green"

print(favoriteColors)

{'Juliet': 'Blue', 'Adam': 'Red', 'Eve': 'Blue', 'Ramos': 'Green'}


### Removing items

To remove an item from a dictionary, call the **pop** method with the key as the argument:

In [86]:
contacts.pop("Fred")
print(contacts)

{'Mary': 3841212, 'Bob': 3841212, 'Sarah': 2213278, 'John': 2228102}


This removes the entire item, both the key and its associated value. The **pop** method returns the value of the item being removed, so you can use it or store it in a variable:

In [87]:
sarahsNumber = contacts.pop("Sarah")
print(sarahsNumber)

2213278


If the key is not in the dictionary, the **pop** method raises a `KeyError` exception. To prevent the exception from being raised, you can test for the key in the dictionary:

In [88]:
if "Fred" in contacts :
    contacts.pop("Fred")
else:
    print("Fred not present in the dictionary")

Fred not present in the dictionary


### Traversing a Dictionary

You can iterate over the individual keys in a dictionary using a for loop:

In [89]:
print("My Contacts:")
for key in contacts :
    print(key)

My Contacts:
Mary
Bob
John


Note that the dictionary stores its items in an order that is optimized for efficiency, which may not be the order in which they were added. To access the value associated with a key in the body of the loop, you can use the loop variable with the subscript operator. For example, these statements print both the name and phone number of your contacts:

In [90]:
print("My Contacts:")
for key in contacts :
    print("%-10s %d" % (key, contacts[key]))

My Contacts:
Mary       3841212
Bob        3841212
John       2228102


<a id="tuples"></a>
## 4.7 Tuples

[[ go back to the top ]](#Table-of-contents)

Python provides a data type for immutable sequences of arbitrary data. A tuple is very similar to a list, but once created, its contents cannot be modified. 

> A tuple is essentially a list that cannot be changed.

> A tuple is a fixed-length, immutable sequence of Python objects.

A tuple is created by specifying its contents as a comma-separated sequence. You can enclose the sequence in parentheses:

In [91]:
triple = (5, 10, 15)
print(triple, "Size: ", len(triple))

data = ("Maths", 98, "Programming", 99, "Research")
print(data, "Size: ", len(data))

(5, 10, 15) Size:  3
('Maths', 98, 'Programming', 99, 'Research') Size:  5


Tuples can contain elements of various types. If you prefer, you can omit the parentheses:

In [92]:
triple = 5, 10, 15

> The difference between lists and tuples are that tuples cannot be changed.

> For all applications where we need to store immutable data, tuples should be preferred over lists.

Each item stored in a tuple is comma-separated. We can create empty tuples as:

In [93]:
counter = ()
print(counter)

()


### Accessing values in tuples

We can access an individual element of a tuple by its position using the **[ ]** operator. For example, 

In [94]:
element = triple[1]
print(element)

print(data[0:3])

10
('Maths', 98, 'Programming')


We can iterate over the elements of a tuple using for loops.

In [95]:
for i in range(1, 3) :
    print(data[i])

98
Programming


### Updating tuples

As tuples are immutable, that means that we cannot change any element in a tuple. For example, the following action is not allowed on a tuple:

In [96]:
data[0] = "Physics"   #error: no changes are allowed!

TypeError: 'tuple' object does not support item assignment

We can use other tuples to create new tuples. For example,

In [97]:
colors1 = ("Red", "Green", "Blue")
colors2 = ("Cyan", "Magenta", "Yellow", "Kelvin")

colors = colors1 + colors2
print(colors)

('Red', 'Green', 'Blue', 'Cyan', 'Magenta', 'Yellow', 'Kelvin')


### Deleting tuples

As tuples are immutable, we cannot delete any element from a tuple. We can however, delete a whole tuple using the **del** function:

In [99]:
del colors

print(colors)   #error: no tuple

NameError: name 'colors' is not defined

### When to use tuples?

When a list is represented as a tuple, Python internally organizes the data in a way that it can access each individual element faster than in a list. Therefore, if you want to write code that runs as fast as possible, then look for any case where you have a list that never changes in your program. You can redefine it from a list to a tuple by changing the square brackets to parentheses. Eventually, this concept becomes second nature. You start thinking of unchanging lists as tuples and define them
that way right from the start.

There is one additional small benefit to using a tuple. If you have a list of data and you want to ensure that there is no code that makes any changes to it, use a tuple. Any code that attempts to append to, delete from, or modify an element of a tuple will generate an error message. The offending code can quickly be identified and corrected.

<a id="func"></a>
## 4.8 Functions

[[ go back to the top ]](#Table-of-contents)

A function is a sequence of instructions with a name. You have already encountered several functions during the discussion of various topics above.  For example, the `round` function contains instructions to round a floating-point value to a specified number of decimal places. You *call* a function in order to execute its instructions. For example, consider the following program statement:

In [100]:
price = round(6.8275, 2) # Sets result to 6.83
print(price)

6.83


By using the expression `round(6.8275, 2)`, the program *calls* the round function, asking it to round 6.8275 to two decimal digits. The instructions of the `round` function execute and compute the result. The `round` function *returns* its result back to where the function was called and the program resumes execution.

When another function calls the `round` function, it provides “inputs”, such as the values 6.8275 and 2 in the call `round(6.8275, 2)`. These values are called the **arguments** of the function call. Note that they are not necessarily inputs provided by a human user. They are simply the values for which we want the function to compute a result. The “output” that the round function computes is called the **return value**.

Functions can receive multiple arguments, but they return only one value. It is also possible to have functions with no arguments. An example is the `random` function that requires no argument to produce a random number.

The return value of a function is returned to the point in your program where the function was called. It is then processed according to the statement containing the function call. For example, suppose your program contains a statement:

In [221]:
price = round(6.8275, 2)
print(price)

6.83


When the round function returns its result, the return value is stored in the variable price.

### Implementing a function

We will start with a very simple example: a function to compute the volume of a cube with a given side length. When writing this function, you need to:
-  Pick a name for the function (cubeVolume).
-  Define a variable for each argument (sideLength). These variables are called the **parameter variables**.

Put all this information together along with the def reserved word to form the first line of the function’s definition:

`def cubeVolume(sideLength) :`

This line is called the **header** of the function. Next, specify the **body** of the function. The body contains the statements that are executed when the function is called.

The volume of a cube of side length s is s × s × s. However, for greater clarity, our parameter variable has been called `sideLength`, not `s`, so we need to compute `sideLength ** 3`.

We will store this value in a variable called `volume`:

`volume = sideLength ** 3`

In order to return the result of the function, use the `return` statement:

`return volume`

A function is a compound statement, which requires the statements in the body to be indented to the same level. Here is the complete function:

In [102]:
def cubeVolume(sideLength) :
    volume = sideLength ** 3
    return volume

### Testing a function

In the preceding section, you saw how to write a function. If you run a program containing just the function definition, then nothing happens. After all, nobody is calling the function.

In order to test the function, your program should contain:
-  The definition of the function.
-  Statements that call the function and print the result.

Here is such a program:

In [103]:
def cubeVolume(sideLength) :
    volume = sideLength ** 3
    return volume

result1 = cubeVolume(2)
result2 = cubeVolume(10)

print("A cube with side length 2 has volume", result1)
print("A cube with side length 10 has volume", result2)

A cube with side length 2 has volume 8
A cube with side length 10 has volume 1000


Note that the function returns different results when it is called with different arguments. Consider the call `cubeVolume(2)`. The argument 2 corresponds to the `sideLength` parameter variable. Therefore, in this call, sideLength is 2. The function computes `sideLength ** 3`, or `2 ** 3`. When the function is called with a different argument, say 10, then the function computes `10 ** 3`.

### Programs that contain functions

When you write a program that contains one or more functions, you need to pay attention to the order of the function definitions and statements in the program. Have another look at the program of the preceding section. Note that it contains
-  The definition of the `cubeVolume` function.
-  Several statements, two of which call that function.

As the Python interpreter reads the source code, it reads each function definition and each statement. The statements in a function definition are not executed until the function is called. Any statement not in a function definition, on the other hand, is executed as it is encountered. Therefore, it is important that you define each function before you call it. For example, the following will produce a compile-time error:

In [104]:
print(cubeVolume(10))

def cubeVolume(sideLength) :
    volume = sideLength ** 3
    return volume

1000


The compiler does not know that the cubeVolume function will be defined later in the program.

However, a function can be called from within another function before the former has been defined. For example, the following is perfectly legal:

In [105]:
def main() :
    result = cubeVolume(2)
    print("A cube with side length 2 has volume", result)

def cubeVolume(sideLength) :
    volume = sideLength ** 3
    return volume

main()

A cube with side length 2 has volume 8


### Parameter passing

When a function is called, variables are created for receiving the function’s arguments. These variables are called **parameter variables**. (Another commonly used term is **formal parameters**.) The values that are supplied to the function when it is called are the **arguments** of the call. (These values are also commonly called the **actual parameters**.) Each parameter variable is initialized with the corresponding argument.

Consider the function call:

In [106]:
result1 = cubeVolume(2)

-  The parameter variable sideLength of the cubeVolume function is created when the function is called.
-  The parameter variable is initialized with the value of the argument that was passed in the call. In our case, `sideLength` is set to 2.
-  The function computes the expression `sideLength ** 3`, which has the value 8. That value is stored in the variable `volume`. 
-  The function returns. All of its variables are removed. The return value is transferred to the *caller*, that is, the function calling the `cubeVolume` function. The caller puts the return value in the `result1` variable. 

Now consider what happens in a subsequent call, `cubeVolume(10)`. A new parameter variable is created. (Recall that the previous parameter variable was removed when the first call to `cubeVolume` returned.) It is initialized with 10, and the process repeats. After the second function call is complete, its variables are again removed.

### Return values

You use the `return` statement to specify the result of a function. In the preceding examples, each `return` statement returned a variable. However, the `return` statement can return the value of any expression. Instead of saving the return value in a variable and returning the variable, it is often possible to eliminate the variable and return the value of a more complex expression:

In [107]:
def cubeVolume(sideLength) :
    return sideLength ** 3

When the return statement is processed, the function exits *immediately*.

<a id="lambda"></a>
## 4.9 Lambda functions

[[ go back to the top ]](#Table-of-contents)

Python has support for so-called *anonymous* or *lambda functions*, which are a way of writing functions consisting of a single statement, the result of which is the return value. They are defined with the **lambda** keyword, which has no meaning other than

“we are declaring an anonymous function”:

Lambda functions have the form:

**lambda par1, par2, ...: expression**

where the expression is the value to be returned. One useful feature of lambda expressions is that they make use of variables from the function in which they are coded.

> Since lambda expressions are functions without a name, they are often referred to as *anonymous functions*.

The following are some examples of regular functions versus lambda functions:

In [108]:
def short_function(x):
    return x * 2

equiv_anon = lambda x: x * 2

In the above code, note how the **lambda** expression replaces the ordinary function. Note also that there are no `return` statements in lambda functions.

Another example of a lamda function as applied on lists:

In [109]:
def apply_to_list(some_list, f):
    return [f(x) for x in some_list]

ints = [4, 0, 1, 5, 6]
apply_to_list(ints, lambda x: x * 2)

[8, 0, 2, 10, 12]

As another example, suppose you wanted to sort a collection of strings by the number of distinct letters in each string:

In [110]:
strings = ['foo', 'card', 'bar', 'aaaa', 'abab']

Here we could pass a lambda function to the list’s sort method:

In [111]:
strings.sort(key=lambda x: len(set(list(x))))
strings

['aaaa', 'foo', 'abab', 'bar', 'card']

The following program sorts names by their surnames. The second line sorts the list of names, and the last two lines display the contents of the sorted list.

In [112]:
names = ["Dennis Ritchie", "Alan Kay", "John Backus", "James Gosling"]    
names.sort(key=lambda name: name.split()[-1])
nameString = ", ".join(names)
print(nameString)

John Backus, James Gosling, Alan Kay, Dennis Ritchie


> Lambda functions do not include the `return` statement. They always contain expressions that are always returned. 

<a id="map"></a>
### 4.9.1 map function

[[ go back to the top ]](#Table-of-contents)

The `map()` function provides an easy way to transform each item into an iterable object. The `map` function is frequently used in lambda functions that returns a list of elements. `map` requires two arguments:

`r = map(function, sequence)`

The first argument is the name of the function that is to be applied on the second argument, which is the name of a sequence, e.g., list, set etc. `map` will then retrurns a new list of elements of sequence that have been changed after applying the function. 

For example, here are efficient, compact ways to perform an operation on a sequence. Note the use of the lambda anonymous function:

In [113]:
lst = [1, 2, 3, 4, 5, 6]
list(map(lambda x: x ** 3, lst))

[1, 8, 27, 64, 125, 216]

In [114]:
animals = ['hawk', 'hen', 'hedgehog', 'hyena', 'zebra', 'giraffe']
print(animals)

list(map(lambda animal: len(animal), animals))     # apply len() to every animals item

['hawk', 'hen', 'hedgehog', 'hyena', 'zebra', 'giraffe']


[4, 3, 8, 5, 5, 7]

In [115]:
numbers = [5, 6, 7, 8, 9, 10, 100, 1000, 10000]
print(numbers)

list(map(lambda n: n % 10, numbers))    

[5, 6, 7, 8, 9, 10, 100, 1000, 10000]


[5, 6, 7, 8, 9, 0, 0, 0, 0]

In [116]:
sentence = "Weather is beginning to change"
words = sentence.split()
print(words)

lengths = map(lambda word: len(word), words)
list(words)

['Weather', 'is', 'beginning', 'to', 'change']


['Weather', 'is', 'beginning', 'to', 'change']

<a id="filter"></a>
### 4.9.2 filter function

[[ go back to the top ]](#Table-of-contents)

The `filter` function provides an elegant way to apply filter on lists. It accepts two arguments: `filter(function, list)`.  

The first argument *function* will be applied on every element in the second argument *list*. The first argument *function* returns a **boolean** value of *True* or *False*. The elements of the *list* will only be included in the resultant list if *function* evaluates to *True*.

For example,

In [117]:
words = ["Hello", "World", "Python", "Great", "OK"]
resultList1 = filter(lambda s: len(s) > 2, words)

list(resultList1)

['Hello', 'World', 'Python', 'Great']

In [118]:
sports = ('Cricket', 'Soccer', 'Hockey', 'Baseball')
resultList2 = filter(lambda w: len(w) % 2 == 0, sports)

list(resultList2)

['Soccer', 'Hockey', 'Baseball']

In [119]:
colors = ('Red', 'Green', 'Blue', 'Black')
resultList3 = filter(lambda b: len(b) < 4  , colors)

list(resultList3)

['Red']

In [120]:
numbers = [5, 6, 7, 8, 9, 10, 100, 1000, 10000]
print(numbers)

list(map(lambda n: n % 10 == 0, numbers))

[5, 6, 7, 8, 9, 10, 100, 1000, 10000]


[False, False, False, False, False, True, True, True, True]

<a id="io"></a>
## 5.0 File Input / Output (I/O)

[[ go back to the top ]](#Table-of-contents)

This section explains how to perform input and output using keyboard and files. We will begin by introducing taking input from keyboard.

### Reading input from keyboard: The input function

The `input` function prompts the user to enter data. A typical input statement is:

In [121]:
city = input("Enter the name of your city: ")
print(city)

Enter the name of your city: Lahore
Lahore


When Python reaches this statement, the string "Enter the name of your city: " is displayed and the program pauses. After the user types in the name of his or her city and presses the Enter (or return) key, the variable town is assigned the name of the city. (If the variable had not been created previously, it is created at this time.) The general form of an input statement is:

**variableName = input(prompt)**

where prompt is a string that requests a response from the user.

The `input` function always returns a string. However, a combination of an `input` function and an `int`, `float`, or `eval` function allows numbers to be input into a program. For instance, consider the following three statements:

In [122]:
age = int(input("Enter your age: "))    #only integers are accepted
weight = float(input("Enter your weight: "))
height = eval(input("Enter your height: "))   #both integer and floating-point values are accepted

Enter your age: 28
Enter your weight: 63
Enter your height: 5.8


### Files input and output 

We now discuss the common task of reading and writing files that contain text. Examples of text files include not only files created with a simple text editor, such as Windows Notepad, but also Python source code and HTML files.

### Opening a file

To access a file, you must first open it. When you open a file, you give the name of the file, or, if the file is stored in a different directory, the file name preceded by the directory path. You also specify whether the file is to be opened for reading or writing. Suppose you want to read data from a file named `input.txt`, located in the same directory as the program. Then you use the following function call to open the file:

In [123]:
infile = open("input.txt", "r")

This statement opens the file for reading (indicated by the string argument "r") and returns a *file object* that is associated with the file named `input.txt`. When opening a file for reading, the file must exist or an exception occurs. 

The file object returned by the `open` function must be saved in a variable. All operations for accessing a file are made via the file object. To open a file for writing, you provide the name of the file as the first argument to the open function and the string "w" as the second argument:

In [124]:
outfile = open("output.txt", "w")

If the output file already exists, it is emptied before the new data is written into it. If the file does not exist, an empty file is created. When you are done processing a file, be sure to close the file using the **close** method:

In [125]:
infile.close()
outfile.close()

After a file has been closed, it cannot be used again until it has been reopened. Attempting to do so will result in an exception.

### Reading from a file

To read a line of text from a file, call the `readline` method on the `file` object that was returned when you opened the file:

In [126]:
infile = open("input.txt", "r")
line = infile.readline()    #read the first line from the file

print(line)
infile.close()

Demo file for file i/o.



When a file is opened, an input marker is positioned at the beginning of the file. The `readline` method reads the text, starting at the current position and continuing until the end of the line is encountered. The input marker is then moved to the next line. The `readline` method returns the text that it read, including the newline character that denotes the end of the line.

Reading multiple lines of text from a file is very similar to reading a sequence of values with the input function. You repeatedly read a line of text and process it until the sentinel value is reached:

In [127]:
infile = open("input.txt", "r")
line = infile.readline()

while line != "" :
    # Process the line.
    print(line)
    line = infile.readline()   
infile.close()

Demo file for file i/o.



Python I/O Jupyter Notebook



Numbers: 10 20 30 40 50 25.5 89.89



Hello

World!





The sentinel value is an empty string, which is returned by the `readline` method after the end of file has been reached.

### Writing to a file

You can write text to a file that has been opened for writing. This is done by applying the `write` method to the `file` object. For example, we can write the string "Hello, World! from Python" to our output file using the statement:

In [128]:
outfile = open("output.txt", "w")
outfile.write("Hello, World! from Python\n")
outfile.close()

### Iterating over the lines of a file

You have seen how to read a file one line at a time. However, there is a simpler way. Python can treat an input file as though it were a container of strings in which each line is an individual string. To read the lines of text from the file, you can iterate over the file object using a for loop.

For example, the following loop reads all lines from a file and prints them:

In [129]:
infile = open("input.txt", "r")

for line in infile :
    print(line)
    
infile.close()

Demo file for file i/o.



Python I/O Jupyter Notebook



Numbers: 10 20 30 40 50 25.5 89.89



Hello

World!





### Binary files and random access

In the following section, you will learn how to process files that contain data other than text. You will also see how to read and write data at arbitrary positions in a file.

### Reading and writing binary files

There are two fundamentally different ways to store data: in text format or binary format. In text format, data items are represented in human-readable form as a sequence of characters. For example, in text form, the integer 12,345 is stored as the
sequence of five characters:

"1" "2" "3" "4" "5"

In binary form, data items are represented in bytes. A byte is composed of 8 bits, each of which can be 0 or 1. A byte can denote one of 256 values. To represent larger values, one uses sequences of bytes. Integers are frequently stored as a
sequence of four bytes. For example, the integer 123,456 can be stored as:

64 226 1 0 

If you load a binary file into a text editor, you will not be able to view its contents. Processing binary files requires programs written explicitly for reading or writing the binary data.

We have to cover a few technical issues about binary files. To open a binary file for reading, use the following command:

`inFile = open(filename, "rb")`

Remember, the second argument to the open function indicates the mode in which the file will be opened. In this example, the mode string indicates that we are opening a binary file for reading. To open a binary file for writing, you would use the mode
string "wb":

`outFile = open(filename, "wb")`

### Random access

So far, you’ve read from a file one string at a time and written to a file one string at a time, without skipping forward or
backward. That access pattern is called **sequential access**. In many applications, we would like to access specific items in
a file without first having to first read all preceding items. This access pattern is called **random access**. There is nothing “random” about random access—the term means that you can read and modify any item stored at any location in the file.

Each file has a special marker that indicates the current position within the file. This marker is used to determine where
the next string is read or written. You can move the file marker to a specific position within the file. To position the marker relative to the beginning of the file, you use the method call:

`inFile.seek(position)`

To determine the current position of the file marker (counted from the beginning of the file), use:

`position = inFile.tell() # Get current position.`

For example,

In [130]:
infile = open("input.txt", "r")

str = infile.read(10)   #read the first 10 characters
print("Read string is: ", str)
    
pos = infile.tell()    #get the current position
print("Current position is: ", pos)

pos = infile.seek(0, 0)   #reposition the pointer at the beginning of the file
str = infile.read(30)

print("Read string is: ", str)
    
infile.close()

Read string is:  Demo file 
Current position is:  10
Read string is:  Demo file for file i/o.

Pytho


<a id="pandas"></a>
## 5. Introduction to Pandas

[[ go back to the top ]](#Table-of-contents)

Pandas is an open source Python library for data analysis. It gives Python the ability to work with spreadsheet-like data for fast data loading, manipulating, aligning, merging, etc. To give Python these enhanced features, Pandas introduces two new data types to Python: **Series** and **DataFrame**. The **DataFrame** will represent your entire spreadsheet or rectangular data, whereas the **Series** is a single column of the **DataFrame**. A Pandas **DataFrame** can also be thought of as a dictionary or collection of **Series**.

pandas is often used in tandem with numerical computing tools like `NumPy` and `SciPy`, analytical libraries like `statsmodels` and `scikit-learn`, and data visualization libraries like `matplotlib`. pandas adopts significant parts of NumPy’s idiomatic style of array-based computing, especially array-based functions and a preference for data processing without for loops.

While pandas adopts many coding idioms from `NumPy`, the biggest difference is that pandas is designed for working with tabular or heterogeneous data. `NumPy`, by contrast, is best suited for working with homogeneous numerical array data.

<a id="pandas-install"></a>
### 5.1 Installation

[[ go back to the top ]](#Table-of-contents)

Installation of Pandas on your system requires `NumPy` to be installed. The easiest and most general way to install the pandas library is to use a prepackaged solution, i.e., installing it through an Anaconda. If you have already installed Anaconda, no additional installtion is required.

<a id="load"></a>
### 5.2 Getting started with pandas

[[ go back to the top ]](#Table-of-contents)

When given a data set, we first load it and begin looking at its structure and contents. The simplest way of looking at a data set is to look and subset specific rows and columns. We can see what type of information is stored in each column, and can start looking for patterns by aggregating descriptive statistics.

Since Pandas is not part of the Python standard library, we have to first tell Python to load (import) the library and check the version:

In [131]:
import pandas as pd
pd.__version__

'0.23.4'

Just as we generally import `NumPy` under the alias ``np``, we will import Pandas under the alias ``pd``.

<a id="pandas-ds"></a>
### 5.3 Introduction to pandas data structures

[[ go back to the top ]](#Table-of-contents)

The heart of pandas is the two primary data structures on which all transactions, which are generally made during the analysis of data, are centralized: 
-  Series
-  Dataframes

> The **Series**, constitutes the data structure designed to accommodate a sequence of one-dimensional data, while the **Dataframe**, a more complex data structure, is designed to contain cases with several dimensions.

<a id="series"></a>
### 5.3.1 The Pandas Series object
[[ go back to the top ]](#Table-of-contents)

A Series is a one-dimensional array-like object containing a sequence of values (of similar types to `NumPy` types) and an associated array of data labels, called its index. The simplest Series is formed from only an array of data:

In [132]:
obj = pd.Series([4, 7, -5, 3])
obj

0    4
1    7
2   -5
3    3
dtype: int64

The string representation of a Series displayed interactively shows the index on the left and the values on the right. Since we did not specify an index for the data, a default one consisting of the integers 0 through N - 1 (where N is the length of the
data) is created. You can get the array representation and index object of the Series via its values and index attributes, respectively:

In [133]:
obj.values
obj.index # like range(4)

RangeIndex(start=0, stop=4, step=1)

Like with a `NumPy` array, data can be accessed by the associated index via the familiar Python square-bracket notation:

In [134]:
obj[1]

7

In [135]:
obj[1:3]

1    7
2   -5
dtype: int64

As we will see, though, the Pandas ``Series`` is much more general and flexible than the one-dimensional `NumPy` array that it emulates.

<a id="series-np"></a>
### 5.3.1.1 ``Series`` as generalized NumPy array
[[ go back to the top ]](#Table-of-contents)

From what we've seen so far, it may look like the ``Series`` object is basically interchangeable with a one-dimensional `NumPy` array.
The essential difference is the presence of the index: while the `Numpy` Array has an *implicitly defined* integer index used to access the values, the Pandas ``Series`` has an *explicitly defined* index associated with the values.

This explicit index definition gives the ``Series`` object additional capabilities. For example, the index need not be an integer, but can consist of values of any desired type.
For example, if we wish, we can use strings as an index:

In [136]:
obj2 = pd.Series([4, 7, -5, 3], index=['d', 'b', 'a', 'c'])
obj2

d    4
b    7
a   -5
c    3
dtype: int64

In [137]:
obj2.index

Index(['d', 'b', 'a', 'c'], dtype='object')

Compared with `NumPy` arrays, you can use labels in the index when selecting single values or a set of values:

In [138]:
obj2['a']

-5

In [139]:
obj2[2]

-5

In [140]:
obj2['d'] = 6

In [141]:
obj2[['c', 'a', 'd']]

c    3
a   -5
d    6
dtype: int64

Here ['c', 'a', 'd'] is interpreted as a list of indices, even though it contains strings instead of integers.

Another way to think about a Series is as a fixed-length, ordered dict, as it is a mapping of index values to data values. It can be used in many contexts where you might use a `dict`:

In [142]:
'b' in obj2

True

In [143]:
'e' in obj2

False

<a id="series-dict"></a>
### 5.3.1.2 ``Series`` as specialized dictionary
[[ go back to the top ]](#Table-of-contents)

In this way, you can think of a Pandas ``Series`` a bit like a specialization of a Python dictionary.
A dictionary is a structure that maps arbitrary keys to a set of arbitrary values, and a ``Series`` is a structure which maps typed keys to a set of typed values.
This typing is important: just as the type-specific compiled code behind a NumPy array makes it more efficient than a Python list for certain operations, the type information of a Pandas ``Series`` makes it much more efficient than Python dictionaries for certain operations.

The ``Series``-as-dictionary analogy can be made even more clear by constructing a ``Series`` object directly from a Python dictionary:

In [144]:
population_dict = {'California': 38332521,
                   'Texas': 26448193,
                   'New York': 19651127,
                   'Florida': 19552860,
                   'Illinois': 12882135}
population = pd.Series(population_dict)
population

California    38332521
Texas         26448193
New York      19651127
Florida       19552860
Illinois      12882135
dtype: int64

In [145]:
sdata = {'Ohio': 35000, 'Texas': 71000, 'Oregon': 16000, 'Utah': 5000}

obj3 = pd.Series(sdata)
obj3

Ohio      35000
Texas     71000
Oregon    16000
Utah       5000
dtype: int64

By default, a ``Series`` will be created where the index is drawn from the sorted keys. 
From here, typical dictionary-style item access can be performed:

In [146]:
population['California']

38332521

Unlike a dictionary, though, the ``Series`` also supports array-style operations such as slicing:

In [147]:
population['California':'Illinois']

California    38332521
Texas         26448193
New York      19651127
Florida       19552860
Illinois      12882135
dtype: int64

When you are only passing a `dict`, the index in the resulting Series will have the dict’s keys in sorted order. You can override this by passing the `dict` keys in the order you want them to appear in the resulting Series:

In [148]:
states = ['California', 'Ohio', 'Oregon', 'Texas']

obj4 = pd.Series(sdata, index=states)
obj4

California        NaN
Ohio          35000.0
Oregon        16000.0
Texas         71000.0
dtype: float64

Here, three values found in sdata were placed in the appropriate locations, but since no value for 'California' was found, it appears as NaN (not a number), which is considered in pandas to mark missing or NA values. Since 'Utah' was not included in
states, it is excluded from the resulting object.

The isnull and notnull functions in pandas should be used to detect missing data:

In [149]:
pd.isnull(obj4)

California     True
Ohio          False
Oregon        False
Texas         False
dtype: bool

In [150]:
pd.notnull(obj4)

California    False
Ohio           True
Oregon         True
Texas          True
dtype: bool

Series also has these as instance methods:

In [151]:
obj4.isnull()

California     True
Ohio          False
Oregon        False
Texas         False
dtype: bool

Both the Series object itself and its index have a `name` attribute, which integrates with other key areas of pandas functionality:

In [152]:
obj4.name = 'population'
obj4.index.name = 'state'

obj4

state
California        NaN
Ohio          35000.0
Oregon        16000.0
Texas         71000.0
Name: population, dtype: float64

A Series’s index can be altered in-place by assignment:

In [153]:
obj

0    4
1    7
2   -5
3    3
dtype: int64

In [154]:
obj.index = ['Bob', 'Steve', 'Jeff', 'Ryan']
obj

Bob      4
Steve    7
Jeff    -5
Ryan     3
dtype: int64

Series can be constructed from a `dict`:

In [155]:
dict = {"Red" : 1, "Green" : 2, "Blue" : 3}
pd.Series(dict)

Red      1
Green    2
Blue     3
dtype: int64

<a id="series-obj"></a>
### 5.3.1.3 Constructing ``Series`` objects
[[ go back to the top ]](#Table-of-contents)

We've already seen a few ways of constructing a Pandas ``Series`` from scratch; all of them are some version of the following:

```python
>>> pd.Series(data, index=index)
```

where ``index`` is an optional argument, and ``data`` can be one of many entities.

For example, ``data`` can be a list or NumPy array, in which case ``index`` defaults to an integer sequence:

In [156]:
pd.Series([2, 4, 6])

0    2
1    4
2    6
dtype: int64

``data`` can be a scalar, which is repeated to fill the specified index:

In [157]:
pd.Series(5, index=[100, 200, 300])

100    5
200    5
300    5
dtype: int64

``data`` can be a dictionary, in which ``index`` defaults to the sorted dictionary keys:

In [158]:
pd.Series({2:'a', 1:'b', 3:'c'})

2    a
1    b
3    c
dtype: object

In each case, the index can be explicitly set if a different result is preferred:

In [159]:
pd.Series({2:'a', 1:'b', 3:'c'}, index=[3, 2])

3    c
2    a
dtype: object

Notice that in this case, the ``Series`` is populated only with the explicitly identified keys.

<a id="df"></a>
### 5.3.2 The Pandas DataFrame object

[[ go back to the top ]](#Table-of-contents)

The next fundamental structure in Pandas is the ``DataFrame``.
Like the ``Series`` object discussed in the previous section, the ``DataFrame`` can be thought of either as a generalization of a NumPy array, or as a specialization of a Python dictionary.
We'll now take a look at each of these perspectives.

<a id="df-np"></a>
### 5.3.2.1 DataFrame as a generalized NumPy array
[[ go back to the top ]](#Table-of-contents)

If a ``Series`` is an analog of a one-dimensional array with flexible indices, a ``DataFrame`` is an analog of a two-dimensional array with both flexible row indices and flexible column names.
Just as you might think of a two-dimensional array as an ordered sequence of aligned one-dimensional columns, you can think of a ``DataFrame`` as a sequence of aligned ``Series`` objects.
Here, by "aligned" we mean that they share the same index.

> A `DataFrame` represents a rectangular table of data and contains an ordered collection of columns, each of which can be a different value type (numeric, string, boolean, etc.). The `DataFrame` has both a row and column index; it can be thought of
as a `dict` of Series all sharing the same index. Under the hood, the data is stored as one or more two-dimensional blocks rather than a `list`, `dict`, or some other collection of one-dimensional arrays.

To demonstrate this, let's first construct a new ``Series`` listing the area of each of the five states discussed in the previous section:

In [160]:
area_dict = {'California': 423967, 'Texas': 695662, 'New York': 141297,
             'Florida': 170312, 'Illinois': 149995}
area = pd.Series(area_dict)
area

California    423967
Texas         695662
New York      141297
Florida       170312
Illinois      149995
dtype: int64

Now that we have this along with the ``population`` Series from before, we can use a dictionary to construct a single two-dimensional object containing this information:

In [161]:
states = pd.DataFrame({'population': population,
                       'area': area})
states

Unnamed: 0,population,area
California,38332521,423967
Texas,26448193,695662
New York,19651127,141297
Florida,19552860,170312
Illinois,12882135,149995


In a similar fashion:

In [162]:
data = {'state': ['Ohio', 'Ohio', 'Ohio', 'Nevada', 'Nevada', 'Nevada'],
'year': [2000, 2001, 2002, 2001, 2002, 2003],
'pop': [1.5, 1.7, 3.6, 2.4, 2.9, 3.2]}
frame = pd.DataFrame(data)

Like the ``Series`` object, the ``DataFrame`` has an ``index`` attribute that gives access to the index labels:

In [163]:
states.index

Index(['California', 'Texas', 'New York', 'Florida', 'Illinois'], dtype='object')

Additionally, the ``DataFrame`` has a ``columns`` attribute, which is an ``Index`` object holding the column labels:

In [164]:
states.columns

Index(['population', 'area'], dtype='object')

Similarly, the resulting `DataFrame` above will have its index assigned automatically as with `Series`, and the columns are placed in sorted order:

In [165]:
frame

Unnamed: 0,state,year,pop
0,Ohio,2000,1.5
1,Ohio,2001,1.7
2,Ohio,2002,3.6
3,Nevada,2001,2.4
4,Nevada,2002,2.9
5,Nevada,2003,3.2


For large DataFrames, the head method selects only the first five rows:

In [166]:
frame.head()

Unnamed: 0,state,year,pop
0,Ohio,2000,1.5
1,Ohio,2001,1.7
2,Ohio,2002,3.6
3,Nevada,2001,2.4
4,Nevada,2002,2.9


If you specify a sequence of columns, the DataFrame’s columns will be arranged in that order:

In [167]:
pd.DataFrame(data, columns=['year', 'state', 'pop'])

Unnamed: 0,year,state,pop
0,2000,Ohio,1.5
1,2001,Ohio,1.7
2,2002,Ohio,3.6
3,2001,Nevada,2.4
4,2002,Nevada,2.9
5,2003,Nevada,3.2


If you pass a column that isn’t contained in the dict, it will appear with missing values in the result:

In [168]:
frame2 = pd.DataFrame(data, columns=['year', 'state', 'pop', 'debt'],
    index=['one', 'two', 'three', 'four', 'five', 'six'])
frame2

Unnamed: 0,year,state,pop,debt
one,2000,Ohio,1.5,
two,2001,Ohio,1.7,
three,2002,Ohio,3.6,
four,2001,Nevada,2.4,
five,2002,Nevada,2.9,
six,2003,Nevada,3.2,


In [169]:
frame2.columns

Index(['year', 'state', 'pop', 'debt'], dtype='object')

A column in a `DataFrame` can be retrieved as a Series either by dict-like notation or by attribute:

In [170]:
frame2['state']

one        Ohio
two        Ohio
three      Ohio
four     Nevada
five     Nevada
six      Nevada
Name: state, dtype: object

In [171]:
frame2.year

one      2000
two      2001
three    2002
four     2001
five     2002
six      2003
Name: year, dtype: int64

Note that the returned Series have the same index as the DataFrame, and their `name` attribute has been appropriately set. Rows can also be retrieved by position or name with the special `loc` attribute: 

In [172]:
frame2.loc['three']

year     2002
state    Ohio
pop       3.6
debt      NaN
Name: three, dtype: object

Columns can be modified by assignment. For example, the empty 'debt' column could be assigned a scalar value or an array of values:

In [173]:
frame2['debt'] = 16.5
frame2

Unnamed: 0,year,state,pop,debt
one,2000,Ohio,1.5,16.5
two,2001,Ohio,1.7,16.5
three,2002,Ohio,3.6,16.5
four,2001,Nevada,2.4,16.5
five,2002,Nevada,2.9,16.5
six,2003,Nevada,3.2,16.5


When you are assigning lists or arrays to a column, the value’s length must match the length of the `DataFrame`. If you assign a `Series`, its labels will be realigned exactly to the DataFrame’s index, inserting missing values in any holes:

In [174]:
val = pd.Series([-1.2, -1.5, -1.7], index=['two', 'four', 'five'])
frame2['debt'] = val

frame2

Unnamed: 0,year,state,pop,debt
one,2000,Ohio,1.5,
two,2001,Ohio,1.7,-1.2
three,2002,Ohio,3.6,
four,2001,Nevada,2.4,-1.5
five,2002,Nevada,2.9,-1.7
six,2003,Nevada,3.2,


Assigning a column that doesn’t exist will create a new column. 

Thus the ``DataFrame`` can be thought of as a generalization of a two-dimensional NumPy array, where both the rows and columns have a generalized index for accessing the data.

<a id="df-dict"></a>
### 5.3.2.2 DataFrame as specialized dictionary
[[ go back to the top ]](#Table-of-contents)

Similarly, we can also think of a ``DataFrame`` as a specialization of a dictionary.
Where a dictionary maps a key to a value, a ``DataFrame`` maps a column name to a ``Series`` of column data.
For example, asking for the ``'area'`` attribute returns the ``Series`` object containing the areas we saw earlier:

In [175]:
states['area']

California    423967
Texas         695662
New York      141297
Florida       170312
Illinois      149995
Name: area, dtype: int64

Notice the potential point of confusion here: in a two-dimesnional NumPy array, ``data[0]`` will return the first *row*. For a ``DataFrame``, ``data['col0']`` will return the first *column*.
Because of this, it is probably better to think about ``DataFrame``s as generalized dictionaries rather than generalized arrays, though both ways of looking at the situation can be useful.

<a id="df-obj"></a>
### 5.3.2.3 Constructing DataFrame objects
[[ go back to the top ]](#Table-of-contents)

A Pandas ``DataFrame`` can be constructed in a variety of ways.
Here we'll give several examples.

#### From a single Series object

A ``DataFrame`` is a collection of ``Series`` objects, and a single-column ``DataFrame`` can be constructed from a single ``Series``:

In [176]:
pd.DataFrame(population, columns=['population'])

Unnamed: 0,population
California,38332521
Texas,26448193
New York,19651127
Florida,19552860
Illinois,12882135


#### From a list of dicts

Any list of dictionaries can be made into a ``DataFrame``.
We'll use a simple list comprehension to create some data:

In [177]:
data = [{'a': i, 'b': 2 * i}
        for i in range(3)]
pd.DataFrame(data)

Unnamed: 0,a,b
0,0,0
1,1,2
2,2,4


Even if some keys in the dictionary are missing, Pandas will fill them in with ``NaN`` (i.e., "not a number") values:

In [178]:
pd.DataFrame([{'a': 1, 'b': 2}, {'b': 3, 'c': 4}])

Unnamed: 0,a,b,c
0,1.0,2,
1,,3,4.0


#### From a dictionary of Series objects

As we saw before, a ``DataFrame`` can be constructed from a dictionary of ``Series`` objects as well:

In [179]:
pd.DataFrame({'population': population,
              'area': area})

Unnamed: 0,population,area
California,38332521,423967
Texas,26448193,695662
New York,19651127,141297
Florida,19552860,170312
Illinois,12882135,149995


#### From a two-dimensional NumPy array

Given a two-dimensional array of data, we can create a ``DataFrame`` with any specified column and index names.
If omitted, an integer index will be used for each:

In [180]:
import numpy as np
import pandas as pd

pd.DataFrame(np.random.rand(3, 2),
             columns=['foo', 'bar'],
             index=['a', 'b', 'c'])

Unnamed: 0,foo,bar
a,0.611748,0.73056
b,0.716367,0.663102
c,0.808041,0.937762


#### From a NumPy structured array

A Pandas ``DataFrame`` operates much like a structured array, and can be created directly from one:

In [181]:
A = np.zeros(3, dtype=[('A', 'i8'), ('B', 'f8')])
A

array([(0, 0.), (0, 0.), (0, 0.)], dtype=[('A', '<i8'), ('B', '<f8')])

In [182]:
pd.DataFrame(A)

Unnamed: 0,A,B
0,0,0.0
1,0,0.0
2,0,0.0


<a id="index"></a>
### 5.3.3 Data Indexing and Selection

[[ go back to the top ]](#Table-of-contents)

Here we'll look at means of accessing and modifying values in Pandas ``Series`` and ``DataFrame`` objects.
We'll start with the simple case of the one-dimensional ``Series`` object, and then move on to the more complicated two-dimesnional ``DataFrame`` object.

<a id="series-select"></a>
### 5.3.3.1 Data Selection in Series
[[ go back to the top ]](#Table-of-contents)

As we saw in the previous section, a ``Series`` object acts in many ways like a one-dimensional NumPy array, and in many ways like a standard Python dictionary.
If we keep these two overlapping analogies in mind, it will help us to understand the patterns of data indexing and selection in these arrays.

#### Series as dictionary

Like a dictionary, the ``Series`` object provides a mapping from a collection of keys to a collection of values:

In [183]:
import pandas as pd
data = pd.Series([0.25, 0.5, 0.75, 1.0],
                 index=['a', 'b', 'c', 'd'])
data

a    0.25
b    0.50
c    0.75
d    1.00
dtype: float64

In [184]:
data['b']

0.5

We can also use dictionary-like Python expressions and methods to examine the keys/indices and values:

In [185]:
'a' in data

True

In [186]:
data.keys()

Index(['a', 'b', 'c', 'd'], dtype='object')

In [187]:
list(data.items())

[('a', 0.25), ('b', 0.5), ('c', 0.75), ('d', 1.0)]

``Series`` objects can even be modified with a dictionary-like syntax.
Just as you can extend a dictionary by assigning to a new key, you can extend a ``Series`` by assigning to a new index value:

In [188]:
data['e'] = 1.25
data

a    0.25
b    0.50
c    0.75
d    1.00
e    1.25
dtype: float64

This easy mutability of the objects is a convenient feature: under the hood, Pandas is making decisions about memory layout and data copying that might need to take place; the user generally does not need to worry about these issues.

#### Series as one-dimensional array

A ``Series`` builds on this dictionary-like interface and provides array-style item selection via the same basic mechanisms as NumPy arrays – that is, *slices*, *masking*, and *fancy indexing*.
Examples of these are as follows:

In [189]:
# slicing by explicit index
data['a':'c']

a    0.25
b    0.50
c    0.75
dtype: float64

In [190]:
# slicing by implicit integer index
data[0:2]

a    0.25
b    0.50
dtype: float64

In [191]:
# masking
data[(data > 0.3) & (data < 0.8)]

b    0.50
c    0.75
dtype: float64

In [192]:
# indexing
data[['a', 'e']]

a    0.25
e    1.25
dtype: float64

Among these, slicing may be the source of the most confusion.
Notice that when slicing with an explicit index (i.e., ``data['a':'c']``), the final index is *included* in the slice, while when slicing with an implicit index (i.e., ``data[0:2]``), the final index is *excluded* from the slice.

#### Indexers: loc, iloc, and ix

These slicing and indexing conventions can be a source of confusion.
For example, if your ``Series`` has an explicit integer index, an indexing operation such as ``data[1]`` will use the explicit indices, while a slicing operation like ``data[1:3]`` will use the implicit Python-style index.

In [193]:
data = pd.Series(['a', 'b', 'c'], index=[1, 3, 5])
data

1    a
3    b
5    c
dtype: object

In [194]:
# explicit index when indexing
data[1]

'a'

In [195]:
# implicit index when slicing
data[1:3]

3    b
5    c
dtype: object

Because of this potential confusion in the case of integer indexes, Pandas provides some special *indexer* attributes that explicitly expose certain indexing schemes.
These are not functional methods, but attributes that expose a particular slicing interface to the data in the ``Series``.

First, the ``loc`` attribute allows indexing and slicing that always references the explicit index:

In [196]:
data.loc[1]

'a'

In [197]:
data.loc[1:3]

1    a
3    b
dtype: object

The ``iloc`` attribute allows indexing and slicing that always references the implicit Python-style index:

In [198]:
data.iloc[1]

'b'

In [199]:
data.iloc[1:3]

3    b
5    c
dtype: object

A third indexing attribute, ``ix``, is a hybrid of the two, and for ``Series`` objects is equivalent to standard ``[]``-based indexing.

<a id="df-select"></a>
### 5.3.3.2 Data Selection in DataFrame
[[ go back to the top ]](#Table-of-contents)

Recall that a ``DataFrame`` acts in many ways like a two-dimensional or structured array, and in other ways like a dictionary of ``Series`` structures sharing the same index.
These analogies can be helpful to keep in mind as we explore data selection within this structure.

#### DataFrame as a dictionary

The first analogy we will consider is the ``DataFrame`` as a dictionary of related ``Series`` objects.
Let's return to our example of areas and populations of states:

In [200]:
area = pd.Series({'California': 423967, 'Texas': 695662,
                  'New York': 141297, 'Florida': 170312,
                  'Illinois': 149995})
pop = pd.Series({'California': 38332521, 'Texas': 26448193,
                 'New York': 19651127, 'Florida': 19552860,
                 'Illinois': 12882135})
data = pd.DataFrame({'area':area, 'pop':pop})
data

Unnamed: 0,area,pop
California,423967,38332521
Texas,695662,26448193
New York,141297,19651127
Florida,170312,19552860
Illinois,149995,12882135


The individual ``Series`` that make up the columns of the ``DataFrame`` can be accessed via dictionary-style indexing of the column name:

In [201]:
data['area']

California    423967
Texas         695662
New York      141297
Florida       170312
Illinois      149995
Name: area, dtype: int64

Equivalently, we can use attribute-style access with column names that are strings:

In [202]:
data.area

California    423967
Texas         695662
New York      141297
Florida       170312
Illinois      149995
Name: area, dtype: int64

This attribute-style column access actually accesses the exact same object as the dictionary-style access:

In [203]:
data.area is data['area']

True

Though this is a useful shorthand, keep in mind that it does not work for all cases!
For example, if the column names are not strings, or if the column names conflict with methods of the ``DataFrame``, this attribute-style access is not possible.
For example, the ``DataFrame`` has a ``pop()`` method, so ``data.pop`` will point to this rather than the ``"pop"`` column:

In [204]:
data.pop is data['pop']

False

In particular, you should avoid the temptation to try column assignment via attribute (i.e., use ``data['pop'] = z`` rather than ``data.pop = z``).

Like with the ``Series`` objects discussed earlier, this dictionary-style syntax can also be used to modify the object, in this case adding a new column:

In [205]:
data['density'] = data['pop'] / data['area']
data

Unnamed: 0,area,pop,density
California,423967,38332521,90.413926
Texas,695662,26448193,38.01874
New York,141297,19651127,139.076746
Florida,170312,19552860,114.806121
Illinois,149995,12882135,85.883763


This shows a preview of the straightforward syntax of element-by-element arithmetic between ``Series`` objects.

#### DataFrame as two-dimensional array

As mentioned previously, we can also view the ``DataFrame`` as an enhanced two-dimensional array.
We can examine the raw underlying data array using the ``values`` attribute:

In [206]:
data.values

array([[4.23967000e+05, 3.83325210e+07, 9.04139261e+01],
       [6.95662000e+05, 2.64481930e+07, 3.80187404e+01],
       [1.41297000e+05, 1.96511270e+07, 1.39076746e+02],
       [1.70312000e+05, 1.95528600e+07, 1.14806121e+02],
       [1.49995000e+05, 1.28821350e+07, 8.58837628e+01]])

With this picture in mind, many familiar array-like observations can be done on the ``DataFrame`` itself.
For example, we can transpose the full ``DataFrame`` to swap rows and columns:

In [207]:
data.T

Unnamed: 0,California,Texas,New York,Florida,Illinois
area,423967.0,695662.0,141297.0,170312.0,149995.0
pop,38332520.0,26448190.0,19651130.0,19552860.0,12882140.0
density,90.41393,38.01874,139.0767,114.8061,85.88376


When it comes to indexing of ``DataFrame`` objects, however, it is clear that the dictionary-style indexing of columns precludes our ability to simply treat it as a NumPy array.
In particular, passing a single index to an array accesses a row:

In [208]:
data.values[0]

array([4.23967000e+05, 3.83325210e+07, 9.04139261e+01])

and passing a single "index" to a ``DataFrame`` accesses a column:

In [209]:
data['area']

California    423967
Texas         695662
New York      141297
Florida       170312
Illinois      149995
Name: area, dtype: int64

#### Additional indexing conventions

There are a couple extra indexing conventions that might seem at odds with the preceding discussion, but nevertheless can be very useful in practice.
First, while *indexing* refers to columns, *slicing* refers to rows:

In [210]:
data['Florida':'Illinois']

Unnamed: 0,area,pop,density
Florida,170312,19552860,114.806121
Illinois,149995,12882135,85.883763


Such slices can also refer to rows by number rather than by index:

In [211]:
data[1:3]

Unnamed: 0,area,pop,density
Texas,695662,26448193,38.01874
New York,141297,19651127,139.076746


Similarly, direct masking operations are also interpreted row-wise rather than column-wise:

In [212]:
data[data.density > 100]

Unnamed: 0,area,pop,density
New York,141297,19651127,139.076746
Florida,170312,19552860,114.806121


The `del` keyword will delete columns as with a `dict`.

As an example of `del`, we first add a new column of boolean values where the state column equals 'Ohio':

In [213]:
frame2['eastern'] = frame2.state == 'Ohio'
frame2

Unnamed: 0,year,state,pop,debt,eastern
one,2000,Ohio,1.5,,True
two,2001,Ohio,1.7,-1.2,True
three,2002,Ohio,3.6,,True
four,2001,Nevada,2.4,-1.5,False
five,2002,Nevada,2.9,-1.7,False
six,2003,Nevada,3.2,,False


The `del` method can then be used to remove this column:

In [214]:
del frame2['eastern']

In [215]:
frame2.columns

Index(['year', 'state', 'pop', 'debt'], dtype='object')

Another common form of data is a nested `dict` of dicts:

In [216]:
pop = {'Nevada': {2001: 2.4, 2002: 2.9},
       'Ohio': {2000: 1.5, 2001: 1.7, 2002: 3.6}}

If the nested `dict` is passed to the `DataFrame`, pandas will interpret the outer `dict` keys as the columns and the inner keys as the row indices:

In [217]:
frame3 = pd.DataFrame(pop)
frame3

Unnamed: 0,Nevada,Ohio
2000,,1.5
2001,2.4,1.7
2002,2.9,3.6


You can transpose the `DataFrame` (swap rows and columns) with similar syntax to a `NumPy` array:

In [218]:
frame3.T

Unnamed: 0,2000,2001,2002
Nevada,,2.4,2.9
Ohio,1.5,1.7,3.6


If a DataFrame’s index and columns have their `name` attributes set, these will also be displayed:

In [219]:
frame3.index.name = 'year'; frame3.columns.name = 'state'
frame3

state,Nevada,Ohio
year,Unnamed: 1_level_1,Unnamed: 2_level_1
2000,,1.5
2001,2.4,1.7
2002,2.9,3.6


As with `Series`, the values attribute returns the data contained in the `DataFrame` as a two-dimensional `ndarray`:

In [220]:
frame2.values

array([[2000, 'Ohio', 1.5, nan],
       [2001, 'Ohio', 1.7, -1.2],
       [2002, 'Ohio', 3.6, nan],
       [2001, 'Nevada', 2.4, -1.5],
       [2002, 'Nevada', 2.9, -1.7],
       [2003, 'Nevada', 3.2, nan]], dtype=object)

<a id="read"></a>
## 6. Further reading

[[ go back to the top ]](#Table-of-contents)

This notebook covers a broad variety of topics but skips over many of the specifics. If you're looking to dive deeper into a particular topic, here's some recommended reading.

**Data Science**: William Chen compiled a [list of free books](http://www.wzchen.com/data-science-books/) for newcomers to Data Science, ranging from the basics of R & Python to Machine Learning to interviews and advice from prominent data scientists.

**Machine Learning**: /r/MachineLearning has a useful [Wiki page](https://www.reddit.com/r/MachineLearning/wiki/index) containing links to online courses, books, data sets, etc. for Machine Learning. There's also a [curated list](https://github.com/josephmisiti/awesome-machine-learning) of Machine Learning frameworks, libraries, and software sorted by language.

**Unit testing**: Dive Into Python 3 has a [great walkthrough](http://www.diveintopython3.net/unit-testing.html) of unit testing in Python, how it works, and how it should be used

**pandas** has [several tutorials](http://pandas.pydata.org/pandas-docs/stable/tutorials.html) covering its myriad features.

**scikit-learn** has a [bunch of tutorials](http://scikit-learn.org/stable/tutorial/index.html) for those looking to learn Machine Learning in Python. Andreas Mueller's [scikit-learn workshop materials](https://github.com/amueller/scipy_2015_sklearn_tutorial) are top-notch and freely available.

**matplotlib** has many [books, videos, and tutorials](http://matplotlib.org/resources/index.html) to teach plotting in Python.

**Seaborn** has a [basic tutorial](http://stanford.edu/~mwaskom/software/seaborn/tutorial.html) covering most of the statistical plotting features.