# Chapter 7 Array-Oriented Programming with NumPy

## 7.1 Introduction

* First appeared in 2006, NumPy is the preferred Python array implementation.
* Offer a high-performance, richly functional *n*-dimensional array type called `ndarray`.
    * Referred to as `array` from now on.
* Operations on `array`s are up to 100x faster than those on lists.
* Over 450 Python libraries depend on NumPy.

* A strength of NumPy is "array-oriented programming," which uses functional-style programming with *internal* iteration to make array manipulation concise and straightforward, eliminating the kinds of bugs that can occur with the *external* iteration of explicitly programmed loops.

* **Supplements**

## 7.2 Creating `array`s from Existing Data

* The NumPy documentation recommends importing the `numpy` module as `np`.

In [2]:
import numpy as np

The `array` function receives as an argument an `array` or other collection of elements and returns a new `array` containing the argument's elements.

In [7]:
numbers = np.array([2,3,5,7,11])
print(type(numbers))
print(repr(numbers))
print(numbers)

<class 'numpy.ndarray'>
array([ 2,  3,  5,  7, 11])
[ 2  3  5  7 11]


* When outputting an `array`, NumPy separates each value from the next with a comma and a space and *right-align* all the values using the same field width.

### Multidimensional Arguments

* The `array` function copies its argument's dimensions.

In [8]:
np.array([[1,2,3],[4,5,6]])

array([[1, 2, 3],
       [4, 5, 6]])

## 7.3 `array` Attributes

* An `array` object provides **attributes** that enable you to discover information about its structure and contents.

In [10]:
import numpy as np

integers = np.array([[1,2,3],[4,5,6]])
print(repr(integers))

floats = np.array([0.0,0.1,0.2,0.3,0.4])
print(repr(floats))

array([[1, 2, 3],
       [4, 5, 6]])
array([0. , 0.1, 0.2, 0.3, 0.4])


* NumPy does not display trailing 0s to the right of the decimal point in floating-point values.

### Determining an `array`'s Element Type

* The `array` function determines an `array`'s element type from its argument's elements.
* You can check the element type with an `array`'s **dtype** attribute.

In [11]:
print(integers.dtype)
print(floats.dtype)

int64
float64


* For performance reasons, NumPy is written in the C programming language and uses C's data types.
* By default, NumPy stores integers as type `int64` and floating-point numbers as type `float64`.
    * Most commomly used data types are `int64`, `float64`, `bool` (for Boolean) and `object` for non-numeric data (such as strings).
    * 64 = 8 bits/byte * 8 bytes
* **supplements**

### Determining an `array`'s Dimensions

* The attribute `ndim` contains an `array`'s number of dimensions.
* The attribute `shape` contains a *tuple* specifying an `array`'s dimensions.

In [12]:
print(integers.ndim)
print(floats.ndim)

print(integers.shape)
print(floats.shape)

2
1
(2, 3)
(5,)


### Determining an `array`'s Number of Elements and Element Size

* You can view an `array`'s total number of elements with the attribute `size`.
* The attribute `itemsize` shows the number of bytes required to store each element.

In [13]:
print(integers.size)
print(integers.itemsize)

print(floats.size)
print(floats.itemsize)

6
8
5
8


### Iterating Through a Multidimensional `array`'s Elements

* `array`s are *iterable* and you can use external iteration.

In [14]:
for row in integers:
    for column in row:
        print(column,end=' ')
    print()

1 2 3 
4 5 6 


* You can iterate through a multidimensional `array` as if it were one-dimensional by using its `flat` attribute.

In [15]:
for i in integers.flat:
    print(i,end=' ')

1 2 3 4 5 6 

## 7.4 Filling `array`s with Specific Values

* NumPy provides functions **zeros**, **ones**, and **full** for creating `array`s containing 0s, 1s, or a specfied value, respectively.
    * The first argument must be an integer or a tuple of integers specifying the desired dimensions.
    * For **zeros** and **ones**You can specify the `array`'s element type with the `dtype` keyword argument.
    * The `array` returned by **full** contains elements with the second argument's value and type.

In [22]:
import numpy as np

array_1 = np.zeros(5)
print(repr(array_1))
print(array_1.itemsize)
print(array_1.dtype)

array_2 = np.ones((2,4),dtype=int)
print(repr(array_2))
print(array_2.itemsize)
print(array_2.dtype)

array_3 = np.full((3,5),13)
print(repr(array_3))
print(array_3.itemsize)
print(array_3.dtype)

array([0., 0., 0., 0., 0.])
8
float64
array([[1, 1, 1, 1],
       [1, 1, 1, 1]])
8
int64
array([[13, 13, 13, 13, 13],
       [13, 13, 13, 13, 13],
       [13, 13, 13, 13, 13]])
8
int64


## 7.5 Creating `array`s from Ranges

### Creating Integer Ranges with **arange**

* NumPy's **arange** function creates integer ranges &mdash; similar to the built-in **range** function.
* It is more efficient to use **arange** directly instead of passing **range**s as arguments.

In [23]:
import numpy as np

a1 = np.arange(5)
print(repr(a1))

a2 = np.arange(5,10)
print(repr(a2))

a3 = np.arange(10,1,-2)
print(repr(a3))

array([0, 1, 2, 3, 4])
array([5, 6, 7, 8, 9])
array([10,  8,  6,  4,  2])


### Creating Floating-Point Ranges with **linspace**

* NumPy's **linspace** function produces evenly spaced floating-point ranges.
    * The first two arguments specify the starting and ending values in the range, and the ending value is *included* in the `array`.
    * The optional keyword argument `num` specifies the number of evenly spaced values to produce &mdash; the default value is 50.

In [25]:
np.linspace(0,1.0,num=5)

array([0.  , 0.25, 0.5 , 0.75, 1.  ])

### Reshaping an `array`

* Using the `array` method **reshape**, you can transform any `array` provided that the new shape has the *same* number of elements as the original.
    * ValueError.

In [5]:
import numpy as np

a1 = np.arange(1,21)

print(a1)

a2 = a1.reshape(4,5)
print(a2)

a3 = a2.reshape(5,4)
print(a3)

print()

a1[0] = 0
print(a1)
print(a2)
print(a3) # weird??

[ 1  2  3  4  5  6  7  8  9 10 11 12 13 14 15 16 17 18 19 20]
[[ 1  2  3  4  5]
 [ 6  7  8  9 10]
 [11 12 13 14 15]
 [16 17 18 19 20]]
[[ 1  2  3  4]
 [ 5  6  7  8]
 [ 9 10 11 12]
 [13 14 15 16]
 [17 18 19 20]]

[ 0  2  3  4  5  6  7  8  9 10 11 12 13 14 15 16 17 18 19 20]
[[ 0  2  3  4  5]
 [ 6  7  8  9 10]
 [11 12 13 14 15]
 [16 17 18 19 20]]
[[ 0  2  3  4]
 [ 5  6  7  8]
 [ 9 10 11 12]
 [13 14 15 16]
 [17 18 19 20]]


### Displaying Large `array`s

* When displaying large `array`s, NumpPy drops the middle rows, columns, or both from the output.
* The notation `...` represents the missing data.

In [28]:
print(np.arange(1,100001).reshape(4,25000))
print(np.arange(1,100001).reshape(100,1000))

[[     1      2      3 ...  24998  24999  25000]
 [ 25001  25002  25003 ...  49998  49999  50000]
 [ 50001  50002  50003 ...  74998  74999  75000]
 [ 75001  75002  75003 ...  99998  99999 100000]]
[[     1      2      3 ...    998    999   1000]
 [  1001   1002   1003 ...   1998   1999   2000]
 [  2001   2002   2003 ...   2998   2999   3000]
 ...
 [ 97001  97002  97003 ...  97998  97999  98000]
 [ 98001  98002  98003 ...  98998  98999  99000]
 [ 99001  99002  99003 ...  99998  99999 100000]]


## 7.6 `list` vs. `array` Performance: Introducing `%timeit`

## 7.7 `array` Operators

NumPy provides many operators which enable you to write simple expressions that perform operations on entire `array`s.
* Between `array`s and numeric values or between `array`s of the same type.

### Arithmetic Operations with `array`s and Individual Numeric Values

* *Element-wise arithmetic* with `array`s and numeric values by using arithmetic operators and augmented assignments.
    * With arithmetic operators, a *new* `array` is returned.
    * With augmented assignment, the left-hand-side operand is modified.

In [2]:
import numpy as np

numbers = np.arange(1,6)
print(numbers)

print(numbers * 2)
print(3 ** numbers)
print(numbers)

numbers += 5
print(numbers)

[1 2 3 4 5]
[ 2  4  6  8 10]
[  3   9  27  81 243]
[1 2 3 4 5]
[ 6  7  8  9 10]


### Broadcasting

* Normally, arithmetic operations require as operands two `array`s of the *same size* and *shape*.
* **Broadcasting**: When one operand is a single value, called **scalar**, NumPy performs the element-wise calculations as if the scalar were an `array` of the same shape as the other operand, but with the scalar value in all its elements.
    * For example, `numbers * 2` is equivalent to `numbers * [2, 2, 2, 2, 2]` assuming that `numbers` is an array of five elements.
* Broadcasting can be applied between `array`s of different sizes and shapes, enabling some concise and powerful manipulations.

### Arithmetic Operations Between `array`s

* Arithmetic operations and augmented assignments between `array`s of the *same* shape perform element-wise operations and return new `array`s.

In [4]:
numbers1 = np.arange(1,6)
numbers2 = np.linspace(1.1,5.5,5)
print(numbers1)
print(numbers2)

numbers3 = numbers1 * numbers2
print(numbers3)

[1 2 3 4 5]
[1.1 2.2 3.3 4.4 5.5]
[ 1.1  4.4  9.9 17.6 27.5]


### Comparing `array`s

* You can compare `array`s element-wise with invidual values and with other `array`s.
* Such comparisons produce `array`s of Boolean values in which each element's `True` or `False ` value indicates the comparison result.

In [7]:
numbers1 = np.arange(11,16)
print(numbers1)

print(numbers1 >= 13) # broadcasting

numbers2 = np.linspace(10,20,5)
print(numbers2)

print(numbers1 < 13)
print(numbers2 < numbers1)
print(numbers1 == numbers2)
print(numbers1 != numbers2)

[11 12 13 14 15]
[False False  True  True  True]
[10.  12.5 15.  17.5 20. ]
[ True  True False False False]
[ True False False False False]
[False False False False False]
[ True  True  True  True  True]


## 7.8 Numpy Calculation Methods

* An `array` has various methods that perform calculations using its contents.
* By default, `array` calculation methods ignore the `array`'s shape and use *all* the elements in the calculations.
    * You can perform these calculations on each dimension as well.

In [7]:
import numpy as np

grades = np.array([
    [87,96,70],
    [100,87,90],
    [94,77,90],
    [100,81,82]
])

print(grades.sum())
print(grades.min())
print(grades.max())
print(grades.mean())
print(grades.std())
print(grades.var())

1054
70
100
87.83333333333333
8.792357792739987
77.30555555555556


### Calculations by Row or Column

* Many calculation methods can be performed on specific `array` dimensions, known as the `array`'s *axes*.
* These methods receive an `axis` keyword argument that specifies which dimension to use in the calculation.
* Check https://docs.scipy.org/doc/numpy/reference/arrays.ndarray.html for more calculation methods.

In [8]:
print(grades.mean(axis=0))
print(grades.max(axis=1))

[95.25 85.25 83.  ]
[ 96 100  94 100]


## 7.9 Universal Functions

* NumPy offers dozens of standalone **universal functions** (or **ufuncs**) that perform various element-wise operations.
* Each performs its task using one or two `array` or array-like (such as lists) arguments.
* Some of these methods are called when you use operators like + and * on `array`s.
* Each returns a new `array` containing the results.

In [10]:
import numpy as np

numbers1 = np.arange(1,7) ** 2
print(np.sqrt(numbers1))

numbers2 = np.arange(1,7) * 10
print(np.add(numbers1,numbers2))

[1. 2. 3. 4. 5. 6.]
[11 24 39 56 75 96]


### Broadcasting with Universal Functions

* You can view the broadcasting rules at: https://docs.scipy.org/doc/numpy/user/basics.broadcasting.html.
* The complete list of universal functions, their descriptions, and more information can be found at https://docs.scipy.org/doc/numpy/reference/ufuncs.html.

In [14]:
print(np.multiply(numbers1,5))

numbers3 = numbers1.reshape(2,3)
print(numbers3)

numbers4 = np.array([2,4,6])
print(numbers4)

print(np.multiply(numbers3,numbers4))

[  5  20  45  80 125 180]
[[ 1  4  9]
 [16 25 36]]
[2 4 6]
[[  2  16  54]
 [ 32 100 216]]


## 7.10 Indexing and Slicing

* One-dimensional `array`s can be indexed and sliced using the syntax and techniques as sequences.
* There are some `array`-specific indexing and slicing capabilities.

### Indexing with Two-Dimensional `array`s

* To select an element in a two-dimensional `array`, specify a tuple containing the element's row and column indices in squrare brackets.

In [1]:
import numpy as np

grades = np.array([
    [87,96,70],
    [100,87,90],
    [94,77,90],
    [100,81,82]
])

grades[0,1]

96

### Selecting a Subset of a Two-Dimensional `array`'s Rows

* To select a single row, specify only one index in square brackets.
* To select multiple rows, use slice notation or a list of row indices.

In [4]:
print(grades)
print(grades[1])
print(grades[0:2])
print(grades[[1,3]])

[[ 87  96  70]
 [100  87  90]
 [ 94  77  90]
 [100  81  82]]
[100  87  90]
[[ 87  96  70]
 [100  87  90]]
[[100  87  90]
 [100  81  82]]


### Selecting a Subset of a Two-Dimensional `array`'s Columns

* You can select subsets of the columns by providing a tuple specifying the row(s) and column(s) to select.
* Each can be a specific index, a slice, or a list.

In [5]:
import numpy as np

grades = np.array([
    [87,96,70],
    [100,87,90],
    [94,77,90],
    [100,81,82]
])

print(grades[:,0])   # first column
print(grades[:,1:3]) # select consecutive columns using a slice
print(grades[:,[0,2]]) # select specific columns using a list of column indices

[ 87 100  94 100]
[[96 70]
 [87 90]
 [77 90]
 [81 82]]
[[ 87  70]
 [100  90]
 [ 94  90]
 [100  82]]


## 7.11 Views: Shallow Copies

* *view objects* &mdash; objects that "see" the data in other objects, rather than having their own copies of the data.
* Views are also known as **shallow copies**.

* The `array` method **view** returns a *new* array object with a *view* of the original `array` object's data.
    * The returned view and the original `array` are *different* objects.
    * But, the returned `array` views the *same* data as the original `array`.
    * Changing a value in the view also changes the value in the original `array`.

In [11]:
import numpy as np

numbers = np.arange(1,7)
numbers_view = numbers.view()

print(numbers)
print(numbers_view)

# The original array and the view are different objects.
print(type(numbers))
print(id(numbers))

print(type(numbers_view))
print(id(numbers_view))

# But they see the same data.
numbers[0] += 10
print(numbers)
print(numbers_view)

numbers_view[-1] *= 10
print(numbers)
print(numbers_view)

[1 2 3 4 5 6]
[1 2 3 4 5 6]
<class 'numpy.ndarray'>
4528057168
<class 'numpy.ndarray'>
4528410128
[11  2  3  4  5  6]
[11  2  3  4  5  6]
[11  2  3  4  5 60]
[11  2  3  4  5 60]


### Slice Views

Slices also create views.

In [13]:
import numpy as np

numbers = np.arange(0,6)
numbers_slice = numbers[1:4]

numbers[1] *= 10
print(numbers)
print(numbers_slice)

numbers_slice[2] *= 10
print(numbers)
print(numbers_slice)

[ 0 10  2  3  4  5]
[10  2  3]
[ 0 10  2 30  4  5]
[10  2 30]


## 7.12 Deep Copies

* Though views are *separate* `array` objects, they save memory by sharing element data from other `array`s.
* When sharing *mutable* values, sometimes it's necessary to create a **deep copy** with *independent* copies of the original data.
* The **array method copy** returns a new array object with a *deep copy* of the original `array` object's data.

In [14]:
import numpy as np

numbers = np.arange(0,6)
numbers_copy = numbers.copy()

print(numbers)
print(numbers_copy)

numbers[1] *= 10
print(numbers)
print(numbers_copy)

[0 1 2 3 4 5]
[0 1 2 3 4 5]
[ 0 10  2  3  4  5]
[0 1 2 3 4 5]


### Module `copy` &mdash; Shallow vs. Deep Copies for Other Types of Python Objects

If you need deep copies of ther types of Python objects, pass them to the **copy** module's **deepcopy** function.

## 7.13 Reshaping and Transposing

### **reshape** vs. **resize**

* The array methods `reshape` and `resize` both enable you to change an array's dimensions.
* `reshape` returns a *view* (shallow copy) of the original `array` with the new dimensions.
    * It does *not* modify the original array.
* `resize` *modifies the original `array`'s shape*.

In [20]:
import numpy as np

grades = np.array([[87,96,70],[100,87,90]])
reshaped = grades.reshape(1,6)

print(grades)
print(reshaped)
print()

grades[0,0] = 80
print(grades)
print(reshaped)
print()

reshaped[0,2] = 10
print(grades)
print(reshaped)


[[ 87  96  70]
 [100  87  90]]
[[ 87  96  70 100  87  90]]

[[ 80  96  70]
 [100  87  90]]
[[ 80  96  70 100  87  90]]

[[ 80  96  10]
 [100  87  90]]
[[ 80  96  10 100  87  90]]


In [23]:
import numpy as np

grades = np.array([[87,96,70],[100,87,90]])

print(grades)
grades.resize(1,6)
print(grades)

[[ 87  96  70]
 [100  87  90]]
[[ 87  96  70 100  87  90]]


### **flatten** vs. **ravel**

* You can take a multidimensional array and flatten it into a single dimension with the methods **flatten** and **ravel**.
* Method `flatten` *deep copies* the original array's data.
* Method `ravel` produces a *view* of the original `array`, which *shares* the original array's data.

In [25]:
import numpy as np

grades = np.array([[87,96,70],[100,87,90]])
flattened = grades.flatten()

print(grades)
print(flattened)
print()

grades[0,0] = 100
print(grades)
print(flattened)
print()

raveled = grades.ravel()
print(grades)
print(raveled)
print()

raveled[0] = 87
print(grades)
print(raveled)



[[ 87  96  70]
 [100  87  90]]
[ 87  96  70 100  87  90]

[[100  96  70]
 [100  87  90]]
[ 87  96  70 100  87  90]

[[100  96  70]
 [100  87  90]]
[100  96  70 100  87  90]

[[ 87  96  70]
 [100  87  90]]
[ 87  96  70 100  87  90]


### Transposing Rows and Columns

* The **T** attribute returns a transposed *view* (shallow copy) of the `array`.

In [27]:
import numpy as np

grades = np.array([[87,96,70],[100,87,90]])
print(grades)
print(grades.T)
print()

grades.T[0,0] = 100
print(grades)
print(grades.T)

[[ 87  96  70]
 [100  87  90]]
[[ 87 100]
 [ 96  87]
 [ 70  90]]

[[100  96  70]
 [100  87  90]]
[[100 100]
 [ 96  87]
 [ 70  90]]


### Horizontal and Vertical Stacking

* You can combine arrays by adding more columns or more rows &mdash; known as *horizontal stacking* and *vertical stacking*.
* **hstack**: horizontal stack function
* **vstack**: vertical stack function

In [31]:
import numpy as np

grades_1 = np.array([[87,96,70],[100,87,90]])
grades_2 = np.array([[94,77,90],[100,81,82]])

horizontal_stack = np.hstack((grades_1,grades_2))
print(horizontal_stack,end='\n\n')

vertical_stack = np.vstack((grades_1,grades_2))
print(vertical_stack,end='\n\n')

vertical_stack[0,0] = 100
print(grades_1)
print(vertical_stack)

[[ 87  96  70  94  77  90]
 [100  87  90 100  81  82]]

[[ 87  96  70]
 [100  87  90]
 [ 94  77  90]
 [100  81  82]]

[[ 87  96  70]
 [100  87  90]]
[[100  96  70]
 [100  87  90]
 [ 94  77  90]
 [100  81  82]]


## 7.14 Intro to Data Science: pandas `Series` and `DataFrames`

* NumPy's `array` is optimized for homogeneous numeric data that's accessed via integer indices.
* Big data applications must support mixed data types, customized indexing, missing data, data that's not structured consistently, and data that needs to be manipulated into forms appropriate for the databases and data analysis packages you use.
* **Pandas** is the most popular library for dealing with such data.
    * It provides two key collctions &mdash; `Series` for one-dimensional collections and `DataFrames` for two-dimensional data.
    * You can use Panda's `MultiIndex` to manipulate multi-dimensional data in the context of `Series` and `DataFrames`.
* NumPy and pandas are intimately related.
    * `Series` and `DataFrame`s use arrays "under the hood."
    * `Series` and `DataFrame`s are valid arguments to many NumPy operations.
    * `array`s are valid arguments to many `Series` and `DataFrame` operations. 

### pandas `Series`

* A `Series` is an enhanced one-dimensional `array`.
* Whereas `array`s use only zero-based integer indices, `Series` support custom indexing, including even non-integer indices like strings.
* `Series` also offer additional capabilities that make them more convenient for many data-science oriented tasks.
    * For example, `Series` may have missing data, and many `Series` operations ignore missing data by default.

#### Creating a `Series` with Default Indices

* By default, a `Series` has integer indices numbered sequentially from 0.

In [33]:
import pandas as pd

grades1 = pd.Series([87,100,94])
print(grades1,end='\n\n')

grades2 = pd.Series(grades1)
print(grades2,end='\n\n')

grades3 = pd.Series({'one':1,2:'two',3:'three'})
print(grades3)


0     87
1    100
2     94
dtype: int64

0     87
1    100
2     94
dtype: int64

one        1
2        two
3      three
dtype: object


#### Displaying a `Series`

* Pandas displays a `Series` in two-column format with the indices *left aligned* in the left column and the values *righ aligned* in the right column.
* After listing the `Series` elements, pandas shows the data type (`dtype`) of the underlying `array`'s elements.

#### Creating a `Series` with All Elements Having the Same Value

In [34]:
pd.Series(98.6,range(2,8))

2    98.6
3    98.6
4    98.6
5    98.6
6    98.6
7    98.6
dtype: float64

* The second argument is a one-dimensional iterable object (such as a list, and `array`, or a `range`) containing the `Series`' indices.
* The number of indices determines the number of elements.

#### Accessing a `Series`' Elements
* Via square brackets containing an index.

In [39]:
import pandas as pd

grades = pd.Series([87,100,94,95,90])

print(grades[0],end='\n\n')

print(grades[1:3])

87

1    100
2     94
dtype: int64


#### Producing Descriptive Statistics for a Series

In [40]:
print(grades.count(),end='\n\n')
print(grades.mean(),end='\n\n')
print(grades.min(),end='\n\n')
print(grades.max(),end='\n\n')
print(grades.std(),end='\n\n')

grades.describe()

5

93.2

87

100

4.969909455915671



count      5.000000
mean      93.200000
std        4.969909
min       87.000000
25%       90.000000
50%       94.000000
75%       95.000000
max      100.000000
dtype: float64

* The `Series` method **describe** produces all the stats.
    * The 25%, 50%, and 75% are **quartiles**.
    * The **interquartile range** is the 75% quartile minus the 25% quartile, which is another measure of dispersion, just like standard deviation and variance.

#### Creating a `Series` with Custom Indices

You can specify *custom* indices with the `index` keyword argument.

In [41]:
grades = pd.Series([87,100,94],['Wally','Eva','Sam'])
print(grades)

Wally     87
Eva      100
Sam       94
dtype: int64


#### Dictionary Initializers

If you initialize a `Series` with a dictionary, its keys become the `Series`' indices, and its values become the `Series`' element values.

In [42]:
grades = pd.Series({'Wally':87,'Eva':100,'Sam':94})
print(grades)

Wally     87
Eva      100
Sam       94
dtype: int64


#### Accessing Elements of a `Series` Via Custom Indices

* In a `Series` with custom indices, you can access individual elements via square brackets containing custom index value.
* If the custom indices are strings that could represent valid Python identifiers, pandas automatically adds them to the `Series` as attributes that you can access via a dot (.).
* `Series` also has *built-in* attributes. For example, the **dtype** attribute returns the underlying `array`'s element type, and the **values** attribute return the underlying `array`.

In [44]:
print(grades['Eva'],end='\n\n')
print(grades['Wally'],end='\n\n')
print(grades.dtype,end='\n\n')
print(grades.values)

100

87

int64

[ 87 100  94]


#### Creating a Series of Strings

* If a `Series` contains strings, you can use its **str** attribute to call string methods on the elements.
* The **str** attribute provides many string-processing methods that are similar to those in Python's string type.

In [45]:
import pandas as pd

hardware = pd.Series(['Hammer','Saw','Wrench'])
print(hardware,end='\n\n')

has_a = hardware.str.contains('a')
print(has_a,end='\n\n')

upper_case = hardware.str.upper()
print(hardware)
print(upper_case)


0    Hammer
1       Saw
2    Wrench
dtype: object

0     True
1     True
2    False
dtype: bool

0    Hammer
1       Saw
2    Wrench
dtype: object
0    HAMMER
1       SAW
2    WRENCH
dtype: object


### `DataFrame`s

* A `DataFrame` is an enhanced two-dimensional `array`.
* Like `Series`, `DataFrame`s can have custom row and column indices, and offer additional operations and capabilities that make them more convenient for many data-science oriented tasks.
* `DataFrame`s also support missing data.
* Each column in a `DataFrame` is a `Series`.

#### Creating a `DataFrame` from a Dictionary

In [46]:
import pandas as pd

grades_dict = {
    'Wally':[87,96,70],
    'Eva': [100,87,90],
    'Sam': [94,77,90],
    'Katie': [100,81,82],
    'Bob': [83,65,85]
}

grades = pd.DataFrame(grades_dict)
print(grades)

   Wally  Eva  Sam  Katie  Bob
0     87  100   94    100   83
1     96   87   77     81   65
2     70   90   90     82   85


* The dictionary's *keys* become the column names and the *values* associated with each key become the element values in the corresponding column.
* By default, the row indices are auto-generated integers starting from 0.

#### Customizing a `DataFrame`'s Indices with the **index** Attribute

* We can specify custom indices with the `index` keyword argument.

In [49]:
import pandas as pd

grades_dict = {
    'Wally':[87,96,70],
    'Eva': [100,87,90],
    'Sam': [94,77,90],
    'Katie': [100,81,82],
    'Bob': [83,65,85]
}
grades = pd.DataFrame(grades_dict,index=['Test1','Test2','Test3'])
print(grades)

       Wally  Eva  Sam  Katie  Bob
Test1     87  100   94    100   83
Test2     96   87   77     81   65
Test3     70   90   90     82   85


* We can use the `index` attribute to change the `DataFrame`'s indices from sequential integers to labels.

In [51]:
import pandas as pd

grades_dict = {
    'Wally':[87,96,70],
    'Eva': [100,87,90],
    'Sam': [94,77,90],
    'Katie': [100,81,82],
    'Bob': [83,65,85]
}

grades = pd.DataFrame(grades_dict)
print(grades,end='\n\n')
grades.index = ['Test1','Test2','Test3']
print(grades)

   Wally  Eva  Sam  Katie  Bob
0     87  100   94    100   83
1     96   87   77     81   65
2     70   90   90     82   85

       Wally  Eva  Sam  Katie  Bob
Test1     87  100   94    100   83
Test2     96   87   77     81   65
Test3     70   90   90     82   85


* When specifying the indices, you must provide a one-dimensional collection that has the same number of elements as there are *rows* in the `DataFrame`.
* `Series` also provides an **index** attribute for changing an existing `Series`' indices.

#### Accessing a `DataFrame`'s Columns

* One benefit of pandas is that you can quickly and conveniently look at your data in many different ways, including selecting portions of the data.

In [52]:
import pandas as pd

grades_dict = {
    'Wally':[87,96,70],
    'Eva': [100,87,90],
    'Sam': [94,77,90],
    'Katie': [100,81,82],
    'Bob': [83,65,85]
}
grades = pd.DataFrame(grades_dict,index=['Test1','Test2','Test3'])
print(grades,end='\n\n')

print(grades['Eva'])

       Wally  Eva  Sam  Katie  Bob
Test1     87  100   94    100   83
Test2     96   87   77     81   65
Test3     70   90   90     82   85

Test1    100
Test2     87
Test3     90
Name: Eva, dtype: int64


* If a `DataFrame`'s column-name strings are valid Python identifiers, you can use them as attributes.

In [53]:
print(grades.Sam)

Test1    94
Test2    77
Test3    90
Name: Sam, dtype: int64


#### Selecting Rows via the `loc` and `iloc` Attributes

* Though `DataFrame`s support indexing capabilities with [], the pandas documentation recommends using the attributes `loc`, `iloc`, `at`, and `iat`, which are optimized to access `DataFrame`s and also provide additional capabilities beyond what you can do with [].
* The documentation also states that indexing with [] *often* produces a copy of the data, which is a logic error if you attempt to assign new values to the `DataFrame` by assigning to the result of the [] operation.

In [55]:
import pandas as pd

grades_dict = {
    'Wally':[87,96,70],
    'Eva': [100,87,90],
    'Sam': [94,77,90],
    'Katie': [100,81,82],
    'Bob': [83,65,85]
}
grades = pd.DataFrame(grades_dict,index=['Test1','Test2','Test3'])

print(grades.loc['Test1'],end='\n\n')
print(grades.iloc[1])

Wally     87
Eva      100
Sam       94
Katie    100
Bob       83
Name: Test1, dtype: int64

Wally    96
Eva      87
Sam      77
Katie    81
Bob      65
Name: Test2, dtype: int64


#### Slecting Rows via Slices and Lists with the `loc` and `iloc` Attributes

* The index can be a *slice*.
* When using slices containing labels with `loc`, the range specified *includes* the high index.
* When using slices containing integer indices with `iloc`, the range you specify *excludes* the high index.
* To select *specific rows*, use a *list* rather than slice notation with `loc` or `iloc`.

In [62]:
import pandas as pd

grades_dict = {
    'Wally':[87,96,70],
    'Eva': [100,87,90],
    'Sam': [94,77,90],
    'Katie': [100,81,82],
    'Bob': [83,65,85]
}
grades = pd.DataFrame(grades_dict,index=['Test1','Test2','Test3'])

print(grades.loc['Test1':'Test3'],end='\n\n')
print(grades.iloc[0:2],end='\n\n')

print(grades.loc[['Test1','Test3']],end='\n\n')
print(grades.iloc[[0,2]])

       Wally  Eva  Sam  Katie  Bob
Test1     87  100   94    100   83
Test2     96   87   77     81   65
Test3     70   90   90     82   85

       Wally  Eva  Sam  Katie  Bob
Test1     87  100   94    100   83
Test2     96   87   77     81   65

       Wally  Eva  Sam  Katie  Bob
Test1     87  100   94    100   83
Test3     70   90   90     82   85

       Wally  Eva  Sam  Katie  Bob
Test1     87  100   94    100   83
Test3     70   90   90     82   85


#### Selecting Subsets of the Rows and Columns

* You can focus on small subsets of a `DataFrame` by selecting rows *and* columns using two slices, two lists, or a combination of slices and lists

In [64]:
import pandas as pd

grades_dict = {
    'Wally':[87,96,70],
    'Eva': [100,87,90],
    'Sam': [94,77,90],
    'Katie': [100,81,82],
    'Bob': [83,65,85]
}
grades = pd.DataFrame(grades_dict,index=['Test1','Test2','Test3'])

print(grades.loc['Test1':'Test2',['Eva','Katie']],end='\n\n')
print(grades.iloc[[0,2],0:3])

       Eva  Katie
Test1  100    100
Test2   87     81

       Wally  Eva  Sam
Test1     87  100   94
Test3     70   90   90


#### Boolean Indexing

* One of pandas' more powerful selection capabilities is **Boolean indexing**.

In [66]:
import pandas as pd

grades_dict = {
    'Wally':[87,96,70],
    'Eva': [100,87,90],
    'Sam': [94,77,90],
    'Katie': [100,81,82],
    'Bob': [83,65,85]
}
grades = pd.DataFrame(grades_dict,index=['Test1','Test2','Test3'])

print(grades >= 90,end='\n\n')
grades[grades >= 90]

       Wally    Eva    Sam  Katie    Bob
Test1  False   True   True   True  False
Test2   True  False  False  False  False
Test3  False   True   True  False  False



Unnamed: 0,Wally,Eva,Sam,Katie,Bob
Test1,,100.0,94.0,100.0,
Test2,96.0,,,,
Test3,,90.0,90.0,,


* Pandas checks every grade to determine whether its value is greater than or equal to 90 and, if so, includes it in the new `DataFrame`.
* Grades for which the condition is `False` are represented as **NaN (not a number)** in the new `DataFrame`.
* `NaN` is pandas' notation for missing values.

In [68]:
print(grades[(grades >= 80) & (grades < 90)])

       Wally   Eva  Sam  Katie   Bob
Test1   87.0   NaN  NaN    NaN  83.0
Test2    NaN  87.0  NaN   81.0   NaN
Test3    NaN   NaN  NaN   82.0  85.0


* Pandas Boolean indices combine multiple conditions with the Python operator & (bitwise AND), *not* the `and` Boolean operator.
* For `or` conditions, use | (bitwise OR).
* NumPy also supports Boolean indexing for `array`s, but always returns a one-dimensional array containing only the values that satisfy the condition.

#### Accessing a Specific `DataFrame` Cell by Row and Column

* You can use a `DataFrame`'s `at` and `iat` to get a single value from a `DataFrame`.
* Like `loc` and `iloc`, `at` uses labels and `iat` uses integer indices.
* In each case, the row and column indices must be separated by a comma.

In [69]:
print(grades.at['Test2','Eva'],end='\n\n')
print(grades.iat[2,0])

87

70


* You can also assign new values to specific elements.

In [70]:
print(grades,end='\n\n')

grades.at['Test2','Eva'] = 100
print(grades,end='\n\n')

grades.iat[1,2] = 87
print(grades)

       Wally  Eva  Sam  Katie  Bob
Test1     87  100   94    100   83
Test2     96   87   77     81   65
Test3     70   90   90     82   85

       Wally  Eva  Sam  Katie  Bob
Test1     87  100   94    100   83
Test2     96  100   77     81   65
Test3     70   90   90     82   85

       Wally  Eva  Sam  Katie  Bob
Test1     87  100   94    100   83
Test2     96  100   87     81   65
Test3     70   90   90     82   85


#### Descriptive Statistics

* Both `Series` and `DataFrame`s have a **describe** method that calculates basic descriptive statistics fro the data and returns them as a `DataFrame`.
* In a `DataFrame`, the statistics are calculated by column.

In [71]:
import pandas as pd

grades_dict = {
    'Wally':[87,96,70],
    'Eva': [100,87,90],
    'Sam': [94,77,90],
    'Katie': [100,81,82],
    'Bob': [83,65,85]
}
grades = pd.DataFrame(grades_dict,index=['Test1','Test2','Test3'])

statistics = grades.describe()
print(statistics)

           Wally         Eva        Sam       Katie        Bob
count   3.000000    3.000000   3.000000    3.000000   3.000000
mean   84.333333   92.333333  87.000000   87.666667  77.666667
std    13.203535    6.806859   8.888194   10.692677  11.015141
min    70.000000   87.000000  77.000000   81.000000  65.000000
25%    78.500000   88.500000  83.500000   81.500000  74.000000
50%    87.000000   90.000000  90.000000   82.000000  83.000000
75%    91.500000   95.000000  92.000000   91.000000  84.000000
max    96.000000  100.000000  94.000000  100.000000  85.000000


* By default, pandas calculates the descriptive statistics with floating-point values and displays them with six digits of precision.
* You can control the precision and other default settings with the pandas' **set_option** function.

In [76]:
pd.set_option('display.precision',2)
print(grades.describe(),end='\n\n')

print(grades.mean())

       Wally     Eva    Sam   Katie    Bob
count   3.00    3.00   3.00    3.00   3.00
mean   84.33   92.33  87.00   87.67  77.67
std    13.20    6.81   8.89   10.69  11.02
min    70.00   87.00  77.00   81.00  65.00
25%    78.50   88.50  83.50   81.50  74.00
50%    87.00   90.00  90.00   82.00  83.00
75%    91.50   95.00  92.00   91.00  84.00
max    96.00  100.00  94.00  100.00  85.00

Wally    84.33
Eva      92.33
Sam      87.00
Katie    87.67
Bob      77.67
dtype: float64


#### Transposing the `DataFrame` with the **T** Attribute

* You can quickly **transpose** the rwos and columns by using the **T** attribute.
* **T** returns a transposed *view* (not a copy) of the `DataFrame`.

In [78]:
print(grades.T,end='\n\n')

print(grades.T.describe(),end='\n\n')
print(grades.T.mean())

       Test1  Test2  Test3
Wally     87     96     70
Eva      100     87     90
Sam       94     77     90
Katie    100     81     82
Bob       83     65     85

        Test1  Test2  Test3
count    5.00   5.00   5.00
mean    92.80  81.20  83.40
std      7.66  11.54   8.23
min     83.00  65.00  70.00
25%     87.00  77.00  82.00
50%     94.00  81.00  85.00
75%    100.00  87.00  90.00
max    100.00  96.00  90.00

Test1    92.8
Test2    81.2
Test3    83.4
dtype: float64


#### Sorting by Rows by Their Indices

* You can sort a `DataFrame` by its rows or columns, based on their indices or values.
* The **sort_index** method returns a new `DataFrame` containing the sorted data.

In [80]:
import pandas as pd

grades_dict = {
    'Wally':[87,96,70],
    'Eva': [100,87,90],
    'Sam': [94,77,90],
    'Katie': [100,81,82],
    'Bob': [83,65,85]
}
grades = pd.DataFrame(grades_dict,index=['Test1','Test2','Test3'])

sorted_grades = grades.sort_index(ascending=False) # default is True
print(sorted_grades)

       Wally  Eva  Sam  Katie  Bob
Test3     70   90   90     82   85
Test2     96   87   77     81   65
Test1     87  100   94    100   83


#### Sorting by Column Indices

* Passing the **`axis=1`** keyword argument indicates that we wish to sort the *column* indices, rather than the row indices &mdash; `axis=0` (the default) sorts the *row* indices.

In [81]:
import pandas as pd

grades_dict = {
    'Wally':[87,96,70],
    'Eva': [100,87,90],
    'Sam': [94,77,90],
    'Katie': [100,81,82],
    'Bob': [83,65,85]
}
grades = pd.DataFrame(grades_dict,index=['Test1','Test2','Test3'])

grades.sort_index(axis=1)

Unnamed: 0,Bob,Eva,Katie,Sam,Wally
Test1,83,100,100,94,87
Test2,65,87,81,77,96
Test3,85,90,82,90,70


#### Sorting by Column Values

* We can call the method **`sort_values`** to sort by column values.
* The `by` and `axis` keyword arguments work together to determine which values will be sorted.

In [85]:
grades_dict = {
    'Wally':[87,96,70],
    'Eva': [100,87,90],
    'Sam': [94,77,90],
    'Katie': [100,81,82],
    'Bob': [83,65,85]
}
grades = pd.DataFrame(grades_dict,index=['Test1','Test2','Test3'])
print(grades,end='\n\n')

print(grades.sort_values(by='Test1',axis=1,ascending=False),end='\n\n')
print(grades.sort_values(by='Sam')) # default is axis=0

       Wally  Eva  Sam  Katie  Bob
Test1     87  100   94    100   83
Test2     96   87   77     81   65
Test3     70   90   90     82   85

       Eva  Katie  Sam  Wally  Bob
Test1  100    100   94     87   83
Test2   87     81   77     96   65
Test3   90     82   90     70   85

       Wally  Eva  Sam  Katie  Bob
Test2     96   87   77     81   65
Test3     70   90   90     82   85
Test1     87  100   94    100   83


* We can also sort the transposed `DataFrame` instead.

In [86]:
print(grades.T.sort_values(by='Test1',ascending=False))

       Test1  Test2  Test3
Eva      100     87     90
Katie    100     81     82
Sam       94     77     90
Wally     87     96     70
Bob       83     65     85


* We can combine selection with sorting to view data of interest only.

In [91]:
print(grades.loc['Test1'].sort_values(ascending=False),end='\n\n')

print(grades.iloc[0:2].T.sort_values(by='Test2',ascending=False))

Eva      100
Katie    100
Sam       94
Wally     87
Bob       83
Name: Test1, dtype: int64

       Test1  Test2
Wally     87     96
Eva      100     87
Katie    100     81
Sam       94     77
Bob       83     65


#### Copy vs. In-Place Sorting

* By default the `sort_index` and `sort_values` return a *copy* of the original `DataFrame`, which could require substantial memory in a big data application.
* To sort the `DataFrame` *in place* rather than *copying* the data, pass the keyword argument `inplace=True` to either `sort_index` or `sort_values`.

In [96]:
print(grades,end='\n\n')
grades.sort_values(by='Test1',axis=1,ascending=False,inplace=True)

print(grades)

       Wally  Eva  Sam  Katie  Bob
Test1     87  100   94    100   83
Test2     96   87   77     81   65
Test3     70   90   90     82   85

       Eva  Katie  Sam  Wally  Bob
Test1  100    100   94     87   83
Test2   87     81   77     96   65
Test3   90     82   90     70   85


#### Practice
Given the `temps` dictionary, perform the following tasks:
1. Convert the dictionary into the `DataFrame` named `temperatures` with `'Low'` and `'High'` as the indices, than display the `DataFrame`.
2. Use the column names to select only the columns for `'Mon'` through `'Wed'`.
3. Use the row index `'Low'` to select only the low temperatures for each day.
4. Set the floating-point precision to 2, then calculate the average temperature for each day.
5. Calculate the average low and high temperatures.

In [102]:
temps = {
    'Mon':[68,89],'Tue':[71,93],'Wed':[66,82],
    'Thu':[75,97],'Fri':[62,79]
}

# task 1
temperatures = pd.DataFrame(temps,index=['Low','High'])
print(temperatures,end='\n\n')
# task 2
print(temperatures.loc[:,'Mon':'Wed'],end='\n\n')
# task 3
print(temperatures.loc['Low'],end='\n\n')
# task 4
pd.set_option('display.precision',2)
print(temperatures.mean(),end='\n\n')
# task 5
print(temperatures.mean(axis=1))

      Mon  Tue  Wed  Thu  Fri
Low    68   71   66   75   62
High   89   93   82   97   79

      Mon  Tue  Wed
Low    68   71   66
High   89   93   82

Mon    68
Tue    71
Wed    66
Thu    75
Fri    62
Name: Low, dtype: int64

Mon    78.5
Tue    82.0
Wed    74.0
Thu    86.0
Fri    70.5
dtype: float64

Low     68.4
High    88.0
dtype: float64


## 7.15 Wrap-Up

#### NumPy
We have explored the use of NumPy's high-performance `ndarray`s for 

* storing and retrieving data, and
* performing common data manipulations

concisely and with reduced chance of errors.

You should be familiar with:

* The creation, initialization, and reference to individual elements of one- and two-dimensional `array`s.
* Use of attributes to determine an `array`'s size, shape, and element type.

Operations on `array`s:

* Use `array` operators and universal functions to perform element-wise calculations.
* Broadcasting.
* Built-in `array` methods for performing calculations using all elements of an `array`, or row-by-row or column-by-column.
* Indexing and slicing.
* Reshape.
* Shallow and deep copy.

#### pandas

* How to create and manipulate pandas `Series` and `DataFrame`s.
* How to customize `Series` and `DataFrame` indices.
* How to set display precision.
* How to access and select data.
* `describe`
* The `T` attribute.
* Ways to sort `DataFrame`s.