# Introduction to Numpy and Pandas

The following tutorial contains examples of using the numpy and pandas library modules. Read the step-by-step instructions below carefully. To execute the code, click on the cell and press the SHIFT-ENTER keys simultaneously.

## 1 Introduction to Numpy

Numpy, which stands for numerical Python, is a Python library package to support numerical computations. The basic data structure in numpy is a multi-dimensional array object called ndarray. Numpy provides a suite of functions that can efficiently manipulate elements of the ndarray. You can find [Numpy routine](https://docs.scipy.org/doc/numpy/reference/routines.html#routines)

### 1.1 Creating ndarray

An ndarray can be created from a list or tuple object.

In [None]:
import numpy as np

In [None]:
oneDim = np.array([1.0,2,3,4,5])   # a 1-dimensional array (vector)
print(oneDim)
print("#Dimensions =", oneDim.ndim)
print("Dimension =", oneDim.shape)
print("Size =", oneDim.size)
print("Array type =", oneDim.dtype)

In [None]:
twoDim = np.array([[1,2],[3,4],[5,6],[7,8]])  # a two-dimensional array (matrix)
print(twoDim)
print("#Dimensions =", twoDim.ndim)
print("Dimension =", twoDim.shape)
print("Size =", twoDim.size)
print("Array type =", twoDim.dtype)

In [None]:
arrFromTuple = np.array([(1,'a',3.0),(2,'b',3.5)])  # create ndarray from tuple
print(arrFromTuple)
print("#Dimensions =", arrFromTuple.ndim)
print("Dimension =", arrFromTuple.shape)
print("Size =", arrFromTuple.size)

There are several built-in functions in numpy that can be used to create ndarrays

In [None]:
print(np.random.rand(5))      # random numbers from a uniform distribution between [0,1]
print(np.random.randn(5))     # random numbers from a normal distribution
print(np.arange(-10,10,2))    # similar to range, but returns ndarray instead of list
print(np.arange(12).reshape(3,4))  # reshape to a matrix
print(np.linspace(0,1,10))    # split interval [0,1] into 10 equally separated values
print(np.logspace(-3,3,7))    # create ndarray with values from 10^-3 to 10^3

In [None]:
print(np.zeros((2,3)))        # a matrix of zeros
print(np.ones((3,2)))         # a matrix of ones
print(np.eye(3))              # a 3 x 3 identity matrix (diagonal elements)

## 1.2 Element-wise Operations

You can apply standard operators such as addition and multiplication on each element of the ndarray.

In [None]:
x = np.array([1,2,3,4,5])

print(x + 1)      # addition
print(x - 1)      # subtraction
print(x * 2)      # multiplication
print(x // 2)     # integer division
print(x ** 2)     # square
print(x % 2)      # modulo  
print(1 / x)      # division

In [None]:
x = np.array([2,4,6,8,10])
y = np.array([1,2,3,4,5])

print(x + y)
print(x - y)
print(x * y)
print(x / y)
print(x // y)
print(x ** y)

## 1.3 Indexing and Slicing

There are various ways to select certain elements with an ndarray.

In [None]:
x = np.arange(-5,5)
print(x)

y = x[3:5]     # y is a slice, i.e., pointer to a subarray in x
print(y)

y[:] = 1000    # modifying the value of y will change x
print(y)
print(x)

z = x[3:5].copy()   # makes a copy of the subarray
print(z)
z[:] = 500          # modifying the value of z will not affect x
print(z)
print(x)

In [None]:
my2dlist = [[1,2,3,4],[5,6,7,8],[9,10,11,12]]   # a 2-dim list
print(my2dlist)
print(my2dlist[2])        # access the third sublist
print(my2dlist[:][2])     # can't access third element of each sublist
# print(my2dlist[:,2])    # this will cause syntax error

my2darr = np.array(my2dlist)
print(my2darr)
print(my2darr[2][:])      # access the third row
print(my2darr[2,:])       # access the third row
print(my2darr[:][2])      # access the third row (similar to 2d list)
print(my2darr[:,2])       # access the third column
print(my2darr[:2,2:])     # access the first two rows & last two columns

ndarray also supports boolean indexing.

In [None]:
my2darr = np.arange(1,13,1).reshape(3,4)
print(my2darr)

divBy3 = my2darr[my2darr % 3 == 0]
print(divBy3, type(divBy3))

divBy3LastRow = my2darr[2:, my2darr[2,:] % 3 == 0]
print(divBy3LastRow)

More indexing examples.

In [None]:
my2darr = np.arange(1,13,1).reshape(4,3)
print(my2darr)

indices = [2,1,0,3]    # selected row indices
print(my2darr[indices,:])

rowIndex = [0,0,1,2,3]     # row index into my2darr
columnIndex = [0,2,0,1,2]  # column index into my2darr
print(my2darr[rowIndex,columnIndex])

## 1.4 Numpy Arithmetic and Statistical Functions

There are many built-in mathematical functions available for manipulating elements of nd-array.

In [None]:
y = np.array([-1.4, 0.4, -3.2, 2.5, 3.4])    # generate a random vector
print(y)

print(np.abs(y))          # convert to absolute values
print(np.sqrt(abs(y)))    # apply square root to each element
print(np.sign(y))         # get the sign of each element
print(np.exp(y))          # apply exponentiation
print(np.sort(y))         # sort array

In [None]:
x = np.arange(-2,3)
y = np.random.randn(5)
print(x)
print(y)

print(np.add(x,y))           # element-wise addition       x + y
print(np.subtract(x,y))      # element-wise subtraction    x - y
print(np.multiply(x,y))      # element-wise multiplication x * y
print(np.divide(x,y))        # element-wise division       x / y
print(np.maximum(x,y))       # element-wise maximum        max(x,y)

In [None]:
y = np.array([-3.2, -1.4, 0.4, 2.5, 3.4])    # generate a random vector
print(y)

print("Min =", np.min(y))             # min 
print("Max =", np.max(y))             # max 
print("Average =", np.mean(y))        # mean/average
print("Std deviation =", np.std(y))   # standard deviation
print("Sum =", np.sum(y))             # sum

print("Index of Min = ", np.argmin(y))
print("Index of Max = ", np.argmax(y))

In [None]:
# Q: what's the difference between np.max and np.maximum?

y = np.arange(1,13,1).reshape(4,3)
print(y)
#np.maximum(y)
print(np.max(y, axis=0))
print(np.max(y, axis=1))

## 1.5 Numpy linear algebra

Numpy provides many functions to support linear algebra operations.

In [None]:
X = np.random.randn(2,3)    # create a 2 x 3 random matrix
print(X)
print(X.T)             # matrix transpose operation X^T

y = np.random.randn(3) # random vector 
print(y)
print(X.dot(y))        # matrix-vector multiplication  X * y
print(X.dot(X.T))      # matrix-matrix multiplication  X * X^T
print(X.T.dot(X))      # matrix-matrix multiplication  X^T * X

In [None]:
X = np.random.randn(5,3)
print(X)

C = X.T.dot(X)               # C = X^T * X is a square matrix

invC = np.linalg.inv(C)      # inverse of a square matrix
print(invC)
detC = np.linalg.det(C)      # determinant of a square matrix
print(detC)
S, U = np.linalg.eig(C)      # eigenvalue S and eigenvector U of a square matrix
print(S)
print(U)

## 1.6 Numpy sort

In [None]:
a = np.array([[1,4,2],[3,1,2]])
print(a)

print(np.sort(a))                # sort along the last axis
#print(np.sort(a, axis=1))

print(np.sort(a, axis=None))     # sort the flattened array

print(np.sort(a, axis=0))        # sort along the first axis

## 1.7 Numpy manipulation & operation

In [None]:
# change the shape of an array

a = np.array([[0,0],
              [1,1],
              [2,2]])
print(a)
print(a.shape)

In [None]:
print(a.ravel())   # flattened array
print(np.ravel(a))

In [None]:
# The reshape function returns its argument with a modified shape, 
# whereas the ndarray.resize method modifies the array itself:
a.reshape(2,3)
print(a)
a.resize((2,3))
print(a)

In [None]:
# If a dimension is given as -1 in a reshaping operation, 
# the other dimensions are automatically calculated:
print(a.reshape(3,-1))
print(a.reshape((-1)))

In [None]:
# Stacking together different arrays
a = np.array([[0,0],
              [1,1],
              [2,2]])
b = np.array([[3,3],
              [4,4],
              [5,5]])

print(np.vstack((a,b)))
print(np.hstack((a,b)))

In [None]:
# concatenate allows for an optional arguments 
# giving the number of the axis along which the concatenation should happen.
print(np.concatenate((a,b), axis=0))
print(np.concatenate((a,b), axis=1))
print(np.concatenate((a,b), axis=None))

In [None]:
a = np.array([1,2,3,4])

print(np.cumsum(a)) # cumulative sum
print(np.cumprod(a)) # cumulative product

## 1.8 Index tricks

In [None]:
# Indexing with Arrays of Indices
a = np.arange(12)**2                       # the first 12 square numbers
print(a)

i = np.array( [ 1,1,3,8,5 ] )              # an array of indices
print(a[i])                                       # the elements of a at the positions i

j = np.array( [ [ 3, 4], [ 9, 7 ] ] )      # a bidimensional array of indices
print(a[j])                                       # the same shape as j

In [None]:
# You can also use indexing with arrays as a target to assign to:
a = np.arange(5)
print(a)

a[[1,3,4]] = 0
print(a)

In [None]:
# Indexing with Boolean Arrays
a = np.arange(5)
print(a)
b = a > 2
print(b)
print(a[b])

In [None]:
a = np.array([1,0,2,3,4], dtype=np.float)
# Q1: Given an array, standardize the array:
mean = np.mean(a)
std = np.std(a)
a = (a - mean) / std
print(a)

In [None]:
a = np.array([1,0,2,3,4], dtype=np.float)
# Q2: only standardize the non-zero elements and update them:
mean = np.mean(a[a > 0])
std = np.std(a[a > 0])
a[a > 0] = (a[a > 0] - mean) / std
print(a)

## 2 Introduction to Pandas

Pandas provide two convenient data structures for storing and manipulating data--Series and DataFrame. A Series is similar to a one-dimensional array whereas a DataFrame is more similar to representing a matrix or a spreadsheet table.  

### 2.1 Series

A Series object consists of a one-dimensional array of values, whose elements can be referenced using an index array. A Series object can be created from a list, a numpy array, or a Python dictionary. You can apply most of the numpy functions on the Series object.


In [None]:
import pandas as pd
from pandas import Series

s = Series([3.1, 2.4, -1.7, 0.2, -2.9, 4.5])   # creating a series from a list
print(s)
print('Values=', s.values)     # display values of the Series
print('Index=', s.index)       # display indices of the Series

In [None]:
import numpy as np

s2 = Series(np.random.randn(6))  # creating a series from a numpy ndarray
print(s2)
print('Values=', s2.values)   # display values of the Series
print('Index=', s2.index)     # display indices of the Series

In [None]:
s3 = Series([1.2,2.5,-2.2,3.1,-0.8,-3.2], 
            index = ['Jan 1','Jan 2','Jan 3','Jan 4','Jan 5','Jan 6'])
print(s3)
print('Values=', s3.values)   # display values of the Series
print('Index=', s3.index)     # display indices of the Series

In [None]:
capitals = {'MI': 'Lansing', 'CA': 'Sacramento', 'TX': 'Austin', 'MN': 'St Paul'}

s4 = Series(capitals)   # creating a series from dictionary object
print(s4)
print('Values=', s4.values)   # display values of the Series
print('Index=', s4.index)     # display indices of the Series

In [None]:
# Accessing elements of a Series

s3 = Series([1.2,2.5,-2.2,3.1,-0.8,-3.2], 
            index = ['Jan 1','Jan 2','Jan 3','Jan 4','Jan 5','Jan 6'])
print(s3)

In [None]:
# Accessing elements of a Series

print('s3[2]=', s3[2])        # display third element of the Series
print('s3[\'Jan 3\']=', s3['Jan 3'])   # indexing element of a Series 

print('\ns3[1:3]=')             # display a slice of the Series
print(s3[1:3])
print('s3.iloc([1:3])=')      # display a slice of the Series
print(s3.iloc[1:3])

In [None]:
print('shape =', s3.shape)  # get the dimension of the Series
print('size =', s3.size)    # get the # of elements of the Series

In [None]:
# indexing with boolean array (Series)

print(s3[s3 > 0])   # applying filter to select elements of the Series

In [None]:
print(s3 + 4)       # applying scalar operation on a numeric Series
print(s3 / 4)

In [None]:
print(np.log(s3 + 4))    # applying numpy math functions to a numeric Series

In [None]:
# convert to numpy array:

print(s3.to_numpy())

### 2.2 DataFrame

A DataFrame object is a tabular, spreadsheet-like data structure containing a collection of columns, each of which can be of different types (numeric, string, boolean, etc). Unlike Series, a DataFrame has distinct row and column indices. There are many ways to create a DataFrame object (e.g., from a dictionary, list of tuples, or even numpy's ndarrays).

In [None]:
import pandas as pd
from pandas import DataFrame

cars = {'make': ['Ford', 'Honda', 'Toyota', 'Tesla'],
       'model': ['Taurus', 'Accord', 'Camry', 'Model S'],
       'MSRP': [27595, 23570, 23495, 68000]}          
carData = DataFrame(cars)   # creating DataFrame from dictionary
carData                     # display the table, contains objects of different types

In [None]:
print(carData.index)       # print the row indices
print(carData.columns)     # print the column indices

In [None]:
carData2 = DataFrame(cars, index = [1,2,3,4])  # change the row index
carData2['year'] = 2018    # add column with same value
carData2['dealership'] = ['Courtesy Ford','Capital Honda','Spartan Toyota','N/A'] # add new row
carData2                   # display table

In [None]:
# Creating DataFrame from a list of tuples.

tuplelist = [(2011,45.1,32.4),(2012,42.4,34.5),(2013,47.2,39.2),
              (2014,44.2,31.4),(2015,39.9,29.8),(2016,41.5,36.7)]
columnNames = ['year','temp','precip']
weatherData = DataFrame(tuplelist, columns=columnNames)
weatherData

In [None]:
# Creating DataFrame from numpy ndarray

import numpy as np

npdata = np.random.randn(5,3)  # create a 5 by 3 random matrix
columnNames = ['x1','x2','x3']
data = DataFrame(npdata, columns=columnNames)
data

The elements of a DataFrame can be accessed in many ways.

In [None]:
# accessing an entire column will return a Series object

print(data['x2'])
print(type(data['x2']))

In [None]:
# accessing an entire row will return a Series object

print('Row 3 of data table:')
print(data.iloc[2])       # returns the 3rd row of DataFrame
print(type(data.iloc[2]))

In [None]:
# recall carData2
carData2

In [None]:
# accessing a specific element of the DataFrame

print(carData2.iloc[1,2])      # retrieving second row, third column
print(carData2.loc[1,'model']) # retrieving second row, column named 'model'


In [None]:
# accessing a slice of the DataFrame

print('carData2.iloc[1:3,1:3]=')
print(carData2.iloc[1:3,1:3]) # retrieving second third row, second and third column

In [None]:
print('carData2.shape =', carData2.shape)
print('carData2.size =', carData2.size)

In [None]:
# selection and filtering

print('carData2[carData2.MSRP > 25000]')  
print(carData2[carData2.MSRP > 25000])

In [None]:
# convert to numpy array:
df = DataFrame({"A": [1, 2], "B": [3, 4]})
print(df)
print(df.to_numpy())

#### File I/O
* read_csv
* to_csv
* read_excel
* to_excel
  * pip install -U openpyxl

In [None]:
# save csv
df.to_csv("output.csv")

# read csv from local file
df = pd.read_csv("output.csv")
df

In [None]:
# read csv from url
df = pd.read_csv('https://github.com/liuhoward/teaching/raw/master/big_data/test.csv')

df.head()

In [None]:
# save excel
df.to_excel("output.xlsx")

# read excel from local file
df = pd.read_excel("output.xlsx")
df

In [None]:
# read excel from url
df = pd.read_excel('https://github.com/liuhoward/teaching/raw/master/big_data/test.xlsx')

df.head()

### 2.3 Arithmetic Operations

In [None]:
# recall data
data

In [None]:
print('Data transpose operation:')
print(data.T)    # transpose operation

print('Addition:')
print(data + 4)    # addition operation

print('Multiplication:')
print(data * 10)   # multiplication operation

In [None]:
# # .add, .mul methods
print('data =')
print(data)

columnNames = ['x1','x2','x3']
data2 = DataFrame(np.random.randn(5,3), columns=columnNames)
print('\ndata2 =')
print(data2)

In [None]:
print('\ndata + data2 = ')
print(data.add(data2))

print('\ndata * data2 = ')
print(data.mul(data2))

In [None]:

print('\nMaximum value per column:')
print(data.max())    # get maximum value for each column

print('\nMinimum value per row:')
print(data.min(axis=1))    # get minimum value for each row


In [None]:
print(data.abs())    # get the absolute value for each element

print('\nSum of values per column:')
print(data.sum())    # get sum of values for each column

In [None]:
print('\nAverage value per row:')
print(data.mean(axis=1))    # get average value for each row

In [None]:
print('\nCalculate max - min per column')
f = lambda x: x.max() - x.min()
print(data.apply(f))

In [None]:
print('\nCalculate max - min per row')
f = lambda x: x.max() - x.min()
print(data.apply(f, axis=1))

### 2.4 Plotting Series and DataFrame

There are built-in functions you can use to plot the data stored in a Series or a DataFrame.

In [None]:
%matplotlib inline

s3 = Series([1.2,2.5,-2.2,3.1,-0.8,-3.2,1.4], 
            index = ['Jan 1','Jan 2','Jan 3','Jan 4','Jan 5','Jan 6','Jan 7'])

s3.plot(kind='line', title='Line plot')

In [None]:
s3.plot(kind='bar', title='Bar plot')

In [None]:
s3.plot(kind='hist', title = 'Histogram')

In [None]:
tuplelist = [(2011,45.1,32.4),(2012,42.4,34.5),(2013,47.2,39.2),
              (2014,44.2,31.4),(2015,39.9,29.8),(2016,41.5,36.7)]
columnNames = ['year','temp','precip']
weatherData = DataFrame(tuplelist, columns=columnNames)

weatherData[['temp','precip']].plot(kind='box', title='Box plot')