<h1>Python libraries for data analysis</h1>


<li><b><span style="color:blue">Numpy</span></b>: supports numerical and array operations
<li><b><span style="color:blue">Scipy</span></b>: open source library for mathematics and scientific computing
<li><b><span style="color:blue">Pandas</span></b>: supports data manipulation and analysis
<li><b><span style="color:blue">Visualization libraries</span></b>: matplotlib, seaborne, bokeh, plotly, gmplot, and many others provide support for charts and graphs

<h1>numpy</h1>


<h2>Why numpy?</h2>
<li>Multi-dimensional arrays:
<li>Faster and more space efficient than lists 
<li>Can incorporate C/C++/Fortran code
<li>Linear algebra, Fourier transforms, Random number support



<h2>numpy array</h2>

In [None]:
import numpy as np
ax = np.array([1,2,3,4,5])
print(type(ax))


<li>A numpy array has a data type associated with its elements
<li>and elements need to be of the same data type
<li>But an element could be an 'arbitrarily' complex object

In [None]:
np.array([1,2,'a'])

In [None]:
np.array([{'a':1,'b':2},4])

<h2>Specifying the type</h2>
<h3>Useful when reading a text stream directly into a numerical array</h3>

<h4>The <i>dtype</i> attribute</h4>
<li>Stores the data type in the array
<li>numpy makes a best guess of the data type

In [None]:
ax = np.array([[1,2,3,4],[5,6,7,8.3]])
ax.dtype

In [None]:
ax = np.array([{'a':1,'b':2},4])
ax.dtype

In [None]:
x=['1','2','3']
xi = np.array(x,'int')
xf = np.array(x,'float')
xs = np.array(x,'str')
print(xi,xf,xs,sep='\n')

<li>The <i>astype</i> function converts from one type to another


In [None]:
ax = np.array([1,2,3,'4'])
print(ax.dtype)
ax.astype(int)

In [None]:
ay = ax.astype(np.float128)
print(ay)
ay.dtype

<h2>Basic operations</h2>

<h4>statistical operations</h4>

In [None]:
x = np.array([13,24,21.2,17.6,21.7],'float')
print(x.sum(),x.mean(),x.std(),sep='\n')

<h4>arrray arithmetic operations</h4>
<li><b>Important</b>: Arrays must be the same size!

In [None]:
x = np.array([13,24,21.2,17.6,21.7],'float')
y = np.array([1,3,4,7,2],'float')
x - y

In [None]:
x+y

In [None]:
x*y

In [None]:
x/y

<h2>Multi-dimensional arrays</h2>

In [None]:
x=[[0,1,2,3,4,5],[10,11,12,13,14,15],[20,21,22,23,24,25]]
ax=np.array(x,float)
print(ax)

<h3>Indexing</h3>

In [None]:
ax[1,3] #indexing

<h3>Slicing</h3>

In [None]:
ax[1:3,2:4]
#Intersection between ax[1:3,:] and ax[:,2:4]

<h3>Reshaping</h3>
<li>nd arrays can be reshaped as long as the total dimensionality is unchanged


In [None]:
print(ax.shape)
ax.reshape(9,2)
#ax.reshape(10,3)

<h3>Creating nd arrays</h3>

<h4>Using the <i>array</i> function</h4>

In [None]:
data = [[0,1,2,3,4],[5,6,7,8,9]]
data_array = np.array(data)
data_array

<h4>Using initializers</h4>

<li>The <i>arrange</i> (array range) function


In [None]:
ax = np.arange(10)
print(ax)
ay = np.array([np.arange(10),np.arange(10)])
print(ay)
ax.dtype

In [None]:
ax = np.arange(10)**2
print(ax)

<li>The <i>ones</i> function creates an array of 1s (floats)

In [None]:
ax = np.ones(10)
print(ax)
ax.dtype

In [None]:
ax = np.array([[1,2,3,4],[5,6,7,8]])
ay = np.ones_like(ax)
ay

In [None]:

ay = np.zeros_like(ax)
ay

<li>The <i>identity(n)</i> function creates an identity matrix of order n

In [None]:
np.identity(10)

<li>The function <i>empty</i> creates an "empty" array
<li>Values in the array are "garbage" values

In [None]:
np.empty([2,3],float)

<h3>Matrix multiplication</h3>


In [None]:
ax = np.arange(10)
ay = np.array([ax,ax])
#Scalar multiplication
ay*2

In [None]:
np.dot(ay,ay.reshape(10,2)) #Dot product

<h2>Lists vs numpy arrays</h2>
<li>Lists are heterogenous. Elements of a list can be of multiple types
<li>Numpy arrays are homogeneous. Elements can be of only one type
<li>Both are mutable
<li>Homogeneity makes indexed access faster and more memory efficient
<li>numpy are optimized for matrix operations
<li>numpy provides random number support


<h3>numpy arrays are homogeneous</h3>

<h3>numpy arrays are faster</h3>

In [None]:
n=10
ax = np.array([np.arange(n)**2,np.arange(n)**3])
ay = ax.transpose()
print(ax)
print(ay)
np.dot(ax,ay)

<h4>Functionalize this</h4>


In [None]:
def dotproduct(n):
    ax = np.array([np.arange(n)**2,np.arange(n)**3])
    ay = ax.transpose()
    import datetime
    start = datetime.datetime.now()
    np.dot(ax,ay)
    end = datetime.datetime.now()
    return end-start
    
dotproduct(10)    

<h4>Do the same with python lists</h4>


In [None]:

def dot_product_lists(n):
    x = [x**2 for x in range(n)]
    y = [x**3 for x in range(n)]
    ax = [x,y]
    ay = [list(i) for i in zip(*ax)]
    import datetime
    start = datetime.datetime.now()
    [[sum(a*b for a,b in zip(X_row,Y_col)) for Y_col in zip(*ay)] for X_row in ax]
    end = datetime.datetime.now()
    return end-start
    
dot_product_lists(10)

<h4>Compare the two</h4>

In [None]:
for n in [10,100,1000,10000,1000000]:
    numpy_result = dotproduct(n)
    list_result = dot_product_lists(n)
    print(n,numpy_result,list_result,sep='\t')

<h3>numpy indexing vs list indexing</h3>
<li>numpy arrays use direct indexing
<li>lists use chained indexing

In [None]:
ax = np.array([1,2,3,4,8,9])
x = [1,2,3,4,8,9]

#Extract the first and last elements from the numpy array into a single array
ax[[0,-1]]

#Extract the first and last elements from the list into a new list
[x[0],x[-1]]

<h3>numpy slicing vs list slicing</h3>

In [None]:
ax = np.array([[11,12,13,14],[21,22,23,24],[31,32,33,34]])
ax[1:3,1:3]


In [None]:
ax

In [None]:
lx = [[11,12,13,14],[21,22,23,24],[31,32,33,34]]
#HELP!

<h2>batch operations on nd arrays</h2>
<li>numpy arrays allow the application of batch operations on all elements of an array
<li>without having to write a for loop or use an iterator
<li>by <i>vectorizing</i> operations, numpy is much faster than the slow for loop structure of python


<h3>batch: selecting elements using a boolean mask</h3>
<li> A boolean max applies a condition to each element in turn
<li> And returns an array of boolean with
<ul>
<li> True for each value that satisfies the condition
<li> False for every other value

In [None]:
ax = np.array([1,4,7,9,2,3,10,11,34,2])
ax < 7

<h4>The mask can be applied as a selection operator on the array

In [None]:
ax[ax<7]

<h4>The mask doesn't have to be constructed on the same array</h4>
<li>But the mask and the array should have the same dimensions

In [None]:
names = np.array(['Bill','Sally','Qing','Savitri','Giovanni'])
bonus = np.array([232300.56,478123.45,3891.24,98012.36,52123.50])
names[bonus > 130000]


<h3>batch: arithmentic operations</h3>
<li>+, -, *, /, scalar multiplication do an element by element operation

In [None]:
ax = np.array([[1,2,3],[4,5,6]])
1/ax


<h3>batch: functional artithmetic operators</h3>

In [None]:
ax = np.array([4,9,2,4,0,25,0])
print(np.sum(ax>5))
print(np.count_nonzero(ax))
print(np.any(ax>10))
print(np.all(ax>0))

<h3>Logical operations with numpy</h3>
<li>logical_or
<li>logical_and

In [None]:
np.logical_and(bonus>90000.0, bonus<400000 )

<h4>Boolean operators</h4>
<li>the numpy equivalent of "and" is "&"
<li>the numpy equivalent of "or" is "|"
<li>the numpy equivalent of "not" is "!"


In [None]:
print(np.sum(bonus[(bonus>50000) & (bonus < 200000)]))
print(np.sum(bonus[(names=="Bill") | (names == "Qing")]))
print(np.sum(bonus[(names!="Bill")]))
print(np.sum(bonus[~((names=="Bill") | (names == "Qing"))]))


<b>Problem</b> Calculate the mean and median bonus anount for all female employees with bonus less than $100,000

In [None]:
names = np.array(['Bill','Sally','Qing','Savitri','Giovanni'])
bonus = np.array([232300.56,478123.45,3891.24,98012.36,52123.50])
gender = np.array(['M','F','F','F','M'])



<b>Problem</b> Return an nd array containing the named of all female employees with bonus less than $100,000

In [None]:
names = np.array(['Bill','Sally','Qing','Savitri','Giovanni'])
bonus = np.array([232300.56,478123.45,3891.24,98012.36,52123.50])
gender = np.array(['M','F','F','F','M'])


<h3>batch: Selecting elements using where</h3>
<li><i>where</i> function creates a new array using a conditional expression
<li>Somewhat like the if function in an excel spreadsheet

<h2>axes</h2>
<li>The axis parameter tells numpy which axis to operate along

In [None]:
ax = np.array([1,2,3,4,5,6,7,8,9,10,11,12])
ax = ax.reshape(3,4)
print(ax)

<h4>sum each column</h4>

In [None]:
ax.sum(axis=0)

<h4>sum each row</h4>

In [None]:
ax.sum(axis=1)

<h4>sum by depth</h4>

In [None]:
ax=ax.reshape(2,3,2)
ax.sum(axis=2)

<h4>add an axis to an array</h4>


In [None]:
ax = np.array([1,2,3,4,5,6,7,8,9,10,11,12])
print(ax)
ax[:,np.newaxis]

<h4>Easy to add n-dimensions to an nd array using newaxis</h4>

In [None]:
ax = ax.reshape(4,3)
ax[np.newaxis,np.newaxis,np.newaxis].shape

In [None]:
x=[[0,1,2,3,4,5],[10,11,12,13,14,15],[20,21,22,23,24,25]]
ax=np.array(x,float)
np.where(ax%2==0,1,0)

<h2>Broadcasting</h2>
<li>arithmetic operations work element by element
<li>so both arrays have to be of the same length
<li><b>broadcasting</b> is used for arithmetic on arrays of different shapes

In [None]:
ax = np.array([1,2,3])
ay = np.array([3,2,1])
ax+ay

<li>when one operand is a scalar, numpy works as if it has created a second array
<li>ax + 5 is equivalent to ax + np.array([5,5,5,])
<li>note the "as if" because it doesn't actually do that
<li>instead it <b>broadcasts</b> the 5 to each element of ax
<li>we can do this broadcasting on any dimensional array

In [None]:
ay = np.ones([3,3])
ay

In [None]:
ax = np.array([1,2,3])
ax + ay

<b>broadcasting</b> won't work when arrays are of incompatible dimensions

In [None]:
ax = np.array([1,2,3,4,5,6,7,8,9,10,11,12])
ay = np.array([3,4,5])
ax + ay

<h4>np.newaxis is useful here because we can convert ax into a 2D array</h4>

In [None]:
ax[:,np.newaxis] + ay

In [None]:
#Broadcasting effectively does this:
ax[:,np.newaxis] + np.array([[3,4,5],[3,4,5],[3,4,5],[3,4,5],[3,4,5],[3,4,5],[3,4,5],[3,4,5],[3,4,5],[3,4,5],[3,4,5],[3,4,5]])

<h4>We could also convert ay into a 2D array</h4>
<li>the result will be different (why?) 

In [None]:
ax = np.array([1,2,3,4,5,6,7,8,9,10,11,12])
ay = np.array([3,4,5])
ax + ay[:,np.newaxis]

<h2>Fancy indexing</h2>
<li>numpy let's us work on indexed subsets of an nd array
<li>this lets us construct arbitrary subsets of an nd array in any dimension

In [None]:
ax = np.array([4,3,9,2,1,6])
ay = np.array([2,4])
ax[ay]

In [None]:
ay = np.array([[2,4],[1,3]])
ax[ay]

<h4>multi-dimentional indexes</h4>
<li>In a 2-d, index, the index array is used to generate (row_number,col_number) pairs

In [None]:
ax = np.array([[23,34,21,34,22],[33,44,11,29,32],[14,90,10,20,17]])
rows = np.array([0,2])
cols = np.array([1,4])
ax[rows,cols] # [ax[0,1],ax[2,4]]

<h2>Universal functions</h2>
<li>functions that perform elementwise operations on arrays
<li>fast "wrapper" functions that produce scalar (or lower dimension) results
<li>sqrt, exp, add,maximum, minimum, abs, etc.
<li>https://docs.scipy.org/doc/numpy/reference/ufuncs.html

In [None]:
ax = np.array([1,2,3,4,5,6,7],float)
np.sqrt(ax)
np.exp(ax)

In [None]:
ay = np.arange(10,17)
np.add(ax,ay)
np.maximum(ax,ay)

In [None]:

#linalg, a linear algebra module
#functions dealing with polynomials, differentials, etc


In [None]:
import scipy
scipy.nanmean(x)

<h3>Random number support in numpy</h3>

In [None]:
np.random.normal(size=10)
#np.random.normal(size=(100,100))
#np.random.exponential()
#np.random.exponential(1.0,size=(6,3))
#np.random.randint(-10,10,size=(9,9))