<h1>Python libraries for data analysis</h1>


<li><b><span style="color:blue">Numpy</span></b>: supports numerical and array operations</li>
<li><b><span style="color:blue">Scipy</span></b>: open source library for mathematics and scientific computing</li>
<li><b><span style="color:blue">Pandas</span></b>: supports data manipulation and analysis</li>
<li><b><span style="color:blue">Visualization libraries</span></b>: matplotlib, seaborne, bokeh, plotly, gmplot, and many others provide support for charts and graphs</li>
<li><b><span style="color:blue">Machine learning libraries</span></b>: sklearn (scikit-learn), tensorflow, keras, pytorch</li>
<li><b><span style="color:blue">Text analytics libraries</span></b>: nltk, spacy, coreNLP, textblob</li>
<li><b><span style="color:blue">Network analysis libraries</span></b>: networkx, igraph</li>

    

<h1>numpy</h1>

<h2>Why numpy?</h2>
<li>Multi-dimensional arrays:
<li>Faster and more space efficient than lists 
<li>Can incorporate C/C++/Fortran code
<li>Linear algebra, Fourier transforms, Random number support
<li>Data analysis and visualization libraries (pandas/scikit-learn) are all numpy based</li> 

<h2>Standard</h2>
<li>numpy is the standard for n-dimensional array based computation in python</li>
<li>tensorflow, keras, (large ML libraries) all use numpy-like APIs</li>
<li>other languages are increasingly using numpy like APIs for array based computation</li>
    

<h4>documentation</h4>
<a href="https://docs.scipy.org/doc/numpy/user/index.html">numpy user guide</a>

In [4]:
x = [1,2,3,42,99]
x

[1, 2, 3, 42, 99]

In [1]:
import numpy
numpy.__version__ 
#'1.25.2'

'1.25.2'

In [None]:
!pip install numpy --upgrade

<h2>numpy array</h2>
<li>numpy is built around an <span style="color:red">ndarray</span> (n-dimensional array) object</li>
<li>ndarrays are array like collections of a single object type (you can't mix object types as with lists)</li>

<h3>0-dimensional array</h3>
<li>scalars</li>

In [2]:
import numpy as np
s = np.array(5.5)
print(s)
print(type(s))

5.5
<class 'numpy.ndarray'>


<h3>1-dimensional array</h3>
<li>vectors</li>
<li>Note that they look different from lists</li>

In [1]:
import numpy as np
print([1,2,3,4,5])
v = np.array([1,2,3,4,5])
print(type(v))
print(v)

[1, 2, 3, 4, 5]
<class 'numpy.ndarray'>
[1 2 3 4 5]


<h3>2-dimensional array</h3>
<li>matrices</li>
<li>list of lists vs 2-d array</li>

In [5]:
[
    [11,12,13,14],
    [21,22,23,24],
]

[[11, 12, 13, 14], [21, 22, 23, 24]]

In [6]:
import numpy as np
m = np.array([
    [11,12,13,14],
    [21,22,23,24],
])
print(m)
print(type(m))

[[11 12 13 14]
 [21 22 23 24]]
<class 'numpy.ndarray'>


<h3>3-dimensional array</h3>
n-dimensional arrays are also known as  <span style="color:red">tensors</span>

In [7]:
T = np.array([
    [
        [111,112,113,114],
        [121,122,123,124]
    ],
    [
        [211,212,213,214],
        [221,222,223,224]
    ],
    [
        [311,312,313,314],
        [321,322,323,324]
    ]
])
print(T)
type(T)

[[[111 112 113 114]
  [121 122 123 124]]

 [[211 212 213 214]
  [221 222 223 224]]

 [[311 312 313 314]
  [321 322 323 324]]]


numpy.ndarray

<h2>data types</h2>

<li>A numpy array has a data type associated with its elements
<li>and elements need to be of the same data type
<li>But an element could be an 'arbitrarily' complex object
<li>note that these are numpy data types, not python data types
<li>for example, python ints are of indeterminate length but numpy ints have fixed sizes

<li>numpy can infer the data type</li>
<li>or you can specify it</li>

In [8]:
ax = np.array([1,2,3,4,5])
ax.dtype

dtype('int64')

In [9]:
np.array([1,2,'abracadabrafhasdklghadl'])

array(['1', '2', 'abracadabrafhasdklghadl'], dtype='<U23')

In [10]:
np.array([{'a':1,'b':2},4])

array([{'a': 1, 'b': 2}, 4], dtype=object)

<h2>Specifying the type</h2>
<h3>Useful when reading a text stream directly into a numerical array</h3>

<h4>The <i>dtype</i> attribute</h4>
<li>Stores the data type in the array
<li>numpy makes a best guess of the data type
<li>unless the data type is explicitly specified
<li>a mismatch between the specified type and actual type will throw an exception
<li>the dtype attribute is accessible

In [3]:
ax = np.array([1,2,'3'],dtype='int64') #'3' is automatically converted to int
ax.dtype

dtype('int64')

In [4]:
ax

array([1, 2, 3])

<li>Numpy lets you specify numeric types of different sizes (int - 8, 16, 32, 64; float - 16, 32, 64, 128)</li>
<li>Giving the programmer some control over memory usage</li>

In [2]:
x=['1','2','3']
xi = np.array(x,'int')
xf = np.array(x,'float')
xs = np.array(x,'str')
xi32 = np.array(x,'int32')
xf128 = np.array(x,'float128') #If the system allows (may not work on a windows machines)
xi8 = np.array(x,'int8')
print(xi,xf,xs,xi32,xf128,xi8,sep='\n')
print(xi.dtype,xf.dtype,xs.dtype,xi32.dtype,xf128.dtype,xi8.dtype,sep='\n')


[1 2 3]
[1. 2. 3.]
['1' '2' '3']
[1 2 3]
[1. 2. 3.]
[1 2 3]
int64
float64
<U1
int32
float128
int8


In [None]:
xs

<li>The <i>astype</i> function converts from one type to another</li>
<li>Obviously only if the conversion is possible!</li>


In [17]:
ax = np.array([1,2,3,'4'])
print(ax)
print(ax.dtype)
ay = ax.astype(int)

['1' '2' '3' '4']
<U21


In [18]:
ay

array([1, 2, 3, 4])

<h2>Indexing</h2>
<li>One-dimensional arrays indexing is identical to list indexing</li>
<li>multi-dimensional arrays use direct indexing, lists use chain indexing</li>

<h4>one-dimensional arrays</h4>

In [19]:
x=[2.3,1.4,9.2,10.1]
x[3]

10.1

In [20]:
ax = np.array([2.3,1.4,9.2,10.1])
ax[3]

10.1

<h4>multi-dimensional arrays</h4>


In [21]:
x = [
    [1,2,3,4],
    [5,6,7,8]
]

#chained indexing
y = x[1]
z = y[1]

#or:
#First get the row into a new object
#Then get the desired column from the new object

x[1][1]

6

In [23]:
(x[1])[1]

6

<li>numpy uses direct (C-style) indexing rather than chained indexing</li>
<li>This is faster and works because np arrays don't mix object types</li>
<li>The location of a value can be easily calculated</li>

In [24]:
x = [
    [1,2,3,4],
    [5,6,7,8]
]

#chained indexing
print(x[0][2])

#direct indexing
ax = np.array(x)
print(ax[0,2])

#Integers are stored in 32 bits
#The array is 2 x 4
#ax[1,2] is at location 4x1 + 2 or at location 6 of the uni-dimensional array [1,2,3,4,5,6,7,8]


3
3


In [None]:
ax

<h3>Slicing</h3>
<li>1-d arrays use list-like slicing</li>
<li>2+ dimension arrays are a bit more complicated!</li>
<li>Because lists don't understand the concept of dimension, 2-d lists cannot be directly sliced</li>

In [26]:
x=[1,2,3,4]
x[1:3]

[2, 3]

In [27]:
import numpy as np
ax = np.array([1,3,5,7,9])
ax[2:4]

array([5, 7])

In [5]:
x=[
    [0,1,2,3,4,5],
    [10,11,12,13,14,15],
    [20,21,22,23,24,25]
]
ax = np.array(x)
ax

array([[ 0,  1,  2,  3,  4,  5],
       [10, 11, 12, 13, 14, 15],
       [20, 21, 22, 23, 24, 25]])

In [None]:
x

<li>specify the slice for each dimension separately</li>
<li>separating each slice with a comma</li>
<li>defaults work the same way as in list slicing</li>

In [31]:
#First specify the row slice, then the column slice
ax[1:, 1:4]

array([[11, 12, 13],
       [21, 22, 23]])

In [7]:
ax[1:3,2:4]
#Intersection between ax[1:3,:] and ax[:,2:4]

array([[12, 13],
       [22, 23]])

In [8]:
ax[:,2:]

array([[ 2,  3,  4,  5],
       [12, 13, 14, 15],
       [22, 23, 24, 25]])

<h2>Reshaping an array</h2>
<li>The shape of an array is the number of elements along each dimension</li>
<li>The <span style="color:blue">reshape</span> function changes the dimension of an np array, keeping the number of elements constant</li>

In [34]:
ax=np.array([
    [1,2,3,4,5,6],
    [7,8,9,10,11,12],
    [13,14,15,16,17,18]
])
print(ax)
print(ax.shape)
print(ax.reshape(9,2))
print(ax.reshape(3,3,2))
ax.reshape(10,3) #Why does this not work?


[[ 1  2  3  4  5  6]
 [ 7  8  9 10 11 12]
 [13 14 15 16 17 18]]
(3, 6)
[[ 1  2]
 [ 3  4]
 [ 5  6]
 [ 7  8]
 [ 9 10]
 [11 12]
 [13 14]
 [15 16]
 [17 18]]
[[[ 1  2]
  [ 3  4]
  [ 5  6]]

 [[ 7  8]
  [ 9 10]
  [11 12]]

 [[13 14]
  [15 16]
  [17 18]]]


ValueError: cannot reshape array of size 18 into shape (10,3)

<h4>reshape returns a view (not a copy) of the original array</h4>

In [35]:
ax=np.array([
    [1,2,3,4,5,6],
    [7,8,9,10,11,12],
    [13,14,15,16,17,18]
])
ay = ax.reshape(9,2)
print(ay)
ay[1,1] = 30
ax

[[ 1  2]
 [ 3  4]
 [ 5  6]
 [ 7  8]
 [ 9 10]
 [11 12]
 [13 14]
 [15 16]
 [17 18]]


array([[ 1,  2,  3, 30,  5,  6],
       [ 7,  8,  9, 10, 11, 12],
       [13, 14, 15, 16, 17, 18]])

<h1 style="color:red">elementwise operations</h1>
<li>numpy functions with array arguments can work elementwise</li>
<li>a process of <span style="color:blue">vectorization</span> leads to fast application of transformations on an array</li>
<li><h2><span style="color:green">for loops are a no-no when using numpy (and pandas)!</span></h2></li>


<h4>Comparing elementwise vs for loop</h4>
<li>Add 1 to every element in an np array</li>

In [36]:
import numpy as np
ax = np.array([1,2,3,4])
ax+1

array([2, 3, 4, 5])

In [9]:
np.arange(10)

array([0, 1, 2, 3, 4, 5, 6, 7, 8, 9])

In [37]:
import numpy as np
ax = np.arange(10000000)


<li>time the elementwise operation</li>

In [38]:
%%time
r=ax+1

CPU times: user 12.5 ms, sys: 18.5 ms, total: 31 ms
Wall time: 35.4 ms


<li>time the loop</li>

In [39]:
%%time
r = list()
for i in ax:
    r.append(i+1)

CPU times: user 1.36 s, sys: 73.3 ms, total: 1.44 s
Wall time: 1.44 s


In [40]:
%%time
r = [i+1 for i in ax]

CPU times: user 990 ms, sys: 87.1 ms, total: 1.08 s
Wall time: 1.08 s


<h2>Digression: List comprehension</h2>
<li>a descriptive way to create a new list from an existing one</li>
<li>more generally, a descriptive way of creating a new collection from an existing one</li>
<li>think of list comprehension as a <span style="color:blue">one line</span> way of writing a for loop</li>

In [41]:
#For loop
#Note that y is initialized explicitly
x = [1,2,3,4,5,6,7,8,9,10]
y = list()
for thing in x:
    y.append(thing*thing)
y

[1, 4, 9, 16, 25, 36, 49, 64, 81, 100]

In [None]:
#List comprehension
#Note that y is initialized implicitly
x = [1,2,3,4,5,6,7,8,9,10]
y=[thing*thing for thing in x]
y

<img src="lc1.png">

<h4>Adding conditions</h4>

In [42]:
x = [1,2,3,4,5,6,7,8,9,10]
y = list()
for thing in x:
    if thing%2==0:
        y.append(thing*thing)
y

[4, 16, 36, 64, 100]

In [None]:
#The for selects a "thing" from x
#The if applies a condition to the selected "thing"
#The thing*thing does some transformation to the selected thing
#The result of the transformation is appended to y
y = [thing*thing for thing in x if thing%2==0]
y

<h2>Try This</h2>
A list contains the names of students and their scores in a test (on 40 points). Use list comprehension to generate a list of scores rescaled to 100 points (percent scores) and then find the average percent score for the class

In [45]:
s = [("Jack",34),("Jill",20),("Moshi",31),("Ravneet",27),("Myo",22)]
sum([thing[1]*100/40 for thing in s])/len(s)

67.0

<h2>aggregate functions and elementwise operations</h2>
<li>No loops necessary!

In [46]:
import numpy as np
ax = np.array([13,24,21.2,17.6,21.7],'float')
print(ax.sum(),ax.mean(),ax.std(),sep='\n')

97.50000000000001
19.500000000000004
3.8429155598321434


In [47]:
%%time
ax = np.arange(10000000)
ax.sum()

CPU times: user 9.9 ms, sys: 2.35 ms, total: 12.2 ms
Wall time: 14.7 ms


49999995000000

In [48]:
%%time
s = 0
for i in ax:
    s+i
s

CPU times: user 1.22 s, sys: 4.74 ms, total: 1.23 s
Wall time: 1.24 s


0

<h4>Average percent score on exam using numpy</h4>

In [3]:
import numpy as np
s = np.array([("Jack",34),("Jill",20),("Moshi",31),("Ravneet",27),("Myo",22)])
(s[:,1:].astype('float')*100.0/40.0).sum()/len(s)

67.0

<h3>Operations on more than one array</h3>
<li>Operations involving more than one array work elementwise by matching indexes</li>
<li>Must have arrays of compatible dimensions</li>

In [50]:
import numpy as np
ax = np.array([13,24,21.2,17.6,21.7],'float')
ay = np.array([1,3,4,7,2],'float')
print(ax - ay)
print(ax + ay)
print(ax * ay)
print(ax/ay)
ax + 1

[12.  21.  17.2 10.6 19.7]
[14.  27.  25.2 24.6 23.7]
[ 13.   72.   84.8 123.2  43.4]
[13.          8.          5.3         2.51428571 10.85      ]


array([14. , 25. , 22.2, 18.6, 22.7])

In [52]:
#This will throw an exception
ax = np.array([13,24,21.2,17.6,21.7],'float')
ay = np.array([1,3,4,7,2,6],'float')
print(ax - ay)

ValueError: operands could not be broadcast together with shapes (5,) (6,) 

<h3>Matrix multiplication</h3>
<li>The * operator does an element-wise multiplication of the elements of two arrays of compatible dimensions</li>
<li>The <span style="color:blue">dot</span> function computes the dot product (matrix multiplication)</li>
<li>provided, of course, the dimensions are correct</li>

In [53]:
import numpy as np
ax = np.arange(10)
ay = np.array([ax,ax*2]) #A 2-d array
ay

array([[ 0,  1,  2,  3,  4,  5,  6,  7,  8,  9],
       [ 0,  2,  4,  6,  8, 10, 12, 14, 16, 18]])

In [54]:
ax*ay

array([[  0,   1,   4,   9,  16,  25,  36,  49,  64,  81],
       [  0,   2,   8,  18,  32,  50,  72,  98, 128, 162]])

In [None]:
print(ax.shape)
print(ay.shape)

<h4>dot product</h4>
<li>ax is a vector (1x10 equivalent)</li>
<li>ay is a matrix (2x10)</li>
<li>Need to reshape ay so that the first dimension is 10</li>

In [55]:
np.dot(ay,ay.reshape(10,2)) #Dot product

array([[380, 460],
       [760, 920]])

<h1>Elementwise Boolean Operations</h1>
<li><b>Boolean masks</b></li>
<ul>
    <li> Boolean operations apply a condition to each element in turn
    <li> And return a <span style="color:red">boolean mask</span>  with
<ul>
<li> True for each value that satisfies the condition
<li> False for every other value
        </ul>
    <li>The mask can then be applied on any array with the same shape</li>
    <li>To return a new array containing only the elements that correspond to the True values in the mask</li>

In [4]:
import numpy as np
ax = np.array([1,4,7,9,2,3,10,11,34,2])
mask = ax < 7 #mask has the same length as ax but the values are boolean
mask

array([ True,  True, False, False,  True,  True, False, False, False,
        True])

In [7]:
x = [1]*len(mask)
x[mask]

TypeError: only integer scalar arrays can be converted to a scalar index

In [10]:
y = np.array(x)
y[mask]

array([1, 1, 1, 1, 1])

In [57]:
#ax[ax<7]
ax[mask] #Only the values less than 7 will be returned

array([1, 4, 2, 3, 2])

<li>Apply the mask to a different array</li>
<li>The second array must have the same dimension as the mask</li>
<li>Note that the mask no longer has the original condition, only True or False values</li>
<li>Consequently, ay below is not selecting values less than 7</li>

In [58]:
ay = np.array([8,8,3,2,38,64,7,8,9,2])
ay[mask]

array([ 8,  8, 38, 64,  2])

In [59]:
ay[ay<7]

array([3, 2, 2])

In [60]:
ay[ax<7]

array([ 8,  8, 38, 64,  2])

<h4>The mask doesn't have to be constructed on the same array</h4>
<li>But the mask and the array should have the same dimensions</li>
<li><i style="color:blue">Which employees receive a bonus of greater than $130,000?</i></li>
<ul><li>Construct a mask on the bonus array</li>
<li>Apply the mask on the names array</li>

In [61]:
names = np.array(['Bill','Ludovica','Qing','Savitri','Giovanni',"Birgit"])
bonus = np.array([232300.56,478123.45,3891.24,98012.36,52123.50,0])

mask = bonus>130000
names[mask]

array(['Bill', 'Ludovica'], dtype='<U8')

In [62]:
#or, for simplicity
names[bonus > 130000]

array(['Bill', 'Ludovica'], dtype='<U8')

<h4>Boolean operations</h4>
<li>numpy logical operators are <b>different</b> from python logical operators</li>
<li>the numpy equivalent of "and" is "&"</li>
<li>the numpy equivalent of "or" is "|"</li>
<li>the numpy equivalent of "not" </li>
    <ul>
        <li>is <span style="color:red">!</span> when used in a relation expression (!=)</li>
        <li> is <span style="color:red">~</span> when used on a True or False value ~(x>5)</li>


<li>bonus between 90000 and 400000</li>

In [63]:
names = np.array(['Bill','Ludovica','Qing','Savitri','Giovanni',"Birgit"])
bonus = np.array([232300.56,478123.45,3891.24,98012.36,52123.50,0])
(bonus>90000)&(bonus<400000) # be careful - not the same as bonus>90000 & bonus<400000

names[(bonus>90000)&(bonus<400000)]

array(['Bill', 'Savitri'], dtype='<U8')

In [None]:
#Remember to use parentheses!
(bonus>90000) & (bonus<400000)
#bonus>90000 & bonus<400000

<h4>Aggregate operations and boolean masks</h4>

In [64]:
names = np.array(['Bill','Ludovica','Qing','Savitri','Giovanni',"Birgit"])
bonus = np.array([232300.56,478123.45,3891.24,98012.36,52123.50,0])

print(np.sum(bonus[(bonus>50000) & (bonus < 200000)])) #Sum the bonuses between 50000 and 200000
print(np.sum(bonus[(names=="Bill") | (names == "Qing")])) #Sum up the bonuses paid to Bill and Qing
print(np.sum(bonus[(names!="Bill")])) # sum up the bonuses paid to everyone other than Bill
print(np.sum(bonus[~((names=="Bill") | (names == "Qing"))])) #Sum the bonuses paid to everyone other than Bill and Qing



150135.86
236191.8
632150.55
628259.31


In [65]:
np.count_nonzero(bonus)

5

In [71]:
print("The number of bonuses that are greater than 100000 are: ",np.sum(bonus>100000)) #Adds the True values in the mask
print("The sum of the bonuses that are greater than 100000 is: ",np.sum(bonus[bonus>100000]))
print("The number of non-zero bonuses is: ",np.count_nonzero(bonus))
print("There is at least one bonus greater than 300000: ",np.any(bonus>300000)) #any returns True or False
print("Everyone got a bonus (all are greater than 0): ",np.all(bonus>0)) #all returns True or False

The number of bonuses that are greater than 100000 are:  2
The sum of the bonuses that are greater than 100000 is:  710424.01
The number of non-zero bonuses is:  5
There is at least one bonus greater than 300000:  True
Everyone got a bonus (all are greater than 0):  False


In [68]:
bonus>300000

array([False,  True, False, False, False, False])

In [76]:
bonus>0

array([ True,  True,  True,  True,  True, False])

In [77]:
bonus

array([232300.56, 478123.45,   3891.24,  98012.36,  52123.5 ,      0.  ])

In [72]:
#List the bonus paid to everyone except Bill and Qing
mask = ~((names=="Bill") | (names == "Qing"))
bonus[mask]

array([478123.45,  98012.36,  52123.5 ,      0.  ])

In [74]:
names

array(['Bill', 'Ludovica', 'Qing', 'Savitri', 'Giovanni', 'Birgit'],
      dtype='<U8')

In [75]:
~((names=="Bill") | (names == "Qing"))

array([False,  True, False,  True,  True,  True])

<h2>In-class problem</h2>
<li>Given the data below, answer the following questions (write code that does it!)</li>
<ol>
    <li>Calculate the mean and standard deviation of bonus amounts for all female employees with bonus less than 100,000</li>
    <li>Return an nd array containing the names of all female employees with bonus less than 100,000</li>
    <li>Return the names of female employees who receive a bonus less than 100000 and who are from either New York or Sydney</li>
    <li>Return the names of the employees who make a bonus greater than 50% of their salary</li>

In [11]:
names = np.array(['Bill','Ludovica','Qing','Savitri','Giovanni',"Birgit"])
bonus = np.array([232300.56,478123.45,3891.24,98012.36,52123.50,0])
gender = np.array(['M','F','F','F','M','F'])
city = np.array(["New York","Catania","Paris","New York","Sydney","Sydney"])
salary = np.array([455000,722321,95223,135000,132033,700000])


In [19]:
bonus[(bonus < 100000) & (gender == 'F')].std()

45314.149974939275

In [18]:
np.sum(bonus[(bonus<100000) & (gender=="F")])/3
np.std(bonus[(bonus<100000) & (gender=="F")])

45314.149974939275

In [88]:
names[bonus<100000]


array(['Qing', 'Savitri', 'Giovanni', 'Birgit'], dtype='<U8')

In [91]:
names[(bonus<100000) & ((city == "New York") | (city=="Sydney"))]

array(['Savitri', 'Giovanni', 'Birgit'], dtype='<U8')

In [93]:
names[bonus > (0.5 * salary)]

array(['Bill', 'Ludovica', 'Savitri'], dtype='<U8')

<h3>batch: Selecting elements using where</h3>
<li><i>where</i> function creates a new array, of the same shape, using a conditional expression
<li>Somewhat like the if function in an excel spreadsheet

In [None]:
#=if(A1>95,"A+",if A1>90,"A","B")

In [None]:
x=[[0,1,2,3,4,5],[10,11,12,13,14,15],[20,21,22,23,24,25]]
ax=np.array(x,float)
print(ax)
np.where(ax%2==0,ax,0)

<li>Employees who receive more than $250,000 in compensation are managers while the others are serfs</li>
 <li>Find the names of the serfs</li>

In [94]:
salary = np.array([455000,722321,95223,135000,132033,700000])
types = np.where(salary>250000,"Manager","Serf")
types

array(['Manager', 'Manager', 'Serf', 'Serf', 'Serf', 'Manager'],
      dtype='<U7')

In [95]:
np.where(salary>350000,"rich",np.where(salary>100000,"middle","poor"))

array(['rich', 'rich', 'poor', 'middle', 'middle', 'rich'], dtype='<U6')