# NumPy arrays, arrays manipulation, random numbers 

**NumPy**, short for Numerical Python is a cornerstone of numerical computing in Python. 
It provides the data structures, algorithms, and library glue needed for most scientific applications involving numerical data in Python.

NumPy has the capabilities including the following:

a) A fast and efficient multi-dimensional array object, *ndarray*

b) Functions for performing element-wise computations with arrays or mathematical computations between arrays.

c) Tools for reading and writing array-based datasets to disk

d) Linear algebra operations, random number generation

NumPy is a large topic.  Let us focus on the following:

* 1) Fast vectorized array operations for data munging and cleaning, subsetting and filtering, transformation, and other computations
* 2) Common array algorithms like sorting, unique and set operations
* 3) Efficient descriptive statistics and aggregating / summarizing data
* 4) Expressing conditional logic as array operations instead of iterative statements( loops) with id-elif-else branches.
* 5) Group-wise data manipulation (aggregation, transformation, function application).

**NumPy is designed for efficiency on large arrays of data. **


*To check if numpy arrays are faster than other pure python counterparts.
Consider a NumPy array of ten million integers and the equivalent Python list.*

In [1]:
import numpy as np
my_array    = np.arange(10000000)
my_list     = list(range(10000000))

Now, let us multiply each sequence by 2.45555 hundred times in a loop.
Using %time measure the Wall times.

In [2]:
print('\nNumpy Arrays speed')
%time for _ in range(100): my_array2 = my_array * 2.45555


Numpy Arrays speed
Wall time: 3.95 s


In [3]:
print('\nOther Pure Python list speed')
%time for _ in range(100): my_list2 = [ x *  2.45555 for x in my_list]


Other Pure Python list speed
Wall time: 1min 25s


In [4]:
85/3.98 # Numpy arrays are about 21 times faster 

21.35678391959799

We observe that calculation using numpy arrays are around 21 times faster than performing calculations using pure python list methods.

numpy.random.randn return a sample or samples from standard normal distribution with mean 1 and standard deviation 0.

Let us generate a small 2 by 4 array of random data.

In [5]:
import numpy as np
data = np.random.randn(2,4)
print(data) # print the array, data


[[-0.90664013 -0.11620656 -1.12047217  0.45962719]
 [ 0.0669131  -0.51101505  1.08006919  1.94691573]]


An ndarray is a generic multidimensional container for homogeneous data (all data elements are of the same type).
Every array has 
* a shape, a tuple indicating the size of each dimension.
* a dtype, an object indicating the data type of the array

The data type or dtype is a special object containing information the ndarray needs to interpret a chunck of memory as a particular type of data.

In [6]:
print(data.shape) # size of each dimension

(2, 4)


In [7]:
print(data.dtype) # data type of the array

float64


Generally *array, ndarray, numpy array* all refer to the same **ndarray** object.

**How do you create ndarray?**

Function, array creates a new NumPy array containing the passed data (sequence like object, including other arrays).

In [8]:
age           =  [29.5, 30.0, 30.5, 31.5, 22.0, 34.5, 33.5, 35.0]
age_arr1      =  np.array(age)
age_arr1

array([29.5, 30. , 30.5, 31.5, 22. , 34.5, 33.5, 35. ])

Nested sequences, like a list of equal-length lists will be converted into a multi-dimensional array:

In [9]:
data3 = [[1,2,3,4,5], [3,4,5,6,7]]
arr3    = np.array(data3)
arr3

array([[1, 2, 3, 4, 5],
       [3, 4, 5, 6, 7]])

In [10]:
print('\nShape of arr3')
print(arr3.shape)
print('\nDimension of arr3')
print(arr3.ndim)


Shape of arr3
(2, 5)

Dimension of arr3
2


There are a number of functions for creating new arrays such as zeros, ones to hold os and 1s respectively with a given length of shape. 

You can create an empty array without initializing its values to any particular value.

In [11]:
print('creates an array of 10 zeros')
print(np.zeros(10)) # creates an array of 10 zeros
print('\ncreates a 2 X 5 array of zeros')
print(np.zeros((2,5))) # creates a 2 X 5 array of zeros
print('\ncreates a two 2 X 3 array of zeros')
print(np.empty((2,3,2))) # Creates a two 3 X 2 array

creates an array of 10 zeros
[0. 0. 0. 0. 0. 0. 0. 0. 0. 0.]

creates a 2 X 5 array of zeros
[[0. 0. 0. 0. 0.]
 [0. 0. 0. 0. 0.]]

creates a two 2 X 3 array of zeros
[[[1.04723487e-311 2.81617418e-322]
  [0.00000000e+000 0.00000000e+000]
  [0.00000000e+000 8.60952352e-072]]

 [[5.33243585e-091 7.62484380e+169]
  [8.27014352e-072 1.80069055e+185]
  [6.48224659e+170 4.93432906e+257]]]


The function, arange is an array-valued version of the built-in Python range function.
np.arange(10)

page 103 

In [12]:
np.arange(10)

array([0, 1, 2, 3, 4, 5, 6, 7, 8, 9])

###  Array creation functions

| Sl No | Functions | Description |
| --- | -------- | ----------------------------------- |
| 1 | array | Convert input data containing any one of list, tuple, array, or other sequence type to an ndarray either by explicitly specifying a dtype or inferring a dtype; by default copies the input data |
| 2 | asarray | Convert input to ndarray, but do not copy if the input is already an ndarray |
| 3 | arange | Same as built-in range, but returns an ndarray instead |
| 4 | ones, ones_like | Produce an array of all 1s with the given shape and dtype; ones_like takes another array and produces a ones array of the same shape and dtype |
| 5 | zeros, zeros_like |  Produce an array of all 0s with the given shape and dtype; zeros_like takes another array and produces a zeros array of the same shape and dtype |
| 6 | empty, empty_like | Create new arrays by allocating new memory, but do not populate with any values like ones or zeros |
| 7 | full, full_like | Produce an array of  given shape and dtype; with all values set to the indicated 'fill value' full_like takes another array and produces a filled array of the same shape and dtype |
| 8 |  eye, identity | Create a square n X n Identity matrix, 1s on digonal and 0 elsewhere |
| 9 | matrix | This returns a matrix from an array-like object, or from a string of data. |

**How to create a 2d Identity matrix with 1s on the diagonal and 0s elsewhere?**

In [13]:
import          numpy as   np

identity_matrix_2d     =   np.identity(2)
print('\nIdentity Matrix of 2 dimension')
print(identity_matrix_2d )


Identity Matrix of 2 dimension
[[1. 0.]
 [0. 1.]]


In [14]:
### Specify data types for the array while creating

array1 = np.array([1, 3, 4], dtype = np.float64)
array2 = np.array([1, 3, 4], dtype = np.int32)

print('\nData type of array1 %s' % array1.dtype)
print('\nData type of array2 %s' % array2.dtype)


Data type of array1 float64

Data type of array2 int32


You can explicitly convert an array from one data type to another using astype method

In [15]:
print('\narray2')
print(array2)
print('\nData type of array 2 ', end = ' =  ')
print(array2.dtype)
array2_float = array2.astype(np.float64)
print('\nData type of array2_float ', end = ' =  ')
print(array2_float.dtype)
print('\narray2_float')
print(array2_float)


array2
[1 3 4]

Data type of array 2  =  int32

Data type of array2_float  =  float64

array2_float
[1. 3. 4.]


**Vectorization property of NumPy**

Arrays are important enables you to do any arithmetic operations between equal-size arrays without writing any for loops.  It applies the operation element-wise.

Arithmetic operations include, multiplication (*), subtraction (-), addition (+) and division (/) and power (${**}$).
Arithmetic operations with scalars propogate the scalar argument to each element in the array.



In [16]:
print('data * 100 ',data * 100) # perform mathematical operation, multiply by 100
print('data + data ',data + data) # perform mathematical operation, addition
print('data  - 10 ',data  - 10) # perform mathematical operation, subtract 10
print('data / 10 ', data / 10) # perform division by 10
print('data * 1.4 ', data * 1.4) # perform multiplication with scalar 1.4 for each element
print('data ** 0.5 ', np.abs(data) ** 0.5) # perform square root for each element after taking absolute value


data * 100  [[ -90.66401272  -11.62065558 -112.04721661   45.96271914]
 [   6.69130973  -51.10150468  108.00691873  194.6915728 ]]
data + data  [[-1.81328025 -0.23241311 -2.24094433  0.91925438]
 [ 0.13382619 -1.02203009  2.16013837  3.89383146]]
data  - 10  [[-10.90664013 -10.11620656 -11.12047217  -9.54037281]
 [ -9.9330869  -10.51101505  -8.91993081  -8.05308427]]
data / 10  [[-0.09066401 -0.01162066 -0.11204722  0.04596272]
 [ 0.00669131 -0.0511015   0.10800692  0.19469157]]
data * 1.4  [[-1.26929618 -0.16268918 -1.56866103  0.64347807]
 [ 0.09367834 -0.71542107  1.51209686  2.72568202]]
data ** 0.5  [[0.95217652 0.34089083 1.05852358 0.6779581 ]
 [0.25867566 0.71485316 1.03926377 1.39531922]]


### Basic indexing and slicing

One dimensional arrays are simple and they act similarly to Python lists.

In [17]:
array = np.arange(20)
print(array)
print(array[5]) # Sixth element
print(array[5:8]) # Sixth element to 8th element
array[5:8] = 100
print(array) # Value given to a slice is propogated to the entire selection

[ 0  1  2  3  4  5  6  7  8  9 10 11 12 13 14 15 16 17 18 19]
5
[5 6 7]
[  0   1   2   3   4 100 100 100   8   9  10  11  12  13  14  15  16  17
  18  19]


### Types of Random Sampling

### a) Simple Random Sampling

This is the simplest type of sampling. 

Consider a population of size, n and you want to sample k units from this population. Then a **simple random sample** of size, k has the property that every possible sample of size k being chosen.

### Q1 )    From the list of 60 employees, select 10 of them randomly for performing night duties next month.

In [18]:
import   numpy as np

empids = ['E001','E002','E003','E004','E010','E011','E012','E013','E020','E021','E022',\
                  'E023','E024','E025','E026','E027','E028','E029','E030','E031','E032','E035',\
                  'E040','E041','E042','E044','E045','E046','E050','E051','E061','E070','E081',\
                  'E083','E084','E085','E086','E087','E088','E089','E090','E091','E092','E093',\
                  'E094','E095','E096','E097','E098','E100','E101','E121','E131','E135','E701',\
                  'E794','E795','E866','E897','E998' ]
print(len(empids))

60


In [19]:
np.random.seed(1234) # to ensure repeatability
selected_10_empids = np.random.choice(empids, size = 10, replace = False)
print('\nSelected Employee IDs who are selected for night shift\n')
print(selected_10_empids )


Selected Employee IDs who are selected for night shift

['E051' 'E095' 'E020' 'E040' 'E012' 'E701' 'E089' 'E025' 'E030' 'E070']


This is a simple random sample of size 10 where each possible sample of size 10 has the same chance of being chosen.

## b) Stratified random sampling

Stratified random sampling is a method of sampling that involves the division of a population into relatively homogeneous subsets called strata and then random samples are taken from each stratum.
Here, we get estimates within each stratum unlike simple random sampling.

Ref: Book Business Analytics - Data Analysis and Decision Making - by Albright and Winston.

Scikit-learn is an open source Python library that implements a range of machine learning, preprocessing, cross-validation and visualization algorithms using a unified interface.

In [20]:
import sklearn as sci
print('My version of sklearn is %s' %sci.__version__)
print('The latest version of sklean is 0.20.3 Refer: %s' %'https://scikit-learn.org/stable/tutorial/index.html')

My version of sklearn is 0.19.1
The latest version of sklean is 0.20.3 Refer: https://scikit-learn.org/stable/tutorial/index.html


### Q2) Use the iris data set preloaded in sklearn datasets. Split the data into 67%:33% so that we have a proper stratefied random sampling and show the value counts of y and y_test

In [21]:
import pandas                  as       pd
from   sklearn.model_selection import   train_test_split
from   sklearn                 import   datasets

iris              =  datasets.load_iris()

iris_df           =  pd.DataFrame(data= np.c_[iris['data'], iris['target']],
                     columns= iris['feature_names'] + ['target'])

### Note  np.c_ is the numpy concatenate function

y                 =  iris_df['target']
X                 =  iris_df.drop(['target'], axis= 1) 

print('X.shape', X.shape)
print('y.shape', y.shape)
print('\nValue counts of y containing %d observations' %y.shape[0])
print(y.value_counts())

x_train, x_test, y_train, y_test = train_test_split(X, y,
                                                    test_size   = 0.33,
                                                    random_state= 12345,
                                                    stratify    = y)

print('\nValue counts of y_test containg %d observations' %y_test.shape[0])
print(y_test.value_counts())

X.shape (150, 4)
y.shape (150,)

Value counts of y containing 150 observations
2.0    50
1.0    50
0.0    50
Name: target, dtype: int64

Value counts of y_test containg 50 observations
2.0    17
1.0    17
0.0    16
Name: target, dtype: int64


## c) Systematic random sampling

Systematic sampling is a probability sampling method where the elements are chosen from  a target population by selecting a random starting point, k  and selecting other members after k, a fixed sampling interval. Here k = N/n where N is the population size and n is the desired sample size.

### Q3) Assume there are 55000 customers and their list is  arranged in the decreasing order of order volumes. You are asked to select 250 customers. So you divide the population size by sample size = 55000 / 250 = 220. 

We have got the population of customers subdivided into 250 customer samples of size 220.
So, the sample size, n= 220.
So we need to first choose a random number between 1 and 220 (both inclusive) and add the sample size to get other numbers.

In [22]:
import numpy as np
n = 220
np.random.seed(1234) # to ensure repeatability
k = np.random.randint(1, n, size = 1) 
print('First random number %d'%k)

First random number 48


In [23]:
sample_sizes     = map(lambda x: x * 220 + k, range(0,250))
random_nos       = list(map(int,sample_sizes))
print(random_nos)

[48, 268, 488, 708, 928, 1148, 1368, 1588, 1808, 2028, 2248, 2468, 2688, 2908, 3128, 3348, 3568, 3788, 4008, 4228, 4448, 4668, 4888, 5108, 5328, 5548, 5768, 5988, 6208, 6428, 6648, 6868, 7088, 7308, 7528, 7748, 7968, 8188, 8408, 8628, 8848, 9068, 9288, 9508, 9728, 9948, 10168, 10388, 10608, 10828, 11048, 11268, 11488, 11708, 11928, 12148, 12368, 12588, 12808, 13028, 13248, 13468, 13688, 13908, 14128, 14348, 14568, 14788, 15008, 15228, 15448, 15668, 15888, 16108, 16328, 16548, 16768, 16988, 17208, 17428, 17648, 17868, 18088, 18308, 18528, 18748, 18968, 19188, 19408, 19628, 19848, 20068, 20288, 20508, 20728, 20948, 21168, 21388, 21608, 21828, 22048, 22268, 22488, 22708, 22928, 23148, 23368, 23588, 23808, 24028, 24248, 24468, 24688, 24908, 25128, 25348, 25568, 25788, 26008, 26228, 26448, 26668, 26888, 27108, 27328, 27548, 27768, 27988, 28208, 28428, 28648, 28868, 29088, 29308, 29528, 29748, 29968, 30188, 30408, 30628, 30848, 31068, 31288, 31508, 31728, 31948, 32168, 32388, 32608, 32828, 3

Now you choose a number randomly between 1 and 220. Assume that number is 48.
So you would choose 48, 268, 488, and so on.

**Practice Exercise 1: **

Consider a school with 10000 students, and assume a researcher wants to select 100 of them for further study. All their student ids are numbered from 0 to 9999. 
Choose 100 students randomly without replacement. Use simple random sample method.

In [24]:
import   numpy  as  np
studentIds = range(10000)

In [25]:
selected_10_studentIds = np.random.choice(studentIds, size = 100, replace = False)
print('\nSelected Student IDs who are selected for further research\n')
print(selected_10_studentIds )


Selected Student IDs who are selected for further research

[2374 1784 6301 1600 7920 6868 9082 9227 6816 1492 4900 6502  786 4147
 7824 5680 8443 3196  954 4906 1552 3764 1763 3637 8866 8524 6011 2165
 9663 3864 5951  104 5513 6858 2295 9806 3765 3981 1816 5141 7316 4441
 7667 1757 9268 2694 7722 7455 6739 9751 3377 3484 2202 3512 7392 4882
 2272 1353 7871 5044 3999 3562 1440 4114 2514 3748 1916 2987 9188  212
 5590 1293 4915 1435 9034  964 9891 8758 2710 1577 6290  818 1954 1111
 3199 8535  315 7654 6564 7356 5068 9421 5997 7868 1914 9493 4885 8686
 3829 3066]


**Practice Exercise 2: **

Load Breast cancer dataset which is available in sklearn.datasets
The target variable, class is a binary class having values Malignant and Benign
There are 569 observations and 30 attributes.

Class distribution is 212 Malignant and 357 benign (37% and 63%). 
Split this data set into training and test data set and having the same proportion of classes.

In [26]:
import pandas                    as       pd
from   sklearn                   import   datasets
from   sklearn.model_selection   import   train_test_split

In [27]:
cancer   = datasets.load_breast_cancer() # Get the diabetes dataset from sklearn
print(cancer.data.shape)

(569, 30)


In [28]:
print(cancer.DESCR)

Breast Cancer Wisconsin (Diagnostic) Database

Notes
-----
Data Set Characteristics:
    :Number of Instances: 569

    :Number of Attributes: 30 numeric, predictive attributes and the class

    :Attribute Information:
        - radius (mean of distances from center to points on the perimeter)
        - texture (standard deviation of gray-scale values)
        - perimeter
        - area
        - smoothness (local variation in radius lengths)
        - compactness (perimeter^2 / area - 1.0)
        - concavity (severity of concave portions of the contour)
        - concave points (number of concave portions of the contour)
        - symmetry 
        - fractal dimension ("coastline approximation" - 1)

        The mean, standard error, and "worst" or largest (mean of the three
        largest values) of these features were computed for each image,
        resulting in 30 features.  For instance, field 3 is Mean Radius, field
        13 is Radius SE, field 23 is Worst Radius.

        

In [29]:
# Load the Diabetes dataset
print('\nLoad the dataset as a pandas data frame')
cancer_df     = pd.DataFrame(data = cancer['data'] , columns = cancer['feature_names']) 


Load the dataset as a pandas data frame


In [30]:
print('\nDefine the independent variable as X')
X = cancer_df 
print(X.head(4).T)


Define the independent variable as X
                                   0            1            2           3
mean radius                17.990000    20.570000    19.690000   11.420000
mean texture               10.380000    17.770000    21.250000   20.380000
mean perimeter            122.800000   132.900000   130.000000   77.580000
mean area                1001.000000  1326.000000  1203.000000  386.100000
mean smoothness             0.118400     0.084740     0.109600    0.142500
mean compactness            0.277600     0.078640     0.159900    0.283900
mean concavity              0.300100     0.086900     0.197400    0.241400
mean concave points         0.147100     0.070170     0.127900    0.105200
mean symmetry               0.241900     0.181200     0.206900    0.259700
mean fractal dimension      0.078710     0.056670     0.059990    0.097440
radius error                1.095000     0.543500     0.745600    0.495600
texture error               0.905300     0.733900     0.786900

In [31]:
print('\nDefine the dependent variable as y')
y        =  pd.Series(cancer.target)
print(y.value_counts())


Define the dependent variable as y
1    357
0    212
dtype: int64


### Split X, y in the ratio 80:20 (Taining data set : Test data set) and try to retain the same target class distribution.

### Practice Exercise 3: 

ABC company wants to study into the impact of leadership style on employee motivation level and reduction in attrition. 

ABC has 240 employees in Operations department who could be potentially interviewed. You identified the sample size as 24 employees. 

** Select 10 employees randomly  for interviewing.**

In [2]:
import pandas                  as       pd
from   sklearn.model_selection import   train_test_split
from   sklearn                 import   datasets
import numpy                   as       np

iris              =  datasets.load_iris()

iris_df           =  pd.DataFrame(data= np.c_[iris['data'], iris['target']],
                     columns= iris['feature_names'] + ['target'])

### Note  np.c_ is the numpy concatenate function

y                 =  iris_df['target']
X                 =  iris_df.drop(['target'], axis= 1) 

print('X.shape', X.shape)
print('y.shape', y.shape)
print('\nValue counts of y containing %d observations' %y.shape[0])
print(y.value_counts())

x_train, x_test, y_train, y_test = train_test_split(X, y,
                                                    test_size   = 0.33,
                                                    random_state= 12345,
                                                    stratify    = iris_df['target'])

print('\nValue counts of y_test containg %d observations' %y_test.shape[0])
print(y_test.value_counts())

X.shape (150, 4)
y.shape (150,)

Value counts of y containing 150 observations
2.0    50
1.0    50
0.0    50
Name: target, dtype: int64

Value counts of y_test containg 50 observations
2.0    17
1.0    17
0.0    16
Name: target, dtype: int64
