# 101 NumPy Exercises for Data Analysis

*by Selva Prabhakaran*

From the website: https://www.machinelearningplus.com/python/101-numpy-exercises-python/

*The goal of the numpy exercises is to serve as a reference as well as to get you to apply numpy beyond the basics.*
*The questions are of 4 levels of difficulties with L1 being the easiest to L4 being the hardest.*

**NOTE**: Run the next cell to load the data and the libraries needed for the exercises.

**NOTE**: There is actually 70 of them, lol. 🤷

In [1]:
pip install -r requirements.txt


[1m[[0m[34;49mnotice[0m[1;39;49m][0m[39;49m A new release of pip is available: [0m[31;49m23.2.1[0m[39;49m -> [0m[32;49m24.0[0m
[1m[[0m[34;49mnotice[0m[1;39;49m][0m[39;49m To update, run: [0m[32;49mpip install --upgrade pip[0m
Note: you may need to restart the kernel to use updated packages.


1. (L1) **Import numpy as `np` and print the version number.**

In [2]:
import numpy as np
print(np.version.version)

1.26.4


2. (L1) **Create a 1D array of numbers from 0 to 9.**

In [3]:
array = np.arange(0, 10)
array

array([0, 1, 2, 3, 4, 5, 6, 7, 8, 9])

3. (L1) **Create a 3×3 numpy array of all True’s.**

In [4]:
all_trues = np.full((3, 3), True, dtype=bool)
print(all_trues)

all_trues = np.ones((3, 3), dtype=bool)
print(all_trues)

[[ True  True  True]
 [ True  True  True]
 [ True  True  True]]
[[ True  True  True]
 [ True  True  True]
 [ True  True  True]]


4. (L1) **Extract all odd numbers from `arr`.**

In [5]:
# Input
arr = np.array([0, 1, 2, 3, 4, 5, 6, 7, 8, 9])

odd = arr[arr % 2 == 1]

print(arr, odd)

[0 1 2 3 4 5 6 7 8 9] [1 3 5 7 9]


5. (L1) **Replace all odd numbers in `arr` with -1.**

In [6]:
odds_replaced = arr.copy()
odds_replaced[odds_replaced % 2 == 1] = -1
print(odds_replaced)

[ 0 -1  2 -1  4 -1  6 -1  8 -1]


6. (L2) **Replace all odd numbers in arr with -1 without changing `arr`**

In [7]:
out_of_place_replace = np.where(arr % 2 == 1, -1, arr)
print(out_of_place_replace, arr)

[ 0 -1  2 -1  4 -1  6 -1  8 -1] [0 1 2 3 4 5 6 7 8 9]


7. (L1) **Convert a 1D array to a 2D array with 2 rows.**

In [8]:
reshaped_arr = arr.reshape((2, 5))
print(reshaped_arr)

# -1 sets the dimension automatically
reshaped_arr = arr.reshape((2, -1))
print(reshaped_arr)

[[0 1 2 3 4]
 [5 6 7 8 9]]
[[0 1 2 3 4]
 [5 6 7 8 9]]


8. (L2) **Stack arrays `a` and `b` vertically.**

In [9]:
# Input
a = np.arange(10).reshape(2,-1)
b = np.repeat(1, 10).reshape(2,-1)

print(a, b)

c = np.vstack([a, b])
print(c)

[[0 1 2 3 4]
 [5 6 7 8 9]] [[1 1 1 1 1]
 [1 1 1 1 1]]
[[0 1 2 3 4]
 [5 6 7 8 9]
 [1 1 1 1 1]
 [1 1 1 1 1]]


9. (L2) **Stack the arrays `a` and `b` horizontally.**

In [10]:
d = np.hstack([a, b])
print(d)

[[0 1 2 3 4 1 1 1 1 1]
 [5 6 7 8 9 1 1 1 1 1]]


10. (L2) **Create the following pattern without hardcoding. Use only numpy functions and the below input array `a`.**

In [11]:
# Input
a = np.array([1,2,3])

# This repeats EACH ELEMENT along given axis
b = a.repeat(3)

# This repeats WHOLE ARRAY, as tiles
triple_a = np.tile(a, 3)

c = np.hstack([b, triple_a])
print(c)

[1 1 1 2 2 2 3 3 3 1 2 3 1 2 3 1 2 3]


11. (L2) **Get the common items between a and b.**

In [12]:
# Input
a = np.array([1,2,3,2,3,4,3,4,5,6])
b = np.array([7,2,10,2,7,4,9,4,9,8])

# Indexing and unique
c = np.unique(a[a == b])
print(c)

# Via intersect method
d = np.intersect1d(ar1=a, ar2=b)
print(d)

[2 4]
[2 4]


12. (L2) **How to remove from one array those items that exist in another?**

In [13]:
# Input
a = np.array([1,2,3,4,5])
b = np.array([5,6,7,8,9])

# Returns difference between arrays
c = np.setdiff1d(a , b)
print(c)

[1 2 3 4]


13. (L2) **Get the positions where elements of a and b match.**

In [14]:
# Input
a = np.array([1,2,3,2,3,4,3,4,5,6])
b = np.array([7,2,10,2,7,4,9,4,9,8])

'''
From docs: 

nonzero: Returns a tuple of arrays, one for each dimension of a, containing the indices of the non-zero elements in that dimension. The values in a are always tested and returned in row-major, C-style order.
asarray: Convert the input to an array.
'''

c = np.asarray(a == b).nonzero()
print(c)

d = np.where(a == b)
print(d)

(array([1, 3, 5, 7]),)
(array([1, 3, 5, 7]),)


14. (L2) **Get all items between 5 and 10 from `a`.**

In [15]:
# Input
a = np.array([2, 6, 1, 9, 10, 3, 27])

# Middle ground
b = a[np.where((a <= 10) & (a >= 5))]
print(b)

# Shorthand
c = a[(a >= 5) & (a <= 10)]
print(c)

# Extended
d = a[np.where(np.logical_and(a >=5, a <= 10))]
print(d)

[ 6  9 10]
[ 6  9 10]
[ 6  9 10]


15. (L2) **Convert the function maxx that works on two scalars, to work on two arrays.**

In [16]:
# Base function
def maxx(x, y):
    """Get the maximum of two items"""
    if x >= y:
        return x
    else:
        return y

# Inputs
a = np.array([5, 7, 9, 8, 6, 4, 5])
b = np.array([6, 3, 4, 8, 9, 7, 1])

# Direct definition
def pair_max(input_a, input_b):
    return np.where(input_a > input_b, input_a, input_b)

result = pair_max(a, b)
print(result)

# Vectorize
'''
From docs:
Returns an object that acts like pyfunc, but takes arrays as input.
Define a vectorized function which takes a nested sequence of objects or numpy arrays as inputs and returns a single numpy array or a tuple of numpy arrays.
The vectorized function evaluates pyfunc over successive tuples of the input arrays like the python map function, except it uses the broadcasting rules of numpy

tl;dr - A map array method
'''
pair_max = np.vectorize(maxx, otypes=[float])

result = pair_max(a, b)
print(result)

[6 7 9 8 9 7 5]
[6. 7. 9. 8. 9. 7. 5.]


16. (L2) **Swap columns 1 and 2 in the array `arr`.**

In [17]:
# Input
arr = np.arange(9).reshape(3,3)
print(arr)

swapped_cols = arr[:, [1, 0, 2]]
print(swapped_cols)

[[0 1 2]
 [3 4 5]
 [6 7 8]]
[[1 0 2]
 [4 3 5]
 [7 6 8]]


17. (L2) **Swap rows 1 and 2 in the array `arr`.**

In [18]:
swapped_rows = arr[[1, 0, 2], :]
print(swapped_rows)

[[3 4 5]
 [0 1 2]
 [6 7 8]]


18. (L2) **Reverse the rows of a 2D array `arr`.**

In [19]:
# Via flip
reversed_rows = np.flip(arr, axis=0)
print(reversed_rows)

# w/ fancy indexing
print(arr[::-1])

[[6 7 8]
 [3 4 5]
 [0 1 2]]
[[6 7 8]
 [3 4 5]
 [0 1 2]]


19. (L2) **Reverse the columns of a 2D array `arr`.**

In [20]:
# Via flip
reversed_cols = np.flip(arr, axis=1)
print(reversed_cols)

[[2 1 0]
 [5 4 3]
 [8 7 6]]


20. (L2) **Create a 2D array of shape 5x3 to contain random decimal numbers between 5 and 10.**

In [21]:
MIN = 5
MAX = 10

# Manual sampling from uniform dist
random_arr = (MAX - MIN) * np.random.random_sample((5, 3)) + MIN
print(random_arr)

# Sampling from uniform distribution via uniform() method
random_arr = np.random.uniform(MIN, MAX, (5,3))
print(random_arr)

[[6.79128264 7.38806759 8.1757195 ]
 [5.71424504 8.61436668 9.07662141]
 [7.43911417 8.06328349 8.91768736]
 [6.89172514 5.80153485 7.28053673]
 [6.61805795 7.18816882 5.97671225]]
[[6.71173875 7.02527264 5.63805213]
 [8.71508405 5.83269018 6.03751013]
 [7.32754015 8.95440265 8.10042707]
 [6.29456717 9.28984949 5.80723521]
 [9.91640407 8.96313073 8.74890102]]


21. (L2) **Print or show only 3 decimal places of the numpy array `rand`.**

In [22]:
rand = np.random.random((5,3))

# Using printoptions as ctx manager makes the changes temporary!
with np.printoptions(precision=3):
    print(rand)

# Manual change and revert
np.set_printoptions(precision=3)
print(rand)

## A "Meyer's" reset for printoptions
np.set_printoptions(
    edgeitems=3,
    infstr='inf',
    linewidth=75,
    nanstr='nan',
    precision=8,
    suppress=False,
    threshold=1000,
    formatter=None
)


[[0.687 0.698 0.154]
 [0.591 0.663 0.441]
 [0.621 0.02  0.93 ]
 [0.085 0.928 0.196]
 [0.323 0.417 0.834]]
[[0.687 0.698 0.154]
 [0.591 0.663 0.441]
 [0.621 0.02  0.93 ]
 [0.085 0.928 0.196]
 [0.323 0.417 0.834]]


22. (L2) **Pretty print `rand` by suppressing the scientific notation (like 1e10).**

In [23]:
np.random.seed(100)
rand = np.random.random([3,3])/1e3
with np.printoptions(suppress=True):
    print(rand)

[[0.0005434  0.00027837 0.00042452]
 [0.00084478 0.00000472 0.00012157]
 [0.00067075 0.00082585 0.00013671]]


23. (L1) **Limit the number of items printed in python numpy array a to a maximum of 6 elements.**

In [24]:
a = np.arange(15)
with np.printoptions(threshold=6):
    print(a)

[ 0  1  2 ... 12 13 14]


24. (L1) **Print the full numpy array a without truncating.**

In [25]:
with np.printoptions(threshold=None):
    print(a)

[ 0  1  2  3  4  5  6  7  8  9 10 11 12 13 14]


25. (L2) **Import the iris dataset keeping the text intact.**

In [26]:
IRIS_DATASET_URL = "https://archive.ics.uci.edu/ml/machine-learning-databases/iris/iris.data"

# None or object returns an ARRAY of ROWS (as tuples)
# since dtypes are not uniform and defined across all columns
iris_dataset = np.genfromtxt(IRIS_DATASET_URL, delimiter=',', dtype=object)
print(iris_dataset)

[[b'5.1' b'3.5' b'1.4' b'0.2' b'Iris-setosa']
 [b'4.9' b'3.0' b'1.4' b'0.2' b'Iris-setosa']
 [b'4.7' b'3.2' b'1.3' b'0.2' b'Iris-setosa']
 [b'4.6' b'3.1' b'1.5' b'0.2' b'Iris-setosa']
 [b'5.0' b'3.6' b'1.4' b'0.2' b'Iris-setosa']
 [b'5.4' b'3.9' b'1.7' b'0.4' b'Iris-setosa']
 [b'4.6' b'3.4' b'1.4' b'0.3' b'Iris-setosa']
 [b'5.0' b'3.4' b'1.5' b'0.2' b'Iris-setosa']
 [b'4.4' b'2.9' b'1.4' b'0.2' b'Iris-setosa']
 [b'4.9' b'3.1' b'1.5' b'0.1' b'Iris-setosa']
 [b'5.4' b'3.7' b'1.5' b'0.2' b'Iris-setosa']
 [b'4.8' b'3.4' b'1.6' b'0.2' b'Iris-setosa']
 [b'4.8' b'3.0' b'1.4' b'0.1' b'Iris-setosa']
 [b'4.3' b'3.0' b'1.1' b'0.1' b'Iris-setosa']
 [b'5.8' b'4.0' b'1.2' b'0.2' b'Iris-setosa']
 [b'5.7' b'4.4' b'1.5' b'0.4' b'Iris-setosa']
 [b'5.4' b'3.9' b'1.3' b'0.4' b'Iris-setosa']
 [b'5.1' b'3.5' b'1.4' b'0.3' b'Iris-setosa']
 [b'5.7' b'3.8' b'1.7' b'0.3' b'Iris-setosa']
 [b'5.1' b'3.8' b'1.5' b'0.3' b'Iris-setosa']
 [b'5.4' b'3.4' b'1.7' b'0.2' b'Iris-setosa']
 [b'5.1' b'3.7' b'1.5' b'0.4' b'Ir

26. (L2) **Extract the text column species from the 1D iris imported in previous question.**

In [27]:
COLUMN_NAMES = ('sepallength', 'sepalwidth', 'petallength', 'petalwidth', 'species')

species = [
    row[4]
    for row in iris_dataset
]
print(species)

[b'Iris-setosa', b'Iris-setosa', b'Iris-setosa', b'Iris-setosa', b'Iris-setosa', b'Iris-setosa', b'Iris-setosa', b'Iris-setosa', b'Iris-setosa', b'Iris-setosa', b'Iris-setosa', b'Iris-setosa', b'Iris-setosa', b'Iris-setosa', b'Iris-setosa', b'Iris-setosa', b'Iris-setosa', b'Iris-setosa', b'Iris-setosa', b'Iris-setosa', b'Iris-setosa', b'Iris-setosa', b'Iris-setosa', b'Iris-setosa', b'Iris-setosa', b'Iris-setosa', b'Iris-setosa', b'Iris-setosa', b'Iris-setosa', b'Iris-setosa', b'Iris-setosa', b'Iris-setosa', b'Iris-setosa', b'Iris-setosa', b'Iris-setosa', b'Iris-setosa', b'Iris-setosa', b'Iris-setosa', b'Iris-setosa', b'Iris-setosa', b'Iris-setosa', b'Iris-setosa', b'Iris-setosa', b'Iris-setosa', b'Iris-setosa', b'Iris-setosa', b'Iris-setosa', b'Iris-setosa', b'Iris-setosa', b'Iris-setosa', b'Iris-versicolor', b'Iris-versicolor', b'Iris-versicolor', b'Iris-versicolor', b'Iris-versicolor', b'Iris-versicolor', b'Iris-versicolor', b'Iris-versicolor', b'Iris-versicolor', b'Iris-versicolor',

27. (L2) **Convert the 1D iris to 2D array iris_2d by omitting the species text field.**

In [28]:
# Input
url = 'https://archive.ics.uci.edu/ml/machine-learning-databases/iris/iris.data'

# Removing string column makes all columns of uniform type, so it can be inferred
# and NumPy sees that it must be a matrix of floats
iris_2d = np.genfromtxt(url, delimiter=',', dtype=None, usecols=[0,1,2,3])

print(iris_2d)

[[5.1 3.5 1.4 0.2]
 [4.9 3.  1.4 0.2]
 [4.7 3.2 1.3 0.2]
 [4.6 3.1 1.5 0.2]
 [5.  3.6 1.4 0.2]
 [5.4 3.9 1.7 0.4]
 [4.6 3.4 1.4 0.3]
 [5.  3.4 1.5 0.2]
 [4.4 2.9 1.4 0.2]
 [4.9 3.1 1.5 0.1]
 [5.4 3.7 1.5 0.2]
 [4.8 3.4 1.6 0.2]
 [4.8 3.  1.4 0.1]
 [4.3 3.  1.1 0.1]
 [5.8 4.  1.2 0.2]
 [5.7 4.4 1.5 0.4]
 [5.4 3.9 1.3 0.4]
 [5.1 3.5 1.4 0.3]
 [5.7 3.8 1.7 0.3]
 [5.1 3.8 1.5 0.3]
 [5.4 3.4 1.7 0.2]
 [5.1 3.7 1.5 0.4]
 [4.6 3.6 1.  0.2]
 [5.1 3.3 1.7 0.5]
 [4.8 3.4 1.9 0.2]
 [5.  3.  1.6 0.2]
 [5.  3.4 1.6 0.4]
 [5.2 3.5 1.5 0.2]
 [5.2 3.4 1.4 0.2]
 [4.7 3.2 1.6 0.2]
 [4.8 3.1 1.6 0.2]
 [5.4 3.4 1.5 0.4]
 [5.2 4.1 1.5 0.1]
 [5.5 4.2 1.4 0.2]
 [4.9 3.1 1.5 0.1]
 [5.  3.2 1.2 0.2]
 [5.5 3.5 1.3 0.2]
 [4.9 3.1 1.5 0.1]
 [4.4 3.  1.3 0.2]
 [5.1 3.4 1.5 0.2]
 [5.  3.5 1.3 0.3]
 [4.5 2.3 1.3 0.3]
 [4.4 3.2 1.3 0.2]
 [5.  3.5 1.6 0.6]
 [5.1 3.8 1.9 0.4]
 [4.8 3.  1.4 0.3]
 [5.1 3.8 1.6 0.2]
 [4.6 3.2 1.4 0.2]
 [5.3 3.7 1.5 0.2]
 [5.  3.3 1.4 0.2]
 [7.  3.2 4.7 1.4]
 [6.4 3.2 4.5 1.5]
 [6.9 3.1 4.

28. (L1) **Find the mean, median, standard deviation of iris's sepallength (1st column).**

In [29]:
for measure in (np.mean, np.median, np.std):
    print(
        measure(
            iris_2d[:,0]
        )
    )

5.843333333333334
5.8
0.8253012917851409


29. (L2) **Create a normalized form of iris's sepallength whose values range exactly between 0 and 1 so that the minimum has value 0 and maximum has value 1.**

In [30]:
sepallength = iris_2d[:,0]

maximum_sepallength = np.max(sepallength)
minimum_sepallength = np.min(sepallength)

normalized_sepallength = (sepallength.copy() - minimum_sepallength) / (maximum_sepallength - minimum_sepallength)
print(normalized_sepallength)

[0.22222222 0.16666667 0.11111111 0.08333333 0.19444444 0.30555556
 0.08333333 0.19444444 0.02777778 0.16666667 0.30555556 0.13888889
 0.13888889 0.         0.41666667 0.38888889 0.30555556 0.22222222
 0.38888889 0.22222222 0.30555556 0.22222222 0.08333333 0.22222222
 0.13888889 0.19444444 0.19444444 0.25       0.25       0.11111111
 0.13888889 0.30555556 0.25       0.33333333 0.16666667 0.19444444
 0.33333333 0.16666667 0.02777778 0.22222222 0.19444444 0.05555556
 0.02777778 0.19444444 0.22222222 0.13888889 0.22222222 0.08333333
 0.27777778 0.19444444 0.75       0.58333333 0.72222222 0.33333333
 0.61111111 0.38888889 0.55555556 0.16666667 0.63888889 0.25
 0.19444444 0.44444444 0.47222222 0.5        0.36111111 0.66666667
 0.36111111 0.41666667 0.52777778 0.36111111 0.44444444 0.5
 0.55555556 0.5        0.58333333 0.63888889 0.69444444 0.66666667
 0.47222222 0.38888889 0.33333333 0.33333333 0.41666667 0.47222222
 0.30555556 0.47222222 0.66666667 0.55555556 0.36111111 0.33333333
 0.33333

30. (L3) **Compute the softmax score of sepallength.**

In [31]:
# https://stackoverflow.com/questions/34968722/how-to-implement-the-softmax-function-in-python
softmax_exponentials = np.exp(sepallength - np.max(sepallength))
softmax = softmax_exponentials / softmax_exponentials.sum(axis=0)

with np.printoptions(precision=3):
    print(softmax)

[0.002 0.002 0.001 0.001 0.002 0.003 0.001 0.002 0.001 0.002 0.003 0.002
 0.002 0.001 0.004 0.004 0.003 0.002 0.004 0.002 0.003 0.002 0.001 0.002
 0.002 0.002 0.002 0.002 0.002 0.001 0.002 0.003 0.002 0.003 0.002 0.002
 0.003 0.002 0.001 0.002 0.002 0.001 0.001 0.002 0.002 0.002 0.002 0.001
 0.003 0.002 0.015 0.008 0.013 0.003 0.009 0.004 0.007 0.002 0.01  0.002
 0.002 0.005 0.005 0.006 0.004 0.011 0.004 0.004 0.007 0.004 0.005 0.006
 0.007 0.006 0.008 0.01  0.012 0.011 0.005 0.004 0.003 0.003 0.004 0.005
 0.003 0.005 0.011 0.007 0.004 0.003 0.003 0.006 0.004 0.002 0.004 0.004
 0.004 0.007 0.002 0.004 0.007 0.004 0.016 0.007 0.009 0.027 0.002 0.02
 0.011 0.018 0.009 0.008 0.012 0.004 0.004 0.008 0.009 0.03  0.03  0.005
 0.013 0.004 0.03  0.007 0.011 0.018 0.007 0.006 0.008 0.018 0.022 0.037
 0.008 0.007 0.006 0.03  0.007 0.008 0.005 0.013 0.011 0.013 0.004 0.012
 0.011 0.011 0.007 0.009 0.007 0.005]


31. (L1) **Find the 5th and 95th percentile of iris's sepallength.**

In [32]:
five_percentile, nine_five_percentile = np.percentile(sepallength, 5), np.percentile(sepallength, 95)
print(five_percentile, nine_five_percentile)

# Doing both at once
print(
    np.percentile(sepallength, [5, 95])
)

4.6 7.254999999999998
[4.6   7.255]


32. (L1) **Insert np.nan values at 20 random positions in iris_2d dataset.**

In [33]:
iris_with_random_nans = iris_2d.copy()
iris_with_random_nans[
    np.random.choice(iris_with_random_nans.shape[0] - 1, size=20),
    np.random.choice(iris_with_random_nans.shape[1] - 1, size=20)
] = np.nan
print(iris_with_random_nans)

[[5.1 3.5 1.4 0.2]
 [4.9 3.  1.4 0.2]
 [4.7 3.2 1.3 0.2]
 [4.6 3.1 1.5 0.2]
 [5.  nan 1.4 0.2]
 [5.4 3.9 1.7 0.4]
 [4.6 3.4 1.4 0.3]
 [5.  3.4 1.5 0.2]
 [4.4 2.9 1.4 0.2]
 [4.9 3.1 1.5 0.1]
 [5.4 3.7 1.5 0.2]
 [4.8 3.4 1.6 0.2]
 [4.8 3.  1.4 0.1]
 [4.3 3.  1.1 0.1]
 [5.8 4.  1.2 0.2]
 [5.7 4.4 1.5 0.4]
 [5.4 3.9 1.3 0.4]
 [5.1 3.5 1.4 0.3]
 [5.7 3.8 1.7 0.3]
 [5.1 3.8 1.5 0.3]
 [5.4 3.4 1.7 0.2]
 [5.1 3.7 1.5 0.4]
 [4.6 3.6 1.  0.2]
 [5.1 3.3 1.7 0.5]
 [4.8 nan 1.9 0.2]
 [5.  3.  1.6 0.2]
 [5.  3.4 1.6 0.4]
 [5.2 3.5 1.5 0.2]
 [5.2 3.4 1.4 0.2]
 [4.7 3.2 1.6 0.2]
 [4.8 3.1 1.6 0.2]
 [5.4 3.4 1.5 0.4]
 [5.2 4.1 1.5 0.1]
 [5.5 4.2 1.4 0.2]
 [4.9 3.1 1.5 0.1]
 [5.  3.2 1.2 0.2]
 [5.5 3.5 1.3 0.2]
 [4.9 3.1 1.5 0.1]
 [4.4 3.  1.3 0.2]
 [5.1 3.4 1.5 0.2]
 [5.  3.5 1.3 0.3]
 [4.5 2.3 1.3 0.3]
 [4.4 3.2 1.3 0.2]
 [5.  3.5 1.6 0.6]
 [5.1 3.8 1.9 0.4]
 [4.8 3.  1.4 0.3]
 [5.1 3.8 1.6 0.2]
 [4.6 3.2 1.4 0.2]
 [5.3 3.7 1.5 0.2]
 [nan 3.3 1.4 0.2]
 [7.  3.2 4.7 1.4]
 [6.4 3.2 4.5 1.5]
 [6.9 3.1 4.

33. (L2) **Find the number and position of missing values in iris_2d's sepallength (1st column).**

In [34]:
nan_indices = np.argwhere(
    np.isnan(iris_with_random_nans)
)
print(nan_indices, nan_indices.shape[0])

[[  4   1]
 [ 24   1]
 [ 49   0]
 [ 58   2]
 [ 60   0]
 [ 67   0]
 [ 86   2]
 [ 91   2]
 [ 93   2]
 [100   0]
 [107   2]
 [108   0]
 [129   1]
 [130   1]
 [132   2]
 [135   2]
 [137   1]
 [141   2]
 [143   1]
 [144   1]] 20


34. (L2) **Filter the rows of iris_2d that has petallength (3rd column) > 1.5 and sepallength (1st column) < 5.0.**

In [35]:
iris_2d[
    np.logical_and(
        iris_2d[:,2] > 1.5,
        iris_2d[:,0] < 5.0
    ),
]

array([[4.8, 3.4, 1.6, 0.2],
       [4.8, 3.4, 1.9, 0.2],
       [4.7, 3.2, 1.6, 0.2],
       [4.8, 3.1, 1.6, 0.2],
       [4.9, 2.4, 3.3, 1. ],
       [4.9, 2.5, 4.5, 1.7]])

35. (L2) **Select the rows of iris_2d that does not have any nan value.**

In [36]:
non_nan_entries = iris_2d[
    np.argwhere(
        iris_2d != np.nan
    )
]
print(np.nan in np.unique(non_nan_entries))

False


36. (L2) **Find the correlation between SepalLength(1st column) and PetalLength(3rd column) in `iris_2d`**

In [37]:
petallength = iris_2d[:,2]

correlation = np.corrcoef(x=sepallength, y=petallength)
print(correlation)

[[1.         0.87175416]
 [0.87175416 1.        ]]


37. (L2) **Find out if `iris_2d` has any missing values.**

In [38]:
nans = np.isnan(iris_2d)
nans.any()

False

38. (L2) **Replace all occurrences of nan with 0 in numpy array.**

In [39]:
# Input
nanified_iris_2d = iris_2d.copy()
nanified_iris_2d[np.random.randint(150, size=20), np.random.randint(4, size=20)] = np.nan

nanified_iris_2d[np.isnan(nanified_iris_2d)] = 0
np.isnan(nanified_iris_2d).any()

False

39. (L2) **Find the unique values and the count of unique values in iris's species.**

In [40]:
# Iterate manually
np.unique(species)
for value in np.unique(species):
    print(value, species.count(value))

# Using the unique.return_counts parameter
print(
    np.unique(species, return_counts=True)
)

b'Iris-setosa' 50
b'Iris-versicolor' 50
b'Iris-virginica' 50
(array([b'Iris-setosa', b'Iris-versicolor', b'Iris-virginica'],
      dtype='|S15'), array([50, 50, 50]))



40. (L2) **Bin the petal length (3rd) column of iris_2d to form a text array, such that if petal length is:**

* Less than 3 --> 'small'
* 3-5 --> 'medium'
* '>=5 --> 'large'

In [41]:
bins = [0.0, 3.0, 5.0]
bins_labels = ['small', 'medium', 'large']
petallength_put_in_bins = np.digitize(petallength, bins)
petallength_categories = [
    bins_labels[bin_number-1]
    for bin_number in petallength_put_in_bins
]
print(petallength_categories)

['small', 'small', 'small', 'small', 'small', 'small', 'small', 'small', 'small', 'small', 'small', 'small', 'small', 'small', 'small', 'small', 'small', 'small', 'small', 'small', 'small', 'small', 'small', 'small', 'small', 'small', 'small', 'small', 'small', 'small', 'small', 'small', 'small', 'small', 'small', 'small', 'small', 'small', 'small', 'small', 'small', 'small', 'small', 'small', 'small', 'small', 'small', 'small', 'small', 'small', 'medium', 'medium', 'medium', 'medium', 'medium', 'medium', 'medium', 'medium', 'medium', 'medium', 'medium', 'medium', 'medium', 'medium', 'medium', 'medium', 'medium', 'medium', 'medium', 'medium', 'medium', 'medium', 'medium', 'medium', 'medium', 'medium', 'medium', 'large', 'medium', 'medium', 'medium', 'medium', 'medium', 'large', 'medium', 'medium', 'medium', 'medium', 'medium', 'medium', 'medium', 'medium', 'medium', 'medium', 'medium', 'medium', 'medium', 'medium', 'medium', 'medium', 'large', 'large', 'large', 'large', 'large', 'large

41. (L2) **Create a new column for volume in iris_2d, where volume is `(pi x petallength x sepal_length^2)/3`**

In [42]:
np.column_stack(
    [
        iris_2d,  # Old array
        (np.pi * petallength * sepallength ** 2) / 2  # New array with volumes - calculated on the fly
    ])

array([[5.10000000e+00, 3.50000000e+00, 1.40000000e+00, 2.00000000e-01,
        5.71989774e+01],
       [4.90000000e+00, 3.00000000e+00, 1.40000000e+00, 2.00000000e-01,
        5.28007477e+01],
       [4.70000000e+00, 3.20000000e+00, 1.30000000e+00, 2.00000000e-01,
        4.51085581e+01],
       [4.60000000e+00, 3.10000000e+00, 1.50000000e+00, 2.00000000e-01,
        4.98570754e+01],
       [5.00000000e+00, 3.60000000e+00, 1.40000000e+00, 2.00000000e-01,
        5.49778714e+01],
       [5.40000000e+00, 3.90000000e+00, 1.70000000e+00, 4.00000000e-01,
        7.78675155e+01],
       [4.60000000e+00, 3.40000000e+00, 1.40000000e+00, 3.00000000e-01,
        4.65332704e+01],
       [5.00000000e+00, 3.40000000e+00, 1.50000000e+00, 2.00000000e-01,
        5.89048623e+01],
       [4.40000000e+00, 2.90000000e+00, 1.40000000e+00, 2.00000000e-01,
        4.25748636e+01],
       [4.90000000e+00, 3.10000000e+00, 1.50000000e+00, 1.00000000e-01,
        5.65722297e+01],
       [5.40000000e+00, 3.7000

42. (L3) **Randomly sample iris's species such that setose is twice the number of versicolor and virginica.**

In [43]:
probabilities_map = {
    b'Iris-setosa': 0.5,
    b'Iris-versicolor': 0.25,
    b'Iris-virginica': 0.25
}
sampling_probabilities = np.array([
    probabilities_map.get(name)
    for name in species
])
sampling_distribution = sampling_probabilities / sum(sampling_probabilities)
sample = np.random.choice(species, size=100, p=sampling_distribution)
np.unique(sample, return_counts=True)

(array([b'Iris-setosa', b'Iris-versicolor', b'Iris-virginica'],
       dtype='|S15'),
 array([47, 28, 25]))

43. (L3) **What is the value of second longest petallength of species setosa**

In [44]:
np.unique(  # Remove repetitions to find the value itself not what is in the second to last POSITION
    np.sort([  # Sort ascending
        iris_dataset_row[2]
        for iris_dataset_row in iris_dataset
        if iris_dataset_row[4]  == b'Iris-setosa'
    ]
    )
)[-2]

b'1.7'

44. (L2) **Sort the iris dataset based on sepallength column.**

In [45]:
sorted_sepallength_indices = iris_2d[:,0].argsort()
sorted_iris_array = iris_dataset[sorted_sepallength_indices]
print(sorted_iris_array)

[[b'4.3' b'3.0' b'1.1' b'0.1' b'Iris-setosa']
 [b'4.4' b'3.2' b'1.3' b'0.2' b'Iris-setosa']
 [b'4.4' b'3.0' b'1.3' b'0.2' b'Iris-setosa']
 [b'4.4' b'2.9' b'1.4' b'0.2' b'Iris-setosa']
 [b'4.5' b'2.3' b'1.3' b'0.3' b'Iris-setosa']
 [b'4.6' b'3.6' b'1.0' b'0.2' b'Iris-setosa']
 [b'4.6' b'3.1' b'1.5' b'0.2' b'Iris-setosa']
 [b'4.6' b'3.4' b'1.4' b'0.3' b'Iris-setosa']
 [b'4.6' b'3.2' b'1.4' b'0.2' b'Iris-setosa']
 [b'4.7' b'3.2' b'1.3' b'0.2' b'Iris-setosa']
 [b'4.7' b'3.2' b'1.6' b'0.2' b'Iris-setosa']
 [b'4.8' b'3.0' b'1.4' b'0.1' b'Iris-setosa']
 [b'4.8' b'3.0' b'1.4' b'0.3' b'Iris-setosa']
 [b'4.8' b'3.4' b'1.9' b'0.2' b'Iris-setosa']
 [b'4.8' b'3.4' b'1.6' b'0.2' b'Iris-setosa']
 [b'4.8' b'3.1' b'1.6' b'0.2' b'Iris-setosa']
 [b'4.9' b'2.4' b'3.3' b'1.0' b'Iris-versicolor']
 [b'4.9' b'2.5' b'4.5' b'1.7' b'Iris-virginica']
 [b'4.9' b'3.1' b'1.5' b'0.1' b'Iris-setosa']
 [b'4.9' b'3.1' b'1.5' b'0.1' b'Iris-setosa']
 [b'4.9' b'3.1' b'1.5' b'0.1' b'Iris-setosa']
 [b'4.9' b'3.0' b'1.4' b'0.

45. (L1) **Find the most frequent value of petal length (3rd column) in iris dataset.**

In [46]:
values, counts = np.unique(iris_dataset[:,2], return_counts=True)
maximum_unique_value_index = counts.argmax()
print(values[maximum_unique_value_index])

b'1.5'


46. (L2) **Find the position of the first occurrence of a value greater than 1.0 in petalwidth 4th column of iris dataset.**

In [47]:
petalwidth = iris_2d[:, 3]
np.argwhere(petalwidth > 1.0)[0]

array([50])

47. (L2) **From the array `a`, replace all values greater than 30 to 30 and less than 10 to 10.**

In [48]:
np.random.seed(100)
a = np.random.uniform(1,50, 20)
a[a > 30] = 30
a[a < 10] = 10
print(a)

[27.62684215 14.64009987 21.80136195 30.         10.         10.
 30.         30.         10.         29.17957314 30.         11.25090398
 10.08108276 10.         11.76517714 30.         30.         10.
 30.         14.42961361]


48. (L2) **Get the positions of top 5 maximum values in a given array `a`.**

In [49]:
# Inputs

np.random.seed(100)
a = np.random.uniform(1,50, 20)

np.argsort(a)[:5]

array([ 4, 13,  5,  8, 17])

49. (L4) **Compute the counts of unique values row-wise.**

**NOTE**: The solution provided on the site seems to be either wrong or made for a different version of the dataset. The counts presented here do not show how many time each value appears in each row. I will provide a solution that does that.

In [50]:
# Inputs
np.random.seed(100)
arr = np.random.randint(1,11,size=(6, 10))

counts_arr = arr.copy()
for row in range(0, arr.shape[0]):
    values, counts = np.unique(arr[row], return_counts=True)
    counts_map = {
        values[i]: counts[i]
        for i in range(0, len(values))
    }
    for column in range(arr.shape[1]):
        counts_arr[row, column] = counts_map[arr[row, column]]
print(arr)
print(counts_arr)

[[ 9  9  4  8  8  1  5  3  6  3]
 [ 3  3  2  1  9  5  1 10  7  3]
 [ 5  2  6  4  5  5  4  8  2  2]
 [ 8  8  1  3 10 10  4  3  6  9]
 [ 2  1  8  7  3  1  9  3  6  2]
 [ 9  2  6  5  3  9  4  6  1 10]]
[[2 2 1 2 2 1 1 2 1 2]
 [3 3 1 2 1 1 2 1 1 3]
 [3 3 1 2 3 3 2 1 3 3]
 [2 2 1 2 2 2 1 2 1 1]
 [2 2 1 1 2 2 1 2 1 2]
 [2 1 2 1 1 2 1 2 1 1]]


50. (L2) **Convert `array_of_arrays` into a flat linear 1d array.**

Desired output:

```python
array([0, 1, 2, 3, 4, 5, 6, 7, 8, 9])
```

In [51]:
# Inputs

arr1 = np.arange(3)
arr2 = np.arange(3,7)
arr3 = np.arange(7,10)

np.hstack([arr1, arr2, arr3])

array([0, 1, 2, 3, 4, 5, 6, 7, 8, 9])

51. (L2) **Compute the one-hot encodings (dummy binary variables for each unique value in the array)**

In [52]:
# Input

np.random.seed(101) 
arr = np.random.randint(1,4, size=6)
print(arr)

dummy = np.zeros((arr.shape[0], arr.max()))
for index in range(0, arr.shape[0]):
    dummy[index, arr[index] - 1] = 1
print(dummy)

[2 3 2 2 2 1]
[[0. 1. 0.]
 [0. 0. 1.]
 [0. 1. 0.]
 [0. 1. 0.]
 [0. 1. 0.]
 [1. 0. 0.]]


52. (L3) **Create row numbers grouped by a categorical variable. Use the following sample from iris `species` as input.**

Desired output:

```python
[0, 1, 2, 3, 4, 5, 0, 1, 2, 3, 4, 5, 0, 1, 2, 3, 4, 5, 6, 7]
```

In [53]:
# Input

species_small = np.sort(np.random.choice(species, size=20))
species_small

values, counts = np.unique(species_small, return_counts=True)
print(f'Unique values :{values}\n')
print(f'Counts: {counts}')

np.hstack([
    np.arange(index)
    for index in counts
])

Unique values :[b'Iris-setosa' b'Iris-versicolor' b'Iris-virginica']

Counts: [5 6 9]


array([0, 1, 2, 3, 4, 0, 1, 2, 3, 4, 5, 0, 1, 2, 3, 4, 5, 6, 7, 8])

53. (L4) **Create group ids based on a given categorical variable. Use the following sample from iris `species` as input.**

Desired output:

```python
[0, 0, 0, 0, 0, 0, 1, 1, 1, 1, 1, 1, 2, 2, 2, 2, 2, 2, 3, 3]
```

In [54]:
# Inputs

species_small = np.sort(np.random.choice(species, size=20))
unique_groups = np.unique(species_small)
print(species_small)

[
    np.argwhere(unique_groups == species_small[index]).flat[0]
    for index in range(0, species_small.shape[0])
]

[b'Iris-setosa' b'Iris-setosa' b'Iris-setosa' b'Iris-setosa'
 b'Iris-setosa' b'Iris-setosa' b'Iris-versicolor' b'Iris-versicolor'
 b'Iris-versicolor' b'Iris-versicolor' b'Iris-versicolor'
 b'Iris-versicolor' b'Iris-virginica' b'Iris-virginica' b'Iris-virginica'
 b'Iris-virginica' b'Iris-virginica' b'Iris-virginica' b'Iris-virginica'
 b'Iris-virginica']


[0, 0, 0, 0, 0, 0, 1, 1, 1, 1, 1, 1, 2, 2, 2, 2, 2, 2, 2, 2]

54. (L2) **Create the ranks for the given numeric array `a`.**

In [55]:
# Input

np.random.seed(10)
a = np.random.randint(20, size=10)
print(a)

# Manual
ranks = np.zeros(a.shape[0])
for element_index in range(0, a.shape[0]):
    for rank in np.argwhere(np.sort(a) == a[element_index]):
        if rank + 1 not in ranks:
            ranks[element_index] = rank[0] + 1
            break
print(ranks)

# Double argsort = ranking
a.argsort().argsort()


[ 9  4 15  0 17 16 17  8  9  0]
[ 5.  3.  7.  1.  9.  8. 10.  4.  6.  2.]


array([4, 2, 6, 0, 8, 7, 9, 3, 5, 1])

55. (L3) **Create a rank array of the same shape as a given numeric array `a`.**

In [56]:
# Input

np.random.seed(10)
a = np.random.randint(20, size=[2,5])

a.flatten().argsort().argsort().reshape(a.shape)

array([[4, 2, 6, 0, 8],
       [7, 9, 3, 5, 1]])

56. (L2) **Compute the maximum for each row in the given array.**

In [57]:
# Input

np.random.seed(100)
a = np.random.randint(1,10, [5,3])

print([
    np.max(row)
    for row in a
])

np.amax(a, axis=1)

[9, 8, 6, 3, 9]


array([9, 8, 6, 3, 9])

57. (L3) **Compute the min-by-max for each row for given 2d array.**

In [58]:
# Input

np.random.seed(100)
a = np.random.randint(1,10, [5,3])

np.apply_along_axis(
    lambda row: row.min() / row.max(),
    axis=1,
    arr=a
)

array([0.44444444, 0.125     , 0.5       , 1.        , 0.11111111])

58. (L3) **Find the duplicate entries (2nd occurrence onwards) in the given numpy array and mark them as True. First time occurrences should be False.**

In [59]:
# Input

np.random.seed(100)
a = np.random.randint(0, 5, 10)

_, indices = np.unique(a, return_index=True)
duplicates = np.full(a.shape[0], True)
duplicates[indices] = False
duplicates

array([False,  True, False,  True, False, False,  True,  True,  True,
        True])

59. (L3) **Find the mean of a numeric column grouped by a categorical column in a 2D numpy array.**

In [60]:
COLUMN = 1
groups = np.unique(iris_dataset[:, 4])
for group_name in groups:
    print(
        group_name,
        np.mean(
            iris_dataset[
                np.where(iris_dataset[:, 4] == group_name)
            ][:, COLUMN].astype(float)
        )
    )

b'Iris-setosa' 3.418
b'Iris-versicolor' 2.7700000000000005
b'Iris-virginica' 2.974


60. (L3) **Import the image from the following URL and convert it to a numpy array.**

In [61]:
from PIL import Image
import requests
import io

URL = 'https://upload.wikimedia.org/wikipedia/commons/8/8b/Denali_Mt_McKinley.jpg'

downloaded_image = Image.open(
    io.BytesIO(
        requests.get(URL).content
    )
)
np.array(
    list(
        downloaded_image.getdata()
    )
)

array([[  9,  72, 125],
       [  9,  72, 125],
       [  9,  72, 125],
       ...,
       [ 15,  51,  77],
       [ 12,  48,  74],
       [ 11,  45,  72]])

61. (L2) **Drop all nan values from a 1D numpy array.**

In [62]:
# Input

a = np.array([1,2,3,np.nan,5,6,7,np.nan])

a = a[
    np.where(
        np.isnan(a) == False
    )
]
print(a)

[1. 2. 3. 5. 6. 7.]


62. (L3) **Compute the euclidean distance between two arrays a and b.**

In [63]:
# Inputs

a = np.array([1,2,3,4,5])
b = np.array([4,5,6,7,8])

manual = np.sqrt(
    np.sum((b - a)**2)
)

linalg = np.linalg.norm(a-b)
print(manual == linalg)

True


63. (L3) **Find all the peaks in a 1D numpy array a. Peaks are points surrounded by smaller values on both sides.**

In [64]:
# Input

a = np.array([1, 3, 7, 1, 2, 6, 0, 1])

# Dunno about this one, seems to be tailored to specific data
dunno = np.where(
    np.diff(
        np.sign(  # Takes the sign
            np.diff(a)
        )
    ) == 0
)[0] + 2  # double differential reduces indices by 2 (due to the i and i+1 pairs)

# General approach

a_diffs = np.sign(np.diff(np.hstack([a[0], a])))
general = [
    diff_index
    for diff_index in range(len(a_diffs[1:]))
    if a_diffs[diff_index] == 1 and a_diffs[diff_index+1] == -1  # Signs of differences: (+1) / [MAX] \ (-1)
]

print(dunno)
print(general)

[2 5]
[2, 5]


64. (L2) **Subtract the 1d array `b_1d` from the 2d array `a_2d`, such that each item of `b_1d` subtracts from respective row of `a_2d`.**

In [65]:
# Inputs

a_2d = np.array([[3,3,3],[4,4,4],[5,5,5]])
b_1d = np.array([1,2,3])

a_2d - b_1d.reshape(b_1d.shape[0], 1)

array([[2, 2, 2],
       [2, 2, 2],
       [2, 2, 2]])

65. (L2) **Find the index of 5th repetition of number 1 in x.**

In [66]:
# Inputs

x = np.array([1, 2, 1, 1, 3, 4, 3, 1, 1, 2, 1, 1, 2])

target_number = 1
target_occurrence = 5

np.argwhere(
    x == target_number
)[target_occurrence-1]

array([8])

66. (L2) **Convert numpy's datetime64 object to datetime's datetime object.**

In [67]:
import datetime

dt64 = np.datetime64('2018-02-25 22:10:10')

manual = datetime.datetime.fromisoformat(dt64.astype(str))
casted = dt64.astype(datetime.datetime)

manual == casted

True

67. (L3) **Compute the moving average of window size 3, for the given 1D array.**

In [68]:
# Inputs

np.random.seed(100)
Z = np.random.randint(10, size=10)

print(Z)

window_size = 3

manual = [
    np.average(
        Z[index:index+window_size]
    )
    for index in range(0, Z.shape[0]-window_size)
]

convoluted = np.convolve(
    Z,
    np.ones(window_size)/window_size,
    mode='valid'
)
print(manual)
print(convoluted)

[8 8 3 7 7 0 4 2 5 2]
[6.333333333333333, 6.0, 5.666666666666667, 4.666666666666667, 3.6666666666666665, 2.0, 3.6666666666666665]
[6.33333333 6.         5.66666667 4.66666667 3.66666667 2.
 3.66666667 3.        ]


68. (L2) **Create a numpy array of length 10, starting from 5 and has a step of 3 between consecutive numbers.**

In [69]:
start = 5
step = 3
number_of_elements = 10

np.arange(start, start + number_of_elements * step, step)

array([ 5,  8, 11, 14, 17, 20, 23, 26, 29, 32])

69. (L3) **Given an array of a non-continuous sequence of dates. Make it a continuous sequence of dates, by filling in the missing dates.**

In [70]:
# Inputs

dates = np.arange(np.datetime64('2018-02-01'), np.datetime64('2018-02-25'), 2)

for index in range(0, dates.shape[0]-1):
    missing_dates = np.arange(dates[index], dates[index+1], 1)[1:]
    dates = np.insert(
        dates,
        index+1,
        missing_dates
    )
dates

array(['2018-02-01', '2018-02-02', '2018-02-03', '2018-02-04',
       '2018-02-05', '2018-02-06', '2018-02-07', '2018-02-08',
       '2018-02-09', '2018-02-10', '2018-02-11', '2018-02-12',
       '2018-02-13', '2018-02-15', '2018-02-17', '2018-02-19',
       '2018-02-21', '2018-02-23'], dtype='datetime64[D]')

70. (L4) **From the given 1d array `arr`, generate a 2d matrix using strides, with a window length of 4 and strides of 2, like `[[0,1,2,3], [2,3,4,5], [4,5,6,7]..]`**

In [71]:
# Input
arr = np.arange(15)

stride = 2
window_size = 4

manual = [
    arr[jump*stride:jump*stride+window_size]
    for jump in range(0, int(arr.shape[0] / stride + 1))
    if jump*stride+window_size < arr.shape[0]
]
print(manual)


[array([0, 1, 2, 3]), array([2, 3, 4, 5]), array([4, 5, 6, 7]), array([6, 7, 8, 9]), array([ 8,  9, 10, 11]), array([10, 11, 12, 13])]
