# NUMPY 

If you tried to add two lists directly it will just concatenate the two lists l1 + l2 = l1l2 
To actually add each element you will need to use a forloop which can be very slow if you have very large lists.

In [1]:
import numpy as np

### Adding up a list without numpy

In [3]:
l1 = list(range(1000000))
%timeit sum(l1)

9.47 ms ± 961 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)


### Adding up a list using numpy

In [5]:
l1_array = np.array(l1)
%timeit np.sum(l1_array)

854 µs ± 3.17 µs per loop (mean ± std. dev. of 7 runs, 1000 loops each)


## Why is numpy faster?

Lists is basically a container with addresses, whereas numpy array puts all the data together in one memory buffer. The numpy array has the same type so no type checking is required. When a list is added up the datatype of each element is checked whether it is integer or not and if not then error is returned whereas numpy stores all the numbers together and there is no typechecking as it already knows they are of the same type.

In [7]:
l = [1,2,3,'saurabh',5,6,7]
sum(l)

TypeError: unsupported operand type(s) for +: 'int' and 'str'

### Numpu basically does the for loop(zip for loop) for us for point to point add/sub/multiplication

In [9]:
a = np.array([1,2,3,4])
b = np.array([10,11,12,13])

In [10]:
a

array([1, 2, 3, 4])

In [11]:
b

array([10, 11, 12, 13])

In [12]:
c = a + b
c

array([11, 13, 15, 17])

In [13]:
a / b

array([0.1       , 0.18181818, 0.25      , 0.30769231])

In [14]:
a * b

array([10, 22, 36, 52])

In [15]:
a ** b

array([       1,     2048,   531441, 67108864])

### Numpy array is homogenous i.e. all the elements in a numpy array are of the same data type. Even if one element is int the whole array will be int, same goes for float but we can enforce a data type using dtype = , this will truncate the elements as required

In [16]:
a.dtype

dtype('int64')

#### The datatype of a numpy array is ndarray where nd stands for n dimensions, in our case n is one

In [18]:
type(a)

numpy.ndarray

#### The number of dimensions of a numpy array is given by

In [19]:
a.ndim

1

#### a transpose will be the same as a because a is one dimensional, numpy is row major the basic element of storage is a row, more precisely an array so a one dimensional array does not have columns and therefore a.T remains a

In [50]:
a.T

array([10,  2,  3,  4])

#### The shape of an array is given by .shape and it returns a tuple

In [48]:
a.shape #This basically denotes that there is only one dimension and there are 4 elements in it

(4,)

In [22]:
type(10)

int

In [24]:
type((10,))

tuple

In [25]:
type(a.shape)

tuple

In [41]:
d = np.array([1,2,3,4.0], dtype = 'int')
d

array([1, 2, 3, 4])

### Broadcasting in numpy 
#### This is another reason why numpy is faster, we do not need to use for loops for these operations

In [26]:
b / 10

array([1. , 1.1, 1.2, 1.3])

In [27]:
b * 10

array([100, 110, 120, 130])

In [28]:
np.log(a)

array([0.        , 0.69314718, 1.09861229, 1.38629436])

In [30]:
np.log


<ufunc 'log'>

The ufunc above denotes that we ar nolonger just using pure python functions

### Array indexing

In [31]:
a[0]

1

In [32]:
a[0] = 10

In [33]:
a[0]

10

#### Numpy is very strict about what data type you put in, if you defined it as an int then it will truncate the float values

In [34]:
a[0] = 10.6

In [35]:
a

array([10,  2,  3,  4])

Even fill has the same behaviour

In [36]:
b.fill(4.5)
b

array([4, 4, 4, 4])

## Higher Dimensions

Multidimensionsal arrays are basically created using list of lists

In [45]:
c = np.array([[10,11,12],[20,21,22]])
c

array([[10, 11, 12],
       [20, 21, 22]])

In [46]:
c.ndim

2

In [49]:
c.shape # this denotes that there are 2 dimensions and 2 rows and 3 columns

(2, 3)

In [51]:
c.size

6

In [52]:
c.nbytes

48

In [55]:
c[0, 1] #highly reccomended to use this instead of c[0][1]

11

In [56]:
c[0]

array([10, 11, 12])

### Some Practice

In [58]:
z = np.arange(25).reshape(5,5)
z

array([[ 0,  1,  2,  3,  4],
       [ 5,  6,  7,  8,  9],
       [10, 11, 12, 13, 14],
       [15, 16, 17, 18, 19],
       [20, 21, 22, 23, 24]])

In [63]:
z[:,1::2]

array([[ 1,  3],
       [ 6,  8],
       [11, 13],
       [16, 18],
       [21, 23]])

In [60]:
z[:,3]

array([ 3,  8, 13, 18, 23])

In [61]:
z[4,:]

array([20, 21, 22, 23, 24])

In [62]:
z[1::2,:4:2]

array([[ 5,  7],
       [15, 17]])

In [64]:
z[-1]

array([20, 21, 22, 23, 24])

In [65]:
z[4]

array([20, 21, 22, 23, 24])

In [66]:
new = z[1::2,:4:2]
new

array([[ 5,  7],
       [15, 17]])

In [67]:
new[1,1]

17

In [70]:
new[1,1] = 0
new

array([[ 5,  7],
       [15,  0]])

##### Now lets check if these changes are made to the original array z of which new is the slice 

In [71]:
z 

array([[ 0,  1,  2,  3,  4],
       [ 5,  6,  7,  8,  9],
       [10, 11, 12, 13, 14],
       [15, 16,  0, 18, 19],
       [20, 21, 22, 23, 24]])

### Slicing numpy arrays

##### Yes, the changes are reflected on the original array, s o basically it means that even after slicing, the data in the slice still points to the same address or informally speaking still belings to the original array.
##### This is basically done to improve speed as there is no cost of allocatig new memory for the slice.
##### So we can say that a slice is just a view on the original data, no new memory is allocated for the slice

#### We can check wether an array owns the data or is a slice by using .flags

In [72]:
z.flags

  C_CONTIGUOUS : True
  F_CONTIGUOUS : False
  OWNDATA : False
  WRITEABLE : True
  ALIGNED : True
  WRITEBACKIFCOPY : False
  UPDATEIFCOPY : False

In [73]:
new.flags

  C_CONTIGUOUS : False
  F_CONTIGUOUS : False
  OWNDATA : False
  WRITEABLE : True
  ALIGNED : True
  WRITEBACKIFCOPY : False
  UPDATEIFCOPY : False

### If you actually want to make a copy then use .copy

In [74]:
new1 = new.copy()

In [75]:
new1

array([[ 5,  7],
       [15,  0]])

In [76]:
new1.flags

  C_CONTIGUOUS : True
  F_CONTIGUOUS : False
  OWNDATA : True
  WRITEABLE : True
  ALIGNED : True
  WRITEBACKIFCOPY : False
  UPDATEIFCOPY : False

### Boolean Masks

In [77]:
x = np.array([3, -1, -2, 4, -6, 8])

In [78]:
x < 0

array([False,  True,  True, False,  True, False])

In [79]:
mask = x < 0

In [80]:
x[mask]

array([-1, -2, -6])

In [81]:
x[x < 0]

array([-1, -2, -6])

In [83]:
x[x < 0] = 0
x

array([3, 0, 0, 4, 0, 8])

In [84]:
x > 0

array([ True, False, False,  True, False,  True])

In [98]:
m = np.array([10,1,4])
n = np.array([1,3,9])

m > n

array([ True, False, False])

##### Directly using and will give you the following error as 

In [85]:
x > 3 and x < 8

ValueError: The truth value of an array with more than one element is ambiguous. Use a.any() or a.all()

In [86]:
x > 3

array([False, False, False,  True, False,  True])

##### Use of .any( ) and .all()

In [88]:
(x > 3).any()

True

In [89]:
(x > 3).all()

False

In [91]:
(x > 3).any() and (x < 8).all()

False

#### This still wont help our original case as we want an array of True False  so we use BITWISE operators
#### & (and), | (or), ~ (not), ^ (xor)

In [93]:
(x > 3) & (x < 8)

array([False, False, False,  True, False, False])

##### Whatever mask you make, you can use it to create subarrays of ana rray by putting them into square brackets.

In [94]:
x[(x > 3) & (x < 8)]

array([4])

#####  And this same logic can be used for replacing elements.

In [95]:
x[(x > 3) & (x < 8)] = 5
x

array([3, 0, 0, 5, 0, 8])

In [96]:
np.nonzero(x>3)

(array([3, 5]),)

### Some weird indexings

In [108]:
q = np.arange(36).reshape(6,6)
q

array([[ 0,  1,  2,  3,  4,  5],
       [ 6,  7,  8,  9, 10, 11],
       [12, 13, 14, 15, 16, 17],
       [18, 19, 20, 21, 22, 23],
       [24, 25, 26, 27, 28, 29],
       [30, 31, 32, 33, 34, 35]])

getting diagonal elements

In [109]:
q[[0,1,2,3,4,5],
  [0,1,2,3,4,5]]

array([ 0,  7, 14, 21, 28, 35])

### Creating masks

In [113]:
mask = np.array([1, 0, 0, 1, 0, 1], dtype = bool)
mask

array([ True, False, False,  True, False,  True])

In [114]:
q[mask, 2]

array([ 2, 20, 32])

### In contrast to slicing 

In [116]:
a = np.array([10,11,12,23,14])
a

array([10, 11, 12, 23, 14])

In [118]:
a.flags

  C_CONTIGUOUS : True
  F_CONTIGUOUS : True
  OWNDATA : True
  WRITEABLE : True
  ALIGNED : True
  WRITEBACKIFCOPY : False
  UPDATEIFCOPY : False

In [117]:
subset = a[[2,4,0]]
subset

array([12, 14, 10])

##### In this case the unlike slicing the subset array has it's own data and a has its own data. So if i make changes to subset, it would not reflect on a

In [119]:
subset.flags

  C_CONTIGUOUS : True
  F_CONTIGUOUS : True
  OWNDATA : True
  WRITEABLE : True
  ALIGNED : True
  WRITEBACKIFCOPY : False
  UPDATEIFCOPY : False

# Modifying a numpy array
## array[mask/list of positions/slice] = value






#### Sum: find all the elements divisible by 3

In [121]:
na = np.arange(25).reshape(5,5)
na

array([[ 0,  1,  2,  3,  4],
       [ 5,  6,  7,  8,  9],
       [10, 11, 12, 13, 14],
       [15, 16, 17, 18, 19],
       [20, 21, 22, 23, 24]])

In [130]:
na % 3 == 0 

array([[ True, False, False,  True, False],
       [False,  True, False, False,  True],
       [False, False,  True, False, False],
       [ True, False, False,  True, False],
       [False,  True, False, False,  True]])

In [132]:
na[na % 3 == 0]

array([ 0,  3,  6,  9, 12, 15, 18, 21, 24])

##### np.where( ) so that it returns 2 dimensional array

In [133]:
np.where(na % 3 == 0, na, np.nan)

array([[ 0., nan, nan,  3., nan],
       [nan,  6., nan, nan,  9.],
       [nan, nan, 12., nan, nan],
       [15., nan, nan, 18., nan],
       [nan, 21., nan, nan, 24.]])

##### np.empty_like()

first we will create a skeleton array of the same shape as na using np.empty_like(). Data type is taken as float because we want to create an array filled with nans and nan is float and a numpy arrat is homogenous(has the same datatype)

In [137]:
na_nan = np.empty_like(na, dtype=float)
na_nan

array([[6.90479605e-310, 6.90476680e-310, 6.90479754e-310,
        6.90479755e-310, 6.90479759e-310],
       [6.90479759e-310, 6.90476680e-310, 6.90476680e-310,
        6.90476680e-310, 6.90476680e-310],
       [6.90476680e-310, 4.64419406e-310, 6.90479757e-310,
        6.90479759e-310, 6.90479758e-310],
       [6.90479753e-310, 6.90479758e-310, 6.90479759e-310,
        6.90479593e-310, 6.90479758e-310],
       [6.90479758e-310, 6.90479759e-310, 6.90479321e-310,
        6.90479322e-310, 6.90479753e-310]])

In [141]:
na_nan.fill(np.nan)
na_nan

array([[nan, nan, nan, nan, nan],
       [nan, nan, nan, nan, nan],
       [nan, nan, nan, nan, nan],
       [nan, nan, nan, nan, nan],
       [nan, nan, nan, nan, nan]])

In [144]:
na_nan[na % 3 == 0] = na[na % 3 == 0]
na_nan

array([[ 0., nan, nan,  3., nan],
       [nan,  6., nan, nan,  9.],
       [nan, nan, 12., nan, nan],
       [15., nan, nan, 18., nan],
       [nan, 21., nan, nan, 24.]])

#### np.nansum() for finding sum of arrays containing nans as nan is poison :p

### Difference between a function and a method
##### A function performs operations based on what is passed to it eg. np.sum(a) whereas a method performs functions on the object that is attached to it based on the characteristics of the object eg. a.sum( )

Sum according to axes

In [146]:
m = np.arange(24).reshape(6,4)
m.shape

(6, 4)

##### takig mean or sum along an axes collapses that axes and the dimension becomes the remaining axes

In [147]:
m.mean(axis=0).shape

(4,)

In [148]:
m.mean(axis=1).shape

(6,)

In [5]:
import pandas as pd
a = {'A':23 ,'A':35 ,'A':45 ,'A':12 ,'B':13 ,'B':17 ,'B':12 ,'B':21 ,'B':23}

In [6]:
df = pd.DataFrame(a)

ValueError: If using all scalar values, you must pass an index

In [None]:
df

In [62]:
raw_data = {'c1': ['A', 'A', 'A', 'A', 'B', 'B', 'B', 'B', 'B'], 
        'c2': [23, 35, 45, 12, 13, 17, 12, 21, 23]
           }

df = pd.DataFrame(raw_data)
df


Unnamed: 0,c1,c2
0,A,23
1,A,35
2,A,45
3,A,12
4,B,13
5,B,17
6,B,12
7,B,21
8,B,23


In [1]:
df1 = df.copy()
df1 = df1.groupby('c1').median()
df1.transpose()

NameError: name 'df' is not defined

In [83]:
df2 = df.pivot('c1').
df2

<bound method GroupBy.sum of <pandas.core.groupby.groupby.SeriesGroupBy object at 0x7f53eff9c358>>

In [60]:
df = df.sort_values('c2', ascending=True)
dfA = df[df['c1'] == 'A']['c2']
dfB = df[df['c1'] == 'B']['c2']
dfA
dfB

6    12
4    13
5    17
7    21
8    23
Name: c2, dtype: int64

In [34]:
l1 = list(df['c1'].unique())

In [29]:
# df.set_index('c1', inplace = True)
# df

In [21]:
list(df)

['c2']

In [87]:
import numpy as np
ar = np.arange(10,91,10)
ar[ar >= 50]

array([50, 60, 70, 80, 90])