<a href="https://colab.research.google.com/github/kunalburgul/Data_Analytics/blob/master/Data_Analysis.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# **Data Analytics**

## **Numpy Arrays**

**Python Objects:**

1. high-level number objects: integers, floating point
2. containers: lists (costless insertion and append), dictionaries (fast lookup)

**Numpy provides:**

1. extension package to Python for multi-dimensional arrays
2. closer to hardware (efficiency)
3. designed for scientific computation (convenience)
4. Also known as array oriented computing


In [1]:
import numpy as np
a = np.array([0, 1, 2, 3])
print(a)
print(np.arange(10)) # list or arrays from 0 to 9 

[0 1 2 3]
[0 1 2 3 4 5 6 7 8 9]


**Why it is useful:** Memory-efficient container that provides fast numerical operations.

In [2]:
# Python Lists
L = range(1000)
%timeit [i**2 for i in L]

1000 loops, best of 3: 257 µs per loop


In [3]:
 # Numpy array
a = np.arange(1000)
%timeit a**2

The slowest run took 42.72 times longer than the fastest. This could mean that an intermediate result is being cached.
1000000 loops, best of 3: 1.34 µs per loop


### **Creating arrays**

#### **Manual Construction of arrays**

In [4]:
# Creating an 1_D array
a = np.array([0, 1, 2, 3])
a
%time

CPU times: user 3 µs, sys: 0 ns, total: 3 µs
Wall time: 7.15 µs


In [5]:
# print dimensions
a.ndim

1

In [6]:
# shape
a.shape

(4,)

In [7]:
len(a)

4

In [8]:
# 2-D, 3-D Numpy array
b = np.array([[0, 1, 2], [3, 4, 5]])
b

array([[0, 1, 2],
       [3, 4, 5]])

In [9]:
b.ndim

2

In [10]:
b.shape

(2, 3)

In [11]:
len(b) #returns the size of the first dimention

2

In [12]:
c = np.array([[[0, 1], [2, 3]], [[4, 5], [6, 7]]])
c

array([[[0, 1],
        [2, 3]],

       [[4, 5],
        [6, 7]]])

In [13]:
c.ndim

3

In [14]:
c.shape

(2, 2, 2)

Basically an,
- 1D array --> Vector
- 2D array --> Matrix
- **ND array --> Tensor**

#### **Functions for creating arrays** 

In [15]:
# Using arrange function
# arange is an array-valued version of the built-in Python range function
a = np.arange(10) # 0.... n-1
a

array([0, 1, 2, 3, 4, 5, 6, 7, 8, 9])

In [16]:
b = np.arange(1, 10, 2) #start, end (exclusive), step
b

array([1, 3, 5, 7, 9])

In [17]:
#using linspace that is n equal points
a = np.linspace(0, 1, 6) #start, end, n-number of points
a

array([0. , 0.2, 0.4, 0.6, 0.8, 1. ])

In [18]:
# common arrays for all ones in array
a = np.ones((3, 3))
a

array([[1., 1., 1.],
       [1., 1., 1.],
       [1., 1., 1.]])

In [19]:
# common arrays for all zeros in array
b = np.zeros((3, 3))
b

array([[0., 0., 0.],
       [0., 0., 0.],
       [0., 0., 0.]])

In [20]:
c = np.eye(3) # Return a 2-D array with ones on diagonal and zeros elsewhere
c

array([[1., 0., 0.],
       [0., 1., 0.],
       [0., 0., 1.]])

In [21]:
i = np.eye(3, 2)
i # 3 is number of rows, 2 is number of columns , index of diagonal start with 0


array([[1., 0.],
       [0., 1.],
       [0., 0.]])

In [22]:
# create array using diag function
a = np.diag([1, 2, 3, 4]) # construc a diagonal array
a

array([[1, 0, 0, 0],
       [0, 2, 0, 0],
       [0, 0, 3, 0],
       [0, 0, 0, 4]])

In [23]:
 np.diag(a)

array([1, 2, 3, 4])

In [24]:
# create a array using random and randn - rand normal
# Create an array of the given shape and populate it with random samples from a uniform distribution over [0, 1).
a = np.random.rand(4)
a

array([0.67147102, 0.72343096, 0.87431745, 0.53143742])

In [25]:
a = np.random.randn(4)#Return a sample (or samples) from the “standard normal” distribution.  ***Gausian***
a

array([-1.15152816,  0.10208972, -1.28705733, -1.3311965 ])

**Note**
<br>
For random samples from N(\mu, \sigma^2), use:
sigma * np.random.randn(...) + mu

### **Basic DataTypes**

You may have noticed that, in some instances, array elements are displayed with a trailing dot (e.g. 2. vs 2). This is due to a difference in the data-type used:

In [26]:
a = np.arange(10)
a.dtype

dtype('int64')

In [27]:
# You can explicitly specify which data-type you want:
a = np.arange(10, dtype='float64')
a

array([0., 1., 2., 3., 4., 5., 6., 7., 8., 9.])

In [28]:
# The default data type is float for zeros and ones function
a = np.zeros((3, 3))
print(a)
a.dtype

[[0. 0. 0.]
 [0. 0. 0.]
 [0. 0. 0.]]


dtype('float64')

There are also other datatypes

In [29]:
d = np.array([1+2j, 2+4j])   #Complex datatype
print(d.dtype)

complex128


In [30]:
b = np.array([True, False, True, False])  #Boolean datatype
print(b.dtype)

bool


In [31]:
s = np.array(['Ram', 'Robert', 'Rahim'])
s.dtype

dtype('<U6')

**Each built-in data type has a character code that uniquely identifies it.**

- 'b' − boolean
- 'i' − (signed) integer
- 'u' − unsigned integer
- 'f' − floating-point
- 'c' − complex-floating point
- 'm' − timedelta
- 'M' − datetime
- 'O' − (Python) objects
- 'S', 'a' − (byte-)string
- 'U' − Unicode
- 'V' − raw data (void)


**For more details:**
https://docs.scipy.org/doc/numpy-1.10.1/user/basics.types.html**

### **Indexing and Slicing**

#### **Indexing**

The items of an array can be accessed and assigned to the same way as other Python sequences (e.g. lists):

In [32]:
a = np.arange(10)
print(a[5])  #indices begin at 0, like other Python sequences (and C/C++)

5


In [33]:
# For multidimensional arrays, indexes are tuples of integers:
a = np.diag([1, 2, 3])
print(a[2, 2])

3


In [34]:
a[2, 1] = 5 #assigning value
a

array([[1, 0, 0],
       [0, 2, 0],
       [0, 5, 3]])

#### **Slicing**

In [35]:
a = np.arange(10)
a

array([0, 1, 2, 3, 4, 5, 6, 7, 8, 9])

In [36]:
a[1:8:2] # [startindex: endindex(exclusive) : step]

array([1, 3, 5, 7])

In [37]:
#we can also combine assignment and slicing:
a = np.arange(10)
a[5:] = 10
a

array([ 0,  1,  2,  3,  4, 10, 10, 10, 10, 10])

In [38]:
b = np.arange(5)
a[5:] = b[::-1]  #assigning
a

array([0, 1, 2, 3, 4, 4, 3, 2, 1, 0])

### **Copies and Views**

A slicing operation creates a view on the original array, which is just a way of accessing array data. Thus the original array is not copied in memory. You can use np.may_share_memory() to check if two arrays share the same memory block.

When modifying the view, the original array is modified as well:

In [39]:
a = np.arange(10)
a

array([0, 1, 2, 3, 4, 5, 6, 7, 8, 9])

In [40]:
b = a[::2]
b

array([0, 2, 4, 6, 8])

In [41]:
np.shares_memory(a, b)

True

In [42]:
b[0] = 10
b

array([10,  2,  4,  6,  8])

In [43]:
a  #eventhough we modified b,  it updated 'a' because both shares same memory

array([10,  1,  2,  3,  4,  5,  6,  7,  8,  9])

In [44]:
a = np.arange(10)
c = a[::2].copy()     #force a copy
c

array([0, 2, 4, 6, 8])

In [45]:
np.shares_memory(a, c)

False

In [46]:
c[0] = 10
a

array([0, 1, 2, 3, 4, 5, 6, 7, 8, 9])

### **Fancy Indexing**

NumPy arrays can be indexed with slices, but also with boolean or integer arrays (masks). This method is called fancy indexing. It creates copies not views.

**Using Boolean Mask** 

In [47]:
a = np.random.randint(0, 20, 15)
a

array([ 6,  1, 15,  1, 18,  4,  2, 18, 12, 19, 18,  3, 19,  8,  1])

In [48]:
mask = (a % 2 == 0)

In [49]:
extract_from_a = a[mask]
extract_from_a

array([ 6, 18,  4,  2, 18, 12, 18,  8])

**Indexing with a mask can be very useful to assign a new value to a sub-array:**

In [50]:
a[mask] = -1
a

array([-1,  1, 15,  1, -1, -1, -1, -1, -1, 19, -1,  3, 19, -1,  1])

**Indexing with an array of integers**

In [51]:
a = np.arange(0, 100, 10)
a

array([ 0, 10, 20, 30, 40, 50, 60, 70, 80, 90])

In [52]:
#Indexing can be done with an array of integers, where the same index is repeated several time:
a[[2, 3, 2, 4, 2]]

array([20, 30, 20, 40, 20])

In [53]:
# New values can be assigned 
a[[9, 7]] = -200
a

array([   0,   10,   20,   30,   40,   50,   60, -200,   80, -200])

### **Elementwise Operations**

#### **Baisc Operaitons**

**with scalars**

In [54]:
a = np.array([1, 2, 3, 4]) # create an array
a + 1

array([2, 3, 4, 5])

In [55]:
a ** 2 # squares elementwise

array([ 1,  4,  9, 16])

**All arithmetic operates elementwise**

In [56]:
b = np.ones(4) + 1
a - b

array([-1.,  0.,  1.,  2.])

In [57]:
a * b

array([2., 4., 6., 8.])

In [58]:
# Matrix multiplication
c = np.diag([1, 2, 3, 4])
print(c * c)
print("+++++++++++++++++++++++")
print(c.dot(c))

[[ 1  0  0  0]
 [ 0  4  0  0]
 [ 0  0  9  0]
 [ 0  0  0 16]]
+++++++++++++++++++++++
[[ 1  0  0  0]
 [ 0  4  0  0]
 [ 0  0  9  0]
 [ 0  0  0 16]]


**Comparisions**

In [59]:
a = np.array([1, 2, 3, 4])
b = np.array([5, 2, 2, 4])
a == b

array([False,  True, False,  True])

In [60]:
a > b

array([False, False,  True, False])

In [61]:
# Array-wise comparisions
a = np.array([1, 2, 3, 4])
b = np.array([5, 2, 2, 4])
c = np.array([1, 2, 3, 4])
np.array_equal(a, b)

False

In [62]:
 np.array_equal(a, c)

True

**Logical Operations**

In [63]:
a = np.array([1, 1, 0, 0], dtype=bool)
b = np.array([1, 0, 1, 0], dtype=bool)
np.logical_or(a, b)

array([ True,  True,  True, False])

In [64]:
np.logical_and(a, b)

array([ True, False, False, False])

**Transcendental functions:**

In [65]:
a = np.arange(5)
np.sin(a)

array([ 0.        ,  0.84147098,  0.90929743,  0.14112001, -0.7568025 ])

In [66]:
np.log(a)

  """Entry point for launching an IPython kernel.


array([      -inf, 0.        , 0.69314718, 1.09861229, 1.38629436])

In [67]:
np.exp(a) #evaluates e^x for each element in a given input

array([ 1.        ,  2.71828183,  7.3890561 , 20.08553692, 54.59815003])

**Shape Mismatch**

In [68]:
a = np.arange(4)
a + np.array([1, 2])

ValueError: ignored

#### **Basic Reductions**

**Computing Sums**

In [69]:
x = np.array([1, 2, 3, 4])
np.sum(x)

10

In [70]:
# Sum by rows and by columns
x = np.array([[1, 1], [2, 2]])
x

array([[1, 1],
       [2, 2]])

In [71]:
x.sum(axis=0) # column-wise sum

array([3, 3])

In [72]:
x.sum(axis=1) # row-wise sum

array([2, 4])

**Other Reductions**

In [73]:
x = np.array([1, 3, 2])
x.min()

1

In [74]:
x.max()

3

In [75]:
x.argmin() # index of the minimum element

0

In [76]:
x.argmax() # index of the maximum element

1

**Logical Operations**

In [77]:
np.all([True, True, False])

False

In [78]:
np.any([True, False, False])

True

In [79]:
#Note: can be used for array comparisions
a = np.zeros((50, 50))
np.any(a != 0)

False

In [80]:
np.all(a == a)

True

In [81]:
a = np.array([1, 2, 3, 2])
b = np.array([2, 2, 3, 2])
c = np.array([6, 4, 4, 5])
((a <= b) & (b <= c)).all()

True

**Statistics**

In [82]:
x = np.array([1, 2, 3, 1])
y = np.array([[1, 2, 3], [5, 6, 1]])
x.mean()

1.75

In [83]:
np.median(x)

1.5

In [84]:
np.median(x)

1.5

In [85]:
np.median(y, axis=-1) # last axis

array([2., 5.])

In [86]:
x.std()          # full population standard dev.

0.82915619758885

Example:
Data in populations.txt describes the populations of hares and lynxes (and carrots) in northern Canada during 20 years.



In [87]:
#load data into numpy array object
data = np.loadtxt('populations.txt')

In [88]:
data

array([[ 1900., 30000.,  4000., 48300.],
       [ 1901., 47200.,  6100., 48200.],
       [ 1902., 70200.,  9800., 41500.],
       [ 1903., 77400., 35200., 38200.],
       [ 1904., 36300., 59400., 40600.],
       [ 1905., 20600., 41700., 39800.],
       [ 1906., 18100., 19000., 38600.],
       [ 1907., 21400., 13000., 42300.],
       [ 1908., 22000.,  8300., 44500.],
       [ 1909., 25400.,  9100., 42100.],
       [ 1910., 27100.,  7400., 46000.],
       [ 1911., 40300.,  8000., 46800.],
       [ 1912., 57000., 12300., 43800.],
       [ 1913., 76600., 19500., 40900.],
       [ 1914., 52300., 45700., 39400.],
       [ 1915., 19500., 51100., 39000.],
       [ 1916., 11200., 29700., 36700.],
       [ 1917.,  7600., 15800., 41800.],
       [ 1918., 14600.,  9700., 43300.],
       [ 1919., 16200., 10100., 41300.],
       [ 1920., 24700.,  8600., 47300.]])

In [89]:
year, hares, lynxes, carrots = data.T # columns to variables
print(year)

[1900. 1901. 1902. 1903. 1904. 1905. 1906. 1907. 1908. 1909. 1910. 1911.
 1912. 1913. 1914. 1915. 1916. 1917. 1918. 1919. 1920.]


In [90]:
#The mean population over time
populations = data[:, 1:]
populations

array([[30000.,  4000., 48300.],
       [47200.,  6100., 48200.],
       [70200.,  9800., 41500.],
       [77400., 35200., 38200.],
       [36300., 59400., 40600.],
       [20600., 41700., 39800.],
       [18100., 19000., 38600.],
       [21400., 13000., 42300.],
       [22000.,  8300., 44500.],
       [25400.,  9100., 42100.],
       [27100.,  7400., 46000.],
       [40300.,  8000., 46800.],
       [57000., 12300., 43800.],
       [76600., 19500., 40900.],
       [52300., 45700., 39400.],
       [19500., 51100., 39000.],
       [11200., 29700., 36700.],
       [ 7600., 15800., 41800.],
       [14600.,  9700., 43300.],
       [16200., 10100., 41300.],
       [24700.,  8600., 47300.]])

In [91]:
# sample standard deviations
populations.std(axis=0)

array([20897.90645809, 16254.59153691,  3322.50622558])

In [92]:
# which species has the highest population each year?
np.argmax(populations, axis=1)

array([2, 2, 0, 0, 1, 1, 2, 2, 2, 2, 2, 2, 0, 0, 0, 1, 2, 2, 2, 2, 2])

#### **Broadcasting**

Basic operations on numpy arrays (addition, etc.) are elementwise

This works on arrays of the same size. Nevertheless, It’s also possible to do operations on arrays of different sizes if NumPy can transform these arrays so that they all have the same size: this conversion is called broadcasting.

The image below gives an example of broadcasting:

![broadcasting](https://i.stack.imgur.com/JcKv1.png)

In [93]:
a = np.tile(np.arange(0, 40, 10), (3,1))
print(a)

print("*************")
a = a.T
print(a)

[[ 0 10 20 30]
 [ 0 10 20 30]
 [ 0 10 20 30]]
*************
[[ 0  0  0]
 [10 10 10]
 [20 20 20]
 [30 30 30]]


In [94]:
b = np.array([0, 1, 2])
b

array([0, 1, 2])

In [95]:
a + b

array([[ 0,  1,  2],
       [10, 11, 12],
       [20, 21, 22],
       [30, 31, 32]])

In [96]:
a = np.arange(0, 40, 10)
a.shape

(4,)

In [97]:
a = a[:, np.newaxis]  # adds a new axis -> 2D array
a.shape

(4, 1)

In [98]:
a

array([[ 0],
       [10],
       [20],
       [30]])

In [99]:
a + b

array([[ 0,  1,  2],
       [10, 11, 12],
       [20, 21, 22],
       [30, 31, 32]])

#### **Array Shape Manipulation**

**Flattening**

In [100]:
a = np.array([[1, 2, 3], [4, 5, 6]])
a.ravel() #Return a contiguous flattened array. A 1-D array, containing the elements of the input, is returned. A copy is made only if needed.

array([1, 2, 3, 4, 5, 6])

In [101]:
a.T #Transpose

array([[1, 4],
       [2, 5],
       [3, 6]])

In [102]:
a.T.ravel()

array([1, 4, 2, 5, 3, 6])

**Reshaping**

The inverse operation to flattening:

In [103]:
print(a.shape)
print(a)

(2, 3)
[[1 2 3]
 [4 5 6]]


In [104]:
b = a.ravel()
print(b)

[1 2 3 4 5 6]


In [105]:
b = b.reshape((2, 3)) # to unflattern
b

array([[1, 2, 3],
       [4, 5, 6]])

In [106]:
b[0, 0] = 100
a

array([[100,   2,   3],
       [  4,   5,   6]])

**Note and Beware: reshape may also return a copy!:**

In [107]:
a = np.zeros((3, 2))
b = a.T.reshape(3*2)
b[0] = 50
a

array([[0., 0.],
       [0., 0.],
       [0., 0.]])

**Adding a Dimension**

Indexing with the np.newaxis object allows us to add an axis to an array

newaxis is used to increase the dimension of the existing array by one more dimension, when used once. Thus,

1D array will become 2D array

2D array will become 3D array

3D array will become 4D array and so on

In [108]:
z = np.array([1, 2, 3])
z

array([1, 2, 3])

In [109]:
z[:, np.newaxis]

array([[1],
       [2],
       [3]])

**Dimension Shuffling**

In [110]:
a = np.arange(4*3*2).reshape(4, 3, 2)
a.shape

(4, 3, 2)

In [111]:
a

array([[[ 0,  1],
        [ 2,  3],
        [ 4,  5]],

       [[ 6,  7],
        [ 8,  9],
        [10, 11]],

       [[12, 13],
        [14, 15],
        [16, 17]],

       [[18, 19],
        [20, 21],
        [22, 23]]])

In [112]:
a[0, 2, 1]

5

**Resizing**

In [113]:
a = np.arange(4)
a.resize((8,))
a

array([0, 1, 2, 3, 0, 0, 0, 0])

However, it must not be referred to somewhere else:

In [114]:
b = a
a.resize((4,)) 

ValueError: ignored

**Sorting Data**

In [115]:
#Sorting along an axis:
a = np.array([[5, 4, 6], [2, 3, 2]])
b = np.sort(a, axis=1)
b

array([[4, 5, 6],
       [2, 2, 3]])

In [116]:
#in-place sort
a.sort(axis=1)
a

array([[4, 5, 6],
       [2, 2, 3]])

In [117]:
#sorting with fancy indexing
a = np.array([4, 3, 1, 2])
j = np.argsort(a)
j

array([2, 3, 1, 0])

In [118]:
a[j]

array([1, 2, 3, 4])