# Introductory Material

Data encompasses a collection of discrete objects, numbers, words, events, facts, measurements, observations, or even descriptions of things.

## Making Sense of Data
- Numerical Data
> Discrete Data
> 
> Continuous Data
- Categorical Data
> Dichotomous Variable
>
> Polytomous Variable
- Measurement Scales
> Nominal
> 
> Ordinal
>
> Interval
>
> Ratio

The "why" of EDA is to process data so that it becomes information and we can process that information so that it becomes knowledge.

EDA fits into a broader set of activities called data analysis.

The stages of data analysis are as follows:
1. Data requirements
2. Data collection
3. Data processing
4. Data cleaning
5. EDA
6. Modeling and algorithm
7. Data product
8. Communication

## Primary Aim of EDA
To examine what data can tell us before actually going through formal modeling or hypothesis formulation.

## Significance of EDA
EDA reveals the ground truth about the content without making any underlying assumptions.

## Steps in EDA
1. Problem definition
2. Data preparation
3. Data analysis
4. Development and representation of results

## Activities of EDA
- Discover patterns
- Spot anomalies
- Test hypotheses
- Check assumptions using statistical measures

In [72]:
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt

# NumPy

## For creating different types of NumPy arrays

In [73]:
my1DArray = np.array([1, 8, 27, 64])
print(my1DArray)

[ 1  8 27 64]


In [74]:
my2DArray = np.array([[1, 2, 3, 4], [2, 4, 9, 16], [4, 8, 18, 32]])
print(my2DArray)

[[ 1  2  3  4]
 [ 2  4  9 16]
 [ 4  8 18 32]]


In [75]:
my3DArray = np.array([[[1, 2, 3, 4], [5, 6, 7, 8]], [[1, 2, 3, 4], [9, 10, 11, 12]]])
print(my3DArray)

[[[ 1  2  3  4]
  [ 5  6  7  8]]

 [[ 1  2  3  4]
  [ 9 10 11 12]]]


## For displaying basic information, such as the data type, shape, size, and strides of NumPy array

In [76]:
print(my2DArray.data)

<memory at 0x0000014B062FF6B0>


In [77]:
print(my2DArray.shape)

(3, 4)


In [78]:
print(my2DArray.dtype)

int32


In [79]:
print(my2DArray.strides)

(16, 4)


In [80]:
print(my3DArray.shape)

(2, 2, 4)


### Strides
Strides in NumPy are a way of indexing arrays that specify the number of bytes to jump to find the next element. It's important to know strides when doing computations with arrays because they provide a complete understanding of memory layout.

For example, consider a 1D array of 8 numbers (i.e.,). The stride for this array is 8, which means that to find the next element, you need to jump 8 bytes forward in memory.

Strides can also be used to index multidimensional arrays. For example, consider a 2D array of 4x4 numbers (i.e., [,,,]). The stride for the first dimension of this array is 32, which means that to find the next element in the first dimension, you need to jump 32 bytes forward in memory. The stride for the second dimension of this array is 8, which means that to find the next element in the second dimension, you need to jump 8 bytes forward in memory.

Strides can be used to perform a variety of operations on arrays, such as slicing, indexing, and broadcasting. For example, to slice an array, you can use the stride to specify the number of elements to skip. To index an array, you can use the stride to specify the offset of the element you want to access. To broadcast an array, you can use the stride to specify the shape of the output array.

## For creating an array using built-in NumPy functions

In [81]:
ones = np.ones((3,4))
print(ones)

[[1. 1. 1. 1.]
 [1. 1. 1. 1.]
 [1. 1. 1. 1.]]


In [82]:
zeros = np.zeros((2,3,4))
print(zeros)

[[[0. 0. 0. 0.]
  [0. 0. 0. 0.]
  [0. 0. 0. 0.]]

 [[0. 0. 0. 0.]
  [0. 0. 0. 0.]
  [0. 0. 0. 0.]]]


In [83]:
emptyArray = np.empty((3,2))
print(emptyArray)

[[4.24399158e-314 8.48798317e-314]
 [8.48798316e-314 3.39519327e-313]
 [1.69759663e-313 6.79038653e-313]]


In [84]:
fullArray = np.full((2,2),7)
print(fullArray)

[[7 7]
 [7 7]]


In [85]:
evenSpacedArray = np.arange(10,25,5)
print(evenSpacedArray)

[10 15 20]


In [86]:
evenSpacedArray2 = np.linspace(0,2,9)
print(evenSpacedArray2)

[0.   0.25 0.5  0.75 1.   1.25 1.5  1.75 2.  ]


## For NumPy arrays and file operations

In [87]:
# Save a numpy array into file
x = np.arange(0.0,50.0,1.0)
np.savetxt('data.out', x, delimiter=',')

In [88]:
# Loading numpy array from text
z = np.loadtxt('data.out', unpack=True)
print(z)

[ 0.  1.  2.  3.  4.  5.  6.  7.  8.  9. 10. 11. 12. 13. 14. 15. 16. 17.
 18. 19. 20. 21. 22. 23. 24. 25. 26. 27. 28. 29. 30. 31. 32. 33. 34. 35.
 36. 37. 38. 39. 40. 41. 42. 43. 44. 45. 46. 47. 48. 49.]


In [89]:
# Loading numpy array using genfromtxt method
my_array2 = np.genfromtxt('data.out', skip_header=1, filling_values=-999)
print(my_array2)

[ 1.  2.  3.  4.  5.  6.  7.  8.  9. 10. 11. 12. 13. 14. 15. 16. 17. 18.
 19. 20. 21. 22. 23. 24. 25. 26. 27. 28. 29. 30. 31. 32. 33. 34. 35. 36.
 37. 38. 39. 40. 41. 42. 43. 44. 45. 46. 47. 48. 49.]


## For inspecting NumPy arrays

In [90]:
# print the number of 'my2DArray`'s dimensions
print(my2DArray.ndim)

2


In [91]:
# print the number of `my2DArray`'s elements
print(my2DArray.size)

12


In [92]:
# print information about `my2DArray`'s memory layout
print(my2DArray.flags)

  C_CONTIGUOUS : True
  F_CONTIGUOUS : False
  OWNDATA : True
  WRITEABLE : True
  ALIGNED : True
  WRITEBACKIFCOPY : False



In [93]:
# print the length of one array element in bytes
print(my2DArray.itemsize)

4


In [94]:
# print the total consumed bytes by `my2DArray`'s elements
print(my2DArray.nbytes)

48


## Broadcasting is a mechanism that permits NumPy to operate with arrays of different shapes when performing arithmetic operations

In [95]:
# Rule 1: Two dimensions are operatable if they are equal
# Create an array of two dimensions
A = np.ones((6, 8))
# Shape of A
print(A.shape)

(6, 8)


In [96]:
# Create another array
B = np.random.random((6, 8))
# Shape of B
print(B.shape)

(6, 8)


In [97]:
# Sum of A and B, here the shape of both matrices is the same
print(A+B)

[[1.7958867  1.41641285 1.90175176 1.71091957 1.23535193 1.65803324
  1.12541529 1.86849516]
 [1.97712844 1.88464626 1.75228449 1.99540948 1.92037227 1.36489852
  1.27723293 1.46353278]
 [1.53388308 1.04476231 1.72056748 1.92369766 1.2199479  1.64870417
  1.90196711 1.85319559]
 [1.29381387 1.96027946 1.00564823 1.64019453 1.3832256  1.15191269
  1.85390289 1.9667417 ]
 [1.82120432 1.97905128 1.19619173 1.71698056 1.50972764 1.76002047
  1.25364237 1.05834751]
 [1.14622861 1.41289179 1.82711641 1.21898661 1.80676056 1.4311782
  1.82664148 1.9013497 ]]


In [98]:
# Rule 2: Two dimensions are also compatible when one of the dimensions of the array is 1. 
# Initialize `x`
x = np.ones((3, 4))
print(x)

[[1. 1. 1. 1.]
 [1. 1. 1. 1.]
 [1. 1. 1. 1.]]


In [99]:
# Check shape of `x`
print(x.shape)

(3, 4)


In [100]:
# Initialize `y`
y = np.arange(4)
print(y)

[0 1 2 3]


In [101]:
# Check shape of `y`
print(y.shape)

(4,)


In [102]:
# Subtract `x` and `y`
print(x - y)

[[ 1.  0. -1. -2.]
 [ 1.  0. -1. -2.]
 [ 1.  0. -1. -2.]]


In [103]:
# Rule 3: Arrays can be broadcast together if they are compatible in all dimensions
x = np.ones((6, 8))
y = np.random.random((10, 1, 8))
print(x + y)

[[[1.32668565 1.02621496 1.74643904 1.5336168  1.37989955 1.52257279
   1.86764703 1.01393763]
  [1.32668565 1.02621496 1.74643904 1.5336168  1.37989955 1.52257279
   1.86764703 1.01393763]
  [1.32668565 1.02621496 1.74643904 1.5336168  1.37989955 1.52257279
   1.86764703 1.01393763]
  [1.32668565 1.02621496 1.74643904 1.5336168  1.37989955 1.52257279
   1.86764703 1.01393763]
  [1.32668565 1.02621496 1.74643904 1.5336168  1.37989955 1.52257279
   1.86764703 1.01393763]
  [1.32668565 1.02621496 1.74643904 1.5336168  1.37989955 1.52257279
   1.86764703 1.01393763]]

 [[1.03587112 1.20876435 1.68345108 1.14852377 1.91314512 1.17771014
   1.60302877 1.27108385]
  [1.03587112 1.20876435 1.68345108 1.14852377 1.91314512 1.17771014
   1.60302877 1.27108385]
  [1.03587112 1.20876435 1.68345108 1.14852377 1.91314512 1.17771014
   1.60302877 1.27108385]
  [1.03587112 1.20876435 1.68345108 1.14852377 1.91314512 1.17771014
   1.60302877 1.27108385]
  [1.03587112 1.20876435 1.68345108 1.14852377 1

Why did the above work?  It comes down to the following:

The dimensions are compared from the last dimension to the first.

- Compare x's last dimension (8) with y's last dimension (8): They are equal.
- Compare x's second-to-last dimension (6) with y's second-to-last dimension (1): One of them is 1, so broadcasting is possible.
- y has an additional dimension at the front (10) which x lacks, so x's shape is implicitly extended with a new leading dimension of size 1.

## For seeing NumPy mathematics at work

In [104]:
# Basic operations (+, -, *, /, %)
x = np.array([[1, 2, 3], [2, 3, 4]])
y = np.array([[1, 4, 9], [2, 3, -2]])

In [105]:
# Add the two arrays
add = np.add(x, y)
print(add)

[[ 2  6 12]
 [ 4  6  2]]


In [106]:
# Subtract the two arrays
sub = np.subtract(x, y)
print(sub)

[[ 0 -2 -6]
 [ 0  0  6]]


In [107]:
# Multiply the two arrays
mul = np.multiply(x, y)
print(mul)

[[ 1  8 27]
 [ 4  9 -8]]


In [108]:
# Divide the two arrays
div = np.divide(x, y)
print(div)

[[ 1.          0.5         0.33333333]
 [ 1.          1.         -2.        ]]


In [109]:
# Calculate the remainder of x and y
rem = np.remainder(x, y)
print(rem)

[[0 2 3]
 [0 0 0]]


## Create a subset and slice an array using an index

In [110]:
x = np.array([10, 20, 30, 40, 50])

In [111]:
# Select items at index 0 and 1
print(x[0:2])

[10 20]


In [112]:
# Select item at row 0 and 1 and column 1 from 2D array
y = np.array([[1, 2, 3, 4], [9, 10, 11, 12]])
print(y[0:2,1])

[ 2 10]


In [113]:
# Specifying conditions
biggerThan2 = (y >= 2)
print(y[biggerThan2])

[ 2  3  4  9 10 11 12]


# Pandas