# Intro To NumPy / Pandas

# What are NumPy and Pandas?

**NumPy** and **Pandas** are Python libraries which provide computation and data analysis tools. In fact, NumPy and Pandas are some of the most widely used python libraries in data science.


## Why are they useful?

NumPy and Pandas have are considered to be essential Python libraries for any scientific computation, including model development for machine learning.

**Machine learning** is the application of giving systems the ability to  learn and improve from experience, without being explicitly programmed. This allows programs to improve on how it handles a task by changing its process every time it runs, and moving towards changes that move them closer to a goal.

Machine learning is useful for any applications that improve over time. This includes virtual personal assistants (Siri, Alexa, etc.), facial recognition, social media services, and search engine preferences.

# NumPy

## Further Look at NumPy

**Numerical Python (NumPy)** is an open source module of Python which provides fast mathematical computation on arrays and matrices, which are an essential part of the machine learning.

## Importing
We can import NumPy by typing:

In [0]:
import numpy as np

## NumPy Arrays (ndarray)

NumPy’s main object is a homogeneous multidimensional array, also called a **ndarray**. It is a table with same type elements, usually integers.

In NumPy, dimensions are called ***axes***. The number of axes is called the ***rank***. There are multiple ways to create NumPy arrays, seen below:



In [12]:
import numpy as np

''' np.array '''

a = np.array([1, 2, 3])
type(a)

numpy.ndarray

In [4]:
import numpy as np

''' np.ones	'''

np.ones( (3,4), dtype=np.int16 )  

array([[1, 1, 1, 1],
       [1, 1, 1, 1],
       [1, 1, 1, 1]], dtype=int16)

In [14]:
import numpy as np

''' np.full	'''

np.full( (3,4), 0.11 )  

array([[0.11, 0.11, 0.11, 0.11],
       [0.11, 0.11, 0.11, 0.11],
       [0.11, 0.11, 0.11, 0.11]])

In [10]:
import numpy as np

''' np.arange	'''

np.arange( 10, 30, 5 )
# array([10, 15, 20, 25])
 
# un-comment this statement to see a floating point example
# np.arange( 0, 2, 0.3 )             


array([10, 15, 20, 25])

In [7]:
import numpy as np

''' np.linspace	'''

np.linspace(0, 5/3, 6)

array([0.        , 0.33333333, 0.66666667, 1.        , 1.33333333,
       1.66666667])

In [8]:
import numpy as np

''' np.random.rand(2,3) '''

np.random.rand(2,3)

array([[0.31506932, 0.819253  , 0.09879741],
       [0.57947711, 0.8896572 , 0.63779976]])

In [9]:
import numpy as np

''' np.empty((2,3))	'''

np.empty((2,3))

array([[0.31506932, 0.819253  , 0.09879741],
       [0.57947711, 0.8896572 , 0.63779976]])

### Array Attributes

These are attributes of NumPy's array object that we can use:

#### ndim

displays the dimension of the array
 
#### shape

returns a tuple of integers indicating the size of the array

#### size

returns the total number of elements in the NumPy array

#### dtype
returns the type of elements in the array, i.e., int64, character

#### itemsize

returns the size in bytes of each item

#### arange

returns an ndarray of evenly spaced values

#### reshape

reshapes the NumPy array

### Array Differences

Some ways in which NumPy arrays are different from normal Python arrays are:

1. If you assign a single value to a ndarray slice, it is copied across the whole slice. This makes it easier to assign values because a regular array would need a loop to assign all of them.

In [15]:
import numpy as np

a = np.array([1, 2, 5, 7, 8])
a[1:3] = -1
a

array([ 1, -1, -1,  7,  8])

2. ndarray slices are actually views on the same array the slice was taken from. If you modify the slice, you modify the original ndarray as well.


In [16]:
a = np.array([1, 2, 5, 7, 8])
a_slice = a[1:5]
a_slice[1] = 1000
a
# Original array was modified

array([   1,    2, 1000,    7,    8])

> If we need a copy of the NumPy array, we need to use the copy method: another_slice = a[1:5].copy(). If we modify another_slice, 'a' remains same.



In [18]:
import numpy as np

a = np.array([1, 2, 5, 7, 8])
another_slice = a[1:5].copy()
another_slice[1] = 1000
a
# Original array was modified

array([1, 2, 5, 7, 8])

3. The way multidimensional arrays are accessed using NumPy is different from how they are accessed in normal python arrays. The generic format in NumPy multi-dimensional arrays is:

```
Array[row_start_index:row_end_index, column_start_index: column_end_index]
```

> NumPy arrays can also be accessed using boolean indexing:

In [25]:
import numpy as np

a = np.arange(12).reshape(3, 4)
print(a)

[[ 0  1  2  3]
 [ 4  5  6  7]
 [ 8  9 10 11]]


In [21]:
import numpy as np

a = np.arange(12).reshape(3, 4)

rows_on = np.array([True, False, True])
a[rows_on , : ]      # Rows 0 and 2, all columns

array([[ 0,  1,  2,  3],
       [ 8,  9, 10, 11]])

### Broadcasting

In general, when NumPy expects arrays of the same shape but finds that this is not the case, it applies the **broadcasting** rules.

[Here](https://cloudxlab.com/blog/wp-content/uploads/2017/12/Screen-Shot-2017-12-13-at-5.57.21-PM.png) are some examples of these broadcasting rules applied.

There are 2 rules of Broadcasting to remember:

1. For the arrays that do not have the same rank, then a 1 will be added to the beginning of the smaller ranking arrays until their ranks match.

>>  Example: When adding arrays A and B of sizes (3,3) and (,3) [rank 2 and rank 1], 1 will be added to the beginning of array B to make it (1,3) [rank=2]. The two sets are compatible when their dimensions are equal or either one of the dimension is 1. 

2. When either of the dimensions compared is one, the other is used. In other words, dimensions with size 1 are stretched or “copied” to match the other.

>>  Example: when adding a 2D array A of shape (3,3) to a 2D ndarray B of shape (1, 3). NumPy will apply the above rule of broadcasting. It shall stretch the array B and replicate the first row 3 times to make array B of dimensions (3,3) and perform the operation.



### Math with NumPy

NumPy provides a layout for using math with the NumPy arrays.

NumPy provides ***basic mathematical and statistical functions*** like mean, min, max, sum, prod, std, var, summation across different axes, transposing of a matrix, etc, while NumPy arrays themselves are capable of performing ***basic operations*** such as addition, subtraction, product, matrix dot product, division, modulo, exponents and conditional operations.

NumPy is even able to solve linear equations. If we want to sofind the coefficients of:

```
2x + 6y = 6
5x + 3y = -9
```

We can use ndarrays and built-in NumPy linear algebra functionality, like so:

In [29]:
import numpy as np

coeffs  = np.array([[2, 6], [5, 3]])
depvars = np.array([6, -9])
solution = np.linalg.solve(coeffs, depvars)
solution

array([-3.,  2.])

## Pandas

### Further Look at Pandas

**Pandas** provides high-performance, easy to use structures and data analysis tools. 

Pandas provides a 2D table object called **DataFrame**, which is like a spreadsheet with column names and row labels.

With these 2D tables, Pandas is capable of functionality like creating pivot tables, computing columns based on other columns, and plotting graphs.




## Importing 

Pandas can be imported into Python using:

In [0]:
import pandas as pd

## Objects

#### ***Series*** objects
* 1D array, similar to a column in a spreadsheet

Pandas Series object is created with the *pd.Series* function. Each row is provided with an index and are assigned numerical values starting from 0. Like NumPy, Pandas also provide the basic mathematical functionalities like addition, subtraction and conditional operations and broadcasting.


In [44]:
import pandas as pd

year_series = pd.Series([1984, 1985, 1992],
                   index=["barbara", "alice", "charles"],
                   name="year")
year_series

barbara    1984
alice      1985
charles    1992
Name: year, dtype: int64

#### ***DataFrame*** objects
* 2D table, similar to a spreadsheet

Pandas DataFrame object represents a spreadsheet with cell values, column names, and row index labels. DataFrame can be visualized as dictionaries of Series. Pandas also provide SQL-like functionality to filter, sort rows based on conditions.

Here is an example of a dictionary of Series objects used to create a DataFrame object and display as a spreadsheet:

In [40]:
import pandas as pd

people_dict = { "weight": pd.Series([68, 83, 112],
                                    index=["alice", "barbara", "charles"]),
               "birthyear": pd.Series([1984, 1985, 1992],
                                      index=["barbara", "alice", "charles"],
                                      name="year"),
               "children": pd.Series([0, 3],
                                     index=["charles", "barbara"]),
               "hobby": pd.Series(["Biking", "Dancing"],
                                  index=["alice", "barbara"]),}


people = pd.DataFrame(people_dict)
people


# we can un-comment this line to filter the DataFrame to only display people
# who were born before 1990

# people[people["birthyear"] < 1990]

Unnamed: 0,weight,birthyear,children,hobby
alice,68,1985,,Biking
barbara,83,1984,3.0,Dancing
charles,112,1992,0.0,


DataFrames can be sorted by a particular column and have new rows and columns added to it.
Dataframes can also be easily exported and imported from CSV, Excel, JSON, HTML and SQL database.

Some other essential methods that are present in dataframes are:

* ***head()***: returns the top 5 rows in the dataframe object
* ***tail()***: returns the bottom 5 rows in the dataframe
* ***info()***: prints the summary of the dataframe
* ***describe()***: gives a nice overview of the main aggregated values over each column

## Afterward

NumPy and Pandas make matrix manipulation easy, which makes them very useful in machine learning model development. Other libraries often used with NumPy and Pandas, like **TensorFlow**, **Matplotlib**, and **Scikit-learn**, can help present these models created from the NumPy arrays and Pandas DataFrames.

# Other Machine Learning Libraries

## TensorFlow

**TensorFlow** is a Python library for dataflow and differentiable programming using data flow graphs to build models, and is used for machine learning applications such as neural networks.

https://www.tensorflow.org

## Matplotlib

**Matplotlib** is a plotting library for NumPy.
The library provides an API for embedding plots into applications using general-purpose GUI toolkits like Tkinter.

https://matplotlib.org

## Scikit-learn

**Scikit-learn** contains efficient tools for machine learning and statistical modeling through classification, regression, clustering and dimensionality reduction.

https://scikit-learn.org/