# Numpy and Pandas

There are many tutorials on line for Numpy and Pandas.  Please use Google to find the ones that you are most comfortable with.

Here is one from [W3C](https://www.w3schools.com/python/numpy/default.asp).

In [1]:
import numpy as np
import pandas as pd

## Numpy

NumPy is a Python library used for working with arrays.

It also has functions for working in domain of linear algebra, fourier transform, and matrices.

NumPy was created in 2005 by Travis Oliphant. It is an open source project and you can use it freely.

NumPy stands for Numerical Python.

### Why is NumPy Faster Than Lists?

NumPy arrays are stored at one continuous place in memory unlike lists, so processes can access and manipulate them very efficiently.

This behavior is called locality of reference in computer science.

This is the main reason why NumPy is faster than lists. Also it is optimized to work with latest CPU architectures.

In [2]:
arr = np.array(42)

print(arr)

42


In [20]:
arr = np.array([1, 2, 3, 4, 5])

print(arr)

[1 2 3 4 5]


In [21]:
arr.shape

(5,)

In [17]:
arr.T

array([1, 2, 3, 4, 5])

In [18]:
arr.T.shape

(5,)

If you want to convert your 1D vector into the 2D array and then transpose it, slice it with numpy np.newaxis (or None, they are the same; the new axis is only more readable).

In [32]:
arr_2D = arr[np.newaxis]

In [33]:
arr_2D

array([[1, 2, 3, 4, 5]])

In [34]:
arr_2D.T

array([[1],
       [2],
       [3],
       [4],
       [5]])

In [35]:
arr = np.array([[1, 2, 3], [4, 5, 6]])

print(arr)

[[1 2 3]
 [4 5 6]]


In [36]:
arr.shape

(2, 3)

In [37]:
arr = np.array([[[1, 2, 3], [4, 5, 6]], [[1, 2, 3], [4, 5, 6]]])

print(arr)

[[[1 2 3]
  [4 5 6]]

 [[1 2 3]
  [4 5 6]]]


In [38]:
arr.shape

(2, 2, 3)

How do we know the number of dimensions for a numpy array?

In [39]:
a = np.array(42)
b = np.array([1, 2, 3, 4, 5])
c = np.array([[1, 2, 3], [4, 5, 6]])
d = np.array([[[1, 2, 3], [4, 5, 6]], [[1, 2, 3], [4, 5, 6]]])

print(a.ndim)
print(b.ndim)
print(c.ndim)
print(d.ndim)

0
1
2
3


In [40]:
d.shape

(2, 2, 3)

### Accessing Numpy Arrays

Array indexing is the same as accessing an array element.

You can access an array element by referring to its index number.

The indexes in NumPy arrays start with 0, meaning that the first element has index 0, and the second has index 1 etc.

In [25]:
arr = np.array([1, 2, 3, 4])

print(arr[0])
print(arr[1])
print(arr[-1])
print(arr[-2])

1
2
4
3


In [26]:
arr = np.array([[1,2,3,4,5], [6,7,8,9,10]])

print('2nd element on 1st row: ', arr[0, 1])

2nd element on 1st row:  2


In [28]:
arr = np.array([[[1, 2, 3], [4, 5, 6]], [[7, 8, 9], [10, 11, 12]]])

print('an element in a 3D array: ', arr[0, 1, 2])

an element in a 3D array:  6


Use negative indexing to access an array from the end.

In [29]:
arr = np.array([[1,2,3,4,5], [6,7,8,9,10]])

print('Last element from 2nd dim: ', arr[1, -1])

Last element from 2nd dim:  10


### Slicing Arrays

Slicing in python means taking elements from one given index to another given index.

We pass slice instead of index like this: [start:end].

We can also define the step, like this: [start:end:step].

In [40]:
arr = np.array([1, 2, 3, 4, 5, 6, 7])

# Slice elements from index 1 to index 5 from the following array:
print(arr[1:5])

[2 3 4 5]


In [39]:
# Slice elements from index 4 to the end of the array:
print(arr[4:])

[5 6 7]


In [35]:
# Slice elements from the beginning to index 4 (not included):
print(arr[:4])

[1 2 3 4]


In [36]:
# Slice from the index 3 from the end to index 1 from the end:
print(arr[-3:-1])

[5 6]


In [37]:
# Return every other element from index 1 to index 5:
print(arr[1:5:2])

[2 4]


In [38]:
# Return every other element from the entire array:
print(arr[::2])

[1 3 5 7]


In [41]:
arr = np.array([[1, 2, 3, 4, 5], [6, 7, 8, 9, 10]])

# From the second element, slice elements from index 1 to index 4 (not included):
print(arr[1, 1:4])

[7 8 9]


In [42]:
# From both elements, return index 2:
print(arr[0:2, 2])

[3 8]


In [43]:
# From both elements, slice index 1 to index 4 (not included), this will return a 2-D array:
print(arr[0:2, 1:4])

[[2 3 4]
 [7 8 9]]


### Numpy Datatypes

By default Python have these data types:

- strings - used to represent text data, the text is given under quote marks. e.g. "ABCD"
- integer - used to represent integer numbers. e.g. -1, -2, -3
- float - used to represent real numbers. e.g. 1.2, 42.42
- boolean - used to represent True or False.
- complex - used to represent complex numbers. e.g. 1.0 + 2.0j, 1.5 + 2.5j

NumPy has some extra data types, and refer to data types with one character, like i for integers, u for unsigned integers etc.

Below is a list of all data types in NumPy and the characters used to represent them.

- i - integer
- b - boolean
- u - unsigned integer
- f - float
- c - complex float
- m - timedelta
- M - datetime
- O - object
- S - string
- U - unicode string
- V - fixed chunk of memory for other type ( void )

Get the data type of an array object:

In [44]:
arr = np.array([1, 2, 3, 4])

print(arr.dtype)

int64


In [45]:
arr = np.array(['apple', 'banana', 'cherry'])

print(arr.dtype)

<U6


We use the array() function to create arrays, this function can take an optional argument: dtype that allows us to define the expected data type of the array elements:

In [46]:
import numpy as np

arr = np.array([1, 2, 3, 4], dtype='S')

print(arr)
print(arr.dtype)

[b'1' b'2' b'3' b'4']
|S1


Create an array with data type 4 bytes integer:

In [47]:
arr = np.array([1, 2, 3, 4], dtype='i4')

print(arr)
print(arr.dtype)

[1 2 3 4]
int32


### Converting Data Types on Existing Arrays

The best way to change the data type of an existing array, is to make a copy of the array with the astype() method.

The astype() function creates a copy of the array, and allows you to specify the data type as a parameter.

The data type can be specified using a string, like 'f' for float, 'i' for integer etc. or you can use the data type directly like float for float and int for integer.

In [48]:
arr = np.array([1.1, 2.1, 3.1])

newarr = arr.astype('i')

print(newarr)
print(newarr.dtype)

[1 2 3]
int32


In [49]:
# Change data type from float to integer by using int as parameter value:

arr = np.array([1.1, 2.1, 3.1])

newarr = arr.astype(int)

print(newarr)
print(newarr.dtype)

[1 2 3]
int64


In [50]:
# Change data type from integer to boolean:

arr = np.array([1, 0, 3])

newarr = arr.astype(bool)

print(newarr)
print(newarr.dtype)

[ True False  True]
bool


### Copy vs View

The main difference between a copy and a view of an array is that the copy is a new array, and the view is just a view of the original array.

The copy owns the data and any changes made to the copy will not affect original array, and any changes made to the original array will not affect the copy.

The view does not own the data and any changes made to the view will affect the original array, and any changes made to the original array will affect the view.

In [52]:
# COPY

arr = np.array([1, 2, 3, 4, 5])
x = arr.copy()
arr[0] = 42

print(arr)
print(x)

[42  2  3  4  5]
[1 2 3 4 5]


In [53]:
# VIEWS

arr = np.array([1, 2, 3, 4, 5])
x = arr.view()
arr[0] = 42

print(arr)
print(x)

[42  2  3  4  5]
[42  2  3  4  5]


In [54]:
# Make a view, change the view, and display both arrays:

arr = np.array([1, 2, 3, 4, 5])
x = arr.view()
x[0] = 31

print(arr)
print(x)

[31  2  3  4  5]
[31  2  3  4  5]


Copies owns the data, and views does not own the data, but how can we check this?

Every NumPy array has the attribute base that returns None if the array owns the data.

Otherwise, the base  attribute refers to the original object.



In [56]:
arr = np.array([1, 2, 3, 4, 5])

x = arr.copy()
y = arr.view()

print(x.base)
print(y.base)

None
[1 2 3 4 5]


### Shape of an Array

NumPy arrays have an attribute called shape that returns a tuple with each index having the number of corresponding elements.

In [57]:
arr = np.array([[1, 2, 3, 4], [5, 6, 7, 8]])

print(arr.shape)

(2, 4)


Integers at every index tells about the number of elements the corresponding dimension has.

In the example here at index-4 we have value 4, so we can say that 5th ( 4 + 1 th) dimension has 4 elements.

In [59]:
# create an array with 5 dimensions using ndmin using a vector with values 1,2,3,4 
# and verify that last dimension has value 4:

arr = np.array([1, 2, 3, 4], ndmin=5)

print(arr)
print('shape of array :', arr.shape)

[[[[[1 2 3 4]]]]]
shape of array : (1, 1, 1, 1, 4)


## Pandas

In [2]:
np.random.seed(0)
df = pd.DataFrame(np.random.choice(10, (3, 5)), columns=list('ABCDE'))
df

Unnamed: 0,A,B,C,D,E
0,5,0,3,3,7
1,9,3,5,2,4
2,7,6,8,8,1


Suppose you would like to select all values in column "B" where values in column "A" is > 5. Pandas allows you to do this in different ways, some more correct than others. For example,

## Updating Value of a Row

Consider this dataframe:

In [5]:
info= {"Num":[12,14,13,12,14,13,15], "NAME":['John','Camili','Rheana','Joseph','Amanti','Alexa','Siri']}
 
data = pd.DataFrame(info)
print("Original Data frame:\n")
print(data)

Original Data frame:

   Num    NAME
0   12    John
1   14  Camili
2   13  Rheana
3   12  Joseph
4   14  Amanti
5   13   Alexa
6   15    Siri


# at()
Python at() method enables us to update the value of one row at a time with respect to a column.

> dataframe.at[index,'column-name']='new value'

In this example, we have provided the at() function with index 6 of the data frame and column ‘NAME’. Thus, the value of the column ‘NAME’ at row index 6 gets updated.

In [6]:
data.at[6,'NAME']='Safa'

In [7]:
data

Unnamed: 0,Num,NAME
0,12,John
1,14,Camili
2,13,Rheana
3,12,Joseph
4,14,Amanti
5,13,Alexa
6,15,Safa


## loc()

Python loc() method can also be used to update the value of a row with respect to columns by providing the labels of the columns and the index of the rows.

> dataframe.loc[row index,['column-names']] = value

We update the value of the rows from index 0 to 2 with respect to columns ‘Num’ and ‘NAME’, respectively.

In [8]:
data.loc[0:2,['Num','NAME']] = [100,'Python']

In [9]:
data

Unnamed: 0,Num,NAME
0,100,Python
1,100,Python
2,100,Python
3,12,Joseph
4,14,Amanti
5,13,Alexa
6,15,Safa


## replace() 

Using replace() method, we can update or change the value of any string within a data frame. We need not provide the index or label values to it.

> dataframe.replace("old string", "new string")

In [11]:
data.replace("Siri", 
           "Code", 
           inplace=True)

In [12]:
data

Unnamed: 0,Num,NAME
0,100,Python
1,100,Python
2,100,Python
3,12,Joseph
4,14,Amanti
5,13,Alexa
6,15,Safa


As seen above, we have replaced the word “Siri” with “Code” within the dataframe.


## iloc()

With the Python iloc() method, it is possible to change or update the value of a row/column by providing the index values of the same.

> dataframe.iloc[index] = value

In this example, we have updated the value of the rows 0, 1, 3 and 6 with respect to the first column i.e. ‘Num’ to 100.

In [13]:
data.iloc[[0,1,3,6],[0]] = 100

In [14]:
data

Unnamed: 0,Num,NAME
0,100,Python
1,100,Python
2,100,Python
3,100,Joseph
4,14,Amanti
5,13,Alexa
6,100,Safa


## Updating Values with Slicing

Suppose you would like to select all values in column "B" where values in column "A" is > 5. Pandas allows you to do this in different ways, some more correct than others. For example,

In [3]:
df[df.A > 5]['B']

1    3
2    6
Name: B, dtype: int64

In [4]:
df.loc[df.A > 5, 'B']

1    3
2    6
Name: B, dtype: int64

These return the same result, so if you are only reading these values, it makes no difference. So, what is the issue? The problem with chained assignment, is that it is generally difficult to predict whether a view or a copy is returned, so __this largely becomes an issue when you are attempting to assign values back__. Consider how this code is executed by the interpreter:

```
df.loc[df.A > 5, 'B'] = 4
# becomes
df.__setitem__((df.A > 5, 'B'), 4)
```

With a single ```__setitem__``` call to df. On the other hand, consider this code:

```
df[df.A > 5]['B'] = 4
# becomes
df.__getitem__(df.A > 5).__setitem__('B', 4)
```

Now, depending on whether ```__getitem__``` returned a view or a copy, the ```__setitem__``` operation may not work.  In general, you should use loc for label-based assignment, and iloc for integer/positional based assignment, as the spec guarantees that they always operate on the original. Additionally, for setting a single cell, you should use at and iat.

## Combining DataFrames

In pandas there are 4 (plus a few special case) ways to combine data from different frames:

* Merging
* Joining
* Concatenating 
* Appending

Where merging and joining are basically redundant and concatenating and appending are basically redundant. 

So today we will be going over Merging and Concatenating in pandas. 

Check out the full documentation [here](http://pandas.pydata.org/pandas-docs/stable/user_guide/merging.html), but be warned it is a bit long :)


Okay let's get started.

### Merge

Merging is for doing complex column-wise combinations of dataframes in a SQL-like way. If you don't know SQL joins then check out this resource [sql joins](https://www.w3schools.com/sql/sql_join.asp) and comment below (and maybe I'll make a video). 

Two merge we need two dataframes, let's make them below:

In [4]:
import pandas as pd
weights_df = pd.DataFrame({"animals": ["cats", "dogs", "chipmunks", "turtles"],
                           "avg_wt": [2.5, 10.0, 1.2, 0.8]})

In [7]:
cost_df = pd.DataFrame({"animals": ["cats", "goldfish", "turtles", "parrots"],
                        "avg_cost": [250, 20, 30, 750]})

In [6]:
weights_df

Unnamed: 0,animals,avg_wt
0,cats,2.5
1,dogs,10.0
2,chipmunks,1.2
3,turtles,0.8


In [8]:
cost_df

Unnamed: 0,animals,avg_cost
0,cats,250
1,goldfish,20
2,turtles,30
3,parrots,750


In [10]:
pd.merge(weights_df, cost_df, on="animals")

Unnamed: 0,animals,avg_wt,avg_cost
0,cats,2.5,250
1,turtles,0.8,30


In [11]:
pd.merge(weights_df, cost_df, on="animals", how="left")

Unnamed: 0,animals,avg_wt,avg_cost
0,cats,2.5,250.0
1,dogs,10.0,
2,chipmunks,1.2,
3,turtles,0.8,30.0


In [12]:
pd.merge(weights_df, cost_df, on="animals", how="right")

Unnamed: 0,animals,avg_wt,avg_cost
0,cats,2.5,250
1,goldfish,,20
2,turtles,0.8,30
3,parrots,,750


In [13]:
pd.merge(weights_df, cost_df, on="animals", how="outer")

Unnamed: 0,animals,avg_wt,avg_cost
0,cats,2.5,250.0
1,dogs,10.0,
2,chipmunks,1.2,
3,turtles,0.8,30.0
4,goldfish,,20.0
5,parrots,,750.0


In [14]:
pd.merge(weights_df, cost_df, on="animals", how="inner")

Unnamed: 0,animals,avg_wt,avg_cost
0,cats,2.5,250
1,turtles,0.8,30


In [15]:
# for compound keys, will need to reset_index()
pd.merge(
    weights_df.reset_index(), 
    cost_df.reset_index(), 
    on="animals"
)

Unnamed: 0,index_x,animals,avg_wt,index_y,avg_cost
0,0,cats,2.5,0,250
1,3,turtles,0.8,2,30


### Concatenation

Concatenating is for combining more than two dataframes in either column-wise or row-wise. The problem with concatenate is that the combinations it allows you to do are rather simplistic. That's why we need merge. 

Concatenate can take as many data frames as you want, but it requires that they are specifically constructed. All of the dataframes you pass in will need to have the same index. So no more using columns as an index. 

Let's check out basic use below:


In [16]:
pd.concat([weights_df, cost_df], sort=False)

Unnamed: 0,animals,avg_wt,avg_cost
0,cats,2.5,
1,dogs,10.0,
2,chipmunks,1.2,
3,turtles,0.8,
0,cats,,250.0
1,goldfish,,20.0
2,turtles,,30.0
3,parrots,,750.0


In [17]:
# stacking the same dataframe
pd.concat([weights_df, weights_df, cost_df], sort=False)

Unnamed: 0,animals,avg_wt,avg_cost
0,cats,2.5,
1,dogs,10.0,
2,chipmunks,1.2,
3,turtles,0.8,
0,cats,2.5,
1,dogs,10.0,
2,chipmunks,1.2,
3,turtles,0.8,
0,cats,,250.0
1,goldfish,,20.0


In [18]:
# this does it column wise
pd.concat([weights_df, cost_df], axis=1)

Unnamed: 0,animals,avg_wt,animals.1,avg_cost
0,cats,2.5,cats,250
1,dogs,10.0,goldfish,20
2,chipmunks,1.2,turtles,30
3,turtles,0.8,parrots,750


### Join

The join() method inserts column(s) from another DataFrame, or Series.

Syntax:
> dataframe.join(other, on, how, lsuffix, rsuffix, sort)

Return Value: A new DataFrame, with the updated result.

This method does not change the original DataFrame.

In [35]:
weights_df

Unnamed: 0,animals,avg_wt
0,cats,2.5
1,dogs,10.0
2,chipmunks,1.2
3,turtles,0.8


In [36]:
cost_df

Unnamed: 0,animals,avg_cost
0,cats,250
1,goldfish,20
2,turtles,30
3,parrots,750


In [39]:
weights_df.join(cost_df, on=["animals"], how="outer")

ValueError: You are trying to merge on object and int64 columns. If you wish to proceed you should use pd.concat