<center>
  <a href="./">Previous Page</a> | <a href="./">Content Page</a> | <a href="2.2.Intro-to-pandas_SERIES.ipynb">Next Page</a></center>
</center>

# 2.1 Introduction to Python Data Analytics

## 2.1.1 NumPy: Numerical Python

NumPy provides an efficient way to store and manipulate multi-dimensional dense arrays in Python.
The important features of NumPy are:

- It provides an ``ndarray`` structure, which allows efficient storage and manipulation of `vectors, matrices, and higher-dimensional datasets`.
- It provides a readable and efficient syntax for operating on this data, from simple element-wise arithmetic to more complicated `linear algebraic operations`.

In the simplest case, `NumPy arrays` look a lot like `Python lists`.

Here is an array containing the range of numbers 1 to 9 (compare this with Python's built-in ``range()``):

In [1]:
import numpy as np
x = np.arange(1, 10)
x

array([1, 2, 3, 4, 5, 6, 7, 8, 9])

In [2]:
type(x)

numpy.ndarray

NumPy's arrays offer both efficient storage of data, as well as efficient element-wise operations on the data.
**To square** each element of the array, we can apply the "``**``" operator to the array directly:

In [3]:
x ** 2

array([ 1,  4,  9, 16, 25, 36, 49, 64, 81])

Compare with the much more verbose Python-style list comprehension for the same result:

In [4]:
[val ** 2 for val in range(1, 10)]

[1, 4, 9, 16, 25, 36, 49, 64, 81]

In [5]:
val=[]
for i in range(1, 10):
    val.append(i**2)
print (val)

[1, 4, 9, 16, 25, 36, 49, 64, 81]


Unlike Python lists (which are limited to one dimension), NumPy arrays can be *multi-dimensional*.
For example, here we will reshape our ``x`` array into a 3x3 array:

In [6]:
M = x.reshape((3, 3))
M

array([[1, 2, 3],
       [4, 5, 6],
       [7, 8, 9]])

A two-dimensional array is one representation of a matrix, and NumPy knows how to efficiently do typical matrix operations. For example, you can compute the transpose using ``.T``:

In [7]:
M.T

array([[1, 4, 7],
       [2, 5, 8],
       [3, 6, 9]])

or a matrix-vector product using ``np.dot``:

In [8]:
np.dot(M, [5, 6, 7])
# (1*5) + (2*6) + (3*7) = 5+12+21=38
# (4*5) + (5*6) + (6*7)= 20 +30+42=92
# (7*5) + (8*6) + (9*7)= 35+48+63=146

array([ 38,  92, 146])

In [9]:
M

array([[1, 2, 3],
       [4, 5, 6],
       [7, 8, 9]])

In [10]:
np.dot([5, 6, 7], M)
#5*1 + 6*4 + 7*7=78
#5*2 + 6*5 + 7*8=96
#5*3 + 6*6 + 7*9=114

array([ 78,  96, 114])

see https://people.rit.edu/pnveme/personal/pigf/Matrices/mat_mult_2.html

## 2.1.2 Pandas: Labeled Column-oriented Data (Another Preview of Panda)

Pandas is a much *newer package* than NumPy, and is in fact *built on top of it*.

What Pandas provides is a labeled interface to multi-dimensional data, in the form of a DataFrame object that will feel very familiar to users of `R and related languages`.
DataFrames in Pandas look something like this:

In [11]:
import pandas as pd
df = pd.DataFrame({'label': ['A', 'B', 'C', 'A', 'B', 'C'],
                   'value': [1, 2, 3, 4, 5, 6]})
df

Unnamed: 0,label,value
0,A,1
1,B,2
2,C,3
3,A,4
4,B,5
5,C,6


The Pandas interface allows you to do things like select `columns` by name:

In [12]:
df["label"]

0    A
1    B
2    C
3    A
4    B
5    C
Name: label, dtype: object

Apply string operations across string entries:

In [13]:
a=df['label']

In [14]:
df

Unnamed: 0,label,value
0,A,1
1,B,2
2,C,3
3,A,4
4,B,5
5,C,6


Apply aggregates across numerical entries:

In [15]:
df['value'].sum()

21

In [16]:
df['value'].mean()

3.5

And, perhaps most importantly, do efficient database-style joins and groupings:

In [21]:
df.groupby('label').sum()

Unnamed: 0_level_0,value
label,Unnamed: 1_level_1
A,5
B,7
C,9


In [22]:
df.groupby('label').mean()

Unnamed: 0_level_0,value
label,Unnamed: 1_level_1
A,2.5
B,3.5
C,4.5


<center>
  <a href="./">Previous Page</a> | <a href="./">Content Page</a> | <a href="2.2.Intro-to-pandas_SERIES.ipynb">Next Page</a></center>
</center>