# Numpy and Pandas Introduction

In this notebook we will perform a quick overview over the vary comoon and basic packages for data analysis which are called: Numpy and Pandas.

* [Numpy](#numpy)
    * [NDArray](#ndarray)
    * [Arithmetic operations](#aritmetic_operations)
    * [Accessing](#accessing)
    * [Reshaping](#reshaping)
    * [Stacking](#stacking)

* [Pandas](#pandas)
    * [Series](#series)
    * [Dataframe](#dataframe)
        * [Creation](#creation)
        * [Operations](#operations)
        * [Adding](#adding)
        * [Data loading](#data_loading)
        
---

In [1]:
import numpy as np
import pandas as pd
np.random.seed(0)  # seed for reproducibility

## <a name="numpy"></a>Numpy

Numpy (“Numeric Python” or “Numerical Python”)
NumPy is, just like SciPy, Scikit-Learn, Pandas, etc.  
One of the packages that you just can’t miss when you’re learning data science, mainly because this library provides you with an array data structure that holds some benefits over Python lists, such as: being more compact, faster access in reading and writing items, being more convenient and more efficient.

### <a name="ndarray"></a>NDArray
An array object represents a multidimensional, homogeneous array of fixed-size items.

![alt text](http://community.datacamp.com.s3.amazonaws.com/community/production/ckeditor_assets/pictures/332/content_arrays-axes.png)

Numpy arrays will always contain data from one type.  
Creating a numpy array can be done in various ways: 

In [2]:
# hardcoded
my_array = np.array([[1,2,3,4], [5,6,7,8]], dtype=np.int64) 
print(my_array)
print()

# using numpy functions
rand_array_1d = np.random.randint(10, size=6)
print(rand_array_1d)
print()
rand_array_2d = np.random.randint(10, size=(2,4))
print(rand_array_2d)
print()

zeros_array = np.zeros((2,3,4),dtype=np.int16)
print(zeros_array)



print("arrange {}".format(np.arange(10)))

matrix_example = np.arange(15).reshape(3, 5)
print(matrix_example)

float_matrix = np.linspace( 0, 2*np.pi, 100 ) # when dealing with floats, it's easier to define number of elements instesad of space size
print(float_matrix)

[[1 2 3 4]
 [5 6 7 8]]
()
[5 0 3 3 7 9]
()
[[3 5 2 4]
 [7 6 8 8]]
()
[[[0 0 0 0]
  [0 0 0 0]
  [0 0 0 0]]

 [[0 0 0 0]
  [0 0 0 0]
  [0 0 0 0]]]
arrange [0 1 2 3 4 5 6 7 8 9]
[[ 0  1  2  3  4]
 [ 5  6  7  8  9]
 [10 11 12 13 14]]
[0.         0.06346652 0.12693304 0.19039955 0.25386607 0.31733259
 0.38079911 0.44426563 0.50773215 0.57119866 0.63466518 0.6981317
 0.76159822 0.82506474 0.88853126 0.95199777 1.01546429 1.07893081
 1.14239733 1.20586385 1.26933037 1.33279688 1.3962634  1.45972992
 1.52319644 1.58666296 1.65012947 1.71359599 1.77706251 1.84052903
 1.90399555 1.96746207 2.03092858 2.0943951  2.15786162 2.22132814
 2.28479466 2.34826118 2.41172769 2.47519421 2.53866073 2.60212725
 2.66559377 2.72906028 2.7925268  2.85599332 2.91945984 2.98292636
 3.04639288 3.10985939 3.17332591 3.23679243 3.30025895 3.36372547
 3.42719199 3.4906585  3.55412502 3.61759154 3.68105806 3.74452458
 3.8079911  3.87145761 3.93492413 3.99839065 4.06185717 4.12532369
 4.1887902  4.25225672 4.31572324 

NDArray has few properties:

In [3]:
print(zeros_array)
print("zeros_array ndim: ", zeros_array.ndim)
print("zeros_array shape:", zeros_array.shape)
print("zeros_array size: ", zeros_array.size)
print("dtype:", zeros_array.dtype)

[[[0 0 0 0]
  [0 0 0 0]
  [0 0 0 0]]

 [[0 0 0 0]
  [0 0 0 0]
  [0 0 0 0]]]
('zeros_array ndim: ', 3)
('zeros_array shape:', (2L, 3L, 4L))
('zeros_array size: ', 24)
('dtype:', dtype('int16'))


### <a name="aritmetic_operations"></a>Arithmetic Operations

Arithmetic operators on numpy arrays apply elementwise.  
A new array is created and filled with the result.

In [4]:
a = np.array( [20,30,40,50] )
b = np.arange(4)
print(a,b)

(array([20, 30, 40, 50]), array([0, 1, 2, 3]))


In [5]:
a = np.array( [20,30,40,50] )
b = np.arange(4)
print(a,b)

c = a-b
print(c)

d = 10*np.sin(a)
print(d)

(array([20, 30, 40, 50]), array([0, 1, 2, 3]))
[20 29 38 47]
[ 9.12945251 -9.88031624  7.4511316  -2.62374854]


Unlike in many matrix languages, the product operator * operates elementwise in NumPy arrays.  
The matrix product can be performed using the dot function or method:

 ![alt text](https://www.mathsisfun.com/algebra/images/matrix-multiply-a.svg)

In [6]:
A = np.array( [[1,1],[0,1]] )
B = np.array( [[2,0],[3,4]] )

print("A*B:")
print(A*B) # elementwise product

print("A.dot(B)")
print(A.dot(B)) # matrix product

print("np dot")
print(np.dot(A, B)) # another matrix product

A*B:
[[2 0]
 [0 4]]
A.dot(B)
[[5 4]
 [3 4]]
np dot
[[5 4]
 [3 4]]


Many unary operations, such as computing the sum of all the elements in the array, are implemented as methods of the ndarray class.

In [7]:
a = np.random.random((2,3))
print(a)
print("The sum of the entire matrix is: {}".format(a.sum()))

[[0.27265629 0.47766512 0.81216873]
 [0.47997717 0.3927848  0.83607876]]
The sum of the entire matrix is: 3.27133087269


By specifying the axis parameter you can apply an operation along the specified axis of an array:

In [8]:
print(a.sum(axis=0))

[0.75263347 0.87044991 1.64824749]


###  <a name="accessing"></a>Accessing

A general paradigm to slice an array into subarrays is: **x[start:stop:step]**

In [9]:
x = np.arange(10)
print(x)
print(x[:5])  # first five elements
print(x[5:])  # elements after index 5
print(x[4:7])  # middle sub-array
print(x[::2])  # every other element
print(x[1::2])  # every other element, starting at index 1
print(x[::-1])  # all elements, reversed
print(x>4) # return array of booleans
print(x[x>4]) # use the boolean array tp slice the data
x=x[x>4]

[0 1 2 3 4 5 6 7 8 9]
[0 1 2 3 4]
[5 6 7 8 9]
[4 5 6]
[0 2 4 6 8]
[1 3 5 7 9]
[9 8 7 6 5 4 3 2 1 0]
[False False False False False  True  True  True  True  True]
[5 6 7 8 9]


In [10]:
x = np.arange(10)
print(x>4)

print(x[x>4])

[False False False False False  True  True  True  True  True]
[5 6 7 8 9]


In [11]:
x = np.arange(10)

print(x>4)
x[x>4]

[False False False False False  True  True  True  True  True]


array([5, 6, 7, 8, 9])

And for a Multi-Dimensional array is will look like:

In [12]:
a = np.random.random((4,5,3))
print(a)
print()

print(a[0:2,2:4,1:2])

[[[0.33739616 0.64817187 0.36824154]
  [0.95715516 0.14035078 0.87008726]
  [0.47360805 0.80091075 0.52047748]
  [0.67887953 0.72063265 0.58201979]
  [0.53737323 0.75861562 0.10590761]]

 [[0.47360042 0.18633234 0.73691818]
  [0.21655035 0.13521817 0.32414101]
  [0.14967487 0.22232139 0.38648898]
  [0.90259848 0.44994999 0.61306346]
  [0.90234858 0.09928035 0.96980907]]

 [[0.65314004 0.17090959 0.35815217]
  [0.75068614 0.60783067 0.32504723]
  [0.03842543 0.63427406 0.95894927]
  [0.65279032 0.63505887 0.99529957]
  [0.58185033 0.41436859 0.4746975 ]]

 [[0.6235101  0.33800761 0.67475232]
  [0.31720174 0.77834548 0.94957105]
  [0.66252687 0.01357164 0.6228461 ]
  [0.67365963 0.971945   0.87819347]
  [0.50962438 0.05571469 0.45115921]]]
()
[[[0.80091075]
  [0.72063265]]

 [[0.22232139]
  [0.44994999]]]


To iterate over an array:

In [13]:
for row in a: # iterating over each row
    print(row)
    
for element in a.flat: # iteratying over each element
    print(element)

[[0.33739616 0.64817187 0.36824154]
 [0.95715516 0.14035078 0.87008726]
 [0.47360805 0.80091075 0.52047748]
 [0.67887953 0.72063265 0.58201979]
 [0.53737323 0.75861562 0.10590761]]
[[0.47360042 0.18633234 0.73691818]
 [0.21655035 0.13521817 0.32414101]
 [0.14967487 0.22232139 0.38648898]
 [0.90259848 0.44994999 0.61306346]
 [0.90234858 0.09928035 0.96980907]]
[[0.65314004 0.17090959 0.35815217]
 [0.75068614 0.60783067 0.32504723]
 [0.03842543 0.63427406 0.95894927]
 [0.65279032 0.63505887 0.99529957]
 [0.58185033 0.41436859 0.4746975 ]]
[[0.6235101  0.33800761 0.67475232]
 [0.31720174 0.77834548 0.94957105]
 [0.66252687 0.01357164 0.6228461 ]
 [0.67365963 0.971945   0.87819347]
 [0.50962438 0.05571469 0.45115921]]
0.3373961604172684
0.6481718720511972
0.36824153984054797
0.9571551589530464
0.14035078041264515
0.8700872583584364
0.4736080452737105
0.8009107519796442
0.5204774795512048
0.6788795301189603
0.7206326547259168
0.5820197920751071
0.5373732294490107
0.7586156243223572
0.105907

###  <a name="reshaping"></a>Reshaping
Reshaping means to restructure the dimensions of the ndarray

In [14]:
a = np.floor(10*np.random.random((3,4)))
print(a)

print(a.ravel()) # returns the array, flattened

print(a.reshape(6,2))  # returns the array with a modified shape

[[0. 4. 9. 3.]
 [4. 6. 8. 9.]
 [2. 5. 8. 5.]]
[0. 4. 9. 3. 4. 6. 8. 9. 2. 5. 8. 5.]
[[0. 4.]
 [9. 3.]
 [4. 6.]
 [8. 9.]
 [2. 5.]
 [8. 5.]]


###  <a name="stacking"></a>Stacking
Stacking mean to combine two arrays togeter (vertically or horizontally).

In [15]:
a = np.floor(10*np.random.random((2,2)))
b = np.floor(10*np.random.random((2,2)))
print(a)
print(b)

print(np.vstack((a,b))) # vertical stacking
print(np.hstack((a,b))) # horizontal stacking

[[9. 9.]
 [0. 2.]]
[[0. 8.]
 [6. 8.]]
[[9. 9.]
 [0. 2.]
 [0. 8.]
 [6. 8.]]
[[9. 9. 0. 8.]
 [0. 2. 6. 8.]]


## <a name="pandas"></a>Pandas
Pandas is an open source, BSD-licensed library providing high-performance, easy-to-use data structures and data analysis tools for the Python programming language.

###  <a name="series"></a>Series
Series is a one-dimensional labeled array capable of holding any data type (integers, strings, floating point numbers, Python objects, etc.).  
The axis labels are collectively referred to as the index.  
The basic method to create a Series is to call: **s = pd.Series(data, index=index)**.  
Series acts very similarly to a ndarray, and is a valid argument to most NumPy functions.  
A series can created from different sources.

In [16]:
# creating series from ndarray
s = pd.Series(np.random.randn(5), index=['a', 'b', 'c', 'd', 'e'])
print(s)

# creating series from dictionary
d = {'b' : 1, 'a' : 0, 'c' : 2}
pd.Series(d) 

a   -0.650913
b   -1.498740
c   -1.230635
d    0.194007
e   -0.998382
dtype: float64


a    0
b    1
c    2
dtype: int64

A Series is like a fixed-size dict in that you can get and set values by index label:

In [17]:
print(s[0])

print(s[:3])

print(s['a'])

print(s.get('f'))

-0.6509128721362502
a   -0.650913
b   -1.498740
c   -1.230635
dtype: float64
-0.6509128721362502
None


Series can also have a name attribute:

In [18]:
s = pd.Series(np.random.randn(5), name='something')
print(s)

0   -0.367638
1    1.737199
2    0.593613
3   -0.542364
4   -1.719672
Name: something, dtype: float64


###  <a name="dataframe"></a>DataFrame
DataFrame is a 2-dimensional labeled data structure with columns of potentially different types.  
You can think of it like a spreadsheet or SQL table, or a dict of Series objects.  
It is generally the most commonly used pandas object.  
Like Series, DataFrame accepts many different kinds of inputs:
* Dict of 1D ndarrays, lists, dicts, or Series
* 2-D numpy.ndarray
* Structured or record ndarray
* A Series
* Another DataFrame

Along with the data, you can optionally pass index (row labels) and columns (column labels) arguments.  
If you pass an index and / or columns, you are guaranteeing the index and / or columns of the resulting DataFrame.  
Thus, a dict of Series plus a specific index will discard all data not matching up to the passed index.

If axis labels are not passed, they will be constructed from the input data based on common sense rules.

![Pandas DataFrame Structure](../Images/dataframe.jpg)

####  <a name="creation"></a>Creation

In [19]:
# from ndarray
import pandas as pd
d = {'one' : [1., 2., 3., 4.], 'two' : [4., 3., 2., 1.]}
d_df1 = pd.DataFrame(d)
print(d_df1)


d_df2 = pd.DataFrame(d, index=['a', 'b', 'c', 'd'])
print(d_df2)

# from list of dicts
data2 = [{'a': 1, 'b': 2}, {'a': 5, 'b': 10, 'c': 20}]
d_df3 = pd.DataFrame(data2)
print(d_df3)
d_df4 = pd.DataFrame(data2, index=['first', 'second'])
print(d_df4)
d_df5 = pd.DataFrame(data2, columns=['a', 'b'])
print(d_df5)


print(d_df2)

   one  two
0  1.0  4.0
1  2.0  3.0
2  3.0  2.0
3  4.0  1.0
   one  two
a  1.0  4.0
b  2.0  3.0
c  3.0  2.0
d  4.0  1.0
   a   b     c
0  1   2   NaN
1  5  10  20.0
        a   b     c
first   1   2   NaN
second  5  10  20.0
   a   b
0  1   2
1  5  10
   one  two
a  1.0  4.0
b  2.0  3.0
c  3.0  2.0
d  4.0  1.0


In [20]:
d_df4
d_df4.iloc[1]
#d_df2.iloc[3]

a     5.0
b    10.0
c    20.0
Name: second, dtype: float64

####  <a name="operations"></a>Operations

![Pandas DataFrame operations](../Images/slicing.png)

In [21]:
d_df2.two

a    4.0
b    3.0
c    2.0
d    1.0
Name: two, dtype: float64

Filter for a specific column.

In [22]:


print(d_df2)
print(d_df2.index)
d_df2['one'] # will return a series

   one  two
a  1.0  4.0
b  2.0  3.0
c  3.0  2.0
d  4.0  1.0
Index([u'a', u'b', u'c', u'd'], dtype='object')


a    1.0
b    2.0
c    3.0
d    4.0
Name: one, dtype: float64

In [23]:
d_df2[['one']] # will return a dataframe

Unnamed: 0,one
a,1.0
b,2.0
c,3.0
d,4.0


Auto complete a column name.

In [24]:
d_df2.one

a    1.0
b    2.0
c    3.0
d    4.0
Name: one, dtype: float64

Filter rows by loc function (by index).

In [25]:
d_df2.loc['a']

one    1.0
two    4.0
Name: a, dtype: float64

Filter rows by iloc function (by location).

In [26]:
d_df2.iloc[1:3]

Unnamed: 0,one,two
b,2.0,3.0
c,3.0,2.0


Simple slicing.

In [27]:
d_df2[1:3]

Unnamed: 0,one,two
b,2.0,3.0
c,3.0,2.0


This feature is not deprecated and completely up to you whether you wish to use it.  
But, I highly prefer not to select rows in this manner as can be ambiguous, especially if you have integers in your index.

Using .iloc and .loc is explicit and clearly tells the person reading the code what is going to happen.

Select by boolean condition.

In [28]:
d_df2[d_df2.one == 2.]

Unnamed: 0,one,two
b,2.0,3.0


####  <a name="adding"></a>Adding

Adding a new column.

In [29]:
d_df2['new_col']=5
d_df2.head()

Unnamed: 0,one,two,new_col
a,1.0,4.0,5
b,2.0,3.0,5
c,3.0,2.0,5
d,4.0,1.0,5


Adding a new row.

In [30]:
df = pd.DataFrame(columns=['lib', 'qty1', 'qty2'])
for i in range(5):
    df.loc[i] = [np.random.randint(-1,1) for n in range(3)]
df.head()

Unnamed: 0,lib,qty1,qty2
0,-1,0,0
1,-1,-1,-1
2,-1,-1,0
3,-1,0,-1
4,-1,-1,0


####  <a name="data_loading"></a>Data Loading

Reading data from a csv.

In [31]:
# df = pd.read_csv("../Data/diamonds.csv")
df.shape
df.head(10)
#df.tail(2)

Unnamed: 0,lib,qty1,qty2
0,-1,0,0
1,-1,-1,-1
2,-1,-1,0
3,-1,0,-1
4,-1,-1,0


View the index of the dataframe.

In [32]:
df.index

Int64Index([0, 1, 2, 3, 4], dtype='int64')

View the column list of the dataframe.

In [33]:
df.columns

Index([u'lib', u'qty1', u'qty2'], dtype='object')

View all of the dataframe values.

In [34]:
df.values

array([[-1L, 0L, 0L],
       [-1L, -1L, -1L],
       [-1L, -1L, 0L],
       [-1L, 0L, -1L],
       [-1L, -1L, 0L]], dtype=object)