**Introduction to NumPy**

NumPy (short form for Numerical Python) is the most fundamental package designed for scientific computing and data analysis. Most of the other packages such as pandas, statsmodels are built on top of it, and is an important package to know and learn about. At the heart of NumPy is a data structure called **ndarray**. ndarray is a basically a multi-dimensional array that is built specifically for the purpose of numerical data analysis. Python also has array capabilities, but they are more generic. The advantage of using ndarray is that processing is extremely efficient and fast. 

For a full listing of NumPy features, please visit http://wiki.scipy.org/Numpy_Example_List .

Possible application of NumPy package are:

+ Algorithmic operations such as sorting, grouping and set operations
+ Performing repetitive operations on whole arrays of data without using loops
+ Data merging and alignment operations
+ Data indexing, filtering, and transformation on individual elements or whole arrays
+ Data summarization and descriptive statistics

**Installing NumPy**

In order to check if NumPy is installed, go to Package Manager and type NumPy. You will get a list of packages with names closely matching to NumPy. For our purpose, we need to focus on package named numpy 1.xx. If the package is not installed, click on Install. 

**Importing NumPy**

In order to be able to use NumPy, first import it using import statement

In [13]:
import numpy as np

The above statement will import all of NumPy into your workspace. For starters its good, but if you are doing performance intensive work, then saving space is of importance. In such cases, you can import specific modules of NumPy by using

In [2]:
from numpy import array

ndarray
The most important data structure in NumPy is an n-dimensional array object. Using ndarray, you can store large multidimensional datasets in Python. Being an array, you can perform mathematical operations on these arrays either one element at a time or on complete arrays without using loops. The way to initialize an array object is

In [4]:
a = array((1,2,3,4,5))    #initializes an array a and assigns values to it
b = array((10,20,30,40,50)) # initializes another array b
print (a)
print (b)
print(a+b) 
print (a+5) 
print (a**2)

[1 2 3 4 5]
[10 20 30 40 50]
[11 22 33 44 55]
[ 6  7  8  9 10]
[ 1  4  9 16 25]


In [5]:
c = array(np.arange(15))   #arange function here works as a sequence or counter
anarray = array(np.arange(1,15,2)) 
onemorearray = array(np.linspace(1,10,15)) 
print(c)
print(anarray)
print(onemorearray)

[ 0  1  2  3  4  5  6  7  8  9 10 11 12 13 14]
[ 1  3  5  7  9 11 13]
[ 1.          1.64285714  2.28571429  2.92857143  3.57142857  4.21428571
  4.85714286  5.5         6.14285714  6.78571429  7.42857143  8.07142857
  8.71428571  9.35714286 10.        ]


With each ndarray are associated two attributes: shape of the array, and type of the array. The shape of the array tells you about dimensionality of the array (rows and columns), and type of the array tells you about the data type contained in the array.

In [12]:
data = np.array((32,45,123,756,23,2123))
print(data.shape)
print(data.dtype)
print(data.size)

(6,)
int32
6


In [3]:
import numpy as np
data2 = [[1,2,3,4],[5,6,7,8]]
arr2 = np.array(data2)
print(arr2)
print (arr2.shape)
arr2.size

[[1 2 3 4]
 [5 6 7 8]]
(2, 4)


8

In [4]:
np.zeros(50)

array([0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0.,
       0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0.,
       0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0.])

In [5]:
np.zeros((3,5))

array([[0., 0., 0., 0., 0.],
       [0., 0., 0., 0., 0.],
       [0., 0., 0., 0., 0.]])

In [6]:
np.ones(30)

array([1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1.,
       1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1.])

In [7]:
np.ones((5,9))

array([[1., 1., 1., 1., 1., 1., 1., 1., 1.],
       [1., 1., 1., 1., 1., 1., 1., 1., 1.],
       [1., 1., 1., 1., 1., 1., 1., 1., 1.],
       [1., 1., 1., 1., 1., 1., 1., 1., 1.],
       [1., 1., 1., 1., 1., 1., 1., 1., 1.]])

In [13]:
np.eye(5) # creates a 5*5 identity matrix. 

array([[1., 0., 0., 0., 0.],
       [0., 1., 0., 0., 0.],
       [0., 0., 1., 0., 0.],
       [0., 0., 0., 1., 0.],
       [0., 0., 0., 0., 1.]])

In [14]:
np.diag(array([1,3,5,3,4,5]))

array([[1, 0, 0, 0, 0, 0],
       [0, 3, 0, 0, 0, 0],
       [0, 0, 5, 0, 0, 0],
       [0, 0, 0, 3, 0, 0],
       [0, 0, 0, 0, 4, 0],
       [0, 0, 0, 0, 0, 5]])

**pandas**

pandas is the primary package for performing data analysis tasks in Python. pandas derives its name from panel data analysis and is the fundamental package that provides relational data structures (think Excel, SQL type) and a host of capabilities to play with those data structures. It is the most widely used package in Python for data analysis tasks, and is very good to work with cross sectional, time series, and panel data analysis. Python sits on top of NumPy and can be used with NumPy arrays and the functions in NumPy. How is pandas suited for a researcher’s needs:

+ Has a tabular data structure that can hold both homogenous and heterogenous data.
+ Very good indexing capabilities that makes data alignment and merging easy.
+ Good time series functionality. No need to use different data structures for time series and cross sectional data. Allows for both ordered and unordered time-series data.
+ A host of statistical functions developed around NumPy and pandas that makes a researcher’s task easy and fast.
+ Programming is lot simpler and faster.
+ Easily handles data manipulation and cleaning.
+ Easy to expand and shorten data sets. Comprehensive merging, joins, and group by functionality to join multiple data sets.

**Installing pandas** 

In order to check if pandas is installed, go to Package Manager and type pandas. By default, pandas already comes installed with a distribution of Canopy. If the package is not installed, click on Install.

**Importing pandas**

In order to be able to use NumPy, first import it using import statement


In [8]:
import pandas as pd #this will import pandas into your workspace

In [9]:
import numpy as np  #we will be using numpy functions so import numpy

**Data Structures in pandas**

There are two basic data structures in pandas: Series and DataFrame

**Series:** It is similar to a NumPy 1-dimensional array. In addition to the values that are specified by the programmer, pandas attaches a label to each of the values. If the labels are not provided by the programmer, then pandas assigns labels ( 0 for first element, 1 for second element and so on). A benefit of assigning labels to data values is that it becomes easier to perform manipulations on the dataset as the whole dataset becomes more of a dictionary where each value is associated with a label. 


In [10]:
series1 = pd.Series([10,20,30,40])
series1

0    10
1    20
2    30
3    40
dtype: int64

In [11]:
series1.values

array([10, 20, 30, 40], dtype=int64)

In [13]:
series1.index

RangeIndex(start=0, stop=4, step=1)

If you want to specify custom index values rather than the default ones provided, you can do so using the following command

In [14]:
series2 = pd.Series([10,20,30,40,50], index=['one','two','three','four','five'])
series2

one      10
two      20
three    30
four     40
five     50
dtype: int64

The ways of accesing elements in a Series object are similar to what we have seen in NumPy, and you can perform NumPy operations on Series data arrays.

In [15]:
series2[2]

30

In [16]:
series2['three']

30

In [17]:
series2[['one', 'three', 'five']]

one      10
three    30
five     50
dtype: int64

In [18]:
series2[[0,1,3]]

one     10
two     20
four    40
dtype: int64

In [19]:
series2 + 4

one      14
two      24
three    34
four     44
five     54
dtype: int64

In [20]:
series2 ** 3

one        1000
two        8000
three     27000
four      64000
five     125000
dtype: int64

In [16]:
series2[series2>30]

four    40
five    50
dtype: int64

In [17]:
np.sqrt(series2)

one      3.162278
two      4.472136
three    5.477226
four     6.324555
five     7.071068
dtype: float64

If you have a dictionary, you can create a Series data structure from that dictionary. Suppose you are interested in EPS values for firms and the values come from different sources and is not clean. In that case you dont have to worry about cleaning and aligning those values. 

In [27]:
years = [90, 91, 92, 93, 94, 95]
f1 = {90:8, 91:9, 92:7, 93:8, 94:9, 95:11}
firm1 = pd.Series(f1,index=years)
firm1

90     8
91     9
92     7
93     8
94     9
95    11
dtype: int64

In [28]:
f2 = {90:14,92:9, 93:13, 94:5}
firm2 = pd.Series(f2,index=years)
firm2

90    14.0
91     NaN
92     9.0
93    13.0
94     5.0
95     NaN
dtype: float64

In [29]:
f3 = {93:10, 94:12, 95: 13}
firm3 = pd.Series(f3,index=years)
firm3

90     NaN
91     NaN
92     NaN
93    10.0
94    12.0
95    13.0
dtype: float64

NaN stands for missing or NA values in pandas. Make use of isnull() function to find out if there are any missing values in the data structure.

In [21]:
pd.isnull(firm3)

90     True
91     True
92     True
93    False
94    False
95    False
dtype: bool

A key feature of Series data is structures is that you don't have to worry about data alignment. For example, if we have run a word count program on two different files and we have the following data structures

In [30]:
dict1 = {'finance': 10, 'earning': 5, 'debt':8}
dict2 = {'finance' : 8, 'compensation':4, 'earning': 9}
count1 = pd.Series(dict1)
count2 = pd.Series(dict2)
print (count1)
count2

finance    10
earning     5
debt        8
dtype: int64


finance         8
compensation    4
earning         9
dtype: int64

If we want to calculate the sum of common words in combined files, then we dont have to worry about data alignment. If we want to include all words, then we can take care of NaN values and compute the sum. By default, Series data structure ignores NaN values. NaN values stand for missing data values.

In [24]:
count1+count2

compensation     NaN
debt             NaN
earning         14.0
finance         18.0
dtype: float64

**Data Frame**

DataFrame is a tabular data structure in which data is laid out in rows and column format (similar to a CSV and SQL file), but it can also be used for higher dimensional data sets. The DataFrame object can contain homogenous and heterogenous values, and can be thought of as a logical extension of Series data structures. In contrast to Series, where there is one index, a DataFrame object has one index for column and one index for rows. This allows flexibility in accessing and manipulating data.

In [36]:
import pandas as pd
data = pd.DataFrame({'price':[95, 25, 85, 41, 78],
                     'ticker':['AXP', 'CSCO', 'DIS', 'MSFT', 'WMT'],
                     'company':['American Express', 'Cisco', 'Walt Disney','Microsoft', 'Walmart']})
data

Unnamed: 0,price,ticker,company
0,95,AXP,American Express
1,25,CSCO,Cisco
2,85,DIS,Walt Disney
3,41,MSFT,Microsoft
4,78,WMT,Walmart


If a column is passed with no values, it will simply have NaN values

In order to access a column, simply mention the column name

In [37]:
data['company']

0    American Express
1               Cisco
2         Walt Disney
3           Microsoft
4             Walmart
Name: company, dtype: object

In [38]:
data.company

0    American Express
1               Cisco
2         Walt Disney
3           Microsoft
4             Walmart
Name: company, dtype: object

In [39]:
data.iloc[2]

price               85
ticker             DIS
company    Walt Disney
Name: 2, dtype: object

In [40]:
data[data.ticker=='DIS']

Unnamed: 0,price,ticker,company
2,85,DIS,Walt Disney


In order to add additional columns

In [41]:
data['Year'] = 2014
data

Unnamed: 0,price,ticker,company,Year
0,95,AXP,American Express,2014
1,25,CSCO,Cisco,2014
2,85,DIS,Walt Disney,2014
3,41,MSFT,Microsoft,2014
4,78,WMT,Walmart,2014


In [42]:
data['pricesquared'] = data.price**2
data

Unnamed: 0,price,ticker,company,Year,pricesquared
0,95,AXP,American Express,2014,9025
1,25,CSCO,Cisco,2014,625
2,85,DIS,Walt Disney,2014,7225
3,41,MSFT,Microsoft,2014,1681
4,78,WMT,Walmart,2014,6084


In [43]:
del data['pricesquared']
data

Unnamed: 0,price,ticker,company,Year
0,95,AXP,American Express,2014
1,25,CSCO,Cisco,2014
2,85,DIS,Walt Disney,2014
3,41,MSFT,Microsoft,2014
4,78,WMT,Walmart,2014


In [44]:
data['pricesquared'] = np.NaN
data

Unnamed: 0,price,ticker,company,Year,pricesquared
0,95,AXP,American Express,2014,
1,25,CSCO,Cisco,2014,
2,85,DIS,Walt Disney,2014,
3,41,MSFT,Microsoft,2014,
4,78,WMT,Walmart,2014,


In [48]:
data['sequence'] = np.arange(1,6)
data

Unnamed: 0,price,ticker,company,Year,pricesquared,sequence
0,95,AXP,American Express,2014,,1
1,25,CSCO,Cisco,2014,,2
2,85,DIS,Walt Disney,2014,,3
3,41,MSFT,Microsoft,2014,,4
4,78,WMT,Walmart,2014,,5


In [55]:
data.values

array([[95, 'AXP', 'American Express', 2014, nan, 1],
       [25, 'CSCO', 'Cisco', 2014, nan, 2],
       [85, 'DIS', 'Walt Disney', 2014, nan, 3],
       [41, 'MSFT', 'Microsoft', 2014, nan, 4],
       [78, 'WMT', 'Walmart', 2014, nan, 5]], dtype=object)

In [56]:
newdata = data.drop(2)

In [58]:
newdata

Unnamed: 0,price,ticker,company,Year,pricesquared,sequence
0,95,AXP,American Express,2014,,1
1,25,CSCO,Cisco,2014,,2
3,41,MSFT,Microsoft,2014,,4
4,78,WMT,Walmart,2014,,5


In [60]:
years = [90, 91, 92, 93, 94, 95]
f1 = {90:8, 91:9, 92:7, 93:8, 94:9, 95:11}
firm1 = pd.Series(f1,index=years)
firm1
f2 = {90:14,92:9, 93:13, 94:5}
firm2 = pd.Series(f2,index=years)
firm2
f3 = {93:10, 94:12, 95: 13}
firm3 = pd.Series(f3,index=years)
firm3
df1 = pd.DataFrame(columns=['Firm1','Firm2','Firm3'],index=years)
df1
df1.Firm1 = firm1
df1.Firm2 = firm2
df1.Firm3 = firm3
df1


Unnamed: 0,Firm1,Firm2,Firm3
90,8,14.0,
91,9,,
92,7,9.0,
93,8,13.0,10.0
94,9,5.0,12.0
95,11,,13.0


In [62]:
dft = df1.T
dft

del dft[90]
dft

Unnamed: 0,91,92,93,94,95
Firm1,9.0,7.0,8.0,9.0,11.0
Firm2,,9.0,13.0,5.0,
Firm3,,,10.0,12.0,13.0


You can pass a number of data structures to DataFrame such as a ndarray, lists, dict, Series, and another DataFrame. You can also reindex to confirm to data to a new index. Reindexing is a powerful feature that allows you to access data in a number of different ways, and also to confirm data to some new time series or other index.

In [68]:
reindexdf1 = df1.reindex([88,89,90,91,92,93,94,95,96,97,98])
reindexdf1

Unnamed: 0,Firm1,Firm2,Firm3
88,,,
89,,,
90,8.0,14.0,
91,9.0,,
92,7.0,9.0,
93,8.0,13.0,10.0
94,9.0,5.0,12.0
95,11.0,,13.0
96,,,
97,,,


In [69]:
years1 = [90, 91, 92, 93, 94, 95]
f4 = {90:8, 91:9, 92:7, 93:8, 94:9, 95:11}
firm4 = pd.Series(f4,index=years)
f5 = {90:14,91:12, 92:9, 93:13, 94:5, 95:8}
firm5 = pd.Series(f5,index=years)
f6 = {90:8, 91: 9, 92:9,93:10, 94:12, 95: 13}
firm6 = pd.Series(f6,index=years)
df2 = pd.DataFrame(columns=['Firm1','Firm2','Firm3'],index=years1)
df2.Firm1 = firm4
df2.Firm2 = firm5
df2.Firm3 = firm6
df2


Unnamed: 0,Firm1,Firm2,Firm3
90,8,14,8
91,9,12,9
92,7,9,9
93,8,13,10
94,9,5,12
95,11,8,13


In [70]:
reindexdf2 = df2.reindex([88,89,90,91,92,93,94,95,96,97,98], fill_value=0)
reindexdf2

Unnamed: 0,Firm1,Firm2,Firm3
88,0,0,0
89,0,0,0
90,8,14,8
91,9,12,9
92,7,9,9
93,8,13,10
94,9,5,12
95,11,8,13
96,0,0,0
97,0,0,0


Similarly, you have backfill (bfill) method to fill values backwards.

In [71]:
df2

Unnamed: 0,Firm1,Firm2,Firm3
90,8,14,8
91,9,12,9
92,7,9,9
93,8,13,10
94,9,5,12
95,11,8,13


In [72]:
reindexdf3 = df2.reindex([88,89,90,91,92,93,94,95,96,97,98], method='bfill')
reindexdf3

Unnamed: 0,Firm1,Firm2,Firm3
88,8.0,14.0,8.0
89,8.0,14.0,8.0
90,8.0,14.0,8.0
91,9.0,12.0,9.0
92,7.0,9.0,9.0
93,8.0,13.0,10.0
94,9.0,5.0,12.0
95,11.0,8.0,13.0
96,,,
97,,,


In [73]:
reindexdf1

Unnamed: 0,Firm1,Firm2,Firm3
88,,,
89,,,
90,8.0,14.0,
91,9.0,,
92,7.0,9.0,
93,8.0,13.0,10.0
94,9.0,5.0,12.0
95,11.0,,13.0
96,,,
97,,,


In [74]:
reindexdf3

Unnamed: 0,Firm1,Firm2,Firm3
88,8.0,14.0,8.0
89,8.0,14.0,8.0
90,8.0,14.0,8.0
91,9.0,12.0,9.0
92,7.0,9.0,9.0
93,8.0,13.0,10.0
94,9.0,5.0,12.0
95,11.0,8.0,13.0
96,,,
97,,,


In [75]:
reindexdf1+reindexdf3

Unnamed: 0,Firm1,Firm2,Firm3
88,,,
89,,,
90,16.0,28.0,
91,18.0,,
92,14.0,18.0,
93,16.0,26.0,20.0
94,18.0,10.0,24.0
95,22.0,,26.0
96,,,
97,,,


In [32]:
reindexdf1.add(reindexdf3, fill_value=0)

Unnamed: 0,Firm1,Firm2,Firm3
88,,,
89,,,
90,16.0,28.0,8.0
91,18.0,12.0,9.0
92,14.0,18.0,9.0
93,16.0,26.0,20.0
94,18.0,10.0,24.0
95,22.0,8.0,26.0
96,11.0,8.0,13.0
97,11.0,8.0,13.0


You can use NumPy functions inside DataFrame objects.

In [77]:
dataframe = pd.DataFrame(np.random.randn(3,3),columns=['one','two','three'])
dataframe

Unnamed: 0,one,two,three
0,-0.889971,0.563024,0.825479
1,1.129391,0.102491,0.051281
2,-0.227511,0.41878,-0.58418


In [78]:
np.abs(dataframe)

Unnamed: 0,one,two,three
0,0.889971,0.563024,0.825479
1,1.129391,0.102491,0.051281
2,0.227511,0.41878,0.58418


In [80]:
dataframe

Unnamed: 0,one,two,three
0,-0.889971,0.563024,0.825479
1,1.129391,0.102491,0.051281
2,-0.227511,0.41878,-0.58418


In [81]:
f = lambda x:x.max()-x.min()
dataframe.apply(f)

one      2.019362
two      0.460533
three    1.409659
dtype: float64

In [82]:
dataframe.apply(f,axis=1)

0    1.715450
1    1.078111
2    1.002960
dtype: float64

In [85]:
g = lambda x: x - np.mean(x)
dataframe.apply(g)

Unnamed: 0,one,two,three
0,-0.893941,0.201592,0.727953
1,1.125422,-0.258941,-0.046246
2,-0.231481,0.057349,-0.681706


In [86]:
dataframe

Unnamed: 0,one,two,three
0,-0.889971,0.563024,0.825479
1,1.129391,0.102491,0.051281
2,-0.227511,0.41878,-0.58418


In [87]:
def f(x):
    return pd.Series([np.mean(x), x.max(), x.min()], index=['mean','max','min'])
dataframe.apply(f,axis=1)

Unnamed: 0,mean,max,min
0,0.166177,0.825479,-0.889971
1,0.427721,1.129391,0.051281
2,-0.13097,0.41878,-0.58418


In [88]:
dataframe = pd.DataFrame(np.random.randn(3,3),columns=['one','two','three'])
dataframe

Unnamed: 0,one,two,three
0,0.983734,0.534972,0.572328
1,1.453973,0.320597,0.951175
2,0.137367,-1.598016,0.188671
