![alt text](null.png "Title")

# Basics of Numpy

An ndarray is a multidimensional container for same data type values. Every ndarray has a shape function and a dtype function that indicates the data type of its elements. For creating an **ndarray** we use the **array** function that accepts as an argument - a sequence of numbers - and converts them into a numpy array.

&nbsp;

In [1]:
import numpy as np

data1 = [0.5, 6.4, 8.9, 10.23]
arr1 = np.array(data1)
print(arr1)

[ 0.5   6.4   8.9  10.23]


In [3]:
data2 = [[1,2,3,4], [5,6,7,8]]
arr2 = np.array(data2)
print(arr2)

[[1 2 3 4]
 [5 6 7 8]]


In [4]:
arr2.ndim

2

In [5]:
arr2.shape

(2, 4)

In [6]:
arr1.dtype

dtype('float64')

In [7]:
arr2.dtype

dtype('int64')

&nbsp;

In order to create high dimensional arrays using the **zeros, ones or empty** functionalities we can specify the dimensions of the array by using a tuple. 

&nbsp;

In [8]:
np.zeros(10)

array([0., 0., 0., 0., 0., 0., 0., 0., 0., 0.])

In [9]:
np.zeros((3,6))

array([[0., 0., 0., 0., 0., 0.],
       [0., 0., 0., 0., 0., 0.],
       [0., 0., 0., 0., 0., 0.]])

In [10]:
np.empty((3,4))

array([[-2.31584178e+077, -2.31584178e+077,  1.03753786e-322,
         0.00000000e+000],
       [ 0.00000000e+000,  0.00000000e+000,  0.00000000e+000,
         0.00000000e+000],
       [ 0.00000000e+000,  0.00000000e+000,  0.00000000e+000,
         0.00000000e+000]])

&nbsp;

Note that **arange** is the array valued version of the classic pythonic **range** function. Further we can also explicitly specify the data type that we would want the numpy array to infer from the entered data values. We can also explicitly **Cast** an array into a specific data type of our preference. 

&nbsp;

In [11]:
np.arange(15)

array([ 0,  1,  2,  3,  4,  5,  6,  7,  8,  9, 10, 11, 12, 13, 14])

In [14]:
arr1 = np.array([1,2,3], dtype=np.float64)
arr1

array([1., 2., 3.])

In [15]:
arr = np.array([1,2,3,4,5])
arr.dtype

dtype('int64')

In [18]:
arr2 = arr.astype(np.float64)
arr2.dtype

dtype('float64')

In [21]:
array1 = np.array(['1.23', '1.34', '5.43'], dtype=np.string_)
array2 = array1.astype(np.float64)
array2

array([1.23, 1.34, 5.43])

&nbsp;

An important feature of Numpy is that contains arrays that can do **batch operations** on data without the use of **for loops**. This process is called **vectorization**. Also note that arithmetic operations on equal sized arrays involve element wise computation. Even scalar operations are performed on each element of an array. We will see later that operations between differently sized arrays is known as **broadcasting**. 

&nbsp;

In [23]:
arr = np.array([[1,2,3], [4,5,6]])
arr * arr

array([[ 1,  4,  9],
       [16, 25, 36]])

In [24]:
arr - arr

array([[0, 0, 0],
       [0, 0, 0]])

In [25]:
1/arr

array([[1.        , 0.5       , 0.33333333],
       [0.25      , 0.2       , 0.16666667]])

&nbsp;

We note that one dimensional arrays can be **indexed** simply like lists are in python. Note that we assign a scalar value to a sliced array, then that value is **propagated** to the entire slice. This process is called **broadcasting**. An important point to note is that array slices are **views** of the original array and any changes to this will affect the original array as well. 

&nbsp;

In [26]:
arr = np.arange(10)
arr

array([0, 1, 2, 3, 4, 5, 6, 7, 8, 9])

In [28]:
arr[4]

4

In [29]:
arr[5:8]

array([5, 6, 7])

In [30]:
arr[5:8] = 12

In [31]:
arr

array([ 0,  1,  2,  3,  4, 12, 12, 12,  8,  9])

In [32]:
arr_slice = arr[5:8]
arr_slice[1] = 1234
arr

array([   0,    1,    2,    3,    4,   12, 1234,   12,    8,    9])

&nbsp;

Note that in case of two dimensional arrays, each index denotes not a scalar quantity but rather a one dimensional array. We can index recursively or we can pass a comma index of values. Now when it comes to slicing 2d arrays, we see that the usual slice methodology gives us slices along the first axis, that is gives us one dimensional arrays corresponding to **axis 0**. We can pass multiple slices along the two axes to obtain customized subsets of our data.

&nbsp;

In [33]:
arr2d = np.array([[1,2,3], [4,5,6], [7,8,9]])
arr2d[1]

array([4, 5, 6])

In [34]:
arr2d[1][2]

6

In [36]:
arr2d[1, 2]

6

In [37]:
arr2d

array([[1, 2, 3],
       [4, 5, 6],
       [7, 8, 9]])

In [40]:
arr2d[0:2]

array([[1, 2, 3],
       [4, 5, 6]])

In [42]:
arr2d[:, :2]

array([[1, 2],
       [4, 5],
       [7, 8]])

In [43]:
arr2d[:2, :1]

array([[1],
       [4]])

&nbsp;

Now consider a situation wherein we have an array of names with duplicate entries and some numerical data. We generate the numerical data using the **randn** function from **numpy.random** that generates random numbers from a normal distribution. 

&nbsp;

In [46]:
names = np.array(['bob', 'joe', 'will', 'bob', 'will', 'joe', 'joe'])
data = np.random.randn(7, 4)
print(names)
print(data)

['bob' 'joe' 'will' 'bob' 'will' 'joe' 'joe']
[[-0.53699063 -0.82655818 -0.22827909  0.95573274]
 [ 0.08583549  1.09407739 -0.77726278  1.54747849]
 [-1.51059031 -1.11900032 -0.37362349 -0.32004849]
 [-2.1461379   0.07304858  0.98022311 -0.40952237]
 [ 0.59736385 -0.44117088 -0.08697034  0.53228029]
 [-0.71501222 -1.40561128  0.80922951  0.27629206]
 [ 1.6479143  -0.15332721 -0.37640045  2.91867758]]


&nbsp;

We suppose that each name corresponds to one row in the data and that we want to extract data rows that correspond to the name **bob**. Further note that boolean indexing can be carried out since if we apply comparison operators on arrays then what results is essentially a boolean array. This boolean array can then be passed as an index which will fetch only those values for which the boolean array assumes TRUE values. Note that the boolean array has to be of the same length as the axis it is indexing. We can also use further higher order axis indexing along with this method. 

&nbsp;

In [47]:
names == 'bob'

array([ True, False, False,  True, False, False, False])

In [48]:
data[names == 'bob']

array([[-0.53699063, -0.82655818, -0.22827909,  0.95573274],
       [-2.1461379 ,  0.07304858,  0.98022311, -0.40952237]])

In [49]:
data[names == 'bob', :2]

array([[-0.53699063, -0.82655818],
       [-2.1461379 ,  0.07304858]])

In [51]:
data[~(names == 'bob')]

array([[ 0.08583549,  1.09407739, -0.77726278,  1.54747849],
       [-1.51059031, -1.11900032, -0.37362349, -0.32004849],
       [ 0.59736385, -0.44117088, -0.08697034,  0.53228029],
       [-0.71501222, -1.40561128,  0.80922951,  0.27629206],
       [ 1.6479143 , -0.15332721, -0.37640045,  2.91867758]])

In [52]:
mask = (names == 'bob')|(names == 'joe')
data[mask]

array([[-0.53699063, -0.82655818, -0.22827909,  0.95573274],
       [ 0.08583549,  1.09407739, -0.77726278,  1.54747849],
       [-2.1461379 ,  0.07304858,  0.98022311, -0.40952237],
       [-0.71501222, -1.40561128,  0.80922951,  0.27629206],
       [ 1.6479143 , -0.15332721, -0.37640045,  2.91867758]])

In [53]:
data[data < 0]

array([-0.53699063, -0.82655818, -0.22827909, -0.77726278, -1.51059031,
       -1.11900032, -0.37362349, -0.32004849, -2.1461379 , -0.40952237,
       -0.44117088, -0.08697034, -0.71501222, -1.40561128, -0.15332721,
       -0.37640045])

In [56]:
## simple reshaping of arrays

arr3 = np.arange(15).reshape((3,5))
arr3

array([[ 0,  1,  2,  3,  4],
       [ 5,  6,  7,  8,  9],
       [10, 11, 12, 13, 14]])

In [57]:
# reshaping arrays using the transpose functions

arr = np.arange(15).reshape((3,5))
arr

array([[ 0,  1,  2,  3,  4],
       [ 5,  6,  7,  8,  9],
       [10, 11, 12, 13, 14]])

In [58]:
arr.T

array([[ 0,  5, 10],
       [ 1,  6, 11],
       [ 2,  7, 12],
       [ 3,  8, 13],
       [ 4,  9, 14]])

&nbsp;

We can perform matrix dot operations of the form $X^TX$ using some linear algebra functionality.

&nbsp;

In [60]:
arr = np.random.randn(6,3)
arr

array([[-1.1958221 , -0.40165665,  1.14518797],
       [ 0.2183597 ,  0.0431026 , -0.11388739],
       [-2.83405678, -0.14523599, -1.41938172],
       [ 0.80733639,  0.16660337,  2.40665369],
       [ 0.39525941,  2.25396886, -0.6503489 ],
       [ 1.43966531, -0.32997072,  0.06033277]])

In [61]:
np.dot(arr.T, arr)

array([[12.39020751,  1.45168878,  4.40108044],
       [ 1.45168878,  5.40129237, -1.34355349],
       [ 4.40108044, -1.34355349,  9.55764599]])

&nbsp;

We use the concept of **ufuncs** or universal functions that compute element wise functional operations on the ndarray. 

&nbsp;

In [62]:
arr = np.arange(10)
np.sqrt(arr)

array([0.        , 1.        , 1.41421356, 1.73205081, 2.        ,
       2.23606798, 2.44948974, 2.64575131, 2.82842712, 3.        ])

In [63]:
np.exp(arr)

array([1.00000000e+00, 2.71828183e+00, 7.38905610e+00, 2.00855369e+01,
       5.45981500e+01, 1.48413159e+02, 4.03428793e+02, 1.09663316e+03,
       2.98095799e+03, 8.10308393e+03])

&nbsp;

Suppose now that we wish to evaluate a function **sqrt(x^2 + y^2)** across a regular grid of values. We note that the **np.meshgrid** function takes two 1D arrays and produces two 2D matrices corresponding to all pairs of (x, y). 

&nbsp;

In [70]:
points = np.arange(-5, 5, 0.01)
xs, ys = np.meshgrid(points, points)
ys

array([[-5.  , -5.  , -5.  , ..., -5.  , -5.  , -5.  ],
       [-4.99, -4.99, -4.99, ..., -4.99, -4.99, -4.99],
       [-4.98, -4.98, -4.98, ..., -4.98, -4.98, -4.98],
       ...,
       [ 4.97,  4.97,  4.97, ...,  4.97,  4.97,  4.97],
       [ 4.98,  4.98,  4.98, ...,  4.98,  4.98,  4.98],
       [ 4.99,  4.99,  4.99, ...,  4.99,  4.99,  4.99]])

In [73]:
z = np.sqrt(xs**2 + ys**2)
z

array([[7.07106781, 7.06400028, 7.05693985, ..., 7.04988652, 7.05693985,
        7.06400028],
       [7.06400028, 7.05692568, 7.04985815, ..., 7.04279774, 7.04985815,
        7.05692568],
       [7.05693985, 7.04985815, 7.04278354, ..., 7.03571603, 7.04278354,
        7.04985815],
       ...,
       [7.04988652, 7.04279774, 7.03571603, ..., 7.0286414 , 7.03571603,
        7.04279774],
       [7.05693985, 7.04985815, 7.04278354, ..., 7.03571603, 7.04278354,
        7.04985815],
       [7.06400028, 7.05692568, 7.04985815, ..., 7.04279774, 7.04985815,
        7.05692568]])

&nbsp;

We can perform usual statistical operations on arrays in the form of aggregate functions used along different axes. By specifying the axis, we get the value of aggregates along the axis. However we see that cumulative function typically dont aggregate, rather they ouput arrays themselves. 

&nbsp;

In [74]:
arr = np.random.randn(5, 4)
arr.mean()

-0.11408489388440957

In [75]:
np.mean(arr)

-0.11408489388440957

In [76]:
arr.sum()

-2.2816978776881913

In [77]:
arr.mean(axis = 1)

array([ 0.25439186, -0.39510343,  0.0443087 , -0.51873905,  0.04471745])

In [80]:
array = np.array([[1,2,3], [3,4,5], [5,6,7]])
array.cumsum(0)

array([[ 1,  2,  3],
       [ 4,  6,  8],
       [ 9, 12, 15]])

In [83]:
arr.sort()
arr

array([[-1.03020103,  0.0525034 ,  0.05376627,  1.94149878],
       [-1.86640925, -0.97478224,  0.09020459,  1.17057319],
       [-0.57144017,  0.01702462,  0.30155924,  0.43009112],
       [-1.8636164 , -0.33143074, -0.03156559,  0.15165653],
       [-0.44424165, -0.31879745,  0.35028278,  0.59162611]])

In [85]:
# using dot product computations

x = np.array([[1, 2, 3], [4, 5, 6]], dtype=np.float64)
y = np.array([[6, 23], [-1, 7], [8, 9]], dtype=np.float64)
print(x)
print(y)

[[1. 2. 3.]
 [4. 5. 6.]]
[[ 6. 23.]
 [-1.  7.]
 [ 8.  9.]]


In [86]:
x.dot(y)

array([[ 28.,  64.],
       [ 67., 181.]])

In [87]:
np.dot(x, y)

array([[ 28.,  64.],
       [ 67., 181.]])

In [88]:
np.dot(x, np.ones(3))

array([ 6., 15.])

In [90]:
# performing more complex linear algebra computations

from numpy.linalg import inv, qr

array = np.random.randn(5, 5)
matrix = array.T.dot(array)
print(inv(matrix))

[[ 2.26834927 -0.77433946 -1.06210808  2.13592429 -1.80797884]
 [-0.77433946  0.56010967  0.52491625 -0.7778496   0.5789922 ]
 [-1.06210808  0.52491625  1.0621428  -1.15700928  0.90443554]
 [ 2.13592429 -0.7778496  -1.15700928  2.28123923 -1.74356554]
 [-1.80797884  0.5789922   0.90443554 -1.74356554  1.61166758]]


In [92]:
print(matrix.dot(inv(matrix)))

[[ 1.00000000e+00  0.00000000e+00  0.00000000e+00 -1.77635684e-15
  -8.88178420e-16]
 [ 4.44089210e-16  1.00000000e+00 -6.66133815e-16 -4.44089210e-16
  -8.88178420e-16]
 [ 6.66133815e-16  2.22044605e-16  1.00000000e+00  8.88178420e-16
   0.00000000e+00]
 [ 2.22044605e-16 -3.88578059e-16  6.66133815e-16  1.00000000e+00
   0.00000000e+00]
 [-1.77635684e-15 -4.44089210e-16 -8.88178420e-16 -1.77635684e-15
   1.00000000e+00]]


In [93]:
# obtaining samples from random modules

samples = np.random.normal(size=(4,4))
samples

array([[ 0.20890525, -0.34149858,  0.92154582, -0.58737155],
       [-1.26604697,  0.57248925, -1.22607407, -0.5084723 ],
       [-1.1548418 ,  0.01611688,  0.34739951, -0.19132954],
       [-0.34036239,  2.15243419, -0.23317119, -0.01138253]])

# Basics of Pandas

We will usually work with Pandas when it comes to data analysis. The two main data structures in this packages are **Series** and **Dataframes**. A series contains usually a one dimensional array of data and another one dimensional array of index values. We can use the .values and .index attributes to get the data and index correspondingly.

&nbsp;

In [94]:
import pandas as pd

In [96]:
obj = pd.Series([4,5,-7,3])
obj

0    4
1    5
2   -7
3    3
dtype: int64

In [97]:
obj.values

array([ 4,  5, -7,  3])

In [98]:
obj.index

RangeIndex(start=0, stop=4, step=1)

In [100]:
# specifying a custom index. 

obj2 = pd.Series([-4,3,23,56], index=['a', 'b', 'c', 'd'])
obj2

a    -4
b     3
c    23
d    56
dtype: int64

In [101]:
obj2.index

Index(['a', 'b', 'c', 'd'], dtype='object')

In [103]:
obj2['a']

-4

In [105]:
obj2[['a', 'c']]

a    -4
c    23
dtype: int64

In [106]:
obj2['d'] = 76

In [107]:
obj2

a    -4
b     3
c    23
d    76
dtype: int64

In [109]:
# creating a series from a dict

mydict = {'delhi': 1200, 'mumbai': 4500, 'chennai': 7682, 'bangalore': 7722}
mydata = pd.Series(mydict)
mydata

delhi        1200
mumbai       4500
chennai      7682
bangalore    7722
dtype: int64

In [110]:
states = ['rajasthan', 'delhi', 'chennai', 'mumbai']
mydata = pd.Series(mydict, index=states)
mydata

rajasthan       NaN
delhi        1200.0
chennai      7682.0
mumbai       4500.0
dtype: float64

In [111]:
# using isnull and notnull to detect null values. 

pd.isnull(mydata)

rajasthan     True
delhi        False
chennai      False
mumbai       False
dtype: bool

In [112]:
pd.notnull(mydata)

rajasthan    False
delhi         True
chennai       True
mumbai        True
dtype: bool

In [115]:
mydata.isnull()

rajasthan     True
delhi        False
chennai      False
mumbai       False
dtype: bool

In [117]:
# dataframes are multidimensional arrays

data = {'state': ['ohio', 'ohio', 'ohio', 'nevada', 'nevada'],
        'year': [2000, 2001, 2002, 2003, 2001],
        'pop': [1.5, 3.4, 1.6, 7.3, 2.4]}

frame = pd.DataFrame(data)
frame

Unnamed: 0,state,year,pop
0,ohio,2000,1.5
1,ohio,2001,3.4
2,ohio,2002,1.6
3,nevada,2003,7.3
4,nevada,2001,2.4


In [118]:
frame = pd.DataFrame(data, columns=['year', 'state', 'pop'])
frame

Unnamed: 0,year,state,pop
0,2000,ohio,1.5
1,2001,ohio,3.4
2,2002,ohio,1.6
3,2003,nevada,7.3
4,2001,nevada,2.4


In [119]:
frame2 = pd.DataFrame(data, columns=['year', 'state', 'pop', 'debt'],
                      index=['one', 'two', 'three', 'four', 'five'])
frame2

Unnamed: 0,year,state,pop,debt
one,2000,ohio,1.5,
two,2001,ohio,3.4,
three,2002,ohio,1.6,
four,2003,nevada,7.3,
five,2001,nevada,2.4,


In [120]:
frame2.columns

Index(['year', 'state', 'pop', 'debt'], dtype='object')

In [121]:
frame2['state']

one        ohio
two        ohio
three      ohio
four     nevada
five     nevada
Name: state, dtype: object

In [122]:
frame2.year

one      2000
two      2001
three    2002
four     2003
five     2001
Name: year, dtype: int64

In [126]:
frame2['debt'] = range(23, 28)
frame2

Unnamed: 0,year,state,pop,debt
one,2000,ohio,1.5,23
two,2001,ohio,3.4,24
three,2002,ohio,1.6,25
four,2003,nevada,7.3,26
five,2001,nevada,2.4,27


In [128]:
frame2['debt'] = np.arange(5)
frame2

Unnamed: 0,year,state,pop,debt
one,2000,ohio,1.5,0
two,2001,ohio,3.4,1
three,2002,ohio,1.6,2
four,2003,nevada,7.3,3
five,2001,nevada,2.4,4


In [129]:
# assigning to dataframe columns, a series, as per certain indexes

values = pd.Series([1.23, 4.56, 3.34], index=['two', 'four', 'five'])
frame2['debt'] = values
frame2

Unnamed: 0,year,state,pop,debt
one,2000,ohio,1.5,
two,2001,ohio,3.4,1.23
three,2002,ohio,1.6,
four,2003,nevada,7.3,4.56
five,2001,nevada,2.4,3.34


In [130]:
# creating a column using boolean operations

frame2['eastern'] = frame2.state == 'ohio'
frame2

Unnamed: 0,year,state,pop,debt,eastern
one,2000,ohio,1.5,,True
two,2001,ohio,3.4,1.23,True
three,2002,ohio,1.6,,True
four,2003,nevada,7.3,4.56,False
five,2001,nevada,2.4,3.34,False


In [131]:
del frame2['eastern']

In [132]:
frame2

Unnamed: 0,year,state,pop,debt
one,2000,ohio,1.5,
two,2001,ohio,3.4,1.23
three,2002,ohio,1.6,
four,2003,nevada,7.3,4.56
five,2001,nevada,2.4,3.34


In [136]:
# creating dataframes using nested dicts
states = pd.DataFrame({'nevada': {2001: 2.4, 2002: 2.9},
                      'ohio': {2000: 1.5, 2001: 2.3, 2002: 1.89}})
states

Unnamed: 0,nevada,ohio
2001,2.4,2.3
2002,2.9,1.89
2000,,1.5


In [137]:
states.T

Unnamed: 0,2001,2002,2000
nevada,2.4,2.9,
ohio,2.3,1.89,1.5


In [138]:
pdata = {'ohio': states['ohio'][:-1],
         'nevada': states['nevada'][:2]}
pd.DataFrame(pdata)

Unnamed: 0,ohio,nevada
2001,2.3,2.4
2002,1.89,2.9


In [139]:
# extracting the data into a numpy array

states.values

array([[2.4 , 2.3 ],
       [2.9 , 1.89],
       [ nan, 1.5 ]])

In [140]:
# indexing with series

obj = pd.Series(range(3), index=['a', 'b', 'c'])
obj

a    0
b    1
c    2
dtype: int64

In [141]:
index = obj.index
index

Index(['a', 'b', 'c'], dtype='object')

In [142]:
index[1:]

Index(['b', 'c'], dtype='object')

In [143]:
index[1]

'b'

In [146]:
# reindexing the data

obj = pd.Series([1,2,3,4], index=['d', 'b', 'c', 'a'])
obj2 = obj.reindex(['a', 'b', 'c', 'd', 'e'])
obj
obj2

a    4.0
b    2.0
c    3.0
d    1.0
e    NaN
dtype: float64

In [147]:
# forward filling missing values

obj3 = pd.Series(['purple', 'blue', 'yellow'], index=[0, 2, 4])
obj4 = obj3.reindex(range(6), method='ffill')
obj4

0    purple
1    purple
2      blue
3      blue
4    yellow
5    yellow
dtype: object

In [149]:
# more reindexing

data = pd.DataFrame(np.arange(9).reshape((3, 3)), index=['a', 'b', 'c'],
                    columns=['ohio', 'california', 'texas'])

data

Unnamed: 0,ohio,california,texas
a,0,1,2
b,3,4,5
c,6,7,8


In [150]:
# dropping values 

obj = pd.Series([1,2,3,4], index=['a', 'b', 'c', 'd'])
obj2 = obj.drop('c')
obj2

a    1
b    2
d    4
dtype: int64

In [151]:
obj.drop(['d', 'c'])

a    1
b    2
dtype: int64

In [152]:
# removal of data from dataframes

data = pd.DataFrame(np.arange(16).reshape((4, 4)), columns=['a', 'b', 'c', 'd'],
                    index=['ohio', 'cali', 'ny', 'nevada'])
data

Unnamed: 0,a,b,c,d
ohio,0,1,2,3
cali,4,5,6,7
ny,8,9,10,11
nevada,12,13,14,15


In [153]:
data.drop(['ohio', 'cali'])

Unnamed: 0,a,b,c,d
ny,8,9,10,11
nevada,12,13,14,15


In [154]:
data.drop(['a', 'b'], axis=1)

Unnamed: 0,c,d
ohio,2,3
cali,6,7
ny,10,11
nevada,14,15


In [155]:
obj = pd.Series(np.arange(4), index=['a', 'b', 'c', 'd'])
obj[1]

1

In [156]:
obj[1:3]

b    1
c    2
dtype: int64

In [157]:
obj['b':'c']

b    1
c    2
dtype: int64

In [158]:
obj[['b', 'a']]

b    1
a    0
dtype: int64

In [160]:
# getting columns via dataframe indexing

data = pd.DataFrame(np.arange(16).reshape((4, 4)), columns=['a', 'b', 'c', 'd'],
                    index=['ohio', 'cali', 'ny', 'nevada'])
data

Unnamed: 0,a,b,c,d
ohio,0,1,2,3
cali,4,5,6,7
ny,8,9,10,11
nevada,12,13,14,15


In [161]:
data['b']

ohio       1
cali       5
ny         9
nevada    13
Name: b, dtype: int64

In [162]:
data[['a', 'b']]

Unnamed: 0,a,b
ohio,0,1
cali,4,5
ny,8,9
nevada,12,13


In [163]:
# selecting rows

data[:2]

Unnamed: 0,a,b,c,d
ohio,0,1,2,3
cali,4,5,6,7


In [164]:
data[data['c'] > 5]

Unnamed: 0,a,b,c,d
cali,4,5,6,7
ny,8,9,10,11
nevada,12,13,14,15


In [169]:
# arithmetic with differeing index series. If added, a union of indexes is created

s1 = pd.Series([2.1, 4.5, 2.3, 5.6], index=['a', 'c', 'd', 'e'])
s2 = pd.Series([4.3, 5.4, 1.2, 6.5, 3.2], index=['a', 'c', 'e', 'f', 'g'])

s1+s2

a    6.4
c    9.9
d    NaN
e    6.8
f    NaN
g    NaN
dtype: float64

In [171]:
df1 = pd.DataFrame(np.arange(9).reshape((3, 3)), columns=list('bcd'),
                   index=['ohio', 'texas', 'colorado'])
df2 = pd.DataFrame(np.arange(12).reshape((4, 3)), columns=list('bde'),
                   index=['utah', 'ohio', 'texas', 'oregon'])
df1 + df2

Unnamed: 0,b,c,d,e
colorado,,,,
ohio,3.0,,6.0,
oregon,,,,
texas,9.0,,12.0,
utah,,,,


In [172]:
# use the add method and fill value

df1.add(df2, fill_value=0)

Unnamed: 0,b,c,d,e
colorado,6.0,7.0,8.0,
ohio,3.0,1.0,6.0,5.0
oregon,9.0,,10.0,11.0
texas,9.0,4.0,12.0,8.0
utah,0.0,,1.0,2.0


In [173]:
df2.add(df1, fill_value=0)

Unnamed: 0,b,c,d,e
colorado,6.0,7.0,8.0,
ohio,3.0,1.0,6.0,5.0
oregon,9.0,,10.0,11.0
texas,9.0,4.0,12.0,8.0
utah,0.0,,1.0,2.0


In [176]:
# computing with stock data

import pandas_datareader as web

all_data = {}
for ticker in ['AAPL', 'IBM', 'MSFT', 'GOOG']:
    all_data[ticker] = web.get_data_yahoo(ticker, '1/1/2000', '1/1/2010')

prices = pd.DataFrame({tic:data['Adj Close']
                       for tic, data in all_data.items()})
volume = pd.DataFrame({tic:data['Volume']
                       for tic, data in all_data.items()})

In [177]:
# computing percent changes in prices

returns = prices.pct_change()
returns.tail()

Unnamed: 0_level_0,AAPL,IBM,MSFT,GOOG
Date,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1
2009-12-24,0.034339,0.004385,0.002587,0.011117
2009-12-28,0.012294,0.013326,0.005484,0.007098
2009-12-29,-0.011861,-0.003477,0.007058,-0.005571
2009-12-30,0.012147,0.005461,-0.013699,0.005376
2009-12-31,-0.004299,-0.012597,-0.015504,-0.004416


In [178]:
returns.head()

Unnamed: 0_level_0,AAPL,IBM,MSFT,GOOG
Date,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1
1999-12-31,,,,
2000-01-03,0.088754,0.075319,-0.001606,
2000-01-04,-0.084311,-0.033944,-0.03378,
2000-01-05,0.014634,0.035137,0.010544,
2000-01-06,-0.086539,-0.017242,-0.033498,


In [179]:
# finding correlations

returns.MSFT.corr(returns.IBM)

0.4943579912733756

In [180]:
returns.MSFT.cov(returns.IBM)

0.000215821315591899

In [181]:
returns.corr()

Unnamed: 0,AAPL,IBM,MSFT,GOOG
AAPL,1.0,0.412392,0.423598,0.470676
IBM,0.412392,1.0,0.494358,0.390688
MSFT,0.423598,0.494358,1.0,0.443586
GOOG,0.470676,0.390688,0.443586,1.0


In [183]:
returns.corrwith(returns.IBM)

AAPL    0.412392
IBM     1.000000
MSFT    0.494358
GOOG    0.390688
dtype: float64

In [185]:
# getting other information from series

obj = pd.Series(['c', 'a', 'd', 'a', 'a', 'b', 'b', 'c', 'c'])
uniques = obj.unique()
uniques

array(['c', 'a', 'd', 'b'], dtype=object)

In [186]:
obj.value_counts()

c    3
a    3
b    2
d    1
dtype: int64

# Handling Time Series data

In [187]:
from datetime import datetime

now = datetime.now()
now.year, now.month, now.day

(2020, 10, 5)

In [189]:
# time delta represents the temporal difference between 2 datetime objects. 

delta = datetime(2011, 1, 7) - datetime(2008, 6, 24, 8, 15)
delta

datetime.timedelta(days=926, seconds=56700)

In [190]:
delta.days

926

In [191]:
delta.seconds

56700

In [194]:
# we can add timedelta to a date to get a shifted object

from datetime import timedelta

start = datetime(2011, 7, 2)

start + timedelta(12)
start - 2*timedelta(10)

datetime.datetime(2011, 6, 12, 0, 0)

In [195]:
# converting string format to date format

from dateutil.parser import parse

parse('2011-11-7')

datetime.datetime(2011, 11, 7, 0, 0)

In [196]:
parse('Jan 31, 1992')

datetime.datetime(1992, 1, 31, 0, 0)

In [198]:
datestrs = ['7/6/2011', '8/6/2011']
dates = pd.to_datetime(datestrs)

In [200]:
# we get a timeseries type when we use dates as indexes to data

dates = [datetime(2011, 1, 2), datetime(2011, 1, 5), datetime(2011, 1, 7),
         datetime(2011, 1, 8), datetime(2011, 1, 10), datetime(2011, 1, 12)]
ts = pd.Series(np.random.randn(6), index=dates)
ts

2011-01-02    0.643641
2011-01-05   -1.042606
2011-01-07    0.933492
2011-01-08   -2.793350
2011-01-10   -0.599916
2011-01-12    2.334496
dtype: float64

In [201]:
type(ts)

pandas.core.series.Series

In [202]:
ts.index

DatetimeIndex(['2011-01-02', '2011-01-05', '2011-01-07', '2011-01-08',
               '2011-01-10', '2011-01-12'],
              dtype='datetime64[ns]', freq=None)