<h2> Python Review and Numpy </h2>

In this module, we review important concepts in Python programming and discuss Numpy on tasks like slicing and data transformation.

The first important concept is a Python list.

<h3> Python List </h3>

A Python data type that can be used to store a collection of items. Lists provide basic operations like indexing, slicing.

In [1]:
a_list = [1,5,8,1,4,9]
print(a_list)
print(type(a_list))

[1, 5, 8, 1, 4, 9]
<class 'list'>


We can use indexes to access items in lists, specifically, with the syntax

$list\_name[item\_index]$

We will obtain the item at the position <i>item_index</i>. Index in Python starts from <b>0</b>, so the first item in a list is $list[0]$, 2nd item is $list[1]$, 3rd item is $list[2]$, and so on

In [2]:
a_list[0]

1

In [3]:
a_list[4]

4

index can be negative, in such case, they will be counted from <b>the end</b> of the list. Negative indexes start from -1. So, the last item in a list is list[-1], 2nd last is list[-2], 3rd last is list[-3], and so on. For example

In [4]:
a_list[-1]

9

Slicing a list gives us a subset of the current list. We slice lists with the <b>:</b> operator. The syntax is

$list[start\_index:end\_index:step]$

The sublist starts at the <b>start_index</b> and ends at the <b>end_index - 1</b>, and the indices increment/decrement by the given <b>step</b>. Any fields can be omitted which makes Python uses the default values of 0/-1 for start and end, and 1 for step

In [5]:
a_list[1:-1]

[5, 8, 1, 4]

In [6]:
a_list[1:5:2]

[5, 1]

Lists can have multiple levels. A two-level list is also called a 2D list, and is similar to the tabular data representation

In [6]:
employees = [
    [100320,36,5,110000],
    [132201,30,3,105000],
    [100212,45,12,133000],
    [143695,27,1,80000]
]

for employee in employees:
    print(employee)

[100320, 36, 5, 110000]
[132201, 30, 3, 105000]
[100212, 45, 12, 133000]
[143695, 27, 1, 80000]


Extracting rows from a 2D list is simply enough with indexing or slicing

In [8]:
data_list[1]

[132201, 30, 3, 95, 105000]

In [9]:
data_list[1:-1]

[[132201, 30, 3, 95, 105000], [100212, 45, 12, 95, 133000]]

However, extracting columns is more complicated

In [4]:
age = []

for row in data_list:
    age.append(row[1])

print(age)

[36, 30, 45, 27]


In [5]:
sum(age) / len(age)

34.5

Overall, the use of a list for data storing and manipulation is not ideal. We can use objects from Numpy and Pandas. While Pandas is more tailored towards datasets manipulation, Numpy is more basic. We will discuss Pandas in the next module.

<h3> Numpy Arrays </h3>

Is a Python package for numerical manipulation. Numpy is widely used in data analytics since a lot of times it is simpler to use and more flexible than Pandas.

Numpy is an external library, so we have to install it first. In each Python session, we need to import numpy before using

In [8]:
import numpy as np

A numpy array is fairly similar to a list. It is however much more robust in terms of indexing, slicing, and numerical computation. We can use numpy.array() to generate a new array

In [10]:
employees = np.array([
    [100320,36,5,110000],
    [132201,30,3,105000],
    [100212,45,12,133000],
    [143695,27,1,80000]
])

employees

array([[100320,     36,      5, 110000],
       [132201,     30,      3, 105000],
       [100212,     45,     12, 133000],
       [143695,     27,      1,  80000]])

Or if the data is stored in a text file

In [13]:
data = np.loadtxt('employee_data.txt')
data

array([[1.00320e+05, 3.60000e+01, 5.00000e+00, 9.70000e+01, 1.10000e+05],
       [1.32201e+05, 3.00000e+01, 3.00000e+00, 9.50000e+01, 1.05000e+05],
       [1.00212e+05, 4.50000e+01, 1.20000e+01, 9.50000e+01, 1.33000e+05],
       [1.43695e+05, 2.70000e+01, 1.00000e+00, 9.00000e+01, 8.00000e+04]])

we can change the array data type with dtype argument

In [14]:
data = np.loadtxt('employee_data.txt',dtype=np.int32)
data

array([[100320,     36,      5,     97, 110000],
       [132201,     30,      3,     95, 105000],
       [100212,     45,     12,     95, 133000],
       [143695,     27,      1,     90,  80000]])

The shape attributes give the numbers of rows and columns

In [12]:
employees.shape

(4, 4)

rows and columns can now be extracted using indexing and slicing

In [None]:
array[<row slice>, <column slice>]

In [13]:
employees[0]

array([100320,     36,      5, 110000])

In [14]:
employees[[0,3]]

array([[100320,     36,      5, 110000],
       [143695,     27,      1,  80000]])

To add a selection for column, add another set of index/slice to the bracket, separated with the first set by ","

In [15]:
employees[:,0]

array([100320, 132201, 100212, 143695])

In [16]:
employees[:,[1,3]]

array([[    36, 110000],
       [    30, 105000],
       [    45, 133000],
       [    27,  80000]])

We can combine both index system to filter both rows and columns

In [17]:
employees[:3,[1,3]]

array([[    36, 110000],
       [    30, 105000],
       [    45, 133000]])

We can even filter rows/columns by conditions

In [18]:
employees[employees[:,1] > 30]  #rows in which age (2nd column) > 30

array([[100320,     36,      5, 110000],
       [100212,     45,     12, 133000]])

In [19]:
employees[employees[:,-1] < 110000]  #rows in which salary (last column) < 110000

array([[132201,     30,      3, 105000],
       [143695,     27,      1,  80000]])

Operations with single numbers are applied to the whole array

In [20]:
employees + 2

array([[100322,     38,      7, 110002],
       [132203,     32,      5, 105002],
       [100214,     47,     14, 133002],
       [143697,     29,      3,  80002]])

In [21]:
employees / 2

array([[5.01600e+04, 1.80000e+01, 2.50000e+00, 5.50000e+04],
       [6.61005e+04, 1.50000e+01, 1.50000e+00, 5.25000e+04],
       [5.01060e+04, 2.25000e+01, 6.00000e+00, 6.65000e+04],
       [7.18475e+04, 1.35000e+01, 5.00000e-01, 4.00000e+04]])

Or we can slice a subset to transform

In [23]:
employees[:,1] = employees[:,1] - 10
employees[:,2] = employees[:,2] * 2
employees

array([[100320,     26,     10, 110000],
       [132201,     20,      6, 105000],
       [100212,     35,     24, 133000],
       [143695,     17,      2,  80000]])

Numpy comes with numerous math functions that are useful for creating new features in data

In [22]:
np.log(employees)

array([[11.51612036,  3.58351894,  1.60943791, 11.60823564],
       [11.79207877,  3.40119738,  1.09861229, 11.56171563],
       [11.51504322,  3.80666249,  2.48490665, 11.79810441],
       [11.87544828,  3.29583687,  0.        , 11.28978191]])

In [24]:
np.exp(employees[:,2])

array([2.20264658e+04, 4.03428793e+02, 2.64891221e+10, 7.38905610e+00])

Creating a log versions of the data is fairly common and can be done fairly simple with numpy

In [28]:
log_age = np.log(data[:,1])
log_age

array([3.25809654, 2.99573227, 3.55534806, 2.83321334])

In [29]:
log_data = np.log(data) #creating a log version of the whole data set
data_with_log = np.hstack([data, log_data]) #vertical concatenation of the original and the log versions
data_with_log

array([[1.00320000e+05, 2.60000000e+01, 1.00000000e+01, 9.70000000e+01,
        1.10000000e+05, 1.15161204e+01, 3.25809654e+00, 2.30258509e+00,
        4.57471098e+00, 1.16082356e+01],
       [1.32201000e+05, 2.00000000e+01, 6.00000000e+00, 9.50000000e+01,
        1.05000000e+05, 1.17920788e+01, 2.99573227e+00, 1.79175947e+00,
        4.55387689e+00, 1.15617156e+01],
       [1.00212000e+05, 3.50000000e+01, 2.40000000e+01, 9.50000000e+01,
        1.33000000e+05, 1.15150432e+01, 3.55534806e+00, 3.17805383e+00,
        4.55387689e+00, 1.17981044e+01],
       [1.43695000e+05, 1.70000000e+01, 2.00000000e+00, 9.00000000e+01,
        8.00000000e+04, 1.18754483e+01, 2.83321334e+00, 6.93147181e-01,
        4.49980967e+00, 1.12897819e+01]])

hstack concatenates data horizontally. Numpy also has vstack function which combines data vertically. This is useful to combine data from multiple sources, though you have to make very sure that the columns are exactly the same in all data sources before combining them.

In [30]:
data2 = np.loadtxt('employee_data_more.txt', dtype=np.int32)
data2

array([[115452,     26,      4,     92, 100000],
       [164201,     25,      1,     91,  95000],
       [105265,     65,     22,     91, 143000],
       [113151,     47,     10,     96, 180000]])

In [31]:
merged_data = np.vstack([data,data2])
merged_data

array([[100320,     26,     10,     97, 110000],
       [132201,     20,      6,     95, 105000],
       [100212,     35,     24,     95, 133000],
       [143695,     17,      2,     90,  80000],
       [115452,     26,      4,     92, 100000],
       [164201,     25,      1,     91,  95000],
       [105265,     65,     22,     91, 143000],
       [113151,     47,     10,     96, 180000]])

Simple statistics like mean, standard deviation, sum, min, max... are available. You can either slice the data to obtain the statistics for specific columns, or apply it on the whole dataset (but remember to set axis=0 to obtain those statistics for each column)

In [32]:
merged_data[:,1].mean(), merged_data[:,1].std()

(32.625, 15.041089554949137)

In [33]:
merged_data[:,2].min(), merged_data[:,2].max()

(1, 24)

In [34]:
merged_data.mean(axis=0), merged_data.std(axis=0)

(array([1.21812125e+05, 3.26250000e+01, 9.87500000e+00, 9.33750000e+01,
        1.18250000e+05]),
 array([2.15097540e+04, 1.50410896e+01, 8.19203119e+00, 2.49687304e+00,
        3.00489184e+04]))

Final note: numpy, while powerful, is quite basic and user-unfriendly for data analytics project. It fits more with datasets that are fairly clean and require less preliminary analysis and data transformation. Next week, we will discuss Pandas which is far more friendly to analytics users. Nevertheless, efficiently using numpy is very important and contributes greatly to understanding Pandas.