# Python Data Structures for Data Science: Numpy and Pandas

In the last practical, we took a look at some basic concepts in python. We covered:
1. Basic data types like numerics and strings.
2. Compound data types like lists, tuples, and dictionaries.
3. Control flow devices like if, else, for, and while.
These concepts give us access to some powerful abilities, however most data scientists and analysts are not writing for loops over lists to understand their data. Instead, they are using even higher level objects to structure their data - namely numpy arrays and pandas dataframes. In this practical, we introduce these libraries and objects, giving a glimpse into how data analysis is actually done in practice.

For this practical, we generally follow the standard tutorials for these libraries found here:
- [Numpy](https://numpy.org/devdocs/user/quickstart.html)
- [Pandas](https://pandas.pydata.org/docs/user_guide/index.html)

When first encountering numpy arrays, it is natural to wonder why we even need them. After all, they look a lot like the ordinary lists we have already learned about. There are, as we will see, several advantages to their use. The first we will look at is speed. Let's do a small experiment of calculating the mean of one million numbers - first by iterating over a list and second by making use of a numpy array method.

In [59]:
import numpy as np
import pandas as pd
import time

In [2]:
heights = np.random.normal(size=1000000) #this is a numpy array
heights_list = heights.tolist() #this is a list

In [3]:
start_time = time.perf_counter()
sum_nums = 0
for i in heights_list:
    sum_nums = sum_nums + i
print('Mean: ', sum_nums / len(heights_list))
end_time = time.perf_counter()
list_time = end_time - start_time
print('Elapsed time: ', list_time)

Mean:  0.0017307978739592488
Elapsed time:  0.07947445799982233


In [4]:
start_time = time.perf_counter()
print('Mean: ', heights.mean())
end_time = time.perf_counter()
numpy_time = end_time - start_time
print('Elapsed time: ', numpy_time)

Mean:  0.001730797873959119
Elapsed time:  0.0011659589999908349


In [5]:
 numpy_time / list_time

0.014670864442931355

It seems there is in fact a big speed advantage in this case!

Another advantage is the quantity of built-in methods. This allows us to transform our data quickly and with very few lines of code:

In [6]:
test_np_array = np.array([1, 2, 3, 4, 5])

In [9]:
# log transform
my_logs = np.log(test_np_array)
my_logs

array([0.        , 0.69314718, 1.09861229, 1.38629436, 1.60943791])

In [10]:
# exponential tranform
np.exp(my_logs)

array([1., 2., 3., 4., 5.])

In [13]:
# square roots
my_sqrts = np.sqrt(test_np_array)
my_sqrts

array([1.        , 1.41421356, 1.73205081, 2.        , 2.23606798])

In [14]:
np.square(my_sqrts)

array([1., 2., 3., 4., 5.])

In [16]:
my_sqrts * my_logs

array([0.        , 0.98025814, 1.9028523 , 2.77258872, 3.59881258])

Yet another advantage is in providing a natural setting for us to do linear algebra - something lists do not provide. We now demonstrate this connection and introduce the numpy library and its core objects more systematically.

The main object in numpy is the "ndarray" (n-dimensional array) or just "array". Let's see some examples:

In [78]:
# ndarrays can behave like vectors!

vec_1 = np.array([1, 2, 3, 4])
vec_2 = np.array([2, 4, 6, 12])

print('Vector sum: ', vec_1 + vec_2)
print('Element-wise product: ', vec_1 * vec_2)
print('Dot product: ', np.dot(vec_1, vec_2))

Vector sum:  [ 3  6  9 16]
Element-wise product:  [ 2  8 18 48]
Dot product:  76


In [47]:
# ndarrays van behave like matrices!
mat_1 = np.array([[1, 2, 3], [4, 5, 6], [7, 8, 9]])
mat_2 = np.array([[1, 1], [2, 2], [3, 3]])
mat_3 = np.array([[2, 4, 6], [3, 6, 9], [4, 8, 12]])

print(mat_1)
print(mat_2)
print(mat_3)

[[1 2 3]
 [4 5 6]
 [7 8 9]]
[[1 1]
 [2 2]
 [3 3]]
[[ 2  4  6]
 [ 3  6  9]
 [ 4  8 12]]


In [48]:
# Matrix sums
mat_1 + mat_3

array([[ 3,  6,  9],
       [ 7, 11, 15],
       [11, 16, 21]])

In [49]:
# Matrix multiplication
mat_1 @ mat_2

array([[14, 14],
       [32, 32],
       [50, 50]])

In [50]:
# ndarrays can be higher-order things (tensors)
ten_1 = np.array([[[1, 2, 3], [4, 5, 6]], [[1, 1, 1], [2, 2, 2]]])
ten_1

array([[[1, 2, 3],
        [4, 5, 6]],

       [[1, 1, 1],
        [2, 2, 2]]])

In [53]:
# Each of these things have many properties, however shape, ndim, and size are useful to start with

print("vec_1: ", vec_1)
print("vec_1 - shape: ", vec_1.shape, " ndim: ", vec_1.ndim, " size: ", vec_1.size)

print("mat_1: ")
print(mat_1)
print("mat_1 - shape: ", mat_1.shape, " ndim: ", mat_1.ndim, " size: ", mat_1.size)

print("ten_1: ")
print(ten_1)
print("ten_1 - shape: ", ten_1.shape, " ndim: ", ten_1.ndim, " size: ", ten_1.size)

vec_1:  [1 2 3 4]
vec_1 - shape:  (4,)  ndim:  1  size:  4
mat_1: 
[[1 2 3]
 [4 5 6]
 [7 8 9]]
mat_1 - shape:  (3, 3)  ndim:  2  size:  9
ten_1: 
[[[1 2 3]
  [4 5 6]]

 [[1 1 1]
  [2 2 2]]]
ten_1 - shape:  (2, 2, 3)  ndim:  3  size:  12


In [58]:
# Like lists, ndarrays can be sliced, with one set of indices per dimension

# Slicing a vector
print('Sliced vector:')
print(vec_1[:2])

# Slicing a matrix
print('Sliced matrix:')
print(mat_1[:2, 1:])

# Slicing a tensor
print('Sliced tensor:')
print(ten_1[0, 0:2, 2])

Sliced vector:
[1 2]
Sliced matrix:
[[2 3]
 [5 6]]
Sliced tensor:
[3 6]


I encourage you to review numpy in more detail [here](https://numpy.org/devdocs/user/quickstart.html), but for now we leave numpy behind and move on to pandas - in particular, the pandas series and dataframe.

The pandas dataframe is the common setting for most data science and analysis tasks that concern tabular data. It contains rows and columns of data along with labels for the rows and columns so that we can reference the data easily. A single column of a pandas dataframe is a Series, and it behaves very similarly to a 1-dimensional numpy array, with the notable exception that each element is labelled via the index.

We create a pandas Series by passing in the data and index. The data can be many things (like a dictionary or numpy array) and the index can be left blank if there is no natural one.

In [62]:
# creating a series from a numpy array and using the default index

my_series_1 = pd.Series(vec_1)
my_series_1

0    1
1    2
2    3
3    4
dtype: int64

In [65]:
# creating a series and using a specific index
my_series_2 = pd.Series(vec_1, index=['a','b','c','d'])
my_series_2

a    1
b    2
c    3
d    4
dtype: int64

In [71]:
# subsetting is natural by index
print(my_series_2[['a','c']])

# you can also access elements like a dictionary
print(my_series_2['b'])

a    1
c    3
dtype: int64
2


In [70]:
# things that work on numpy arrays generally work on Series
np.cos(my_series_2)

a    0.540302
b   -0.416147
c   -0.989992
d   -0.653644
dtype: float64

Pandas series also have a name attribute that we can set. This becomes natural, when we consider dataframes. This is because the columns of a dataframe will be series, and the column name will be the series name.

In [75]:
# I have a name
named_series = pd.Series(vec_1, index=['foo', 'bar', 'baz', 'bat'], name='integers')
print(named_series)
print(named_series.name)

foo    1
bar    2
baz    3
bat    4
Name: integers, dtype: int64
integers


Armed with the knowledge that a series is a like a numpy array with an index and a name, we can move on to pandas dataframes, which will be like a bunch of Series stuck side by side.

We can create pandas dataframes in many ways - we generally need to specify:
1. The data.
2. The column names (we can ignore this and use defaults).
3. The row names (we can ignore this and use defaults).

Two common such ways are from a dictionary or from a csv.

In [81]:
my_dictionary = {'integers': pd.Series(vec_1, index=['a', 'b', 'c', 'd']),
                 'even_integers': pd.Series(vec_2, index=['a', 'b', 'c', 'f'])}
my_df = pd.DataFrame(my_dictionary)

In [83]:
# note how the index is observed and NaNs are created!
my_df

Unnamed: 0,integers,even_integers
a,1.0,2.0
b,2.0,4.0
c,3.0,6.0
d,4.0,
f,,12.0


In [87]:
# there are many options for reading from a csv depending on how the .csv file is structured
my_df_2 = pd.read_csv('lecture_02_data.csv',
                      names =['sex', 'length', 'diameter', 'height',
                              'whole_weight', 'shucked_weight', 'viscera_weight',
                              'shell_weight', 'rings'])

In [88]:
my_df_2.head(10)

Unnamed: 0,sex,length,diameter,height,whole_weight,shucked_weight,viscera_weight,shell_weight,rings
0,M,0.455,0.365,0.095,0.514,0.2245,0.101,0.15,15
1,M,0.35,0.265,0.09,0.2255,0.0995,0.0485,0.07,7
2,F,0.53,0.42,0.135,0.677,0.2565,0.1415,0.21,9
3,M,0.44,0.365,0.125,0.516,0.2155,0.114,0.155,10
4,I,0.33,0.255,0.08,0.205,0.0895,0.0395,0.055,7
5,I,0.425,0.3,0.095,0.3515,0.141,0.0775,0.12,8
6,F,0.53,0.415,0.15,0.7775,0.237,0.1415,0.33,20
7,F,0.545,0.425,0.125,0.768,0.294,0.1495,0.26,16
8,M,0.475,0.37,0.125,0.5095,0.2165,0.1125,0.165,9
9,F,0.55,0.44,0.15,0.8945,0.3145,0.151,0.32,19


However you create the dataframe, you can treat it as a dictionary of series objects keyed on column name. Let's use this strategy to create some columns:

In [92]:
my_df['A'] = my_df['integers'] * my_df['even_integers']
my_df['B'] = my_df['A'] <= my_df['even_integers']
my_df

Unnamed: 0,integers,even_integers,A,B
a,1.0,2.0,2.0,True
b,2.0,4.0,8.0,False
c,3.0,6.0,18.0,False
d,4.0,,,False
f,,12.0,,False


As far as subsetting (or filtering down) dataframes go, we have this helpful table:

| Operation | Syntax | Result |
| --------- | ------ | ------ |
|Select column | df[col] | Series |
|Select row by label | df.loc[label] | Series |
|Select row by integer location | df.iloc[loc] | Series |
|Slice rows | df[5:10] | DataFrame |
|Select rows by boolean vector | df[bool_vec] | DataFrame |
|Select by columns | df[[col1, ... , coln]] | DataFrame |


In [102]:
# Suppose we only want rows where column 'A' is not NaN
mask = ~my_df['A'].isna()
print(mask)
clean_df = my_df[mask]
print(clean_df)

a     True
b     True
c     True
d    False
f    False
Name: A, dtype: bool
   integers  even_integers     A      B
a       1.0            2.0   2.0   True
b       2.0            4.0   8.0  False
c       3.0            6.0  18.0  False


In [103]:
# Suppose we only want columns 'integers' and 'A':
my_df[['integers', 'A']]

Unnamed: 0,integers,A
a,1.0,2.0
b,2.0,8.0
c,3.0,18.0
d,4.0,
f,,


In [104]:
# Suppose we only want the first 3 rows:
my_df[:3]

Unnamed: 0,integers,even_integers,A,B
a,1.0,2.0,2.0,True
b,2.0,4.0,8.0,False
c,3.0,6.0,18.0,False


In [108]:
# Suppose we want the row at index 2:
# we could use the integer index
my_df.iloc[2]

integers           3.0
even_integers      6.0
A                 18.0
B                False
Name: c, dtype: object

In [109]:
# or the label
my_df.loc['b']

integers           2.0
even_integers      4.0
A                  8.0
B                False
Name: b, dtype: object

## Exercises
1. Create a pandas dataframe with four columns.
    1. Column 'A' should contain the integers from 0 to 10.
    2. Column 'B' should contain the values of 'A' but doubled.
    3. Column 'C' should contain the values of 'A' but squared.
    4. Column 'D' should contain the values of 'A' but exponentiated (with base e).
2. Create a dataframe the `lecture_01_data.csv` file with the column names `['sepal_len', 'sepal_width', 'petal_len', 'petal_width', 'class']`.
    1. Create a new column 'sepal_ratio' which is 'sepal_len' / 'sepal_width'.
    2. Is 'sepal_ratio' ever NaN? If so, remove those rows.
    3. Is 'petal_len' ever 0? If so, remove those rows.
    4. Perform a log transform on 'petal_len' (create a new column with log base e of the petal_len column).
    5. Create a new boolean column that is True when petal_len is 1.4 or more and false otherwise.
3. For many modeling and analysis techniques, we need all feature values to be numeric. Starting with the dataframe created by exercise 2, do the following:
    1. Change the boolean column into 1's and 0's.
    2. One hot encode the 'class' column using pd.get_dummies() function.
    3. What is get_dummies doing?