<img src="https://github.com/Center-for-Health-Data-Science/PythonTsunami/blob/spring2022/figures/HeaDS_logo_large_withTitle.png?raw=1" width="300">

<img src="https://github.com/Center-for-Health-Data-Science/PythonTsunami/blob/spring2022/figures/tsunami_logo.PNG?raw=1" width="600">

# Numpy arrays

The basic data structure of Scikit learn is the numpy array. This structure is also used in other ML oriented python libraries, such as TensorFlow and PyTorch so we'll spend some time getting to know how they work.

There are several ways to create np arrays. Here we'll make one from a list:

In [None]:
import numpy as np

arr = np.array([5,4,3,2,1]) #you can also use a tuple
arr

array([5, 4, 3, 2, 1])

Unlike pandas dataframes and dataframes in R, np arrays do not have column or rows names.

## Array dimensions and shape

Let's check the dimensions. We're using the `shape` property of the `ndarray` object here. It's an attribute, not a method, so no parentheses!

In [None]:
arr.shape

(5,)

The array we made is **one dimensional**. There is an x but no y-dimension. We could also say it has a depth of 1 (the same a list, really), because it has length only along one dimension.

We can also see that with the `ndim` attribute:

In [None]:
arr.ndim

1

The shape of arrays is quite important for using them in scikit-learn.

Data is typically in the form of *n* rows x *m* columns, with observations being in the rows and features in the columns.

Let's load some data and look at it. This is the famous iris dataset.

In [None]:
#load example data
from sklearn.datasets import load_iris
iris = load_iris()

When we load data like this the resulting object is a `bunch`.

In [None]:
type(iris)

A bunch is a dictionary-like object with several contents inside it, like 'data', 'target', 'target_names', 'DESCR'. You can get a list of contents with `dir`:

In [None]:
dir(iris)

['DESCR',
 'data',
 'data_module',
 'feature_names',
 'filename',
 'frame',
 'target',
 'target_names']

They can be queried using a `.`. Lets take a look at the `data`:

In [None]:
iris.data

array([[5.1, 3.5, 1.4, 0.2],
       [4.9, 3. , 1.4, 0.2],
       [4.7, 3.2, 1.3, 0.2],
       [4.6, 3.1, 1.5, 0.2],
       [5. , 3.6, 1.4, 0.2],
       [5.4, 3.9, 1.7, 0.4],
       [4.6, 3.4, 1.4, 0.3],
       [5. , 3.4, 1.5, 0.2],
       [4.4, 2.9, 1.4, 0.2],
       [4.9, 3.1, 1.5, 0.1],
       [5.4, 3.7, 1.5, 0.2],
       [4.8, 3.4, 1.6, 0.2],
       [4.8, 3. , 1.4, 0.1],
       [4.3, 3. , 1.1, 0.1],
       [5.8, 4. , 1.2, 0.2],
       [5.7, 4.4, 1.5, 0.4],
       [5.4, 3.9, 1.3, 0.4],
       [5.1, 3.5, 1.4, 0.3],
       [5.7, 3.8, 1.7, 0.3],
       [5.1, 3.8, 1.5, 0.3],
       [5.4, 3.4, 1.7, 0.2],
       [5.1, 3.7, 1.5, 0.4],
       [4.6, 3.6, 1. , 0.2],
       [5.1, 3.3, 1.7, 0.5],
       [4.8, 3.4, 1.9, 0.2],
       [5. , 3. , 1.6, 0.2],
       [5. , 3.4, 1.6, 0.4],
       [5.2, 3.5, 1.5, 0.2],
       [5.2, 3.4, 1.4, 0.2],
       [4.7, 3.2, 1.6, 0.2],
       [4.8, 3.1, 1.6, 0.2],
       [5.4, 3.4, 1.5, 0.4],
       [5.2, 4.1, 1.5, 0.1],
       [5.5, 4.2, 1.4, 0.2],
       [4.9, 3

`iris.data` a 2D numpy array with 150 rows and 4 columns. Each row is a flower, i.e. observation and the columns are features that have been measured on the flower.

In [None]:
print(type(iris.data))
print(iris.data.ndim)
print(iris.data.shape)


<class 'numpy.ndarray'>
2
(150, 4)


## Data types

Arrays have a data type depending on the type of their elements:

In [None]:
print(arr.dtype)
print(iris.data.dtype)

int64
float64


What about arrays of mixed type? If worst comes to worst, everything is a string.

In [None]:
nonono = np.array([1,2,3,'four', True])
print(nonono)
print(nonono.dtype)

['1' '2' '3' 'four' 'True']
<U21


Sometimes a specific type is needed. You can change it with `astype()` (and will get a ValueError is that's not possible).

In [None]:
floatie = arr.astype(float)
print(floatie.dtype)
floatie

float64


array([5., 4., 3., 2., 1.])

## Accessing array contents

Just like lists, we can also slice arrays with this syntax:

```
array[start:stop:step]
```


We'll try this on the small test array we made in the beginning.

In [None]:
print(arr)
print(arr[3:])
print(arr[:-2])
print(arr[::-1])

[5 4 3 2 1]
[2 1]
[5 4 3]
[1 2 3 4 5]


As you can see in the example above, we can omit any of the parameters.

The default step size is 1 and we will go in steps of 1 unless something else is specified:

In [None]:
#no step specified
print(arr[0:4])

In [None]:
#in steps of size 2
print(arr[0:4:2])

Leaving out the `start` parameter means 'from the beginning':  

In [None]:
print(arr[:3])

Leaving out the `stop` parameter means all the way to the end:

In [None]:
print(arr[2:])

Leaving out both start and stop means 'everything':

In [None]:
print(arr[:])

In multidimensional arrays we need to specify the slice for every dimension, separated by a comma. It can also be left empty to get everything in that dimension ...

In [None]:
#first ten rows and all columns
iris.data[0:10,] #the columns slice is empty so we get all

array([[5.1, 3.5, 1.4, 0.2],
       [4.9, 3. , 1.4, 0.2],
       [4.7, 3.2, 1.3, 0.2],
       [4.6, 3.1, 1.5, 0.2],
       [5. , 3.6, 1.4, 0.2],
       [5.4, 3.9, 1.7, 0.4],
       [4.6, 3.4, 1.4, 0.3],
       [5. , 3.4, 1.5, 0.2],
       [4.4, 2.9, 1.4, 0.2],
       [4.9, 3.1, 1.5, 0.1]])

... unless it is the first dimension! Then you need a colon `:` to specifiy everything.

In [None]:
#first two columns
#if a slice starts from 0, we can leave that out:
iris.data[:,:2]

array([[5.1, 3.5],
       [4.9, 3. ],
       [4.7, 3.2],
       [4.6, 3.1],
       [5. , 3.6],
       [5.4, 3.9],
       [4.6, 3.4],
       [5. , 3.4],
       [4.4, 2.9],
       [4.9, 3.1],
       [5.4, 3.7],
       [4.8, 3.4],
       [4.8, 3. ],
       [4.3, 3. ],
       [5.8, 4. ],
       [5.7, 4.4],
       [5.4, 3.9],
       [5.1, 3.5],
       [5.7, 3.8],
       [5.1, 3.8],
       [5.4, 3.4],
       [5.1, 3.7],
       [4.6, 3.6],
       [5.1, 3.3],
       [4.8, 3.4],
       [5. , 3. ],
       [5. , 3.4],
       [5.2, 3.5],
       [5.2, 3.4],
       [4.7, 3.2],
       [4.8, 3.1],
       [5.4, 3.4],
       [5.2, 4.1],
       [5.5, 4.2],
       [4.9, 3.1],
       [5. , 3.2],
       [5.5, 3.5],
       [4.9, 3.6],
       [4.4, 3. ],
       [5.1, 3.4],
       [5. , 3.5],
       [4.5, 2.3],
       [4.4, 3.2],
       [5. , 3.5],
       [5.1, 3.8],
       [4.8, 3. ],
       [5.1, 3.8],
       [4.6, 3.2],
       [5.3, 3.7],
       [5. , 3.3],
       [7. , 3.2],
       [6.4, 3.2],
       [6.9,

If it makes more sense to you, you can always use a colon to signify everything for all dimensions.

In [None]:
iris.data[:10,:] #we can specify : instead of puttin nothing for consistency

array([[5.1, 3.5, 1.4, 0.2],
       [4.9, 3. , 1.4, 0.2],
       [4.7, 3.2, 1.3, 0.2],
       [4.6, 3.1, 1.5, 0.2],
       [5. , 3.6, 1.4, 0.2],
       [5.4, 3.9, 1.7, 0.4],
       [4.6, 3.4, 1.4, 0.3],
       [5. , 3.4, 1.5, 0.2],
       [4.4, 2.9, 1.4, 0.2],
       [4.9, 3.1, 1.5, 0.1]])

In [None]:
#first 5 rows and last two columns
iris.data[:5,-2:]

array([[1.4, 0.2],
       [1.4, 0.2],
       [1.3, 0.2],
       [1.5, 0.2],
       [1.4, 0.2]])

If you are confused which comes first, rows or columns, look at the shape:

In [None]:
iris.data.shape

(150, 4)

## Exercise 1 (10 mins)

Load the `wine` data, another example dataset from scikit-learn. You can do this in the same way as with the `iris` data, just replacing `iris` with `wine`. If you have trouble look at the code example [here](https://scikit-learn.org/stable/modules/generated/sklearn.datasets.load_wine.html#sklearn.datasets.load_wine).

Check the dimensions and shape of the wine data sets. How many rows and columns are there?

Now display:
* the first 10 rows
* the last 10 rows
* the 2nd and 3rd column



## Reshaping arrays

You can change the `shape` of an array with the `reshape` method.

Let's make an example.
`arange` is a quick way to create a range of values. We're using it here to demo how to change the shape of an array.

In [None]:
long = np.array(np.arange(15))
long

array([ 0,  1,  2,  3,  4,  5,  6,  7,  8,  9, 10, 11, 12, 13, 14])

As expected, the long array above is currently one dimensional. It has 15 rows and 0 columns (because it's a vector. If it had 15 rows and 1 column it would be 2D).

In [None]:
print(long.ndim)
print(long.shape)

1
(15,)


Let's make it into a 2d array:

In [None]:
# We ask for 5 rows and 3 cols.
# The length original array must be divisible by both dims.
mat = long.reshape(5,3)
mat


array([[ 0,  1,  2],
       [ 3,  4,  5],
       [ 6,  7,  8],
       [ 9, 10, 11],
       [12, 13, 14]])

In [None]:
print(mat.ndim)
print(mat.shape)

2
(5, 3)


Special feature: You can let numpy calculate the needed length for one dimension by passing `-1`.

In the below case we'll get 3 rows and it calculates that we need to have 5 columns since 3x5 = 15.

If the total length of the original array is not divisable by the number of rows you want you'll get an error.

In [None]:
mat = long.reshape(3,-1)
mat

array([[ 0,  1,  2,  3,  4],
       [ 5,  6,  7,  8,  9],
       [10, 11, 12, 13, 14]])

What if we want to return to a simpler time or 1 dimensional arrays? Flatten by only passing `-1`.

In [None]:
mat.reshape(-1)

array([ 0,  1,  2,  3,  4,  5,  6,  7,  8,  9, 10, 11, 12, 13, 14])

Observe that the above has **not changed** the object `mat`, only printed it to the console. `mat` is still a matrix and has dimensions 3 x 5.

In [None]:
print(mat)
print(mat.shape)

[[ 0  1  2  3  4]
 [ 5  6  7  8  9]
 [10 11 12 13 14]]
(3, 5)


## Adding dimensions

Both our objects `arr` and `long` are 1D arrays, also called vectors. Scikit-learn usually wants to receive a matrix, i.e. a 2D array, even if there is only 1 column or only 1 row.

We can easily comply with this by using `reshape` (without adding any content!).

If your data is supposed to be a single column, ie. feature with as many samples measured as there are item it's `reshape(-1,1)`

In [None]:
long_2D = long.reshape(-1,1)
#original array
print(long)
#reshaped into a single column
print(long_2D)

[ 0  1  2  3  4  5  6  7  8  9 10 11 12 13 14]
[[ 0]
 [ 1]
 [ 2]
 [ 3]
 [ 4]
 [ 5]
 [ 6]
 [ 7]
 [ 8]
 [ 9]
 [10]
 [11]
 [12]
 [13]
 [14]]


As you can see, the content stayed the same. Checking `ndim` and `shape` confirms that we now have a 2D array with 15 rows and 1 column.

In [None]:
print(long.shape)
print(long.ndim)
print(long_2D.shape)
print(long_2D.ndim)

(15,)
1
(15, 1)
2


Conversely, if your data is supposed to be 15 features measured on one sample it is `reshape(1,-1)`

In [None]:
long_2D = long.reshape(1,-1)
print(long_2D)
print(long_2D.shape)
print(long_2D.ndim)

[[ 0  1  2  3  4  5  6  7  8  9 10 11 12 13 14]]
(1, 15)
2


Remember, if you are confused which one you need always check the `shape` and it will be (rows, columns).

## Joining arrays

Sometimes we'll want to merge data sets. For this we can use two functions: `concatenate` and `stack`.

To join arrays we use `concatenate`:

In [None]:
arr1 = np.array([1, 2, 3])

arr2 = np.array([4, 5, 6])

together = np.concatenate((arr1, arr2))

print(together)

[1 2 3 4 5 6]


Note that `concatenate` preserves the shape: The new array has the same number of dimensions as the two original ones, which is a 1-d array.

In [None]:
print(together.shape)
print(arr1.shape)
print(arr2.shape)

(6,)
(3,)
(3,)


When we join n-dim arrays we have to specifiy the axis:

In [None]:
arr1 = np.array([[1, 2], [3, 4]])
arr2 = np.array([[5, 6], [7, 8]])

print('arr1:\n', arr1, end = '\n\n')
print('arr2:\n',arr2)

arr1:
 [[1 2]
 [3 4]]

arr2:
 [[5 6]
 [7 8]]


Do you want them stacked next to each other or on top of each other?

In [None]:
# Next to each other.
# The resulting array has the same number of rows as the two original ones
# but 4 columns instead of 2.
arr = np.concatenate((arr1, arr2), axis=1)

print(arr)
print(arr.shape)

[[1 2 5 6]
 [3 4 7 8]]
(2, 4)


In [None]:
# On top of eachother
# The resulting array has the same number of cols as the two original ones
# but 4 rows instead of 2.
arr = np.concatenate((arr1, arr2), axis=0)

print(arr)
print(arr.shape)

[[1 2]
 [3 4]
 [5 6]
 [7 8]]
(4, 2)


To arrange two 1D arrays into one 2D array (which seems like a natural things to want to do), we instead need `stack`:

In [None]:
arr1 = np.array([1, 2, 3])

arr2 = np.array([4, 5, 6])

side_by_side = np.stack((arr1, arr2), axis=1)
print(side_by_side)
print(side_by_side.shape, end = '\n\n')

below = np.stack((arr1, arr2), axis=0)
print(below)
print(below.shape)

[[1 4]
 [2 5]
 [3 6]]
(3, 2)

[[1 2 3]
 [4 5 6]]
(2, 3)


Alternatively, if you find the axes confusing you can choose between:

* stacking horizontally: `hstack`
* stacking vertically: `vstack`
* stacking in depth: `dstack`

(Though none of these quite does side by side since depth adds an extra dimension.)

In [None]:
h = np.hstack((arr1, arr2))
print(h)
print(h.shape)

[1 2 3 4 5 6]
(6,)


In [None]:
v = np.vstack((arr1, arr2))
print(v)
print(v.shape)

[[1 2 3]
 [4 5 6]]
(2, 3)


In [None]:
d = np.dstack((arr1, arr2))
print(d)
print(d.shape)

[[[1 4]
  [2 5]
  [3 6]]]
(1, 3, 2)


## Filtering arrays

We can also subset arrays based on conditions by using a boolean mask.

In [None]:
arr = np.array([41, 42, 43, 44])

x = [True, False, True, False]

newarr = arr[x]

print(arr)
print(newarr)

[41 42 43 44]
[41 43]


Usually we will create the mask by capturing the output of a condition instead of writing it manually, i.e.:

In [None]:
filter_arr = arr > 42
print(filter_arr)
arr[filter_arr]

[False False  True  True]


array([43, 44])

In [None]:
filter_arr = arr % 2 == 0
print(filter_arr)
arr[filter_arr]

[False  True False  True]


array([42, 44])

In [None]:
filter_arr = arr == 42
print(filter_arr)
arr[filter_arr]

[False  True False False]


array([42])

You can also write the condition directly into the selection instead of saving it. Base R users might find this syntax familiar:

In [None]:
arr[arr % 2 == 0]

array([42, 44])

In [None]:
arr[arr > 42]

array([43, 44])

For finer control of data selection, cleaning and replacing NAs you'll probably want to switch to pandas for ease of use. You can always convert a pandas Dataframe back into an nd array when you're done.

## Exercise 2 (10 mins)

1. Create a 1D array from the numbers 5 to 24.
2. Reshape the array to have 5 rows.
3. Verify the shape.
4. Join the reshaped array with this array `plus_this = np.array([1,2,3,4])` such that `plus_this` sits on top and you have all the numbers in the proper order. Make sure that the dimensions of the two arrays match so you can join them. The dimensions of the combined array should be 6 x 4.



(5, 4)
[[ 1  2  3  4]
 [ 5  6  7  8]
 [ 9 10 11 12]
 [13 14 15 16]
 [17 18 19 20]
 [21 22 23 24]]


# Pandas Dataframes

The other data structure we will use a lot is the dataframe from the pandas package. Pandas is a huge library so here we will only cover the basics of how to work with dataframes.  

## Loading data with the pandas csv reader

In [None]:
import pandas as pd

The pandas package has a number of read functions you can use to import data. They are very commonly used.

In [None]:
#using pandas csv reader to load data from our github repo
link_to_file = "https://raw.githubusercontent.com/Center-for-Health-Data-Science/Python_part2/main/data/diabetes.csv"
diabetes_df = pd.read_csv(link_to_file)

#diabetes_df = pd.read_csv("diabetes.csv")
diabetes_df.head()

Unnamed: 0,ID,Age,Sex,BloodPressure,GeneticRisk,BMI,PhysicalActivity,Married,Work,Smoker,Diabetes
0,9046,34,Male,84,0.619,24.7,93,Yes,Self-employed,Unknown,0
1,51676,25,Male,74,0.591,22.5,102,No,Public,Unknown,0
2,31112,30,Male,0,0.839,32.3,75,Yes,Private,Former,1
3,60182,50,Male,80,0.178,34.5,98,Yes,Self-employed,Unknown,1
4,1665,27,Female,60,0.206,26.3,82,Yes,Private,Never,0


The object we get is a pandas dataframe.

In [None]:
type(diabetes_df)

## Rows, columns, shape

Pandas dataframes have column names and row names (often referred to as the index).

In [None]:
diabetes_df.columns

Index(['ID', 'Age', 'Sex', 'BloodPressure', 'GeneticRisk', 'BMI',
       'PhysicalActivity', 'Married', 'Work', 'Smoker', 'Diabetes'],
      dtype='object')

The output of `.columns` can be used as an iterator. This will come in handy later!

In [None]:
for col in diabetes_df.columns:
    print(col)

ID
Age
Sex
BloodPressure
GeneticRisk
BMI
PhysicalActivity
Married
Work
Smoker
Diabetes


In [None]:
#row names:
diabetes_df.index

RangeIndex(start=0, stop=532, step=1)

The default index (row names) you get if you do not specifically set the index is numeric, starting from 0 and going up to length of the df - 1. It is called a range index since it is based on a range of numbers: `range(0,len(df))`. Remember the upper bound is excluded in `range` so it stops at len(df)-1.  

Just like a numpy array, a dataframe has dimensions and a shape.

In [None]:
diabetes_df.shape

(532, 11)

Pandas dataframes are almost always 2D because they are tables.

In [None]:
diabetes_df.ndim

2

## Counting instances

You can count the occurance of unique instances for a column with `value_counts`:

In [None]:
#here's how you count categorical variables
diabetes_df.value_counts('Work')

Work
Private          283
Public           154
Self-employed     89
Retired            6
dtype: int64

You can also check the number of unique elements in each column with `nunique`. Typically, categorical variables will have few unique values and numerical variables will have a number of unique values close to the number of observations (though measurements can of course repeat).

In [None]:
diabetes_df.nunique()

ID                  532
Age                  49
Sex                   2
BloodPressure        44
GeneticRisk         378
BMI                 223
PhysicalActivity    110
Married               2
Work                  4
Smoker                4
Diabetes              2
dtype: int64

## Subsetting dataframes

Pandas gives us fine-grained control to either extract data we want or discard data we do not want.

Let's have another look at our dataframe:

In [None]:
diabetes_df.head()

Unnamed: 0,ID,Age,Sex,BloodPressure,GeneticRisk,BMI,PhysicalActivity,Married,Work,Smoker,Diabetes
0,9046,34,Male,84,0.619,24.7,93,Yes,Self-employed,Unknown,0
1,51676,25,Male,74,0.591,22.5,102,No,Public,Unknown,0
2,31112,30,Male,0,0.839,32.3,75,Yes,Private,Former,1
3,60182,50,Male,80,0.178,34.5,98,Yes,Self-employed,Unknown,1
4,1665,27,Female,60,0.206,26.3,82,Yes,Private,Never,0


### Omitting columns

We might want to remove the ID column when we analyze this data since this is not really a measured feature and will not help us to understand stroke incidences.

In [None]:
diabetes_no_ID = diabetes_df.drop(columns = ['ID'])
diabetes_no_ID.head()

Unnamed: 0,Age,Sex,BloodPressure,GeneticRisk,BMI,PhysicalActivity,Married,Work,Smoker,Diabetes
0,34,Male,84,0.619,24.7,93,Yes,Self-employed,Unknown,0
1,25,Male,74,0.591,22.5,102,No,Public,Unknown,0
2,30,Male,0,0.839,32.3,75,Yes,Private,Former,1
3,50,Male,80,0.178,34.5,98,Yes,Self-employed,Unknown,1
4,27,Female,60,0.206,26.3,82,Yes,Private,Never,0


If we do not want to create a new object but instead overwrite the existing dataframe we can use the `inplace` argument:

In [None]:
diabetes_df.drop(columns = ['ID'], inplace = True)
diabetes_df.head()

Unnamed: 0,Age,Sex,BloodPressure,GeneticRisk,BMI,PhysicalActivity,Married,Work,Smoker,Diabetes
0,34,Male,84,0.619,24.7,93,Yes,Self-employed,Unknown,0
1,25,Male,74,0.591,22.5,102,No,Public,Unknown,0
2,30,Male,0,0.839,32.3,75,Yes,Private,Former,1
3,50,Male,80,0.178,34.5,98,Yes,Self-employed,Unknown,1
4,27,Female,60,0.206,26.3,82,Yes,Private,Never,0


However, pandas doesn't want us to do this since inplace operations can have unintended consequences so it's cleaner to do it with re-assignment. You can reassign to the same name, i.e.

```python
diabetes_df = diabetes_df.drop(columns = ['ID'])
```

### Conditional selection

You will often want to extract rows, i.e. observation that match a certain condition:

* only people who had a stroke
* only people over 50
* only people over 50 who were never married
* ect

We do this with conditional selection.

In order to show how it works, let us first create a simpler example dataframe.

In [None]:
example_data = pd.DataFrame(
    {"age": [67, 61, 40, 53, 82, 20],
    "sex": ["male", "male", "female", "male", "female", "female"],
    "smoker": ["yes", "no", "yes", "yes", "yes", "no"],
    "work": ['public', 'private', 'private', 'public', 'retired', 'private'],})

example_data

Unnamed: 0,age,sex,smoker,work
0,67,male,yes,public
1,61,male,no,private
2,40,female,yes,private
3,53,male,yes,public
4,82,female,yes,retired
5,20,female,no,private


First, we need a way to refer to a specific column in the dataframe. There are others but we will use this for now.

In [None]:
example_data['age']

0    67
1    61
2    40
3    53
4    82
5    20
Name: age, dtype: int64

We can now use this to query conditions in this dataframe.

In [None]:
#Which of the people in our data are older than 50?
#Or in python words:
#Which rows have a value greater than 50 in the age column?
example_data['age'] > 50

0     True
1     True
2    False
3     True
4     True
5    False
Name: age, dtype: bool

What we get is a series of `True` and `False` values with the same length as the number of rows in the dataframe:

* For the first row, the condition is `True` since the person is over 50
* For the second row the condition is also `True`
* For the third row the condition is `False`
* ect

You see that we verify the condition once for each row. This is also called a **bitwise comparision**.

### Combining conditions

We can also combine conditions. What if want to know which people are older than 50 and smokers?

Because we are doing bitwise comparisions, we need to use the bitwise operators:

* & (instead of `and`)
* | (instead of `or`)
* ~ (instead of `not`)

We'll also need to encase the separate conditions in bracets.

In [None]:
#people who are smokers and over 50
(example_data['age'] > 50) & (example_data['smoker'] == 'yes')

0     True
1    False
2    False
3     True
4     True
5    False
dtype: bool

In [None]:
#people who are smokers or over 50
(example_data['age'] > 50) | (example_data['smoker'] == 'yes')

0     True
1     True
2     True
3     True
4     True
5    False
dtype: bool

In [None]:
#people who are neither smokers, nor over 50
~((example_data['age'] > 50) | (example_data['smoker'] == 'yes'))

0    False
1    False
2    False
3    False
4    False
5     True
dtype: bool

Two more useful conditional operators:

* `between()`
* `isin()`


In [None]:
#people with an age between 60 and 70
example_data['age'].between(60,70)

0     True
1     True
2    False
3    False
4    False
5    False
Name: age, dtype: bool

In [None]:
#people with jobs in the public sector or retirees.
example_data['work'].isin(['public', 'retired'])

0     True
1    False
2    False
3     True
4     True
5    False
Name: work, dtype: bool

## Exercise 3 (10 mins)

Write a condition that returns `True` for :

1. People who are smokers
2. People who work in the private sector and are over 50
3. People who are neither retired nor under 30.

Look through the examples above and compare them to the dataframe if you have trouble! Try to understand for each row what the result of the condition check should be and verify that this is what you get.

### The .loc operator

There are several different ways to access data inside a dataframe because python is very flexible.

Here, we will only show you the .loc syntax because it is the most versatile and always works.


insert link to this image when the github is public

https://github.com/Center-for-Health-Data-Science/Python_part2/blob/main/figures/df_loc_conditions.png

With this syntax you can

* select rows on conditions
* select columns
* select both rows and columns

In order to create subsets of our dataframe, we put the conditional we looked at above inside the `.loc` operator:

In [None]:
#extracting people who are over 50 years old

sub = example_data.loc[example_data['age'] > 50]
sub

Unnamed: 0,age,sex,smoker,work
0,67,male,yes,public
1,61,male,no,private
3,53,male,yes,public
4,82,female,yes,retired


We also notice that the row names are not continuous anymore (we skip 2 since this row was deselected). So row names, i.e. the index, still corresponds to the original dataframe that we made the subset from.

That can be a good thing if we need to maintain compatibility with the original dataframe. It can also be a hindrance if we want to treat the subset as it's own dataset and work further with it. If we do not need to refer back to the original dataframe we can reset the index to be continuous again:

In [None]:
sub.reset_index(inplace = True, drop = True)
sub

Unnamed: 0,age,sex,smoker,work
0,67,male,yes,public
1,61,male,no,private
2,53,male,yes,public
3,82,female,yes,retired


Since we're using .loc we can select columns at the same time:

In [None]:
#extracting people who are over 50 years old and only keeping the age and sex columns

sub = example_data.loc[example_data['age'] > 50, ['age', 'sex']]
sub

Unnamed: 0,age,sex
0,67,male
1,61,male
3,53,male
4,82,female


We're also not required to keep the column the condition is on:

In [None]:
#extracting people who are over 50 years old and only keeping the work and sex columns

sub = example_data.loc[example_data['age'] > 50, ['sex', 'work']]
sub

Unnamed: 0,sex,work
0,male,public
1,male,private
3,male,public
4,female,retired


Going back a step to row selection, the condition in the first field of `.loc` can be arbitrarily complicated.

Here we select people who are over 50 and female:

In [None]:
sub = example_data.loc[(example_data['age'] > 50) & (example_data['sex'] == 'female')]
sub

Unnamed: 0,age,sex,smoker,work
4,82,female,yes,retired


You can also save your condition in a variable and plug that variable into `.loc` if it helps you to keep the overview. This is often referred to as creating a *mask*:

In [None]:
mask = (example_data['age'] > 50) & (example_data['sex'] == 'female')

sub = example_data.loc[mask]
sub

Unnamed: 0,age,sex,smoker,work
4,82,female,yes,retired


Lastly, if we only want to subset on columns and *keep all the rows*, we can also do that in the `.loc` syntax. There are more ways to do this but we won't go over them here in order to keep the subsetting simple.

In [None]:
#extracting only desired columns.
#Remember, the column selection is in the second field.
#You need ':' in the first field to specify 'all rows'.

sub = example_data.loc[:,['work', 'smoker']]
sub

Unnamed: 0,work,smoker
0,public,yes
1,private,no
2,private,yes
3,public,yes
4,retired,yes
5,private,no


## Exercise 4 (10 mins)

Subset the diabetes_df to:

1. People who are smokers.
2. People who are smokers and retain only the categorical columns.
3. People who are smokers or have an unknown smoking status.
4. People who are over 40 and unmarried.


## Manipulating dataframes

### Adding columns

If you have a list, array or series that has the same length as the number of rows in your dataframe, you can add it as a column:

In [None]:
married_status = ['married', 'divorced', 'married', 'unmarried', 'married', 'unmarried']

example_data['married'] = married_status
example_data

Unnamed: 0,age,sex,smoker,work,married
0,67,male,yes,public,married
1,61,male,no,private,divorced
2,40,female,yes,private,married
3,53,male,yes,public,unmarried
4,82,female,yes,retired,married
5,20,female,no,private,unmarried


You can also create columns based on other columns:

In [None]:
example_data['age_in_5_years'] = example_data['age'] + 5
example_data

Unnamed: 0,age,sex,smoker,work,married,age_in_5_years
0,67,male,yes,public,married,72
1,61,male,no,private,divorced,66
2,40,female,yes,private,married,45
3,53,male,yes,public,unmarried,58
4,82,female,yes,retired,married,87
5,20,female,no,private,unmarried,25


### Omitting columns

We can omit columns by using the `drop` method:

In [None]:
no_age = example_data.drop(columns = ['age'])

We can drop several columns by naming all of them in the list.

## Joining dataframes

Just like with arrays, there are several way we could want to join dataframes. The most straightforward ones are below each other and side-by-side.



### Below each other (column-wise)

An example:

df1:

| Index   | Name | Fav Animal |
|---------|-----|-----|
| 0       | Katrine   | Dog   |
| 1       | Sven   | Cat   |
| 2       | John   | Snake   |
| 3       | Lisa   | Rabbit   |

df2:

| Index   | Name | Fav Animal |
|---------|-----|-----|
| 0       | Marty   | Dog   |
| 1       | Clara   | Cat   |
| 2       | Ingo   | Car   |

Result:

| Index   | Name | Fav Animal |
|---------|-----|-----|
| 0       | Katrine   | Dog   |
| 1       | Sven   | Cat   |
| 2       | John   | Snake   |
| 3       | Lisa   | Rabbit   |
| 4       | Marty   | Dog   |
| 5       | Clara   | Cat   |
| 6       | Ingo   | Car   |



We can achieve this with `concat`:

In [None]:
df1 = pd.DataFrame({'Name': ['Katrine', 'Sven', 'John', 'Lisa'],
                    'Fav Animal': ['Dog', 'Cat', 'Snake', 'Rabbit'],})


df2 = pd.DataFrame({'Name': ['Marty', 'Clara', 'Ingo'],
                    'Fav Animal': ['Dog', 'Cat', 'Car'],})


#We'll also reset the index to achieve a continuous index instead of having 0, 1, 2 repeat
result = pd.concat([df1,df2]).reset_index(drop = True)
result

Unnamed: 0,index,Name,Fav Animal
0,0,Katrine,Dog
1,1,Sven,Cat
2,2,John,Snake
3,3,Lisa,Rabbit
4,0,Marty,Dog
5,1,Clara,Cat
6,2,Ingo,Car


### Side-by-side (also called row-wise)

An example:

df1:

| Index   | Name | Fav Animal |
|---------|-----|-----|
| 0       | Katrine   | Dog   |
| 1       | Sven   | Cat   |
| 2       | John   | Snake   |
| 3       | Lisa   | Rabbit   |

df3:

| Index   | Age | Fav Color |
|---------|-----|-----|
| 0       | 10   | Blue   |
| 1       | 11   | Red   |
| 2       | 7   | Green   |
| 3       | 5   | Pink   |

Result:

| Index   | Name | Fav Animal | Age | Fav Color |
|---------|-----|-----|-----|-----|
| 0       | Katrine   | Dog   | 10   | Blue   |
| 1       | Sven   | Cat   |11   | Red   |
| 2       | John   | Snake   |7   | Green   |
| 3       | Lisa   | Rabbit   |5   | Pink|

We can do this by changing the axis of concat to 1:

In [None]:
df1 = pd.DataFrame({'Name': ['Katrine', 'Sven', 'John', 'Lisa'],
                    'Fav Animal': ['Dog', 'Cat', 'Snake', 'Rabbit'],})


df3 = pd.DataFrame({'Age': [10, 11, 7, 5],
                    'Fav Color': ['Blue', 'Red', 'Green', 'Pink'],})


result = pd.concat([df1,df3], axis = 1)
result

Unnamed: 0,Name,Fav Animal,Age,Fav Color
0,Katrine,Dog,10,Blue
1,Sven,Cat,11,Red
2,John,Snake,7,Green
3,Lisa,Rabbit,5,Pink


# From pandas dataframes to np arrays and back

Now we have had an intro into what arrays and dataframes are and how to manipulate them.  

In order to do data modelling efficiently it makes it easier if we know how to switch between these two data formats.

We will often use pandas for loading, subsetting and plotting data, since seaborn requires dataframes. Conversely, when we do the modelling we use np arrays since scikit learn is written to interface with numpy.

In this section we'll cover some tricks for switching between the two quickly and easily.

### Cast to np array

A pandas dataframe is an object of the dataframe class. We can get the content of any dataframe in an array format by using the aptly named `to_numpy` method:

In [None]:
#save as a new variable
diabetes_np = diabetes_df.to_numpy()
#lets see the first 5 lines
diabetes_np[:5,]

array([[9046, 34, 'Male', 84, 0.619, 24.7, 93, 'Yes', 'Self-employed',
        'Unknown', 0],
       [51676, 25, 'Male', 74, 0.591, 22.5, 102, 'No', 'Public',
        'Unknown', 0],
       [31112, 30, 'Male', 0, 0.839, 32.3, 75, 'Yes', 'Private',
        'Former', 1],
       [60182, 50, 'Male', 80, 0.178, 34.5, 98, 'Yes', 'Self-employed',
        'Unknown', 1],
       [1665, 27, 'Female', 60, 0.206, 26.3, 82, 'Yes', 'Private',
        'Never', 0]], dtype=object)

In [None]:
type(diabetes_np)

numpy.ndarray

You can directly add `.to_numpy()` to any subsetting command on a dataframe to get an array-like structure out, without saving an intermediate, subset dataframe object:

In [None]:
#Here we omit the stroke column and make the rest of the df into an array
diabetes_np = diabetes_df.drop('Diabetes', axis = 1).to_numpy()
diabetes_np

array([[9046, 34, 'Male', ..., 'Yes', 'Self-employed', 'Unknown'],
       [51676, 25, 'Male', ..., 'No', 'Public', 'Unknown'],
       [31112, 30, 'Male', ..., 'Yes', 'Private', 'Former'],
       ...,
       [7621, 37, 'Female', ..., 'No', 'Private', 'Never'],
       [6855, 29, 'Female', ..., 'No', 'Private', 'Smoker'],
       [5374, 35, 'Female', ..., 'No', 'Self-employed', 'Never']],
      dtype=object)

When we model data in scikit-learn, we'll often want to have the target column in a separate object, like how in the iris data we have `iris.data` and `iris.target`.

We can do this by selecting the column and appending `.to_numpy()`.

In [None]:
#here we make a 1-d array (a vector) from the target column
diabetes_target = diabetes_df.loc[:,'Diabetes'].to_numpy()
diabetes_target

array([0, 0, 1, 1, 0, 1, 1, 1, 1, 1, 1, 0, 0, 0, 0, 0, 1, 0, 0, 0, 0, 1,
       0, 1, 1, 0, 1, 1, 1, 1, 0, 1, 1, 0, 0, 1, 0, 0, 1, 0, 0, 1, 0, 1,
       0, 0, 1, 1, 1, 1, 0, 1, 0, 1, 1, 0, 1, 0, 1, 0, 1, 0, 0, 0, 1, 1,
       0, 0, 0, 0, 1, 1, 0, 0, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 1, 0, 0, 0,
       1, 1, 0, 0, 0, 0, 0, 0, 0, 1, 0, 1, 1, 0, 0, 1, 0, 0, 0, 1, 1, 1,
       1, 1, 1, 1, 1, 0, 1, 0, 0, 0, 1, 1, 0, 0, 1, 1, 1, 1, 0, 0, 0, 0,
       1, 0, 0, 0, 0, 1, 0, 0, 1, 1, 0, 1, 0, 0, 1, 0, 0, 1, 0, 1, 1, 0,
       0, 0, 0, 1, 0, 0, 1, 1, 1, 0, 0, 0, 1, 0, 1, 1, 1, 0, 1, 0, 0, 0,
       1, 1, 1, 0, 1, 1, 1, 0, 1, 0, 1, 1, 1, 1, 0, 0, 0, 0, 1, 0, 1, 0,
       1, 0, 0, 0, 1, 1, 0, 0, 1, 1, 0, 0, 0, 0, 0, 1, 0, 0, 1, 0, 0, 1,
       0, 1, 0, 1, 0, 0, 1, 0, 1, 1, 1, 1, 0, 1, 1, 1, 1, 0, 1, 0, 0, 0,
       0, 1, 1, 1, 1, 1, 0, 1, 0, 0, 1, 1, 0, 1, 1, 1, 0, 1, 1, 1, 1, 0,
       1, 1, 1, 0, 0, 0, 1, 0, 0, 0, 1, 1, 0, 0, 1, 1, 0, 1, 1, 0, 1, 1,
       1, 0, 1, 0, 0, 0, 1, 1, 1, 0, 0, 1, 1, 1, 1,

Since we're using .loc we can use the same syntax to get several columns. Here we make all numeric columns into an array.


In [None]:
#Extracting several columns by name and cast to np array
diabetes_numeric = diabetes_df.loc[:,['Age', 'BloodPressure', 'GeneticRisk', 'BMI', 'PhysicalActivity']].to_numpy()
diabetes_numeric

array([[ 34.   ,  84.   ,   0.619,  24.7  ,  93.   ],
       [ 25.   ,  74.   ,   0.591,  22.5  , 102.   ],
       [ 30.   ,   0.   ,   0.839,  32.3  ,  75.   ],
       ...,
       [ 37.   ,  84.   ,   0.696,  24.5  , 128.   ],
       [ 29.   ,  86.   ,   0.808,  35.6  ,  51.   ],
       [ 35.   ,  90.   ,   0.314,  36.5  ,  75.   ]])

### Cast to dataframe

Alright, so how do we get back?

In [None]:
new_df = pd.DataFrame(diabetes_numeric)
new_df.head()

Unnamed: 0,0,1,2,3,4
0,34.0,84.0,0.619,24.7,93.0
1,25.0,74.0,0.591,22.5,102.0
2,30.0,0.0,0.839,32.3,75.0
3,50.0,80.0,0.178,34.5,98.0
4,27.0,60.0,0.206,26.3,82.0


In [None]:
type(new_df)

We have lost column names. We can re-add them (or make up new ones):

In [None]:
new_df.columns = ['Age', 'Blood', 'Risk', 'BMI', 'Activity']
new_df.head()

Unnamed: 0,Age,Blood,Risk,BMI,Activity
0,34.0,84.0,0.619,24.7,93.0
1,25.0,74.0,0.591,22.5,102.0
2,30.0,0.0,0.839,32.3,75.0
3,50.0,80.0,0.178,34.5,98.0
4,27.0,60.0,0.206,26.3,82.0


We can also add vectors as new columns to an existing dataframe. Here we'll re-add the target, stroke.

In [None]:
#put in the name of your new column and where the values are coming from
new_df['Diabetes'] = diabetes_target
new_df.tail()

Unnamed: 0,Age,Blood,Risk,BMI,Activity,Diabetes
527,40.0,88.0,0.403,34.5,72.0,1
528,58.0,82.0,0.528,39.2,85.0,1
529,37.0,84.0,0.696,24.5,128.0,0
530,29.0,86.0,0.808,35.6,51.0,1
531,35.0,90.0,0.314,36.5,75.0,1


So far so good!

## Exercise 5 (5 mins)

Make a pandas dataframe from the iris data. Put on column names. The columns are: sepal length, sepal width, petal length and petal width. We know this because they are listed in `iris.DESCR`. Have a look at it.

The target value is in `iris.target` and describes the species of iris. Add this as a column to the dataframe.