<a href="https://colab.research.google.com/github/jeremykleindienst/vorlesung_python/blob/main/numpy_pandas_basics_without_solutions.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

<h1>Numpy & Pandas Basics</h1>

So far, we have mostly used the modules and data structures built into Python for writing our programs.

While they work great for many different application areas, data scientists often use two additional libraries that provide more sophisticated data structures for working with data: `numpy` and `pandas`.

# Numpy

Numpy is a python library for working with **multidimensional arrays and matrices**. Because these data structures are the building blocks of many machine learning and most deep learning algorithms, a solid understanding of the numpy library is beneficial.

For further information, refer to the [official documentation](https://numpy.org/doc/1.22/index.html).

![](https://upload.wikimedia.org/wikipedia/commons/thumb/1/17/CPT-arrays-2d-demo.svg/488px-CPT-arrays-2d-demo.svg.png)

You might ask why we need a special library for this as Python already provides lists which can also be nested to create multidimensional matrices:

In [None]:
lst = [1,2,3]

In [None]:
multidimensional_array = [
                          [1, 2, 3, 4],
                          [5, 6, 7, 8]
]

In [None]:
for row in multidimensional_array:
  print(row)

[1, 2, 3, 4]
[5, 6, 7, 8]


In [None]:
multidimensional_array[0][2]

3

In [None]:
multidimensional_array + multidimensional_array

[[1, 2, 3, 4], [5, 6, 7, 8], [1, 2, 3, 4], [5, 6, 7, 8]]

The answer mainly comes down to **speed and convenience**. Numpy stores the data at one continuous place in memory which enables really fast processing unlike the Python list implementation. Also, large parts of numpy are written in C or C++ which also contributes to the overall speed improvement. On top of that, numpy provides lots of highly optimized mathematical operations which makes our lives a lot easier.

## Creating Arrays

We can start using numpy by first importing it and then using it, e.g., by defining a simple array:

In [None]:
import numpy as np # this allows us to refer to the numpy library just with 'np'
a = np.array([1.2, 2.2, 3.0])

![](https://numpy.org/doc/1.22/_images/np_array.png)

An array is a **grid of values** and it contains information about the raw data, how to locate an element, and how to interpret an element. It has a grid of elements that can be indexed in various ways. The elements are all of the same type, referred to as the **array dtype**.

In [None]:
a.dtype

dtype('float64')

We can change the type using the `astype()` function:

In [None]:
a.astype('int').dtype

dtype('int64')

Let's print the array to see what it looks like:

In [None]:
print(a)

[1.2 2.2 3. ]


You can also create arrays filled with zeros, ones, or an empty one:

In [None]:
print(np.zeros(10))

[0. 0. 0. 0. 0. 0. 0. 0. 0. 0.]


In [None]:
np.ones(10)

array([1., 1., 1., 1., 1., 1., 1., 1., 1., 1.])

In [None]:
print(np.empty(2))

[1. 1.]


Or you can create an array filled with a range of elements:

In [None]:
np.arange(4)

array([0, 1, 2, 3])

While the default data type is floating point (np.float64), you can explicitly specify which data type you want using the `dtype` keyword.

In [None]:
a = np.ones(2, dtype=np.float16)

In [None]:
a.dtype

dtype('float16')

A numpy array can have more than just one dimension, so let's create one with five dimensions filled with random values between 0 and 1:

In [None]:
a = np.random.rand(5, 10, 20, 30, 40) # 5 dimensions

In [None]:
a

Lets verify the number of dimensions and its shape:

In [None]:
print(a.ndim) # print number of dimensions/axes
print(a.shape) # print shape of array

5
(5, 10, 20, 30, 40)


We can count the number of elements in the array using the `size` property:

In [None]:
a.size

1200000

This should be equal to the product of shape dimensions, let's check:

In [None]:
5*10*20*30*40

1200000

## Indexing

Numpy has created our array as instructed and we can also see that it has assigned random values. Let's  check one specific element in the array:

In [None]:
a[2][4][18][22][10] # select element in 10th place of 22nd place ...

0.44919973181784834

Indexing an array works the same as with Python lists:

![](https://numpy.org/doc/1.22/_images/np_indexing.png)

For multidimensional arrays, indexing works similar:
![](https://numpy.org/doc/1.22/_images/np_matrix_indexing.png)

In [None]:
data = np.random.rand(3, 2, 4)

In [None]:
data[0]

array([[0.3594875 , 0.84853296, 0.43714027, 0.26291735],
       [0.63287313, 0.04567911, 0.10976615, 0.84197284]])

In [None]:
data[0, 1]

0.3420586549725324

In [None]:
a = np.random.rand(3,4)

In [None]:
a

In [None]:
a[1][2]

In [None]:
a[1, 2]

In [None]:
a[1:2]

However, numpy arrays can not only be indexed using indices but also booleans (`True` and `False` values).

In [None]:
boolean = 1 == 1

In [None]:
type(boolean)

bool

In [None]:
a = np.array([0.1, 0.4, 0.6, 0.8])

In [None]:
a[a < 0.5]

array([0.1, 0.4])

If you look at what the `a<0.5` expression actually returns, you see that its a list of booleans:

In [None]:
a < 0.5

array([ True,  True, False, False])

Similarly, we could index the array by manually specifying such a list:

In [None]:
a[[True, False, True, False]]

array([0.1, 0.6])

## Adding, Removing, Sorting Arrays

Sorting an array is simple with `np.sort()`. You can specify the axis, kind, and order when you call the function.

In [None]:
a = np.array([2, 1, 5, 3, 7, 4, 6, 8])

In [None]:
a = np.sort(a)

In [None]:
a

array([1, 2, 3, 4, 5, 6, 7, 8])

To combine two (or more) arrays, you can use the `concatenate()` function:

In [None]:
a = np.array([1, 2, 3, 4])
b = np.array([5, 6, 7, 8])

In [None]:
np.concatenate((a, b))

array([1, 2, 3, 4, 5, 6, 7, 8])

If you want to concatenate them "below" each other, you can explicitly state the axis:

In [None]:
x = np.array([[1, 2], [3, 4]])
y = np.array([[5, 6]])

In [None]:
x

array([[1, 2],
       [3, 4]])

In [None]:
y

array([[5, 6]])

In [None]:
np.concatenate((x, y), axis=0)

array([[1, 2],
       [3, 4],
       [5, 6]])

## Reshaping Arrays

Using `arr.reshape()` will give a new shape to an array without changing the data. However, the different shapes must have the **same number of elements**.

![](https://numpy.org/doc/1.22/_images/np_reshape.png)

In [None]:
a = np.arange(6)
print(a)

[0 1 2 3 4 5]


In [None]:
b = a.reshape(3, 2)
print(b)

[[0 1]
 [2 3]
 [4 5]]


You can use `np.newaxis` and `np.expand_dims` to increase the dimensions of your existing array.

Using `np.newaxis` will increase the dimensions of your array by one dimension when used once. This means that a 1D array will become a 2D array, a 2D array will become a 3D array, and so on.

In [None]:
a = np.array([1, 2, 3, 4, 5, 6])
a.shape

(6,)

In [None]:
a2 = a[np.newaxis, :]
a2.shape

(1, 6)

You can explicitly convert a 1D array with either a row vector or a column vector using `np.newaxis`. For example, you can convert a 1D array to a row vector by inserting an axis along the first dimension:

In [None]:
row_vector = a[np.newaxis, :]
row_vector.shape

Or, for a column vector, you can insert an axis along the second dimension:

In [None]:
col_vector = a[:, np.newaxis]
col_vector.shape

(6, 1)

You can also expand an array by inserting a new axis at a specified position with `np.expand_dims`.

In [None]:
a = np.array([1, 2, 3, 4, 5, 6])
a.shape

(6,)

In [None]:
b = np.expand_dims(a, axis=1)
b.shape

(6, 1)

In [None]:
c = np.expand_dims(a, axis=0)
c.shape

(1, 6)

## Calculating with Numpy Arrays

In [None]:
data = np.array([1,2])
ones = np.ones(2)

In [None]:
data

array([1, 2])

In [None]:
ones

array([1., 1.])

Using our arrays, we can now perform some basic arithmetic operations.

We can add and subtract:

In [None]:
data + ones

array([2., 3.])

![](https://numpy.org/doc/1.22/_images/np_data_plus_ones.png)
![](https://numpy.org/doc/1.22/_images/np_sub_mult_divide.png)

In [None]:
data - ones

array([0., 1.])

In [None]:
data * data

array([1, 4])

In [None]:
data / data

array([1., 1.])

Basic operations are simple with NumPy. If you want to find the sum of the elements in an array, you’d use sum(). This works for 1D arrays, 2D arrays, and arrays in higher dimensions.

![](https://numpy.org/doc/1.22/_images/np_aggregation.png)

In [None]:
a = np.array([1, 2, 3, 4])

In [None]:
a

array([1, 2, 3, 4])

In [None]:
a.sum()

10

There are times when you might want to carry out an operation between an array and a single number (also called an operation between a vector and a scalar) or between arrays of two different sizes. You can perform this operation with:

In [None]:
a

array([1, 2, 3, 4])

In [None]:
a * 3 # broadcasting

array([ 3,  6,  9, 12])

![](https://numpy.org/doc/1.22/_images/np_multiply_broadcasting.png)

The matrix product can be computed using the `@` sign or alternatively the `dot` function.
![](https://www.google.com/url?sa=i&url=https%3A%2F%2Fupload.wikimedia.org%2Fwikipedia%2Fcommons%2Fthumb%2F1%2F18%2FMatrix_multiplication_qtl1.svg%2F800px-Matrix_multiplication_qtl1.svg.png&psig=AOvVaw0Oof5cYH9U-C8RytBdFuM2&ust=1647025166326000&source=images&cd=vfe&ved=0CA0Q3YkBahcKEwjAq_iBnbz2AhUAAAAAHQAAAAAQAw)

Because the number of columns of the first matrix must equal the number of rows of the second matrix, we have to call the `transpose` function on one of the matrices first:

In [None]:
a = np.random.rand(2, 3)

In [None]:
a

array([[0.59343177, 0.84279067, 0.4180077 ],
       [0.50064662, 0.18056498, 0.8899229 ]])

In [None]:
a.transpose()

array([[0.59343177, 0.50064662],
       [0.84279067, 0.18056498],
       [0.4180077 , 0.8899229 ]])

In [None]:
a @ a.transpose()

array([[1.23718782, 0.82127271],
       [0.82127271, 1.07521351]])

In [None]:
a.dot(a.transpose())

array([[1.23718782, 0.82127271],
       [0.82127271, 1.07521351]])

Of course, the same operations can be carried out on multidimensional arrays:

In [None]:
data = np.array([[1,2],[3,4],[5,6]])
ones = np.ones((3,2))

![](https://numpy.org/doc/1.22/_images/np_matrix_aggregation.png)

In [None]:
data.max()

2

In [None]:
data.min()

1

In [None]:
data.sum()

3

![](https://numpy.org/doc/1.22/_images/np_matrix_aggregation_row.png)

In [None]:
data = np.array([[1,2],[5,3],[4,6]])

In [None]:
data

array([[1, 2],
       [5, 3],
       [4, 6]])

In [None]:
data.max(axis=0)

array([5, 6])

In [None]:
data.max(axis=1)

array([2, 5, 6])

![](https://numpy.org/doc/1.22/_images/np_matrix_arithmetic.png)

In [None]:
data + ones

You can do these arithmetic operations on matrices of different sizes, but only if one matrix has only one column or one row. In this case, NumPy will use its broadcast rules for the operation.

![](https://numpy.org/doc/1.22/_images/np_matrix_broadcasting.png)

In [None]:
data

array([[1, 2],
       [5, 3],
       [4, 6]])

In [None]:
ones = np.ones((1,2))

In [None]:
ones

array([[1., 1.]])

In [None]:
data + ones

array([[2., 3.],
       [6., 4.],
       [5., 7.]])

## Exercises

### Exercise 1
Create a numpy array of shape (6, 10) filled with random values.
Print the elements of the array which are larger than 0.2 but smaller than 0.6.

In [None]:
a = np.random.rand(6, 10)

In [None]:
%%timeit
a[(a>0.2) & (a<0.6)]

The slowest run took 12.96 times longer than the fastest. This could mean that an intermediate result is being cached.
100000 loops, best of 5: 5.42 µs per loop


### Exercise 2
Create a python program that compares the performance of Python lists and numpy arrays by implementing elementwise addition of a 100x100x100 array.

Hint: putting `%%timeit` at the top of a cell will time its execution:

In [None]:
%%timeit
def loop():
  for i in range(100000):
    pass
loop()

100 loops, best of 5: 2.74 ms per loop


In [None]:
a.tolist()

In [None]:
numpy_array_1 = np.random.rand(100, 100, 100)
numpy_array_2 = np.random.rand(100, 100, 100)
python_list_1 = numpy_array_1.tolist()
python_list_2 = numpy_array_2.tolist()

In [None]:
numpy_array_1.size

1000000

In [None]:
%%timeit
numpy_array_1 + numpy_array_2

1000 loops, best of 5: 1.8 ms per loop


In [None]:
def elementwise_addition(python_list_1, python_list_2):
  output = np.empty((100,100,100)).tolist()
  for index_1 in range(100):
    for index_2 in range(100):
      for index_3 in range(100):
        python_element_1 = python_list_1[index_1][index_2][index_3]
        python_element_2 = python_list_2[index_1][index_2][index_3]
        result = python_element_1 + python_element_2
        output[index_1][index_2][index_3] = result
  return output
python_result = elementwise_addition(python_list_1, python_list_2)

In [None]:
368/1.8

204.44444444444443

In [None]:
(numpy_array_1 + numpy_array_2).tolist() == python_result

True

# Pandas

The pandas library can be used for working with tabular data from spreadsheets and databases. It offers data structures and operations for manipulating, reading and writing, reshaping and pivoting, slicing, indexing, subsetting and much more. The pandas developers aim to build the fundamental high-level building block for doing practical, real world data analysis in Python.

But enough talk, let's get started using pandas!

## Basics



First, we have to get some data. We'll be using the `titanic` dataset provided by the pandas developers:

In [None]:
# download the data
!wget https://raw.githubusercontent.com/pandas-dev/pandas/master/doc/data/titanic.csv

--2022-03-11 11:07:57--  https://raw.githubusercontent.com/pandas-dev/pandas/master/doc/data/titanic.csv
Resolving raw.githubusercontent.com (raw.githubusercontent.com)... 185.199.108.133, 185.199.109.133, 185.199.110.133, ...
Connecting to raw.githubusercontent.com (raw.githubusercontent.com)|185.199.108.133|:443... connected.
HTTP request sent, awaiting response... 200 OK
Length: 60302 (59K) [text/plain]
Saving to: ‘titanic.csv’


2022-03-11 11:07:57 (7.15 MB/s) - ‘titanic.csv’ saved [60302/60302]



Let's first look at the raw data in the csv file:

In [None]:
# print raw data contained in file
!cat titanic.csv

That does not really seem to be useful, let's use the pandas library to read the csv:

In [None]:
!pip install pandas

In [None]:
import pandas as pd # this will allow us to use the 'pd' shortcut
pd.read_csv('titanic.csv')

Unnamed: 0,PassengerId,Survived,Pclass,Name,Sex,Age,SibSp,Parch,Ticket,Fare,Cabin,Embarked
0,1,0,3,"Braund, Mr. Owen Harris",male,22.0,1,0,A/5 21171,7.2500,,S
1,2,1,1,"Cumings, Mrs. John Bradley (Florence Briggs Th...",female,38.0,1,0,PC 17599,71.2833,C85,C
2,3,1,3,"Heikkinen, Miss. Laina",female,26.0,0,0,STON/O2. 3101282,7.9250,,S
3,4,1,1,"Futrelle, Mrs. Jacques Heath (Lily May Peel)",female,35.0,1,0,113803,53.1000,C123,S
4,5,0,3,"Allen, Mr. William Henry",male,35.0,0,0,373450,8.0500,,S
...,...,...,...,...,...,...,...,...,...,...,...,...
886,887,0,2,"Montvila, Rev. Juozas",male,27.0,0,0,211536,13.0000,,S
887,888,1,1,"Graham, Miss. Margaret Edith",female,19.0,0,0,112053,30.0000,B42,S
888,889,0,3,"Johnston, Miss. Catherine Helen ""Carrie""",female,,1,2,W./C. 6607,23.4500,,S
889,890,1,1,"Behr, Mr. Karl Howell",male,26.0,0,0,111369,30.0000,C148,C


That looks a lot better, but what's happened here exactly? 

The `read_csv` function from the pandas library reads data stored as a csv file and returns something called a `DataFrame`.

DataFrames contain tabular data and present the central data structures we work with in pandas (because they are so fundamental, they are also just called `pandas` themselves). DataFrames also define a number of operations and properties.

In [None]:
df = pd.read_csv('titanic.csv') # 'df' is our variable name and a common short for dataframe

In [None]:
df

Unnamed: 0,PassengerId,Survived,Pclass,Name,Sex,Age,SibSp,Parch,Ticket,Fare,Cabin,Embarked
0,1,0,3,"Braund, Mr. Owen Harris",male,22.0,1,0,A/5 21171,7.2500,,S
1,2,1,1,"Cumings, Mrs. John Bradley (Florence Briggs Th...",female,38.0,1,0,PC 17599,71.2833,C85,C
2,3,1,3,"Heikkinen, Miss. Laina",female,26.0,0,0,STON/O2. 3101282,7.9250,,S
3,4,1,1,"Futrelle, Mrs. Jacques Heath (Lily May Peel)",female,35.0,1,0,113803,53.1000,C123,S
4,5,0,3,"Allen, Mr. William Henry",male,35.0,0,0,373450,8.0500,,S
...,...,...,...,...,...,...,...,...,...,...,...,...
886,887,0,2,"Montvila, Rev. Juozas",male,27.0,0,0,211536,13.0000,,S
887,888,1,1,"Graham, Miss. Margaret Edith",female,19.0,0,0,112053,30.0000,B42,S
888,889,0,3,"Johnston, Miss. Catherine Helen ""Carrie""",female,,1,2,W./C. 6607,23.4500,,S
889,890,1,1,"Behr, Mr. Karl Howell",male,26.0,0,0,111369,30.0000,C148,C


![](https://www.w3resource.com/w3r_images/pandas-data-structure.svg)

## Selecting Columns

In [None]:
df

Unnamed: 0,PassengerId,Survived,Pclass,Name,Sex,Age,SibSp,Parch,Ticket,Fare,Cabin,Embarked
0,1,0,3,"Braund, Mr. Owen Harris",male,22.0,1,0,A/5 21171,7.2500,,S
1,2,1,1,"Cumings, Mrs. John Bradley (Florence Briggs Th...",female,38.0,1,0,PC 17599,71.2833,C85,C
2,3,1,3,"Heikkinen, Miss. Laina",female,26.0,0,0,STON/O2. 3101282,7.9250,,S
3,4,1,1,"Futrelle, Mrs. Jacques Heath (Lily May Peel)",female,35.0,1,0,113803,53.1000,C123,S
4,5,0,3,"Allen, Mr. William Henry",male,35.0,0,0,373450,8.0500,,S
...,...,...,...,...,...,...,...,...,...,...,...,...
886,887,0,2,"Montvila, Rev. Juozas",male,27.0,0,0,211536,13.0000,,S
887,888,1,1,"Graham, Miss. Margaret Edith",female,19.0,0,0,112053,30.0000,B42,S
888,889,0,3,"Johnston, Miss. Catherine Helen ""Carrie""",female,,1,2,W./C. 6607,23.4500,,S
889,890,1,1,"Behr, Mr. Karl Howell",male,26.0,0,0,111369,30.0000,C148,C


To select a single column from our dataframe/panda, we can either use the dot notation:

In [None]:
type(df.Name)

pandas.core.series.Series

In [None]:
type(df)

pandas.core.frame.DataFrame

or alternatively the array notation:

In [None]:
df['Name']

0                                Braund, Mr. Owen Harris
1      Cumings, Mrs. John Bradley (Florence Briggs Th...
2                                 Heikkinen, Miss. Laina
3           Futrelle, Mrs. Jacques Heath (Lily May Peel)
4                               Allen, Mr. William Henry
                             ...                        
886                                Montvila, Rev. Juozas
887                         Graham, Miss. Margaret Edith
888             Johnston, Miss. Catherine Helen "Carrie"
889                                Behr, Mr. Karl Howell
890                                  Dooley, Mr. Patrick
Name: Name, Length: 891, dtype: object

pandas will return these single columns as *series* - one-dimensional arrays which make it easier to do computation.

We can also select a subset of columns like this:

In [None]:
df[['Name', 'Age', 'Survived']] 

Unnamed: 0,Name,Age,Survived
0,"Braund, Mr. Owen Harris",22.0,0
1,"Cumings, Mrs. John Bradley (Florence Briggs Th...",38.0,1
2,"Heikkinen, Miss. Laina",26.0,1
3,"Futrelle, Mrs. Jacques Heath (Lily May Peel)",35.0,1
4,"Allen, Mr. William Henry",35.0,0
...,...,...,...
886,"Montvila, Rev. Juozas",27.0,0
887,"Graham, Miss. Margaret Edith",19.0,1
888,"Johnston, Miss. Catherine Helen ""Carrie""",,0
889,"Behr, Mr. Karl Howell",26.0,1


Note that selecting multiple columns does not return a series object, but another DataFrame instead.

## Indexing

We have two ways of indexing data from a dataframe:
*   `loc`: primarily label based, but may also be used with a boolean array
*   `iloc`: primarily integer position based, but may also be used with a boolean array



### iloc

Let's look at `iloc` first.

It expects queries in the form `[row_indexer,column_indexer]`, so to select the element in row 5 of column 4 of the dataframe (zero-based indexing), we would write:

In [None]:
df.iloc[4,3]

'Allen, Mr. William Henry'

We can also select an entire row using this approach using the `:` literal:

In [None]:
df.iloc[4, :] # ':' means 'all columns' here

PassengerId                           5
Survived                              0
Pclass                                3
Name           Allen, Mr. William Henry
Sex                                male
Age                                35.0
SibSp                                 0
Parch                                 0
Ticket                           373450
Fare                               8.05
Cabin                               NaN
Embarked                              S
Name: 4, dtype: object

In [None]:
type(df.iloc[4, :])

pandas.core.series.Series

Or entire columns:

In [None]:
df.iloc[:, 3] # ':' means 'all rows' here

0                                Braund, Mr. Owen Harris
1      Cumings, Mrs. John Bradley (Florence Briggs Th...
2                                 Heikkinen, Miss. Laina
3           Futrelle, Mrs. Jacques Heath (Lily May Peel)
4                               Allen, Mr. William Henry
                             ...                        
886                                Montvila, Rev. Juozas
887                         Graham, Miss. Margaret Edith
888             Johnston, Miss. Catherine Helen "Carrie"
889                                Behr, Mr. Karl Howell
890                                  Dooley, Mr. Patrick
Name: Name, Length: 891, dtype: object

In [None]:
df['Name']

0                                Braund, Mr. Owen Harris
1      Cumings, Mrs. John Bradley (Florence Briggs Th...
2                                 Heikkinen, Miss. Laina
3           Futrelle, Mrs. Jacques Heath (Lily May Peel)
4                               Allen, Mr. William Henry
                             ...                        
886                                Montvila, Rev. Juozas
887                         Graham, Miss. Margaret Edith
888             Johnston, Miss. Catherine Helen "Carrie"
889                                Behr, Mr. Karl Howell
890                                  Dooley, Mr. Patrick
Name: Name, Length: 891, dtype: object

### loc

Let's look at `loc` next.

It works in a similar way, but is label-based so we just write the index name of the row we want:

In [None]:
df

Unnamed: 0,PassengerId,Survived,Pclass,Name,Sex,Age,SibSp,Parch,Ticket,Fare,Cabin,Embarked
0,1,0,3,"Braund, Mr. Owen Harris",male,22.0,1,0,A/5 21171,7.2500,,S
1,2,1,1,"Cumings, Mrs. John Bradley (Florence Briggs Th...",female,38.0,1,0,PC 17599,71.2833,C85,C
2,3,1,3,"Heikkinen, Miss. Laina",female,26.0,0,0,STON/O2. 3101282,7.9250,,S
3,4,1,1,"Futrelle, Mrs. Jacques Heath (Lily May Peel)",female,35.0,1,0,113803,53.1000,C123,S
4,5,0,3,"Allen, Mr. William Henry",male,35.0,0,0,373450,8.0500,,S
...,...,...,...,...,...,...,...,...,...,...,...,...
886,887,0,2,"Montvila, Rev. Juozas",male,27.0,0,0,211536,13.0000,,S
887,888,1,1,"Graham, Miss. Margaret Edith",female,19.0,0,0,112053,30.0000,B42,S
888,889,0,3,"Johnston, Miss. Catherine Helen ""Carrie""",female,,1,2,W./C. 6607,23.4500,,S
889,890,1,1,"Behr, Mr. Karl Howell",male,26.0,0,0,111369,30.0000,C148,C


In [None]:
df.loc[4, 'Name'] # note that '4' is interpreted as a label of the index; it is *not* the integer position like it was in iloc

'Allen, Mr. William Henry'

In our DataFrame, the row index label is equal to the position so the `row_indexer` for `loc` and `iloc` is the same.

But what happens if we had a different DataFrame? Let's try it out using the `transpose` function!

In [None]:
transposed_df = df.transpose() # rows become columns and columns become rows
transposed_df

Unnamed: 0,0,1,2,3,4,5,6,7,8,9,...,881,882,883,884,885,886,887,888,889,890
PassengerId,1,2,3,4,5,6,7,8,9,10,...,882,883,884,885,886,887,888,889,890,891
Survived,0,1,1,1,0,0,0,0,1,1,...,0,0,0,0,0,0,1,0,1,0
Pclass,3,1,3,1,3,3,1,3,3,2,...,3,3,2,3,3,2,1,3,1,3
Name,"Braund, Mr. Owen Harris","Cumings, Mrs. John Bradley (Florence Briggs Th...","Heikkinen, Miss. Laina","Futrelle, Mrs. Jacques Heath (Lily May Peel)","Allen, Mr. William Henry","Moran, Mr. James","McCarthy, Mr. Timothy J","Palsson, Master. Gosta Leonard","Johnson, Mrs. Oscar W (Elisabeth Vilhelmina Berg)","Nasser, Mrs. Nicholas (Adele Achem)",...,"Markun, Mr. Johann","Dahlberg, Miss. Gerda Ulrika","Banfield, Mr. Frederick James","Sutehall, Mr. Henry Jr","Rice, Mrs. William (Margaret Norton)","Montvila, Rev. Juozas","Graham, Miss. Margaret Edith","Johnston, Miss. Catherine Helen ""Carrie""","Behr, Mr. Karl Howell","Dooley, Mr. Patrick"
Sex,male,female,female,female,male,male,male,male,female,female,...,male,female,male,male,female,male,female,female,male,male
Age,22.0,38.0,26.0,35.0,35.0,,54.0,2.0,27.0,14.0,...,33.0,22.0,28.0,25.0,39.0,27.0,19.0,,26.0,32.0
SibSp,1,1,0,1,0,0,0,3,0,1,...,0,0,0,0,0,0,0,1,0,0
Parch,0,0,0,0,0,0,0,1,2,0,...,0,0,0,0,5,0,0,2,0,0
Ticket,A/5 21171,PC 17599,STON/O2. 3101282,113803,373450,330877,17463,349909,347742,237736,...,349257,7552,C.A./SOTON 34068,SOTON/OQ 392076,382652,211536,112053,W./C. 6607,111369,370376
Fare,7.25,71.2833,7.925,53.1,8.05,8.4583,51.8625,21.075,11.1333,30.0708,...,7.8958,10.5167,10.5,7.05,29.125,13.0,30.0,23.45,30.0,7.75


Now, the difference between `loc` and `iloc` becomes clearer:

In [None]:
transposed_df.iloc[3] # select fourth row (remember, zero-based indexing)

0                                Braund, Mr. Owen Harris
1      Cumings, Mrs. John Bradley (Florence Briggs Th...
2                                 Heikkinen, Miss. Laina
3           Futrelle, Mrs. Jacques Heath (Lily May Peel)
4                               Allen, Mr. William Henry
                             ...                        
886                                Montvila, Rev. Juozas
887                         Graham, Miss. Margaret Edith
888             Johnston, Miss. Catherine Helen "Carrie"
889                                Behr, Mr. Karl Howell
890                                  Dooley, Mr. Patrick
Name: Name, Length: 891, dtype: object

In [None]:
transposed_df.loc['Name'] # select row with label 'Name'

0                                Braund, Mr. Owen Harris
1      Cumings, Mrs. John Bradley (Florence Briggs Th...
2                                 Heikkinen, Miss. Laina
3           Futrelle, Mrs. Jacques Heath (Lily May Peel)
4                               Allen, Mr. William Henry
                             ...                        
886                                Montvila, Rev. Juozas
887                         Graham, Miss. Margaret Edith
888             Johnston, Miss. Catherine Helen "Carrie"
889                                Behr, Mr. Karl Howell
890                                  Dooley, Mr. Patrick
Name: Name, Length: 891, dtype: object

### Boolean-based indexing

`loc` and `iloc` can both also take a boolean array as 'arguments'.

They will use that boolean array to determine which rows and columns to return (`True` means return, `False` means don't return). This means that we have to provide a boolean array with a length equal to the length of the DataFrame:

In [None]:
boolean_array = np.full(len(df), True) # use numpy to create an array filled with 'True' the same size as our dataframe; what happens if you change it to 'False'? Try it by yourself!
df.loc[boolean_array] # because everything is 'True', this returns all rows

Unnamed: 0,PassengerId,Survived,Pclass,Name,Sex,Age,SibSp,Parch,Ticket,Fare,Cabin,Embarked
0,1,0,3,"Braund, Mr. Owen Harris",male,22.0,1,0,A/5 21171,7.2500,,S
1,2,1,1,"Cumings, Mrs. John Bradley (Florence Briggs Th...",female,38.0,1,0,PC 17599,71.2833,C85,C
2,3,1,3,"Heikkinen, Miss. Laina",female,26.0,0,0,STON/O2. 3101282,7.9250,,S
3,4,1,1,"Futrelle, Mrs. Jacques Heath (Lily May Peel)",female,35.0,1,0,113803,53.1000,C123,S
4,5,0,3,"Allen, Mr. William Henry",male,35.0,0,0,373450,8.0500,,S
...,...,...,...,...,...,...,...,...,...,...,...,...
886,887,0,2,"Montvila, Rev. Juozas",male,27.0,0,0,211536,13.0000,,S
887,888,1,1,"Graham, Miss. Margaret Edith",female,19.0,0,0,112053,30.0000,B42,S
888,889,0,3,"Johnston, Miss. Catherine Helen ""Carrie""",female,,1,2,W./C. 6607,23.4500,,S
889,890,1,1,"Behr, Mr. Karl Howell",male,26.0,0,0,111369,30.0000,C148,C


While this alone is not very useful, it becomes very powerful when we combine it with another feature of pandas DataFrames: broadcasting.

When we apply a comparison operation to our DataFrame, pandas will actually return an array of boolean values each indicating if the corresponding row fulfills the comparison:

In [None]:
df.Age > 20

0       True
1       True
2       True
3       True
4       True
       ...  
886     True
887    False
888    False
889     True
890     True
Name: Age, Length: 891, dtype: bool

We can now easily select all rows where the passengers older than 20 by using:

In [None]:
df.loc[df.Age > 20]

Unnamed: 0,PassengerId,Survived,Pclass,Name,Sex,Age,SibSp,Parch,Ticket,Fare,Cabin,Embarked
0,1,0,3,"Braund, Mr. Owen Harris",male,22.0,1,0,A/5 21171,7.2500,,S
1,2,1,1,"Cumings, Mrs. John Bradley (Florence Briggs Th...",female,38.0,1,0,PC 17599,71.2833,C85,C
2,3,1,3,"Heikkinen, Miss. Laina",female,26.0,0,0,STON/O2. 3101282,7.9250,,S
3,4,1,1,"Futrelle, Mrs. Jacques Heath (Lily May Peel)",female,35.0,1,0,113803,53.1000,C123,S
4,5,0,3,"Allen, Mr. William Henry",male,35.0,0,0,373450,8.0500,,S
...,...,...,...,...,...,...,...,...,...,...,...,...
884,885,0,3,"Sutehall, Mr. Henry Jr",male,25.0,0,0,SOTON/OQ 392076,7.0500,,S
885,886,0,3,"Rice, Mrs. William (Margaret Norton)",female,39.0,0,5,382652,29.1250,,Q
886,887,0,2,"Montvila, Rev. Juozas",male,27.0,0,0,211536,13.0000,,S
889,890,1,1,"Behr, Mr. Karl Howell",male,26.0,0,0,111369,30.0000,C148,C


Or all male passengers:

In [None]:
len(df)

891

## Adding Columns and Rows

We can add new columns and rows to a panda dataframe by just assigning to a new column:

In [None]:
a = np.random.rand(len(df))

In [None]:
a

In [None]:
df['New Column'] = np.random.rand(len(df))

In [None]:
df

Unnamed: 0,PassengerId,Survived,Pclass,Name,Sex,Age,SibSp,Parch,Ticket,Fare,Cabin,Embarked,New Column,Age Squared
0,1,0,3,"Braund, Mr. Owen Harris",male,22.0,1,0,A/5 21171,7.2500,,S,0.833758,484.0
1,2,1,1,"Cumings, Mrs. John Bradley (Florence Briggs Th...",female,38.0,1,0,PC 17599,71.2833,C85,C,0.165245,1444.0
2,3,1,3,"Heikkinen, Miss. Laina",female,26.0,0,0,STON/O2. 3101282,7.9250,,S,0.909593,676.0
3,4,1,1,"Futrelle, Mrs. Jacques Heath (Lily May Peel)",female,35.0,1,0,113803,53.1000,C123,S,0.790978,1225.0
4,5,0,3,"Allen, Mr. William Henry",male,35.0,0,0,373450,8.0500,,S,0.051708,1225.0
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
886,887,0,2,"Montvila, Rev. Juozas",male,27.0,0,0,211536,13.0000,,S,0.453368,729.0
887,888,1,1,"Graham, Miss. Margaret Edith",female,19.0,0,0,112053,30.0000,B42,S,0.239575,361.0
888,889,0,3,"Johnston, Miss. Catherine Helen ""Carrie""",female,,1,2,W./C. 6607,23.4500,,S,0.630189,
889,890,1,1,"Behr, Mr. Karl Howell",male,26.0,0,0,111369,30.0000,C148,C,0.413827,676.0


This can also be used to introduce new columns based on other columns:

In [None]:
df['Age Squared'] = df.Age**2

To remove columns from a dataframe, the `drop()` method is used:

In [None]:
df

Unnamed: 0,PassengerId,Survived,Pclass,Name,Sex,Age,SibSp,Parch,Ticket,Fare,Cabin,Embarked,New Column,Age Squared
0,1,0,3,"Braund, Mr. Owen Harris",male,22.0,1,0,A/5 21171,7.2500,,S,0.833758,484.0
1,2,1,1,"Cumings, Mrs. John Bradley (Florence Briggs Th...",female,38.0,1,0,PC 17599,71.2833,C85,C,0.165245,1444.0
2,3,1,3,"Heikkinen, Miss. Laina",female,26.0,0,0,STON/O2. 3101282,7.9250,,S,0.909593,676.0
3,4,1,1,"Futrelle, Mrs. Jacques Heath (Lily May Peel)",female,35.0,1,0,113803,53.1000,C123,S,0.790978,1225.0
4,5,0,3,"Allen, Mr. William Henry",male,35.0,0,0,373450,8.0500,,S,0.051708,1225.0
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
886,887,0,2,"Montvila, Rev. Juozas",male,27.0,0,0,211536,13.0000,,S,0.453368,729.0
887,888,1,1,"Graham, Miss. Margaret Edith",female,19.0,0,0,112053,30.0000,B42,S,0.239575,361.0
888,889,0,3,"Johnston, Miss. Catherine Helen ""Carrie""",female,,1,2,W./C. 6607,23.4500,,S,0.630189,
889,890,1,1,"Behr, Mr. Karl Howell",male,26.0,0,0,111369,30.0000,C148,C,0.413827,676.0


In [None]:
df = df.drop(columns=['New Column'])

Note that the method returns a new dataframe so you might have to reassign.

Removing rows can accomplished using the same function but using the `index` argument:

In [None]:
df.drop(index=0)

Unnamed: 0,PassengerId,Survived,Pclass,Name,Sex,Age,SibSp,Parch,Ticket,Fare,Cabin,Embarked,Age Squared
1,2,1,1,"Cumings, Mrs. John Bradley (Florence Briggs Th...",female,38.0,1,0,PC 17599,71.2833,C85,C,1444.0
2,3,1,3,"Heikkinen, Miss. Laina",female,26.0,0,0,STON/O2. 3101282,7.9250,,S,676.0
3,4,1,1,"Futrelle, Mrs. Jacques Heath (Lily May Peel)",female,35.0,1,0,113803,53.1000,C123,S,1225.0
4,5,0,3,"Allen, Mr. William Henry",male,35.0,0,0,373450,8.0500,,S,1225.0
5,6,0,3,"Moran, Mr. James",male,,0,0,330877,8.4583,,Q,
...,...,...,...,...,...,...,...,...,...,...,...,...,...
886,887,0,2,"Montvila, Rev. Juozas",male,27.0,0,0,211536,13.0000,,S,729.0
887,888,1,1,"Graham, Miss. Margaret Edith",female,19.0,0,0,112053,30.0000,B42,S,361.0
888,889,0,3,"Johnston, Miss. Catherine Helen ""Carrie""",female,,1,2,W./C. 6607,23.4500,,S,
889,890,1,1,"Behr, Mr. Karl Howell",male,26.0,0,0,111369,30.0000,C148,C,676.0


Adding rows to pandas is not as easy. You have to use the `pd.concat()` method:

In [None]:
series = pd.Series([2, 2, 3, 'Mr. Fake', 'male', 23, 1, 0, 'No Ticket', 23.3, np.nan, 'S', 23])
pd.concat([df, series] , axis=1)

Unnamed: 0,PassengerId,Survived,Pclass,Name,Sex,Age,SibSp,Parch,Ticket,Fare,Cabin,Embarked,Age Squared,0
0,1,0,3,"Braund, Mr. Owen Harris",male,22.0,1,0,A/5 21171,7.2500,,S,484.0,2
1,2,1,1,"Cumings, Mrs. John Bradley (Florence Briggs Th...",female,38.0,1,0,PC 17599,71.2833,C85,C,1444.0,2
2,3,1,3,"Heikkinen, Miss. Laina",female,26.0,0,0,STON/O2. 3101282,7.9250,,S,676.0,3
3,4,1,1,"Futrelle, Mrs. Jacques Heath (Lily May Peel)",female,35.0,1,0,113803,53.1000,C123,S,1225.0,Mr. Fake
4,5,0,3,"Allen, Mr. William Henry",male,35.0,0,0,373450,8.0500,,S,1225.0,male
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
886,887,0,2,"Montvila, Rev. Juozas",male,27.0,0,0,211536,13.0000,,S,729.0,
887,888,1,1,"Graham, Miss. Margaret Edith",female,19.0,0,0,112053,30.0000,B42,S,361.0,
888,889,0,3,"Johnston, Miss. Catherine Helen ""Carrie""",female,,1,2,W./C. 6607,23.4500,,S,,
889,890,1,1,"Behr, Mr. Karl Howell",male,26.0,0,0,111369,30.0000,C148,C,676.0,


In [None]:
df.loc[891] = [2, 2, 3, 'Mr. Fake', 'male', 23, 1, 0, 'No Ticket', 23.3, np.nan, 'S', 23]

In [None]:
df.shape[0]

892

In [None]:
df

Unnamed: 0,PassengerId,Survived,Pclass,Name,Sex,Age,SibSp,Parch,Ticket,Fare,Cabin,Embarked,Age Squared
0,1,0,3,"Braund, Mr. Owen Harris",male,22.0,1,0,A/5 21171,7.2500,,S,484.0
1,2,1,1,"Cumings, Mrs. John Bradley (Florence Briggs Th...",female,38.0,1,0,PC 17599,71.2833,C85,C,1444.0
2,3,1,3,"Heikkinen, Miss. Laina",female,26.0,0,0,STON/O2. 3101282,7.9250,,S,676.0
3,4,1,1,"Futrelle, Mrs. Jacques Heath (Lily May Peel)",female,35.0,1,0,113803,53.1000,C123,S,1225.0
4,5,0,3,"Allen, Mr. William Henry",male,35.0,0,0,373450,8.0500,,S,1225.0
...,...,...,...,...,...,...,...,...,...,...,...,...,...
886,887,0,2,"Montvila, Rev. Juozas",male,27.0,0,0,211536,13.0000,,S,729.0
887,888,1,1,"Graham, Miss. Margaret Edith",female,19.0,0,0,112053,30.0000,B42,S,361.0
888,889,0,3,"Johnston, Miss. Catherine Helen ""Carrie""",female,,1,2,W./C. 6607,23.4500,,S,
889,890,1,1,"Behr, Mr. Karl Howell",male,26.0,0,0,111369,30.0000,C148,C,676.0


## Calculating with Pandas

Similar to numpy arrays, pandas also support mathematical operations:

In [None]:
df.mean()

  """Entry point for launching an IPython kernel.


PassengerId     446.000000
Survived          0.383838
Pclass            2.308642
Age              29.699118
SibSp             0.523008
Parch             0.381594
Fare             32.204208
Age Squared    1092.761169
dtype: float64

In [None]:
df.Age.mean()

29.69911764705882

In [None]:
df.Age.max()

80.0

In [None]:
df.Age.min()

0.42

We can also conduct more sophisticated operations using the `apply()` function:

In [None]:
def square(x):
  return x*x

df.Age.apply(square)

0       484.0
1      1444.0
2       676.0
3      1225.0
4      1225.0
        ...  
886     729.0
887     361.0
888       NaN
889     676.0
890    1024.0
Name: Age, Length: 891, dtype: float64

## Exercises

### Exercise 1
Use the titanic pandas dataframe from above and select the name of all passengers that survived.

In [None]:
df

Unnamed: 0,PassengerId,Survived,Pclass,Name,Sex,Age,SibSp,Parch,Ticket,Fare,Cabin,Embarked,Age Squared
0,1,0,3,"Braund, Mr. Owen Harris",male,22.0,1,0,A/5 21171,7.2500,,S,484.0
1,2,1,1,"Cumings, Mrs. John Bradley (Florence Briggs Th...",female,38.0,1,0,PC 17599,71.2833,C85,C,1444.0
2,3,1,3,"Heikkinen, Miss. Laina",female,26.0,0,0,STON/O2. 3101282,7.9250,,S,676.0
3,4,1,1,"Futrelle, Mrs. Jacques Heath (Lily May Peel)",female,35.0,1,0,113803,53.1000,C123,S,1225.0
4,5,0,3,"Allen, Mr. William Henry",male,35.0,0,0,373450,8.0500,,S,1225.0
...,...,...,...,...,...,...,...,...,...,...,...,...,...
886,887,0,2,"Montvila, Rev. Juozas",male,27.0,0,0,211536,13.0000,,S,729.0
887,888,1,1,"Graham, Miss. Margaret Edith",female,19.0,0,0,112053,30.0000,B42,S,361.0
888,889,0,3,"Johnston, Miss. Catherine Helen ""Carrie""",female,,1,2,W./C. 6607,23.4500,,S,
889,890,1,1,"Behr, Mr. Karl Howell",male,26.0,0,0,111369,30.0000,C148,C,676.0


In [None]:
df[df.Survived == 1].Name

1      Cumings, Mrs. John Bradley (Florence Briggs Th...
2                                 Heikkinen, Miss. Laina
3           Futrelle, Mrs. Jacques Heath (Lily May Peel)
8      Johnson, Mrs. Oscar W (Elisabeth Vilhelmina Berg)
9                    Nasser, Mrs. Nicholas (Adele Achem)
                             ...                        
875                     Najib, Miss. Adele Kiamie "Jane"
879        Potter, Mrs. Thomas Jr (Lily Alexenia Wilson)
880         Shelley, Mrs. William (Imanita Parrish Hall)
887                         Graham, Miss. Margaret Edith
889                                Behr, Mr. Karl Howell
Name: Name, Length: 342, dtype: object

### Exercise 2
Use the titanic pandas dataframe from above and check the number of female and male passengers.

In [None]:
df = df.drop(index=891)

In [None]:
df

Unnamed: 0,PassengerId,Survived,Pclass,Name,Sex,Age,SibSp,Parch,Ticket,Fare,Cabin,Embarked,Age Squared
0,1,0,3,"Braund, Mr. Owen Harris",male,22.0,1,0,A/5 21171,7.2500,,S,484.0
1,2,1,1,"Cumings, Mrs. John Bradley (Florence Briggs Th...",female,38.0,1,0,PC 17599,71.2833,C85,C,1444.0
2,3,1,3,"Heikkinen, Miss. Laina",female,26.0,0,0,STON/O2. 3101282,7.9250,,S,676.0
3,4,1,1,"Futrelle, Mrs. Jacques Heath (Lily May Peel)",female,35.0,1,0,113803,53.1000,C123,S,1225.0
4,5,0,3,"Allen, Mr. William Henry",male,35.0,0,0,373450,8.0500,,S,1225.0
...,...,...,...,...,...,...,...,...,...,...,...,...,...
886,887,0,2,"Montvila, Rev. Juozas",male,27.0,0,0,211536,13.0000,,S,729.0
887,888,1,1,"Graham, Miss. Margaret Edith",female,19.0,0,0,112053,30.0000,B42,S,361.0
888,889,0,3,"Johnston, Miss. Catherine Helen ""Carrie""",female,,1,2,W./C. 6607,23.4500,,S,
889,890,1,1,"Behr, Mr. Karl Howell",male,26.0,0,0,111369,30.0000,C148,C,676.0


In [None]:
df[df.Sex == 'male'].shape[0]

577

In [None]:
df[df.Sex == 'female'].shape[0]

314

### Exercise 3
Use the titanic pandas dataframe from above and check whether women were more or less likely to survive.

In [None]:
number_women_survived = df[(df.Sex == 'female') & (df.Survived == 1)].shape[0]
number_men_survived = df[(df.Sex == 'male') & (df.Survived == 1)].shape[0]
number_men = df[df.Sex == 'male'].shape[0]
number_female=df[df.Sex == 'female'].shape[0]

print(number_women_survived/number_female)
print(number_men_survived/number_men)

0.7420382165605095
0.18890814558058924


In [None]:
df.shape[0]

891

In [None]:
len(df)

891

In [None]:
df

Unnamed: 0,PassengerId,Survived,Pclass,Name,Sex,Age,SibSp,Parch,Ticket,Fare,Cabin,Embarked,Age Squared
0,1,0,3,"Braund, Mr. Owen Harris",male,22.0,1,0,A/5 21171,7.2500,,S,484.0
1,2,1,1,"Cumings, Mrs. John Bradley (Florence Briggs Th...",female,38.0,1,0,PC 17599,71.2833,C85,C,1444.0
2,3,1,3,"Heikkinen, Miss. Laina",female,26.0,0,0,STON/O2. 3101282,7.9250,,S,676.0
3,4,1,1,"Futrelle, Mrs. Jacques Heath (Lily May Peel)",female,35.0,1,0,113803,53.1000,C123,S,1225.0
4,5,0,3,"Allen, Mr. William Henry",male,35.0,0,0,373450,8.0500,,S,1225.0
...,...,...,...,...,...,...,...,...,...,...,...,...,...
886,887,0,2,"Montvila, Rev. Juozas",male,27.0,0,0,211536,13.0000,,S,729.0
887,888,1,1,"Graham, Miss. Margaret Edith",female,19.0,0,0,112053,30.0000,B42,S,361.0
888,889,0,3,"Johnston, Miss. Catherine Helen ""Carrie""",female,,1,2,W./C. 6607,23.4500,,S,
889,890,1,1,"Behr, Mr. Karl Howell",male,26.0,0,0,111369,30.0000,C148,C,676.0


### Exercise 4
Use the titanic pandas dataframe from above and check which of the following passenger groups had the highest risk of dying:

*   passengers younger than 18
*   passengers between 18 and 60
*   passengers above 60

In [None]:
first_group = df[df.Age <= 18]
second_group = df[(df.Age > 18) & (df.Age < 60)]
third_group = df[df.Age >= 60]

number_survived_first_group = len(first_group[first_group.Survived == 0])
number_survived_second_group = len(second_group[second_group.Survived == 0])
number_survived_third_group = len(third_group[third_group.Survived == 0])

number_first_group = len(first_group)
number_second_group = len(second_group)
number_third_group = len(third_group)

print(number_survived_first_group/number_first_group)
print(number_survived_second_group/number_second_group)
print(number_survived_third_group/number_third_group)

0.49640287769784175
0.6120218579234973
0.7307692307692307


### Exercise 5
Use the dataset under https://data.humdata.org/hxlproxy/api/data-preview.csv?url=https%3A%2F%2Fraw.githubusercontent.com%2Fnytimes%2Fcovid-19-data%2Fmaster%2Fus-counties.csv&filename=us-counties.csv to find out which US state has the highest cumulative COVID death count.

In [None]:
!wget https://data.humdata.org/hxlproxy/api/data-preview.csv?url=https%3A%2F%2Fraw.githubusercontent.com%2Fnytimes%2Fcovid-19-data%2Fmaster%2Fus-counties.csv&filename=us-counties.csv

In [None]:
import pandas as pd

In [None]:
df = pd.read_csv('https://data.humdata.org/hxlproxy/api/data-preview.csv?url=https%3A%2F%2Fraw.githubusercontent.com%2Fnytimes%2Fcovid-19-data%2Fmaster%2Fus-counties.csv&filename=us-counties.csv')

In [None]:
df

Unnamed: 0,date,county,state,fips,cases,deaths
0,2020-01-21,Snohomish,Washington,53061.0,1,0.0
1,2020-01-22,Snohomish,Washington,53061.0,1,0.0
2,2020-01-23,Snohomish,Washington,53061.0,1,0.0
3,2020-01-24,Cook,Illinois,17031.0,1,0.0
4,2020-01-24,Snohomish,Washington,53061.0,1,0.0
...,...,...,...,...,...,...
2254682,2022-02-26,Greenville,South Carolina,45045.0,171160,1866.0
2254683,2022-02-26,Greenwood,South Carolina,45047.0,22357,282.0
2254684,2022-02-26,Hampton,South Carolina,45049.0,4988,80.0
2254685,2022-02-26,Horry,South Carolina,45051.0,95626,1093.0


In [None]:
df[df.cases == df.cases.max()]

Unnamed: 0,date,county,state,fips,cases,deaths
2252446,2022-02-26,Los Angeles,California,6037.0,2794480,30650.0


# Additional Resources

The content we discussed in this workshop only scratch the surface.

Here are some additional resources you can look at if you need more information or are just interested:
### General
*   [Introduction to Colab, Python, Numpy and Matplotlib (Notebook)](https://colab.research.google.com/github/cs231n/cs231n.github.io/blob/master/python-colab.ipynb#scrollTo=cYb0pjh1L9eb) 

### Jupyter Notebook & Google Colaboratory
*   [Introduction to Colab (YouTube)](https://www.youtube.com/watch?v=inN8seMm7UI)
*   [Overview of Colab (Notebook)](/notebooks/basic_features_overview.ipynb)
*   [Jupyter Notebook Documentation](https://jupyter.org/documentation)

*   [Markdown Guide](/notebooks/markdown_guide.ipynb)
*   [Installing and Importing Libraries (Notebook)](/notebooks/snippets/importing_libraries.ipynb)
*   [Saving and Loading Notebooks in GitHub (Notebook)](https://colab.research.google.com/github/googlecolab/colabtools/blob/master/notebooks/colab-github-demo.ipynb)
*   [Interactive Forms (Notebook)](/notebooks/forms.ipynb)
*   [Interactive Widgets (Notebook)](/notebooks/widgets.ipynb)
*   [Loading Data from Drive, Tables and Google Cloud Storage (Notebook)](/notebooks/io.ipynb) 

### Numpy
*   [Quickstart](https://numpy.org/doc/stable/user/quickstart.html)
*   [Tutorial (Notebook)](https://colab.research.google.com/drive/1NQDtO3Y8kApxS5SwMPE2hvdUl3T_of4V)
*   [Reference](https://numpy.org/doc/stable/reference/index.html)

### Pandas
*   [Getting Started](https://pandas.pydata.org/pandas-docs/stable/getting_started/index.html)
*   [Pandas Walkthrough (Youtube)](https://www.youtube.com/watch?v=_T8LGqJtuGc)
*   [API Reference](https://pandas.pydata.org/pandas-docs/stable/reference/index.html)
*   [Introduction to Pandas (Notebook)](/notebooks/mlcc/intro_to_pandas.ipynb)