# Module 4 - Control Flow

We've now seen how to store various data types using variables. Most commonly in programming, we're interested not just in storing data but also in manipulating them in some way. This motivates the tools you'll see in this chapter: *control flow* elements of Python that allow you to group and organize operations together.

---

## If-else statements
Just like in Julia, Python has standard `if-else` statements. Let's see an example.

In [1]:
x = 3
if 2*x == 6:
    print('2*x equals 6')
    print('Hooray!')

2*x equals 6
Hooray!


Some important points to note.
* The `if` statement ends with a `:`
* Python uses indents to denote which lines are encompasulated in the `if` statement

Let's see examples which won't work.

We can add an `else` statement to handle the case where the expression fails.

In [2]:
x = 4
if 2*x == 6:
    print('2*x equals 6')
else:
    print('2*x does not equal 6')

2*x does not equal 6


Finally, we can add a `elif` if we want an else-if expression.

In [3]:
x = 4
if 2*x == 6:
    print('2*x equals 5')
    print('Hooray!')
elif not (2*x > 8):
    print('2*x does not equal 6, but is not greater than 8')

2*x does not equal 6, but is not greater than 8


## Iteration

Another common pattern of "things you want to do with code" is to do something repeatedly. This could be for example performing some actions on items in a collection, or performing some code until a condition doesn't hold. Here we investigate ways to express this in Python code.

### The `for` loop

The first example of this is the `for` loop which allows you to execute statements repeatedly. Python's `for` statement iterates over any sequence, in the order they appear in the sequence. For example:

In [5]:
words = ["I", "am", "happy", "to", "be", "at", "NYU"]
for w in words:
    print("|", w, "|")

| I |
| am |
| happy |
| to |
| be |
| at |
| NYU |


Just like the `if` statement, we use indented blocks to denote what should get repeated! We also have a colon after the sequence. The variable `w` refers to an individual item in the sequence (an individual word).

If you want to iterate over numbers (e.g. starting at `i = 0` and stopping after `i = 9`), the `range()` function allows you to define a sequence containing the integers 0 (incl.) through 10 (excl.):

In [6]:
for i in range(len(words)):
    print(i, words[i])

0 I
1 am
2 happy
3 to
4 be
5 at
6 NYU


Be very careful about modifying a collection you're iterating over! Here's an example:


In [7]:
words = ["I", "am", "happy", "to", "be", "at", "NYU"]
for i in range(len(words)):
    if len(words[i]) < 3:
        words.pop(i) # deletes words shorter than 3 letters

IndexError: list index out of range

And here's a few ways to do it right:

In [9]:
words = ["I", "am", "happy", "to", "be", "at", "NYU"]

# Strategy 1: create a new collection
long_words = []
for w in words:
    if len(w) >= 3:
        long_words.append(w)
long_words

['happy', 'NYU']

In [10]:
words = ["I", "am", "happy", "to", "be", "at", "NYU"]

# Strategy 2: collect the indices to delete
short_word_indices = []
for i in range(len(words)):
    if len(words[i]) < 3:
        short_word_indices.append(i)

for i in sorted(short_word_indices, reverse = True):
    words.pop(i)

words

['happy', 'NYU']

When you want to loop over a sequence but access both the index and the item itself, you should use `enumerate()`. Here's an example (and further reading [here](https://docs.python.org/3/tutorial/datastructures.html#tut-loopidioms)):

In [11]:
for (i, word) in enumerate(words):
    print(i, word)

0 happy
1 NYU


### The `while` loop

Another way you might want to implement repeated execution is until a certain condition is satisfied, instead of for a given number of iterations. We accomplish this in Python using `while` loops, as shown below:

In [12]:
# You can use `while` loops to accomplish whatever `for` loops do
count = 0
while count < 10:
    print(count)
    count += 1

0
1
2
3
4
5
6
7
8
9


In [13]:
# But you can also use them to iterate in different ways
# e.g. until a condition is reached!

# You wouldn't be able to code the below with a for loop!
num = 15
while num != 1:
    print(num)
    # compute the Collatz sequence
    if num % 2 == 1:
        num = 3 * num + 1
    else:
        num = num // 2

15
46
23
70
35
106
53
160
80
40
20
10
5
16
8
4
2


In the above examples, the statements in the `while` block are repeatedly executed. Each time, the statement right after the `while` block (the *condition*) is evaluated to either true or false, and the block stops running once a value of `false` is encountered.

### `break` and `continue` statements, and `else` clauses

There are a few more features of the `for` and `while` loops we haven't mentioned!
- The `break` statement breaks out of the innermost `for` or `while` loop.
- The `continue` statement continues with the next iteration of the loop.
- The `else` clause for a `for` loop is executed after the final iteration of the loop.
- The `else` clause for a `while` loop is executed after the loop's condition becomes false.
- In both cases, if the loop was terminated by a `break`, the `else` clause is not executed.

Here are some examples:

In [14]:
# This for loop searches for primes:
for n in range(2, 10):
    for x in range(2, n):
        if n % x == 0:
            print(n, "equals", x, "*", n // x)
            break # Breaks out of the `for x` loop, since a divisor was found
    else:
        # If the `for x` loop was not broken out of, prints this line
        print(n, "is a prime number.")

2 is a prime number.
3 is a prime number.
4 equals 2 * 2
5 is a prime number.
6 equals 2 * 3
7 is a prime number.
8 equals 2 * 4
9 equals 3 * 3


In [15]:
# This for loop searches for even numbers.
for n in range(2, 10):
    if n % 2 == 0:
        print("Found an even number", n)
    else:
        print("Found an odd number", n)

print()
for n in range(2, 10):
    if n % 2 == 0:
        print("Found an even number", n)
        # Proceeds to the next iteration of the for loop
        continue
    print("Found an odd number", n)

Found an even number 2
Found an odd number 3
Found an even number 4
Found an odd number 5
Found an even number 6
Found an odd number 7
Found an even number 8
Found an odd number 9

Found an even number 2
Found an odd number 3
Found an even number 4
Found an odd number 5
Found an even number 6
Found an odd number 7
Found an even number 8
Found an odd number 9


## Functions

So far, we've already seen some examples of functions, such as builtin functions `len()`, `range()`, and `sorted()`, as well as sequence-specific methods. The real power of Python (and all modern programming) lies in the ability to define your own functions -- this is very important in creating modular, extensible code.

Here's the syntax for creating a function:

In [16]:
def addition(x, y):
    return x + y

In the above function, the keyword `def` introduces a new function *definition*. It's followed by the function name `add_one`.
- What comes inside the brackets are the *parameters* of the function. Here we have 2 parameters *x* and *y*.
- What comes after the `return` keyword is the output of the function.
- We don't have to declare the type of *x* and *y*! `add_one(x)` just works regardless of whether *x* and *y* are ints, floats, other numeric types, or even strings -- so long as they support the `+` operation. This is one of the key characteristics of Python: "duck typing"!

(If you want, you can annotate the types of your parameters and output; see [Function Annotations](https://docs.python.org/3/tutorial/controlflow.html#function-annotations))

In [17]:
addition(5.7, 3 + 2j)

(8.7+2j)

In [18]:
addition("Hello ", "world")

'Hello world'

In [19]:
addition([2, 4], [1, 3, 5])

[2, 4, 1, 3, 5]

There's a lot more to talk about on defining functions, like default argument values, keyword/positional arguments, arbitrary argument lists, etc... Read more [here].(https://docs.python.org/3/tutorial/controlflow.html#more-on-defining-functions)

# Module 5: A Numpy Primer

We're going to look at `numpy`, which is a very popular package in Python for numerical computation and scientific computing. It supports fast array arithmetic, vectorized operations, and built-in linear algebra functions, among others.

---

### The NumPy array

The basic object in NumPy is the array. It is a table of elements (usually numbers), all of the same type. It is indexed by a tuple of non-negative integers, along its axes.

In [20]:
import numpy as np

In [21]:
A = np.array([[1, 2, 3],[4, 5, 6]])

In [22]:
A

array([[1, 2, 3],
       [4, 5, 6]])

Here we've created our first array `A` via a list of lists. Since there are 2 levels of nesting, `A` has dimension 2:

In [23]:
print("Shape:", A.shape)
print("Dimension:", A.ndim)
print("Datatype:", A.dtype)

Shape: (2, 3)
Dimension: 2
Datatype: int64


Indexing into arrays is exactly like indexing / slicing into regular lists:

In [24]:
print(A)
print(A[0,2])
print(A[0,:])
print(A[:,2])
print(A[1,0:2])

[[1 2 3]
 [4 5 6]]
3
[1 2 3]
[3 6]
[4 5]


A common patten with working with arrays is that we don't know the array contents ahead of time, but we do know the size and data type. Hence, one can use these functions to *initialize* arrays with placeholder content:

In [25]:
np.zeros((3, 4)) # initialize specified shape with zeros

array([[0., 0., 0., 0.],
       [0., 0., 0., 0.],
       [0., 0., 0., 0.]])

In [26]:
np.ones((2, 2, 2), dtype = bool) # initialize specified shape with ones of that datatype

array([[[ True,  True],
        [ True,  True]],

       [[ True,  True],
        [ True,  True]]])

In [27]:
np.empty((2, 3)) # creates array but does not initialize values

array([[2.17587523e-316, 0.00000000e+000, 6.57983230e-310],
       [6.57983230e-310, 4.89227755e-317, 6.57983230e-310]])

### Math operations

One of the main things you want to do with NumPy is to perform math operations on numbers. If you try this using a base Python list you get a somewhat unexpected result:

In [28]:
[2, 3, 4] * 2

[2, 3, 4, 2, 3, 4]

But trying this with a NumPy array gives the desired result:

In [29]:
a = np.array([2, 3, 4])
a * 2

array([4, 6, 8])

Here are more examples:

In [30]:
a ** 2

array([ 4,  9, 16])

In [31]:
b = np.array([3, 5, 1])
a + b

array([5, 8, 5])

In [32]:
a == 3

array([False,  True, False])

In [33]:
A = np.array([[2, 3], [0, 4]])
B = np.array([[0, 1], [3, 2]])

In [34]:
A * B # elementwise product

array([[0, 3],
       [0, 8]])

In [35]:
A @ B # matrix product

array([[ 9,  8],
       [12,  8]])

Here are some examples of ways you can *aggregate* a table of numbers along any of their axes:

In [36]:
A.sum()

np.int64(9)

In [37]:
A.sum(axis = 0)

array([2, 7])

In [38]:
A.min(axis = 1)

array([2, 0])

### Reshaping

Sometimes, your data manipulation requires you to change the shape of an array. NumPy lets you do this via various commands:

In [39]:
C = np.array(np.arange(12)).reshape(6, 2)
C

array([[ 0,  1],
       [ 2,  3],
       [ 4,  5],
       [ 6,  7],
       [ 8,  9],
       [10, 11]])

In [40]:
C.ravel() # returns the array, flattened

array([ 0,  1,  2,  3,  4,  5,  6,  7,  8,  9, 10, 11])

In [41]:
C.reshape(3, 2, 2) # reshapes the array to the specified shape

array([[[ 0,  1],
        [ 2,  3]],

       [[ 4,  5],
        [ 6,  7]],

       [[ 8,  9],
        [10, 11]]])

In [42]:
C.reshape(3, -1, 2) # reshapes to specified shape - NumPy will "guess" the dimension

array([[[ 0,  1],
        [ 2,  3]],

       [[ 4,  5],
        [ 6,  7]],

       [[ 8,  9],
        [10, 11]]])

In [43]:
C.T # does an array transpose

array([[ 0,  2,  4,  6,  8, 10],
       [ 1,  3,  5,  7,  9, 11]])

For further reading, we highly recommend:
https://numpy.org/devdocs/user/quickstart.html

# Module 6: Intro to Pandas, Data Manipulation, and Visualization in Python
In this section, we will learn and practice how to read in data, conduct data manipulation and visualization in Python. In particular, we will be learning the `pandas` package, which provides a fast and powerful interface to dataframes.




### Data I/O

The first thing we do is to read data into a format known as a Pandas DataFrame. We will be looking at data on historical meteorite landings.

In [44]:
import pandas as pd
import numpy as np

We read the data from a CSV file using the `read_csv()` function:

In [45]:
# Here, the ! before wget indicates that this line is a shell command (not for the Python interpreter)
# We are retrieving the file which I hosted on my Dropbox, before reading it
!wget -O meteorites.csv https://www.dropbox.com/scl/fi/3lbzi31nvcdd6zqf6fu3d/Meteorite_Landings.csv?rlkey=cbta8umoy6h3ol598dwaarxnd&dl=0
meteorites = pd.read_csv("meteorites.csv")

--2025-04-08 16:12:38--  https://www.dropbox.com/scl/fi/3lbzi31nvcdd6zqf6fu3d/Meteorite_Landings.csv?rlkey=cbta8umoy6h3ol598dwaarxnd
Resolving www.dropbox.com (www.dropbox.com)... 162.125.5.18, 2620:100:601d:18::a27d:512
Connecting to www.dropbox.com (www.dropbox.com)|162.125.5.18|:443... connected.
HTTP request sent, awaiting response... 302 Found
Location: https://ucc65afec3e2934389bcb18afdb5.dl.dropboxusercontent.com/cd/0/inline/CnZ0dg8xq9kSrfJJb31azs6nh_ztjtwLikw_mE2Jks8iiqetuai-KK8N4zn7djj5s5lEu3AndaaeGuUBKkK7gpkj4dRDDNKjeSo7x5Q77USRMoQaGTQdLI7jZ3-8C2z-FbflxS1jxiZJltJ349Imkw7t/file# [following]
--2025-04-08 16:12:39--  https://ucc65afec3e2934389bcb18afdb5.dl.dropboxusercontent.com/cd/0/inline/CnZ0dg8xq9kSrfJJb31azs6nh_ztjtwLikw_mE2Jks8iiqetuai-KK8N4zn7djj5s5lEu3AndaaeGuUBKkK7gpkj4dRDDNKjeSo7x5Q77USRMoQaGTQdLI7jZ3-8C2z-FbflxS1jxiZJltJ349Imkw7t/file
Resolving ucc65afec3e2934389bcb18afdb5.dl.dropboxusercontent.com (ucc65afec3e2934389bcb18afdb5.dl.dropboxusercontent.com)... 162.125.5.

In [46]:
meteorites

Unnamed: 0,name,id,nametype,recclass,mass (g),fall,year,reclat,reclong,GeoLocation
0,Aachen,1,Valid,L5,21.0,Fell,01/01/1880 12:00:00 AM,50.77500,6.08333,"(50.775, 6.08333)"
1,Aarhus,2,Valid,H6,720.0,Fell,01/01/1951 12:00:00 AM,56.18333,10.23333,"(56.18333, 10.23333)"
2,Abee,6,Valid,EH4,107000.0,Fell,01/01/1952 12:00:00 AM,54.21667,-113.00000,"(54.21667, -113.0)"
3,Acapulco,10,Valid,Acapulcoite,1914.0,Fell,01/01/1976 12:00:00 AM,16.88333,-99.90000,"(16.88333, -99.9)"
4,Achiras,370,Valid,L6,780.0,Fell,01/01/1902 12:00:00 AM,-33.16667,-64.95000,"(-33.16667, -64.95)"
...,...,...,...,...,...,...,...,...,...,...
45711,Zillah 002,31356,Valid,Eucrite,172.0,Found,01/01/1990 12:00:00 AM,29.03700,17.01850,"(29.037, 17.0185)"
45712,Zinder,30409,Valid,"Pallasite, ungrouped",46.0,Found,01/01/1999 12:00:00 AM,13.78333,8.96667,"(13.78333, 8.96667)"
45713,Zlin,30410,Valid,H4,3.3,Found,01/01/1939 12:00:00 AM,49.25000,17.66667,"(49.25, 17.66667)"
45714,Zubkovsky,31357,Valid,L6,2167.0,Found,01/01/2003 12:00:00 AM,49.78917,41.50460,"(49.78917, 41.5046)"


We can get a glimpse of the data by the following:
- `.shape` to get the dimensions / size of the dataframe;
- `.describe()` to see a summary of the data;
- `.value_counts()` to see the frequency of values in a specific column;
- `.head()` to see the first five rows;
- `.tail()` to see the last five rows.

In [47]:
meteorites.shape

(45716, 10)

In [48]:
meteorites.describe()

Unnamed: 0,id,mass (g),reclat,reclong
count,45716.0,45585.0,38401.0,38401.0
mean,26889.735104,13278.08,-39.12258,61.074319
std,16860.68303,574988.9,46.378511,80.647298
min,1.0,0.0,-87.36667,-165.43333
25%,12688.75,7.2,-76.71424,0.0
50%,24261.5,32.6,-71.5,35.66667
75%,40656.75,202.6,0.0,157.16667
max,57458.0,60000000.0,81.16667,354.47333


In [49]:
meteorites["fall"]

Unnamed: 0,fall
0,Fell
1,Fell
2,Fell
3,Fell
4,Fell
...,...
45711,Found
45712,Found
45713,Found
45714,Found


In [50]:
meteorites["fall"].value_counts()

Unnamed: 0_level_0,count
fall,Unnamed: 1_level_1
Found,44609
Fell,1107


In [51]:
meteorites["fall"].value_counts(normalize = True)

Unnamed: 0_level_0,proportion
fall,Unnamed: 1_level_1
Found,0.975785
Fell,0.024215


In [52]:
meteorites.head()

Unnamed: 0,name,id,nametype,recclass,mass (g),fall,year,reclat,reclong,GeoLocation
0,Aachen,1,Valid,L5,21.0,Fell,01/01/1880 12:00:00 AM,50.775,6.08333,"(50.775, 6.08333)"
1,Aarhus,2,Valid,H6,720.0,Fell,01/01/1951 12:00:00 AM,56.18333,10.23333,"(56.18333, 10.23333)"
2,Abee,6,Valid,EH4,107000.0,Fell,01/01/1952 12:00:00 AM,54.21667,-113.0,"(54.21667, -113.0)"
3,Acapulco,10,Valid,Acapulcoite,1914.0,Fell,01/01/1976 12:00:00 AM,16.88333,-99.9,"(16.88333, -99.9)"
4,Achiras,370,Valid,L6,780.0,Fell,01/01/1902 12:00:00 AM,-33.16667,-64.95,"(-33.16667, -64.95)"


In [53]:
meteorites.tail()

Unnamed: 0,name,id,nametype,recclass,mass (g),fall,year,reclat,reclong,GeoLocation
45711,Zillah 002,31356,Valid,Eucrite,172.0,Found,01/01/1990 12:00:00 AM,29.037,17.0185,"(29.037, 17.0185)"
45712,Zinder,30409,Valid,"Pallasite, ungrouped",46.0,Found,01/01/1999 12:00:00 AM,13.78333,8.96667,"(13.78333, 8.96667)"
45713,Zlin,30410,Valid,H4,3.3,Found,01/01/1939 12:00:00 AM,49.25,17.66667,"(49.25, 17.66667)"
45714,Zubkovsky,31357,Valid,L6,2167.0,Found,01/01/2003 12:00:00 AM,49.78917,41.5046,"(49.78917, 41.5046)"
45715,Zulu Queen,30414,Valid,L3.7,200.0,Found,01/01/1976 12:00:00 AM,33.98333,-115.68333,"(33.98333, -115.68333)"


You can get the names of all the columns via `.columns`, and the row indices via `.index`. You can even get the data types that each column holds using `.dtypes`.

In [54]:
meteorites.columns

Index(['name', 'id', 'nametype', 'recclass', 'mass (g)', 'fall', 'year',
       'reclat', 'reclong', 'GeoLocation'],
      dtype='object')

In [55]:
meteorites.index

RangeIndex(start=0, stop=45716, step=1)

In [56]:
meteorites.dtypes

Unnamed: 0,0
name,object
id,int64
nametype,object
recclass,object
mass (g),float64
fall,object
year,object
reclat,float64
reclong,float64
GeoLocation,object


### Indexing

After reading in the datasets and looking at the description or the first few rows, we are interested in some basic dataframe manipulations. Here's how to select a specific column:

In [57]:
meteorites["name"]

Unnamed: 0,name
0,Aachen
1,Aarhus
2,Abee
3,Acapulco
4,Achiras
...,...
45711,Zillah 002
45712,Zinder
45713,Zlin
45714,Zubkovsky


What we have selected is a `Series` object, which is a single column together with the index. We can select multiple columns:

In [58]:
meteorites[["name", "mass (g)"]]

Unnamed: 0,name,mass (g)
0,Aachen,21.0
1,Aarhus,720.0
2,Abee,107000.0
3,Acapulco,1914.0
4,Achiras,780.0
...,...,...
45711,Zillah 002,172.0
45712,Zinder,46.0
45713,Zlin,3.3
45714,Zubkovsky,2167.0


We can also index into `meteorites` directly with an integer slice, to select the appropriate rows:

In [59]:
meteorites[100:102]

Unnamed: 0,name,id,nametype,recclass,mass (g),fall,year,reclat,reclong,GeoLocation
100,Benton,5026,Valid,LL6,2840.0,Fell,01/01/1949 12:00:00 AM,45.95,-67.55,"(45.95, -67.55)"
101,Berduc,48975,Valid,L6,270.0,Fell,01/01/2008 12:00:00 AM,-31.91,-58.32833,"(-31.91, -58.32833)"


There are also ways to select both a subset of rows and columns, depending on whether you know their row and column indices, or their row and column names:
- `.iloc[]` for row and column indices;
- `.loc[]` for row and column names.

In [60]:
meteorites.iloc[100:102, [0, 3, 4, 6]]

Unnamed: 0,name,recclass,mass (g),year
100,Benton,LL6,2840.0,01/01/1949 12:00:00 AM
101,Berduc,L6,270.0,01/01/2008 12:00:00 AM


In [61]:
meteorites.loc[100:102, ["name", "recclass", "mass (g)", "year"]]

Unnamed: 0,name,recclass,mass (g),year
100,Benton,LL6,2840.0,01/01/1949 12:00:00 AM
101,Berduc,L6,270.0,01/01/2008 12:00:00 AM
102,Béréba,Eucrite-mmict,18000.0,01/01/1924 12:00:00 AM


A common thing to want to do when selecting rows is to select rows whose values meet some condition. This is also known as a Boolean mask or logical indexing. We demonstrate it here:

In [62]:
meteorites["mass (g)"] > 50

Unnamed: 0,mass (g)
0,False
1,True
2,True
3,True
4,True
...,...
45711,True
45712,False
45713,False
45714,True


The above is a Boolean mask, which is a `Series` object. Each entry is `True` or `False` depending on whether the meteorite weighted more or less than 50 grams.

We can compose these Boolean masks (using bitwise operators `&`, `|`, `~`), to encode more complicated logical conditions:

In [63]:
mask = (meteorites["mass (g)"] > 1e6) & (meteorites["fall"] == "Fell")

In [64]:
meteorites[mask]

Unnamed: 0,name,id,nametype,recclass,mass (g),fall,year,reclat,reclong,GeoLocation
29,Allende,2278,Valid,CV3,2000000.0,Fell,01/01/1969 12:00:00 AM,26.96667,-105.31667,"(26.96667, -105.31667)"
419,Jilin,12171,Valid,H5,4000000.0,Fell,01/01/1976 12:00:00 AM,44.05,126.16667,"(44.05, 126.16667)"
506,Kunya-Urgench,12379,Valid,H5,1100000.0,Fell,01/01/1998 12:00:00 AM,42.25,59.2,"(42.25, 59.2)"
707,Norton County,17922,Valid,Aubrite,1100000.0,Fell,01/01/1948 12:00:00 AM,39.68333,-99.86667,"(39.68333, -99.86667)"
920,Sikhote-Alin,23593,Valid,"Iron, IIAB",23000000.0,Fell,01/01/1947 12:00:00 AM,46.16,134.65333,"(46.16, 134.65333)"


### Data manipulation

There are much more things you can do with dataframes: mutate columns, transform and create new columns, reshape them, merge dataframes, and so on. This section introduces these functionalities using 2019 Yellow Taxi Trip Data (provided by [NYC Open Data](https://data.cityofnewyork.us/Transportation/2019-Yellow-Taxi-Trip-Data/2upf-qytp)).

In [65]:
!wget -O taxis.csv https://www.dropbox.com/scl/fi/bkyfsszcffd9zyko9olgw/2019_Yellow_Taxi_Trip_Data.csv?rlkey=izmgr854im7ppswl0e8u4tk1p&dl=0
taxis = pd.read_csv("taxis.csv")


--2025-04-08 16:12:59--  https://www.dropbox.com/scl/fi/bkyfsszcffd9zyko9olgw/2019_Yellow_Taxi_Trip_Data.csv?rlkey=izmgr854im7ppswl0e8u4tk1p
Resolving www.dropbox.com (www.dropbox.com)... 162.125.5.18, 2620:100:601d:18::a27d:512
Connecting to www.dropbox.com (www.dropbox.com)|162.125.5.18|:443... connected.
HTTP request sent, awaiting response... 302 Found
Location: https://ucb75ca2188c068fe2bea453a803.dl.dropboxusercontent.com/cd/0/inline/Cnb24FoRSI2vl6l-AJhmNefGnXbXDJ-GQ0Z1UdNDOGAfjEsWG-qxZ1n9YTKHHqyb0H-Cm_2FWlj5hpS1S9MRjc98VMBU0k78im9kQCdVfJx8ysrjoc3qVAQOW2B4m_3_vzmVnmJ5Gc56QTbGl3aSp-es/file# [following]
--2025-04-08 16:13:00--  https://ucb75ca2188c068fe2bea453a803.dl.dropboxusercontent.com/cd/0/inline/Cnb24FoRSI2vl6l-AJhmNefGnXbXDJ-GQ0Z1UdNDOGAfjEsWG-qxZ1n9YTKHHqyb0H-Cm_2FWlj5hpS1S9MRjc98VMBU0k78im9kQCdVfJx8ysrjoc3qVAQOW2B4m_3_vzmVnmJ5Gc56QTbGl3aSp-es/file
Resolving ucb75ca2188c068fe2bea453a803.dl.dropboxusercontent.com (ucb75ca2188c068fe2bea453a803.dl.dropboxusercontent.com)... 16

In [66]:
taxis.head()

Unnamed: 0,vendorid,tpep_pickup_datetime,tpep_dropoff_datetime,passenger_count,trip_distance,ratecodeid,store_and_fwd_flag,pulocationid,dolocationid,payment_type,fare_amount,extra,mta_tax,tip_amount,tolls_amount,improvement_surcharge,total_amount,congestion_surcharge
0,2,2019-10-23T16:39:42.000,2019-10-23T17:14:10.000,1,7.93,1,N,138,170,1,29.5,1.0,0.5,7.98,6.12,0.3,47.9,2.5
1,1,2019-10-23T16:32:08.000,2019-10-23T16:45:26.000,1,2.0,1,N,11,26,1,10.5,1.0,0.5,0.0,0.0,0.3,12.3,0.0
2,2,2019-10-23T16:08:44.000,2019-10-23T16:21:11.000,1,1.36,1,N,163,162,1,9.5,1.0,0.5,2.0,0.0,0.3,15.8,2.5
3,2,2019-10-23T16:22:44.000,2019-10-23T16:43:26.000,1,1.0,1,N,170,163,1,13.0,1.0,0.5,4.32,0.0,0.3,21.62,2.5
4,2,2019-10-23T16:45:11.000,2019-10-23T16:58:49.000,1,1.96,1,N,163,236,1,10.5,1.0,0.5,0.5,0.0,0.3,15.3,2.5


We first drop redundant columns; this can be done by the `drop()` method:

In [67]:
remove_cols = ["vendorid", "ratecodeid", "pulocationid", "dolocationid", "store_and_fwd_flag"]
taxis = taxis.drop(columns = remove_cols)
taxis.head()

Unnamed: 0,tpep_pickup_datetime,tpep_dropoff_datetime,passenger_count,trip_distance,payment_type,fare_amount,extra,mta_tax,tip_amount,tolls_amount,improvement_surcharge,total_amount,congestion_surcharge
0,2019-10-23T16:39:42.000,2019-10-23T17:14:10.000,1,7.93,1,29.5,1.0,0.5,7.98,6.12,0.3,47.9,2.5
1,2019-10-23T16:32:08.000,2019-10-23T16:45:26.000,1,2.0,1,10.5,1.0,0.5,0.0,0.0,0.3,12.3,0.0
2,2019-10-23T16:08:44.000,2019-10-23T16:21:11.000,1,1.36,1,9.5,1.0,0.5,2.0,0.0,0.3,15.8,2.5
3,2019-10-23T16:22:44.000,2019-10-23T16:43:26.000,1,1.0,1,13.0,1.0,0.5,4.32,0.0,0.3,21.62,2.5
4,2019-10-23T16:45:11.000,2019-10-23T16:58:49.000,1,1.96,1,10.5,1.0,0.5,0.5,0.0,0.3,15.3,2.5


We can rename columns using `rename()`:

In [68]:
taxis = taxis.rename(
    columns = {
        "tpep_pickup_datetime" : "pickup",
        "tpep_dropoff_datetime" : "dropoff",
    }
)
taxis.columns

Index(['pickup', 'dropoff', 'passenger_count', 'trip_distance', 'payment_type',
       'fare_amount', 'extra', 'mta_tax', 'tip_amount', 'tolls_amount',
       'improvement_surcharge', 'total_amount', 'congestion_surcharge'],
      dtype='object')

We can convert the datatypes of columns using `apply()`:

In [69]:
taxis.dtypes

Unnamed: 0,0
pickup,object
dropoff,object
passenger_count,int64
trip_distance,float64
payment_type,int64
fare_amount,float64
extra,float64
mta_tax,float64
tip_amount,float64
tolls_amount,float64


In [70]:
taxis[["pickup", "dropoff"]].apply(pd.to_datetime)

Unnamed: 0,pickup,dropoff
0,2019-10-23 16:39:42,2019-10-23 17:14:10
1,2019-10-23 16:32:08,2019-10-23 16:45:26
2,2019-10-23 16:08:44,2019-10-23 16:21:11
3,2019-10-23 16:22:44,2019-10-23 16:43:26
4,2019-10-23 16:45:11,2019-10-23 16:58:49
...,...,...
9995,2019-10-23 17:39:59,2019-10-23 17:49:26
9996,2019-10-23 17:53:02,2019-10-23 18:00:45
9997,2019-10-23 17:07:16,2019-10-23 17:11:35
9998,2019-10-23 17:38:26,2019-10-23 17:49:28


In [71]:
taxis[["pickup", "dropoff"]] = taxis[["pickup", "dropoff"]].apply(pd.to_datetime)

Here, we've converted the datatypes of the `"pickup"` and `"dropoff"` columns to `pd.datetime`. The argument to `apply()` is `pd.to_datetime()`, which is the function to use.

We could have also converted numeric datatypes using `pd.to_numeric()`, and we can also use the more general `pd.astype()` method.

We can create new columns from our existing columns using `assign()`:

In [72]:
taxis["elapsed_time"] = taxis["dropoff"] - taxis["pickup"]

In [73]:
taxis = taxis.assign(
    elapsed_time = lambda x: x.dropoff - x.pickup,
    cost_before_tip = lambda x: x.total_amount - x.tip_amount,
    tip_pct = lambda x: x.tip_amount / x.cost_before_tip,
    fees = lambda x: x.cost_before_tip - x.fare_amount,
    avg_speed = lambda x: x.trip_distance.div(
        x.elapsed_time.dt.total_seconds() / 60 / 60
    )
)

In [74]:
taxis.head()

Unnamed: 0,pickup,dropoff,passenger_count,trip_distance,payment_type,fare_amount,extra,mta_tax,tip_amount,tolls_amount,improvement_surcharge,total_amount,congestion_surcharge,elapsed_time,cost_before_tip,tip_pct,fees,avg_speed
0,2019-10-23 16:39:42,2019-10-23 17:14:10,1,7.93,1,29.5,1.0,0.5,7.98,6.12,0.3,47.9,2.5,0 days 00:34:28,39.92,0.1999,10.42,13.804642
1,2019-10-23 16:32:08,2019-10-23 16:45:26,1,2.0,1,10.5,1.0,0.5,0.0,0.0,0.3,12.3,0.0,0 days 00:13:18,12.3,0.0,1.8,9.022556
2,2019-10-23 16:08:44,2019-10-23 16:21:11,1,1.36,1,9.5,1.0,0.5,2.0,0.0,0.3,15.8,2.5,0 days 00:12:27,13.8,0.144928,4.3,6.554217
3,2019-10-23 16:22:44,2019-10-23 16:43:26,1,1.0,1,13.0,1.0,0.5,4.32,0.0,0.3,21.62,2.5,0 days 00:20:42,17.3,0.249711,4.3,2.898551
4,2019-10-23 16:45:11,2019-10-23 16:58:49,1,1.96,1,10.5,1.0,0.5,0.5,0.0,0.3,15.3,2.5,0 days 00:13:38,14.8,0.033784,4.3,8.625917


If you haven't seen lambda functions before, these are small anonymous functions that can themselves be arguments to other functions. Here we use them to access the `cost_before_tip` and `elapsed_time` columns in the same method they are created.

We can also sort our dataframe along any number of the columns:

In [75]:
taxis.sort_values(["passenger_count", "pickup"], ascending = [False, True]).head()

Unnamed: 0,pickup,dropoff,passenger_count,trip_distance,payment_type,fare_amount,extra,mta_tax,tip_amount,tolls_amount,improvement_surcharge,total_amount,congestion_surcharge,elapsed_time,cost_before_tip,tip_pct,fees,avg_speed
5997,2019-10-23 15:55:19,2019-10-23 16:08:25,6,1.58,2,10.0,1.0,0.5,0.0,0.0,0.3,14.3,2.5,0 days 00:13:06,14.3,0.0,4.3,7.236641
443,2019-10-23 15:56:59,2019-10-23 16:04:33,6,1.46,2,7.5,1.0,0.5,0.0,0.0,0.3,11.8,2.5,0 days 00:07:34,11.8,0.0,4.3,11.577093
8722,2019-10-23 15:57:33,2019-10-23 16:03:34,6,0.62,1,5.5,1.0,0.5,0.7,0.0,0.3,10.5,2.5,0 days 00:06:01,9.8,0.071429,4.3,6.182825
4198,2019-10-23 15:57:38,2019-10-23 16:05:07,6,1.18,1,7.0,1.0,0.5,1.0,0.0,0.3,12.3,2.5,0 days 00:07:29,11.3,0.088496,4.3,9.461024
8238,2019-10-23 15:58:31,2019-10-23 16:29:29,6,3.23,2,19.5,1.0,0.5,0.0,0.0,0.3,23.8,2.5,0 days 00:30:58,23.8,0.0,4.3,6.258342
