# Problems 3 and 4

A very common task handled by GIS is to convert a list of coordinates into geographic objects. For instance, you might have a table with latitude and longitude values, and want to use the locations in a map.
Python is an excellent tool for this kind of a task: it can read data from (almost) any input format (CSV, text, Excel, GPX, various databases).

In this exercise, we concentrate on reading data using [pandas](https://pandas.pydata.org/), and on creating geometry objects using [shapely](https://shapely.readthedocs.io/). 
Later on in the course, we will get to know other packages that are better tailored to geographic data, and will also learn how to write data to files, including to GIS file formats.

## Sample data set

For this exercise, we read the data from a file that lists travel times between different locations in Helsinki.
The data is stored in a semicolon-separated text file, which you can find in the `data` folder of this repository, its file name is [`travel_times_2015_helsinki.txt`](data/travel_times_2015_helsinki.txt).

The first four rows of our data look like this:

```
from_id;to_id;fromid_toid;route_number;at;from_x;from_y;to_x;to_y;total_route_time;route_time;route_distance
5861326;5785640;5861326_5785640;1;08:10;24.9704379;60.3119173;24.8560344;60.399940599999994;125.0;99.0;22917.6
5861326;5785641;5861326_5785641;1;08:10;24.9704379;60.3119173;24.8605682;60.4000135;123.0;102.0;23123.5
5861326;5785642;5861326_5785642;1;08:10;24.9704379;60.3119173;24.865102;60.4000863;125.0;103.0;23241.3
```

In this exercise, we are interested in the following columns:

| Column name        | Description                                              |
|:------------------ |:-------------------------------------------------------- |
| `from_x`           | x-coordinate of the **origin** location (longitude)      |
| `from_y`           | y-coordinate of the **origin** location (latitude)       |
| `to_x`             | x-coordinate of the **destination** location (longitude) |
| `to_y`             | y-coordinate of the **destination** location (latitude)  |
| `total_route_time` | Travel time with public transportation at the route      |

Read more about this data set on the blog of the Digital Geography Lab: https://blogs.helsinki.fi/accessibility/helsinki-region-travel-time-matrix/.


----

## Problem 3: Reading coordinates from a text file, and creating geometries (*5 points*)

In this problem, your task is to read data from the file described above, and create two lists of points representing 
the origins and destinations of the routes described in the data set.

This task entails multiple steps:

1. Read the data into a `pandas.DataFrame`
2. Discard all unnecessary columns (this is good practice, as it helps reduce the memory footprint of a program)
3. Create two lists of `shapely.geometry.Point`s

Let’s go step-by-step. 

Remember that there are code cells that you can and should modify (they initially contain only a comment `# ADD YOUR OWN CODE HERE`),
and other code cells that you can and should run (but cannot modify) to test whether your code fulfils the requirements.



----

#### (1)

First, use `pandas` to read the file into a variable `data`. You can revisit [lesson 5 of the Geo-Python course](https://geo-python-site.readthedocs.io/en/latest/notebooks/L5/exploring-data-using-pandas.html#reading-a-data-file-with-pandas) and consult the [pandas documentation](https://pandas.pydata.org/docs/user_guide/) to find the best way to do this.

In [36]:
# ADD YOUR OWN CODE HERE
import pandas as pd
# Read the file into a pandas DataFrame
filepath = "D:/Google Downloads/travel_times_2015_helsinki.txt"
usecols = ["to_x","to_y","from_x","from_y","total_route_time"]
data = pd.read_csv(filepath, sep=';', usecols = usecols)
#data = data.loc[:, ["from_x","from_y","to_x","to_y"]]

As a little sanity check, print the number of rows and columns of the data set:

In [37]:
# ADD YOU OWN CODE HERE
print(data.shape)

(14643, 5)


If you loaded the data set successfully, the following code cell will print the first few rows of the data:

In [38]:
# NON-EDITABLE CODE CELL FOR TESTING YOUR SOLUTION
data.head()

Unnamed: 0,from_x,from_y,to_x,to_y,total_route_time
0,24.970438,60.311917,24.856034,60.399941,125.0
1,24.970438,60.311917,24.860568,60.400014,123.0
2,24.970438,60.311917,24.865102,60.400086,125.0
3,24.970438,60.311917,24.869636,60.400159,129.0
4,24.970438,60.311917,24.842582,60.397478,118.0



----
#### (2)

Now, select the 4 columns that contain coordinate information (**`from_x`**, **`from_y`**, **`to_x`**, **`to_y`**), and store them in a DataFrame **`data`**. 
(i.e. update the variable `data`  to contain only these four columns).


In [40]:
# ADD YOUR OWN CODE HERE
data = data.loc[:, ["from_x","from_y","to_x","to_y"]]

Run the following code cell to test whether you have successfully replaced `data` with only the required data columns: it prints an error if you haven’t.

In [41]:
# NON-EDITABLE CODE CELL FOR TESTING YOUR SOLUTION
assert list(data.columns) == ["from_x", "from_y", "to_x", "to_y"], "Error: `data` does not (or not only) contain the four columns it should"


----

#### (3)

Finally, create two lists called **`origin_points`** and **`destination_points`** that contain `shapely.geometry.Point` objects created using the coordinates from `data`. 

In particular, the origin points in `origin_points` should be based on columns `from_x` and `from_y`, and the destination points in `destination_points` on columns `to_x` and `to_y`.

There are many ways to achieve this, find two possible approaches below (you can implement either one of them):

##### **Approach A**

- Create two empty lists for the origin and destination points, respectively
- Use a for-loop to iterate over the rows of your dataframe:
    - For each row, create a `shapely.geometry.Point` object based on the coordinate columns
    - Append the point object to the `origin_points` and `destination_point` lists

You can consult [lesson 6 of Geo-Python](https://geo-python-site.readthedocs.io/en/latest/notebooks/L6/advanced-data-processing-with-pandas.html#iterating-over-rows) to revisit how to loop over the rows of a `pandas.DataFrame`.

##### **Approach B (more advanced)**

- Make use of the `.apply()` function of the `pandas.DataFrame` to operate on all rows at once (see its [documentation](https://pandas.pydata.org/docs/reference/api/pandas.DataFrame.apply.html), *HINT:* you might want to use the `axis` parameter)
- Use the `shapely.geometry.Point` constructor directly, or wrap it into a [lambda function](https://towardsdatascience.com/apply-and-lambda-usage-in-pandas-b13a1ea037f7)
- Finally, convert the output `pandas.Series` into `list`s





In [130]:
# ADD YOUR OWN CODE HERE
from shapely import Point
origin_points = list(data.apply(lambda row: Point(row["from_x"], row["from_y"]), axis=1))
destination_points = list(data.apply(lambda row: Point(row["to_x"], row["to_y"]), axis=1))
print("ORIGIN X Y:", origin_points[12203].x, origin_points[12203].y)

ORIGIN X Y: 24.9704379 60.3119173



**NOTE: After you have solved this problem, there might be some left-over variables around.<br />We recommend you *restart the kernel and run all cells* from the toolbar or JupyterLab’s menu.***




Use the following code cell to test whether your solution works:

In [113]:
# NON-EDITABLE CODE CELL FOR TESTING YOUR SOLUTION

# This test print should print out the first origin and destination coordinates in the two lists:
print("ORIGIN X Y:", origin_points[0].x, origin_points[0].y)
print("DESTINATION X Y:", destination_points[0].x, destination_points[0].y)

# Check that you created a correct amount of points:
assert len(origin_points) == len(data), "Number of origin points must be the same as number of rows in the original file"
assert len(destination_points) == len(data), "Number of destination points must be the same as number of rows in the original file"

ORIGIN X Y: 24.9704379 60.3119173
DESTINATION X Y: 24.8560344 60.3999406



----

Remember to commit your code using git after each major change (for example, after solving each problem).

### Done!

That’s it. Now you are ready to continue to problem 4.


----

## Problem 4: Creating LineStrings that represent the movements (*5 points*):

This problem continues where we left off after completing *Problem 3*. 

The task is to:

1. create a list lines (`shapely.geometry.LineString`) between each pair of origin and destination points, and 
2. calculate the over-all total_length of all those lines.

Store the list of lines in a variable called `lines`, and the sum of lengths in a variable called `total_length`.

Once you have working solutions for both tasks, 

3. create functions for them so you can apply them to other similar data sets in the future (see instructions below).

#### (1)

To create the `shapely.geometry.LineString`s for each pair of origins and destinations, you need to loop over both lists at the same time.

Again, there are many ways to achieve this, here are two suggestions (implement either one):

- (alternative 1) Use the `zip()` function that allows you to iterate over multiple lists at the same time. See this week’s [exercise hints](https://autogis-site.readthedocs.io/en/latest/lessons/L1/exercise-1.html#hints).
- (alternative 2) Use the [*for-range* pattern from lesson 3 of Geo-Python](https://geo-python-site.readthedocs.io/en/latest/notebooks/L3/for-loops.html#looping-over-the-length-of-lists-using-index-values) and an index variable to access the same value in both lists


In [139]:
# ADD YOUR OWN CODE HERE
from shapely import LineString
 
testline = LineString (zip(destination_points, origin_points))
testline
#lines = [LineString(start,end) for start,end in zip(origin_points, destination_points)]

TypeError: ufunc 'linestrings' not supported for the input types, and the inputs could not be safely coerced to any supported types according to the casting rule ''safe''


**NOTE: After you have solved this problem, there might be some left-over variables around.<br />We recommend you *restart the kernel and run all cells* from the toolbar or JupyterLab’s menu.***


In [136]:
# NON-EDITABLE CODE CELL FOR TESTING YOUR SOLUTION

# Test that the list has correct number of LineStrings
assert len(lines) == len(data), "There should be as many lines as there are rows in the original data"


----

#### (2)

Create a variable called **`total_length`**, and store the total (Euclidian) distance of all the origin-destination LineStrings that we just created into that variable.

*Hint*: A simple solution is to start with a `total_length` of `0`, and add each line’s length while iterating over the list of lines.


In [137]:
# ADD YOUR OWN CODE HERE
total_length = 0
lines[1]
#for line in lines:
#    line

shapely.geometry.linestring.LineString

In [138]:
# NON-EDITABLE CODE CELL FOR TESTING YOUR SOLUTION

# This test print should print the total length of all lines
print("Total length of all lines is", round(total_length, 2))
assert round(total_length, 2) == 3148.57

Total length of all lines is 0


AssertionError: 


----

#### (3)

Now, create functions that automate the functionality you implemented for part (1) and part (2) of this problem:

- `create_od_lines()`: accepts two `list`s of `shapely.geometry.Point`s and returns a `list` of `shapely.geometry.LineString`s 
- `calculate_total_distance()`: takes a `list` of `shapely.geometry.LineString` geometries and returns their total length

You can copy and paste the codes you have written earlier into the functions. Be sure to add a **docstring** to each function.
Below, you can find a code cell for testing your functions (you should get the same result as earler).

In [None]:
# ADD YOUR OWN CODE HERE


In [None]:
# NON-EDITABLE CODE CELL FOR TESTING YOUR SOLUTION

# Create origin-destination lines
od_lines = create_od_lines(origin_points, destination_points)

# Calculate the total distance
tot_dist = calculate_total_distance(od_lines)

print("Total distance", round(tot_dist,2))
assert tot_dist == total_length


----


## Well done!

Awesome, now you have successfully practiced how geometries can be created in Python. Next week we will start using them actively.