## Problem 3: How far did people travel? (8 points)

During this task, the aim is to calculate the (air-line) distance in meters that each social media user in the data set prepared in *Problem 2* has travelled in-between the posts. We’re interested in the Euclidean distance between subsequent points generated by the same user.

For this, we will need to use the `userid` column of the data set `kruger_posts.shp` that we created in *Problem 2*.

Answer the following questions:
- What was the shortest distance a user travelled between all their posts (in meters)?
- What was the mean distance travelled per user (in meters)?
- What was the maximum distance a user travelled (in meters)?

---


### a) Read the input file and re-project it

- Read the input file `kruger_points.shp` into a geo-data frame `kruger_points`
- Transform the data from WGS84 to an `EPSG:32735` projection (UTM Zone 35S, suitable for South Africa). This CRS has *metres* as units.

In [1]:
import geopandas as gpd
from shapely.geometry import LineString
import pathlib

# Data path
DATA_DIRECTORY = pathlib.Path().resolve() / "data"

# Import the data
kruger_points = gpd.read_file(DATA_DIRECTORY / "kruger_points.shp")

# Transform the CRS of the data to EPSG:32735
kruger_points = kruger_points.to_crs("EPSG:32735")


In [2]:
# NON-EDITABLE CODE CELL FOR TESTING YOUR SOLUTION

# Check the data
kruger_points.head()

Unnamed: 0,lat,lon,timestamp,userid,geometry
0,-24.980792,31.484633,2015-07-07 03:02,66487960,POINT (952912.890 7229683.258)
1,-25.499225,31.508906,2015-07-07 03:18,65281761,POINT (953433.223 7172080.632)
2,-24.342578,30.930866,2015-03-07 03:38,90916112,POINT (898955.144 7302197.408)
3,-24.854614,31.519718,2015-10-07 05:04,37959089,POINT (956927.218 7243564.942)
4,-24.921069,31.520836,2015-10-07 05:19,27793716,POINT (956794.955 7236187.926)


In [3]:
# NON-EDITABLE CODE CELL FOR TESTING YOUR SOLUTION

# Check that the crs is correct after re-projecting (should be epsg:32735)
import pyproj
assert kruger_points.crs == pyproj.CRS("EPSG:32735")

### b) Group the data by user id

Group the data by `userid` and store the grouped data in a variable `grouped_by_users`

In [4]:
# ADD YOUR OWN CODE HERE

In [5]:
# Group the data by the "userid" column
grouped_by_users = kruger_points.groupby("userid")

In [6]:
# NON-EDITABLE CODE CELL FOR TESTING YOUR SOLUTION

# Check the number of groups:
assert len(grouped_by_users.groups) == kruger_points["userid"].nunique(), "Number of groups should match number of unique users!"

### c) Create `shapely.geometry.LineString` objects for each user connecting the points from oldest to most recent

There are multiple ways to solve this problem (see the [hints for this exercise](https://autogis-site.readthedocs.io/en/latest/lessons/lesson-2/exercise-2.html). You can use, for instance, a dictionary or an empty GeoDataFrame to collect data that is generated using the steps below:

- Use a for-loop to iterate over the grouped object. For each user’s data: 
    - [sort](http://pandas.pydata.org/pandas-docs/stable/generated/pandas.DataFrame.sort_values.html) the rows by timestamp 
    - create a `shapely.geometry.LineString` based on the user’s points

**CAREFUL**: Remember that every LineString needs at least two points. Skip users who have less than two posts.

Store the results in a `geopandas.GeoDataFrame` called `movements`, and remember to assign a CRS.

In [7]:
# Create a LineString object for each user
mov = {
    "userid": [],
    "geometry": []
}

for key, group in grouped_by_users:
    mov["userid"].append(key)
    
    group.sort_values("timestamp")
    lstOfPts = list(group["geometry"])
    if len(lstOfPts) < 2:
        geom = None
    else:
        geom = LineString(lstOfPts)
    
    mov["geometry"].append(geom)

    
# Create a geopandas.GeoDataFrame to store the data
movements = gpd.GeoDataFrame(mov, crs="EPSG:32735")


In [8]:
# NON-EDITABLE CODE CELL FOR TESTING YOUR SOLUTION

# Check the result
print(type(movements))
print(movements.crs)

movements

<class 'geopandas.geodataframe.GeoDataFrame'>
EPSG:32735


Unnamed: 0,userid,geometry
0,16301,"LINESTRING (939011.113 7254636.121, 942231.630..."
1,26589,
2,29322,
3,42181,
4,45136,"LINESTRING (905394.500 7193375.148, 905394.500..."
...,...,...
14985,99966397,
14986,99986933,"LINESTRING (936598.681 7308435.075, 935937.029..."
14987,99988918,
14988,99990870,"LINESTRING (899089.377 7180296.561, 899089.377..."


### d) Calculate the distance between all posts of a user

- Check once more that the CRS of the data frame is correct
- Compute the lengths of the lines, and store it in a new column called `distance`

In [9]:
def compute_distance(row):
    """
    A function to calculate the length of LineString objects in a dataframe.
    It caters for None values.
    """
    geom = row["geometry"]
    if geom == None:
        return None
    else:
        return geom.length
    
    
# Compute the distance between all posts of a user
movements["distance"] = movements.apply(compute_distance, axis=1)

In [10]:
# NON-EDITABLE CODE CELL FOR TESTING YOUR SOLUTION

#Check the output
movements.head()

Unnamed: 0,userid,geometry,distance
0,16301,"LINESTRING (939011.113 7254636.121, 942231.630...",195251.395657
1,26589,,
2,29322,,
3,42181,,
4,45136,"LINESTRING (905394.500 7193375.148, 905394.500...",0.0


### e) Answer the original questions

You should now be able to quickly find answers to the following questions: 
- What was the shortest distance a user travelled between all their posts (in meters)? (store the value in a variable `shortest_distance`)
- What was the mean distance travelled per user (in meters)? (store the value in a variable `mean_distance`)
- What was the maximum distance a user travelled (in meters)? (store the value in a variable `longest_distance`)

In [11]:
# The shortest distance a user travelled between all their posts
shortest_distance = movements["distance"].min()

# The mean distance travelled per user
mean_distance = movements["distance"].mean()

# The maximum distance a user travelled
longest_distance = movements["distance"].max()


# Print results
print(f"The shortest distance a user travelled between all their posts is {shortest_distance} meters.")
print(f"The mean distance a user travelled per user is {mean_distance} meters.")
print(f"The maximum distance a user travelled is {longest_distance} meters.")

The shortest distance a user travelled between all their posts is 0.0 meters.
The mean distance a user travelled per user is 69090.38226302531 meters.
The maximum distance a user travelled is 4535318.9896208625 meters.


### f) Save the movements in a file

Save the `movements` into a new Shapefile called `movements.shp` inside the `data` directory.

In [12]:
# Save "movements" data as a new Shapefile
movements.to_file(DATA_DIRECTORY / "movements.shp")

In [13]:
# NON-EDITABLE CODE CELL FOR TESTING YOUR SOLUTION

assert (DATA_DIRECTORY / "movements.shp").exists()


---

# Fantastic job!

That’s all for this week! 