# Correlation between planted trees in Berlin and speed limits

This projects aims to analyzes the correlation between planted trees and speed limits in Berlin. The goal is to see if there is a correlation between the two and if so, how strong it is.

## Datasources

### Datasource 1: Baumbestand - Berlin - [WFS]
- Metadata URL: https://mobilithek.info/offers/-5687470862699743129
- Data URL: https://fbinter.stadt-berlin.de/fb/wfs/data/senstadt/s_wfs_baumbestand
- Data Type: WFS_SRVC

Planted (streets-)trees in Berlin.

### Datasource 2: Tempolimits - Berlin - [WFS]
- Metadata URL: https://mobilithek.info/offers/-8613064499673471355
- Data URL: https://fbinter.stadt-berlin.de/fb/wfs/data/senstadt/s_vms_tempolimits_spatial
- Data Type: WFS_SRVC
- Description: Speed Limits in Berlin. 

Important notes: Only speedlimits that are different from the default speed limit of 50km/h are listed. Therefor we need another datasource to get the streets with the default speed limit.

### Datasource 3: Traffic Network - Berlin - [WFS]
- Metadata URL: https://fbinter.stadt-berlin.de/fb/index.jsp
- Data URL: https://fbinter.stadt-berlin.de/fb/wfs/data/senstadt/s_vms_detailnetz_spatial_gesamt
- Data Type: WFS_SRVC
- Description: Traffic network in Berlin.

Important notes: We use this dataset (filtered for only (car-)streets) and inject the speed limits from datasource 2 into it, defaulting to 50km/h if no speed limit is found.


## Question and Hypothesis 
Is there a correlation between planted trees and speed limits in Berlin? If so, how strong is it and does the type of tree have an influence on the correlation?

My initial Hypothesis is that there is a small negative correlation between speed limits and planted trees. Thereby affecting the city landscape in two positive ways, by reducing the speed of cars and making cities greener. 

## Outline
1. Install required dependencies
2. Load the preprocessed data
3. Visualize the data on a map of Berlin
4. Calculate correlation between planted trees and speed limits
5. Visualize the correlation between planted trees and speed limits
6. Conclusion & Outlook

### 1. Install required dependencies
Initially, install all required dependencies. We use requirements.txt to manage dependencies.

In [None]:
%%capture
%pip install -r requirements.txt

### 2. Load the preprocessed data
Create geopandas dataframes using the preprocessed data from the data pipeline. 
If some data is missing, the data pipeline will be executed again.

In [1]:
import geopandas as gpd
import os

if (not os.path.exists("data/trees.geojson") 
or not os.path.exists("data/streets.geojson") 
or not os.path.exists("data/speed_limits.geojson") 
or not os.path.exists("data/merged_data.geojson")):
    os.chdir("data")
    os.system("python pipeline.py")
    os.chdir("..")

trees: gpd.GeoDataFrame = gpd.read_file("data/trees.geojson")
streets: gpd.GeoDataFrame = gpd.read_file("data/streets.geojson")
speed_limits: gpd.GeoDataFrame = gpd.read_file("data/speed_limits.geojson")
merged_data: gpd.GeoDataFrame = gpd.read_file("data/merged_data.geojson")

### 3. Visualize the data on a map of Berlin
We color all streets according to their speed limits.

In [None]:
import plotly.io as pio
import plotly.express as px
import shapely
import numpy as np

pio.renderers.default = "notebook"

# Prepare data for plotting
merged_data['speed_limit'] = merged_data['speed_limit'].astype(float)
unique_streets = merged_data[~merged_data["elem_nr"].duplicated(keep="last")].sort_values(by="speed_limit")
trees_in_merged = trees[trees["id"].isin(merged_data["id_y"])]


lats = []
lons = []
ids = []
names = []
speed_limits = []

for ident, feature, name, speed_limit in zip(unique_streets.id_x, unique_streets.geometry, unique_streets.strassenname, unique_streets.speed_limit):    
    if isinstance(feature, shapely.geometry.linestring.LineString):
        linestrings = [feature]
    elif isinstance(feature, shapely.geometry.multilinestring.MultiLineString):
        linestrings = feature.geoms
    else:
        continue
    for linestring in linestrings:
        x, y = linestring.xy
        lats = np.append(lats, y)
        lons = np.append(lons, x)
        ids = np.append(ids, [ident]*len(y))
        names = np.append(names, [name]*len(y))
        speed_limits = np.append(speed_limits, [speed_limit]*len(y))
        lats = np.append(lats, None)
        lons = np.append(lons, None)
        ids = np.append(ids, None)
        names = np.append(names, None)
        speed_limits = np.append(speed_limits, None)

fig = px.line_mapbox(lat=lats, 
                    lon=lons, 
                    hover_name=names, 
                    hover_data=[ids, speed_limits],
                    color=speed_limits,
                    color_discrete_map = {
                        3.0: 'green',
                        5.0: 'limegreen',
                        7.0: 'lime',
                        10.0: 'lightgreen',
                        20.0: 'yellow',
                        30.0: 'gold',
                        40.0: 'orange',
                        50.0: 'darkorange',
                        60.0: 'orangered',
                        70.0: 'red'
                    },
                    line_group=ids,
                    zoom=15,
                    height=800, 
                    width=1000, 
                    center={'lon':13.402149951382846, 'lat': 52.514327773853275})


scatter_trace_in_joined = px.scatter_mapbox(trees_in_merged, 
                        lat=trees_in_merged.geometry.y,
                        lon=trees_in_merged.geometry.x,
                        #color='gattung_deutsch',
                        hover_name="gattung_deutsch",
                        hover_data=[],
                        size_max=1,
                        opacity=0.3)




for trace in scatter_trace_in_joined.data:
    fig.add_trace(trace)



fig.update_layout(mapbox_style='carto-positron', margin={"r":0,"t":0,"l":0,"b":0})
fig.show()

### 4. Calculate correlation between planted trees and speed limits

In [None]:
import matplotlib.pyplot as plt

def plot_speed_to_tree_count(df):
    plt.scatter(df['speed_limit'], df['tree_count'])
    plt.xlabel('Speed Limit')
    plt.ylabel('Tree Count')
    plt.title('Correlation between Speed Limit and Tree Count')
    plt.show()


# ensure that speed_limit is a float
merged_data['speed_limit'] = merged_data['speed_limit'].astype(float)
# aggregate data by street name
aggregated_data = merged_data.groupby('strassenname').agg({'speed_limit': 'first', 'gattung_deutsch': 'count'}).reset_index()
aggregated_data.rename(columns={'gattung_deutsch': 'tree_count'}, inplace=True)


plot_speed_to_tree_count(aggregated_data)

##### Comments
As we can see in the graph above we have some outliers, which are mostly Bundesautobahnen and Bundesstraßen. 
They are very long and therefor accumulate a lot of trees.
As we are focusing more on the inner city, we filter them out and calculate the correlation between planted trees and speed limits.

In [None]:
# Only consider Gemeindestraßen
inner_city_data = merged_data.copy()[merged_data['strassenklasse'] == 'G']

# aggregate data by street name
aggregated_inner_data = inner_city_data.groupby('strassenname').agg({'speed_limit': 'first', 'gattung_deutsch': 'count'}).reset_index()
aggregated_inner_data.rename(columns={'gattung_deutsch': 'tree_count'}, inplace=True)

plot_speed_to_tree_count(aggregated_inner_data)

#### Comments
We can see that the outliers have decreased. Still, there are some outliers left, which we can't easily filter out.
One Solution would be filtering out longer streets or setting number of trees into relation with the length of the street. 

For this project we will leave it as it is and continue with the correlation calculation.

In [None]:
df = inner_city_data.copy()

df_grouped = df.groupby(['speed_limit']).size().reset_index(name='tree_count')


# Pivot the DataFrame to get tree types as columns
df_pivot = df.pivot_table(index='speed_limit', columns='gattung_deutsch', aggfunc='size', fill_value=0)
 
df_merged = df_grouped.merge(df_pivot, on='speed_limit')

correlation = df_merged.corr()
speed_limit_corr = correlation[['speed_limit']].sort_values(by='speed_limit', ascending=False)

### 5. Visualize the correlation between planted trees and speed limits


In [None]:
import seaborn as sns
import matplotlib.pyplot as plt

plt.figure(figsize=(5, 15))  # Specify the size of your heatmap
sns.heatmap(speed_limit_corr, annot=True, cmap='coolwarm', vmin=-1, vmax=1)
plt.title("Correlation Heatmap for Speed Limit")
plt.show()

### 6. Conclusion
The goal of this project was to investigate the correlation between the number of trees planted and the speed limits on roads in Berlin. Contrary to the initial expectation, the data does not show a negative correlation. Instead, the correlation is slightly positive, ranging between 0 and 0.2.

A plausible explanation for this unexpected outcome is that longer roads, which tend to have higher speed limits, also accumulate more trees due to their length. Some of these longer roads, specifically Bundesautobahnen and Bundesstraßen, can be considered as outliers and have been filtered. Despite filtering them out, the impact of the remaining ones seems significant enough to influence the correlation results.

This conclusion underlines the importance of careful data processing and the influence of outliers on the results of a statistical analysis. 
The findings also emphasize the complexities involved in making accurate predictions and the potential for unexpected results even when reasoning about the data seems sound.

### Outlook
Future work could involve refining the data cleaning process to address the influence of longer roads. For instance, streets could be filtered based on length, or the number of trees could be standardized by street length to adjust for the size of different roads. Another approach could be to eliminate streets with speed limits above a certain threshold, like 50 km/h, to focus on inter city streets.

Additionally, it could be beneficial to explore the correlation with other factors, such as traffic volume, road width, or the proximity of parks and green areas. The role of urban planning policies could also be an interesting aspect to consider, as they may influence both the placement of trees and the establishment of speed limits.

Ultimately, while the project's results deviated from my initial hypothesis, it provided valuable insights into the nature of data analysis. It underscored the importance of careful data preprocessing, the handling of outliers, and the challenges in making accurate predictions. These lessons can be instrumental in future data analysis projects, both in the context of advanced methods of software engineering and beyond.