<a href="https://colab.research.google.com/github/nathanw9722/trying/blob/master/Mapbox.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# **Trusted Performance with the Mapbox Search Experience**

### **Run this notebook (⌘/CMD+F9) to see why Mapbox is trusted by the world's leading businesses to deliver an incredible search experience.**


<img src = "https://cdn.prod.website-files.com/609ed46055e27a02ffc0749b/66453b8846de8106813f8310_5f17a938687eab6f6a42e2f9_Finder_Main.png">


---



In [None]:
# @title Install Libraries
#import libraries and data
!pip install -q folium
!pip install -q geopy
!pip install -q RISE
!jupyter-nbextension install rise --py --sys-prefix -q
!jupyter-nbextension enable rise --py --sys-prefix -q
!pip install -q pycountry_convert

In [18]:
# @title Import Libraries and Data
#I've worked with pandas a lot in the past, folium's new to me though...here goes
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import folium
import ipywidgets as widgets
from IPython.display import display, clear_output

#I've stored the file in google drive so I need to mount the drive here
csv_url = "https://raw.githubusercontent.com/nathanw9722/trying/b891f93bf6721a39f2459eb91e53e46b21855518/Sample_Data.csv"
df = pd.read_csv(csv_url)

###Drop rows with non-numeric or missing values in numeric columns
We're not going to do anything fancy like fill missing values with the average from the n nearest neighbors. A more thorough approach would also check for invalid values in non-numeric columns but since the dataframe is small I can visually see we're ok with the string columns.

In [9]:
# @title
def is_numeric(value):
    #check if we can convert to float
    try:
        float(value)
        return True
    except (ValueError, TypeError):
        return False

#string columns to exclude
exclude_columns = ['Search Query', 'Top Result (Mapbox)', 'Top Result (Provider B)']
#get numeric columns
numeric_columns = [col for col in df.columns if col not in exclude_columns]

#cooerce numeric values in each Series in the df
df[numeric_columns] = df[numeric_columns].apply(pd.to_numeric, errors='coerce')

#create and apply mask. Have to do this with a lambda for an entire dataframe because applymap is deprecated
numeric_mask = df.apply(lambda series: series.map(is_numeric))
df = df[numeric_mask[numeric_columns].all(axis=1)].copy()

Unnamed: 0,Search Query,Latitude,Longitude,Top Result (Mapbox),Top Result (Provider B),Results Returned (Mapbox),Results Returned (Provider B),Distance to Top Result (Mapbox),Distance to Top Result (B),Relevance Score (Mapbox),Relevance Score (B),User Rating of Top Result (Mapbox),User Rating of Top Result (B),Latency (Mapbox),Latency (B)
0,"""best pizza in NYC""",40.7128,-74.006,"""Joe's Pizza""","""Pizza Hut""",10,8,0.5,0.6,5,4,4.5,4.2,1.2,1.3
1,"""coffee near me""",40.7128,-74.006,"""Starbucks""","""Joe's Coffee""",8,6,0.3,0.5,4,5,4.0,4.6,0.8,0.9
2,"""55 Beale Street, San Francisco""",37.7749,-122.4194,"""Beale St. Restaurant""","""Nearby Restaurant""",5,3,0.2,0.3,5,3,4.7,3.5,0.9,1.1
3,"""Golden Gate Park""",37.7749,-122.4194,"""Golden Gate Park""","""Alamo Square Park""",6,4,0.5,1.0,5,4,4.8,4.5,1.0,1.3
4,"""best sushi in Tokyo""",35.6762,139.6503,"""Sushi Zanmai""","""Sushi Saito""",12,10,0.4,0.6,5,5,4.7,4.9,1.5,1.2
5,"""cheap hotels in London""",51.5074,-0.1278,"""Budget Inn""","""EasyHotel""",6,7,1.5,1.2,3,4,3.5,4.0,2.0,1.8
6,"""IST""",41.2753,28.7519,"""Istanbul Airport""","""Ataturk Airport""",8,5,0.4,1.0,5,4,4.8,4.5,1.0,1.2
7,"""Best restaurants in Amsterdam""",52.3676,4.9041,"""De Silveren Spiegel""","""The Pancake Bakery""",12,10,0.5,0.6,5,4,4.9,4.6,1.3,1.1
8,"""Los Angeles Airport""",33.9416,-118.4085,"""LAX""","""Los Angeles International Airport Terminal 1""",9,7,0.3,0.4,5,4,,,1.1,1.5
9,"""best burgers near me""",40.7128,-74.006,"""Shake Shack""","""Burger King""",7,8,0.3,0.4,5,4,4.6,4.2,0.7,0.9


##Add Country & Continent Attributes
We'll use reverse geocoding to get the country for each query

In [10]:
# @title
from geopy.geocoders import Nominatim
from geopy.exc import GeocoderTimedOut
from pycountry_convert import country_alpha2_to_continent_code, convert_continent_code_to_continent_name
# Initialize the geocoder
geolocator = Nominatim(user_agent="nathan-mapbox-demo/1.0")

#Reverse geocoding to get country and continent
def get_country_and_continent(lat, lon):
    try:
        location = geolocator.reverse((lat, lon), exactly_one=True, language="en")
        if location and 'country' in location.raw['address']:
            country_name = location.raw['address'].get('country', None)
            country_code = location.raw['address'].get('country_code', None).upper() if location.raw['address'].get('country_code') else None
            if country_code:
                continent_code = country_alpha2_to_continent_code(country_code)
                continent_name = convert_continent_code_to_continent_name(continent_code)
                return country_name, continent_name
        return None, None
    except (GeocoderTimedOut, KeyError):
        return None, None

# Apply the function to each row and unpack the results
df[['Country', 'Continent']] = df.apply(
    lambda row: pd.Series(get_country_and_continent(row['Latitude'], row['Longitude'])), axis=1
)




In [11]:
# @title
def histograms_by_geo(df):
  #Histogram for 'Country'
  plt.figure(figsize=(12, 6))
  plt.subplot(1, 2, 1)
  country_counts = df['Country'].value_counts()
  country_counts.plot(kind='bar', color='skyblue')
  plt.title('Search Queries by Country')
  plt.ylabel('Number of Queries')
  plt.xticks(rotation=45, ha='right') # Rotate x-axis labels for better readability.

  #Histogram for 'Continent'
  plt.subplot(1, 2, 2)
  continent_counts = df['Continent'].value_counts()
  continent_counts.plot(kind='bar', color='lightcoral')
  plt.title('Search Queries by Continent')
  plt.ylabel('Number of Queries')
  plt.xticks(rotation=0)

  plt.tight_layout()  # Adjust spacing between subplots
  plt.show()

**Results:** Queries from the United States are clearly the most common.

##Get a great visual
I'm trying to break down my results to focus on a single radar chart visual. We are storytelling, so we need to maximize the impact of each visual.

First we'll normalize results so we can plot multiple variables for our constants (Mapbox and vendor B). We're going to be using min-max, it's more intuitive than z-scores.   

In [12]:
# @title

# Function for min-max normalization, with combined min-max normalization to preserve the relationships between comparable columns.
def min_max_normalize(df, columns):
    combined_min = df[columns].min().min()
    combined_max = df[columns].max().max()
    if combined_min == combined_max:  # Handle edge case for constant values
        return df[columns].apply(lambda x: 0)
    normalized_df = (df[columns] - combined_min) / (combined_max - combined_min)
    return normalized_df

normalized_df = df.copy()

col_pairs = [
    ['Results Returned (Mapbox)', 'Results Returned (Provider B)'],
     ['Relevance Score (Mapbox)', 'Relevance Score (B)'],
     ['User Rating of Top Result (Mapbox)', 'User Rating of Top Result (B)'],
     ['Distance to Top Result (Mapbox)', 'Distance to Top Result (B)'],
     ['Latency (Mapbox)', 'Latency (B)']
]
for pair in range(len(col_pairs)):
  # Normalize the specified columns
  normalized_cols = min_max_normalize(df, col_pairs[pair])
  #invert normalization to align with positive-increasing values. AKA the bigger (should be) the better.
  if col_pairs[pair][0] in ['Distance to Top Result (Mapbox)', 'Latency (Mapbox)']:
    normalized_cols = 1 - normalized_cols
  normalized_df[col_pairs[pair]] = normalized_cols


In [13]:
# @title
# Data for the radar chart
def create_radar_chart(df, continent_name = None):
  #optionally filter by continent
  if continent_name:
    df = df[df['Continent'] == continent_name]
  #sum relevant normalized columns
  exclude_columns = ['Search Query', 'Longitude', 'Latitude', 'Top Result (Mapbox)', 'Top Result (Provider B)', 'Country', 'Continent']
  column_sums = {}

  for column in df.columns:
      if column not in exclude_columns:
          column_sums[column] = df[column].sum()

  #Keys to visualize
  keys = ['Results Returned', 'Distance to Top Result',  'Relevance Score', 'User Rating of Top Result', 'Latency']
  mapbox_keys = [i+' (Mapbox)' for i in keys]
  other_keys = [key for key in column_sums if key not in mapbox_keys]

  mapbox_values = [column_sums[key] for key in mapbox_keys]
  other_values = [column_sums[key] for key in other_keys]

  # Categories for radar chart - using the keys
  categories = ['Results Returned', 'Min. Distance to Top Result',  'Relevance Score', 'User Rating of Top Result', 'Min. Latency']

  # Number of categories
  N = len(categories)

  # Angles for the radar chart
  angles = np.linspace(0, 2 * np.pi, N, endpoint=False).tolist()

  # Close the plot
  angles += angles[:1]
  mapbox_values += mapbox_values[:1]
  other_values += other_values[:1]

  # Create the radar chart
  fig, ax = plt.subplots(figsize=(10, 6), subplot_kw=dict(polar=True))

  ax.plot(angles, mapbox_values, linewidth=2, label='Mapbox')
  ax.plot(angles, other_values, linewidth=2, label='Provider B')

  # Set the labels for each category
  ax.set_thetagrids(np.degrees(angles[:-1]), categories)
  ax.set_title('Metrics Comparison: Mapbox vs. Provider B')
  ax.legend()

  plt.tight_layout()
  plt.show()



---



## Plot the Coordinates
Get a quick visual on the geographic distribution of queries

In [14]:
# @title
def plot_coordinates(df, continent_name, metric= 'Relevance Score'):
  #filter for continent
  filtered_df = df[df['Continent'] == continent_name]

  # Define metric columns based on input
  if metric == 'Relevance Score':
    mapbox_col = 'Relevance Score (Mapbox)'
    other_col = 'Relevance Score (B)'
  elif metric == 'Results Returned':
    mapbox_col = 'Results Returned (Mapbox)'
    other_col = 'Results Returned (Provider B)'
  elif metric == 'User Rating of Top Result':
      mapbox_col = 'User Rating of Top Result (Mapbox)'
      other_col = 'User Rating of Top Result (B)'
  elif metric == 'Distance to Top Result':
      mapbox_col = 'Distance to Top Result (Mapbox)'
      other_col = 'Distance to Top Result (B)'
  elif metric == 'Latency':
      mapbox_col = 'Latency (Mapbox)'
      other_col = 'Latency (B)'
  else:
    raise ValueError("Invalid metric provided. Choose from: 'Relevance Score', 'Results Returned', 'User Rating of Top Result', 'Distance to Top Result', 'Latency'")

  # Create a base map centered around the mean coordinates
  m = folium.Map(location=[filtered_df['Latitude'].mean(), filtered_df['Longitude'].mean()], zoom_start=10, width=600, height=400)

  # Add markers for each coordinate
  for idx, row in filtered_df.iterrows():
    popup_text = f"<b>Search Query:</b> {row['Search Query']}<br>" \
                 f"<b>{metric} (Mapbox):</b> {row[mapbox_col]}<br>" \
                 f"<b>{metric} (Provider B):</b> {row[other_col]}"
    folium.Marker([row['Latitude'], row['Longitude']], popup=folium.Popup(popup_text, max_width=300)).add_to(m)

  # Get the bounds of the dataset so that we can see all points when the map opens
  min_lat, max_lat = filtered_df['Latitude'].min(), filtered_df['Latitude'].max()
  min_lon, max_lon = filtered_df['Longitude'].min(), filtered_df['Longitude'].max()

  # Fit the map to the bounds
  m.fit_bounds([[min_lat, min_lon], [max_lat, max_lon]])


  # Add a tileset
  folium.TileLayer('OpenStreetMap').add_to(m)

  # Display the map
  return m

#Introduction
This notebook compares Mapbox's performance on key location-based search metrics, such as Relevance Score and Distance to Top Result, with the performance of a competing alternative from Provider B. As a notebook, the analysis here is repeatable, should you wish to use it on a different sample dataset.


We'll start with an overview by comparing the normalized values of all metrics, to see if we can identify a trend across all metrics and regions. From there we'll drill down to specific metrics and regions to evaluate performance at these more granular levels.
Finally we'll discuss our results, the implications for positioning Mapbox vs. the alternative from Provider B and opportunities to both solidify this positioning with additional metrics and support this customer, should we win their business.




#High-level Metric Overview
To compare metrics all values are normalized between 0 and 1, then each metric values are summed together for an overall score. Note that Distance and Latency are inversed so that an increase in values represents a better result.   


---



In [15]:
# @title Overall Metric Performance
create_radar_chart(normalized_df)

NameError: name 'np' is not defined

A clear winner. One thing to note is the distinct advantage in Relevance Score. Mapbox consistently delivers more relevant results when compared to Provider B. Before we take this any further, we'll check where these results are coming from.

In [None]:
# @title Query Distribution by Location
histograms_by_geo(df)

We can see that our data is dominated by North America and Europe, with the United States having far more queries than all other countries.
Lets evaluate metric performance for each continent separately.

In [None]:
# @title Evaluate by Continent { run: "auto" }
Metrics_Geo = "Asia" # @param ["North America", "Europe", "Asia", "Oceania", "Africa"]
Display = "Radar Chart (normalized values)" # @param ["Radar Chart (normalized values)", "Map"]
if Display == "Radar Chart (normalized values)":
  create_radar_chart(normalized_df, Metrics_Geo)
else:
  display(plot_coordinates(df, Metrics_Geo))

Mapbox shows incredible dominance in both North America and Europe, particularly in relevance scores and user rating for top result. These are also the continents from which we have the most data in our dataset, so this finding is relatively robust.
Compare this to Africa, where the competitor appears to provide a search result that is closer to the point of interest. However our dataset only has one query from Africa, meaning this is far from conclusive.

##Relevance Scores Analysis
Given the dominance in relevance scores, lets take a deeper look here. We'll look at just how big of a lead Mapbox has by checking the difference (or gap) in Relevance Scores of Mapbox and Provider B.  


---





In [None]:
# @title Relevance Score Gaps: Mapbox vs Provider B
# Calculate the directional difference
df['RelevanceDifference'] = df['Relevance Score (Mapbox)'] - df['Relevance Score (B)']
# See where Mapbox performed poorly by looking at the smallest 'RelevanceDifference' values
sort_df = df.sort_values(by='RelevanceDifference')
plt.figure(figsize=(10, 6))  # Adjust figure size as needed
#Converting to int for better visual display
plt.hist(df['RelevanceDifference'], bins=range(int(df['RelevanceDifference'].min()), int(df['RelevanceDifference'].max()) + 2), align='left', rwidth=0.8)
plt.xlabel('Relevance Difference')
plt.ylabel('Frequency')
plt.title('Gap size in Relevance Scores')
plt.xticks(range(int(df['RelevanceDifference'].min()), int(df['RelevanceDifference'].max()) + 1)) # Ensure integer ticks

plt.show()

**Results:** Almost all queries have a difference of 1 point between the Relevance Score of Mapbox and that of Provider B. There are very few ties. This seems to point to a strategic advantage in Mapbox's ability to consistently deliver the most relevant results to users.


#Distance Analysis
Mapbox and Provider B appear to have similar results in Europe and North America when it comes to distance to top result. We'll take a closer look by breaking this down by country.

In [None]:
# @title Distance to Top Result
#create boxplots for distance to top results in NA and EU
def grouped_boxplot(df):

  # Filter for North America and Europe
  filtered_df = df[(df['Continent'] == 'North America') | (df['Continent'] == 'Europe')]

  # Select relevant columns for the plot
  plot_data = filtered_df[['Country', 'Distance to Top Result (Mapbox)', 'Distance to Top Result (B)']]

  # Melt the DataFrame for easier plotting with seaborn
  plot_data = plot_data.melt(id_vars='Country', var_name='Metric', value_name='Distance')

  # Create the grouped boxplot
  plt.figure(figsize=(12, 6))
  sns.boxplot(x='Country', y='Distance', hue='Metric', data=plot_data)
  plt.gca().invert_yaxis()
  plt.ylabel('Distance to Top Result')
  plt.title('Distance to Top Result by Country')
  plt.xticks(rotation=45, ha='right')
  plt.tight_layout()
  plt.show()
grouped_boxplot(df)

As we can see in the US, where we had a large sample size, the average distance to the top result was significantly better for Mapbox. In countries such as Mexico and Italy, where we had a small sample size, Mapbox performed significantly better. Spain is an outlier here, where Provider B has a smaller average distance to the top result.

# Conclusions


---

Mapbox delivers consistently better performance across a range of metrics. With market-leading results Relevance Scores and User Ratings, particularly in North America and Europe, Mapbox delivers what users want the most, results that align with their search intent.
It's hard to talk about geographic fluctuations, given how skewed our data was towards results from the US, but Mapbox leads in all metrics in North America and Europe.

Google Maps is the 500 pound gorilla in this market, but for businesses that need a partner willing to innovate, for businesses looking for a partner that will prioritize the best results for their users, as proven by the consisten relevance scores, Mapbox is the clear choice.

I'd like to see a breakdown by query language, we've only seen results for English. I'd also like to see result completeness. To convert results to action they need to include all the data the user needs to take action, such as email, phone numbers, operating hours, etc. My hypothesis is that seeing completeness of results would add to building trust that Mapbox can deliver everything users need. This is implicit trust that Google has built over the years, it needs to be proven by challengers.  

As far as developing a product backlog for dev teams if we won this business, I would look to emphasize the areas that the customer is looking to grow in. There were not a lot of results for Africa, Asia and Oceania, but support for localization (languages) and growing our data assests