# Chart WhatsApp Members on a Map

The purpose of this notebook is to map the WhatsApp chat members on a map.  It will do the following:

* Scrape the phone numbers from the WhatsApp chat log
* Extract either country code or US/Canada area codes from the phone number
* Associate the country/area code to a latitude and longitude using publicly available data
* Map the latitudes and longitudes on a world map using Folium

## Some Preliminary Setup

### Files Needed

#### WhatsApp Chat Log

The WhatsApp chat log should be named:

* chat.txt

If not, change the name in the appropriate cell below.

#### Additional Files

In addition to the WhatsApp chat log, the following files are also needed:

* us-area-code-geo.csv
* ca-area-code-geo.csv
* world_country_and_usa_states_latitude_and_longitude_values.csv
* world-countries.json

These files are sourced from the public domain.

Latitude and Longitude data for US and Canadian area codes were downloaded from [https://github.com/ravisorg/Area-Code-Geolocation-Database/tree/master](https://github.com/ravisorg/Area-Code-Geolocation-Database/tree/master).

Latitude and longitude data for all countries can be downloaded from [https://www.kaggle.com/datasets/paultimothymooney/latitude-and-longitude-for-every-country-and-state](https://www.kaggle.com/datasets/paultimothymooney/latitude-and-longitude-for-every-country-and-state).

And the list of countries used by folium can be downloaded from [https://github.com/python-visualization/folium/blob/main/examples/data/world-countries.json](https://github.com/python-visualization/folium/blob/main/examples/data/world-countries.json).

To map country code to actual country name, we will scrape that data from [https://countrycode.org/](https://countrycode.org/).

### Import Libraries

Import the libraries needed for this notebook.

In [None]:
def print_box(message, chr='*'):
  '''Function to print a message with a box of asterisks around it.

  Input Parameters:
  -----------------
  message : str
      The message to place inside the box.
  chr : str      
      The character to use for the top and bottom of the box (default is '*').
  '''
  msg = f'*** {message} ***'
  msg_len = len(msg)
  print(chr * msg_len)
  print(msg)
  print(chr * msg_len)

print_box('Function loaded!')

In [None]:
# Import some libraries
import numpy as np              # useful for many scientific computing in Python
import pandas as pd             # primary data structure library
from bs4 import BeautifulSoup   # For webscraping
import requests                 # HTTP support

print_box('Libraries loaded!')

Install the mapping library.

In [None]:
# Install Folium library

# This syntax is in VSCode.
# Uncomment the following line if folium is not pre-installed in the kernel.
# %pip install folium

# Then import the library
import folium

print_box('Folium loaded!')

### Define Some Functions

After downloading the chat log from WhatsApp, there may be some strange unicode characters.  This function will clean the logs so that Python can process it properly.

In [None]:
# Function to clean up the chat string
def cleanup(input):
  '''Function to clean up the chat string.
  
  This function takes a chat string and either replaces the no-break space with a regular space character,
  or removes embedded characters that can mess up string processing.  It returns the cleaned string.

  Parameters:
  -----------
  input : str
      String to clean.

  Returns:
  --------
    Cleaned string.
  '''
  retVal = ""                                         # Start with a blank string
  for char in input:                                  # Loop through each character
    if char in ["\xa0"]:                              # Is it a no-break space?
      retVal += " "                                   # Yes, then replace with a regular space
    elif char not in ["\u202a", "\u202c", "\u200e"]:  # Is it some other weird character?
      retVal += char                                  # No, then append to the final string
  return retVal                                       # Return the cleaned string

print_box('Function loaded!')

## Step 1: Extract Country Area Codes from Chat

### Load the Chat Logs

WhatsApp does not have an easy way to download the list of members in a group.  The only option is to download the chat logs via the 'Export Chat' command.

Each line in the chat log includes the date/time stamp as well as the poster's phone number like so:

```
[10/9/23, 11:21:51 AM] ‪+[phone number]‬: Thanks for starting the group
```

When a member joins, WhatsApp automatically logs the following message:

```
[10/9/23, 11:20:21 AM] ‪+[phone number]‬: ‎‪+[phone number]‬ joined using this group's invite link
```

Since each line contains the phone number of the chat member that posted the message, let's focus only on the lines where the `joined` word is found.  This will reduce the number of chat messages that need to be processed.  Also, it will include members who have not posted a message since joining.

The problem with this method is, if the person who downloaded the chat log joined after the chat group was created, this will only find members that joined after them.  If we remove this condition, then it will look at all posts (e.g. extra processing).  It will still not pick up members who have not posted anything.

Ideally, the admin who created the group should download the chat.  This way, we can be sure that all members who joined will be included.

In [None]:
# Load the chat logs and process it line by line

# Define a constant that points to the location of the chat log
WHATSAPP_CHAT_LOG ='data/chat.txt'

# We use a set here because we only want unique phone numbers. (That's what a set does.)
phones = set()

# Now open the file and process it
with open(WHATSAPP_CHAT_LOG, 'r') as file:                # Open the file in read-only mode
  while line := file.readline():                          # Loop though the file line by line
    if "joined" in line:                                  # If the line contains the string, use it
      split = line.split(":", 4)                          # Split up to the 4th ":"
      phone_no = split[-1].split("joined")[0].strip()     # Split the last element again at "joined" and strip any whitespaces
      final = cleanup(phone_no)                           # Clean up the phone number
      if final[0] == '+':                                 # Does it begin with a "+"?
        phones.add(final)                                 # Yes, assume it's a phone number and save it

Print out the number of phone numbers found.  Stop processing if none are found.

In [None]:
# Output the number of phone numbers found
count = len(phones)
print('Found {:,} phone numbers in the chat log.'.format(count))

if count == 0:
  print('No phone numbers found in the chat logs.  Please download another chat log that contains phone numbers.')
  assert count > 0

### Convert to a Pandas Dataframe

In [None]:
df = pd.DataFrame(list(phones))           # Convert the phone set ==> list ==> Pandas DataFrame
df.columns = ['Code']                     # Set the first column to "Code"
df.sort_values('Code', inplace=True)      # Sort the codes in place
df.reset_index(inplace=True, drop=True)   # Reset the index
df.head()                                 # Display partial results

# Output removed for privacy reasons.

### Some Additional Functions

In [None]:
# Function to extract the country code or US/Canada area code
def getCode(phone):
  if "(" in phone:                                        # Does input string have "("
    return phone[phone.index("(") + 1:phone.index(")")]   # Yes, return the area code in between the parenthesis
  return phone[:phone.index(" ")]                         # No, return the country code, including the "+", before the first space

In [None]:
# Test the function
code = '+223 12 34 56 78'
print("The code for '{}' is '{}'".format(code, getCode(code)))
code = '+1 (201) 123‑4567'
print("The code for '{}' is '{}'".format(code, getCode(code)))

### Apply the Function to the dataframe

In [None]:
# Apply the getCode function to each row and save it back to the original column
df['Code'] = df.apply(lambda x: getCode(x['Code']), axis=1)

In [None]:
# Check a few things
df.dtypes

In [None]:
# Check the shape of the dataframe
df.shape

In [None]:
# Check the value counts
df['Code'].value_counts()

## Step 2: Add Latitude and Longitude Data

This section deals with converting the phone numbers that were extracted in the previous step and converting them to latitude and longitude.  In order to do that, we need to extract the country code or area code from the phone numbers and look them up in a table to get latitude and longitude.  Since the geo data file is organized by country, we will need to convert the country codes to a country first before we can look up the geo data.

### Load The Country Code Data

Scrape the country code data from the Internet.  This will allow us to convert the country code to an actual country name.

In [None]:
# Load the country calling codes from the Internet
COUNTRY_CODE_URL = "https://countrycode.org/"
response = requests.get(COUNTRY_CODE_URL)
soup = BeautifulSoup(response.content, "html.parser")
soup.title

In [None]:
# Search through the table and extract the country by calling codes.

rows = soup.find("table").find("tbody").find_all("tr")  # Extract all rows from the table on the page
country_codes = {}                                      # Initialize an empty dictionary to store the country + code
for row in rows:                                        # Loop through all the rows
    cells = row.find_all("td")                          # Extract the cells from the row
    country = cells[0].get_text()                       # First cell is the country
    code = cells[1].get_text()                          # Second cell is the calling code
    country_codes[code] = country                       # Save code/country to the dictionary

# Note: the 'calling code' is the key and the 'country' is the value.  This is
# because we want to look up the country by code later on.

In [None]:
# Display the first 5 elements in the dictionary to verify we did it right
first_few = dict(list(country_codes.items())[:5])
first_few

### Load Geo Data

In [None]:
# Load US geo data
df_area_codes = pd.read_csv("data/us-area-code-geo.csv")
df_area_codes.head()

In [None]:
# Load Canada geo data
df_ca_area_codes = pd.read_csv("data/ca-area-code-geo.csv")
df_ca_area_codes.head()

In [None]:
# Load the world geo data
df_codes = pd.read_csv("data/world_country_and_usa_states_latitude_and_longitude_values.csv")
df_codes.head()

### Determine Latitudes/Longitudes

Now that we loaded the various files with geo data, we can now go through the input file and figure out the latitudes/longitudes for each member.

#### Step 2.1: Determine Country

Take the area/country calling codes and figure out what country it belongs to.

In [None]:
# Function to return the country based on the calling code
def getCountry(code):
  '''Function to return the country based on the calling code.
  
  Parameters:
  -----------
  code : str
      Country or US/Canada area code of the phone number.

  Returns:
  --------
    Name of the country associated with the phone number.
  '''
  if code[0] == "+":                                      # Does the code begin with "+"?
    new_code = code[1:]                                   # Yes, then extract the code without the "+"
    if new_code in country_codes:                         # Is this a valid country code?
      return country_codes[new_code]                      # Yes, return the country
  int_code = int(code)                                    # Convert string to integer for the next section
  if int_code in df_ca_area_codes['area_code'].values:    # No "+" or invalid code, is the code in Canada?
    return 'Canada'                                       # Yes, return 'Canada'
  elif int_code in df_area_codes['area_code'].values:     # Not Canada, is it in the US?
    return 'United States'                                # Yes, return 'United States'

  # Country/Area code wasn't found!  Let's start trimming it from the back
  # and see if we can find it.
  if len(code) > 1:                                       # Is the length greater than 1?
    return getCountry(code[:-1])                          # Yes, then call this function again with a shorter string (drop the last character)
  return None                                             # Still can't find it!  Return None instead!

# If this function can't find the code, either the code really doesn't exist (bad input data),
# or we need to update our source tables for country/area codes.

In [None]:
# Test the function with some values
test_code = '+49'
print('{} belongs to {}'.format(test_code, getCountry(test_code)))
test_code = '201'
print('{} belongs to {}'.format(test_code, getCountry(test_code)))
test_code = '+316'
print('{} belongs to {}'.format(test_code, getCountry(test_code)))

# This is an edge case.  The input file contains this value.
test_code = '+447956'
print('{} belongs to {}'.format(test_code, getCountry(test_code)))

In [None]:
df2 = df.copy()                                                         # Make a copy of the original data
df2['Country'] = df2.apply(lambda row: getCountry(row['Code']), axis=1) # Create a new 'Country' column based on the country/area code

In [None]:
# Check to make sure there are no nulls
df2.isnull().sum()

In [None]:
# Check the value counts just to be curious
df2['Country'].value_counts()

#### Step 2.2: Determine Latitudes/Longitudes

In [None]:
# Function to return latitude/longitude based on the country and/or area code of the row
def getLatitudeLongitude(row):
  match row['Country']:                                       # Check 'Country'
    case 'United States':                                     # Is it 'United States'?
      area_code = int(row['Code'])                            # Yes, convert the 'Code' from string to integer and get the coordinates for it
      retVal = df_area_codes[['latitude', 'longitude']].loc[df_area_codes['area_code'] == area_code]

    case 'Canada':                                            # Is it 'Canada'?
      area_code = int(row['Code'])                            # Yes, convert the 'Code' from string to integer and get the coordinates for it
      retVal = df_ca_area_codes[['latitude', 'longitude']].loc[df_ca_area_codes['area_code'] == area_code]

    case _:                                                   # Must be a country calling code; get the coordinates for the country
      retVal = df_codes[['latitude', 'longitude']].loc[df_codes['country'] == row['Country']]

  if len(retVal) == 1:                                        # Found exactly one row?  Should always be true.
    return pd.Series([retVal.iloc[0, 0], retVal.iloc[0, 1]])  # Yes, return latitude/longitude
  return None                                                 # No, return None

In [None]:
# Test the function with the first row of the dataset
row = df2.iloc[0]
ret = getLatitudeLongitude(row)
print('Code {} is located at {}, {}'.format(row['Code'], ret[0], ret[1]))

# Test the function with the last row of the dataset
row = df2.iloc[-1]
ret = getLatitudeLongitude(row)
print('Code {} is located at {}, {}'.format(row['Code'], ret[0], ret[1]))

In [None]:
# Create a copy of the dataset
df3 = df2.copy()

# Now, for each row, call our function and save the data to new columns
df3[['latitude', 'longitude']] = df3.apply(getLatitudeLongitude, axis=1)
df3.head()

In [None]:
# Verify that there are no null rows
df3[df3.isnull().any(axis=1)]

## Step 3: Map Chat Members

Create the following maps:

1. World map of members by count
2. World map of members by count minus US members
3. World map of members by count with adjusted bins
4. World map of members by country/area codes


### Massage the dataset

In [None]:
# Generate a new dataset of count by country

# Note: The dataframe, df4, will be used by maps 1, 2, and 3.

df4 = df3.value_counts('Country').to_frame().reset_index()  # Generate a count dataset
df4.rename(columns={'count': 'Count'}, inplace=True)        # Change the new column's name to 'Count'
df4.head()                                                  # Check the results

In [None]:
# Change 'United States' to 'United States of America'
# Need to do this because the world-countries.json file uses this value
df4.loc[df4['Country'] == 'United States', 'Country'] = 'United States of America'
df4.head()

Now that we've loaded the data and cleaned it, we can now map it.

### Map 1: Map of Member Count by Country


In [None]:
# Create a plain world map
world_map = folium.Map(location=[0, 0], zoom_start=2)
WORLD_GEO = r'data/world-countries.json'

# Generate choropleth map using the member count by country
folium.Choropleth(
    geo_data=WORLD_GEO,
    data=df4,
    columns=['Country', 'Count'],
    key_on='feature.properties.name',
    fill_color='YlGn',
    nan_fill_color='gainsboro',
    nan_fill_opacity=0.1,
    fill_opacity=0.7,
    line_opacity=0.2,
    legend_name='MIT ADS WhatsApp Group Membership'
).add_to(world_map)

# display map
world_map

### Map 2: Map of Member Count by Country minus those in the US

Since the US has 68 members and the next country only has 13, all the other countries will fall into the first bucket based on the legend above.  Basically, the US will be in the top bucket/color and all the other countries will be in the lower bucket/color.  This makes for a boring map.  So, let's adjust this by dropping the US data and create a new map.

**Note: Values mentioned above will differ based on your chat log.  So, this may not be applicable.**

In [None]:
# Since the data is heavily skewed by the US, let's drop it and remap the data
df5 = df4.drop([0])
df5.head()

In [None]:
# Get the min/max values in the dataset
min = df5['Count'].min()
max = df5['Count'].max()

# Adjust the max value.
# Add 1 if the max value is odd. Otherwise, add 2.
max = max + 1 if max % 2 == 1 else max + 2

# Print out the min/max values so we can see what they are.
print(f'Min = {min}')
print(f'Max = {max}')

In [None]:
# Start with a new world map
world_map2 = folium.Map(location=[0, 0], zoom_start=2)

# Generate choropleth map using the member count by country
folium.Choropleth(
    geo_data=WORLD_GEO,
    data=df5,
    columns=['Country', 'Count'],
    key_on='feature.properties.name',
    fill_color='YlGn',
    nan_fill_color='gainsboro',
    nan_fill_opacity=0.1,
    fill_opacity=0.7,
    line_opacity=0.2,
    bins=list(range(min,max,2)),
    highlight=True,
    legend_name='MIT ADS WhatsApp Group Membership (minus US)'
).add_to(world_map2)

# display map
world_map2

Now that we've removed data for the US, the map now shows a better distribution of the remaining members.

### Map 3: Map of Member Count by Country with Adjusted Bins

Let's see if we can adjust the bins instead.  This way, we can include the US data and still see color gradients for countries with lower counts.

Since the other countries have 13 or less members, let's use smaller bins for that group and finally a really large bin for the US.

**Note: This may not be applicable to your chat log.  Adjust the bins to fit your data.**

In [None]:
# Create a new world map
world_map3 = folium.Map(location=[0, 0], zoom_start=2)

# Generate choropleth map using the member count by country
folium.Choropleth(
    geo_data=WORLD_GEO,
    data=df4,
    columns=['Country', 'Count'],
    key_on='feature.properties.name',
    fill_color='YlGn',
    nan_fill_color='gainsboro',
    nan_fill_opacity=0.1,
    fill_opacity=0.7,
    line_opacity=0.2,
    bins=[1,3,5,7,9,11,13,15,70],
    legend_name='MIT ADS WhatsApp Group Membership'
).add_to(world_map3)

# display map
world_map3

### Map 4: Map of Members by Area/Country Codes

This will map members in the US and Canada by area code and everyone else by country calling codes.

This map will show where members' phone numbers are located around the world.

In [None]:
# Drop duplicates since we don't really care how many members are in each area/country code

df6 = df3.drop_duplicates() # Drop duplicates and save the results to a new dataframe
df6.head()                  # Display the first few rows

In [None]:
# Check the shape
df6.shape

In [None]:
# Define the world map centered around (0,0) with a low zoom level
world_map4 = folium.Map(location=[0,0], zoom_start=2)

# instantiate a feature group for the dataset
chat_people = folium.map.FeatureGroup()

# loop through the dataframe and add each row to the chat_people feature group
for lat, lng, in zip(df6['latitude'], df6['longitude']):
    chat_people.add_child(
        folium.vector_layers.CircleMarker(
            [lat, lng],
            radius=5, # define how big you want the circle markers to be
            color='yellow',
            fill=True,
            fill_color='blue',
            fill_opacity=0.6
        )
    )

# add chat_people to the world map and display it!
world_map4.add_child(chat_people)