# Detecting patterns of speciation in the fos- sil record

In this assignment, we use data from the NOW (New and Old Worlds) database of fossil mammals to study patterns of speciation over time and space. In particular, we are interested to know when and where speciation rates have been significantly high. The task is to find which time periods and which places over the history of mammals have given rise to exceptionally high numbers of new species. The phenomenon is known in the evolutionary literature as the “species factory”. Palaeontologists are interested why and in which ways those times and places are special. The role of computational science is to identify and characterize such times and places.
We practice using pandas DataFrames, performing logistic regression and making statistical significance tests in data analysis.

In [None]:
import pandas as pd
import numpy as np

Exercise 2. Create a pandas DataFrame that contains all of the data
and save it as a csv file. How many rows does the DataFrame contain?

In [None]:
df = pd.read_csv("now.txt", sep=',')

num_rows = len(df)
print(f"The DataFrame contains {num_rows} rows.")
df.to_csv("data.csv", index=False)

Exercise 3. a) Remove all rows where LAT = LONG = 0; these occurrences have incorrect coordinates. Drop rows where SPECIES is “sp.” or
“indet.”; these occurrences have not been properly identified.

In [None]:
condition_lat_long_zero = (df['LAT'] == 0) & (df['LONG'] == 0)
df = df[~condition_lat_long_zero]
        
print(f"Number of rows after removing LAT=0 & LONG=0: {len(df)}")

species_to_remove = ["sp.", "indet."]
condition_unidentified_species = df['SPECIES'].isin(species_to_remove)
df = df[~condition_unidentified_species]

print(f"Number of rows after removing unidentified species ('sp.', 'indet.'): {len(df)}")

b) Next we will assign each occurrence to a specific Mammal Neogene
(MN) time unit. Table 1 shows the time boundaries of each time unit.
Assign each occurrence to a correct time unit by calculating the mean of
MIN AGE and MAX AGE. If the mean age of an occurrence is precisely
on the boundary between two time units, assign the occurrence to the
older time unit. If the mean age of an occurrence is outside of the MN
time interval, assign it to a “pre-MN” or “post-MN” category.

In [None]:
mn_boundaries = {
    "MN1": (23.0, 21.7),
    "MN2": (21.7, 19.5),
    "MN3": (19.5, 17.2),
    "MN4": (17.2, 16.4),
    "MN5": (16.4, 14.2),
    "MN6": (14.2, 12.85),
    "MN7-8": (12.85, 11.2),
    "MN9": (11.2, 9.9),
    "MN10": (9.9, 8.9),
    "MN11": (8.9, 7.6),
    "MN12": (7.6, 7.1),
    "MN13": (7.1, 5.3),
    "MN14": (5.3, 5.0),
    "MN15": (5.0, 3.55),
    "MN16": (3.55, 2.5),
    "MN17": (2.5, 1.9),
    "MQ18": (1.9, 0.85), 
    "MQ19": (0.85, 0.01) 
}

overall_start_age = mn_boundaries["MN1"][0]
overall_end_age = mn_boundaries["MQ19"][1]

""" Calculate mean age, handling potential NaN values if necessary
    Assigns an occurrence to an MN unit based on its mean age."""
def assign_mn_unit(row):
    if pd.isna(row['MIN AGE']) or pd.isna(row['MAX AGE']):
        return 'unknown_age' 
    mean_age = (row['MIN AGE'] + row['MAX AGE']) / 2.0
    if mean_age > overall_start_age:
        return 'pre-MN'
    elif mean_age < overall_end_age: 
        return 'post-MN'
    else:
        for unit, (start_age, end_age) in mn_boundaries.items():
            if mean_age >= end_age and mean_age < start_age:
                return unit
        if mean_age == overall_end_age:
             for unit, (start_age, end_age) in mn_boundaries.items():
                 if end_age == overall_end_age:
                     return unit
    return 'assignment_error' 
df['MIN AGE'] = pd.to_numeric(df['MIN AGE'], errors='coerce')
df['MAX AGE'] = pd.to_numeric(df['MAX AGE'], errors='coerce')
df['MN_Unit'] = df.apply(assign_mn_unit, axis=1)
print("First 5 rows with assigned MN Unit:")
print(df[['LIDNUM', 'MIN AGE', 'MAX AGE', 'MN_Unit']].head())
print("\nCounts per MN Unit:")
print(df['MN_Unit'].value_counts().sort_index())
print("\nChecking for assignment issues:")
print(df[df['MN_Unit'] == 'unknown_age'][['LIDNUM', 'MIN AGE', 'MAX AGE']].head())
print(df[df['MN_Unit'] == 'assignment_error'][['LIDNUM', 'MIN AGE', 'MAX AGE']].head())


c) Sometimes expert knowledge may be used to override some of the
information recorded in the data. In our case, experts in palaeontology
tell us that occurrences in the localities “Samos Main Bone Beds” and
“Can Llobateres I” should be assigned to time units MN12 and MN9,
respectively. Check these and if necessary, edit the time units to their
correct values.


d) We need to be able to identify all occurrences of each species. Assign a unique identification number for each unique combination of GENUS and SPECIES. Create a new column in the DataFrame and label each
occurrence with a corresponding species identification number.

e) Each locality should contain no more than one occurrence of any
species. Check whether this is the case and remove duplicate copies, if
necessary.

f) How many rows are we left with in the DataFrame (compare with
exercise 2)? How many unique species and localities are identified?

Exercise 4. Create a DataFrame that shows for each species how many
occurrences it has in each time unit. Then, create a different DataFrame
that shows for each species the time unit when it is first observed (i.e.
the oldest time unit). For each time unit, calculate the proportion of first
occurrences to all occurrences. Plot the proportion of first occurrences
over time. Also, plot the total number of occurrences over time.

Exercise 5. a) Create a DataFrame that collects the following information for every locality: locality number (LIDNUM), longitude, latitude,
time unit, number of first occurrences in the locality, number of all occurrences in the locality and proportion of first occurrences in the locality. Note, you should use LIDNUM to identify unique localities and not the NAME variable (why?).

b) Visualize the distribution of localities in space and time. For each time
unit, plot the LAT and LONG coordinates of each locality (corresponding
to the time unit). For example, you can use the above codes to create a
geographic map and then use a standard matplotlib scatter plot to add
the localities. Choose the marker size for each locality such that it is
relative to the number of occurrences in the locality (bigger markers for
bigger localities).

c) Based on exercises 4 and 5, what kind of observations about sampling
can you make? Are there differences in sampling density over space and
time? Compare some basic sampling properties between Africa, Asia and
Europe, e.g. spatial coverage and average number of occurrences per
locality

Exercise 6. For each locality, look at a ten by ten degrees area (in
latitude and longitude) centered around the locality. Record the total
number of occurrences and total number of first occurrences found within
that square in the time unit corresponding to the focal locality. Also,
record the total number of occurrences within that square in the preceding
time unit (relative to the focal locality). Record these numbers into the
DataFrame that was created in exercise 5 (add new columns).

Exercise 7. a) Create the regression data set. Only use localities within
the co-ordinates -25<LONG<40 and LAT>35 and time unit within MN2-
MQ19 (why not include MN1?). Create an m × 2 array, where m is the
total number of occurrences in all the localities. Each row in the array
represents one occurrence. For each occurrence, fill in to the first column
of the array the number of occurrences in the focal area in the previous
time unit (calculated in exercise 6). For the second column, fill in 1 for a
first occurrence and 0 for other occurrences.
b) Perform logistic regression.
c) Plot regression curve and 95%-confidence intervals.

Exercise 8. For each European locality, calculate the expected proportion of first occurrences in the focal area surrounding the locality using
the logistic regression calculated in exercise 7.

Exercise 9. For each European locality, calculate the probability of observing as many or more first occurrences in the focal area than what is
actually found. Assume that occurrences are binomially distributed to
“first occurrences” and “not first occurrences”, so that the probability of
a given occurrence to be a first occurrence is equal to the expected proportion of first occurrences in the focal area. You may use, for example,
the scipy.stats.binom library (https://docs.scipy.org/doc/scipy-0.14.0/
reference/generated/scipy.stats.binom.html) for the calculations.

Exercise 10. For each time unit, plot localities on a map covering the
coordinates defined in exercise 7a and indicate their significance level with
a sliding color scheme. Highlight localities that have p-value less than
0.05 (i.e. probability of observations is less than 0.05). Describe briefly
the overall patterns that you observe.
