
## Giulia Mancini


In [19]:
import pandas as pd
import numpy as np
from sklearn.model_selection import train_test_split
from sklearn.preprocessing import LabelEncoder, StandardScaler
from sklearn.svm import SVC
from sklearn.metrics import accuracy_score, classification_report, roc_curve, auc
from sklearn.linear_model import LogisticRegression
from sklearn.metrics import mean_squared_error
from sklearn.tree import DecisionTreeClassifier
from sklearn.ensemble import RandomForestRegressor
from sklearn.tree import DecisionTreeRegressor
from sklearn.linear_model import LinearRegression


#### a) Load the data ‘global_shark_attacks.csv’ into a pandas dataframe.

Load the dataset

In [20]:
file_path = "/Users/giuliamancini/Desktop/global_shark_attacks.csv"
df = pd.read_csv(file_path)


Print to see the infos

In [21]:
print("\n * Dataset Info:")
print(df.info())  


 * Dataset Info:
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 6890 entries, 0 to 6889
Data columns (total 13 columns):
 #   Column     Non-Null Count  Dtype  
---  ------     --------------  -----  
 0   date       6587 non-null   object 
 1   year       6758 non-null   float64
 2   type       6871 non-null   object 
 3   country    6839 non-null   object 
 4   area       6409 non-null   object 
 5   location   6325 non-null   object 
 6   activity   6304 non-null   object 
 7   name       6670 non-null   object 
 8   sex        6318 non-null   object 
 9   age        3903 non-null   object 
 10  fatal_y_n  6890 non-null   object 
 11  time       3372 non-null   object 
 12  species    3772 non-null   object 
dtypes: float64(1), object(12)
memory usage: 699.9+ KB
None


Print to see the database

In [22]:
print("\n * Dataset Overview:")
print(df.head())


 * Dataset Overview:
         date    year        type    country                area  \
0  2023-05-13  2023.0  Unprovoked  AUSTRALIA     South Australia   
1  2023-04-29  2023.0  Unprovoked  AUSTRALIA   Western Australia   
2  2022-10-07  2022.0  Unprovoked  AUSTRALIA  Western  Australia   
3  2021-10-04  2021.0  Unprovoked        USA             Florida   
4  2021-10-03  2021.0  Unprovoked        USA             Florida   

                                   location      activity                name  \
0                                  Elliston       Surfing    Simon Baccanello   
1                      Yallingup, Busselton      Swimming                male   
2                              Port Hedland  Spearfishing         Robbie Peck   
3  Fort Pierce State Park, St. Lucie County       Surfing  Truman Van Patrick   
4               Jensen Beach, Martin County      Swimming                male   

  sex   age fatal_y_n   time      species  
0   M    46         Y  10h10  White sh

Print to see how many NULL values we have in each column

In [23]:
print("\n * Dataset NUll Values:")
print(df.isnull().sum())


 * Dataset NUll Values:
date          303
year          132
type           19
country        51
area          481
location      565
activity      586
name          220
sex           572
age          2987
fatal_y_n       0
time         3518
species      3118
dtype: int64


Observation: By loading the data and doing an initial analysis, we can see that we have: 13 columns and 6890 rows. 

The columns are: date, year, type, country, area, location, activity, name, sex, age, fatal_y_n, time, species

Before proceeding I would like to clean the data so we can have a more accurate result in the following questions. 

By analysing a little bit further and looking at the missing values per column I decided that:

Activity has 586 missing values, since this column is our main target variable I will remove those rows. 

Year has 132 missing values, I will drop the missing values and also convert it to integer. 

Sex have 572 missing values,  since is important for the analysis I will fill the missing values with "Unknow". I will also do that for age and species because they are going to be relavant later 

Time have too many missing values 3,518/6890 so I will drop the column. I will also drop date and name since they also have a lot of missing values.  

Lastly I will remove the duplicates to ensure I have a clean data. 

Cleaning the data:

In [24]:
# Step 1: Drop rows where 'activity' is missing (since we are predicting it)
df_cleaned = df.dropna(subset=['activity'])

# Step 2: Drop rows where 'year' is missing and convert it to integer
df_cleaned = df_cleaned.dropna(subset=['year'])
df_cleaned['year'] = df_cleaned['year'].astype(int)

# Step 3: Fill missing values
df_cleaned['sex'] = df_cleaned['sex'].fillna('Unknown')
df_cleaned['species'] = df_cleaned['species'].fillna('Unknown')

# Convert 'age' to numeric, filling NaNs with median age
df_cleaned['age'] = pd.to_numeric(df_cleaned['age'], errors='coerce')  # Convert strings to numbers
df_cleaned.loc[:, 'age'] = df_cleaned['age'].fillna(df_cleaned['age'].median())
  # Fill NaNs with median

# Step 4: Drop unnecessary columns
df_cleaned = df_cleaned.drop(columns=['date', 'name', 'time'])

# Step 5: Remove duplicates
df_cleaned = df_cleaned.drop_duplicates()

# Step 6: Encode categorical variables
categorical_cols = ['type', 'country', 'area', 'location', 'sex', 'fatal_y_n', 'species']
encoder = LabelEncoder()

for col in categorical_cols:
    df_cleaned[col] = encoder.fit_transform(df_cleaned[col])

print(df_cleaned.head())  # To display the first few rows

   year  type  country  area  location      activity  sex   age  fatal_y_n  \
0  2023     8       10   639      1047       Surfing    2  46.0          5   
1  2023     8       10   766      4050      Swimming    2  24.0          2   
2  2022     8       10   764      2968  Spearfishing    2  38.0          2   
3  2021     8      195   241      1173       Surfing    2  25.0          2   
4  2021     8      195   241      1561      Swimming    2  24.0          2   

   species  
0     1240  
1      161  
2      620  
3     1230  
4     1230  


#### c) Make a variable for the activity to predict ‘Swimming’ or not (make it a binary variable). Report on the accuracy, and then produce a ROC curve

In [27]:
# Convert 'activity' into binary classification (Swimming vs Not Swimming)
def simplify_activity(activity):
    return "Swimming" if "swim" in activity.lower() else "Not Swimming"

df_cleaned['activity'] = df_cleaned['activity'].apply(simplify_activity)

# Encode target variable
y = encoder.fit_transform(df_cleaned['activity'])

# Define features
X = df_cleaned.drop(columns=['activity'])

# Standardize features
scaler = StandardScaler()
X_scaled = scaler.fit_transform(X)

# Split dataset into training and testing sets
X_train, X_test, y_train, y_test = train_test_split(X_scaled, y, test_size=0.2, random_state=42)

Train models (Logistic Regressision, Decision Tree and SVM)

In [28]:
models = {
    "Logistic Regression": LogisticRegression(random_state=42, max_iter=1000, class_weight='balanced'),
    "Decision Tree": DecisionTreeClassifier(random_state=42, class_weight='balanced'),
    "SVM": SVC(kernel='linear', random_state=42, class_weight='balanced', probability=True)
}

results = {}

for model_name, model in models.items():
    # Train the model
    model.fit(X_train, y_train)
    y_pred = model.predict(X_test)

    # Evaluate the model
    accuracy = accuracy_score(y_test, y_pred)
    classification_rep = classification_report(y_test, y_pred)

    # Store results
    results[model_name] = {
        "Accuracy": accuracy,
        "Classification Report": classification_rep
    }

# Print Results:
for model_name, metrics in results.items():
    print(f"\n{'='*40}\n * {model_name} Results\n{'='*40}")
    print(f" Accuracy: {metrics['Accuracy']:.4f}")
    print("\n Classification Report:\n", metrics["Classification Report"])


 * Logistic Regression Results
 Accuracy: 0.6810

 Classification Report:
               precision    recall  f1-score   support

           0       0.84      0.74      0.78       964
           1       0.34      0.48      0.39       268

    accuracy                           0.68      1232
   macro avg       0.59      0.61      0.59      1232
weighted avg       0.73      0.68      0.70      1232


 * Decision Tree Results
 Accuracy: 0.7330

 Classification Report:
               precision    recall  f1-score   support

           0       0.82      0.85      0.83       964
           1       0.37      0.33      0.35       268

    accuracy                           0.73      1232
   macro avg       0.60      0.59      0.59      1232
weighted avg       0.72      0.73      0.73      1232


 * SVM Results
 Accuracy: 0.7094

 Classification Report:
               precision    recall  f1-score   support

           0       0.83      0.79      0.81       964
           1       0.35      0.

Observation: 

Decision Tree achieved the highest accuracy (73.3%), making it the best at overall classification.

Logistic Regression had the best recall for ‘Swimming’ (48%), meaning it identified more actual swimming cases.

SVM (Support Vector Machine) was the most balanced model, achieving 71% accuracy with 41% recall for ‘Swimming’.
    

I needed to do the following steps because plotly wasn't working and I needed to install 

In [40]:
import sys
!{sys.executable} -m pip install plotly

Collecting plotly
  Downloading plotly-6.0.1-py3-none-any.whl.metadata (6.7 kB)
Collecting narwhals>=1.15.1 (from plotly)
  Downloading narwhals-1.31.0-py3-none-any.whl.metadata (11 kB)
Downloading plotly-6.0.1-py3-none-any.whl (14.8 MB)
[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m14.8/14.8 MB[0m [31m14.7 MB/s[0m eta [36m0:00:00[0m00:01[0m00:01[0m
[?25hDownloading narwhals-1.31.0-py3-none-any.whl (313 kB)
Installing collected packages: narwhals, plotly
Successfully installed narwhals-1.31.0 plotly-6.0.1

[1m[[0m[34;49mnotice[0m[1;39;49m][0m[39;49m A new release of pip is available: [0m[31;49m24.3.1[0m[39;49m -> [0m[32;49m25.0.1[0m
[1m[[0m[34;49mnotice[0m[1;39;49m][0m[39;49m To update, run: [0m[32;49mpip3 install --upgrade pip[0m


In [41]:
import plotly
print(plotly.__version__)


6.0.1


Create the ROC curve for the Best Model

In [42]:
import plotly.graph_objects as go
from sklearn.metrics import roc_curve, auc

# Create a figure
fig = go.Figure()

for model_name, model in models.items():
    y_probs = model.predict_proba(X_test)[:, 1]  # Get probability scores for class 1
    fpr, tpr, _ = roc_curve(y_test, y_probs)
    roc_auc = auc(fpr, tpr)
    
    fig.add_trace(go.Scatter(
        x=fpr, y=tpr,
        mode='lines',
        name=f"{model_name} (AUC = {roc_auc:.2f})"
    ))

# Plot diagonal line for random guessing
fig.add_trace(go.Scatter(
    x=[0, 1], y=[0, 1],
    mode='lines', line=dict(dash='dash', color="gray"),
    name="Random Guessing"
))

# Customize layout
fig.update_layout(
    title="ROC Curve for Activity Prediction",
    xaxis_title="False Positive Rate",
    yaxis_title="True Positive Rate",
    template="plotly_white"
)

# Show the interactive plot
fig.show()


Observations: 

SVM is the best model (AUC = 0.67), meaning it provides the best classification performance.

Logistic Regression is a close second (AUC = 0.66), meaning the relationship between features and activity is likely linear.

Decision Tree struggles (AUC = 0.59), possibly due to overfitting or lack of informative features


d) Build 3 different models to predict the age and report the on the MSE and the most important variable from the relative importance

 Prepare Data for Regression

In [43]:
# Ensure no missing values in 'age'
df_cleaned = df_cleaned[df_cleaned['age'].notna()]

# Define target variable (Age)
y_age = df_cleaned['age']

# Check if 'activity_binary' exists before dropping
if 'activity_binary' in df_cleaned.columns:
    X_age = df_cleaned.drop(columns=['age', 'activity', 'activity_binary'])
else:
    X_age = df_cleaned.drop(columns=['age', 'activity'])  # Drop only existing columns

# Standardize features
scaler = StandardScaler()
X_age_scaled = scaler.fit_transform(X_age)

# Split dataset into training and testing sets
X_train_age, X_test_age, y_train_age, y_test_age = train_test_split(X_age_scaled, y_age, test_size=0.2, random_state=42)


Define Regression Models

In [44]:
regression_models = {
    "Linear Regression": LinearRegression(),
    "Decision Tree Regressor": DecisionTreeRegressor(random_state=42, max_depth=10),
    "Random Forest Regressor": RandomForestRegressor(random_state=42, n_estimators=100)
}

Train and evaluate regression models

In [45]:
regression_results = {}

for model_name, model in regression_models.items():
    model.fit(X_train_age, y_train_age)
    y_pred_age = model.predict(X_test_age)

    # Calculate Mean Squared Error (MSE)
    mse = mean_squared_error(y_test_age, y_pred_age)

    # Store results
    regression_results[model_name] = {
        "MSE": mse,
        "Model": model
    }

# Print results
for model_name, metrics in regression_results.items():
    print(f"\n{'='*40}\n * {model_name} - Age Prediction\n{'='*40}")
    print(f" Mean Squared Error (MSE): {metrics['MSE']:.2f}")


 * Linear Regression - Age Prediction
 Mean Squared Error (MSE): 132.89

 * Decision Tree Regressor - Age Prediction
 Mean Squared Error (MSE): 165.27

 * Random Forest Regressor - Age Prediction
 Mean Squared Error (MSE): 140.00


Observations: 

Linear Regression performed best with the lowest MSE of 132.89. This suggests that age has a linear relationship with variables sucha as location, year and species. 

Random Forest performed better that Decision Tree but worse than Linear Regression. It has a MSE of 140 which means it has some non-linear relationships but may nor generalize well.

Lastly Decision Tree has the worst performance with a MSE of 165.27. This probably did happen because of overfitting to training data that led to a poor generalization. 


Feature importance 

In [46]:
feature_importance = regression_results["Random Forest Regressor"]["Model"].feature_importances_
feature_names = X_age.columns

In [48]:
import plotly.graph_objects as go
import numpy as np

# Sort feature importances in descending order
sorted_idx = np.argsort(feature_importance)[::-1]

# Create a Plotly bar chart
fig = go.Figure()

fig.add_trace(go.Bar(
    x=[feature_names[i] for i in sorted_idx],
    y=feature_importance[sorted_idx],
    marker=dict(color="blue"),
    text=feature_importance[sorted_idx],
    textposition="outside"
))

# Customize layout
fig.update_layout(
    title="Feature Importance - Random Forest Regressor for Age Prediction",
    xaxis_title="Feature",
    yaxis_title="Importance Score",
    template="plotly_white",
    xaxis_tickangle=-45
)

# Show the interactive plot
fig.show()


Observations: We uses the Random Foresr Regressor for Age Prediction

Location is the most influential factor in predicting a person's age, with the highest importance score (0.27).

Year is the second most important feature (0.24), suggesting that the time period of the attack has a strong correlation with age.

Species of the shark is also significant (0.18), indicating that certain shark species might be associated with attacks on people of certain age groups.

Area where the attack happened has a moderate influence (0.14).

Country (0.07) has some predictive power, but it is not as strong as location.

Sex, fatality (yes/no), and type of attack contribute very little to predicting age

In summary: 

The dominance of location and years suggest to us that reginal and temporal factors strongly influence the age distriibutiuon of shark attack victims. Species also matters, which indicates that certain age groups may be prone to attacks from specific sharks 

Now lets compare with other models:

In [49]:
import plotly.graph_objects as go
import numpy as np

# Extract feature importance for each model
feature_importance = {}

# Linear Regression (Use absolute coefficient values)
feature_importance["Linear Regression"] = np.abs(regression_results["Linear Regression"]["Model"].coef_)

# Decision Tree Regressor
feature_importance["Decision Tree"] = regression_results["Decision Tree Regressor"]["Model"].feature_importances_

# Random Forest Regressor
feature_importance["Random Forest"] = regression_results["Random Forest Regressor"]["Model"].feature_importances_

# Normalize values for comparison
max_values = {model: max(importance) for model, importance in feature_importance.items()}
feature_importance = {model: importance / max_values[model] for model, importance in feature_importance.items()}

feature_names = X_age.columns
sorted_idx = np.argsort(feature_importance["Random Forest"])[::-1]  # Sort by Random Forest importance

# Create a Plotly Figure
fig = go.Figure()

# Add bars for each model
for model_name, importance in feature_importance.items():
    fig.add_trace(go.Bar(
        x=[feature_names[i] for i in sorted_idx],
        y=importance[sorted_idx],
        name=model_name
    ))

# Customize layout
fig.update_layout(
    title="Feature Importance Comparison Across Models",
    xaxis_title="Feature",
    yaxis_title="Importance Score (Normalized)",
    barmode="group",
    template="plotly_white"
)

# Show the interactive plot
fig.show()

Observations:

By doing a comparision between the 3 models:

Linear Regression relies heavily on "year" as seen on the graph. 

Decision Tree finds non-linear relationships, emphasizing "area" and "species".

Random Forest is the most balanced, showing the overall importance of all features.

Year & Location are the best predictors of Age as mentioned in the analysis for the Random Forest. 

