## 3.2 De-anonymising a dataset â€“ 50 marks
For this task, we are using a dataset provided to us by our colleagues. The Dataset has been anonymized using bayesian inferences. 
- Dataset:  
- Credits: 

### 3.2.1 Using standard search mechanisms, determine if there are any elements within the dataset that you received, that allow for de-anonymistaions to occur. Make a note of what you find and explain the procedure you used. 

In [None]:
import pandas as pd
from googlesearch import search
from bs4 import BeautifulSoup
import requests

# Load your dataset
df = pd.read_csv('police_shooting_anonymized.csv')



# List to store the names
names = []

# Iterate over each record in the dataset
for record in df:
    query = f"{record['city']} {record['location']} {record['manner_of_death']} police shooting"
    # Perform a web search for each query
    for j in search(query, num=10, stop=10, pause=2):
        # Make a request to the content of the search result
        page = requests.get(j)
        soup = BeautifulSoup(page.content, 'html.parser')
        
        # Here you would have logic to parse the soup object and extract names
        # This is highly dependent on the structure of the web page
        # For example:
        # name_tags = soup.find_all('h1', class_='name')
        # for tag in name_tags:
        #     names.append(tag.text.strip())

# Print or save the extracted names
print(names)

### 3.2.2 Design a de-anonymisation algorithm and apply to both the received dataset and your dataset. Report on the following:

In [8]:
import pandas as pd
from sklearn.preprocessing import OneHotEncoder, StandardScaler
from sklearn.decomposition import PCA
from sklearn.pipeline import Pipeline
import numpy as np

# Load dataset from CSV file
file_path = 'police_shooting_anonymized.csv'  
df = pd.read_csv(file_path)

# Convert 'date' column to a numerical feature
df['date'] = pd.to_datetime(df['date'])
df['days_since'] = (df['date'] - df['date'].min()).dt.days

# Select columns to be one-hot encoded
categorical_cols = ['city', 'manner_of_death']

# One-hot encoding
encoder = OneHotEncoder()
encoded_features = encoder.fit_transform(df[categorical_cols]).toarray()
encoded_feature_names = encoder.get_feature_names_out(categorical_cols)
encoded_df = pd.DataFrame(encoded_features, columns=encoded_feature_names)

# Combine encoded features with the numerical 'days_since' column
final_df = pd.concat([encoded_df, df['days_since']], axis=1)

# PCA and StandardScaler in a Pipeline
pipeline = Pipeline([
    ('scaler', StandardScaler()),
    ('pca', PCA(n_components=2))  # Adjust n_components as needed
])

# Fit and transform the data
pca_features = pipeline.fit_transform(final_df)

# Calculate the Euclidean distance from the origin for each point
distances = np.linalg.norm(pca_features, axis=1)

# Determine a threshold for outliers
threshold = np.mean(distances) + 2 * np.std(distances)

# Identify outliers
outliers = distances > threshold

# Retrieve the entire corresponding rows of the dataset for the outliers
outlier_indices = np.where(outliers)[0]
outlier_rows = df.iloc[outlier_indices].copy()  # Use .copy() to avoid SettingWithCopyWarning

# Add the outlier distances to these rows
outlier_rows['outlier_distance'] = distances[outliers]

# Export the outlier rows with distances to a new CSV file
outlier_rows.to_csv('outliers_with_distances.csv', index=False)

print("Outlier rows with distances have been exported to 'outliers_with_distances.csv'.")


Outlier rows with distances have been exported to 'outliers_with_distances.csv'.
