To tackle the project of analyzing duplicates, redundancy, and last modified data for multimedia files, while using strong Python, data science tools, and probabilistic matching (as mentioned in your requirements), we need to develop a structured solution pipeline. The process includes the following components:

### 1. **Project Objective and Hypothesis**
   - **Objective**: Identify and resolve duplicate or redundant multimedia files based on metadata attributes (like file size, creation date, file type, and content). Ensure the last modified data is captured correctly.
   - **Hypothesis**: Duplicate or similar files can be identified based on probabilistic matching and metadata attributes, even if file names or other attributes are slightly altered. This reduces storage footprint and optimizes data archival on magnetic tapes.

### 2. **Data Sources**
   The data will likely come from multimedia sources like Box, OneDAM, or a custom digital asset management (DAM) system. You'll have metadata available in formats like the ones you uploaded.
   
   - **Metadata** will contain:
     - File names, types, sizes, folder locations, timestamps (created, modified), etc.
     - Content signatures like SHA1 hashes (to match exact duplicates).
     - Associated tags or information about the file.

### 3. **Solution Pipeline**

#### Step 1: **Data Collection**
   - **APIs**: Integrate with Box, OneDAM, or other storage systems to fetch metadata using APIs (as described in your project).
   - Use libraries like:
     - `requests` or `httpx` for API calls.
     - `pymongo` for MongoDB (since MongoDB is mentioned in your database design).

#### Step 2: **Data Cleaning and Preprocessing**
   - Normalize metadata fields like `file_name`, `file_type`, `created_date`, etc., to a uniform structure.
   - Handle missing or inconsistent metadata by imputing missing values or discarding unnecessary records.
   - Convert `created_date` and `modified_date` to comparable formats using `pandas` or `datetime` libraries.
   - Convert text fields (like file names) to lowercase and strip extra spaces for better matching.

#### Step 3: **Duplicate Detection and Probabilistic Matching**
   Use fuzzy matching and probabilistic techniques to identify near-duplicate or exact duplicate files.
   
   - **Exact Matching**: Use hash values like `SHA1` from the metadata to identify exact duplicates.
   - **Near-Duplicate/Probabilistic Matching**: Use a combination of:
     - **Fuzzy Matching**: Use libraries like `fuzzywuzzy` or `rapidfuzz` to detect similar file names or content.
     - **Record Linkage**: Use `splink` or `dedupe` for probabilistic record linkage.
     - **Attributes**: Compare metadata fields like `file_name`, `file_size`, `created_date`, `parentFolder`, etc., to identify duplicates.
     - **Example**:
       ```python
       from fuzzywuzzy import fuzz
       ratio = fuzz.ratio("filename1.mp4", "filename2.mp4")
       ```
       For large datasets, prefer more scalable libraries like `splink`.

#### Step 4: **Data Storage and Post-Processing**
   - Store metadata in **MongoDB** (as specified in the task list).
   - Store deduplication analysis and tagged data for files in MongoDB.
   - Structure your MongoDB collections similar to the structure in your uploaded file, with fields for `file_id`, `file_name`, `file_size`, etc.
   - Use MongoDB queries to retrieve and process data efficiently.

#### Step 5: **Archiving to Magnetic Tape**
   - Once duplicates and redundant files are detected, generate a final list of unique multimedia files.
   - Archive these files to magnetic tapes for long-term storage using tools that support tape archives, such as **LTFS** (Linear Tape File System).

### 4. **Libraries to Use**
   - **API and Data Handling**:
     - `requests`, `httpx`, `pymongo`, `boto3` (if using AWS S3)
     - `pandas` for data wrangling and manipulation.
   - **Deduplication and Matching**:
     - `fuzzywuzzy` or `rapidfuzz` for text matching.
     - `splink` or `dedupe` for probabilistic record linkage.
   - **Data Analysis**:
     - `numpy`, `scipy` for statistical analysis.
     - `matplotlib`, `seaborn`, `plotly` for visualization (if insights are needed).
   - **Archiving**:
     - **LTFS** (Linear Tape File System) for magnetic tape archival.

### 5. **Integration with Box/OneDAM**
   As per your uploaded project, integrate the system with Box and OneDAM through their respective APIs to retrieve metadata and store results back into the system.

   - Use Box APIs for fetching files and metadata.
   - Schedule tasks for periodic metadata fetching and storing them in MongoDB.
   
### 6. **Performance Tuning**
   - Optimize the system by using efficient data structures and reducing memory footprint in data processing.
   - Use batch processing for handling large metadata volumes.
   - Profiling tools like **cProfile** or **memory_profiler** can be helpful.

### 7. **End-to-End Workflow**
   - **API integration** → **Data Cleaning** → **Fuzzy Matching and Deduplication** → **Data Storage** (MongoDB) → **Post-processing** → **Archiving to Tape**.

#### Example Code Outline:
```python
import pandas as pd
from fuzzywuzzy import fuzz
from pymongo import MongoClient

# Connect to MongoDB
client = MongoClient('mongodb://localhost:27017/')
db = client['media_db']
collection = db['media_metadata']

# Fetch data from MongoDB
metadata_df = pd.DataFrame(list(collection.find()))

# Example of applying fuzzy matching
def find_similar_files(df):
    df['similar_files'] = df.apply(lambda row: [fuzz.ratio(row['file_name'], x) for x in df['file_name']], axis=1)
    return df

# Apply deduplication
metadata_df = find_similar_files(metadata_df)
```

### 8. **Testing and Validation**
   - Unit testing: Ensure each component works as expected (e.g., fuzzy matching, MongoDB storage).
   - Validate the matching engine using a small dataset with known duplicates.
  
---

This solution pipeline focuses on the metadata analysis, deduplication, and archiving process while integrating with storage systems and handling large multimedia datasets effectively.

To design a project pipeline for analyzing duplicates, redundant files, and last-modified data for multimedia using techniques like probabilistic matching, deduplication, fuzzy matching, and record linkage, I’ll suggest an approach and outline the steps below. The final goal is to store cleaned and deduplicated data on magnetic tape, which is often used for archiving purposes.

### 1. **Project Approach**

#### a) **Problem Definition**
   - **Objective**: Analyze multimedia files to identify duplicates, redundant data, and determine the most recent versions of files.
   - **Dataset**: Multimedia metadata, including details like file names, file sizes, creation and modification dates, etc.
   - **Challenges**:
     - Inconsistent or slightly different metadata for duplicates.
     - Identifying multimedia file duplicates across different folders/locations.
     - Ensuring data integrity for archiving on magnetic tape.
   
#### b) **Hypothesis**
   - **Duplicates**: Files with similar metadata such as file size, creation/modification date, and other properties likely represent the same multimedia object.
   - **Redundant Data**: Files that are unused or obsolete, potentially older versions of updated files.
   - **Last Modified**: The latest version of a file can be determined by analyzing metadata.

---

### 2. **Data Pipeline Overview**

1. **Data Collection/Integration**
   - **Data Sources**: The metadata from systems like Box and OneDAM, as shown in the images, where files are tagged with relevant attributes (created_by, file_name, file_size, etc.).
   - **Tools**: Integrate metadata using APIs for platforms like Box and OneDAM.

2. **Data Preprocessing**
   - **Parsing**: Metadata parsing and extraction from MongoDB, as shown in the provided structure. This includes fields like file_name, file_size, created_date, modified_date, etc.
   - **Data Cleaning**: Handle missing, incomplete, or erroneous data entries.
   - **Normalization**: Standardize the data types and formats (e.g., ensuring all dates are in a consistent timezone, file sizes are in a uniform unit).
   - **Deduplication**: Initial cleaning by eliminating exact duplicates based on metadata.

3. **Probabilistic Matching & Record Linkage**
   - **Tools**: Use record linkage and deduplication techniques to identify duplicates that are not exact matches.
     - **Libraries**:
       - `Dedupe` (Python library for fuzzy matching and deduplication)
       - `Splink` (for probabilistic record linkage and deduplication)
       - `fuzzywuzzy` (for fuzzy string matching)
   - **Methodology**:
     - Apply probabilistic matching for records with slight variations in file names, sizes, or modification dates.
     - Use `fuzzy matching` for metadata fields like `file_name` and `parentFolder`.
     - Link records using combinations of fields (file size + created date + file type) with a weighted score.
   - **Example**:
     - Compare `file_name` using fuzzy matching.
     - Compare `file_size` and dates to determine if files are probable duplicates.

4. **Analysis of Redundancy & Last Modified**
   - Analyze the metadata to find redundant files and determine the most recent version of each file based on `modified_date` and other relevant metadata fields.

5. **Storing Clean Data**
   - After deduplication and redundancy elimination, save the cleaned metadata and corresponding files for archiving.
   - **Output Storage**: Prepare cleaned files for long-term storage on **magnetic tape**.
   - **Compression**: Consider compressing files for efficient storage.

6. **End-to-End Testing and Validation**
   - Create unit test cases and validate the deduplication and matching results using test datasets before applying them on a larger scale.
   - Benchmark the accuracy and efficiency of the probabilistic matching model using validation sets.

---

### 3. **Pipeline Steps & Tools**

#### **Data Sources**:
   - **Box and OneDAM**: Fetch metadata from these systems via APIs.
   - **MongoDB**: Load metadata into a MongoDB database for processing.

#### **Python Libraries**:
   - **pandas**: For data manipulation and cleaning.
   - **Dedupe**: For identifying duplicates based on probabilistic matching.
   - **Splink**: For fuzzy matching and linking records with small variations.
   - **fuzzywuzzy**: For string similarity matching (fuzzy matching).
   - **pymongo**: For interacting with MongoDB and storing the processed metadata.
   - **NumPy**: For efficient data processing.
   - **joblib**: For parallel processing if the dataset is large.

#### **Data Pipeline Framework**:
   - **Apache Airflow**: For scheduling and automating tasks in the pipeline, including fetching metadata, preprocessing, deduplication, and storage.
   - **Celery**: For task distribution if running the pipeline on multiple nodes.
   - **Docker**: To containerize the application for reproducibility across environments.

---

### 4. **Project Stages (Sprints)**

#### **Sprint 1**: Design & Integration
   - Set up MongoDB database schema based on metadata fields (e.g., `file_name`, `file_size`, etc.).
   - Design APIs and use cases for fetching metadata from Box/OneDAM.
   - Initial parsing of metadata and exploration of data for insights.

#### **Sprint 2**: Development & Testing
   - Implement deduplication using probabilistic matching techniques.
   - Set up fuzzy matching logic for metadata comparison.
   - Develop logic to identify redundant data and determine the most recent file versions.
   - Create unit test cases to ensure pipeline accuracy.

#### **Sprint 3**: Optimization & Storage
   - Optimize deduplication and matching algorithms for large datasets.
   - Finalize the pipeline for archiving clean data onto magnetic tapes.
   - Performance tuning and resource utilization improvements.

---

### 5. **Storage & Archival on Magnetic Tape**

   - **Compression Techniques**: Consider compressing multimedia data (e.g., zip, gzip) for efficient storage.
   - **Tape Storage Format**: Ensure the file formats are compatible with magnetic tape storage solutions (e.g., LTO tapes).
   - **Backup Strategy**: Implement a backup strategy for long-term retention and recovery.

---

This pipeline combines probabilistic record linkage, data science approaches, and efficient storage mechanisms to meet the project's goals. The use of MongoDB for metadata storage and the various Python libraries for deduplication and fuzzy matching make this a scalable solution for large datasets.

In [None]:
from pymongo import MongoClient

# Connect to MongoDB
client = MongoClient('mongodb://localhost:27017/')
db = client['metadata_db']
collection = db['files']

# Example record structure
record = {
    'file_name': 'video1.mp4',
    'file_size': 1048576,
    'created_date': '2023-10-01T12:34:56Z',
    'modified_date': '2024-01-01T12:34:56Z',
    'file_type': 'video',
    'isDuplicate': False
}

# Insert the record into MongoDB
collection.insert_one(record)


In [None]:
import pandas as pd
from dateutil import parser

# Fetch records from MongoDB
metadata = pd.DataFrame(list(collection.find()))

# Normalize date fields
metadata['created_date'] = metadata['created_date'].apply(parser.parse)
metadata['modified_date'] = metadata['modified_date'].apply(parser.parse)

# Handle missing values
metadata.fillna('', inplace=True)


In [None]:
import dedupe
from unidecode import unidecode

# Define the fields for deduplication
fields = [
    {'field': 'file_name', 'type': 'String'},
    {'field': 'file_size', 'type': 'Exact'},
    {'field': 'file_type', 'type': 'String'},
]

# Initialize a Dedupe object
deduper = dedupe.Dedupe(fields)

# Prepare the data for deduplication
data = metadata.to_dict('index')

# Sample data for training the deduper
deduper.sample(data, 10000)

# Active learning loop for training the deduper
deduper.train()

# Blocking and linking the duplicates
clustered_dupes = deduper.partition(data, threshold=0.5)

# Mark duplicates in MongoDB
for cluster_id, records in clustered_dupes.items():
    for record in records:
        collection.update_one({'_id': record['_id']}, {'$set': {'isDuplicate': True}})


In [None]:
# Sort by modified date and group by file_name
metadata.sort_values('modified_date', ascending=False, inplace=True)
metadata['is_latest'] = metadata.duplicated(subset=['file_name'], keep='first')

# Mark older versions as redundant
metadata.loc[metadata['is_latest'] == False, 'isDuplicate'] = True

# Update MongoDB
for index, row in metadata.iterrows():
    collection.update_one({'_id': row['_id']}, {'$set': {'isDuplicate': row['isDuplicate']}})


In [None]:
# Pseudo code for archiving to magnetic tape
def archive_to_tape(file_path, tape_device):
    # Use system commands or APIs to write files to tape
    os.system(f"tar -cvf {tape_device} {file_path}")

# Archive all non-duplicate files
for record in collection.find({'isDuplicate': False}):
    archive_to_tape(record['file_path'], '/dev/tape_device')


In [None]:
from fuzzywuzzy import fuzz

# Example of fuzzy matching for file names
def is_similar_name(name1, name2):
    return fuzz.ratio(name1, name2) > 85  # Return True if similarity is above 85%

# Test similarity
print(is_similar_name("video1_final.mp4", "video1.mp4"))  # Likely to be duplicates


In [None]:
from airflow import DAG
from airflow.operators.python_operator import PythonOperator
from datetime import datetime

# Define a DAG (Directed Acyclic Graph)
with DAG('multimedia_deduplication_pipeline', start_date=datetime(2024, 10, 15), schedule_interval='@daily') as dag:

    fetch_metadata = PythonOperator(
        task_id='fetch_metadata',
        python_callable=fetch_metadata_from_box_onedam
    )

    preprocess_data = PythonOperator(
        task_id='preprocess_data',
        python_callable=preprocess_metadata
    )

    deduplicate_data = PythonOperator(
        task_id='deduplicate_data',
        python_callable=perform_deduplication
    )

    archive_data = PythonOperator(
        task_id='archive_data',
        python_callable=archive_to_magnetic_tape
    )

    # Define task dependencies
    fetch_metadata >> preprocess_data >> deduplicate_data >> archive_data


To generate a SHA-1 checksum for audio files stored in an S3 bucket without downloading the files locally, you can use the AWS SDK in combination with the **ETag** or initiate an operation directly on the S3 object. However, by default, S3 calculates and provides an **MD5 checksum** (ETag) for files, not SHA-1.

Unfortunately, AWS S3 doesn't natively provide SHA-1 checksums out-of-the-box (for example, in the metadata or ETag). You will need to implement this checksum calculation on the file either on-the-fly during upload or using AWS services like **AWS Lambda** or **AWS SDKs**.

### Approach 1: Using AWS SDK with S3's ETag (not SHA-1 but MD5)
- S3’s **ETag** is usually an MD5 checksum (with caveats for multipart uploads). However, this won't help since Box uses **SHA-1**.
  
### Approach 2: Using AWS Lambda for SHA-1 checksum generation (without downloading the file)

You can create an **AWS Lambda function** that reads files directly from S3, computes the SHA-1 checksum, and returns or stores it back in the S3 bucket or another service.

Here’s a **Python AWS Lambda** example using **Boto3** and **hashlib** to calculate SHA-1 checksums for files in S3:

1. **Create an S3-triggered Lambda function** that runs every time a new file is uploaded or on-demand.

2. **Code Example (Python)**:
   ```python
   import hashlib
   import boto3
   import botocore

   s3_client = boto3.client('s3')

   def lambda_handler(event, context):
       bucket_name = event['Records'][0]['s3']['bucket']['name']
       object_key = event['Records'][0]['s3']['object']['key']

       # Fetch the object from S3
       try:
           s3_object = s3_client.get_object(Bucket=bucket_name, Key=object_key)
           file_content = s3_object['Body'].read()

           # Compute SHA-1 checksum
           sha1_checksum = hashlib.sha1(file_content).hexdigest()

           print(f"SHA-1 checksum for {object_key}: {sha1_checksum}")

           # Optionally store the SHA-1 checksum as metadata or in a DynamoDB table, etc.

           return {
               'statusCode': 200,
               'sha1_checksum': sha1_checksum
           }
       except botocore.exceptions.ClientError as e:
           print(f"Error fetching the object {object_key}: {e}")
           return {
               'statusCode': 500,
               'error': str(e)
           }
   ```

   - **Explanation**:
     - The Lambda function is triggered when an object is uploaded to S3.
     - It retrieves the file content and calculates the **SHA-1 checksum** using the `hashlib.sha1()` function.
     - The checksum can then be printed, stored as metadata, or returned as part of the response.

3. **Advantages**:
   - You don't need to download the file manually; the Lambda reads it directly from S3.
   - The function can handle large files without requiring manual download, though there are **Lambda size and time limits**.

4. **Add Metadata (optional)**:
   You can also update the S3 object to store the SHA-1 checksum as metadata if needed:
   ```python
   # Add the checksum to the metadata of the S3 object
   s3_client.put_object(
       Bucket=bucket_name,
       Key=object_key,
       Metadata={'SHA1Checksum': sha1_checksum}
   )
   ```

### Approach 3: Direct Calculation during Upload
If you’re uploading files using the AWS SDK, you can calculate the **SHA-1 checksum** before or during the upload and store it as metadata. Here’s how you can do it using **Boto3** in Python:

```python
import hashlib
import boto3

def upload_with_sha1(bucket_name, file_path, object_key):
    s3_client = boto3.client('s3')
    
    # Calculate SHA-1 checksum
    sha1 = hashlib.sha1()
    
    with open(file_path, 'rb') as f:
        while chunk := f.read(8192):
            sha1.update(chunk)
    sha1_checksum = sha1.hexdigest()
    
    # Upload the file to S3 and store the SHA-1 checksum as metadata
    s3_client.upload_file(
        Filename=file_path,
        Bucket=bucket_name,
        Key=object_key,
        ExtraArgs={'Metadata': {'SHA1Checksum': sha1_checksum}}
    )

    print(f"File {file_path} uploaded to {bucket_name}/{object_key} with SHA-1 checksum: {sha1_checksum}")
```

### Documentation for Reference
1. **Boto3 S3 Documentation** (Python SDK for AWS):
   - [Boto3 S3 Client](https://boto3.amazonaws.com/v1/documentation/api/latest/reference/services/s3.html#S3.Client.upload_file)
   - [AWS SDK for S3 Object Operations](https://docs.aws.amazon.com/AmazonS3/latest/API/API_Object.html)

2. **hashlib documentation**:
   - [Python hashlib for SHA-1](https://docs.python.org/3/library/hashlib.html#hashlib.sha1)

### Verifying SHA-1 Algorithm
Both Box and AWS use the **SHA-1 algorithm as defined by the FIPS PUB 180-4** standard, ensuring that they should compute identical SHA-1 hashes for the same content.

If you want a direct approach from AWS without running Lambda, there's no built-in functionality to generate SHA-1 checksums directly on S3 objects without downloading or accessing the object content. The best option would be a Lambda, or you could compute the checksum before uploading and store it as metadata.

This should give you a clear path to compute SHA-1 for your S3 audio files and ensure compatibility with Box's checksum validation process.

To address the problem of generating a SHA-1 checksum for S3 objects (like audio files) using AWS SDKs, integrating metadata from Box, and analyzing redundant or duplicated files, you can approach this in stages by breaking it down into manageable components.

Given your project requirements, I will propose a pipeline that can help automate the checksum generation, handle duplicates, and integrate data from S3 and Box. I'll also cover how splink and probabilistic matching can help in deduplication and redundant file analysis.

Pipeline Breakdown

    Fetch Files from S3 and Compute SHA-1 Checksums
        You can use AWS SDK (like Boto3 in Python) to fetch files from S3 and compute the SHA-1 checksum for each file.
        Ensure that you handle this without downloading the file locally using streams and AWS’s direct object access.
        Option 1: Implement a Lambda function that computes the checksum when a file is uploaded to S3, or process all existing files.

    Code Example (for Python using Boto3):

In [None]:
import hashlib
import boto3

s3_client = boto3.client('s3')

def generate_sha1_from_s3(bucket, key):
    # Download the file in memory and calculate the SHA-1 checksum
    s3_object = s3_client.get_object(Bucket=bucket, Key=key)
    file_content = s3_object['Body'].read()

    # Compute SHA-1
    sha1_checksum = hashlib.sha1(file_content).hexdigest()
    return sha1_checksum

# Example usage
bucket_name = "your-s3-bucket"
file_key = "path/to/your/audio.mp3"
checksum = generate_sha1_from_s3(bucket_name, file_key)
print(f"The SHA-1 checksum of the file is: {checksum}")


Store or Compare Metadata

    Once you calculate the SHA-1 checksum, store it in S3 metadata, or in a DynamoDB database, for future comparisons with Box's metadata.
    Box’s metadata includes a SHA-1 field, which you can directly compare with S3’s computed checksum.
    Fetch Box file metadata using Box SDK or API and match it against the checksum calculated in S3.

Probabilistic Matching & Duplicate Detection

    Using the metadata (SHA-1 checksums, file names, sizes, etc.) from both S3 and Box, you can run duplicate detection.
    For this, you can use the splink library for probabilistic record linkage (since you are working with potentially large datasets).
    Splink will help match files from Box and S3 using fuzzy matching, dedupe logic, or probabilistic approaches when exact matches (such as SHA-1 hashes) are not found.

Hypothesis: Duplicate or redundant files are likely to have matching SHA-1 checksums or near-identical metadata (size, timestamp).

In [None]:
import splink
import pandas as pd

# Assuming you have metadata for files from both S3 and Box
s3_data = pd.DataFrame([
    {'file_name': 'audio1.mp3', 'sha1': 'ABC123', 'size': 12345},
    {'file_name': 'audio2.mp3', 'sha1': 'XYZ456', 'size': 67890}
])

box_data = pd.DataFrame([
    {'file_name': 'audio1.mp3', 'sha1': 'ABC123', 'size': 12345},
    {'file_name': 'audio_duplicate.mp3', 'sha1': 'ABC123', 'size': 12345}
])

# Define the model
settings = {
    "link_type": "dedupe_only",
    "blocking_rules_to_generate_predictions": ["l.file_name = r.file_name"],
    "comparison_columns": [
        {"col_name": "sha1", "num_levels": 2}
    ]
}

linker = splink.Splink(settings, [s3_data, box_data])
results = linker.get_match_candidates()

print(results)


Metadata Storage & Analysis

    For efficient storage and analysis of metadata, you can use MongoDB or any NoSQL database.
    Store all metadata (file names, checksums, timestamps, etc.) from S3 and Box for further analysis, comparison, and deduplication.
    Use MongoDB queries to fetch duplicates, or files that haven't been matched, and analyze redundancy based on user requirements.

Big Data Scaling

    If your dataset is large, you can use AWS Glue for ETL processing, AWS Athena for querying metadata at scale, and AWS S3 Select to optimize data extraction without needing to pull large files.
    For batch processing and running the deduplication at scale, consider using AWS EMR with Spark and PySpark.

To address the problem of generating a SHA-1 checksum for S3 objects (like audio files) using AWS SDKs, integrating metadata from Box, and analyzing redundant or duplicated files, you can approach this in stages by breaking it down into manageable components.

Given your project requirements, I will propose a **pipeline** that can help automate the checksum generation, handle duplicates, and integrate data from S3 and Box. I'll also cover how **splink** and probabilistic matching can help in deduplication and redundant file analysis.

---

### Pipeline Breakdown

1. **Fetch Files from S3 and Compute SHA-1 Checksums**
    - You can use AWS SDK (like **Boto3** in Python) to fetch files from S3 and compute the **SHA-1** checksum for each file.
    - Ensure that you handle this without downloading the file locally using streams and AWS’s direct object access.
    - **Option 1**: Implement a Lambda function that computes the checksum when a file is uploaded to S3, or process all existing files.

    **Code Example** (for Python using Boto3):
    ```python
    import hashlib
    import boto3

    s3_client = boto3.client('s3')

    def generate_sha1_from_s3(bucket, key):
        # Download the file in memory and calculate the SHA-1 checksum
        s3_object = s3_client.get_object(Bucket=bucket, Key=key)
        file_content = s3_object['Body'].read()
        
        # Compute SHA-1
        sha1_checksum = hashlib.sha1(file_content).hexdigest()
        return sha1_checksum

    # Example usage
    bucket_name = "your-s3-bucket"
    file_key = "path/to/your/audio.mp3"
    checksum = generate_sha1_from_s3(bucket_name, file_key)
    print(f"The SHA-1 checksum of the file is: {checksum}")
    ```

2. **Store or Compare Metadata**
    - Once you calculate the SHA-1 checksum, store it in S3 metadata, or in a DynamoDB database, for future comparisons with Box's metadata.
    - Box’s metadata includes a **SHA-1** field, which you can directly compare with S3’s computed checksum.
    - Fetch Box file metadata using **Box SDK** or API and match it against the checksum calculated in S3.

3. **Probabilistic Matching & Duplicate Detection**
    - Using the metadata (SHA-1 checksums, file names, sizes, etc.) from both S3 and Box, you can run **duplicate detection**.
    - For this, you can use the **splink** library for **probabilistic record linkage** (since you are working with potentially large datasets).
    - **Splink** will help match files from Box and S3 using **fuzzy matching**, **dedupe** logic, or **probabilistic approaches** when exact matches (such as SHA-1 hashes) are not found.
    
    **Hypothesis**: Duplicate or redundant files are likely to have matching SHA-1 checksums or near-identical metadata (size, timestamp).
    
    **Example**:
    ```python
    import splink
    import pandas as pd

    # Assuming you have metadata for files from both S3 and Box
    s3_data = pd.DataFrame([
        {'file_name': 'audio1.mp3', 'sha1': 'ABC123', 'size': 12345},
        {'file_name': 'audio2.mp3', 'sha1': 'XYZ456', 'size': 67890}
    ])

    box_data = pd.DataFrame([
        {'file_name': 'audio1.mp3', 'sha1': 'ABC123', 'size': 12345},
        {'file_name': 'audio_duplicate.mp3', 'sha1': 'ABC123', 'size': 12345}
    ])

    # Define the model
    settings = {
        "link_type": "dedupe_only",
        "blocking_rules_to_generate_predictions": ["l.file_name = r.file_name"],
        "comparison_columns": [
            {"col_name": "sha1", "num_levels": 2}
        ]
    }

    linker = splink.Splink(settings, [s3_data, box_data])
    results = linker.get_match_candidates()
    
    print(results)
    ```

4. **Metadata Storage & Analysis**
    - For efficient storage and analysis of metadata, you can use **MongoDB** or any NoSQL database.
    - Store all metadata (file names, checksums, timestamps, etc.) from S3 and Box for further analysis, comparison, and deduplication.
    - Use MongoDB queries to fetch duplicates, or files that haven't been matched, and analyze redundancy based on user requirements.

5. **Big Data Scaling**
    - If your dataset is large, you can use **AWS Glue** for ETL processing, **AWS Athena** for querying metadata at scale, and **AWS S3 Select** to optimize data extraction without needing to pull large files.
    - For **batch processing** and running the deduplication at scale, consider using **AWS EMR** with Spark and PySpark.

---

### Complete Pipeline

1. **Data Ingestion**:
   - Fetch audio file metadata from both S3 and Box.
   - Compute the **SHA-1 checksum** for files in S3 (using Lambda or direct Boto3 calls).
   
2. **Metadata Storage**:
   - Store the metadata in **MongoDB**, including file name, size, creation date, and SHA-1 checksums from both sources.

3. **Probabilistic Matching & Deduplication**:
   - Use **splink** to match and deduplicate records between S3 and Box.
   - Use **fuzzy logic** and exact matches based on SHA-1, file name, and file size.

4. **Duplicate Identification**:
   - Tag files as duplicates if the checksum matches, or if the probabilistic model identifies them as near duplicates.
   
5. **Post-Processing and Analysis**:
   - Use Python and Pandas for further analysis and reporting, identifying redundant files and generating reports for stakeholders.

---

### Tools and Technologies
1. **AWS SDK (Boto3)** - for S3 operations.
2. **Box SDK** - for fetching metadata from Box.
3. **hashlib** - for SHA-1 checksum generation.
4. **splink** - for probabilistic matching and deduplication.
5. **MongoDB** - for metadata storage and querying.
6. **Python (Pandas)** - for processing and analysis.

---

### Hypothesis to Result Flow
1. **Hypothesis**: Files with identical SHA-1 checksums or similar metadata (name, size) are duplicates.
2. **Steps**:
   - Fetch metadata from S3 and Box.
   - Compute SHA-1 checksum for S3 files.
   - Use **splink** for probabilistic matching.
3. **Result**: Identified duplicate files can be tagged and removed or analyzed for redundancy.

This pipeline ensures end-to-end coverage of metadata processing, deduplication, and checksum validation, providing a solid foundation to complete your project.

To implement the solution using the mentioned tools and technologies, I'll walk you through how each can be applied to solve the problem of **deduplication** and **metadata comparison** between **AWS S3** and **Box**, including computing **SHA-1 checksums**, managing metadata, and **probabilistic matching**.

Here’s a step-by-step guide on how to use these tools effectively:

---

### 1. **AWS SDK (Boto3) for S3 Operations**

Boto3 is the AWS SDK for Python. We’ll use it to interact with AWS S3—specifically to fetch files, compute their SHA-1 checksum, and work with metadata.

#### Use Case:
- Fetch audio files from your **S3 bucket** and retrieve the metadata.
- Compute and store the **SHA-1 checksum** for each file.

#### Example Code:
```python
import boto3
import hashlib

# Initialize S3 client
s3_client = boto3.client('s3')

def fetch_s3_metadata(bucket, key):
    """
    Fetch file content and metadata from an S3 bucket.
    """
    # Fetch the file from S3
    s3_object = s3_client.get_object(Bucket=bucket, Key=key)
    file_content = s3_object['Body'].read()  # File content for checksum
    metadata = s3_object['Metadata']  # Existing metadata if any
    return file_content, metadata

def generate_sha1_checksum(file_content):
    """
    Generate SHA-1 checksum for the file content.
    """
    sha1 = hashlib.sha1()
    sha1.update(file_content)
    return sha1.hexdigest()

bucket_name = "your-s3-bucket"
file_key = "path/to/your/file.mp3"

# Get file content from S3 and calculate the SHA-1 checksum
file_content, metadata = fetch_s3_metadata(bucket_name, file_key)
checksum = generate_sha1_checksum(file_content)

print(f"SHA-1 Checksum for {file_key}: {checksum}")
```

#### Steps:
1. **Fetch files from S3**: Use `get_object()` to retrieve the file and metadata from S3.
2. **Compute SHA-1 Checksum**: Use `hashlib` to generate the checksum for deduplication purposes.

---

### 2. **Box SDK for Metadata Fetching**

The **Box SDK** allows you to interact with Box to fetch metadata for files, including the **SHA-1 checksum** Box provides.

#### Use Case:
- Fetch metadata (including the SHA-1 checksum) from Box for comparison with files in S3.

#### Example Code:
```python
from boxsdk import Client, OAuth2

# Authenticate with Box
oauth2 = OAuth2(
    client_id='your-client-id',
    client_secret='your-client-secret',
    access_token='your-access-token'
)
client = Client(oauth2)

def get_box_file_metadata(file_id):
    """
    Fetch metadata (including SHA-1) from Box for the given file.
    """
    box_file = client.file(file_id).get()
    sha1_checksum = box_file.sha1
    file_name = box_file.name
    file_size = box_file.size
    return {"name": file_name, "size": file_size, "sha1": sha1_checksum}

box_file_id = "1234567890"
box_metadata = get_box_file_metadata(box_file_id)
print(f"Box File Metadata: {box_metadata}")
```

#### Steps:
1. **Authenticate with Box**: Use OAuth2 credentials to authenticate with the Box API.
2. **Fetch Metadata**: Use the Box SDK to retrieve metadata, including the SHA-1 checksum.

---

### 3. **Hashlib for SHA-1 Checksum Generation**

**Hashlib** is used to compute the **SHA-1 checksum** for the files fetched from **S3**.

#### Use Case:
- After fetching the file from S3, compute the **SHA-1 checksum** to compare with Box files.

#### Example:
```python
import hashlib

def generate_sha1_checksum(file_content):
    sha1 = hashlib.sha1()
    sha1.update(file_content)
    return sha1.hexdigest()

file_content = b"Sample file content"
checksum = generate_sha1_checksum(file_content)
print(f"SHA-1 checksum: {checksum}")
```

#### Steps:
1. **Compute Checksum**: Use `hashlib` to generate SHA-1 hash for any file or byte content.

---

### 4. **Splink for Probabilistic Matching and Deduplication**

**Splink** is a library for probabilistic record linkage and deduplication. It helps you match files between S3 and Box, even when exact matches are not possible (e.g., mismatched file names).

#### Use Case:
- Use **Splink** to compare metadata from S3 and Box and identify duplicates or similar files based on checksum, file name, or size.

#### Example Code:
```python
import pandas as pd
from splink import Splink

# Example data from S3 and Box
s3_data = pd.DataFrame([
    {'file_name': 'audio1.mp3', 'sha1': 'ABC123', 'size': 12345},
    {'file_name': 'audio2.mp3', 'sha1': 'XYZ456', 'size': 67890}
])

box_data = pd.DataFrame([
    {'file_name': 'audio_duplicate.mp3', 'sha1': 'ABC123', 'size': 12345},
    {'file_name': 'audio2.mp3', 'sha1': 'XYZ456', 'size': 67890}
])

# Linkage settings
settings = {
    "link_type": "dedupe_only",
    "comparison_columns": [
        {"col_name": "file_name", "num_levels": 2},
        {"col_name": "sha1", "num_levels": 2},
        {"col_name": "size", "num_levels": 2}
    ]
}

# Link S3 and Box data
linker = Splink(settings, [s3_data, box_data])
results = linker.get_match_candidates()
print(results)
```

#### Steps:
1. **Prepare data**: Extract metadata from both S3 and Box and convert it to a DataFrame.
2. **Define matching settings**: Use Splink to compare file name, SHA-1 checksum, and size for deduplication.
3. **Run Matching**: Execute Splink to find exact or probabilistic matches.

---

### 5. **MongoDB for Metadata Storage**

Store all metadata from S3 and Box in **MongoDB** for querying and deduplication analysis.

#### Use Case:
- Use MongoDB to store the file metadata from S3 and Box, along with computed SHA-1 checksums, for easy querying and analysis.

#### Example Code:
```python
from pymongo import MongoClient

# Connect to MongoDB
client = MongoClient('mongodb://localhost:27017/')
db = client['file_metadata_db']
collection = db['file_metadata']

# Insert metadata from S3
s3_metadata = {
    "file_name": "audio1.mp3",
    "sha1": "ABC123",
    "size": 12345,
    "source": "S3"
}
collection.insert_one(s3_metadata)

# Insert metadata from Box
box_metadata = {
    "file_name": "audio_duplicate.mp3",
    "sha1": "ABC123",
    "size": 12345,
    "source": "Box"
}
collection.insert_one(box_metadata)

# Query for duplicates based on SHA-1 checksum
duplicates = collection.find({"sha1": "ABC123"})
for doc in duplicates:
    print(doc)
```

#### Steps:
1. **Insert Metadata**: After calculating SHA-1 checksums, store file metadata from S3 and Box into MongoDB.
2. **Query Data**: Use MongoDB queries to search for duplicates or near-duplicates based on SHA-1 or other criteria.

---

### 6. **Python (Pandas) for Processing and Analysis**

Pandas will help you manipulate and analyze metadata from both **S3** and **Box**. After identifying duplicates with **Splink** or SHA-1 checksum comparison, you can use **Pandas** for further processing.

#### Use Case:
- Use **Pandas** to analyze the file metadata, detect duplicates, and generate reports.

#### Example Code:
```python
import pandas as pd

# Example data from S3 and Box
data = [
    {'source': 'S3', 'file_name': 'audio1.mp3', 'sha1': 'ABC123', 'size': 12345},
    {'source': 'Box', 'file_name': 'audio_duplicate.mp3', 'sha1': 'ABC123', 'size': 12345}
]

# Create DataFrame
df = pd.DataFrame(data)

# Find duplicate files based on SHA-1
duplicates = df[df.duplicated(subset='sha1', keep=False)]
print(duplicates)
```

#### Steps:
1. **Load Metadata**: Use Pandas to read and manipulate metadata from S3 and Box.
2. **Analyze Duplicates**: Use Pandas to detect duplicates based on SHA-1 or other attributes.
3. **Generate Reports**: Summarize the results (e.g., how many duplicates were found) for reporting purposes.

---

### Final Workflow

1. **S3 File Handling**: Use **AWS SDK (Boto3)** to fetch file metadata and generate **SHA-1** checksums.
2. **Box Metadata**: Use **Box SDK** to fetch file metadata from Box.
3. **Checksum Comparison**: Compare the **SHA-1 checksums** between S3 and Box files.
4.

# AWS SDK (Boto3) for S3 Operations

Boto3 is the AWS SDK for Python. We’ll use it to interact with AWS S3—specifically to fetch files, compute their SHA-1 checksum, and work with metadata.
Use Case:

    Fetch audio files from your S3 bucket and retrieve the metadata.
    Compute and store the SHA-1 checksum for each file.


In [None]:
import boto3
import hashlib

# Initialize S3 client
s3_client = boto3.client('s3')

def fetch_s3_metadata(bucket, key):
    """
    Fetch file content and metadata from an S3 bucket.
    """
    # Fetch the file from S3
    s3_object = s3_client.get_object(Bucket=bucket, Key=key)
    file_content = s3_object['Body'].read()  # File content for checksum
    metadata = s3_object['Metadata']  # Existing metadata if any
    return file_content, metadata

def generate_sha1_checksum(file_content):
    """
    Generate SHA-1 checksum for the file content.
    """
    sha1 = hashlib.sha1()
    sha1.update(file_content)
    return sha1.hexdigest()

bucket_name = "your-s3-bucket"
file_key = "path/to/your/file.mp3"

# Get file content from S3 and calculate the SHA-1 checksum
file_content, metadata = fetch_s3_metadata(bucket_name, file_key)
checksum = generate_sha1_checksum(file_content)

print(f"SHA-1 Checksum for {file_key}: {checksum}")


# Box SDK for Metadata Fetching

The Box SDK allows you to interact with Box to fetch metadata for files, including the SHA-1 checksum Box provides.
Use Case:

    Fetch metadata (including the SHA-1 checksum) from Box for comparison with files in S3.

In [None]:
from boxsdk import Client, OAuth2

# Authenticate with Box
oauth2 = OAuth2(
    client_id='your-client-id',
    client_secret='your-client-secret',
    access_token='your-access-token'
)
client = Client(oauth2)

def get_box_file_metadata(file_id):
    """
    Fetch metadata (including SHA-1) from Box for the given file.
    """
    box_file = client.file(file_id).get()
    sha1_checksum = box_file.sha1
    file_name = box_file.name
    file_size = box_file.size
    return {"name": file_name, "size": file_size, "sha1": sha1_checksum}

box_file_id = "1234567890"
box_metadata = get_box_file_metadata(box_file_id)
print(f"Box File Metadata: {box_metadata}")


ModuleNotFoundError: No module named 'boxsdk'

# Hashlib for SHA-1 Checksum Generation

Hashlib is used to compute the SHA-1 checksum for the files fetched from S3.
Use Case:

    After fetching the file from S3, compute the SHA-1 checksum to compare with Box files.

In [None]:
import hashlib

def generate_sha1_checksum(file_content):
    sha1 = hashlib.sha1()
    sha1.update(file_content)
    return sha1.hexdigest()

file_content = b"Sample file content"
checksum = generate_sha1_checksum(file_content)
print(f"SHA-1 checksum: {checksum}")

SHA-1 checksum: 9fbd36649a852af83044e783b51245c31028aa31


# Splink for Probabilistic Matching and Deduplication

Splink is a library for probabilistic record linkage and deduplication. It helps you match files between S3 and Box, even when exact matches are not possible (e.g., mismatched file names).
Use Case:

    Use Splink to compare metadata from S3 and Box and identify duplicates or similar files based on checksum, file name, or size.

Steps:

    Prepare data: Extract metadata from both S3 and Box and convert it to a DataFrame.
    Define matching settings: Use Splink to compare file name, SHA-1 checksum, and size for deduplication.
    Run Matching: Execute Splink to find exact or probabilistic matches.

In [None]:
import pandas as pd
from splink import Splink

# Example data from S3 and Box
s3_data = pd.DataFrame([
    {'file_name': 'audio1.mp3', 'sha1': 'ABC123', 'size': 12345},
    {'file_name': 'audio2.mp3', 'sha1': 'XYZ456', 'size': 67890}
])

box_data = pd.DataFrame([
    {'file_name': 'audio_duplicate.mp3', 'sha1': 'ABC123', 'size': 12345},
    {'file_name': 'audio2.mp3', 'sha1': 'XYZ456', 'size': 67890}
])

# Linkage settings
settings = {
    "link_type": "dedupe_only",
    "comparison_columns": [
        {"col_name": "file_name", "num_levels": 2},
        {"col_name": "sha1", "num_levels": 2},
        {"col_name": "size", "num_levels": 2}
    ]
}

# Link S3 and Box data
linker = Splink(settings, [s3_data, box_data])
results = linker.get_match_candidates()
print(results)


# MongoDB for Metadata Storage

Store all metadata from S3 and Box in MongoDB for querying and deduplication analysis.
Use Case:

    Use MongoDB to store the file metadata from S3 and Box, along with computed SHA-1 checksums, for easy querying and analysis.

Steps:

    Insert Metadata: After calculating SHA-1 checksums, store file metadata from S3 and Box into MongoDB.
    Query Data: Use MongoDB queries to search for duplicates or near-duplicates based on SHA-1 or other criteria.

In [None]:
from pymongo import MongoClient

# Connect to MongoDB
client = MongoClient('mongodb://localhost:27017/')
db = client['file_metadata_db']
collection = db['file_metadata']

# Insert metadata from S3
s3_metadata = {
    "file_name": "audio1.mp3",
    "sha1": "ABC123",
    "size": 12345,
    "source": "S3"
}
collection.insert_one(s3_metadata)

# Insert metadata from Box
box_metadata = {
    "file_name": "audio_duplicate.mp3",
    "sha1": "ABC123",
    "size": 12345,
    "source": "Box"
}
collection.insert_one(box_metadata)

# Query for duplicates based on SHA-1 checksum
duplicates = collection.find({"sha1": "ABC123"})
for doc in duplicates:
    print(doc)


# Python (Pandas) for Processing and Analysis

Pandas will help you manipulate and analyze metadata from both S3 and Box. After identifying duplicates with Splink or SHA-1 checksum comparison, you can use Pandas for further processing.
Use Case:

    Use Pandas to analyze the file metadata, detect duplicates, and generate reports.

Steps:

    Load Metadata: Use Pandas to read and manipulate metadata from S3 and Box.
    Analyze Duplicates: Use Pandas to detect duplicates based on SHA-1 or other attributes.
    Generate Reports: Summarize the results (e.g., how many duplicates were found) for reporting purposes.

In [None]:
import pandas as pd

# Example data from S3 and Box
data = [
    {'source': 'S3', 'file_name': 'audio1.mp3', 'sha1': 'ABC123', 'size': 12345},
    {'source': 'Box', 'file_name': 'audio_duplicate.mp3', 'sha1': 'ABC123', 'size': 12345}
]

# Create DataFrame
df = pd.DataFrame(data)

# Find duplicate files based on SHA-1
duplicates = df[df.duplicated(subset='sha1', keep=False)]
print(duplicates)

# Final Workflow

    S3 File Handling: Use AWS SDK (Boto3) to fetch file metadata and generate SHA-1 checksums.
    Box Metadata: Use Box SDK to fetch file metadata from Box.
    Checksum Comparison: Compare the SHA-1 checksums between S3 and Box files.

When explaining your project and code to your professor, it's important to present it in a clear, structured, and logical manner. Here's how you can explain the steps, methodology, and choices you made for this deduplication and record linkage project using **Splink** in **Google Colab**.

---

### 1. **Introduction to the Problem**
Start by explaining the context and what the assignment is about.

**Example:**
"Professor, the goal of this project is to identify duplicate media assets between two different systems: **OneDAM** and **Box**. Since these systems may have different ways of calculating checksums (e.g., MD5 and SHA-1), and even different file formats for the same content, we are using probabilistic record linkage to match records. The records are provided as metadata in JSON format, and I'm using a Python library called **Splink** for probabilistic linkage."

---

### 2. **Why Use Probabilistic Linkage?**
Highlight why deterministic methods (like direct comparison of checksums) may not work and why a probabilistic approach is more appropriate.

**Example:**
"Checksum values like MD5 and SHA-1 are not entirely reliable due to collision attacks, meaning different files can sometimes produce the same checksum. To avoid false matches, I opted for **probabilistic linkage**, which compares multiple attributes (like filenames, checksums, creation dates, etc.) and assigns a probability to how likely two records are duplicates. This gives us a more flexible and accurate way of identifying duplicates."

---

### 3. **Tools and Libraries Used**
Briefly explain the tools you chose and why.

**Example:**
"I used **Google Colab** for coding because it provides a convenient, cloud-based environment that can handle Python code. For the linkage itself, I used the **Splink** library. This library is designed for probabilistic record linkage and is highly configurable. It allowed me to compare metadata fields like filenames, checksums, and modification dates to detect duplicates."

---

### 4. **Steps Followed in the Project**
Now walk through the workflow, explaining the key steps.

#### Step 1: **Loading the Metadata**
"I started by loading the metadata from **OneDAM** and **Box** into Python. The metadata was in JSON format, so I used Python’s `json` and `pandas` libraries to read and convert this data into a structured format (Pandas DataFrame), which is easier to work with."

#### Step 2: **Combining the Data**
"Next, I combined the metadata from both systems into a single dataset. This allowed me to compare records across both systems."

#### Step 3: **Configuring Splink for Record Linkage**
"To set up the probabilistic linkage, I defined which fields in the metadata should be compared. I chose to compare fields like `filename`, `checksum`, `creation_date`, and `modification_date`. In Splink, you can assign different weights to these fields based on how important they are for matching."

#### Step 4: **Running the Linkage Process**
"Splink then runs the comparison, assigning a probability score to each potential match. This score helps us determine whether two records from OneDAM and Box are likely to be duplicates."

#### Step 5: **Identifying and Saving the Duplicates**
"Once the linkage process was complete, I extracted the results into a Pandas DataFrame, where I could see which records were flagged as potential duplicates. I then saved the results to a CSV file for further analysis."

---

### 5. **Results and What They Mean**
Summarize what you achieved.

**Example:**
"The result of this process is a set of potential duplicate records across OneDAM and Box, each with a probability score indicating how likely it is that they are duplicates. This approach ensures that we capture duplicates even when simple checksum validation wouldn't have worked."

---

### 6. **Challenges and Solutions**
Mention any challenges you faced and how you overcame them.

**Example:**
"One challenge was that different hash algorithms, like MD5 and SHA-1, could return the same values for different files. To address this, I configured Splink to look at multiple metadata fields and used **probabilistic** rather than deterministic matching."

---

### 7. **Why Google Colab?**
Explain why you chose Google Colab as your development environment.

**Example:**
"I used Google Colab because it's a free, cloud-based environment that provides access to all necessary Python libraries and tools. It’s also convenient for uploading files and sharing results."

---

### 8. **Potential Improvements**
You could also mention what could be improved in future work.

**Example:**
"In the future, I could refine the linkage model by including additional metadata fields if available (e.g., file size, MIME type). Additionally, I could tune the thresholds for better precision in detecting duplicates."

---

### 9. **Code Walkthrough**
Be ready to walk your professor through the key parts of the code:

- **Loading the Data**: "Here, I use `pandas` and `json` to load the JSON files into DataFrames."
- **Splink Configuration**: "I defined which fields Splink should compare, such as `filename` and `checksum`."
- **Running the Model**: "Splink then calculates match probabilities for each record pair."
- **Outputting Results**: "Finally, I saved the results to a CSV file for analysis."

---

### 10. **Conclusion**
Wrap up by summarizing the main points and what you've learned.

**Example:**
"In conclusion, this project demonstrates how probabilistic linkage can effectively identify duplicate files across systems even when simple checksum comparisons fail. Using **Splink** in **Google Colab** allowed me to build a robust process for deduplication that can be extended to larger datasets in the future."

---

### Bonus: **Potential Questions**
Be prepared for questions like:
- **Why did you choose these specific fields for linkage?**
  - Explain why fields like `filename` and `checksum` were used—because they are crucial for file identity.
  
- **How accurate is this process?**
  - You could explain that Splink provides probability scores, and the accuracy depends on the quality of the metadata and how the model is configured.

- **How could this be scaled to larger datasets?**
  - Mention that Splink can handle large datasets, and with cloud-based environments like Colab, you can scale up by integrating with more powerful services if needed.

---

By following this structure, you can clearly explain your project, the reasoning behind your choices, and how you implemented the solution using Splink in Google Colab. Let me know if you'd like more detail on any of the sections!

In [None]:
%%capture
!pip install splink pandas pyarrow

In [None]:
!pip install splink



In [None]:
from google.colab import files

# Upload the two JSON metadata files
uploaded = files.upload()

Saving BoxData.ipynb to BoxData (1).ipynb
Saving onddam_examole_data.json to onddam_examole_data (1).json


In [None]:
import pandas as pd
import json
from splink import Splink
from splink.settings import complete_settings_link_only

# Load JSON data from OneDAM and Box
with open('oneDAM_metadata.json') as f1, open('box_metadata.json') as f2:
    oneDAM_data = json.load(f1)
    box_data = json.load(f2)

# Convert JSON data into Pandas DataFrames
df_oneDAM = pd.DataFrame(oneDAM_data)
df_box = pd.DataFrame(box_data)

# Combine datasets (for linkage)
df_combined = pd.concat([df_oneDAM, df_box], keys=["OneDAM", "Box"])

# Splink linkage configuration
linking_fields = [
    {"col_name": "filename", "term_frequency_adjustments": True},
    {"col_name": "checksum"},
    {"col_name": "creation_date"},
    {"col_name": "modification_date"}
]

# Configure Splink settings for probabilistic record linkage
settings = complete_settings_link_only(linking_fields)

# Initialize Splink model with combined dataset
model = Splink(settings, df_combined)

# Run the linkage process to find duplicates
results = model.run_linking()

# Convert the results to a Pandas DataFrame
duplicates = results.as_pandas_dataframe()

# Show the first few records of possible duplicates
print(duplicates.head())

ImportError: cannot import name 'Splink' from 'splink' (/usr/local/lib/python3.10/dist-packages/splink/__init__.py)

In [None]:
import pandas as pd
import json
# Import Splink from splink.linker
from splink.linker import Linker as Splink  # Splink class is now called Linker
from splink.settings import complete_settings_link_only
# Load JSON data from OneDAM and Box
with open('oneDAM_metadata.json') as f1, open('box_metadata.json') as f2:
    oneDAM_data = json.load(f1)
    box_data = json.load(f2)

# Convert JSON data into Pandas DataFrames
df_oneDAM = pd.DataFrame(oneDAM_data)
df_box = pd.DataFrame(box_data)

# Combine datasets (for linkage)
df_combined = pd.concat([df_oneDAM, df_box], keys=["OneDAM", "Box"])

# Splink linkage configuration
linking_fields = [
    {"col_name": "filename", "term_frequency_adjustments": True},
    {"col_name": "checksum"},
    {"col_name": "creation_date"},
    {"col_name": "modification_date"}
]

# Configure Splink settings for probabilistic record linkage
settings = complete_settings_link_only(linking_fields)

# Initialize Splink model with combined dataset
model = Splink(settings, df_combined)

# Run the linkage process to find duplicates
results = model.run_linking()

# Convert the results to a Pandas DataFrame
duplicates = results.as_pandas_dataframe()

# Show the first few records of possible duplicates
print(duplicates.head())

ModuleNotFoundError: No module named 'splink.linker'

In [None]:
# Save the results to a CSV file
duplicates.to_csv('duplicates_results.csv', index=False)

# Download the file
from google.colab import files
files.download('duplicates_results.csv')

NameError: name 'duplicates' is not defined

In [None]:
!pip install splink pandas pyarrow

Collecting splink
  Downloading splink-4.0.4-py3-none-any.whl.metadata (12 kB)
Collecting altair<6.0.0,>=5.0.1 (from splink)
  Downloading altair-5.4.1-py3-none-any.whl.metadata (9.4 kB)
Collecting igraph>=0.11.2 (from splink)
  Downloading igraph-0.11.6-cp39-abi3-manylinux_2_17_x86_64.manylinux2014_x86_64.whl.metadata (3.9 kB)
Collecting narwhals>=1.5.2 (from altair<6.0.0,>=5.0.1->splink)
  Downloading narwhals-1.9.4-py3-none-any.whl.metadata (7.0 kB)
Collecting texttable>=1.6.2 (from igraph>=0.11.2->splink)
  Downloading texttable-1.7.0-py2.py3-none-any.whl.metadata (9.8 kB)
Downloading splink-4.0.4-py3-none-any.whl (3.7 MB)
[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m3.7/3.7 MB[0m [31m32.8 MB/s[0m eta [36m0:00:00[0m
[?25hDownloading altair-5.4.1-py3-none-any.whl (658 kB)
[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m658.1/658.1 kB[0m [31m27.4 MB/s[0m eta [36m0:00:00[0m
[?25hDownloading igraph-0.11.6-cp39-abi3-manylinux_2_17_x86_64.manyli

In [None]:
from google.colab import files

# Upload the two JSON metadata files
uploaded = files.upload()

Saving BoxData.ipynb to BoxData (2).ipynb
Saving onddam_examole_data.json to onddam_examole_data (2).json


In [None]:
import pandas as pd
import json
from splink import Linker
from splink.settings import complete_settings_link_only

# Load JSON data from the uploaded files
oneDAM_data = json.loads(uploaded['oneDAM_metadata.json'])
box_data = json.loads(uploaded['box_metadata.json'])

# Convert JSON data into Pandas DataFrames
df_oneDAM = pd.DataFrame(oneDAM_data)
df_box = pd.DataFrame(box_data)

# Combine datasets (for linkage)
df_combined = pd.concat([df_oneDAM, df_box], keys=["OneDAM", "Box"])

# Splink linkage configuration
linking_fields = [
    {"col_name": "filename", "term_frequency_adjustments": True},
    {"col_name": "checksum"},
    {"col_name": "creation_date"},
    {"col_name": "modification_date"}
]

# Configure Splink settings for probabilistic record linkage
settings = complete_settings_link_only(linking_fields)

# Initialize Splink model with combined dataset
linker = Linker(settings, df_combined)

# Run the linkage process to find duplicates
results = linker.get_duplicates()

# Convert the results to a Pandas DataFrame
duplicates = results.as_pandas_dataframe()

# Show the first few records of possible duplicates
print(duplicates.head())

# Save the results to a CSV file
duplicates.to_csv('duplicates_results.csv', index=False)

# Download the file
files.download('duplicates_results.csv')


In [None]:
!pip install splink[spark] pandas duckdb



In [None]:
from google.colab import files
import pandas as pd

# Upload files
uploaded = files.upload()

Saving BoxData.ipynb to BoxData.ipynb
Saving onddam_examole_data.json to onddam_examole_data.json


In [None]:
# Load the uploaded files into Pandas DataFrames (assuming they are JSON files)
oneDAM_df = pd.read_json(list(BoxData.ipynb.keys())[0])  # Replace with actual file name
box_df = pd.read_json(list(onddam_examole_data.json.keys())[1])  # Replace with actual file name

# Add a column to indicate the source system
oneDAM_df['system'] = 'OneDAM'
box_df['system'] = 'Box'

# Combine both datasets into a single DataFrame
df_combined = pd.concat([oneDAM_df, box_df], ignore_index=True)

# Preview the data
df_combined.head()

NameError: name 'BoxData' is not defined

In [None]:
# Now yoiu need to load the uploaded files into Pandas DataFrames (assuming they are JSON files)
oneDAM_df = pd.read_json(list(uploaded.keys())[0])  # whatever the file you have in oneDAm just replace with actual file name
box_df = pd.read_json(list(uploaded.keys())[1])  # whatever the file you have in box replace with actual file name

# Now add a column to indicate the source system
oneDAM_df['system'] = 'OneDAM'
box_df['system'] = 'Box'

NameError: name 'pd' is not defined

In [None]:
# after adding column we need to combine both datasets into a single DataFrame for further works
df_combined = pd.concat([oneDAM_df, box_df], ignore_index=True)

# Preview the data
df_combined.head()

**# If above given code is not working then you can try this one:**


In [None]:
import pandas as pd

# Example metadata for OneDAM
oneDAM_data = [
    {"file_name": "fileA.mp4", "file_size": 1000, "file_type": "video", "checksum": "md5abc", "created_date": "2023-01-01"},
    {"file_name": "fileB.mp3", "file_size": 500, "file_type": "audio", "checksum": "md5xyz", "created_date": "2022-12-15"}
]

# Example metadata for Box
box_data = [
    {"file_name": "fileA_renamed.mp4", "file_size": 1000, "file_type": "video", "checksum": "sha1abc", "created_date": "2023-01-02"},
    {"file_name": "fileC.mp3", "file_size": 600, "file_type": "audio", "checksum": "sha1xyz", "created_date": "2022-11-20"}
]

# Convert the data to DataFrames
oneDAM_df = pd.DataFrame(oneDAM_data)
box_df = pd.DataFrame(box_data)

# Add a system column
oneDAM_df['system'] = 'OneDAM'
box_df['system'] = 'Box'

# Combine both datasets into a single DataFrame
df_combined = pd.concat([oneDAM_df, box_df], ignore_index=True)

# Preview the combined data
df_combined.head()


## Now in next step we need to configure splink linkage configure and run with prepared data.


In [None]:
from splink.duckdb.duckdb_linker import DuckDBLinker
from splink import Splink

# Define Splink settings for probabilistic linkage
settings = {
    "link_type": "link_and_dedupe",  # We want to dedupe and link across the two systems
    "blocking_rules_to_generate_predictions": [
        "l.file_size = r.file_size"  # Block by file size to limit comparisons
    ],
    "comparison_columns": [
        {
            "col_name": "file_name",
            "num_levels": 3,  # Fuzzy matching with multiple levels
            "term_frequency_adjustments": True
        },
        {
            "col_name": "file_size",
            "num_levels": 1,  # Exact match for file size
        },
        {
            "col_name": "file_type",
            "num_levels": 1,  # Exact match for file type
        },
        {
            "col_name": "checksum",
            "num_levels": 1,  # Exact match for checksum (even though algorithms differ, it's worth trying)
        },
        {
            "col_name": "system",  # Ensure we only compare files across systems (OneDAM vs Box)
            "num_levels": 1,
            "case_expression": "l.system != r.system"
        }
    ],
    "probability_two_random_records_match": 0.01,  # Small likelihood of a random match
}


In [None]:
!pip install splink[duckdb]
# Installs splink with DuckDB support



In [None]:
from splink.duckdb.duckdb_linker import DuckDBLinker

ModuleNotFoundError: No module named 'splink.duckdb'

In [None]:
# Now next we need to initialize the Splink DuckDB linker
linker = DuckDBLinker(df_combined, settings)

# Run the linkage process

NameError: name 'DuckDBLinker' is not defined

In [None]:
# Now we can try to predict matches
df_predictions = linker.predict()

# Display the matching results
df_predictions[['l.file_name', 'r.file_name', 'match_probability']].head()

# You can put all together in app.py if you arebtrying vs code



In [None]:
!pip install splink[spark] pandas duckdb

Collecting splink[spark]
  Downloading splink-4.0.4-py3-none-any.whl.metadata (12 kB)
Collecting altair<6.0.0,>=5.0.1 (from splink[spark])
  Downloading altair-5.4.1-py3-none-any.whl.metadata (9.4 kB)
Collecting igraph>=0.11.2 (from splink[spark])
  Downloading igraph-0.11.6-cp39-abi3-manylinux_2_17_x86_64.manylinux2014_x86_64.whl.metadata (3.9 kB)
Collecting narwhals>=1.5.2 (from altair<6.0.0,>=5.0.1->splink[spark])
  Downloading narwhals-1.9.4-py3-none-any.whl.metadata (7.0 kB)
Collecting texttable>=1.6.2 (from igraph>=0.11.2->splink[spark])
  Downloading texttable-1.7.0-py2.py3-none-any.whl.metadata (9.8 kB)
Downloading altair-5.4.1-py3-none-any.whl (658 kB)
[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m658.1/658.1 kB[0m [31m10.3 MB/s[0m eta [36m0:00:00[0m
[?25hDownloading igraph-0.11.6-cp39-abi3-manylinux_2_17_x86_64.manylinux2014_x86_64.whl (3.1 MB)
[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m3.1/3.1 MB[0m [31m59.2 MB/s[0m eta [36m0:00:0

In [None]:
!pip install splink duckdb



In [None]:
from splink.duckdb.duckdb_linker import DuckDBLinker

ModuleNotFoundError: No module named 'splink.duckdb'

In [None]:
from splink.duckdb.duckdb_linker import DuckDBLinker
from splink import Splink

# Define Splink settings for probabilistic linkage
settings = {
    "link_type": "link_and_dedupe",  # We want to dedupe and link across the two systems
    "blocking_rules_to_generate_predictions": [
        "l.file_size = r.file_size"  # Block by file size to limit comparisons
    ],
    "comparison_columns": [
        {
            "col_name": "file_name",
            "num_levels": 3,  # Fuzzy matching with multiple levels
            "term_frequency_adjustments": True
        },
        {
            "col_name": "file_size",
            "num_levels": 1,  # Exact match for file size
        },
        {
            "col_name": "file_type",
            "num_levels": 1,  # Exact match for file type
        },
        {
            "col_name": "checksum",
            "num_levels": 1,  # Exact match for checksum (even though algorithms differ, it's worth trying)
        },
        {
            "col_name": "system",  # Ensure we only compare files across systems (OneDAM vs Box)
            "num_levels": 1,
            "case_expression": "l.system != r.system"
        }
    ],
    "probability_two_random_records_match": 0.01,  # Small likelihood of a random match
}

# Initialize the Splink DuckDB linker
linker = DuckDBLinker(df_combined, settings)

# Predict matches
df_predictions = linker.predict()

# Display the matching results
df_predictions[['l.file_name', 'r.file_name', 'match_probability']].head()


ModuleNotFoundError: No module named 'splink.duckdb'

# in next step we need to filter and analyze the matching results
for this you can try filtering the results based on the match probability score to get list of valid/potential duplicate files across both box and od.

In [None]:
# Filter results based on match probability (set a threshold, e.g., 0.95)
df_matches = df_predictions[df_predictions['match_probability'] > 0.95]

# Show matched pairs of files across OneDAM and Box
df_matches[['l.file_name', 'r.file_name', 'l.system', 'r.system', 'match_probability']]

# You can now try with analyzing duplicate files on fields based on given schema such as created date, modified date, other metadta.

# make sure that your data is clean, no missing value and normalized.

In [None]:
# Analysis for duplicates
duplicates = df_matches[df_matches['match_probability'] > 0.95]

# Print details of duplicates
print(duplicates[['l.file_name', 'r.file_name', 'l.created_date', 'r.created_date']])