In [4]:
import pandas as pd
import xml.etree.ElementTree as ET
import os
import logging
from tempfile import NamedTemporaryFile
import time
import subprocess
from math import ceil
import requests
import gzip
import io
import re
from datetime import datetime

ModuleNotFoundError: No module named 'requests'

### Background Information: SEC ADV Filings

The SEC Form ADV is a key regulatory document used by investment advisers to register with the U.S. Securities and Exchange Commission (SEC). It serves as a comprehensive disclosure form that provides detailed information about an investment adviser’s business, including its operations, services, and any potential conflicts of interest.

**Key Components of the SEC Form ADV:**

1. **Part 1A:** This section contains information about the adviser's business, ownership, clients, employees, business practices, affiliations, and any disciplinary events of the adviser or its employees. It also includes details about the private funds managed by the adviser.

2. **Part 2A (Brochure):** This section provides a narrative description of the adviser’s business practices, fees, conflicts of interest, and disciplinary information. It is designed to be a plain-English disclosure document that is given to clients.

3. **Part 2B (Brochure Supplement):** This section includes information about the advisory personnel on whom clients rely for investment advice, including their education, business experience, and any disciplinary history.

**Why is the SEC Form ADV Important?**

- **Transparency:** The form promotes transparency by providing clients and regulators with essential information about the adviser’s business practices and any potential conflicts of interest.
- **Regulatory Compliance:** Filing the Form ADV is a regulatory requirement for investment advisers, ensuring they operate within the legal framework set by the SEC and state authorities.
- **Investor Protection:** By disclosing detailed information about their operations and practices, advisers help protect investors from fraud and misrepresentation.

**Private Fund Reporting:**

Within the SEC Form ADV, Sections 7.B.(1) and 7.B.(2) are specifically focused on private fund reporting. These sections require advisers to provide detailed information about the private funds they manage or advise, including the fund’s name, type, gross asset value, and regulatory status. This information helps regulators and investors understand the scope and nature of the adviser’s involvement with private funds.

**Task Context:**

For this task, you will be working with data extracted from SEC ADV filings. You will download metadata and PDFs for specific firms, extract relevant information, and analyze it to identify top-performing funds. This exercise will help you understand how regulatory filings can be used to gather critical information about investment advisers and their managed funds.

# Get the metadata

In [1]:
# Set the WorkingDate to the latest date, e.g., Feb 14, 2025
WorkingDate = datetime(2025, 2, 14)


path = rf"https://reports.adviserinfo.sec.gov/reports/CompilationReports/IA_FIRM_SEC_Feed_{WorkingDate:%m_%d_%Y}.xml.gz" 

response = requests.get(path)

# Ensure the request was successful
if response.status_code == 200:
    # Unzip the content
    with gzip.GzipFile(fileobj=io.BytesIO(response.content)) as gz:
        #Read the data into a dataframe
        df = pd.read_xml(gz, xpath='//Info')
else: print(f"Failed to download the file. Status code: {response.status_code}")

# Create a column with paths for the pdfs
df['DownloadPath'] = df['FirmCrdNb'].apply(lambda x: f"https://reports.adviserinfo.sec.gov/reports/ADV/{x}/PDF/{x}.pdf")
df.set_index('FirmCrdNb', inplace = True)

df

NameError: name 'datetime' is not defined

# Download the PDF text

In [1]:
# Define the directory to save the downloaded files
save_dir = input_dir
os.makedirs(save_dir, exist_ok=True)

count = 0 # number of files skipped

# Setup log to capture if any files failed to download:
logging.basicConfig(filename='download_logs.log', level = logging.INFO,
                    format='%(asctime)s - %(levelname)s - %(message)s')

start_time = time.time()

for url in df['DownloadPath']: 
    FileName = url.split('/')[-1]
    save_path = os.path.join(save_dir, FileName)
    
    # Check if the file already exists
    if os.path.isfile(save_path):
        count += 1
        logging.info(f"File {FileName} already exists. Skipping download.")
        continue

    # Create the request
    response = requests.get(url)
    
    if response.status_code == 200:
        with open(save_path, 'wb') as file:
            file.write(response.content)
        logging.info(f"Downloaded {FileName}")
    else:
        logging.error(f"Failed to download {FileName}. Status code: {response.status_code}")

end_time = time.time()

duration_minutes = (end_time - start_time) / 60

# Log the total time taken in minutes 
logging.info(f"Completed the download process in {duration_minutes:.2f} minutes.")


# Task:

## 1. Download Metadata and PDFs (Estimated Time: 30-45 minutes):
Using the starter code provided, download the metadata and PDFs for the following FirmCrdNb values. Use the provided URLs to download the files and save them locally. Implement error handling and logging :
* 160882
* 160021
* 1679500

## 2. Extract and Store Information (Estimated Time: 1-1.5 hours):
* Task: Extract specific information from the downloaded PDFs and store it in a local SQLite database.
* Details: Extract fields such as FirmCrdNb, SECNb, Business Name, Full Legal Name, Address, Phone Number, Compensation Arrangements, Number of employees performing investment advisory functions, Type of Client and Amount of Regulatory Assets Under Management, Names of Private Fund and Private Fund Identification Number, and Signatory of the PDF.
* In practice, you will deal with tens of thousands of files. Your code should systematically parse the text and extract the relevant information, as scalability is an important factor.

## 3. Data Transformation and Analysis (Estimated Time: 30-45 minutes):
* Task: Perform data transformation and analysis using Pandas.
* Details: Clean and transform the extracted data, and perform basic analysis such as identifying the top-performing funds based on specific criteria (e.g., assets under management).

## 4. Generate Excel File (Estimated Time: 15-30 minutes):
* Task: Extract the following information from the SQLite database and output it in an **Excel file**. Keep in mind that this excel file will be ultimately used by BD recruiters who are considered non-techinical users:
* FirmCrdNb
* SECNb
* Business Name
* Full Legal Name
* Address
* Phone Number
* Compensation Arrangements
* Number of employees performing investment advisory functions, including research
* Type of Client and Amount of Regulatory Assets Under Management
* Names of Private Fund and Private Fund Identification Number
* Signatory of the PDF

## 5. Scalability and Performance (Discussion - Estimated Time: 15-30 minutes):
* Task: Optimize the data pipeline for performance.
* Details: Write a brief explanation of how you would handle scalability and performance issues if the dataset were significantly larger.

## 6. Integration with External Systems (Estimated Time: 1-1.5 hours):
* Task: Simulate integration with an external system using Fast APIs.
* Details: Write a Python script to fetch additional data from a mock API from your SQLite database using standard Python frameworks and data models. Create three GET and PUSH endpoints and demonstrate data quality checks and validations. Provide two tests for each call. Consider and handle potential edge cases. 

## 7. Automated Testing and Data Quality (Discussion - Estimated Time: 15-30 minutes):
* Task: Write a brief explanation of how you would implement automated testing and ensure data quality

## 8. Identify Top Performing Funds (Discussion - Estimated Time: 15-30 minutes):
* Describe how you would use the information above to identify top-performing funds.
* List any questions you would ask regarding the task.
* Specify any additional information you would need.

## 9. Data Visualization (Optional - Estimated Time: 30-45 minutes (if included)):
* Task: Create visualizations using a Python library such as Matplotlib or Seaborn.
* Details: Generate plots to visualize key metrics and insights from the data.

## [NOTE] Please do not spend more than 6 hours on this task. Also, provide the time it took you to complete it.

# Your code