# DATA PREPROCESSING STEPS

To do the preprocessing, we create a class of object called **YearlyFiling13F** that store all the information of top invest managers in a particular year. 

To create the object, follow 3 steps:
- Base on 'https://www.advratings.com/top-asset-management-firms' to filter the top investors in market stored in *funds_list*
- Find all urls related to those investors to 13F-HR filing data from SEC Edgar database in a particular year stored in *urls*
- Fetch all the neccessary information stored in *data_dict* & *names_dict*

This object has below important attributes:
- funds_list: the list of top investor's names
- urls: store all urls to get 13F-HR filing data from SEC Edgar database of top investmanager
- data_dict: a nested dictionary for each CIK contains a dictionary of {issuer : total_amount}
- names_dict: a dictionary linking cik to the name

This object has below important function: 
- create_JSON_files() : Create a JSON file to store and retrieve a dict of data for desired year, facilitate simpler analysis steps.



In [1]:
import glob
import requests
from bs4 import BeautifulSoup
import pandas as pd
import datetime
import re
import csv
import numpy as np
import networkx as nx
from networkx.algorithms import bipartite
import json
import matplotlib.pyplot as plt
import seaborn as sns

In [2]:
class YearlyFiling13F:
    """ 
        Class represent an object containing stock portfolio information from top investor in a particular year.
        To obtain the neccessary information, perform 3 steps:
        1. Base on 'https://www.advratings.com/top-asset-management-firms' to filter the top investors in market
        2. Find 13F-HR filing data from SEC Edgar database for those investors in a particular year
        3. Create a JSON file to store and retrieve a dict of data for desired year, facilitate simpler analysis steps.
    """
    
    # If True prints out results in console
    debug = False
    
    """ 
        Get the biggest investment managers 
        - Scraping the website: 'https://www.advratings.com/top-asset-management-firms' to get name of top investors
        - Storing the list of top investor's names in funds_list
    """
    def __init__(self,year=''):
        funds_list = []
        url = 'https://www.advratings.com/top-asset-management-firms'
        page = requests.get(url)
        soup = BeautifulSoup(page.content, 'html.parser')

        for row in soup.findAll('table')[0].tbody.findAll('tr'):
            company = str(row.findAll('td')[1].contents)
            company = re.split(r'<|>', company)
            if(len(company) > 2):
                #exluding any special chars and wite spaces from company names
                company = ''.join(e for e in company[2] if e.isalnum())
                funds_list.append(company.upper())
            else:
                company = re.split(r'([\'|\'])', company[0])
                #exluding any special chars and wite spaces from company names
                company = ''.join(e for e in company[2] if e.isalnum())
                funds_list.append(company.upper())
        """ Initialize object """
        self.funds_list = funds_list[1:]
        self.filepath = (glob.glob("13F/*" + year + ".txt"))[0] # Path of file
        self.year = year
        self.get_URLs(self.filepath)
        # Directly call parse_file() when year is provided
            
    def get_URLs(self, filepath=''):
        """ Parses relevant information from 13F-HR text file 
            Getting paths to all 13F-HR filing per quarter
            Choosing only files from funds_list - list of the top asset investment managers
        """
        self.filepath = filepath # Path of file
        
        all_urls = {}
        path = 'https://www.sec.gov/Archives/'
        file = open(filepath, 'r')
        forms_url = []
        for line in file:
            #parsing out the company name from the list
            company = re.findall(r'13F-HR\s*\d*([\D+\s\D+]*)\s*\d*', line)
            #string processing to get uniform formatting
            company = ''.join(e for e in company)
            company = company.replace(' ', '')
            company = re.sub('\d', '', company)
            company = company.upper()
            #finding the investment managers that match the list of the top investment mangers *fund_list*
            for name in self.funds_list:
                if (company in name or name in company) and len(company) > 3:
                    splitted = line.split()
                    forms_url.append(path + splitted[-1])

        #adding a key:value pair to a dict. - contains 
        all_urls[file.name.split('/')[-1]] = forms_url
        self.urls = forms_url
        # parsing data from urls
        data_dict = {} #nested dictionary for each CIK contains a dictionary of {issuer : total_amount}
        names_dict = {} #a dictionary linking cik to the name
        
        for url in forms_url: 
            page = requests.get(url)
            data = page.text
            soup = BeautifulSoup(data, "lxml")

            cik_pattern = r'\s*CENTRAL INDEX KEY:\s*(\w[\w*|\s*]*)\n'
            cik_key = re.findall(cik_pattern, data)
            cik_key = str(cik_key[0]) if cik_key != [] else None
            data_dict[cik_key] = {}

            name_pattern = r'\s*COMPANY CONFORMED NAME:\s*(\w[\w*|\s*]*)\n*'
            name = re.findall(name_pattern, data)
            name = name[0].split('\n') if name != [] else None 
            names_dict[cik_key] = name[0] if name != None else None

            stocklist = soup.find_all('infotable')
            for s in stocklist:

                if s.find("ns1:nameofissuer") != None:
                    # Company name
                    n = s.find("ns1:nameofissuer").string
                    if n in data_dict[cik_key].keys():
                        #Create only a record if the issuer is unique, oterwise sum the amount of stocks
                        data_dict[cik_key][n] = data_dict[cik_key].get(n) + int(s.find("ns1:shrsorprnamt").find("ns1:sshprnamt").string)# Company name
                    else:
                        data_dict[cik_key][n] = int(s.find("ns1:shrsorprnamt").find("ns1:sshprnamt").string)

                else:
                    n = s.find("nameofissuer").string
                    if n in data_dict[cik_key].keys():
                        #Create only a record if the issuer is unique, oterwise sum the amount of stocks
                        data_dict[cik_key][n] = data_dict[cik_key].get(n) + int(s.find("shrsorprnamt").find("sshprnamt").string)
                    else:
                        data_dict[cik_key][n] = int(s.find("shrsorprnamt").find("sshprnamt").string)
        
        #Removing ciks/issuers with empty dictionary
        self.data_dict = dict(filter(lambda sub: sub[1], data_dict.items()))
        self.names_dict = dict(filter(lambda sub: sub[1], names_dict.items()))
        return
    
    def create_JSON_files(self):
        """ Create JSON file """
        with open('json_pipeline/data'+ self.year +'.json', 'w') as json_file:
            json.dump(self.data_dict, json_file)

        with open('json_pipeline/names'+ self.year +'.json', 'w') as json_file:
            json.dump(self.names_dict, json_file)
        return

# EXTRACT DATA AND CREATE JSON_FILES

In this step, we will initiate all the data that neccessary for analysis, we can choose the numbers of year freely, depend on the purpose of analysis.

Steps:
- select 1 year and fill this in in YearlyFilin13F
- create the JSON_files

In particular report, we do the analysis from 2018 to 2021, create 4 objects and related JSON files: 

In [3]:
#step1
Filing2018 = YearlyFiling13F('2018')

Filing2019 = YearlyFiling13F('2019')

Filing2020 = YearlyFiling13F('2020')

Filing2021 = YearlyFiling13F('2021')

In [6]:
#step2
Filing2018.create_JSON_files()

Filing2019.create_JSON_files()

Filing2020.create_JSON_files()

Filing2021.create_JSON_files()