In [1]:
import pandas as pd
import numpy as np
import os
from tqdm import tqdm     # progress bar on loops
from NEAR_regex import NEAR_regex  # copy this file into the asgn folder
from bs4 import BeautifulSoup
import re
from time import sleep

#if tqdm issues, run this in terminal or with ! trick
#jupyter nbextension enable --py widgetsnbextension
#jupyter labextension install @jupyter-widgets/jupyterlab-manager

os.makedirs('output',exist_ok=True)

## User Note

_This file parses the wikipedia pages._
- _Parsing 10-Ks requires ONLY ONE CHANGE: Change how `load_me` is defined to refer to the 10-k folder._
- _That change produces RADICALLY different measures!_

| My measure of | Fraction that aren't zero with wiki pages | Fraction that aren't zero with 10-k pages |
| --- | --- | --- |
| tax_risks    | 0.003960 |   0.786139 |
| tariff_risk  | 0.000000 |  0.225743 |
| fincon       | 0.140594 |  0.899010 |
| proprietary  | 0.003960 |  0.544554 |

HOLY COW! The exact same searches produce radically different stuff. 

But this shouldn't be surprising: There is a **lot** more info in the 10-K files! The wiki pages are just 86MB on my computer, but the 10-K files are 2.4 GB! As a result, parsing the 10k's took 15 minutes on my computer, whereas the wiki pages are parsed in just 30 seconds.

![](https://media.giphy.com/media/gwv52hvs09UBSVyqHn/source.gif)

## Defining the searches

These searches were defined with a slightly different project in mind, but the general ideas of _**how**_ and _**why**_ these are set up as they are applies:

### Tax risk exposure

Technology firms are often involved in large amounts of creative accounting to reduce tax bills. To identify when a firm is negatively exposed to possible tax changes, I look for a firm mentioning a "risk term" near "tax" (or similar) and "changes". 

**HIT:** "A change to tax policies could negatively affect profits"

**NOT A HIT:** "A change to tax policies is likely"

In [2]:
# this will look for mentions with 25 word gaps maximum
tax_risks = ['(risk|risks|could harm|negative|negatively|uncertain)',
            '(tax|taxes|taxation)',
            '(change|new|law|policy|policies|regulation|regulations)']

### Tariffs

Technology firms often ship product across international borders. To identify when a firm is negatively exposed to possible tariff changes, I look for a firm mentioning a "risk term" near "tariff" (or similar) and "changes". 

**HIT:** "A change to tariff policies could negatively affect profits"

**NOT A HIT:** "A change to tariff policies is likely"

In [3]:
# this will look for mentions with 25 word gaps maximum
tariff_search = ['(risk|risks|could harm|negative|negatively)',
                '(tariff|tariffs)',
                '(change|new|law|policy|policies|regulation|regulations)']

### Financial constraints

Technology firms tend to be younger and smaller than other public firms. According to published research, young and small firms also tend to be financially constrained.

Following [Hoberg and Maksimovic](https://poseidon01.ssrn.com/delivery.php?ID=875082005085007108066003027097109092018052053087053016092066101124083072025114105026038106063111031098097099020098001110068066029018023080043026109080070118114124088008042110092095070091123122124087109120115122022004003119096075106076087081087092093&EXT=pdf), I define firms as financially constrained if a firm discusses "curtailing" near "investment". The full lists, below, come from the paper.

In [4]:
# this list comes from page 9 of the WP version of Hoberg and Maksimovic (link above)

# allow for partial matches and a max gap of 25 (they use 12, but our text is messier)
fin_constraints = ['(delay|abandon|eliminate|curtail|scale back|postpone)',
                   '(construction|expansion|acquisition|restructuring|project|research|development|exploration|product|expenditure|manufactur|entry|renovat|growth|activities|capital improvement|capital spend|capital proj|commercial release|business plan|transmitter deployment|opening restaurants)' ]

### Proprietary Information Leak Risk

A crucial task for technology firms is protecting their IP. Following [Hoberg and Maksimovic](https://poseidon01.ssrn.com/delivery.php?ID=875082005085007108066003027097109092018052053087053016092066101124083072025114105026038106063111031098097099020098001110068066029018023080043026109080070118114124088008042110092095070091123122124087109120115122022004003119096075106076087081087092093&EXT=pdf) again, I define firms worried about IP leaks as those that discuss "protecting" near "trade secrets" or "proprietary information". I could use a larger list, but this definition has been vetted.


In [5]:
proprietary_information_risks = ['(protect|safeguard)',
                                '(trade secret|proprietary information|confidential information)']

In [6]:
# add blank new variables for each of the searches

sp500 = (pd.read_csv('inputs/sp500_with_url.csv')
         .assign(tax_risks = np.nan,
                 tariff_risk = np.nan,
                 fincon = np.nan,
                 proprietary = np.nan,))


## Loop over and parse/search wiki pages

In [7]:
for index, row in tqdm(sp500.iterrows(), total=len(sp500)):
      
    load_me = 'text_files/' + row['Symbol'] + '.html'

    if os.path.exists(load_me):
        
        # open file
        with open(load_me,'r',encoding='utf-8') as f:
            text = f.read()
        
        # clean the 10k before searching
        lower = BeautifulSoup(text).get_text().lower()
        no_punc = re.sub(r'\W',' ',lower)
        cleaned = re.sub(r'\s+',' ',no_punc).strip()
        
        # search    
        rgx   = NEAR_regex(tax_risks,max_words_between=25)
        sp500.loc[index,"tax_risks"] = len(re.findall(rgx,cleaned)) 
        
        rgx   = NEAR_regex(tariff_search,max_words_between=25)
        sp500.loc[index,"tariff_risk"] = len(re.findall(rgx,cleaned)) 

        rgx   = NEAR_regex(fin_constraints,max_words_between=25,partial=True)
        sp500.loc[index,"fincon"] = len(re.findall(rgx,cleaned)) 

        rgx   = NEAR_regex(proprietary_information_risks,max_words_between=25,partial=True)
        sp500.loc[index,"proprietary"] = len(re.findall(rgx,cleaned)) 
                

100%|████████████████████████████████████████████████████████████████████████████████| 505/505 [00:57<00:00,  8.73it/s]


## Examining my proposed measures

Initial findings: 

1. The tariff risks search is ill defined for wikipedia pages - **it is always zero.** I'd have to simplify it, probably to a mere discussion of "tariffs".
2. The proprietary and tax searches also seem to be poorly defined. Only EBAY, Oracle, Amazon, and Exxon show up when you run this, despite the S&P 500 containing many firms that rely on patents and that use aggressive tax management policies:

    ```python
    sp500.query('(proprietary != 0) | (tax_risks != 0)')
    ```
    
3. 14% (see below) of S&P500 firms have indicators for financial constraints. This is probably not "too high" (S&P500 firms should have pretty diverse funding options) or "too low" (arguably some of these firms should have a degree of constraint).     

In [8]:
# count the non-zero elements
(sp500[['tax_risks','tariff_risk','fincon','proprietary']] > 0).sum() / len(sp500)

tax_risks      0.003960
tariff_risk    0.000000
fincon         0.140594
proprietary    0.003960
dtype: float64

In [9]:
sp500[['tax_risks','tariff_risk','fincon','proprietary']].describe().T.style.format('{:,.2f}')

Unnamed: 0,count,mean,std,min,25%,50%,75%,max
tax_risks,504.0,0.0,0.06,0.0,0.0,0.0,0.0,1.0
tariff_risk,504.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
fincon,504.0,0.19,0.56,0.0,0.0,0.0,0.0,6.0
proprietary,504.0,0.0,0.06,0.0,0.0,0.0,0.0,1.0


## Saving the sample for the analysis file

In [10]:
ccm = pd.read_stata('https://github.com/LeDataSciFi/ledatascifi-2021/blob/main/data/2019%20ccm_cleaned.dta?raw=true')

In [11]:
(
    sp500.merge(ccm,how='left',
                left_on='Symbol',right_on='tic',
                indicator=True, validate='one_to_one')
    .to_csv('output/sp500_accting_plus_textrisks.csv',
            index=False)
)    