## Demo 

We want to solve some steps for measure_risk. I outlined this demo here https://github.com/orgs/LeDataSciFi/teams/classmates-2022/discussions/19/comments/4

In [1]:
import pandas as pd
import random

# step 1 will load some database and prep it for the loopy parts
# here, we will just use a toy dataset

toy_database = pd.DataFrame({"Security":['3M','TLSA','APPL'],
             "URL":['blahblah.com','wikisomething.com','wiki.com']})

In [2]:
toy_database

Unnamed: 0,Security,URL
0,3M,blahblah.com
1,TLSA,wikisomething.com
2,APPL,wiki.com


The problem:

In [None]:
# step 2: figure out how to loop through this dataframe 
for ______________________________: 
    # A. here, you would open the related 10k, but SKIP this for now

    # B. You'd measure the risk exposures here. Let's just pretend that 
    # you opened+cleaned+searched it and built a risk exposure variable 
    # called "new_risk_var" (a bad name, but this is just example code!)

    new_risk_var = random.randint(0,10) # this is a silly line to "simulate" that you created some risk measure

    __________________________ # add new_risk_var to your toy database

## Solution:
1. .iterrows() is nice for looping over df
2. .at[] is nice for adding one value to df

#### Note that the contents of this loop are the meat of this file!

In [3]:
# (first run on the first 10 firms of the dataset)
# use tqdm to monitor progress!
# when you try this on all rows
for index, row in toy_database.iterrows():
    
    # open 10K here - like discussed below
    # main challenge: getting the full path to the HTML
    # see the hint on the discussion board!
    
    print(row['Security']) # can pull vars out of that row for use

    # clean the 10-K so you just have text in a variable!
    #TBD
    
    # measure risks: this is fake code
    #use nearregex
    
    new_risk_var = random.randint(0,10) # this is a silly line to "simulate" that you created some risk measure
    print(new_risk_var)
    
    # put into database    
    toy_database.at[index, 'Risk_exposure']=new_risk_var # add new_risk_var to your toy database
    print('------')

3M
4
------
TLSA
6
------
APPL
8
------


And now we've added the variable:

In [4]:
toy_database

Unnamed: 0,Security,URL,Risk_exposure
0,3M,blahblah.com,4.0
1,TLSA,wikisomething.com,6.0
2,APPL,wiki.com,8.0


## Opening A 10-k

In [5]:
fname = "10k_files/sec-edgar-filings/TSLA/10-K/0001564590-20-004475/filing-details.html"

# when you open a file in python, the file is added to the RAM working memory
# so you want to close the file when you are done with it (otherwise you will have
# all 500 files open at the same time!)

# "with" automatically closes the file once the block of code insideit ends
# which reduces RAM/memory requirements, 
with open(fname, encoding="utf-8") as report_file:
    html = report_file.read()
    # now it automatically closes the 10-K html

Clearly, we will need to find the text from inside the 10-K because the raw html is nasty:

In [6]:
html[:1000]

'<?xml version="1.0" encoding="utf-8"?><!-- DFIN ActiveDisclosure(SM) Inline XBRL Document - http://www.dfinsolutions.com/ --><!-- Creation Date      : 2020-02-13T10:20:23.3895073+00:00 --><!-- Version            : 5.0.1.321 --><!-- Package ID         : 323b37309d3e4cdb824ea7eabcb9d5e2 --><!-- Copyright (c) 2020 Donnelley Financial Solutions, Inc. All Rights Reserved. --><html xmlns="http://www.w3.org/1999/xhtml" xmlns:country="http://xbrl.sec.gov/country/2017-01-31" xmlns:currency="http://xbrl.sec.gov/currency/2019-01-31" xmlns:dei="http://xbrl.sec.gov/dei/2019-01-31" xmlns:exch="http://xbrl.sec.gov/exch/2019-01-31" xmlns:invest="http://xbrl.sec.gov/invest/2013-01-31" xmlns:iso4217="http://www.xbrl.org/2003/iso4217" xmlns:ix="http://www.xbrl.org/2013/inlineXBRL" xmlns:ixt="http://www.xbrl.org/inlineXBRL/transformation/2015-02-26" xmlns:ixt-sec="http://www.sec.gov/inlineXBRL/transformation/2015-08-31" xmlns:link="http://www.xbrl.org/2003/linkbase" xmlns:naics="http://xbrl.sec.gov/naics

In the midterm, the `fname` is always like the above, but "TSLA" is always replaced with the firm's ticker, and the numbers are different for every filing.

#### I left a hint on the discussion board about figuring out how to look for a file when you don't know the full path name!

## NLP with our eyes

What are some risks factors for Telsa we can look for? Brainstorm, then find instances in the 10-K that support it.
- oil prices 
- china labor/workforce 
- competition from other EV brands
    - Joe and Ryan
    - In addition, Model 3 and Model Y face competition from existing and future automobile manufacturers in the extremely competitive entry-level premium sedan and compact SUV market, including BMW, Ford, Lexus, Mercedes and Volkswagen Group.
    - 56 "compet"
- lithium 
    - Yang and Matt
    - "Increases in costs, disruption of supply or shortage of materials, in particular for lithium-ion cells, could harm our business."
    - 21 lithium
- government policy, environmental (including tax credit phase out)
- business cycle
- non-oil energy prices 
- regulatory (automation of driving rules)
    - "In particular, we offer in our vehicles Autopilot and FSD features that today assist drivers with certain tedious and potentially dangerous aspects of road travel, but which currently require drivers to remain engaged. We are continuing to develop our FSD technology with the goal of achieving full self-driving capability in the future. There is a variety of international, federal and state regulations that may apply to self-driving vehicles, which include many existing vehicle standards that were not originally intended to apply to vehicles that may not have a driver. Such regulations continue to rapidly change, which increases the likelihood of a patchwork of complex or conflicting regulations, or may delay products or restrict self-driving features and availability, any of which could adversely affect our business."
- supply chain issues (chip)
    - Row 2
- government prohibition (strict regulation, like tobacco)
- unionization
    - Jack
    - union appears 12, 4 of these are EU!
    - 8 about labor
    - "Our business may be adversely affected by any disruptions caused by union activities."
- real estate impact on store costs 
- supply chain risk (broadly?)
- production risks (a specific supply chain risk)
    - alex 
    - We have experienced in the past, and may experience in the future, delays or other complications in the design, manufacture, launch, and production ramp of our vehicles, energy products, and product features, or may not realize our manufacturing cost targets, which could harm our brand, business, prospects, financial condition and operating results.
    - We may be unable to meet our growing product sales, delivery and installation plans and vehicle servicing and charging network needs, or accurately project and manage this growth internationally, any of which could harm our business and prospects.
    - We are dependent on our suppliers, the majority of which are single-source suppliers, and the inability of these suppliers to deliver necessary components of our products according to our schedule and at prices, quality levels and volumes acceptable to us, or our inability to efficiently manage these components, could have a material adverse effect on our financial condition and operating results.
- computer chips (a specific supply chain risk)
- changes to tax policy
    - theo
    - "Globally, both the operation of our business by us and the ownership of our products by our customers are impacted by a number of government programs, incentives and other arrangements. Our business and products are also subject to a number of governmental regulations that vary among jurisdictions."
- prices of alternatives (ICE, gas/oil)
    - austen
    - competition, including from other types of alternative fuel vehicles, plug-in hybrid electric vehicles and high fuel-economy internal combustion engine vehicles; 
- business cycle risk (income of customers, preference changes)
    - ian  
    - Our future growth and success is dependent upon consumers’ willingness to adopt electric vehicles and specifically our vehicles. We operate in the automotive industry, which is generally susceptible to cyclicality and volatility. 
- environmental regulation (emission standards up or chemical standard)
    - matt and seb
    - The unavailability, reduction or elimination of, or unfavorable determinations with respect to, government and economic incentives in the U.S. and abroad supporting the development and adoption of electric vehicles, energy storage products or solar energy could have some impact on demand for our products and services. 
- increased competition (EV) 
    - owen colin
    - "The markets in which we operate are highly competitive, and we may not be successful in competing in these industries. We currently face competition from new and established domestic and international competitors and expect to face competition from others in the future, including competition from companies with new technology."
- idiosyncratic management
    - harrison
    - We are highly dependent on the services of Elon Musk, our Chief Executive Officer and largest stockholder. Although Mr. Musk spends significant time with Tesla and is highly active in our management, he does not devote his full time and attention to Tesla. Mr. Musk also currently serves as Chief Executive Officer and Chief Technical Officer of Space Exploration Technologies Corp., a developer and manufacturer of space launch vehicles, and is involved in other emerging technology ventures.
- regulation of automated driving
    - eric
    - "There are no federal U.S. regulations pertaining to the safety of self-driving vehicles; however, NHTSA has established recommended guidelines. Certain U.S. states have legal restrictions on self-driving vehicles, and many other states are considering them. This patchwork increases the legal complexity for our vehicles. In Europe, certain vehicle safety regulations apply to self-driving braking and steering systems, and certain treaties also restrict the legality of certain higher levels of self-driving vehicles. Self-driving laws and regulations are expected to continue to evolve in numerous jurisdictions in the U.S. and foreign countries, and may create restrictions on self-driving features that we develop."
- regulation of DTC

## Near regex basics

Load the function, and `re`.

In [7]:
from near_regex import NEAR_regex 
import re

You need a document (string)

In [8]:
test  = 'partial o o with o o o  partial o o  with o o o partial o o with'

Set up your search terms

In [9]:
words = ['partial','with']

The function simply writes the regex for you:

In [10]:
rgx   = NEAR_regex(words,2)
rgx

'(?:\\bpartial\\b(?: +[^ \\n]*){0,2} *\\bwith\\b)|(?:\\bwith\\b(?: +[^ \\n]*){0,2} *\\bpartial\\b)'

That `rgx` is ugly, but you can use it to count the number of hits easily. Look at the next code. `findall` looks for a pattern (argument 1) inside a string or document (argument 2). You can count the number of hits with `len`.

In [11]:
len(
    re.findall(NEAR_regex(words,2),
               test)
)

3

## Building and practicing a search

The idea is to translate some of our manual 10-K findings to code.

1. **Start with a string that you know discuses your topic**
2. Clean it like you will clean the document (the steps in 4.4 and 4.4.4.4)
3. Tweak and play with the regex until your search finds the match.
    - Below, we looked for "cycle" near "risk" to catch business cycle risk. 
    - Except that didn't work. The "risk" term was "susceptible".
    - That still didn't work. So we added other versions of the word cycle


In [None]:
# this sentence talks about biz cycle risk
text = "Our future growth and success is dependent \n \r \t upon consumers’ willingness to adopt electric vehicles and specifically our vehicles. We operate in the automotive industry, which is generally susceptible to cyclicality and volatility. "

# clean the document/string: putting the "Good ideas" from 4.4 to work to clean the document:
lower = text.lower()
no_punc = re.sub(r'\W',' ',lower)
cleaned = re.sub(r'\s+',' ',no_punc).strip()

In [None]:
cleaned

'our future growth and success is dependent upon consumers willingness to adopt electric vehicles and specifically our vehicles we operate in the automotive industry which is generally susceptible to cyclicality and volatility'

We ended up with this search, but I'd still want to tweak it more:

In [18]:
len(re.findall(
    NEAR_regex(['(cycle|cyclical|cyclicality)','(risk|susceptible)'],10),
    cleaned))

1

_Notice we used the [tips](https://ledatascifi.github.io/ledatascifi-2022/content/04/02d_RegexApplication.html?highlight=actually%20detect#tips) from the website to add synonyms to the search._

## Other things

## How do I know if my search terms and choices are good?

1. **YOU WANT YOUR SEARCH TO CATCH DISCUSSIONS WHERE YOU KNOW IT EXISTS.** 
    - So find examples of such discussions by looking at 5-15 10-Ks. Look for patterns with how the discussions are made and common terms used. This will help you build a list of words to use. 
1. **BE CAREFUL TO NOT BE TOO BROAD! You don't want "false positives"!**
    - Searching for "comp"  with `partial=True` will get competition, competing, but also completion, and many other words not related to competition.
    - Being careful with the words you allow (hopefully they don't have multiple uses) and explicitly listing allowable  synonyms (instead of using `partial=True`, use `partial=False` and list synonyms)

### How many words apart should we allow?

After we cleaned the text above, periods and paragraph breaks disappeared. So you can't rely on "is this the same paragraph or the same sentence". 

Think about the number of words in a typical sentence or paragraph or clause. If your words should be adjectives, the words between should probably be 0 or 1. If they should be nearby modifiers, maybe 5 or 10. Same sentence? Maybe 10 or 20. 

Ultimately, the answer is somewhat ad-hoc. 
- If some of your example paragraphs have a long gap between the two related terms, then you can show that and say that's why you increased the distance.
- Run your searches and examine the matches when you have different distances. If at some limit you start getting false matches, like "The cycle is complete: I finally won a game of risk", then reduce the distance allowed. 


### Cleaning the 10K html 

Our loop above requires us to
1. open a 10-K (which you can do as long as you can figure out the path to each 10K, and there are hints above on that)
2. **get the text of the 10-K out of the HTML**
3. define our search terms, and run them
4. put the number of hits into the dataset

On Wednesday, I will show you a bit about how to get the text out of the 10-K, and we will do some more practice and discussion of search terms. 