This notebook, the first of two, demonstrates my usage of SEC databases and APIs, and a BERT finaicial sentiment analyzer. The below cells show how I gathered and cleaned raw SEC text data to feed into the FinBERT model to create data that hopefully correlates to IBM's historical stock price.

To quickly sum up the data used:
> 10Qs are publicly accessable forms that the SEC requires companies to create. They contain numerical and textual data like quartly profit, costs, and more. These SEC forms show the calculated book value, but the public's response to seeing this can change the market value.

Specific SEC data:
> In each 10Q document there is a section called: "management's discussion and analysis" (MDA). In short, this contains the company's own view of its current financial standing. And because the SEC has strict regulations, companies have to be factual. This section is often used by investors as a guide for whether to buy or sell a company's stock. And because so many people use these documents as a guide for investing, the MDA section contains information about possible changes in market value.

Therefore, if extracted properly, the sentiment of this section should be correlated to the directional movement of IBM's market value. And in theory, should be able to predict whether IBM's stock price will go up or down.

---
# Data Selection

To get SEC text data, collect from EDGAR database:
https://www.sec.gov/edgar/search-and-access

Using SEC API to get data (python implimentation of APIs: https://pypi.org/project/sec-api/#10-k10-q8-k-section-extractor-api)

In [None]:
!pip install sec-api

APIs used:
*   QueryAPI
*   ExtractorAPI

In [None]:
key = "410c802bca58fc6bd5f38f37fef151da322993c93e39bb7e6d8d1afc0a077910"

### QueryAPI
Used for collecting data about specific **fliings** : everything submitted at a specific time

In [None]:
from sec_api import QueryApi
queryApi = QueryApi(api_key=key)

IBM_10q_query = {
    "query": {
        "query_string": {
            "query": "cik: 51143 AND formType:\"10-Q\" AND filedAt:{2004-01-01 TO 2023-12-31}"
        }
    },
    "from": "0",
    "size": "*",
    "sort": [{ "filedAt": { "order": "desc" } }]
}

IBM_10q_filings = queryApi.get_filings(IBM_10q_query)

### Output of QueryAPI:

In [None]:
print(IBM_10q_filings.keys(), "\n","\n",
      IBM_10q_filings["total"],"\n", "\n", "Filings:")
IBM_10q_filings["filings"][0]

dict_keys(['total', 'query', 'filings']) 
 
 {'value': 58, 'relation': 'eq'} 
 
 Filings:


{'id': '6711b1f33c49a0aecda3904cdfd8d5d3',
 'accessionNo': '0001558370-22-015322',
 'cik': '51143',
 'ticker': 'IBM',
 'companyName': 'INTERNATIONAL BUSINESS MACHINES CORP',
 'companyNameLong': 'INTERNATIONAL BUSINESS MACHINES CORP (Filer)',
 'formType': '10-Q',
 'description': 'Form 10-Q - Quarterly report [Sections 13 or 15(d)]',
 'filedAt': '2022-10-25T16:16:44-04:00',
 'linkToTxt': 'https://www.sec.gov/Archives/edgar/data/51143/000155837022015322/0001558370-22-015322.txt',
 'linkToHtml': 'https://www.sec.gov/Archives/edgar/data/51143/000155837022015322/0001558370-22-015322-index.htm',
 'linkToXbrl': '',
 'linkToFilingDetails': 'https://www.sec.gov/Archives/edgar/data/51143/000155837022015322/ibm-20220930x10q.htm',
 'entities': [{'companyName': 'INTERNATIONAL BUSINESS MACHINES CORP (Filer)',
   'cik': '51143',
   'irsNo': '130871985',
   'stateOfIncorporation': 'NY',
   'fiscalYearEnd': '1231',
   'type': '10-Q',
   'act': '34',
   'fileNo': '001-02360',
   'filmNo': '221329679',
  

Important data in each filing:
* "linkToFilingDetails" - The document of each report (html file for newer reports, txt file for older report)
* "filedAt" - Time of filing (date and time)

In [None]:
# IBM_10q_filings["filings"][0]["linkToFilingDetails"]
# IBM_10q_filings["filings"][0]["filedAt"]

In [None]:
# demonstration of how to extract the important data
forms = IBM_10q_filings['filings']
date = []

form_links = []
for i in range(len(forms)):
  form_links.append(forms[i]["linkToFilingDetails"])
  date.append(forms[i]["filedAt"])

print("Lengths (all files):" , "\n" + "Links: "+str(len(form_links)) + "\n" + "Date: "+str(len(date)), "\n")
print("Filing #1:")
print("Link -", form_links[0])
print("Date -", date[0])

Lengths (all files): 
Links: 58
Date: 58 

Filing #1:
Link - https://www.sec.gov/Archives/edgar/data/51143/000155837022015322/ibm-20220930x10q.htm
Date - 2022-10-25T16:16:44-04:00


---
# Data Collection

In [None]:
from sec_api import ExtractorApi
from bs4 import BeautifulSoup
extractorApi = ExtractorApi(key)

data_dict = {}

filing = 0
for i in form_links:
  section_text = extractorApi.get_section(i, "part1item2", "html")

  soup = BeautifulSoup(section_text, 'html.parser')

  test = []
  for i in soup.find_all("p"):
    data = i.get_text().strip()
    if len(data) > 140 and data[0] != "*" and data[0] != "+":
      if not(data[0].isupper()):
        test[-1] += " "
        test[-1] += data
      else:
        test.append(data)
  data_dict[filing] = test
  filing += 1
  
#data_dict

In [None]:
print("Document #s")
print(str(list(data_dict.keys())[:3])[:-1] + ", ..., " + str(list(data_dict.keys())[-3:])[1:])

Document #s
[0, 1, 2, ..., 55, 56, 57]


Check to see if each docment as data:

In [None]:
doc_length = []
for i in data_dict.keys():
  doc_length.append(len(data_dict[i]))
print(doc_length)

[147, 147, 123, 165, 162, 134, 156, 154, 129, 143, 138, 118, 134, 134, 110, 138, 139, 145, 167, 169, 143, 133, 120, 126, 135, 127, 0, 0, 148, 127, 139, 141, 124, 138, 140, 133, 143, 141, 128, 151, 146, 134, 159, 143, 124, 140, 140, 113, 159, 157, 127, 121, 123, 107, 0, 0, 0, 0]


There are some 0 length documents, meaning the ExtractorAPI didn't work properly

In [None]:
empty_keys = []
no_data_links = []

for i in data_dict.keys():
  if len(data_dict[i]) == 0:
    empty_keys.append(i)
    no_data_links.append(form_links[i])
print("Reports with no data")
print("\n"+"Keys:","\n"+str(empty_keys))
print("\n"+"Data:")
for i in empty_keys:
  print(data_dict[i])
print("\n"+"Links:")
for i in no_data_links:
  print(i)

Reports with no data

Keys: 
[26, 27, 54, 55, 56, 57]

Data:
[]
[]
[]
[]
[]
[]

Links:
https://www.sec.gov/Archives/edgar/data/51143/000005114314000004/ibm14q1_10q.htm
https://www.sec.gov/Archives/edgar/data/51143/000005114313000007/ibm13q3_10q.htm
https://www.sec.gov/Archives/edgar/data/51143/000110465904032863/a04-12261_210qa.htm
https://www.sec.gov/Archives/edgar/data/51143/000110465904032411/a04-12261_110q.htm
https://www.sec.gov/Archives/edgar/data/51143/000110465904021678/a04-7971_110q.htm
https://www.sec.gov/Archives/edgar/data/51143/000110465904013278/a04-5397_110q.htm


---
# Data Creation

### Function 1:
Input a string of text and get index of every functional period:
a "." that is used to mark the end of a sentence.

Non-functional periods - Mr., Inc., and so on

(might be worthwhile to do further testing to identify more cases of non-functional periods)

In [None]:
# get list of indicies for every period that is used as a period
def get_functional_periods(data):
# get indicies for periods that aren't used for numeric representations (percentages)
  period_indicies = []
  for i in range(len(data)):
    if (data[i] == "."):
      if i == len(data)-1:
        period_indicies.append(i)
      else:
        if data[i+1]==" " and data[i-3:i] !="Inc" and data[i-3:i] !="U.S":
          period_indicies.append(i)
  return period_indicies

In [None]:
# creates a list of the indicies of all functional periods for the first entry in data_dict[0]
paragraph = data_dict[0][0]
period_indicies = get_functional_periods(paragraph)
print("Indicies of periods:",period_indicies,"\n")
paragraph[0:period_indicies[0]]

Indicies of periods: [264, 481, 669, 959, 1133, 1212] 



'On November 3, 2021, we completed the separation of our managed infrastructure services unit into a new public company with the distribution of 80.1 percent of the outstanding common stock of Kyndryl Holdings, Inc. (Kyndryl) to IBM stockholders on a pro rata basis'

### Function 2:
Input a string of text and the indicies of functional periods, and output a list where each item is a sentence

In [None]:
# creates list of strings from paragraph (each string is a sentence)
def create_sentences(period_indicies, data):
  sentences = []
  for i in range(len(period_indicies)):
    if i == 0:
      sentences.append(data[0:(period_indicies[0]+1)])

    if i != 0 and i != (len(period_indicies)-1):
      sentences.append(data[period_indicies[i-1]+2:period_indicies[i]+1])

  if len(period_indicies) > 1:
    sentences.append(data[period_indicies[-2]+2:])
  return sentences

In [None]:
# combine function 1 and function 2
create_sentences(get_functional_periods(paragraph), paragraph)

['On November 3, 2021, we completed the separation of our managed infrastructure services unit into a new public company with the distribution of 80.1 percent of the outstanding common stock of Kyndryl Holdings, Inc. (Kyndryl) to IBM stockholders on a pro rata basis.',
 'To affect the separation, IBM stockholders received one share of Kyndryl common stock for every five shares of IBM common stock held at the close of business on October 25, 2021, the record date for the distribution.',
 'IBM retained 19.9 percent of the shares of Kyndryl common stock immediately following the separation with the intent to dispose of such shares within twelve months after the distribution.',
 'The company accounts for the retained Kyndryl common stock as a fair value investment included within prepaid expenses and other current assets in the Consolidated Balance Sheet with subsequent fair value changes included in other (income) and expense in the Consolidated Income Statement.',
 'As of September 30, 2

### Function 3:
Combine functions 1 and 2

In [None]:
def easy_create_sentences(paragraph):
  return create_sentences(get_functional_periods(paragraph), paragraph)

The cell below is the first chunk of text after processing it

In [None]:
easy_create_sentences(data_dict[0][0])

['On November 3, 2021, we completed the separation of our managed infrastructure services unit into a new public company with the distribution of 80.1 percent of the outstanding common stock of Kyndryl Holdings, Inc. (Kyndryl) to IBM stockholders on a pro rata basis.',
 'To affect the separation, IBM stockholders received one share of Kyndryl common stock for every five shares of IBM common stock held at the close of business on October 25, 2021, the record date for the distribution.',
 'IBM retained 19.9 percent of the shares of Kyndryl common stock immediately following the separation with the intent to dispose of such shares within twelve months after the distribution.',
 'The company accounts for the retained Kyndryl common stock as a fair value investment included within prepaid expenses and other current assets in the Consolidated Balance Sheet with subsequent fair value changes included in other (income) and expense in the Consolidated Income Statement.',
 'As of September 30, 2

The cell below is the first chunk of text before processing it

In [None]:
data_dict[0][0]

'On November 3, 2021, we completed the separation of our managed infrastructure services unit into a new public company with the distribution of 80.1 percent of the outstanding common stock of Kyndryl Holdings, Inc. (Kyndryl) to IBM stockholders on a pro rata basis. To affect the separation, IBM stockholders received one share of Kyndryl common stock for every five shares of IBM common stock held at the close of business on October 25, 2021, the record date for the distribution. IBM retained 19.9 percent of the shares of Kyndryl common stock immediately following the separation with the intent to dispose of such shares within twelve months after the distribution. The company accounts for the retained Kyndryl common stock as a fair value investment included within prepaid expenses and other current assets in the Consolidated Balance Sheet with subsequent fair value changes included in other (income) and expense in the Consolidated Income Statement. As of September 30, 2022, we transferr

### Creating Processed Dataset Using "easy_create_sentences" Function:

In [None]:
data_dict2 = {}
for i in data_dict:
  data_dict2[i] = []
  for j in data_dict[i]:
    if len(j)!=1:
      data_dict2[i].append(easy_create_sentences(j))

In [None]:
data_dict2.keys()

dict_keys([0, 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15, 16, 17, 18, 19, 20, 21, 22, 23, 24, 25, 26, 27, 28, 29, 30, 31, 32, 33, 34, 35, 36, 37, 38, 39, 40, 41, 42, 43, 44, 45, 46, 47, 48, 49, 50, 51, 52, 53, 54, 55, 56, 57])

### Loading and Setting Up the FinBERT Model

In [None]:
!pip install transformers #Contains the FinBERT model 
!pip install torch torchvision #a deep learning library: PyTorch, TensorFlow, Flax

In [None]:
from transformers import AutoTokenizer, AutoModelForSequenceClassification
from transformers import pipeline

tokenizer = AutoTokenizer.from_pretrained("ProsusAI/finbert")
model = AutoModelForSequenceClassification.from_pretrained("ProsusAI/finbert")
nlp = pipeline("text-classification", model=model, tokenizer=tokenizer)

Downloading:   0%|          | 0.00/252 [00:00<?, ?B/s]

Downloading:   0%|          | 0.00/758 [00:00<?, ?B/s]

Downloading:   0%|          | 0.00/232k [00:00<?, ?B/s]

Downloading:   0%|          | 0.00/112 [00:00<?, ?B/s]

Downloading:   0%|          | 0.00/438M [00:00<?, ?B/s]

### Tokenizer

The text data fed into the tokenizer can be a single input or a list of inputs)

And, each input can be a single segment or a list of 2 segments

* (segments are basically just a chunk of text)

In [None]:
encoding = tokenizer([
    [data_dict2[0][0][0],data_dict2[0][0][3]], # this is an input with 2 segments
    data_dict2[0][0][1], # both this input and the one below it are single segments
    data_dict2[0][0][2]  #
],
    padding = True,      # adds trailing 0's to give model consistent data length
    truncation = True,   # adds maximum token length to the model
    return_tensors='pt'  # the model is built with pytorch tensors
)

Output:

In [None]:
encoding

{'input_ids': tensor([[  101,  2006,  2281,  1017,  1010, 25682,  1010,  2057,  2949,  1996,
          8745,  1997,  2256,  3266,  6502,  2578,  3131,  2046,  1037,  2047,
          2270,  2194,  2007,  1996,  4353,  1997,  3770,  1012,  1015,  3867,
          1997,  1996,  5151,  2691,  4518,  1997, 18712,  4859, 23320,  9583,
          1010,  4297,  1012,  1006, 18712,  4859, 23320,  1007,  2000,  9980,
          4518, 17794,  2006,  1037,  4013,  9350,  2050,  3978,  1012,   102,
          1996,  2194,  6115,  2005,  1996,  6025, 18712,  4859, 23320,  2691,
          4518,  2004,  1037,  4189,  3643,  5211,  2443,  2306, 17463, 14326,
         11727,  1998,  2060,  2783,  7045,  1999,  1996, 10495,  5703,  7123,
          2007,  4745,  4189,  3643,  3431,  2443,  1999,  2060,  1006,  3318,
          1007,  1998, 10961,  1999,  1996, 10495,  3318,  4861,  1012,   102],
        [  101,  2000,  7461,  1996,  8745,  1010,  9980,  4518, 17794,  2363,
          2028,  3745,  1997, 18712, 

In [None]:
encoding.keys()

dict_keys(['input_ids', 'token_type_ids', 'attention_mask'])

Tokens:

First tensor is the 2 sequence input

Second tensor is a 1 sequence input (trailing 0s)

In [None]:
encoding["input_ids"][0:2]

tensor([[  101,  2006,  2281,  1017,  1010, 25682,  1010,  2057,  2949,  1996,
          8745,  1997,  2256,  3266,  6502,  2578,  3131,  2046,  1037,  2047,
          2270,  2194,  2007,  1996,  4353,  1997,  3770,  1012,  1015,  3867,
          1997,  1996,  5151,  2691,  4518,  1997, 18712,  4859, 23320,  9583,
          1010,  4297,  1012,  1006, 18712,  4859, 23320,  1007,  2000,  9980,
          4518, 17794,  2006,  1037,  4013,  9350,  2050,  3978,  1012,   102,
          1996,  2194,  6115,  2005,  1996,  6025, 18712,  4859, 23320,  2691,
          4518,  2004,  1037,  4189,  3643,  5211,  2443,  2306, 17463, 14326,
         11727,  1998,  2060,  2783,  7045,  1999,  1996, 10495,  5703,  7123,
          2007,  4745,  4189,  3643,  3431,  2443,  1999,  2060,  1006,  3318,
          1007,  1998, 10961,  1999,  1996, 10495,  3318,  4861,  1012,   102],
        [  101,  2000,  7461,  1996,  8745,  1010,  9980,  4518, 17794,  2363,
          2028,  3745,  1997, 18712,  4859, 23320, 

Meaning behind tokens:

1: input sequence

2: visualization of tokenization process ('[CLS]' = starting token, '[SEP]' = separating token)

3: numerical representation of tokens

In [None]:
import torch
print('1:', data_dict2[0][0][0],"\n")
print('2:',tokenizer.convert_ids_to_tokens(encoding["input_ids"][0].tolist()),"\n")
print('3:',encoding["input_ids"][0].tolist())

1: On November 3, 2021, we completed the separation of our managed infrastructure services unit into a new public company with the distribution of 80.1 percent of the outstanding common stock of Kyndryl Holdings, Inc. (Kyndryl) to IBM stockholders on a pro rata basis. 

2: ['[CLS]', 'on', 'november', '3', ',', '2021', ',', 'we', 'completed', 'the', 'separation', 'of', 'our', 'managed', 'infrastructure', 'services', 'unit', 'into', 'a', 'new', 'public', 'company', 'with', 'the', 'distribution', 'of', '80', '.', '1', 'percent', 'of', 'the', 'outstanding', 'common', 'stock', 'of', 'ky', '##nd', '##ryl', 'holdings', ',', 'inc', '.', '(', 'ky', '##nd', '##ryl', ')', 'to', 'ibm', 'stock', '##holders', 'on', 'a', 'pro', 'rat', '##a', 'basis', '.', '[SEP]', 'the', 'company', 'accounts', 'for', 'the', 'retained', 'ky', '##nd', '##ryl', 'common', 'stock', 'as', 'a', 'fair', 'value', 'investment', 'included', 'within', 'prep', '##aid', 'expenses', 'and', 'other', 'current', 'assets', 'in', 'the',

"token_type_ids"
0 or 1, tells the model which sequence it is looking at (maximum of 2 segments)

In [None]:
# encoding = tokenizer([
#     [data_dict2[0][0][0],data_dict2[0][0][3]],
#     data_dict2[0][0][1],
#     data_dict2[0][0][2]
# ],
#     padding = True,
#     truncation = True,
#     return_tensors='pt'
# )

encoding['token_type_ids']

tensor([[0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,
         0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,
         0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1,
         1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1,
         1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1],
        [0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,
         0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,
         0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,
         0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,
         0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0],
        [0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,
         0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,
         0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,
     

Attention Mask

Tells the model whether or not the token should be treated as an actual token or a padded token (1 = word, 0 = padding)

In [None]:
# encoding = tokenizer([
#     [data_dict2[0][0][0],data_dict2[0][0][3]],
#     data_dict2[0][0][1],
#     data_dict2[0][0][2]
# ],
#     padding = True,
#     truncation = True,
#     return_tensors='pt'
# )

encoding['attention_mask']

tensor([[1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1,
         1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1,
         1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1,
         1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1,
         1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1],
        [1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1,
         1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 0, 0,
         0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,
         0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,
         0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0],
        [1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1,
         1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,
         0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,
     

# Modeling Data

If inputting a list of sequences into the model, use: **name_of_list

In [None]:
output = model(**encoding, output_hidden_states = True)
output

SequenceClassifierOutput(loss=None, logits=tensor([[-1.4899,  0.2987,  2.0011],
        [-0.9812, -1.1731,  2.6127],
        [-0.7998, -1.4381,  2.5900]], grad_fn=<AddmmBackward0>), hidden_states=(tensor([[[ 0.1615, -0.2566, -0.3011,  ..., -0.0213,  0.0447,  0.1875],
         [-0.1507,  0.7061,  0.0823,  ...,  0.6274, -0.0640,  0.2646],
         [-0.3093, -0.3974, -0.5383,  ..., -0.0260, -0.9953, -0.2304],
         ...,
         [ 0.4137,  0.4686, -0.5649,  ..., -1.0511, -0.8764, -0.6820],
         [-0.0971,  0.4436,  0.0854,  ...,  0.3906,  0.3979,  0.4538],
         [-0.3806,  0.1136,  0.2359,  ..., -0.5221,  0.1586, -0.0746]],

        [[ 0.1615, -0.2566, -0.3011,  ..., -0.0213,  0.0447,  0.1875],
         [ 0.4015,  0.4841, -0.1439,  ...,  0.5635,  0.5193,  0.2663],
         [ 0.2178,  0.4814, -0.6954,  ..., -0.1432, -0.3436, -0.5575],
         ...,
         [ 0.3699, -0.2974,  0.1428,  ..., -0.3487, -0.4158, -0.0679],
         [ 0.3966, -0.1230,  0.2540,  ..., -0.1848, -0.4398,  0

In [None]:
output.keys()

odict_keys(['logits', 'hidden_states'])

Logits are the results of the sentiment analysis, and values for each label: [positive,  negative, neutral]

Softmax changes values to percent likelihood for each label

In [None]:
print(output.logits, "\n")

value = torch.nn.functional.softmax(output.logits,dim=-1)
print(value)

print("\n"+ "should be 1:",(value[0][0]+value[0][1]+value[0][2]).tolist())

tensor([[-1.4899,  0.2987,  2.0011],
        [-0.9812, -1.1731,  2.6127],
        [-0.7998, -1.4381,  2.5900]], grad_fn=<AddmmBackward0>) 

tensor([[0.0251, 0.1503, 0.8246],
        [0.0262, 0.0216, 0.9522],
        [0.0321, 0.0169, 0.9510]], grad_fn=<SoftmaxBackward0>)

should be 1: 1.0


Hidden States are the word ebmedding vectors,

The FinBERT model generates 13 embeddings vectors:

In [None]:
len(output.hidden_states)

13

Each of these hidden states contains information about:
* each input
* each token in the input
* the actual data: word embeddings (BERT embeddings are 768 in length)

Below is the first of 13 

In [None]:
output.hidden_states[0].shape

torch.Size([3, 110, 768])

### Running the Model on "data_dict2" and creating "sentiments" dictionary

In [None]:
import numpy as np

#Model
sentiments = {}
for i in data_dict2:
  print("doc:", i)
  sentiments[i] = {"positive":[],
    "negative":[],
    "neutral":[]}

  for j in range(len(data_dict2[i])):
    if len(data_dict2[i][j]) > 1:
      print("paragraph:", j)
      batch = tokenizer(
      data_dict2[i][j],
      padding = True,
      truncation = True,
      return_tensors='pt')
      output = model(**batch)
      prediction = torch.nn.functional.softmax(output.logits,dim=-1)
      prediction

      averages = [float(str(prediction[:,0].mean().tolist())[:6]),
                  float(str(prediction[:,1].mean().tolist())[:6]),
                  float(str(prediction[:,2].mean().tolist())[:6])]

      if np.max(averages)==averages[0]:
        sentiments[i]["positive"].append(averages[0])
      if np.max(averages)==averages[1]:
        sentiments[i]["negative"].append(averages[1])
      if np.max(averages)==averages[2]:
        sentiments[i]["neutral"].append(averages[2])

In [None]:
sentiments[0].keys()

dict_keys(['positive', 'negative', 'neutral'])

Creating dictionary for sentiment data

In [None]:
data = {}
for i in sentiments:
  data[i] = {'positive':0, 'negative':0, 'neutral':0,
           "num_positive":0,
           "num_negative":0,
           "num_neutral":0}
  for j in sentiments[i]:
    data[i][j] = np.mean(sentiments[i][j])
    data[i]["num_"+j] = len(sentiments[i][j])

In [None]:
data

{0: {'positive': 0.7247903846153847,
  'negative': 0.6482153846153846,
  'neutral': 0.7546088888888888,
  'num_positive': 52,
  'num_negative': 26,
  'num_neutral': 45},
 1: {'positive': 0.753925,
  'negative': 0.6843857142857143,
  'neutral': 0.7429416666666667,
  'num_positive': 44,
  'num_negative': 28,
  'num_neutral': 48},
 2: {'positive': 0.7459166666666667,
  'negative': 0.6721411764705882,
  'neutral': 0.7340978723404255,
  'num_positive': 36,
  'num_negative': 17,
  'num_neutral': 47},
 3: {'positive': 0.7090380952380952,
  'negative': 0.6718488888888889,
  'neutral': 0.7525847826086955,
  'num_positive': 42,
  'num_negative': 45,
  'num_neutral': 46},
 4: {'positive': 0.7159770833333333,
  'negative': 0.6736918918918918,
  'neutral': 0.7538318181818181,
  'num_positive': 48,
  'num_negative': 37,
  'num_neutral': 44},
 5: {'positive': 0.7168538461538461,
  'negative': 0.6243407407407408,
  'neutral': 0.7734500000000001,
  'num_positive': 39,
  'num_negative': 27,
  'num_neutr

Creating pandas dataframe from "data"

In [None]:
import pandas as pd
df = pd.DataFrame(data)
df = df.transpose()
df["filed_at"] = date
df

In [None]:
df.to_csv("sentiments.csv", index = True)

# Final dataframe

In [None]:
df = pd.read_csv('sentiments (1).csv')

In [None]:
df.rename(columns = {"Unnamed: 0": "document"}, inplace =True) 
df

Unnamed: 0,document,positive,negative,neutral,num_positive,num_negative,num_neutral,filed_at
0,0,0.72479,0.648215,0.754609,52.0,26.0,45.0,2022-10-25T16:16:44-04:00
1,1,0.753925,0.684386,0.742942,44.0,28.0,48.0,2022-07-25T16:43:18-04:00
2,2,0.745917,0.672141,0.734098,36.0,17.0,47.0,2022-04-26T16:19:53-04:00
3,3,0.709038,0.671849,0.752585,42.0,45.0,46.0,2021-11-05T07:23:16-04:00
4,4,0.715977,0.673692,0.753832,48.0,37.0,44.0,2021-07-27T16:41:04-04:00
5,5,0.716854,0.624341,0.77345,39.0,27.0,40.0,2021-04-27T16:19:53-04:00
6,6,0.727618,0.735065,0.765815,44.0,43.0,41.0,2020-10-27T16:20:30-04:00
7,7,0.735105,0.710935,0.768005,41.0,46.0,41.0,2020-07-28T16:29:01-04:00
8,8,0.676021,0.679038,0.745768,38.0,26.0,41.0,2020-04-28T16:24:43-04:00
9,9,0.716632,0.706492,0.718324,28.0,50.0,41.0,2019-10-29T19:37:08-04:00


In [None]:
# df.drop( columns = "Unnamed: 0", inplace = True)
df.head()

Unnamed: 0.1,Unnamed: 0,positive,negative,neutral,num_positive,num_negative,num_neutral,filed_at
0,0,0.72479,0.648215,0.754609,52.0,26.0,45.0,2022-10-25T16:16:44-04:00
1,1,0.753925,0.684386,0.742942,44.0,28.0,48.0,2022-07-25T16:43:18-04:00
2,2,0.745917,0.672141,0.734098,36.0,17.0,47.0,2022-04-26T16:19:53-04:00
3,3,0.709038,0.671849,0.752585,42.0,45.0,46.0,2021-11-05T07:23:16-04:00
4,4,0.715977,0.673692,0.753832,48.0,37.0,44.0,2021-07-27T16:41:04-04:00


### Next Steps
# Change the structure of the dataframe so that there are columns for:
* Label (pos, neg, neu)
* Percentage likelyhood of label
* Count of label
* Time of filing