# Introduction


**What?** Name entity recognition = NER



# What is NER?


- NER stands for Name Entity Recongition. 



# Objective?


- The goal of this project is to learn and apply Named Entity Recognition to extract important entities(publicly traded companies in our example) and then link each entity with some information using a knowledge base(Nifty500 companies list).
- We’ll get the textual data from RSS feeds on the internet, extract the names of buzzing stocks, and then pull their market price data to test the authenticity of the news before taking any position in those stocks.
- **Essentially** To learn about what stocks are buzzing in the market and get their details on your dashboard is the goal for this project.



# Import modules

In [32]:
import requests, spacy
from bs4 import BeautifulSoup
import warnings
import yfinance as yf
import pandas as pd
warnings.filterwarnings("ignore")

# Gather the data


- **Step #1** Get the entire XML document and we can use the requests library to do that.
- **Step #2** Find out where in the XML file the data we are interest are. The headlines are present inside the `<title>` tag of the XML here.
- **Step #3** Use `SpaCy` to extract the main entities from the headlines.


- We'll be using this two news feeds:
    - [Economic Times](https://economictimes.indiatimes.com/markets/stocks/rssfeeds/2146842.cms)
    - [Money Control](https://www.moneycontrol.com/rss/buzzingstocks.xml)



In [3]:
XML_File_No1 = "https://economictimes.indiatimes.com/markets/stocks/rssfeeds/2146842.cms"
XML_File_No2 = "https://www.moneycontrol.com/rss/buzzingstocks.xml"
# Get the XML object
resp = requests.get(XML_File_No1)

In [4]:
resp

<Response [200]>

In [5]:
soup = BeautifulSoup(resp.content, features = "xml")

In [10]:
headlines = soup.findAll("title")

In [11]:
headlines

[<title>Stocks-Markets-Economic Times</title>,
 <title>Economic Times</title>,
 <title>Daily Trading Guide: 2 stock recommendations for Friday</title>,
 <title>Wall Street opens higher, but set for steep monthly losses</title>,
 <title>Biggest gainers &amp; losers of the day: Kanchi Karpooram soars 20%, Aptus shed 5%</title>,
 <title>Market Movers: Zee Entertainment tumbles as risk to Sony merger rise</title>,
 <title>Tech View: Nifty below 13-day SMA, signalling a short-term downswing</title>,
 <title>F&amp;O: Nifty forming lower highs, resistance levels slip lower too</title>,
 <title>Rich List | Kamath worth Rs 26K cr! Raamdeo, Jain, Rashesh among richest brokers</title>,
 <title>Stock market update: Nifty IT index  falls  0.72%</title>,
 <title>Share market update: Most active stocks in today's market in terms of volume</title>,
 <title>Stock market update: Sugar stocks  up  as market  falls </title>,
 <title>Stock market update: Mining stocks  up  as market  falls </title>,
 <titl

# Extract NER


- We’ll be using a **pre-trained** core language model from the spaCy library to extract the main entities in a headline.
- spaCy has two major classes of pretrained language models that are trained on different sizes of textual data to give us state-of-the-art inferences.
    - **Core Models** — for general-purpose basic NLP tasks.
    - **Starter Models** — for niche applications that require transfer learning. Fine-tune our custom models without having to train the model from scratch. 

- Since our use case is basic in this tutorial, we are going to stick with the `en_core_web_sm` core model pipeline.



In [13]:
nlp = spacy.load('en_core_web_sm')

In [15]:
# Let's see how it does with tokenization
for token in nlp(headlines[4].text):
    print(token)

Biggest
gainers
&
losers
of
the
day
:
Kanchi
Karpooram
soars
20
%
,
Aptus
shed
5
%



- A description of all POS can be found [here](https://spacy.io/models/en)
- A description of the dependency graph can be found [here](https://spacy.io/models/en)



In [21]:
# Let's see how it does with tagging part of speech
for token in nlp(headlines[4].text):
    print(token, "POS? ", token.pos_, " DEPENDENCY GRAPH? ", token.dep_)

Biggest POS?  ADJ  DEPENDENCY GRAPH?  amod
gainers POS?  NOUN  DEPENDENCY GRAPH?  nsubj
& POS?  CCONJ  DEPENDENCY GRAPH?  cc
losers POS?  NOUN  DEPENDENCY GRAPH?  conj
of POS?  ADP  DEPENDENCY GRAPH?  prep
the POS?  DET  DEPENDENCY GRAPH?  det
day POS?  NOUN  DEPENDENCY GRAPH?  pobj
: POS?  PUNCT  DEPENDENCY GRAPH?  punct
Kanchi POS?  PROPN  DEPENDENCY GRAPH?  compound
Karpooram POS?  PROPN  DEPENDENCY GRAPH?  nsubj
soars POS?  VERB  DEPENDENCY GRAPH?  ccomp
20 POS?  NUM  DEPENDENCY GRAPH?  nummod
% POS?  NOUN  DEPENDENCY GRAPH?  dobj
, POS?  PUNCT  DEPENDENCY GRAPH?  punct
Aptus POS?  PROPN  DEPENDENCY GRAPH?  nsubj
shed POS?  VERB  DEPENDENCY GRAPH?  ROOT
5 POS?  NUM  DEPENDENCY GRAPH?  nummod
% POS?  NOUN  DEPENDENCY GRAPH?  dobj


In [24]:
# Visualize the relationship dependencies among the tokens
spacy.displacy.render(nlp(headlines[4].text), style='dep',jupyter=True, options={'distance': 120})

In [25]:
# Important entities of the sentence, you can pass 'ent’ as style in the same code
spacy.displacy.render(nlp(headlines[4].text), style = 'ent',jupyter=True, options={'distance': 120})


- We have different tags for different entities like the day has DATE, Glasscoat has GPE which can be Countries/Cities/States. 
- There are many entities we can extract, **which one are we interested in?**
- We are majorly looking for entities that have `ORG` tag that’ll give us Companies, agencies, institutions, etc.



In [26]:
companies = []
for title in headlines:
    doc = nlp(title.text)
    for token in doc.ents:
        if token.label_ == 'ORG':
            companies.append(token.text)
        else:
            pass

In [27]:
companies

['Daily Trading Guide',
 'Zee Entertainment',
 'Sony',
 'SMA',
 'Nifty',
 'NCLT',
 'ZEE',
 'Invesco’s',
 'BSE',
 'JM Financial',
 'Nifty Auto',
 'Paytm’s Vijay Shekhar',
 'BSE',
 'Nifty Realty',
 'Nifty',
 'Nifty Bank',
 'Nifty',
 'F&O',
 'Zee Ent',
 'HPCL',
 'Indian Oil Corp.',
 'Sensex',
 'REC',
 'PI Industries',
 'Power Finance Corp.',
 'Sensex',
 'Piramal Ent',
 'Marico',
 'Siemens',
 'ACC']

# Name entity linking


- Of all the company we have, we'd like to select only some of them.



In [34]:
# Collect various market attributes of a stock into a dictionary
stock_dict = {
    'Org': [],
    'Symbol': [],
    'currentPrice': [],
    'dayHigh': [],
    'dayLow': [],
    'forwardPE': [],
    'dividendYield': []
}

In [37]:
companies

['Daily Trading Guide',
 'Zee Entertainment',
 'Sony',
 'SMA',
 'Nifty',
 'NCLT',
 'ZEE',
 'Invesco’s',
 'BSE',
 'JM Financial',
 'Nifty Auto',
 'Paytm’s Vijay Shekhar',
 'BSE',
 'Nifty Realty',
 'Nifty',
 'Nifty Bank',
 'Nifty',
 'F&O',
 'Zee Ent',
 'HPCL',
 'Indian Oil Corp.',
 'Sensex',
 'REC',
 'PI Industries',
 'Power Finance Corp.',
 'Sensex',
 'Piramal Ent',
 'Marico',
 'Siemens',
 'ACC']


- We have the company names but in order to get their trading details, we’ll need the company’s trading stock symbol.
- Since I am extracting the details and news of Indian Companies, I am going to use an external database of [Nifty 500 companies(a CSV file)](https://www1.nseindia.com/products/content/equities/indices/nifty_500.htm).
- For every company, we’ll look it up in the list of companies using pandas, and then we’ll capture the stock market statistics using the yahoo `yfinance` library.



In [43]:
input_path = "../DATASETS/ind_nifty500list.csv"
stocks_df = pd.read_csv(input_path)
print('dimension: ', stocks_df.shape)
stocks_df.head()

dimension:  (501, 5)


Unnamed: 0,Company Name,Industry,Symbol,Series,ISIN Code
0,3M India Ltd.,CONSUMER GOODS,3MINDIA,EQ,INE470A01017
1,ABB India Ltd.,INDUSTRIAL MANUFACTURING,ABB,EQ,INE117A01022
2,ABB Power Products and Systems India Ltd.,INDUSTRIAL MANUFACTURING,POWERINDIA,EQ,INE07Y701011
3,ACC Ltd.,CEMENT & CEMENT PRODUCTS,ACC,EQ,INE012A01025
4,AIA Engineering Ltd.,INDUSTRIAL MANUFACTURING,AIAENG,EQ,INE212H01026


In [44]:
# For each company look it up and gather all market data on it
for company in companies:
    try:
        if stocks_df['Company Name'].str.contains(company).sum():
            symbol = stocks_df[stocks_df['Company Name'].\
                                str.contains(company)]['Symbol'].values[0]
            org_name = stocks_df[stocks_df['Company Name'].\
                                str.contains(company)]['Company Name'].values[0]
            stock_dict['Org'].append(org_name)
            stock_dict['Symbol'].append(symbol)
            # indian NSE stock symbols are stored with a .NS suffix in yfinance
            stock_info = yf.Ticker(symbol + ".NS").info
            stock_dict['currentPrice'].append(stock_info['currentPrice'])
            stock_dict['dayHigh'].append(stock_info['dayHigh'])
            stock_dict['dayLow'].append(stock_info['dayLow'])
            stock_dict['forwardPE'].append(stock_info['forwardPE'])
            stock_dict['dividendYield'].append(stock_info['dividendYield'])
        else:
            pass
    except:
        pass

In [45]:
# Create a dataframe to display the buzzing stocks
pd.DataFrame(stock_dict)

Unnamed: 0,Org,Symbol,currentPrice,dayHigh,dayLow,forwardPE,dividendYield
0,Zee Entertainment Enterprises Ltd.,ZEEL,303.2,309.6,300.0,18.27607,0.0081
1,BSE Ltd.,BSE,1231.25,1252.0,1225.0,29.238897,0.0169
2,JM Financial Ltd.,JMFINANCIL,92.55,93.15,92.0,,0.0054
3,BSE Ltd.,BSE,1231.25,1252.0,1225.0,29.238897,0.0169
4,Zee Entertainment Enterprises Ltd.,ZEEL,303.2,309.6,300.0,18.27607,0.0081
5,Indian Oil Corporation Ltd.,IOC,125.3,128.5,124.35,6.941829,0.0942
6,REC Ltd.,RECLTD,157.85,163.95,157.15,5.165249,0.0906
7,PI Industries Ltd.,PIIND,3178.35,3273.4,3142.0,89.4554,0.0015
8,Power Finance Corporation Ltd.,PFC,142.1,146.4,141.0,5.636653,0.0849
9,Piramal Enterprises Ltd.,PEL,2595.8,2667.0,2584.05,19.007101,0.0125


# Conclusions


- We have automatically extract name from the an xml file.
- We have linked them to company we are interested in.
- We have collect some important info about them.



# References


- https://www.kdnuggets.com/2021/09/-structured-financial-newsfeed-using-python-spacy-and-streamlit.html 

