## Load SpaCy resources

In [2]:
# Import spacy and English models
import spacy
import pandas as pd

nlp = spacy.load("en_core_web_sm")

Loading spaCy can take a while, in the meantime here are a few definitions to help you on your NLP journey.


## 

#### Datetime parsing

- https://docs.python.org/3/library/datetime.html

In [1]:
from google.colab import drive
drive.mount('/content/drive')

Mounted at /content/drive


In [3]:
file_path = '/content/drive/MyDrive/NLP/NLPClassV1/IndianFinancialNews.csv'

In [4]:
from datetime import datetime

custom_date_parser = lambda x: datetime.strptime(x, "%B %d, %Y, %A")

news_df = pd.read_csv(file_path, 
                      index_col=[0], 
                      parse_dates=['Date'],
                      date_parser=custom_date_parser)

In [5]:
pd.set_option("max_colwidth", 200)
news_df.head(5)

Unnamed: 0,Date,Title,Description
0,2020-05-26,"ATMs to become virtual bank branches, accept deposits with instant credit","Close to 14.6 per cent (or 35,000) of the 240,000 ATMs in India are new-age recyclers, even though they have been around for only ..."
1,2020-05-26,IDFC First Bank seniors to forgo 65% of bonus amid Covid-19 crisis,"V Vaidyanathan, managing director and chief executive, will take 30 per cent cut in his compensation, including fixed ..."
2,2020-05-25,"Huge scam in YES Bank for many years, says Enforcement Directorate",Rana Kapoor's wife also charged with abetting crime
3,2020-05-24,"Bank of Maharashtra sanctioned Rs 2,789 cr in loans to MSMEs in 3 months",The bank said it was now gearing up to extend the stimulus package announced by Finance Minister Nirmala Sitharaman to restart ...
4,2020-05-23,DCB Bank's profit before tax declines 37.6% to Rs 93.84 crore in Q4,"Net profit for the financial year ended March 31, 2020 (FY20), stood at Rs 337.25 crore, up marginally from Rs 325.37 crore in ..."


In [6]:
news_df.info()

<class 'pandas.core.frame.DataFrame'>
Int64Index: 50000 entries, 0 to 49999
Data columns (total 3 columns):
 #   Column       Non-Null Count  Dtype         
---  ------       --------------  -----         
 0   Date         50000 non-null  datetime64[ns]
 1   Title        50000 non-null  object        
 2   Description  49290 non-null  object        
dtypes: datetime64[ns](1), object(2)
memory usage: 1.5+ MB


In [7]:
news_df.Date.min()

Timestamp('2003-02-10 00:00:00')

In [8]:
news_df.Date.max()

Timestamp('2020-05-26 00:00:00')

## Named entities

#### Named Entities

A named entity is any real world object such as a person, location, organisation or product with a proper name. 

- https://spacy.io/api/annotation

In [14]:
all_titles = [str(x) for x in news_df.sample(10)['Description']]

In [15]:
all_titles

['Public sector lender Bank of India plans to raise upto Rs 1,500 crore through Basel III compliant bonds to boost its tier II ...',
 'Enterprise value, say sources, set finally around Rs 280 crore; deal to be announced next month',
 'As per rules, every insurer, who begins to carry on insurance business has to ensure that it undertakes social, rural sector ...',
 "With the revision, PNB's overnight marginal cost of funds based lending rate (MCLR) now stands at 8.2 per cent as against 7.9 per ...",
 "While liquidity played a role, banks' reluctance to lend due to risk aversion and tightened group borrower exposure limits are ...",
 'Banks face a major challenge in terms of introducing innovative customer-friendly products and services and newer technologies to ...',
 "Interest rate rises have been a cause of concern for the past 18 months. Prompted by the Reserve Bank of India's increase in the ...",
 'State Bank of Bikaner & Jaipur is planning to raise Rs 200 crore.According to a rele

In [17]:
# Print all named entities with named entity types
for sent in all_titles:
  doc = nlp(sent)
  for ent in doc.ents:
      print('{} - {}'.format(ent, ent.label_))

Bank of India - ORG
Rs 1,500 - PRODUCT
Basel III - GPE
Rs 280 - PRODUCT
next month - DATE
PNB - ORG
overnight - TIME
8.2 per cent - MONEY
7.9 - CARDINAL
the past 18 months - DATE
the Reserve Bank of India's - ORG
State Bank of Bikaner & Jaipur - ORG
Rs 200 - PRODUCT
BSE - ORG
Dena Bank - ORG
PLR - ORG
75 - CARDINAL
13.50 per cent - MONEY
Syndicate Bank's - ORG
the quarter ended September 2004 - DATE
34 per cent - MONEY


## Most visible entity in a month



In [20]:
news_df['year'] = pd.DatetimeIndex(news_df['Date']).year

In [21]:
def get_top_10(year):
  year_df = news_df[news_df.year == year]

  org_list = {}

  try:
    for each_title in year_df.Title:
      doc = nlp(str(each_title))
      for ent in doc.ents:
        if ent.label_ == 'ORG':
          entity_name = ent.text
          org_list[entity_name] = org_list.get(entity_name, 0) + 1
  except Exception:
    pass        

  sort_orders = sorted(org_list.items(), 
                      key=lambda x: x[1],
                      reverse=True)
  return  sort_orders[0:10]

In [23]:
get_top_10(2010)

[('RBI', 441),
 ('Bank', 42),
 ('LIC', 40),
 ('IDBI Bank', 38),
 ('PNB', 36),
 ('ICICI Bank', 36),
 ('IPO', 35),
 ('Union Bank', 33),
 ('ICICI', 31),
 ('Indian Bank', 27)]

## Part of Speech Tags

#### What is a Speech Tag?
A speech tag is a context sensitive description of what a word means in the context of the whole sentence.
More information about the kinds of speech tags which are used in NLP can be [found here]
- https://spacy.io/api/annotation
- http://www.winwaed.com/blog/2011/11/08/part-of-speech-tags/

Examples:

1. CARDINAL, Cardinal Number - 1,2,3
2. PROPN, Proper Noun, Singular - "Matic", "Andraz", "Cardiff"
3. INTJ, Interjection - "Uhhhhhhhhhhh"

In [24]:
# For each token, print corresponding part of speech tag
for token in doc:
    print('{} - {}'.format(token, token.pos_))

Syndicate - PROPN
Bank - PROPN
's - PART
net - ADJ
profit - NOUN
for - ADP
the - DET
quarter - NOUN
ended - VERB
September - PROPN
2004 - NUM
fell - VERB
34 - NUM
per - ADP
cent - NOUN
compared - VERB
with - ADP
the - DET
corresponding - ADJ
period - NOUN
last - ADV
... - PUNCT


## Excercise for participants

- Write a method to find the most expressed proper noun for a specific month and year. The month and year will be passes as arguments.