## Load SpaCy resources

In [2]:
# Import spacy and English models
import spacy
import pandas as pd

nlp = spacy.load("en_core_web_sm")

Loading spaCy can take a while, in the meantime here are a few definitions to help you on your NLP journey.

#### What are Stop Words?

Stop words are the common words in a vocabulary which are of little value when considering word frequencies in text. This is because they don't provide much useful information about what the sentence is telling the reader.

Example: _"the","and","a","are","is"_

#### What is a Corpus?

A corpus (plural: corpora) is a large collection of text or documents and can provide useful training data for NLP models. A corpus might be built from transcribed speech or a collection of manuscripts. Each item in a corpus is not necessarily unique and frequency counts of words can assist in uncovering the structure in a corpus.

Examples:

1. Every word written in the complete works of Shakespeare
2. Every word spoken on BBC Radio channels for the past 30 years 

## Process text

In [3]:
!wget "https://drive.google.com/uc?export=download&id=17j7xui0oJmKhmszrH0CBmIxTWXwcgqky" -O IndianFinancialNews.csv

--2021-06-21 08:04:10--  https://drive.google.com/uc?export=download&id=17j7xui0oJmKhmszrH0CBmIxTWXwcgqky
Resolving drive.google.com (drive.google.com)... 74.125.20.102, 74.125.20.138, 74.125.20.113, ...
Connecting to drive.google.com (drive.google.com)|74.125.20.102|:443... connected.
HTTP request sent, awaiting response... 302 Moved Temporarily
Location: https://doc-0s-44-docs.googleusercontent.com/docs/securesc/ha0ro937gcuc7l7deffksulhg5h7mbp1/j925besc6ob0pdfcakupodkcghae8cpc/1624262625000/03885802779803335284/*/17j7xui0oJmKhmszrH0CBmIxTWXwcgqky?e=download [following]
--2021-06-21 08:04:13--  https://doc-0s-44-docs.googleusercontent.com/docs/securesc/ha0ro937gcuc7l7deffksulhg5h7mbp1/j925besc6ob0pdfcakupodkcghae8cpc/1624262625000/03885802779803335284/*/17j7xui0oJmKhmszrH0CBmIxTWXwcgqky?e=download
Resolving doc-0s-44-docs.googleusercontent.com (doc-0s-44-docs.googleusercontent.com)... 74.125.195.132, 2607:f8b0:400e:c09::84
Connecting to doc-0s-44-docs.googleusercontent.com (doc-0s-44-

#### Datetime parsing

- https://docs.python.org/3/library/datetime.html

In [23]:
from datetime import datetime

custom_date_parser = lambda x: datetime.strptime(x, "%B %d, %Y, %A")

news_df = pd.read_csv("IndianFinancialNews.csv", 
                      index_col=[0], 
                      parse_dates=['Date'],
                      date_parser=custom_date_parser)

In [24]:
pd.set_option("max_colwidth", 200)
news_df.head(5)

Unnamed: 0,Date,Title,Description
0,2020-05-26,"ATMs to become virtual bank branches, accept deposits with instant credit","Close to 14.6 per cent (or 35,000) of the 240,000 ATMs in India are new-age recyclers, even though they have been around for only ..."
1,2020-05-26,IDFC First Bank seniors to forgo 65% of bonus amid Covid-19 crisis,"V Vaidyanathan, managing director and chief executive, will take 30 per cent cut in his compensation, including fixed ..."
2,2020-05-25,"Huge scam in YES Bank for many years, says Enforcement Directorate",Rana Kapoor's wife also charged with abetting crime
3,2020-05-24,"Bank of Maharashtra sanctioned Rs 2,789 cr in loans to MSMEs in 3 months",The bank said it was now gearing up to extend the stimulus package announced by Finance Minister Nirmala Sitharaman to restart ...
4,2020-05-23,DCB Bank's profit before tax declines 37.6% to Rs 93.84 crore in Q4,"Net profit for the financial year ended March 31, 2020 (FY20), stood at Rs 337.25 crore, up marginally from Rs 325.37 crore in ..."


In [25]:
news_df.info()

<class 'pandas.core.frame.DataFrame'>
Int64Index: 50000 entries, 0 to 49999
Data columns (total 3 columns):
Date           50000 non-null datetime64[ns]
Title          50000 non-null object
Description    49290 non-null object
dtypes: datetime64[ns](1), object(2)
memory usage: 1.5+ MB


In [28]:
news_df.Date.min()

Timestamp('2003-02-10 00:00:00')

In [30]:
news_df.Date.max()

Timestamp('2020-05-26 00:00:00')

## Get tokens and sentences

#### What is a Token?
A token is a single chopped up element of the sentence, which could be a word or a group of words to analyse. The task of chopping the sentence up is called "tokenisation".

Example: The following sentence can be tokenised by splitting up the sentence into individual words.

	"Cytora is going to PyCon!"
	["Cytora","is","going","to","PyCon!"]

In [12]:
all_titles = news_df['Description'].tolist()

In [16]:
doc = nlp(unicode(all_titles[5]))


# Print sentences (one sentence per line)
for token in doc:
    print(token)

Under
the
scheme
,
the
government
will
offer
100
per
cent
guarantee
on
loans
.


## Part of speech tags

#### What is a Speech Tag?
A speech tag is a context sensitive description of what a word means in the context of the whole sentence.
More information about the kinds of speech tags which are used in NLP can be [found here]
- https://spacy.io/api/annotation
- http://www.winwaed.com/blog/2011/11/08/part-of-speech-tags/

Examples:

1. CARDINAL, Cardinal Number - 1,2,3
2. PROPN, Proper Noun, Singular - "Matic", "Andraz", "Cardiff"
3. INTJ, Interjection - "Uhhhhhhhhhhh"

In [17]:
# For each token, print corresponding part of speech tag
for token in doc:
    print('{} - {}'.format(token, token.pos_))

Under - ADP
the - DET
scheme - NOUN
, - PUNCT
the - DET
government - NOUN
will - VERB
offer - VERB
100 - NUM
per - ADP
cent - NOUN
guarantee - NOUN
on - ADP
loans - NOUN
. - PUNCT


## Named entities

#### Named Entities

A named entity is any real world object such as a person, location, organisation or product with a proper name. 

- https://spacy.io/api/annotation

In [18]:
# Print all named entities with named entity types

for ent in doc.ents:
    print('{} - {}'.format(ent, ent.label_))

100 per cent - MONEY


### Most visible entity in a month



In [36]:
news_df['year'] = pd.DatetimeIndex(news_df['Date']).year

In [58]:
def get_top_10(year):
  year_df = news_df[news_df.year == year]

  org_list = {}

  try:
    for each_title in year_df.Title:
      doc = nlp(unicode(each_title))
      for ent in doc.ents:
        if ent.label_ == 'ORG':
          entity_name = ent.text
          org_list[entity_name] = org_list.get(entity_name, 0) + 1
  except Exception:
    pass        

  sort_orders = sorted(org_list.items(), 
                      key=lambda x: x[1],
                      reverse=True)
  return  sort_orders[0:10]

In [59]:
get_top_10(2010)

[(u'Sebi', 14),
 (u'Aegon Religare', 7),
 (u'Citi', 6),
 (u'Basel', 5),
 (u'Bhatt', 5),
 (u'Buffett', 4),
 (u'Crisil', 3),
 (u'Future Generali', 3),
 (u'Tata Cap', 3),
 (u'SKS Microfinance', 3)]

## Excercise for participants

- Write a method to find the most expressed proper noun for a specific month and year. The month and year will be passes as arguments.