In [1]:
# This Python 3 environment comes with many helpful analytics libraries installed
# It is defined by the kaggle/python Docker image: https://github.com/kaggle/docker-python
# For example, here's several helpful packages to load

import numpy as np # linear algebra
import pandas as pd # data processing, CSV file I/O (e.g. pd.read_csv)

# Input data files are available in the read-only "../input/" directory
# For example, running this (by clicking run or pressing Shift+Enter) will list all files under the input directory

import os
for dirname, _, filenames in os.walk('/kaggle/input'):
    for filename in filenames:
        print(os.path.join(dirname, filename))

# You can write up to 20GB to the current directory (/kaggle/working/) that gets preserved as output when you create a version using "Save & Run All" 
# You can also write temporary files to /kaggle/temp/, but they won't be saved outside of the current session

/kaggle/input/news-data/abcnews-date-text.csv


### What is the NLP?
The objective of NLP (Natural Language Processing) is to make computers understand the text and spoken words in much the same way human beings can.

NLP applies statistical, machine learning, and deep learning models to large text data to comprehend the speaker’s or writer’s intent and sentiment.

Usually, the input data to the NLP model is about the words in texts. Anyway, we can add more features to make the model more accurate.

### What is Named Entity Recognition (NER)?
Named-entity recognition (NER) (also known as (named) entity identification, entity chunking, and entity extraction) is a subtask of information extraction that seeks to locate and classify named entities mentioned in unstructured text into pre-defined categories such as person names, organizations, locations, medical codes, time expressions, quantities, monetary values, percentages, etc.
(from https://en.wikipedia.org/wiki/Named-entity_recognition)

We can use NEM to create new columns for the NLP model and sometimes visualization of NEM alone can even classify text type.

### Get to know spaCy
spaCy is a free open-source Python library for Natural Language Processing in Python. It features NER, POS tagging, dependency parsing, word vectors, and more.

In this Project, I will show you step by step how to perform NEM with the spaCy library.

### Step summary
- Install and import libraries
- Load an NER model
- Create a tag list column
- Create new features

### Import Libraries

In [2]:
import spacy
import pandas as pd

### Load a dataset

In [3]:
df =pd.read_csv("../input/news-data/abcnews-date-text.csv")
df = df[:10] # Use only 10 samples to reduce computation time

###  Load an NER model
Use spacy.load to load the pre-trained NER model. (Make sure that you already download this model)

In [4]:
ner = spacy.load("en_core_web_sm")

#### Fit the model with a text

In [5]:
doc = ner(df['headline_text'][9])

doc is a spacy token. You can call its attributes.

- Display
Loop over words in doc.

.text will return word

.pos_ will return the part of speech of the word

.ent_type_ will return the named-entity of the word

From the example “australia is locked into war timetable opp”, you will see the model can detect australia as GPE (Geopolitical Entity).

In [6]:
print("Text is: "+doc.text+"\n")
for token in doc:
    print(token.text+"\t"+token.pos_+"\t"+token.ent_type_)

Text is: australia is locked into war timetable opp

australia	PROPN	GPE
is	AUX	
locked	VERB	
into	ADP	
war	NOUN	
timetable	NOUN	
opp	NOUN	


#### spaCy has a useful display tool, displacy, that can beautifully visualize the named entity.

In [7]:
spacy.displacy.render(doc, style="ent")

#### The list of all NER can be gotten by .pipe_labels and the explanation of each NER is known by .explain.

In [8]:
ner_list = ner.pipe_labels['ner']
print("Number of NER: "+str(len(ner_list)))
for i in range(len(ner_list)):
    print(ner_list[i]+" : "+spacy.explain(ner_list[i]))

Number of NER: 18
CARDINAL : Numerals that do not fall under another type
DATE : Absolute or relative dates or periods
EVENT : Named hurricanes, battles, wars, sports events, etc.
FAC : Buildings, airports, highways, bridges, etc.
GPE : Countries, cities, states
LANGUAGE : Any named language
LAW : Named documents made into laws.
LOC : Non-GPE locations, mountain ranges, bodies of water
MONEY : Monetary values, including unit
NORP : Nationalities or religious or political groups
ORDINAL : "first", "second", etc.
ORG : Companies, agencies, institutions, etc.
PERCENT : Percentage, including "%"
PERSON : People, including fictional
PRODUCT : Objects, vehicles, foods, etc. (not services)
QUANTITY : Measurements, as of weight or distance
TIME : Times smaller than a day
WORK_OF_ART : Titles of books, songs, etc.


- There are 18 NERs as shown above.

### Create a tag list column
First, create a function add_tag_column that when entered text will return a dictionary of the number of tag occurance.

In [9]:
def add_tag_column(text):
    doc = ner(text)
    tag_dict = dict.fromkeys(ner_list,0) # empty column
    for token in doc:
        if token.ent_type_ != "":    
            tag_dict[token.ent_type_] +=1
    return tag_dict

#### Apply the add_tag_column function to all rows in the dataframe. The tag list result will be temporarily stored in the tag column.

In [10]:
df["tag"] = df["headline_text"].apply(add_tag_column)
df.head()

Unnamed: 0,publish_date,headline_text,tag
0,20030219,aba decides against community broadcasting lic...,"{'CARDINAL': 0, 'DATE': 0, 'EVENT': 0, 'FAC': ..."
1,20030219,act fire witnesses must be aware of defamation,"{'CARDINAL': 0, 'DATE': 0, 'EVENT': 0, 'FAC': ..."
2,20030219,a g calls for infrastructure protection summit,"{'CARDINAL': 0, 'DATE': 0, 'EVENT': 0, 'FAC': ..."
3,20030219,air nz staff in aust strike for pay rise,"{'CARDINAL': 0, 'DATE': 0, 'EVENT': 0, 'FAC': ..."
4,20030219,air nz strike to affect australian travellers,"{'CARDINAL': 0, 'DATE': 0, 'EVENT': 0, 'FAC': ..."


### Create new features
Create new columns, tag_NER, which are the number of each NER occurrence in the text.

In [11]:
for tag in ner_list:
    df["tag_"+tag] = df['tag'].apply(lambda count: count[tag])
    
df.head()    

Unnamed: 0,publish_date,headline_text,tag,tag_CARDINAL,tag_DATE,tag_EVENT,tag_FAC,tag_GPE,tag_LANGUAGE,tag_LAW,...,tag_MONEY,tag_NORP,tag_ORDINAL,tag_ORG,tag_PERCENT,tag_PERSON,tag_PRODUCT,tag_QUANTITY,tag_TIME,tag_WORK_OF_ART
0,20030219,aba decides against community broadcasting lic...,"{'CARDINAL': 0, 'DATE': 0, 'EVENT': 0, 'FAC': ...",0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
1,20030219,act fire witnesses must be aware of defamation,"{'CARDINAL': 0, 'DATE': 0, 'EVENT': 0, 'FAC': ...",0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
2,20030219,a g calls for infrastructure protection summit,"{'CARDINAL': 0, 'DATE': 0, 'EVENT': 0, 'FAC': ...",0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
3,20030219,air nz staff in aust strike for pay rise,"{'CARDINAL': 0, 'DATE': 0, 'EVENT': 0, 'FAC': ...",0,0,0,0,0,0,0,...,0,0,0,2,0,0,0,0,0,0
4,20030219,air nz strike to affect australian travellers,"{'CARDINAL': 0, 'DATE': 0, 'EVENT': 0, 'FAC': ...",0,0,0,0,0,0,0,...,0,1,0,0,0,2,0,0,0,0


### You can use these columns to train an NLP model.