# <div class="alert alert-info"> 1. Introduction </div>

1.  **Problem**:  Processing raw text intelligently is difficult: most words are rare, and it's common for words that look completely different to mean almost the same thing. The same words in a different order can mean something completely different. It's usually better to use linguistic knoweledge to add useful information.


2. **Part of Speed Tagging**: It refers to the way words are arranged together as a single unit of phrase.

+ Coarse-grained POS tags

+ Fine-grained POS tags


# <div class="alert alert-info"> 2. Setup </div>

In [1]:
#!pip install networkx
#!pip install datapane
#!pip install operator
#!pip install pyvis
#!pip install streamlit pyvis networkx
#!pip install langdetect

import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import re
import zipfile
import json
import urllib
import langdetect

import nltk
from nltk.tokenize import word_tokenize
from nltk.tag import pos_tag
nltk.download('averaged_perceptron_tagger')

import spacy
nlp = spacy.load('en_core_web_sm')

from spacy.matcher import Matcher 
from spacy.tokens import Span 
from spacy import displacy 


import networkx as nx
from networkx.algorithms import community #This part of networkx, for community detection, needs to be imported separately
import datapane as dp
#from operator import itemgetter
from pyvis.network import Network

from IPython.core.display import display, HTML
display(HTML("<style>.container { width:100% !important; }</style>"))


[nltk_data] Downloading package averaged_perceptron_tagger to
[nltk_data]     C:\Users\Mai\AppData\Roaming\nltk_data...
[nltk_data]   Package averaged_perceptron_tagger is already up-to-
[nltk_data]       date!


---
# <div class="alert alert-info"> 3. Data Preparation </div>

 ## <font color=red>**3.1.  News category dataset**

The dataset used is “News category dataset” from Kaggle (https://www.kaggle.com/rmisra/news-category-dataset. This dataset provided with around 200k news headlines from the year 2012 to 2018 obtained from HuffPost.

## <font color=red>3.2. Load data
    
The dataset is contained into a json file, so I will first read it into a list of dictionaries with json and then transform it into a pandas Dataframe.

In [4]:
df = pd.read_csv('../data/processed/News_Category.csv')
print(df.shape)
## print 5 random rows
df = df.reset_index(drop=True)
df.head(5)

(26768, 4)


Unnamed: 0,category,headline,short_description,text
0,BUSINESS,"U.S. Launches Auto Import Probe, China Vows To...",The investigation could lead to new U.S. tarif...,"U.S. Launches Auto Import Probe, China Vows To..."
1,BUSINESS,Starbucks Says Anyone Can Now Sit In Its Cafes...,The new policy was unveiled weeks after the co...,Starbucks Says Anyone Can Now Sit In Its Cafes...
2,BUSINESS,Seattle Passes Controversial New Tax On City's...,"Following the council vote, Amazon’s vice pres...",Seattle Passes Controversial New Tax On City's...
3,BUSINESS,Uber Ends Forced Arbitration In Individual Cas...,Victims will be free to go to court -- but a f...,Uber Ends Forced Arbitration In Individual Cas...
4,BUSINESS,"Chili's Hit By Data Breach, Credit And Debit C...",The breach is believed to have occurred betwee...,"Chili's Hit By Data Breach, Credit And Debit C..."


In [5]:
df.category.value_counts()

TRAVEL          9855
FOOD & DRINK    6208
BUSINESS        5878
SPORTS          4827
Name: category, dtype: int64

In [6]:
df = df[df['category'] == 'TRAVEL']

## <font color=red>**3.3. Text Processing and Normalizing** </font>  
    
Before feature engineering, we need to  pre-process, clean, and normalize text. Following are some of the popular pre-processing techniques:

**1. Text tokenization and lower casting**: Split doc into individual words and lower casting all words <br>
**2. Removing special characters**: remove special characters and punctuations <br>
**3. Removing stop words**: Words like "a" and "the" appear so frequently and  are called stop words, they can be filtered from the text to be processed <br>
**4. Stemming**: extract root of words by remove -ing, -ly, -ed... <br>
**5. Lemmatization**: In contrast to stemming, lemmatization looks beyond word reduction, and considers a language's full vocabulary to apply a *morphological analysis* to words. The lemma of 'was' is 'be' and the lemma of 'mice' is 'mouse'. Further, the lemma of 'meeting' might be 'meet' or 'meeting' depending on its use in a sentence. <br>

In [7]:
'''
Preprocess a string.
:parameter
    :param text: string - name of column containing text
    :param lst_stopwords: list - list of stopwords to remove
    :param flg_stemm: bool - whether stemming is to be applied
    :param flg_lemm: bool - whether lemmitisation is to be applied
:return
    cleaned text
'''
def utils_preprocess_text(text, flg_stemm=False, flg_lemm=False, lst_stopwords=None):
    ## clean (convert to lowercase and remove punctuations and   characters and then strip)
    #text = re.sub(r'[^\w\s]', '', str(text).lower().strip())
    #text = re.sub('[^a-zA-Z\s]', '', text)
    text = str(text).lower().strip()
            
    ## Tokenize (convert from string to list)
    lst_text = text.split()
    ## remove Stopwords
    if lst_stopwords is not None:
        lst_text = [word for word in lst_text if word not in lst_stopwords]
                
    ## Stemming (remove -ing, -ly, ...)
    if flg_stemm == True:
        ps = nltk.stem.porter.PorterStemmer()
        lst_text = [ps.stem(word) for word in lst_text]
                
    ## Lemmatisation (convert the word into root word)
    if flg_lemm == True:
        lem = nltk.stem.wordnet.WordNetLemmatizer()
        lst_text = [lem.lemmatize(word) for word in lst_text]
            
    ## back to string from list
    text = " ".join(lst_text)
    return text

In [8]:
lst_stopwords = nltk.corpus.stopwords.words("english")

df['text_clean'] = df["short_description"].apply(lambda x: utils_preprocess_text(x, flg_stemm=False, flg_lemm=False, lst_stopwords=None))

df.head()

Unnamed: 0,category,headline,short_description,text,text_clean
5878,TRAVEL,"14 Ways To Make Family Road Trips Easier, From...",Having waterproof covers on the seats is kind ...,"14 Ways To Make Family Road Trips Easier, From...",having waterproof covers on the seats is kind ...
5879,TRAVEL,14 Trips To Take From New York City On A Long ...,"Charming towns, relaxing beaches and top hikin...",14 Trips To Take From New York City On A Long ...,"charming towns, relaxing beaches and top hikin..."
5880,TRAVEL,Disney Reveals Opening Seasons For 'Star Wars'...,Star Wars: Galaxy's Edge will open at Disneyla...,Disney Reveals Opening Seasons For 'Star Wars'...,star wars: galaxy's edge will open at disneyla...
5881,TRAVEL,Lonely Planet's Top European Destinations Of 2...,These underrated travel destinations in Europe...,Lonely Planet's Top European Destinations Of 2...,these underrated travel destinations in europe...
5882,TRAVEL,8 Majestic Islands In Europe That Most Tourist...,If you’re dreaming about a romantic European g...,8 Majestic Islands In Europe That Most Tourist...,if you’re dreaming about a romantic european g...


## <font color=red>3.4.  Split text to sentences

In [9]:
df = df.reset_index(drop=True)
sent_list = []
id_list = []
category_list = []

for i in range(0, len(df)):
    doc = nlp(df.text_clean.iloc[i])
    for sent in doc.sents:
        id_list.append(i)
        sent = "'" + str(sent) + "'"
        sent_list.append(sent)
        category_list.append(df.category.iloc[i])
        
sent_df = pd.DataFrame()
#sent_df['id'] = id_list
sent_df['text_clean'] = sent_list
print(sent_df.shape)

(16747, 1)


In [10]:
#sent_df = pd.read_csv('Category_News_sent_df.csv')
sent_df = pd.DataFrame()
#sent_df['id'] = id_list
sent_df['text_clean'] = sent_list
print(sent_df.shape)

(16747, 1)


In [11]:
sent_df.head()

Unnamed: 0,text_clean
0,'having waterproof covers on the seats is kind...
1,"'charming towns, relaxing beaches and top hiki..."
2,'star wars: galaxy's edge will open at disneyl...
3,'these underrated travel destinations in europ...
4,'if you’re dreaming about a romantic european ...


---
# <div class="alert alert-info"> 4. LINGUISTICS </div>

## <font color=red>4.1. Part of Speech

In [12]:
sentence = 'singapore is currently the most competitive city in the world, beating out new york and london, according to the economist'
sentence = 'challengers to silicon valley include new york, l.a., boston, tel aviv, and london.'
sentence = 'ed young is the senior pastor of fellowship church, which is headquartered in grapevine, texas but has rapidly expanded across texas, to florida, london (uk) and online.'
sentence = 'she earned her bfa at london college of communication, uk'
sentence = 'london -- a classic mercedes-benz race car driven by formula 1 legend juan manuel fangio sold for 19.6 million pounds ($29.6'
sentence  = 'prices soared as people scrambled to flee the london bridge attack'
sentence= "at a london branch of britain's biggest retailer, tesco , which found horse dna in some of its own-brand frozen spaghetti"
sentence = 'maria perez is the co-founder and product manager of glassful'
sentence = 'london is a tourist’s paradise owing to the several iconic attractions it has to offer.therefore,it makes sense why london'
sentence = "investigators in washington and london last month struck a $450 million settlement with barclays in a rate-rigging case, but"
sentence = 'according to a newly released report, the united states is predicted to win the most medals at the london 2012 olympics.'

# Make your Doc object and pass it into the scorer:
doc = nlp(sentence)

# For practice, visualize your fine-grained POS tags (shown in the third column):
print(f"{'TOKEN':{10}} {'COARSE':{8}} {'FINE':{6}} {'DESCRIPTION FINE':{50}} {'DEPENDENCY':{6}} {'DESCRIPTION DEPENDENCY'}")
print(f"{'-----':{10}} {'------':{8}} {'----':{6}} {'----------------':{50}} {'----------':{6}} {'----------------------'}")

for token in doc:
    print(f'{token.text:{10}} {token.pos_:{8}} {token.tag_:{6}} {spacy.explain(token.tag_):{50}} {token.dep_:{8}} {spacy.explain(token.dep_)}')

TOKEN      COARSE   FINE   DESCRIPTION FINE                                   DEPENDENCY DESCRIPTION DEPENDENCY
-----      ------   ----   ----------------                                   ---------- ----------------------
according  VERB     VBG    verb, gerund or present participle                 prep     prepositional modifier
to         ADP      IN     conjunction, subordinating or preposition          prep     prepositional modifier
a          DET      DT     determiner                                         det      determiner
newly      ADV      RB     adverb                                             advmod   adverbial modifier
released   VERB     VBN    verb, past participle                              amod     adjectival modifier
report     NOUN     NN     noun, singular or mass                             pobj     object of preposition
,          PUNCT    ,      punctuation mark, comma                            punct    punctuation
the        DET      DT     determiner

## <font color=red>4.2. Name Entity Recognition

---
# <div class="alert alert-info"> 5. Information Retrieval </div>

## <font color=red>5.1. Find documents contains words

In [224]:
sent_df.text_clean = sent_df.text_clean.astype(str) 


sdf = sent_df[sent_df['text_clean'].str.contains("singapore")]
print(sdf.shape)
sdf = sdf.reset_index()


for i in range(0,18):
    print("\n", sdf.text_clean.iloc[i])

(27, 1)

 'i have had a love hate relationship with singapore ever since i first visited 15 years ago.'

 'singapore airlines' suites class product, available on the airbus a380, made headlines in 2014 after a blogger published his review of the lavish, first-class experience.'

 'and besides being cheap, the region offers some of the world's best party countries, like thailand and singapore.'

 'the raffles hotel group is spreading its wings from its singapore base.'

 'new york city doesn't lack for good food, but i've yet to find satisfying singapore hawker (street food) fare.'

 'though singapore air's 18-hour nonstop flight from newark to singapore was suspended last year, there are still plenty of'

 'the bridge gives you killer views of singapore city and its islands both day and night.'

 'scoot (singapore) to “scoot” implies a jerking, tugging motion that just makes us uncomfortable to think of while soaring'

 'he spent two decades life managing marketing for singapore airlin

## <font color=red>5.2. Extract pattern

In [225]:
# Make your Doc object and pass it into the scorer:
doc = nlp(sdf.text_clean.iloc[10])

# For practice, visualize your fine-grained POS tags (shown in the third column):
print(f"{'TOKEN':{10}} {'COARSE':{8}} {'FINE':{6}} {'DESCRIPTION FINE':{50}} {'DEPENDENCY':{6}} {'DESCRIPTION DEPENDENCY'}")
print(f"{'-----':{10}} {'------':{8}} {'----':{6}} {'----------------':{50}} {'----------':{6}} {'----------------------'}")

for token in doc:
    print(f'{token.text:{10}} {token.pos_:{8}} {token.tag_:{6}} {spacy.explain(token.tag_):{50}} {token.dep_:{8}} {spacy.explain(token.dep_)}')

TOKEN      COARSE   FINE   DESCRIPTION FINE                                   DEPENDENCY DESCRIPTION DEPENDENCY
-----      ------   ----   ----------------                                   ---------- ----------------------
'          PUNCT    ''     closing quotation mark                             punct    punctuation
shiok      NOUN     NN     noun, singular or mass                             nsubj    nominal subject
,          PUNCT    ,      punctuation mark, comma                            punct    punctuation
if         SCONJ    IN     conjunction, subordinating or preposition          mark     marker
you        PRON     PRP    pronoun, personal                                  nsubj    nominal subject
are        VERB     VBP    verb, non-3rd person singular present              advcl    adverbial clause modifier
n't        PART     RB     adverb                                             neg      negation modifier
familiar   ADJ      JJ     adjective (English), other noun-m

In [226]:
# extract pattern: N + adj

def extract_pattern(doc):
    
    obj_list = []
    adj_list = []
    connection_list = []
    conj_list = []
    
    flag = 0
    for i, tok in enumerate(doc):
        
        # extract subject or direct object, it used to link other objects, noun
        if((tok.pos_.endswith("NOUN")==True) | (tok.pos_.endswith("PROPN") == True)):
            if((tok.dep_.endswith("dobj")==True) | (tok.dep_.endswith("ROOT")==True)| (tok.dep_.endswith("nsubj")==True)):
                #print(tok.text,tok.dep_)
                connection_list.append(tok.text)  
                #print(connection_list)
            else:
                connection_list.append('N/A')
        else: 
            connection_list.append('N/A')
        
        # extract all nouns
        if((tok.pos_.endswith("NOUN")==True) | (tok.pos_.endswith("PROPN") == True)):
            if((tok.dep_.endswith("compound")==False) & (tok.dep_.endswith("amod")==False)):
                obj_list.append(tok.text)
            else:
                obj_list.append('N/A')  
        else:
            obj_list.append('N/A')
        
        # extract compound => pattern: compound + noun (oxford street, convern garden)
        if((tok.pos_.endswith("NOUN")==True) | (tok.pos_.endswith("PROPN") == True)):
            if((tok.dep_.endswith("compound")==True)):
                conj_list.append(tok.text)
            else:
                conj_list.append('N/A')  
        else:
            conj_list.append('N/A')
        
        # extract adj
        if((tok.pos_.endswith("ADJ")==True)|(tok.dep_.endswith("amod")==True)):
            adj_list.append(tok.text)
        else:
            adj_list.append('N/A')
            
    return obj_list, adj_list, connection_list, conj_list

doc = nlp(sdf.text_clean.iloc[15])
print(sdf.text_clean.iloc[15])
obj_list, adj_list, connection_list, conj_list = extract_pattern(doc)

'the catch 22 of the singapore food scene is that there is no culture of institutional continuity in this lucrative and well-loved food industry.'


In [227]:
list_obj1 = []
list_obj2 = []
connection_word = 'N/A'

#for e in range(0, 1):
for e in range(0, len(sdf)):
    doc = nlp(sdf.text_clean.iloc[e])
    
    obj_list, adj_list, connection_list, conj_list = extract_pattern(doc)
    
    for i in range(0, len(obj_list)):
        
        # connection word {root noun, direct object + object}
        if((connection_list[i] != 'N/A')):
            #print(connection_list[i], obj_list[i])
            connection_word = connection_list[i]
       
        #  nound + adj near by
        if (obj_list[i] != 'N/A'):
            # compount object + object
            if(conj_list[i-1] != 'N/A'):
                #list_obj1.append(obj_list[i])
                #list_obj2.append(conj_list[i-1])  
                obj_list[i] = conj_list[i-1] + ' ' + obj_list[i]
                
            # subject + noun
            if((connection_word != obj_list[i]) & (connection_word != 'N/A')):
                list_obj1.append(obj_list[i])
                list_obj2.append(connection_word) 
            
            # noun + adj on the left
            for j in range(i-3, i):
                if((j < i) & (j > i - 3) & (j >= 0 )):
                    if (adj_list[j] != 'N/A'):
                        #print(i, j)
                        #print(obj_list[i], adj_list[j])
                        list_obj1.append(obj_list[i])
                        list_obj2.append(adj_list[j])
            # noun + adj on the right   
            for j in range(i+1, i+2):
                if((j > i) & (j < i + 2) & (j < len(obj_list))):
                    if (adj_list[j] != 'N/A'):
                        #print(i, j)
                        #print(obj_list[i], adj_list[j])
                        list_obj1.append(obj_list[i])
                        list_obj2.append(adj_list[j])

In [228]:
len(list_obj1), len(list_obj2)

(151, 151)

---
# <div class="alert alert-info"> 6. Knowledge Graph </div>

## <font color=red>6.1. create data frame to contain nodes of graph

In [229]:
graph_df = pd.DataFrame()
graph_df['obj1'] = list_obj1
graph_df['obj2'] = list_obj2
graph_df['ID'] = range(0,len(graph_df))
#graph_df = graph_df[graph_df.obj2 != 'NA']
print(graph_df.shape)
graph_df.head()

(151, 3)


Unnamed: 0,obj1,obj2,ID
0,hate relationship,relationship,0
1,singapore,relationship,1
2,years,relationship,2
3,singapore airlines,relationship,3
4,class product,product,4


In [230]:
weight_graph_df = pd.DataFrame(graph_df.groupby(['obj1','obj2']).count()).reset_index()
weight_graph_df.rename(columns={'ID':'weight'}, inplace=True)
weight_graph_df = weight_graph_df.drop_duplicates(subset=['obj1','obj2'], keep='last')
weight_graph_df.sort_values(['weight'],ascending=False)

Unnamed: 0,obj1,obj2,weight
145,year,last,2
86,people,most,2
0,aesthetics,bridge,1
102,singapore,city,1
96,services,personal,1
...,...,...,...
50,helix design,bridge,1
51,hilton,marketing,1
52,hotel,iconic,1
53,hotel,people,1


## <font color=red>6.2. Network Analysis (with NetworkX)

In [231]:
# Generate a networkx graph
G = nx.from_pandas_edgelist(weight_graph_df, 'obj1', 'obj2')

# Give the graph a name
G.name = 'Hotel Interactions Network'

# Check whether graph is directed or undirected (False = undirected)
print(G.is_directed())

# Obtain general information of graph
print(nx.info(G))

# Get graph density
density = nx.density(G)
print("Network density:", density)

False
Name: Hotel Interactions Network
Type: Graph
Number of nodes: 154
Number of edges: 149
Average degree:   1.9351
Network density: 0.012647483235718529


In [232]:
# Get most connected node (i.e. drug with most drug interactions)
G.degree()
max(dict(G.degree()).items(), key = lambda x : x[1])

('singapore', 28)

## <font color=red>6.3. Network Visualization (with Pyvis)

https://towardsdatascience.com/customizing-networkx-graphs-f80b4e69bedf

https://www.cl.cam.ac.uk/teaching/1314/L109/tutorial.pdf

https://www.toptal.com/data-science/graph-data-science-python-networkx

In [233]:
# Define function to generate Pyvis visualization
def generate_network_viz(df, source_col, target_col, weights, 
                         layout='barnes_hut',
                         central_gravity=0.15,
                         node_distance=420,
                         spring_length=100,
                         spring_strength=0.15,
                         damping=0.96,
                         minium_weight: int = 0,
                         ):
    
    # Generate a networkx graph
    G = nx.from_pandas_edgelist(df, source_col, target_col, weights)
    
    if layout == 'repulsion':
        bgcolor, font_color = '#222222', 'white'
    else:
        bgcolor, font_color = 'white', 'black'
    
    # Initiate PyVis network object
    drug_net = Network(
                       height='700px', 
                       width='100%',
                       bgcolor=bgcolor, 
                       font_color=font_color, 
                       notebook=True
                      )
    
    # Take Networkx graph and translate it to a PyVis graph format
    drug_net.from_nx(G)
    
    # Create different network layout (repulsion or Barnes Hut)
    if layout == 'repulsion':
        drug_net.repulsion(
                            node_distance=node_distance, 
                            central_gravity=central_gravity, 
                            spring_length=spring_length, 
                            spring_strength=spring_strength, 
                            damping=damping
                           )
        
    # Run default Barnes Hut visualization
    else:
        drug_net.barnes_hut(
#                            gravity=-80000, 
#                            central_gravity=central_gravity, 
#                            spring_length=spring_length, 
#                            spring_strength=spring_strength, 
#                            damping=damping, 
#                            overlap=0
                          )      
    return drug_net

### <font color=blue>Barnes Hut Visualization
BarnesHut is a quadtree based gravity model
It is the fastest, default and recommended solver for non-hierarchical layouts

In [236]:
selected_nodes = []
for e in dict(G.degree()).items():
    node, degree = e
    if ((degree < 100) & (degree > 2)):
    # if (node == 'bridge'):
        print(node, degree)
        selected_nodes.append(node)

bridge 9
marketing 7
singapore 28
people 4
feelings 3
lattes 4
cities 3
city 4
sling 9
catch 5
views 5
travels 8
notables 4
terminal 4
shiok 4
fare 6
centre 5
need 3
relationship 4
hawker 3
hotel 3
hotels 3
region 4
services 3
singapore airlines 3
tech 3
year 3


In [237]:
#selected_nodes = ['location','italian', 'prime']
# Create network for single drug. Use Phenytoin since it has most edges (i.e. involved in most drug interactions)
small_nw = weight_graph_df.loc[weight_graph_df['obj1'].isin(selected_nodes) | weight_graph_df['obj2'].isin(selected_nodes)]

node_color = {'NOUN':'lightblue', 'PRON':'lightblue', 'PROPN':'yellow', 'VERB': 'red', 'ADJ':'red', 'ADV':'red', 'DET':'red', 'X':'grey', 'INTJ':'grey',
              'AUX':'grey','NUM':'grey','SPACE':'grey','PUNCT':'grey','SCONJ':'grey','ADP':'grey'}
# Generate a networkx graph based on subset data
net_repulsion = generate_network_viz(small_nw, 'obj1','obj2', 'weight', layout='repulsion')

node_list = []
for i in range(0, len(net_repulsion.nodes)):
    node_list.append(net_repulsion.nodes[i]['id'])
type_node = [nlp(x)[0].pos_ for x in node_list ]

for i in range(0, len(net_repulsion.nodes)):
    net_repulsion.nodes[i]['color'] = node_color[type_node[i]]

net_repulsion.show('hotel_interactions_network_room.html')

# Run the above code chunk in order to display the graph visualization below