- Natural Language Toolkit (NLTK): The complete toolkit for all NLP techniques.
- spaCy — Industrial strength N LP with Python and Cython.
- Gensim — Topic Modelling for Humans
- Stanford Core NLP — NLP services and packages by Stanford NLP Group.

In [4]:
import pandas as pd
import numpy as np
from newspaper import Article
import nltk
import pickle

from nltk.classify import NaiveBayesClassifier
from nltk.corpus import subjectivity 
from nltk.sentiment import SentimentAnalyzer
from nltk.sentiment.util import *

import matplotlib.pyplot as plt
import string
from nltk.stem import SnowballStemmer
from nltk.stem.porter import PorterStemmer

from nltk.corpus import stopwords
from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.model_selection import train_test_split
### for summarization
from gensim.summarization.summarizer import summarize as gensim_summarize 

from concurrent.futures import ProcessPoolExecutor, ThreadPoolExecutor
from langdetect import detect

nltk.download()

# Part 1: get raw text data 

In [5]:
rawDf = pd.read_csv("outputs/step1_bigquery_output.csv")

In [6]:
rawDf.head()

Unnamed: 0,DATE,THEMES,DocumentIdentifier
0,20190101060000,EDUCATION;SOC_POINTSOFINTEREST;SOC_POINTSOFINT...,https://www.daijiworld.com/chan/exclusiveDispl...
1,20190101061500,TAX_FNCACT;TAX_FNCACT_MAN;ARREST;SOC_GENERALCR...,https://caymannewsservice.com/2018/12/
2,20190101063000,TAX_FNCACT;TAX_FNCACT_LEADER;ENV_NUCLEARPOWER;...,https://www.vesti.bg/tehnologii/bil-gejts-sash...
3,20190101061500,ENV_GREEN;WB_507_ENERGY_AND_EXTRACTIVES;WB_525...,https://www.ajc.com/business/economy/georgia-p...
4,20190101061500,ENV_GREEN;WB_507_ENERGY_AND_EXTRACTIVES;WB_525...,https://pv-magazine-usa.com/2018/12/18/breakin...


In [11]:
from newspaper import Article

def get_text(url):
    """
    Func: 1. get raw text from url 2. get summary & keyword from text
        Input: url, a link to article
        Output: dictionary contains 3 keys, text, summary & keywords
    """
    try:
        article = Article(url)
        article.download()

        ### parse html file
        article.parse()
        text = article.text
    
        return text
    except:
        print(f'fail to download news from {url}')
        return None

In [12]:
rawDf.DocumentIdentifier[1]

'https://caymannewsservice.com/2018/12/'

In [13]:
eg = get_text(rawDf.DocumentIdentifier[13])
eg

'ZAP, together with its subsidiaries, designs, develops, manufactures, and sells electric and advanced technology vehicles in the United States and internationally. The company offers electric, alternative energy and fuel efficient automobiles and commercial vehicles, motorcycles and scooters, and other forms of personal transportation. ZAP also markets its electric transportation products through its zapworld.com Website. The company was formerly known as ZAPWORLD.COM and changed its name to ZAP in 2001. ZAP was founded in 1994 and is headquartered in Santa Rosa, California.\n\nZAP (OTCMKTS:ZAAP) Frequently Asked Questions What is ZAP\'s stock symbol? ZAP trades on the OTCMKTS under the ticker symbol "ZAAP." Has ZAP been receiving favorable news coverage? News articles about ZAAP stock have been trending somewhat positive recently, InfoTrie reports. The research firm scores the sentiment of press coverage by analyzing more than six thousand blog and news sources. The firm ranks covera

# Part 2: perform nlp analysis

## 2.1 Translate <br>

#### We notice some of the news is not in English, so we need to translate other language to English

In [16]:
def detect_lang(text):
    ### translate to english
    try:
        language = detect(text)
        print(f"language is {language}")
    except:
        print("Not able to detect language")
        language = "other"
    return language

In [17]:
detect_lang(eg)

language is en


'en'

## 2.2 Text summarization <br>
#### The reason for text summarization
- Avoid noise, since content in news will contains some irrelevant info, perform text summarization will reduce the noise
- Improve scalability: Decrease the length of string which could increase scalability

### Intro of gensim summarization<br>
#### The gensim implementation is based on the popular TextRank algorithm. some intro link: https://medium.com/@shivangisareen/text-summarisation-with-gensim-textrank-46bbb3401289, 
#### There are two kinds of method:
- Extractive methods — Involves the selection of phrases and sentences from the source document to make up the new summary.
- Abstractive methods- It involves generating entirely new phrases and sentences to capture the meaning of the source document.
<br>

##### Here we are Extractive methods,using textRank based text summarization,.
(to learn more, plz goolgle or contact the DBC consultant for further pure nlp courses

In [18]:
from gensim.summarization.summarizer import summarize as gensim_summarize 

def summarize(string,**kwargs):
    try:
        summarized = gensim_summarize(string,**kwargs)
    except:
        return string
    return summarized

In [19]:
egSummary = summarize(eg)
egSummary

'ZAP also markets its electric transportation products through its zapworld.com Website.\nZAP trades on the OTCMKTS under the ticker symbol "ZAAP." Has ZAP been receiving favorable news coverage?\nThe firm ranks coverage of public companies on a scale of -5 to 5, with scores closest to five being the most favorable.\n, (Age 37) Mr. Michael Ringstad , Interim Chief Financial Officer (Age 64)\nOne share of ZAAP stock can currently be purchased for approximately $0.0050.\nMarketBeat Community Rating for ZAP (OTCMKTS ZAAP)\nVote "Outperform" if you believe ZAAP will outperform the S&P 500 over the long term.\nVote "Underperform" if you believe ZAAP will underperform the S&P 500 over the long term.'

## 2.3 Preprocess text<br>

- remove puntuation: base on package: string.
- remove stop words: based on english stopwords
##### Certain parts of English speech, like conjunctions (“for”, “or”) or the word “the” are meaningless to a topic model. These terms are called stop words and need to be removed from our token list.
- remove lemmatization: use nltk WordNetLemmatizer
##### Lemmatisation (or lemmatization) in linguistics is the process of grouping together the inflected forms of a word so they can be analysed as a single item, identified by the word's lemma, or dictionary form. Original : forms, New: form Original : as, New: a
- remove stemmization: use nltk SnowballStemmer
##### Stemming words is another common NLP technique to reduce topically similar words to their root. For example, “stemming,” “stemmer,” “stemmed,” all have similar meanings; stemming reduces those terms to “stem.” This is important for topic modeling, which would otherwise view those terms as separate entities and reduce their importance in the model.

In [44]:
def pre_process(text,return_str=False):
    ### Remove number: for func `translate`: yourstring.translate(str.maketrans(fromstr, tostr, deletestr))
    text = text.translate(str.maketrans('', '',string.digits))
    ### Remove punctuation
    text = text.translate(str.maketrans('', '', string.punctuation))
    ### Remove stops words
    text = [word for word in text.split() if word.lower() not in stopwords.words('english') and not word.startswith("http")]

    ### Remove lemmatization
    wnl = nltk.WordNetLemmatizer()
    text = list(map(lambda x:wnl.lemmatize(x),text))
    ### Remove stemmization
#         stemmer = SnowballStemmer("english")
    stemmer = PorterStemmer()

    words = list(map(lambda x:stemmer.stem(x),text))
    if return_str:
        return (' ').join(words)
    else:
        return words

In [49]:
processedText = pre_process(egSummary,return_str=True)

In [50]:
processedText

'zap also market electr transport product zapworldcom websit zap trade otcmkt ticker symbol zaap zap receiv favor news coverag firm rank coverag public compani scale score closest five favor age Mr michael ringstad interim chief financi offic age one share zaap stock current purchas approxim marketbeat commun rate zap otcmkt zaap vote outperform believ zaap outperform SP long term vote underperform believ zaap underperform SP long term'

# Part 3: Modularize  

In [63]:

class ProcessPipeline:
	def __init__(self,texts=None,steps=["langdetection","summarization",'tokenization']):
	    self.stemmer = PorterStemmer()
	    self.lemmatizer = nltk.WordNetLemmatizer()
	    self.texts = texts
	    self.steps = steps
	    if "summarization" not in self.steps and "tokenization" not in self.steps
	    	raise "Needs to define at least summarization or tokenization"


	def process(self,text,return_str=False):
		if "langdetection" in self.steps:
			lang = self.detect_lang(text)
			if lang == "en":
				text =  text
			else:
				text = ""
		if "summarization"	in self.steps:
			text = self.summarize(text)
		if "tokenization" in self.steps:
			processed = self.pre_process(text,return_str=return_str)
			return processed
		else:
			return text

	def run(self,return_str=False,workers=6):
	    with ProcessPoolExecutor(max_workers=workers) as executor:
	    	if return_str:
	        	res = executor.map(self.process, self.texts,[True]*len(self.texts))        		
	    	else:
	        	res = executor.map(self.process, self.texts)
	    return list(res)    

	        
	def run_lambda(self):
	    return list(map(lambda x:self.process,self.texts))

	def run_loop(self):
	    processed = []
	    for i in self.texts:
	        processed.append(self.process(i))
	    return processed

	def detect_lang(self,text):
	    ### translate to english
	    try:
	        language = detect(text)
	        print(f"language is {language}")
	    except:
	        print("Not able to detect language")
	        language = "other"
	    return language

	def summarize(self,text,**kwargs):
	    try:
	        summarized = gensim_summarize(text,**kwargs)
	        return summarized
	    except:
	        return text

	def pre_process(self,text,return_str=False):
	    ### Remove number: for func `translate`: yourstring.translate(str.maketrans(fromstr, tostr, deletestr))
	    text = text.translate(str.maketrans('', '',string.digits))
	    ### Remove punctuation
	    text = text.translate(str.maketrans('', '', string.punctuation))
	    ### Remove stops words
	    text = [word for word in text.split() if word.lower() not in stopwords.words('english') and not word.startswith("http")]

	    ### Remove lemmatization
	    text = list(map(lambda x:self.lemmatizer.lemmatize(x),text))
	    ### Remove stemmization
	#         stemmer = SnowballStemmer("english")

	    words = list(map(lambda x:self.stemmer.stem(x),text))
	    if return_str:
	        return (' ').join(words)
	    else:
	        return words



In [72]:
pipeline = ProcessPipeline()

In [68]:
egText = eg
lang = pipeline.detect_lang(egText)
lang

language is en


'en'

In [69]:
summarized = pipeline.summarize(egText)

summarized

'ZAP also markets its electric transportation products through its zapworld.com Website.\nZAP trades on the OTCMKTS under the ticker symbol "ZAAP." Has ZAP been receiving favorable news coverage?\nThe firm ranks coverage of public companies on a scale of -5 to 5, with scores closest to five being the most favorable.\n, (Age 37) Mr. Michael Ringstad , Interim Chief Financial Officer (Age 64)\nOne share of ZAAP stock can currently be purchased for approximately $0.0050.\nMarketBeat Community Rating for ZAP (OTCMKTS ZAAP)\nVote "Outperform" if you believe ZAAP will outperform the S&P 500 over the long term.\nVote "Underperform" if you believe ZAAP will underperform the S&P 500 over the long term.'

In [70]:
processed = pipeline.pre_process(summarized,return_str=True)
processed

'zap also market electr transport product zapworldcom websit zap trade otcmkt ticker symbol zaap zap receiv favor news coverag firm rank coverag public compani scale score closest five favor age Mr michael ringstad interim chief financi offic age one share zaap stock current purchas approxim marketbeat commun rate zap otcmkt zaap vote outperform believ zaap outperform SP long term vote underperform believ zaap underperform SP long term'

In [71]:
pipeline.pre_process(summarized,return_str=False)

['zap',
 'also',
 'market',
 'electr',
 'transport',
 'product',
 'zapworldcom',
 'websit',
 'zap',
 'trade',
 'otcmkt',
 'ticker',
 'symbol',
 'zaap',
 'zap',
 'receiv',
 'favor',
 'news',
 'coverag',
 'firm',
 'rank',
 'coverag',
 'public',
 'compani',
 'scale',
 'score',
 'closest',
 'five',
 'favor',
 'age',
 'Mr',
 'michael',
 'ringstad',
 'interim',
 'chief',
 'financi',
 'offic',
 'age',
 'one',
 'share',
 'zaap',
 'stock',
 'current',
 'purchas',
 'approxim',
 'marketbeat',
 'commun',
 'rate',
 'zap',
 'otcmkt',
 'zaap',
 'vote',
 'outperform',
 'believ',
 'zaap',
 'outperform',
 'SP',
 'long',
 'term',
 'vote',
 'underperform',
 'believ',
 'zaap',
 'underperform',
 'SP',
 'long',
 'term']

# Part 4: Scrape raw news data

In [None]:
rawDf = pd.read_csv("outputs/step1_bigquery_output.csv")

In [None]:
texts = list(map(lambda x:get_text(x),rawDf.DocumentIdentifier))

In [None]:
### save as pickle

with open('outputs/step2_news_raw.pickle', 'wb') as handle:
    pickle.dump(texts, handle, protocol=pickle.HIGHEST_PROTOCOL)