# Natural Language Processing

### Install Spacy
https://spacy.io/api/doc/

```
conda install -c conda-forge spacy
python -m spacy download en_core_web_sm
python -m spacy download en_core_web_md
```
Optionnaly install french language core (fr_core_news_sm)

### Install NLTK
https://www.nltk.org/

```
conda install -c anaconda nltk
```

In [1]:
import spacy
nlp = spacy.load('en_core_web_sm')

In [3]:
# Pipelines are a series of operations to tag, parse and describe the data
nlp.pipe_names

['tok2vec', 'tagger', 'parser', 'ner', 'attribute_ruler', 'lemmatizer']

## Tokenisation
Split up all the component parts (words & punctuation) into "tokens". 

### Token attributes
|Tag|Description|`doc2[0].tag`|
|:------|:------|:------|
|`.text`|The original word text<!-- .element: style="text-align:left;" -->|`Tesla`|
|`.lemma_`|The base form of the word|`tesla`|
|`.pos_`|The simple part-of-speech tag|`PROPN`/`proper noun`|
|`.tag_`|The detailed part-of-speech tag|`NNP`/`noun, proper singular`|
|`.shape_`|The word shape – capitalization, punctuation, digits|`Xxxxx`|
|`.is_alpha`|Is the token an alpha character?|`True`|
|`.is_stop`|Is the token part of a stop list, i.e. the most common words of the language?|`False`|

In [103]:
# Create a Doc object
doc = nlp(u'Tesla is looking at buying U.S. startup for $6 million') # u stand for unicode

# Print each token separately
for token in doc:
    # print text, part of speach, dependancy
    #print(token.text, token.pos_, token.dep_)
    print(f'{token.text:{8}}', '\t', token.pos_, '\t', token.dep_) 

Tesla    	 PROPN 	 nsubj
is       	 AUX 	 aux
looking  	 VERB 	 ROOT
at       	 ADP 	 prep
buying   	 VERB 	 pcomp
U.S.     	 PROPN 	 compound
startup  	 NOUN 	 dobj
for      	 ADP 	 prep
$        	 SYM 	 quantmod
6        	 NUM 	 compound
million  	 NUM 	 pobj


In [6]:
# Get the details for abreviations
spacy.explain('PROPN')

'proper noun'

In [123]:
# Sentence segmentation
doc2 = nlp(u'This is the first sentence. This is another sentence. This is the last sentence.')
for sent in doc11.sents:
    print(sent)

This is the first sentence.
This is another sentence.
This is the last sentence.


In [12]:
# Name Entities
doc3 = nlp(u'Apple to build a Hong Kong factory for $6 million')

for token in doc3:
    print(token.text, end=' | ')

print('\n----')

# Get entities
for ent in doc3.ents:
    print(ent.text+' - '+ent.label_+' - '+str(spacy.explain(ent.label_)))

Apple | to | build | a | Hong | Kong | factory | for | $ | 6 | million | 
----
Apple - ORG - Companies, agencies, institutions, etc.
Hong Kong - GPE - Countries, cities, states
$6 million - MONEY - Monetary values, including unit


In [107]:
# Name Entities visualization
from spacy import displacy
doc5 = nlp(u'Over the last quarter Apple sold nearly 20 thousand iPods for a profit of $6 million.')
displacy.render(doc5, style='ent', jupyter=True)

In [109]:
options = { 'ents': ['ORG'] }
displacy.render(doc5, style='ent', jupyter=True, options=options)

In [92]:
# Noun chunks
doc4 = nlp(u"Autonomous cars shift insurance liability toward manufacturers.")

for chunk in doc4.noun_chunks:
    print(chunk.text)

Autonomous cars
insurance liability
manufacturers


## Stemming
Stemming is a somewhat crude method for cataloging related words; it essentially chops off letters from the end until the stem is reached.

In [22]:
# Import the toolkit and the full Porter Stemmer library
import nltk
from nltk.stem.porter import *
from nltk.stem.snowball import SnowballStemmer

p_stemmer = PorterStemmer()
# The Snowball Stemmer requires that you pass a language parameter
s_stemmer = SnowballStemmer(language='english')

words = ['run','runner','running','ran','runs','easily','fairly']
#words = ['generous','generation','generously','generate']

print('\n--Porter--')
for word in words:
    print(word+' --> '+p_stemmer.stem(word))

print('\n--Snowball--')
for word in words:
    print(word+' --> '+s_stemmer.stem(word))


--Porter--
run --> run
runner --> runner
running --> run
ran --> ran
runs --> run
easily --> easili
fairly --> fairli

--Snowball--
run --> run
runner --> runner
running --> run
ran --> ran
runs --> run
easily --> easili
fairly --> fair


## Lemmatization
In contrast to stemming, lemmatization looks beyond word reduction, and considers a language's full vocabulary to apply a *morphological analysis* to words. The lemma of 'was' is 'be' and the lemma of 'mice' is 'mouse'.

In [50]:
doc5 = nlp(u"I am a runner running in a race because I love to run since I ran today")

def show_lemmas(text):
    for token in doc5:
        print(f'{token.text:{12}} {token.pos_:{6}} {token.lemma:<{22}} {token.lemma_}')

show_lemmas(doc5)

I            PRON   4690420944186131903    I
am           AUX    10382539506755952630   be
a            DET    11901859001352538922   a
runner       NOUN   12640964157389618806   runner
running      VERB   12767647472892411841   run
in           ADP    3002984154512732771    in
a            DET    11901859001352538922   a
race         NOUN   8048469955494714898    race
because      SCONJ  16950148841647037698   because
I            PRON   4690420944186131903    I
love         VERB   3702023516439754181    love
to           PART   3791531372978436496    to
run          VERB   12767647472892411841   run
since        SCONJ  10066841407251338481   since
I            PRON   4690420944186131903    I
ran          VERB   12767647472892411841   run
today        NOUN   11042482332948150395   today


## Stop Words
Words like "a" and "the" appear frequently and doesn't carry meaning.

In [57]:
print(nlp.Defaults.stop_words)

{'anywhere', 'hers', 'whom', 'does', 'less', 'ourselves', 'n‘t', 'their', 'after', 'that', '’s', 'them', 'together', 'among', 'further', '‘m', 'made', '‘s', 'of', 'anyway', 'name', 'another', 'yet', 'must', 'very', "n't", 'him', 'two', 'toward', 'themselves', 'rather', "'s", 'until', '’ve', 'you', 'call', 'into', 'while', 'yours', 'however', "'ve", 'top', 'whereby', 'everyone', 'sometimes', 'which', 'otherwise', '‘ve', 'some', 'so', 'ca', 'her', 'unless', 'my', 'due', 'thereby', 'side', 'become', 'no', 'most', 'every', 'eight', 'everything', 'whatever', 'seemed', 'three', 'whereupon', '’d', 'hence', 'can', 'he', 'still', 'something', 'throughout', 'if', 'now', "'re", 'amount', 'many', 'per', 'in', "'ll", 'namely', 'i', 'am', 'about', 'show', '’m', 'whereafter', 'should', 'nor', 'never', 'those', 'she', 'except', 'whether', 'without', 'thence', 'yourself', 'front', 'alone', 'than', 'became', 'why', 'your', 'anyhow', 'under', 'beforehand', 'somehow', 'once', 'had', 'nine', 'other', 'amon

In [52]:
nlp.vocab['myself'].is_stop

True

In [53]:
# Add the word to the set of stop words. Use lowercase!
nlp.Defaults.stop_words.add('btw')
# Set the stop_word tag on the lexeme
nlp.vocab['btw'].is_stop = True

# Remove the word from the set of stop words
nlp.Defaults.stop_words.remove('beyond')
# Remove the stop_word tag from the lexeme
nlp.vocab['beyond'].is_stop = False

## Matching
spaCy offers a rule-matching tool called `Matcher` that allows you to build a library of token patterns, then match those patterns against a Doc object to return a list of found matches. https://spacy.io/usage/rule-based-matching

In [84]:
# Import the Matcher library
from spacy.matcher import Matcher
matcher = Matcher(nlp.vocab)

In [85]:
# Solar Power
pattern1 = [{'LOWER': 'solar'}, {'LOWER': 'power'}]
# Solar-Power
pattern2 = [{'LOWER': 'solar'}, {'IS_PUNCT': True, 'OP':'*'}, {'LOWER': 'power'}] # * = any number of times

matcher.add('SolarPower', [pattern1, pattern2])

doc6 = nlp(u'The Solar Power industry continues to grow as demand for solarpower increases. Solar--power cars are gaining popularity.')

found_matches = matcher(doc6)
print(found_matches)

[(8656102463236116519, 1, 3), (8656102463236116519, 13, 16)]


In [75]:
for match_id, start, end in found_matches:
    string_id = nlp.vocab.strings[match_id]  # get string representation
    span = doc[start:end]                    # get the matched span
    print(match_id, string_id, start, end, span.text)

8656102463236116519 SolarPower 1 3 Solar Power
8656102463236116519 SolarPower 10 11 solarpower
8656102463236116519 SolarPower 13 16 Solar--power


## Phrase Matching
is basicaly matching of a list

In [86]:
# Import the PhraseMatcher library
from spacy.matcher import PhraseMatcher
pmatcher = PhraseMatcher(nlp.vocab)

In [87]:
with open('data/reaganomics.txt') as f:
    doc7 = nlp(f.read())

# First, create a list of match phrases:
phrase_list = ['voodoo economics', 'supply-side economics', 'trickle-down economics', 'free-market economics']

# Next, convert each phrase to a Doc object:
phrase_patterns = [nlp(text) for text in phrase_list]

# Pass each Doc object into matcher (note the use of the asterisk!):
pmatcher.add('VoodooEconomics', None, *phrase_patterns)

# Build a list of matches:
matches = pmatcher(doc7)

# (match_id, start, end)
matches

[(3473369816841043438, 41, 45),
 (3473369816841043438, 49, 53),
 (3473369816841043438, 54, 56),
 (3473369816841043438, 61, 65),
 (3473369816841043438, 673, 677),
 (3473369816841043438, 2987, 2991)]

In [88]:
for match_id, start, end in matches:
    string_id = nlp.vocab.strings[match_id]  # get string representation
    span = doc7[start:end]                    # get the matched span
    print(match_id, string_id, start, end, span.text)

3473369816841043438 VoodooEconomics 41 45 supply-side economics
3473369816841043438 VoodooEconomics 49 53 trickle-down economics
3473369816841043438 VoodooEconomics 54 56 voodoo economics
3473369816841043438 VoodooEconomics 61 65 free-market economics
3473369816841043438 VoodooEconomics 673 677 supply-side economics
3473369816841043438 VoodooEconomics 2987 2991 trickle-down economics


## Counting POS Tags

In [90]:
doc8 = nlp(u"The quick brown fox jumped over the lazy dog's back.")

# Count the frequencies of different coarse-grained POS tags:
POS_counts = doc8.count_by(spacy.attrs.POS)
POS_counts

for k,v in sorted(POS_counts.items()):
    print(f'{k}. {doc8.vocab[k].text:{5}}: {v}')

84. ADJ  : 3
85. ADP  : 1
90. DET  : 2
92. NOUN : 3
94. PART : 1
97. PUNCT: 1
100. VERB : 1


## Name Entity Regognition

In [91]:
# Write a function to display basic entity info:
def show_ents(doc):
    if doc.ents:
        for ent in doc.ents:
            print(ent.text+' - '+ent.label_+' - '+str(spacy.explain(ent.label_)))
    else:
        print('No named entities found.')

doc9 = nlp(u'May I go to Washington, DC next May to see the Washington Monument?')

show_ents(doc9)

Washington, DC - GPE - Countries, cities, states
next May - DATE - Absolute or relative dates or periods
the Washington Monument - ORG - Companies, agencies, institutions, etc.


In [97]:
# Tesla not recognize as an entity
doc10 = nlp(u'Tesla to build a U.K. factory for $6 million')
print(show_ents(doc10))

U.K. - GPE - Countries, cities, states
$6 million - MONEY - Monetary values, including unit
None


In [102]:
# Add entities
from spacy.tokens import Span

# Get the hash value of the ORG entity label
ORG = doc10.vocab.strings[u'ORG']

# Create a Span for the new entity
new_ent = Span(doc10, 0, 1, label=ORG)

# Add the entity to the existing Doc object
doc10.ents = list(doc10.ents) + [new_ent]

# Verify
print(show_ents(doc10))

Tesla - ORG - Companies, agencies, institutions, etc.
U.K. - GPE - Countries, cities, states
$6 million - MONEY - Monetary values, including unit
None


## Features extraction

In [140]:
import pandas as pd
from sklearn.model_selection import train_test_split

df = pd.read_csv('data/smsspamcollection.tsv', sep='\t')
X = df['message']
y = df['label']

X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.33, random_state=42)

# spam or ham dataset
df.head()

Unnamed: 0,label,message,length,punct
0,ham,"Go until jurong point, crazy.. Available only ...",111,9
1,ham,Ok lar... Joking wif u oni...,29,6
2,spam,Free entry in 2 a wkly comp to win FA Cup fina...,155,6
3,ham,U dun say so early hor... U c already then say...,49,6
4,ham,"Nah I don't think he goes to usf, he lives aro...",61,2


In [138]:
# Create pipeline
from sklearn.pipeline import Pipeline
from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.svm import LinearSVC

text_clf = Pipeline([('tfidf', TfidfVectorizer()),
                     ('clf', LinearSVC()),
])

# Feed the training data through the pipeline
text_clf.fit(X_train, y_train)  

# Form a prediction set
predictions = text_clf.predict(X_test)

In [139]:
# Report the confusion matrix
from sklearn import metrics

print(metrics.confusion_matrix(y_test,predictions))
# Print a classification report
print(metrics.classification_report(y_test,predictions))
# Print the overall accuracy
print(metrics.accuracy_score(y_test,predictions))

[[1586    7]
 [  12  234]]
              precision    recall  f1-score   support

         ham       0.99      1.00      0.99      1593
        spam       0.97      0.95      0.96       246

    accuracy                           0.99      1839
   macro avg       0.98      0.97      0.98      1839
weighted avg       0.99      0.99      0.99      1839

0.989668297988037


## Word Vectors

In [150]:
nlp = spacy.load('en_core_web_md')
nlp.vocab.vectors.shape

(20000, 300)

In [151]:
# Create a three-token Doc object:
tokens = nlp(u'lion cat pet')

# Iterate through token combinations:
for token1 in tokens:
    for token2 in tokens:
        print(token1.text, token2.text, token1.similarity(token2))

lion lion 1.0
lion cat 0.5265437
lion pet 0.39923772
cat lion 0.5265437
cat cat 1.0
cat pet 0.7505456
pet lion 0.39923772
pet cat 0.7505456
pet pet 1.0


In [152]:
from scipy import spatial

cosine_similarity = lambda x, y: 1 - spatial.distance.cosine(x, y)

king = nlp.vocab['king'].vector
man = nlp.vocab['man'].vector
woman = nlp.vocab['woman'].vector

# Now we find the closest vector in the vocabulary 
# to the result of "man" - "woman" + "queen"
new_vector = king - man + woman
computed_similarities = []

for word in nlp.vocab:
    # Ignore words without vectors and mixed-case words:
    if word.has_vector:
        if word.is_lower:
            if word.is_alpha:
                similarity = cosine_similarity(new_vector, word.vector)
                computed_similarities.append((word, similarity))

computed_similarities = sorted(computed_similarities, key=lambda item: -item[1])

print([w[0].text for w in computed_similarities[:10]])

['king', 'woman', 'she', 'lion', 'who', 'when', 'dare', 'cat', 'was', 'not']


## Sentiment Analysis

In [153]:
nltk.download('vader_lexicon')

[nltk_data] Downloading package vader_lexicon to
[nltk_data]     C:\Users\user\AppData\Roaming\nltk_data...


True

In [156]:
from nltk.sentiment.vader import SentimentIntensityAnalyzer
sid = SentimentIntensityAnalyzer()

In [168]:
a = 'This was the best, most awesome movie EVER MADE!!!'
print(sid.polarity_scores(a))
b = 'This was the worst film to ever disgrace the screen.'
print(sid.polarity_scores(b))

{'neg': 0.0, 'neu': 0.425, 'pos': 0.575, 'compound': 0.8877}
{'neg': 0.477, 'neu': 0.523, 'pos': 0.0, 'compound': -0.8074}


In [160]:
df = pd.read_csv('data/amazonreviews.tsv', sep='\t')
df.head()

Unnamed: 0,label,review
0,pos,Stuning even for the non-gamer: This sound tra...
1,pos,The best soundtrack ever to anything.: I'm rea...
2,pos,Amazing!: This soundtrack is my favorite music...
3,pos,Excellent Soundtrack: I truly like this soundt...
4,pos,"Remember, Pull Your Jaw Off The Floor After He..."


In [161]:
# REMOVE NaN VALUES AND EMPTY STRINGS:
df.dropna(inplace=True)

blanks = []  # start with an empty list

for i,lb,rv in df.itertuples():  # iterate over the DataFrame
    if type(rv)==str:            # avoid NaN values
        if rv.isspace():         # test 'review' for whitespace
            blanks.append(i)     # add matching index numbers to the list

df.drop(blanks, inplace=True)

In [165]:
df['scores'] = df['review'].apply(lambda review: sid.polarity_scores(review))
df['compound']  = df['scores'].apply(lambda score_dict: score_dict['compound'])
df['comp_score'] = df['compound'].apply(lambda c: 'pos' if c >=0 else 'neg')
df.head()

Unnamed: 0,label,review,scores,compound,comp_score
0,pos,Stuning even for the non-gamer: This sound tra...,"{'neg': 0.088, 'neu': 0.669, 'pos': 0.243, 'co...",0.9454,pos
1,pos,The best soundtrack ever to anything.: I'm rea...,"{'neg': 0.018, 'neu': 0.837, 'pos': 0.145, 'co...",0.8957,pos
2,pos,Amazing!: This soundtrack is my favorite music...,"{'neg': 0.04, 'neu': 0.692, 'pos': 0.268, 'com...",0.9858,pos
3,pos,Excellent Soundtrack: I truly like this soundt...,"{'neg': 0.09, 'neu': 0.615, 'pos': 0.295, 'com...",0.9814,pos
4,pos,"Remember, Pull Your Jaw Off The Floor After He...","{'neg': 0.0, 'neu': 0.746, 'pos': 0.254, 'comp...",0.9781,pos


In [167]:
print(metrics.accuracy_score(df['label'],df['comp_score']))
print(metrics.classification_report(df['label'],df['comp_score']))

0.7091
              precision    recall  f1-score   support

         neg       0.86      0.51      0.64      5097
         pos       0.64      0.91      0.75      4903

    accuracy                           0.71     10000
   macro avg       0.75      0.71      0.70     10000
weighted avg       0.75      0.71      0.70     10000



## Topic Modeling: LDA

In [171]:
npr = pd.read_csv('data/npr.csv')
npr.head()

Unnamed: 0,Article
0,"In the Washington of 2016, even when the polic..."
1,Donald Trump has used Twitter — his prefe...
2,Donald Trump is unabashedly praising Russian...
3,"Updated at 2:50 p. m. ET, Russian President Vl..."
4,"From photography, illustration and video, to d..."


In [173]:
# Preprocessing
from sklearn.feature_extraction.text import CountVectorizer
# remove frequent words (max_df=0.95) and very rare words (min_df=2)
cv = CountVectorizer(max_df=0.95, min_df=2, stop_words='english')
dtm = cv.fit_transform(npr['Article'])

In [174]:
# LDA
from sklearn.decomposition import LatentDirichletAllocation
# n_components is the arbitrary number of expected topics
LDA = LatentDirichletAllocation(n_components=7,random_state=42)
LDA.fit(dtm)

LatentDirichletAllocation(n_components=7, random_state=42)

In [175]:
# Get top words by components (topics)
for index,topic in enumerate(LDA.components_):
    print(f'THE TOP 15 WORDS FOR TOPIC #{index}')
    print([cv.get_feature_names()[i] for i in topic.argsort()[-15:]])
    print('\n')

THE TOP 15 WORDS FOR TOPIC #0
['companies', 'money', 'year', 'federal', '000', 'new', 'percent', 'government', 'company', 'million', 'care', 'people', 'health', 'said', 'says']


THE TOP 15 WORDS FOR TOPIC #1
['military', 'house', 'security', 'russia', 'government', 'npr', 'reports', 'says', 'news', 'people', 'told', 'police', 'president', 'trump', 'said']


THE TOP 15 WORDS FOR TOPIC #2
['way', 'world', 'family', 'home', 'day', 'time', 'water', 'city', 'new', 'years', 'food', 'just', 'people', 'like', 'says']


THE TOP 15 WORDS FOR TOPIC #3
['time', 'new', 'don', 'years', 'medical', 'disease', 'patients', 'just', 'children', 'study', 'like', 'women', 'health', 'people', 'says']


THE TOP 15 WORDS FOR TOPIC #4
['voters', 'vote', 'election', 'party', 'new', 'obama', 'court', 'republican', 'campaign', 'people', 'state', 'president', 'clinton', 'said', 'trump']


THE TOP 15 WORDS FOR TOPIC #5
['years', 'going', 've', 'life', 'don', 'new', 'way', 'music', 'really', 'time', 'know', 'think',

In [176]:
# Reasign topics to the dataset
topic_results = LDA.transform(dtm)
npr['Topic'] = topic_results.argmax(axis=1)
npr.head(10)

Unnamed: 0,Article,Topic
0,"In the Washington of 2016, even when the polic...",1
1,Donald Trump has used Twitter — his prefe...,1
2,Donald Trump is unabashedly praising Russian...,1
3,"Updated at 2:50 p. m. ET, Russian President Vl...",1
4,"From photography, illustration and video, to d...",2
5,I did not want to join yoga class. I hated tho...,3
6,With a who has publicly supported the debunk...,3
7,"I was standing by the airport exit, debating w...",2
8,"If movies were trying to be more realistic, pe...",3
9,"Eighteen years ago, on New Year’s Eve, David F...",2


In [182]:
mytopic_dict = {0:'economy',1:'politics',2:'local',3:'health',4:'election',5:'music',6:'education'}

In [183]:
npr['Topic'] = npr['Topic'].map(mytopic_dict)

## Topic Modeling: Non-negative Matrix Factorization

In [None]:
#npr = pd.read_csv('data/npr.csv')
#npr.head()

In [177]:
# Preprocessing
from sklearn.feature_extraction.text import TfidfVectorizer
# remove frequent words (max_df=0.95) and very rare words (min_df=2)
tfidf = TfidfVectorizer(max_df=0.95, min_df=2, stop_words='english')
dtm = tfidf.fit_transform(npr['Article'])

In [178]:
# NMF
from sklearn.decomposition import NMF
# n_components is the arbitrary number of expected topics
nmf_model = NMF(n_components=7,random_state=42)
nmf_model.fit(dtm)

NMF(n_components=7, random_state=42)

In [179]:
# Get top words by components (topics)
for index,topic in enumerate(nmf_model.components_):
    print(f'THE TOP 15 WORDS FOR TOPIC #{index}')
    print([tfidf.get_feature_names()[i] for i in topic.argsort()[-15:]])
    print('\n')

THE TOP 15 WORDS FOR TOPIC #0
['new', 'research', 'like', 'patients', 'health', 'disease', 'percent', 'women', 'virus', 'study', 'water', 'food', 'people', 'zika', 'says']


THE TOP 15 WORDS FOR TOPIC #1
['gop', 'pence', 'presidential', 'russia', 'administration', 'election', 'republican', 'obama', 'white', 'house', 'donald', 'campaign', 'said', 'president', 'trump']


THE TOP 15 WORDS FOR TOPIC #2
['senate', 'house', 'people', 'act', 'law', 'tax', 'plan', 'republicans', 'affordable', 'obamacare', 'coverage', 'medicaid', 'insurance', 'care', 'health']


THE TOP 15 WORDS FOR TOPIC #3
['officers', 'syria', 'security', 'department', 'law', 'isis', 'russia', 'government', 'state', 'attack', 'president', 'reports', 'court', 'said', 'police']


THE TOP 15 WORDS FOR TOPIC #4
['primary', 'cruz', 'election', 'democrats', 'percent', 'party', 'delegates', 'vote', 'state', 'democratic', 'hillary', 'campaign', 'voters', 'sanders', 'clinton']


THE TOP 15 WORDS FOR TOPIC #5
['love', 've', 'don', 'al

In [187]:
# Reasign topics to the dataset
topic_results = nmf_model.transform(dtm)
npr['Topic2'] = topic_results.argmax(axis=1)

In [188]:
mytopic_dict2 = {0:'health',1:'election',2:'legislation',3:'politics',4:'election',5:'music',6:'education'}

In [189]:
npr['Topic2'] = npr['Topic2'].map(mytopic_dict2)
npr.head(10)

Unnamed: 0,Article,Topic,Topic2
0,"In the Washington of 2016, even when the polic...",politics,election
1,Donald Trump has used Twitter — his prefe...,politics,election
2,Donald Trump is unabashedly praising Russian...,politics,election
3,"Updated at 2:50 p. m. ET, Russian President Vl...",politics,politics
4,"From photography, illustration and video, to d...",local,education
5,I did not want to join yoga class. I hated tho...,health,music
6,With a who has publicly supported the debunk...,health,health
7,"I was standing by the airport exit, debating w...",local,health
8,"If movies were trying to be more realistic, pe...",health,health
9,"Eighteen years ago, on New Year’s Eve, David F...",local,music


## Summarization

In [9]:
import spacy
from spacy.lang.en.stop_words import STOP_WORDS
from string import punctuation
nlp = spacy.load('en_core_web_sm')

In [10]:
doc ="""Machine learning (ML) is the scientific study of algorithms and statistical models that computer systems use to progressively improve their performance on a specific task. Machine learning algorithms build a mathematical model of sample data, known as "training data", in order to make predictions or decisions without being explicitly programmed to perform the task. Machine learning algorithms are used in the applications of email filtering, detection of network intruders, and computer vision, where it is infeasible to develop an algorithm of specific instructions for performing the task. Machine learning is closely related to computational statistics, which focuses on making predictions using computers. The study of mathematical optimization delivers methods, theory and application domains to the field of machine learning. Data mining is a field of study within machine learning, and focuses on exploratory data analysis through unsupervised learning.In its application across business problems, machine learning is also referred to as predictive analytics."""

In [11]:
docx = nlp(doc)
extra_words = list(STOP_WORDS) + list(punctuation) + ['\n']

In [12]:
# Word frequency
all_words = [word.text for word in docx]

Freq_word = {}
for w in all_words:
    w1 = w.lower()
    if w1 not in extra_words and w1.isalpha():
        if w1 in Freq_word.keys():
            Freq_word[w1] += 1
        else:
            Freq_word[w1] = 1
#Freq_word

In [20]:
# Main topics
val = sorted(Freq_word.values())
max_freq = val[-3:]
print("Topic of document given:")
for word,freq in Freq_word.items():  
    
    if freq in max_freq:
        print(word, end=" ")
        
    else:
        continue

Topic of document given:
machine learning data 

In [22]:
# TF-IDF
for word in Freq_word.keys():  
        Freq_word[word] = (Freq_word[word] / max_freq[-1])
#Freq_word

In [23]:
# Sentence Strength (score)
sent_strength = {}
for sent in docx.sents:
    for word in sent :
       
        if word.text.lower() in Freq_word.keys():
            
            if sent in sent_strength.keys():
                sent_strength[sent]+=Freq_word[word.text.lower()]
            else:
                sent_strength[sent]=Freq_word[word.text.lower()]
        else:
            continue
#sent_strength

In [24]:
# Sort Sentences
top_sentences = (sorted(sent_strength.values())[::-1])
top20percent_sentence=int(0.2 * len(top_sentences))
top_sent=top_sentences[:top20percent_sentence]

In [25]:
# Summary
summary=[]
for sent,strength in sent_strength.items():  
    if strength in top_sent:
        summary.append(sent)
        
    else:
        continue

for i in summary:
    print(i,end="")

Machine learning algorithms build a mathematical model of sample data, known as "training data", in order to make predictions or decisions without being explicitly programmed to perform the task.

### Compare result with gensim summarizer

In [26]:
from gensim.summarization import summarize
summarize(doc)

'Machine learning algorithms build a mathematical model of sample data, known as "training data", in order to make predictions or decisions without being explicitly programmed to perform the task.'

In [None]:
#https://towardsdatascience.com/7-models-on-huggingface-you-probably-didnt-knew-existed-f3d079a4fd7c