# HW 4: Natural Language Processing

## Q1: Extract data using regular expression
Suppose you have scraped the text shown below from an online source. 
Define a `extract` function which:
- takes a piece of text (in the format of shown below) as an input
- extracts data into a list of tuples , e.g. `[('F', 'Ford Motor Company', '15.70', '+0.25', '+1.62%'), ...]`, using regular expression
- returns the list of tuples

In [1]:
import re
import nltk
import en_core_web_sm
from nltk.corpus import stopwords
import spacy
import string
import pandas as pd
import numpy as np 
from IPython.core.interactiveshell import InteractiveShell
from sklearn.metrics import pairwise_distances
InteractiveShell.ast_node_interactivity = "all"

In [2]:
# columes are: Symbol   Name   Price   Change   % Change
text = '''F     Ford Motor Company    15.70     +0.25     +1.62%   
            AAPL   Apple Inc. 144.84   +1.08   +0.75%   
            SPCE   Virgin Galactic Holdings, Inc.   20.01   -4.05   -16.83%      
            BAC    Bank of America Corporation      46.37   +1.30   +2.88%      
            WFC    Wells Fargo & Company            48.38   +3.07   +6.78%      
            PCG    PG&E Corporation     11.20   +0.43   +3.99%      
            T      AT&T Inc.   25.70   +0.08   +0.31%'''

In [3]:
def extract(text):
   
    result = None
    text=re.findall(r'([A-Z]+)\s+([\w.,&\s]+\S)\s+([\d.]+)\s+([0-9.+-]+)\s+([0-9.0-9%.+-]+)',text)
    
    return text

In [4]:
extract(text)

[('F', 'Ford Motor Company', '15.70', '+0.25', '+1.62%'),
 ('AAPL', 'Apple Inc.', '144.84', '+1.08', '+0.75%'),
 ('SPCE', 'Virgin Galactic Holdings, Inc.', '20.01', '-4.05', '-16.83%'),
 ('BAC', 'Bank of America Corporation', '46.37', '+1.30', '+2.88%'),
 ('WFC', 'Wells Fargo & Company', '48.38', '+3.07', '+6.78%'),
 ('PCG', 'PG&E Corporation', '11.20', '+0.43', '+3.99%'),
 ('T', 'AT&T Inc.', '25.70', '+0.08', '+0.31%')]

## Q2: Develop a method to summarize a document

When you have a long document, you would hope to create a concise summary while preserving it's key information content and overall meaning. Let's implement an `extractive method` based on the concept of TF-IDF. The idea is to identify the key sentences from an article and use them as a summary. Carefully follow the steps below to produce the summary.

### Q2.1. Preprocess the input document

Define a function `proprocess(doc, lemmatized = True, remove_stopword = True, remove_punctuation = True)` 
- Inputs with four parameters:
    - `doc`: an input string (e.g. a document)
    - `lemmatized`: an optional boolean parameter to indicate if tokens are lemmatized. The default value is True (i.e. tokens are lemmatized).
    - `remove_stopword`: an optional boolean parameter to remove stop words. The default value is True (i.e. remove stop words). 
    - `remove_punctuation`: optional boolean parameter to remove punctuations. The default values is True (i.e. remove all punctuations)
       
- Split the input `doc` into sentences
- Tokenize each sentence into unigram tokens and also clean up the tokens as follows:
    - Remove '-\n' with regular express
    - If `lemmatized` is True, lemmatize all unigrams.
    - If `remove_stopword` is set to True, remove all stop words. 
    - If `remove_punctuation` is set to True, remove all punctuations.
    - Convert all tokens to the lower case and remove empty ones

- Return the original sentence list (`sents`) and also the tokenized sentence list (`tokenized_sents`). 
   
(Hint: you can use [spacy](https://spacy.io/api/token#attributes) package for this task.)

In [5]:
def preprocess(doc, lemmatized=True, remove_stopword=True, remove_punctuation = True):
    sents=[]
    tokenized_sents1=[]
    nlp = spacy.load('en_core_web_sm')
    
    text1=re.sub(r'-\n','',doc)
    doc1 = nlp(text1)
    for sent in doc1.sents:
        
        sent1 = []
        if lemmatized==True:
            if remove_stopword==True:
                if remove_punctuation == True:
                    for token in sent:
                        if ((not token.is_stop) & (not token.is_punct)&(not token.is_space)):
                            sent1.append((token.lemma_ ).lower())
        
        tokenized_sents1.append(sent1)
        sents.append(sent)
    tokenized_sents=[token for token in tokenized_sents1 if token!=[]]
    

    
    return sents, tokenized_sents

In [6]:
# load test document

text = open("hw1_test_doc.txt", "r", encoding='utf-8').read()
print(text[0:500])

Advances in natural
language processing
Julia Hirschberg1* and Christopher D. Manning2,3
Natural language processing employs computational techniques for the purpose of learning,
understanding, and producing human language content. Early computational approaches to
language research focused on automating the analysis of the linguistic structure of language
and developing basic technologies such as machine translation, speech recognition, and speech
synthesis. Today’s researchers refine and make 


In [7]:
# test preprocess function
sents, tokenized_sents = preprocess(text)

for sent in tokenized_sents[0:5]:
    print(sent)


['advance', 'natural', 'language', 'processing', 'julia', 'hirschberg1', 'christopher', 'd.', 'manning2,3', 'natural', 'language', 'processing', 'employ', 'computational', 'technique', 'purpose', 'learning', 'understanding', 'produce', 'human', 'language', 'content']
['early', 'computational', 'approach', 'language', 'research', 'focus', 'automate', 'analysis', 'linguistic', 'structure', 'language', 'develop', 'basic', 'technology', 'machine', 'translation', 'speech', 'recognition', 'speech', 'synthesis']
['today', 'researcher', 'refine', 'use', 'tool', 'real', 'world', 'application', 'create', 'spoken', 'dialogue', 'system', 'speech', 'speech', 'translation', 'engine', 'mine', 'social', 'medium', 'information', 'health', 'finance', 'identify', 'sentiment', 'emotion', 'product', 'service']
['describe', 'success', 'challenge', 'rapidly', 'advance', 'area']
['past', '20', 'year', 'computational', 'linguistic', 'grow', 'exciting', 'area', 'scientific', 'research', 'practical', 'technology

### Q2.2. Generate TF-IDF representations for sentences

Define a function `compute_tf_idf(sents, use_idf)` as follows: 


- Take the following two inputs:
    - `sents`: tokenized sentences returned from Q2.1. These sentences form a corpus for you to calculate `TF-IDF` vectors.
    - `use_idf`: if this option is true, return smoothed normalized `TF_IDF` vectors for all sentences; otherwise, just return normalized `TF` vector for each sentence.
    
- Calculate `TF-IDF` vectors as shown in the lecture notes (Hint: you can slightly modify code segment 7.5 in NLP Lecture Notes (II) for this task)

- Return the `TF-IDF` vectors  if `use_idf` is True.  Return the `TF` vectors if `use_idf` is False.

In [8]:
def compute_tf_idf(sents, use_idf = True):
    
    tf_idf = None

    docs_tokens={idx:{token:sent.count(token) for token in set(sent)} for idx,sent in enumerate(tokenized_sents)}
    dtm=pd.DataFrame.from_dict(docs_tokens, orient="index" )
    dtm=dtm.fillna(0)
    dtm = dtm.sort_index(axis = 0)
    tf=dtm.values
    doc_len=tf.sum(axis=1, keepdims=True)
    tf=np.divide(tf, doc_len)
    df=np.where(tf>0,1,0)
    smoothed_idf=np.log(np.divide(len(sents)+1, np.sum(df, axis=0)+1))+1    
    smoothed_tf_idf=tf*smoothed_idf    
    if use_idf == True:
        tf_idf=smoothed_tf_idf
    else:
        tf_idf=tf

    return tf_idf

In [9]:
tf_idf = compute_tf_idf(tokenized_sents, use_idf = True)

# show shape of TF-IDF
tf_idf.shape

(183, 1172)

### Q2.3. Identify key sentences as summary

The basic idea is that, in a coherence article, all sentences should center around some key ideas. If we can identify a subset of sentences, denoted as $S_{key}$, which precisely capture the key ideas,  then $S_{key}$ can be used as a summary. Moreover, $S_{key}$ should have high similarity to all the other sentences on average, because all sentences are centered around the key ideas contained in $S_{key}$. Therefore, we can identify whether a sentence belongs to $S_{key}$ by its similarity to all the other sentences.


Define a function `get_summary(tf_idf, sents, topN = 5)`  as follows:

- This function takes three inputs:
    - `tf_idf`: the TF-IDF vectors of all the sentences in a document
    - `sents`: the original sentences corresponding to the TF-IDF vectors
    - `topN`: the top N sentences in the generated summary

- Steps:
    1. Calculate the cosine similarity for every pair of TF-IDF vectors
    1. For each sentence, calculate its average similarity to all the others
    1. Select the sentences with the `topN` largest average similarity
    1. Print the `topN` sentences index
    1. Return these sentences as the summary

In [10]:
def get_summary(tf_idf, sents, topN = 5):
    
    summary = None
    similarity=1-pairwise_distances(tf_idf, metric = 'cosine')
    summ=similarity.sum(axis=0)
    average=summ/len(summ)
    order=np.argsort(-average)
    order_top5=order[0:topN]
    summary1=sents[order[0]]
    summary2=sents[order[1]]
    summary3=sents[order[2]]
    summary4=sents[order[3]]
    summary5=sents[order[4]]
    summary=[summary1,summary2,summary3,summary4,summary5]
    
    
    
    return summary 

In [11]:
# put everything together and test with different options

sents, tokenized_sents = preprocess(text)
tf_idf = compute_tf_idf(tokenized_sents, use_idf = True)
summary = get_summary(tf_idf, sents, topN = 5)

for sent in summary:
    print(sent,"\n")

# test with the option lemmatized=False, remove_stopword=False

# sents, tokenized_sents = preprocess(text, lemmatized=False, remove_stopword=False, remove_punctuation = True )
# tf_idf = compute_tf_idf(tokenized_sents, use_idf = True)
# summary = get_summary(tf_idf, sents, topN = 5)
# for sent in summary:
#    print(sent,"\n")

# # test with the option use_idf = False

# sents, tokenized_sents = preprocess(text)
# tf_idf = compute_tf_idf(tokenized_sents, use_idf = False)
# summary = get_summary(tf_idf, sents, topN = 5)
# for sent in summary:
#    print(sent,"\n")

Early computational approaches to
language research focused on automating the analysis of the linguistic structure of language
and developing basic technologies such as machine translation, speech recognition, and speech
synthesis. 

Today’s researchers refine and make use of such tools in real-world applications,
creating spoken dialogue systems and speech-to-speech translation engines, mining social
media for information about health or finance, and identifying sentiment and emotion toward
products and services. 

These efforts illustrate computational
approaches to big data, based on current cuttingedge methodologies that combine statistical analysis and ML with knowledge of language. 

Building large models
of this form is much more practical with the
massive parallel computation that is now economically available via graphics processing units. 

2Department of Linguistics, Stanford University,
Stanford, CA 94305-2150, USA. 



### Q2.4. Analysis

- Do you think this method is able to generate a good summary? Any pros or cons have you observed?
- Do these options `lemmatized, remove_stopword, remove_punctuation, use_idf` matter? 
- Why do you think these options matter or do not matter? 
- If these options matter, what are the best values for these options?


Write your analysis as a pdf file. Be sure to provide some evidence from the output of each step to support your arguments.

## Q2.5. (Bonus)


- Can you think a way to improve this extractive summary method? Explain the method you propose for improvement,  implement it, use it to generate a new summary, and demonstrate what is improved in the new summary.

- Or, you can research on some other extractive summary methods and implement one here. Compare it with the one you implemented in Q2.1-Q2.3 and show pros and cons of each method.


In [12]:
from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.metrics.pairwise import cosine_similarity

sents = nltk.sent_tokenize(text)
tfidf_vectorizer = TfidfVectorizer()
tfidf_matrix = tfidf_vectorizer.fit_transform(sents)
cosine_sim = cosine_similarity(tfidf_matrix, tfidf_matrix)
summary=cosine_sim.sum(axis=0)
average=summary/len(summary)
order=np.argsort(-average)
summary1=sents[order[0]]
summary2=sents[order[1]]
summary3=sents[order[2]]
summary4=sents[order[3]]
summary5=sents[order[4]]
summaryall=[summary1,summary2,summary3,summary4,summary5]
summaryall

['The creation of SDSs, whether between hu-\nmans or between humans and artificial agents,\nrequires tools for automatic speech recognition\n(ASR), to identify what a human says; dialogue\nmanagement (DM), to determine what that hu-\nman wants; actions to obtain the information or\nperform the activity requested; and text-to-speech\n(TTS) synthesis, to convey that information back\nto the human in spoken form.',
 'Computation-\nal linguistic systems can have multiple purposes:\nThe goal can be aiding human-human commu-\nnication, such as in machine translation (MT);\naiding human-machine communication, such as\nwith conversational agents; or benefiting both\nhumans and machines by analyzing and learn-\ning from the enormous quantity of human lan-\nguage content that is now available online.',
 'Early computational approaches to\nlanguage research focused on automating the analysis of the linguistic structure of language\nand developing basic technologies such as machine translation, sp

In [14]:
import networkx as nx
from sklearn.metrics.pairwise import cosine_similarity
sents, tokenized_sents = preprocess(text)
docs_tokens={idx:{token:sent.count(token) for token in set(sent)} for idx,sent in enumerate(tokenized_sents)}
dtm=pd.DataFrame.from_dict(docs_tokens, orient="index" )
dtm=dtm.fillna(0)
dtm = dtm.sort_index(axis = 0)
tf=dtm.values
doc_len=tf.sum(axis=1, keepdims=True)
tf=np.divide(tf, doc_len)
df=np.where(tf>0,1,0)
smoothed_idf=np.log(np.divide(len(sents)+1, np.sum(df, axis=0)+1))+1 
smoothed_tf_idf=tf*smoothed_idf  
tf_idf=smoothed_tf_idf
cosine_sim = cosine_similarity(tf_idf, tf_idf)
nx_graph = nx.from_numpy_array(cosine_sim)
scores = nx.pagerank(nx_graph)
ranked_sentences = sorted(((scores[i],s) for i,s in enumerate(tokenized_sents)), reverse=True)
ranked_sentences[0][1]
ranked_sentences[1][1]
ranked_sentences[2][1]
ranked_sentences[3][1]
ranked_sentences[4][1]


['early',
 'computational',
 'approach',
 'language',
 'research',
 'focus',
 'automate',
 'analysis',
 'linguistic',
 'structure',
 'language',
 'develop',
 'basic',
 'technology',
 'machine',
 'translation',
 'speech',
 'recognition',
 'speech',
 'synthesis']

['today',
 'researcher',
 'refine',
 'use',
 'tool',
 'real',
 'world',
 'application',
 'create',
 'spoken',
 'dialogue',
 'system',
 'speech',
 'speech',
 'translation',
 'engine',
 'mine',
 'social',
 'medium',
 'information',
 'health',
 'finance',
 'identify',
 'sentiment',
 'emotion',
 'product',
 'service']

['computational',
 'linguistic',
 'know',
 'natural',
 'language',
 'processing',
 'nlp',
 'subfield',
 'computer',
 'science',
 'concern',
 'computational',
 'technique',
 'learn',
 'understand',
 'produce',
 'human',
 'language',
 'content']

['creation',
 'sds',
 'human',
 'human',
 'artificial',
 'agent',
 'require',
 'tool',
 'automatic',
 'speech',
 'recognition',
 'asr',
 'identify',
 'human',
 'say',
 'dialogue',
 'management',
 'dm',
 'determine',
 'human',
 'want',
 'action',
 'obtain',
 'information',
 'perform',
 'activity',
 'request',
 'text',
 'speech',
 'tts',
 'synthesis',
 'convey',
 'information',
 'human',
 'spoken',
 'form']

['subsequent',
 'research',
 'aim',
 'well',
 'exploit',
 'structure',
 'human',
 'language',
 'sentence',
 'i.e.',
 'syntax',
 'translation',
 'system',
 '7',
 '8)',
 'researcher',
 'actively',
 'build',
 'deeply',
 'meaning',
 'representation',
 'language',
 '9',
 'enable',
 'new',
 'level',
 'semantic',
 'mt']

In [29]:
if __name__ == "__main__":  
    
    # Test Q1
    
    text='''F   Ford Motor Company  15.70   +0.25   +1.62%   
AAPL   Apple Inc.   144.84   +1.08   +0.75%   
SPCE   Virgin Galactic Holdings, Inc.   20.01   -4.05   -16.83%      
BAC   Bank of America Corporation   46.37   +1.30   +2.88%      
WFC   Wells Fargo & Company   48.38   +3.07   +6.78%      
PCG   PG&E Corporation   11.20   +0.43   +3.99%      
T   AT&T Inc.   25.70   +0.08   +0.31%'''
    
    
    print("\n==================\n")
    print("Test Q1")
    print(extract(text))
    
    print("\n==================\n")
    print("Test Q2")
    
    text = open("hw1_test_doc.txt", "r", encoding='utf-8').read()
    sents, tokenized_sents = preprocess(text)
    tf_idf = compute_tf_idf(tokenized_sents, use_idf = True)
    summary = get_summary(tf_idf, sents, topN = 5)
    
    for sent in summary:
        print(sent,"\n")




Test Q1
[('F', 'Ford Motor Company', '15.70', '+0.25', '+1.62%'), ('AAPL', 'Apple Inc.', '144.84', '+1.08', '+0.75%'), ('SPCE', 'Virgin Galactic Holdings, Inc.', '20.01', '-4.05', '-16.83%'), ('BAC', 'Bank of America Corporation', '46.37', '+1.30', '+2.88%'), ('WFC', 'Wells Fargo & Company', '48.38', '+3.07', '+6.78%'), ('PCG', 'PG&E Corporation', '11.20', '+0.43', '+3.99%'), ('T', 'AT&T Inc.', '25.70', '+0.08', '+0.31%')]


Test Q2
Early computational approaches to
language research focused on automating the analysis of the linguistic structure of language
and developing basic technologies such as machine translation, speech recognition, and speech
synthesis. 

Today’s researchers refine and make use of such tools in real-world applications,
creating spoken dialogue systems and speech-to-speech translation engines, mining social
media for information about health or finance, and identifying sentiment and emotion toward
products and services. 

These efforts illustrate computational
