For a first introduction with Natural Language Processing (NLP), we will  try some rudimentary summarisation techniques. Through those techniques we will see some common steps for cleaning the text and preparing it for processing.

For texts, we will scrappe articles from Wikipedia. Now we will not get into much depth on the subject of Web Scrapping, but we will cover the basics.

In [11]:
import requests
from bs4 import BeautifulSoup
import re
import pandas as pd
link = "https://en.wikipedia.org/wiki/Battle_of_Halidon_Hill"
page = requests.get(link)
soup = BeautifulSoup(page.content, "html.parser")
# Through the aforementioned 3 steps we get the html code of the Wikipedia
# page. However, if you take a look, you will notice that it contains a lot of
# unnecassary stuff. Thankfully, Wikipedia IDs the entirety of the text as "p".
content = soup.find(id = "bodyContent").find_all("p")
# Now we got all the text of the article. It still contains some unnecassary
# stuff, such as hyperlinks, images, references etc. We will apply some 
# filters and remove them.

In [12]:
wiki_text = ""
for i in range(len(content)):
    wiki_text += content[i].text
# We go through the text and get each text paragraph.
wiki_text = wiki_text.replace("\n", " ")
wiki_text = re.sub("[\[].*?[\]]", "", wiki_text)
for i in range(len(wiki_text)):
    wiki_text[i].replace("\\", " ")
# Now we replace all the breaklines and anything that starts and ends with
# brackets. This will remove the references, and result to negligible loss of
# information as Wikipedia does not use brackets in text.

Now the text is mostly complete, some minor issues notwithstanding. We will import our NLP model, "spaCy", to transform the text from str object to a Doc one. 

In [13]:
import spacy
from spacy.lang.en.stop_words import STOP_WORDS as stopwords
import string
from string import punctuation
nlp = spacy.load("en_core_web_sm")
# We download a small general purpose package, trained on the Web.
# We also download a list of the punctuation marks and stop words in English
# for filtering those out later.

Among many things, NLP modules allow us to manipulate text and extract information from it. The most common pieces information we can extract is the:
1) Text
2) Lemma (The dictionary form of the word)
3) Part of speech (i.e. verb, pronoun, adjective etc.)
4) Shape (A string of N "x" where N the length of word. Case sensitive)
5) Wheter or not is alphabetic
6) Whether or not is stopword

In [30]:
doc = nlp(wiki_text)
labels = ["Text", "Lemma", "PoS", "Shape", "Alphabetic", "Stopword"]
# We will create a dataframe to store each word in the text, alongside its
# information
doc_df = pd.DataFrame(columns=labels)
x = 0 # To be used as a counter in a for loop 
for word in doc:
    doc_df.loc[x, "Text"] = word.text
    doc_df.loc[x, "Lemma"] = word.lemma_
    doc_df.loc[x, "PoS"] = word.pos_
    doc_df.loc[x, "Shape"] = word.shape_
    doc_df.loc[x, "Alphabetic"] = word.is_alpha
    doc_df.loc[x, "Stopword"] = word.is_stop
    x += 1
# Remove any blankspaces and punctuation marks that are counted as words
doc_df = doc_df.loc[(doc_df["PoS"]!="SPACE") & (doc_df["PoS"]!="PUNCT")]

We could use the resulting dataframe to do our frequency analysis; but iterating over it repeatedly would be time consuming. We will instead use dictionaries in our approach.

The following piece of code is nothing more than some nested for-loops with conditions in order to get a count of each word's (stop-words excluded) frequency of appearance in the text.
After that we will get look at each sentence seperately and sum the normalised frequency count of their respective words. 
In the end, we will keep the N% of sentences that had the top frequency sum; and merge them in order of appearance in the original text.

In [16]:
frequency = {}
punctuation = string.punctuation
for word in doc:
    word_temp = word.text.lower()
# We will go through each word in the text, and if not a stopword or
# punctuation mark, we will count their occurrence in a dictionary.
    if word_temp not in list(stopwords):
        if word_temp not in list(punctuation):
            if word_temp not in frequency.keys(): 
                # If it is the first time we encounter this word, we will
                # create an entry to the dictionary. Otherwise we will add one
                # to the sum.
                frequency[word_temp] = 1
            else:
                frequency[word_temp] += 1

In [17]:
# Find the word with the greatest occurrence and divide everything by that, 
# in order to normalise the count.
max_frequency = max(frequency.values())
for word in frequency.keys():
    frequency[word] = frequency[word] / max_frequency

In [18]:
sentences_total = [i for i in doc.sents]
sentence_sum = {}
# Now we will do the same as above, but instead of seperate words we will 
# target sentences; and dd up the frequency of their constituent words.
for sentence in sentences_total:
    for word in sentence:
        if word.text.lower() in frequency.keys():
            if sentence not in sentence_sum.keys():
                sentence_sum[sentence] = frequency[word.text.lower()]
            else:
                sentence_sum[sentence] += frequency[word.text.lower()]

In [19]:
# Calculate how many sentences is the N% (let's say 5%), of the total.
length = int(len(sentences_total)*0.05)
# Now we will transform the dictionary with the frequency some to a dataframe.
# The reason is that the dictionary still preserves the order of appearance of
# the sentences, and we will lose that information if we just sort it.
df = pd.DataFrame.from_dict([sentence_sum])
df = (df.T).reset_index()
labels = {"index": "Sentence", 0:"Score"}
df.rename(labels, axis=1, inplace=True)
df["Rank"] = 0
for i in range(len(df)):
    df.loc[i, "Rank"] = i
# Now we one each row of the dataframe we have the a sentence, its frequency
# score, as well as its order of appearance.
df.sort_values(by="Score", inplace=True, ascending=False)
df.reset_index(inplace=True)
df=df.head(length)
# Keep only the top sentences that we want.
df.sort_values(by="Rank", inplace=True)
df.reset_index(inplace=True)
# Sort them by order of appearance and combine them in a list.
result = []
for i in range(length):
    result.append(df.loc[i, "Sentence"])

We have our summary of the Wikipedia article. Naturally it is not very acccurate, and since it works by frequency analysis, it favors common names that appear often in the text. We could resolve that by passing the text through a filter of given names.