# SCC.413 Applied Data Mining
# Week 18
# Feature Extraction

## Contents
* [Introduction](#intro)
* [Preamble](#preamble)
* [Bag of Words](#bow)
    - [Filtered List](#filtered)
    - [Word N-grams](#wordn)
* [Characters](#chars)
    - [Char N-grams](#charn)
* [Annotation](#ann)
* [Other features](#other)
* [Documents](#docs)
* [Corpus analysis](#corpus)
* [TF-IDF](#tfidf)
* [Exercise](#ex)

<a name="intro"></a>
## Introduction

In previous weeks we have collected data, preprocessed and cleaned it, and tokenised the text into meaningful units ("words"). Now with usable text and a token list, in this lab we will look to extract features by counting occurrences of different elements, and calculating other features over the text, tokens, and other features.

A range of features will be looked at here, that can be used for a variety of analyses, however there are many other feature that can be extracted (see lecture slides and reading list). You should keep under consideration how preprocessing and tokenisation (your pipeline) can impact the features extracted.

<a name="preamble"></a>
## Preamble

You should upload all of the provided files to a Google Drive folder, you can then access these files from your Python code. See also the files tab.

In [None]:
from google.colab import drive
drive.mount('/content/gdrive/')

We save the folder we are working from as a variable for easy access. You may need to edit the path to match your own.

In [None]:
working_folder = '/content/gdrive/MyDrive/413/wk18/'

The below code adds the working folder to the system path, so you can import Python files from this folder.

In [None]:
import sys
sys.path.append(working_folder)

We can use code from last week to preprocess our text, a method is defined below to do some basic preprocessing, please check your understanding. You may see fit to edit the preprocessing to suit your needs later.

In [None]:
!pip install ftfy

In [None]:
import ftfy
import re

hashtag_re = re.compile(r"#\w+")
mention_re = re.compile(r"@\w+")
url_re = re.compile(r"(?:https?://)?(?:[-\w]+\.)+[a-zA-Z]{2,9}[-\w/#~:;.?+=&%@~]*")

def preprocess(text):
    p_text = hashtag_re.sub("[hashtag]",text)
    p_text = mention_re.sub("[mention]",p_text)
    p_text = url_re.sub("[url]",p_text)
    p_text = ftfy.fix_text(p_text)
    return p_text

To demonstrate the feature extraction, we're going to start by working with a single tweet:

In [None]:
tweet = "This week we’re at a #careers event in #Blackpool @Pleasure_Beach, talking to students about #languages and language careers! Come have a go at some of our activities! 🌏#LoveLanguages #LoveLancaster @Lancaster_CI https://t.co/vQQWdrUuqh"

In [None]:
p_tweet = preprocess(tweet)
print(p_tweet)

For tokenisation, we have a basic custom tokeniser. This is equivalent to the custom tokenisers created last week, but with a pre-compiled regular expression. Alternation is used to separate patterns. Again, you may see fit to edit to suit your needs later.

In [None]:
tokenise_re = re.compile(r"(\[[^\]]+\]|[-'\w]+|[^\s\w\[']+)") #([]|words|other non-space)
def custom_tokenise(text):
    return tokenise_re.findall(text)

Utility methods for displaying/saving tokens list. Can be used for any list.

In [None]:
def print_tokens(tokens):
    for token in tokens: #iterate tokens and print one per line.
        print(token)
    print(f"Total: {len(tokens)} tokens")

In [None]:
def save_tokens(tokens, outfile):
    with open(outfile, 'w', encoding="utf-8") as f:
        for token in tokens: #iterate tokens and output to file.
            f.write(token + '\n')
        f.write(f"Total: {len(tokens)} tokens")

<a name="bow"></a>
## Bag of words

Probably the most common NLP feature set, traditionally, is the "bag of words". This is a count of each word in the text, disregarding context. Whilst limited, due to the lack of context, a simple bag of words can achieve reasonable results for simple classification tasks, and is often used as a baseline.

First we need to tokenise the text. The tokenisation used will determine what is considered a "word", although post processing of the token list could be undertaken, e.g. to filter.

In [None]:
tokens = custom_tokenise(p_tweet)

In [None]:
print_tokens(tokens)

For simple bag of words, it often makes sense to make the token list all lowercase, so the same word with different casings are merged (e.g. if a word is at the beginning of a sentence).

In [None]:
lower_tokens = [t.lower() for t in tokens] #list comprehension
print_tokens(lower_tokens)

Note that Python's `lower()` method is Unicode aware, and will lowercase letters with diacritics and from non-Latin alphabets.

In [None]:
"ÅÉÎÑÇΛФ".lower()

To make a frequency list, we simply place the token list in a [`Counter`](https://docs.python.org/3.7/library/collections.html#counter-objects) object, which extends `dict`, mapping items to frequencies. [NLTK's FreqDist](http://www.nltk.org/_modules/nltk/probability.html#FreqDist), which extends `Counter`, could also be used.

In [None]:
from collections import Counter

tokens_fql = Counter(lower_tokens)

In [None]:
tokens_fql.most_common() #displays frequency list in descending frequency order.

<a name="filtered"></a>
### Filtered list

At some point we will need to filter the bag of words, e.g. to some top-500 or top-1000 words, as it rarely makes sense to have a feature vector containing all words.

The method below uses word frequencies to create a new frequency list containing all in the predefined lists. Including 0s for words not found (dense vector). The vector can be made sparse (remove 0s) with `+counter`.

In [None]:
def filter_fql(fql, predefined_list):
    return Counter({t: fql[t] for t in predefined_list}) #dict comprehension, t: fql[t] is token: freq.

A common feature set (especially for authorship analysis) is function words (aka stop words). Here we use the function word list taken from https://ieeexplore.ieee.org/abstract/document/6234420.

In [None]:
def read_list(file):
    with open(file) as f:
        items = []
        lines = f.readlines()
        for line in lines:
            items.append(line.strip())
    return items

In [None]:
fws = read_list(working_folder + "functionwords.txt")

In [None]:
fws_fql = filter_fql(tokens_fql, fws)
fws_fql.most_common()

Remove 0s, and make into sparse vector: 

In [None]:
+fws_fql

Note you need to be careful that the tokenisation matches what is in the function word / stopword list.

You could also use NLTK's stopword list.

In [None]:
import nltk
from nltk.corpus import stopwords
nltk.download('stopwords')
stoplist = stopwords.words('english')
print(stoplist)

We can remove words from a list (e.g. a stopword list) by iterating the list of words in the frequency list to remove, and 'popping off' ([dict.pop(key,None)](https://docs.python.org/3/library/stdtypes.html?highlight=pop#dict.pop)) each word if present. 

In [None]:
def remove_list(fql, to_remove):
    filtered = Counter(fql)
    for r in to_remove:
        filtered.pop(r,None)        
    return filtered

filtered = remove_list(tokens_fql, stoplist)
print(filtered)

<a name="wordn"></a>
### Word n-grams

To get some context for words, we can use sequences of words instead of single words, these are known as word n-grams. bigrams (2-grams) and trigrams (3-grams) are popular. One issue with word n-grams is their sparsity. It's a good idea to reduce the size of the vocabulary as much as possible, e.g. digits and dates could be mapped to single tokens.

Whether a token appears at the start or end of the text (or could be sentence) can be useful, so we can introduce buffer markers at the start and end to indicate this.

Note also that n-grams should be created with a sliding window over the text, i.e. the first word bigram is the first and second word, the second bigram is the second and third word.

The method below is a generic method for turning a list of tokens into an n-gram list, adding the buffer characters either side, and moving a sliding window of size n across the text and providing a list of n-grams. Check your understanding of how this works.

In [None]:
def ngrams(tokens, n, sep = "_", buffer="^"):
    buffered = [buffer] * (n-1) + tokens + [buffer] * (n-1) #add buffer either side to denote start and end
    return [sep.join(buffered[i:i+n]) for i in range(len(buffered)-n+1)] #list comprehension creating merged string of n chars, with a window of n through string

In [None]:
word_bigrams = ngrams(lower_tokens,2)

In [None]:
word_bigrams

In [None]:
word_bigrams_fql = Counter(word_bigrams)
word_bigrams_fql.most_common()

**Quick task:** Produce a frequency list of word trigrams.

<a name="chars"></a>
## Characters

Just looking at characters as features is a simple (yet often powerful) way of processing text.

In [None]:
print(tweet)

In [None]:
print(p_tweet)

We probably don't want the artificial hashtag, mention, and url markers, we could keep these as is, replace with single chars, or just remove them. Below we just remove them. We often have different pre-processing for different features.

In [None]:
def preprocess_remove(text):
    r_text = hashtag_re.sub("",text)
    r_text = mention_re.sub("",r_text)
    r_text = url_re.sub("",r_text)
    r_text = ftfy.fix_text(r_text)
    return r_text

In [None]:
r_tweet = preprocess_remove(tweet)
print(r_tweet)

Note, extra spaces are included now, how could you preprocess the text further to reduce multiple spaces to a single space?

In Python a string is just a sequence (list) of characters, so we can just iterate through the characters as below:

In [None]:
for char in r_tweet:
    print(char)

We can make this count the frequency of each character easily:

In [None]:
char_fql = Counter(r_tweet)
char_fql.most_common()

This appears to work well, **but this should be used with caution**:

In [None]:
test = "Remember the spicy jalapen\u0303o"
print(test)

In [None]:
for i, char in enumerate(test):
    print(i,char)

Notice the  ̃ separated from the n because it is a separate codepoint (combining). It is placed over the space.

This looks even worse if we view the characters as a list:

In [None]:
print(list(test))

See the  ̃ over the single quote mark. Nasty! 🤢

The combining codepoint combines with whatever the character before is, and in this case it's displayed as the quote mark.

We need to be careful how we define "character". In Python a 'character' is a single Unicode codepoint. When in reality, we should be looking for "graphemes", i.e. displayed single characters (which may be a cluster of codepoints): https://unicode.org/reports/tr29/#Grapheme_Cluster_Boundaries

As you saw last week, we can use regular expressions to find these graphemes, but Python's default regular expression library (re), whilst being Unicode aware, does not deal with Unicode particularly well. The [regex library](https://pypi.org/project/regex/) has better support, providing the use of unicode categories: https://www.regular-expressions.info/unicode.html, including `\X` to match single graphemes.

In [None]:
import regex
char_regex = regex.compile(r'\X')

In [None]:
chars = char_regex.findall(test)
print(chars)

This nicely separates ñ as single "character". We can put this into a frequency list:

In [None]:
char_fql = Counter(chars)
char_fql.most_common()

Even more "fun" can be had with emojis, which can contain numerous codepoints, particularly joined with zero-width-joiners: https://unicode.org/emoji/charts/emoji-zwj-sequences.html.

In [None]:
emoji_test = "This is one emoji: \U0001F468\u200D\U0001F469\u200D\U0001F467\u200D\U0001F466"

In [None]:
print(emoji_test)

In [None]:
test_matches = char_regex.findall(emoji_test)

In [None]:
for match in test_matches:
    print(match)

Another library, [grapheme](https://pypi.org/project/grapheme/), also provides funtionality to deal with these graphemes like characters.

In [None]:
!pip install grapheme

In [None]:
import grapheme

In [None]:
graphemes = list(grapheme.graphemes(emoji_test))

In [None]:
for g in graphemes:
    print(g)

In [None]:
char_fql = Counter(graphemes)
char_fql.most_common()

Note, when composite graphemes are printed in a list/tuple, they're expanded for some reason (if you know why, please tell me!). As can be seen, this is just a display issue:

In [None]:
for char in char_fql.most_common():
    print("{}\t{}".format(char[0], char[1]))

<a name="charn"></a>
### Character n-grams

We can also look at sequences of characters, though be aware that these will overlap with words and other features (double counting).

You have everything you need to do this (remember the n-grams function is generic).

**Task:** Produce character trigrams for the tweet. You don't need a separator for chars, so the first trigram should be '^^T'

<a name="ann"></a>
## Annotation

As discussed in the lecture, various levels of annotation are available to add on top of the tokens. These are extra levels of information that can be used as features for various NLP tasks. Lemmatisation is one option available and straightforward to [implement with nltk](http://www.nltk.org/book/ch03.html#lemmatization).

Part-of-speech (POS) tags are probably the most used form of annotation, certainly for classification tasks. NLTK provides a POS tagger using the standard [Penn Treebank POS tagset](https://www.ling.upenn.edu/courses/Fall_2003/ling001/penn_treebank_pos.html). Tokenised text can be POS tagged easily:

In [None]:
nltk.download('punkt')
nltk.download('maxent_treebank_pos_tagger')
nltk.download('averaged_perceptron_tagger') # check how uses penn and look at alternatives.

In [None]:
pos_tagged = nltk.pos_tag(tokens)
pos_tagged

What do you think of the accuracy of the POS tags on this small sample? You can see a description of each POS tag with the below. Note, we should POS tag the tokens without making them lowercase first, as POS taggers will use capital letters, e.g. for proper nouns.

In [None]:
nltk.download("tagsets")
nltk.help.upenn_tagset()

To create a POS frequency list is straightforward:

In [None]:
pos = [tag[1] for tag in pos_tagged]
pos_fql = Counter(pos)
pos_fql.most_common()

**Task:** Try to make improvements to the POS tagging by changing the preprocessing and tokenisation. As a minimum, try using NLTK's default tokeniser.

**Advanced task:** 

Developing a POS tagger that is capable of dealing well with the intricacies of user generated content (e.g. Twitter) text is difficult, although there have been attempts, e.g. http://www.cs.cmu.edu/~ark/TweetNLP/. One option is to post-process the POS tagged text to fix the main issues.

Define a function that takes the POS tagged text and post-processes the output to add new tags for mentions, hashtags, urls, emojis, and anything else you can see to fix with simple rules.

<a name="other"></a>
## Other features

Many other features can be calculated over the text, token stream, or other feature frequency lists. Some examples:

In [None]:
length_chars = len(tweet) #length of text in chars
length_tokens = len(tokens) #length of text in tokens
print(length_chars)
print(length_tokens)

Average word length:

In [None]:
avg_word_length = sum([len(tok) for tok in tokens])/length_tokens #make a list of lengths per token, sum and divide by number of tokens
print(avg_word_length)

Various vocabulary measures are available that represent how varied and large the vocabulary is of the text.

We need to know the number of **word types** present, this is the number of words, counting multiple instances (tokens) of the same word once. This is simply the size of the frequency list:

In [None]:
length_types = len(tokens_fql)

Type Token Ratio (TTR) is a popular vocabulary measure, simply dividing the number of types by the number of tokens.

In [None]:
ttr = length_types / length_tokens #type token ratio (ttr)
print(ttr)

TTR is not comparable over texts of very different lengths, instead use something like Moving-Average Type-Token Ratio (MATTR): https://doi.org/10.1080/09296171003643098

**Advanced task**: Reading the above linked paper, implement MATTR.

Other vocabulary measures look at the number of hapaxes (words types which only appear once), below a simple hapax ratio is calculated.

In [None]:
hapaxes = list(tokens_fql.values()).count(1) #convert frequencies to list and count 1s.
hapax_ratio = hapaxes / length_types
print(hapax_ratio)

There are many other features that could be implemented. Readability metrics could be calculated, most of which require a count of syllables. Counting syllables is actually [quite an involved task](https://stackoverflow.com/questions/405161/detecting-syllables-in-a-word/4103234), especially for user generated content, and multi-lingual data. [Big Phoney](https://github.com/repp/big-phoney) is one option that seems promising (based on some limited testing). An **Advanced Task** would be to implement one or more the readability measures (e.g. [*Flesch reading ease*](https://en.wikipedia.org/wiki/Flesch–Kincaid_readability_tests)).

Counting and splitting text into sentences is also needed for some features. This is quite simple to do with NLTK, as below. Though be aware, like other segmentation tasks, doing this accurately with user generated content is not straight-forward.

In [None]:
from nltk.tokenize import sent_tokenize
sent_tokenize(p_tweet)

<a name="docs"></a>
## Documents

So far we have been utilising a single line of text (Tweet) to demonstrate feature extraction. However, we will often be dealing with larger texts consisting of lines of texts (e.g. paragraphs or sets of Tweets), we can call these documents. We normally do not want sequence features (e.g. n-grams) to go across line boundaries within a document. Hence we process and extract features per line of text.

To make things a little easier, we create a `Document` class which holds the features of a document (and any metadata provided). Features are calculated with the `extract_featues` function, which takes in a list (iterable) of texts (which could be lines in a text, or individual tweets from a user). Currently, just tokens are counted (i.e. Bag of Words), and a single method to demonstrate how to return Document level features.

In [None]:
class Document:
    def __init__(self, meta={}):
        self.meta = meta
        self.tokens_fql = Counter() #empty counter, ready to be added to with Counter.update.
        
    def extract_features(self, texts): #document should be iterable text lines, e.g. read in from file.
        for text in texts:
            p_text = preprocess(text)
            tokens = custom_tokenise(p_text)
            lower_tokens = [t.lower() for t in tokens]
            self.tokens_fql.update(lower_tokens) #updating Counter counts items in list, adding to existing Counter items.
            
    def get_ttr(self): #type token ratio
        length_types = len(self.tokens_fql)
        length_tokens = sum(self.tokens_fql.values())
        return length_types / length_tokens

To utilise this, we simply create a Document, and add text to it. An example using the existing tweet we've been using is given below.

In [None]:
tweet_doc = Document()
tweet_doc.extract_features([tweet])
print(tweet_doc.tokens_fql)

<a name="mps"></a>
## MPs Dataset
In order to play with features, a collection of Tweets from MP accounts is provided in the `mps` folder. These are plain text files for each user, split into Labour and Conservative. These Tweets were collected a while back, so the list of MPs (some are no longer MPs, or have left Labour or Conservatives) and Tweets is not current. You could use what you've learnt from week 14 (data collection) to gather a list of MPs from https://www.politics-social.com/list/name, and download their latest tweets. More MP data also here: https://www.theyworkforyou.com/mps/.

The corpus can be read into Documents as follows.

In [None]:
from os import listdir
from os.path import isfile, join, splitext, split


def import_party_folder(party):
    folder = working_folder + "mps/" + party
    textfiles = [join(folder, f) for f in listdir(folder) if isfile(join(folder, f)) and f.endswith(".txt")]
    for tf in textfiles:
        username = splitext(split(tf)[1])[0] #extract just username from filename.
        print("Processing " + username)
        doc = Document({'username': username, 'party': party}) #include metadata
        with open(tf) as f:
            tweets = f.readlines()
        doc.extract_features(tweets)
        yield doc

In [None]:
corpus = []
corpus.extend(import_party_folder("labour"))
corpus.extend(import_party_folder("conservative"))

We now have a **corpus** of MPs on Twitter we can use for further analysis.

In [None]:
for doc in corpus:
    print(doc.meta['username'], doc.meta['party'], sum(doc.tokens_fql.values()),sep=", ")

<a name="corpus"></a>
## Corpus analysis
We can compare corpora or sub-corpora to start to gain insights into language differences.

Our frequency lists (FQLs) are stored as [`Counters`](https://docs.python.org/3.7/library/collections.html#counter-objects), which can be merged easily by just adding them together.

In [None]:
def merge_fqls(fqls):
    merged = Counter()
    for fql in fqls:
        merged += fql
    return merged

Create a sub-corpus, one for Conservative MPs, another for Labour MPs.

In [None]:
con_fql = merge_fqls([doc.tokens_fql for doc in corpus if doc.meta['party']=="conservative"])
lab_fql = merge_fqls([doc.tokens_fql for doc in corpus if doc.meta['party']=="labour"])

In [None]:
con_size = sum(con_fql.values())
lab_size = sum(lab_fql.values())
print(con_size,lab_size)

We can start analysing the most frequent words:

In [None]:
print(lab_fql.most_common(20))
print(con_fql.most_common(20))

And even create a basic [word cloud](https://github.com/amueller/word_cloud).

In [None]:
from wordcloud import WordCloud
import matplotlib.pyplot as plt

def create_wordcloud(words):
    wordcloud = WordCloud().generate_from_frequencies(words)
    plt.imshow(wordcloud, interpolation='bilinear')
    plt.axis("off")
    plt.show()

In [None]:
create_wordcloud(con_fql)

In [None]:
create_wordcloud(lab_fql)

Common words dominate. How could we remove these?

To normalise the frequencies, we can simply divide by the number of tokens, to gain relative frequencies.

In [None]:
def relative_freqs(fql):
    size = sum(fql.values())
    return {term: fql[term]/size for term in fql}

In [None]:
con_rel = relative_freqs(con_fql)
lab_rel = relative_freqs(lab_fql)

To do a "Key words" comparison between the sub-corpora, we can utilise [*Log Ratio*](http://cass.lancs.ac.uk/log-ratio-an-informal-introduction/), which is the binary log of the relative risk (ratio between relative frequencies). Other significance tests and effect size measures can be used: http://ucrel.lancs.ac.uk/llwizard.html

In [None]:
from math import log

#Calculates log ratio for terms in corpus1, compared to corpus2.
#we pass the corpus sizes for ease.
#If the term is not present in corpus2, we make the frequency 0.5.
def log_ratio(corpus1, corpus1_size, corpus2, corpus2_size, min_freq1=0, min_freq2=0):
    return {term: log((corpus1[term]/corpus1_size)/((corpus2[term] if corpus2[term] else 0.5)/corpus2_size),2) for term in corpus1 if corpus1[term] >= min_freq1 and corpus2[term] >= min_freq2}

The above method is a dict comprehension one-liner, which may be difficult to interpret. The below method does exactly the same as the above, but is split over multiple lines to ease readability and understanding. 

In [None]:
from math import log

#Calculates log ratio for terms in corpus1, compared to corpus2.
#we pass the corpus sizes for ease.
#If the term is not present in corpus2, we make the frequency 0.5.
def log_ratio(corpus1, corpus1_size, corpus2, corpus2_size, min_freq1=0, min_freq2=0):
    lrs = dict()
    for term in corpus1:
      if corpus1[term] >= min_freq1 and corpus2[term] >= min_freq2:
        rel_freq1 = corpus1[term]/corpus1_size
        if corpus2[term]:
          freq2 = corpus2[term]
        else:
          freq2 = 0.5
        rel_freq2 = freq2/corpus2_size
        lr = log(rel_freq1/rel_freq2, 2)
        lrs[term] = lr

    return lrs
        
            

Calculate the terms from Conservative MPs with the biggest log ratio compared to terms from Labour MPs.

In [None]:
con_lr = log_ratio(con_fql, con_size, lab_fql, lab_size)

We can sort our list of terms by this log ratio:

In [None]:
sorted_terms = sorted(con_lr.items(), key=lambda x: x[1], reverse=True)
print(sorted_terms[:20])

and create a word cloud using the log ratios, instead of frequencies:

In [None]:
create_wordcloud(con_lr)

and the other way round:

In [None]:
lab_lr = log_ratio(lab_fql, lab_size, con_fql, con_size)
sorted_terms = sorted(lab_lr.items(), key=lambda x: x[1], reverse=True)
print(sorted_terms[:20])
create_wordcloud(lab_lr)

Some interesting terms appear, but with a small number of authors, some terms will be prominent from one MP, boosting the frequency in the sub-corpus. You can set a minimum frequency for a term to appear in each corpus. The below sets a minimum frequency of 5 in each corpus, which will rule out words that only appear in one sub corpus. You can change these minimum frequencies (e.g. 1,1). Notice you will get quite different words highlighted in the wordcloud.

In [None]:
lab_lr = log_ratio(lab_fql, lab_size, con_fql, con_size,5,5)
sorted_terms = sorted(lab_lr.items(), key=lambda x: x[1], reverse=True)
print(sorted_terms[:20])
create_wordcloud(lab_lr)

<a name="tfidf"></a>
## TF-IDF

As discussed in the lecture, TF-IDF is a commonly used normalisation method which considers the term frequency along with how many documents in the corpus the term appears in.

In [None]:
#doc is a Counter representing an fql from a document.
def tf(term, doc):
    return doc[term] / sum(doc.values()) #term freq / total terms (relative term freq)

def num_containing(term, corpus):
    return sum(1 for doc in corpus if term in doc) #counts docs in corpus containing term.

#1 added to numerator and denominator is for preventing division by zero. Equivalent of an extra document containing all terms once.
def idf(term, corpus):
    n_t = num_containing(term,corpus)
    return log((len(corpus)+1) / ((n_t) + 1))
    
def tfidf(term, doc, corpus):
    return tf(term, doc) * idf(term, corpus)

We can calculate the TF-IDF for every term for every MP in the corpus. By listing the terms with the highest TF-IDF, we can look at terms that are used by that MP frequently, but only used by that MP alone, or a small number of MPs.

In [None]:
corpus_fqls = [doc.tokens_fql for doc in corpus]
for doc in corpus:
    print(doc.meta['username'], doc.meta['party'])
    scores = {term: tfidf(term,doc.tokens_fql,corpus_fqls) for term in doc.tokens_fql}
    sorted_terms = sorted(scores.items(), key=lambda x: x[1], reverse=True)
    for term, score in sorted_terms[:5]:
        print("\tToken: {}, TF-IDF: {}".format(term, round(score, 5)))

<a name="ex"></a>
## Exercise
Use the MPs data provided to conduct some further feature extraction and analysis. Write code to answer the following questions:

1. Which MP in the dataset has the highest average token length?
2. What are the 5 **key** part-of-speech tags overused by Labour MPs compared to Conservative MPs?
3. What hashtags does Jeremy Corbyn use frequently, which aren't used widely by the rest of the Labour party MPs in the dataset (TF-IDF)?
4. **Advanced:** What are the **key** adjectives overused by Boris Johnson compared to other MPs?
4. **Advanced:** If you want to go further, devise your own research question, either using the MP data provided, collecting a new MP dataset, or on different data.

You may need to pre-process and tokenise the text differently. Re-use the code above, including adapting the Document class, adding/editing preprocessing, tokenisation, and feature extraction.

If you prefer, you can create a new notebook for the exercise work. The methods and imports above are provided in a Python file too: `features.py`.