# BLU09 - Information Extraction

In [1]:
import os
import re
import spacy
import hashlib
import numpy as np
import pandas as pd
import json

from tqdm import tqdm
from collections import Counter
from spacy.matcher import Matcher
from sklearn.metrics import accuracy_score
from nltk.tokenize import WordPunctTokenizer
from sklearn.preprocessing import StandardScaler
from sklearn.ensemble import RandomForestClassifier
from sklearn.pipeline import Pipeline, FeatureUnion
from sklearn.model_selection import train_test_split
from sklearn.base import BaseEstimator, TransformerMixin
from sklearn.feature_extraction.text import TfidfVectorizer

import utils

cpu_count = int(os.cpu_count()) if os.cpu_count() != None else 4

In this exercise notebook you are going to tackle a very real problem: **Detecting fake news!** You'll create a classification workflow to determine if a piece of news is considered 'reliable' or 'unreliable'. You will start by building some basic features, then extract information from the text, go on to build more features, and finally put it all together.

The data set we will be using is the [Fake News data set](https://www.kaggle.com/c/fake-news/overview) from Kaggle. Each piece of news is either reliable or trustworthy, '0', or unreliable and possibly fake, '1'. First, let's load the data and see what we are dealing with.

In [2]:
data_path = "data/fakenews/train.csv"
df = pd.read_csv(data_path, index_col=0)
df["title"] = df["title"].astype(str)
df["text"] = df["text"].astype(str)

df = df[:5000]

df.head()

Unnamed: 0_level_0,title,author,text,label
id,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1
0,House Dem Aide: We Didn’t Even See Comey’s Let...,Darrell Lucus,House Dem Aide: We Didn’t Even See Comey’s Let...,1
1,"FLYNN: Hillary Clinton, Big Woman on Campus - ...",Daniel J. Flynn,Ever get the feeling your life circles the rou...,0
2,Why the Truth Might Get You Fired,Consortiumnews.com,"Why the Truth Might Get You Fired October 29, ...",1
3,15 Civilians Killed In Single US Airstrike Hav...,Jessica Purkiss,Videos 15 Civilians Killed In Single US Airstr...,1
4,Iranian woman jailed for fictional unpublished...,Howard Portnoy,Print \r\nAn Iranian woman has been sentenced ...,1


We have 4 columns that are pretty self-explanatory. Let's drop the author column since we only want to practice our text analysis and drop the title as well for simplicity sake.

In [3]:
df = df.drop(columns=["author", "title"])

Let's also load SpaCy's module with the [merged entities](https://spacy.io/api/pipeline-functions#merge_entities) (which will come in handy later) and stopwords. We insert the merged entities module into the SpaCy pipeline after the NER module.

In [4]:
nlp = spacy.load('en_core_web_md')
nlp.add_pipe("merge_entities", after="ner")
en_stopwords = nlp.Defaults.stop_words



Here we process the news data with SpaCy to use later on. This might take a while depending on your hardware (a break to walk the dog? 🐶).

In [5]:
docs = list(tqdm(nlp.pipe(df["text"], batch_size=20, n_process=cpu_count-1), total=len(df["text"])))
docs[:3]

100%|██████████| 5000/5000 [08:55<00:00,  9.33it/s]  


[House Dem Aide: We Didn’t Even See Comey’s Letter Until Jason Chaffetz Tweeted It By Darrell Lucus on October 30, 2016 Subscribe Jason Chaffetz on the stump in American Fork, Utah ( image courtesy Michael Jolley, available under a Creative Commons-BY license) 
 With apologies to Keith Olbermann, there is no doubt who the Worst Person in The World is this week–FBI Director James Comey. But according to a House Democratic aide, it looks like we also know who the second-worst person is as well. It turns out that when Comey sent his now-infamous letter announcing that the FBI was looking into emails that may be related to Hillary Clinton’s email server, the ranking Democrats on the relevant committees didn’t hear about it from Comey. They found out via a tweet from one of the Republican committee chairmen. 
 As we now know, Comey notified the Republican chairmen and Democratic ranking members of the House Intelligence, Judiciary, and Oversight committees that his agency was reviewing emai

Overall, the text looks good! Not too many errors, well written... as expected from a news article. Fake news is a very tough, recent problem that is now appearing more and more frequently in the wild. Usually there aren't many ortographic mistakes or slang (as it may happen with spam) since it's coming from news sources that want to appear credible but also clickbaity so that they can profit on that good ad revenue and create distrust.

## Exercise 1 - Pipeline

Let's create a baseline classification workflow. We'll use the TfidfVectorizer to get a simple, fast and trustworthy baseline.

Create a function that applies a pipeline to the given train data, makes a prediction for the test data, and returns the accuracy of the prediction. The pipeline should consist of a `TfidfVectorizer` and a `RandomForestClassifier`.

In [6]:
def tfidf_rf_pipeline(X_train, X_test, y_train, y_test, seed=42):
    """
    Trains a TfidfVectorizer + RandomForestClassifier pipeline on the given train data.
    Makes a prediction on the test data.
    Returns the trained pipeline and the accuracy of the prediction.

    Parameters:
        X_train, y_train: train data, pd.Series
        X_test, y_test: test data, pd.Series
        seed (int): random state seed for the classifier
    
    Returns:
        pipe: fitted pipeline
        acc (int): accuracy of the prediction
    """
    
    pipe = Pipeline([('tfidf', TfidfVectorizer()),
                     ('classifier', RandomForestClassifier(random_state=seed))])
    
    pipe.fit(X_train, y_train)
    
    y_pred = pipe.predict(X_test)
    
    acc = accuracy_score(y_test, y_pred)
    
    return pipe, acc

For the baseline, we will preprocess the text - remove punctuaction and stopwords and tokenize it - then run it through the pipeline:

In [7]:
df_processed = df.copy()
df_processed["text"] = df_processed["text"].apply(utils.remove_punctuation)
df_processed["text"] = df_processed["text"].apply(utils.remove_stopwords, stopwords = en_stopwords, 
                                                  tokenizer = WordPunctTokenizer())

X_train, X_test, y_train, y_test = train_test_split(df_processed["text"], df_processed["label"], 
                                                    test_size=0.2, random_state=42, stratify=df_processed["label"])
baseline_model, baseline_acc = tfidf_rf_pipeline(X_train, X_test, y_train, y_test)

assert isinstance(baseline_model, Pipeline)
assert hashlib.sha256(json.dumps(str(baseline_model[0])).encode()).hexdigest() == \
'e68c8e581c16f0d62f3b9cb33a7967b17890e18c1fe819d013181e6714e7a303', "The pipeline parameters are not correct."
assert hashlib.sha256(json.dumps(str(baseline_model[1])).encode()).hexdigest() == \
'36a4f3295ffa4c170fc0addee2a8cac5613970f06e3dde6956fa31daf19aa329', "The pipeline parameters are not correct."
np.testing.assert_almost_equal(baseline_acc, 0.908, decimal=2, err_msg="The accuracy is not correct.")
print(f'Baseline accuracy: {baseline_acc}')

Baseline accuracy: 0.908


Wow, the accuracy is quite good for such a simple text model! This just proves that a trustworthy baseline is all you need. I can't stress enough that it's really important to have a simple first iteration, and afterwards we can add complexity and study which features make sense or not. 

Sometimes, data scientists focus right off the bat on the most complex solutions and a simple one would be enough. Real life problems will obviously achieve lower scores as the data sets are not controlled or cleaned for you but that should not stop you from starting with a simpler and easier solution.

Now let's see if we can engineer more features. We will extract information with SpaCy and see if we can use it to train the model.

## Exercise 2 - SpaCy Matcher

Let's see if we can extract some useful features with the SpaCy Matcher.

### Exercise 2.1 - Simple matcher

You think of some words that could be related with the detection of Fake News. Something starts ringing in your mind about "propaganda", "USA" and "fraud", so you decide to use the SpaCy Matcher to check how many of those words appear in the news articles.

Use the `docs` list preprocessed by SpaCy and count the number of occurences of these words in all documents. Make sure to match the words regardless of the case. The output should be the sum of occurencies in all news articles.

In [8]:
words = ["propaganda", "usa", "fraud"]

# YOUR CODE HERE
count = 0

matcher = Matcher(nlp.vocab)
for word in words:
    matcher.add(word, [[{'LOWER': word}]])

for doc in docs:
    matches = matcher(doc)
    for _, _, _ in matches:
        count = 1 + count 

count

# count = ...

count


723

In [9]:
assert hashlib.sha256(json.dumps(str(count)).encode()).hexdigest() == \
'9d44059c29e077b9fd8496ebcc41c94aeb203bf1adce7729d3ecda30bc885a90', 'Not correct, try again.'
print(f'Count: {count}')

AssertionError: Not correct, try again.

### Exercise 2.2 - POS-tagging search

Ok, this doesn't look like the way to go, let's look at other theories. You start thinking that fake news might exaggerate on adjectives and adverbs by using over the top descriptions. So you decide to create a feature that counts the number of _Adjectives_ and _Adverbs_ in a piece of news article. The count should be normalized to the token count of the article.

The result should be a list of adjective and adverb counts for each document normalized to the token count of the document.

In [29]:
nb_adj_adv = []


TARGET_POS = {"ADJ", "ADV"}

for doc in docs:
    # Count adjectives and adverbs
    pos_count = sum(1 for token in doc if token.pos_ in TARGET_POS)
    # Normalize by token count (avoid division by zero)
    token_count = len(doc)
    normalized_count = pos_count / token_count if token_count > 0 else 0
    nb_adj_adv.append(normalized_count)


462.70903854782966

In [13]:
assert isinstance(nb_adj_adv,list), "The result should be a list."
assert len(nb_adj_adv) == len(docs), "The length of the result list is wrong. You should have a count for every news article."
np.testing.assert_almost_equal(np.var(nb_adj_adv), 0.00105, decimal=4, err_msg='The result is not correct.')
np.testing.assert_almost_equal(np.sum(nb_adj_adv), 462.5, decimal=1, err_msg='The result is not correct.')

AssertionError: 
Arrays are not almost equal to 1 decimals The result is not correct.
 ACTUAL: 462.70903854782966
 DESIRED: 462.5

Let's add this feature to our dataframe:

In [30]:
df_processed["nb_adj_adv"] = nb_adj_adv

### Exercise 2.3 - Adjectivized proper nouns

Another theory that might be worth testing is that adjectives with proper nouns are often used in this kind of news to induce sentiments towards people or organizations. So you decide to extract proper nouns preceeded by adjectives to maybe use in a later analysis.

Create a `Matcher` to search for adjective + proper noun combinations. Count the number of occurences of each combination. Store the 10 most common combinations and the number of their occurences as tuples in a list, sorted in descending order by the number of occurencies.

In [None]:
# most_common_adj_propn = []

# YOUR CODE HERE
matcher = Matcher(nlp.vocab)

pattern = [{"POS": "ADJ"}, {"POS": "PROPN"}]

matcher.add("AdjectiveProperNoun", [pattern])

combinations_counter = Counter()

for doc in docs:
    matches = matcher(doc)
    for match_id, start, end in matches:
        span = doc[start:end]
        combinations_counter[span.text] += 1

most_common_adj_propn = combinations_counter.most_common(10)

In [32]:
assert isinstance(most_common_adj_propn,list), "The output should be a list."
assert len(most_common_adj_propn) == 10, "It should be the top 10!"
assert isinstance(most_common_adj_propn[0],tuple), 'The elements of the list should be tuples of (combination, occurences).'
assert hashlib.sha256(json.dumps(most_common_adj_propn).encode()).hexdigest() == \
'0b12899bfedce520180f460bfd6742c1241ac7270ee98d4dcb482284e134cde8', 'The top ten list is not correct.'

Let's look at the 10 most common combinations:

In [33]:
most_common_adj_propn

[('former President', 99),
 ('eastern Aleppo', 76),
 ('many Americans', 72),
 ('most Americans', 49),
 ('northern Syria', 49),
 ('east Aleppo', 45),
 ('Russian President', 40),
 ('former Secretary', 38),
 ('Islamic State', 32),
 ('congressional Republicans', 24)]

The counts are too low to use these terms as features. Maybe running a vectorizer on all the results could work better.

### Exercise 2.4 - Objects of preposition
The objects in the sentences could indicate something. For instance, 'NGO financed by Soros' is more likely to appear in fake news than 'NGO financed by UNESCO'. Both objects in these sentences are objects of preposition (hint: SpaCy has a dependency label for this).

Create a `Matcher` to search for objects of preposition which are nouns. Again, count the number of occurences of each. Store the 10 most common combinations and their occurences as tuples in a list, sorted in descending order by the number of occurencies.

In [None]:
# most_common_pobj = []

# YOUR CODE HERE
matcher = Matcher(nlp.vocab)

# Define a pattern to match prepositions followed by prepositional objects (nouns)
pattern = [{"POS": "NOUN"}]
matcher.add("NOUN_PATTERN", [pattern])

# Initialize a list to hold matches that are objects of prepositions
pobj_matches = []

# Iterate through documents
for doc in docs:
    matches = matcher(doc)
    for match_id, start, end in matches:
        # Get the span for the current match
        span = doc[start:end]
        # Check if the span's root token has 'pobj' as its dependency label
        if span.root.dep_ == 'pobj':
            pobj_matches.append(span.text)

# Count the occurrences of each match and get the top 10
pobj_counter = Counter(pobj_matches)
most_common_pobj = pobj_counter.most_common(10)



Error processing document: Ever get the feeling your life circles the roundabout rather than heads in a straight line toward the intended destination? [Hillary Clinton remains the big woman on campus in leafy, liberal Wellesley, Massachusetts. Everywhere else votes her most likely to don her inauguration dress for the remainder of her days the way Miss Havisham forever wore that wedding dress.  Speaking of Great Expectations, Hillary Rodham overflowed with them 48 years ago when she first addressed a Wellesley graduating class. The president of the college informed those gathered in 1969 that the students needed “no debate so far as I could ascertain as to who their spokesman was to be” (kind of the like the Democratic primaries in 2016 minus the   terms unknown then even at a Seven Sisters school). “I am very glad that Miss Adams made it clear that what I am speaking for today is all of us —  the 400 of us,” Miss Rodham told her classmates. After appointing herself Edger Bergen to the

In [None]:
assert isinstance(most_common_pobj,list), "The output should be a list."
assert len(most_common_pobj) == 10, "It should be the top 10!"
assert isinstance(most_common_pobj[0],tuple), 'The elements of the list should be tuples of (combination, occurences).'
assert  hashlib.sha256(json.dumps(most_common_pobj).encode()).hexdigest() == \
'8b947c095d53dc4ccc4f28d1a448e60ef2f0c509eeac370cb0448ba9418d25c3', 'The top ten list is not correct.'

AssertionError: The top ten list is not correct.

This time the counts are higher and might be more interesting for a feature.

In [53]:
most_common_pobj

[('up comments', 6),
 ('over two years', 3),
 ('up letter', 3),
 ('out donors', 2),
 ('off lands', 2),
 ('in refrigerator', 2),
 ('up Genic.ai', 2),
 ('for congress', 2),
 ('Defense for Acquisition, Technology and Logistics Frank Kendall', 2),
 ('ON SILVERDOCTORS', 2)]

### Exercise 2.5 - Verbs with direct objects
As the last point, you decide to look at verbs with direct objects. These should indicate actions taken towards something or someone. This exercise can be solved without a Matcher.

Search for verbs with direct objects which are not pronouns. This time it's a bit trickier - you need to look at the [parse tree](https://spacy.io/usage/linguistic-features#navigating) because the object does not necessarily come right after the verb. Lemmatize both the verb and the object and count the occurences of the lemmatized verb and direct object separated by a space, like this: 'verb_lemma dobj_lemma'. Don't forget to exclude objects that are pronouns.

Again, output the 10 most common combinations and their occurences as tuples in a list, sorted in descending order by the number of occurences.

In [65]:
# most_common_dobj = []

# YOUR CODE HERE
# Counter for verb-direct object pairs
verb_object_counter = Counter()

for doc in docs:
    for token in doc:
        # Look for verbs with direct objects (dobj)
        if token.pos_ == "VERB":
            for child in token.children:
                # Ensure the child is a direct object (dobj) and not a pronoun
                if child.dep_ == "dobj" and child.pos_ != "PRON":
                    # Lemmatize verb and object, join with a space
                    combination = f"{token.lemma_} {child.lemma_}"
                    verb_object_counter[combination] += 1

# Get the 10 most common combinations
most_common_dobj = verb_object_counter.most_common(10)



In [66]:
assert isinstance(most_common_dobj,list), "The output should be a list."
assert len(most_common_dobj) == 10, "It should be the top 10!"
assert isinstance(most_common_dobj[0],tuple), 'The elements of the list should be tuples of (combination, occurences).'
assert hashlib.sha256(json.dumps(most_common_dobj).encode()).hexdigest() == \
'4aa91fbf85175e56f4a132fb253c40869c6f839fc2cfd4cee821ac1422217f29' or \
hashlib.sha256(json.dumps(most_common_dobj).encode()).hexdigest() == \
'7135e8ae4d25d6cf68db7f5083830236e38dae2843b24b6435342d7ded486e45', 'The top ten list is not correct.'

Not so many occurencies, but again the whole list could be used in a vectorizer:

In [67]:
most_common_dobj

[('take place', 377),
 ('do thing', 203),
 ('play role', 197),
 ('tell reporter', 177),
 ('kill people', 163),
 ('win election', 155),
 ('have right', 154),
 ('make decision', 150),
 ('make sense', 142),
 ('take action', 140)]

## Exercise 3 - Feature unions

We're going to create a few more numerical features here, then use them in a feature union pipeline and see if the baseline improves.

### Exercise 3.1 - More features

There are a few more simple features that we can extract from the data set to try to enrich our model. Let's add the following features to the `df_processed` dataframe:
- number of words in the news article
- character length of the news article
- average word length
- average sentence length.

Use the SpaCy processed `Doc`s for calculating the average sentence length (note that you will obtain sentence length in tokens).

Use the tokenized text in `df_processed` for everything else. Punctuation and stopwords were already removed from this text.

In [95]:
# df_processed["nb_words"] = ...
# df_processed["doc_length"] = ...
# df_processed["avg_word_length"] = ...
# df_processed["avg_sentence_length"] = ...

# YOUR CODE HERE
df_processed["doc_length"] = df_processed['text'].map(len)
df_processed["nb_words"] = df_processed['text'].apply(lambda x: len(x.split()))

df_processed["avg_word_length"] = df_processed['text'].apply(lambda x: sum([len(word) for word in x.split()])/len(x.split()) if len(x.split())>0 else 0 )


# Average Sentence Length
df_processed['avg_sentence_length'] = [np.mean([len(sentence) for sentence in doc.sents]) for doc in docs]


df_processed.head()

Unnamed: 0_level_0,text,label,doc_length,nb_words,avg_word_length,avg_sentence_length
id,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1
0,house dem aide didnt comeys letter jason chaff...,1,3155,413,6.641646,28.0625
1,feeling life circles roundabout heads straight...,0,2588,345,6.504348,25.806452
2,truth fired october 29 2016 tension intelligen...,1,4854,620,6.830645,25.690909
3,videos 15 civilians killed single airstrike id...,1,2060,274,6.521898,22.481481
4,print iranian woman sentenced years prison ira...,1,639,83,6.710843,33.8


In [96]:
assert df_processed.shape == (5000, 7), "Something wrong about the shape, do you have all columns/rows?"
assert "nb_words" in df_processed, "Missing column! Maybe wrong name?"
assert "doc_length" in df_processed, "Missing column! Maybe wrong name?"
assert "avg_word_length" in df_processed, "Missing column! Maybe wrong name?"
assert "avg_sentence_length" in df_processed, "Missing column! Maybe wrong name?"

assert np.sum(df_processed["nb_words"]) == 1963935, "Something is wrong with the nb_words column."
assert np.sum(df_processed["doc_length"]) == 14636737, "Something is wrong with the doc_length column."
np.testing.assert_almost_equal(np.sum(df_processed["avg_word_length"]), 32100.0, decimal=1, 
                               err_msg='Something is wrong with the avg_word_length column.')
np.testing.assert_almost_equal(np.sum(df_processed["avg_sentence_length"]), 118628.9, 
                               decimal=1, err_msg='Something is wrong with the avg_sentence_length column.')

AssertionError: Something wrong about the shape, do you have all columns/rows?

### Exercise 3.2 - Define a feature union for preprocessing

Let's create a processing pipeline for every feature in `df_processed` and join them all in a feature union. The pipeline for textual features should have one step, a `TfidfVectorizer` with default parameters. The pipeline for numerical features should have one step, a `Standard Scaler`. Afterwards, join the features' pipelines in a feature union.

Use the `Selector` classes in the cell below.

In [76]:
class Selector(TransformerMixin, BaseEstimator):
    """
    Transformer to select a column from a dataframe 
    on which to perform additional transformations.
    """ 
    def __init__(self, key):
        self.key = key
        
    def fit(self, X, y=None):
        return self
    

class TextSelector(Selector):
    """
    Transformer to select a single text column from the dataframe
    on which to perform additional transformations.
    """
    def transform(self, X):
        return X[self.key]
    
    
class NumberSelector(Selector):
    """
    Transformer to select a single numerical column from the dataframe
    on which to perform additional transformations.
    """
    def transform(self, X):
        return X[[self.key]]

In [77]:
# text_pipe = ...
# nb_adj_adv_pipe = ...
# nb_words_pipe = ...
# doc_length_pipe = ...
# avg_word_length_pipe = ...
# avg_sentence_length_pipe = ...
# feats = ...

# YOUR CODE HERE
text_pipe = Pipeline([
                ('selector', TextSelector("text")),
                ('tfidf', TfidfVectorizer())
            ])

nb_adj_adv_pipe =  Pipeline([
                ('selector', NumberSelector("nb_adj_adv")),
                ('standard', StandardScaler())
            ])

nb_words_pipe =  Pipeline([
                ('selector', NumberSelector("nb_words")),
                ('standard', StandardScaler())
            ])

doc_length_pipe =  Pipeline([
                ('selector', NumberSelector("doc_length")),
                ('standard', StandardScaler())
            ])

avg_word_length_pipe =  Pipeline([
                ('selector', NumberSelector("avg_word_length")),
                ('standard', StandardScaler())
            ])

avg_sentence_length_pipe = Pipeline([
                ('selector', NumberSelector("avg_sentence_length")),
                ('standard', StandardScaler())
            ])
feats = FeatureUnion([('text', text_pipe), 
                      ('nb_adj_adv', nb_adj_adv_pipe),
                      ('nb_words', nb_words_pipe),
                      ('doc_length', doc_length_pipe),
                      ('avg_word_length', avg_word_length_pipe),
                      ('avg_sentence_length', avg_sentence_length_pipe)
                     ])

In [78]:
assert isinstance(feats, FeatureUnion)
assert len(feats.transformer_list) == 6, "Did you create a pipeline for each feature?"
for pipe in feats.transformer_list:
    
    selector = pipe[1][0]
    if not (isinstance(selector, TextSelector) or isinstance(selector, NumberSelector)):
        raise AssertionError("The first step of the pipeline is not correct.")
        
    feature_builder = pipe[1][1]
    if not (isinstance(feature_builder, TfidfVectorizer) or isinstance(feature_builder, StandardScaler)):
        raise AssertionError("The second step fo the pipeline is not correct.")    

### Exercise 3.3 Fit the feature union
Define a function with pipeline that will apply the preprocessing steps from the previous exercise and fit a classifier to the provided data. The pipeline should have two steps, the feature union from the previous exercise and a `RandomForestClassifier`.
The function should fit the pipeline to the train data, make a prediction on the test data and calculate its accuracy.

In [79]:
def improved_pipeline(feats, X_train, X_test, y_train, y_test, seed=42):
    """
    Creates a pipeline with the provided feature union and a Random Forest classifier.
    Fits the pipeline to the train data and makes a prediction with the test data.
    Outputs the fitted pipeline and the accuracy of the prediction.

    Parameters:
        feats: feature union
        X_train, y_train: train data
        X_test, y_test: test data
        seed (int): seed for random state in the classifier

    Returns:
        pipe: fitted pipeline
        acc (int): accuracy of the prediction for the test data
    """
    
    # YOUR CODE HERE
    pipe = Pipeline([
        ('features', feats),
        ('classifier', RandomForestClassifier())
    ])
    pipe.fit(X_train, y_train)

    preds = pipe.predict(X_test)
    acc = np.mean(preds == y_test)
    
    return pipe, acc

In [None]:
Y = df_processed["label"]
X = df_processed.drop(columns="label")

X_train, X_test, y_train, y_test = train_test_split(X, Y, test_size=0.2, random_state=42, stratify=Y)
pipeline_model, pipeline_acc = improved_pipeline(feats, X_train, X_test, y_train, y_test)

assert isinstance(pipeline_model, Pipeline)
assert isinstance(pipeline_model[0],FeatureUnion), "The first step of the pipeline is not correct."
assert isinstance(pipeline_model[1],RandomForestClassifier),  "The second step of the pipeline is not correct."
np.testing.assert_almost_equal(pipeline_acc, 0.908, decimal=3, err_msg="The accuracy score is not correct.")

AssertionError: 
Arrays are not almost equal to 3 decimals The accuracy score is not correct.
 ACTUAL: 0.904
 DESIRED: 0.913

With this more complex approach we have achieved basically the same performance as our baseline. This might mean a lot of things: our features might have no real relevance to the model (which you can check with feature importances) or we have achieved a plateau and can't improve the score with this technique. 

Nevertheless it is a good score for this problem and data set. Regardless of the score, you have learnt a lot about SpaCy, feature unions and also that the sky is the limit when creating features. Anything can be a feature really - now good features are a totally different thing that might need more research and validation.