# DIGI405 Text Analysis Project Notebook

[0.2.2 - 2025-09-08](https://github.com/polsci/DIGI405-assignments/blob/main/CHANGELOG.md) - quality of life improvements, probability labels    
[0.2.1 - 2025-08-19](https://github.com/polsci/DIGI405-assignments/blob/main/CHANGELOG.md) - ensure filtering / grid search handled as expected   
Note: Search notebook for 0.2.1/0.2.2 to find changes if you want to apply to your existing notebook.

## Introduction

You should use this notebook as a starting point for your DIGI405 project. It provides code to select your dataset, and run a complete text classification pipeline with [textplumber](https://geoffford.nz/textplumber/), a package that provides an easy to use interface to methods covered in this course.

**Name:*Jiajun li*  
**Student ID:*38339315*  
**Project option:*Essay* 
**Project submission date:*10.16*  

Please also add your name to your notebook filename (where it says 'NAME').

### Notebook structure

Sections 1-4 provide code you should modify or extend. In your report, you can refer to code sections by their section number, eg 2.1.

## 1. Setup

You must select the Python 3.12 kernel to run the code in this notebook. 

In [1]:
from datasets import load_dataset, ClassLabel, DatasetDict
from sklearn.model_selection import train_test_split
import pandas as pd
import numpy as np

from sklearn.feature_selection import SelectKBest, mutual_info_classif, chi2
from sklearn.pipeline import Pipeline
from sklearn.linear_model import LogisticRegression
from sklearn.tree import DecisionTreeClassifier
from sklearn.preprocessing import StandardScaler
from sklearn.preprocessing import Normalizer
from sklearn.pipeline import FeatureUnion
from sklearn.metrics import confusion_matrix, classification_report

from textplumber.core import *
from textplumber.clean import *
from textplumber.preprocess import *
from textplumber.tokens import *
from textplumber.pos import *
from textplumber.embeddings import *
from textplumber.report import *
from textplumber.store import *
from textplumber.lexicons import *
from textplumber.textstats import *

from imblearn.under_sampling import RandomUnderSampler 

import warnings

# in the interests of readability, ignoring this warning
warnings.filterwarnings("ignore", message="Your stop_words may be inconsistent with your preprocessing")

These settings control the display of Pandas dataframes in the notebook.

In [2]:
pd.set_option('display.max_columns', None) # show all columns
pd.set_option('display.max_colwidth', 500) # increase this to see more text in the dataframe

Get word lists: 
* The stop word list is from NLTK.   
* All of the word lists (including the stop word list) can be used to extract lexicon count features to extract features based on a set of words.

In [3]:
stop_words = get_stop_words()
stop_words_lexicon = {'stop_words': stop_words}
empath_lexicons = get_empath_lexicons()
vader_lexicons = get_sentiment_lexicons()

## 2. Load and inspect data

### 2.1 Choose a dataset and preview the labels

Below you can select a dataset for the assignment. The options are `sentiment`, `essay` and `genre`. Change the value of `dataset_option` below. The datasets available on Huggingface.co will be downloaded automatically and a link provided to the dataset card with more information. The `genre` dataset was distributed with this notebook.   

Note:  The `movie_reviews` dataset is being used to demonstrate the notebook and is not one of your options for the assignment.  

In [4]:
# Choose 'essay', 'sentiment', or 'genre' ('movie_reviews' is just for testing/demonstration)
dataset_option = 'essay' 

if dataset_option == 'movie_reviews':
	dataset_name = 'polsci/sentiment-polarity-dataset-v2.0'
	dataset_dir = None
	target_labels = ['neg', 'pos']
	text_column = 'text'
	label_column = 'label'
	train_split_name = 'train'
	test_split_name = 'train'
	print('The movie_reviews is to demonstrate the notebook and is not an assignment option.')
elif dataset_option == 'sentiment':
	dataset_name = 'cardiffnlp/tweet_eval'
	dataset_dir = 'sentiment'
	target_labels = ['negative', 'neutral', 'positive']
	text_column = 'text'
	label_column = 'label'
	train_split_name = 'train'
	test_split_name = 'validation'
	print('You selected the sentiment dataset. Read more about this at https://huggingface.co/datasets/cardiffnlp/tweet_eval')
elif dataset_option == 'essay':
	dataset_name = 'polsci/ghostbuster-essay-cleaned'
	dataset_dir = None
	target_labels = ['claude', 'gpt', 'human']
	text_column = 'text'
	label_column = 'label'
	train_split_name = 'train'
	test_split_name = 'test'
	print('You selected the essay dataset. Read more about this at https://huggingface.co/datasets/polsci/ghostbuster-essay-cleaned')
elif dataset_option == 'genre':
    dataset_name = 'genre'
    dataset_type = 'json'
	# Note: Quality of life improvement for version 0.2.2
    dataset_dir = '/srv/source-data/genre_dataset.json' # if you are running this locally change to the path on your machine
    target_labels = ['Fiction', 'Letter', 'Notice', 'Obituary', 'Poetry or verse', 'Recipe', 'Review']
    text_column = 'text'
    label_column = 'label'
    train_split_name = 'train'
    test_split_name = 'test'
    print('You selected the genre dataset.')
else:
	print('Try again! That was not an option!')

You selected the essay dataset. Read more about this at https://huggingface.co/datasets/polsci/ghostbuster-essay-cleaned


#### Important notes about specific datasets:

* Make sure you go to the relevant Huggingface page to read more about the [essay](https://huggingface.co/datasets/polsci/ghostbuster-essay-cleaned) and [sentiment](https://huggingface.co/datasets/cardiffnlp/tweet_eval/viewer/sentiment) datasets. Note the sentiment dataset is one subset of the larger 'tweet_eval' dataset.  
* For the *sentiment* dataset, it is challenging to get good accuracy with three classes. If you like you can remove the `neutral` class. There is a cell below that does this for you - don't change the cell above.
* For the *essay* dataset, there are differences in punctuation between classes. You should use `character_replacements = {"’": "'", '“': '"', '”': '"',}` in the `TextCleaner` component in your pipeline to make sure you are not overfitting to a quirk of the data.

This loads the dataset. 

In [5]:
if dataset_option != 'genre': # if loading from huggingface ...
    dataset = load_dataset(dataset_name, data_dir=dataset_dir)
else: # if loading the genre dataset from the provided json file
    dataset = load_dataset(dataset_type, data_files=dataset_dir)
    train_dataset = dataset['train'].filter(lambda example: example['split'] == 'train')
    test_dataset = dataset['train'].filter(lambda example: example['split'] == 'test')
    dataset = DatasetDict({
        'train': train_dataset,
        'test': test_dataset
        })

This cell will show you information on the dataset fields and the splits.

In [6]:
preview_dataset(dataset)

Here is the breakdown of the composition of labels in each data-set split.

In [7]:
# casting label column to ClassLabel if not already
cast_column_to_label(dataset, label_column)
label_names = get_label_names(dataset, label_column)

dfs = {}
for split in dataset.keys():
    dfs[split] = dataset[split].to_pandas()
    dfs[split].insert(1, 'label_name', dfs[split][label_column].apply(lambda x: dataset[split].features[label_column].int2str(x)))
    print('Labels for {}:'.format(split))
    preview_label_counts(dfs[split], label_column, label_names)

Column 'label' is already a ClassLabel.
Labels for train:


Unnamed: 0_level_0,label_name,count
label,Unnamed: 1_level_1,Unnamed: 2_level_1
0,claude,694
1,gpt,694
2,gpt_prompt1,694
3,gpt_prompt2,694
4,gpt_semantic,694
5,gpt_writing,694
6,human,694


Labels for test:


Unnamed: 0_level_0,label_name,count
label,Unnamed: 1_level_1,Unnamed: 2_level_1
0,claude,300
1,gpt,300
2,gpt_prompt1,300
3,gpt_prompt2,300
4,gpt_semantic,300
5,gpt_writing,300
6,human,300


### 2.2 Configure the labels (optional)

* You can override the default labels for the data-set here to make the task more or less challenging. High accuracy does not guarantee a high grade. 
* See the assignment instructions and the dataset card or corresponding paper for explanations of the data.  
* Read the comments below and uncomment the relevant lines for your data-set if and amend the label names if needed.
* Remember, this is optional.

In [8]:
# for the movie reviews dataset (this is just for testing/demonstration) - there are 2 labels and that is it!

# for the sentiment dataset - there are 3 labels - you can make the task simpler as a binary classification problem using one of these options:
#target_labels = ['negative', 'neutral']
#target_labels = ['negative', 'positive']
#target_labels = ['neutral', 'positive']

# for the essay dataset - there are 7 labels - you can make the task simpler as a binary classification problem using one of these options:
#target_labels = ['claude', 'gpt']
# target_labels = ['human', 'gpt'] 
#target_labels = ['human', 'claude']

# for the genre dataset - there are 7 labels - you can turn the task into one or more binary classification problems using options such as:
#target_labels = ['Letter', 'Notice']
#target_labels = ['Letter', 'Fiction']
#target_labels = ['Review', 'Fiction']
#target_labels = ['Notice', 'Obituary']

print(target_labels)

['claude', 'gpt', 'human']


### 2.3 Prepare the train and test splits

* This cell handles the train-test split for you.
* Some of the data-sets are unbalanced. This cell will balance the training data using under-sampling.

In [9]:
target_classes = [label_names.index(name) for name in target_labels]
target_names = [label_names[i] for i in target_classes]

if train_split_name == test_split_name:
    X = dataset[train_split_name].to_pandas()
    X.insert(1, 'label_name', dfs[train_split_name][label_column].apply(lambda x: dataset[train_split_name].features[label_column].int2str(x)))
    y = np.array(dataset[train_split_name][label_column])

    mask = np.isin(y, target_classes)
    X = X.loc[mask]
    y = y[mask]

    # creating df splits with original data first  - so can look at the train data if needed
    dfs['train'], dfs['test'], y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42, stratify=y)

    # we're just using the text for features
    X_train = np.array(dfs['train'][text_column])
    X_test = np.array(dfs['test'][text_column])
else:
    X_train = np.array(dataset[train_split_name][text_column])
    y_train = np.array(dataset[train_split_name][label_column])
    X_test = np.array(dataset[test_split_name][text_column])
    y_test = np.array(dataset[test_split_name][label_column])

    mask = np.isin(y_train, target_classes)
    mask_test = np.isin(y_test, target_classes)

    X_train = X_train[mask]
    y_train = y_train[mask]
    X_test = X_test[mask_test]
    y_test = y_test[mask_test]

# this cell undersamples all but the minority class to balance the training data
X_train = X_train.reshape(-1, 1)
X_train, y_train = RandomUnderSampler(random_state=0).fit_resample(X_train, y_train)
X_train = X_train.reshape(-1)

preview_splits(X_train, y_train, X_test, y_test, target_classes = target_classes, target_names = target_names)

Train: 2082 samples, 3 classes


Unnamed: 0_level_0,label_name,count
0,Unnamed: 1_level_1,Unnamed: 2_level_1
0,claude,694
1,gpt,694
6,human,694


Test: 900 samples, 3 classes


Unnamed: 0_level_0,label_name,count
0,Unnamed: 1_level_1,Unnamed: 2_level_1
0,claude,300
1,gpt,300
6,human,300


### 2.4 Preview the texts

Time to get to know your data. We will only preview the train split.

In [10]:
y_train_names = map(lambda x: label_names[x], y_train)
# Note: Version 0.2.1 corrects display of the dataframe to ensure filtering by the selected labels 
display(dfs['train'][dfs['train']['label_name'].isin(y_train_names)].sample(10))

Unnamed: 0,text,label_name,label,ID,filename,prompt
1524,"Management and Goal-Setting: Behavioral Challenges and Strategies\r\nIntroduction:\r\nSales teams are the driving force behind the success of any organization's revenue generation. As a supervisor, managing a sales team can be both rewarding and challenging. Supervisor's encounter various behavioral challenges while managing their sales teams, which can hinder the achievement of goals. This essay aims to discuss the behavioral challenges faced by supervisors in managing their sales team in t...",gpt,1,1219,295.txt,"management, and goal-setting.\r\nDiscuss the behavioral challenges faced by the supervisor in managing their sales team in the workplace and propose strategies from a behavioral perspective to address these challenges."
2242,"Changes in the global management paradigm have affected one of the most important components of management – the leadership process. The technological and information revolution has led to the fact that most businesses have the ability to scale their activities to a global scale. This gave the leaders, on the one hand, the opportunity to enter global markets. On the other hand, it led to the need to design business changes in such a way that the company was ready to win against an unlimited ...",human,6,6323,389.txt,"""How has the technological and information revolution impacted the leadership process, particularly in terms of global scalability and cultural management skills? Discuss the importance of transformational leadership in achieving global leadership and the role of cultural differences in shaping leadership behavior and actions."""
3290,"International Joint Ventures (IJVs) are strategic alliances formed between two or more partner firms that operate in different countries to pursue international business opportunities. There are several reasons why companies opt to establish IJVs. First, IJVs allow companies to share the costs and risks of developing new international markets or products. Launching new ventures in foreign markets can be very risky and expensive due to high costs of research, marketing, distribution, and othe...",claude,0,474,524.txt,Evaluate the role of mass media in shaping public opinion and personal worldview on racial appearance. Discuss the potential impact of media's limited presentation and biased portrayal on the formation of opinions and ideas about other races.
2780,"The aim of the experiment investigating photosynthetic control, P/O ratio, and the effect of uncoupling agents using isolated chloroplasts and an oxygen electrode was to better understand the relationship between light absorption, electron transport, and oxygen evolution in the photosynthetic process. Specifically, the experiment sought to determine the proportion of light energy used directly for oxygen production versus wasted as heat (as quantified by the P/O ratio) and how this could be ...",claude,0,401,459.txt,How can the government implement preventive measures to combat overpopulation in cities and its associated environmental and social problems?
2447,"What is a Mediterranean diet?\r\nThe Mediterranean diet is perhaps the most popular gastronomy trend of the current decade. The trend has been popular for a long time and continues to gain popularity among food and culinary enthusiasts. Thanks to the COVID-19 pandemic, the Mediterranean diet has gained wider recognition among people from different countries. The trend adopts its name from the traditional cuisine of the four Mediterranean countries: Morocco, Greece, Italy, and Spain (Papadaki...",human,6,6352,414.txt,"Write an essay discussing the popularity of the Mediterranean diet and its simplicity in preparing and availability of ingredients. Describe a proposed two-course meal that incorporates the principles of the Mediterranean diet and explain the importance of using locally available ingredients. Provide a recipe for the Cream of Mushroom Soup appetizer, including the necessary ingredients and preparation instructions."
4580,"It should be noted that there is a multitude of motivational factors to consider for a manager from both personal and organizational standpoints. In the case of the former, these include personal investment, personal growth, recognition, and achievement. On the organizational level, equity plays a critical role in impacting motivation since if there is a perception of inequity, then the disadvantaged workers will lose motivation. Dan Pink claims that there are also critically important and e...",human,6,6658,690.txt,"""In your opinion, what are the most effective intrinsic and extrinsic motivational factors for managers to consider? How would you implement these factors in a start-up company with limited resources? Provide specific examples."""
3974,"Introduction:\r\nCar theft continues to be a significant issue in both the United Kingdom and North America, causing financial loss, emotional distress, and safety concerns for car owners. This essay aims to explore the different methods employed by car thieves to obtain keys and steal vehicles in these regions, analyze the reasons behind their choice of techniques, and discuss the implications for car owners.\r\nBody:\r\n1. Traditional Methods:\r\n a. Hotwiring: A classic method, commonly...",gpt,1,1571,611.txt,"""What are the different methods used by car thieves to obtain keys and steal vehicles in the United Kingdom and North America? Discuss the reasons behind their choice of methods and the implications for car owners."""
602,"Employee involvement (DEI) programs have become increasingly popular in organizations over recent decades. DEI aims to give employees more voice, influence and responsibility over their work. This can lead to a range of benefits for the organization, such as improved productivity, motivation, and retention. However, for DEI programs to be effective, they require significant investments in communication and teamworking, as they can represent major cultural changes that need to be carefully im...",claude,0,87,176.txt,"Discuss the influence of culture on leadership behaviors, providing examples and evidence from different cultural clusters."
4595,"Historical Overview\r\nMultiple religions characterize Brazil as the Church emerged due to the Portuguese conquests. It is believed that the first religion to enter Brazil with the Portuguese was Catholicism. The starting point is the Catholic Mass of 1500, verified by some of the first colonizers. Subject to the strong influence of colonization and the rigidity of their forces, Brazil struggled to cope with the religious onslaught. Missionary and educational activities helped strengthen Cat...",human,6,6660,692.txt,Explain the historical influence of Catholicism in Brazil and its impact on the current political and social landscape.
3132,"Compare and Contrast IKEA's Performance Objectives with Traditional Competitors \r\n\r\nIKEA is a well-known Swedish furniture company that has developed a very distinct business model focused on providing affordable, functional furniture to budget-conscious consumers. IKEA's performance objectives around quality, speed, dependability, and flexibility differ substantially from those of traditional furniture competitors that offer higher-end, customizable pieces at premium price points. By an...",claude,0,451,503.txt,"Prompt:\r\n\r\nDiscuss the need for a comprehensive palliative care policy in Intermountain Healthcare (IH) facilities. Analyze the current policy's shortcomings in elaborating on specific nursing responsibilities and providing detailed instructions for care teams. Consider the implications of lacking a comprehensive policy on the quality and efficiency of palliative care services. Additionally, draw upon literature and research to support your argument for the development of a detailed pall..."


Enter the index (the number in the first column) as `selected_index` to see the row. The `limit` value controls how much of the text you see. Set a higher limit to see more of the text or set it to 0 to see all of the text.

In [11]:
# We can display the full text of a selected article by dataframe index
selected_index = 10

preview_row_text(dfs['train'], selected_index, text_column = text_column, limit=400) # change limit to see more of the text if needed

Unnamed: 0_level_0,Value
Attribute,Unnamed: 1_level_1
label_name,gpt_prompt2
label,3
ID,3002
filename,10.txt
prompt,"Analyze the study conducted in Malawi on the impact of financial assistance on reducing the spread of HIV among young girls. In your response, consider the primary and secondary sources used, the experimental design, and the findings of the study. Discuss the implications of the findings and the limitations of the research."


text:
Introduction:
The study conducted in Malawi aimed to analyze the impact of financial
assistance on reducing the spread of HIV among young girls. This essay will
evaluate the primary and secondary sources used, discuss the experimental
design, analyze the findings of the study, and explore the implications and
limitations of the research.
Analysis of Sources:
The study relied on a combination of...


## 3. Create a classification pipeline and train a model

Create a Sci-kit Learn pipeline to preprocess the texts and train a classification model. The pipeline components will be added in through the notebook. There are a number of pipeline components you can access through the `textplumber` package. You will have an opportunity to learn about this in labs, but documentation is [available here](https://geoffford.nz/textplumber).

To speed up preprocessing some of the pipeline components store the preprocessed data in a cache to avoid recomputing them. Run this as is - it will create an SQLite file with the name of your dataset option in the directory of the notebook. This will speed up some repeated processing (e.g. tokenization with Spacy).

In [12]:
feature_store = TextFeatureStore(f'assignment-{dataset_option}.sqlite')

The pipeline below includes a number of different components. Most are commented out on the first run of the notebook. There are lots of options for each component. You will need to look at the documentation and examples in labs to learn about these. These components can extract different kinds of features, any of which can be applied to build a model. The feature types include:

* Token features  
* Bigram features  
* Parts of speech features
* Lexicon-based features  
* Document-level statistics  
* Text embeddings


In [19]:
pipeline = Pipeline([
    # ('cleaner', TextCleaner(strip_whitespace=True)),
	('cleaner', TextCleaner(strip_whitespace=True, character_replacements = {"’": "'", '“': '"', '”': '"',})), # for the essay dataset you should use character_replacements = {"’": "'", '“': '"', '”': '"',}
	('spacy', SpacyPreprocessor(feature_store=feature_store)),
	('features', FeatureUnion([
		('tokens', # token features - these can be single tokens or ngrams of tokens using TokensVectorizer - see textplumber documentation for examples
			Pipeline([
				('spacy_token_vectorizer', TokensVectorizer(feature_store = feature_store, vectorizer_type='count', max_features=100, lowercase = True, remove_punctuation = True, stop_words = stop_words, min_df=0.0, max_df=1.0, ngram_range=(1, 1))),
				('selector', SelectKBest(score_func=mutual_info_classif, k=100)), # uncomment for feature selection
				('scaler', StandardScaler(with_mean=False)),
				], verbose = True)),

		('pos', # pos features - these can be a single label or ngrams of pos tags using POSVectorizer - see textplumber documentation for examples
			Pipeline([
				('spacy_pos_vectorizer', POSVectorizer(feature_store=feature_store)),
				#('selector', SelectKBest(score_func=mutual_info_classif, k=5)),
				('scaler', StandardScaler(with_mean=False)),
				], verbose = True)),

		('textstats', # document-level text statistics using TextstatsTransformer - see textplumber documentation for examples
			Pipeline([
				('textstats_vectorizer', TextstatsTransformer(feature_store=feature_store)),
				('scaler', StandardScaler(with_mean=False)),
				], verbose = True)),

		# ('lexicon', # lexicon features - defined above are empath_lexicons, sentiment_lexicons and stop_words_lexicon - see textplumber documentation for examples
		# 	Pipeline([
		# 		('lexicon_vectorizer', LexiconCountVectorizer(feature_store=feature_store, lexicons=empath_lexicons)), # the notebook has already provided example lexicons right at the top!
		#  		#('selector', SelectKBest(score_func=mutual_info_classif, k=5)),
		# 		('scaler', StandardScaler(with_mean=False)),
		# 		], verbose = True)),

		('embeddings', Model2VecEmbedder(feature_store=feature_store)), # extract embeddings using Model2Vec - textplumber documentation for examples

		], verbose = True)),
	
	('classifier', LogisticRegression(max_iter=5000, random_state=42)) # for logistic regression - only select one classifier!
    #('classifier', DecisionTreeClassifier(max_depth = 3, random_state=42)) # for decision tree - only select one classifier!
], verbose = True) # using verbose because I like to see what is going on

display(pipeline)


In [20]:
pipeline.fit(X_train, y_train)

[Pipeline] ........... (step 1 of 4) Processing cleaner, total=   0.6s
[Pipeline] ............. (step 2 of 4) Processing spacy, total=   0.3s
[Pipeline]  (step 1 of 3) Processing spacy_token_vectorizer, total=   1.8s
[Pipeline] .......... (step 2 of 3) Processing selector, total=   0.1s
[Pipeline] ............ (step 3 of 3) Processing scaler, total=   0.0s
[FeatureUnion] ........ (step 1 of 4) Processing tokens, total=   2.0s




[Pipeline]  (step 1 of 2) Processing spacy_pos_vectorizer, total=   0.7s
[Pipeline] ............ (step 2 of 2) Processing scaler, total=   0.0s
[FeatureUnion] ........... (step 2 of 4) Processing pos, total=   0.7s
[Pipeline]  (step 1 of 2) Processing textstats_vectorizer, total=   0.1s
[Pipeline] ............ (step 2 of 2) Processing scaler, total=   0.0s
[FeatureUnion] ..... (step 3 of 4) Processing textstats, total=   0.2s
[FeatureUnion] .... (step 4 of 4) Processing embeddings, total=   0.1s
[Pipeline] .......... (step 3 of 4) Processing features, total=   2.9s
[Pipeline] ........ (step 4 of 4) Processing classifier, total=   7.6s


Run the predictions and output model metrics and a confusion matrix using this cell.

In [21]:
y_predicted = pipeline.predict(X_test)
print(classification_report(y_test, y_predicted, target_names = target_names, digits=3, zero_division=0))
plot_confusion_matrix(y_test, y_predicted, target_classes, target_names)

              precision    recall  f1-score   support

      claude      0.923     0.843     0.882       300
         gpt      0.884     0.963     0.922       300
       human      0.936     0.933     0.935       300

    accuracy                          0.913       900
   macro avg      0.915     0.913     0.913       900
weighted avg      0.915     0.913     0.913       900



The cell below is commented out, but you have the option to uncomment it to run a grid search based on the pipeline you've created above.

In [22]:
# # Note: Version 0.2.1 commented grid search out by default as intended

# # Note: if you get a warning about tokenizers and parallelism - uncomment this line 
# # os.environ["TOKENIZERS_PARALLELISM"] = "false"

# # setup gridsearch to test different max_features
# from sklearn.model_selection import GridSearchCV
# param_grid = {
#     'features__tokens__spacy_token_vectorizer__max_features': [50, 100, 150, 200, 250, 300],  # this assumes you are using the tokens part of the pipeline
#     'features__tokens__selector__k': [50, 100, 150, 200, 250, 300],  # this assumes you have enabled the selector for tokens
# }
# grid_search = GridSearchCV(pipeline, param_grid, cv=3, scoring='f1_macro', verbose=100, n_jobs=1)
# grid_search.fit(X_train, y_train)

# print('\n-----------------------------------------------------------------')
# print("Best parameters found: ", grid_search.best_params_)
# print("Best score found: ", grid_search.best_score_)
# print('-----------------------------------------------------------------\n')

# y_pred = grid_search.predict(X_test)

# print(classification_report(y_test, y_pred, target_names = target_names, digits=3))
# plot_confusion_matrix(y_test, y_pred, target_classes, target_names)

Fitting 3 folds for each of 36 candidates, totalling 108 fits
[CV 1/3; 1/36] START features__tokens__selector__k=50, features__tokens__spacy_token_vectorizer__max_features=50
[Pipeline] ........... (step 1 of 4) Processing cleaner, total=   0.5s
[Pipeline] ............. (step 2 of 4) Processing spacy, total=   0.2s
[Pipeline]  (step 1 of 3) Processing spacy_token_vectorizer, total=   1.2s
[Pipeline] .......... (step 2 of 3) Processing selector, total=   0.1s
[Pipeline] ............ (step 3 of 3) Processing scaler, total=   0.0s
[FeatureUnion] ........ (step 1 of 4) Processing tokens, total=   1.3s
[Pipeline]  (step 1 of 2) Processing spacy_pos_vectorizer, total=   0.4s
[Pipeline] ............ (step 2 of 2) Processing scaler, total=   0.0s
[FeatureUnion] ........... (step 2 of 4) Processing pos, total=   0.4s
[Pipeline]  (step 1 of 2) Processing textstats_vectorizer, total=   0.1s
[Pipeline] ............ (step 2 of 2) Processing scaler, total=   0.0s
[FeatureUnion] ..... (step 3 of 4) P



[Pipeline]  (step 1 of 2) Processing spacy_pos_vectorizer, total=   0.4s
[Pipeline] ............ (step 2 of 2) Processing scaler, total=   0.0s
[FeatureUnion] ........... (step 2 of 4) Processing pos, total=   0.4s
[Pipeline]  (step 1 of 2) Processing textstats_vectorizer, total=   0.1s
[Pipeline] ............ (step 2 of 2) Processing scaler, total=   0.0s
[FeatureUnion] ..... (step 3 of 4) Processing textstats, total=   0.1s
[FeatureUnion] .... (step 4 of 4) Processing embeddings, total=   0.1s
[Pipeline] .......... (step 3 of 4) Processing features, total=   1.8s
[Pipeline] ........ (step 4 of 4) Processing classifier, total=   5.7s
[CV 1/3; 7/36] END features__tokens__selector__k=100, features__tokens__spacy_token_vectorizer__max_features=50;, score=0.904 total time=   9.0s
[CV 2/3; 7/36] START features__tokens__selector__k=100, features__tokens__spacy_token_vectorizer__max_features=50
[Pipeline] ........... (step 1 of 4) Processing cleaner, total=   0.5s
[Pipeline] ............. (s



[Pipeline]  (step 1 of 2) Processing spacy_pos_vectorizer, total=   0.4s
[Pipeline] ............ (step 2 of 2) Processing scaler, total=   0.0s
[FeatureUnion] ........... (step 2 of 4) Processing pos, total=   0.4s
[Pipeline]  (step 1 of 2) Processing textstats_vectorizer, total=   0.1s
[Pipeline] ............ (step 2 of 2) Processing scaler, total=   0.0s
[FeatureUnion] ..... (step 3 of 4) Processing textstats, total=   0.1s
[FeatureUnion] .... (step 4 of 4) Processing embeddings, total=   0.1s
[Pipeline] .......... (step 3 of 4) Processing features, total=   1.8s
[Pipeline] ........ (step 4 of 4) Processing classifier, total=   6.0s
[CV 2/3; 7/36] END features__tokens__selector__k=100, features__tokens__spacy_token_vectorizer__max_features=50;, score=0.908 total time=   9.4s
[CV 3/3; 7/36] START features__tokens__selector__k=100, features__tokens__spacy_token_vectorizer__max_features=50
[Pipeline] ........... (step 1 of 4) Processing cleaner, total=   0.5s
[Pipeline] ............. (s



[Pipeline]  (step 1 of 2) Processing spacy_pos_vectorizer, total=   0.4s
[Pipeline] ............ (step 2 of 2) Processing scaler, total=   0.0s
[FeatureUnion] ........... (step 2 of 4) Processing pos, total=   0.4s
[Pipeline]  (step 1 of 2) Processing textstats_vectorizer, total=   0.1s
[Pipeline] ............ (step 2 of 2) Processing scaler, total=   0.0s
[FeatureUnion] ..... (step 3 of 4) Processing textstats, total=   0.1s
[FeatureUnion] .... (step 4 of 4) Processing embeddings, total=   0.1s
[Pipeline] .......... (step 3 of 4) Processing features, total=   1.9s
[Pipeline] ........ (step 4 of 4) Processing classifier, total=   5.4s
[CV 3/3; 7/36] END features__tokens__selector__k=100, features__tokens__spacy_token_vectorizer__max_features=50;, score=0.914 total time=   8.9s
[CV 1/3; 8/36] START features__tokens__selector__k=100, features__tokens__spacy_token_vectorizer__max_features=100
[Pipeline] ........... (step 1 of 4) Processing cleaner, total=   0.4s
[Pipeline] ............. (



[Pipeline]  (step 1 of 2) Processing spacy_pos_vectorizer, total=   0.4s
[Pipeline] ............ (step 2 of 2) Processing scaler, total=   0.0s
[FeatureUnion] ........... (step 2 of 4) Processing pos, total=   0.4s
[Pipeline]  (step 1 of 2) Processing textstats_vectorizer, total=   0.1s
[Pipeline] ............ (step 2 of 2) Processing scaler, total=   0.0s
[FeatureUnion] ..... (step 3 of 4) Processing textstats, total=   0.1s
[FeatureUnion] .... (step 4 of 4) Processing embeddings, total=   0.1s
[Pipeline] .......... (step 3 of 4) Processing features, total=   1.8s
[Pipeline] ........ (step 4 of 4) Processing classifier, total=   5.9s
[CV 1/3; 13/36] END features__tokens__selector__k=150, features__tokens__spacy_token_vectorizer__max_features=50;, score=0.904 total time=   9.3s
[CV 2/3; 13/36] START features__tokens__selector__k=150, features__tokens__spacy_token_vectorizer__max_features=50
[Pipeline] ........... (step 1 of 4) Processing cleaner, total=   0.5s
[Pipeline] ............. 



[Pipeline]  (step 1 of 2) Processing spacy_pos_vectorizer, total=   0.4s
[Pipeline] ............ (step 2 of 2) Processing scaler, total=   0.0s
[FeatureUnion] ........... (step 2 of 4) Processing pos, total=   0.4s
[Pipeline]  (step 1 of 2) Processing textstats_vectorizer, total=   0.1s
[Pipeline] ............ (step 2 of 2) Processing scaler, total=   0.0s
[FeatureUnion] ..... (step 3 of 4) Processing textstats, total=   0.1s
[FeatureUnion] .... (step 4 of 4) Processing embeddings, total=   0.1s
[Pipeline] .......... (step 3 of 4) Processing features, total=   1.8s
[Pipeline] ........ (step 4 of 4) Processing classifier, total=   6.3s
[CV 2/3; 13/36] END features__tokens__selector__k=150, features__tokens__spacy_token_vectorizer__max_features=50;, score=0.908 total time=   9.7s
[CV 3/3; 13/36] START features__tokens__selector__k=150, features__tokens__spacy_token_vectorizer__max_features=50
[Pipeline] ........... (step 1 of 4) Processing cleaner, total=   0.4s
[Pipeline] ............. 



[Pipeline]  (step 1 of 2) Processing spacy_pos_vectorizer, total=   0.4s
[Pipeline] ............ (step 2 of 2) Processing scaler, total=   0.0s
[FeatureUnion] ........... (step 2 of 4) Processing pos, total=   0.4s
[Pipeline]  (step 1 of 2) Processing textstats_vectorizer, total=   0.1s
[Pipeline] ............ (step 2 of 2) Processing scaler, total=   0.0s
[FeatureUnion] ..... (step 3 of 4) Processing textstats, total=   0.1s
[FeatureUnion] .... (step 4 of 4) Processing embeddings, total=   0.1s
[Pipeline] .......... (step 3 of 4) Processing features, total=   1.8s
[Pipeline] ........ (step 4 of 4) Processing classifier, total=   5.3s
[CV 3/3; 13/36] END features__tokens__selector__k=150, features__tokens__spacy_token_vectorizer__max_features=50;, score=0.914 total time=   8.7s
[CV 1/3; 14/36] START features__tokens__selector__k=150, features__tokens__spacy_token_vectorizer__max_features=100
[Pipeline] ........... (step 1 of 4) Processing cleaner, total=   0.5s
[Pipeline] .............



[Pipeline]  (step 1 of 2) Processing spacy_pos_vectorizer, total=   0.4s
[Pipeline] ............ (step 2 of 2) Processing scaler, total=   0.0s
[FeatureUnion] ........... (step 2 of 4) Processing pos, total=   0.4s
[Pipeline]  (step 1 of 2) Processing textstats_vectorizer, total=   0.1s
[Pipeline] ............ (step 2 of 2) Processing scaler, total=   0.0s
[FeatureUnion] ..... (step 3 of 4) Processing textstats, total=   0.1s
[FeatureUnion] .... (step 4 of 4) Processing embeddings, total=   0.1s
[Pipeline] .......... (step 3 of 4) Processing features, total=   1.9s
[Pipeline] ........ (step 4 of 4) Processing classifier, total=   4.3s
[CV 1/3; 14/36] END features__tokens__selector__k=150, features__tokens__spacy_token_vectorizer__max_features=100;, score=0.903 total time=   8.1s
[CV 2/3; 14/36] START features__tokens__selector__k=150, features__tokens__spacy_token_vectorizer__max_features=100
[Pipeline] ........... (step 1 of 4) Processing cleaner, total=   0.5s
[Pipeline] ............



[Pipeline]  (step 1 of 2) Processing spacy_pos_vectorizer, total=   0.4s
[Pipeline] ............ (step 2 of 2) Processing scaler, total=   0.0s
[FeatureUnion] ........... (step 2 of 4) Processing pos, total=   0.4s
[Pipeline]  (step 1 of 2) Processing textstats_vectorizer, total=   0.1s
[Pipeline] ............ (step 2 of 2) Processing scaler, total=   0.0s
[FeatureUnion] ..... (step 3 of 4) Processing textstats, total=   0.1s
[FeatureUnion] .... (step 4 of 4) Processing embeddings, total=   0.1s
[Pipeline] .......... (step 3 of 4) Processing features, total=   1.9s
[Pipeline] ........ (step 4 of 4) Processing classifier, total=   4.7s
[CV 2/3; 14/36] END features__tokens__selector__k=150, features__tokens__spacy_token_vectorizer__max_features=100;, score=0.923 total time=   8.3s
[CV 3/3; 14/36] START features__tokens__selector__k=150, features__tokens__spacy_token_vectorizer__max_features=100
[Pipeline] ........... (step 1 of 4) Processing cleaner, total=   0.5s
[Pipeline] ............



[Pipeline]  (step 1 of 2) Processing spacy_pos_vectorizer, total=   0.5s
[Pipeline] ............ (step 2 of 2) Processing scaler, total=   0.0s
[FeatureUnion] ........... (step 2 of 4) Processing pos, total=   0.5s
[Pipeline]  (step 1 of 2) Processing textstats_vectorizer, total=   0.1s
[Pipeline] ............ (step 2 of 2) Processing scaler, total=   0.0s
[FeatureUnion] ..... (step 3 of 4) Processing textstats, total=   0.1s
[FeatureUnion] .... (step 4 of 4) Processing embeddings, total=   0.1s
[Pipeline] .......... (step 3 of 4) Processing features, total=   2.1s
[Pipeline] ........ (step 4 of 4) Processing classifier, total=   4.8s
[CV 3/3; 14/36] END features__tokens__selector__k=150, features__tokens__spacy_token_vectorizer__max_features=100;, score=0.908 total time=   8.7s
[CV 1/3; 15/36] START features__tokens__selector__k=150, features__tokens__spacy_token_vectorizer__max_features=150
[Pipeline] ........... (step 1 of 4) Processing cleaner, total=   0.5s
[Pipeline] ............



[Pipeline]  (step 1 of 2) Processing spacy_pos_vectorizer, total=   0.5s
[Pipeline] ............ (step 2 of 2) Processing scaler, total=   0.0s
[FeatureUnion] ........... (step 2 of 4) Processing pos, total=   0.5s
[Pipeline]  (step 1 of 2) Processing textstats_vectorizer, total=   0.1s
[Pipeline] ............ (step 2 of 2) Processing scaler, total=   0.0s
[FeatureUnion] ..... (step 3 of 4) Processing textstats, total=   0.1s
[FeatureUnion] .... (step 4 of 4) Processing embeddings, total=   0.1s
[Pipeline] .......... (step 3 of 4) Processing features, total=   2.7s
[Pipeline] ........ (step 4 of 4) Processing classifier, total=   6.7s
[CV 1/3; 19/36] END features__tokens__selector__k=200, features__tokens__spacy_token_vectorizer__max_features=50;, score=0.904 total time=  11.1s
[CV 2/3; 19/36] START features__tokens__selector__k=200, features__tokens__spacy_token_vectorizer__max_features=50
[Pipeline] ........... (step 1 of 4) Processing cleaner, total=   0.5s
[Pipeline] ............. 



[Pipeline]  (step 1 of 2) Processing spacy_pos_vectorizer, total=   0.5s
[Pipeline] ............ (step 2 of 2) Processing scaler, total=   0.0s
[FeatureUnion] ........... (step 2 of 4) Processing pos, total=   0.5s
[Pipeline]  (step 1 of 2) Processing textstats_vectorizer, total=   0.1s
[Pipeline] ............ (step 2 of 2) Processing scaler, total=   0.0s
[FeatureUnion] ..... (step 3 of 4) Processing textstats, total=   0.1s
[FeatureUnion] .... (step 4 of 4) Processing embeddings, total=   0.1s
[Pipeline] .......... (step 3 of 4) Processing features, total=   2.2s
[Pipeline] ........ (step 4 of 4) Processing classifier, total=   7.4s
[CV 2/3; 19/36] END features__tokens__selector__k=200, features__tokens__spacy_token_vectorizer__max_features=50;, score=0.908 total time=  11.3s
[CV 3/3; 19/36] START features__tokens__selector__k=200, features__tokens__spacy_token_vectorizer__max_features=50
[Pipeline] ........... (step 1 of 4) Processing cleaner, total=   0.5s
[Pipeline] ............. 



[Pipeline]  (step 1 of 2) Processing spacy_pos_vectorizer, total=   0.5s
[Pipeline] ............ (step 2 of 2) Processing scaler, total=   0.0s
[FeatureUnion] ........... (step 2 of 4) Processing pos, total=   0.5s
[Pipeline]  (step 1 of 2) Processing textstats_vectorizer, total=   0.1s
[Pipeline] ............ (step 2 of 2) Processing scaler, total=   0.0s
[FeatureUnion] ..... (step 3 of 4) Processing textstats, total=   0.1s
[FeatureUnion] .... (step 4 of 4) Processing embeddings, total=   0.1s
[Pipeline] .......... (step 3 of 4) Processing features, total=   2.0s
[Pipeline] ........ (step 4 of 4) Processing classifier, total=   5.9s
[CV 3/3; 19/36] END features__tokens__selector__k=200, features__tokens__spacy_token_vectorizer__max_features=50;, score=0.914 total time=   9.5s
[CV 1/3; 20/36] START features__tokens__selector__k=200, features__tokens__spacy_token_vectorizer__max_features=100
[Pipeline] ........... (step 1 of 4) Processing cleaner, total=   0.4s
[Pipeline] .............



[Pipeline]  (step 1 of 2) Processing spacy_pos_vectorizer, total=   0.4s
[Pipeline] ............ (step 2 of 2) Processing scaler, total=   0.0s
[FeatureUnion] ........... (step 2 of 4) Processing pos, total=   0.4s
[Pipeline]  (step 1 of 2) Processing textstats_vectorizer, total=   0.1s
[Pipeline] ............ (step 2 of 2) Processing scaler, total=   0.0s
[FeatureUnion] ..... (step 3 of 4) Processing textstats, total=   0.1s
[FeatureUnion] .... (step 4 of 4) Processing embeddings, total=   0.1s
[Pipeline] .......... (step 3 of 4) Processing features, total=   2.0s
[Pipeline] ........ (step 4 of 4) Processing classifier, total=   4.5s
[CV 1/3; 20/36] END features__tokens__selector__k=200, features__tokens__spacy_token_vectorizer__max_features=100;, score=0.903 total time=   8.1s
[CV 2/3; 20/36] START features__tokens__selector__k=200, features__tokens__spacy_token_vectorizer__max_features=100
[Pipeline] ........... (step 1 of 4) Processing cleaner, total=   0.5s
[Pipeline] ............



[Pipeline]  (step 1 of 2) Processing spacy_pos_vectorizer, total=   0.5s
[Pipeline] ............ (step 2 of 2) Processing scaler, total=   0.0s
[FeatureUnion] ........... (step 2 of 4) Processing pos, total=   0.5s
[Pipeline]  (step 1 of 2) Processing textstats_vectorizer, total=   0.1s
[Pipeline] ............ (step 2 of 2) Processing scaler, total=   0.0s
[FeatureUnion] ..... (step 3 of 4) Processing textstats, total=   0.1s
[FeatureUnion] .... (step 4 of 4) Processing embeddings, total=   0.1s
[Pipeline] .......... (step 3 of 4) Processing features, total=   2.1s
[Pipeline] ........ (step 4 of 4) Processing classifier, total=   5.0s
[CV 2/3; 20/36] END features__tokens__selector__k=200, features__tokens__spacy_token_vectorizer__max_features=100;, score=0.923 total time=   8.8s
[CV 3/3; 20/36] START features__tokens__selector__k=200, features__tokens__spacy_token_vectorizer__max_features=100
[Pipeline] ........... (step 1 of 4) Processing cleaner, total=   0.5s
[Pipeline] ............



[Pipeline]  (step 1 of 2) Processing spacy_pos_vectorizer, total=   0.4s
[Pipeline] ............ (step 2 of 2) Processing scaler, total=   0.0s
[FeatureUnion] ........... (step 2 of 4) Processing pos, total=   0.4s
[Pipeline]  (step 1 of 2) Processing textstats_vectorizer, total=   0.1s
[Pipeline] ............ (step 2 of 2) Processing scaler, total=   0.0s
[FeatureUnion] ..... (step 3 of 4) Processing textstats, total=   0.1s
[FeatureUnion] .... (step 4 of 4) Processing embeddings, total=   0.1s
[Pipeline] .......... (step 3 of 4) Processing features, total=   2.1s
[Pipeline] ........ (step 4 of 4) Processing classifier, total=   4.7s
[CV 3/3; 20/36] END features__tokens__selector__k=200, features__tokens__spacy_token_vectorizer__max_features=100;, score=0.908 total time=   8.4s
[CV 1/3; 21/36] START features__tokens__selector__k=200, features__tokens__spacy_token_vectorizer__max_features=150
[Pipeline] ........... (step 1 of 4) Processing cleaner, total=   0.5s
[Pipeline] ............



[Pipeline] .......... (step 2 of 3) Processing selector, total=   0.2s
[Pipeline] ............ (step 3 of 3) Processing scaler, total=   0.0s
[FeatureUnion] ........ (step 1 of 4) Processing tokens, total=   1.5s
[Pipeline]  (step 1 of 2) Processing spacy_pos_vectorizer, total=   0.4s
[Pipeline] ............ (step 2 of 2) Processing scaler, total=   0.0s
[FeatureUnion] ........... (step 2 of 4) Processing pos, total=   0.5s
[Pipeline]  (step 1 of 2) Processing textstats_vectorizer, total=   0.1s
[Pipeline] ............ (step 2 of 2) Processing scaler, total=   0.0s
[FeatureUnion] ..... (step 3 of 4) Processing textstats, total=   0.1s
[FeatureUnion] .... (step 4 of 4) Processing embeddings, total=   0.1s
[Pipeline] .......... (step 3 of 4) Processing features, total=   2.2s
[Pipeline] ........ (step 4 of 4) Processing classifier, total=   3.0s
[CV 1/3; 21/36] END features__tokens__selector__k=200, features__tokens__spacy_token_vectorizer__max_features=150;, score=0.927 total time=   6.



[Pipeline] .......... (step 2 of 3) Processing selector, total=   0.3s
[Pipeline] ............ (step 3 of 3) Processing scaler, total=   0.0s
[FeatureUnion] ........ (step 1 of 4) Processing tokens, total=   1.7s
[Pipeline]  (step 1 of 2) Processing spacy_pos_vectorizer, total=   0.5s
[Pipeline] ............ (step 2 of 2) Processing scaler, total=   0.0s
[FeatureUnion] ........... (step 2 of 4) Processing pos, total=   0.5s
[Pipeline]  (step 1 of 2) Processing textstats_vectorizer, total=   0.1s
[Pipeline] ............ (step 2 of 2) Processing scaler, total=   0.0s
[FeatureUnion] ..... (step 3 of 4) Processing textstats, total=   0.1s
[FeatureUnion] .... (step 4 of 4) Processing embeddings, total=   0.1s
[Pipeline] .......... (step 3 of 4) Processing features, total=   2.4s
[Pipeline] ........ (step 4 of 4) Processing classifier, total=   3.5s
[CV 2/3; 21/36] END features__tokens__selector__k=200, features__tokens__spacy_token_vectorizer__max_features=150;, score=0.938 total time=   7.



[Pipeline] .......... (step 2 of 3) Processing selector, total=   0.2s
[Pipeline] ............ (step 3 of 3) Processing scaler, total=   0.0s
[FeatureUnion] ........ (step 1 of 4) Processing tokens, total=   2.1s
[Pipeline]  (step 1 of 2) Processing spacy_pos_vectorizer, total=   0.6s
[Pipeline] ............ (step 2 of 2) Processing scaler, total=   0.0s
[FeatureUnion] ........... (step 2 of 4) Processing pos, total=   0.6s
[Pipeline]  (step 1 of 2) Processing textstats_vectorizer, total=   0.1s
[Pipeline] ............ (step 2 of 2) Processing scaler, total=   0.0s
[FeatureUnion] ..... (step 3 of 4) Processing textstats, total=   0.1s
[FeatureUnion] .... (step 4 of 4) Processing embeddings, total=   0.1s
[Pipeline] .......... (step 3 of 4) Processing features, total=   2.9s
[Pipeline] ........ (step 4 of 4) Processing classifier, total=   4.2s
[CV 3/3; 21/36] END features__tokens__selector__k=200, features__tokens__spacy_token_vectorizer__max_features=150;, score=0.932 total time=   9.



[Pipeline]  (step 1 of 2) Processing spacy_pos_vectorizer, total=   0.5s
[Pipeline] ............ (step 2 of 2) Processing scaler, total=   0.0s
[FeatureUnion] ........... (step 2 of 4) Processing pos, total=   0.5s
[Pipeline]  (step 1 of 2) Processing textstats_vectorizer, total=   0.1s
[Pipeline] ............ (step 2 of 2) Processing scaler, total=   0.0s
[FeatureUnion] ..... (step 3 of 4) Processing textstats, total=   0.1s
[FeatureUnion] .... (step 4 of 4) Processing embeddings, total=   0.1s
[Pipeline] .......... (step 3 of 4) Processing features, total=   2.1s
[Pipeline] ........ (step 4 of 4) Processing classifier, total=   6.8s
[CV 1/3; 25/36] END features__tokens__selector__k=250, features__tokens__spacy_token_vectorizer__max_features=50;, score=0.904 total time=  10.5s
[CV 2/3; 25/36] START features__tokens__selector__k=250, features__tokens__spacy_token_vectorizer__max_features=50
[Pipeline] ........... (step 1 of 4) Processing cleaner, total=   0.5s
[Pipeline] ............. 



[Pipeline]  (step 1 of 2) Processing spacy_pos_vectorizer, total=   0.4s
[Pipeline] ............ (step 2 of 2) Processing scaler, total=   0.0s
[FeatureUnion] ........... (step 2 of 4) Processing pos, total=   0.5s
[Pipeline]  (step 1 of 2) Processing textstats_vectorizer, total=   0.1s
[Pipeline] ............ (step 2 of 2) Processing scaler, total=   0.0s
[FeatureUnion] ..... (step 3 of 4) Processing textstats, total=   0.1s
[FeatureUnion] .... (step 4 of 4) Processing embeddings, total=   0.1s
[Pipeline] .......... (step 3 of 4) Processing features, total=   2.1s
[Pipeline] ........ (step 4 of 4) Processing classifier, total=   6.8s
[CV 2/3; 25/36] END features__tokens__selector__k=250, features__tokens__spacy_token_vectorizer__max_features=50;, score=0.908 total time=  10.6s
[CV 3/3; 25/36] START features__tokens__selector__k=250, features__tokens__spacy_token_vectorizer__max_features=50
[Pipeline] ........... (step 1 of 4) Processing cleaner, total=   0.5s
[Pipeline] ............. 



[Pipeline]  (step 1 of 2) Processing spacy_pos_vectorizer, total=   0.4s
[Pipeline] ............ (step 2 of 2) Processing scaler, total=   0.0s
[FeatureUnion] ........... (step 2 of 4) Processing pos, total=   0.4s
[Pipeline]  (step 1 of 2) Processing textstats_vectorizer, total=   0.1s
[Pipeline] ............ (step 2 of 2) Processing scaler, total=   0.0s
[FeatureUnion] ..... (step 3 of 4) Processing textstats, total=   0.1s
[FeatureUnion] .... (step 4 of 4) Processing embeddings, total=   0.1s
[Pipeline] .......... (step 3 of 4) Processing features, total=   1.9s
[Pipeline] ........ (step 4 of 4) Processing classifier, total=   5.5s
[CV 3/3; 25/36] END features__tokens__selector__k=250, features__tokens__spacy_token_vectorizer__max_features=50;, score=0.914 total time=   9.0s
[CV 1/3; 26/36] START features__tokens__selector__k=250, features__tokens__spacy_token_vectorizer__max_features=100
[Pipeline] ........... (step 1 of 4) Processing cleaner, total=   0.5s
[Pipeline] .............



[Pipeline]  (step 1 of 2) Processing spacy_pos_vectorizer, total=   0.5s
[Pipeline] ............ (step 2 of 2) Processing scaler, total=   0.0s
[FeatureUnion] ........... (step 2 of 4) Processing pos, total=   0.5s
[Pipeline]  (step 1 of 2) Processing textstats_vectorizer, total=   0.1s
[Pipeline] ............ (step 2 of 2) Processing scaler, total=   0.0s
[FeatureUnion] ..... (step 3 of 4) Processing textstats, total=   0.1s
[FeatureUnion] .... (step 4 of 4) Processing embeddings, total=   0.1s
[Pipeline] .......... (step 3 of 4) Processing features, total=   2.2s
[Pipeline] ........ (step 4 of 4) Processing classifier, total=   4.6s
[CV 1/3; 26/36] END features__tokens__selector__k=250, features__tokens__spacy_token_vectorizer__max_features=100;, score=0.903 total time=   8.4s
[CV 2/3; 26/36] START features__tokens__selector__k=250, features__tokens__spacy_token_vectorizer__max_features=100
[Pipeline] ........... (step 1 of 4) Processing cleaner, total=   0.5s
[Pipeline] ............



[Pipeline]  (step 1 of 2) Processing spacy_pos_vectorizer, total=   0.5s
[Pipeline] ............ (step 2 of 2) Processing scaler, total=   0.0s
[FeatureUnion] ........... (step 2 of 4) Processing pos, total=   0.5s
[Pipeline]  (step 1 of 2) Processing textstats_vectorizer, total=   0.1s
[Pipeline] ............ (step 2 of 2) Processing scaler, total=   0.0s
[FeatureUnion] ..... (step 3 of 4) Processing textstats, total=   0.1s
[FeatureUnion] .... (step 4 of 4) Processing embeddings, total=   0.1s
[Pipeline] .......... (step 3 of 4) Processing features, total=   2.2s
[Pipeline] ........ (step 4 of 4) Processing classifier, total=   4.9s
[CV 2/3; 26/36] END features__tokens__selector__k=250, features__tokens__spacy_token_vectorizer__max_features=100;, score=0.923 total time=   8.8s
[CV 3/3; 26/36] START features__tokens__selector__k=250, features__tokens__spacy_token_vectorizer__max_features=100
[Pipeline] ........... (step 1 of 4) Processing cleaner, total=   0.5s
[Pipeline] ............



[Pipeline]  (step 1 of 2) Processing spacy_pos_vectorizer, total=   0.8s
[Pipeline] ............ (step 2 of 2) Processing scaler, total=   0.0s
[FeatureUnion] ........... (step 2 of 4) Processing pos, total=   0.8s
[Pipeline]  (step 1 of 2) Processing textstats_vectorizer, total=   0.1s
[Pipeline] ............ (step 2 of 2) Processing scaler, total=   0.0s
[FeatureUnion] ..... (step 3 of 4) Processing textstats, total=   0.1s
[FeatureUnion] .... (step 4 of 4) Processing embeddings, total=   0.1s
[Pipeline] .......... (step 3 of 4) Processing features, total=   2.6s
[Pipeline] ........ (step 4 of 4) Processing classifier, total=   4.9s
[CV 3/3; 26/36] END features__tokens__selector__k=250, features__tokens__spacy_token_vectorizer__max_features=100;, score=0.908 total time=   9.5s
[CV 1/3; 27/36] START features__tokens__selector__k=250, features__tokens__spacy_token_vectorizer__max_features=150
[Pipeline] ........... (step 1 of 4) Processing cleaner, total=   0.4s
[Pipeline] ............



[Pipeline]  (step 1 of 2) Processing spacy_pos_vectorizer, total=   0.5s
[Pipeline] ............ (step 2 of 2) Processing scaler, total=   0.0s
[FeatureUnion] ........... (step 2 of 4) Processing pos, total=   0.5s
[Pipeline]  (step 1 of 2) Processing textstats_vectorizer, total=   0.1s
[Pipeline] ............ (step 2 of 2) Processing scaler, total=   0.0s
[FeatureUnion] ..... (step 3 of 4) Processing textstats, total=   0.1s
[FeatureUnion] .... (step 4 of 4) Processing embeddings, total=   0.1s
[Pipeline] .......... (step 3 of 4) Processing features, total=   2.0s
[Pipeline] ........ (step 4 of 4) Processing classifier, total=   2.8s
[CV 1/3; 27/36] END features__tokens__selector__k=250, features__tokens__spacy_token_vectorizer__max_features=150;, score=0.927 total time=   6.4s
[CV 2/3; 27/36] START features__tokens__selector__k=250, features__tokens__spacy_token_vectorizer__max_features=150
[Pipeline] ........... (step 1 of 4) Processing cleaner, total=   0.5s
[Pipeline] ............



[Pipeline]  (step 1 of 2) Processing spacy_pos_vectorizer, total=   0.4s
[Pipeline] ............ (step 2 of 2) Processing scaler, total=   0.0s
[FeatureUnion] ........... (step 2 of 4) Processing pos, total=   0.4s
[Pipeline]  (step 1 of 2) Processing textstats_vectorizer, total=   0.1s
[Pipeline] ............ (step 2 of 2) Processing scaler, total=   0.0s
[FeatureUnion] ..... (step 3 of 4) Processing textstats, total=   0.1s
[FeatureUnion] .... (step 4 of 4) Processing embeddings, total=   0.1s
[Pipeline] .......... (step 3 of 4) Processing features, total=   2.0s
[Pipeline] ........ (step 4 of 4) Processing classifier, total=   2.9s
[CV 2/3; 27/36] END features__tokens__selector__k=250, features__tokens__spacy_token_vectorizer__max_features=150;, score=0.938 total time=   6.5s
[CV 3/3; 27/36] START features__tokens__selector__k=250, features__tokens__spacy_token_vectorizer__max_features=150
[Pipeline] ........... (step 1 of 4) Processing cleaner, total=   0.5s
[Pipeline] ............



[Pipeline]  (step 1 of 2) Processing spacy_pos_vectorizer, total=   0.5s
[Pipeline] ............ (step 2 of 2) Processing scaler, total=   0.0s
[FeatureUnion] ........... (step 2 of 4) Processing pos, total=   0.5s
[Pipeline]  (step 1 of 2) Processing textstats_vectorizer, total=   0.1s
[Pipeline] ............ (step 2 of 2) Processing scaler, total=   0.0s
[FeatureUnion] ..... (step 3 of 4) Processing textstats, total=   0.1s
[FeatureUnion] .... (step 4 of 4) Processing embeddings, total=   0.1s
[Pipeline] .......... (step 3 of 4) Processing features, total=   2.1s
[Pipeline] ........ (step 4 of 4) Processing classifier, total=   3.6s
[CV 3/3; 27/36] END features__tokens__selector__k=250, features__tokens__spacy_token_vectorizer__max_features=150;, score=0.932 total time=   7.4s
[CV 1/3; 28/36] START features__tokens__selector__k=250, features__tokens__spacy_token_vectorizer__max_features=200
[Pipeline] ........... (step 1 of 4) Processing cleaner, total=   0.5s
[Pipeline] ............



[Pipeline] .......... (step 2 of 3) Processing selector, total=   0.3s
[Pipeline] ............ (step 3 of 3) Processing scaler, total=   0.0s
[FeatureUnion] ........ (step 1 of 4) Processing tokens, total=   1.5s
[Pipeline]  (step 1 of 2) Processing spacy_pos_vectorizer, total=   0.5s
[Pipeline] ............ (step 2 of 2) Processing scaler, total=   0.0s
[FeatureUnion] ........... (step 2 of 4) Processing pos, total=   0.5s
[Pipeline]  (step 1 of 2) Processing textstats_vectorizer, total=   0.1s
[Pipeline] ............ (step 2 of 2) Processing scaler, total=   0.0s
[FeatureUnion] ..... (step 3 of 4) Processing textstats, total=   0.1s
[FeatureUnion] .... (step 4 of 4) Processing embeddings, total=   0.1s
[Pipeline] .......... (step 3 of 4) Processing features, total=   2.2s
[Pipeline] ........ (step 4 of 4) Processing classifier, total=   2.8s
[CV 1/3; 28/36] END features__tokens__selector__k=250, features__tokens__spacy_token_vectorizer__max_features=200;, score=0.937 total time=   6.



[Pipeline] .......... (step 2 of 3) Processing selector, total=   0.3s
[Pipeline] ............ (step 3 of 3) Processing scaler, total=   0.0s
[FeatureUnion] ........ (step 1 of 4) Processing tokens, total=   1.5s
[Pipeline]  (step 1 of 2) Processing spacy_pos_vectorizer, total=   0.5s
[Pipeline] ............ (step 2 of 2) Processing scaler, total=   0.0s
[FeatureUnion] ........... (step 2 of 4) Processing pos, total=   0.5s
[Pipeline]  (step 1 of 2) Processing textstats_vectorizer, total=   0.1s
[Pipeline] ............ (step 2 of 2) Processing scaler, total=   0.0s
[FeatureUnion] ..... (step 3 of 4) Processing textstats, total=   0.1s
[FeatureUnion] .... (step 4 of 4) Processing embeddings, total=   0.1s
[Pipeline] .......... (step 3 of 4) Processing features, total=   2.2s
[Pipeline] ........ (step 4 of 4) Processing classifier, total=   2.6s
[CV 2/3; 28/36] END features__tokens__selector__k=250, features__tokens__spacy_token_vectorizer__max_features=200;, score=0.938 total time=   6.



[Pipeline] .......... (step 2 of 3) Processing selector, total=   0.3s
[Pipeline] ............ (step 3 of 3) Processing scaler, total=   0.0s
[FeatureUnion] ........ (step 1 of 4) Processing tokens, total=   1.5s
[Pipeline]  (step 1 of 2) Processing spacy_pos_vectorizer, total=   0.4s
[Pipeline] ............ (step 2 of 2) Processing scaler, total=   0.0s
[FeatureUnion] ........... (step 2 of 4) Processing pos, total=   0.4s
[Pipeline]  (step 1 of 2) Processing textstats_vectorizer, total=   0.1s
[Pipeline] ............ (step 2 of 2) Processing scaler, total=   0.0s
[FeatureUnion] ..... (step 3 of 4) Processing textstats, total=   0.1s
[FeatureUnion] .... (step 4 of 4) Processing embeddings, total=   0.1s
[Pipeline] .......... (step 3 of 4) Processing features, total=   2.2s
[Pipeline] ........ (step 4 of 4) Processing classifier, total=   2.8s
[CV 3/3; 28/36] END features__tokens__selector__k=250, features__tokens__spacy_token_vectorizer__max_features=200;, score=0.937 total time=   6.



[Pipeline]  (step 1 of 2) Processing spacy_pos_vectorizer, total=   0.4s
[Pipeline] ............ (step 2 of 2) Processing scaler, total=   0.0s
[FeatureUnion] ........... (step 2 of 4) Processing pos, total=   0.4s
[Pipeline]  (step 1 of 2) Processing textstats_vectorizer, total=   0.1s
[Pipeline] ............ (step 2 of 2) Processing scaler, total=   0.0s
[FeatureUnion] ..... (step 3 of 4) Processing textstats, total=   0.1s
[FeatureUnion] .... (step 4 of 4) Processing embeddings, total=   0.1s
[Pipeline] .......... (step 3 of 4) Processing features, total=   1.9s
[Pipeline] ........ (step 4 of 4) Processing classifier, total=   6.3s
[CV 1/3; 31/36] END features__tokens__selector__k=300, features__tokens__spacy_token_vectorizer__max_features=50;, score=0.904 total time=   9.9s
[CV 2/3; 31/36] START features__tokens__selector__k=300, features__tokens__spacy_token_vectorizer__max_features=50
[Pipeline] ........... (step 1 of 4) Processing cleaner, total=   0.5s
[Pipeline] ............. 



[Pipeline]  (step 1 of 2) Processing spacy_pos_vectorizer, total=   0.5s
[Pipeline] ............ (step 2 of 2) Processing scaler, total=   0.0s
[FeatureUnion] ........... (step 2 of 4) Processing pos, total=   0.5s
[Pipeline]  (step 1 of 2) Processing textstats_vectorizer, total=   0.1s
[Pipeline] ............ (step 2 of 2) Processing scaler, total=   0.0s
[FeatureUnion] ..... (step 3 of 4) Processing textstats, total=   0.1s
[FeatureUnion] .... (step 4 of 4) Processing embeddings, total=   0.1s
[Pipeline] .......... (step 3 of 4) Processing features, total=   1.9s
[Pipeline] ........ (step 4 of 4) Processing classifier, total=   6.5s
[CV 2/3; 31/36] END features__tokens__selector__k=300, features__tokens__spacy_token_vectorizer__max_features=50;, score=0.908 total time=  10.1s
[CV 3/3; 31/36] START features__tokens__selector__k=300, features__tokens__spacy_token_vectorizer__max_features=50
[Pipeline] ........... (step 1 of 4) Processing cleaner, total=   0.5s
[Pipeline] ............. 



[Pipeline]  (step 1 of 2) Processing spacy_pos_vectorizer, total=   0.4s
[Pipeline] ............ (step 2 of 2) Processing scaler, total=   0.0s
[FeatureUnion] ........... (step 2 of 4) Processing pos, total=   0.4s
[Pipeline]  (step 1 of 2) Processing textstats_vectorizer, total=   0.1s
[Pipeline] ............ (step 2 of 2) Processing scaler, total=   0.0s
[FeatureUnion] ..... (step 3 of 4) Processing textstats, total=   0.1s
[FeatureUnion] .... (step 4 of 4) Processing embeddings, total=   0.1s
[Pipeline] .......... (step 3 of 4) Processing features, total=   2.2s
[Pipeline] ........ (step 4 of 4) Processing classifier, total=   5.4s
[CV 3/3; 31/36] END features__tokens__selector__k=300, features__tokens__spacy_token_vectorizer__max_features=50;, score=0.914 total time=   9.3s
[CV 1/3; 32/36] START features__tokens__selector__k=300, features__tokens__spacy_token_vectorizer__max_features=100
[Pipeline] ........... (step 1 of 4) Processing cleaner, total=   0.5s
[Pipeline] .............



[Pipeline]  (step 1 of 2) Processing spacy_pos_vectorizer, total=   0.4s
[Pipeline] ............ (step 2 of 2) Processing scaler, total=   0.0s
[FeatureUnion] ........... (step 2 of 4) Processing pos, total=   0.4s
[Pipeline]  (step 1 of 2) Processing textstats_vectorizer, total=   0.1s
[Pipeline] ............ (step 2 of 2) Processing scaler, total=   0.0s
[FeatureUnion] ..... (step 3 of 4) Processing textstats, total=   0.1s
[FeatureUnion] .... (step 4 of 4) Processing embeddings, total=   0.1s
[Pipeline] .......... (step 3 of 4) Processing features, total=   1.9s
[Pipeline] ........ (step 4 of 4) Processing classifier, total=   4.2s
[CV 1/3; 32/36] END features__tokens__selector__k=300, features__tokens__spacy_token_vectorizer__max_features=100;, score=0.903 total time=   7.7s
[CV 2/3; 32/36] START features__tokens__selector__k=300, features__tokens__spacy_token_vectorizer__max_features=100
[Pipeline] ........... (step 1 of 4) Processing cleaner, total=   0.5s
[Pipeline] ............



[Pipeline]  (step 1 of 2) Processing spacy_pos_vectorizer, total=   0.4s
[Pipeline] ............ (step 2 of 2) Processing scaler, total=   0.0s
[FeatureUnion] ........... (step 2 of 4) Processing pos, total=   0.4s
[Pipeline]  (step 1 of 2) Processing textstats_vectorizer, total=   0.1s
[Pipeline] ............ (step 2 of 2) Processing scaler, total=   0.0s
[FeatureUnion] ..... (step 3 of 4) Processing textstats, total=   0.1s
[FeatureUnion] .... (step 4 of 4) Processing embeddings, total=   0.1s
[Pipeline] .......... (step 3 of 4) Processing features, total=   1.9s
[Pipeline] ........ (step 4 of 4) Processing classifier, total=   4.6s
[CV 2/3; 32/36] END features__tokens__selector__k=300, features__tokens__spacy_token_vectorizer__max_features=100;, score=0.923 total time=   8.1s
[CV 3/3; 32/36] START features__tokens__selector__k=300, features__tokens__spacy_token_vectorizer__max_features=100
[Pipeline] ........... (step 1 of 4) Processing cleaner, total=   0.5s
[Pipeline] ............



[Pipeline]  (step 1 of 2) Processing spacy_pos_vectorizer, total=   0.4s
[Pipeline] ............ (step 2 of 2) Processing scaler, total=   0.0s
[FeatureUnion] ........... (step 2 of 4) Processing pos, total=   0.4s
[Pipeline]  (step 1 of 2) Processing textstats_vectorizer, total=   0.1s
[Pipeline] ............ (step 2 of 2) Processing scaler, total=   0.0s
[FeatureUnion] ..... (step 3 of 4) Processing textstats, total=   0.1s
[FeatureUnion] .... (step 4 of 4) Processing embeddings, total=   0.1s
[Pipeline] .......... (step 3 of 4) Processing features, total=   1.9s
[Pipeline] ........ (step 4 of 4) Processing classifier, total=   4.4s
[CV 3/3; 32/36] END features__tokens__selector__k=300, features__tokens__spacy_token_vectorizer__max_features=100;, score=0.908 total time=   7.8s
[CV 1/3; 33/36] START features__tokens__selector__k=300, features__tokens__spacy_token_vectorizer__max_features=150
[Pipeline] ........... (step 1 of 4) Processing cleaner, total=   0.5s
[Pipeline] ............



[Pipeline] .......... (step 2 of 3) Processing selector, total=   0.2s
[Pipeline] ............ (step 3 of 3) Processing scaler, total=   0.0s
[FeatureUnion] ........ (step 1 of 4) Processing tokens, total=   1.4s
[Pipeline]  (step 1 of 2) Processing spacy_pos_vectorizer, total=   0.5s
[Pipeline] ............ (step 2 of 2) Processing scaler, total=   0.0s
[FeatureUnion] ........... (step 2 of 4) Processing pos, total=   0.5s
[Pipeline]  (step 1 of 2) Processing textstats_vectorizer, total=   0.1s
[Pipeline] ............ (step 2 of 2) Processing scaler, total=   0.0s
[FeatureUnion] ..... (step 3 of 4) Processing textstats, total=   0.1s
[FeatureUnion] .... (step 4 of 4) Processing embeddings, total=   0.1s
[Pipeline] .......... (step 3 of 4) Processing features, total=   2.1s
[Pipeline] ........ (step 4 of 4) Processing classifier, total=   3.0s
[CV 1/3; 33/36] END features__tokens__selector__k=300, features__tokens__spacy_token_vectorizer__max_features=150;, score=0.927 total time=   6.



[Pipeline]  (step 1 of 2) Processing spacy_pos_vectorizer, total=   0.4s
[Pipeline] ............ (step 2 of 2) Processing scaler, total=   0.0s
[FeatureUnion] ........... (step 2 of 4) Processing pos, total=   0.4s
[Pipeline]  (step 1 of 2) Processing textstats_vectorizer, total=   0.1s
[Pipeline] ............ (step 2 of 2) Processing scaler, total=   0.0s
[FeatureUnion] ..... (step 3 of 4) Processing textstats, total=   0.1s
[FeatureUnion] .... (step 4 of 4) Processing embeddings, total=   0.1s
[Pipeline] .......... (step 3 of 4) Processing features, total=   2.0s
[Pipeline] ........ (step 4 of 4) Processing classifier, total=   3.2s
[CV 2/3; 33/36] END features__tokens__selector__k=300, features__tokens__spacy_token_vectorizer__max_features=150;, score=0.938 total time=   6.8s
[CV 3/3; 33/36] START features__tokens__selector__k=300, features__tokens__spacy_token_vectorizer__max_features=150
[Pipeline] ........... (step 1 of 4) Processing cleaner, total=   0.5s
[Pipeline] ............



[Pipeline]  (step 1 of 2) Processing spacy_pos_vectorizer, total=   0.4s
[Pipeline] ............ (step 2 of 2) Processing scaler, total=   0.0s
[FeatureUnion] ........... (step 2 of 4) Processing pos, total=   0.4s
[Pipeline]  (step 1 of 2) Processing textstats_vectorizer, total=   0.1s
[Pipeline] ............ (step 2 of 2) Processing scaler, total=   0.0s
[FeatureUnion] ..... (step 3 of 4) Processing textstats, total=   0.1s
[FeatureUnion] .... (step 4 of 4) Processing embeddings, total=   0.1s
[Pipeline] .......... (step 3 of 4) Processing features, total=   2.0s
[Pipeline] ........ (step 4 of 4) Processing classifier, total=   3.4s
[CV 3/3; 33/36] END features__tokens__selector__k=300, features__tokens__spacy_token_vectorizer__max_features=150;, score=0.932 total time=   6.9s
[CV 1/3; 34/36] START features__tokens__selector__k=300, features__tokens__spacy_token_vectorizer__max_features=200
[Pipeline] ........... (step 1 of 4) Processing cleaner, total=   0.5s
[Pipeline] ............



[Pipeline] .......... (step 2 of 3) Processing selector, total=   0.2s
[Pipeline] ............ (step 3 of 3) Processing scaler, total=   0.0s
[FeatureUnion] ........ (step 1 of 4) Processing tokens, total=   1.4s
[Pipeline]  (step 1 of 2) Processing spacy_pos_vectorizer, total=   0.4s
[Pipeline] ............ (step 2 of 2) Processing scaler, total=   0.0s
[FeatureUnion] ........... (step 2 of 4) Processing pos, total=   0.4s
[Pipeline]  (step 1 of 2) Processing textstats_vectorizer, total=   0.1s
[Pipeline] ............ (step 2 of 2) Processing scaler, total=   0.0s
[FeatureUnion] ..... (step 3 of 4) Processing textstats, total=   0.1s
[FeatureUnion] .... (step 4 of 4) Processing embeddings, total=   0.1s
[Pipeline] .......... (step 3 of 4) Processing features, total=   2.0s
[Pipeline] ........ (step 4 of 4) Processing classifier, total=   2.7s
[CV 1/3; 34/36] END features__tokens__selector__k=300, features__tokens__spacy_token_vectorizer__max_features=200;, score=0.937 total time=   6.



[Pipeline] .......... (step 2 of 3) Processing selector, total=   0.2s
[Pipeline] ............ (step 3 of 3) Processing scaler, total=   0.0s
[FeatureUnion] ........ (step 1 of 4) Processing tokens, total=   1.8s
[Pipeline]  (step 1 of 2) Processing spacy_pos_vectorizer, total=   0.4s
[Pipeline] ............ (step 2 of 2) Processing scaler, total=   0.0s
[FeatureUnion] ........... (step 2 of 4) Processing pos, total=   0.4s
[Pipeline]  (step 1 of 2) Processing textstats_vectorizer, total=   0.1s
[Pipeline] ............ (step 2 of 2) Processing scaler, total=   0.0s
[FeatureUnion] ..... (step 3 of 4) Processing textstats, total=   0.1s
[FeatureUnion] .... (step 4 of 4) Processing embeddings, total=   0.1s
[Pipeline] .......... (step 3 of 4) Processing features, total=   2.4s
[Pipeline] ........ (step 4 of 4) Processing classifier, total=   2.4s
[CV 2/3; 34/36] END features__tokens__selector__k=300, features__tokens__spacy_token_vectorizer__max_features=200;, score=0.938 total time=   6.



[Pipeline] .......... (step 2 of 3) Processing selector, total=   0.2s
[Pipeline] ............ (step 3 of 3) Processing scaler, total=   0.0s
[FeatureUnion] ........ (step 1 of 4) Processing tokens, total=   1.4s
[Pipeline]  (step 1 of 2) Processing spacy_pos_vectorizer, total=   0.4s
[Pipeline] ............ (step 2 of 2) Processing scaler, total=   0.0s
[FeatureUnion] ........... (step 2 of 4) Processing pos, total=   0.4s
[Pipeline]  (step 1 of 2) Processing textstats_vectorizer, total=   0.1s
[Pipeline] ............ (step 2 of 2) Processing scaler, total=   0.0s
[FeatureUnion] ..... (step 3 of 4) Processing textstats, total=   0.1s
[FeatureUnion] .... (step 4 of 4) Processing embeddings, total=   0.1s
[Pipeline] .......... (step 3 of 4) Processing features, total=   2.0s
[Pipeline] ........ (step 4 of 4) Processing classifier, total=   2.6s
[CV 3/3; 34/36] END features__tokens__selector__k=300, features__tokens__spacy_token_vectorizer__max_features=200;, score=0.937 total time=   6.



[Pipeline] .......... (step 2 of 3) Processing selector, total=   0.3s
[Pipeline] ............ (step 3 of 3) Processing scaler, total=   0.0s
[FeatureUnion] ........ (step 1 of 4) Processing tokens, total=   1.5s
[Pipeline]  (step 1 of 2) Processing spacy_pos_vectorizer, total=   0.4s
[Pipeline] ............ (step 2 of 2) Processing scaler, total=   0.0s
[FeatureUnion] ........... (step 2 of 4) Processing pos, total=   0.4s
[Pipeline]  (step 1 of 2) Processing textstats_vectorizer, total=   0.1s
[Pipeline] ............ (step 2 of 2) Processing scaler, total=   0.0s
[FeatureUnion] ..... (step 3 of 4) Processing textstats, total=   0.1s
[FeatureUnion] .... (step 4 of 4) Processing embeddings, total=   0.1s
[Pipeline] .......... (step 3 of 4) Processing features, total=   2.1s
[Pipeline] ........ (step 4 of 4) Processing classifier, total=   2.4s
[CV 1/3; 35/36] END features__tokens__selector__k=300, features__tokens__spacy_token_vectorizer__max_features=250;, score=0.934 total time=   6.



[Pipeline] .......... (step 2 of 3) Processing selector, total=   0.3s
[Pipeline] ............ (step 3 of 3) Processing scaler, total=   0.0s
[FeatureUnion] ........ (step 1 of 4) Processing tokens, total=   1.5s
[Pipeline]  (step 1 of 2) Processing spacy_pos_vectorizer, total=   0.4s
[Pipeline] ............ (step 2 of 2) Processing scaler, total=   0.0s
[FeatureUnion] ........... (step 2 of 4) Processing pos, total=   0.4s
[Pipeline]  (step 1 of 2) Processing textstats_vectorizer, total=   0.1s
[Pipeline] ............ (step 2 of 2) Processing scaler, total=   0.0s
[FeatureUnion] ..... (step 3 of 4) Processing textstats, total=   0.1s
[FeatureUnion] .... (step 4 of 4) Processing embeddings, total=   0.1s
[Pipeline] .......... (step 3 of 4) Processing features, total=   2.1s
[Pipeline] ........ (step 4 of 4) Processing classifier, total=   2.4s
[CV 2/3; 35/36] END features__tokens__selector__k=300, features__tokens__spacy_token_vectorizer__max_features=250;, score=0.942 total time=   6.



[Pipeline] .......... (step 2 of 3) Processing selector, total=   0.3s
[Pipeline] ............ (step 3 of 3) Processing scaler, total=   0.0s
[FeatureUnion] ........ (step 1 of 4) Processing tokens, total=   1.5s
[Pipeline]  (step 1 of 2) Processing spacy_pos_vectorizer, total=   0.4s
[Pipeline] ............ (step 2 of 2) Processing scaler, total=   0.0s
[FeatureUnion] ........... (step 2 of 4) Processing pos, total=   0.4s
[Pipeline]  (step 1 of 2) Processing textstats_vectorizer, total=   0.1s
[Pipeline] ............ (step 2 of 2) Processing scaler, total=   0.0s
[FeatureUnion] ..... (step 3 of 4) Processing textstats, total=   0.1s
[FeatureUnion] .... (step 4 of 4) Processing embeddings, total=   0.1s
[Pipeline] .......... (step 3 of 4) Processing features, total=   2.1s
[Pipeline] ........ (step 4 of 4) Processing classifier, total=   2.8s
[CV 3/3; 35/36] END features__tokens__selector__k=300, features__tokens__spacy_token_vectorizer__max_features=250;, score=0.942 total time=   6.

## 4. Evaluate your model and investigate model predictions

You already have some metrics in the cell above. Below is some additional reporting to help you understand your model.

### 4.1 Classifier-specific features

If you are using a Decision Tree classifier in your pipeline, this will plot it ...

In [17]:
if pipeline.named_steps['classifier'].__class__.__name__ == 'DecisionTreeClassifier':
    plot_decision_tree_from_pipeline(pipeline, X_train, y_train, target_classes, target_names, 'classifier', 'features')
else:
    print('The classifier is not a decision tree - so no plot is shown!')

The classifier is not a decision tree - so no plot is shown!


If you are using a Logistic Regression classifier in your pipeline, this will plot the coefficients of the features in the model.


In [18]:
if pipeline.named_steps['classifier'].__class__.__name__ == 'LogisticRegression':
	plot_logistic_regression_features_from_pipeline(pipeline, target_classes, target_names, top_n=20, classifier_step_name = 'classifier', features_step_name = 'features')

Unnamed: 0,Feature,Log Odds (Logit),Odds Ratio
105,pos__DET,-2.28174,0.102106
112,pos__PUNCT,-2.206801,0.110052
127,textstats__hapax_legomena_count,1.426865,4.165619
134,embeddings__emb_5,-1.279421,0.278198
114,pos__SYM,1.22972,3.420271
104,pos__CCONJ,1.098508,2.999686
93,tokens__various,-1.073689,0.341745
136,embeddings__emb_7,1.00249,2.725058
192,embeddings__emb_63,-0.981348,0.374806
195,embeddings__emb_66,0.907776,2.478804


Unnamed: 0,Feature,Log Odds (Logit),Odds Ratio
122,textstats__unique_tokens_relfreq,-1.667904,0.188642
129,embeddings__emb_0,-1.496928,0.223817
134,embeddings__emb_5,1.174182,3.235496
13,tokens__crucial,1.100141,3.004588
108,pos__NUM,-1.049534,0.350101
127,textstats__hapax_legomena_count,-1.048345,0.350517
113,pos__SCONJ,-1.010101,0.364182
11,tokens__conclusion,1.008563,2.741657
123,textstats__average_characters_per_token,0.946506,2.576692
144,embeddings__emb_15,0.913637,2.493375


Unnamed: 0,Feature,Log Odds (Logit),Odds Ratio
122,textstats__unique_tokens_relfreq,2.260484,9.587729
112,pos__PUNCT,1.656479,5.240823
105,pos__DET,1.543559,4.681222
103,pos__AUX,1.471068,4.353884
108,pos__NUM,1.372979,3.947093
123,textstats__average_characters_per_token,-1.204636,0.299801
115,pos__VERB,-1.183369,0.306245
129,embeddings__emb_0,1.146996,3.14872
104,pos__CCONJ,-1.139646,0.319932
66,tokens__people,0.949395,2.584146


### 4.2 Investigate correct and incorrect predictions

To see the predictions of your model run this cell. The output can be quite long depending on the dataset and the number of misclassifications. The Pandas `max_rows` is configured at the top of the cell to restrict the length of output. You can adjust this as required. This is reset back to the Pandas default at the end of the cell.

In [19]:
# adjust max rows
pd.set_option('display.max_rows', 5) # show all rows

# creating dataframe from y_predicted, y_test and the text
predictions_df = pd.DataFrame(data = {'true': y_test, 'predicted': y_predicted})
y_predicted_probs = pipeline.predict_proba(X_test)
y_predicted_probs = np.round(y_predicted_probs, 3)
# Note: Version 0.2.2 changed the following line to ensure probability labels are correct regardless of the order of target classes
columns = [f'{label_names[c]}_prob' for c in pipeline.named_steps['classifier'].classes_ if c in target_classes]
predictions_df['predicted'] = predictions_df['predicted'].apply(lambda x: label_names[x])
predictions_df['true'] = predictions_df['true'].apply(lambda x: label_names[x])
predictions_df['correct'] = predictions_df['true'] == predictions_df['predicted']
predictions_df['text'] = X_test
predictions_df = pd.concat([predictions_df, pd.DataFrame(y_predicted_probs, columns=columns)], axis=1)

# output a preview of docs for each cell of confusion matrix ...
for true_target, target_name in enumerate(target_names):
    for predicted_target, target_name in enumerate(target_names):
        if true_target == predicted_target:
            print(f'\nCORRECTLY CLASSIFIED: {target_names[true_target]}')
        else:
            print(f'\n{target_names[true_target]} INCORRECTLY CLASSIFIED as: {target_names[predicted_target]}')
        print('=================================================================')

        display(predictions_df[(predictions_df['true'] == target_names[true_target]) & (predictions_df['predicted'] == target_names[predicted_target])])

pd.set_option('display.max_rows', 60) # setting back to the default


CORRECTLY CLASSIFIED: claude


Unnamed: 0,true,predicted,correct,text,claude_prob,gpt_prob,human_prob
2,claude,claude,True,"Romantic poets in late 18th and early 19th century Britain employed poetic language and various tropes to challenge popular opinions supporting slavery and the oppression of slaves. Four poets in particular—William Wordsworth, Samuel Taylor Coleridge, Robert Southey and William Cowper—used their works to critique the institution of slavery and advocate for greater sensibility and humanity. However, they differed in their approaches and the extent to which they promoted outright revolution.\r...",0.979,0.020,0.001
3,claude,claude,True,"Human-wildlife conflicts refer to situations where humans and wildlife have adverse interactions that lead to perceived or real harm. These conflicts arise due to competition for resources such as land, food, and water, as well as direct aggression in the form of predation of livestock or even human attacks. Several factors contribute to the prevalence and intensity of human-wildlife conflicts around the world.\r\n\r\nOne of the primary drivers of conflict is overlap in land use or habitat b...",1.000,0.000,0.000
...,...,...,...,...,...,...,...
894,claude,claude,True,"Theories and approaches to studying gendered language have evolved significantly over time along with broader social changes. Early research largely focused on the differences in men's and women's speech, investigating language as a reflection of gender identity. More recent postmodern perspectives have challenged these traditional views, recognizing language as a multifunctional system that constructs gender identity. \r\n\r\nEarly approaches were influenced by the idea that gender is inna...",0.997,0.000,0.003
897,claude,claude,True,"Pattern grammar refers to the study of the frequent and systematic occurrences of lexical, grammatical, semantic, and discursive patterns in language. Corpus linguistics, the analysis of large collections of authentic language data, provides powerful tools for identifying and theorizing about these patterns. By examining historical corpora, we now understand that language is highly “phraseological”—that is, structured around the usage of common patterns, from lexical bundles to idioms to col...",0.852,0.086,0.062



claude INCORRECTLY CLASSIFIED as: gpt


Unnamed: 0,true,predicted,correct,text,claude_prob,gpt_prob,human_prob
44,claude,gpt,False,"William Blake's poetry exemplifies the Romantic era's view of nature as a living, spiritual entity that symbolizes deeper meaning and transcendent truths. For the Romantic poets, nature was not something separate or external to be observed, but an essential interconnected part of existence, the very ""life of things."" In Blake's poem ""The Rose,"" natural imagery is used to represent spiritual and political ideas that critique contemporary society. \r\n\r\nThe rose in Blake's poem is a central...",0.053,0.945,0.002
48,claude,gpt,False,"T.S. Eliot's landmark poem ""The Waste Land"" has been subject to a variety of theoretical and critical interpretations since its publication in 1922. Several theoretical approaches can aid in interpreting and understanding the poem, including New Criticism, psychoanalytic theory, Structuralism, and Post-structuralism. Each of these approaches provides insight into different aspects of the poem's meaning and structure.\r\n\r\nThe New Critical approach emerged around the time ""The Waste Land"" w...",0.276,0.724,0.000
...,...,...,...,...,...,...,...
777,claude,gpt,False,"Standardization in the hospitality industry refers to theconsistent delivery of a predictable and uniform product or service by a business to its customers. Many hospitality businesses standardize elements of the customer experience including décor, menus, service procedures, and employee training in order to increase operational efficiency and ensure a consistent customer experience across multiple locations. However, standardization may reduce the ability for businesses to customize offer...",0.369,0.631,0.000
871,claude,gpt,False,"Investigating Hellenistic Athens from an archaeological perspective presents numerous challenges that relate to the interpretive approaches scholars have taken. The archaeological record from the Hellenistic period in Athens is fragmentary, uneven, and often difficult to interpret. Key sites have been destroyed or built over, leaving only limited material evidence. The archaeological evidence that does remain must be interpreted in light of the complex historical context of the period. Schol...",0.162,0.785,0.052



claude INCORRECTLY CLASSIFIED as: human


Unnamed: 0,true,predicted,correct,text,claude_prob,gpt_prob,human_prob
87,claude,human,False,"Through my work experiences in the hospitality industry, I have gained valuable managerial and personal skills, as well as insights into industry practices. However, there are also certain skills I have not yet fully developed. \r\n\r\nTwo key skills I have acquired are effective communication and problem-solving. As a front desk agent, I regularly interacted with guests and managed their concerns by actively listening, addressing their needs, and resolving issues to their satisfaction. For ...",0.454,0.022,0.525
92,claude,human,False,"To identify my career aspirations of opening an inclusive fusion bakery, I first analyzed my key interests and values. I have always been passionate about baking and trying recipes from different cultures. After graduating with a degree in Hospitality Management, I gained experience in various roles at a prestigious international hotel company. However, I felt unfulfilled in my strategic management position and craved more creativity and autonomy in my work. \r\n\r\nEntrepreneurship seemed t...",0.299,0.088,0.613
...,...,...,...,...,...,...,...
855,claude,human,False,"In October 2005, a study on the nutrient intakes of 67 food and nutrition students (aged 18 to 35 years) at the University of Reading was completed using a validated food frequency questionnaire. The results showed that the students' mean energy intakes were 8.4 MJ for males and 7.2 MJ for females, which were comparable to UK reference values. However, their intakes of some vitamins and minerals were below recommendations. \r\n\r\nFor Vitamin C, the students' mean intake was 64 mg for males ...",0.455,0.000,0.545
859,claude,human,False,"Testing the peroxide value (PV) in palm oil is an important quality control process to ensure the oil remains fresh and safe for consumption or use in various products. The PV measures the amount of peroxides in an oil sample, which indicates the level of oxidation and rancidity. A higher PV means the oil has started to oxidize and break down, producing foul odors and flavors as well as potentially harmful compounds. \r\n\r\nTo test the PV, oil samples are first collected from various points...",0.326,0.089,0.585



gpt INCORRECTLY CLASSIFIED as: claude


Unnamed: 0,true,predicted,correct,text,claude_prob,gpt_prob,human_prob
36,gpt,claude,False,"Introduction (50 words):\r\nThe current wealth distribution patterns in the U.S. have raised concerns about social justice and fair compensation. This essay will analyze the underpayment of registered nurses (RNs) and the overpayment of National Football League (NFL) players as exemplifications of these patterns. By utilizing utilitarian ethics as a framework, this essay argues that compensation should align with social importance and proposes strategies to address this injustice.\r\nAnalysi...",0.803,0.197,0.001
257,gpt,claude,False,"The United States' military presence in the Middle East has been a contentious issue ever since it became heavily involved in the region several decades ago. Critics argue that the military interventions have only exacerbated tensions and fueled conflicts, while proponents argue that the presence is necessary for global peace and security. To evaluate the extent to which the US military presence is necessary, we need to consider both the historical context and the consequences of its actions...",0.956,0.002,0.043
...,...,...,...,...,...,...,...
601,gpt,claude,False,"Conflict is an inevitable part of human existence, and the need for resolution becomes paramount to maintain peace and harmony within societies. Mediation, social work, and law are three professions that are instrumental in resolving conflicts between different parties. Despite having distinct roles, each profession's techniques and values share a common motive: assisting parties in reaching a fair and just resolution. However, professionals face challenges when transitioning between roles i...",0.690,0.310,0.000
866,gpt,claude,False,"Emile Durkheim and Karl Marx are two eminent sociologists who have contributed significantly to our understanding of society and its functions. While Durkheim developed the theory of functionalism, Marx advocated the conflict theory. Although these two perspectives differ in many aspects, they also complement each other in explaining societal structures and dynamics.\r\nDurkheim's theory of functionalism views society as a complex organism with interdependent parts that work together to main...",0.990,0.010,0.000



CORRECTLY CLASSIFIED: gpt


Unnamed: 0,true,predicted,correct,text,claude_prob,gpt_prob,human_prob
0,gpt,gpt,True,"Theodore Roosevelt was a prominent figure during the Progressive Era in American history and played a significant role in shaping the movement's goals and ideals. In his own words, Roosevelt defined progressivism as the belief that ""we must act for the benefit of the whole people."" Under this definition, progressivism aimed to address various social, economic, and political issues by implementing reforms that would lead to the betterment of society as a whole.\r\nRoosevelt emphasized the nee...",0.148,0.837,0.015
5,gpt,gpt,True,"Thyroiditis is a medical condition characterized by inflammation of the thyroid gland, which can lead to various symptoms that can significantly affect an individual's well-being. There are different types of thyroiditis, each with its own distinct symptoms and treatment options.\r\nOne type of thyroiditis is Hashimoto's thyroiditis. It is an autoimmune disease in which the body's immune system mistakenly attacks and damages the thyroid gland. People with Hashimoto's thyroiditis often experi...",0.000,1.000,0.000
...,...,...,...,...,...,...,...
896,gpt,gpt,True,"The book ""Edges of the Rainbow"" by Michael Delsol and Haruku Shinozaki is a compelling and eye-opening exploration of the LGBTQ+ community in Japan. Through a collection of interviews with individuals from various backgrounds, sexual orientations, and gender identities, the book challenges stereotypes and highlights the diverse experiences within this community. By delving into the personal stories of LGBTQ+ individuals, the authors reveal the complexities and struggles that often go unnotic...",0.000,1.000,0.000
898,gpt,gpt,True,"Introduction:\r\nProteins play a crucial role in our diets as they are essential for body growth, repair, and overall health maintenance. Traditional sources of protein such as meat have long been a staple in human diets, but their environmental impact and ethical considerations have prompted the exploration of alternative protein sources. This essay aims to evaluate the various sources of meat and their biochemical compositions while discussing the advantages and limitations of edible insec...",0.000,1.000,0.000



gpt INCORRECTLY CLASSIFIED as: human


Unnamed: 0,true,predicted,correct,text,claude_prob,gpt_prob,human_prob
593,gpt,human,False,"In his article ""Dude,"" Scott Kiesling explores the various patterns of use, meanings, and functions associated with the term ""dude"" in American culture. As a linguist, Kiesling takes an in-depth look at how ""dude"" has evolved and adapted over time, becoming an integral part of American vernacular.\r\nOne of the first patterns of use that Kiesling highlights is the term's flexibility. ""Dude"" can be used as a noun, verb, adjective, or even an interjection, depending on the context. This versat...",0.02,0.262,0.718
831,gpt,human,False,"The discovery of the genetic code has revolutionized our understanding of biology and evolution. The genetic code is a set of rules that determine how the genetic information encoded in DNA is translated into functional proteins. It is a highly complex system that has evolved over billions of years, and its origins can be traced back to a simpler two-nucleotide code. This two-nucleotide code laid the foundation for the evolution of the current triplet genetic code, and in turn, has implicati...",0.025,0.343,0.632



human INCORRECTLY CLASSIFIED as: claude


Unnamed: 0,true,predicted,correct,text,claude_prob,gpt_prob,human_prob
49,human,claude,False,"Despite the low unemployment rate in the U.S, Black Americans are twice susceptible to unemployment compared to white Americans. The Bureau of Labor Statistics (BLS) recorded a 7.1% unemployment rate of African Americans compared to the national average of 4.2% (BLS, 2022). Racism and inequality in education in the U.S causes structural, cyclical, and frictional unemployment. Structural racism in education has disadvantaged Black Americans by offering them lower levers of education resulting...",0.655,0.007,0.338
230,human,claude,False,"One of the prominent areas of anthropology, primatology, studies nonhuman primates, their behavior, and social capabilities to provide valuable details about human evolution. Next, one of the primary traits that could characterize most apes, especially chimpanzees, is aggression among the male species. However, while aggression presents a significant component of social structure and gender differences among nonhuman primates, its role and influence in human evolution remain implicit. This e...",0.930,0.000,0.070
...,...,...,...,...,...,...,...
824,human,claude,False,"A strategic initiative in a business organization is a comprehensive plan that links up the companies’ objectives and its future goals and visions. Based on the results and outcomes from the SWOT analysis we conducted, as a measure to assess the establishment’s strengths, weaknesses, opportunities, and threats, my team came up with a tremendous life-changing initiative to upgrade the financial management of the DIVERSICARE HEALTHCARE organization. Referring to the financial statements we loo...",1.000,0.000,0.000
874,human,claude,False,"Dyslexia affects children with normal eyesight and intellect in different ways, such as an inability to read or acquire words. Coaching or a specialized educational intervention may help most children with Dyslexia to perform better in school. According to the article, Learning difficulties are one of the many consequences of Dyslexia. A youngster with Dyslexia will have difficulty keeping up with their classmates in most classrooms since reading is such a fundamental ability in so many area...",0.600,0.000,0.399



human INCORRECTLY CLASSIFIED as: gpt


Unnamed: 0,true,predicted,correct,text,claude_prob,gpt_prob,human_prob
128,human,gpt,False,"Most American investors and people in business do not support investigative journalism in the business community. This thesis is of salience as it reveals the media’s hidden skirmishes, especially in conducting and reporting findings related to business and trade. The news houses have the right to conduct research on each sector of the economy and pass the results to the members of the society. However, this right has been compromised over time by the top businessmen. The main reasons for th...",0.000,0.634,0.366
190,human,gpt,False,"The success of Phineas Taylor Barnum can be largely attributed to his ability as a marketing specialist and the sensationalist nature of his products and services. In order to understand how his performance could be implemented in the modern day, it is essential to assess the appeal of such an experience and the primary elements that allowed Barnum to successfully market it as a product to interested customers. From his museum of curiosities to his circus, Barnum would often instigate curios...",0.022,0.935,0.043
...,...,...,...,...,...,...,...
672,human,gpt,False,"There are always multiple obstacles that can significantly diminish the overall efficiency of any project plan. Therefore, it is crucial to analyze all the factors that may stay behind potential barriers to its successful implementation. Given the complexity of the project plan and the holistic approach expected to prevail in all stages of its fulfillment, there are numerous spheres that may be exposed to certain risks. Significant changes that affect the organizational structure of an insti...",0.000,0.973,0.026
820,human,gpt,False,"Organizational culture is an organization’s unique characteristics that set it apart from others in the industry. It manifests itself in values and norms, work ethics, employees’ awareness of themselves and their place in the organization, communications, and human relationships within the team and patients. The development of organizational culture requires the definition of the main overall goal of the organization – the mission, as well as the choice of a strategy for the implementation o...",0.000,0.822,0.177



CORRECTLY CLASSIFIED: human


Unnamed: 0,true,predicted,correct,text,claude_prob,gpt_prob,human_prob
1,human,human,True,"Theodore Roosevelt was one of the youngest American presidents known for his progressive ideas and the desire to change society, addressing the existing distrust and helplessness. The decision to promote progressive ideas was not spontaneous, and he saw the movement as the possibility to protect the real rule of humans (Roosevelt, 1912). The characteristics of a progressive are the intention to stand for social justice, achieve good for all people, and improve the environment in which indivi...",0.002,0.0,0.998
4,human,human,True,"Thyroiditis is considered to be an illness that is connected to inflammation and affection of the thyroid gland. The thyroid gland, formed on the front side of the neck and underneath the laryngeal prominence, produces hormones that regulate metabolism. Hashimoto’s thyroiditis, postnatal thyroiditis, and subacute thyroiditis are the most prevalent types of thyroiditis discussed by practitioners and researchers (Quintero et al., 2021). Thyroid dysfunction manifests itself in a triphasic illne...",0.010,0.0,0.990
...,...,...,...,...,...,...,...
895,human,human,True,"“Edges of the Rainbow” was created by Michael Delsol and Haruku Shinozaki in 2017. It is a collection of photographs and personal stories of LGBT+Q community representatives from different parts of Japan. Michael Delsos is a French photographer who depicts the people connected with art creation, and his pictures are published in magazines and newspapers. He also participates in the exhibits where Delsol presents his work separately or with other photographers. Haruku Shinozaki is a journalis...",0.000,0.0,1.000
899,human,human,True,"Introduction\r\nDue to rising meat consumption and decreasing farmland supply, creating alternate protein sources is pressing. According to Van Huis (2015), food security occurs when all populations have physical, societal, and financial access to adequate, secure, and healthier meals to suit their dietary requirements for an active lifestyle at all times. Globally, meat consumption is anticipated to rise by 76% between 2005/2007 and 2050 (Van Huis, 2015). Therefore, alternate protein supple...",0.000,0.0,1.000


In [20]:
# Note: Quality of life improvement for version 0.2.2
# We can display the full text of a selected misclassified article by dataframe index
selected_index = 15

preview_row_text(predictions_df, selected_index, text_column = text_column, limit=400) # change limit to see more of the text if needed

Unnamed: 0_level_0,Value
Attribute,Unnamed: 1_level_1
true,claude
predicted,claude
correct,True
claude_prob,0.942
gpt_prob,0.043
human_prob,0.015


text:
The speakers in Pablo Neruda's "Tonight I Can Write" and William Butler Yeats'
"When You Are Old" express their affection for their muses through strategic
poetic elements that reflect their respective cultures. Neruda's poem is
characterized by fluid, melodious lines that convey passion and vitality,
mirroring Latin American artistic traditions. In contrast, Yeats' poem has a
more formal structur...


### 4.3 Run inference on new (or old) data

You can also run inference on new data (or any of the texts from training/validation) by changing the contents of the `texts` list below. This outputs a prediction, the probabilities of each class and the features present within the text that are used by the model to make its predictions. The numbers for each feature are the input to the final step of the pipeline. They may be scaled or transformed depending on the pipeline components you've chosen.

In [21]:
texts = ['''
It was excellent!
''',
		'''
This was a terrible movie!
''',
	'''
This might not not be the best movie ever made, or it could be the best movie of no time.
''',
]

y_inference = pipeline.predict(texts)

preprocessor = Pipeline(pipeline.steps[:-1])
feature_names = preprocessor.named_steps['features'].get_feature_names_out()

for i, text in enumerate(texts):
	print(f"Text {i}: {text}")
	
	print(f"\tPredicted class: {label_names[y_inference[i]]}")
	print()

	y_inference_proba = pipeline.predict_proba([text])
	
	# Note: Version 0.2.2 changed the following lines to ensure probability labels are correct regardless of the order of target classes
	for idx, prob in enumerate(y_inference_proba[0]):
		c = pipeline.named_steps['classifier'].classes_[idx]
		if c in target_classes:
			print(f"\tProbability of class {label_names[c]}: {prob:.2f}")
	# End change for 0.2.2

	print()
	print("\tFeatures:")

	embeddings = 0
    
	frequencies = preprocessor.transform([text])
	if not isinstance(frequencies, np.ndarray):
		frequencies = frequencies.toarray()
	frequencies = frequencies[0].T
    
	for j, freq in enumerate(frequencies):
		if feature_names[j].startswith('embeddings_'):
			embeddings += 1
		elif freq > 0:
			print(f"\t{feature_names[j]}: {freq:.2f}")
	if embeddings > 0:
		print(f"\tFeatures also include {embeddings} embedding dimensions")

	print()


Text 0: 
It was excellent!

	Predicted class: human

	Probability of class claude: 0.00
	Probability of class gpt: 0.00
	Probability of class human: 1.00

	Features:
	pos__ADJ: 0.03
	pos__AUX: 0.05
	pos__PRON: 0.06
	pos__PUNCT: 0.02
	textstats__tokens_count: 0.01
	textstats__sentences_count: 0.06
	textstats__characters_count: 0.01
	textstats__monosyllabic_words_relfreq: 9.01
	textstats__polysyllabic_words_relfreq: 4.93
	textstats__unique_tokens_relfreq: 16.41
	textstats__average_characters_per_token: 9.77
	textstats__average_tokens_per_sentence: 1.28
	textstats__characters_proportion_letters: 157.16
	textstats__characters_proportion_uppercase: 7.62
	textstats__hapax_legomena_count: 0.06
	textstats__hapax_legomena_to_unique: 17.81
	Features also include 256 embedding dimensions

Text 1: 
This was a terrible movie!

	Predicted class: human

	Probability of class claude: 0.00
	Probability of class gpt: 0.00
	Probability of class human: 1.00

	Features:
	pos__ADJ: 0.03
	pos__AUX: 0.05
	pos