Created on April 11th 2021 by Patrick Rotzetter

https://www.linkedin.com/in/rotzetter/

# Small experiment of document mining with various techniques Part 11

We will be using the brand new spacy version 3.0 for sentence.

## Load the files

In [1]:
# Import require libraries
import spacy
import texthero as hero
import pandas as pd
import numpy as np

In [2]:
#  validate spacy language models just in case, this command does not work on Mac ARM systems unless you have installed the brew workaround and reinstalled python
!python -m spacy validate

[2K[38;5;2m✔ Loaded compatibility table[0m
[1m
[38;5;4mℹ spaCy installation:
/home/studio-lab-user/.conda/envs/default/lib/python3.9/site-packages/spacy[0m

TYPE      NAME             MODEL            VERSION                            
package   en-core-web-sm   en_core_web_sm   [38;5;2m2.3.1[0m   [38;5;2m✔[0m



In [3]:
#path of first input test file
path='./sampledocs/txt/'

In [4]:
# let us scan the full directory, read the text files and clean them using texthero

docName=[]
docType=[]
docText=[]
import glob
list_of_files = glob.glob(path+'*.txt')           # create the list of file
fileNames=[]
for file_name in list_of_files:
    f = open(file_name,'r')
    fileText=f.read()
    docName.append(file_name)
    docType.append('txt')
    docText.append(fileText)
fullDocs = pd.DataFrame({'Name':docName,'Type':docType,'Text':docText})
fullDocs['cleanText']=hero.clean(fullDocs['Text'])

In [5]:
 print ("Average length of text:" + str((np.mean(fullDocs['Text'].str.len()))))
 print ("Min length of text:" + str((np.min(fullDocs['Text'].str.len()))))
 print ("Max length of text:" + str((np.max(fullDocs['Text'].str.len()))))

Average length of text:91714.61111111111
Min length of text:9170
Max length of text:328295


In [6]:
fullDocs['text_word_count'] = fullDocs['Text'].apply(lambda x: len(x.strip().split()))  # word count
fullDocs['text_unique_words']=fullDocs['Text'].apply(lambda x:len(set(str(x).split())))  # number of unique words
fullDocs.head()

Unnamed: 0,Name,Type,Text,cleanText,text_word_count,text_unique_words
0,./sampledocs/txt/AI-bank-of-the-future-Can-ban...,txt,Global Banking & Securities\n\nAI-bank of the ...,global banking securities ai bank future banks...,5774,2144
1,./sampledocs/txt/Artificial Financial Intellig...,txt,Texas A&M University School of Law\n\nTexas A&...,texas university school law texas law scholars...,22240,6349
2,./sampledocs/txt/Data machine the insurers usi...,txt,Data machine: the insurers using AI to reshape...,data machine insurers using ai reshape industr...,1454,684
3,./sampledocs/txt/Digital-disruption-in-Insuran...,txt,Digital disruption\nin insurance:\nCutting thr...,digital disruption insurance cutting noise con...,34485,7049
4,./sampledocs/txt/Impact-Big-Data-AI-in-the-Ins...,txt,The Impact of Big Data and\nArtificial Intelli...,impact big data artificial intelligence ai ins...,13471,3467


In [7]:
fullDocs.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 18 entries, 0 to 17
Data columns (total 6 columns):
 #   Column             Non-Null Count  Dtype 
---  ------             --------------  ----- 
 0   Name               18 non-null     object
 1   Type               18 non-null     object
 2   Text               18 non-null     object
 3   cleanText          18 non-null     object
 4   text_word_count    18 non-null     int64 
 5   text_unique_words  18 non-null     int64 
dtypes: int64(2), object(4)
memory usage: 992.0+ bytes


In [8]:
fullDocs.describe()

Unnamed: 0,text_word_count,text_unique_words
count,18.0,18.0
mean,13696.666667,3516.944444
std,12138.456947,2126.698366
min,1454.0,684.0
25%,5404.25,1872.5
50%,10611.0,3148.5
75%,16399.0,4303.75
max,49748.0,8458.0


## Process files with spacy Sentencizer 

In [None]:
# load spacy with transformer model excluding standard 

nlp = spacy.blank("en")
nlp.add_pipe(nlp.create_pipe('sentencizer'))
#nlp = spacy.load("en_core_web_sm", exclude=["tok2vec", "tagger", "parser", "ner"])

In [None]:
# helper function to process documents in an apply function and return the nlp object
def processDoc(doc):
    return nlp(doc)

In [None]:
test='I love Safaris. I want to go to South Africa .'
doc=nlp(test)
for sent in doc.sents:
    print(sent)

In [None]:
fullDocs['NLP']=fullDocs['cleanText'].apply(processDoc)

## Transformer Summarization Pipeline

In [None]:
# Transformers installation
! pip install transformers datasets
# To install from source instead of the last release, comment the command above and uncomment the following one.
# ! pip install git+https://github.com/huggingface/transformers.git

In [9]:
from transformers import pipeline

summarizer = pipeline("summarization",model="t5-base", tokenizer="t5-base", framework="tf")

2022-04-24 14:57:37.604478: W tensorflow/stream_executor/platform/default/dso_loader.cc:64] Could not load dynamic library 'libcudart.so.11.0'; dlerror: libcudart.so.11.0: cannot open shared object file: No such file or directory
2022-04-24 14:57:37.604510: I tensorflow/stream_executor/cuda/cudart_stub.cc:29] Ignore above cudart dlerror if you do not have a GPU set up on your machine.
2022-04-24 14:57:41.848979: W tensorflow/stream_executor/platform/default/dso_loader.cc:64] Could not load dynamic library 'libcuda.so.1'; dlerror: libcuda.so.1: cannot open shared object file: No such file or directory
2022-04-24 14:57:41.849014: W tensorflow/stream_executor/cuda/cuda_driver.cc:269] failed call to cuInit: UNKNOWN ERROR (303)
2022-04-24 14:57:41.849042: I tensorflow/stream_executor/cuda/cuda_diagnostics.cc:156] kernel driver does not appear to be running on this host (default): /proc/driver/nvidia/version does not exist
2022-04-24 14:57:41.850566: I tensorflow/core/platform/cpu_feature_gu

In [14]:
summarizer('in 2016, AlphaGo, a machine, defeated 18-time world champion Lee Sedol at the game of Go, a complex board game requiring intuition, imagination, and strategic thinking—abilities long considered distinctly human. Since then, artificial intelligence (AI) technologies haveadvanced even further,and their transformativeimpact is increasingly evident acrossindustries. AI-powered machines are tailoringrecommendations of digital content to individualtastes and preferences, designing clothinglines for fashion retailers, and even beginning tosurpass experienced doctors in detecting signs ofcancer', max_length=40)

[{'summary_text': 'artificial intelligence (AI) technologies haveadvanced even further . their transformative impact is increasingly evident across industries . machines are tailoring digital content to individual tastes and preferences '}]

Let us summarize the documents using standard transformer pipeline

In [15]:
# helper function to process documents in an apply function and return the nlp object
def summarizeDoc(doc):
    doc = ' '.join(doc.split())
    return summarizer(doc)

In [None]:
fullDocs['summary']=fullDocs['cleanText'].apply(summarizeDoc)

Token indices sequence length is longer than the specified maximum sequence length for this model (5452 > 512). Running this sequence through the model will result in indexing errors
2022-04-24 15:03:01.788355: W tensorflow/core/framework/cpu_allocator_impl.cc:82] Allocation of 1426766592 exceeds 10% of free system memory.
2022-04-24 15:03:03.268146: W tensorflow/core/framework/cpu_allocator_impl.cc:82] Allocation of 1426766592 exceeds 10% of free system memory.
2022-04-24 15:03:03.596629: W tensorflow/core/framework/cpu_allocator_impl.cc:82] Allocation of 1426766592 exceeds 10% of free system memory.
2022-04-24 15:03:04.278938: W tensorflow/core/framework/cpu_allocator_impl.cc:82] Allocation of 1426766592 exceeds 10% of free system memory.
2022-04-24 15:03:04.716394: W tensorflow/core/framework/cpu_allocator_impl.cc:82] Allocation of 1426766592 exceeds 10% of free system memory.


The documents are too long to fit into the maximum sequence length and hence are trunctaed. We need to find another way to summarize longer documents.

## Summarizing Long Documents