# Text Summarization & Question Answering using google's T5 Transformer
This notebook uses google's pre-trained T5 Transformer to conduct Text Summarization for Reddit selftext (with average length 800 characters) and Question Answering (because many posts' title is a question)

### Reference:

Spark NLP documentation and instructions: https://nlp.johnsnowlabs.com/docs/en/quickstart

Spark NLP Google T5 Article: https://towardsdatascience.com/hands-on-googles-text-to-text-transfer-transformer-t5-with-spark-nlp-6f7db75cecff

Spark NLP annotators: https://nlp.johnsnowlabs.com/docs/en/annotators

Spark NLP models: https://nlp.johnsnowlabs.com/models

Read CSV correctly in pyspark: https://stackoverflow.com/questions/50751687/spark-incorrectly-reading-csv

## 1. Start the Spark session and read data

Import dependencies and start Spark session.

In [3]:
import json
import pandas as pd
import numpy as np

from pyspark.ml import Pipeline
from pyspark.sql import SparkSession
import pyspark.sql.functions as F
from sparknlp.annotator import *
from sparknlp.base import *
import sparknlp
from sparknlp.pretrained import PretrainedPipeline

In [5]:
# need to set sep and escape parameters
auto = spark.read.csv('Auto.csv', header=True, sep=',', escape="\"", multiLine=True)

In [6]:
auto.show()

+--------------------+--------------------+--------------------+--------------------+--------------------+-----+-----------+
|               title|              author|link_flair_css_class|            selftext|                 url|score|created_utc|
+--------------------+--------------------+--------------------+--------------------+--------------------+-----+-----------+
|Trade in car or k...|  crosstitchchampion|                Auto|So we have a Jeep...|https://www.reddi...|    1| 1584143188|
|Wanting to buy fi...|      victorriiaaaaa|                Auto|I'm not sure whic...|https://www.reddi...|    1| 1584140907|
|Please help! Youn...|         shanerobles|                Auto|hello, i’m 18 yea...|https://www.reddi...|    1| 1584138407|
|Do I let them rep...|      username910975|                Auto|23F. Bought my ca...|https://www.reddi...|    1| 1584136256|
|Car Dealership Ne...|             Rewbies|                Auto|Last month I purc...|https://www.reddi...|    1| 1584135468|


In [7]:
auto.count()

7324

In [45]:
auto.printSchema()

root
 |-- title: string (nullable = true)
 |-- author: string (nullable = true)
 |-- link_flair_css_class: string (nullable = true)
 |-- selftext: string (nullable = true)
 |-- url: string (nullable = true)
 |-- score: string (nullable = true)
 |-- created_utc: string (nullable = true)



## 2. Select the Deep Learning model

For complete model list: 
https://nlp.johnsnowlabs.com/models

For `T5` models:
https://nlp.johnsnowlabs.com/models?tag=t5

## 3. Text Summaization using T5 Transformer

 Define Spark NLP pipeline

In [16]:
document_assembler = DocumentAssembler()\
.setInputCol("selftext")\
.setOutputCol("documents")

t5 = T5Transformer() \
  .pretrained("t5_small", 'en') \
  .setTask("summarize:")\
  .setMaxOutputLength(100)\
  .setInputCols(["documents"]) \
  .setOutputCol("summaries")

summarizer_pp = Pipeline(stages=[
    document_assembler, t5
])

t5_small download started this may take some time.
Approximate size to download 139 MB
[OK!]


Run the pipeline

In [17]:
empty_df = spark.createDataFrame([['']]).toDF('selftext')
pipeline_model = summarizer_pp.fit(empty_df)
sum_lmodel = LightPipeline(pipeline_model)

In [18]:
result = sum_lmodel.transform(auto)

In [36]:
result.select(F.explode(F.arrays_zip('summaries.result'
                                )).alias('cols'))\
      .select(F.expr("cols['0']").alias('summaries'),
      )\
      .show(truncate=False)

+-------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+
|summaries                                                                                                                                                                                                                                                          |
+-------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+
|we have a Jeep Grand Cherokee and we average about 20MPG . we owe about 17.5K on it and our payments are about 420 a month . we talked about trading it in on a smaller car to get substantial fuel gains .          

In [37]:
summary_cols = result.select(F.explode(F.arrays_zip('summaries.result'
                                )).alias('cols'))\
      .select(F.expr("cols['0']").alias('summaries'),
      )
summary_cols.show(truncate=False)

+-------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+
|summaries                                                                                                                                                                                                                                                          |
+-------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+
|we have a Jeep Grand Cherokee and we average about 20MPG . we owe about 17.5K on it and our payments are about 420 a month . we talked about trading it in on a smaller car to get substantial fuel gains .          

## 4. Question Answering using T5 Transformer

 Define Spark NLP pipeline

In [14]:
document_assembler = DocumentAssembler() \
    .setInputCol("text") \
    .setOutputCol("documents")

sentence_detector = SentenceDetectorDLModel\
    .pretrained("sentence_detector_dl", "en")\
    .setInputCols(["documents"])\
    .setOutputCol("questions")

t5 = T5Transformer()\
    .pretrained("google_t5_small_ssm_nq", 'en')\
    .setInputCols(["questions"])\
    .setOutputCol("answers")\

qa_pp = Pipeline(stages=[
    document_assembler, sentence_detector, t5
])

sentence_detector_dl download started this may take some time.
Approximate size to download 354.6 KB
[OK!]
google_t5_small_ssm_nq download started this may take some time.
Approximate size to download 139 MB
[OK!]


Run the pipeline

In [15]:
empty_df = spark.createDataFrame([['']]).toDF('text')
pipeline_model = qa_pp.fit(empty_df)
qa_lmodel = LightPipeline(pipeline_model)

questions = ["Do student loans or credit card debt take precedence?",
             "How does Wageworks work?",
             "What to ask for when buying a used car?",
             "How fast does your credit score actually update?",
             "What happens with unused credit card accounts?"
]

res = qa_lmodel.fullAnnotate(questions)


for i, r in enumerate(res):
    print ("Question:", questions[i])
    for sent in r['answers']:
        print ('Answer:\t', sent.result)


Question: Do student loans or credit card debt take precedence?
Answer:	 Over one million students
Question: How does Wageworks work?
Answer:	 sales of ten million units
Question: What to ask for when buying a used car?
Answer:	 car gage
Question: How fast does your credit score actually update?
Answer:	 until you are interrogated
Question: What happens with unused credit card accounts?
Answer:	 bankruptcy
