<a href="https://colab.research.google.com/github/jacksonchen1998/Chinese-Text-Summarization/blob/main/ML_Final.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# **Chinese text summarization**

This project is based on the paper called Leveraging BERT for Extractive Text Summarization on Lectures.

And the source code can be found on github: https://github.com/dmmiller612/bert-extractive-summarizer

This project used bert-base-chinese as the text summarization model, and it can be changed if you want.

In [1]:
!pip install bert-extractive-summarizer # Github source code : https://github.com/dmmiller612/bert-extractive-summarizer
!pip install spacy==2.3.1
!pip install transformers # Using bert-chinese-base model
!pip install neuralcoref # Solving the problem with massive context
!pip install sacremoses # Using tokenizer with mouse

#下載中文的spacy model
!python -m spacy download zh_core_web_sm # Can be changed, e.g. zh_core_web_sm

Looking in indexes: https://pypi.org/simple, https://us-python.pkg.dev/colab-wheels/public/simple/
Looking in indexes: https://pypi.org/simple, https://us-python.pkg.dev/colab-wheels/public/simple/
Looking in indexes: https://pypi.org/simple, https://us-python.pkg.dev/colab-wheels/public/simple/
Looking in indexes: https://pypi.org/simple, https://us-python.pkg.dev/colab-wheels/public/simple/
Looking in indexes: https://pypi.org/simple, https://us-python.pkg.dev/colab-wheels/public/simple/
Looking in indexes: https://pypi.org/simple, https://us-python.pkg.dev/colab-wheels/public/simple/
Collecting zh_core_web_sm==2.3.1
  Downloading https://github.com/explosion/spacy-models/releases/download/zh_core_web_sm-2.3.1/zh_core_web_sm-2.3.1.tar.gz (47.9 MB)
[K     |████████████████████████████████| 47.9 MB 4.3 MB/s 
[38;5;2m✔ Download and installation successful[0m
You can now load the model via spacy.load('zh_core_web_sm')


# **Setting Model parameter**

In this section, the program will load model, model config and tokenizer via Transformers.

And we use **bert-base-model** for custom model, zh_core_web_sm as trained pipelines for Chinese.

Model can be tuned with those parameters:

*   model: This gets used by the hugging face bert library to load the model, you can supply a custom trained model here
*   custom_model: If you have a pre-trained model, you can add the model class here.

*   custom_tokenizer:  If you have a custom tokenizer, you can add the tokenizer here.
*   hidden: Needs to be negative, but allows you to pick which layer you want the embeddings to come from.

*   reduce_option: It can be 'mean', 'median', or 'max'. This reduces the embedding layer for pooling.
*   sentence_handler: The handler to process sentences. If want to use coreference, instantiate and pass CoreferenceHandler instance


In [2]:
import spacy
import zh_core_web_sm
import neuralcoref

nlp = zh_core_web_sm.load()
neuralcoref.add_to_pipe(nlp)

from summarizer import Summarizer
from summarizer.text_processors.sentence_handler import SentenceHandler

from spacy.lang.zh import Chinese
from transformers import *

modelName = "bert-base-chinese"
custom_config = AutoConfig.from_pretrained(modelName) # Download configuration from huggingface.co (user-uploaded) and cache.
custom_config.output_hidden_states=True # Output with embedding, it will help BERT model with detecting sentence's relationship
custom_tokenizer = AutoTokenizer.from_pretrained(modelName)
custom_model = AutoModel.from_pretrained(modelName, config=custom_config) # It will help embedding text for using transformer model

model = Summarizer(
    custom_model=custom_model, 
    custom_tokenizer=custom_tokenizer,
    sentence_handler = SentenceHandler(language=Chinese)
    )

  return f(*args, **kwds)
  return f(*args, **kwds)
  return f(*args, **kwds)
Building prefix dict from the default dictionary ...
Loading model from cache /tmp/jieba.cache
Loading model cost 1.068 seconds.
Prefix dict has been built successfully.
loading configuration file https://huggingface.co/bert-base-chinese/resolve/main/config.json from cache at /root/.cache/huggingface/transformers/6cc404ca8136bc87bae0fb24f2259904943d776a6c5ddc26598bbdc319476f42.0f9bcd8314d841c06633e7b92b04509f1802c16796ee67b0f1177065739e24ae
Model config BertConfig {
  "_name_or_path": "bert-base-chinese",
  "architectures": [
    "BertForMaskedLM"
  ],
  "attention_probs_dropout_prob": 0.1,
  "classifier_dropout": null,
  "directionality": "bidi",
  "hidden_act": "gelu",
  "hidden_dropout_prob": 0.1,
  "hidden_size": 768,
  "initializer_range": 0.02,
  "intermediate_size": 3072,
  "layer_norm_eps": 1e-12,
  "max_position_embeddings": 512,
  "model_type": "bert",
  "num_attention_heads": 12,
  "num_hidden_laye

# **Main function (bart_summarize)**

Basically, it has 5 paramters that we can use for Chinese text summarization.



*   body: The string body that you want to summarize (type=str)
*   ratio: The ratio of sentences that you want for the final summary (type=float)

*   min_length: Parameter to specify to remove sentences that are less than 40 characters (type=int)
*   max_length: Parameter to specify to remove sentences greater than the max length (type=int)

*   num_sentences: Number of sentences to use. Overrides ratio if supplied (type=int)


In [3]:
import google.colab.output

def bart_summarize(text, ratio, num_sentences):
  text = text.replace('\n','')
  result = model(text, float(ratio), int(num_sentences))
  summary = ''.join(result)
  return summary

#register callback for Javascript
google.colab.output.register_callback('bart_summarize', bart_summarize)

# **UI Interface**

I choose to use textarea as the text input.

And we can choose to set ratio and number of sentences for the number of summarization sentence.

After clicking the button, we can see the execution time and the actual ratio that being reduced.

In [4]:
from IPython.display import HTML

spinner_css = """
<style>

input.custom_checkbox {
  width: 20px;
  height: 20px;
}

@keyframes c-inline-spinner-kf {
  0% {
    transform: rotate(0deg);
  }
  100% {
    transform: rotate(360deg);
  }
}

.c-inline-spinner,
.c-inline-spinner:before {
  display: inline-block;
  width: 11px;
  height: 11px;
  transform-origin: 50%;
  border: 2px solid transparent;
  border-color: #74a8d0 #74a8d0 transparent transparent;
  border-radius: 50%;
  content: "";
  animation: linear c-inline-spinner-kf 300ms infinite;
  position: relative;
  vertical-align: inherit;
  line-height: inherit;
}
.c-inline-spinner {
  top: 3px;
  margin: 0 3px;
}
.c-inline-spinner:before {
  border-color: #74a8d0 #74a8d0 transparent transparent;
  position: absolute;
  left: -2px;
  top: -2px;
  border-style: solid;
}
.description{
  font-size: 15px;
}
</style>
"""

input_form = """
<link rel="stylesheet" href="https://unpkg.com/purecss@1.0.1/build/pure-min.css" integrity="sha384-oAOxQR6DkCoMliIh8yFnu25d7Eq/PHS21PClpwjOTeU2jRSq11vu66rf90/cZr47" crossorigin="anonymous">

<div style="background-color:white; border:solid #ccc; width:800px; padding:20px; color: black;">
<p style="font-size: 20px "><strong>BERT Based </strong>Chinese extractive text summarization</p>
<textarea id="main_textarea" cols="75" rows="20" placeholder="Paste your text here..." style="font-family: 'Liberation Serif', 'DejaVu Serif', Georgia, 'Times New Roman', Times, serif; font-size: 13pt; padding:10px;"></textarea><br>
<p><string>Current text length: </strong><string id = "text_length"> 0 </strong></p>
<br>
<p style="font-size: 20px "><strong>Description: </strong></p>
<div class="description">
    <p><string>ratio: </strong> The ratio of sentences that you want for the final summary</p>
    <!---<p><string>max_length: </strong>Parameter to specify to remove sentences greater than the max length</p>-->
    <!---<p><string>min_length: </strong>Parameter to specify to remove sentences that are less than 40 characters</p>-->
    <p><string>num_sentences: </strong>Number of sentences to use. Overrides ratio if supplied.</p>
</div>
<br>

<div class="pure-form pure-form-aligned">
    
    <div class="pure-control-group">
        <label for="ratio"><strong>ratio:</strong></label>
        <input type="number" id="ratio" max="1" min="0" step="0.01" value="0.5" style="background-color: white;">
    </div>
    <!---

    <div class="pure-control-group">
        <label for="max_length"><strong>max_length:</strong></label>
        <input type="number" id="max_length" value="100" style="background-color: white;">
    </div>
    -->

     <div class="pure-control-group">
        <label for="num_sentences"><strong>num_sentences:</strong></label>
        <input type="number" id="num_sentences" value="10" style="background-color: white;">
    </div>
    
    <br>

    <p>
      <input type="checkbox" id="save" class="custom_checkbox">
      <label for="cbox1"> Save result as text file</label>
    </p>
    
    <br>
    

    <div style="width: 300px; display: block; margin-left: auto !important; margin-right: auto !important;">
        <p><button class="pure-button pure-button-primary" style="font-size: 125%%;" onclick="summarize()">Summarize</button>
        <span class="c-inline-spinner" style="visibility: hidden;" id="spinner"></span></p>
    </div>
</div>
    <p><string>Executing time: </strong><string id = "show_time"> NaN </strong></p>
    <p><string>Reduced ratio: </strong><string id = "reduce_ratio"> NaN </strong></p>
</div>
"""

javascript = """
<script type="text/Javascript">

    var textarea = document.querySelector("textarea");
    var save_checkbox = document.getElementById("save").checked;

    textarea.addEventListener("input", function(){
        document.getElementById('text_length').innerHTML= this.value.length;
    });

    getElemnetId("test")=getElementId("save").value


       function saveTextAsFile(textToWrite, fileNameToSaveAs)
    {
    	var textFileAsBlob = new Blob([textToWrite], {type:'text/plain'}); 
    	var downloadLink = document.createElement("a");
    	downloadLink.download = fileNameToSaveAs;
    	downloadLink.innerHTML = "Download File";
    	if (window.webkitURL != null)
    	{
    		// Chrome allows the link to be clicked
    		// without actually adding it to the DOM.
    		downloadLink.href = window.webkitURL.createObjectURL(textFileAsBlob);
    	}
    	else
    	{
    		// Firefox requires the link to be added to the DOM
    		// before it can be clicked.
    		downloadLink.href = window.URL.createObjectURL(textFileAsBlob);
    		downloadLink.onclick = destroyClickedElement;
    		downloadLink.style.display = "none";
    		document.body.appendChild(downloadLink);
    	}
    
    	downloadLink.click();
    }


    function summarize(){
        
        var text = document.getElementById('main_textarea').value;
        var ratio = document.getElementById('ratio').value;
        /*
        var max_length = document.getElementById('max_length').value;
        */
        var num_sentences = document.getElementById('num_sentences').value;
        var kernel = google.colab.kernel;

        var resultPromise = kernel.invokeFunction("bart_summarize", [text,ratio, num_sentences]); // developer, look here

        resultPromise.then(
            function(result) {
              var start_time = new Date();
              if(textarea.value.length != 0 && document.getElementById("save").checked){
                saveTextAsFile(result.data["text/plain"], 'summary.txt')
              }
              document.getElementById('main_textarea').value = 'da resultado';
              document.getElementById('main_textarea').value = result.data["text/plain"];
              document.getElementById('spinner').style = "visibility: hidden;";
              var end_time = new Date();
              document.getElementById('show_time').innerHTML = document.getElementById('show_time').innerHTML = (end_time - start_time) + "ms";
              document.getElementById('reduce_ratio').innerHTML = document.getElementById('text_length').value
              /*document.getElementById('test').innerHTML = ( textarea.value.length - document.getElementById('text_length').value ) / document.getElementById('text_length').value*/
              document.getElementById('text_length').innerHTML = textarea.value.length
        }).catch(function(error){document.getElementById('main_textarea').value = error;});
        document.getElementById('spinner').style = "visibility: visible;";
    };
</script>
""" 


HTML(spinner_css + input_form + javascript)