<a href="https://colab.research.google.com/github/jacksonchen1998/Chinese-Text-Summarization/blob/main/ML_Final.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# **Chinese text summarization**

This project is based on the paper called Leveraging BERT for Extractive Text Summarization on Lectures.

And the source code can be found on github: https://github.com/dmmiller612/bert-extractive-summarizer

This project used bert-base-chinese as the text summarization model, and it can be changed if you want.

In [1]:
!pip install bert-extractive-summarizer # Github source code : https://github.com/dmmiller612/bert-extractive-summarizer
!pip install spacy==2.3.1
!pip install transformers # Using bert-chinese-base model
!pip install neuralcoref # Solving the problem with massive context
!pip install sacremoses # Using tokenizer with mouse

#下載中文的spacy model
!python -m spacy download zh_core_web_sm # Can be changed, e.g. zh_core_web_sm

Looking in indexes: https://pypi.org/simple, https://us-python.pkg.dev/colab-wheels/public/simple/
Collecting bert-extractive-summarizer
  Downloading bert_extractive_summarizer-0.10.1-py3-none-any.whl (25 kB)
Collecting transformers
  Downloading transformers-4.20.1-py3-none-any.whl (4.4 MB)
[K     |████████████████████████████████| 4.4 MB 20.1 MB/s 
Collecting tokenizers!=0.11.3,<0.13,>=0.11.1
  Downloading tokenizers-0.12.1-cp37-cp37m-manylinux_2_12_x86_64.manylinux2010_x86_64.whl (6.6 MB)
[K     |████████████████████████████████| 6.6 MB 63.8 MB/s 
[?25hCollecting pyyaml>=5.1
  Downloading PyYAML-6.0-cp37-cp37m-manylinux_2_5_x86_64.manylinux1_x86_64.manylinux_2_12_x86_64.manylinux2010_x86_64.whl (596 kB)
[K     |████████████████████████████████| 596 kB 71.3 MB/s 
[?25hCollecting huggingface-hub<1.0,>=0.1.0
  Downloading huggingface_hub-0.8.1-py3-none-any.whl (101 kB)
[K     |████████████████████████████████| 101 kB 12.1 MB/s 
Installing collected packages: pyyaml, tokenizers, 

# **Setting Model parameter**

In this section, the program will load model, model config and tokenizer via Transformers.

And we use **bert-base-model** for custom model, zh_core_web_sm as trained pipelines for Chinese.

Model can be tuned with those parameters:

*   model: This gets used by the hugging face bert library to load the model, you can supply a custom trained model here
*   custom_model: If you have a pre-trained model, you can add the model class here.

*   custom_tokenizer:  If you have a custom tokenizer, you can add the tokenizer here.
*   hidden: Needs to be negative, but allows you to pick which layer you want the embeddings to come from.

*   reduce_option: It can be 'mean', 'median', or 'max'. This reduces the embedding layer for pooling.
*   sentence_handler: The handler to process sentences. If want to use coreference, instantiate and pass CoreferenceHandler instance


In [2]:
import spacy
import zh_core_web_sm
import neuralcoref

nlp = zh_core_web_sm.load()
neuralcoref.add_to_pipe(nlp)

from summarizer import Summarizer
from summarizer.text_processors.sentence_handler import SentenceHandler

from spacy.lang.zh import Chinese
from transformers import *

modelName = "bert-base-chinese"
custom_config = AutoConfig.from_pretrained(modelName) # Download configuration from huggingface.co (user-uploaded) and cache.
custom_config.output_hidden_states=True # Output with embedding, it will help BERT model with detecting sentence's relationship
custom_tokenizer = AutoTokenizer.from_pretrained(modelName)
custom_model = AutoModel.from_pretrained(modelName, config=custom_config) # It will help embedding text for using transformer model

model = Summarizer(
    custom_model=custom_model, 
    custom_tokenizer=custom_tokenizer,
    sentence_handler = SentenceHandler(language=Chinese)
    )

  return f(*args, **kwds)
  return f(*args, **kwds)
  return f(*args, **kwds)
100%|██████████| 40155833/40155833 [00:02<00:00, 19328187.85B/s]
Building prefix dict from the default dictionary ...
Dumping model to file cache /tmp/jieba.cache
Loading model cost 1.133 seconds.
Prefix dict has been built successfully.
https://huggingface.co/bert-base-chinese/resolve/main/config.json not found in cache or force_download set to True, downloading to /root/.cache/huggingface/transformers/tmpb_se6u6p


Downloading:   0%|          | 0.00/624 [00:00<?, ?B/s]

storing https://huggingface.co/bert-base-chinese/resolve/main/config.json in cache at /root/.cache/huggingface/transformers/6cc404ca8136bc87bae0fb24f2259904943d776a6c5ddc26598bbdc319476f42.0f9bcd8314d841c06633e7b92b04509f1802c16796ee67b0f1177065739e24ae
creating metadata file for /root/.cache/huggingface/transformers/6cc404ca8136bc87bae0fb24f2259904943d776a6c5ddc26598bbdc319476f42.0f9bcd8314d841c06633e7b92b04509f1802c16796ee67b0f1177065739e24ae
loading configuration file https://huggingface.co/bert-base-chinese/resolve/main/config.json from cache at /root/.cache/huggingface/transformers/6cc404ca8136bc87bae0fb24f2259904943d776a6c5ddc26598bbdc319476f42.0f9bcd8314d841c06633e7b92b04509f1802c16796ee67b0f1177065739e24ae
Model config BertConfig {
  "_name_or_path": "bert-base-chinese",
  "architectures": [
    "BertForMaskedLM"
  ],
  "attention_probs_dropout_prob": 0.1,
  "classifier_dropout": null,
  "directionality": "bidi",
  "hidden_act": "gelu",
  "hidden_dropout_prob": 0.1,
  "hidden_s

Downloading:   0%|          | 0.00/29.0 [00:00<?, ?B/s]

storing https://huggingface.co/bert-base-chinese/resolve/main/tokenizer_config.json in cache at /root/.cache/huggingface/transformers/2dc6085404c55008ba7fc09ab7483ef3f0a4ca2496ccee0cdbf51c2b5d529dff.ec5c189f89475aac7d8cbd243960a0655cfadc3d0474da8ff2ed0bf1699c2a5f
creating metadata file for /root/.cache/huggingface/transformers/2dc6085404c55008ba7fc09ab7483ef3f0a4ca2496ccee0cdbf51c2b5d529dff.ec5c189f89475aac7d8cbd243960a0655cfadc3d0474da8ff2ed0bf1699c2a5f
loading configuration file https://huggingface.co/bert-base-chinese/resolve/main/config.json from cache at /root/.cache/huggingface/transformers/6cc404ca8136bc87bae0fb24f2259904943d776a6c5ddc26598bbdc319476f42.0f9bcd8314d841c06633e7b92b04509f1802c16796ee67b0f1177065739e24ae
Model config BertConfig {
  "_name_or_path": "bert-base-chinese",
  "architectures": [
    "BertForMaskedLM"
  ],
  "attention_probs_dropout_prob": 0.1,
  "classifier_dropout": null,
  "directionality": "bidi",
  "hidden_act": "gelu",
  "hidden_dropout_prob": 0.1,
 

Downloading:   0%|          | 0.00/107k [00:00<?, ?B/s]

storing https://huggingface.co/bert-base-chinese/resolve/main/vocab.txt in cache at /root/.cache/huggingface/transformers/36acdf4f3edf0a14ffb2b2c68ba47e93abd9448825202377ddb16dae8114fe07.accd894ff58c6ff7bd4f3072890776c14f4ea34fcc08e79cd88c2d157756dceb
creating metadata file for /root/.cache/huggingface/transformers/36acdf4f3edf0a14ffb2b2c68ba47e93abd9448825202377ddb16dae8114fe07.accd894ff58c6ff7bd4f3072890776c14f4ea34fcc08e79cd88c2d157756dceb
https://huggingface.co/bert-base-chinese/resolve/main/tokenizer.json not found in cache or force_download set to True, downloading to /root/.cache/huggingface/transformers/tmpk843vziv


Downloading:   0%|          | 0.00/263k [00:00<?, ?B/s]

storing https://huggingface.co/bert-base-chinese/resolve/main/tokenizer.json in cache at /root/.cache/huggingface/transformers/7e23f4e1f58f867d672f84d9a459826e41cea3be6d0fe62502ddce9920f57e48.4495f7812b44ff0568ce7c4ff3fdbb2bac5eaf330440ffa30f46893bf749184d
creating metadata file for /root/.cache/huggingface/transformers/7e23f4e1f58f867d672f84d9a459826e41cea3be6d0fe62502ddce9920f57e48.4495f7812b44ff0568ce7c4ff3fdbb2bac5eaf330440ffa30f46893bf749184d
loading file https://huggingface.co/bert-base-chinese/resolve/main/vocab.txt from cache at /root/.cache/huggingface/transformers/36acdf4f3edf0a14ffb2b2c68ba47e93abd9448825202377ddb16dae8114fe07.accd894ff58c6ff7bd4f3072890776c14f4ea34fcc08e79cd88c2d157756dceb
loading file https://huggingface.co/bert-base-chinese/resolve/main/tokenizer.json from cache at /root/.cache/huggingface/transformers/7e23f4e1f58f867d672f84d9a459826e41cea3be6d0fe62502ddce9920f57e48.4495f7812b44ff0568ce7c4ff3fdbb2bac5eaf330440ffa30f46893bf749184d
loading file https://hugg

Downloading:   0%|          | 0.00/393M [00:00<?, ?B/s]

storing https://huggingface.co/bert-base-chinese/resolve/main/pytorch_model.bin in cache at /root/.cache/huggingface/transformers/58592490276d9ed1e8e33f3c12caf23000c22973cb2b3218c641bd74547a1889.fabda197bfe5d6a318c2833172d6757ccc7e49f692cb949a6fabf560cee81508
creating metadata file for /root/.cache/huggingface/transformers/58592490276d9ed1e8e33f3c12caf23000c22973cb2b3218c641bd74547a1889.fabda197bfe5d6a318c2833172d6757ccc7e49f692cb949a6fabf560cee81508
loading weights file https://huggingface.co/bert-base-chinese/resolve/main/pytorch_model.bin from cache at /root/.cache/huggingface/transformers/58592490276d9ed1e8e33f3c12caf23000c22973cb2b3218c641bd74547a1889.fabda197bfe5d6a318c2833172d6757ccc7e49f692cb949a6fabf560cee81508
Some weights of the model checkpoint at bert-base-chinese were not used when initializing BertModel: ['cls.predictions.decoder.weight', 'cls.predictions.transform.LayerNorm.bias', 'cls.seq_relationship.weight', 'cls.predictions.transform.LayerNorm.weight', 'cls.predicti

# **Main function (bert_summarize)**

Basically, it has 5 paramters that we can use for Chinese text summarization.



*   body: The string body that you want to summarize (type=str)
*   ratio: The ratio of sentences that you want for the final summary (type=float)

*   min_length: Parameter to specify to remove sentences that are less than 40 characters (type=int)
*   max_length: Parameter to specify to remove sentences greater than the max length (type=int)

*   num_sentences: Number of sentences to use. Overrides ratio if supplied (type=int)


In [3]:
import google.colab.output

def bert_summarize_ratio(text, ratio):
  text = text.replace('\n','')
  result = model(text, float(ratio))
  summary = ''.join(result)
  return summary

def bert_summarize_max(text, max_length):
  text = text.replace('\n','')
  result = model(text, int(max_length))
  summary = ''.join(result)
  return summary

def bert_summarize_min(text, min_length):
  text = text.replace('\n','')
  result = model(text, int(min_length))
  summary = ''.join(result)
  return summary

def bert_summarize_num(text, num_sentences):
  text = text.replace('\n','')
  result = model(text, int(num_sentences))
  summary = ''.join(result)
  return summary


#register callback for Javascript
google.colab.output.register_callback('bert_summarize_ratio', bert_summarize_ratio)
google.colab.output.register_callback('bert_summarize_max', bert_summarize_max)
google.colab.output.register_callback('bert_summarize_min', bert_summarize_min)
google.colab.output.register_callback('bert_summarize_num', bert_summarize_num)

# **UI Interface**

I choose to use textarea as the text input.

And we can choose to set ratio and number of sentences for the number of summarization sentence.

After clicking the button, we can see the execution time and the actual ratio that being reduced.

In [8]:
from IPython.display import HTML

spinner_css = """
<style>

 /* The switch - the box around the slider */
.switch {
  position: relative;
  display: inline-block;
  width: 60px;
  height: 34px;
}

/* Hide default HTML checkbox */
.switch input {
  opacity: 0;
  width: 0;
  height: 0;
}

/* The slider */
.slider {
  position: absolute;
  cursor: pointer;
  top: 0;
  left: 0;
  right: 0;
  bottom: 0;
  background-color: #ccc;
  -webkit-transition: .4s;
  transition: .4s;
}

.slider:before {
  position: absolute;
  content: "";
  height: 26px;
  width: 26px;
  left: 4px;
  bottom: 4px;
  background-color: white;
  -webkit-transition: .4s;
  transition: .4s;
}

input:checked + .slider {
  background-color: #2196F3;
}

input:focus + .slider {
  box-shadow: 0 0 1px #2196F3;
}

input:checked + .slider:before {
  -webkit-transform: translateX(26px);
  -ms-transform: translateX(26px);
  transform: translateX(26px);
}

@keyframes c-inline-spinner-kf {
  0% {
    transform: rotate(0deg);
  }
  100% {
    transform: rotate(360deg);
  }
}

.c-inline-spinner,
.c-inline-spinner:before {
  display: inline-block;
  width: 11px;
  height: 11px;
  transform-origin: 50%;
  border: 2px solid transparent;
  border-color: #74a8d0 #74a8d0 transparent transparent;
  border-radius: 50%;
  content: "";
  animation: linear c-inline-spinner-kf 300ms infinite;
  position: relative;
  vertical-align: inherit;
  line-height: inherit;
}
.c-inline-spinner {
  top: 3px;
  margin: 0 3px;
}
.c-inline-spinner:before {
  border-color: #74a8d0 #74a8d0 transparent transparent;
  position: absolute;
  left: -2px;
  top: -2px;
  border-style: solid;
}
.description{
  font-size: 15px;
}
</style>
"""

input_form = """
<link rel="stylesheet" href="https://unpkg.com/purecss@1.0.1/build/pure-min.css" integrity="sha384-oAOxQR6DkCoMliIh8yFnu25d7Eq/PHS21PClpwjOTeU2jRSq11vu66rf90/cZr47" crossorigin="anonymous">

<div style="background-color:white; border:solid #ccc; width:800px; padding:20px; color: black;">
<p style="font-size: 20px "><strong>BERT Based </strong>Chinese extractive text summarization</p>
<textarea id="main_textarea" cols="75" rows="20" placeholder="Paste your text here..." style="font-family: 'Liberation Serif', 'DejaVu Serif', Georgia, 'Times New Roman', Times, serif; font-size: 13pt; padding:10px;"></textarea><br>
<p><string>Current text length: </strong><string id = "text_length"> 0 </strong></p>
<br>
<p style="font-size: 20px "><strong>Description: </strong></p>
<div class="description">
    <p><string>ratio: </strong> The ratio of sentences that you want for the final summary</p>
    <p><string>max_length: </strong>Parameter to specify to remove sentences greater than the max length</p>
    <p><string>min_length: </strong>Parameter to specify to remove sentences that are less than 40 characters</p>
    <p><string>num_sentences: </strong>Number of sentences to use. Overrides ratio if supplied.</p>
    <p><a target="_blank" href='https://raw.githubusercontent.com/jacksonchen1998/Chinese-Text-Summarization/main/sample.txt'>A sample text file that you can try to summarize it :)</a></p>
</div>
<br>

<div class="pure-form pure-form-aligned">

    <label class="switch">
      <input type="checkbox" id="ratio_check" name="check" onclick="onlyOne(this)" checked="true">
        <span class="slider"></span>
    </label>
    
    <div class="pure-control-group">

        <label for="ratio"><strong>ratio:</strong></label>
        <input type="number" id="ratio" max="1" min="0" step="0.01" value="0.5" style="background-color: white;">
    </div>

    <label class="switch">
      <input type="checkbox" id="max_check" name="check" onclick="onlyOne(this)">
        <span class="slider"></span>
    </label>

    <div class="pure-control-group">
        <label for="max_length"><strong>max_length:</strong></label>
        <input type="number" id="max_length" value="100" style="background-color: white;">
    </div>

    <label class="switch">
      <input type="checkbox" id="min_check" name="check" onclick="onlyOne(this)">
        <span class="slider"></span>
    </label>

    <div class="pure-control-group">
        <label for="min_length"><strong>min_length:</strong></label>
        <input type="number" id="min_length" value="40" style="background-color: white;">
    </div>


    <label class="switch">
        <input type="checkbox" id="num_check" name="check" onclick="onlyOne(this)">
          <span class="slider"></span>
    </label>

     <div class="pure-control-group">
        <label for="num_sentences"><strong>num_sentences:</strong></label>
        <input type="number" id="num_sentences" value="10" style="background-color: white;">
    </div>
    
    <br>
    <br>

      
      <label class="switch">
      <input type="checkbox" id="save">
        <span class="slider"></span>
      </label>
      <p>Save result as text file</p>

    
    <br>
    

    <div style="width: 300px; display: block; margin-left: auto !important; margin-right: auto !important;">
        <p><button class="pure-button pure-button-primary" style="font-size: 125%%;" onclick="summarize()">Summarize</button>
        <span class="c-inline-spinner" style="visibility: hidden;" id="spinner"></span></p>
    </div>
</div>
    <p><string>Executing time: </strong><string id = "show_time"> NaN </strong></p>
    <p><string>Reduced ratio: </strong><string id = "reduce_ratio"> NaN </strong></p>
</div>
"""

javascript = """
<script type="text/Javascript">

    var textarea = document.querySelector("textarea");
    var save_checkbox = document.getElementById("save").checked;


    textarea.addEventListener("input", function(){
        document.getElementById('text_length').innerHTML= this.value.length;
    });

    if (document.getElementById("ratio_check").checked == true){
      document.getElementById("num_check").checked = false;
    }


       function saveTextAsFile(textToWrite, fileNameToSaveAs)
    {
    	var textFileAsBlob = new Blob([textToWrite], {type:'text/plain'}); 
    	var downloadLink = document.createElement("a");
    	downloadLink.download = fileNameToSaveAs;
    	downloadLink.innerHTML = "Download File";
    	if (window.webkitURL != null)
    	{
    		// Chrome allows the link to be clicked
    		// without actually adding it to the DOM.
    		downloadLink.href = window.webkitURL.createObjectURL(textFileAsBlob);
    	}
    	else
    	{
    		// Firefox requires the link to be added to the DOM
    		// before it can be clicked.
    		downloadLink.href = window.URL.createObjectURL(textFileAsBlob);
    		downloadLink.onclick = destroyClickedElement;
    		downloadLink.style.display = "none";
    		document.body.appendChild(downloadLink);
    	}
    
    	downloadLink.click();
    }

    function onlyOne(checkbox) {
      var checkboxes = document.getElementsByName('check')
      checkboxes.forEach((item) => {
          if (item != checkbox) item.checked = false
      })
    }


    function summarize(){
        
        var text = document.getElementById('main_textarea').value;
        var ratio = document.getElementById('ratio').value;
        var max_length = document.getElementById('max_length').value;
        var min_length = document.getElementById('min_length').value;
        var num_sentences = document.getElementById('num_sentences').value;
        var kernel = google.colab.kernel;


        if (document.getElementById("ratio_check").checked == true){
          var resultPromise = kernel.invokeFunction("bert_summarize_ratio", [text,ratio]);
        }

        if (document.getElementById("max_check").checked == true){
          var resultPromise = kernel.invokeFunction("bert_summarize_max", [text,max_length]);
        }

        if (document.getElementById("min_check").checked == true){
          var resultPromise = kernel.invokeFunction("bert_summarize_min", [text,min_length]);
        }

        if (document.getElementById("num_check").checked == true){
          var resultPromise = kernel.invokeFunction("bert_summarize_num", [text,num_sentences]);
        }

        resultPromise.then(
            function(result) {
              var start_time = new Date();
              var last_length = textarea.value.length;
              if(textarea.value.length != 0 && document.getElementById("save").checked){
                saveTextAsFile(result.data["text/plain"], 'summary.txt')
              }
              document.getElementById('main_textarea').value = 'da resultado';
              document.getElementById('main_textarea').value = result.data["text/plain"];
              document.getElementById('spinner').style = "visibility: hidden;";
              var end_time = new Date();
              document.getElementById('show_time').innerHTML = document.getElementById('show_time').innerHTML = (end_time - start_time) + "ms";
              document.getElementById('reduce_ratio').innerHTML = document.getElementById('text_length').value
              document.getElementById('reduce_ratio').innerHTML = textarea.value.length / last_length;
        }).catch(function(error){document.getElementById('main_textarea').value = error;});
        document.getElementById('spinner').style = "visibility: visible;";
    };
</script>
""" 


HTML(spinner_css + input_form + javascript)