<a href="https://colab.research.google.com/github/kuekuetwo/rbx1-software/blob/master/Generating_Meta_Description_Tags_using_TextSummBert_by_WordLift.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# Generating Meta Description Tags

<table align="left">
  <td>
  <a href="https://wordlift.io">
    <img width=130px src="https://wordlift.io/wp-content/uploads/2018/07/logo-assets-510x287.png" />
    </a>
    </td>
    <td>
      by 
      <a href="https://wordlift.io/blog/en/entity/andrea-volpini">
        Andrea Volpini
      </a>
      <br/>
      <br/>
      MIT License
      <br/>
      <br/>
      <i>Last updated: <b>December 3rd, 2019</b></i>
  </td>
</table>

You can read the blog post here: https://wordlift.io/blog/en/write-meta-descriptions-bert/

## Importing and installing the libraries we need


In [None]:
!pip install -U git+https://github.com/adbar/trafilatura.git

%tensorflow_version 1.x
!pip install spacy==2.1.3
!pip install transformers
!pip install bert-extractive-summarizer==0.2.*


from bs4 import BeautifulSoup

import csv
import os
import requests, sys
import pandas as pd
import re
import numpy as np
import trafilatura



Collecting git+https://github.com/adbar/trafilatura.git
  Cloning https://github.com/adbar/trafilatura.git to /tmp/pip-req-build-2s1hy8_t
  Running command git clone -q https://github.com/adbar/trafilatura.git /tmp/pip-req-build-2s1hy8_t
Collecting htmldate>=0.6.2
  Downloading https://files.pythonhosted.org/packages/26/17/abb8e6ceedec5bd1a52c3e61acd29bf7eb98e3dea98834ed07bd44244650/htmldate-0.6.2-py3-none-any.whl
Collecting justext>=2.2.0
[?25l  Downloading https://files.pythonhosted.org/packages/6c/5f/c7b909b4b864ebcacfac23ce2f6f01a50c53628787cc14b3c06f79464cab/jusText-2.2.0-py2.py3-none-any.whl (860kB)
[K     |████████████████████████████████| 870kB 24.8MB/s 
[?25hCollecting readability-lxml>=0.7.1
  Downloading https://files.pythonhosted.org/packages/af/a7/8ea52b2d3de4a95c3ed8255077618435546386e35af8969744c0fa82d0d6/readability-lxml-0.7.1.tar.gz
Collecting lxml>=4.4.2
[?25l  Downloading https://files.pythonhosted.org/packages/dd/ba/a0e6866057fc0bbd17192925c1d63a3b85cf522965de9b

## Downloading crawl data from Google Sheet 

The script uses the _url` CSV file generated with **WooRank Crawler** (or alternatively the data from **Screaming Frog**) that provides the list of URLs and the information of where the MD is missing.  

The data has been imported into Google Sheet so that we can inspect it. Change the URL below after publishing your CSV:


> 1. Open file from "My Drive" or "Upload"
2. File -> Publish to the web -> "Sheet name" option and "csv" option


### Using WooRank

In [None]:
# Download the list of URLs from Google Docs (file generated with WooRank) 
# Replace the following with a crawl from your favorite website that you have published on Google Drive
!wget 'https://docs.google.com/spreadsheets/d/e/2PACX-1vRKcg1Ly4wD2ANquGnZCgUZv22lVPcRvMlTyzhLSavnH97VSPGhm0qC7U2ggVl330aFauJOftTxGIhQ/pub?gid=217899676&single=true&output=csv'

--2020-05-07 16:22:55--  https://docs.google.com/spreadsheets/d/e/2PACX-1vRKcg1Ly4wD2ANquGnZCgUZv22lVPcRvMlTyzhLSavnH97VSPGhm0qC7U2ggVl330aFauJOftTxGIhQ/pub?gid=217899676&single=true&output=csv
Resolving docs.google.com (docs.google.com)... 74.125.20.138, 74.125.20.139, 74.125.20.101, ...
Connecting to docs.google.com (docs.google.com)|74.125.20.138|:443... connected.
HTTP request sent, awaiting response... 200 OK
Length: unspecified [text/csv]
Saving to: ‘pub?gid=217899676&single=true&output=csv.2’

          pub?gid=2     [<=>                 ]       0  --.-KB/s               pub?gid=217899676&s     [ <=>                ] 731.05K  --.-KB/s    in 0.1s    

2020-05-07 16:22:55 (6.67 MB/s) - ‘pub?gid=217899676&single=true&output=csv.2’ saved [748591]



#### Creating a Pandas DataFrame from WooRank data


Following the file structure generated using the WooRank's crawler, we will use the following columns:

- *url* (`cols='0'` | `url`), 
- *status code* (`cols='5'` | `status`),
- *page type* (`cols='8'` | `parent_type`)
- *internal or esternal* (`cols='12'` | `from_internal`)
- *position* (`cols='38'` | `position`)
- *meta description lenght in px* (`cols='46'` | `description_len_px`)

We will then use *http status* to focus our analysis only to urls responding with `HTTP 200`.

In [None]:
df = pd.read_csv('pub?gid=217899676&single=true&output=csv.2', # Update the string here to change the file
                 usecols=[0,5,8,12,38,46],  
                 header=0,
                 encoding="utf-8-sig" ) 

print("we have a total of:", len(df), " urls")

df.head()

we have a total of: 1707  urls


Unnamed: 0,url,status,parent_type,from_internal,position,description_len_px
0,http://wordlift.io/robots.txt,302,,,,
1,https://wordlift.io/robots.txt,200,,,,
2,https://wordlift.io/blog/en/entity/semantic-seo,301,PAGE,yes,,
3,https://wordpress.org/plugins/wordlift/,200,PAGE,no,,
4,https://vimeo.com/io10,200,PAGE,no,,


#### Finding all URLs where meta description are missing


In [None]:
# Keep all rows representing a page with status = 200, with md either null or 0, from the English blog and with Position < 15 
 
df = df[(df['from_internal'] != 'no') & (df['status'] == 200) & (df['parent_type'] == 'PAGE') & ((df['description_len_px'].isnull()) | (df['description_len_px']== 0)) & (df['url'].str.contains("blog/it")) & (df['position'] < 15) & (df['position'] > 3)] # Use this with WooRank

print("we have to process:", len(df), " urls")

# Reindex df
df.index = range(len(df.index))

df.head()

we have to process: 22  urls


Unnamed: 0,url,status,parent_type,from_internal,position,description_len_px
0,https://wordlift.io/blog/it/vocabolario/wordca...,200,PAGE,yes,10.755906,0.0
1,https://wordlift.io/blog/it/vocabolario/json-ld/,200,PAGE,yes,11.821712,0.0
2,https://wordlift.io/blog/it/vocabolario/thubte...,200,PAGE,yes,6.739726,0.0
3,https://wordlift.io/blog/it/vocabolario/robert...,200,PAGE,yes,11.796475,0.0
4,https://wordlift.io/blog/it/vocabolario/wordlift/,200,PAGE,yes,10.535714,0.0


### Using Screaming Frog

In [None]:
# Download the list of URLs from Google Docs (file generated with Screaming Frog SEO Spider) 
# Replace the following with a crawl from your favorite website that you have published on Google Drive

!wget 'https://docs.google.com/spreadsheets/d/e/2PACX-1vTGpl7KboITzqC8d-rosX_H4geyib-kHrVtVwrhM9rZSie7X35vYvC8iVJLVwGOYTemC4xm1qduMU8v/pub?gid=662239818&single=true&output=csv'

--2019-12-04 09:00:03--  https://docs.google.com/spreadsheets/d/e/2PACX-1vTGpl7KboITzqC8d-rosX_H4geyib-kHrVtVwrhM9rZSie7X35vYvC8iVJLVwGOYTemC4xm1qduMU8v/pub?gid=662239818&single=true&output=csv
Resolving docs.google.com (docs.google.com)... 74.125.142.139, 74.125.142.138, 74.125.142.101, ...
Connecting to docs.google.com (docs.google.com)|74.125.142.139|:443... connected.
HTTP request sent, awaiting response... 200 OK
Length: unspecified [text/csv]
Saving to: ‘pub?gid=662239818&single=true&output=csv’

          pub?gid=6     [<=>                 ]       0  --.-KB/s               pub?gid=662239818&s     [ <=>                ]  97.97K  --.-KB/s    in 0.04s   

2019-12-04 09:00:03 (2.52 MB/s) - ‘pub?gid=662239818&single=true&output=csv’ saved [100324]



#### Creating a Pandas DataFrame from Screaming Frog data


Following the file structure generated using the Screaming Frog's crawler, we will use the following columns:

- *url* (`cols='0'` | `Address`), 
- *http status* (`cols='2'` | `Status Code`), 
- *meta description lenght* (`cols='11'` | `Meta Description 1 Length`),
- *position* (`cols='48'` | `Position`),

We will then use *http status* to focus our analysis only to urls responding with `HTTP 200`.

In [None]:
df = pd.read_csv('pub?gid=662239818&single=true&output=csv', # Update the string here to change the file
                 usecols=[0,2,11,48],  
                 header=0,
                 encoding="utf-8-sig" ) 

print("we have a total of:", len(df), " urls")

df.head()

#### Finding all URLs where meta description are missing


In [None]:
# Keep all rows representing a page with status = 200, with md 0, from the Italian blog and with Position < 15 
 
df = df[(df['Status Code'] == 200) & ((df['Meta Description 1 Pixel Width']== 0)) & (df['Address'].str.contains("blog/it")) & (df['Position'] < 15) & (df['Position'] > 3)] # Use this with Screaming Frog

print("we have to process:", len(df), " urls")

# Reindex df
df.index = range(len(df.index))

df.head()

## Summarizing 


## Running the analysis 

In the next cells we have one function called `url_to_string` to get the text from a URL (make sure to fine-tune this one if you know the class that contains the body of the article on your website) 

In [None]:
# Get clean text from URL

def url_to_string(url):
  try:
    headers = {'User-Agent': 'Mozilla/5.0 (Windows NT 6.1; WOW64; rv:50.0) Gecko/20100101 Firefox/50.0'}
    res = requests.get(url, headers=headers)
    html = res.text
    soup = BeautifulSoup(html, 'html5lib')
    for script in soup(["script", "style", 'aside']):
        script.extract()
    
    # uncomment the lines in the if/else block and comment the one after if you know the name of the class containing the article body 
    if isinstance(soup.find('div', {'class' :'entry-content'}), type(None)): # here is the div containing the content
      return " ".join(re.split(r'[\n\t]+', soup.get_text()))
    else:
      return " ".join(re.split(r'[\n\t]+', soup.find('div', {'class' :'entry-content'}).text))   

  except requests.exceptions.HTTPError as err:
    print(err)
    sys.exit(1)
    return err

'''
# Get clean text from URL using Trafilatura

def url_to_string(url):
  try:
    downloaded = trafilatura.fetch_url(url)
    if downloaded is not None: # assuming the download was successful
      result = trafilatura.extract(downloaded, include_tables=False, include_formatting=False, include_comments=False) 
    return result
  except ValueError as err:
    print(err)
    sys.exit(1)
    return err
'''


SyntaxError: ignored

In [None]:
# Create a list to store the MDs
data_x = [] 

from summarizer import Summarizer
# For each URL in the input CSV run the analysis and store the results in the list 
for i in range(len(df)):
    # Here is the URL to be analyzed
    line = df.iloc[i][0]

	# Error handling for HTTP connection problems
    try:
       body = url_to_string(line)
    except:
    	print('error while fetching', line, err)
    
	# BERT
    print("Summarizing URL via BERT: " + line)
    model = Summarizer()
    result = model(body, min_length=60, ratio=0.005)
    full = ''.join(result)
    print(full)

	# Storing all values into the list 
    data_x.append({"url":line, "BERT":full})


Summarizing URL via BERT: https://wordlift.io/blog/it/vocabolario/wordcamp-europe-2019/
Dal 20 al 22 giugno, la comunità di WordPress si è riunita a Berlino in occasione del WordCamp Europe (#WCEU) e, ovviamente, il nostro team non poteva mancare all’appello. Matt è salito sul palco per spiegare come l’editor a blocchi di Gutenberg abbia aggiunto una serie di notevoli miglioramenti, tra cui le funzionalità di gestione dei blocchi, un blocco di copertina con elementi nidificati, widget da integrare come blocchi, raggruppamenti di blocchi e avvisi in stile snackbar.
Summarizing URL via BERT: https://wordlift.io/blog/it/vocabolario/json-ld/
JSON-LD sta per JavaScript Object Notation per i Linked Data ed è un formato leggero per i Linked Data, per leggere e scrivere in modo semplice i metadati sul web. Queste entità hanno ID unici (unique resource identifier) nel web dei dati e grazie a questi ID, WordLift estrae delle informazioni aggiuntive e le inietta nelle pagine web usando JSON-LD.
S

### Testing BERT Multilingual

This cell is alternative to the cells above and will load a varian of BERT called `bert-base-multilingual-cased`.

Trained on cased text in the top **104 languages** with the largest Wikipedias.

In [None]:
# Create a list to store the MDs
data_x = [] 

from transformers import BertTokenizer, BertModel

bert_model = BertModel.from_pretrained('bert-base-multilingual-cased', output_hidden_states=True)
bert_tokenizer = BertTokenizer.from_pretrained('bert-base-multilingual-cased')

from summarizer import Summarizer
# For each URL in the input CSV run the analysis and store the results in the list 
for i in range(len(df)):
    # Here is the URL to be analyzed
    line = df.iloc[i][0]

	# Error handling for HTTP connection problems
    try:
       body = url_to_string(line)
    except:
    	print('error while fetching', line, err)
    
	# BERT
    print("Summarizing URL via BERT  ML: " + line)
    model = Summarizer(custom_model=bert_model, custom_tokenizer=bert_tokenizer)
    result = model(body, min_length=60, ratio=0.005)
    full = ''.join(result)
    print(full)

	# Storing all values into the list 
    data_x.append({"url":line, "BERT":full})


Summarizing URL via BERT  ML: https://wordlift.io/blog/it/vocabolario/wordcamp-europe-2019/
Dal 20 al 22 giugno, la comunità di WordPress si è riunita a Berlino in occasione del WordCamp Europe (#WCEU) e, ovviamente, il nostro team non poteva mancare all’appello. Matteoc and Cyberandy sul palco del WCEU
Google sta aumentando il suo impegno nell’ecosistema WordPress e per questa edizione di WCEU è stato introdotto un nuovo strumento chiamato Google Site Kit.
Summarizing URL via BERT  ML: https://wordlift.io/blog/it/vocabolario/json-ld/


TypeError: ignored

### Testing the brand new ALBERT implementation

This cell is alternative to the cell above and will load ALBERT (see: "[ALBERT: A Lite BERT For Self-Supervised Learning of Language Representations](https://arxiv.org/abs/1909.11942)") 

In [None]:
# Create a list to store the MDs
data_x = [] 

from transformers import AlbertTokenizer, AlbertModel

albert_model = AlbertModel.from_pretrained('albert-base-v1', output_hidden_states=True)
albert_tokenizer = AlbertTokenizer.from_pretrained('albert-base-v1')

from summarizer import Summarizer
# For each URL in the input CSV run the analysis and store the results in the list 
for i in range(len(df)):
    # Here is the URL to be analyzed
    line = df.iloc[i][0]

	# Error handling for HTTP connection problems
    try:
       body = url_to_string(line)
    except:
    	print('error while fetching', line, err)
    
	# BERT
    print("Summarizing URL via ALBERT: " + line)
    model = Summarizer(custom_model=albert_model, custom_tokenizer=albert_tokenizer)
    result = model(body, min_length=60, ratio=0.005)
    full = ''.join(result)
    print(full)

	# Storing all values into the list 
    data_x.append({"url":line, "BERT":full})


Summarizing URL via ALBERT: https://wordlift.io/blog/en/entity/freeyork/
%More sessions from GoogleFounded by Sam Isma in 2009, Freeyork is a community-driven design magazine which aims to spread the works and stories of upcoming artists. ”Sam Isma, Founder of FreeyorkThe ResultsAfter the fist three months, WordLift improved the number of organic sessions (+18.47% increase of sessions from Google) and the number of new users with a double digit growth (+12.13% of new users).On average, pages enriched with WordLift compared with all the other pages, are performing 2.4 times better in terms of page views and in terms of sessions.
Summarizing URL via ALBERT: https://wordlift.io/blog/en/entity/fact-checking/
According to Wikipedia, fact checking is:“Fact checking is the act of checking factual assertions in non-fictional text in order to determine the veracity and correctness of the factual statements in the text.
Summarizing URL via ALBERT: https://wordlift.io/blog/en/entity/wordpress/


## Storing data 

In the following cells we are going to save a CSV containing for each url the summaries generated by the different algos. 


In [None]:
# Save results to the output CSV
df_new = pd.DataFrame(data_x, columns=["url", "BERT"])


In [None]:
df_new.head()

Unnamed: 0,url,BERT
0,https://wordlift.io/blog/it/vocabolario/wordca...,"Dal 20 al 22 giugno, la comunità di WordPress ..."
1,https://wordlift.io/blog/it/vocabolario/json-ld/,JSON-LD sta per JavaScript Object Notation per...
2,https://wordlift.io/blog/it/vocabolario/thubte...,Nel 1878 è stati riconosciuto come la reincarn...
3,https://wordlift.io/blog/it/vocabolario/robert...,"Organizzatore del #WMT2017, ha anche portato u..."
4,https://wordlift.io/blog/it/vocabolario/wordlift/,WordLift è una startup innovativa romana che h...


In [None]:
from google.colab import files

# We set the variable forthe name of the CSV where we will store the new MDs 
outputcsv = 'new-md.csv'
print("output csv name: ", outputcsv)

df_new.to_csv(outputcsv, encoding='utf-8', index=False)
print("Saving results on:", outputcsv)
files.download(outputcsv)

output csv name:  new-md.csv
Saving results on: new-md.csv


# License

MIT License

Copyright (c) 2019 Andrea Volpini, WordLift

Permission is hereby granted, free of charge, to any person obtaining a copy of this software and associated documentation files (the "Software"), to deal in the Software without restriction, including without limitation the rights to use, copy, modify, merge, publish, distribute, sublicense, and/or sell copies of the Software, and to permit persons to whom the Software is furnished to do so, subject to the following conditions:

The above copyright notice and this permission notice shall be included in all copies or substantial portions of the Software.

THE SOFTWARE IS PROVIDED "AS IS", WITHOUT WARRANTY OF ANY KIND, EXPRESS OR IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF MERCHANTABILITY, FITNESS FOR A PARTICULAR PURPOSE AND NONINFRINGEMENT. IN NO EVENT SHALL THE AUTHORS OR COPYRIGHT HOLDERS BE LIABLE FOR ANY CLAIM, DAMAGES OR OTHER LIABILITY, WHETHER IN AN ACTION OF CONTRACT, TORT OR OTHERWISE, ARISING FROM, OUT OF OR IN CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER DEALINGS IN THE SOFTWARE.