# Juno's Co-op Python Project

At Ben Roger's request, I have taken upon myself the task of creating a simple bot that can automatically extract information from a web-page using natural language processing, to be used as a meta description for the purposes of search engine optimization.

While Google announced in September of 2009 that neither meta descriptions nor meta keywords factor into Google's ranking algorithms for web search, meta descriptions can nonetheless impact a page's CTR (click-through-rate) on Google, which can significantly impact a page's ability to rank. 

Please find below the code. You will need to type in the webpage that you would like the data to be extracted from.

BeautifulSoup was used for readability.
spaCy and scikit-learn were used for Automatic Text Summarization.

Extractive summarization, wherein summaries are performed by cutting and pasting individual sentences, was first performed. I hope to implement abstractive summarization, wherein new text is generated based on the input text, if time allows.

The data will be output as a .csv file, and the meta descriptions will be a string of at most 160 characters.

First, we get the HTML code from a page:

In [16]:
# Imports, and installation of Python packages

import pandas as pd
import csv
import requests
from bs4 import BeautifulSoup
import sys
!{sys.executable} -m pip install spacy
!{sys.executable} -m pip install scikit-learn



In [17]:
url = 'https://forestry.ubc.ca/about/our-faculty-today/indigenous-initiatives/'
response = requests.get(url)
soup = BeautifulSoup(response.text)
print(soup)

<!DOCTYPE html>
<html class="no-js" lang="en-US">
<head itemscope="itemscope" itemtype="http://schema.org/WebSite">
<!-- Google Tag Manager -->
<script>(function(w,d,s,l,i){w[l]=w[l]||[];w[l].push({'gtm.start':
new Date().getTime(),event:'gtm.js'});var f=d.getElementsByTagName(s)[0],
j=d.createElement(s),dl=l!='dataLayer'?'&l='+l:'';j.async=true;j.src=
'https://www.googletagmanager.com/gtm.js?id='+i+dl;f.parentNode.insertBefore(j,f);
})(window,document,'script','dataLayer','GTM-TLMJ5BN');</script>
<!-- End Google Tag Manager -->
<link as="font" crossorigin="" data-wpacu-preload-font="1" href="/wp-content/themes/ubc-forestry/fonts/fontawesome/webfonts/fa-solid-900.woff2" rel="preload"/>
<link as="font" crossorigin="" data-wpacu-preload-font="1" href="/wp-content/themes/ubc-forestry/fonts/fontawesome/webfonts/fa-brands-400.woff2" rel="preload"/>
<!-- Global site tag (gtag.js) - Google Analytics -->
<script async="" src="https://www.googletagmanager.com/gtag/js?id=UA-192597-1"></script>
<

Second, we find all meta tags that have been generated (I believe by Google, but don't quote me on that) and then extract the meta description. While you can find the description directly, as in the next code block, I decided to leave this in to see what other metadata is coded.

In [18]:
metas = soup.find_all('meta')
print(metas)

[<meta charset="utf-8"/>, <meta content="width=device-width, initial-scale=1" name="viewport"/>, <meta content="UBC Forestry 1.0.0" name="generator"/>, <meta content="index, follow" name="robots"/>, <meta content="index, follow, max-snippet:-1, max-image-preview:large, max-video-preview:-1" name="googlebot"/>, <meta content="index, follow, max-snippet:-1, max-image-preview:large, max-video-preview:-1" name="bingbot"/>, <meta content="en_US" property="og:locale"/>, <meta content="article" property="og:type"/>, <meta content="Indigenous Initiatives | UBC Forestry" property="og:title"/>, <meta content="The Faculty of Forestry provides exceptional opportunities for Aboriginal students through innovative programs, engaging material, and award-winning faculty. If your interests include a love of the outdoors, a desire to effect positive change, and the opportunity for well-paid employment, you should consider enrolling in one of our five undergraduate degree programs or seven graduate progra

In [19]:
descs = soup.find('meta', property = {'og:description'})
desc_text = descs.get("content")
print(desc_text)

The Faculty of Forestry provides exceptional opportunities for Aboriginal students through innovative programs, engaging material, and award-winning faculty. If your interests include a love of the outdoors, a desire to effect positive change, and the opportunity for well-paid employment, you should consider enrolling in one of our five undergraduate degree programs or seven graduate programs.…


# Having pulled a description successfully, we can then run some linguistic analysis and determine how we will shorten the description.

In [20]:
# import spacy
# from spacy import displacy

# nlp = spacy.load("en_core_web_sm")
# doc1 = nlp(desc_text)
# doc2 = nlp("This is another sentence.")
# html = displacy.render([doc1, doc2], style="dep", page=True)


Finally, we export as a .csv with columns 'url' and 'meta_desc'

In [21]:
# with open('url_meta.csv', mode='w') as url_meta:
#     field_names =['url', 'meta_desc']
#     meta_writer = csv.DictWriter(url_meta, fieldnames=field_names)
#     meta_writer.writeheader()
#     meta_writer.writerow({'url': 'www.test.com', 'meta_desc' : 'test meta desc'})

## Automation
Next, we automate for all urls in the .csv. First, we extract the URL addresses into a Pandas Dataframe object.

In [22]:
col_list = ["Address", "HasMeta", "Status Code", "Status", "Page Content 1"]
df = pd.read_csv("custom_extraction_all-content.csv", usecols=col_list)
urls = df["Address"]
print(urls)

0                               https://forestry.ubc.ca/
1                         https://forestry.ubc.ca/about/
2             https://forestry.ubc.ca/about/departments/
3       https://forestry.ubc.ca/about/our-faculty-today/
4      https://forestry.ubc.ca/about/our-faculty-toda...
                             ...                        
506    https://forestry.ubc.ca/student-stories/marina...
507    https://forestry.ubc.ca/student-stories/samuel...
508    https://forestry.ubc.ca/student-stories/seo-ye...
509    https://forestry.ubc.ca/student-stories/sneha-...
510    https://forestry.ubc.ca/student-stories/veena-...
Name: Address, Length: 511, dtype: object


## Having collected all the URLs, we can then create a function to pull a description from an URL input, and loop it over all the URLs in the dataframe.

In [23]:
def description_function(url_arg):
    url = url_arg
    response = requests.get(url)
    soup = BeautifulSoup(response.text)
    metas = soup.find_all('meta')
    descs = soup.find('meta', property = {'og:description'})
    desc_text = descs.get("content")
    print(desc_text)

In [15]:
[description_function(urls_ind) for urls_ind in urls]

UBC Forestry is Canada's largest forestry school - global leader in education & research for conservation, wood products & natural resources
UBC Faculty of Forestry is internationally-recognized for its award-winning educational programs, research, and initiatives.
Access our listing of departments within the faculty and learn more about the programs, services, and opportunities that they offer.
Welcome to the Faculty of Forestry. Located at the University of British Columbia Vancouver campus - one of Canada's top universities.
The Forest Sciences Centre (FSC) is one of the Designated Learning Spaces on campus at UBC. This spot is a vibrant place for individual or group study.
The Faculty of Forestry provides exceptional opportunities for Aboriginal students through innovative programs, engaging material, and award-winning faculty. If your interests include a love of the outdoors, a desire to effect positive change, and the opportunity for well-paid employment, you should consider enro

AttributeError: 'NoneType' object has no attribute 'get'