NLP task:
1. Please develop a program in python that extracts the vision/mission of the company out of the example text we provide. Please share the results in a commented jupyter notebook that explains your solution (please justify your model choice). 
2. Explain the challenges that lie in the extraction of the service/product of a business for the example texts. Write down some ideas how you would approach the task.
Attached you find
• We shared 3 example jsons to work with (as a warning, there is a lot of noise in the data, so you have to identify the right sections!)


In [None]:
from pathlib import Path
import json

In [None]:
scrapes = {}

In [None]:
for json_path in Path('data/').glob('*.json'):
    with open(json_path) as f:
        scrape = json.load(f)
        scrape['json_path'] = str(json_path)
        scrapes[scrape['website']['company_name']] = scrape

In [None]:
list(scrapes.keys())

In [None]:
company_name = 'Black Bear Carbon'
for company, scrape in scrapes.items():
    print(company)
    print(list(scrape['website']['website_content'].keys()))

In [6]:
import re

def get_website_content_with_regex(scrape, regex=r'(about)'):
    dic = {}
    for page_url, content in scrape['website']['website_content'].items():
        if re.findall(regex, page_url):
            dic[page_url] = content
    return dic

In [7]:
get_website_content_with_regex(scrapes[company_name], r'https?://[^/]+/?$')

{'https://blackbearcarbon.com': 'Translated: Home About Us Carbon Black Applications Coatings Inks Plastics Technical Rubber Tires Becoming a Partner Sustainability Media Gallery Events Videos Press Contact WE ARE BLACK BEAR WE ARE TECH PIONEERS DISCOVER WHY CARBON BLACK Almost every black-coloured object you see contains Carbon Black. Your mobile phone case, the ink in your pen, the buttons on your keyboard… and the tires on your car. Carbon Black is everywhere. UPCYCLED CARBON BLACK FROM TIRES Every year, more than 1.5 billion polluting end-of-life tires enter the global waste stream. Until now, there has been no sustainable solution. BLACK BEAR We upcycle end-of-life tires to produce sustainable Carbon Black, preventing CO 2 emissions and solving the global waste tire problem. BLACK BEAR IS SUPPORTED BY PRESS RELEASE Black Bear awarded as Technology Pioneer by World Economic Forum! Press Release (EN) See CARBON BLACK APPLICATIONS LATEST NEWS Black Bear appoints Victor Vreeken as Chi

In [8]:
from collections import defaultdict

missions = defaultdict(dict)
for company, scrape in scrapes.items():
    missions[company] |= get_website_content_with_regex(scrape, r'https?://[^/]+/?$')
    missions[company] |= get_website_content_with_regex(scrape, r'https?://.*/[^/]*(about)[^/]*$')


In [9]:
missions['Hyperdrive Innovation']

{'https://hyperdriveinnovation.com': 'Translated: Skip to content Home About Battery Energy Storage Industries Technology Manufacturing Tel: +44 (0) 191 640 4586 Email: info@hyperdriveinnovation.com Menu Close Home About Battery Energy Storage Industries Technology Manufacturing Insights Community Careers Contact Twitter LinkedIn Announcement: Business Continuity in Response to COVID-19 Find out more High Performance Battery Energy Storage Systems Hyperdrive Innovation designs, develops and manufactures lithium ion battery systems. As a trusted electrification partner to original equipment manufacturers around the world, our battery technology is present in a diverse range of applications, providing our customers with the right energy at the right time. About Hyperdrive Innovation Language select: 普通话 日本語 Our battery energy storage solution Modular product range With a standardised design, our modular product range provides a flexible and scalable battery energy storage solution. Combi

In [10]:
import spacy

nlp = spacy.load('en_core_web_md')

In [11]:
contents = missions[company_name].values()

In [12]:
docs = nlp.pipe(contents)

In [13]:
import numpy as np

for company, content in missions.items():
    docs = nlp.pipe(contents)
    for doc in docs:
        for sent in doc.sents:
            print([tok for tok in sent if not any([tok.is_punct, tok.like_num, tok.like_url, tok.like_email])])
            

[Translated, Home, About]
[Us, Carbon, Black, Applications, Coatings, Inks, Plastics, Technical, Rubber, Tires, Becoming]
[a, Partner, Sustainability, Media, Gallery, Events, Videos]
[Press, Contact]
[WE, ARE]
[BLACK, BEAR, WE, ARE, TECH, PIONEERS, DISCOVER, WHY, CARBON, BLACK]
[Almost, every, black, coloured, object, you, see, contains, Carbon, Black]
[Your, mobile, phone, case, the, ink, in, your, pen, the, buttons, on, your, keyboard, and, the, tires, on, your, car]
[Carbon, Black, is, everywhere]
[UPCYCLED, CARBON, BLACK, FROM, TIRES]
[Every, year, more, than, polluting, end, of, life, tires, enter, the, global, waste, stream]
[Until, now, there, has, been, no, sustainable, solution]
[BLACK, BEAR]
[We, upcycle, end, of, life, tires, to, produce, sustainable, Carbon, Black, preventing, CO, emissions, and, solving, the, global, waste, tire, problem]
[BLACK, BEAR, IS, SUPPORTED, BY]
[PRESS, RELEASE, Black, Bear, awarded, as, Technology, Pioneer, by, World, Economic, Forum]
[Press, Rel