# NLP task:
1. Please develop a program in python that extracts the vision/mission of the company out of the example text we provide. Please share the results in a commented jupyter notebook that explains your solution (please justify your model choice). 
2. Explain the challenges that lie in the extraction of the service/product of a business for the example texts. Write down some ideas how you would approach the task.
Attached you find
• We shared 3 example jsons to work with (as a warning, there is a lot of noise in the data, so you have to identify the right sections!)


In [1]:
from pathlib import Path
import json

In [2]:
scrapes = {}

In [3]:
for json_path in Path('data/').glob('*.json'):
    with open(json_path) as f:
        scrape = json.load(f)
        scrape['json_path'] = str(json_path)
        scrapes[scrape['website']['company_name']] = scrape

In [4]:
list(scrapes.keys())

['Hyperdrive Innovation', 'Black Bear Carbon', 'INFARM']

In [5]:
company_name = 'Black Bear Carbon'
for company, scrape in scrapes.items():
    print(company, 'has', len(scrape['website']['website_content'].keys()), 'pages scraped')


Hyperdrive Innovation has 48 pages scraped
Black Bear Carbon has 27 pages scraped
INFARM has 38 pages scraped


## Mission / Vision extraction
Because of the limited number of data available, best initial solution is to do manual feature engineering. 
This should be a good baseline for a supervised classification task we might implement later.

Below approach is straight forward. Using extensions on spacy docs, we define mission sentence as having any predefined keywords, which are picked manually.

In [None]:
import spacy
from spacy.tokens import Doc, Span, Token

nlp = spacy.load('en_core_web_md')

mission_words = ('mission', 'vision', 'idea', 'goal', 'solve', 'believe')
is_mission_getter = lambda token: token.lemma_.lower() in mission_words
has_mission_getter = lambda obj: any([t.lemma_.lower() in mission_words for t in obj])

Token.set_extension("is_mission_word", getter=is_mission_getter)
Span.set_extension("has_mission_word", getter=has_mission_getter)

In [None]:
from dataclasses import dataclass, field

@dataclass
class Content:
    identifier: str
    content: str
    source: str
    doc: Doc = field(init=False)
        
    def __post_init__(self):
        self.doc = nlp(self.content)
        
    def get_sentences(self):
        return self.doc.sents
    
    def get_mission_sentences(self):
        return list(filter(lambda x: x._.has_mission_word, self.get_sentences()))

In [None]:
import re

def get_website_content_with_regex(scrape, regex=r'(about)'):
    contents = []
    for page_url, content in scrape.get('website', {}).get('website_content', {}).items():
        if re.findall(regex, page_url):
            contents.append(Content(page_url, content, 'website'))
    return contents

def get_twitter_content(scrape):
    contents = []
    for dic in scrape.get('Twitter_account', []):
        contents.append(Content(dic['id_str'], dic['text'], 'twitter'))
    return contents

def get_medium_content(scrape):
    contents = []
    for dic in scrape.get('medium', []):
        contents.append(Content(dic['url'], dic['text'], 'medium'))
    return contents

In [None]:
from collections import defaultdict

scrape_contents = defaultdict(list)
for company, scrape in scrapes.items():
    scrape_contents[company] += get_website_content_with_regex(scrape, r'https?://[^/]+/?$')
    scrape_contents[company] += get_website_content_with_regex(scrape, r'https?://.*/[^/]*(about)[^/]*$')
    scrape_contents[company] += get_twitter_content(scrape)
    scrape_contents[company] += get_medium_content(scrape)    

In [None]:
scrape.keys()

In [None]:
for company, contents in scrape_contents.items():
    print(company)
    all_mission_sentences = [c.get_mission_sentences() for c in contents]
    print(sum(all_mission_sentences, []), '\n')