<!--
 Copyright 2021 Pujit Mehrotra
 
 Licensed under the Apache License, Version 2.0 (the "License");
 you may not use this file except in compliance with the License.
 You may obtain a copy of the License at
 
     http://www.apache.org/licenses/LICENSE-2.0
 
 Unless required by applicable law or agreed to in writing, software
 distributed under the License is distributed on an "AS IS" BASIS,
 WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
 See the License for the specific language governing permissions and
 limitations under the License.
-->

## Project Overview

Euler's Wikipedia entry, more specifically the summary of his entry, is what will be used to analyze and visualize sentence structure using the production-grade spaCY natural language processing library.

## Implementation

The demo implementation is simple: get text from Wikipedia via an API call, process the text using our NLP library, and use the resulting properties to display our desired analysis, the utilities for which come conveniently bundled with our library of choice.

I chose spaCY over NLTK, TextBlob, Stanford CoreNLP, and Gensim because it seemed like the one I would use for a real-life project in the same spirit as the assignment. 

TextBlob, and by extension NLTK (which it is built upon), are actually used as plugins, but they aren't performant or feature-rich nough to use in a production setting. 

CoreNLP is written in Java, and although we could use the Python wrapper around it, I would expose my team to a tool dependency I/the leadership would not require them to have experience with--which is dangerous.

I would use Genism for text similarity and topic modeling, but that was not my intent with this assignment.

spaCY was the most performant and modular choice.

## Results

Euler's entry is more subjective than I expected; I expected the Wikipedia entry to be more objective and closer to zero, much like its polarity score.

It was also interesting to note the entry's varied sentence structure. Longer sentences are followed by shorter ones and vice versa. Proper nouns and numbers also appear in groups instead of being scattered across the content.

## Process

The process was smooth and straightforward. Not much to improve on.

## Getting Started

Before running the contents of this notebook, run the following command in terminal *after* activating a python virtual environment:

```bash
pip install -r requirements.txt
```

In [11]:
# Copyright 2021 Pujit Mehrotra
# 
# Licensed under the Apache License, Version 2.0 (the "License");
# you may not use this file except in compliance with the License.
# You may obtain a copy of the License at
# 
#     http://www.apache.org/licenses/LICENSE-2.0
# 
# Unless required by applicable law or agreed to in writing, software
# distributed under the License is distributed on an "AS IS" BASIS,
# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
# See the License for the specific language governing permissions and
# limitations under the License.

from mediawiki import MediaWiki
import spacy
from spacy import displacy
from spacytextblob.spacytextblob import SpacyTextBlob

nlp = spacy.load("en_core_web_sm")  # Load English tokenizer, tagger, parser and NER
nlp.add_pipe('spacytextblob')
wikipedia = MediaWiki()

In [12]:
# Copyright 2021 Pujit Mehrotra
# 
# Licensed under the Apache License, Version 2.0 (the "License");
# you may not use this file except in compliance with the License.
# You may obtain a copy of the License at
# 
#     http://www.apache.org/licenses/LICENSE-2.0
# 
# Unless required by applicable law or agreed to in writing, software
# distributed under the License is distributed on an "AS IS" BASIS,
# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
# See the License for the specific language governing permissions and
# limitations under the License.

euler = wikipedia.page("Euler")
processed_doc = nlp(euler.summary)
# Polarity is float which lies in the range of [-1,1] where 1 means positive statement and -1 means a negative statement. Subjective sentences generally refer to personal opinion, emotion or judgment whereas objective refers to factual information. Subjectivity is also a float which lies in the range of [0,1].
print('Euler Wikipedia Summary Analysis:')
print(f'Polarity: {processed_doc._.polarity}')
print(f'Subjectivity: {processed_doc._.subjectivity}')

Euler Wikipedia Summary Analysis:
Polarity: 0.14302641802641805
Subjectivity: 0.4157148407148408


In [13]:
# Copyright 2021 Pujit Mehrotra
# 
# Licensed under the Apache License, Version 2.0 (the "License");
# you may not use this file except in compliance with the License.
# You may obtain a copy of the License at
# 
#     http://www.apache.org/licenses/LICENSE-2.0
# 
# Unless required by applicable law or agreed to in writing, software
# distributed under the License is distributed on an "AS IS" BASIS,
# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
# See the License for the specific language governing permissions and
# limitations under the License.

summary_sentences = list(processed_doc.sents)
displacy.render(summary_sentences, style="dep", jupyter=True)


In [14]:
# Copyright 2021 Pujit Mehrotra
# 
# Licensed under the Apache License, Version 2.0 (the "License");
# you may not use this file except in compliance with the License.
# You may obtain a copy of the License at
# 
#     http://www.apache.org/licenses/LICENSE-2.0
# 
# Unless required by applicable law or agreed to in writing, software
# distributed under the License is distributed on an "AS IS" BASIS,
# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
# See the License for the specific language governing permissions and
# limitations under the License.

displacy.render(summary_sentences, style="ent", jupyter=True)