# Basic NLP for ecosustem mapping
In this short tutorial you'll learn some basic (high-level) NLP Functionally that may come handy when analysing industries or other topics of interest. We will be using Named Entity Recognition, which is an advanced technique – however, as we will be relying on the Spacy library, we don't have to worry about developing the functionality from scratch. This has solved for us and the performance is okay for our demands...

In Python (as in many other languages) you can comment things by adding a "#". Everything after a # in a line will be ignored by the compiler. Leaving clear comments is good practice, allowing others and yourself – it's so easy to forget code – to understand what you've actually done...

In [1]:
# Install the libraries needed for getting a list of URLs and for extracting text from articles
!pip install newspaper3k
!pip install newsapi-python

Collecting newspaper3k
[?25l  Downloading https://files.pythonhosted.org/packages/d7/b9/51afecb35bb61b188a4b44868001de348a0e8134b4dfa00ffc191567c4b9/newspaper3k-0.2.8-py3-none-any.whl (211kB)
[K    100% |████████████████████████████████| 215kB 12.6MB/s 
[?25hCollecting jieba3k>=0.35.1 (from newspaper3k)
[?25l  Downloading https://files.pythonhosted.org/packages/a9/cb/2c8332bcdc14d33b0bedd18ae0a4981a069c3513e445120da3c3f23a8aaa/jieba3k-0.35.1.zip (7.4MB)
[K    100% |████████████████████████████████| 7.4MB 5.5MB/s 
Collecting feedfinder2>=0.0.4 (from newspaper3k)
  Downloading https://files.pythonhosted.org/packages/35/82/1251fefec3bb4b03fd966c7e7f7a41c9fc2bb00d823a34c13f847fd61406/feedfinder2-0.0.4.tar.gz
Collecting tinysegmenter==0.3 (from newspaper3k)
  Downloading https://files.pythonhosted.org/packages/17/82/86982e4b6d16e4febc79c2a1d68ee3b707e8a020c5d2bc4af8052d0f136a/tinysegmenter-0.3.tar.gz
Collecting feedparser>=5.2.1 (from newspaper3k)
[?25l  Downloading https:/

an ! in front of a commant will execute it not in Python but in the undelying system (here Linux).

In [2]:
# Import the article-extractor package
from newspaper import Article 

In [3]:
# Loop example
numbers = [1,2,3,4,5]
for some_happy_number in numbers:
    print(some_happy_number * 2)

2
4
6
8
10


**Loops** are a fundamental concept in programming, which allow us to let the computer perform some repetitive task over and over again.
Above, you see a loop that takes the numbers 1-5 out of their list "number" one by one and displays the product of the respective number multiplied by 2

Make sure that you follow the indentation structure in python. Everything that has to happen "in the loop" – over and over – has to be on a lower level of indentation than the other code. Usually once you use ":" which starts a loop, python will indent automatically.

Now, let's fetch some article texts

In [4]:
# Create a list [] called "urls" with 2 urls leading to some news articles
urls = ['https://techcrunch.com/2019/04/08/iphone-spyware-certificate/', 
 "https://techcrunch.com/2019/04/07/rise-of-the-snapchat-empire/"]

In [5]:
# Extract the article text
article_container = [] #create an empy list

for happy_url in urls: #take one url at a time
    our_happy_test_article = Article(happy_url) #instantiate it as an "Article"
    our_happy_test_article.download() #download it
    our_happy_test_article.parse() #read it (and try to guess what the title, author etc. are)
    article_container.append(our_happy_test_article.text) #extract its text and put it (append) into the empty list created earlier

In a more realistic case, you would like to fetch more than two articles. You probably would also not like to compile the list of urls manually. Well, one way to automatize the process is using the NewsApiClient. It's a programmatic news search engine built for app developers that want to include news-streams in their applications.
To use it, please register with them and get a free API-key https://newsapi.org.

In [6]:
from newsapi import NewsApiClient #import news-api
from collections import Counter #import the counter module, which allows to count stuff (useful)
import itertools #iterator library that helps performing complex iteration routines (e.g. combinations)

In [7]:
# for example: give me all possible combinations of 2 elements from 1,2,3

list(itertools.combinations([1,2,3], 2))

[(1, 2), (1, 3), (2, 3)]

In [8]:
# identify with the server...

# GET your free API key at https://newsapi.org/

newsapi = NewsApiClient(api_key='XXXXXXX12345')

In [9]:
# Let's fetch urls for 100 most relevant articles for the query: "China Artificial Intelligence"
# As you can see, you have many other options inlcuding language and dates
all_articles = newsapi.get_everything(q='China Artificial Intelligence',
                                        #domains = "techcrunch.com",
                                        language='en',
                                        sort_by='relevancy',
                                        page_size = 100,
                                        #from_param = start_date,
                                        #to = end_date
                                     )

NewsAPIException: {'status': 'error', 'code': 'apiKeyInvalid', 'message': 'Your API key is invalid or incorrect. Check your key, or go to https://newsapi.org to create a free API key.'}

In [10]:
# This will display the url of the first article that has been found - Python indices start with 0, R starts with 1

all_articles['articles'][0]['url']

NameError: name 'all_articles' is not defined

the "all_articles" object created above is a dictionary – a list of key-value pairs.
More on dictionaries here: https://www.geeksforgeeks.org/python-dictionary/

Calling ['articles'] opens up a list with the 100 found articles. Each of these elements are again dictionaries.
The structure is thus dict - lists - dict


In [11]:
# here we collect all urls into one list.
# the below is a list comprehension - a short option in Python to write a loop.
# it can be translated into: *Create a list in which you pyt the url that you strip from
# each element in all_articles['articles']*

urls_big = [x['url'] for x in all_articles['articles']]


NameError: name 'all_articles' is not defined

more on that here: https://www.pythonforbeginners.com/basics/list-comprehensions-in-python

In [12]:
# Let's fetch all the 100 articles

texts = []

for url in urls_big:
    article = Article(url)
    article.download()
    try:
      article.parse()
    except Exception as e:
      print(e)
      continue
    texts.append(article)

NameError: name 'urls_big' is not defined

The syntax is just as before where we only had 2 urls. However, we add some exception handling here. We do that because some news outlets e.g. forbes don't like what we are doing here and will try to block us. When this happens ususually our function would break. For this not to happen, we add the try-except statement, which will attempt to do what we want but skip to the next url in case an error occurs.

More on that here: https://www.datacamp.com/community/tutorials/exception-handling-python

In [13]:
# texts seems to be a list of objects that are not purely text but als contain other meta-information
# let's make sure that only the text is left
texts = [x.text for x in texts]

In [14]:
# quick check of how long they are
len(texts)

0

In [15]:
# downlaod the medium size-model if you work on your computer or google colab (or elsewhere) for now we comment that out 
# because Kaggle has us covered with the large model
#!python -m spacy download en_core_web_md

In [16]:
# Introducing spacy

import spacy #load the library
nlp = spacy.load('en_core_web_lg') #load the (larg english) model

More info on Spacy: https://spacy.io/
and https://nlpforhackers.io/complete-guide-to-spacy/

In [17]:
# Let's try out some stuff

# product 3 sentences
sen1 = "The weather today is cold and Donald Trump is fun."
sen2 = "It's sunny and im HAPPY"
sen3 = "Everyone is bored and cold"

In [18]:
# Let spacy read and annotate them

AI_sen1 = nlp(sen1)
AI_sen2 = nlp(sen2)
AI_sen3 = nlp(sen3)

In [19]:
# Getting the 2nd entity type of the first sentence
AI_sen1.ents[1].label_

'PERSON'

In [20]:
# let's have it read one of our articles

AI_texts_0 = nlp(texts[0])

IndexError: list index out of range

In [21]:
#Make a list of entity-texts from all entities in text 0 if the entity is a person

[ent.text for ent in AI_texts_0.ents if ent.label_ == 'PERSON']

NameError: name 'AI_texts_0' is not defined

In [22]:
# lets extract all (location, person, orga : GPE, PERSON, ORG) entities into an empty container

container = []

for article in texts: # take an article
    article_nlp = nlp(article) #read it
    entities = [ent.text for ent in article_nlp.ents if ent.label_ == 'GPE'] # extract entities for the single articles
    container.extend(entities) # drop them into the "container"

In [23]:
people = Counter(container) #count up stuff in the container
people.most_common(100) #show most common 100

[]

In [24]:
org = Counter(container)
org.most_common(100)

[]

In [25]:
gpe = Counter(container)
gpe.most_common(100)

[]