# <font color="#003660">Applied Machine Learning for Text Analysis (M.184.5331)</font>


# <font color="#003660">Week 1: Introduction to Natural Language Processing</font>

# <font color="#003660">Notebook 1: Annotating and Exploring Texts with spaCy</font>

<center><br><img width=256 src="https://git.uni-paderborn.de/data.analytics.teaching/aml4ta-2020/-/raw/master/resources/dag.png"/><br></center>

<p>
<center>
<div>
    <font color="#085986"><b>By the end of this lesson, you will be able to...</b><br><br>
        ... perform NLP preprocessing with Spacy.</font>
</div>
</center>
</p>

# Import packages

As always, we first need to load a number of required Python packages:
- `pandas` provides high-performance, easy-to-use data structures and data analysis tools.
- `NLTK` is a leading platform for building Python programs to work with human language data.
- `Spacy` is a library for for industrial-strength natural language processing.
- `SQLAlchemy`, together with `pymysql`, allows to communicate with SQL databases.
- `getpass` provides function to safely enter passwords.
- `altair` is a visualization library based on the grammar of graphics.

In [None]:
# Install missing packages
!pip install pymysql

Collecting pymysql
  Downloading PyMySQL-1.1.0-py3-none-any.whl (44 kB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m44.8/44.8 kB[0m [31m1.6 MB/s[0m eta [36m0:00:00[0m
[?25hInstalling collected packages: pymysql
Successfully installed pymysql-1.1.0


In [None]:
import pandas as pd
import nltk
import spacy
from spacy import displacy
from sqlalchemy import create_engine
import getpass
import altair as alt

# Quick tour of spaCy

spaCy is an open-source library for Natural Language Processing (NLP) in Python. It helps you build NLP applications that process and understand large volumes of unstructured text. One of the main features of spaCy are linguistic annotations that give you insights into a text’s grammatical structure (e.g., word order, types of words, parts of speech, grammatical roles and relations).

At the center of spaCy is the processing pipeline, an object which is usually called `nlp`. The pipeline is build on top of a language-specific machine learning model and a set of handcrafted rules.

The pipeline contains different components, each specialized for a specific NLP task.

[More...](https://spacy.io/usage/spacy-101#whats-spacy)

<center><br><img src="
https://d33wubrfki0l68.cloudfront.net/3ad0582d97663a1272ffc4ccf09f1c5b335b17e9/7f49c/pipeline-fde48da9b43661abcdf62ab70a546d71.svg"/><br></center>

The following code creates a pipeline based on the `en_core_web_sm` model and assigns it to the variable `nlp`.

In [None]:
nlp = spacy.load("en_core_web_sm")

Let's feed the `nlp` object with a simple sentence. When you process a text with the `nlp` object, spaCy outputs a `doc` object. The `doc` lets you access information about the text in a structured way.

In [None]:
doc = nlp(u"Yesterday, I went to five pubs in Oxford. It was fun.")

The first component of every spaCy pipeline is the `tokenizer`, which segments an unstructured text into words, punctuation, and so on. These `tokens` are the main contents of the `doc` object. [More...](https://spacy.io/usage/spacy-101#annotations-token)



In [None]:
for token in doc:
  print(token.text)

Yesterday
,
I
went
to
five
pubs
in
Oxford
.
It
was
fun
.


The tokens contain many useful attributes.






In [None]:
print(doc[0].text)
print(doc[0].i)
print(doc[3].idx)
print(doc[1].is_sent_start)
print(doc[9].is_sent_end)

Yesterday
0
13
False
True


Spacy also recognized that the doc consists of two sentences. We can access these `sents` just like we can access tokens.

In [None]:
doc.sents

<generator at 0x787e7e4cbb00>

In [None]:
list(doc.sents)

[Yesterday, I went to five pubs in Oxford., It was fun.]

In [None]:
list(doc.sents)[1]

It was fun.

We can also iterate through all tokens of a doc and access their attributes. For example, we can access the `lemma` of each token. [More...](https://spacy.io/usage/linguistic-features#lemmatization)

In [None]:
for token in doc:
  print(f"{token.text} -> {token.lemma_}")

Yesterday -> yesterday
, -> ,
I -> I
went -> go
to -> to
five -> five
pubs -> pub
in -> in
Oxford -> Oxford
. -> .
It -> it
was -> be
fun -> fun
. -> .


`Part-of-speech` tagging is the process of marking up a word in a text as corresponding to a particular part of speech, based on both its definition and its context. Examples of POS include words as nouns, verbs, adjectives, adverbs, etc. [More](https://spacy.io/usage/linguistic-features#pos-tagging
)

In [None]:
for token in doc:
  print(f"{token.text} -> {token.pos_} ({spacy.explain(token.pos_)})")

Yesterday -> NOUN (noun)
, -> PUNCT (punctuation)
I -> PRON (pronoun)
went -> VERB (verb)
to -> ADP (adposition)
five -> NUM (numeral)
pubs -> NOUN (noun)
in -> ADP (adposition)
Oxford -> PROPN (proper noun)
. -> PUNCT (punctuation)
It -> PRON (pronoun)
was -> AUX (auxiliary)
fun -> ADJ (adjective)
. -> PUNCT (punctuation)


We can also recognize so-called named entities. A `named entity` is a “real-world object” that’s assigned a name – for example, a person, a country, or  a product. spaCy can recognize various types of named entities in a document. [More...](https://spacy.io/usage/linguistic-features#named-entities)

In [None]:
for ent in doc.ents:
  print(ent.text, "->", ent.label_, "(", spacy.explain(ent.label_), ")")

Yesterday -> DATE ( Absolute or relative dates or periods )
five -> CARDINAL ( Numerals that do not fall under another type )
Oxford -> GPE ( Countries, cities, states )


Spacy also has some nice visualization features...

In [None]:
displacy.render(doc, style='ent', jupyter=True)

In [None]:
displacy.render(list(doc.sents)[0], style='dep', jupyter=True)

# Exploring song lyrics


## Load documents
This time, we load our data from a MySQL database. For security reasons, we don't store the database credentials here; please have a look at Panda how to get them.

In [None]:
# Get credentials
user = input("Username: ")
passwd = getpass.getpass("Password: ")
server = input("Server: ")
db = input("Database: ")

# Create an engine instance (SQLAlchemy)
engine = create_engine("mysql+pymysql://{}:{}@{}/{}".format(user, passwd ,server, db))

# Define SQL query
sql_query = "SELECT * FROM BillboardLyrics"

# Query dataset (pandas)
corpus = pd.read_sql(sql=sql_query, con=engine)

# Sample
corpus.head()

Username: student
Password: ··········
Server: manila.uni-paderborn.de
Database: aml4ta


Unnamed: 0,index,Rank,Song,Artist,Year,Lyrics,Source
0,0,1,wooly bully,sam the sham and the pharaohs,1965,sam the sham miscellaneous wooly bully wooly b...,3
1,1,2,i cant help myself sugar pie honey bunch,four tops,1965,sugar pie honey bunch you know that i love you...,1
2,2,3,i cant get no satisfaction,the rolling stones,1965,,1
3,3,4,you were on my mind,we five,1965,when i woke up this morning you were on my min...,1
4,4,5,youve lost that lovin feelin,the righteous brothers,1965,you never close your eyes anymore when i kiss ...,1


## Preprocess documents
Tokenization, stopword removal and lemmatization in one go.

In [None]:
docs_prep = corpus.to_dict("records")
for i, entry in enumerate(docs_prep):
  if entry["Lyrics"]:
    doc = nlp(entry["Lyrics"])
    tokens_prep = []
    for token in doc:
      if token.is_alpha and not token.is_stop:
        tokens_prep.append(token.lemma_)
    entry["Lyrics_prep"] = tokens_prep
  else:
    entry["Lyrics_prep"] = []

In [None]:
docs_prep[42]

{'index': 42,
 'Rank': 43,
 'Song': 'ferry cross the mersey',
 'Artist': 'gerry and the pacemakers',
 'Year': 1965,
 'Lyrics': 'gerry miscellaneous ferry cross the mersey ferry cross the mersey gerry and pace makers gerry marsden life goes on day after day hearts torn in every way so ferry cross the mersey cause this lands the place i love and here ill stay people they rush everywhere each with their own secret care so ferry cross the mersey and always take me there the place i love people around every corner they seem to smile and say we dont care what your name is boy well never turn you away so ill continue to say here i always will stay so ferry cross the mersey cause this lands the place i love and here ill stay and here ill stay here ill stay',
 'Source': '3',
 'Lyrics_prep': ['gerry',
  'miscellaneous',
  'ferry',
  'cross',
  'mersey',
  'ferry',
  'cross',
  'mersey',
  'gerry',
  'pace',
  'maker',
  'gerry',
  'marsden',
  'life',
  'go',
  'day',
  'day',
  'heart',
  'tear

## Counting words
We seperately count words for each condition, that is, for each year. We have to do this "by hand" and iterate through all docs and tokens and increase the token count for the respective condition.

In [None]:
cfreq = nltk.ConditionalFreqDist()

for doc in docs_prep:
  for token in doc["Lyrics_prep"]:
    condition = doc["Year"]
    cfreq[condition][token] += 1

In [None]:
cfreq[2010]

FreqDist({'not': 655, 'm': 641, 'like': 519, 'oh': 384, 'love': 376, 'know': 334, 'baby': 311, 'get': 295, 'go': 282, 'yeah': 242, ...})

## Time series of word counts

For all years between 1965 and 2015, get the frequency of the word "money".

In [None]:
word = u"money"
years = range(1965,2016)
occurences = []
for year in years:
  occurences.append(cfreq[year][word])

Merge the years and the word occurcences in one dataframe.

In [None]:
timeseries = pd.DataFrame(list(zip(years, occurences)),
              columns=['years','count'])
timeseries['years'] = pd.to_datetime(timeseries['years'], format='%Y')

Plot the time series.

In [None]:
alt.Chart(timeseries).mark_line().encode(
    x='years',
    y='count'
).interactive()