<a href="https://colab.research.google.com/github/olivermueller/amlta2021/blob/main/Session_01/1_03_Conditional_word_counting_with_Spacy.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# <font color="#003660">Applied Machine Learning for Text Analysis (M.184.5331)</font>


In [None]:
# Set up Google Drive

from google.colab import drive

drive.mount('/content/gdrive')

%cd /content/gdrive/MyDrive/Colab Notebooks/AMLTA2021/Session_01

!pip install pymysql

Drive already mounted at /content/gdrive; to attempt to forcibly remount, call drive.mount("/content/gdrive", force_remount=True).
/content/gdrive/MyDrive/Colab Notebooks/AMLTA2021/Session_01


# <font color="#003660">Week 1: Basics of Natural Language Processing</font>

# <font color="#003660">Notebook 3: Conditional Word Counting with Spacy</font>

<center><br><img width=256 src="https://git.uni-paderborn.de/data.analytics.teaching/aml4ta-2020/-/raw/master/resources/dag.png"/><br></center>

<p>
<center>
<div>
    <font color="#085986"><b>By the end of this lesson, you will be able to...</b><br><br>
        ... bla bla bla, and<br>
        ... bla bla bla.</font>
</div>
</center>
</p>

# Import packages

As always, we first need to load a number of required Python packages:
- `pandas` provides high-performance, easy-to-use data structures and data analysis tools.
- NLTK is a leading platform for building Python programs to work with human language data.
- `Spacy` is a library for for industrial-strength natural language processing.
- `json` allows to read and write JSON files.
- `altair` is a visualization library based on the grammar of graphics.

In [None]:
import pandas as pd
import nltk
import spacy
import json
import altair as alt

# Load documents
We load the lyrics of all songs which made it to the Billboard charts between 1964 and 2015. The data is in JSON Lines format (http://jsonlines.org/), so we must iterate over each line of the file and append it to a list called `corpus`.

In [None]:
file_path = 'billboard_lyrics_1964-2015.json'
corpus = []
with open(file_path) as f:
    for line in f:
        corpus.append(json.loads(line))

In [None]:
corpus[5000]

{'Artist': 'mark ronson featuring bruno mars',
 'Lyrics': ' this hit that ice cold michelle pfeiffer that white gold this one for them hood girls them good girls straight masterpieces stylin whilen livin it up in the city got chucks on with saint laurent got kiss myself im so prettyim too hot hot damn called a police and a fireman im too hot hot damn make a dragon wanna retire man im too hot hot damn say my name you know who i am im too hot hot damn am i bad bout that money break it downgirls hit your hallelujah whoo girls hit your hallelujah whoo girls hit your hallelujah whoo cause uptown funk gon give it to you cause uptown funk gon give it to you cause uptown funk gon give it to you saturday night and we in the spot dont believe me just watch come ondont believe me just watch uhdont believe me just watch dont believe me just watch dont believe me just watch dont believe me just watch hey hey hey oh    meaning  byamandah   editor    70s girl group the sequence accused bruno mars and

# Test drive Spacy

In [None]:
nlp = spacy.load("en_core_web_sm")

In [None]:
doc = nlp(u"Yesterday, I went to five pubs in Oxford.")

Tokenization

In [None]:
for token in doc:
  print(token.text)

Yesterday
,
I
went
to
five
pubs
in
Oxford
.


Lemmatization

In [None]:
for token in doc:
  print(token.lemma_)

yesterday
,
-PRON-
go
to
five
pub
in
Oxford
.


Part-of-speech Tagging

In [None]:
for token in doc:
  print(token.pos_, spacy.explain(token.pos_))

NOUN noun
PUNCT punctuation
PRON pronoun
VERB verb
ADP adposition
NUM numeral
NOUN noun
ADP adposition
PROPN proper noun
PUNCT punctuation


In [None]:
for token in doc:
  print(token.tag_, spacy.explain(token.tag_))

NN noun, singular or mass
, punctuation mark, comma
PRP pronoun, personal
VBD verb, past tense
IN conjunction, subordinating or preposition
CD cardinal number
NNS noun, plural
IN conjunction, subordinating or preposition
NNP noun, proper singular
. punctuation mark, sentence closer


Dependency Parsing

In [None]:
for token in doc:
  print(token.text, token.dep_, spacy.explain(token.dep_))

Yesterday npadvmod noun phrase as adverbial modifier
, punct punctuation
I nsubj nominal subject
went ROOT None
to prep prepositional modifier
five nummod numeric modifier
pubs pobj object of preposition
in prep prepositional modifier
Oxford pobj object of preposition
. punct punctuation


Named Entity Recognition

In [None]:
for ent in doc.ents:
  print(ent.text, ent.label_, spacy.explain(ent.label_))

Yesterday DATE Absolute or relative dates or periods
five CARDINAL Numerals that do not fall under another type
Oxford ORG Companies, agencies, institutions, etc.


# Preprocess documents with Spacy

Tokenization, stopword removal and lemmatization in one go.

In [None]:
docs_prep = corpus[:]
for i, entry in enumerate(docs_prep):
  doc = nlp(entry["Lyrics"])
  tokens_prep = [] 
  for token in doc:
    if not token.is_stop:
      tokens_prep.append(token.lemma_)
  entry["Lyrics_prep"] = tokens_prep

In [None]:
docs_prep[5]

{'Artist': 'petula clark',
 'Lyrics': ' when youre alone and life is making you lonely you can always go downtown when youve got worries all the noise and the hurry seems to help i know downtownjust listen to the music of the traffic in the city linger on the sidewalk where the neon signs are pretty how can you lose the lights are much brighter there you can forget all your troubles forget all your caresso go downtown things will be great when youre downtown no finer place for sure downtown every things waiting for youdont hang around and let your problems surround you there are movie shows downtown maybe you know some little places to go to where they never close downtownjust listen to the rhythm of a gentle bossa nova youll be dancing with em too before the night is over happy again the lights are much brighter there you can forget all your troubles forget all your caresso go downtown where all the lights are bright downtown waiting for you tonight downtown youre gonna be alright now

# Conditional word counting
We seperately count words for each condition, that is, for each year. Unfortunately, this time we have to do this "by hand" and iterate through all docs and tokens and increase the token count for the respective condition.

In [None]:
cfreq = nltk.ConditionalFreqDist()

for doc in docs_prep:
    for token in doc["Lyrics_prep"]:
        condition = doc["Year"]
        cfreq[condition][token] += 1

In [None]:
cfreq["2010"]

FreqDist({' ': 98,
          'wake': 18,
          'morning': 3,
          'feeling': 13,
          'like': 519,
          'p': 1,
          'diddy': 1,
          'hey': 110,
          'girl': 170,
          'grab': 1,
          'glass': 2,
          'be': 761,
          'door': 8,
          'go': 306,
          'to': 189,
          'hit': 27,
          'city': 18,
          'let': 218,
          'leave': 62,
          'brush': 1,
          'tooth': 3,
          'bottle': 19,
          'jack': 4,
          'cause': 162,
          'night': 74,
          'not': 655,
          'come': 184,
          'backim': 1,
          'talk': 21,
          'pedicure': 1,
          'toe': 7,
          'try': 38,
          'clothe': 10,
          'boy': 65,
          'blow': 20,
          'phone': 10,
          'droptoppe': 1,
          'play': 29,
          'favorite': 6,
          'cd': 2,
          'rollin': 2,
          'party': 29,
          'little': 31,
          'bit': 2,
          'tipsydont': 

# Time series of word occurences

For all years between 1965 and 2015, get the frequency of the word "money".

In [None]:
word = u"money"
years = range(1965,2016)
occurences = []
for year in years:
  occurences.append(cfreq[str(year)][word])

Merge the years and the word occurcences in one dataframe.

In [None]:
timeseries = pd.DataFrame(list(zip(years, occurences)),
              columns=['years','count'])
timeseries['years'] = pd.to_datetime(timeseries['years'], format='%Y')

Plot the time series.

In [None]:
alt.Chart(timeseries).mark_line().encode(
    x='years',
    y='count'
).interactive()