<a href="https://colab.research.google.com/github/olivermueller/amlta2021/blob/main/Session_01/1_02_Conditional_word_counting.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# <font color="#003660">Applied Machine Learning for Text Analysis (M.184.5331)</font>


XXX

In [None]:
# Set up Google Drive

from google.colab import drive

drive.mount('/content/gdrive')

%cd /content/gdrive/MyDrive/Colab Notebooks/AMLTA2021/Session_01

!pip install pymysql

# <font color="#003660">Week 1: Basics of Natural Language Processing</font>

# <font color="#003660">Notebook 2: Conditional Word Counting</font>

<center><br><img width=256 src="https://git.uni-paderborn.de/data.analytics.teaching/aml4ta-2020/-/raw/master/resources/dag.png"/><br></center>

<p>
<center>
<div>
    <font color="#085986"><b>By the end of this lesson, you will be able to...</b><br><br>
        ... bla bla bla, and<br>
        ... bla bla bla.</font>
</div>
</center>
</p>

# Import packages

As always, we first need to load a number of required Python packages:
- `pandas` provides high-performance, easy-to-use data structures and data analysis tools.
- `NLTK` is a leading platform for building Python programs to work with human language data.
- `altair` is a visualization library based on the grammar of graphics.

In [None]:
import pandas as pd
import nltk
from nltk.corpus import stopwords
from nltk.stem import PorterStemmer
from nltk.stem import WordNetLemmatizer
import altair as alt

To work with the `NLTK` package, you also need to download some additional data (e.g., stopword lists).

In [None]:
nltk.download('punkt')
nltk.download('wordnet')
nltk.download('stopwords')

[nltk_data] Downloading package punkt to /root/nltk_data...
[nltk_data]   Unzipping tokenizers/punkt.zip.
[nltk_data] Downloading package wordnet to /root/nltk_data...
[nltk_data]   Unzipping corpora/wordnet.zip.
[nltk_data] Downloading package stopwords to /root/nltk_data...
[nltk_data]   Unzipping corpora/stopwords.zip.


True

# Load documents
This time, we want to analyze documents with regards to some metadata (i.e., year of publication). Each document is stored in a dictionary with two keys (`text` and `year`). The corpus is stored as a list of dictionaries.

In [None]:
corpus = [
    {"text":"Hello World", "year":2015},
    {"text":"How are you today?", "year":2015},
    {"text":"The world is nice", "year":2016},
    {"text":"The weather is also nice", "year":2016},
    {"text":"Yesterday, the weather was also nice", "year":2017},
    {"text":"I own two bicycles", "year":2017},
    {"text":"I love to ride my bicycle", "year":2018}
]

In [None]:
corpus

[{'text': 'Hello World', 'year': 2015},
 {'text': 'How are you today?', 'year': 2015},
 {'text': 'The world is nice', 'year': 2016},
 {'text': 'The weather is also nice', 'year': 2016},
 {'text': 'Yesterday, the weather was also nice', 'year': 2017},
 {'text': 'I own two bicycles', 'year': 2017},
 {'text': 'I love to ride my bicycle', 'year': 2018}]

# Preprocess documents

We make a copy of the corpus dictionary and iterate over its entries to add a `tokens` field.

In [None]:
docs_tokenized = corpus[:]
for i, entry in enumerate(docs_tokenized):
    entry["tokens"] = nltk.word_tokenize(entry["text"])
docs_tokenized

[{'text': 'Hello World', 'tokens': ['Hello', 'World'], 'year': 2015},
 {'text': 'How are you today?',
  'tokens': ['How', 'are', 'you', 'today', '?'],
  'year': 2015},
 {'text': 'The world is nice',
  'tokens': ['The', 'world', 'is', 'nice'],
  'year': 2016},
 {'text': 'The weather is also nice',
  'tokens': ['The', 'weather', 'is', 'also', 'nice'],
  'year': 2016},
 {'text': 'Yesterday, the weather was also nice',
  'tokens': ['Yesterday', ',', 'the', 'weather', 'was', 'also', 'nice'],
  'year': 2017},
 {'text': 'I own two bicycles',
  'tokens': ['I', 'own', 'two', 'bicycles'],
  'year': 2017},
 {'text': 'I love to ride my bicycle',
  'tokens': ['I', 'love', 'to', 'ride', 'my', 'bicycle'],
  'year': 2018}]

And we iterate again over the corpus to transform all tokens to lowercase.

In [None]:
docs_tokenized_lower = docs_tokenized[:]
for i,entry in enumerate(docs_tokenized_lower):
    tokens_lower = []
    for token in entry["tokens"]:
        tokens_lower.append(token.lower())
    entry["tokens"] = tokens_lower
docs_tokenized_lower

[{'text': 'Hello World', 'tokens': ['hello', 'world'], 'year': 2015},
 {'text': 'How are you today?',
  'tokens': ['how', 'are', 'you', 'today', '?'],
  'year': 2015},
 {'text': 'The world is nice',
  'tokens': ['the', 'world', 'is', 'nice'],
  'year': 2016},
 {'text': 'The weather is also nice',
  'tokens': ['the', 'weather', 'is', 'also', 'nice'],
  'year': 2016},
 {'text': 'Yesterday, the weather was also nice',
  'tokens': ['yesterday', ',', 'the', 'weather', 'was', 'also', 'nice'],
  'year': 2017},
 {'text': 'I own two bicycles',
  'tokens': ['i', 'own', 'two', 'bicycles'],
  'year': 2017},
 {'text': 'I love to ride my bicycle',
  'tokens': ['i', 'love', 'to', 'ride', 'my', 'bicycle'],
  'year': 2018}]

And lemmatize all tokens...

In [None]:
lemmatizer = WordNetLemmatizer()

docs_tokenized_lower_lemmatized = docs_tokenized_lower[:]
for i,entry in enumerate(docs_tokenized_lower_lemmatized):
    tokens_lemmatized = []
    for token in entry["tokens"]:
        tokens_lemmatized.append(lemmatizer.lemmatize(token))
    entry["tokens"] = tokens_lemmatized
docs_tokenized_lower_lemmatized

[{'text': 'Hello World', 'tokens': ['hello', 'world'], 'year': 2015},
 {'text': 'How are you today?',
  'tokens': ['how', 'are', 'you', 'today', '?'],
  'year': 2015},
 {'text': 'The world is nice',
  'tokens': ['the', 'world', 'is', 'nice'],
  'year': 2016},
 {'text': 'The weather is also nice',
  'tokens': ['the', 'weather', 'is', 'also', 'nice'],
  'year': 2016},
 {'text': 'Yesterday, the weather was also nice',
  'tokens': ['yesterday', ',', 'the', 'weather', u'wa', 'also', 'nice'],
  'year': 2017},
 {'text': 'I own two bicycles',
  'tokens': ['i', 'own', 'two', u'bicycle'],
  'year': 2017},
 {'text': 'I love to ride my bicycle',
  'tokens': ['i', 'love', 'to', 'ride', 'my', 'bicycle'],
  'year': 2018}]

Finally, we iterate one last time over the corpus to remove stopwords.

In [None]:
docs_tokenized_lower_lemmatized_cleaned = docs_tokenized_lower_lemmatized[:]
for i,entry in enumerate(docs_tokenized_lower_lemmatized_cleaned):
    tokens_cleaned = []
    for token in entry["tokens"]:
        if (token not in stopwords.words('english')):
            tokens_cleaned.append(token)
    entry["text"] = tokens_cleaned
docs_tokenized_lower_lemmatized_cleaned

[{'text': ['hello', 'world'], 'tokens': ['hello', 'world'], 'year': 2015},
 {'text': ['today', '?'],
  'tokens': ['how', 'are', 'you', 'today', '?'],
  'year': 2015},
 {'text': ['world', 'nice'],
  'tokens': ['the', 'world', 'is', 'nice'],
  'year': 2016},
 {'text': ['weather', 'also', 'nice'],
  'tokens': ['the', 'weather', 'is', 'also', 'nice'],
  'year': 2016},
 {'text': ['yesterday', ',', 'weather', u'wa', 'also', 'nice'],
  'tokens': ['yesterday', ',', 'the', 'weather', u'wa', 'also', 'nice'],
  'year': 2017},
 {'text': ['two', u'bicycle'],
  'tokens': ['i', 'own', 'two', u'bicycle'],
  'year': 2017},
 {'text': ['love', 'ride', 'bicycle'],
  'tokens': ['i', 'love', 'to', 'ride', 'my', 'bicycle'],
  'year': 2018}]

# Conditional word counting
We seperately count words for each condition, that is, for each year. Unfortunately, this time we have to do this "by hand" and iterate through all docs and tokens and increase the token count for the respective condition.

In [None]:
cfreq = nltk.ConditionalFreqDist()

for doc in docs_tokenized_lower_lemmatized_cleaned:
    for token in doc["text"]:
        condition = doc["year"]
        cfreq[condition][token] += 1

Print the frequency distributions for all conditions.

In [None]:
cfreq

ConditionalFreqDist(nltk.probability.FreqDist,
                    {2015: FreqDist({'?': 1,
                               'hello': 1,
                               'today': 1,
                               'world': 1}),
                     2016: FreqDist({'also': 1,
                               'nice': 2,
                               'weather': 1,
                               'world': 1}),
                     2017: FreqDist({',': 1,
                               'also': 1,
                               u'bicycle': 1,
                               'nice': 1,
                               'two': 1,
                               u'wa': 1,
                               'weather': 1,
                               'yesterday': 1}),
                     2018: FreqDist({'bicycle': 1, 'love': 1, 'ride': 1})})

Print the frequency distributions of the year 2017.

In [None]:
cfreq[2017]

FreqDist({',': 1,
          'also': 1,
          u'bicycle': 1,
          'nice': 1,
          'two': 1,
          u'wa': 1,
          'weather': 1,
          'yesterday': 1})

# Time series of word occurences

For all years between 2010 and 2020, get the frequency of the word "nice".

In [None]:
word = "world"
years = range(2010,2020)
occurences = []
for year in years:
    occurences.append(cfreq[year][word])

Print the resulting time series.

In [None]:
occurences

[0, 0, 0, 0, 0, 1, 1, 0, 0, 0]

Merge the years and the word occurcences in one dataframe.

In [None]:
timeseries = pd.DataFrame(list(zip(years, occurences)),
              columns=['years','count'])
timeseries['years'] = pd.to_datetime(timeseries['years'], format='%Y')
timeseries

Unnamed: 0,years,count
0,2010-01-01,0
1,2011-01-01,0
2,2012-01-01,0
3,2013-01-01,0
4,2014-01-01,0
5,2015-01-01,1
6,2016-01-01,1
7,2017-01-01,0
8,2018-01-01,0
9,2019-01-01,0


Plot the time series.

In [None]:
alt.Chart(timeseries).mark_line().encode(
    x='years',
    y='count'
).interactive()