<a href="https://colab.research.google.com/github/olivermueller/aml4ta-2021/blob/main/Session_01/1_02_Conditional_word_counting.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# <font color="#003660">Applied Machine Learning for Text Analysis (M.184.5331)</font>


In [None]:
# Set up Google Drive
from google.colab import drive
drive.mount('/content/gdrive')

In [None]:
# Install packages
!pip install pymysql

# <font color="#003660">Week 1: Basics of Natural Language Processing</font>

# <font color="#003660">Notebook 2: Conditional Word Counting</font>

<center><br><img width=256 src="https://raw.githubusercontent.com/olivermueller/aml4ta-2021/main/resources/dag.png"/><br></center>

<p>
<center>
<div>
    <font color="#085986"><b>By the end of this lesson, you will be able to...</b><br><br>
        ... load text data from files and databases,<br> 
        ... conduct basic NLP preprocessing (e.g., tokenization, stopword removal, stemming, lemmatization),<br>
        ... calculate corpus statistics (esp. word frequencies), and<br>
        ... calculate and visualize corpus statistics over time.
    </font>
</div>
</center>
</p>

# Import packages

As always, we first need to load a number of required Python packages:
- `pandas` provides high-performance, easy-to-use data structures and data analysis tools.
- `NLTK` is a leading platform for building Python programs to work with human language data.
- `altair` is a visualization library based on the grammar of graphics.

In [None]:
import pandas as pd
import nltk
from nltk.corpus import stopwords
from nltk.stem import PorterStemmer
from nltk.stem import WordNetLemmatizer
import altair as alt

To work with the `NLTK` package, you also need to download some additional data (e.g., stopword lists).

In [None]:
nltk.download('punkt')
nltk.download('wordnet')
nltk.download('stopwords')

# Load documents
This time, we want to analyze documents with regards to some metadata (i.e., year of publication). Each document is stored in a dictionary with two keys (`text` and `year`). The corpus is stored as a list of dictionaries.

In [None]:
corpus = [
    {"text":"Hello World", "year":2015},
    {"text":"How are you today?", "year":2015},
    {"text":"The world is nice", "year":2016},
    {"text":"The weather is also nice", "year":2016},
    {"text":"Yesterday, the weather was also nice", "year":2017},
    {"text":"I own two bicycles", "year":2017},
    {"text":"I love to ride my bicycle", "year":2018}
]

In [None]:
corpus

# Preprocess documents

We make a copy of the corpus dictionary, iterate over its entries, and add a `tokens` field with the tokenized text.

In [None]:
docs_tokenized = corpus.copy()
for i, entry in enumerate(docs_tokenized):
    entry["tokens"] = nltk.word_tokenize(entry["text"])
docs_tokenized

And we iterate again over the corpus to transform all tokens to lowercase.

In [None]:
docs_tokenized_lower = docs_tokenized.copy()
for i,entry in enumerate(docs_tokenized_lower):
    tokens_lower = []
    for token in entry["tokens"]:
        tokens_lower.append(token.lower())
    entry["tokens"] = tokens_lower
docs_tokenized_lower

And lemmatize all tokens...

In [None]:
lemmatizer = WordNetLemmatizer()

docs_tokenized_lower_lemmatized = docs_tokenized_lower.copy()
for i,entry in enumerate(docs_tokenized_lower_lemmatized):
    tokens_lemmatized = []
    for token in entry["tokens"]:
        tokens_lemmatized.append(lemmatizer.lemmatize(token))
    entry["tokens"] = tokens_lemmatized
docs_tokenized_lower_lemmatized

Finally, we iterate one last time over the corpus to remove stopwords.

In [None]:
docs_tokenized_lower_lemmatized_cleaned = docs_tokenized_lower_lemmatized.copy()
for i,entry in enumerate(docs_tokenized_lower_lemmatized_cleaned):
    tokens_cleaned = []
    for token in entry["tokens"]:
        if (token.isalpha() and token not in stopwords.words('english')):
            tokens_cleaned.append(token)
    entry["tokens"] = tokens_cleaned
docs_tokenized_lower_lemmatized_cleaned

# Conditional word counting
We seperately count words for each condition, that is, for each year. Unfortunately, this time we have to do this "by hand" and iterate through all docs and tokens and increase the token count for the respective condition.

In [None]:
cfreq = nltk.ConditionalFreqDist()

for doc in docs_tokenized_lower_lemmatized_cleaned:
    for token in doc["tokens"]:
        condition = doc["year"]
        cfreq[condition][token] += 1

Print the frequency distributions for all conditions.

In [None]:
cfreq

Print the frequency distributions of the year 2017.

In [None]:
cfreq[2017]

# Time series of word occurences

For all years between 2015 and 2018, get the frequency of the word "nice".

In [None]:
word = "world"
years = range(2015,2019)
occurences = []
for year in years:
    occurences.append(cfreq[year][word])

Print the resulting time series.

In [None]:
occurences

Merge the years and the word occurcences in one dataframe.

In [None]:
timeseries = pd.DataFrame(list(zip(years, occurences)),
              columns=['years','count'])
timeseries['years'] = pd.to_datetime(timeseries['years'], format='%Y')
timeseries

Plot the time series.

In [None]:
alt.Chart(timeseries).mark_line().encode(
    x='years',
    y='count'
).interactive()