<a href="https://colab.research.google.com/github/loperntu/Coding4Linguists/blob/master/Lyrics_analytics_from_corpus_to_AI_applications.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# 語料分析流程：歌詞語料庫為例

# Corpus Analytics Workflow

- data collection (**crawling**)
- data preprocessing (**cleaning** and **tokenization**)
- data tagging and parsing (**pos**, **dependency formalism**)

----------------- CORPUS ---------------------

- Exploratory data analysis (hypothesis formation via **statistics** and **visualization**)
- Annotation
- Analysis and Application
  - linguistic analysis
  - NLP.ML.AI applications

# Packages installation

In [0]:
!pip install nltk

In [0]:
# Download some dependencies
!git clone https://github.com/ldkrsi/jieba-zh_TW.git jieba_tw
!pip install opencc-python-reimplemented
!pip install zhon
!pip install jieba
!pip install -U scikit-learn

#  Corpus Data Ingestion 語料擷取 

In [0]:
from google.colab import files
import re
import json
from pathlib import Path
import jieba
import jieba.posseg as pseg
from jieba_tw import jieba as jieba_tw
from opencc import OpenCC
from zhon import hanzi
import string
import nltk
nltk.download("averaged_perceptron_tagger")

[nltk_data] Downloading package averaged_perceptron_tagger to
[nltk_data]     /root/nltk_data...
[nltk_data]   Unzipping taggers/averaged_perceptron_tagger.zip.


True

## Lyrics data collection

In [0]:
# 李宗盛歌詞為例

SONG_REGEX = re.compile(r'\n\n(\D+)\n{2,4}', re.DOTALL | re.MULTILINE)

# SONG_REGEX = re.compile(r'\n\n(.+)\n{2,4}(\[\d\d:\d\d.\d\d\]\s*\w*)?', re.DOTALL)
HEADERS = {
    'User-Agent': 'Mozilla/5.0 (X11; Linux x86_64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/51.0.2704.103 Safari/537.36'
}
sess = requests.Session()
sess.headers.update(HEADERS)


def get_song_rows(singer_link):
    """Provide a link to a singer's Mojim page. Returns a list of BeautifulSoup tags"""
    if not singer_link.endswith('-A2.htm'):
        singer_link = singer_link.replace('.htm', '-A2.htm')
    soup = bs(sess.get(singer_link).text)
    return soup.select('dd.hb2, dd.hb3')


def get_song_attrs(rows):
    """
    Provide a list of BeautifulSoup tags containing information such as a link to a song's lyrics, 
    title, and release time. Returns a list of dictionaries containing the aforementioned information.
    """
    song_attrs_list = []
    visited = set()  # keep track of songs
    BASE_URL = "http://mojim.com"
    for row in rows:
        link = f"{BASE_URL}{row.select('span.hc1 a')[0].attrs['href']}"
        title = row.select('span.hc1 a')[0].text
        time = row.select('span.hc4')[0].text
        if title in visited:
            print(f'{title} already exists.')
            continue
        song_attrs_list.append({'link': link, 'title': title, 'time': time})
        visited.add(title)
    print(f'{len(song_attrs_list)} unique songs found.')
    return song_attrs_list


def get_song_lyrics(song_dicts):
    """Provide a dictionary containing song information. Returns the song's lyrics."""
    total = len(song_dicts)
    errors = []
    for idx, song_dict in enumerate(song_dicts, 1):
        link = song_dict['link']
        r = sess.get(link).text
        soup = bs(r, 'html5lib')
        soup = str(soup.select('dd.fsZx3')[0])
        # remove link
        soup = soup.replace(
            r'<br/>更多更詳盡歌詞 在 <a href="http://mojim.com">※ Mojim.com　魔鏡歌詞網 </a><br/>', '')
        soup = soup.replace('<br/>', '\n')
        matches = SONG_REGEX.search(soup)
        try:
            match = matches[1].strip()
        except TypeError as e:
            print(e)
            print(song_dict)
            errors.append({'error': e, 'song': song_dict})
        else:
            song_dict['lyrics'] = match
        if idx % 25 == 0:
            print(f'{idx} of {total} complete')
    print('Done!')


def save(song_dict_list, filename):
    with open(filename, 'w') as fp:
        json.dump(song_dict_list, fp, ensure_ascii=False)


In [0]:
rows = get_song_rows('http://mojim.com/twh100041-A2.htm')  # Mojim singer's page
song_attrs = get_song_attrs(rows)
get_song_lyrics(song_attrs)
save(song_attrs, 'lyrics.json')

## Lyrics data preprocessing

In [0]:


def preprocess(files, pos_tagging=True, remove_stopwords=False):
  print("Starting...")
  cc = OpenCC('s2tw')
  completed = []
  
  with open("./baidu_stopwords.txt") as fp:
    stopwords = fp.read()
  stopwords = cc.convert(stopwords)
  stopwords = stopwords.splitlines()
  
  regex = re.compile(rf"[{hanzi.punctuation}{string.punctuation}]")
  for file in files:
    with open(file) as fp:
      lyrics = fp.read()
      title = cc.convert(str(file.name).split("_")[0])
    print(title)
    # simplified to traditional (Taiwan standard)
    converted = cc.convert(lyrics)
    # remove punctuation
    no_punc = regex.sub("", converted)
    # remove newlines
    no_newlines = no_punc.replace('\n', '')
    # remove other whitespace
    no_whitespace = no_newlines.replace(' ', '')

    if pos_tagging:
      # pseg.cut returns a generator
      seg = list(pseg.cut(no_whitespace))
      seg = [tuple(s) for s in seg]
      if remove_stopwords:
        no_stopwords = [(word, tag) for word, tag in seg]
        completed.append({
            'title': title,
            'lyrics': no_stopwords
        })
      else:
        completed.append({
            'title': title,
            'lyrics': seg
        })

    else:
      # returns a list instead of a generator
      seg = jieba_tw.lcut(no_whitespace)
      seg = [tuple(s) for s in seg]
      if remove_stopwords:
        no_stopwords = [(word, tag) for word, tag in seg]
        completed.append({
            'title': title,
            'lyrics': no_stopwords
        })
      else:
        completed.append({
            'title': title,
            'lyrics': seg
        })
    
  return completed

In [0]:
preprocessed_lyrics = []
for name, lst in [("Jay", jay_files), ("Leehom", leehom_files), ("Jam", jam_files), ("Jolin", jolin_files)]:
  preprocessed = preprocess(lst)
  preprocessed_lyrics.append({
      name: preprocessed
  })

## NLP-enhanced data processing
- POS tagging
- Dependency parsing
- NER
- (Sentiment analysis)

#### POS tagging 


In [0]:
import os       #importing os to set environment variable
def install_java():
  !apt-get install -y openjdk-8-jdk-headless -qq > /dev/null      #install openjdk
  os.environ["JAVA_HOME"] = "/usr/lib/jvm/java-8-openjdk-amd64"     #set environment variable
  !java -version       #check java version
install_java()


openjdk version "10.0.2" 2018-07-17
OpenJDK Runtime Environment (build 10.0.2+13-Ubuntu-1ubuntu0.18.04.4)
OpenJDK 64-Bit Server VM (build 10.0.2+13-Ubuntu-1ubuntu0.18.04.4, mixed mode)


In [0]:
!pip install StanfordCoreNLP
from stanfordcorenlp import StanfordCoreNLP
nlp = StanfordCoreNLP('stanford-corenlp', lang='de', memory='4g')

#### Dependency parsing

In [0]:
#code here

#### NER

In [0]:
# code here

# Exploratory Corpus Data Analysis 語料探索分析

Corpus Statistics and Visualization

- Descriptive Statistics
- Statistical Testing
- Association and Productivity Measure
- Multivaraite Statistics (Clustering)


> Use `panda` library to conduct **data manipulation**  (preparation, transformation, aggregation)

> *Note*: exploratory *structured* vs *unstructured*  data analysis

### Corpus Data Summary, Query and Graphics

- Corpus Basic statistics
-  *Plot, Barplot, and Histograms* (check `Code snippets: Altair`) https://altair-viz.github.io/
-  Visualization 


### Basic Corpus Statistics and Plot

In [0]:
#code:test, correlation
## for structured data (e.g., data frame)

#### More text-oriented plot
- *Dispersion* and *Strip charts*


In [0]:
#code
## for unstructured data

### Concordance


In [0]:
import nltk
import jieba
raw = open(....).read()
corpus = nltk.Text(jieba.lcut(raw))
corpus.concordance(u'愛', width = 40, lines = 15)

### Visualization

- *Word cloud*
- *Motion Chart* 
- *ScatterText*

#### Word Cloud

#### Motion chart

#### Scatter Text

- a recent method to make legible, interactive scatter plots for text visualization [Jason S. Kessler ](https://github.com/JasonKessler)
- Check the author's [Tutorial video](https://www.youtube.com/watch?v=H7X9CA2pWKo)

In [0]:
!pip install --upgrade scattertext
import sys
import pandas as pd
import scattertext as st
import numpy as np
from IPython.display import IFrame
from IPython.core.display import display, HTML
display(HTML("<style>.container { width:98% !important; }</style>"))

In [0]:
from __future__ import print_function
from scattertext import CorpusFromParsedDocuments
from scattertext import chinese_nlp
from scattertext import produce_scattertext_explorer

In [0]:
## compare chinese translations of tale of two cities and ulysses, from http://www.pku.edu.cn/study/novel/ulysses/cindex.htm

df = pd.read_csv('https://cdn.rawgit.com/JasonKessler/scattertext/e508bf32/scattertext/data/chinese.csv')
df['text'] = df['text'].apply(chinese_nlp)
corpus = CorpusFromParsedDocuments(df,
                                   category_col='novel',
                                   parsed_col='text').build()
html = produce_scattertext_explorer(corpus,
                                    category='Tale of Two Cities',
	                                  category_name='Tale of Two Cities',
	                                  not_category_name='Ulysses',
	                                  width_in_pixels=1000,
	                                  metadata=df['novel'],
	                                  asian_mode=True)
# open('./demo_chinese.html', 'w').write(html)
# print('Open ./demo_chinese.html in Chrome or Firefox.')



Building prefix dict from the default dictionary ...
DEBUG:jieba:Building prefix dict from the default dictionary ...
Dumping model to file cache /tmp/jieba.cache
DEBUG:jieba:Dumping model to file cache /tmp/jieba.cache
Loading model cost 1.146 seconds.
DEBUG:jieba:Loading model cost 1.146 seconds.
Prefix dict has been built succesfully.
DEBUG:jieba:Prefix dict has been built succesfully.


Open ./demo_chinese.html in Chrome or Firefox.


In [0]:
display(HTML(html))

### Clustering

> Clustering is the task of organizing unlabelled objects in a way that objects in the same group are similar to each other and dissimilar to those in other groups. In other words, clustering is like *unsupervised classification* where the algorithm models the *similarities* instead of the boundaries.

#### Basic notions

- a **distance measure** to define whether or not two documents are similar.

- a **criterion function** to compute the quality of our clusters, and 

- an algorithm to **optimize** this criterion.

#### Principal Component Analysis (PCA)

#### t-SNE ( t-distributed stochastic neighbor embedding)

- pretty useful when it comes to visualizing similarity between objects. It works by taking a group of high-dimensional (100 dimensions via Word2Vec) vocabulary word feature vectors, then compresses them down to 2-dimensional x,y coordinate pairs. 

- The idea is to keep similar words close together on the plane, while maximizing the distance between dissimilar words.

- resolution of syntactic and semantic ambiguity (polysemy)

Steps

- Clean the data
- Build a corpus
- Train a Word2Vec Model
- Visualize t-SNE representations of the most common words

In [0]:
# import pandas as pd
# pd.options.mode.chained_assignment = None 
# import numpy as np
# import re
# import nltk

# from gensim.models import word2vec

# from sklearn.manifold import TSNE
# import matplotlib.pyplot as plt
# %matplotlib inline


## Co-occurrence, Association and Productivity Measures

- Co-occurrence is the simultaneous occurrence of *usually two* linguistic phenomena.

- Finding lexical-grammatical patterns is useful both for linguistic and lexicographical studies.


### Collocation 搭配現象

- statistically significant co-occurrence of linguistic forms.

> You shall know a word by the company it keeps (Firth, 1957).

- lexical and grammatical collocations




In [0]:
#code: a naive collocation extraction (via frequency as in textbook)

### Colligation
- like collocation except that it has a grammatical component.

`use languageR data(dative)` 

### Collostruction

three methods proposed by (Stefanowitsch and Gries)

1. **Collexeme analysis** : measures the mutual attraction of lexemes and constructions.

  - (Fisher's exact test; odds ratio, $G^2$, MI, $\chi^2$)

2. Distinctive collexeme analysis
  
   - (Fisher's exact test, $G^2$) 

3. Co-varying collexeme analysis

  - (Fisher's exact test, odds ration, $G^2$)

In [0]:
#code

## Association Measures

- measure the significant co-occurrences (between two units)
- on the basis of *contingency table*

```
# an example table here
```
----

- Mutual information
- Fisher's Exact Test
- The $\chi^2$ Test



### Lexical Richness and Productivity

- Types, Tokens, and TTR (type-token ratio)
- Vocabulary Graowth Curve








In [0]:
# code

-------------

# Corpus Annotation and Analysis 語料標記分析

- Data is often annotated using both automatic taggers/parsers and a growing set of manual annotation tools (e.g. EXMARaLDA, ELAN, annotate/Synpathy, MMAX, RSTTool, Arborator, WebAnno, Atomic), 
- E.g, ANNIS provides the means for visualizing and retrieving this data. Pepper is used to import the multiple annotation formats into ANNIS.

![ANNIS](http://corpus-tools.org/annis/images/annis3_full.png)

easier version: ![WebAnno](https://webanno.github.io/webanno/assets/img/logo.png)

![Annotation is a process](https://webanno.github.io/webanno/releases/3.4.6/docs/user-guide/images/progress_workflow.jpg)

## `WebAnno` annotation practice

> WebAnno (Eckart de Castilho, R. et al. 2016) is a general purpose web-based annotation tool for a wide range of linguistic annotations including various layers of morphological, syntactical, and semantic annotations. Additionaly, custom annotation layers can be defined, allowing WebAnno to be used also for non-linguistic annotation tasks.

### Sentiment and Emotion Annotation

- Sentiment Polarity annotation
- Emotion annotation `REMAN corpus (Relational EMotion ANnotation)`(Kim et al. 2018)


## Install `Docker` 
https://docs.docker.com/





## Exercise.1


-------

#Applications 應用

### Sentiment Analysis / Emotion detection

In [0]:
# SA from previous tutorial
# 

### Lyrics generation

-  Use LSTM (Long Short Term Memory) neural network, to avoid the long-term dependency problem. 

- [中文歌詞產生器示例](http://140.112.147.125:5000/)




# COPENS 開放語料庫計畫

http://140.112.147.125:8000/





- Rationale

    - Corpus as Human-Machine collaboration interface人機協作、學習與互惠
    - 使用者上傳
    - Open data and source 

- Corpus Query Language 教學


## 歌詞語料庫為例

- Corpus query language

## Exercise.2

In [0]:
- 上傳
- 搜尋 (using CQL)
- 

# Reference

- Evgeny Kim and Roman Klinger. Who Feels What and Why? Annotation of a Literature Corpus with Semantic Roles of Emotions. In Proceedings of COLING 2018, the 27th International Conference on Computational Linguistics, Santa Fe, USA, August 2018.