# Class 1

## Semantic Text Similarity

### How to find the Semantic Text Similarity between words using Python

- **WordNet is easily imported into Python through NLTK and it helps find the appropiate sense of the words (it's important to define whats it's considered to be appropiate)**

```python
import nltk
from nltk.corpus import webnet as wn

deer = wn.synset("deer.n.01")
elk = wn.synset("elk.n.01")
horse = wn.synset("horse.n.01")```


- **Find the Path Similarity**

```python
deer.path_similarity(elk) # --> 0.5
deer.path_similarity(horse) # --> 0.1428```


- **Use the information criteria to find the Linear Similarity**

```python
from nltk.corpus import webnet_ic

brown_ic = wordnet_ic.ic("ic-brown.dat")
deer.lin_similarity(elk, brown_ic) # --> 0.7726
deer.lin_similarity(horse, brown_ic) # --> 0.8623```

### Collocations and Distributional Similarity


- What is Collocations?
    - *"You know a word by the company it keeps (First, 1957)"*
    - Two words that frequently appears in similar contexts are more likely to be semantically related
    - For example:
        - The friends **met at** a **<font color='red'>café</font>**
        - Shyam **met** Ray **at** a **<font color='red'>pizzeria</font>**
        - Let's **meet** up **near the** **<font color='red'>coffee shop</font>**
        - The secret **meeting at the** **<font color='red'>restaurant</font>** soon became public
        


### Distributional Similarity: Context

- Words that are before and after the specific word, considered within a small window or neighbors

- Parts-Of-Speech (POS) of words that are before and after the specific word, considered within a small window or neighbors

- Specific syntactic relation to the target word

- Words in the same sentence, same documents, ...



### Strengh of Association between Words

- How frequent are these inside the document
    - Not similar if two words don't occure together often 


- Also it's important to see how frequent are individual words
    - The word *the* is very frequent in the english languje, so it's a high chance it co-occurs often with every other word
    
    
- Pointwise Mutual Information **(PMI)**: $$PMI(w, c) = log(\frac{P(w, c)}{P(w) · P(c)})$$



### How to Apply this in Python

- Use NLTK Collocations and Association measures, also finder has other useful functions such as frequency filter

```python
import nltk
from nltk.collocations import *

bigrams_measures = nltk.collocations.BigramAssocMeasures()
finder = BigramCollocationFinder.from_words(text)

finder.nbest(bigrams_measures.pmi, 10)
finder.apply_freq_filter(10)```



### Take Home Concepts

- Finding similarity between words and text is a non-trivial task

- WordNet is a useful resource for semantic relationships between words

- Many similarity functions exist to help with several tasks

- NLTK is a useful package for many such tasks

---

# Class 2

## Topic Modeling

### What is Topic Modeling?

- A course-level analysis of what's in a text collection

- *Topic*: the subject or theme of a discorse or a scientific paper, ...

- Topics are represented as a word distribution 

- A document is assumed to be a mixture of topics



- What's known:
    - The text collection or also known as *corpus*
    - The number of topics (for example 4 topics sush as sport, science, cooking, gameing)



- What's not known:
    - The actual topics
    - The topic distribution for each document (40% sports, 60% genetics)



- Essencially, we are talking about a text clustering problem
    - Documents are words clustered simultaneously
    

- Different topic modeling approaches are available:
    - Probabilistic Latent Semantic Analysis (PLSA) 
    - **Latent Dirichlet Allocation (LDA)**

## Generative Models and Latent Dirichlet Allocation

### Latent Dirichlet Allocation (LDA)


- Generative models for a document d
    - Choose length of a the document `d`
    - Choose a mixture of topics for the document `d`
    - Use a topic's multinomial distribution to output words to fill that topic's quota
    
    
    
### Topic Modeling in Practice

- How many Topics?
    - Finding or even guessing the number of topics inside a document is a hard task


- Interpreting topics
    - Topics are just word distributions
    - Making sense of words / generationg labels is subjective (*computer science, data science* may be seen as similar tags) 
    
    
    
### Topic Modeling - Summary

- Great tool for exploratory text analysis
    - What are the documents (tweets, reviews, news, articles) about?


- Many tools are available to do it effortless by using Python



### Working with LDA in Python

- There are many packages available, such as *gensim and lda*


- Pre-processing text steps:
    - Tokenize the text in sencentes and words, then normalize (lowercase) the tokens
    - Remove the Stop-Words such as the, is, ...
    - Stemming and Lemmatization
    

- Convert tokenized documents to a document - term matrix

- Build LDA models on the doc-term matrix


- `Doc_set`: set of pre-processed text documents

```python
import gensim
from gensim import corpora, models

dictionary = corpora.Dictionary(doc_set)
corpus = [dictionary.doc2bow(doc) for doc in doc_set]
lda_model = gensim.ldamodel.LdaModel(corpus, num_topics=4, id2word=dictionary, passes=50)

print(lda_model.print_topics(num_topics=4, num_words=5))```


- The lda_model can also be used to find topic distribution of the documents



### Take Home Concepts

- Topic modeling is an exploratory tool frequently used for text mining

- Linear Disrichlet Allocation is a generative model used extensivly for modeling large text corpora

- LDA can also be used as a feature selection technique for text classification and other tasks

---

# Class 3

## Information Extraction

### Information is Hidden in Free-Text and Information Extraction

- Most traditional transactional information is in a structured format


- Abundance of unstructured data, freeform text
    - 3 most common type of unstructured data: Image, Sound, Text


- How to convert unstrctured text to a structured form?


- The GOAL is to identify and extract the fields of interest from the free and unstructured text
     - For example: Imagine we are looking at an internet article of Lung Cancer, so the most common information that would be of interest to extract from the article would be:
         - Erbitux helps treat lung cancer
         - Author: Charlene Laino
         - Reviewers: Louise Chang, MD
         - Date: September 23, 2009
         - Location: Berlin
         - ...
         
         

### Fields of Interest

- Named entities
    - **<font color='red'>[NEWS]</font>** Peoples, Places, Dates, ...
    - **<font color='red'>[FINANCE]</font>** Companies, Stocks, Money, ...
    - **<font color='red'>[MEDICINE]</font>** Diseases, Drugs, Procedures, ...
    
    
- Relations
    - What happened to *who, when, where, ...*
    
    
    
### Named Entity Recognition

- **<font color='red'>Named Entities</font> :** Noun phrases that are of specific type and refer to specific individuals, places, organizations


- **<font color='red'>Named Entities Recognition</font> :** Technique(s) to identify all mentionsof pre-defined named entities in text:
    - Identify the mention / phrase: Boundary detection
    - Identify the type: *Tagging* / classification
    
    
    
    
### Approaches to Identify Named Entities

- Depends on kinds of entities that need to be identified

- For well-formatted fields like dates, phone numbers it's recommended to use Regex (recall Week 1 Assignment)

- For other fields: Typically a machine learning algorithm (recall Week 3)




### Person, Organization, Location / GPE


- Standard NER task in NLP reserch community



- Typically a four-class model
    - PER (person)
    - ORG (organization)
    - LOC/GPE (location)
    - Other / Outside (any other class)
    
    
    
### Question Answering


- Given a question, find the most appropiate answer from the text
    - What does Erbitux treat?
    - Who gave Anita the rose?
    
    
    
- Builds on named entity recognition, relation extracction, and co-reference resolution



### Take Home Concepts

- Information Extraction is important for natural language understanding and making sense of textual data

- Named Entity Recognition is a key building block to address many advanced NLP tasks

- Named Entity Recognition systems extensively deploy supervised machine learning and text mining techniques disscued in this course

---