# Assignemnt 1: Creating a Mini Corpus

$\textbf{Programming Assignment}$

Choose a topic that you will be using as a term paper for this subject. Collect articles, publications, sotries etc. of your chosen topic and develop your own mini-corpus using the preprocessing steps required. Be sure to print the output.

Note that this corpus will be used for the entire subject.

$\textbf{Text Vectorization}$
-

- a vector is a geometric object which contains a magnitude and a direction.

- Text vectorization is the projection of words into a mathematical space while preserving information.

$\textbf{The Bag of Words Model}$
-

- The BOW is a straight forward model for vectorizing sentences.

- BOW uses word frequencies to construct vectors.

- BOW model is an orderless document representation and only the counts of the words matter.

- Because BOW does not take into account the positioning of words we loss smenatic information.

- Vectorizing different sentences and joining the result into a single vocabulary.

- The vocabulary acts as a reference if a specific word is present or absent in each of the sentence.

$EXAMPLE$

In [29]:
import re
import string

s1 = "My dog sat on the mat."
s2 = "The neighborhood cat loves playing with my dog."

def token_sentence(s):
    # Make a regular expression that matches all punctuation
    regex = re.compile('[%s]' % re.escape(string.punctuation))
    # Use the regex
    res = regex.sub('', s)
    res = res.split()
    return res

new_s1 = token_sentence(s1)
new_s2 = token_sentence(s2)
vocabulary = list(set(new_s1 + new_s2))

print(vocabulary)

['mat', 'playing', 'my', 'sat', 'the', 'The', 'with', 'on', 'My', 'cat', 'neighborhood', 'dog', 'loves']


In [30]:
# printing the token sentences
print(new_s1)
print(new_s2)

['My', 'dog', 'sat', 'on', 'the', 'mat']
['The', 'neighborhood', 'cat', 'loves', 'playing', 'with', 'my', 'dog']


In [31]:
BOW1 = [int(u in new_s1) for u in vocabulary]
BOW2 = [int(u in new_s2) for u in vocabulary]

print(BOW1)
print(BOW2)

[1, 0, 0, 1, 1, 0, 0, 1, 1, 0, 0, 1, 0]
[0, 1, 1, 0, 0, 1, 1, 0, 0, 1, 1, 1, 1]


$\text{Term Frequency Inverse Document Frequency (TF-IDF)}$
-

- A model largely used in search engines to query relevant documents.

- Two informations are encoded: the term frequency, and the inverse document frequency.

- The term frequency is the count of words appearing in a document.

- The inverse document frequency measures the importance of words in a document.

- The inverse document frequency is calculated by logarithmically scaling the inverse fraction of the documents containing the word. This is obtained by dividing the total number of documents by the number of documents containing the term, followed by taking the logarithm of the ratio.

- The inverse document frequency measures how common or rare a term is among all documents.

The formula are:
\begin{gather}
TF(t) = \frac{\text{number of times the term "t" appeas in a specific document}}{\text{total number of terms in the document}}
\end{gather}

\begin{gather}
IDF(t) = log(\frac{\text{total number of documents}}{\text{number of documents with term "t"}})
\end{gather}

\begin{gather}
TF \cdotp IDF = TF(t) \cdotp IDF(t)
\end{gather}

- TF-IDF has more information that using vector representation because instead of using the count of words as used in the BOW, TF-IDF makes rare terms more prominent and ignores common words like stopwords such as "is", "that", "of", etc.

$\text{Vectorization Using Gensim}$

In [32]:
# cleaning the text so as to remove line breaks and excessive spaces
def clean_text(text):
    # Remove excess spaces
    text = ' '.join(text.split())
    # Remove line breaks
    text = text.replace('\n', '').replace('\r', '').replace('"', '')
    return text

Fetch articles from arXiv

In [33]:
import feedparser
n_papers = 10
chunk_size = 10
category = 'physics.optics'
for chunk_i in range(0, n_papers, chunk_size): 
   feed = feedparser.parse('http://export.arxiv.org/api/query?search_query=cat:%s&start=%d&max_results=%d' % (category, chunk_i,chunk_size))
   
   for i in range(len(feed.entries)):
       entry = feed.entries[i]
       title = (entry.title).replace('\n', "") #removes newlines
       print(dir(entry))
       with open('files\\assignment1\\'+clean_text(title)+'.txt', 'w') as f:
            f.write(clean_text(entry.summary))

       print(entry.summary)

['__class__', '__class_getitem__', '__contains__', '__delattr__', '__delitem__', '__dict__', '__dir__', '__doc__', '__eq__', '__format__', '__ge__', '__getattr__', '__getattribute__', '__getitem__', '__getstate__', '__gt__', '__hash__', '__init__', '__init_subclass__', '__ior__', '__iter__', '__le__', '__len__', '__lt__', '__module__', '__ne__', '__new__', '__or__', '__reduce__', '__reduce_ex__', '__repr__', '__reversed__', '__ror__', '__setattr__', '__setitem__', '__sizeof__', '__slotnames__', '__str__', '__subclasshook__', '__weakref__', 'clear', 'copy', 'fromkeys', 'get', 'has_key', 'items', 'keymap', 'keys', 'pop', 'popitem', 'setdefault', 'update', 'values']
A numerical study of the properties of Gaussian pulses propagating in planar
waveguide under the combined effect of positive Kerr-type nonlinearity,
diffraction in planar waveguides and anomalous or normal dispersion, is
presented. It is demonstrated how the relative strength of dispersion and
diffraction, the strength of nonl

In [34]:
from gensim import corpora
import spacy
nlp = spacy.load('en_core_web_sm')

documents = []

  return torch._C._cuda_getDeviceCount() if nvml_count < 0 else nvml_count


In [35]:
import os

# folder path
dir_path = 'files\\assignment1'

# list to store file names
res = []

# Iterate directory
for path in os.listdir(dir_path):
    # check if current path is a file
    if os.path.isfile(os.path.join(dir_path, path)):
        res.append(path)
print(res)

documents = []

for file in res:
    f = open(dir_path + '\\' + file, "r")

    doc = f.read()
    clean_doc = clean_text(doc)
    documents.append(clean_doc)
    print(clean_doc)

print(len(documents))

['A Numerical Study of Absorption by Multilayered Biperiodic Structures.txt', 'Giant phase-conjugate reflection with a normal mirror in front of an optical phase-conjugator.txt', 'Numerical Study on Space-Time Pulse Compression.txt', 'On Localized X-shaped Superluminal Solutions to Maxwell Equations.txt', 'Power Switching in Hybrid Coherent Couplers.txt', 'Propagation of a short laser pulse in a plasma.txt', 'Scattering Integral Equations for Diffusive Waves. Detection of Objects Buried in Diffusive Media in the Presence of Interfaces.txt', 'Space-Time Approach to Scattering from Many Body Systems.txt', 'Stokes Parameters as a Minkowskian Four-vector.txt', 'Thirring Solitons in the presence of dispersion.txt']
We study the electromagnetic scattering by multilayered biperiodic aggregates of dielectric layers and gratings of conducting plates. We show that the characteristic lengths of such structures provide a good control of absorption bands. The influence of the physical parameters of

In [36]:
texts = []
for document in documents:
    text = []
    doc = nlp(document)
    for w in doc:
        if not w.is_stop and not w.is_punct and not w.like_num:
            text.append(w.lemma_)
    texts.append(text)
#texts is a mini-corpus specifically for toxic algal bloom
print(texts)

[['study', 'electromagnetic', 'scattering', 'multilayere', 'biperiodic', 'aggregate', 'dielectric', 'layer', 'grating', 'conduct', 'plate', 'characteristic', 'length', 'structure', 'provide', 'good', 'control', 'absorption', 'band', 'influence', 'physical', 'parameter', 'problem', 'size', 'impedance', 'discuss'], ['theoretically', 'study', 'reflection', 'light', 'phase', 'conjugate', 'mirror', 'precede', 'partially', 'reflect', 'normal', 'mirror', 'presence', 'suitably', 'choose', 'normal', 'mirror', 'phase', 'conjugator', 'find', 'greatly', 'enhance', 'total', 'phase', 'conjugate', 'reflected', 'power', 'order', 'magnitude', 'require', 'condition', 'phase', 'conjugate', 'mirror', 'amplifie', 'reflection', 'constructive', 'interference', 'light', 'region', 'mirror', 'take', 'place', 'phase', 'conjugate', 'reflect', 'power', 'exhibit', 'maximum', 'function', 'transmittance', 'normal', 'mirror'], ['numerical', 'study', 'property', 'Gaussian', 'pulse', 'propagate', 'planar', 'waveguide', 

In [37]:
#creating a BOW representation of the mini-corpus
dictionary = corpora.Dictionary(texts)
print(dictionary.token2id)

{'absorption': 0, 'aggregate': 1, 'band': 2, 'biperiodic': 3, 'characteristic': 4, 'conduct': 5, 'control': 6, 'dielectric': 7, 'discuss': 8, 'electromagnetic': 9, 'good': 10, 'grating': 11, 'impedance': 12, 'influence': 13, 'layer': 14, 'length': 15, 'multilayere': 16, 'parameter': 17, 'physical': 18, 'plate': 19, 'problem': 20, 'provide': 21, 'scattering': 22, 'size': 23, 'structure': 24, 'study': 25, 'amplifie': 26, 'choose': 27, 'condition': 28, 'conjugate': 29, 'conjugator': 30, 'constructive': 31, 'enhance': 32, 'exhibit': 33, 'find': 34, 'function': 35, 'greatly': 36, 'interference': 37, 'light': 38, 'magnitude': 39, 'maximum': 40, 'mirror': 41, 'normal': 42, 'order': 43, 'partially': 44, 'phase': 45, 'place': 46, 'power': 47, 'precede': 48, 'presence': 49, 'reflect': 50, 'reflected': 51, 'reflection': 52, 'region': 53, 'require': 54, 'suitably': 55, 'take': 56, 'theoretically': 57, 'total': 58, 'transmittance': 59, 'Gaussian': 60, 'Kerr': 61, 'anomalous': 62, 'chirp': 63, 'comb

$INSIGHTS$

- There are 329 unique words in our corpus that is focused on healthcare and toxic algal bloom.

- Each word is indexed with an integer.

- The index is termed as a "word ID".

- The BOW now can be used for word integer-id mapping.

Using the doc2bow method, which, as the name suggests, helps convert our document to bag-of-words.

In [38]:
corpus = [dictionary.doc2bow(text) for text in texts]
corpus

[[(0, 1),
  (1, 1),
  (2, 1),
  (3, 1),
  (4, 1),
  (5, 1),
  (6, 1),
  (7, 1),
  (8, 1),
  (9, 1),
  (10, 1),
  (11, 1),
  (12, 1),
  (13, 1),
  (14, 1),
  (15, 1),
  (16, 1),
  (17, 1),
  (18, 1),
  (19, 1),
  (20, 1),
  (21, 1),
  (22, 1),
  (23, 1),
  (24, 1),
  (25, 1)],
 [(25, 1),
  (26, 1),
  (27, 1),
  (28, 1),
  (29, 4),
  (30, 1),
  (31, 1),
  (32, 1),
  (33, 1),
  (34, 1),
  (35, 1),
  (36, 1),
  (37, 1),
  (38, 2),
  (39, 1),
  (40, 1),
  (41, 6),
  (42, 3),
  (43, 1),
  (44, 1),
  (45, 5),
  (46, 1),
  (47, 2),
  (48, 1),
  (49, 1),
  (50, 2),
  (51, 1),
  (52, 2),
  (53, 1),
  (54, 1),
  (55, 1),
  (56, 1),
  (57, 1),
  (58, 1),
  (59, 1)],
 [(17, 1),
  (25, 1),
  (42, 1),
  (60, 1),
  (61, 1),
  (62, 1),
  (63, 1),
  (64, 1),
  (65, 3),
  (66, 1),
  (67, 2),
  (68, 2),
  (69, 1),
  (70, 2),
  (71, 1),
  (72, 1),
  (73, 2),
  (74, 2),
  (75, 1),
  (76, 2),
  (77, 1),
  (78, 1),
  (79, 1),
  (80, 1),
  (81, 1),
  (82, 3),
  (83, 1),
  (84, 1),
  (85, 2),
  (86, 1),
  (87, 

- The output is a nested list.

- Each individual sublist represents a documents bag-of-words representation.

- A reminder: you might see different numbers in your list, this is because each time you create a dictionary, different mappings will occur.

- Unlike the example we demonstrated, where an absence of a word was a 0, we use tuples that represent (word_id, word_count).

- We can easily verify this by checking the original sentence, mapping each word to its integer ID and reconstructing our list.

- We can also notice in this case each document has not greater than one count of each word - in smaller corpuses, this tends to happen.

In [39]:
#storing your generated corpus

corpora.MmCorpus.serialize('mini-corpus.mm', corpus)

- It is more memory efficient to store your corpus into the disk and later loading it because at most one vector resides in the RAM at a time.

In [40]:
#Converting Bag-of-Words to TF-IDF representation
from gensim import models
tfidf = models.TfidfModel(corpus)

for document in tfidf[corpus]:
       print(document)

[(0, 0.21612403188063795), (1, 0.21612403188063795), (2, 0.21612403188063795), (3, 0.21612403188063795), (4, 0.21612403188063795), (5, 0.21612403188063795), (6, 0.21612403188063795), (7, 0.21612403188063795), (8, 0.15106421550072735), (9, 0.11300666261467564), (10, 0.21612403188063795), (11, 0.15106421550072735), (12, 0.21612403188063795), (13, 0.21612403188063795), (14, 0.21612403188063795), (15, 0.15106421550072735), (16, 0.21612403188063795), (17, 0.06505981637991057), (18, 0.21612403188063795), (19, 0.21612403188063795), (20, 0.15106421550072735), (21, 0.21612403188063795), (22, 0.21612403188063795), (23, 0.21612403188063795), (24, 0.21612403188063795), (25, 0.06505981637991057)]
[(25, 0.027936411533166672), (26, 0.0928027503423617), (27, 0.0928027503423617), (28, 0.0928027503423617), (29, 0.3712110013694468), (30, 0.0928027503423617), (31, 0.0928027503423617), (32, 0.0928027503423617), (33, 0.0928027503423617), (34, 0.06486633880919503), (35, 0.06486633880919503), (36, 0.092802750

- TF-IDF scores: The higher the score, the more important the word in the document.

$\textbf{N-Gramming}$

- Context is very important when working with text data.
- This context is lost during vector representation because on only the word frequency is taken into account.
- An n-gram is a contiguous sequence of n items in the text. In our case, we will be dealing with words being the item, but depending on the use case, it could be even letters, syllables, or sometimes in the case of speech, phonemes.
- Mono-gram, n=1
- Bi-gram, n = 2.
- Tri-gram, n=3
- N-Gramming is calculated through the conditional probability of a token given by thr preceding token.
- N-Gramming can also be done by calculating words that appear close to each other.
- Bi-gramming is also called co-location, it locates pair of words that are very likely to appear close together.
- Example: "New Hampshire" is one word not "New" and "Hampshire"
- Gensim approaches bigrams by simply combining the two high probability tokens with an underscore. The tokens new and york will now become new_york instead. Similar to the TF- IDF model, bigrams can be created using another Gensim model - Phrases.

In [41]:
import gensim
bigram = gensim.models.Phrases(texts)
texts = [bigram[line] for line in texts]
texts

[['study',
  'electromagnetic',
  'scattering',
  'multilayere',
  'biperiodic',
  'aggregate',
  'dielectric',
  'layer',
  'grating',
  'conduct',
  'plate',
  'characteristic',
  'length',
  'structure',
  'provide',
  'good',
  'control',
  'absorption',
  'band',
  'influence',
  'physical',
  'parameter',
  'problem',
  'size',
  'impedance',
  'discuss'],
 ['theoretically',
  'study',
  'reflection',
  'light',
  'phase',
  'conjugate',
  'mirror',
  'precede',
  'partially',
  'reflect',
  'normal',
  'mirror',
  'presence',
  'suitably',
  'choose',
  'normal',
  'mirror',
  'phase',
  'conjugator',
  'find',
  'greatly',
  'enhance',
  'total',
  'phase',
  'conjugate',
  'reflected',
  'power',
  'order',
  'magnitude',
  'require',
  'condition',
  'phase',
  'conjugate',
  'mirror',
  'amplifie',
  'reflection',
  'constructive',
  'interference',
  'light',
  'region',
  'mirror',
  'take',
  'place',
  'phase',
  'conjugate',
  'reflect',
  'power',
  'exhibit',
  'maxim

$\textbf{NOTE}:$ Since by creating new phrases we add words to our dictionary, this step must be done before we create our dictionary. We would have to run this:

In [42]:
dictionary = corpora.Dictionary(texts)
corpus = [dictionary.doc2bow(text) for text in texts]

# print key-value pairs of the corpus dictionary that has gone through bi-gramming
print(dictionary.token2id)

{'absorption': 0, 'aggregate': 1, 'band': 2, 'biperiodic': 3, 'characteristic': 4, 'conduct': 5, 'control': 6, 'dielectric': 7, 'discuss': 8, 'electromagnetic': 9, 'good': 10, 'grating': 11, 'impedance': 12, 'influence': 13, 'layer': 14, 'length': 15, 'multilayere': 16, 'parameter': 17, 'physical': 18, 'plate': 19, 'problem': 20, 'provide': 21, 'scattering': 22, 'size': 23, 'structure': 24, 'study': 25, 'amplifie': 26, 'choose': 27, 'condition': 28, 'conjugate': 29, 'conjugator': 30, 'constructive': 31, 'enhance': 32, 'exhibit': 33, 'find': 34, 'function': 35, 'greatly': 36, 'interference': 37, 'light': 38, 'magnitude': 39, 'maximum': 40, 'mirror': 41, 'normal': 42, 'order': 43, 'partially': 44, 'phase': 45, 'place': 46, 'power': 47, 'precede': 48, 'presence': 49, 'reflect': 50, 'reflected': 51, 'reflection': 52, 'region': 53, 'require': 54, 'suitably': 55, 'take': 56, 'theoretically': 57, 'total': 58, 'transmittance': 59, 'Gaussian': 60, 'Kerr': 61, 'anomalous': 62, 'chirp': 63, 'comb

After we are done creating our bi-grams, we can create tri-grams, and other n-grams by simply running the phrases model multiple times on our corpus. Bi-grams still remains the most used n-gram model, though it is worth one's time to glance over the other uses and kinds of n-gram implementations

In [43]:
# Removing both high frequency and low-frequency words.
# Example: get rid of words that occur in less than 20 documents, or in more than 50% of the documents, 
dictionary.filter_extremes(no_below=len(documents), no_above=0.5)


print(dictionary.token2id)

{}
