# Natural Language Processing

Natural Language Processing is a subfield of aritifical intelligence that helps computers to understand human language and extract insights from unestructured data.

It integrates statistics, machine learning models and deep learning models to achieve its goals.

## Applications 

- Sentiment Analysis: determine the underlying subjective tone of a piece of writing
- Name Entity Recognition (NER): locating and classifying named entities mentioned in unstructured text into pre-defined categories. Named entities are real world objects such as persons or locations. 
- ChatBots: like ChatGPT

# Regular Expressions 

In [None]:
import numpy as np 
import matplotlib.pyplot as plt

import nltk

nltk.download('stopwords')
nltk.download('punkt_tab')
nltk.download('wordnet')

In [None]:
import re

re.match('abc', 'abcdef')

In [None]:
word_regex = r'\w+'
re.match(word_regex, 'hi there!')

| **Pattern**  | **Matches**                               | **Example**         |
|--------------|-------------------------------------------|---------------------|
| `.`          | Any character except newline              | `a.b` matches `acb` |
| `\d`         | Any digit (0-9)                           | `\d\d\d` matches `123` |
| `\w`         | Any word character (alphanumeric + _)     | `\w+` matches `hello123` |
| `\s`         | Any whitespace (space, tab, newline)      | `\s+` matches space(s) |
| `^`          | Start of a string                         | `^abc` matches `abc` only at start |
| `$`          | End of a string                           | `xyz$` matches `xyz` only at end |
| `*`          | Zero or more of the preceding character   | `a*` matches `aa`, `aaa`, or empty |
| `+`          | One or more of the preceding character    | `a+` matches `a`, `aa` |
| `?`          | Zero or one of the preceding character    | `a?` matches `a` or empty |
| `[abc]`      | Any one character from the set            | `[abc]` matches `a`, `b`, or `c` |
| `[^abc]`     | Any character not in the set              | `[^abc]` matches `d`, `e`, etc. |
| `(a\|b)`      | Either `a` or `b`                         | `(cat\|dog)` matches `cat` or `dog` |
| `\b`         | Word boundary                             | `\bword\b` matches `word` exactly |

In [None]:
my_string = "Let's write RegEx!  Won't that be fun?  I sure think so.  Can you find 4 sentences?  Or perhaps, all 19 words?" 

# Write a pattern to match sentence endings: sentence_endings
sentence_endings = r"[.?!]"

# Split my_string on sentence endings and print the result
print(re.split(sentence_endings, my_string))

In [None]:
capitalized_words = r"[A-Z]\w+"
print(re.findall(capitalized_words, my_string))

In [None]:
spaces = r"\s+"
print(re.split(spaces, my_string))

In [None]:
# Find all digits in my_string and print the result
digits = r"\d+"
print(re.findall(digits, my_string))

# Tokenization

**Tokenization** is the process of turning a string or document into **tokens** (smaller chunks)

Its often used when setting up a text for NLP

Tokenization can be done directly with regular expressions but libraries like *nltk* have implementations already available.

In [None]:
from nltk.tokenize import word_tokenize

word_tokenize(my_string)

Other nltk tokenizers are: 
- **sent_tokenize**: a document into sentences
- **regexp_tokenize**: tokenize a string or document based on a regular expression pattern
- **TweetTokenizer**: special class for tweet tokenization, allowing to separate hashtags, mentions and lots of exclamation points

## Search VS Match 

Search will scan the whole string, while match will start matching since the beginning

In [None]:
re.match('abc', 'abcde')

In [None]:
re.search('abc', 'abcde')

In [None]:
re.match('bc', 'abcde')

In [None]:
re.search('bc', 'abcde')

In [None]:
with open("../data/grail.txt", "r") as file:
    content=file.read()

# Import necessary modules
from nltk.tokenize import word_tokenize
from nltk.tokenize import sent_tokenize

sentences = sent_tokenize(content)

tokenized_sent = word_tokenize(sentences[3])

unique_tokens = set(word_tokenize(content))

print(unique_tokens)


In [None]:
german_text = 'Wann gehen wir Pizza essen? 🍕 Und fährst du mit Über? 🚕'

from nltk.tokenize import regexp_tokenize

# Tokenize and print all words in german_text
all_words = word_tokenize(german_text)
print(all_words)

# Tokenize and print only capital words
capital_words = r"[A-ZÜ]\w+"
print(regexp_tokenize(german_text, capital_words))

# Tokenize and print only emoji
emoji = "['\U0001F300-\U0001F5FF'|'\U0001F600-\U0001F64F'|'\U0001F680-\U0001F6FF'|'\u2600-\u26FF\u2700-\u27BF']"
print(regexp_tokenize(german_text, emoji))

# Bag of Words 

**Bag of Words** is a basic method for finding topics in a text 

It requires the text to be already tokenized and all its tokens have to be counted

The more frequent a word is, the more important it might be

Can be a great way to determine the significant words in a text




In [None]:
from nltk.tokenize import word_tokenize
from collections import Counter

counter = Counter(word_tokenize("Hace tanto que sueño su boca que la vida se me ha vuelto loca. Nana nino nanaino nanaino niro na niro na niro nonaaaa"))
print(counter)

In [None]:
counter.most_common(2)

# Text Preprocessing 

Text preprocessing helps make better input data, for instance, when feeding machine learning or other statistical methods.

Text preprocessing techniques could be tokenization or lowercasing words, lematization/stemming, removing stop words, punctuation...

Its good to experiment with several approachs, since results may differ.



In [None]:
from nltk.corpus import stopwords 

text='the cat is in the box. The cat likes the box. The box is over the cat' 

tokens = [w for w in word_tokenize(text.lower()) if w.isalpha()]

no_stops = [t for t in tokens if t not in stopwords.words('english')]

Counter(no_stops).most_common(2)

In [None]:
# Import WordNetLemmatizer
from nltk.stem import WordNetLemmatizer

# Retain alphabetic words: alpha_only
alpha_only = [t for t in no_stops if t.isalpha()]

# Instantiate the WordNetLemmatizer
wordnet_lemmatizer = WordNetLemmatizer()

# Lemmatize all tokens into a new list: lemmatized
lemmatized = [wordnet_lemmatizer.lemmatize(t) for t in alpha_only]

# Create the bag-of-words: bow
bow = Counter(lemmatized)

# Print the 10 most common tokens
print(bow.most_common(10))

# Gensim

[Gensim](https://radimrehurek.com/gensim/) is a popular open source NLP library

Uses top academic models to perform complex tasks: 
- Building document or word vectors
- Performing topic identification and document comparison

## Word Vectors 

Based on how words are used in the text, we can derive word vectors that would allow us to find relationships between terms.

## Corpus and Corpora 

A corpus (or plural corpora) is a set of texts used to help perform NLP tasks.

A gensim corpus is a list of lists (each is a document) of tuple the token dictionary id and token frequency within the document.

## Dictionary

Is a mapping word to id.

In [None]:
articles=[['uses',
'file',
'operating',
'system',
'placement',
'software',
'.svg|thumb|upright|a',
'diagram',
'showing',
'user',
'computing',
'|user',
'interacts',
'application',
'software',
'typical',
'desktop',
'computer.the',
'application',
'software',
'layer',
'interfaces',
'operating',
'system',
'turn',
'communicates',
'personal',
'computer',
'hardware|hardware',
'arrows',
'indicate',
'information',
'flow',
"''",
'computer',
'software',
"''",
'simply',
'``',
"'software",
"''",
'part',
'computer',
'system',
'consists',
'data',
'computing',
'|data',
'computer',
'instructions',
'contrast',
'computer',
'hardware|physical',
'hardware',
'system',
'built',
'computer',
'science',
'software',
'engineering',
'computer',
'software',
'information',
'processed',
'computer',
'systems',
'computer',
'program|programs',
'data',
'computer',
'software',
'includes',
'computer',
'programs',
'library',
'computing',
'|libraries',
'related',
'non-executable',
'data',
'computing',
'|data',
'software',
'documentation|online',
'documentation',
'digital',
'media',
'computer',
'hardware',
'software',
'require',
'neither',
  'realistically', 'used', 'lowest', 'level', 'executable', 'code', 'consists', 'machine', 'code|machine', 'language', 'instructions', 'specific', 'individual', 'microprocessor|processor—typically', 'central', 'processing', 'unit', 'cpu', 'machine', 'language', 'consists', 'groups', 'binary', 'numbers|binary', 'values', 'signifying', 'processor', 'instructions', 'change', 'state', 'computer', 'preceding', 'state', 'example', 'instruction', 'may', 'change', 'value',
  'stored',
  'particular',
  'storage',
  'location',
  'computer—an',
  'effect',
  'directly',
  'observable',
  'user',
  'instruction',
  'may',
  'also',
  'indirectly',
  'cause',
  'something',
  'appear',
  'display',
  'computer',
  'system—a',
  'state',
  'change',
  'visible',
  'user',
  'processor',
  'carries',
  'instructions',
  'order',
  'provided',
  'unless',
  'instructed',
  'branch',
  'instruction|',
  "''",
  'jump',
  "''",
  'different',
  'instruction',
  'interrupted',
  'multi-core',
  'processors',
  'dominant',
  'core',
  'run',
  'instructions',
  'order',
  'however',
  'application',
  'software',
  'runs',
  'one',
  'core',
  'default',
  'software',
  'made',
  'run',
  'many',
  'majority',
  'software',
  'written',
  'high-level',
  'programming',
  'languages',
  'easier',
  'efficient',
  'programmers',
  'meaning',
  'closer',
  'natural',
  'language',
  'cite',
  'web|title=compiler',
  'construction|url=http',
  '//www.cs.uu.nl/education/vak.php',
  'vak=infomcco',
  'high-level',
  'languages',
  'translated',
  'machine',
  'language',
  'using',
  'compiler',
  'interpreter',
  'computing',
  '|interpreter',
  'combination',
  'two',
  'software',
  'may',
  'also',
  'written',
  'low-level',
  'assembly',
  'language',
  'essentially',
  'vaguely',
  'mnemonic',
  'representation',
  'machine',
  'language',
  'using',
  'natural',
  'language',
  'alphabet',
  'translated',
  'machine',
  'language',
  'using',
  'assembly',
  'language|assembler',
  'history',
  'main',
  'article|history',
  'software',
  'outline',
  'algorithm',
  'would',
  'first',
  'piece',
  'software',
  'written',
  'ada',
  'lovelace',
  '19th',
  'century',
  'planned',
  'analytical',
  'engine',
  'however',
  'neither',
  'analytical',
  'engine',
  'software',
  'ever',
  'created',
  'first',
  'theory',
  'software—prior',
  'creation',
  'computers',
  'know',
  'today—was',
  'proposed',
  'alan',
  'turing',
  '1935',
  'essay',
  '``',
  'computable',
  'numbers',
  'application',
  'entscheidungsproblem',
  "''",
  'decision',
  'problem',
  'eventually',
  'led',
  'creation',
  'twin',
  'academic',
  'fields',
  'computer',
  'science',
  'software',
  'engineering',
  'study',
  'software',
  'creation',
  'computer',
  'science',
  'theoretical',
  'turing',
  "'s",
  'essay',
  'example',
  'computer',
  'science',
  'software',
  'engineering',
  'focuses',
  'practical',
  'concerns',
  'however',
  'prior',
  '1946',
  'software',
  'understand',
  'it—programs',
  'stored',
  'memory',
  'stored-program',
  'digital',
  'computers—did',
  'yet',
  'exist',
  'first',
  'electronic',
  'computing',
  'devices',
  'instead',
  'rewired',
  'order',
  '``',
  'reprogram',
  "''",
  'types',
  'software',
  'see',
  'also|list',
  'software',
  'categories',
  'virtually',
  'computer',
  'platforms',
  'software',
  'grouped',
  'broad',
  'categories',
  '=purpose',
  'domain',
  'use=',
  'based',
  'goal',
  'computer',
  'software',
  'divided',
  '``',
  'application',
  'software',
  "''",
  'software',
  'uses',
  'computer',
  'system',
  'perform',
  'special',
  'functions',
  'provide',
  'video',
  'game|entertainment',
  'functions',
  'beyond',
  'basic',
  'operation',
  'computer',
  'many',
  'different',
  'types',
  'application',
  'software',
  'range',
  'tasks',
  'performed',
  'modern',
  'computer',
  'large—see',
  'list',
  'software',
  "''system",
  'software',
  "''",
  'software',
  'directly',
  'operates',
  'computer',
  'hardware',
  'provide',
  'basic',
  'functionality',
  'needed',
  'users',
  'software',
  'provide',
  'platform',
  'running',
  'application',
  'software',
  'cite',
  'web|title=system',
  'software|url=http',
  '//home.olemiss.edu/~misbook/sfsysfm.htm|archive-url=https',
  '//web.archive.org/web/20010530092843/http',
  '//home.olemiss.edu:80/~misbook/sfsysfm.htm|dead-url=yes|archive-date=2001-05-30|publisher=the',
  'university',
  'mississippi',
  'system',
  'software',
  'includes',
  '``',
  'operating',
  'systems',
  "''",
  'essential',
  'collections',
  'software',
  'manage',
  'resources',
  'provides',
  'common',
  'services',
  'software',
  'runs',
  '``',
  'top',
  "''",
  'supervisory',
  'programs',
  'boot',
  'loaders',
  'shell',
  'computing',
  '|shells',
  'window',
  'systems',
  'core',
  'parts',
  'operating',
  'systems',
  'practice',
  'operating',
  'system',
  'comes',
  'bundled',
  'additional',
  'software',
  'including',
  'application',
  'software',
  'user',
  'potentially',
  'work',
  'computer',
  'operating',
  'system',
  "''device",
  'drivers',
  "''",
  'operate',
  'control',
  'particular',
  'type',
  'device',
  'attached',
  'computer',
  'device',
  'needs',
  'least',
  'one',
  'corresponding',
  'device',
  'driver',
  'computer',
  'typically',
  'minimum',
  'least',
  'one',
  'input',
  'device',
  'least',
  'one',
  'output',
  'device',
  'computer',
  'typically',
  'needs',
  'one',
  'device',
  'driver',
  "''software",
  'utility|utilities',
  "''",
  'computer',
  'programs',
  'designed',
  'assist',
  'users',
  'maintenance',
  'care',
  'computers',
  "''malicious",
  'software',
  "''",
  '``',
  'malware',
  "''",
  'software',
  'developed',
  'harm',
  'disrupt',
  'computers',
  'malware',
  'undesirable',
  'malware',
  'closely',
  'associated',
  'computer-related',
  'crimes',
  'though',
  'malicious',
  'programs',
  'may',
  'designed',
  'practical',
  'jokes',
  '=nature',
  'domain',
  'execution=',
  'desktop',
  'applications',
  'web',
  'browsers',
  'microsoft',
  'office',
  'well',
  'smartphone',
  'tablet',
  'computer|tablet',
  'applications',
  'called',
  '``',
  'mobile',
  'app|apps',
  "''",
  'push',
  'parts',
  'software',
  'industry',
  'merge',
  'desktop',
  'applications',
  'mobile',
  'apps',
  'extent',
  'windows',
  '8',
  'later',
  'ubuntu',
  'touch',
  'tried',
  'allow',
  'style',
  'application',
  'user',
  'interface',
  'used',
  'desktops',
  'laptops',
  'mobiles',
  'javascript',
  'scripts',
  'pieces',
  'software',
  'traditionally',
  'embedded',
  'web',
  'pages',
  'run',
  'directly',
  'inside',
  'web',
  'browser',
  'web',
  'page',
  'loaded',
  'without',
  'need',
  'web',
  'browser',
  'plugin',
  'software',
  'written',
  'programming',
  'languages',
  'also',
  'run',
  'within',
  'web',
  'browser',
  'software',
  'either',
  'translated',
  'javascript',
  'web',
  'browser',
  'plugin',
  'supports',
  'language',
  'installed',
  'common',
  'example',
  'latter',
  'actionscript',
  'scripts',
  'supported',
  'adobe',
  'flash',
  'plugin',
  'server',
  'software',
  'including',
  'web',
  'applications',
  'usually',
  'run',
  'web',
  'server',
  'output',
  'dynamically',
  'generated',
  'web',
  'pages',
  'web',
  'browsers',
  'using',
  'e.g',
  'php',
  'java',
  'programming',
  'language',
  '|java',
  'asp.net',
  'even',
  'node.js|javascript',
  'runs',
  'server',
  'modern',
  'times',
  'commonly',
  'include',
  'javascript',
  'run',
  'web',
  'browser',
  'well',
  'case',
  'typically',
  'run',
  'partly',
  'server',
  'partly',
  'web',
  'browser',
  'plug-in',
  'computing',
  '|plugins',
  'extensions',
  'software',
  'extends',
  'modifies',
  'functionality',
  'another',
  'piece',
  'software',
  'require',
  'software',
  'used',
  'order',
  'function',
  'embedded',
  'software',
  'resides',
  'firmware',
  'within',
  'embedded',
  'systems',
  'devices',
  'dedicated',
  'single',
  'use',
  'uses',
  'cars',
  'televisions',
  'although',
  'embedded',
  'devices',
  'wireless',
  'chipsets',
  '``',
  "''",
  'part',
  'ordinary',
  'non-embedded',
  'computer',
  'system',
  'pc',
  'smartphone',
  'cite',
  'web|title=embedded',
  'software—technologies',
  'trends|url=http',
  '//www.computer.org/csdl/mags/so/2009/03/mso2009030014.html|publisher=ieee',
  'computer',
  'society|date=may–june',
  '2009|accessdate=6',
  'november',
  '2013',
  'embedded',
  'system',
  'context',
  'sometimes',
  'clear',
  'distinction',
  'system',
  'software',
  'application',
  'software',
  'however',
  'embedded',
  'systems',
  'run',
  'embedded',
  'operating',
  'systems',
  'systems',
  'retain',
  'distinction',
  'system',
  'software',
  'application',
  'software',
  'although',
  'typically',
  'one',
  'fixed',
  'application',
  'always',
  'run',
  'microcode',
  'special',
  'relatively',
  'obscure',
  'type',
  'embedded',
  'software',
  'tells',
  'processor',
  '``',
  "''",
  'execute',
  'machine',
  'code',
  'actually',
  'lower',
  'level',
  'machine',
  'code',
  'typically',
  'proprietary',
  'processor',
  'manufacturer',
  'necessary',
  'correctional',
  'microcode',
  'software',
  'updates',
  'supplied',
  'users',
  'much',
  'cheaper',
  'shipping',
  'replacement',
  'processor',
  'hardware',
  'thus',
  'ordinary',
  'programmer',
  'would',
  'expect',
  'ever',
  'deal',
  '=programming',
  'tools=',
  'main',
  'article|programming',
  'tool',
  'programming',
  'tools',
  'also',
  'software',
  'form',
  'programs',
  'applications',
  'software',
  'developers',
  'also',
  'known',
  '``',
  'programmers',
  'coders',
  'hackers',
  "''",
  '``',
  'software',
  'engineers',
  "''",
  'use',
  'create',
  'debugging|debug',
  'software',
  'maintenance|maintain',
  'i.e',
  'improve',
  'fix',
  'otherwise',
  'technical',
  'support|support',
  'software',
  'software',
  'written',
  'one',
  'programming',
  'languages',
  'many',
  'programming',
  'languages',
  'existence',
  'least',
  'one',
  'implementation',
  'consists',
  'set',
  'programming',
  'tools',
  'tools',
  'may',
  'relatively',
  'self-contained',
  'programs',
  'compilers',
  'debuggers',
  'interpreter',
  'computing',
  '|interpreters',
  'linker',
  'computing',
  '|linkers',
  'text',
  'editors',
  'combined',
  'together',
  'accomplish',
  'task',
  'may',
  'form',
  'integrated',
  'development',
  'environment',
  'ide',
  'combines',
  'much',
  'functionality',
  'self-contained',
  'tools',
  'ides',
  'may',
  'either',
  'invoking',
  'relevant',
  'individual',
  'tools',
  're-implementing',
  'functionality',
  'new',
  'way',
  'ide',
  'make',
  'easier',
  'specific',
  'tasks',
  'searching',
  'files',
  'particular',
  'project',
  'many',
  'programming',
  'language',
  'implementations',
  'provide',
  'option',
  'using',
  'individual',
  'tools',
  'ide',
  'software',
  'topics',
  '=architecture=',
  'see',
  'also|software',
  'architecture',
  'users',
  'often',
  'see',
  'things',
  'differently',
  'programmers',
  'people',
  'use',
  'modern',
  'general',
  'purpose',
  'computers',
  'opposed',
  'embedded',
  'systems',
  'analog',
  'computers',
  'supercomputers',
  'usually',
  'see',
  'three',
  'layers',
  'software',
  'performing',
  'variety',
  'tasks',
  'platform',
  'application',
  'user',
  'software',
  'platform',
  'software',
  'platform',
  'computing',
  '|platform',
  'includes',
  'firmware',
  'device',
  'drivers',
  'operating',
  'system',
  'typically',
  'graphical',
  'user',
  'interface',
  'total',
  'allow',
  'user',
  'interact',
  'computer',
  'peripherals',
  'associated',
  'equipment',
  'platform',
  'software',
  'often',
  'comes',
  'bundled',
  'computer',
  'personal',
  'computer|pc',
  'one',
  'usually',
  'ability',
  'change',
  'platform',
  'software',
  'application',
  'software',
  'application',
  'software',
  'applications',
  'people',
  'think',
  'think',
  'software',
  'typical',
  'examples',
  'include',
  'office',
  'suites',
  'video'],
 ["''",
  'debugging',
  "''",
  'process',
  'finding',
  'resolving',
  'defects',
  'prevent',
  'correct',
  'operation',
  'computer',
  'software',
  'system',
  'numerous',
  'books',
  'written',
  'debugging',
  'see',
  'reading|further',
  'reading',
  'involves',
  'numerous',
  'aspects',
  'including',
  'interactive',
  'debugging',
  'control',
  'flow',
  'integration',
  'testing',
  'logfile|log',
  'files',
  'monitoring',
  'application',
  'monitoring|application',
  'system',
  'monitoring|system',
  'memory',
  'dumps',
  'profiling',
  'computer',
  'programming',
  '|profiling',
  'statistical',
  'process',
  'control',
  'special',
  'design',
  'tactics',
  'improve',
  'detection',
  'simplifying',
  'changes',
  'origin',
  'computer',
  'log',
  'entry',
  'mark',
  'nbsp',
  'ii',
  'moth',
  'taped',
  'page',
  'terms',
  '``',
  'bug',
  "''",
  '``',
  'debugging',
  "''",
  'popularly',
  'attributed',
  'admiral',
  'grace',
  'hopper',
  '1940s',
  'http',
  '//foldoc.org/grace+hopper',
  'grace',
  'hopper',
  'foldoc',
  'working',
  'harvard',
  'mark',
  'ii|mark',
  'ii',
  'computer',
  'harvard',
  'university',
  'associates',
  'discovered',
  'moth',
  'stuck',
  'relay',
  'thereby',
  'impeding',
  'operation',
  'whereupon',
  'remarked',
  '``',
  'debugging',
  "''",
  'system',
  'however',
  'term',
  '``',
  'bug',
  "''",
  'meaning',
  'technical',
  'error',
  'dates',
  'back',
  'least',
  '1878',
  'thomas',
  'edison',
  'see',
  'software',
  'bug',
  'full',
  'discussion',
  '``',
  'debugging',
  "''",
  'seems',
  'used',
  'term',
  'aeronautics',
  'entering',
  'world',
  'computers',
  'indeed',
  'interview',
  'grace',
  'hopper',
  'remarked',
  'coining',
  'term',
  'citation',
  'needed|date=july',
  '2015',
  'moth',
  'fit',
  'already',
  'existing',
  'terminology',
  'saved',
  'letter',
  'j.',
  'robert',
  'oppenheimer',
  'director',
  'wwii',
  'atomic',
  'bomb',
  '``',
  'manhattan',
  "''",
  'project',
  'los',
  'alamos',
  'nm',
  'used',
  'term',
  'letter',
  'dr.',
  'ernest',
  'lawrence',
  'uc',
  'berkeley',
  'dated',
  'october',
  '27',
  '1944',
  'http',
  '//bancroft.berkeley.edu/exhibits/physics/images/bigscience25.jpg',
  'regarding',
  'recruitment',
  'additional',
  'technical',
  'staff',
  'oxford',
  'english',
  'dictionary',
  'entry',
  '``',
  'debug',
  "''",
  'quotes',
  'term',
  '``',
  'debugging',
  "''",
  'used',
  'reference',
  'airplane',
  'engine',
  'testing',
  '1945',
  'article',
  'journal',
  'royal',
  'aeronautical',
  'society',
  'article',
  '``',
  'airforce',
  "''",
  'june',
  '1945',
  'p.',
  'nbsp',
  '50',
  'also',
  'refers',
  'debugging',
  'time',
  'aircraft',
  'cameras',
  'hopper',
  "'s",
  'computer',
  'bug|bug',
  'found',
  'september',
  '9',
  '1947.',
  'term',
  'adopted',
  'computer',
  'programmers',
  'early',
  '1950s',
  'seminal',
  'article',
  'gills',
  'gill',
  'http',
  '//www.jstor.org/stable/98663',
  'diagnosis',
  'mistakes',
  'programmes',
  'edsac',
  'proceedings',
  'royal',
  'society',
  'london',
  'series',
  'mathematical',
  'physical',
  'sciences',
  'vol',
  '206',
  '1087',
  'may',
  '22',
  '1951',
  'pp',
  '538-554',
  '1951',
  'earliest',
  'in-depth',
  'discussion',
  'programming',
  'errors',
  'use',
  'term',
  '``',
  'bug',
  "''",
  '``',
  'debugging',
  "''",
  'association',
  'computing',
  'machinery|acm',
  "'s",
  'digital',
  'library',
  'term',
  '``',
  'debugging',
  "''",
  'first',
  'used',
  'three',
  'papers',
  '1952',
  'acm',
  'national',
  'meetings.robert',
  'v.',
  'd.',
  'campbell',
  'http',
  '//portal.acm.org/citation.cfm',
  'id=609784.609786',
  'evolution',
  'automatic',
  'computation',
  'proceedings',
  '1952',
  'acm',
  'national',
  'meeting',
  'pittsburgh',
  'p',
  '29-32',
  '1952.alex',
  'orden',
  'http',
  '//portal.acm.org/citation.cfm',
  'id=609784.609793',
  'solution',
  'systems',
  'linear',
  'inequalities',
  'digital',
  'computer',
  'proceedings',
  '1952',
  'acm',
  'national',
  'meeting',
  'pittsburgh',
  'p.',
  '91-95',
  '1952.howard',
  'b.',
  'demuth',
  'john',
  'b.',
  'jackson',
  'edmund',
  'klein',
  'n.',
  'metropolis',
  'walter',
  'orvedahl',
  'james',
  'h.',
  'richardson',
  'http',
  '//portal.acm.org/citation.cfm',
  'id=800259.808982',
  'maniac',
  'proceedings',
  '1952',
  'acm',
  'national',
  'meeting',
  'toronto',
  'p.',
  '13-16',
  'two',
  'three',
  'use',
  'term',
  'quotation',
  'marks',
  '1963',
  '``',
  'debugging',
  "''",
  'common',
  'enough',
  'term',
  'mentioned',
  'passing',
  'without',
  'explanation',
  'page',
  '1',
  'compatible',
  'time-sharing',
  'system|ctss',
  'manual',
  'http',
  '//www.bitsavers.org/pdf/mit/ctss/ctss_programmersguide.pdf',
  'compatible',
  'time-sharing',
  'system',
  'm.i.t',
  'press',
  '1963',
  'kidwell',
  "'s",
  'article',
  '``',
  'stalking',
  'elusive',
  'computer',
  'bug',
  "''",
  'peggy',
  'aldrich',
  'kidwell',
  'http',
  '//ieeexplore.ieee.org/xpl/freeabs_all.jsp',
  'tp=',
  'arnumber=728224',
  'isnumber=15706',
  'stalking',
  'elusive',
  'computer',
  'bug',
  'ieee',
  'annals',
  'history',
  'computing',
  '1998.',
  'discusses',
  'etymology',
  '``',
  'bug',
  "''",
  '``',
  'debug',
  "''",
  'greater',
  'detail',
  'scope',
  'software',
  'electronic',
  'systems',
  'become',
  'generally',
  'complex',
  'various',
  'common',
  'debugging',
  'techniques',
  'expanded',
  'methods',
  'detect',
  'anomalies',
  'assess',
  'impact',
  'schedule',
  'software',
  'patches',
  'full',
  'updates',
  'system',
  'words',
  '``',
  'anomaly',
  "''",
  '``',
  'discrepancy',
  "''",
  'used',
  'neutral',
  'terms',
  'avoid',
  'words',
  '``',
  'error',
  "''",
  '``',
  'defect',
  "''",
  '``',
  'bug',
  "''",
  'might',
  'implication',
  'so-called',
  '``',
  'errors',
  "''",
  '``',
  'defects',
  "''",
  '``',
  'bugs',
  "''",
  'must',
  'fixed',
  'costs',
  'instead',
  'impact',
  'assessment',
  'made',
  'determine',
  'changes',
  'remove',
  '``',
  'anomaly',
  "''",
  '``',
  'discrepancy',
  "''",
  'would',
  'cost-effective',
  'system',
  'perhaps',
  'scheduled',
  'new',
  'release',
  'might',
  'render',
  'change',
  'unnecessary',
  'issues',
  'life-critical',
  'mission-critical',
  'system',
  'also',
  'important',
  'avoid',
  'situation',
  'change',
  'might',
  'upsetting',
  'users',
  'long-term',
  'living',
  'known',
  'problem',
  '``',
  'cure',
  'would',
  'worse',
  'disease',
  "''",
  'basing',
  'decisions',
  'acceptability',
  'anomalies',
  'avoid',
  'culture',
  '``',
  'zero-defects',
  "''",
  'mandate',
  'people',
  'might',
  'tempted',
  'deny',
  'existence',
  'problems',
  'result',
  'would',
  'appear',
  'zero',
  '``',
  'defects',
  "''",
  'considering',
  'collateral',
  'issues',
  'cost-versus-benefit',
  'impact',
  'assessment',
  'broader',
  'debugging',
  'techniques',
  'expand',
  'determine',
  'frequency',
  'anomalies',
  'often',
  '``',
  'bugs',
  "''",
  'occur',
  'help',
  'assess',
  'impact',
  'overall',
  'system',
  'tools',
  'debugging',
  'video',
  'game',
  'consoles',
  'usually',
  'done',
  'special',
  'hardware',
  'xbox',
  'console',
  '|xbox',
  'debug',
  'unit',
  'intended',
  'developers',
  'debugging',
  'ranges',
  'complexity',
  'fixing',
  'simple',
  'errors',
  'performing',
  'lengthy',
  'tiresome',
  'tasks',
  'data',
  'collection',
  'analysis',
  'scheduling',
  'updates',
  'debugging',
  'skill',
  'programmer',
  'major',
  'factor',
  'ability',
  'debug',
  'problem',
  'difficulty',
  'software',
  'debugging',
  'varies',
  'greatly',
  'complexity',
  'system',
  'also',
  'depends',
  'extent',
  'programming',
  'language',
  'used',
  'available',
  'tools',
  '``',
  'debuggers',
  "''",
  'debuggers',
  'software',
  'tools',
  'enable',
  'programmer',
  'monitor',
  'execution',
  'computers',
  '|execution',
  'program',
  'stop',
  'restart',
  'set',
  'breakpoints',
  'change',
  'values',
  'memory',
  'term',
  '``',
  'debugger',
  "''",
  'also',
  'refer',
  'person',
  'debugging',
  'generally',
  'high-level',
  'programming',
  'languages',
  'java',
  'programming',
  'language',
  '|java',
  'make',
  'debugging',
  'easier',
  'features',
  'exception',
  'handling',
  'make',
  'real',
  'sources',
  'erratic',
  'behaviour',
  'easier',
  'spot',
  'programming',
  'languages',
  'c',
  'programming',
  'language',
  '|c',
  'assembly',
  'language|assembly',
  'bugs',
  'may',
  'cause',
  'silent',
  'problems',
  'memory',
  'corruption',
  'often',
  'difficult',
  'see',
  'initial',
  'problem',
  'happened',
  'cases',
  'memory',
  'debugging|memory',
  'debugger',
  'tools',
  'may',
  'needed',
  'certain',
  'situations',
  'general',
  'purpose',
  'software',
  'tools',
  'language',
  'specific',
  'nature',
  'useful',
  'take',
  'form',
  '``',
  'list',
  'tools',
  'static',
  'code',
  'analysis|static',
  'code',
  'analysis',
  'tools',
  "''",
  'tools',
  'look',
  'specific',
  'set',
  'known',
  'problems',
  'common',
  'rare',
  'within',
  'source',
  'code',
  'issues',
  'detected',
  'tools',
  'would',
  'rarely',
  'picked',
  'compiler',
  'interpreter',
  'thus',
  'syntax',
  'checkers',
  'semantic',
  'checkers',
  'tools',
  'claim',
  'able',
  'detect',
  '300+',
  'unique',
  'problems',
  'commercial',
  'free',
  'tools',
  'exist',
  'various',
  'languages',
  'tools',
  'extremely',
  'useful',
  'checking',
  'large',
  'source',
  'trees',
  'impractical',
  'code',
  'walkthroughs',
  'typical',
  'example',
  'problem',
  'detected',
  'would',
  'variable',
  'dereference',
  'occurs',
  '``',
  "''",
  'variable',
  'assigned',
  'value',
  'another',
  'example',
  'would',
  'perform',
  'strong',
  'type',
  'checking',
  'language',
  'require',
  'thus',
  'better',
  'locating',
  'likely',
  'errors',
  'versus',
  'actual',
  'errors',
  'result',
  'tools',
  'reputation',
  'false',
  'positives',
  'old',
  'unix',
  '``',
  'lint',
  'programming',
  'tool|lint',
  "''",
  'program',
  'early',
  'example',
  'debugging',
  'electronic',
  'hardware',
  'e.g.',
  'computer',
  'hardware',
  'well',
  'low-level',
  'software',
  'e.g.',
  'bioses',
  'device',
  'drivers',
  'firmware',
  'instruments',
  'oscilloscopes',
  'logic',
  'analyzers',
  'in-circuit',
  'emulator|in-circuit',
  'emulators',
  'ices',
  'often',
  'used',
  'alone',
  'combination',
  'ice',
  'may',
  'perform',
  'many',
  'typical',
  'software',
  'debugger',
  "'s",
  'tasks',
  'low-level',
  'software',
  'firmware',
  'debugging',
  'process',
  'normally',
  'first',
  'step',
  'debugging',
  'attempt',
  'reproduce',
  'problem',
  'non-trivial',
  'task',
  'example',
  'parallel',
  'computing|parallel',
  'processes',
  'unusual',
  'software',
  'bugs',
  'also',
  'specific',
  'user',
  'environment',
  'usage',
  'history',
  'make',
  'difficult',
  'reproduce',
  'problem',
  'bug',
  'reproduced',
  'input',
  'program',
  'may',
  'need',
  'simplified',
  'make',
  'easier',
  'debug',
  'example',
  'bug',
  'compiler',
  'make',
  'crash',
  'computing',
  '|crash',
  'parsing',
  'large',
  'source',
  'file',
  'however',
  'simplification',
  'test',
  'case',
  'lines',
  'original',
  'source',
  'file',
  'sufficient',
  'reproduce',
  'crash',
  'simplification',
  'made',
  'manually',
  'using',
  'divide',
  'conquer',
  'algorithm|divide-and-conquer',
  'approach',
  'programmer',
  'try',
  'remove',
  'parts',
  'original',
  'test',
  'case',
  'check',
  'problem',
  'still',
  'exists',
  'debugging',
  'problem',
  'graphical',
  'user',
  'interface|gui',
  'programmer',
  'try',
  'skip',
  'user',
  'interaction',
  'original',
  'problem',
  'description',
  'check',
  'remaining',
  'actions',
  'sufficient',
  'bugs',
  'appear',
  'test',
  'case',
  'sufficiently',
  'simplified',
  'programmer',
  'use',
  'debugger',
  'tool',
  'examine',
  'program',
  'states',
  'values',
  'variables',
  'plus',
  'call',
  'stack',
  'track',
  'origin',
  'problem',
  'alternatively',
  'tracing',
  'software',
  '|tracing',
  'used',
  'simple',
  'cases',
  'tracing',
  'print',
  'statements',
  'output',
  'values',
  'variables',
  'certain',
  'points',
  'program',
  'execution',
  'citation',
  'needed|date=february',
  '2016',
  'techniques',
  '``',
  'interactive',
  'debugging',
  "''",
  '``',
  'visible'],
 ['use',
  'dmy',
  'dates|date=september',
  '2013',
  'refimprove|date=december',
  '2013',
  'file',
  'crashed',
  'computer.jpg|thumb|a',
  'crashed',
  'imac',
  'computing',
  '``',
  "'crash",
  "''",
  '``',
  "'system",
  'crash',
  "''",
  'occurs',
  'computer',
  'program',
  'software',
  'application',
  'operating',
  'system',
  'stops',
  'functioning',
  'properly',
  'exit',
  'system',
  'call',
  '|exits',
  'program',
  'responsible',
  'may',
  'appear',
  'hang',
  'computing',
  '|hang',
  'crash',
  'reporter|crash',
  'reporting',
  'service',
  'reports',
  'crash',
  'details',
  'relating',
  'program',
  'critical',
  'part',
  'operating',
  'system',
  'entire',
  'system',
  'may',
  'crash',
  'hang',
  'computing',
  '|hang',
  'often',
  'resulting',
  'kernel',
  'panic',
  'fatal',
  'system',
  'error',
  'crashes',
  'result',
  'executing',
  'invalid',
  'instruction',
  'set|machine',
  'instructions',
  'typical',
  'causes',
  'include',
  'incorrect',
  'address',
  'space|address',
  'values',
  'program',
  'counter',
  'buffer',
  'overflow',
  'overwriting',
  'portion',
  'affected',
  'program',
  'code',
  'due',
  'earlier',
  'computer',
  'bug|bug',
  'accessing',
  'invalid',
  'memory',
  'addresses',
  'using',
  'illegal',
  'opcode',
  'triggering',
  'unhandled',
  'exception',
  'handling|exception',
  'original',
  'software',
  'bug',
  'started',
  'chain',
  'events',
  'typically',
  'considered',
  'cause',
  'crash',
  'discovered',
  'process',
  'debugging',
  'original',
  'bug',
  'far',
  'removed',
  'source',
  'code|code',
  'actually',
  'crashed',
  'earlier',
  'personal',
  'computers',
  'attempting',
  'write',
  'data',
  'hardware',
  'addresses',
  'outside',
  'system',
  "'s",
  'main',
  'memory',
  'could',
  'cause',
  'hardware',
  'damage',
  'crashes',
  'exploit',
  'computer',
  'security',
  '|exploitable',
  'allow',
  'malicious',
  'program',
  'hacker',
  'execute',
  'arbitrary',
  'code',
  'execution|arbitrary',
  'code',
  'allowing',
  'replication',
  'computer',
  'virus|viruses',
  'acquisition',
  'data',
  'would',
  'normally',
  'inaccessible',
  'application',
  'crashes',
  'image',
  'computer',
  'crash',
  'airport.jpg|thumb|a',
  'display',
  'frankfurt',
  'airport',
  'running',
  'program',
  'windows',
  'xp',
  'crashed',
  'due',
  'segmentation',
  'fault|memory',
  'read',
  'access',
  'violation',
  'software',
  'application|application',
  'typically',
  'crashes',
  'performs',
  'operation',
  'allowed',
  'operating',
  'system',
  'operating',
  'system',
  'triggers',
  'exception',
  'handling|exception',
  'signal',
  'computing',
  '|signal',
  'application',
  'unix',
  'applications',
  'traditionally',
  'responded',
  'signal',
  'core',
  'dump|dumping',
  'core',
  'windows',
  'unix',
  'graphical',
  'user',
  'interface|gui',
  'applications',
  'respond',
  'displaying',
  'dialogue',
  'box',
  'one',
  'shown',
  'right',
  'option',
  'attach',
  'debugger',
  'one',
  'installed',
  'applications',
  'attempt',
  'recover',
  'error',
  'continue',
  'running',
  'instead',
  'exit',
  'system',
  'call',
  '|exiting',
  'typical',
  'errors',
  'result',
  'application',
  'crashes',
  'include',
  'attempting',
  'read',
  'write',
  'memory',
  'allocated',
  'reading',
  'writing',
  'application',
  'segmentation',
  'fault',
  'x86',
  'specific',
  'general',
  'protection',
  'fault',
  'attempting',
  'execute',
  'privileged',
  'invalid',
  'instructions',
  'attempting',
  'perform',
  'i/o',
  'operations',
  'computer',
  'hardware|hardware',
  'devices',
  'permission',
  'access',
  'passing',
  'invalid',
  'arguments',
  'system',
  'calls',
  'attempting',
  'access',
  'system',
  'resources',
  'application',
  'permission',
  'access',
  'attempting',
  'execute',
  'machine',
  'instructions',
  'bad',
  'arguments',
  'depending',
  'cpu',
  'architecture',
  'division',
  'zero|divide',
  'zero',
  'operations',
  'denormal',
  'number|denorms',
  'nan',
  'values',
  'memory',
  'access',
  'bus',
  'error|unaligned',
  'addresses',
  'etc',
  'web',
  'server',
  'crashes',
  'software',
  'running',
  'web',
  'server',
  'behind',
  'website',
  'may',
  'crash',
  'rendering',
  'inaccessible',
  'entirely',
  'providing',
  'error',
  'message',
  'instead',
  'normal',
  'content',
  'example',
  'site',
  'using',
  'sql',
  'database',
  'mysql',
  'script',
  'php',
  'sql',
  'database',
  'server',
  'crashes',
  'php',
  'display',
  'connection',
  'error',
  'operating',
  'system',
  'crashes',
  'operating',
  'system',
  'crash',
  'commonly',
  'occurs',
  'exception',
  'handling',
  'exception',
  'handling',
  'hardware|hardware',
  'exception',
  'occurs',
  'exception',
  'handling|handled',
  'operating',
  'system',
  'crashes',
  'also',
  'occur',
  'internal',
  'sanity',
  'check|sanity-checking',
  'logic',
  'within',
  'operating',
  'system',
  'detects',
  'operating',
  'system',
  'lost',
  'internal',
  'self-consistency',
  'modern',
  'multi-tasking',
  'operating',
  'systems',
  'windows',
  'nt',
  'linux',
  'macos',
  'usually',
  'remain',
  'unharmed',
  'application',
  'program',
  'crashes',
  'security',
  'implications',
  'crashes',
  'many',
  'software',
  'bugs',
  'cause',
  'crashes',
  'also',
  'exploit',
  'computer',
  'security',
  '|exploitable',
  'arbitrary',
  'code',
  'execution',
  'types',
  'privilege',
  'escalation.',
  'ref',
  'cite',
  'web|url=http',
  '//msdn.microsoft.com/en-us/magazine/cc163311.aspx',
  '|title=analyze',
  'crashes',
  'find',
  'security',
  'vulnerabilities',
  'apps',
  '|publisher=msdn.microsoft.com',
  '|date=2007-04-26',
  '|accessdate=2014-06-26',
  'ref',
  'cite',
  'web|url=http',
  '//www.squarefree.com/2006/11/01/memory-safety-bugs-in-c-code/',
  '|title=jesse',
  'ruderman',
  '»',
  'memory',
  'safety',
  'bugs',
  'c++',
  'code',
  '|publisher=squarefree.com',
  '|date=2006-11-01',
  '|accessdate=2014-06-26',
  'example',
  'stack',
  'buffer',
  'overflow',
  'overwrite',
  'return',
  'address',
  'subroutine',
  'invalid',
  'value',
  'cause',
  'segmentation',
  'fault',
  'subroutine',
  'returns',
  'however',
  'exploit',
  'overwrites',
  'return',
  'address',
  'valid',
  'value',
  'code',
  'address',
  'executed',
  'see',
  'also',
  'blue',
  'screen',
  'death',
  'crash-only',
  'software',
  'crash',
  'reporter',
  'crash',
  'desktop',
  'data',
  'loss',
  'debugging',
  'guru',
  'meditation',
  'kernel',
  'panic',
  'memory',
  'corruption',
  'reboot',
  'computing',
  '|reboot',
  'safe',
  'mode',
  'segmentation',
  'fault',
  'systemrescuecd',
  'undefined',
  'behaviour',
  'references',
  'reflist',
  'external',
  'links',
  'commons',
  'category|computer',
  'errors',
  'http',
  '//windows.microsoft.com/en-us/windows-vista/picking-up-the-pieces-after-a-computer-crash',
  'picking',
  'pieces',
  'computer',
  'crash',
  'http',
  '//www.scientificamerican.com/article.cfm',
  'id=why-do-computers-crash',
  'computers',
  'crash',
  'defaultsort',
  'crash',
  'computing',
  'category',
  'computer',
  'jargon',
  'category',
  'computer',
  'errors',
  'category',
  'software',
  'anomalies'],
 ['use',
  'dmy',
  'dates|date=march',
  '2014',
  'information',
  'security',
  'refimprove|date=july',
  '2013',
  "''",
  'malware',
  "''",
  'short',
  '``',
  "'malicious",
  'software',
  "''",
  'software',
  'used',
  'disrupt',
  'computer',
  'mobile',
  'operations',
  'gather',
  'sensitive',
  'information',
  'gain',
  'access',
  'private',
  'computer',
  'systems',
  'display',
  'unwanted',
  'advertising.',
  'ref',
  'cite',
  'web|url=http',
  '//techterms.com/definition/malware|',
  'title=malware',
  'definition|',
  'publisher=techterms.com',
  '|accessdate=27',
  'september',
  '2015',
  'term',
  'malware',
  'coined',
  'yisrael',
  'radai',
  '1990',
  'malicious',
  'software',
  'referred',
  'computer',
  'viruses.',
  'ref',
  'name=',
  "''",
  'elisan2012',
  "''",
  'cite',
  'book|author=christopher',
  'elisan|title=malware',
  'rootkits',
  'botnets',
  'beginner',
  "'s",
  'guide|url=https',
  '//books.google.com/books',
  'id=josfllpg1kkc',
  'pg=pa10|date=5',
  'september',
  '2012|publisher=mcgraw',
  'hill',
  'professional|isbn=978-0-07-179205-9|pages=10–',
  'first',
  'category',
  'malware',
  'propagation',
  'concerns',
  'parasitic',
  'software',
  'fragments',
  'attach',
  'existing',
  'executable',
  'content',
  'fragment',
  'may',
  'machine',
  'code',
  'infects',
  'existing',
  'application',
  'utility',
  'system',
  'program',
  'even',
  'code',
  'used',
  'boot',
  'computer',
  'system.',
  'ref',
  'name=',
  "''",
  'stallings',
  '2012',
  'p.182',
  '``',
  'cite',
  'book',
  'last=stallings',
  'first=william',
  'title=computer',
  'security',
  'principles',
  'practice',
  'publisher=pearson',
  'location=boston',
  'year=2012',
  'isbn=978-0-13-277506-9',
  'page=182',
  'malware',
  'defined',
  'malicious',
  'intent',
  'acting',
  'requirements',
  'computer',
  'user',
  'include',
  'software',
  'causes',
  'unintentional',
  'harm',
  'due',
  'deficiency',
  'malware',
  'may',
  'stealthy',
  'intended',
  'steal',
  'information',
  'spy',
  'computer',
  'users',
  'extended',
  'period',
  'without',
  'knowledge',
  'example',
  'regin',
  'malware',
  '|regin',
  'may',
  'designed',
  'cause',
  'harm',
  'often',
  'sabotage',
  'e.g.',
  'stuxnet',
  'extort',
  'payment',
  'cryptolocker',
  "'malware",
  'umbrella',
  'term',
  'used',
  'refer',
  'variety',
  'forms',
  'hostile',
  'intrusive',
  'software',
  'ref',
  'cite',
  'web|url=http',
  '//technet.microsoft.com/en-us/library/dd632948.aspx|title=defining',
  'malware',
  'faq|publisher=technet.microsoft.com|accessdate=10',
  'september',
  '2009',
  'including',
  'computer',
  'viruses',
  'computer',
  'worm|worms',
  'trojan',
  'horse',
  'computing',
  '|trojan',
  'horses',
  'ransomware',
  'malware',
  '|ransomware',
  'spyware',
  'adware',
  'scareware',
  'malicious',
  'programs',
  '--',
  'rootkits',
  'keyloggers',
  'dialers',
  'bhos',
  'types',
  'malware',
  'function',
  'groups',
  'necessarily',
  'even',
  'typically',
  'malware',
  'would',
  'incorrect',
  'assert',
  'malware',
  'includes',
  'say',
  'drivers',
  'macros.',
  '--',
  'take',
  'form',
  'executable',
  'code',
  'script',
  'computing',
  '|scripts',
  'active',
  'content',
  'software.',
  'ref',
  'cite',
  'web|url=https',
  '//ics-cert.us-cert.gov/sites/default/files/recommended_practices/casestudy-002.pdf',
  '|title=an',
  'undirected',
  'attack',
  'critical',
  'infrastructure',
  '|publisher=united',
  'states',
  'computer',
  'emergency',
  'readiness',
  'team',
  'us-cert.gov',
  '|date=',
  'format=pdf|',
  'accessdate=28',
  'september',
  '2014',
  'malware',
  'often',
  'disguised',
  'embedded',
  'non-malicious',
  'files',
  'of|2011',
  'majority',
  'active',
  'malware',
  'threats',
  'worms',
  'trojans',
  'rather',
  'viruses.',
  'ref',
  'cite',
  'web|url=http',
  '//www.microsoft.com/security/sir/story/default.aspx',
  '10year_malware',
  '|title=evolution',
  'malware-malware',
  'trends',
  '|publisher=microsoft.com',
  '|date=',
  'work=microsoft',
  'security',
  'intelligence',
  'report-featured',
  'articles',
  '|accessdate=28',
  'april',
  '2013',
  'law',
  'malware',
  'sometimes',
  'known',
  '``',
  "'computer",
  'contaminant',
  "''",
  'legal',
  'codes',
  'several',
  'united',
  'states|u.s',
  'states.',
  'ref',
  'cite',
  'web|',
  'publisher=national',
  'conference',
  'state',
  'legislatures',
  '|url=http',
  '//www.ncsl.org/issues-research/telecom/state-virus-and-computer-contaminant-laws.aspx|',
  'title=virus/contaminant/destructive',
  'transmission',
  'statutes',
  'state|',
  'date=2012-02-14|',
  'accessdate=26',
  'august',
  '2013',
  'ref',
  'cite',
  'web|url=http',
  '//jcots.state.va.us/2005',
  '20content/pdf/computer',
  '20contamination',
  '20bill.pdf|title=§',
  'nbsp',
  '18.2-152.4:1',
  'penalty',
  'computer',
  'contamination|format=pdf|publisher=joint',
  'commission',
  'technology',
  'science|accessdate=17',
  'september',
  '2010',
  'spyware',
  'malware',
  'sometimes',
  'found',
  'embedded',
  'programs',
  'supplied',
  'officially',
  'companies',
  'e.g.',
  'downloadable',
  'websites',
  'appear',
  'useful',
  'attractive',
  'may',
  'example',
  'additional',
  'hidden',
  'tracking',
  'functionality',
  'gathers',
  'marketing',
  'statistics',
  'example',
  'software',
  'described',
  'illegitimate',
  'sony',
  'rootkit',
  'trojan',
  'embedded',
  'compact',
  'disc|cds',
  'sold',
  'sony',
  'silently',
  'installed',
  'concealed',
  'purchasers',
  'computers',
  'intention',
  'preventing',
  'illicit',
  'copying',
  'also',
  'reported',
  'users',
  'listening',
  'habits',
  'unintentionally',
  'created',
  'vulnerabilities',
  'exploited',
  'unrelated',
  'malware.',
  'ref',
  'cite',
  'web',
  '|last=russinovich',
  '|first=mark',
  '|url=http',
  '//blogs.technet.com/markrussinovich/archive/2005/10/31/sony-rootkits-and-digital-rights-management-gone-too-far.aspx',
  '|title=sony',
  'rootkits',
  'digital',
  'rights',
  'management',
  'gone',
  'far',
  '|work=mark',
  "'s",
  'blog',
  '|publisher=microsoft',
  'msdn',
  '|date=2005-10-31',
  '|accessdate=2009-07-29',
  'software',
  'anti-virus',
  'firewall',
  'computing',
  '|firewalls',
  'used',
  'protect',
  'activity',
  'identified',
  'malicious',
  'recover',
  'attacks.',
  'ref',
  'cite',
  'web|title=protect',
  'computer',
  'malware|url=http',
  '//www.onguardonline.gov/media/video-0056-protect-your-computer-malware|publisher=onguardonline.gov',
  '|accessdate=26',
  'august',
  '2013',
  'purposes',
  'file',
  'malware',
  'statics',
  '2011-03-16-en.svg|thumb|alt=this',
  'pie',
  'chart',
  'shows',
  '2011',
  '70',
  'percent',
  'malware',
  'infections',
  'trojan',
  'horses',
  '17',
  'percent',
  'viruses',
  '8',
  'percent',
  'worms',
  'remaining',
  'percentages',
  'divided',
  'among',
  'adware',
  'backdoor',
  'spyware',
  'exploits.|300px|malware',
  'categories',
  '16',
  'march',
  '2011.',
  'many',
  'early',
  'infectious',
  'programs',
  'including',
  'morris',
  'worm|first',
  'internet',
  'worm',
  'written',
  'experiments',
  'pranks',
  'today',
  'malware',
  'used',
  'black-hat',
  'hacking|black',
  'hat',
  'hackers',
  'governments',
  'steal',
  'personal',
  'financial',
  'business',
  'information.',
  'ref',
  'cite',
  'web|title=malware|url=http',
  '//www.consumer.ftc.gov/articles/0011-malware|publisher=federal',
  'trade',
  'commission-',
  'consumer',
  'information|accessdate=27',
  'march',
  '2014',
  'ref',
  'cite',
  'web|last=hernandez|first=pedro|title=microsoft',
  'vows',
  'combat',
  'government',
  'cyber-spying|url=http',
  '//www.eweek.com/security/microsoft-vows-to-combat-government-cyber-spying.html|publisher=eweek|accessdate=15',
  'december',
  '2013',
  'malware',
  'sometimes',
  'used',
  'broadly',
  'government',
  'corporate',
  'websites',
  'gather',
  'guarded',
  'information',
  'ref',
  'cite',
  'web',
  '|last=kovacs',
  '|first=eduard',
  '|title=miniduke',
  'malware',
  'used',
  'european',
  'government',
  'organizations|url=http',
  '//news.softpedia.com/news/miniduke-malware-used-against-european-government-organizations-333006.shtml|publisher=softpedia|accessdate=27',
  'february',
  '2013',
  'disrupt',
  'operation',
  'general',
  'however',
  'malware',
  'often',
  'used',
  'individuals',
  'gain',
  'information',
  'personal',
  'identification',
  'numbers',
  'details',
  'bank',
  'credit',
  'card',
  'numbers',
  'passwords',
  'left',
  'unguarded',
  'personal',
  'computer',
  'network|networked',
  'computers',
  'considerable',
  'risk',
  'threats',
  'frequently',
  'defended',
  'various',
  'types',
  'firewall',
  'computing',
  '|firewall',
  'anti-virus',
  'software',
  'network',
  'switch|network',
  'hardware',
  'ref',
  'cite',
  'news|title=south',
  'korea',
  'network',
  'attack',
  'computer',
  "virus'",
  '|url=http',
  '//www.bbc.co.uk/news/world-asia-21855051|newspaper=bbc|accessdate=20',
  'march',
  '2013',
  'since',
  'rise',
  'widespread',
  'broadband',
  'internet',
  'access',
  'malicious',
  'software',
  'frequently',
  'designed',
  'profit',
  'since',
  '2003',
  'majority',
  'widespread',
  'computer',
  'virus|viruses',
  'worms',
  'designed',
  'take',
  'control',
  'users',
  'computers',
  'illicit',
  'purposes.',
  'ref',
  'cite',
  'web|title=malware',
  'revolution',
  'change',
  'target|url=http',
  '//technet.microsoft.com/en-us/library/cc512596.aspx|date=march',
  '2007',
  'infected',
  '``',
  'zombie',
  'computers',
  "''",
  'used',
  'send',
  'email',
  'spam',
  'host',
  'contraband',
  'data',
  'child',
  'pornography',
  'ref',
  'cite',
  'web|title=child',
  'porn',
  'malware',
  "'s",
  'ultimate',
  'evil|url=http',
  '//www.itworld.com/security/84077/child-porn-malwares-ultimate-evil|date=november',
  '2009',
  'engage',
  'distributed',
  'denial-of-service',
  'attack',
  'computing',
  '|attacks',
  'form',
  'extortion.',
  'ref',
  'http',
  '//www.pcworld.com/article/id,116841-page,1/article.html',
  'pc',
  'world',
  '–',
  'zombie',
  'pcs',
  'silent',
  'growing',
  'threat',
  '--',
  'bot',
  'generated',
  'title',
  '--',
  'programs',
  'designed',
  'monitor',
  'users',
  'web',
  'browsing',
  'display',
  'unsolicited',
  'advertisements',
  'redirect',
  'affiliate',
  'marketing',
  'revenues',
  'called',
  'spyware',
  'spyware',
  'programs',
  'spread',
  'like',
  'computer',
  'virus|viruses',
  'instead',
  'generally',
  'installed',
  'exploiting',
  'security',
  'holes',
  'also',
  'hidden',
  'packaged',
  'together',
  'unrelated',
  'user-installed',
  'software.',
  'ref',
  'cite',
  'web|title=peer',
  'peer',
  'information|url=http',
  '//oit.ncsu.edu/resnet/p2p|publisher=north',
  'carolina',
  'state',
  'university|accessdate=25',
  'march',
  '2011',
  'ransomware',
  'affects',
  'infected',
  'computer',
  'way',
  'demands',
  'payment',
  'reverse',
  'damage',
  'example',
  'programs',
  'cryptolocker',
  'encryption|encrypt',
  'files',
  'securely',
  'decrypt',
  'payment',
  'substantial',
  'sum',
  'money',
  'malware',
  'used',
  'generate',
  'money',
  'click',
  'fraud',
  'making',
  'appear',
  'computer',
  'user',
  'clicked',
  'advertising',
  'link',
  'site',
  'generating',
  'payment',
  'advertiser',
  'estimated',
  '2012',
  '60',
  '70',
  'active',
  'malware',
  'used',
  'kind',
  'click',
  'fraud',
  '22',
  'ad-clicks',
  'fraudulent.',
  'ref',
  'cite',
  'web|url=http',
  '//blogs.technet.com/b/mmpc/archive/2012/11/29/another-way-microsoft-is-disrupting-the-malware-ecosystem.aspx|title=another',
  'way',
  'microsoft',
  'disrupting',
  'malware',
  'ecosystem|publisher=|accessdate=18',
  'february',
  '2015',
  'malware',
  'usually',
  'used',
  'criminal',
  'purposes',
  'used',
  'sabotage',
  'often',
  'without',
  'direct',
  'benefit',
  'perpetrators',
  'one',
  'example',
  'sabotage',
  'stuxnet',
  'used',
  'destroy',
  'specific',
  'industrial',
  'equipment',
  'politically',
  'motivated',
  'attacks',
  'spread',
  'shut',
  'large',
  'computer',
  'networks',
  'including',
  'massive',
  'deletion',
  'files',
  'corruption',
  'master',
  'boot',
  'records',
  'described',
  '``',
  'computer',
  'killing',
  "''",
  'attacks',
  'made',
  'sony',
  'pictures',
  'entertainment',
  '25',
  'november',
  '2014',
  'using',
  'malware',
  'known',
  'shamoon',
  'w32.disttrack',
  'saudi',
  'aramco',
  'august',
  '2012',
  'ref',
  'cite',
  'web|url=http',
  '//www.computerweekly.com/news/2240161674/shamoon-is-latest-malware-to-target-energy-sector|title=shamoon',
  'latest',
  'malware',
  'target',
  'energy',
  'sector|publisher=|accessdate=18',
  'february',
  '2015',
  'ref',
  'cite',
  'web|url=http',
  '//www.computerweekly.com/news/2240235919/computer-killing-malware-used-in-sony-attack-a-wake-up-call-to-business',
  'asrc=em_mdn_37122786',
  'utm_medium=em',
  'utm_source=mdn',
  'utm_campaign=20141203_computer-killing',
  '20malware',
  '20used',
  '20in',
  '20sony',
  '20attack',
  '20a',
  '20wake-up',
  '20call_|title=computer-killing',
  'malware',
  'used',
  'sony',
  'attack',
  'wake-up',
  'call|publisher=|accessdate=18',
  'february',
  '2015',
  'proliferation',
  'preliminary',
  'results',
  'symantec',
  'published',
  '2008',
  'suggested',
  '``',
  'release',
  'rate',
  'malicious',
  'executable|code',
  'unwanted',
  'programs',
  'may',
  'exceeding',
  'legitimate',
  'software',
  'applications',
  '``',
  'ref',
  'cite',
  'journal|title=symantec',
  'internet',
  'security',
  'threat',
  'report',
  'trends',
  'july–december',
  '2007',
  'executive',
  'summary',
  '|publisher=symantec',
  'corp.|volume=xiii|page=29|date=april',
  '2008|url=http',
  '//eval.symantec.com/mktginfo/enterprise/white_papers/b-whitepaper_exec_summary_internet_security_threat_report_xiii_04-2008.en-us.pdf',
  '|format=pdf|accessdate=11',
  'may',
  '2008',
  'according',
  'f-secure',
  '``',
  'much',
  'malware',
  'produced',
  '2007',
  'previous',
  '20',
  'years',
  'altogether',
  '``',
  'ref',
  'cite',
  'press',
  'release|title=f-secure',
  'reports',
  'amount',
  'malware',
  'grew',
  '100',
  '2007|url=http',
  '//www.f-secure.com/f-secure/pressroom/news/fs_news_20071204_1_eng.html|date=4',
  'december',
  '2007|publisher=f-secure',
  'corporation|accessdate=11',
  'december',
  '2007',
  'malware',
  "'s",
  'common',
  'pathway',
  'criminals',
  'users',
  'internet',
  'primarily',
  'e-mail',
  'world',
  'wide',
  'web.',
  'ref',
  'cite',
  'web|title=',
  'f-secure',
  'quarterly',
  'security',
  'wrap-up',
  'first',
  'quarter',
  '2008|url=http',
  '//www.f-secure.com/f-secure/pressroom/news/fsnews_20080331_1_eng.html|publisher=f-secure|date=31',
  'march',
  '2008|accessdate=25',
  'april',
  '2008',
  'prevalence',
  'malware',
  'vehicle',
  'internet',
  'crime',
  'along',
  'challenge',
  'anti-malware',
  'software',
  'keep',
  'continuous',
  'stream',
  'new',
  'malware',
  'seen',
  'adoption',
  'new',
  'mindset',
  'individuals',
  'businesses',
  'using',
  'internet',
  'amount',
  'malware',
  'currently',
  'distributed',
  'percentage',
  'computers',
  'currently',
  'assumed',
  'infected',
  'businesses',
  'especially',
  'sell',
  'mainly',
  'internet',
  'means',
  'need'],
 ["''",
  'reverse',
  'engineering',
  "''",
  'also',
  'called',
  '``',
  "'back",
  'engineering',
  "''",
  'process',
  'engineering',
  '|processes',
  'extracting',
  'knowledge',
  'design',
  'information',
  'anything',
  'man-made',
  're-producing',
  're-producing',
  'anything',
  'based',
  'extracted',
  'information.',
  'ref',
  'name=',
  "''",
  'eilam',
  "''",
  'cite',
  'book|authors=eilam',
  'eldad',
  '|title=reversing',
  'secrets',
  'reverse',
  'engineering|publisher=john',
  'wiley',
  'sons|year=2005|isbn=978-0-7645-7481-8',
  'rp|3',
  'process',
  'often',
  'involves',
  'disassembling',
  'something',
  'machine|mechanical',
  'device',
  'electronic',
  'component',
  'computer',
  'program',
  'biological',
  'chemical',
  'organic',
  'matter',
  'analyzing',
  'components',
  'workings',
  'detail',
  'reasons',
  'goals',
  'obtaining',
  'information',
  'vary',
  'widely',
  'everyday',
  'socially',
  'beneficial',
  'actions',
  'criminal',
  'actions',
  'depending',
  'upon',
  'situation',
  'often',
  'intellectual',
  'property',
  'rights',
  'breached',
  'person',
  'business',
  'recollect',
  'something',
  'done',
  'something',
  'needs',
  'reverse',
  'engineer',
  'work',
  'reverse',
  'engineering',
  'also',
  'beneficial',
  'crime',
  'prevention',
  'suspected',
  'malware',
  'reverse',
  'engineered',
  'understand',
  'anti-virus|how',
  'detect',
  'remove',
  'allow',
  'computers',
  'devices',
  'work',
  'together',
  '``',
  'interoperate',
  "''",
  'allow',
  'saved',
  'files',
  'obsolete',
  'systems',
  'used',
  'newer',
  'systems',
  'contrast',
  'reverse',
  'engineering',
  'also',
  'used',
  'software',
  'cracking|',
  "''",
  'crack',
  "''",
  'software',
  'media',
  'remove',
  'copy',
  'protection',
  'ref',
  'name=',
  "''",
  'eilam',
  "''",
  'rp|5',
  'create',
  'possibly',
  'improved',
  'copying|copy',
  'even',
  'knockoff',
  'usually',
  'goal',
  'competitor.',
  'ref',
  'name=',
  "''",
  'eilam',
  "''",
  'rp|4',
  'reverse',
  'engineering',
  'origins',
  'analysis',
  'hardware',
  'commercial',
  'military',
  'advantage.',
  'ref',
  'name=',
  "''",
  'chikofsky',
  "''",
  'cite',
  'journal',
  '|doi=10.1109/52.43044',
  '|first=e',
  'j',
  '|last=chikofsky',
  '|lastauthoramp=yes',
  '|first2=j',
  'h.',
  'ii',
  '|last2=cross',
  '|title=reverse',
  'engineering',
  'design',
  'recovery',
  'taxonomy',
  '|journal=ieee',
  'software',
  '|volume=7',
  '|issue=1',
  '|pages=13–17',
  '|year=1990',
  'rp|13',
  'however',
  'reverse',
  'engineering',
  'process',
  'concerned',
  'creating',
  'copy',
  'changing',
  'artifact',
  'way',
  'analysis',
  'order',
  'deductive',
  'reasoning|deduce',
  'design',
  'features',
  'products',
  'little',
  'additional',
  'knowledge',
  'procedures',
  'involved',
  'original',
  'production.',
  'ref',
  'name=',
  "''",
  'chikofsky',
  "''",
  'rp|15',
  'cases',
  'goal',
  'reverse',
  'engineering',
  'process',
  'simply',
  'documentation|redocumentation',
  'legacy',
  'systems.',
  'ref',
  'name=',
  "''",
  'chikofsky',
  "''",
  'rp|15',
  'ref',
  'name=',
  "''",
  'nelson96',
  "''",
  'survey',
  'reverse',
  'engineering',
  'program',
  'comprehension',
  'michael',
  'l.',
  'nelson',
  'april',
  '19',
  '1996',
  'odu',
  'cs',
  '551',
  'nbsp',
  '–',
  'software',
  'engineering',
  'survey',
  'arxiv|cs/0503068v1',
  'even',
  'product',
  'reverse',
  'engineered',
  'competitor',
  'goal',
  'may',
  'copy',
  'perform',
  'competitor',
  'analysis.',
  'ref',
  'name=',
  "''",
  'rajafernandes2007',
  "''",
  'cite',
  'book|author1=vinesh',
  'raja|author2=kiran',
  'j.',
  'fernandes|title=reverse',
  'engineering',
  'industrial',
  'perspective|year=2007|publisher=springer',
  'science',
  'business',
  'media|isbn=978-1-84628-856-2|page=3',
  'reverse',
  'engineering',
  'may',
  'also',
  'used',
  'create',
  'interoperability|interoperable',
  'products',
  'despite',
  'narrowly',
  'tailored',
  'us',
  'eu',
  'legislation',
  'legality',
  'using',
  'specific',
  'reverse',
  'engineering',
  'techniques',
  'purpose',
  'hotly',
  'contested',
  'courts',
  'worldwide',
  'two',
  'decades.',
  'ref',
  'name=',
  "''",
  'bandkatoh2011',
  "''",
  'cite',
  'book|author1=jonathan',
  'band|author2=masanobu',
  'katoh|title=interfaces',
  'trial',
  '2.0|year=2011|publisher=mit',
  'press|isbn=978-0-262-29446-1|page=136',
  'motivation',
  'refimprove',
  'section|date=july',
  '2014',
  'reasons',
  'reverse',
  'engineering',
  '``',
  "'interfacing",
  "''",
  'reverse',
  'engineering',
  'used',
  'system',
  'required',
  'interface',
  'another',
  'system',
  'systems',
  'would',
  'negotiate',
  'established',
  'requirements',
  'typically',
  'exist',
  'interoperability',
  "''",
  'military',
  'commercial',
  'espionage',
  "''",
  'learning',
  'enemy',
  "'s",
  'competitor',
  "'s",
  'latest',
  'research',
  'stealing',
  'capturing',
  'prototype',
  'dismantling',
  'may',
  'result',
  'development',
  'similar',
  'product',
  'better',
  'countermeasures',
  "''",
  'improve',
  'documentation',
  'shortcomings',
  "''",
  'reverse',
  'engineering',
  'done',
  'documentation',
  'system',
  'design',
  'production',
  'operation',
  'maintenance',
  'shortcomings',
  'original',
  'designers',
  'available',
  'improve',
  'reverse',
  'engineering',
  'software',
  'provide',
  'current',
  'documentation',
  'necessary',
  'understanding',
  'current',
  'state',
  'software',
  'system',
  "''",
  'obsolescence',
  "''",
  'integrated',
  'circuits',
  'often',
  'designed',
  'proprietary',
  'systems',
  'built',
  'production',
  'lines',
  'become',
  'obsolete',
  'years',
  'systems',
  'using',
  'parts',
  'longer',
  'maintained',
  'since',
  'parts',
  'longer',
  'made',
  'way',
  'incorporate',
  'functionality',
  'new',
  'technology',
  'reverse-engineer',
  'existing',
  'chip',
  'remake',
  'computing',
  '|re-design',
  'using',
  'newer',
  'tools',
  'using',
  'understanding',
  'gained',
  'guide',
  'another',
  'obsolescence',
  'originated',
  'problem',
  'solved',
  'reverse',
  'engineering',
  'need',
  'support',
  'maintenance',
  'supply',
  'continuous',
  'operation',
  'existing',
  'legacy',
  'devices',
  'longer',
  'supported',
  'original',
  'equipment',
  'manufacturer',
  'oem',
  'problem',
  'particularly',
  'critical',
  'military',
  'operations',
  "''",
  'software',
  'modernization',
  "''",
  'often',
  'knowledge',
  'lost',
  'time',
  'prevent',
  'updates',
  'improvements',
  'reverse',
  'engineering',
  'generally',
  'needed',
  'order',
  'understand',
  "'as",
  'state',
  'existing',
  'legacy',
  'software',
  'order',
  'properly',
  'estimate',
  'effort',
  'required',
  'migrate',
  'system',
  'knowledge',
  "'to",
  'state',
  'much',
  'may',
  'driven',
  'changing',
  'functional',
  'compliance',
  'security',
  'requirements',
  "''",
  'product',
  'security',
  'analysis',
  "''",
  'examine',
  'product',
  'works',
  'specifications',
  'components',
  'estimate',
  'costs',
  'identify',
  'potential',
  'patent',
  'infringement',
  'acquiring',
  'sensitive',
  'data',
  'disassembling',
  'analysing',
  'design',
  'system',
  'component.',
  'ref',
  'name=rfc2828',
  'internet',
  'engineering',
  'task',
  'force',
  'rfc',
  '2828',
  'internet',
  'security',
  'glossary',
  'another',
  'intent',
  'may',
  'remove',
  'copy',
  'protection',
  'circumvention',
  'access',
  'restrictions',
  "''",
  'bug',
  'fixing',
  "''",
  'unofficial',
  'patch|fix',
  'sometimes',
  'enhance',
  'legacy',
  'software',
  'longer',
  'supported',
  'creators',
  'e.g',
  'abandonware',
  "''",
  'creation',
  'unlicensed/unapproved',
  'duplicates',
  "''",
  'duplicates',
  'sometimes',
  'called',
  'clone',
  'computing',
  '|clones',
  'computing',
  'domain',
  "''",
  'academic/learning',
  'purposes',
  "''",
  'reverse',
  'engineering',
  'learning',
  'purposes',
  'may',
  'understand',
  'key',
  'issues',
  'unsuccessful',
  'design',
  'subsequently',
  'improve',
  'design',
  "''",
  'competitive',
  'technical',
  'intelligence',
  "''",
  'understand',
  'one',
  "'s",
  'competitor',
  'actually',
  'versus',
  'say',
  "''",
  'saving',
  'money',
  "''",
  'one',
  'finds',
  'piece',
  'electronics',
  'capable',
  'spare',
  'user',
  'purchase',
  'separate',
  'product',
  "''",
  'repurposing',
  "''",
  'opportunities',
  'repurpose',
  'stuff',
  'otherwise',
  'obsolete',
  'incorporated',
  'bigger',
  'body',
  'utility',
  'common',
  'situations',
  '=reverse',
  'engineering',
  'machines=',
  'computer-aided',
  'design',
  'cad',
  'become',
  'popular',
  'reverse',
  'engineering',
  'become',
  'viable',
  'method',
  'create',
  '3d',
  'virtual',
  'model',
  'existing',
  'physical',
  'part',
  'use',
  '3d',
  'cad',
  'computer-aided',
  'manufacturing|cam',
  'computer-aided',
  'engineering|cae',
  'software',
  'cite',
  'journal|doi=10.1016/s0010-4485',
  '96',
  '00054-1|url=http',
  '//ralph.cs.cf.ac.uk/papers/geometry/re.pdf|title=reverse',
  'engineering',
  'geometric',
  'models–an',
  'introduction|year=1997|last1=varady|first1=t|last2=martin|first2=r|last3=cox|first3=j|journal=computer-aided',
  'design|volume=29|issue=4|pages=255–268',
  'reverse-engineering',
  'process',
  'involves',
  'measuring',
  'object',
  'reconstructing',
  '3d',
  'model',
  'physical',
  'object',
  'measured',
  'using',
  '3d',
  'scanner|3d',
  'scanning',
  'technologies',
  'like',
  'coordinate-measuring',
  'machine|cmms',
  '3d',
  'scanner',
  'triangulation|laser',
  'scanners',
  '3d',
  'scanner',
  'structured',
  'light|structured',
  'light',
  'digitizers',
  'industrial',
  'ct',
  'scanning',
  'computed',
  'tomography',
  'measured',
  'data',
  'alone',
  'usually',
  'represented',
  'point',
  'cloud',
  'lacks',
  'topological',
  'information',
  'therefore',
  'often',
  'processed',
  'modeled',
  'usable',
  'format',
  'triangular-faced',
  'mesh',
  'set',
  'nonuniform',
  'rational',
  'b-spline|nurbs',
  'surfaces',
  'computer',
  'assisted',
  'design|cad',
  'model',
  'cite',
  'web|url',
  'http',
  '//haman-co.com|title',
  'haman',
  'engineering',
  'solutions|date',
  '|accessdate',
  '|website',
  '|publisher',
  '|last',
  '|first',
  'hybrid',
  'modelling',
  'commonly',
  'used',
  'term',
  'nurbs',
  'solid',
  'modeling|parametric',
  'modelling',
  'implemented',
  'together',
  'using',
  'combination',
  'geometric',
  'freeform',
  'surfaces',
  'provide',
  'powerful',
  'method',
  '3d',
  'modelling',
  'areas',
  'freeform',
  'data',
  'combined',
  'exact',
  'geometric',
  'surfaces',
  'create',
  'hybrid',
  'model',
  'typical',
  'example',
  'would',
  'reverse',
  'engineering',
  'cylinder',
  'head',
  'includes',
  'freeform',
  'cast',
  'features',
  'water',
  'jackets',
  'high',
  'tolerance',
  'machined',
  'areas',
  'cite',
  'web|url=http',
  '//www.physicaldigital.com/services/reverse-engineering/|title=reverse',
  'engineering|last=|first=|date=|website=|publisher=|access-date=',
  'reverse',
  'engineering',
  'also',
  'used',
  'businesses',
  'bring',
  'existing',
  'physical',
  'geometry',
  'digital',
  'product',
  'development',
  'environments',
  'make',
  'digital',
  '3d',
  'record',
  'products',
  'assess',
  'competitors',
  'products',
  'used',
  'analyse',
  'instance',
  'product',
  'works',
  'components',
  'consists',
  'estimate',
  'costs',
  'identify',
  'potential',
  'patent',
  'infringement',
  'etc',
  'value',
  'engineering',
  'related',
  'activity',
  'also',
  'used',
  'businesses',
  'involves',
  'de-constructing',
  'analysing',
  'products',
  'objective',
  'find',
  'opportunities',
  'cost',
  'cutting',
  '=reverse',
  'engineering',
  'software=',
  'term',
  '``',
  'reverse',
  'engineering',
  "''",
  'applied',
  'software',
  'means',
  'different',
  'things',
  'different',
  'people',
  'prompting',
  'chikofsky',
  'cross',
  'write',
  'paper',
  'researching',
  'various',
  'uses',
  'defining',
  'taxonomy',
  'general',
  '|taxonomy',
  'paper',
  'state',
  '``',
  'reverse',
  'engineering',
  'process',
  'analyzing',
  'subject',
  'system',
  'create',
  'representations',
  'system',
  'higher',
  'level',
  'abstraction',
  '``',
  'cite',
  'journal',
  'last1',
  'chikofsky',
  'first1',
  'e.',
  'j',
  'last2',
  'cross',
  'first2',
  'j.',
  'h.',
  'doi',
  '10.1109/52.43044',
  'title',
  'reverse',
  'engineering',
  'design',
  'recovery',
  'taxonomy',
  'journal',
  'ieee',
  'software',
  'volume',
  '7',
  'pages',
  '13–17',
  '|date=january',
  '1990',
  'url',
  'http',
  '//win.ua.ac.be/~lore/research/chikofsky1990-taxonomy.pdf',
  'also',
  'seen',
  '``',
  'going',
  'backwards',
  'development',
  'cycle',
  "''",
  'cite',
  'book',
  'last=warden',
  'first=r',
  'title=software',
  'reuse',
  'reverse',
  'engineering',
  'practice',
  'year=1992',
  'publisher=chapman',
  'hall',
  'location=london',
  'england',
  'pages=283–305',
  'model',
  'output',
  'implementation',
  'phase',
  'source',
  'code',
  'form',
  'reverse-engineered',
  'back',
  'analysis',
  'phase',
  'inversion',
  'traditional',
  'waterfall',
  'model',
  'another',
  'term',
  'technique',
  'program',
  'comprehension.',
  'ref',
  'name=',
  "''",
  'nelson96',
  "''",
  'reverse',
  'engineering',
  'process',
  'examination',
  'software',
  'system',
  'consideration',
  'modified',
  'would',
  'make',
  'reengineering',
  'software',
  '|re-engineering',
  'software',
  'anti-tamper',
  'technology',
  'like',
  'obfuscation'],
 ['use',
  'mdy',
  'dates|date=october',
  '2014',
  'infobox',
  'military',
  'person',
  '|name=grace',
  'murray',
  'hopper',
  '|birth_date',
  'birth',
  'date|1906|12|9',
  '|death_date',
  'death',
  'date',
  'age|1992|1|1|1906|12|9',
  '|birth_place=new',
  'york',
  'city',
  'new',
  'york',
  'u.s.',
  '|death_place=arlington',
  'virginia',
  'u.s.',
  '|placeofburial=arlington',
  'national',
  'cemetery',
  '|placeofburial_label=',
  'place',
  'burial',
  '|image=commodore',
  'grace',
  'm.',
  'hopper',
  'usn',
  'covered',
  '.jpg',
  '|caption=rear',
  'admiral',
  'grace',
  'm.',
  'hopper',
  '1984',
  '|nickname=',
  "''",
  'amazing',
  'grace',
  "''",
  '|alma_mater',
  'yale',
  'university',
  '|allegiance=',
  'flagu|united',
  'states',
  'america',
  '|serviceyears=1943–1966',
  '1967–1971',
  '1972–1986',
  '|rank=',
  'file',
  'us-o7',
  'insignia.svg|24px',
  'rear',
  'admiral',
  'united',
  'states',
  '|rear',
  'admiral',
  'lower',
  'half',
  '|branch=',
  'flag|united',
  'states',
  'navy',
  '|commands=',
  '|awards=file',
  'defense',
  'distinguished',
  'service',
  'ribbon.svg|border|22px',
  'defense',
  'distinguished',
  'service',
  'medal',
  'br',
  'file',
  'legion',
  'merit',
  'ribbon.svg|border|22px',
  'legion',
  'merit',
  'br',
  'file',
  'meritorious',
  'service',
  'ribbon.svg|border|22px',
  'meritorious',
  'service',
  'medal',
  'usa',
  '|meritorious',
  'service',
  'medal',
  'br',
  'file',
  'american',
  'campaign',
  'medal',
  'ribbon.svg|border|22px',
  'american',
  'campaign',
  'medal',
  'br',
  'file',
  'world',
  'war',
  'ii',
  'victory',
  'medal',
  'ribbon.svg|border|22px',
  'world',
  'war',
  'ii',
  'victory',
  'medal',
  'br',
  'file',
  'national',
  'defense',
  'service',
  'medal',
  'ribbon.svg|border|22px',
  'national',
  'defense',
  'service',
  'medal',
  'br',
  'file',
  'afrm',
  'hourglass',
  'device',
  'silver',
  '.jpg|border|22px',
  'armed',
  'forces',
  'reserve',
  'medal',
  'two',
  'hourglass',
  'devices',
  'br',
  'file',
  'naval',
  'reserve',
  'medal',
  'ribbon.svg|border|22px',
  'naval',
  'reserve',
  'medal',
  'br',
  'file',
  'presidential',
  'medal',
  'freedom',
  'ribbon',
  '.png|border|22px',
  'presidential',
  'medal',
  'freedom',
  'posthumous',
  '|relations=',
  '|laterwork=',
  "''",
  'grace',
  'brewster',
  'murray',
  'hopper',
  "''",
  'née|',
  "''",
  "'murray",
  "''",
  'december',
  '9',
  '1906',
  '–',
  'january',
  '1',
  '1992',
  'american',
  'computer',
  'scientist',
  'united',
  'states',
  'navy',
  'rear',
  'admiral',
  'united',
  'states',
  '|rear',
  'admiral',
  'cite',
  'news|url',
  'http',
  '//content.yudu.com/a2qfj4/201403march/resources/3.htm|title',
  'amazing',
  'grace',
  'rear',
  'adm.',
  'grace',
  'hopper',
  'usn',
  'pioneer',
  'computer',
  'science|first',
  'mark|last',
  'cantrell|magazine',
  'military',
  'officer|publisher',
  'military',
  'officers',
  'association',
  'america|volume',
  '12|issue',
  '3|pages',
  '52–55',
  '106|date',
  'march',
  '2014|accessdate',
  'march',
  '1',
  '2014',
  '1944',
  'one',
  'first',
  'programmers',
  'harvard',
  'mark',
  'computer',
  'http',
  '//chsi.harvard.edu/exhibitions/harvard-mark-l',
  'mark',
  'computer',
  'harvard',
  'university',
  'invented',
  'first',
  'compiler',
  'computer',
  'programming',
  'language.',
  'ref',
  'name=',
  "''",
  'wexelblat81',
  "''",
  'cite',
  'book',
  '|author=',
  'richard',
  'l.',
  'wexelblat',
  'ed',
  '|title=',
  'history',
  'programming',
  'languages',
  '|year=',
  '1981',
  '|location=',
  'new',
  'york',
  '|publisher=',
  'academic',
  'press',
  '|isbn=',
  '0-12-745040-8',
  'ref',
  'name=',
  "''",
  'spencer85',
  "''",
  'cite',
  'book',
  '|author=',
  'donald',
  'd.',
  'spencer',
  '|title=',
  'computers',
  'information',
  'processing',
  '|year=',
  '1985',
  '|publisher=',
  'c.e',
  'merrill',
  'publishing',
  'co',
  '|isbn=',
  '978-0-675-20290-9',
  'ref',
  'name=',
  "''",
  'laplante01',
  "''",
  'cite',
  'book',
  '|author=',
  'phillip',
  'a.',
  'laplante',
  '|title=',
  'dictionary',
  'computer',
  'science',
  'engineering',
  'technology',
  '|year=',
  '2001',
  '|publisher=',
  'crc',
  'press',
  '|isbn=',
  '978-0-8493-2691-2',
  'ref',
  'name=',
  "''",
  'bunch93',
  "''",
  'cite',
  'book',
  '|author=',
  'bryan',
  'h.',
  'bunch',
  'alexander',
  'hellemans',
  '|title=',
  'timetables',
  'technology',
  'chronology',
  'important',
  'people',
  'events',
  'history',
  'technology',
  '|year=',
  '1993',
  '|publisher=',
  'simon',
  'schuster',
  '|isbn=',
  '978-0-671-76918-5',
  'ref',
  'name=',
  "''",
  'booss03',
  "''",
  'cite',
  'book',
  '|author=',
  'bernhelm',
  'booss-bavnbek',
  'jens',
  'høyrup',
  '|title=',
  'mathematics',
  'war',
  '|year=',
  '2003',
  '|publisher=',
  'birkhäuser',
  'verlag',
  '|isbn=',
  '978-3-7643-1634-1',
  'popularized',
  'idea',
  'machine-independent',
  'programming',
  'languages',
  'led',
  'development',
  'cobol',
  'one',
  'first',
  'high-level',
  'programming',
  'languages',
  'owing',
  'accomplishments',
  'naval',
  'rank',
  'sometimes',
  'referred',
  '``',
  'amazing',
  'grace',
  "''",
  'ref',
  'name=',
  "''",
  'urlcyber',
  'heroes',
  'past',
  'amazing',
  'grace',
  'hopper',
  "''",
  'cite',
  'web|url=http',
  '//wvegter.hivemind.net/abacus/cyberheroes/hopper.htm|title=cyber',
  'heroes',
  'past',
  '``',
  'amazing',
  'grace',
  "''",
  'hopper|accessdate=december',
  '12',
  '2012',
  'ref',
  'name=',
  "''",
  'urlgrace',
  'murray',
  'hopper',
  "''",
  'cite',
  'web|url=http',
  '//www.agnesscott.edu/lriddle/women/hopper.htm|title=grace',
  'murray',
  'hopper|accessdate=december',
  '12',
  '2012',
  'u.s.',
  'navy',
  'sclass-|arleigh',
  'burke|destroyer|0',
  'guided-missile',
  'destroyer',
  'uss|hopper',
  'named',
  'cray',
  'xe6',
  '``',
  'hopper',
  "''",
  'supercomputer',
  'nersc',
  'cite',
  'web|url=http',
  '//www.nersc.gov/users/computational-systems/retired-systems/hopper/|title=hopper|website=www.nersc.gov|access-date=2016-03-19',
  'november',
  '22',
  '2016',
  'posthumously',
  'awarded',
  'presidential',
  'medal',
  'freedom',
  'president',
  'barack',
  'obama',
  'cite',
  'web|url=http',
  '//www.cbsnews.com/news/white-house-medal-of-freedom-margaret-hamilton-grace-hopper/|title=white',
  'house',
  'honors',
  'two',
  'tech',
  "'s",
  'female',
  'pioneers|work=cbsnews.com|accessdate=november',
  '23',
  '2016',
  'early',
  'life',
  'education',
  'listen|type=speech|pos=right|filename=grace',
  'hopper',
  'told',
  'u.s.',
  'chief',
  'technology',
  'officer',
  'megan',
  'smith',
  '.oggvorbis.ogg|title=grace',
  'hopper',
  'told',
  'u.s.',
  'chief',
  'technology',
  'officer',
  'megan',
  'smith',
  '|description=',
  'hopper',
  'born',
  'new',
  'york',
  'city',
  'eldest',
  'three',
  'children',
  'parents',
  'walter',
  'fletcher',
  'murray',
  'mary',
  'campbell',
  'van',
  'horne',
  'scottish',
  'people|scottish',
  'dutch',
  'people|dutch',
  'descent',
  'attended',
  'west',
  'end',
  'collegiate',
  'church',
  'cite',
  'book',
  'publisher',
  'naval',
  'institute',
  'press|',
  'isbn',
  '1557509522|',
  'last',
  'williams|',
  'first',
  'kathleen',
  'broome|',
  'title',
  'grace',
  'hopper',
  'admiral',
  'cyber',
  'sea|',
  'location',
  'annapolis',
  'md|',
  'series',
  'library',
  'naval',
  'biography|',
  'date',
  '2004',
  'great-grandfather',
  'alexander',
  'wilson',
  'russell',
  'admiral',
  'us',
  'navy',
  'fought',
  'battle',
  'mobile',
  'bay',
  'american',
  'civil',
  'war|civil',
  'war',
  'grace',
  'curious',
  'child',
  'lifelong',
  'trait',
  'age',
  'seven',
  'decided',
  'determine',
  'alarm',
  'clock',
  'worked',
  'dismantled',
  'seven',
  'alarm',
  'clocks',
  'mother',
  'realized',
  'limited',
  'one',
  'clock',
  'cite',
  'journal',
  '|last1=dickason',
  '|first=elizabeth',
  '|url=http',
  '//inventors.about.com/library/inventors/bl_grace_murray_hopper.htm',
  '|title=looking',
  'back',
  'grace',
  'murray',
  'hopper',
  "'s",
  'younger',
  'years',
  '|journal=chips',
  '|date=april',
  '1992',
  'university-preparatory',
  'school|preparatory',
  'school',
  'education',
  'attended',
  'wardlaw-hartridge',
  'school|hartridge',
  'school',
  'plainfield',
  'new',
  'jersey',
  'hopper',
  'initially',
  'rejected',
  'early',
  'admission',
  'vassar',
  'college',
  'age',
  '16',
  'test',
  'scores',
  'latin',
  'low',
  'admitted',
  'following',
  'year',
  'graduated',
  'phi',
  'beta',
  'kappa',
  'society|phi',
  'beta',
  'kappa',
  'vassar',
  '1928',
  'bachelor',
  "'s",
  'degree',
  'mathematics',
  'physics',
  'earned',
  'master',
  "'s",
  'degree',
  'yale',
  'university',
  '1930.',
  '1934',
  'earned',
  'ph.d.',
  'mathematics',
  'yale',
  'ref',
  'name=',
  "''",
  'nwhm',
  "''",
  'cite',
  'web|',
  'url=http',
  '//www.nwhm.org/education-resources/biography/biographies/grace-murray-hopper/|',
  'title=grace',
  'murray',
  'hopper',
  '1906-1992',
  'accessdate=september',
  '1',
  '2014|',
  'publisher=national',
  'women',
  "'s",
  'history',
  'museum|',
  'website=nwhm.org',
  'direction',
  'øystein',
  'ore.',
  'ref',
  'name=',
  "''",
  'greenladuke09',
  "''",
  'though',
  'books',
  'including',
  'kurt',
  'beyer',
  "'s",
  '``',
  'grace',
  'hopper',
  'invention',
  'information',
  'age',
  "''",
  'reported',
  'hopper',
  'first',
  'woman',
  'earn',
  'yale',
  'phd',
  'mathematics',
  'first',
  'ten',
  'women',
  'prior',
  '1934',
  'charlotte',
  'cynthia',
  'barnum',
  '1860–1934',
  'cite',
  'news',
  'last',
  'murray',
  'first',
  'margaret',
  'a.',
  'm.',
  'publication-date',
  'may–june',
  '2010',
  'title',
  'first',
  'lady',
  'math',
  'periodical',
  'yale',
  'alumni',
  'magazine',
  'volume',
  '73',
  'issue',
  '5',
  'pages',
  '5–6',
  'issn',
  '0044-0051',
  'postscript',
  '--',
  'none',
  '--',
  'dissertation',
  '``',
  'new',
  'types',
  'irreducibility',
  'criteria',
  "''",
  'published',
  'year.g',
  'm.',
  'hopper',
  'o.',
  'ore',
  '``',
  'new',
  'types',
  'irreducibility',
  'criteria',
  "''",
  '``',
  'bull',
  'amer',
  'math',
  'soc',
  "''",
  '40',
  '1934',
  '216',
  'cite',
  'web',
  'title=new',
  'types',
  'irreducibility',
  'criteria',
  'url=http',
  '//www.ams.org/journals/bull/1934-40-03/s0002-9904-1934-05818-x/',
  'hopper',
  'began',
  'teaching',
  'mathematics',
  'vassar',
  '1931',
  'promoted',
  'associate',
  'professor',
  '1941.',
  'ref',
  'name=ogilvie',
  'cite',
  'book|last=ogilvie|first=marilyn|title=the',
  'biographical',
  'dictionary',
  'women',
  'science',
  'pioneering',
  'lives',
  'ancient',
  'times',
  'mid-20th',
  'century.|year=2000|publisher=routledge|location=new',
  'york|isbn=0-415-92040-x|author2=',
  'joy',
  'harvey|url=https',
  '//books.google.com/books',
  'id=qmfyk0qtsrac',
  'q=hopper',
  'v=snippet',
  'q=hopper',
  'f=false',
  'check',
  'cite|reason=does',
  "n't",
  'seem',
  'support',
  'dates|date=november',
  '2013',
  'married',
  'new',
  'york',
  'university',
  'professor',
  'vincent',
  'foster',
  'hopper',
  '1906–76',
  '1930',
  'divorce',
  '1945.',
  'ref',
  'name=',
  "''",
  'greenladuke09',
  "''",
  'cite',
  'book',
  '|last=green',
  '|first=judy',
  'jeanne',
  'laduke',
  '|title=pioneering',
  'women',
  'american',
  'mathematics',
  'pre-1940',
  'phd',
  "'s",
  '|accessdate=',
  '|edition=',
  '|year=2009',
  '|publisher=american',
  'mathematical',
  'society',
  '|location=providence',
  'rhode',
  'island',
  '|isbn=978-0821843765',
  'cite',
  'news|title=prof',
  'vincent',
  'hopper',
  'n.y.u.',
  'literature',
  'teacher',
  'dead',
  '69|newspaper=the',
  'new',
  'york',
  'times|date=january',
  '21',
  '1976',
  'marry',
  'chose',
  'retain',
  'surname',
  'career',
  '=world',
  'war',
  'ii=',
  'file',
  'harvard',
  'mark',
  'sign-up.agr.jpg|thumb|hopper',
  "'s",
  'signatures',
  'duty',
  'officer',
  'signup',
  'sheet',
  'bureau',
  'ships',
  'computation',
  'project',
  'harvard',
  'built',
  'operated',
  'harvard',
  'mark',
  'i|mark',
  'hopper',
  'tried',
  'enlist',
  'navy',
  'early',
  'war',
  'age',
  '34',
  'old',
  'enlist',
  'weight',
  'height',
  'ratio',
  'low',
  'also',
  'denied',
  'basis',
  'job',
  'mathematician',
  'mdashb',
  'mathematics',
  'professor',
  'vassar',
  'college',
  'mdashb',
  'valuable',
  'war',
  'effort',
  'cite',
  'web|url=https',
  '//www.thocp.net/biographies/hopper_grace.html|title=grace',
  'hopper|website=www.thocp.net|access-date=2016-12-12',
  'world',
  'war',
  'ii',
  '1943',
  'hopper',
  'obtained',
  'leave',
  'absence',
  'vassar',
  'sworn',
  'united',
  'states',
  'navy',
  'reserve',
  'one',
  'many',
  'women',
  'volunteer',
  'serve',
  'waves',
  'get',
  'exemption',
  'enlist',
  'convert|15|lb',
  'navy',
  'minimum',
  'weight',
  'convert|120|lb',
  'reported',
  'december',
  'trained',
  'naval',
  'reserve',
  'midshipmen',
  "'s",
  'school',
  'smith',
  'college',
  'northampton',
  'massachusetts',
  'hopper',
  'graduated',
  'first',
  'class',
  '1944',
  'assigned',
  'bureau',
  'ships',
  'computation',
  'project',
  'harvard',
  'university',
  'lieutenant',
  'junior',
  'grade',
  'served',
  'harvard',
  'mark',
  'i|mark'],
 ['redirect|computer',
  'system||computer',
  'disambiguation',
  '|and|computer',
  'system',
  'disambiguation',
  'pp-semi-indef',
  'pp-move-indef',
  'infobox|title',
  'computer',
  '|image',
  'div',
  'style=',
  "''",
  'white-space',
  'nowrap',
  "''",
  'file',
  'acer',
  'aspire',
  '8920',
  'gemstone',
  'georgy.jpg|x81pxfile',
  'columbia',
  'supercomputer',
  'nasa',
  'advanced',
  'supercomputing',
  'facility.jpg|x81pxfile',
  'intertec',
  'superbrain.jpg|x81px',
  'br',
  'file:2010-01-26-technikkrempel-by-ralfr-05.jpg|x79pxfile',
  'thinking',
  'machines',
  'connection',
  'machine',
  'cm-5',
  'frostburg',
  '2.jpg|x79pxfile',
  'g5',
  'supplying',
  'wikipedia',
  'via',
  'gigabit',
  'lange',
  'nacht',
  'der',
  'wissenschaften',
  '2006',
  'dresden.jpg|x79px',
  'br',
  'file',
  'dm',
  'ibm',
  's360.jpg|x77pxfile',
  'acorn',
  'bbc',
  'master',
  'series',
  'microcomputer.jpg|x77pxfile',
  'dell',
  'poweredge',
  'servers.jpg|x77px',
  '|caption',
  'computers',
  'computing',
  'devices',
  'different',
  'eras',
  '--',
  'paragraph',
  'currently',
  'discussion',
  'talk',
  '--',
  '``',
  "'computer",
  "''",
  'device',
  'computer',
  'programming|instructed',
  'carry',
  'arbitrary',
  'set',
  'arithmetic',
  'boolean',
  'algebra|logical',
  'operations',
  'automatically',
  'ability',
  'computers',
  'follow',
  'sequence',
  'operations',
  'called',
  '``',
  'computer',
  'program|program',
  "''",
  'make',
  'computers',
  'applicable',
  'wide',
  'range',
  'tasks',
  'computers',
  'used',
  'control',
  'systems',
  'wide',
  'variety',
  'programmable',
  'logic',
  'controller|industrial',
  'consumer',
  'electronics|consumer',
  'devices',
  'includes',
  'simple',
  'special',
  'purpose',
  'devices',
  'like',
  'microwave',
  'ovens',
  'remote',
  'controls',
  'factory',
  'devices',
  'industrial',
  'robots',
  'computer',
  'assisted',
  'design',
  'also',
  'general',
  'purpose',
  'devices',
  'like',
  'personal',
  'computers',
  'mobile',
  'devices',
  'smartphones',
  'internet',
  'run',
  'computers',
  'connects',
  'millions',
  'computers',
  'since',
  'ancient',
  'times',
  'simple',
  'manual',
  'devices',
  'like',
  'abacus',
  'aided',
  'people',
  'calculations',
  'early',
  'industrial',
  'revolution',
  'mechanical',
  'devices',
  'built',
  'automate',
  'long',
  'tedious',
  'tasks',
  'guiding',
  'patterns',
  'looms',
  'sophisticated',
  'electrical',
  'machines',
  'specialized',
  'analogue',
  'electronics|analog',
  'calculations',
  'early',
  '20th',
  'century',
  'first',
  'digital',
  'data|digital',
  'electronic',
  'calculating',
  'machines',
  'developed',
  'world',
  'war',
  'ii',
  'speed',
  'power',
  'versatility',
  'computers',
  'increased',
  'continuously',
  'dramatically',
  'since',
  'conventionally',
  'modern',
  'computer',
  'consists',
  'least',
  'one',
  'processing',
  'element',
  'typically',
  'central',
  'processing',
  'unit',
  'cpu',
  'form',
  'memory',
  'computers',
  '|memory',
  'processing',
  'element',
  'carries',
  'arithmetic',
  'logical',
  'operations',
  'sequencing',
  'control',
  'unit',
  'change',
  'order',
  'operations',
  'response',
  'stored',
  'data|information',
  'peripheral|peripheral',
  'devices',
  'include',
  'input',
  'devices',
  'keyboards',
  'mice',
  'joystick',
  'etc',
  'output',
  'devices',
  'monitor',
  'screens',
  'printers',
  'etc',
  'input/output',
  'devices',
  'perform',
  'functions',
  'e.g.',
  '2000s-era',
  'touchscreen',
  'peripheral',
  'devices',
  'allow',
  'information',
  'retrieved'
]]

In [None]:
from collections import defaultdict
import itertools 

# Import Dictionary
from gensim.corpora.dictionary import Dictionary

# Create a Dictionary from the articles: dictionary
dictionary = Dictionary(articles)

# Select the id for "computer": computer_id
computer_id = dictionary.token2id.get('computer')

# Use computer_id with the dictionary to print the word
print(dictionary.get(computer_id))

# Create a MmCorpus: corpus
corpus = [dictionary.doc2bow(article) for article in articles]

# Print the first 10 word ids with their frequency counts from the fifth document
print(corpus[4][:10])

# Save the fifth document: doc
doc = corpus[4]

# Sort the doc for frequency: bow_doc
bow_doc = sorted(doc, key=lambda w: w[1], reverse=True)

# Print the top 5 words of the document alongside the count
for word_id, word_count in bow_doc[:5]:
    print(dictionary.get(word_id), word_count)
    
# Create the defaultdict: total_word_count
total_word_count = defaultdict(int)
for word_id, word_count in itertools.chain.from_iterable(corpus):
    total_word_count[word_id] += word_count
    
# Create a sorted list from the defaultdict: sorted_word_count
sorted_word_count = sorted(total_word_count.items(), key=lambda w: w[1], reverse=True) 

# Print the top 5 words across all documents alongside the count
for word_id, word_count in sorted_word_count[:5]:
    print(dictionary.get(word_id), word_count)

## TF-IDF with gensim

**TF-IDF** stands for *Term Frequency - Inverse Document Frequency*. 

This technique allows to determine the most important words in each document.

The idea behind is that each corpus may have shared words beyond just stopwords, and that these shared words should be removed or at least down weighted in importance.

TF-IDF ensures that most common words dont show up as key words

Keeps document specific frequent words weighted high

$$ w_{i,j} = tf_{i,j} * \log\left(\frac{N}{df_{i}}\right) $$

- $w_{i,j}$ = tf-idf weight for token *i* in document *j*
- $tf_{i,j}$ = number of occurrences of token *i* in document *j*
- $df_{i}$ = number of documents that contain token *i*
- $N$ = total number of documents

In [None]:
from gensim.models import TfidfModel

# Create a new TfidfModel using the corpus: tfidf
tfidf = TfidfModel(corpus)

# Calculate the tfidf weights of doc: tfidf_weights
tfidf_weights = tfidf[doc]

# Print the first five weights
print(tfidf_weights[0:4])

In [None]:

# Sort the weights from highest to lowest: sorted_tfidf_weights
sorted_tfidf_weights = sorted(tfidf_weights, key=lambda w: w[1], reverse=True)

# Print the top 5 weighted words
for term_id, weight in sorted_tfidf_weights[:5]:
    print(dictionary.get(term_id), weight)

# Named Entity Recognition 

**Named Entity Recognition** is an NLP task to identify important named entities in the text like those of people, places or organizations. They could be even dates, states or works of art...

NER can be used alonside with topic identification or on its own.

NLTK offers its own NER model as well as the one from the Stanford CoreNLP library.



In [None]:
import nltk

nltk.download('averaged_perceptron_tagger_eng')

sentence= '''In New York I like to ride the metro to visit MOMA and some restaurants rated well by Ruth Reichl.'''

tokenized_sent = nltk.word_tokenize(sentence)
tagged_sent = nltk.pos_tag(tokenized_sent)
tagged_sent

In [None]:
nltk.download('maxent_ne_chunker_tab')
nltk.download('words')

print(nltk.ne_chunk(tagged_sent))

# SpaCy

SpaCy is an NLP library similar to Gensim but with different implementations

It is designed to build systems for information extraction.

Spacy has a focus on creating NLP pipelines to generate models and corpora

Its open-source, supports more than 64 languages, its robust and has a bunch of extra visualization libraries like [displacy](https://demos.explosion.ai/displacy-ent)

In [None]:
!python -m spacy download en_core_web_sm

In [None]:
import spacy 

nlp = spacy.load("en_core_web_sm")
doc = nlp("He works at Google.")
# spacy.displacy.serve(doc, style="ent", auto_select_port=True)


Once the spacy model is loaded we get the nlp object. 

nlp object converts text into a doc object to store processed text

In [None]:
doc = nlp('Berlin is the capital of Germany, and the residence of Chancellor Angela Merkel.')
doc.ents

In [None]:
print(doc.ents[0], doc.ents[0].label_)

> SpaCy allows for easier pipeline creation and identifies different entity types compared to nltk.
> On top of that, Spacy has an informal language corpora, what enables NER in tweets and chat messages

In [None]:
article= '\ufeffThe taxi-hailing company Uber brings into very sharp focus the question of whether corporations can be said to have a moral character. If any human being were to behave with the single-minded and ruthless greed of the company, we would consider them sociopathic. Uber wanted to know as much as possible about the people who use its service, and those who don’t. It has an arrangement with unroll.me, a company which offered a free service for unsubscribing from junk mail, to buy the contacts unroll.me customers had had with rival taxi companies. Even if their email was notionally anonymised, this use of it was not something the users had bargained for. Beyond that, it keeps track of the phones that have been used to summon its services even after the original owner has sold them, attempting this with Apple’s phones even thought it is forbidden by the company.\r\n\r\n\r\nUber has also tweaked its software so that regulatory agencies that the company regarded as hostile would, when they tried to hire a driver, be given false reports about the location of its cars. Uber management booked and then cancelled rides with a rival taxi-hailing company which took their vehicles out of circulation. Uber deny this was the intention. The punishment for this behaviour was negligible. Uber promised not to use this “greyball” software against law enforcement – one wonders what would happen to someone carrying a knife who promised never to stab a policeman with it. Travis Kalanick of Uber got a personal dressing down from Tim Cook, who runs Apple, but the company did not prohibit the use of the app. Too much money was at stake for that.\r\n\r\n\r\nMillions of people around the world value the cheapness and convenience of Uber’s rides too much to care about the lack of drivers’ rights or pay. Many of the users themselves are not much richer than the drivers. The “sharing economy” encourages the insecure and exploited to exploit others equally insecure to the profit of a tiny clique of billionaires. Silicon Valley’s culture seems hostile to humane and democratic values. The outgoing CEO of Yahoo, Marissa Mayer, who is widely judged to have been a failure, is likely to get a $186m payout. This may not be a cause for panic, any more than the previous hero worship should have been a cause for euphoria. Yet there’s an urgent political task to tame these companies, to ensure they are punished when they break the law, that they pay their taxes fairly and that they behave responsibly.'

# Import spacy
import spacy

# Instantiate the English model: nlp
nlp = spacy.load('en_core_web_sm', disable=['tagger', 'parser', 'matcher'])

# Create a new document: doc
doc = nlp(article)

# Print all of the found entities and their labels
for ent in doc.ents:
    print(ent.label_, '-', ent.text)


> The spacy nlp text to doc includes steps like tokenization, tagging, parsing, name entity recognition, and others

Spacy has multimple data structures to represent text data: 
- **Doc**: a container for accessing linguistic annotations of text
- **Span**: a slice from a Doc object
- **Token**: An individual token (word, punctuation, whitespace...)

The spacy language processing pipeline always depends on the loaded model and its capabilities
- **Tokenizer**: segment text into tokens nd create the Doc object
- **Tagger**: Assign part of speech tags
- **Lemmatizer**: Reduce the words to their root names
- **EntityRecognizer**: Detect and label named entities

Other important components are **language**, **dependencyParser** or **sentencizer**.

## Tokenization 

Its always the first operation in spacy

All the other operations require tokens 

Tokens can be wordsm numbers or punctuation

## Sentence Segmentation 

More complex than tokenization

Is a part of DependencyParser component



In [None]:
import spacy

nlp = spacy.load('en_core_web_sm')

doc = nlp("We are learning NLP. In particular, NLP with Spacy")

for sent in doc.sents:
    print(sent.text)

## Lemmatization 

A lemma is the base form of a token

The lemma of eats and ate is eat 

Improves accuracy of language models

In [None]:
import spacy

nlp = spacy.load('en_core_web_sm')

doc = nlp("We are learning NLP. In particular, NLP with Spacy")

for token in doc:
    print(token.text, '-', token.lemma_)

## Linguistic Features in Spacy 

PoS (Parts-of-Speech) tagging consists in categorize words grammatically, based on the function and context within a sentence (verb, noun, adjective, conjunction...)


In [None]:
verb_sent = 'I watch TV'
noun_sent = 'I have lost my watch recently'

for sent in [verb_sent, noun_sent]: 
    print([(token.text, token.pos_, spacy.explain(token.pos_)) for token in nlp(sent)])
    print('-')

## Named Entity Recognition 

A named entity is a word or phrase that refers to a specific entity with a name.

The predefined categories are: 
- PERSON: person
- ORG: organization
- GPE: geopolitical entity
- LOC: non GPE locations
- DATE: absolute or relative dates or periods
- TIME: time smaller than a day

In [None]:
import spacy

nlp = spacy.load("en_core_web_sm")
doc = nlp("Albert Einstein was a genius.")

print([(ent.text, ent.start_char, ent.end_char, ent.label_) for ent in doc.ents])

> We can access the entities via the doc container too

In [None]:
import spacy

nlp = spacy.load("en_core_web_sm")
doc = nlp("Albert Einstein was a genius.")

print([(token.text, token.ent_type_) for token in doc])

In [None]:
from spacy import displacy 

displacy.serve(doc, style='ent', port=5001)

## Linguistic Features

**Word-sense disambiguation** is the process of tagging each word to the right PoS category using the context.

**Dependency parsing** explores the syntax of a sentence and links tokens between them creating a tree

In [None]:
displacy.serve(doc, style='dep', port=5001)

# Word Vectors

1. **Introduction to Word Vectors**  
   Word vectors (or embeddings) are numerical representations of words that allow computers to work with text data. Traditional methods, such as the "bag-of-words," assign unique numbers to words but fail to capture their meaning or context.

2. **Limitations of Older Methods**  
   Older models, like "bag-of-words," can map words to numbers but struggle with understanding semantics. For instance, the sentences "I got covid" and "I got coronavirus" would be treated as different, despite their similar meaning.

3. **Improving Word Understanding with Vectors**  
   Recent techniques generate word vectors that help computers recognize words with similar meanings. These vectors have multiple dimensions and capture word context, enabling semantic understanding.

4. **Common Algorithms for Word Vectors**  
   Methods like Word2Vec, Glove, fastText, and transformer-based models produce word vectors using neural networks or co-occurrence matrices. spaCy integrates some of these methods to create word vectors.

5. **Word Vectors in spaCy**  
   spaCy’s models vary in their use of word vectors. For example, the small model `en_core_web_sm` has no word vectors, while the medium model `en_core_web_md` includes 20,000-word vectors.

6. **Accessing Word Vectors in spaCy**  
   In spaCy, word vectors can be accessed for words in the model's vocabulary using `nlp.vocab`. By mapping a word to its ID, users can retrieve its corresponding word vector for analysis.


In [None]:
!python -m spacy download en_core_web_md


In [None]:
# Load the en_core_web_md model
md_nlp = spacy.load('en_core_web_md')

# Print the number of words in the model's vocabulary
print("Number of words: ", md_nlp.meta["vectors"]["vectors"], "\n")

# Print the dimensions of word vectors in en_core_web_md model
print("Dimension of word vectors: ", md_nlp.meta["vectors"]["width"])

In [None]:
words = ["like", "love"]

# IDs of all the given words
ids = [md_nlp.vocab.strings[w] for w in words]

# Store the first ten elements of the word vectors for each word
word_vectors = [md_nlp.vocab.vectors[i][:10] for i in ids]

# Print the first ten elements of the first word vector
print(word_vectors[0])

# Visualizing Word Vectors and Semantic Understanding

1. **Introduction to Word Vectors and spaCy**  
   Word vectors can be used to visualize and identify similar contexts. We'll learn how to visualize them and use spaCy to explore word similarities.

2. **Visualizing Word Vectors with PCA**  
   To visualize word vectors, we project them into a two-dimensional space using Principal Component Analysis (PCA). This allows us to group words based on their semantic similarities, such as animals, fruits, and emotional contexts.

3. **Creating Word Vector Visualizations with Python**  
   Using `matplotlib`, `spaCy`, and `sklearn`, we can create word vector visualizations. By loading a spaCy model and extracting word vectors, we apply PCA to project the 300-dimensional vectors into two dimensions and visualize them in a scatter plot.

4. **Dimensionality Reduction with PCA**  
   To reduce the 300-dimensional word vectors into two dimensions, we use PCA to extract the two principal components. These components serve as the x and y coordinates in the scatter plot, allowing us to visualize word groupings.

5. **Analogies and Vector Operations**  
   Word vectors can capture semantic relationships and support analogies through vector operations. For example, "queen - woman + man = king" demonstrates how gender relations can be represented by vector manipulation.

6. **Finding Similar Words in a Vocabulary**  
   We can use spaCy to find words that are semantically similar to a given word, like "covid". By extracting its word vector and using the `most_similar()` method, spaCy returns similar terms such as "covid-19" and "corona."


In [None]:
from sklearn.decomposition import PCA

words = ["tiger", "bird", "sea", "air", "land", "king", "queen"]

# Extract word IDs of given words
word_ids = [md_nlp.vocab.strings[w] for w in words]

# Extract word vectors and stack the first five elements vertically
word_vectors = np.vstack([md_nlp.vocab.vectors[i][:5] for i in word_ids])

# Calculate the transformed word vectors using the pca object
pca = PCA(n_components=2)
word_vectors_transformed = pca.fit_transform(word_vectors)

# Print the first component of the transformed word vectors
print(word_vectors_transformed[:, 0])

fig, ax = plt.subplots() 
ax.scatter(x=word_vectors_transformed[:, 0], y=word_vectors_transformed[:, 1])

for i, txt in enumerate(words):
    ax.annotate(txt, (word_vectors_transformed[i, 0]+0.15, word_vectors_transformed[i, 1]+0.15))


# Measuring Semantic Similarity with spaCy

**Understanding Semantic Similarity**  
   Semantic similarity helps to categorize texts or detect relevant content. For example, identifying customer questions related to "price" can be done by calculating similarity scores between sentences and the target word.

**Calculating Similarity Scores**  
   The similarity score is calculated using **cosine similarity between word vectors, with values ranging from 0 to 1. Higher cosine similarity indicates a closer semantic relationship between texts.

**Token Similarity**  
   We can calculate similarity scores between individual tokens. For example, the similarity between "pizza" and "pasta" can be calculated, resulting in a score of 0.685, showing they are somewhat related.

**Span Similarity**  
   spaCy can also compute similarity between spans of text. For example, "eat pizza" and "eat pasta" have a similarity score of 0.936, indicating they are very closely related, whereas "eat pizza" and "like to eat pasta" score lower at 0.588.

**Doc Similarity**  
   Documents can be compared using spaCy's `-dot-similarity()` method. For example, "I like to play basketball" and "I love to play basketball" score 0.975, showing high similarity.

**Sentence Similarity**  
   spaCy can find the most relevant sentence to a keyword using sentence similarity. For instance, "What is the cheapest flight from Boston to Seattle?" scores highest in relevance to the keyword "price".


In [None]:
texts=['I like the Vitality canned dog food products.',
 'The peanuts were actually small sized unsalted. Not sure if this was an error.',
 'It is a light, pillowy citrus gelatin with nuts - in this case Filberts.',
 'the Root Beer Extract I ordered is very medicinal.',
 'Great taffy at a great price.']

# Create a documents list containing Doc containers
documents = [md_nlp(t) for t in texts]

# Create a Doc container of the category
category = "canned dog food"
category_document = md_nlp(category)

# Print similarity scores of each Doc container and the category_document
for i, doc in enumerate(documents):
  print(f"Semantic similarity with document {i+1}:", round(doc.similarity(category_document), 3))

In [None]:
# Create a documents list containing Doc containers
documents = [md_nlp(t) for t in texts]

# Create a Doc container of the category
category = "canned dog food"
category_document = md_nlp(category)

# Print similarity scores of each Doc container and the category_document
for i, doc in enumerate(documents):
  print(f"Semantic similarity with document {i+1}:", round(doc.similarity(category_document), 3))

# Spacy Pipelines



In [None]:
texts='I like the Vitality canned dog food products. The peanuts were actually small sized unsalted. Not sure if this was an error. It is a light, pillowy citrus gelatin with nuts - in this case Filberts.'

# Load a blank spaCy English model and add a sentencizer component
nlp = spacy.blank("en")
nlp.add_pipe("sentencizer")

# Create Doc containers, store sentences and print its number of sentences
doc = nlp(texts)
sentences = [s for s in doc.sents]
print("Number of sentences: ", len(sentences), "\n")

# Print the list of tokens in the second sentence
print("Second sentence tokens: ", [token for token in sentences[1]])

In [None]:
# Load a blank spaCy English model
nlp = spacy.blank("en")

# Add tagger and entity_linker pipeline components
nlp.add_pipe("tagger")
nlp.add_pipe("entity_linker")

# Analyze the pipeline
analysis = nlp.analyze_pipes(pretty=True)

> The output message of the analysis of the pipe shows that there is a NER and a sentencizer missing in the pipeline

# EntityRuler  

Entity ruler add named entities to a Doc container

It can be used on its own or combined with EntityRecognizer



In [None]:
nlp = spacy.blank("en")
patterns = [{"label": "ORG", "pattern": [{"LOWER": "openai"}]},
            {"label": "ORG", "pattern": [{"LOWER": "microsoft"}]}]
text = "OpenAI has joined forces with Microsoft."

# Add EntityRuler component to the model
entity_ruler = nlp.add_pipe("entity_ruler")

# Add given patterns to the EntityRuler component
entity_ruler.add_patterns(patterns)

# Run the model on a given text
doc = nlp(text)

# Print entities text and type for all entities in the Doc container
print([(ent.text, ent.label_) for ent in doc.ents])

In [None]:
nlp = spacy.load("en_core_web_sm")
text = "New York Group was built in 1987."

# Add an EntityRuler to the nlp before NER component
ruler = nlp.add_pipe("entity_ruler", before="ner")

# Define a pattern to classify lower cased new york group as ORG
patterns = [{"label": "ORG", "pattern": [{"lower": "new york group"}]}]

# Add the patterns to the EntityRuler component
ruler.add_patterns(patterns)

# Run the model and print entities text and type for all the entities
doc = nlp(text)
print([(ent.text, ent.label_) for ent in doc.ents])

In [None]:
example_text='This is a confection. In this case Filberts. And it is cut into tiny squares. This is the treat that seduces Edmund into selling out his Brother and Sisters to the Witch.'

nlp = spacy.load("en_core_web_md")

# Print a list of tuples of entities text and types in the example_text
print("Before EntityRuler: ", [(ent.text, ent.label_) for ent in nlp(example_text).ents], "\n")

# Define pattern to add a label PERSON for lower cased sisters and brother entities
patterns = [{"label": 'PERSON', "pattern": [{"lower": 'brother'}]},
            {"label": 'PERSON', "pattern": [{"lower": 'sisters'}]}]

# Add an EntityRuler component and add the patterns to the ruler
ruler = nlp.add_pipe("entity_ruler")
ruler.add_patterns(patterns)

# Print a list of tuples of entities text and types
print("After EntityRuler: ", [(ent.text, ent.label_) for ent in nlp(example_text).ents])

> Weird since before EntityRuler 'Sisters' wasnt supposed to be catched as an entity.

# Regex with Spacy

Regex can be used to find and retrieve patterns or replacing matching patterns. Links or phone numbers are examples.

- Enable writing robust rules to retrieve information
- Allow us to find many types of variance in strings
- Runs fast
- Supported by programming languages

On the bad side: 
- They are complex
- They require to understand how the possible matches would change in the corpus.




In [None]:
import re

pattern = r"((\d){3}-(\d){3}-(\d){4})"
text = "Our phone number is 834-234-5643 and their phone number is 234-456-2322."

iter_matches = re.finditer(pattern,text)
for match in iter_matches:
    start_char = match.start()
    end_char = match.end()
    print("Start character:", start_char, "| End character:", end_char, "| Matching text: ", text[start_char:end_char])

In [None]:
Spacy uses regexps in 3 pipeline components: Matcher, PhraseMatcher and EntityRuler.


In [None]:
text = "Our phone number is 832-123-5555 and their phone number is 425-123-4567."
nlp = spacy. blank ("en")
patterns = [{"label": "PHONE_NUMBER", "pattern": [{"SHAPE": "ddd"},{"ORTH": "-"},{"SHAPE": "ddd"},{"ORTH": "-"}, {"SHAPE":"dddd"}]}]

ruler=nlp.add_pipe('entity_ruler')
ruler.add_patterns(patterns) 
doc = nlp(text)
print([(ent.text, ent.label_) for ent in doc.ents])

In [None]:
text = "Our phone number is 4251234567."

# Define a pattern to match phone numbers
patterns = [{"label": "PHONE_NUMBERS", "pattern": [{"TEXT": {"REGEX": "(\d){10}"}}]}]

# Load a blank model and add an EntityRuler
nlp = spacy.blank("en")
ruler = nlp.add_pipe("entity_ruler")

# Add the compiled patterns to the EntityRuler
ruler.add_patterns(patterns)

# Print the tuple of entities texts and types for the given text
doc = nlp(text)
print([(ent.text, ent.label_) for ent in doc.ents])

In [None]:
# Matcher in Spacy 

Since regex can be complex and difficult to read and debug, Spacy offers a readable and production level alternative: the **Matcher** class 


In [None]:
import spacy
from spacy.matcher import Matcher

nlp = spacy.load("en_core_web_sm")

doc = nlp("Good morning, this is our first day on campus.")

matcher = Matcher(nlp.vocab)

pattern = [{"LOWER": "good"},{"LOWER": "morning"}]

matcher.add("morning_greeting", [pattern])

matches = matcher(doc)
for match_id, start, end in matches:
    print("Start token:", start, "| End token:", end, "| Matched text: ", doc[start:end].text)

Matchers have many other operators to better describe how the pattern would match the text.

In [None]:
import spacy
from spacy.matcher import Matcher

nlp = spacy.load("en_core_web_sm")

doc = nlp("Good morning, this is our first day on campus.")

matcher = Matcher(nlp.vocab)

pattern = [{"LOWER": "good"},{"LOWER": {"IN": ["morning", "evening"]}}]

matcher.add("morning_greeting", [pattern])

matches = matcher(doc)
for match_id, start, end in matches:
    print("Start token:", start, "| End token:", end, "| Matched text: ", doc[start:end].text)

# Phrase Matcher 

The Matcher classe needs to be crafted pattern by pattern. 

Spacy has another class, PhraseMatcher that helps making the creation of patterns way easier.



In [None]:
example_text = 'I highly recommend this yummy treat.  If you are familiar with the story of C.S. Lewis\' "The Lion, The Witch, and The Wardrobe" - this is the treat that seduces Edmund into selling out his Brother and Sisters to the Witch.'

nlp = spacy.load("en_core_web_sm")
doc = nlp(example_text)

# Initialize a Matcher object
matcher = Matcher(nlp.vocab)

# Define a pattern to match lower cased word witch
pattern = [{"lower" : "witch"}]

# Add the pattern to matcher object and find matches
matcher.add("CustomMatcher", [pattern])
matches = matcher(doc)

# Print start and end token indices and span of the matched text
for match_id, start, end in matches:
    print("Start token: ", start, " | End token: ", end, "| Matched text: ", doc[start:end].text)

In [None]:
text = "There are only a few acceptable IP addresse: (1) 127.100.0.1, (2) 123.4.1.0."
terms = ["110.0.0.0", "101.243.0.0"]

# Initialize a PhraseMatcher class to match to shapes of given terms
matcher = PhraseMatcher(nlp.vocab, attr = 'SHAPE')

# Create patterns to add to the PhraseMatcher object
patterns = [nlp.make_doc(term) for term in terms]
matcher.add("IPAddresses", patterns)

# Find matches to the given patterns and print start and end characters and matches texts
doc = nlp(text)
matches = matcher(doc)
for match_id, start, end in matches:
    print("Start token: ", start, " | End token: ", end, "| Matched text: ", doc[start:end].text)

In [None]:
example_text = 'It is cut into tiny squares and then liberally coated with powdered sugar.  And it is a tiny mouthful of heaven.'

nlp = spacy.load("en_core_web_sm")
doc = nlp(example_text)

# Define a matcher object
matcher = Matcher(nlp.vocab)
# Define a pattern to match tiny squares and tiny mouthful
pattern = [{"lower": 'tiny'}, {"lower": {"IN": ["squares", "mouthful"]}}]

# Add the pattern to matcher object and find matches
matcher.add("CustomMatcher", [pattern])
matches = matcher(doc)

# Print out start and end token indices and the matched text span per match
for match_id, start, end in matches:
    print("Start token: ", start, " | End token: ", end, "| Matched text: ", doc[start:end].text)

> Spacy allows for its models to be adapted to specific domains.

## Multilingual NER with polyglot

