### The corpus is taken from https://en.wikipedia.org/wiki/Natural_language_processing




In [2]:
df=""" class="mw-heading mw-heading3"><h3 id="Symbolic_NLP_(1950s_–_early_1990s)"><span id="Symbolic_NLP_.281950s_.E2.80.93_early_1990s.29"></span>Symbolic NLP (1950s – early 1990s)</h3><span class="mw-editsection"><span class="mw-editsection-bracket">[</span><a href="/w/index.php?title=Natural_language_processing&amp;action=edit&amp;section=2" title="Edit section: Symbolic NLP (1950s – early 1990s)"><span>edit</span></a><span class="mw-editsection-bracket">]</span></span></div>
<p>The premise of symbolic NLP is well-summarized by <a href="/wiki/John_Searle" title="John Searle">John Searle</a>'s <a href="/wiki/Chinese_room" title="Chinese room">Chinese room</a> experiment: Given a collection of rules (e.g., a Chinese phrasebook, with questions and matching answers), the computer emulates natural language understanding (or other NLP tasks) by applying those rules to the data it confronts.
</p>
<ul><li><b>1950s</b>: The <a href="/wiki/Georgetown-IBM_experiment" class="mw-redirect" title="Georgetown-IBM experiment">Georgetown experiment</a> in 1954 involved fully <a href="/wiki/Automatic_translation" class="mw-redirect" title="Automatic translation">automatic translation</a> of more than sixty Russian sentences into English. The authors claimed that within three or five years, machine translation would be a solved problem.<sup id="cite_ref-2" class="reference"><a href="#cite_note-2"><span class="cite-bracket">&#91;</span>2<span class="cite-bracket">&#93;</span></a></sup>  However, real progress was much slower, and after the <a href="/wiki/ALPAC" title="ALPAC">ALPAC report</a> in 1966, which found that ten years of research had failed to fulfill the expectations, funding for machine translation was dramatically reduced. Little further research in machine translation was conducted in America (though some research continued elsewhere, such as Japan and Europe<sup id="cite_ref-3" class="reference"><a href="#cite_note-3"><span class="cite-bracket">&#91;</span>3<span class="cite-bracket">&#93;</span></a></sup>) until the late 1980s when the first <a href="/wiki/Statistical_machine_translation" title="Statistical machine translation">statistical machine translation</a> systems were developed.</li>
<li><b>1960s</b>: Some notably successful natural language processing systems developed in the 1960s were <a href="/wiki/SHRDLU" title="SHRDLU">SHRDLU</a>, a natural language system working in restricted "<a href="/wiki/Blocks_world" title="Blocks world">blocks worlds</a>" with restricted vocabularies, and <a href="/wiki/ELIZA" title="ELIZA">ELIZA</a>, a simulation of a <a href="/wiki/Rogerian_psychotherapy" class="mw-redirect" title="Rogerian psychotherapy">Rogerian psychotherapist</a>, written by <a href="/wiki/Joseph_Weizenbaum" title="Joseph Weizenbaum">Joseph Weizenbaum</a> between 1964 and 1966. Using almost no information about human thought or emotion, ELIZA sometimes provided a startlingly human-like interaction. When the "patient" exceeded the very small knowledge base, ELIZA might provide a generic response, for example, responding to "My head hurts" with "Why do you say your head hurts?". <a href="/w/index.php?title=Ross_Quillian&amp;action=edit&amp;redlink=1" class="new" title="Ross Quillian (page does not exist)">Ross Quillian</a>'s successful work on natural language was demonstrated with a vocabulary of only <i>twenty</i> words, because that was all that would fit in a computer  memory at the time.<sup id="cite_ref-4" class="reference"><a href="#cite_note-4"><span class="cite-bracket">&#91;</span>4<span class="cite-bracket">&#93;</span></a></sup></li></ul>
<ul><li><b>1970s</b>: During the 1970s, many programmers began to write "conceptual <a href="/wiki/Ontology_(information_science)" title="Ontology (information science)">ontologies</a>", which structured real-world information into computer-understandable data.  Examples are MARGIE (Schank, 1975), SAM (Cullingford, 1978), PAM (Wilensky, 1978), TaleSpin (Meehan, 1976), QUALM (Lehnert, 1977), Politics (Carbonell, 1979), and Plot Units (Lehnert 1981).  During this time, the first <a href="/wiki/Chatterbots" class="mw-redirect" title="Chatterbots">chatterbots</a> were written (e.g., <a href="/wiki/PARRY" title="PARRY">PARRY</a>).</li>
<li><b>1980s</b>: The 1980s and early 1990s mark the heyday of symbolic methods in NLP. Focus areas of the time included research on rule-based parsing (e.g., the development of <a href="/wiki/Head-driven_phrase_structure_grammar" title="Head-driven phrase structure grammar">HPSG</a> as a computational operationalization of <a href="/wiki/Generative_grammar" title="Generative grammar">generative grammar</a>), morphology (e.g., two-level morphology<sup id="cite_ref-5" class="reference"><a href="#cite_note-5"><span class="cite-bracket">&#91;</span>5<span class="cite-bracket">&#93;</span></a></sup>), semantics (e.g., <a href="/wiki/Lesk_algorithm" title="Lesk algorithm">Lesk algorithm</a>), reference (e.g., within Centering Theory<sup id="cite_ref-6" class="reference"><a href="#cite_note-6"><span class="cite-bracket">&#91;</span>6<span class="cite-bracket">&#93;</span></a></sup>) and other areas of natural language understanding (e.g., in the <a href="/wiki/Rhetorical_structure_theory" title="Rhetorical structure theory">Rhetorical Structure Theory</a>). Other lines of research were continued, e.g., the development of chatterbots with <a href="/wiki/Racter" title="Racter">Racter</a> and <a href="/wiki/Jabberwacky" title="Jabberwacky">Jabberwacky</a>. An important development (that eventually led to the statistical turn in the 1990s) was the rising importance of quantitative evaluation in this period.<sup id="cite_ref-7" class="reference"><a href="#cite_note-7"><span class="cite-bracket">&#91;</span>7<span class="cite-bracket">&#93;</span></a></sup></li></ul>
<div class="mw-heading mw-heading3"><h3 id="Statistical_NLP_(1990s–2010s)"><span id="Statistical_NLP_.281990s.E2.80.932010s.29"></span>Statistical NLP (1990s–2010s)</h3><span class="mw-editsection"><span class="mw-editsection-bracket">[</span><a href="/w/index.php?title=Natural_language_processing&amp;action=edit&amp;section=3" title="Edit section: Statistical NLP (1990s–2010s)"><span>edit</span></a><span class="mw-editsection-bracket">]</span></span></div>
<p>Up until the 1980s, most natural language processing systems were based on complex sets of hand-written rules.  Starting in the late 1980s, however, there was a revolution in natural language processing with the introduction of <a href="/wiki/Machine_learning" title="Machine learning">machine learning</a> algorithms for language processing.  This was due to both the steady increase in computational power (see <a href="/wiki/Moore%27s_law" title="Moore&#39;s law">Moore's law</a>) and the gradual lessening of the dominance of <a href="/wiki/Noam_Chomsky" title="Noam Chomsky">Chomskyan</a> theories of linguistics (e.g. <a href="/wiki/Transformational_grammar" title="Transformational grammar">transformational grammar</a>), whose theoretical underpinnings discouraged the sort of <a href="/wiki/Corpus_linguistics" title="Corpus linguistics">corpus linguistics</a> that underlies the machine-learning approach to language processing.<sup id="cite_ref-8" class="reference"><a href="#cite_note-8"><span class="cite-bracket">&#91;</span>8<span class="cite-bracket">&#93;</span></a></sup> 
</p>
<ul><li><b>1990s</b>: Many of the notable early successes in statistical methods in NLP occurred in the field of <a href="/wiki/Machine_translation" title="Machine translation">machine translation</a>, due especially to work at IBM Research, such as <a href="/wiki/IBM_alignment_models" title="IBM alignment models">IBM alignment models</a>.  These systems were able to take advantage of existing multilingual <a href="/wiki/Text_corpus" title="Text corpus">textual corpora</a> that had been produced by the <a href="/wiki/Parliament_of_Canada" title="Parliament of Canada">Parliament of Canada</a> and the <a href="/wiki/European_Union" title="European Union">European Union</a> as a result of laws calling for the translation of all governmental proceedings into all official languages of the corresponding systems of government.  However, most other systems depended on corpora specifically developed for the tasks implemented by these systems, which was (and often continues to be) a major limitation in the success of these systems. As a result, a great deal of research has gone into methods of more effectively learning from limited amounts of data.</li>
<li><b>2000s</b>: With the growth of the web, increasing amounts of raw (unannotated) language data have become available since the mid-1990s. Research has thus increasingly focused on <a href="/wiki/Unsupervised_learning" title="Unsupervised learning">unsupervised</a> and <a href="/wiki/Semi-supervised_learning" class="mw-redirect" title="Semi-supervised learning">semi-supervised learning</a> algorithms.  Such algorithms can learn from data that has not been hand-annotated with the desired answers or using a combination of annotated and non-annotated data.  Generally, this task is much more difficult than <a href="/wiki/Supervised_learning" title="Supervised learning">supervised learning</a>, and typically produces less accurate results for a given amount of input data.  However, there is an enormous amount of non-annotated data available (including, among other things, the entire content of the <a href="/wiki/World_Wide_Web" title="World Wide Web">World Wide Web</a>), which can often make up for the inferior results if the algorithm used has a low enough <a href="/wiki/Time_complexity" title="Time complexity">time complexity</a> to be practical.</li></ul>
<div class="mw-heading mw-heading3"><h3 id="Neural_NLP_(present)"><span id="Neural_NLP_.28present.29"></span>Neural NLP (present)</h3><span class="mw-editsection"><span class="mw-editsection-bracket">[</span><a href="/w/index.php?title=Natural_language_processing&amp;action=edit&amp;section=4" title="Edit section: Neural NLP (present)"><span>edit</span></a><span class="mw-editsection-bracket">]</span></span></div>
<p>In 2003, <a href="/wiki/Word_n-gram_language_model" title="Word n-gram language model">word n-gram model</a>, at the time the best statistical algorithm, was outperformed by a <a href="/wiki/Multi-layer_perceptron" class="mw-redirect" title="Multi-layer perceptron">multi-layer perceptron</a> (with a single hidden layer and context length of several words trained on up to 14 million of words with a CPU cluster in <a href="/wiki/Language_model" title="Language model">language modelling</a>) by <a href="/wiki/Yoshua_Bengio" title="Yoshua Bengio">Yoshua Bengio</a> with co-authors.<sup id="cite_ref-9" class="reference"><a href="#cite_note-9"><span class="cite-bracket">&#91;</span>9<span class="cite-bracket">&#93;</span></a></sup> 
</p><p>In 2010, <a href="/wiki/Tom%C3%A1%C5%A1_Mikolov" title="Tomáš Mikolov">Tomáš Mikolov</a> (then a PhD student at <a href="/wiki/Brno_University_of_Technology" title="Brno University of Technology">Brno University of Technology</a>) with co-authors applied a simple <a href="/wiki/Recurrent_neural_network" title="Recurrent neural network">recurrent neural network</a> with a single hidden layer to language modelling,<sup id="cite_ref-10" class="reference"><a href="#cite_note-10"><span class="cite-bracket">&#91;</span>10<span class="cite-bracket">&#93;</span></a></sup> and in the following years he went on to develop <a href="/wiki/Word2vec" title="Word2vec">Word2vec</a>. In the 2010s, <a href="/wiki/Representation_learning" class="mw-redirect" title="Representation learning">representation learning</a> and <a href="/wiki/Deep_learning" title="Deep learning">deep neural network</a>-style (featuring many hidden layers) machine learning methods became widespread in natural language processing. That popularity was due partly to a flurry of results showing that such techniques<sup id="cite_ref-goldberg:nnlp17_11-0" class="reference"><a href="#cite_note-goldberg:nnlp17-11"><span class="cite-bracket">&#91;</span>11<span class="cite-bracket">&#93;</span></a></sup><sup id="cite_ref-goodfellow:book16_12-0" class="reference"><a href="#cite_note-goodfellow:book16-12"><span class="cite-bracket">&#91;</span>12<span class="cite-bracket">&#93;</span></a></sup> can achieve state-of-the-art results in many natural language tasks, e.g., in <a href="/wiki/Language_modeling" class="mw-redirect" title="Language modeling">language modeling</a><sup id="cite_ref-jozefowicz:lm16_13-0" class="reference"><a href="#cite_note-jozefowicz:lm16-13"><span class="cite-bracket">&#91;</span>13<span class="cite-bracket">&#93;</span></a></sup> and parsing.<sup id="cite_ref-choe:emnlp16_14-0" class="reference"><a href="#cite_note-choe:emnlp16-14"><span class="cite-bracket">&#91;</span>14<span class="cite-bracket">&#93;</span></a></sup><sup id="cite_ref-vinyals:nips15_15-0" class="reference"><a href="#cite_note-vinyals:nips15-15"><span class="cite-bracket">&#91;</span>15<span class="cite-bracket">&#93;</span></a></sup> This is increasingly important <a href="/wiki/Artificial_intelligence_in_healthcare" title="Artificial intelligence in healthcare">in medicine and healthcare</a>, where NLP helps analyze notes and text in <a href="/wiki/Electronic_health_record" title="Electronic health record">electronic health records</a> that would otherwise be inaccessible for study when seeking to improve care<sup id="cite_ref-16" class="reference"><a href="#cite_note-16"><span class="cite-bracket">&#91;</span>16<span class="cite-bracket">&#93;</span></a></sup> or protect patient privacy.<sup id="cite_ref-17" class="reference"><a href="#cite_note-17"><span class="cite-bracket">&#91;</span>17<span class="cite-bracket">&#93;</span></a></sup>
</p>"""

In [4]:
df=df.lower()
print(df)

 class="mw-heading mw-heading3"><h3 id="symbolic_nlp_(1950s_–_early_1990s)"><span id="symbolic_nlp_.281950s_.e2.80.93_early_1990s.29"></span>symbolic nlp (1950s – early 1990s)</h3><span class="mw-editsection"><span class="mw-editsection-bracket">[</span><a href="/w/index.php?title=natural_language_processing&amp;action=edit&amp;section=2" title="edit section: symbolic nlp (1950s – early 1990s)"><span>edit</span></a><span class="mw-editsection-bracket">]</span></span></div>
<p>the premise of symbolic nlp is well-summarized by <a href="/wiki/john_searle" title="john searle">john searle</a>'s <a href="/wiki/chinese_room" title="chinese room">chinese room</a> experiment: given a collection of rules (e.g., a chinese phrasebook, with questions and matching answers), the computer emulates natural language understanding (or other nlp tasks) by applying those rules to the data it confronts.
</p>
<ul><li><b>1950s</b>: the <a href="/wiki/georgetown-ibm_experiment" class="mw-redirect" title="georg

In [6]:
import re
def remove_html_tags(text):
    pattern = re.compile('<.*?>')
    return pattern.sub(r'', text)

df=remove_html_tags(df)
print(df)


 class="mw-heading mw-heading3">symbolic nlp (1950s – early 1990s)[edit]
the premise of symbolic nlp is well-summarized by john searle's chinese room experiment: given a collection of rules (e.g., a chinese phrasebook, with questions and matching answers), the computer emulates natural language understanding (or other nlp tasks) by applying those rules to the data it confronts.

1950s: the georgetown experiment in 1954 involved fully automatic translation of more than sixty russian sentences into english. the authors claimed that within three or five years, machine translation would be a solved problem.&#91;2&#93;  however, real progress was much slower, and after the alpac report in 1966, which found that ten years of research had failed to fulfill the expectations, funding for machine translation was dramatically reduced. little further research in machine translation was conducted in america (though some research continued elsewhere, such as japan and europe&#91;3&#93;) until the la

In [7]:
def remove_url(text):
    pattern = re.compile(r'https?://\S+|www\.\S+')
    return pattern.sub(r'', text)

df= remove_url(df)
print(df)


 class="mw-heading mw-heading3">symbolic nlp (1950s – early 1990s)[edit]
the premise of symbolic nlp is well-summarized by john searle's chinese room experiment: given a collection of rules (e.g., a chinese phrasebook, with questions and matching answers), the computer emulates natural language understanding (or other nlp tasks) by applying those rules to the data it confronts.

1950s: the georgetown experiment in 1954 involved fully automatic translation of more than sixty russian sentences into english. the authors claimed that within three or five years, machine translation would be a solved problem.&#91;2&#93;  however, real progress was much slower, and after the alpac report in 1966, which found that ten years of research had failed to fulfill the expectations, funding for machine translation was dramatically reduced. little further research in machine translation was conducted in america (though some research continued elsewhere, such as japan and europe&#91;3&#93;) until the la

In [8]:
import string,time
string.punctuation

punct=string.punctuation
def remove_punc1(text):
    return text.translate(str.maketrans('', '', punct))

df=remove_punc1(df)
print(df)

 classmwheading mwheading3symbolic nlp 1950s – early 1990sedit
the premise of symbolic nlp is wellsummarized by john searles chinese room experiment given a collection of rules eg a chinese phrasebook with questions and matching answers the computer emulates natural language understanding or other nlp tasks by applying those rules to the data it confronts

1950s the georgetown experiment in 1954 involved fully automatic translation of more than sixty russian sentences into english the authors claimed that within three or five years machine translation would be a solved problem91293  however real progress was much slower and after the alpac report in 1966 which found that ten years of research had failed to fulfill the expectations funding for machine translation was dramatically reduced little further research in machine translation was conducted in america though some research continued elsewhere such as japan and europe91393 until the late 1980s when the first statistical machine tra

In [9]:
import nltk
nltk.download('stopwords')
from nltk.corpus import stopwords
stop = stopwords.words('english')

df = ' '.join([word for word in df.split() if word not in (stop)])
print(df)

classmwheading mwheading3symbolic nlp 1950s – early 1990sedit premise symbolic nlp wellsummarized john searles chinese room experiment given collection rules eg chinese phrasebook questions matching answers computer emulates natural language understanding nlp tasks applying rules data confronts 1950s georgetown experiment 1954 involved fully automatic translation sixty russian sentences english authors claimed within three five years machine translation would solved problem91293 however real progress much slower alpac report 1966 found ten years research failed fulfill expectations funding machine translation dramatically reduced little research machine translation conducted america though research continued elsewhere japan europe91393 late 1980s first statistical machine translation systems developed 1960s notably successful natural language processing systems developed 1960s shrdlu natural language system working restricted blocks worlds restricted vocabularies eliza simulation roger

[nltk_data] Downloading package stopwords to
[nltk_data]     C:\Users\hrish\AppData\Roaming\nltk_data...
[nltk_data]   Package stopwords is already up-to-date!


In [11]:
from nltk.stem.porter import PorterStemmer
ps = PorterStemmer()
def stem_words(text):
    return " ".join([ps.stem(word) for word in text.split()])
stemwords=stem_words(df)
stemwords

'classmwhead mwheading3symbol nlp 1950 – earli 1990sedit premis symbol nlp wellsummar john searl chines room experi given collect rule eg chines phrasebook question match answer comput emul natur languag understand nlp task appli rule data confront 1950 georgetown experi 1954 involv fulli automat translat sixti russian sentenc english author claim within three five year machin translat would solv problem91293 howev real progress much slower alpac report 1966 found ten year research fail fulfil expect fund machin translat dramat reduc littl research machin translat conduct america though research continu elsewher japan europe91393 late 1980 first statist machin translat system develop 1960 notabl success natur languag process system develop 1960 shrdlu natur languag system work restrict block world restrict vocabulari eliza simul rogerian psychotherapist written joseph weizenbaum 1964 1966 use almost inform human thought emot eliza sometim provid startlingli humanlik interact patient ex

In [12]:
from nltk.stem import WordNetLemmatizer
wordnet_lemmatizer = WordNetLemmatizer()

sentence = df
punctuations="?:!.,;"
sentence_words = nltk.word_tokenize(sentence)
for word in sentence_words:
    if word in punctuations:
        sentence_words.remove(word)

sentence_words

['classmwheading',
 'mwheading3symbolic',
 'nlp',
 '1950s',
 '–',
 'early',
 '1990sedit',
 'premise',
 'symbolic',
 'nlp',
 'wellsummarized',
 'john',
 'searles',
 'chinese',
 'room',
 'experiment',
 'given',
 'collection',
 'rules',
 'eg',
 'chinese',
 'phrasebook',
 'questions',
 'matching',
 'answers',
 'computer',
 'emulates',
 'natural',
 'language',
 'understanding',
 'nlp',
 'tasks',
 'applying',
 'rules',
 'data',
 'confronts',
 '1950s',
 'georgetown',
 'experiment',
 '1954',
 'involved',
 'fully',
 'automatic',
 'translation',
 'sixty',
 'russian',
 'sentences',
 'english',
 'authors',
 'claimed',
 'within',
 'three',
 'five',
 'years',
 'machine',
 'translation',
 'would',
 'solved',
 'problem91293',
 'however',
 'real',
 'progress',
 'much',
 'slower',
 'alpac',
 'report',
 '1966',
 'found',
 'ten',
 'years',
 'research',
 'failed',
 'fulfill',
 'expectations',
 'funding',
 'machine',
 'translation',
 'dramatically',
 'reduced',
 'little',
 'research',
 'machine',
 'translat

In [13]:
for word in sentence_words:
    print ("{0:20}{1:20}".format(word,wordnet_lemmatizer.lemmatize(word,pos='v')))

classmwheading      classmwheading      
mwheading3symbolic  mwheading3symbolic  
nlp                 nlp                 
1950s               1950s               
–                   –                   
early               early               
1990sedit           1990sedit           
premise             premise             
symbolic            symbolic            
nlp                 nlp                 
wellsummarized      wellsummarized      
john                john                
searles             searles             
chinese             chinese             
room                room                
experiment          experiment          
given               give                
collection          collection          
rules               rule                
eg                  eg                  
chinese             chinese             
phrasebook          phrasebook          
questions           question            
matching            match               
answers         

In [14]:
from textblob import TextBlob
textBlb = TextBlob(df)

corrected_strng=textBlb.correct().string
print(corrected_strng)

classmwheading mwheading3symbolic nap 1950s – early 1990sedit premise symbolic nap wellsummarized john series chinese room experiment given collection rules eg chinese phrasebook questions watching answers computer emulated natural language understanding nap tasks applying rules data confront 1950s georgetown experiment 1954 involved fully automatic translation sixty russian sentences english authors claimed within three five years machine translation would solved problem91293 however real progress much slower iliac report 1966 found ten years research failed fulfill expectations funding machine translation dramatically reduced little research machine translation conducted america though research continued elsewhere japan europe91393 late 1980s first statistical machine translation systems developed 1960s notably successful natural language processing systems developed 1960s shrill natural language system working restricted blocks worlds restricted vocabularies eliza stimulation rogeri

In [15]:
from nltk.tokenize import word_tokenize

text = corrected_strng

tokens = word_tokenize(text)
print(tokens)

['classmwheading', 'mwheading3symbolic', 'nap', '1950s', '–', 'early', '1990sedit', 'premise', 'symbolic', 'nap', 'wellsummarized', 'john', 'series', 'chinese', 'room', 'experiment', 'given', 'collection', 'rules', 'eg', 'chinese', 'phrasebook', 'questions', 'watching', 'answers', 'computer', 'emulated', 'natural', 'language', 'understanding', 'nap', 'tasks', 'applying', 'rules', 'data', 'confront', '1950s', 'georgetown', 'experiment', '1954', 'involved', 'fully', 'automatic', 'translation', 'sixty', 'russian', 'sentences', 'english', 'authors', 'claimed', 'within', 'three', 'five', 'years', 'machine', 'translation', 'would', 'solved', 'problem91293', 'however', 'real', 'progress', 'much', 'slower', 'iliac', 'report', '1966', 'found', 'ten', 'years', 'research', 'failed', 'fulfill', 'expectations', 'funding', 'machine', 'translation', 'dramatically', 'reduced', 'little', 'research', 'machine', 'translation', 'conducted', 'america', 'though', 'research', 'continued', 'elsewhere', 'japan