## Natural Language Programming NLP 
We will look at some typical tools for NLP processing in Python. Examples below can be found in https://github.com/pdeitel/PythonForProgrammers. The areas covered are:
- nltk https://nltk.org - tagging stemming lemmaization parts of speech (pos)
- textblob https://textblob.readthedocs.io/en/dev/  builds on nltk and simplifies
- textatstic  http://www.erinhengel.com/software/textatistic/ readability scores 
- spacy https://spacy.io/ speed optimized and simplified analysis and similarity scoring 

Examples below will take you through these tools and typical use cases. 

In [1]:
import nltk

In [2]:
nltk.download('punkt')  #need for parsing 

[nltk_data] Downloading package punkt to
[nltk_data]     C:\Users\i080272\AppData\Roaming\nltk_data...
[nltk_data]   Package punkt is already up-to-date!


True

In [3]:
nltk.download('averaged_perceptron_tagger') 

[nltk_data] Downloading package averaged_perceptron_tagger to
[nltk_data]     C:\Users\i080272\AppData\Roaming\nltk_data...
[nltk_data]   Package averaged_perceptron_tagger is already up-to-
[nltk_data]       date!


True

In [4]:
nltk.download('movie_reviews')  #used for sentiment training 

[nltk_data] Downloading package movie_reviews to
[nltk_data]     C:\Users\i080272\AppData\Roaming\nltk_data...
[nltk_data]   Package movie_reviews is already up-to-date!


True

In [5]:
nltk.download('stopwords')

[nltk_data] Downloading package stopwords to
[nltk_data]     C:\Users\i080272\AppData\Roaming\nltk_data...
[nltk_data]   Package stopwords is already up-to-date!


True

## Textblob 
Covers a lot of NLP functionality 

In [6]:
!pip install textblob   



In [7]:
import textblob

In [8]:
from textblob import TextBlob

In [10]:
text1 = 'Today is a beautiful day. Tomorrow looks like bad weather.'

In [11]:
text2 ="Au'jourd hui il fait beau"

In [12]:
blob = TextBlob(text1)

In [13]:
text1

'Today is a beautiful day. Tomorrow looks like bad weather.'

In [14]:
blob

TextBlob("Today is a beautiful day. Tomorrow looks like bad weather.")

In [15]:
blob.sentences 

[Sentence("Today is a beautiful day."),
 Sentence("Tomorrow looks like bad weather.")]

In [16]:
blob.detect_language()

'en'

In [17]:
blob.words

WordList(['Today', 'is', 'a', 'beautiful', 'day', 'Tomorrow', 'looks', 'like', 'bad', 'weather'])

In [18]:
blob.tags

[('Today', 'NN'),
 ('is', 'VBZ'),
 ('a', 'DT'),
 ('beautiful', 'JJ'),
 ('day', 'NN'),
 ('Tomorrow', 'NNP'),
 ('looks', 'VBZ'),
 ('like', 'IN'),
 ('bad', 'JJ'),
 ('weather', 'NN')]

In [19]:
blob.sentiment

Sentiment(polarity=0.07500000000000007, subjectivity=0.8333333333333333)

In [20]:
from textblob.sentiments import NaiveBayesAnalyzer

In [21]:
blob = TextBlob(text1, analyzer=NaiveBayesAnalyzer())

In [22]:
blob.sentiment

Sentiment(classification='neg', p_pos=0.47662917962091056, p_neg=0.5233708203790892)

## Determining readability 
Shows various readability scores for a corpus of work. Installation notes:
- documentation http://www.erinhengel.com/software/textatistic/
- use the github repo https://github.com/erinhengel/Textatistic 
- unpack and python setup.py install 
- if/when it fails download VSE https://visualstudio.microsoft.com/downloads/ vs build tools 2019 
- Will need to reboot
Learn about specific measures, for example https://en.wikipedia.org/wiki/Flesch%E2%80%93Kincaid_readability_tests Probably different install on a Mac. 

In [23]:
!pip install Textatistic



In [24]:
from pathlib import Path

In [25]:
doc1 = (Path(r'./data/RomeoAndJuliet.txt').read_text())

In [26]:
from textatistic import Textatistic

In [27]:
reading_level = Textatistic(doc1)

In [28]:
reading_level.dict()

{'char_count': 133715,
 'word_count': 29279,
 'sent_count': 3403,
 'sybl_count': 34830,
 'notdalechall_count': 6963,
 'polysyblword_count': 906,
 'flesch_score': 97.46276509546402,
 'fleschkincaid_score': 1.8026725219007993,
 'gunningfog_score': 4.679298762961586,
 'smog_score': 6.0767645611351515,
 'dalechall_score': 7.818359126732924}

In [29]:
%precision 3 

'%.3f'

In [30]:
reading_level.dict()

{'char_count': 133715,
 'word_count': 29279,
 'sent_count': 3403,
 'sybl_count': 34830,
 'notdalechall_count': 6963,
 'polysyblword_count': 906,
 'flesch_score': 97.463,
 'fleschkincaid_score': 1.803,
 'gunningfog_score': 4.679,
 'smog_score': 6.077,
 'dalechall_score': 7.818}

## Similarity checking in documents 
Spacy is a high performance nlp package 
- documentation https://spacy.io/
- github https://github.com/explosion/spaCy
- usually issues with soft dir linking with error like below:
Error: Couldn't link model to 'en_core_web_sm'
    Creating a symlink in spacy/data failed. Make sure you have the required
    permissions and try re-running the command as admin, or use a
    virtualenv. You can still import the model as a module and call its
    load() method, or create the symlink manually.

    C:\Users\i080272\AppData\Local\Continuum\anaconda3\envs\pd\lib\site-packages\en_core_web_sm
    -->
    C:\Users\i080272\AppData\Local\Continuum\anaconda3\envs\pd\lib\site-packages\spacy\data\en_core_web_sm
- See https://github.com/explosion/spaCy/issues/1283 
- This also could be a permissions problem and you need to run ananconda prompt as admin then python -m spacy download en 

Used to compare corpus for similarity of authorship. In the case below comparing Shakespeare and Sir Francis Bacon taken from Deitel which was taken from project Gutenberg... 

In [31]:
import spacy

In [None]:
#!python -m spacy download en #this needs admin right to work steps below are dependent on this being setup right 

In [32]:
nlp = spacy.load('en')

In [None]:
#!python -m spacy download en_core_web_sm

In [None]:
#from stackoverflow we learn that 'en' is a symbolic link to another directory so we can just use an explict import 

In [33]:
from pathlib import Path

In [40]:
doc1 = nlp(Path('./data/RomeoAndJuliet.txt').read_text())

In [41]:
doc2 = nlp(Path('./data/EdwardTheSecond.txt').read_text())

In [37]:
from nltk.corpus import stopwords
stop_words = stopwords.words('english')

In [38]:
stop_words

['i',
 'me',
 'my',
 'myself',
 'we',
 'our',
 'ours',
 'ourselves',
 'you',
 "you're",
 "you've",
 "you'll",
 "you'd",
 'your',
 'yours',
 'yourself',
 'yourselves',
 'he',
 'him',
 'his',
 'himself',
 'she',
 "she's",
 'her',
 'hers',
 'herself',
 'it',
 "it's",
 'its',
 'itself',
 'they',
 'them',
 'their',
 'theirs',
 'themselves',
 'what',
 'which',
 'who',
 'whom',
 'this',
 'that',
 "that'll",
 'these',
 'those',
 'am',
 'is',
 'are',
 'was',
 'were',
 'be',
 'been',
 'being',
 'have',
 'has',
 'had',
 'having',
 'do',
 'does',
 'did',
 'doing',
 'a',
 'an',
 'the',
 'and',
 'but',
 'if',
 'or',
 'because',
 'as',
 'until',
 'while',
 'of',
 'at',
 'by',
 'for',
 'with',
 'about',
 'against',
 'between',
 'into',
 'through',
 'during',
 'before',
 'after',
 'above',
 'below',
 'to',
 'from',
 'up',
 'down',
 'in',
 'out',
 'on',
 'off',
 'over',
 'under',
 'again',
 'further',
 'then',
 'once',
 'here',
 'there',
 'when',
 'where',
 'why',
 'how',
 'all',
 'any',
 'both',
 'each

In [42]:
doc1.similarity(doc2)

  "__main__", mod_spec)


0.9470151584521619