### What is NLP?

NLP - **Natural Language Processing** is the process of analysing Natural Language (as in, How we humans speak) and extracting meaningful insights from the given data. NLP has become one of the very popular areas of interest due to increase in NLP and also development in Information Extraction (IE) methodologies. 

### Sources of Natural Langauge

* Social Media  (like FB Posts/Comments, Twitter Tweets, Youtube Comments)
* Speech Transcripts (Call Center Conversations) 
* Voice Agents (Amazon Echo, Google Home, Apple Siri) 

### Some Applications of NLP

* Automated Customer Service 
* Chatbots
* Social Listening
* Market Trends and much more

### About this Dataset

This dataset contains a bunch of tweet that came with this tag **#JustDoIt** after **Nike** released the ad campaign with Colin Kaepernick that turned controversial. 

<img src="https://www.thenation.com/wp-content/uploads/2018/09/Kaepernick-Nike-Ad-sg-img.jpg" alt="drawing" width="400"/>

### About spaCy:

spaCy by [explosion.ai](https://explosion.ai/) is a library for advanced **Natural Language Processing** in Python and Cython.
spaCy comes with
*pre-trained statistical models* and word
vectors, and currently supports tokenization for **20+ languages**. It features
the **fastest syntactic parser** in the world, convolutional **neural network models**
for tagging, parsing and **named entity recognition** and easy **deep learning**
integration. It's commercial open-source software, released under the MIT license.

### About this Kernel:

In this Kernel, We will learn how to use *spaCy* in Python to perform a few things of NLP. 

Let us begin our journey by loading required libraries. 

### Loading the required Libraries

In [1]:
import numpy as np # linear algebra
import pandas as pd # data processing, CSV file I/O (e.g. pd.read_csv)
#import os
#print(os.listdir("../input"))
import spacy
import random 
from collections import Counter #for counting
import seaborn as sns #for visualization

As we have seen above, *spaCy* comes with Pre-trained Language models and since our tweets are predominantly English, let us load our *en* model using the following code:

### Loading Spacy English Model

In [3]:
import spacy
nlp = spacy.load("en_core_web_sm")

Please note that you can download other language models by running a code like below in your shell or terminal

`python -m spacy download en_core_web_sm` 

and then loading using `spacy.load()`. The last argument in the above code is the name of the langauge model that's to be downloaded. 

Now that our model is successfully loaded into `nlp`, let us read our input data using `read_csv()` of `pandas`. 

### Reading input file - Tweets

In [5]:
tweets = pd.read_csv(r"D:\avinash\Capstone-Projects\NLP\4.Twetter Sentimental Analysis\tweets.csv")
tweets.head()

Unnamed: 0,tweet_contributors,tweet_coordinates,tweet_created_at,tweet_display_text_range,tweet_entities,tweet_extended_entities,tweet_favorite_count,tweet_favorited,tweet_full_text,tweet_geo,...,user_profile_text_color,user_profile_use_background_image,user_protected,user_screen_name,user_statuses_count,user_time_zone,user_translator_type,user_url,user_utc_offset,user_verified
0,,,Fri Sep 07 16:25:06 +0000 2018,"[0, 75]","{'hashtags': [{'text': 'quote', 'indices': [47...","{'media': [{'id': 1038100853872197632, 'id_str...",0,False,Done is better than perfect. — Sheryl Sandberg...,,...,333333,True,False,UltraYOUwoman,91870.0,,none,https://t.co/jGlJswxjwS,,False
1,,,Fri Sep 07 16:24:59 +0000 2018,"[0, 237]","{'hashtags': [{'text': 'hero', 'indices': [90,...",,0,False,Shout out to the Great Fire Department and the...,,...,333333,True,False,yungcutup,618822.0,,none,http://t.co/lVm8vfDbfO,,False
2,,,Fri Sep 07 16:24:50 +0000 2018,"[0, 176]","{'hashtags': [{'text': 'JustDoIt', 'indices': ...","{'media': [{'id': 1038100773396041728, 'id_str...",0,False,There are some AMAZINGLY hilarious Nike Ad mem...,,...,50505,True,False,rachelbogle,48075.0,,none,https://t.co/g9exqgZp9x,,True
3,,"{'type': 'Point', 'coordinates': [-86.45594032...",Fri Sep 07 16:24:44 +0000 2018,"[0, 88]","{'hashtags': [{'text': 'kapernickeffect', 'ind...",,0,False,#kapernickeffect #swoosh #justdoit @ Lucas Bis...,"{'type': 'Point', 'coordinates': [35.85402047,...",...,333333,True,False,ErvGotti609,1983.0,,none,,,False
4,,,Fri Sep 07 16:24:39 +0000 2018,"[0, 132]","{'hashtags': [{'text': 'shaquem', 'indices': [...","{'media': [{'id': 1038100736595255296, 'id_str...",0,False,"One Hand, One Dream: The Shaquem Griffin Story...",,...,1A1E1A,True,False,NoLuckNeeded,24989.0,,none,http://t.co/MMGF9RfLz0,,False


As with any dataset, let us do a few basics like understanding the shape (dimension) of the dataset and then see a sample row. 

### Dimension of the input file

In [6]:
tweets.shape

(5089, 72)

### Sample Row

In [7]:
tweets.head(1)

Unnamed: 0,tweet_contributors,tweet_coordinates,tweet_created_at,tweet_display_text_range,tweet_entities,tweet_extended_entities,tweet_favorite_count,tweet_favorited,tweet_full_text,tweet_geo,...,user_profile_text_color,user_profile_use_background_image,user_protected,user_screen_name,user_statuses_count,user_time_zone,user_translator_type,user_url,user_utc_offset,user_verified
0,,,Fri Sep 07 16:25:06 +0000 2018,"[0, 75]","{'hashtags': [{'text': 'quote', 'indices': [47...","{'media': [{'id': 1038100853872197632, 'id_str...",0,False,Done is better than perfect. — Sheryl Sandberg...,,...,333333,True,False,UltraYOUwoman,91870.0,,none,https://t.co/jGlJswxjwS,,False


Now that we know `tweet_full_text` is the column name in which tweets are stored, let us print some sample tweets.

### Sample Tweets Text

For simplicity, Let us take a sample of tweets.

In [8]:
random.seed(888)
text = tweets.tweet_full_text[random.sample(range(1,100),10)]
text

11    Colin Kaepernick's business partner @Nike send...
56    This is why Colin kneels.  We all should kneel...
75    @Nike is aligning itself with the core values ...
57    Sounds like a plan! #JustDoIt #FirstAmendment ...
64    @washingtonpost Thank you #Kap #JustDoIt #Nike...
50    Invest in #Mojo50?\n\n#JustDoIt \n\n@DocThomps...
81    Owned Yet, Libs? https://t.co/D7I86zTfL7 #Nike...
48    If you work hard, limitless and focus on your ...
70    @JWKeady @Kaepernick7 @KillerMike @tmorello #N...
54    Create Your Own Nike Just Do It Colin Kaeperni...
Name: tweet_full_text, dtype: object

### Annotation:

Let us begin our NLP journey with Lingustic Annotation, which means marking each and every word with its linguistic type like if it's a NOUN, VERB and so on. This help us in giving grammatical labels to our Text Corpus. The function `nlp()` takes only string so let us use `str()` to combine all our rows above into one long string. 

In [12]:
text_combined = str(text)
text_combined

"11    Colin Kaepernick's business partner @Nike send...\n56    This is why Colin kneels.  We all should kneel...\n75    @Nike is aligning itself with the core values ...\n57    Sounds like a plan! #JustDoIt #FirstAmendment ...\n64    @washingtonpost Thank you #Kap #JustDoIt #Nike...\n50    Invest in #Mojo50?\\n\\n#JustDoIt \\n\\n@DocThomps...\n81    Owned Yet, Libs? https://t.co/D7I86zTfL7 #Nike...\n48    If you work hard, limitless and focus on your ...\n70    @JWKeady @Kaepernick7 @KillerMike @tmorello #N...\n54    Create Your Own Nike Just Do It Colin Kaeperni...\nName: tweet_full_text, dtype: object"

In [11]:
doc = nlp(text_combined)
doc

11    Colin Kaepernick's business partner @Nike send...
56    This is why Colin kneels.  We all should kneel...
75    @Nike is aligning itself with the core values ...
57    Sounds like a plan! #JustDoIt #FirstAmendment ...
64    @washingtonpost Thank you #Kap #JustDoIt #Nike...
50    Invest in #Mojo50?\n\n#JustDoIt \n\n@DocThomps...
81    Owned Yet, Libs? https://t.co/D7I86zTfL7 #Nike...
48    If you work hard, limitless and focus on your ...
70    @JWKeady @Kaepernick7 @KillerMike @tmorello #N...
54    Create Your Own Nike Just Do It Colin Kaeperni...
Name: tweet_full_text, dtype: object

### Tokenization 

`doc` is the annotated text (that we did using the loaded langauge model). Now, let us tokenize our text. Tokenization has been done along with the above process. We can now print the **chunks**. The tokenized parts are called **chunks**. As a naive description, tokenization is nothing but breaking the long sentences/text corpus into a small chunks (or mostly words). 

In [13]:
for token in doc:
    print(token)

11
   
Colin
Kaepernick
's
business
partner
@Nike
send
...


56
   
This
is
why
Colin
kneels
.
 
We
all
should
kneel
...


75
   
@Nike
is
aligning
itself
with
the
core
values
...


57
   
Sounds
like
a
plan
!
#
JustDoIt
#
FirstAmendment
...


64
   
@washingtonpost
Thank
you
#
Kap
#
JustDoIt
#
Nike
...


50
   
Invest
in
#
Mojo50?\n\n#JustDoIt
\n\n@DocThomps
...


81
   
Owned
Yet
,
Libs
?
https://t.co/D7I86zTfL7
#
Nike
...


48
   
If
you
work
hard
,
limitless
and
focus
on
your
...


70
   
@JWKeady
@Kaepernick7
@KillerMike
@tmorello
#
N
...


54
   
Create
Your
Own
Nike
Just
Do
It
Colin
Kaeperni
...


Name
:
tweet_full_text
,
dtype
:
object


Since we have already done the annotation, Let us print our chunks with their Parts-of-speech tags.

In [14]:
for token in doc:
    print(token.text, token.pos_)

11 NUM
    SPACE
Colin PROPN
Kaepernick PROPN
's PART
business NOUN
partner NOUN
@Nike AUX
send VERB
... PUNCT

 SPACE
56 NUM
    SPACE
This PRON
is AUX
why SCONJ
Colin PROPN
kneels NOUN
. PUNCT
  SPACE
We PRON
all PRON
should AUX
kneel VERB
... PUNCT

 SPACE
75 NUM
    SPACE
@Nike ADP
is AUX
aligning VERB
itself PRON
with ADP
the DET
core NOUN
values NOUN
... PUNCT

 SPACE
57 NUM
    SPACE
Sounds VERB
like ADP
a DET
plan NOUN
! PUNCT
# SYM
JustDoIt PROPN
# SYM
FirstAmendment PROPN
... PUNCT

 SPACE
64 NUM
    SPACE
@washingtonpost NOUN
Thank VERB
you PRON
# SYM
Kap PROPN
# SYM
JustDoIt PROPN
# SYM
Nike NOUN
... PUNCT

 SPACE
50 NUM
    SPACE
Invest PROPN
in ADP
# SYM
Mojo50?\n\n#JustDoIt NOUN
\n\n@DocThomps NOUN
... PUNCT

 SPACE
81 NUM
    SPACE
Owned VERB
Yet ADV
, PUNCT
Libs PROPN
? PUNCT
https://t.co/D7I86zTfL7 PROPN
# NOUN
Nike NOUN
... PUNCT

 SPACE
48 NUM
    SPACE
If SCONJ
you PRON
work VERB
hard ADV
, PUNCT
limitless VERB
and CCONJ
focus VERB
on ADP
your PRON
... PUNCT

 SPAC

That's good, We've got a bunch of chunks and their respective POS tags. Perhaps, we don't want to see everything but just NOUNs.  Below is the code how we can print only the nouns in the text.

In [15]:
nouns = list(doc.noun_chunks)
nouns

[11    Colin Kaepernick's business partner,
 This,
 We,
 all,
 itself,
 the core values,
 a plan,
 #JustDoIt,
 64    @washingtonpost,
 you,
 Mojo50?\n\n#JustDoIt \n\n@DocThomps,
 https://t.co/D7I86zTfL7,
 Nike,
 you,
 your ...
 70    @JWKeady @Kaepernick7,
 @tmorello,
 #N,
 Your Own Nike,
 It,
 Name,
 tweet_full_text,
 dtype,
 object]

Sometimes, we might need to tokenization based on sentences. Let's say we've got Chat Transcript from Customer Service and in that case we need to tokenize our transcript based on sentences. 

In [16]:
list(doc.sents)

[11    Colin Kaepernick's business partner @Nike send...
 56    ,
 This is why Colin kneels.  ,
 We all should kneel...
 75    @Nike is aligning itself with the core values ...
 57    Sounds like a plan!,
 #JustDoIt #FirstAmendment ...
 64    @washingtonpost Thank you #Kap #JustDoIt #Nike...
 50    Invest in #Mojo50?\n\n#JustDoIt \n\n@DocThomps...
 81    Owned,
 Yet, Libs?,
 https://t.co/D7I86zTfL7 #,
 Nike...
 48    If you work hard, limitless and focus on your ...
 70    @JWKeady @Kaepernick7 @KillerMike @tmorello #N...
 54    Create Your Own Nike,
 Just Do It Colin Kaeperni...,
 Name: tweet_full_text, dtype: object]

### Named Entity Recognition (NER)

NER is the process of extracting Named Entities like Person, Organization, Location and other such infromation from our Text Corpus.  spaCy also has an object `displacy` that lets us visualize our text with NER. We can display Named Entities using the following code:

In [17]:
for ent in doc.ents:
    print(ent.text,ent.label_)

11 CARDINAL
Colin Kaepernick's PERSON
56 CARDINAL
Colin kneels PERSON
75 CARDINAL
57 CARDINAL
#JustDoIt # MONEY
64 CARDINAL
#Kap MONEY
#JustDoIt # MONEY
50 CARDINAL
81 CARDINAL
Nike ORG
48 CARDINAL
70 CARDINAL
@KillerMike ORG
54 CARDINAL
Colin Kaeperni PERSON


**spaCy** also allows to visualize Named Entities along woith the Text Labels. 

In [18]:
spacy.displacy.render(doc, style='ent',jupyter=True)

### Lemmatization

Lemmetiztion is the process of retrieving the root word of the current word. Lemmatization is an essential process in NLP to bring different variants of a single word to one root word. 

In [19]:
for token in doc:
    print(token.text, token.lemma_)

11 11
       
Colin Colin
Kaepernick Kaepernick
's 's
business business
partner partner
@Nike @nike
send send
... ...

 

56 56
       
This this
is be
why why
Colin Colin
kneels kneel
. .
   
We we
all all
should should
kneel kneel
... ...

 

75 75
       
@Nike @nike
is be
aligning align
itself itself
with with
the the
core core
values value
... ...

 

57 57
       
Sounds sound
like like
a a
plan plan
! !
# #
JustDoIt JustDoIt
# #
FirstAmendment FirstAmendment
... ...

 

64 64
       
@washingtonpost @washingtonpost
Thank thank
you you
# #
Kap Kap
# #
JustDoIt JustDoIt
# #
Nike Nike
... ...

 

50 50
       
Invest Invest
in in
# #
Mojo50?\n\n#JustDoIt mojo50?\n\n#justdoit
\n\n@DocThomps \n\n@docthomps
... ...

 

81 81
       
Owned own
Yet yet
, ,
Libs Libs
? ?
https://t.co/D7I86zTfL7 https://t.co/D7I86zTfL7
# #
Nike Nike
... ...

 

48 48
       
If if
you you
work work
hard hard
, ,
limitless limitless
and and
focus focus
on on
your your
... ...

 

70 70
       
@JWKeady @jw

As you can see in the above output, words like *aligning* and *values* have been converted to their root words *align* and *value*. 

### Dependency Parser Visualization

In [20]:
spacy.displacy.render(doc, style='dep',jupyter=True)

### WIP! We'll get into more concepts soon!

Also please note, Text cleaning is one of the vital preprocessing for anything NLP and this kernel hasn't addressed it. 