# Module 6
## Vader Sentiment Analysis

Compound score is a metric that calculates the sum of all the lexicon ratings which have been normalized -1 1  
positive sentiment >=0.5  
neutral sentiment <0.5 & -0.5>  
negative <= -0.5

In [22]:
from vaderSentiment.vaderSentiment import SentimentIntensityAnalyzer
analyzer = SentimentIntensityAnalyzer()

In [23]:
sentence = "The car is cool."
analyzer.polarity_scores(sentence)


{'neg': 0.0, 'neu': 0.566, 'pos': 0.434, 'compound': 0.3182}

### Vader Key Points
- punctuations:  `the car is cool!!`
- capitalizations `The car is COOL!!`
- degree modifiers `The car is a bit Cool.`
- conjuntions `The car is cool, but small.`

# Tri-gram
Is a modifier that precedes the sentiment-LADEN (sentiment-👳🏽‍♂️) lexical feature, and have huge impact on the sentiment.  
usually increase the negativity  
like `not that`

"your hotel service is NOT THAT great!"

## module 7 
#### chunking (shallow PArsing)
break into small meaningful groups

In [1]:
from textblob import TextBlob
mystring = "John found a new coach and a new bed in his new apartment."
output = TextBlob(mystring)
output.tags

[('John', 'NNP'),
 ('found', 'VBD'),
 ('a', 'DT'),
 ('new', 'JJ'),
 ('coach', 'NN'),
 ('and', 'CC'),
 ('a', 'DT'),
 ('new', 'JJ'),
 ('bed', 'NN'),
 ('in', 'IN'),
 ('his', 'PRP$'),
 ('new', 'JJ'),
 ('apartment', 'NN')]

parse with regex some patterns on pos

In [9]:
import nltk
regex="NP:{<DT>?<JJ>*<NN>}" #1-0, 0+ 1
rp = nltk.chunk.RegexpParser(regex)
# Regex p Parser

In [10]:
output = rp.parse(output.tags)
print(output)

(S
  John/NNP
  found/VBD
  (NP a/DT new/JJ coach/NN)
  and/CC
  (NP a/DT new/JJ bed/NN)
  in/IN
  his/PRP$
  (NP new/JJ apartment/NN))


In [12]:
output.draw()

## Chinking
removing unwanted word from chunks  
remove a chunk from chunk

REmove is different than deleting

{} , }{
    
if the mathcin sequence of tokens spans an entire chunk then the whole chunk is removed  
    teniendo un chunk, y un chink que coincidan igual, todo se elimina, el chink gana  
    if the matching is in the middle it gets removed and left 2 smaller

In [19]:
string2="the little yellow dog barked at the cat."
output= TextBlob(string2)
output.tags

[('the', 'DT'),
 ('little', 'JJ'),
 ('yellow', 'JJ'),
 ('dog', 'NN'),
 ('barked', 'VBD'),
 ('at', 'IN'),
 ('the', 'DT'),
 ('cat', 'NN')]

In [20]:
regex2=r"""NP:
{<.*>+}       # chunk everything
}<VBD|IN>+{   # chink sequenced of VBD and IN
"""

In [21]:
rp = nltk.chunk.RegexpParser(regex2)
output=rp.parse(output.tags)
print(output)

(S
  (NP the/DT little/JJ yellow/JJ dog/NN)
  barked/VBD
  at/IN
  (NP the/DT cat/NN))


#  
#  
#  
# Module 8 
### PIPELINES

Allows to test different modelling training, 
simpler and cleaner, and callable as argument

In [26]:
from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.pipeline import Pipeline
# import news dataset

In [27]:
# Create a function that is going to receive pipeline and X, Y
def train_test(classifier, X, y):
    X_train, X_test, y_train, y_test = train_test_split(X,y, test_size=0.25, random_state=48)
    classifier.fit(X_train,y_train)
    print("classifier accuracy is", classifier.score(X_test,y_test))
    return classifier

In [None]:
# First pipeline
from sklearn.naive_bayes import MultinomialNB

trial1 = Pipeline([("vectorizer",TfidfVectorizer()),
                  ("classifier", MultinomialNB())])
train_test(trial1, news.data, news.target)


In [None]:
# SEcond Pipeline
trial2 = Pipeline([('vectorizer',TfidfVectorizer(stop_words=stop_words.words('english'))),
                  ('classifier',MultinomialNB())])
train_test(trail2,news.data, news.target)

In [None]:
#third pipeline
trial3 = Pipeline([('vectorizer', TfidfVectorizer(stop_words=stop_words.words('english'))),
                  ('classifier', MultinomialNB(alpha=0.05))])
train_test(trial3, news.data, news.target)

In [None]:
# fourth pipeline
from sklearn import svm
trial4 = Pipeline([('vectorizer',TfidfVectorizer(stop_words=stop_words.words('english'))),
                  ('classifier', svm.LinearSVC())])
train_test(trial4, news.data, news.target)



#  
#  
#  
# Module 10  
## Word2Vec

USed to generate vector representation of words. is a 2layer neuralnet, to process vector words    
It is also called word embedding  
The vector of each word is the semantic representation that how the word is using the context


Words as integers, will affect the training as they cannot give exact context and semantics, it will consider surrounding words as similar.  
adding dimension for semantics will improve the relations

therefore, another similar word will have a similar vector, then it will be better interpreted


In [2]:
from nltk import sent_tokenize
from nltk import word_tokenize
text = "This is our first sentence. This is the second one, and one more."

sentences = sent_tokenize(text)
token = [word_tokenize(sentence) for sentence in sentences]

In [6]:
from gensim.models import Word2Vec
model = Word2Vec(token, min_count=1, vector_size=50)
# size is the number of dimension of the vector. bigger size require more trianing fata but leads to better accuracy.
# min_count is minimum number of ocurrences of a word to be considered in the model

In [7]:
print(model)

Word2Vec<vocab=12, vector_size=50, alpha=0.025>


In [11]:
words = list(model.wv.index_to_key)
words

['one',
 '.',
 'is',
 'This',
 'more',
 'and',
 ',',
 'second',
 'the',
 'sentence',
 'first',
 'our']

In [12]:
print(model.wv['sentence'])

[ 0.00805479  0.00869582  0.01991692 -0.00894846 -0.00277883 -0.01463625
 -0.01939779 -0.01816251 -0.00204573 -0.01300801  0.00970052 -0.01232941
  0.00503892  0.00147904 -0.00678505 -0.00195866  0.01996044  0.01829378
 -0.00892464  0.01816805 -0.01128476  0.01186315 -0.00619512  0.00686426
  0.00603512  0.01380244 -0.00474829  0.017552    0.01518052 -0.01909739
 -0.01601818 -0.01527747  0.00584716 -0.00559006 -0.01386056 -0.01625831
  0.01662018  0.00398141 -0.01865808 -0.00958648  0.00627417 -0.00942745
  0.01056284 -0.00846781  0.00528417 -0.01609314  0.01242113  0.00963884
  0.00157456  0.00602756]


In [13]:
model.save('wvmodel.bin')
model_load = Word2Vec.load('wvmodel.bin')
print(model_load)

Word2Vec<vocab=12, vector_size=50, alpha=0.025>


### Contextualized word embedding
static word embaddings, provide an exact meaning to words. But depending on the context this may change, therefore, it is a problem on word embeddings [Si se hace el embedding para una palabra "rain" y la detecta como el agua que cae, y despues aparece como verbo, sera interpretada como el agua y no como verbo]  

So, the contextualized word embedding came into picture.
They create a vector for each word conditioned on tis context

Representatin for each word is a function of the entire input sentence (function: se crea a partir de las demas)

Lista de Dynamic LMs / Contextualized Word Embeddings (CWE):
BERT  
ELMO  
GPT  
Transformers  
Cove, ELMFiT, CoVe


### ELMO
Embedding Language Model
pretrained models for downsream tasks, 
LSTM, considering surrounding words

### BERT
Bidirectional Encoder Representation from TRansformer, milestone after elmo
uses pretrained transformers language model.
TRained by masking 15% of the words, task is to predict those words.
800M words from BookCorpus, and 2.500M wiki
can be fined tuned with our own data



###  GPT-2
Transformer-based model.
for any downstream task
it uses more parameters (1.452 M parameters)
trained with book corpus (800M words)




## Word2Vec uses cases
Analyze: Surveys, verbatim comments in forums
Recommendation systems

it finds complex relationshups between the response being reviewed and the specific conett within which the response was made.



These models can work not only for text, but also on other areas, such as recommendation systems


In [None]:

#  
#  
#  
#  
#  

