Medium size spacy english model: python -m spacy download en_core_web_md

Large size spacy english model: python -m spacy download en_core_web_lg

You can now load the package via spacy.load('en_core_web_lg')

Word to vec is a two layer neural net that processes text. Its input is a text corpus, and its output is a set
of vectors, which are essentially just feature vectors for words in that corpus.

The purpose and usefulness of word to vec is to group the vectors of similar words together in this vector space.
And that is it's able to detect similarities mathematically.

word to vec trains words against other words that neighbor them in the input corpus.
And it can actually do this one of two ways either using context to predict a target word. And this is known as a continuous bag of words approach or C-B-O-W. 

Or the other method that's also common is using a word to predict a target context, which is called skip gram.

CBOW approach, you have several input words and then your projection is essentially trying to predict what is the highest probability word to show up given the context of those surrounding words.

Now, the skip gram method takes a little longer to train and to develop because it's essentially doing the opposite
given an input of a single word using the auto encoder neural network projection try to output the weighted probabilities
of the other words that are gonna show up around the context of this input word.

In [2]:
import numpy as np
import pandas as pd
import spacy

nlp = spacy.load('en_core_web_lg')

In [4]:
len(nlp.vocab)

764

In [5]:
nlp(u"Lion").vector

array([ 0.19375 , -0.50978 ,  1.3235  , -4.2232  ,  1.9153  ,  2.7848  ,
        3.5018  , -5.0287  ,  4.1574  ,  1.6299  , -4.1079  ,  4.4102  ,
       -2.1288  ,  2.4968  , -6.0555  , -0.13991 ,  0.12095 ,  2.2596  ,
        1.275   ,  3.067   , -1.482   ,  1.3716  ,  2.1392  ,  0.26788 ,
        0.90869 , -0.60697 , -4.805   ,  0.18177 , -3.3778  ,  0.2611  ,
       -1.1376  ,  0.86083 ,  1.1145  , -0.33503 ,  0.27146 , -0.56102 ,
        0.65067 ,  0.14716 , -3.9606  , -4.7924  , -0.72956 ,  2.0589  ,
       -0.55128 ,  3.675   , -1.3233  ,  0.17632 , -0.47537 , -1.4397  ,
        5.0616  , -4.8501  , -0.16462 ,  4.5376  ,  0.24597 ,  4.3179  ,
       -1.3986  ,  0.19962 , -3.1926  , -4.927   ,  0.57918 , -0.76117 ,
       -1.4544  , -0.81949 ,  0.27261 ,  3.798   , -1.317   , -0.84707 ,
       -3.4222  , -0.26315 ,  3.4899  , -6.5645  , -2.0544  ,  2.6137  ,
       -2.3301  ,  0.027578, -0.86062 ,  1.9059  , -0.025821,  0.75302 ,
        1.3491  ,  1.749   ,  3.7663  ,  0.47703 , 

In [6]:
nlp(u"Lion").vector.shape

(300,)

In [7]:
nlp(u"The quick brown fox jumped").vector.shape

(300,)

In [8]:
nlp(u"fox").vector.shape

(300,)

In [9]:
tokens = nlp(u"lion cat pet")

In [11]:
for token1 in tokens:
    print(token1)

lion
cat
pet


In [13]:
for token1 in tokens:
    for token2 in tokens:
        print(token2)

lion
cat
pet
lion
cat
pet
lion
cat
pet


In [10]:
for token1 in tokens:
    for token2 in tokens:
        print(token1.text, token2.text, token1.similarity(token2))

lion lion 1.0
lion cat 0.3854507803916931
lion pet 0.20031584799289703
cat lion 0.3854507803916931
cat cat 1.0
cat pet 0.732966423034668
pet lion 0.20031584799289703
pet cat 0.732966423034668
pet pet 1.0


In [14]:
tokens = nlp(u"like love hate")

In [15]:
for token1 in tokens:
    for token2 in tokens:
        print(token1.text, token2.text, token1.similarity(token2))

like like 1.0
like love 0.5212638974189758
like hate 0.5065141320228577
love like 0.5212638974189758
love love 1.0
love hate 0.5708349943161011
hate like 0.5065141320228577
hate love 0.5708349943161011
hate hate 1.0


In [17]:
len(nlp.vocab.vectors)

514157

In [19]:
nlp.vocab.vectors.shape

(514157, 300)

In [22]:
tokens = nlp(u"dog cat solanki")

In [23]:
for token in tokens:
    print(token.text, token.has_vector, token.vector_norm, token.is_oov) # oov - out of vocab

dog True 75.254234 False
cat True 63.188496 False
solanki False 0.0 True


In [24]:
# Vector arthimetic
from scipy import spatial

In [25]:
cosine_similarity = lambda vec1,vec2: 1- spatial.distance.cosine(vec1,vec2)

In [26]:
king = nlp.vocab["king"].vector
man = nlp.vocab["man"].vector
woman = nlp.vocab["woman"].vector


In [34]:
# Ask for king - man + woman ---> New_Vector is similar to Queen, princess, highness

new_vector = king-man+woman

In [35]:
computed_similarities = []

# For all words in my vocab
for word in nlp.vocab:
    if word.has_vector:
        if word.is_lower:
            if word.is_alpha:
                similarity = cosine_similarity(new_vector, word.vector)
                computed_similarities.append((word, similarity))

In [36]:
computed_similarities = sorted(computed_similarities, key=lambda item:-item[1])

In [37]:
print([t[0].text for t in computed_similarities[:10]])

['king', 'and', 'that', 'where', 'she', 'they', 'woman', 'there', 'should', 'these']


VADER (Valence Aware Dictionary for sEntiment Reasoning) Sentiment Analysis with Python and NLTK

In [39]:
import nltk

In [41]:
nltk.download('vader_lexicon')

[nltk_data] Downloading package vader_lexicon to
[nltk_data]     C:\Users\manor\AppData\Roaming\nltk_data...
[nltk_data]   Package vader_lexicon is already up-to-date!


True

In [42]:
from nltk.sentiment.vader import SentimentIntensityAnalyzer

In [43]:
sid = SentimentIntensityAnalyzer()

In [44]:
a = "This is a good movie"

In [45]:
sid.polarity_scores(a)

{'neg': 0.0, 'neu': 0.508, 'pos': 0.492, 'compound': 0.4404}

In [46]:
a = "This was the best, most awesome movie EVER MADE!!!"

In [47]:
sid.polarity_scores(a)

{'neg': 0.0, 'neu': 0.425, 'pos': 0.575, 'compound': 0.8877}

In [48]:
a = "This was the WORST movie that has ever disgraced the screen"

In [49]:
sid.polarity_scores(a)

{'neg': 0.465, 'neu': 0.535, 'pos': 0.0, 'compound': -0.8331}

In [50]:
import pandas as pd

In [51]:
df = pd.read_csv("TextFiles/amazonreviews.tsv", sep="\t")
df.head()

Unnamed: 0,label,review
0,pos,Stuning even for the non-gamer: This sound tra...
1,pos,The best soundtrack ever to anything.: I'm rea...
2,pos,Amazing!: This soundtrack is my favorite music...
3,pos,Excellent Soundtrack: I truly like this soundt...
4,pos,"Remember, Pull Your Jaw Off The Floor After He..."


In [53]:
df['label'].value_counts()

label
neg    5097
pos    4903
Name: count, dtype: int64

In [54]:
df.isnull().sum()

label     0
review    0
dtype: int64

In [55]:
df.dropna(inplace=True)

In [56]:
blanks = []
for i,lb,rv in df.itertuples():
    # (index, label, review)
    if type(rv) == str:
        if rv.isspace():
            blanks.append(i)

In [57]:
df.drop(blanks, inplace=True)

In [59]:
df.iloc[0]['review']

'Stuning even for the non-gamer: This sound track was beautiful! It paints the senery in your mind so well I would recomend it even to people who hate vid. game music! I have played the game Chrono Cross but out of all of the games I have ever played it has the best music! It backs away from crude keyboarding and takes a fresher step with grate guitars and soulful orchestras. It would impress anyone who cares to listen! ^_^'

In [58]:
sid.polarity_scores(df.iloc[0]['review'])

{'neg': 0.088, 'neu': 0.669, 'pos': 0.243, 'compound': 0.9454}

In [61]:
df['scores'] = df['review'].apply(lambda review: sid.polarity_scores(review))

In [62]:
df.head()

Unnamed: 0,label,review,scores
0,pos,Stuning even for the non-gamer: This sound tra...,"{'neg': 0.088, 'neu': 0.669, 'pos': 0.243, 'co..."
1,pos,The best soundtrack ever to anything.: I'm rea...,"{'neg': 0.018, 'neu': 0.837, 'pos': 0.145, 'co..."
2,pos,Amazing!: This soundtrack is my favorite music...,"{'neg': 0.04, 'neu': 0.692, 'pos': 0.268, 'com..."
3,pos,Excellent Soundtrack: I truly like this soundt...,"{'neg': 0.09, 'neu': 0.615, 'pos': 0.295, 'com..."
4,pos,"Remember, Pull Your Jaw Off The Floor After He...","{'neg': 0.0, 'neu': 0.746, 'pos': 0.254, 'comp..."


In [63]:
df['compound'] = df['scores'].apply(lambda d:d['compound'])
df.head()

Unnamed: 0,label,review,scores,compound
0,pos,Stuning even for the non-gamer: This sound tra...,"{'neg': 0.088, 'neu': 0.669, 'pos': 0.243, 'co...",0.9454
1,pos,The best soundtrack ever to anything.: I'm rea...,"{'neg': 0.018, 'neu': 0.837, 'pos': 0.145, 'co...",0.8957
2,pos,Amazing!: This soundtrack is my favorite music...,"{'neg': 0.04, 'neu': 0.692, 'pos': 0.268, 'com...",0.9858
3,pos,Excellent Soundtrack: I truly like this soundt...,"{'neg': 0.09, 'neu': 0.615, 'pos': 0.295, 'com...",0.9814
4,pos,"Remember, Pull Your Jaw Off The Floor After He...","{'neg': 0.0, 'neu': 0.746, 'pos': 0.254, 'comp...",0.9781


In [64]:
df['comp_score'] = df['compound'].apply(lambda score: 'pos' if score >= 0 else 'neg')
df.head()

Unnamed: 0,label,review,scores,compound,comp_score
0,pos,Stuning even for the non-gamer: This sound tra...,"{'neg': 0.088, 'neu': 0.669, 'pos': 0.243, 'co...",0.9454,pos
1,pos,The best soundtrack ever to anything.: I'm rea...,"{'neg': 0.018, 'neu': 0.837, 'pos': 0.145, 'co...",0.8957,pos
2,pos,Amazing!: This soundtrack is my favorite music...,"{'neg': 0.04, 'neu': 0.692, 'pos': 0.268, 'com...",0.9858,pos
3,pos,Excellent Soundtrack: I truly like this soundt...,"{'neg': 0.09, 'neu': 0.615, 'pos': 0.295, 'com...",0.9814,pos
4,pos,"Remember, Pull Your Jaw Off The Floor After He...","{'neg': 0.0, 'neu': 0.746, 'pos': 0.254, 'comp...",0.9781,pos


In [65]:
from sklearn.metrics import accuracy_score, classification_report, confusion_matrix

In [66]:
print(accuracy_score(df['label'], df['comp_score']))

0.7097


In [67]:
print(classification_report(df['label'], df['comp_score']))

              precision    recall  f1-score   support

         neg       0.86      0.52      0.64      5097
         pos       0.64      0.91      0.75      4903

    accuracy                           0.71     10000
   macro avg       0.75      0.71      0.70     10000
weighted avg       0.75      0.71      0.70     10000



In [68]:
print(confusion_matrix(df['label'], df['comp_score']))

[[2629 2468]
 [ 435 4468]]
