# Data management

## Introduction to Text Data

## [Michel Coppée](https://www.uliege.be/cms/c_9054334/fr/repertoire?uid=u224042) & [Malka Guillot](https://malkaguillot.github.io/)

## HEC Liège | [ECON2306]()

[![Binder](https://mybinder.org/badge_logo.svg)](https://mybinder.org/v2/gh/malkaguillot/ECON2206-Data-Management-2023/HEAD?labpath=%2Fpractice%2F4.3-text-data.ipynb)

This notebook provides an introduction to the basic tools for text analytics.

In [1]:
#!pip install unidecode
#!pip install googletrans
#!pip install gensim
#!pip install spacy
#!pip install wordcloud
#!pip install pyldavis

#!python -m spacy download en_core_web_sm
#!python -m spacy download en_core_web_lg

#import nltk
#nltk.download('stopwords') 
#nltk.download('punkt') 
#nltk.download('wordnet') 
#nltk.download('averaged_perceptron_tagger')
#nltk.download('vader_lexicon')

**Set up and load data**

In [2]:
# Common imports
import numpy as np
import os
import pandas as pd

# To plot pretty figures
%matplotlib inline
%config InlineBackend.figure_format = 'retina'
import matplotlib as mpl
import matplotlib.pyplot as plt
#%matplotlib notebook
mpl.rc('axes', labelsize=14)
mpl.rc('xtick', labelsize=12)
mpl.rc('ytick', labelsize=12)

import seaborn as sns
import warnings
warnings.filterwarnings('ignore', category=FutureWarning)
warnings.filterwarnings('ignore', category=DeprecationWarning)
warnings.filterwarnings = lambda *a, **kw: None

# to make this notebook's output identical at every run
np.random.seed(42)

In [3]:
# Scikit-Learn ≥0.20 is required
import sklearn

We use as an example the **20 Newsgroups** ([[http://qwone.com/~jason/20Newsgroups/]]) dataset (from `sklearn`), a collection of about 20,000 newsgroup (message forum) documents. 

In [4]:
from sklearn.datasets import fetch_20newsgroups
data = fetch_20newsgroups() # object is a dictionary
data.keys()

dict_keys(['data', 'filenames', 'target_names', 'target', 'DESCR'])

Data Set Characteristics:

In [5]:
print(data['DESCR'])

.. _20newsgroups_dataset:

The 20 newsgroups text dataset
------------------------------

The 20 newsgroups dataset comprises around 18000 newsgroups posts on
20 topics split in two subsets: one for training (or development)
and the other one for testing (or for performance evaluation). The split
between the train and test set is based upon a messages posted before
and after a specific date.

This module contains two loaders. The first one,
:func:`sklearn.datasets.fetch_20newsgroups`,
returns a list of the raw texts that can be fed to text feature
extractors such as :class:`~sklearn.feature_extraction.text.CountVectorizer`
with custom parameters so as to extract feature vectors.
The second one, :func:`sklearn.datasets.fetch_20newsgroups_vectorized`,
returns ready-to-use features, i.e., it is not necessary to use a feature
extractor.

**Data Set Characteristics:**

    Classes                     20
    Samples total            18846
    Dimensionality               1
    Features      

In [6]:
W, y = data.data, data.target
n_samples = y.shape[0]
n_samples

11314

In [7]:
y[:10] # news story categories

array([ 7,  4,  4,  1, 14, 16, 13,  3,  2,  4])

In [8]:
doc = W[0]
doc

"From: lerxst@wam.umd.edu (where's my thing)\nSubject: WHAT car is this!?\nNntp-Posting-Host: rac3.wam.umd.edu\nOrganization: University of Maryland, College Park\nLines: 15\n\n I was wondering if anyone out there could enlighten me on this car I saw\nthe other day. It was a 2-door sports car, looked to be from the late 60s/\nearly 70s. It was called a Bricklin. The doors were really small. In addition,\nthe front bumper was separate from the rest of the body. This is \nall I know. If anyone can tellme a model name, engine specs, years\nof production, where this car is made, history, or whatever info you\nhave on this funky looking car, please e-mail.\n\nThanks,\n- IL\n   ---- brought to you by your neighborhood Lerxst ----\n\n\n\n\n"

**Make a pandas dataframe**

In [9]:
df = pd.DataFrame(W,columns=['text'])
df['topic'] = y
df.head()

Unnamed: 0,text,topic
0,From: lerxst@wam.umd.edu (where's my thing)\nS...,7
1,From: guykuo@carson.u.washington.edu (Guy Kuo)...,4
2,From: twillis@ec.ecn.purdue.edu (Thomas E Will...,4
3,From: jgreen@amber (Joe Green)\nSubject: Re: W...,1
4,From: jcm@head-cfa.harvard.edu (Jonathan McDow...,14


# Working with Text Data

Iterate over some documents:

In [10]:
from gensim.utils import simple_preprocess

processed = []
# iterate over rows
for i, text in enumerate(W):
    document = simple_preprocess(text) # get sentences/tokens
    processed.append(document) # add to list
    if i > 100:
        break

ModuleNotFoundError: No module named 'gensim'

In [None]:
processed[0][:10]

*Removing unicode characters*

In [None]:
from unidecode import unidecode # package for removing unicode
uncode_str = 'Visualizations\xa0'
fixed = unidecode(uncode_str) # example usage
print([uncode_str],[fixed]) # print cleaned string (replaced with a space)

# Quantity of Text

Count words per document.

In [None]:
def get_words_per_doc(txt):
    # split text into words and count them.
    return len(txt.split()) 

# apply to our data
df['num_words'] = df['text'].apply(get_words_per_doc)
df['num_words'].hist()

In [None]:
df['log_words'] = np.log(df['num_words'])
import seaborn as sns
sns.jointplot(data=df,x='topic', y='log_words',kind='hex')

Build a frequency distribution over words with `Counter`.

In [None]:
from collections import Counter
freqs = Counter()
for i, row in df.iterrows():
    freqs.update(row['text'].lower().split())
    if i > 100:
        break
freqs.most_common()[:20] # can use most frequent words as style/function words

# Dictionary / Matching Methods

## Sentiment Analysis

In [None]:
# Dictionary-Based Sentiment Analysis

from nltk.sentiment.vader import SentimentIntensityAnalyzer
sid = SentimentIntensityAnalyzer()
polarity = sid.polarity_scores(doc)
print(polarity)

In [None]:
# sample 20% of the dataset
dfs = df.sample(frac=.2) 

# apply compound sentiment score to data-frame
def get_sentiment(snippet):
    return sid.polarity_scores(snippet)['compound']
dfs['sentiment'] = dfs['text'].apply(get_sentiment)

In [None]:
dfs.sort_values('sentiment',inplace=True)
# print beginning of most positive documents
[x[50:150] for x  in dfs[-5:]['text']]

In [None]:
# print beginning of most negative documents
[x[50:150] for x  in dfs[:5]['text']]

## StopWords

In [None]:
from nltk.corpus import stopwords
stopwords = set(stopwords.words('english'))
stopwords

In [None]:
stopwords = set([x for x in stopwords if x != 'against']) # exception

In [None]:
stopfreq = np.sum([freqs[x] for x in stopwords])
stopfreq

In [None]:
otherfreq = np.sum([freqs[x] for x in freqs if x not in stopwords])
otherfreq

## RegEx

Please refer to [RegExOne Regular Expressions Lessons](regexone.com) and [the python documentation](https://docs.python.org/3/howto/regex.html).

In [None]:
import re

docs = dfs[:5]['text']

# Extract words after Subject.
for doc in docs:    
    print(re.findall(r'Subject: \w+ ', # pattern to match. always put 'r' in front of string so that backslashes are treated literally.
                     doc,            # string
                     re.IGNORECASE))  # ignore upper/lowercase (optional)

In [None]:
# Extract hyphenated words
for doc in docs:    
    print(re.findall(r'[a-z]+-[a-z]+', 
                     doc,            
                     re.IGNORECASE))  

In [None]:
# extract email addresses
for i, doc in enumerate(docs):
    finder = re.finditer('\w+@.+\.\w\w\w', # pattern to match ([^\s] means non-white-space)
                     doc)            # string
    for m in finder: 
        print(i, m.span(),m.group()) # location (start,end) and matching string

In [None]:
# baker-bloom economic uncertainty
pattern1 = r'(\b)uncertain[a-z]*'
pattern2 = r'(\b)econom[a-z]*'
pattern3 = r'(\b)congress(\b)|(\b)deficit(\b)|(\b)federal reserve(\b)|(\b)legislation(\b)|(\b)regulation(\b)|(\b)white house(\b)'

In [None]:
re.search(pattern1,'The White House tried to calm uncertainty in the markets.')

In [None]:
def indicates_uncertainty(doc):
    m1 = re.search(pattern1, doc, re.IGNORECASE)
    m2 = re.search(pattern2, doc, re.IGNORECASE)
    m3 = re.search(pattern3, doc, re.IGNORECASE)
    if m1 and m2 and m3:
        return True
    else:
        return False

In [None]:
df['uncertainty'] = df['text'].apply(indicates_uncertainty)

In [None]:
df.uncertainty.mean()

In [None]:
df[df.uncertainty]

# Featurizing Texts

## Main

In [None]:
text = "Prof. Zurich hailed from Zurich. She got 3 M.A.'s from ETH."

**Sentence Tokenization**

**NLTK has a fast implementation that makes errors.**

In [None]:
from nltk import sent_tokenize
sentences = sent_tokenize(text) # split document into sentences
print(sentences)

**spacy works better.**

**Install spacy and the English model if you have not already.**

In [None]:
import spacy
nlp = spacy.load('en_core_web_sm')
doc = nlp(text)
sentences = list(doc.sents)
print(sentences)

**Removing capitalization.**

In [None]:
# Capitalization
text_lower = text.lower() # go to lower-case
text_lower

In [None]:
#####
# Punctuation
#####

# recipe for fast punctuation removal
from string import punctuation
punc_remover = str.maketrans('','',punctuation) 
text_nopunc = text_lower.translate(punc_remover)
print(text_nopunc)

In [None]:
# Tokens
tokens = text_nopunc.split() # splits a string on white space
print(tokens)

In [None]:
# Numbers
# remove numbers (keep if not a digit)
no_numbers = [t for t in tokens if not t.isdigit()]
# keep if not a digit, else replace with "#"
norm_numbers = [t if not t.isdigit() else '#' 
                for t in tokens ]
print(no_numbers )
print(norm_numbers)

In [None]:
# Stopwords
from nltk.corpus import stopwords
stoplist = stopwords.words('english') 
# keep if not a stopword
nostop = [t for t in norm_numbers if t not in stoplist]
print(nostop)

In [None]:
# scikit-learn stopwords
from sklearn.feature_extraction import stop_words
sorted(list(stop_words.ENGLISH_STOP_WORDS))[:10]

In [None]:
# spacy stopwords
sorted(list(nlp.Defaults.stop_words))[:10]

In [None]:
# Stemming
from nltk.stem import SnowballStemmer
stemmer = SnowballStemmer('english') # snowball stemmer, english
# remake list of tokens, replace with stemmed versions
tokens_stemmed = [stemmer.stem(t) for t in ['tax','taxes','taxed','taxation']]
print(tokens_stemmed)

In [None]:
stemmer = SnowballStemmer('german') # snowball stemmer, german
print(stemmer.stem("Autobahnen"))

In [None]:
# Lemmatizing
from nltk.stem import WordNetLemmatizer
wnl = WordNetLemmatizer()
[wnl.lemmatize(c) for c in ['corporation', 'corporations', 'corporate']]

Let's wrap it into a recipe.

In [None]:
from string import punctuation
translator = str.maketrans('','',punctuation) 
from nltk.corpus import stopwords
stoplist = set(stopwords.words('english'))
from nltk.stem import SnowballStemmer
stemmer = SnowballStemmer('english')

def normalize_text(doc):
    "Input doc and return clean list of tokens"
    doc = doc.replace('\r', ' ').replace('\n', ' ')
    lower = doc.lower() # all lower case
    nopunc = lower.translate(translator) # remove punctuation
    words = nopunc.split() # split into tokens
    nostop = [w for w in words if w not in stoplist] # remove stopwords
    no_numbers = [w if not w.isdigit() else '#' for w in nostop] # normalize numbers
    stemmed = [stemmer.stem(w) for w in no_numbers] # stem each word
    return stemmed

And apply it to the corpus.

In [None]:
df['tokens_cleaned'] = df['text'].apply(normalize_text)
df['tokens_cleaned']

**Shortcut: `gensim.simple_preprocess`.**

In [None]:
from gensim.utils import simple_preprocess
print(simple_preprocess(text))

In [None]:
from collections import Counter
print(Counter(simple_preprocess(text)))

Now let's `simple_preprocess` on the corpus.

In [None]:
df['tokens_simple'] = df['text'].apply(simple_preprocess)
df['tokens_simple']

**Tagging Parts of Speech**

In [None]:
text = 'Science cannot solve the ultimate mystery of nature. And that is because, in the last analysis, we ourselves are a part of the mystery that we are trying to solve.'

#nltk.download('averaged_perceptron_tagger')
from nltk.tag import perceptron 
from nltk import word_tokenize
tagger = perceptron.PerceptronTagger()
tokens = word_tokenize(text)
tagged_sentence = tagger.tag(tokens)
tagged_sentence

Plot nouns and adjectives by topic

In [None]:
from collections import Counter
from nltk import word_tokenize

def get_nouns_adj(snippet):
    tags = [x[1] for x in tagger.tag(word_tokenize(snippet))]
    num_nouns = len([t for t in tags if t[0] == 'N']) / len(tags)
    num_adj = len([t for t in tags if t[0] == 'J']) / len(tags)
    return num_nouns, num_adj

dfs['nouns'], dfs['adj'] = zip(*dfs['text'].map(get_nouns_adj))
dfs.groupby('topic')[['nouns','adj']].mean().plot()

## Corpus Prep with spaCy

Get spacy documents for each speech and add to dataframe. This is quicker than iterating over the dataframe with `iterrows()`, but slower than a parallelized solution. It will take a few minutes for a whole corpus.

In [None]:
dfs = df.sample(10)
dfs['doc'] = dfs['text'].apply(nlp)

In [None]:
dfs.doc

In [None]:
# The spacy model already gives you sentences and tokens.
# For example:
tensents = list(dfs['doc'].iloc[0].sents)[:10]
tensents

In [None]:
# tokens
list(tensents[-1]) 

In [None]:
# lemmas
[x.lemma_ for x in tensents[-1]]

In [None]:
# POS tags
[x.tag_ for x in tensents[-1]]

## N-grams

In [None]:
from nltk import ngrams
from collections import Counter

# get n-gram counts for 10 documents
grams = []
for i, row in df.iterrows():
    tokens = row['text'].lower().split() # get tokens
    for n in range(2,4):
        grams += list(ngrams(tokens,n)) # get bigrams, trigrams, and quadgrams
    if i > 50:
        break
Counter(grams).most_common()[:8]  # most frequent n-grams

In [None]:
# spacy NER noun chunks
i = 0
chunks = list(nlp(df['text'].iloc[10]).noun_chunks)
chunks

## Tokenizers / Vectorizers

In [None]:
# Counter is a quick pure-python solution.
from collections import Counter
freqs = Counter(tokens)
freqs.most_common()[:20]

Usually we use scikit-learn's vectorizer.

In [None]:
from sklearn.feature_extraction.text import CountVectorizer
vec = CountVectorizer(min_df=0.001, # at min 0.1% of docs
                        max_df=.8, # drop if shows up ih more than 80%  
                        max_features=1000,
                        stop_words='english',
                        ngram_range=(1,3)) # words, bigrams, and trigrams
X = vec.fit_transform(df['text'])

# save the vectors
# pd.to_pickle(X,'X.pkl')

# save the vectorizer 
# (so you can transform other documents, 
# also for the vocab)
#pd.to_pickle(vec, 'vec-3grams-1.pkl')

In [None]:
X

In [None]:
# tf-idf vectorizer up-weights rare/distinctive words
from sklearn.feature_extraction.text import TfidfVectorizer
tfidf = TfidfVectorizer(min_df=0.001, 
                        max_df=0.9,  
                        max_features=1000,
                        stop_words='english',
                        use_idf=True, # the new piece
                        ngram_range=(1,2))

X_tfidf = tfidf.fit_transform(df['text'])
#pd.to_pickle(X_tfidf,'X_tfidf.pkl')

In [None]:
X_tfidf

Make word cloud of common words by topic id.

In [None]:
df['topic'].value_counts() 

In [None]:
vocab = tfidf.get_feature_names()
vocab[:10], vocab[-10:]

In [None]:
from wordcloud import WordCloud
import matplotlib.pyplot as plt

for topic_id in [1,2,8,9]: 
    slicer = df['topic'] == topic_id
    f = X_tfidf[slicer.values]
    total_freqs = list(np.array(f.sum(axis=0))[0])
    fdict = dict(zip(vocab,total_freqs))
    # generate word cloud of words with highest counts
    wordcloud = WordCloud().generate_from_frequencies(fdict) 
    print(topic_id)
    plt.clf()
    plt.imshow(wordcloud, interpolation='bilinear') 
    plt.axis("off") 
    plt.show()

In [None]:
# hash vectorizer
from sklearn.feature_extraction.text import HashingVectorizer

hv = HashingVectorizer(n_features=10)
X_hash = hv.fit_transform(df['text'])
X_hash

## Feature Selection

In [None]:
#%% Univariate feature selection using chi2
from sklearn.feature_selection import SelectKBest, chi2, f_classif, f_regression, f_classif, mutual_info_classif
select = SelectKBest(chi2, k=10)
Y = df['topic']==1
X_new = select.fit_transform(X, Y)
# top 10 features by chi-squared:
[vocab[i] for i in np.argsort(select.scores_)[-10:]]

In [None]:
#%% top 10 features by  ANOVA F-value:
select = SelectKBest(f_classif, k=10)
select.fit(X, Y)
[vocab[i] for i in np.argsort(select.scores_)[-10:]]

In [None]:
#%% top 10 features by linear regression
select = SelectKBest(f_regression, k=10)
select.fit(X, Y)
[vocab[i] for i in np.argsort(select.scores_)[-10:]]

In [None]:
#%% top 10 features by mutual information (classification)
select = SelectKBest(mutual_info_classif, k=10)
select.fit(X[:1000], Y[:1000])
[vocab[i] for i in np.argsort(select.scores_)[-10:]]

# Document Distance

In [None]:
# compute pair-wise similarities between all documents in corpus"
from sklearn.metrics.pairwise import cosine_similarity

sim = cosine_similarity(X[:100])
sim.shape

In [None]:
sim[:4,:4]

In [None]:
# TF-IDF Similarity
tsim = cosine_similarity(X_tfidf[:100])
tsim[:4,:4]

In [None]:
11000*11000

# Topic models

We use as an example the **20 Newsgroups** ([[http://qwone.com/~jason/20Newsgroups/]]) dataset (from `sklearn`), a collection of about 20,000 newsgroup (message forum) documents. 
Cf. Week 4 on introduction to text analysis. 

In [None]:
W=data.data

In [None]:
from nltk.corpus import stopwords
stopwords = set(stopwords.words('english'))
stopwords.add('edu')

Pre-processing

In [None]:
from gensim.utils import simple_preprocess

doc_clean = []
# iterate over rows
for i, text in enumerate(W):
    document = simple_preprocess(text) # get sentences/tokens
    document = [word for word in document if word not in stopwords] # remove stopwords
    doc_clean.append(document) # add to list
    if i > 100:
        break

In [None]:
# shuffle the documents
from random import shuffle
shuffle(doc_clean)

# creating the term dictionary
from gensim import corpora
dictionary = corpora.Dictionary(doc_clean)

Converting list of documents (corpus) into Document Term Matrix using dictionary prepared above.

In [None]:
doc_term_matrix = [dictionary.doc2bow(doc) for doc in doc_clean]

TF-IDF matrix

In [None]:
from gensim.models import TfidfModel
tfidf = TfidfModel(doc_term_matrix)  # fit model

In [None]:
vector = tfidf[doc_term_matrix[0]]  # apply model to the first corpus document
vector

In [None]:
corpus_tfidf = tfidf[doc_term_matrix]   # apply model to whole corpus


Parameters of LDA

    num_topics
        specify how many topics you would like to extract from the documents

    alpha
        document-topic density
            the greater, the article will be assigned to more topics, vice versa

    eta
        topic-word density
            the greater, each topic will contain more words, vice versa


#### Using the DTM

In [None]:
# train LDA with 10 topics and print 
from gensim.models.ldamodel import LdaModel

lda = LdaModel(doc_term_matrix, num_topics=10, 
               id2word = dictionary, passes=3)
lda.show_topics(formatted=False)

In [None]:
lda_idf = LdaModel(corpus_tfidf, num_topics=10, 
               id2word = dictionary, passes=3)
lda_idf.show_topics(formatted=False)

In [None]:
# to get the topic proportions for a document, use
# the corresponding row from the document-term matrix.
lda[doc_term_matrix[0]]

The wordcloud package builds a visual representation of most common words. We apply it by topic here

In [None]:
###
# LDA Word Clouds
###

from numpy.random import randint
from wordcloud import WordCloud
import matplotlib.pyplot as plt

# make directory if not exists
from os import mkdir
try:
    mkdir('lda')
except:
    pass

# make word clouds for the topics
for i,weights in lda.show_topics(num_topics=-1,
                                 num_words=100,
                                 formatted=False):
    
    #logweights = [w[0], np.log(w[1]) for w in weights]
    maincol = randint(0,360)
    def colorfunc(word=None, font_size=None, 
                  position=None, orientation=None, 
                  font_path=None, random_state=None):   
        color = randint(maincol-10, maincol+10)
        if color < 0:
            color = 360 + color
        return "hsl(%d, %d%%, %d%%)" % (color,randint(65, 75)+font_size / 7, randint(35, 45)-font_size / 10)   

    
    wordcloud = WordCloud(background_color="white", 
                          ranks_only=False, 
                          max_font_size=120,
                          color_func=colorfunc,
                          height=600,width=800).generate_from_frequencies(dict(weights))

    plt.clf()
    plt.imshow(wordcloud,interpolation="bilinear")
    plt.axis("off")
    plt.show()

**LDAvis viz**

# Word Embeddings

In [None]:

def get_sentences(doc):
    sentences = []
    
    for raw in sent_tokenize(doc):
        raw2 = [i for i in raw.translate(translator).lower().split() if i not in stop and len(i) < 10]
        raw3 = [stemmer.stem(t) for t in raw2]
        sentences.append(raw3)
    return sentences

In [None]:
###
# Word2Vec in gensim
###

# word2vec requires sentences as input
sentences = []
for doc in df['text']:
    sentences += [simple_preprocess(doc)]
from random import shuffle
shuffle(sentences) # stream in sentences in random order

# train the model
from gensim.models import Word2Vec
w2v = Word2Vec(sentences,  # list of tokenized sentences
               workers = 8, # Number of threads to run in parallel
               size=300,  # Word vector dimensionality     
               min_count =  25, # Minimum word count  
               window = 5, # Context window size      
               sample = 1e-3, # Downsample setting for frequent words
               )

# done training, so delete context vectors
w2v.init_sims(replace=True)

w2v.save('w2v-vectors.pkl')

In [None]:
w2v.wv.most_similar('man') # most similar words

In [None]:
# analogies: judge is to man as __ is to woman
w2v.wv.most_similar(positive=['judge','man'],
                 negative=['woman'])

In [None]:
# Word2Vec: K-Means Clusters
from sklearn.cluster import KMeans
kmw = KMeans(n_clusters=50)
kmw.fit(w2v.wv.vectors)

In [None]:
clust = kmw.labels_[w2v.wv.vocab['woman'].index]
for i, cluster in enumerate(kmw.labels_):
    if cluster == clust:
        print(w2v.wv.index2word[i])
    if i > 1000:
        break

In [None]:
###
# Pre-trained vectors
###

import spacy
en = spacy.load('en_core_web_lg') # higher-quality vectors (but 800MB)
apple = en('apple') 
apple.vector[:10] # vector for 'apple'

In [None]:
apple.similarity(apple)

In [None]:
orange = en('orange')
apple.similarity(orange)

# Document Embeddings

In [None]:
###
# Make document vectors from word embeddings
##

# Continuous bag-of-words representation
from gensim.models import Word2Vec
w2v = Word2Vec.load('w2v-vectors.pkl')

sentvecs = []
for sentence in sentences:
    vecs = [w2v.wv[w] for w in sentence if w in w2v.wv]
    if len(vecs)== 0:
        sentvecs.append(np.nan)
        continue
    sentvec = np.mean(vecs,axis=0)
    sentvecs.append(sentvec.reshape(1,-1))
sentvecs[0][0][:30]

In [None]:
# compute cosine similarity between sentence vectors
from sklearn.metrics.pairwise import cosine_similarity
cosine_similarity(sentvecs[0],
                  sentvecs[1])[0][0]

In [None]:
sentvecs[0]

In [None]:
###
# Doc2Vec
###

from nltk import word_tokenize
docs = []

for i, row in df.iterrows():
    docs += [word_tokenize(row['text'])]
shuffle(docs)

In [None]:
from gensim.models.doc2vec import Doc2Vec, TaggedDocument
doc_iterator = [TaggedDocument(doc, [i]) for i, doc in enumerate(docs)]
d2v = Doc2Vec(doc_iterator,
                min_count=10, # minimum word count
                window=10,    # window size
                vector_size=200, # size of document vector
                sample=1e-4, 
                negative=5, 
                workers=4, # threads
                #dbow_words = 1 # uncomment to get word vectors too
                max_vocab_size=1000) # max vocab size

In [None]:
d2v.save('d2v-vectors.pkl')

In [None]:
# matrix of all document vectors:
D = d2v.docvecs.vectors_docs
D.shape

In [None]:
D

In [None]:
# infer vectors for new documents
d2v.infer_vector(['the judge on the court'])[:20]

In [None]:
# get all pair-wise document similarities
pairwise_sims = cosine_similarity(D)
pairwise_sims.shape

In [None]:
pairwise_sims[:3,:3]

In [None]:
# Document clusters
from sklearn.cluster import KMeans

# create 50 clusters of similar documents
num_clusters = 10
kmw = KMeans(n_clusters=num_clusters)
kmw.fit(D)

In [None]:
# Documents from an example cluster
for i, doc in enumerate(docs):
    if kmw.labels_[i] == 3:
        print(doc[10:20])
    if i == 20000:
        break

In [None]:
# t-SNE for visualization
from sklearn.manifold import TSNE
tsne = TSNE(n_components=2, verbose=1, perplexity=50, n_iter=300)
d2v_tsne = tsne.fit_transform(D)

In [None]:
vdf = pd.DataFrame(d2v_tsne,
                  columns=['x-tsne', 'y-tsne'])
vdf['cluster'] = kmw.labels_

In [None]:
import seaborn as sns
vdf = pd.DataFrame(d2v_tsne,
                  columns=['x', 'y'])
vdf['cluster'] = kmw.labels_

chart = sns.scatterplot(data=vdf, x='x', y='y', hue='cluster')

# Dependency Parsing

In [None]:
text = 'Science cannot solve the ultimate mystery of nature. And that is because, in the last analysis, we ourselves are a part of the mystery that we are trying to solve.'
import spacy
nlp = spacy.load('en_core_web_sm')
doc = nlp(text)

In [None]:
doc

In [None]:
for sent in doc.sents:
    print(sent)
    print(sent.root)
    print([(w, w.dep_) for w in sent.root.children])
    print()

In [None]:
sent

In [None]:
# Noun Phrase Chunking
list(doc.noun_chunks)

In [None]:
sent.root

In [None]:
list(sent.root.children)

In [None]:
# Left children
list(sent.root.lefts)

In [None]:
# Right children
list(sent.root.rights)

In [None]:
sent[0]

In [None]:
sent[0].dep_

In [None]:
sent[0].head