In [5]:
import pandas as pd

In [6]:
sens_data = pd.read_hdf('./data/big_newsdata.h5', 'sens', mode='a')

In [7]:
sens_data.head()

In [7]:
sens_data.shape

I have taken a sample for the sake of memory-size.  

In [7]:
sens_sample = sens_data.sample(n=100)
sens_sample.shape

(100, 5)

I have imported the spacy library for text recognition.  

In [8]:
import spacy
from spacy import displacy
nlp = spacy.load('en')

Let's just preview the structure of our text.  

In [9]:
parsed = sens_sample['text'].map(lambda x: nlp(x))

In [16]:
displacy.render(parsed.iloc[0], style='ent', minify=True, jupyter=True)

Now lets get the number of sentences

In [98]:
sentences = pd.DataFrame(list(zip(parsed.iloc[0].sents)))
sentences

Unnamed: 0,0
0,"(SBV, SVN, 201801020013A, -, Trading, statemen..."
1,"(LIMITED-(Incorporated, in, the, Republic, of,..."
2,"(Expected, 12, months, )"
3,"(Expected, 12, months, )"
4,"(Audited, 12, months, -, to, 31, December, ..."
5,"(2017, , 2016-%, increase, ..."
6,(>)
7,"(100, %, , 1, 498, ..."
8,(>)
9,"(100, %, , 1, 498, ..."


In [90]:
list(parsed.iloc[0].sents)

[SBV SVN 201801020013A-Trading statement and update to Sabvest shareholders-SABVEST,
 LIMITED-(Incorporated in the Republic of South Africa)-Registration number 1987/003753/06-ISIN: ZAE000006417 Ordinary shares-ISIN: ZAE000012043 N Ordinary shares-(“Sabvest” or “the Company”)-TRADING STATEMENT AND UPDATE TO SABVEST SHAREHOLDERS-Trading Statement-Shareholders are advised that the financial results of Sabvest for the twelve months-ending 31 December 2017 are expected to be as follows:-,
 Expected 12 months     ,
 Expected 12 months     ,
 Audited 12 months-to 31 December         to 31 December        to 31 December-2017                   ,
 2017                  2016-% increase                  cents                 cents-Net asset value per                  +35,2%                   5 040                 3 728-share-Headline earnings per                 ,
 >,
 100%                   1 498                 119,7-share-Earnings per share                    ,
 >,
 100%                   1 49

Now lets get the entities.  
We are now going to do out text normalization or stemming/ lemmatization.  prob_log is the token probability, so closer to 0 is more likely further is less likely.  
Out of vocab is a word that is not part of the model

In [76]:
token_attributes = [(token.orth_,
                     token.lemma_,
                     token.ent_type_,
                     token.pos_,
                     token.prob,
                     token.is_punct,
                     token.is_space,
                     token.like_num,
                     token.is_oov,
                    )
                    for token in parsed.iloc[0]]

attributes = pd.DataFrame(token_attributes, columns=['text','lemma','entity','part of speech', 'log_prob','punctuation','space','number', 'out of vocab'])
attributes.loc[attributes['punctuation'] == False]

Unnamed: 0,text,lemma,entity,part of speech,log_prob,punctuation,space,number,out of vocab
0,SBV,sbv,PERSON,PROPN,-20.0,False,False,False,True
1,SVN,svn,PERSON,PROPN,-20.0,False,False,False,True
2,201801020013A,201801020013a,,PROPN,-20.0,False,False,False,True
4,Trading,trading,,NOUN,-20.0,False,False,False,True
5,statement,statement,,NOUN,-20.0,False,False,False,True
6,and,and,,CCONJ,-20.0,False,False,False,True
7,update,update,,NOUN,-20.0,False,False,False,True
8,to,to,,ADP,-20.0,False,False,False,True
9,Sabvest,sabvest,ORG,PROPN,-20.0,False,False,False,True
10,shareholders,shareholder,,NOUN,-20.0,False,False,False,True


## Phrase Modelling
Learning conbinations of tokens that together represent meaningful multi-word concepts.  
  
$\frac{count(AB)-count_{min}}{count(a)*count(b)} * N > threashold $

Where count is is the number of times a token appears in the corpus.  
N is the total size of the corpus.  
  
 Gensim is a library for statistical analysis of sentences.  

In [77]:
! pip install gensim

Collecting gensim
  Downloading gensim-3.3.0-cp36-cp36m-manylinux1_x86_64.whl (22.5MB)
[K    100% |████████████████████████████████| 22.5MB 53kB/s 
Collecting smart-open>=1.2.1 (from gensim)
  Using cached smart_open-1.5.6.tar.gz
Collecting boto>=2.32 (from smart-open>=1.2.1->gensim)
  Using cached boto-2.48.0-py2.py3-none-any.whl
Collecting bz2file (from smart-open>=1.2.1->gensim)
  Using cached bz2file-0.98.tar.gz
Collecting boto3 (from smart-open>=1.2.1->gensim)
  Downloading boto3-1.5.36-py2.py3-none-any.whl (128kB)
[K    100% |████████████████████████████████| 133kB 367kB/s 
[?25hCollecting jmespath<1.0.0,>=0.7.1 (from boto3->smart-open>=1.2.1->gensim)
  Using cached jmespath-0.9.3-py2.py3-none-any.whl
Collecting botocore<1.9.0,>=1.8.50 (from boto3->smart-open>=1.2.1->gensim)
  Downloading botocore-1.8.50-py2.py3-none-any.whl (4.1MB)
[K    100% |████████████████████████████████| 4.1MB 140kB/s 
[?25hCollecting s3transfer<0.2.0,>=0.1.10 (from boto3->smart-open>=1.2.1->gensim)
 

In [87]:
from gensim.models import Phrases
from gensim.models.word2vec import LineSentence