In [None]:
%matplotlib inline


Similarity Queries
==================

Demonstrates querying a corpus for similar documents.



In [None]:
import logging
logging.basicConfig(format='%(asctime)s : %(levelname)s : %(message)s', level=logging.INFO)

Creating the Corpus
-------------------

First, we need to create a corpus to work with.
This step is the same as in the previous tutorial;
if you completed it, feel free to skip to the next section.



In [None]:
from collections import defaultdict
from gensim import corpora

documents = [
    "Human machine interface for lab abc computer applications",
    "A survey of user opinion of computer system response time",
    "The EPS user interface management system",
    "System and human system engineering testing of EPS",
    "Relation of user perceived response time to error measurement",
    "The generation of random binary unordered trees",
    "The intersection graph of paths in trees",
    "Graph minors IV Widths of trees and well quasi ordering",
    "Graph minors A survey",
]

# remove common words and tokenize
stoplist = set('for a of the and to in'.split())
texts = [
    [word for word in document.lower().split() if word not in stoplist]
    for document in documents
]

# remove words that appear only once
frequency = defaultdict(int)
for text in texts:
    for token in text:
        frequency[token] += 1

texts = [
    [token for token in text if frequency[token] > 1]
    for text in texts
]

dictionary = corpora.Dictionary(texts)
corpus = [dictionary.doc2bow(text) for text in texts]

Similarity interface
--------------------

In the previous tutorials on
`sphx_glr_auto_examples_core_run_corpora_and_vector_spaces.py`
and
`sphx_glr_auto_examples_core_run_topics_and_transformations.py`,
we covered what it means to create a corpus in the Vector Space Model and how
to transform it between different vector spaces. A common reason for such a
charade is that we want to determine **similarity between pairs of
documents**, or the **similarity between a specific document and a set of
other documents** (such as a user query vs. indexed documents).

To show how this can be done in gensim, let us consider the same corpus as in the
previous examples (which really originally comes from Deerwester et al.'s
`"Indexing by Latent Semantic Analysis" <http://www.cs.bham.ac.uk/~pxt/IDA/lsa_ind.pdf>`_
seminal 1990 article).
To follow Deerwester's example, we first use this tiny corpus to define a 2-dimensional
LSI space:



In [None]:
from gensim import models
lsi = models.LsiModel(corpus, id2word=dictionary, num_topics=2)

For the purposes of this tutorial, there are only two things you need to know about LSI.
First, it's just another transformation: it transforms vectors from one space to another.
Second, the benefit of LSI is that enables identifying patterns and relationships between terms (in our case, words in a document) and topics.
Our LSI space is two-dimensional (`num_topics = 2`) so there are two topics, but this is arbitrary.
If you're interested, you can read more about LSI here: `Latent Semantic Indexing <https://en.wikipedia.org/wiki/Latent_semantic_indexing>`_:

Now suppose a user typed in the query `"Human computer interaction"`. We would
like to sort our nine corpus documents in decreasing order of relevance to this query.
Unlike modern search engines, here we only concentrate on a single aspect of possible
similarities---on apparent semantic relatedness of their texts (words). No hyperlinks,
no random-walk static ranks, just a semantic extension over the boolean keyword match:



In [None]:
doc = "Human computer interaction"
vec_bow = dictionary.doc2bow(doc.lower().split())
vec_lsi = lsi[vec_bow]  # convert the query to LSI space
print(vec_lsi)

In addition, we will be considering `cosine similarity <http://en.wikipedia.org/wiki/Cosine_similarity>`_
to determine the similarity of two vectors. Cosine similarity is a standard measure
in Vector Space Modeling, but wherever the vectors represent probability distributions,
`different similarity measures <http://en.wikipedia.org/wiki/Kullback%E2%80%93Leibler_divergence#Symmetrised_divergence>`_
may be more appropriate.

Initializing query structures
++++++++++++++++++++++++++++++++

To prepare for similarity queries, we need to enter all documents which we want
to compare against subsequent queries. In our case, they are the same nine documents
used for training LSI, converted to 2-D LSA space. But that's only incidental, we
might also be indexing a different corpus altogether.



In [None]:
from gensim import similarities
index = similarities.MatrixSimilarity(lsi[corpus])  # transform corpus to LSI space and index it

<div class="alert alert-danger"><h4>Warning</h4><p>The class :class:`similarities.MatrixSimilarity` is only appropriate when the whole
  set of vectors fits into memory. For example, a corpus of one million documents
  would require 2GB of RAM in a 256-dimensional LSI space, when used with this class.

  Without 2GB of free RAM, you would need to use the :class:`similarities.Similarity` class.
  This class operates in fixed memory, by splitting the index across multiple files on disk, called shards.
  It uses :class:`similarities.MatrixSimilarity` and :class:`similarities.SparseMatrixSimilarity` internally,
  so it is still fast, although slightly more complex.</p></div>

Index persistency is handled via the standard :func:`save` and :func:`load` functions:



In [None]:
index.save('/tmp/deerwester.index')
index = similarities.MatrixSimilarity.load('/tmp/deerwester.index')

This is true for all similarity indexing classes (:class:`similarities.Similarity`,
:class:`similarities.MatrixSimilarity` and :class:`similarities.SparseMatrixSimilarity`).
Also in the following, `index` can be an object of any of these. When in doubt,
use :class:`similarities.Similarity`, as it is the most scalable version, and it also
supports adding more documents to the index later.

Performing queries
++++++++++++++++++

To obtain similarities of our query document against the nine indexed documents:



In [None]:
sims = index[vec_lsi]  # perform a similarity query against the corpus
print(list(enumerate(sims)))  # print (document_number, document_similarity) 2-tuples

Cosine measure returns similarities in the range `<-1, 1>` (the greater, the more similar),
so that the first document has a score of 0.99809301 etc.

With some standard Python magic we sort these similarities into descending
order, and obtain the final answer to the query `"Human computer interaction"`:



In [None]:
sims = sorted(enumerate(sims), key=lambda item: -item[1])
for i, s in enumerate(sims):
    print(s, documents[i])

The thing to note here is that documents no. 2 (``"The EPS user interface management system"``)
and 4 (``"Relation of user perceived response time to error measurement"``) would never be returned by
a standard boolean fulltext search, because they do not share any common words with ``"Human
computer interaction"``. However, after applying LSI, we can observe that both of
them received quite high similarity scores (no. 2 is actually the most similar!),
which corresponds better to our intuition of
them sharing a "computer-human" related topic with the query. In fact, this semantic
generalization is the reason why we apply transformations and do topic modelling
in the first place.

Where next?
------------

Congratulations, you have finished the tutorials -- now you know how gensim works :-)
To delve into more details, you can browse through the `apiref`,
see the `wiki` or perhaps check out `distributed` in `gensim`.

Gensim is a fairly mature package that has been used successfully by many individuals and companies, both for rapid prototyping and in production.
That doesn't mean it's perfect though:

* there are parts that could be implemented more efficiently (in C, for example), or make better use of parallelism (multiple machines cores)
* new algorithms are published all the time; help gensim keep up by `discussing them <http://groups.google.com/group/gensim>`_ and `contributing code <https://github.com/piskvorky/gensim/wiki/Developer-page>`_
* your **feedback is most welcome** and appreciated (and it's not just the code!):
  `bug reports <https://github.com/piskvorky/gensim/issues>`_ or
  `user stories and general questions <http://groups.google.com/group/gensim/topics>`_.

Gensim has no ambition to become an all-encompassing framework, across all NLP (or even Machine Learning) subfields.
Its mission is to help NLP practitioners try out popular topic modelling algorithms
on large datasets easily, and to facilitate prototyping of new algorithms for researchers.



In [None]:
import matplotlib.pyplot as plt
import matplotlib.image as mpimg
img = mpimg.imread('run_similarity_queries.png')
imgplot = plt.imshow(img)
plt.axis('off')
plt.show()

In [None]:
import xml.etree.ElementTree as ET
from urllib.error import HTTPError
import urllib.request
import threading
import datetime
import random
import time
import csv
import re

In [None]:
date = datetime.datetime.now()
currDate = '{}/{}/{}'.format(date.day,date.month,date.year)
random.seed(datetime.datetime.now())


since Python 3.9 and will be removed in a subsequent version. The only 
supported seed types are: None, int, float, str, bytes, and bytearray.
  random.seed(datetime.datetime.now())


In [None]:
# Directories for the BBC news webpages I'm interested in
BBCArticleURLs = ('News',
'Modi/Indai','Modi/BJP','BJP/Congress','Modi/Rahul Gandhi','BJP/Modi',
'Congress/Rahul Gandhi')

# Directories for the CNN news webpages I'm interested in
CNNArticleURLs = ('Modi',
'Modi/Indai','Modi/BJP','BJP/Congress','Modi/Rahul Gandhi','BJP/Modi',
'Congress/Rahul Gandhi')

# Directories for the CNN news webpages I'm interested in
RTArticleURLs = ('Modi',
'Modi/Indai','Modi/BJP','BJP/Congress','Modi/Rahul Gandhi','BJP/Modi',
'Congress/Rahul Gandhi')


In [None]:
# Directories for the BBC news webpages I'm interested in
BBCArticleURLs = ('News',
'Modi/Indai','Modi/BJP','BJP/Congress','Modi/Rahul Gandhi','BJP/Modi',
'Congress/Rahul Gandhi')

# Directories for the CNN news webpages I'm interested in
CNNArticleURLs = ('Modi',
'Modi/Indai','Modi/BJP','BJP/Congress','Modi/Rahul Gandhi','BJP/Modi',
'Congress/Rahul Gandhi')

# Directories for the CNN news webpages I'm interested in
RTArticleURLs = ('Modi',
'Modi/Indai','Modi/BJP','BJP/Congress','Modi/Rahul Gandhi','BJP/Modi',
'Congress/Rahul Gandhi')


def getArticles(dir, website):
    try:
        if website == 'BBC':
            tree = ET.parse(source=urllib.request.urlopen('http://feeds.bbci.co.uk/news/'+dir+'/rss.xml'))
        elif website == 'CNN':
            tree = ET.parse(source=urllib.request.urlopen('http://rss.cnn.com/rss/'+dir+'.rss'))
        elif website == 'RT':
            tree = ET.parse(source=urllib.request.urlopen('https://www.rt.com/rss/'+dir))
        else:
            tree = ET.parse(source=urllib.request.urlopen('https://www.theguardian.com/sitemaps/news.xml'))
    except HTTPError as err:
        print(err)
        return None
    except ET.ParseError as err:
        return None
    else:
        # Gets the xml tree as an object which
        # is then used to extract the articles
        root = tree.getroot()
        if website != 'guardian':
            allArticles = list()
            # TODO: fix issue with filtering bad titles
            # e.g 'RT UK News' or 'CNN.com - RSS' etc
            for elem in root.iter('title'):
                allArticles.append(elem.text)
            return allArticles
        else:
            i = 0
            if dir == 'titles':
                allTitles = list()
                for elem in root.iter('{http://www.google.com/schemas/sitemap-news/0.9}title'):
                    allTitles.append(elem.text.strip())
                    i += 1
                print('{} article titles scraped'.format(i))
                return allTitles

            else:
                # Returns a list where each element is a list containing
                # the keywords for an article title
                allKeywords = list()
                tempKeywords = list()
                keywordString = ''

                for elem in root.iter('{http://www.google.com/schemas/sitemap-news/0.9}keywords'):
                    keywordString = elem.text
                    try:
                        tempKeywords = keywordString.split(',')
                    except AttributeError as err:
                        print('Guardian article {} had no keywords'.format(i))
                        tempKeywords = ['no keywords']
                    allKeywords.append(tempKeywords)
                    i += 1
                print('{} article keyword lists scraped'.format(i))
                return allKeywords


def writeCSV(articleList, dir, invalid, website):
    if invalid:
        with open('errorLog.csv', 'a', encoding="utf-8") as file:
            fields = ['date', 'website', 'dir', 'articleTitle']
            writeObj = csv.DictWriter(file, fieldnames=fields,lineterminator='\n')

            for article in articleList:
                writeObj.writerow({'date':'{}'.format(currDate),'website':'{}'.format(website), 'dir':'{}'.format(dir),'articleTitle':'{}'.format(article)})
    else:
        with open('{}infoXML.csv'.format(website), 'a', encoding="utf-8") as file:
            fields = ['date', 'dir', 'articleTitle']
            writeObj = csv.DictWriter(file, fieldnames=fields,lineterminator='\n')

            for article in articleList:
                writeObj.writerow({'date':'{}'.format(currDate),'dir':'{}'.format(dir),'articleTitle':'{}'.format(article)})


def writeGuardianCSV(allTitles, allKeywords, date, invalid):
    if invalid:
        with open('errorLog.csv', 'a', encoding='utf-8') as file:
            i = 0
            fields = ['date', 'website', 'keywordsArr', 'articleTitle']
            writeObj = csv.DictWriter(file, fieldnames=fields, delimiter=',',lineterminator='\n')

            for title, keywords in zip(allTitles, allKeywords):
                writeObj.writerow({'date':'{}'.format(currDate), 'website':'{}'.format(website), 'keywordsArr':'{}'.format(keywords),'articleTitle':'{}'.format(title)})
                i += 1
    else:
        with open('guardianInfoXML.csv', 'a', encoding='utf-8') as file:
            i = 0
            fields = ['date','keywordsArr', 'articleTitle']
            writeObj = csv.DictWriter(file, fieldnames=fields, delimiter=',',lineterminator='\n')

            for title, keywords in zip(allTitles, allKeywords):
                writeObj.writerow({'date':'{}'.format(currDate),'keywordsArr':'{}'.format(keywords),'articleTitle':'{}'.format(title)})
                i += 1


def scrape(dir, website):
    if website != 'guardian':
        allArticles = getArticles(dir, website)
        if allArticles != None:
            writeCSV(allArticles, dir, 0, website)
            if website == 'BBC' or 'guardian' or 'RT':
                print('Downloaded articles from section: {} - {}'.format(website, dir))
            elif website == 'CNN':
                print('Downloaded articles from section: {} - {}'.format(website, dir[8:]))
        else:
            badscrapeMsg = 'Error could not scrape from section: {}'.format(dir)
            badscrape = list()
            badscrape.append(badscrapeMsg)
            writeCSV(badscrape, dir, 1, website)
            print('############ Failed to download articles from section: {} ############ '.format(dir))
    # If scraping from the guardian, slightly different format
    # due to the keywords list used
    if dir == 'titles':
        titlesList = getArticles('titles', 'guardian')
        return titlesList
    if dir == 'keywords':
        keywordsList = getArticles('keywords', 'guardian')
        return keywordsList


def BBCControl():
    for target in BBCArticleURLs:
        scrape(target, 'BBC')
        time.sleep(random.random())


def CNNControl():
    for target in CNNArticleURLs:
        scrape(target, 'CNN')
        time.sleep(random.random())


def RTControl():
    for target in RTArticleURLs:
        scrape(target, 'RT')
        time.sleep(random.random())


def guardianControl():
    titlesList = scrape('titles', 'guardian')
    keywordsList = scrape('keywords', 'guardian')

    if titlesList and keywordsList != None:
        writeGuardianCSV(titlesList, keywordsList, currDate, 0)
    else:
        writeGuardianCSV(titlesList, keywordsList, currDate, 1)

def main():
    threading.Thread(target=BBCControl).start()
    threading.Thread(target=CNNControl).start()
    threading.Thread(target=guardianControl).start()
    threading.Thread(target=RTControl).start()


if __name__ == '__main__':
    main()

############ Failed to download articles from section: News ############ 
HTTP Error 404: Not Found
############ Failed to download articles from section: Modi ############ 
539 article titles scraped


In [None]:
allArticles

NameError: ignored

In [None]:
from sklearn.feature_extraction.text import CountVectorizer
from sklearn.model_selection import train_test_split
from sklearn.linear_model import LogisticRegression
import random
import os
from sklearn.metrics import accuracy_score
import pickle
import sys

class FactOrFeelModel(object):
	log_model = LogisticRegression()
	vectorizer = CountVectorizer()

	def __init__(self):
		# load the model from disk
		filename = 'finalized_model.sav'
		if (sys.version_info > (3, 0)): # if python3
			with open(filename,'rb') as f:
				self.log_model = pickle.load(f, encoding='latin1')
		else:
			self.log_model = pickle.load(open(filename, 'rb'))


		#load the vectorizer from the disk
		filename2 = 'vectorizer.sav'
		if (sys.version_info > (3, 0)):	# if python 3
			with open(filename2,'rb') as f:
				self.vectorizer = pickle.load(f, encoding='latin1')
		else:
			self.vectorizer = pickle.load(open(filename2, 'rb'))

	def example(self):
		text1 = "You should be proud of yourself"
		text2 = "The lab coat is white"
		data = [text1,text2]
		print_results(data)

	# data can be of the form string or [string]
	# returns ['fact'] or ['feel']
	def make_prediction(self,data):
		prediction = ''
		if type(data) == str:
			data = [data]
			return self.log_model.predict(self.vectorizer.transform(data).toarray())
		elif type(data) == list:
			return self.log_model.predict(self.vectorizer.transform(data).toarray())
		else:
			raise ValueError("data must be either list of strings or a string but is of type " + str(type(data)))

	# text is a string
	# num_sentences_per_eval is the number of sentences for each prediction (NOT CURRENT IN USE)
	# returns the percent of feel and fact
	def evaluateText(self,text):
		factCounter = 0
		feelCounter = 0

		model = FactOrFeelModel()
		splitText = text.split('.')
		splitText.pop()
		splitText = [x+y for x,y in zip(splitText[0::2], splitText[1::2])] #each prediciton is two sentences

		preds = model.make_prediction(splitText)

		for pred in preds:
			if type(pred) != str:
				pred = pred.decode("utf-8")
			if pred == 'fact':
				factCounter+=1
			else:
				feelCounter+=1

		percentFacts = int(float(factCounter)/float(factCounter+feelCounter) * 100)
		percentFeels = int(float(feelCounter)/float(feelCounter+factCounter) * 100)

		return [percentFacts,percentFeels]

	def printEvaluations(self, percentages):
		print("facts: " + str(percentages[0]) + "% | feels: " + str(percentages[1]) + "% | Predictions accuracy: 73%")


if __name__ == "__main__":
	model = FactOrFeelModel()
	path = '/content/trump.txt'
	with open(path, 'r') as content_file:
		content = content_file.read()
		percentages = model.evaluateText(content)
		model.printEvaluations(percentages)

	# while(True):
	# 	data = raw_input("Enter a sentance. (type 'q' to quit)\n")
	# 	if data == "q":
	# 		break
	# 	print(model.make_prediction(data))

FileNotFoundError: ignored

In [None]:
pip install spacy

In [None]:

article = '''
Asian shares skidded on Tuesday after a rout in tech stocks put Wall Street to the sword, while a
sharp drop in oil prices and political risks in Europe pushed the dollar to 16-month highs as investors dumped
riskier assets. MSCI’s broadest index of Asia-Pacific shares outside Japan dropped 1.7 percent to a 1-1/2
week trough, with Australian shares sinking 1.6 percent. Japan’s Nikkei dived 3.1 percent led by losses in
electric machinery makers and suppliers of Apple’s iphone parts. Sterling fell to $1.286 after three straight
sessions of losses took it to the lowest since Nov.1 as there were still considerable unresolved issues with the
European Union over Brexit, British Prime Minister Theresa May said on Monday.'''

import spacy

spacy_nlp = spacy.load('en')
document = spacy_nlp(article)

print('Original Sentence: %s' % (article))

for element in document.ents:
    print('Type: %s, Value: %s' % (element.label_, element))

In [None]:
class SkillsExtractorNN:

    def __init__(self, word_features_dim, dense_features_dim):

        lstm_input_phrase = keras.layers.Input(shape=(None, word_features_dim))
        lstm_input_cont = keras.layers.Input(shape=(None, word_features_dim))
        dense_input = keras.layers.Input(shape=(dense_features_dim,))

        lstm_emb_phrase = keras.layers.LSTM(256)(lstm_input_phrase)
        lstm_emb_phrase = keras.layers.Dense(128, activation='relu')(lstm_emb_phrase)

        lstm_emb_cont = keras.layers.LSTM(256)(lstm_input_cont)
        lstm_emb_cont = keras.layers.Dense(128, activation='relu')(lstm_emb_cont)

        dense_emb = keras.layers.Dense(512, activation='relu')(dense_input)
        dense_emb = keras.layers.Dense(256, activation='relu')(dense_emb)

        x = keras.layers.concatenate([lstm_emb_phrase, lstm_emb_cont, dense_emb])
        x = keras.layers.Dense(128, activation='relu')(x)
        x = keras.layers.Dense(64, activation='relu')(x)
        x = keras.layers.Dense(32, activation='relu')(x)

        main_output = keras.layers.Dense(2, activation='softplus')(x)

        self.model = keras.models.Model(inputs=[lstm_input_phrase, lstm_input_cont, dense_input],
                                        outputs=main_output)

        optimizer = keras.optimizers.Adam(lr=0.0001)

        self.model.compile(optimizer=optimizer, loss='binary_crossentropy')

In [None]:
SkillsExtractorNN

In [None]:
def fit(self, x_lstm_phrase, x_lstm_context, x_dense, y,
            val_split=0.25, patience=5, max_epochs=1000, batch_size=32):

        x_lstm_phrase_seq = keras.preprocessing.sequence.pad_sequences(x_lstm_phrase)
        x_lstm_context_seq = keras.preprocessing.sequence.pad_sequences(x_lstm_context)

        y_onehot = onehot_transform(y)

        self.model.fit([x_lstm_phrase_seq, x_lstm_context_seq, x_dense],
                       y_onehot,
                       batch_size=batch_size,
                       pochs=max_epochs,
                       validation_split=val_split,
                       callbacks=[keras.callbacks.EarlyStopping(monitor='val_loss', patience=patience)])

def predict(self, x_lstm_phrase, x_lstm_context, x_dense):

  x_lstm_phrase_seq = keras.preprocessing.sequence.pad_sequences(x_lstm_phrase)
  x_lstm_context_seq = keras.preprocessing.sequence.pad_sequences(x_lstm_context)

  y = self.model.predict([x_lstm_phrase_seq, x_lstm_context_seq, x_dense])

  return y







In [None]:
def onehot_transform(y):

    onehot_y = []

    for numb in y:
        onehot_arr = np.zeros(2)
        onehot_arr[numb] = 1
        onehot_y.append(np.array(onehot_arr))

    return np.array(onehot_y)

In [None]:
from urllib.request import urlopen
url = urlopen("http://venturebeat.com/2014/07/04/facebooks-little-social-experiment-got-you-bummed-out-get-over-it/")
#contents = url.read()
html = url.read()
html[:500]


from bs4 import BeautifulSoup
soup = BeautifulSoup(html, 'html.parser')




In [None]:
print(soup.get_text())


In [None]:
from readability.readability import Document
from bs4 import BeautifulSoup

readable_article = Document(html).summary()
readable_title = Document(html).title()
soup = BeautifulSoup(readable_article)
print('*** TITLE *** \n\"' + readable_title + '\"\n')
print('*** CONTENT *** \n\"' + soup.text[:500] + '[...]\"')

ModuleNotFoundError: ignored

In [None]:
import nltk
tokens = [word for sent in nltk.sent_tokenize(soup.text) for word in nltk.word_tokenize(sent)]

for token in sorted(set(tokens))[:30]:
    print(token + ' [' + str(tokens.count(token)) + ']')

In [None]:
from nltk.stem.snowball import SnowballStemmer

stemmer = SnowballStemmer("english")
stemmed_tokens = [stemmer.stem(t) for t in tokens]

for token in sorted(set(stemmed_tokens))[50:75]:
    print(token+ ' [' + str(stemmed_tokens.count(token)) + ']')

In [None]:

import requests
import nltk

In [None]:
page = requests.get('https://qz.com/africa/latest')
soup = BeautifulSoup(page.content, 'html.parser')

In [None]:
weblinks = soup.find_all('article')
pagelinks = []
for link in weblinks[5:]:
      url = link.contents[0].find_all('a')[0]
      pagelinks.append('http://qz.com'+url.get('href'))

In [None]:
from newspaper import Article

In [None]:
pip install newspaper3k

Collecting newspaper3k
  Downloading newspaper3k-0.2.8-py3-none-any.whl (211 kB)
[?25l     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m0.0/211.1 kB[0m [31m?[0m eta [36m-:--:--[0m[2K     [91m━━━━━━━━━━━━━━━━━━━━[0m[91m╸[0m[90m━━━━━━━━━━━━━━━━━━[0m [32m112.6/211.1 kB[0m [31m3.1 MB/s[0m eta [36m0:00:01[0m[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m211.1/211.1 kB[0m [31m3.6 MB/s[0m eta [36m0:00:00[0m
Collecting cssselect>=0.9.2 (from newspaper3k)
  Downloading cssselect-1.2.0-py2.py3-none-any.whl (18 kB)
Collecting feedparser>=5.2.1 (from newspaper3k)
  Downloading feedparser-6.0.10-py3-none-any.whl (81 kB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m81.1/81.1 kB[0m [31m5.5 MB/s[0m eta [36m0:00:00[0m
[?25hCollecting tldextract>=2.0.1 (from newspaper3k)
  Downloading tldextract-3.4.4-py3-none-any.whl (93 kB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m93.3/93.3 kB[0m [31m6.0 MB/s[0m eta 

In [None]:
url1 = 'https://www.vox.com/platform/amp/2019/11/26/20983690/trump-impeachment-hearings-women-poll-2020-democrats'
article1 = Article(url1)

url2 = 'https://amp.cnn.com/cnn/2019/11/26/politics/trump-cnn-impeachment-poll/index.html'
article2 = Article(url2)

url3 = 'https://amp.usatoday.com/amp/4305749002'
article3 = Article(url3)

url4 = 'https://mobile.reuters.com/article/amp/idUSKBN1Y02MO'
article4 = Article(url4)


In [None]:
article1.download()

article2.download()

article3.download()

article4.download()




In [None]:
article2.html

'<!doctype html>\n<html amp lang="en">\n  <head>\n    <meta charset="utf-8">\n    <title>The new Donald Trump impeachment poll is not good news for the President - CNNPolitics</title>\n    <link rel="shortcut icon" href="/static/cnn-favicon.png">\n    <meta property="og:pubdate" content="2019-11-26T21:18:54Z">\n    <meta property="og:url" content="https://www.cnn.com/2019/11/26/politics/trump-cnn-impeachment-poll/index.html" >\n    <meta property="og:title" content="No, the new CNN poll is not good news for Donald Trump on impeachment">\n    <meta property="og:description" content="A new CNN poll shows that half the country believes that President Donald Trump should be not only impeached by the House, but also removed from office by the Senate." >\n    <meta property="og:site_name" content="CNN" >\n    <meta property="og:type" content="article" >\n    <meta property="og:image" content="https://cdn.cnn.com/cnnnext/dam/assets/191002164147-11-trump-impeachment-inquiry-super-tease.jpg">\n

In [None]:
article1.parse()






In [None]:
article2.parse()

In [None]:
article3.parse()

In [None]:
article4.parse()

In [None]:
Doc1 = article1.text
Doc2 = article2.text
Doc3 = article3.text
Doc4 = article4.text


article1.text
Doc = (Doc1,Doc2,Doc3,Doc4)

In [None]:
article3.nlp()
article3.keywords
article3.summary
import nltk

nltk.download('stopwords')

[nltk_data] Downloading package stopwords to /root/nltk_data...
[nltk_data]   Unzipping corpora/stopwords.zip.


True

In [None]:
sentence_list1 = nltk.sent_tokenize(article1.text)
sentence_list2 = nltk.sent_tokenize(article2.text)
sentence_list3 = nltk.sent_tokenize(article3.text)
sentence_list4 = nltk.sent_tokenize(article4.text)

sentence_list=(sentence_list1,sentence_list2,sentence_list3,sentence_list4)

out = [item for t in sentence_list for item in t]
out

['Even after last week’s extensive public hearings, the American public remains split on impeaching President Trump, with 50 percent in support of the process and 43 percent against it.',
 'But some groups of Americans are a lot more supportive of impeachment than others.',
 'In a CNN poll released Tuesday, a full 61 percent of women were in favor of impeaching Trump while 34 percent were against it.',
 'Among men, by contrast, just 40 percent support impeachment and 53 percent oppose it.',
 'The gender gap also shows up in Trump’s approval ratings: 52 percent of men approve of how the president is doing compared with 32 percent of women, according to the CNN poll.',
 'One big factor at work in the gender divide is party identification.',
 'Women are more likely than men to be Democrats, and “one of the strongest drivers of support for impeachment is partisanship,” Kelly Dittmar, an assistant professor of political science and scholar at the Center for American Women and Politics, told

In [None]:
for art in Doc:



  stopwords = nltk.corpus.stopwords.words('english')



  word_frequencies = {}
  for word in nltk.word_tokenize(Doc3):
      if word not in stopwords:
          if word not in word_frequencies.keys():
              word_frequencies[word] = 1
          else:
              word_frequencies[word] += 1






In [None]:
maximum_frequncy = max(word_frequencies.values())

for word in word_frequencies.keys():
    word_frequencies[word] = (word_frequencies[word]/maximum_frequncy)

In [None]:
sentence_scores = {}
#change out
for sent in sentence_list1:
    for word in nltk.word_tokenize(sent.lower()):
        if word in word_frequencies.keys():
            if len(sent.split(' ')) < 30:
                if sent not in sentence_scores.keys():
                    sentence_scores[sent] = word_frequencies[word]
                else:
                    sentence_scores[sent] += word_frequencies[word]

In [None]:
import heapq
summary_sentences = heapq.nlargest(40, sentence_scores, key=sentence_scores.get)

summary = ' '.join(summary_sentences)
print(summary)

article = summary


But then came James Comey’s letter regarding Hillary Clinton’s emails, and Trump, obviously, ended up winning the election — with, famously, 53 percent of white female voters supporting him. So for Democrats, the poll numbers are a sign that among one key voting bloc, at least, there’s little political cost to proceeding with impeachment. Among women, however, the numbers are much higher and growing, with 61 percent now supporting impeachment compared with 56 percent in October and 51 percent in May. “If you’re a Democrat, women are effectively your base,” Dittmar said — especially black women, who voted for Democrats in overwhelming majorities in 2016 and 2018. Since then, he’s mocked Christine Blasey Ford for her testimony that Brett Kavanaugh, now a Supreme Court Justice, sexually assaulted her when the two were in high school. There’s no reason, of course, why male voters can’t care about sexual misconduct allegations, too. For one, Dittmar said, they send a signal to Democrats tha

In [None]:
pip install -update _version
from textteaser import TextTeaser
tt = TextTeaser()
tt.summarize(title, Doc2)

In [None]:
from nltk.corpus import stopwords
from nltk.tokenize import word_tokenize
import nltk
nltk.download('punkt')
nltk.download('stopwords')
# tokenization
X_list = word_tokenize(Doc1)
Y_list = word_tokenize(Doc4)

In [None]:
# sw contains the list of stopwords
sw = stopwords.words('english')
l1 =[];l2 =[]

In [None]:
# remove stop words from string
X_set = {w for w in X_list if not w in sw}
Y_set = {w for w in Y_list if not w in sw}

In [None]:
# form a set containing keywords of both strings
rvector = X_set.union(Y_set)
for w in rvector:
    if w in X_set: l1.append(1) # create a vector
    else: l1.append(0)
    if w in Y_set: l2.append(1)
    else: l2.append(0)
c = 0

# cosine formula
for i in range(len(rvector)):
        c+= l1[i]*l2[i]
cosine = c / float((sum(l1)*sum(l2))**0.5)
print("similarity between article 3,4: ", cosine)

In [None]:
authorname = []
title = []
thearticle = []

# store the text for each article
articletext = []






aname = article.authors
# get article title

thetitle = article.title

# get text
articletext = article.text


# combine all paragraphs into an article
thearticle.append(articletext)
authorname.append(aname)
title.append(thetitle)
myarticle = [' '.join(article) for article in thearticle]

In [None]:

articletext


In [None]:
import pandas as pd
from datetime import datetime
# save article data to file
data = {'Title':article.title,
        'Author':article.authors,
        'PageLink':url,
        'Article':article.text,
        'Date':article.publish_date,
        'Summary':article.summary,
        'Keywords':listToStr}

oldnews = pd.read_excel('/content/news.xls')
news = pd.DataFrame(data=data)
cols = ['Title', 'Author', 'PageLink', 'Article', 'Date', 'Summary','Keywords']
news = news[cols]
afronews = oldnews.append(news)
afronews.drop_duplicates(subset='Title', keep='last', inplace=True)
afronews.reset_index(inplace=True)
afronews.drop(labels='index', axis=1, inplace=True)
filename = '/content/news.xls'
wks_name = 'Data'
writer = pd.ExcelWriter(filename)
afronews.to_excel(writer, wks_name, index=False)
writer.save()

In [None]:
#!python -m spacy link en_core_web_sm en
!python -m spacy download en_core_web_lg


In [None]:
!python -m spacy link en_core_web_lg

In [None]:
pip install textacy

In [None]:
!python -m textacy download depeche_mood

In [None]:
!python -m textacy --help

In [None]:
#doc = textacy.make_spacy_doc(text)
from pathlib import Path

In [None]:
#!python -m spacy download en_core_web_lg
#!pip install -U spacy download en_core_web_sm
import spacy
import textacy
text = Doc2




import sys



In [None]:

#import neuralcoref
nlp = spacy.load('en_core_web_lg')
#neuralcoref.add_to_pipe(nlp)

In [None]:

doc = nlp1(article3.summary)

In [None]:
nlp1 = spacy.load('en_core_web_lg')
doc = nlp1(text)

In [None]:
print(doc)

William Cummings | USA TODAYHannah Gaber, USA TODAYSupport for President Donald Trump's impeachment remains at about 50% despite two weeks of testimony in public hearings that Democrats felt strongly bolstered their case, according to a CNN poll released Tuesday.
Of those who support impeachment, 91% said they "strongly" felt that way, a one percentage point increase from the month before.
Among those who oppose impeachment, the number who strongly felt that way climbed from 86% to 89%.
As has been the case with virtually every poll on impeachment, the numbers were clearly divided by party affiliation, race and gender.
View | 88 PhotosDonald Trump's impeachment inquiry related to Ukraine in pictures


In [None]:
print("Named Entities")
for entity in doc.ents:
    print(f" - {entity.text} ({entity.label_})")

In [None]:
svos = textacy.extract.subject_verb_object_triples(doc)

# Print the results
print("Subject, verb, object tuples:")

for svo in svos:
    subject, verb, object = svo
    print(f" - {svo}")

In [None]:
import spacy
import textacy.extract

In [None]:
nlp = spacy.load('en_core_web_lg')

In [None]:
document = nlp(text)
document

In [None]:
statements = textacy.extract.semistructured_statements(doc, "President")

print("**** Information on Trump ****")
count = 1
for statement in statements:
    subject, verb, fact = statement
    print(str(count) + " - Statement: ", statement,  cue=token.lemma_ , ignore_entity_case=True)
    print(str(count) + " - Fact: ", fact)
    count += 1
for entity in doc.ents:


  print(f"{entity.text} ({entity.label_})")

In [None]:
text = Path("/content/trump.txt").read_text()
text

In [None]:
# Parse the document with spaCy
doc = nlp(text)
doc

In [None]:
statements = textacy.extract.semistructured_statements(doc, "Democrats")

In [None]:
print("Here are the things I know about Trump:\n")

for statement in statements:
    subject, verb, fact = statement
    print(f"- {fact}")

In [None]:
# Print the results
print("Here are the facts on Donald Trump :")
for token in doc:

  verb1 = token.lemma_

  #if token.pos_ == 'VERB':



    # Extract semi-structured statements
  statements = textacy.extract.semistructured_statements(doc, "Trump", cue=token.lemma_, ignore_entity_case=True)

  for statement in statements:


      #if token.pos_ == 'VERB':
    entity, verb, fact = statement
    print(f" - ", verb1, " " + token.text + " " + str(fact))
        #qg.generate_closed_question(doc, "impeachment proceeding", token)
        #print(f" - " + str(fact))













In [None]:
pip install textteaser

In [None]:
from gensim.test.utils import common_dictionary, common_corpus
from gensim.models import LsiModel
model = LsiModel(common_corpus, id2word=common_dictionary)
vectorized_corpus = model[common_corpus]

In [None]:
list(textacy.extract.semistructured_statements(doc, "Trump", cue=token.lemma_, ignore_entity_case=True))



In [None]:
statements = textacy.extract.semistructured_statements(Doc2,"Trump")

print("This text is about: ")
for statement in statements:
    subject,verb,point = statement
    print(f':{point}')

In [None]:
nlp = spacy.load('en_core_web_lg')
from pathlib import Path

text = Path("/content/trump.txt").read_text()

doc = textacy.make_spacy_doc(text,lang='en_core_web_sm')



In [None]:
statements = textacy.extract.semistructured_statements(doc, "Impeachment")
statements

In [None]:


for statement in statements:
    subject,verb,fact=statement
    print(fact)






In [None]:
pip install textacy

In [None]:
pip install -U spacy==2.1.8

In [None]:

from sklearn.model_selection import cross_validate
from sklearn import preprocessing, svm
from sklearn.linear_model import LinearRegression
import matplotlib.pyplot as plt
from matplotlib import style
import datetime
from keras.models import Sequential
from keras.layers import Dense
from keras.layers import LSTM
from keras.layers import Dropout
import requests
from pandas_datareader import data
from sklearn.model_selection import train_test_split
#from yahoofinancials import YahooFinancials
import datetime
import sys
import warnings
import tensorflow as tf

import numpy as np
import os
import seaborn as sns
import pandas as pd
from sklearn.preprocessing import MinMaxScaler
#from datetime import datetime
#from datetime import timedelta


from tqdm import tqdm
if not sys.warnoptions:
    warnings.simplefilter('ignore')
sns.set()
tf.compat.v1.random.set_random_seed(1234)
# To grab stock data
import yfinance as fyf
from pandas_datareader import data as pdr
fyf.pdr_override()
stocks = ["AAPL"] # If you want to grab multiple stocks add more labels to this list

# Set start and end dates
start = datetime.datetime(1995, 1, 1)
end   = datetime.datetime.now()

# Grab data
df = pdr.get_data_yahoo(stocks, start = start, end = end)
print(df.head())

minmax = MinMaxScaler().fit(df.iloc[:, 4:5].astype('float32')) # Close index
df_log = minmax.transform(df.iloc[:, 4:5].astype('float32')) # Close index
df_log = pd.DataFrame(df_log)
print(df_log.head())

simulation_size = 6
num_layers = 1
size_layer = 128
timestamp = 5
epoch = 300
dropout_rate = 0.8
test_size = 30
learning_rate = 0.01

df_train = df_log
print(df.shape, df_train.shape)


class Model:
    def __init__(
            self,
            learning_rate,
            num_layers,
            size,
            size_layer,
            output_size,
            forget_bias=0.1,
    ):
        def lstm_cell(size_layer):
            return tf.keras.layers.LSTMCell(size_layer, state_is_tuple=False)

        rnn_cells = tf.keras.layers.StackedRNNCells(
            [lstm_cell(size_layer) for _ in range(num_layers)],
            state_is_tuple=False,
        )
        self.X = tf.placeholder(tf.float32, (None, None, size))
        self.Y = tf.placeholder(tf.float32, (None, output_size))
        drop = tf.contrib.rnn.DropoutWrapper(
            rnn_cells, output_keep_prob=forget_bias
        )
        self.hidden_layer = tf.placeholder(
            tf.float32, (None, num_layers * 2 * size_layer)
        )
        self.outputs, self.last_state = tf.nn.dynamic_rnn(
            drop, self.X, initial_state=self.hidden_layer, dtype=tf.float32
        )
        self.logits = tf.layers.dense(self.outputs[-1], output_size)
        self.cost = tf.reduce_mean(tf.square(self.Y - self.logits))
        self.optimizer = tf.train.AdamOptimizer(learning_rate).minimize(
            self.cost
        )


def calculate_accuracy(real, predict):
    real = np.array(real) + 1
    predict = np.array(predict) + 1
    percentage = 1 - np.sqrt(np.mean(np.square((real - predict) / real)))
    return percentage * 100


def anchor(signal, weight):
    buffer = []
    last = signal[0]
    for i in signal:
        smoothed_val = last * weight + (1 - weight) * i
        buffer.append(smoothed_val)
        last = smoothed_val
    return buffer


def forecast():
    tf.compat.v1.reset_default_graph()
    modelnn = Model(learning_rate, num_layers, df_log.shape[1], size_layer, df_log.shape[1], dropout_rate)
    sess = tf.InteractiveSession()
    sess.run(tf.global_variables_initializer())
    date_ori = pd.to_datetime(df.iloc[:, 0]).tolist()

    pbar = tqdm(range(epoch), desc='train loop')
    for i in pbar:
        init_value = np.zeros((1, num_layers * 2 * size_layer))
        total_loss, total_acc = [], []
        for k in range(0, df_train.shape[0] - 1, timestamp):
            index = min(k + timestamp, df_train.shape[0] - 1)
            batch_x = np.expand_dims(
                df_train.iloc[k: index, :].values, axis=0
            )
            batch_y = df_train.iloc[k + 1: index + 1, :].values
            logits, last_state, _, loss = sess.run(
                [modelnn.logits, modelnn.last_state, modelnn.optimizer, modelnn.cost],
                feed_dict={
                    modelnn.X: batch_x,
                    modelnn.Y: batch_y,
                    modelnn.hidden_layer: init_value,
                },
            )
            init_value = last_state
            total_loss.append(loss)
            total_acc.append(calculate_accuracy(batch_y[:, 0], logits[:, 0]))
        pbar.set_postfix(cost=np.mean(total_loss), acc=np.mean(total_acc))

    future_day = test_size

    output_predict = np.zeros((df_train.shape[0] + future_day, df_train.shape[1]))
    output_predict[0] = df_train.iloc[0]
    upper_b = (df_train.shape[0] // timestamp) * timestamp
    init_value = np.zeros((1, num_layers * 2 * size_layer))

    for k in range(0, (df_train.shape[0] // timestamp) * timestamp, timestamp):
        out_logits, last_state = sess.run(
            [modelnn.logits, modelnn.last_state],
            feed_dict={
                modelnn.X: np.expand_dims(
                    df_train.iloc[k: k + timestamp], axis=0
                ),
                modelnn.hidden_layer: init_value,
            },
        )
        init_value = last_state
        output_predict[k + 1: k + timestamp + 1] = out_logits

    if upper_b != df_train.shape[0]:
        out_logits, last_state = sess.run(
            [modelnn.logits, modelnn.last_state],
            feed_dict={
                modelnn.X: np.expand_dims(df_train.iloc[upper_b:], axis=0),
                modelnn.hidden_layer: init_value,
            },
        )
        output_predict[upper_b + 1: df_train.shape[0] + 1] = out_logits
        future_day -= 1
        date_ori.append(date_ori[-1] + timedelta(days=1))

    init_value = last_state

    for i in range(future_day):
        o = output_predict[-future_day - timestamp + i:-future_day + i]
        out_logits, last_state = sess.run(
            [modelnn.logits, modelnn.last_state],
            feed_dict={
                modelnn.X: np.expand_dims(o, axis=0),
                modelnn.hidden_layer: init_value,
            },
        )
        init_value = last_state
        output_predict[-future_day + i] = out_logits[-1]
        date_ori.append(date_ori[-1] + timedelta(days=1))

    output_predict = minmax.inverse_transform(output_predict)
    deep_future = anchor(output_predict[:, 0], 0.4)

    return deep_future


results = []
for i in range(simulation_size):
    print('simulation %d'%(i + 1))
    results.append(forecast())
print(results)
date_ori = pd.to_datetime(df.iloc[:, 0]).tolist()
for i in range(test_size):
    date_ori.append(date_ori[-1] + timedelta(days = 1))
date_ori = pd.Series(date_ori).dt.strftime(date_format = '%Y-%m-%d').tolist()
print(date_ori[-5:])

accepted_results = []
for r in results:
    if (np.array(r[-test_size:]) < np.min(df['Close'])).sum() == 0 and \
    (np.array(r[-test_size:]) > np.max(df['Close']) * 2).sum() == 0:
        accepted_results.append(r)
print(len(accepted_results))

accuracies = [calculate_accuracy(df['Close'].values, r[:-test_size]) for r in accepted_results]

plt.figure(figsize = (15, 5))
for no, r in enumerate(accepted_results):
    plt.plot(r, label = 'forecast %d'%(no + 1))
plt.plot(df['Close'], label = 'true trend', c = 'black')
plt.legend()
plt.title('average accuracy: %.4f'%(np.mean(accuracies)))

x_range_future = np.arange(len(results[0]))
plt.xticks(x_range_future[::30], date_ori[::30])

plt.show()

[*********************100%***********************]  1 of 1 completed
                Open      High       Low     Close  Adj Close     Volume
Date                                                                    
1995-01-03  1.388393  1.388393  1.352679  1.370536   1.179026   25967200
1995-01-04  1.379464  1.415179  1.379464  1.406250   1.209749   39670400
1995-01-05  1.401786  1.406250  1.383929  1.388393   1.194387   18410000
1995-01-06  1.486607  1.540179  1.468750  1.500000   1.290400  269155600
1995-01-09  1.486607  1.495536  1.464286  1.471540   1.265917   68521600
          0
0  0.002828
1  0.002940
2  0.002884
3  0.003233
4  0.003144
(6283, 6) (6283, 1)
simulation 1


TypeError: ignored

In [None]:
pip install yfinance

Collecting yfinance
  Downloading https://files.pythonhosted.org/packages/c2/31/8b374a12b90def92a4e27d0fc595fc43635f395984e36a075244d98bd265/yfinance-0.1.54.tar.gz
Building wheels for collected packages: yfinance
  Building wheel for yfinance (setup.py) ... [?25l[?25hdone
  Created wheel for yfinance: filename=yfinance-0.1.54-py2.py3-none-any.whl size=22411 sha256=4801d20ee2c9af5ea6426250c40e53ceb3adf5b53c38fc30a78297aa24c4f46f
  Stored in directory: /root/.cache/pip/wheels/f9/e3/5b/ec24dd2984b12d61e0abf26289746c2436a0e7844f26f2515c
Successfully built yfinance
Installing collected packages: yfinance
Successfully installed yfinance-0.1.54


In [None]:
import sys
import tensorflow as tf
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
import pandas as pd
from sklearn.preprocessing import MinMaxScaler
import datetime
from datetime import timedelta
from tqdm import tqdm
sns.set()
tf.compat.v1.random.set_random_seed(1234)
import datetime
import sys
import warnings
import tensorflow as tf

import numpy as np
import os
import seaborn as sns
import pandas as pd
from sklearn.preprocessing import MinMaxScaler
#from datetime import datetime
#from datetime import timedelta



In [None]:


from tqdm import tqdm
if not sys.warnoptions:
    warnings.simplefilter('ignore')
sns.set()
tf.compat.v1.random.set_random_seed(1234)
# To grab stock data
import yfinance as fyf
from pandas_datareader import data as pdr
fyf.pdr_override()
stocks = ["AAPL"] # If you want to grab multiple stocks add more labels to this list
#stocks = ["AHEALTH"]
# Set start and end dates
start = datetime.datetime(2000, 1, 1)
end   = datetime.datetime.now()

# Grab data
#df = pdr.get_data_yahoo(stocks, start = start, end = end)
#df = pd.read_csv("/content/WTK.csv")
df.index()
print(df.tail())

            Date  Open  High   Low  Close  Adj Close   Volume
2469  11-25-2016  1.03  1.05  1.02   1.05       1.05   854400
2470  11-28-2016  1.03  1.04  1.02   1.04       1.04  1223700
2471  11-29-2016  1.04  1.04  1.02   1.04       1.04   151300
2472  11-30-2016  1.02  1.02  1.01   1.01       1.01   893700
2473  12-01-2016  1.01  1.02  1.01   1.02       1.02   386500


In [None]:
df['Date'] = df['Date'].astype('datetime64[ns]')
df.info()
df = df.sort_values(by=['Date'])

df.tail()

<class 'pandas.core.frame.DataFrame'>
Int64Index: 2474 entries, 765 to 1032
Data columns (total 7 columns):
Date         2474 non-null datetime64[ns]
Open         2474 non-null float64
High         2473 non-null float64
Low          2474 non-null object
Close        2474 non-null float64
Adj Close    2474 non-null float64
Volume       2474 non-null int64
dtypes: datetime64[ns](1), float64(4), int64(1), object(1)
memory usage: 154.6+ KB


Unnamed: 0,Date,Open,High,Low,Close,Adj Close,Volume
1007,2020-01-27,0.515,0.515,0.51,0.51,0.51,192800
1006,2020-01-28,0.505,0.505,0.485,0.49,0.49,1520300
1005,2020-01-29,0.5,0.515,0.5,0.51,0.51,456200
1004,2020-01-30,0.51,0.51,0.495,0.495,0.495,320900
1032,2020-01-31,0.495,0.495,0.485,0.485,0.485,798200


In [None]:
#df = df.set_index('Date')
df.to_csv("/content/WTK_.csv")

In [None]:
minmax = MinMaxScaler().fit(df.iloc[:, 4:5].astype('float32')) # Close index
df_log = minmax.transform(df.iloc[:, 4:5].astype('float32')) # Close index
df_log = pd.DataFrame(df_log)
df_log.head()

Unnamed: 0,0
0,0.413174
1,0.431138
2,0.437126
3,0.413174
4,0.419162


In [None]:

simulation_size = 10
num_layers = 1
size_layer = 128
timestamp = 5
epoch = 500
dropout_rate = 0.8
test_size = 7
learning_rate = 0.01

df_train = df_log
df.shape, df_train.shape

((2474, 6), (2474, 1))

In [None]:
class Model:
    def __init__(
        self,
        learning_rate,
        num_layers,
        size,
        size_layer,
        output_size,
        forget_bias = 0.1,
    ):
        def lstm_cell(size_layer):
            return tf.nn.rnn_cell.LSTMCell(size_layer, state_is_tuple = False)

        rnn_cells = tf.nn.rnn_cell.MultiRNNCell(
            [lstm_cell(size_layer) for _ in range(num_layers)],
            state_is_tuple = False,
        )
        self.X = tf.placeholder(tf.float32, (None, None, size))
        self.Y = tf.placeholder(tf.float32, (None, output_size))
        drop = tf.contrib.rnn.DropoutWrapper(
            rnn_cells, output_keep_prob = forget_bias
        )
        self.hidden_layer = tf.placeholder(
            tf.float32, (None, num_layers * 2 * size_layer)
        )
        self.outputs, self.last_state = tf.nn.dynamic_rnn(
            drop, self.X, initial_state = self.hidden_layer, dtype = tf.float32
        )
        self.logits = tf.layers.dense(self.outputs[-1], output_size)
        self.cost = tf.reduce_mean(tf.square(self.Y - self.logits))
        self.optimizer = tf.train.AdamOptimizer(learning_rate).minimize(
            self.cost
        )

def calculate_accuracy(real, predict):
    real = np.array(real) + 1
    predict = np.array(predict) + 1
    percentage = 1 - np.sqrt(np.mean(np.square((real - predict) / real)))
    return percentage * 100

def anchor(signal, weight):
    buffer = []
    last = signal[0]
    for i in signal:
        smoothed_val = last * weight + (1 - weight) * i
        buffer.append(smoothed_val)
        last = smoothed_val
    return buffer

In [None]:
def forecast():
    tf.reset_default_graph()
    modelnn = Model(
        learning_rate, num_layers, df_log.shape[1], size_layer, df_log.shape[1], dropout_rate
    )
    sess = tf.InteractiveSession()
    sess.run(tf.global_variables_initializer())
    date_ori = pd.to_datetime(df.iloc[:, 0]).tolist()

    pbar = tqdm(range(epoch), desc = 'train loop')
    for i in pbar:
        init_value = np.zeros((1, num_layers * 2 * size_layer))
        total_loss, total_acc = [], []
        for k in range(0, df_train.shape[0] - 1, timestamp):
            index = min(k + timestamp, df_train.shape[0] - 1)
            batch_x = np.expand_dims(
                df_train.iloc[k : index, :].values, axis = 0
            )
            batch_y = df_train.iloc[k + 1 : index + 1, :].values
            logits, last_state, _, loss = sess.run(
                [modelnn.logits, modelnn.last_state, modelnn.optimizer, modelnn.cost],
                feed_dict = {
                    modelnn.X: batch_x,
                    modelnn.Y: batch_y,
                    modelnn.hidden_layer: init_value,
                },
            )
            init_value = last_state
            total_loss.append(loss)
            total_acc.append(calculate_accuracy(batch_y[:, 0], logits[:, 0]))
        pbar.set_postfix(cost = np.mean(total_loss), acc = np.mean(total_acc))

    future_day = test_size

    output_predict = np.zeros((df_train.shape[0] + future_day, df_train.shape[1]))
    output_predict[0] = df_train.iloc[0]
    upper_b = (df_train.shape[0] // timestamp) * timestamp
    init_value = np.zeros((1, num_layers * 2 * size_layer))

    for k in range(0, (df_train.shape[0] // timestamp) * timestamp, timestamp):
        out_logits, last_state = sess.run(
            [modelnn.logits, modelnn.last_state],
            feed_dict = {
                modelnn.X: np.expand_dims(
                    df_train.iloc[k : k + timestamp], axis = 0
                ),
                modelnn.hidden_layer: init_value,
            },
        )
        init_value = last_state
        output_predict[k + 1 : k + timestamp + 1] = out_logits

    if upper_b != df_train.shape[0]:
        out_logits, last_state = sess.run(
            [modelnn.logits, modelnn.last_state],
            feed_dict = {
                modelnn.X: np.expand_dims(df_train.iloc[upper_b:], axis = 0),
                modelnn.hidden_layer: init_value,
            },
        )
        output_predict[upper_b + 1 : df_train.shape[0] + 1] = out_logits
        future_day -= 1
        date_ori.append(date_ori[-1] + timedelta(days = 1))

    init_value = last_state

    for i in range(future_day):
        o = output_predict[-future_day - timestamp + i:-future_day + i]
        out_logits, last_state = sess.run(
            [modelnn.logits, modelnn.last_state],
            feed_dict = {
                modelnn.X: np.expand_dims(o, axis = 0),
                modelnn.hidden_layer: init_value,
            },
        )
        init_value = last_state
        output_predict[-future_day + i] = out_logits[-1]
        date_ori.append(date_ori[-1] + timedelta(days = 1))

    output_predict = minmax.inverse_transform(output_predict)
    deep_future = anchor(output_predict[:, 0], 0.4)

    return deep_future

In [None]:
results = []
for i in range(simulation_size):
    print('simulation %d'%(i + 1))
    results.append(forecast())

simulation 1
Instructions for updating:
This class is equivalent as tf.keras.layers.LSTMCell, and will be replaced by that in Tensorflow 2.0.
Instructions for updating:
This class is equivalent as tf.keras.layers.StackedRNNCells, and will be replaced by that in Tensorflow 2.0.
The TensorFlow contrib module will not be included in TensorFlow 2.0.
For more information, please see:
  * https://github.com/tensorflow/community/blob/master/rfcs/20180907-contrib-sunset.md
  * https://github.com/tensorflow/addons
  * https://github.com/tensorflow/io (for I/O related ops)
If you depend on functionality not listed there, please file an issue.

Instructions for updating:
Please use `keras.layers.RNN(cell)`, which is equivalent to this API
Instructions for updating:
Please use `layer.add_weight` method instead.
Instructions for updating:
Call initializer instance with the dtype argument instead of passing it to the constructor
Instructions for updating:
Use keras.layers.Dense instead.
Instructions

train loop: 100%|██████████| 500/500 [18:37<00:00,  2.10s/it, acc=97.8, cost=0.00184]


simulation 2


train loop: 100%|██████████| 500/500 [17:53<00:00,  2.15s/it, acc=97.9, cost=0.00149]


simulation 3


train loop: 100%|██████████| 500/500 [17:39<00:00,  2.12s/it, acc=98.2, cost=0.00113]


simulation 4


train loop: 100%|██████████| 500/500 [17:42<00:00,  2.16s/it, acc=97.8, cost=0.00162]


simulation 5


train loop: 100%|██████████| 500/500 [17:53<00:00,  2.22s/it, acc=97.8, cost=0.00176]


simulation 6


train loop: 100%|██████████| 500/500 [18:24<00:00,  2.27s/it, acc=98, cost=0.00139]


simulation 7


train loop: 100%|██████████| 500/500 [18:17<00:00,  2.12s/it, acc=97.8, cost=0.00167]


simulation 8


train loop: 100%|██████████| 500/500 [17:34<00:00,  2.13s/it, acc=97.7, cost=0.00226]


simulation 9


train loop: 100%|██████████| 500/500 [17:46<00:00,  2.13s/it, acc=97.9, cost=0.0014]


simulation 10


train loop: 100%|██████████| 500/500 [17:45<00:00,  2.17s/it, acc=97.8, cost=0.00165]


In [None]:
date_ori = pd.to_datetime(df.iloc[:, 0]).tolist()
for i in range(test_size):
    date_ori.append(date_ori[-1] + timedelta(days = 1))
date_ori = pd.Series(date_ori).dt.strftime(date_format = '%Y-%m-%d').tolist()
#date_ori[-5:]
date_ori

['1970-01-01',
 '1970-01-01',
 '1970-01-01',
 '1970-01-01',
 '1970-01-01',
 '1970-01-01',
 '1970-01-01',
 '1970-01-01',
 '1970-01-01',
 '1970-01-01',
 '1970-01-01',
 '1970-01-01',
 '1970-01-01',
 '1970-01-01',
 '1970-01-01',
 '1970-01-01',
 '1970-01-01',
 '1970-01-01',
 '1970-01-01',
 '1970-01-01',
 '1970-01-01',
 '1970-01-01',
 '1970-01-01',
 '1970-01-01',
 '1970-01-01',
 '1970-01-01',
 '1970-01-01',
 '1970-01-01',
 '1970-01-01',
 '1970-01-01',
 '1970-01-01',
 '1970-01-01',
 '1970-01-01',
 '1970-01-01',
 '1970-01-01',
 '1970-01-01',
 '1970-01-01',
 '1970-01-01',
 '1970-01-01',
 '1970-01-01',
 '1970-01-01',
 '1970-01-01',
 '1970-01-01',
 '1970-01-01',
 '1970-01-01',
 '1970-01-01',
 '1970-01-01',
 '1970-01-01',
 '1970-01-01',
 '1970-01-01',
 '1970-01-01',
 '1970-01-01',
 '1970-01-01',
 '1970-01-01',
 '1970-01-01',
 '1970-01-01',
 '1970-01-01',
 '1970-01-01',
 '1970-01-01',
 '1970-01-01',
 '1970-01-01',
 '1970-01-01',
 '1970-01-01',
 '1970-01-01',
 '1970-01-01',
 '1970-01-01',
 '1970-01-

In [None]:
accepted_results = []
for r in results:
    if (np.array(r[-test_size:]) < np.min(df['Close'])).sum() == 0 and \
    (np.array(r[-test_size:]) > np.max(df['Close']) * 2).sum() == 0:
        accepted_results.append(r)
len(accepted_results)
#accepted_results

9

In [None]:
#for i in accepted_results
#df1 = pd.DataFrame(i)
#accepted_results[0]
#df1 = pd.DataFrame(accepted_results[0])
df2 = pd.DataFrame(accepted_results[1])
df3 = pd.DataFrame(accepted_results[2])
df4 = pd.DataFrame(accepted_results[3])
df5 = pd.DataFrame(accepted_results[4])
df6 = pd.DataFrame(accepted_results[5])
df7 = pd.DataFrame(accepted_results[6])
df8 = pd.DataFrame(accepted_results[7])
df9 = pd.DataFrame(accepted_results[8])

dfAll = pd.concat([df2,df3,df4,df5,df6,df7,df8,df9],axis = 1)
dfMean = dfAll.mean(axis = 1, skipna = True)
df_col = dfMean.tail(7)
df_col.to_csv('/content/sample_data/7days__WTK.csv')
#dfAll = df1[0].map(str)+df2[0].map(str)+df3[0].map(str)+df4[0].map(str)

#df_col = pd.concat([df.iloc[:, 0:0],df1], axis=1)
#df3 = df.iloc[:, 0:0].insert(df1)
#df3= pd.concat([df.iloc[:, 0:0], df1], axis=1)

#df_col.to_csv('/content/sample_data/test1.csv')


  del sys.path[0]


In [None]:
dfAll.tail(10)



Unnamed: 0,0,0.1,0.2,0.3,0.4,0.5,0.6,0.7,0.8
2471,0.522226,0.556952,0.50052,0.51797,0.500833,0.549653,0.507244,0.554856,0.47239
2472,0.586633,0.530565,0.504145,0.516599,0.531278,0.572034,0.514699,0.528961,0.467345
2473,0.551822,0.508491,0.496588,0.546839,0.544776,0.578906,0.509856,0.548027,0.502916
2474,0.61267,0.501473,0.507234,0.530068,0.546451,0.558677,0.494513,0.498764,0.464933
2475,0.750057,0.518257,0.559687,0.556092,0.56325,0.588775,0.510843,0.535418,0.468739
2476,0.856165,0.547038,0.561331,0.585506,0.616332,0.622677,0.521069,0.509301,0.432176
2477,0.927503,0.585715,0.52486,0.606218,0.645433,0.616523,0.547208,0.561313,0.509687
2478,0.962525,0.628514,0.531948,0.598464,0.661084,0.610217,0.586318,0.581132,0.475847
2479,0.983011,0.63405,0.513697,0.57951,0.66945,0.639104,0.625818,0.558938,0.557099
2480,0.98888,0.614764,0.486517,0.58927,0.666239,0.653973,0.634729,0.552755,0.579619


In [None]:
j
#dfAll.tail(30)

Unnamed: 0,0
0,1.120000
1,1.057627
2,1.059366
3,1.061841
4,1.050585
...,...
2476,0.432176
2477,0.509687
2478,0.475847
2479,0.557099


In [None]:
from google.colab import drive
drive.mount('/content/drive')

Go to this URL in a browser: https://accounts.google.com/o/oauth2/auth?client_id=947318989803-6bn6qk8qdgf4n4g3pfee6491hc0brc4i.apps.googleusercontent.com&redirect_uri=urn%3aietf%3awg%3aoauth%3a2.0%3aoob&response_type=code&scope=email%20https%3a%2f%2fwww.googleapis.com%2fauth%2fdocs.test%20https%3a%2f%2fwww.googleapis.com%2fauth%2fdrive%20https%3a%2f%2fwww.googleapis.com%2fauth%2fdrive.photos.readonly%20https%3a%2f%2fwww.googleapis.com%2fauth%2fpeopleapi.readonly

Enter your authorization code:
··········
Mounted at /content/drive


In [None]:
accuracies = [calculate_accuracy(df['Close'].values, r[:-test_size]) for r in accepted_results]

plt.figure(figsize = (15, 5))
for no, r in enumerate(accepted_results):
    plt.plot(r, label = 'forecast %d'%(no + 1))
plt.plot(df['Close'], label = 'true trend', c = 'black')
plt.legend()
plt.title('average accuracy: %.4f'%(np.mean(accuracies)))

x_range_future = np.arange(len(results[0]))
plt.xticks(x_range_future[::30], date_ori[::30])

plt.show()


To register the converters:
	>>> from pandas.plotting import register_matplotlib_converters
	>>> register_matplotlib_converters()


NameError: ignored

ValueError: ignored

<Figure size 1080x360 with 1 Axes>

In [None]:
import yfinance as yf

msft = yf.Ticker("MSFT")
hist = msft.history(period="max")
hist

Unnamed: 0_level_0,Open,High,Low,Close,Volume,Dividends,Stock Splits
Date,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1
1986-03-13,0.06,0.07,0.06,0.06,1031788800,0.0,0.0
1986-03-14,0.06,0.07,0.06,0.06,308160000,0.0,0.0
1986-03-17,0.06,0.07,0.06,0.07,133171200,0.0,0.0
1986-03-18,0.07,0.07,0.06,0.06,67766400,0.0,0.0
1986-03-19,0.06,0.06,0.06,0.06,47894400,0.0,0.0
...,...,...,...,...,...,...,...
2020-01-27,161.15,163.38,160.20,162.28,32078100,0.0,0.0
2020-01-28,163.78,165.76,163.07,165.46,24899900,0.0,0.0
2020-01-29,167.84,168.75,165.69,168.04,34754500,0.0,0.0
2020-01-30,174.05,174.05,170.79,172.78,51597500,0.0,0.0
