# Summary of the task

In this project, we will try to predict the upvote score of posts on the Hacker News website https://news.ycombinator.com/ using just their titles.

From this week onward, you are allowed (and recommneded) to use PyTorch and its libraries.

This is the recipe which we suggest.

1. Import SentencePiece to use for tokenizing our data.
2. Prepare the dataset of Hacker News titles and upvote scores
    - Obtain the data from the database [DATABASE_LINK]
    - Tokenise the titles using SentencePiece
3. Implement and train an architecture to obtain word embeddings in the style of the word2vec paper
https://arxiv.org/pdf/1301.3781.pdf
using either the *continuous bag of words (CBOW) or Skip-gram model (or both).
4. Implement a regression model to predict a Hacker News upvote score from the pooled average of the word embeddings in each title.
5. Extension: train your word embeddings on a different dataset, such as
- More Hacker News content, such as comments
- A completely different corpus of text, like (some of) Wikipedia

In the following chapters of this week's Cortex module, we will delve into some of these steps and the technologies which they use.


# Wikipedia 
6GB worth of compressed articles. Which you can load as follows:
```
import datasets

dataset = load_dataset("wikipedia", "20200501.en")
```

There is also a summary dataset: https://github.com/tscheepers/Wikipedia-Summary-Dataset
which is 5.1 Million wiki articles which is 2.7Gb uncompressed.

### Download the wiki dataset using gensim downloader

In [2]:
import gensim.downloader as api

# https://github.com/piskvorky/gensim-data/releases/tag/wiki-english-20171001
dataset = api.load("wiki-english-20171001")

In [16]:
print(len([1 for i in dataset]))

4924894


#### Exploring the large 6GB wikipedia dump
UPDATE: This has lots of wiki markup in it. So not useful for our purposes. We will use the summary dataset instead.

In [15]:
# def load_n_articles(dataset, n=100)
for i, thing in enumerate(dataset):
    # print(i, thing, type(thing))
    print(f"{i}th article")
    for k, v in thing.items():
        print(f"\t{k}; type(val)={type(v)}", end=" ")
        if isinstance(v, str) == str or isinstance(v, list):
            if isinstance(v, list) and isinstance(v[0], str):
                string_text = " ".join(v).strip()
                print(f"len={len(string_text)}")
                print(f"\t\t{string_text[0:100]}")
            else:
                print(f"len={len(v)}")
                print(f"\t{k}: {v[0:100]}")
        print()
    if i > 10:
        break

0th article
	section_texts; type(val)=<class 'list'> len=69962
		'''Anarchism''' is a political philosophy that advocates self-governed societies based on voluntary 

	section_titles; type(val)=<class 'list'> len=184
		Introduction Etymology and terminology History Anarchist schools of thought Internal issues and deba

	title; type(val)=<class 'str'> 
1th article
	section_texts; type(val)=<class 'list'> len=54309
		'''Autism''' is a neurodevelopmental disorder characterized by impaired social interaction, impaired

	section_titles; type(val)=<class 'list'> len=175
		Introduction Characteristics Causes Mechanism Diagnosis Screening Prevention Management Society and 

	title; type(val)=<class 'str'> 
2th article
	section_texts; type(val)=<class 'list'> len=18444
		Percentage of diffusely reflected sunlight in relation to various surface conditions

'''Albedo''' (

	section_titles; type(val)=<class 'list'> len=149
		Introduction Terrestrial albedo  Astronomical albedo  Examples of terrest

In [9]:
def load_n_summary_articles(summary_file_path=..., start=0, end=100):
    """
    read start to end lines from data/dir within the summary_file_path without reading in the whole
    file in python
    """
    summary_dataset = []
    for i in wiki_article_iterator(summary_file_path, start, end):
        summary_dataset.append(i)
    return summary_dataset
    
def wiki_article_iterator(summary_file_path=..., start=0, end=100):
    """
    read start to end lines from data/dir within the summary_file_path without reading in the whole
    file in python
    """
    with open(summary_file_path, "r") as f:
        for i, line in enumerate(f):
            if i > end:
                break
            if "|||" in line:
                yield line.split("|||")[1].strip()

In [7]:
from pathlib import Path

# make the current project path + summary_file_path
summary_file_path = "data/raw.txt"
# full_path = os.path.join(os.getcwd(), summary_file_path)
full_path = Path.cwd() / summary_file_path
print(full_path, full_path.exists())
assert full_path.exists()

/Users/sid/workspace/recurrent_rebels_week2_hackernews_upvotes/notebooks/data/raw.txt True


In [8]:
articles = load_n_summary_articles(full_path, 0, 1e9)
print(articles[0])
print(len(articles))
print(
    f"total of {sum([len(a) for a in articles])} characters in {len(articles)} articles"
)

Anarchism is a political philosophy that advocates self-governed societies based on voluntary institutions. These are often described as stateless societies, although several authors have defined them more specifically as institutions based on non-hierarchical free associations. Anarchism holds the state to be undesirable, unnecessary and harmful. While anti-statism is central, anarchism specifically entails opposing authority or hierarchical organisation in the conduct of all human relations, including—but not limited to—the state system. Anarchism is usually considered an extreme left-wing ideology and much of anarchist economics and anarchist legal philosophy reflects anti-authoritarian interpretations of communism, collectivism, syndicalism, mutualism or participatory economics. Anarchism does not offer a fixed body of doctrine from a single particular world view, instead fluxing and flowing as a philosophy. Many types and traditions of anarchism exist, not all of which are mutuall

In [10]:
spm.SentencePieceTrainer.train()

Help on function Train in module sentencepiece:

Train(arg=None, logstream=None, **kwargs)



# Train the sentence piece model

In [2]:
VOCAB_SIZE = 10000
import sentencepiece as spm
import io


In [None]:
NUM_ARTICLES = 5315384
model = io.BytesIO()
# for article in articles[0:100]:
iterator = wiki_article_iterator(full_path, 0, 5e6)
spm.SentencePieceTrainer.train(sentence_iterator=iterator, model_writer=model, vocab_size=VOCAB_SIZE)


In [21]:
print(articles[1000001])
print(' '.join(sp_processor.encode_as_pieces(articles[1000001])))
print(sp_processor.encode_as_ids(articles[1000001]))

Alexandra Hills is a locality in Redland City, Queensland, Australia. Alexandra Hills sits between two major areas of Redlands, with Cleveland to the east and Capalaba to the west.
▁Alexandr a ▁Hills ▁is ▁a ▁local ity ▁in ▁Red land ▁City , ▁Queensland , ▁Australia . ▁Alexandr a ▁Hills ▁si ts ▁between ▁two ▁major ▁areas ▁of ▁Red lands , ▁with ▁Cleveland ▁to ▁the ▁east ▁and ▁Cap al aba ▁to ▁the ▁west .
[7560, 45, 3821, 12, 10, 456, 261, 9, 1080, 289, 303, 4, 4506, 4, 507, 6, 7560, 45, 3821, 2717, 684, 131, 85, 447, 1052, 7, 1080, 3420, 4, 26, 4914, 11, 3, 581, 8, 4166, 53, 5209, 11, 3, 593, 6]


### Save the model

In [16]:
sp_model_path = Path.cwd() / "model/spm_1e6.model"
Path.mkdir(sp_model_path.parent, exist_ok=True)

with open(sp_model_path, "wb") as f:
    f.write(model.getvalue())

In [17]:
sp_processor = spm.SentencePieceProcessor(model_proto=model.getvalue())

# Train the word2vec model

### Use Gensim?

In [23]:
help(Word2Vec)

Help on class Word2Vec in module gensim.models.word2vec:

class Word2Vec(gensim.utils.SaveLoad)
 |  Word2Vec(sentences=None, corpus_file=None, vector_size=100, alpha=0.025, window=5, min_count=5, max_vocab_size=None, sample=0.001, seed=1, workers=3, min_alpha=0.0001, sg=0, hs=0, negative=5, ns_exponent=0.75, cbow_mean=1, hashfxn=<built-in function hash>, epochs=5, null_word=0, trim_rule=None, sorted_vocab=1, batch_words=10000, compute_loss=False, callbacks=(), comment=None, max_final_vocab=None, shrink_windows=True)
 |  
 |  Method resolution order:
 |      Word2Vec
 |      gensim.utils.SaveLoad
 |      builtins.object
 |  
 |  Methods defined here:
 |  
 |  __init__(self, sentences=None, corpus_file=None, vector_size=100, alpha=0.025, window=5, min_count=5, max_vocab_size=None, sample=0.001, seed=1, workers=3, min_alpha=0.0001, sg=0, hs=0, negative=5, ns_exponent=0.75, cbow_mean=1, hashfxn=<built-in function hash>, epochs=5, null_word=0, trim_rule=None, sorted_vocab=1, batch_words=100

In [29]:
X = sp_processor.encode_as_pieces(articles[:int(1e6)])

In [30]:
# Use the word2vec model to get the word embeddings for the tokens from sentence piece
# Trained on the same dataset of 1e6 articles used for the sentence piece model.
from gensim.models import Word2Vec
reference_w2v_model = Word2Vec(
         sentences=X, 
         vector_size=1024, 
         window=8, # Maximum distance between the current and predicted word within a sentence.
         min_count=10,  # Minimum frequency below which a token is ignored
         workers=4, 
         sg=1, # Skip gram
         epochs=10,
         max_vocab_size=VOCAB_SIZE,
         compute_loss=True,
)

In [1]:
reference_w2v_model

NameError: name 'reference_w2v_model' is not defined

### adhoc testing of the word2Vec model
Write a The king - man + woman = queen example test with the trained embeddings.
Another is to plot the loss as its trained


# 

In [38]:
# Tokenize wikipedia `articles` list using sentence piece.
# Feed the tokens to gensim and train word2vec model.
# Save the model to disk.



model = create_sentencepiece_model("wiki-ml-100k.txt")
# Quick eval:
## Generate example embeddings
test_words = [
    "queen",
    "woman",
    "mother",
    "wife",
    "new delhi",
    "mughals",
    "taj mahal",
    "hotel",
    "vacation",
    "hiking",
]
## Visualize a set of vectors using t-SNE

NameError: name 'spm' is not defined

In [32]:
# Replace these values with your PostgreSQL credentials
DATABASE = {
    "drivername": "postgresql",
    "host": "your_host",
    "port": "your_port",
    "username": "your_username",
    "password": "your_password",
    "database": "your_database",
}

# Format the database URL
# db_url = f"{DATABASE['drivername']}://{DATABASE['username']}:{DATABASE['password']}@{DATABASE['host']}:{DATABASE['port']}/{DATABASE['database']}"
db_url = "postgres://arcanum:nz2TBHLHl8VSBTSxznk@pg.mlx.institute:5433/arcanum"


# Create the database engine
engine = create_engine(db_url)

NameError: name 'create_engine' is not defined