In [1]:
from doc2vec import *
import sys

## Initialization
In `server.py`, we use these lines of codes for initialization:
```python
i = sys.argv.index('server:app')
glove_filename = sys.argv[i+1]
articles_dirname = sys.argv[i+2]

gloves = load_glove(glove_filename)
articles_loaded = load_articles(articles_dirname, gloves)   
```
How do them work?
First, when we are launching the Web server using gunicorn: 
```
gunicorn -D --threads 4 -b 0.0.0.0:5000 --access-logfile server.log --timeout 60 server:app glove.6B.300d.txt bbc
```
We need to find the glove and articles arguments. The first three lines of codes realize it. It will read `glove.6B.300d.txt` into `glove_filename` variable and then `bbc` into `articles_dirname` variable.

In [13]:
glove_filename = 'glove.6B.300d.txt'
articles_dirname = 'bbc'

The `gloves` variable is the dictionary mapping a word to its 300-vector vector.

In [14]:
gloves = load_glove(glove_filename)

In [15]:
gloves['the'].shape

(300,)

In [16]:
gloves['the']

array([ 4.6560e-02,  2.1318e-01, -7.4364e-03, -4.5854e-01, -3.5639e-02,
        2.3643e-01, -2.8836e-01,  2.1521e-01, -1.3486e-01, -1.6413e+00,
       -2.6091e-01,  3.2434e-02,  5.6621e-02, -4.3296e-02, -2.1672e-02,
        2.2476e-01, -7.5129e-02, -6.7018e-02, -1.4247e-01,  3.8825e-02,
       -1.8951e-01,  2.9977e-01,  3.9305e-01,  1.7887e-01, -1.7343e-01,
       -2.1178e-01,  2.3617e-01, -6.3681e-02, -4.2318e-01, -1.1661e-01,
        9.3754e-02,  1.7296e-01, -3.3073e-01,  4.9112e-01, -6.8995e-01,
       -9.2462e-02,  2.4742e-01, -1.7991e-01,  9.7908e-02,  8.3118e-02,
        1.5299e-01, -2.7276e-01, -3.8934e-02,  5.4453e-01,  5.3737e-01,
        2.9105e-01, -7.3514e-03,  4.7880e-02, -4.0760e-01, -2.6759e-02,
        1.7919e-01,  1.0977e-02, -1.0963e-01, -2.6395e-01,  7.3990e-02,
        2.6236e-01, -1.5080e-01,  3.4623e-01,  2.5758e-01,  1.1971e-01,
       -3.7135e-02, -7.1593e-02,  4.3898e-01, -4.0764e-02,  1.6425e-02,
       -4.4640e-01,  1.7197e-01,  4.6246e-02,  5.8639e-02,  4.14

The `articles` is a list of records, one for each article. An article record is just a list containing the fully-qualified file name, the article title, the text without the title, and the word vector computed from the text without the title.

In [17]:
articles_loaded = load_articles(articles_dirname, gloves)   

In [18]:
print('number of articles: ', len(articles_loaded))

number of articles:  2225


In [19]:
article = articles_loaded[0] 
article
# it displays filename, title, article-text-minus-title, wordvec-centroid-for-article-text

('entertainment/289.txt',
 'Musicians to tackle US red tape',
 '\nMusicians\' groups are to tackle US visa regulations which are blamed for hindering British acts\' chances of succeeding across the Atlantic.\n\nA singer hoping to perform in the US can expect to pay $1,300 (Â£680) simply for obtaining a visa. Groups including the Musicians\' Union are calling for an end to the "raw deal" faced by British performers. US acts are not faced with comparable expense and bureaucracy when visiting the UK for promotional purposes.\n\nNigel McCune from the Musicians\' Union said British musicians are "disadvantaged" compared to their US counterparts. A sponsor has to make a petition on their behalf, which is a form amounting to nearly 30 pages, while musicians face tougher regulations than athletes and journalists. "If you make a mistake on your form, you risk a five-year ban and thus the ability to further your career," says Mr McCune.\n\n"The US is the world\'s biggest music market, which mean

## Recommendation
After we read `gloves` (database of word vectors) and `articles_loaded ` (a corpus of text article records) into memory, we can use word2vec to recommend articles.
To get the list of most relevant five articles for `article`, we'll do this:

```python
similar_articles = recommended(article, articles_loaded, 5)
```

In [22]:
similar_articles = recommended(article, articles_loaded, 5)

`similar_articles` is a list containing 5 article records.

In [26]:
len(similar_articles)

5

In [36]:
# print the name of the 5 articles
for similar_article in similar_articles:
    print(similar_article[0])

entertainment/131.txt
politics/250.txt
entertainment/271.txt
politics/122.txt
politics/220.txt


### How does `recommended` work?
What happens exactly when we run `recommended(article, articles_loaded, 5)`?

First, it will run `distances(article, articles)`, which is going to compute the euclidean distance from `article` to every other article and return a list of (distance, a) tuples for all a in articles.

In [29]:
distances_list = distances(article, articles_loaded)

In [31]:
len(distances_list)

2224

In [44]:
example = distances_list[0]
print('name of the sample article: ', example[1][0]) 
print(f'the euclidean distance from {article[0]} to {example[1][0]}:', example[0] )

name of the sample article:  entertainment/262.txt
the euclidean distance from entertainment/289.txt to entertainment/262.txt: 1.3157544907583372


Then, it will sort `distances_list` in reverse order by distance.

In [45]:
sorted_distances_list = sorted(distances_list, key = lambda x:x[0], reverse=False) 

Finally, it will return the top n (n is 5 in this example) article records. 

In [47]:
similar_articles = [record[1] for record in sorted_distances_list[:5]]