<style>
*{
    color: #D6DAC8;
    font-size: 14px;
    font-weight: 550;
}

h1, h2, h3, h4, h5, h6 {
    font-weight: 1000;
    color:white;
}

.para{
    color: #DA6C6C;
    font-weight: 950;
}

b, .name_list {
    font-weight: 950;
}
img {
    display: block;
    margin: 0 auto;
    width: 40em;
}
</style>

<h2>CHAPTER 2: Tokens and Embeddings</h2>
<p><b>Tokens and Embeddings</b> are two of the central concepts of using large language models (LLMs). In this chapter, we look more closely at what tokens are and the tokenization methods used to power LLMs. We will then dive into the famous <b>word2vec embeddings</b> method tha preceded modern-day LLms and see how it's extending the concept of token embeddings to build commercial recommendation systems that power a lot of the apps you use.</p>
<img src="../image/tokens_and_embeddings.jpg">
<p>A model does not produce its output response all at once; it actually generates one token at a time. But tokens aren't only the output of a model, they're also the way in which the model sees its inputs. A text prompt sent to the model is first broken down into tokens. Before the prompt is presented to the language model, however, it first has to go through a <b>tokenizer</b> that breaks it into pieces. Here's an example that show the tokenizer of <b>GPT-3.5 and GPT-4</b>.</p>
<img src="../image/tokenizer_example.jpg">


<h2 style="font-weight: 1000">Run an example</h2>

In [8]:
!pip install -q transformers numpy pandas 
!pip install -q torch torchvision torchaudio


[1m[[0m[34;49mnotice[0m[1;39;49m][0m[39;49m A new release of pip is available: [0m[31;49m24.2[0m[39;49m -> [0m[32;49m25.2[0m
[1m[[0m[34;49mnotice[0m[1;39;49m][0m[39;49m To update, run: [0m[32;49mpip install --upgrade pip[0m

[1m[[0m[34;49mnotice[0m[1;39;49m][0m[39;49m A new release of pip is available: [0m[31;49m24.2[0m[39;49m -> [0m[32;49m25.2[0m
[1m[[0m[34;49mnotice[0m[1;39;49m][0m[39;49m To update, run: [0m[32;49mpip install --upgrade pip[0m


In [None]:
from transformers import AutoModelForCausalLM, AutoTokenizer

model_name = "microsoft/Phi-3-mini-4k-instruct"
model = AutoModelForCausalLM.from_pretrained(
    model_name,
    device_map="cuda",
    torch_dtype="auto",
    trust_remote_code=True,
)
tokenizer = AutoTokenizer.from_pretrained(model_name)

In [None]:
prompt = """
Write an example of Part 1 IELTS writing band 9.0. <assistant>
"""

input_ids = tokenizer(prompt,
                     return_tensors="pt").input_ids.to("cuda")

generated_output = model.generate(
    input_ids=input_ids,
    max_new_tokens=256,
    use_cache=False
)

text = tokenizer.decode(generated_output[0])
text

"\nWrite an example of Part 1 IELTS writing band 9.0. <assistant>\nIn the wake of the global pandemic, the importance of digital literacy has become more pronounced than ever. As we navigate through an increasingly digital world, the ability to effectively use technology is no longer a luxury but a necessity. This essay will explore the significance of digital literacy in today's society, its impact on various aspects of life, and the need for its integration into educational curricula.\n\nDigital literacy refers to the ability to use digital technology, communication tools, and networks to access, manage, integrate, evaluate, and create information. It encompasses a wide range of skills, including basic computer skills, internet navigation, online communication, and the ability to critically evaluate digital content. In a world where information is readily available at our fingertips, the ability to discern credible sources from unreliable ones is crucial.\n\nThe importance of digital

In [None]:
input_ids

tensor([[29871,    13,  6113,   385,  1342,   310,  3455, 29871, 29896,  7159,
          5850, 29903,  5007,  3719, 29871, 29929, 29889, 29900, 29889,   529,
           465, 22137, 29958,    13]], device='cuda:0')

To view each token in the **input_ids**, we should read each id and then use the tokeniser to decode it.

In [None]:
for id in input_ids[0]:
    print(tokenizer.decode(id))




Write
an
example
of
Part

1
IE
LT
S
writing
band

9
.
0
.
<
ass
istant
>




<style>
*{
    color: #D6DAC8;
    font-size: 14px;
    font-weight: 550;
}

h1, h2, h3, h4, h5, h6 {
    font-weight: 1000;
    color:white;
}

.para{
    color: #DA6C6C;
    font-weight: 950;
}

b, .name_list {
    font-weight: 950;
}
img {
    display: block;
    margin: 0 auto;
    width: 40em;
}
</style>

<p>
    Thanks to the example, we can know how the tonkeniser broke down our input prompt. Notice the following:
    <ul>
        <li>In some prompts, the first token <b>&lt;s&gt;</b>, a special token indacating the beginning of the text.</li>
        <li>Some tokens are complete words.</li>
        <li>Some tokens are parts of words</li>
        <li>Punctuation characters are their token.</li>
    </ul>
    <p>One thing you should notice is that the space character does not have its own token. Instead, partial tokens (like "IE") have a special hidden character at their beginning that indicates that they're connected with the token that precedes them in the text. Tokens without that special character are assumed to have a space before them.</p>
</p>


In [None]:
generated_output[0]

tensor([29871,    13,  6113,   385,  1342,   310,  3455, 29871, 29896,  7159,
         5850, 29903,  5007,  3719, 29871, 29929, 29889, 29900, 29889,   529,
          465, 22137, 29958,    13,   797,   278,   281,  1296,   310,   278,
         5534,  7243, 24552, 29892,   278, 13500,   310, 13436,  4631,  4135,
          756,  4953,   901, 11504, 20979,  1135,  3926, 29889,  1094,   591,
        23624,  1549,   385, 10231,   368, 13436,  3186, 29892,   278, 11509,
          304, 17583,   671, 15483,   338,   694,  5520,   263, 21684,  2857,
          541,   263, 24316, 29889,   910,  3686,   388,   674, 26987,   278,
        26002,   310, 13436,  4631,  4135,   297,  9826, 29915, 29879, 12459,
        29892,   967, 10879,   373,  5164, 21420,   310,  2834, 29892,   322,
          278,   817,   363,   967, 13465,   964, 28976, 16256,   293,  2497,
        29889,    13,    13, 27103,  4631,  4135, 14637,   304,   278, 11509,
          304,   671, 13436, 15483, 29892, 12084,  8492, 29892, 

<style>
*{
    color: #D6DAC8;
    font-size: 14px;
    font-weight: 550;
}

h1, h2, h3, h4, h5, h6 {
    font-weight: 1000;
    color:white;
}

.para{
    color: #DA6C6C;
    font-weight: 950;
}

b, .name_list {
    font-weight: 950;
}
img {
    display: block;
    margin: 0 auto;
    width: 40em;
}
</style>

<h2>How does the Tokenizer Break Down Text?</h2>
<p>There are three factors that dictate how a tokenizer breaks down an input prompt.</p>
<ul>
    <li>First, at model design time, the creator of the model chooses a tokenization method. Popular methods include <b>byte pair encoding(BPE)</b> (widely used by GPT models) and <b>WordPiece</b>(used by BERT). They aim to optimize an efficient set of tokens to represent a text dataset, but they arrive at it in different ways.</li>
    <li>Second, after choosing the method, we need to make a number of tokenizer design choices like vocabulary size and what special tokens to use.</li>
    <li>Third, the tokenizer needs to be trained on a specific dataset to establish the best vocabulary it can use to represent that dataset. Even if we set the same methods and parameters, a tokenizer trained on an English text dataset will be different from another trained on a code dataset or a multilingual text dataset.</li>
</ul>
<h2>Word Versus Subword Versus Character Versus Byte Tokens</h2>
<p>The tokenization scheme we just discussed is call <b>subword tokenization</b>. It's the most commonly used tokenization scheme but not the only one. The four notable ways to tokenize are shown below:</p>
<img src="../image/four_ways_tokenize.jpg">
<p>
    <ul>
        <li><b>Word tokens</b>: This approach was common with earlier methods like <b>word2vec</b> but is being used less and less in NLP. Though of its usefulness, it led to be used outside of NLP for use cases such as recommendation systems. One challenge with this method is that the tokenizer may be unable to deal with new words that enter that dataset after the tokenizer was trained. This also results in a vocabulary that has a lot of tokens with minimal differences between them. (e.g., apology, apologize, apologetic, apologist). This latter challenge is resolved by <b>subword tokenization</b> as it has a token for <em>apolog</em>, and then suffix tokens (e.g., -y, -ize, -etic, -ist) that are common with many other tokens, resulting in a more expressive vocabulary.</li>
        <li><b>Subword tokens</b>: This method contains full and partial words. In addition to the vocabulary expressivity mentioned earlier, another benefit of the approach is its ability to represent new words by breaking down the new token into smaller characters, which tend to be a part of the vocabulary.</li>
        <li><b>Characters tokens</b>: It can deal successfully with new words because it has the raw letters to fall back on. While that makes the representation easier to tokenize, it makes the modeling more difficult. Where a model with subword tokenization can represent "play" as one token, a model using character-level tokens needs to model the information to spell out <b>"p-l-a-y"</b> in addition to modeling the rest of the sequence. Subword tokens present an advantage over character tokens in the ability to fit more text within the limited context length of a Transformer model.</li>
        <li><b>Byte Tokens</b>: <a href="https://arxiv.org/abs/2103.06874"><b style="color: red">"CANINE: Pre-training an efficient tokenization-free encoder for language representation"</b> outline methods like this, which are also called <b>tokenization-free encoding</b>.</a> Other works like <a href="https://arxiv.org/abs/2105.13626"><b style="color:red">"ByT5: Towards a token-free future with pre-trained byte-to-byte models."</b></a> show that this can be a competitive method, especially in multilingual scenarios.</li>
    </ul>
</p>

In [None]:
text = """
English and CAPITALIZATION
show_tokens False None elif == >= else: two tabs:" " Three tabs:
" "
12.0*50=600
"""

colors_list = [
    '102;194;165', '252;141;98', '141;160;203',
    '231;138;195', '166;216;84', '255;217;47'
    ]

def show_tokens(sentence, tokenizer_name):
    tokenizer = AutoTokenizer.from_pretrained(tokenizer_name)
    token_ids = tokenizer(sentence).input_ids
    
    for idx, t in enumerate(token_ids):
        print(
            f'\x1b[0;30;48;2;{colors_list[idx %
            len(colors_list)]}m' +
            tokenizer.decode(t) +
            '\x1b[0m',
            end=' '
        )   

<style>
*{
    color: #D6DAC8;
    font-size: 14px;
    font-weight: 550;
}

h1, h2, h3, h4, h5, h6 {
    font-weight: 1000;
    color:white;
}

.para{
    color: #DA6C6C;
    font-weight: 950;
}

b, .name_list {
    font-weight: 950;
}
img {
    display: block;
    margin: 0 auto;
    width: 40em;
}
</style>

<img src="../image/table_tokenize_1.jpg" width="400">  <img src="../image/table_tokenize_2.jpg" width="400"> 
<p>As you can see, different models use different ways of tokenizing a prompt, have different <b>special tokens</b> (like <em>CLS, SEP</em>) and have different <b>Vocabulary size</b>. The newer the  models are, the larger their vocabulary size is.</p>
<h2>Tokenizer Properties</h2>
<p>There are three  major  groups  of design choices  that determine how the tokenizer will break down text: <b>the tokenization method, the initialization parameters, and the domain of  the data the tokenizer targets</b></p>
<h4 style="color:gray">Tokenization methods</h4>
<p>
As we've seen, there are a number of tokenization methods with <b>byte pair encoding (BPE)</b> being the more popular one. Each of these methods outlines an algorithm for hwo to choose an appropriate set of tokens to represent a dataset. 
</p>
<h4 style="color:gray">Tokenizer Parameters</h4>
<p>
After choosing a tokenization method, an LLM designer needs to make some decisions about the parameters of the tokenizer. These include:
<ul>
    <li><b>Vocabulary size:</b> The majority of models use vocabulary size between <b>30K and 50K</b>, but an increasing number of models adopt larger sizes such as <b>100K.</b> </li>
    <li><b>Special tokens:</b> The LLM designer can add tokens that help better model the domain of the problem they're trying to focus on, as we've seen with <b>Galactica's <em>&lt;work&gt;</em> and <em>[START_REF]</em></b> tokens.</li>
    <li><b>Capitalization</b></li>
</ul>
</p>
<h4 style="color:gray">The domain of the data</h4>
<p>
Even if we select the same method and parameter, tokenizer behavior  will be different based on the dataset it was trained on. The tokenization methods mentioned previously work by optimizing the vocabulary to represent a specific dataset.
</p>

<style>
*{
    color: #D6DAC8;
    font-size: 14px;
    font-weight: 550;
}

h1, h2, h3, h4, h5, h6 {
    font-weight: 1000;
    color: white;
}

.para{
    color: #DA6C6C;
    font-weight: 950;
}

b, .name_list {
    font-weight: 950;
}
img {
    display: block;
    margin: 0 auto;
    width: 40em;
}
</style>

<h2>Creating Contextualized Word Embeddings with Language Models</h2>
<p>
Let's how language models can create better token embeddings. This is one of the primary ways to use language models for text representation. This empowers applications like named-entity recognition or extractive text summarization. Instead of representing each token or word with a static vector,  language models create contextualized word embeddings that represent a word with a different token  based on its  context. These  vectors can then be used by other systems  for a variety of tasks.</p>
<img src="../image/contextualized_token_embeddings.jpg">

In [None]:
from transformers import AutoModel, AutoTokenizer

tokenizer = AutoTokenizer.from_pretrained('microsoft/deberta-base')
model = AutoModel.from_pretrained("microsoft/deberta-base")

tokens = tokenizer('Hello world', return_tensors='pt')
output = model(**tokens)[0] # Creating contextualized word embeddings for input.
output

tensor([[[ 0.0473, -0.0435, -0.0812,  ...,  0.0121,  0.0395, -0.0462],
         [-1.1017, -0.7390, -0.7409,  ..., -0.4467,  0.3183, -0.4456],
         [ 1.0047,  0.6782, -0.4958,  ...,  0.2436, -0.3662,  0.5453],
         [ 0.2161,  0.0714, -0.1412,  ...,  0.0635,  0.1943,  0.0773]]],
       grad_fn=<AddBackward0>)

In [None]:
tokens

{'input_ids': tensor([[    1, 31414,   232,     2]]), 'token_type_ids': tensor([[0, 0, 0, 0]]), 'attention_mask': tensor([[1, 1, 1, 1]])}

In [None]:
output.shape

torch.Size([1, 4, 768])

In [None]:
for token in tokens['input_ids'][0]:
    print(tokenizer.decode(token))

[CLS]
Hello
 world
[SEP]


In [None]:
first_input = "I ate an Apple."
second_input = "Apple released a new iPhone."

fi_tokens = tokenizer(first_input, return_tensors='pt')
se_tokens = tokenizer(second_input, return_tensors='pt')

fi_output = model(**fi_tokens)[0]
se_output = model(**se_tokens)[0]

print("First: ", fi_output)
print("Second: ", se_output)

First:  tensor([[[ 0.0543, -0.0255, -0.0719,  ...,  0.0082,  0.0595, -0.0669],
         [-0.8302, -0.4220, -0.7239,  ...,  0.8751, -0.8066,  0.0873],
         [-0.6174,  0.7138,  0.0803,  ...,  1.6043, -0.8155, -0.1865],
         ...,
         [ 0.0516, -0.4860,  0.0461,  ...,  0.7933,  0.3989,  0.3375],
         [ 0.0968, -0.1326, -0.0322,  ...,  0.3286, -0.1014,  0.1500],
         [ 0.2169,  0.0501, -0.1573,  ...,  0.0423,  0.1953,  0.0429]]],
       grad_fn=<AddBackward0>)
Second:  tensor([[[ 0.0594, -0.0389, -0.0846,  ...,  0.0154,  0.0753, -0.0663],
         [ 0.6168, -0.8267, -0.3540,  ...,  1.0822, -0.7952,  0.5758],
         [-0.3719,  0.2083, -0.1232,  ...,  1.6132, -0.2349, -0.1427],
         ...,
         [ 1.1779,  0.1458, -0.6624,  ...,  0.8139, -0.4652,  0.0227],
         [ 0.1738, -0.1632, -0.1210,  ...,  0.4953, -0.2894,  0.1291],
         [ 0.2075,  0.0712, -0.1375,  ...,  0.0497,  0.2004,  0.0351]]],
       grad_fn=<AddBackward0>)


In [None]:
print(f'First shape: {fi_output.shape}\t Second shape: {se_output.shape}')

First shape: torch.Size([1, 7, 768])	 Second shape: torch.Size([1, 8, 768])


In [None]:
for token in fi_tokens['input_ids'][0]:
    print(tokenizer.decode(token))

print('\n\n\n')
for token in se_tokens['input_ids'][0]:
    print(tokenizer.decode(token))

[CLS]
I
 ate
 an
 Apple
.
[SEP]




[CLS]
Apple
 released
 a
 new
 iPhone
.
[SEP]


<img style="display:block; margin:0 auto" src="../image/llms_operate_on_raw.jpg">

<style>
*{
    color: #D6DAC8;
    font-size: 14px;
    font-weight: 550;
}

h1, h2, h3, h4, h5, h6 {
    font-weight: 1000;
    color:white;
}

.para{
    color: #DA6C6C;
    font-weight: 950;
}

b, .name_list {
    font-weight: 950;
}
img {
    display: block;
    margin: 0 auto;
    width: 40em;
}
</style>

<h2>Text Embeddings (for Sentences and Whole Documents)</h2>
<p>
While token embeddings are key to how LLMs operate, a number of LLM applications require operating on entire sentences, paragraphs, or even text documents. This has led to special language models that produce <b>text embeddings</b> - a single vector that represents a piece of text longer than just one token. There are multiple ways to produce a text embedding vector. One of the most common ways is to average the values of all the token embeddings produced by the model. Yet high-quality text embedding models tend to be trained specifically for text embedding tasks.
</p>
<img src="../image/text_embeddings.jpg">

In [None]:
!pip install -q sentence-transformers

huggingface/tokenizers: The current process just got forked, after parallelism has already been used. Disabling parallelism to avoid deadlocks...
	- Avoid using `tokenizers` before the fork if possible
	- Explicitly set the environment variable TOKENIZERS_PARALLELISM=(true | false)


In [None]:
from sentence_transformers import SentenceTransformer

model = SentenceTransformer('sentence-transformers/all-mpnet-base-v2')

vector = model.encode("Introduction to Programming")

vector

array([ 9.01108608e-03, -3.51743773e-02, -4.17335704e-02, -4.56001749e-03,
        6.87649148e-03, -2.07700487e-03, -4.66636522e-03,  4.49101720e-03,
       -2.34699175e-02, -3.48360464e-03,  5.90537861e-02, -7.01046316e-03,
        3.92840579e-02,  7.98736587e-02, -2.18903031e-02, -1.04732312e-01,
        5.84166162e-02, -1.90777909e-02, -2.22391319e-02, -1.49280541e-02,
        1.07468087e-02, -1.58077329e-02,  4.89438511e-03, -1.63508970e-02,
       -5.83496131e-03, -5.74922822e-02, -2.59398762e-02,  5.04231639e-02,
       -2.78870277e-02, -5.45268273e-03, -3.32935192e-02,  7.90276565e-03,
        1.87810212e-02,  1.22810062e-02,  1.48726895e-06, -6.36558384e-02,
       -3.67058478e-02,  5.21059744e-02, -2.26591956e-02,  8.01776573e-02,
        1.23646576e-02,  4.11825441e-02, -5.17043925e-04, -2.95661530e-03,
        1.94303729e-02,  4.80969576e-03,  4.81724441e-02, -2.53551304e-02,
        7.34378994e-02,  8.50495789e-03,  6.46130368e-03, -3.10404338e-02,
       -3.07741854e-02,  

In [None]:
vector.shape

(768,)

## **Using pre-trained Word Embeddings**
We can also download and use pre-trained word embeddings (like **word2vec** or **GloVe**) using the **Gensim Library**

In [None]:
import scipy
print(scipy.__version__)

1.16.1


In [None]:
!pip install gensim pandas numpy transformers matplotlib numpy

Collecting gensim
  Downloading gensim-4.3.3-cp312-cp312-manylinux_2_17_x86_64.manylinux2014_x86_64.whl.metadata (8.1 kB)
Collecting pandas
  Downloading pandas-2.3.1-cp312-cp312-manylinux_2_17_x86_64.manylinux2014_x86_64.whl.metadata (91 kB)
Collecting numpy
  Downloading numpy-2.3.2-cp312-cp312-manylinux_2_27_x86_64.manylinux_2_28_x86_64.whl.metadata (62 kB)
Collecting transformers
  Using cached transformers-4.55.2-py3-none-any.whl.metadata (41 kB)
Collecting matplotlib
  Downloading matplotlib-3.10.5-cp312-cp312-manylinux2014_x86_64.manylinux_2_17_x86_64.whl.metadata (11 kB)
Collecting numpy
  Downloading numpy-1.26.4-cp312-cp312-manylinux_2_17_x86_64.manylinux2014_x86_64.whl.metadata (61 kB)
Collecting scipy<1.14.0,>=1.7.0 (from gensim)
  Downloading scipy-1.13.1-cp312-cp312-manylinux_2_17_x86_64.manylinux2014_x86_64.whl.metadata (60 kB)
Collecting smart-open>=1.8.1 (from gensim)
  Downloading smart_open-7.3.0.post1-py3-none-any.whl.metadata (24 kB)
Collecting pytz>=2020.1 (from p

In [None]:
import gensim.downloader as api

model = api.load("glove-wiki-gigaword-50")



In [None]:
model.most_similar([model['dog']], topn=11)

[('dog', 1.0000001192092896),
 ('cat', 0.9218004941940308),
 ('dogs', 0.8513158559799194),
 ('horse', 0.7907583713531494),
 ('puppy', 0.7754920721054077),
 ('pet', 0.7724708318710327),
 ('rabbit', 0.7720813751220703),
 ('pig', 0.7490062117576599),
 ('snake', 0.7399188876152039),
 ('baby', 0.7395570278167725),
 ('bite', 0.7387937307357788)]

## **The Word2vec Algorithm and Contrastive Training**
The central ideas are condensed here as we build on them when discussing one method for creating embeddings for recommendation engines in the following section.

Just like LLMs, **word2vec** is trained on examples generated from text. For example, we have the text ***"Thou shalt not make a machine in the likeness of a human mind"*** from the Dune novels by Frank Herbert. The algorithm uses a **sliding window** to generate training examples. For example, we can have a **window size** two, meaning that we consider **two neighbors** on each side of a **central word**.

In each of the produced training examples, the word in the center is used as one input, and each of its neighbors is a distinct second input in each training example. We expect the final trained model to be able to classify this neighbor and output 1 if the two input words it receives are indeed neighbors. If, however, we have a dataset of only a target value of 1, then a model can **cheat** and ace it by outputting 1 all the time. To get around this, we need to **enrich** our training dataset with examples of words that are not typically neighbors. A lot of useful models result from simple ability to detect positive examples from randomly generated examples (inspired by an important idea called **noise-contrastive estimation**)

<img src="../image/enrich_dataset.jpg" style="display:block; margin:0 auto">

A model is then trained on each example to take in two embedding vectors and predict **if they're related or not**. It looks like:

<img src="../image/predicting_two_words_are_neighbors.jpg" style="display:block; margin:0 auto">

Based on whether its prediction was correct or not, the typical machine learning training step updates the embeddings so that the next time the model is presented with those two vectors, it has a better chance of being more correct. And by the end of the training process, we have better embeddings for all the tokens in our vocabulary.


<style>
*{
    color: #D6DAC8;
    font-size: 14px;
    font-weight: 550;
}

h1, h2, h3, h4, h5, h6 {
    font-weight: 1000;
    color:white;
}

.para{
    color: #DA6C6C;
    font-weight: 950;
}

b, .name_list {
    font-weight: 950;
}
img {
    display: block;
    margin: 0 auto;
    width: 40em;
}
</style>
<h2>Embeddings for Recommendation Systems</h2>
<h3>Recommending Songs by Embeddings</h3>
<p>
In this section, we'll use the <a href="https://oreil.ly/A-AK6"><b style="color: #9c1628ff">dataset</b></a> collected by Shuo Chen from Cornell University. It contains playlists from hundreds of radio stations around the US.
<img src="../image/example_shuo_chen_dataset.jpg">
</p>

In [10]:
import pandas as pd
from urllib import request

data = request.urlopen('https://storage.googleapis.com/maps-premium/dataset/yes_complete/train.txt')

lines = data.read().decode("utf-8").split('\n')[2:]

playlists = [s.rstrip().split() for s in lines if len(s.split()) > 1]

In [11]:
lines

['0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 2 42 43 44 45 46 47 48 20 49 8 50 51 52 53 54 55 56 57 25 58 59 60 61 62 3 63 64 65 66 46 47 67 2 48 68 69 70 57 50 71 72 53 73 25 74 59 20 46 75 76 77 59 20 43 ',
 '78 79 80 3 62 81 14 82 48 83 84 17 85 86 87 88 74 89 90 91 4 73 62 92 17 53 59 93 94 51 50 27 95 48 96 97 98 99 100 57 101 102 25 103 3 104 105 106 107 47 108 109 110 111 112 113 25 63 62 114 115 84 116 117 118 119 120 121 122 123 50 70 71 124 17 85 14 82 48 125 47 46 72 53 25 73 4 126 59 74 20 43 127 128 129 13 82 48 130 131 132 133 134 135 136 137 59 46 138 43 20 139 140 73 57 70 141 3 1 74 142 143 144 145 48 13 25 146 50 147 126 59 20 148 149 150 151 152 56 153 154 155 156 157 158 159 160 161 162 163 164 165 166 167 168 169 170 171 172 173 174 175 60 176 51 177 178 179 180 181 182 183 184 185 57 186 187 188 189 190 191 46 192 193 194 195 196 197 198 25 199 200 49 201 100 202 203 204 205 206 207 32 208 20

In [12]:
playlists

[['0',
  '1',
  '2',
  '3',
  '4',
  '5',
  '6',
  '7',
  '8',
  '9',
  '10',
  '11',
  '12',
  '13',
  '14',
  '15',
  '16',
  '17',
  '18',
  '19',
  '20',
  '21',
  '22',
  '23',
  '24',
  '25',
  '26',
  '27',
  '28',
  '29',
  '30',
  '31',
  '32',
  '33',
  '34',
  '35',
  '36',
  '37',
  '38',
  '39',
  '40',
  '41',
  '2',
  '42',
  '43',
  '44',
  '45',
  '46',
  '47',
  '48',
  '20',
  '49',
  '8',
  '50',
  '51',
  '52',
  '53',
  '54',
  '55',
  '56',
  '57',
  '25',
  '58',
  '59',
  '60',
  '61',
  '62',
  '3',
  '63',
  '64',
  '65',
  '66',
  '46',
  '47',
  '67',
  '2',
  '48',
  '68',
  '69',
  '70',
  '57',
  '50',
  '71',
  '72',
  '53',
  '73',
  '25',
  '74',
  '59',
  '20',
  '46',
  '75',
  '76',
  '77',
  '59',
  '20',
  '43'],
 ['78',
  '79',
  '80',
  '3',
  '62',
  '81',
  '14',
  '82',
  '48',
  '83',
  '84',
  '17',
  '85',
  '86',
  '87',
  '88',
  '74',
  '89',
  '90',
  '91',
  '4',
  '73',
  '62',
  '92',
  '17',
  '53',
  '59',
  '93',
  '94',
  '51',

In [13]:
songs_file = request.urlopen('https://storage.googleapis.com/maps-premium/dataset/yes_complete/song_hash.txt')
songs_file = songs_file.read().decode("utf-8").split('\n')
songs = [s.rstrip().split('\t') for s in songs_file]

songs_df = pd.DataFrame(data=songs, columns=['id', 'title', 'artist'])
songs_df = songs_df.set_index('id')

In [14]:
songs_df.head()

Unnamed: 0_level_0,title,artist
id,Unnamed: 1_level_1,Unnamed: 2_level_1
0,Gucci Time (w\/ Swizz Beatz),Gucci Mane
1,Aston Martin Music (w\/ Drake & Chrisette Mich...,Rick Ross
2,Get Back Up (w\/ Chris Brown),T.I.
3,Hot Toddy (w\/ Jay-Z & Ester Dean),Usher
4,Whip My Hair,Willow


In [15]:
print( 'Playlist #1:\n ', playlists[0], '\n')
print( 'Playlist #2:\n ', playlists[1])


Playlist #1:
  ['0', '1', '2', '3', '4', '5', '6', '7', '8', '9', '10', '11', '12', '13', '14', '15', '16', '17', '18', '19', '20', '21', '22', '23', '24', '25', '26', '27', '28', '29', '30', '31', '32', '33', '34', '35', '36', '37', '38', '39', '40', '41', '2', '42', '43', '44', '45', '46', '47', '48', '20', '49', '8', '50', '51', '52', '53', '54', '55', '56', '57', '25', '58', '59', '60', '61', '62', '3', '63', '64', '65', '66', '46', '47', '67', '2', '48', '68', '69', '70', '57', '50', '71', '72', '53', '73', '25', '74', '59', '20', '46', '75', '76', '77', '59', '20', '43'] 

Playlist #2:
  ['78', '79', '80', '3', '62', '81', '14', '82', '48', '83', '84', '17', '85', '86', '87', '88', '74', '89', '90', '91', '4', '73', '62', '92', '17', '53', '59', '93', '94', '51', '50', '27', '95', '48', '96', '97', '98', '99', '100', '57', '101', '102', '25', '103', '3', '104', '105', '106', '107', '47', '108', '109', '110', '111', '112', '113', '25', '63', '62', '114', '115', '84', '116', '117',

In [16]:
from gensim.models import Word2Vec

model = Word2Vec(
    playlists, vector_size=32, window=20, negative=50, min_count=1, workers=4
)

In [17]:
song_id = 2172
model.wv.most_similar(positive=str(song_id))


[('2849', 0.9985262155532837),
 ('2976', 0.9981426000595093),
 ('3094', 0.9957646727561951),
 ('11473', 0.9956449270248413),
 ('3119', 0.9949755072593689),
 ('5586', 0.9949577450752258),
 ('11502', 0.9944504499435425),
 ('3116', 0.9941827654838562),
 ('6658', 0.9941803812980652),
 ('2640', 0.9941173195838928)]

In [18]:
print(songs_df.iloc[2172])

title     Fade To Black
artist        Metallica
Name: 2172 , dtype: object


In [20]:
import numpy as np

def print_recommendations(song_id):
    similar_songs = np.array(
        model.wv.most_similar(positive=str(song_id), topn=5)
    )[:,0]
    return songs_df.iloc[similar_songs]

print_recommendations(312)

Unnamed: 0_level_0,title,artist
id,Unnamed: 1_level_1,Unnamed: 2_level_1
3310,Action (w\/ Nadine Sutherland),Terror Fabulous
12725,Thank You Mamma,Sizzla
11898,Joy Ride (w\/ Baby Cham),Wayne Wonder
61,On My Mind,Da'Ville
12145,Sorry,Foxy Brown [Reggae]
