<a href="https://colab.research.google.com/github/harnalashok/deeplearning-sequences/blob/main/skipgram_generation.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

In [None]:
"""
Last amended: 3rd Nov, 2022
My folder: C:\Users\Ashok\OneDrive\Documents\skipgrams

Ref:
https://ljvmiranda921.github.io/notebook/2021/12/11/word-vectors/#pairs
https://www.kaggle.com/competitions/word2vec-nlp-tutorial/overview/part-2-word-vectors
https://www.tensorflow.org/api_docs/python/tf/keras/preprocessing/sequence/skipgrams
https://stackoverflow.com/a/1994012

Objectives:
        i)   To get a skipgram paired sequence

"""

## Install software

In [1]:
! pip install --upgrade gensim 

Looking in indexes: https://pypi.org/simple, https://us-python.pkg.dev/colab-wheels/public/simple/


## Call libraries

In [2]:
# 1.0 Call libraries
#%reset -f
import pandas as pd
import numpy as np

# 1.1 Import module imdb & other keras modules
import tensorflow as tf


# 1.2 API to manipulate sequences of words
from tensorflow.keras.preprocessing import sequence
from tensorflow.keras.preprocessing.sequence import skipgrams

# 1.3
import gensim
import re
from bs4 import BeautifulSoup
import nltk
from nltk.corpus import stopwords
from collections import Counter

# 1.4
import os


In [3]:
# 1.5 Download stopwords from nltk
nltk.download('stopwords')


[nltk_data] Downloading package stopwords to /root/nltk_data...
[nltk_data]   Package stopwords is already up-to-date!


True

In [4]:
# 1.6 Display multiple commands output from a cell
from IPython.core.interactiveshell import InteractiveShell
InteractiveShell.ast_node_interactivity = "all"

## Useful functions


In [5]:
# 2.0 Function to clean text (from Kaggle):

def review_to_wordlist( review, remove_stopwords=False ):
    # Function to convert a document to a sequence of words,
    # optionally removing stop words.  Returns a list of words.
    #
    # 1. Remove HTML
    review_text = BeautifulSoup(review).get_text()
    #
    # 2. Remove non-letters
    review_text = re.sub("[^a-zA-Z]"," ", review_text)
    #
    # 3. Convert words to lower case and split them
    words = review_text.lower().split()
    #
    # 4. Optionally remove stop words (false by default)
    if remove_stopwords:
        stops = set(stopwords.words("english"))
        words = [w for w in words if not w in stops]
    #
    # 5. Return a list of words
    return(words)


## Read data
Upload text file <i>'football.txt'</i> directly to `/content/` folder of colab virtual machine.

In [6]:
# 3.0 We will upload our text file (football.txt)
#      directly to /content/ folder from our laptop:

os.chdir(r"/content/")

In [7]:
# 3.1 Read foortball.txt:

tx_data = pd.read_csv("football.txt", sep = "\t", header = 'infer')
tx_data.head()

Unnamed: 0,text
0,Football is a family of team sports that invol...


In [8]:
# 3.2 Examine relevant column:
tx_data['text']


0    Football is a family of team sports that invol...
Name: text, dtype: object

In [9]:
# 3.3 Get complete text as a list of one string:

text = list(tx_data['text'])
text

['Football is a family of team sports that involve, to varying degrees, kicking a ball to score a goal. Unqualified, the word football normally means the form of football that is the most popular where the word is used. Sports commonly called football include association football (known as soccer in North America and Oceania); gridiron football (specifically American football or Canadian football); Australian rules football; rugby union and rugby league; and Gaelic football.1 These various forms of football share to varying extent common origins and are known as football codes. There are a number of references to traditional, ancient, or prehistoric ball games played in many different parts of the world.234 Contemporary codes of football can be traced back to the codification of these games at English public schools during the 19th century.56 The expansion and cultural influence of the British Empire allowed these rules of football to spread to areas of British influence outside the di

## Process text
Clean text and create word-to-int index

In [10]:
# 4.0 Clean the text and get tokens:

cleaned_tokens = review_to_wordlist(text[0], remove_stopwords=True)

In [11]:
# 4.1 Look at tokens and how many total words:
cleaned_tokens[:10]
print()
len(cleaned_tokens)

['football',
 'family',
 'team',
 'sports',
 'involve',
 'varying',
 'degrees',
 'kicking',
 'ball',
 'score']




740

In [12]:
# 4.2 No of unique words:

max_vocab = len(set(cleaned_tokens))
max_vocab  # 480

480

In [13]:
# 5.0 Get a dict of which word occurs
#     how many times:

freq_of_words = Counter(cleaned_tokens)
freq_of_words

Counter({'football': 29,
         'family': 1,
         'team': 4,
         'sports': 4,
         'involve': 1,
         'varying': 2,
         'degrees': 1,
         'kicking': 4,
         'ball': 10,
         'score': 1,
         'goal': 4,
         'unqualified': 1,
         'word': 4,
         'normally': 1,
         'means': 1,
         'form': 2,
         'popular': 3,
         'used': 1,
         'commonly': 1,
         'called': 1,
         'include': 7,
         'association': 2,
         'known': 3,
         'soccer': 1,
         'north': 1,
         'america': 1,
         'oceania': 1,
         'gridiron': 1,
         'specifically': 1,
         'american': 2,
         'canadian': 2,
         'australian': 2,
         'rules': 5,
         'rugby': 4,
         'union': 2,
         'league': 3,
         'gaelic': 3,
         'various': 7,
         'forms': 1,
         'share': 2,
         'extent': 1,
         'common': 4,
         'origins': 2,
         'codes': 8,
         '

In [14]:
# 5.1 Sort words in order of freq
#     Most freq at the top:

vocab = sorted(freq_of_words,
               key=freq_of_words.get,
               reverse=True)

# 5.1.1
vocab   # Most freq at the top; least at the bottom

['football',
 'ball',
 'symptoms',
 'religion',
 'codes',
 'include',
 'various',
 'covid',
 'may',
 'people',
 'develop',
 'players',
 'rules',
 'virus',
 'religious',
 'team',
 'sports',
 'kicking',
 'goal',
 'word',
 'rugby',
 'common',
 'games',
 'disease',
 'religions',
 'popular',
 'known',
 'league',
 'gaelic',
 'many',
 'public',
 'th',
 'century',
 'spread',
 'elements',
 'two',
 'also',
 'either',
 'foot',
 'severe',
 'risk',
 'contaminated',
 'transcription',
 'social',
 'beliefs',
 'supernatural',
 'sacred',
 'life',
 'varying',
 'form',
 'association',
 'american',
 'canadian',
 'australian',
 'union',
 'share',
 'origins',
 'traditional',
 'played',
 'different',
 'world',
 'cultural',
 'influence',
 'british',
 'empire',
 'end',
 'distinct',
 'developing',
 'first',
 'several',
 'moved',
 'field',
 'hands',
 'teams',
 'usually',
 'defined',
 'area',
 'scoring',
 'goals',
 'points',
 'opposing',
 'line',
 'resulting',
 'goalposts',
 'explanations',
 'origin',
 'explanatio

In [15]:
# 5.2 Get a dict of word and its int label:
#     Most freq word gets transformed to 1:

word_index = {word: ii for ii, word in enumerate(vocab, 1)}

In [16]:
# 5.3 Here is our word to int index:

word_index

{'football': 1,
 'ball': 2,
 'symptoms': 3,
 'religion': 4,
 'codes': 5,
 'include': 6,
 'various': 7,
 'covid': 8,
 'may': 9,
 'people': 10,
 'develop': 11,
 'players': 12,
 'rules': 13,
 'virus': 14,
 'religious': 15,
 'team': 16,
 'sports': 17,
 'kicking': 18,
 'goal': 19,
 'word': 20,
 'rugby': 21,
 'common': 22,
 'games': 23,
 'disease': 24,
 'religions': 25,
 'popular': 26,
 'known': 27,
 'league': 28,
 'gaelic': 29,
 'many': 30,
 'public': 31,
 'th': 32,
 'century': 33,
 'spread': 34,
 'elements': 35,
 'two': 36,
 'also': 37,
 'either': 38,
 'foot': 39,
 'severe': 40,
 'risk': 41,
 'contaminated': 42,
 'transcription': 43,
 'social': 44,
 'beliefs': 45,
 'supernatural': 46,
 'sacred': 47,
 'life': 48,
 'varying': 49,
 'form': 50,
 'association': 51,
 'american': 52,
 'canadian': 53,
 'australian': 54,
 'union': 55,
 'share': 56,
 'origins': 57,
 'traditional': 58,
 'played': 59,
 'different': 60,
 'world': 61,
 'cultural': 62,
 'influence': 63,
 'british': 64,
 'empire': 65,
 'e

## Get skipgrams
Transform text to int sequence and get skipgram pairs

In [17]:
# 6.0 Function to map a word list to integers
#     as per mapping in word_index dict:

def seq2int(wordList, word_index):
    return [word_index[x] for x in wordList]

In [18]:
# 6.1 Here is our int seq:

int_seq = seq2int(cleaned_tokens, word_index)
int_seq[:10]

[1, 119, 16, 17, 120, 49, 121, 18, 2, 122]

In [19]:
# 6.2 Translate the int_seq to pairs of skipgrams:

pairs = skipgrams(
                   int_seq,
                   vocabulary_size = max_vocab,
                   window_size=2,
                   negative_samples=1.0,
                   shuffle=True,
                   categorical=False,
                   sampling_table=None,
                   seed=None
                  )

In [20]:
# 6.3 Look at pairs:

pairs[0]

[[455, 213],
 [61, 265],
 [180, 181],
 [426, 357],
 [455, 453],
 [333, 208],
 [17, 126],
 [39, 245],
 [382, 35],
 [245, 134],
 [440, 439],
 [85, 84],
 [1, 1],
 [38, 53],
 [196, 7],
 [93, 219],
 [279, 98],
 [20, 1],
 [102, 101],
 [2, 33],
 [18, 49],
 [459, 300],
 [230, 232],
 [4, 386],
 [7, 5],
 [154, 5],
 [117, 116],
 [380, 378],
 [20, 85],
 [417, 418],
 [225, 379],
 [11, 406],
 [1, 129],
 [439, 408],
 [267, 10],
 [362, 222],
 [57, 51],
 [11, 237],
 [82, 27],
 [246, 243],
 [23, 57],
 [24, 435],
 [12, 80],
 [474, 100],
 [11, 460],
 [67, 5],
 [375, 376],
 [385, 383],
 [68, 155],
 [417, 31],
 [1, 71],
 [86, 1],
 [29, 2],
 [88, 470],
 [142, 379],
 [173, 103],
 [81, 362],
 [451, 116],
 [434, 414],
 [416, 418],
 [58, 431],
 [283, 225],
 [235, 164],
 [469, 70],
 [399, 349],
 [178, 392],
 [277, 101],
 [231, 232],
 [308, 314],
 [299, 390],
 [9, 200],
 [277, 173],
 [453, 455],
 [382, 229],
 [103, 43],
 [24, 229],
 [195, 16],
 [231, 353],
 [351, 8],
 [60, 136],
 [345, 344],
 [386, 389],
 [43, 421

## Save processed data
Save skipgram pairs and word-to-index dict to `/content/` folder of virtual machine

In [25]:
# 7.0 Transform to pandas DataFrame:

data = pd.DataFrame(pairs[0], columns = ["a", "b"])
data.to_pickle("/content/seq.pkl")
data.head()

Unnamed: 0,a,b
0,455,213
1,61,265
2,180,181
3,426,357
4,455,453


In [26]:
# 7.1 Save word_to_index dict to a text file:

filehandler = open("/content/word_index.txt", 'wt')
data = str(word_index)
filehandler.write(data)

7643

## Read back saved files

In [27]:
# 8.0 Read back pkl file and dictionary:
seq = pd.read_pickle("/content/seq.pkl")
seq.head()

Unnamed: 0,a,b
0,455,213
1,61,265
2,180,181
3,426,357
4,455,453


In [None]:
# 8.1 Read saved dict:

filehandler = open("/content/word_index.txt", 'r')
filehandler.read()

In [None]:
#####################