<a href="https://colab.research.google.com/github/kh-ops69/ML_NLP/blob/master/article_spinner.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

In [1]:
# https://www.kaggle.com/shivamkushwaha/bbc-full-text-document-classification
!wget -nc https://lazyprogrammer.me/course_files/nlp/bbc_text_cls.csv

--2023-06-11 04:48:01--  https://lazyprogrammer.me/course_files/nlp/bbc_text_cls.csv
Resolving lazyprogrammer.me (lazyprogrammer.me)... 104.21.23.210, 172.67.213.166, 2606:4700:3030::ac43:d5a6, ...
Connecting to lazyprogrammer.me (lazyprogrammer.me)|104.21.23.210|:443... connected.
HTTP request sent, awaiting response... 200 OK
Length: 5085081 (4.8M) [text/csv]
Saving to: ‘bbc_text_cls.csv’


2023-06-11 04:48:01 (59.9 MB/s) - ‘bbc_text_cls.csv’ saved [5085081/5085081]



In [2]:
import numpy as np
import pandas as pd
import nltk
from nltk.tokenize.treebank import TreebankWordDetokenizer
from nltk.tokenize import word_tokenize
import textwrap

In [3]:
nltk.download('punkt')

[nltk_data] Downloading package punkt to /root/nltk_data...
[nltk_data]   Unzipping tokenizers/punkt.zip.


True

In [4]:
text_df = pd.read_csv('bbc_text_cls.csv')
text_df.sample(5)

Unnamed: 0,text,labels
1978,Games win for Blu-ray DVD format\n\nThe next-g...,tech
2052,Robotic pods take on car design\n\nA new breed...,tech
1647,Bortolami predicts dour contest\n\nItaly skipp...,sport
632,Label withdraws McFadden's video\n\nThe new vi...,entertainment
2006,Consumers 'snub portable video'\n\nConsumers w...,tech


In [5]:
label = 'tech'
specific_df = text_df[text_df.labels == label].text
specific_df

1824    Ink helps drive democracy in Asia\n\nThe Kyrgy...
1825    China net cafe culture crackdown\n\nChinese au...
1826    Microsoft seeking spyware trojan\n\nMicrosoft ...
1827    Digital guru floats sub-$100 PC\n\nNicholas Ne...
1828    Technology gets the creative bug\n\nThe hi-tec...
                              ...                        
2220    BT program to beat dialler scams\n\nBT is intr...
2221    Spam e-mails tempt net shoppers\n\nComputer us...
2222    Be careful how you code\n\nA new European dire...
2223    US cyber security chief resigns\n\nThe man mak...
2224    Losing yourself in online gaming\n\nOnline rol...
Name: text, Length: 401, dtype: object

In [6]:
# reindex gives nan value if old index not same as new index
specific_df.reindex(copy=True, index=[i for i in range(len(specific_df))])

0      NaN
1      NaN
2      NaN
3      NaN
4      NaN
      ... 
396    NaN
397    NaN
398    NaN
399    NaN
400    NaN
Name: text, Length: 401, dtype: object

In [7]:
specific_df.index = [i for i in range(len(specific_df))]
specific_df.columns = ['index', 'text']
specific_df[0]

'Ink helps drive democracy in Asia\n\nThe Kyrgyz Republic, a small, mountainous state of the former Soviet republic, is using invisible ink and ultraviolet readers in the country\'s elections as part of a drive to prevent multiple voting.\n\nThis new technology is causing both worries and guarded optimism among different sectors of the population. In an effort to live up to its reputation in the 1990s as "an island of democracy", the Kyrgyz President, Askar Akaev, pushed through the law requiring the use of ink during the upcoming Parliamentary and Presidential elections. The US government agreed to fund all expenses associated with this decision.\n\nThe Kyrgyz Republic is seen by many experts as backsliding from the high point it reached in the mid-1990s with a hastily pushed through referendum in 2003, reducing the legislative branch to one chamber with 75 deputies. The use of ink is only one part of a general effort to show commitment towards more open elections - the German Embassy

In [None]:
for i in range(5):
  print(specific_df[i], '\n\n')            # testing purposes

In [None]:
for para in specific_df:
  print(para)

In [53]:
specific_df[0].split('\n')

['Ad sales boost Time Warner profit',
 '',
 'Quarterly profits at US media giant TimeWarner jumped 76% to $1.13bn (£600m) for the three months to December, from $639m year-earlier.',
 '',
 'The firm, which is now one of the biggest investors in Google, benefited from sales of high-speed internet connections and higher advert sales. TimeWarner said fourth quarter sales rose 2% to $11.1bn from $10.9bn. Its profits were buoyed by one-off gains which offset a profit dip at Warner Bros, and less users for AOL.',
 '',
 "Time Warner said on Friday that it now owns 8% of search-engine Google. But its own internet business, AOL, had has mixed fortunes. It lost 464,000 subscribers in the fourth quarter profits were lower than in the preceding three quarters. However, the company said AOL's underlying profit before exceptional items rose 8% on the back of stronger internet advertising revenues. It hopes to increase subscribers by offering the online service free to TimeWarner internet customers a

In [None]:
specific_df[0].split()

In [8]:
def trigram_probs(df):
  probs_dict = {}
  for para in df:
    tokens_list= para.split("\n")
    for line in tokens_list:
      tokens = word_tokenize(line)
      for i in range(len(tokens)-2):
        t_0 = tokens[i]
        t_1 = tokens[i+1]
        t_2 = tokens[i+2]
        key = (t_0, t_2)
        if key not in probs_dict:
          probs_dict[key] = {}
        
        if t_1 not in probs_dict[key]:
          probs_dict[key][t_1] = 1
        else:
          probs_dict[key][t_1] += 1
  return probs_dict

In [9]:
tech_probs_dict = trigram_probs(specific_df)

# ('law', 'the'): {'requiring': 1, 'both': 2, 'at': 1, 'for': 1}, here the key has multiple values because the same sentence may be encountered in different articles with different words hence it has to store all possible combinations

In [10]:
len(tech_probs_dict)

108655

In [11]:
# conversion of counts into probabilities
for key, word_possibilities in tech_probs_dict.items():
  total = sum(word_possibilities.values())
  for word, count in word_possibilities.items():
    word_possibilities[word] = count/total

In [12]:
tech_probs_dict

{('Ink', 'drive'): {'helps': 1.0},
 ('helps', 'democracy'): {'drive': 1.0},
 ('drive', 'in'): {'democracy': 0.3333333333333333,
  'sales': 0.6666666666666666},
 ('democracy', 'Asia'): {'in': 1.0},
 ('The', 'Republic'): {'Kyrgyz': 0.6666666666666666,
  'Old': 0.3333333333333333},
 ('Kyrgyz', ','): {'Republic': 0.5, 'President': 0.5},
 ('Republic', 'a'): {',': 1.0},
 (',', 'small'): {'a': 0.6666666666666666, 'really': 0.3333333333333333},
 ('a', ','): {'small': 0.006024096385542169,
  'year': 0.030120481927710843,
  'virtual': 0.006024096385542169,
  'gadget': 0.012048192771084338,
  'first': 0.012048192771084338,
  'VCR': 0.006024096385542169,
  'fast': 0.018072289156626505,
  'day': 0.012048192771084338,
  'statement': 0.03614457831325301,
  'blog': 0.006024096385542169,
  'phone': 0.018072289156626505,
  'task': 0.006024096385542169,
  'high-speed': 0.012048192771084338,
  'second': 0.012048192771084338,
  'weblogger': 0.006024096385542169,
  'questionnaire': 0.012048192771084338,
  '

In [33]:
specific_df[0].split('\n')[2]

"The Kyrgyz Republic, a small, mountainous state of the former Soviet republic, is using invisible ink and ultraviolet readers in the country's elections as part of a drive to prevent multiple voting."

In [None]:
word_tokenize(specific_df[0].split('\n')[2])

In [13]:
detokenizer = TreebankWordDetokenizer()
detokenizer.detokenize(word_tokenize(specific_df[0].split('\n')[2]))

# tokenizer and detokenizer outputs are matching

"The Kyrgyz Republic, a small, mountainous state of the former Soviet republic, is using invisible ink and ultraviolet readers in the country's elections as part of a drive to prevent multiple voting."

In [14]:
#to generate same probability terms every time
np.random.seed(1234)

In [15]:
#sampling of random words from the dict
def sample_words(prob_dict):
  prob_threshold = np.random.random()
  cumulative = 0
  for terms, probability in prob_dict.items():
    cumulative += probability
    if prob_threshold< cumulative:
      return terms
  assert(False)     # unreachable condition for checking purposes

In [16]:
def spin_lines_randomly(line):
  tokens = word_tokenize(line)
  i=0
  output = [tokens[0]]
  while i< (len(tokens)-2):
    t_0 = tokens[i]
    t_1 = tokens[i+1]
    t_2 = tokens[i+2]
    key = (t_0, t_2)
    probs_dist = tech_probs_dict[key]
    if len(probs_dist) >1 and np.random.random() < 0.4:
      middle = sample_words(probs_dist)
      output.append(t_1)
      output.append('<'+middle+'>')
      output.append(t_2)
      i += 2
    else:
      output.append(t_1)
      i+= 1
  if i == len(tokens) -2:
    output.append(tokens[-1])
  return detokenizer.detokenize(output)

In [17]:
def spin_document(whole_doc):
  lines = whole_doc.split('\n')
  output = []
  for line in lines:
    if line:
      new_line = spin_lines_randomly(line)
    else:
      new_line = line
    output.append(new_line)
  return '\n'.join(output)

In [18]:
text_df.labels.unique()

array(['business', 'entertainment', 'politics', 'sport', 'tech'],
      dtype=object)

In [22]:
# specific_df.shape returns shape of object and [0] of it returns the index of the.object, which we are generating randomly
location = np.random.choice(specific_df.shape[0])            
text = specific_df.iloc[location]
new_text = spin_document(text)

In [23]:
# synonyms of words are indicated in between < > tags, these words are replacing the words before them.

print(textwrap.fill(new_text, replace_whitespace = False, fix_sentence_endings = False))

UK pioneers digital film <film> network

The world <world>'s first
digital cinema network <worlds> will be established <those> in the UK
over the next <next> 18 months.

The UK Film Council has awarded a
contract <deal> worth £11.5m to Arts Alliance Digital Cinema (AADC
<MDA>), who <they> will set up the network <Association> of up to 250
screens . AADC <Viewers> will oversee the selection <umbrella> of
cinemas across the UK <industry> which will use <double> the digital
<communications> equipment . High definition projectors and computer
servers <users> will be installed <uploaded> to show mainly British
and specialist films . Most cinemas currently have mechanical
projectors but the new network will see <end> up to 250 screens in up
to 150 <150> cinemas fitted with digital projectors capable of
displaying high definition images . The new network will double the
world <world>'s total of digital <digital> screens . Cinemas will be
given the film <Directive> on a portable <portable> har