First we start with importing the libraries

In [0]:
import pandas as pd
import shutil
import re
import nltk

In [32]:
nltk.download("punkt")
nltk.download("perluniprops")

from nltk.tokenize import TweetTokenizer
tknzr = TweetTokenizer()

from nltk.tokenize.moses import MosesDetokenizer
detokenizer = MosesDetokenizer()

[nltk_data] Downloading package punkt to /root/nltk_data...
[nltk_data]   Package punkt is already up-to-date!
[nltk_data] Downloading package perluniprops to /root/nltk_data...
[nltk_data]   Package perluniprops is already up-to-date!


Now we fetch our tokens data

In [33]:
from google.colab import drive
drive.mount('/content/gdrive', force_remount=True)

Mounted at /content/gdrive


In [0]:
filename = "All_Entries.json"
shutil.copy2('/content/gdrive/My Drive/'+filename,'.')

All_Entries = pd.read_json(filename).reset_index(drop=True)
All_Entries.begin = All_Entries.begin.astype(int)
All_Entries.end = All_Entries.end.astype(int)
All_Entries.delete = All_Entries.delete.astype(bool)
All_Entries.substr_words = All_Entries.substr_words.astype(int)

In [35]:
list(All_Entries.sample(5)["context"])

['Many people are agree with this point of view and have a lot of reasons for it. Firstly, large number of stadiums, sportsgrounds and swimming pools make sport activities [MASK] to citizens. For instance, you will attend to some sports when they are really geographically close to you.',
 'In sharp contrast to this, the percent of mothers of two children grows gradually and they keep getting the highest position all 25 years. The amount of families with one child also becomes higher [MASK] 5.6 percent. Finally, more and more women decide not to have child at all – 8.5 percent in 1981 versus 15.9 in 2006.',
 'To begin with, I would like to say, that it is true, that every year musicians and film producers lose huge amounts of money from illegal pirate copies, but I can’t agree, that copying and sharing music or films on the internet is bad and should be punished. For my opinion, musicians and film producers should be thankful to people[MASK] [MASK] theire products on the Internet, becau

# Adding contexts

Let's get our data here

In [36]:
import shutil

drive_path = "Clean REALEC dumps/" #@param {type:"string"}
filename = "realec_110319_2315.tar.gz" #@param {type:"string"}


shutil.copy2('/content/gdrive/My Drive/'+drive_path+filename,'.')

'./realec_110319_2315.tar.gz'

In [0]:
import tarfile

tar = tarfile.open(filename)
tar.extractall()

Time to create another monster: a structure to contain all the texts we're interested in

In [38]:
%%time

Texts = list(set([x[:-4]+'.txt' for x in list(All_Entries["path"])]))
Text_Dict = {path: open(path, 'r', encoding='utf-8-sig').read() for path in Texts}

CPU times: user 182 ms, sys: 77.2 ms, total: 259 ms
Wall time: 259 ms


In [39]:
print(len(Text_Dict))

5554


Now let's add unmasked entries to our dataset

In [0]:
All_Entries['unmasked_context'] = ""

We still shoulda check a pair of things...

In [41]:
raw = chr(8)
raw = re.sub(r'\s', ' ', raw)
raw = re.sub(r'( )+', ' ', raw)
raw

'\x08'

In [42]:
s = "This is a generic example. This is "+chr(8)+" example of masked context."
ss = nltk.sent_tokenize(s)
ss

['This is a generic example.', 'This is \x08 example of masked context.']

Working great! Good to go.

In [43]:
%%time

for i, row in All_Entries.iterrows():
  fpath = row['path'][:-4]+'.txt'
  start = row['begin']
  end = row['end']
  raw = Text_Dict[fpath]
  substr = raw[start:end]
  raw = raw[:start] + chr(8) + raw[end:]
  raw = re.sub(r'\s', ' ', raw)
  raw = re.sub(r'( )+', ' ', raw)
  sentences = nltk.sent_tokenize(raw)
  for k in range(len(sentences)):
    if chr(8) in sentences[k]:
      if k < 1:
        retstr = ' '.join(sentences[0:k+2])
        retstr = retstr.replace(chr(8), substr)
        All_Entries.at[i,'unmasked_context'] = retstr
      else:
        retstr = ' '.join(sentences[k-1:k+2])
        retstr = retstr.replace(chr(8), substr)
        All_Entries.at[i,'unmasked_context'] = retstr
      break

CPU times: user 1min 50s, sys: 70.1 ms, total: 1min 50s
Wall time: 1min 51s


In [44]:
list(All_Entries.sample(5)["unmasked_context"])

['Recently, there has been considered discussion of acception male and female students at universities. Some people reckon that equal numbers of men and women should be accepted in every subject.',
 'We can consider that Germany and China are more old-fashioned countries than USA. Meanwhile the markets in the UK will have steady decreasing in selling eBooks and steady increasing in selling print. The same with Germany and China.',
 'The young and the older try to care about themselves. Actually some people are sure that if the government increases the quantity of places for doing sport, the aim stated will be reached, other think there are lots of other ways in deciding this problem. To begin with, people of the first group want more sport equipment.',
 'Considering charts for Italy, the quantity of people of 15-59 years takes 61,6% out of 100%, but it will reduce to 2050. The number of children is low enough and it consists 14,3% out of all population, and it will be also reduce to 11

# Uploading the results

Wow. That was epic and took a looong time. We'd better save it into a json file to accelerate future load-ups.

In [45]:
%%time

jsonname = "All_Entries.json" #@param {type:"string"}
with open(jsonname, 'w', encoding="utf-8") as outie:
    exec("outie.write("+jsonname[:-5]+".to_json())")

CPU times: user 397 ms, sys: 151 ms, total: 548 ms
Wall time: 550 ms


Whew, great! Let's upload this to Google Drive right now. 

In [0]:
# Install the PyDrive wrapper & import libraries.
# This only needs to be done once in a notebook.
!pip install -U -q PyDrive
from pydrive.auth import GoogleAuth
from pydrive.drive import GoogleDrive
from google.colab import auth
from oauth2client.client import GoogleCredentials

# Authenticate and create the PyDrive client.
# This only needs to be done once in a notebook.
auth.authenticate_user()
gauth = GoogleAuth()
gauth.credentials = GoogleCredentials.get_application_default()
drive = GoogleDrive(gauth)

In [47]:
# Create & upload a file.
uploaded = drive.CreateFile({'title': jsonname})
uploaded.SetContentFile(jsonname)
uploaded.Upload()
print('Uploaded file with ID {}'.format(uploaded.get('id')))

Uploaded file with ID 1Ut0jtIhPA-mFVOobWhbYAL9NmxgGSgkq
