<a href="https://colab.research.google.com/github/malipalema/lafand-mt/blob/main/masakhane_baseline_on_lafand.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# Evaluating the Masakhane MT models on the English to Isizulu LAFAND test set

## Masakhane - Machine Translation for African Languages (Using JoeyNMT)

In [None]:
! git clone https://github.com/joeynmt/joeynmt.git
! cd joeynmt; pip3 install .

Cloning into 'joeynmt'...
remote: Enumerating objects: 3224, done.[K
remote: Counting objects: 100% (273/273), done.[K
remote: Compressing objects: 100% (143/143), done.[K
remote: Total 3224 (delta 155), reused 209 (delta 130), pack-reused 2951[K
Receiving objects: 100% (3224/3224), 8.19 MiB | 16.90 MiB/s, done.
Resolving deltas: 100% (2183/2183), done.
Processing /content/joeynmt
[33m  DEPRECATION: A future pip version will change local packages to be built in-place without first copying to a temporary directory. We recommend you use --use-feature=in-tree-build to test your packages with this new behavior before it becomes the default.
   pip 21.3 will remove support for this functionality. You can find discussion regarding this at https://github.com/pypa/pip/issues/7555.[0m
Collecting sacrebleu>=2.0.0
  Downloading sacrebleu-2.0.0-py3-none-any.whl (90 kB)
[K     |████████████████████████████████| 90 kB 3.9 MB/s 
[?25hCollecting subword-nmt
  Downloading subword_nmt-0.3.7-py2.

In [1]:
import torch
import pandas as pd
from IPython.core.interactiveshell import InteractiveShell
InteractiveShell.ast_node_interactivity = "all"

device_num = torch.cuda.current_device()
torch.cuda.get_device_name(device_num)
# torch.cuda.is_available()

'Tesla K80'

In [2]:
# Install opus-tools
! pip install opustools-pkg

Collecting opustools-pkg
  Downloading opustools_pkg-0.0.52-py3-none-any.whl (80 kB)
[K     |████████████████████████████████| 80 kB 3.8 MB/s 
[?25hInstalling collected packages: opustools-pkg
Successfully installed opustools-pkg-0.0.52


In [3]:
from google.colab import drive
drive.mount('/content/drive')

Mounted at /content/drive


In [4]:
import os
import numpy as np
import pandas as pd

source_language = "en"
target_language = "yo" 
lc = True  # If True, lowercase the data.
seed = 42  # Random seed for shuffling.
tag = "baseline" # Give a unique name to your folder - this is to ensure you don't rewrite any models you've already submitted
vocab_size=4000
corpus = "JW300"

os.environ["src"] = source_language # Sets them in bash as well, since we often use bash scripts
os.environ["tgt"] = target_language
os.environ["tag"] = tag
os.environ["vocab_size"] = str(vocab_size)
os.environ["corpus"] = corpus

In [5]:
# This will save it to a folder in our gdrive instead! 
!mkdir -p "/content/drive/My Drive/masakhane/baseline/$src-$trg-$tag"
gdrive_path = f"/content/drive/My Drive/masakhane/baseline/{source_language}-{target_language}-{tag}/"
os.environ["gdrive_path"] = gdrive_path
! echo $gdrive_path

/content/drive/My Drive/masakhane/baseline/en-yo-baseline/


In [6]:
!echo $gdrive_path

/content/drive/My Drive/masakhane/baseline/en-yo-baseline/


In [7]:
# Download the global test set.
! wget https://raw.githubusercontent.com/masakhane-io/masakhane/master/jw300_utils/test/test.en-any.en
  
# And the specific test set for this language pair.
os.environ["trg"] = target_language 
os.environ["src"] = source_language 

! wget https://raw.githubusercontent.com/masakhane-io/masakhane/master/jw300_utils/test/test.en-$trg.en 
! mv test.en-$trg.en test.en
! wget https://raw.githubusercontent.com/masakhane-io/masakhane/master/jw300_utils/test/test.en-$trg.$trg 
! mv test.en-$trg.$trg test.$trg

--2021-12-03 16:07:14--  https://raw.githubusercontent.com/masakhane-io/masakhane/master/jw300_utils/test/test.en-any.en
Resolving raw.githubusercontent.com (raw.githubusercontent.com)... 185.199.108.133, 185.199.109.133, 185.199.110.133, ...
Connecting to raw.githubusercontent.com (raw.githubusercontent.com)|185.199.108.133|:443... connected.
HTTP request sent, awaiting response... 200 OK
Length: 277791 (271K) [text/plain]
Saving to: ‘test.en-any.en’


2021-12-03 16:07:14 (8.70 MB/s) - ‘test.en-any.en’ saved [277791/277791]

--2021-12-03 16:07:14--  https://raw.githubusercontent.com/masakhane-io/masakhane/master/jw300_utils/test/test.en-yo.en
Resolving raw.githubusercontent.com (raw.githubusercontent.com)... 185.199.110.133, 185.199.109.133, 185.199.111.133, ...
Connecting to raw.githubusercontent.com (raw.githubusercontent.com)|185.199.110.133|:443... connected.
HTTP request sent, awaiting response... 200 OK
Length: 201994 (197K) [text/plain]
Saving to: ‘test.en-yo.en’


2021-12-03 1

In [8]:
# Read the test data to filter from train and dev splits.
# Store english portion in set for quick filtering checks.
en_test_sents = set()
filter_test_sents = "test.en-any.en"
j = 0
blanks = [] # sometimes blank lines creep innto test set - store which lines these are
with open(filter_test_sents) as f:
  for line in f:
    en_test_sents.add(line.strip())
    if len(line)<=1:
      blanks.append(j)
    j += 1
print('Loaded {} global test sentences to filter from the training/dev data.'.format(j))
print(f'There are {len(blanks)} blank lines in the test set')

Loaded 3571 global test sentences to filter from the training/dev data.
There are 0 blank lines in the test set


In [9]:
# filter test set

source_file = f"test.{source_language}"
target_file = f"test.{target_language}"

source = []
target = []

with open(source_file) as f:
  source = f.readlines()
            
with open(target_file) as f:
  target = f.readlines()

df = pd.DataFrame(zip(source, target), columns=['source_sentence', 'target_sentence'])

# remove trailing newline chars
df['source_sentence'] = df['source_sentence'].str.rstrip('" \n')
df['target_sentence'] = df['target_sentence'].str.rstrip('" \n')

# remove leading newline chars
df['source_sentence'] = df['source_sentence'].str.lstrip('"')
df['target_sentence'] = df['target_sentence'].str.lstrip('"')

# remove rows with really short sentences
df = df[~(df['source_sentence'].str.len() <8)] # remove rows wher esource text len <8 characters
df = df[~(df['target_sentence'].str.len() <8)] # remove rows wher esource text len <8 characters

# save the filtered test set
df['source_sentence'].to_csv(f'{source_file}', index=False, header=False, doublequote=False)
df['target_sentence'].to_csv(f'{target_file}', index=False, header=False, doublequote=False)

In [10]:
df.head()

Unnamed: 0,source_sentence,target_sentence
0,Some names in this article have been changed .,A ti yí àwọn orúkọ kan padà nínú àpilẹ̀kọ yìí .
1,It does not belong to man who is walking even ...,Kì í ṣe ti ènìyàn tí ń rìn àní láti darí àwọn ...
2,Published by Jehovah’s Witnesses but now out o...,"Àwọn Ẹlẹ́rìí Jèhófà ló tẹ̀ ẹ́ jáde , àmọ́ wọn ..."
3,“ The whole world is lying in the power of the...,“ Gbogbo ayé wà lábẹ́ agbára ẹni burúkú náà . ”
4,"Moreover , do not call anyone your father on e...","Jù bẹ́ẹ̀ lọ , ẹ má pe ẹnikẹ́ni ní baba yín lór..."


In [16]:
# en_jw = read_file(data_dir + 'jw300.en')
# zu_jw = read_file(data_dir + 'jw300.zu')

menyo_train = pd.read_csv('/content/drive/MyDrive/masakhane/train.tsv', sep='\t')
en_menyo_train = menyo_train['English'].values
yo_menyo_train = menyo_train['Yoruba'].values

menyo_dev = pd.read_csv('/content/drive/MyDrive/masakhane/dev.tsv', sep='\t')
en_menyo_dev = menyo_dev['english'].values
yo_menyo_dev = menyo_dev['yoruba'].values

menyo_test = pd.read_csv('/content/drive/MyDrive/masakhane/test_news.tsv', sep='\t')
en_menyo_test = menyo_test['english'].values
yo_menyo_test = menyo_test['yoruba'].values

# merge data
# Train data
train_data_en = list(en_menyo_train)  # en_jw + list(en_menzu_train)
train_data_yo = list(yo_menyo_train)  # zu_jw + list(zu_menzu_train)

df_train_enyo = pd.DataFrame(train_data_en, columns=['source_sentence'])
df_train_enyo['target_sentence'] = train_data_yo

df_train_yoen = pd.DataFrame(train_data_yo, columns=['source_sentence'])
df_train_yoen['target_sentence'] = train_data_en

# dev data
df_dev_enyo = pd.DataFrame(en_menyo_dev, columns=['source_sentence'])
df_dev_enyo['target_sentence'] = yo_menyo_dev

df_dev_yoen = pd.DataFrame(yo_menyo_dev, columns=['source_sentence'])
df_dev_yoen['target_sentence'] = en_menyo_dev

# test data
df_test_enyo = pd.DataFrame(en_menyo_test, columns=['source_sentence'])
df_test_enyo['target_sentence'] = yo_menyo_test

df_test_yoen = pd.DataFrame(yo_menyo_test, columns=['source_sentence'])
df_test_yoen['target_sentence'] = en_menyo_test

In [18]:
df_test_enyo.shape

(3102, 2)

In [19]:
# How many samples
size = len(df_test_enyo)
print(f"\n {size} samples in original text")


 3102 samples in original text


## Pre-processing and export

It is generally a good idea to remove duplicate translations and conflicting translations from the corpus. In practice, these public corpora include some number of these that need to be cleaned.

In addition we will split our data into dev/test/train and export to the filesystem.

In [21]:
## Preprocessing - Step 1 : Drop NaNs

df_pp = df_test_enyo.dropna()
#df_pp.info(memory_usage='deep')
new_size = len(df_pp)
print(f"\n {size-new_size}({100*(size-new_size)/size :.2f} %) samples removed by dropping all NaNs")
size = new_size


 0(0.00 %) samples removed by dropping all NaNs


In [22]:
## Preprocessing - Step 2a : Drop all duplicates in Source (en) text

df_pp = df_pp.drop_duplicates(subset='source_sentence')
#df_pp.info(memory_usage='deep')
new_size = len(df_pp)
print(f"\n {size-new_size}({100*(size-new_size)/size :.2f} %) samples removed by dropping Source sentence duplicates")
size = new_size


 45(1.45 %) samples removed by dropping Source sentence duplicates


In [23]:
## Preprocessing - Step 2b : Drop all duplicates in Target (zu) text

df_pp = df_pp.drop_duplicates(subset='target_sentence')
#df_pp.info(memory_usage='deep')
new_size = len(df_pp)
print(f"\n {size-new_size}({100*(size-new_size)/size :.2f} %) samples removed by dropping Target sentence duplicates")
size = new_size


 0(0.00 %) samples removed by dropping Target sentence duplicates


In [24]:
##  Preprocessing - Step 3 : Remove all numeric entries

pattern = r"([0-9]*\.?[0-9]*)"  # catch integers and decimals
import re
r = re.compile(pattern)

In [25]:
%%time
##  Preprocessing - Step 3a : Remove all numeric entries - Source text

df_pp['source_sentence'] = df_pp['source_sentence'].str.replace(pattern,"")
df_pp['source_sentence'] = df_pp['source_sentence'].replace("",np.nan)

df_pp = df_pp.dropna()
#df_pp.info(memory_usage='deep')
new_size = len(df_pp)

print(f"\n {size-new_size}({100*(size-new_size)/size :.2f} %) samples removed by dropping nummeric entries from source text")
size = new_size


 0(0.00 %) samples removed by dropping nummeric entries from source text
CPU times: user 130 ms, sys: 0 ns, total: 130 ms
Wall time: 132 ms


In [26]:
# DOnt RUN
# Install fuzzy wuzzy to remove "almost duplicate" sentences in the
# test and training sets.
! pip install fuzzywuzzy
! pip install python-Levenshtein
import time
from fuzzywuzzy import process
import numpy as np
from os import cpu_count
from functools import partial
from multiprocessing import Pool


# reset the index of the training set after previous filtering
df_pp.reset_index(drop=False, inplace=True)

# Remove samples from the training data set if they "almost overlap" with the
# samples in the test set.

# Filtering function. Adjust pad to narrow down the candidate matches to
# within a certain length of characters of the given sample.
def fuzzfilter(sample, candidates, pad):
  candidates = [x for x in candidates if len(x) <= len(sample)+pad and len(x) >= len(sample)-pad] 
  if len(candidates) > 0:
    return process.extractOne(sample, candidates)[1]
  else:
    return np.nan

Collecting fuzzywuzzy
  Downloading fuzzywuzzy-0.18.0-py2.py3-none-any.whl (18 kB)
Installing collected packages: fuzzywuzzy
Successfully installed fuzzywuzzy-0.18.0
Collecting python-Levenshtein
  Downloading python-Levenshtein-0.12.2.tar.gz (50 kB)
[K     |████████████████████████████████| 50 kB 2.9 MB/s 
Building wheels for collected packages: python-Levenshtein
  Building wheel for python-Levenshtein (setup.py) ... [?25l[?25hdone
  Created wheel for python-Levenshtein: filename=python_Levenshtein-0.12.2-cp37-cp37m-linux_x86_64.whl size=149870 sha256=15cacaa74c656077591967f862d26ec7be0ef5edf033cedb9f1c56cf59088d45
  Stored in directory: /root/.cache/pip/wheels/05/5f/ca/7c4367734892581bb5ff896f15027a932c551080b2abd3e00d
Successfully built python-Levenshtein
Installing collected packages: python-Levenshtein
Successfully installed python-Levenshtein-0.12.2


In [27]:
# Dont RUN
start_time = time.time()
### iterating over pandas dataframe rows is not recomended, let use multi processing to apply the function

with Pool(cpu_count()-1) as pool:
    scores = pool.map(partial(fuzzfilter, candidates=list(en_test_sents), pad=5), df_pp['source_sentence'])
hours, rem = divmod(time.time() - start_time, 3600)
minutes, seconds = divmod(rem, 60)
print("done in {}h:{}min:{}seconds".format(hours, minutes, seconds))

# Filter out "almost overlapping samples"
df_pp = df_pp.assign(scores=scores)
df_pp = df_pp[df_pp['scores'] < 95]

done in 0.0h:0.0min:55.06313967704773seconds


In [28]:
# Dont Run
# This section does the split between train/dev for the parallel corpora then saves them as separate files
# We use 1000 dev test and the given test set.
import csv

# TODO: if your corpus is smaller than 1000, reduce this number. With a corpus that small you might not obtain good results with NMT though :/
# Do the split between dev/train and create parallel corpora
num_dev_patterns = 100

# Optional: lower case the corpora - this will make it easier to generalize, but without proper casing.
if lc:  # Julia: making lowercasing optional
    df_pp["source_sentence"] = df_pp["source_sentence"].str.lower()
    df_pp["target_sentence"] = df_pp["target_sentence"].str.lower()

# Julia: test sets are already generated
dev = df_pp.tail(num_dev_patterns) # Herman: Error in original
stripped = df_pp.drop(df_pp.tail(num_dev_patterns).index)

with open("train."+source_language, "w") as src_file, open("train."+target_language, "w") as trg_file:
  for index, row in stripped.iterrows():
    src_file.write(row["source_sentence"]+"\n")
    trg_file.write(row["target_sentence"]+"\n")
    
with open("dev."+source_language, "w") as src_file, open("dev."+target_language, "w") as trg_file:
  for index, row in dev.iterrows():
    src_file.write(row["source_sentence"]+"\n")
    trg_file.write(row["target_sentence"]+"\n")

#stripped[["source_sentence"]].to_csv("train."+source_language, header=False, index=False)  # Herman: Added `header=False` everywhere
#stripped[["target_sentence"]].to_csv("train."+target_language, header=False, index=aFalse)  # Julia: Problematic handling of quotation marks.

#dev[["source_sentence"]].to_csv("dev."+source_language, header=False, index=False)
#dev[["target_sentence"]].to_csv("dev."+target_language, header=False, index=False)


# TODO: Doublecheck the format below. There should be no extra quotation marks or weird characters. It should also not be empty.
! head train.*
! head dev.*

106

151

37

45

180

223

198

238

205

216

124

239

118

145

78

61

116

138

99

78

147

128

200

248

37

32

67

59

81

107

95

106

51

79

87

89

114

62

50

27

78

84

163

140

94

127

73

84

63

67

117

124

210

255

81

130

196

235

48

95

44

46

126

183

230

281

130

135

133

153

222

326

56

60

299

253

91

102

93

93

186

120

169

137

126

69

58

47

109

102

210

193

42

61

164

225

259

283

145

141

246

273

134

139

159

134

177

173

147

112

43

63

181

162

304

337

89

38

125

88

173

173

119

132

87

103

101

97

288

331

86

82

127

102

105

65

68

84

118

126

216

254

118

101

86

87

170

189

11

16

96

121

76

86

126

147

143

138

35

21

146

86

172

246

87

97

139

140

87

113

52

51

129

174

129

179

215

167

194

260

137

108

61

70

82

94

172

171

44

48

133

168

170

233

163

232

162

135

193

141

76

58

70

27

127

124

131

121

134

90

115

124

50

59

19

20

178

209

178

201

141

129

172

221

206

258

77

87

88

93

16

19

131

165

178

188

98

142

146

131

100

142

90

76

95

95

138

186

85

108

126

100

76

98

56

56

141

174

163

257

154

170

164

192

245

308

92

135

92

86

281

370

156

161

166

226

181

191

221

240

218

169

108

104

134

107

239

229

189

172

44

56

133

106

71

91

29

51

67

68

242

282

133

124

140

101

98

145

76

81

123

126

138

155

114

126

27

35

151

144

78

95

94

103

136

138

191

203

210

239

64

66

248

223

53

72

134

135

225

235

68

68

75

64

81

102

48

51

32

59

97

99

226

241

54

69

156

123

65

64

50

54

98

97

91

83

114

123

245

223

32

33

152

118

118

125

82

89

178

181

93

99

292

271

152

150

88

110

56

45

159

142

216

301

247

256

128

110

40

31

183

188

189

167

177

247

41

41

124

164

140

185

84

71

216

227

36

34

130

159

78

124

173

186

110

75

172

129

209

191

183

174

113

93

46

89

190

231

31

45

60

54

107

100

106

104

111

117

105

115

115

113

120

135

50

50

187

186

134

173

15

32

62

64

68

73

172

182

264

291

123

127

77

121

153

147

205

194

187

201

196

153

108

93

145

122

112

131

216

156

229

296

120

125

236

238

197

198

89

141

170

159

136

205

158

183

276

333

206

283

236

220

85

106

51

33

241

146

216

130

129

198

247

253

177

170

101

131

190

270

134

113

250

239

249

244

156

187

173

232

158

203

159

165

185

195

174

204

65

76

48

55

65

65

70

76

139

177

160

115

94

129

60

60

122

181

129

158

19

23

159

174

92

63

121

142

57

77

23

15

101

103

49

66

139

106

68

52

117

74

53

68

125

90

35

23

114

91

123

134

94

57

154

135

118

112

25

31

92

92

131

127

149

112

155

154

84

85

129

137

114

139

107

113

27

26

120

121

133

129

89

103

178

160

102

77

51

35

148

133

22

42

126

136

66

83

73

74

98

75

44

60

45

54

173

161

99

84

102

87

50

48

106

126

27

26

112

105

206

162

127

89

169

99

97

75

137

91

40

36

79

77

169

191

106

119

60

45

131

102

94

96

53

46

76

57

130

106

83

78

95

114

28

99

101

135

85

138

111

110

67

65

112

139

103

100

54

54

58

85

96

141

174

177

68

122

336

369

219

180

27

33

100

108

150

115

107

128

197

145

100

107

129

153

93

75

16

17

169

163

194

194

91

87

144

102

22

17

33

34

273

205

66

57

129

131

256

216

252

227

103

112

228

174

27

31

141

130

113

116

200

159

11

18

238

231

171

126

122

118

181

206

113

85

46

48

134

173

86

72

38

47

56

49

56

59

39

58

34

46

145

192

127

150

126

104

251

259

192

201

23

16

222

272

163

199

103

107

63

94

147

170

84

120

90

102

179

139

141

112

118

160

180

196

116

137

140

206

77

115

79

100

93

136

52

42

158

115

132

209

109

112

28

27

54

67

113

124

214

278

111

146

105

162

84

93

109

127

119

129

125

86

180

243

165

154

117

86

54

61

46

53

35

48

125

200

152

163

194

223

67

97

91

131

36

50

131

167

305

345

289

238

120

140

277

245

67

75

165

199

117

125

84

197

282

186

222

192

88

80

69

64

165

236

135

176

34

35

92

130

204

179

62

81

90

93

86

100

137

144

23

26

124

104

110

81

87

72

69

100

251

272

69

99

200

233

152

145

64

67

142

128

181

201

67

96

158

124

146

159

195

254

91

82

111

107

63

54

79

126

165

237

220

246

148

148

118

125

98

116

200

315

249

283

99

61

195

244

172

208

55

40

59

66

248

254

60

52

130

110

139

164

50

59

164

133

94

92

28

45

80

97

85

90

159

150

156

195

168

169

96

124

44

56

129

143

204

189

136

95

94

124

170

155

125

123

185

211

109

126

272

331

143

144

78

104

58

51

88

112

65

73

57

60

155

192

74

62

73

103

40

62

116

128

149

175

124

138

61

72

54

65

67

70

120

100

71

118

72

61

114

113

73

74

27

24

40

37

21

19

89

110

81

83

86

97

252

171

26

24

283

226

102

93

120

130

44

45

39

38

176

194

119

97

123

137

115

82

146

122

159

193

96

105

36

30

94

66

50

75

82

93

171

157

57

44

83

85

68

97

169

194

81

89

43

54

43

59

77

92

47

60

170

220

143

199

177

208

122

181

189

294

68

96

24

41

13

28

36

33

97

70

69

77

29

21

31

41

71

67

67

66

93

105

94

94

106

130

93

78

142

124

119

139

105

100

116

112

146

116

133

152

190

163

115

124

90

98

45

46

59

59

36

43

136

166

33

24

111

158

237

306

33

36

146

154

40

51

35

37

49

58

232

276

219

263

34

28

187

154

67

75

199

246

112

145

49

54

152

176

78

63

110

114

22

30

231

251

209

210

122

96

96

115

49

55

215

217

186

179

120

162

194

211

54

60

165

134

169

191

91

112

170

182

166

195

110

149

169

139

232

165

147

156

56

70

141

170

114

113

104

140

76

86

148

164

128

185

97

122

21

40

281

317

192

178

94

141

212

221

116

131

42

47

245

221

294

315

133

121

95

126

37

34

83

109

294

349

191

196

39

46

182

176

126

182

261

314

188

180

72

76

51

58

157

137

194

196

121

184

274

349

162

189

191

207

115

134

79

88

88

221

213

212

78

133

202

258

180

240

289

340

199

227

199

180

14

23

214

220

105

113

84

125

143

218

51

78

128

123

90

151

175

144

154

171

70

119

62

66

100

142

81

125

169

229

152

162

132

202

77

100

41

51

143

142

165

160

92

112

216

213

271

246

297

356

66

76

75

93

175

187

140

139

174

173

103

108

154

206

70

100

30

42

259

204

72

123

204

232

130

109

107

111

203

229

60

60

143

167

125

143

177

193

160

212

51

81

44

44

108

121

100

93

272

272

253

251

86

81

158

86

180

237

196

183

148

174

77

102

65

63

100

97

35

33

210

241

227

275

120

139

148

151

129

170

130

134

26

38

106

159

137

179

51

56

164

207

199

254

140

149

30

36

166

195

194

201

126

167

80

77

76

75

195

245

83

84

47

48

106

95

145

130

84

47

77

77

37

35

89

90

109

167

112

100

207

215

72

73

98

92

162

134

115

113

120

121

29

35

270

309

87

83

125

118

187

204

286

343

168

208

210

190

164

155

101

109

189

151

46

56

90

80

172

141

39

47

31

34

118

164

67

50

79

112

141

174

153

116

148

114

32

37

181

175

136

99

96

95

83

54

147

111

134

136

134

150

113

103

106

107

214

215

131

110

114

98

76

87

134

144

65

73

106

88

88

77

118

131

82

84

111

132

72

75

100

96

91

122

168

182

74

68

90

100

59

45

66

104

133

164

96

130

200

180

308

370

212

241

102

150

116

80

35

43

119

96

175

186

143

153

207

244

111

149

191

211

187

163

93

106

96

101

105

122

97

73

46

62

205

187

161

162

165

203

78

51

114

83

160

162

96

87

83

57

114

143

174

156

147

120

177

185

177

192

212

210

98

142

94

120

196

170

209

278

231

241

137

113

144

165

177

173

27

32

16

15

91

100

76

61

145

135

175

167

123

129

138

140

253

265

133

145

199

203

82

97

155

220

149

146

21

17

126

128

279

280

119

120

173

199

54

57

21

23

101

76

194

210

114

136

153

130

57

69

119

92

39

32

43

41

61

61

62

43

96

96

64

50

91

75

92

96

135

125

130

121

115

111

84

96

184

192

138

136

73

68

68

96

53

74

77

93

158

211

217

307

117

116

82

89

53

51

98

111

109

158

78

113

115

221

118

131

169

181

121

151

79

98

117

182

153

201

102

124

220

244

129

141

70

63

132

147

98

123

108

103

106

136

18

24

115

136

97

115

189

247

104

135

89

95

226

179

86

145

208

272

124

193

173

160

121

113

202

182

154

150

66

96

205

249

27

26

106

116

180

232

156

182

195

163

213

301

187

270

200

173

215

257

296

290

199

267

178

203

100

120

173

241

139

213

210

212

224

261

99

151

75

91

175

255

137

221

100

161

42

70

248

245

219

254

158

254

74

136

117

116

16

16

38

77

128

177

45

68

239

318

95

138

63

69

96

104

90

131

16

21

25

28

143

174

66

65

197

256

65

86

28

36

226

279

59

88

189

240

207

316

241

336

193

222

75

64

289

332

136

182

45

69

280

311

270

362

109

125

41

45

77

80

44

42

225

247

194

211

47

66

281

346

182

207

164

333

158

189

102

106

195

189

253

263

14

29

194

252

206

285

124

121

247

250

181

166

128

254

35

42

67

102

96

117

154

171

59

104

212

182

136

211

259

322

236

293

94

146

164

269

139

218

127

201

141

146

60

88

59

68

86

112

184

207

74

112

88

105

63

69

71

52

55

58

53

50

15

16

54

63

13

15

23

24

189

201

66

67

100

80

212

254

101

108

135

150

80

75

108

91

115

128

220

209

99

85

139

146

176

229

37

36

227

212

219

284

165

134

127

121

62

44

212

270

222

278

255

259

169

169

170

220

215

182

171

218

91

119

123

137

158

187

95

125

170

138

165

167

153

153

42

39

187

225

190

214

144

135

161

126

140

172

244

282

136

170

70

92

74

88

138

163

64

67

86

71

200

191

146

179

146

168

126

65

36

36

230

292

67

92

214

292

109

115

58

88

188

227

170

215

89

110

213

260

146

183

95

157

75

114

119

176

84

140

109

163

128

126

79

67

183

210

60

86

73

91

242

225

291

374

104

141

158

174

145

119

52

46

71

90

231

182

104

88

28

28

134

167

108

137

95

97

203

237

205

241

117

146

97

120

116

125

142

172

117

93

53

86

83

86

303

258

78

106

135

116

206

206

94

87

104

109

228

225

27

20

161

144

173

200

24

22

218

232

236

349

128

216

144

170

113

114

49

59

92

103

70

71

206

212

164

192

237

208

157

162

147

113

122

154

71

117

88

107

96

103

96

158

113

136

119

162

72

44

176

199

165

140

26

27

83

77

106

91

58

69

390

390

42

45

112

116

150

130

220

300

35

46

133

183

215

221

258

291

135

137

168

193

84

87

197

276

194

205

68

103

207

223

86

73

28

29

140

153

178

250

106

110

88

85

160

181

203

157

42

44

110

137

158

155

142

211

310

335

90

89

207

228

101

120

32

42

50

71

119

125

91

132

310

340

156

150

124

85

122

95

168

156

93

86

54

55

62

57

68

64

78

78

145

125

97

82

122

139

82

102

115

125

242

195

116

130

103

112

90

84

96

88

42

38

77

75

81

72

71

88

95

85

76

57

40

52

108

94

76

87

25

32

65

65

66

68

25

26

77

160

146

124

37

49

110

185

169

177

72

107

83

71

20

38

178

179

98

91

193

228

122

157

81

116

68

107

146

132

140

107

166

188

121

161

206

260

66

65

154

210

85

87

126

105

44

64

118

106

141

167

244

222

116

153

63

55

146

143

123

141

55

73

27

29

81

68

87

86

75

41

24

28

132

125

42

37

43

31

33

26

24

34

132

133

103

186

33

34

105

101

31

39

79

93

164

148

216

211

261

262

192

179

106

86

46

56

95

90

100

93

64

78

157

173

221

193

157

129

138

123

157

168

260

259

158

145

142

90

178

154

149

133

284

289

74

82

80

71

122

127

132

167

215

238

170

175

38

55

123

163

34

69

126

92

31

42

194

355

65

72

192

180

146

147

181

201

288

274

129

135

223

215

166

146

275

257

160

137

53

53

125

110

168

171

243

237

68

73

179

174

183

153

137

116

240

296

248

226

128

138

239

207

244

205

224

213

337

253

177

158

64

69

236

223

54

90

264

271

192

196

64

66

198

199

67

64

242

261

18

16

169

158

196

217

111

101

254

246

97

112

114

152

216

212

122

111

201

159

177

177

161

139

194

170

148

120

197

157

241

217

187

178

293

245

233

242

290

243

394

299

181

173

178

143

183

150

274

259

263

250

147

113

262

220

175

199

86

80

298

238

245

192

202

156

71

70

144

123

90

85

155

130

155

170

137

93

186

191

176

137

55

45

209

185

181

186

211

197

77

75

71

60

194

163

95

102

166

157

138

135

118

119

59

65

154

166

137

141

248

210

155

123

47

53

131

141

179

183

122

123

153

214

180

167

199

155

234

229

55

63

102

107

224

223

52

54

48

41

116

118

99

112

83

92

54

54

34

54

142

158

304

342

180

189

154

154

74

71

249

216

169

152

221

197

100

114

117

95

189

184

239

220

121

117

69

65

205

216

115

115

82

90

73

95

133

351

241

467

55

83

189

242

161

183

151

128

174

190

205

227

194

189

103

108

62

57

155

151

183

192

118

193

137

130

96

85

146

188

160

149

140

74

100

138

198

168

126

139

67

85

46

155

169

323

117

212

128

296

99

91

144

172

165

167

148

225

133

180

187

209

168

148

41

84

112

110

62

84

82

110

124

395

75

101

94

126

255

182

196

136

190

194

203

135

307

277

59

89

207

221

295

233

74

71

173

189

201

83

297

224

335

296

229

314

88

120

40

88

79

115

159

209

128

315

80

96

160

165

248

240

245

228

205

196

246

262

190

169

196

156

138

114

109

112

145

137

173

163

85

92

139

120

177

167

48

63

331

313

191

256

222

394

172

226

122

200

51

83

150

194

167

152

94

110

122

100

60

65

156

183

125

115

105

107

137

175

44

52

193

215

47

89

74

115

223

400

127

271

63

52

301

240

213

236

232

210

39

37

51

51

135

112

65

88

160

166

146

129

167

199

197

177

123

116

189

211

43

44

172

162

254

194

197

158

169

171

153

84

187

173

56

65

64

87

180

191

124

139

96

105

40

84

79

111

50

61

74

89

204

189

108

130

192

164

190

201

171

164

182

180

141

140

91

62

127

130

210

196

75

66

177

178

186

149

135

140

45

78

45

35

38

54

81

68

137

130

197

174

97

92

89

101

170

162

140

168

115

128

239

258

126

163

70

75

168

201

76

79

181

173

120

126

114

83

222

206

135

127

124

93

178

117

119

107

80

73

256

310

189

190

99

125

149

216

128

210

173

259

299

334

107

88

155

142

72

105

185

237

187

195

159

224

85

106

143

147

62

75

192

249

65

70

186

185

331

379

218

177

60

74

169

213

121

132

131

102

97

111

130

103

144

175

20

19

79

100

151

175

119

113

80

81

216

194

74

79

128

119

126

110

180

254

107

91

83

84

75

109

132

132

100

87

158

165

37

38

182

181

101

90

161

134

258

273

258

359

159

119

172

151

115

148

39

51

178

147

176

170

53

65

178

139

126

182

30

50

57

42

47

61

143

130

43

72

82

95

113

190

69

80

144

190

169

117

160

153

140

112

158

154

90

89

72

80

116

167

118

109

159

155

183

196

129

134

287

220

59

72

60

70

136

167

114

107

99

106

115

140

45

72

141

173

100

228

116

170

68

64

196

228

274

334

206

129

87

91

198

174

45

74

109

179

111

157

52

71

137

176

120

196

123

156

77

76

180

186

179

195

91

129

121

166

78

100

223

238

90

128

235

274

159

205

79

77

293

173

311

356

194

165

162

143

81

96

126

117

64

65

137

122

164

144

210

154

181

149

180

141

163

149

225

190

91

89

145

144

148

124

191

149

251

250

143

154

147

131

123

136

158

167

199

160

55

54

219

194

197

215

171

166

136

133

76

70

200

143

131

176

225

135

40

73

79

108

73

94

155

263

104

145

62

45

138

177

337

356

197

224

120

129

50

41

144

186

73

126

159

275

45

74

89

71

278

230

158

155

131

144

135

212

76

94

55

52

115

161

130

190

137

207

77

63

56

102

153

212

71

75

282

225

306

306

217

178

26

18

141

156

141

89

249

227

250

229

86

72

143

121

175

162

123

111

206

188

145

127

68

59

81

72

198

170

173

166

172

169

181

142

136

122

264

192

160

146

245

213

119

125

237

231

138

145

68

67

182

156

191

163

134

114

113

107

158

139

39

70

200

149

112

131

68

69

207

161

197

199

176

189

101

101

225

202

166

202

68

74

211

196

150

158

208

286

128

118

107

113

104

88

117

93

82

87

156

121

187

101

112

108

155

267

42

34

201

186

120

137

127

100

163

96

111

116

82

79

184

132

135

178

52

50

94

94

100

105

208

212

83

78

57

60

125

113

152

138

158

124

114

115

239

196

188

150

222

177

46

43

250

262

135

128

189

190

103

119

56

57

92

105

122

132

244

201

166

109

177

159

131

128

132

127

309

247

195

174

146

106

76

61

282

252

147

131

225

185

209

165

47

62

72

74

180

178

102

112

105

121

89

115

67

73

202

194

331

278

246

206

212

205

153

143

127

117

182

176

162

135

239

231

132

140

135

111

230

203

187

181

183

136

167

152

167

146

264

235

193

164

147

122

215

214

159

139

155

132

192

169

100

97

123

109

84

62

45

55

40

51

117

122

143

130

139

139

100

117

55

53

85

108

155

154

39

56

85

80

50

48

102

106

159

162

78

68

97

93

49

46

152

136

103

104

139

124

114

95

146

124

129

114

112

93

142

147

71

60

62

63

149

160

258

253

130

141

108

116

85

84

141

147

186

180

45

48

174

178

187

199

189

173

155

119

135

122

191

168

176

147

76

84

151

178

229

221

147

126

234

199

65

53

195

202

62

87

190

199

194

193

119

131

231

252

256

216

202

227

142

142

143

127

118

104

182

224

65

73

193

203

198

251

147

185

198

147

162

141

47

103

152

250

310

255

62

101

112

156

32

42

51

54

111

64

85

121

175

215

60

46

188

148

264

199

142

140

186

150

264

239

101

75

47

55

150

125

275

269

221

190

117

70

151

189

151

108

47

69

148

170

134

122

162

121

229

228

105

100

100

133

112

99

150

114

62

81

166

251

236

172

210

243

152

153

75

70

239

227

143

139

136

195

118

85

190

132

224

185

162

160

176

174

161

175

141

178

135

107

91

63

186

217

65

72

38

51

144

142

79

74

131

115

197

197

195

226

45

54

167

196

102

148

166

147

184

218

237

212

133

158

124

150

243

185

61

43

212

206

60

66

191

254

173

158

178

121

128

142

310

269

229

199

251

355

159

179

105

121

109

108

165

153

166

137

64

54

112

139

251

242

118

146

202

146

66

100

150

155

131

212

49

85

226

249

148

193

105

116

25

30

188

186

159

210

227

229

149

189

194

187

95

94

67

121

13

9

201

212

220

183

34

31

205

185

57

61

148

187

186

203

144

136

221

175

167

103

114

120

89

65

137

96

115

89

139

120

166

167

57

64

136

144

216

225

95

110

161

142

168

158

108

136

99

104

57

61

57

94

174

226

230

291

274

253

99

99

56

39

144

167

157

189

138

187

37

41

57

86

115

168

216

260

111

89

164

147

79

93

61

50

212

258

169

219

112

105

57

67

135

144

203

230

205

196

190

229

54

66

110

136

188

231

333

318

61

56

156

138

175

150

170

151

96

87

239

248

90

66

60

73

178

192

152

185

55

99

163

202

212

246

58

55

186

191

103

92

175

130

184

128

74

83

207

209

161

147

160

142

255

353

60

92

241

274

203

170

49

71

156

250

106

130

98

101

169

223

166

161

56

93

104

159

172

173

217

216

100

159

62

81

278

348

247

268

119

87

100

130

138

142

241

226

53

53

190

272

183

233

137

162

114

96

152

183

94

80

199

184

74

74

132

79

67

87

134

140

149

217

225

321

132

93

53

43

145

144

258

242

17

23

191

224

231

168

137

131

311

204

88

98

41

64

153

208

194

184

136

134

229

132

145

172

162

164

66

97

190

265

111

123

135

167

64

85

183

170

52

77

260

252

229

217

111

132

181

220

78

77

75

95

182

156

178

151

102

105

202

222

96

106

113

118

96

229

67

63

244

242

265

246

230

289

96

54

90

112

78

107

54

82

146

208

243

239

83

88

89

72

216

203

212

203

288

310

112

111

52

53

180

192

240

221

141

166

244

152

127

103

60

77

202

223

213

261

161

238

213

173

210

249

256

216

113

136

105

110

62

66

252

374

101

133

147

162

302

215

262

167

128

174

46

32

120

107

108

129

168

173

96

86

88

70

77

80

37

54

287

351

130

144

106

119

106

92

82

84

50

69

232

145

165

131

194

177

221

175

101

150

48

77

191

161

172

207

201

166

81

94

146

184

156

117

181

199

163

185

104

90

148

148

140

138

67

70

242

306

172

169

43

100

192

158

96

108

205

240

73

66

171

171

138

149

337

263

203

164

55

48

193

160

236

216

147

170

276

204

260

331

47

67

136

128

56

67

104

121

188

169

239

167

171

250

46

76

148

124

181

202

76

123

52

83

174

174

153

132

59

66

249

288

150

185

53

90

140

218

128

117

144

144

124

117

166

247

137

162

143

121

103

99

116

145

212

238

155

162

80

90

252

242

174

208

225

289

139

153

87

109

195

232

104

113

190

180

171

183

165

218

117

121

105

118

110

143

40

47

134

142

120

108

62

79

136

207

260

261

163

194

203

146

173

172

155

110

134

151

170

183

147

147

156

141

72

74

116

90

168

141

82

75

162

81

121

120

97

99

58

74

185

197

164

149

138

130

125

115

199

124

141

93

64

94

59

48

186

190

102

77

176

235

105

47

216

218

210

224

230

186

58

80

163

216

204

144

160

175

151

139

204

212

108

125

51

66

127

143

176

197

211

143

172

185

220

279

141

132

206

152

150

129

187

153

163

197

167

160

119

138

63

71

144

155

139

168

95

119

236

305

133

155

63

85

44

58

97

101

153

170

66

99

84

95

84

93

128

155

71

76

73

90

176

183

140

168

196

230

156

198

75

82

137

179

36

36

69

96

135

137

110

124

141

167

308

321

70

72

139

139

128

95

85

70

81

104

86

107

59

72

73

76

204

253

93

122

93

108

75

84

62

65

61

90

147

208

106

131

105

117

159

206

182

177

53

59

175

175

202

238

136

157

208

242

65

70

96

135

166

142

115

120

136

128

99

107

214

218

177

162

67

81

242

296

187

229

117

134

253

357

201

275

109

104

118

129

206

231

127

155

37

31

56

55

149

152

82

95

153

174

115

129

111

154

63

96

59

53

126

141

83

88

83

93

44

61

137

145

70

68

88

86

41

67

180

161

187

187

84

98

185

194

96

83

170

201

50

49

154

169

83

130

208

210

277

338

234

199

96

100

238

230

222

268

99

129

99

91

133

120

52

39

202

165

203

155

83

62

57

71

106

118

226

248

232

232

205

167

62

67

154

162

127

243

152

150

186

173

194

116

199

252

87

111

173

213

191

233

101

142

201

196

231

203

51

75

232

184

259

212

224

169

170

124

137

116

52

78

173

173

95

84

235

243

48

62

62

87

87

116

144

155

247

304

294

234

153

170

78

79

203

310

199

158

162

154

203

253

257

281

75

85

174

195

100

112

148

141

128

127

50

62

159

208

120

128

128

141

211

230

164

182

151

157

193

234

109

117

155

181

54

46

123

130

119

137

87

109

94

110

109

118

105

133

46

71

131

186

109

144

160

197

142

296

216

278

119

116

71

56

168

170

127

155

114

111

149

171

119

160

142

177

89

117

54

81

128

139

52

65

149

171

113

173

115

138

51

96

170

224

172

161

172

233

118

122

93

81

118

151

61

72

291

402

218

251

152

159

137

159

204

251

35

68

214

254

147

198

86

110

144

158

59

104

42

70

106

142

114

157

114

185

123

207

126

234

103

127

84

106

56

81

106

151

21

46

84

152

154

185

159

217

109

155

67

71

127

161

65

91

121

138

117

138

111

107

41

71

113

150

79

124

132

171

125

151

71

81

57

67

165

205

223

340

91

126

29

25

116

116

33

57

192

283

45

68

219

260

117

185

137

147

257

357

105

101

129

201

22

28

81

63

200

202

103

125

61

97

130

179

131

166

145

180

157

157

76

72

84

100

69

113

80

79

221

292

195

172

255

307

252

206

192

212

60

75

221

276

244

306

130

106

275

324

47

75

186

172

394

418

203

236

287

255

42

54

111

128

96

134

59

81

47

54

94

120

214

345

274

315

159

151

56

112

156

236

185

213

140

180

138

186

161

177

149

191

73

82

139

165

113

135

254

295

223

293

109

76

49

55

213

265

181

249

148

178

91

96

143

193

45

52

147

158

164

252

110

191

230

291

42

54

161

197

157

267

83

134

161

182

61

127

56

85

241

297

95

101

259

270

194

218

213

274

48

51

184

275

183

208

212

342

54

96

194

329

211

282

210

292

209

297

49

77

174

179

196

195

226

251

48

45

243

254

161

163

108

158

52

42

295

485

63

82

225

332

43

87

157

181

163

200

145

145

46

89

113

134

153

223

86

120

144

201

49

79

83

144

224

252

42

61

222

258

86

169

97

124

50

60

173

248

82

137

164

169

151

199

27

39

303

390

46

70

226

301

334

424

208

292

209

245

43

52

185

243

52

68

148

175

143

253

169

236

56

90

265

348

81

106

106

144

86

101

97

90

195

187

31

28

96

81

90

119

60

72

145

190

169

261

166

212

133

208

69

108

177

265

142

183

156

193

203

235

153

248

161

209

61

62

144

220

52

76

225

279

172

226

236

351

156

190

62

63

112

141

110

102

53

64

227

305

176

227

124

167

55

94

175

198

97

205

92

140

103

130

152

134

116

132

119

154

74

98

90

120

161

306

67

61

116

118

94

99

126

131

59

95

93

114

170

162

143

162

46

52

177

174

134

177

152

148

96

140

180

224

48

75

174

233

162

135

129

120

141

153

42

49

190

258

145

146

76

141

147

169

110

112

92

107

85

126

81

86

156

130

201

132

209

248

162

178

169

151

31

18

247

269

78

78

64

81

214

293

207

362

177

183

50

61

232

261

83

107

104

98

52

50

175

240

105

120

64

48

85

49

55

72

78

105

96

95

49

71

104

126

122

166

101

103

340

378

96

97

55

108

126

196

122

163

191

215

240

204

116

197

98

132

46

65

200

251

188

261

118

147

89

99

44

45

141

149

44

58

124

169

182

193

44

40

42

44

130

137

147

142

115

125

98

136

201

237

169

135

212

260

200

208

33

40

158

223

121

133

146

211

138

169

==> train.en <==
[update] as members of sebin searched diaz's and soto's home, luz mely reyes reported live from the scene
the sntp also reported the incident:
urgent at this time, : am, a commission from the intelligence forces arrives to journalist and human rights activist luis carlos díaz's home, missing since : pm #whereisluiscarlos
other human rights and freedom of speech organizations joined the campaign through #dondeestaluiscarlos (where is luis carlos?) which is, at the moment, a trending topic in venezuela's twittosphere
díaz is a journalist and a human rights and freedom of speech advocate who is well known and highly appreciated in venezuela and abroad for his commentary and criticism of the government of nicolas maduro
he has long worked with soto producing web-based video and radio programs focused on politics and human rights in venezuela
he has also worked as an educator and promoter of the creation of citizen media spaces and independent media projects
díaz has also b

In [29]:
# Install JoeyNMT
! git clone https://github.com/joeynmt/joeynmt.git
! cd joeynmt; pip3 install .
# Install Pytorch with GPU support v1.7.1.
#! pip install torch==1.9.0+cu101 -f https://download.pytorch.org/whl/torch_stable.html

Cloning into 'joeynmt'...
remote: Enumerating objects: 3224, done.[K
remote: Counting objects: 100% (273/273), done.[K
remote: Compressing objects: 100% (143/143), done.[K
remote: Total 3224 (delta 155), reused 209 (delta 130), pack-reused 2951[K
Receiving objects: 100% (3224/3224), 8.19 MiB | 16.34 MiB/s, done.
Resolving deltas: 100% (2183/2183), done.
Processing /content/joeynmt
[33m  DEPRECATION: A future pip version will change local packages to be built in-place without first copying to a temporary directory. We recommend you use --use-feature=in-tree-build to test your packages with this new behavior before it becomes the default.
   pip 21.3 will remove support for this functionality. You can find discussion regarding this at https://github.com/pypa/pip/issues/7555.[0m
Collecting sacrebleu>=2.0.0
  Downloading sacrebleu-2.0.0-py3-none-any.whl (90 kB)
[K     |████████████████████████████████| 90 kB 4.0 MB/s 
[?25hCollecting subword-nmt
  Downloading subword_nmt-0.3.7-py2.

In [30]:
# One of the huge boosts in NMT performance was to use a different method of tokenizing. 
# Usually, NMT would tokenize by words. However, using a method called BPE gave amazing boosts to performance

# Do subword NMT
from os import path
os.environ["src"] = source_language # Sets them in bash as well, since we often use bash scripts
os.environ["tgt"] = target_language

# Learn BPEs on the training data.
os.environ["data_path"] = path.join("joeynmt", "data", source_language + target_language) # Herman! 
! subword-nmt learn-joint-bpe-and-vocab --input train.$src train.$tgt -s 4000 -o bpe.codes.4000 --write-vocabulary vocab.$src vocab.$tgt

# Apply BPE splits to the development and test data.
! subword-nmt apply-bpe -c bpe.codes.4000 --vocabulary vocab.$src < train.$src > train.bpe.$src
! subword-nmt apply-bpe -c bpe.codes.4000 --vocabulary vocab.$tgt < train.$tgt > train.bpe.$tgt

! subword-nmt apply-bpe -c bpe.codes.4000 --vocabulary vocab.$src < dev.$src > dev.bpe.$src
! subword-nmt apply-bpe -c bpe.codes.4000 --vocabulary vocab.$tgt < dev.$tgt > dev.bpe.$tgt
! subword-nmt apply-bpe -c bpe.codes.4000 --vocabulary vocab.$src < test.$src > test.bpe.$src
! subword-nmt apply-bpe -c bpe.codes.4000 --vocabulary vocab.$tgt < test.$tgt > test.bpe.$tgt

# Create directory, move everyone we care about to the correct location
! mkdir -p "$data_path"
! cp train.* "$data_path"
! cp test.* "$data_path"
! cp dev.* "$data_path"
! cp bpe.codes.4000 "$data_path"
! ls "$data_path"

# Also move everything we care about to a mounted location in google drive (relevant if running in colab) at gdrive_path
! cp train.* "$gdrive_path"
! cp test.* "$gdrive_path"
! cp dev.* "$gdrive_path"
! cp bpe.codes.4000 "$gdrive_path"
! ls "$gdrive_path"

# Create that vocab using build_vocab
! sudo chmod 777 joeynmt/scripts/build_vocab.py
! joeynmt/scripts/build_vocab.py joeynmt/data/$src$tgt/train.bpe.$src joeynmt/data/$src$tgt/train.bpe.$tgt --output_path joeynmt/data/$src$tgt/vocab.txt

# Some output
! echo "BPE Test language Sentences"
! tail -n 5 test.bpe.$tgt
! echo "Combined BPE Vocab"
! tail -n 10 joeynmt/data/$src$tgt/vocab.txt  # Herman

bpe.codes.4000	dev.en	     test.bpe.yo     test.yo	   train.en
dev.bpe.en	dev.yo	     test.en	     train.bpe.en  train.yo
dev.bpe.yo	test.bpe.en  test.en-any.en  train.bpe.yo
cp: target '/content/drive/My Drive/masakhane/baseline/en-yo-baseline/' is not a directory
cp: target '/content/drive/My Drive/masakhane/baseline/en-yo-baseline/' is not a directory
cp: target '/content/drive/My Drive/masakhane/baseline/en-yo-baseline/' is not a directory
cp: cannot create regular file '/content/drive/My Drive/masakhane/baseline/en-yo-baseline/': Not a directory
ls: cannot access '/content/drive/My Drive/masakhane/baseline/en-yo-baseline/': No such file or directory
BPE Test language Sentences
A@@ p@@ ata ńlá ti ìgb@@ àgb@@ ó@@ ̣ ( W@@ o ìpín@@ r@@ ò@@ ̣ 1@@ 2 - 1@@ 4 )
À@@ ṣí@@ borí ìgb@@ àl@@ à ( W@@ o ìpín@@ r@@ ò@@ ̣ 1@@ 5 - 1@@ 8 )
"@@ M@@ o ti rí i pé àwọn èè@@ yàn máa ń f@@ é@@ ̣ gbó@@ ̣@@ r@@ ò@@ ̣ wa tí w@@ ó@@ ̣@@ n bá rí i pé a ló@@ ye B@@ í@@ b@@ é@@ l@@ ì dá@@ adá@@ a , a sì f@@ é@@ ̣

In [31]:
# Also move everything we care about to a mounted location in google drive (relevant if running in colab) at gdrive_path
! cp train.* "$gdrive_path"
! cp test.* "$gdrive_path"
! cp dev.* "$gdrive_path"
! cp bpe.codes.4000 "$gdrive_path"
! ls "$gdrive_path"

cp: target '/content/drive/My Drive/masakhane/baseline/en-yo-baseline/' is not a directory
cp: target '/content/drive/My Drive/masakhane/baseline/en-yo-baseline/' is not a directory
cp: target '/content/drive/My Drive/masakhane/baseline/en-yo-baseline/' is not a directory
cp: cannot create regular file '/content/drive/My Drive/masakhane/baseline/en-yo-baseline/': Not a directory
ls: cannot access '/content/drive/My Drive/masakhane/baseline/en-yo-baseline/': No such file or directory


In [32]:
# This creates the config file for our JoeyNMT system. It might seem overwhelming so we've provided a couple of useful parameters you'll need to update
# (You can of course play with all the parameters if you'd like!)

name = '%s%s' % (source_language, target_language)
gdrive_path = os.environ["gdrive_path"]

# Create the config
config = """
name: "{name}_transformer"

data:
    src: "{source_language}"
    trg: "{target_language}"
    train: "data/{name}/train.bpe"
    dev:   "data/{name}/dev.bpe"
    test:  "data/{name}/test.bpe"
    level: "bpe"
    lowercase: False
    max_sent_length: 100
    src_vocab: "data/{name}/vocab.txt"
    trg_vocab: "data/{name}/vocab.txt"

testing:
    beam_size: 5
    alpha: 1.0
    sacrebleu:                      # sacrebleu options
        remove_whitespace: True     # `remove_whitespace` option in sacrebleu.corpus_chrf() function (defalut: True)
        tokenize: "none"            # `tokenize` option in sacrebleu.corpus_bleu() function (options include: "none" (use for already tokenized test data), "13a" (default minimal tokenizer), "intl" which mostly does punctuation and unicode, etc) 

training:
    #load_model: "{gdrive_path}/models/{name}_transformer/1.ckpt" # if uncommented, load a pre-trained model from this checkpoint
    random_seed: 25
    optimizer: "adam"
    normalization: "tokens"
    adam_betas: [0.9, 0.999] 
    scheduling: "plateau"           # TODO: try switching from plateau to Noam scheduling
    patience: 5                     # For plateau: decrease learning rate by decrease_factor if validation score has not improved for this many validation rounds.
    learning_rate_factor: 0.5       # factor for Noam scheduler (used with Transformer)
    learning_rate_warmup: 1000      # warmup steps for Noam scheduler (used with Transformer)
    decrease_factor: 0.7
    loss: "crossentropy"
    learning_rate: 0.0003
    learning_rate_min: 0.00000001
    weight_decay: 0.0
    label_smoothing: 0.1
    batch_size: 4096
    batch_type: "token"
    eval_batch_size: 3600
    eval_batch_type: "token"
    batch_multiplier: 1
    early_stopping_metric: "ppl"
    epochs: 30                     # TODO: Decrease for when playing around and checking of working. Around 30 is sufficient to check if its working at all
    validation_freq: 100          # TODO: Set to at least once per epoch.
    logging_freq: 100
    eval_metric: "bleu"
    model_dir: "models/{name}_transformer"
    overwrite: True               # TODO: Set to True if you want to overwrite possibly existing models. 
    shuffle: True
    use_cuda: True
    max_output_length: 100
    print_valid_sents: [0, 1, 2, 3]
    keep_last_ckpts: 1

model:
    initializer: "xavier"
    bias_initializer: "zeros"
    init_gain: 1.0
    embed_initializer: "xavier"
    embed_init_gain: 1.0
    tied_embeddings: True
    tied_softmax: True
    encoder:
        type: "transformer"
        num_layers: 6
        num_heads: 4             # TODO: Increase to 8 for larger data.
        embeddings:
            embedding_dim: 256   # TODO: Increase to 512 for larger data.
            scale: True
            dropout: 0.2
        # typically ff_size = 4 x hidden_size
        hidden_size: 256         # TODO: Increase to 512 for larger data.
        ff_size: 1024            # TODO: Increase to 2048 for larger data.
        dropout: 0.3
    decoder:
        type: "transformer"
        num_layers: 6
        num_heads: 4              # TODO: Increase to 8 for larger data.
        embeddings:
            embedding_dim: 256    # TODO: Increase to 512 for larger data.
            scale: True
            dropout: 0.2
        # typically ff_size = 4 x hidden_size
        hidden_size: 256         # TODO: Increase to 512 for larger data.
        ff_size: 1024            # TODO: Increase to 2048 for larger data.
        dropout: 0.3
""".format(name=name, gdrive_path=os.environ["gdrive_path"], source_language=source_language, target_language=target_language)
with open("joeynmt/configs/transformer_{name}.yaml".format(name=name),'w') as f:
    f.write(config)

3514

In [33]:
# Train the model
# You can press Ctrl-C to stop. And then run the next cell to save your checkpoints! 
!cd joeynmt; python3 -m joeynmt train configs/transformer_$src$tgt.yaml

2021-12-03 16:27:43,566 - INFO - root - Hello! This is Joey-NMT (version 1.3).
2021-12-03 16:27:43,581 - INFO - joeynmt.data - Loading training data...
2021-12-03 16:27:43,636 - INFO - joeynmt.data - Building vocabulary...
2021-12-03 16:27:43,940 - INFO - joeynmt.data - Loading dev data...
2021-12-03 16:27:43,942 - INFO - joeynmt.data - Loading test data...
2021-12-03 16:27:44,029 - INFO - joeynmt.data - Data loaded.
2021-12-03 16:27:44,030 - INFO - joeynmt.model - Building an encoder-decoder model...
2021-12-03 16:27:44,461 - INFO - joeynmt.model - Enc-dec model built.
2021-12-03 16:27:46,849 - INFO - joeynmt.training - Total params: 12107520
2021-12-03 16:27:58,516 - INFO - joeynmt.helpers - cfg.name                           : enyo_transformer
2021-12-03 16:27:58,516 - INFO - joeynmt.helpers - cfg.data.src                       : en
2021-12-03 16:27:58,517 - INFO - joeynmt.helpers - cfg.data.trg                       : yo
2021-12-03 16:27:58,517 - INFO - joeynmt.helpers - cfg.data.t

In [34]:
# Copy the created models from the notebook storage to google drive for persistent storage 
! mkdir -p "$gdrive_path/models/${src}${tgt}_transformer/"
! cp -r joeynmt/models/${src}${tgt}_transformer/* "$gdrive_path/models/${src}${tgt}_transformer/"

In [35]:
# Output our validation accuracy
! cat "$gdrive_path/models/${src}${tgt}_transformer/validations.txt"

Steps: 100	Loss: 25717.04688	PPL: 403.52728	bleu: 0.03219	LR: 0.00030000	*
Steps: 200	Loss: 25040.62695	PPL: 344.61349	bleu: 0.01046	LR: 0.00030000	*
Steps: 300	Loss: 24206.30859	PPL: 283.65601	bleu: 0.01047	LR: 0.00030000	*
Steps: 400	Loss: 23636.44727	PPL: 248.34116	bleu: 0.02877	LR: 0.00030000	*
Steps: 500	Loss: 23217.34375	PPL: 225.20686	bleu: 0.09617	LR: 0.00030000	*
Steps: 600	Loss: 22636.63281	PPL: 196.67035	bleu: 0.12135	LR: 0.00030000	*
Steps: 700	Loss: 22100.86328	PPL: 173.56023	bleu: 0.17651	LR: 0.00030000	*
Steps: 800	Loss: 21621.75391	PPL: 155.20390	bleu: 0.35282	LR: 0.00030000	*
Steps: 900	Loss: 21157.94922	PPL: 139.28551	bleu: 0.44453	LR: 0.00030000	*
Steps: 1000	Loss: 20708.88867	PPL: 125.43050	bleu: 0.39353	LR: 0.00030000	*
Steps: 1100	Loss: 20508.19531	PPL: 119.69261	bleu: 0.22327	LR: 0.00030000	*
Steps: 1200	Loss: 20269.20117	PPL: 113.20101	bleu: 0.39414	LR: 0.00030000	*
Steps: 1300	Loss: 20065.06641	PPL: 107.93581	bleu: 0.52607	LR: 0.00030000	*
Steps: 1400	Loss: 199

In [36]:
# Test our model
! cd joeynmt; python3 -m joeynmt test "$gdrive_path/models/${src}${tgt}_transformer/config.yaml"

2021-12-03 16:52:53,708 - INFO - root - Hello! This is Joey-NMT (version 1.3).
2021-12-03 16:52:53,709 - INFO - joeynmt.data - Building vocabulary...
2021-12-03 16:52:53,990 - INFO - joeynmt.data - Loading dev data...
2021-12-03 16:52:53,993 - INFO - joeynmt.data - Loading test data...
2021-12-03 16:52:54,027 - INFO - joeynmt.data - Data loaded.
2021-12-03 16:52:54,034 - INFO - joeynmt.prediction - Process device: cuda, n_gpu: 1, batch_size per device: 3600
2021-12-03 16:52:54,034 - INFO - joeynmt.prediction - Loading model from models/enyo_transformer/1900.ckpt
2021-12-03 16:52:56,744 - INFO - joeynmt.model - Building an encoder-decoder model...
2021-12-03 16:52:56,991 - INFO - joeynmt.model - Enc-dec model built.
2021-12-03 16:52:57,073 - INFO - joeynmt.prediction - Decoding on dev set (data/enyo/dev.bpe.yo)...
2021-12-03 16:53:29,882 - INFO - joeynmt.prediction -  dev bleu[none]:   1.33 [Beam search decoding with beam size = 5 and alpha = 1.0]
2021-12-03 16:53:29,882 - INFO - joeynm