# Damara JW.ORG Scraping

This notebook scrapes and preprocesses the English and Damara songs available on [jw.org](jw.org) in order to build a parallel corpus. 

Authors: 
- [Musie Meressa](https://github.com/Msquarme)
- [Wilhelmina Nekoto](https://twitter.com/Onyothi)

In [None]:
from bs4 import BeautifulSoup
import requests, re, os, sys
from urllib.request import urlopen
from glob import glob
import pandas as pd

In [None]:
DMR_URL = "https://www.jw.org/naq-x-dmr/ǂkhanin/musiki-ǁnaetsanadi/%C7%83gâiaǂgaob-ǀkha-jehovaba-ǁnaeba-re/"
ENG_URL = "https://www.jw.org/en/library/music-songs/sing-out-joyfully/"

In [None]:
def get_songs(song_url):
    url = requests.get(song_url)
    page =BeautifulSoup(url.text, 'lxml')
    songs = page.find('div', attrs={'class':'musicList'}).text
    songs = songs.replace(u'\xa0', u' ')

    for i in range(len(songs)):
        if(len(songs[i].split()) > 1):
            hyphen_join = songs[i].split()
            songs[i] = '-'.join(hyphen_join) #for every word.
    
    songs = songs.split("\n")
    songList = []
    for song in songs: #cleaning it up
        if len(song)>1:
            songList.append(song)

    return songList

In [None]:
English_songs = get_songs(ENG_URL)
print(English_songs)

['1. Jehovah’s Attributes', 'Play', '2. Jehovah Is Your Name', 'Play', '3. Our Strength, Our Hope, Our Confidence', 'Play', '4. “Jehovah Is My Shepherd”', 'Play', '5. God’s Wondrous Works', 'Play', '6. The Heavens Declare God’s Glory', 'Play', '7. Jehovah, Our Strength', 'Play', '8. Jehovah Is Our Refuge', 'Play', '9. Jehovah Is Our King!', 'Play', '10. Praise Jehovah Our God!', 'Play', '11. Creation Praises God', 'Play', '12. Great God, Jehovah', 'Play', '13. Christ, Our Model', 'Play', '14. Praising Earth’s New King', 'Play', '15. Praise Jehovah’s Firstborn!', 'Play', '16. Praise Jah for His Son, the Anointed', 'Play', '17. “I Want To”', 'Play', '18. Grateful for the Ransom', 'Play', '19. The Lord’s Evening Meal', 'Play', '20. You Gave Your Precious Son', 'Play', '21. Keep On Seeking First the Kingdom', 'Play', '22. The Kingdom Is in Place\u200b—Let It Come!', 'Play', '23. Jehovah Begins His Rule', 'Play', '24. Come to Jehovah’s Mountain', 'Play', '25. A Special Possession', 'Play', 

Remove 'Play' word from the list

In [None]:
English_songs = get_songs(ENG_URL)
List_english_song = []
for s in English_songs:
    if(s == 'Play'):
        continue
    else:
        List_english_song.append(s)

In [None]:
# Word 'play' not removed
English_songs

['1. Jehovah’s Attributes',
 'Play',
 '2. Jehovah Is Your Name',
 'Play',
 '3. Our Strength, Our Hope, Our Confidence',
 'Play',
 '4. “Jehovah Is My Shepherd”',
 'Play',
 '5. God’s Wondrous Works',
 'Play',
 '6. The Heavens Declare God’s Glory',
 'Play',
 '7. Jehovah, Our Strength',
 'Play',
 '8. Jehovah Is Our Refuge',
 'Play',
 '9. Jehovah Is Our King!',
 'Play',
 '10. Praise Jehovah Our God!',
 'Play',
 '11. Creation Praises God',
 'Play',
 '12. Great God, Jehovah',
 'Play',
 '13. Christ, Our Model',
 'Play',
 '14. Praising Earth’s New King',
 'Play',
 '15. Praise Jehovah’s Firstborn!',
 'Play',
 '16. Praise Jah for His Son, the Anointed',
 'Play',
 '17. “I Want To”',
 'Play',
 '18. Grateful for the Ransom',
 'Play',
 '19. The Lord’s Evening Meal',
 'Play',
 '20. You Gave Your Precious Son',
 'Play',
 '21. Keep On Seeking First the Kingdom',
 'Play',
 '22. The Kingdom Is in Place\u200b—Let It Come!',
 'Play',
 '23. Jehovah Begins His Rule',
 'Play',
 '24. Come to Jehovah’s Mountain'

In [None]:
#Word 'play' removed
List_english_song

['1. Jehovah’s Attributes',
 '2. Jehovah Is Your Name',
 '3. Our Strength, Our Hope, Our Confidence',
 '4. “Jehovah Is My Shepherd”',
 '5. God’s Wondrous Works',
 '6. The Heavens Declare God’s Glory',
 '7. Jehovah, Our Strength',
 '8. Jehovah Is Our Refuge',
 '9. Jehovah Is Our King!',
 '10. Praise Jehovah Our God!',
 '11. Creation Praises God',
 '12. Great God, Jehovah',
 '13. Christ, Our Model',
 '14. Praising Earth’s New King',
 '15. Praise Jehovah’s Firstborn!',
 '16. Praise Jah for His Son, the Anointed',
 '17. “I Want To”',
 '18. Grateful for the Ransom',
 '19. The Lord’s Evening Meal',
 '20. You Gave Your Precious Son',
 '21. Keep On Seeking First the Kingdom',
 '22. The Kingdom Is in Place\u200b—Let It Come!',
 '23. Jehovah Begins His Rule',
 '24. Come to Jehovah’s Mountain',
 '25. A Special Possession',
 '26. You Did It for Me',
 '27. The Revealing of God’s Sons',
 '28. Gaining Jehovah’s Friendship',
 '29. Living Up to Our Name',
 '30. My Father, My God and Friend',
 '31. Oh, 

Howevever, if we copied and pasted content as is into links, it would not work
Observation: 
 * remove . after number
 * join words using the -  



In [None]:
characters = "”“,.!?:’—​"
stopwords = ['—','-and','-That','-A-','-The']
Link_english_song = []
for s in List_english_song:
    s = s.split()
    # remove spaces
    s = '-'.join(s)
    # remove special chars
    for ch in characters:
        if ch in s:
            s = s.replace(ch,'')
            # We see there are still words with special chars
    for stpword in stopwords:
        if stpword in s:
            # we see that some words with -A are affected
            if stpword == '-A-':
                s = s.replace('-A','')
            else:
                s = s.replace(stpword,'')
            # Write condition statement to replace '—​' with normal '-' in text 22
            
    print(s)
    Link_english_song.append(s)

1-Jehovahs-Attributes
2-Jehovah-Is-Your-Name
3-Our-Strength-Our-Hope-Our-Confidence
4-Jehovah-Is-My-Shepherd
5-Gods-Wondrous-Works
6-Heavens-Declare-Gods-Glory
7-Jehovah-Our-Strength
8-Jehovah-Is-Our-Refuge
9-Jehovah-Is-Our-King
10-Praise-Jehovah-Our-God
11-Creation-Praises-God
12-Great-God-Jehovah
13-Christ-Our-Model
14-Praising-Earths-New-King
15-Praise-Jehovahs-Firstborn
16-Praise-Jah-for-His-Son-the-Anointed
17-I-Want-To
18-Grateful-for-the-Ransom
19-Lords-Evening-Meal
20-You-Gave-Your-Precious-Son
21-Keep-On-Seeking-First-the-Kingdom
22-Kingdom-Is-in-PlaceLet-It-Come
23-Jehovah-Begins-His-Rule
24-Come-to-Jehovahs-Mountain
25-Special-Possession
26-You-Did-It-for-Me
27-Revealing-of-Gods-Sons
28-Gaining-Jehovahs-Friendship
29-Living-Up-to-Our-Name
30-My-Father-My-God-Friend
31-Oh-Walk-With-God
32-Take-Sides-With-Jehovah
33-Throw-Your-Burden-on-Jehovah
34-Walking-in-Integrity
35-Make-Sure-of-the-More-Important-Things
36-We-Guard-Our-Hearts
37-Serving-Jehovah-Whole-Souled
38-He-Will-Ma

https://www.jw.org/en/library/music-songs/sing-out-joyfully/30-My-Father-My-God-Friend


Cleaning up the Damara language

In [None]:
Damara_songs = get_songs(DMR_URL)
List_damara_song = []
for s in Damara_songs: #cleaning it up
    if len(s)>1:
        List_damara_song.append(s)

In [None]:
List_damara_song

['1. Jehovab di ǀgaugu',
 '2. Sadu ǀons ge a Jehova',
 '3. Sida ǀgaib, ǃâubasens tsî ǂgomǃgâb',
 '4. Jehovab ge a ti ǃûi-ao',
 '5. Elob di buruxa sîsengu',
 '6. ǀHommi ge ǁÎb ǂkhaisiba ra ǁgau',
 '7. Jehovab ge a ti ǃgâiǃō',
 '8. Jehovab ge sida sâuǃkhai',
 '9. Jehovab ge sida Gao-aoǃ',
 '10. Kare re Jehova Eloba',
 '11. Kurus ge Eloba ra koa',
 '12. Jehova Buruxa Elob',
 '13. Xristub ge a sida di aiǁgau',
 '14. Kare re ǃhūbaib di ǀasa Gao-aoba',
 '15. Koa re Elob ǂguro ǃnaesabeba!',
 '16. Koa re Jehovaba ǁîb di ǂkhauhesa ǀGôab ǃgao',
 '17. “ǂGao ta ra”',
 '18. Aio re Xoremariba',
 '19. ǃKhūb di ǃuiǂûs',
 '20. Sadu ǀgôaba du ge ge mā',
 '21. ǂGurose Elob Gaosiba ôa re',
 '22. Gaosib ge go ǂgaeǂgui tsoatsoa\u200b—Ab hā re!',
 '23. Jehovab ǂgaeǂguis go tsoatsoa',
 '24. Hā re Jehovab ǃhommi ǃoa',
 '25. ǃKhōǂuibasensa ǁaes',
 '26. Tita du ge dība',
 '27. Elob ǀgôagu di ǂhaiǂhais',
 '28. Jehovab ǀhōsa kai re',
 '29. Sida ǀons ǃoa-ai da ge ra ûi',
 '30. Ti ǁGû, ti Elo tsî ti ǀhōsa',
 '31. El

Lets clean up the links a little. Remove certain strings. the ['] at the start and end. The [.] after the numbers. Any other character at the end of the last word. 

In [None]:
characters_to_remove = ".“”,:‘’?​—"
#stopwords = ['ǃ']
Link_damara_song =[]
for s in List_damara_song :
    # remove spaces and replace with hiphens
    s = s.split()
    s = '-'.join(s)
    print(s)
    
    # remove special characters still in text
    for ch in characters_to_remove:
        if ch in s:
            s = s.replace(ch,'')
            # remove the '!' at the end
            s = s.rstrip('!')
            t = s[len(s)-1]
            #print(t)
            if t == 'ǃ':
               #print(t)
                s = s[0:len(s)-1]
                # Î should be smaller for link to work.
            if 'Î' in s:
                s = s.replace('Î', 'î')
                print(s)
        

    print(s, "\n")
    Link_damara_song.append(s)
    
    # List not appending in the best way 
    # print(Link_damara_song, "\n")
    
#print(s)

1.-Jehovab-di-ǀgaugu
1-Jehovab-di-ǀgaugu 

2.-Sadu-ǀons-ge-a-Jehova
2-Sadu-ǀons-ge-a-Jehova 

3.-Sida-ǀgaib,-ǃâubasens-tsî-ǂgomǃgâb
3-Sida-ǀgaib-ǃâubasens-tsî-ǂgomǃgâb 

4.-Jehovab-ge-a-ti-ǃûi-ao
4-Jehovab-ge-a-ti-ǃûi-ao 

5.-Elob-di-buruxa-sîsengu
5-Elob-di-buruxa-sîsengu 

6.-ǀHommi-ge-ǁÎb-ǂkhaisiba-ra-ǁgau
6-ǀHommi-ge-ǁîb-ǂkhaisiba-ra-ǁgau
6-ǀHommi-ge-ǁîb-ǂkhaisiba-ra-ǁgau 

7.-Jehovab-ge-a-ti-ǃgâiǃō
7-Jehovab-ge-a-ti-ǃgâiǃō 

8.-Jehovab-ge-sida-sâuǃkhai
8-Jehovab-ge-sida-sâuǃkhai 

9.-Jehovab-ge-sida-Gao-aoǃ
9-Jehovab-ge-sida-Gao-ao 

10.-Kare-re-Jehova-Eloba
10-Kare-re-Jehova-Eloba 

11.-Kurus-ge-Eloba-ra-koa
11-Kurus-ge-Eloba-ra-koa 

12.-Jehova-Buruxa-Elob
12-Jehova-Buruxa-Elob 

13.-Xristub-ge-a-sida-di-aiǁgau
13-Xristub-ge-a-sida-di-aiǁgau 

14.-Kare-re-ǃhūbaib-di-ǀasa-Gao-aoba
14-Kare-re-ǃhūbaib-di-ǀasa-Gao-aoba 

15.-Koa-re-Elob-ǂguro-ǃnaesabeba!
15-Koa-re-Elob-ǂguro-ǃnaesabeba 

16.-Koa-re-Jehovaba-ǁîb-di-ǂkhauhesa-ǀGôab-ǃgao
16-Koa-re-Jehovaba-ǁîb-di-ǂkhauhesa-ǀGôab-ǃgao 


In [None]:
for t in Link_damara_song:
    t.split()
    print(t)

1-Jehovab-di-ǀgaugu
2-Sadu-ǀons-ge-a-Jehova
3-Sida-ǀgaib-ǃâubasens-tsî-ǂgomǃgâb
4-Jehovab-ge-a-ti-ǃûi-ao
5-Elob-di-buruxa-sîsengu
6-ǀHommi-ge-ǁîb-ǂkhaisiba-ra-ǁgau
7-Jehovab-ge-a-ti-ǃgâiǃō
8-Jehovab-ge-sida-sâuǃkhai
9-Jehovab-ge-sida-Gao-ao
10-Kare-re-Jehova-Eloba
11-Kurus-ge-Eloba-ra-koa
12-Jehova-Buruxa-Elob
13-Xristub-ge-a-sida-di-aiǁgau
14-Kare-re-ǃhūbaib-di-ǀasa-Gao-aoba
15-Koa-re-Elob-ǂguro-ǃnaesabeba
16-Koa-re-Jehovaba-ǁîb-di-ǂkhauhesa-ǀGôab-ǃgao
17-ǂGao-ta-ra
18-Aio-re-Xoremariba
19-ǃKhūb-di-ǃuiǂûs
20-Sadu-ǀgôaba-du-ge-ge-mā
21-ǂGurose-Elob-Gaosiba-ôa-re
22-Gaosib-ge-go-ǂgaeǂgui-tsoatsoaAb-hā-re
23-Jehovab-ǂgaeǂguis-go-tsoatsoa
24-Hā-re-Jehovab-ǃhommi-ǃoa
25-ǃKhōǂuibasensa-ǁaes
26-Tita-du-ge-dība
27-Elob-ǀgôagu-di-ǂhaiǂhais
28-Jehovab-ǀhōsa-kai-re
29-Sida-ǀons-ǃoa-ai-da-ge-ra-ûi
30-Ti-ǁGû-ti-Elo-tsî-ti-ǀhōsa
31-Elob-daob-ǃnâ-ǃgû-re
32-ǂGomǂgomsase-Jehovab-ǃoa-ai-hâ-re
33-Sadu-ǃgomsiga-Jehovab-ǃomǁae-mā-re
34-ǂHauǃnâse-ǃgû-re
35-ǁApoǁapo-re-ǂHâǂhâsa-xūna
36-Sida-ǂgaoga-da-ge-ra-

Great! All links work fine now. 
Lets get the text then write to file. 

 # To Do
Write to file

Clean up data a little more 
merge all sentences, created line by line.
evaluate all of them.  

Remove additional spaces '\xa0' now 

concatenate until you get one paragraph lines. 

In [None]:
!rm -r Scrapped/* 

rm: cannot remove 'Scrapped/*': No such file or directory


In [None]:
def write_song_to_file(sub_url, songs,lang):
    complete_songs = []
    os.makedirs("Scrapped/"+lang)
    for i in range(len(songs)):
        address = sub_url + songs[i]
        print(address)
        url = requests.get(address)
        page = BeautifulSoup(url.text, 'lxml')
        #print(page.text)
        # get the song
        song = page.find('div', attrs={'class': 'pGroup'}).text.split('\n')[1:] 
        #song = ' '.join(song)
        #print(song)
        #song.remove('')
        #strip song url to return only the song title. 
        filename = str(address).split('/')[-1]
        f = open("Scrapped/"+lang+"/"+filename+".txt", 'w')
        #print verses and write as one full song in 1 line
        v = []
        switch = 1
        res = " ".join(filter(lambda x: not x.isdigit(), song))
        res = re.split("\d+",res)[1:]
        for verse in res:
          f.write("{}\n".format(verse))
        f.close()
        #r = re.match(r"\d.",song)
        #print(r.groups())
        '''
        for verses in song:
            verses.replace(u'\xa0', u' ')
            if switch:
                v.append(verses)
                switch = 0
            # regular exp to remove 1. from list
            if re.match(r'\d+.',verses) == None:
                v.append(verses)
            else:
                # getting a new line
                complete_songs.append(' '.join(v)) 
                v = []
        '''
        # To Do
       #  merge
       #  Create line by line.
       #  Check if all of them.  
    #print(complete_songs)
                #complete songs contain all songs. now concatenate until you get all lines. 
                
        
      

In [None]:
write_song_to_file(ENG_URL, Link_english_song[0:150],"English")

https://www.jw.org/en/library/music-songs/sing-out-joyfully/1-Jehovahs-Attributes
https://www.jw.org/en/library/music-songs/sing-out-joyfully/2-Jehovah-Is-Your-Name
https://www.jw.org/en/library/music-songs/sing-out-joyfully/3-Our-Strength-Our-Hope-Our-Confidence
https://www.jw.org/en/library/music-songs/sing-out-joyfully/4-Jehovah-Is-My-Shepherd
https://www.jw.org/en/library/music-songs/sing-out-joyfully/5-Gods-Wondrous-Works
https://www.jw.org/en/library/music-songs/sing-out-joyfully/6-Heavens-Declare-Gods-Glory
https://www.jw.org/en/library/music-songs/sing-out-joyfully/7-Jehovah-Our-Strength
https://www.jw.org/en/library/music-songs/sing-out-joyfully/8-Jehovah-Is-Our-Refuge
https://www.jw.org/en/library/music-songs/sing-out-joyfully/9-Jehovah-Is-Our-King
https://www.jw.org/en/library/music-songs/sing-out-joyfully/10-Praise-Jehovah-Our-God
https://www.jw.org/en/library/music-songs/sing-out-joyfully/11-Creation-Praises-God
https://www.jw.org/en/library/music-songs/sing-out-joyfully/1

Great, All works well for English, Lets do the same for damara

In [None]:
import pandas as pd

In [None]:
write_song_to_file(DMR_URL, Link_damara_song[:150], "Damara")

https://www.jw.org/naq-x-dmr/ǂkhanin/musiki-ǁnaetsanadi/%C7%83gâiaǂgaob-ǀkha-jehovaba-ǁnaeba-re/1-Jehovab-di-ǀgaugu
https://www.jw.org/naq-x-dmr/ǂkhanin/musiki-ǁnaetsanadi/%C7%83gâiaǂgaob-ǀkha-jehovaba-ǁnaeba-re/2-Sadu-ǀons-ge-a-Jehova
https://www.jw.org/naq-x-dmr/ǂkhanin/musiki-ǁnaetsanadi/%C7%83gâiaǂgaob-ǀkha-jehovaba-ǁnaeba-re/3-Sida-ǀgaib-ǃâubasens-tsî-ǂgomǃgâb
https://www.jw.org/naq-x-dmr/ǂkhanin/musiki-ǁnaetsanadi/%C7%83gâiaǂgaob-ǀkha-jehovaba-ǁnaeba-re/4-Jehovab-ge-a-ti-ǃûi-ao
https://www.jw.org/naq-x-dmr/ǂkhanin/musiki-ǁnaetsanadi/%C7%83gâiaǂgaob-ǀkha-jehovaba-ǁnaeba-re/5-Elob-di-buruxa-sîsengu
https://www.jw.org/naq-x-dmr/ǂkhanin/musiki-ǁnaetsanadi/%C7%83gâiaǂgaob-ǀkha-jehovaba-ǁnaeba-re/6-ǀHommi-ge-ǁîb-ǂkhaisiba-ra-ǁgau
https://www.jw.org/naq-x-dmr/ǂkhanin/musiki-ǁnaetsanadi/%C7%83gâiaǂgaob-ǀkha-jehovaba-ǁnaeba-re/7-Jehovab-ge-a-ti-ǃgâiǃō
https://www.jw.org/naq-x-dmr/ǂkhanin/musiki-ǁnaetsanadi/%C7%83gâiaǂgaob-ǀkha-jehovaba-ǁnaeba-re/8-Jehovab-ge-sida-sâuǃkhai
https://www.jw.o

Great! Successfully writing songs to file.

To Do: 

*   Write functions to merge songs
*   Create parallel Corpus using pandas pandas dataframe



In [None]:
def merge_songs(lang, songs):
  file_lang = []
  for sn in songs:
    file_lang.append("Scrapped/"+lang+"/" + sn + ".txt")

  with open("Scrapped/"+lang+"/All.txt","wb") as write_file:
    for f in file_lang:
        with open(f, 'rb') as r:
          write_file.write(r.read())
          


In [None]:
# Add list of songs to be ommitted. Damara has 8 songs. 
#ommit_dmr_songs
merge_songs("Damara",Link_damara_song[:150])

In [None]:
# Add list of songs to be ommitted. English has 40 songs. 
#ommit_songs(Link_english_song[21, 27, 36, 72, 96, 125, 129, 143)
merge_songs("English",Link_english_song[:150])


In [None]:
damara = pd.read_csv("Scrapped/Damara/All.txt", header=None, sep="\n")
damara.columns = ["Damara"]
damara.head()

Unnamed: 0,Damara
0,". Je-ho-va ǃKhū ǀgai-sa tsî du a ǁkhā, tsî ûib..."
1,. ǂHa-nu-ai as ge sa-du di tron-sa. Sa-du ǂha-...
2,. Tsî ǃgō-sa-se sa-du di kai ǀnam-ma. Tā-tsē d...
3,. Bu-ru-xa a E-lob— Hoa-na ge a ku-ru-ba. Mâ-ǁ...
4,". Sa-du ra ī kai da, ǂgao du ra kha-mi hoa-ǁae..."


In [None]:
english = pd.read_csv("Scrapped/English/All.txt", header=None, sep="\n")
english.columns = ["English"]
english.head()

Unnamed: 0,English
0,". Jehovah our God, exalted in might, Creator o..."
1,". Your heavenly throne, on justice it stands. ..."
2,. The greatest of all is your perfect love. Be...
3,. The living and true God— The God of all crea...
4,". You cause us to become Whatever you desire, ..."


In [None]:
Corpus = pd.concat([damara,english], axis=1)
Corpus = Corpus.dropna()

In [None]:
Corpus

Unnamed: 0,Damara,English
0,". Je-ho-va ǃKhū ǀgai-sa tsî du a ǁkhā, tsî ûib...",". Jehovah our God, exalted in might, Creator o..."
1,. ǂHa-nu-ai as ge sa-du di tron-sa. Sa-du ǂha-...,". Your heavenly throne, on justice it stands. ..."
2,. Tsî ǃgō-sa-se sa-du di kai ǀnam-ma. Tā-tsē d...,. The greatest of all is your perfect love. Be...
3,. Bu-ru-xa a E-lob— Hoa-na ge a ku-ru-ba. Mâ-ǁ...,. The living and true God— The God of all crea...
4,". Sa-du ra ī kai da, ǂgao du ra kha-mi hoa-ǁae...",". You cause us to become Whatever you desire, ..."
...,...,...
277,". Je-ho-va ǂkhîb E-lob, ǂkhî-ba du ge a mî-mâi...",". Let all men the pure New Jerusalem see, The ..."
278,". Ga-gab ge ra hui da, Sa-du mîs ǃnâ-ba ra mā....",. This city so grand will become a delight. It...
279,". ǀNam du an ge ra ǀgui, ǀhom-mi tsî ǃhūb-aib ...",". The living God, Jehovah, you have proved to ..."
280,. Je-ho-va E-lob ǀgui-ǃnâ-xa-sib di-ba koa. ǁÎ...,". Though ropes of death encircle me, I call to..."


In [None]:
# Clean leading fullstops and spaces
Corpus['Damara'] = Corpus['Damara'].apply(lambda x: x[2:])
Corpus['English'] = Corpus['English'].apply(lambda x: x[2:])

In [None]:
# Strip leading or trailing spaces
Corpus['Damara'] = Corpus['Damara'].apply(lambda x: x.strip())
Corpus['English'] = Corpus['English'].apply(lambda x: x.strip())

## Let's save this somewhere safe!



In [None]:
from google.colab import drive
drive.mount('/content/drive')
!mkdir "drive/My Drive/masakhane-damara"

Mounted at /content/drive


In [None]:
# Make folder
!mkdir "drive/My Drive/masakhane-damara"

In [None]:
# Save fies
import csv
Corpus.to_csv("drive/My Drive/masakhane-damara/DMREN.csv",index=False)
Corpus.to_csv("drive/My Drive/masakhane-damara/songs.en", columns=['English'], header=None, index=False)
Corpus.to_csv("drive/My Drive/masakhane-damara/songs.naq", columns=['Damara'], header=None, index=False)


In [None]:
# Load in data for futher processing
Corpus = pd.read_csv("drive/My Drive/masakhane-damara/DMREN.csv")

Corpus

Unnamed: 0,Damara,English
0,"Je-ho-va ǃKhū ǀgai-sa tsî du a ǁkhā, tsî ûib d...","Jehovah our God, exalted in might, Creator of ..."
1,ǂHa-nu-ai as ge sa-du di tron-sa. Sa-du ǂha-nu...,"Your heavenly throne, on justice it stands. To..."
2,Tsî ǃgō-sa-se sa-du di kai ǀnam-ma. Tā-tsē da ...,The greatest of all is your perfect love. Beyo...
3,Bu-ru-xa a E-lob— Hoa-na ge a ku-ru-ba. Mâ-ǁae...,The living and true God— The God of all creati...
4,"Sa-du ra ī kai da, ǂgao du ra kha-mi hoa-ǁae, ...","You cause us to become Whatever you desire, To..."
...,...,...
277,"Je-ho-va ǂkhîb E-lob, ǂkhî-ba du ge a mî-mâi. ...","Let all men the pure New Jerusalem see, The br..."
278,"Ga-gab ge ra hui da, Sa-du mîs ǃnâ-ba ra mā. D...",This city so grand will become a delight. Its ...
279,"ǀNam du an ge ra ǀgui, ǀhom-mi tsî ǃhūb-aib ts...","The living God, Jehovah, you have proved to be..."
280,Je-ho-va E-lob ǀgui-ǃnâ-xa-sib di-ba koa. ǁÎb ...,"Though ropes of death encircle me, I call to y..."
