# Text generation using BERT model
### Luxin Tian and Heather Chen
In this notebook we mainly utilize BERT model for text generation. 

In [44]:
import os
import sklearn 
import pandas as pd
import numpy as np
import re

import lucem_illud_2020 

In [45]:
#First load the corpus to a dataframe
file_path = './raw_data/Converted sessions/'
entry_list = []

list_of_folders = os.listdir(file_path)
for session in list_of_folders:
    if not session.startswith('.') and 'Session' in session: 
        list_of_files = os.listdir(file_path + session)
    else: 
        list_of_files = []
    for country in list_of_files: 
        if not country.startswith('.') and '.txt' in country:
            filename = country.split('.')[0].split('_')
            country_code = filename[0]
            session_code = filename[1]
            year_code = filename[2]
            text = open(file_path + session + '/' + country).read()
            entry_list.append(pd.Series({'filename': country, 
                                         'country_code': country_code, 
                                         'session': session_code, 
                                         'year': year_code, 
                                         'text': text}))

ungdc_df = pd.DataFrame(entry_list)

In [46]:
ungdc_df

Unnamed: 0,filename,country_code,session,year,text
0,BRB_73_2018.txt,BRB,73,2018,Let me begin by congratulating Ms. María Ferna...
1,IND_73_2018.txt,IND,73,2018,"On my own behalf and on behalf of my country, ..."
2,ARG_73_2018.txt,ARG,73,2018,I would like to congratulate the President on ...
3,JOR_73_2018.txt,JOR,73,2018,It is an honour to take part in the general de...
4,SWE_73_2018.txt,SWE,73,2018,"Just a bit more than a week ago, we honoured t..."
...,...,...,...,...,...
8088,LIE_69_2014.txt,LIE,69,2014,This has been an \nenormously difficult year f...
8089,AZE_69_2014.txt,AZE,69,2014,"At the outset, \nI would like to congratulate ..."
8090,GRC_69_2014.txt,GRC,69,2014,This sixty-ninth session of the General Assemb...
8091,ISL_69_2014.txt,ISL,69,2014,Next year we will \ncelebrate the seventieth a...


In [5]:
#Split the data to train set and test set
from sklearn.model_selection import train_test_split
train_text, test_text = train_test_split(ungdc_df['text'], test_size=0.2)

In [13]:
#Save them to local repo as csv
train_text.to_frame().to_csv(r'train_text_ungdc', header=None, index=None, sep=' ', mode='a')

In [14]:
test_text.to_frame().to_csv(r'test_text_ungdc', header=None, index=None, sep=' ', mode='a')

The following model training process is conducted in Google Colab. It is too large to use the whole dataset, let us try only using texts from 2000. 

In [54]:
ungdc_2000_df = ungdc_df.loc[ungdc_df['year'].astype('int') >= 2000]
ungdc_2000_df = ungdc_2000_df.sort_values(by='year')
ungdc_2000_df

Unnamed: 0,filename,country_code,session,year,text
6053,SAU_55_2000.txt,SAU,55,2000,It\ngives me pleasure at the outset of the fif...
5938,AZE_55_2000.txt,AZE,55,2000,"Mr.\nPresident, allow me first of all to congr..."
5937,GRC_55_2000.txt,GRC,55,2000,I express my sincere\ncongratulations to the P...
5936,VEN_55_2000.txt,VEN,55,2000,I\nagain extend our congratulations to the Pre...
5935,LIE_55_2000.txt,LIE,55,2000,Allow me to begin my\nremarks by congratulatin...
...,...,...,...,...,...
127,MWI_73_2018.txt,MWI,73,2018,The General Assembly is a representation of hu...
126,LCA_73_2018.txt,LCA,73,2018,Allow me to begin by congratulating the Presid...
125,BEN_73_2018.txt,BEN,73,2018,I have the honour to deliver the following st...
123,ROU_73_2018.txt,ROU,73,2018,I am particularly honoured to address this ye...


In [55]:
for index, row in ungdc_2000_df.iterrows():
    temp = row['text']
    temp = re.sub(r"\n", " ", temp)
    row['text'] = temp

ungdc_2000_df

Unnamed: 0,filename,country_code,session,year,text
6053,SAU_55_2000.txt,SAU,55,2000,It gives me pleasure at the outset of the fift...
5938,AZE_55_2000.txt,AZE,55,2000,"Mr. President, allow me first of all to congra..."
5937,GRC_55_2000.txt,GRC,55,2000,I express my sincere congratulations to the Pr...
5936,VEN_55_2000.txt,VEN,55,2000,I again extend our congratulations to the Pres...
5935,LIE_55_2000.txt,LIE,55,2000,Allow me to begin my remarks by congratulating...
...,...,...,...,...,...
127,MWI_73_2018.txt,MWI,73,2018,The General Assembly is a representation of hu...
126,LCA_73_2018.txt,LCA,73,2018,Allow me to begin by congratulating the Presid...
125,BEN_73_2018.txt,BEN,73,2018,I have the honour to deliver the following st...
123,ROU_73_2018.txt,ROU,73,2018,I am particularly honoured to address this ye...


In [57]:
#Split the data to train set and test set
from sklearn.model_selection import train_test_split
train_text, test_text = train_test_split(ungdc_2000_df['text'], test_size=0.2)

In [58]:
#Save them to local repo as csv
train_text.to_frame().to_csv(r'train_text_ungdc2000', header=None, index=None, sep=' ', mode='a')
test_text.to_frame().to_csv(r'test_text_ungdc2000', header=None, index=None, sep=' ', mode='a')

Here we train the model using Google Colab. The google colab notebook can be seen here [My google colab notebook](https://colab.research.google.com/drive/1VYVadCwQX3RU1cPIKcYHsP8dy7OPCI_g#scrollTo=L5bjXfZhuKOj)

Now load the pretrained model in Google Colab

In [61]:
from transformers import AutoModelWithLMHead, AutoTokenizer

tokenizer_ungdc = AutoTokenizer.from_pretrained("output_gpt_ungdc2000")
model_ungdc = AutoModelWithLMHead.from_pretrained("output_gpt_ungdc2000")

### Discovering topics using text generation

In [63]:
sequence = "Climate change is"

input = tokenizer_ungdc.encode(sequence, return_tensors="pt")
generated = model_ungdc.generate(input, max_length=50, bos_token_id=1, pad_token_id=1, eos_token_ids=1)

resulting_string = tokenizer_ungdc.decode(generated.tolist()[0])
print(resulting_string)

Climate change is a major challenge for the world. The world is facing a new challenge, one that is not only global, but also global in scope. The United Nations is the only global organization that can effectively address the challenges of climate change."


In [65]:
sequence = "Public health is"

input = tokenizer_ungdc.encode(sequence, return_tensors="pt")
generated = model_ungdc.generate(input, max_length=100, bos_token_id=1, pad_token_id=1, eos_token_ids=1)

resulting_string = tokenizer_ungdc.decode(generated.tolist()[0])
print(resulting_string)

Public health is the most important  priority of the United Nations. We must ensure that  the United Nations is able to respond to the needs of  the most vulnerable countries.   The United Nations is the only Organization that  can provide the necessary support to the most vulnerable  countries. The United Nations is the only  Organization that can provide the necessary support to  the most vulnerable countries.   The United Nations is the only Organization that can  provide the necessary support to the most vulnerable"


In [66]:
sequence = "Terrorism is"

input = tokenizer_ungdc.encode(sequence, return_tensors="pt")
generated = model_ungdc.generate(input, max_length=100, bos_token_id=1, pad_token_id=1, eos_token_ids=1)

resulting_string = tokenizer_ungdc.decode(generated.tolist()[0])
print(resulting_string)

Terrorism is a threat to international peace and security. It is a threat to the stability of the region and to the stability of the world. It is a threat to the stability of the world. It is a threat to the stability of the world. It is a threat to the stability of the world. It is a threat to the stability of the world. It is a threat to the stability of the world. It is a threat to the stability of the world. It is a threat"


In [70]:
sequence = "The Syrian conflict has"

input = tokenizer_ungdc.encode(sequence, return_tensors="pt")
generated = model_ungdc.generate(input, max_length=50, bos_token_id=1, pad_token_id=1, eos_token_ids=1)

resulting_string = tokenizer_ungdc.decode(generated.tolist()[0])
print(resulting_string)

The Syrian conflict has been a source of great concern to the international community. The Syrian people have suffered a terrible loss of life and have suffered a great loss of property. The Syrian people have suffered a great loss of life and have suffered a great"


In [72]:
sequence = "The relationship between North and South Korea has"

input = tokenizer_ungdc.encode(sequence, return_tensors="pt")
generated = model_ungdc.generate(input, max_length=100, bos_token_id=1, pad_token_id=1, eos_token_ids=1)

resulting_string = tokenizer_ungdc.decode(generated.tolist()[0])
print(resulting_string)

The relationship between North and South Korea has been a source of great concern to the international community. The recent nuclear tests of the United States and the Russian Federation have demonstrated that the United States and Russia are not only adversaries, but also partners in the fight against terrorism. The United States and Russia have been at odds for decades, and the United States and Russia have been at odds for decades. The United States and Russia have been at odds for decades, and the United States and Russia have been at"


In [73]:
sequence = "The conflicts in Crimea has"

input = tokenizer_ungdc.encode(sequence, return_tensors="pt")
generated = model_ungdc.generate(input, max_length=100, bos_token_id=1, pad_token_id=1, eos_token_ids=1)

resulting_string = tokenizer_ungdc.decode(generated.tolist()[0])
print(resulting_string)

The conflicts in Crimea has been a source of great concern to the international community. The situation in the region has become a source of concern to the international community. The situation in the Democratic Republic of the Congo has been a source of concern to the international community. The situation in the Democratic Republic of the Congo has been a source of concern to the international community. The situation in the Democratic Republic of the Congo has been a source of concern to the international community. The situation in the Democratic Republic of"


In [74]:
sequence = "China is"

input = tokenizer_ungdc.encode(sequence, return_tensors="pt")
generated = model_ungdc.generate(input, max_length=100, bos_token_id=1, pad_token_id=1, eos_token_ids=1)

resulting_string = tokenizer_ungdc.decode(generated.tolist()[0])
print(resulting_string)

China is a country that has been a member of the United Nations for more than half a century. We are proud to be a member of the United Nations. We are proud to be a member of the United Nations. We are proud to be a member of the United Nations. We are proud to be a member of the United Nations. We are proud to be a member of the United Nations. We are proud to be a member of the United Nations. We are proud to be a member"
