# Natural Language Processing - Solutions

In this tutorial we will be working with the transcriptions of general debates at the United Nations from 1970 to 2016. We will try to see whether the fall of the iron curtain changed the debates.

This data set contains the following features:

* 'session': the UN session. There is one session per year, and the data in this dataset ranges from session 25 to session 71.
* 'year': The year of the session, from 1970 to 2016
* 'country': The representative’s country, as an ISO 3166 Alpha-3 country code (more information: https://www.iso.org/iso-3166-country-codes.html).
* 'text': The complete text of that country’s statement in the general debate from that year, with OCR page numbers removed.

## Packages

In [1]:
InstallPackages = False
if InstallPackages:
    import sys
    !pip install pandas
    !pip install nltk
    !pip install spacy
    !pip install numpy

In [2]:
import pandas as pd
import numpy as np
import nltk
from nltk import ngrams
from nltk.corpus import stopwords
import re
import spacy

In [3]:
DownloadAdditions = False
if DownloadAdditions:
    nltk.download('stopwords')
    spacy.cli.download('en_core_web_sm')#de_core_news_sm #de_core_news_md

In [4]:
nlp = spacy.load('en_core_web_sm')

## Read the Data
Read in the 02.1 un-general-debates.csv file and set it to a data frame called df.

In [5]:
df = pd.read_csv('Data/02.1 un-general-debates.csv')

Use info on df

In [6]:
df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 7507 entries, 0 to 7506
Data columns (total 4 columns):
 #   Column   Non-Null Count  Dtype 
---  ------   --------------  ----- 
 0   session  7507 non-null   int64 
 1   year     7507 non-null   int64 
 2   country  7507 non-null   object
 3   text     7507 non-null   object
dtypes: int64(2), object(2)
memory usage: 234.7+ KB


Check the head of ad_data

In [7]:
df.head()

Unnamed: 0,session,year,country,text
0,44,1989,MDV,﻿It is indeed a pleasure for me and the member...
1,44,1989,FIN,"﻿\nMay I begin by congratulating you. Sir, on ..."
2,44,1989,NER,"﻿\nMr. President, it is a particular pleasure ..."
3,44,1989,URY,﻿\nDuring the debate at the fortieth session o...
4,44,1989,ZWE,﻿I should like at the outset to express my del...


## Exercise 1 - Write a function to pre-process the general debates

*Hint 1:* You may write a function to lemmatize the general debates

*Hint 2:* You may write a function to correct spelling mistakes in the general debates

*Hint 3:* You may write a pre-processing functioning using those function


Lemmatization

In [8]:
def lemmatize(statement):
    
    """
    This function is used to lemmatize the corpus. 
    
    """
    
    #define strings which should not be lemmatized
    donotlemmatize = {}#{'A','B'}
    
    #process the corpus using spaCy
    doc = nlp(statement)
    
    #lemmatize the corpus
    lemmas = [token.text.lower() if token.text in donotlemmatize else token.lemma_.lower() for token in doc]
    
    return lemmas
    

Correction function

In [9]:
def correction_all(statement):
    
    """
    This function corrects spelling mistake.
    
    """
    # This corrects spelling mistakes                                                                                                            #ID
    #statement = statement.apply(lambda x: "".join(re.sub(r'\b20\b', ' ', x)))

    return statement       

Pre-processing function

In [10]:
def preprocess(docs):
    
    """
    
    This function is used to preprocess the corpus: remove special characters, lemmatize, remove stopwords and correction.
    
    """
        
    #spelling correction of strings
    docs = correction_all(docs)
    
    #define strings which should not be stopwords
    nostopwords = {''}#{'A', 'B','a', 'b','X', 'Y','x','y'}
    
    #create list of stop words
    stop_words = list(set(stopwords.words('english'))-nostopwords)
    
    #remove special characters which are not german umlaute
    docs = docs.apply(lambda x: "".join(re.sub(r'[^ \nA-Za-z0-9À-ÖØ-öø-ÿ/]+', ' ', x)))
    
    #lemmatize strings
    docs = docs.apply(lambda x: " ".join(lemmatize(x)))
    
    #remove stopwords
    docs = docs.apply(lambda x: " ".join(x for x in str(x).split() if x not in stop_words))

    
    return docs

## Exercise 2 - Pre-Process the General Debates

Not it's time to pre-process the gernal debates.

In [11]:
df['document'] = preprocess(df['text'].astype(str))

In [12]:
df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 7507 entries, 0 to 7506
Data columns (total 5 columns):
 #   Column    Non-Null Count  Dtype 
---  ------    --------------  ----- 
 0   session   7507 non-null   int64 
 1   year      7507 non-null   int64 
 2   country   7507 non-null   object
 3   text      7507 non-null   object
 4   document  7507 non-null   object
dtypes: int64(2), object(3)
memory usage: 293.4+ KB


In [13]:
df.head()

Unnamed: 0,session,year,country,text,document
0,44,1989,MDV,﻿It is indeed a pleasure for me and the member...,indeed pleasure member delegation extend ambas...
1,44,1989,FIN,"﻿\nMay I begin by congratulating you. Sir, on ...",may begin congratulate sir election presidency...
2,44,1989,NER,"﻿\nMr. President, it is a particular pleasure ...",mr president particular pleasure behalf delega...
3,44,1989,URY,﻿\nDuring the debate at the fortieth session o...,debate fortieth session general assembly four ...
4,44,1989,ZWE,﻿I should like at the outset to express my del...,like outset express delegation satisfaction pl...


In [14]:
df['text'][5]

"\ufeffBefore you began to occupy that exalted seat, Mr. President, you were already discharging a noble international mandate, having presided for years over the Special Committee against Apartheid. Under your direction, the global counterattack against that insult to the race of man has gained signal victories. And now, there you are, Mr. President, a friend of the Filipinos, who visited us in 1987, a friend of our late martyr, Ninoy Aquino, having worked with him at Harvard. We Filipinos rejoice that it is you who will preside over this body of nations for the next 365 days.\nAnd to Mr. Dante Caputo, the former Foreign Minister of Argentina, let me say how much our prediction of success for his presidency last year has been proved correct. His expert hand steered us through the proceedings without conflict, without incident, without delay. We had faith in him as an outstanding human leader, and he justified that faith.\nFilipinos can take added pride in the fact that Dante Caputo, o

In [15]:
df['document'][5]

'begin occupy exalt seat mr president already discharge noble international mandate preside year special committee apartheid direction global counterattack insult race man gain signal victory mr president friend filipinos visit 1987 friend late martyr ninoy aquino work harvard filipinos rejoice preside body nation next 365 day mr dante caputo former foreign minister argentina let say much prediction success presidency last year prove correct expert hand steer proceeding without conflict without incident without delay faith outstanding human leader justify faith filipinos take add pride fact dante caputo successful outgoing president man like root world hispanic culture last year come rostrum bring assembly message poor country say poor come indolent race say countryman cover earth 2 million americas half million middle east quarter million europe half million asia pacific seeker toil life teacher nation physician man builder industry designer module challenge star settle moon man woman

## Exercise 3 - Save the pre-processed data

Save the data as 02.1 un-general-debates

In [17]:
excelfilename = "Data/03.1 un-general-debates.csv"
df.to_csv(excelfilename,index=False)