# DATA20001 Deep Learning - Group Project
## Text project

**Due Thursday, December 13, before 23:59.**

The task is to learn to assign the correct labels to news articles.  The corpus contains ~850K articles from Reuters.  The test set is about 10% of the articles. The data is unextracted in XML files.

We're only giving you the code for downloading the data, and how to save the final model. The rest you'll have to do yourselves.

Some comments and hints particular to the project:

- One document may belong to many classes in this problem, i.e., it's a multi-label classification problem. In fact there are documents that don't belong to any class, and you should also be able to handle these correctly. Pay careful attention to how you design the outputs of the network (e.g., what activation to use) and what loss function should be used.
- You may use word-embeddings to get better results. For example, you were already using a smaller version of the GloVE  embeddings in exercise 4. Do note that these embeddings take a lot of memory. 
- In the exercises we used e.g., `torchvision.datasets.MNIST` to handle the loading of the data in suitable batches. Here, you need to handle the dataloading yourself.  The easiest way is probably to create a custom `Dataset`. [See for example here for a tutorial](https://github.com/utkuozbulak/pytorch-custom-dataset-examples).

In [2]:
import os
import torch
from torchvision.datasets.utils import download_url
import zipfile

from bs4 import BeautifulSoup
import pandas as pd

In [3]:
def read_one_zipfile(filepath):  
    '''
    read and parse contents of single zipfile (with about 100+ xml-files in it)
    fields: headline, text, classes
    return them as list
    '''
    this_documents=[]
    
    zf = zipfile.ZipFile(filepath, 'r')    

    # for all xml-files within a zip
    for name in zf.namelist():
        #if name.endswith('xml'): continue
    
        infile = zf.open(name)    
        contents = infile.read()
        soup = BeautifulSoup(contents,'lxml')
    
        headline = soup.find('headline')
        text = soup.find('text')       #print(headline.get_text())
    
    # extract all topic-classes 
    # only take "codes" by topic, not region or industry: class == 'bip:topics:1.0'
        classcodes = []
        for element in soup.find_all('codes', class_='bip:topics:1.0'):
            for code in element.find_all('code'):
                clas = code['code']
                #print(clas)
                classcodes.append(clas)

        this_documents.append({'headline': headline.get_text(), 'text': text.get_text(), 'codes': classcodes})
    return this_documents

## Download the data

In [12]:
train_path = 'train/'

dl_file='reuters.zip'
dl_url='https://www.cs.helsinki.fi/u/jgpyykko/'
zip_path = os.path.join(train_path, dl_file)
if not os.path.isfile(zip_path):
    download_url(dl_url + dl_file, root=train_path, filename=dl_file, md5=None)

with zipfile.ZipFile(zip_path) as zip_f:
    zip_f.extractall(train_path)
    #os.unlink(zip_path)

The above command downloads and extracts the data files into the `train` subdirectory.

The files can be found in `train/`, and are named as `19970405.zip`, etc. You will have to manage the content of these zips to get the data. There is a readme which has links to further descriptions on the data.

The class labels, or topics, can be found in the readme file called `train/codes.zip`.  The zip contains a file called "topic_codes.txt".  This file contains the special codes for the topics (about 130 of them), and the explanation - what each code means.  

The XML document files contain the article's headline, the main body text, and the list of topic labels assigned to each article.  You will have to extract the topics of each article from the XML.  For example: 
&lt;code code="C18"&gt; refers to the topic "OWNERSHIP CHANGES" (like a corporate buyout).

You should pre-process the XML to extract the words from the article: the &lt;headline&gt; element and the &lt;text&gt;.  You should not need any other parts of the article.

### Read the CLASS codes

In [4]:
# read one of the Class-code files
import pandas as pd
import zipfile

zf = zipfile.ZipFile('train/REUTERS_CORPUS_2/codes.zip', 'r') 
colnames=['Code','Description']
df = pd.read_csv(zf.open('topic_codes.txt'), skiprows=2, error_bad_lines=True, 
                 header=None, names=colnames, sep='\t')

# df # (the file has 2 first rows as CODE/DESCRIPTION, the extra line is still at row 0)

In [49]:
# Save to csv
df.to_csv('input/classcodes.csv', index=None)

In [5]:
# read csv
classcodes= pd.read_csv('input/classcodes.csv')

### Parse XML

In [53]:
# example http://www2.hawaii.edu/~takebaya/cent110/xml_parse/xml_parse.html

# testing
# unzip a single xml first
"""
from bs4 import BeautifulSoup
infile = open("train/477886newsML.xml","r")
contents = infile.read()
soup = BeautifulSoup(contents,'lxml') # use parser lxml as parser xml returns empty list

headline = soup.find('headline')
print(headline.get_text())

text = soup.find('text')
print(text.get_text()[0:1000])
"""

'\nfrom bs4 import BeautifulSoup\ninfile = open("train/477886newsML.xml","r")\ncontents = infile.read()\nsoup = BeautifulSoup(contents,\'lxml\') # use parser lxml as parser xml returns empty list\n\nheadline = soup.find(\'headline\')\nprint(headline.get_text())\n\ntext = soup.find(\'text\')\nprint(text.get_text()[0:1000])\n'

In [54]:
# codes = soup.find_all('code')
# for code in codes:
#     print(code)

Refer to topic_codes.txt inside codes.zip, Which defines that 

G15 EUROPEAN COMMUNITY 

GCAT	GOVERNMENT/SOCIAL

EEC is found in region_codes.txt: EEC	EUROPEAN UNION

In [55]:
# only take codes by topic, not region or industry
# codes = soup.find_all('codes', class_='bip:topics:1.0')
# for code in codes:
#     print(code)

In [56]:
# extract all topic-classes 
# only take "codes" by topic, not region or industry: class == 'bip:topics:1.0'
# example
# <codes class="bip:topics:1.0">
# <code code="G15"> ... </code>
# <code code="GCAT"> ... </code>
# </codes>
for element in soup.find_all('codes', class_='bip:topics:1.0'):
    for code in element.find_all('code'):
        clas = code['code']
        print(clas)

G15
GCAT


### Read files

In [6]:
# get list of *.zip files in dir, such that contain xml-files (name starts with 1).
dirpath = 'train/REUTERS_CORPUS_2/'
files = [f for f in os.listdir(dirpath) if os.path.isfile(os.path.join(dirpath, f))]
# cut out codes.zip, readme.txt etc. All zips containing .xml start with 1
filenames_zip = [f for f in files if '1' in f]
print(len(filenames_zip))
print(filenames_zip[0:4])

127
['19970722.zip', '19970508.zip', '19970421.zip', '19970612.zip']


In [7]:
# get xml-filenames inside a single zip-file
mypath = 'train/REUTERS_CORPUS_2/'
file = '19970722.zip'
zf = zipfile.ZipFile(mypath+file, 'r')

# get names of all xml-files within a zip
# for name in zf.namelist():    
    # print(name)    
    #f = zf.open(name)
    #print(f.read()) 

In [15]:
# Read single zipfiles contents
mypath = 'train/REUTERS_CORPUS_2/'
documents= []    

file = '19970722.zip'

documents.extend( read_one_zipfile(mypath+file) )
len(documents)

3426

In [16]:
data_small = pd.DataFrame(documents)
# data_small[0:5]

In [62]:
# optional - faster test with cutting list to 2 zipfiles
#filenames_zip = filenames_zip[0:2]
#filenames_zip

In [63]:
# CAN TAKE ABOUT 30-60 MIN!
# Read all zipfiles
mypath = 'train/REUTERS_CORPUS_2/'
documents= []

ind_8 = 0
for i, file in enumerate(filenames_zip):
    if i == 8:
        ind_8 = len(documents)
    documents.extend( read_one_zipfile(mypath+file) )
len(documents)    

299773

In [66]:
data = pd.DataFrame(documents)
# data[0:5]

### Turn classnames into integers

#### string-to-int and int-to-string dictionaries for classcodes, turn classes into integers

In [19]:
# add index field to DataFrame
classcodes = classcodes.reset_index()

In [20]:
# Create dictionary index/int to classcode and classcode to int
itocode = dict(zip(classcodes.index, classcodes.Code))
codetoi = dict(zip(classcodes.Code, classcodes.index))

In [21]:
list(itocode.items())[0:7]

[(0, '1POL'),
 (1, '2ECO'),
 (2, '3SPO'),
 (3, '4GEN'),
 (4, '6INS'),
 (5, '7RSK'),
 (6, '8YDB')]

In [22]:
print(itocode[3])
print(codetoi['4GEN'])

4GEN
3


In [23]:
# Turn one list of codes into ints
def listToInt(mylist):
    return [codetoi[item] for item in mylist]

#test
listToInt(['C18', 'C181', 'CCAT'])

[25, 26, 44]

In [24]:
# for each list of codes, turn it to ints
reuters = data_small

reuters['classes'] = [listToInt(codelist) for codelist in reuters.codes]
data_small = reuters
print(reuters[0:3])

               codes                                           headline  \
0  [C18, C181, CCAT]    Eureko is latest suitor for French insurer GAN.   
1        [G15, GCAT]  Reuter EC Report Long-Term Diary for July 28 -...   
2        [G15, GCAT]  Official Journal contents - OJ L 190 of July 1...   

                                                text       classes  
0  \nEureko, an alliance of six European financia...  [25, 26, 44]  
1  \n****\nHIGHLIGHTS\n****\nLUXEMBOURG - Luxembo...      [80, 90]  
2  \n*\n(Note - contents are displayed in reverse...      [80, 90]  


In [None]:
reuters2 = pd.DataFrame(data.loc[:ind_8])
reuters2['classes'] = [listToInt(codelist) for codelist in reuters2.codes]
data_8 = reuters2

### Pad with -1 for given length


 for pytorch nn.MultiLabelMarginLoss(), which expects labels in start, then -1 padding

In [26]:
# Pad list with -1 to given length
def padList(mylist, length=10):

    mylist = (mylist + length*['-1'])[:length]
    return mylist

#test
padList([2,3],length=4)

[2, 3, '-1', '-1']

In [27]:
# for each list of codes, pad it
reuters = data_small

reuters['classes_pad'] = padList(reuters['classes'], length=10)
data_small = reuters
reuters[0:3]

Unnamed: 0,codes,headline,text,classes,classes_pad
0,"[C18, C181, CCAT]",Eureko is latest suitor for French insurer GAN.,"\nEureko, an alliance of six European financia...","[25, 26, 44]","[25, 26, 44, -1, -1, -1, -1, -1, -1, -1, -1, -..."
1,"[G15, GCAT]",Reuter EC Report Long-Term Diary for July 28 -...,\n****\nHIGHLIGHTS\n****\nLUXEMBOURG - Luxembo...,"[80, 90]","[80, 90, -1, -1, -1, -1, -1, -1, -1, -1, -1, -1]"
2,"[G15, GCAT]",Official Journal contents - OJ L 190 of July 1...,\n*\n(Note - contents are displayed in reverse...,"[80, 90]","[80, 90, -1, -1, -1, -1, -1, -1, -1, -1, -1, -1]"


### Save the data

In [None]:
# Save small example-table to pickle
data_small.to_pickle('input/reuters_small.pkl')

In [None]:
data_small.to_json('input/reuters_small.json', orient='records', lines=True)

In [None]:
# load small
reuters = pd.read_pickle('input/reuters_small.pkl')

In [85]:
# Save 8 zip example-table to pickle
data_small.to_pickle('input/reuters_small8.pkl')

In [None]:
# Save large table to pickle
data.to_pickle('input/reuters_all.pkl')

In [None]:
# load large
reuters = pd.read_pickle('input/reuters_all.pkl')

In [None]:
len(reuters)

### Read new unseen data: XML / ZIP, preprocess it and save

In [9]:
# get list of *.zip files in dir, such that contain xml-files (name starts with 1).
dirpath = 'test_data/'
files = [f for f in os.listdir(dirpath) if os.path.isfile(os.path.join(dirpath, f))]
# cut out codes.zip, readme.txt etc. All zips containing .xml start with 1
# NOTICE - in new_data, all zips contain 19, also one extra html contain only 1
filenames_zip = [f for f in files if '19' in f]
print(len(filenames_zip))
print(filenames_zip[0:4])

14
['19970410-test.zip', '19970619-test.zip', '19970719-test.zip', '19970510-test.zip']


In [10]:
# Read all zipfiles
mypath = 'test_data/'
documents= []

ind_8 = 0
for i, file in enumerate(filenames_zip):
    if i == 8:
        ind_8 = len(documents)
    documents.extend( read_one_zipfile(mypath+file) )
len(documents)    

33142

In [11]:
new_data = pd.DataFrame(documents)
new_data[0:2]

Unnamed: 0,codes,headline,text
0,[],PRESS DIGEST - SOUTH AFRICA - APRIL 10.,\nThese are the leading stories in the South A...
1,[],OFFICIAL JOURNAL CONTENTS - OJ C 110 OF APRIL ...,\n*\n(Note - contents are displayed in reverse...


In [12]:
# Save to pickle
new_data.to_pickle('input/data_new.pkl')