# From METS-ALTO to enriched structured datasets

First, create a directory called `downloaded_newspapers` (or anything else you like, provided you change this to your chosen name in the rest of the notebook):

In [1]:
!mkdir downloaded_newspapers

mkdir: downloaded_newspapers: File exists


Now head to the repository containing the [British Library newspaper collection](https://bl.iro.bl.uk/collections/9a6a4cdd-2bfe-47bb-8c14-c0a5d100501f?locale=en). The first level of the repository is organized by titles. Each title is organized by year, i.e. the OCR texts in XML is released in one separate .zip folder per year. For the purpose of this excercise, download two .zip folders (i.e. 2 years) from one title, e.g. the [_Warrington Examiner_](https://bl.iro.bl.uk/concern/datasets/2e4efac2-e341-467b-86e9-4269ec07c474?locale=en). 

Now create a subfolder called `warrington_examiner` (or, again, any other name that works for you) under the folder we created above (`downloaded_newspapers`).

Unzip the downloaded folder and placed them in the subfolder `warrington_examiner`. 

Your folder structure should now be:

```
downloaded_newspapers
└─ warrington_examiner
   ├─ downloaded folder 1
   └─ downloaded folder 2

```

You can download as many titles as you want and as many years for those titles as you want, but for the following code to work, the folder structure must be followed. If you downloaded another title, say, [_Birkenhead News_](https://bl.iro.bl.uk/concern/datasets/30830be0-b512-4609-8e3d-be5b7f2b1498?locale=en), you'd make a folder under `downloaded_newspapers` called `birkenhead_news` (or similar), and place the downloaded .zip folders under it. Your new structure would look like this:

```
downloaded_newspapers
├─ warrington_examiner
│  ├─ downloaded folder 1
│  └─ downloaded folder 2
└─ birkenhead_news
   ├─ downloaded folder 1
   └─ downloaded folder 2
```

And so on until you've downloaded and organized all the titles like this. 

## Run alto2txt

We now run `alto2txt` to convert ALTO-METS into plain TXT files, each _one_ 'item' from the downloaded folders (i.e. often roughly a news article, but this really depends on the quality of OCR page segmentation), each paired with an XML file containing some metadata about that particular item. 

In [3]:
from alto2txt import multiprocess_xml_to_text

Define your output directory (it will be created if it doesn't exist yet):

In [8]:
output_dir = 'newspapers_plain_text'

In [9]:
multiprocess_xml_to_text.publications_to_text(
    'downloaded_newspapers',
    output_dir,
    'log.txt')

2024-05-15 08:33:51,623:alto2txt.xml_to_text:85185:INFO:Processing publication: wolverhamptontimes
2024-05-15 08:33:51,624:alto2txt.xml_to_text:85185:INFO:Processing issue: BLNewspapers_WolverhamptonTimesandMidlandCounties_0002603_1876/0513
2024-05-15 08:33:52,975:alto2txt.xml_to_text:85185:INFO:downloaded_newspapers/wolverhamptontimes/BLNewspapers_WolverhamptonTimesandMidlandCounties_0002603_1876/0513/0002603_18760513_mets.xml gave XSLT output
2024-05-15 08:33:53,136:alto2txt.xml_to_text:85185:INFO:downloaded_newspapers/wolverhamptontimes/BLNewspapers_WolverhamptonTimesandMidlandCounties_0002603_1876/0513 {'num_files': 9, 'bad_xml': 0, 'converted_ok': 1, 'converted_bad': 0, 'skipped_alto': 8, 'skipped_bl_page': 0, 'skipped_mets_unknown': 0, 'skipped_root_unknown': 0, 'non_xml': 0}
2024-05-15 08:33:53,143:alto2txt.xml_to_text:85185:INFO:Processing issue: BLNewspapers_WolverhamptonTimesandMidlandCounties_0002603_1876/0115
2024-05-15 08:33:54,652:alto2txt.xml_to_text:85185:INFO:downloade

## Structure the data

In [10]:
from bs4 import BeautifulSoup
from glob import glob
import re
import csv

In [13]:
alltitles = glob(f'{output_dir}/*')
alltitles

['newspapers_plain_text/wolverhamptontimes']

[TODO Rename these variales to reflect Metadata DB?]

In [14]:
columns = ['publication_code','issue_id','item_id','newspaper_title','data_provider','date','year','month','day','location','word_count','ocrquality','text']

In [25]:
filenotfound = '' # these could prob be safely removed
leftbehind = '' # these could prob be safely removed

with open('news_with_meta.csv', 'w') as csvout:
    csv_output = csv.writer(csvout)
    csv_output.writerow(columns)
    for title in alltitles:
        allyears = glob('{}/*'.format(title))
        for year in allyears:
            print(f'reading {year}')
            allissues = glob('{}/*'.format(year))
            for issue in allissues:
                metadataxmls = glob('{}/*.xml'.format(issue))
                for metadataxml in metadataxmls:
                    with open(metadataxml, 'r') as meta:
                        soup = BeautifulSoup(meta, 'lxml')
                        try:
                            ocrquality = soup.find('ocr_quality_mean').text
#                             # Only take NPs with OCR over 90
                            txtfilename = (soup.find('plain_text_file')).text
                            word_count = soup.find('word_count').text
                            date = soup.find('date').text
                            year = date.split('-')[0]
                            month = date.split('-')[1]
                            day = date.split('-')[2]
                            pubmeta = soup.find('publication')
                            collection = pubmeta.find('source').text
                            newstitle = pubmeta.find('title',recursive=False).text.strip()
                            location = soup.find('location').text
                            txtfilename.split('_')[0]
                            nlp = txtfilename.split('_')[0]
                            issue = txtfilename.split('_')[1].split(year)[-1]
                            art_num = txtfilename.split('_')[-1].split('.txt')[0]
                            txtpath = '/'.join(metadataxml.split('/')[:-1]) + '/' + txtfilename
                            try:
                                with open(txtpath) as txtfile:
                                    contents = txtfile.read()
                                    # Remove linebreaks
                                    text = re.sub('-\n', '', contents)
                                    text = re.sub('\n', ' ', text)
                                    csv_output.writerow([nlp,issue,art_num,newstitle,collection,date,year,month,day,location,word_count,ocrquality,text])
                            except FileNotFoundError:
                                filenotfound += str(metadataxml + '\n')
                        except AttributeError:
                            leftbehind += str(metadataxml + '\n')

with open('leftbehind.txt', 'w') as newleft:
    newleft.write(leftbehind)

with open('filenotfound.txt', 'w') as newleft:
    newleft.write(filenotfound)


reading newspapers_plain_text/wolverhamptontimes


## Check the CSV out

In [26]:
import pandas as pd

In [28]:
df_news = pd.read_csv('news_with_meta.csv')
df_news.head()

Unnamed: 0,NLP,issue,art_num,title,collection,full_date,year,month,day,location,word_count,ocrquality,text
0,2603,513,art0046,The Wolverhampton and Midland Counties Adverti...,British Library Living with Machines Project,1876-05-13,1876,5,13,"Wolverhampton, West Midlands, England",154,0.9664,GREAT BRIDGE. TRYING TO GOUGE A POLICEMAN'S E...
1,2603,513,art0013,The Wolverhampton and Midland Counties Adverti...,British Library Living with Machines Project,1876-05-13,1876,5,13,"Wolverhampton, West Midlands, England",287,0.9563,THE PROPOSED REDUCTION IN WAR- . WICKSHIRE WA(...
2,2603,513,art0041,The Wolverhampton and Midland Counties Adverti...,British Library Living with Machines Project,1876-05-13,1876,5,13,"Wolverhampton, West Midlands, England",869,0.9794,GOSSIP OF THE CLUBS. FROM OUR LONDON CORRESPON...
3,2603,513,art0014,The Wolverhampton and Midland Counties Adverti...,British Library Living with Machines Project,1876-05-13,1876,5,13,"Wolverhampton, West Midlands, England",925,0.958,"WEDNESBITRY SCHOOL BOARD. On Monday last, the..."
4,2603,513,art0054,The Wolverhampton and Midland Counties Adverti...,British Library Living with Machines Project,1876-05-13,1876,5,13,"Wolverhampton, West Midlands, England",875,0.948,"CHESS MATCH. On Tuesday evening last, a match..."


## Enrich it with variables from the Press Directories

[here link to latest version.]

In [None]:
press_dir_df = pd.read_csv('')