In [1]:
import pandas as pd
import numpy as np
from bs4 import BeautifulSoup
import os

### Open file and parse it with bs4

In [2]:
f = open('data_files/reut2-000.sgm', encoding='utf-8', errors='ignore')
dataFile = f.read()
soup = BeautifulSoup(dataFile,'lxml')

In [3]:
type(soup)

bs4.BeautifulSoup

In [6]:
f.name

'data_files/reut2-000.sgm'

### Check structure of an article

In [157]:
content = soup.find_all('reuters')

In [158]:
len(content)

1000

Each file has 1000 articles

In [159]:
content[5]

<reuters cgisplit="TRAINING-SET" lewissplit="TRAIN" newid="6" oldid="5549" topics="YES">
<date>26-FEB-1987 15:14:36.41</date>
<topics><d>veg-oil</d><d>linseed</d><d>lin-oil</d><d>soy-oil</d><d>sun-oil</d><d>soybean</d><d>oilseed</d><d>corn</d><d>sunseed</d><d>grain</d><d>sorghum</d><d>wheat</d></topics>
<places><d>argentina</d></places>
<people></people>
<orgs></orgs>
<exchanges></exchanges>
<companies></companies>
<unknown> 
G
f0754reute
r f BC-ARGENTINE-1986/87-GRA   02-26 0066</unknown>
<text>
<title>ARGENTINE 1986/87 GRAIN/OILSEED REGISTRATIONS</title>
<dateline>    BUENOS AIRES, Feb 26 - </dateline>Argentine grain board figures show
crop registrations of grains, oilseeds and their products to
February 11, in thousands of tonnes, showing those for futurE
shipments month, 1986/87 total and 1985/86 total to February
12, 1986, in brackets:
    Bread wheat prev 1,655.8, Feb 872.0, March 164.6, total
2,692.4 (4,161.0).
    Maize Mar 48.0, total 48.0 (nil).
    Sorghum nil (nil)
    Oils

### Look at topic list

In [160]:
topics = np.genfromtxt("data_files/all-topics-strings.lc.txt", 
                      delimiter='\n', dtype=None, encoding=None)
topics

array(['acq', 'alum', 'austdlr', 'austral', 'barley', 'bfr', 'bop', 'can',
       'carcass', 'castor-meal', 'castor-oil', 'castorseed', 'citruspulp',
       'cocoa', 'coconut', 'coconut-oil', 'coffee', 'copper',
       'copra-cake', 'corn', 'corn-oil', 'cornglutenfeed', 'cotton',
       'cotton-meal', 'cotton-oil', 'cottonseed', 'cpi', 'cpu', 'crude',
       'cruzado', 'dfl', 'dkr', 'dlr', 'dmk', 'drachma', 'earn', 'escudo',
       'f-cattle', 'ffr', 'fishmeal', 'flaxseed', 'fuel', 'gas', 'gnp',
       'gold', 'grain', 'groundnut', 'groundnut-meal', 'groundnut-oil',
       'heat', 'hk', 'hog', 'housing', 'income', 'instal-debt',
       'interest', 'inventories', 'ipi', 'iron-steel', 'jet', 'jobs',
       'l-cattle', 'lead', 'lei', 'lin-meal', 'lin-oil', 'linseed', 'lit',
       'livestock', 'lumber', 'lupin', 'meal-feed', 'mexpeso', 'money-fx',
       'money-supply', 'naphtha', 'nat-gas', 'nickel', 'nkr', 'nzdlr',
       'oat', 'oilseed', 'orange', 'palladium', 'palm-meal', 'palm-oil',

### Find all topics and turn them in to text

In [161]:
topics = soup.find_all('topics')
topic_list = list()

In [162]:
for x in topics:
    # turn bs4.tag into text
    words = [i.text for i in x]
    #append text to list
    topic_list.append(words)

### Make a dataframe with the topics

Make function to pull out **'earn'** topic

I only care about the 'earn' topic for now, so I'm making it the only one that matters, but I'm not clear what the ones that have no topic mean. I'm calling those 'blank'.

In [7]:
def pull_out_earn_topic(topic_list):
    for i, topic in enumerate(topic_list):
        
        # format is a list of strings, so this loop removes topics from nested list
        article_topics = ''
        for word in topic:
            article_topics += (word + ' ')
            
        # assign correct topic 
        if not article_topics:
            topic_list[i] = 'blank'
        elif 'earn' in article_topics:
            topic_list[i] = 'earn'
        else:
            topic_list[i] = 'other'
    
    return topic_list

In [164]:
test_list = pull_out_earn_topic(topic_list[0:100])
test_list[0:10]

['other',
 'blank',
 'blank',
 'blank',
 'other',
 'other',
 'blank',
 'blank',
 'earn',
 'other']

Test case works. Make the whole dataframe.

In [165]:
topics_for_df = pull_out_earn_topic(topic_list)
df=pd.DataFrame(topics_for_df, columns=['topics'])

In [166]:
df

Unnamed: 0,topics
0,other
1,blank
2,blank
3,blank
4,other
...,...
995,blank
996,blank
997,earn
998,other


### Make list of article content

In [167]:
all_text = soup.find_all("text")
len(all_text)

1000

In [168]:
list_all_text = list()
for text in all_text:
    
    # getting just the text from the element
    # stripping out the newline indicator
    working_text = text.get_text().replace("\n", " ")
    
    # removing extra spaces
    working_text = ' '.join(working_text.split())
    
    # appending to list
    list_all_text.append(working_text)

In [169]:
df['text'] = list_all_text

In [170]:
df

Unnamed: 0,topics,text
0,other,"BAHIA COCOA REVIEW SALVADOR, Feb 26 - Showers ..."
1,blank,STANDARD OIL <SRD> TO FORM FINANCIAL UNIT CLEV...
2,blank,TEXAS COMMERCE BANCSHARES <TCB> FILES PLAN HOU...
3,blank,TALKING POINT/BANKAMERICA <BAC> EQUITY OFFER b...
4,other,NATIONAL AVERAGE PRICES FOR FARMER-OWNED RESER...
...,...,...
995,blank,ASHTON-TATE <TATE> TO OFFER COMMON SHARES TORR...
996,blank,KEYCORP <KEY> REGISTERS SUBORDINATED NOTES ALB...
997,earn,<NATIONAL SEA PRODUCTS LTD> 4TH QTR NET HALIFA...
998,other,U.K. MONEY MARKET SHORTAGE FORECAST REVISED DO...


Now that I got it working with one .sgm file, do the entire dataset.

# Join all datafiles

- make an empty df with the two columns
- loop through each file
    - read the file and make it a bs4 object
    - find the topics
        - make an empty list
        - loop through the topics and turn the bs4 tags into text
        - run the function on the list to make it the topics for the df
    - find the text
        - make an empty list
        - run the loop using `.get_text()`
            - clean the text in the loop
    - add the lists to df using `add_to_df` function
  


### Define functions

In [8]:
def pull_out_earn_topic(topic_list):
    for i, topic in enumerate(topic_list):
        
        # format is a list of strings, so this loop removes topics from nested list
        article_topics = ''
        for word in topic:
            article_topics += (word + ' ')
            
        # assign correct topic 
        if not article_topics:
            topic_list[i] = 'blank'
        elif 'earn' in article_topics:
            topic_list[i] = 'earn'
        else:
            topic_list[i] = 'other'
    
    return topic_list

In [13]:
def make_bs4(file):
    filename = os.path.join("data_files", file)
    f = open(filename, 'r', encoding='utf-8', errors='ignore')
    dataFile = f.read()
    print(f.name)
        
    # make it a bs4 object
    soup = BeautifulSoup(dataFile,'lxml')
    return soup

In [10]:
def add_to_df(topics, texts, df):
    for i, topic in enumerate(topics):
        new_row = pd.Series([topics[i], texts[i]], index=df.columns)
        df = df.append(new_row, ignore_index=True)
    return df

### Make empty dataframe

In [11]:
df = pd.DataFrame(columns=['topic', 'text'])
df

Unnamed: 0,topic,text


### Execute big loop to add all the data to the dataframe

In [15]:

for file in os.listdir("data_files/"): 

    if file.endswith(".sgm"):
        
        # for each sgm file, read it and make it a bs4 object
        soup = make_bs4(file)
        
        # isolate topics
        topic_list = list()
        
        topics = soup.find_all('topics')
        
        for x in topics:
            # turn bs4.tag into text
            words = [i.text for i in x]
            #append text to list
            topic_list.append(words)
        
        topic_list = pull_out_earn_topic(topic_list)
        
        # isolate text
        list_all_text = list()
        
        all_text = soup.find_all("text")
        
        for text in all_text:
            
            # getting just the text from the element
            # stripping out the newline indicator
            working_text = text.get_text().replace("\n", " ")
            
            # removing extra spaces
            working_text = ' '.join(working_text.split())
            
            # appending to list
            list_all_text.append(working_text)
        
        # add the article's topic and the article's text to the df         
        df = add_to_df(topic_list, list_all_text, df)
df

data_files\reut2-000.sgm
1000
1000
data_files\reut2-001.sgm
1000
1000
data_files\reut2-002.sgm
1000
1000
data_files\reut2-003.sgm
1000
1000
data_files\reut2-004.sgm
1000
1000
data_files\reut2-005.sgm
1000
1000
data_files\reut2-006.sgm
1000
1000
data_files\reut2-007.sgm
1000
1000
data_files\reut2-008.sgm
1000
1000
data_files\reut2-009.sgm
1000
1000
data_files\reut2-010.sgm
1000
1000
data_files\reut2-011.sgm
1000
1000
data_files\reut2-012.sgm
1000
1000
data_files\reut2-013.sgm
1000
1000
data_files\reut2-014.sgm
1000
1000
data_files\reut2-015.sgm
1000
1000
data_files\reut2-016.sgm
1000
1000
data_files\reut2-017.sgm
1000
1000
data_files\reut2-018.sgm
1000
1000
data_files\reut2-019.sgm
1000
1000
data_files\reut2-020.sgm
1000
1000


Unnamed: 0,topic,text
0,other,"BAHIA COCOA REVIEW SALVADOR, Feb 26 - Showers ..."
1,blank,STANDARD OIL <SRD> TO FORM FINANCIAL UNIT CLEV...
2,blank,TEXAS COMMERCE BANCSHARES <TCB> FILES PLAN HOU...
3,blank,TALKING POINT/BANKAMERICA <BAC> EQUITY OFFER b...
4,other,NATIONAL AVERAGE PRICES FOR FARMER-OWNED RESER...
...,...,...
22995,blank,******PACIFIC STOCK EXCHANGE SAYS IT WILL CLOS...
22996,blank,"******DOW FALLS 404 POINTS TO 1844, LOWEST LEV..."
22997,blank,"AFG INDUSTRIES <AFG> TO BUY BACK STOCK IRVINE,..."
22998,blank,J.P. MORGAN <JPM> LOWERS LOAN-LOSS PROVISIONS ...


In [16]:
df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 23000 entries, 0 to 22999
Data columns (total 2 columns):
 #   Column  Non-Null Count  Dtype 
---  ------  --------------  ----- 
 0   topic   23000 non-null  object
 1   text    23000 non-null  object
dtypes: object(2)
memory usage: 359.5+ KB


In [175]:
df[0:10]

Unnamed: 0,topic,text
0,other,"BAHIA COCOA REVIEW SALVADOR, Feb 26 - Showers ..."
1,blank,STANDARD OIL <SRD> TO FORM FINANCIAL UNIT CLEV...
2,blank,TEXAS COMMERCE BANCSHARES <TCB> FILES PLAN HOU...
3,blank,TALKING POINT/BANKAMERICA <BAC> EQUITY OFFER b...
4,other,NATIONAL AVERAGE PRICES FOR FARMER-OWNED RESER...
5,other,ARGENTINE 1986/87 GRAIN/OILSEED REGISTRATIONS ...
6,blank,"RED LION INNS FILES PLANS OFFERING PORTLAND, O..."
7,blank,"USX <X> DEBT DOWGRADED BY MOODY'S NEW YORK, Fe..."
8,earn,CHAMPION PRODUCTS <CH> APPROVES STOCK SPLIT RO...
9,other,COMPUTER TERMINAL SYSTEMS <CPML> COMPLETES SAL...


In [178]:
df[990:1001]

Unnamed: 0,topic,text
990,other,U.S. ASKS JAPAN END AGRICULTURE IMPORT CONTROL...
991,blank,U.S. FINANCIAL ANALYSTS - March 3 Wilcox/Gibbs...
992,blank,U.S. DIVIDEND MEETINGS - MARCH 3 Mickelberry C...
993,blank,U.S. SHAREHOLDER MEETINGS - MARCH 3 None Repor...
994,earn,FIRSTCORP <FCR> SEES GAIN ON CONDEMNATION RALE...
995,blank,ASHTON-TATE <TATE> TO OFFER COMMON SHARES TORR...
996,blank,KEYCORP <KEY> REGISTERS SUBORDINATED NOTES ALB...
997,earn,<NATIONAL SEA PRODUCTS LTD> 4TH QTR NET HALIFA...
998,other,U.K. MONEY MARKET SHORTAGE FORECAST REVISED DO...
999,other,NATIONAL AMUSEMENTS AGAIN UPS VIACOM <VIA> BID...


In [18]:
df.iloc[14792]

topic    blank
text          
Name: 14792, dtype: object

### Checking to see if there are any null or blank strings

In [19]:
np.where(pd.isnull(df))

(array([], dtype=int64), array([], dtype=int64))

In [21]:
np.where(df.applymap(lambda x: x == ''))

(array([14792], dtype=int64), array([1], dtype=int64))

In [22]:
df.iloc[14792]

topic    blank
text          
Name: 14792, dtype: object

It seems row 14792 has empty string. Remove it, and double-check.

In [23]:
df = df.drop([14792])
np.where(df.applymap(lambda x: x == ''))

(array([], dtype=int64), array([], dtype=int64))

In [None]:
Looks OK.

In [24]:
df.to_csv('data_files/topics_and_text.csv', index=False)