<b>Studying the Reuters Corpus</b>

For this notebook it is necessary to download the reuters corpus. In a terminal run the following:
* `import nltk`
* `nltk.download('reuters')`

In [1]:
import nltk
from nltk.corpus import reuters

In [2]:
import matplotlib.pyplot as plt
import seaborn as sns
sns.set()
%matplotlib inline
import numpy as np
import pandas as pd

How many categories are there in the Reuters Corpus ?

In [3]:
print('There are ' + str(len(reuters.categories())) + ' categories in the Reuters Corpus')

LookupError: 
**********************************************************************
  Resource [93mreuters[0m not found.
  Please use the NLTK Downloader to obtain the resource:

  [31m>>> import nltk
  >>> nltk.download('reuters')
  [0m
  For more information see: https://www.nltk.org/data.html

  Attempted to load [93mcorpora/reuters[0m

  Searched in:
    - '/Users/jan/nltk_data'
    - '/Users/jan/.pyenv/versions/3.9.5/envs/mda_2022/nltk_data'
    - '/Users/jan/.pyenv/versions/3.9.5/envs/mda_2022/share/nltk_data'
    - '/Users/jan/.pyenv/versions/3.9.5/envs/mda_2022/lib/nltk_data'
    - '/usr/share/nltk_data'
    - '/usr/local/share/nltk_data'
    - '/usr/lib/nltk_data'
    - '/usr/local/lib/nltk_data'
**********************************************************************


Creating a list of the documents across te different categories 

In [None]:
category_doc = [(category, id)
                for category in reuters.categories()
                for id in reuters.fileids(categories=category)]

What kind of variable is category_doc ?

In [None]:
print(type(category_doc))
print(category_doc[0])

Reading the first 5 lines of the first article 

In [None]:
from nltk.tokenize import sent_tokenize
sent = sent_tokenize(reuters.raw('test/14843'))
for i in np.arange(0,5):
    print(i,sent[i])

Printing the first 10 categories of the corpus

In [None]:
reuters.categories()[:10]


The list category_doc can be transformed in a collection of tuples where each tuple corresponds to a particular category using the function <b> ConditionalFreqDist</b><br>
The function returns a dictionary with as <b>key</b>, the name of the category and as <b>item</b> the names of the articles belonging to this category

In [None]:
cfd = nltk.ConditionalFreqDist(category_doc)
cat_name = []
cat_freq = []
for k in cfd.keys():
    cat_name.append(k)
    cat_freq.append(len(cfd[k]))

    
# put the result in a dataframe    
cat_name =  np.array(cat_name)
cat_freq =  np.array(cat_freq)

df = pd.DataFrame(cat_freq,index=cat_name)
df.columns = ['freq']
df.plot(title='Reuters Categories',kind='bar',figsize=(14,5));


What is the most important category ? 

In [None]:
df_top = df.sort_values(by='freq',ascending=False).head()

In [None]:
df_top

Merger and aquisitions

The merger \& aquisitions category is one of the more important. The Reuters corpus can be used to understand when most M\&A news hits the newswires:

When does one observe most of the merger & aquisition news?

In [None]:
Days = dict()
Days['monday']=0
Days['tuesday']=0
Days['wednesday']=0
Days['thursday']=0
Days['friday']=0
Days['saturday']=0
Days['sunday']=0

for doc in reuters.fileids(categories='acq'):
    # reading all the words that are lower case and 
    # insert them in an array
    v=np.array([v.lower() for v in reuters.words(doc)])
    for d in Days.keys():
        idx = v == d
        Days[d] += sum(idx)

In [None]:
df = pd.DataFrame.from_dict(Days,orient='index')
df.columns = ['nbr']
df.sort_values(by='nbr',ascending=False,inplace=True)
df