# Tutorial

This notebook is a simple guide on how to use the `Corpus` object to easily load and interact with the data. `make data` produces some preprocessed data files that are read here.

In [1]:
from src.utils.corpus import Corpus

The `debates` instance variable is a `pd.DataFrame` where each row corresponds to a paragraph.

In [2]:
corp = Corpus()
corp.debates.head()

Unnamed: 0,document_id,paragraph_index,session,year,country,country_name,bag_of_words,paragraph_id,text
0,0,0,25,1970,ALB,Albania,"[convey, president, congratulation, albanian, ...",0_0,33: May I first convey to our President the co...
1,0,1,25,1970,ALB,Albania,"[take, work, agenda, twenty-, fifth, session, ...",0_1,34.\tIn taking up the work on the agenda of th...
2,0,2,25,1970,ALB,Albania,"[utilization, united, nations, serve, policy, ...",0_2,35.\tThe utilization of the United Nations to ...
3,0,3,25,1970,ALB,Albania,"[progressive, mankind, recall, admiration, her...",0_3,36.\tThe whole of progressive mankind recalls ...
4,0,4,25,1970,ALB,Albania,"[know, consequence, united, nations, particula...",0_4,37.\tAll this has had well known consequences ...


The `speeches` and `paragraphs` instance variables contain lists of respective `Speech` and `Paragraph` objects. The `paragraphs` list is ordered in the same way as the `debates` data frame. Some common operations on these objects are demonstrated below.

In [3]:
corp.speeches[:5]

[<src.utils.corpus.Speech at 0x7393204e0>,
 <src.utils.corpus.Speech at 0x72af5def0>,
 <src.utils.corpus.Speech at 0x7261b1630>,
 <src.utils.corpus.Speech at 0x725b846a0>,
 <src.utils.corpus.Speech at 0x7277b8a20>]

In [4]:
corp.paragraphs[:5]

[<src.utils.corpus.Paragraph at 0x738143240>,
 <src.utils.corpus.Paragraph at 0x737206b00>,
 <src.utils.corpus.Paragraph at 0x734ecf7f0>,
 <src.utils.corpus.Paragraph at 0x732b97f60>,
 <src.utils.corpus.Paragraph at 0x70d968908>]

In [5]:
# check that order is the same as the debates df
for par_id_from_df, par_obj in zip(corp.debates.paragraph_id, corp.paragraphs):
    assert par_id_from_df == par_obj.id_

To access serialized Spacy markup, use the `spacy_doc` method. Spacy markup is computed at the full speech level, so `spacy_doc` on a `Paragraph` object returns a span into the parent speech. This is lazy loaded and cached on the speech level. It takes several minutes to load in the markup for all speeches.

In [6]:
doc = corp.speeches[0].spacy_doc()
type(doc)

spacy.tokens.doc.Doc

In [7]:
doc_paragraph = corp.paragraphs[0].spacy_doc()
type(doc_paragraph)

spacy.tokens.span.Span

You can access individual speeches and paragraphs by id directly.

In [8]:
corp.speech(45)

<src.utils.corpus.Speech at 0x7374b0320>

In [9]:
corp.paragraph('467_20')

<src.utils.corpus.Paragraph at 0x7426a7710>