# Context and Sequences

## New Notebook, New Language

In this notebook, we will look at how we can extract sequences of words from a corpus and learn how we can build a semantic space and a Markov model from those. As it will be more important to actually understand the data we are working with, we will switch from our sample of Polish data to a slightly bigger sample of English and use the first 2 million words from the BNC. Let's get started!

As we are working with new data, we should take a look and see what we are dealing with. We can check what the first 20 lines look like to get an idea:

In [2]:
!head -n 20 ../data/BNC.sample


FACTSHEET WHAT IS AIDS?
AIDS (Acquired Immune Deficiency Syndrome)is a condition caused by a virus called HIV (Human Immuno Deficiency Virus).
This virus affects the body's defence system so that it cannot fight infection.
How is infection transmitted?
through unprotected sexual intercourse with an infected partner.
through infected blood or blood products.
from an infected mother to her baby.
It is not transmitted from:
giving blood/mosquito bites/toilet seats/kissing/from normal day-to-day contact
---END.OF.DOCUMENT---

How does it affect you?
The medical aspects can be cancer, pneumonia, sudden blindness, dementia, dramatic weight loss or any combination of these.
Often infected people are rejected by family and friends, leaving them to face this chronic condition alone.
---END.OF.DOCUMENT---

Did you know?
there is no vaccine or cure currently available.
10 million people worldwide are infected with HIV.


(The ! at the beginning of that command makes use of one of the reasons we are using iPython rather than the regular python interpreter. It tells iPython that the command (head) is actually something we would type into the terminal, rather than python, and it runs it in the background and then reports back with the output we would see in the terminal. Cool.)

Now that output is quite different from what we had before. Our data does not come separated into words and it's not tagged either. It does come with document separators and every sentence appears to be one line, which will come in handy when we extract sequences, as we would not want those to cross documents, would we? Anyway, let's open the data in python and see whether we can work with it.

To get started, we will only work on the first twenty documents to get a feel of some of the difficulties we might run into when working with (more) raw text data:

In [4]:
file_path = "../data/BNC.sample"

with open(file_path, "r") as bnc:
    documents = list()
    current_document = list()
    while len(documents) < 20:
        line = bnc.readline()
        if line.strip() == "---END.OF.DOCUMENT---":
            documents.append(current_document)
            current_document = list()
        else:
            current_document.append(line.strip())

In [6]:
documents[0]

['',
 'FACTSHEET WHAT IS AIDS?',
 'AIDS (Acquired Immune Deficiency Syndrome)is a condition caused by a virus called HIV (Human Immuno Deficiency Virus).',
 "This virus affects the body's defence system so that it cannot fight infection.",
 'How is infection transmitted?',
 'through unprotected sexual intercourse with an infected partner.',
 'through infected blood or blood products.',
 'from an infected mother to her baby.',
 'It is not transmitted from:',
 'giving blood/mosquito bites/toilet seats/kissing/from normal day-to-day contact']

We now have a data structure that contains the first 20 documents. You first task will be to replicate what we did in the previous notebook - generate a vocabulary with frequencies. To make this task even more fun, try your best to only count actual words and try to catch all of them!