# 1. Primary data analysis

## 1.1 Connecting libraries and importing data


In [1]:
%run "../../Oleksandr Zakharchuk Handbook.ipynb"

In [2]:
# !pip install --user -U nltk

In [3]:
# !pip install -U spacy

In [4]:
# !pip install gensim

In [5]:
# !pip install pytest

In [6]:
from sklearn.feature_extraction.text import CountVectorizer
from sklearn.pipeline import Pipeline
from sklearn.feature_extraction.text import TfidfTransformer
from sklearn.feature_extraction.text import TfidfVectorizer
from nltk.stem import *
from nltk.stem.porter import *
from nltk.tokenize import PunktSentenceTokenizer
from sklearn.linear_model import LogisticRegression
import spacy
from spacy import displacy
from spacy.matcher import Matcher
from nltk.test.gensim_fixt import setup_module
import gensim
from gensim import models
import nltk
from nltk.corpus import brown

In [7]:
# !python -m spacy download en_core_web_sm

In [8]:
df = pd.read_csv('Data Folder/IMDB Dataset.csv')

For speed and convenience, we can temporarily take a limited number of lines:



In [9]:
df = df.loc[0:999]

## 1.2 General information


In [10]:
df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 1000 entries, 0 to 999
Data columns (total 2 columns):
 #   Column     Non-Null Count  Dtype 
---  ------     --------------  ----- 
 0   review     1000 non-null   object
 1   sentiment  1000 non-null   object
dtypes: object(2)
memory usage: 15.8+ KB


In [11]:
df

Unnamed: 0,review,sentiment
0,One of the other reviewers has mentioned that ...,positive
1,A wonderful little production. <br /><br />The...,positive
2,I thought this was a wonderful way to spend ti...,positive
3,Basically there's a family where a little boy ...,negative
4,"Petter Mattei's ""Love in the Time of Money"" is...",positive
...,...,...
995,Nothing is sacred. Just ask Ernie Fosselius. T...,positive
996,I hated it. I hate self-aware pretentious inan...,negative
997,I usually try to be professional and construct...,negative
998,If you like me is going to see this in a film ...,negative


In [12]:
target_name = 'sentiment'
target = [target_name]

# 2. Type conversion and value adjustment

## 2.1 Parsing Data Types and Values


In [13]:
df['review'] = df['review'].str.replace('<br />', '')

In [14]:
analysis_dataframe_values_by_column(df, regex='<br')

Search 0 value per column: {}
Search nan value per column: {}
Search None value per column: {}
Search [inf, -inf] value per column: {}
Search count per column by regular expression '<br' (without single quotes): {}
Search unique value per column by regular expression '<br' (without single quotes): {}


All <br \/> tags have been removed



# 4. Dividing the dataset into training and test parts


In [15]:
X, y = get_features_target_split(df, target_name)
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.33, random_state=42)

In [16]:
y_train.shape

(670,)

# 3. Classification

## 3.1 Bag Of Word (BOW) (CountVectorizer)

Thus, a corpus of documents can be represented by a matrix with one row per document and one column per token (eg word) occurring in the corpus.

We call vectorization the general process of converting a set of text documents into vectors of numerical features. This particular strategy (tokenization, counting and normalization) is called the Bag of Words or Bag of n-grams representation. Documents are described by occurrences of words, completely ignoring information about the relative position of words in the document.


In [17]:
vectorizer = CountVectorizer()
X = vectorizer.fit_transform(X_train['review'])

Get a list of words (features) used in the review:


In [18]:
vectorizer_get_feature_names_out = vectorizer.get_feature_names_out()
vectorizer_get_feature_names_out

array(['00', '000', '007', ..., 'zulu', 'zzzzzzzzzzzzzzzzzz', 'ísnt'],
      dtype=object)

The number of unique words in the dataset:


In [19]:
len(vectorizer.vocabulary_.items())

14723

Let's get our matrix, which shows the number of words in each review:


In [20]:
print(X.toarray())

[[0 0 0 ... 0 0 0]
 [0 0 0 ... 0 0 0]
 [0 0 0 ... 0 0 0]
 ...
 [0 0 0 ... 0 0 0]
 [0 0 1 ... 0 0 0]
 [0 0 0 ... 0 0 0]]


In [21]:
X.toarray().shape

(670, 14723)

As we can see, the number of columns is equal to the number of unique words in the dataset:


In [22]:
vectorizer2 = CountVectorizer(analyzer='word', ngram_range=(1, 3))  #‘word’, ‘char’, ‘char_wb’
X2 = vectorizer2.fit_transform(X_train['review'])

In [23]:
list(vectorizer2.vocabulary_.items())[:50]

[('must', 127879),
 ('admit', 4183),
 ('was', 218048),
 ('expecting', 62430),
 ('something', 174550),
 ('quite', 154955),
 ('different', 51336),
 ('from', 72610),
 ('my', 128055),
 ('first', 68434),
 ('viewing', 216486),
 ('of', 134163),
 ('cut', 47220),
 ('last', 109834),
 ('night', 130674),
 ('though', 202505),
 ('delighted', 49362),
 ('with', 227204),
 ('the', 186651),
 ('unexpected', 212706),
 ('australian', 22544),
 ('horror', 89812),
 ('gem', 74808),
 ('am', 7955),
 ('true', 210676),
 ('fan', 64004),
 ('as', 19279),
 ('they', 199055),
 ('come', 42096),
 ('and', 9544),
 ('found', 71739),
 ('to', 204462),
 ('not', 131824),
 ('only', 141672),
 ('be', 24617),
 ('best', 28145),
 ('genre', 74997),
 ('australia', 22527),
 ('has', 81726),
 ('ever', 60797),
 ('produced', 153172),
 ('but', 33110),
 ('one', 140908),
 ('great', 78812),
 ('parody', 146283),
 ('comedy', 42259),
 ('films', 67494),
 ('late', 109965),
 ('concern', 43460),
 ('is', 98251)]

Let's get our words (features), as well as the sequence of links for 2-3 words:


In [24]:
vectorizer2.get_feature_names_out()

array(['00', '00 am', '00 am stayed', ..., 'ísnt', 'ísnt entertaining',
       'ísnt entertaining if'], dtype=object)

The number of unique words in the dataset:


In [25]:
len(vectorizer2.get_feature_names_out())

234260

Matrix of occurring words:


In [26]:
print(X2.toarray())

[[0 0 0 ... 0 0 0]
 [0 0 0 ... 0 0 0]
 [0 0 0 ... 0 0 0]
 ...
 [0 0 0 ... 0 0 0]
 [0 0 0 ... 0 0 0]
 [0 0 0 ... 0 0 0]]


In [27]:
X2.toarray().shape

(670, 234260)

## 3.2 TF_IDF (CountVectorizer, TfidfTransformer)

Prepare the pipeline and train the model:


In [28]:
pipe = Pipeline(
    [
        ('count', CountVectorizer(vocabulary=vectorizer_get_feature_names_out)), 
        ('tfid', TfidfTransformer())
    ]
).fit(X_train['review'])

In [29]:
pipe['count']

In [30]:
pipe['count'].transform(X_train['review']).toarray()

array([[0, 0, 0, ..., 0, 0, 0],
       [0, 0, 0, ..., 0, 0, 0],
       [0, 0, 0, ..., 0, 0, 0],
       ...,
       [0, 0, 0, ..., 0, 0, 0],
       [0, 0, 1, ..., 0, 0, 0],
       [0, 0, 0, ..., 0, 0, 0]], dtype=int64)

We see the words (features):


In [31]:
pipe['count'].get_feature_names_out()

array(['00', '000', '007', ..., 'zulu', 'zzzzzzzzzzzzzzzzzz', 'ísnt'],
      dtype=object)

Number of words (features):


In [32]:
len(pipe['count'].get_feature_names_out())

14723

Next, you need to highlight those words that are most often found in the current sentence, but not found in other sentences. To do this, let's look at the tfid metric (tf is the frequency of the word, idf is the inverse frequency in the document)

In [33]:
pipe['tfid'].idf_

array([5.89933122, 5.4293276 , 6.81562196, ..., 6.81562196, 6.81562196,
       6.81562196])

In [34]:
vectorizer_tfidf = TfidfVectorizer()
X = vectorizer_tfidf.fit_transform(X_train['review'])

In [35]:
vectorizer_tfidf.get_feature_names_out()

array(['00', '000', '007', ..., 'zulu', 'zzzzzzzzzzzzzzzzzz', 'ísnt'],
      dtype=object)

In [36]:
print(X)

  (0, 5597)	0.039589484285865825
  (0, 7831)	0.05804152816229731
  (0, 4349)	0.04070505939805111
  (0, 14054)	0.02983697400209609
  (0, 13309)	0.08661010119341983
  (0, 2322)	0.08393337600498052
  (0, 13419)	0.1000624790968887
  (0, 11351)	0.07798060395217861
  (0, 1283)	0.052805061211738454
  (0, 14633)	0.04401333754281163
  (0, 9194)	0.027830554894485455
  (0, 3011)	0.055718587377774006
  (0, 10163)	0.1000624790968887
  (0, 8645)	0.021877782841683537
  (0, 13212)	0.03717367840531482
  (0, 551)	0.048305667164118694
  (0, 13063)	0.1000624790968887
  (0, 7107)	0.1000624790968887
  (0, 11119)	0.04918082390135742
  (0, 1771)	0.06131757502665992
  (0, 2146)	0.08393337600498052
  (0, 1660)	0.1000624790968887
  (0, 11902)	0.08393337600498052
  (0, 14330)	0.06475835065848883
  (0, 4818)	0.06240557953482789
  :	:
  (669, 8645)	0.02906555331052809
  (669, 13212)	0.09877358590281722
  (669, 14342)	0.13296372385871721
  (669, 8033)	0.05007124141908814
  (669, 13194)	0.03613801386372284
  (669, 13

## 3.3 NLTK

### 3.3.1 nltk.stem, nltk.stem.porter

The PorterStemmer function in the NLTK (Natural Language Toolkit) library performs word stemming based on the Porter algorithm. Stemming is the process of reducing word forms to their stems or roots (called stems) by removing endings and affixes. For example, if you apply the PorterStemmer function to the word "running", it will return the stem "run".


In [37]:
stemmer = PorterStemmer()

Let's bring our words (features) to the basics of words


In [38]:
singles = [stemmer.stem(v) for v in vectorizer_get_feature_names_out]

In [39]:
print(' '.join(singles)[0:5000])

00 000 007 00am 10 100 1000 100th 101 102 103 105 10i 10thi 11 12 120 13 135 13th 14 15 16 17 1700 177 1794 18 1800 1840 1860 18th 19 1900 1903 1919 1920 1920 1921 1922 1923 1928 1929 1930 1930 1932 1934 1936 1937 1938 1939 1940 1940 1941 1944 1945 1947 1948 1949 1950 1950 1951 1952 1953 1954 1955 1956 1959 1960 1960 1963 1964 1966 1968 1969 1970 1971 1972 1973 1974 1976 1977 1980 1980 1981 1982 1983 1984 1984ish 1985 1987 1990 1990 1991 1992 1993 1994 1995 1996 1997 1998 1999 19th 1m 1sound 1st 20 2000 20000 2000 2001 2002 2003 2004 2005 2006 2008 2012 2036 20 20th 21 22 23 24 25 250 2500 25min 26 28 29 2am 2ftm 2hour 2hr 2nd 30 300 3000 30pm 30 31 31st 32 33 35 35c 36 360 36th 37 38 3rd 40 400 40 43 44c 45 450 47 48 49th 4th 50 500 50c 50 52 53 55 5539 56 57 58 588 5th 5yo 60 60 63 64 666 69 70 700 70 74 747 75 75c 77 78 80 8000 80 85 87 88 89 90 90c 90 95 950 99 9th _inspire_ aaliyah aamir aaron abandon abandon abandon abbey abbot abbott abbrevi abducte abed abet abhorr abid abid ab

### 3.3.2 PunktSentenceTokenizer

The PunktSentenceTokenizer function in the NLTK (Natural Language Toolkit) library is used to split text into sentences using a trained Punkt tokenizer model.

PunktSentenceTokenizer implements an unsupervised learning algorithm that parses text and finds punctuation patterns used to separate sentences. It relies on a set of rules that are applied to text to determine the most likely places where sentences end.

The workflow of the PunktSentenceTokenizer function is as follows:

Training: To train a PunktSentenceTokenizer model, you need to use a training corpus containing sentence-separated texts. NLTK provides some pre-trained models that can be used out of the box. However, if you have your own training corpus, you can train the PunktSentenceTokenizer model based on it.

Tokenization: Once the PunktSentenceTokenizer model has been trained, you can use it to tokenize text into sentences. The PunktSentenceTokenizer function takes input text and returns a list of sentences separated into separate lines.

Slicing Algorithm: PunktSentenceTokenizer parses text by applying a set of rules to determine where each sentence ends. It takes into account various factors, such as the position of punctuation, the use of abbreviations, and also takes into account the contextual features of the text.

All in all, the PunktSentenceTokenizer function provides a reasonably reliable and efficient way to split text into sentences. However, it is worth noting that it is not perfect and may have some limitations or bugs in some cases, especially when processing text with non-standard features or complex sentence structures.

In [40]:
PST = PunktSentenceTokenizer()
tokenized = PST.tokenize(X_train['review'][0])
tokenized

["One of the other reviewers has mentioned that after watching just 1 Oz episode you'll be hooked.",
 'They are right, as this is exactly what happened with me.The first thing that struck me about Oz was its brutality and unflinching scenes of violence, which set in right from the word GO.',
 'Trust me, this is not a show for the faint hearted or timid.',
 'This show pulls no punches with regards to drugs, sex or violence.',
 'Its is hardcore, in the classic use of the word.It is called OZ as that is the nickname given to the Oswald Maximum Security State Penitentary.',
 'It focuses mainly on Emerald City, an experimental section of the prison where all the cells have glass fronts and face inwards, so privacy is not high on the agenda.',
 "Em City is home to many..Aryans, Muslims, gangstas, Latinos, Christians, Italians, Irish and more....so scuffles, death stares, dodgy dealings and shady agreements are never far away.I would say the main appeal of the show is due to the fact that it 

## 3.4 LogisticRegression

In [41]:
df_lr = df.copy()

Let's transform into vectors of numerical features:


In [42]:
vectorizer_lr = CountVectorizer(analyzer='word', ngram_range=(1, 3))  #‘word’, ‘char’, ‘char_wb’
X_lr = vectorizer_lr.fit_transform(df_lr['review'])

I split the dataset:



In [43]:
X_train_lr, X_test_lr, y_train_lr, y_test_lr = train_test_split(X_lr.toarray(), df_lr['sentiment'], test_size=0.33, random_state=42)

Let's train:


In [44]:
clf = LogisticRegression(random_state=0).fit(X_train_lr, y_train_lr)

Let's make a prediction for the test data:


In [45]:
clf.predict(X_test_lr)

array(['positive', 'positive', 'negative', 'positive', 'negative',
       'positive', 'positive', 'positive', 'negative', 'positive',
       'positive', 'positive', 'negative', 'negative', 'negative',
       'negative', 'positive', 'negative', 'positive', 'positive',
       'positive', 'negative', 'positive', 'negative', 'negative',
       'positive', 'negative', 'positive', 'positive', 'negative',
       'negative', 'negative', 'negative', 'positive', 'positive',
       'positive', 'negative', 'negative', 'negative', 'positive',
       'positive', 'negative', 'negative', 'negative', 'positive',
       'positive', 'positive', 'positive', 'positive', 'positive',
       'positive', 'negative', 'negative', 'negative', 'negative',
       'negative', 'negative', 'positive', 'negative', 'positive',
       'negative', 'positive', 'negative', 'positive', 'positive',
       'positive', 'positive', 'positive', 'negative', 'negative',
       'negative', 'positive', 'positive', 'positive', 'positi

## 3.5 spyCy

spaCy NLP (Natural Language Processing) is a library for Natural Language Processing in Python. It provides tools for performing various text processing tasks such as tokenization, lemmatization, part-of-word markup, named entity extraction, parsing, and more.

The main features of the spaCy library are:

Tokenization: Dividing text into individual words or tokens. SpaCy provides efficient methods for language-specific tokenization.

Lemmatization: Reducing words to their base or lemmatic forms. This is useful, for example, for matching different forms of the same word.

Part-of-speech markup: Identify parts of speech for each token in a text, such as nouns, verbs, adjectives, etc.

Named Entity Extraction (NER): Recognition and classification of named entities in text, such as the names of people, organizations, locations, etc.

Parsing: Analysis of the structure of sentences to determine relationships between words, such as dependencies and syntactic relationships.

Vector word representations: spaCy provides pre-trained models for creating word vector representations that can be used for text comparison and semantic analysis.

Loading a pretrained NER model:


In [46]:
nlp = spacy.load("en_core_web_sm")

In [47]:
doc = nlp(X_train['review'][0])

In [48]:
for token in doc[0:20]:
    print(token.text, token.lemma_, token.pos_, token.tag_, token.dep_,
            token.shape_, token.is_alpha, token.is_stop)

One one NUM CD nsubj Xxx True True
of of ADP IN prep xx True True
the the DET DT det xxx True True
other other ADJ JJ amod xxxx True True
reviewers reviewer NOUN NNS pobj xxxx True False
has have AUX VBZ aux xxx True True
mentioned mention VERB VBN ROOT xxxx True False
that that SCONJ IN mark xxxx True True
after after ADP IN prep xxxx True True
watching watch VERB VBG pcomp xxxx True False
just just ADV RB advmod xxxx True True
1 1 NUM CD nummod d False False
Oz oz NOUN NN compound Xx True False
episode episode NOUN NN dobj xxxx True False
you you PRON PRP nsubjpass xxx True True
'll will AUX MD aux 'xx False True
be be AUX VB auxpass xx True True
hooked hook VERB VBN ccomp xxxx True False
. . PUNCT . punct . False False
They they PRON PRP nsubj Xxxx True True


Visualization of the proposed syntactic structure of a sentence using spaCy. Arrows point from child elements to head elements and are labeled with their relationship types (must be uncommented and run to view):


In [None]:
displacy.serve(doc[0:20], style="dep")




Using the 'dep' visualizer
Serving on http://0.0.0.0:5000 ...



In [50]:
nlp = spacy.load("en_core_web_sm")

In [51]:
matcher = Matcher(nlp.vocab)

In [52]:
pattern = [{"LOWER": "senses"}, {"IS_PUNCT": True}, {"LOWER": "particularly"}]

In [53]:
matcher.add("my_matcher", [pattern])

In [54]:
doc = nlp(X_train['review'][1])

In [55]:
doc

A wonderful little production. The filming technique is very unassuming- very old-time-BBC fashion and gives a comforting, and sometimes discomforting, sense of realism to the entire piece. The actors are extremely well chosen- Michael Sheen not only "has got all the polari" but he has all the voices down pat too! You can truly see the seamless editing guided by the references to Williams' diary entries, not only is it well worth the watching but it is a terrificly written and performed piece. A masterful production about one of the great master's of comedy and his life. The realism really comes home with the little things: the fantasy of the guard which, rather than use the traditional 'dream' techniques remains solid then disappears. It plays on our knowledge and our senses, particularly with the scenes concerning Orton and Halliwell and the sets (particularly of their flat with Halliwell's murals decorating every surface) are terribly well done.

We get:  
match identifier (match_id),  
string representation (string_id),  
the position of the start (start) and end of the match (end),  
text in the corresponding span (span.text):

In [56]:
matches = matcher(doc)
for match_id, start, end in matches:
    string_id = nlp.vocab.strings[match_id]  # Get string representation
    span = doc[start:end]  # The matched span
    print(f"match_id: {match_id}\r\nstring_id: {string_id}\r\nstart: {start}\r\nend: {end}\r\nspan.text: {span.text}")

match_id: 476750250461617799
string_id: my_matcher
start: 151
end: 154
span.text: senses, particularly


## 3.6 GENSIM

In [57]:
setup_module()

In [58]:
nltk.download('word2vec_sample')

[nltk_data] Downloading package word2vec_sample to
[nltk_data]     C:\Users\Administrator\AppData\Roaming\nltk_data...
[nltk_data]   Package word2vec_sample is already up-to-date!


True

In [59]:
first_sents = []
for word in doc:
    first_sents.append(word)

In [60]:
print(first_sents)

[A, wonderful, little, production, ., The, filming, technique, is, very, unassuming-, very, old, -, time, -, BBC, fashion, and, gives, a, comforting, ,, and, sometimes, discomforting, ,, sense, of, realism, to, the, entire, piece, ., The, actors, are, extremely, well, chosen-, Michael, Sheen, not, only, ", has, got, all, the, polari, ", but, he, has, all, the, voices, down, pat, too, !, You, can, truly, see, the, seamless, editing, guided, by, the, references, to, Williams, ', diary, entries, ,, not, only, is, it, well, worth, the, watching, but, it, is, a, terrificly, written, and, performed, piece, ., A, masterful, production, about, one, of, the, great, master, 's, of, comedy, and, his, life, ., The, realism, really, comes, home, with, the, little, things, :, the, fantasy, of, the, guard, which, ,, rather, than, use, the, traditional, ', dream, ', techniques, remains, solid, then, disappears, ., It, plays, on, our, knowledge, and, our, senses, ,, particularly, with, the, scenes, con

In [61]:
model = gensim.models.Word2Vec([first_sents], min_count=1)

In [1]:
# model.save("word2vec.my_model")

In [63]:
print(model.wv.index_to_key)

[., seamless, see, truly, can, You, !, too, pat, down, voices, the, all, has, he, but, ", polari, the, all, got, the, editing, ", guided, it, but, watching, the, worth, well, it, is, only, not, ,, entries, diary, ', Williams, to, references, the, by, has, only, done, comforting, gives, and, fashion, BBC, -, time, -, old, very, unassuming-, very, is, technique, filming, The, ., production, little, wonderful, a, ,, not, and, Sheen, Michael, chosen-, well, extremely, are, actors, The, ., piece, entire, the, to, realism, of, sense, ,, discomforting, sometimes, is, a, terrificly, Orton, scenes, the, with, particularly, ,, senses, our, and, knowledge, our, on, plays, It, ., disappears, then, solid, remains, techniques, concerning, and, written, Halliwell, well, terribly, are, ), surface, every, decorating, murals, 's, Halliwell, with, flat, their, of, particularly, (, sets, the, and, ', dream, ', traditional, life, his, and, comedy, of, 's, master, great, the, of, one, about, production, mas

In [2]:
# len(model.wv['rather'])

## 3.7 NER

Named Entity Recognition (NER) in the spaCy library is the process of extracting and classifying named entities from text. Named entities are specific entities such as persons, organizations, locations, dates, times, currencies, etc.

NER in spaCy uses machine learning to recognize and classify named entities in text. The library provides pre-trained models that can be used to perform NER in different languages and domains.

When the NER spaCy model is applied to text, it separates it into tokens (individual words or parts of words) and then determines whether each token is a named entity or not. If the token is classified as a named entity, the model also defines the type of that entity (for example, person, organization, location, etc.).

Applying NER to text and extracting named entities:


In [65]:
for doc in nlp.pipe(X_train['review'][0:20], disable=["tok2vec", "tagger",  "parser", "attribute_ruler", "lemmatizer"]):
    # Do something with the doc here
    print([(ent.text, ent.label_) for ent in doc.ents])

[('first', 'ORDINAL'), ('last night', 'TIME'), ('Australian', 'NORP'), ('Cut', 'PERSON'), ('Australia', 'GPE'), ('gore', 'PERSON'), ('Kylie Minogue', 'PERSON'), ('Molly Ringwald', 'PERSON'), ('Simon Bossell', 'PERSON'), ('Castle', 'ORG'), ('the last couple or years', 'DATE')]
[('CIA', 'ORG'), ('Olivia', 'ORG'), ('Tom Conti', 'PERSON'), ('Hendrick Haese', 'PERSON'), ('Roger Moore', 'PERSON')]
[('ITV', 'ORG'), ('last night', 'TIME'), ('Rupert Grint', 'PERSON'), ('Ben', 'PERSON'), ('Dame Eve Walton', 'PERSON'), ('Julie Walters', 'PERSON'), ('Driving Lessons', 'WORK_OF_ART'), ('2 hours', 'TIME')]
[('two', 'CARDINAL'), ('Pepsi', 'ORG'), ('Kimberly Williams', 'PERSON'), ('Kimberly Williams', 'PERSON')]
[('One', 'CARDINAL'), ('just 1 Oz', 'PERCENT'), ('the Oswald Maximum Security State Penitentary', 'ORG'), ('Emerald City', 'GPE'), ('Aryans', 'NORP'), ('Muslims', 'NORP'), ('Latinos', 'ORG'), ('Christians', 'NORP'), ('Italians', 'NORP'), ('Irish', 'NORP'), ('first', 'ORDINAL'), ('Watching Oz',