# HW02: Tokenization

Remember that these homework work as a completion grade. **You can skip one section without losing credit.**

In [1]:
#Import the AG news dataset (same as hw01)
#Download them from here 
#!wget https://raw.githubusercontent.com/mhjabreel/CharCnn_Keras/master/data/ag_news_csv/train.csv

import pandas as pd
import nltk
df = pd.read_csv('train.csv')

df.columns = ["label", "title", "lead"]
label_map = {1:"world", 2:"sport", 3:"business", 4:"sci/tech"}
def replace_label(x):
	return label_map[x]
df["label"] = df["label"].apply(replace_label) 
df["text"] = df["title"] + " " + df["lead"]
df.head()

Unnamed: 0,label,title,lead,text
0,business,Carlyle Looks Toward Commercial Aerospace (Reu...,Reuters - Private investment firm Carlyle Grou...,Carlyle Looks Toward Commercial Aerospace (Reu...
1,business,Oil and Economy Cloud Stocks' Outlook (Reuters),Reuters - Soaring crude prices plus worries\ab...,Oil and Economy Cloud Stocks' Outlook (Reuters...
2,business,Iraq Halts Oil Exports from Main Southern Pipe...,Reuters - Authorities have halted oil export\f...,Iraq Halts Oil Exports from Main Southern Pipe...
3,business,"Oil prices soar to all-time record, posing new...","AFP - Tearaway world oil prices, toppling reco...","Oil prices soar to all-time record, posing new..."
4,business,"Stocks End Up, But Near Year Lows (Reuters)",Reuters - Stocks ended slightly higher on Frid...,"Stocks End Up, But Near Year Lows (Reuters) Re..."


## Preprocess Text

In [2]:
import spacy
dfs = df.sample(50)
nlp = spacy.load('en_core_web_sm')

##TODO use spacy to split the documents in the sampled dataframe (dfs) in sentences and tokens
dfs['document'] = df['title'] + ' ' + df['lead'] + ' ' + df['text']
    # above I created a new column like in hw_01 where all the text from these columns is combined.

# For storing sentences and tokens
sentences = list()
tokens = list()

for doc_text in dfs['document']:
    doc = nlp(doc_text)

    sentences.extend([sentence.text for sentence in doc.sents])
    tokens.extend([token.text for token in doc])

##TODO print the first sentence of the first document in your sample
print(sentences[0])

Eagles' Biggest Hurdle Lies Ahead EAST RUTHERFORD, N.J., Nov. 28 - Even the flamboyant Terrell Owens was subdued after Philadelphia's 27-6 victory over the Giants on Sunday.


In [26]:
##TODO create a new column with tokens in lowercase (x.lower()), without punctuation tokens (x.is_punct), stopwords (x.is_stop), and digits (x.is_digit)
# dfs['processed'] = dfs['document'].apply(lambda doc: [token.text.lower() for token in nlp(doc) if not token.is_punct and not token.is_stop and not token.is_digit])


##TODO print the tokens (x.lemma_) and the dependency labels (x.dep_ ) of the first sentence of the first document (doc.sents)

for token in nlp(sentences[0]):
    print(token, token.lemma_, token.dep_)

Eagles eagle poss
' ' case
Biggest big amod
Hurdle Hurdle nsubj
Lies lie ROOT
Ahead ahead advmod
EAST EAST compound
RUTHERFORD RUTHERFORD nsubjpass
, , punct
N.J. N.J. appos
, , punct
Nov. Nov. npadvmod
28 28 nummod
- - punct
Even even advmod
the the det
flamboyant flamboyant amod
Terrell Terrell compound
Owens Owens nsubjpass
was be auxpass
subdued subdue advcl
after after prep
Philadelphia Philadelphia poss
's 's case
27 27 nummod
- - punct
6 6 prep
victory victory pobj
over over prep
the the det
Giants Giants pobj
on on prep
Sunday Sunday pobj
. . punct


In [27]:
dfs['processed'].head()

103735    [eagles, biggest, hurdle, lies, ahead, east, r...
97576     [robinson, hails, hodgson, heroics, charlie, h...
55421     [honor, system, flu, shots, u.s., chain, store...
10214     [iraq, sistani, begins, journey, najaf, witnes...
91886     [angels, renew, molina, ortiz, cbc, sports, on...
Name: processed, dtype: object

### Named Entities

Let's compute the ratio of named entities starting with a capital letter, e.g. if we have "University of Chicago" as a NE, "University" and "Chicago" are capitalized, "of" is not, thus the ratio is 2/3.

In [4]:
##TODO print the ratio of tokens being part of a named entity span starting with a capital letter (doc.ents)
total_ne = 0
cap_ne = 0

for doc_text in dfs['document']:
    doc = nlp(doc_text)

    # now identify all NEs in doc
    for ent in doc.ents:
        #keep track of number of all entitites
        total_ne += 1
        if ent.text[0].isupper(): #check if entity is capitalised
            cap_ne += 1

print('Ratio of capitalised NEs to the total number of NEs: {}/{}'.format(cap_ne, total_ne))

Ratio of capitalised NEs to the total number of NEs: 427/617


In [19]:
##TODO print the ratio of capitalized tokens not being part of a named entity span (have no token.ent_type_)
# e.g. "The dog barks" = 1/3; 3 tokens, only "The" is capitalized

total_tok = 0
cap_tok = 0

for doc_text in dfs['document']:
    doc = nlp(doc_text)
    total_tok += len(doc) #Count total number of tokens per doc
    
    for token in doc:
        if token.ent_type_ == "" and token.text[0].isupper():
            cap_tok += 1

print('Ratio of capitalised non-NEs to the total number of tokens: {}/{}'.format(cap_tok, total_tok))

Ratio of capitalised non-NEs to the total number of tokens: 336/4566


In [20]:
##TODO print the ratio of capitalized tokens not being a named entity and not being the first token in a sentence
# e.g. "The dog barks" = 0; 3 tokens, "The" is capitalized but the starting token of a sentence, no other tokens are capitalized.
total_tok = 0
cap_tok = 0

for doc_text in dfs['document']:
    doc = nlp(doc_text)
    total_tok += len(doc) #Count total number of tokens per doc
    
    for token in doc:
        if token.ent_type_ == "" and not token.is_sent_start and token.text[0].isupper():
            cap_tok += 1

print('Ratio of capitalised non-NEs/non-start tokens to the total number of tokens: {}/{}'.format(cap_tok, total_tok))

Ratio of capitalised non-NEs/non-start tokens to the total number of tokens: 263/4566


Give an example of a capitalized token in the data which is neither a named entity nor at the start of a sentence. What could be the reason the token is capitalized (one sentence)?

In [21]:
found_example = False
for doc_text in dfs['document']:
    doc = nlp(doc_text)
    for token in doc:
        if not found_example and token.ent_type_ == "" and not token.is_sent_start and token.text[0].isupper():
            print(token.text)
            found_example = True
            break
    if found_example:
        break
    

# These tokens could be capitalised because of stylistic choices perhaps, for example they are part of a news headline which often capitalises most nouns.

PITTSBURGH


## Term Frequencies

In [21]:
business_text = dfs[dfs['label'] == 'business']['document'].apply.tolist()
business_text[1]

['gm',
 'close',
 'baltimore',
 'plant',
 'general',
 'motors',
 'corp.',
 'world',
 '39;s',
 'largest',
 'automaker',
 'announced',
 'tuesday',
 '1,100',
 'remaining',
 'workers',
 'baltimore',
 'assembly',
 'plant',
 'lose',
 'jobs',
 'year',
 ' ',
 'gm',
 'close',
 'baltimore',
 'plant',
 'general',
 'motors',
 'corp.',
 'world',
 '39;s',
 'largest',
 'automaker',
 'announced',
 'tuesday',
 '1,100',
 'remaining',
 'workers',
 'baltimore',
 'assembly',
 'plant',
 'lose',
 'jobs',
 'year']

In [19]:
from sklearn.feature_extraction.text import TfidfVectorizer
tfidf = TfidfVectorizer(min_df=0.01, 
                        max_df=0.9,  
                        max_features=1000,
                        stop_words='english',
                        use_idf=True, # the new piece
                        ngram_range=(1,2))

from wordcloud import WordCloud
import matplotlib.pyplot as plt

##TODO using the whole sample, produce a world cloud with bigrams for label == business using tfidf frequencies
tfidf.fit([business_text])

feature_names = tfidf.get_feature_names_out()

tfidf_dict = dict(zip(feature_names, tfidf.idf_))

# Generate word cloud using TF-IDF scores
wordcloud = WordCloud(width=800, height=400, background_color='white').generate_from_frequencies(tfidf_dict)

# Plot the word cloud
plt.figure(figsize=(10, 5))
plt.imshow(wordcloud, interpolation='bilinear')
plt.axis('off')
plt.title('Word Cloud for "Business" Label with Bigrams (TF-IDF)')
plt.show()

AttributeError: 'list' object has no attribute 'lower'

In [13]:
business_text[:30]

'short lived bush rally washing'

## Supervised Feature Selection

In [None]:
from sklearn.feature_selection import SelectKBest, f_classif, chi2

##TODO compute the number of words per document (excluding stopwords)
##TODO get the most predictive features of the number of words per document using first f_class and then chi2

Are the results different? What could be a reason for this? 

## Huggingface Tokenizers

In [None]:
# # we use distilbert tokenizer
# !pip install transformers
from transformers import DistilBertTokenizerFast

# let's instantiate a tokenizer
tokenizer = DistilBertTokenizerFast.from_pretrained('distilbert-base-uncased')

##TODO tokenize the sentences in the sampled dataframe (dfs) using the DisilBertTokenizer
##TODO what is the type/token ratio from this tokenizer (number_of_unqiue_token_types/number_of_tokens)?
##TODO what is the amount of subword tokens returned by the huggingface tokenizer? hint: each subword token starts with "#"



# Parsing

In [10]:
import pandas as pd
import nltk
df = pd.read_csv('train.csv')

df.columns = ["label", "title", "lead"]
label_map = {1:"world", 2:"sport", 3:"business", 4:"sci/tech"}
def replace_label(x):
	return label_map[x]
df["label"] = df["label"].apply(replace_label) 
df["text"] = df["title"] + " " + df["lead"]
df = df.sample(n=10000) # # only use 10K datapoints
df.head()

Unnamed: 0,label,title,lead,text
110581,sport,UEFA to probe Valencia-Werder Bremen incidents,UEFA will be launching disciplinary proceeding...,UEFA to probe Valencia-Werder Bremen incidents...
15925,sci/tech,New iMac tries to play it cool,Hot G5 chip requires some serious effort to av...,New iMac tries to play it cool Hot G5 chip req...
46720,business,Ford Reports Disappointing U.S. Sales,Ford's car and truck business sales fell nearl...,Ford Reports Disappointing U.S. Sales Ford's c...
57350,sci/tech,Microsoft Upgrades Windows XP Media Center (Ne...,NewsFactor - Bill Gates is about to announce a...,Microsoft Upgrades Windows XP Media Center (Ne...
90601,sport,Rams make statement they #39;re team to beat i...,ST. LOUIS -- When St. Louis coach Mike Martz l...,Rams make statement they #39;re team to beat i...


In [11]:
import spacy
nlp = spacy.load('en_core_web_sm')
#TODO preprocess the corpus using spacy


### Information Extraction

In [None]:
def extract_subject_verb_pairs(sent):
    subjs = [w for w in sent if w.dep_ == "nsubj"]
    pairs = [(w.lemma_.lower(), w.head.lemma_.lower()) for w in subjs]
    return pairs
##TODO extract the subject-verbs pairs and print the result for the second document

from collections import Counter
counter = Counter()

##TODO create a list ranking the most common pairs and print the first 10 items

In [None]:
##TODO do the same for verbs-object pairs ('dobj')
##TODO create a list ranking the most common pairs and print the first 10 items

In [None]:
##TODO do the same for adjectives-nouns pairs ('amod')
##TODO create a list ranking the most common pairs and print the first 10 items

### Exploring cross label dependencies

In [None]:
##TODO extract all the subject-verbs and verbs-object pairs for the verb "rise"

In [None]:
##TODO for each label create a list ranking the most common subject-verbs pairs and one for the most common verbs-object pairs
##TODO print the 10 most common pairs for each of the two lists for the labels "world" and "sci/tech"

In [22]:
import os

os.system('jupyter nbconvert --to html homework_02.ipynb')

0