# Week 1: Explore the BBC News archive

Welcome! In this assignment you will be working with a variation of the [BBC News Classification Dataset](https://www.kaggle.com/c/learn-ai-bbc/overview), which contains 2225 examples of news articles with their respective categories (labels).

Let's get started!

In [7]:
import csv
from tensorflow.keras.preprocessing.text import Tokenizer
from tensorflow.keras.preprocessing.sequence import pad_sequences

Begin by looking at the structure of the csv that contains the data:

In [8]:
with open("./bbc-text.csv", 'r') as csvfile:
    print(f"First line (header) looks like this:\n\n{csvfile.readline()}")
    print(f"Each data point looks like this:\n\n{csvfile.readline()}")     

First line (header) looks like this:

category,text

Each data point looks like this:

tech,tv future in the hands of viewers with home theatre systems  plasma high-definition tvs  and digital video recorders moving into the living room  the way people watch tv will be radically different in five years  time.  that is according to an expert panel which gathered at the annual consumer electronics show in las vegas to discuss how these new technologies will impact one of our favourite pastimes. with the us leading the trend  programmes and other content will be delivered to viewers via home networks  through cable  satellite  telecoms companies  and broadband service providers to front rooms and portable devices.  one of the most talked-about technologies of ces has been digital and personal video recorders (dvr and pvr). these set-top boxes  like the us s tivo and the uk s sky+ system  allow people to record  store  play  pause and forward wind tv programmes when they want.  essentially

As you can see, each data point is composed of the category of the news article followed by a comma and then the actual text of the article.

## Removing Stopwords

One important step when working with text data is to remove the **stopwords** from it. These are the most common words in the language and they rarely provide useful information for the classification process.

Complete the `remove_stopwords` below. This function should receive a string and return another string that excludes all of the stopwords provided.

In [54]:
# GRADED FUNCTION: remove_stopwords
def remove_stopwords(sentence):
    # List of stopwords
    stopwords = ["a", "about", "above", "after", "again", "against", "all", "am", "an", "and", "any", "are", "as", "at", "be", "because", "been", "before", "being", "below", "between", "both", "but", "by", "could", "did", "do", "does", "doing", "down", "during", "each", "few", "for", "from", "further", "had", "has", "have", "having", "he", "he'd", "he'll", "he's", "her", "here", "here's", "hers", "herself", "him", "himself", "his", "how", "how's", "i", "i'd", "i'll", "i'm", "i've", "if", "in", "into", "is", "it", "it's", "its", "itself", "let's", "me", "more", "most", "my", "myself", "nor", "of", "on", "once", "only", "or", "other", "ought", "our", "ours", "ourselves", "out", "over", "own", "same", "she", "she'd", "she'll", "she's", "should", "so", "some", "such", "than", "that", "that's", "the", "their", "theirs", "them", "themselves", "then", "there", "there's", "these", "they", "they'd", "they'll", "they're", "they've", "this", "those", "through", "to", "too", "under", "until", "up", "very", "was", "we", "we'd", "we'll", "we're", "we've", "were", "what", "what's", "when", "when's", "where", "where's", "which", "while", "who", "who's", "whom", "why", "why's", "with", "would", "you", "you'd", "you'll", "you're", "you've", "your", "yours", "yourself", "yourselves" ]
    
    # Sentence converted to lowercase-only
    sentence = sentence.lower()

    # Sentence is a string formed by several words separated by blank spaces.
    # We may apply the string split method to separate it into several substrings (the words).
    # The syntax is: string.split(separator, maxsplit)
    # Warning, the arguments names should not be declared. They are defined by the input order.
    # Examples: x = txt.split("#", 1), 
    # x = txt.split(", ")
    # https://www.w3schools.com/python/ref_string_split.asp?msclkid=6794d039c56111ec9f28ce07f968e96a
    list_of_words = sentence.split(' ')

    # Create a support list to store the words that are not defined as 'stopwords':
    support_list = []

    # Loop through all items (elements named 'word') in the list_of_words
    for word in list_of_words:
        
        # check if sentence[i] is not in the list stopwords:
        if word not in stopwords:
            # If it is not, append it into the support list:
            
            support_list.append(word)
    
    # Now, support_list contains only the words from 'sentence' that are not classified as 'stopwords'.
    # Let's recreate the sentence without the 'stopwords':
    
    # The first element from the new sentence is the first word from support_list:
    i = 0
    sentence = '' # Convert to empty string

    # Now, loop through all other elements from support_list (starting from index 1, since index 0 
    # was already used):

    for i in range (len(support_list)):
      #loops from i = 0, index of the 2nd word, to i = len(support_list) - 1, index of the last word:
      # Concatenate the string sentence to the word of index [i] and concatenate a blank space after them:
      sentence = sentence + support_list[i] + " "
    
    # Apply rtrip method to remove possible blank spaces in the beginning or at the end of the sentence
    # rstrip is the Trim function. Syntax: string.rstrip(characters)
    # characters	Optional. A set of characters to remove as trailing characters.
    # The rstrip() method removes any trailing characters 
    # (characters at the end a string), space is the default trailing character to remove.
    # example: txt = "banana,,,,,ssqqqww.....", 
    # Remove the trailing characters if they are commas, s, q, or w: 
    # x = txt.rstrip(",.qsw")
    # https://www.w3schools.com/python/ref_string_rstrip.asp?msclkid=ee2d05c3c56811ecb1d2189d9f803f65
    
    sentence = sentence.rstrip() # space is the default, when no parameter is provided.


    ### END CODE HERE
    return sentence

In [55]:
# Test your function
remove_stopwords("I am about to go to the store and get any snack")

'go store get snack'

***Expected Output:***
```
'go store get snack'

```

## Reading the raw data

Now you need to read the data from the csv file. To do so, complete the `parse_data_from_file` function.

A couple of things to note:
- You should omit the first line as it contains the headers and not data points.
- There is no need to save the data points as numpy arrays, regular lists is fine.
- To read from csv files use [`csv.reader`](https://docs.python.org/3/library/csv.html#csv.reader) by passing the appropriate arguments.
- `csv.reader` returns an iterable that returns each row in every iteration. So the label can be accessed via row[0] and the text via row[1].
- Use the `remove_stopwords` function in each sentence.

In [56]:
def parse_data_from_file(filename):
    
    import csv
    import numpy as np
    
    sentences = []
    labels = []
    
    with open(filename, 'r') as csvfile:
        ### START CODE HERE
        csv_reader = csv.reader(csvfile, delimiter = ',')

        for analyzed_row in csv_reader:
          
          # Iterate through all the rows, named 'analyzed_row' from csv_reader
          # object:
          row_label = analyzed_row[0] # first row value
          row_sentence = analyzed_row[1] # element with index 1.
          # Here, we should not apply the slicing analyzed_row[:1] because it
          # will generate a list of a single element instead of a string. And we
          # must obtain a string to apply the split method.
          
          # Now, call function remove_stopwords for removing those words from
          # row_sentence:
          row_sentence = remove_stopwords(row_sentence)

          # Now append these values to the list:
          labels.append(row_label)
          sentences.append(row_sentence)

        # By now, the lists contain all rows, including the first one.
        # We must remove elements of index 0 (header).
        # Using del keyword
        # To remove an element from the list, you can use the del keyword 
        # followed by a list. You have to pass the index of the element to the 
        # list. The index starts at 0. Syntax:
        # del list[index]
        # https://www.guru99.com/python-list-remove-clear-pop-del.html#:~:text=In%20Python%2C%20there%20are%20many%20methods%20available%20on,from%20the%20list%20based%20on%20the%20index%20given.

        # Remove elements of index 0 from lists labels and sentences with del function:
        del labels[0]
        del sentences[0]
        # deletes elements with index 0 (headers)
        
        ### END CODE HERE
    return sentences, labels

In [57]:
# Test your function
sentences, labels = parse_data_from_file("./bbc-text.csv")

print(f"There are {len(sentences)} sentences in the dataset.\n")
print(f"First sentence has {len(sentences[0].split())} words (after removing stopwords).\n")
print(f"There are {len(labels)} labels in the dataset.\n")
print(f"The first 5 labels are {labels[:5]}")

1st label = tech
1st sentence = tv future hands viewers home theatre systems  plasma high-definition tvs  digital video recorders moving living room  way people watch tv will radically different five years  time.  according expert panel gathered annual consumer electronics show las vegas discuss new technologies will impact one favourite pastimes. us leading trend  programmes content will delivered viewers via home networks  cable  satellite  telecoms companies  broadband service providers front rooms portable devices.  one talked-about technologies ces digital personal video recorders (dvr pvr). set-top boxes  like us s tivo uk s sky+ system  allow people record  store  play  pause forward wind tv programmes want.  essentially  technology allows much personalised tv. also built-in high-definition tv sets  big business japan us  slower take off europe lack high-definition programming. not can people forward wind adverts  can also forget abiding network channel schedules  putting togeth

***Expected Output:***
```
There are 2225 sentences in the dataset.

First sentence has 436 words (after removing stopwords).

There are 2225 labels in the dataset.

The first 5 labels are ['tech', 'business', 'sport', 'sport', 'entertainment']

```

## Using the Tokenizer

Now it is time to tokenize the sentences of the dataset. 

Complete the `fit_tokenizer` below. 

This function should receive the list of sentences as input and return a [Tokenizer](https://www.tensorflow.org/api_docs/python/tf/keras/preprocessing/text/Tokenizer) that has been fitted to those sentences. You should also define the "Out of Vocabulary" token as `<OOV>`.

In [58]:
def fit_tokenizer(sentences):
    
    from tensorflow.keras.preprocessing.text import Tokenizer
    ### START CODE HERE
    # Instantiate the Tokenizer class by passing in the oov_token argument
    # We will not set a maximum num_words here.
    tokenizer = Tokenizer(oov_token="<OOV>")
    # oov: out of vocabulary - it should be a special token different from any possible word
    # to differentiate the words that are present, but were not seem when creating the word index.
    # If there is no token for words out of the vocabulary, the words would be simply removed when
    # encoding the sequence. Now, instead of removing those words, we encode them as the special
    # token oov_token.

    # Fit on the sentences
    # Tokenize the input sentences
    tokenizer.fit_on_texts(sentences)
    
    ### END CODE HERE
    return tokenizer

In [59]:
tokenizer = fit_tokenizer(sentences)
word_index = tokenizer.word_index

print(f"Vocabulary contains {len(word_index)} words\n")
print("<OOV> token included in vocabulary" if "<OOV>" in word_index else "<OOV> token NOT included in vocabulary")

Vocabulary contains 29714 words

<OOV> token included in vocabulary


***Expected Output:***
```
Vocabulary contains 29714 words

<OOV> token included in vocabulary

```

In [60]:
def get_padded_sequences(tokenizer, sentences):
    
    ### START CODE HERE
    # Convert sentences to sequences
    sequences = tokenizer.texts_to_sequences(sentences)
    
    # Pad the sequences using the post padding strategy
    # We will not set a maximum length here, so that user can do it.
    padded_sequences = pad_sequences(sequences, padding = 'post', truncating = 'post')
    # Keras requires sequences of the same lenght: same number of variables (columns); images of
    # constant dimensions; or text sequences with same length
    # maxlen: set the maximum length allowed for the sequences. Longer sequences will be truncated
    # after reaching the number of tokens specified as maxlen. If maxlen = 5, the sequences are
    # allowed to have only 5 tokens (5 words, basically).

    # The default for padding and truncating is in the beginning of the sentence: then, zeros are added to
    # the beginning of the smaller sequences (their left, before the first words). The number of added zeros
    # is the amount required for them to reach the maximum possible length. In turns, sequences longer than
    # the maximum possible length are cropped (truncated) in their left, i.e., their first words are removed
    # for them to reach the maximum allowed length.
    # By specifying padding = 'post', we tell keras that the zeros should be added at the end (far-right) 
    # of the sequences, so only what comes after the sentence is 'modified' 
    # (filled with zeros to reach the maximum length).

    # In its turn, when specifying truncating = 'post', we tell Keras to remove only the last (far-
    # right) words, so the beginning of the sentences longer than max length are not removed (we lost
    # the final part of the text, instead of losing its beginning).

    ### END CODE HERE
    
    return padded_sequences

In [61]:
padded_sequences = get_padded_sequences(tokenizer, sentences)
print(f"First padded sequence looks like this: \n\n{padded_sequences[0]}\n")
print(f"Numpy array of all sequences has shape: {padded_sequences.shape}\n")
print(f"This means there are {padded_sequences.shape[0]} sequences in total and each one has a size of {padded_sequences.shape[1]}")

First padded sequence looks like this: 

[  96  176 1157 ...    0    0    0]

Numpy array of all sequences has shape: (2225, 2438)

This means there are 2225 sequences in total and each one has a size of 2438


***Expected Output:***
```
First padded sequence looks like this: 

[  96  176 1157 ...    0    0    0]

Numpy array of all sequences has shape: (2225, 2438)

This means there are 2225 sequences in total and each one has a size of 2438

```

In [64]:
def tokenize_labels(labels):
    ### START CODE HERE
    from tensorflow.keras.preprocessing.text import Tokenizer

    # Instantiate the Tokenizer class
    # No need to pass additional arguments since you will be tokenizing the labels
    # So, do not pass the oov token.
    label_tokenizer = Tokenizer()
    # oov: out of vocabulary
    
    # Fit the tokenizer to the labels
    label_tokenizer.fit_on_texts(labels)
    
    # Save the word index
    label_word_index = label_tokenizer.word_index
    
    # Save the sequences
    label_sequences = label_tokenizer.texts_to_sequences(labels)

    ### END CODE HERE
    
    return label_sequences, label_word_index

In [65]:
label_sequences, label_word_index = tokenize_labels(labels)
print(f"Vocabulary of labels looks like this {label_word_index}\n")
print(f"First ten sequences {label_sequences[:10]}\n")

Vocabulary of labels looks like this {'sport': 1, 'business': 2, 'politics': 3, 'tech': 4, 'entertainment': 5}

First ten sequences [[4], [2], [1], [1], [5], [3], [3], [1], [1], [5]]



***Expected Output:***
```
Vocabulary of labels looks like this {'sport': 1, 'business': 2, 'politics': 3, 'tech': 4, 'entertainment': 5}

First ten sequences [[4], [2], [1], [1], [5], [3], [3], [1], [1], [5]]

```

**Congratulations on finishing this week's assignment!**

You have successfully implemented functions to process various text data processing ranging from pre-processing, reading from raw files and tokenizing text.

**Keep it up!**