# Feature Engineering

The next step is to create features from the raw text so we can train the machine learning models. The steps followed are:

1. **Text Cleaning and Preparation**: cleaning of special characters, downcasing, punctuation signs. possessive pronouns and stop words removal and lemmatization. 
2. **Label coding**: creation of a dictionary to map each category to a code.
3. **Train-test split**: to test the models on unseen data.
4. **Text representation**: use of TF-IDF scores to represent text.

In [1]:
import pickle
import pandas as pd
import re
import nltk
from nltk.corpus import stopwords
from nltk.stem import WordNetLemmatizer
from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.model_selection import train_test_split
from sklearn.feature_selection import chi2
import numpy as np

First of all we'll load the dataset:

In [2]:
path_df = "./Pickles/all_articles_processed.pickle"

with open(path_df, 'rb') as data:
    df = pickle.load(data)

In [3]:
df = df.rename(columns={'article': 'Content','category':'Category'})
df= df.reset_index()
df.head()

Unnamed: 0,index,source,title,Content,Category,category_code
0,0,The Straits Times,"Sales for Handmaid's Tale sequel top 125,000 c...",new york ap sales margaret atwoods testament...,Lifestyle,3
1,1,The Straits Times,R. Kelly a no-show in court on Minnesota solic...,minneapolis ap singer r kelly noshow initial...,Lifestyle,3
2,2,The Straits Times,HK director Derek Tsang picks forest in Japan ...,soul mate director derek tsang know ask team...,Lifestyle,3
3,3,The Straits Times,"Tony Hadley, ex-frontman of Spandau Ballet, to...",singapore voice behind spandau ballet hit tr...,Lifestyle,3
4,4,The Straits Times,South Korean actor Sung Hoon holding meet-and-...,singapore south korean heartthrob sing hoon ...,Lifestyle,3


In [4]:
list(df.columns)

['index', 'source', 'title', 'Content', 'Category', 'category_code']

And visualize one sample news content:

In [5]:
df.loc[1]['Content']

' minneapolis ap  singer r kelly noshow initial court appearance minnesota case accuse offer yearold girl us take clothe dance  kelly jail chicago sexual abuse count charge minnesota august solicit girl meet concert minneapolis kelly whose full name robert sylvester kelly face previously file federal state charge new york chicago     prosecutor judith cole tell judge jay quam thursdays sept  brief hear federal authorities illinois give us access case resolve judge issue bench warrant formality kellys attorney steve greenberg didnt attend hear isnt officially register minnesota case say never get notice       spokesman county attorneys office say summon send kellys last know address '

## 1. Text cleaning and preparation

### 1.1. Special character cleaning

We can see the following special characters:

* ``\r``
* ``\n``
* ``\`` before possessive pronouns (`government's = government\'s`)
* ``\`` before possessive pronouns 2 (`Yukos'` = `Yukos\'`)
* ``"`` when quoting text

In [6]:
# \r and \n
df['Content_Parsed_1'] = df['Content'].str.replace("\r", " ")
df['Content_Parsed_1'] = df['Content_Parsed_1'].str.replace("\n", " ")
df['Content_Parsed_1'] = df['Content_Parsed_1'].str.replace("    ", " ")

Regarding 3rd and 4th bullet, although it seems there is a special character, it won't affect us since it is not a *real* character:

In [7]:
text = "Mr Greenspan\'s"
text

"Mr Greenspan's"

In [8]:
# " when quoting text
df['Content_Parsed_1'] = df['Content_Parsed_1'].str.replace('"', '')

### 1.2. Upcase/downcase

We'll downcase the texts because we want, for example, `Football` and `football` to be the same word.

In [9]:
# Lowercasing the text
df['Content_Parsed_2'] = df['Content_Parsed_1'].str.lower()

### 1.3. Punctuation signs

Punctuation signs won't have any predicting power, so we'll just get rid of them.

In [10]:
punctuation_signs = list("?:!.,;")
df['Content_Parsed_3'] = df['Content_Parsed_2']

for punct_sign in punctuation_signs:
    df['Content_Parsed_3'] = df['Content_Parsed_3'].str.replace(punct_sign, '')

By doing this we are messing up with some numbers, but it's no problem since we aren't expecting any predicting power from them.

### 1.4. Possessive pronouns

We'll also remove possessive pronoun terminations:

In [11]:
df['Content_Parsed_4'] = df['Content_Parsed_3'].str.replace("'s", "")

### 1.5. Stemming and Lemmatization

Since stemming can produce output words that don't exist, we'll only use a lemmatization process at this moment. Lemmatization takes into consideration the morphological analysis of the words and returns words that do exist, so it will be more useful for us.

In [12]:
# Downloading punkt and wordnet from NLTK
nltk.download('punkt')
print("------------------------------------------------------------")
nltk.download('wordnet')

------------------------------------------------------------


[nltk_data] Downloading package punkt to
[nltk_data]     C:\Users\darry\AppData\Roaming\nltk_data...
[nltk_data]   Package punkt is already up-to-date!
[nltk_data] Downloading package wordnet to
[nltk_data]     C:\Users\darry\AppData\Roaming\nltk_data...
[nltk_data]   Package wordnet is already up-to-date!


True

In [13]:
# Saving the lemmatizer into an object
wordnet_lemmatizer = WordNetLemmatizer()

In order to lemmatize, we have to iterate through every word:

In [14]:
nrows = len(df)
lemmatized_text_list = []

for row in range(0, nrows):
    
    # Create an empty list containing lemmatized words
    lemmatized_list = []
    
    # Save the text and its words into an object
    text = df.loc[row]['Content_Parsed_4']
    text_words = text.split(" ")

    # Iterate through every word to lemmatize
    for word in text_words:
        lemmatized_list.append(wordnet_lemmatizer.lemmatize(word, pos="v"))
        
    # Join the list
    lemmatized_text = " ".join(lemmatized_list)
    
    # Append to the list containing the texts
    lemmatized_text_list.append(lemmatized_text)

In [15]:
df['Content_Parsed_5'] = lemmatized_text_list

Although lemmatization doesn't work perfectly in all cases (as can be seen in the example below), it can be useful.

### 1.6. Stop words

In [16]:
# Downloading the stop words list
nltk.download('stopwords')

[nltk_data] Downloading package stopwords to
[nltk_data]     C:\Users\darry\AppData\Roaming\nltk_data...
[nltk_data]   Package stopwords is already up-to-date!


True

In [17]:
# Loading the stop words in english
stop_words = list(stopwords.words('english'))

In [18]:
stop_words[0:10]

['i', 'me', 'my', 'myself', 'we', 'our', 'ours', 'ourselves', 'you', "you're"]

To remove the stop words, we'll handle a regular expression only detecting whole words, as seen in the following example:

In [19]:
example = "me eating a meal"
word = "me"

# The regular expression is:
regex = r"\b" + word + r"\b"  # we need to build it like that to work properly

re.sub(regex, "StopWord", example)

'StopWord eating a meal'

We can now loop through all the stop words:

In [20]:
df['Content_Parsed_6'] = df['Content_Parsed_5']

for stop_word in stop_words:

    regex_stopword = r"\b" + stop_word + r"\b"
    df['Content_Parsed_6'] = df['Content_Parsed_6'].str.replace(regex_stopword, '')

We have some dobule/triple spaces between words because of the replacements. However, it's not a problem because we'll tokenize by the spaces later.

As an example, we'll show an original news article and its modifications throughout the process:

In [21]:
df.loc[5]['Content']

' jacqueline wong kenneth reconcile since catch kiss singer andy hui taxi april career front reason happy inadvertently benefit current tense situation hong kong wongs show put cold storage employer tvb scandal break     hong kong face demonstrations broadcaster careful screen show depict police triads link assault prodemocracy supporters yuen long mtr station july still show must go tvb reportedly reshuffle program include tap show feature wong come news show find voice slat primetime air oct       ask wong would return unite state  flee scandal  promote show producer tell oriental daily news tvb tell could  \xa0        relate story jacqueline wong set early comeback tvb put hold show police triads     \xa0   relate story fan urge kenneth date actress natalie tong breakup jacqueline wong     \xa0   relate story kenneth longer consider jacqueline wong girlfriend friends    could contact get reply wong send text message inform find voice receive rollout clearance meanwhile tap star sequ

1. Special character cleaning

In [22]:
df.loc[5]['Content_Parsed_1']

' jacqueline wong kenneth reconcile since catch kiss singer andy hui taxi april career front reason happy inadvertently benefit current tense situation hong kong wongs show put cold storage employer tvb scandal break  hong kong face demonstrations broadcaster careful screen show depict police triads link assault prodemocracy supporters yuen long mtr station july still show must go tvb reportedly reshuffle program include tap show feature wong come news show find voice slat primetime air oct    ask wong would return unite state  flee scandal  promote show producer tell oriental daily news tvb tell could  \xa0  relate story jacqueline wong set early comeback tvb put hold show police triads  \xa0   relate story fan urge kenneth date actress natalie tong breakup jacqueline wong  \xa0   relate story kenneth longer consider jacqueline wong girlfriend friends could contact get reply wong send text message inform find voice receive rollout clearance meanwhile tap star sequel hit tvb drama want

2. Upcase/downcase

In [23]:
df.loc[5]['Content_Parsed_2']

' jacqueline wong kenneth reconcile since catch kiss singer andy hui taxi april career front reason happy inadvertently benefit current tense situation hong kong wongs show put cold storage employer tvb scandal break  hong kong face demonstrations broadcaster careful screen show depict police triads link assault prodemocracy supporters yuen long mtr station july still show must go tvb reportedly reshuffle program include tap show feature wong come news show find voice slat primetime air oct    ask wong would return unite state  flee scandal  promote show producer tell oriental daily news tvb tell could  \xa0  relate story jacqueline wong set early comeback tvb put hold show police triads  \xa0   relate story fan urge kenneth date actress natalie tong breakup jacqueline wong  \xa0   relate story kenneth longer consider jacqueline wong girlfriend friends could contact get reply wong send text message inform find voice receive rollout clearance meanwhile tap star sequel hit tvb drama want

3. Punctuation signs

In [24]:
df.loc[5]['Content_Parsed_3']

' jacqueline wong kenneth reconcile since catch kiss singer andy hui taxi april career front reason happy inadvertently benefit current tense situation hong kong wongs show put cold storage employer tvb scandal break  hong kong face demonstrations broadcaster careful screen show depict police triads link assault prodemocracy supporters yuen long mtr station july still show must go tvb reportedly reshuffle program include tap show feature wong come news show find voice slat primetime air oct    ask wong would return unite state  flee scandal  promote show producer tell oriental daily news tvb tell could  \xa0  relate story jacqueline wong set early comeback tvb put hold show police triads  \xa0   relate story fan urge kenneth date actress natalie tong breakup jacqueline wong  \xa0   relate story kenneth longer consider jacqueline wong girlfriend friends could contact get reply wong send text message inform find voice receive rollout clearance meanwhile tap star sequel hit tvb drama want

4. Possessive pronouns

In [25]:
df.loc[5]['Content_Parsed_4']

' jacqueline wong kenneth reconcile since catch kiss singer andy hui taxi april career front reason happy inadvertently benefit current tense situation hong kong wongs show put cold storage employer tvb scandal break  hong kong face demonstrations broadcaster careful screen show depict police triads link assault prodemocracy supporters yuen long mtr station july still show must go tvb reportedly reshuffle program include tap show feature wong come news show find voice slat primetime air oct    ask wong would return unite state  flee scandal  promote show producer tell oriental daily news tvb tell could  \xa0  relate story jacqueline wong set early comeback tvb put hold show police triads  \xa0   relate story fan urge kenneth date actress natalie tong breakup jacqueline wong  \xa0   relate story kenneth longer consider jacqueline wong girlfriend friends could contact get reply wong send text message inform find voice receive rollout clearance meanwhile tap star sequel hit tvb drama want

5. Stemming and Lemmatization

In [26]:
df.loc[5]['Content_Parsed_5']

' jacqueline wong kenneth reconcile since catch kiss singer andy hui taxi april career front reason happy inadvertently benefit current tense situation hong kong wongs show put cold storage employer tvb scandal break  hong kong face demonstrations broadcaster careful screen show depict police triads link assault prodemocracy supporters yuen long mtr station july still show must go tvb reportedly reshuffle program include tap show feature wong come news show find voice slat primetime air oct    ask wong would return unite state  flee scandal  promote show producer tell oriental daily news tvb tell could  \xa0  relate story jacqueline wong set early comeback tvb put hold show police triads  \xa0   relate story fan urge kenneth date actress natalie tong breakup jacqueline wong  \xa0   relate story kenneth longer consider jacqueline wong girlfriend friends could contact get reply wong send text message inform find voice receive rollout clearance meanwhile tap star sequel hit tvb drama want

6. Stop words

In [27]:
df.loc[5]['Content_Parsed_6']

' jacqueline wong kenneth reconcile since catch kiss singer andy hui taxi april career front reason happy inadvertently benefit current tense situation hong kong wongs show put cold storage employer tvb scandal break  hong kong face demonstrations broadcaster careful screen show depict police triads link assault prodemocracy supporters yuen long mtr station july still show must go tvb reportedly reshuffle program include tap show feature wong come news show find voice slat primetime air oct    ask wong would return unite state  flee scandal  promote show producer tell oriental daily news tvb tell could  \xa0  relate story jacqueline wong set early comeback tvb put hold show police triads  \xa0   relate story fan urge kenneth date actress natalie tong breakup jacqueline wong  \xa0   relate story kenneth longer consider jacqueline wong girlfriend friends could contact get reply wong send text message inform find voice receive rollout clearance meanwhile tap star sequel hit tvb drama want

Finally, we can delete the intermediate columns:

In [28]:
df.head(1)

Unnamed: 0,index,source,title,Content,Category,category_code,Content_Parsed_1,Content_Parsed_2,Content_Parsed_3,Content_Parsed_4,Content_Parsed_5,Content_Parsed_6
0,0,The Straits Times,"Sales for Handmaid's Tale sequel top 125,000 c...",new york ap sales margaret atwoods testament...,Lifestyle,3,new york ap sales margaret atwoods testament...,new york ap sales margaret atwoods testament...,new york ap sales margaret atwoods testament...,new york ap sales margaret atwoods testament...,new york ap sales margaret atwoods testament...,new york ap sales margaret atwoods testament...


In [29]:
list_columns = ["Category", "Content", "Content_Parsed_6"]
df = df[list_columns]

df = df.rename(columns={'Content_Parsed_6': 'Content_Parsed'})

In [30]:
df.head()

Unnamed: 0,Category,Content,Content_Parsed
0,Lifestyle,new york ap sales margaret atwoods testament...,new york ap sales margaret atwoods testament...
1,Lifestyle,minneapolis ap singer r kelly noshow initial...,minneapolis ap singer r kelly noshow initial...
2,Lifestyle,soul mate director derek tsang know ask team...,soul mate director derek tsang know ask team...
3,Lifestyle,singapore voice behind spandau ballet hit tr...,singapore voice behind spandau ballet hit tr...
4,Lifestyle,singapore south korean heartthrob sing hoon ...,singapore south korean heartthrob sing hoon ...


**IMPORTANT:**

We need to remember that our model will gather the latest news articles from different newspapers every time we want. For that reason, we not only need to take into account the peculiarities of the training set articles, but also possible ones that are present in the gathered news articles.

For this reason, possible peculiarities have been studied in the *05. News Scraping* folder.

## 2. Label coding

We'll create a dictionary with the label codification:

In [31]:
category_codes = {
    'Singapore': 1,
    'Sports': 2,
    'Lifestyle': 3,
    'World': 4,
    'Business': 5,
    'Technology': 4
}

In [32]:
# Category mapping
df['Category_Code'] = df['Category']
df = df.replace({'Category_Code':category_codes})

In [33]:
df.head()

Unnamed: 0,Category,Content,Content_Parsed,Category_Code
0,Lifestyle,new york ap sales margaret atwoods testament...,new york ap sales margaret atwoods testament...,3
1,Lifestyle,minneapolis ap singer r kelly noshow initial...,minneapolis ap singer r kelly noshow initial...,3
2,Lifestyle,soul mate director derek tsang know ask team...,soul mate director derek tsang know ask team...,3
3,Lifestyle,singapore voice behind spandau ballet hit tr...,singapore voice behind spandau ballet hit tr...,3
4,Lifestyle,singapore south korean heartthrob sing hoon ...,singapore south korean heartthrob sing hoon ...,3


## 3. Train - test split

We'll set apart a test set to prove the quality of our models. We'll do Cross Validation in the train set in order to tune the hyperparameters and then test performance on the unseen data of the test set.

In [34]:
X_train, X_test, y_train, y_test = train_test_split(df['Content_Parsed'], 
                                                    df['Category_Code'], 
                                                    test_size=0.15, 
                                                    random_state=8)

Since we don't have much observations (only 2.225), we'll choose a test set size of 15% of the full dataset.

## 4. Text representation

We have various options:

* Count Vectors as features
* TF-IDF Vectors as features
* Word Embeddings as features
* Text / NLP based features
* Topic Models as features

We'll use **TF-IDF Vectors** as features.

We have to define the different parameters:

* `ngram_range`: We want to consider both unigrams and bigrams.
* `max_df`: When building the vocabulary ignore terms that have a document
    frequency strictly higher than the given threshold
* `min_df`: When building the vocabulary ignore terms that have a document
    frequency strictly lower than the given threshold.
* `max_features`: If not None, build a vocabulary that only consider the top
    max_features ordered by term frequency across the corpus.

See `TfidfVectorizer?` for further detail.

It needs to be mentioned that we are implicitly scaling our data when representing it as TF-IDF features with the argument `norm`.

In [35]:
# Parameter election
ngram_range = (1,2)
min_df = 10
max_df = 1.
max_features = 20000

We have chosen these values as a first approximation. Since the models that we develop later have a very good predictive power, we'll stick to these values. But it has to be mentioned that different combinations could be tried in order to improve even more the accuracy of the models.

In [36]:
tfidf = TfidfVectorizer(encoding='utf-8',
                        ngram_range=ngram_range,
                        stop_words=None,
                        lowercase=False,
                        max_df=max_df,
                        min_df=min_df,
                        max_features=max_features,
                        norm='l2',
                        sublinear_tf=True)
                        
features_train = tfidf.fit_transform(X_train).toarray()
labels_train = y_train
print(features_train.shape)

features_test = tfidf.transform(X_test).toarray()
labels_test = y_test
print(features_test.shape)

(9281, 20000)
(1638, 20000)


Please note that we have fitted and then transformed the training set, but we have **only transformed** the **test set**.

We can use the Chi squared test in order to see what unigrams and bigrams are most correlated with each category:

In [37]:
from sklearn.feature_selection import chi2
import numpy as np

for Product, category_id in sorted(category_codes.items()):
    features_chi2 = chi2(features_train, labels_train == category_id)
    indices = np.argsort(features_chi2[0])
    feature_names = np.array(tfidf.get_feature_names())[indices]
    unigrams = [v for v in feature_names if len(v.split(' ')) == 1]
    bigrams = [v for v in feature_names if len(v.split(' ')) == 2]
    print("# '{}' category:".format(Product))
    print("  . Most correlated unigrams:\n. {}".format('\n. '.join(unigrams[-5:])))
    print("  . Most correlated bigrams:\n. {}".format('\n. '.join(bigrams[-2:])))
    print("")


# 'Business' category:
  . Most correlated unigrams:
. billion
. per
. cent
. trade
. tariff
  . Most correlated bigrams:
. trade pact
. per cent

# 'Lifestyle' category:
  . Most correlated unigrams:
. actress
. movie
. actor
. singer
. film
  . Most correlated bigrams:
. post instagram
. relate story

# 'Singapore' category:
  . Most correlated unigrams:
. jul
. mr
. scdf
. jail
. singapore
  . Most correlated bigrams:
. singapore civil
. years fin

# 'Sports' category:
  . Most correlated unigrams:
. cup
. match
. win
. champion
. league
  . Most correlated bigrams:
. us open
. world cup

# 'Technology' category:
  . Most correlated unigrams:
. security
. government
. protesters
. protest
. singapore
  . Most correlated bigrams:
. fire tear
. tear gas

# 'World' category:
  . Most correlated unigrams:
. security
. government
. protesters
. protest
. singapore
  . Most correlated bigrams:
. fire tear
. tear gas



As we can see, the unigrams correspond well to their category. However, bigrams do not. If we get the bigrams in our features:

In [38]:
bigrams

['ongoing trade',
 'saturday night',
 'cost around',
 'foreign investment',
 'could add',
 'become one',
 'macron say',
 'nearly per',
 'get way',
 'also create',
 'two others',
 'auto part',
 'say hear',
 'year could',
 'post per',
 'metoo movement',
 'estimate per',
 'years also',
 'say past',
 'time around',
 'wear mask',
 'two three',
 'china india',
 'one place',
 'rate increase',
 'even try',
 'hold per',
 'year find',
 'take months',
 'least four',
 'want come',
 'cultural heritage',
 'say remain',
 'ask court',
 'record us',
 'across world',
 'stream television',
 'almost years',
 'end last',
 'year accord',
 'areas like',
 'know whether',
 'group include',
 'home many',
 'like see',
 'midautumn festival',
 'earlier wednesday',
 'good thing',
 'british pound',
 'pass away',
 'plunge per',
 'effect climate',
 'quarter compare',
 'use fund',
 'say reason',
 'another two',
 'quarter say',
 'malaysian authorities',
 'worth million',
 'private sector',
 'say want',
 'billion worth',

We can see there are only six. This means the unigrams have more correlation with the category than the bigrams, and since we're restricting the number of features to the most representative 300, only a few bigrams are being considered.

Let's save the files we'll need in the next steps: