# Feature Engineering

The next step is to create features from the raw text so we can train the machine learning models. The steps followed are:

1. **Text Cleaning and Preparation**: cleaning of special characters, downcasing, punctuation signs. possessive pronouns and stop words removal and lemmatization. 
2. **Label coding**: creation of a dictionary to map each category to a code.
3. **Train-test split**: to test the models on unseen data.
4. **Text representation**: use of TF-IDF scores to represent text.

In [46]:
import pickle
import pandas as pd
import re
import nltk
from nltk.corpus import stopwords
from nltk.stem import WordNetLemmatizer
from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.model_selection import train_test_split
from sklearn.feature_selection import chi2
import numpy as np

First of all we'll load the dataset:

In [47]:
path_df = "/Users/keerthanareddy/Downloads/Latest-News-Classifier-master/0. Latest News Classifier/02. Exploratory Data Analysis/News_dataset.pickle16"

with open(path_df, 'rb') as data:
    df = pickle.load(data)

In [48]:
df.head()

Unnamed: 0,File_Name,Content,Category,Complete_Filename,id,News_length
0,Neha_Assign_AutonomusCar.txt,"A year ago, Detroit and Silicon Valley had vis...",Baleno,Neha_Assign_AutonomusCar.txt-Baleno,1,8503
1,Neha_Assign_AutonomusCar2.txt,Artificial intelligence (AI) is used in a wide...,Suziki,Neha_Assign_AutonomusCar2.txt-Suziki,1,10366


And visualize one sample news content:

In [49]:
df.loc[1]['Content']

'Artificial intelligence (AI) is used in a wide variety of products and services, including maps embedded on our smart phones and â€œchat botsâ€\x9d that help answer our questions on websites. Many hope that AI will transform our economy in ways that drive growth, similar to how steam engines did in the late 19th century and electricity did in the early 20th century. But it is hard to imagine that maps on smart phones, chatbots, and other existing AI-enabled services will drive the type of economic growth we saw from stream and electricity. What we need to see are some dramatic new AI-enabled products and services that transform our way of lifeâ€”in short, we are waiting for an AI â€œkiller app.â€\x9d\r\nAutonomous vehicles (AVs)â€”vehicles that accelerate, brake, and turn on their own, requiring little or no input from a human driverâ€”may be such a killer app that transforms our economy significantly. AI supports AVs in a variety of ways, including quickly processing and interpreting

## 1. Text cleaning and preparation

### 1.1. Special character cleaning

We can see the following special characters:

* ``\r``
* ``\n``
* ``\`` before possessive pronouns (`government's = government\'s`)
* ``\`` before possessive pronouns 2 (`Yukos'` = `Yukos\'`)
* ``"`` when quoting text

In [50]:
# \r and \n
df['Content_Parsed_1'] = df['Content'].str.replace("\r", " ")
df['Content_Parsed_1'] = df['Content_Parsed_1'].str.replace("\n", " ")
df['Content_Parsed_1'] = df['Content_Parsed_1'].str.replace("    ", " ")

Regarding 3rd and 4th bullet, although it seems there is a special character, it won't affect us since it is not a *real* character:

In [51]:
text = "Mr Greenspan\'s"
text

"Mr Greenspan's"

In [52]:
# " when quoting text
df['Content_Parsed_1'] = df['Content_Parsed_1'].str.replace('"', '')

### 1.2. Upcase/downcase

We'll downcase the texts because we want, for example, `Football` and `football` to be the same word.

In [53]:
# Lowercasing the text
df['Content_Parsed_2'] = df['Content_Parsed_1'].str.lower()

### 1.3. Punctuation signs

Punctuation signs won't have any predicting power, so we'll just get rid of them.

In [54]:
punctuation_signs = list("?:!.,;")
df['Content_Parsed_3'] = df['Content_Parsed_2']

for punct_sign in punctuation_signs:
    df['Content_Parsed_3'] = df['Content_Parsed_3'].str.replace(punct_sign, '')

  df['Content_Parsed_3'] = df['Content_Parsed_3'].str.replace(punct_sign, '')


By doing this we are messing up with some numbers, but it's no problem since we aren't expecting any predicting power from them.

### 1.4. Possessive pronouns

We'll also remove possessive pronoun terminations:

In [58]:
df['Content_Parsed_4'] = df['Content_Parsed_3'].str.replace("'s", "")

### 1.5. Stemming and Lemmatization

Since stemming can produce output words that don't exist, we'll only use a lemmatization process at this moment. Lemmatization takes into consideration the morphological analysis of the words and returns words that do exist, so it will be more useful for us.

In [59]:
# Downloading punkt and wordnet from NLTK
nltk.download('punkt')
print("------------------------------------------------------------")
nltk.download('wordnet')

------------------------------------------------------------


[nltk_data] Downloading package punkt to
[nltk_data]     /Users/keerthanareddy/nltk_data...
[nltk_data]   Package punkt is already up-to-date!
[nltk_data] Downloading package wordnet to
[nltk_data]     /Users/keerthanareddy/nltk_data...
[nltk_data]   Package wordnet is already up-to-date!


True

In [60]:
# Saving the lemmatizer into an object
wordnet_lemmatizer = WordNetLemmatizer()

In order to lemmatize, we have to iterate through every word:

In [61]:
nrows = len(df)
lemmatized_text_list = []

for row in range(0, nrows):
    
    # Create an empty list containing lemmatized words
    lemmatized_list = []
    
    # Save the text and its words into an object
    text = df.loc[row]['Content_Parsed_4']
    text_words = str(text).split(" ")

    # Iterate through every word to lemmatize
    for word in text_words:
        lemmatized_list.append(wordnet_lemmatizer.lemmatize(word, pos="v"))
        
    # Join the list
    lemmatized_text = " ".join(lemmatized_list)
    
    # Append to the list containing the texts
    lemmatized_text_list.append(lemmatized_text)

In [62]:
df['Content_Parsed_5'] = lemmatized_text_list

Although lemmatization doesn't work perfectly in all cases (as can be seen in the example below), it can be useful.

### 1.6. Stop words

In [63]:
# Downloading the stop words list
nltk.download('stopwords')

[nltk_data] Downloading package stopwords to
[nltk_data]     /Users/keerthanareddy/nltk_data...
[nltk_data]   Package stopwords is already up-to-date!


True

In [64]:
# Loading the stop words in english
stop_words = list(stopwords.words('english'))

In [65]:
stop_words[0:10]

['i', 'me', 'my', 'myself', 'we', 'our', 'ours', 'ourselves', 'you', "you're"]

To remove the stop words, we'll handle a regular expression only detecting whole words, as seen in the following example:

In [66]:
example = "me eating a meal"
word = "me"

# The regular expression is:
regex = r"\b" + word + r"\b"  # we need to build it like that to work properly

re.sub(regex, "StopWord", example)

'StopWord eating a meal'

We can now loop through all the stop words:

In [67]:
df['Content_Parsed_6'] = df['Content_Parsed_5']

for stop_word in stop_words:

    regex_stopword = r"\b" + stop_word + r"\b"
    df['Content_Parsed_6'] = df['Content_Parsed_6'].str.replace(regex_stopword, '')

  df['Content_Parsed_6'] = df['Content_Parsed_6'].str.replace(regex_stopword, '')


We have some dobule/triple spaces between words because of the replacements. However, it's not a problem because we'll tokenize by the spaces later.

As an example, we'll show an original news article and its modifications throughout the process:

In [68]:
df.loc[1]['Content']

'Artificial intelligence (AI) is used in a wide variety of products and services, including maps embedded on our smart phones and â€œchat botsâ€\x9d that help answer our questions on websites. Many hope that AI will transform our economy in ways that drive growth, similar to how steam engines did in the late 19th century and electricity did in the early 20th century. But it is hard to imagine that maps on smart phones, chatbots, and other existing AI-enabled services will drive the type of economic growth we saw from stream and electricity. What we need to see are some dramatic new AI-enabled products and services that transform our way of lifeâ€”in short, we are waiting for an AI â€œkiller app.â€\x9d\r\nAutonomous vehicles (AVs)â€”vehicles that accelerate, brake, and turn on their own, requiring little or no input from a human driverâ€”may be such a killer app that transforms our economy significantly. AI supports AVs in a variety of ways, including quickly processing and interpreting

1. Special character cleaning

In [69]:
df.loc[1]['Content_Parsed_1']

'Artificial intelligence (AI) is used in a wide variety of products and services, including maps embedded on our smart phones and â€œchat botsâ€\x9d that help answer our questions on websites. Many hope that AI will transform our economy in ways that drive growth, similar to how steam engines did in the late 19th century and electricity did in the early 20th century. But it is hard to imagine that maps on smart phones, chatbots, and other existing AI-enabled services will drive the type of economic growth we saw from stream and electricity. What we need to see are some dramatic new AI-enabled products and services that transform our way of lifeâ€”in short, we are waiting for an AI â€œkiller app.â€\x9d  Autonomous vehicles (AVs)â€”vehicles that accelerate, brake, and turn on their own, requiring little or no input from a human driverâ€”may be such a killer app that transforms our economy significantly. AI supports AVs in a variety of ways, including quickly processing and interpreting t

2. Upcase/downcase

In [70]:
df.loc[1]['Content_Parsed_2']

'artificial intelligence (ai) is used in a wide variety of products and services, including maps embedded on our smart phones and â€œchat botsâ€\x9d that help answer our questions on websites. many hope that ai will transform our economy in ways that drive growth, similar to how steam engines did in the late 19th century and electricity did in the early 20th century. but it is hard to imagine that maps on smart phones, chatbots, and other existing ai-enabled services will drive the type of economic growth we saw from stream and electricity. what we need to see are some dramatic new ai-enabled products and services that transform our way of lifeâ€”in short, we are waiting for an ai â€œkiller app.â€\x9d  autonomous vehicles (avs)â€”vehicles that accelerate, brake, and turn on their own, requiring little or no input from a human driverâ€”may be such a killer app that transforms our economy significantly. ai supports avs in a variety of ways, including quickly processing and interpreting t

3. Punctuation signs

In [71]:
df.loc[1]['Content_Parsed_3']

'artificial intelligence (ai) is used in a wide variety of products and services including maps embedded on our smart phones and â€œchat botsâ€\x9d that help answer our questions on websites many hope that ai will transform our economy in ways that drive growth similar to how steam engines did in the late 19th century and electricity did in the early 20th century but it is hard to imagine that maps on smart phones chatbots and other existing ai-enabled services will drive the type of economic growth we saw from stream and electricity what we need to see are some dramatic new ai-enabled products and services that transform our way of lifeâ€”in short we are waiting for an ai â€œkiller appâ€\x9d  autonomous vehicles (avs)â€”vehicles that accelerate brake and turn on their own requiring little or no input from a human driverâ€”may be such a killer app that transforms our economy significantly ai supports avs in a variety of ways including quickly processing and interpreting the large amoun

4. Possessive pronouns

In [72]:
df.loc[1]['Content_Parsed_4']

'artificial intelligence (ai) is used in a wide variety of products and services including maps embedded on our smart phones and â€œchat botsâ€\x9d that help answer our questions on websites many hope that ai will transform our economy in ways that drive growth similar to how steam engines did in the late 19th century and electricity did in the early 20th century but it is hard to imagine that maps on smart phones chatbots and other existing ai-enabled services will drive the type of economic growth we saw from stream and electricity what we need to see are some dramatic new ai-enabled products and services that transform our way of lifeâ€”in short we are waiting for an ai â€œkiller appâ€\x9d  autonomous vehicles (avs)â€”vehicles that accelerate brake and turn on their own requiring little or no input from a human driverâ€”may be such a killer app that transforms our economy significantly ai supports avs in a variety of ways including quickly processing and interpreting the large amoun

5. Stemming and Lemmatization

In [73]:
df.loc[1]['Content_Parsed_5']

'artificial intelligence (ai) be use in a wide variety of products and service include map embed on our smart phone and â€œchat botsâ€\x9d that help answer our question on websites many hope that ai will transform our economy in ways that drive growth similar to how steam engines do in the late 19th century and electricity do in the early 20th century but it be hard to imagine that map on smart phone chatbots and other exist ai-enabled service will drive the type of economic growth we saw from stream and electricity what we need to see be some dramatic new ai-enabled products and service that transform our way of lifeâ€”in short we be wait for an ai â€œkiller appâ€\x9d  autonomous vehicles (avs)â€”vehicles that accelerate brake and turn on their own require little or no input from a human driverâ€”may be such a killer app that transform our economy significantly ai support avs in a variety of ways include quickly process and interpret the large amount of data generate by the vehicleâ€™

6. Stop words

In [77]:
df.loc[1]['Content_Parsed_6']

'artificial intelligence (ai)  use   wide variety  products  service include map embed   smart phone  â€œchat botsâ€\x9d  help answer  question  websites many hope  ai  transform  economy  ways  drive growth similar   steam engines    late 19th century  electricity    early 20th century    hard  imagine  map  smart phone chatbots   exist ai-enabled service  drive  type  economic growth  saw  stream  electricity   need  see   dramatic new ai-enabled products  service  transform  way  lifeâ€” short   wait   ai â€œkiller appâ€\x9d  autonomous vehicles (avs)â€”vehicles  accelerate brake  turn    require little   input   human driverâ€”may    killer app  transform  economy significantly ai support avs   variety  ways include quickly process  interpret  large amount  data generate   vehicleâ€™ cameras  sensors  help  improve vehicle fuel efficiency  safety  impact  broad adoption  avs  equally numerous  potentially lower transportation cost  limit  need  drivers  possibly transform mobility 

Finally, we can delete the intermediate columns:

In [78]:
df.head(1)

Unnamed: 0,File_Name,Content,Category,Complete_Filename,id,News_length,Content_Parsed_1,Content_Parsed_2,Content_Parsed_3,Content_Parsed_4,Content_Parsed_5,Content_Parsed_6
0,Neha_Assign_AutonomusCar.txt,"A year ago, Detroit and Silicon Valley had vis...",Baleno,Neha_Assign_AutonomusCar.txt-Baleno,1,8503,"A year ago, Detroit and Silicon Valley had vis...","a year ago, detroit and silicon valley had vis...",a year ago detroit and silicon valley had visi...,a year ago detroit and silicon valley had visi...,a year ago detroit and silicon valley have vis...,year ago detroit silicon valley visions pu...


In [79]:
list_columns = ["File_Name", "Category", "Content", "Content_Parsed_6"]
df = df[list_columns]

df = df.rename(columns={'Content_Parsed_6': 'Content_Parsed'})

In [32]:
df.head()

Unnamed: 0,File_Name,Category,Content,Content_Parsed
0,Lahari_Assg1.txt,Alto,Ask nine futurists what life will be like in 5...,ask nine futurists life like 50 years ' l...
1,Lahari_Assg2.txt,Ferrari,From ushering in an era of decreased car owner...,usher era decrease car ownership narrow s...


**IMPORTANT:**

We need to remember that our model will gather the latest news articles from different newspapers every time we want. For that reason, we not only need to take into account the peculiarities of the training set articles, but also possible ones that are present in the gathered news articles.

For this reason, possible peculiarities have been studied in the *05. News Scraping* folder.

## 2. Label coding

We'll create a dictionary with the label codification:

In [80]:
category_codes = {
    'business': 0,
    'entertainment': 1,
    'politics': 2,
    'sport': 3,
    'tech': 4
}

In [81]:
# Category mapping
df['Category_Code'] = df['Category']
df = df.replace({'Category_Code':category_codes})

In [82]:
df.head()

Unnamed: 0,File_Name,Category,Content,Content_Parsed,Category_Code
0,Neha_Assign_AutonomusCar.txt,Baleno,"A year ago, Detroit and Silicon Valley had vis...",year ago detroit silicon valley visions pu...,Baleno
1,Neha_Assign_AutonomusCar2.txt,Suziki,Artificial intelligence (AI) is used in a wide...,artificial intelligence (ai) use wide varie...,Suziki


## 3. Train - test split

We'll set apart a test set to prove the quality of our models. We'll do Cross Validation in the train set in order to tune the hyperparameters and then test performance on the unseen data of the test set.

In [83]:
X_train, X_test, y_train, y_test = train_test_split(df['Content_Parsed'], 
                                                    df['Category_Code'], 
                                                    test_size=0.15, 
                                                    random_state=8)

Since we don't have much observations (only 2.225), we'll choose a test set size of 15% of the full dataset.

## 4. Text representation

We have various options:

* Count Vectors as features
* TF-IDF Vectors as features
* Word Embeddings as features
* Text / NLP based features
* Topic Models as features

We'll use **TF-IDF Vectors** as features.

We have to define the different parameters:

* `ngram_range`: We want to consider both unigrams and bigrams.
* `max_df`: When building the vocabulary ignore terms that have a document
    frequency strictly higher than the given threshold
* `min_df`: When building the vocabulary ignore terms that have a document
    frequency strictly lower than the given threshold.
* `max_features`: If not None, build a vocabulary that only consider the top
    max_features ordered by term frequency across the corpus.

See `TfidfVectorizer?` for further detail.

It needs to be mentioned that we are implicitly scaling our data when representing it as TF-IDF features with the argument `norm`.

In [84]:
# Parameter election
ngram_range = (1,2)
min_df = 1
max_df = 10
max_features = 300

We have chosen these values as a first approximation. Since the models that we develop later have a very good predictive power, we'll stick to these values. But it has to be mentioned that different combinations could be tried in order to improve even more the accuracy of the models.

In [85]:
tfidf = TfidfVectorizer(encoding='utf-8',
                        ngram_range=ngram_range,
                        stop_words=None,
                        lowercase=False,
                        max_df=max_df,
                        min_df=min_df,
                        max_features=max_features,
                        norm='',
                        sublinear_tf=True)
                        
features_train = tfidf.fit_transform(X_train).toarray()
labels_train = y_train
print(features_train.shape)

features_test = tfidf.transform(X_test).toarray()
labels_test = y_test
print(features_test.shape)

(1, 300)
(1, 300)


Please note that we have fitted and then transformed the training set, but we have **only transformed** the **test set**.

We can use the Chi squared test in order to see what unigrams and bigrams are most correlated with each category:

In [86]:
from sklearn.feature_selection import chi2
import numpy as np

for Product, category_id in sorted(category_codes.items()):
    features_chi2 = chi2(features_train, labels_train == category_id)
    indices = np.argsort(features_chi2[0])
    feature_names = np.array(tfidf.get_feature_names())[indices]
    unigrams = [v for v in feature_names if len(v.split(' ')) == 1]
    bigrams = [v for v in feature_names if len(v.split(' ')) == 2]
    print("# '{}' category:".format(Product))
    print("  . Most correlated unigrams:\n. {}".format('\n. '.join(unigrams[-5:])))
    print("  . Most correlated bigrams:\n. {}".format('\n. '.join(bigrams[-2:])))
    print("")


# 'business' category:
  . Most correlated unigrams:
. gittleman
. future
. full
. fuel
. framework
  . Most correlated bigrams:
. fuel efficiency
. œkiller appâ

# 'entertainment' category:
  . Most correlated unigrams:
. gittleman
. future
. full
. fuel
. framework
  . Most correlated bigrams:
. fuel efficiency
. œkiller appâ

# 'politics' category:
  . Most correlated unigrams:
. gittleman
. future
. full
. fuel
. framework
  . Most correlated bigrams:
. fuel efficiency
. œkiller appâ

# 'sport' category:
  . Most correlated unigrams:
. gittleman
. future
. full
. fuel
. framework
  . Most correlated bigrams:
. fuel efficiency
. œkiller appâ

# 'tech' category:
  . Most correlated unigrams:
. gittleman
. future
. full
. fuel
. framework
  . Most correlated bigrams:
. fuel efficiency
. œkiller appâ



As we can see, the unigrams correspond well to their category. However, bigrams do not. If we get the bigrams in our features:

In [87]:
bigrams

['prospective av',
 'promote state',
 'project adoption',
 'provide autonomous',
 'products service',
 'profile accidents',
 'provide clarity',
 'provide nhtsa',
 'provide variety',
 'range level',
 'quickly process',
 'question websites',
 'purpose aka',
 'publish book',
 'public utilities',
 'public transit',
 'public laws',
 'model two',
 'mobility urban',
 'mind attempt',
 'mean fewer',
 'may ultimately',
 'may take',
 'may provide',
 'may outweigh',
 'may lead',
 'may increase',
 'monaco 2017',
 'reduce human',
 'monaco estimate',
 'muskâ 2019',
 'patchwork state',
 'reduce labor',
 'reduction fuel',
 'urban environments',
 'transform economy',
 'variety ways',
 'would require',
 'worth note',
 'widespread adoption',
 'vehicle traffic',
 'state laws',
 'start act',
 'require human',
 'relative time',
 'relative ease',
 'related policy',
 'regulations current',
 'regulate avs',
 'smart phone',
 'self driving',
 'self drive',
 'maximize benefit',
 'map smart',
 'commercially availab

We can see there are only six. This means the unigrams have more correlation with the category than the bigrams, and since we're restricting the number of features to the most representative 300, only a few bigrams are being considered.

Let's save the files we'll need in the next steps:

In [88]:
# X_train
with open('/Users/keerthanareddy/Downloads/Latest-News-Classifier-master/0. Latest News Classifier/03. Feature Engineering/Pickles/X_train.pickle', 'wb') as output:
    pickle.dump(X_train, output)
    
# X_test    
with open('/Users/keerthanareddy/Downloads/Latest-News-Classifier-master/0. Latest News Classifier/03. Feature Engineering/Pickles/X_test.pickle', 'wb') as output:
    pickle.dump(X_test, output)
    
# y_train
with open('/Users/keerthanareddy/Downloads/Latest-News-Classifier-master/0. Latest News Classifier/03. Feature Engineering/Pickles/y_train.pickle', 'wb') as output:
    pickle.dump(y_train, output)
    
# y_test
with open('/Users/keerthanareddy/Downloads/Latest-News-Classifier-master/0. Latest News Classifier/03. Feature Engineering/Pickles/y_test.pickle', 'wb') as output:
    pickle.dump(y_test, output)
    
# df
with open('/Users/keerthanareddy/Downloads/Latest-News-Classifier-master/0. Latest News Classifier/03. Feature Engineering/Pickles/df.pickle', 'wb') as output:
    pickle.dump(df, output)
    
# features_train
with open('/Users/keerthanareddy/Downloads/Latest-News-Classifier-master/0. Latest News Classifier/03. Feature Engineering/Pickles/features_train.pickle', 'wb') as output:
    pickle.dump(features_train, output)

# labels_train
with open('/Users/keerthanareddy/Downloads/Latest-News-Classifier-master/0. Latest News Classifier/03. Feature Engineering/Pickles/labels_train.pickle', 'wb') as output:
    pickle.dump(labels_train, output)

# features_test
with open('/Users/keerthanareddy/Downloads/Latest-News-Classifier-master/0. Latest News Classifier/03. Feature Engineering/Pickles/features_test.pickle', 'wb') as output:
    pickle.dump(features_test, output)

# labels_test
with open('/Users/keerthanareddy/Downloads/Latest-News-Classifier-master/0. Latest News Classifier/03. Feature Engineering/Pickles/labels_test.pickle', 'wb') as output:
    pickle.dump(labels_test, output)
    
# TF-IDF object
with open('/Users/keerthanareddy/Downloads/Latest-News-Classifier-master/0. Latest News Classifier/03. Feature Engineering/Pickles/tfidf.pickle', 'wb') as output:
    pickle.dump(tfidf, output)