# Feature Engineering

The next step is to create features from the raw text so we can train the machine learning models. The steps followed are:

1. **Text Cleaning and Preparation**: cleaning of special characters, downcasing, punctuation signs. possessive pronouns and stop words removal and lemmatization. 
2. **Label coding**: creation of a dictionary to map each category to a code.
3. **Train-test split**: to test the models on unseen data.
4. **Text representation**: use of TF-IDF scores to represent text.

In [62]:
import pickle
import pandas as pd
import re
import nltk
from nltk.corpus import stopwords
from nltk.stem import WordNetLemmatizer
from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.model_selection import train_test_split
from sklearn.feature_selection import chi2
import numpy as np

First of all we'll load the dataset:

In [63]:
path_df = "/Users/keerthanareddy/Downloads/Latest-News-Classifier-master/0. Latest News Classifier/02. Exploratory Data Analysis/News_dataset.pickle"

with open(path_df, 'rb') as data:
    df = pickle.load(data)

In [64]:
df.head()

Unnamed: 0,File_Name,Content,Category,Complete_Filename,id,News_length
0,Autonomous_new_1.txt,Self-driving car dilemmas reveal that moral ch...,Audi,Autonomous_new_1.txt-Audi,1,3215
1,Atonomous_new_3.txt,Self-driving car dilemmas reveal that moral ch...,BMW,Atonomous_new_3.txt-BMW,1,8005
2,Autonomous_new_2.txt,Self-driving car dilemmas reveal that moral ch...,Tesla,Autonomous_new_2.txt-Tesla,1,1916


And visualize one sample news content:

In [65]:
df.loc[1]['Content']

"Self-driving car dilemmas reveal that moral choices are not universal\r\nSurvey maps global variations in ethics for programming autonomous vehicles.\r\nAmy Maxmen\r\n  \r\n\r\nSelf-driving cars are being developed by several major technology companies and carmakers. credit: VCG/Getty\r\n\r\nWhen a driver slams on the brakes to avoid hitting a pedestrian crossing the road illegally, she is making a moral decision that shifts risk from the pedestrian to the people in the car. Self-driving cars might soon have to make such ethical judgments on their own â€” but settling on a universal moral code for the vehicles could be a thorny task, suggests a survey of 2.3 million people from around the world.\r\n\r\nThe largest ever survey of machine ethics1, published today in Nature, finds that many of the moral principles that guide a driverâ€™s decisions vary by country. For example, in a scenario in which some combination of pedestrians and passengers will die in a collision, people from relat

## 1. Text cleaning and preparation

### 1.1. Special character cleaning

We can see the following special characters:

* ``\r``
* ``\n``
* ``\`` before possessive pronouns (`government's = government\'s`)
* ``\`` before possessive pronouns 2 (`Yukos'` = `Yukos\'`)
* ``"`` when quoting text

In [66]:
# \r and \n
df['Content_Parsed_1'] = df['Content'].str.replace("\r", " ")
df['Content_Parsed_1'] = df['Content_Parsed_1'].str.replace("\n", " ")
df['Content_Parsed_1'] = df['Content_Parsed_1'].str.replace("    ", " ")

Regarding 3rd and 4th bullet, although it seems there is a special character, it won't affect us since it is not a *real* character:

In [67]:
text = "Mr Greenspan\'s"
text

"Mr Greenspan's"

In [68]:
# " when quoting text
df['Content_Parsed_1'] = df['Content_Parsed_1'].str.replace('"', '')

### 1.2. Upcase/downcase

We'll downcase the texts because we want, for example, `Football` and `football` to be the same word.

In [69]:
# Lowercasing the text
df['Content_Parsed_2'] = df['Content_Parsed_1'].str.lower()

### 1.3. Punctuation signs

Punctuation signs won't have any predicting power, so we'll just get rid of them.

In [70]:
punctuation_signs = list("?:!.,;")
df['Content_Parsed_3'] = df['Content_Parsed_2']

for punct_sign in punctuation_signs:
    df['Content_Parsed_3'] = df['Content_Parsed_3'].str.replace(punct_sign, '')

  df['Content_Parsed_3'] = df['Content_Parsed_3'].str.replace(punct_sign, '')


By doing this we are messing up with some numbers, but it's no problem since we aren't expecting any predicting power from them.

### 1.4. Possessive pronouns

We'll also remove possessive pronoun terminations:

In [71]:
df['Content_Parsed_4'] = df['Content_Parsed_3'].str.replace("'s", "")

### 1.5. Stemming and Lemmatization

Since stemming can produce output words that don't exist, we'll only use a lemmatization process at this moment. Lemmatization takes into consideration the morphological analysis of the words and returns words that do exist, so it will be more useful for us.

In [72]:
# Downloading punkt and wordnet from NLTK
nltk.download('punkt')
print("------------------------------------------------------------")
nltk.download('wordnet')

------------------------------------------------------------


[nltk_data] Downloading package punkt to
[nltk_data]     /Users/keerthanareddy/nltk_data...
[nltk_data]   Package punkt is already up-to-date!
[nltk_data] Downloading package wordnet to
[nltk_data]     /Users/keerthanareddy/nltk_data...
[nltk_data]   Package wordnet is already up-to-date!


True

In [73]:
# Saving the lemmatizer into an object
wordnet_lemmatizer = WordNetLemmatizer()

In order to lemmatize, we have to iterate through every word:

In [74]:
nrows = len(df)
lemmatized_text_list = []

for row in range(0, nrows):
    
    # Create an empty list containing lemmatized words
    lemmatized_list = []
    
    # Save the text and its words into an object
    text = df.loc[row]['Content_Parsed_4']
    text_words = str(text).split(" ")

    # Iterate through every word to lemmatize
    for word in text_words:
        lemmatized_list.append(wordnet_lemmatizer.lemmatize(word, pos="v"))
        
    # Join the list
    lemmatized_text = " ".join(lemmatized_list)
    
    # Append to the list containing the texts
    lemmatized_text_list.append(lemmatized_text)

In [75]:
df['Content_Parsed_5'] = lemmatized_text_list

Although lemmatization doesn't work perfectly in all cases (as can be seen in the example below), it can be useful.

### 1.6. Stop words

In [76]:
# Downloading the stop words list
nltk.download('stopwords')

[nltk_data] Downloading package stopwords to
[nltk_data]     /Users/keerthanareddy/nltk_data...
[nltk_data]   Package stopwords is already up-to-date!


True

In [77]:
# Loading the stop words in english
stop_words = list(stopwords.words('english'))

In [78]:
stop_words[0:10]

['i', 'me', 'my', 'myself', 'we', 'our', 'ours', 'ourselves', 'you', "you're"]

To remove the stop words, we'll handle a regular expression only detecting whole words, as seen in the following example:

In [79]:
example = "me eating a meal"
word = "me"

# The regular expression is:
regex = r"\b" + word + r"\b"  # we need to build it like that to work properly

re.sub(regex, "StopWord", example)

'StopWord eating a meal'

We can now loop through all the stop words:

In [80]:
df['Content_Parsed_6'] = df['Content_Parsed_5']

for stop_word in stop_words:

    regex_stopword = r"\b" + stop_word + r"\b"
    df['Content_Parsed_6'] = df['Content_Parsed_6'].str.replace(regex_stopword, '')

  df['Content_Parsed_6'] = df['Content_Parsed_6'].str.replace(regex_stopword, '')


We have some dobule/triple spaces between words because of the replacements. However, it's not a problem because we'll tokenize by the spaces later.

As an example, we'll show an original news article and its modifications throughout the process:

In [82]:
df.loc[2]['Content']

'Self-driving car dilemmas reveal that moral choices are not universal\r\nSurvey maps global variations in ethics for programming autonomous vehicles.\r\nAmy Maxmen\r\n  \r\n\r\nSelf-driving cars are being developed by several major technology companies and carmakers. credit: VCG/Getty\r\n\r\nWhen a driver slams on the brakes to avoid hitting a pedestrian crossing the road illegally, she is making a moral decision that shifts risk from the pedestrian to the people in the car. Self-driving cars might soon have to make such ethical judgments on their own â€” but settling on a universal moral code for the vehicles could be a thorny task, suggests a survey of 2.3 million people from around the world.\r\n\r\nThe largest ever survey of machine ethics1, published today in Nature, finds that many of the moral principles that guide a driverâ€™s decisions vary by country. For example, in a scenario in which some combination of pedestrians and passengers will die in a collision, people from relat

1. Special character cleaning

In [83]:
df.loc[2]['Content_Parsed_1']

'Self-driving car dilemmas reveal that moral choices are not universal  Survey maps global variations in ethics for programming autonomous vehicles.  Amy Maxmen  Self-driving cars are being developed by several major technology companies and carmakers. credit: VCG/Getty When a driver slams on the brakes to avoid hitting a pedestrian crossing the road illegally, she is making a moral decision that shifts risk from the pedestrian to the people in the car. Self-driving cars might soon have to make such ethical judgments on their own â€” but settling on a universal moral code for the vehicles could be a thorny task, suggests a survey of 2.3 million people from around the world. The largest ever survey of machine ethics1, published today in Nature, finds that many of the moral principles that guide a driverâ€™s decisions vary by country. For example, in a scenario in which some combination of pedestrians and passengers will die in a collision, people from relatively prosperous countries wit

2. Upcase/downcase

In [84]:
df.loc[2]['Content_Parsed_2']

'self-driving car dilemmas reveal that moral choices are not universal  survey maps global variations in ethics for programming autonomous vehicles.  amy maxmen  self-driving cars are being developed by several major technology companies and carmakers. credit: vcg/getty when a driver slams on the brakes to avoid hitting a pedestrian crossing the road illegally, she is making a moral decision that shifts risk from the pedestrian to the people in the car. self-driving cars might soon have to make such ethical judgments on their own â€” but settling on a universal moral code for the vehicles could be a thorny task, suggests a survey of 2.3 million people from around the world. the largest ever survey of machine ethics1, published today in nature, finds that many of the moral principles that guide a driverâ€™s decisions vary by country. for example, in a scenario in which some combination of pedestrians and passengers will die in a collision, people from relatively prosperous countries wit

3. Punctuation signs

In [85]:
df.loc[2]['Content_Parsed_3']

'self-driving car dilemmas reveal that moral choices are not universal  survey maps global variations in ethics for programming autonomous vehicles  amy maxmen  self-driving cars are being developed by several major technology companies and carmakers credit vcg/getty when a driver slams on the brakes to avoid hitting a pedestrian crossing the road illegally she is making a moral decision that shifts risk from the pedestrian to the people in the car self-driving cars might soon have to make such ethical judgments on their own â€” but settling on a universal moral code for the vehicles could be a thorny task suggests a survey of 23 million people from around the world the largest ever survey of machine ethics1 published today in nature finds that many of the moral principles that guide a driverâ€™s decisions vary by country for example in a scenario in which some combination of pedestrians and passengers will die in a collision people from relatively prosperous countries with strong inst

4. Possessive pronouns

In [86]:
df.loc[2]['Content_Parsed_4']

'self-driving car dilemmas reveal that moral choices are not universal  survey maps global variations in ethics for programming autonomous vehicles  amy maxmen  self-driving cars are being developed by several major technology companies and carmakers credit vcg/getty when a driver slams on the brakes to avoid hitting a pedestrian crossing the road illegally she is making a moral decision that shifts risk from the pedestrian to the people in the car self-driving cars might soon have to make such ethical judgments on their own â€” but settling on a universal moral code for the vehicles could be a thorny task suggests a survey of 23 million people from around the world the largest ever survey of machine ethics1 published today in nature finds that many of the moral principles that guide a driverâ€™s decisions vary by country for example in a scenario in which some combination of pedestrians and passengers will die in a collision people from relatively prosperous countries with strong inst

5. Stemming and Lemmatization

In [88]:
df.loc[2]['Content_Parsed_5']

'self-driving car dilemmas reveal that moral choices be not universal  survey map global variations in ethics for program autonomous vehicles  amy maxmen  self-driving cars be be develop by several major technology company and carmakers credit vcg/getty when a driver slam on the brake to avoid hit a pedestrian cross the road illegally she be make a moral decision that shift risk from the pedestrian to the people in the car self-driving cars might soon have to make such ethical judgments on their own â€” but settle on a universal moral code for the vehicles could be a thorny task suggest a survey of 23 million people from around the world the largest ever survey of machine ethics1 publish today in nature find that many of the moral principles that guide a driverâ€™s decisions vary by country for example in a scenario in which some combination of pedestrians and passengers will die in a collision people from relatively prosperous countries with strong institutions be less likely to spare

6. Stop words

In [89]:
df.loc[2]['Content_Parsed_6']

'self-driving car dilemmas reveal  moral choices   universal  survey map global variations  ethics  program autonomous vehicles  amy maxmen  self-driving cars   develop  several major technology company  carmakers credit vcg/getty   driver slam   brake  avoid hit  pedestrian cross  road illegally   make  moral decision  shift risk   pedestrian   people   car self-driving cars might soon   make  ethical judgments    â€”  settle   universal moral code   vehicles could   thorny task suggest  survey  23 million people  around  world  largest ever survey  machine ethics1 publish today  nature find  many   moral principles  guide  driverâ€™ decisions vary  country  example   scenario    combination  pedestrians  passengers  die   collision people  relatively prosperous countries  strong institutions  less likely  spare  pedestrian  step  traffic illegally â€œpeople  think  machine ethics make  sound like   come    perfect set  rule  robots    show   data      universal rulesâ€\x9d say iyad r

Finally, we can delete the intermediate columns:

In [90]:
df.head(1)

Unnamed: 0,File_Name,Content,Category,Complete_Filename,id,News_length,Content_Parsed_1,Content_Parsed_2,Content_Parsed_3,Content_Parsed_4,Content_Parsed_5,Content_Parsed_6
0,Autonomous_new_1.txt,Self-driving car dilemmas reveal that moral ch...,Audi,Autonomous_new_1.txt-Audi,1,3215,Self-driving car dilemmas reveal that moral ch...,self-driving car dilemmas reveal that moral ch...,self-driving car dilemmas reveal that moral ch...,self-driving car dilemmas reveal that moral ch...,self-driving car dilemmas reveal that moral ch...,self-driving car dilemmas reveal moral choice...


In [91]:
list_columns = ["File_Name", "Category", "Content", "Content_Parsed_6"]
df = df[list_columns]

df = df.rename(columns={'Content_Parsed_6': 'Content_Parsed'})

In [92]:
df.head()

Unnamed: 0,File_Name,Category,Content,Content_Parsed
0,Autonomous_new_1.txt,Audi,Self-driving car dilemmas reveal that moral ch...,self-driving car dilemmas reveal moral choice...
1,Atonomous_new_3.txt,BMW,Self-driving car dilemmas reveal that moral ch...,self-driving car dilemmas reveal moral choice...
2,Autonomous_new_2.txt,Tesla,Self-driving car dilemmas reveal that moral ch...,self-driving car dilemmas reveal moral choice...


**IMPORTANT:**

We need to remember that our model will gather the latest news articles from different newspapers every time we want. For that reason, we not only need to take into account the peculiarities of the training set articles, but also possible ones that are present in the gathered news articles.

For this reason, possible peculiarities have been studied in the *05. News Scraping* folder.

## 2. Label coding

We'll create a dictionary with the label codification:

In [93]:
category_codes = {
    'business': 0,
    'entertainment': 1,
    'politics': 2,
    'sport': 3,
    'tech': 4
}

In [94]:
# Category mapping
df['Category_Code'] = df['Category']
df = df.replace({'Category_Code':category_codes})

In [95]:
df.head()

Unnamed: 0,File_Name,Category,Content,Content_Parsed,Category_Code
0,Autonomous_new_1.txt,Audi,Self-driving car dilemmas reveal that moral ch...,self-driving car dilemmas reveal moral choice...,Audi
1,Atonomous_new_3.txt,BMW,Self-driving car dilemmas reveal that moral ch...,self-driving car dilemmas reveal moral choice...,BMW
2,Autonomous_new_2.txt,Tesla,Self-driving car dilemmas reveal that moral ch...,self-driving car dilemmas reveal moral choice...,Tesla


## 3. Train - test split

We'll set apart a test set to prove the quality of our models. We'll do Cross Validation in the train set in order to tune the hyperparameters and then test performance on the unseen data of the test set.

In [96]:
X_train, X_test, y_train, y_test = train_test_split(df['Content_Parsed'], 
                                                    df['Category_Code'], 
                                                    test_size=0.15, 
                                                    random_state=8)

Since we don't have much observations (only 2.225), we'll choose a test set size of 15% of the full dataset.

## 4. Text representation

We have various options:

* Count Vectors as features
* TF-IDF Vectors as features
* Word Embeddings as features
* Text / NLP based features
* Topic Models as features

We'll use **TF-IDF Vectors** as features.

We have to define the different parameters:

* `ngram_range`: We want to consider both unigrams and bigrams.
* `max_df`: When building the vocabulary ignore terms that have a document
    frequency strictly higher than the given threshold
* `min_df`: When building the vocabulary ignore terms that have a document
    frequency strictly lower than the given threshold.
* `max_features`: If not None, build a vocabulary that only consider the top
    max_features ordered by term frequency across the corpus.

See `TfidfVectorizer?` for further detail.

It needs to be mentioned that we are implicitly scaling our data when representing it as TF-IDF features with the argument `norm`.

In [101]:
# Parameter election
ngram_range = (1,2)
min_df = 1
max_df = 10
max_features = 300

We have chosen these values as a first approximation. Since the models that we develop later have a very good predictive power, we'll stick to these values. But it has to be mentioned that different combinations could be tried in order to improve even more the accuracy of the models.

In [102]:
tfidf = TfidfVectorizer(encoding='utf-8',
                        ngram_range=ngram_range,
                        stop_words=None,
                        lowercase=False,
                        max_df=max_df,
                        min_df=min_df,
                        max_features=max_features,
                        norm='',
                        sublinear_tf=True)
                        
features_train = tfidf.fit_transform(X_train).toarray()
labels_train = y_train
print(features_train.shape)

features_test = tfidf.transform(X_test).toarray()
labels_test = y_test
print(features_test.shape)

(2, 300)
(1, 300)


Please note that we have fitted and then transformed the training set, but we have **only transformed** the **test set**.

We can use the Chi squared test in order to see what unigrams and bigrams are most correlated with each category:

In [103]:
from sklearn.feature_selection import chi2
import numpy as np

for Product, category_id in sorted(category_codes.items()):
    features_chi2 = chi2(features_train, labels_train == category_id)
    indices = np.argsort(features_chi2[0])
    feature_names = np.array(tfidf.get_feature_names())[indices]
    unigrams = [v for v in feature_names if len(v.split(' ')) == 1]
    bigrams = [v for v in feature_names if len(v.split(' ')) == 2]
    print("# '{}' category:".format(Product))
    print("  . Most correlated unigrams:\n. {}".format('\n. '.join(unigrams[-5:])))
    print("  . Most correlated bigrams:\n. {}".format('\n. '.join(bigrams[-2:])))
    print("")


# 'business' category:
  . Most correlated unigrams:
. likelihood
. life
. level
. less
. œwe
  . Most correlated bigrams:
. least five
. law professor

# 'entertainment' category:
  . Most correlated unigrams:
. likelihood
. life
. level
. less
. œwe
  . Most correlated bigrams:
. least five
. law professor

# 'politics' category:
  . Most correlated unigrams:
. likelihood
. life
. level
. less
. œwe
  . Most correlated bigrams:
. least five
. law professor

# 'sport' category:
  . Most correlated unigrams:
. likelihood
. life
. level
. less
. œwe
  . Most correlated bigrams:
. least five
. law professor

# 'tech' category:
  . Most correlated unigrams:
. likelihood
. life
. level
. less
. œwe
  . Most correlated bigrams:
. least five
. law professor



As we can see, the unigrams correspond well to their category. However, bigrams do not. If we get the bigrams in our features:

In [104]:
bigrams

['road increase',
 'road illegally',
 'practical use',
 'say scenarios',
 'say study',
 'studyâ author',
 'study valuable',
 'study unrealistic',
 'self driving',
 'political policies',
 'policies favor',
 'pedestrian cross',
 'passengers die',
 'oncoming vehicle',
 'number driverless',
 'norms support',
 'need come',
 'pedestrian people',
 'pedestrian step',
 'pedestrians passengers',
 'policies express',
 'play gamesâ',
 'place lower',
 'peopleâ moral',
 'people œi',
 'people relatively',
 'people play',
 'people car',
 'people around',
 'suggest survey',
 'support policies',
 'walker smith',
 'vehicles could',
 'valuable wege',
 'use say',
 'use bryant',
 'us cities',
 'unrealistic instance',
 'œabout risk',
 'yet sale',
 'year events',
 'would face',
 'would cause',
 'worry automate',
 'world largest',
 'wide use',
 'university south',
 'university british',
 'technology company',
 'task suggest',
 'survey really',
 'survey practical',
 'survey moral',
 'survey map',
 'survey machi

We can see there are only six. This means the unigrams have more correlation with the category than the bigrams, and since we're restricting the number of features to the most representative 300, only a few bigrams are being considered.

Let's save the files we'll need in the next steps:

In [105]:
# X_train
with open('/Users/keerthanareddy/Downloads/Latest-News-Classifier-master/0. Latest News Classifier/03. Feature Engineering/Pickles/X_train.pickle', 'wb') as output:
    pickle.dump(X_train, output)
    
# X_test    
with open('/Users/keerthanareddy/Downloads/Latest-News-Classifier-master/0. Latest News Classifier/03. Feature Engineering/Pickles/X_test.pickle', 'wb') as output:
    pickle.dump(X_test, output)
    
# y_train
with open('/Users/keerthanareddy/Downloads/Latest-News-Classifier-master/0. Latest News Classifier/03. Feature Engineering/Pickles/y_train.pickle', 'wb') as output:
    pickle.dump(y_train, output)
    
# y_test
with open('/Users/keerthanareddy/Downloads/Latest-News-Classifier-master/0. Latest News Classifier/03. Feature Engineering/Pickles/y_test.pickle', 'wb') as output:
    pickle.dump(y_test, output)
    
# df
with open('/Users/keerthanareddy/Downloads/Latest-News-Classifier-master/0. Latest News Classifier/03. Feature Engineering/Pickles/df.pickle', 'wb') as output:
    pickle.dump(df, output)
    
# features_train
with open('/Users/keerthanareddy/Downloads/Latest-News-Classifier-master/0. Latest News Classifier/03. Feature Engineering/Pickles/features_train.pickle', 'wb') as output:
    pickle.dump(features_train, output)

# labels_train
with open('/Users/keerthanareddy/Downloads/Latest-News-Classifier-master/0. Latest News Classifier/03. Feature Engineering/Pickles/labels_train.pickle', 'wb') as output:
    pickle.dump(labels_train, output)

# features_test
with open('/Users/keerthanareddy/Downloads/Latest-News-Classifier-master/0. Latest News Classifier/03. Feature Engineering/Pickles/features_test.pickle', 'wb') as output:
    pickle.dump(features_test, output)

# labels_test
with open('/Users/keerthanareddy/Downloads/Latest-News-Classifier-master/0. Latest News Classifier/03. Feature Engineering/Pickles/labels_test.pickle', 'wb') as output:
    pickle.dump(labels_test, output)
    
# TF-IDF object
with open('/Users/keerthanareddy/Downloads/Latest-News-Classifier-master/0. Latest News Classifier/03. Feature Engineering/Pickles/tfidf.pickle', 'wb') as output:
    pickle.dump(tfidf, output)