# Feature Engineering

The next step is to create features from the raw text so we can train the machine learning models. The steps followed are:

1. **Text Cleaning and Preparation**: cleaning of special characters, downcasing, punctuation signs. possessive pronouns and stop words removal and lemmatization. 
2. **Label coding**: creation of a dictionary to map each category to a code.
3. **Train-test split**: to test the models on unseen data.
4. **Text representation**: use of TF-IDF scores to represent text.

In [2]:
import pickle
import pandas as pd
import re
import nltk
from nltk.corpus import stopwords
from nltk.stem import WordNetLemmatizer
from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.model_selection import train_test_split
from sklearn.feature_selection import chi2
import numpy as np

First of all we'll load the dataset:

In [3]:
path_df = "/Users/keerthanareddy/Downloads/Latest-News-Classifier-master/0. Latest News Classifier/02. Exploratory Data Analysis/News_dataset.pickle12"

with open(path_df, 'rb') as data:
    df = pickle.load(data)

In [4]:
df.head()

Unnamed: 0,File_Name,Content,Category,Complete_Filename,id,News_length
0,Lahari_Assg1.txt,Ask nine futurists what life will be like in 5...,Alto,Lahari_Assg1.txt-Alto,1,8327
1,Lahari_Assg2.txt,From ushering in an era of decreased car owner...,Ferrari,Lahari_Assg2.txt-Ferrari,1,21021


And visualize one sample news content:

In [5]:
df.loc[1]['Content']

'From ushering in an era of decreased car ownership, to narrowing streets and eliminating parking lots, autonomous vehicles promise to dramatically reshape our cities.\r\n\r\nBut after an Uber-operated self-driving vehicle struck and killed 49-year-old Elaine Herzberg, who was crossing the street with her bike in Tempe, Arizona on March 18, 2018, there are more questions than ever about the safety of this technology, especially as these vehicles are being tested more frequently on public streets.\r\n\r\nSome argue the safety record for self-driving cars isnâ€™t proven, and that itâ€™s unclear whether or not enough testing miles have been driven in real-life conditions. Other safety advocates go further, and say that driverless cars are introducing a new problem to cities, when cities should instead be focusing on improving transit and encouraging walking and biking instead.\r\n\r\nContentions aside, the autonomous revolution is already here, although some cities will see its impacts so

## 1. Text cleaning and preparation

### 1.1. Special character cleaning

We can see the following special characters:

* ``\r``
* ``\n``
* ``\`` before possessive pronouns (`government's = government\'s`)
* ``\`` before possessive pronouns 2 (`Yukos'` = `Yukos\'`)
* ``"`` when quoting text

In [6]:
# \r and \n
df['Content_Parsed_1'] = df['Content'].str.replace("\r", " ")
df['Content_Parsed_1'] = df['Content_Parsed_1'].str.replace("\n", " ")
df['Content_Parsed_1'] = df['Content_Parsed_1'].str.replace("    ", " ")

Regarding 3rd and 4th bullet, although it seems there is a special character, it won't affect us since it is not a *real* character:

In [7]:
text = "Mr Greenspan\'s"
text

"Mr Greenspan's"

In [9]:
# " when quoting text
df['Content_Parsed_1'] = df['Content_Parsed_1'].str.replace('"', '')

### 1.2. Upcase/downcase

We'll downcase the texts because we want, for example, `Football` and `football` to be the same word.

In [10]:
# Lowercasing the text
df['Content_Parsed_2'] = df['Content_Parsed_1'].str.lower()

### 1.3. Punctuation signs

Punctuation signs won't have any predicting power, so we'll just get rid of them.

In [11]:
punctuation_signs = list("?:!.,;")
df['Content_Parsed_3'] = df['Content_Parsed_2']

for punct_sign in punctuation_signs:
    df['Content_Parsed_3'] = df['Content_Parsed_3'].str.replace(punct_sign, '')

  df['Content_Parsed_3'] = df['Content_Parsed_3'].str.replace(punct_sign, '')


By doing this we are messing up with some numbers, but it's no problem since we aren't expecting any predicting power from them.

### 1.4. Possessive pronouns

We'll also remove possessive pronoun terminations:

In [12]:
df['Content_Parsed_4'] = df['Content_Parsed_3'].str.replace("'s", "")

### 1.5. Stemming and Lemmatization

Since stemming can produce output words that don't exist, we'll only use a lemmatization process at this moment. Lemmatization takes into consideration the morphological analysis of the words and returns words that do exist, so it will be more useful for us.

In [13]:
# Downloading punkt and wordnet from NLTK
nltk.download('punkt')
print("------------------------------------------------------------")
nltk.download('wordnet')

------------------------------------------------------------


[nltk_data] Downloading package punkt to
[nltk_data]     /Users/keerthanareddy/nltk_data...
[nltk_data]   Package punkt is already up-to-date!
[nltk_data] Downloading package wordnet to
[nltk_data]     /Users/keerthanareddy/nltk_data...
[nltk_data]   Package wordnet is already up-to-date!


True

In [14]:
# Saving the lemmatizer into an object
wordnet_lemmatizer = WordNetLemmatizer()

In order to lemmatize, we have to iterate through every word:

In [15]:
nrows = len(df)
lemmatized_text_list = []

for row in range(0, nrows):
    
    # Create an empty list containing lemmatized words
    lemmatized_list = []
    
    # Save the text and its words into an object
    text = df.loc[row]['Content_Parsed_4']
    text_words = str(text).split(" ")

    # Iterate through every word to lemmatize
    for word in text_words:
        lemmatized_list.append(wordnet_lemmatizer.lemmatize(word, pos="v"))
        
    # Join the list
    lemmatized_text = " ".join(lemmatized_list)
    
    # Append to the list containing the texts
    lemmatized_text_list.append(lemmatized_text)

In [16]:
df['Content_Parsed_5'] = lemmatized_text_list

Although lemmatization doesn't work perfectly in all cases (as can be seen in the example below), it can be useful.

### 1.6. Stop words

In [17]:
# Downloading the stop words list
nltk.download('stopwords')

[nltk_data] Downloading package stopwords to
[nltk_data]     /Users/keerthanareddy/nltk_data...
[nltk_data]   Package stopwords is already up-to-date!


True

In [18]:
# Loading the stop words in english
stop_words = list(stopwords.words('english'))

In [19]:
stop_words[0:10]

['i', 'me', 'my', 'myself', 'we', 'our', 'ours', 'ourselves', 'you', "you're"]

To remove the stop words, we'll handle a regular expression only detecting whole words, as seen in the following example:

In [20]:
example = "me eating a meal"
word = "me"

# The regular expression is:
regex = r"\b" + word + r"\b"  # we need to build it like that to work properly

re.sub(regex, "StopWord", example)

'StopWord eating a meal'

We can now loop through all the stop words:

In [21]:
df['Content_Parsed_6'] = df['Content_Parsed_5']

for stop_word in stop_words:

    regex_stopword = r"\b" + stop_word + r"\b"
    df['Content_Parsed_6'] = df['Content_Parsed_6'].str.replace(regex_stopword, '')

  df['Content_Parsed_6'] = df['Content_Parsed_6'].str.replace(regex_stopword, '')


We have some dobule/triple spaces between words because of the replacements. However, it's not a problem because we'll tokenize by the spaces later.

As an example, we'll show an original news article and its modifications throughout the process:

In [23]:
df.loc[1]['Content']

'From ushering in an era of decreased car ownership, to narrowing streets and eliminating parking lots, autonomous vehicles promise to dramatically reshape our cities.\r\n\r\nBut after an Uber-operated self-driving vehicle struck and killed 49-year-old Elaine Herzberg, who was crossing the street with her bike in Tempe, Arizona on March 18, 2018, there are more questions than ever about the safety of this technology, especially as these vehicles are being tested more frequently on public streets.\r\n\r\nSome argue the safety record for self-driving cars isnâ€™t proven, and that itâ€™s unclear whether or not enough testing miles have been driven in real-life conditions. Other safety advocates go further, and say that driverless cars are introducing a new problem to cities, when cities should instead be focusing on improving transit and encouraging walking and biking instead.\r\n\r\nContentions aside, the autonomous revolution is already here, although some cities will see its impacts so

1. Special character cleaning

In [24]:
df.loc[1]['Content_Parsed_1']

'From ushering in an era of decreased car ownership, to narrowing streets and eliminating parking lots, autonomous vehicles promise to dramatically reshape our cities. But after an Uber-operated self-driving vehicle struck and killed 49-year-old Elaine Herzberg, who was crossing the street with her bike in Tempe, Arizona on March 18, 2018, there are more questions than ever about the safety of this technology, especially as these vehicles are being tested more frequently on public streets. Some argue the safety record for self-driving cars isnâ€™t proven, and that itâ€™s unclear whether or not enough testing miles have been driven in real-life conditions. Other safety advocates go further, and say that driverless cars are introducing a new problem to cities, when cities should instead be focusing on improving transit and encouraging walking and biking instead. Contentions aside, the autonomous revolution is already here, although some cities will see its impacts sooner than others. Fro

2. Upcase/downcase

In [25]:
df.loc[1]['Content_Parsed_2']

'from ushering in an era of decreased car ownership, to narrowing streets and eliminating parking lots, autonomous vehicles promise to dramatically reshape our cities. but after an uber-operated self-driving vehicle struck and killed 49-year-old elaine herzberg, who was crossing the street with her bike in tempe, arizona on march 18, 2018, there are more questions than ever about the safety of this technology, especially as these vehicles are being tested more frequently on public streets. some argue the safety record for self-driving cars isnâ€™t proven, and that itâ€™s unclear whether or not enough testing miles have been driven in real-life conditions. other safety advocates go further, and say that driverless cars are introducing a new problem to cities, when cities should instead be focusing on improving transit and encouraging walking and biking instead. contentions aside, the autonomous revolution is already here, although some cities will see its impacts sooner than others. fro

3. Punctuation signs

In [26]:
df.loc[1]['Content_Parsed_3']

'from ushering in an era of decreased car ownership to narrowing streets and eliminating parking lots autonomous vehicles promise to dramatically reshape our cities but after an uber-operated self-driving vehicle struck and killed 49-year-old elaine herzberg who was crossing the street with her bike in tempe arizona on march 18 2018 there are more questions than ever about the safety of this technology especially as these vehicles are being tested more frequently on public streets some argue the safety record for self-driving cars isnâ€™t proven and that itâ€™s unclear whether or not enough testing miles have been driven in real-life conditions other safety advocates go further and say that driverless cars are introducing a new problem to cities when cities should instead be focusing on improving transit and encouraging walking and biking instead contentions aside the autonomous revolution is already here although some cities will see its impacts sooner than others from las vegas where

4. Possessive pronouns

In [27]:
df.loc[1]['Content_Parsed_4']

'from ushering in an era of decreased car ownership to narrowing streets and eliminating parking lots autonomous vehicles promise to dramatically reshape our cities but after an uber-operated self-driving vehicle struck and killed 49-year-old elaine herzberg who was crossing the street with her bike in tempe arizona on march 18 2018 there are more questions than ever about the safety of this technology especially as these vehicles are being tested more frequently on public streets some argue the safety record for self-driving cars isnâ€™t proven and that itâ€™s unclear whether or not enough testing miles have been driven in real-life conditions other safety advocates go further and say that driverless cars are introducing a new problem to cities when cities should instead be focusing on improving transit and encouraging walking and biking instead contentions aside the autonomous revolution is already here although some cities will see its impacts sooner than others from las vegas where

5. Stemming and Lemmatization

In [28]:
df.loc[1]['Content_Parsed_5']

'from usher in an era of decrease car ownership to narrow streets and eliminate park lot autonomous vehicles promise to dramatically reshape our cities but after an uber-operated self-driving vehicle strike and kill 49-year-old elaine herzberg who be cross the street with her bike in tempe arizona on march 18 2018 there be more question than ever about the safety of this technology especially as these vehicles be be test more frequently on public streets some argue the safety record for self-driving cars isnâ€™t prove and that itâ€™s unclear whether or not enough test miles have be drive in real-life condition other safety advocate go further and say that driverless cars be introduce a new problem to cities when cities should instead be focus on improve transit and encourage walk and bike instead contentions aside the autonomous revolution be already here although some cities will see its impact sooner than others from las vegas where a navya self-driving minibus scoot slowly along a d

6. Stop words

In [29]:
df.loc[1]['Content_Parsed_6']

' usher   era  decrease car ownership  narrow streets  eliminate park lot autonomous vehicles promise  dramatically reshape  cities    uber-operated self-driving vehicle strike  kill 49-year-old elaine herzberg   cross  street   bike  tempe arizona  march 18 2018    question  ever   safety   technology especially   vehicles   test  frequently  public streets  argue  safety record  self-driving cars isnâ€™ prove   itâ€™ unclear whether   enough test miles   drive  real-life condition  safety advocate go   say  driverless cars  introduce  new problem  cities  cities  instead  focus  improve transit  encourage walk  bike instead contentions aside  autonomous revolution  already  although  cities  see  impact sooner  others  las vegas   navya self-driving minibus scoot slowly along  downtown street  general motorsâ€™ cruise ride-hailing service  san francisco  backup humans   driverâ€™ seat  waymoâ€™ family-focused chandler arizonaâ€“based pilot program  use  human operators   chrysler pac

Finally, we can delete the intermediate columns:

In [30]:
df.head(1)

Unnamed: 0,File_Name,Content,Category,Complete_Filename,id,News_length,Content_Parsed_1,Content_Parsed_2,Content_Parsed_3,Content_Parsed_4,Content_Parsed_5,Content_Parsed_6
0,Lahari_Assg1.txt,Ask nine futurists what life will be like in 5...,Alto,Lahari_Assg1.txt-Alto,1,8327,Ask nine futurists what life will be like in 5...,ask nine futurists what life will be like in 5...,ask nine futurists what life will be like in 5...,ask nine futurists what life will be like in 5...,ask nine futurists what life will be like in 5...,ask nine futurists life like 50 years ' l...


In [31]:
list_columns = ["File_Name", "Category", "Content", "Content_Parsed_6"]
df = df[list_columns]

df = df.rename(columns={'Content_Parsed_6': 'Content_Parsed'})

In [32]:
df.head()

Unnamed: 0,File_Name,Category,Content,Content_Parsed
0,Lahari_Assg1.txt,Alto,Ask nine futurists what life will be like in 5...,ask nine futurists life like 50 years ' l...
1,Lahari_Assg2.txt,Ferrari,From ushering in an era of decreased car owner...,usher era decrease car ownership narrow s...


**IMPORTANT:**

We need to remember that our model will gather the latest news articles from different newspapers every time we want. For that reason, we not only need to take into account the peculiarities of the training set articles, but also possible ones that are present in the gathered news articles.

For this reason, possible peculiarities have been studied in the *05. News Scraping* folder.

## 2. Label coding

We'll create a dictionary with the label codification:

In [33]:
category_codes = {
    'business': 0,
    'entertainment': 1,
    'politics': 2,
    'sport': 3,
    'tech': 4
}

In [34]:
# Category mapping
df['Category_Code'] = df['Category']
df = df.replace({'Category_Code':category_codes})

In [35]:
df.head()

Unnamed: 0,File_Name,Category,Content,Content_Parsed,Category_Code
0,Lahari_Assg1.txt,Alto,Ask nine futurists what life will be like in 5...,ask nine futurists life like 50 years ' l...,Alto
1,Lahari_Assg2.txt,Ferrari,From ushering in an era of decreased car owner...,usher era decrease car ownership narrow s...,Ferrari


## 3. Train - test split

We'll set apart a test set to prove the quality of our models. We'll do Cross Validation in the train set in order to tune the hyperparameters and then test performance on the unseen data of the test set.

In [36]:
X_train, X_test, y_train, y_test = train_test_split(df['Content_Parsed'], 
                                                    df['Category_Code'], 
                                                    test_size=0.15, 
                                                    random_state=8)

Since we don't have much observations (only 2.225), we'll choose a test set size of 15% of the full dataset.

## 4. Text representation

We have various options:

* Count Vectors as features
* TF-IDF Vectors as features
* Word Embeddings as features
* Text / NLP based features
* Topic Models as features

We'll use **TF-IDF Vectors** as features.

We have to define the different parameters:

* `ngram_range`: We want to consider both unigrams and bigrams.
* `max_df`: When building the vocabulary ignore terms that have a document
    frequency strictly higher than the given threshold
* `min_df`: When building the vocabulary ignore terms that have a document
    frequency strictly lower than the given threshold.
* `max_features`: If not None, build a vocabulary that only consider the top
    max_features ordered by term frequency across the corpus.

See `TfidfVectorizer?` for further detail.

It needs to be mentioned that we are implicitly scaling our data when representing it as TF-IDF features with the argument `norm`.

In [39]:
# Parameter election
ngram_range = (1,2)
min_df = 1
max_df = 10
max_features = 300

We have chosen these values as a first approximation. Since the models that we develop later have a very good predictive power, we'll stick to these values. But it has to be mentioned that different combinations could be tried in order to improve even more the accuracy of the models.

In [40]:
tfidf = TfidfVectorizer(encoding='utf-8',
                        ngram_range=ngram_range,
                        stop_words=None,
                        lowercase=False,
                        max_df=max_df,
                        min_df=min_df,
                        max_features=max_features,
                        norm='',
                        sublinear_tf=True)
                        
features_train = tfidf.fit_transform(X_train).toarray()
labels_train = y_train
print(features_train.shape)

features_test = tfidf.transform(X_test).toarray()
labels_test = y_test
print(features_test.shape)

(1, 300)
(1, 300)


Please note that we have fitted and then transformed the training set, but we have **only transformed** the **test set**.

We can use the Chi squared test in order to see what unigrams and bigrams are most correlated with each category:

In [41]:
from sklearn.feature_selection import chi2
import numpy as np

for Product, category_id in sorted(category_codes.items()):
    features_chi2 = chi2(features_train, labels_train == category_id)
    indices = np.argsort(features_chi2[0])
    feature_names = np.array(tfidf.get_feature_names())[indices]
    unigrams = [v for v in feature_names if len(v.split(' ')) == 1]
    bigrams = [v for v in feature_names if len(v.split(' ')) == 2]
    print("# '{}' category:".format(Product))
    print("  . Most correlated unigrams:\n. {}".format('\n. '.join(unigrams[-5:])))
    print("  . Most correlated bigrams:\n. {}".format('\n. '.join(bigrams[-2:])))
    print("")


# 'business' category:
  . Most correlated unigrams:
. fatal
. fact
. facility
. even
. œthe
  . Most correlated bigrams:
. fully autonomous
. fatal crash

# 'entertainment' category:
  . Most correlated unigrams:
. fatal
. fact
. facility
. even
. œthe
  . Most correlated bigrams:
. fully autonomous
. fatal crash

# 'politics' category:
  . Most correlated unigrams:
. fatal
. fact
. facility
. even
. œthe
  . Most correlated bigrams:
. fully autonomous
. fatal crash

# 'sport' category:
  . Most correlated unigrams:
. fatal
. fact
. facility
. even
. œthe
  . Most correlated bigrams:
. fully autonomous
. fatal crash

# 'tech' category:
  . Most correlated unigrams:
. fatal
. fact
. facility
. even
. œthe
  . Most correlated bigrams:
. fully autonomous
. fatal crash



As we can see, the unigrams correspond well to their category. However, bigrams do not. If we get the bigrams in our features:

In [42]:
bigrams

['public transportation',
 'public streets',
 'public roads',
 'real time',
 'road safety',
 'right many',
 'ride hailing',
 'reduce number',
 'miles us',
 'miles travel',
 'miles drive',
 'november 2019',
 'may 2018',
 'uberâ fatal',
 'transportation secretary',
 'transportation safety',
 'traffic deaths',
 'us streets',
 'without human',
 'vehicles public',
 'vehicles detect',
 'test vehicles',
 'test self',
 'semi autonomous',
 'self driving',
 'self driven',
 'san francisco',
 'safety board',
 'safety advocate',
 'test public',
 'cars make',
 'crash cause',
 'board ntsb',
 '94 percent',
 'autonomous vehicles',
 'autonomous vehicle',
 'autonomous technology',
 'automate vehicles',
 'human error',
 'human drivers',
 'human driver',
 'human driven',
 'google show',
 'many case',
 'many cars',
 'driving vehicles',
 'driving project',
 'driving program',
 'driving industry',
 'driving company',
 'driving cars',
 'fully autonomous',
 'fatal crash']

We can see there are only six. This means the unigrams have more correlation with the category than the bigrams, and since we're restricting the number of features to the most representative 300, only a few bigrams are being considered.

Let's save the files we'll need in the next steps:

In [43]:
# X_train
with open('/Users/keerthanareddy/Downloads/Latest-News-Classifier-master/0. Latest News Classifier/03. Feature Engineering/Pickles/X_train.pickle', 'wb') as output:
    pickle.dump(X_train, output)
    
# X_test    
with open('/Users/keerthanareddy/Downloads/Latest-News-Classifier-master/0. Latest News Classifier/03. Feature Engineering/Pickles/X_test.pickle', 'wb') as output:
    pickle.dump(X_test, output)
    
# y_train
with open('/Users/keerthanareddy/Downloads/Latest-News-Classifier-master/0. Latest News Classifier/03. Feature Engineering/Pickles/y_train.pickle', 'wb') as output:
    pickle.dump(y_train, output)
    
# y_test
with open('/Users/keerthanareddy/Downloads/Latest-News-Classifier-master/0. Latest News Classifier/03. Feature Engineering/Pickles/y_test.pickle', 'wb') as output:
    pickle.dump(y_test, output)
    
# df
with open('/Users/keerthanareddy/Downloads/Latest-News-Classifier-master/0. Latest News Classifier/03. Feature Engineering/Pickles/df.pickle', 'wb') as output:
    pickle.dump(df, output)
    
# features_train
with open('/Users/keerthanareddy/Downloads/Latest-News-Classifier-master/0. Latest News Classifier/03. Feature Engineering/Pickles/features_train.pickle', 'wb') as output:
    pickle.dump(features_train, output)

# labels_train
with open('/Users/keerthanareddy/Downloads/Latest-News-Classifier-master/0. Latest News Classifier/03. Feature Engineering/Pickles/labels_train.pickle', 'wb') as output:
    pickle.dump(labels_train, output)

# features_test
with open('/Users/keerthanareddy/Downloads/Latest-News-Classifier-master/0. Latest News Classifier/03. Feature Engineering/Pickles/features_test.pickle', 'wb') as output:
    pickle.dump(features_test, output)

# labels_test
with open('/Users/keerthanareddy/Downloads/Latest-News-Classifier-master/0. Latest News Classifier/03. Feature Engineering/Pickles/labels_test.pickle', 'wb') as output:
    pickle.dump(labels_test, output)
    
# TF-IDF object
with open('/Users/keerthanareddy/Downloads/Latest-News-Classifier-master/0. Latest News Classifier/03. Feature Engineering/Pickles/tfidf.pickle', 'wb') as output:
    pickle.dump(tfidf, output)