<a href="https://colab.research.google.com/github/lindyco/Colab/blob/main/Feature_extraction.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

This program is to extract the features from the website and use classification to predict the page number of contents. 

In [1]:
from bs4 import BeautifulSoup as bs
import urllib.request as req
import pandas as pd
from sklearn.model_selection import train_test_split
from sklearn.metrics import classification_report
from sklearn.metrics import accuracy_score

In [2]:
# Import the data from the web page

url1='http://quotes.toscrape.com/page/1/'
url2='http://quotes.toscrape.com/page/2/'

sourcedata1 = req.urlopen(url1)
soup1=bs(sourcedata1,"html.parser")

sourcedata2 = req.urlopen(url2)
soup2=bs(sourcedata2,"html.parser")


In [3]:
# Get all the quotes from each page

quotes = soup1.find_all("span",{"class":"text"}) + soup2.find_all("span",{"class":"text"}) 
length = len(quotes)
length

20

In [4]:
content1 = []
for each in quotes:
    txt = each.text.strip()
    content1.append(txt)
    
content1

['“The world as we have created it is a process of our thinking. It cannot be changed without changing our thinking.”',
 '“It is our choices, Harry, that show what we truly are, far more than our abilities.”',
 '“There are only two ways to live your life. One is as though nothing is a miracle. The other is as though everything is a miracle.”',
 '“The person, be it gentleman or lady, who has not pleasure in a good novel, must be intolerably stupid.”',
 "“Imperfection is beauty, madness is genius and it's better to be absolutely ridiculous than absolutely boring.”",
 '“Try not to become a man of success. Rather become a man of value.”',
 '“It is better to be hated for what you are than to be loved for what you are not.”',
 "“I have not failed. I've just found 10,000 ways that won't work.”",
 "“A woman is like a tea bag; you never know how strong it is until it's in hot water.”",
 '“A day without sunshine is like, you know, night.”',
 "“This life is what you make it. No matter what, you'r

In [5]:
# Get all the authors from each page

authors = soup1.find_all("small",{"class": "author"}) + soup2.find_all("small",{"class": "author"})
content2 = []
for each in authors:
    txt = each.text.strip()
    content2.append(txt)
content2

['Albert Einstein',
 'J.K. Rowling',
 'Albert Einstein',
 'Jane Austen',
 'Marilyn Monroe',
 'Albert Einstein',
 'André Gide',
 'Thomas A. Edison',
 'Eleanor Roosevelt',
 'Steve Martin',
 'Marilyn Monroe',
 'J.K. Rowling',
 'Albert Einstein',
 'Bob Marley',
 'Dr. Seuss',
 'Douglas Adams',
 'Elie Wiesel',
 'Friedrich Nietzsche',
 'Mark Twain',
 'Allen Saunders']

In [6]:
#Assign the page to base on the index

content3 = []
for num in range(length):
  if(0 <= num <= 10):
    content3.append("page0")
  if(10 < num <= 20):
    content3.append("page1")
content3

['page0',
 'page0',
 'page0',
 'page0',
 'page0',
 'page0',
 'page0',
 'page0',
 'page0',
 'page0',
 'page0',
 'page1',
 'page1',
 'page1',
 'page1',
 'page1',
 'page1',
 'page1',
 'page1',
 'page1']

In [7]:
#Build up data frame to store the collected data
df = pd.DataFrame(data = {'quote':content1, 'author':content2, 'page':content3})

df

Unnamed: 0,quote,author,page
0,“The world as we have created it is a process ...,Albert Einstein,page0
1,"“It is our choices, Harry, that show what we t...",J.K. Rowling,page0
2,“There are only two ways to live your life. On...,Albert Einstein,page0
3,"“The person, be it gentleman or lady, who has ...",Jane Austen,page0
4,"“Imperfection is beauty, madness is genius and...",Marilyn Monroe,page0
5,“Try not to become a man of success. Rather be...,Albert Einstein,page0
6,“It is better to be hated for what you are tha...,André Gide,page0
7,"“I have not failed. I've just found 10,000 way...",Thomas A. Edison,page0
8,“A woman is like a tea bag; you never know how...,Eleanor Roosevelt,page0
9,"“A day without sunshine is like, you know, nig...",Steve Martin,page0


In [8]:
# This is the text feature I extract from the website and convert to csv file
df.to_csv("quotesCollection.csv",encoding = 'utf-8-sig')

After create the csv file, I going to training and predict the data.

In [9]:
# Import and download necessary library
import nltk
nltk.download('punkt')
nltk.download('wordnet')
nltk.download('averaged_perceptron_tagger')
nltk.download('stopwords')

[nltk_data] Downloading package punkt to /root/nltk_data...
[nltk_data]   Package punkt is already up-to-date!
[nltk_data] Downloading package wordnet to /root/nltk_data...
[nltk_data]   Package wordnet is already up-to-date!
[nltk_data] Downloading package averaged_perceptron_tagger to
[nltk_data]     /root/nltk_data...
[nltk_data]   Package averaged_perceptron_tagger is already up-to-
[nltk_data]       date!
[nltk_data] Downloading package stopwords to /root/nltk_data...
[nltk_data]   Package stopwords is already up-to-date!


True

In [10]:
import pandas as pd
import numpy as np
from nltk.tokenize import word_tokenize
from nltk import pos_tag
from nltk.corpus import stopwords
from nltk.stem import WordNetLemmatizer
from sklearn.preprocessing import LabelEncoder
from collections import defaultdict
from nltk.corpus import wordnet as wn
from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn import model_selection, naive_bayes, svm
from sklearn.metrics import accuracy_score

In [11]:
#Set Random seed to 500 to generate a random number
np.random.seed(500)

# Import the Data that I scraped using pandas
Quotes = pd.read_csv(r"quotesCollection.csv",encoding='latin-1')


In [12]:
# Data Pre-processing - This will help in getting better results through the classification algorithms

# Remove blank rows if any.
Quotes['quote'].dropna(inplace=True)

# Change all the text to lower case. 
Quotes['quote'] = [entry.lower() for entry in Quotes['quote']]

# Tokenization : In this each entry in the file quotes will be broken into set of words
Quotes['quote']= [word_tokenize(entry) for entry in Quotes['quote']]

# Remove Stop words, Non-Numeric and perfom Word Stemming/Lemmenting.

# WordNetLemmatizer requires Pos tags to understand if the word is noun or verb or adjective etc. By default it is set to Noun
tag_map = defaultdict(lambda : wn.NOUN)
tag_map['J'] = wn.ADJ
tag_map['V'] = wn.VERB
tag_map['R'] = wn.ADV

In [13]:
for index,entry in enumerate(Quotes['quote']):
    # Declaring a empty list to store the words
    Final_words = []
    # Initializing WordNetLemmatizer()
    word_Lemmatized = WordNetLemmatizer()

    # pos_tag function below will provide the 'tag' i.e if the word is Noun(N) or Verb(V) or something else.
    for word, tag in pos_tag(entry):
        # Below condition is to check if is alphabets and not the stop words
        if word.isalpha() and word not in stopwords.words('english') :
            word_Final = word_Lemmatized.lemmatize(word,tag_map[tag[0]])
            Final_words.append(word_Final)
    # The final processed set of words for each iteration will be stored in 'text_final' as string
    Quotes.loc[index,'text_final_str'] = str(Final_words)

print(Quotes['text_final_str'].head())

0    ['world', 'create', 'process', 'thinking', 'ch...
1          ['choice', 'harry', 'show', 'truly', 'far']
2    ['two', 'way', 'live', 'life', 'one', 'though'...
3    ['person', 'gentleman', 'lady', 'pleasure', 'g...
4    ['beauty', 'madness', 'genius', 'good', 'absol...
Name: text_final_str, dtype: object


In [14]:
# Split the model into Train(70%) and Test Data(30%) set
Train_X, Test_X, Train_Y, Test_Y = model_selection.train_test_split(Quotes['text_final_str'],Quotes['page'],test_size=0.2, random_state=42)

In [15]:
# Page number encode the target variable  - This is done to transform Categorical data of string type in the data set into numerical values
Encoder = LabelEncoder()
Train_Y = Encoder.fit_transform(Train_Y)
Test_Y = Encoder.fit_transform(Test_Y)

In [16]:
# Vectorize the words by using TF-IDF Vectorizer - This is done to find how important a word in document is in comaprison to the Quote
Tfidf_vect = TfidfVectorizer(max_features=5000)
Tfidf_vect.fit(Quotes['text_final_str'])

Train_X_Tfidf = Tfidf_vect.transform(Train_X)
Test_X_Tfidf = Tfidf_vect.transform(Test_X)

In [17]:
# Now we can run different algorithms to classify out data check for accuracy

# Classifier - Algorithm - Naive Bayes
# fit the training dataset on the classifier
Naive = naive_bayes.MultinomialNB()
Naive.fit(Train_X_Tfidf,Train_Y)

# predict the labels on validation dataset
predictions_NB = Naive.predict(Test_X_Tfidf)
print(predictions_NB)

# Use accuracy_score function to get the accuracy
print("Naive Bayes Accuracy Score -> ",accuracy_score(predictions_NB, Test_Y)*100)


[0 0 0 0]
Naive Bayes Accuracy Score ->  50.0


In [18]:
# Classifier - Algorithm - SVM
# fit the training dataset on the classifier
SVM = svm.SVC(C=1.0, kernel='linear', degree=3, gamma='auto')
SVM.fit(Train_X_Tfidf,Train_Y)

# predict the labels on validation dataset
predictions_SVM = SVM.predict(Test_X_Tfidf)
print(predictions_SVM)

# Use accuracy_score function to get the accuracy
print("SVM Accuracy Score -> ",accuracy_score(predictions_SVM, Test_Y)*100)

[0 0 0 0]
SVM Accuracy Score ->  50.0


------------------------------------------------------------------------------------------------

Accuracy of both classification model are the same 50%, I think the reason of that is this program predict the page number, it is not a categories that relative to any words, but the classification model Naive Bayes and SVM is predict things by the words using statistics.
Therefore, I decide to make another csv file base on the categorie of the quetoes, such as categorie life and humor.

In [19]:
# Create the csv file quotetype

url1='http://quotes.toscrape.com/tag/life/'
url2='http://quotes.toscrape.com/tag/humor/'

sourcedata1 = req.urlopen(url1)
soup1=bs(sourcedata1,"html.parser")

sourcedata2 = req.urlopen(url2)
soup2=bs(sourcedata2,"html.parser")

quotes = soup1.find_all("span",{"class":"text"}) + soup2.find_all("span",{"class":"text"}) 
length = len(quotes)
length

content1 = []
for each in quotes:
    txt = each.text.strip()
    content1.append(txt)
    
content1

authors = soup1.find_all("small",{"class": "author"}) + soup2.find_all("small",{"class": "author"})
content2 = []
for each in authors:
    txt = each.text.strip()
    content2.append(txt)
content2

# type_0 is type life and type_1 is humor
content3 = []
for num in range(length):
  if(0 <= num <= 10):
    content3.append("type_0")
  if(10 < num <= 20):
    content3.append("type_1")
content3

df = pd.DataFrame(data = {'quote':content1, 'author':content2, 'type':content3})

df.to_csv("quotetype.csv",encoding = 'utf-8-sig')

In [20]:
# Data prepare for the prediction models

Quotes_type = pd.read_csv(r"quotetype.csv",encoding='latin-1')

Quotes_type['quote'].dropna(inplace=True)

Quotes_type['quote'] = [entry.lower() for entry in Quotes_type['quote']]

Quotes_type['quote']= [word_tokenize(entry) for entry in Quotes_type['quote']]

tag_map = defaultdict(lambda : wn.NOUN)
tag_map['J'] = wn.ADJ
tag_map['V'] = wn.VERB
tag_map['R'] = wn.ADV

for index,entry in enumerate(Quotes_type['quote']):
    Final_words = []
    word_Lemmatized = WordNetLemmatizer()

    for word, tag in pos_tag(entry):
        if word.isalpha() and word not in stopwords.words('english') :
            word_Final = word_Lemmatized.lemmatize(word,tag_map[tag[0]])
            Final_words.append(word_Final)
    Quotes_type.loc[index,'text_final_str'] = str(Final_words)

print(Quotes_type['text_final_str'].head())


0    ['two', 'way', 'live', 'life', 'one', 'though'...
1                              ['well', 'hat', 'love']
2    ['life', 'make', 'matter', 'go', 'mess', 'some...
3    ['may', 'go', 'intend', 'go', 'think', 'end', ...
4    ['friend', 'good', 'book', 'sleepy', 'conscien...
Name: text_final_str, dtype: object


In [21]:
# Set up the training and testing data sets

Train_X, Test_X, Train_Y, Test_Y = model_selection.train_test_split(Quotes_type['text_final_str'],Quotes_type['type'],test_size=0.2, random_state=42)

Encoder = LabelEncoder()
Train_Y = Encoder.fit_transform(Train_Y)
Test_Y = Encoder.fit_transform(Test_Y)

Tfidf_vect = TfidfVectorizer(max_features=5000)
Tfidf_vect.fit(Quotes_type['text_final_str'])

Train_X_Tfidf = Tfidf_vect.transform(Train_X)
Test_X_Tfidf = Tfidf_vect.transform(Test_X)

In [22]:
# Native bayes model
Naive = naive_bayes.MultinomialNB()
Naive.fit(Train_X_Tfidf,Train_Y)

predictions_NB = Naive.predict(Test_X_Tfidf)
print(predictions_NB)

print("Naive Bayes Accuracy Score -> ",accuracy_score(predictions_NB, Test_Y)*100)

# SVM model
SVM = svm.SVC(C=1.0, kernel='linear', degree=3, gamma='auto')
SVM.fit(Train_X_Tfidf,Train_Y)

predictions_SVM = SVM.predict(Test_X_Tfidf)
print(predictions_SVM)

print("SVM Accuracy Score -> ",accuracy_score(predictions_SVM, Test_Y)*100)


[0 0 1 0]
Naive Bayes Accuracy Score ->  75.0
[0 0 0 0]
SVM Accuracy Score ->  50.0


This time, two models return different accuracy, Naive Bayes has higher accuracy score than SVM model.