# COMP41680 Assignment 2:Text Scraping and Classification

# Problem Statement

The objective of this assignment is to scrape a corpus of news articles from a set of
web pages, pre-process the corpus, and evaluate the performance of automated
classification of these articles in a supervised learning context. 

# Part 1. Data Collection 

1.1.Base URL :http://mlg.ucd.ie/modules/COMP41680/archive/index.html
 
This URL store the information of news articles stored month wise and for each category like Sports, Business and Technology
           

### Importing important packages required by the code

In [1]:
import requests
from bs4 import BeautifulSoup
import pandas as pd
import os
import glob
import urllib 

Declaring url ,path1 and path2 as global variable. They can be accessed from every function. The url contains the location from where the data should be scraped out. The path1 stores the location where the article's files should be stored. The path2 stores the location where the category's file should be stored

In [2]:
url = "http://mlg.ucd.ie/modules/COMP41680/archive/"
PATH = "articles/"

Defining a generic function 'get_html()' to get the html page of the url. It can be called N number of times in the program whenever required. It takes the url link as input and returns parsed page from beautiful soup.

In [3]:
#getting content of the htmls
def get_html(url):
    re = requests.get(url)  #Now, we have a Response object called re. We can get all the information we need from this object.
    data = re.text         #Requests will automatically decode content from the server. Most unicode charsets are seamlessly decoded.
    return BeautifulSoup(data, 'html.parser')  #Beautiful Soup is a Python library for pulling data out of HTML and XML files. The parser here is html

The web_scraping() function is to return all the links of the the months present on the index page. It takes the extension and filter as input and returns list of links as output. This a generic function and can be called N number of times. The scope of this function is this program.

In [4]:
#scraping the links from the urls
def web_scraping(extension, filter):
    links = []
    html = get_html(url + extension) #calling get_html to fetch the parsed html page as return object
    for link in html.find_all("a"):   #Finding all the links in the html as 'a' stores the hyperlinks
        link_href = link.get("href").strip() #getting the href values of the links and stripping for any escape character.
        if link_href!='index.html' and link_href!='': #checking if the link is empty or index.html it should be avoided
            links.append(link_href)     #if the above conditions are met then only links are appended in the list
    #print(links)
    return links    #returning the list object 


# The Functions performed by getArticleData function:

#### The getArticleData function takes the input of the links of the months and parse to the article level and retrieves the content from each of the article.

#### The getArticleData function gets the content of the article category wise.

#### The getArticleData function stores the  in the file category wise

####  The getArticleData function generates .txt files

In [5]:
#to save the data in file category wise
def getArticleData(links):
    
    for l in links:     #for every link of the month
        url1=url+l
            ##print(url)
        
            #print(page)
        soup = get_html(url1)     #get the parsed soup of the link
            #print(soup.find_all("tr"))
        otrsoupobject = soup.find_all("tr") #find all the tr of the html as the category is stored in the tr's cstegory class
        #print(otrsoupobject[1:])
        #print('===Writing Data to Files===',otrsoupobject)
        for ele in otrsoupobject[1:]:  #for every tr tag fetch the category 
            #print(ele)
            label = ele.find(class_="category").get_text().strip() #store the category's value inn the label removing escape char

            if label != "N/A":   #only get the title and content if the category;s value is present                     
                art_link = ele.find(class_="title").a["href"].strip() #get the article link for that category
                #print(art_link)
                l = url+ art_link  #form the url of the article link fetched above

                r = urllib.request.urlopen(l).read()  #get the respons for the article link
                soup = BeautifulSoup(r, "html.parser") #parse the response of the html through Beautiful Soup
                letters = soup.find_all("p")           #find all the paragraph tags as the content are stored in it
                content = soup.find("h2").get_text()   #fetch the content values
                    #print(content)
                for ele1 in letters[:-1]:               #append all the content frtched from the p tag
                    content = content + " " + ele1.get_text()    
                fileName = "data/"+label+".txt"         #store the content of the file in the respective category of that content

                f = open(fileName, "a+", encoding="utf-8")    #write the data in the txt file with filename genreated above
                f.write(content.strip())    #strip if any escape character is present
                f.write("\n")     #append a new line after content of each file
                f.close()       #close the file once the the data is wrtten
            
            

# Saving the Data in txt File

In [6]:
links = web_scraping("index.html", "month")  #fetch the links of every month 
    #print(links)
    # Web scraping for each month
getArticleData(links)   #pass the link to the function to write appropriate data in the file.

# Loading the Data from the txt file

In [7]:

df=pd.DataFrame()
CatgoryName=['business.txt','sport.txt','technology.txt']
for filename in CatgoryName:
    name1=filename.replace(".\Data\\","")
    name1=name1.replace(".txt","")
    temp_df = pd.DataFrame()
    print( " The name of the file:",name1)
    #print(filename[:-4])
    body_list = []
    f = open(".\Data\\"+filename,"r", encoding="utf-8")
    for i in f.readlines():
        body_list.append(i.strip())
    temp_df["Article"] = body_list
    temp_df["Category"] = filename[:-4]
    df = pd.concat([df,temp_df])
    


 The name of the file: business
 The name of the file: sport
 The name of the file: technology


### The dataframe loaded from the file 

In [8]:
df.head()

Unnamed: 0,Article,Category
0,Asian quake hits European shares Asian quake h...,business
1,Barclays shares up on merger talk Barclays sha...,business
2,Bush budget seeks deep cutbacks Bush budget se...,business
3,Bush to get 'tough' on deficit Bush to get 'to...,business
4,Card fraudsters 'targeting web' Card fraudster...,business


### To display the content of the article, setting the max width of the screen

In [9]:
pd.set_option('display.max_colwidth', -1)  

#### Encode input values as an enumerated type or categorical variable

In [10]:
category_id, unique_category = pd.factorize(df["Category"])
#enumerating the category as numerical value and storing as category id
#print(category_id)
#print(unique_category)

df['category_id']=category_id
df.head()

Unnamed: 0,Article,Category,category_id
0,"Asian quake hits European shares Asian quake hits European shares Shares in Europe's leading reinsurers and travel firms have fallen as the scale of the damage wrought by tsunamis across south Asia has become apparent. More than 23,000 people have been killed following a massive underwater earthquake and many of the worst hit areas are popular tourist destinations. Reisurance firms such as Swiss Re and Munich Re lost value as investors worried about rebuilding costs. But the disaster has little impact on stock markets in the US and Asia. Currencies including the Thai baht and Indonesian rupiah weakened as analysts warned that economic growth may slow. ""It came at the worst possible time,"" said Hans Goetti, a Singapore-based fund manager. ""The impact on the tourist industry is pretty devastating, especially in Thailand."" Travel-related shares dropped in Europe, with companies such as Germany's TUI and Lufthansa and France's Club Mediterranne sliding. Insurers and reinsurance firms were also under pressure in Europe. Shares in Munich Re and Swiss Re - the world's two biggest reinsurers - both fell 1.7% as the market speculated about the cost of rebuilding in Asia. Zurich Financial, Allianz and Axa also suffered a decline in value. However, their losses were much smaller, reflecting the market's view that reinsurers were likely to pick up the bulk of the costs. Worries about the size of insurance liabilities dragged European shares down, although the impact was exacerbated by light post-Christmas trading. Germany's benchmark Dax index closed the day 16.29 points lower at 3.817.69 while France's Cac index of leading shares fell 5.07 points to 3.817.69. Investors pointed out, however, that declines probably would be industry specific, with the travel and insurance firms hit hardest. ""It's still too early for concrete damage figures,"" Swiss Re's spokesman Floiran Woest told Associated Press. ""That also has to do with the fact that the damage is very widely spread geographically."" The unfolding scale of the disaster in south Asia had little immediate impact on US shares, however. The Dow Jones index had risen 20.54 points, or 0.2%, to 10,847.66 by late morning as analsyts were cheered by more encouraging reports from retailers about post-Christmas sales. In Asian markets, adjustments were made quickly to account for lower earnings and the cost of repairs. Thai Airways shed almost 4%. The country relies on tourism for about 6% of its total economy. Singapore Airlines dropped 2.6%. About 5% of Singapore's annual gross domestic product (GDP) comes from tourism. Malaysia's budget airline, AirAsia fell 2.9%. Resort operator Tanco Holdings slumped 5%. Travel companies also took a hit, with Japan's Kinki Nippon sliding 1.5% and HIS dropping 3.3%. However, the overall impact on Asia's largest stock market, Japan's Nikkei, was slight. Shares fell just 0.03%. Concerns about the strength of economic growth going forward weighed on the currency markets. The Indonesian rupiah lost as much as 0.6% against the US dollar, before bouncing back slightly to trade at 9,300. The Thai baht lost 0.3% against the US currency, trading at 39.10. In India, where more than 2,000 people are thought to have died, the rupee shed 0.1% against the dollar Analysts said that it was difficult to predict the total cost of the disaster and warned that share prices and currencies would come under increasing pressure as the bills mounted.",business,0
1,"Barclays shares up on merger talk Barclays shares up on merger talk Shares in UK banking group Barclays have risen on Monday following a weekend press report that it had held merger talks with US bank Wells Fargo. A tie-up between Barclays and California-based Wells Fargo would create the world's fourth biggest bank, valued at $180bn (£96bn). Barclays has declined to comment on the report in the Sunday Express, saying it does not respond to market speculation. The two banks reportedly held talks in October and November 2004. Barclays shares were up 8 pence, or 1.3%, at 605 pence by late morning in London on Monday, making it the second biggest gainer in the FTSE 100 index. UK banking icon Barclays was founded more than 300 years ago; it has operations in over 60 countries and employs 76,200 staff worldwide. Its North American divisions focus on business banking, whereas Wells Fargo operates retail and business banking services from 6,000 branches. In 2003, Barclays reported a 20% rise in pre-tax profits to £3.8bn, and it has recently forecast similar gains in 2004, predicting that full year pre-tax profits would rise 18% to £4.5bn. Wells Fargo had net income of $6.2bn in its last financial year, a 9% increase on the previous year, and revenues of $28.4bn. Barclays was the focus of takeover speculation in August, when it was linked to Citigroup, though no bid has ever materialised. Stock market traders were sceptical that the latest reports heralded a deal. ""The chief executive would be abandoning his duty if he didn't talk to rivals, but a deal doesn't seem likely,"" Online News quoted one trader as saying.",business,0
2,"Bush budget seeks deep cutbacks Bush budget seeks deep cutbacks President Bush has presented his 2006 budget, cutting domestic spending in a bid to lower a record deficit projected to peak at $427bn (£230bn) this year. The $2.58 trillion (£1.38 trillion) budget submitted to Congress affects 150 domestic programmes from farming to the environment, education and health. But foreign aid is due to rise by 10%, with more money to treat HIV/Aids and reward economic and political reform. Military spending is also set to rise by 4.8%, to reach $419.3bn. The budget does not include the cost of running military operations in Iraq and Afghanistan, for which the administration is expected to seek an extra $80bn from Congress later this year. Congress will spend several months debating George W Bush's proposal. The state department's planned budget would rise to just under $23bn - a fraction of the defence department's request - including almost $6bn to assist US allies in the ""war on terror"". However, the administration is keen to highlight its global effort to tackle HIV/Aids, the Online News's Jonathan Beale reports, and planned spending would almost double to $3bn, with much of that money going to African nations. Mr Bush also wants to increase the amount given to poorer countries through his Millennium Challenge Corporation. The scheme has been set up to reward developing countries that embrace what the US considers to be good governance and sound policies. Yet Mr Bush's proposed spending of $3bn on that project is well below his initial promise of $5bn. A key spending line missing from proposals is the cost of funding the administration's proposed radical overhaul of Social Security, the pensions programme on which many Americans rely for their retirement income. Some experts believe this could require borrowing of up to $4.5 trillion over a 20-year period. Neither does the budget include any cash to purchase crude oil for the US emergency petroleum stockpile. Concern over the level of the reserve, created in 1970s, has led to rises in oil prices over the past year. The Bush administration will instead continue to fill the reserve by taking oil - rather than cash - from energy companies that drill under federal leases. The outline proposes reductions in budgets at 12 out of 23 government agencies including cuts of 9.6% at Agriculture and 5.6% at the Environmental Protection Agency. The spending plan for the year beginning 1 October is banking on a healthy US economy to boost government income by 6.1% to $2.18 trillion. Spending is forecast to grow by 3.5% to $2.57 trillion. But the budget is still the tightest yet under Mr Bush's presidency. ""In order to sustain our economic expansion, we must continue pro-growth policies and enforce even greater spending restraint across federal government,"" Mr Bush said in his budget message to Congress. Mr Bush has promised to halve the US's massive budget deficit within five years. The deficit, partly the result of massive tax cuts early in Mr Bush's presidency, has been a key factor in pushing the US dollar lower. The independent Congressional Budget Office estimates that the shortfall could shrink to little more than $200bn by 2009, returning to the surpluses seen in the late 1990s by 2012. But its estimates depend on the tax cuts not being made permanent, in line with the promise when they were passed that they would ""sunset"", or disappear, in 2010. Most Republicans, however, want them to stay in place. And the figures also rely on the ""Social Security trust fund"" - the money set aside to cover the swelling costs of retirement pensions - being offset against the main budget deficit.",business,0
3,"Bush to get 'tough' on deficit Bush to get 'tough' on deficit US president George W Bush has pledged to introduce a ""tough"" federal budget next February in a bid to halve the country's deficit in five years. The US budget and its trade deficit are both deep in the red, helping to push the dollar to lows against the euro and fuelling fears about the economy. Mr Bush indicated there would be ""strict discipline"" on non-defence spending in the budget. The vow to cut the deficit had been one of his re-election declarations. The federal budget deficit hit a record $412bn (£211.6bn) in the 12 months to 30 September and $377bn in the previous year. ""We will submit a budget that fits the times,"" Mr Bush said. ""It will provide every tool and resource to the military, will protect the homeland, and meet other priorities of the government."" The US has said it is committed to a strong dollar. But the dollar's weakness has hit European and Asian exporters and lead to calls for US intervention to boost the currency. Mr Bush, however, has said the best way to halt the dollar's slide is to deal with the US deficit. ""It's a budget that I think will send the right signal to the financial markets and to those concerned about our short-term deficits,"" Mr Bush added. ""As well, we've got to deal with the long-term deficit issues.""",business,0
4,"Card fraudsters 'targeting web' Card fraudsters 'targeting web' New safeguards on credit and debit card payments in shops has led fraudsters to focus on internet and phone payments, an anti-fraud agency has said. Anti-fraud consultancy Retail Decisions says 'card-not-present' fraud, where goods are paid for online or by phone, has risen since the start of 2005. The introduction of 'chip and pin' cards has tightened security for transactions on the High Street. But the clampdown has caused fraudsters to change tack, Retail Decisions said. The introduction of chip and pin cards aimed to cut down on credit card fraud in stores by asking shoppers to verify their identity with a confidential personal pin number, instead of a signature. Retail Decisions chief executive Carl Clump told the Online News that there was ""no doubt"" that chip and pin would ""reduce card fraud in the card-present environment"". ""However, it is important to monitor what happens in the card-not-present environment as fraudsters will turn their attention to the internet, mail order, telephone order and interactive TV,"" he said. ""We have seen a 22% uplift in card-not-present fraud here in the UK... since the start of the year. ""Fraud doesn't just disappear, it mutates to the next weakest link in the chain,"" he said. Retail Decisions' survey on the implementation of chip and pin found that shoppers had adapted easily to the new system, but that banks' performance in distributing the new cards had been patchy, at best. ""The main issue is that not everyone has the pins they need,"" said Mr Clump. Nearly two thirds - 65% - of the 1,000 people interviewed said they had used chip and pin to make payments. Of these, 83% were happy with the experience, though nearly a quarter said they struggled to remember their pin number. However, only 34% said they had received replacement cards with the necessary 'chip' technology from all their card providers. Furthermore, 16% said that none of their cards had been replaced, while 30% said only some had. UK shoppers spent £5.3bn on plastic cards in 2003, the last full year for which figures are available from the Association of Payment Clearing Services (Apacs). Altogether, card scams on UK-issued cards totalled £402.4m in 2003. Card-not-present fraud rose an annual 6% to £116.4m, making it the biggest category even then. Within this, internet fraud totalled £43m, Apacs' figures show.",business,0


In [11]:
article_id=df.index.tolist()
print(article_id[:5])
categories=df["category_id"]   #storing the category_id value in a list
print(categories[:5])


[0, 1, 2, 3, 4]
0    0
1    0
2    0
3    0
4    0
Name: category_id, dtype: int64


# Tokenizing Text

 For tokenizing process it is important to have article in a list form. Hence article variable stores all the articles of the df dataframe in alist for further pre-processing

In [12]:
article=df["Article"].tolist() #converting article column of data frame to a list for tokenizing it
print("Read %d raw text documents" % len(article))

Read 1408 raw text documents


Raw text documents are textual, not numeric. The first step in analysing unstructured documents is to split the raw text into individual tokens, each corresponding to a single term (word). As an example:

In [13]:
print(article[0])

Asian quake hits European shares Asian quake hits European shares  Shares in Europe's leading reinsurers and travel firms have fallen as the scale of the damage wrought by tsunamis across south Asia has become apparent.  More than 23,000 people have been killed following a massive underwater earthquake and many of the worst hit areas are popular tourist destinations. Reisurance firms such as Swiss Re and Munich Re lost value as investors worried about rebuilding costs. But the disaster has little impact on stock markets in the US and Asia.  Currencies including the Thai baht and Indonesian rupiah weakened as analysts warned that economic growth may slow. "It came at the worst possible time," said Hans Goetti, a Singapore-based fund manager. "The impact on the tourist industry is pretty devastating, especially in Thailand." Travel-related shares dropped in Europe, with companies such as Germany's TUI and Lufthansa and France's Club Mediterranne sliding. Insurers and reinsurance firms we

#### We will use the built-in scikit-learn tokenizer to split this document into tokens. Note that we will perform case conversion first to convert the entire text to lowercase

In [28]:
from sklearn.feature_extraction.text import CountVectorizer
tokenize = CountVectorizer().build_tokenizer()

# convert to lowercase (Normalisation), then tokenise 
tokens = []
for articles in article:
    tokens.append(tokenize(articles.lower()))

In [29]:
#printing sample tokens
print(tokens[0])
print("length:", len(tokens[0]))

['asian', 'quake', 'hits', 'european', 'shares', 'asian', 'quake', 'hits', 'european', 'shares', 'shares', 'in', 'europe', 'leading', 'reinsurers', 'and', 'travel', 'firms', 'have', 'fallen', 'as', 'the', 'scale', 'of', 'the', 'damage', 'wrought', 'by', 'tsunamis', 'across', 'south', 'asia', 'has', 'become', 'apparent', 'more', 'than', '23', '000', 'people', 'have', 'been', 'killed', 'following', 'massive', 'underwater', 'earthquake', 'and', 'many', 'of', 'the', 'worst', 'hit', 'areas', 'are', 'popular', 'tourist', 'destinations', 'reisurance', 'firms', 'such', 'as', 'swiss', 're', 'and', 'munich', 're', 'lost', 'value', 'as', 'investors', 'worried', 'about', 'rebuilding', 'costs', 'but', 'the', 'disaster', 'has', 'little', 'impact', 'on', 'stock', 'markets', 'in', 'the', 'us', 'and', 'asia', 'currencies', 'including', 'the', 'thai', 'baht', 'and', 'indonesian', 'rupiah', 'weakened', 'as', 'analysts', 'warned', 'that', 'economic', 'growth', 'may', 'slow', 'it', 'came', 'at', 'the', 'wo

## STOP Words Removal

#### We immediately see that many of the words here are not useful (e.g. "to", "the" etc.). Scikit-learn provides a list of such stop words:


In [30]:
# load English stop words
from sklearn.feature_extraction import text
stopwords = text.ENGLISH_STOP_WORDS
print(stopwords)


frozenset({'almost', 'couldnt', 'around', 'were', 'anyhow', 'former', 'own', 'there', 'whole', 'namely', 'could', 'seeming', 'hereafter', 'sometime', 'had', 'often', 'my', 'to', 'while', 'down', 'each', 'system', 'whatever', 'who', 'until', 'although', 'them', 'at', 'latter', 'less', 'give', 'through', 'is', 'full', 'nothing', 'or', 'and', 'both', 'interest', 'hereby', 'might', 'move', 'a', 'rather', 'about', 'top', 'fifteen', 'thereby', 'amount', 'should', 'noone', 'therefore', 'throughout', 'we', 'inc', 'six', 'become', 'now', 'may', 'part', 'i', 'moreover', 'cant', 'himself', 'show', 'how', 'herself', 'detail', 'whether', 'above', 'hers', 'against', 'her', 'be', 'can', 'here', 'never', 'this', 'seem', 'him', 'itself', 'side', 'few', 'wherever', 'forty', 'thence', 'enough', 'seems', 'yours', 'such', 'when', 'nine', 'already', 'again', 'mill', 'go', 'sixty', 'between', 'after', 'neither', 'therein', 'among', 'eight', 'call', 'nor', 'either', 'everything', 'below', 'see', 'themselves',

#### We can filter out these stopwords from our document:

In [31]:
for i in range (0, len(tokens)):
    filtered_token = []
    for token in tokens[i]:
        if token not in stopwords:
            filtered_token.append(token)
    tokens[i] = filtered_token

In [32]:
# Example after removing stop words

print(tokens[0])
print("length:", len(tokens[0]))

['asian', 'quake', 'hits', 'european', 'shares', 'asian', 'quake', 'hits', 'european', 'shares', 'shares', 'europe', 'leading', 'reinsurers', 'travel', 'firms', 'fallen', 'scale', 'damage', 'wrought', 'tsunamis', 'south', 'asia', 'apparent', '23', '000', 'people', 'killed', 'following', 'massive', 'underwater', 'earthquake', 'worst', 'hit', 'areas', 'popular', 'tourist', 'destinations', 'reisurance', 'firms', 'swiss', 'munich', 'lost', 'value', 'investors', 'worried', 'rebuilding', 'costs', 'disaster', 'little', 'impact', 'stock', 'markets', 'asia', 'currencies', 'including', 'thai', 'baht', 'indonesian', 'rupiah', 'weakened', 'analysts', 'warned', 'economic', 'growth', 'slow', 'came', 'worst', 'possible', 'time', 'said', 'hans', 'goetti', 'singapore', 'based', 'fund', 'manager', 'impact', 'tourist', 'industry', 'pretty', 'devastating', 'especially', 'thailand', 'travel', 'related', 'shares', 'dropped', 'europe', 'companies', 'germany', 'tui', 'lufthansa', 'france', 'club', 'mediterran

We will repeat this process for all documents

In [33]:
all_filtered_tokens = []
for doc in article:
    # tokenize the next document
    tokens = tokenize(doc.lower())
    # remove the stopwords
    filtered_tokens = []
    for token in tokens:
        if not token in stopwords:
            filtered_tokens.append(token)  
    # add to the overall list
    all_filtered_tokens.append( filtered_tokens )
print("Created %d filtered token lists" % len(all_filtered_tokens) )

Created 1408 filtered token lists


### Counting Tokens

A simple type of analysis that we might do is to count the number of times specific terms (words) appear in our corpus. We could do this by creating a dictionary of term frequency counts

In [38]:
counts = {}
# process filtered tokens for each document
for doc_tokens in all_filtered_tokens:
    for token in doc_tokens:
        # increment existing?
        if token in counts:
            counts[token] += 1
        # a new term?
        else:
            counts[token] = 1
print("Found %d unique terms in this corpus" % len(counts))

Found 22601 unique terms in this corpus


### The top 20 most Frequent terms

In [35]:
import operator
sorted_counts = sorted(counts.items(), key=operator.itemgetter(1), reverse=True)
for i in range(20):
    term = sorted_counts[i][0]
    count = sorted_counts[i][1]
    print( "%s (count=%d)" % ( term, count )  )

said (count=4119)
year (count=1563)
new (count=1233)
people (count=1204)
mr (count=1092)
world (count=966)
time (count=934)
game (count=886)
news (count=772)
online (count=733)
just (count=683)
market (count=652)
like (count=618)
games (count=615)
players (count=605)
make (count=604)
company (count=602)
years (count=600)
technology (count=583)
firm (count=555)


# Lemmatisation: WordNetLemmatizer

In [44]:
from nltk.stem import WordNetLemmatizer
lemmatizer = WordNetLemmatizer()

for i in range (0, len(all_filtered_tokens)):
    filtered_token = []
    for token in all_filtered_tokens[i]:
        filtered_token.append(lemmatizer.lemmatize(token))
    all_filtered_tokens[i] = filtered_token

In [46]:
# Example after lemmatising
#print(article_title[0])
print(all_filtered_tokens[0])
print("length:", len(all_filtered_tokens[0]))

['asian', 'quake', 'hit', 'european', 'share', 'asian', 'quake', 'hit', 'european', 'share', 'share', 'europe', 'leading', 'reinsurers', 'travel', 'firm', 'fallen', 'scale', 'damage', 'wrought', 'tsunami', 'south', 'asia', 'apparent', '23', '000', 'people', 'killed', 'following', 'massive', 'underwater', 'earthquake', 'worst', 'hit', 'area', 'popular', 'tourist', 'destination', 'reisurance', 'firm', 'swiss', 'munich', 'lost', 'value', 'investor', 'worried', 'rebuilding', 'cost', 'disaster', 'little', 'impact', 'stock', 'market', 'asia', 'currency', 'including', 'thai', 'baht', 'indonesian', 'rupiah', 'weakened', 'analyst', 'warned', 'economic', 'growth', 'slow', 'came', 'worst', 'possible', 'time', 'said', 'han', 'goetti', 'singapore', 'based', 'fund', 'manager', 'impact', 'tourist', 'industry', 'pretty', 'devastating', 'especially', 'thailand', 'travel', 'related', 'share', 'dropped', 'europe', 'company', 'germany', 'tui', 'lufthansa', 'france', 'club', 'mediterranne', 'sliding', 'ins

In [47]:
corpus = []
for articles in all_filtered_tokens:
    corpus.append(" ".join(articles))

# example
corpus[0]

'asian quake hit european share asian quake hit european share share europe leading reinsurers travel firm fallen scale damage wrought tsunami south asia apparent 23 000 people killed following massive underwater earthquake worst hit area popular tourist destination reisurance firm swiss munich lost value investor worried rebuilding cost disaster little impact stock market asia currency including thai baht indonesian rupiah weakened analyst warned economic growth slow came worst possible time said han goetti singapore based fund manager impact tourist industry pretty devastating especially thailand travel related share dropped europe company germany tui lufthansa france club mediterranne sliding insurer reinsurance firm pressure europe share munich swiss world biggest reinsurers fell market speculated cost rebuilding asia zurich financial allianz axa suffered decline value loss smaller reflecting market view reinsurers likely pick bulk cost worry size insurance liability dragged europe

# Creating a document-term matrix

### Bag-of-Words Representation¶

In the *bag-of-words model*, each document is represented by a vector in a *m*-dimensional coordinate space, where *m* is number of unique terms across all documents. This set of terms is called the corpus *vocabulary*. Note that the positioning (context) of terms within the original document is lost in this model.

Since each document can be represented as a term vector, we can stack these vectors to create a full *document-term matrix*. We can easily create this matrix from a list of document strings using Scikit-learn:

In [25]:
from sklearn.feature_extraction.text import CountVectorizer
vectorizer = CountVectorizer()
X = vectorizer.fit_transform(corpus)
print(X.shape)

ValueError: empty vocabulary; perhaps the documents only contain stop words

In [None]:
print("Number of terms in model is %d" % len(vectorizer.vocabulary_))

In [None]:
terms = vectorizer.get_feature_names()
vocab = vectorizer.vocabulary_
print("Vocabulary has %d distinct terms" % len(terms))

In [None]:
df_count = pd.DataFrame(X.toarray(), columns = vectorizer.get_feature_names(), index = article_id).T
df_count

In [None]:
from sklearn.feature_extraction.text import TfidfVectorizer
vectorizer = TfidfVectorizer()
tfidf = vectorizer.fit_transform(corpus)

In [None]:
terms = vectorizer.get_feature_names()
vocab = vectorizer.vocabulary_
print("Vocabulary has %d distinct terms" % len(terms))

In [None]:
print(tfidf.shape)

# convert to dense matrix
tfidf_dense = tfidf.toarray()
df_tfidf = pd.DataFrame(tfidf.toarray(), columns = vectorizer.get_feature_names(), index = article_id).T
df_tfidf

# Classification

## Split the data in training and test set

In [None]:
from sklearn.model_selection import train_test_split
dataset_train, dataset_test, target_train, target_test = train_test_split(tfidf, categories, test_size=0.2)

### The size of the training set and test set

In [None]:
print("Training set size is %d" % dataset_train.shape[0] )
print("Test set size is %d" % dataset_test.shape[0] )

## KNN Classifier

k-NN is a type of instance-based learning, or lazy learning, where the function is only approximated locally and all computation is deferred until classification. The k-NN algorithm is among the simplest of all machine learning algorithms.
The neighbors are taken from a set of objects for which the class (for k-NN classification) or the object property value (for k-NN regression) is known. This can be thought of as the training set for the algorithm, though no explicit training step is required.
he training examples are vectors in a multidimensional feature space, each with a class label. The training phase of the algorithm consists only of storing the feature vectors and class labels of the training samples.

In the classification phase, k is a user-defined constant, and an unlabeled vector (a query or test point) is classified by assigning the label which is most frequent among the k training samples nearest to that query point.

A commonly used distance metric for continuous variables is Euclidean distance. For discrete variables, such as for text classification, another metric can be used, such as the overlap metric (or Hamming distance). In the context of gene expression microarray data, for example, k-NN has also been employed with correlation coefficients such as Pearson and Spearman.[3] Often, the classification accuracy of k-NN can be improved significantly if the distance metric is learned with specialized algorithms such as Large Margin Nearest Neighbor or Neighbourhood components analysis.

### Adavatages
-Simple to implement
-Flexible to feature / distance choices
-Naturally handles multi-class cases
-Can do well in practice with enough representative data

### Disadvantages
- Large search problem to find nearest neighbours
- Storage of data
- Must know we have a meaningful distance function

In [None]:
from sklearn.neighbors import KNeighborsClassifier
model = KNeighborsClassifier(n_neighbors=3)
model.fit(dataset_train, target_train) #fitting the model
print(model)

Predcted datalabels of the test dataset

In [None]:
predicted = model.predict(dataset_test)
predicted  #prediction by the model

In [None]:
unique_category

In [None]:
num_b = (predicted == 0).sum()
num_s = (predicted == 1).sum()
num_t = (predicted == 2).sum()

print( "Number of labels predicted as 'business' : %d" % num_b )
print( "Number of labels predicted as 'sport' : %d" % num_s )
print( "Number of labels predicted as 'technology' : %d" % num_t )

Example of Correct label versus predicted label

In [None]:
print("Predictions\n", predicted[:10])
print("Correct labels\n", target_test[:10])

# The Confusion Matrix of  KNN

In [None]:
# import all of the scikit-learn evaluation functionality
from sklearn.metrics import *
# build the confusion matrix
cm = confusion_matrix(target_test, predicted)
print(cm)

### The accuracy of the Knn model

In [None]:
from sklearn.metrics import accuracy_score
score=accuracy_score(target_test, predicted)

# Decision Tree Classifier

Decision tree learning uses a decision tree (as a predictive model) to go from observations about an item (represented in the branches) to conclusions about the item's target value (represented in the leaves). It is one of the predictive modelling approaches used in statistics, data mining and machine learning. Tree models where the target variable can take a discrete set of values are called classification trees; in these tree structures, leaves represent class labels and branches represent conjunctions of features that lead to those class labels. Decision trees where the target variable can take continuous values (typically real numbers) are called regression trees.


### Decision tree advantages
- Amongst other data mining methods, decision trees have various advantages:

- Simple to understand and interpret. People are able to understand decision tree models after a brief explanation. Trees can also be displayed graphically in a way that is easy for non-experts to interpret.[16]
- Able to handle both numerical and categorical data.[16] Other techniques are usually specialised in analysing datasets that have only one type of variable. (For example, relation rules can be used only with nominal variables while neural networks can be used only with numerical variables or categoricals converted to 0-1 values.)
- Requires little data preparation. Other techniques often require data normalization. Since trees can handle qualitative predictors, there is no need to create dummy variables.[16]
- Uses a white box model. If a given situation is observable in a model the explanation for the condition is easily explained by boolean logic. By contrast, in a black box model, the explanation for the results is typically difficult to understand, for example with an artificial neural network.
- Possible to validate a model using statistical tests. That makes it possible to account for the reliability of the model.
- Non-statistical approach that makes no assumptions of the training data or prediction residuals; e.g., no distributional, independence, or constant variance assumptions
- Performs well with large datasets. Large amounts of data can be analysed using standard computing resources in reasonable time.
- Mirrors human decision making more closely than other approaches.[16] This could be useful when modeling human decisions/behavior.
- Robust against co-linearity, particularly boosting
- In built feature selection. Additional irrelevant feature will be less used so that they can be removed on subsequent runs.


### Limitations
- Trees do not tend to be as accurate as other approaches.
- Trees can be very non-robust. A small change in the training data can result in a big change in the tree, and thus a big change in final predictions.
- The problem of learning an optimal decision tree is known to be NP-complete under several aspects of optimality and even for simple concepts Consequently, practical decision-tree learning algorithms are based on heuristics such as the greedy algorithm where locally-optimal decisions are made at each node. Such algorithms cannot guarantee to return the globally-optimal decision tree. To reduce the greedy effect of local-optimality some methods such as the dual information distance (DID) tree were proposed.[19]
- Decision-tree learners can create over-complex trees that do not generalize well from the training data. (This is known as overfitting.[20]) Mechanisms such as pruning are necessary to avoid this problem (with the exception of some algorithms such as the Conditional Inference approach, that does not require pruning.
- There are concepts that are hard to learn because decision trees do not express them easily, such as XOR, parity or multiplexer problems. In such cases, the decision tree becomes prohibitively large. Approaches to solve the problem involve either changing the representation of the problem domain (known as propositionalization)[21] or using learning algorithms based on more expressive representations (such as statistical relational learning or inductive logic programming).
- For data including categorical variables with different numbers of levels, information gain in decision trees is biased in favor of those attributes with more levels.[22] However, the issue of biased predictor selection is avoided by the Conditional Inference approach[12], a two-stage approach[23], or adaptive leave-one-out feature selection.

In [None]:
from sklearn import tree
clf = tree.DecisionTreeClassifier()
clf.fit(dataset_train, target_train) #fitting the model

In [None]:
predicted = clf.predict(dataset_test)
predicted  #prediction by the model

In [None]:
num_b = (predicted == 0).sum()
num_s = (predicted == 1).sum()
num_t = (predicted == 2).sum()

print( "Number of labels predicted as 'business' : %d" % num_b )
print( "Number of labels predicted as 'sport' : %d" % num_s )
print( "Number of labels predicted as 'technology' : %d" % num_t )

In [None]:
print("Predictions\n", predicted[:10])
print("Correct labels\n", target_test[:10])

### The accuracy of the Decision tree model

In [None]:
from sklearn.metrics import accuracy_score
score1=accuracy_score(target_test, predicted)

# The Confusion Matrix of Decision Tree

In [None]:
# import all of the scikit-learn evaluation functionality
from sklearn.metrics import *
# build the confusion matrix
cm = confusion_matrix(target_test, predicted)
print(cm)

In [None]:
%matplotlib inline
import matplotlib.pyplot as plt
import numpy as np

# Plotting the graph

In [None]:
prediction_accuracy=[score,score1]
y_pos = np.arange(len(prediction_accuracy))
freq_series = pd.Series.from_array(prediction_accuracy)
x_labels = list(y_pos)
z = ['KNN(N=3)', 'Decision Tree']
# Plot the figure.
plt.figure(figsize=(10, 6))
ax = freq_series.plot(kind='bar')
ax.set_title('Accuracy of classifiers')
ax.set_xlabel('Classifiers')
ax.set_ylabel('Accuracy Percentage')
ax.set_xticklabels(list(z))
rects = ax.patches
# putting labels.
labels =[i for i in list(prediction_accuracy)]

for rect, label in zip(rects, labels):
    height = rect.get_height()
    ax.text(rect.get_x() + rect.get_width(), height, label, ha='center', va='bottom')

# Cross Validation

A problem with simply randomly splitting a dataset into two sets is that each random split might give different results. We are also ignoring a portion of your dataset. One way to address this is to use *k-fold cross-validation* to evaluate a classifier:
1. Divide the data into k disjoint subsets - “folds” (e.g. k=5).
2. For each of k experiments, use k-1 folds for training and the selected one fold for testing.
3. Repeat for all k folds, average the accuracy/error rates.

While this is a relatively complex process, scikit-learn allows us to achieve this using a single command. Let's do a 2-fold cross-validation of the KNN classifier

In [None]:
# create a single classifier
model = KNeighborsClassifier(n_neighbors=3)
# apply 2-fold cross-validation, measuring accuracy each time
from sklearn.model_selection import cross_val_score
acc_scores = cross_val_score(model, dataset_train, target_train, cv=10, scoring="accuracy")
print(acc_scores)

In [None]:
print("KNN: Mean cross-validation accuracy = %.2f" % acc_scores.mean() )

In [None]:
from sklearn import tree
model = tree.DecisionTreeClassifier()
acc_scores1 = cross_val_score(model, dataset_train, target_train, cv=10, scoring="accuracy")
print("The Decision tree average cross-validation  = %.2f" % acc_scores1.mean() )

# Plotting the graph

In [None]:
prediction_accuracy=[acc_scores.mean(),acc_scores1.mean()]
y_pos = np.arange(len(prediction_accuracy))
freq_series = pd.Series.from_array(prediction_accuracy)
x_labels = list(y_pos)
z = ['KNN(N=3)', 'Decision Tree']
# Plot the figure.
plt.figure(figsize=(10, 6))
ax = freq_series.plot(kind='bar')
ax.set_title('Accuracy of classifiers after kfold validation')
ax.set_xlabel('Classifiers')
ax.set_ylabel('Accuracy Percentage')
ax.set_xticklabels(list(z))
rects = ax.patches
# putting labels.
labels =[i for i in list(prediction_accuracy)]

for rect, label in zip(rects, labels):
    height = rect.get_height()
    ax.text(rect.get_x() + rect.get_width(), height, label, ha='center', va='bottom')