## News Extraction

### Import Packages

In [1]:
from bs4 import BeautifulSoup
from requests import get

### creating a function to extract only text from  paragraph tag

In [2]:
def get_only_text(url):
 """ 
  return the title and the text of the article
  at the specified url
 """
 page = get(url)
 soup = BeautifulSoup(page.content, "lxml")
 text = ' '.join(map(lambda p: p.text, soup.find_all('p')))
 #text = soup.text
 title = ' '.join(soup.title.stripped_strings)
 return title , text    

### Calling the function with the desired News URL

In [3]:
text = get_only_text("https://en.wikinews.org/wiki/Global_markets_plunge")

In [4]:
text

('Global markets plunge - Wikinews, the free news source',
 'Friday, October 10, 2008\xa0\n Stock markets across the world have fallen sharply with several seeing the biggest drop in their history. \n Asian markets saw the biggest sell-off. The Nikkei dropped 9.62% to reach a 20 year low. Japan also saw a collapse of a mid-size insurance company, Yamato Life Insurance Company, which declared bankruptcy.  The Hang Seng, which was one of the few markets that was positive yesterday, fell 7.19%. Australia dropped by 8.4% and South Korea saw a 9% fall. \n In Europe, markets dropped at the open with the FTSE losing 11%. They have recovered only sightly with all European markets losing more than 5%. The European sell off was more about the Asian lows then any specific news. European banks and financial institutes saw the most selling. Also, oil related companies saw large drops as an result of an expected decrease in oil consumption. \n\n The U.S. markets opened lower with the Dow Jones Indus

### Number of Words - Original Text

In [5]:
text[0]

'Global markets plunge - Wikinews, the free news source'

In [6]:
len(str.split(text[0]))

9

## Summarization

In [17]:
from gensim.summarization.summarizer import summarize
from gensim.summarization import keywords

### Printing the Summarized Text

### Method #1 - Word Count

In [8]:
print ("Title : " + text[0])
print ("Summary : ")
print (summarize(repr(text[1]), word_count=100))

Title : Global markets plunge - Wikinews, the free news source
Summary : 
Bush made an address on the economy and said markets were being "driven by uncertainty and fear."\n Oil has seen losses of more than US$6 in trading with the current price of a barrel of oil less than $80.
The reality is that most investors have been spooked by the sheer pressure that the credit crunch is putting on the global economy.”\n The Japanese Nikkei 225 has recorded it\'s third biggest drop in history with a massive sell-off in the exchange that has resulted in USD 250 billion being knocked of the index\'s value.\n Toyota, which is the second largest carmaker in the world, fell by the largest amount in 21 years, while Elpida Memory, the world\'s largest manufacturer of computer memory, dropped in value to a record low.\n Masafumi Oshiden, a fund manager in Toyota commented on the drop."It\'s capitulation," he said.


In [9]:
print ("Title : " + text[0])
print ("Summary : ")
print (summarize(repr(text[1]), ratio = 0.1))

Title : Global markets plunge - Wikinews, the free news source
Summary : 
Bush made an address on the economy and said markets were being "driven by uncertainty and fear."\n Oil has seen losses of more than US$6 in trading with the current price of a barrel of oil less than $80.
Cats Protection had a total of £11.2 million saved in the now-collapsed Kaupthing bank.\n The British National Council for Voluntary Organisations said that 60 of its 6,500 have lost money due to the collapse of banks.\n The Dow Jones Industrial Average fell to its lowest level in five years at 8,579.19, falling 679 points in one day.
The reality is that most investors have been spooked by the sheer pressure that the credit crunch is putting on the global economy.”\n The Japanese Nikkei 225 has recorded it\'s third biggest drop in history with a massive sell-off in the exchange that has resulted in USD 250 billion being knocked of the index\'s value.\n Toyota, which is the second largest carmaker in the world, 

In [10]:
summarized_text = summarize(repr(text[1]), ratio = 0.1)

### Number of Words - Summarized Text

In [11]:
len(str.split((summarize(repr(text[1]), word_count=100))))

143

### Keywords

In [12]:
print ('\nKeywords:')
print (keywords(text[1], ratio=0.1))


Keywords:
markets
market
drop
dropped
drops
largest
fell
low
lows
canadian
canadians
oil
commented
mortgage
mortgages
fall
falling
falls
correction
corrections
national
nation
saw
said
today
company
companies
collapse
collapsing
european
banks
bank
banking
states
restore
restoring
unsettling
people
credit


In [13]:
print ('\nKeywords:')
print (keywords(text[1], ratio=0.1, lemmatize=True))


Keywords:
market
drops
largest
fell
lows
canadians
oil
commented
mortgages
falls
corrections
nation
saw
said
today
companies
european
banking
restoring
unsettling
people
states
credit


# Scrap data using beutifulsoup as a general code

In [14]:
import wikipedia as w
import pandas as pd

In [15]:
text=w.summary("Flash (Barry Allen)")
text


'The Flash (Bartholomew Henry "Barry" Allen) is a superhero appearing in American comic books published by DC Comics. The character first appeared in Showcase #4 (October 1956), created by writer Robert Kanigher and penciler Carmine Infantino. Barry Allen is a reinvention of the original Flash, Jay Garrick.\nBecause he is a speedster, his power consists mainly of superhuman speed. Various other effects are also attributed to his ability to control the slowness of molecular vibrations, including his ability to vibrate at speed to pass through objects. The Flash wears a distinct red and gold costume treated to resist friction and wind resistance, traditionally storing the costume compressed inside a ring.\nBarry Allen\'s classic stories introduced the concept of the Multiverse to DC Comics, and this concept played a large part in DC\'s various continuity reboots over the years. The Flash has traditionally always had a significant role in DC\'s major company-wide reboot stories, and in th

In [16]:
R = w.search("Flash (Barry Allen)")
page = w.page(R[0])
title = page.title
content = page.content
print("Page title : ", title, "\n")
print("Page content : ", content, "\n")

Page title :  Flash (Barry Allen) 

Page content :  The Flash (Bartholomew Henry "Barry" Allen) is a superhero appearing in American comic books published by DC Comics. The character first appeared in Showcase #4 (October 1956), created by writer Robert Kanigher and penciler Carmine Infantino. Barry Allen is a reinvention of the original Flash, Jay Garrick.
Because he is a speedster, his power consists mainly of superhuman speed. Various other effects are also attributed to his ability to control the slowness of molecular vibrations, including his ability to vibrate at speed to pass through objects. The Flash wears a distinct red and gold costume treated to resist friction and wind resistance, traditionally storing the costume compressed inside a ring.
Barry Allen's classic stories introduced the concept of the Multiverse to DC Comics, and this concept played a large part in DC's various continuity reboots over the years. The Flash has traditionally always had a significant role in DC'