# Capstone Project - Scraping

For my capstone project I will be building a streamlit application that collects the latest news. First we have to scrape the news sites.

### Import packages

In [1]:
import pandas as pd
import re
import requests
from bs4 import BeautifulSoup

## Ikon
First website we'll scrape is Ikon.mn
Let's get all the articles that are less than a day old.

In [113]:
response = requests.get("https://ikon.mn/j/sd7qmo/sd7qmo")

In [24]:
response = requests.get("https://ikon.mn/l/1")

In [6]:
response = requests.get("https://ikon.mn/l/2")

In [114]:
soup = BeautifulSoup(response.content)

In [115]:
ikon_links = soup.find_all("div",{"class":"nlitem"})

In [116]:
ikon_urls = []
for link in ikon_links:
    if "Өчигдөр" not in link:
        ikon_a = link.find("a")
        ikon_url = ikon_a.get('href')
        ikon_urls.append(ikon_url)

In [117]:
ikon_urls

['/n/354p',
 '/n/354o',
 '/n/354k',
 '/n/354m',
 '/n/354h',
 '/n/354e',
 '/n/354l',
 '/n/354j',
 '/n/354i',
 '/n/354g',
 '/n/3539',
 '/n/354f',
 '/n/354d',
 '/n/354c',
 '/n/354a',
 '/n/354b',
 '/n/3549',
 '/n/3548',
 '/n/3546',
 '/n/3545']

In [118]:
full_ikon_urls = []
for url in ikon_urls:
    full_url = "https://ikon.mn" + url
    full_ikon_urls.append(full_url)

In [119]:
full_ikon_urls

['https://ikon.mn/n/354p',
 'https://ikon.mn/n/354o',
 'https://ikon.mn/n/354k',
 'https://ikon.mn/n/354m',
 'https://ikon.mn/n/354h',
 'https://ikon.mn/n/354e',
 'https://ikon.mn/n/354l',
 'https://ikon.mn/n/354j',
 'https://ikon.mn/n/354i',
 'https://ikon.mn/n/354g',
 'https://ikon.mn/n/3539',
 'https://ikon.mn/n/354f',
 'https://ikon.mn/n/354d',
 'https://ikon.mn/n/354c',
 'https://ikon.mn/n/354a',
 'https://ikon.mn/n/354b',
 'https://ikon.mn/n/3549',
 'https://ikon.mn/n/3548',
 'https://ikon.mn/n/3546',
 'https://ikon.mn/n/3545']

### Scraping a single article

In [120]:
response = requests.get('https://ikon.mn/n/3548')

In [121]:
soup = BeautifulSoup(response.content)

In [122]:
article_title = soup.find('h1').get_text(strip=True)

In [123]:
article_paragraphs = soup.find_all('p')

In [124]:
body = ""
for paragraph in article_paragraphs:
    paragraph = paragraph.get_text(strip=True)
    body = body + paragraph

In [125]:
body

'УИХ-ын гишүүн асан Ж.Батзандан ШИНЭ намаас гарч, АН-д нэгдэж буйгаа мэдэгдсэн. Үүнтэй холбогдуулан ШИНЭ намаас мэдээлэл хийлээ.ШИНЭ намын дарга Ц.Гантулга“Өнгөрсөн хоёр хоногийн хугацаанд Ардчилсан Намын зүгээс ШИНЭ намыг өөрийн намтай нэгдсэн мэтээр мэдээлэл тараасан. Тиймээс ШИНЭ нам анх яаж байгуулагдсан талаар мэдээлэл өгөх зүйтэй гэж үзлээ.Анх 2019 оны есдүгээр сарын 16-нд ШИНЭ нам байгуулагдсан. ШИНЭ намыг 113 төрийн бус байгууллага, мэргэжлийн холбоод хамтран Монголыг МАНАН дэглэмээс аврах, шинэ нийгмийн байгуулах эрхэм зорилготойгоор анх байгуулагдсан.Тухайн үеийн УИХ-ын гишүүн Ж.Батзандан, Л.Болд нар 1,200 төлөөлөгчтэйгөөр анхны хурлаа хийсэн. Анх 801 хүн гарын үсэг зурснаар байгуулагдсан байсан нам одоо 24 мянган гишүүнээ үнэмлэхжүүлсэн, 50 гаруй мянган дэмжигчтэй, өдөр тутмын үйл ажиллагаагаа академик түвшинд явуулж байгаа улс төрийн хүчин болоод байна.Энэ намын анхны дарга Ж.Батзандан АН-д орж буйгаа мэдэгдсэн. Тэр бол хувь хүний сонголт. Цөөн хэдэн хүн улс төрийн өөр нам,

### Scraping multiple articles

In [126]:
article_titles = []
article_bodies = []
for url in full_ikon_urls:
    response = requests.get(url)
    soup = BeautifulSoup(response.content)
    article_title = soup.find('h1').get_text(strip=True)
    article_paragraphs = soup.find_all('p')
    body = ""
    for paragraph in article_paragraphs:
        paragraph = paragraph.get_text(strip=True)
        body = body + paragraph
    article_titles.append(article_title)
    article_bodies.append(body)

In [127]:
df = pd.DataFrame({'article url': full_ikon_urls, 'title': article_titles, 'article': article_bodies})

In [128]:
df

Unnamed: 0,article url,title,article
0,https://ikon.mn/n/354p,Өнгөрсөн сард улсын хэмжээнд 10.5 сая тонн ача...,Зам тээврийн салбар энэ оны дөрөвдүгээр сард у...
1,https://ikon.mn/n/354o,Шатахуун түгээхийн болон түргэн тусламжийн маш...,Хан-Уул дүүргийн нутаг дэвсгэрт шатахуун түгээ...
2,https://ikon.mn/n/354k,Монголын хөлбөмбөгийн шигшээ баг Азийн АШТ-ий ...,2027 онд Саудын Араб улсад зохион байгуулагдах...
3,https://ikon.mn/n/354m,Цэцэг төвийн уулзварын зам засварын ажил марга...,"Цэцэг төвийн уулзвараас Цаг уур, орчны шинжилг..."
4,https://ikon.mn/n/354h,"""Монголбанкны эх үүсвэрээр орон сууц авч буй и...",Ипотекийн зээлийн үндсэн шалгуур болон зээл ол...
5,https://ikon.mn/n/354e,"""207-р байрны галд өртсөн найман хувийг засуул...",Газын дэлбэрэлтэд өртсөн 207 дугаар байрны орш...
6,https://ikon.mn/n/354l,“Grow with Google Mongolia” хөтөлбөрийн дэмжиг...,Монгол Улсад анх удаа хэрэгжүүлж буй “Grow wit...
7,https://ikon.mn/n/354j,"Хуримтлалын сангийн дансанд өнөөдөр Эрдэнэт, О...",Үндэсний баялгийн сангийн тухай хууль хэрэгжиж...
8,https://ikon.mn/n/354i,Буудлагын Дэлхийн цомын тэмцээнд О.Есүгэн найм...,Азербайжаны нийслэл Баку хотноо буудлагын Дэлх...
9,https://ikon.mn/n/354g,Бороотой үеэр Сүхбаатар дүүргийн ногоон байгуу...,Өнөөдөр өглөө бороотой байсан ч Сүхбаатар дүүр...


In [130]:
df.to_csv("ikonrecent.csv")

## Yahoo Sports
For Sports related news, I will be scraping yahoo sports.

In [2]:
response = requests.get("https://sports.yahoo.com/nba/news/") #nba

In [None]:
response = requests.get("https://sports.yahoo.com/soccer/news/") #soccer

In [3]:
soup = BeautifulSoup(response.content)

In [4]:
sports_links = soup.find_all("li",{"class":"stream-item js-stream-content Bgc(t) Pos(r) Mb(24px)"})

In [5]:
time = sports_links[0].find("time").get_text(strip=True)

In [6]:
sports_urls = []
for link in sports_links:
    time = link.find("time").get_text(strip=True)
    if "h" in time:
        sports_a = link.find("a",{"class":"stream-title D(b) Td(n) Td(n):f C(--batcave) C($streamBrandHoverClass):h C($streamBrandHoverClass):fv"})
        sports_url = sports_a.get('href')
        sports_urls.append(sports_url)

In [7]:
sports_urls

['https://sports.yahoo.com/nba-playoffs-cavaliers-stun-celtics-with-dominant-11894-win-in-game-2-at-boston-013648391.html',
 'https://sports.yahoo.com/report-suns-fire-head-coach-frank-vogel-after-1st-round-playoff-sweep-eyeing-mike-budenholzer-as-replacement-202449030.html',
 'https://sports.yahoo.com/nba-playoffs-knicks-rule-og-anunoby-out-for-game-3-vs-pacers-with-hamstring-injury-213757605.html',
 'https://sports.yahoo.com/brunson-leads-knicks-over-pacers-hornets-hire-charles-lee--big-questions-for-eliminated-teams--no-cap-room-211404027.html',
 'https://sports.yahoo.com/2024-nba-playoffs-second-round-schedule-how-to-watch-tonights-games-where-to-stream-and-more-thursday-173731096.html',
 'https://sports.yahoo.com/nba-suspends-bucks-patrick-beverley-4-games-for-throwing-ball-at-fans-kicking-reporter-out-of-interview-180830598.html',
 'https://sports.yahoo.com/former-nba-player-glen-davis-sentenced-to-40-months-in-prison-for-involvement-in-healthcare-fraud-scheme-190328457.html',
 '

### Scraping a single article

In [82]:
response = requests.get('https://sports.yahoo.com/nba-suspends-bucks-patrick-beverley-4-games-for-throwing-ball-at-fans-kicking-reporter-out-of-interview-180830598.html')

In [83]:
soup = BeautifulSoup(response.content)

In [84]:
article_title = soup.find('h1').get_text(strip=True)

In [85]:
article_paragraphs = soup.find_all('p')

In [86]:
body = ""
for paragraph in article_paragraphs:
    paragraph = paragraph.get_text(strip=True)
    body = body + paragraph

In [87]:
body

'Patrick Beverleyof theMilwaukee Buckshas been suspendedfour games without pay by the NBAfor throwing a basketball at fans and for his interaction with an ESPN reporter after Game 6 of their first-round series against theIndiana Pacers.In the final minutes of the Bucks’ Game 6 loss to the Pacers, whichclosed out the seriesand sent the Pacers into the Eastern Conference semifinals,Beverley was seen throwing a balltoward Pacers fans and hitting a woman in the head. Beverley then waved at a different fan to throw the ball back to him, which he did, and then Beverley chucked the ball right back at him hard.Beverley kept jawing with fans behind their bench before teammates and others defused the situation. He was not ejected from the game, but he didn’t return. The Bucks were down by 20 points at the time.altercation between pat bev and pacers fans behind the benchpic.twitter.com/dfQpqSBv33— Rob Perez (@WorldWideWob)May 3, 2024"It\'s an unfortunate situation that should have never happened.

### Scraping multiple articles

In [8]:
article_titles = []
article_bodies = []
for url in sports_urls:
    response = requests.get(url)
    soup = BeautifulSoup(response.content)
    article_title = soup.find('h1').get_text(strip=True)
    article_paragraphs = soup.find_all('p')
    body = ""
    for paragraph in article_paragraphs:
        paragraph = paragraph.get_text(strip=True)
        body = body + paragraph
    article_titles.append(article_title)
    article_bodies.append(body)

In [9]:
df = pd.DataFrame({'article url': sports_urls, 'title': article_titles, 'article': article_bodies})

In [10]:
df

Unnamed: 0,article url,title,article
0,https://sports.yahoo.com/nba-playoffs-cavalier...,NBA playoffs: Cavaliers stun Celtics with domi...,TheCleveland Cavaliersfinally won their first ...
1,https://sports.yahoo.com/report-suns-fire-head...,Report: Suns fire head coach Frank Vogel after...,ThePhoenix Sunsare moving on from head coach F...
2,https://sports.yahoo.com/nba-playoffs-knicks-r...,NBA playoffs: Knicks rule OG Anunoby out for G...,TheNew York Knickswill be withoutOG Anunobyon ...
3,https://sports.yahoo.com/brunson-leads-knicks-...,"Brunson leads Knicks over Pacers, Hornets hire...","On this episode ofNo Cap Room, Yahoo Sports se..."
4,https://sports.yahoo.com/2024-nba-playoffs-sec...,2024 NBA Playoffs second round schedule: How t...,The2024 NBA Playoffsare in full swing! After a...
5,https://sports.yahoo.com/nba-suspends-bucks-pa...,NBA suspends Bucks' Patrick Beverley 4 games f...,Patrick Beverleyof theMilwaukee Buckshas been ...
6,https://sports.yahoo.com/former-nba-player-gle...,Former NBA player Glen Davis sentenced to 40 m...,Former NBA player Glen Davis was sentenced to ...
7,https://sports.yahoo.com/is-pacers-coach-rick-...,Is Pacers coach Rick Carlisle right to be upse...,It wouldn’t be the playoffs we know and love w...
8,https://sports.yahoo.com/how-the-deep-bond-bet...,How the deep bond between Nikola Jokić and Aar...,"One day, afterDenverclimbed the NBA’s rocky mo..."
9,https://sports.yahoo.com/brooklyn-nets-2024-nb...,Brooklyn Nets 2024 NBA offseason preview: Path...,2023-24 season:32-50Highlight of the season:In...


## Esports news
For esports news, I will be scraping Esports Charts. Since chess news are slow, let's just take the last 10 articles.

In [126]:
response = requests.get("https://www.chess.com/news")

In [127]:
soup = BeautifulSoup(response.content)

In [128]:
response

<Response [200]>

In [121]:
chess_links = soup.find_all("a",{"class":"post-preview-title"})

In [123]:
chess_links[1].get('href')

'https://www.chess.com/news/view/2024-superbet-poland-rapid-blitz-day-2'

In [129]:
chess_urls = []
for link in chess_links[:10]:
    chess_url = link.get('href')
    chess_urls.append(chess_url)

In [130]:
chess_urls

['https://www.chess.com/news/view/2024-cct-chesscom-classic-day-2',
 'https://www.chess.com/news/view/2024-superbet-poland-rapid-blitz-day-2',
 'https://www.chess.com/news/view/2024-cct-chesscom-classic-day-1',
 'https://www.chess.com/news/view/2024-superbet-poland-rapid-blitz-day-1',
 'https://www.chess.com/news/view/le-carlsen-win-titled-tuesday-may-7-2024',
 'https://www.chess.com/news/view/amateurs-mind-chessable',
 'https://www.chess.com/news/view/thomas-mars-bot',
 'https://www.chess.com/news/view/hikaru-nakamura-wins-bullet-brawl-may-4-2024',
 'https://www.chess.com/news/view/fides-call-for-world-championship-bids-draws-reactions',
 'https://www.chess.com/news/view/abdusattorov-wins-2024-tepe-sigeman-tournament']

### Scraping a single article

In [131]:
response = requests.get('https://www.chess.com/news/view/2024-cct-chesscom-classic-day-2')

In [132]:
soup = BeautifulSoup(response.content)

In [133]:
article_title = soup.find('h1').get_text(strip=True)

In [149]:
article_body = soup.find('div',{"class":"post-view-content"})

In [151]:
article_body.get_text(strip=True)

"This is a flash report. The full article will be added to this page soon.GMVelimir Ivichad the performance of a lifetime on day two of theChampions Chess Tour Chess.com Classic 2024. After surviving elimination from Division I thanks to a mouse slip by his opponent, he defeated GMMaxime Vachier-Lagravein armageddon—and then went on to send GMFabiano Caruanato Division II as well.Ivic is joined in Division I by GMsIan Nepomniachtchi,Jan-Krzysztof Duda,Alexey Sarana,Denis Lazavik, and the three pre-qualified players, GMsMagnus Carlsen,Alireza Firouzja, andVincent Keymer.In Division II Placement, six players won both of their matches. One player to keep an eye on is GMVidit Gujrathi, who after a difficultFIDE Candidates Tournamentand then a less-than-ideal performance in Wednesday's Play-in, has started to punch back.In Division III Placement, 20 players won their match (only one was played). GMLiem Leplayed two exceptional attacking games to eliminate GMChristopher Yoo. We will go over 

### Scraping multiple articles

In [152]:
article_titles = []
article_bodies = []
for url in chess_urls:
    response = requests.get(url)
    soup = BeautifulSoup(response.content)
    article_title = soup.find('h1').get_text(strip=True)
    article_body = soup.find('div',{"class":"post-view-content"})
    article_titles.append(article_title)
    article_bodies.append(article_body.get_text(strip=True))

In [153]:
df = pd.DataFrame({'article url': chess_urls, 'title': article_titles, 'article': article_bodies})

In [154]:
df

Unnamed: 0,article url,title,article
0,https://www.chess.com/news/view/2024-cct-chess...,"Vachier-Lagrave, Caruana, Wesley So Kicked Out...",This is a flash report. The full article will ...
1,https://www.chess.com/news/view/2024-superbet-...,"Carlsen Lets Gukesh Escape, Leads With Wei Yi",GMMagnus Carlsencame within a whisker of beati...
2,https://www.chess.com/news/view/2024-cct-chess...,Caruana Wins In Final Round To Take Sole 1st I...,GMFabiano Caruanafinished first in the Play-in...
3,https://www.chess.com/news/view/2024-superbet-...,Shevchenko Leads Carlsen And Abdusattorov Afte...,GMKirill Shevchenkois the absolute underdog in...
4,https://www.chess.com/news/view/le-carlsen-win...,Carlsen Wins Final Clash To Take Tuesday From ...,GMTuan Minh Lewon the early edition ofTitled T...
5,https://www.chess.com/news/view/amateurs-mind-...,Break Free From Amateur Mistakes With Silman's...,We're happy to announce the release of IMJerem...
6,https://www.chess.com/news/view/thomas-mars-bot,Run Run Run! It's Chess Party Time With The Ne...,"If you're looking for some chessEntertainment,..."
7,https://www.chess.com/news/view/hikaru-nakamur...,Nakamura Pushes Bullet Brawl Earnings Over $10...,GMHikaru Nakamurasecured his second straightBu...
8,https://www.chess.com/news/view/fides-call-for...,FIDE's Call For World Championship Bids Sparks...,The International Chess Federation (FIDE) has ...
9,https://www.chess.com/news/view/abdusattorov-w...,Abdusattorov Wins TePe Sigeman Chess Tournamen...,"He was in fourth place, trailing three leaders..."


## Summarization

In [12]:
from pysummarization.nlpbase.auto_abstractor import AutoAbstractor
from pysummarization.tokenizabledoc.simple_tokenizer import SimpleTokenizer
from pysummarization.abstractabledoc.top_n_rank_abstractor import TopNRankAbstractor

In [13]:
for text in df['article']:
    auto_abstractor = AutoAbstractor()
    auto_abstractor.tokenizable_doc = SimpleTokenizer()
    auto_abstractor.delimiter_list = [".", "\n"]
    abstractable_doc = TopNRankAbstractor()
    
    result_dict = auto_abstractor.summarize(text, abstractable_doc)

    for sentence in result_dict["summarize_result"]:
        print(sentence)
    


TheCleveland Cavaliersfinally won their first road game of the NBA playoffs, surprising theBoston Celticsand the TD Garden crowd with a118–94 win in Game 2 of their second-round Eastern Conference series.

 The series is tied at 1–1 going into Saturday's Game 3 at Cleveland.

Early on, it looked like the Cavs might win a playoff game withoutDonovan Mitchellscoring the majority of their points.

 After averaging 39 points over Cleveland's past three playoff games, Mitchell had only six points at halftime.

 But he took over in the second half, scoring 16 in the third quarter and 29 for the game.

Cleveland took control in the second half, outscoring the Celtics 64–40.

 With approximately five minutes remaining in the game, Boston head coach Joe Mazzulla conceded and put most of his bench on the court while the home fans headed for the exits.

com/Whb7NyqnlF— Cleveland Cavaliers (@cavs)May 9, 2024Boston came back in the second quarter as Mobley and LeVert cooled off, while Mitchell andD