# Project: What's The Hot Topic In Town? - kelvin.ahiakpor & emmanuel.acquaye

This notebook addresses Phase 1 of the What's The Hot Topic in Topic Town? project: **Web Scraping**.

### Information Retrieval

# Phase 1           
Web Scraping

The self-created rubric, in our repository, explains the requirement for a proper execution of this phase as seen below.   
**Description:** Build a Spider to scrape news from at least 1 credible news source (AllAfrica, BBC, The New
York Times, Reuters, Foreign Affairs).  
**Note:** This notebook **only** shows the code and instructions for web scraping with a spider with .py files. Scrapy is built in python and will not run in this environment. To learn how to build a scrapy spider, read the Scrapy documentation. [Scrapy Tutorial](https://docs.scrapy.org/en/latest/intro/tutorial.html)

### Repository Link 

Here is a link to our repository:

[What's The Hot Topic In Town?](https://github.com/kelvin-ahiakpor/Whats.The.Hot.Topic.In.Town)

### Imports

In [1]:
import scrapy
import pycountry
import pandas as pd
import random
import time
import sqlite3 as sql
from scrapy.downloadermiddlewares.useragent import UserAgentMiddleware

**Create a .py file called** `news_scraper.py` **for create Spider classes**   

In [2]:
class TrendsNewsSpider(scrapy.Spider):
    name = 'trends_news'
    
    custom_settings = {
        'DOWNLOAD_DELAY': 2,  # Default delay of 2 seconds
        'RANDOMIZE_DOWNLOAD_DELAY': True,
        'CONCURRENT_REQUESTS': 1,
        'RETRY_ENABLED': True,
        'RETRY_TIMES': 3,
        'DOWNLOADER_MIDDLEWARES': {
            'scrapy.downloadermiddlewares.useragent.UserAgentMiddleware': None,
            'newsspider.middlewares.RandomUserAgentMiddleware': 400,
        }
    }
    
    def __init__(self, *args, **kwargs):
        super(TrendsNewsSpider, self).__init__(*args, **kwargs)
        self.data = []

    def start_requests(self):
        country = getattr(self, 'country', None)
        if country is not None:
            if ' ' in country:
                country1 = country.replace(' ', '-')
                url = 'https://trends24.in/' + country1.lower() + '/'
            else:
                url = 'https://trends24.in/' + country.lower() + '/'
            url_allafrica = f'https://allafrica.com/{country.lower()}/'
            
            yield scrapy.Request(url=url, callback=self.parse_trends, meta={'country': country})
            yield scrapy.Request(url=url_allafrica, callback=self.parse_allafrica, meta={'country': country})
        else:
            self.log("No country provided")

    def parse_trends(self, response):
        country = response.meta['country']
        trend_list = response.css('.list-container .trend-card__list')
        trends = []
        for trend in trend_list[0].css('li a::text').getall():
            trends.append(trend.strip())
        
        self.log(f"Top 50 trends in {country}:")
        for trend in trends[:50]:
            self.log(trend)
            #time.sleep(random.uniform(1, 3))  # Random delay between requests
    
    def parse_allafrica(self, response):
        country = response.meta['country']
        self.log(f"Scraping AllAfrica for {country}")
        
        articles = response.css('.container.mid .row .col-tn-12.col-sm-8.column.main .section.box.headlines.two-column .content .stories li a')
        
        for article in articles:
            title = article.css('::attr(title)').get()
            link = article.css('::attr(href)').get()
            
            if link and not link.startswith('http'):
                link = response.urljoin(link)
            
            yield scrapy.Request(url=link, callback=self.parse_article, meta={'title': title, 'country': country})
            time.sleep(random.uniform(0.5, 1))  # Random delay between requests

    def parse_article(self, response):
        title = response.meta['title']
        country = response.meta['country']
        
        paragraphs = response.css('.container.mid .row .col-tn-12.col-sm-8.column.main .story-body p::text').getall()
        
        body = ' '.join(paragraphs).strip()
        
        org_link = response.css('.container.mid .row .col-tn-12.col-sm-8.column.main .story-footer-link .source-url::attr(href)').get()

        self.data.append({
            'TITLE': title,
            'COUNTRY': country,
            'BODY': body,
            'Website Link': org_link
        })

    def closed(self, reason):
        df = pd.DataFrame(self.data)
        conn = sql.connect('newsData.db')
        df.to_csv('newsData.csv', index=False)
        df.to_sql('news', conn, if_exists='replace', index=False)
        conn.close()

**Create a .py file called runspider to run the spider**

In [3]:
from scrapy.crawler import CrawlerRunner
from scrapy.utils.project import get_project_settings
from twisted.internet import reactor
from twisted.internet.defer import inlineCallbacks
from newsspider.spiders.news_scraper import TrendsNewsSpider
from scrapy.utils.log import configure_logging
import sys

# Configure logging for Scrapy
configure_logging()

@inlineCallbacks
def crawl(runner, country):
    yield runner.crawl(TrendsNewsSpider, country=country)
    reactor.stop()

def run_spider(country):
    runner = CrawlerRunner(get_project_settings())
    crawl(runner, country)
    reactor.run(installSignalHandlers=False)

if __name__ == "__main__":
    country = sys.argv[1]
    run_spider(country)