
https://www.dw.com/en/why-south-korean-women-arent-having-babies/a-68419317

### Importing Libraries

In [156]:
import requests
import pandas as pd
import json
from bs4 import BeautifulSoup
import warnings
warnings.filterwarnings('ignore')

### Requesting 

In [158]:
base_url = 'https://www.dw.com'
article_url = '/en/why-south-korean-women-arent-having-babies/a-68419317'

In [159]:
# function for requesting url
def request_url(link_url):
    url = base_url + link_url
    html = requests.get(url).text
    return html

# creating soup object
html_article = request_url(article_url)
soup = BeautifulSoup(html_article, 'html.parser')

### Extracting Information

Author of Article

In [160]:
# author name of article 
author_class_name = 'author-link'
author = soup.find(name="a", attrs={'class': author_class_name})
author_name = author.text

print(f'Author Name is:\n {author_name}')

Author Name is:
 Julian Ryall


Date of Article

In [161]:
# date of article

date_of_article_tag = soup.select('time')
date_of_article = date_of_article_tag[0].contents[0]
print(f'Date of Article is:\n {date_of_article}')


Date of Article is:
 03/01/2024


Title of Article

In [162]:
# title of the article
title_class_name = 'sc-ivDvhZ iBEcrE sc-iJfdHH djgAGb sc-jIQHsi VsLvM sc-ZiJuh dCQQTg sc-isKtiH jzOpwm'
title = soup.find(name="h1", attrs={'class': title_class_name})
title_text = title.text

print(f'Title of the Article is:\n {title_text}')


Title of the Article is:
 Why South Korean women aren't having babies


Summary of Article

In [163]:
# summary of article

summary = soup.find(name = 'p', attrs = {'class' :'teaser-text' })
summary_text = summary.text

print(f"The summary of the article is:\n {summary_text}")

The summary of the article is:
 New statistics show a record low number of children were born last year in South Korea, with women citing a desire for a career and to push back against a male-dominated society as key reasons.


Main text of Article

In [164]:
# main text of article
main_tag = soup.find(name = 'div', attrs = {'class' :'rich-text'})
main_text = main_tag.text
print(f'Main text is:\n {main_text}')

Main text is:
 When she was younger, Hyobin Lee yearned to be a mother. There came a point, however, when she had to make a difficult decision. Ultimately, she chose her career over a family and is now a successful academic in the South Korean city of Daejeon.
Lee, now 44, is just one of millions of Korean women who are making a conscious decision to remain childless — resulting in the nation's fertility rate dropping to a new record low.
The fertility rate — the average number of births per woman — shrank to 0.72 last year, according to preliminary government statistics released on Wednesday, down from 0.78 in the previous year and continuing the gradual annual decline since 2015.
That figure is well below the 2.1 children required to maintain South Korea's population, with the mere 230,000 children born last year hinting that the nation's total population is on course to fall to around 26 million — half the current total — by 2100.

A dream of a son
"When I was young, I dreamed of ha

Related Topics

In [165]:
# related topics 
related_topics_links = {}
related_topics = soup.find('aside').findAll('a')
for i in related_topics:
    related_topics_links[i.text] = base_url + i.attrs['href']
    
print(f'Related topics are: \n {related_topics_links}')


Related topics are: 
 {'Minorities': 'https://www.dw.com/en/minorities/t-50781525', "Women's rights": 'https://www.dw.com/en/womens-rights/t-17455067', 'South Korea': 'https://www.dw.com/en/south-korea/t-63357867'}


Creating Dictionary

In [166]:
#adding keys in dictionary 
website_data_dict = {}
website_data_dict['author_name'] = author_name
website_data_dict['date_of_article'] = date_of_article
website_data_dict['title_of_article'] = title_text
website_data_dict['summary'] = summary_text
website_data_dict['main_text'] = main_text
website_data_dict['related_topic'] = related_topics_links

In [167]:
def prettify_dict(dictionary_input):
    print(json.dumps(website_data_dict,sort_keys=True, indent=4))

In [168]:
print(f'The resulting dictionary is : \n')
prettify_dict(website_data_dict)

The resulting dictionary is : 

{
    "author_name": "Julian Ryall",
    "date_of_article": "03/01/2024",
    "main_text": "When she was younger, Hyobin Lee yearned to be a mother. There came a point, however, when she had to make a difficult decision. Ultimately, she chose her career over a family and is now a successful academic in the South Korean city of Daejeon.\nLee, now 44, is just one of millions of Korean women who are making a conscious decision to remain childless \u2014 resulting in the nation's fertility rate dropping to a new record low.\nThe fertility rate \u2014 the average number of births per woman \u2014 shrank to 0.72 last year, according to preliminary government statistics released on Wednesday, down from 0.78 in the previous year and continuing the gradual annual decline since 2015.\nThat figure is well below the 2.1 children required to maintain South Korea's population, with the mere 230,000 children born last year hinting that the nation's total population is 

### Analyzing the article

After analysing the data the following data can also be accessed:
1. Social websites share article links
2. Images link related to the article
3. Videos link related to the articles
4. Links related to some additional information inside paragraphs (for example: 'South korea')
5. About the Author bio in journalism (background of author)
6. Editing Author Name
7. Images Name

In [169]:
# Collecting social websites link 

# always visible websites 
websites_icons = soup.select_one("div.always-visible").find_all('a')

social_websites_link = {}


for w in websites_icons:

    title = w.attrs['title']
    print(f'Before splitting the title : {title}')

    #extracting title from a string 
    title = title.split('—')[1].split('with')[1].strip()

    print(f'After splitting the title: {title}')
    
    social_link = w.attrs['href']

    social_websites_link[title] = social_link
    


print(f'\nSocial Websites link are: \n {social_websites_link}')



Before splitting the title : External link — share this page with Facebook
After splitting the title: Facebook
Before splitting the title : External link — share this page with X / Twitter
After splitting the title: X / Twitter

Social Websites link are: 
 {'Facebook': 'https://www.facebook.com/sharer/sharer.php?u=https://p.dw.com/p/4d4zN%3Fmaca%3Den-Facebook-sharing', 'X / Twitter': "https://twitter.com/intent/tweet?source=webclient&text=Why%20South%20Korean%20women%20aren't%20having%20babies+https://p.dw.com/p/4d4zN%3Fmaca%3Den-Twitter-sharing"}


In [170]:
# extracting first image
images_dict = {}
image = soup.select_one('picture').find('img')

images_dict[image.attrs['title']] = image.attrs['srcset'].split(',')[0].split()[0]

print(f'Images dictionary :\n {images_dict}')


Images dictionary :
 {'A woman holding an umbrella crosses a road in central Seoul ': 'https://static.dw.com/image/68419984_600.jpg'}


Adding metadata in original dictionary 

In [171]:
website_data_dict['image'] = images_dict
website_data_dict['social_website'] = social_websites_link

### Printing Dictionary

In [172]:
print(f"The final dictionary with all data and metadata: \n")

prettify_dict(website_data_dict)

The final dictionary with all data and metadata: 

{
    "author_name": "Julian Ryall",
    "date_of_article": "03/01/2024",
    "image": {
        "A woman holding an umbrella crosses a road in central Seoul ": "https://static.dw.com/image/68419984_600.jpg"
    },
    "main_text": "When she was younger, Hyobin Lee yearned to be a mother. There came a point, however, when she had to make a difficult decision. Ultimately, she chose her career over a family and is now a successful academic in the South Korean city of Daejeon.\nLee, now 44, is just one of millions of Korean women who are making a conscious decision to remain childless \u2014 resulting in the nation's fertility rate dropping to a new record low.\nThe fertility rate \u2014 the average number of births per woman \u2014 shrank to 0.72 last year, according to preliminary government statistics released on Wednesday, down from 0.78 in the previous year and continuing the gradual annual decline since 2015.\nThat figure is well be

## Excercise 2

### Extracting Information for articles on date 01.03.2024

In [173]:
import selenium
from selenium import webdriver
from selenium.webdriver.common.by import By  
import time
import pandas as pd

Chrome Engine Opening 

In [174]:
search_url = 'https://www.dw.com/search/?languageCode=en'
driver = webdriver.Chrome()
driver.get(search_url)
driver.maximize_window()

Selecting Media Type as Article

In [175]:
article_css_selector = '#searchType > a:nth-child(5)'
article_media = driver.find_element(By.CSS_SELECTOR,article_css_selector)
article_media.click()

Selecting dates from 01.03.2024 to 01.03.2024

In [176]:
month_to_select = 'March'
year_to_select = '2024'
day_to_select = '1'

month_from_select = 'March'
year_from_select = '2024'
day_from_select = '1'

months_list = ['January', 'February', 'March', 'April', 'May', 'June', 'July', 'August', 'September', 'October', 'November', 'December']

In [177]:
#selecting from date input field
date_from_id = 'dateFrom'
date_from_element = driver.find_element(By.ID, date_from_id)
date_from_element.click()


#selecting the calendar div 
calendar_div_id = 'ui-datepicker-div'
calendar = driver.find_element(By.ID, calendar_div_id)

# selecting the month year div
month_year_class = 'ui-datepicker-title'
month_year = driver.find_element(By.CLASS_NAME, month_year_class)


current_month_from = month_year.text.split()[0]
current_year_from = month_year.text.split()[1]

print(f"The current Month is: {current_month_from}")
print(f"The current Year is: {current_year_from}")


The current Month is: April
The current Year is: 2024


In [178]:
def selecting_date(month_to_select, year_to_select, day_to_select, current_year, current_month):
    while (current_month + current_year) != (month_to_select + year_to_select):
        if (current_year < year_to_select) :
            next_button  = driver.find_element(By.XPATH, '//*[@id="ui-datepicker-div"]/div/a[2]')
            next_button.click()
            current_year = driver.find_element(By.CLASS_NAME, 'ui-datepicker-year').text
            current_month = driver.find_element(By.CLASS_NAME, 'ui-datepicker-month').text
            if current_month + current_year == month_to_select + year_to_select:
                break
        elif current_year > year_to_select:
             previous_button = driver.find_element(By.XPATH, '//*[@id="ui-datepicker-div"]/div/a[1]')
             previous_button.click()
             current_year = driver.find_element(By.CLASS_NAME, 'ui-datepicker-year').text
             current_month = driver.find_element(By.CLASS_NAME, 'ui-datepicker-month').text
             if  current_month + current_year == month_to_select + year_to_select:
                break
        elif months_list.index(current_month) < months_list.index(month_to_select):
            next_button  = driver.find_element(By.XPATH, '//*[@id="ui-datepicker-div"]/div/a[2]')
            next_button.click()
            current_year = driver.find_element(By.CLASS_NAME, 'ui-datepicker-year').text
            current_month = driver.find_element(By.CLASS_NAME, 'ui-datepicker-month').text
            if current_month + current_year == month_to_select + year_to_select:
                break
        elif months_list.index(current_month) > months_list.index(month_to_select):
             previous_button = driver.find_element(By.XPATH, '//*[@id="ui-datepicker-div"]/div/a[1]')
             previous_button.click()
             current_year = driver.find_element(By.CLASS_NAME, 'ui-datepicker-year').text
             current_month = driver.find_element(By.CLASS_NAME, 'ui-datepicker-month').text
             if current_month + current_year == month_to_select + year_to_select:
                break
        else:
            break

    #setting date on the calendar 
    calendar_table =driver.find_element(By.CSS_SELECTOR, '#ui-datepicker-div > table > tbody')
    rows = calendar_table.find_elements(By.TAG_NAME, "tr")
    for row in rows:
        cells = row.find_elements(By.TAG_NAME,"a")
        for cell in cells:
            if cell.text == day_to_select:
                cell.click()  
                break  


In [179]:
# selecting from date
selecting_date(month_to_select=month_from_select, year_to_select=year_from_select, day_to_select=day_from_select, current_month=current_month_from, current_year=current_year_from)

In [180]:
#selecting date-to input field
date_to_id = 'dateTo'
date_to_element = driver.find_element(By.ID, date_to_id)
date_to_element.click()


#selecting the calendar div 
calendar_div_id = 'ui-datepicker-div'
calendar = driver.find_element(By.ID, calendar_div_id)

# selecting the month year div
month_year_class = 'ui-datepicker-title'
month_year_to = driver.find_element(By.CLASS_NAME, month_year_class)


current_month_to = month_year_to.text.split()[0]
current_year_to = month_year_to.text.split()[1]

print(f"The current Month to is: {current_month_to}")
print(f"The current Year to is: {current_year_to}")

The current Month to is: April
The current Year to is: 2024


In [181]:
# selecting to date
selecting_date(month_to_select=month_to_select, year_to_select=year_to_select, day_to_select=day_to_select, current_year=current_year_to, current_month=current_month_to)

Extracting articles details 

In [182]:
#dataframe for articles
articles_dataframe = pd.DataFrame()

#dataframe for article details
main_dataframe = pd.DataFrame()

#setting the col width of dataframe to max
pd.set_option('display.max_colwidth', 200)


Checking Pagination

In [183]:
show_more = driver.find_elements(By.XPATH, '//*[@id="searchResult"]/a')
if len(show_more) > 0:
    more_pages = True
else:
    more_pages = False

In [184]:
while more_pages:
    
       if len(show_more) > 0:
              show_more = driver.find_elements(By.XPATH, '//*[@id="searchResult"]/a')
              driver.execute_script("arguments[0].click();", show_more[0])
              time.sleep(3)
              show_more = driver.find_elements(By.XPATH, '//*[@id="searchResult"]/a')
              if len(show_more) == 0:
                     more_pages = False
                     break
              
       else:
              print(f"Length first {len(show_more)}")
              more_pages = False
              break
    

Searching through whole list

In [185]:
para_list_class = 'searchResult'
para_list = driver.find_elements(By.CLASS_NAME, para_list_class)


for i in range(len(para_list)):
         
    #getting heading
    para_heading = para_list[i].find_element(By.TAG_NAME, 'h2').text
    span_heading =  para_list[i].find_element(By.TAG_NAME, 'span').text
    para_heading = para_heading.replace(span_heading,"")
    articles_dataframe.loc[i, 'title']= str(para_heading)
    
    #getting two liners
    para_two_liner = para_list[i].find_element(By.TAG_NAME, 'p').text
    articles_dataframe.loc[i,'detail'] = str(para_two_liner)

    #getting images 
    para_img = para_list[i].find_element(By.TAG_NAME, 'img').get_attribute('src')
    articles_dataframe.loc[i,'image'] = str(para_img)

    # Getting Urls
    para_url = para_list[i].find_element(By.TAG_NAME, 'a').get_attribute('href')
    articles_dataframe.loc[i,'article_url'] = str(para_url)

    
    
            

Articles Dataframe

In [186]:
print(f"Number of Rows : {articles_dataframe.shape[0]}")
print(f"Number of Cols : {articles_dataframe.shape[1]}")
articles_dataframe.head(10)

Number of Rows : 24
Number of Cols : 4


Unnamed: 0,title,detail,image,article_url
0,Apple pulls the plug on its self-driving e-car project,"Apple has ended its decadelong autonomous vehicle effort, known as Project Titan. The stock market breathed a sigh of relief in response, with insiders pointing to areas where Apple should redoubl...",https://static.dw.com/image/66794208_305.webp,https://www.dw.com/en/apple-pulls-the-plug-on-its-self-driving-e-car-project/a-68421710
1,Iran elections: Low turnout in parliamentary vote,Iran's parliamentary elections saw low voter turnout as candidates competed for a seat in the 290-member parliament. New members of Iran's Assembly of Experts were also to be elected.,https://static.dw.com/image/68412824_305.webp,https://www.dw.com/en/iran-elections-low-turnout-in-parliamentary-vote/a-68412885
2,Germany: Tesla plant protesters to spend week in forest,Treehouses and a piano concert in the woods: Activists protesting against a Tesla factory expansion near Berlin said they intend to stay in a forest for a week. The police keep a low profile.,https://static.dw.com/image/68404578_305.webp,https://www.dw.com/en/germany-tesla-plant-protesters-to-spend-week-in-forest/a-68422841
3,Nicaragua says Germany facilitates genocide by aiding Israel,Nicaragua has filed a case at the International Court of Justice against Germany for financial and military support it provides to Israel. A hearing could take place within weeks.,https://static.dw.com/image/68138344_305.webp,https://www.dw.com/en/nicaragua-says-germany-facilitates-genocide-by-aiding-israel/a-68422846
4,Why some African countries have strange shapes,Arbitrary boundaries were drawn on maps to separate European colonies in Africa. Most of those lines drawn on a Berlin Conference table between 1884 and 1885 are still in place today – with deadly...,https://static.dw.com/image/51539775_305.webp,https://www.dw.com/en/why-some-african-countries-have-strange-shapes/a-67624254
5,Why South Korean women aren't having babies,"New statistics show a record low number of children were born last year in South Korea, with women citing a desire for a career and to push back against a male-dominated society as key reasons.",https://static.dw.com/image/68419984_305.webp,https://www.dw.com/en/why-south-korean-women-arent-having-babies/a-68419317
6,Navalny buried at Moscow cemetery after prison death,"Crowds chanted ""Navalny, Navalny!"" as his coffin was carried into the church for his funeral service. Proceedings had taken place amid a heavy security presence and warnings that protests would no...",https://static.dw.com/image/68417403_305.webp,https://www.dw.com/en/navalny-buried-at-moscow-cemetery-after-prison-death/a-68412409
7,Transnistria: Will Russia's next war be in Moldova?,"Moldova's breakaway region of Transnistria has asked Moscow for protection. Although President Putin made no mention of the appeal in his annual address to the nation, is Russia's next war imminent?",https://static.dw.com/image/66201775_305.webp,https://www.dw.com/en/transnistria-will-russias-next-war-be-in-moldova/a-68418058
8,New species of green anaconda identified in Amazon,"Researchers have shown that two genetically very different types of green anaconda exist in the Amazon. The two species appear virtually identical, meaning their genetic divergence had not been no...",https://static.dw.com/image/68417231_305.webp,https://www.dw.com/en/new-species-of-green-anaconda-identified-in-amazon/a-68417173
9,Decoding China: National People's Congress meets at bad time,"China's economy is spluttering, as nearly 3,000 lawmakers are set to meet in Beijing for the National People's Congress. Prime Minister Li Qiang will face some tough questions.",https://static.dw.com/image/67309353_305.webp,https://www.dw.com/en/china-national-peoples-congress-meets-economic-recovery/a-68417995


Creating Main dataframes

In [187]:
main_topics = []
for j in range(len(articles_dataframe['article_url'])):
    html_article = requests.get(articles_dataframe['article_url'][j]).text
    soup_article = BeautifulSoup(html_article, 'html.parser')
    
    #authors 
    if soup_article.find(name = 'div', attrs={'class': 'author-details'}):
       main_dataframe.loc[j,'author'] = soup_article.find(name = 'div', attrs={'class': 'author-details'}).text
       
    else:
        main_dataframe.loc[j,'author'] = None
      
    
    #dates
    if soup_article.find('time'):
        main_dataframe.loc[j,'date'] = soup_article.find('time').text
        
    else:
        main_dataframe.loc[j,'date'] = None
        
    #summary
    if soup_article.find('p', attrs={'class':'teaser-text'}):
       main_dataframe.loc[j,'summary'] = soup_article.find('p', attrs={'class':'teaser-text'}).text
    else:
       main_dataframe.loc[j,'summary'] = None
      
    #titles
    if soup_article.find('h1', attrs={'class': 'jzOpwm'}):
        main_dataframe.loc[j, 'title'] = soup_article.find('h1', attrs={'class': 'jzOpwm'}).text
    else:
        main_dataframe.loc[j, 'title'] = None
    

    #main text
    if soup_article.find('div', attrs={'class':'rich-text'}):
        main_dataframe.loc[j,'main_text'] = soup_article.find('div', attrs={'class':'rich-text'}).text
    else:
        main_dataframe.loc[j,'main_text'] = None


    # topics
    if soup_article.find('aside'):
        topics = []
        for i in soup_article.find('aside').findAll('a'):
            topics.append(i.text)
        main_topics.append(topics)
    else:
        main_topics.append([None])

    

In [188]:
main_dataframe['topics'] = main_topics

Shape of Dataframe and first five row

In [189]:
print(f"Shape of Dataframe is {main_dataframe.shape}")
print(f"Number of rows : {main_dataframe.shape[0]} ")
print(f"Number of cols : {main_dataframe.shape[1]} ")

Shape of Dataframe is (24, 6)
Number of rows : 24 
Number of cols : 6 


In [190]:
#first five rows 
main_dataframe.head(5)

Unnamed: 0,author,date,summary,title,main_text,topics
0,Dirk Kaufmann,03/01/2024,"Apple has ended its decadelong autonomous vehicle effort, known as Project Titan. The stock market breathed a sigh of relief in response, with insiders pointing to areas where Apple should redoubl...",Apple pulls the plug on its self-driving e-car project,"Traditional car manufacturers in Europe, Asia and the US face a number of problems, some of them of their own making. To make matters worse, the creeping concern that tech companies could crowd le...","[Electric vehicles, Apple, Artificial intelligence]"
1,,03/01/2024,Iran's parliamentary elections saw low voter turnout as candidates competed for a seat in the 290-member parliament. New members of Iran's Assembly of Experts were also to be elected.,Iran elections: Low turnout in parliamentary vote,"Iran held its first parliamentary elections since mass protests swept the Islamic Republic in 2022, triggered by the death of Jina Mahsa Amini while in custody.\n""Voting for the 12th term of the I...","[Iran protests, Jina Mahsa Amini, Iran]"
2,,03/01/2024,Treehouses and a piano concert in the woods: Activists protesting against a Tesla factory expansion near Berlin said they intend to stay in a forest for a week. The police keep a low profile.,Germany: Tesla plant protesters to spend week in forest,"Environmental activists protesting the expansion of US electric carmaker Tesla's factory outside Berlin said they plan to occupy the forest near the factory for a week. \n""The longer the occupatio...","[Rhine River, Poverty in Germany, Greta Thunberg, Robert Habeck, Germany]"
3,,03/01/2024,Nicaragua has filed a case at the International Court of Justice against Germany for financial and military support it provides to Israel. A hearing could take place within weeks.,Nicaragua says Germany facilitates genocide by aiding Israel,"Nicaragua has accused Germany of facilitating ""genocide"" in Gaza in a case filed at the International Court of Justice (ICJ) on Friday.\nThe Nicaraguan government said Germany provided financial a...","[Rhine River, Poverty in Germany, Robert Habeck, International Court of Justice (ICJ), Israel, Germany]"
4,Cai Nebe,03/01/2024,Arbitrary boundaries were drawn on maps to separate European colonies in Africa. Most of those lines drawn on a Berlin Conference table between 1884 and 1885 are still in place today – with deadly...,Why some African countries have strange shapes,"Why is the Caprivi Strip symbolic of colonial era mapmaking?\nIt's one of the weirdest looking territories on a map, and forms part of Namibia. It looks like a pan handle, has parallel straight li...","[Namibia, Cameroon, Togo, Tanzania]"


Closing the browser

In [191]:
driver.quit()

End of File