# #DAV 5400 Project 2 

### Working with HTML, JSON, Web Scraping, and Web APIs
 Team Members: **Meher Venkat Karri**, **Ruchika Reddy Nannuri**

# # Working with HTML and JSON

To share album information, including artist details, album titles, release years, and songs, you'll need to create HTML and JSON files containing these details for three albums. After creating these files:

- Upload the HTML and JSON files to a GitHub repository.
- Use the urllib.request library in Python to access the raw URLs of these files on GitHub.
- Retrieve the content of the files from GitHub.
- Load the JSON file content into a Python dictionary using the json module.
- Convert the parsed HTML data and the dictionary from the JSON file into pandas DataFrames for analysis and manipulation.

In [88]:
import pandas as pd
import json 
import requests
import urllib.request

In [89]:
html_url='https://raw.githubusercontent.com/meher646/DAV-5400/main/A%20R%20Rahman.html'

In [106]:
json_url='https://raw.githubusercontent.com/meher646/DAV-5400/main/Album.json'


In [107]:
html_df=pd.read_html(html_url)[0]

In [108]:
html_df

Unnamed: 0,Artist,Album Title,Year Released,Songs
0,A.R. Rahman,Roja,1992,"Naa Cheli Rojave, Roja Roja, Paruvam Vanaga, C..."
1,Devi Sri Prasad,Gabbar Singh,2012,"Dekho Dekho Gabbar Singh, Akasam Ammayaithe, P..."
2,M.M. Keeravani,Baahubali: The Beginning,2015,"Sivuni Aana, Pacha Bottasi, Dhivara, Manohari,..."


In [109]:
pd.read_json(json_url)

Unnamed: 0,Artist,Album Title,Year Released,Songs
0,A.R. Rahman,Roja,1992,"[Naa Cheli Rojave, Roja Roja, Paruvam Vanaga, ..."
1,Devi Sri Prasad,Gabbar Singh,2012,"[Dekho Dekho Gabbar Singh, Akasam Ammayaithe, ..."
2,M.M. Keeravani,Baahubali: The Beginning,2015,"[Sivuni Aana, Pacha Bottasi, Dhivara, Manohari..."


Both HTML and JSON files contains same data but they are not identical due to the following difference:

**Data Structure**: In the HTML DataFrame, the songs are presented as a single string with songs seemingly concatenated and abbreviated with ellipses. In the JSON DataFrame, the songs are in a list format, indicated by the square brackets. This structural difference is inherent to the way HTML and JSON handle and present lists of items.



# #  Scraping the Katz School’s “Staff” Web Page

To scrape data from the **Katz Schoo** webpage, we need to use BeautifulSoup from the bs4 package, which helps in parsing HTML content. Additionally, for extracting specific information such as names, titles, office details, emails, and phone numbers that follow particular patterns in the webpage's HTML structure, we'll utilize the re module for regular expressions. The first step involves identifying the HTML tags and classes that encapsulate the desired data on the webpage.

After setting up our environment by importing necessary libraries, we proceed by fetching the webpage content into a Python object. For organizing and storing the extracted information, we will create a pandas DataFrame named **staff_info**. This DataFrame is designed to hold text data, making it suitable for storing the extracted staff details.

Within the parsed HTML content held in our variable, we aim to identify and extract details about staff members, including their names, titles, office locations, emails, and phone numbers. These extracted pieces of information will then be methodically saved into the staff_info DataFrame.

In [20]:
from bs4 import BeautifulSoup,NavigableString
import re

In [21]:
url = 'https://www.yu.edu/katz/staff'

page = requests.get(url)

soup = BeautifulSoup(page.text,'html')

In [22]:
soup.find('div')

<div aria-describedby="message_from_our_president_2_desc" aria-labelledby="message_from_our_president_2_label" aria-modal="true" class="modal fade js-modal-page-show" data-backdrop="true" data-keyboard="true" data-modal-options='{"id":"message_from_our_president_2","auto_open":false,"open_modal_on_element_click":".open-letter-modal"}' id="js-modal-page-show-modal" role="dialog" tabindex="-1">
<div class="modal-page-dialog modal-dialog modal-lg" role="document">
<div class="modal-page-content modal-content">
<div class="modal-page-content modal-header">
<button class="close js-modal-page-ok-button" data-dismiss="modal" type="button">
                              ×
                          </button>
<h4 class="modal-title modal-page-title" id="message_from_our_president_2_label">Message From Our President</h4>
</div>
<div class="modal-body modal-page-body" id="message_from_our_president_2_desc">
<p><img alt="Ari Berman" data-entity-type="file" data-entity-uuid="00f7c8ae-cb73-4b40-904d-

In [23]:
staff_div = soup.find('div',class_= 'text-only')

In [24]:
staff_div

<div class="text-only">
<div class="field field--name-field-paragraph-body"><h3>Office of the Dean </h3>
<p>Paul Russo, Vice Provost and Dean <br/>
Professor of Data Science<br/><a href="/faculty/pages/russo-paul">Read Dr. Russo's Biography</a> </p>
<p>Aaron Ross, Assistant Dean for Academic Programs and Deputy to the Dean <br/><a href="mailto:Aaron.Ross2@yu.edu">aaron.ross2@yu.edu</a> | 646-592-4148  <br/>
 <br/>
Sofia Binioris, Director of Communications and Strategic Initiatives<br/><a href="mailto:Sofia.Binioris@yu.edu">sofia.binioris@yu.edu</a> | 645-592-4719</p>
<p>Jackie Hamilton, Executive Director of Enrollment Management and Partnerships<br/><a href="mailto:jackie.hamilton@yu.edu">jackie.hamilton@yu.edu</a> | 646-787-6194</p>
<p>Pamela Rodman, Director of Finance and Administration<br/><a href="mailto:pamela.rodman@yu.edu">pamela.rodman@yu.edu</a> | 646.592.4777</p>
<p>Tabitha Collazo, Business and Operations Coordinator<br/><a href="mailto:tabitha.collazo@yu.edu">tabitha.col

In [25]:
staff_info = pd.DataFrame(columns = ['Office','Name','Title','Email','Phone'])

In [26]:
for p in staff_div.find_all('div', recursive=False):  
    
# Start navigating through the siblings of each div to find h3 and p tags    
    x = p.find_next()
    while x:
        if x.name =='h3':
       # Extract the office name from the h3 tag
            officeName = x.text
        if x.name =='p' and x.text.strip():
            name_segments = re.findall(r'<p>(.*?)\s*,', str(x))
         # Extract the title from the p tag, assuming it's followed by a comma and ends before a break <br/> tag  
            title_segments = re.findall(r',(.*?)\s*<br\s*/?>', str(x))
            name = name_segments[0]

            if len(title_segments)==0:
                title = x.find('span').text
            else:
                title = title_segments[0]

           # Extract the email using a regex pattern for emails
            email_find = re.findall(r'[a-zA-Z0-9._%+-]+@[a-zA-Z0-9.-]+\.[a-zA-Z]{2,}', str(x))
            email = email_find[0] if len(email_find)>0 else 'N/A'
           # Extract the phone number using a regex pattern for US phone numbers
            phone_find = re.findall(r'\b\d{3}-\d{3}-\d{4}\b', str(x))
            phone = phone_find[0] if len(phone_find)>0 else 'N/A'
            staff_info.loc[len(staff_info)] = [officeName,name,title,email,phone]
        x = x.next_sibling


In [27]:
staff_info

Unnamed: 0,Office,Name,Title,Email,Phone
0,Office of the Dean,Paul Russo,Vice Provost and Dean,,
1,Office of the Dean,Aaron Ross,Assistant Dean for Academic Programs and Depu...,Aaron.Ross2@yu.edu,646-592-4148
2,Office of the Dean,Jackie Hamilton,Executive Director of Enrollment Management a...,jackie.hamilton@yu.edu,646-787-6194
3,Office of the Dean,Pamela Rodman,Director of Finance and Administration,pamela.rodman@yu.edu,
4,Office of the Dean,Tabitha Collazo,Business and Operations Coordinator,tabitha.collazo@yu.edu,646-592-4735
5,Office of the Dean,Ann Leary,Office Manager/Executive Assistant to the Dean...,ann.leary@yu.edu,646-592-4724
6,Graduate Admissions,Jared Hakimi,Director,jared.hakimi@yu.edu,646-592-4722
7,Graduate Admissions,Xavier Velasquez,Associate Director of Graduate Admissions Ope...,xavier.velasquez@yu.edu,646-592-4737
8,Graduate Admissions,Shayna Matzner,Assistant Director,Shayna.matzner@yu.edu,646-592-4726
9,Graduate Admissions,Linyu Zheng,Assistant Director,linyu.zheng@yu.edu,332-271-5865


# # Working with Web API’s

To work with the **Newsdata.io** web API, you need to sign up for an account on the Newsdata.io website. Once you have registered, you will be provided with an API key. This key is used to authenticate your requests when you call their API to fetch news data.

With the API key, you can construct requests to the Newsdata.io API endpoints to retrieve news articles based on specific search queries, such as **"crypto"** for cryptocurrency-related news. In Python, you would typically use a library like requests to make HTTP calls to the API.

The response from the API will typically be in JSON format, which you can convert into a Python dictionary using the .json() method provided by the requests library.
Extract the relevant parts of the data (like articles, titles, descriptions) from the dictionary.
Store this data in a pandas DataFrame, which allows for easier manipulation and analysis.
Perform the desired analysis on the DataFrame. This could be statistical analysis, text analysis, or any other form of data processing.



In [72]:
news_api = "https://newsdata.io/api/1/news?apikey=pub_40202b4f77b894ab7862ce897af038b739202&q=pegasus&language=en"

In [73]:
import requests

def fetch_live_breaking_news(url, category='technology'):

    response = requests.get(url)
    if response.status_code == 200:
        # This returns the live breaking news data in JSON format
        return response.json()  
    else:
        print("Failed to fetch data")
        return None


In [135]:
news_json = fetch_live_breaking_news(news_api)

In [136]:
df1 = pd.json_normalize(news_json['results'])
df1.head()

Unnamed: 0,article_id,title,link,keywords,creator,video_url,description,content,pubDate,image_url,...,source_url,source_icon,source_priority,country,category,language,ai_tag,sentiment,sentiment_stats,ai_region
0,977d1ba39a517648a31d10722dfaf598,Top Wall Street analysts prefer these three st...,https://www.cnbc.com/2024/03/17/top-wall-stree...,,,,TipRanks' analyst ranking service highlights W...,ONLY AVAILABLE IN PAID PLANS,2024-03-17 11:45:31,https://image.cnbcfm.com/api/v1/image/10738344...,...,http://cnbc.com,,82,[australia],[entertainment],english,ONLY AVAILABLE IN PROFESSIONAL AND CORPORATE P...,ONLY AVAILABLE IN PROFESSIONAL AND CORPORATE P...,ONLY AVAILABLE IN PROFESSIONAL AND CORPORATE P...,ONLY AVAILABLE IN CORPORATE PLANS
1,2d924607bb25cc0d79f1668f08a168a4,"Top gospel acts, ministers for Mindset Conference",https://www.jamaicaobserver.com/2024/03/17/top...,[entertainment],[BY KEVIN JACKSON Jamaica Observer Writer],,"Top gospel acts, ministers for Mindset Conference",ONLY AVAILABLE IN PAID PLANS,2024-03-17 05:24:00,https://www.jamaicaobserver.com/jamaicaobserve...,...,https://www.jamaicaobserver.com,,14529,[gabon],[entertainment],english,ONLY AVAILABLE IN PROFESSIONAL AND CORPORATE P...,ONLY AVAILABLE IN PROFESSIONAL AND CORPORATE P...,ONLY AVAILABLE IN PROFESSIONAL AND CORPORATE P...,ONLY AVAILABLE IN CORPORATE PLANS
2,50df8da4a8d5f68664af421d053a5520,Former Polish PM Kaczynski questioned over spy...,https://news.knowledia.com/all/all/articles/fo...,,,,Former Polish prime minister Jarosław Kaczyńsk...,ONLY AVAILABLE IN PAID PLANS,2024-03-17 03:40:57,https://i.abcnewsfe.com/a/197b8204-84f8-4c85-8...,...,http://news.knowledia.com,https://i.bytvi.com/domain_icons/knowledia.png,167779,[india],[top],english,ONLY AVAILABLE IN PROFESSIONAL AND CORPORATE P...,ONLY AVAILABLE IN PROFESSIONAL AND CORPORATE P...,ONLY AVAILABLE IN PROFESSIONAL AND CORPORATE P...,ONLY AVAILABLE IN CORPORATE PLANS
3,f463c468572a32bf6c9380620bb71d46,Former Polish PM Kaczynski questioned over spy...,https://news.knowledia.com/US/en/articles/form...,,,,Former Polish prime minister Jarosław Kaczyńsk...,ONLY AVAILABLE IN PAID PLANS,2024-03-17 03:40:57,https://i.abcnewsfe.com/a/197b8204-84f8-4c85-8...,...,http://news.knowledia.com,https://i.bytvi.com/domain_icons/knowledia.png,167779,[united states of america],[top],english,ONLY AVAILABLE IN PROFESSIONAL AND CORPORATE P...,ONLY AVAILABLE IN PROFESSIONAL AND CORPORATE P...,ONLY AVAILABLE IN PROFESSIONAL AND CORPORATE P...,ONLY AVAILABLE IN CORPORATE PLANS
4,925abbcfefd41200ed55708a6b71b02b,"Despot, radical … peacemaker? The millennial p...",https://www.theage.com.au/world/middle-east/de...,[world / middle east],[Sherryn Groch],,"Reformer, tyrant or both? Saudi Arabia’s crown...",ONLY AVAILABLE IN PAID PLANS,2024-03-16 18:00:00,https://static.ffx.io/images/$zoom_0.5521%2C$m...,...,https://www.theage.com.au,https://i.bytvi.com/domain_icons/theage.png,62533,[australia],[top],english,ONLY AVAILABLE IN PROFESSIONAL AND CORPORATE P...,ONLY AVAILABLE IN PROFESSIONAL AND CORPORATE P...,ONLY AVAILABLE IN PROFESSIONAL AND CORPORATE P...,ONLY AVAILABLE IN CORPORATE PLANS


In [137]:
crypto_api = "https://newsdata.io/api/1/news?apikey=pub_40202b4f77b894ab7862ce897af038b739202&q=crypto"
crypto_json = fetch_live_breaking_news(crypto_api)

In [139]:
df2 = pd.json_normalize(crypto_json['results'])
df2.head()

Unnamed: 0,article_id,title,link,keywords,creator,video_url,description,content,pubDate,image_url,...,source_url,source_icon,source_priority,country,category,language,ai_tag,sentiment,sentiment_stats,ai_region
0,e3564cfbe07bfbd8bba467ff674b0327,Top Meme Coins: Floki (FLOKI) vs Shiba Budz (B...,https://coinfomania.com/shiba-budz-cryptocurre...,"[press release, sponsored]",[PR Desk],,"In the vibrant world of cryptocurrency, meme c...",ONLY AVAILABLE IN PAID PLANS,2024-03-17 15:00:24,,...,https://coinfomania.com,https://i.bytvi.com/domain_icons/coinfomania.png,5766846,[united states of america],[top],english,ONLY AVAILABLE IN PROFESSIONAL AND CORPORATE P...,ONLY AVAILABLE IN PROFESSIONAL AND CORPORATE P...,ONLY AVAILABLE IN PROFESSIONAL AND CORPORATE P...,ONLY AVAILABLE IN CORPORATE PLANS
1,75601010eab99809dd07cee6309dd468,Avalanche and Toncoin Price Skyrockets But Nug...,https://bitcoinworld.co.in/avalanche-and-tonco...,"[latest news, press release, altcoin price, al...",[Keshav Aggarwal],,TLDR: Avalanche price has increased 32.7% in t...,ONLY AVAILABLE IN PAID PLANS,2024-03-17 15:00:24,https://bitcoinworld.co.in/wp-content/uploads/...,...,https://bitcoinworld.co.in,https://i.bytvi.com/domain_icons/bitcoinworld.jpg,2739438,[india],[business],english,ONLY AVAILABLE IN PROFESSIONAL AND CORPORATE P...,ONLY AVAILABLE IN PROFESSIONAL AND CORPORATE P...,ONLY AVAILABLE IN PROFESSIONAL AND CORPORATE P...,ONLY AVAILABLE IN CORPORATE PLANS
2,e63cd639dd75631e41c12ba56330b402,8 iPhone browser apps you should use instead o...,https://www.digitaltrends.com/mobile/iphone-br...,"[features, mobile, apple, apps, ios, iphone, s...",[Bryan M. Wolfe],,Safari is the default web browser for every iP...,ONLY AVAILABLE IN PAID PLANS,2024-03-17 15:00:05,https://www.digitaltrends.com/wp-content/uploa...,...,https://www.digitaltrends.com,https://i.bytvi.com/domain_icons/digitaltrends...,287,[ireland],[technology],english,ONLY AVAILABLE IN PROFESSIONAL AND CORPORATE P...,ONLY AVAILABLE IN PROFESSIONAL AND CORPORATE P...,ONLY AVAILABLE IN PROFESSIONAL AND CORPORATE P...,ONLY AVAILABLE IN CORPORATE PLANS
3,f71949b7a98c7b2fb1f5c23769bd4415,Dencun Goes Live of Ethereum Mainnet: CryptoQu...,https://finbold.com/dencun-goes-live-of-ethere...,[press releases],[Paul L.],,"TLDR: After months of waiting, the highly-anti...",ONLY AVAILABLE IN PAID PLANS,2024-03-17 15:00:00,https://assets.finbold.com/uploads/2024/03/ima...,...,https://finbold.com,https://i.bytvi.com/domain_icons/finbold.png,14758,[united kingdom],[top],english,ONLY AVAILABLE IN PROFESSIONAL AND CORPORATE P...,ONLY AVAILABLE IN PROFESSIONAL AND CORPORATE P...,ONLY AVAILABLE IN PROFESSIONAL AND CORPORATE P...,ONLY AVAILABLE IN CORPORATE PLANS
4,6e33e11eece285a39ba22ef764957cad,Polygon’s (MATIC) Investors $3 Dream Goes Up I...,https://coinedition.com/polygons-matic-investo...,[sponsored],[Coin Edition],,In the ever-evolving landscape of cryptocurren...,ONLY AVAILABLE IN PAID PLANS,2024-03-17 15:00:00,https://coinedition.com/wp-content/uploads/202...,...,https://coinedition.com,https://i.bytvi.com/domain_icons/coinedition.png,2131824,[united arab emirates],[top],english,ONLY AVAILABLE IN PROFESSIONAL AND CORPORATE P...,ONLY AVAILABLE IN PROFESSIONAL AND CORPORATE P...,ONLY AVAILABLE IN PROFESSIONAL AND CORPORATE P...,ONLY AVAILABLE IN CORPORATE PLANS


In [170]:
df1.isnull().sum()

article_id          0
title               0
link                0
keywords            3
creator             3
video_url          10
description         0
pubDate             0
image_url           0
source_id           0
source_url          0
source_icon         2
source_priority     0
country             0
category            0
language            0
dtype: int64

In [171]:
df2.isnull().sum()

article_id          0
title               0
link                0
keywords            0
creator             0
video_url          10
description         0
pubDate             0
image_url           3
source_id           0
source_url          0
source_icon         0
source_priority     0
country             0
category            0
language            0
dtype: int64

In [140]:
df1.columns

Index(['article_id', 'title', 'link', 'keywords', 'creator', 'video_url',
       'description', 'content', 'pubDate', 'image_url', 'source_id',
       'source_url', 'source_icon', 'source_priority', 'country', 'category',
       'language', 'ai_tag', 'sentiment', 'sentiment_stats', 'ai_region'],
      dtype='object')

In [141]:
df2.columns

Index(['article_id', 'title', 'link', 'keywords', 'creator', 'video_url',
       'description', 'content', 'pubDate', 'image_url', 'source_id',
       'source_url', 'source_icon', 'source_priority', 'country', 'category',
       'language', 'ai_tag', 'sentiment', 'sentiment_stats', 'ai_region'],
      dtype='object')

In [142]:
df1.drop(['ai_tag', 'sentiment', 'sentiment_stats', 'ai_region','content'],axis=1,inplace=True)
df2.drop(['ai_tag', 'sentiment', 'sentiment_stats', 'ai_region','content'],axis=1,inplace=True)

In [144]:
df1['language'].value_counts()

language
english    10
Name: count, dtype: int64

In [145]:
df2['language'].value_counts()

language
english    9
dutch      1
Name: count, dtype: int64

In [147]:
df1['category'].value_counts()

category
[top]              5
[world]            3
[entertainment]    2
Name: count, dtype: int64

In [148]:
df2['category'].value_counts()

category
[top]           6
[business]      3
[technology]    1
Name: count, dtype: int64

In [150]:
df1['country'].value_counts()

country
[australia]                   7
[gabon]                       1
[india]                       1
[united states of america]    1
Name: count, dtype: int64

In [151]:
df2['country'].value_counts()

country
[united states of america]    2
[united kingdom]              2
[united arab emirates]        2
[india]                       1
[ireland]                     1
[canada]                      1
[netherland]                  1
Name: count, dtype: int64

In [152]:
df1

Unnamed: 0,article_id,title,link,keywords,creator,video_url,description,pubDate,image_url,source_id,source_url,source_icon,source_priority,country,category,language
0,977d1ba39a517648a31d10722dfaf598,Top Wall Street analysts prefer these three st...,https://www.cnbc.com/2024/03/17/top-wall-stree...,,,,TipRanks' analyst ranking service highlights W...,2024-03-17 11:45:31,https://image.cnbcfm.com/api/v1/image/10738344...,cnbc,http://cnbc.com,,82,[australia],[entertainment],english
1,2d924607bb25cc0d79f1668f08a168a4,"Top gospel acts, ministers for Mindset Conference",https://www.jamaicaobserver.com/2024/03/17/top...,[entertainment],[BY KEVIN JACKSON Jamaica Observer Writer],,"Top gospel acts, ministers for Mindset Conference",2024-03-17 05:24:00,https://www.jamaicaobserver.com/jamaicaobserve...,jamaicaobserver,https://www.jamaicaobserver.com,,14529,[gabon],[entertainment],english
2,50df8da4a8d5f68664af421d053a5520,Former Polish PM Kaczynski questioned over spy...,https://news.knowledia.com/all/all/articles/fo...,,,,Former Polish prime minister Jarosław Kaczyńsk...,2024-03-17 03:40:57,https://i.abcnewsfe.com/a/197b8204-84f8-4c85-8...,knowledia,http://news.knowledia.com,https://i.bytvi.com/domain_icons/knowledia.png,167779,[india],[top],english
3,f463c468572a32bf6c9380620bb71d46,Former Polish PM Kaczynski questioned over spy...,https://news.knowledia.com/US/en/articles/form...,,,,Former Polish prime minister Jarosław Kaczyńsk...,2024-03-17 03:40:57,https://i.abcnewsfe.com/a/197b8204-84f8-4c85-8...,knowledia,http://news.knowledia.com,https://i.bytvi.com/domain_icons/knowledia.png,167779,[united states of america],[top],english
4,925abbcfefd41200ed55708a6b71b02b,"Despot, radical … peacemaker? The millennial p...",https://www.theage.com.au/world/middle-east/de...,[world / middle east],[Sherryn Groch],,"Reformer, tyrant or both? Saudi Arabia’s crown...",2024-03-16 18:00:00,https://static.ffx.io/images/$zoom_0.5521%2C$m...,theage,https://www.theage.com.au,https://i.bytvi.com/domain_icons/theage.png,62533,[australia],[top],english
5,3255b3f97d2a157037af81faf558e086,"Despot, radical … peacemaker? The millennial p...",https://www.theage.com.au/world/middle-east/de...,[world / middle east],[Sherryn Groch],,"Reformer, tyrant or both? Saudi Arabia’s crown...",2024-03-16 18:00:00,https://static.ffx.io/images/$zoom_0.5521%2C$m...,theage,https://www.theage.com.au,https://i.bytvi.com/domain_icons/theage.png,62533,[australia],[world],english
6,ae0e36047e6ea50699d9dfa2ab110174,"Despot, radical … peacemaker? The millennial p...",https://www.smh.com.au/world/middle-east/despo...,[world / middle east],[Sherryn Groch],,"Reformer, tyrant or both? Saudi Arabia’s crown...",2024-03-16 18:00:00,https://static.ffx.io/images/$zoom_0.5521%2C$m...,smh,https://www.smh.com.au,https://i.bytvi.com/domain_icons/smh.png,6729,[australia],[top],english
7,4e1306150ae954e492bb431769c773cc,"Despot, radical … peacemaker? The millennial p...",https://www.smh.com.au/world/middle-east/despo...,[world / middle east],[Sherryn Groch],,"Reformer, tyrant or both? Saudi Arabia’s crown...",2024-03-16 18:00:00,https://static.ffx.io/images/$zoom_0.5521%2C$m...,smh,https://www.smh.com.au,https://i.bytvi.com/domain_icons/smh.png,6729,[australia],[world],english
8,9fa1474f1a72bb6803905a42fb77891e,"Despot, radical … peacemaker? The millennial p...",https://www.watoday.com.au/world/middle-east/d...,[world / middle east],[Sherryn Groch],,"Reformer, tyrant or both? Saudi Arabia’s crown...",2024-03-16 18:00:00,https://static.ffx.io/images/$zoom_0.5521%2C$m...,watoday,https://www.watoday.com.au,https://i.bytvi.com/domain_icons/watoday.png,347743,[australia],[top],english
9,a03c0b3e132a512a2bd6fe38fc9a0ed5,"Despot, radical … peacemaker? The millennial p...",https://www.watoday.com.au/world/middle-east/d...,[world / middle east],[Sherryn Groch],,"Reformer, tyrant or both? Saudi Arabia’s crown...",2024-03-16 18:00:00,https://static.ffx.io/images/$zoom_0.5521%2C$m...,watoday,https://www.watoday.com.au,https://i.bytvi.com/domain_icons/watoday.png,347743,[australia],[world],english


In [176]:
x1 = ' '.join(df1['description'])

In [177]:
import nltk
from nltk.corpus import stopwords
nltk.download('stopwords')

[nltk_data] Downloading package stopwords to
[nltk_data]     /Users/mehervenkat/nltk_data...
[nltk_data]   Package stopwords is already up-to-date!


True

In [178]:
import string
def clean_sentence(sentence):
    cleaned_sentence = sentence.lower()
    cleaned_sentence = cleaned_sentence.translate(str.maketrans('', '', string.punctuation))
    cleaned_sentence = ' '.join(cleaned_sentence.split())
    stop_words = set(stopwords.words('english'))
    cleaned_sentence = ' '.join(word for word in cleaned_sentence.split() if word not in stop_words)
    return cleaned_sentence

In [179]:
x1 =clean_sentence(x1)

In [186]:
from collections import Counter
Counter(x1.split(' ')).most_common(1)

[('reformer', 6)]

In [184]:
x2 = ' '.join(df2['description'])
x2 =clean_sentence(x2)
Counter(x2.split(' ')).most_common(1)

[('investors', 7)]

In [172]:
df1

Unnamed: 0,article_id,title,link,keywords,creator,video_url,description,pubDate,image_url,source_id,source_url,source_icon,source_priority,country,category,language
0,977d1ba39a517648a31d10722dfaf598,Top Wall Street analysts prefer these three st...,https://www.cnbc.com/2024/03/17/top-wall-stree...,,,,TipRanks' analyst ranking service highlights W...,2024-03-17 11:45:31,https://image.cnbcfm.com/api/v1/image/10738344...,cnbc,http://cnbc.com,,82,[australia],[entertainment],english
1,2d924607bb25cc0d79f1668f08a168a4,"Top gospel acts, ministers for Mindset Conference",https://www.jamaicaobserver.com/2024/03/17/top...,[entertainment],[BY KEVIN JACKSON Jamaica Observer Writer],,"Top gospel acts, ministers for Mindset Conference",2024-03-17 05:24:00,https://www.jamaicaobserver.com/jamaicaobserve...,jamaicaobserver,https://www.jamaicaobserver.com,,14529,[gabon],[entertainment],english
2,50df8da4a8d5f68664af421d053a5520,Former Polish PM Kaczynski questioned over spy...,https://news.knowledia.com/all/all/articles/fo...,,,,Former Polish prime minister Jarosław Kaczyńsk...,2024-03-17 03:40:57,https://i.abcnewsfe.com/a/197b8204-84f8-4c85-8...,knowledia,http://news.knowledia.com,https://i.bytvi.com/domain_icons/knowledia.png,167779,[india],[top],english
3,f463c468572a32bf6c9380620bb71d46,Former Polish PM Kaczynski questioned over spy...,https://news.knowledia.com/US/en/articles/form...,,,,Former Polish prime minister Jarosław Kaczyńsk...,2024-03-17 03:40:57,https://i.abcnewsfe.com/a/197b8204-84f8-4c85-8...,knowledia,http://news.knowledia.com,https://i.bytvi.com/domain_icons/knowledia.png,167779,[united states of america],[top],english
4,925abbcfefd41200ed55708a6b71b02b,"Despot, radical … peacemaker? The millennial p...",https://www.theage.com.au/world/middle-east/de...,[world / middle east],[Sherryn Groch],,"Reformer, tyrant or both? Saudi Arabia’s crown...",2024-03-16 18:00:00,https://static.ffx.io/images/$zoom_0.5521%2C$m...,theage,https://www.theage.com.au,https://i.bytvi.com/domain_icons/theage.png,62533,[australia],[top],english
5,3255b3f97d2a157037af81faf558e086,"Despot, radical … peacemaker? The millennial p...",https://www.theage.com.au/world/middle-east/de...,[world / middle east],[Sherryn Groch],,"Reformer, tyrant or both? Saudi Arabia’s crown...",2024-03-16 18:00:00,https://static.ffx.io/images/$zoom_0.5521%2C$m...,theage,https://www.theage.com.au,https://i.bytvi.com/domain_icons/theage.png,62533,[australia],[world],english
6,ae0e36047e6ea50699d9dfa2ab110174,"Despot, radical … peacemaker? The millennial p...",https://www.smh.com.au/world/middle-east/despo...,[world / middle east],[Sherryn Groch],,"Reformer, tyrant or both? Saudi Arabia’s crown...",2024-03-16 18:00:00,https://static.ffx.io/images/$zoom_0.5521%2C$m...,smh,https://www.smh.com.au,https://i.bytvi.com/domain_icons/smh.png,6729,[australia],[top],english
7,4e1306150ae954e492bb431769c773cc,"Despot, radical … peacemaker? The millennial p...",https://www.smh.com.au/world/middle-east/despo...,[world / middle east],[Sherryn Groch],,"Reformer, tyrant or both? Saudi Arabia’s crown...",2024-03-16 18:00:00,https://static.ffx.io/images/$zoom_0.5521%2C$m...,smh,https://www.smh.com.au,https://i.bytvi.com/domain_icons/smh.png,6729,[australia],[world],english
8,9fa1474f1a72bb6803905a42fb77891e,"Despot, radical … peacemaker? The millennial p...",https://www.watoday.com.au/world/middle-east/d...,[world / middle east],[Sherryn Groch],,"Reformer, tyrant or both? Saudi Arabia’s crown...",2024-03-16 18:00:00,https://static.ffx.io/images/$zoom_0.5521%2C$m...,watoday,https://www.watoday.com.au,https://i.bytvi.com/domain_icons/watoday.png,347743,[australia],[top],english
9,a03c0b3e132a512a2bd6fe38fc9a0ed5,"Despot, radical … peacemaker? The millennial p...",https://www.watoday.com.au/world/middle-east/d...,[world / middle east],[Sherryn Groch],,"Reformer, tyrant or both? Saudi Arabia’s crown...",2024-03-16 18:00:00,https://static.ffx.io/images/$zoom_0.5521%2C$m...,watoday,https://www.watoday.com.au,https://i.bytvi.com/domain_icons/watoday.png,347743,[australia],[world],english
