Q-3. Imagine you have a dataset where you have different categories of data, Now
you need to find the most similar data to the given data by using any 4 different
similarity algorithms. Now you have to build a model which can find the most similar
data to the given data.

In [51]:
import numpy as np
import pandas as pd
from sklearn.metrics.pairwise import cosine_similarity, euclidean_distances, manhattan_distances
from sklearn.feature_extraction.text import TfidfVectorizer
from gensim.models import Word2Vec
from gensim.similarities import WmdSimilarity

import warnings
warnings.filterwarnings("ignore")

In [7]:
df = pd.read_json(r"C:\Users\Nirbhay\Downloads\News_Category_Dataset_v3.json",lines=True)

In [8]:
df.head()

Unnamed: 0,link,headline,category,short_description,authors,date
0,https://www.huffpost.com/entry/covid-boosters-...,Over 4 Million Americans Roll Up Sleeves For O...,U.S. NEWS,Health experts said it is too early to predict...,"Carla K. Johnson, AP",2022-09-23
1,https://www.huffpost.com/entry/american-airlin...,"American Airlines Flyer Charged, Banned For Li...",U.S. NEWS,He was subdued by passengers and crew when he ...,Mary Papenfuss,2022-09-23
2,https://www.huffpost.com/entry/funniest-tweets...,23 Of The Funniest Tweets About Cats And Dogs ...,COMEDY,"""Until you have a dog you don't understand wha...",Elyse Wanshel,2022-09-23
3,https://www.huffpost.com/entry/funniest-parent...,The Funniest Tweets From Parents This Week (Se...,PARENTING,"""Accidentally put grown-up toothpaste on my to...",Caroline Bologna,2022-09-23
4,https://www.huffpost.com/entry/amy-cooper-lose...,Woman Who Called Cops On Black Bird-Watcher Lo...,U.S. NEWS,Amy Cooper accused investment firm Franklin Te...,Nina Golgowski,2022-09-22


In [9]:
df.shape

(209527, 6)

In [10]:
df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 209527 entries, 0 to 209526
Data columns (total 6 columns):
 #   Column             Non-Null Count   Dtype         
---  ------             --------------   -----         
 0   link               209527 non-null  object        
 1   headline           209527 non-null  object        
 2   category           209527 non-null  object        
 3   short_description  209527 non-null  object        
 4   authors            209527 non-null  object        
 5   date               209527 non-null  datetime64[ns]
dtypes: datetime64[ns](1), object(5)
memory usage: 9.6+ MB


In [11]:
df.isnull().sum()

link                 0
headline             0
category             0
short_description    0
authors              0
date                 0
dtype: int64

In [12]:
df.duplicated().sum()

13

In [13]:
df.drop_duplicates(inplace=True)

In [19]:
data = df.copy()

In [23]:
category_count = data['category'].value_counts()
print(f'There are {len(category_count)} categories of news')
print(category_count)

There are 42 categories of news
category
POLITICS          35601
WELLNESS          17942
ENTERTAINMENT     17362
TRAVEL             9900
STYLE & BEAUTY     9811
PARENTING          8791
HEALTHY LIVING     6694
QUEER VOICES       6347
FOOD & DRINK       6340
BUSINESS           5992
COMEDY             5400
SPORTS             5077
BLACK VOICES       4583
HOME & LIVING      4320
PARENTS            3955
THE WORLDPOST      3664
WEDDINGS           3653
WOMEN              3571
CRIME              3562
IMPACT             3484
DIVORCE            3426
WORLD NEWS         3299
MEDIA              2944
WEIRD NEWS         2777
GREEN              2622
WORLDPOST          2579
RELIGION           2577
STYLE              2254
SCIENCE            2206
TECH               2100
TASTE              2096
MONEY              1756
ARTS               1509
ENVIRONMENT        1443
FIFTY              1401
GOOD NEWS          1398
U.S. NEWS          1377
ARTS & CULTURE     1339
COLLEGE            1144
LATINO VOICES      1130

In [33]:
data.columns

Index(['link', 'headline', 'category', 'short_description', 'authors', 'date',
       'News'],
      dtype='object')

In [34]:
data = data[['category', 'headline', 'short_description']]

In [36]:
data['News'] = data['headline'] + ' ' + data['short_description']

In [37]:
vectorizer = TfidfVectorizer()
X = vectorizer.fit_transform(data['News'])

In [60]:
def find_similar_data(query, top_n=5):
    # Vectorize the query
    query_vector = vectorizer.transform([query])

    # Calculating similarities using different algorithms
    cosine_sim = cosine_similarity(X, query_vector).flatten()
    euclidean_sim = euclidean_distances(X, query_vector).flatten()
    manhattan_sim = manhattan_distances(X, query_vector).flatten()

    similarity_scores = (cosine_sim + euclidean_sim + manhattan_sim) / 3

    top_indices = similarity_scores.argsort()[-top_n:][::-1]

    similar_data = df.iloc[top_indices]

    return similar_data


In [61]:
given_data = "Over 4 Million Americans Roll Up Sleeves For O..."
similar_data = find_similar_data(given_data)
print(similar_data)

                                                     link   
109802  https://www.huffingtonpost.com/entry/weekend-r...  \
66816   https://www.huffingtonpost.com/entry/sunday-ro...   
72892   https://www.huffingtonpost.com/entry/sunday-ro...   
63109   https://www.huffingtonpost.com/entry/sunday-ro...   
107893  https://www.huffingtonpost.com/entry/sunday-ro...   

                                headline   category   
109802  Weekend Roundup: Laughing at God  WORLDPOST  \
66816                     Sunday Roundup   POLITICS   
72892                     Sunday Roundup   POLITICS   
63109                     Sunday Roundup   POLITICS   
107893                    Sunday Roundup   POLITICS   

                                        short_description   
109802  The first principle of an open society is not ...  \
66816   This week the nation watched as the #NeverTrum...   
72892   This week the GOP debate circus pulled into Mi...   
63109   This week, the nation was reminded, in ways bo... 