# Keyword Influence Analysis on Kaggle Dataset Upvotes
This notebook analyzes the influence of keywords in dataset names on the number of upvotes.

[![youtube thumbnail](https://i.ytimg.com/vi/Uul4gA6XCg0/maxresdefault.jpg)](https://www.youtube.com/watch?v=Uul4gA6XCg0)
[Video tutorial making this notebook](https://www.youtube.com/watch?v=Uul4gA6XCg0)

In [1]:
import pandas as pd

# Load the dataset
df = pd.read_csv('/kaggle/input/kaggle-dataset/kaggle-preprocessed.csv')
df.head()

Unnamed: 0.1,Unnamed: 0,Dataset_name,Author_name,Author_id,No_of_files,size,Type_of_file,Upvotes,Medals,Usability,Date,Day,Time,Dataset_link
0,0,Hotel Reservations Dataset,Ahsan Raza,https://www.kaggle.com/ahsan81,1,491 kB,CSV,315,Silver,10.0,1/04/2023,Wed,18:20:31,https://www.kaggle.com/datasets/ahsan81/hotel-...
1,2,Most Subscribed 1000 Youtube Channels,Mrityunjay Pathak,https://www.kaggle.com/themrityunjaypathak,1,29 kB,CSV,76,Bronze,10.0,1/21/2023,Sat,20:12:05,https://www.kaggle.com/datasets/themrityunjayp...
2,3,Olympics 124 years Dataset(till 2020),Nitish Sharma01,https://www.kaggle.com/nitishsharma01,3,5 MB,CSV,30,Bronze,10.0,2/01/2023,Wed,14:30:49,https://www.kaggle.com/datasets/nitishsharma01...
3,4,Medical Student Mental Health,The Devastator,https://www.kaggle.com/thedevastator,2,19 kB,CSV,37,Bronze,10.0,1/25/2023,Wed,06:30:14,https://www.kaggle.com/datasets/thedevastator/...
4,5,Latest Netflix TV shows and movies,Senapati Rajesh,https://www.kaggle.com/senapatirajesh,1,1 MB,CSV,94,Bronze,9.4,1/14/2023,Sat,22:33:12,https://www.kaggle.com/datasets/senapatirajesh...


## Calculate Median Upvotes
Compute the median number of upvotes for all datasets.

In [2]:
# Calculate the median number of upvotes
median_upvotes = df['Upvotes'].median()
median_upvotes

23.0

## Extract Keywords
Break apart each dataset name into individual keywords, converting them to lowercase.

In [3]:
# Extract keywords from dataset names
from collections import defaultdict

keyword_dict = defaultdict(list)

for index, row in df.iterrows():
    keywords = row['Dataset_name'].lower().split()
    for keyword in keywords:
        keyword_dict[keyword].append(row['Upvotes'])

# Filter keywords with at least 10 datasets
filtered_keywords = {k: v for k, v in keyword_dict.items() if len(v) >= 10}
len(filtered_keywords)

530

## Keyword Analysis
For each keyword, calculate the median number of upvotes for datasets containing that keyword.

In [4]:
# Calculate median upvotes for each keyword
keyword_median_upvotes = {k: pd.Series(v).median() for k, v in filtered_keywords.items()}

# Compute the ratio of keyword median upvotes to overall median upvotes
keyword_ratios = {k: v / median_upvotes for k, v in keyword_median_upvotes.items()}

# Rank keywords by ratio in descending order
sorted_keywords = sorted(keyword_ratios.items(), key=lambda item: item[1], reverse=True)

# Display top 10 keywords
top_10_keywords = sorted_keywords[:10]
top_10_keywords

[('coronavirus', 4.869565217391305),
 ('star', 4.043478260869565),
 ('suicide', 4.0),
 ('2016', 3.9130434782608696),
 ('brain', 3.869565217391304),
 ('shootings', 3.739130434782609),
 ('mnist', 3.608695652173913),
 ('legends', 3.5652173913043477),
 ('expression', 3.0652173913043477),
 ('women', 3.0434782608695654)]