<a href="https://www.kaggle.com/code/pranavjha24/sports-reviews-word2vec-model?scriptVersionId=189655551" target="_blank"><img align="left" alt="Kaggle" title="Open in Kaggle" src="https://kaggle.com/static/images/open-in-kaggle.svg"></a>

In [1]:
# This Python 3 environment comes with many helpful analytics libraries installed
# It is defined by the kaggle/python Docker image: https://github.com/kaggle/docker-python
# For example, here's several helpful packages to load

import numpy as np # linear algebra
import pandas as pd # data processing, CSV file I/O (e.g. pd.read_csv)

# Input data files are available in the read-only "../input/" directory
# For example, running this (by clicking run or pressing Shift+Enter) will list all files under the input directory

import os
for dirname, _, filenames in os.walk('/kaggle/input'):
    for filename in filenames:
        print(os.path.join(dirname, filename))

# You can write up to 20GB to the current directory (/kaggle/working/) that gets preserved as output when you create a version using "Save & Run All" 
# You can also write temporary files to /kaggle/temp/, but they won't be saved outside of the current session

# Training a Word2Vec Model on Sports & Outdoors Reviews Dataset

## Problem

Train a Word2Vec model using the Sports & Outdoors Reviews Dataset and perform the following tasks:

1. Find the words most similar to 'awful'.
2. Compute the similarity between the following word pairs:
   - ('good', 'great')
   - ('slow', 'steady')

# Reading and Exploring the Dataset

The dataset we are using is a subset of Amazon reviews from the Sports & Outdoors category. It is stored as a JSON file and can be read using pandas.

**Link to the Dataset:** [Sports & Outdoors Reviews Dataset](http://snap.stanford.edu/data/amazon/productGraph/categoryFiles/reviews_Sports_and_Outdoors_5.json.gz)


## Data Loading

In [2]:
import os
import gensim
import requests
import gzip
import shutil

# URL of the dataset
dataset_url = "http://snap.stanford.edu/data/amazon/productGraph/categoryFiles/reviews_Sports_and_Outdoors_5.json.gz"

# Name of the local files
gz_filename = 'reviews_Sports_and_Outdoors_5.json.gz'
json_filename = 'reviews_Sports_and_Outdoors_5.json'

# Step 1: Check if the JSON file already exists
if not os.path.isfile(json_filename):
    # Step 2: Download the dataset
    response = requests.get(dataset_url, stream=True)
    with open(gz_filename, 'wb') as file:
        shutil.copyfileobj(response.raw, file)

    # Step 3: Extract the .gz file
    with gzip.open(gz_filename, 'rb') as f_in:
        with open(json_filename, 'wb') as f_out:
            shutil.copyfileobj(f_in, f_out)

    print("Dataset downloaded and extracted successfully.")
else:
    print("Dataset already exists. Skipping download and extraction.")

Dataset downloaded and extracted successfully.


In [3]:
df = pd.read_json("/kaggle/working/reviews_Sports_and_Outdoors_5.json", lines=True)
df.head()

Unnamed: 0,reviewerID,asin,reviewerName,helpful,reviewText,overall,summary,unixReviewTime,reviewTime
0,AIXZKN4ACSKI,1881509818,David Briner,"[0, 0]",This came in on time and I am veru happy with ...,5,Woks very good,1390694400,"01 26, 2014"
1,A1L5P841VIO02V,1881509818,Jason A. Kramer,"[1, 1]",I had a factory Glock tool that I was using fo...,5,Works as well as the factory tool,1328140800,"02 2, 2012"
2,AB2W04NI4OEAD,1881509818,J. Fernald,"[2, 2]",If you don't have a 3/32 punch or would like t...,4,"It's a punch, that's all.",1330387200,"02 28, 2012"
3,A148SVSWKTJKU6,1881509818,"Jusitn A. Watts ""Maverick9614""","[0, 0]",This works no better than any 3/32 punch you w...,4,It's a punch with a Glock logo.,1328400000,"02 5, 2012"
4,AAAWJ6LW9WMOO,1881509818,Material Man,"[0, 0]",I purchased this thinking maybe I need a speci...,4,"Ok,tool does what a regular punch does.",1366675200,"04 23, 2013"


In [4]:
df.shape

(296337, 9)

## Simple Preprocessing & Tokenization

The first step in any data science task is to clean the data. For Natural Language Processing (NLP), this involves several key steps:

- **Convert all words to lowercase**
- **Trim spaces**
- **Remove punctuations**

Additionally, we can enhance preprocessing by:

- **Removing stop words** such as 'and', 'or', 'is', 'the', 'a', 'an'
- **Converting words to their root forms**, e.g., changing 'running' to 'run'

In [5]:
review_text = df.reviewText.apply(gensim.utils.simple_preprocess)

In [6]:
review_text

0         [this, came, in, on, time, and, am, veru, happ...
1         [had, factory, glock, tool, that, was, using, ...
2         [if, you, don, have, punch, or, would, like, t...
3         [this, works, no, better, than, any, punch, yo...
4         [purchased, this, thinking, maybe, need, speci...
                                ...                        
296332    [this, is, water, bottle, done, right, it, is,...
296333    [if, you, re, looking, for, an, insulated, wat...
296334    [this, hydracentials, sporty, oz, double, insu...
296335    [as, usual, received, this, item, free, in, ex...
296336    [hydracentials, insulated, oz, water, bottle, ...
Name: reviewText, Length: 296337, dtype: object

In [7]:
review_text.loc[0]

['this',
 'came',
 'in',
 'on',
 'time',
 'and',
 'am',
 'veru',
 'happy',
 'with',
 'it',
 'haved',
 'used',
 'it',
 'already',
 'and',
 'it',
 'makes',
 'taking',
 'out',
 'the',
 'pins',
 'in',
 'my',
 'glock',
 'very',
 'easy']

In [8]:
df.reviewText.loc[0]

'This came in on time and I am veru happy with it, I haved used it already and it makes taking out the pins in my glock 32 very easy'

# Word2Vec Model

Train the Word2Vec model on the reviews with the following configuration:

- **Window Size**: 10 (i.e., 10 words before and 10 words ahead of the current word)
- **Minimum Count**: Configure using the `min_count` parameter to ensure only sentences with at least 2 words are considered.
- **Workers**: Define the number of CPU threads to be used for training.

## Initialize the model

In [9]:
model = gensim.models.Word2Vec(
    window=10,
    min_count=2,
    workers=4,
)

## Build Vocabulary

In [10]:
model.build_vocab(review_text, progress_per=1000)

## Train the Word2Vec Model

In [11]:
model.train(review_text, total_examples=model.corpus_count, epochs=model.epochs)

(91342349, 121496535)

# Finding Similar Words and Similarity Between Words

For more details, refer to the [Gensim Word2Vec documentation](https://radimrehurek.com/gensim/models/word2vec.html).


In [12]:
model.wv.most_similar("awful")

[('horrible', 0.7032758593559265),
 ('terrible', 0.6824818253517151),
 ('unpleasant', 0.6486302614212036),
 ('ugly', 0.624929666519165),
 ('horrendous', 0.6086167693138123),
 ('overpowering', 0.600160539150238),
 ('pungent', 0.5843437910079956),
 ('aweful', 0.5704948306083679),
 ('overwhelming', 0.5665915012359619),
 ('funny', 0.5591309666633606)]

In [13]:
model.wv.similarity(w1="good", w2="great")

0.78608596

In [14]:
model.wv.similarity(w1="slow", w2="steady")

0.39040118