To get the kaggle.json file, you need to generate an API token from your Kaggle account. Here’s a step-by-step guide on how to do this:
Step 1: Sign in to Kaggle
1.	Go to the Kaggle website and log in to your account.
Step 2: Go to Account Settings
1.	Click on your profile picture in the top right corner.
2.	Select "My Account" from the dropdown menu.
Step 3: Create an API Token
1.	Scroll down to the "API" section.
2.	Click on the "Create New API Token" button. This will generate and download a kaggle.json file to your computer. This file contains your API credentials (username and key).
Step 4: Upload kaggle.json to Google Colab
Now, you need to upload this file to your Google Colab environment:
1.	Open your Google Colab notebook.
2.	Use the following code to upload the kaggle.json file:


In [None]:
# Install Kaggle package
!pip install kaggle

# Upload kaggle.json file
from google.colab import files
files.upload()

# Create a directory for the Kaggle API key and move the kaggle.json file there
!mkdir -p ~/.kaggle
!cp kaggle.json ~/.kaggle/

# Set permissions for the key
!chmod 600 ~/.kaggle/kaggle.json

# List datasets to verify the setup (optional)
!kaggle datasets list

# Download the IMDb dataset
!kaggle datasets download -d lakshmi25npathi/imdb-dataset-of-50k-movie-reviews

# Unzip the downloaded dataset
!unzip imdb-dataset-of-50k-movie-reviews.zip

# Load the CSV file into a DataFrame
import pandas as pd
df = pd.read_csv('IMDB Dataset.csv')  # Replace 'IMDB Dataset.csv' with the actual file name
df.head()




Saving kaggle.json to kaggle.json
ref                                                                   title                                              size  lastUpdated          downloadCount  voteCount  usabilityRating  
--------------------------------------------------------------------  ------------------------------------------------  -----  -------------------  -------------  ---------  ---------------  
shreyanshverma27/online-sales-dataset-popular-marketplace-data        Online Sales Dataset - Popular Marketplace Data     7KB  2024-05-25 23:55:26           5667        107  1.0              
kunjadiyarohit/flats-uncleaned-dataset                                Flats Uncleaned Dataset                           283KB  2024-06-07 12:26:37            565         23  1.0              
nuhmanpk/india-lok-sabha-election-results-2024                        Lok Sabha Election Results 2024 India              20KB  2024-06-05 05:49:01           1549         35  1.0              
hibrah

Unnamed: 0,review,sentiment
0,One of the other reviewers has mentioned that ...,positive
1,A wonderful little production. <br /><br />The...,positive
2,I thought this was a wonderful way to spend ti...,positive
3,Basically there's a family where a little boy ...,negative
4,"Petter Mattei's ""Love in the Time of Money"" is...",positive


In [None]:
import numpy as np
import pandas as pd

In [None]:
df.shape

(50000, 2)

In [None]:
df.head()

Unnamed: 0,review,sentiment
0,One of the other reviewers has mentioned that ...,positive
1,A wonderful little production The filming tech...,positive
2,I thought this was a wonderful way to spend ti...,positive
3,Basically theres a family where a little boy J...,negative
4,Petter Matteis Love in the Time of Money is a ...,positive


In [None]:
df['review'][1]

'A wonderful little production. <br /><br />The filming technique is very unassuming- very old-time-BBC fashion and gives a comforting, and sometimes discomforting, sense of realism to the entire piece. <br /><br />The actors are extremely well chosen- Michael Sheen not only "has got all the polari" but he has all the voices down pat too! You can truly see the seamless editing guided by the references to Williams\' diary entries, not only is it well worth the watching but it is a terrificly written and performed piece. A masterful production about one of the great master\'s of comedy and his life. <br /><br />The realism really comes home with the little things: the fantasy of the guard which, rather than use the traditional \'dream\' techniques remains solid then disappears. It plays on our knowledge and our senses, particularly with the scenes concerning Orton and Halliwell and the sets (particularly of their flat with Halliwell\'s murals decorating every surface) are terribly well d

In [None]:
df['sentiment'].value_counts()

sentiment
positive    25000
negative    25000
Name: count, dtype: int64

# Lowercase

In [None]:
df['review'] = df['review'].str.lower()

In [None]:
df.head()

Unnamed: 0,review,sentiment
0,four daughters begins as just another clone of...,positive
1,okay ill admit iti am a goofball and i occasio...,positive
2,scarecrow gone wild hes the death of the party...,negative
3,this schiffer guy is a real genius the movie i...,positive
4,jiøí trnka made his last animated short an ind...,positive


# Remove HTML Tags

In [None]:
import re
def remove_html_tags(text):
    pattern = re.compile('<.*?>')
    return pattern.sub(r'', text)

In [None]:
df['review'] = df['review'].apply(remove_html_tags)

In [None]:
df['review']

0        four daughters begins as just another clone of...
1        okay ill admit iti am a goofball and i occasio...
2        scarecrow gone wild hes the death of the party...
3        this schiffer guy is a real genius the movie i...
4        jiøí trnka made his last animated short an ind...
                               ...                        
24995    okay the only reason i watched this movie is b...
24996    i never made any comment here on imdb but as i...
24997    the second beginning as its title explains sho...
24998    documentary starts in 1986 in nyc where black ...
24999    this is without a doubt the worst sequel i hav...
Name: review, Length: 25000, dtype: object

# Remove URLS

In [None]:
def remove_url(text):
    pattern = re.compile(r'https?://\S+|www\.\S+')
    return pattern.sub(r'', text)

In [None]:
df['review'] = df['review'].apply(remove_html_tags)
df['review'].head(2)

0    four daughters begins as just another clone of...
1    okay ill admit iti am a goofball and i occasio...
Name: review, dtype: object

# Remove Punctuation

In [None]:
import string
# Function to preprocess text data by removing punctuation
def preprocess_text(text):
    # Convert bytes to string if necessary
    if isinstance(text, bytes):
        text = text.decode('utf-8')
    # Remove punctuation using regex
    text = re.sub(f'[{re.escape(string.punctuation)}]', '', text)
    return text

In [None]:
df['review'] = df['review'].apply(preprocess_text)

In [None]:
df['review'].head(2)

0    four daughters begins as just another clone of...
1    okay ill admit iti am a goofball and i occasio...
Name: review, dtype: object

In [None]:
df['review'][5]

'while i am not a big fan of musicals i have loved the films of fred astaire and ginger rogers because they are just so much fun sure they can be a bit formulaic but even though you know what is going to happen they still are very pleasing to watch however despite this i was a bit disappointed in this outing part of it was because this film doesnt have the wonderful supporting cast like you saw in top hat or shall we dance without edward everett horton or eric blore the film seems to be a bit lackingespecially in the fun department the silly antics of these supporting actors gave the other films charm that you just dont get with follow the fleet in addition unlike the usual character played by astaire this one is more of a jerkas his fat head gets rogers into trouble again and again and as a result its a lot harder to like him or want to see them get together in the end of the film plus although the music is by irving berlin the songs just dont seem as memorable in fact none of the son

# Remove Stopwords

In [None]:
 import nltk
 nltk.download('stopwords')

[nltk_data] Downloading package stopwords to /root/nltk_data...
[nltk_data]   Unzipping corpora/stopwords.zip.


True

In [None]:
from nltk.corpus import stopwords

In [None]:
print(stopwords.words('english'))

['i', 'me', 'my', 'myself', 'we', 'our', 'ours', 'ourselves', 'you', "you're", "you've", "you'll", "you'd", 'your', 'yours', 'yourself', 'yourselves', 'he', 'him', 'his', 'himself', 'she', "she's", 'her', 'hers', 'herself', 'it', "it's", 'its', 'itself', 'they', 'them', 'their', 'theirs', 'themselves', 'what', 'which', 'who', 'whom', 'this', 'that', "that'll", 'these', 'those', 'am', 'is', 'are', 'was', 'were', 'be', 'been', 'being', 'have', 'has', 'had', 'having', 'do', 'does', 'did', 'doing', 'a', 'an', 'the', 'and', 'but', 'if', 'or', 'because', 'as', 'until', 'while', 'of', 'at', 'by', 'for', 'with', 'about', 'against', 'between', 'into', 'through', 'during', 'before', 'after', 'above', 'below', 'to', 'from', 'up', 'down', 'in', 'out', 'on', 'off', 'over', 'under', 'again', 'further', 'then', 'once', 'here', 'there', 'when', 'where', 'why', 'how', 'all', 'any', 'both', 'each', 'few', 'more', 'most', 'other', 'some', 'such', 'no', 'nor', 'not', 'only', 'own', 'same', 'so', 'than', '

In [None]:
def remove_stopwords(text):
    new_text = []

    for word in text.split():
        if word in stopwords.words('english'):
            new_text.append('')
        else:
            new_text.append(word)
    x = new_text[:]
    new_text.clear()
    return " ".join(x)

In [None]:
df['review'] = df['review'].apply(remove_stopwords)
df['review'][2]

'scarecrow gone wild hes  death   party need  say  scarecrow gone wild got four   ten stars    one simple reason aside   terrible acting plot holes cheap special effects  anticlimactic whistling   cinematic gold  think   movie could  actually  really good   scarecrow turned     baseball coach  portrayed   eversobrilliant ken shamrock     would    cut  awesome return   jedi electricity special effectswhile watching  movie  friends    convinced     fact written  one   friends  stereotypical teenaged boy  movie  topless women miserably fake gore  dialog  could   talked  way    paper bag    case  cornfieldif  could ask  filmmaker one thing  would    much     pay  teenager  wrote   '

# Tokenization

In [None]:
import spacy

# Load the spaCy model
nlp = spacy.load('en_core_web_sm')

def tokenize_text(text):
    # Process the text with spaCy
    doc = nlp(text)

    # Extract tokens
    tokens = [token.text for token in doc]

    return tokens


In [None]:
df['review'] = df['review'].apply(tokenize_text)
print(df['review'][2])

['scarecrow', 'gone', 'wild', 'he', 's', ' ', 'death', '  ', 'party', 'need', ' ', 'say', ' ', 'scarecrow', 'gone', 'wild', 'got', 'four', '  ', 'ten', 'stars', '   ', 'one', 'simple', 'reason', 'aside', '  ', 'terrible', 'acting', 'plot', 'holes', 'cheap', 'special', 'effects', ' ', 'anticlimactic', 'whistling', '  ', 'cinematic', 'gold', ' ', 'think', '  ', 'movie', 'could', ' ', 'actually', ' ', 'really', 'good', '  ', 'scarecrow', 'turned', '    ', 'baseball', 'coach', ' ', 'portrayed', '  ', 'eversobrilliant', 'ken', 'shamrock', '    ', 'would', '   ', 'cut', ' ', 'awesome', 'return', '  ', 'jedi', 'electricity', 'special', 'effectswhile', 'watching', ' ', 'movie', ' ', 'friends', '   ', 'convinced', '    ', 'fact', 'written', ' ', 'one', '  ', 'friends', ' ', 'stereotypical', 'teenaged', 'boy', ' ', 'movie', ' ', 'topless', 'women', 'miserably', 'fake', 'gore', ' ', 'dialog', ' ', 'could', '  ', 'talked', ' ', 'way', '   ', 'paper', 'bag', '   ', 'case', ' ', 'cornfieldif', ' ', 

# Lemmatizer

In [None]:
import nltk
from nltk.stem import WordNetLemmatizer
from nltk.corpus import wordnet

# Ensure you have the necessary NLTK resources
nltk.download('wordnet')
nltk.download('averaged_perceptron_tagger')

# Initialize the WordNetLemmatizer
wordnet_lemmatizer = WordNetLemmatizer()

def get_wordnet_pos(word):
    """Map POS tag to first character lemmatize() accepts"""
    tag = nltk.pos_tag([word])[0][1][0].upper()
    tag_dict = {"J": wordnet.ADJ,
                "N": wordnet.NOUN,
                "V": wordnet.VERB,
                "R": wordnet.ADV}
    return tag_dict.get(tag, wordnet.NOUN)

def lemmatize_tokens(tokens):
    lemmatized_tokens = [wordnet_lemmatizer.lemmatize(token, get_wordnet_pos(token)) for token in tokens]
    return lemmatized_tokens

# Apply lemmatization
df['lemmatized_review'] = df['review'].apply(lemmatize_tokens)

# If you want to convert the lemmatized tokens back into a single string for each review
df['lemmatized_review'] = df['lemmatized_review'].apply(lambda tokens: ' '.join(tokens))



[nltk_data] Downloading package wordnet to /root/nltk_data...
[nltk_data]   Package wordnet is already up-to-date!
[nltk_data] Downloading package averaged_perceptron_tagger to
[nltk_data]     /root/nltk_data...
[nltk_data]   Package averaged_perceptron_tagger is already up-to-
[nltk_data]       date!


Choosing Between Stemming and Lemmatization:

•	Speed vs. Accuracy: Stemming is generally faster and simpler but less accurate, while lemmatization is slower and more computationally intensive but provides better accuracy.

•	Application Requirements: For applications where precise word forms are essential (e.g., legal documents, medical texts), lemmatization is preferable. For applications that prioritize speed and can tolerate some inaccuracies, stemming might be sufficient.

•	Language Complexity: For languages with complex morphology, lemmatization offers more benefits by accurately capturing the base forms of words.


In [None]:
import pandas as pd
from sklearn.preprocessing import LabelEncoder
from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.model_selection import train_test_split
from sklearn.linear_model import LogisticRegression
from sklearn.metrics import accuracy_score
import joblib

In [None]:
# Label Encoding the sentiment column
label_encoder = LabelEncoder()
df['sentiment'] = label_encoder.fit_transform(df['sentiment'])
y = df['sentiment']

In [None]:
# Vectorization using TF-IDF
vectorizer = TfidfVectorizer()
X = vectorizer.fit_transform(df['lemmatized_review'])

In [None]:
# Split the dataset
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)

In [None]:
# Train the Logistic Regression model
model = LogisticRegression()
model.fit(X_train, y_train)

In [None]:
# Evaluate the model
y_pred = model.predict(X_test)
accuracy = accuracy_score(y_test, y_pred)
print(f"Model accuracy: {accuracy}")

Model accuracy: 0.8798


In [None]:
# Save the model, vectorizer, and label encoder
joblib.dump(model, 'logistic_regression_model.pkl')
joblib.dump(vectorizer, 'tfidf_vectorizer.pkl')
joblib.dump(label_encoder, 'label_encoder.pkl')

['label_encoder.pkl']

## prediction function

In [None]:
import spacy
from nltk.stem import WordNetLemmatizer
from nltk.corpus import wordnet
import joblib
import nltk

# Ensure necessary NLTK resources are downloaded
nltk.download('wordnet')
nltk.download('averaged_perceptron_tagger')

# Load the spaCy model
nlp = spacy.load('en_core_web_sm')

# Initialize the WordNetLemmatizer
wordnet_lemmatizer = WordNetLemmatizer()

def get_wordnet_pos(word):
    """Map POS tag to first character lemmatize() accepts"""
    tag = nltk.pos_tag([word])[0][1][0].upper()
    tag_dict = {"J": wordnet.ADJ,
                "N": wordnet.NOUN,
                "V": wordnet.VERB,
                "R": wordnet.ADV}
    return tag_dict.get(tag, wordnet.NOUN)

def lemmatize_tokens(tokens):
    lemmatized_tokens = [wordnet_lemmatizer.lemmatize(token.text, get_wordnet_pos(token.text)) for token in tokens]
    return lemmatized_tokens

def load_model_and_vectorizer(model_path='logistic_regression_model.pkl', vectorizer_path='tfidf_vectorizer.pkl', encoder_path='label_encoder.pkl'):
    model = joblib.load(model_path)
    vectorizer = joblib.load(vectorizer_path)
    label_encoder = joblib.load(encoder_path)
    return model, vectorizer, label_encoder

def predict_sentiment(review, model, vectorizer, label_encoder):
    # Tokenize and lemmatize the review using spaCy and NLTK
    doc = nlp(review)
    tokens = [token for token in doc]
    lemmatized_tokens = lemmatize_tokens(tokens)
    lemmatized_review = ' '.join(lemmatized_tokens)

    # Vectorize the lemmatized review
    review_vector = vectorizer.transform([lemmatized_review])

    # Predict sentiment
    prediction = model.predict(review_vector)
    sentiment = label_encoder.inverse_transform(prediction)
    return sentiment[0]

# Load the saved model, vectorizer, and label encoder
model, vectorizer, label_encoder = load_model_and_vectorizer()




[nltk_data] Downloading package wordnet to /root/nltk_data...
[nltk_data]   Package wordnet is already up-to-date!
[nltk_data] Downloading package averaged_perceptron_tagger to
[nltk_data]     /root/nltk_data...
[nltk_data]   Package averaged_perceptron_tagger is already up-to-
[nltk_data]       date!


In [None]:
# Example prediction
new_review = "I love this movie! It's fantastic."
sentiment = predict_sentiment(new_review, model, vectorizer, label_encoder)
print(f"Predicted sentiment: {sentiment}")

Predicted sentiment: positive


In [None]:
import sklearn
sklearn.__version__

'1.2.2'