Program 2
Program on Word 2 vector and cosine similarity


Cosine similarity is a metric used to measure how similar two vectors are in a multi-dimensional space. It is particularly popular in natural language processing and information retrieval for comparing the similarity of documents or words based on their vector representations.


In [7]:
import pandas as pd
import spacy #natural language processing (NLP) library in Python.
from sklearn.metrics.pairwise import cosine_similarity

#loads an English language model called "en_core_web_sm" from spaCy using spacy.load(). 
nlp=spacy.load('en_core_web_sm')
#This model provides pre-trained word embeddings and other linguistic information for English text.

terms=['I','like','apples','oranges','pears']

#spaCy's nlp() function to obtain its word embedding vector using .vector. 
# .tolist() converts this vector to a Python list
vectors=[
    nlp(term).vector.tolist() for term in terms
]

# the word vector for 'apples' is extracted from the vectors list using terms.index('apples') 
# to find the index of 'apples' in the terms list. 
x=pd.Series(vectors[terms.index('apples')]).rename('apples')
print("word vector for apples:\n ",x)#prints the Pandas Series x, which represents the word vector for 'apples'.

# a Pandas DataFrame named abc is created
abc=pd.DataFrame(
    cosine_similarity(vectors),
    index=terms,
    columns=terms
).round(3)

#prints the cosine similarity matrix stored in the abc DataFrame. 
# The matrix shows how similar each term is to every other term in the terms list based on their word embeddings.
print("\ncosine similarity matrix :\n",abc)



word vector for apples:
  0    -1.244198
1     0.849777
2    -0.847986
3     1.536878
4     1.451894
        ...   
91    0.708207
92    1.898031
93   -0.110315
94   -0.203278
95    0.668344
Name: apples, Length: 96, dtype: float64

cosine similarity matrix :
              I   like  apples  oranges  pears
I        1.000  0.163   0.376    0.168 -0.088
like     0.163  1.000   0.073   -0.006  0.217
apples   0.376  0.073   1.000    0.750  0.289
oranges  0.168 -0.006   0.750    1.000  0.584
pears   -0.088  0.217   0.289    0.584  1.000


The value at index 0 is approximately -1.244198, and the value at index 1 is approximately 0.849777. These values indicate the position of the word "apples" in a high-dimensional space, where each dimension represents some aspect of the word's meaning.Word vectors are often used to capture semantic information about words, allowing algorithms to understand and work with word meanings in a numerical format.

3. Program to dataset from url 

downloading and reading sentiment analysis data from UCI's Machine Learning Repository

In [31]:
#importing libs
import pandas as pd
import os
import requests
import zipfile
from io import BytesIO

#Data Directory Creation
data_dir=f'{os.getcwd()}/data'
#It checks if a directory named 'data' exists in the current working directory (os.getcwd()). 
# If not, it creates the 'data' directory.
if not os.path.exists(data_dir):
    os.mkdir(data_dir)


import requests
url ="https://archive.ics.uci.edu/static/public/331/sentiment+labelled+sentences.zip"
response = requests.get(url)

#extracts the contents of the downloaded zip file
#BytesIO(response.content) converts the binary content into an in-memory binary stream using BytesIO.
with zipfile.ZipFile(file=BytesIO(response.content),mode='r') as compressed_file:
    compressed_file.extractall(data_dir)


assuming that the CSV files are located in a subdirectory named 'sentiment labelled sentences' within the previously created data_dir.

In [32]:
# initialized an empty list called df_list, 
# will be used to store DataFrames read from individual CSV files.
df_list = []
for csv_file in ['imdb_labelled.txt','yelp_labelled.txt','amazon_cells_labelled.txt']:
    # constructs the full file path for each CSV file using f'{data_dir}/sentiment labelled sentences/{csv_file}'
    csv_file_with_path = f'{data_dir}/sentiment labelled sentences/{csv_file}'
    temp_df = pd.read_csv(
                            csv_file_with_path,
                            sep="\t",#tab-separated 
                            header=0,#indicates that the first row contains column names.
                            names=['text', 'sentiment'] #provides column names for the DataFrame.
    )
    df_list.append(temp_df)# resulting DataFrame (temp_df) for each CSV file is appended to the df_list.

df = pd.concat(df_list)
pd.options.display.max_colwidth = 90
df[['text', 'sentiment']].sample(10, random_state=42)


Unnamed: 0,text,sentiment
471,This is a stunning movie.,1
278,I had the mac salad and it was pretty bland so I will not be getting that again.,0
20,"The food, amazing.",1
150,"Audio Quality is poor, very poor.",0
430,His acting alongside Olivia De Havilland was brilliant and the ending was fantastic!,1
39,The shrimp tender and moist.,1
340,Was not happy.,0
84,The headsets are easy to use and everyone loves them.,1
599,"It is wonderful and inspiring to watch, and I hope that it gets released again on to v...",1
984,The problem I have is that they charge $11.99 for a sandwich that is no bigger than a ...,0


4. Program on Naïve Bayes Classifier

The Naïve Bayes Classifier is a probabilistic machine learning algorithm that is used for classification tasks, particularly in natural language processing (NLP) and text classification. It's based on Bayes' theorem and is considered "naïve" because it makes a simplifying assumption that the features (or attributes) used to describe data are independent of each other, which may not always hold true in real-world data

In [33]:
from sklearn.model_selection import train_test_split
df_train,df_test=train_test_split(df,test_size=0.4,random_state=42)

y_train=df_train['sentiment']
y_test=df_test['sentiment']


Feature Extraction: It uses CountVectorizer to convert the text data into numerical features. 

All values of n such such that min_n <= n <= max_n will be used.
For example an ngram_range of (1, 1) means only unigrams,
(1, 2) means unigrams and bigrams, and (2, 2) means only bigrams.
Only applies if analyzer is not callable.
For example, setting the ngram_range to 2, 2 will return bigrams (2-grams) or two word phrases.

The ngram_range parameter (1,3) specifies that both unigrams (single words) and trigrams (sequences of three words) should be considered as features.

The min_df parameter sets a minimum document frequency of 3 for a word to be included as a feature.

In [34]:
from sklearn.feature_extraction.text import CountVectorizer

vec=CountVectorizer(ngram_range=(1,3),min_df=3,strip_accents='ascii')
# x_train and x_test are vectors
x_train=vec.fit_transform(df_train['text'])
x_test=vec.transform(df_test['text'])

To train a Naïve Bayes classifier, you need labeled data where each data point is associated with a known class label. The classifier calculates probabilities during training and uses them during inference to assign a class label to new, unseen data.

In [35]:
from sklearn.naive_bayes import MultinomialNB
clf=MultinomialNB(fit_prior=True)
clf.fit(x_train,y_train)


In [36]:
y_test_pred=clf.predict(x_test)

In [37]:
from sklearn.metrics import classification_report,precision_recall_fscore_support
print(classification_report(y_test, y_test_pred)) #precision    recall  f1-score   support

              precision    recall  f1-score   support

           0       0.81      0.78      0.79       565
           1       0.77      0.80      0.79       533

    accuracy                           0.79      1098
   macro avg       0.79      0.79      0.79      1098
weighted avg       0.79      0.79      0.79      1098



In [38]:
p,r,f,s= precision_recall_fscore_support(y_test,y_test_pred)
print("precison: ",p)
print("recall: ", r)
print("fscore: ",f)
print("support: ",s)


precison:  [0.80586081 0.77355072]
recall:  [0.77876106 0.8011257 ]
fscore:  [0.79207921 0.78709677]
support:  [565 533]
