## Building an NLP Sentiment Analysis Pipeline In Python
Reference:https://www.linkedin.com/advice/0/how-do-you-design-implement-nlp-pipelines#:~:text=An%20NLP%20pipeline%20is%20a,entity%20recognition%2C%20or%20sentiment%20analysis.

2.https://www.geeksforgeeks.org/natural-language-processing-nlp-pipeline/
####  1.Data Acquisition
####   2.Text Cleaning
Unicode normalisation:Symbols,Emojis,Special Characters
Regex:String pattern based removal of email,Phine number,URL
Spellingm Correction:Web scraped data - Create a corpus or dictionary of misspelled word
####   3.Text Preprocessing
Words to be separated at the minimum level
Tokenization
Lowercasing
Stop words removal
Stemming/Lemmatization
POS tagging - Assign Parts of speech to each word in the text(NER,Sentimental Analysis& Machine translation)
####   4.Feature Engineering
Text vectorization/Representation
Classical approach:
    One hot encoding
    Bag of words
    Bag of n-grams
    TF-TDF
Neural approach or Word Embedding:
    To understand the contextual meaning
        Continous Bag of word
        Skip gram 
    Pre trained word embedding - Use large corpus --Import Gensim or hugging face Word2Vec by Google,GloVe by stanford    
####   5.Building Model

####   6.Evaluation

### Data Acquisition - https://www.kaggle.com/datasets/abhi8923shriv/sentiment-analysis-dataset

### Import necessary Libraries

In [1]:
import pandas as pd
import re
import nltk
import numpy as np

#### Load dataset

In [2]:
data_reference  = pd.read_csv(r"D:\Malathi\SEM_6\NLP\archive\train.csv",encoding='latin1')

In [3]:
data_reference.head(5)

Unnamed: 0,textID,text,selected_text,sentiment,Time of Tweet,Age of User,Country,Population -2020,Land Area (Km²),Density (P/Km²)
0,cb774db0d1,"I`d have responded, if I were going","I`d have responded, if I were going",neutral,morning,0-20,Afghanistan,38928346,652860.0,60
1,549e992a42,Sooo SAD I will miss you here in San Diego!!!,Sooo SAD,negative,noon,21-30,Albania,2877797,27400.0,105
2,088c60f138,my boss is bullying me...,bullying me,negative,night,31-45,Algeria,43851044,2381740.0,18
3,9642c003ef,what interview! leave me alone,leave me alone,negative,morning,46-60,Andorra,77265,470.0,164
4,358bd9e861,"Sons of ****, why couldn`t they put them on t...","Sons of ****,",negative,noon,60-70,Angola,32866272,1246700.0,26


In [4]:
def load_dataset(file):
        data = pd.read_csv(file,encoding='latin1')
        data.drop_duplicates(inplace=True)
        data.dropna(inplace=True)
        selected_colums=['text','sentiment']
        data=data[selected_colums]
        data=pd.DataFrame(data)
        return data
    
train_data = load_dataset(r"D:\Malathi\SEM_6\NLP\archive\train.csv")
test_data =load_dataset(r"D:\Malathi\SEM_6\NLP\archive\test.csv")

In [5]:
train_data.info()

<class 'pandas.core.frame.DataFrame'>
Index: 27480 entries, 0 to 27480
Data columns (total 2 columns):
 #   Column     Non-Null Count  Dtype 
---  ------     --------------  ----- 
 0   text       27480 non-null  object
 1   sentiment  27480 non-null  object
dtypes: object(2)
memory usage: 644.1+ KB


In [6]:
type(train_data)

pandas.core.frame.DataFrame

In [7]:
train_data

Unnamed: 0,text,sentiment
0,"I`d have responded, if I were going",neutral
1,Sooo SAD I will miss you here in San Diego!!!,negative
2,my boss is bullying me...,negative
3,what interview! leave me alone,negative
4,"Sons of ****, why couldn`t they put them on t...",negative
...,...,...
27476,wish we could come see u on Denver husband l...,negative
27477,I`ve wondered about rake to. The client has ...,negative
27478,Yay good for both of you. Enjoy the break - y...,positive
27479,But it was worth it ****.,positive


In [8]:
test_data.info()

<class 'pandas.core.frame.DataFrame'>
Index: 3534 entries, 0 to 3533
Data columns (total 2 columns):
 #   Column     Non-Null Count  Dtype 
---  ------     --------------  ----- 
 0   text       3534 non-null   object
 1   sentiment  3534 non-null   object
dtypes: object(2)
memory usage: 82.8+ KB


In [9]:
data_reference = data_reference.iloc[:,:4]

In [10]:
data_reference

Unnamed: 0,textID,text,selected_text,sentiment
0,cb774db0d1,"I`d have responded, if I were going","I`d have responded, if I were going",neutral
1,549e992a42,Sooo SAD I will miss you here in San Diego!!!,Sooo SAD,negative
2,088c60f138,my boss is bullying me...,bullying me,negative
3,9642c003ef,what interview! leave me alone,leave me alone,negative
4,358bd9e861,"Sons of ****, why couldn`t they put them on t...","Sons of ****,",negative
...,...,...,...,...
27476,4eac33d1c0,wish we could come see u on Denver husband l...,d lost,negative
27477,4f4c4fc327,I`ve wondered about rake to. The client has ...,", don`t force",negative
27478,f67aae2310,Yay good for both of you. Enjoy the break - y...,Yay good for both of you.,positive
27479,ed167662a5,But it was worth it ****.,But it was worth it ****.,positive


In [11]:
data_reference = data_reference.drop(['textID','selected_text'],axis=1)

In [12]:
data_reference

Unnamed: 0,text,sentiment
0,"I`d have responded, if I were going",neutral
1,Sooo SAD I will miss you here in San Diego!!!,negative
2,my boss is bullying me...,negative
3,what interview! leave me alone,negative
4,"Sons of ****, why couldn`t they put them on t...",negative
...,...,...
27476,wish we could come see u on Denver husband l...,negative
27477,I`ve wondered about rake to. The client has ...,negative
27478,Yay good for both of you. Enjoy the break - y...,positive
27479,But it was worth it ****.,positive


##### Label Encoding for the o/p columns - +ve , -ve and zero

In [13]:
from sklearn.preprocessing import LabelEncoder
le = LabelEncoder()
y_train = le.fit_transform(train_data['sentiment'])
y_test = le.transform(test_data['sentiment'])

In [14]:
y_train.shape

(27480,)

In [15]:
y_test

array([1, 2, 0, ..., 0, 2, 2])

### Text Cleaning 

In [16]:
def preprocessing_1(data:str):
    data = data.strip()#Remove leading white spaces
    data = data.lower()#Convert to lower case
    url_pattern = re.compile(r"https?://\S+|www\.\S+")
    data = re.sub(url_pattern, "", data)
    username_pattern = re.compile(r"@\w+")
    data = re.sub(username_pattern, "", data)
    hashtag_pattern = re.compile(r"#\w+")
    data = re.sub(hashtag_pattern, "", data)
    data = re.sub(r"([a-zA-Z])\1{2,}", r'\1', data)
    data = re.sub(r'[^a-zA-Z\s]',"",data)#Remove special characters
    return data

train_data['preprocess_1']=train_data['text'].apply(preprocessing_1)
test_data['preprocess_1']=test_data['text'].apply(preprocessing_1)

In [17]:
train_data['preprocess_1']=train_data['text'].apply(preprocessing_1)

In [18]:
train_data

Unnamed: 0,text,sentiment,preprocess_1
0,"I`d have responded, if I were going",neutral,id have responded if i were going
1,Sooo SAD I will miss you here in San Diego!!!,negative,so sad i will miss you here in san diego
2,my boss is bullying me...,negative,my boss is bullying me
3,what interview! leave me alone,negative,what interview leave me alone
4,"Sons of ****, why couldn`t they put them on t...",negative,sons of why couldnt they put them on the rele...
...,...,...,...
27476,wish we could come see u on Denver husband l...,negative,wish we could come see u on denver husband lo...
27477,I`ve wondered about rake to. The client has ...,negative,ive wondered about rake to the client has mad...
27478,Yay good for both of you. Enjoy the break - y...,positive,yay good for both of you enjoy the break you ...
27479,But it was worth it ****.,positive,but it was worth it


In [19]:
test_data

Unnamed: 0,text,sentiment,preprocess_1
0,Last session of the day http://twitpic.com/67ezh,neutral,last session of the day
1,Shanghai is also really exciting (precisely -...,positive,shanghai is also really exciting precisely sk...
2,"Recession hit Veronique Branquinho, she has to...",negative,recession hit veronique branquinho she has to ...
3,happy bday!,positive,happy bday
4,http://twitpic.com/4w75p - I like it!!,positive,i like it
...,...,...,...
3529,"its at 3 am, im very tired but i can`t sleep ...",negative,its at am im very tired but i cant sleep but...
3530,All alone in this old house again. Thanks for...,positive,all alone in this old house again thanks for ...
3531,I know what you mean. My little dog is sinkin...,negative,i know what you mean my little dog is sinking ...
3532,_sutra what is your next youtube video gonna b...,positive,sutra what is your next youtube video gonna be...


In [20]:
#nltk.download('punkt')
#nltk.download('averaged_perceptron_tagger')

In [21]:
def preprocessing_2(data:str):
    data = nltk.word_tokenize(data)
    def get_pos(word):
        tag = nltk.pos_tag([word])[0][1].upper()
        tag_dict = {"N":"n","V":"v","R":"r","J":"a"}
        return tag_dict.get(tag,"n")
    lemma = nltk.stem.WordNetLemmatizer()
    data = [lemma.lemmatize(word,pos=get_pos(word))for word in data]
    return data
    
train_data['preprocess_2']=train_data["preprocess_1"].apply(preprocessing_2)


In [22]:
test_data['preprocess_2']=test_data["preprocess_1"].apply(preprocessing_2)

In [23]:
train_data

Unnamed: 0,text,sentiment,preprocess_1,preprocess_2
0,"I`d have responded, if I were going",neutral,id have responded if i were going,"[id, have, responded, if, i, were, going]"
1,Sooo SAD I will miss you here in San Diego!!!,negative,so sad i will miss you here in san diego,"[so, sad, i, will, miss, you, here, in, san, d..."
2,my boss is bullying me...,negative,my boss is bullying me,"[my, bos, is, bullying, me]"
3,what interview! leave me alone,negative,what interview leave me alone,"[what, interview, leave, me, alone]"
4,"Sons of ****, why couldn`t they put them on t...",negative,sons of why couldnt they put them on the rele...,"[son, of, why, couldnt, they, put, them, on, t..."
...,...,...,...,...
27476,wish we could come see u on Denver husband l...,negative,wish we could come see u on denver husband lo...,"[wish, we, could, come, see, u, on, denver, hu..."
27477,I`ve wondered about rake to. The client has ...,negative,ive wondered about rake to the client has mad...,"[ive, wondered, about, rake, to, the, client, ..."
27478,Yay good for both of you. Enjoy the break - y...,positive,yay good for both of you enjoy the break you ...,"[yay, good, for, both, of, you, enjoy, the, br..."
27479,But it was worth it ****.,positive,but it was worth it,"[but, it, wa, worth, it]"


In [24]:
test_data.head(5)

Unnamed: 0,text,sentiment,preprocess_1,preprocess_2
0,Last session of the day http://twitpic.com/67ezh,neutral,last session of the day,"[last, session, of, the, day]"
1,Shanghai is also really exciting (precisely -...,positive,shanghai is also really exciting precisely sk...,"[shanghai, is, also, really, exciting, precise..."
2,"Recession hit Veronique Branquinho, she has to...",negative,recession hit veronique branquinho she has to ...,"[recession, hit, veronique, branquinho, she, h..."
3,happy bday!,positive,happy bday,"[happy, bday]"
4,http://twitpic.com/4w75p - I like it!!,positive,i like it,"[i, like, it]"


In [25]:
train_data["documents"] = train_data["preprocess_2"].apply(lambda x : " ".join(x))
test_data["documents"] = test_data["preprocess_2"].apply(lambda x : " ".join(x))

In [29]:
train_data.head(5)
train_data.to_csv('processed_train_senti.csv', index=False)

In [27]:
test_data.head(5)

Unnamed: 0,text,sentiment,preprocess_1,preprocess_2,documents
0,Last session of the day http://twitpic.com/67ezh,neutral,last session of the day,"[last, session, of, the, day]",last session of the day
1,Shanghai is also really exciting (precisely -...,positive,shanghai is also really exciting precisely sk...,"[shanghai, is, also, really, exciting, precise...",shanghai is also really exciting precisely sky...
2,"Recession hit Veronique Branquinho, she has to...",negative,recession hit veronique branquinho she has to ...,"[recession, hit, veronique, branquinho, she, h...",recession hit veronique branquinho she ha to q...
3,happy bday!,positive,happy bday,"[happy, bday]",happy bday
4,http://twitpic.com/4w75p - I like it!!,positive,i like it,"[i, like, it]",i like it


In [28]:
res_1 = preprocessing_1(" Hellooooo I'ammmm keerthan@gmail.com #NLP is niceeeee")

In [32]:
res_1

'hello iam keerthancom  is nice'

In [33]:
preprocessing_2(res_1)

['hello', 'iam', 'keerthancom', 'is', 'nice']

### Creating a vocabulary from the unique words in the text - set()

In [34]:
vocab = set()
for words in train_data['preprocess_2']:
    for word in words:
        vocab.add(word)
print("Vocabulary Size:",len(vocab))

Vocabulary Size: 23462


#### Vectorization

### Bag of words

In [35]:
from sklearn.feature_extraction.text import CountVectorizer
bow = CountVectorizer()
train_bow = bow.fit_transform(train_data['documents'])
test_bow = bow.transform(test_data['documents'])

In [36]:
bow

In [37]:
from sklearn.linear_model import LogisticRegression

model = LogisticRegression(max_iter = 1000)
model.fit(train_bow, y_train)


from sklearn.metrics import classification_report, accuracy_score

predict = model.predict(test_bow)
print("Accuracy Score :", accuracy_score(y_test, predict), end='\n\n')
print(classification_report(y_true = y_test, y_pred = predict))

Accuracy Score : 0.6983588002263724

              precision    recall  f1-score   support

           0       0.70      0.65      0.67      1001
           1       0.64      0.73      0.68      1430
           2       0.79      0.71      0.75      1103

    accuracy                           0.70      3534
   macro avg       0.71      0.69      0.70      3534
weighted avg       0.71      0.70      0.70      3534



### TF-IDF

In [38]:
from sklearn.feature_extraction.text import TfidfVectorizer
td_idf = TfidfVectorizer()


train_idf = td_idf.fit_transform(train_data['documents']) 
test_idf = td_idf.transform(test_data['documents'])

In [39]:
test_idf.shape

(3534, 23436)

In [40]:
from sklearn.linear_model import LogisticRegression

model = LogisticRegression(max_iter = 1000)
model.fit(train_idf, y_train)

from sklearn.metrics import classification_report, accuracy_score

predict = model.predict(test_idf)
print("Accuracy Score :", accuracy_score(y_test, predict), end='\n\n')
print(classification_report(y_true = y_test, y_pred = predict))

Accuracy Score : 0.7085455574419921

              precision    recall  f1-score   support

           0       0.73      0.64      0.68      1001
           1       0.64      0.76      0.69      1430
           2       0.81      0.71      0.76      1103

    accuracy                           0.71      3534
   macro avg       0.73      0.70      0.71      3534
weighted avg       0.72      0.71      0.71      3534



### Continous Bag of words

In [44]:
from gensim.models import Word2Vec
g_model = Word2Vec(sentences = train_data['preprocess_2'],vector_size=200,window=5, workers=5, epochs=500)

In [45]:
def in_vocab(word_l):
    for word in word_l:
        if word not in g_model.wv:
            return False
    else:
        return True

train_vec = [g_model.wv[x].sum(axis = 0) if len(x) and in_vocab(x) else np.zeros((200)) for x in train_data['preprocess_2']]
test_vec  = [g_model.wv[x].sum(axis = 0) if len(x) and in_vocab(x) else np.zeros((200)) for x in test_data['preprocess_2']]

In [46]:
from sklearn.linear_model import LogisticRegression

model = LogisticRegression(max_iter = 1000)
model.fit(train_vec, y_train)

from sklearn.metrics import classification_report, accuracy_score

predict = model.predict(test_vec)
print("Accuracy Score :", accuracy_score(y_test, predict), end='\n\n')
print(classification_report(y_true = y_test, y_pred = predict))

Accuracy Score : 0.5118845500848896

              precision    recall  f1-score   support

           0       0.66      0.26      0.37      1001
           1       0.46      0.86      0.60      1430
           2       0.72      0.29      0.41      1103

    accuracy                           0.51      3534
   macro avg       0.61      0.47      0.46      3534
weighted avg       0.60      0.51      0.48      3534



### Skipgram

In [47]:
from gensim.models import Word2Vec

g_model = Word2Vec(sentences=train_data['preprocess_2'], vector_size=200, window=5, workers=5, sg=1, epochs=500)


In [48]:
def in_vocab(word_l):
    for word in word_l:
        if word not in g_model.wv:
            return False
    else:
        return True

train_vec = [g_model.wv[x].sum(axis = 0) if len(x) and in_vocab(x) else np.zeros((200)) for x in train_data["preprocess_2"]]
test_vec  = [g_model.wv[x].sum(axis = 0) if len(x) and in_vocab(x) else np.zeros((200)) for x in test_data["preprocess_2"]]


In [49]:
from sklearn.linear_model import LogisticRegression

model = LogisticRegression(max_iter = 1000)
model.fit(train_vec, y_train)

from sklearn.metrics import classification_report, accuracy_score

predict = model.predict(test_vec)
print("Accuracy Score :", accuracy_score(y_test, predict), end='\n\n')
print(classification_report(y_true = y_test, y_pred = predict))

Accuracy Score : 0.512450481041313

              precision    recall  f1-score   support

           0       0.65      0.26      0.37      1001
           1       0.46      0.86      0.60      1430
           2       0.72      0.29      0.41      1103

    accuracy                           0.51      3534
   macro avg       0.61      0.47      0.46      3534
weighted avg       0.59      0.51      0.48      3534



### WORD2VEC using GloVe of twitter

In [50]:
import gensim.downloader as api

model = api.load("glove-twitter-200")

shape_n = 200

def in_vocab(word_l):
    for word in word_l:
        if word not in model:
            return False
    else:
        return True

train_vec = [model[x].sum(axis = 0) if len(x) and in_vocab(x) else np.zeros((shape_n)) for x in train_data['preprocess_2']]
test_vec  = [model[x].sum(axis = 0) if len(x) and in_vocab(x) else np.zeros((shape_n)) for x in test_data['preprocess_2']]


[=-------------------------------------------------] 3.1% 23.4/758.5MB downloaded

IOPub message rate exceeded.
The notebook server will temporarily stop sending output
to the client in order to avoid crashing it.
To change this limit, set the config variable
`--NotebookApp.iopub_msg_rate_limit`.

Current values:
NotebookApp.iopub_msg_rate_limit=1000.0 (msgs/sec)
NotebookApp.rate_limit_window=3.0 (secs)



[====----------------------------------------------] 8.6% 64.9/758.5MB downloaded

IOPub message rate exceeded.
The notebook server will temporarily stop sending output
to the client in order to avoid crashing it.
To change this limit, set the config variable
`--NotebookApp.iopub_msg_rate_limit`.

Current values:
NotebookApp.iopub_msg_rate_limit=1000.0 (msgs/sec)
NotebookApp.rate_limit_window=3.0 (secs)





IOPub message rate exceeded.
The notebook server will temporarily stop sending output
to the client in order to avoid crashing it.
To change this limit, set the config variable
`--NotebookApp.iopub_msg_rate_limit`.

Current values:
NotebookApp.iopub_msg_rate_limit=1000.0 (msgs/sec)
NotebookApp.rate_limit_window=3.0 (secs)





IOPub message rate exceeded.
The notebook server will temporarily stop sending output
to the client in order to avoid crashing it.
To change this limit, set the config variable
`--NotebookApp.iopub_msg_rate_limit`.

Current values:
NotebookApp.iopub_msg_rate_limit=1000.0 (msgs/sec)
NotebookApp.rate_limit_window=3.0 (secs)





IOPub message rate exceeded.
The notebook server will temporarily stop sending output
to the client in order to avoid crashing it.
To change this limit, set the config variable
`--NotebookApp.iopub_msg_rate_limit`.

Current values:
NotebookApp.iopub_msg_rate_limit=1000.0 (msgs/sec)
NotebookApp.rate_limit_window=3.0 (secs)





IOPub message rate exceeded.
The notebook server will temporarily stop sending output
to the client in order to avoid crashing it.
To change this limit, set the config variable
`--NotebookApp.iopub_msg_rate_limit`.

Current values:
NotebookApp.iopub_msg_rate_limit=1000.0 (msgs/sec)
NotebookApp.rate_limit_window=3.0 (secs)





IOPub message rate exceeded.
The notebook server will temporarily stop sending output
to the client in order to avoid crashing it.
To change this limit, set the config variable
`--NotebookApp.iopub_msg_rate_limit`.

Current values:
NotebookApp.iopub_msg_rate_limit=1000.0 (msgs/sec)
NotebookApp.rate_limit_window=3.0 (secs)





IOPub message rate exceeded.
The notebook server will temporarily stop sending output
to the client in order to avoid crashing it.
To change this limit, set the config variable
`--NotebookApp.iopub_msg_rate_limit`.

Current values:
NotebookApp.iopub_msg_rate_limit=1000.0 (msgs/sec)
NotebookApp.rate_limit_window=3.0 (secs)





IOPub message rate exceeded.
The notebook server will temporarily stop sending output
to the client in order to avoid crashing it.
To change this limit, set the config variable
`--NotebookApp.iopub_msg_rate_limit`.

Current values:
NotebookApp.iopub_msg_rate_limit=1000.0 (msgs/sec)
NotebookApp.rate_limit_window=3.0 (secs)





IOPub message rate exceeded.
The notebook server will temporarily stop sending output
to the client in order to avoid crashing it.
To change this limit, set the config variable
`--NotebookApp.iopub_msg_rate_limit`.

Current values:
NotebookApp.iopub_msg_rate_limit=1000.0 (msgs/sec)
NotebookApp.rate_limit_window=3.0 (secs)





IOPub message rate exceeded.
The notebook server will temporarily stop sending output
to the client in order to avoid crashing it.
To change this limit, set the config variable
`--NotebookApp.iopub_msg_rate_limit`.

Current values:
NotebookApp.iopub_msg_rate_limit=1000.0 (msgs/sec)
NotebookApp.rate_limit_window=3.0 (secs)





IOPub message rate exceeded.
The notebook server will temporarily stop sending output
to the client in order to avoid crashing it.
To change this limit, set the config variable
`--NotebookApp.iopub_msg_rate_limit`.

Current values:
NotebookApp.iopub_msg_rate_limit=1000.0 (msgs/sec)
NotebookApp.rate_limit_window=3.0 (secs)





IOPub message rate exceeded.
The notebook server will temporarily stop sending output
to the client in order to avoid crashing it.
To change this limit, set the config variable
`--NotebookApp.iopub_msg_rate_limit`.

Current values:
NotebookApp.iopub_msg_rate_limit=1000.0 (msgs/sec)
NotebookApp.rate_limit_window=3.0 (secs)





IOPub message rate exceeded.
The notebook server will temporarily stop sending output
to the client in order to avoid crashing it.
To change this limit, set the config variable
`--NotebookApp.iopub_msg_rate_limit`.

Current values:
NotebookApp.iopub_msg_rate_limit=1000.0 (msgs/sec)
NotebookApp.rate_limit_window=3.0 (secs)





IOPub message rate exceeded.
The notebook server will temporarily stop sending output
to the client in order to avoid crashing it.
To change this limit, set the config variable
`--NotebookApp.iopub_msg_rate_limit`.

Current values:
NotebookApp.iopub_msg_rate_limit=1000.0 (msgs/sec)
NotebookApp.rate_limit_window=3.0 (secs)





IOPub message rate exceeded.
The notebook server will temporarily stop sending output
to the client in order to avoid crashing it.
To change this limit, set the config variable
`--NotebookApp.iopub_msg_rate_limit`.

Current values:
NotebookApp.iopub_msg_rate_limit=1000.0 (msgs/sec)
NotebookApp.rate_limit_window=3.0 (secs)



In [51]:
from sklearn.linear_model import LogisticRegression

model = LogisticRegression(max_iter = 1000)
model.fit(train_vec, y_train)

from sklearn.metrics import classification_report, accuracy_score

predict = model.predict(test_vec)
print("Accuracy Score :", accuracy_score(y_test, predict), end='\n\n')
print(classification_report(y_true = y_test, y_pred = predict))

Accuracy Score : 0.6428975664968873

              precision    recall  f1-score   support

           0       0.71      0.56      0.62      1001
           1       0.56      0.73      0.64      1430
           2       0.74      0.60      0.67      1103

    accuracy                           0.64      3534
   macro avg       0.67      0.63      0.64      3534
weighted avg       0.66      0.64      0.64      3534



### Classification using TF-IDF

In [52]:
text = """What is not to like about this product.
Not bad.
Not an issue.
Not buggy.
Not happy.
Not user-friendly.
Not good.
Is it any good?
I do not dislike horror movies. 
Disliking horror movies is not uncommon. 
Sometimes I really hate the show. 
I love having to wait two months for the next series to come out! 
The final episode was surprising with a terrible twist at the end.
The film was easy to watch but I would not recommend it to my friends. 
I LOL’d at the end of the cake scene."""

input_text = text.split("\n")
input_text = [" ".join(preprocessing_2(string)) for string in input_text]

In [53]:
from sklearn.feature_extraction.text import TfidfVectorizer

tf_idf = TfidfVectorizer()

train_idf = tf_idf.fit_transform(train_data["documents"])
pred_idf = tf_idf.transform(input_text)

In [54]:
from sklearn.linear_model import LogisticRegression

model = LogisticRegression(max_iter = 1000)
model.fit(train_idf, y_train)

predict = model.predict(pred_idf)
predict = le.inverse_transform(predict)

In [55]:
for index, text in enumerate(text.split("\n")):
    print(text, " : ", predict[index])

What is not to like about this product.  :  negative
Not bad.  :  negative
Not an issue.  :  negative
Not buggy.  :  neutral
Not happy.  :  positive
Not user-friendly.  :  neutral
Not good.  :  positive
Is it any good?  :  positive
I do not dislike horror movies.   :  negative
Disliking horror movies is not uncommon.   :  negative
Sometimes I really hate the show.   :  negative
I love having to wait two months for the next series to come out!   :  positive
The final episode was surprising with a terrible twist at the end.  :  neutral
The film was easy to watch but I would not recommend it to my friends.   :  neutral
I LOL’d at the end of the cake scene.  :  neutral


In [67]:
#pip install nbconvert

Note: you may need to restart the kernel to use updated packages.




In [43]:
#pip install -U gensim

Collecting gensim
  Downloading gensim-4.3.2-cp39-cp39-win_amd64.whl.metadata (8.5 kB)
Collecting smart-open>=1.8.1 (from gensim)
  Downloading smart_open-6.4.0-py3-none-any.whl.metadata (21 kB)
Downloading gensim-4.3.2-cp39-cp39-win_amd64.whl (24.0 MB)
   ---------------------------------------- 0.0/24.0 MB ? eta -:--:--
   - -------------------------------------- 0.7/24.0 MB 20.8 MB/s eta 0:00:02
   ----- ---------------------------------- 3.2/24.0 MB 33.4 MB/s eta 0:00:01
   ------- -------------------------------- 4.6/24.0 MB 32.6 MB/s eta 0:00:01
   -------- ------------------------------- 5.1/24.0 MB 32.8 MB/s eta 0:00:01
   ------------ --------------------------- 7.5/24.0 MB 31.9 MB/s eta 0:00:01
   -------------- ------------------------- 8.8/24.0 MB 31.4 MB/s eta 0:00:01
   ------------------ --------------------- 11.2/24.0 MB 36.4 MB/s eta 0:00:01
   ------------------- -------------------- 11.4/24.0 MB 36.4 MB/s eta 0:00:01
   ------------------- -------------------- 11.4/2