<a href="https://colab.research.google.com/github/hibatallahk/NLP-Sentiment-Analysis/blob/main/sentiment_analysis_practicum.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# Sentiment Analysis

In order to determine the tonality of the text, let's use TF-IDF values as features.
Sentiment analysis identifies emotionally-charged texts. This tool can be extremely useful in business when evaluating consumer reactions to a new product. A human would need several hours to analyze thousands of reviews. A computer will do the same in a couple of minutes.
Sentiment analysis works by labeling text as positive or negative. Positive text is given a "1", and negative text is assigned a "0".

### Task
Train a logistic regression to determine the tonality of reviews. Use TF-IDF vectors for lemmatized reviews as features.
The train part of the dataset is in the imdb_reviews_small_lemm_train.tsv file, the lemmatized reviews are in the review_lemm column (so you don't have to lemmatize reviews yourself), the target is in the pos column (0 - negative review, 1, - positive review).
Use the trained classification model to determine the prediction results for the test sample of reviews from the imdb_reviews_small_lemm_test.tsv file. Save the predictions to the pos column. Save the table with results as an CSV file. Don't specify the file extension so that the platform accepts the file (for example, call it 'predictions').
The model accuracy should be at least 0.82.

In [1]:
import pandas as pd

from nltk.corpus import stopwords as nltk_stopwords
from sklearn.feature_extraction.text import TfidfVectorizer
import nltk
nltk.download('stopwords')
from sklearn.linear_model import LogisticRegression

[nltk_data] Downloading package stopwords to /root/nltk_data...
[nltk_data]   Package stopwords is already up-to-date!


In [2]:
train_set = pd.read_csv('imdb_reviews_small_lemm_train.tsv', sep='\t')
test_set = pd.read_csv('imdb_reviews_small_lemm_test.tsv', sep='\t')

In [3]:
stop_words = set(nltk_stopwords.words('english'))

In [4]:
count_tf_idf = TfidfVectorizer(stop_words=stop_words)

In [5]:
features_train = count_tf_idf.fit_transform(train_set['review_lemm'])
features_test = count_tf_idf.transform(test_set['review_lemm'])

In [6]:
target_train = train_set['pos']
target_test = train_set['pos']

In [7]:
model_lr = LogisticRegression(random_state=12345)

In [8]:
model_lr.fit(features_train, target_train)

LogisticRegression(C=1.0, class_weight=None, dual=False, fit_intercept=True,
                   intercept_scaling=1, l1_ratio=None, max_iter=100,
                   multi_class='auto', n_jobs=None, penalty='l2',
                   random_state=12345, solver='lbfgs', tol=0.0001, verbose=0,
                   warm_start=False)

In [9]:
predictions = model_lr.predict(features_test)

In [10]:
submission = pd.DataFrame({'pos': predictions})
submission.to_csv('predictions')