# Sentiment Analysis in Python

Here the goal is to understand how positive or negative is the sentiment in texts, which are usually reviews/comments of some kind.

The data is taken from this [link](http://www.cs.jhu.edu/~mdredze/datasets/sentiment/index2.html) which includes Multi-Domain Sentiment Dataset. This dataset contains product reviews taken from Amazon.com from 4 product types (domains): Kitchen, Books, DVDs, and Electronics. Each domain has several thousand reviews, but the exact number varies by domain. Reviews contain star ratings (1 to 5 stars) that can be converted into binary labels if needed.

These data are in XML format and therefore, we will need an XML parser to read them. Here we will use _BeautifulSoup_ as an XML parser. It can be installed with the following command:

```
conda install -c anaconda beautifulsoup4 
```

## Outline of the Sentiment Analyzer

Here we will look at the electronics category. One approach would be doing regression using the 4 star targets, however, as the data is already labeled with "positive"/"negative" sentiments we will use a classification approach using a simple classifier like _logistic regression_ which also allows us to interpret the labels.

For this purpose, we only need to look at the review text that can be found under the tag "review_text".

Two passes are required here, one to determine the vocabulary size and corresponding indexes and words, and another for creating the data vectors.

## Logistic Regression Review
For a review on Logistic Regression theory and it's use in python you can refer to this tutorial: https://github.com/mahsa-teimourikia/LogisticRegressionWithPython

## Preprocessing

### Tokenization
Tokenization is a process that splits an input sequence into so-called tokens which are units for semantic processing.

NLTK library includes the tokenization functionality, here are some examples:

In [2]:
import nltk

text = "This is Mahsa's cup, isn't it?"

# Using the white space tokenizer
tokenizer = nltk.tokenize.WhitespaceTokenizer()
tokenizer.tokenize(text)

['This', 'is', "Mahsa's", 'cup,', "isn't", 'it?']

In [3]:
# splitting by a set of rules
tokenizer = nltk.tokenize.TreebankWordTokenizer()
tokenizer.tokenize(text)

['This', 'is', 'Mahsa', "'s", 'cup', ',', 'is', "n't", 'it', '?']

In [4]:
# Splitting by punctuation
tokenizer = nltk.tokenize.WordPunctTokenizer()
tokenizer.tokenize(text)

['This', 'is', 'Mahsa', "'", 's', 'cup', ',', 'isn', "'", 't', 'it', '?']

Sometimes it is required to write our own tokenizer, such as cases in which we require to remove the stop words, remove numbers, or single-letter tokens. Other cases are application specific cases that might require specific tokenization techniques.