<table>
    <tr><td>
         <a href="https://nbviewer.jupyter.org/github/panayiotiska/Jupyter-Sentiment-Analysis-Video-games-reviews/blob/master/Naive_Bayes_&_SVM_HashingVectorizer.ipynb">
         <img alt="start" src="figures/button_previous.jpg" width= 70% height= 70%>
    </td><td>
        <a href="https://nbviewer.jupyter.org/github/panayiotiska/Jupyter-Sentiment-Analysis-Video-games-reviews/blob/master/Index.ipynb">
         <img alt="start" src="figures/button_table-of-contents.jpg" width= 70% height= 70%>
    </td><td>
         <a href="https://nbviewer.jupyter.org/github/panayiotiska/Jupyter-Sentiment-Analysis-Video-games-reviews/blob/master/SVM_HashingVectorizer-LancasterStemming.ipynb">
         <img alt="start" src="figures/button_next.jpg" width= 70% height= 70%>
    </td></tr>
</table>

# Stemming

Stemming is a text normalization technique used to prepare text, words, and documents for further processing. In natural language, words are being modified in order to express different grammatical categories such as tense, case, voice, aspect, person, number, gender, and mood. In natural language processing and specifically in sentiment analysis it is important to examine the initial meaning of a word together with the frequency the word appears in a text. This technique resolves the problem of having different versions of the same word inside a corpora text by trying to transform words to their common root form.

Usually the affected part of the words is the suffix thus most algorithms are simply cutting off some commonly used suffixes like "s", "es", "ies", "ing" and etc.

- Examples of stemming:
    - Playing &#8594; play
    - Played  &#8594; play
    - Plays   &#8594; play

The algorithms being examined in the following notebooks are the Lancaster Stemmer and Snowball Stemmer.

### Lancaster Stemmer

The Lancaster or Paice-Husk Stemmer was developed by Chris D Paice at Lancaster University and it is considered a very aggressive stemming algorithm at specifying the removal or replacement of an ending.

The Lancaster stemmer is iterative, and uses just one table of rules; each rule may specify either deletion or replacement of an ending. The rules are grouped into sections corresponding to the final letter of the suffix; this means that the rule table is accessed quickly by looking up the final letter of the current word or truncated word.
Within each section the ordering of the rules is significant. Some rules are restricted to intact words, i.e., words from which no ending has yet been removed. A simple blanket acceptability test is applied before any matching rule is activated. After a rule has been applied, processing may be allowed to continue iterafively, or may be terminated. 

#### The Lancaster stemming algorithm

Inspects the final letter of the word, checks for all the rules in the implemented table, if the condition of the examined rule is satisfied (for example the final letter is the letter 's') and the word is not considered intact then the rule applies. The algorithm continues for all the words inside the corpora text.

#### Example from the dataset :

In [3]:
import re
import pandas as pd
from nltk.corpus import stopwords
from nltk.stem import LancasterStemmer

dataset = pd.read_json(r"C:\Users\Panos\Desktop\Dissert\Code\Sample_Video_Games_5.json", lines=True, encoding='latin-1')
dataset = dataset[['reviewText','overall']]

corpus = []

#Print initial form of a review
print(dataset['reviewText'][1])

# Clean the text and apply stemming
for i in range(0, len(dataset)):
    review = re.sub('[^a-zA-Z]', ' ', dataset['reviewText'][i])
    review = review.lower()
    review = review.split()
    lc = LancasterStemmer()
    review = [lc.stem(word) for word in review if not word in set(stopwords.words('english'))]
    review = [word for word in review if not word in set(stopwords.words('english'))]
    review = ' '.join(review)
    corpus.append(review)

# Print review after stemming
print("---------")
print(corpus[1])

If you like rally cars get this game you will have fun.It is more oriented to &#34;European market&#34; since here in America there isn't a huge rally fan party. Music it is very European and even the voices from the game very &#34;English&#34; accent.The multiplayer isn't the best but it works just ok.
---------
lik ral car get gam fun ory europ market sint americ hug ral fan party mus europ ev voic gam engl acc multiplay best work ok


*By printing a specific review from the dataset before and after applying stemming with the Lancaster Stemmer, it is clearly prooved that the algorithm is aggresive as it cuts a significant part of every word. This is not always a bad thing because it can match together more words coming out of the same root, however, sometimes the final form of two different words can end up be the same which is not a good thing for a sentiment analysis task*

### Snowball Stemmer

Nltk's Snoball Stemmer is the updated version of the Porter Stemmer, also known as Porter2. Porter2 is a suffix-stripping stemmer. It transforms words into stems by applying a deterministic sequence of changes to the final portion of the word. 


#### The Snowball stemming algorithm

The original algorithm (Porter1) consists of 5 phases of word reduction. Each phase has a set of rules written beneath each other, among which only one is obeyed.
First a condition loosely checks the number of syllables to see whether a word is long enough to replace the suffix or not.
Next, every rule's condition is checked and if the condition is satisfied the token is getting reformed.

Later, Dr. Porter himself has suggested several improvements to the original algorithm.
The changes made are :
- Terminating ‘y’ changed to ‘i’ seldom occurrence.
- Suffix ‘us’ does not lose its ‘s’.
- Removal of additional suffixes, including suffix ‘ly’.
- Add step 0 to handle apostrophe.
- A small list of exceptional forms is included.



#### Example from the dataset :

In [4]:
import re
import pandas as pd
from nltk.corpus import stopwords
from nltk.stem.snowball  import SnowballStemmer

dataset = pd.read_json(r"C:\Users\Panos\Desktop\Dissert\Code\Sample_Video_Games_5.json", lines=True, encoding='latin-1')
dataset = dataset[['reviewText','overall']]

corpus = []

# Clean the text and apply stemming
for i in range(0, len(dataset)):
    review = re.sub('[^a-zA-Z]', ' ', dataset['reviewText'][i])
    review = review.lower()
    review = review.split()
    sb = SnowballStemmer("english")
    review = [sb.stem(word) for word in review if not word in set(stopwords.words('english'))]
    review = [word for word in review if not word in set(stopwords.words('english'))]
    review = ' '.join(review)
    corpus.append(review)

# Print review after stemming
print("---------")
print(corpus[1])

---------
like ralli car get game fun orient european market sinc america huge ralli fan parti music european even voic game english accent multiplay best work ok


*In comparison with the Lancaster stemmer the Snowball stemmer seem that does a much "cleaner" stemming, extracting words in a more falimiar for the human form. Although, in order to see which algorithm acts better in the type of task examined in this project, the two algorithms have to be tested using the whole dataset. In the next four notebooks the stemming algorithms are being tested for different cases.*

<a href="https://nbviewer.jupyter.org/github/panayiotiska/Jupyter-Sentiment-Analysis-Video-games-reviews/blob/master/SVM_HashingVectorizer-LancasterStemming.ipynb">
         <img alt="start" src="figures/button_next.jpg">