__In this project I cover some of the most basic techniques to tackle text data using pyspark.__
It good to not that the five major steps for most NLPs are:
- Reading the Corps
- Tokenization.
- Cleaning/ Stopword Removal
- Stemming
- Converting into numerical form.

In [1]:
#create spark session
from pyspark.sql import SparkSession

spark=SparkSession.builder.appName('Natural Language Processing').getOrCreate()

Let us see how we can do tokenization using PySpark. The first step is to create a dataframe that has text data.

In [4]:
df = spark.createDataFrame([(1,'I really liked this movie'),
                            (2,'I would recommend this movie to my friends'),
                            (3,'movie was alright but acting was horrible'),
                            (4,'I am never watching that movie ever again')],
                           ['user_id','review'])
df.show(4,False)

+-------+------------------------------------------+
|user_id|review                                    |
+-------+------------------------------------------+
|1      |I really liked this movie                 |
|2      |I would recommend this movie to my friends|
|3      |movie was alright but acting was horrible |
|4      |I am never watching that movie ever again |
+-------+------------------------------------------+



The next step is to import Tokenizer from the Spark library. We have to then pass the input column and name the output column after tokenization. We use the transform function in order to apply tokenization to the review column.

In [7]:
from pyspark.ml.feature import Tokenizer
tokenization = Tokenizer(inputCol='review', outputCol='tokens')
tokenized_df=tokenization.transform(df)
tokenized_df.show(5, False)

+-------+------------------------------------------+---------------------------------------------------+
|user_id|review                                    |tokens                                             |
+-------+------------------------------------------+---------------------------------------------------+
|1      |I really liked this movie                 |[i, really, liked, this, movie]                    |
|2      |I would recommend this movie to my friends|[i, would, recommend, this, movie, to, my, friends]|
|3      |movie was alright but acting was horrible |[movie, was, alright, but, acting, was, horrible]  |
|4      |I am never watching that movie ever again |[i, am, never, watching, that, movie, ever, again] |
+-------+------------------------------------------+---------------------------------------------------+



__The new column contains tokens for each sentence__

### Stop Words Removal
Stop words and little to no value to our analysis. Including them in our computation only increases computation overhead without adding too much insight.

In [9]:
from pyspark.ml.feature import StopWordsRemover
stopword_removal=StopWordsRemover(inputCol='tokens', outputCol='refined_tokens')
refined_df = stopword_removal.transform(tokenized_df)
refined_df.select('user_id', 'tokens', 'refined_tokens').show(5, False)

+-------+---------------------------------------------------+----------------------------------+
|user_id|tokens                                             |refined_tokens                    |
+-------+---------------------------------------------------+----------------------------------+
|1      |[i, really, liked, this, movie]                    |[really, liked, movie]            |
|2      |[i, would, recommend, this, movie, to, my, friends]|[would, recommend, movie, friends]|
|3      |[movie, was, alright, but, acting, was, horrible]  |[movie, alright, acting, horrible]|
|4      |[i, am, never, watching, that, movie, ever, again] |[never, watching, movie, ever]    |
+-------+---------------------------------------------------+----------------------------------+



__The refined_tokens column has all the stopwords removed.__

## Bag of Words (BOW)
This is the methodology through which we can represent the text data into numerical form for it to be used by Machine Learning or any other analysis.

## Count Vectorizer
This takes the count of the word appearing in aparticular document. Let's see how it works

In [20]:
from pyspark.ml.feature import CountVectorizer
count_vec = CountVectorizer(inputCol='refined_tokens', outputCol='features')
cv_df = count_vec.fit(refined_df).transform(refined_df)
cv_df.select('user_id','refined_tokens', 'features').show(5, False)

+-------+----------------------------------+---------------------------------+
|user_id|refined_tokens                    |features                         |
+-------+----------------------------------+---------------------------------+
|1      |[really, liked, movie]            |(12,[0,4,5],[1.0,1.0,1.0])       |
|2      |[would, recommend, movie, friends]|(12,[0,1,8,10],[1.0,1.0,1.0,1.0])|
|3      |[movie, alright, acting, horrible]|(12,[0,2,7,9],[1.0,1.0,1.0,1.0]) |
|4      |[never, watching, movie, ever]    |(12,[0,3,6,11],[1.0,1.0,1.0,1.0])|
+-------+----------------------------------+---------------------------------+



As we can observe, each sentence is represented as a dense vector. It shows that the vector length is 12 and the first sentence contains 3 values at the 0th, 4th, and 5th indexes.
__To validate the vocabulary of the count vectorizer, we can simply use
the vocabulary function:__

In [21]:
count_vec.fit(refined_df).vocabulary

['movie',
 'recommend',
 'acting',
 'never',
 'horrible',
 'really',
 'watching',
 'ever',
 'liked',
 'would',
 'alright',
 'friends']

The drawback of using the Count Vectorizer method is that it doesn’t consider the co-occurrences of words in other documents. In simple terms, the words appearing more often would have a larger impact on the feature vector. 

## Term Frequency – inverse Document Frequency (TF-IDF).
This method tries to normalize the frequency of word occurrence based on other documents. The whole idea is to give more weight to the word if appearing a high number of times in the same document but penalize if it is appearing a higher number of times in other documents as well. This indicates that a word is common across the corpus and is not as important as its frequency in the current document indicates.
- __Term Frequency:__ Score based on the frequency of word in current document.
- __Inverse Document Frequency:__ Scoring based on frequency of documents that contains the current word.

In [22]:
from pyspark.ml.feature import HashingTF,IDF
hashing_vec = HashingTF(inputCol='refined_tokens', outputCol='tf_features')
hashing_df = hashing_vec.transform(refined_df)
hashing_df.show()

+-------+--------------------+--------------------+--------------------+--------------------+
|user_id|              review|              tokens|      refined_tokens|         tf_features|
+-------+--------------------+--------------------+--------------------+--------------------+
|      1|I really liked th...|[i, really, liked...|[really, liked, m...|(262144,[14,32675...|
|      2|I would recommend...|[i, would, recomm...|[would, recommend...|(262144,[68867,12...|
|      3|movie was alright...|[movie, was, alri...|[movie, alright, ...|(262144,[80824,15...|
|      4|I am never watchi...|[i, am, never, wa...|[never, watching,...|(262144,[63139,15...|
+-------+--------------------+--------------------+--------------------+--------------------+



In [24]:
tf_idf_vec = IDF(inputCol='tf_features', outputCol='tf_idf_features')
tf_idf_df=tf_idf_vec.fit(hashing_df).transform(hashing_df)
tf_idf_df.show()

+-------+--------------------+--------------------+--------------------+--------------------+--------------------+
|user_id|              review|              tokens|      refined_tokens|         tf_features|     tf_idf_features|
+-------+--------------------+--------------------+--------------------+--------------------+--------------------+
|      1|I really liked th...|[i, really, liked...|[really, liked, m...|(262144,[14,32675...|(262144,[14,32675...|
|      2|I would recommend...|[i, would, recomm...|[would, recommend...|(262144,[68867,12...|(262144,[68867,12...|
|      3|movie was alright...|[movie, was, alri...|[movie, alright, ...|(262144,[80824,15...|(262144,[80824,15...|
|      4|I am never watchi...|[i, am, never, wa...|[never, watching,...|(262144,[63139,15...|(262144,[63139,15...|
+-------+--------------------+--------------------+--------------------+--------------------+--------------------+



## Text Classification Using Machine Learning
