#Scenario: **Analyze Amazon Alexa Reviews using spaCy**

###**What is Natural Language Processing?**

Natural Language Processing (NLP in short) is a branch of artificial intelligence that deals with analyzing, understanding and generating the languages that humans use naturally in order to interface with computers in both written and spoken contexts using natural human languages instead of computer languages. Some of the common use case of NLP in machine learning are: 

- **Topic discovery and modeling:** Capture the meaning and themes in text collections, and apply advanced modeling techniques such as Topic Modeling to group similar documents together.
- **Sentiment Analysis:** Identifying the mood or subjective opinions within large amounts of text, including average sentiment and opinion mining.
- **Document summarization:** Automatically generating synopses of large bodies of text.
- **Speech-to-text and text-to-speech conversion:** Transforming voice commands into written text, and vice versa.
- **Machine translation:** Automatic translation of text or speech from one language to another.  

__[Learn More about Natural Language Processing](https://en.wikipedia.org/wiki/Natural_language_processing)__

###**Dataset Description:**

This dataset consists of a nearly 5,68,454 Amazon customer reviews (input text), star ratings, date of review, variant and feedback of various products

- **Id**
- **ProductId**
- **UserId**
- **ProfileName**
- **HelpfulnessNumerator**
- **HelpfulnessDenominator**
- **Score**
- **Time**
- **Summary**
- **Text**


### Tasks to be performed:

- Download the dataset from dropbox and install dependencies
- Import the required libraries and load the dataset 
- Perfom Exploratory Data Analysis (EDA) on the data set
- Use SpaCy to implement:
 - **Tokenization**
 - **Part of Speech Tagging**
 - **Stopwords removal**
 - **Lemmatization**
 - **Dependency Parsing**
 - **Named Entity Recognition**
- Implement Text Summarization Using **Gensim**
- Use **PyCaret** to implement NLP


### Importing Libraries and getting data

In [None]:
import pandas as pd
import numpy as np
# import matplotlib.pyplot as plt
# import seaborn as sns
# import missingno as msno
# sns.set()

import warnings
warnings.filterwarnings("ignore", category=FutureWarning)

print('Libraries Imported')

In [None]:
!wget https://www.dropbox.com/s/socxqy7mbtteo65/Reviews.csv

In [3]:
df = pd.read_csv('Reviews.csv')
print(df.shape)
df.head(2)

(568454, 10)


Unnamed: 0,Id,ProductId,UserId,ProfileName,HelpfulnessNumerator,HelpfulnessDenominator,Score,Time,Summary,Text
0,1,B001E4KFG0,A3SGXH7AUHU8GW,delmartian,1,1,5,1303862400,Good Quality Dog Food,I have bought several of the Vitality canned d...
1,2,B00813GRG4,A1D87F6ZCVE5NK,dll pa,0,0,1,1346976000,Not as Advertised,Product arrived labeled as Jumbo Salted Peanut...


In [None]:
print(df.shape)
df.columns

In [4]:
print(df.loc[1,'Summary'])
print(df.loc[1,'Text'])
print(df.loc[1,'Score'])

Not as Advertised
Product arrived labeled as Jumbo Salted Peanuts...the peanuts were actually small sized unsalted. Not sure if this was an error or if the vendor intended to represent the product as "Jumbo".
1


### Missing values

In [5]:
#msno.matrix(df)
#msno.bar(df)
df.isna().sum()

Id                         0
ProductId                  0
UserId                     0
ProfileName               16
HelpfulnessNumerator       0
HelpfulnessDenominator     0
Score                      0
Time                       0
Summary                   27
Text                       0
dtype: int64

### EDA using Sweetviz



**Sweetviz** is an open source Python library that generates beautiful, high-density visualizations to kickstart EDA (Exploratory Data Analysis) with a single line of code. **Output** is a fully self-contained **HTML** application.

The system is built around quickly visualizing target values and comparing datasets. Its goal is to help quick analysis of target characteristics, training vs testing data, and other such data characterization tasks.

**[Click Here!](https://pypi.org/project/sweetviz/)** to learn more about Sweetviz

In [None]:
!pip install sweetviz

In [7]:
import sweetviz as sv
report = sv.analyze(df)
report.show_html('Output.html')

                                             |          | [  0%]   00:00 -> (? left)

Report Output.html was generated! NOTEBOOK/COLAB USERS: the web browser MAY not pop up, regardless, the report IS saved in your notebook/colab files.


### Pre-processing

In [8]:
df = df[['Text','Score']].dropna()

In [9]:
print(df.shape)
df.head()

(568454, 2)


Unnamed: 0,Text,Score
0,I have bought several of the Vitality canned d...,5
1,Product arrived labeled as Jumbo Salted Peanut...,1
2,This is a confection that has been around a fe...,4
3,If you are looking for the secret ingredient i...,2
4,Great taffy at a great price. There was a wid...,5


In [10]:
df.Score.unique()

array([5, 1, 4, 2, 3])

In [11]:
df.Score.value_counts()
# plt.figure(figsize=(12,8))
# df.Score.value_counts().plot(kind='bar')
# plt.show()

5    363122
4     80655
1     52268
3     42640
2     29769
Name: Score, dtype: int64

___
Let us treat rating 4 and 5 as positive and rest as negative reviews
___

In [12]:
df.Score[df.Score<=3]=0
df.Score[df.Score>=4]=1

A value is trying to be set on a copy of a slice from a DataFrame

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  df.Score[df.Score<=3]=0
A value is trying to be set on a copy of a slice from a DataFrame

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  df.Score[df.Score>=4]=1


In [13]:
df.Score.value_counts()
# plt.figure(figsize=(12,8))
# df.Score.value_counts().plot(kind='bar')
# plt.show()

1    443777
0    124677
Name: Score, dtype: int64

In [14]:
print(df.shape)
df.head()

(568454, 2)


Unnamed: 0,Text,Score
0,I have bought several of the Vitality canned d...,1
1,Product arrived labeled as Jumbo Salted Peanut...,0
2,This is a confection that has been around a fe...,1
3,If you are looking for the secret ingredient i...,0
4,Great taffy at a great price. There was a wid...,1


###Linguistic features: 

- Tokenization
- Part-of-speech tagging
- Lemmatization
- Named Entity Recognition
- Dependency parsing
- Visualization using spacy.displacy and explacy


In [15]:
import spacy
!python -m spacy download en_core_web_lg

Looking in indexes: https://pypi.org/simple, https://us-python.pkg.dev/colab-wheels/public/simple/
Collecting en-core-web-lg==3.5.0
  Downloading https://github.com/explosion/spacy-models/releases/download/en_core_web_lg-3.5.0/en_core_web_lg-3.5.0-py3-none-any.whl (587.7 MB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m587.7/587.7 MB[0m [31m2.7 MB/s[0m eta [36m0:00:00[0m
Installing collected packages: en-core-web-lg
Successfully installed en-core-web-lg-3.5.0
[38;5;2m✔ Download and installation successful[0m
You can now load the package via spacy.load('en_core_web_lg')


In [16]:

# from spacy.matcher import Matcher
# from spacy.tokens import Span


# Loading english language model
# A language model is a statistical model that lets us perform NLP tasks such as POS-tagging and NER-tagging

nlp = spacy.load('en_core_web_lg')

In [17]:
print(df.shape)
df.head()

(568454, 2)


Unnamed: 0,Text,Score
0,I have bought several of the Vitality canned d...,1
1,Product arrived labeled as Jumbo Salted Peanut...,0
2,This is a confection that has been around a fe...,1
3,If you are looking for the secret ingredient i...,0
4,Great taffy at a great price. There was a wid...,1


In [18]:
df.Text[5]

'I got a wild hair for taffy and ordered this five pound bag. The taffy was all very enjoyable with many flavors: watermelon, root beer, melon, peppermint, grape, etc. My only complaint is there was a bit too much red/black licorice-flavored pieces (just not my particular favorites). Between me, my kids, and my husband, this lasted only two weeks! I would recommend this brand of taffy -- it was a delightful treat.'

In [19]:
#Loading a random review from the data set 
review1 = df.Text[15]
print('Random Review:\n',review1)
print(len(review1))

Random Review:
 My daughter loves twizzlers and this shipment of six pounds really hit the spot. It's exactly what you would expect...six packages of strawberry twizzlers.
155


In [20]:
#Calling the nlp() on a string and spaCy tokenizes the text and creates a document object
doc = nlp(review1)
print(type(doc))

<class 'spacy.tokens.doc.Doc'>


In [25]:
i = 0
for tokenObj in doc:
  i += 1
  print(tokenObj.text, tokenObj.pos_, tokenObj.lemma_, tokenObj.is_alpha)

  if i > 5:
    break

My PRON my True
daughter NOUN daughter True
loves VERB love True
twizzlers NOUN twizzler True
and CCONJ and True
this DET this True


In [30]:
#for i,token in enumerate(doc):
tlist = [tokenObj.text for tokenObj in doc]
print(review1)
print(tlist)
print(len(tlist))

My daughter loves twizzlers and this shipment of six pounds really hit the spot. It's exactly what you would expect...six packages of strawberry twizzlers.
['My', 'daughter', 'loves', 'twizzlers', 'and', 'this', 'shipment', 'of', 'six', 'pounds', 'really', 'hit', 'the', 'spot', '.', 'It', "'s", 'exactly', 'what', 'you', 'would', 'expect', '...', 'six', 'packages', 'of', 'strawberry', 'twizzlers', '.']
29


In [29]:
mylist = [[token.text, token.lemma_, token.pos_] for token in doc]
print(mylist[0])
print(mylist[1])
print(len(mylist))
print(mylist)

['My', 'my', 'PRON']
['daughter', 'daughter', 'NOUN']
29
[['My', 'my', 'PRON'], ['daughter', 'daughter', 'NOUN'], ['loves', 'love', 'VERB'], ['twizzlers', 'twizzler', 'NOUN'], ['and', 'and', 'CCONJ'], ['this', 'this', 'DET'], ['shipment', 'shipment', 'NOUN'], ['of', 'of', 'ADP'], ['six', 'six', 'NUM'], ['pounds', 'pound', 'NOUN'], ['really', 'really', 'ADV'], ['hit', 'hit', 'VERB'], ['the', 'the', 'DET'], ['spot', 'spot', 'NOUN'], ['.', '.', 'PUNCT'], ['It', 'it', 'PRON'], ["'s", 'be', 'AUX'], ['exactly', 'exactly', 'ADV'], ['what', 'what', 'PRON'], ['you', 'you', 'PRON'], ['would', 'would', 'AUX'], ['expect', 'expect', 'VERB'], ['...', '...', 'PUNCT'], ['six', 'six', 'NUM'], ['packages', 'package', 'NOUN'], ['of', 'of', 'ADP'], ['strawberry', 'strawberry', 'NOUN'], ['twizzlers', 'twizzler', 'NOUN'], ['.', '.', 'PUNCT']]


In [None]:
print(mylist)

In [31]:
myReview1TokenList = review1.split()
print(myReview1TokenList)

['My', 'daughter', 'loves', 'twizzlers', 'and', 'this', 'shipment', 'of', 'six', 'pounds', 'really', 'hit', 'the', 'spot.', "It's", 'exactly', 'what', 'you', 'would', 'expect...six', 'packages', 'of', 'strawberry', 'twizzlers.']


In [32]:
tokenized_text = pd.DataFrame()

for i, token in enumerate(doc):
    tokenized_text.loc[i, 'text'] = token.text
    tokenized_text.loc[i, 'lemma'] = token.lemma_,
    tokenized_text.loc[i, 'pos'] = token.pos_
    tokenized_text.loc[i, 'tag'] = token.tag_
    # tokenized_text.loc[i, 'dep'] = token.dep_
    # tokenized_text.loc[i, 'shape'] = token.shape_
    tokenized_text.loc[i, 'is_alpha'] = token.is_alpha
    tokenized_text.loc[i, 'is_stop'] = token.is_stop
    tokenized_text.loc[i, 'is_punctuation'] = token.is_punct
    tokenized_text.loc[i, 'entity'] = token.ent_type_

print(tokenized_text.shape)
tokenized_text[:20]

(29, 8)


Unnamed: 0,text,lemma,pos,tag,is_alpha,is_stop,is_punctuation,entity
0,My,my,PRON,PRP$,True,True,False,
1,daughter,"(daughter,)",NOUN,NN,True,False,False,
2,loves,"(love,)",VERB,VBZ,True,False,False,
3,twizzlers,"(twizzler,)",NOUN,NNS,True,False,False,
4,and,"(and,)",CCONJ,CC,True,True,False,
5,this,"(this,)",DET,DT,True,True,False,
6,shipment,"(shipment,)",NOUN,NN,True,False,False,
7,of,"(of,)",ADP,IN,True,True,False,
8,six,"(six,)",NUM,CD,True,True,False,QUANTITY
9,pounds,"(pound,)",NOUN,NNS,True,False,False,QUANTITY


In [33]:
review1 = df.Text[3]
doc = nlp(review1)
tokenized_text = pd.DataFrame()

for i, token in enumerate(doc):
    tokenized_text.loc[i, 'text'] = token.text
    tokenized_text.loc[i, 'lemma'] = token.lemma_,
    tokenized_text.loc[i, 'pos'] = token.pos_
    tokenized_text.loc[i, 'tag'] = token.tag_
    # tokenized_text.loc[i, 'dep'] = token.dep_
    # tokenized_text.loc[i, 'shape'] = token.shape_
    tokenized_text.loc[i, 'is_alpha'] = token.is_alpha
    tokenized_text.loc[i, 'is_stop'] = token.is_stop
    tokenized_text.loc[i, 'is_punctuation'] = token.is_punct
    tokenized_text.loc[i, 'entity'] = token.ent_type_

print(tokenized_text.shape)
tokenized_text[:20]

(48, 8)


Unnamed: 0,text,lemma,pos,tag,is_alpha,is_stop,is_punctuation,entity
0,If,if,SCONJ,IN,True,True,False,
1,you,"(you,)",PRON,PRP,True,True,False,
2,are,"(be,)",AUX,VBP,True,True,False,
3,looking,"(look,)",VERB,VBG,True,False,False,
4,for,"(for,)",ADP,IN,True,True,False,
5,the,"(the,)",DET,DT,True,True,False,
6,secret,"(secret,)",ADJ,JJ,True,False,False,
7,ingredient,"(ingredient,)",NOUN,NN,True,False,False,
8,in,"(in,)",ADP,IN,True,True,False,
9,Robitussin,"(Robitussin,)",PROPN,NNP,True,False,False,GPE


In [34]:
entities = [[tokenized_text.text[i], tokenized_text.entity[i]] for i in range(0, len(tokenized_text.entity))]
print(entities)
entities1 = [e for e in entities if len(e[1]) > 0]
print(entities1)


[['If', ''], ['you', ''], ['are', ''], ['looking', ''], ['for', ''], ['the', ''], ['secret', ''], ['ingredient', ''], ['in', ''], ['Robitussin', 'GPE'], ['I', ''], ['believe', ''], ['I', ''], ['have', ''], ['found', ''], ['it', ''], ['.', ''], [' ', ''], ['I', ''], ['got', ''], ['this', ''], ['in', ''], ['addition', ''], ['to', ''], ['the', 'ORG'], ['Root', 'ORG'], ['Beer', 'ORG'], ['Extract', 'ORG'], ['I', 'ORG'], ['ordered', ''], ['(', ''], ['which', ''], ['was', ''], ['good', ''], [')', ''], ['and', ''], ['made', ''], ['some', ''], ['cherry', ''], ['soda', ''], ['.', ''], [' ', ''], ['The', ''], ['flavor', ''], ['is', ''], ['very', ''], ['medicinal', ''], ['.', '']]
[['Robitussin', 'GPE'], ['the', 'ORG'], ['Root', 'ORG'], ['Beer', 'ORG'], ['Extract', 'ORG'], ['I', 'ORG']]


In [37]:
entities = [[tokenObj.text, tokenObj.ent_type_] for tokenObj in doc]
print(entities)
entities1 = [e for e in entities if len(e[1]) > 0]
print(entities1)

[['If', ''], ['you', ''], ['are', ''], ['looking', ''], ['for', ''], ['the', ''], ['secret', ''], ['ingredient', ''], ['in', ''], ['Robitussin', 'GPE'], ['I', ''], ['believe', ''], ['I', ''], ['have', ''], ['found', ''], ['it', ''], ['.', ''], [' ', ''], ['I', ''], ['got', ''], ['this', ''], ['in', ''], ['addition', ''], ['to', ''], ['the', 'ORG'], ['Root', 'ORG'], ['Beer', 'ORG'], ['Extract', 'ORG'], ['I', 'ORG'], ['ordered', ''], ['(', ''], ['which', ''], ['was', ''], ['good', ''], [')', ''], ['and', ''], ['made', ''], ['some', ''], ['cherry', ''], ['soda', ''], ['.', ''], [' ', ''], ['The', ''], ['flavor', ''], ['is', ''], ['very', ''], ['medicinal', ''], ['.', '']]
[['Robitussin', 'GPE'], ['the', 'ORG'], ['Root', 'ORG'], ['Beer', 'ORG'], ['Extract', 'ORG'], ['I', 'ORG']]


#### Universal POS tags

- ADJ	 adjective
- ADP	 adposition
- ADV	 adverb
- AUX	 auxiliary
- CCONJ	 coordinating conjunction
- DET	 determiner
- INTJ	 interjection
- NOUN	 noun
- NUM	 numeral
- PART	 particle
- PRON	 pronoun
- PROPN	 proper noun
- PUNCT	 punctuation
- SCONJ	 subordinating conjunction
- SYM	 symbol
- VERB	 verb
- X	 other

Source: http://universaldependencies.org/u/pos/index.html

#### Visualizing entities using displacy

In [38]:
from spacy import displacy

In [41]:
review1 = df.Text[25]
doc = nlp(review1)
spacy.displacy.render(doc, style='ent', jupyter=True)

In [42]:
mytext = ''' 
India's industrial production growth slipped to five-month low of 1.1 per cent in March from 5.8 per cent in February 2023, mainly due to poor performance of power and manufacturing sectors, according to official data released on Friday.
'''
doc = nlp(mytext)
spacy.displacy.render(doc, style='ent', jupyter=True)

In [43]:
mytext = '''
 Inter-governmental body Shanghai Cooperation Organization members, which includes China and Pakistan, have unanimously adopted India's proposal for developing Digital Public Infrastructure, Union minister Ashwini Vaishnaw said on Saturday.
India has developed Digital Public Infrastructure (DPI) like unified payment interface, Aadhaar etc to make services available to people in a convenient manner.
'''
doc = nlp(mytext)
spacy.displacy.render(doc, style='ent', jupyter=True)

In [None]:
doc = nlp(review1)
# doc = nlp(df.Text[0])
spacy.displacy.render(doc, style='ent', jupyter=True)

In [44]:
spacy.explain('GPE')

'Countries, cities, states'

In [45]:
lt = ['GPE', 'CARDINAL','PERSON','DATE', 'ORG', 'LOC']

for i in lt:
  print(i, ':', spacy.explain(i))

GPE : Countries, cities, states
CARDINAL : Numerals that do not fall under another type
PERSON : People, including fictional
DATE : Absolute or relative dates or periods
ORG : Companies, agencies, institutions, etc.
LOC : Non-GPE locations, mountain ranges, bodies of water


#### Visualizing dependency parsing using displacy 

In [46]:
spacy.displacy.render(doc, style='dep', jupyter=True,options={'distance': 140})

####Visualizing using **Explacy**

In [47]:
!wget https://raw.githubusercontent.com/tylerneylon/explacy/master/explacy.py

--2023-05-14 04:17:00--  https://raw.githubusercontent.com/tylerneylon/explacy/master/explacy.py
Resolving raw.githubusercontent.com (raw.githubusercontent.com)... 185.199.108.133, 185.199.109.133, 185.199.110.133, ...
Connecting to raw.githubusercontent.com (raw.githubusercontent.com)|185.199.108.133|:443... connected.
HTTP request sent, awaiting response... 200 OK
Length: 6896 (6.7K) [text/plain]
Saving to: ‘explacy.py’


2023-05-14 04:17:00 (80.0 MB/s) - ‘explacy.py’ saved [6896/6896]



In [48]:
import explacy

# explacy.print_parse_info(nlp, df.Text[0])
explacy.print_parse_info(nlp, mytext)

Dep tree                 Token          Dep type Lemma          Part of Sp
──────────────────────── ────────────── ──────── ────────────── ──────────
                     ┌─► 
              dep      
              SPACE     
                  ┌─►└── Inter          amod     inter          ADJ       
                  │ ┌──► -              amod     -              ADJ       
                  │ │┌─► governmental   amod     governmental   ADJ       
               ┌─►└─┴┴── body           compound body           NOUN      
               │    ┌──► Shanghai       compound Shanghai       PROPN     
               │    │┌─► Cooperation    compound Cooperation    PROPN     
               │ ┌─►└┴── Organization   compound Organization   PROPN     
          ┌─►┌┬┴─┴───┬── members        nsubj    member         NOUN      
          │  ││      └─► ,              punct    ,              PUNCT     
          │  ││      ┌─► which          nsubj    which          PRON      
          │  │└─►┌───┴── 

### Text Summarization

- When you open news sites, do you just start reading every news article? Probably not. 
- We typically glance the short news summary and then read more details if interested. Short, informative summaries of the news is now everywhere like magazines, news aggregator apps, research sites, etc.

- It is possible to create the summaries automatically as the news comes in from various sources around the world.

- The method of extracting these summaries from the original huge text without losing vital information is called as **Text Summarization**. 

- **Google News**, **inshorts app** and various other news aggregator apps take advantage of text summarization algorithms.




[**Click Here!**](https://www.machinelearningplus.com/nlp/text-summarization-approaches-nlp-example/) to learn more about Text Summarization

**Types of Text Summarization Methods**

Two main methods of Text Summarization: 

- **Extractive** 
- **Abstractive**





### Text Summarization using Gensim

In [49]:
# As per Gensim’s Github changelog 188, gensim.summarization module has been removed in versions Gensim 4.x
# as it was an unmaintained third-party module.

# To continue using gensim.summarization, you will need to downgrade the version of Gensim in the requirements.txt file
# by replacing it with gensim==3.8.3 or an older version
!pip install gensim==3.8.3

Looking in indexes: https://pypi.org/simple, https://us-python.pkg.dev/colab-wheels/public/simple/
Collecting gensim==3.8.3
  Downloading gensim-3.8.3.tar.gz (23.4 MB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m23.4/23.4 MB[0m [31m7.7 MB/s[0m eta [36m0:00:00[0m
[?25h  Preparing metadata (setup.py) ... [?25l[?25hdone
Building wheels for collected packages: gensim
  [1;31merror[0m: [1msubprocess-exited-with-error[0m
  
  [31m×[0m [32mpython setup.py bdist_wheel[0m did not run successfully.
  [31m│[0m exit code: [1;36m1[0m
  [31m╰─>[0m See above for output.
  
  [1;35mnote[0m: This error originates from a subprocess, and is likely not a problem with pip.
  Building wheel for gensim (setup.py) ... [?25lerror
[31m  ERROR: Failed building wheel for gensim[0m[31m
[0m[?25h  Running setup.py clean for gensim
Failed to build gensim
[31mERROR: Could not build wheels for gensim, which is required to install pyproject.toml-based projects[0m[31m


In [50]:
import gensim
gensim.__version__

'4.3.1'

In [None]:
from gensim.summarization.summarizer import summarize 
from gensim.summarization import keywords 

In [None]:
news = '''
Chinese technology conglomerate Tencent has bought stake worth $264 million (about Rs 2,060 crore) in Flipkart from its co-founder Binny Bansal through its European subsidiary, according to official documents.
Singapore-headquartered e-commerce firm Flipkart has operations in India only.
Bansal holds around 1.84 per cent stake in Flipkart after selling part of his stake to Tencent Cloud Europe BV.
The transaction was completed on October 26, 2021 and was shared with the government authorities at the beginning of the current financial year.

Post the transaction, Tencent arm holds 0.72 per stake in Flipkart which is valued at around $264 million, as per last valuation of $37.6 billion disclosed by the e-commerce firm in July 2021.
The company's valuation surged to $37.6 billion after raising $3.6 billion (about Rs 26,805.6 crore) in funding round led by Singapore's sovereign wealth fund GIC, CPP Investments, SoftBank Vision Fund 2 and Walmart.
DisruptAD, Qatar Investment Authority, Khazanah Nasional Berhad as well as marquee investors Tencent, Willoughby Capital, Antara Capital, Franklin Templeton and Tiger Global also participated in the funding round.
The transaction between Bansal and Tencent took place after the July funding round.
Sources said the transaction took place in Singapore but Flipkart informed Indian authorities about it as a responsible entity and that the transaction does not fall under purview of 'Press Note 3' which calls for scrutiny of investment that any Indian company gets from countries sharing land border with India.
While there are several companies operating in India in which Tencent has made investment, the government has banned some gaming apps including PUBG Mobile, PUBG Mobile Lite which were published by Tencent Group. E-mail query sent to Flipkart and Bansal did not elicit any reply.
'''

news_summ = summarize(news, word_count = 100)
print(news_summ)

In [None]:
news = '''
Several Indian investors trading on cryptocurrency platforms have been losing crores of rupees and falling prey to fraudsters who promise high returns and goad them into opening wallets (accounts) on fake websites and transferring crypto coins they bought from genuine trading websites to these fake accounts.
STOI spoke to a few victims from Karnataka, Delhi, and Tamil Nadu and also came in contact with an alleged cyber fraudster to understand how the deceit plays out. Cheats offering advice on Telegram groups are behind most frauds.
A year ago, Delhi-based Nitin Vashishta was duped of about Rs 20 lakh. He bought 24,000 tether (USDT) from a genuine crypto exchange. On Telegram, he came in contact with a person who identified herself as Milanka and said she was an investment adviser working for a reputed firm. The woman used a Singapore virtual phone number. The scamster had the international number to gain the trust of targeted victims. She claimed to be based out of Hong Kong.

Milanka first asked Vashishta to register himself on a website and sent him the link. When he registered on the website using his mail ID and phone number, he was assigned a new digital wallet. The woman then told him to start trading. He bought USDT on an exchange. For further trading, he transferred the same to the new wallet assigned to him on the website. But he was not able to withdraw the USDT. The woman said he had to pay 30% commission of the total amount in the wallet to withdraw. His account was frozen. Even after he paid the commission, the fraudster told Vashishta that he had to pay more tax as withdrawing without taxes would be flagged as money laundering. Soon, the website too was taken down.
'''
news_summ = summarize(news, word_count = 100)
print(news_summ)

##### Getting wikipedia content and summarizing it

In [None]:
pip install wikipedia

In [None]:
import wikipedia 
wikisearch = wikipedia.page("A. P. J. Abdul Kalam") 
wikicontent = wikisearch.content 
print(wikicontent)

In [None]:
#Loading the nlp core
nlp = spacy.load('en_core_web_sm')
doc = nlp(wikicontent) 

In [None]:
# Summary (0.2% of the original content). 
summ_per = summarize(wikicontent, ratio = 0.02) 
print("Percentagewise Summary\n") 
print(summ_per) 

In [None]:
# Summary (100 words)
summ_words = summarize(wikicontent, word_count = 100) 

print('--------------------------------------------------------------------------------------------------------------------------------')
print("Word count summary\n") 
print(summ_words) 

The parameters are:

- **ratio**: It can take values between 0 to 1. It represents the proportion of the summary compared to the original text.

- **word_count**: It decides the no of words in the summary.

### PyCaret



####Overview of **Natural Language Processing** Module in **PyCaret**

PyCaret's NLP module (`pycaret.nlp`) is an unsupervised machine learning module which can be used for analyzing the text data by creating topic model to find hidden semantic structure in documents. PyCaret's NLP module comes built-in with a wide range of text pre-processing techniques which is the fundamental step in any NLP problem. It transforms the raw text into a format that machine learning algorithms can learn from.

As of first release, PyCaret's NLP module only support `English` language and provides several popular implementation of topic models from Latent Dirichlet Allocation to Non-Negative Matrix Factorization. It has over 5 ready-to-use algorithms and over 10 plots to analyze the text. PyCaret's NLP module also implements a unique function `tune_model()` that allows you to tune the hyperparameters of a topic model to optimize the supervised learning objective such as `AUC` for classification or `R2` for regression.



[**Click Here!**](https://pycaret.org/) to learn more about **PyCaret**

**Installing PyCaret**

- !pip install pycaret

####**Tasks to be performed**

- Import PyCaret and load the data set
- Setup the environment 
- Create a Topic model
- Assign a model
- Plot a model


####**Import PyCaret and load the data set**

In [None]:
!wget https://www.dropbox.com/s/x2aza32otpkc53b/spam.csv

In [None]:
pip install pycaret

In [None]:
#Loading the dataset
import pandas as pd
df = pd.read_csv('/content/spam.csv', encoding='latin-1')

df.head() #Printing the first 5 rows of dataframe

In [None]:
df.drop(['Unnamed: 2',	'Unnamed: 3',	'Unnamed: 4'], axis = 1, inplace = True)

In [None]:
df.head()

####**Setup the environment**

`setup()` function initializes the environment in pycaret and performs several text pre-processing steps that are imperative to work with NLP problems. setup must be called before executing any other function in pycaret. It takes two parameters: pandas dataframe and name of the text column passed as `target` parameter. You can also pass a `list` containing text, in which case you don't need to pass `target` parameter. When setup is executed, following pre-processing steps are applied automatically:

- **Removing Numeric Characters:** All numeric characters are removed from the text. They are replaced with blanks.<br/>
<br/>
- **Removing Special Characters:** All non-alphanumeric special characters are removed from the text. They are also replaced with blanks.<br/>
<br/>
- **Word Tokenization:** Word tokenization is the process of splitting a large sample of text into words. This is the core requirement in natural language processing tasks where each word needs to be captured separately for further analysis. __[Read More](https://nlp.stanford.edu/IR-book/html/htmledition/tokenization-1.html)__ <br/>
<br/> 
- **Stopword Removal:** A stop word (or stopword) is a word that is often removed from text because it is common and provides little value for information retrieval, even though it might be linguistically meaningful. Example of such words in english language are: "the", "a", "an", "in" etc. __[Read More](https://en.wikipedia.org/wiki/Stop_words)__ <br/>
<br/>
- **Bigram Extraction:** A bigram is a sequence of two adjacent elements from a string of tokens, which are typically letters, syllables, or words. For example: word New York is captured as two different words "New" and "York" when tokenization is performed but if it is repeated enough times, Bigram Extraction will represent the word as one i.e. "New_York"  __[Read More](https://en.wikipedia.org/wiki/Bigram)__ <br/>
<br/>
- **Trigram Extraction:** Similar to bigram extraction, trigram is a sequence of three adjacent elements from a string of tokens. __[Read More](https://en.wikipedia.org/wiki/Trigram)__ <br/>
<br/>
- **Lemmatizing:** Lemmatization is the process of grouping together the inflected forms of a word so they can be analysed as a single word, identified by the word's lemma, or dictionary form. In English language, word appears in several inflected forms. For example the verb 'to walk' may appear as 'walk', 'walked', 'walks', 'walking'. The base form, 'walk', that one might look up in a dictionary, is called the lemma for the word. __[Read More](https://en.wikipedia.org/wiki/Lemmatisation)__ <br/>
<br/>
- **Custom Stopwords:** Many times text contains words that are not stopwords by the rule of language but they add no or very little information. For example, in this tutorial we are using the loan dataset. As such, words like "loan", "bank", "money", "business" are too obvious and adds no value. More often than not, they also add a lot of noise in the topic model. You can remove those words from corpus by using `custom_stopwords` parameter. <br/>
<br/>

**Note :** Some functionalities in `pycaret.nlp` requires english language model. The language model is not downloaded automatically when you install pycaret. You will have to download these python command line interface such as Anaconda Prompt. To download the model, please type the following in your command line:

`python -m spacy download en_core_web_sm` <br/>
`python -m textblob.download_corpora` <br/>

In [None]:
from pycaret.nlp import *

In [None]:
exp_nlp101 = setup(data = df, target = 'v2', session_id = 123)

####**Create a Topic Model**



**What is Topic Model?** 

- A **topic model** is a type of statistical model for discovering the abstract "topics" that occur in a collection of documents. 
- Topic modeling is a frequently used text-mining tool for discovery of hidden semantic structures in a text body. 

Intuitively, given that a document is about a particular topic, one would expect particular words to appear in the document more or less frequently: "dog" and "bone" will appear more often in documents about dogs, "cat" and "meow" will appear in documents about cats, and "the" and "is" will appear equally in both. A document typically concerns multiple topics in different proportions; thus, in a document that is 10% about cats and 90% about dogs, there would probably be about 9 times more dog words than cat words. 

- The **topics** produced by topic modeling techniques are clusters of similar words. 

__[Read More](https://en.wikipedia.org/wiki/Topic_model)__



Creating a topic model in PyCaret 

- A topic model is created using **create_model()** function which takes one mandatory parameter i.e. name of model as a string

- Returns a trained model object
- There are 5 topic models available in PyCaret


In [None]:
lda = create_model('lda')

In [None]:
print(lda)

___
- Notice the **num_topics** parameter is set to **4** which is a default value 

- In below example, we will create LDA model with 6 topics and we will also set **multi_core** parameter to **True**. 

- When **multi_core** is set to **True**, Latent Dirichlet Allocation (LDA) uses all CPU cores to parallelize and speed up model training.
___

In [None]:
lda2 = create_model('lda', num_topics = 6, multi_core = True)

In [None]:
print(lda2)

####**Assign a Model**



Now that we have created a topic model, we would like to assign the topic proportions to our dataset (6818 documents / samples) to analyze the results. We will achieve this by using `assign_model()` function. See an example below:

In [None]:
lda_results = assign_model(lda)
lda_results.head()

- Notice how 6 additional columns are now added to the dataframe. 

- **en** is the text after all pre-processing. 
- **Topic_0 ... Topic_3` are the topic proportions and represents the distribution of topics for each document. `Dominant_Topic` is the topic number with highest proportion and `Perc_Dominant_Topic` is the percentage of dominant topic over 1 (only shown when models are stochastic i.e. sum of all proportions equal to 1) .

####**Plot a Model**

#####Frequency Distribution of Topic 1

`plot_model()` can also be used to analyze the same plots for specific topics. To generate plots at topic level, function requires trained model object to be passed inside `plot_model()`. In example below we will generate frequency distribution on `Topic 1` only as defined by `topic_num` parameter.

In [None]:
#plot_model(lda, plot = 'frequency', topic_num = 'Topic 1')

#####Topic Distribution

In [None]:
#plot_model(lda, plot = 'topic_distribution')

Each document is a distribution of topics and not a single topic. Although, if the task is of categorizing document into specific topics, it wouldn't be wrong to use the topic proportion with highest value to categorize the document into **a topic**. In above plot, each document is categorized into one topic using the largest proportion of topic weights. We can see most of the documents are in `Topic 3` with only few in `Topic 1`. If you hover over these bars, you will get basic idea of themes in this topic by looking at the keywords. For example if you evaluate `Topic 2`, you will see keywords words like 'farmer', 'rice', 'land', which probably means that the loan applicants in this category pertains to agricultural/farming loans. However, if you hover over `Topic 0` and `Topic 3` you will observe lot of repitions and keywords are overlapping in all topics such as word "loan" and "business" appears both in `Topic 0` and `Topic 3`. 