<a href="https://colab.research.google.com/github/krishnanands17/DataScienceLab/blob/main/CO5PG1-NLTK-10-02-22.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

Short notes:

The **Natural Language Toolkit (NLTK)** is a platform used for building programs for text analysis. One of the more powerful aspects of the NLTK module is the Part of Speech tagging.

**Part-of-speech (POS)** tagging is a process of converting a sentence to forms – list of words, list of tuples (where each tuple is having a form (word, tag)). The tag in case of is a part-of-speech tag, and signifies whether the word is a noun, adjective, verb, and so on.

**keywords**:

Corpus : Body of text, singular. Corpora is the plural of this.

Lexicon : Words and their meanings.

Token : Each “entity” that is a part of whatever was split up based on rules.

In [None]:
from google.colab import drive
drive.mount('/content/drive')

Aim: Implement problems on natural language processing - Part of Speech
tagging, N-gram & smoothening and Chunking using NLTK

**Tags and their meanings**

CD cardinal digit

EX existential there (like: “there is” … think of it like “there exists”)

FW foreign word

IN preposition/subordinating conjunction

JJ adjective ‘big’

JJR adjective, comparative ‘bigger’

JJS adjective, superlative ‘biggest’

NN noun, singular ‘desk’

NNS noun plural ‘desks’

NNP proper noun, singular ‘Harrison’

NNPS proper noun, plural ‘Americans’

PDT predeterminer ‘all the kids’

POS possessive ending parent‘s

PRP personal pronoun I, he, she

PRP$ possessive pronoun my, his, hers

RB adverb very, silently,

RBR adverb, comparative better

RBS adverb, superlative best

RP particle give up


**N-grams** are continuous sequences of words or symbols or tokens in a document. In technical terms, they can be defined as the neighbouring sequences of items in a document. 

**Steps for n-gram model:**

Explore the dataset

Feature extraction

Train-test split

Basic pre-processing

Code to generate N-grams

Creating unigrams

Creating bigrams

Creating trigrams

In [None]:
import nltk
from nltk.corpus import stopwords
from nltk.tokenize import word_tokenize, sent_tokenize
nltk.download('stopwords')
nltk.download('punkt')
nltk.download('averaged_perceptron_tagger')

stop_words = set(stopwords.words('english'))  

#Dummy text
txt = "Hello. MCA S3 is fantastic. we learn many new concepts and implementations.List of all the data science is a new paper."
  
# sent_tokenize is one of instances of 
# PunktSentenceTokenizer from the nltk.tokenize.punkt module
tokenized = sent_tokenize(txt)
for i in tokenized:

      
    # Word tokenizers is used to find the words 
    # and punctuation in a string
    wordList = word_tokenize(i)
  
    # removing stop words from wordList
    wordList = [w for w in wordList if not w in stop_words]
  
    #  Using a Tagger. Which is part-of-speech 
    # tagger or POS-tagger. 
    tagged = nltk.pos_tag(wordList)

[nltk_data] Downloading package stopwords to /root/nltk_data...
[nltk_data]   Package stopwords is already up-to-date!
[nltk_data] Downloading package punkt to /root/nltk_data...
[nltk_data]   Unzipping tokenizers/punkt.zip.
[nltk_data] Downloading package averaged_perceptron_tagger to
[nltk_data]     /root/nltk_data...
[nltk_data]   Unzipping taggers/averaged_perceptron_tagger.zip.


In [None]:
print(stop_words)

{'him', 'mustn', 'hadn', 'were', 'no', 'in', 'by', 'all', 'a', "doesn't", 'against', "you'll", 'on', "didn't", 'after', 'i', 'am', 'ma', 's', 'until', 'with', 'being', 'was', 'if', 'don', 'most', 'while', 'an', 'wasn', 'then', 'now', 'ours', 'doing', 'more', 'theirs', 'too', 'needn', 'weren', 'been', 'here', "haven't", 'at', 'do', 'myself', 'mightn', 'down', "you're", 'own', 'just', 'what', 'shan', "aren't", "shan't", 'his', "mightn't", 'hers', 'she', 'other', 'yourself', 'your', 'should', 've', 'this', 'so', 'off', 'where', 'wouldn', 'll', "mustn't", 'as', 'their', 'd', 're', 'you', 'each', 'again', 'but', 'through', 'how', "should've", 'of', 'and', 'same', 'y', "wasn't", 'because', 'did', 'any', 'yourselves', 'these', 'himself', 'than', 'such', 'who', 'during', "needn't", 'they', 'ourselves', 'its', 'her', 'are', 't', 'few', 'into', 'why', 'not', 'isn', 'both', 'herself', "hasn't", 'between', 'won', "wouldn't", 'for', 'it', 'itself', 'very', 'before', 'is', "don't", 'we', 'below', 'd

In [None]:
print(tokenized)

['Hello.', 'MCA S3 is fantastic.', 'we learn many new concepts and implementations.List of all the data science is a new paper.']


N-gram model

In [None]:
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
plt.style.use(style='seaborn')

#get the data from https://www.kaggle.com/ankurzing/sentiment-analysis-for-financial-news/version/5

colnames = ['Sentiment' , 'news']

df = pd.read_csv('all-data.csv.csv' , encoding="ISO-8859-1",names=colnames)
df.head()


Unnamed: 0,Sentiment,news
0,neutral,"According to Gran , the company has no plans t..."
1,neutral,Technopolis plans to develop in stages an area...
2,negative,The international electronic industry company ...
3,positive,With the new production plant the company woul...
4,positive,According to the company 's updated strategy f...


In [None]:
df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 4846 entries, 0 to 4845
Data columns (total 2 columns):
 #   Column     Non-Null Count  Dtype 
---  ------     --------------  ----- 
 0   Sentiment  4846 non-null   object
 1   news       4846 non-null   object
dtypes: object(2)
memory usage: 75.8+ KB


In [None]:
df['Sentiment'].value_counts()

neutral     2879
positive    1363
negative     604
Name: Sentiment, dtype: int64

In [None]:
y=df['Sentiment'].values
y.shape

(4846,)

In [None]:
x=df['news'].values
x.shape

(4846,)

In [None]:
#Split train dataset 
from sklearn.model_selection import train_test_split
x_train,x_test,y_train,y_test=train_test_split(x,y,test_size=0.4,random_state=42)
print(x_train.shape)
print(x_test.shape)
print(y_train.shape)
print(y_test.shape)

(2907,)
(1939,)
(2907,)
(1939,)


In [None]:
#removing punctuations
#library that contains punctuation
import string
string.punctuation

'!"#$%&\'()*+,-./:;<=>?@[\\]^_`{|}~'

In [None]:
#Make train dataset as a dataframe
df1=pd.DataFrame(x_train)
df1=df1.rename(columns={0:'News'})
df2=pd.DataFrame(y_train)
df2=df2.rename(columns={0:'Sentiment'})
df_train=pd.concat([df1,df2],axis=1)
df_train

Unnamed: 0,News,Sentiment
0,Exel Composites ' long-term growth prospects r...,positive
1,The Samsung Mobile Applications Store was laun...,neutral
2,Altogether CapMan employs approximately 150 pe...,neutral
3,The segments through which the company operate...,neutral
4,UK 's Sarantel to outsource part of its proces...,neutral
...,...,...
2902,"The currency effect had a 3.0 pct , or 20 mln ...",negative
2903,`` Lidskoe Pivo 's investment program foresees...,positive
2904,Products include Consumer Electronics devices ...,neutral
2905,The bridge is part of the highway 14 developme...,neutral


In [None]:
#Make test dataset as a dataframe
df3=pd.DataFrame(x_train)
df3=df3.rename(columns={0:'News'})
df4=pd.DataFrame(y_train)
df4=df4.rename(columns={0:'Sentiment'})
df_test=pd.concat([df3,df4],axis=1)
df_test

Unnamed: 0,News,Sentiment
0,Exel Composites ' long-term growth prospects r...,positive
1,The Samsung Mobile Applications Store was laun...,neutral
2,Altogether CapMan employs approximately 150 pe...,neutral
3,The segments through which the company operate...,neutral
4,UK 's Sarantel to outsource part of its proces...,neutral
...,...,...
2902,"The currency effect had a 3.0 pct , or 20 mln ...",negative
2903,`` Lidskoe Pivo 's investment program foresees...,positive
2904,Products include Consumer Electronics devices ...,neutral
2905,The bridge is part of the highway 14 developme...,neutral


In [None]:
#defining the function to remove punctuation
def remove_punctucation(text):
  if(type(text)==float):
    return text
  ans=""
  for i in text:
    if i not in string.punctuation:
      ans+=i
  return ans    

In [None]:
text='WELCOME_ TO_ THE_ WORLD:'
remove_punctucation(text)

'WELCOME TO THE WORLD'

In [None]:
#storing the puntuation free text in a new column called clean_msg
df_train['News'] = df_train['News'].apply(remove_punctucation)
df_test['News'] = df_test['News'].apply(remove_punctucation)
df_train.head()

#punctuations are removed from news column in train dataset

Unnamed: 0,News,Sentiment
0,Exel Composites longterm growth prospects rem...,positive
1,The Samsung Mobile Applications Store was laun...,neutral
2,Altogether CapMan employs approximately 150 pe...,neutral
3,The segments through which the company operate...,neutral
4,UK s Sarantel to outsource part of its process...,neutral


In [None]:
import nltk
from nltk.corpus import stopwords
nltk.download('stopwords')

[nltk_data] Downloading package stopwords to /root/nltk_data...
[nltk_data]   Package stopwords is already up-to-date!


True

In [None]:
#method to generate n-grams:
#params:
#text-the text for which we have to generate n-grams
#ngram-number of grams to be generated from the text(1,2,3,4 etc., default value=1)
def generate_N_grams(text,ngram=1):
  words=[word for word in text.split(" ") if word not in set(stopwords.words('english'))]
  print("Sentance after removing stopwords:",words)
  temp=zip(*[words[i:] for i in range(0,ngram)])
  ans=[' '.join(ngram) for ngram in temp]
  return ans

In [None]:
generate_N_grams("The sun rises in the east",2)

Sentance after removing stopwords: ['The', 'sun', 'rises', 'east']


['The sun', 'sun rises', 'rises east']

In [None]:
generate_N_grams("The sun rises in the east",3)

Sentance after removing stopwords: ['The', 'sun', 'rises', 'east']


['The sun rises', 'sun rises east']

In [None]:
generate_N_grams("The sun rises in the east",4)

Sentance after removing stopwords: ['The', 'sun', 'rises', 'east']


['The sun rises east']

In [None]:
s1 = ['ABEEL V ASHRAF','ADHITHYA BINU','AKASH MADHUSUDHANAN','AKSHAYA VIJAYAN']
s2 = [1,2,3,4]
s3 = zip(s1,s2)
print(set(s3))

{('ABEEL V ASHRAF', 1), ('AKASH MADHUSUDHANAN', 3), ('ADHITHYA BINU', 2), ('AKSHAYA VIJAYAN', 4)}


In [None]:
name=['KRISHNANAND','S']
s = ' '.join(name)
s

'KRISHNANAND S'