# **Python Text Pre-Processing**
**Dataset used**: News detection( real or fake) dataset.

**Objectives:**



*   Text Cleaning
*   Tokenization


*   Stopwords
*   Stemming


*   Lemmatization
*   N-grams(bigrams and trigrams)


*   Word Frequency Analysis











**Loading Dataset**

In [1]:
import pandas as pd
df = pd.read_csv('/content/fake_and_real_news.csv')
df.head()


Unnamed: 0,Text,label
0,Top Trump Surrogate BRUTALLY Stabs Him In The...,Fake
1,U.S. conservative leader optimistic of common ...,Real
2,"Trump proposes U.S. tax overhaul, stirs concer...",Real
3,Court Forces Ohio To Allow Millions Of Illega...,Fake
4,Democrats say Trump agrees to work on immigrat...,Real


In [2]:
df.columns


Index(['Text', 'label'], dtype='object')

In [26]:
len(df)

9900

**Text pre-processing one text**

In [3]:
sample_text = df['Text'][0]
sample_text


' Top Trump Surrogate BRUTALLY Stabs Him In The Back: ‘He’s Pathetic’ (VIDEO) It s looking as though Republican presidential candidate Donald Trump is losing support even from within his own ranks. You know things are getting bad when even your top surrogates start turning against you, which is exactly what just happened on Fox News when Newt Gingrich called Trump  pathetic. Gingrich knows that Trump needs to keep his focus on Hillary Clinton if he even remotely wants to have a chance at defeating her. However, Trump has hurt feelings because many Republicans don t support his sexual assault against women have turned against him, including House Speaker Paul Ryan (R-WI). So, that has made Trump lash out as his own party.Gingrich said on Fox News: Look, first of all, let me just say about Trump, who I admire and I ve tried to help as much as I can. There s a big Trump and a little Trump. The little Trump is frankly pathetic. I mean, he s mad over not getting a phone call? Trump s referr

**1. Text cleaning: lowercasing and removing special characters**

In [4]:
import re

cleaned_text = re.sub(r'[^a-zA-Z\s]', '', sample_text)
cleaned_text = cleaned_text.lower().strip()
cleaned_text


'top trump surrogate brutally stabs him in the back hes pathetic video it s looking as though republican presidential candidate donald trump is losing support even from within his own ranks you know things are getting bad when even your top surrogates start turning against you which is exactly what just happened on fox news when newt gingrich called trump  pathetic gingrich knows that trump needs to keep his focus on hillary clinton if he even remotely wants to have a chance at defeating her however trump has hurt feelings because many republicans don t support his sexual assault against women have turned against him including house speaker paul ryan rwi so that has made trump lash out as his own partygingrich said on fox news look first of all let me just say about trump who i admire and i ve tried to help as much as i can there s a big trump and a little trump the little trump is frankly pathetic i mean he s mad over not getting a phone call trump s referring to the fact that paul ry

**2. Tokenization: split text into words**


In [10]:
import nltk
from nltk.corpus import stopwords
from nltk.stem import PorterStemmer, WordNetLemmatizer
from nltk.util import ngrams
from collections import Counter

nltk.download('punkt_tab')
nltk.download('stopwords')
nltk.download('wordnet')

[nltk_data] Downloading package punkt_tab to /root/nltk_data...
[nltk_data]   Unzipping tokenizers/punkt_tab.zip.
[nltk_data] Downloading package stopwords to /root/nltk_data...
[nltk_data]   Package stopwords is already up-to-date!
[nltk_data] Downloading package wordnet to /root/nltk_data...
[nltk_data]   Package wordnet is already up-to-date!


True

In [14]:
from nltk.tokenize import word_tokenize
tokens = word_tokenize(cleaned_text)
tokens[:30]


['top',
 'trump',
 'surrogate',
 'brutally',
 'stabs',
 'him',
 'in',
 'the',
 'back',
 'hes',
 'pathetic',
 'video',
 'it',
 's',
 'looking',
 'as',
 'though',
 'republican',
 'presidential',
 'candidate',
 'donald',
 'trump',
 'is',
 'losing',
 'support',
 'even',
 'from',
 'within',
 'his',
 'own']

**3. Stopwords: remove stop words like in, is, the..etc**

In [17]:
stop_words = set(stopwords.words('english'))
filtered_tokens = [word for word in tokens if word not in stop_words]
filtered_tokens[:30]

['top',
 'trump',
 'surrogate',
 'brutally',
 'stabs',
 'back',
 'hes',
 'pathetic',
 'video',
 'looking',
 'though',
 'republican',
 'presidential',
 'candidate',
 'donald',
 'trump',
 'losing',
 'support',
 'even',
 'within',
 'ranks',
 'know',
 'things',
 'getting',
 'bad',
 'even',
 'top',
 'surrogates',
 'start',
 'turning']

**4. Stemming: reduce words to their root form using PorterStemmer**

In [18]:
ps = PorterStemmer()
stemmed_tokens = [ps.stem(word) for word in filtered_tokens]
stemmed_tokens[:30]

['top',
 'trump',
 'surrog',
 'brutal',
 'stab',
 'back',
 'he',
 'pathet',
 'video',
 'look',
 'though',
 'republican',
 'presidenti',
 'candid',
 'donald',
 'trump',
 'lose',
 'support',
 'even',
 'within',
 'rank',
 'know',
 'thing',
 'get',
 'bad',
 'even',
 'top',
 'surrog',
 'start',
 'turn']

**5. Lemmatization: gives real words instead of chopped roots**

In [19]:
lemm = WordNetLemmatizer()
lemmatized_tokens = [lemm.lemmatize(word) for word in filtered_tokens]
lemmatized_tokens[:30]

['top',
 'trump',
 'surrogate',
 'brutally',
 'stab',
 'back',
 'he',
 'pathetic',
 'video',
 'looking',
 'though',
 'republican',
 'presidential',
 'candidate',
 'donald',
 'trump',
 'losing',
 'support',
 'even',
 'within',
 'rank',
 'know',
 'thing',
 'getting',
 'bad',
 'even',
 'top',
 'surrogate',
 'start',
 'turning']

**6. N-grams (bigrams & trigrams): sequences of words**

In [22]:
bigrams = list(ngrams(filtered_tokens, 2))
trigrams = list(ngrams(filtered_tokens, 3))
print("First 10 bigrams:\n", bigrams[:10])
print("\nFirst 10 trigrams:\n", trigrams[:10])

First 10 bigrams:
 [('top', 'trump'), ('trump', 'surrogate'), ('surrogate', 'brutally'), ('brutally', 'stabs'), ('stabs', 'back'), ('back', 'hes'), ('hes', 'pathetic'), ('pathetic', 'video'), ('video', 'looking'), ('looking', 'though')]

First 10 trigrams:
 [('top', 'trump', 'surrogate'), ('trump', 'surrogate', 'brutally'), ('surrogate', 'brutally', 'stabs'), ('brutally', 'stabs', 'back'), ('stabs', 'back', 'hes'), ('back', 'hes', 'pathetic'), ('hes', 'pathetic', 'video'), ('pathetic', 'video', 'looking'), ('video', 'looking', 'though'), ('looking', 'though', 'republican')]


**7. Frequency analysis: Word frequency count**

In [23]:
word_freq = Counter(filtered_tokens)
word_freq.most_common(10)

[('trump', 14),
 ('pathetic', 3),
 ('even', 3),
 ('paul', 3),
 ('ryan', 3),
 ('top', 2),
 ('donald', 2),
 ('support', 2),
 ('getting', 2),
 ('fox', 2)]