## Text Preprocessing

### Importing libraries and setting path

In [1]:
import pandas as pd
import os
# text preprocessing
import string
import nltk
from nltk.tokenize import word_tokenize
from nltk.corpus import stopwords
from nltk.stem.porter import PorterStemmer
from nltk.stem import WordNetLemmatizer

# Increase width to see the article_title clearly
pd.set_option('display.max_colwidth', None)

# Our project path
project_path = r"C:\Users\HP\Desktop\For CV\Project 5"
# check working directory
print("Working directory:",os.getcwd())
# change working directory and list all files in new directory
os.chdir(project_path)
print("New working directory", project_path)
print()
for dirname, _, filenames in os.walk(project_path):
    for filename in filenames:
        print(os.path.join(dirname, filename))

Working directory: C:\Users\HP\Desktop\For CV\Project 5\preprocessing
New working directory C:\Users\HP\Desktop\For CV\Project 5

C:\Users\HP\Desktop\For CV\Project 5\.ipynb_checkpoints\articles_scraper-checkpoint.ipynb
C:\Users\HP\Desktop\For CV\Project 5\data cleaning\clean_df.csv
C:\Users\HP\Desktop\For CV\Project 5\data cleaning\data_cleaning.html
C:\Users\HP\Desktop\For CV\Project 5\data cleaning\data_cleaning.ipynb
C:\Users\HP\Desktop\For CV\Project 5\data cleaning\.ipynb_checkpoints\data_cleaning-checkpoint.ipynb
C:\Users\HP\Desktop\For CV\Project 5\data collection\articles_scraper.html
C:\Users\HP\Desktop\For CV\Project 5\data collection\articles_scraper.ipynb
C:\Users\HP\Desktop\For CV\Project 5\data collection\arts_2.csv
C:\Users\HP\Desktop\For CV\Project 5\data collection\arts_categories.txt
C:\Users\HP\Desktop\For CV\Project 5\data collection\economy_2.csv
C:\Users\HP\Desktop\For CV\Project 5\data collection\sports_2.csv
C:\Users\HP\Desktop\For CV\Project 5\data collection\

### Load our data

In [2]:
df = pd.read_csv(r"C:\Users\HP\Desktop\For CV\Project 5\data cleaning\clean_df.csv", encoding='ISO-8859-1')
print(df.columns)
# drop the Unnamed: 0' column
df.drop(columns = 'Unnamed: 0', inplace=True)

Index(['Unnamed: 0', 'article_title', 'category'], dtype='object')


In [3]:
display(df.head())
print()
display(df.tail())

Unnamed: 0,article_title,category
0,"Egypt-based Paymob raises $50 mln in Series B funding round Egypt-based Paymob, a financial services for merchants platform, raised $50 million in Series B funding, which the platform will use to grow its product range, expanding in the Egyptian market as well as into new markets across the Middle East and Africa region.",economy
1,"Egyptâs inflation speeds up amid war in Ukraine, rising food and energy prices Egyptâs headline annual inflation rate accelerated to 14.9 percent in April, up from the 12.1 percent recorded in March and 4.4 percent in the corresponding month in 2021, the Central Agency for Public Mobilisation and Statistics (CAPMAS) announced on Tuesday.",economy
2,EBRD upgrades Egyptâs GDP growth forecasts in FY2021/22 by 0.8% The European Bank for Reconstruction and Development (EBRD) raised its projections for Egyptâs GDP growth for the current FY2021/22 to 5.7 percent â which ends in June â up from the 4.9 percent it projected in November before slowing down to 5 percent in FY2022/23 â which begins in July â according to the Regional Economic Prospects Report the bank released on Tuesday.,economy
3,"Gold prices down in Egypt amid uncertain economic outlook, expected dollar price hike Gold prices in the Egyptian market saw a fluctuation on Monday amid the uncertainty cast by an expected rise in the US dollar price and inflation caused by the Russian war in Ukraine, as well as rising food and energy prices.",economy
4,"Egypt, EU ink â¬138 mln development finance deals covering several sectors Egypt and the European Union (EU) delegation in Egypt signed on Monday a number of development finance agreements worth â¬138 million covering healthcare, administrative reform, the environment, rural and social development, and enhancing governance.",economy





Unnamed: 0,article_title,category
30594,Chekhov's Three Sisters visit AUC Anton ChekhovÃ¯Â¿Â½Ã¯Â¿Â½Ã¯Â¿Â½s Three Sisters directed by Frank Bradley will be performed by AUC students in De,art
30595,Egyptian critic wins Polish cultural award Professor Hanaa Abdel Fattah recognised by Poland's Ministry of Foreign Affairs,art
30596,Examining Acting Myths and Misconceptions Lecture by theatre director and professor Mahmoud El-Lozy at AUC,art
30597,"Julius Caesar with a twist Al Hayat Theatre Ensemble adds a twist to the Shakespearean play Julius Caesar, winning 3rd place at the El Sawy Culturewheel 8th Theatre Festival.",art
30598,"A Humourless Night On 12 November a stand-up comedy show with Akram Hosny took place at El Sakia culturewheel, with its theme - the current education system. The material was adapted from the satirical book Awel Mokarer by Haitham Dabbour.",art


In [4]:
# value counts
df['category'].value_counts()

art        10644
sports      9984
economy     9971
Name: category, dtype: int64

### Train of thoughts

We need to get our article_titles in dataframe to a state where it's ready to use by machine learning models.

I will start by putting the steps to follow, and then writing **helper functions**(if needed) to perform these steps.

Finally, I will put the whole process in one big function called **text_preprocessing** that takes a dataframe and returns a cleaned preprocessed one.

In [5]:
# step1: Remove punctation
# step2: .lower()
# step3: word tokenization(taking each sentence and splitting it into a list of words)
# step4: Stopwords Removal
# step5: Stemming 
# step6: Lemmatization

#### Remove punctuation

In [6]:
def remove_punctuation(text):
    puncts_free = "".join([i for i in text if i not in string.punctuation])
    return puncts_free

In [7]:
df['puncts_free'] = df['article_title'].apply(lambda x: remove_punctuation(x))
df.head(1)

Unnamed: 0,article_title,category,puncts_free
0,"Egypt-based Paymob raises $50 mln in Series B funding round Egypt-based Paymob, a financial services for merchants platform, raised $50 million in Series B funding, which the platform will use to grow its product range, expanding in the Egyptian market as well as into new markets across the Middle East and Africa region.",economy,Egyptbased Paymob raises 50 mln in Series B funding round Egyptbased Paymob a financial services for merchants platform raised 50 million in Series B funding which the platform will use to grow its product range expanding in the Egyptian market as well as into new markets across the Middle East and Africa region


#### Lowering

In [8]:
df['puncts_free_lower'] = df['puncts_free'].apply(lambda x: x.lower())
df.head(1)

Unnamed: 0,article_title,category,puncts_free,puncts_free_lower
0,"Egypt-based Paymob raises $50 mln in Series B funding round Egypt-based Paymob, a financial services for merchants platform, raised $50 million in Series B funding, which the platform will use to grow its product range, expanding in the Egyptian market as well as into new markets across the Middle East and Africa region.",economy,Egyptbased Paymob raises 50 mln in Series B funding round Egyptbased Paymob a financial services for merchants platform raised 50 million in Series B funding which the platform will use to grow its product range expanding in the Egyptian market as well as into new markets across the Middle East and Africa region,egyptbased paymob raises 50 mln in series b funding round egyptbased paymob a financial services for merchants platform raised 50 million in series b funding which the platform will use to grow its product range expanding in the egyptian market as well as into new markets across the middle east and africa region


#### Word Tokenization

In [9]:
# You can either use spacy or nltk, but nltk is better for our purposes here
df['tokenized_words'] = df.apply(lambda x: word_tokenize(x['puncts_free_lower']), axis=1)

In [10]:
# testing
df.head(1)

Unnamed: 0,article_title,category,puncts_free,puncts_free_lower,tokenized_words
0,"Egypt-based Paymob raises $50 mln in Series B funding round Egypt-based Paymob, a financial services for merchants platform, raised $50 million in Series B funding, which the platform will use to grow its product range, expanding in the Egyptian market as well as into new markets across the Middle East and Africa region.",economy,Egyptbased Paymob raises 50 mln in Series B funding round Egyptbased Paymob a financial services for merchants platform raised 50 million in Series B funding which the platform will use to grow its product range expanding in the Egyptian market as well as into new markets across the Middle East and Africa region,egyptbased paymob raises 50 mln in series b funding round egyptbased paymob a financial services for merchants platform raised 50 million in series b funding which the platform will use to grow its product range expanding in the egyptian market as well as into new markets across the middle east and africa region,"[egyptbased, paymob, raises, 50, mln, in, series, b, funding, round, egyptbased, paymob, a, financial, services, for, merchants, platform, raised, 50, million, in, series, b, funding, which, the, platform, will, use, to, grow, its, product, range, expanding, in, the, egyptian, market, as, well, as, into, new, markets, across, the, middle, east, and, africa, region]"


We only need **'tokenized_words'**, and **'category'** columns.

In [11]:
df = df[['tokenized_words', 'category']]
df.head()

Unnamed: 0,tokenized_words,category
0,"[egyptbased, paymob, raises, 50, mln, in, series, b, funding, round, egyptbased, paymob, a, financial, services, for, merchants, platform, raised, 50, million, in, series, b, funding, which, the, platform, will, use, to, grow, its, product, range, expanding, in, the, egyptian, market, as, well, as, into, new, markets, across, the, middle, east, and, africa, region]",economy
1,"[egyptâs, inflation, speeds, up, amid, war, in, ukraine, rising, food, and, energy, prices, egyptâs, headline, annual, inflation, rate, accelerated, to, 149, percent, in, april, up, from, the, 121, percent, recorded, in, march, and, 44, percent, in, the, corresponding, month, in, 2021, the, central, agency, for, public, mobilisation, and, statistics, capmas, announced, on, tuesday]",economy
2,"[ebrd, upgrades, egyptâs, gdp, growth, forecasts, in, fy202122, by, 08, the, european, bank, for, reconstruction, and, development, ebrd, raised, its, projections, for, egyptâs, gdp, growth, for, the, current, fy202122, to, 57, percent, â, which, ends, in, june, â, up, from, the, 49, percent, it, projected, in, november, before, slowing, down, to, 5, percent, in, fy202223, â, which, begins, in, july, â, according, to, the, regional, economic, prospects, report, the, bank, released, on, tuesday]",economy
3,"[gold, prices, down, in, egypt, amid, uncertain, economic, outlook, expected, dollar, price, hike, gold, prices, in, the, egyptian, market, saw, a, fluctuation, on, monday, amid, the, uncertainty, cast, by, an, expected, rise, in, the, us, dollar, price, and, inflation, caused, by, the, russian, war, in, ukraine, as, well, as, rising, food, and, energy, prices]",economy
4,"[egypt, eu, ink, â¬138, mln, development, finance, deals, covering, several, sectors, egypt, and, the, european, union, eu, delegation, in, egypt, signed, on, monday, a, number, of, development, finance, agreements, worth, â¬138, million, covering, healthcare, administrative, reform, the, environment, rural, and, social, development, and, enhancing, governance]",economy


#### Stopwords Removal

In [12]:
stopwords = stopwords.words('english')
print(stopwords, " | ",len(stopwords))

['i', 'me', 'my', 'myself', 'we', 'our', 'ours', 'ourselves', 'you', "you're", "you've", "you'll", "you'd", 'your', 'yours', 'yourself', 'yourselves', 'he', 'him', 'his', 'himself', 'she', "she's", 'her', 'hers', 'herself', 'it', "it's", 'its', 'itself', 'they', 'them', 'their', 'theirs', 'themselves', 'what', 'which', 'who', 'whom', 'this', 'that', "that'll", 'these', 'those', 'am', 'is', 'are', 'was', 'were', 'be', 'been', 'being', 'have', 'has', 'had', 'having', 'do', 'does', 'did', 'doing', 'a', 'an', 'the', 'and', 'but', 'if', 'or', 'because', 'as', 'until', 'while', 'of', 'at', 'by', 'for', 'with', 'about', 'against', 'between', 'into', 'through', 'during', 'before', 'after', 'above', 'below', 'to', 'from', 'up', 'down', 'in', 'out', 'on', 'off', 'over', 'under', 'again', 'further', 'then', 'once', 'here', 'there', 'when', 'where', 'why', 'how', 'all', 'any', 'both', 'each', 'few', 'more', 'most', 'other', 'some', 'such', 'no', 'nor', 'not', 'only', 'own', 'same', 'so', 'than', '

In [13]:
# writing our function
def remove_stopwords(text_list):
    stop_words_free = [i for i in text_list if i not in stopwords]
    return stop_words_free    

In [14]:
# testing
df['sw_free'] = df['tokenized_words'].apply(lambda x: remove_stopwords(x))
df.head(1)

Unnamed: 0,tokenized_words,category,sw_free
0,"[egyptbased, paymob, raises, 50, mln, in, series, b, funding, round, egyptbased, paymob, a, financial, services, for, merchants, platform, raised, 50, million, in, series, b, funding, which, the, platform, will, use, to, grow, its, product, range, expanding, in, the, egyptian, market, as, well, as, into, new, markets, across, the, middle, east, and, africa, region]",economy,"[egyptbased, paymob, raises, 50, mln, series, b, funding, round, egyptbased, paymob, financial, services, merchants, platform, raised, 50, million, series, b, funding, platform, use, grow, product, range, expanding, egyptian, market, well, new, markets, across, middle, east, africa, region]"


#### Stemming

In [15]:
# defining stemming object
pstemmer = PorterStemmer()

In [16]:
def stemming(text):
    stemmed_text = [pstemmer.stem(word) for word in text]
    return stemmed_text

In [17]:
# testing our function
df['stemmed_title'] = df['sw_free'].apply(lambda x: stemming(x))
df.head(1)

Unnamed: 0,tokenized_words,category,sw_free,stemmed_title
0,"[egyptbased, paymob, raises, 50, mln, in, series, b, funding, round, egyptbased, paymob, a, financial, services, for, merchants, platform, raised, 50, million, in, series, b, funding, which, the, platform, will, use, to, grow, its, product, range, expanding, in, the, egyptian, market, as, well, as, into, new, markets, across, the, middle, east, and, africa, region]",economy,"[egyptbased, paymob, raises, 50, mln, series, b, funding, round, egyptbased, paymob, financial, services, merchants, platform, raised, 50, million, series, b, funding, platform, use, grow, product, range, expanding, egyptian, market, well, new, markets, across, middle, east, africa, region]","[egyptbas, paymob, rais, 50, mln, seri, b, fund, round, egyptbas, paymob, financi, servic, merchant, platform, rais, 50, million, seri, b, fund, platform, use, grow, product, rang, expand, egyptian, market, well, new, market, across, middl, east, africa, region]"


#### Lemmatization

Before starting the lemmatization process, open your cmd and:

* type python
* import nltk
* nltk.download('omw-1.4')

for more info:
[stackoverflow wordnet](https://stackoverflow.com/questions/48152637/why-is-nltk-download-unable-to-download-wordnet-or-any-other-data)

This will download the WordNet(nltk corpus reader).

If not downloaded, it will throw an error when you try to use the **WordNetLemmatizer()**

In [18]:
# defining lemmatization object
wnet_lemmatizer = WordNetLemmatizer()

In [19]:
def lemmatization(text):
    lemmatized_text = [wnet_lemmatizer.lemmatize(word) for word in text]
    return lemmatized_text

In [20]:
# testing our function
df['lemmatized_title'] = df['stemmed_title'].apply(lambda x: lemmatization(x))
df.head(1)

Unnamed: 0,tokenized_words,category,sw_free,stemmed_title,lemmatized_title
0,"[egyptbased, paymob, raises, 50, mln, in, series, b, funding, round, egyptbased, paymob, a, financial, services, for, merchants, platform, raised, 50, million, in, series, b, funding, which, the, platform, will, use, to, grow, its, product, range, expanding, in, the, egyptian, market, as, well, as, into, new, markets, across, the, middle, east, and, africa, region]",economy,"[egyptbased, paymob, raises, 50, mln, series, b, funding, round, egyptbased, paymob, financial, services, merchants, platform, raised, 50, million, series, b, funding, platform, use, grow, product, range, expanding, egyptian, market, well, new, markets, across, middle, east, africa, region]","[egyptbas, paymob, rais, 50, mln, seri, b, fund, round, egyptbas, paymob, financi, servic, merchant, platform, rais, 50, million, seri, b, fund, platform, use, grow, product, rang, expand, egyptian, market, well, new, market, across, middl, east, africa, region]","[egyptbas, paymob, rais, 50, mln, seri, b, fund, round, egyptbas, paymob, financi, servic, merchant, platform, rais, 50, million, seri, b, fund, platform, use, grow, product, rang, expand, egyptian, market, well, new, market, across, middl, east, africa, region]"


In [21]:
df = df[['stemmed_title', 'lemmatized_title', 'category']]
df

Unnamed: 0,stemmed_title,lemmatized_title,category
0,"[egyptbas, paymob, rais, 50, mln, seri, b, fund, round, egyptbas, paymob, financi, servic, merchant, platform, rais, 50, million, seri, b, fund, platform, use, grow, product, rang, expand, egyptian, market, well, new, market, across, middl, east, africa, region]","[egyptbas, paymob, rais, 50, mln, seri, b, fund, round, egyptbas, paymob, financi, servic, merchant, platform, rais, 50, million, seri, b, fund, platform, use, grow, product, rang, expand, egyptian, market, well, new, market, across, middl, east, africa, region]",economy
1,"[egyptâ, inflat, speed, amid, war, ukrain, rise, food, energi, price, egyptâ, headlin, annual, inflat, rate, acceler, 149, percent, april, 121, percent, record, march, 44, percent, correspond, month, 2021, central, agenc, public, mobilis, statist, capma, announc, tuesday]","[egyptâ, inflat, speed, amid, war, ukrain, rise, food, energi, price, egyptâ, headlin, annual, inflat, rate, acceler, 149, percent, april, 121, percent, record, march, 44, percent, correspond, month, 2021, central, agenc, public, mobilis, statist, capma, announc, tuesday]",economy
2,"[ebrd, upgrad, egyptâ, gdp, growth, forecast, fy202122, 08, european, bank, reconstruct, develop, ebrd, rais, project, egyptâ, gdp, growth, current, fy202122, 57, percent, â, end, june, â, 49, percent, project, novemb, slow, 5, percent, fy202223, â, begin, juli, â, accord, region, econom, prospect, report, bank, releas, tuesday]","[ebrd, upgrad, egyptâ, gdp, growth, forecast, fy202122, 08, european, bank, reconstruct, develop, ebrd, rais, project, egyptâ, gdp, growth, current, fy202122, 57, percent, â, end, june, â, 49, percent, project, novemb, slow, 5, percent, fy202223, â, begin, juli, â, accord, region, econom, prospect, report, bank, releas, tuesday]",economy
3,"[gold, price, egypt, amid, uncertain, econom, outlook, expect, dollar, price, hike, gold, price, egyptian, market, saw, fluctuat, monday, amid, uncertainti, cast, expect, rise, us, dollar, price, inflat, caus, russian, war, ukrain, well, rise, food, energi, price]","[gold, price, egypt, amid, uncertain, econom, outlook, expect, dollar, price, hike, gold, price, egyptian, market, saw, fluctuat, monday, amid, uncertainti, cast, expect, rise, u, dollar, price, inflat, caus, russian, war, ukrain, well, rise, food, energi, price]",economy
4,"[egypt, eu, ink, â¬138, mln, develop, financ, deal, cover, sever, sector, egypt, european, union, eu, deleg, egypt, sign, monday, number, develop, financ, agreement, worth, â¬138, million, cover, healthcar, administr, reform, environ, rural, social, develop, enhanc, govern]","[egypt, eu, ink, â¬138, mln, develop, financ, deal, cover, sever, sector, egypt, european, union, eu, deleg, egypt, sign, monday, number, develop, financ, agreement, worth, â¬138, million, cover, healthcar, administr, reform, environ, rural, social, develop, enhanc, govern]",economy
...,...,...,...
30594,"[chekhov, three, sister, visit, auc, anton, chekhovã¯â¿â½ã¯â¿â½ã¯â¿â½, three, sister, direct, frank, bradley, perform, auc, student, de]","[chekhov, three, sister, visit, auc, anton, chekhovã¯â¿â½ã¯â¿â½ã¯â¿â½, three, sister, direct, frank, bradley, perform, auc, student, de]",art
30595,"[egyptian, critic, win, polish, cultur, award, professor, hanaa, abdel, fattah, recognis, poland, ministri, foreign, affair]","[egyptian, critic, win, polish, cultur, award, professor, hanaa, abdel, fattah, recognis, poland, ministri, foreign, affair]",art
30596,"[examin, act, myth, misconcept, lectur, theatr, director, professor, mahmoud, ellozi, auc]","[examin, act, myth, misconcept, lectur, theatr, director, professor, mahmoud, ellozi, auc]",art
30597,"[juliu, caesar, twist, al, hayat, theatr, ensembl, add, twist, shakespearean, play, juliu, caesar, win, 3rd, place, el, sawi, culturewheel, 8th, theatr, festiv]","[juliu, caesar, twist, al, hayat, theatr, ensembl, add, twist, shakespearean, play, juliu, caesar, win, 3rd, place, el, sawi, culturewheel, 8th, theatr, festiv]",art


In [22]:
# Save preprocessed text to a csv file
df.to_csv(r"C:\Users\HP\Desktop\For CV\Project 5\preprocessing\final_df_1.csv")

#### Final Preprocessed text

In [23]:
df['final_text'] = df['lemmatized_title'].apply(lambda x: ' '.join(x))

In [24]:
# save preprocessed df
pp_df = df[['final_text', 'category']]
pp_df

Unnamed: 0,final_text,category
0,egyptbas paymob rais 50 mln seri b fund round egyptbas paymob financi servic merchant platform rais 50 million seri b fund platform use grow product rang expand egyptian market well new market across middl east africa region,economy
1,egyptâ inflat speed amid war ukrain rise food energi price egyptâ headlin annual inflat rate acceler 149 percent april 121 percent record march 44 percent correspond month 2021 central agenc public mobilis statist capma announc tuesday,economy
2,ebrd upgrad egyptâ gdp growth forecast fy202122 08 european bank reconstruct develop ebrd rais project egyptâ gdp growth current fy202122 57 percent â end june â 49 percent project novemb slow 5 percent fy202223 â begin juli â accord region econom prospect report bank releas tuesday,economy
3,gold price egypt amid uncertain econom outlook expect dollar price hike gold price egyptian market saw fluctuat monday amid uncertainti cast expect rise u dollar price inflat caus russian war ukrain well rise food energi price,economy
4,egypt eu ink â¬138 mln develop financ deal cover sever sector egypt european union eu deleg egypt sign monday number develop financ agreement worth â¬138 million cover healthcar administr reform environ rural social develop enhanc govern,economy
...,...,...
30594,chekhov three sister visit auc anton chekhovã¯â¿â½ã¯â¿â½ã¯â¿â½ three sister direct frank bradley perform auc student de,art
30595,egyptian critic win polish cultur award professor hanaa abdel fattah recognis poland ministri foreign affair,art
30596,examin act myth misconcept lectur theatr director professor mahmoud ellozi auc,art
30597,juliu caesar twist al hayat theatr ensembl add twist shakespearean play juliu caesar win 3rd place el sawi culturewheel 8th theatr festiv,art


In [25]:
pp_df.to_csv(r"C:\Users\HP\Desktop\For CV\Project 5\preprocessing\final_df_2.csv")

### Final Thoughts

Our data is cleaned, preprocessed, and ready.

In the next notebook, We will convert our text data to numeric form and build our machine learning model.