## Text Preprocessing

### Importing libraries and setting path

In [1]:
import pandas as pd
import os
# text preprocessing
import string
import nltk
from nltk.tokenize import word_tokenize
from nltk.corpus import stopwords
from nltk.stem.porter import PorterStemmer
from nltk.stem import WordNetLemmatizer

# Increase width to see the article_title clearly
pd.set_option('display.max_colwidth', None)

# Our project path
project_path = r"C:\Users\HP\Desktop\For CV\Project 5"
# check working directory
print("Working directory:",os.getcwd())
# change working directory and list all files in new directory
os.chdir(project_path)
print("New working directory", project_path)
print()
for dirname, _, filenames in os.walk(project_path):
    for filename in filenames:
        print(os.path.join(dirname, filename))

Working directory: C:\Users\HP\Desktop\For CV\Project 5\preprocessing


NameError: name 'data_path' is not defined

### Load our data

In [None]:
df = pd.read_csv(r"C:\Users\HP\Desktop\For CV\Project 5\data cleaning\clean_df.csv", encoding='ISO-8859-1')
print(df.columns)
# drop the Unnamed: 0' column
df.drop(columns = 'Unnamed: 0', inplace=True)

In [None]:
display(df.head())
print()
display(df.tail())

In [None]:
# value counts
df['category'].value_counts()

### Train of thoughts

We need to get our article_titles in dataframe to a state where it's ready to use by machine learning models.

I will start by putting the steps to follow, and then writing **helper functions**(if needed) to perform these steps.

Finally, I will put the whole process in one big function called **text_preprocessing** that takes a dataframe and returns a cleaned preprocessed one.

In [None]:
# step1: Remove punctation
# step2: .lower()
# step3: word tokenization(taking each sentence and splitting it into a list of words)
# step4: Stopwords Removal
# step5: Stemming 
# step6: Lemmatization

#### Remove punctuation

In [None]:
def remove_punctuation(text):
    puncts_free = "".join([i for i in text if i not in string.punctuation])
    return puncts_free

In [None]:
df['puncts_free'] = df['article_title'].apply(lambda x: remove_punctuation(x))
df.head(1)

#### Lowering

In [None]:
df['puncts_free_lower'] = df['puncts_free'].apply(lambda x: x.lower())
df.head(1)

#### Word Tokenization

In [None]:
# You can either use spacy or nltk, but nltk is better for our purposes here
df['tokenized_words'] = df.apply(lambda x: word_tokenize(x['puncts_free_lower']), axis=1)

In [None]:
# testing
df.head(1)

We only need **'tokenized_words'**, and **'category'** columns.

In [None]:
df = df[['tokenized_words', 'category']]
df.head()

#### Stopwords Removal

In [None]:
stopwords = stopwords.words('english')
print(stopwords, " | ",len(stopwords))

In [None]:
# writing our function
def remove_stopwords(text_list):
    stop_words_free = [i for i in text_list if i not in stopwords]
    return stop_words_free    

In [None]:
# testing
df['sw_free'] = df['tokenized_words'].apply(lambda x: remove_stopwords(x))
df.head(1)

#### Stemming

In [None]:
# defining stemming object
pstemmer = PorterStemmer()

In [None]:
def stemming(text):
    stemmed_text = [pstemmer.stem(word) for word in text]
    return stemmed_text

In [None]:
# testing our function
df['stemmed_title'] = df['sw_free'].apply(lambda x: stemming(x))
df.head(1)

#### Lemmatization

Before starting the lemmatization process, open your cmd and:

* type python
* import nltk
* nltk.download('omw-1.4')

for more info:
[stackoverflow wordnet](https://stackoverflow.com/questions/48152637/why-is-nltk-download-unable-to-download-wordnet-or-any-other-data)

This will download the WordNet(nltk corpus reader).

If not downloaded, it will throw an error when you try to use the **WordNetLemmatizer()**

In [None]:
# defining lemmatization object
wnet_lemmatizer = WordNetLemmatizer()

In [None]:
def lemmatization(text):
    lemmatized_text = [wnet_lemmatizer.lemmatize(word) for word in text]
    return lemmatized_text

In [None]:
# testing our function
df['lemmatized_title'] = df['stemmed_title'].apply(lambda x: lemmatization(x))
df.head(1)

In [None]:
df = df[['stemmed_title', 'lemmatized_title', 'category']]
df

In [None]:
# Save preprocessed text to a csv file
df.to_csv(r"C:\Users\HP\Desktop\For CV\Project 5\preprocessing\final_df_1.csv")

#### Final Preprocessed text

In [None]:
df['final_text'] = df['lemmatized_title'].apply(lambda x: ' '.join(x))

In [None]:
# save preprocessed df
pp_df = df[['final_text', 'category']]
pp_df

In [None]:
pp_df.to_csv(r"C:\Users\HP\Desktop\For CV\Project 5\preprocessing\final_df_2.csv")

### Final Thoughts

Our data is cleaned, preprocessed, and ready.

In the next notebook, We will convert our text data to numeric form to be ready for machine learning models.