# Norwegian text preprocessing with Spacy

A basic NLP preprocessing pipeline for Norwegian text that lemmatizes the text, but does not remove punctuation and symbols, newline characters, numbers, etc. The following code can be used to process csv files in any language given that SpaCy has a suitable [language module](https://spacy.io/usage/models). For example, if you want to preprocess English text, just run "nlp = spacy.load('en_core_web_trf') instead. Remember to change "Define Stopwords", as well, so it fits with your language requirements.

## Load libraries

In [None]:
import spacy
import pandas as pd
import csv
from spacy.lang.nb.examples import sentences 

## Load the Norwegian language model

In [2]:
# spacy.cli.download('nb_core_news_lg') // Run this code first if you get an error message loading the language module
nlp = spacy.load('nb_core_news_lg')

## Define Stopwords

In [3]:
stop_words = set(spacy.lang.nb.STOP_WORDS)

## Increase text size processing possibility

In [3]:
# Increase maximum text length that can be processed by SpaCy model
nlp.max_length = 20000000
csv.field_size_limit(5000000)

131072

## Testing that language module has been loaded correctly

In [None]:
doc = nlp("Jeg gikk og går for å gå etter å ha gått, og jeg synger, liker å synge da jeg sang.")
for token in doc:
    print(token.text, token.lemma_, token.pos_, token.tag_, token.dep_,
            token.shape_, token.is_alpha, token.is_stop)

## Load Norwegian dataframe

A dataframe with text items as rows, various metadata columns, and a column called 'text' where the values are strings.

In [5]:
df = pd.read_csv('aftenposten_unprocessed.csv')

## Create function to process the text

In [6]:
# Define a function to process the text column
def preprocess_text(text):
    stop_words = set(spacy.lang.nb.STOP_WORDS)
    doc = nlp(text)
    lemmatized_text = [token.lemma_ for token in doc if not token.is_stop and token.text not in stop_words]
    return " ".join(lemmatized_text)

## Apply the function to the text colum

In [None]:
df["text"] = df["text"].apply(preprocess_text)

## Save the processed .csv file

In [7]:
df.to_csv('aftenposten_processed.csv')