<a href="https://colab.research.google.com/github/krishnasaiv/personal_projects/blob/main/searchEngnine/1.1%20Text%20Processing%20Using%20spaCy.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

## Load all relevant Python libraries and a spaCy language model.

In [23]:
import json 
import spacy 

nlp = spacy.load("en_core_web_sm")

## Open the provided JSON file. 

It contains a list of dictionaries with summaries from Wikipedia articles, where each dictionary has three key-value pairs. The keys title, text and url correspond to:


1.   Title of the Wikipedia article the text is taken from.
2.   Wikipedia article text. (In this dataset we included only the summary.)
3.   Link to the Wikipedia article.

In [24]:
with open('data.json') as user_file:
  file_contents = user_file.read()

data = json.loads(file_contents)

## Create a Python function that takes in a text string, performs all operations described in the previous step, and outputs a list of tokens (lemmas).

1. Lowercases the text string.
2. Creates a spaCy document with the text lemmas and their attributes using a spaCy model of your choice.
3. Removes stop words, punctuation, and other unclassified lemmas.
4. Returns a list of tokens (lemmas) found in the text.

In [28]:
def extractLemmaTokens(s):
    s = s.lower()
    doc = nlp(s.replace("\n", ""))

    tokens = [token.lemma_ for token in doc if 
              not (
                    token.is_stop or
                    token.pos_ in ('PUNCT','SYM', 'X') or
                    token.dep_ == "" or
                    token.like_num or
                    token.text in '_\n '
                   )
              ]

    return tokens

## Use this function to preprocess all text documents in the dataset (text field only), and add the resulting lists to the dictionaries from step 1 . 

You should end up with a list of dictionaries, each of which now has four key-value pairs:

* title: Title of the Wikipedia article the text is taken from.
* text: Wikipedia article text. (In this dataset we included only the summary.)
* tokenized_text: Tokenized Wikipedia article text.
* url: Link to the Wikipedia article.

In [None]:
for article in data:
    article['tokenized_text'] = extractLemmaTokens(article['text'])

## Save the new list of dictionaries in JSON format.

In [12]:
with open('data_with_token.json', 'w') as fp:
    json.dump(data, fp)

## Test

In [31]:
# txt = "POS Tagging in Spacy library is quite easy as seen in the below example. We just instantiate a Spacy object as doc. We iterate over doc object and use pos_ , tag_, to print the POS tag. Spacy also lets you access the detailed explanation of POS tags by using spacy.explain() function which is also printed in the same iteration along with POS tags."
# #data[0]['text'].lower()
# doc = nlp(article['text'])

# # for t in s:
# #     print(t, t.lemma_)

# rows = []
# rows.append(["Word", "Position", "Lowercase", "Lemma", "POS", "Alphanumeric","Stopword", "Dependency"])

# for token in doc:
#     if not (token.is_stop or token.pos_ in ('PUNCT','SYM', 'X') or token.text == '_\n ' or token.dep_ == ""):
#         rows.append([token.text, str(token.i), token.lower_, token.lemma_, token.pos_, str(token.is_alpha), str(token.is_stop), token.dep_])
# columns = zip(*rows)
# column_widths = [max(len(item) for item in col) for col in columns]


# for row in rows:
#     print(''.join(' {:{width}} '.format(row[i], width=column_widths[i])  for i in range(0, len(row))))

In [None]:
# len(rows)