# NLP training set-up
This is the setup notebook for the natural language processing (NLP) workshop at the Indigidata event. This guide will help you set up your environment and ensure everything is working correctly before the session on Monday 26th August 2024.

### Install the packages 
Once Juypter Notebook is open We will be using the following packages for this workshop. You will need to ensure that these are installed and can be imported successfully before the workshop.
#### Importing the NLTK package
The `nltk` package is the NLP library we will be using during the session. If you do not already have `nltk` installed, you will need to install it using the following command:

Alternatively, you can install the package from the command line (`pip install nltk`) or using your favourite Python package manager such as Anaconda.

In [None]:
# installing the nltk package - press ctrl + enter to run the code cell
!pip install nltk

Once the package has successfully installed, check that you can import the package:

In [None]:
# import the nltk package
import nltk

We will also be using the NLTK names corpus during this session. You will need to first download the names corpus:

In [None]:
# download the names corpus from the nltk package
nltk.download('names')

Import the names corpus and test that it loads by loading the first 15 names:

In [None]:
# import the name corpus
from nltk.corpus import names

# print the first 15 names in the corpus
print(names.words()[:15])

At this stage, if everything has worked, you should get some output that looks like a list of names
 starting with `['Abagael', 'Abagail', 'Abbe', 'Abbey', 'Abbi', 'Abbie', 'Abby',`

Next we want to make sure that we can load the various other packages that we'll be using during the wānanga:

In [None]:
from IPython.display import Image
import random
import ast
from nltk import NaiveBayesClassifier
from nltk import classify

If everything worked in the import you should have no error messages. 
But it can also be useful to test that things work by using them.

You can run the code below and test that the output from the print statements 
and from the `nb_classifier` line matches the expected output indicated in the comments of the code.

In [None]:
x = [1,2,3,4,5]
random.shuffle(x)
print(x) #should be in random order

y = '''('word','flag')'''
print(type(y)) # expect class 'str'
print(type(ast.literal_eval(y))) # expect tuple

nb_classifier = NaiveBayesClassifier.train([({'feature':'value'},'tag')]) #train on a single piece of "data"
nb_classifier.classify({'feature':'value'}) #expect to classify this as 'tag'

### Importing the selenium package 
We will also be using the `selenium` and `BeautifulSoup4` packages to scrape names from a website. If you do not already have these installed, you will need to install them using the following commands (or your preferred alternative method):

In [None]:
# install the selenium package
!pip install selenium

# install BeautifulSoup4
!pip install BeautifulSoup4

Import the packages and test that they load by checking the versions of the package:

In [None]:
# print the version of the package to ensure it has installed correctly
import selenium
print("Selenium version:",selenium.__version__) #expect a version number, probably 4.31.1

from bs4 import BeautifulSoup
BeautifulSoup('''<p>words words words</p>''','html.parser').find_all('p')
# expect output that looks like '[<p>words words words</p>]'

In [None]:
# clear outputs - just to keep things tidy.
from IPython.display import clear_output
clear_output()