<a href="https://colab.research.google.com/github/mrhallonline/NLP-Workshop/blob/main/Module_2_Workshop_Setting_Up_Natural_Language_Toolkit_(NLTK)_V3.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# 2.0 Setting Up (?? minutes)

Click below to install the libraries:
1. NLTK
2.   [Matplotlib](https://matplotlib.org/) library: Library for creating data visualizations.
3.   [Gensim](https://pypi.org/project/gensim/) library: Natural language processing tool
4.   [PyPDF2](https://pypdf2.readthedocs.io/en/3.0.0/index.html#) library: NLTK normally works with text files, PyPDF2 will allow you to read, write, convert to text, and merge pdf files
5. Numpy
6. Pandas



###Google Colab is similar in usage to software like RStudio allowing you to run chunks or cells at a time. Click in this next code cell. An output should appear under it listing the version of python used in this Colab notebook.

In [20]:
!python --version

Python 3.10.12


## 2.1 Installing NLTK supporting dependencies and libraries.
Once we know that python is running you can click the following code cell to automatically download and install NLTK and the dependencies that we will be using throughout this workshop. Don't worry if you have already installed it prior to this. If installed you will see the output mention "Requirement already satisfied:"


In [None]:
!pip install nltk
!pip install matplotlib
!pip install gensim
!pip install PyPDF2
!pip install numpy
!pip install pandas

# 2.2 Importing NLTK and popular corpora

NLTK has access to a wide range of text and audio corpora that can be easily viewed and analyzed if you are ever in need of data to mess around with. We wont spend time with it today but clicking on the following code cell will import nltk and download the most widely used corpora. This "import nltk" code is needed to be run at least once but can be placed at the head of any code cell just to be certain.

In [None]:
import nltk
nltk.download('popular')

## 2.3  Connecting/Mounting your Google drive to be accessible in Google Colab

Click the following cell to connect the current Google Colab notebook to you Google Drive to save and access data. This connection is temporary and you will need to connect again after some passage of time without usage. If after some time you run into errors stating files can't be found, clicking this again would make sure the connection is still live.

#### When clicked, the output will let you know if the drive is already mounted, if not it will ask for your authorization to connect to Drive

In [None]:
from google.colab import drive
drive.mount('/content/drive')

Mounted at /content/drive


# 2.4 Importing the Google Sheets file containing our raw transcript corpus to be used in Google Colab.
Clicking the next code cell will automatically download the csv file containing our data corpus and save it locally as an Excel file in Google Drive so it can be accessed by this Colab document. You can use this same code to download any file that you have sharing access with, simply change the file_id with the new one in line 7 and changing the filename and/or filetype in line 16.


#### The file will show in your folder several seconds after running the code cell

#### This is also temporary unless saved directly in your Google Drive.

In [None]:
import requests

# This is the full shared Drive link,
# https://docs.google.com/spreadsheets/d/1iJ4SG-QXfY4zw5K9B7Ununv3rb3iBj8S/edit?usp=drive_link&ouid=106477043869312333876&rtpof=true&sd=true

# get the file ID from the shareable link and paste below
file_id = "1iJ4SG-QXfY4zw5K9B7Ununv3rb3iBj8S"

# construct the download URL
download_url = f"https://docs.google.com/uc?export=download&id={file_id}"

# send a GET request to the download URL and save the response content
response = requests.get(download_url)

# The next line names the file after download. If you change it here, you will also need to change in the subsequent fields.
with open("uncertaintyText.xlsx", "wb") as f:
    f.write(response.content)

# 2.5 Working with csv Files

Now that the csv file is locally accessible, clicking this code cell will open the Excel file called uncertaintyText.xlsx found in the content folder and copy each row of text found in the column titled "transcript" and write that data to a text file called "raw_uncertaintyText.txt and save it in your Google Drive instead of the temporary cloud folder.
###Importantly, the whole text corpus is also saved in a variable called raw_uncertaintyText, which is what we will be working with as our raw unprocessed "uncertainty" data as we move forward. The text file is only needed if we need to do this process again.

In [None]:
!pip install pandas openpyxl nltk

import pandas as pd
import nltk
from nltk.tokenize import word_tokenize
nltk.download('punkt')

# Specify the path to the Excel file
excel_file_path = '/content/uncertaintyText.xlsx'

# Specify the column name you want to tokenize
column_name = 'transcript'

# Read the Excel file and extract the specified column
data = pd.read_excel(excel_file_path, engine='openpyxl')
text_column = data[column_name]


# Convert each item in the column to a string and then join them
raw_uncertaintyText = ' '.join(map(str, text_column))


# Save the string to a text file in your Google Drive
with open('/content/drive/MyDrive/raw_uncertaintyText.txt', 'w') as file:
  file.write(raw_uncertaintyText)

print("Text saved to raw_uncertaintyText.txt")
print(raw_uncertaintyText[0:250])

# 2.6 Working directly from Text Files
Text files can be used directly and don't need to be converted in order to be imported into NLTK. Many other file types, other than raw text and downloaded corpora, will need to be first initially converted to text files in order to be used in our NLTK data flow. Use the following code cell if you want to load your data directly from a text file without need to convert from csv or pdf for example.

In [37]:
import nltk

# load data from existing text file
filename = '/content/drive/MyDrive/raw_uncertaintyText.txt'
uncertaintyText = open(filename, 'rt', encoding='utf-8', errors='replace')

raw_uncertaintyText = uncertaintyText.read()
uncertaintyText.close()

# Word Tokenization
uncertainty_wordTokens = nltk.word_tokenize(raw_uncertaintyText)

# Creating a Text object from the tokens
uncertainty_wordTextObjects = nltk.Text(uncertainty_wordTokens)


# 2.7 Basic information about the data corpus

From the original csv file located in my Google Drive, you have now created 5 documents:

1. Excel File = uncertaintyText.xlsx
2. Text File = raw_uncertaintyText.txt
3. Uncertainty text variable = raw_uncertaintyText
4. Uncertainty text as word tokens = uncertainty_wordTokens

If you run the code cell below you can notice some differences between the documents. The differences are unimportant but it is important to know that you can always figure out what type of data corpus you are dealing with by running these print checks. It is also extremely important to also note the importance of keeping your documents categorized, lest they get out of control.
* We will look at the utility of numbers 3 and 4 in the next module using some of the features of natural language processing using NLTK.

In [39]:
print("Number 3 is a: ",type(raw_uncertaintyText))
print("Number 4 is a: ",type(uncertainty_wordTokens))
print("Number 5 is a: ",type(uncertainty_wordTextObjects))

print("Number of characters in number 3 is: ",len(raw_uncertaintyText))
print("Number of characters in number 4 is: ",len(uncertainty_wordTokens))
print("Number of characters in number 5 is: ",len(uncertainty_wordTextObjects))

print("Here are the first 100 characters in number 3: ",raw_uncertaintyText[0:100])
print("Here are the first 100 characters in number 4: ",uncertainty_wordTokens[0:100])
print("Here are the first 100 characters in number 5: ",uncertainty_wordTextObjects[0:100])

Number 3 is a:  <class 'str'>
Number 4 is a:  <class 'list'>
Number 5 is a:  <class 'nltk.text.Text'>
Number of characters in number 3 is:  66207
Number of characters in number 4 is:  16771
Number of characters in number 5 is:  16771
Here are the first 200 characters in number 3:  It's table 6, right? I think so. Always go with basic assumptions. Unless it's about someone in that
Here are the first 200 characters in number 4:  ['It', "'s", 'table', '6', ',', 'right', '?', 'I', 'think', 'so', '.', 'Always', 'go', 'with', 'basic', 'assumptions', '.', 'Unless', 'it', "'s", 'about', 'someone', 'in', 'that', 'case', ',', 'do', "n't", '.', 'Eric', '.', 'Yes', '.', 'I', "'ve", 'been', 'sitting', 'with', 'you', '.', 'Do', "n't", 'like', 'it', 'because', 'we', "'re", 'right', 'in', 'front', 'of', 'Ms.', 'Fletcher', '.', 'But', 'it', "'s", 'okay', '.', 'You', 'can', 'live', 'with', 'it', '.', 'Did', 'you', '...', 'did', 'you', 'do', 'the', 'search', '?', 'Yes', '.', 'It', "'s", 'for', 'next', 'c

# 2.8 Messing Around With Concordance and Text Objects
##Anonymizing data corpus


### 2.81 Searching for names

In [70]:
import nltk
from nltk.text import Text


uncertainty_textObjects.concordance("felix", lines = 25, width=200)

Displaying 3 of 3 matches:
u 're going to want to use that word a lot . That 's too slow to hit this . All right . Thanks , Felix . I guess Nate , what 'd you miss ? So the same trick that I had all the answers fitting out rig
t right next to each other , I 'm going to put both of them in the equation itself . Anyway , as Felix found , the answer found is in here , 4.62 . But this can also be attributed to , since we know 
e exactly ? I mean , it 's one radian , but what else is it exactly ? Yeah . Yeah , what is it , Felix ? It 's 57.29577 . Is that exact ? Still getting me the decimal point . I was just looking for l


### 2.82 Replacing names

In [72]:
replace_dict = {
    'Eric': 'Steven',
    'Felix': 'Jason'
}

# Replace words in your original token list
names_uncertainty_wordTokens = [replace_dict.get(word, word) for word in uncertainty_wordTokens]

# You can directly create a Text object from the tokens for concordance
text_obj = Text(names_uncertainty_wordTokens)

# Check replacements with concordance
text_obj.concordance('felix')
text_obj.concordance('eric')
text_obj.concordance('Steven')
text_obj.concordance('Jason')

no matches
no matches
Displaying 4 of 4 matches:
out someone in that case , do n't . Steven . Yes . I 've been sitting with you
w what to do with it . According to Steven , there 's no solutions . Yeah , th
 supposed to solve for X . What did Steven say ? There 's no solutions . Steve
teven say ? There 's no solutions . Steven 's wrong . But look at the graph . 
Displaying 3 of 3 matches:
w to hit this . All right . Thanks , Jason . I guess Nate , what 'd you miss ? 
in the equation itself . Anyway , as Jason found , the answer found is in here 
exactly ? Yeah . Yeah , what is it , Jason ? It 's 57.29577 . Is that exact ? S
