# 2023 COSS Text Mining

 Welcome to Compute Ontario Text Mining Workshop's Jupyter Notebook. The dataset we will be using is called FederalistDataset.xlsx . This dataset has numbered various text excerpts by certain authors and organized into three columns: number, author, text. Text mining is the process of extracting meaning, patterns, and trends from unstructured textual data. Massive amounts of unstructured text are prevalent in today's society. Traditional machine learning algorithms handle only numerical or categorical data. Existing data analytical platforms provide special components to facilitate the analysis of textual data. This workshop introduces the topic of text mining and provides a tour with hands-on exercises and demonstrations of some basic texting mining tools, each of which supports an interesting and diverse set of features.

## In this Notebook

 -  Use multiple Python libraries, such as Pandas, NLTK, string, numpy, scipy, textBlob, and sci-kit learn
 - Preprocessing Text Data using preprocessing techniques (Tokenization, lowercase conversion, lemmatization, stemming, stopword removal)
 - Text Vectorization using TF-IDF (Term Frequency-Inverse Document Frequency) with Python library (sci-kit learn)
 - Text Classification using textBlob and Naive-Bayes classifer (supervised machine learning algorithm)

In [None]:
#The dataset to be used is an excel sheet called FederalistDataSet.

In [None]:
import pandas as pd #for example 1

In [34]:
import nltk #for example 1

In [35]:
from nltk.corpus import stopwords #for example 1

In [36]:
from nltk.stem import WordNetLemmatizer, PorterStemmer #for example 1

In [37]:
import string #for example 1 and 2

In [38]:
import numpy as np #for Example 2

In [39]:
import scipy #for Example 2

In [40]:
from sklearn.feature_extraction.text import TfidfVectorizer #for Example 2

In [80]:
from sklearn.model_selection import train_test_split #for Example 3

In [87]:
from nltk.sentiment import SentimentIntensityAnalyzer #for Example 3

In [98]:
from sklearn.feature_extraction.text import TfidfVectorizer #for Example 3
from sklearn.linear_model import LogisticRegression #for Example 3

In [120]:
from sklearn.svm import SVC #imports support vector machine

In [121]:
from sklearn.metrics import accuracy_score #metrics for sentiment analysis after classification

In [141]:
from sklearn.naive_bayes import MultinomialNB #for Naive Bayes Classifier in small exercise in Example 3

### Example 1: Preprocessing Text Data

In [43]:
file_path = "\\Users\\haani.admin\\Downloads\\FederalistDataset.xlsx" #File path of dataset in my case. Will be slightly different for each person.

In [44]:
df = pd.read_excel(file_path) #reads excel sheet

In [45]:
print(df.head()) #testing

   number    author                                               text
0       1  HAMILTON  To the People of the State of New York <l> AFT...
1       2       JAY  To the People of the State of New York <l> WHE...
2       3       JAY  To the People of the State of New York <l> IT ...
3       4       JAY  To the People of the State of New York <l> MY ...
4       5       JAY  To the People of the State of New York <l> QUE...


In [46]:
print(df.columns) #testing

Index(['number', 'author', 'text'], dtype='object')


In [47]:
df['author'] = df['author'].str.lower() #makes the author column all lowercase using str.lower()

In [48]:
print(df['author']) #proof that all of author column is lowercase

0     hamilton
1          jay
2          jay
3          jay
4          jay
        ...   
80    hamilton
81    hamilton
82    hamilton
83    hamilton
84    hamilton
Name: author, Length: 85, dtype: object


In [49]:
#df['text'] = df['text'].apply(lambda x: x.translate(str.maketrans('', '', string.punctuation))) removing all punctuation from text column of dataset

In [50]:
#print(df['text']) #proof that all punctuation got removed from the column called text

0     To the People of the State of New York l AFTER...
1     To the People of the State of New York l WHEN ...
2     To the People of the State of New York l IT IS...
3     To the People of the State of New York l MY LA...
4     To the People of the State of New York l QUEEN...
                            ...                        
80    To the People of the State of New York l LET U...
81    To the People of the State of New York l THE e...
82    To the People of the State of New York l THE o...
83    To the People of the State of New York l IN TH...
84    To the People of the State of New York l ACCOR...
Name: text, Length: 85, dtype: object


In [51]:
#initialization code below

In [52]:
lemmatizer = WordNetLemmatizer() #lemmatizer initialized

In [53]:
stemmer = PorterStemmer() #stemming initialized

In [54]:
stopwords = set(stopwords.words('english')) #stopwords initialization

In [55]:
preprocessed_text = [] #array to be appended to

In [56]:
for text in df['text']: #for loop which tokenizes, lowercases, lemmatizes, stemms, and does stopword removing of text column
    tokens = nltk.word_tokenize(text) #tokenization (removes all punctuation)
    tokens = [token.lower() for token in tokens] #lowercase conversion
    tokens = [lemmatizer.lemmatize(token)for token in tokens]    #lemmatization
    tokens = [stemmer.stem(token) for token in tokens] #Stemming
    tokens = [token for token in tokens if token not in stopwords] #stopword removing
    preprocessed = ''.join(tokens)
    preprocessed_text.append(preprocessed)

In [57]:
df['text'] = preprocessed_text

In [None]:
df['text'].head()

In [None]:
print(df['text']) #tests if all changes of the for loop were successfully done to text (column in dataset)

### Example 2: Text Vectorization using TF-IDF (Term Frequency-Inverse Document Frequency) with Python Library (sci-kit learn)

In [68]:
text_data = df['text'].tolist() #converts dataframe into python list

In [69]:
#Creates an instance of TfidfVectorizer

In [70]:
vectorizer = TfidfVectorizer()

In [71]:
#Fits the vectorizer on the text data to learn the vocabulary 

In [72]:
vectorizer.fit(text_data)

In [73]:
#Transforms the text data into TF-IDF vectors

In [74]:
tfidf_vectors = vectorizer.transform(text_data)

In [75]:
#Print the TF-IDF vectors

In [146]:
print(tfidf_vectors.toarray())

[[0. 0. 0. ... 0. 0. 0.]
 [0. 0. 0. ... 0. 0. 0.]
 [0. 0. 0. ... 0. 0. 0.]
 ...
 [0. 0. 0. ... 0. 0. 0.]
 [0. 0. 0. ... 0. 0. 0.]
 [1. 0. 0. ... 0. 0. 0.]]


In [None]:
#The code above, df['text'] refers to the text column of the dataFrame 'df'
#tolist() is used to convert the column into a list of text data
#the remaining steps are for creating TF-IDF vectors which are displayed
#in array format: each row corresponds to a document from the 'text' column
#each column represents a unique word or term from the vocab

### Example 3: Text Classification

In [None]:
#Goal of this exercise: Implement classification model (one of: Naïve Bayes, Support Vector Machines) using scikit-learn 

#### Example 3: Exercise 1 (using SentimentIntensityAnalyzer)

In [160]:
from textblob import TextBlob

# Apply sentiment analysis using TextBlob
df['sentiment'] = df['text'].apply(lambda x: TextBlob(x).sentiment.polarity)

In [161]:
# Assign sentiment labels based on polarity values
df['sentiment_label'] = df['sentiment'].apply(lambda x: 'positive' if x > 0 else 'negative' if x < 0 else 'neutral')

In [83]:
df['y'] = "" #creates a df for sentiments

In [88]:
sia = SentimentIntensityAnalyzer() #sia variable created to represent SentimentIntensityAnalyzer

In [128]:
for index, row in df.iterrows(): #sentiment breakdown to be used in a classification model
    text = row['text']
    sentiment_scores = sia.polarity_scores(text)
    if sentiment_scores['compound'] >= 0.05:
        df.at[index, 'y'] = 'positive'
    elif sentiment_scores['compound'] <= -0.05:
        df.at[index, 'y'] = 'negative'
    else:
        df.at[index, 'y'] = 'neutral'

In [129]:
print(df['y']) #assigns sentiment labelling

0     neutral
1     neutral
2     neutral
3     neutral
4     neutral
       ...   
80    neutral
81    neutral
82    neutral
83    neutral
84    neutral
Name: y, Length: 85, dtype: object


#### Example 3: End of Exercise 1

In [130]:
#X = df['text'] splits data into training + testing sets. 80% of data used for training, 20% for testing
#y = df['y']
#X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42) #X=feature matrix, y=Target Variable


In [131]:
#vectorizer = TfidfVectorizer()

#### Example 3: Exercise 2

In [None]:
#Below is just a small exercise demonstrating sentiment analysis library called TextBlob which provides a simple API for 

In [None]:
#performing sentiment analysis on text data. TextBlob's sentiment analysis functionality uses a pre-trained sentiment polarity model based on Naive Bayes Algorithm

In [None]:
#Textblob's sentimental analysis model assigns a polarity score to each text input, indicating sentiment polarity as: Positive (1.0), Negative(-1.0) or Neutral(0.0)

In [None]:
#TextBlob's sentiment analysis model utilizes a Naive Bayes classifier trained on a large dataset of movie reviews that have been labeled with sentiment polarity. Capturing patterns in the text data to predict sentiment polarity based on the words and phrases present in the input.

In [150]:
from textblob import TextBlob #import statement

In [154]:
from newspaper import Article #import statement

In [None]:
# test out these three urls and find out what the sentiment score is!

In [None]:
#https://en.wikipedia.org/wiki/Computer_science

In [None]:
#https://www.cnbc.com/2020/06/07/stock-market-futures-open-to-close-news.html

In [None]:
#https://www.cnbc.com/2020/04/22/recession-depth-will-be-much-worse-than-2007-2009-lakshman-achuthan.html

In [155]:
url = 'https://en.wikipedia.org/wiki/Computer_science' #input any of the sample links from above here

In [156]:
article = Article(url)

In [157]:
blob = TextBlob(text)

In [158]:
sentiment = blob.sentiment.polarity #-1 to 1 where: -1 = Negative, 0 = Neutral, 1 = Positive

In [169]:
print(sentiment)

0.0


#### Example 3: End of Exercise 2