Use the following lists to find open source data sets to complete take-home exercises. You can also apply these in the data set provided for AT1.

[Open Data Sets](https://canvas.uts.edu.au/courses/32341/pages/open-data-sets-for-nlp-and-text-analysis?module_item_id=1878922)

In [2]:
import pandas as pd

# Dataset - https://www.kaggle.com/datasets/crowdflower/twitter-airline-sentiment
data = pd.read_csv('Datasets/Tweets.csv')

In [3]:
data.head()

Unnamed: 0,tweet_id,airline_sentiment,airline_sentiment_confidence,negativereason,negativereason_confidence,airline,airline_sentiment_gold,name,negativereason_gold,retweet_count,text,tweet_coord,tweet_created,tweet_location,user_timezone
0,570306133677760513,neutral,1.0,,,Virgin America,,cairdin,,0,@VirginAmerica What @dhepburn said.,,2015-02-24 11:35:52 -0800,,Eastern Time (US & Canada)
1,570301130888122368,positive,0.3486,,0.0,Virgin America,,jnardino,,0,@VirginAmerica plus you've added commercials t...,,2015-02-24 11:15:59 -0800,,Pacific Time (US & Canada)
2,570301083672813571,neutral,0.6837,,,Virgin America,,yvonnalynn,,0,@VirginAmerica I didn't today... Must mean I n...,,2015-02-24 11:15:48 -0800,Lets Play,Central Time (US & Canada)
3,570301031407624196,negative,1.0,Bad Flight,0.7033,Virgin America,,jnardino,,0,@VirginAmerica it's really aggressive to blast...,,2015-02-24 11:15:36 -0800,,Pacific Time (US & Canada)
4,570300817074462722,negative,1.0,Can't Tell,1.0,Virgin America,,jnardino,,0,@VirginAmerica and it's a really big bad thing...,,2015-02-24 11:14:45 -0800,,Pacific Time (US & Canada)


In [4]:
text = data['text'].to_string()

In [5]:
text[:500]

#things to clean up - numerals

"0                      @VirginAmerica What @dhepburn said.\n1        @VirginAmerica plus you've added commercials t...\n2        @VirginAmerica I didn't today... Must mean I n...\n3        @VirginAmerica it's really aggressive to blast...\n4        @VirginAmerica and it's a really big bad thing...\n5        @VirginAmerica seriously would pay $30 a fligh...\n6        @VirginAmerica yes, nearly every time I fly VX...\n7        @VirginAmerica Really missed a prime opportuni...\n8          @virginamerica We"

### Pre-processing

1.   Calculate word associations in a large data set; try different methods to calculate it (e.g. pmi, chi-square test, etc.)
2.  Compare lemmatization and stemming results
3.   Try adding another pre-processing step to remove all numbers/ digits from text




In [7]:
# Taken from week 2 lab
# We create a TextPreprocessor class that encapsulates all the preprocessing steps. The class constructor allows for custom punctuation marks and stopwords to be added.
# Each preprocessing step is implemented as a separate method so we can define in which order they need to be called.

import nltk
from nltk.tokenize import word_tokenize
from nltk.corpus import stopwords
from nltk.stem import PorterStemmer
import re
import string

# Download necessary NLTK data
nltk.download('punkt')
nltk.download('stopwords')

class TextPreprocessor:
    def __init__(self, custom_punctuation=None, custom_stopwords=None):
        self.punctuation = string.punctuation
        if custom_punctuation:
            self.punctuation += custom_punctuation

        self.stop_words = set(stopwords.words('english'))
        if custom_stopwords:
            self.stop_words.update(custom_stopwords)

        self.stemmer = PorterStemmer()

    def remove_punctuation(self, text):
        return ''.join([char for char in text if char not in self.punctuation])

    # Custom one for the CNN dataset - try removing below and see results
    def add_space_after_parenthesis(self, text):
        return re.sub(r'\)', ') ', text)

    def to_lowercase(self, text):
        return text.lower()

    def remove_stopwords(self, text):
        words = word_tokenize(text)
        return ' '.join([word for word in words if word not in self.stop_words])

    def remove_extra_whitespace(self, text): # This is to remove our CNN) problem - The space is added before punctuation removal, so it won't affect the final preprocessed text if you're removing all punctuation
        return re.sub(r'\s+', ' ', text).strip()

    def stem_words(self, text):
        words = word_tokenize(text)
        return ' '.join([self.stemmer.stem(word) for word in words])

    # Drop the first character(is a 0) and any \n<numeric>
    def remove_numerics(self, text):
        return re.sub('\d*', '', text[1:])

    #Order matters - how you call these methods is how the text will be processed step-by-step
    # rearrange if we want to change the order of functions here
    def preprocess(self, text):
        text = self.add_space_after_parenthesis(text)
        text = self.remove_numerics(text)
        text = self.remove_punctuation(text)
        text = self.to_lowercase(text)
        text = self.remove_stopwords(text)
        text = self.remove_extra_whitespace(text)
        #text = self.stem_words(text)
        return text

[nltk_data] Downloading package punkt to
[nltk_data]     C:\Users\ff255\AppData\Roaming\nltk_data...
[nltk_data]   Package punkt is already up-to-date!
[nltk_data] Downloading package stopwords to
[nltk_data]     C:\Users\ff255\AppData\Roaming\nltk_data...
[nltk_data]   Package stopwords is already up-to-date!


In [8]:
preprocessor = TextPreprocessor()

text_cleaned = preprocessor.preprocess(text)

In [9]:
text_cleaned[:500]

'virginamerica dhepburn said virginamerica plus youve added commercials virginamerica didnt today must mean n virginamerica really aggressive blast virginamerica really big bad thing virginamerica seriously would pay fligh virginamerica yes nearly every time fly vx virginamerica really missed prime opportuni virginamerica well didnt‚Ä¶but virginamerica amazing arrived virginamerica know suicide th virginamerica lt pretty graphics muc virginamerica great deal alre virginamerica virginmedia im flying'

In [10]:
tokenized_words=word_tokenize(text_cleaned)

In [11]:
tokenized_words[:20]

['virginamerica',
 'dhepburn',
 'said',
 'virginamerica',
 'plus',
 'youve',
 'added',
 'commercials',
 'virginamerica',
 'didnt',
 'today',
 'must',
 'mean',
 'n',
 'virginamerica',
 'really',
 'aggressive',
 'blast',
 'virginamerica',
 'really']

In [12]:
# PMI tests
from nltk.collocations import *
bigram_measures = nltk.collocations.BigramAssocMeasures()
#trigram_measures = nltk.collocations.TrigramAssocMeasures()
#fourgram_measures = nltk.collocations.QuadgramAssocMeasures()
finder = BigramCollocationFinder.from_words(tokenized_words)

#Using PMI scores to quantify and rank the BiGrams
finder.nbest(bigram_measures.pmi, 50)

[('aal', 'declared'),
 ('abused', 'threatened'),
 ('accommodated', 'dsm'),
 ('adamkarren', 'zj'),
 ('adolfo', 'garcia'),
 ('aggravating', 'zone'),
 ('aggressive', 'blast'),
 ('ahahüòÉüíïüéµ', 'whyni'),
 ('airlinegeeks', 'avgee'),
 ('airserv', 'contractors'),
 ('alamo', 'tat'),
 ('alert', 'immediately'),
 ('alicia', 'exceptionalservice'),
 ('americanaireveryone', 'weeksampthos'),
 ('americas', 'largest'),
 ('amiltx', 'forgiven'),
 ('amy', 'lloyd'),
 ('andyellwood', 'delk'),
 ('angriest', 'angstiest'),
 ('announcing', 'winn'),
 ('annual', 'marthas'),
 ('anticipating', 'weatherrela'),
 ('arkansas', 'gov'),
 ('aussie', 'cow'),
 ('authors', 'fiction'),
 ('avoidable', 'nonweather'),
 ('baftz', 'rcvd'),
 ('baldordash', 'rebookedarrived'),
 ('batman', 'spee'),
 ('beatriz', 'susan'),
 ('becky', 'piela'),
 ('beefjerky', 'snacksüòâ'),
 ('belabor', 'pointbut'),
 ('belligerent', 'jerk'),
 ('betsy', 'besty'),
 ('beware', 'barklays'),
 ('blackhistorymonth', 'commerc'),
 ('bleed', 'foot'),
 ('bloc

In [13]:
# Stemming
stem_words = ' '.join([PorterStemmer().stem(word) for word in tokenized_words])

In [14]:
stem_words[:500]

'virginamerica dhepburn said virginamerica plu youv ad commerci virginamerica didnt today must mean n virginamerica realli aggress blast virginamerica realli big bad thing virginamerica serious would pay fligh virginamerica ye nearli everi time fli vx virginamerica realli miss prime opportuni virginamerica well didnt‚Ä¶but virginamerica amaz arriv virginamerica know suicid th virginamerica lt pretti graphic muc virginamerica great deal alr virginamerica virginmedia im fli f virginamerica thank virg'

In [15]:
import spacy # Library for NLP
# Load the spacy trained pipeline to tokenize the text
nlp = spacy.load('en_core_web_sm', disable=['parser', 'ner'])

# Function to lemmatize the tokens
def lemmatize(tokens, allowed_postags=["NOUN", "ADJ", "VERB"]):
    text = " ".join(tokens)
    text = nlp(text)
    lemmatized_tokens = []

    for token in text:
        if token.pos_ in allowed_postags:
            lemmatized_tokens.append(token.lemma_)

    return lemmatized_tokens

In [16]:
# Lemmatization
lemma_words = lemmatize(tokenized_words)

In [37]:
lemma_words[:100]

['say',
 've',
 'add',
 'commercial',
 'today',
 'mean',
 'aggressive',
 'blast',
 'big',
 'bad',
 'thing',
 'pay',
 'time',
 'fly',
 'miss',
 'prime',
 'didnt',
 'arrive',
 'pretty',
 'graphic',
 'm',
 'fly',
 'schedule',
 'excite',
 'fly',
 'last',
 '‚ù§',
 'fly',
 'know',
 'amazingl',
 'first',
 'fare',
 'love',
 'graphic',
 'make',
 'guy',
 'mess',
 'seat',
 'happen',
 'worry',
 'get',
 'seat',
 'bked',
 'cool',
 'birthday',
 'help',
 'leave',
 'expensive',
 'headphone',
 'await',
 'return',
 'phone',
 'call',
 'moodlighte',
 'way',
 'freddieaward',
 'do',
 'do',
 'support',
 'first',
 'time',
 'flyer',
 'next',
 'week',
 'help',
 'win',
 'bid',
 'unused',
 'ticket',
 'mov',
 'flight',
 'leave',
 'm',
 'elevategold',
 'blow',
 'way',
 'fly',
 'flight',
 'leave',
 'm',
 'excited',
 'know',
 'need',
 'm',
 'virginamerica',
 'new',
 'marketing',
 'song',
 'call',
 'week',
 'try',
 'm',
 'hold',
 'congrat',
 'win',
 'travel',
 'fine',
 'need',
 'change',
 'reservation']

### Topic Modeling

Try out Topic Modeling using the Sci-kit learn (SKLearn) package. There are different algorithms you can read about and experiment with - Latent Semantic Analysis (LSA), Latent Dirichlet Allocation (LDA), Non-negative Matrix Factorization (NMF)

### Text Clustering

Try out text clustering with a different dataset and build an optimized model by re-evaluating the number of clusters.