## Task 3: Named Entity Recognition (NER)

The goal of this task was to extract named entities such as product names, organizations, and locations from sustainability-related text using a pre-trained NER model. The model was applied to the dataset to identify relevant entities, and challenges related to domain-specific entity recognition were addressed. Visualization of the detected entities was done to further analyze the outputs.

In [None]:
from google.colab import drive

drive.mount('/content/drive')

Mounted at /content/drive


In [None]:
!pip install spacy
!python -m spacy download en_core_web_sm

# Import necessary Libraries

In [None]:
import pandas as pd
from spacy import displacy
import spacy
import re
import nltk
from sklearn.feature_extraction.text import CountVectorizer
from sklearn.decomposition import LatentDirichletAllocation
from nltk.corpus import stopwords
from nltk.stem import WordNetLemmatizer

In [None]:
# Load the dataset
df = pd.read_csv("/content/drive/My Drive/Advanced NLP/Dataset/twitter_dataset.csv", encoding ="ISO-8859-1" , names=["target", "ids", "date", "flag", "user", "text"])

df.head()

Unnamed: 0,target,ids,date,flag,user,text
0,0,1467810369,Mon Apr 06 22:19:45 PDT 2009,NO_QUERY,_TheSpecialOne_,"@switchfoot http://twitpic.com/2y1zl - Awww, that's a bummer. You shoulda got David Carr of Third Day to do it. ;D"
1,0,1467810672,Mon Apr 06 22:19:49 PDT 2009,NO_QUERY,scotthamilton,is upset that he can't update his Facebook by texting it... and might cry as a result School today also. Blah!
2,0,1467810917,Mon Apr 06 22:19:53 PDT 2009,NO_QUERY,mattycus,@Kenichan I dived many times for the ball. Managed to save 50% The rest go out of bounds
3,0,1467811184,Mon Apr 06 22:19:57 PDT 2009,NO_QUERY,ElleCTF,my whole body feels itchy and like its on fire
4,0,1467811193,Mon Apr 06 22:19:57 PDT 2009,NO_QUERY,Karoli,"@nationwideclass no, it's not behaving at all. i'm mad. why am i here? because I can't see you all over there."


In [None]:
df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 1600000 entries, 0 to 1599999
Data columns (total 6 columns):
 #   Column  Non-Null Count    Dtype 
---  ------  --------------    ----- 
 0   target  1600000 non-null  int64 
 1   ids     1600000 non-null  int64 
 2   date    1600000 non-null  object
 3   flag    1600000 non-null  object
 4   user    1600000 non-null  object
 5   text    1600000 non-null  object
dtypes: int64(2), object(4)
memory usage: 73.2+ MB


In [None]:
df["target"].value_counts()

Unnamed: 0_level_0,count
target,Unnamed: 1_level_1
0,800000
4,800000


## Removing unecessary columns

In [None]:
df.head()

Unnamed: 0,target,ids,date,flag,user,text
0,0,1467810369,Mon Apr 06 22:19:45 PDT 2009,NO_QUERY,_TheSpecialOne_,"@switchfoot http://twitpic.com/2y1zl - Awww, that's a bummer. You shoulda got David Carr of Third Day to do it. ;D"
1,0,1467810672,Mon Apr 06 22:19:49 PDT 2009,NO_QUERY,scotthamilton,is upset that he can't update his Facebook by texting it... and might cry as a result School today also. Blah!
2,0,1467810917,Mon Apr 06 22:19:53 PDT 2009,NO_QUERY,mattycus,@Kenichan I dived many times for the ball. Managed to save 50% The rest go out of bounds
3,0,1467811184,Mon Apr 06 22:19:57 PDT 2009,NO_QUERY,ElleCTF,my whole body feels itchy and like its on fire
4,0,1467811193,Mon Apr 06 22:19:57 PDT 2009,NO_QUERY,Karoli,"@nationwideclass no, it's not behaving at all. i'm mad. why am i here? because I can't see you all over there."


We are only interested in the target and text columns, the rest will be removed

In [None]:
df = df[["text", "target"]]

# Let's extract tweets that are related to sustainability

In [None]:
# We defined a set of sustainability-related keywords
sustainability_keywords = [
    'climate change', 'renewable energy', 'clean energy', 'sustainable', 'green energy',
    'carbon emissions', 'environment', 'recycling', 'solar power', 'wind energy', 'sustainability',
    'biofuel', 'global warming', 'sustainable transport', 'fossil fuels', 'net zero', 'greenhouse gases',
    'carbon footprint', 'conservation', 'pollution'
]

In [None]:
# Now we will filter rows that contain any of the sustainability-related keywords in the 'text' column
def contains_sustainability_keywords(text):
    text = text.lower()  # Convert text to lowercase for case-insensitive matching
    return any(keyword in text for keyword in sustainability_keywords)

In [None]:
# Filter out rows containing sustainability-related keywords
sustainability_related_df = df[df['text'].apply(contains_sustainability_keywords)]

In [None]:
print(f"Sustainability Related DF contains {len(sustainability_related_df)}")

Sustainability Related DF contains 493


In [None]:
df = sustainability_related_df

# Clean the Data

In [None]:
import nltk
nltk.download('stopwords')
nltk.download('wordnet')

[nltk_data] Downloading package stopwords to /root/nltk_data...
[nltk_data]   Package stopwords is already up-to-date!
[nltk_data] Downloading package wordnet to /root/nltk_data...
[nltk_data]   Package wordnet is already up-to-date!


True

In [None]:
import re
from nltk.corpus import stopwords

lemmatizer = WordNetLemmatizer()

# Clean the text: remove URLs, users, special chars stopwords etc
TEXT_CLEANING_RE = "@\S+|https?:\S+|http?:\S|[^A-Za-z0-9]+"
stop_words = stopwords.words("english")
stop_words.extend(['good', 'like', 'gt', 'amp', 'quot'])

def preprocess(text):
    # Remove unwanted characters and lemmatize
    text = re.sub(TEXT_CLEANING_RE, ' ', str(text).lower()).strip()
    tokens = [lemmatizer.lemmatize(token) for token in text.split() if token not in stop_words]
    return " ".join(tokens)

df["cleaned_text"] = df.text.apply(preprocess)

In [None]:
pd.set_option('display.max_colwidth', None)

df.head()

Unnamed: 0,text,target,cleaned_text
335,getting annoyed easily today &gt;&gt;&gt; biofuel proposal: getting annoyed easily today &gt;&gt;&gt; biof.. http://tinyurl.com/ceprvs,0,getting annoyed easily today biofuel proposal getting annoyed easily today biof
3363,Gosh It is raining in summer cause of the global warming?,0,gosh raining summer cause global warming
3862,"Arse, totally forgot about a webinar that I wanted to attend this morning. Now I'll never know how to secure virtualised environments",0,arse totally forgot webinar wanted attend morning never know secure virtualised environment
4722,http://twitpic.com/2ya1c - Good F'in Morning Springtime my ass. Global warming?! Suck it!,0,f morning springtime as global warming suck
6254,I hate global warming and i hate snow. ITS APRIL ffs.,0,hate global warming hate snow april ffs


# NER

In [None]:
# Load spaCy's small English model
nlp = spacy.load("en_core_web_sm")

In [None]:
nlp

<spacy.lang.en.English at 0x7ab3d4ee2080>

In [None]:
# Apply NER on the text
def extract_entities(text):
    doc = nlp(text)  # Process the text with spaCy NER
    entities = [(ent.text, ent.label_) for ent in doc.ents]
    return entities

# text_df = df.copy()
# cleaned_text_df = df.copy()

df['entities'] = df['cleaned_text'].apply(extract_entities)

In [None]:
df[['cleaned_text', 'entities']].head()

Unnamed: 0,cleaned_text,entities
335,getting annoyed easily today biofuel proposal getting annoyed easily today biof,"[(today, DATE), (today, DATE)]"
3363,gosh raining summer cause global warming,"[(summer, DATE)]"
3862,arse totally forgot webinar wanted attend morning never know secure virtualised environment,"[(morning, TIME)]"
4722,f morning springtime as global warming suck,"[(morning, TIME)]"
6254,hate global warming hate snow april ffs,"[(april, DATE)]"


In [None]:
df = df[["cleaned_text", "entities"]].reset_index()

df.head()

Unnamed: 0,index,cleaned_text,entities
0,335,getting annoyed easily today biofuel proposal getting annoyed easily today biof,"[(today, DATE), (today, DATE)]"
1,3363,gosh raining summer cause global warming,"[(summer, DATE)]"
2,3862,arse totally forgot webinar wanted attend morning never know secure virtualised environment,"[(morning, TIME)]"
3,4722,f morning springtime as global warming suck,"[(morning, TIME)]"
4,6254,hate global warming hate snow april ffs,"[(april, DATE)]"


# Visualize Entities

In [None]:
# Let's test the visualization on the first row
def visualize_entities(text):
    doc = nlp(text)
    # Render the visualization !!!
    displacy.render(doc, style='ent', jupyter=True)

visualize_entities(df['cleaned_text'][0])