# **Part 1: Text Collection and Loading**

##### **Objective:** *Collect and load a text dataset from a selected domain into a suitable format for processing.*
**Domain:** *Environmental*

**Kaggle Dataset:** https://www.kaggle.com/datasets/joseguzman/climate-sentiment-in-twitter



### **Necessary Imports**

In [14]:
# Import necessary libraries
import pandas as pd
import nltk
from nltk.corpus import gutenberg
from nltk.tokenize import word_tokenize, sent_tokenize
from nltk.stem import PorterStemmer
from nltk.stem import WordNetLemmatizer
from nltk.corpus import stopwords


# Download necessary NLTK data files
# nltk.download('gutenberg')
# nltk.download('punkt')
# nltk.download('stopwords')
# nltk.download('wordnet')

[nltk_data] Downloading package stopwords to
[nltk_data]     C:\Users\HP\AppData\Roaming\nltk_data...
[nltk_data]   Package stopwords is already up-to-date!


True

### **Loading a CSV File**

In [1]:
# Load the CSV data into DataFrames
df_climate_change = pd.read_csv("Climate_twitter.csv")

# Print the first five rows of the DataFrame
print(df_climate_change.head())

           id                 date  retweets                   source  \
0  2184934963  2020-12-22 23:22:20        71          Twitter Web App   
1   508658626  2020-12-10 14:30:00        14  Twitter for Advertisers   
2  2607105006  2020-12-22 21:28:52         0          Twitter Web App   
3    19609660  2020-12-22 21:24:10         0          Twitter Web App   
4    19609660  2020-12-21 22:52:09         1          Twitter Web App   

                         author  likes  \
0                      GO GREEN     91   
1               Elsevier Energy     98   
2                  Arwyn Thomas      1   
3  Tom Gillispie, EDITOR/WRITER      0   
4  Tom Gillispie, EDITOR/WRITER      1   

                                                text    twitter_name  \
0  The death of summer Arctic ice our Earth coole...    ECOWARRIORSS   
1  Elsevier and the EditorsinChief are pleased to...  ElsevierEnergy   
2  From better climate change education to improv...         siwarr5   
3  climate change Li

# **Part 2: Text Preprocessing**

### **Tokenization: Split the text into words and sentences.**

In [5]:
# Extract the column with the text data
texts = df_climate_change['text']

# Function to tokenize sentences
def tokenize_sentences(text):
    return sent_tokenize(text)

# Function to tokenize words
def tokenize_words(text):
    return word_tokenize(text)

# Apply the tokenization functions to the text column
df_climate_change['sentences'] = texts.apply(tokenize_sentences)
df_climate_change['words'] = texts.apply(tokenize_words)

# Display the DataFrame with the new tokenized columns
print(df_climate_change[['text', 'sentences', 'words']].head())


                                                text  \
0  The death of summer Arctic ice our Earth coole...   
1  Elsevier and the EditorsinChief are pleased to...   
2  From better climate change education to improv...   
3  climate change Links to FIXING CLIMATE CHANGE ...   
4  climate change The 11TH HOUR FOR THE EARTH cli...   

                                           sentences  \
0  [The death of summer Arctic ice our Earth cool...   
1  [Elsevier and the EditorsinChief are pleased t...   
2  [From better climate change education to impro...   
3  [climate change Links to FIXING CLIMATE CHANGE...   
4  [climate change The 11TH HOUR FOR THE EARTH cl...   

                                               words  
0  [The, death, of, summer, Arctic, ice, our, Ear...  
1  [Elsevier, and, the, EditorsinChief, are, plea...  
2  [From, better, climate, change, education, to,...  
3  [climate, change, Links, to, FIXING, CLIMATE, ...  
4  [climate, change, The, 11TH, HOUR, FOR, THE, E..

### **Stemming: Reduce words to their root form.**

In [11]:
# Initialize the stemmer
stemmer = PorterStemmer()

# Function to stem words
def stem_words(words):
    return [stemmer.stem(word) for word in words]


# Apply the stemming function to the tokenized words column
df_climate_change['stemmed_words'] = df_climate_change['words'].apply(stem_words)

# Display the DataFrame with the stemmed words column
print(df_climate_change[['text', 'words', 'stemmed_words']].head())


                                                text  \
0  The death of summer Arctic ice our Earth coole...   
1  Elsevier and the EditorsinChief are pleased to...   
2  From better climate change education to improv...   
3  climate change Links to FIXING CLIMATE CHANGE ...   
4  climate change The 11TH HOUR FOR THE EARTH cli...   

                                               words  \
0  [The, death, of, summer, Arctic, ice, our, Ear...   
1  [Elsevier, and, the, EditorsinChief, are, plea...   
2  [From, better, climate, change, education, to,...   
3  [climate, change, Links, to, FIXING, CLIMATE, ...   
4  [climate, change, The, 11TH, HOUR, FOR, THE, E...   

                                       stemmed_words  
0  [the, death, of, summer, arctic, ice, our, ear...  
1  [elsevi, and, the, editorsinchief, are, pleas,...  
2  [from, better, climat, chang, educ, to, improv...  
3  [climat, chang, link, to, fix, climat, chang, ...  
4  [climat, chang, the, 11th, hour, for, the, ear..

### **Lemmatization: Further reduce the stemmed words by considering their context.**

In [13]:
# Initialize the lemmatizer
lemmatizer = WordNetLemmatizer()

# Function to lemmatize words
def lemmatize_words(words):
    return [lemmatizer.lemmatize(word) for word in words]

# Apply the lemmatization function to the tokenized words column
df_climate_change['lemmatized_words'] = df_climate_change['words'].apply(lemmatize_words)

# Display the DataFrame with the lemmatized words column
print(df_climate_change[['text', 'words', 'lemmatized_words']].head())


                                                text  \
0  The death of summer Arctic ice our Earth coole...   
1  Elsevier and the EditorsinChief are pleased to...   
2  From better climate change education to improv...   
3  climate change Links to FIXING CLIMATE CHANGE ...   
4  climate change The 11TH HOUR FOR THE EARTH cli...   

                                               words  \
0  [The, death, of, summer, Arctic, ice, our, Ear...   
1  [Elsevier, and, the, EditorsinChief, are, plea...   
2  [From, better, climate, change, education, to,...   
3  [climate, change, Links, to, FIXING, CLIMATE, ...   
4  [climate, change, The, 11TH, HOUR, FOR, THE, E...   

                                    lemmatized_words  
0  [The, death, of, summer, Arctic, ice, our, Ear...  
1  [Elsevier, and, the, EditorsinChief, are, plea...  
2  [From, better, climate, change, education, to,...  
3  [climate, change, Links, to, FIXING, CLIMATE, ...  
4  [climate, change, The, 11TH, HOUR, FOR, THE, E..

### **Stop Word Removal: Eliminate common words that may not be useful for analysis**

In [15]:
# Get the list of stop words from NLTK
stop_words = set(stopwords.words('english'))

# Function to remove stop words
def remove_stop_words(words):
    return [word for word in words if word.lower() not in stop_words]

# Apply the stop word removal function to the tokenized words column
df_climate_change['filtered_words'] = df_climate_change['words'].apply(remove_stop_words)

# Display the DataFrame with the filtered words column
print(df_climate_change[['text', 'words', 'filtered_words']].head())

                                                text  \
0  The death of summer Arctic ice our Earth coole...   
1  Elsevier and the EditorsinChief are pleased to...   
2  From better climate change education to improv...   
3  climate change Links to FIXING CLIMATE CHANGE ...   
4  climate change The 11TH HOUR FOR THE EARTH cli...   

                                               words  \
0  [The, death, of, summer, Arctic, ice, our, Ear...   
1  [Elsevier, and, the, EditorsinChief, are, plea...   
2  [From, better, climate, change, education, to,...   
3  [climate, change, Links, to, FIXING, CLIMATE, ...   
4  [climate, change, The, 11TH, HOUR, FOR, THE, E...   

                                      filtered_words  
0  [death, summer, Arctic, ice, Earth, cooler, ye...  
1  [Elsevier, EditorsinChief, pleased, share, fir...  
2  [better, climate, change, education, improved,...  
3  [climate, change, Links, FIXING, CLIMATE, CHAN...  
4  [climate, change, 11TH, HOUR, EARTH, climatech..