Covid 19 sentiment analysis


In [12]:
import numpy as np
import pandas as pd
import re
import nltk
from nltk.corpus import stopwords
from nltk.stem.porter import PorterStemmer
from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.model_selection import train_test_split
from sklearn.linear_model import LogisticRegression
from sklearn.metrics import accuracy_score

In [13]:
import nltk
nltk.download('stopwords')

[nltk_data] Downloading package stopwords to /root/nltk_data...
[nltk_data]   Package stopwords is already up-to-date!


True

In [14]:
print(stopwords.words('english'))

['a', 'about', 'above', 'after', 'again', 'against', 'ain', 'all', 'am', 'an', 'and', 'any', 'are', 'aren', "aren't", 'as', 'at', 'be', 'because', 'been', 'before', 'being', 'below', 'between', 'both', 'but', 'by', 'can', 'couldn', "couldn't", 'd', 'did', 'didn', "didn't", 'do', 'does', 'doesn', "doesn't", 'doing', 'don', "don't", 'down', 'during', 'each', 'few', 'for', 'from', 'further', 'had', 'hadn', "hadn't", 'has', 'hasn', "hasn't", 'have', 'haven', "haven't", 'having', 'he', "he'd", "he'll", 'her', 'here', 'hers', 'herself', "he's", 'him', 'himself', 'his', 'how', 'i', "i'd", 'if', "i'll", "i'm", 'in', 'into', 'is', 'isn', "isn't", 'it', "it'd", "it'll", "it's", 'its', 'itself', "i've", 'just', 'll', 'm', 'ma', 'me', 'mightn', "mightn't", 'more', 'most', 'mustn', "mustn't", 'my', 'myself', 'needn', "needn't", 'no', 'nor', 'not', 'now', 'o', 'of', 'off', 'on', 'once', 'only', 'or', 'other', 'our', 'ours', 'ourselves', 'out', 'over', 'own', 're', 's', 'same', 'shan', "shan't", 'she

Data Processing


In [11]:
df = pd.read_csv('/content/train.csv',encoding = 'ISO-8859-1')

In [15]:
df.shape

(41157, 6)

In [16]:
df.head()

Unnamed: 0,UserName,ScreenName,Location,TweetAt,OriginalTweet,Sentiment
0,3799,48751,London,16-03-2020,@MeNyrbie @Phil_Gahan @Chrisitv https://t.co/i...,Neutral
1,3800,48752,UK,16-03-2020,advice Talk to your neighbours family to excha...,Positive
2,3801,48753,Vagabonds,16-03-2020,Coronavirus Australia: Woolworths to give elde...,Positive
3,3802,48754,,16-03-2020,My food stock is not the only one which is emp...,Positive
4,3803,48755,,16-03-2020,"Me, ready to go at supermarket during the #COV...",Extremely Negative


In [17]:
#counting the missing values
df.isnull().sum()

Unnamed: 0,0
UserName,0
ScreenName,0
Location,8590
TweetAt,0
OriginalTweet,0
Sentiment,0


In [18]:
#handle the missing values
df['Location'].fillna('Unknown', inplace=True)
df.isnull().sum()

The behavior will change in pandas 3.0. This inplace method will never work because the intermediate object on which we are setting values always behaves as a copy.

For example, when doing 'df[col].method(value, inplace=True)', try using 'df.method({col: value}, inplace=True)' or df[col] = df[col].method(value) instead, to perform the operation inplace on the original object.


  df['Location'].fillna('Unknown', inplace=True)


Unnamed: 0,0
UserName,0
ScreenName,0
Location,0
TweetAt,0
OriginalTweet,0
Sentiment,0


Let's check the distribution of the 'Sentiment' column, which is our target variable.

In [20]:
display(df['Sentiment'].value_counts())

Unnamed: 0_level_0,count
Sentiment,Unnamed: 1_level_1
Positive,11422
Negative,9917
Neutral,7713
Extremely Positive,6624
Extremely Negative,5481


# Task
Perform sentiment analysis on the 'OriginalTweet' column of the `df` DataFrame by cleaning and preprocessing the text (lowercase conversion, removal of punctuation, special characters, numbers, and stopwords, and stemming), converting the preprocessed text into TF-IDF features, splitting the data into training and testing sets, training a Logistic Regression model, evaluating its performance using accuracy score, and finally summarizing the entire process.

## Text Preprocessing

### Subtask:
Clean and preprocess the 'OriginalTweet' column. This involves converting text to lowercase, removing punctuation, special characters, numbers, and stopwords. We will also apply stemming to reduce words to their root form.


**Reasoning**:
I will initialize a Porter Stemmer, define a preprocessing function that handles lowercasing, punctuation/special character removal, stemming, and stopword removal, and then apply this function to the 'OriginalTweet' column to create a new 'Transformed_Tweet' column. This covers all the instructions for text preprocessing.



In [21]:
ps = PorterStemmer()
corpus = []

def stemming_function(content):
    # Remove non-alphanumeric characters, replacing them with a space
    stemmed_content = re.sub('[^a-zA-Z]', ' ', content)
    # Convert to lowercase
    stemmed_content = stemmed_content.lower()
    # Split into words
    stemmed_content = stemmed_content.split()
    # Apply stemming and remove stopwords
    stemmed_content = [ps.stem(word) for word in stemmed_content if not word in stopwords.words('english')]
    # Join words back
    stemmed_content = ' '.join(stemmed_content)
    return stemmed_content

# Apply the stemming_function to the 'OriginalTweet' column
df['Transformed_Tweet'] = df['OriginalTweet'].apply(stemming_function)

print("Preprocessing complete. A new column 'Transformed_Tweet' has been added.")
print(df[['OriginalTweet', 'Transformed_Tweet']].head())

Preprocessing complete. A new column 'Transformed_Tweet' has been added.
                                       OriginalTweet  \
0  @MeNyrbie @Phil_Gahan @Chrisitv https://t.co/i...   
1  advice Talk to your neighbours family to excha...   
2  Coronavirus Australia: Woolworths to give elde...   
3  My food stock is not the only one which is emp...   
4  Me, ready to go at supermarket during the #COV...   

                                   Transformed_Tweet  
0  menyrbi phil gahan chrisitv http co ifz fan pa...  
1  advic talk neighbour famili exchang phone numb...  
2  coronaviru australia woolworth give elderli di...  
3  food stock one empti pleas panic enough food e...  
4  readi go supermarket covid outbreak paranoid f...  


**Reasoning**:
Now that the text data has been preprocessed, the next step is to convert this text into numerical features using TF-IDF, as specified by the main task. This will allow the data to be used by machine learning models.



In [22]:
tfidf_vectorizer = TfidfVectorizer(max_features=5000)
X = tfidf_vectorizer.fit_transform(df['Transformed_Tweet']).toarray()
y = df['Sentiment'].values

print("TF-IDF vectorization complete. The feature matrix X and target vector y are ready.")
print(f"Shape of X: {X.shape}")
print(f"Shape of y: {y.shape}")

TF-IDF vectorization complete. The feature matrix X and target vector y are ready.
Shape of X: (41157, 5000)
Shape of y: (41157,)


**Reasoning**:
The data has been preprocessed and vectorized into numerical features (X) and a target variable (y). The next step, as outlined in the main task, is to split this data into training and testing sets to prepare for model training and evaluation.



In [23]:
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, stratify=y, random_state=42)

print("Data split into training and testing sets.")
print(f"Shape of X_train: {X_train.shape}")
print(f"Shape of X_test: {X_test.shape}")
print(f"Shape of y_train: {y_train.shape}")
print(f"Shape of y_test: {y_test.shape}")

Data split into training and testing sets.
Shape of X_train: (32925, 5000)
Shape of X_test: (8232, 5000)
Shape of y_train: (32925,)
Shape of y_test: (8232,)


**Reasoning**:
With the data now split into training and testing sets, the next step is to train a Logistic Regression model as specified in the main task, using the `X_train` and `y_train` data.



In [24]:
model = LogisticRegression(max_iter=1000)
model.fit(X_train, y_train)

print("Logistic Regression model trained successfully.")

Logistic Regression model trained successfully.


**Reasoning**:
The Logistic Regression model has been trained. The next step, as per the main task, is to evaluate its performance using accuracy score. This involves making predictions on the test set and then comparing them to the actual test labels.



In [25]:
y_pred = model.predict(X_test)
accuracy = accuracy_score(y_test, y_pred)

print("Model evaluation complete.")
print(f"Accuracy Score: {accuracy}")

Model evaluation complete.
Accuracy Score: 0.5771379980563655
