### Problem:
Sentiment analysis remains one of the key problems that has seen extensive application of natural language processing. This time around, given the tweets from customers about various tech firms who manufacture and sell mobiles, computers, laptops, etc, the task is to identify if the tweets have a negative sentiment towards such companies or products

### Data
train.csv - For training the models, we provide a labelled dataset of 7920 tweets. The dataset is provided in the form of a csv file with each line storing a tweet id, its label and the tweet.

test.csv - The test data file contains only tweet ids and the tweet text with each tweet in a new line.

Most profane and vulgar terms in the tweets have been replaced with “$&@*#”. However, please note that the dataset still might contain text that may be considered profane, vulgar, or offensive.

### Importing Libraries & Dataset

In [1]:
import pandas as pd
import numpy as np
import nltk
import string
import re
import warnings
warnings.filterwarnings('ignore')

nltk.download('stopwords')
nltk.download('wordnet')


train_data = pd.read_csv(r'C:\Users\SHREE\Downloads\Python CODES\Customer Tweets Sentiment Analysis\train.csv')
test_data = pd.read_csv(r'C:\Users\SHREE\Downloads\Python CODES\Customer Tweets Sentiment Analysis\test.csv')

[nltk_data] Downloading package stopwords to
[nltk_data]     C:\Users\SHREE\AppData\Roaming\nltk_data...
[nltk_data]   Package stopwords is already up-to-date!
[nltk_data] Downloading package wordnet to
[nltk_data]     C:\Users\SHREE\AppData\Roaming\nltk_data...
[nltk_data]   Package wordnet is already up-to-date!


In [2]:
data = pd.concat([train_data, test_data], axis=0)

df = data.copy()

In [3]:
df['label'].value_counts()

0.0    5894
1.0    2026
Name: label, dtype: int64

### Creating functions to Remove URLs & Punctuations, Tokenizer, Stopword & Lemmatization

In [4]:
# Removing URLs
def remove_url(text):
    return re.sub(r"http\S+", "", text)

#Removing Punctuations
def remove_punct(text):
    new_text = []
    for t in text:
        if t not in string.punctuation:
            new_text.append(t)
    return ''.join(new_text)


#Tokenizer
from nltk.tokenize import RegexpTokenizer

tokenizer = RegexpTokenizer(r'\w+')



#Removing Stop words
from nltk.corpus import stopwords

def remove_sw(text):
    new_text = []
    for t in text:
        if t not in stopwords.words('english'):
            new_text.append(t)
    return new_text

#Lemmatizaion
from nltk.stem import WordNetLemmatizer

lemmatizer = WordNetLemmatizer()

def word_lemmatizer(text):
    new_text = []
    for t in text:
        lem_text = lemmatizer.lemmatize(t)
        new_text.append(lem_text)
    return new_text

### Applying the Functions to Tweets

In [5]:
df['tweet'] = df['tweet'].apply(lambda t: remove_url(t))

df['tweet'] = df['tweet'].apply(lambda t: remove_punct(t))

df['tweet'] = df['tweet'].apply(lambda t: tokenizer.tokenize(t.lower()))

df['tweet'] = df['tweet'].apply(lambda t: remove_sw(t))

df['tweet'] = df['tweet'].apply(lambda t: word_lemmatizer(t))

### Splitting the Data into Train & Test Sets

In [6]:
features_set = df.copy()

train_set = features_set.iloc[:len(train_data), :]

test_set = features_set.iloc[len(train_data):, :]

### Selecting Target & Feature Variables for Classifying Tweets

In [7]:
X = train_set['tweet']


for i in range(0, len(X)):
    X.iloc[i] = ' '.join(X.iloc[i])


Y = train_set['label']

### Feature Extraction

In [8]:
from sklearn.feature_extraction.text import TfidfVectorizer

TfidV = TfidfVectorizer()

X = TfidV.fit_transform(X)

### Splitting the Data into Train & Test Data

In [9]:
from sklearn.model_selection import train_test_split

x_train, x_test, y_train, y_test = train_test_split(X, Y, test_size = 0.1, random_state = 1234)

### Model Selection & Model Evaluation

In [10]:
from sklearn.linear_model import LogisticRegression

lr = LogisticRegression()

lr.fit(x_train, y_train)

y_predict_lr = lr.predict(x_test)

In [11]:
from sklearn.metrics import confusion_matrix, f1_score

cm_lr = confusion_matrix(y_test, y_predict_lr)

f1_lr = f1_score(y_test, y_predict_lr)

score_lr = lr.score(x_test, y_test)
score_lr

0.8573232323232324

In [12]:
from sklearn.ensemble import RandomForestClassifier

rfc = RandomForestClassifier(class_weight='balanced')

rfc.fit(x_train, y_train)

y_predict_rfc = rfc.predict(x_test)

In [13]:
from sklearn.metrics import confusion_matrix, f1_score

cm_rfc = confusion_matrix(y_test, y_predict_rfc)

f1_rfc = f1_score(y_test, y_predict_rfc)

score_rfc = rfc.score(x_test, y_test)
score_rfc

0.8813131313131313

In [14]:
from xgboost import XGBClassifier

xgb = XGBClassifier(scale_pos_weight=3)

xgb.fit(x_train, y_train)

y_predict_xgb = xgb.predict(x_test)



In [15]:
from sklearn.metrics import confusion_matrix, f1_score

cm_xgb = confusion_matrix(y_test, y_predict_xgb)

f1_xgb = f1_score(y_test, y_predict_xgb)

score_xgb = xgb.score(x_test, y_test)
score_xgb

0.8762626262626263

In [16]:
from lightgbm import LGBMClassifier

lgb = LGBMClassifier(scale_pos_weight=3)

#lgb.fit(X, Y)

lgb.fit(x_train, y_train)

y_predict_lgb = lgb.predict(x_test)

In [17]:
from sklearn.metrics import confusion_matrix, f1_score

cm_lgb = confusion_matrix(y_test, y_predict_lgb)

f1_lgb = f1_score(y_test, y_predict_lgb)

score_lgb = lgb.score(x_test, y_test)
score_lgb

0.8762626262626263

### Conclusion
Here XGBoost & LightGBM Models are giving us High Accuracy with good F1 Score. We can use them for Classifying the Tweets as Positive Sentiment or Negative Sentiment Tweets towards such companies or products.