# Tweets Sentiment Analysis Model

## Importing Libraries

In [7]:
import pandas as pd
import numpy as np
import re
from sklearn.model_selection import train_test_split
from sklearn.feature_extraction.text import CountVectorizer
from sklearn.naive_bayes import MultinomialNB
from sklearn.metrics import accuracy_score


## Data Loading

In [5]:
data = pd.read_csv("C:/Users/Admin 21/Downloads/tweets.csv",
                encoding='ISO-8859-1', 
                header = None, 
                names= ["target", "tweet_id", "date", "flag", "user", "tweet"]
                )

data.head()

Unnamed: 0,target,tweet_id,date,flag,user,tweet
0,0,1467810369,Mon Apr 06 22:19:45 PDT 2009,NO_QUERY,_TheSpecialOne_,"@switchfoot http://twitpic.com/2y1zl - Awww, t..."
1,0,1467810672,Mon Apr 06 22:19:49 PDT 2009,NO_QUERY,scotthamilton,is upset that he can't update his Facebook by ...
2,0,1467810917,Mon Apr 06 22:19:53 PDT 2009,NO_QUERY,mattycus,@Kenichan I dived many times for the ball. Man...
3,0,1467811184,Mon Apr 06 22:19:57 PDT 2009,NO_QUERY,ElleCTF,my whole body feels itchy and like its on fire
4,0,1467811193,Mon Apr 06 22:19:57 PDT 2009,NO_QUERY,Karoli,"@nationwideclass no, it's not behaving at all...."


## Data Inspection

In [6]:
data.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 1600000 entries, 0 to 1599999
Data columns (total 6 columns):
 #   Column    Non-Null Count    Dtype 
---  ------    --------------    ----- 
 0   target    1600000 non-null  int64 
 1   tweet_id  1600000 non-null  int64 
 2   date      1600000 non-null  object
 3   flag      1600000 non-null  object
 4   user      1600000 non-null  object
 5   tweet     1600000 non-null  object
dtypes: int64(2), object(4)
memory usage: 73.2+ MB


In [12]:
df = data.drop(columns = ["tweet_id", "date", "flag", "user"])
df.head()

Unnamed: 0,target,tweet
0,0,"@switchfoot http://twitpic.com/2y1zl - Awww, t..."
1,0,is upset that he can't update his Facebook by ...
2,0,@Kenichan I dived many times for the ball. Man...
3,0,my whole body feels itchy and like its on fire
4,0,"@nationwideclass no, it's not behaving at all...."


In [13]:
df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 1600000 entries, 0 to 1599999
Data columns (total 2 columns):
 #   Column  Non-Null Count    Dtype 
---  ------  --------------    ----- 
 0   target  1600000 non-null  int64 
 1   tweet   1600000 non-null  object
dtypes: int64(1), object(1)
memory usage: 24.4+ MB


## Data Preprocessing

In [14]:
df["target"] = df["target"].map({0:0,4:1})
df["target"].value_counts()

0    800000
1    800000
Name: target, dtype: int64

In [21]:
df["tweet"][0]

"@switchfoot http://twitpic.com/2y1zl - Awww, that's a bummer.  You shoulda got David Carr of Third Day to do it. ;D"

### Dealing with the @ Mention

In [41]:
re.sub(r'@\w+', '',df["tweet"][0] )

" http://twitpic.com/2y1zl - Awww, that's a bummer.  You shoulda got David Carr of Third Day to do it. ;D"

### Dealing with link\urls 

In [37]:
re.sub(r'http\S+', '',df["tweet"][0] )

"@switchfoot  - Awww, that's a bummer.  You shoulda got David Carr of Third Day to do it. ;D"

### Dealing with the Punctuations

In [33]:
re.sub(r'[^\w\s]', '',df["tweet"][0] )

'switchfoot httptwitpiccom2y1zl  Awww thats a bummer  You shoulda got David Carr of Third Day to do it D'

### Dealing with Hashtags

In [53]:
text = df["tweet"][45668] + '#hio'

In [54]:
re.sub(r'#\w+', '',text )

"@cindypepper I like too, but I prefer the winter, because here in Brasil ai almost ALWAYS summer, so, I'm sick a little "

### Dealing with Numbers\Digits

In [59]:
text = df["tweet"][45668] + '5'
text
re.sub(r'\d+', '',text )

"@cindypepper I like too, but I prefer the winter, because here in Brasil ai almost ALWAYS summer, so, I'm sick a little "

### Dealing with extra Whitespace

In [66]:
df["tweet"][0]

"@switchfoot http://twitpic.com/2y1zl - Awww, that's a bummer.  You shoulda got David Carr of Third Day to do it. ;D"

In [65]:
re.sub(r'\s+', ' ',df["tweet"][0] )

"@switchfoot http://twitpic.com/2y1zl - Awww, that's a bummer. You shoulda got David Carr of Third Day to do it. ;D"

## Data Preprocessing Function

We will incomporate all above text operations into one function in order to change all our tweets to Machine learning texts easy to vectorize.

In [67]:
def text_processing(text):
    text = re.sub(r'@\w+', '', text)
    text = re.sub(r'http\S+', '',text)
    text = re.sub(r'#\w+', '',text )
    text = re.sub(r'[^\w\s]', '', text)
    text = re.sub(r'\d+', '', text)
    text = re.sub(r'\s+', ' ', text)
    text = text.lower()
    return text.strip()


In [68]:
text_processing(df["tweet"][0])

'awww thats a bummer you shoulda got david carr of third day to do it d'

In [71]:
df["tweet"] = df["tweet"].apply(text_processing)

In [72]:
df["tweet"].head()

0    awww thats a bummer you shoulda got david carr...
1    is upset that he cant update his facebook by t...
2    i dived many times for the ball managed to sav...
3       my whole body feels itchy and like its on fire
4    no its not behaving at all im mad why am i her...
Name: tweet, dtype: object

## Model Building

### Features and Targets

We will split our clean data to targets(the sentimental label) and the features (the tweets).

In [73]:
X = df["tweet"]
y= df["target"]

### Train and Test Data

We will use the sklearn package to split the data into train and test data. We will use the 80:20 train to test ratio in the splitting.

In [74]:
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)

In [76]:
print(X_train.info(), y_train.info())

<class 'pandas.core.series.Series'>
Int64Index: 1280000 entries, 1374558 to 121958
Series name: tweet
Non-Null Count    Dtype 
--------------    ----- 
1280000 non-null  object
dtypes: object(1)
memory usage: 19.5+ MB
<class 'pandas.core.series.Series'>
Int64Index: 1280000 entries, 1374558 to 121958
Series name: target
Non-Null Count    Dtype
--------------    -----
1280000 non-null  int64
dtypes: int64(1)
memory usage: 19.5 MB
None None


## Feature Extraction

In [77]:
vector = CountVectorizer(stop_words='english')

In [78]:
X_train_vector = vector.fit_transform(X_train)
X_test_vector = vector.transform(X_test)

### Instantiate Model 

In [79]:
naves = MultinomialNB()

### Fit our extracted train data

In [80]:
naves.fit(X_train_vector, y_train)

## Model Prediction

In [84]:
y_pred = naves.predict(X_test_vector)
y_pred

array([1, 1, 1, ..., 1, 0, 0], dtype=int64)

## Model Evaluation

In [86]:
acc = accuracy_score(y_test,y_pred)
acc

0.76559375

The model accuracy score was 76% which means that it will predict the wrong label for 20 out of 100 predictions.

## Model Usage

In [154]:
tweet = [" I hate statistics", "I am a marine soldier", "# Protect our 5 forest", "i love God"]

tweet_tok = vector.transform(tweet)


In [155]:
pred = naves.predict(tweet_tok)
pred

array([0, 0, 1, 1], dtype=int64)

In [152]:

print(f"'{tweet[0]} ' tweet has a {'positive' if pred[0]==0 else 'negative'} sentiment label.")


' I hate statistics ' tweet has a positive sentiment label.


In [153]:
for i in range(len(tweet)):
    print(f"'{tweet[i]} ' tweet has a {'positive' if pred[i]==0 else 'negative'} sentiment label.")

' I hate statistics ' tweet has a positive sentiment label.
'I am a marine soldier ' tweet has a positive sentiment label.
'# Protect our 5 forest ' tweet has a negative sentiment label.
'i love God ' tweet has a negative sentiment label.


End of our Sentimental Analysis. Thank you!