<a href="https://colab.research.google.com/github/mustaphamerakech/twitter-sentiment-analysis/blob/main/twitter_sentiment_analysis.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

In [3]:
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
import re
import string
from nltk.corpus import stopwords
from nltk.stem.porter import PorterStemmer
from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.model_selection import train_test_split
from sklearn.linear_model import LogisticRegression
from sklearn.metrics import accuracy_score, classification_report, confusion_matrix

In [4]:
import nltk
nltk.download('stopwords')

[nltk_data] Downloading package stopwords to /root/nltk_data...
[nltk_data]   Package stopwords is already up-to-date!


True

In [5]:
from google.colab import drive
drive.mount('/content/drive', force_remount=True)
columns_name = ['target', 'ids', 'date', 'flag', 'user', 'text']
df = pd.read_csv('/content/drive/My Drive/twitter-sentiment-analysis/data.csv',names = columns_name, encoding='latin-1')

Mounted at /content/drive


In [6]:
df.head(5)

Unnamed: 0,target,ids,date,flag,user,text
0,0,1467810369,Mon Apr 06 22:19:45 PDT 2009,NO_QUERY,_TheSpecialOne_,"@switchfoot http://twitpic.com/2y1zl - Awww, t..."
1,0,1467810672,Mon Apr 06 22:19:49 PDT 2009,NO_QUERY,scotthamilton,is upset that he can't update his Facebook by ...
2,0,1467810917,Mon Apr 06 22:19:53 PDT 2009,NO_QUERY,mattycus,@Kenichan I dived many times for the ball. Man...
3,0,1467811184,Mon Apr 06 22:19:57 PDT 2009,NO_QUERY,ElleCTF,my whole body feels itchy and like its on fire
4,0,1467811193,Mon Apr 06 22:19:57 PDT 2009,NO_QUERY,Karoli,"@nationwideclass no, it's not behaving at all...."


In [7]:
#printing stop words in english
print(stopwords.words('english'))

['i', 'me', 'my', 'myself', 'we', 'our', 'ours', 'ourselves', 'you', "you're", "you've", "you'll", "you'd", 'your', 'yours', 'yourself', 'yourselves', 'he', 'him', 'his', 'himself', 'she', "she's", 'her', 'hers', 'herself', 'it', "it's", 'its', 'itself', 'they', 'them', 'their', 'theirs', 'themselves', 'what', 'which', 'who', 'whom', 'this', 'that', "that'll", 'these', 'those', 'am', 'is', 'are', 'was', 'were', 'be', 'been', 'being', 'have', 'has', 'had', 'having', 'do', 'does', 'did', 'doing', 'a', 'an', 'the', 'and', 'but', 'if', 'or', 'because', 'as', 'until', 'while', 'of', 'at', 'by', 'for', 'with', 'about', 'against', 'between', 'into', 'through', 'during', 'before', 'after', 'above', 'below', 'to', 'from', 'up', 'down', 'in', 'out', 'on', 'off', 'over', 'under', 'again', 'further', 'then', 'once', 'here', 'there', 'when', 'where', 'why', 'how', 'all', 'any', 'both', 'each', 'few', 'more', 'most', 'other', 'some', 'such', 'no', 'nor', 'not', 'only', 'own', 'same', 'so', 'than', '

**Data processing**

In [8]:
#cheking rows and collumns
df.shape

(1600000, 6)

In [9]:
# checking the missing values
df.isnull().sum()

Unnamed: 0,0
target,0
ids,0
date,0
flag,0
user,0
text,0


In [10]:
# The distrubtion of target collumn
df['target'].value_counts()

Unnamed: 0_level_0,count
target,Unnamed: 1_level_1
0,800000
4,800000


In [11]:
#convert the target 4 to 1
df['target'] = df['target'].replace(4,1)

0 for negative tweet
1 for positve one

In [12]:
# The distrubtion of target collumn
df['target'].value_counts()

Unnamed: 0_level_0,count
target,Unnamed: 1_level_1
0,800000
1,800000


**stemming**  is the process of reducing **words** to ther root word
**for exempl** actor, actress .... = act


In [13]:
port_stem = PorterStemmer()

In [14]:
def stemming(content):
  #convert to lower case
  content = re.sub('[^a-zA-Z]', ' ', content)
  content = content.lower()
  #spliting the content
  content = content.split()
  #stemming
  content = [port_stem.stem(word) for word in content if not word in stopwords.words('english')]
  content = ' '.join(content)
  return content

In [15]:
df['stemmed_data'] = df['text'].apply(stemming)

In [16]:
df.head(5)

Unnamed: 0,target,ids,date,flag,user,text,stemmed_data
0,0,1467810369,Mon Apr 06 22:19:45 PDT 2009,NO_QUERY,_TheSpecialOne_,"@switchfoot http://twitpic.com/2y1zl - Awww, t...",switchfoot http twitpic com zl awww bummer sho...
1,0,1467810672,Mon Apr 06 22:19:49 PDT 2009,NO_QUERY,scotthamilton,is upset that he can't update his Facebook by ...,upset updat facebook text might cri result sch...
2,0,1467810917,Mon Apr 06 22:19:53 PDT 2009,NO_QUERY,mattycus,@Kenichan I dived many times for the ball. Man...,kenichan dive mani time ball manag save rest g...
3,0,1467811184,Mon Apr 06 22:19:57 PDT 2009,NO_QUERY,ElleCTF,my whole body feels itchy and like its on fire,whole bodi feel itchi like fire
4,0,1467811193,Mon Apr 06 22:19:57 PDT 2009,NO_QUERY,Karoli,"@nationwideclass no, it's not behaving at all....",nationwideclass behav mad see


In [17]:
print(df['stemmed_data'])

0          switchfoot http twitpic com zl awww bummer sho...
1          upset updat facebook text might cri result sch...
2          kenichan dive mani time ball manag save rest g...
3                            whole bodi feel itchi like fire
4                              nationwideclass behav mad see
                                 ...                        
1599995                           woke school best feel ever
1599996    thewdb com cool hear old walt interview http b...
1599997                         readi mojo makeov ask detail
1599998    happi th birthday boo alll time tupac amaru sh...
1599999    happi charitytuesday thenspcc sparkschar speak...
Name: stemmed_data, Length: 1600000, dtype: object


In [18]:
print(df['target'])

0          0
1          0
2          0
3          0
4          0
          ..
1599995    1
1599996    1
1599997    1
1599998    1
1599999    1
Name: target, Length: 1600000, dtype: int64


In [19]:
#separting the data and labl
X = df['stemmed_data']
Y = df['target']

In [22]:
#spliting the data ento training data 70% validation data 15% testing data 15%
X_train, X_temp, Y_train, Y_temp = train_test_split(X, Y, test_size=0.3, stratify=Y, random_state=2)

X_val, X_test, Y_val, Y_test = train_test_split(X_temp, Y_temp, test_size=0.5, stratify=Y_temp, random_state=2)

In [23]:
#converting the text data into némurical data
vectorizer = TfidfVectorizer()

X_train = vectorizer.fit_transform(X_train)
X_val = vectorizer.transform(X_val)
X_test = vectorizer.transform(X_test)

In [24]:
print(X_train)

  (0, 121348)	0.44599875194306554
  (0, 306319)	0.4371164314704369
  (0, 43969)	0.4748653389926461
  (0, 328004)	0.5314491135690684
  (0, 374778)	0.31949818170660904
  (1, 377777)	0.5390797727604117
  (1, 28528)	0.16997429420382654
  (1, 2941)	0.3329517305729672
  (1, 224362)	0.2471619420602169
  (1, 376966)	0.16230856596272636
  (1, 36162)	0.21760975021723117
  (1, 400139)	0.5008486447803013
  (1, 32222)	0.2885646935506797
  (1, 45367)	0.24454930105105105
  (1, 360908)	0.20232270128356877
  (2, 192775)	0.7270515839609044
  (2, 331146)	0.4577243030925928
  (2, 217307)	0.3586212341678004
  (2, 209149)	0.3650688524406097
  (3, 175917)	0.5797213273864051
  (3, 139076)	0.25759117714273316
  (3, 366377)	0.4018889828996008
  (3, 344001)	0.3565976148233519
  (3, 114004)	0.3165169852577187
  (3, 381278)	0.3044716859774938
  :	:
  (1119996, 41689)	0.2988684615709349
  (1119996, 21512)	0.23572264244863686
  (1119996, 382366)	0.2544848779868633
  (1119996, 295570)	0.33902005717805955
  (1119996, 

**Training the machine learning model**

**Logistique regression**

In [25]:
model = LogisticRegression(max_iter=1000)

In [26]:
model.fit(X_train, Y_train)

In [27]:
# Accuracy score on the training data
X_train_prediction = model.predict(X_train)
training_data_accuracy = accuracy_score(X_train_prediction, Y_train)

In [28]:
print('Accuracy score of the training data : ', training_data_accuracy)

Accuracy score of the training data :  0.8039133928571428


In [29]:
# Accuracy score on the validation data
X_val_prediction = model.predict(X_val)
val_data_accuracy = accuracy_score(X_val_prediction, Y_val)

In [30]:
print('Accuracy score of the validation data : ', val_data_accuracy)

Accuracy score of the validation data :  0.7772458333333333


In [31]:
# Accuracy score on the test data
X_test_prediction = model.predict(X_test)
test_data_accuracy = accuracy_score(X_test_prediction, Y_test)

In [32]:
print('Accuracy score of the test data : ', test_data_accuracy)

Accuracy score of the test data :  0.7769166666666667


saving the trained model

In [33]:
#saving the traind model
import pickle
filename = 'trained_model.sav'
pickle.dump(model, open(filename, 'wb'))

In [34]:
# Loading the train model
loaded_model = pickle.load(open(filename, 'rb'))

In [39]:
print(X_test.shape[0]) # Get the number of rows in X_test
print(len(Y_test))


240000
240000


In [55]:
X_new = X_test[1]
Y_new = Y_test.iloc[1]
if (Y_new==0):
  print('The real value is Negative tweet')
else:
  print('The real value is positive tweet')

prediction = model.predict(X_new)

if (prediction[0]==0):
  print('The predect values is Negative tweet')
else:
  print('the predict values is positive tweet')


The real value is Negative tweet
The predect values is Negative tweet
