# Twitter data Classification

Steps to model the twitter data classification:
    1. Import the necessary modules
    2. Load the twitter data and remove duplicate tweets
    3. Clean the tweets
    4. Split the train and test data
    5. Prepare Pipeline to vectorize, transform the data and specify the model to train
    6. Apply the trained model to test data and calculate accuracy

### 1.Import the necessary modules

In [1]:
import nltk
import string
import pandas as pd
import numpy as np
from nltk.corpus import stopwords
from nltk.stem import PorterStemmer
from sklearn.model_selection import train_test_split
from sklearn.feature_extraction.text import CountVectorizer
from sklearn.feature_extraction.text import TfidfTransformer
from sklearn.pipeline import Pipeline
from sklearn.naive_bayes import MultinomialNB

### 2.Load the twitter data and remove duplicate tweets

In [2]:
# loading the input csv file from local
data = pd.read_csv('path_to_the_input_data')
# printing the shape of input data
print(data.shape)
# removing duplicate tweets
data=data.drop_duplicates(subset='Tweet')
# printing the shape of data after removing duplicates
print(data.shape)

(23516, 4)
(20896, 4)


### 3.Clean the tweets

In [3]:
# cleaing the tweets
def clean_doc(tweet):
    # initializing the empty list called tokens
    tokens=[]
    # initializing the stemmer
    stemmer = PorterStemmer()
    # removing the punctuation marks from the lower cased tweet
    translator = str.maketrans(string.punctuation, ' '*len(string.punctuation))
    tweet = tweet.lower().translate(translator)
    # sentence and word tokenizing the tweet
    for sent in nltk.sent_tokenize(tweet):
        tokens.extend(nltk.word_tokenize(sent))
    # filtering the non alphanumeric word from tokens
    tokens = [word for word in tokens if word.isalpha()]
    # filtering stop words
    stop_words = set(stopwords.words('english'))
    tokens = [w for w in tokens if not w in stop_words]
    # stemming the tokens
    tokens = [stemmer.stem(word) for word in tokens]
    # returing the tokens of the tweet as a string
    return ' '.join(tokens)

In [4]:
# looping over all the tweets to clean
for idx,row in data.iterrows():
    data.set_value(idx,'Tweet',clean_doc(data.Tweet[idx]))
#print(data[0:20])

### 4.Split the train and test data

In [5]:
# assigning the target variable and splitting the data into train and test
y=data.ADR_label
x_train, x_test, y_train, y_test = train_test_split(data.Tweet, y, test_size = 0.25, random_state = 53)

### 5.Prepare Pipeline to vectorize, transform the data and specify the model to train

Preparing the pipeline, using CountVectorizer to vectorize the bag of words, TfidfTransformer to apply tf and idf on the vectors and using the Multinomial Naives Bayes model

In [6]:
# using the max_features parameter in COuntVectorizer and limiting it to 1500
text_clf = Pipeline([('vect', CountVectorizer(max_features=1500)),
                     ('tfidf', TfidfTransformer()),
                     ('clf', MultinomialNB())])
# training the model
text_clf = text_clf.fit(x_train, y_train)

### 6.Apply the trained model to test data and calculate accuracy

In [10]:
# predicting the test data target variable
predicted = text_clf.predict(x_test)
# printing the acuracy
print("The accuracy of the model is : ",np.mean(predicted == y_test)*100)

The accuracy of the model is :  83.7672281776
