# Text classification using TF-IDF

### 1. Load the dataset from sklearn.datasets

In [1]:
from sklearn.datasets import fetch_20newsgroups

In [2]:
categories = ['alt.atheism', 'soc.religion.christian', 'comp.graphics', 'sci.med']

### 2. Training data

In [3]:
twenty_train = fetch_20newsgroups(subset='train', categories=categories, shuffle=True, random_state=42)

Downloading 20news dataset. This may take a few minutes.
Downloading dataset from https://ndownloader.figshare.com/files/5975967 (14 MB)


### 3. Test data

In [7]:
twenty_test = fetch_20newsgroups(subset='test', categories=categories, shuffle=True, random_state=42)

###  a.  You can access the values for the target variable using .target attribute 

In [11]:
twenty_train.target

array([1, 1, 3, ..., 2, 2, 2], dtype=int64)

###  b. You can access the name of the class in the target variable with .target_names

In [9]:
twenty_train.target_names

['alt.atheism', 'comp.graphics', 'sci.med', 'soc.religion.christian']

In [149]:
twenty_train.data[0:2]

['From: sd345@city.ac.uk (Michael Collier)\nSubject: Converting images to HP LaserJet III?\nNntp-Posting-Host: hampton\nOrganization: The City University\nLines: 14\n\nDoes anyone know of a good way (standard PC application/PD utility) to\nconvert tif/img/tga files into LaserJet III format.  We would also like to\ndo the same, converting to HPGL (HP plotter) files.\n\nPlease email any response.\n\nIs this the correct group?\n\nThanks in advance.  Michael.\n-- \nMichael Collier (Programmer)                 The Computer Unit,\nEmail: M.P.Collier@uk.ac.city                The City University,\nTel: 071 477-8000 x3769                      London,\nFax: 071 477-8565                            EC1V 0HB.\n',
 "From: ani@ms.uky.edu (Aniruddha B. Deglurkar)\nSubject: help: Splitting a trimming region along a mesh \nOrganization: University Of Kentucky, Dept. of Math Sciences\nLines: 28\n\n\n\n\tHi,\n\n\tI have a problem, I hope some of the 'gurus' can help me solve.\n\n\tBackground of the probl

### 4.  Now with dependent and independent data available for both train and test datasets, using TfidfVectorizer fit and transform the training data and test data and get the tfidf features for both

Hint: Use ".fit_transform" on Train set and ".transform" on Test set

In [30]:
from sklearn.feature_extraction.text import TfidfVectorizer
tf=TfidfVectorizer()
tf_train= tf.fit_transform(twenty_train.data)

In [31]:
tf_train.shape

(2257, 35788)

In [32]:
tf_test=tf.transform(twenty_test.data)

In [33]:
tf_test.shape

(1502, 35788)

### 5. Use logisticRegression with tfidf features as input and targets as output and train the model and report the train and test accuracy score

In [17]:
from sklearn.linear_model import LogisticRegression

In [36]:
lr=LogisticRegression()
lr.fit(tf_train,twenty_train.target)

LogisticRegression(C=1.0, class_weight=None, dual=False, fit_intercept=True,
                   intercept_scaling=1, l1_ratio=None, max_iter=100,
                   multi_class='auto', n_jobs=None, penalty='l2',
                   random_state=None, solver='lbfgs', tol=0.0001, verbose=0,
                   warm_start=False)

In [37]:
y_pred=lr.predict(tf_test)

In [38]:
from sklearn.metrics import accuracy_score, confusion_matrix

In [39]:
train_accuracy=accuracy_score(twenty_train.target,lr.predict(tf_train))

In [40]:
print("the train accuracy is :",train_accuracy)

the train accuracy is : 0.9920248116969429


In [41]:
test_accuracy=accuracy_score(twenty_test.target,y_pred)

In [42]:
print("the test accuracy is   :",test_accuracy)

the test accuracy is   : 0.8974700399467377


## Sentiment analysis <br> 

The objective of this problem is to perform Sentiment analysis from the tweets data collected from the users targeted at various mobile devices.
Based on the tweet posted by a user (text), we will classify if the sentiment of the user targeted at a particular mobile device is positive or not.

### 6. Read the dataset (tweets.csv) and drop the NA's while reading the dataset

Hint: pd.read_csv('./tweets.csv',encoding = "ISO-8859-1").dropna()

In [44]:
import pandas as pd

In [51]:
data=pd.read_csv('./tweets.csv',encoding = "ISO-8859-1").dropna()

In [52]:
data.head()

Unnamed: 0,tweet_text,emotion_in_tweet_is_directed_at,is_there_an_emotion_directed_at_a_brand_or_product
0,.@wesley83 I have a 3G iPhone. After 3 hrs twe...,iPhone,Negative emotion
1,@jessedee Know about @fludapp ? Awesome iPad/i...,iPad or iPhone App,Positive emotion
2,@swonderlin Can not wait for #iPad 2 also. The...,iPad,Positive emotion
3,@sxsw I hope this year's festival isn't as cra...,iPad or iPhone App,Negative emotion
4,@sxtxstate great stuff on Fri #SXSW: Marissa M...,Google,Positive emotion


In [53]:
data.shape

(3291, 3)

### 7. Preprocess the text and add the preprocessed text in a column with name `text` in the dataframe.

In [49]:
import string, re
from nltk import word_tokenize 
def preprocess(text):
    try:
        # Check characters to see if they are in punctuation
        nopunc = [char for char in text if char not in string.punctuation]
        # Join the characters again to form the string.
        nopunc = ''.join(nopunc)
        # convert text to lower-case
        nopunc = nopunc.lower()
        # remove URLs
        nopunc = re.sub('((www\.[^\s]+)|(https?://[^\s]+)|(http?://[^\s]+))', '', nopunc)
        nopunc = re.sub(r'http\S+', '', nopunc)
        # remove usernames
        nopunc = re.sub('@[^\s]+', '', nopunc)
        # remove the # in #hashtag
        nopunc = re.sub(r'#([^\s]+)', r'\1', nopunc)
        return ''.join(nopunc)
    except Exception as e:
        return ""

In [54]:
data['text'] = [preprocess(text) for text in data.tweet_text]

In [62]:
data.groupby("is_there_an_emotion_directed_at_a_brand_or_product").count()

Unnamed: 0_level_0,tweet_text,emotion_in_tweet_is_directed_at,text
is_there_an_emotion_directed_at_a_brand_or_product,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1
I can't tell,9,9,9
Negative emotion,519,519,519
No emotion toward brand or product,91,91,91
Positive emotion,2672,2672,2672


since "is_there_an_emotion_directed_at_a_brand_or_product" column has other rows other than positive or negative ....lets remove it

### 8. Consider only rows having a Positive or Negative emotion and remove other rows from the dataframe.

Hint: Use df = df[(df["col_name"] == "Positive emotion") OR (df["col_name"] == "Negative emotion")]

In [64]:
df =data[(data["is_there_an_emotion_directed_at_a_brand_or_product"] == "Positive emotion") | (data["is_there_an_emotion_directed_at_a_brand_or_product"] == "Negative emotion")]

In [67]:
data.shape

(3291, 4)

In [66]:
df.shape

(3191, 4)

### 9. Represent text as numerical data using `CountVectorizer` and get the document term frequency matrix

Hint: Perfrom fit (".fit") and transformation(".transform") for whole data, later will do CountVectorizer "fit_transform" and "transform" for train and test separately 

In [68]:
from sklearn.feature_extraction.text import CountVectorizer

In [69]:
vect = CountVectorizer(ngram_range=(1,2),stop_words='english', min_df = 0.01)

In [73]:
vect.fit(df["tweet_text"])
features = vect.transform(df["tweet_text"])

### 10. Find number of different words in vocabulary

 #### Tip: To see all available functions for an Object use dir and use appropriate function to find number of different words in vocab

In [78]:
features.shape

(3191, 189)

In [79]:
features_df = pd.DataFrame(features.toarray())
features_df.columns = vect.get_feature_names()

In [81]:
word_freq = features_df.mean().sort_values(ascending=False)

In [83]:
word_freq.shape

(189,)

In [85]:
word_freq[0:10]

sxsw          1.052335
mention       0.697274
link          0.367910
ipad          0.352241
apple         0.307427
rt            0.300219
rt mention    0.290505
google        0.239737
iphone        0.200877
quot          0.178941
dtype: float64

### 11. Find out how many Positive and Negative emotions are there.

Hint: Use value_counts on that column

In [87]:
df.head(2)

Unnamed: 0,tweet_text,emotion_in_tweet_is_directed_at,is_there_an_emotion_directed_at_a_brand_or_product,text
0,.@wesley83 I have a 3G iPhone. After 3 hrs twe...,iPhone,Negative emotion,wesley83 i have a 3g iphone after 3 hrs tweeti...
1,@jessedee Know about @fludapp ? Awesome iPad/i...,iPad or iPhone App,Positive emotion,jessedee know about fludapp awesome ipadiphon...


In [96]:
df.groupby("is_there_an_emotion_directed_at_a_brand_or_product").count()

Unnamed: 0_level_0,tweet_text,emotion_in_tweet_is_directed_at,text
is_there_an_emotion_directed_at_a_brand_or_product,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1
Negative emotion,519,519,519
Positive emotion,2672,2672,2672


### 12. Change the labels for Positive and Negative emotions as 1 and 0 respectively and store in a different column in the same dataframe named 'Label'

Hint: use map on that column and give labels

In [105]:
dic={ 'Positive emotion':1, 'Negative emotion':0}

In [106]:
df["Label"]=df["is_there_an_emotion_directed_at_a_brand_or_product"].map(dic)

A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: http://pandas.pydata.org/pandas-docs/stable/indexing.html#indexing-view-versus-copy
  """Entry point for launching an IPython kernel.


In [107]:
df.head(3)

Unnamed: 0,tweet_text,emotion_in_tweet_is_directed_at,is_there_an_emotion_directed_at_a_brand_or_product,text,Label
0,.@wesley83 I have a 3G iPhone. After 3 hrs twe...,iPhone,Negative emotion,wesley83 i have a 3g iphone after 3 hrs tweeti...,0
1,@jessedee Know about @fludapp ? Awesome iPad/i...,iPad or iPhone App,Positive emotion,jessedee know about fludapp awesome ipadiphon...,1
2,@swonderlin Can not wait for #iPad 2 also. The...,iPad,Positive emotion,swonderlin can not wait for ipad 2 also they s...,1


### 13. Define the feature set (independent variable or X) to be `text` column and `labels` as target (or dependent variable)  and divide into train and test datasets and display shapes

In [111]:
from sklearn.model_selection import train_test_split
x = df['text']
y = df['Label']

In [113]:
x_train, x_test, y_train, y_test = train_test_split(x,y,test_size=0.3,random_state=1)

In [114]:
x_train.shape

(2233,)

In [115]:
x_test.shape

(958,)

In [116]:
y_train.shape

(2233,)

In [117]:
y_test.shape

(958,)

## 14. **Predicting the sentiment:**


### Use (i) Naive Bayes and (ii) Logistic Regression and print their accuracy scores for predicting the sentiment of the given text

In [118]:
from sklearn.naive_bayes import MultinomialNB
from sklearn.linear_model import LogisticRegression
from sklearn import metrics

In [119]:
vect = CountVectorizer()

# create document-term matrices
x_train_dtm = vect.fit_transform(x_train)
x_test_dtm = vect.transform(x_test)

# Logistic regression model building

In [126]:
lr1=LogisticRegression()
lr1.fit(x_train_dtm,y_train)
y_predict=lr1.predict(x_test_dtm)
print("train accuracy is   :",accuracy_score(y_train,lr1.predict(x_train_dtm)))
print("test accuracy is   :",accuracy_score(y_test,y_predict))

train accuracy is   : 0.9825347066726378
test accuracy is   : 0.8517745302713987


# Naive bayes model building

In [130]:
nbm=MultinomialNB()
nbm.fit(x_train_dtm,y_train)
y_predictions=nbm.predict(x_test_dtm)
print("train accuracy is   :",accuracy_score(y_train,nbm.predict(x_train_dtm)))
print("test accuracy is   :",accuracy_score(y_test,y_predictions))

train accuracy is   : 0.9561128526645768
test accuracy is   : 0.8455114822546973


### 15. Create a function called `tokenize_predict` which can take count vectorizer object as input, create document term matrix out of x_train & x_test, build and train a model using dtm created and print the accuracy 

In [None]:
vect = CountVectorizer()

In [137]:

def tokenize_predict(vect,x_train,x_test,y_train,y_test):
    x_train_dtm = vect.fit_transform(x_train)
    print('Features: ', x_train_dtm.shape[1])
    x_test_dtm = vect.transform(x_test)
    nb = MultinomialNB()
    nb.fit(x_train_dtm, y_train)
    y_pred_class = nb.predict(x_test_dtm)
    print('Accuracy: ', metrics.accuracy_score(y_test, y_pred_class))

In [138]:
tokenize_predict(vect,x_train,x_test,y_train,y_test)

Features:  5098
Accuracy:  0.8455114822546973


### 16. Create a count vectorizer function which includes n_grams = 1,2  and pass it to tokenize_predict function to print the accuracy score

Hint: vect = CountVectorizer(ngram_range=(1, 2))

In [140]:
# include 1-grams and 2-grams
vect = CountVectorizer(ngram_range=(1, 2))
tokenize_predict(vect,x_train,x_test,y_train,y_test)

Features:  24228
Accuracy:  0.848643006263048


### 17. Create a count vectorizer function with stopwords = 'english'  and pass it to tokenize_predict function to print the accuracy score

In [143]:
# remove English stop words
vect = CountVectorizer(stop_words = 'english')
tokenize_predict(vect,x_train,x_test,y_train,y_test)

Features:  4861
Accuracy:  0.8507306889352818


### 18. Create a count vectorizer function with stopwords = 'english' and max_features =300  and pass it to tokenize_predict function to print the accuracy score

In [145]:
# remove English stop words and only keep 300 features
vect = CountVectorizer(stop_words = 'english',max_features =300)
tokenize_predict(vect,x_train,x_test,y_train,y_test)

Features:  300
Accuracy:  0.8058455114822547


### 19. Create a count vectorizer function with n_grams = 1,2  and max_features = 15000  and pass it to tokenize_predict function to print the accuracy score

In [147]:
# include 1-grams and 2-grams, and limit the number of features to 15000
vect = CountVectorizer(ngram_range=(1,2),max_features =15000)
tokenize_predict(vect,x_train,x_test,y_train,y_test)

Features:  15000
Accuracy:  0.8444676409185804


### 20. Create a count vectorizer function with n_grams = 1,2  and include terms that appear at least 2 times (min_df = 2)  and pass it to tokenize_predict function to print the accuracy score

In [148]:
# include 1-grams and 2-grams, and only include terms that appear at least 2 times
vect = CountVectorizer(ngram_range=(1,2),min_df =2)
tokenize_predict(vect,x_train,x_test,y_train,y_test)

Features:  7204
Accuracy:  0.8455114822546973
